Tinygrad:
7x speed improvement for LLaMA in less than 10 lines of code? Yes, you’ve read that right. All by adding a single line that allows openMP to parallelize the global and local index loops in the clang backend.
Before: After:
Tinygrad day 5:
LLaMA now runs on the cuda backend!
It's still very slow at 1100ms/token compared to 180ms on the opencl backend (I'm not even going to mention
@ggerganov
's times, as they blow us out of the water)
I'm proud to announce the beta release of Figura chat -
an uncensored, privacy first AI companion platform running at lightning fast speeds.
For more information on how to participate for free, join our discord using the link below
Sad news everybody :(
A couple of weeks ago I proposed an idea to replace 1GB vram chips with 2GB ones to achieve a 48gb 3090, today I have received news from a tinygrad community member that such configuration doesn’t even post.
Just got reminded of a cool old project of mine which was rendering realtime videos in the terminal with text in a decent quality.
Not all of it got upstreamed to mpv tho, but I still have the patches somewhere in my archive
@yacineMTB
@cto_junior
tinygrad does tensor operations, these operations are lazy (executed only when the result is needed) there is a layer which operates on the OPs (custom tinygrad IR) and optimizes them (fusing, constant folding, etc.) then that IR gets executed by a backend 1/n
@cto_junior
The doc site linked is a very good source, others I can recommend are the introduction to NVPTX () and also the ENCCS guide (). These are the ones I've read at least.
tinygrad day (night) 6:
cudacpu is now 5x faster, and 20 more tests pass !
No more segfaults, but I still have to do what I feared the most - fuzzing ptx assembly for bugs in the emulator...
Passed 2**10 followers. Thank you all !
As for tinygrad updates, pycuda stopped working out of nowhere (moved to another server since) and there are still numerous edge cases in the assembler. so that's that
I like how for anime people that have some audience on twitter, their profile picture becomes their brand. I'm kinda stuck with this konata one I guess
I've noticed that Twitter is all about traction. When you lose it, you need to keep rolling the dice to hope for a banger post that jumpstarts it. Kinda destructive but understandable, promotes being chronically online
It's embarrassing how slow even high end windows laptops are on battery compared to M{1,2,3} macbooks. My 13th gen i9 struggles with inference even on small models. Hell, even on usb-c power it's sluggish
@MaximePeabody
I don't think it's going for exact matches as there are too many variables from the microphone differences alone to stuff like natural reverb, I'd bet it's "just" a kNN
By anime pfps, for anime pfps.
Together with
@AnikiRip
we're building the BEST AI companion platform available.
If you're someone who's interested in digital companionship, we want to talk to you!
Join the Discord (on the site) for launch updates.
The early stage of the first server in our 3-server cluster is ready - balthazar, it's still little but will grow to a 7x4090 open air monster in some time, expect a blog post on fitting 2 air cooled 4090s in a standard sized case soon
Oh wow, forget fixing bugs. I just git stashed new changes and reimplemented them from scratch and it seems to be working. Unreal. I diffed the changes and it was a 0 vs 1 index issue that was causing some weird behavior with aligning stores and loads. I’m glad it’s over
@gf_256
I’d guess that firstly ret is not emitted due to the infinite loop, then the loop without side-effects is removed due to UB, and now we have main which is just an empty label without ret, so the next instruction is the one in unreachable()