Project
#2
: LLM Visualization
So I created a web-page to visualize a small LLM, of the sort that's behind ChatGPT. Rendered in 3D, it shows all the steps to run a single token inference. (link in bio)
Here's a technical guide on how I wrote & structured some of the LLM Visualization. A few people have been asking about it, so I thought I'd write it up. Lots of code screenshots, so not for everyone.
Project
#2
: LLM Visualization
So I created a web-page to visualize a small LLM, of the sort that's behind ChatGPT. Rendered in 3D, it shows all the steps to run a single token inference. (link in bio)
It also contains a walkthrough/guide of the steps, as well as a few interactive elements to play with.
Why, you ask? For what purpose did I put all the time & effort into this project?
With this, you can see the whole thing at once. You can see where the computation takes place, its complexity, and relative sizes of the tensors & weights.
Oh yeah, the link is here: Works best on desktop (sorry mobile). Left-click drag, right-click rotate, scroll to zoom. And hover over the tensor cells. Blue cells are weights/parameters, green cells are intermediate values. Each cell is a single number!
The model with all the animations is tiiny, to make it tractable. For comparison, I threw in a few of the larger models (GPT-2, GPT-3), render-only.
And when you see what it takes to just produce a single value in a mat-mul, the sheer scale of these things becomes apparent.
There's a real advantage to unpacking a set of abstractions, flattening them out. Abstractions can be useful for terseness and management, but they can be a real blocker to seeing the big picture.
Well, I hope you find it interesting. Let me know your thoughts! And if someone makes it through the walkthrough and finds it a little ~incomplete towards the end I might even getting around to fix it (my attention has largely turned to other projects oops)
@Algomancer
Yup, it's all here: and the LLM code is under /src/llm (oops a bit of a scope change)
But things that are more useful are 1) motivation & 2) reach/visibility. I've got another project coming up, which is, uhh, probably more ambitious
What about understanding what each layer does? Uhh, sorry, won't be much help.
The project just came out of "Let's build a 3D viz!", so the scope is a bit limited. It's more: here's a way to learn & digest the algorithm, and perhaps think about how to optimize the process.
I also learnt a good amount of GL (dF/dx, fwidth, ubos, instancing), and animation approaches. So, uhh, even if no-one sees this, the project definitely has some value to me.
As for what I got out of creating this: before I made it, I mostly knew how image convolution nets worked, but language-based models seemed kinda magical in comparison.
Well, now I know them in a fair amount of detail!
That's certainly been quite the response! 200k site visits, 1M X views,
#1
on HN for a day or so, and plenty of very positive feedback.
Here's a quick doc I put together with my sister a couple months ago, "setting expectations" I guess.
Some quick notes:
* took me maybe 200 hours
* written in Typescript with next.js, react for anything DOM related
* all the 3D stuff is written directly against WEBGL2
* GPT algo itself run in WASM, written in Odin (nice lang
@TheGingerBill
!)
* here are my client-side deps:
That's certainly been quite the response! 200k site visits, 1M X views,
#1
on HN for a day or so, and plenty of very positive feedback.
Here's a quick doc I put together with my sister a couple months ago, "setting expectations" I guess.
Project: Robotic arm
#1
I'm in the process of designing/building a little robotic arm. I'm using little servos + laser-cut wood as the base materials (it's what I have).
The purpose/utility? Hmm haven't put much thought into that. Not important tbh.
The actual eval & population of the green-block values is done in WASM, written in Odin. It runs at init in a few ms (not optimized, eg uses naive matmul). We then pass that data to the GPU in texture-maps. The intro animation of the entire process is all done after-the-fact.
@cunha_tristan
In a proper impl, you have all steps running on the GPU (CPU<-->GPU bandwidth is super low). And then the ops themselves are dominated by the dot products in the mat-muls, i.e. a series of multiply-adds. Getting mat-muls to run fast is a bit of an art, to hide high vram latency.
But when we have the blocks hovered, and a row/column/cell is highlighted (+ other animation effects), we split the blocks into sub-blocks (splitGrid). That way they can take unique color/opacity as needed. With a bit of careful maths, the texture-map lookups remain consistent.
So this was a really fun project, engineering-wise. Lots of experimenting with new techniques & approaches. Naturally this was 10x harder than I'd planned. And I've got a new, unrelated one on the go, which is probably more ambitious, oops.
Block rendering: each block is single 3D, six-sided cube (two tris per side). Drawing the individual cell value circles is done in-shader, querying float32 texture-maps. So the circles & the white grids are all done per-pixel in the fragment shader.
Here's a technical guide on how I wrote & structured some of the LLM Visualization. A few people have been asking about it, so I thought I'd write it up. Lots of code screenshots, so not for everyone.
Of note, the block generation for the small LLM is run for every frame, and the code looks like this.
Very terse, but there are ~50 unique blocks.
There's info for how to look-up on-GPU texture-maps for the cell data, and the block's dep structure (for hover & anim).
A few acknowledgements:
@karpathy
's mingpt repo was vital to ensuring my GPT impl. was working correctly (plus his YT vids are great),
@telmudic
for that first QT that got me off the ground, and my sister for getting me to actually have a plan & set a release deadline.
@finbarrtimbers
I figured it out by, uhh, fully implementing one. And then only understood that Q/K/V came from hashtables/dicts jargon like 90% of the way through
Project
#2
: LLM Visualization
So I created a web-page to visualize a small LLM, of the sort that's behind ChatGPT. Rendered in 3D, it shows all the steps to run a single token inference. (link in bio)
Maybe I'll get around to animating those last few pages, or fixing mobile, but no promises.
I'll write a thread or two on how I actually wrote the app, because that's interesting to me.
Here's some walkthrough code samples. First we have the commentary strings (template-strings handy), interleaved with these t_<name> variables. These provide the sequencing for the animations. Below that, we have the logic for animating the blocks, making direct changes to them.
Today I redesigned the carriage between the two base servos, since the old one had a few fit issues (nasty interference; broken bits). This one's way better. I do the layout/design in Inkscape because it's familiar. Then the construction looks something like this:
I still messed up though. The resulting distance between the pivot points (red) was about 1.5mm too short for the lower arm. So need to adjust in the designs (blue). A pain to start over, so fixed with a wooden spacer for a nice snug fit.
Most of the logic is executed within a requestAnimationFrame (rAF) loop, within the ~top-level runProgram() function.
The IProgramState has all the state hanging off of it, some cross-frame, and some generated per-frame.
This JS logic all takes ~5ms.
We have a per-chapter value t, and when we create each t_<name> variable, it figures out its local t value from that, which ranges from 0 -> 1.
These 0 -> 1 values are then used to drive the animations, typically via lerp functions.
I did a bunch of other mini-projects in this, for the fun (I'll spare details).
* The text layout code for the hover tooltips (2D; could have been DOM?)
* Constructing & rendering the ribbons with beziers
* A mimalloc-inspired memory allocator
* The ToC diagram highlighting
Project
#2
: LLM Visualization
So I created a web-page to visualize a small LLM, of the sort that's behind ChatGPT. Rendered in 3D, it shows all the steps to run a single token inference. (link in bio)
@__frye
serious answer from youtube-for-amateurs: try to dig it up from other parts of your road
or maybe other projects have a pile of surplus somewhere
Oh yeah, I threw in a Stripe tip-jar on my website (to be clear, this is just a hobby project & I have a full-time job). P.S.
@stripe
,
@patio11
, your onboarding process is smooth as silk.
@ccreikey
@Algomancer
Oh yeah, I wrote the actual GPT algorithm in Odin that runs in WASM. Just did naive mat-mul though.
Originally wrote a webgl-based GPT evaluator, but perf was terrible for some reason (odin=2ms, webgl=500ms!?!?).
@goblincodes
first thing is to check the dev tools css styles for that text: check the computed values & the style inheritances for differences
also check chrome & firefox to rule out per-domain browser overrides like full page zoom
@goblincodes
when the css gets minified, there's a bug where they clamp opacity values to [0, 1], which breaks for percentages. fix is to use 0.2 instead of 20% etc
i ran into that issue like 2 years ago oh no
@dhomochameleon
there might be confusion about what the pattern is since on a phone it's easy to get cross-eyed across more than one of the pattern repeats
the main one looks like a lion 2 me, but only when I'm gently cross-eyed. the 2nd/3rd ones have extra pop-out layers & don't make sense
@MorlockP
5. It seems like Tesla are really pulling on that late-structural-combination thread for mfg efficiency. It's risky of course: modern unibody design is a bit of a dark art, so a lot of care & skill has to go into keeping structural integrity, crash behavior, etc. up to scratch
@tautologer
Seems reasonable, but +1 for redis pub/sub. Big thing with SSE is the max 6 conns per [domain+browser], so if you have a bunch of tabs open you run into trouble. Can use visibilitychange to deal with that, though
@NakramR
Cheers! Yeah it's a bit unfinished towards the end (haven't animated much). But with the amount of traction it's getting... worth tidying up I think
@pepijndevos
@MuzafferKal_
Yeah so will I. Esp with context window & the new tricks that came in this year.
Also it's worth noting the incremental cost of processing a new token is limited to just one of the columns in the embedding (provided the intermediates are stored across evals)
@ollybot_redux
Have you looked into the JBP route? From my own relatively mild encounters with anxiety/depression, I think finding the right sort of frame is quite valuable. And JBP's frame was pretty novel/useful, and one that I hadn't really encountered in therapy or elsewhere
@MorlockP
4. One of the key driving differences here is that the battery forms part of the structure, and can be attached later in the process. You can see this partially done with the Cybertruck, and Munro appreciate this even though it helps with just 2 seats.
@goblincodes
quick prediction: on localhost:3000, your page has a zoom of 150% (ctrl-0 to fix)
rem is affected by page zoom, but vmin isn't + responsive mode ignores page zoom
@ollybot_redux
here it is so far
i decided to impulse buy a cheapo cnc laser router so thought I'd better make use of it
wiring and software to do (mostly), and then we will see if it can lift it's own weight and maybe even something else
@cheascake
hello I'd like to add a submission
def baz(xlst):
cnt = minCnt = 0
for x in xlst + [0]:
if not x and cnt and (cnt < minCnt or not minCnt):
minCnt = cnt
cnt = cnt + 1 if x else 0
return minCnt
@MorlockP
2. This assembly phase is:
Long (many stations; cars are big)
Complicated (all sub-assemblies need to be inserted into a semi-enclosed shell with relatively small openings), and
Linear (any stoppage halts all stations immediately)