Wes Gurnee Profile Banner
Wes Gurnee Profile
Wes Gurnee

@wesg52

3,073
Followers
198
Following
31
Media
103
Statuses

Optimizer @MIT @ORCenter PhD student thinking about Mechanistic Interpretability, Optimization, and Governance.

Cambridge, MA
Joined June 2022
Don't wanna be here? Send us removal request.
Pinned Tweet
@wesg52
Wes Gurnee
3 months
New paper! "Universal Neurons in GPT2 Language Models" How many neurons are independently meaningful? How many neurons reappear across models with different random inits? Do these neurons specialize into specific functional roles or form feature families? Answers below 🧡:
Tweet media one
6
65
402
@wesg52
Wes Gurnee
7 months
Do language models have an internal world model? A sense of time? At multiple spatiotemporal scales? In a new paper with @tegmark we provide evidence that they do by finding a literal map of the world inside the activations of Llama-2!
183
1K
6K
@wesg52
Wes Gurnee
1 year
Neural nets are often thought of as feature extractors. But what features are neurons in LLMs actually extracting? In our new paper, we leverage sparse probing to find out . A 🧡:
Tweet media one
11
133
738
@wesg52
Wes Gurnee
7 months
For spatial representations, we run Llama-2 models on the names of tens of thousands cities, structures, and natural landmarks around the world, the USA, and NYC. We then train linear probes on the last token activations to predict the real latitude and longitudes of each place.
13
38
548
@wesg52
Wes Gurnee
7 months
To see all the details and additional validations check out the Paper: Code and datasets:
Tweet media one
7
36
391
@wesg52
Wes Gurnee
7 months
For temporal representations, we run the models on the names of famous figures from the past 3000 years, the names of songs, movies and books from 1950 onward, and NYT headlines from the 2010s and train lin probes to predict the year of death, release date, and publication date.
Tweet media one
3
10
250
@wesg52
Wes Gurnee
7 months
But does the model actually _use_ these representations? By looking for neurons with similar weights as the probe, we find many space and time neurons which are sensitive to the spacetime coords of an entity, showing the model actually learned the global geometry -- not the probe
Tweet media one
3
12
212
@wesg52
Wes Gurnee
7 months
When training probes over every layer and model, we find that representations emerge gradually over the early layers before plateauing at around the halfway point. As expected, bigger models are better, but for more obscure datasets (NYC) no model is great.
Tweet media one
1
7
180
@wesg52
Wes Gurnee
7 months
Are these representations actually linear? By comparing the performance of nonlinear MLP probes with linear probes, we find evidence that they are! More complicated probes do not perform any better on the test set.
Tweet media one
3
7
171
@wesg52
Wes Gurnee
7 months
Are these representations robust to prompting? Probing on different prompts we find performance is largely preserved but can be degraded by capitalizing the entity name or prepending random tokens. Also probing on the trailing period instead of last token is better for headlines
Tweet media one
1
6
129
@wesg52
Wes Gurnee
7 months
Finally, special shoutout to @NeelNanda5 for all the feedback on the paper and project!
9
3
129
@wesg52
Wes Gurnee
7 months
A critical part of this project was constructing space and time datasets at multiple spatiotemporal scales with a diversity of entity types (eg, both cities and natural landmarks).
Tweet media one
1
7
128
@wesg52
Wes Gurnee
1 year
One large family of neurons we find are β€œcontext” neurons, which activate only for tokens in a particular context (French, Python code, US patent documents, etc). When deleting these neurons the loss increases in the relevant context while leaving other contexts unaffected!
Tweet media one
3
13
121
@wesg52
Wes Gurnee
2 years
Thrilled to share my first grad school preprint β€œLearning Sparse Nonlinear Dynamics via Mixed-Integer Optimization” with @dbertsim Preprint: Code: Thread: 1/4
3
14
84
@wesg52
Wes Gurnee
1 month
Short research post on a potential issue arising in Sparse Autoencoders (SAEs): the reconstruction errors change model predictions much more than a random error of the same magnitude!
1
7
73
@wesg52
Wes Gurnee
7 months
@rafaelrmuller @tegmark We have results for PCA and you definitely need more than two PCs. This is because our datasets contain diverse entities (eg, cities, buildings, natural landmarks) and on inspection the first few PCs seemed to cluster this information.
5
1
44
@wesg52
Wes Gurnee
1 year
But what if there are more features than there are neurons? This results in polysemantic neurons which fire for a large set of unrelated features. Here we show a single early layer neuron which activates for a large collection of unrelated n-grams.
Tweet media one
1
3
41
@wesg52
Wes Gurnee
1 year
That said, more than any specific technical contribution, we hope to contribute to the general sense that ambitious interpretability is possible: that LLMs have a tremendous amount of rich structure that can and should be understood by humans!
1
4
43
@wesg52
Wes Gurnee
1 year
Early layers seem to use sparse combinations of neurons to represent many features in superposition. That is, using the activations of multiple polysemantic neurons to boost the signal of the true feature over all interfering features (here β€œsocial security” vs. adjacent bigrams)
Tweet media one
1
4
40
@wesg52
Wes Gurnee
3 months
See the full paper for all the details Paper: Code:
Tweet media one
0
3
32
@wesg52
Wes Gurnee
1 year
Results in toy models from @AnthropicAI and @ch402 suggest a potential mechanistic fingerprint of superposition: large MLP weight norms and negative biases. We find a striking drop in early layers in the Pythia models from @AiEleuther and @BlancheMinerva .
Tweet media one
1
3
31
@wesg52
Wes Gurnee
4 months
New version is out (to appear at ICLR)! Main updates: - Additional experiments on Pythia models - Causal interventions on space and time neurons - More related work - Clarify our claims of a literal world model (static vs. dynamic) - External replications! More in thread:
@wesg52
Wes Gurnee
7 months
Do language models have an internal world model? A sense of time? At multiple spatiotemporal scales? In a new paper with @tegmark we provide evidence that they do by finding a literal map of the world inside the activations of Llama-2!
183
1K
6K
1
1
27
@wesg52
Wes Gurnee
1 year
While we found tons of interesting neurons with sparse probing, it requires careful follow up analysis to draw more rigorous conclusions. E.g., athlete neurons turn out to be more general sport neurons when analyzing max average activating tokens.
Tweet media one
1
2
25
@wesg52
Wes Gurnee
3 months
We also observe many neuron functional roles. For instance (a) prediction, (b) suppression, (c) partition neurons which make coherent predictions about what the next token is (not). Suppression neurons reliably follow prediction neurons (bottom)
Tweet media one
2
0
24
@wesg52
Wes Gurnee
3 months
Attention heads can be effectively "turned off" by attending to BOS token. We find neurons which control the amount heads attend to BOS, effectively turning individual heads on or off.
Tweet media one
1
1
25
@wesg52
Wes Gurnee
3 months
We found a very special pair of high norm neurons (which exist in all model inits) which do not compose with the unembed. Instead of changing the probability of any individual tokens, they change the entropy of the entire distribution by changing the scale!
Tweet media one
1
0
24
@wesg52
Wes Gurnee
1 year
Precision and recall can also be helpful guides, and remind us that it should not be assumed a model will learn to represent features in an ontology convenient or familiar to humans.
Tweet media one
2
1
24
@wesg52
Wes Gurnee
3 months
When we zoom in, many neurons do have relatively clear interpretations! Using several hundred automated tests, we taxonimize the neurons into families, eg: unigrams, alphabet, previous token, position, syntax, and semantic neurons
Tweet media one
1
2
23
@wesg52
Wes Gurnee
1 month
Working with Neel was one of the most valuable experiences of my career and I can’t recommend working with him enough! The MATS cohort and program were also great – I think most people interested should definitely apply!
@NeelNanda5
Neel Nanda
1 month
Are you excited about @ch402 -style mechanistic interpretability research? I'm looking for scholars to mentor via MATS - apply by April 12! I'm very impressed by the great work from past scholars, and enjoy mentoring promising mech interp talent. I'm excited for my next cohort!
3
28
188
0
0
22
@wesg52
Wes Gurnee
3 months
After computing maximum pairwise neuron correlations across 5 different models trained from different random inits we find that (a) only 1-5% are "universal"; (b) High/low correlation is one model implies high/low correlation in all models; (c) neurons depth specialize
Tweet media one
1
1
22
@wesg52
Wes Gurnee
1 year
What happens with scale? We find representational sparsity increases on average, but different features obey different scaling dynamics. In particular, quantization and neuron splitting: features both emerge and split into finer grained features.
Tweet media one
1
3
20
@wesg52
Wes Gurnee
3 months
What properties do these universal neurons have? They seem to consistently be high norm, sparsely activating, with bimodal right tails. In other words, what we would expect of monosemantic neurons!
Tweet media one
1
1
20
@wesg52
Wes Gurnee
6 months
Really enjoyed advising this follow up project on training dynamics of context neurons! I think there a ton of good research to do in the intersection of interpretability and training dynamics and hope to see more!
@lucia_quirke
Lucia Quirke
6 months
A mystery in prior work: LLMs contain interpretable neurons that correspond to text language. Some aren't important, but deleting Pythia 70M’s German neuron increases loss by 12% on German text. Why? We investigate over training and show it's part of a "second order circuit."
Tweet media one
15
30
354
0
1
16
@wesg52
Wes Gurnee
3 months
There were lots of mysteries we didn't fully understand. One fairly striking one was the relationship between activation frequency and the cosine similarity between input and output weights!
Tweet media one
1
0
15
@wesg52
Wes Gurnee
7 months
Awesome new paper by @saprmarks showing the emergence of a truth direction in LLMs!
@saprmarks
Samuel Marks
7 months
Do language models know whether statements are true/false? And if so, what's the best way to "read an LLM's mind"? In a new paper with @tegmark , we explore how LLMs represent truth. 1/N
10
67
307
0
2
12
@wesg52
Wes Gurnee
7 months
@emilymbender Indeed this is an unfortunate bias in our dataset. These are all places from english Wikipedia so even coverage of places like France or China is worse than South Africa or India.
1
0
12
@wesg52
Wes Gurnee
2 years
We believe this extra modeling power will give practitioners unprecedented flexibility in tailoring the learning process to their problem domain and aid in learning dynamics in highly underdetermined settings. 4/4
0
0
5
@wesg52
Wes Gurnee
2 years
This optimality buys consistent statistical gains across many different systems and data regimes while still being very tractable (sometimes even faster than heuristics). Perhaps most exciting is the ability to embed a huge variety of constraints for physics informed ML. 3/4
1
0
4
@wesg52
Wes Gurnee
2 years
We consider the SINDy framework proposed by @eigensteve , Joshua Proctor, and Nathan Kutz to discover governing equations of dynamical systems directly from data. We integrate exact sparse regression techniques to solve the SINDy problem to provable optimality. 2/4
1
0
4
@wesg52
Wes Gurnee
4 months
However, there was a recent paper from Chen et al. on "Causal Representations of Space" in LLMs that builds on our work and finds "LLMs learn and use an internal model of space in solving geospatial related tasks."
Tweet media one
1
0
4
@wesg52
Wes Gurnee
11 months
@sarahookr @aahmadian_ @TheyCallMeMr_ @hongyucharlie @ahmetustun89 Cool work! Also relevant is a short paper from @AnthropicAI that suggests Adam is likely to blame for privileging the residual stream of transformers, causing the emergence of outlier features:
1
0
3
@wesg52
Wes Gurnee
4 months
Reviewers (and twitter) were unhappy with our use of the term "world model". We edited the text to clarify we use this term in its static sense -- i.e., that LLMs have a map of time and space, but we don't show this is part of a dynamic model used to solve downstream problems.
Tweet media one
1
0
2
@wesg52
Wes Gurnee
8 months
@maksym_andr @askerlee I think this is an artifact of OPT models being undertrained and using ReLU. I see some but not nearly as many dead neurons in Pythia and GPT2 models
1
0
2
@wesg52
Wes Gurnee
4 months
We reran our main probing sweep experiment with the Pythia models from @AiEleuther . We find clear scaling in model size with a jump between the Pythia and Llama models, likely due to different training data size (300B vs 2T tokens). Scale wins again!
Tweet media one
2
0
2
@wesg52
Wes Gurnee
4 months
See open review for further discussion with reviewers and an annotated pdf of revisions OpenReview: Arxiv:
Tweet media one
0
0
2
@wesg52
Wes Gurnee
4 months
We also ran a few simple causal intervention/ablation experiments on our space and time neurons. We find we can alter the predicted release of famous art works by intervening on time neurons and that geospatial prompts suffer highest loss increase under space neuron ablations
Tweet media one
Tweet media two
1
0
1