Wes Gurnee @wesg52 Twitter profile | Pikagi

Pikagi

Wes Gurnee

@wesg52

3,073

Followers

198

Following

31

Media

103

Statuses

Optimizer @MIT @ORCenter PhD student thinking about Mechanistic Interpretability, Optimization, and Governance.

Cambridge, MA

https://t.co/1Ug8W3vknf

Joined June 2022

Don't wanna be here? Send us removal request.

Pinned Tweet

@wesg52

Wes Gurnee

3 months

New paper! "Universal Neurons in GPT2 Language Models" How many neurons are independently meaningful? How many neurons reappear across models with different random inits? Do these neurons specialize into specific functional roles or form feature families? Answers below 🧵:

Tweet media one

6

65

402

Last Seen Profiles

@dyingscribe

@YOM_Official

@MMRFTeam4Cures

@GrahamFleming_

@gimjiho83517660

@Kamigakari_dog1

@Rinka9439798101

@TodayNewsfeed

@rockstarronan

@jenhopkins88

@web3dothan

@imthatguy4r

@Baron17320256

@carol36938069

@mrstoneking2u

@Daddy7738552343

@mochieps

@kati671217

@cjoyalb

@Ryan51127923536

@EricJTSmith1

@marzuki0268

@koharu818421693

@r2rwestcoast

@Enver_Hoxhaj

@LilyLuria

@Shero434377

@KrystalMil54354

@vocaro1

@ThomasEdelman3

@bunda_gowes

@WayneCrosby69

@adxjnr

@marcomares_

@VincentDeWolf

@SammedVardhman

@wesg52

Wes Gurnee

7 months

Do language models have an internal world model? A sense of time? At multiple spatiotemporal scales? In a new paper with @tegmark we provide evidence that they do by finding a literal map of the world inside the activations of Llama-2!

183

1K

6K

@wesg52

Wes Gurnee

1 year

Neural nets are often thought of as feature extractors. But what features are neurons in LLMs actually extracting? In our new paper, we leverage sparse probing to find out . A 🧵:

Tweet media one

11

133

738

@wesg52

Wes Gurnee

7 months

For spatial representations, we run Llama-2 models on the names of tens of thousands cities, structures, and natural landmarks around the world, the USA, and NYC. We then train linear probes on the last token activations to predict the real latitude and longitudes of each place.

13

38

548

@wesg52

Wes Gurnee

7 months

To see all the details and additional validations check out the Paper: Code and datasets:

Tweet media one

7

36

391

@wesg52

Wes Gurnee

7 months

For temporal representations, we run the models on the names of famous figures from the past 3000 years, the names of songs, movies and books from 1950 onward, and NYT headlines from the 2010s and train lin probes to predict the year of death, release date, and publication date.

Tweet media one

3

10

250

@wesg52

Wes Gurnee

7 months

But does the model actually _use_ these representations? By looking for neurons with similar weights as the probe, we find many space and time neurons which are sensitive to the spacetime coords of an entity, showing the model actually learned the global geometry -- not the probe

Tweet media one

3

12

212

@wesg52

Wes Gurnee

7 months

When training probes over every layer and model, we find that representations emerge gradually over the early layers before plateauing at around the halfway point. As expected, bigger models are better, but for more obscure datasets (NYC) no model is great.

Tweet media one

1

7

180

@wesg52

Wes Gurnee

7 months

Are these representations actually linear? By comparing the performance of nonlinear MLP probes with linear probes, we find evidence that they are! More complicated probes do not perform any better on the test set.

Tweet media one

3

7

171

@wesg52

Wes Gurnee

7 months

Are these representations robust to prompting? Probing on different prompts we find performance is largely preserved but can be degraded by capitalizing the entity name or prepending random tokens. Also probing on the trailing period instead of last token is better for headlines

Tweet media one

1

6

129

@wesg52

Wes Gurnee

7 months

Finally, special shoutout to @NeelNanda5 for all the feedback on the paper and project!

9

3

129

@wesg52

Wes Gurnee

7 months

A critical part of this project was constructing space and time datasets at multiple spatiotemporal scales with a diversity of entity types (eg, both cities and natural landmarks).

Tweet media one

1

7

128

@wesg52

Wes Gurnee

1 year

One large family of neurons we find are “context” neurons, which activate only for tokens in a particular context (French, Python code, US patent documents, etc). When deleting these neurons the loss increases in the relevant context while leaving other contexts unaffected!

Tweet media one

3

13

121

@wesg52

Wes Gurnee

2 years

Thrilled to share my first grad school preprint “Learning Sparse Nonlinear Dynamics via Mixed-Integer Optimization” with @dbertsim Preprint: Code: Thread: 1/4

Tweet card media

GitHub - wesg52/sindy_mio_paper: Code for Learning Sparse Nonlinear Dynamics via Mixed Integer...

Code for Learning Sparse Nonlinear Dynamics via Mixed Integer Optimization - wesg52/sindy_mio_paper

3

14

84

@wesg52

Wes Gurnee

1 month

Short research post on a potential issue arising in Sparse Autoencoders (SAEs): the reconstruction errors change model predictions much more than a random error of the same magnitude!

Tweet card media

SAE reconstruction errors are (empirically) pathological — LessWrong

Summary Sparse Autoencoder (SAE) errors are empirically pathological: when a reconstructed activation vector is distance ϵ from the original activati…

www.lesswrong.com

1

7

73

@wesg52

Wes Gurnee

1 year

This paper would not have been possible without my coauthors @NeelNanda5 , Matthew Pauly, Katherine Harvey, @mitroitskii , and @dbertsim or all the foundational and inspirational work from @ch402 , @boknilev , and many others! Read the full paper:

Tweet card media

Finding Neurons in a Haystack: Case Studies with Sparse Probing

Despite rapid adoption and deployment of large language models (LLMs), the internal computations of these models remain opaque and poorly understood. In this work, we seek to understand how...

2

3

45

@wesg52

Wes Gurnee

7 months

@rafaelrmuller @tegmark We have results for PCA and you definitely need more than two PCs. This is because our datasets contain diverse entities (eg, cities, buildings, natural landmarks) and on inspection the first few PCs seemed to cluster this information.

5

1

44

@wesg52

Wes Gurnee

1 year

But what if there are more features than there are neurons? This results in polysemantic neurons which fire for a large set of unrelated features. Here we show a single early layer neuron which activates for a large collection of unrelated n-grams.

Tweet media one

1

3

41

@wesg52

Wes Gurnee

1 year

That said, more than any specific technical contribution, we hope to contribute to the general sense that ambitious interpretability is possible: that LLMs have a tremendous amount of rich structure that can and should be understood by humans!

1

4

43

@wesg52

Wes Gurnee

1 year

Early layers seem to use sparse combinations of neurons to represent many features in superposition. That is, using the activations of multiple polysemantic neurons to boost the signal of the true feature over all interfering features (here “social security” vs. adjacent bigrams)

Tweet media one

1

4

40

@wesg52

Wes Gurnee

3 months

See the full paper for all the details Paper: Code:

Tweet media one

0

3

32

@wesg52

Wes Gurnee

1 year

Results in toy models from @AnthropicAI and @ch402 suggest a potential mechanistic fingerprint of superposition: large MLP weight norms and negative biases. We find a striking drop in early layers in the Pythia models from @AiEleuther and @BlancheMinerva .

Tweet media one

1

3

31

@wesg52

Wes Gurnee

4 months

New version is out (to appear at ICLR)! Main updates: - Additional experiments on Pythia models - Causal interventions on space and time neurons - More related work - Clarify our claims of a literal world model (static vs. dynamic) - External replications! More in thread:

@wesg52

Wes Gurnee

7 months

Do language models have an internal world model? A sense of time? At multiple spatiotemporal scales? In a new paper with @tegmark we provide evidence that they do by finding a literal map of the world inside the activations of Llama-2!

183

1K

6K

1

1

27

@wesg52

Wes Gurnee

1 year

While we found tons of interesting neurons with sparse probing, it requires careful follow up analysis to draw more rigorous conclusions. E.g., athlete neurons turn out to be more general sport neurons when analyzing max average activating tokens.

Tweet media one

1

2

25

@wesg52

Wes Gurnee

3 months

We also observe many neuron functional roles. For instance (a) prediction, (b) suppression, (c) partition neurons which make coherent predictions about what the next token is (not). Suppression neurons reliably follow prediction neurons (bottom)

Tweet media one

2

0

24

@wesg52

Wes Gurnee

3 months

Attention heads can be effectively "turned off" by attending to BOS token. We find neurons which control the amount heads attend to BOS, effectively turning individual heads on or off.

Tweet media one

1

1

25

@wesg52

Wes Gurnee

3 months

We found a very special pair of high norm neurons (which exist in all model inits) which do not compose with the unembed. Instead of changing the probability of any individual tokens, they change the entropy of the entire distribution by changing the scale!

Tweet media one

1

0

24

@wesg52

Wes Gurnee

1 year

Precision and recall can also be helpful guides, and remind us that it should not be assumed a model will learn to represent features in an ontology convenient or familiar to humans.

Tweet media one

2

1

24

@wesg52

Wes Gurnee

3 months

When we zoom in, many neurons do have relatively clear interpretations! Using several hundred automated tests, we taxonimize the neurons into families, eg: unigrams, alphabet, previous token, position, syntax, and semantic neurons

Tweet media one

1

2

23

@wesg52

Wes Gurnee

1 month

Working with Neel was one of the most valuable experiences of my career and I can’t recommend working with him enough! The MATS cohort and program were also great – I think most people interested should definitely apply!

@NeelNanda5

Neel Nanda

1 month

Are you excited about @ch402 -style mechanistic interpretability research? I'm looking for scholars to mentor via MATS - apply by April 12! I'm very impressed by the great work from past scholars, and enjoy mentoring promising mech interp talent. I'm excited for my next cohort!

3

28

188

0

0

22

@wesg52

Wes Gurnee

3 months

After computing maximum pairwise neuron correlations across 5 different models trained from different random inits we find that (a) only 1-5% are "universal"; (b) High/low correlation is one model implies high/low correlation in all models; (c) neurons depth specialize

Tweet media one

1

1

22

@wesg52

Wes Gurnee

1 year

What happens with scale? We find representational sparsity increases on average, but different features obey different scaling dynamics. In particular, quantization and neuron splitting: features both emerge and split into finer grained features.

Tweet media one

1

3

20

@wesg52

Wes Gurnee

3 months

What properties do these universal neurons have? They seem to consistently be high norm, sparsely activating, with bimodal right tails. In other words, what we would expect of monosemantic neurons!

Tweet media one

1

1

20

@wesg52

Wes Gurnee

6 months

Really enjoyed advising this follow up project on training dynamics of context neurons! I think there a ton of good research to do in the intersection of interpretability and training dynamics and hope to see more!

@lucia_quirke

Lucia Quirke

6 months

A mystery in prior work: LLMs contain interpretable neurons that correspond to text language. Some aren't important, but deleting Pythia 70M’s German neuron increases loss by 12% on German text. Why? We investigate over training and show it's part of a "second order circuit."

Tweet media one

15

30

354

0

1

16

@wesg52

Wes Gurnee

3 months

There were lots of mysteries we didn't fully understand. One fairly striking one was the relationship between activation frequency and the cosine similarity between input and output weights!

Tweet media one

1

0

15

@wesg52

Wes Gurnee

7 months

Awesome new paper by @saprmarks showing the emergence of a truth direction in LLMs!

@saprmarks

Samuel Marks

7 months

Do language models know whether statements are true/false? And if so, what's the best way to "read an LLM's mind"? In a new paper with @tegmark , we explore how LLMs represent truth. 1/N

10

67

307

0

2

12

@wesg52

Wes Gurnee

7 months

@emilymbender Indeed this is an unfortunate bias in our dataset. These are all places from english Wikipedia so even coverage of places like France or China is worse than South Africa or India.

1

0

12

@wesg52

Wes Gurnee

2 years

We believe this extra modeling power will give practitioners unprecedented flexibility in tailoring the learning process to their problem domain and aid in learning dynamics in highly underdetermined settings. 4/4

0

0

5

@wesg52

Wes Gurnee

2 years

This optimality buys consistent statistical gains across many different systems and data regimes while still being very tractable (sometimes even faster than heuristics). Perhaps most exciting is the ability to embed a huge variety of constraints for physics informed ML. 3/4

1

0

4

@wesg52

Wes Gurnee

2 years

We consider the SINDy framework proposed by @eigensteve , Joshua Proctor, and Nathan Kutz to discover governing equations of dynamical systems directly from data. We integrate exact sparse regression techniques to solve the SINDy problem to provable optimality. 2/4

1

0

4

@wesg52

Wes Gurnee

4 months

However, there was a recent paper from Chen et al. on "Causal Representations of Space" in LLMs that builds on our work and finds "LLMs learn and use an internal model of space in solving geospatial related tasks."

Tweet media one

1

0

4

@wesg52

Wes Gurnee

2 years

@AnthropicAI For more extreme lateral inhibition, it would be interesting to test the sparsemax activation (; or x*sparsemax(x))

Tweet card media

From Softmax to Sparsemax: A Sparse Model of Attention and...

We propose sparsemax, a new activation function similar to the traditional softmax, but able to output sparse probabilities. After deriving its properties, we show how its Jacobian can be...

0

1

4

@wesg52

Wes Gurnee

11 months

@sarahookr @aahmadian_ @TheyCallMeMr_ @hongyucharlie @ahmetustun89 Cool work! Also relevant is a short paper from @AnthropicAI that suggests Adam is likely to blame for privileging the residual stream of transformers, causing the emergence of outlier features:

1

0

3

@wesg52

Wes Gurnee

4 months

Reviewers (and twitter) were unhappy with our use of the term "world model". We edited the text to clarify we use this term in its static sense -- i.e., that LLMs have a map of time and space, but we don't show this is part of a dynamic model used to solve downstream problems.

Tweet media one

1

0

2

@wesg52

Wes Gurnee

8 months

@maksym_andr @askerlee I think this is an artifact of OPT models being undertrained and using ReLU. I see some but not nearly as many dead neurons in Pythia and GPT2 models

1

0

2

@wesg52

Wes Gurnee

4 months

We reran our main probing sweep experiment with the Pythia models from @AiEleuther . We find clear scaling in model size with a jump between the Pythia and Llama models, likely due to different training data size (300B vs 2T tokens). Scale wins again!

Tweet media one

2

0

2

@wesg52

Wes Gurnee

4 months

See open review for further discussion with reviewers and an annotated pdf of revisions OpenReview: Arxiv:

Tweet media one

0

0

2

@wesg52

Wes Gurnee

4 months

We also ran a few simple causal intervention/ablation experiments on our space and time neurons. We find we can alter the predicted release of famous art works by intervening on time neurons and that geospatial prompts suffer highest loss increase under space neuron ablations

Tweet media one

Tweet media two

1

0

1

@wesg52

Wes Gurnee

3 months

@JosephSarnecki @tegmark There was this paper a while back which showed something similar to this:

Tweet card media

Implicit Representations of Meaning in Neural Language Models

Does the effectiveness of neural language models derive entirely from accurate modeling of surface word co-occurrence statistics, or do these models represent and reason about the world they...

1

0

1