Collin Burns @CollinBurns4 Twitter profile

Last Seen Profiles

@oberssbm

@Saaphi

@AN__HYPEN

@newsahne

@monasafwan

@FoxStarIndia

@salieutaal

@QuantomicDefi

@LakersCanes305

@bikepackingcom

@brewstateba

@meltedvideos

@CllrJHayes

@NFLrecord

@Fw_BRB

@LCritikeur

@TheDreadheadRed

@RAVEN0U5

@wcupa

@RufusPeabody

@UNFPAPalestine

@turunsanomat

@manuremvo

@_AthenasOwl

@ran_t_ji92

@bandibrawl96

@takanari_ono

@wordsdotart

@chlorexia

@HelloRossPod

@Mekrokiev

@Katy_Tel_Aviv

@inspire_malawi

@uyuofficial

@jandakembangstw

@aisa_mendoza

Collin Burns

@CollinBurns4

1 year

How can we figure out if what a language model says is true, even when human evaluators can’t easily tell? We show () that we can identify whether text is true or false directly from a model’s *unlabeled activations*. 🧵

Discovering Latent Knowledge in Language Models Without Supervision

Existing techniques for training language models can be misaligned with the truth: if we train models with imitation learning, they may reproduce errors that humans make; if we train them to...

arxiv.org

31

247

1K

Collin Burns

@CollinBurns4

5 months

I’m extremely excited to finally share the first paper from the OpenAI Superalignment team :) In it, we introduce a new research direction for aligning superhuman AI systems. 🧵

OpenAI

@OpenAI

5 months

In the future, humans will need to supervise AI systems much smarter than them. We study an analogy: small models supervising large models. Read the Superalignment team's first paper showing progress on a new approach, weak-to-strong generalization:

530

1K

7K

21

67

778

Collin Burns

@CollinBurns4

6 months

I think the OpenAI board should resign. I feel more confused than ever how we should govern the development of the most powerful technology ever to be created. But it's clear this wasn't the way.

31

28

558

Collin Burns

@CollinBurns4

3 months

The next few years are going to be wilder than almost anyone realizes. I've been watching this over and over again and it's still hard to believe it's not real.

9

29

385

Collin Burns

@CollinBurns4

11 months

There has never been a better time to start working on (superintelligence) alignment :) I'm extremely excited to share a small preview of what I've been up to over the last few months since joining @OpenAI . Really looking forward to sharing many more details soon; stay tuned!

OpenAI

@OpenAI

11 months

We need new technical breakthroughs to steer and control AI systems much smarter than us. Our new Superalignment team aims to solve this problem within 4 years, and we’re dedicating 20% of the compute we've secured to date towards this problem. Join us!

476

751

4K

10

8

180

Collin Burns

@CollinBurns4

1 year

We make this intuition concrete by introducing Contrast-Consistent Search (CCS), a method that searches for a direction in activation space that satisfies negation consistency.

5

131

Collin Burns

@CollinBurns4

11 months

Yes, this is a *huge* amount of compute. Very proud of @OpenAI for doing this; I really hope it encourages the other AGI labs ( @DeepMind @AnthropicAI ) to make similarly big (or perhaps even bigger? ;)) commitments to their respective alignment efforts as well!

Nat McAleese

@__nmca__

11 months

2) Yes, 20% of all of OpenAI’s compute is a metric shit-ton of GPUs per person.

2

1

47

5

10

126

Collin Burns

@CollinBurns4

1 year

This may be possible to do because truth satisfies special structure: unlike most features in a model, it is *logically consistent*

3

4

124

Collin Burns

@CollinBurns4

1 year

Existing techniques for training language models are misaligned with the truth: if we train models to imitate human data, they can output human-like errors; if we train them to generate highly-rated text, they can output errors that human evaluators can’t assess or don’t notice.

1

2

95

Collin Burns

@CollinBurns4

1 year

We propose trying to circumvent this issue by directly finding latent “truth-like” features inside language model activations without using any human supervision in the first place.

1

95

Collin Burns

@CollinBurns4

1 year

Nevertheless, we found it surprising that we could make substantial progress on this problem at all. (Imagine recording a person's brain activity as you tell them T/F statements, then classifying those statements as true or false just from the raw, unlabeled neural recordings!)

3

89

Collin Burns

@CollinBurns4

3 months

Likewise, this was super fun! Was very cool getting to cube with the guy who taught me F2L (the one part of cubing I was decent at) :) We’ll miss you <3

Andrej Karpathy

@karpathy

3 months

@NairAanish oh that ship has sailed, sorry :D actually one of my favorite meets at OpenAI was a cubing session with two very fast cubers, one of them a former world's record holder. I can't cube anywhere near my prior level anymore so it was a bit embarassing alongside but really fun.

24

14

837

1

0

85

Collin Burns

@CollinBurns4

1 year

Informally, instead of trying to explicitly, externally specify ground truth labels, we search for implicit, internal “beliefs” or “knowledge” learned by a model.

1

2

83

Collin Burns

@CollinBurns4

1 year

However, our results suggest that unsupervised approaches to making models truthful may also be a viable – and more scalable – alternative to human feedback. For many more details, please check out our paper () and code ()!

Discovering Latent Knowledge in Language Models Without Supervision

Existing techniques for training language models can be misaligned with the truth: if we train models with imitation learning, they may reproduce errors that humans make; if we train them to...

arxiv.org

3

4

78

Collin Burns

@CollinBurns4

1 year

This problem is important because as language models become more capable, they may output false text in increasingly severe and difficult-to-detect ways. Some models may even have incentives to deliberately “lie”, which could make human feedback particularly unreliable.

2

1

73

Collin Burns

@CollinBurns4

1 year

We find that on a diverse set of tasks (NLI, sentiment classification, cloze tasks, etc.), our method can recover correct answers from model activations with high accuracy (even outperforming zero-shot prompting) despite not using any labels or model outputs.

1

0

71

Collin Burns

@CollinBurns4

5 months

Humans won't be able to supervise models smarter than us. For example, if a superhuman model generates a million lines of extremely complicated code, we won’t be able to tell if it’s safe to run or not, if it follows our instructions or not, and so on.

7

5

72

Collin Burns

@CollinBurns4

5 months

We're also announcing $10m in grants to support research on aligning superhuman models! I think it has never been easier to get started working on alignment—much easier today than even a year ago.

OpenAI

@OpenAI

5 months

We're announcing, together with @ericschmidt : Superalignment Fast Grants. $10M in grants for technical research on aligning superhuman AI systems, including weak-to-strong generalization, interpretability, scalable oversight, and more. Apply by Feb 18!

246

480

3K

2

67

Collin Burns

@CollinBurns4

1 year

Of course, our work has important limitations and creates many new questions for future work. CCS still fails sometimes and there’s still a lot that we don’t understand about when this type of approach should be feasible in the first place.

1

0

58

Collin Burns

@CollinBurns4

1 year

Among other findings, we also show that CCS really recovers something different from just the model outputs; it continues to work well in several cases where model outputs are unreliable or uninformative.

1

0

55

Collin Burns

@CollinBurns4

5 months

I'm extremely impressed by @aleks_madry @tejalpatwardhan @kliu128 for developing this Preparedness Framework. It's a huge step by @OpenAI toward stronger AGI safety.

OpenAI

@OpenAI

5 months

We are systemizing our safety thinking with our Preparedness Framework, a living document (currently in beta) which details the technical and operational investments we are adopting to guide the safety of our frontier model development.

312

393

2K

2

1

46

Collin Burns

@CollinBurns4

1 year

(And a huge thanks to my excellent collaborators -- Haotian Ye, Dan Klein, and @JacobSteinhardt -- for helping make this happen!)

2

0

45

Collin Burns

@CollinBurns4

5 months

Paper! Blog! $10m in grants! Code (to be cleaned up :))!

1

3

40

Collin Burns

@CollinBurns4

5 months

We propose a simple simple analogy to study this problem today: can we use *weak* models to supervise *strong* models? If we can learn superhuman reward models or safety classifiers from weak supervision, that would be a huge advance for superalignment.

1

2

38

Collin Burns

@CollinBurns4

5 months

I'm incredibly proud of the entire Superalignment Generalization team for making this happen!!! @Pavel_Izmailov @janhkirchner @bobabowen @nabla_theta @leopoldasch @cynnjjs @AdrienLE @ManasJoglekar @janleike @ilyasut @WuTheFWasThat

5

1

37

Collin Burns

@CollinBurns4

5 months

We empirically test this setup and find that if we finetune a strong pretrained model using weak model supervision, it consistently outperforms the weak model—usually by a large margin. Generalization appears to be a promising approach to alignment!

1

2

35

Collin Burns

@CollinBurns4

5 months

This is a key difficulty of aligning superhuman models: unlike in most of machine learning, we will need to supervise models *smarter* than us. Despite its importance, it's not obvious how to even begin to empirically study this issue.

1

35

Collin Burns

@CollinBurns4

5 months

Across a large number of datasets, this simple method drastically improves weak-to-strong generalization performance. On our NLP tasks we can finetune GPT-4 using a GPT-2-level supervisor, and attain performance close to GPT-3.5!

1

29

Collin Burns

@CollinBurns4

5 months

Intuitively, this may be feasible because the strong model should already be very capable at the key (alignment-relevant) tasks we care about. All the weak supervisor needs to do is elicit key capabilities that already exist within the strong model.

1

2

29

Collin Burns

@CollinBurns4

22 days

Nice! Seems very similar to CCS/CRC ()—cool to see these sorts of simple contrastive probing methods working in the sleeper agent setting as well!

Trenton Bricken

@TrentonBricken

22 days

How to catch a sleeper agent: 1. Collect neuron activations from the model when it replies “Yes” vs “No” to the question: “Are you a helpful AI?”

8

164

2

29

Collin Burns

@CollinBurns4

5 months

But directly finetuning a big model to imitate a small model is suboptimal. Intuitively, we want to nudge the generalization toward outputting what it internally knows. We test a simple method for doing this that makes the strong model more confident in its own predictions.

1

24

Collin Burns

@CollinBurns4

5 months

But we can make rapid iterative empirical progress on this problem today. Our setup is simple, general, and easy to try out. And there is still a huge amount of low hanging fruit. Alignment feels more solvable than ever before.

3

1

25

Collin Burns

@CollinBurns4

5 months

There is still a huge amount of work to be done in this setting. Our methods still don’t always work well (for example, performance isn't as good on our ChatGPT preference dataset), and our setup still has disanalogies with the future alignment problems we care about.

1

2

22

Collin Burns

@CollinBurns4

1 year

I'm glad you liked it! :) Incidentally, we just (finally 😅) put the paper up on arxiv () and released the code on github () a few hours after your tweet yesterday!

Discovering Latent Knowledge in Language Models Without Supervision

Existing techniques for training language models can be misaligned with the truth: if we train models with imitation learning, they may reproduce errors that humans make; if we train them to...

arxiv.org

Zack Witten

@zswitten

1 year

Discovering Latent Knowledge in Language Models Without Supervision is blowing my mind right now. Basic idea is so simple yet brilliant: Find a direction in activation space where mutually exclusive pairs of statements are anticorrelated. I <3 clickbait so: the Truth Vector.

8

92

743

1

0

6

Collin Burns

@CollinBurns4

6 months

@willdepue @jacobrintamaki @karpathy I learned F2L from @karpathy . I met @leopoldasch because of cubing. Jeff Wu used to cube. Lots of connections.

1

0

6

Collin Burns

@CollinBurns4

5 months

I also genuinely think is a great place to get started if you're an ML researcher curious about alignment. Closely related to many other research areas in ML!

Weak-to-strong generalization

We present a new research direction for superalignment, together with promising initial results: can we leverage the generalization properties of deep learning to control strong models with weak...

openai.com

0

4

Collin Burns

@CollinBurns4

1 year

@percyliang Possibly of interest :)

Collin Burns

@CollinBurns4

1 year

How can we figure out if what a language model says is true, even when human evaluators can’t easily tell? We show () that we can identify whether text is true or false directly from a model’s *unlabeled activations*. 🧵

31

247

1K

0

4

Collin Burns

@CollinBurns4

16 days

@eshear This is an extremely nice articulation of one of the core intuitions underlying my research agenda.

0

4