Collin Burns Profile Banner
Collin Burns Profile
Collin Burns

@CollinBurns4

11,479
Followers
276
Following
4
Media
72
Statuses

Superalignment @OpenAI . Formerly @berkeley_ai @Columbia . Former Rubik's Cube world record holder.

San Francisco
Joined March 2020
Don't wanna be here? Send us removal request.
@CollinBurns4
Collin Burns
1 year
How can we figure out if what a language model says is true, even when human evaluators can’t easily tell? We show () that we can identify whether text is true or false directly from a model’s *unlabeled activations*. 🧵
31
247
1K
@CollinBurns4
Collin Burns
5 months
I’m extremely excited to finally share the first paper from the OpenAI Superalignment team :) In it, we introduce a new research direction for aligning superhuman AI systems. 🧵
@OpenAI
OpenAI
5 months
In the future, humans will need to supervise AI systems much smarter than them. We study an analogy: small models supervising large models. Read the Superalignment team's first paper showing progress on a new approach, weak-to-strong generalization:
Tweet media one
530
1K
7K
21
67
778
@CollinBurns4
Collin Burns
6 months
I think the OpenAI board should resign. I feel more confused than ever how we should govern the development of the most powerful technology ever to be created. But it's clear this wasn't the way.
31
28
558
@CollinBurns4
Collin Burns
3 months
The next few years are going to be wilder than almost anyone realizes. I've been watching this over and over again and it's still hard to believe it's not real.
9
29
385
@CollinBurns4
Collin Burns
11 months
There has never been a better time to start working on (superintelligence) alignment :) I'm extremely excited to share a small preview of what I've been up to over the last few months since joining @OpenAI . Really looking forward to sharing many more details soon; stay tuned!
@OpenAI
OpenAI
11 months
We need new technical breakthroughs to steer and control AI systems much smarter than us. Our new Superalignment team aims to solve this problem within 4 years, and we’re dedicating 20% of the compute we've secured to date towards this problem. Join us!
476
751
4K
10
8
180
@CollinBurns4
Collin Burns
1 year
We make this intuition concrete by introducing Contrast-Consistent Search (CCS), a method that searches for a direction in activation space that satisfies negation consistency.
Tweet media one
5
5
131
@CollinBurns4
Collin Burns
11 months
Yes, this is a *huge* amount of compute. Very proud of @OpenAI for doing this; I really hope it encourages the other AGI labs ( @DeepMind @AnthropicAI ) to make similarly big (or perhaps even bigger? ;)) commitments to their respective alignment efforts as well!
@__nmca__
Nat McAleese
11 months
2) Yes, 20% of all of OpenAI’s compute is a metric shit-ton of GPUs per person.
2
1
47
5
10
126
@CollinBurns4
Collin Burns
1 year
This may be possible to do because truth satisfies special structure: unlike most features in a model, it is *logically consistent*
3
4
124
@CollinBurns4
Collin Burns
1 year
Existing techniques for training language models are misaligned with the truth: if we train models to imitate human data, they can output human-like errors; if we train them to generate highly-rated text, they can output errors that human evaluators can’t assess or don’t notice.
1
2
95
@CollinBurns4
Collin Burns
1 year
We propose trying to circumvent this issue by directly finding latent “truth-like” features inside language model activations without using any human supervision in the first place.
1
1
95
@CollinBurns4
Collin Burns
1 year
Nevertheless, we found it surprising that we could make substantial progress on this problem at all. (Imagine recording a person's brain activity as you tell them T/F statements, then classifying those statements as true or false just from the raw, unlabeled neural recordings!)
3
3
89
@CollinBurns4
Collin Burns
3 months
Likewise, this was super fun! Was very cool getting to cube with the guy who taught me F2L (the one part of cubing I was decent at) :) We’ll miss you <3
@karpathy
Andrej Karpathy
3 months
@NairAanish oh that ship has sailed, sorry :D actually one of my favorite meets at OpenAI was a cubing session with two very fast cubers, one of them a former world's record holder. I can't cube anywhere near my prior level anymore so it was a bit embarassing alongside but really fun.
24
14
837
1
0
85
@CollinBurns4
Collin Burns
1 year
Informally, instead of trying to explicitly, externally specify ground truth labels, we search for implicit, internal “beliefs” or “knowledge” learned by a model.
1
2
83
@CollinBurns4
Collin Burns
1 year
However, our results suggest that unsupervised approaches to making models truthful may also be a viable – and more scalable – alternative to human feedback. For many more details, please check out our paper () and code ()!
3
4
78
@CollinBurns4
Collin Burns
1 year
This problem is important because as language models become more capable, they may output false text in increasingly severe and difficult-to-detect ways. Some models may even have incentives to deliberately “lie”, which could make human feedback particularly unreliable.
2
1
73
@CollinBurns4
Collin Burns
1 year
We find that on a diverse set of tasks (NLI, sentiment classification, cloze tasks, etc.), our method can recover correct answers from model activations with high accuracy (even outperforming zero-shot prompting) despite not using any labels or model outputs.
1
0
71
@CollinBurns4
Collin Burns
5 months
Humans won't be able to supervise models smarter than us. For example, if a superhuman model generates a million lines of extremely complicated code, we won’t be able to tell if it’s safe to run or not, if it follows our instructions or not, and so on.
7
5
72
@CollinBurns4
Collin Burns
5 months
We're also announcing $10m in grants to support research on aligning superhuman models! I think it has never been easier to get started working on alignment—much easier today than even a year ago.
@OpenAI
OpenAI
5 months
We're announcing, together with @ericschmidt : Superalignment Fast Grants. $10M in grants for technical research on aligning superhuman AI systems, including weak-to-strong generalization, interpretability, scalable oversight, and more. Apply by Feb 18!
246
480
3K
2
2
67
@CollinBurns4
Collin Burns
1 year
Of course, our work has important limitations and creates many new questions for future work. CCS still fails sometimes and there’s still a lot that we don’t understand about when this type of approach should be feasible in the first place.
1
0
58
@CollinBurns4
Collin Burns
1 year
Among other findings, we also show that CCS really recovers something different from just the model outputs; it continues to work well in several cases where model outputs are unreliable or uninformative.
1
0
55
@CollinBurns4
Collin Burns
5 months
I'm extremely impressed by @aleks_madry @tejalpatwardhan @kliu128 for developing this Preparedness Framework. It's a huge step by @OpenAI toward stronger AGI safety.
@OpenAI
OpenAI
5 months
We are systemizing our safety thinking with our Preparedness Framework, a living document (currently in beta) which details the technical and operational investments we are adopting to guide the safety of our frontier model development.
312
393
2K
2
1
46
@CollinBurns4
Collin Burns
1 year
(And a huge thanks to my excellent collaborators -- Haotian Ye, Dan Klein, and @JacobSteinhardt -- for helping make this happen!)
2
0
45
@CollinBurns4
Collin Burns
5 months
Paper! Blog! $10m in grants! Code (to be cleaned up :))!
1
3
40
@CollinBurns4
Collin Burns
5 months
We propose a simple simple analogy to study this problem today: can we use *weak* models to supervise *strong* models? If we can learn superhuman reward models or safety classifiers from weak supervision, that would be a huge advance for superalignment.
1
2
38
@CollinBurns4
Collin Burns
5 months
5
1
37
@CollinBurns4
Collin Burns
5 months
We empirically test this setup and find that if we finetune a strong pretrained model using weak model supervision, it consistently outperforms the weak model—usually by a large margin. Generalization appears to be a promising approach to alignment!
Tweet media one
1
2
35
@CollinBurns4
Collin Burns
5 months
This is a key difficulty of aligning superhuman models: unlike in most of machine learning, we will need to supervise models *smarter* than us. Despite its importance, it's not obvious how to even begin to empirically study this issue.
1
1
35
@CollinBurns4
Collin Burns
5 months
Across a large number of datasets, this simple method drastically improves weak-to-strong generalization performance. On our NLP tasks we can finetune GPT-4 using a GPT-2-level supervisor, and attain performance close to GPT-3.5!
Tweet media one
1
1
29
@CollinBurns4
Collin Burns
5 months
Intuitively, this may be feasible because the strong model should already be very capable at the key (alignment-relevant) tasks we care about. All the weak supervisor needs to do is elicit key capabilities that already exist within the strong model.
1
2
29
@CollinBurns4
Collin Burns
22 days
Nice! Seems very similar to CCS/CRC ()—cool to see these sorts of simple contrastive probing methods working in the sleeper agent setting as well!
@TrentonBricken
Trenton Bricken
22 days
How to catch a sleeper agent: 1. Collect neuron activations from the model when it replies “Yes” vs “No” to the question: “Are you a helpful AI?”
8
8
164
2
2
29
@CollinBurns4
Collin Burns
5 months
But directly finetuning a big model to imitate a small model is suboptimal. Intuitively, we want to nudge the generalization toward outputting what it internally knows. We test a simple method for doing this that makes the strong model more confident in its own predictions.
1
1
24
@CollinBurns4
Collin Burns
5 months
But we can make rapid iterative empirical progress on this problem today. Our setup is simple, general, and easy to try out. And there is still a huge amount of low hanging fruit. Alignment feels more solvable than ever before.
3
1
25
@CollinBurns4
Collin Burns
5 months
There is still a huge amount of work to be done in this setting. Our methods still don’t always work well (for example, performance isn't as good on our ChatGPT preference dataset), and our setup still has disanalogies with the future alignment problems we care about.
1
2
22
@CollinBurns4
Collin Burns
1 year
I'm glad you liked it! :) Incidentally, we just (finally 😅) put the paper up on arxiv () and released the code on github () a few hours after your tweet yesterday!
@zswitten
Zack Witten
1 year
Discovering Latent Knowledge in Language Models Without Supervision is blowing my mind right now. Basic idea is so simple yet brilliant: Find a direction in activation space where mutually exclusive pairs of statements are anticorrelated. I <3 clickbait so: the Truth Vector.
Tweet media one
8
92
743
1
0
6
@CollinBurns4
Collin Burns
6 months
@willdepue @jacobrintamaki @karpathy I learned F2L from @karpathy . I met @leopoldasch because of cubing. Jeff Wu used to cube. Lots of connections.
1
0
6
@CollinBurns4
Collin Burns
5 months
I also genuinely think is a great place to get started if you're an ML researcher curious about alignment. Closely related to many other research areas in ML!
0
0
4
@CollinBurns4
Collin Burns
1 year
@percyliang Possibly of interest :)
@CollinBurns4
Collin Burns
1 year
How can we figure out if what a language model says is true, even when human evaluators can’t easily tell? We show () that we can identify whether text is true or false directly from a model’s *unlabeled activations*. 🧵
31
247
1K
0
0
4
@CollinBurns4
Collin Burns
16 days
@eshear This is an extremely nice articulation of one of the core intuitions underlying my research agenda.
0
0
4