Jan Leike @janleike Twitter profile | Pikagi

Pikagi

Jan Leike

@janleike

43,920

Followers

322

Following

19

Media

532

Statuses

ML Researcher, co-leading Superalignment @OpenAI . Optimizing for a post-AGI future where humanity flourishes.

San Francisco, USA

https://t.co/Uvp4pU8R0f

Joined March 2016

Don't wanna be here? Send us removal request.

Pinned Tweet

@janleike

Jan Leike

10 months

Our new goal is to solve alignment of superintelligence within the next 4 years. OpenAI is committing 20% of its compute to date towards this goal. Join us in researching how to best spend this compute to solve the problem!

Tweet card media

Introducing Superalignment

We need scientific and technical breakthroughs to steer and control AI systems much smarter than us. To solve this problem within four years, we’re starting a new team, co-led by Ilya Sutskever and...

115

182

1K

Last Seen Profiles

@kissmyspaz

@Mr_Seasons_

@WichitaState

@rurimatsumura

@BakchosGlass

@Xgames3502

@chapfnbr

@Staytonankrom

@ashirotyan

@Hey_hey_hey_1

@bwats

@BhawkTommyHawk

@IrishTim5

@takaSPKHokkaido

@mixuew

@cryptic_tits

@LCNAU

@mcgrathbrains

@najahii1

@CrystabelKelsey

@PeteButtigieg

@_jumpingforjoy_

@mo_kolade

@AlexisMalavolt

@TheRealBerghaus

@MariaritaTesto1

@jessycal__

@nin_ebooks

@GinaWatchesTV

@cakes_kue

@HUNTERS0RA

@limpbizkit

@cevanscchs

@thiicksnoww

@PurpleM8n

@solochicktravel

@janleike

Jan Leike

5 months

I think the OpenAI board should resign

98

200

3K

@janleike

Jan Leike

1 year

With the InstructGPT paper we found that our models generalized to follow instructions in non-English even though we almost exclusively trained on English. We still don't know why. I wish someone would figure this out.

160

347

3K

@janleike

Jan Leike

5 months

I have been working all weekend with the OpenAI leadership team to help with this crisis

104

73

3K

@janleike

Jan Leike

5 months

Super excited about our new research direction for aligning smarter-than-human AI: We finetune large models to generalize from weak supervision—using small models instead of humans as weak supervisors. Check out our new paper:

Tweet media one

75

321

2K

@janleike

Jan Leike

5 months

I think the OpenAI board should resign

37

96

2K

@janleike

Jan Leike

2 months

Today we're releasing a tool we've been using internally to analyze transformer internals - the Transformer Debugger! It combines both automated interpretability and sparse autoencoders, and it allows rapid exploration of models without writing code.

18

194

2K

@janleike

Jan Leike

6 months

The names for "precision" and "recall" seem so unintuitive to me, I have probably opened the Wikipedia article for them dozens of times. Does anyone know a good mnemonic for them?

125

28

1K

@janleike

Jan Leike

1 year

Really exciting new work on automated interpretability: We ask GPT-4 to explain firing patterns for individual neurons in LLMs and score those explanations.

Tweet card media

Language models can explain neurons in language models

We use GPT-4 to automatically write explanations for the behavior of neurons in large language models and to score those explanations. We release a dataset of these (imperfect) explanations and...

28

233

1K

@janleike

Jan Leike

1 year

Before we scramble to deeply integrate LLMs everywhere in the economy, can we pause and think whether it is wise to do so? This is quite immature technology and we don't understand how it works. If we're not careful we're setting ourselves up for a lot of correlated failures.

112

154

1K

@janleike

Jan Leike

4 months

humans built machines that talk to us like people do and everyone acts like this is normal now. it's pretty nuts

49

100

1K

@janleike

Jan Leike

4 months

I'm very excited that today OpenAI adopts its new preparedness framework! This framework spells out our strategy for measuring and forecasting risks, and our commitments to stop deployment and development if safety mitigations are ever lagging behind.

Tweet card media

The study of frontier AI risks has fallen far short of what is possible and where we need to be. To address this gap and systematize our safety thinking, we are adopting the initial version of our...

65

137

972

@janleike

Jan Leike

2 years

Extremely exciting alignment research milestone: Using reinforcement learning from human feedback, we've trained GPT-3 to be much better at following human intentions.

Tweet card media

Aligning language models to follow instructions

We’ve trained language models that are much better at following user intentions than GPT-3 while also making them more truthful and less toxic, using techniques developed through our alignment...

10

141

895

@janleike

Jan Leike

1 year

Reinforcement learning from human feedback won't scale. It fundamentally assumes that humans can evaluate what the AI system is doing. This will not be true once AI becomes smarter than humans.

59

86

866

@janleike

Jan Leike

2 years

This is one of the craziest plots I have ever seen. World GDP follows a power law that holds over many orders of magnitude and extrapolates to infinity (!) by 2047. Clearly this trend can't continue forever. But whatever happens, the next 25 years are going to be pretty nuts.

Tweet media one

90

99

852

@janleike

Jan Leike

1 year

This is your periodic reminder that aligning smarter-than-human AI systems with human values is an open research problem.

56

85

704

@janleike

Jan Leike

2 years

How will we solve the alignment problem for AGI? I've been working on this question for almost 10 years now. Our current path is very promising: 1/

Tweet card media

Our approach to alignment research

We are improving our AI systems’ ability to learn from human feedback and to assist humans at evaluating AI. Our goal is to build a sufficiently aligned AI system that can help us solve all other...

42

90

615

@janleike

Jan Leike

3 years

Last week I joined @OpenAI to lead their alignment effort. Very exicited to be part of the team!

12

12

484

@janleike

Jan Leike

1 year

Check out OpenAI's new text-davinci-003! Same underlying model as text-davinci-002 but more aligned. Would love to hear feedback about it!

47

46

461

@janleike

Jan Leike

2 years

This is the most important plot of alignment lore: Whenever you optimize a proxy, you make progress on your true objective for a while. At some point you start overoptimizing and do worse on your true objective (hard to know when). This applies to all proxy measures ever.

Tweet media one

15

60

479

@janleike

Jan Leike

1 year

Web4 is when the internet you're browsing is just sampled from a language model

26

35

465

@janleike

Jan Leike

2 months

This is still an early stage research tool, but we are releasing to let others play with and build on it! Check it out:

Tweet card media

GitHub - openai/transformer-debugger

Contribute to openai/transformer-debugger development by creating an account on GitHub.

7

65

452

@janleike

Jan Leike

5 months

We're distributing $1e7 in grants for research on making superhuman models safer and more aligned. If you've always wanted to work on this, now is your time! Apply by Feb 18:

Tweet card media

Superalignment Fast Grants

We’re launching $10M in grants to support technical research towards the alignment and safety of superhuman AI systems, including weak-to-strong generalization, interpretability, scalable oversight,...

19

60

444

@janleike

Jan Leike

1 year

I fondly remember the days when people were arguing intensely whether AI is bee level or rat level.

23

21

429

@janleike

Jan Leike

9 months

An important test for humanity will be whether we can collectively decide not to open source LLMs that can reliably survive and spread on their own. Once spreading, LLMs will get up to all kinds of crime, it'll be hard to catch all copies, and we'll fight over who's responsible

188

55

409

@janleike

Jan Leike

3 years

We're hiring research engineers for alignment work at @OpenAI ! If you're excited about finetuning gpt3-sized language models to be better at following human intentions, then this is for you! Apply here:

6

76

357

@janleike

Jan Leike

2 years

Really looking forward to working with the legendary Scott Aaronson!

Tweet card media

I have some exciting news (for me, anyway). Starting next week, I’ll be going on leave from UT Austin for one year, to work at OpenAI. They’re the creators of the astonishing GPT-3 and …

scottaaronson.blog

11

32

345

@janleike

Jan Leike

8 months

Jailbreaking LLMs through input images might end up being a nasty problem. It's likely much harder to defend against than text jailbreaks because it's a continuous space. Despite a decade of research we don't know how to make vision models adversarially robust.

38

40

335

@janleike

Jan Leike

1 year

The alignment problem is very tractable. We haven't figured out how to solve it yet, but with focus and dedication we will.

63

31

310

@janleike

Jan Leike

11 months

Really interesting result on using LLMs to do math: Supervising every step works better than only checking the answer. Some thoughts how this matters for alignment 👇

Tweet card media

Improving mathematical reasoning with process supervision

We've trained a model to achieve a new state-of-the-art in mathematical problem solving by rewarding each correct step of reasoning (“process supervision”) instead of simply rewarding the correct...

15

55

309

@janleike

Jan Leike

1 year

GPT-4 is safer and more aligned than any other OpenAI has deployed before. Yet it's not perfect. There is still a lot to do to improve safety and we're planning to make updates over the coming months. Huge congrats to the team on all the progress! 🎉

20

19

289

@janleike

Jan Leike

1 year

It's been heartening to see so many more people lately starting to take existential risk from AI seriously and speaking up about it. It's a first step towards solving the problem.

24

23

280

@janleike

Jan Leike

4 years

Today was my last day at @DeepMind . It's been an amazing journey; I've learned so many things and got to work with so many amazing people! Excited for what comes next!

16

11

278

@janleike

Jan Leike

2 years

Super exciting new research milestone on alignment: We trained language models to assist human feedback! Our models help humans find 50% more flaws in summaries than they would have found unassisted.

10

48

275

@janleike

Jan Leike

7 months

If you're into practical alignment, consider applying to @lilianweng 's team. They're building some really exciting stuff: - Automatically extract intent from a fine-tuning dataset - Make models robust to jailbreaks - Detect & mitigate harmful use - ...

Tweet card media

Lilian Weng on LinkedIn: Research Scientist, Safety | 11 comments

My team, Safety Systems, is working on the practical side of alignment at OpenAI: Building systems to enable safe deployment of powerful AI models. Our work… | 11 comments on LinkedIn

www.linkedin.com

13

32

254

@janleike

Jan Leike

19 days

The superalignment fast grants are now decided! We got a *ton* of really strong applications, so unfortunately we had to say no to many we're very excited about. There is still so much good research waiting to be funded. Congrats to all recipients!

13

14

242

@janleike

Jan Leike

9 months

Great conversation with @robertwiblin on how alignment is one of the most interesting ML problems, what the Superalignment Team is working on, what roles we're hiring for, what's needed to reach an awesome future, and much more 👇 Check it out 👇

Tweet card media

Jan Leike on OpenAI's massive push to make superintelligence safe in 4 years or less

"...the vast power of superintelligence could be very dangerous... currently, we don't have a solution for steering a potentially superintelligent AI..."

15

39

229

@janleike

Jan Leike

6 months

@ml_anon22 True, but you can remember them using this picture

Tweet media one

1

11

229

@janleike

Jan Leike

1 year

New blog post on why I'm excited about OpenAI's approach to alignment, including some responses to common objections:

Tweet card media

Why I’m optimistic about our alignment approach

Some arguments in favor and responses to common objections

aligned.substack.com

9

26

211

@janleike

Jan Leike

2 years

Every organization attempting to build AGI should be transparent about their alignment plans.

12

16

205

@janleike

Jan Leike

5 years

The agent alignment problem may be one of the biggest obstacles for using ML to improve people’s lives. Today I’m very excited to share a research direction for how we’ll aim to solve alignment at @DeepMindAI . Blog post: Paper:

Tweet media one

4

38

201

@janleike

Jan Leike

6 months

@alexeyguzey @SimonLermenAI We'll have some evidence to share soon

7

9

191

@janleike

Jan Leike

1 year

If AI ever goes rogue, just remember to make yourself really tall. It will be intimidated and leave you alone.

23

22

180

@janleike

Jan Leike

5 years

How do we uncover failures in ML models that occur too rarely during testing? How do we prove their absence? Very excited about the work by @DeepMindAI ’s Robust & Verified AI team that sheds light on these questions! Check out their blog post:

0

49

175

@janleike

Jan Leike

2 years

RSA was published 45 years ago and yet the universally accepted way to digitally sign a document is to make an indecipherable squiggle on a touch screen that no one ever checks.

3

8

175

@janleike

Jan Leike

1 year

New documentation on language models used in OpenAI's research is up, including some more info on different InstructGPT variants:

Tweet card media

OpenAI Platform

Explore developer resources, tutorials, API docs, and dynamic examples to get the most out of OpenAI's platform.

platform.openai.com

6

28

177

@janleike

Jan Leike

1 year

Everyone has a right to know whether they are interacting with a human or AI. Language models like ChatGPT are good at posing as humans. So we trained a classifier to distinguish between AI-written and human-written text. But it's not fully reliable.

23

20

157

@janleike

Jan Leike

1 year

Y'all should stop using logprob-based evals for language models. I.e. don't craft two reference responses and calculate logP(good response | prompt) - logP(bad response | prompt). This wouldn't actually measure what you care about!

6

14

159

@janleike

Jan Leike

5 years

Very excited to deliver the #icml2019 tutorial on #safeml tomorrow together with @csilviavr ! Be prepared for fairness, human-in-the-loop RL, and a general overview of the field. And lots of memes!

Tweet media one

3

19

156

@janleike

Jan Leike

19 days

Some statistics on the superalignment fast grants: We funded 50 out of ~2,700 applications, awarding a total of $9,895,000. Median grant size: $150k Average grant size: $198k Smallest grant size: $50k Largest grant size: $500k Grantees: Universities: $5.7m (22) Graduate…

11

16

152

@janleike

Jan Leike

1 year

AI not killing everyone is too low of a bar. I want humans to live out the most amazing future.

16

8

151

@janleike

Jan Leike

1 year

One of my favorite parts of the GPT-4 release is that we asked an external auditor to check if the model is dangerous. This project lead by @BethMayBarnes tested if GPT-4 could autonomously survive and spread. (The answer is no.) More details here:

Tweet card media

Update on ARC's recent eval efforts

evals.alignment.org

17

16

146

@janleike

Jan Leike

2 months

It supports both neurons and attention heads. You can intervene on the forward pass by ablating individual neurons and see what changes. In short, it's a quick and easy way to discover circuits manually.

2

6

140

@janleike

Jan Leike

1 year

This problem is sometimes called *scalable oversight*. There are several ideas how to do this, and how to measure that we're making progress. The path I'm very excited for is using models like ChatGPT to assist humans at evaluating other AI systems.

10

9

136

@janleike

Jan Leike

2 years

Well explained blog post about over-optimizing reward models using simple best-of-n sampling: By Jacob Hilton and @nabla_theta

Tweet card media

Measuring Goodhart’s law

Goodhart’s law famously says: “When a measure becomes a target, it ceases to be a good measure.” Although originally from economics, it’s something we have to grapple with at OpenAI when figuring out...

4

24

133

@janleike

Jan Leike

6 months

Hm, maybe "if you need to look up precision and recall on wikipedia every time, you'll achieve high precision and low recall"

1

0

134

@janleike

Jan Leike

10 months

I'm super excited to be co-leading the team together with @ilyasut . Most of our previous alignment team has joined the new superalignment team, and we're welcoming many new people from OpenAI and externally. I feel very lucky to get to work with so many super talented people!

11

3

133

@janleike

Jan Leike

5 months

Kudos especially to @CollinBurns4 for being the visionary behind this work, @Pavel_Izmailov for all the great scientific inquisition, @ilyasut for stoking the fires, @janhkirchner and @leopoldasch for moving things forward every day. Amazing ✨

9

9

133

@janleike

Jan Leike

1 year

I recommend talking to the model to explore what it can help you best with. Try out how it works for your use case and probe it adversarially. Think of edge cases. Don't rush to hook it up to important infrastructure before you're familiar with how it behaves for your use case.

2

9

129

@janleike

Jan Leike

2 years

I wish there were better resources for machine learning researchers who are interested to dive into alignment research

12

6

128

@janleike

Jan Leike

4 years

Submtting a NeurIPS paper and unsure how to write your broader impact statement? This blog post will guide you through it! Comes with a few concrete examples, too. By Carolyn Ashurst, @Manderljung , @carinaprunkl , @yaringal , and Allan Dafoe.

1

29

128

@janleike

Jan Leike

5 months

We find that large models generally do better than their weak supervisor (a smaller model), but not by much. This suggests reward models won't be much better than their human supervisors. In other words: RLHF won't scale.

6

10

126

@janleike

Jan Leike

3 years

There are a lot of exciting things in the Codex paper, but my favorite titbit is the misalignment evaluations by @BethMayBarnes : Subtly buggy code in the context makes the model more likely to write buggy code, and this discrepancy gets larger as the models get bigger!

Tweet media one

4

24

122

@janleike

Jan Leike

10 months

20% of compute is not a small amount and I'm very impressed that OpenAI is willing to allocate resources at this scale. It's the largest investment in alignment ever made, and it's probably more than humanity has spent on alignment research in total so far.

6

8

120

@janleike

Jan Leike

5 months

For lots of important tasks we don't have ground truth supervision: Is this statement true? Is this code buggy? We want to elicit the strong model's capabilities on these tasks without access to ground truth. This is pretty central to aligning superhuman models.

2

4

119

@janleike

Jan Leike

5 months

The glorious details are in the paper: If you want to build on this, here is some open source code to get you started:

2

9

106

@janleike

Jan Leike

10 months

@ESYudkowsky We'll stare at the empirical data as it's coming in: 1. We can measure progress locally on various parts of our research roadmap (e.g. for scalable oversight) 2. We can see how well alignment of GPT-5 will go 3. We'll monitor closely how quickly the tech develops

15

4

98

@janleike

Jan Leike

2 months

Big congrats to the team! 🎉 @mildseasoning , Steven Bills, @HenkTillman , @tomdlt10 , @nickcammarata , @nabla_theta , @jachiam0 , Cathy Yeh, @WuTheFWasThat , and William Saunders

6

3

98

@janleike

Jan Leike

6 months

This is very cool work! Especially the unsupervised version of this technique seems promising for superhuman models.

@andyzou_jiaming

Andy Zou

@andyzou_jiaming

7 months

LLMs can hallucinate and lie. They can be jailbroken by weird suffixes. They memorize training data and exhibit biases. 🧠 We shed light on all of these phenomena with a new approach to AI transparency. 🧵 Website: Paper:

Tweet media one

27

255

1K

17

9

94

@janleike

Jan Leike

1 year

The fastest typing humans paid at US minimum wage would be 200x more expensive than gpt-3.5-turbo

6

3

92

@janleike

Jan Leike

5 months

Amazing work, so proud of the team! @CollinBurns4 @Pavel_Izmailov @janhkirchner @bobabowen @nabla_theta @leopoldasch @cynnjjs @AdrienLE @ManasJoglekar @ilyasut @WuTheFWasThat and many others It's an honor to work with y'all! :excited-superalignment:

6

3

90

@janleike

Jan Leike

10 months

@michhuan @OpenAI @NPCollapse @ilyasut Alignment is not binary and there is a big difference between aligning human level systems and aligning superintelligence. Making roughly human-level AI aligned enough to solve alignment is much easier than solving alignment once and for all

12

7

86

@janleike

Jan Leike

10 months

@ManifoldMarkets @ilyasut @leopoldasch @OpenAI excited to beat these odds

4

6

87

@janleike

Jan Leike

5 months

But even our simple technique can significantly improve weak-to-strong generalization. This is great news: we can make measurable progress on this problem today! I believe more progress in this direction will help us align superhuman models.

Tweet media one

2

5

83

@janleike

Jan Leike

6 years

New paper on teaching RL agents to understand the meaning of instructions. Instead of manually specifying rewards, we learn them from goal-state examples. With @DBahdanau , Felix Hill, @edwardfhughes , @pushmeet , and @egrefen !

1

36

83

@janleike

Jan Leike

1 year

@lyfaradey afaik you can use text-davinci-002, text-davinci-003, and ChatGPT in Japanese and Chinese just fine

10

4

79

@janleike

Jan Leike

6 months

If you are worried about risks from frontier model capabilities, consider applying to the new Preparedness team! If we can measure exactly how dangerous models are, the conversation around this will become more grounded. Exciting that this new team is taking on the challenge!

@OpenAI

OpenAI

6 months

We are building a new Preparedness team to evaluate, forecast, and protect against the risks of highly-capable AI—from today's models to AGI. Goal: a quantitative, evidence-based methodology, beyond what is accepted as possible:

289

397

2K

6

7

77

@janleike

Jan Leike

1 year

If you're interested in this topic, we're hiring, especially research engineers! If you think this might be you, please apply here:

Tweet card media

Developing safe and beneficial AI systems requires people from a wide range of disciplines and backgrounds. We’re always looking for curious minds to join our team.

10

6

77

@janleike

Jan Leike

1 year

What should we be aligning to when we're building AI systems like ChatGPT? I'm excited about this idea based on simulated deliberative democracy. Would love to hear what y'all think :)

Tweet card media

A proposal for importing society’s values

Building towards Coherent Extrapolated Volition with language models

aligned.substack.com

15

11

77

@janleike

Jan Leike

6 years

Interested in getting into machine learning research and AI safety in particular? @80000Hours recently interviewed me about this. Check out the podcast:

Tweet card media

An AI safety researcher on how to become an AI safety researcher

Want to help steer the 21st Century’s most transformative technology? DeepMind Research Scientist Dr Jan Leike recommends you first complete an...

0

21

74

@janleike

Jan Leike

2 years

How do we get from responsible AI as an afterthought to responsible AI being part of the entire development process?

19

7

76

@janleike

Jan Leike

1 year

Incentives are the most powerful force in the universe. Stronger that any other physical force. E.g. if you commit enough money to have a train float in the air, it will float.

14

7

72

@janleike

Jan Leike

9 months

Had a great time chatting with @dfrsrchtwts about our Superalignment plans. If you want to learn more about what our team is up to and hear my latest thoughts about alignment, check it out:

@dfrsrchtwts

Daniel Filan research-tweets

9 months

Yet another new episode of AXRP, where I speak with @janleike about OpenAI's new superalignment team!

2

5

45

4

3

62

@janleike

Jan Leike

4 years

How do you measure the distance between two reward functions? Our EPIC distance is invariant to reward shaping, can be approximated efficiently, and is predictive of policy training success and transfer! New paper with @ARGleave , @MichaelD1729 et al.

Tweet media one

0

9

70

@janleike

Jan Leike

2 years

What could a once-and-for-all solution to the alignment problem actually look like? It'll be very different from what we do today. This is my attempt to sketch it out:

Tweet card media

What could a solution to the alignment problem look like?

A high-level view on the elusive once-and-for-all solution

aligned.substack.com

7

3

69

@janleike

Jan Leike

10 months

Alignment is fundamentally a machine learning problem, and we need the world's best ML talent to solve it. We're looking for engineers, researchers, and research managers. If this could be you, please apply:

14

5

69

@janleike

Jan Leike

5 months

If you're interested in working on this topic more seriously, check out our grants:

@janleike

Jan Leike

5 months

We're distributing $1e7 in grants for research on making superhuman models safer and more aligned. If you've always wanted to work on this, now is your time! Apply by Feb 18:

19

60

444

4

3

69

@janleike

Jan Leike

2 years

For comparison: We spent <2% of the pretraining compute on fine-tuning and collect a few 10,000s of human labels and demos. Our 1.3b parameter models (GPT-2 sized!) are preferred over a prompted 175b parameter GPT-3.

2

4

65

@janleike

Jan Leike

9 months

Just landed in Honolulu! HMU if you're at ICML and want to talk about superalignment!

5

1

64

@janleike

Jan Leike

3 years

I really love this new paper showing how single neurons respond across modalities in CLIP models. Opens up a new avenue of new typographic attacks to fool these kinds of models. By @gabeeegoooh , @ch402 , and others.

1

11

65

@janleike

Jan Leike

1 year

@michael_nielsen I agree that these questions are important, but we don't need a definitive answer in order to make progress on alignment. Right now we don't even know how to make them reliably follow anyone's intent, or do things we all agree on. Mitigating misuse of AI is a different problem.

13

2

65

@janleike

Jan Leike

2 years

So many alignment plans revolve around "we'll convince everyone to not do X." Maybe you can buy some time, but people will do X anyway. We should instead spend our time trying to figure out how to make X aligned & safe.

15

1

62

@janleike

Jan Leike

5 years

Multiparty computation is awesome because it lets multiple parties train a model without seeing the weights. But there are fundamental limits to making it scalable: >24x overhead! Our new paper addresses this problem. w/ @MiljanMartic @iamtrask et al.

Tweet media one

1

20

64

@janleike

Jan Leike

10 months

Why 4 years? It's a very ambitious goal, and we might not succeed. But I'm optimistic that it can be done. There is a lot of uncertainty how much time we'll have, but the technology might develop very quickly over the next few years. I'd rather have alignment be solved too soon

5

4

63

@janleike

Jan Leike

1 year

That was like 4 years ago...

2

2

63

@janleike

Jan Leike

1 year

A common misconception is that alignment is binary. A system can be more aligned with you or less aligned with you.

15

1

62

@janleike

Jan Leike

3 years

I'm very interested in techniques for supervising models to do tasks that are difficult for humans to evaluate. To study this, we trained a model on summarizing entire books! Read more 👇

Tweet card media

Summarizing books with human feedback

Scaling human oversight of AI systems for tasks that are difficult to evaluate.

1

4

60

@janleike

Jan Leike

8 months

Thanks for all the great questions, @robertwiblin ! Very grateful for the opportunity to share more about our research plans.

@robertwiblin

Robert Wiblin

8 months

I interviewed OpenAI's Head of Alignment @janleike on their new superalignment project on which they're spending $100m's in an attempt to figure out how to make superhuman AI not go rogue and wreck everything:

6

14

92

3

7

52

@janleike

Jan Leike

1 year

Constitutional AI doesn't let you avoid labeling data by writing down some rules. You still need to figure out how good your rules are. So you need to label a validation set. Then you'll get some accuracy on the validation set. How can you increase this accuracy?

3

3

57

@janleike

Jan Leike

1 year

It's still early days, but it's been cool to see some interesting trends: 1. Later layers are harder to explain than earlier ones 2. Simple interventions into pretraining can improve explainability of neurons 3. Simple tricks like iterative refinements can improve explanations

3

4

56