Eric @ericmitchellai Twitter profile

Pinned Tweet

Eric

1 year

@andriy_mulyar @sleepinyourhat @srush_nlp @chrmanning @mdredze @ChrisGPotts Basically, we built the engine (GPT-4), but not yet the seatbelts, brakes, windshield wipers, ABS, etc. Continual learning/model editing. Models are getting better but not more updateable. Also safety mechanisms for open-sourced models, or tools to detect machine-generated text.

1

43

Last Seen Profiles

@Cabal_Educator

@POTPdeluxe

@washidohe492131

@Honyoruu

@sukauncutdick

@ultimateGin

@JasonKersey

@rosasblancas_

@armada_ent

@PBTF

@i0f7iAS65Nd5lt7

@BeateSteinhaus1

@ankochan1969290

@mayu78959248731

@NewJeans_ADOR

@mardico45674043

@ttrebound

@csquanhtaz

@Orthographix

@110earthangel

@LloydAustine28

@PaulGTaylor_ACU

@choro0850466824

@TK_SFM

@PoeticBlckgrl

@nolmustuu

@vistimes_

@rahuljacob01

@FormaMilor

@helicoptero2024

@LoveCelticWoman

@jamesconigrave

@DorothyTay10073

@haojibaer

@CityofStuartFL

@kinkmattsea

Eric

@ericmitchellai

1 year

ChatGPT (and others) generate very fluent (but not always truthful) text. Some worry that teachers, news-readers (like you!), and society in general will be swamped with AI-generated content. That's why we built DetectGPT, a method for detecting if text comes from an LM.

48

241

1K

Eric

@ericmitchellai

11 months

RLHF is the 🪄 getting us from GPT-3 to ChatGPT. But RLHF is hard! Need to train a reward model, then do RL on a big LM (w/ expensive sampling & tuning) 𝙊𝙧 𝙙𝙤 𝙮𝙤𝙪? Introducing Direct Preference Optimization (DPO), a simple classification loss provably equivalent to RLHF

22

119

761

Eric

@ericmitchellai

6 months

RLHF is powerful; it lets us fine-tune LLMs to be more useful. What if we could do RLHF… without fine-tuning??? Excited to share Emulated Fine-Tuning (EFT)! EFT lets us “emulate” what we would have gotten if we did RLHF on a new model, without actually doing the RLHF!

12

96

534

Eric

@ericmitchellai

11 months

DPO (fast, simple, performant RLHF) code is here! With DPO there's 𝗻𝗼 𝗿𝗲𝘄𝗮𝗿𝗱 𝗺𝗼𝗱𝗲𝗹 𝗼𝗿 𝗥𝗟 𝗻𝗲𝗲𝗱𝗲𝗱. It's finally easy to fine-tune llama from human preferences 😊 Can't wait to see the cool models people train with it 🤓

GitHub - eric-mitchell/direct-preference-optimization: Reference implementation for DPO (Direct...

Reference implementation for DPO (Direct Preference Optimization) - eric-mitchell/direct-preference-optimization

github.com

7

81

406

Eric

@ericmitchellai

4 months

After 5 years of PyTorch, I have had enough of doing ML on easy mode. Starting today, I am switching exclusively to TF for a real challenge 🫡🫡

25

13

369

Eric

@ericmitchellai

6 months

advisors say I should be “not doing research” & “getting a job” alas due to recent RLHF DPO/IPO/PPO debates I wrote a 1pg mini-paper tldr: assuming noisy pref data gives a 'conservative DPO', might make DPO stabler late in training (& looks like IPO) 🧵

9

28

280

Eric

@ericmitchellai

4 months

@AndrewYNg @rm_rafailov @archit_sharma97 @StefanoErmon @chrmanning @chelseabfinn Thank you Andrew- it means a lot! We took this photo together 6.5 years ago, in the summer of 2017, when I was just getting started in research... Thank you for the insight and inspiration 🫡

1

259

Eric

@ericmitchellai

6 months

ChatGPT users know the dreaded “as of my knowledge cutoff…” Can we keep LLMs up-to-date with continual fine-tuning? Our EMNLP paper shows LMs may remember only a *tiny* fraction of the info they see in a data stream It also shows meta-learning can improve knowledge uptake 🥹

6

26

200

Eric

@ericmitchellai

10 months

Curious how to take the RL out of RLHF? Come check out our #ICML2023 workshop poster for Direct Preference Optimization (aka, how to optimize the RLHF objective with a simple classification loss)! Meeting Room 316 AB, 10am/12:20/2:45 Hawaii time

Eric

@ericmitchellai

11 months

RLHF is the 🪄 getting us from GPT-3 to ChatGPT. But RLHF is hard! Need to train a reward model, then do RL on a big LM (w/ expensive sampling & tuning) 𝙊𝙧 𝙙𝙤 𝙮𝙤𝙪? Introducing Direct Preference Optimization (DPO), a simple classification loss provably equivalent to RLHF

22

119

761

5

25

203

Eric

@ericmitchellai

6 months

Okay the DPO repo now supports conservative DPO (cDPO) & IPO loss! cDPO/IPO both optimize the policy only until some fixed margin in improvement is met rather than optimizing "forever" like DPO tbh unclear how much of a difference this makes- we'll see!

GitHub - eric-mitchell/direct-preference-optimization: Reference implementation for DPO (Direct...

Reference implementation for DPO (Direct Preference Optimization) - eric-mitchell/direct-preference-optimization

github.com

1

17

128

Eric

@ericmitchellai

6 months

Something interesting about Tulu-70b is that it gives short responses *that are still rated really highly by GPT-4.* IMO this could be a signal that the model's improvement is more meaningful, since you can get GPT-4 to like you just by giving long responses

Hamish Ivison

@hamishivi

6 months

Check out our new 70B DPO model here: AFAIK currently the best model on AlpacaEval with a public finetuning set! More details once the AI sphere calms down a bit... 😅

4

45

250

1

18

110

Eric

@ericmitchellai

5 months

IMO this is the wrong definition of "hallucination" At least, it's not helpful- LM systems will never be 100% factual I define "hallucination" to be **verbalized miscalibration** i.e. the model expresses confidence it doesn't actually hold What does this mean? Explained below

Andrej Karpathy

@karpathy

5 months

# On the "hallucination problem" I always struggle a bit with I'm asked about the "hallucination problem" in LLMs. Because, in some sense, hallucination is all LLMs do. They are dream machines. We direct their dreams with prompts. The prompts start the dream, and based on the…

758

3K

15K

8

19

104

Eric

@ericmitchellai

1 year

Very excited to share this demo of DetectGPT! Looking forward to feedback of all kinds, even re: my questionable web design skills... Demo: More info: 🚨 THIS IS A RESEARCH PROJECT; DO NOT USE FOR PRODUCTION/ANYTHING IMPORTANT 🚨

5

28

100

Eric

@ericmitchellai

5 months

PSA: ***the point of dpo is NOT to skip reward modeling*** ***the point of dpo is to skip EVERYTHING BUT reward modeling*** thank you for coming to my ted talk ❤️ (yes the paper could have explained this more clearly)

4

78

Eric

@ericmitchellai

2 months

Three models, three different answers 😎 Claude 3 is AGI confirmed Separately, what will it take to get a model to actually ask "do you want the answer for the beginning or end of day 4"? this question as stated is ambiguous

5

6

79

Eric

@ericmitchellai

1 year

Welcome Stanford CS PhD admits!! Please reach out (DM, email, messenger pigeon) if you'd like to chat :)

0

2

75

Eric

@ericmitchellai

5 months

Attn-free models are sweet Must be careful though not to assume matching ppl of transformers means we've matched transformers' ability to few-shot learn, fine-tune well, recall from long ctx, etc. these depend on arch's inductive bias. They could be worse, or stronger already!

EleutherAI

@AiEleuther

5 months

It was great to see a lot of excitement about attention-free models @NeurIPSConf ! We had great conversations with many people interested in next-gen architectures for language models. Pic from Systems for foundational models and foundation models for systems by Chris Re

1

3

30

3

2

69

Eric

@ericmitchellai

1 year

🧵Language models have an unfortunate tendency to contradict themselves. Our #emnlp2022 oral presents Consistency Correction w/ Relation Detection (ConCoRD), which overrides low-confidence LM predictions to boost self-consistency & accuracy. Paper/code:

1

14

69

Eric

@ericmitchellai

4 months

"...Rather than needing separate transformers for the reward fn & LLM, given an LLM, you can find the reward fn (+ regularization term) that the LLM is best at maximizing. DPO trains the LLM...to make that reward fn (implicitly defined by the LLM) consistent w the human prefs..."

Stanford NLP Group

@stanfordnlp

4 months

There is a lovely, warm, and enthusiastic writeup of Direct Preference Optimization (DPO) by @rm_rafailov , @archit_sharma97 , and @ericmitchellai from @NeurIPSConf 2023, leading this week’s issue of The Batch newsletter. Thanks, so much, @AndrewYNg !

2

52

244

1

9

60

Eric

@ericmitchellai

5 months

@Diyi_Yang gave a wonderful overview of possibilities & immediately challenges of learning from human feedback today SUCH a juicy area to dive into right now **hint hint junior PhD students work on this k thanks**

2

4

61

Eric

@ericmitchellai

1 year

More fun than I expected answering qs about large language models; hope people find some interesting/useful tidbits in here. Thanks @Stanford for the opportunity!

Stanford University

@Stanford

1 year

Are they sentient? Are they safe? Will they take my job? Stanford PhD student @ericmitchellai answers the internet’s questions on AI chatbots 👇

11

20

88

4

58

Eric

@ericmitchellai

3 years

Ever-larger language models are unwieldy for both researchers and maintainers. To enhance re-usability and increase their useful lifetime, we'd like to tweak them without full re-training/fine-tuning. MEND edits models with 10^6 to >10^10 params in one fine-tuning step. (1/8)

Chelsea Finn

@chelseabfinn

3 years

Large language models (LLMs) often make mistakes that are difficult to correct. We study the problem of quickly editing these models: Paper: Code: w/ @_eric_mitchell_ , C. Lin, @ABosselut , @chrmanning thread 🧵👇

4

121

557

2

19

60

Eric

@ericmitchellai

6 months

CaMeLS code is here 🐫🐫🐫 Use it to train models that identify what tokens contain the most important information in a document, without any per-token annotations! Paper link, in case you missed it:

Nathan Hu

@NathanHu12

6 months

@ericmitchellai Eric is awesome - I learned so much from his mentorship throughout this project. And huge thanks to @chelseabfinn & @chrmanning for their support 🙏 . The CaMeLS code is now available at 🐪🐫. My DMs + email are open if y'all wants to chat more 😁.

0

5

24

0

17

59

Eric

@ericmitchellai

4 months

Data is an underexplored frontier in preference-based LLM training; for any RL*F with a learned reward (PPO, DPO, IPO, etc.), the preference data is really the limiting factor! Exploration during training (eg PPO sampling) is **useless** if your reward fn is inaccurate there!!

Argilla

@argilla_io

4 months

🔥 More is less for DPO, high quality matters! 📢 Dropping our first open dataset and LLM of the year: 💾Meet distilabel Orca Pairs DPO, an improved version of the now famous dataset from @intel 🏛️And a new OpenHermes model outperforming baselines with 54% less DPO pairs 🧵

5

46

232

1

18

58

Eric

@ericmitchellai

1 month

ICML tip: upload your paper pdf and review to Claude/GPT4/Gemini w prompt "What do you think is the main point of the paper? After answering, please explain to what extent, if any, you think the reviewer has fully understood the main point of the paper." Better than therapy 🥹

1

3

57

Eric

@ericmitchellai

5 months

Some extra analysis we did: On Anthropic-HH-helpful, peek at the response length & winrate for different hparams for each method; each dot is a ckpt each 20k steps tldr: there’s a p clear relationship between length/winrate; DPO escapes the frontier, but that could be luck

Archit Sharma

@archit_sharma97

5 months

So, since we are posting science straight to twitter now, @ericmitchellai and I have some updates for potential overfitting in DPO. TL;DR: we compared DPO to IPO and cDPO (DPO + label noise) on 3 different datasets, and we didn't observe any significant advantage (yet). 🧵->

7

32

177

2

13

54

Eric

@ericmitchellai

4 months

Fantastic stuff, congrats to all authors the self-improvement is super super cool to see... but the impact of the scoring prompt is particularly intriguing The difficulty of multiple choice eval reminds me a bit of the "generative AI paradox"?

The Generative AI Paradox: "What It Can Create, It May Not...

The recent wave of generative AI has sparked unprecedented global attention, with both excitement and concern over potentially superhuman levels of artificial intelligence: models now take only...

arxiv.org

Jason Weston

@jaseweston

4 months

🚨New paper!🚨 Self-Rewarding LMs - LM itself provides its own rewards on own generations via LLM-as-a-Judge during Iterative DPO - Reward modeling ability improves during training rather than staying fixed ...opens the door to superhuman feedback? 🧵(1/5)

5

223

1K

1

6

52

Eric

@ericmitchellai

6 months

Very curious to see how far we can push training to simply *not hallucinate.* It won't give us perfect models, but it seems like really meaningful (more than 50%) reduction in factual errors might be possible. Needs to be scaled up 😀 full thread on the way!

AK

@_akhaliq

6 months

Fine-tuning Language Models for Factuality paper page: The fluency and creativity of large pre-trained language models (LLMs) have led to their widespread use, sometimes even as a replacement for traditional search engines. Yet language models are prone…

6

48

256

0

9

49

Eric

@ericmitchellai

5 months

Leaving lovely Singapore for NeurIPS in Nola- reach out (DM here is easiest) if you'd like to chat RLHF/DPO/alignment generally, LLM reasoning, continual learning, uncertainty in LLMs, or other stuff related to reliable & safe AI! By the way, I am on the job market this year :)

0

2

49

Eric

@ericmitchellai

1 year

There are other goodies in the experiments… for example, we explore robustness of detection to machine-generated text that has been partially revised. Check out the paper for more (and website for code/demo soon)

DetectGPT: Zero-Shot Machine-Generated Text Detection using...

The increasing fluency and widespread usage of large language models (LLMs) highlight the desirability of corresponding tools aiding detection of LLM-generated text. In this paper, we identify a...

arxiv.org

1

8

47

Eric

@ericmitchellai

5 months

Is a flight from SFO to Changi long enough to implement PPO-RLHF from scratch? and actually make it work? ?????

4

1

44

Eric

@ericmitchellai

1 year

Bring on the questions! + tune in to the final video for the chance to see me stare off at more bright lights! maybe?

Stanford University

@Stanford

1 year

Stanford PhD student @ericmitchellai will be stepping into our 🎥 studio to answer the internet’s questions on #ChatGPT , GPT-4, and other large language models. What do you want to know?

52

17

105

3

4

40

Eric

@ericmitchellai

4 months

I think @argilla_io is on to something We need dedicated, streamlined tooling for collecting (continual) feedback Going beyond static datasets (eg, doing *multiple* rounds of DPO with online preferences) is low hanging fruit for improving open source LLMs IMO

Argilla

@argilla_io

4 months

🚀 Open-source AI strikes again! Announcing Notux 8x7B, a fine-tune of Mixtral Instruct with high-quality chat data and DPO. Notux now the top ranked MoE on the Open LLM leaderboard.

8

84

436

1

2

39

Eric

@ericmitchellai

5 months

v cool work @ethayarajh @winniethexu et al.! ❤️ non-pref learning! DPO's update rule pushes up chosen & down rejected interesting to compare w/ KTO's update rule looks ~similar (when we have 👍 & 👎 examples for the same prompt), but different learning rate (check my math?)

Kawin Ethayarajh

@ethayarajh

5 months

📢The problem in model alignment no one talks about — the need for preference data, which costs $$$ and time! Enter Kahneman-Tversky Optimization (KTO), which matches or exceeds DPO without paired preferences. And with it, the largest-ever suite of feedback-aligned LLMs. 🧵

19

130

699

2

9

35

Eric

@ericmitchellai

4 months

If only people gave me the respect ChatGPT does

5

0

35

Eric

@ericmitchellai

4 months

Had a ton of conversations at EMNLP/NeurIPS about the potential of RL(non-human)F for improving capabilities like factuality, reasoning, and coding. We did it for factuality here: Awesome to see it work for reasoning too! Down with human feedback!!

Fine-tuning Language Models for Factuality

The fluency and creativity of large pre-trained language models (LLMs) have led to their widespread use, sometimes even as a replacement for traditional search engines. Yet language models are...

arxiv.org

Peiyi Wang

@sybilhyz

4 months

🔥Excited to share our latest work: Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations. With Math-Shepherd, Mistral-7B fine-tuned on MetaMATH achieves accuracy rates of 89.1% and 43.5% on GSM8K and MATH, respectively. Paper:

17

49

228

1

35

Eric

@ericmitchellai

4 months

Awesome work addressing one potential issue with DPO (the reward modeling objective only weakly constrains the policy update). Maybe you'd get similar results by simply pushing down all the tokens in the vocab with low prior prob in the neg sequence, not just the observed token?

fly51fly

@fly51fly

4 months

[CL] Some things are more CRINGE than others: Preference Optimization with the Pairwise Cringe Loss J Xu, A Lee, S Sukhbaatar, J Weston [Meta] (2023) - Preference learning is commonly used to align large language models, using pairwise labels indicating…

1

6

44

1

2

33

Eric

@ericmitchellai

1 year

Finally spent set aside some time to learn @Gradio . Worth the time investment if you often find yourself needing to share quick ML demos. Thanks to @_akhaliq + others from @Gradio for the help porting over the DetectGPT demo! Check out the HF space here:

Detectgpt - a Hugging Face Space by ericanthonymitchell

huggingface.co

1

6

32

Eric

@ericmitchellai

10 months

Just got to Honolulu for #ICML2023 and consumed my first of hopefully many poke bowls for the week. Reach out (email, DM) if you want to hang out or chat LLMs/continual learning/RLHF (or anything)! PS- I'm on the job market this fall & happy to chat about that too 😀

2

1

31

Eric

@ericmitchellai

9 months

@CJessCooke @benji_smith I'm genuinely confused and would love to understand- how does this project harm authors? My understanding is that it counts the number of positive/negative sentiment words in a book and makes that info public.

4

0

29

Eric

@ericmitchellai

4 months

@Teknium1 master is a strong word but I've trained one or two DPO models in my day

1

0

29

Eric

@ericmitchellai

5 months

Come by the #NeurIPS2023 Instruction Following workshop (room 220-222) to see our work on: *Emulated fine-tuning*: RLHF without fine-tuning! *Fine-tuning for factuality*: how to fine-tune LLMs directly for factuality, reducing hallucination by >50% RIGHT NOW!!!

1

5

28

Eric

@ericmitchellai

5 months

People say cos making LLMs will never release pre-training data bc of liability I get that, but why not just release a teeny tiny near-IID sample of the data? Enough to learn the rough stats, format, etc.? Can filter out anything "incriminating" From a convo @neurips ...

3

29

Eric

@ericmitchellai

2 months

In light of all of the discussion of "self-awareness", remember that we are still *explicitly telling* our systems: - who they are - what they do and don't know (stuff up to August 2023) - what their personality/tendencies are Is personality emergent or... part of the prompt?

Amanda Askell

@AmandaAskell

2 months

Here is Claude 3's system prompt! Let me break it down 🧵

122

544

3K

1

0

28

Eric

@ericmitchellai

1 year

One way to detect LM text is with a trained classifier (another LM). This works, but can overfit to the models/topics it was trained on. Instead, if we can access the LM itself, we can use its own log probabilities to do detection *zero-shot*, without any training at all!

4

5

27

Eric

@ericmitchellai

5 months

At EMNLP! hmu/DM to chat reliability in LMs (factuality/reasoning/alignment/...), or just grab sambal stingray 😀 ALSO the amazing @NathanHu12 & @kattian_ (both on PhD market!) are presenting work on keeping LMs up-to-date & mitigating overconfidence, respectively CHAT W THEM!

1

2

27

Eric

@ericmitchellai

1 year

Many asking about state of AI-generated text detection. Progress is exciting! But I cannot overstate: NO EXISTING TOOL is ready to justify real-world disciplinary decisions. This is still research; we don't even have open/standardized benchmarks yet! See

2

6

27

Eric

@ericmitchellai

6 months

We call this version of EFT “up-scaling” a small fine-tuned model We can recover most of improvements in factuality that we would see if we *had* fine-tuned the larger model (bottom row of results), without fine-tuning! Again, no new sampling hyperparams here

2

27

Eric

@ericmitchellai

4 months

Based on his performance in our last interview prep session (i.e.: 😬) I think @archit_sharma97 probably should have spent his last 30 minutes grinding leetcode rather than "hacking" my twitter just my 2 cents ❤️🥰😘

Eric

@ericmitchellai

4 months

If only people gave me the respect ChatGPT does

5

0

35

1

0

26

Eric

@ericmitchellai

7 months

Super excited to see folks having success with DPO as an alternative to PPO-based RLHF; congrats Lewis & the rest of the team! Train your own DPO LLMs either with TRL's DPOTrainer () or with the standalone DPO repo () Have fun 😀

GitHub - eric-mitchell/direct-preference-optimization: Reference implementation for DPO (Direct...

Reference implementation for DPO (Direct Preference Optimization) - eric-mitchell/direct-preference-optimization

github.com

Lewis Tunstall

@_lewtun

7 months

Here's a simple recipe to train a 7B model that outperforms Llama2 70B on MT Bench 🥇 1. SFT Mistral 7B on the UltraChat dataset 2. Align the SFT model to the UltraFeedback dataset with "direct preference optimisation" (DPO) Demo: More details in the 🧵

20

202

955

1

0

27

Eric

@ericmitchellai

11 months

Here's the conceptual punchline. ☹️ Current RLHF: train a reward model to align w/ human prefs. THEN train a policy (w/ e.g. PPO) to maximize reward 😊 DPO: directly train a policy (without RL!) that is the optimal policy for an implicit reward function aligned w/ human prefs

1

2

27

Eric

@ericmitchellai

5 months

Presenting this work RIGHT NOW, led by @NathanHu12 ! Come hear about LLMs that autonomously keep themselves more up-to-date with the world!! We can do this with ✨✨ m e t a - l e a r n i ng ✨✨ Poster 𝟮𝟯𝗕!!! + our alg is called CaMeLS 🐪, just another reason to come 🥹

Chelsea Finn

@chelseabfinn

5 months

Can LLMs keep themselves up to date by reading the news? Fine-tuning on news articles doesn't work. Using meta-learning, we can reweight news article tokens so that fine-tuning works. @NathanHu12 & @ericmitchellai presenting this work at #EMNLP2023 this week!

8

34

209

0

7

25

Eric

@ericmitchellai

2 months

How fortunate I am to work with such a wise city lad 🙇

Rohan Taori

@rtaori13

2 months

“One needs to learn to love and enjoy the little things in life. One also needs to discover one’s true calling and then should do everything to pursue the selected path,” - wise words @archit_sharma97

3

4

50

1

0

24

Eric

@ericmitchellai

6 months

What does this decomposition get us? Say we fine-tune a llama-7b model (somehow); we can “emulate” the result of doing that same fine-tuning procedure on a 70B model by sampling from: 70b-base lps + (7b-fine-tuned lps - 7b-base lps) No fine-tuning (or sampling hparams) needed!

1

3

24

Eric

@ericmitchellai

6 months

IPO and cDPO (maybe "robust DPO" is a better name?) models are cooking @archit_sharma97 crafting some artisanal RLAIF 🧑‍🍳

3

0

24

Eric

@ericmitchellai

1 year

The code for DetectGPT is now available on the project website (along with the demo): Thanks to everyone who's tried out the demo ❤️ The feedback has been really interesting (keep it coming!)

3

5

23

Eric

@ericmitchellai

6 months

practically, this is a basically trivial change to the DPO loss, and it ends up looking qualitatively a lot like IPO would be AMAZING if someone did some careful comparisons between these (at least DPO/cDPO/IPO) and reported back... I'm really supposed to be writing job apps rn

2

0

23

Eric

@ericmitchellai

1 year

@kchonyc Thank you for communicating the status clearly and regularly! Hard to control what reviewers do, but as an author it's nice to at least hear from organizers what the plan/status is, even if things are incomplete/delayed.

0

23

Eric

@ericmitchellai

11 months

This project was an unbelievably exciting and educational experience w/ co-leads @rm_rafailov and @archit_sharma97 + our insightful advisors @chelseabfinn , @chrmanning , and @StefanoErmon Check out the paper + code teaser on arXiv (full code soon!):

3

1

23

Eric

@ericmitchellai

5 months

Hmm new mistral models have toggleable safety tuning behavior through a system prompt Curious to see if this is more jailbreakable (has prior work studied this?) Can't actually unlearn any bad behaviors this way (not that we're really unlearning bad stuff currently)

3

23

Eric

@ericmitchellai

3 months

super annoying to prompt an LLM w/ high-level feedback, only for it to apply the feedback in places it shouldn't see Moritz's 🧵 on how RL(Verbal)F reduces "feedback overgeneralization" we also do some math analyzing an increasingly popular pattern for generating RL*F prefs 👇

Moritz Stephan

@at_code_wizard

3 months

[1/6] Excited to share “RLVF: Learning from Verbal Feedback without Overgeneralization” Our method C3PO fine-tunes an LLM from 1 sentence of feedback and decreases overgeneralization (=applying the feedback when it should not be applied). Details:

2

19

89

2

1

22

Eric

@ericmitchellai

10 months

Interested in detecting text generated by language models? Come see poster #609 in Exhibit Hall 1 at #ICML2023 **today** from 11am-12:30pm! You can also come to the oral presentation in Ballroom C (oral session B1) at 4:32pm 😊

0

3

22

Eric

@ericmitchellai

1 year

Quick q: What do we expect the log probability function to look like in the neighborhood of a model sample? We hypothesized that a model's samples are usually in local maxima of its log probability function, or more generally, in areas of negative curvature. Spoiler: they are!

13

3

22

Eric

@ericmitchellai

1 year

But how to measure curvature of an LM's logprob, you ask?? We can approximate *directional second derivatives* by perturbing the text a bit with T5 & comparing the logprob under the LM before and after. Add Hutchinson's trace estimator & we get approximate trace of the Hessian.

3

5

22

Eric

@ericmitchellai

11 months

Here’s the idea (summary in pic): Rearrange to get the reward as a function of the optimal policy for that reward fn Now a single substitution turns our simple Bradley-Terry loss on reward functions into a simple loss on policies Bam, RLHF without RL!

1

2

21

Eric

@ericmitchellai

1 year

So much fun working on this with: @yoonholeee @SashaKhazatsky @chrmanning @chelseabfinn Also extremely grateful for the support of Stanford's Center for Research on Foundation Models @StanfordCRFM in running experiments on some very large LMs!!

7

2

21

Eric

@ericmitchellai

6 months

Very grateful for the input of my collaborators for this project! arxiv: @rm_rafailov @archit_sharma97 @chelseabfinn @chrmanning

An Emulator for Fine-Tuning Large Language Models using Small...

Widely used language models (LMs) are typically built by scaling up a two-stage training pipeline: a pre-training stage that uses a very large, diverse dataset of text and a fine-tuning...

arxiv.org

0

1

20

Eric

@ericmitchellai

1 year

AI-powered search engines (New Bing, , , , etc.) are super cool! Do they understand the flow of time, i.e., that the "right answer" to a question is fluid? 🧵 w/ example: "when does REAL ID go into effect?"

Neeva - Search powered by AI

Did you know 40% of search results are ads? Created by ex-Google execs, Neeva only shows you real results. No ads or affiliate links ever.

neeva.com

1

4

20

Eric

@ericmitchellai

6 months

so, RLHF reward modeling (for both DPO and “OG” RLHF) trains with a binary cross entropy loss where our predicted probability is sigmoid(r(chosen) - r(rejected)) and the target prob is 1 but are we really confident enough in our training data to use a target prob of exactly 1???

2

0

19

Eric

@ericmitchellai

6 months

i don't know how important this 'stop optimizing after a fixed amount of improvement' thing is, but based on some anecdotes I've heard from larger labs trying DPO at larger scale/on huge pref datasets, I have a hunch this could be helpful

1

0

19

Eric

@ericmitchellai

4 months

@Teknium1 This is normal. One intuition I have is that after SFT, you're near a local maximum for logprob of chosen, and any way you change the model will decrease logprob of chosen. ie, it's worth lowering the logprob of good stuff if we lower logprob of bad stuff more

1

18

Eric

@ericmitchellai

1 year

DetectGPT takes this approximate Hessian trace, and simply thresholds it to get a detector. Hessian trace very negative? Probably a model sample! Turns out this quantity discriminates between human-written and model-generated text very well, for various models and scales.

3

2

18

Eric

@ericmitchellai

6 months

EFT is similar in spirit to the lovely concurrent “Reward-Augmented Decoding” by @HaikangDeng @colinraffel However the DPO logprob-ratio reward parameterization used in EFT means we don’t need a separate forward pass for each possible next token!

Reward-Augmented Decoding: Efficient Controlled Text Generation...

While large language models have proven effective in a huge range of downstream applications, they often generate text that is problematic or lacks a desired attribute. In this paper, we introduce...

arxiv.org

1

17

Eric

@ericmitchellai

5 months

Apparently my math photo comparing DPO/KTO isn't showing? Reposting to try again? @winniethexu

Eric

@ericmitchellai

5 months

v cool work @ethayarajh @winniethexu et al.! ❤️ non-pref learning! DPO's update rule pushes up chosen & down rejected interesting to compare w/ KTO's update rule looks ~similar (when we have 👍 & 👎 examples for the same prompt), but different learning rate (check my math?)

2

9

35

3

17

Eric

@ericmitchellai

1 year

Does it work? DetectGPT consistently improves AUROC (prob. a random pair of fake/human text is correctly classified) over existing zero-shot methods, for models with 100M to 175B parameters. It's also competitive with supervised classifiers, outperforming them in some domains.

2

16

Eric

@ericmitchellai

6 months

tagging ppl who might care @teortaxesTex @yacineMTB @erhartford @Teknium1 @ethayarajh @natolambert @abacaj @arankomatsuzaki @_lewtun @AnthropicAI @akbirthko @thomasahle @cohere @tszzl hopefully no negative mass needed this time but no promises

3

0

17

Eric

@ericmitchellai

2 months

More work showing how careful (or careless!) data selection hugely impacts model quality. DPO (offline RL generally) is powerful, but you still need to train on data worth learning from! The data specifies the implicit reward function you're ultimately optimizing, after all...

Philipp Schmid

@_philschmid

2 months

How good is AI Feedback, and does it really help us improve LLMs? 🧠 A new paper, “A Critical Evaluation of AI Feedback for Aligning Large Language Models,” takes a critical look at the effectiveness of AI feedback when doing RLHF (DPO). Experiment: 1️⃣ Create a dataset of…

2

15

89

2

16

Eric

@ericmitchellai

6 months

one supposed issue with RLHF/DPO (which hasn’t really been proven on real-world problems) is that this loss will try to increase the reward of the ‘preferred’ response and decrease the reward of the ‘dispreferred’ response *forever*... this might be undesirable with noisy prefs

1

0

16

Eric

@ericmitchellai

5 months

Come see @kattian_ 's work on the ability of RLHF'd LLMs to *directly verbalize* probabilities (yes that's right as tokens) that are actually pretty-well calibrated! (Usually than the log probs!!! 🤯🤯🤯) Poster 14B R I G H T N O W until 3:30 SG time!!

1

16

Eric

@ericmitchellai

6 months

since we’re probably not 100% confident in our preference labels, our target prob should be something more like 1 - eps in this case, we get a slightly different BCE loss that stops optimizing after some fixed amount of improvement, so we don’t just “optimize forever”

1

0

15

Eric

@ericmitchellai

4 months

@sirbayes I think @BlancheMinerva would know best... But here the authors find some memorization even at very low rates of duplication: I think the tldr is large models can memorize even with very little (no?) duplication Stella can correct me if I'm wrong 😅😅

1

0

14

Eric

@ericmitchellai

1 year

Biden starts with "K"?? Some confident reasoning/factual errors with GPT-4 & (IMO) surprising out-of-control repetition. Unsure how to interpret GPT-4's reasoning abilities given these results. Test scores are impressive, but this model clearly couldn't be a lawyer/doctor/etc

2

3

14

Eric

@ericmitchellai

1 year

But how? Well, one simple approach is to measure the log probability of the text under the model. High log prob -> model sample. This works, but DetectGPT takes a different approach that turns out to be consistently more accurate in our experiments.

1

2

13

Eric

@ericmitchellai

6 months

oh also cDPO and IPO are both basically trivial to implement in the DPO codebase, they are very small modifications to the loss, will probably add these soon next time i'm not applying to jobs (or someone could PR 🥺)

0

13

Eric

@ericmitchellai

11 months

Check out the paper for more theoretical analysis of DPO (led by @rm_rafailov ) as well as additional experiments. Don’t worry, we also did a human study to make sure our GPT-4-based eval metrics were sensible 🙂

1

0

13

Eric

@ericmitchellai

6 months

We can even sample from *combinations* of rewards, w/o re-training each combination! We train one 7B model for helpfulness, one for harmlessness & interpolate to produce a frontier. The frontier is strictly improved by up-scaling the 7B rewards to 70B, all w/o fine-tuning!

1

2

13

Eric

@ericmitchellai

5 months

Cool to see this eval!! Hard to say two algs will be the same in all conditions, but great to see independent verification that DPO is indeed v similar to PPO, while being a whole lot simpler/cheaper Ofc still v curious about if there are "corner cases" where DPO/PPO differ

Xuechen Li

@lxuechen

5 months

Belatedly, I finally had a chance to update the AlpacaFarm paper with DPO results. TL;DR: DPO performs similarly to RLHF+PPO but is much more memory-friendly. Previously, PPO fine-tuning took ~2 hours on 8 A100 GPUs. Our DPO runs take about the same time on 4 GPUs. DPO with LoRA…

4

27

235

0

1

13

Eric

@ericmitchellai

4 months

Excited for this, and a bit surprised it took this long; low-hanging fruit tbh for increasing the immersiveness/usefulness of LLMs. Maybe harder from the eng perspective than research, depending on the implementation. Personalization a big theme of 2024?

Andrew Curran

@AndrewCurran_

4 months

Personalization just went live.

57

128

2K

1

0

13

Eric

@ericmitchellai

3 months

Honestly enjoying talking to Gemini Ultra so far. Its conversational style is pretty natural and empathetic without being too moralizing. Fails the apple test though 😅 ht r/localllama

0

12

Eric

@ericmitchellai

6 months

also big hugs to @JoeyHejna for looking over this "paper" before i blast it into cyberspace

1

0

12

Eric

@ericmitchellai

9 months

@hazmo89 @benji_smith Believe it or not, some people really do try to understand viewpoints that differ from their own. However I can see you don't make a habit of doing so.

0

12

Eric

@ericmitchellai

9 months

@goodside Training your new model to minimize cross entropy with samples from the existing LM is (in expectation) equivalent to minimizing the (forward) KL between the old (teacher) and new (student) models. ie your new "pretraining" data is just samples from your starting model.

2

0

12

Eric

@ericmitchellai

5 months

Antoine "back in my day GPTs didn't have numbers yet" Bosselut throwing down the "what even is mechinterp" gauntlet right from the start ⁦ @ABosselut ⁩

1

0

11

Eric

@ericmitchellai

6 months

We can actually interpret the logprob difference term as a reward function that *would* take us from the base/reference model to the fine-tuned model, if we did RL (see the DPO paper!) In EFT, we’re just re-weighting the base model logprobs with the reward, token by token!

1

11

Eric

@ericmitchellai

7 months

@bitcloud Peter apparently missed the fact that the post in question was an April Fool's joke...

1

0

10

Eric

@ericmitchellai

1 year

One of the coolest winners of the inverse scaling prize IMO was failure of modus tollens (reasoning incorrectly when presented with P -> Q, ~Q as inputs). RLHF maybe helps? GPT-3.5 and GPT-4 get this right, but Bard can't overcome the prior that dogs aren't the only pet.

2

0

11

Eric

@ericmitchellai

11 months

Let’s back up. RLHF is all about learning from preferences, i.e. a dataset D of prompts + chosen/rejected model response for each prompt. We 𝙘𝙤𝙪𝙡𝙙 train a reward model matching these human prefs using D + the Bradley-Terry preference model (for ex) as a loss function

1

0

11

Eric

@ericmitchellai

9 months

@Axmortl @benji_smith To clarify, at no point did Benji ever train any generator model that produces text of any kind. He just computed basic statistics about the books & shared small snippets (like Google Books). This is completely different from models like midjourney/gpt-4, which produce content.

1

0

11

Eric

@ericmitchellai

6 months

There are lots more analyses in the paper, again led by Nathan, showing how weights generalize across datasets, where the gains from weighted adaptation come from, and more. Definitely check it out!! Code coming ASAP :)

1

10