Eric Profile Banner
Eric Profile
Eric

@ericmitchellai

3,675
Followers
492
Following
97
Media
573
Statuses

I like AI & music. Working on making LLMs easier & safer to use. Final year PhD student at Stanford advised by Chelsea Finn & Chris Manning.

United States
Joined December 2017
Don't wanna be here? Send us removal request.
Pinned Tweet
@andriy_mulyar @sleepinyourhat @srush_nlp @chrmanning @mdredze @ChrisGPotts Basically, we built the engine (GPT-4), but not yet the seatbelts, brakes, windshield wipers, ABS, etc. Continual learning/model editing. Models are getting better but not more updateable. Also safety mechanisms for open-sourced models, or tools to detect machine-generated text.
1
1
43
ChatGPT (and others) generate very fluent (but not always truthful) text. Some worry that teachers, news-readers (like you!), and society in general will be swamped with AI-generated content. That's why we built DetectGPT, a method for detecting if text comes from an LM.
Tweet media one
48
241
1K
@ericmitchellai
Eric
11 months
RLHF is the 🪄 getting us from GPT-3 to ChatGPT. But RLHF is hard! Need to train a reward model, then do RL on a big LM (w/ expensive sampling & tuning) 𝙊𝙧 𝙙𝙤 𝙮𝙤𝙪? Introducing Direct Preference Optimization (DPO), a simple classification loss provably equivalent to RLHF
Tweet media one
22
119
761
@ericmitchellai
Eric
6 months
RLHF is powerful; it lets us fine-tune LLMs to be more useful. What if we could do RLHF… without fine-tuning??? Excited to share Emulated Fine-Tuning (EFT)! EFT lets us “emulate” what we would have gotten if we did RLHF on a new model, without actually doing the RLHF!
Tweet media one
12
96
534
@ericmitchellai
Eric
11 months
DPO (fast, simple, performant RLHF) code is here! With DPO there's 𝗻𝗼 𝗿𝗲𝘄𝗮𝗿𝗱 𝗺𝗼𝗱𝗲𝗹 𝗼𝗿 𝗥𝗟 𝗻𝗲𝗲𝗱𝗲𝗱. It's finally easy to fine-tune llama from human preferences 😊 Can't wait to see the cool models people train with it 🤓
7
81
406
@ericmitchellai
Eric
4 months
After 5 years of PyTorch, I have had enough of doing ML on easy mode. Starting today, I am switching exclusively to TF for a real challenge 🫡🫡
25
13
369
@ericmitchellai
Eric
6 months
advisors say I should be “not doing research” & “getting a job” alas due to recent RLHF DPO/IPO/PPO debates I wrote a 1pg mini-paper tldr: assuming noisy pref data gives a 'conservative DPO', might make DPO stabler late in training (& looks like IPO) 🧵
9
28
280
@ericmitchellai
Eric
4 months
@AndrewYNg @rm_rafailov @archit_sharma97 @StefanoErmon @chrmanning @chelseabfinn Thank you Andrew- it means a lot! We took this photo together 6.5 years ago, in the summer of 2017, when I was just getting started in research... Thank you for the insight and inspiration 🫡
Tweet media one
1
1
259
@ericmitchellai
Eric
6 months
ChatGPT users know the dreaded “as of my knowledge cutoff…” Can we keep LLMs up-to-date with continual fine-tuning? Our EMNLP paper shows LMs may remember only a *tiny* fraction of the info they see in a data stream It also shows meta-learning can improve knowledge uptake 🥹
Tweet media one
6
26
200
@ericmitchellai
Eric
10 months
Curious how to take the RL out of RLHF? Come check out our #ICML2023 workshop poster for Direct Preference Optimization (aka, how to optimize the RLHF objective with a simple classification loss)! Meeting Room 316 AB, 10am/12:20/2:45 Hawaii time
Tweet media one
@ericmitchellai
Eric
11 months
RLHF is the 🪄 getting us from GPT-3 to ChatGPT. But RLHF is hard! Need to train a reward model, then do RL on a big LM (w/ expensive sampling & tuning) 𝙊𝙧 𝙙𝙤 𝙮𝙤𝙪? Introducing Direct Preference Optimization (DPO), a simple classification loss provably equivalent to RLHF
Tweet media one
22
119
761
5
25
203
@ericmitchellai
Eric
6 months
Okay the DPO repo now supports conservative DPO (cDPO) & IPO loss! cDPO/IPO both optimize the policy only until some fixed margin in improvement is met rather than optimizing "forever" like DPO tbh unclear how much of a difference this makes- we'll see!
1
17
128
@ericmitchellai
Eric
6 months
Something interesting about Tulu-70b is that it gives short responses *that are still rated really highly by GPT-4.* IMO this could be a signal that the model's improvement is more meaningful, since you can get GPT-4 to like you just by giving long responses
Tweet media one
@hamishivi
Hamish Ivison
6 months
Check out our new 70B DPO model here: AFAIK currently the best model on AlpacaEval with a public finetuning set! More details once the AI sphere calms down a bit... 😅
4
45
250
1
18
110
@ericmitchellai
Eric
5 months
IMO this is the wrong definition of "hallucination" At least, it's not helpful- LM systems will never be 100% factual I define "hallucination" to be **verbalized miscalibration** i.e. the model expresses confidence it doesn't actually hold What does this mean? Explained below
Tweet media one
@karpathy
Andrej Karpathy
5 months
# On the "hallucination problem" I always struggle a bit with I'm asked about the "hallucination problem" in LLMs. Because, in some sense, hallucination is all LLMs do. They are dream machines. We direct their dreams with prompts. The prompts start the dream, and based on the…
758
3K
15K
8
19
104
Very excited to share this demo of DetectGPT! Looking forward to feedback of all kinds, even re: my questionable web design skills... Demo: More info: 🚨 THIS IS A RESEARCH PROJECT; DO NOT USE FOR PRODUCTION/ANYTHING IMPORTANT 🚨
5
28
100
@ericmitchellai
Eric
5 months
PSA: ***the point of dpo is NOT to skip reward modeling*** ***the point of dpo is to skip EVERYTHING BUT reward modeling*** thank you for coming to my ted talk ❤️ (yes the paper could have explained this more clearly)
4
4
78
@ericmitchellai
Eric
2 months
Three models, three different answers 😎 Claude 3 is AGI confirmed Separately, what will it take to get a model to actually ask "do you want the answer for the beginning or end of day 4"? this question as stated is ambiguous
Tweet media one
5
6
79
Welcome Stanford CS PhD admits!! Please reach out (DM, email, messenger pigeon) if you'd like to chat :)
0
2
75
@ericmitchellai
Eric
5 months
Attn-free models are sweet Must be careful though not to assume matching ppl of transformers means we've matched transformers' ability to few-shot learn, fine-tune well, recall from long ctx, etc. these depend on arch's inductive bias. They could be worse, or stronger already!
@AiEleuther
EleutherAI
5 months
It was great to see a lot of excitement about attention-free models @NeurIPSConf ! We had great conversations with many people interested in next-gen architectures for language models. Pic from Systems for foundational models and foundation models for systems by Chris Re
Tweet media one
1
3
30
3
2
69
🧵Language models have an unfortunate tendency to contradict themselves. Our #emnlp2022 oral presents Consistency Correction w/ Relation Detection (ConCoRD), which overrides low-confidence LM predictions to boost self-consistency & accuracy. Paper/code:
Tweet media one
1
14
69
@ericmitchellai
Eric
4 months
"...Rather than needing separate transformers for the reward fn & LLM, given an LLM, you can find the reward fn (+ regularization term) that the LLM is best at maximizing. DPO trains the LLM...to make that reward fn (implicitly defined by the LLM) consistent w the human prefs..."
@stanfordnlp
Stanford NLP Group
4 months
There is a lovely, warm, and enthusiastic writeup of Direct Preference Optimization (DPO) by @rm_rafailov , @archit_sharma97 , and @ericmitchellai from @NeurIPSConf 2023, leading this week’s issue of The Batch newsletter. Thanks, so much, @AndrewYNg !
Tweet media one
2
52
244
1
9
60
@ericmitchellai
Eric
5 months
@Diyi_Yang gave a wonderful overview of possibilities & immediately challenges of learning from human feedback today SUCH a juicy area to dive into right now **hint hint junior PhD students work on this k thanks**
Tweet media one
2
4
61
More fun than I expected answering qs about large language models; hope people find some interesting/useful tidbits in here. Thanks @Stanford for the opportunity!
@Stanford
Stanford University
1 year
Are they sentient? Are they safe? Will they take my job? Stanford PhD student @ericmitchellai answers the internet’s questions on AI chatbots 👇
11
20
88
4
4
58
@ericmitchellai
Eric
3 years
Ever-larger language models are unwieldy for both researchers and maintainers. To enhance re-usability and increase their useful lifetime, we'd like to tweak them without full re-training/fine-tuning. MEND edits models with 10^6 to >10^10 params in one fine-tuning step. (1/8)
@chelseabfinn
Chelsea Finn
3 years
Large language models (LLMs) often make mistakes that are difficult to correct. We study the problem of quickly editing these models: Paper: Code: w/ @_eric_mitchell_ , C. Lin, @ABosselut , @chrmanning thread 🧵👇
4
121
557
2
19
60
@ericmitchellai
Eric
6 months
CaMeLS code is here 🐫🐫🐫 Use it to train models that identify what tokens contain the most important information in a document, without any per-token annotations! Paper link, in case you missed it:
Tweet media one
@NathanHu12
Nathan Hu
6 months
@ericmitchellai Eric is awesome - I learned so much from his mentorship throughout this project. And huge thanks to @chelseabfinn & @chrmanning for their support 🙏 . The CaMeLS code is now available at 🐪🐫. My DMs + email are open if y'all wants to chat more 😁.
0
5
24
0
17
59
@ericmitchellai
Eric
4 months
Data is an underexplored frontier in preference-based LLM training; for any RL*F with a learned reward (PPO, DPO, IPO, etc.), the preference data is really the limiting factor! Exploration during training (eg PPO sampling) is **useless** if your reward fn is inaccurate there!!
@argilla_io
Argilla
4 months
🔥 More is less for DPO, high quality matters! 📢 Dropping our first open dataset and LLM of the year: 💾Meet distilabel Orca Pairs DPO, an improved version of the now famous dataset from @intel 🏛️And a new OpenHermes model outperforming baselines with 54% less DPO pairs 🧵
Tweet media one
5
46
232
1
18
58
@ericmitchellai
Eric
1 month
ICML tip: upload your paper pdf and review to Claude/GPT4/Gemini w prompt "What do you think is the main point of the paper? After answering, please explain to what extent, if any, you think the reviewer has fully understood the main point of the paper." Better than therapy 🥹
1
3
57
@ericmitchellai
Eric
5 months
Some extra analysis we did: On Anthropic-HH-helpful, peek at the response length & winrate for different hparams for each method; each dot is a ckpt each 20k steps tldr: there’s a p clear relationship between length/winrate; DPO escapes the frontier, but that could be luck
Tweet media one
@archit_sharma97
Archit Sharma
5 months
So, since we are posting science straight to twitter now, @ericmitchellai and I have some updates for potential overfitting in DPO. TL;DR: we compared DPO to IPO and cDPO (DPO + label noise) on 3 different datasets, and we didn't observe any significant advantage (yet). 🧵->
Tweet media one
7
32
177
2
13
54
@ericmitchellai
Eric
4 months
Fantastic stuff, congrats to all authors the self-improvement is super super cool to see... but the impact of the scoring prompt is particularly intriguing The difficulty of multiple choice eval reminds me a bit of the "generative AI paradox"?
@jaseweston
Jason Weston
4 months
🚨New paper!🚨 Self-Rewarding LMs - LM itself provides its own rewards on own generations via LLM-as-a-Judge during Iterative DPO - Reward modeling ability improves during training rather than staying fixed ...opens the door to superhuman feedback? 🧵(1/5)
Tweet media one
5
223
1K
1
6
52
@ericmitchellai
Eric
6 months
Very curious to see how far we can push training to simply *not hallucinate.* It won't give us perfect models, but it seems like really meaningful (more than 50%) reduction in factual errors might be possible. Needs to be scaled up 😀 full thread on the way!
@_akhaliq
AK
6 months
Fine-tuning Language Models for Factuality paper page: The fluency and creativity of large pre-trained language models (LLMs) have led to their widespread use, sometimes even as a replacement for traditional search engines. Yet language models are prone…
Tweet media one
6
48
256
0
9
49
@ericmitchellai
Eric
5 months
Leaving lovely Singapore for NeurIPS in Nola- reach out (DM here is easiest) if you'd like to chat RLHF/DPO/alignment generally, LLM reasoning, continual learning, uncertainty in LLMs, or other stuff related to reliable & safe AI! By the way, I am on the job market this year :)
0
2
49
There are other goodies in the experiments… for example, we explore robustness of detection to machine-generated text that has been partially revised. Check out the paper for more (and website for code/demo soon)
1
8
47
@ericmitchellai
Eric
5 months
Is a flight from SFO to Changi long enough to implement PPO-RLHF from scratch? and actually make it work? ?????
4
1
44
Bring on the questions! + tune in to the final video for the chance to see me stare off at more bright lights! maybe?
@Stanford
Stanford University
1 year
Stanford PhD student @ericmitchellai will be stepping into our 🎥 studio to answer the internet’s questions on #ChatGPT , GPT-4, and other large language models. What do you want to know?
Tweet media one
52
17
105
3
4
40
@ericmitchellai
Eric
4 months
I think @argilla_io is on to something We need dedicated, streamlined tooling for collecting (continual) feedback Going beyond static datasets (eg, doing *multiple* rounds of DPO with online preferences) is low hanging fruit for improving open source LLMs IMO
@argilla_io
Argilla
4 months
🚀 Open-source AI strikes again! Announcing Notux 8x7B, a fine-tune of Mixtral Instruct with high-quality chat data and DPO. Notux now the top ranked MoE on the Open LLM leaderboard.
Tweet media one
8
84
436
1
2
39
@ericmitchellai
Eric
5 months
v cool work @ethayarajh @winniethexu et al.! ❤️ non-pref learning! DPO's update rule pushes up chosen & down rejected interesting to compare w/ KTO's update rule looks ~similar (when we have 👍 & 👎 examples for the same prompt), but different learning rate (check my math?)
Tweet media one
@ethayarajh
Kawin Ethayarajh
5 months
📢The problem in model alignment no one talks about — the need for preference data, which costs $$$ and time! Enter Kahneman-Tversky Optimization (KTO), which matches or exceeds DPO without paired preferences. And with it, the largest-ever suite of feedback-aligned LLMs. 🧵
Tweet media one
19
130
699
2
9
35
@ericmitchellai
Eric
4 months
If only people gave me the respect ChatGPT does
Tweet media one
5
0
35
@ericmitchellai
Eric
4 months
Had a ton of conversations at EMNLP/NeurIPS about the potential of RL(non-human)F for improving capabilities like factuality, reasoning, and coding. We did it for factuality here: Awesome to see it work for reasoning too! Down with human feedback!!
@sybilhyz
Peiyi Wang
4 months
🔥Excited to share our latest work: Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations. With Math-Shepherd, Mistral-7B fine-tuned on MetaMATH achieves accuracy rates of 89.1% and 43.5% on GSM8K and MATH, respectively. Paper:
17
49
228
1
1
35
@ericmitchellai
Eric
4 months
Awesome work addressing one potential issue with DPO (the reward modeling objective only weakly constrains the policy update). Maybe you'd get similar results by simply pushing down all the tokens in the vocab with low prior prob in the neg sequence, not just the observed token?
@fly51fly
fly51fly
4 months
[CL] Some things are more CRINGE than others: Preference Optimization with the Pairwise Cringe Loss J Xu, A Lee, S Sukhbaatar, J Weston [Meta] (2023) - Preference learning is commonly used to align large language models, using pairwise labels indicating…
Tweet media one
Tweet media two
Tweet media three
Tweet media four
1
6
44
1
2
33
Finally spent set aside some time to learn @Gradio . Worth the time investment if you often find yourself needing to share quick ML demos. Thanks to @_akhaliq + others from @Gradio for the help porting over the DetectGPT demo! Check out the HF space here:
1
6
32
@ericmitchellai
Eric
10 months
Just got to Honolulu for #ICML2023 and consumed my first of hopefully many poke bowls for the week. Reach out (email, DM) if you want to hang out or chat LLMs/continual learning/RLHF (or anything)! PS- I'm on the job market this fall & happy to chat about that too 😀
2
1
31
@ericmitchellai
Eric
9 months
@CJessCooke @benji_smith I'm genuinely confused and would love to understand- how does this project harm authors? My understanding is that it counts the number of positive/negative sentiment words in a book and makes that info public.
4
0
29
@ericmitchellai
Eric
4 months
@Teknium1 master is a strong word but I've trained one or two DPO models in my day
1
0
29
@ericmitchellai
Eric
5 months
Come by the #NeurIPS2023 Instruction Following workshop (room 220-222) to see our work on: *Emulated fine-tuning*: RLHF without fine-tuning! *Fine-tuning for factuality*: how to fine-tune LLMs directly for factuality, reducing hallucination by >50% RIGHT NOW!!!
1
5
28
@ericmitchellai
Eric
5 months
People say cos making LLMs will never release pre-training data bc of liability I get that, but why not just release a teeny tiny near-IID sample of the data? Enough to learn the rough stats, format, etc.? Can filter out anything "incriminating" From a convo @neurips ...
3
3
29
@ericmitchellai
Eric
2 months
In light of all of the discussion of "self-awareness", remember that we are still *explicitly telling* our systems: - who they are - what they do and don't know (stuff up to August 2023) - what their personality/tendencies are Is personality emergent or... part of the prompt?
@AmandaAskell
Amanda Askell
2 months
Here is Claude 3's system prompt! Let me break it down 🧵
Tweet media one
122
544
3K
1
0
28
One way to detect LM text is with a trained classifier (another LM). This works, but can overfit to the models/topics it was trained on. Instead, if we can access the LM itself, we can use its own log probabilities to do detection *zero-shot*, without any training at all!
4
5
27
@ericmitchellai
Eric
5 months
At EMNLP! hmu/DM to chat reliability in LMs (factuality/reasoning/alignment/...), or just grab sambal stingray 😀 ALSO the amazing @NathanHu12 & @kattian_ (both on PhD market!) are presenting work on keeping LMs up-to-date & mitigating overconfidence, respectively CHAT W THEM!
1
2
27
Many asking about state of AI-generated text detection. Progress is exciting! But I cannot overstate: NO EXISTING TOOL is ready to justify real-world disciplinary decisions. This is still research; we don't even have open/standardized benchmarks yet! See
2
6
27
@ericmitchellai
Eric
6 months
We call this version of EFT “up-scaling” a small fine-tuned model We can recover most of improvements in factuality that we would see if we *had* fine-tuned the larger model (bottom row of results), without fine-tuning! Again, no new sampling hyperparams here
Tweet media one
2
2
27
@ericmitchellai
Eric
4 months
Based on his performance in our last interview prep session (i.e.: 😬) I think @archit_sharma97 probably should have spent his last 30 minutes grinding leetcode rather than "hacking" my twitter just my 2 cents ❤️🥰😘
@ericmitchellai
Eric
4 months
If only people gave me the respect ChatGPT does
Tweet media one
5
0
35
1
0
26
@ericmitchellai
Eric
7 months
Super excited to see folks having success with DPO as an alternative to PPO-based RLHF; congrats Lewis & the rest of the team! Train your own DPO LLMs either with TRL's DPOTrainer () or with the standalone DPO repo () Have fun 😀
@_lewtun
Lewis Tunstall
7 months
Here's a simple recipe to train a 7B model that outperforms Llama2 70B on MT Bench 🥇 1. SFT Mistral 7B on the UltraChat dataset 2. Align the SFT model to the UltraFeedback dataset with "direct preference optimisation" (DPO) Demo: More details in the 🧵
Tweet media one
20
202
955
1
0
27
@ericmitchellai
Eric
11 months
Here's the conceptual punchline. ☹️ Current RLHF: train a reward model to align w/ human prefs. THEN train a policy (w/ e.g. PPO) to maximize reward 😊 DPO: directly train a policy (without RL!) that is the optimal policy for an implicit reward function aligned w/ human prefs
1
2
27
@ericmitchellai
Eric
5 months
Presenting this work RIGHT NOW, led by @NathanHu12 ! Come hear about LLMs that autonomously keep themselves more up-to-date with the world!! We can do this with ✨✨ m e t a - l e a r n i ng ✨✨ Poster 𝟮𝟯𝗕!!! + our alg is called CaMeLS 🐪, just another reason to come 🥹
@chelseabfinn
Chelsea Finn
5 months
Can LLMs keep themselves up to date by reading the news? Fine-tuning on news articles doesn't work. Using meta-learning, we can reweight news article tokens so that fine-tuning works. @NathanHu12 & @ericmitchellai presenting this work at #EMNLP2023 this week!
Tweet media one
8
34
209
0
7
25
@ericmitchellai
Eric
2 months
How fortunate I am to work with such a wise city lad 🙇
@rtaori13
Rohan Taori
2 months
“One needs to learn to love and enjoy the little things in life. One also needs to discover one’s true calling and then should do everything to pursue the selected path,” - wise words @archit_sharma97
3
4
50
1
0
24
@ericmitchellai
Eric
6 months
What does this decomposition get us? Say we fine-tune a llama-7b model (somehow); we can “emulate” the result of doing that same fine-tuning procedure on a 70B model by sampling from: 70b-base lps + (7b-fine-tuned lps - 7b-base lps) No fine-tuning (or sampling hparams) needed!
Tweet media one
1
3
24
@ericmitchellai
Eric
6 months
IPO and cDPO (maybe "robust DPO" is a better name?) models are cooking @archit_sharma97 crafting some artisanal RLAIF 🧑‍🍳
3
0
24
The code for DetectGPT is now available on the project website (along with the demo): Thanks to everyone who's tried out the demo ❤️ The feedback has been really interesting (keep it coming!)
3
5
23
@ericmitchellai
Eric
6 months
practically, this is a basically trivial change to the DPO loss, and it ends up looking qualitatively a lot like IPO would be AMAZING if someone did some careful comparisons between these (at least DPO/cDPO/IPO) and reported back... I'm really supposed to be writing job apps rn
2
0
23
@kchonyc Thank you for communicating the status clearly and regularly! Hard to control what reviewers do, but as an author it's nice to at least hear from organizers what the plan/status is, even if things are incomplete/delayed.
0
0
23
@ericmitchellai
Eric
11 months
This project was an unbelievably exciting and educational experience w/ co-leads @rm_rafailov and @archit_sharma97 + our insightful advisors @chelseabfinn , @chrmanning , and @StefanoErmon Check out the paper + code teaser on arXiv (full code soon!):
3
1
23
@ericmitchellai
Eric
5 months
Hmm new mistral models have toggleable safety tuning behavior through a system prompt Curious to see if this is more jailbreakable (has prior work studied this?) Can't actually unlearn any bad behaviors this way (not that we're really unlearning bad stuff currently)
Tweet media one
3
3
23
@ericmitchellai
Eric
3 months
super annoying to prompt an LLM w/ high-level feedback, only for it to apply the feedback in places it shouldn't see Moritz's 🧵 on how RL(Verbal)F reduces "feedback overgeneralization" we also do some math analyzing an increasingly popular pattern for generating RL*F prefs 👇
@at_code_wizard
Moritz Stephan
3 months
[1/6] Excited to share “RLVF: Learning from Verbal Feedback without Overgeneralization” Our method C3PO fine-tunes an LLM from 1 sentence of feedback and decreases overgeneralization (=applying the feedback when it should not be applied). Details:
2
19
89
2
1
22
@ericmitchellai
Eric
10 months
Interested in detecting text generated by language models? Come see poster #609 in Exhibit Hall 1 at #ICML2023 **today** from 11am-12:30pm! You can also come to the oral presentation in Ballroom C (oral session B1) at 4:32pm 😊
Tweet media one
0
3
22
Quick q: What do we expect the log probability function to look like in the neighborhood of a model sample? We hypothesized that a model's samples are usually in local maxima of its log probability function, or more generally, in areas of negative curvature. Spoiler: they are!
Tweet media one
13
3
22
But how to measure curvature of an LM's logprob, you ask?? We can approximate *directional second derivatives* by perturbing the text a bit with T5 & comparing the logprob under the LM before and after. Add Hutchinson's trace estimator & we get approximate trace of the Hessian.
Tweet media one
3
5
22
@ericmitchellai
Eric
11 months
Here’s the idea (summary in pic): Rearrange to get the reward as a function of the optimal policy for that reward fn Now a single substitution turns our simple Bradley-Terry loss on reward functions into a simple loss on policies Bam, RLHF without RL!
Tweet media one
1
2
21
So much fun working on this with: @yoonholeee @SashaKhazatsky @chrmanning @chelseabfinn Also extremely grateful for the support of Stanford's Center for Research on Foundation Models @StanfordCRFM in running experiments on some very large LMs!!
7
2
21
AI-powered search engines (New Bing, , , , etc.) are super cool! Do they understand the flow of time, i.e., that the "right answer" to a question is fluid? 🧵 w/ example: "when does REAL ID go into effect?"
1
4
20
@ericmitchellai
Eric
6 months
so, RLHF reward modeling (for both DPO and “OG” RLHF) trains with a binary cross entropy loss where our predicted probability is sigmoid(r(chosen) - r(rejected)) and the target prob is 1 but are we really confident enough in our training data to use a target prob of exactly 1???
2
0
19
@ericmitchellai
Eric
6 months
i don't know how important this 'stop optimizing after a fixed amount of improvement' thing is, but based on some anecdotes I've heard from larger labs trying DPO at larger scale/on huge pref datasets, I have a hunch this could be helpful
1
0
19
@ericmitchellai
Eric
4 months
@Teknium1 This is normal. One intuition I have is that after SFT, you're near a local maximum for logprob of chosen, and any way you change the model will decrease logprob of chosen. ie, it's worth lowering the logprob of good stuff if we lower logprob of bad stuff more
1
1
18
DetectGPT takes this approximate Hessian trace, and simply thresholds it to get a detector. Hessian trace very negative? Probably a model sample! Turns out this quantity discriminates between human-written and model-generated text very well, for various models and scales.
Tweet media one
3
2
18
@ericmitchellai
Eric
6 months
EFT is similar in spirit to the lovely concurrent “Reward-Augmented Decoding” by @HaikangDeng @colinraffel However the DPO logprob-ratio reward parameterization used in EFT means we don’t need a separate forward pass for each possible next token!
1
1
17
@ericmitchellai
Eric
5 months
Apparently my math photo comparing DPO/KTO isn't showing? Reposting to try again? @winniethexu
Tweet media one
@ericmitchellai
Eric
5 months
v cool work @ethayarajh @winniethexu et al.! ❤️ non-pref learning! DPO's update rule pushes up chosen & down rejected interesting to compare w/ KTO's update rule looks ~similar (when we have 👍 & 👎 examples for the same prompt), but different learning rate (check my math?)
Tweet media one
2
9
35
3
3
17
Does it work? DetectGPT consistently improves AUROC (prob. a random pair of fake/human text is correctly classified) over existing zero-shot methods, for models with 100M to 175B parameters. It's also competitive with supervised classifiers, outperforming them in some domains.
Tweet media one
2
2
16
@ericmitchellai
Eric
2 months
More work showing how careful (or careless!) data selection hugely impacts model quality. DPO (offline RL generally) is powerful, but you still need to train on data worth learning from! The data specifies the implicit reward function you're ultimately optimizing, after all...
@_philschmid
Philipp Schmid
2 months
How good is AI Feedback, and does it really help us improve LLMs? 🧠 A new paper, “A Critical Evaluation of AI Feedback for Aligning Large Language Models,” takes a critical look at the effectiveness of AI feedback when doing RLHF (DPO). Experiment: 1️⃣ Create a dataset of…
Tweet media one
2
15
89
2
2
16
@ericmitchellai
Eric
6 months
one supposed issue with RLHF/DPO (which hasn’t really been proven on real-world problems) is that this loss will try to increase the reward of the ‘preferred’ response and decrease the reward of the ‘dispreferred’ response *forever*... this might be undesirable with noisy prefs
1
0
16
@ericmitchellai
Eric
5 months
Come see @kattian_ 's work on the ability of RLHF'd LLMs to *directly verbalize* probabilities (yes that's right as tokens) that are actually pretty-well calibrated! (Usually than the log probs!!! 🤯🤯🤯) Poster 14B R I G H T N O W until 3:30 SG time!!
1
1
16
@ericmitchellai
Eric
6 months
since we’re probably not 100% confident in our preference labels, our target prob should be something more like 1 - eps in this case, we get a slightly different BCE loss that stops optimizing after some fixed amount of improvement, so we don’t just “optimize forever”
1
0
15
@ericmitchellai
Eric
4 months
@sirbayes I think @BlancheMinerva would know best... But here the authors find some memorization even at very low rates of duplication: I think the tldr is large models can memorize even with very little (no?) duplication Stella can correct me if I'm wrong 😅😅
1
0
14
Biden starts with "K"?? Some confident reasoning/factual errors with GPT-4 & (IMO) surprising out-of-control repetition. Unsure how to interpret GPT-4's reasoning abilities given these results. Test scores are impressive, but this model clearly couldn't be a lawyer/doctor/etc
Tweet media one
Tweet media two
Tweet media three
2
3
14
But how? Well, one simple approach is to measure the log probability of the text under the model. High log prob -> model sample. This works, but DetectGPT takes a different approach that turns out to be consistently more accurate in our experiments.
1
2
13
@ericmitchellai
Eric
6 months
oh also cDPO and IPO are both basically trivial to implement in the DPO codebase, they are very small modifications to the loss, will probably add these soon next time i'm not applying to jobs (or someone could PR 🥺)
0
0
13
@ericmitchellai
Eric
11 months
Check out the paper for more theoretical analysis of DPO (led by @rm_rafailov ) as well as additional experiments. Don’t worry, we also did a human study to make sure our GPT-4-based eval metrics were sensible 🙂
Tweet media one
1
0
13
@ericmitchellai
Eric
6 months
We can even sample from *combinations* of rewards, w/o re-training each combination! We train one 7B model for helpfulness, one for harmlessness & interpolate to produce a frontier. The frontier is strictly improved by up-scaling the 7B rewards to 70B, all w/o fine-tuning!
Tweet media one
1
2
13
@ericmitchellai
Eric
5 months
Cool to see this eval!! Hard to say two algs will be the same in all conditions, but great to see independent verification that DPO is indeed v similar to PPO, while being a whole lot simpler/cheaper Ofc still v curious about if there are "corner cases" where DPO/PPO differ
@lxuechen
Xuechen Li
5 months
Belatedly, I finally had a chance to update the AlpacaFarm paper with DPO results. TL;DR: DPO performs similarly to RLHF+PPO but is much more memory-friendly. Previously, PPO fine-tuning took ~2 hours on 8 A100 GPUs. Our DPO runs take about the same time on 4 GPUs. DPO with LoRA…
Tweet media one
4
27
235
0
1
13
@ericmitchellai
Eric
4 months
Excited for this, and a bit surprised it took this long; low-hanging fruit tbh for increasing the immersiveness/usefulness of LLMs. Maybe harder from the eng perspective than research, depending on the implementation. Personalization a big theme of 2024?
@AndrewCurran_
Andrew Curran
4 months
Personalization just went live.
Tweet media one
57
128
2K
1
0
13
@ericmitchellai
Eric
3 months
Honestly enjoying talking to Gemini Ultra so far. Its conversational style is pretty natural and empathetic without being too moralizing. Fails the apple test though 😅 ht r/localllama
Tweet media one
0
0
12
@ericmitchellai
Eric
6 months
also big hugs to @JoeyHejna for looking over this "paper" before i blast it into cyberspace
1
0
12
@ericmitchellai
Eric
9 months
@hazmo89 @benji_smith Believe it or not, some people really do try to understand viewpoints that differ from their own. However I can see you don't make a habit of doing so.
0
0
12
@ericmitchellai
Eric
9 months
@goodside Training your new model to minimize cross entropy with samples from the existing LM is (in expectation) equivalent to minimizing the (forward) KL between the old (teacher) and new (student) models. ie your new "pretraining" data is just samples from your starting model.
2
0
12
@ericmitchellai
Eric
5 months
Antoine "back in my day GPTs didn't have numbers yet" Bosselut throwing down the "what even is mechinterp" gauntlet right from the start ⁦ @ABosselut
Tweet media one
1
0
11
@ericmitchellai
Eric
6 months
We can actually interpret the logprob difference term as a reward function that *would* take us from the base/reference model to the fine-tuned model, if we did RL (see the DPO paper!) In EFT, we’re just re-weighting the base model logprobs with the reward, token by token!
Tweet media one
Tweet media two
1
1
11
@ericmitchellai
Eric
7 months
@bitcloud Peter apparently missed the fact that the post in question was an April Fool's joke...
1
0
10
One of the coolest winners of the inverse scaling prize IMO was failure of modus tollens (reasoning incorrectly when presented with P -> Q, ~Q as inputs). RLHF maybe helps? GPT-3.5 and GPT-4 get this right, but Bard can't overcome the prior that dogs aren't the only pet.
Tweet media one
Tweet media two
Tweet media three
2
0
11
@ericmitchellai
Eric
11 months
Let’s back up. RLHF is all about learning from preferences, i.e. a dataset D of prompts + chosen/rejected model response for each prompt. We 𝙘𝙤𝙪𝙡𝙙 train a reward model matching these human prefs using D + the Bradley-Terry preference model (for ex) as a loss function
Tweet media one
1
0
11
@ericmitchellai
Eric
9 months
@Axmortl @benji_smith To clarify, at no point did Benji ever train any generator model that produces text of any kind. He just computed basic statistics about the books & shared small snippets (like Google Books). This is completely different from models like midjourney/gpt-4, which produce content.
Tweet media one
Tweet media two
1
0
11
@ericmitchellai
Eric
6 months
There are lots more analyses in the paper, again led by Nathan, showing how weights generalize across datasets, where the gains from weighted adaptation come from, and more. Definitely check it out!! Code coming ASAP :)
Tweet media one
Tweet media two
1
1
10