@andriy_mulyar
@sleepinyourhat
@srush_nlp
@chrmanning
@mdredze
@ChrisGPotts
Basically, we built the engine (GPT-4), but not yet the seatbelts, brakes, windshield wipers, ABS, etc.
Continual learning/model editing. Models are getting better but not more updateable. Also safety mechanisms for open-sourced models, or tools to detect machine-generated text.
ChatGPT (and others) generate very fluent (but not always truthful) text.
Some worry that teachers, news-readers (like you!), and society in general will be swamped with AI-generated content.
That's why we built DetectGPT, a method for detecting if text comes from an LM.
RLHF is the 🪄 getting us from GPT-3 to ChatGPT.
But RLHF is hard! Need to train a reward model, then do RL on a big LM (w/ expensive sampling & tuning)
𝙊𝙧 𝙙𝙤 𝙮𝙤𝙪?
Introducing Direct Preference Optimization (DPO), a simple classification loss provably equivalent to RLHF
RLHF is powerful; it lets us fine-tune LLMs to be more useful.
What if we could do RLHF… without fine-tuning???
Excited to share Emulated Fine-Tuning (EFT)!
EFT lets us “emulate” what we would have gotten if we did RLHF on a new model, without actually doing the RLHF!
DPO (fast, simple, performant RLHF) code is here!
With DPO there's 𝗻𝗼 𝗿𝗲𝘄𝗮𝗿𝗱 𝗺𝗼𝗱𝗲𝗹 𝗼𝗿 𝗥𝗟 𝗻𝗲𝗲𝗱𝗲𝗱.
It's finally easy to fine-tune llama from human preferences 😊
Can't wait to see the cool models people train with it 🤓
advisors say I should be “not doing research” & “getting a job”
alas due to recent RLHF DPO/IPO/PPO debates I wrote a 1pg mini-paper
tldr: assuming noisy pref data gives a 'conservative DPO', might make DPO stabler late in training (& looks like IPO)
🧵
ChatGPT users know the dreaded “as of my knowledge cutoff…”
Can we keep LLMs up-to-date with continual fine-tuning?
Our EMNLP paper shows LMs may remember only a *tiny* fraction of the info they see in a data stream
It also shows meta-learning can improve knowledge uptake 🥹
Curious how to take the RL out of RLHF?
Come check out our
#ICML2023
workshop poster for Direct Preference Optimization (aka, how to optimize the RLHF objective with a simple classification loss)!
Meeting Room 316 AB, 10am/12:20/2:45 Hawaii time
RLHF is the 🪄 getting us from GPT-3 to ChatGPT.
But RLHF is hard! Need to train a reward model, then do RL on a big LM (w/ expensive sampling & tuning)
𝙊𝙧 𝙙𝙤 𝙮𝙤𝙪?
Introducing Direct Preference Optimization (DPO), a simple classification loss provably equivalent to RLHF
Okay the DPO repo now supports conservative DPO (cDPO) & IPO loss!
cDPO/IPO both optimize the policy only until some fixed margin in improvement is met rather than optimizing "forever" like DPO
tbh unclear how much of a difference this makes- we'll see!
Something interesting about Tulu-70b is that it gives short responses *that are still rated really highly by GPT-4.*
IMO this could be a signal that the model's improvement is more meaningful, since you can get GPT-4 to like you just by giving long responses
Check out our new 70B DPO model here:
AFAIK currently the best model on AlpacaEval with a public finetuning set!
More details once the AI sphere calms down a bit... 😅
IMO this is the wrong definition of "hallucination"
At least, it's not helpful- LM systems will never be 100% factual
I define "hallucination" to be **verbalized miscalibration**
i.e. the model expresses confidence it doesn't actually hold
What does this mean? Explained below
# On the "hallucination problem"
I always struggle a bit with I'm asked about the "hallucination problem" in LLMs. Because, in some sense, hallucination is all LLMs do. They are dream machines.
We direct their dreams with prompts. The prompts start the dream, and based on the…
Very excited to share this demo of DetectGPT! Looking forward to feedback of all kinds, even re: my questionable web design skills...
Demo:
More info:
🚨 THIS IS A RESEARCH PROJECT; DO NOT USE FOR PRODUCTION/ANYTHING IMPORTANT 🚨
PSA:
***the point of dpo is NOT to skip reward modeling***
***the point of dpo is to skip EVERYTHING BUT reward modeling***
thank you for coming to my ted talk ❤️
(yes the paper could have explained this more clearly)
Three models, three different answers 😎
Claude 3 is AGI confirmed
Separately, what will it take to get a model to actually ask "do you want the answer for the beginning or end of day 4"? this question as stated is ambiguous
Attn-free models are sweet
Must be careful though not to assume matching ppl of transformers means we've matched transformers' ability to few-shot learn, fine-tune well, recall from long ctx, etc.
these depend on arch's inductive bias.
They could be worse, or stronger already!
It was great to see a lot of excitement about attention-free models
@NeurIPSConf
! We had great conversations with many people interested in next-gen architectures for language models.
Pic from Systems for foundational models and foundation models for systems by Chris Re
"...Rather than needing separate transformers for the reward fn & LLM, given an LLM, you can find the reward fn (+ regularization term) that the LLM is best at maximizing. DPO trains the LLM...to make that reward fn (implicitly defined by the LLM) consistent w the human prefs..."
@Diyi_Yang
gave a wonderful overview of possibilities & immediately challenges of learning from human feedback today
SUCH a juicy area to dive into right now
**hint hint junior PhD students work on this k thanks**
More fun than I expected answering qs about large language models; hope people find some interesting/useful tidbits in here.
Thanks
@Stanford
for the opportunity!
Ever-larger language models are unwieldy for both researchers and maintainers.
To enhance re-usability and increase their useful lifetime, we'd like to tweak them without full re-training/fine-tuning.
MEND edits models with 10^6 to >10^10 params in one fine-tuning step. (1/8)
Large language models (LLMs) often make mistakes that are difficult to correct.
We study the problem of quickly editing these models:
Paper:
Code:
w/
@_eric_mitchell_
, C. Lin,
@ABosselut
,
@chrmanning
thread 🧵👇
CaMeLS code is here 🐫🐫🐫
Use it to train models that identify what tokens contain the most important information in a document, without any per-token annotations!
Paper link, in case you missed it:
@ericmitchellai
Eric is awesome - I learned so much from his mentorship throughout this project. And huge thanks to
@chelseabfinn
&
@chrmanning
for their support 🙏 .
The CaMeLS code is now available at 🐪🐫.
My DMs + email are open if y'all wants to chat more 😁.
Data is an underexplored frontier in preference-based LLM training; for any RL*F with a learned reward (PPO, DPO, IPO, etc.), the preference data is really the limiting factor!
Exploration during training (eg PPO sampling) is **useless** if your reward fn is inaccurate there!!
🔥 More is less for DPO, high quality matters!
📢 Dropping our first open dataset and LLM of the year:
💾Meet distilabel Orca Pairs DPO, an improved version of the now famous dataset from
@intel
🏛️And a new OpenHermes model outperforming baselines with 54% less DPO pairs
🧵
ICML tip: upload your paper pdf and review to Claude/GPT4/Gemini w prompt
"What do you think is the main point of the paper? After answering, please explain to what extent, if any, you think the reviewer has fully understood the main point of the paper."
Better than therapy 🥹
Some extra analysis we did:
On Anthropic-HH-helpful, peek at the response length & winrate for different hparams for each method; each dot is a ckpt each 20k steps
tldr: there’s a p clear relationship between length/winrate; DPO escapes the frontier, but that could be luck
So, since we are posting science straight to twitter now,
@ericmitchellai
and I have some updates for potential overfitting in DPO.
TL;DR: we compared DPO to IPO and cDPO (DPO + label noise) on 3 different datasets, and we didn't observe any significant advantage (yet). 🧵->
Fantastic stuff, congrats to all authors
the self-improvement is super super cool to see... but the impact of the scoring prompt is particularly intriguing
The difficulty of multiple choice eval reminds me a bit of the "generative AI paradox"?
🚨New paper!🚨
Self-Rewarding LMs
- LM itself provides its own rewards on own generations via LLM-as-a-Judge during Iterative DPO
- Reward modeling ability improves during training rather than staying fixed
...opens the door to superhuman feedback?
🧵(1/5)
Very curious to see how far we can push training to simply *not hallucinate.* It won't give us perfect models, but it seems like really meaningful (more than 50%) reduction in factual errors might be possible. Needs to be scaled up 😀 full thread on the way!
Fine-tuning Language Models for Factuality
paper page:
The fluency and creativity of large pre-trained language models (LLMs) have led to their widespread use, sometimes even as a replacement for traditional search engines. Yet language models are prone…
Leaving lovely Singapore for NeurIPS in Nola- reach out (DM here is easiest) if you'd like to chat RLHF/DPO/alignment generally, LLM reasoning, continual learning, uncertainty in LLMs, or other stuff related to reliable & safe AI!
By the way, I am on the job market this year :)
There are other goodies in the experiments… for example, we explore robustness of detection to machine-generated text that has been partially revised.
Check out the paper for more (and website for code/demo soon)
Stanford PhD student
@ericmitchellai
will be stepping into our 🎥 studio to answer the internet’s questions on
#ChatGPT
, GPT-4, and other large language models. What do you want to know?
I think
@argilla_io
is on to something
We need dedicated, streamlined tooling for collecting (continual) feedback
Going beyond static datasets (eg, doing *multiple* rounds of DPO with online preferences) is low hanging fruit for improving open source LLMs IMO
🚀 Open-source AI strikes again! Announcing Notux 8x7B, a fine-tune of Mixtral Instruct with high-quality chat data and DPO.
Notux now the top ranked MoE on the Open LLM leaderboard.
v cool work
@ethayarajh
@winniethexu
et al.! ❤️ non-pref learning!
DPO's update rule pushes up chosen & down rejected
interesting to compare w/ KTO's update rule
looks ~similar (when we have 👍 & 👎 examples for the same prompt), but different learning rate
(check my math?)
📢The problem in model alignment no one talks about — the need for preference data, which costs $$$ and time!
Enter Kahneman-Tversky Optimization (KTO), which matches or exceeds DPO without paired preferences.
And with it, the largest-ever suite of feedback-aligned LLMs. 🧵
Had a ton of conversations at EMNLP/NeurIPS about the potential of RL(non-human)F for improving capabilities like factuality, reasoning, and coding.
We did it for factuality here:
Awesome to see it work for reasoning too! Down with human feedback!!
🔥Excited to share our latest work:
Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations.
With Math-Shepherd, Mistral-7B fine-tuned on MetaMATH achieves accuracy rates of 89.1% and 43.5% on GSM8K and MATH, respectively.
Paper:
Awesome work addressing one potential issue with DPO (the reward modeling objective only weakly constrains the policy update).
Maybe you'd get similar results by simply pushing down all the tokens in the vocab with low prior prob in the neg sequence, not just the observed token?
[CL] Some things are more CRINGE than others: Preference Optimization with the Pairwise Cringe Loss
J Xu, A Lee, S Sukhbaatar, J Weston [Meta] (2023)
- Preference learning is commonly used to align large language models, using pairwise labels indicating…
Finally spent set aside some time to learn
@Gradio
. Worth the time investment if you often find yourself needing to share quick ML demos. Thanks to
@_akhaliq
+ others from
@Gradio
for the help porting over the DetectGPT demo!
Check out the HF space here:
Just got to Honolulu for
#ICML2023
and consumed my first of hopefully many poke bowls for the week.
Reach out (email, DM) if you want to hang out or chat LLMs/continual learning/RLHF (or anything)!
PS- I'm on the job market this fall & happy to chat about that too 😀
@CJessCooke
@benji_smith
I'm genuinely confused and would love to understand- how does this project harm authors?
My understanding is that it counts the number of positive/negative sentiment words in a book and makes that info public.
Come by the
#NeurIPS2023
Instruction Following workshop (room 220-222) to see our work on:
*Emulated fine-tuning*: RLHF without fine-tuning!
*Fine-tuning for factuality*: how to fine-tune LLMs directly for factuality, reducing hallucination by >50%
RIGHT NOW!!!
People say cos making LLMs will never release pre-training data bc of liability
I get that, but why not just release a teeny tiny near-IID sample of the data?
Enough to learn the rough stats, format, etc.?
Can filter out anything "incriminating"
From a convo
@neurips
...
In light of all of the discussion of "self-awareness", remember that we are still *explicitly telling* our systems:
- who they are
- what they do and don't know (stuff up to August 2023)
- what their personality/tendencies are
Is personality emergent or... part of the prompt?
One way to detect LM text is with a trained classifier (another LM). This works, but can overfit to the models/topics it was trained on.
Instead, if we can access the LM itself, we can use its own log probabilities to do detection *zero-shot*, without any training at all!
At EMNLP! hmu/DM to chat reliability in LMs (factuality/reasoning/alignment/...), or just grab sambal stingray 😀
ALSO
the amazing
@NathanHu12
&
@kattian_
(both on PhD market!) are presenting work on keeping LMs up-to-date & mitigating overconfidence, respectively
CHAT W THEM!
Many asking about state of AI-generated text detection. Progress is exciting!
But I cannot overstate:
NO EXISTING TOOL is ready to justify real-world disciplinary decisions. This is still research; we don't even have open/standardized benchmarks yet!
See
We call this version of EFT “up-scaling” a small fine-tuned model
We can recover most of improvements in factuality that we would see if we *had* fine-tuned the larger model (bottom row of results), without fine-tuning!
Again, no new sampling hyperparams here
Based on his performance in our last interview prep session (i.e.: 😬) I think
@archit_sharma97
probably should have spent his last 30 minutes grinding leetcode rather than "hacking" my twitter
just my 2 cents ❤️🥰😘
Super excited to see folks having success with DPO as an alternative to PPO-based RLHF; congrats Lewis & the rest of the team!
Train your own DPO LLMs either with TRL's DPOTrainer () or with the standalone DPO repo ()
Have fun 😀
Here's a simple recipe to train a 7B model that outperforms Llama2 70B on MT Bench 🥇
1. SFT Mistral 7B on the UltraChat dataset
2. Align the SFT model to the UltraFeedback dataset with "direct preference optimisation" (DPO)
Demo:
More details in the 🧵
Here's the conceptual punchline.
☹️ Current RLHF: train a reward model to align w/ human prefs. THEN train a policy (w/ e.g. PPO) to maximize reward
😊 DPO: directly train a policy (without RL!) that is the optimal policy for an implicit reward function aligned w/ human prefs
Presenting this work RIGHT NOW, led by
@NathanHu12
!
Come hear about LLMs that autonomously keep themselves more up-to-date with the world!!
We can do this with ✨✨ m e t a - l e a r n i ng ✨✨
Poster 𝟮𝟯𝗕!!!
+ our alg is called CaMeLS 🐪, just another reason to come 🥹
Can LLMs keep themselves up to date by reading the news?
Fine-tuning on news articles doesn't work.
Using meta-learning, we can reweight news article tokens so that fine-tuning works.
@NathanHu12
&
@ericmitchellai
presenting this work at
#EMNLP2023
this week!
“One needs to learn to love and enjoy the little things in life. One also needs to discover one’s true calling and then should do everything to pursue the selected path,” - wise words
@archit_sharma97
What does this decomposition get us?
Say we fine-tune a llama-7b model (somehow); we can “emulate” the result of doing that same fine-tuning procedure on a 70B model by sampling from:
70b-base lps + (7b-fine-tuned lps - 7b-base lps)
No fine-tuning (or sampling hparams) needed!
The code for DetectGPT is now available on the project website (along with the demo):
Thanks to everyone who's tried out the demo ❤️ The feedback has been really interesting (keep it coming!)
practically, this is a basically trivial change to the DPO loss, and it ends up looking qualitatively a lot like IPO
would be AMAZING if someone did some careful comparisons between these (at least DPO/cDPO/IPO) and reported back...
I'm really supposed to be writing job apps rn
@kchonyc
Thank you for communicating the status clearly and regularly! Hard to control what reviewers do, but as an author it's nice to at least hear from organizers what the plan/status is, even if things are incomplete/delayed.
Hmm new mistral models have toggleable safety tuning behavior through a system prompt
Curious to see if this is more jailbreakable (has prior work studied this?)
Can't actually unlearn any bad behaviors this way (not that we're really unlearning bad stuff currently)
super annoying to prompt an LLM w/ high-level feedback, only for it to apply the feedback in places it shouldn't
see Moritz's 🧵 on how RL(Verbal)F reduces "feedback overgeneralization"
we also do some math analyzing an increasingly popular pattern for generating RL*F prefs 👇
[1/6] Excited to share “RLVF: Learning from Verbal Feedback without Overgeneralization”
Our method C3PO fine-tunes an LLM from 1 sentence of feedback and decreases overgeneralization (=applying the feedback when it should not be applied).
Details:
Interested in detecting text generated by language models? Come see poster
#609
in Exhibit Hall 1 at
#ICML2023
**today** from 11am-12:30pm!
You can also come to the oral presentation in Ballroom C (oral session B1) at 4:32pm 😊
Quick q:
What do we expect the log probability function to look like in the neighborhood of a model sample?
We hypothesized that a model's samples are usually in local maxima of its log probability function, or more generally, in areas of negative curvature.
Spoiler: they are!
But how to measure curvature of an LM's logprob, you ask??
We can approximate *directional second derivatives* by perturbing the text a bit with T5 & comparing the logprob under the LM before and after.
Add Hutchinson's trace estimator & we get approximate trace of the Hessian.
Here’s the idea (summary in pic):
Rearrange to get the reward as a function of the optimal policy for that reward fn
Now a single substitution turns our simple Bradley-Terry loss on reward functions into a simple loss on policies
Bam, RLHF without RL!
AI-powered search engines (New Bing, , , , etc.) are super cool!
Do they understand the flow of time, i.e., that the "right answer" to a question is fluid?
🧵 w/ example: "when does REAL ID go into effect?"
so, RLHF reward modeling (for both DPO and “OG” RLHF) trains with a binary cross entropy loss where our predicted probability is sigmoid(r(chosen) - r(rejected)) and the target prob is 1
but are we really confident enough in our training data to use a target prob of exactly 1???
i don't know how important this 'stop optimizing after a fixed amount of improvement' thing is, but based on some anecdotes I've heard from larger labs trying DPO at larger scale/on huge pref datasets, I have a hunch this could be helpful
@Teknium1
This is normal. One intuition I have is that after SFT, you're near a local maximum for logprob of chosen, and any way you change the model will decrease logprob of chosen.
ie, it's worth lowering the logprob of good stuff if we lower logprob of bad stuff more
DetectGPT takes this approximate Hessian trace, and simply thresholds it to get a detector.
Hessian trace very negative? Probably a model sample!
Turns out this quantity discriminates between human-written and model-generated text very well, for various models and scales.
EFT is similar in spirit to the lovely concurrent “Reward-Augmented Decoding” by
@HaikangDeng
@colinraffel
However the DPO logprob-ratio reward parameterization used in EFT means we don’t need a separate forward pass for each possible next token!
v cool work
@ethayarajh
@winniethexu
et al.! ❤️ non-pref learning!
DPO's update rule pushes up chosen & down rejected
interesting to compare w/ KTO's update rule
looks ~similar (when we have 👍 & 👎 examples for the same prompt), but different learning rate
(check my math?)
Does it work?
DetectGPT consistently improves AUROC (prob. a random pair of fake/human text is correctly classified) over existing zero-shot methods, for models with 100M to 175B parameters.
It's also competitive with supervised classifiers, outperforming them in some domains.
More work showing how careful (or careless!) data selection hugely impacts model quality.
DPO (offline RL generally) is powerful, but you still need to train on data worth learning from!
The data specifies the implicit reward function you're ultimately optimizing, after all...
How good is AI Feedback, and does it really help us improve LLMs? 🧠
A new paper, “A Critical Evaluation of AI Feedback for Aligning Large Language Models,” takes a critical look at the effectiveness of AI feedback when doing RLHF (DPO).
Experiment:
1️⃣ Create a dataset of…
one supposed issue with RLHF/DPO (which hasn’t really been proven on real-world problems) is that this loss will try to increase the reward of the ‘preferred’ response and decrease the reward of the ‘dispreferred’ response *forever*...
this might be undesirable with noisy prefs
Come see
@kattian_
's work on the ability of RLHF'd LLMs to *directly verbalize* probabilities (yes that's right as tokens) that are actually pretty-well calibrated! (Usually than the log probs!!! 🤯🤯🤯)
Poster 14B
R I G H T N O W until 3:30 SG time!!
since we’re probably not 100% confident in our preference labels, our target prob should be something more like 1 - eps
in this case, we get a slightly different BCE loss that stops optimizing after some fixed amount of improvement, so we don’t just “optimize forever”
@sirbayes
I think
@BlancheMinerva
would know best...
But here the authors find some memorization even at very low rates of duplication:
I think the tldr is large models can memorize even with very little (no?) duplication
Stella can correct me if I'm wrong 😅😅
Biden starts with "K"??
Some confident reasoning/factual errors with GPT-4 & (IMO) surprising out-of-control repetition.
Unsure how to interpret GPT-4's reasoning abilities given these results. Test scores are impressive, but this model clearly couldn't be a lawyer/doctor/etc
But how? Well, one simple approach is to measure the log probability of the text under the model. High log prob -> model sample.
This works, but DetectGPT takes a different approach that turns out to be consistently more accurate in our experiments.
oh also cDPO and IPO are both basically trivial to implement in the DPO codebase, they are very small modifications to the loss, will probably add these soon next time i'm not applying to jobs (or someone could PR 🥺)
Check out the paper for more theoretical analysis of DPO (led by
@rm_rafailov
) as well as additional experiments.
Don’t worry, we also did a human study to make sure our GPT-4-based eval metrics were sensible 🙂
We can even sample from *combinations* of rewards, w/o re-training each combination!
We train one 7B model for helpfulness, one for harmlessness & interpolate to produce a frontier.
The frontier is strictly improved by up-scaling the 7B rewards to 70B, all w/o fine-tuning!
Cool to see this eval!! Hard to say two algs will be the same in all conditions, but great to see independent verification that DPO is indeed v similar to PPO, while being a whole lot simpler/cheaper
Ofc still v curious about if there are "corner cases" where DPO/PPO differ
Belatedly, I finally had a chance to update the AlpacaFarm paper with DPO results.
TL;DR: DPO performs similarly to RLHF+PPO but is much more memory-friendly. Previously, PPO fine-tuning took ~2 hours on 8 A100 GPUs. Our DPO runs take about the same time on 4 GPUs. DPO with LoRA…
Excited for this, and a bit surprised it took this long; low-hanging fruit tbh for increasing the immersiveness/usefulness of LLMs. Maybe harder from the eng perspective than research, depending on the implementation.
Personalization a big theme of 2024?
Honestly enjoying talking to Gemini Ultra so far. Its conversational style is pretty natural and empathetic without being too moralizing.
Fails the apple test though 😅
ht r/localllama
@hazmo89
@benji_smith
Believe it or not, some people really do try to understand viewpoints that differ from their own.
However I can see you don't make a habit of doing so.
@goodside
Training your new model to minimize cross entropy with samples from the existing LM is (in expectation) equivalent to minimizing the (forward) KL between the old (teacher) and new (student) models.
ie your new "pretraining" data is just samples from your starting model.
Antoine "back in my day GPTs didn't have numbers yet" Bosselut throwing down the "what even is mechinterp" gauntlet right from the start
@ABosselut
We can actually interpret the logprob difference term as a reward function that *would* take us from the base/reference model to the fine-tuned model, if we did RL (see the DPO paper!)
In EFT, we’re just re-weighting the base model logprobs with the reward, token by token!
One of the coolest winners of the inverse scaling prize IMO was failure of modus tollens (reasoning incorrectly when presented with P -> Q, ~Q as inputs).
RLHF maybe helps? GPT-3.5 and GPT-4 get this right, but Bard can't overcome the prior that dogs aren't the only pet.
Let’s back up.
RLHF is all about learning from preferences, i.e. a dataset D of prompts + chosen/rejected model response for each prompt.
We 𝙘𝙤𝙪𝙡𝙙 train a reward model matching these human prefs using D + the Bradley-Terry preference model (for ex) as a loss function
@Axmortl
@benji_smith
To clarify, at no point did Benji ever train any generator model that produces text of any kind. He just computed basic statistics about the books & shared small snippets (like Google Books). This is completely different from models like midjourney/gpt-4, which produce content.
There are lots more analyses in the paper, again led by Nathan, showing how weights generalize across datasets, where the gains from weighted adaptation come from, and more.
Definitely check it out!!
Code coming ASAP :)