📢 Excited to announce our new paper
Language-guided world models: A model-based approach to AI control
• We develop LWMs: world models that can read texts to capture new environment dynamics
• These models enable humans to efficiently control agents by providing language
😠It is still ridiculous to me how much money/time was wasted simply because people don't read some old papers.
💡If you want to know why REINFORCE/A2C is better than Actor-Critic, read our paper:
We have identified all of the common issues for you:
-
The objective mismatch issue raised in John Schumann's ICML talk was already foreseen by our paper () 6 years ago. Sadly it wasn't cited nearly enough.
Biased opinion: our paper deserves more readers and acknowledgement. In fact, not OpenAI's papers, but
Why RL-tuning hurts calibration of LLMs? RL objective can be written as a reverse KL divergence which encourages mode-seeking behavior (i.e. peaky distribution). RL+translation has studied this phenomenon a long time ago (, )
Working on calibration/uncertainty for LLMs, which papers should I cite? Guo et al. () is pretty popular but it is about classification tasks. Calibration on sequences comes with distinct challenges.
📢Learning is all about communication. And who is the master of communication? Humans!
😯Our new paper enables AI agents to learn more like humans.
🔥Our agents define and share increasingly abstract intentions over time, and as a result, learn with progressive efficiency.
Our HANNA paper on Visual Navigation with Natural Multimodal Assistance has been accepted to
#emnlp2019
. New task/dataset/model/learning algorithm for leveraging vision-and-language human assistance in object-finding tasks in photo-realistic environments! (with
@haldaume3
)
After a wonderfullll year at Princeton, I am excited to join CHAI Berkeley, working Prof. Russell and Prof. Dragan to continue my effort to make AI communicate more effectively with humans. Connect with me if you are interested in learning from language feedback, learning to ask
🚀 Dive into the untold story of Alignment via Human Feedback from an NLP perspective! This paper brilliantly encapsulates the epoch often overlooked in surveys written by RL groups. An absolute must-read for newcomers in the field! 📚
Do AI agents know what they want?
Can they ask specific questions that faithfully reflect their intrinsic needs?
We develop a general decision-making framework for simultaneously learning 𝙬𝙝𝙚𝙣 𝙖𝙣𝙙 𝙬𝙝𝙖𝙩 𝙩𝙤 𝙖𝙨𝙠 (w/
@ybisk
@haldaume3
)
This great work confirms my intuition: people have rediscovered problems of RLHF that was observed and documented many years ago when the method was first tried on machine translation. The finding in this paper is similar to . People, especially
"Less (tuning) is more for alignment" is an intriguing hypothesis. Is alignment tuning really that “superficial”⁉️ 🤔 If so, how so? 🤔 Can any straightforward analysis explain this? 🤔 What if I tell you “no tuning can also be great for alignment”? 🫢 😉 If you’re interested in
Maybe it's time to move beyond rewards and start 𝘁𝗮𝗹𝗸𝗶𝗻𝗴 properly to our ML agents!
Our ILIAD
#ICML2021
paper formulates a learning framework where natural language is the only communication medium used by the teacher.
Blog:
Happy to introduce 𝗚𝗹𝗼𝗯𝗮𝗹 𝗩𝗼𝗶𝗰𝗲𝘀, an evaluation dataset for multilingual and cross-lingual summarization in 15 languages (w.
@haldaume3
).
New materials for studying translation quality in downstream task, zero-shot learning, etc.
#NLProc
#summarization
#multilingual
Passing false-belief tests = model HAS theory of mind
Passing false-belief tests ≠ model USES theory of mind to perform tasks
Our
#ACL2023
paper: formulates 𝑻𝒂𝒔𝒌-𝑶𝒓𝒊𝒆𝒏𝒕𝒆𝒅 cognitive capabilities, which are used to perform tasks.
Very delighted to receive an Outstanding paper award at
@tom_icml2023
. It is a great honor to be acknowledged by experts in the domain you have just recently ventured into :)
Passing false-belief tests = model HAS theory of mind
Passing false-belief tests ≠ model USES theory of mind to perform tasks
Our
#ACL2023
paper: formulates 𝑻𝒂𝒔𝒌-𝑶𝒓𝒊𝒆𝒏𝒕𝒆𝒅 cognitive capabilities, which are used to perform tasks.
HANNA: Visual Navigation with Multimodal Natural Assistance is online
Our agent finds objects in photo-realistic environments by learning to query simulated humans for instructions.
Paper:
Github:
Do AI agents know what they want?
Can they ask specific questions that faithfully reflect their intrinsic needs?
We develop a general decision-making framework for simultaneously learning 𝙬𝙝𝙚𝙣 𝙖𝙣𝙙 𝙬𝙝𝙖𝙩 𝙩𝙤 𝙖𝙨𝙠 (w/
@ybisk
@haldaume3
)
I woke up to this wonderful paper!!!
@KreutzerJulia
(the RLHF veteran) and
@CohereForAI
has done it! They show REINFORCE beats PPO convincingly and propose a better version. Only those who understand the past can shape the future.
I wrote a thought piece showing RLHF = variational inference on Bayesian cognitive model (generalized RSA). I hope that realizing this connection can help better understand recent developments on LLMs and inspire future research.
Nice work! Remember that the SOTA LLMs do not implement SOTA learning algorithms. Imitation learning was less popular because of the expert query cost. But the cost is now much cheaper with LLMs as experts. Many cool IL work in the past now can find their ways into real-world
New paper! Learning to Generate Better Than Your LLM ()
RLHF has become a powerful paradigm for fine-tuning LLM, but we only use general-purpose RL algorithms. We introduce new algorithmic paradigm that takes advantage of additional feedback for learning.
When a language model guides a human, giving false instructions can frustrate them or even put them in danger.
We propose a cost-effective method for detecting hallucinations in navigation instructions.
More about our
#EMNLP2023
findings paper⬇️ (1/n)
My opinion on SORA as a world model (ignore this post if you think of it as just a video-editing tool):
- Generating high-resolution, realistic outputs makes it hard to use SORA as a planner. We should have more work on planning with abstract representations of the world (e.g.,
As humans, we influence others' worldview to shape their behavior.
A kid asks his mom if he can go swimming at a nearby lake. The mom says: "there was a drowning accident over there last year." After listening to that, the kid chooses to stay home.
Here, instead of giving an
📢Internship at CHAI Berkeley. Apply by Nov 13.
Opportunity to work with a group of leading experts in AI safety. I am particularly looking for students who are interested in learning from language feedback, and learning to ask questions.
Do language-to-world models like OpenAI SORA excite you? We are too! In this recent paper, we lay out a vision for this type of models. Not just video-creating tool, they will enable humans to collaborate safely and control AI easily.
The code has been released. Check it out!
📢 Excited to announce our new paper
Language-guided world models: A model-based approach to AI control
• We develop LWMs: world models that can read texts to capture new environment dynamics
• These models enable humans to efficiently control agents by providing language
The hardest paper I have ever been a part of, both in terms of arguments, experimental setup, and technical depth. Could not achieve without help from amazing co-authors, and the open-minded reviewers. Learning from language is challenging but (to me) it is the future of AI!
"Interactive Learning from Activity Description", led by the fantastic
@khanhxuannguyen
with Dipendra Misra, Robert Schapire, Miro Dudík (
@MSFTResearch
) has been accepted to
#ICML2021
!
First time co-organize a workshop at a major conference. Great interactive audience, wonderful talks and discussions about
#interactiveNLP
. Simultaneous interpretation still awkward. Everyone seemed to be happy. Thank you all for contributing to this experience :D
I had a wonderful visit and learned about cool research at NYU and FAIR thanks to the hospitality and generosity of
@kchonyc
@W4ngatang
and
@uralik1
. Thank you very much and wish you all the best!
Just right when I just asked the question. Google Gemma uses the old good REINFORCE! This confirms my belief that the algorithm doesn't really matter (hyper-tuning matters though). What you should care about is the structure in the data and how to formulate the problem in a way
RLHF details in
@GoogleDeepMind
's Gemma:
* Confirm Google uses REINFORCE algo
* KL penalty in reward to SFT distribution (like InstructGPT), would be in addition to policy KL
* "we relied on a high capacity model" big RMs >> small, as Anthropic results have shown
More soon.
Nguyen and O'Connor () and Kuleshov and Liang () are the first papers on calibration for sequences. They formulate and discuss challenges to this problem. Consider reading and citing these papers if you work on this topic :)
No offense to my Chinese friends. But if you are speaking to a general audience and you are unsure that they are all from China, use the term "𝗟𝘂𝗻𝗮𝗿 𝗡𝗲𝘄 𝗬𝗲𝗮𝗿". In Vietnam, we call it "Tết Nguyên Đán" (if anyone cares about inclusiveness).
This is great! It might imply that we have been doing actor-critic the wrong way the whole time? Actor critic seems like coordiante descent but the problem is that the coordinates are correlated?
🔥Major Breakthrough in
#RLHF
! Traditional approaches fall short in characterizing policy-driven data dependency. Introducing PARL: a Unified Stochastic Bilevel Formulation. One of the FIRST provable solutions to
#Alignment
. 🚀 Essential for ethical AI! 📄
Do AI agents know what they want?
Can they ask specific questions that faithfully reflect their intrinsic needs?
We develop a general decision-making framework for simultaneously learning 𝙬𝙝𝙚𝙣 𝙖𝙣𝙙 𝙬𝙝𝙖𝙩 𝙩𝙤 𝙖𝙨𝙠 (w/
@ybisk
@haldaume3
)
(5/7) Julia Kreutzer is a veteran on this topic. She authors so many papers that analyze the feasbility of learning translation systems from human feedback (those with Sokolov, and , ).
I had a fantastic internship at
@MSFTResearch
working with
@debadeepta
@chris_brockett
and Bill Dolan on empowering navigation agents with the ability to leverage help from humans. Human-assisted AI agents can accomplish tasks that surpass their knowledge and skill levels.
When in doubt, people ask for help. What if our personal digital assistants could do the same? Microsoft researchers have created a novel method of training agents to strategically ask for assistance during vision-language tasks:
#CVPR2019
By the way,
@a1zhang
is on the PhD market this year. He is smart, diligent, and productive, and is experienced with vision&language research. Grab him while you can 😃
I am super proud of my collaborators
@a1zhang
@JensTuyls
, Albert Lin, and
@karthik_r_n
. The problem turned out be much more challenging than we had anticipated, but we didn't give up. Our paper has just tackled an easy version of the general problem. We hope it will spark
I should say that the scope of this tweet is text gen. The history of RL from humans of course dates way further back than this (e.g. TAMER by Knox and Stone, Littman et al., etc.)
@fhuszar
“Enough” does not mean “efficient”. A two-layer neural network with sufficient width can approximate any function. But the width could grow exponentially with the complexity of the function. Deep nets are more efficient function appriximators.
Another goal of this work is to construct an agent that asks increasingly abstract questions to reduce the effort of the human assisting it.
When I started my PhD, I asked my advisor about every little detail. But near the end, we mostly exchanged high-level ideas.
Now I am
📢Learning is all about communication. And who is the master of communication? Humans!
😯Our new paper enables AI agents to learn more like humans.
🔥Our agents define and share increasingly abstract intentions over time, and as a result, learn with progressive efficiency.
(1/7) In terms of RL for text gen, cite Ranzato+15 () and Shen+ () who pioneer training text generators to optimize rewards, and Bahdanau+17 () who attempt the first actor-critic solution.
I wonder if there has been work that compares DPO, PPO with simpler RL algo like A2C or even REINFORCE in fine-tuning LLMs. DPO can be interpreted as actor-critic with a cool math trick to obtain a reliable critic for free (i.e. use the policy itself as critic). It also has a
(6/7) All of these works happened before or around the time of Christiano+17 () who introduce the now well-known method for learning from rankings, and Stiennon+20 () who apply the method with real humans on text summarization.
(4/7) Our 2017 paper () is first to present and simulate the risk of using user ratings for training text generators. People have different opinions; one's opinion varies over time. We show RL is robust to granularity, skew in rewards but not variance.
(0/7) To some people, RLHF means "learn a reward model from human rankings and RL on it". But the term literally conveys a much broader meaning: any RL method that can learn from any type of human scalar feedback.
I am super proud of my collaborators
@a1zhang
@JensTuyls
, Albert Lin, and
@karthik_r_n
. The problem turned out be much more challenging than we had anticipated, but we didn't give up. Our paper has just tackled an easy version of the general problem. We hope it will spark
Imitation learning and reinforcement learning have taken us really far. But I can't teach my AI complex things efficiently if I keep talking to it using primitive actions and rewards. I want our conversation to evolve to be more efficient over time.
Be careful! The bias argument in PyTorch's linear is True by default. If you do NMT or LM and forget to turn this off, the pre-softmax linear's weight may not be valid embeddings.
We give our agents these elements and press the button. Bam!!! Progressively efficient learning emerges. Our agents conveys increasingly abstract intention over time.
The discussion on VLN reminds me of our motivation for creating VLNA (). The first thing we changed was to replace initial detailed instructions with high-level instructions, essentially removing the assumption that the requester knows the task solutions...
The need for open data & benchmarks in modern ML research has led to an outpouring of
#NLProc
data creation. But
@harm_devries
,
@DBahdanau
& I suggest the low ecological validity of most of this data undermines the resulting research. Comments welcome!
A qualitative example: here, the Observational (no language) model mistakenly captures the movement patterns of the queen and the whale entities. It also misrecognizes the whale as an enemy. GPTHard is an approach that leverages ChatGPT to ground descriptions to entities. It
@DrJimFan
We did Sora+Genie but at a much more humble scale :p
Still we realize that the problem of grounding language to dynamics is extremely difficult. With immense data, maybe you will generalize in distribution well, but achieving true compositional
Looking for a new challenge because SOTA of
@panderson_me
VLN advanced too much
@cvpr2019
? Come check out VNLA, where an agent learns to request and understand human assistance in object-finding tasks. Novel imitation learning framework for language feedback!
On the grand stage of
@emnlp2019
,
@kchonyc
serves the community with his wisdom.
The historical journey of how neural language generation was revived and took the spotlight of NLP research. Tips: Be 𝗰𝘂𝗿𝗶𝗼𝘂𝘀 and if don't have "attention", 𝗶𝗻𝘃𝗲𝗻𝘁 it!
We demonstrate this scenario in Messenger. Without ever interacting with the real environment, our LWM-based agent can raise its final performance significantly by effectively incorporating language feedback (EMMA-LWM vs. Observational).
(2/7) In those works, rewards given to the model were dense and computed automatically (BLEU). Sokolov+15,16,17 (, ) is one of the first to really think about learning from human ratings, modeling the problem as bandit learning.
@DrJimFan
@yoavgo
@johnschulman2
yeah, the (learned) reward function may be still imperfect but the (unconfirmed) hypothesis is that evaluation is easier than generation so the reward function may still be of higher quality than a policy learned with the same amount of labeling effort.
@SemanticScholar
There are a lot of authors that have the same name as mine. SS seems to merge all of them into a single page. Why not let the authors create their own page and add papers?
To implement this approach, we need world models that make it easy for humans to adapt them. Traditional world models can be adapted with only observations, which are inadequate for humans to convey intentions. We develop world models that can be adapted through language.
Why RL-tuning hurts calibration of LLMs? RL objective can be written as a reverse KL divergence which encourages mode-seeking behavior (i.e. peaky distribution). RL+translation has studied this phenomenon a long time ago (, )
I strongly encourage
@GoogleDeepMind
to acknowledge the early work on RLHF for text generation that pioneers the use of REINFORCE on this problem. Simplicity prevails!
The model-based approach is not only human-compatible but practically efficient: because an agent's policies are optimized w.r.t. to a world model, changing that model systematically shifts all the policies.
(3/7) "Bandit" is important because naturally you could only ask a human to give one rating for a whole text. Sokolov formulation characterizes how difficult the problem is compared to video-game dense-reward RL problems.
As humans, we influence others' worldview to shape their behavior.
A kid asks his mom if he can go swimming at a nearby lake. The mom says: "there was a drowning accident over there last year." After listening to that, the kid chooses to stay home.
Here, instead of giving an
We find that the standard Transformer architecture struggle to generalize compositionally, and augment it with a more effective attention mechanism. More details and results are in the paper.
AI control has mainly taken a model-free approach: constructing agents made of black-box policies, then directly updating the policies to change their behavior.
In contrast, a model-based approach constructs agents with explicit mental states and enables humans to easily
and “alignment” is the new name for RL for structured prediction… (I guess that is not the originally intended meaning but that is what it turns out to be now)
Convince me I'm wrong: Generative AI is the new name for structured prediction.
An interviewer asked for a def of GenAI & offhand: "an AI system that generates a complex output at once (vs a single prediction)"
I later realized that's ≈identical to the def of SP I'd give ~2005
For those who are unfamilar. This is the past I talked about. I apologize if you have seen this slide too often recently. But not enough people have seen it.
Finally, we illustrate a promising application of LWMs, in which these models enable agents to generate and discuss plans with humans before execution.
This makes agents more safe, intepretable, and robust!
In this setting, humans can not only provide action-correcting
@yoavgo
@johnschulman2
i think the viewing of llm having a fixed knowledge graph is slightly misleading, by instruct-tune you also add knowledge and modify the knowledge graph. the issue to me is overgeneralization: instead of learning just the taught knowledge, llm also learns hallucination behavior.
Our modeling approach converts trajectories into sequences of tokens and trains a Transformer as a world model to auto-regressively generate those sequences.
@amritsinghbedi3
Another remark is whether the current formulation would result in overly conservative agents because an easy way to optimize the objective is to make the data distribution have very low support. RLHF is known to hurt calibration. This problem has also been studied in machine
We first construct a benchmark based on the Messenger environment. There, a model needs to interpret a language manual to predict environment dynamics. This is a hard language-grounding problem. A model has to learn representations of entities, correctly extract textual features,