Khanh Nguyen @khanhxuannguyen Twitter profile

Pinned Tweet

Khanh Nguyen

4 months

📢 Excited to announce our new paper Language-guided world models: A model-based approach to AI control • We develop LWMs: world models that can read texts to capture new environment dynamics • These models enable humans to efficiently control agents by providing language

4

47

200

Last Seen Profiles

@AboutTheBBC

@lyoiguu

@ApedERC

@PanitiaPensi

@ETBU_Baseball

@phiopsia

@TemboLukas79834

@yesevi0

@Phill9991849169

@BitcoinMajlis

@catmourie

@Quincyw05337351

@Ricto_R6

@eliseomoro

@DavidHogan66791

@BoyGeorge

@PBRTX_Scouting

@CoachWatkins91

@snsigaming1

@letsgowa

@xessa95

@ChrisLynnHedges

@cpaaustralia

@enijahjacquee

@CornellBaseball

@clamnuts

@Doug_Paulley

@riaghad

@JordanSweett

@KingDon_za

@JackermanDev

@ShaneSaint

@CorpusBaseball

@HardeeWill

@ECUTigersBSB

@RobBrownBoro

Khanh Nguyen

@khanhxuannguyen

1 year

The RLHF page of HuggingFace () misses many important citations. Here are some classical RLHF papers that you should cite and why.

Illustrating Reinforcement Learning from Human Feedback (RLHF)

huggingface.co

4

60

343

Khanh Nguyen

@khanhxuannguyen

3 months

😠It is still ridiculous to me how much money/time was wasted simply because people don't read some old papers. 💡If you want to know why REINFORCE/A2C is better than Actor-Critic, read our paper: We have identified all of the common issues for you: -

7

26

175

Khanh Nguyen

@khanhxuannguyen

10 months

The objective mismatch issue raised in John Schumann's ICML talk was already foreseen by our paper () 6 years ago. Sadly it wasn't cited nearly enough. Biased opinion: our paper deserves more readers and acknowledgement. In fact, not OpenAI's papers, but

Reinforcement Learning for Bandit Neural Machine Translation with...

Machine translation is a natural candidate problem for reinforcement learning from human feedback: users provide quick, dirty ratings on candidate translations to guide a system to improve. Yet,...

arxiv.org

1

20

145

Khanh Nguyen

@khanhxuannguyen

1 year

Why RL-tuning hurts calibration of LLMs? RL objective can be written as a reverse KL divergence which encourages mode-seeking behavior (i.e. peaky distribution). RL+translation has studied this phenomenon a long time ago (, )

On the Weaknesses of Reinforcement Learning for Neural Machine Translation

Reinforcement learning (RL) is frequently used to increase performance in text generation tasks, including machine translation (MT), notably through the use of Minimum Risk Training (MRT) and...

arxiv.org

4

11

91

Khanh Nguyen

@khanhxuannguyen

1 year

Working on calibration/uncertainty for LLMs, which papers should I cite? Guo et al. () is pretty popular but it is about classification tasks. Calibration on sequences comes with distinct challenges.

On Calibration of Modern Neural Networks

Confidence calibration -- the problem of predicting probability estimates representative of the true correctness likelihood -- is important for classification models in many applications. We...

arxiv.org

3

14

92

Khanh Nguyen

@khanhxuannguyen

2 years

I have finally graduated, thanks to tremendous support from my research mentors ( @haldaume3 @brendan642 @debadeepta , Dipendra Misra, and others), my family, and friends. I will be a postdoc @princeton_nlp and later @CHAI_Berkeley . Looking for opportunities to give talks :P

14

5

83

Khanh Nguyen

@khanhxuannguyen

7 months

📢Learning is all about communication. And who is the master of communication? Humans! 😯Our new paper enables AI agents to learn more like humans. 🔥Our agents define and share increasingly abstract intentions over time, and as a result, learn with progressive efficiency.

2

15

72

Khanh Nguyen

@khanhxuannguyen

1 year

This paper is so awesome!!! Debunking the “emergent capabilities” myth is the key to break the LLM recipe.

Are Emergent Abilities of Large Language Models a Mirage?

Recent work claims that large language models display emergent abilities, abilities not present in smaller-scale models that are present in larger-scale models. What makes emergent abilities...

arxiv.org

3

13

72

Khanh Nguyen

@khanhxuannguyen

6 months

Made this slide for my recent talk.

2

11

67

Khanh Nguyen

@khanhxuannguyen

5 years

Our HANNA paper on Visual Navigation with Natural Multimodal Assistance has been accepted to #emnlp2019 . New task/dataset/model/learning algorithm for leveraging vision-and-language human assistance in object-finding tasks in photo-realistic environments! (with @haldaume3 )

2

14

65

Khanh Nguyen

@khanhxuannguyen

9 months

After a wonderfullll year at Princeton, I am excited to join CHAI Berkeley, working Prof. Russell and Prof. Dragan to continue my effort to make AI communicate more effectively with humans. Connect with me if you are interested in learning from language feedback, learning to ask

10

2

62

Khanh Nguyen

@khanhxuannguyen

7 months

🚀 Dive into the untold story of Alignment via Human Feedback from an NLP perspective! This paper brilliantly encapsulates the epoch often overlooked in surveys written by RL groups. An absolute must-read for newcomers in the field! 📚

The Past, Present and Better Future of Feedback Learning in Large...

Human feedback is increasingly used to steer the behaviours of Large Language Models (LLMs). However, it is unclear how to collect and incorporate feedback in a way that is efficient, effective...

arxiv.org

0

22

59

Khanh Nguyen

@khanhxuannguyen

3 years

Do AI agents know what they want? Can they ask specific questions that faithfully reflect their intrinsic needs? We develop a general decision-making framework for simultaneously learning 𝙬𝙝𝙚𝙣 𝙖𝙣𝙙 𝙬𝙝𝙖𝙩 𝙩𝙤 𝙖𝙨𝙠 (w/ @ybisk @haldaume3 )

3

10

58

Khanh Nguyen

@khanhxuannguyen

3 months

This great work confirms my intuition: people have rediscovered problems of RLHF that was observed and documented many years ago when the method was first tried on machine translation. The finding in this paper is similar to . People, especially

On the Weaknesses of Reinforcement Learning for Neural Machine Translation

Reinforcement learning (RL) is frequently used to increase performance in text generation tasks, including machine translation (MT), notably through the use of Minimum Risk Training (MRT) and...

arxiv.org

Bill Yuchen Lin 🤖

@billyuchenlin

3 months

"Less (tuning) is more for alignment" is an intriguing hypothesis. Is alignment tuning really that “superficial”⁉️ 🤔 If so, how so? 🤔 Can any straightforward analysis explain this? 🤔 What if I tell you “no tuning can also be great for alignment”? 🫢 😉 If you’re interested in

10

66

325

2

8

55

Khanh Nguyen

@khanhxuannguyen

7 years

Our code for Reinforcement Learning + Neural Machine Translation: @nlproc

GitHub - khanhptnk/bandit-nmt

Contribute to khanhptnk/bandit-nmt development by creating an account on GitHub.

github.com

Khanh Nguyen

@khanhxuannguyen

7 years

Our #emnlp2017 work improves machine translation with simulated ratings using reinforcement learning #nlproc

2

8

43

0

11

50

Khanh Nguyen

@khanhxuannguyen

3 years

Maybe it's time to move beyond rewards and start 𝘁𝗮𝗹𝗸𝗶𝗻𝗴 properly to our ML agents! Our ILIAD #ICML2021 paper formulates a learning framework where natural language is the only communication medium used by the teacher. Blog:

1

20

51

Khanh Nguyen

@khanhxuannguyen

5 years

Happy to introduce 𝗚𝗹𝗼𝗯𝗮𝗹 𝗩𝗼𝗶𝗰𝗲𝘀, an evaluation dataset for multilingual and cross-lingual summarization in 15 languages (w. @haldaume3 ). New materials for studying translation quality in downstream task, zero-shot learning, etc. #NLProc #summarization #multilingual

1

10

49

Khanh Nguyen

@khanhxuannguyen

1 year

Passing false-belief tests = model HAS theory of mind Passing false-belief tests ≠ model USES theory of mind to perform tasks Our #ACL2023 paper: formulates 𝑻𝒂𝒔𝒌-𝑶𝒓𝒊𝒆𝒏𝒕𝒆𝒅 cognitive capabilities, which are used to perform tasks.

1

11

45

Khanh Nguyen

@khanhxuannguyen

7 years

Our #emnlp2017 work improves machine translation with simulated ratings using reinforcement learning #nlproc

2

8

43

Khanh Nguyen

@khanhxuannguyen

10 months

Very delighted to receive an Outstanding paper award at @tom_icml2023 . It is a great honor to be acknowledged by experts in the domain you have just recently ventured into :)

Khanh Nguyen

@khanhxuannguyen

1 year

Passing false-belief tests = model HAS theory of mind Passing false-belief tests ≠ model USES theory of mind to perform tasks Our #ACL2023 paper: formulates 𝑻𝒂𝒔𝒌-𝑶𝒓𝒊𝒆𝒏𝒕𝒆𝒅 cognitive capabilities, which are used to perform tasks.

1

11

45

4

7

40

Khanh Nguyen

@khanhxuannguyen

5 years

HANNA: Visual Navigation with Multimodal Natural Assistance is online Our agent finds objects in photo-realistic environments by learning to query simulated humans for instructions. Paper: Github:

1

11

40

Khanh Nguyen

@khanhxuannguyen

2 years

Accepted at #icml2022 :)

Khanh Nguyen

@khanhxuannguyen

3 years

Do AI agents know what they want? Can they ask specific questions that faithfully reflect their intrinsic needs? We develop a general decision-making framework for simultaneously learning 𝙬𝙝𝙚𝙣 𝙖𝙣𝙙 𝙬𝙝𝙖𝙩 𝙩𝙤 𝙖𝙨𝙠 (w/ @ybisk @haldaume3 )

3

10

58

0

7

37

Khanh Nguyen

@khanhxuannguyen

3 months

I woke up to this wonderful paper!!! @KreutzerJulia (the RLHF veteran) and @CohereForAI has done it! They show REINFORCE beats PPO convincingly and propose a better version. Only those who understand the past can shape the future.

Arash Ahmadian

@aahmadian_

3 months

PPO has been cemented as the defacto RL algorithm for RLHF. But… is this reputation + complexity merited?🤔 Our new work revisits PPO from first principles🔎 📜 w @chriscremer_ @mgalle @mziizm @KreutzerJulia Olivier Pietquin @ahmetustun89 @sarahookr

13

96

474

1

5

36

Khanh Nguyen

@khanhxuannguyen

11 months

I wrote a thought piece showing RLHF = variational inference on Bayesian cognitive model (generalized RSA). I hope that realizing this connection can help better understand recent developments on LLMs and inspire future research.

0

6

30

Khanh Nguyen

@khanhxuannguyen

11 months

Nice work! Remember that the SOTA LLMs do not implement SOTA learning algorithms. Imitation learning was less popular because of the expert query cost. But the cost is now much cheaper with LLMs as experts. Many cool IL work in the past now can find their ways into real-world

Kianté Brantley

@xkianteb

11 months

New paper! Learning to Generate Better Than Your LLM () RLHF has become a powerful paradigm for fine-tuning LLM, but we only use general-purpose RL algorithms. We introduce new algorithmic paradigm that takes advantage of additional feedback for learning.

1

74

372

0

4

31

Khanh Nguyen

@khanhxuannguyen

7 months

When a language model guides a human, giving false instructions can frustrate them or even put them in danger. We propose a cost-effective method for detecting hallucinations in navigation instructions. More about our #EMNLP2023 findings paper⬇️ (1/n)

2

5

27

Khanh Nguyen

@khanhxuannguyen

3 months

My opinion on SORA as a world model (ignore this post if you think of it as just a video-editing tool): - Generating high-resolution, realistic outputs makes it hard to use SORA as a planner. We should have more work on planning with abstract representations of the world (e.g.,

Khanh Nguyen

@khanhxuannguyen

4 months

As humans, we influence others' worldview to shape their behavior. A kid asks his mom if he can go swimming at a nearby lake. The mom says: "there was a drowning accident over there last year." After listening to that, the kid chooses to stay home. Here, instead of giving an

1

0

5

4

2

27

Khanh Nguyen

@khanhxuannguyen

7 months

📢Internship at CHAI Berkeley. Apply by Nov 13. Opportunity to work with a group of leading experts in AI safety. I am particularly looking for students who are interested in learning from language feedback, and learning to ask questions.

Center for Human-Compatible Artificial Intelligence

boards.greenhouse.io

0

4

24

Khanh Nguyen

@khanhxuannguyen

3 months

Do language-to-world models like OpenAI SORA excite you? We are too! In this recent paper, we lay out a vision for this type of models. Not just video-creating tool, they will enable humans to collaborate safely and control AI easily. The code has been released. Check it out!

Khanh Nguyen

@khanhxuannguyen

4 months

📢 Excited to announce our new paper Language-guided world models: A model-based approach to AI control • We develop LWMs: world models that can read texts to capture new environment dynamics • These models enable humans to efficiently control agents by providing language

4

47

200

0

2

22

Khanh Nguyen

@khanhxuannguyen

5 years

Watch our agent, HANNA, find objects by asking for directions along the way. Full demo on Youtube: Paper: Github:

0

7

19

Khanh Nguyen

@khanhxuannguyen

6 years

first week @MSFTResearch #notatnaacl #halfolks

1

2

18

Khanh Nguyen

@khanhxuannguyen

3 years

The hardest paper I have ever been a part of, both in terms of arguments, experimental setup, and technical depth. Could not achieve without help from amazing co-authors, and the open-minded reviewers. Learning from language is challenging but (to me) it is the future of AI!

Patrick Shafto

@patrickshafto

3 years

"Interactive Learning from Activity Description", led by the fantastic @khanhxuannguyen with Dipendra Misra, Robert Schapire, Miro Dudík ( @MSFTResearch ) has been accepted to #ICML2021 !

0

7

0

2

16

Khanh Nguyen

@khanhxuannguyen

1 year

First time co-organize a workshop at a major conference. Great interactive audience, wonderful talks and discussions about #interactiveNLP . Simultaneous interpretation still awkward. Everyone seemed to be happy. Thank you all for contributing to this experience :D

1

3

14

Khanh Nguyen

@khanhxuannguyen

5 years

7 predictions of Abhinav Gupta at the "Computer Vision after 5 years" workshop @cvpr2019 1/7

1

13

Khanh Nguyen

@khanhxuannguyen

5 years

I had a wonderful visit and learned about cool research at NYU and FAIR thanks to the hospitality and generosity of @kchonyc @W4ngatang and @uralik1 . Thank you very much and wish you all the best!

Kyunghyun Cho

@kchonyc

5 years

Khan Nguyen at NYU on empowering navigation agents with human assistance cc his advisor @haldaume3

3

1

16

1

0

12

Khanh Nguyen

@khanhxuannguyen

3 months

Just right when I just asked the question. Google Gemma uses the old good REINFORCE! This confirms my belief that the algorithm doesn't really matter (hyper-tuning matters though). What you should care about is the structure in the data and how to formulate the problem in a way

Nathan Lambert

@natolambert

3 months

RLHF details in @GoogleDeepMind 's Gemma: * Confirm Google uses REINFORCE algo * KL penalty in reward to SFT distribution (like InstructGPT), would be in addition to policy KL * "we relied on a high capacity model" big RMs >> small, as Anthropic results have shown More soon.

4

13

246

1

0

11

Khanh Nguyen

@khanhxuannguyen

1 year

Nguyen and O'Connor () and Kuleshov and Liang () are the first papers on calibration for sequences. They formulate and discuss challenges to this problem. Consider reading and citing these papers if you work on this topic :)

0

1

11

Khanh Nguyen

@khanhxuannguyen

3 years

No offense to my Chinese friends. But if you are speaking to a general audience and you are unsure that they are all from China, use the term "𝗟𝘂𝗻𝗮𝗿 𝗡𝗲𝘄 𝗬𝗲𝗮𝗿". In Vietnam, we call it "Tết Nguyên Đán" (if anyone cares about inclusiveness).

1

0

11

Khanh Nguyen

@khanhxuannguyen

7 months

This is great! It might imply that we have been doing actor-critic the wrong way the whole time? Actor critic seems like coordiante descent but the problem is that the coordinates are correlated?

Furong Huang

@furongh

7 months

🔥Major Breakthrough in #RLHF ! Traditional approaches fall short in characterizing policy-driven data dependency. Introducing PARL: a Unified Stochastic Bilevel Formulation. One of the FIRST provable solutions to #Alignment . 🚀 Essential for ethical AI! 📄

2

15

93

3

2

10

Khanh Nguyen

@khanhxuannguyen

6 months

Non-OpenAI papers on the slide: [Ranzanto et al.] [Bahdanau et al.] [Sokolov et al.] [Nguyen et al.] [Kreutzer et al.] [Kirk et al.] (survey)

Bandit Structured Prediction for Neural Sequence-to-Sequence Learning

Bandit structured prediction describes a stochastic optimization framework where learning is performed from partial feedback. This feedback is received in the form of a task loss evaluation to a...

arxiv.org

0

3

10

Khanh Nguyen

@khanhxuannguyen

9 years

Model uncertainty must be correct to be useful. Posterior calibration for NLP models http://t.co/5NJUQXhxWE #nlproc #mlearning #datascience

2

4

9

Khanh Nguyen

@khanhxuannguyen

2 years

Presenting this work today (Thu) at #ICML2022 in Room 310 at 11.30. Poster at 18.00 EDT :)

Khanh Nguyen

@khanhxuannguyen

3 years

Do AI agents know what they want? Can they ask specific questions that faithfully reflect their intrinsic needs? We develop a general decision-making framework for simultaneously learning 𝙬𝙝𝙚𝙣 𝙖𝙣𝙙 𝙬𝙝𝙖𝙩 𝙩𝙤 𝙖𝙨𝙠 (w/ @ybisk @haldaume3 )

3

10

58

0

6

10

Khanh Nguyen

@khanhxuannguyen

5 years

@umdclip @ml_umd @umdcs students presenting their work at EMNLP'19 in Hong Kong. A memorable event: first EMNLP paper for @swetaagrawal20 and last for @yogarshi and Weiwei Yang as PhD candidates.

1

3

8

Khanh Nguyen

@khanhxuannguyen

1 year

(5/7) Julia Kreutzer is a veteran on this topic. She authors so many papers that analyze the feasbility of learning translation systems from human feedback (those with Sokolov, and , ).

Can Neural Machine Translation be Improved with User Feedback?

We present the first real-world application of methods for improving neural machine translation (NMT) with human reinforcement, based on explicit and implicit user feedback collected on the eBay...

arxiv.org

1

0

9

Khanh Nguyen

@khanhxuannguyen

3 months

@aahmadian_ @chriscremer_ @mgalle @mziizm @KreutzerJulia @ahmetustun89 @sarahookr this is the comparison I have been looking for! in fact, all of the early work on RLHF for text generation employed simple algorithms like A2C and REINFORCE and they worked fine.

Khanh Nguyen

@khanhxuannguyen

6 months

Made this slide for my recent talk.

2

11

67

1

0

9

Khanh Nguyen

@khanhxuannguyen

5 years

I had a fantastic internship at @MSFTResearch working with @debadeepta @chris_brockett and Bill Dolan on empowering navigation agents with the ability to leverage help from humans. Human-assisted AI agents can accomplish tasks that surpass their knowledge and skill levels.

Microsoft Research

@MSFTResearch

5 years

When in doubt, people ask for help. What if our personal digital assistants could do the same? Microsoft researchers have created a novel method of training agents to strategically ask for assistance during vision-language tasks: #CVPR2019

0

28

59

0

2

9

Khanh Nguyen

@khanhxuannguyen

4 months

By the way, @a1zhang is on the PhD market this year. He is smart, diligent, and productive, and is experienced with vision&language research. Grab him while you can 😃

Khanh Nguyen

@khanhxuannguyen

4 months

I am super proud of my collaborators @a1zhang @JensTuyls , Albert Lin, and @karthik_r_n . The problem turned out be much more challenging than we had anticipated, but we didn't give up. Our paper has just tackled an easy version of the general problem. We hope it will spark

1

0

7

1

2

8

Khanh Nguyen

@khanhxuannguyen

1 year

I should say that the scope of this tweet is text gen. The history of RL from humans of course dates way further back than this (e.g. TAMER by Knox and Stone, Littman et al., etc.)

0

8

Khanh Nguyen

@khanhxuannguyen

3 years

@fhuszar “Enough” does not mean “efficient”. A two-layer neural network with sufficient width can approximate any function. But the width could grow exponentially with the complexity of the function. Deep nets are more efficient function appriximators.

0

1

8

Khanh Nguyen

@khanhxuannguyen

7 months

Another goal of this work is to construct an agent that asks increasingly abstract questions to reduce the effort of the human assisting it. When I started my PhD, I asked my advisor about every little detail. But near the end, we mostly exchanged high-level ideas. Now I am

Khanh Nguyen

@khanhxuannguyen

7 months

📢Learning is all about communication. And who is the master of communication? Humans! 😯Our new paper enables AI agents to learn more like humans. 🔥Our agents define and share increasingly abstract intentions over time, and as a result, learn with progressive efficiency.

2

15

72

0

1

8

Khanh Nguyen

@khanhxuannguyen

1 year

(1/7) In terms of RL for text gen, cite Ranzato+15 () and Shen+ () who pioneer training text generators to optimize rewards, and Bahdanau+17 () who attempt the first actor-critic solution.

An Actor-Critic Algorithm for Sequence Prediction

We present an approach to training neural networks to generate sequences using actor-critic methods from reinforcement learning (RL). Current log-likelihood training methods are limited by the...

arxiv.org

1

0

8

Khanh Nguyen

@khanhxuannguyen

7 years

Come to our #emnlp2017 poster at 10.30am today (Sep 10 GMT+2) on Reinforcement Learning for Neural MT with Simulated Ratings. #nlproc

1

8

Khanh Nguyen

@khanhxuannguyen

3 years

@xwang_lk You look like a rich guy who owns multiple casinos in HK in this photo :))

2

0

8

Khanh Nguyen

@khanhxuannguyen

3 months

@QuanquanGu @iampanxu Indeed, all the early RLHF papers on text generation use REINFORCE and A2C.

Khanh Nguyen

@khanhxuannguyen

6 months

Made this slide for my recent talk.

2

11

67

0

8

Khanh Nguyen

@khanhxuannguyen

3 months

I wonder if there has been work that compares DPO, PPO with simpler RL algo like A2C or even REINFORCE in fine-tuning LLMs. DPO can be interpreted as actor-critic with a cool math trick to obtain a reliable critic for free (i.e. use the policy itself as critic). It also has a

4

1

8

Khanh Nguyen

@khanhxuannguyen

1 year

(6/7) All of these works happened before or around the time of Christiano+17 () who introduce the now well-known method for learning from rankings, and Stiennon+20 () who apply the method with real humans on text summarization.

Learning to summarize from human feedback

As language models become more powerful, training and evaluation are increasingly bottlenecked by the data and metrics used for a particular task. For example, summarization models are often...

arxiv.org

1

0

7

Khanh Nguyen

@khanhxuannguyen

1 year

(4/7) Our 2017 paper () is first to present and simulate the risk of using user ratings for training text generators. People have different opinions; one's opinion varies over time. We show RL is robust to granularity, skew in rewards but not variance.

Reinforcement Learning for Bandit Neural Machine Translation with...

Machine translation is a natural candidate problem for reinforcement learning from human feedback: users provide quick, dirty ratings on candidate translations to guide a system to improve. Yet,...

arxiv.org

1

0

7

Khanh Nguyen

@khanhxuannguyen

1 year

(7/7) I hope this tweet conveys a better snapshot of the history of RLHF. Thanks for reading :)

1

0

7

Khanh Nguyen

@khanhxuannguyen

1 year

(0/7) To some people, RLHF means "learn a reward model from human rankings and RL on it". But the term literally conveys a much broader meaning: any RL method that can learn from any type of human scalar feedback.

1

0

7

Khanh Nguyen

@khanhxuannguyen

4 months

I am super proud of my collaborators @a1zhang @JensTuyls , Albert Lin, and @karthik_r_n . The problem turned out be much more challenging than we had anticipated, but we didn't give up. Our paper has just tackled an easy version of the general problem. We hope it will spark

1

0

7

Khanh Nguyen

@khanhxuannguyen

3 years

Paper: Code: Talk:

Interactive Learning from Activity Description (ICML 2021)

An interactive learning framework that enables training agents by only describing their activities in language.ICML 2021 Paper: https://arxiv.org/pdf/2102.07...

www.youtube.com

2

7

Khanh Nguyen

@khanhxuannguyen

7 months

Imitation learning and reinforcement learning have taken us really far. But I can't teach my AI complex things efficiently if I keep talking to it using primitive actions and rewards. I want our conversation to evolve to be more efficient over time.

1

0

7

Khanh Nguyen

@khanhxuannguyen

10 months

@StephenLCasper Nice survey but missing key citations. Please see this tweet for a deeper history of RLHF .

Khanh Nguyen

@khanhxuannguyen

1 year

The RLHF page of HuggingFace () misses many important citations. Here are some classical RLHF papers that you should cite and why.

4

60

343

2

0

7

Khanh Nguyen

@khanhxuannguyen

7 years

Be careful! The bias argument in PyTorch's linear is True by default. If you do NMT or LM and forget to turn this off, the pre-softmax linear's weight may not be valid embeddings.

1

0

7

Khanh Nguyen

@khanhxuannguyen

8 years

Finally had time to write some introduction about my research on calibration #NLP #calibration #machinelearning

0

2

5

Khanh Nguyen

@khanhxuannguyen

9 months

@ShunyuYao12 @tedsumers @karthik_r_n @cocosci_lab @princeton_nlp @PrincetonCS Share many of the opinions <3 In , I was also thinking of a two-system architecture because inference with the rigorous reasoning could be slow.

0

6

Khanh Nguyen

@khanhxuannguyen

7 months

We give our agents these elements and press the button. Bam!!! Progressively efficient learning emerges. Our agents conveys increasingly abstract intention over time.

1

0

6

Khanh Nguyen

@khanhxuannguyen

4 years

The discussion on VLN reminds me of our motivation for creating VLNA (). The first thing we changed was to replace initial detailed instructions with high-level instructions, essentially removing the assumption that the requester knows the task solutions...

Vision-based Navigation with Language-based Assistance via...

We present Vision-based Navigation with Language-based Assistance (VNLA), a grounded vision-language task where an agent with visual perception is guided via language to find objects in...

arxiv.org

Christopher Manning

@chrmanning

4 years

The need for open data & benchmarks in modern ML research has led to an outpouring of #NLProc data creation. But @harm_devries , @DBahdanau & I suggest the low ecological validity of most of this data undermines the resulting research. Comments welcome!

11

84

336

1

2

6

Khanh Nguyen

@khanhxuannguyen

4 months

A qualitative example: here, the Observational (no language) model mistakenly captures the movement patterns of the queen and the whale entities. It also misrecognizes the whale as an enemy. GPTHard is an approach that leverages ChatGPT to ground descriptions to entities. It

1

0

6

Khanh Nguyen

@khanhxuannguyen

3 months

@DrJimFan We did Sora+Genie but at a much more humble scale :p Still we realize that the problem of grounding language to dynamics is extremely difficult. With immense data, maybe you will generalize in distribution well, but achieving true compositional

Language-Guided World Models: A Model-Based Approach to AI Control

Installing probabilistic world models into artificial agents opens an efficient channel for humans to communicate with and control these agents. In addition to updating agent policies, humans can...

arxiv.org

0

2

5

Khanh Nguyen

@khanhxuannguyen

5 years

Looking for a new challenge because SOTA of @panderson_me VLN advanced too much @cvpr2019 ? Come check out VNLA, where an agent learns to request and understand human assistance in object-finding tasks. Novel imitation learning framework for language feedback!

1

2

6

Khanh Nguyen

@khanhxuannguyen

5 years

On the grand stage of @emnlp2019 , @kchonyc serves the community with his wisdom. The historical journey of how neural language generation was revived and took the spotlight of NLP research. Tips: Be 𝗰𝘂𝗿𝗶𝗼𝘂𝘀 and if don't have "attention", 𝗶𝗻𝘃𝗲𝗻𝘁 it!

1

0

6

Khanh Nguyen

@khanhxuannguyen

4 months

We demonstrate this scenario in Messenger. Without ever interacting with the real environment, our LWM-based agent can raise its final performance significantly by effectively incorporating language feedback (EMMA-LWM vs. Observational).

1

0

6

Khanh Nguyen

@khanhxuannguyen

1 year

(2/7) In those works, rewards given to the model were dense and computed automatically (BLEU). Sokolov+15,16,17 (, ) is one of the first to really think about learning from human ratings, modeling the problem as bandit learning.

Bandit Structured Prediction for Learning from Partial Feedback in...

We present an approach to structured prediction from bandit feedback, called Bandit Structured Prediction, where only the value of a task loss function at a single predicted point, instead of a...

arxiv.org

2

0

6

Khanh Nguyen

@khanhxuannguyen

1 year

@DrJimFan @yoavgo @johnschulman2 yeah, the (learned) reward function may be still imperfect but the (unconfirmed) hypothesis is that evaluation is easier than generation so the reward function may still be of higher quality than a policy learned with the same amount of labeling effort.

2

0

6

Khanh Nguyen

@khanhxuannguyen

7 years

About the OpenAI dota bot: there were other people beating it, just didn't make into the news :)

0

5

Khanh Nguyen

@khanhxuannguyen

5 years

@SemanticScholar There are a lot of authors that have the same name as mine. SS seems to merge all of them into a single page. Why not let the authors create their own page and add papers?

1

0

5

Khanh Nguyen

@khanhxuannguyen

3 months

@natolambert All the early RLHF papers on text generation use REINFORCE and A2C.

Khanh Nguyen

@khanhxuannguyen

6 months

Made this slide for my recent talk.

2

11

67

0

5

Khanh Nguyen

@khanhxuannguyen

4 months

To implement this approach, we need world models that make it easy for humans to adapt them. Traditional world models can be adapted with only observations, which are inadequate for humans to convey intentions. We develop world models that can be adapted through language.

1

0

5

Khanh Nguyen

@khanhxuannguyen

1 year

The theoretical fact that RL = Reverse KL optimization is pretty well-known and has been re-discovered multiple times (e.g., , , , ).

RL with KL penalties is better viewed as Bayesian inference

Reinforcement learning (RL) is frequently employed in fine-tuning large language models (LMs), such as GPT-3, to penalize them for undesirable features of generated sequences, such as...

arxiv.org

Khanh Nguyen

@khanhxuannguyen

1 year

Why RL-tuning hurts calibration of LLMs? RL objective can be written as a reverse KL divergence which encourages mode-seeking behavior (i.e. peaky distribution). RL+translation has studied this phenomenon a long time ago (, )

4

11

91

0

1

5

Khanh Nguyen

@khanhxuannguyen

3 months

I strongly encourage @GoogleDeepMind to acknowledge the early work on RLHF for text generation that pioneers the use of REINFORCE on this problem. Simplicity prevails!

Khanh Nguyen

@khanhxuannguyen

6 months

Made this slide for my recent talk.

2

11

67

0

5

Khanh Nguyen

@khanhxuannguyen

4 months

The model-based approach is not only human-compatible but practically efficient: because an agent's policies are optimized w.r.t. to a world model, changing that model systematically shifts all the policies.

1

0

5

Khanh Nguyen

@khanhxuannguyen

1 year

(3/7) "Bandit" is important because naturally you could only ask a human to give one rating for a whole text. Sokolov formulation characterizes how difficult the problem is compared to video-game dense-reward RL problems.

1

0

5

Khanh Nguyen

@khanhxuannguyen

3 months

I am talking about these papers and the paper mentioned in the previous tweets.

Khanh Nguyen

@khanhxuannguyen

6 months

Made this slide for my recent talk.

2

11

67

1

0

5

Khanh Nguyen

@khanhxuannguyen

8 years

NAACL 2016 best papers: flexible, LEGO-style neural architectures #naacl2016 #nlproc

0

5

Khanh Nguyen

@khanhxuannguyen

5 years

@haldaume3 This is follow-up work on (with @debadeepta , @chris_brockett , and Bill Dolan) with nat. instructions, an extension of @panderson_me R2R task to multi-round interaction, and is closely related to @_jessethomason_ Vision-and-Dialog (CVDN) task

Vision-based Navigation with Language-based Assistance via...

We present Vision-based Navigation with Language-based Assistance (VNLA), a grounded vision-language task where an agent with visual perception is guided via language to find objects in...

arxiv.org

1

2

5

Khanh Nguyen

@khanhxuannguyen

4 months

As humans, we influence others' worldview to shape their behavior. A kid asks his mom if he can go swimming at a nearby lake. The mom says: "there was a drowning accident over there last year." After listening to that, the kid chooses to stay home. Here, instead of giving an

1

0

5

Khanh Nguyen

@khanhxuannguyen

4 months

We find that the standard Transformer architecture struggle to generalize compositionally, and augment it with a more effective attention mechanism. More details and results are in the paper.

1

0

5

Khanh Nguyen

@khanhxuannguyen

4 months

AI control has mainly taken a model-free approach: constructing agents made of black-box policies, then directly updating the policies to change their behavior. In contrast, a model-based approach constructs agents with explicit mental states and enables humans to easily

1

0

5

Khanh Nguyen

@khanhxuannguyen

3 months

and “alignment” is the new name for RL for structured prediction… (I guess that is not the originally intended meaning but that is what it turns out to be now)

Hal Daumé III

@haldaume3

3 months

Convince me I'm wrong: Generative AI is the new name for structured prediction. An interviewer asked for a def of GenAI & offhand: "an AI system that generates a complex output at once (vs a single prediction)" I later realized that's ≈identical to the def of SP I'd give ~2005

20

7

108

1

5

Khanh Nguyen

@khanhxuannguyen

3 months

For those who are unfamilar. This is the past I talked about. I apologize if you have seen this slide too often recently. But not enough people have seen it.

Khanh Nguyen

@khanhxuannguyen

6 months

Made this slide for my recent talk.

2

11

67

1

4

Khanh Nguyen

@khanhxuannguyen

5 years

@haldaume3 @debadeepta @chris_brockett @panderson_me @_jessethomason_ @emnlp2019 The ACL organizers should consider putting the rebuttal period back to ACL and NAACL. While the number of papers is increasing exponentially, reducing the amount of review time spent on each paper may not be the best move to maintain/improve the review quality.

0

5

Khanh Nguyen

@khanhxuannguyen

4 months

Finally, we illustrate a promising application of LWMs, in which these models enable agents to generate and discuss plans with humans before execution. This makes agents more safe, intepretable, and robust! In this setting, humans can not only provide action-correcting

1

0

5

Khanh Nguyen

@khanhxuannguyen

7 years

My thought on the similarity between actor-critic and GAN

1

0

5

Khanh Nguyen

@khanhxuannguyen

1 year

@yoavgo @johnschulman2 i think the viewing of llm having a fixed knowledge graph is slightly misleading, by instruct-tune you also add knowledge and modify the knowledge graph. the issue to me is overgeneralization: instead of learning just the taught knowledge, llm also learns hallucination behavior.

2

0

5

Khanh Nguyen

@khanhxuannguyen

4 months

Our modeling approach converts trajectories into sequences of tokens and trains a Transformer as a world model to auto-regressively generate those sequences.

1

0

5

Khanh Nguyen

@khanhxuannguyen

7 months

@amritsinghbedi3 Another remark is whether the current formulation would result in overly conservative agents because an easy way to optimize the objective is to make the data distribution have very low support. RLHF is known to hurt calibration. This problem has also been studied in machine

1

0

4

Khanh Nguyen

@khanhxuannguyen

4 months

Website: Paper: [end]

0

5

Khanh Nguyen

@khanhxuannguyen

4 months

We first construct a benchmark based on the Messenger environment. There, a model needs to interpret a language manual to predict environment dynamics. This is a hard language-grounding problem. A model has to learn representations of entities, correctly extract textual features,

1

0

5