Aviral Kumar @aviral_kumar2 Twitter profile

Last Seen Profiles

@PrinceZordan

@h_uc55

@EzikPasifKole

@AminataCorr

@nono_oldisgold

@Marija272309

@_TLadyC

@__Therealxo

@MHRtentyo

@mikasounds

@Guven__once

@Paulo_Hsam

@greentony

@somehgosoden

@KogoyaMeso48952

@stw_pdg

@Skie_supreme_

@BronwynLMartin1

@PowerNaShary

@mochitoseaguri

@M_almarhaby

@jandakembangstw

@xoxCheIs

@IFeelPretty

@Horid81558871

@fetisbumil

@LoreeRose36296

@mr_addo

@jeunesmacron77

@TwaunPowell

@stw_pdg

@huyi6661

@margomar1ee

@jobergum

@thechelseafeed

@Pushpendra_2601

Aviral Kumar

@aviral_kumar2

11 months

Thrilled to share that I will be joining Carnegie Mellon @SCSatCMU as an Assistant Professor of CS and ML @CSDatCMU @mldcmu in Fall 2024. Extremely thankful to my mentors & collaborators, especially @svlevine ! Looking forward to working with amazing students & colleagues at CMU!

66

29

679

Aviral Kumar

@aviral_kumar2

7 months

Posting this a bit late, but if you are applying for a PhD in AI and are interested in decision making and reinforcement learning, please consider applying to my upcoming lab at CMU by December 13! Details about my interests and application process can be found on my website.

4

62

311

Aviral Kumar

@aviral_kumar2

2 months

Many LLM fine-tuning methods. Unclear what you should use & why? In our new paper, we did an extensive study of on-policy RL, supervised & offline contrastive methods (DPO, IPO) to answer this... 🧵⬇️ On-policy > offline, mode-seeking > mode-covering

3

68

277

Aviral Kumar

@aviral_kumar2

4 months

Super simple code change to get value-based deep RL scale *much* better w/ big models across the board on Atari games, robotic manipulation w/ transformers, LLM + text games, & even Chess! Just use classification loss (i.e., cross entropy), not MSE!! 🧵⬇️

3

43

262

Aviral Kumar

@aviral_kumar2

4 months

How can we train LLM Agents, to learn from their own experience autonomously? Introducing ArCHer, a simple (i.e., small change on top of standard RLHF) and effective way of doing so with multi-turn RL 🧵⬇️ Paper: Website:

ArCHer: Training Language Model Agents via Hierarchical Multi-Turn RL

A broad use case of large language models (LLMs) is in goal-directed decision-making tasks (or "agent" tasks), where an LLM needs to not just generate completions for a given prompt, but rather...

arxiv.org

2

41

193

Aviral Kumar

@aviral_kumar2

2 years

First tweet: Recent work showing how to train big models via offline RL on diverse, multi-game data. 2 billion sub-opt. data + offline RL => generalist policy better than data & good at fine-tuning. w/ @svlevine @agarwl_ @younggeng @georgejtucker

Scaling offline RL

Can we train large models via offline RL on large datasets?

sites.google.com

2

14

138

Aviral Kumar

@aviral_kumar2

8 months

A crucial component in modern ML seems to be using the *right*, quality subset of data for learning. What does this mean for offline RL? Given an offline dataset, can we also improve perf. by developing automatic ways to filter data? We answer this in our NeurIPS 2023 paper 🧵

1

13

103

Aviral Kumar

@aviral_kumar2

9 months

Human video (e.g., Ego 4D) pre-training can improve robot control, including for downstream robotic RL. But can we *also* use RL for actually doing video pre-training? Yes! Value-based offline RL can pre-train on video for your robot! Introducing V-PTR 🧵

1

14

99

Aviral Kumar

@aviral_kumar2

8 months

Can we use text-to-image diffusion models to steer robots into doing things, zero-shot? Our method, SuSIE, fine-tunes diffusion models trained for image editing to produce future subgoals from a given scene, which then drive a low-level policy. 🧵⬇️

1

20

97

Aviral Kumar

@aviral_kumar2

1 year

Interested in offline RL that improves with limited online interaction rapidly? Check out Cal-QL: a method for pre-training with offline RL to enable fast fine-tuning, that's just a 1-line code change on conservative Q-learning (CQL)! A thread 🧵...

Cal-QL: Calibrated Offline RL Pre-Training for Efficient Online Fine-Tuning

A compelling use case of offline reinforcement learning (RL) is to obtain a policy initialization from existing datasets followed by fast online fine-tuning with limited interaction. However,...

arxiv.org

1

18

94

Aviral Kumar

@aviral_kumar2

3 months

Our new paper on understanding why LLMs make up stuff & hallucinate and how RL fine-tuning with an appropriate conservative reward model can mitigate these issues Paper: A thread below 🧵⬇️ (+ check @katie_kang_ 's thread for many more details)

Katie Kang

@katie_kang_

3 months

We know LLMs hallucinate, but what governs what they dream up? Turns out it’s all about the “unfamiliar” examples they see during finetuning Our new paper shows that manipulating the supervision on these special examples can steer how LLMs hallucinate 🧵

11

78

368

3

8

63

Aviral Kumar

@aviral_kumar2

7 months

On my way to NOLA for #NeurIPS2023 ! We will present several works on offline RL, fast online fine-tuning, using pre-trained models for improving low-level robot control, RL pre-training on human videos, and querying VLMs for maximal efficacy in RL. Come talk to us! Details ⬇️

1

42

Aviral Kumar

@aviral_kumar2

10 months

Check out our work on training large transformer policies on demo and autonomous data (including failures of existing imitation policies) via offline Q-learning. Q-Transformer improves over RT-1 on real robots & provides a recipe for building ever-improving robotic systems! ⬇️

Yevgen Chebotar

@YevgenChebotar

10 months

Offline RL strikes back! In our new Q-Transformer paper, we introduce a scalable framework for offline reinforcement learning using Transformers and autoregressive Q-Learning to learn from mixed-quality datasets! Website and paper: 🧵

8

111

543

0

25

Aviral Kumar

@aviral_kumar2

9 months

Great collab led by @ChetBhateja , Derek & @its_dibya . w/ @Anikait_Singh_ , @manan_tomar , @QuanVng , @YevgenChebotar , @svlevine ! I was quite(?) late in posting, but check: , Paper:

Robotic Offline RL from Internet Videos via Value-Function Pre-Training

Pre-training on Internet data has proven to be a key ingredient for broad generalization in many modern ML systems. What would it take to enable such capabilities in robotic reinforcement learning...

arxiv.org

AK

@_akhaliq

9 months

Robotic Offline RL from Internet Videos via Value-Function Pre-Training paper page: Pre-training on Internet data has proven to be a key ingredient for broad generalization in many modern ML systems. What would it take to enable such capabilities in

2

39

183

0

1

12

Aviral Kumar

@aviral_kumar2

4 months

This project was amazing & fun led by @JesseFarebro @agarwl_ , with a number of fantastic collaborators @QuanVng Jordi Orbay Adrien Ali Taiga @YevgenChebotar @pcastr @AleksandraFaust @svlevine @xiao_ted @AlexIrpan .

0

11

Aviral Kumar

@aviral_kumar2

4 months

3. Chess without search Achieves AlphaZero level performance on chess, without needing any MCTS -- just distill data into the value function with a cross-entropy loss, building on top of the results in

2

1

9

Aviral Kumar

@aviral_kumar2

4 months

So why does this work? We study many hypotheses and find that cross-entropy improves the ability of value-based RL: - to deal with non-stationarity - improves representation quality - makes it robust to noise These are big problems in RL. Checkout Sec. 5 for detailed analysis!

1

0

7

Aviral Kumar

@aviral_kumar2

4 months

Method: Take your favorite value-based RL method (CQL for offline RL, DQN for online RL, etc.), convert the Bellman target into a categorical distribution (more on this next), replace the MSE loss to Bellman target with cross-entropy. And that is it!

1

0

7

Aviral Kumar

@aviral_kumar2

4 months

We studied many methods for converting targets into categorical distributions: 1. Two hot ➡️ put probability mass in two consecutive bins surrounding the scalar target 2. HL-Gauss ➡️ add noise to target value and then discretize into bins 3. C51 ➡�� cross-entropy + dist. RL

1

0

5

Aviral Kumar

@aviral_kumar2

4 months

Most LLM fine-tuning is done within a single turn. This is limiting: does not teach the LLM how to seek information, optimize long-term metrics, or reason about its past actions. Result: verbose, non-targeted responses => ❌❌agent problems ➡️need multi-turn LLM fine-tuning.

1

0

7

Aviral Kumar

@aviral_kumar2

1 year

This was an exciting collaboration with @mitsuhiko_nm , @simon_zhai , Anikait Singh, Max Sobol Mark, @YiMaTweets , @chelseabfinn & @svlevine . Definitely check out Sergey’s detailed thread: and the website:

Sergey Levine

@svlevine

1 year

Can conservative Q-learning be used to pretrain followed by online finetuning? Turns out that naive offline RL pretraining leads to a "dip" when finetuning online, but we can fix this with a 1-line change! That's the idea in Cal-QL: A thread👇

4

49

284

0

6

Aviral Kumar

@aviral_kumar2

2 months

Overall, this was a fun collaboration & we learned a lot! Lots of experiments, analysis in the paper: (takeaway boxes if you don't have time) w/ @FahimTajwar10 @Anikait_Singh_ @archit_sharma97 @rm_rafailov Jeff @tengyangx @StefanoErmon @chelseabfinn

0

6

Aviral Kumar

@aviral_kumar2

4 months

This work was an amazing, truly enjoyable collaboration, led by @YifeiZhou02 , w/ @Zanette_ai , @pan_jiayipan and @svlevine . I learned a lot working with the team!

1

0

6

Aviral Kumar

@aviral_kumar2

2 years

Broadly, I am excited about this as it presents a starting point to scale up offline RL as a pre-training method that could ingest all of the data out there. Lots of algorithmic and technical questions to explore on this front!

1

0

6

Aviral Kumar

@aviral_kumar2

4 months

Our key insight: Take any RL method for single-turn LLM fine-tuning & replace the reward model (RM that works for 1 turn) with a turn-level value model (trained with off-policy RL), accounting for future turns. Use it to provide rewards for the token policy instead of the RM.

1

0

6

Aviral Kumar

@aviral_kumar2

2 months

1. On-policy sampling improves perf. and efficiency, especially when the peak of reward appears farther from the init / ref policy, even when the reward model is learned from the same pref dataset that methods without on-policy sampling also use. i.e., model-based > model-free

1

0

4

Aviral Kumar

@aviral_kumar2

2 years

@svlevine @agarwl_ @younggeng @georgejtucker See Sergey's thread below: a combination of existing offline RL ideas (CQL, DR3) and C51 can make offline RL work with large models, retaining benefits of “stitching” and policy improvement.

DR3: Value-Based Deep Reinforcement Learning Requires Explicit...

Despite overparameterization, deep networks trained via supervised learning are easy to optimize and exhibit excellent generalization. One hypothesis to explain this is that overparameterized deep...

arxiv.org

Sergey Levine

@svlevine

2 years

A big goal for Atari is training one policy on many games. In new work, we show that offline RL (CQL) can do this well w/ big models. On suboptimal data it beats SOTA by 2.5x, finetunes to new games, brings us closer to dream of offline pre-training: 🧵>

3

38

172

1

0

5

Aviral Kumar

@aviral_kumar2

2 months

The paper is on arxiv now:

Preference Fine-Tuning of LLMs Should Leverage Suboptimal, On-Policy Data

Learning from preference labels plays a crucial role in fine-tuning large language models. There are several distinct approaches for preference fine-tuning, including supervised learning,...

arxiv.org

0

1

6

Aviral Kumar

@aviral_kumar2

1 year

Check out Joey's talk at #ICLR2023 at 4pm local time (poster at 4:30 pm local time) on how we can train offline value functions for multiple levels of conservatism and then adjust the level with online data to attain improved performance.

Sergey Levine

@svlevine

2 years

Offline RL algorithms require choosing a constraint or a level of pessimism/conservatism. But what if we train a value function to support *any* level of conservatism? We study this in our new paper on confidence-conditioned offline RL: Short 🧵:

4

14

98

0

4

Aviral Kumar

@aviral_kumar2

4 months

2. Generalist robotic manipulation 67% improvement and much better learning speed on top of offline RL Q-Transformer for robotic manipulation with human teleop demos + autonomous failures data

Yevgen Chebotar

@YevgenChebotar

10 months

Offline RL strikes back! In our new Q-Transformer paper, we introduce a scalable framework for offline reinforcement learning using Transformers and autoregressive Q-Learning to learn from mixed-quality datasets! Website and paper: 🧵

8

111

543

1

0

4

Aviral Kumar

@aviral_kumar2

2 months

We grouped methods along: (sec 3.2) (1) running on-policy rollouts against a reward model learned from pref data (like "offline model-based RL" in RL) [w/ or w/o sample reuse] (2) using a negative gradient: not just maximizing likelihood but also pushing it down (DPO, IPO)

1

0

4

Aviral Kumar

@aviral_kumar2

2 months

3. We find that on-policy sampling + negative gradient are complementary. since on-policy DPO > on-policy PPO in our experiments (section 5.3). DPO / IPO gradients provide a stronger learning signal than PPO... in some ways, negative gradient is helping kill out variance.

1

0

4

Aviral Kumar

@aviral_kumar2

2 months

You may want to try to potentially try to compensate for less on-policy sampling with the use of sample reuse (i.e., make more updates on stale data). This can help a little, but unless curated well does hurt... (T=2 does a bit better than T=1 quickly, but then it hurts)

1

0

4

Aviral Kumar

@aviral_kumar2

4 months

4. LLM text games Also does really well on text games / LLM agent tasks, like the game of playing Wordle with offline RL, building on top of CQL (43% improvement)

1

4

Aviral Kumar

@aviral_kumar2

4 months

In 2022, we did some of the first work to scale up offline RL (CQL) to big models, with multi-game Atari data. We found C51 (dist. RL) to be critical, but didn't know why.... Turns out the cross-entropy in C51 was the key, it enables RL to scale well!!

Offline Q-Learning on Diverse Multi-Task Data Both Scales And Generalizes

The potential of offline reinforcement learning (RL) is that high-capacity models trained on large, heterogeneous datasets can lead to agents that generalize broadly, analogously to similar...

arxiv.org

1

0

4

Aviral Kumar

@aviral_kumar2

9 months

Our recipe trains on videos with RL and then continues to run RL on the robot. Concretely: first run value-based offline RL on videos, then run offline RL on robot data (you could use RT-X data now too!) to get a general policy, then fine-tune to your task with just a few demos.

1

0

4

Aviral Kumar

@aviral_kumar2

2 months

To sum up to the big picture, 1. need to explore regions covered less by ref policy => use on-policy sampling ✅ 2. need a strong learning signal ==> use negative gradients on suboptimal / negative data✅ hence our paper title...

1

0

4

Aviral Kumar

@aviral_kumar2

4 months

We got many great results with this method: 1. Multi-game Atari with offline CQL (C51 is our prior result): 82% improvement over it and larger gains over multi-game DT (which does not use any RL)

Aviral Kumar

@aviral_kumar2

2 years

First tweet: Recent work showing how to train big models via offline RL on diverse, multi-game data. 2 billion sub-opt. data + offline RL => generalist policy better than data & good at fine-tuning. w/ @svlevine @agarwl_ @younggeng @georgejtucker

2

14

138

1

0

3

Aviral Kumar

@aviral_kumar2

1 year

We tried to understand the key ingredient that allows us to use to as many gradient steps as possible to enable fast online RL. Check out our #ICLR2023 paper by @qiyang_li at 11:30 am local time on Tuesday. ICLR link:

Qiyang Li

@qiyang_li

1 year

What is the key ingredient that enables sample-efficient online RL with TD objectives? tl;dr – We find that techniques to enable sample-efficient online RL are also effective at controlling a notion of validation error. . A thread 🧵: 1/N

3

18

99

0

1

3

Aviral Kumar

@aviral_kumar2

2 months

But, upon looking at the mechanisms behind them (Sec 5.2.2), these methods often extrapolate and likelihoods of y+ in my data decrease (not everywhere, for UltraFB it does not decrease) Ofc, extrapolation can be good or bad, so can't always say offline DPO is better!

1

0

4

Aviral Kumar

@aviral_kumar2

8 months

Overall, SuSIE is a simple recipe to use web-scale pre-training for boosting semantic generalization *and* policy precision in robot control! Code here: Awesome work led by @kvablack & @mitsuhiko_nm , w/ Pranav, @HomerWalke , @chelseabfinn , @svlevine

1

0

4

Aviral Kumar

@aviral_kumar2

8 months

**Insight:** instead of constraining the policy to the data distribution given to you, we should constrain the policy against a better, *reweighted* version of the data distribution, allowing for good behavior better than the offline data while avoiding OOD actions.

1

0

4

Aviral Kumar

@aviral_kumar2

8 months

This re-weighting of the data enables all methods to work much better across the board! We test on D4RL tasks, and @ZhangWeiHong9 did a very extensive stress test for all methods across many data compositions to verify that the trend holds.

1

0

3

Aviral Kumar

@aviral_kumar2

4 months

@JesseFarebro @BlackHC Yes I should have been more clear — as Jesse said it is categorical cross entropy and turning it into a classification problem that matters.

0

2

Aviral Kumar

@aviral_kumar2

4 months

Overall, this change to cross-entropy is super simple, addresses issues that we face in value-based offline & online RL, and works reliably in the "scaling" regime (becoming more important with big models such as transformers). Try this out on your problem & let us know!

1

0

3

Aviral Kumar

@aviral_kumar2

4 months

On WebShop specifically, ArCHer with a GPT-2 base model can improve over perf of (much more capable) GPT 3.5 with ReAct and expert prompt => ArCHer is very good at learning from rewards, autonomously!!

1

0

3

Aviral Kumar

@aviral_kumar2

9 months

Video pre-training (done on Ego 4D) allows us to learn about intentions and associated outcomes (check ICVF: ), and then robotic offline RL (check PTR: ) brings in understanding of robot actions, dynamics, etc.

1

0

3

Aviral Kumar

@aviral_kumar2

8 months

SuSIE is really simple: 1. Fine-tune an image editing model on robot data to produce future sub-goals for a language command. 2. Take any goal-reaching policy, good at reaching short-term goals 3. At test time, command the policy with subgoals from your model and iterate!

1

0

3

Aviral Kumar

@aviral_kumar2

8 months

This was a very fun collaboration led by @ZhangWeiHong9 , together with @abhishekunique7 @pulkitology and others! Paper: Some of our past work on data sharing: CDS: UDS:

1

0

3

Aviral Kumar

@aviral_kumar2

2 months

2. Negative gradients accelerate convergence of offline methods: in our bandit, we tried to use an explicit "likelihood minimizer" term (kind of like unlikelihood) on top of distilled Best-of-N and found it to be better. IPO was the best here. Similar trends in other setups..

1

0

3

Aviral Kumar

@aviral_kumar2

7 months

Finally, I (on behalf of William) will also present some ongoing work on **promptable representations** -- a framework for steering off-the-shelf VLMs into producing features that are particularly useful for downstream control & policy learning. More on this near the workshops!

1

0

2

Aviral Kumar

@aviral_kumar2

2 months

We also provide a theoretical result (Lemma 6.2) for this trying to understand where the probability mass recovered by pushing down on negatives goes and when this can go to preferred response regions.

1

3

Aviral Kumar

@aviral_kumar2

2 months

Mode-seeking KL => faster re-organization of probability mass as long as KL loss is no 0 (which we rarely get to on the training set). This is cool theoretically, since categorical dist. do not present misspecification like the classic 2 modes vs unimodal gaussian example

1

3

Aviral Kumar

@aviral_kumar2

2 months

We also theoretically unify these concepts of on-policy sampling + neg grad under mode-seeking losses (e.g, reverse KL) vs mode-covering losses (e.g., forward / supervised learning KL). 1. DPO, RL, on-policy ReST: mode-seeking 2. offline RWR, BoN: mode-covering (see Sec 6.1)

1

0

3

Aviral Kumar

@aviral_kumar2

8 months

The most interesting aspect to me is how SuSIE allows Internet vision-language knowledge to be used for enhancing precision of the low-level policy execution. In fact, we found SuSIE to be better than RT-X and even oracle goal-reaching due to this! Why? ⬇️

1

0

3

Aviral Kumar

@aviral_kumar2

2 months

In general, we found that on-policy sampling tends to only be irrelevant when reward peak is already close to the highly likely of the starting / ref policy... like in only one of three problems below with "Mode length" on-policy sampling is not needed...

1

0

3

Aviral Kumar

@aviral_kumar2

4 months

More detailed diagram for those interested in learning the details of the practical algo / trying it out (we would love to hear feedback if you try this out; github code link in the last tweet on this thread)

1

0

3

Aviral Kumar

@aviral_kumar2

8 months

Intuition: SuSIE's diffusion model produces careful sub-goals, such that even just trying to match the arm position helps prevent imprecise / premature grasps, or lack of awareness to object poses. This mechanism is absent in many vision-language pre-training methods in robotics

1

0

3

Aviral Kumar

@aviral_kumar2

2 months

We make some interesting takeaways... (I'll list some here, but way many more interesting experiments in the paper + the mini summary on the website). Paper: The paper is long, but we tried to make it accessible with takeaway boxes! (e.g. below ⬇️)

1

3

Aviral Kumar

@aviral_kumar2

8 months

This idea works very well! Many videos on the website, but quantitatively this does do better than other ways of using diffusion models and even RT-2-X, which is much bigger, trained on more data (including our robot data).

1

0

3

Aviral Kumar

@aviral_kumar2

4 months

Check the paper, website for results, details (and also a theoretical justification for why this works!!) Code: Website: We are **very** excited to try ArCHer at large scales! Reach out if you have interesting agent problems.

GitHub - YifeiZhou02/ArCHer: Research Code for "ArCHer: Training Language Model Agents via Hierar...

Research Code for "ArCHer: Training Language Model Agents via Hierarchical Multi-Turn RL" - YifeiZhou02/ArCHer

github.com

1

0

3

Aviral Kumar

@aviral_kumar2

4 months

➡️This "hierarchical" view can lead to several other multi-turn RL approaches for LLMs. In the paper, we study several alternate design choices in the paper (different value learning objectives, policy gradient objectives, etc). But, many designs yet remain to be explored!

1

0

3

Aviral Kumar

@aviral_kumar2

8 months

We instantiate this insight by learning importance weights that satisfy certain "Bellman-like" consistency conditions (details in the paper: ) and integrate it into off-the-shelf offline RL methods (CQL, IQL, TD3+BC).

Beyond Uniform Sampling: Offline Reinforcement Learning with...

Offline policy learning is aimed at learning decision-making policies using existing datasets of trajectories without collecting additional data. The primary motivation for using reinforcement...

arxiv.org

1

0

3

Aviral Kumar

@aviral_kumar2

2 months

We designed various problems that admit different geometric & coverage relations between reward, policy coverage, and pref data coverage (like in the first figure)... A mix of bandit problems + synthetic LLM fine-tuning problems (eg. below) + then transferred to full-scale LLMs

1

0

2

Aviral Kumar

@aviral_kumar2

4 months

How well does this approach work? On several text game environments and the WebShop benchmark, we find ArCHer to be 100x 🚀 more sample-efficient than PPO and much better in perf. & efficiency than other methods, when learning from its self-collected data, autonomously.

1

0

2

Aviral Kumar

@aviral_kumar2

8 months

And yes, of course, you can use human video data along with robot data to fine-tune the diffusion model. We find video data (Something-Something dataset in our experiments) to boost zero-shot generalization further.

1

0

2

Aviral Kumar

@aviral_kumar2

4 months

This idea is generalizable: we can view this approach as running two independent RL methods: one at the utterance level (i.e., off-policy TD) & other at token level (i.e., policy gradient). The utterance-level RL method sets learning targets for the token-level method.

1

0

2

Aviral Kumar

@aviral_kumar2

8 months

The most interesting aspect to me was that our re-weighting was important for generalization when (1) the initial states are different at test-time (basically any real-world problem), (2) when the dataset is small (& generalization is needed) Initial states result ⬇️

1

0

2

Aviral Kumar

@aviral_kumar2

9 months

Final downstream tasks on your robot could be specified in language or in any other way. We see the resulting policy exhibits an elevated level of generalization across objects / positions / scenes, robustness to distractors. Some examples here on on twitter, more on the website

1

0

2

Aviral Kumar

@aviral_kumar2

1 year

Why? We chose CQL and tried to understand why performance dips when fine-tuning. Turns out CQL needs samples to “correct” its values, until online learning progresses normally. Conservative values are great for offline RL, but do not match with the scale of return on online data.

1

0

2

Aviral Kumar

@aviral_kumar2

9 months

Lots of more analyses in the paper -- we probe value functions learned from video and why pre-training with RL on video helps downstream RL. Overall, I am excited about understanding why / when / how / what can be pre-trained from video for robots, and V-PTR is a step towards it!

1

0

2

Aviral Kumar

@aviral_kumar2

4 months

5. Atari results: single-game, MoE And finally, also does well on single-game offline & online Atari, with no apparent performance degradation with higher capacity or more updates unlike standard offline RL. Better results with MoE models building on:

AK

@_akhaliq

4 months

Google announces In deep reinforcement learning, a pruned network is a good network Recent work has shown that deep reinforcement learning agents have difficulty in effectively using their network parameters. We leverage prior insights into the advantages of sparse training

4

84

464

2

0

2

Aviral Kumar

@aviral_kumar2

7 months

On Wednesday at 10:45 am (poster 1312), @mitsuhiko_nm @simon_zhai will present Cal-QL: , a principled and effective algorithm for online fine-tuning of offline RL policies (& works in the real world too).

Aviral Kumar

@aviral_kumar2

1 year

Interested in offline RL that improves with limited online interaction rapidly? Check out Cal-QL: a method for pre-training with offline RL to enable fast fine-tuning, that's just a 1-line code change on conservative Q-learning (CQL)! A thread 🧵...

1

18

94

1

0

2

Aviral Kumar

@aviral_kumar2

1 year

Theoretically, this calibration in Cal-QL can be viewed as a way to better control the terms that show up in a decomposition of cumulative regret. In my opinion, this is interesting as it provides us with a mental model to think about offline RL + online fine-tuning more broadly.

1

0

2

Aviral Kumar

@aviral_kumar2

1 year

Simple fix: just prevent offline Q-values from becoming too small. We call this “calibration” (i.e., adjusting the scale of the Q-function). This can be done by ensuring Q-values are larger than Q-values of a reference policy, e.g., behavior policy. This is calibrated Q-learning.

1

0

2

Aviral Kumar

@aviral_kumar2

1 year

Prior methods for offline RL + online fine-tuning suffer from issues: for some, performance dips and waste samples for recovery, for many others, online learning improves slowly. In fact, recent work (RLPD) shows that avoiding pre-training altogether can perform better.

1

0

1

Aviral Kumar

@aviral_kumar2

6 months

Here's the poster for this paper (come over to the FMDM workshop tomorrow to discuss more!)

0

Aviral Kumar

@aviral_kumar2

7 months

At the FMDM & GCRL workshops (on Fri) and Robot learning & GenPlan workshops (on Sat), @mitsuhiko_nm will present SuSIE an approach for using pre-trained image-editing models for improving low-level control.

Aviral Kumar

@aviral_kumar2

8 months

Can we use text-to-image diffusion models to steer robots into doing things, zero-shot? Our method, SuSIE, fine-tunes diffusion models trained for image editing to produce future subgoals from a given scene, which then drive a low-level policy. 🧵⬇️

1

20

97

1

0

1

Aviral Kumar

@aviral_kumar2

4 months

Scaling: we could only test few domains with larger base models but found ArCHer to also improve significantly with a Mistral 7B base model. I think **if your single-turn RLHF scales, ArCHer will inherit those scaling benefits** + it brings sample efficiency to the table.

1

0

1

Aviral Kumar

@aviral_kumar2

3 months

While we focus around factuality fine-tuning in this paper, the understanding of how fine-tuning data affects behavior + the RLFT recipe has lots of potential to impact other LLM fine-tuning problems where we see similar issues -- hallucinations, incorrect reasoning traces, etc

1

0

1

Aviral Kumar

@aviral_kumar2

3 months

Amazing collaboration led by @katie_kang_ , with @Eric_Wallace_ Claire Tomlin, @svlevine !

0

1

Aviral Kumar

@aviral_kumar2

1 year

Empirically, Cal-QL outperforms prior online fine-tuning methods. More importantly, it improves over online methods that do not use pre-training. A great point: Cal-QL benefits from advances that make standard online RL fast => lots of opportunities to make Cal-QL even faster!

1

0

1

Aviral Kumar

@aviral_kumar2

8 months

Our method always improved performance, even when compared to the best hyper-parameters for the underlying offline RL method, indicating that re-weighting and standard pessimism likely offer complementary benefits.

1

0

1