Aviral Kumar Profile
Aviral Kumar

@aviral_kumar2

2,461
Followers
340
Following
85
Media
143
Statuses

Research Scientist at Google DeepMind. Incoming Assistant Professor of CS & ML at CMU. PhD from UC Berkeley.

Berkeley
Joined May 2016
Don't wanna be here? Send us removal request.
@aviral_kumar2
Aviral Kumar
11 months
Thrilled to share that I will be joining Carnegie Mellon @SCSatCMU as an Assistant Professor of CS and ML @CSDatCMU @mldcmu in Fall 2024. Extremely thankful to my mentors & collaborators, especially @svlevine ! Looking forward to working with amazing students & colleagues at CMU!
66
29
679
@aviral_kumar2
Aviral Kumar
7 months
Posting this a bit late, but if you are applying for a PhD in AI and are interested in decision making and reinforcement learning, please consider applying to my upcoming lab at CMU by December 13! Details about my interests and application process can be found on my website.
4
62
311
@aviral_kumar2
Aviral Kumar
2 months
Many LLM fine-tuning methods. Unclear what you should use & why? In our new paper, we did an extensive study of on-policy RL, supervised & offline contrastive methods (DPO, IPO) to answer this... ๐Ÿงตโฌ‡๏ธ On-policy > offline, mode-seeking > mode-covering
Tweet media one
3
68
277
@aviral_kumar2
Aviral Kumar
4 months
Super simple code change to get value-based deep RL scale *much* better w/ big models across the board on Atari games, robotic manipulation w/ transformers, LLM + text games, & even Chess! Just use classification loss (i.e., cross entropy), not MSE!! ๐Ÿงตโฌ‡๏ธ
Tweet media one
3
43
262
@aviral_kumar2
Aviral Kumar
4 months
How can we train LLM Agents, to learn from their own experience autonomously? Introducing ArCHer, a simple (i.e., small change on top of standard RLHF) and effective way of doing so with multi-turn RL ๐Ÿงตโฌ‡๏ธ Paper: Website:
2
41
193
@aviral_kumar2
Aviral Kumar
2 years
First tweet: Recent work showing how to train big models via offline RL on diverse, multi-game data. 2 billion sub-opt. data + offline RL => generalist policy better than data & good at fine-tuning. w/ @svlevine @agarwl_ @younggeng @georgejtucker
2
14
138
@aviral_kumar2
Aviral Kumar
8 months
A crucial component in modern ML seems to be using the *right*, quality subset of data for learning. What does this mean for offline RL? Given an offline dataset, can we also improve perf. by developing automatic ways to filter data? We answer this in our NeurIPS 2023 paper ๐Ÿงต
1
13
103
@aviral_kumar2
Aviral Kumar
9 months
Human video (e.g., Ego 4D) pre-training can improve robot control, including for downstream robotic RL. But can we *also* use RL for actually doing video pre-training? Yes! Value-based offline RL can pre-train on video for your robot! Introducing V-PTR ๐Ÿงต
1
14
99
@aviral_kumar2
Aviral Kumar
8 months
Can we use text-to-image diffusion models to steer robots into doing things, zero-shot? Our method, SuSIE, fine-tunes diffusion models trained for image editing to produce future subgoals from a given scene, which then drive a low-level policy. ๐Ÿงตโฌ‡๏ธ
1
20
97
@aviral_kumar2
Aviral Kumar
1 year
Interested in offline RL that improves with limited online interaction rapidly? Check out Cal-QL: a method for pre-training with offline RL to enable fast fine-tuning, that's just a 1-line code change on conservative Q-learning (CQL)! A thread ๐Ÿงต...
1
18
94
@aviral_kumar2
Aviral Kumar
3 months
Our new paper on understanding why LLMs make up stuff & hallucinate and how RL fine-tuning with an appropriate conservative reward model can mitigate these issues Paper: A thread below ๐Ÿงตโฌ‡๏ธ (+ check @katie_kang_ 's thread for many more details)
Tweet media one
@katie_kang_
Katie Kang
3 months
We know LLMs hallucinate, but what governs what they dream up? Turns out itโ€™s all about the โ€œunfamiliarโ€ examples they see during finetuning Our new paper shows that manipulating the supervision on these special examples can steer how LLMs hallucinate ๐Ÿงต
Tweet media one
11
78
368
3
8
63
@aviral_kumar2
Aviral Kumar
7 months
On my way to NOLA for #NeurIPS2023 ! We will present several works on offline RL, fast online fine-tuning, using pre-trained models for improving low-level robot control, RL pre-training on human videos, and querying VLMs for maximal efficacy in RL. Come talk to us! Details โฌ‡๏ธ
1
1
42
@aviral_kumar2
Aviral Kumar
10 months
Check out our work on training large transformer policies on demo and autonomous data (including failures of existing imitation policies) via offline Q-learning. Q-Transformer improves over RT-1 on real robots & provides a recipe for building ever-improving robotic systems! โฌ‡๏ธ
@YevgenChebotar
Yevgen Chebotar
10 months
Offline RL strikes back! In our new Q-Transformer paper, we introduce a scalable framework for offline reinforcement learning using Transformers and autoregressive Q-Learning to learn from mixed-quality datasets! Website and paper: ๐Ÿงต
8
111
543
0
0
25
@aviral_kumar2
Aviral Kumar
9 months
Great collab led by @ChetBhateja , Derek & @its_dibya . w/ @Anikait_Singh_ , @manan_tomar , @QuanVng , @YevgenChebotar , @svlevine ! I was quite(?) late in posting, but check: , Paper:
@_akhaliq
AK
9 months
Robotic Offline RL from Internet Videos via Value-Function Pre-Training paper page: Pre-training on Internet data has proven to be a key ingredient for broad generalization in many modern ML systems. What would it take to enable such capabilities in
Tweet media one
2
39
183
0
1
12
@aviral_kumar2
Aviral Kumar
4 months
This project was amazing & fun led by @JesseFarebro @agarwl_ , with a number of fantastic collaborators @QuanVng Jordi Orbay Adrien Ali Taiga @YevgenChebotar @pcastr @AleksandraFaust @svlevine @xiao_ted @AlexIrpan .
0
0
11
@aviral_kumar2
Aviral Kumar
4 months
3. Chess without search Achieves AlphaZero level performance on chess, without needing any MCTS -- just distill data into the value function with a cross-entropy loss, building on top of the results in
Tweet media one
2
1
9
@aviral_kumar2
Aviral Kumar
4 months
So why does this work? We study many hypotheses and find that cross-entropy improves the ability of value-based RL: - to deal with non-stationarity - improves representation quality - makes it robust to noise These are big problems in RL. Checkout Sec. 5 for detailed analysis!
Tweet media one
Tweet media two
Tweet media three
1
0
7
@aviral_kumar2
Aviral Kumar
4 months
Method: Take your favorite value-based RL method (CQL for offline RL, DQN for online RL, etc.), convert the Bellman target into a categorical distribution (more on this next), replace the MSE loss to Bellman target with cross-entropy. And that is it!
Tweet media one
1
0
7
@aviral_kumar2
Aviral Kumar
4 months
We studied many methods for converting targets into categorical distributions: 1. Two hot โžก๏ธ put probability mass in two consecutive bins surrounding the scalar target 2. HL-Gauss โžก๏ธ add noise to target value and then discretize into bins 3. C51 โžก๏ฟฝ๏ฟฝ๏ฟฝ cross-entropy + dist. RL
Tweet media one
1
0
5
@aviral_kumar2
Aviral Kumar
4 months
Most LLM fine-tuning is done within a single turn. This is limiting: does not teach the LLM how to seek information, optimize long-term metrics, or reason about its past actions. Result: verbose, non-targeted responses => โŒโŒagent problems โžก๏ธneed multi-turn LLM fine-tuning.
Tweet media one
1
0
7
@aviral_kumar2
Aviral Kumar
1 year
This was an exciting collaboration with @mitsuhiko_nm , @simon_zhai , Anikait Singh, Max Sobol Mark, @YiMaTweets , @chelseabfinn & @svlevine . Definitely check out Sergeyโ€™s detailed thread: and the website:
@svlevine
Sergey Levine
1 year
Can conservative Q-learning be used to pretrain followed by online finetuning? Turns out that naive offline RL pretraining leads to a "dip" when finetuning online, but we can fix this with a 1-line change! That's the idea in Cal-QL: A thread๐Ÿ‘‡
Tweet media one
4
49
284
0
0
6
@aviral_kumar2
Aviral Kumar
2 months
Overall, this was a fun collaboration & we learned a lot! Lots of experiments, analysis in the paper: (takeaway boxes if you don't have time) w/ @FahimTajwar10 @Anikait_Singh_ @archit_sharma97 @rm_rafailov Jeff @tengyangx @StefanoErmon @chelseabfinn
0
0
6
@aviral_kumar2
Aviral Kumar
4 months
This work was an amazing, truly enjoyable collaboration, led by @YifeiZhou02 , w/ @Zanette_ai , @pan_jiayipan and @svlevine . I learned a lot working with the team!
1
0
6
@aviral_kumar2
Aviral Kumar
2 years
Broadly, I am excited about this as it presents a starting point to scale up offline RL as a pre-training method that could ingest all of the data out there. Lots of algorithmic and technical questions to explore on this front!
1
0
6
@aviral_kumar2
Aviral Kumar
4 months
Our key insight: Take any RL method for single-turn LLM fine-tuning & replace the reward model (RM that works for 1 turn) with a turn-level value model (trained with off-policy RL), accounting for future turns. Use it to provide rewards for the token policy instead of the RM.
1
0
6
@aviral_kumar2
Aviral Kumar
2 months
1. On-policy sampling improves perf. and efficiency, especially when the peak of reward appears farther from the init / ref policy, even when the reward model is learned from the same pref dataset that methods without on-policy sampling also use. i.e., model-based > model-free
Tweet media one
1
0
4
@aviral_kumar2
Aviral Kumar
2 years
@svlevine @agarwl_ @younggeng @georgejtucker See Sergey's thread below: a combination of existing offline RL ideas (CQL, DR3) and C51 can make offline RL work with large models, retaining benefits of โ€œstitchingโ€ and policy improvement.
@svlevine
Sergey Levine
2 years
A big goal for Atari is training one policy on many games. In new work, we show that offline RL (CQL) can do this well w/ big models. On suboptimal data it beats SOTA by 2.5x, finetunes to new games, brings us closer to dream of offline pre-training: ๐Ÿงต>
Tweet media one
Tweet media two
Tweet media three
3
38
172
1
0
5
@aviral_kumar2
Aviral Kumar
1 year
Check out Joey's talk at #ICLR2023 at 4pm local time (poster at 4:30 pm local time) on how we can train offline value functions for multiple levels of conservatism and then adjust the level with online data to attain improved performance.
@svlevine
Sergey Levine
2 years
Offline RL algorithms require choosing a constraint or a level of pessimism/conservatism. But what if we train a value function to support *any* level of conservatism? We study this in our new paper on confidence-conditioned offline RL: Short ๐Ÿงต:
4
14
98
0
0
4
@aviral_kumar2
Aviral Kumar
4 months
2. Generalist robotic manipulation 67% improvement and much better learning speed on top of offline RL Q-Transformer for robotic manipulation with human teleop demos + autonomous failures data
Tweet media one
@YevgenChebotar
Yevgen Chebotar
10 months
Offline RL strikes back! In our new Q-Transformer paper, we introduce a scalable framework for offline reinforcement learning using Transformers and autoregressive Q-Learning to learn from mixed-quality datasets! Website and paper: ๐Ÿงต
8
111
543
1
0
4
@aviral_kumar2
Aviral Kumar
2 months
We grouped methods along: (sec 3.2) (1) running on-policy rollouts against a reward model learned from pref data (like "offline model-based RL" in RL) [w/ or w/o sample reuse] (2) using a negative gradient: not just maximizing likelihood but also pushing it down (DPO, IPO)
Tweet media one
1
0
4
@aviral_kumar2
Aviral Kumar
2 months
3. We find that on-policy sampling + negative gradient are complementary. since on-policy DPO > on-policy PPO in our experiments (section 5.3). DPO / IPO gradients provide a stronger learning signal than PPO... in some ways, negative gradient is helping kill out variance.
Tweet media one
1
0
4
@aviral_kumar2
Aviral Kumar
2 months
You may want to try to potentially try to compensate for less on-policy sampling with the use of sample reuse (i.e., make more updates on stale data). This can help a little, but unless curated well does hurt... (T=2 does a bit better than T=1 quickly, but then it hurts)
Tweet media one
1
0
4
@aviral_kumar2
Aviral Kumar
4 months
4. LLM text games Also does really well on text games / LLM agent tasks, like the game of playing Wordle with offline RL, building on top of CQL (43% improvement)
Tweet media one
1
1
4
@aviral_kumar2
Aviral Kumar
4 months
In 2022, we did some of the first work to scale up offline RL (CQL) to big models, with multi-game Atari data. We found C51 (dist. RL) to be critical, but didn't know why.... Turns out the cross-entropy in C51 was the key, it enables RL to scale well!!
1
0
4
@aviral_kumar2
Aviral Kumar
9 months
Our recipe trains on videos with RL and then continues to run RL on the robot. Concretely: first run value-based offline RL on videos, then run offline RL on robot data (you could use RT-X data now too!) to get a general policy, then fine-tune to your task with just a few demos.
Tweet media one
1
0
4
@aviral_kumar2
Aviral Kumar
2 months
To sum up to the big picture, 1. need to explore regions covered less by ref policy => use on-policy sampling โœ… 2. need a strong learning signal ==> use negative gradients on suboptimal / negative dataโœ… hence our paper title...
Tweet media one
1
0
4
@aviral_kumar2
Aviral Kumar
4 months
We got many great results with this method: 1. Multi-game Atari with offline CQL (C51 is our prior result): 82% improvement over it and larger gains over multi-game DT (which does not use any RL)
Tweet media one
@aviral_kumar2
Aviral Kumar
2 years
First tweet: Recent work showing how to train big models via offline RL on diverse, multi-game data. 2 billion sub-opt. data + offline RL => generalist policy better than data & good at fine-tuning. w/ @svlevine @agarwl_ @younggeng @georgejtucker
2
14
138
1
0
3
@aviral_kumar2
Aviral Kumar
1 year
We tried to understand the key ingredient that allows us to use to as many gradient steps as possible to enable fast online RL. Check out our #ICLR2023 paper by @qiyang_li at 11:30 am local time on Tuesday. ICLR link:
@qiyang_li
Qiyang Li
1 year
What is the key ingredient that enables sample-efficient online RL with TD objectives? tl;dr โ€“ We find that techniques to enable sample-efficient online RL are also effective at controlling a notion of validation error. . A thread ๐Ÿงต: 1/N
3
18
99
0
1
3
@aviral_kumar2
Aviral Kumar
2 months
But, upon looking at the mechanisms behind them (Sec 5.2.2), these methods often extrapolate and likelihoods of y+ in my data decrease (not everywhere, for UltraFB it does not decrease) Ofc, extrapolation can be good or bad, so can't always say offline DPO is better!
Tweet media one
1
0
4
@aviral_kumar2
Aviral Kumar
8 months
Overall, SuSIE is a simple recipe to use web-scale pre-training for boosting semantic generalization *and* policy precision in robot control! Code here: Awesome work led by @kvablack & @mitsuhiko_nm , w/ Pranav, @HomerWalke , @chelseabfinn , @svlevine
1
0
4
@aviral_kumar2
Aviral Kumar
8 months
**Insight:** instead of constraining the policy to the data distribution given to you, we should constrain the policy against a better, *reweighted* version of the data distribution, allowing for good behavior better than the offline data while avoiding OOD actions.
Tweet media one
1
0
4
@aviral_kumar2
Aviral Kumar
8 months
This re-weighting of the data enables all methods to work much better across the board! We test on D4RL tasks, and @ZhangWeiHong9 did a very extensive stress test for all methods across many data compositions to verify that the trend holds.
Tweet media one
1
0
3
@aviral_kumar2
Aviral Kumar
4 months
@JesseFarebro @BlackHC Yes I should have been more clear โ€” as Jesse said it is categorical cross entropy and turning it into a classification problem that matters.
0
0
2
@aviral_kumar2
Aviral Kumar
4 months
Overall, this change to cross-entropy is super simple, addresses issues that we face in value-based offline & online RL, and works reliably in the "scaling" regime (becoming more important with big models such as transformers). Try this out on your problem & let us know!
1
0
3
@aviral_kumar2
Aviral Kumar
4 months
On WebShop specifically, ArCHer with a GPT-2 base model can improve over perf of (much more capable) GPT 3.5 with ReAct and expert prompt => ArCHer is very good at learning from rewards, autonomously!!
Tweet media one
1
0
3
@aviral_kumar2
Aviral Kumar
9 months
Video pre-training (done on Ego 4D) allows us to learn about intentions and associated outcomes (check ICVF: ), and then robotic offline RL (check PTR: ) brings in understanding of robot actions, dynamics, etc.
1
0
3
@aviral_kumar2
Aviral Kumar
8 months
SuSIE is really simple: 1. Fine-tune an image editing model on robot data to produce future sub-goals for a language command. 2. Take any goal-reaching policy, good at reaching short-term goals 3. At test time, command the policy with subgoals from your model and iterate!
Tweet media one
1
0
3
@aviral_kumar2
Aviral Kumar
8 months
This was a very fun collaboration led by @ZhangWeiHong9 , together with @abhishekunique7 @pulkitology and others! Paper: Some of our past work on data sharing: CDS: UDS:
1
0
3
@aviral_kumar2
Aviral Kumar
2 months
2. Negative gradients accelerate convergence of offline methods: in our bandit, we tried to use an explicit "likelihood minimizer" term (kind of like unlikelihood) on top of distilled Best-of-N and found it to be better. IPO was the best here. Similar trends in other setups..
Tweet media one
Tweet media two
1
0
3
@aviral_kumar2
Aviral Kumar
7 months
Finally, I (on behalf of William) will also present some ongoing work on **promptable representations** -- a framework for steering off-the-shelf VLMs into producing features that are particularly useful for downstream control & policy learning. More on this near the workshops!
1
0
2
@aviral_kumar2
Aviral Kumar
2 months
We also provide a theoretical result (Lemma 6.2) for this trying to understand where the probability mass recovered by pushing down on negatives goes and when this can go to preferred response regions.
Tweet media one
1
1
3
@aviral_kumar2
Aviral Kumar
2 months
Mode-seeking KL => faster re-organization of probability mass as long as KL loss is no 0 (which we rarely get to on the training set). This is cool theoretically, since categorical dist. do not present misspecification like the classic 2 modes vs unimodal gaussian example
Tweet media one
Tweet media two
1
1
3
@aviral_kumar2
Aviral Kumar
2 months
We also theoretically unify these concepts of on-policy sampling + neg grad under mode-seeking losses (e.g, reverse KL) vs mode-covering losses (e.g., forward / supervised learning KL). 1. DPO, RL, on-policy ReST: mode-seeking 2. offline RWR, BoN: mode-covering (see Sec 6.1)
1
0
3
@aviral_kumar2
Aviral Kumar
8 months
The most interesting aspect to me is how SuSIE allows Internet vision-language knowledge to be used for enhancing precision of the low-level policy execution. In fact, we found SuSIE to be better than RT-X and even oracle goal-reaching due to this! Why? โฌ‡๏ธ
1
0
3
@aviral_kumar2
Aviral Kumar
2 months
In general, we found that on-policy sampling tends to only be irrelevant when reward peak is already close to the highly likely of the starting / ref policy... like in only one of three problems below with "Mode length" on-policy sampling is not needed...
Tweet media one
1
0
3
@aviral_kumar2
Aviral Kumar
4 months
More detailed diagram for those interested in learning the details of the practical algo / trying it out (we would love to hear feedback if you try this out; github code link in the last tweet on this thread)
Tweet media one
1
0
3
@aviral_kumar2
Aviral Kumar
8 months
Intuition: SuSIE's diffusion model produces careful sub-goals, such that even just trying to match the arm position helps prevent imprecise / premature grasps, or lack of awareness to object poses. This mechanism is absent in many vision-language pre-training methods in robotics
1
0
3
@aviral_kumar2
Aviral Kumar
2 months
We make some interesting takeaways... (I'll list some here, but way many more interesting experiments in the paper + the mini summary on the website). Paper: The paper is long, but we tried to make it accessible with takeaway boxes! (e.g. below โฌ‡๏ธ)
Tweet media one
1
1
3
@aviral_kumar2
Aviral Kumar
8 months
This idea works very well! Many videos on the website, but quantitatively this does do better than other ways of using diffusion models and even RT-2-X, which is much bigger, trained on more data (including our robot data).
Tweet media one
1
0
3
@aviral_kumar2
Aviral Kumar
4 months
Check the paper, website for results, details (and also a theoretical justification for why this works!!) Code: Website: We are **very** excited to try ArCHer at large scales! Reach out if you have interesting agent problems.
1
0
3
@aviral_kumar2
Aviral Kumar
4 months
โžก๏ธThis "hierarchical" view can lead to several other multi-turn RL approaches for LLMs. In the paper, we study several alternate design choices in the paper (different value learning objectives, policy gradient objectives, etc). But, many designs yet remain to be explored!
1
0
3
@aviral_kumar2
Aviral Kumar
8 months
We instantiate this insight by learning importance weights that satisfy certain "Bellman-like" consistency conditions (details in the paper: ) and integrate it into off-the-shelf offline RL methods (CQL, IQL, TD3+BC).
1
0
3
@aviral_kumar2
Aviral Kumar
2 months
We designed various problems that admit different geometric & coverage relations between reward, policy coverage, and pref data coverage (like in the first figure)... A mix of bandit problems + synthetic LLM fine-tuning problems (eg. below) + then transferred to full-scale LLMs
Tweet media one
Tweet media two
1
0
2
@aviral_kumar2
Aviral Kumar
4 months
How well does this approach work? On several text game environments and the WebShop benchmark, we find ArCHer to be 100x ๐Ÿš€ more sample-efficient than PPO and much better in perf. & efficiency than other methods, when learning from its self-collected data, autonomously.
Tweet media one
1
0
2
@aviral_kumar2
Aviral Kumar
8 months
And yes, of course, you can use human video data along with robot data to fine-tune the diffusion model. We find video data (Something-Something dataset in our experiments) to boost zero-shot generalization further.
Tweet media one
1
0
2
@aviral_kumar2
Aviral Kumar
4 months
This idea is generalizable: we can view this approach as running two independent RL methods: one at the utterance level (i.e., off-policy TD) & other at token level (i.e., policy gradient). The utterance-level RL method sets learning targets for the token-level method.
1
0
2
@aviral_kumar2
Aviral Kumar
8 months
The most interesting aspect to me was that our re-weighting was important for generalization when (1) the initial states are different at test-time (basically any real-world problem), (2) when the dataset is small (& generalization is needed) Initial states result โฌ‡๏ธ
Tweet media one
1
0
2
@aviral_kumar2
Aviral Kumar
9 months
Final downstream tasks on your robot could be specified in language or in any other way. We see the resulting policy exhibits an elevated level of generalization across objects / positions / scenes, robustness to distractors. Some examples here on on twitter, more on the website
1
0
2
@aviral_kumar2
Aviral Kumar
1 year
Why? We chose CQL and tried to understand why performance dips when fine-tuning. Turns out CQL needs samples to โ€œcorrectโ€ its values, until online learning progresses normally. Conservative values are great for offline RL, but do not match with the scale of return on online data.
1
0
2
@aviral_kumar2
Aviral Kumar
9 months
Lots of more analyses in the paper -- we probe value functions learned from video and why pre-training with RL on video helps downstream RL. Overall, I am excited about understanding why / when / how / what can be pre-trained from video for robots, and V-PTR is a step towards it!
Tweet media one
1
0
2
@aviral_kumar2
Aviral Kumar
4 months
5. Atari results: single-game, MoE And finally, also does well on single-game offline & online Atari, with no apparent performance degradation with higher capacity or more updates unlike standard offline RL. Better results with MoE models building on:
Tweet media one
Tweet media two
@_akhaliq
AK
4 months
Google announces In deep reinforcement learning, a pruned network is a good network Recent work has shown that deep reinforcement learning agents have difficulty in effectively using their network parameters. We leverage prior insights into the advantages of sparse training
Tweet media one
4
84
464
2
0
2
@aviral_kumar2
Aviral Kumar
7 months
On Wednesday at 10:45 am (poster 1312), @mitsuhiko_nm @simon_zhai will present Cal-QL: , a principled and effective algorithm for online fine-tuning of offline RL policies (& works in the real world too).
Tweet media one
@aviral_kumar2
Aviral Kumar
1 year
Interested in offline RL that improves with limited online interaction rapidly? Check out Cal-QL: a method for pre-training with offline RL to enable fast fine-tuning, that's just a 1-line code change on conservative Q-learning (CQL)! A thread ๐Ÿงต...
1
18
94
1
0
2
@aviral_kumar2
Aviral Kumar
1 year
Theoretically, this calibration in Cal-QL can be viewed as a way to better control the terms that show up in a decomposition of cumulative regret. In my opinion, this is interesting as it provides us with a mental model to think about offline RL + online fine-tuning more broadly.
Tweet media one
1
0
2
@aviral_kumar2
Aviral Kumar
1 year
Simple fix: just prevent offline Q-values from becoming too small. We call this โ€œcalibrationโ€ (i.e., adjusting the scale of the Q-function). This can be done by ensuring Q-values are larger than Q-values of a reference policy, e.g., behavior policy. This is calibrated Q-learning.
Tweet media one
1
0
2
@aviral_kumar2
Aviral Kumar
1 year
Prior methods for offline RL + online fine-tuning suffer from issues: for some, performance dips and waste samples for recovery, for many others, online learning improves slowly. In fact, recent work (RLPD) shows that avoiding pre-training altogether can perform better.
Tweet media one
1
0
1
@aviral_kumar2
Aviral Kumar
6 months
Here's the poster for this paper (come over to the FMDM workshop tomorrow to discuss more!)
Tweet media one
0
0
0
@aviral_kumar2
Aviral Kumar
7 months
At the FMDM & GCRL workshops (on Fri) and Robot learning & GenPlan workshops (on Sat), @mitsuhiko_nm will present SuSIE an approach for using pre-trained image-editing models for improving low-level control.
Tweet media one
@aviral_kumar2
Aviral Kumar
8 months
Can we use text-to-image diffusion models to steer robots into doing things, zero-shot? Our method, SuSIE, fine-tunes diffusion models trained for image editing to produce future subgoals from a given scene, which then drive a low-level policy. ๐Ÿงตโฌ‡๏ธ
1
20
97
1
0
1
@aviral_kumar2
Aviral Kumar
4 months
Scaling: we could only test few domains with larger base models but found ArCHer to also improve significantly with a Mistral 7B base model. I think **if your single-turn RLHF scales, ArCHer will inherit those scaling benefits** + it brings sample efficiency to the table.
Tweet media one
1
0
1
@aviral_kumar2
Aviral Kumar
3 months
While we focus around factuality fine-tuning in this paper, the understanding of how fine-tuning data affects behavior + the RLFT recipe has lots of potential to impact other LLM fine-tuning problems where we see similar issues -- hallucinations, incorrect reasoning traces, etc
1
0
1
@aviral_kumar2
Aviral Kumar
3 months
Amazing collaboration led by @katie_kang_ , with @Eric_Wallace_ Claire Tomlin, @svlevine !
0
0
1
@aviral_kumar2
Aviral Kumar
1 year
Empirically, Cal-QL outperforms prior online fine-tuning methods. More importantly, it improves over online methods that do not use pre-training. A great point: Cal-QL benefits from advances that make standard online RL fast => lots of opportunities to make Cal-QL even faster!
Tweet media one
1
0
1
@aviral_kumar2
Aviral Kumar
8 months
Our method always improved performance, even when compared to the best hyper-parameters for the underlying offline RL method, indicating that re-weighting and standard pessimism likely offer complementary benefits.
1
0
1