Robert Dadashi Profile
Robert Dadashi

@robdadashi

1,580
Followers
389
Following
6
Media
121
Statuses

reinforcement learning research @GoogleDeepMind , built RLHF layer of Bard and Gemma

Paris, France
Joined September 2014
Don't wanna be here? Send us removal request.
Pinned Tweet
@robdadashi
Robert Dadashi
1 month
I am very happy to announce that Gemma 1.1 Instruct 2B and “7B” are out! Here are a few details about the new models: 1/11
13
71
376
@robdadashi
Robert Dadashi
5 years
Just released a notebook to generate the figures in "The Value Function Polytope in RL": . It's fun to play with !
0
28
101
@robdadashi
Robert Dadashi
4 years
New paper out: PWIL ! A simple imitation learning method, which reinforces a reward signal based on a distance to expert demonstrations. Makes Humanoid walk with a single demonstration (below). 1/
3
10
83
@robdadashi
Robert Dadashi
2 years
Very proud of our latest work: AQuaDem - Action Quantization from Demonstrations. The idea is simple: 1- Learn a state-conditioned quantization of a continuous action space from human demonstrations 2- Learn a controller in the induced MDP with a discrete action method, e.g DQN
2
21
83
@robdadashi
Robert Dadashi
4 years
We have just released the code for our latest paper on Imitation Learning: PWIL (). It’s simple and concise and yet performs strongly with MuJoCo environments. The code builds on @deepmind ’s Acme (which is great !)
Tweet media one
2
13
71
@robdadashi
Robert Dadashi
5 years
During my @GoogleAI residency I have been fortunate to work with my research mentor @marcgbellemare and other great collaborators on projects that I am weirdly excited about. This led to 2 papers accepted at ICML: (1/2)
2
4
67
@robdadashi
Robert Dadashi
3 months
I am so proud to see Gemma released today! I have had a fantastic time working on post-training and RLHF with an amazing team. Cannot wait to see what the community builds with these models!
@GoogleDeepMind
Google DeepMind
3 months
Introducing Gemma: a family of lightweight, state-of-the-art open models for developers and researchers to build with AI. 🌐 We’re also releasing tools to support innovation and collaboration - as well as to guide responsible use. Get started now. →
Tweet media one
134
542
2K
4
9
58
@robdadashi
Robert Dadashi
1 month
The training data was pretty much the same as v1.0, but we switched the RL algorithm to something new. I hope that we will be able to disclose more about this in the future :). 6/11
1
4
57
@robdadashi
Robert Dadashi
5 years
I will talk about the *mysterious* polytopes in reinforcement learning at #ICML2019 , Tuesday June 11th at 5:15pm in room 104 and at 6:30 at poster 119.
Tweet media one
1
6
50
@robdadashi
Robert Dadashi
2 years
Very proud to contribute to making RL agents more accessible and reproducible!
@GoogleDeepMind
Google DeepMind
2 years
Acme, a framework for distributed RL research, has been updated to be cleaner, more modular, and to support more agents - including offline & imitation. Try it yourself! GitHub: Quickstart: V2 Paper: 1/
Tweet media one
8
106
497
0
8
50
@robdadashi
Robert Dadashi
1 month
Similarly to v1.0, we enforced verbosity penalty on the models at training time even though it means worse performance on benchmarks. If you still feel like Gemma models are too chatty, prompting with a target word count can help. 7/11
2
1
36
@robdadashi
Robert Dadashi
5 years
Proud to be part of this project tackling how we should think of statistics propagation & derive new algorithms in the context of distributional reinforcement learning
@marcgbellemare
Marc G. Bellemare
5 years
Mark Rowland's distributional RL paper on samples and statistics (& potential mismatch) is out -- big step towards understanding the method w/ @wwdabney @RobertDadashi S. Kumar R. Munos
Tweet media one
0
23
80
0
3
34
@robdadashi
Robert Dadashi
1 month
Finally, make sure to correctly prompt the models following the Gemma IT chat template: f“<start_of_turn>\n{prompt}<end_of_turn>\n<start_of_turn>model\n”. If you notice abnormalities, or if you want specific capabilities improved in future versions, please DM me :) 10/11
2
1
35
@robdadashi
Robert Dadashi
1 month
This was done with my great collaborators: @piergsessa , @suryabhupa , @leonardhussenot , @johanferret , @olivierbachem and of course the Gemma team. 11/11
2
1
35
@robdadashi
Robert Dadashi
5 years
On the geometric characterization of discrete Markov decision processes: On a new framework for distributional RL, that distinguishes «samples» from «statistics»:
0
3
33
@robdadashi
Robert Dadashi
1 month
The new models are better across the board (e.g. quality, instruction following, factuality, coding, reasoning) while maintaining the same standards of safety. The gains are larger for “7B” than 2B. 3/11
3
1
31
@robdadashi
Robert Dadashi
1 month
Gemma 1.1 is making a big jump on the Arena leaderboard!
@lmsysorg
lmsys.org
1 month
Exciting news - the latest Arena result are out! @cohere 's Command R+ has climbed to the 6th spot, matching GPT-4-0314 level by 13K+ human votes! It's undoubtedly the **best** open model on the leaderboard now🔥 Big congrats to @cohere 's incredible work & valuable contribution…
Tweet media one
43
314
1K
1
4
32
@robdadashi
Robert Dadashi
1 month
Please be aware that lower resolution versions of the models (anything below bf16) have noticeable drops in quality. 9/11
2
2
30
@robdadashi
Robert Dadashi
1 month
In the same vein, Gemma models have a tendency to output itemized lists. If you don’t like bullet points, ask the model to “write paragraphs” in your prompt. 8/11
1
1
29
@robdadashi
Robert Dadashi
1 month
We mitigated the overuse of “Sure,” at the start of the model answers. 5/11
1
1
26
@robdadashi
Robert Dadashi
1 month
This update addresses some of the feedback from the community. We will continue to do so in the future :) 2/11
1
0
25
@robdadashi
Robert Dadashi
1 month
We fixed a multi-turn bug (v1.0 models sometimes refuse to answer when the user changes topic in the middle of the conversation). 4/11
1
1
24
@robdadashi
Robert Dadashi
3 years
New paper to appear at #ICML2021 : Offline Reinforcement Learning with Pseudometric Learning (PLOff) PLOff first learns a metric in the spirit of bisimulations using offline transitions, and uses it to derive a bonus preventing OOD action extrapolation 1/n
Tweet media one
1
7
18
@robdadashi
Robert Dadashi
4 years
Exhaustive exploration makes sense when we have no prior about an environment (which most novelty based exploration bonuses assume). In this work, we use demonstrations to derive an exploration bonus, and show that we can extract the priors of the demonstrator.
@leonardhussenot
Léonard H.
4 years
Human behavior is driven by many intrinsic motivations: fun, fear, curiosity, competition, resource constraints... Instead of pushing RL agents to carry out an exhaustive exploration by modeling curiosity, can we implicitly extract all intrinsic motivations from demonstrations?
Tweet media one
1
8
25
1
3
18
@robdadashi
Robert Dadashi
1 month
It was great to collaborate with the RecurrentGemma team on post-training (with again, a colossal effort from @piergsessa )! I am so excited to see the applications that RecurrentGemma opens :)
@SamuelMLSmith
Samuel L Smith
1 month
Announcing RecurrentGemma! - A 2B model with open weights based on Griffin - Replaces transformer with mix of gated linear recurrences and local attention - Competitive with Gemma-2B on downstream evals - Higher throughput when sampling long sequences
Tweet media one
9
69
281
0
1
13
@robdadashi
Robert Dadashi
3 months
I recommend reading this, thanks @natolambert ! Here are a few additional thoughts:
@natolambert
Nathan Lambert
3 months
A brief summary on what REINFORCE is in terms of RLHF and history of RL. The algorithm known as REINFORCE is really just the vanilla policy gradient approach. The name comes from Williams 1992, "Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement…
3
17
122
1
1
12
@robdadashi
Robert Dadashi
2 years
We have a project website where we detail the influence of all the hyperparameters considered (for the baselines and the introduced methods) and where we provide videos of *all* resulting agents. Paper: Website: n/n
0
0
6
@robdadashi
Robert Dadashi
5 years
Also, if you are interested in representation learning in RL, I would love to chat with you!
0
0
7
@robdadashi
Robert Dadashi
5 years
Although I received a "Visual Intimidation Award" for most equations (thanks @MILAMontreal ), there will be no equation in my talk.
2
0
7
@robdadashi
Robert Dadashi
5 years
Congrats @aalitaiga !
@marcgbellemare
Marc G. Bellemare
5 years
Congrats to my PhD student @aalitaiga for winning Best Paper Award at the Exploration in RL Workshop at ICML19, "Benchmarking Bonus-Based Exploration Methods on the Arcade Learning Environment"! Talk today, 11:30, Hall A. #ICML2019 #ERL19 @AaronCourville @cholodovskis @LiamFedus
4
15
124
0
0
6
@robdadashi
Robert Dadashi
5 years
I guess I know my next research project
@karol_kurach
Karol Kurach
5 years
Ever wanted to play FIFA but still be able to call it "doing research work" ? You're welcome.
0
6
41
1
0
6
@robdadashi
Robert Dadashi
2 years
The quantization step can be thought of as a Behavioral Cloning with multiple actions as outputs (thus capturing the multimodality of the demonstrator’s behavior), while the second step is meant to select the “right mode” by interacting with the environment. 3/n
1
0
5
@robdadashi
Robert Dadashi
4 years
We recover near-optimal expert behaviour on all tasks considered. Joint work with my great collaborators: @leonardhussenot , Matthieu Geist and Olivier Pietquin ! 6/
0
0
5
@robdadashi
Robert Dadashi
2 years
We only considered human demos and designed new algorithms for various setups: RL + demos, Imitation and RL + play data, that outperform SOTA continuous methods (in sample efficiency and performance). Plus, our algorithms result in agents that behave similarly to the human. 2/n
1
0
4
@robdadashi
Robert Dadashi
5 years
Analytics have changed the game in basketball or baseball. Very excited to see the impact they will have on football.
@Polytechnique
École polytechnique
5 years
. @Polytechnique & @PSG_English launch the “Sports Analytics Challenge”: an exclusive project that invites candidates worldwide to take part in the #datascience & sports performance #challengexpsg ⚽️. Consult the challenge website for more information:
Tweet media one
0
4
9
0
0
3
@robdadashi
Robert Dadashi
2 years
The second reason is that the quantization is based on the actions taken by the demonstrator. This makes it possible to capture the prior knowledge of the demonstrator, and limit the possible actions to demonstrator-like actions. Arguably, this also facilitates exploration. 5/n
1
0
4
@robdadashi
Robert Dadashi
3 months
2/n I do believe that on the algorithmic end (which is probably far less crucial than data quality), what really matters is to use an online method. I am pretty sure we will see an emergence of new RL methods for language but I bet they will all require sampling from the policy.
1
0
4
@robdadashi
Robert Dadashi
4 years
We compare PWIL with DAC, and show results for the original return of the task (not available in real settings) but also in terms of the Wasserstein distance between the agent and the expert. 5/
1
0
4
@robdadashi
Robert Dadashi
4 years
@neilzegh For the funniest content on twitter follow @laurent_dinh
1
0
4
@robdadashi
Robert Dadashi
1 month
@EugeneVinitsky In some way you can think of a reward model learned from preferences as a discriminator (and so RLHF really means IL). The SFT phase is what a lot of IL methods do in practice: start from the BC policy
1
2
4
@robdadashi
Robert Dadashi
3 months
1/n TRPO vs PPO vs REINFORCE probably also matters less in the RLHF setting because we typically tend to use the BC policy as a regularizer (and start from it rather than tabula rasa).
1
0
3
@robdadashi
Robert Dadashi
4 years
Contrary to adversarial IL methods, we bypass the minmax optimization problem and reinforce a non-stationary reward function that is not re-parameterized with interactions with the environment, and that relies on 2 hyperparameters. 4/
1
0
3
@robdadashi
Robert Dadashi
2 years
Why quantize in the first place? We argue that discrete action problems (with a reasonable number of actions) are more natural for VI-inspired methods, since the policy improvement step is immediate. 4/n
1
0
3
@robdadashi
Robert Dadashi
5 years
@RorySmith "There is no plan that contains Messi." Wrong, there is a one man plan named @nglkante .
0
0
3
@robdadashi
Robert Dadashi
3 years
Check out our new paper: an extensive study of the experimental and design choices of GAIL-like methods !
@leonardhussenot
Léonard H.
3 years
Here is our new large-scale study on Adversarial Imitation Learning! 🤖 How to train your discriminator? How to regularize it? What direct RL agent to choose? How to optimize for training time? How does it behave with human data? Check out the answers 🥳
Tweet media one
2
7
35
0
0
3
@robdadashi
Robert Dadashi
4 years
with hopefully a sharper version of our humanoid :)
0
0
3
@robdadashi
Robert Dadashi
3 months
3/n I am not a fan of DPO because it makes multi-objective harder.
1
0
3
@robdadashi
Robert Dadashi
5 years
@jishanshaikh41 I learned with David Silver's online lectures from his RL course at UCL. You can also check out the RL course from @mlittmancs and @isbellHFh on udacity.
0
0
2
@robdadashi
Robert Dadashi
3 years
The intuition is to move away from the idea of estimating a parametric behavior policy, and replace it with a non-parametric bonus. This encourages the learning agent to remain close to the support of logged transitions, in terms of the learned metric. 2/n
1
0
2
@robdadashi
Robert Dadashi
4 years
Conceptually, PWIL defines a suboptimal transport between the agent state-action pairs and the expert state-action pairs. The approach relies on a distance in an MDP; in our case we use expert demonstrations to derive a distance. 3/
1
0
2
@robdadashi
Robert Dadashi
5 years
@le_roux_nicolas @icmlconf Thanks for the mentorship Nicolas :)
0
0
2
@robdadashi
Robert Dadashi
3 years
This is joint work with my great collaborators @shidilrzf , @leonardhussenot , Nino Vieillard, Olivier Pietquin and Matthieu Geist. n/n
0
0
2
@robdadashi
Robert Dadashi
3 months
n/n Fwiw on Gemma, we did use a baseline with REINFORCE but I am not even sure this really matters because of the structure of the reward function in this setting (and as rightfully pointed out, the batch size).
0
0
2
@robdadashi
Robert Dadashi
4 years
Idea: at the start of the episode all expert state-action pairs are available. As the agent takes action a in state s, look for the closest expert state-action pair (s*, a*), pop it, and define a reward r = exp(- d(s, a, s*, a*) ). 2/
1
0
2
@robdadashi
Robert Dadashi
1 year
@pcastr Seems like we have a barber paradox then :)
0
0
1
@robdadashi
Robert Dadashi
4 years
@pcastr @RealAAAI Congrats Pablo !
0
0
1