S4, Mamba, and Hawk/Griffin are great – but do we really understand how they work? We fully characterize the power of gated (selective) SSMs mathematically using powerful tools from Rough Path Theory. All thanks to our math magician
@MucaCirone
🧵
Inspired by recent breakthroughs in SSMs, we propose a new architecture, Graph Recurrent Encoding by Distance (GRED), for long-range graph representation learning:
with
@orvieto_antonio
,
@bobby_he
and Thomas Hofmann (1/4)
If you are looking for a PhD position in the intersection between Deep Learning and Optimization, it's not too late to apply to my group at
@MPI_IS
and
@ELLISforEurope
Institute Tübingen!
Send a DM if you are interested :)
🚀 Thrilled to announce: I'm now with ELLIS as a PI & MPI for Intelligent Systems as an Independent Group Leader! 🌟 Tübingen is such an amazing place. On a hunt for PhD candidates passionate about deep learning & optimization! Interested? Slide into my DMs! 🔍
@ELLISforEurope
Also at HiLD
#ICML2023
,
@SamuelMLSmith
@sohamde_
, others, and I will present our work showing that
linear RNN + token-wise MLP
=
universal nonlinear dynamical system approximator
This is so cool! Explains S4, S5 and the LRU.
preprint:
CLS offers a unique blend of ETH and MPI; I know so many exceptional graduates!
This year, I am an associate faculty member!
Please apply via our online application portal at . The application deadline is midnight (23:59 CET) on November 15, 2023.
I was always fascinated by muP. while muP theory is clear, the optimization perspective gives super cool and clear insights, holding at finite width.
Was a super fun project.
Let's make optimization great again!
Why in neural networks the learning rate can transfer from small to large models (both in width and depth)? It turns out that the sharpness dynamics can explain it. Check out our new work!
w/
@alexmeterez
(co-first),
@orvieto_antonio
and T. Hofmann
It has been an incredible first month here at the ELLIS Institute
#T
übingen . Freedom we have is unmatched, environment is incredibly stimulating. Did you know Hölderlin started writing the Hyperion in Tübingen? At the age of 22.
Thanks,
@ELLISforEurope
@MPI_IS
for your vision
Our Next Generation Sequence Modeling Architectures workshop proposal was accepted by ICML! We have an incredible lineup of speakers, please come say hi and consider submitting your works! :)
Feeling very fortunate to co-organize this workshop with an incredible group of researchers, Razvan Pascanu,
@orvieto_antonio
, Carmen Amo Alonso, and Maciej Wołczyk!
It's awesome to be back in Paris!!
Thanks
@BachFrancis
for hosting me this week at
@Inria
– such a wonderful place. Filling the building with thoughts on RNNs 🎃
fun fact, Paris is the only place in the world i managed to get my oboe repaired in 5 min.. and for free 🥖🥖
🚀 Get ready to dive deep into the captivating world of artificial intelligence with us!
The Cyber Valley Podcast coming soon...
🎙️ Don’t miss our unforgettable episodes, created in collaboration with the ELLIS Institute Tübingen
#AIPodcast
#AIResearch
#ELLIS
#AI
@orvieto_antonio
I am looking for a motivated 3-months intern for a project here at MPI for Intelligent Systems & ELLIS Tübingen! If you are free in the period Nov 15 - Feb 15 and know optimization + how to code in torch/jax, please contact me! Internships are on-site only :)
stop 2: deepmind! its nice to see that the atmosphere did not change: amazing place filled with inspiring people, like
@sohamde_
@SamuelMLSmith
– thanks for hosting me!
Very last few days to apply for a PhD at
@ELLISforEurope
. This program is getting more and more awesome every year.
If you'd like to work on unveiling some deep learning mysteries, pls apply!
In "SDEs for Minimax Optimization" we investigate the intriguing training dynamics of minimax games. It is a journey through a complex dance of optimizers, where continuous-time tools simplify the math and provide great insights.
#AISTATS
so nice to finally see what
@sohamde_
@SamuelMLSmith
@caglarml
have been up to! this is INCREDIBLE, nicest possible read in my Cairo-Milan flight this morning :)
New Griffin paper is really interesting and contains a lot of implementation details
. Implementation is in Pallas which is a Jax like frontend to Triton/TPU lowering. They show that Associative Scan is inherently worse than Linear Scan in this context.…
See you at the HiLD
#ICML2023
workshop today! We have 3 posters:
- On the Universality of Linear Recurrences Followed by Nonlinear Projections
- A New Adaptive Method for Minimizing Non-negative Losses
- On the Advantage of Lion Compared to signSGD with Momentum
Pls stop by!
🚀 We show that dense generalizations of Mamba/Hawk/Griffin~(Linear CDEs) are able to approximate any nonlinear sequence to sequence map - no MLP/GLU layer is required. This is our main technical result.
Nesterov's acceleration mechanism is believed to be linked to the geometry of symplectic integration. I can name more than 10 papers about it. Our paper (accepted AISTATS 2021) shows this not the case: explicit Euler integration also leads to acceleration.
Mamba: The Hard Way (v2 ).
Ton of feedback on v1, learned a lot. This version produces identical results to cuda, and should be faster and cleaner than v1. (I had to learn about butterfly register shuffles 🦋)
Unfortunately still slower than Cuda. There…
💥 We rigorously prove that Mamba collects input statistics more efficiently than S4. Chaining S6 recurrences with linear pointwise maps allows computation of higher-order global statistics. As such, Mamba and Hawk/Griffin put less compute burden on the MLP.
Hot off the presses: ResNet hyperparameter transfer across depth and width!
Tl;dr transfer for LR+schedules, momentum, L2 reg., etc. for wide ResNets and ViTs, with and without Batch/LayerNorm
w/
@lorenzo_noci
@mufan_li
@BorisHanin
@CPehlevan
Help us build the ELLIS Institute: the new call for Hector Endowed PI positions is at . The positions come with the possibility for co-appointment at Max Planck & Tübingen AI Center
#ELLISforEurope
#Tuebingen
#AI
@MPI_IS
Following our previous work, we are releasing RecurrentGemma - a fully open source 2B model based on our Griffin architecutre!
Code + weights as everyone has wished for!
Code on Github:
Weights on Kaggle:
🎙 The second episode of the
@Cyber_Valley
Podcast with our Principal Investigator
@jonasgeiping
is now available🚀Tune in to learn about Safety and Efficiency of AI.
👉 Check it out:
Boosting generalization of your deep learning model with just 6 lines of code?
"Explicit Regularization in Overparametrized Models via Noise Injection", just accepted at AISTATS2023
w/
@anantraj94
@HansKersting
@BachFrancis
At HiLD
#ICML2023
, Lin Xiao (Meta) and I will show an optimizer that is *provably better* than SGD – and works amazingly in Deep Learning and convex optimization. It's also as cheap as SGD, yet almost second order.
Full paper coming soon!
sample:
@giffmana
@HansKersting
@AurelienLucchi
@BachFrancis
hi! thanks for the comment. our objective was not to get state of the art, but to improve over vanilla algorithms – i.e. constant stepsize sgd and gd – using noise injection.
with schedulers, additional effects might kick in: we want to just study the effects of noise here
Check out our
#NeurIPS2022
paper:
“Dynamics of SGD with Stochastic Polyak Stepsizes: Truly Adaptive Variants and Convergence to Exact Solution”
Joint Work with
@orvieto_antonio
,
@SimonLacosteJ
Paper:
Code:
So apparently, according to Gemini, the best way to learn about the paper DAGs with NO TEARS () is to listen to the song Whigfield - No Tears to Cry. Actually, the song is not too bad.
Just got back from vacation, and super excited to finally release Griffin - a new hybrid LLM mixing RNN layers with Local Attention - scaled up to 14B params!
My co-authors have already posted about our amazing results, so here's a 🧵on how we got there!
Here is a *fantastic* PhD program: the International Max Planck Research School for Intelligent Systems (IMPRS-IS).
I am so lucky to be among faculty this year :) have room for one practical-minded PhD in opt for deep learning. Interested? Please apply on
Quadratic attention has been indispensable for information-dense modalities such as language... until now.
Announcing Mamba: a new SSM arch. that has linear-time scaling, ultra long context, and most importantly--outperforms Transformers everywhere we've tried.
With
@tri_dao
1/
"Simplifying Transformer Blocks" ranks easily among my favorite research papers that I've read this year.
Here, the authors look into how the standard transformer block, essential to LLMs, can be simplified without compromising convergence properties and downstream task…
Blog post!!
Rumors of the death of RNNs have been largely exaggerated...
In this post I summarize why and how RNNs are making a comeback in ML, and what this means for theorists of neural comps.
Many thanks to
@NicolasZucchet
for help and corrections!
Now, this is really really cool stuff. Heavy-ball is beautiful but very nasty to analyze mathematically compared to Nesterov's method. very nice work.
#Optimization
#MachineLearning
ehi! Yuwen,
@AurelienLucchi
, and I will be hosting in 20 minutes an ICML Zoom poster session for our paper on acceleration for stochastic derivative-free optimization (). Here is the link:
Shadowing Properties of Optimization Algorithms,
#neurips2019
, poster 217 on Thursday evening. We derive a theoretical argument to link ODEs and their corresponding algorithms in optimization.
@PierreMari0n
It's a powerful decision that i fully respect. i was always curious though of this: is it really needed to take strict actions on flying when this is only ~3% (some say a bit more, some a bit less) of emissions? flying less is good, definitely helps, but is the stigma worth it?
@giffmana
@HansKersting
@AurelienLucchi
@BachFrancis
Maybe! there are many variations: what if one does momentum? what is you have an adaptive step? what about warmup? and if we clip gradients? there are many variations one could explore :) here we kept it simple.
Want to know more about the acceleration mechanism in convex optimization? Please visit our AISTATS poster in a few hours!
"Revisiting the Role of Euler Numerical Integration on Acceleration and Stability in Convex Optimization"
Meta used my 1991 ideas to train LLaMA 2, but made it insinuate that I “have been involved in harmful activities” and have not made “positive contributions to society, such as pioneers in their field.”
@Meta
& LLaMA promoter
@ylecun
should correct this ASAP. See…
@tetraduzione
dear Grande Antonio, i actually only read the solution to differential equations from the olive oil pattern in my fish soup. i can teach you any time
@SamuelAinsworth
@deepcohen
@BachFrancis
@HansKersting
@AurelienLucchi
Nice discussion! Yes, everything is in expectation. But The analysis would be complex in the smoothing approach: grad(x+noise) = smoothed_grad + noise2, where noise2 crucially depends on the gradient scale. The noise will have a non-stationary distribution – harder to analyze.
@KhanovMax
I sort of agree, at inference time RNNs and all S4 variants can model very similar functions.
Initialization and optimization are important tho! we try to explain the main issues and solutions here 🤠.
We're looking for postdoctoral fellows in AI! We offer: excellent cohort of young researchers, dedicated GPU cluster with 300H100s, $100K salary (+$10k research funds), stunning campus. 1 hour from NYC and Philly. Renewable, i.e., possible to stay multiple years. Join us!