Super excited to join Harvard with a stellar group of new hires and looking forward to many new collabs with the terrific faculty there! Def sad to be leaving my wonderful UW and MSR colleagues and friends; rest assured, I'll keep up the collabs!
Thank you so much to the awards committee! Also a huge thanks to the past and current ICML chairs and organizers for all their great work for our community! 👍 It is an honor to receive this 😀😀 with such wonderful co-authors:
@arkrause
, Matthias, and Niranjan!
We are very pleased to announce that the
#icml2020
Test of Time award goes to
Gaussian Process Optimization in the Bandit Setting: No Regret and Experimental Design
by Niranjan Srinivas, Andreas Krause, Sham Kakade and Matthias Seeger
>
Grateful to Priscilla Chan/Mark Zuckerberg (
@ChanZuckerberg
Initiative) for generous gift. Kempner Natural & Artificial Intelligence Institute
@Harvard
. Excited to work w/
@blsabatini
+new colleagues to provide new educational and research opportunities.
couldn't reconcile theory/practice with dropout for over a year. New: w/
@tengyuma
& C. Wei. turns out dropout sometimes has an implicit regularization effect! pretty wild. just like small vs. large batch sgd. these plots def surprised us!
What actually constitutes a good representation for reinforcement learning? Lots of sufficient conditions. But what's necessary? New paper: . Surprisingly, good value (or policy) based representations just don't cut it! w/
@SimonShaoleiDu
@RuosongW
@lyang36
Wrapped up at the "Workshop on Theory of Deep Learning: Where next?" at IAS. . The field has moved so much! e.g. Neural Tangent Kernel (NTK) results! A few years ago, understanding DL looked hopeless. Terrific set of talks, too!
1/3 Two shots at few shot learning: We have T tasks and N1 samples per task. How effective is pooling these samples for few shot learning? New work: . Case 1: there is a common low dim representation. Case 2: there is a common high dim representation.
1/ David Blackwell. Leagues ahead of his time: "What is a good prediction strategy, and how well can you do?" While some things do not seem possible, "Looking for a p [probability] that does well against every x [an outcome] seems hopeless", Blackwell does give us a strategy:
Also, we recently posted this work on the theory of policy gradients for reinforcement learning! A long time in the works, this paper finally gets a handle on function approximation with policy gradient methods.
1/3 Should only Bayesians be Bayesian? No. Being Bayes is super robust! An oldie but goodie from Vovk: "Competitive Online Statistics" 2001. Beautiful work showing Bayes is awesome, even if you are not a Bayesian. (post motivated by nice thoughts from
@RogerGrosse
@roydanroy
).
1/2. No double dipping. My current worldview: the 'double dip' is not a practical concern due to that tuning of various hyperparams (early stopping, L2 reg, model size, etc) on a holdout set alleviates the 'dip'. This work lends evidence to this viewpoint!
Optimal Regularization can Mitigate Double Descent
Joint work with Prayaag Venkat,
@ShamKakade6
,
@tengyuma
.
We prove in certain ridge regression settings that *optimal* L2 regularization can eliminate double descent: more data never hurts (1/n)
1/ Playing the long game: Is long horizon RL harder than short horizon RL? Clearly, H length episodes scale linearly with H, but counting learning complexity by # episodes rather than # samples accounts for this. So is it any harder?
Beautiful post by
@BachFrancis
on Chebyshev polynomials: . Handy for algorithm design. Let's not forget the wise words of Rocco Servedio, as quoted by
@mrtz
, "There's only one bullet in the gun. It's called the Chebyshev polynomial."
Very cool! Years ago (a little post AlexNet) I put in (too?) many cycles trying to design such a kernel. Didn't match their performance (though didn't have much compute back then). Pretty slick that they use a derived kernel from a ConvNet!
We have released code for computing Convolutional Neural Tangent Kernel (CNTK) used in our paper "On Exact Computation with an Infinitely Wide Neural Net", which will appear in NeurIPS 2019.
Paper:
Code:
Predicting What You Already Know Helps: Provable Self-Supervised Learning
We analyze how predicting parts of the input from other parts (missing patch, missing word, etc.) helps to learn a representation that linearly separates the downstream task.
1/2
2/ In a seminal paper, "Minimax vs Bayes prediction" ('56) , Blackwell shows we can predict well on any sequence, using randomization: "we can improve matters by allowing randomized predictions." These ideas permeate so much of learning theory today.
Wow! This is amazing. A few years ago RL was starting to be applied in the robotics domain, with many doubters. Fast forward a handful of years and this! 👏
We've trained an AI system to solve the Rubik's Cube with a human-like robot hand.
This is an unprecedented level of dexterity for a robot, and is hard even for humans to do.
The system trains in an imperfect simulation and quickly adapts to reality:
Bellairs. Theory of DL. Day 4, penultimate session from the unstoppable Boaz Barak. Average case complexity, computational limits, and relevance to DL. Front row: Yoshua Bengio, Jean Ponce,
@ylecun
Dean Foster,
@dhruvmadeka
, and I have been excited about the application of AI to education! We collected our thoughts here - and we're curious what people think:
Solid move from DeepMind. Knowing Emo Todorov and his work, frankly surprised this one-man-show is only being purchased now. Hope Emo still stays in the driver's seat for MuJoCo going forward!
We’ve acquired the MuJoCo physics simulator () and are making it free for all, to support research everywhere. MuJoCo is a fast, powerful, easy-to-use, and soon to be open-source simulation tool, designed for robotics research:
Amazing and congrats! I have def been wondering if the inductive biases in DeepNets and in ML methods are well suited for certain scientific domains. This settles that for structure prediction! Hoping this can eventually help with drug discovery.
In a major scientific breakthrough, the latest version of
#AlphaFold
has been recognised as a solution to one of biology's grand challenges - the “protein folding problem”. It was validated today at
#CASP14
, the biennial Critical Assessment of protein Structure Prediction (1/3)
I found this to be very informative for LLM training. the science was just super well done. highly recommended for anyone training transformer based LLMs.
Bellairs Research Institute☀️🏖️.Theory of DL. Day 3: new insights from
@roydanroy
@KDziugaite
on PAC-Bayes for DL. possibly gives a new lens into implicit reg 🤔
@david_rolnick
cool results on expressivity of deep nets. And T. Lillicrap keeps us real on theory vs. practice!
Great to see that Mike Jordan is thinking about Decision Theory, ML, and Econ! Super important area: lots of stats/algorithmic questions that have immediate impact on practice. Few other areas that one can say the same!
At the Bellairs Research Institute: Theory of Deep Learning workshop. Day 1: great presentations on implicit regularization from
@prfsanjeevarora
@suriyagnskr
. Day 2: lucid explanations of NTKs from
@Hoooway
@jasondeanlee
. Good friends, sun ☀️, and sand 🏖️ a bonus.
John was a reason I moved to AI and neuroscience from physics . In his first class, he compared the human pattern-matching algo for chess playing to DeepBlue's brute force lookahead. I wondered if Go would be mastered in my lifetime! Wonderful to hear from John Hopfield again!
Here's my conversation with John Hopfield. Hopfield networks were one of the early ideas that catalyzed the development of deep learning. His truly original work has explored the messy world of biology through the piercing eyes of a physicist.
revised thoughts on Neural Tangent Kernels (after understanding the regime better. h/t
@SimonShaoleiDu
): def a super cool idea for designing a kernel! It does not look to be helpful of our understanding of how representations arise in deep learning. Much more needed here!
1/5 In new paper with
@vyasnikhil96
and
@ShamKakade6
we give a way to certify that a generative model does not infringe on the copyright of data that was in its training set. See for blog, but TL;DR is...
Nice talk from Rong Ge! learning two layer neural nets, with _finite_ width: A seriously awesome algebraic idea. Reminiscent of FOOBI (the coolest spectral algo in town!): they replace the 'rank-1-detector' in FOOBI with a 'one-neuron-detector'.
Bellairs Research Institute ☀️⛱️. Theory of DL workshop, Day 2 (eve): Thanks to Yann LeCun and Yoshua Bengio for thought provoking talks.
@ylecun
title: "Questions from the 80s and 90s". Good questions indeed!!
Bellairs. Day 5
@HazanPrinceton
and myself: double feature on controls+RL. +spotlights:
@maithra_raghu
: meta-learning as rapid feature learning. Raman Arora: dropout, capacity control, and matrix sensing .
@HanieSedghi
: module criticality and generalization! And that is a wrap!🙂
Thrilled to announce the first annual Reinforcement Learning Conference
@RL_Conference
, which will be held at UMass Amherst August 9-12! RLC is the first strongly peer-reviewed RL venue with proceedings, and our call for papers is now available: .
🧵What’s the simplest failure mode of Transformers? Our
#NeurIPS2023
spotlight paper identifies the “attention glitches” phenomenon, where Transformers intermittently fail to capture robust reasoning, due to undesirable architectural inductive biases.
Poster: Wed 5-7pm CST,
#528
Noga Alon
@princeton
and Joel Spencer
@nyuniversity
receive the 2021 Steele Prize for Mathematical Exposition for The Probabilistic Method
@WileyGlobal
. Now in its 4th ed, the text is invaluable for both the beginner and the experienced researcher. More...
This should be good!
@SurbhiGoel_
has done some exciting work in understanding neural nets, going beyond the "linear" NTK barrier. To make progress in deep learning theory, we def need to understand these beasts in the non-linear regime.
Looking forward to this Friday at 1pm when we'll hear from
@SurbhiGoel_
about the computational complexity of learning neural networks over gaussian marginals.
We'll see some average-case hardness results as well as a poly-time algorithm for approximately learning ReLUs
📢 Announcing seven new National Artificial Intelligence Research Institutes!
Discover the themes and the institutions that are helping advance foundational AI research to address national economic and societal priorities in the 🧵 ⬇️:
A huge congrats to MPI for hiring the terrific
@mrtz
as a director! Personally sad to have him across the pond, but excited to see what Moritz helps to build.
1/3 Can open democracies fight pandemics? making a PACT to set forth transparent privacy and anonymity standards, which permit adoption of mobile tracing efforts while upholding civil liberties.
Very cool! Getting RL to work with real sample size constraints is critical. Interesting to see how it was done here. Also, looks like the application with Loon is for social good! 👏
Our most recent work is out in Nature! We're reporting on (reinforcement) learning to navigate Loon stratospheric balloons and minimizing the sim2real gap. Results from a 39-day Pacific Ocean experiment show RL keeps its strong lead in real conditions.
A nice note. Some cool tricks in these Bhatia matrix analyses books. If I understand correctly, Russo-Dye Thm lets u (exactly) compute the largest learning rate with a maximization problem using only vectors rather than matrices (still hitting it on the 4th-moment data tensor).
What's the largest learning rate for which SGD converges? In deterministic case with Hessian H it is 2/||H||, from basic linear algebra. For SGD, an equivalent rate is 2/Tr(H), derivation from Russo-Dye theorem:
3/ Due to Dean Foster, my own education of online learning and sequential prediction started through first understanding Blackwell's approachability, which is a wonderful way to grasp the foundations. I signed this:
2/ This was the COLT 2018 open problem from
@nanjiang_cs
and Alekh, who conjectured a poly(H) lower bound. New work refutes this, showing only logarithmic in H episodes are needed to learn. So, in a minimax sense, long horizons are not more difficult than short ones!
It's hard to scale meta-learning to long inner optimizations.
We introduce iMAML, which meta-learns *without* differentiating through the inner optimization path using implicit differentiation.
to appear
@NeurIPSConf
w/
@aravindr93
@ShamKakade6
@svlevine
huh... so this is pretty wild. it is _formally_ equivalent to the the Polyak heavy ball momentum algorithm (with weight decay). not just 'similar behavior'.
Conventional wisdom: slowly decay learning rate (lr) when training deep nets. Empirically, some exotic lr schedules also work, eg cosine. New work with Zhiyuan Li: exponentially increasing lr works too! Experiments + surprising math explanation. See
this work did change my world view of resource tradeoffs: how more compute makes up for less data. the frontier plots were quite compelling! checkout the poster for more info!
Will deep learning improve with more data, a larger model, or training for longer?
"Any balanced combination of them" <– in our
#NeurIPS2023
spotlight, we reveal this through the lens of gradient-based feature learning in the presence of computational-statistical gaps. 1/5
Turns out margin maximization, yes just margin maximization, implies this emergence. Some cool new mathematical techniques let us precisely derive the max margin… (yup, that observed margin of 1/(105√426) is indeed what we predict).
A nice point: better features not better classifiers are key. This is more generally an important point related to distribution shift: (also comes up in RL, related to our " is a good representation sufficient" paper).
Video summaries for our papers "Adversarial Examples Aren't Bugs They're Features" () and "Image Synthesis with a Single Robust Classifier" () are now online. Enjoy! (
@andrew_ilyas
@tsiprasd
@ShibaniSan
@logan_engstrom
Brandon Tran)
cool stuff from
@TheGregYang
: Tensors, Neural Nets, GPs, and kernels! looks like we can derive a corresponding kernel/GP in a fairly general sense. very curious on broader empirical comparisons to neural nets, which (potentially) draw strength from the non-linear regime!
1/ I can't teach you how to dougie but I can teach you how to compute the Gaussian Process corresponding to infinite-width neural network of ANY architecture, feedforward or recurrent, eg: resnet, GRU, transformers, etc ... RT plz💪
1/5 New preprint w
@_hanlin_zhang_
, Edelman, Francanti, Venturi & Ateniese!
We prove mathematically & demonstrate empirically impossibility for strong watermarking of generative AI models.
What's strong watermarking? What assumptions? See blog and 🧵
Our recent work on the comparison between Transformers and State Space Models for sequence modeling now on arxiv! TLDR - we find a key disadvantage of SSMs compared to Transformers: they cannot copy from their input. 🧵
Arxiv:
Blog:
Very excited to see this finally announced & many thanks to
@JeffDean
,
@GoogleAI
and
@Princeton
for the ongoing support!
+ fresh from the oven, research from the lab:
Also, the FOOBI (Fourth-Order-Only Blind Identification) paper: . A beautiful algo. And a catchy acronym too! Still surprises me how one can efficiently impose that rank one constraint! Worth a read.
Bellairs Research Institute ☀️🏖️. Theory of DL. Day 3:
@nadavcohen
: "Gen. and Opt. in DL via Trajectories". careful study of deep linear nets. second time hearing about it. now appreciate how this reveals quite different effects, relevant for DL! also, 🏊♂️🥥 🍨!
2/3 This work shows that, under either assumption, all T*N1 samples can be used to achieve a precise notion of "few shot" learning. Also, worth pointing out nice work in Maurer et al . New work makes improvements under assumptions of good common rep!
New post on iMAML: Meta Learning with Implicit Gradients
some animations, discussing potential limitations and of course a Bayesian/variational interpretation
3/3 And the seminal papers that started this line of thought: Dawid, "The prequential approach" (1984), and Foster, "Prediction in the worst case" (1991). They def influenced my thinking! stats meets philosophy. good stuff.
@RogerGrosse
@SimonShaoleiDu
Right! An interesting hypothesis test for 'deep learning' could be to see if the learned network is better than using the derived locally linear kernel. The derived kernel itself is def pretty cool (e.g. CNTK).
Attended the Dr. Martin Luther King Jr. commemorative lecture by Loretta Lynch, the 1st Black woman to serve as US Attorney General, introduced by Pro. Claudine Gay, the 1st Black president of Harvard University. The message was clear: Never Lose Infinite Hope. INFINITE HOPE. ❤️
One of my favorites from most recent offering of CS287 Advanced Robotics?
Exam study handout summarizing all the main math in ~20pp. Incl. MaxEnt RL, CEM, LQR, Penalty Method, RRTs, Particle Filters, Policy Gradient, TRPO, PPO, Q-learning, DDPG, SAC,
I'm absolutely thrilled that
@MosaicML
has agreed to join
@databricks
as we continue on our journey to make the latest advances deep learning efficient and accessible for everyone. The best of MosaicML is yet to come 🎉🎉🎉
@roydanroy
@SimonShaoleiDu
@RuosongW
@lyang36
Nope. i'll be talking about recent work in policy gradient methods in RL and controls. for the following IAS workshop, I will! the rep. paper is pretty cool, in that it is still a bit puzzling to me!
Looking forward to reading this one! There isn't a compelling explanation for the unreasonable effectiveness of adam/adagrad (aside from the original convex regret bounds). So this looks quite promising!
Trying out yet another deep learning optimizer? Graft its learning rates to better understand its performance:
w.
@naman33k
@_arohan_
Tomer Koren and Cyril Zhang
2/2 Why double dip, you ask? Is it a theorists' concoction? Perhaps, yes. It is a compelling demo of how SGD (and GD) behave differently in the overparameterized regime; a great question to study in its own right! As an 'asymptotic', it may not be how practitioners roll...
The adoption of cellphones by Keralan fishermen is, I believe, the most stunning example of the contribution of information technology to market performance.
Take a look at this graph for background: in three different regions of Kerala, phones were adopted at different times.…
@ChrSzegedy
Agreed. Transformers were the sauce for NMT, not unsupervised pretraining. ELMo was a different landmark: it showed the value of pretraining for numerous downstream tasks. Many researchers wondered why they didn't try it themselves (it wasn't about the architecture).
This is for SGD for least squares. Super basic problem. If I understand correctly, the _exact_ problem dependent largest learning rate (above which divergence occurs) is at 1/lambda_max( E[xx']^-1 E[||x||^2 x x']). This is pretty clean (and no SDP needed to compute it).
hmmm... well, it is true that a function that is a sum of degree 3 polynomials would still be a function that is a sum of degree 3 polynomials after any linear transformation of the input (but the # of terms in the sum might be much larger). this def would be neat to try!
@ShamKakade6
Yes, early versions and relative position encoding use fixed spatial features. Some encoding of position (learned or engineered is clearly necessary otherwise, it is just a BOW)
Rotating the sentence would be fun to try. IMO, unlikely to affect the performance significantly.
Sweet. Def want to build up my intuition on Hermite polynomials! For Gaussians, I tend to think more in terms of Isserlis' theorem (aka Wick's theorem), often a more brute approach for dealing with higher moments.
oh yeah. batch norm is assumed, which is clearly why things aren't exploding. But still, it is quite cool that they are formally equivalent; note their scheme only tracks one param, as opposed to two with momentum.
@faoliehoek
@SimonShaoleiDu
@RuosongW
@lyang36
No. The representation allows for near perfect approximation of *every* possible intermediate value function! Even this (everywhere) near perfect approximation has massive error amplification. Subtly, I'd say a good representation has to capture dynamics info to avoid this.
@bremen79
Bigger picture here: "full" Bayes averaging is quite powerful (due to mixability). That it is robust is often not reflected in the Bayesian viewpoint. Similarly, PAC-Bayes also demonstrates this power of averaging (
@RogerGrosse
@roydanroy
). Also, yes, Bayes needs smooth losses.