Sham Kakade @ShamKakade6 Twitter profile

Last Seen Profiles

@BBCAfrica

@Lunaspacebooks

@arobasexavi

@JRS_London

@radio5_rne

@caffeineva

@CUT_Union

@YoungPrideClub

@cityofdentontx

@THforPalestine

@Skiblkk

@reversefair

@MazurFocus

@Stupid_Punks

@natasufe

@coquhette

@pspengle

@AndiKoelle

@arcanigon

@flashpointzero

@ALegalProcess

@meteyarar

@smangmin

@LAPD77thSt

@BahanFantasy

@stbevag

@AlcCuauhtemocMx

@112resident

@bahnfueralle

@ProfCBrekke

@JollyPerv

@ProfCBrekke

@StonedGod_

@SteelOrbis_TR

@JoseCardenas1

@saharalanaye

Sham Kakade

@ShamKakade6

3 years

Super excited to join Harvard with a stellar group of new hires and looking forward to many new collabs with the terrific faculty there! Def sad to be leaving my wonderful UW and MSR colleagues and friends; rest assured, I'll keep up the collabs!

Seven join Harvard computer science faculty

seas.harvard.edu

21

15

412

Sham Kakade

@ShamKakade6

4 years

Thank you so much to the awards committee! Also a huge thanks to the past and current ICML chairs and organizers for all their great work for our community! 👍 It is an honor to receive this 😀😀 with such wonderful co-authors: @arkrause , Matthias, and Niranjan!

Hal Daumé III

@haldaume3

4 years

We are very pleased to announce that the #icml2020 Test of Time award goes to Gaussian Process Optimization in the Bandit Setting: No Regret and Experimental Design by Niranjan Srinivas, Andreas Krause, Sham Kakade and Matthias Seeger >

3

92

471

27

31

345

Sham Kakade

@ShamKakade6

2 years

Grateful to Priscilla Chan/Mark Zuckerberg ( @ChanZuckerberg Initiative) for generous gift. Kempner Natural & Artificial Intelligence Institute @Harvard . Excited to work w/ @blsabatini +new colleagues to provide new educational and research opportunities.

New Harvard institute to study natural, artificial intelligence

University-wide initiative made possible by gift from Priscilla Chan and Mark Zuckerberg.

news.harvard.edu

11

23

301

Sham Kakade

@ShamKakade6

6 months

Can inductive biases explain mechanistic interpretability? Why do sinusoidal patterns emerge for NNs trained on modular addition? (e.g. @NeelNanda5 ) New work pins this down! w/ @depen_morwani @EdelmanBen @rosieyzh @costinoncescu ! @KempnerInst

5

36

271

Sham Kakade

@ShamKakade6

4 years

couldn't reconcile theory/practice with dropout for over a year. New: w/ @tengyuma & C. Wei. turns out dropout sometimes has an implicit regularization effect! pretty wild. just like small vs. large batch sgd. these plots def surprised us!

The Implicit and Explicit Regularization Effects of Dropout

Dropout is a widely-used regularization technique, often required to obtain state-of-the-art for a number of architectures. This work demonstrates that dropout introduces two distinct but...

arxiv.org

5

28

190

Sham Kakade

@ShamKakade6

5 years

What actually constitutes a good representation for reinforcement learning? Lots of sufficient conditions. But what's necessary? New paper: . Surprisingly, good value (or policy) based representations just don't cut it! w/ @SimonShaoleiDu @RuosongW @lyang36

Is a Good Representation Sufficient for Sample Efficient...

Modern deep learning methods provide effective means to learn good representations. However, is a good representation itself sufficient for sample efficient reinforcement learning? This question...

arxiv.org

2

32

179

Sham Kakade

@ShamKakade6

5 years

Wrapped up at the "Workshop on Theory of Deep Learning: Where next?" at IAS. . The field has moved so much! e.g. Neural Tangent Kernel (NTK) results! A few years ago, understanding DL looked hopeless. Terrific set of talks, too!

1

8

140

Sham Kakade

@ShamKakade6

4 years

1/3 Two shots at few shot learning: We have T tasks and N1 samples per task. How effective is pooling these samples for few shot learning? New work: . Case 1: there is a common low dim representation. Case 2: there is a common high dim representation.

3

17

127

Sham Kakade

@ShamKakade6

4 years

1/ David Blackwell. Leagues ahead of his time: "What is a good prediction strategy, and how well can you do?" While some things do not seem possible, "Looking for a p [probability] that does well against every x [an outcome] seems hopeless", Blackwell does give us a strategy:

3

15

117

Sham Kakade

@ShamKakade6

5 years

Also, we recently posted this work on the theory of policy gradients for reinforcement learning! A long time in the works, this paper finally gets a handle on function approximation with policy gradient methods.

On the Theory of Policy Gradient Methods: Optimality,...

Policy gradient methods are among the most effective methods in challenging reinforcement learning problems with large state and/or action spaces. However, little is known about even their most...

arxiv.org

0

26

115

Sham Kakade

@ShamKakade6

4 years

1/3 Should only Bayesians be Bayesian? No. Being Bayes is super robust! An oldie but goodie from Vovk: "Competitive Online Statistics" 2001. Beautiful work showing Bayes is awesome, even if you are not a Bayesian. (post motivated by nice thoughts from @RogerGrosse @roydanroy ).

2

16

110

Sham Kakade

@ShamKakade6

4 years

1/2. No double dipping. My current worldview: the 'double dip' is not a practical concern due to that tuning of various hyperparams (early stopping, L2 reg, model size, etc) on a holdout set alleviates the 'dip'. This work lends evidence to this viewpoint!

Preetum Nakkiran

@PreetumNakkiran

4 years

Optimal Regularization can Mitigate Double Descent Joint work with Prayaag Venkat, @ShamKakade6 , @tengyuma . We prove in certain ridge regression settings that *optimal* L2 regularization can eliminate double descent: more data never hurts (1/n)

6

51

225

2

20

96

Sham Kakade

@ShamKakade6

4 years

1/ Playing the long game: Is long horizon RL harder than short horizon RL? Clearly, H length episodes scale linearly with H, but counting learning complexity by # episodes rather than # samples accounts for this. So is it any harder?

Is Long Horizon Reinforcement Learning More Difficult Than Short...

Learning to plan for long horizons is a central challenge in episodic reinforcement learning problems. A fundamental question is to understand how the difficulty of the problem scales as the...

arxiv.org

2

8

83

Sham Kakade

@ShamKakade6

3 months

Repeat After Me: Transformers are Better than State Space Models at Copying Transformers are Better than State Space Models at Copying

Kempner Institute at Harvard University

@KempnerInst

3 months

Check out #KempnerInstitute ’s newest blog post! Authors Samy Jelassi, @brandfonbrener , @ShamKakade6 @EranMalach show the improved efficiency of State Space Models sacrifices some core capabilities for modern LLMs. #MachineLearning #AI

0

13

55

1

4

78

Sham Kakade

@ShamKakade6

4 years

Beautiful post by @BachFrancis on Chebyshev polynomials: . Handy for algorithm design. Let's not forget the wise words of Rocco Servedio, as quoted by @mrtz , "There's only one bullet in the gun. It's called the Chebyshev polynomial."

0

10

75

Sham Kakade

@ShamKakade6

5 years

Very cool! Years ago (a little post AlexNet) I put in (too?) many cycles trying to design such a kernel. Didn't match their performance (though didn't have much compute back then). Pretty slick that they use a derived kernel from a ConvNet!

Ruosong Wang

@RuosongW

5 years

We have released code for computing Convolutional Neural Tangent Kernel (CNTK) used in our paper "On Exact Computation with an Infinitely Wide Neural Net", which will appear in NeurIPS 2019. Paper: Code:

1

46

209

2

11

67

Sham Kakade

@ShamKakade6

4 years

Great to see some theory on self-supervised learning. Looking forward to reading this one!

Jason Lee

@jasondeanlee

4 years

Predicting What You Already Know Helps: Provable Self-Supervised Learning We analyze how predicting parts of the input from other parts (missing patch, missing word, etc.) helps to learn a representation that linearly separates the downstream task. 1/2

2

105

526

1

3

64

Sham Kakade

@ShamKakade6

4 years

2/ In a seminal paper, "Minimax vs Bayes prediction" ('56) , Blackwell shows we can predict well on any sequence, using randomization: "we can improve matters by allowing randomized predictions." These ideas permeate so much of learning theory today.

1

7

59

Sham Kakade

@ShamKakade6

5 years

Wow! This is amazing. A few years ago RL was starting to be applied in the robotics domain, with many doubters. Fast forward a handful of years and this! 👏

OpenAI

@OpenAI

5 years

We've trained an AI system to solve the Rubik's Cube with a human-like robot hand. This is an unprecedented level of dexterity for a robot, and is hard even for humans to do. The system trains in an imperfect simulation and quickly adapts to reality:

243

4K

11K

1

3

59

Sham Kakade

@ShamKakade6

4 years

Bellairs. Theory of DL. Day 4, penultimate session from the unstoppable Boaz Barak. Average case complexity, computational limits, and relevance to DL. Front row: Yoshua Bengio, Jean Ponce, @ylecun

2

1

53

Sham Kakade

@ShamKakade6

10 months

Dean Foster, @dhruvmadeka , and I have been excited about the application of AI to education! We collected our thoughts here - and we're curious what people think:

What is the Value of Human-Level AI to Education? | Dhruv Madeka

Can LLMs help accelerate education in the developing world.

dhruvmadeka.com

0

13

53

Sham Kakade

@ShamKakade6

3 years

Solid move from DeepMind. Knowing Emo Todorov and his work, frankly surprised this one-man-show is only being purchased now. Hope Emo still stays in the driver's seat for MuJoCo going forward!

Google DeepMind

@GoogleDeepMind

3 years

We’ve acquired the MuJoCo physics simulator () and are making it free for all, to support research everywhere. MuJoCo is a fast, powerful, easy-to-use, and soon to be open-source simulation tool, designed for robotics research:

85

2K

6K

1

2

51

Sham Kakade

@ShamKakade6

3 years

🇺🇸

1

2

52

Sham Kakade

@ShamKakade6

5 months

we are growing! please apply to join a vibrant community and please spread the word.

Kempner Institute at Harvard University

@KempnerInst

5 months

The #KempnerInstitute is hiring scientists, researchers, and engineers to join our growing community! Check out our openings and apply today: #scienceforsocialgood #openscience @Harvard @ChanZuckerberg @ShamKakade6 @blsabatini

0

9

24

0

8

49

Sham Kakade

@ShamKakade6

4 years

Just got back from MSR Montreal and had a great visit! Lots of cool projects going on there in RL/NLP/unsupervised learning. Thanks to @APTrizzle @momusbah @Drewch @JessMastronardi @philip_bachman for hosting me!

0

47

Sham Kakade

@ShamKakade6

3 years

Amazing and congrats! I have def been wondering if the inductive biases in DeepNets and in ML methods are well suited for certain scientific domains. This settles that for structure prediction! Hoping this can eventually help with drug discovery.

Google DeepMind

@GoogleDeepMind

3 years

In a major scientific breakthrough, the latest version of #AlphaFold has been recognised as a solution to one of biology's grand challenges - the “protein folding problem”. It was validated today at #CASP14 , the biennial Critical Assessment of protein Structure Prediction (1/3)

135

3K

10K

1

3

45

Sham Kakade

@ShamKakade6

5 months

I found this to be very informative for LLM training. the science was just super well done. highly recommended for anyone training transformer based LLMs.

Mitchell Wortsman

@Mitchnw

7 months

Sharing some highlights from our work on small-scale proxies for large-scale Transformer training instabilities: With fantastic collaborators @peterjliu , @Locchiu , @_katieeverett , many others (see final tweet!), @hoonkp , @jmgilmer , @skornblith ! (1/15)

5

63

348

0

4

41

Sham Kakade

@ShamKakade6

4 years

Bellairs Research Institute☀️🏖️.Theory of DL. Day 3: new insights from @roydanroy @KDziugaite on PAC-Bayes for DL. possibly gives a new lens into implicit reg 🤔 @david_rolnick cool results on expressivity of deep nets. And T. Lillicrap keeps us real on theory vs. practice!

1

0

41

Sham Kakade

@ShamKakade6

4 years

Great to see that Mike Jordan is thinking about Decision Theory, ML, and Econ! Super important area: lots of stats/algorithmic questions that have immediate impact on practice. Few other areas that one can say the same!

ACM Education & Learning Center

@acmeducation

4 years

Mar 25 #ACMTechTalk "The Decision-Making Side of Machine Learning: Computational, Inferential, and Economic Perspective," w/Michael I. Jordan. @JeffDean @smolix @etzioni @erichorvitz @lexfridman @DaphneKoller @pabbeel @ShamKakade6 @suchisaria @aleks_madry

0

14

38

0

2

39

Sham Kakade

@ShamKakade6

4 years

At the Bellairs Research Institute: Theory of Deep Learning workshop. Day 1: great presentations on implicit regularization from @prfsanjeevarora @suriyagnskr . Day 2: lucid explanations of NTKs from @Hoooway @jasondeanlee . Good friends, sun ☀️, and sand 🏖️ a bonus.

0

3

37

Sham Kakade

@ShamKakade6

4 years

John was a reason I moved to AI and neuroscience from physics . In his first class, he compared the human pattern-matching algo for chess playing to DeepBlue's brute force lookahead. I wondered if Go would be mastered in my lifetime! Wonderful to hear from John Hopfield again!

Lex Fridman

@lexfridman

4 years

Here's my conversation with John Hopfield. Hopfield networks were one of the early ideas that catalyzed the development of deep learning. His truly original work has explored the messy world of biology through the piercing eyes of a physicist.

11

36

226

1

2

38

Sham Kakade

@ShamKakade6

5 years

revised thoughts on Neural Tangent Kernels (after understanding the regime better. h/t @SimonShaoleiDu ): def a super cool idea for designing a kernel! It does not look to be helpful of our understanding of how representations arise in deep learning. Much more needed here!

2

3

36

Sham Kakade

@ShamKakade6

1 year

Very excited about this new work with @vyasnikhil96 and @boazbaraktcs ! Provable copyright protection for generative models? See:

Boaz Barak

@boazbaraktcs

1 year

1/5 In new paper with @vyasnikhil96 and @ShamKakade6 we give a way to certify that a generative model does not infringe on the copyright of data that was in its training set. See for blog, but TL;DR is...

7

52

248

1

2

36

Sham Kakade

@ShamKakade6

5 years

Nice talk from Rong Ge! learning two layer neural nets, with _finite_ width: A seriously awesome algebraic idea. Reminiscent of FOOBI (the coolest spectral algo in town!): they replace the 'rank-1-detector' in FOOBI with a 'one-neuron-detector'.

1

6

35

Sham Kakade

@ShamKakade6

4 years

Bellairs Research Institute ☀️⛱️. Theory of DL workshop, Day 2 (eve): Thanks to Yann LeCun and Yoshua Bengio for thought provoking talks. @ylecun title: "Questions from the 80s and 90s". Good questions indeed!!

1

2

33

Sham Kakade

@ShamKakade6

4 years

Bellairs. Day 5 @HazanPrinceton and myself: double feature on controls+RL. +spotlights: @maithra_raghu : meta-learning as rapid feature learning. Raman Arora: dropout, capacity control, and matrix sensing . @HanieSedghi : module criticality and generalization! And that is a wrap!🙂

0

3

32

Sham Kakade

@ShamKakade6

6 months

Exciting! New RL Conference. Thanks to @yayitsamyzhang and others for their leadership!

Amy Zhang

@yayitsamyzhang

6 months

Thrilled to announce the first annual Reinforcement Learning Conference @RL_Conference , which will be held at UMass Amherst August 9-12! RLC is the first strongly peer-reviewed RL venue with proceedings, and our call for papers is now available: .

5

61

421

0

30

Sham Kakade

@ShamKakade6

5 months

I found this quite thought provoking!

Bingbin Liu

@BingbinL

5 months

🧵What’s the simplest failure mode of Transformers? Our #NeurIPS2023 spotlight paper identifies the “attention glitches” phenomenon, where Transformers intermittently fail to capture robust reasoning, due to undesirable architectural inductive biases. Poster: Wed 5-7pm CST, #528

4

31

225

0

2

30

Sham Kakade

@ShamKakade6

3 years

Congrats! A beautiful book indeed!

American Mathematical Society

@amermathsoc

3 years

Noga Alon @princeton and Joel Spencer @nyuniversity receive the 2021 Steele Prize for Mathematical Exposition for The Probabilistic Method @WileyGlobal . Now in its 4th ed, the text is invaluable for both the beginner and the experienced researcher. More...

0

16

104

1

2

29

Sham Kakade

@ShamKakade6

4 years

This should be good! @SurbhiGoel_ has done some exciting work in understanding neural nets, going beyond the "linear" NTK barrier. To make progress in deep learning theory, we def need to understand these beasts in the non-linear regime.

Boaz Barak

@boazbaraktcs

4 years

Looking forward to this Friday at 1pm when we'll hear from @SurbhiGoel_ about the computational complexity of learning neural networks over gaussian marginals. We'll see some average-case hardness results as well as a poly-time algorithm for approximately learning ReLUs

0

1

12

0

4

29

Sham Kakade

@ShamKakade6

6 months

A downright classic. And let us take a moment to ponder on the most didactic figure of all time, seen in this beautiful paper: :)

Michael Nielsen

@michael_nielsen

6 months

This is amazing, and very beautiful:

16

85

806

0

1

28

Sham Kakade

@ShamKakade6

1 year

Excited to be a part of this, with Aarti Singh who is spearheading the CMU effort!

U.S. National Science Foundation

@NSF

1 year

📢 Announcing seven new National Artificial Intelligence Research Institutes! Discover the themes and the institutions that are helping advance foundational AI research to address national economic and societal priorities in the 🧵 ⬇️:

3

39

95

1

2

27

Sham Kakade

@ShamKakade6

3 years

A huge congrats to MPI for hiring the terrific @mrtz as a director! Personally sad to have him across the pond, but excited to see what Moritz helps to build.

0

25

Sham Kakade

@ShamKakade6

4 years

I feel like I should take the class after reading this 😂

Boaz Barak

@boazbaraktcs

4 years

GPT-3 on why Harvard students should take CS 182 this fall (bold text is prompt)

3

5

62

0

24

Sham Kakade

@ShamKakade6

4 years

1/3 Can open democracies fight pandemics? making a PACT to set forth transparent privacy and anonymity standards, which permit adoption of mobile tracing efforts while upholding civil liberties.

1

4

22

Sham Kakade

@ShamKakade6

3 years

Very cool! Getting RL to work with real sample size constraints is critical. Interesting to see how it was done here. Also, looks like the application with Loon is for social good! 👏

Marc G. Bellemare

@marcgbellemare

3 years

Our most recent work is out in Nature! We're reporting on (reinforcement) learning to navigate Loon stratospheric balloons and minimizing the sim2real gap. Results from a 39-day Pacific Ocean experiment show RL keeps its strong lead in real conditions.

23

108

768

1

24

Sham Kakade

@ShamKakade6

4 years

A nice note. Some cool tricks in these Bhatia matrix analyses books. If I understand correctly, Russo-Dye Thm lets u (exactly) compute the largest learning rate with a maximization problem using only vectors rather than matrices (still hitting it on the 4th-moment data tensor).

Yaroslav Bulatov

@yaroslavvb

4 years

What's the largest learning rate for which SGD converges? In deterministic case with Hessian H it is 2/||H||, from basic linear algebra. For SGD, an equivalent rate is 2/Tr(H), derivation from Russo-Dye theorem:

3

19

159

2

1

23

Sham Kakade

@ShamKakade6

4 years

Congratulations to the new Sloan Research Fellows!

0

23

Sham Kakade

@ShamKakade6

6 months

Super cool result on the impossibility of watermarking! (+ Kempner's new blog: Deeper Learning)

Kempner Institute at Harvard University

@KempnerInst

6 months

Deeper Learning, our new #KempnerInstitute blog is live! Check it out: In our first post, Ben Edelman, @_hanlin_zhang_ & @boazbaraktcs show that robust #watermarking in #AI is impossible under natural assumptions. Read more:

0

9

25

1

2

21

Sham Kakade

@ShamKakade6

4 years

3/ Due to Dean Foster, my own education of online learning and sequential prediction started through first understanding Blackwell's approachability, which is a wonderful way to grasp the foundations. I signed this:

Sign the Petition

Rename The Fisher Lecture After David Blackwell

www.change.org

1

3

22

Sham Kakade

@ShamKakade6

6 months

Theory extends to general finite groups (e.g. @bilalchughtai_ et al.). Many open questions. See paper: and blog post:

0

1

21

Sham Kakade

@ShamKakade6

3 months

It's a really great codebase and excited for future collabs with @allen_ai !

Kempner Institute at Harvard University

@KempnerInst

3 months

So excited to collaborate with @allen_ai and its partners @databricks @AMD @LUMIhpc on this groundbreaking work. Special thanks to @KempnerInst ’s co-director @ShamKakade6 and engineering lead @maxshadx !

0

3

18

0

1

22

Sham Kakade

@ShamKakade6

2 months

Big congrats tot he 2024 Sloan Research Fellows!

Sloan Foundation

@SloanFoundation

2 months

We have today announced the names of the 2024 Sloan Research Fellows! Congratulations to these 126 outstanding early-career researchers:

6

40

246

0

21

Sham Kakade

@ShamKakade6

3 years

Thank you @SusanMurphylab1 ! I am thrilled that I can finally be your colleague ❤️

Susan Murphy lab

@SusanMurphylab1

3 years

I’m thrilled that @ShamKakade6 is joining the Harvard SEAS CS faculty!!! Welcome, Sham!

0

48

0

1

20

Sham Kakade

@ShamKakade6

4 years

2/ This was the COLT 2018 open problem from @nanjiang_cs and Alekh, who conjectured a poly(H) lower bound. New work refutes this, showing only logarithmic in H episodes are needed to learn. So, in a minimax sense, long horizons are not more difficult than short ones!

1

0

19

Sham Kakade

@ShamKakade6

5 years

Excited to share this new work:

Chelsea Finn

@chelseabfinn

5 years

It's hard to scale meta-learning to long inner optimizations. We introduce iMAML, which meta-learns *without* differentiating through the inner optimization path using implicit differentiation. to appear @NeurIPSConf w/ @aravindr93 @ShamKakade6 @svlevine

8

121

537

0

1

19

Sham Kakade

@ShamKakade6

5 years

huh... so this is pretty wild. it is _formally_ equivalent to the the Polyak heavy ball momentum algorithm (with weight decay). not just 'similar behavior'.

Sanjeev Arora

@prfsanjeevarora

5 years

Conventional wisdom: slowly decay learning rate (lr) when training deep nets. Empirically, some exotic lr schedules also work, eg cosine. New work with Zhiyuan Li: exponentially increasing lr works too! Experiments + surprising math explanation. See

15

137

555

1

4

17

Sham Kakade

@ShamKakade6

3 months

Please spread the word!

Kempner Institute at Harvard University

@KempnerInst

3 months

Our post-bac fellowship application deadline is fast approaching! Read more about this program and apply today: #KempnerInstitute @EllaBatty @grez72 @ShamKakade6 @blsabatini @HarvardGSAS

1

7

12

0

11

17

Sham Kakade

@ShamKakade6

5 months

this work did change my world view of resource tradeoffs: how more compute makes up for less data. the frontier plots were quite compelling! checkout the poster for more info!

Ben Edelman

@EdelmanBen

5 months

Will deep learning improve with more data, a larger model, or training for longer? "Any balanced combination of them" <– in our #NeurIPS2023 spotlight, we reveal this through the lens of gradient-based feature learning in the presence of computational-statistical gaps. 1/5

1

4

25

1

16

Sham Kakade

@ShamKakade6

6 months

Turns out margin maximization, yes just margin maximization, implies this emergence. Some cool new mathematical techniques let us precisely derive the max margin… (yup, that observed margin of 1/(105√426) is indeed what we predict).

1

0

16

Sham Kakade

@ShamKakade6

5 years

Looking forward to reading this one! The original Bousquet and Elisseeff work was way ahead of its time! Epic in retrospect.

Vitaly 🇺🇦 Feldman

@vitalyFM

5 years

Indeed, really neat simplification of the bounds!

0

15

0

16

Sham Kakade

@ShamKakade6

5 years

A nice point: better features not better classifiers are key. This is more generally an important point related to distribution shift: (also comes up in RL, related to our " is a good representation sufficient" paper).

Aleksander Madry

@aleks_madry

5 years

Video summaries for our papers "Adversarial Examples Aren't Bugs They're Features" () and "Image Synthesis with a Single Robust Classifier" () are now online. Enjoy! ( @andrew_ilyas @tsiprasd @ShibaniSan @logan_engstrom Brandon Tran)

2

22

99

0

3

16

Sham Kakade

@ShamKakade6

5 months

interested in elastic ML? check out our new blog post. this should help serving foundation models on more devices and in more settings.

Kempner Institute at Harvard University

@KempnerInst

5 months

In our latest Deeper Learning blog post, the authors introduce an algorithmic method to elastically deploy large models, the #MatFormer . Read more: #KempnerInstitute @adityakusupati @snehaark @Devvrit_Khatri @Tim_Dettmers

0

12

21

0

3

14

Sham Kakade

@ShamKakade6

5 years

cool stuff from @TheGregYang : Tensors, Neural Nets, GPs, and kernels! looks like we can derive a corresponding kernel/GP in a fairly general sense. very curious on broader empirical comparisons to neural nets, which (potentially) draw strength from the non-linear regime!

Greg Yang

@TheGregYang

5 years

1/ I can't teach you how to dougie but I can teach you how to compute the Gaussian Process corresponding to infinite-width neural network of ANY architecture, feedforward or recurrent, eg: resnet, GRU, transformers, etc ... RT plz💪

4

109

373

1

15

Sham Kakade

@ShamKakade6

6 months

a really excellent result. very intuitive! (also, Kempner's new blog: Deeper Learning)

Boaz Barak

@boazbaraktcs

6 months

1/5 New preprint w @_hanlin_zhang_ , Edelman, Francanti, Venturi & Ateniese! We prove mathematically & demonstrate empirically impossibility for strong watermarking of generative AI models. What's strong watermarking? What assumptions? See blog and 🧵

5

44

255

0

2

15

Sham Kakade

@ShamKakade6

3 years

Congrats and well deserved!! 👏👏 It’s inspiring to have @madsjw as a leader in our community.

Stephen Wright

@madsjw

3 years

My Khachiyan prize talk at INFORMS yesterday was the victim of technical difficulties, so here is the script:

15

18

173

0

13

Sham Kakade

@ShamKakade6

3 months

Retweet after me...

Eran Malach

@EranMalach

3 months

Our recent work on the comparison between Transformers and State Space Models for sequence modeling now on arxiv! TLDR - we find a key disadvantage of SSMs compared to Transformers: they cannot copy from their input. 🧵 Arxiv: Blog:

2

53

240

1

14

Sham Kakade

@ShamKakade6

4 years

3/ joint work with @RuosongW @SimonShaoleiDu @lyang36

0

1

12

Sham Kakade

@ShamKakade6

5 years

It’s great to be a visitor here!

Elad Hazan

@HazanPrinceton

5 years

Very excited to see this finally announced & many thanks to @JeffDean , @GoogleAI and @Princeton for the ongoing support! + fresh from the oven, research from the lab:

0

5

56

2

0

12

Sham Kakade

@ShamKakade6

1 year

So excited to work with this amazing new cohort!

Boaz Barak

@boazbaraktcs

1 year

Kempner Institute announces the first cohort of research fellows starting this fall! Looking forward to learning from and collaborating with @brandfonbrener , @cogscikid , @_jennhu , @IlennaJ , @WangBinxu , @nsaphra , Eran Malach, and @t_andy_keller .

3

16

141

0

10

Sham Kakade

@ShamKakade6

5 years

nice talk! and an important direction to pursue; the older margin based ideas def need to be refined. so nice to see this here!

Tengyu Ma

@tengyuma

5 years

A new paper on improving the generalization of deep models (w.r.t clean or robust accuracy) by theory-inspired explicit regularizers.

0

84

421

0

10

Sham Kakade

@ShamKakade6

5 years

Also, the FOOBI (Fourth-Order-Only Blind Identification) paper: . A beautiful algo. And a catchy acronym too! Still surprises me how one can efficiently impose that rank one constraint! Worth a read.

0

1

9

Sham Kakade

@ShamKakade6

2 years

Congrats @timnitGebru for your efforts to provide a broader set of ideas to the community! Excited to see the work that comes out.

Distributed AI Research Institute is on Mastodon

@DAIRInstitute

2 years

We are @DAIRInstitute — an independent, community-rooted #AI research institute free from #BigTech 's pervasive influence. Founded by @timnitGebru .

26

307

1K

0

10

Sham Kakade

@ShamKakade6

4 years

Bellairs Research Institute ☀️🏖️. Theory of DL. Day 3: @nadavcohen : "Gen. and Opt. in DL via Trajectories". careful study of deep linear nets. second time hearing about it. now appreciate how this reveals quite different effects, relevant for DL! also, 🏊‍♂️🥥 🍨!

1

0

10

Sham Kakade

@ShamKakade6

4 years

2/3 This work shows that, under either assumption, all T*N1 samples can be used to achieve a precise notion of "few shot" learning. Also, worth pointing out nice work in Maurer et al . New work makes improvements under assumptions of good common rep!

1

8

Sham Kakade

@ShamKakade6

6 months

and the second post in Deeper Learning...

Kempner Institute at Harvard University

@KempnerInst

6 months

In our newest Deeper Learning #KempnerInstitute blog, authors @EdelmanBen , @depen_morwani , @costinoncescu , and @rosieyzh explain mechanic interpretability results using known inductive biases. Read it here: @ShamKakade6 @Harvard #AI #machinelearning

1

9

29

0

9

Sham Kakade

@ShamKakade6

5 years

nice post with some cool explanations!

Ferenc Huszár

@fhuszar

5 years

New post on iMAML: Meta Learning with Implicit Gradients some animations, discussing potential limitations and of course a Bayesian/variational interpretation

9

107

482

0

9

Sham Kakade

@ShamKakade6

4 years

3/3 And the seminal papers that started this line of thought: Dawid, "The prequential approach" (1984), and Foster, "Prediction in the worst case" (1991). They def influenced my thinking! stats meets philosophy. good stuff.

1

0

9

Sham Kakade

@ShamKakade6

5 years

@RogerGrosse @SimonShaoleiDu Right! An interesting hypothesis test for 'deep learning' could be to see if the learned network is better than using the derived locally linear kernel. The derived kernel itself is def pretty cool (e.g. CNTK).

0

9

Sham Kakade

@ShamKakade6

7 months

Also attended this powerful lecture by Loretta Lynch, the 1st Black woman to serve as US Attorney General. Grateful for all she has done. ❤️

Zed Zha, MD, FAAFP is writing

@DrZedZha

7 months

Attended the Dr. Martin Luther King Jr. commemorative lecture by Loretta Lynch, the 1st Black woman to serve as US Attorney General, introduced by Pro. Claudine Gay, the 1st Black president of Harvard University. The message was clear: Never Lose Infinite Hope. INFINITE HOPE. ❤️

2

63

1

0

9

Sham Kakade

@ShamKakade6

3 years

Pinged Emo, and he'll have freedom to drive/add new things. So this looks like a win-win situation for the community...

0

9

Sham Kakade

@ShamKakade6

5 years

A nice read!

Simon Shaolei Du

@SimonShaoleiDu

5 years

Check out new blog post on deep learning theory: ultra-wide neural network and Neural Tangent Kernel.

0

13

45

0

9

Sham Kakade

@ShamKakade6

4 years

Nice notes!

Pieter Abbeel

@pabbeel

4 years

One of my favorites from most recent offering of CS287 Advanced Robotics? Exam study handout summarizing all the main math in ~20pp. Incl. MaxEnt RL, CEM, LQR, Penalty Method, RRTs, Particle Filters, Policy Gradient, TRPO, PPO, Q-learning, DDPG, SAC,

2

174

878

1

0

8

Sham Kakade

@ShamKakade6

10 months

Congratulations!! Excited to see what comes next!

Jonathan Frankle

@jefrankle

10 months

I'm absolutely thrilled that @MosaicML has agreed to join @databricks as we continue on our journey to make the latest advances deep learning efficient and accessible for everyone. The best of MosaicML is yet to come 🎉🎉🎉

47

22

474

1

0

7

Sham Kakade

@ShamKakade6

5 years

@roydanroy @SimonShaoleiDu @RuosongW @lyang36 Nope. i'll be talking about recent work in policy gradient methods in RL and controls. for the following IAS workshop, I will! the rep. paper is pretty cool, in that it is still a bit puzzling to me!

1

0

7

Sham Kakade

@ShamKakade6

4 years

2/3 This motivated this this line of work on Bayesian methods in the worst case: and

0

2

6

Sham Kakade

@ShamKakade6

4 years

Looking forward to reading this one! There isn't a compelling explanation for the unreasonable effectiveness of adam/adagrad (aside from the original convex regret bounds). So this looks quite promising!

Elad Hazan

@HazanPrinceton

4 years

Trying out yet another deep learning optimizer? Graft its learning rates to better understand its performance: w. @naman33k @_arohan_ Tomer Koren and Cyril Zhang

1

15

68

0

6

Sham Kakade

@ShamKakade6

4 years

2/2 Why double dip, you ask? Is it a theorists' concoction? Perhaps, yes. It is a compelling demo of how SGD (and GD) behave differently in the overparameterized regime; a great question to study in its own right! As an 'asymptotic', it may not be how practitioners roll...

0

1

6

Sham Kakade

@ShamKakade6

10 months

One MacAskill might have a discount factor of 1 but the other has a discount factor of 0!

GoPro: Danny MacAskill - Cascadia

Join Danny MacAskill on an insane journey across the rooftops of Gran Canaria. Mixing vertigo-inducing lines and killer POV-footage, “Cascadia” delivers some...

www.youtube.com

0

1

6

Sham Kakade

@ShamKakade6

7 months

I ❤️ efficiency. A wonderful example.

Crémieux

@cremieuxrecueil

7 months

The adoption of cellphones by Keralan fishermen is, I believe, the most stunning example of the contribution of information technology to market performance. Take a look at this graph for background: in three different regions of Kerala, phones were adopted at different times.…

95

1K

7K

0

5

Sham Kakade

@ShamKakade6

3 years

@ChrSzegedy Agreed. Transformers were the sauce for NMT, not unsupervised pretraining. ELMo was a different landmark: it showed the value of pretraining for numerous downstream tasks. Many researchers wondered why they didn't try it themselves (it wasn't about the architecture).

0

5

Sham Kakade

@ShamKakade6

4 years

This is for SGD for least squares. Super basic problem. If I understand correctly, the _exact_ problem dependent largest learning rate (above which divergence occurs) is at 1/lambda_max( E[xx']^-1 E[||x||^2 x x']). This is pretty clean (and no SDP needed to compute it).

1

0

5

Sham Kakade

@ShamKakade6

3 years

hmmm... well, it is true that a function that is a sum of degree 3 polynomials would still be a function that is a sum of degree 3 polynomials after any linear transformation of the input (but the # of terms in the sum might be much larger). this def would be neat to try!

Christian Szegedy

@ChrSzegedy

3 years

@ShamKakade6 Yes, early versions and relative position encoding use fixed spatial features. Some encoding of position (learned or engineered is clearly necessary otherwise, it is just a BOW) Rotating the sentence would be fun to try. IMO, unlikely to affect the performance significantly.

1

0

3

1

0

4

Sham Kakade

@ShamKakade6

4 years

Sweet. Def want to build up my intuition on Hermite polynomials! For Gaussians, I tend to think more in terms of Isserlis' theorem (aka Wick's theorem), often a more brute approach for dealing with higher moments.

Francis Bach

@BachFrancis

4 years

If you like Gaussian kernels and distributions, you will enjoy this month blog post on Hermite polynomials!

7

140

574

0

1

5

Sham Kakade

@ShamKakade6

5 years

oh yeah. batch norm is assumed, which is clearly why things aren't exploding. But still, it is quite cool that they are formally equivalent; note their scheme only tracks one param, as opposed to two with momentum.

1

0

4

Sham Kakade

@ShamKakade6

4 years

@aravindr93 oh wow! thank you for the kind words :)

1

0

4

Sham Kakade

@ShamKakade6

5 years

@faoliehoek @SimonShaoleiDu @RuosongW @lyang36 No. The representation allows for near perfect approximation of *every* possible intermediate value function! Even this (everywhere) near perfect approximation has massive error amplification. Subtly, I'd say a good representation has to capture dynamics info to avoid this.

1

0

4

Sham Kakade

@ShamKakade6

3 months

It was a great talk with an engaged audience!

Kempner Institute at Harvard University

@KempnerInst

3 months

Thanks to @OpenAI ’s Noam Brown for joining the @KempnerInst Seminar Series to discuss CICERO, the first #AI agent to achieve human-level performance in the strategy game Diplomacy. Next in the series : Rajesh Rao on Feb 16. @polynoamial @ShamKakade6 @blsabatini @boazbaraktcs

1

5

26

0

5

Sham Kakade

@ShamKakade6

1 year

Only 2D? Amateur. Still has nothing on @adamfungi .

Adam Tauman Kalai

@adamfungi

1 year

Results of GPT-4's attempts at "What do you get when you cross X with Y" jokes.

1

0

10

0

4

Sham Kakade

@ShamKakade6

4 years

@bremen79 Bigger picture here: "full" Bayes averaging is quite powerful (due to mixability). That it is robust is often not reflected in the Bayesian viewpoint. Similarly, PAC-Bayes also demonstrates this power of averaging ( @RogerGrosse @roydanroy ). Also, yes, Bayes needs smooth losses.

1

0

4