Sham Kakade Profile
Sham Kakade

@ShamKakade6

11,611
Followers
383
Following
7
Media
311
Statuses

Harvard Professor. Full stack ML and AI. Co-director of the Kempner Institute for the Study of Artificial and Natural Intelligence.

Joined December 2018
Don't wanna be here? Send us removal request.
@ShamKakade6
Sham Kakade
3 years
Super excited to join Harvard with a stellar group of new hires and looking forward to many new collabs with the terrific faculty there! Def sad to be leaving my wonderful UW and MSR colleagues and friends; rest assured, I'll keep up the collabs!
21
15
412
@ShamKakade6
Sham Kakade
4 years
Thank you so much to the awards committee! Also a huge thanks to the past and current ICML chairs and organizers for all their great work for our community! 👍 It is an honor to receive this 😀😀 with such wonderful co-authors: @arkrause , Matthias, and Niranjan!
@haldaume3
Hal Daumé III
4 years
We are very pleased to announce that the #icml2020 Test of Time award goes to Gaussian Process Optimization in the Bandit Setting: No Regret and Experimental Design by Niranjan Srinivas, Andreas Krause, Sham Kakade and Matthias Seeger >
Tweet media one
Tweet media two
3
92
471
27
31
345
@ShamKakade6
Sham Kakade
2 years
Grateful to Priscilla Chan/Mark Zuckerberg ( @ChanZuckerberg Initiative) for generous gift. Kempner Natural & Artificial Intelligence Institute @Harvard . Excited to work w/ @blsabatini +new colleagues to provide new educational and research opportunities.
11
23
301
@ShamKakade6
Sham Kakade
6 months
Can inductive biases explain mechanistic interpretability? Why do sinusoidal patterns emerge for NNs trained on modular addition? (e.g. @NeelNanda5 ) New work pins this down! w/ @depen_morwani @EdelmanBen @rosieyzh @costinoncescu ! @KempnerInst
5
36
271
@ShamKakade6
Sham Kakade
4 years
couldn't reconcile theory/practice with dropout for over a year. New: w/ @tengyuma & C. Wei. turns out dropout sometimes has an implicit regularization effect! pretty wild. just like small vs. large batch sgd. these plots def surprised us!
5
28
190
@ShamKakade6
Sham Kakade
5 years
What actually constitutes a good representation for reinforcement learning? Lots of sufficient conditions. But what's necessary? New paper: . Surprisingly, good value (or policy) based representations just don't cut it! w/ @SimonShaoleiDu @RuosongW @lyang36
2
32
179
@ShamKakade6
Sham Kakade
5 years
Wrapped up at the "Workshop on Theory of Deep Learning: Where next?" at IAS. . The field has moved so much! e.g. Neural Tangent Kernel (NTK) results! A few years ago, understanding DL looked hopeless. Terrific set of talks, too!
1
8
140
@ShamKakade6
Sham Kakade
4 years
1/3 Two shots at few shot learning: We have T tasks and N1 samples per task. How effective is pooling these samples for few shot learning? New work: . Case 1: there is a common low dim representation. Case 2: there is a common high dim representation.
Tweet media one
3
17
127
@ShamKakade6
Sham Kakade
4 years
1/ David Blackwell. Leagues ahead of his time: "What is a good prediction strategy, and how well can you do?" While some things do not seem possible, "Looking for a p [probability] that does well against every x [an outcome] seems hopeless", Blackwell does give us a strategy:
Tweet media one
3
15
117
@ShamKakade6
Sham Kakade
5 years
Also, we recently posted this work on the theory of policy gradients for reinforcement learning! A long time in the works, this paper finally gets a handle on function approximation with policy gradient methods.
0
26
115
@ShamKakade6
Sham Kakade
4 years
1/3 Should only Bayesians be Bayesian? No. Being Bayes is super robust! An oldie but goodie from Vovk: "Competitive Online Statistics" 2001. Beautiful work showing Bayes is awesome, even if you are not a Bayesian. (post motivated by nice thoughts from @RogerGrosse @roydanroy ).
2
16
110
@ShamKakade6
Sham Kakade
4 years
1/2. No double dipping. My current worldview: the 'double dip' is not a practical concern due to that tuning of various hyperparams (early stopping, L2 reg, model size, etc) on a holdout set alleviates the 'dip'. This work lends evidence to this viewpoint!
@PreetumNakkiran
Preetum Nakkiran
4 years
Optimal Regularization can Mitigate Double Descent Joint work with Prayaag Venkat, @ShamKakade6 , @tengyuma . We prove in certain ridge regression settings that *optimal* L2 regularization can eliminate double descent: more data never hurts (1/n)
Tweet media one
6
51
225
2
20
96
@ShamKakade6
Sham Kakade
4 years
1/ Playing the long game: Is long horizon RL harder than short horizon RL? Clearly, H length episodes scale linearly with H, but counting learning complexity by # episodes rather than # samples accounts for this. So is it any harder?
2
8
83
@ShamKakade6
Sham Kakade
3 months
Repeat After Me: Transformers are Better than State Space Models at Copying Transformers are Better than State Space Models at Copying
@KempnerInst
Kempner Institute at Harvard University
3 months
Check out #KempnerInstitute ’s newest blog post! Authors Samy Jelassi, @brandfonbrener , @ShamKakade6 @EranMalach show the improved efficiency of State Space Models sacrifices some core capabilities for modern LLMs. #MachineLearning #AI
0
13
55
1
4
78
@ShamKakade6
Sham Kakade
4 years
Beautiful post by @BachFrancis on Chebyshev polynomials: . Handy for algorithm design. Let's not forget the wise words of Rocco Servedio, as quoted by @mrtz , "There's only one bullet in the gun. It's called the Chebyshev polynomial."
0
10
75
@ShamKakade6
Sham Kakade
5 years
Very cool! Years ago (a little post AlexNet) I put in (too?) many cycles trying to design such a kernel. Didn't match their performance (though didn't have much compute back then). Pretty slick that they use a derived kernel from a ConvNet!
@RuosongW
Ruosong Wang
5 years
We have released code for computing Convolutional Neural Tangent Kernel (CNTK) used in our paper "On Exact Computation with an Infinitely Wide Neural Net", which will appear in NeurIPS 2019. Paper: Code:
1
46
209
2
11
67
@ShamKakade6
Sham Kakade
4 years
Great to see some theory on self-supervised learning. Looking forward to reading this one!
@jasondeanlee
Jason Lee
4 years
Predicting What You Already Know Helps: Provable Self-Supervised Learning We analyze how predicting parts of the input from other parts (missing patch, missing word, etc.) helps to learn a representation that linearly separates the downstream task. 1/2
Tweet media one
2
105
526
1
3
64
@ShamKakade6
Sham Kakade
4 years
2/ In a seminal paper, "Minimax vs Bayes prediction" ('56) , Blackwell shows we can predict well on any sequence, using randomization: "we can improve matters by allowing randomized predictions." These ideas permeate so much of learning theory today.
1
7
59
@ShamKakade6
Sham Kakade
5 years
Wow! This is amazing. A few years ago RL was starting to be applied in the robotics domain, with many doubters. Fast forward a handful of years and this! 👏
@OpenAI
OpenAI
5 years
We've trained an AI system to solve the Rubik's Cube with a human-like robot hand. This is an unprecedented level of dexterity for a robot, and is hard even for humans to do. The system trains in an imperfect simulation and quickly adapts to reality:
243
4K
11K
1
3
59
@ShamKakade6
Sham Kakade
4 years
Bellairs. Theory of DL. Day 4, penultimate session from the unstoppable Boaz Barak. Average case complexity, computational limits, and relevance to DL. Front row: Yoshua Bengio, Jean Ponce, @ylecun
Tweet media one
2
1
53
@ShamKakade6
Sham Kakade
10 months
Dean Foster, @dhruvmadeka , and I have been excited about the application of AI to education! We collected our thoughts here - and we're curious what people think:
0
13
53
@ShamKakade6
Sham Kakade
3 years
Solid move from DeepMind. Knowing Emo Todorov and his work, frankly surprised this one-man-show is only being purchased now. Hope Emo still stays in the driver's seat for MuJoCo going forward!
@GoogleDeepMind
Google DeepMind
3 years
We’ve acquired the MuJoCo physics simulator () and are making it free for all, to support research everywhere. MuJoCo is a fast, powerful, easy-to-use, and soon to be open-source simulation tool, designed for robotics research:
85
2K
6K
1
2
51
@ShamKakade6
Sham Kakade
3 years
🇺🇸
1
2
52
@ShamKakade6
Sham Kakade
5 months
we are growing! please apply to join a vibrant community and please spread the word.
@KempnerInst
Kempner Institute at Harvard University
5 months
The #KempnerInstitute is hiring scientists, researchers, and engineers to join our growing community! Check out our openings and apply today: #scienceforsocialgood #openscience @Harvard @ChanZuckerberg @ShamKakade6 @blsabatini
Tweet media one
0
9
24
0
8
49
@ShamKakade6
Sham Kakade
4 years
Just got back from MSR Montreal and had a great visit! Lots of cool projects going on there in RL/NLP/unsupervised learning. Thanks to @APTrizzle @momusbah @Drewch @JessMastronardi @philip_bachman for hosting me!
0
0
47
@ShamKakade6
Sham Kakade
3 years
Amazing and congrats! I have def been wondering if the inductive biases in DeepNets and in ML methods are well suited for certain scientific domains. This settles that for structure prediction! Hoping this can eventually help with drug discovery.
@GoogleDeepMind
Google DeepMind
3 years
In a major scientific breakthrough, the latest version of #AlphaFold has been recognised as a solution to one of biology's grand challenges - the “protein folding problem”. It was validated today at #CASP14 , the biennial Critical Assessment of protein Structure Prediction (1/3)
135
3K
10K
1
3
45
@ShamKakade6
Sham Kakade
5 months
I found this to be very informative for LLM training. the science was just super well done. highly recommended for anyone training transformer based LLMs.
@Mitchnw
Mitchell Wortsman
7 months
Sharing some highlights from our work on small-scale proxies for large-scale Transformer training instabilities: With fantastic collaborators @peterjliu , @Locchiu , @_katieeverett , many others (see final tweet!), @hoonkp , @jmgilmer , @skornblith ! (1/15)
Tweet media one
5
63
348
0
4
41
@ShamKakade6
Sham Kakade
4 years
Bellairs Research Institute☀️🏖️.Theory of DL. Day 3: new insights from @roydanroy @KDziugaite on PAC-Bayes for DL. possibly gives a new lens into implicit reg 🤔 @david_rolnick cool results on expressivity of deep nets. And T. Lillicrap keeps us real on theory vs. practice!
1
0
41
@ShamKakade6
Sham Kakade
4 years
Great to see that Mike Jordan is thinking about Decision Theory, ML, and Econ! Super important area: lots of stats/algorithmic questions that have immediate impact on practice. Few other areas that one can say the same!
@acmeducation
ACM Education & Learning Center
4 years
Mar 25 #ACMTechTalk "The Decision-Making Side of Machine Learning: Computational, Inferential, and Economic Perspective," w/Michael I. Jordan. @JeffDean @smolix @etzioni @erichorvitz @lexfridman @DaphneKoller @pabbeel @ShamKakade6 @suchisaria @aleks_madry
Tweet media one
0
14
38
0
2
39
@ShamKakade6
Sham Kakade
4 years
At the Bellairs Research Institute: Theory of Deep Learning workshop. Day 1: great presentations on implicit regularization from @prfsanjeevarora @suriyagnskr . Day 2: lucid explanations of NTKs from @Hoooway @jasondeanlee . Good friends, sun ☀️, and sand 🏖️ a bonus.
0
3
37
@ShamKakade6
Sham Kakade
4 years
John was a reason I moved to AI and neuroscience from physics . In his first class, he compared the human pattern-matching algo for chess playing to DeepBlue's brute force lookahead. I wondered if Go would be mastered in my lifetime! Wonderful to hear from John Hopfield again!
@lexfridman
Lex Fridman
4 years
Here's my conversation with John Hopfield. Hopfield networks were one of the early ideas that catalyzed the development of deep learning. His truly original work has explored the messy world of biology through the piercing eyes of a physicist.
Tweet media one
11
36
226
1
2
38
@ShamKakade6
Sham Kakade
5 years
revised thoughts on Neural Tangent Kernels (after understanding the regime better. h/t @SimonShaoleiDu ): def a super cool idea for designing a kernel! It does not look to be helpful of our understanding of how representations arise in deep learning. Much more needed here!
2
3
36
@ShamKakade6
Sham Kakade
1 year
Very excited about this new work with @vyasnikhil96 and @boazbaraktcs ! Provable copyright protection for generative models? See:
@boazbaraktcs
Boaz Barak
1 year
1/5 In new paper with @vyasnikhil96 and @ShamKakade6 we give a way to certify that a generative model does not infringe on the copyright of data that was in its training set. See for blog, but TL;DR is...
7
52
248
1
2
36
@ShamKakade6
Sham Kakade
5 years
Nice talk from Rong Ge! learning two layer neural nets, with _finite_ width: A seriously awesome algebraic idea. Reminiscent of FOOBI (the coolest spectral algo in town!): they replace the 'rank-1-detector' in FOOBI with a 'one-neuron-detector'.
1
6
35
@ShamKakade6
Sham Kakade
4 years
Bellairs Research Institute ☀️⛱️. Theory of DL workshop, Day 2 (eve): Thanks to Yann LeCun and Yoshua Bengio for thought provoking talks. @ylecun title: "Questions from the 80s and 90s". Good questions indeed!!
1
2
33
@ShamKakade6
Sham Kakade
4 years
Bellairs. Day 5 @HazanPrinceton and myself: double feature on controls+RL. +spotlights: @maithra_raghu : meta-learning as rapid feature learning. Raman Arora: dropout, capacity control, and matrix sensing . @HanieSedghi : module criticality and generalization! And that is a wrap!🙂
0
3
32
@ShamKakade6
Sham Kakade
6 months
Exciting! New RL Conference. Thanks to @yayitsamyzhang and others for their leadership!
@yayitsamyzhang
Amy Zhang
6 months
Thrilled to announce the first annual Reinforcement Learning Conference @RL_Conference , which will be held at UMass Amherst August 9-12! RLC is the first strongly peer-reviewed RL venue with proceedings, and our call for papers is now available: .
Tweet media one
5
61
421
0
0
30
@ShamKakade6
Sham Kakade
5 months
I found this quite thought provoking!
@BingbinL
Bingbin Liu
5 months
🧵What’s the simplest failure mode of Transformers? Our #NeurIPS2023 spotlight paper identifies the “attention glitches” phenomenon, where Transformers intermittently fail to capture robust reasoning, due to undesirable architectural inductive biases. Poster: Wed 5-7pm CST, #528
Tweet media one
4
31
225
0
2
30
@ShamKakade6
Sham Kakade
3 years
Congrats! A beautiful book indeed!
@amermathsoc
American Mathematical Society
3 years
Noga Alon @princeton and Joel Spencer @nyuniversity receive the 2021 Steele Prize for Mathematical Exposition for The Probabilistic Method @WileyGlobal . Now in its 4th ed, the text is invaluable for both the beginner and the experienced researcher. More...
Tweet media one
0
16
104
1
2
29
@ShamKakade6
Sham Kakade
4 years
This should be good! @SurbhiGoel_ has done some exciting work in understanding neural nets, going beyond the "linear" NTK barrier. To make progress in deep learning theory, we def need to understand these beasts in the non-linear regime.
@boazbaraktcs
Boaz Barak
4 years
Looking forward to this Friday at 1pm when we'll hear from @SurbhiGoel_ about the computational complexity of learning neural networks over gaussian marginals. We'll see some average-case hardness results as well as a poly-time algorithm for approximately learning ReLUs
0
1
12
0
4
29
@ShamKakade6
Sham Kakade
6 months
A downright classic. And let us take a moment to ponder on the most didactic figure of all time, seen in this beautiful paper: :)
@michael_nielsen
Michael Nielsen
6 months
This is amazing, and very beautiful:
Tweet media one
16
85
806
0
1
28
@ShamKakade6
Sham Kakade
1 year
Excited to be a part of this, with Aarti Singh who is spearheading the CMU effort!
@NSF
U.S. National Science Foundation
1 year
📢 Announcing seven new National Artificial Intelligence Research Institutes! Discover the themes and the institutions that are helping advance foundational AI research to address national economic and societal priorities in the 🧵 ⬇️:
Tweet media one
3
39
95
1
2
27
@ShamKakade6
Sham Kakade
3 years
A huge congrats to MPI for hiring the terrific @mrtz as a director! Personally sad to have him across the pond, but excited to see what Moritz helps to build.
0
0
25
@ShamKakade6
Sham Kakade
4 years
I feel like I should take the class after reading this 😂
@boazbaraktcs
Boaz Barak
4 years
GPT-3 on why Harvard students should take CS 182 this fall (bold text is prompt)
Tweet media one
3
5
62
0
0
24
@ShamKakade6
Sham Kakade
4 years
1/3 Can open democracies fight pandemics? making a PACT to set forth transparent privacy and anonymity standards, which permit adoption of mobile tracing efforts while upholding civil liberties.
Tweet media one
1
4
22
@ShamKakade6
Sham Kakade
3 years
Very cool! Getting RL to work with real sample size constraints is critical. Interesting to see how it was done here. Also, looks like the application with Loon is for social good! 👏
@marcgbellemare
Marc G. Bellemare
3 years
Our most recent work is out in Nature! We're reporting on (reinforcement) learning to navigate Loon stratospheric balloons and minimizing the sim2real gap. Results from a 39-day Pacific Ocean experiment show RL keeps its strong lead in real conditions.
23
108
768
1
1
24
@ShamKakade6
Sham Kakade
4 years
A nice note. Some cool tricks in these Bhatia matrix analyses books. If I understand correctly, Russo-Dye Thm lets u (exactly) compute the largest learning rate with a maximization problem using only vectors rather than matrices (still hitting it on the 4th-moment data tensor).
@yaroslavvb
Yaroslav Bulatov
4 years
What's the largest learning rate for which SGD converges? In deterministic case with Hessian H it is 2/||H||, from basic linear algebra. For SGD, an equivalent rate is 2/Tr(H), derivation from Russo-Dye theorem:
3
19
159
2
1
23
@ShamKakade6
Sham Kakade
4 years
Congratulations to the new Sloan Research Fellows!
0
0
23
@ShamKakade6
Sham Kakade
6 months
Super cool result on the impossibility of watermarking! (+ Kempner's new blog: Deeper Learning)
@KempnerInst
Kempner Institute at Harvard University
6 months
Deeper Learning, our new #KempnerInstitute blog is live! Check it out:  In our first post, Ben Edelman, @_hanlin_zhang_ & @boazbaraktcs show that robust #watermarking in #AI is impossible under natural assumptions. Read more:
Tweet media one
0
9
25
1
2
21
@ShamKakade6
Sham Kakade
4 years
3/ Due to Dean Foster, my own education of online learning and sequential prediction started through first understanding Blackwell's approachability, which is a wonderful way to grasp the foundations. I signed this:
1
3
22
@ShamKakade6
Sham Kakade
6 months
Theory extends to general finite groups (e.g. @bilalchughtai_ et al.). Many open questions. See paper: and blog post:
0
1
21
@ShamKakade6
Sham Kakade
3 months
It's a really great codebase and excited for future collabs with @allen_ai !
@KempnerInst
Kempner Institute at Harvard University
3 months
So excited to collaborate with @allen_ai and its partners @databricks @AMD @LUMIhpc on this groundbreaking work. Special thanks to @KempnerInst ’s co-director @ShamKakade6 and engineering lead @maxshadx !
0
3
18
0
1
22
@ShamKakade6
Sham Kakade
2 months
Big congrats tot he 2024 Sloan Research Fellows!
@SloanFoundation
Sloan Foundation
2 months
We have today announced the names of the 2024 Sloan Research Fellows! Congratulations to these 126 outstanding early-career researchers:
Tweet media one
6
40
246
0
0
21
@ShamKakade6
Sham Kakade
3 years
Thank you @SusanMurphylab1 ! I am thrilled that I can finally be your colleague ❤️
@SusanMurphylab1
Susan Murphy lab
3 years
I’m thrilled that @ShamKakade6 is joining the Harvard SEAS CS faculty!!! Welcome, Sham!
0
0
48
0
1
20
@ShamKakade6
Sham Kakade
4 years
2/ This was the COLT 2018 open problem from @nanjiang_cs and Alekh, who conjectured a poly(H) lower bound. New work refutes this, showing only logarithmic in H episodes are needed to learn. So, in a minimax sense, long horizons are not more difficult than short ones!
1
0
19
@ShamKakade6
Sham Kakade
5 years
Excited to share this new work:
@chelseabfinn
Chelsea Finn
5 years
It's hard to scale meta-learning to long inner optimizations. We introduce iMAML, which meta-learns *without* differentiating through the inner optimization path using implicit differentiation. to appear @NeurIPSConf w/ @aravindr93 @ShamKakade6 @svlevine
Tweet media one
8
121
537
0
1
19
@ShamKakade6
Sham Kakade
5 years
huh... so this is pretty wild. it is _formally_ equivalent to the the Polyak heavy ball momentum algorithm (with weight decay). not just 'similar behavior'.
@prfsanjeevarora
Sanjeev Arora
5 years
Conventional wisdom: slowly decay learning rate (lr) when training deep nets. Empirically, some exotic lr schedules also work, eg cosine. New work with Zhiyuan Li: exponentially increasing lr works too! Experiments + surprising math explanation. See
15
137
555
1
4
17
@ShamKakade6
Sham Kakade
3 months
Please spread the word!
@KempnerInst
Kempner Institute at Harvard University
3 months
Our post-bac fellowship application deadline is fast approaching! Read more about this program and apply today: #KempnerInstitute @EllaBatty @grez72 @ShamKakade6 @blsabatini @HarvardGSAS
Tweet media one
1
7
12
0
11
17
@ShamKakade6
Sham Kakade
5 months
this work did change my world view of resource tradeoffs: how more compute makes up for less data. the frontier plots were quite compelling! checkout the poster for more info!
@EdelmanBen
Ben Edelman
5 months
Will deep learning improve with more data, a larger model, or training for longer? "Any balanced combination of them" <– in our #NeurIPS2023 spotlight, we reveal this through the lens of gradient-based feature learning in the presence of computational-statistical gaps. 1/5
Tweet media one
1
4
25
1
1
16
@ShamKakade6
Sham Kakade
6 months
Turns out margin maximization, yes just margin maximization, implies this emergence. Some cool new mathematical techniques let us precisely derive the max margin… (yup, that observed margin of 1/(105√426) is indeed what we predict).
Tweet media one
1
0
16
@ShamKakade6
Sham Kakade
5 years
Looking forward to reading this one! The original Bousquet and Elisseeff work was way ahead of its time! Epic in retrospect.
@vitalyFM
Vitaly 🇺🇦 Feldman
5 years
Indeed, really neat simplification of the bounds!
0
0
15
0
0
16
@ShamKakade6
Sham Kakade
5 years
A nice point: better features not better classifiers are key. This is more generally an important point related to distribution shift: (also comes up in RL, related to our " is a good representation sufficient" paper).
@aleks_madry
Aleksander Madry
5 years
Video summaries for our papers "Adversarial Examples Aren't Bugs They're Features" () and "Image Synthesis with a Single Robust Classifier" () are now online. Enjoy! ( @andrew_ilyas @tsiprasd @ShibaniSan @logan_engstrom Brandon Tran)
2
22
99
0
3
16
@ShamKakade6
Sham Kakade
5 months
interested in elastic ML? check out our new blog post. this should help serving foundation models on more devices and in more settings.
@KempnerInst
Kempner Institute at Harvard University
5 months
In our latest Deeper Learning blog post, the authors introduce an algorithmic method to elastically deploy large models, the #MatFormer . Read more: #KempnerInstitute @adityakusupati @snehaark @Devvrit_Khatri @Tim_Dettmers
Tweet media one
0
12
21
0
3
14
@ShamKakade6
Sham Kakade
5 years
cool stuff from @TheGregYang : Tensors, Neural Nets, GPs, and kernels! looks like we can derive a corresponding kernel/GP in a fairly general sense. very curious on broader empirical comparisons to neural nets, which (potentially) draw strength from the non-linear regime!
@TheGregYang
Greg Yang
5 years
1/ I can't teach you how to dougie but I can teach you how to compute the Gaussian Process corresponding to infinite-width neural network of ANY architecture, feedforward or recurrent, eg: resnet, GRU, transformers, etc ... RT plz💪
Tweet media one
4
109
373
1
1
15
@ShamKakade6
Sham Kakade
6 months
a really excellent result. very intuitive! (also, Kempner's new blog: Deeper Learning)
@boazbaraktcs
Boaz Barak
6 months
1/5 New preprint w @_hanlin_zhang_ , Edelman, Francanti, Venturi & Ateniese! We prove mathematically & demonstrate empirically impossibility for strong watermarking of generative AI models. What's strong watermarking? What assumptions? See blog and 🧵
5
44
255
0
2
15
@ShamKakade6
Sham Kakade
3 years
Congrats and well deserved!! 👏👏 It’s inspiring to have @madsjw as a leader in our community.
@madsjw
Stephen Wright
3 years
My Khachiyan prize talk at INFORMS yesterday was the victim of technical difficulties, so here is the script:
15
18
173
0
0
13
@ShamKakade6
Sham Kakade
3 months
Retweet after me...
@EranMalach
Eran Malach
3 months
Our recent work on the comparison between Transformers and State Space Models for sequence modeling now on arxiv! TLDR - we find a key disadvantage of SSMs compared to Transformers: they cannot copy from their input. 🧵 Arxiv: Blog:
2
53
240
1
1
14
@ShamKakade6
Sham Kakade
4 years
0
1
12
@ShamKakade6
Sham Kakade
5 years
It’s great to be a visitor here!
@HazanPrinceton
Elad Hazan
5 years
Very excited to see this finally announced & many thanks to @JeffDean , @GoogleAI and @Princeton for the ongoing support! + fresh from the oven, research from the lab:
0
5
56
2
0
12
@ShamKakade6
Sham Kakade
1 year
So excited to work with this amazing new cohort!
@boazbaraktcs
Boaz Barak
1 year
Kempner Institute announces the first cohort of research fellows starting this fall! Looking forward to learning from and collaborating with @brandfonbrener , @cogscikid , @_jennhu , @IlennaJ , @WangBinxu , @nsaphra , Eran Malach, and @t_andy_keller .
3
16
141
0
0
10
@ShamKakade6
Sham Kakade
5 years
nice talk! and an important direction to pursue; the older margin based ideas def need to be refined. so nice to see this here!
@tengyuma
Tengyu Ma
5 years
A new paper on improving the generalization of deep models (w.r.t clean or robust accuracy) by theory-inspired explicit regularizers.
0
84
421
0
0
10
@ShamKakade6
Sham Kakade
5 years
Also, the FOOBI (Fourth-Order-Only Blind Identification) paper: . A beautiful algo. And a catchy acronym too! Still surprises me how one can efficiently impose that rank one constraint! Worth a read.
0
1
9
@ShamKakade6
Sham Kakade
2 years
Congrats @timnitGebru for your efforts to provide a broader set of ideas to the community! Excited to see the work that comes out.
@DAIRInstitute
Distributed AI Research Institute is on Mastodon
2 years
We are @DAIRInstitute — an independent, community-rooted #AI research institute free from #BigTech 's pervasive influence. Founded by @timnitGebru .
26
307
1K
0
0
10
@ShamKakade6
Sham Kakade
4 years
Bellairs Research Institute ☀️🏖️. Theory of DL. Day 3: @nadavcohen : "Gen. and Opt. in DL via Trajectories". careful study of deep linear nets. second time hearing about it. now appreciate how this reveals quite different effects, relevant for DL! also, 🏊‍♂️🥥 🍨!
1
0
10
@ShamKakade6
Sham Kakade
4 years
2/3 This work shows that, under either assumption, all T*N1 samples can be used to achieve a precise notion of "few shot" learning. Also, worth pointing out nice work in Maurer et al . New work makes improvements under assumptions of good common rep!
1
1
8
@ShamKakade6
Sham Kakade
6 months
and the second post in Deeper Learning...
@KempnerInst
Kempner Institute at Harvard University
6 months
In our newest Deeper Learning #KempnerInstitute blog, authors @EdelmanBen , @depen_morwani , @costinoncescu , and @rosieyzh explain mechanic interpretability results using known inductive biases. Read it here: @ShamKakade6 @Harvard #AI #machinelearning
Tweet media one
1
9
29
0
0
9
@ShamKakade6
Sham Kakade
5 years
nice post with some cool explanations!
@fhuszar
Ferenc Huszár
5 years
New post on iMAML: Meta Learning with Implicit Gradients some animations, discussing potential limitations and of course a Bayesian/variational interpretation
9
107
482
0
0
9
@ShamKakade6
Sham Kakade
4 years
3/3 And the seminal papers that started this line of thought: Dawid, "The prequential approach" (1984), and Foster, "Prediction in the worst case" (1991). They def influenced my thinking! stats meets philosophy. good stuff.
1
0
9
@ShamKakade6
Sham Kakade
5 years
@RogerGrosse @SimonShaoleiDu Right! An interesting hypothesis test for 'deep learning' could be to see if the learned network is better than using the derived locally linear kernel. The derived kernel itself is def pretty cool (e.g. CNTK).
0
0
9
@ShamKakade6
Sham Kakade
7 months
Also attended this powerful lecture by Loretta Lynch, the 1st Black woman to serve as US Attorney General. Grateful for all she has done. ❤️
@DrZedZha
Zed Zha, MD, FAAFP is writing
7 months
Attended the Dr. Martin Luther King Jr. commemorative lecture by Loretta Lynch, the 1st Black woman to serve as US Attorney General, introduced by Pro. Claudine Gay, the 1st Black president of Harvard University. The message was clear: Never Lose Infinite Hope. INFINITE HOPE. ❤️
Tweet media one
2
2
63
1
0
9
@ShamKakade6
Sham Kakade
3 years
Pinged Emo, and he'll have freedom to drive/add new things. So this looks like a win-win situation for the community...
0
0
9
@ShamKakade6
Sham Kakade
5 years
A nice read!
@SimonShaoleiDu
Simon Shaolei Du
5 years
Check out new blog post on deep learning theory: ultra-wide neural network and Neural Tangent Kernel.
0
13
45
0
0
9
@ShamKakade6
Sham Kakade
4 years
Nice notes!
@pabbeel
Pieter Abbeel
4 years
One of my favorites from most recent offering of CS287 Advanced Robotics? Exam study handout summarizing all the main math in ~20pp. Incl. MaxEnt RL, CEM, LQR, Penalty Method, RRTs, Particle Filters, Policy Gradient, TRPO, PPO, Q-learning, DDPG, SAC,
2
174
878
1
0
8
@ShamKakade6
Sham Kakade
10 months
Congratulations!! Excited to see what comes next!
@jefrankle
Jonathan Frankle
10 months
I'm absolutely thrilled that @MosaicML has agreed to join @databricks as we continue on our journey to make the latest advances deep learning efficient and accessible for everyone. The best of MosaicML is yet to come 🎉🎉🎉
47
22
474
1
0
7
@ShamKakade6
Sham Kakade
5 years
@roydanroy @SimonShaoleiDu @RuosongW @lyang36 Nope. i'll be talking about recent work in policy gradient methods in RL and controls. for the following IAS workshop, I will! the rep. paper is pretty cool, in that it is still a bit puzzling to me!
1
0
7
@ShamKakade6
Sham Kakade
4 years
2/3 This motivated this this line of work on Bayesian methods in the worst case: and
0
2
6
@ShamKakade6
Sham Kakade
4 years
Looking forward to reading this one! There isn't a compelling explanation for the unreasonable effectiveness of adam/adagrad (aside from the original convex regret bounds). So this looks quite promising!
@HazanPrinceton
Elad Hazan
4 years
Trying out yet another deep learning optimizer? Graft its learning rates to better understand its performance: w. @naman33k @_arohan_ Tomer Koren and Cyril Zhang
1
15
68
0
0
6
@ShamKakade6
Sham Kakade
4 years
2/2 Why double dip, you ask? Is it a theorists' concoction? Perhaps, yes. It is a compelling demo of how SGD (and GD) behave differently in the overparameterized regime; a great question to study in its own right! As an 'asymptotic', it may not be how practitioners roll...
0
1
6
@ShamKakade6
Sham Kakade
7 months
I ❤️ efficiency. A wonderful example.
@cremieuxrecueil
Crémieux
7 months
The adoption of cellphones by Keralan fishermen is, I believe, the most stunning example of the contribution of information technology to market performance. Take a look at this graph for background: in three different regions of Kerala, phones were adopted at different times.…
Tweet media one
Tweet media two
Tweet media three
95
1K
7K
0
0
5
@ShamKakade6
Sham Kakade
3 years
@ChrSzegedy Agreed. Transformers were the sauce for NMT, not unsupervised pretraining. ELMo was a different landmark: it showed the value of pretraining for numerous downstream tasks. Many researchers wondered why they didn't try it themselves (it wasn't about the architecture).
0
0
5
@ShamKakade6
Sham Kakade
4 years
This is for SGD for least squares. Super basic problem. If I understand correctly, the _exact_ problem dependent largest learning rate (above which divergence occurs) is at 1/lambda_max( E[xx']^-1 E[||x||^2 x x']). This is pretty clean (and no SDP needed to compute it).
1
0
5
@ShamKakade6
Sham Kakade
3 years
hmmm... well, it is true that a function that is a sum of degree 3 polynomials would still be a function that is a sum of degree 3 polynomials after any linear transformation of the input (but the # of terms in the sum might be much larger). this def would be neat to try!
@ChrSzegedy
Christian Szegedy
3 years
@ShamKakade6 Yes, early versions and relative position encoding use fixed spatial features. Some encoding of position (learned or engineered is clearly necessary otherwise, it is just a BOW) Rotating the sentence would be fun to try. IMO, unlikely to affect the performance significantly.
1
0
3
1
0
4
@ShamKakade6
Sham Kakade
4 years
Sweet. Def want to build up my intuition on Hermite polynomials! For Gaussians, I tend to think more in terms of Isserlis' theorem (aka Wick's theorem), often a more brute approach for dealing with higher moments.
@BachFrancis
Francis Bach
4 years
If you like Gaussian kernels and distributions, you will enjoy this month blog post on Hermite polynomials!
7
140
574
0
1
5
@ShamKakade6
Sham Kakade
5 years
oh yeah. batch norm is assumed, which is clearly why things aren't exploding. But still, it is quite cool that they are formally equivalent; note their scheme only tracks one param, as opposed to two with momentum.
1
0
4
@ShamKakade6
Sham Kakade
4 years
@aravindr93 oh wow! thank you for the kind words :)
1
0
4
@ShamKakade6
Sham Kakade
5 years
@faoliehoek @SimonShaoleiDu @RuosongW @lyang36 No. The representation allows for near perfect approximation of *every* possible intermediate value function! Even this (everywhere) near perfect approximation has massive error amplification. Subtly, I'd say a good representation has to capture dynamics info to avoid this.
1
0
4
@ShamKakade6
Sham Kakade
3 months
It was a great talk with an engaged audience!
@KempnerInst
Kempner Institute at Harvard University
3 months
Thanks to @OpenAI ’s Noam Brown for joining the @KempnerInst Seminar Series to discuss CICERO, the first #AI agent to achieve human-level performance in the strategy game Diplomacy. Next in the series : Rajesh Rao on Feb 16. @polynoamial @ShamKakade6 @blsabatini @boazbaraktcs
Tweet media one
1
5
26
0
0
5
@ShamKakade6
Sham Kakade
1 year
Only 2D? Amateur. Still has nothing on @adamfungi .
@adamfungi
Adam Tauman Kalai
1 year
Results of GPT-4's attempts at "What do you get when you cross X with Y" jokes.
Tweet media one
1
0
10
0
0
4
@ShamKakade6
Sham Kakade
4 years
@bremen79 Bigger picture here: "full" Bayes averaging is quite powerful (due to mixability). That it is robust is often not reflected in the Bayesian viewpoint. Similarly, PAC-Bayes also demonstrates this power of averaging ( @RogerGrosse @roydanroy ). Also, yes, Bayes needs smooth losses.
1
0
4