Sadhika Malladi Profile
Sadhika Malladi

@SadhikaMalladi

851
Followers
121
Following
6
Media
103
Statuses

CS PhD student at Princeton

Joined June 2022
Don't wanna be here? Send us removal request.
@SadhikaMalladi
Sadhika Malladi
1 year
Introducing MeZO - a memory-efficient zeroth-order optimizer that can fine-tune large language models with forward passes while remaining performant. MeZO can train a 30B model on 1x 80GB A100 GPU. Paper: Code:
Tweet media one
9
93
456
@SadhikaMalladi
Sadhika Malladi
5 months
Blog post about how to scale training runs to highly distributed settings (i.e., large batch sizes)! Empirical insights from my long-ago work on stochastic differential equations (SDEs). Written to be accessible - give it a shot!
8
77
394
@SadhikaMalladi
Sadhika Malladi
2 months
Dataset choice is crucial in today's ML training pipeline. We ( @xiamengzhou and I) introduce desiderata for "good" data and explain how our recent algorithm, LESS, fits into the picture. Huge review of data selection algs for pre-training and fine-tuning!
Tweet media one
2
53
202
@SadhikaMalladi
Sadhika Malladi
5 months
Announcing the 2nd Workshop on Mathematical and Empirical Understanding of Foundation Models (ME-FoMo) at ICLR 2024! Improving our understanding helps us advance capabilities and build safer, more aligned models. Paper deadline is Feb 3! Website:
Tweet media one
0
15
107
@SadhikaMalladi
Sadhika Malladi
10 months
I'll be presenting MeZO at the ES-FoMo workshop at #ICML2023 . My talk is at 10:40am, and the poster session is at 1pm on Saturday in Ballroom A. Hope to see you there!
@SadhikaMalladi
Sadhika Malladi
1 year
Introducing MeZO - a memory-efficient zeroth-order optimizer that can fine-tune large language models with forward passes while remaining performant. MeZO can train a 30B model on 1x 80GB A100 GPU. Paper: Code:
Tweet media one
9
93
456
1
7
34
@SadhikaMalladi
Sadhika Malladi
6 months
We are at #Neurips2023 ! We will present this work as an oral on Wed at 4:15pm and as a poster on Wed at 5-7pm. Many of the authors are here -- stop by to chat with us!
@SadhikaMalladi
Sadhika Malladi
1 year
Introducing MeZO - a memory-efficient zeroth-order optimizer that can fine-tune large language models with forward passes while remaining performant. MeZO can train a 30B model on 1x 80GB A100 GPU. Paper: Code:
Tweet media one
9
93
456
0
7
32
@SadhikaMalladi
Sadhika Malladi
3 months
We are really excited to host @aleks_madry from @OpenAI at the PASS seminar on 3/26, 2pm ET! Submit your questions about the Preparedness team: , and join our mailing list to receive notifications about talks:
@PrincetonPLI
Princeton PLI
3 months
PASS seminar on 3/26 2pm ET! Speaker: Aleksander Madry @aleks_madry from @OpenAI Topic: AI Preparedness Live: Submit questions: Recordings later at:
Tweet media one
1
2
12
0
2
27
@SadhikaMalladi
Sadhika Malladi
4 months
Excited to share our work on data selection for instruction tuning! Bootstrap a few available examples to identify the useful training data from a huge pool of available data. Interesting optimization observation along the way: shorter instructions induce massive gradient norms.
@xiamengzhou
Mengzhou Xia
4 months
Lots of instruction tuning data out there...but how to best adapt LLMs for specific queries? Don’t use ALL of the data, use LESS! 5% beats the full dataset. Can even use one small model to select data for others! Paper: Code: [1/n]
Tweet media one
13
98
435
0
0
27
@SadhikaMalladi
Sadhika Malladi
11 months
Our new paper shows that not-so-big transformers can simulate + train an internal, not-so-small transformer over the course of a single inference pass!
@Abhishek_034
Abhishek Panigrahi
11 months
**New paper ** In-context learning was explained as simulate + train simple models at inference. We show a 2B model can run GD on an internal 125M model. Surprising simulation + AI safety implications! 1/5 w/ @SadhikaMalladi , @xiamengzhou , @prfsanjeevarora
2
49
242
0
2
19
@SadhikaMalladi
Sadhika Malladi
2 years
Enlarging batch size B speeds up distributed training, but how should we set the LR? For SGD, the famous Linear Scaling Rule suggests scaling LR linearly with B. For RMSprop/Adam, our new #NeurIPS2022 paper justifies scaling LR ~ sqrt(B) through formal SDE approximations. [1/2]
1
2
17
@SadhikaMalladi
Sadhika Malladi
1 year
Couldn't make it to ICLR 2023 but check out my talk in the ME-FoMo workshop today!
@SadhikaMalladi
Sadhika Malladi
2 years
Why can we fine-tune (FT) huge LMs on a few data points without overfitting? We show with theory + exps that FT can be described by kernel dynamics. Joint work with @_awettig , @dingli_yu , @danqi_chen , @prfsanjeevarora . [1/8]
2
11
46
0
3
16
@SadhikaMalladi
Sadhika Malladi
4 months
@ArmenAgha Yup, this is exactly what my blog post covers! The SDE gets more complicated with Adam :)
0
3
16
@SadhikaMalladi
Sadhika Malladi
4 months
Great to see our work on tuning LLMs with forward passes (MeZO: ) extended with in-depth benchmarking on more models, tasks, and settings! And a tutorial at AAAI 24 on using ZO methods to tune large models:
@TianlongChen4
Tianlong Chen
4 months
💭 Dreaming of tuning LLMs with inference-only memory? 🤔 🌄 Check out our ZO-LLM Benchmark, revisiting ZO for LLM tuning, across 5 LLM families & 3 task complexities & 4 tuning schemes ➡️ Unveiling overlooked principles & 3 novel enhancements. 🔗
Tweet media one
2
15
55
0
1
15
@SadhikaMalladi
Sadhika Malladi
3 months
Can RNNs replace transformers? Hot topic of debate over the past week with several competitive RNN-style models. Useful theory for grounding your intuitions about the benefits of chain-of-thought and the expressivity of different architectures!
@vfleaking
Kaifeng Lyu
3 months
Check out our new paper! We explore the representation gap between RNNs and Transformers. Theory: CoT improves RNNs but is insufficient to close the gap. Improving the capability of retrieving information from context is the key (e.g. +RAG / +1 attention).
1
5
44
2
0
14
@SadhikaMalladi
Sadhika Malladi
2 months
Super excited to see this work out! Happy to have contributed a small part to thinking about the optimization dynamics at this scale :)
@violet_zct
Chunting Zhou
2 months
How to enjoy the best of both worlds of efficient training (less communication and computation) and inference (constant KV-cache)? We introduce a new efficient architecture for long-context modeling – Megalodon that supports unlimited context length. In a controlled head-to-head
Tweet media one
Tweet media two
4
51
225
1
0
11
@SadhikaMalladi
Sadhika Malladi
8 months
Excited to give this talk on Thursday! Thanks for inviting me :)
@SydMathInst
SydMathInst
8 months
Announcing the next "Mathematical challenges in AI" seminar by computer scientist @SadhikaMalladi of #PrincetonU @Princeton , this Thursday 28 September at 8:00 AEST. Join us online or in-person, all info on the course webpage:
0
1
12
0
0
11
@SadhikaMalladi
Sadhika Malladi
3 months
Interesting result on how transformers can learn fairly arbitrary causal mechanisms in the data through GD. Tricky analysis!
@EshaanNichani
Eshaan Nichani
3 months
Causal self-attention encodes causal structure between tokens (eg. induction head, learning function class in-context, n-grams). But how do transformers learn this causal structure via gradient descent? New paper with @alex_damian_ @jasondeanlee ! (1/10)
Tweet media one
6
93
415
0
0
9
@SadhikaMalladi
Sadhika Malladi
5 months
Thanks to @gaotianyu1350 , @SurbhiGoel_ , @BingbinL , @vfleaking , @abhi_venigalla , @xiamengzhou , and @HowardYen1 for their help. And especially thanks to @zhiyuanli_ for patiently introducing me to this topic when I first started grad school :)
0
1
9
@SadhikaMalladi
Sadhika Malladi
3 months
Thrilled to announce that we are starting a virtual seminar on alignment and safety! I’m excited to bring these discussions with experts to a broad audience. Sign up to receive email updates and stay tuned for how to submit questions for our speakers :)
@PrincetonPLI
Princeton PLI
3 months
Announcing Princeton AI Alignment and Safety Seminar (PASS): A virtual & collaborative space for diverse researchers to learn & discuss aligning increasingly capable AI models for safe behavior. Join our mailing list for updates:
Tweet media one
0
19
60
0
0
9
@SadhikaMalladi
Sadhika Malladi
11 months
I'll be presenting our kernel-based view of fine-tuning at #ICML23 in the first poster session! 11am, Exhibit Hall 1, #442 :)
@SadhikaMalladi
Sadhika Malladi
2 years
Why can we fine-tune (FT) huge LMs on a few data points without overfitting? We show with theory + exps that FT can be described by kernel dynamics. Joint work with @_awettig , @dingli_yu , @danqi_chen , @prfsanjeevarora . [1/8]
2
11
46
0
1
8
@SadhikaMalladi
Sadhika Malladi
4 months
@giffmana I had the same Q :) GLUE-style FT seems to behave more like small-LR, full-batch GD (), so probably not enough gradient noise for SDEs/scaling rules. Instruction tuning is harder to say...my guess is it's close to SDE regime, but I have no exps on it. [1/2]
1
0
8
@SadhikaMalladi
Sadhika Malladi
1 year
MeZO works with parameter efficient methods (LoRA and prefix tuning) and can optimize non-differentiable objectives (directly maximizing accuracy and F1 score).
Tweet media one
1
1
7
@SadhikaMalladi
Sadhika Malladi
1 year
MeZO adapts classical zeroth order methods to estimate the gradient in-place with two forward passes. Our theory shows how adequate pre-training allows it to avoid the classically catastrophic zeroth order slowdown (usually proportional to the number of parameters!)
Tweet media one
1
1
7
@SadhikaMalladi
Sadhika Malladi
1 year
Cool follow up work to MeZO!
@ericzelikman
Eric Zelikman
1 year
Decentralized LLM fine-tuning on the normal internet needs a ton of bandwidth to send model updates (e.g. terabytes per several gradients!). Even with LoRA, this scales pretty badly with many devices. Can you get similar performance w/ just one byte per gradient? (Maybe!)
Tweet media one
2
32
148
0
0
7
@SadhikaMalladi
Sadhika Malladi
1 year
Extensive exps on masked and autoregressive models show that MeZO outperforms memory-equivalent methods (e.g., ICL) and often performs comparably to FT.
Tweet media one
1
1
7
@SadhikaMalladi
Sadhika Malladi
2 years
Joint work with @vfleaking @Abhishek_034 @prfsanjeevarora Poster Session: 11am-1pm Tuesday Link: [2/2]
1
1
2
@SadhikaMalladi
Sadhika Malladi
4 months
@giffmana Hard to compute the (empirical) NTK in instruction tuning since it's a V-way clf task and each prefix is treated as its own input. Related exps suggest grad changes a lot early on and then not so much later (Appendix F: ). But not conclusive evidence.
1
0
5
@SadhikaMalladi
Sadhika Malladi
10 months
Cool workshop by some of my friends! Submit if you have something :)
@andrew_ilyas
Andrew Ilyas
10 months
What makes ML models tick? How do we attribute model behavior to the training data, algorithm, architecture, or scale used in training? Papers (or ideas) here? Submit to ATTRIB @ NeurIPS 2023 ()! Deadline is September 23!
Tweet media one
1
20
127
0
0
5
@SadhikaMalladi
Sadhika Malladi
8 months
Excited to see this work on using existing LLMs to efficiently construct and train powerful smaller models!
@xiamengzhou
Mengzhou Xia
8 months
We release the strongest public 1.3B and 3B models so far – the ShearedLLaMA series. Structured pruning from a large model to a small one is far more cost-effective (only 3%!) than pre-training them from scratch! Check out our paper and models at: [1/n]
Tweet media one
20
141
771
0
0
4
@SadhikaMalladi
Sadhika Malladi
2 years
Our new work!
@prfsanjeevarora
Sanjeev Arora
2 years
Fine tuned LLMs can solve many NLP tasks. A priori, fine-tuning a huge LM on a few datapoints could lead to catastrophic overfitting. So why doesn’t it? Our theory + experiments (on GLUE) reveal that fine-tuning is often well-approximated as simple kernel-based learning. 1/2
5
33
239
0
0
4
@SadhikaMalladi
Sadhika Malladi
1 year
Exciting times to be at Princeton :) come join us!
@prfsanjeevarora
Sanjeev Arora
1 year
Princeton has a new Center for Language and Intelligence, researching LLMs + large AI models, as well as their interdisciplinary applications. Looking for postdocs/research scientists/engineers; attractive conditions.
22
116
622
0
0
4
@SadhikaMalladi
Sadhika Malladi
4 months
@jon_barron Cool idea! I guess for the settings that we would mostly use (ie decayed beta is between 0.9 and 0.8), both of our scaling rules mostly agree. I'm wondering if there's any theory behind your idea?
3
0
3
@SadhikaMalladi
Sadhika Malladi
4 months
@yaroslavvb Good question! So the scaling rule comes out of the "time-scaling" of the SDE. eta ~ B^{1-p}, so you could feasibly derive some SDEs for other values of p. Section 4.1 of our paper provides an illustration of how you could do so in a simple setting:
0
0
3
@SadhikaMalladi
Sadhika Malladi
9 months
Proud sister!
@chatt_md
Chatt Malladi, MD
9 months
Tweet media one
0
1
16
0
0
3
@SadhikaMalladi
Sadhika Malladi
11 months
I'm at #ICML23 , DM me if you want to meet up!
0
0
2
@SadhikaMalladi
Sadhika Malladi
10 months
Excited to see this work out! It contains multitudes :)
@TheGregYang
Greg Yang
10 months
1/ How to scale hyperparams (eg learning rate) as neural network gets wider? Esp w/ adaptive optimizers like Adam? I derived the answer (μP) in 2020 & verified it on GPT3 This required some beautiful new math that’s just been completely written down w/ @EtaiLittwin 🧵👇
Tweet media one
11
63
301
0
0
2
@SadhikaMalladi
Sadhika Malladi
4 months
@ceksudo Thanks for reading! I am actually not working on diffusion models, even though they use the SDE as a backbone. The application is a bit different there. It's something I'm learning now though, and maybe I will write a post on it later!
1
0
2
@SadhikaMalladi
Sadhika Malladi
7 months
@Kangwook_Lee @yzeng58 @edwardjhu Great work! You may be interested in our complementary ICML 23 work (): uses NTK-style interpretation of FT to show that LoRA doesn't modify FT dynamics much if the hyperparameters are chosen well. It also has a similar conclusion about last-layer tuning!
1
0
1
@SadhikaMalladi
Sadhika Malladi
4 months
@giffmana @deepcohen @SamuelMLSmith @HKydlicek Thanks for looking at our work! The rule we prescribe assumes that you have found the best LR for a model+dataset and you only want to scale batch size (ie accelerate training). But that best LR indeed depends on model and data! More here:
2
0
2
@SadhikaMalladi
Sadhika Malladi
2 years
We design an empirically testable condition to formalize the intuition that a pre-trained model can already do pretty well on downstream tasks. We prove that this condition ensures prompt-based FT will exhibit kernel behavior. [7/8]
1
0
2
@SadhikaMalladi
Sadhika Malladi
4 months
@typedfemale @jon_barron One good way to get intuition is to look at the simplified setting in Section 4.1 of our paper (). If you want to discuss more, maybe you can shoot me an email so we are not limited by number of characters? :)
0
0
1
@SadhikaMalladi
Sadhika Malladi
2 months
@BingbinL @KempnerInst Kempner is lucky to have you :) excited for you!!! 🎉
0
0
1
@SadhikaMalladi
Sadhika Malladi
3 months
Effectively handling text inputs in a visual format (e.g., flowcharts) is crucial for many applications!
@gaotianyu1350
Tianyu Gao
3 months
New preprint "Improving Language Understanding from Screenshots" w/ @zwcolin @AdithyaNLP @danqi_chen . We improve language understanding abilities of screenshot LMs, an emerging family of models that processes everything (including text) via visual inputs
6
43
189
0
0
1
@SadhikaMalladi
Sadhika Malladi
4 months
@samlakig Thanks for reading it :)
0
0
1
@SadhikaMalladi
Sadhika Malladi
4 months
@KordingLab @SurbhiGoel_ Thanks for your interest -- happy to chat via email! Note these exps are only GLUE-style fine-tuning, not instruction tuning. The latter is a V-way classification on every prefix of a context so kernel gets massive! Also, this paper has exps in vision FT:
0
0
1
@SadhikaMalladi
Sadhika Malladi
2 years
suggested using the neural tangent kernel (NTK) to study vision FT. But, the NTK describes training randomly initialized infinite-width networks with gradient descent. Does FT in LMs exhibit kernel behavior? If so, why? [2/8]
1
1
1
@SadhikaMalladi
Sadhika Malladi
10 months
I'll also be presenting at the Differentiable Almost Everything workshop on Friday at meeting room 310. The poster sessions are 11:30-12:30pm and 4-5pm.
0
0
1
@SadhikaMalladi
Sadhika Malladi
4 months
@oharub Thank you!
0
0
1