Sadhika Malladi @SadhikaMalladi Twitter profile

Last Seen Profiles

@levugaelle

@accesshollywood

@mire_patipela

@kiariflow

@zielkemariusz

@ScottW_inks

@cappucxino

@ga_miesn

@coachcanales

@DogStricken

@ImLittleNick

@devbillyy

@demae

@KiltimaghGAA

@Ururi_Rapis

@depperla

@wbaa_j

@shesood

@kevoboyle

@dek2k

@cloudcastlesgg

@HackingButLegal

@CoachJJacobson

@EatYourReedies

@jandakembangstw

@adampfc_

@chessiecat83322

@lukewhoshowedup

@jenjenosi

@chenledrifter

@scottdbg_

@fplsverigetv

@mchenrybaseball

@rennayemiller

@MarkCruzOR

@jandakembangstw

Sadhika Malladi

@SadhikaMalladi

1 year

Introducing MeZO - a memory-efficient zeroth-order optimizer that can fine-tune large language models with forward passes while remaining performant. MeZO can train a 30B model on 1x 80GB A100 GPU. Paper: Code:

9

93

456

Sadhika Malladi

@SadhikaMalladi

5 months

Blog post about how to scale training runs to highly distributed settings (i.e., large batch sizes)! Empirical insights from my long-ago work on stochastic differential equations (SDEs). Written to be accessible - give it a shot!

8

77

394

Sadhika Malladi

@SadhikaMalladi

2 months

Dataset choice is crucial in today's ML training pipeline. We ( @xiamengzhou and I) introduce desiderata for "good" data and explain how our recent algorithm, LESS, fits into the picture. Huge review of data selection algs for pre-training and fine-tuning!

2

53

202

Sadhika Malladi

@SadhikaMalladi

5 months

Announcing the 2nd Workshop on Mathematical and Empirical Understanding of Foundation Models (ME-FoMo) at ICLR 2024! Improving our understanding helps us advance capabilities and build safer, more aligned models. Paper deadline is Feb 3! Website:

0

15

107

Sadhika Malladi

@SadhikaMalladi

2 years

Why can we fine-tune (FT) huge LMs on a few data points without overfitting? We show with theory + exps that FT can be described by kernel dynamics. Joint work with @_awettig , @dingli_yu , @danqi_chen , @prfsanjeevarora . [1/8]

A Kernel-Based View of Language Model Fine-Tuning

It has become standard to solve NLP tasks by fine-tuning pre-trained language models (LMs), especially in low-data settings. There is minimal theoretical understanding of empirical success, e.g.,...

arxiv.org

2

11

46

Sadhika Malladi

@SadhikaMalladi

10 months

I'll be presenting MeZO at the ES-FoMo workshop at #ICML2023 . My talk is at 10:40am, and the poster session is at 1pm on Saturday in Ballroom A. Hope to see you there!

Sadhika Malladi

@SadhikaMalladi

1 year

Introducing MeZO - a memory-efficient zeroth-order optimizer that can fine-tune large language models with forward passes while remaining performant. MeZO can train a 30B model on 1x 80GB A100 GPU. Paper: Code:

9

93

456

1

7

34

Sadhika Malladi

@SadhikaMalladi

6 months

We are at #Neurips2023 ! We will present this work as an oral on Wed at 4:15pm and as a poster on Wed at 5-7pm. Many of the authors are here -- stop by to chat with us!

Sadhika Malladi

@SadhikaMalladi

1 year

Introducing MeZO - a memory-efficient zeroth-order optimizer that can fine-tune large language models with forward passes while remaining performant. MeZO can train a 30B model on 1x 80GB A100 GPU. Paper: Code:

9

93

456

0

7

32

Sadhika Malladi

@SadhikaMalladi

3 months

We are really excited to host @aleks_madry from @OpenAI at the PASS seminar on 3/26, 2pm ET! Submit your questions about the Preparedness team: , and join our mailing list to receive notifications about talks:

PASS Question Submission

Submit a question for the speaker of the Princeton AI Alignment and Safety Seminar (PASS)! We will moderate the questions and ask the speaker during the discussion period.

docs.google.com

Princeton PLI

@PrincetonPLI

3 months

PASS seminar on 3/26 2pm ET! Speaker: Aleksander Madry @aleks_madry from @OpenAI Topic: AI Preparedness Live: Submit questions: Recordings later at:

1

2

12

0

2

27

Sadhika Malladi

@SadhikaMalladi

4 months

Excited to share our work on data selection for instruction tuning! Bootstrap a few available examples to identify the useful training data from a huge pool of available data. Interesting optimization observation along the way: shorter instructions induce massive gradient norms.

Mengzhou Xia

@xiamengzhou

4 months

Lots of instruction tuning data out there...but how to best adapt LLMs for specific queries? Don’t use ALL of the data, use LESS! 5% beats the full dataset. Can even use one small model to select data for others! Paper: Code: [1/n]

13

98

435

0

27

Sadhika Malladi

@SadhikaMalladi

11 months

Our new paper shows that not-so-big transformers can simulate + train an internal, not-so-small transformer over the course of a single inference pass!

Abhishek Panigrahi

@Abhishek_034

11 months

**New paper ** In-context learning was explained as simulate + train simple models at inference. We show a 2B model can run GD on an internal 125M model. Surprising simulation + AI safety implications! 1/5 w/ @SadhikaMalladi , @xiamengzhou , @prfsanjeevarora

2

49

242

0

2

19

Sadhika Malladi

@SadhikaMalladi

2 years

Enlarging batch size B speeds up distributed training, but how should we set the LR? For SGD, the famous Linear Scaling Rule suggests scaling LR linearly with B. For RMSprop/Adam, our new #NeurIPS2022 paper justifies scaling LR ~ sqrt(B) through formal SDE approximations. [1/2]

1

2

17

Sadhika Malladi

@SadhikaMalladi

1 year

Couldn't make it to ICLR 2023 but check out my talk in the ME-FoMo workshop today!

Sadhika Malladi

@SadhikaMalladi

2 years

Why can we fine-tune (FT) huge LMs on a few data points without overfitting? We show with theory + exps that FT can be described by kernel dynamics. Joint work with @_awettig , @dingli_yu , @danqi_chen , @prfsanjeevarora . [1/8]

2

11

46

0

3

16

Sadhika Malladi

@SadhikaMalladi

4 months

@ArmenAgha Yup, this is exactly what my blog post covers! The SDE gets more complicated with Adam :)

0

3

16

Sadhika Malladi

@SadhikaMalladi

4 months

Great to see our work on tuning LLMs with forward passes (MeZO: ) extended with in-depth benchmarking on more models, tasks, and settings! And a tutorial at AAAI 24 on using ZO methods to tune large models:

AAAI 2024 Tutorial

Overview

sites.google.com

Tianlong Chen

@TianlongChen4

4 months

💭 Dreaming of tuning LLMs with inference-only memory? 🤔 🌄 Check out our ZO-LLM Benchmark, revisiting ZO for LLM tuning, across 5 LLM families & 3 task complexities & 4 tuning schemes ➡️ Unveiling overlooked principles & 3 novel enhancements. 🔗

2

15

55

0

1

15

Sadhika Malladi

@SadhikaMalladi

3 months

Can RNNs replace transformers? Hot topic of debate over the past week with several competitive RNN-style models. Useful theory for grounding your intuitions about the benefits of chain-of-thought and the expressivity of different architectures!

Kaifeng Lyu

@vfleaking

3 months

Check out our new paper! We explore the representation gap between RNNs and Transformers. Theory: CoT improves RNNs but is insufficient to close the gap. Improving the capability of retrieving information from context is the key (e.g. +RAG / +1 attention).

1

5

44

2

0

14

Sadhika Malladi

@SadhikaMalladi

2 months

Super excited to see this work out! Happy to have contributed a small part to thinking about the optimization dynamics at this scale :)

Chunting Zhou

@violet_zct

2 months

How to enjoy the best of both worlds of efficient training (less communication and computation) and inference (constant KV-cache)? We introduce a new efficient architecture for long-context modeling – Megalodon that supports unlimited context length. In a controlled head-to-head

4

51

225

1

0

11

Sadhika Malladi

@SadhikaMalladi

8 months

Excited to give this talk on Thursday! Thanks for inviting me :)

SydMathInst

@SydMathInst

8 months

Announcing the next "Mathematical challenges in AI" seminar by computer scientist @SadhikaMalladi of #PrincetonU @Princeton , this Thursday 28 September at 8:00 AEST. Join us online or in-person, all info on the course webpage:

0

1

12

0

11

Sadhika Malladi

@SadhikaMalladi

3 months

Interesting result on how transformers can learn fairly arbitrary causal mechanisms in the data through GD. Tricky analysis!

Eshaan Nichani

@EshaanNichani

3 months

Causal self-attention encodes causal structure between tokens (eg. induction head, learning function class in-context, n-grams). But how do transformers learn this causal structure via gradient descent? New paper with @alex_damian_ @jasondeanlee ! (1/10)

6

93

415

0

9

Sadhika Malladi

@SadhikaMalladi

5 months

Thanks to @gaotianyu1350 , @SurbhiGoel_ , @BingbinL , @vfleaking , @abhi_venigalla , @xiamengzhou , and @HowardYen1 for their help. And especially thanks to @zhiyuanli_ for patiently introducing me to this topic when I first started grad school :)

0

1

9

Sadhika Malladi

@SadhikaMalladi

3 months

Thrilled to announce that we are starting a virtual seminar on alignment and safety! I’m excited to bring these discussions with experts to a broad audience. Sign up to receive email updates and stay tuned for how to submit questions for our speakers :)

Princeton PLI

@PrincetonPLI

3 months

Announcing Princeton AI Alignment and Safety Seminar (PASS): A virtual & collaborative space for diverse researchers to learn & discuss aligning increasingly capable AI models for safe behavior. Join our mailing list for updates:

0

19

60

0

9

Sadhika Malladi

@SadhikaMalladi

11 months

I'll be presenting our kernel-based view of fine-tuning at #ICML23 in the first poster session! 11am, Exhibit Hall 1, #442 :)

Sadhika Malladi

@SadhikaMalladi

2 years

Why can we fine-tune (FT) huge LMs on a few data points without overfitting? We show with theory + exps that FT can be described by kernel dynamics. Joint work with @_awettig , @dingli_yu , @danqi_chen , @prfsanjeevarora . [1/8]

2

11

46

0

1

8

Sadhika Malladi

@SadhikaMalladi

4 months

@giffmana I had the same Q :) GLUE-style FT seems to behave more like small-LR, full-batch GD (), so probably not enough gradient noise for SDEs/scaling rules. Instruction tuning is harder to say...my guess is it's close to SDE regime, but I have no exps on it. [1/2]

A Kernel-Based View of Language Model Fine-Tuning

It has become standard to solve NLP tasks by fine-tuning pre-trained language models (LMs), especially in low-data settings. There is minimal theoretical understanding of empirical success, e.g.,...

arxiv.org

1

0

8

Sadhika Malladi

@SadhikaMalladi

1 year

MeZO works with parameter efficient methods (LoRA and prefix tuning) and can optimize non-differentiable objectives (directly maximizing accuracy and F1 score).

1

7

Sadhika Malladi

@SadhikaMalladi

1 year

MeZO adapts classical zeroth order methods to estimate the gradient in-place with two forward passes. Our theory shows how adequate pre-training allows it to avoid the classically catastrophic zeroth order slowdown (usually proportional to the number of parameters!)

1

7

Sadhika Malladi

@SadhikaMalladi

1 year

Cool follow up work to MeZO!

Eric Zelikman

@ericzelikman

1 year

Decentralized LLM fine-tuning on the normal internet needs a ton of bandwidth to send model updates (e.g. terabytes per several gradients!). Even with LoRA, this scales pretty badly with many devices. Can you get similar performance w/ just one byte per gradient? (Maybe!)

2

32

148

0

7

Sadhika Malladi

@SadhikaMalladi

1 year

Extensive exps on masked and autoregressive models show that MeZO outperforms memory-equivalent methods (e.g., ICL) and often performs comparably to FT.

1

7

Sadhika Malladi

@SadhikaMalladi

1 year

Fun collab with @gaotianyu1350 , @EshaanNichani , @alex_damian_ , @jasondeanlee , @danqi_chen , @prfsanjeevarora !

0

2

6

Sadhika Malladi

@SadhikaMalladi

2 years

Joint work with @vfleaking @Abhishek_034 @prfsanjeevarora Poster Session: 11am-1pm Tuesday Link: [2/2]

1

2

Sadhika Malladi

@SadhikaMalladi

4 months

@giffmana Hard to compute the (empirical) NTK in instruction tuning since it's a V-way clf task and each prefix is treated as its own input. Related exps suggest grad changes a lot early on and then not so much later (Appendix F: ). But not conclusive evidence.

LESS: Selecting Influential Data for Targeted Instruction Tuning

Instruction tuning has unlocked powerful capabilities in large language models (LLMs), effectively using combined datasets to develop generalpurpose chatbots. However, real-world applications...

arxiv.org

1

0

5

Sadhika Malladi

@SadhikaMalladi

10 months

Cool workshop by some of my friends! Submit if you have something :)

Andrew Ilyas

@andrew_ilyas

10 months

What makes ML models tick? How do we attribute model behavior to the training data, algorithm, architecture, or scale used in training? Papers (or ideas) here? Submit to ATTRIB @ NeurIPS 2023 ()! Deadline is September 23!

1

20

127

0

5

Sadhika Malladi

@SadhikaMalladi

8 months

Excited to see this work on using existing LLMs to efficiently construct and train powerful smaller models!

Mengzhou Xia

@xiamengzhou

8 months

We release the strongest public 1.3B and 3B models so far – the ShearedLLaMA series. Structured pruning from a large model to a small one is far more cost-effective (only 3%!) than pre-training them from scratch! Check out our paper and models at: [1/n]

20

141

771

0

4

Sadhika Malladi

@SadhikaMalladi

2 years

Our new work!

Sanjeev Arora

@prfsanjeevarora

2 years

Fine tuned LLMs can solve many NLP tasks. A priori, fine-tuning a huge LM on a few datapoints could lead to catastrophic overfitting. So why doesn’t it? Our theory + experiments (on GLUE) reveal that fine-tuning is often well-approximated as simple kernel-based learning. 1/2

5

33

239

0

4

Sadhika Malladi

@SadhikaMalladi

1 year

Exciting times to be at Princeton :) come join us!

Sanjeev Arora

@prfsanjeevarora

1 year

Princeton has a new Center for Language and Intelligence, researching LLMs + large AI models, as well as their interdisciplinary applications. Looking for postdocs/research scientists/engineers; attractive conditions.

22

116

622

0

4

Sadhika Malladi

@SadhikaMalladi

4 months

@jon_barron Cool idea! I guess for the settings that we would mostly use (ie decayed beta is between 0.9 and 0.8), both of our scaling rules mostly agree. I'm wondering if there's any theory behind your idea?

3

0

3

Sadhika Malladi

@SadhikaMalladi

4 months

@yaroslavvb Good question! So the scaling rule comes out of the "time-scaling" of the SDE. eta ~ B^{1-p}, so you could feasibly derive some SDEs for other values of p. Section 4.1 of our paper provides an illustration of how you could do so in a simple setting:

On the SDEs and Scaling Rules for Adaptive Gradient Algorithms

Approximating Stochastic Gradient Descent (SGD) as a Stochastic Differential Equation (SDE) has allowed researchers to enjoy the benefits of studying a continuous optimization trajectory while...

arxiv.org

0

3

Sadhika Malladi

@SadhikaMalladi

9 months

Proud sister!

Chatt Malladi, MD

@chatt_md

9 months

@JonHsuMD

0

1

16

0

3

Sadhika Malladi

@SadhikaMalladi

11 months

I'm at #ICML23 , DM me if you want to meet up!

0

2

Sadhika Malladi

@SadhikaMalladi

10 months

Excited to see this work out! It contains multitudes :)

Greg Yang

@TheGregYang

10 months

1/ How to scale hyperparams (eg learning rate) as neural network gets wider? Esp w/ adaptive optimizers like Adam? I derived the answer (μP) in 2020 & verified it on GPT3 This required some beautiful new math that’s just been completely written down w/ @EtaiLittwin 🧵👇

11

63

301

0

2

Sadhika Malladi

@SadhikaMalladi

4 months

@ceksudo Thanks for reading! I am actually not working on diffusion models, even though they use the SDE as a backbone. The application is a bit different there. It's something I'm learning now though, and maybe I will write a post on it later!

1

0

2

Sadhika Malladi

@SadhikaMalladi

2 years

Paper: Code and downloadable kernels: [8/8]

GitHub - princeton-nlp/LM-Kernel-FT: A Kernel-Based View of Language Model Fine-Tuning https://ar...

A Kernel-Based View of Language Model Fine-Tuning https://arxiv.org/abs/2210.05643 - princeton-nlp/LM-Kernel-FT

github.com

0

2

Sadhika Malladi

@SadhikaMalladi

7 months

@Kangwook_Lee @yzeng58 @edwardjhu Great work! You may be interested in our complementary ICML 23 work (): uses NTK-style interpretation of FT to show that LoRA doesn't modify FT dynamics much if the hyperparameters are chosen well. It also has a similar conclusion about last-layer tuning!

A Kernel-Based View of Language Model Fine-Tuning

It has become standard to solve NLP tasks by fine-tuning pre-trained language models (LMs), especially in low-data settings. There is minimal theoretical understanding of empirical success, e.g....

openreview.net

1

0

1

Sadhika Malladi

@SadhikaMalladi

4 months

@giffmana @deepcohen @SamuelMLSmith @HKydlicek Thanks for looking at our work! The rule we prescribe assumes that you have found the best LR for a model+dataset and you only want to scale batch size (ie accelerate training). But that best LR indeed depends on model and data! More here:

2

0

2

Sadhika Malladi

@SadhikaMalladi

2 years

We design an empirically testable condition to formalize the intuition that a pre-trained model can already do pretty well on downstream tasks. We prove that this condition ensures prompt-based FT will exhibit kernel behavior. [7/8]

1

0

2

Sadhika Malladi

@SadhikaMalladi

4 months

@typedfemale @jon_barron One good way to get intuition is to look at the simplified setting in Section 4.1 of our paper (). If you want to discuss more, maybe you can shoot me an email so we are not limited by number of characters? :)

On the SDEs and Scaling Rules for Adaptive Gradient Algorithms

This paper derives the SDE approximations for RMSprop and Adam, giving theoretical guarantees of their correctness as well as experimental validation of their applicability to common large-scaling...

openreview.net

0

1

Sadhika Malladi

@SadhikaMalladi

2 years

arXiv version can be found here!

On the SDEs and Scaling Rules for Adaptive Gradient Algorithms

Approximating Stochastic Gradient Descent (SGD) as a Stochastic Differential Equation (SDE) has allowed researchers to enjoy the benefits of studying a continuous optimization trajectory while...

arxiv.org

0

1

Sadhika Malladi

@SadhikaMalladi

2 months

@BingbinL @KempnerInst Kempner is lucky to have you :) excited for you!!! 🎉

0

1

Sadhika Malladi

@SadhikaMalladi

3 months

Effectively handling text inputs in a visual format (e.g., flowcharts) is crucial for many applications!

Tianyu Gao

@gaotianyu1350

3 months

New preprint "Improving Language Understanding from Screenshots" w/ @zwcolin @AdithyaNLP @danqi_chen . We improve language understanding abilities of screenshot LMs, an emerging family of models that processes everything (including text) via visual inputs

6

43

189

0

1

Sadhika Malladi

@SadhikaMalladi

4 months

@samlakig Thanks for reading it :)

0

1

Sadhika Malladi

@SadhikaMalladi

4 months

@KordingLab @SurbhiGoel_ Thanks for your interest -- happy to chat via email! Note these exps are only GLUE-style fine-tuning, not instruction tuning. The latter is a V-way classification on every prefix of a context so kernel gets massive! Also, this paper has exps in vision FT:

More Than a Toy: Random Matrix Models Predict How Real-World...

Of theories for why large-scale machine learning models generalize despite being vastly overparameterized, which of their assumptions are needed to capture the qualitative phenomena of...

arxiv.org

0

1

Sadhika Malladi

@SadhikaMalladi

2 years

suggested using the neural tangent kernel (NTK) to study vision FT. But, the NTK describes training randomly initialized infinite-width networks with gradient descent. Does FT in LMs exhibit kernel behavior? If so, why? [2/8]

More Than a Toy: Random Matrix Models Predict How Real-World...

Of theories for why large-scale machine learning models generalize despite being vastly overparameterized, which of their assumptions are needed to capture the qualitative phenomena of...

arxiv.org

1

Sadhika Malladi

@SadhikaMalladi

10 months

I'll also be presenting at the Differentiable Almost Everything workshop on Friday at meeting room 310. The poster sessions are 11:30-12:30pm and 4-5pm.

0

1

Sadhika Malladi

@SadhikaMalladi

4 months

@oharub Thank you!

0

1