Prateek Yadav @prateeky2806 Twitter profile

Pinned Tweet

Prateek Yadav

6 months

Presenting ComPEFT 🗜! We compress parameter updates to facilitate efficient communication of expert models for compositional generalization. ComPEFT improves perf. 📈, while reducing storage/communication costs 📉 @LChoshen @colinraffel @mohitban47 🧵

8

56

234

Last Seen Profiles

@shannonsnow_sf

@trebla_1006

@RahulSi79738144

@logarithy

@Wolfiesmom

@megantheevalium

@GemmaandCharli1

@msevrythngmajor

@kitsuneteeeth

@Heir_Company

@_mrwellington_

@JeffLoveness

@SageHealthInfo

@GreeceMFA

@dralmosa

@BCapdepera

@froggiesgutz

@ASVPNVST

@znagy20

@elizabe57009550

@aki_ohchan

@PhiphenPictures

@erinmchat

@TCNJ

@DJohnsonPGA

@briannejenner

@AssistanceFund

@WarbyParker

@lilgnar

@PowerRangers

@britkitty

@NeopinOfficial

@livtavares

@9llj0

@WellingtonAFC

@boo_she2

Prateek Yadav

@prateeky2806

7 months

Gradient Checkpointing(GC) is a hidden gem that most people take for granted, However, it has a crazy impact on reducing the VRAM. @yilin_sung and I profiled the activation memory used for LLaMA-7B model and the impact is crazy! 🧵Find out more about GC 👇 cc @Tim_Dettmers

7

52

276

Prateek Yadav

@prateeky2806

7 months

🚀Struggling with Memory issues in MoE models?😭 Introducing...✨MC-SMoE✨ We merge experts THEN compress/decompose merged experts➡️low-rank. Up to 80% mem reduction! 🎉 w/ @pingzli @KyriectionZhang @yilin_sung @YuCheng3 @mohitban47 @TianlongChen4 🧵👇

4

75

259

Prateek Yadav

@prateeky2806

11 months

Performance degrades when merging diff task-specific models into a multitask model? Presenting TIES-Merging🪢 We find signif *Interference* b/w model params & mitigate it 👉 improves both NLP & CV merging @dtredsox13 @LChoshen @colinraffel @mohitban47 🧵

6

86

255

Prateek Yadav

@prateeky2806

4 months

🎉 Thrilled to announce our MOE Expert Merging paper has been accepted to @iclr_conf as a SpotLight paper. ! We reduce the inference memory cost of MOE models by utilizing routing statistics-based merging of experts to achieve up to 80% memory and 20% flops reduction. 📷

Prateek Yadav

@prateeky2806

7 months

🚀Struggling with Memory issues in MoE models?😭 Introducing...✨MC-SMoE✨ We merge experts THEN compress/decompose merged experts➡️low-rank. Up to 80% mem reduction! 🎉 w/ @pingzli @KyriectionZhang @yilin_sung @YuCheng3 @mohitban47 @TianlongChen4 🧵👇

4

75

259

8

27

168

Prateek Yadav

@prateeky2806

7 months

🔍 A thread on the latest @iclr_conf 2024 papers on - Mixture of Experts - Modular Models - Compositional Generalization - and related topics: Dive into the latest papers from #ICLR2024 through the list below! Let me know if I missed some relevant papers. [🧵Thread ⬇️]

4

20

164

Prateek Yadav

@prateeky2806

3 months

I am not sure who is still a disbeliever and needs to hear it. If you are not using MODEL MERGING for either pretraining/continued-finetuning/adapting your models then you are wasting a lot of compute which costs you direct $$$ 🧵

3

26

166

Prateek Yadav

@prateeky2806

7 months

A very nice visual explanation of how Gradient Checkpointing works is in this blog post by @yaroslavvb . A brief summary from the blog on how GC stores some activations and uses partial forward passes for backprop. (Visualizations are from the blog)

Prateek Yadav

@prateeky2806

7 months

Gradient Checkpointing(GC) is a hidden gem that most people take for granted, However, it has a crazy impact on reducing the VRAM. @yilin_sung and I profiled the activation memory used for LLaMA-7B model and the impact is crazy! 🧵Find out more about GC 👇 cc @Tim_Dettmers

7

52

276

1

23

160

Prateek Yadav

@prateeky2806

8 months

🎉 Thrilled to announce our paper on TIES-Merging🪢 has been accepted to @NeurIPSConf ! We've delved into the significant Interference between task-specific model parameters when merging and found a way to mitigate it, enhancing both NLP & CV. Stay tuned for more insights! 📄✨

Prateek Yadav

@prateeky2806

11 months

Performance degrades when merging diff task-specific models into a multitask model? Presenting TIES-Merging🪢 We find signif *Interference* b/w model params & mitigate it 👉 improves both NLP & CV merging @dtredsox13 @LChoshen @colinraffel @mohitban47 🧵

6

86

255

5

26

157

Prateek Yadav

@prateeky2806

7 months

🔍 Searching for @iclr_conf 2024 paper on Model Merging/Fusion & related topic: Dive into the latest advancements in model merging, fusion, and weight interpolations from #ICLR2024 through the list below! Let me know if I missed some relevant papers. [Thread ⬇️]

4

16

81

Prateek Yadav

@prateeky2806

2 years

Can we avoid forgetting while achieving forward transfer in continual learning, by finding and training subnetworks? Check out “Exclusive Supermask Subnetwork Training for Continual Learning” which works well for both vision & NLP. @mohitban47 @uncnlp 🧵

3

20

82

Prateek Yadav

@prateeky2806

4 months

🎉 When pruning datasets there is a trade-off between selecting Diverse and Difficult samples. Proud to announce that our paper D2-Pruning has been accepted to ICLR'24 @iclr_conf and uses message passing on a dataset graph to effectively navigate this trade-off📷

Adyasha Maharana

@adyasha10

7 months

How to select important+diverse training data under a fixed data budget? 📢"D2 Pruning" --> represent datasets as sparse undirected graph & perform forward+reverse message passing to select both difficult & diverse samples. @prateeky2806 @mohitban47 🧵

2

36

127

1

14

72

Prateek Yadav

@prateeky2806

10 months

Ever wondered how to continually improve your code LLM? In our new #ACL2023nlp paper, we explore Continual Learning (CL) methods for code domain: CodeTask-CL benchmark & Prompt Pooling with Teacher Forcing for CL in code domain. @AmazonScience @uncnlp 🧵

1

21

58

Prateek Yadav

@prateeky2806

7 months

Check out the camera-ready version of TIES-Merging to be presented at @NeurIPSConf 2023! We have added more experiments on 1. Merging for robustness on a single task. 2. Merging for better Initialization and Finetuning 3. We show that interference exists even when merging…

Prateek Yadav

@prateeky2806

11 months

Performance degrades when merging diff task-specific models into a multitask model? Presenting TIES-Merging🪢 We find signif *Interference* b/w model params & mitigate it 👉 improves both NLP & CV merging @dtredsox13 @LChoshen @colinraffel @mohitban47 🧵

6

86

255

4

7

49

Prateek Yadav

@prateeky2806

8 months

Whoever is handling the @iclr_conf account this year has a good sense of humor. I am loving these random short tweets and replies. It's making the process fun.

ICLR 2024

@iclr_conf

8 months

@_vaishnavh We strive to vibe.

0

2

44

2

0

47

Prateek Yadav

@prateeky2806

7 months

Checkout our new work on dataset pruning that balances both the difficulty of samples and their diversity. We employ message passing on the dataset graphs to select the dataset subset. Some interesting findings here! 👇

Adyasha Maharana

@adyasha10

7 months

How to select important+diverse training data under a fixed data budget? 📢"D2 Pruning" --> represent datasets as sparse undirected graph & perform forward+reverse message passing to select both difficult & diverse samples. @prateeky2806 @mohitban47 🧵

2

36

127

0

8

44

Prateek Yadav

@prateeky2806

4 months

It is amazing to see that model merging is finally getting more traction and people are extracting value out of it. Moreover, there are resources that make it easier to quickly try out these methods like the mergekit. Everyone who is building any model should try merging as it…

Julien Chaumond

@julien_c

4 months

I was collating a basic collection of *Papers on model merging*: Then i found out that @osanseviero already had done a much more complete one 😈 So here goes =>

5

24

164

1

6

40

Prateek Yadav

@prateeky2806

5 months

The FOMO of missing #NeurIPS2023 despite having two papers (TIES-Merging and SeViLA) is real. However, If you want to chat or are looking for interns to work on MoE models, Instruction Tuning, Continuously updating LLMs, PEFT methods, or Model Merging then do reach out! 🧵

1

4

37

Prateek Yadav

@prateeky2806

8 months

Thrilled to announce another paper that has been accepted to @NeurIPSConf ! In SeViLA, we self-chain BLIP2 for temporal localization and question answering on video & get new SOTA on multiple videoQA benchmarks.

Shoubin Yu

@shoubin621

1 year

🚨Can we self-refine a single image-LM for both language-aware keyframe localization & QA on videos? (=sota on multiple datasets)🚨 “Self-Chained Image-Language Model for Video Localization and Question Answering” @jmin__cho @prateeky2806 @mohitban47 🧵

4

41

155

2

6

37

Prateek Yadav

@prateeky2806

1 month

Pretty strong results on long context tasks like passkey detection and book summarisation. Glad to be interning with @TsendeeMTS @tuvllms at Google this summer. Looking forward to doing something great soon.

Aran Komatsuzaki

@arankomatsuzaki

1 month

Google presents Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention 1B model that was fine-tuned on up to 5K sequence length passkey instances solves the 1M length problem

28

262

1K

2

3

36

Prateek Yadav

@prateeky2806

3 months

It's great to see that research on continual & collaborative model training to which I have dedicated 3 years is becoming relevant in industry. @JeffDean publically advocating for systems like this is encouraging. Multiple research topics that directly fit into this vision. 🧵

Suraksha P

@SurakshaPinnuET

3 months

. @JeffDean , chief scientist, @Google DeepMind and Google Research, Grace Chung, Site Lead, and Engineering Director, Google Australia, and @ManishGuptaMG1 , Director of Google Research India talk about large language models and the current AI landscape. @ETtech @EconomicTimes

2

11

37

1

6

29

Prateek Yadav

@prateeky2806

10 months

I'm finally in Toronto & presenting my 2 Continual-Learning #ACL2023nlp papers: 1) Our ExSSNeT paper was accepted as findings and will be presented as spotlight Mon 7pm and poster at Tue 11am. 2) Our Code-CL paper in main conf. on Wed 11am. @mohitban47 @AmazonScience @uncnlp 🧵

Prateek Yadav

@prateeky2806

2 years

Can we avoid forgetting while achieving forward transfer in continual learning, by finding and training subnetworks? Check out “Exclusive Supermask Subnetwork Training for Continual Learning” which works well for both vision & NLP. @mohitban47 @uncnlp 🧵

3

20

82

1

10

25

Prateek Yadav

@prateeky2806

3 months

Git-Theta (ICML'23) proposed a feasible library to actually build models collaboratively like this. Might be useful to discuss it in the paper.

Pulkit Agrawal

@pulkitology

3 months

Presenting a method for training models from SCRATCH using LoRA: 💡20x reduction in communication 💡3x savings in memory - Find out more: - Code available to try out - Scaling to larger models ongoing - led by Jacob Huh!

6

57

387

1

3

24

Prateek Yadav

@prateeky2806

3 years

@timnitGebru Can you please also tell us what #1 and #2 were so that everyone gets a better understanding?

0

21

Prateek Yadav

@prateeky2806

3 months

I would say TIES with DARE might be the most useful. There might be better ways to merge but they are just more complex to use and might need more data, training, gradients, etc. TIES with merging weight=1 also works pretty well.

Sebastian Raschka

@rasbt

3 months

As an LLM finetuner, I recently started getting into model merging. I wrote up a short tutorial on linear merging to introduce the topic: Btw does anyone happen to have good examples of LLMs that work well when merged via linear merging? And for…

10

153

809

1

4

22

Prateek Yadav

@prateeky2806

4 months

It's fascinating yet concerning how we struggle to efficiently access existing knowledge. Even for most world-class researchers, it is hard to find knowledge that is well documented in the form of scientific publications, even in highly active areas like Models of Experts (MoE).…

1

19

Prateek Yadav

@prateeky2806

8 months

Eye-opening!🤯 Even after attempting to delete sensitive information from LLMs, a significant portion can still be recovered via extraction attacks. Moreover, this study presents new defense mechanisms to reduce such recovery to only 2%. Be cautious when deleting info from LLMs!

Vaidehi Patil

@vaidehi_patil_

8 months

🚨Can Sensitive Information Be Deleted From LLMs? We show that extraction attacks recover 18-38% of "deleted" knowledge! Our attack+defense framework has whitebox+blackbox attacks. New defense objectives lower attacks to 2%! @peterbhase @mohitban47 🧵

1

65

265

1

4

19

Prateek Yadav

@prateeky2806

6 months

@LChoshen @colinraffel @mohitban47 We introduce ComPEFT, a novel method that significantly compresses the fine-tuning residuals (task vectors) of PEFT-based models. This is achieved through sparsification and quantization techniques. And guess what? No additional retraining is needed!

1

19

Prateek Yadav

@prateeky2806

3 months

I am not sure why I missed this OG thread which was before chatGPT, before MOE's were cool, and even before anyone thought merging would work and be a thing.

Prateek Yadav

@prateeky2806

3 months

I am not sure who is still a disbeliever and needs to hear it. If you are not using MODEL MERGING for either pretraining/continued-finetuning/adapting your models then you are wasting a lot of compute which costs you direct $$$ 🧵

3

26

166

0

2

15

Prateek Yadav

@prateeky2806

3 years

Had a great experience working with my colleagues on this paper. Check it out!

Swarnadeep Saha

@swarnaNLP

3 years

Excited to share our new work on "ExplaGraphs: An Explanation Graph Generation Task for Structured Commonsense Reasoning"! Has been a long effort and a great learning experience too 🙂 Joint work w. @prateeky2806 @lbauer119 @mohitban47 @uncnlp Paper: 1/5

2

28

89

1

14

Prateek Yadav

@prateeky2806

7 months

As many people must be waiting. ICLR-submitted papers are available now at

ICLR 2024 Conference

Welcome to the OpenReview homepage for ICLR 2024 Conference

openreview.net

1

2

13

Prateek Yadav

@prateeky2806

1 year

I'm at @NeurIPSConf till Friday, if you're interested in continual learning, model merging/adaptation, sparsity for CL, or mixture of expert models, let's chat. Shoot me a DM

0

1

13

Prateek Yadav

@prateeky2806

3 months

Experiments with model merging are ongoing, and many merges rank high on HF leaderboards. People use TIES-Merging ( @prateeky2806 ), DARE-TIES(Le Yu), or SLERP methods to do merging. There are > 130 models on HF that are built using TIES which are downloaded 1000s of times/week

1

0

13

Prateek Yadav

@prateeky2806

3 months

@tianle_cai Definitely look at ComPEFT (released Nov'23) which does exactly what BigDelta tries but with only evaluation and for bigger models this can be done without examples. ComPEFT is definitely worth citing if not comparing against it.

Prateek Yadav

@prateeky2806

6 months

Presenting ComPEFT 🗜! We compress parameter updates to facilitate efficient communication of expert models for compositional generalization. ComPEFT improves perf. 📈, while reducing storage/communication costs 📉 @LChoshen @colinraffel @mohitban47 🧵

8

56

234

1

0

13

Prateek Yadav

@prateeky2806

1 year

How to refine a single image-LM for language-aware keyframe localization & video QA, reaching Sota on multiple datasets?🎯 Introducing SeViLA! By Self-Chaining BLIP-2, we streamline 2-stage inference (localize+QA) & refine localization via QA feedback.🎥🔎💡 👇

Shoubin Yu

@shoubin621

1 year

🚨Can we self-refine a single image-LM for both language-aware keyframe localization & QA on videos? (=sota on multiple datasets)🚨 “Self-Chained Image-Language Model for Video Localization and Question Answering” @jmin__cho @prateeky2806 @mohitban47 🧵

4

41

155

0

3

13

Prateek Yadav

@prateeky2806

6 months

@LChoshen @colinraffel @mohitban47 PEFT methods like LoRA, and (IA)^3 make it possible to efficiently adapt LLMs to create expert models that specialize on new tasks or domains. Compositional generalization and Model merging compose these expert models to improve zero/few-shot generalization on unseen tasks.

1

15

Prateek Yadav

@prateeky2806

3 years

I totally disagree that there is more to life than work and research. If we have to work 90hours/week, then when will we enjoy our lives and do other meaningful things. There needs to be a balance which I intend to maintain as well.

Greg Brockman

@gdb

3 years

Agreed:

98

104

987

0

12

Prateek Yadav

@prateeky2806

3 years

Check out our new work on generating multiple proof graphs for compositional reasoning to be presented at #NAACL2021 next Tuesday.

Swarnadeep Saha

@swarnaNLP

3 years

New #NAACL2021 paper (next Tues) on explaining compositional reasoning w/ multiple proof graphs "multiPRover: Generating Multiple Proofs for Improved Interpretability in Rule Reasoning" Paper: Code: @prateeky2806 @mohitban47 1/n

1

18

40

0

4

12

Prateek Yadav

@prateeky2806

3 years

Check out our updated paper and website release of ExplaGraphs.

Swarnadeep Saha

@swarnaNLP

3 years

ExplaGraphs (to be presented at #EMNLP2021 ): Check out our website & new version with more+refined graph data, new structured models, new metrics (like graph-editdistance + graph-bertscore) & human eval + human-metric correlation😀

2

16

52

0

2

11

Prateek Yadav

@prateeky2806

6 months

Colin is an amazing advisor and a pleasure to work with. Definitely must apply if you are interested in the listed topics.

Colin Raffel

@colinraffel

6 months

Also, I am 1000% hiring PhD students this round! If you want to work on - open models - collaborative/decentralized training - building models like OSS - coordinating model ecosystems - mitigating risks you should definitely apply! Deadline is Friday 😬

12

75

458

0

11

Prateek Yadav

@prateeky2806

2 months

There are not many good LLMs for healthcare applications which is a challenging problem. This is truly an amazing feat, certified doctors rate the Polaris model similar to or better than nurses across multiple dimensions. Crazy!

Subhabrata Mukherjee

@subho_mpi

2 months

When we started building a safety-focused LLM for healthcare a year back, a result like this was beyond imagination. We are excited to share some of the technical and a lot of the clinical considerations that went into building #Polaris in our 53-page technical report available…

0

4

29

0

2

6

Prateek Yadav

@prateeky2806

6 months

@LChoshen @colinraffel @mohitban47 When applied to the LLaMA model (7B - 65B), ComPEFT-QLoRA outperformed QLoRA by 4.16% on the MMLU benchmark, with an impressive reduction in storage size of up to 26x. This is a significant gain in terms of efficiency and performance!

1

12

Prateek Yadav

@prateeky2806

6 months

@LChoshen @colinraffel @mohitban47 In summary, ComPEFT compresses parameter updates via sparse ternary compression to facilitate efficient communication and retrieval of expert models. Check out the paper and the code of ComPEFT for more details! 📜 : 🖥️ :

GitHub - prateeky2806/ComPEFT

Contribute to prateeky2806/ComPEFT development by creating an account on GitHub.

github.com

1

11

Prateek Yadav

@prateeky2806

7 months

@yilin_sung @Tim_Dettmers With GC, the act VRAM reduces to just 1.6 GB 🤯 but takes 2 seconds. When combining GC with flash attn. the act VRAM usage is ~0.7 GB and the time is 1.7 seconds.

1

0

8

Prateek Yadav

@prateeky2806

4 months

@ekzhang1 They just want your resume for training data

0

9

Prateek Yadav

@prateeky2806

11 months

Shoutout to several prev/related works in the model merging community👇 1⃣Task Arithmetic ( @gabriel_ilharco , @Mitchnw et al) 2⃣Fisher Merging (michael_matena et al) 3⃣Git Re-Basin ( @SamuelAinsworth et al)

Samuel "curry-howard fanboi" Ainsworth

@SamuelAinsworth

2 years

📜🚨📜🚨 NN loss landscapes are full of permutation symmetries, ie. swap any 2 units in a hidden layer. What does this mean for SGD? Is this practically useful? For the past 5 yrs these Qs have fascinated me. Today, I am ready to announce "Git Re-Basin"!

63

586

3K

1

9

Prateek Yadav

@prateeky2806

7 months

@yilin_sung @Tim_Dettmers GC + Flash Attn is 20% slower compared to not using these tricks but reduces VRAM requirement by >95% when using batch size 1 and seq len 1400. Without GC, Act. use approx. ~39 GB VRAM and takes about 1.4 sec for a backward pass.

1

0

7

Prateek Yadav

@prateeky2806

3 months

@natolambert @jacobcares @osanseviero @Ar_Douillard @ramealexandre @rasbt @LChoshen LoraX by @TravisAddair

Release v0.7: LoRA Merging (linear, TIES, DARE) per request · predibase/lorax

🎉 Enhancements Merge multiple LoRA adapters per request (linear, TIES, DARE) by @tgaddair in #212 Eetq by @flozi00 in #195 hqq JIT Quantization by @flozi00 in #147 Added Bloom dynamic adapter load...

github.com

1

0

9

Prateek Yadav

@prateeky2806

7 months

@iclr_conf 📝 Emergent Mixture-of-Experts: Can Dense Pre-trained Transformers Benefit from Emergent Modular Structures? TL;DR: Transforming a pre-trained dense model into a modular one can alleviate negative transfer and enhance both ID and OOD capabilities. 🔗:

Emergent Mixture-of-Experts: Can Dense Pre-trained Transformers...

Incorporating modular designs into neural networks demonstrates superior out-of-generalization, learning efficiency, etc. Existing modular neural networks are generally $\textit{explicit}$ because...

openreview.net

1

0

9

Prateek Yadav

@prateeky2806

7 months

We find several exciting and strong results. ✅ MC-SMoE gives up to 80% memory savings! ✅ 20% reduction in FLOPs! ✅ Virtually no performance loss! Tested across 8 benchmarks and compared with 6 baseline methods! 📊

1

0

8

Prateek Yadav

@prateeky2806

3 months

Some major releases in the PEFT library, thanks @sourab_m these updates are super useful.

0

8

Prateek Yadav

@prateeky2806

3 years

The amount of uncertainty in such situations is traumatizing at the very least. It changes your views on life and inherent believes. Many things my life have lost the false importance they had.

Sneha Annavarapu

@SnehaAnnavarapu

3 years

My covid +ve mother is in the hospital, my father (who is also covid +ve) is driving my covid +ve grandmother to a hospital...& I’m sitting here in Chicago, fully vaccinated, starting into space, calling people, giving instructions, feeling guilty, tired & useless, venting.

542

997

13K

0

1

8

Prateek Yadav

@prateeky2806

11 months

Interesting work on understanding when a big teacher model can help a smaller student model.

Swarnadeep Saha

@swarnaNLP

11 months

Can LLMs Teach Weaker Agents? Aligned teachers can intervene w/ free-text explanations using Theory of Mind (ExpUtility+Personalization) to improve students on future unexplained data🙂 Misaligned teachers hurt students😢 w/ @peterbhase @mohitban47 🧵👇

2

65

187

1

3

8

Prateek Yadav

@prateeky2806

7 months

Shout out to @pingzli for the awesome effort and leading this work! Check out the paper and our code for more insights! 📚 Arxiv: Our code is also publicly available 👉

GitHub - UNITES-Lab/MC-SMoE: [ICLR 2024 Spotlight] Code for the paper "Merge, Then Compress:...

[ICLR 2024 Spotlight] Code for the paper "Merge, Then Compress: Demystify Efficient SMoE with Hints from Its Routing Policy" - UNITES-Lab/MC-SMoE

github.com

0

8

Prateek Yadav

@prateeky2806

6 months

@LChoshen @colinraffel @mohitban47 We also find that ComPEFT improves with scale. This means stronger models not only become more compressible but also show better performance post-compression.

1

0

8

Prateek Yadav

@prateeky2806

11 months

Looking forward to our collaboration! 🎉 Welcome to UNC.

Jaehong Yoon

@jaeh0ng_yoon

11 months

😍I'm super excited to announce my next journey! After a great time at KAIST, I'll be working as a Postdoctoral Research Associate at UNC Chapel Hill ( @UNC ) this fall, working with Prof. Mohit Bansal ( @mohitban47 ) and faculty+students in the awesome @uncnlp and @unccs groups! 1/3

12

18

121

1

7

Prateek Yadav

@prateeky2806

7 months

@yilin_sung @Tim_Dettmers By selectively storing activations, Gradient Checkpointing reduces memory usage, enabling the training of larger models or bigger batch sizes or longer sequences on the same hardware. It's a trade-off: Lower memory usage comes at the cost of increased computation.

1

7

Prateek Yadav

@prateeky2806

6 months

@LChoshen @colinraffel @mohitban47 We perform extensive evaluation across diverse models like T5, T0, and LLaMA (ranging from 200M to 65B parameters). ComPEFT achieves staggering compression ratios of 8x to 50x while maintaining or even improving performance in many cases.

1

0

8

Prateek Yadav

@prateeky2806

3 months

Merging might not be perfect yet but it has proven itself enough for people to test it for their use case and my bet is that in most cases it can save a ton of compute when trying to create specialized models via Full-FT or PEFT.

1

0

8

Prateek Yadav

@prateeky2806

11 months

In summary, TIES-Merging🪢resolves interference when merging models across diverse settings (diff modalities, model sizes, architectures, fine-tuning) Check out the paper and the code of TIES-Merging for more details! 📜: 🖥️: n/n

GitHub - prateeky2806/ties-merging

Contribute to prateeky2806/ties-merging development by creating an account on GitHub.

github.com

1

7

Prateek Yadav

@prateeky2806

10 months

1. ExSSNeT: 2. Code-CL: Code-CL Thread 👇

Prateek Yadav

@prateeky2806

10 months

Ever wondered how to continually improve your code LLM? In our new #ACL2023nlp paper, we explore Continual Learning (CL) methods for code domain: CodeTask-CL benchmark & Prompt Pooling with Teacher Forcing for CL in code domain. @AmazonScience @uncnlp 🧵

1

21

58

0

4

7

Prateek Yadav

@prateeky2806

6 months

@LChoshen @colinraffel @mohitban47 However, the size of these expert models presents challenges, especially when 1️⃣ Retrieving them over high-latency networks (say Internet) 2️⃣ Serving multiple experts on a single GPU For e.g. QLoRA on LLaMA-65B is 3.2GB in size which is similar to a full T5-Large model (3GB)

1

0

7

Prateek Yadav

@prateeky2806

3 years

Congratulations @jayleicn , @mohitban47 , @LINJIEFUN , @luowei_zhou , @zhegan4 , tlberg and jjliu. This is so amazing! 😍

Jie Lei

@jayleicn

3 years

Yay! We won the @CVPR 2021 Best Student Paper Honorable Mention! (Top 7 out of 7000 submissions 😍) @linjiefun @luowei_zhou @zhegan4 tlberg @mohitban47 jjliu @uncnlp @unccs @msftresearch

10

12

163

0

7

Prateek Yadav

@prateeky2806

11 months

4⃣Branch-Train-Merge ( @margs_li , @ssgrn , @Tim_Dettmers et al) 5⃣RegMean ( @XisenJ et al) 6⃣Model Ratatouille ( @ramealexandre et al)

Alexandre Ramé

@ramealexandre

1 year

Ready to give your deep models a second life? Introducing model ♻️ recycling (), improving generalization by reusing weights fine-tuned on various vision tasks. Just like you recycle your bottles and cardboards, it's time to start recycling your models too!

5

16

83

1

7

Prateek Yadav

@prateeky2806

7 months

@yilin_sung @Tim_Dettmers Gradient Checkpointing is all about managing memory efficiently during training so that we can train bigger models with larger batch sizes and sequence lengths. When performing backprop on a Model (say with 1B parameters) there are four major components to store,

1

0

6

Prateek Yadav

@prateeky2806

7 months

@yilin_sung @Tim_Dettmers 4) Model Activations: used in chain rule for backprop -> depends mainly on the number of layers, model's hidden dimensions, batch size, and sequence length. So given a model, as the batch size and seq len increase, activation starts to dominate the VRAM usage.

1

7

Prateek Yadav

@prateeky2806

7 months

Vanilla SMoE models often suffer from: (a) High Memory Usage 📊 due to duplicated network layers (b) Redundancy in Experts 🔄 from the learned routing policy Can we merge and compress SMoE experts to make them more compact? 🤷

1

7

Prateek Yadav

@prateeky2806

7 months

@yilin_sung @Tim_Dettmers Gradient Checkpointing (GC) comes to the rescue here. ♻️ Instead of storing all intermediate activations, GC stores a subset of them & performs partial forward passes from these cached acts to recompute the rest during backprop. A balancing act between computation & memory!

1

6

Prateek Yadav

@prateeky2806

6 months

@natanielruizg @Google For people looking for a similar idea with openly available code. Checkout TIES-Merging that will appear at neurips 23

Prateek Yadav

@prateeky2806

11 months

Performance degrades when merging diff task-specific models into a multitask model? Presenting TIES-Merging🪢 We find signif *Interference* b/w model params & mitigate it 👉 improves both NLP & CV merging @dtredsox13 @LChoshen @colinraffel @mohitban47 🧵

6

86

255

0

2

7

Prateek Yadav

@prateeky2806

6 months

@LChoshen @colinraffel @mohitban47 We find that the compressed models from ComPEFT lead to better-merged models and outperform strong baselines like Task Arithmetic and TIES-Merging improving performance in 9/12 settings and leading to an improvement of 1.4 on average.

1

0

6

Prateek Yadav

@prateeky2806

11 months

We argue that current merging methods fail to account for two major sources of interference: (a) redundant parameter values pulling the average to 0. (b) disagreement on the sign of a given parameter’s values across models.

1

0

6

Prateek Yadav

@prateeky2806

4 months

Congratulations to @pingzli for his first paper that too as a spotlight at such an amazing venue like ICLR. cc @KyriectionZhang @yilin_sung @YuCheng3 @mohitban47 @TianlongChen4

1

0

6

Prateek Yadav

@prateeky2806

7 months

@iclr_conf 📝 ZipIt! Merging Models from Different Tasks without Training TL;DR: Merging models trained on completely different tasks without retraining. 🔗:

ZipIt! Merging Models from Different Tasks without Training

Typical deep visual recognition models are capable of performing the one task they were trained on. In this paper, we tackle the extremely difficult problem of combining distinct models with...

openreview.net

2

1

6

Prateek Yadav

@prateeky2806

1 month

@PandaAshwinee I guess soon middle school and primary school kids will be submitted to NeurIPS because otherwise, they won't be able to get into High school research because it's competitive and so on ... If something is not ideal then it doesn't mean we should double down on it.

1

0

6

Prateek Yadav

@prateeky2806

8 months

Congratulations @mohitban47 ! Getting this award from @IITKanpur is huge and also very well deserved. 🎉

Mohit Bansal

@mohitban47

8 months

Honored and humbled to receive the @IITKanpur Young Alumnus Award from my alma mater, which has been an amazing source of mentors+friends+memories and important foundation/values 🙏 All the credit for this award belongs to my mentors, students, collaborators, family/friends ❤️

20

9

218

1

0

6

Prateek Yadav

@prateeky2806

7 months

@PontiEdoardo @ndaheim_ @tmoellenhoff @IGurevych @EmtiyazKhan Thanks for thoroughly analyzing the gradient mismatch problem! I was wondering if you considered comparing it with TIES-Merging (NeurIPS'23) as the main thesis there is also to resolve the interference between the task vectors (i.e. accumulated gradients)

Prateek Yadav

@prateeky2806

8 months

🎉 Thrilled to announce our paper on TIES-Merging🪢 has been accepted to @NeurIPSConf ! We've delved into the significant Interference between task-specific model parameters when merging and found a way to mitigate it, enhancing both NLP & CV. Stay tuned for more insights! 📄✨

5

26

157

1

0

6

Prateek Yadav

@prateeky2806

7 months

@yilin_sung @Tim_Dettmers 1) Model parameters -> takes 1B*K bits for K-bit precision 2) parameter gradients -> similar to model weight and take 1B*K for K-bit precision 3) Optimizer states: used for tricks like momentum -> Depends on the optimizer but typically takes same as model parameters, so 1B*K

1

5

Prateek Yadav

@prateeky2806

6 months

@LChoshen @colinraffel @mohitban47 ComPEFT follows 3 simple steps 1️⃣ Decompose the task vector into the sign and magnitude (mag) vector 2️⃣ Sparsify the mag vector to keep only the top-k value and also remove these indices from the sign vec 3️⃣ Multiply the sign vec by a scalar constant alpha * std(task vector)

2

0

6

Prateek Yadav

@prateeky2806

7 months

@yilin_sung @Tim_Dettmers In most cases, these benefits often outweigh the costs, especially in scenarios with memory constraints. Let me know if I missed something!

1

0

5

Prateek Yadav

@prateeky2806

11 months

7⃣Task Arithmetic in the Tangent Space ( @gortizji et al) 8⃣LMC and LTH ( @jefrankle et al) 9⃣Role of Perm. Invariance in LMC ( @rahiment et al) 🔟Cold-Fusion ( @LChoshen et al) ⃣

RahimEntezari

@rahiment

3 years

🆕The Role of Permutation Invariance in Linear Mode Connectivity of Neural Networks Our conjecture: Taking permutations into account, there is likely no barrier in the linear interpolation between SGD solutions. w @HanieSedghi @osaukh @bneyshabur 1/10

9

40

226

1

2

6

Prateek Yadav

@prateeky2806

11 months

Moreover, we observe that TIES-Merging🪢 (1) Improves out-of-domain performance significantly! (2) Scales better with more tasks! (3) Additional ablations confirm that all three steps are important.

1

6

Prateek Yadav

@prateeky2806

3 months

@_akhaliq Definitely look at ComPEFT (released Nov'23) which does exactly what BigDelta does only with evaluation and for bigger models we don't even need any data.

Prateek Yadav

@prateeky2806

6 months

Presenting ComPEFT 🗜! We compress parameter updates to facilitate efficient communication of expert models for compositional generalization. ComPEFT improves perf. 📈, while reducing storage/communication costs 📉 @LChoshen @colinraffel @mohitban47 🧵

8

56

234

0

5

Prateek Yadav

@prateeky2806

11 months

We propose Trim, Elect Sign & Merge (TIES-Merging🪢), which introduces 3 new steps when merging: 1⃣Resetting weights that changed a small amount during fine-tuning. 2⃣Resolving sign conflicts. 3⃣Merging only the weights that are in alignment with the final agreed-upon sign.

1

5

Prateek Yadav

@prateeky2806

11 months

TIES-Merging🪢outperforms all other methods in diverse settings covering a range of modalities, domains, model sizes, architectures, and fine-tuning and settings! It even works on parameter-efficient FT!

1

0

5

Prateek Yadav

@prateeky2806

7 months

@iclr_conf 📝 LUMOS: Towards Language Agents that are Unified, Modular, and Open Source TL;DR: Offers a modular architecture for task decomposition, grounding, and execution, leveraging open-source LLMs, showing competitive performance on interactive tasks. 🔗:

LUMOS: Towards Language Agents that are Unified, Modular, and Open...

In this paper, we present LUMOS, **L**anguage agents with **U**nified formats, **M**odular design, and **O**pen **S**ource LLMs. LUMOS features a modular architecture consisting of planning...

openreview.net

1

0

5

Prateek Yadav

@prateeky2806

6 months

@LChoshen @colinraffel @mohitban47 Moreover, ComPEFT applied to LoRA and (IA)$^3$ is Pareto-optimal in terms of storage costs vs. performance compared to a wide range of existing PEFT methods.

1

0

6

Prateek Yadav

@prateeky2806

3 months

I feel like merging can also play an even bigger role in continued-pretraining which is unexplored but that is where big savings lie for people who are willing to explore more of it.

1

0

5

Prateek Yadav

@prateeky2806

11 months

We further analyze the impact of different types of interference on model parameters, highlight the importance of having correct signs, and show that estimating the signs using the validation data could further improve performance.

1

0

5

Prateek Yadav

@prateeky2806

6 months

@LChoshen @colinraffel @mohitban47 Furthermore, Compressed ComPEFT checkpoints perform similarly to the original uncompressed checkpoints when performing few-shot compositional generalization on the Big-Bench-Hard benchmark via LoraHub.

1

0

5

Prateek Yadav

@prateeky2806

3 years

Congratulations @svjan5 for the award. I have witnessed your hard work and dedication while collaborating with you. There are a lot of good things yet to come along your way!

ACM India

@Indiaacm

3 years

Shikhar V. Chosen as Recipient of 2021 ACM India Doctoral Dissertation Award for "Neural Graph Embedding Methods for Natural Language Processing." He was advised by Prof @partha_p_t and Chiranjib Bhattacharyya. Details at #ACMIndia #DoctoralDissertation

7

54

0

5

Prateek Yadav

@prateeky2806

5 months

@LChoshen Not just unofficial implementations but many top models on the HF openllm leaderboard were created using TIES-Merging. @Weyaxi and many others use it frequently. Also it is integrated in many GitHub repos on merging.

1

0

4

Prateek Yadav

@prateeky2806

6 months

@LChoshen @colinraffel @mohitban47 We perform ablations on ComPEFT to show that all steps are necessary and that it outperforms strong baseline like STC from Federated Learning.

1

0

6

Prateek Yadav

@prateeky2806

11 months

11. Multimodal Model Merging ( @yilin_sung et al)

Yi Lin Sung

@yilin_sung

1 year

Can we MERGE weights of different MODALITIES? The answer is no using naive merging. However we find an effective recipe for improving merging results significantly in “An Empirical Study of Multimodal Model Merging” 🧵👇 @linjiefun @zhegan4 @mohitban47

1

44

147

0

1

5

Prateek Yadav

@prateeky2806

11 months

Model merging methods combine multiple task-specific models into 👉 one multitask model without more training. However, the weights of different models might interfere with each other, which we find can significantly harm multitask performance!

1

0

5

Prateek Yadav

@prateeky2806

3 months

@natolambert @jacobcares Huggingface Blog Post by @osanseviero

@osanseviero on Hugging Face: "I finished my model merging experiment day.🤗I would love your...

huggingface.co

1

0

4

Prateek Yadav

@prateeky2806

7 months

Extensive analysis of our design choices highlights the best practices for merging Experts. We find that (1) adaptive layerwise budget allocation, with (2) router logits-based similarity, with (3) activation frequency-based expert merging with knowledge distillation works best.

1

0

4

Prateek Yadav

@prateeky2806

3 months

Blogpost by @natolambert and @jacobcares

2

0

4

Prateek Yadav

@prateeky2806

7 months

@iclr_conf 📝 CCA Merge: Merging Many Neural Networks with Canonical Correlation Analysis TL;DR: Novel fusion method using Canonical Correlation Analysis to merge many models into one with lower accuracy drops than past methods. 🔗:

CCA Merge: Merging Many Neural Networks with Canonical Correlation...

Combining the predictions of multiple trained models through ensembling is generally a good way to improve accuracy by leveraging the different learned features of the models, however it comes with...

openreview.net

1

0

4

Prateek Yadav

@prateeky2806

10 months

@Yampeleg What context length?

1

0

4

Prateek Yadav

@prateeky2806

3 years

Interesting work on bridging neural and symbolic techniques for generating proof graphs. We will be happy to chat/answer any questions you might have!

Swarnadeep Saha

@swarnaNLP

3 years

👇Happening today 9.20-10.40pm ET at #NAACL2021 session 10E: Question Answering/Interpretability. Talk video & session links below! Happy to chat about proof-graph set generation for explaining compositional/multi-hop reasoning.

1

3

10

0

4