Prateek Yadav Profile
Prateek Yadav

@prateeky2806

1,555
Followers
1,588
Following
51
Media
737
Statuses

Ph.D. at @unccs Continual Model Adaptation and Composition Previously @MSFTResearch , @AmazonScience , @iitmadras . UG @iiscbangalore . Opinions are my own.

North Carolina
Joined July 2014
Don't wanna be here? Send us removal request.
Pinned Tweet
@prateeky2806
Prateek Yadav
6 months
Presenting ComPEFT 🗜! We compress parameter updates to facilitate efficient communication of expert models for compositional generalization. ComPEFT improves perf. 📈, while reducing storage/communication costs 📉 @LChoshen @colinraffel @mohitban47 🧵
Tweet media one
8
56
234
@prateeky2806
Prateek Yadav
7 months
Gradient Checkpointing(GC) is a hidden gem that most people take for granted, However, it has a crazy impact on reducing the VRAM. @yilin_sung and I profiled the activation memory used for LLaMA-7B model and the impact is crazy! 🧵Find out more about GC 👇 cc @Tim_Dettmers
Tweet media one
7
52
276
@prateeky2806
Prateek Yadav
7 months
🚀Struggling with Memory issues in MoE models?😭 Introducing...✨MC-SMoE✨ We merge experts THEN compress/decompose merged experts➡️low-rank. Up to 80% mem reduction! 🎉 w/ @pingzli @KyriectionZhang @yilin_sung @YuCheng3 @mohitban47 @TianlongChen4 🧵👇
Tweet media one
4
75
259
@prateeky2806
Prateek Yadav
11 months
Performance degrades when merging diff task-specific models into a multitask model? Presenting TIES-Merging🪢 We find signif *Interference* b/w model params & mitigate it 👉 improves both NLP & CV merging @dtredsox13 @LChoshen @colinraffel @mohitban47 🧵
Tweet media one
6
86
255
@prateeky2806
Prateek Yadav
4 months
🎉 Thrilled to announce our MOE Expert Merging paper has been accepted to @iclr_conf as a SpotLight paper. ! We reduce the inference memory cost of MOE models by utilizing routing statistics-based merging of experts to achieve up to 80% memory and 20% flops reduction. 📷
@prateeky2806
Prateek Yadav
7 months
🚀Struggling with Memory issues in MoE models?😭 Introducing...✨MC-SMoE✨ We merge experts THEN compress/decompose merged experts➡️low-rank. Up to 80% mem reduction! 🎉 w/ @pingzli @KyriectionZhang @yilin_sung @YuCheng3 @mohitban47 @TianlongChen4 🧵👇
Tweet media one
4
75
259
8
27
168
@prateeky2806
Prateek Yadav
7 months
🔍 A thread on the latest @iclr_conf 2024 papers on - Mixture of Experts - Modular Models - Compositional Generalization - and related topics: Dive into the latest papers from #ICLR2024 through the list below! Let me know if I missed some relevant papers. [🧵Thread ⬇️]
4
20
164
@prateeky2806
Prateek Yadav
3 months
I am not sure who is still a disbeliever and needs to hear it. If you are not using MODEL MERGING for either pretraining/continued-finetuning/adapting your models then you are wasting a lot of compute which costs you direct $$$ 🧵
Tweet media one
3
26
166
@prateeky2806
Prateek Yadav
7 months
A very nice visual explanation of how Gradient Checkpointing works is in this blog post by @yaroslavvb . A brief summary from the blog on how GC stores some activations and uses partial forward passes for backprop. (Visualizations are from the blog)
@prateeky2806
Prateek Yadav
7 months
Gradient Checkpointing(GC) is a hidden gem that most people take for granted, However, it has a crazy impact on reducing the VRAM. @yilin_sung and I profiled the activation memory used for LLaMA-7B model and the impact is crazy! 🧵Find out more about GC 👇 cc @Tim_Dettmers
Tweet media one
7
52
276
1
23
160
@prateeky2806
Prateek Yadav
8 months
🎉 Thrilled to announce our paper on TIES-Merging🪢 has been accepted to @NeurIPSConf ! We've delved into the significant Interference between task-specific model parameters when merging and found a way to mitigate it, enhancing both NLP & CV. Stay tuned for more insights! 📄✨
@prateeky2806
Prateek Yadav
11 months
Performance degrades when merging diff task-specific models into a multitask model? Presenting TIES-Merging🪢 We find signif *Interference* b/w model params & mitigate it 👉 improves both NLP & CV merging @dtredsox13 @LChoshen @colinraffel @mohitban47 🧵
Tweet media one
6
86
255
5
26
157
@prateeky2806
Prateek Yadav
7 months
🔍 Searching for @iclr_conf 2024 paper on Model Merging/Fusion & related topic: Dive into the latest advancements in model merging, fusion, and weight interpolations from #ICLR2024 through the list below! Let me know if I missed some relevant papers. [Thread ⬇️]
4
16
81
@prateeky2806
Prateek Yadav
2 years
Can we avoid forgetting while achieving forward transfer in continual learning, by finding and training subnetworks? Check out “Exclusive Supermask Subnetwork Training for Continual Learning” which works well for both vision & NLP. @mohitban47 @uncnlp 🧵
Tweet media one
3
20
82
@prateeky2806
Prateek Yadav
4 months
🎉 When pruning datasets there is a trade-off between selecting Diverse and Difficult samples. Proud to announce that our paper D2-Pruning has been accepted to ICLR'24 @iclr_conf and uses message passing on a dataset graph to effectively navigate this trade-off📷
@adyasha10
Adyasha Maharana
7 months
How to select important+diverse training data under a fixed data budget? 📢"D2 Pruning" --> represent datasets as sparse undirected graph & perform forward+reverse message passing to select both difficult & diverse samples. @prateeky2806 @mohitban47 🧵
Tweet media one
2
36
127
1
14
72
@prateeky2806
Prateek Yadav
10 months
Ever wondered how to continually improve your code LLM? In our new #ACL2023nlp paper, we explore Continual Learning (CL) methods for code domain: CodeTask-CL benchmark & Prompt Pooling with Teacher Forcing for CL in code domain. @AmazonScience @uncnlp 🧵
Tweet media one
1
21
58
@prateeky2806
Prateek Yadav
7 months
Check out the camera-ready version of TIES-Merging to be presented at @NeurIPSConf 2023! We have added more experiments on 1. Merging for robustness on a single task. 2. Merging for better Initialization and Finetuning 3. We show that interference exists even when merging…
Tweet media one
Tweet media two
Tweet media three
Tweet media four
@prateeky2806
Prateek Yadav
11 months
Performance degrades when merging diff task-specific models into a multitask model? Presenting TIES-Merging🪢 We find signif *Interference* b/w model params & mitigate it 👉 improves both NLP & CV merging @dtredsox13 @LChoshen @colinraffel @mohitban47 🧵
Tweet media one
6
86
255
4
7
49
@prateeky2806
Prateek Yadav
8 months
Whoever is handling the @iclr_conf account this year has a good sense of humor. I am loving these random short tweets and replies. It's making the process fun.
@iclr_conf
ICLR 2024
8 months
@_vaishnavh We strive to vibe.
0
2
44
2
0
47
@prateeky2806
Prateek Yadav
7 months
Checkout our new work on dataset pruning that balances both the difficulty of samples and their diversity. We employ message passing on the dataset graphs to select the dataset subset. Some interesting findings here! 👇
@adyasha10
Adyasha Maharana
7 months
How to select important+diverse training data under a fixed data budget? 📢"D2 Pruning" --> represent datasets as sparse undirected graph & perform forward+reverse message passing to select both difficult & diverse samples. @prateeky2806 @mohitban47 🧵
Tweet media one
2
36
127
0
8
44
@prateeky2806
Prateek Yadav
4 months
It is amazing to see that model merging is finally getting more traction and people are extracting value out of it. Moreover, there are resources that make it easier to quickly try out these methods like the mergekit. Everyone who is building any model should try merging as it…
@julien_c
Julien Chaumond
4 months
I was collating a basic collection of *Papers on model merging*: Then i found out that @osanseviero already had done a much more complete one 😈 So here goes =>
5
24
164
1
6
40
@prateeky2806
Prateek Yadav
5 months
The FOMO of missing #NeurIPS2023 despite having two papers (TIES-Merging and SeViLA) is real. However, If you want to chat or are looking for interns to work on MoE models, Instruction Tuning, Continuously updating LLMs, PEFT methods, or Model Merging then do reach out! 🧵
1
4
37
@prateeky2806
Prateek Yadav
8 months
Thrilled to announce another paper that has been accepted to @NeurIPSConf ! In SeViLA, we self-chain BLIP2 for temporal localization and question answering on video & get new SOTA on multiple videoQA benchmarks.
@shoubin621
Shoubin Yu
1 year
🚨Can we self-refine a single image-LM for both language-aware keyframe localization & QA on videos? (=sota on multiple datasets)🚨 “Self-Chained Image-Language Model for Video Localization and Question Answering” @jmin__cho @prateeky2806 @mohitban47 🧵
Tweet media one
4
41
155
2
6
37
@prateeky2806
Prateek Yadav
1 month
Pretty strong results on long context tasks like passkey detection and book summarisation. Glad to be interning with @TsendeeMTS @tuvllms at Google this summer. Looking forward to doing something great soon.
@arankomatsuzaki
Aran Komatsuzaki
1 month
Google presents Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention 1B model that was fine-tuned on up to 5K sequence length passkey instances solves the 1M length problem
Tweet media one
28
262
1K
2
3
36
@prateeky2806
Prateek Yadav
3 months
It's great to see that research on continual & collaborative model training to which I have dedicated 3 years is becoming relevant in industry. @JeffDean publically advocating for systems like this is encouraging. Multiple research topics that directly fit into this vision. 🧵
@SurakshaPinnuET
Suraksha P
3 months
. @JeffDean , chief scientist, @Google DeepMind and Google Research, Grace Chung, Site Lead, and Engineering Director, Google Australia, and @ManishGuptaMG1 , Director of Google Research India talk about large language models and the current AI landscape. @ETtech @EconomicTimes
2
11
37
1
6
29
@prateeky2806
Prateek Yadav
10 months
I'm finally in Toronto & presenting my 2 Continual-Learning #ACL2023nlp papers: 1) Our ExSSNeT paper was accepted as findings and will be presented as spotlight Mon 7pm and poster at Tue 11am. 2) Our Code-CL paper in main conf. on Wed 11am. @mohitban47 @AmazonScience @uncnlp 🧵
@prateeky2806
Prateek Yadav
2 years
Can we avoid forgetting while achieving forward transfer in continual learning, by finding and training subnetworks? Check out “Exclusive Supermask Subnetwork Training for Continual Learning” which works well for both vision & NLP. @mohitban47 @uncnlp 🧵
Tweet media one
3
20
82
1
10
25
@prateeky2806
Prateek Yadav
3 months
Git-Theta (ICML'23) proposed a feasible library to actually build models collaboratively like this. Might be useful to discuss it in the paper.
@pulkitology
Pulkit Agrawal
3 months
Presenting a method for training models from SCRATCH using LoRA: 💡20x reduction in communication 💡3x savings in memory - Find out more: - Code available to try out - Scaling to larger models ongoing - led by Jacob Huh!
Tweet media one
6
57
387
1
3
24
@prateeky2806
Prateek Yadav
3 years
@timnitGebru Can you please also tell us what #1 and #2 were so that everyone gets a better understanding?
0
0
21
@prateeky2806
Prateek Yadav
3 months
I would say TIES with DARE might be the most useful. There might be better ways to merge but they are just more complex to use and might need more data, training, gradients, etc. TIES with merging weight=1 also works pretty well.
@rasbt
Sebastian Raschka
3 months
As an LLM finetuner, I recently started getting into model merging. I wrote up a short tutorial on linear merging to introduce the topic: Btw does anyone happen to have good examples of LLMs that work well when merged via linear merging? And for…
10
153
809
1
4
22
@prateeky2806
Prateek Yadav
4 months
It's fascinating yet concerning how we struggle to efficiently access existing knowledge. Even for most world-class researchers, it is hard to find knowledge that is well documented in the form of scientific publications, even in highly active areas like Models of Experts (MoE).…
Tweet media one
Tweet media two
Tweet media three
1
1
19
@prateeky2806
Prateek Yadav
8 months
Eye-opening!🤯 Even after attempting to delete sensitive information from LLMs, a significant portion can still be recovered via extraction attacks. Moreover, this study presents new defense mechanisms to reduce such recovery to only 2%. Be cautious when deleting info from LLMs!
@vaidehi_patil_
Vaidehi Patil
8 months
🚨Can Sensitive Information Be Deleted From LLMs? We show that extraction attacks recover 18-38% of "deleted" knowledge! Our attack+defense framework has whitebox+blackbox attacks. New defense objectives lower attacks to 2%! @peterbhase @mohitban47 🧵
Tweet media one
1
65
265
1
4
19
@prateeky2806
Prateek Yadav
6 months
@LChoshen @colinraffel @mohitban47 We introduce ComPEFT, a novel method that significantly compresses the fine-tuning residuals (task vectors) of PEFT-based models. This is achieved through sparsification and quantization techniques. And guess what? No additional retraining is needed!
Tweet media one
1
1
19
@prateeky2806
Prateek Yadav
3 months
I am not sure why I missed this OG thread which was before chatGPT, before MOE's were cool, and even before anyone thought merging would work and be a thing.
@prateeky2806
Prateek Yadav
3 months
I am not sure who is still a disbeliever and needs to hear it. If you are not using MODEL MERGING for either pretraining/continued-finetuning/adapting your models then you are wasting a lot of compute which costs you direct $$$ 🧵
Tweet media one
3
26
166
0
2
15
@prateeky2806
Prateek Yadav
3 years
Had a great experience working with my colleagues on this paper. Check it out!
@swarnaNLP
Swarnadeep Saha
3 years
Excited to share our new work on "ExplaGraphs: An Explanation Graph Generation Task for Structured Commonsense Reasoning"! Has been a long effort and a great learning experience too 🙂 Joint work w. @prateeky2806 @lbauer119 @mohitban47 @uncnlp Paper: 1/5
Tweet media one
Tweet media two
Tweet media three
Tweet media four
2
28
89
1
1
14
@prateeky2806
Prateek Yadav
7 months
As many people must be waiting. ICLR-submitted papers are available now at
1
2
13
@prateeky2806
Prateek Yadav
1 year
I'm at @NeurIPSConf till Friday, if you're interested in continual learning, model merging/adaptation, sparsity for CL, or mixture of expert models, let's chat. Shoot me a DM
0
1
13
@prateeky2806
Prateek Yadav
3 months
Experiments with model merging are ongoing, and many merges rank high on HF leaderboards. People use TIES-Merging ( @prateeky2806 ), DARE-TIES(Le Yu), or SLERP methods to do merging. There are > 130 models on HF that are built using TIES which are downloaded 1000s of times/week
Tweet media one
1
0
13
@prateeky2806
Prateek Yadav
3 months
@tianle_cai Definitely look at ComPEFT (released Nov'23) which does exactly what BigDelta tries but with only evaluation and for bigger models this can be done without examples. ComPEFT is definitely worth citing if not comparing against it.
@prateeky2806
Prateek Yadav
6 months
Presenting ComPEFT 🗜! We compress parameter updates to facilitate efficient communication of expert models for compositional generalization. ComPEFT improves perf. 📈, while reducing storage/communication costs 📉 @LChoshen @colinraffel @mohitban47 🧵
Tweet media one
8
56
234
1
0
13
@prateeky2806
Prateek Yadav
1 year
How to refine a single image-LM for language-aware keyframe localization & video QA, reaching Sota on multiple datasets?🎯 Introducing SeViLA! By Self-Chaining BLIP-2, we streamline 2-stage inference (localize+QA) & refine localization via QA feedback.🎥🔎💡 👇
@shoubin621
Shoubin Yu
1 year
🚨Can we self-refine a single image-LM for both language-aware keyframe localization & QA on videos? (=sota on multiple datasets)🚨 “Self-Chained Image-Language Model for Video Localization and Question Answering” @jmin__cho @prateeky2806 @mohitban47 🧵
Tweet media one
4
41
155
0
3
13
@prateeky2806
Prateek Yadav
6 months
@LChoshen @colinraffel @mohitban47 PEFT methods like LoRA, and (IA)^3 make it possible to efficiently adapt LLMs to create expert models that specialize on new tasks or domains. Compositional generalization and Model merging compose these expert models to improve zero/few-shot generalization on unseen tasks.
Tweet media one
1
1
15
@prateeky2806
Prateek Yadav
3 years
I totally disagree that there is more to life than work and research. If we have to work 90hours/week, then when will we enjoy our lives and do other meaningful things. There needs to be a balance which I intend to maintain as well.
@gdb
Greg Brockman
3 years
Agreed:
Tweet media one
98
104
987
0
0
12
@prateeky2806
Prateek Yadav
3 years
Check out our new work on generating multiple proof graphs for compositional reasoning to be presented at #NAACL2021 next Tuesday.
@swarnaNLP
Swarnadeep Saha
3 years
New #NAACL2021 paper (next Tues) on explaining compositional reasoning w/ multiple proof graphs "multiPRover: Generating Multiple Proofs for Improved Interpretability in Rule Reasoning" Paper: Code: @prateeky2806 @mohitban47 1/n
Tweet media one
Tweet media two
Tweet media three
1
18
40
0
4
12
@prateeky2806
Prateek Yadav
3 years
Check out our updated paper and website release of ExplaGraphs.
@swarnaNLP
Swarnadeep Saha
3 years
ExplaGraphs (to be presented at #EMNLP2021 ): Check out our website & new version with more+refined graph data, new structured models, new metrics (like graph-editdistance + graph-bertscore) & human eval + human-metric correlation😀
Tweet media one
Tweet media two
Tweet media three
2
16
52
0
2
11
@prateeky2806
Prateek Yadav
6 months
Colin is an amazing advisor and a pleasure to work with. Definitely must apply if you are interested in the listed topics.
@colinraffel
Colin Raffel
6 months
Also, I am 1000% hiring PhD students this round! If you want to work on - open models - collaborative/decentralized training - building models like OSS - coordinating model ecosystems - mitigating risks you should definitely apply! Deadline is Friday 😬
12
75
458
0
0
11
@prateeky2806
Prateek Yadav
2 months
There are not many good LLMs for healthcare applications which is a challenging problem. This is truly an amazing feat, certified doctors rate the Polaris model similar to or better than nurses across multiple dimensions. Crazy!
@subho_mpi
Subhabrata Mukherjee
2 months
When we started building a safety-focused LLM for healthcare a year back, a result like this was beyond imagination. We are excited to share some of the technical and a lot of the clinical considerations that went into building #Polaris in our 53-page technical report available…
Tweet media one
0
4
29
0
2
6
@prateeky2806
Prateek Yadav
6 months
@LChoshen @colinraffel @mohitban47 When applied to the LLaMA model (7B - 65B), ComPEFT-QLoRA outperformed QLoRA by 4.16% on the MMLU benchmark, with an impressive reduction in storage size of up to 26x. This is a significant gain in terms of efficiency and performance!
Tweet media one
1
1
12
@prateeky2806
Prateek Yadav
6 months
@LChoshen @colinraffel @mohitban47 In summary, ComPEFT compresses parameter updates via sparse ternary compression to facilitate efficient communication and retrieval of expert models. Check out the paper and the code of ComPEFT for more details! 📜 : 🖥️ :
1
1
11
@prateeky2806
Prateek Yadav
7 months
@yilin_sung @Tim_Dettmers With GC, the act VRAM reduces to just 1.6 GB 🤯 but takes 2 seconds. When combining GC with flash attn. the act VRAM usage is ~0.7 GB and the time is 1.7 seconds.
1
0
8
@prateeky2806
Prateek Yadav
4 months
@ekzhang1 They just want your resume for training data
0
0
9
@prateeky2806
Prateek Yadav
11 months
Shoutout to several prev/related works in the model merging community👇 1⃣Task Arithmetic ( @gabriel_ilharco , @Mitchnw et al) 2⃣Fisher Merging (​michael_matena et al) 3⃣Git Re-Basin ( @SamuelAinsworth et al)
@SamuelAinsworth
Samuel "curry-howard fanboi" Ainsworth
2 years
📜🚨📜🚨 NN loss landscapes are full of permutation symmetries, ie. swap any 2 units in a hidden layer. What does this mean for SGD? Is this practically useful? For the past 5 yrs these Qs have fascinated me. Today, I am ready to announce "Git Re-Basin"!
63
586
3K
1
1
9
@prateeky2806
Prateek Yadav
7 months
@yilin_sung @Tim_Dettmers GC + Flash Attn is 20% slower compared to not using these tricks but reduces VRAM requirement by >95% when using batch size 1 and seq len 1400. Without GC, Act. use approx. ~39 GB VRAM and takes about 1.4 sec for a backward pass.
1
0
7
@prateeky2806
Prateek Yadav
7 months
@iclr_conf 📝 Emergent Mixture-of-Experts: Can Dense Pre-trained Transformers Benefit from Emergent Modular Structures? TL;DR: Transforming a pre-trained dense model into a modular one can alleviate negative transfer and enhance both ID and OOD capabilities. 🔗:
1
0
9
@prateeky2806
Prateek Yadav
7 months
We find several exciting and strong results. ✅ MC-SMoE gives up to 80% memory savings! ✅ 20% reduction in FLOPs! ✅ Virtually no performance loss! Tested across 8 benchmarks and compared with 6 baseline methods! 📊
Tweet media one
1
0
8
@prateeky2806
Prateek Yadav
3 months
Some major releases in the PEFT library, thanks @sourab_m these updates are super useful.
0
0
8
@prateeky2806
Prateek Yadav
3 years
The amount of uncertainty in such situations is traumatizing at the very least. It changes your views on life and inherent believes. Many things my life have lost the false importance they had.
@SnehaAnnavarapu
Sneha Annavarapu
3 years
My covid +ve mother is in the hospital, my father (who is also covid +ve) is driving my covid +ve grandmother to a hospital...& I’m sitting here in Chicago, fully vaccinated, starting into space, calling people, giving instructions, feeling guilty, tired & useless, venting.
542
997
13K
0
1
8
@prateeky2806
Prateek Yadav
11 months
Interesting work on understanding when a big teacher model can help a smaller student model.
@swarnaNLP
Swarnadeep Saha
11 months
Can LLMs Teach Weaker Agents? Aligned teachers can intervene w/ free-text explanations using Theory of Mind (ExpUtility+Personalization) to improve students on future unexplained data🙂 Misaligned teachers hurt students😢 w/ @peterbhase @mohitban47 🧵👇
Tweet media one
2
65
187
1
3
8
@prateeky2806
Prateek Yadav
7 months
Shout out to @pingzli for the awesome effort and leading this work! Check out the paper and our code for more insights! 📚 Arxiv: Our code is also publicly available 👉
0
0
8
@prateeky2806
Prateek Yadav
6 months
@LChoshen @colinraffel @mohitban47 We also find that ComPEFT improves with scale. This means stronger models not only become more compressible but also show better performance post-compression.
Tweet media one
1
0
8
@prateeky2806
Prateek Yadav
11 months
Looking forward to our collaboration! 🎉 Welcome to UNC.
@jaeh0ng_yoon
Jaehong Yoon
11 months
😍I'm super excited to announce my next journey! After a great time at KAIST, I'll be working as a Postdoctoral Research Associate at UNC Chapel Hill ( @UNC ) this fall, working with Prof. Mohit Bansal ( @mohitban47 ) and faculty+students in the awesome @uncnlp and @unccs groups! 1/3
Tweet media one
12
18
121
1
1
7
@prateeky2806
Prateek Yadav
7 months
@yilin_sung @Tim_Dettmers By selectively storing activations, Gradient Checkpointing reduces memory usage, enabling the training of larger models or bigger batch sizes or longer sequences on the same hardware. It's a trade-off: Lower memory usage comes at the cost of increased computation.
1
1
7
@prateeky2806
Prateek Yadav
6 months
@LChoshen @colinraffel @mohitban47 We perform extensive evaluation across diverse models like T5, T0, and LLaMA (ranging from 200M to 65B parameters). ComPEFT achieves staggering compression ratios of 8x to 50x while maintaining or even improving performance in many cases.
1
0
8
@prateeky2806
Prateek Yadav
3 months
Merging might not be perfect yet but it has proven itself enough for people to test it for their use case and my bet is that in most cases it can save a ton of compute when trying to create specialized models via Full-FT or PEFT.
1
0
8
@prateeky2806
Prateek Yadav
11 months
In summary, TIES-Merging🪢resolves interference when merging models across diverse settings (diff modalities, model sizes, architectures, fine-tuning) Check out the paper and the code of TIES-Merging for more details! 📜: 🖥️: n/n
1
1
7
@prateeky2806
Prateek Yadav
10 months
1. ExSSNeT: 2. Code-CL: Code-CL Thread 👇
@prateeky2806
Prateek Yadav
10 months
Ever wondered how to continually improve your code LLM? In our new #ACL2023nlp paper, we explore Continual Learning (CL) methods for code domain: CodeTask-CL benchmark & Prompt Pooling with Teacher Forcing for CL in code domain. @AmazonScience @uncnlp 🧵
Tweet media one
1
21
58
0
4
7
@prateeky2806
Prateek Yadav
6 months
@LChoshen @colinraffel @mohitban47 However, the size of these expert models presents challenges, especially when 1️⃣ Retrieving them over high-latency networks (say Internet) 2️⃣ Serving multiple experts on a single GPU For e.g. QLoRA on LLaMA-65B is 3.2GB in size which is similar to a full T5-Large model (3GB)
1
0
7
@prateeky2806
Prateek Yadav
3 years
Congratulations @jayleicn , @mohitban47 , @LINJIEFUN , @luowei_zhou , @zhegan4 , tlberg and jjliu. This is so amazing! 😍
@jayleicn
Jie Lei
3 years
Yay! We won the @CVPR 2021 Best Student Paper Honorable Mention! (Top 7 out of 7000 submissions 😍) @linjiefun @luowei_zhou @zhegan4 tlberg @mohitban47 jjliu @uncnlp @unccs @msftresearch
Tweet media one
10
12
163
0
0
7
@prateeky2806
Prateek Yadav
11 months
4⃣Branch-Train-Merge ( @margs_li , @ssgrn , @Tim_Dettmers et al) 5⃣RegMean ( @XisenJ et al) 6⃣Model Ratatouille ( @ramealexandre et al)
@ramealexandre
Alexandre Ramé
1 year
Ready to give your deep models a second life? Introducing model ♻️ recycling (), improving generalization by reusing weights fine-tuned on various vision tasks. Just like you recycle your bottles and cardboards, it's time to start recycling your models too!
5
16
83
1
1
7
@prateeky2806
Prateek Yadav
7 months
@yilin_sung @Tim_Dettmers Gradient Checkpointing is all about managing memory efficiently during training so that we can train bigger models with larger batch sizes and sequence lengths. When performing backprop on a Model (say with 1B parameters) there are four major components to store,
1
0
6
@prateeky2806
Prateek Yadav
7 months
@yilin_sung @Tim_Dettmers 4) Model Activations: used in chain rule for backprop -> depends mainly on the number of layers, model's hidden dimensions, batch size, and sequence length. So given a model, as the batch size and seq len increase, activation starts to dominate the VRAM usage.
1
1
7
@prateeky2806
Prateek Yadav
7 months
Vanilla SMoE models often suffer from: (a) High Memory Usage 📊 due to duplicated network layers (b) Redundancy in Experts 🔄 from the learned routing policy Can we merge and compress SMoE experts to make them more compact? 🤷
Tweet media one
1
1
7
@prateeky2806
Prateek Yadav
7 months
@yilin_sung @Tim_Dettmers Gradient Checkpointing (GC) comes to the rescue here. ♻️ Instead of storing all intermediate activations, GC stores a subset of them & performs partial forward passes from these cached acts to recompute the rest during backprop. A balancing act between computation & memory!
1
1
6
@prateeky2806
Prateek Yadav
6 months
@natanielruizg @Google For people looking for a similar idea with openly available code. Checkout TIES-Merging that will appear at neurips 23
@prateeky2806
Prateek Yadav
11 months
Performance degrades when merging diff task-specific models into a multitask model? Presenting TIES-Merging🪢 We find signif *Interference* b/w model params & mitigate it 👉 improves both NLP & CV merging @dtredsox13 @LChoshen @colinraffel @mohitban47 🧵
Tweet media one
6
86
255
0
2
7
@prateeky2806
Prateek Yadav
6 months
@LChoshen @colinraffel @mohitban47 We find that the compressed models from ComPEFT lead to better-merged models and outperform strong baselines like Task Arithmetic and TIES-Merging improving performance in 9/12 settings and leading to an improvement of 1.4 on average.
Tweet media one
1
0
6
@prateeky2806
Prateek Yadav
11 months
We argue that current merging methods fail to account for two major sources of interference: (a) redundant parameter values pulling the average to 0. (b) disagreement on the sign of a given parameter’s values across models.
Tweet media one
1
0
6
@prateeky2806
Prateek Yadav
4 months
Congratulations to @pingzli for his first paper that too as a spotlight at such an amazing venue like ICLR. cc @KyriectionZhang @yilin_sung @YuCheng3 @mohitban47 @TianlongChen4
1
0
6
@prateeky2806
Prateek Yadav
1 month
@PandaAshwinee I guess soon middle school and primary school kids will be submitted to NeurIPS because otherwise, they won't be able to get into High school research because it's competitive and so on ... If something is not ideal then it doesn't mean we should double down on it.
1
0
6
@prateeky2806
Prateek Yadav
8 months
Congratulations @mohitban47 ! Getting this award from @IITKanpur is huge and also very well deserved. 🎉
@mohitban47
Mohit Bansal
8 months
Honored and humbled to receive the @IITKanpur Young Alumnus Award from my alma mater, which has been an amazing source of mentors+friends+memories and important foundation/values 🙏 All the credit for this award belongs to my mentors, students, collaborators, family/friends ❤️
20
9
218
1
0
6
@prateeky2806
Prateek Yadav
7 months
@PontiEdoardo @ndaheim_ @tmoellenhoff @IGurevych @EmtiyazKhan Thanks for thoroughly analyzing the gradient mismatch problem! I was wondering if you considered comparing it with TIES-Merging (NeurIPS'23) as the main thesis there is also to resolve the interference between the task vectors (i.e. accumulated gradients)
@prateeky2806
Prateek Yadav
8 months
🎉 Thrilled to announce our paper on TIES-Merging🪢 has been accepted to @NeurIPSConf ! We've delved into the significant Interference between task-specific model parameters when merging and found a way to mitigate it, enhancing both NLP & CV. Stay tuned for more insights! 📄✨
5
26
157
1
0
6
@prateeky2806
Prateek Yadav
7 months
@yilin_sung @Tim_Dettmers 1) Model parameters -> takes 1B*K bits for K-bit precision 2) parameter gradients -> similar to model weight and take 1B*K for K-bit precision 3) Optimizer states: used for tricks like momentum -> Depends on the optimizer but typically takes same as model parameters, so 1B*K
1
1
5
@prateeky2806
Prateek Yadav
6 months
@LChoshen @colinraffel @mohitban47 ComPEFT follows 3 simple steps 1️⃣ Decompose the task vector into the sign and magnitude (mag) vector 2️⃣ Sparsify the mag vector to keep only the top-k value and also remove these indices from the sign vec 3️⃣ Multiply the sign vec by a scalar constant alpha * std(task vector)
Tweet media one
2
0
6
@prateeky2806
Prateek Yadav
7 months
@yilin_sung @Tim_Dettmers In most cases, these benefits often outweigh the costs, especially in scenarios with memory constraints. Let me know if I missed something!
1
0
5
@prateeky2806
Prateek Yadav
11 months
7⃣Task Arithmetic in the Tangent Space ( @gortizji et al) 8⃣LMC and LTH ( @jefrankle et al) 9⃣Role of Perm. Invariance in LMC ( @rahiment et al) 🔟Cold-Fusion ( @LChoshen et al) ⃣
@rahiment
RahimEntezari
3 years
🆕The Role of Permutation Invariance in Linear Mode Connectivity of Neural Networks Our conjecture: Taking permutations into account, there is likely no barrier in the linear interpolation between SGD solutions. w @HanieSedghi @osaukh @bneyshabur 1/10
Tweet media one
9
40
226
1
2
6
@prateeky2806
Prateek Yadav
11 months
Moreover, we observe that TIES-Merging🪢 (1) Improves out-of-domain performance significantly! (2) Scales better with more tasks! (3) Additional ablations confirm that all three steps are important.
Tweet media one
Tweet media two
Tweet media three
1
1
6
@prateeky2806
Prateek Yadav
3 months
@_akhaliq Definitely look at ComPEFT (released Nov'23) which does exactly what BigDelta does only with evaluation and for bigger models we don't even need any data.
@prateeky2806
Prateek Yadav
6 months
Presenting ComPEFT 🗜! We compress parameter updates to facilitate efficient communication of expert models for compositional generalization. ComPEFT improves perf. 📈, while reducing storage/communication costs 📉 @LChoshen @colinraffel @mohitban47 🧵
Tweet media one
8
56
234
0
0
5
@prateeky2806
Prateek Yadav
11 months
We propose Trim, Elect Sign & Merge (TIES-Merging🪢), which introduces 3 new steps when merging: 1⃣Resetting weights that changed a small amount during fine-tuning. 2⃣Resolving sign conflicts. 3⃣Merging only the weights that are in alignment with the final agreed-upon sign.
Tweet media one
Tweet media two
1
1
5
@prateeky2806
Prateek Yadav
11 months
TIES-Merging🪢outperforms all other methods in diverse settings covering a range of modalities, domains, model sizes, architectures, and fine-tuning and settings! It even works on parameter-efficient FT!
Tweet media one
1
0
5
@prateeky2806
Prateek Yadav
7 months
@iclr_conf 📝 LUMOS: Towards Language Agents that are Unified, Modular, and Open Source TL;DR: Offers a modular architecture for task decomposition, grounding, and execution, leveraging open-source LLMs, showing competitive performance on interactive tasks. 🔗:
1
0
5
@prateeky2806
Prateek Yadav
6 months
@LChoshen @colinraffel @mohitban47 Moreover, ComPEFT applied to LoRA and (IA)$^3$ is Pareto-optimal in terms of storage costs vs. performance compared to a wide range of existing PEFT methods.
Tweet media one
1
0
6
@prateeky2806
Prateek Yadav
3 months
I feel like merging can also play an even bigger role in continued-pretraining which is unexplored but that is where big savings lie for people who are willing to explore more of it.
1
0
5
@prateeky2806
Prateek Yadav
11 months
We further analyze the impact of different types of interference on model parameters, highlight the importance of having correct signs, and show that estimating the signs using the validation data could further improve performance.
Tweet media one
Tweet media two
Tweet media three
1
0
5
@prateeky2806
Prateek Yadav
6 months
@LChoshen @colinraffel @mohitban47 Furthermore, Compressed ComPEFT checkpoints perform similarly to the original uncompressed checkpoints when performing few-shot compositional generalization on the Big-Bench-Hard benchmark via LoraHub.
Tweet media one
1
0
5
@prateeky2806
Prateek Yadav
3 years
Congratulations @svjan5 for the award. I have witnessed your hard work and dedication while collaborating with you. There are a lot of good things yet to come along your way!
@Indiaacm
ACM India
3 years
Shikhar V. Chosen as Recipient of 2021 ACM India Doctoral Dissertation Award for "Neural Graph Embedding Methods for Natural Language Processing." He was advised by Prof @partha_p_t and Chiranjib Bhattacharyya. Details at #ACMIndia #DoctoralDissertation
Tweet media one
7
7
54
0
0
5
@prateeky2806
Prateek Yadav
5 months
@LChoshen Not just unofficial implementations but many top models on the HF openllm leaderboard were created using TIES-Merging. @Weyaxi and many others use it frequently. Also it is integrated in many GitHub repos on merging.
1
0
4
@prateeky2806
Prateek Yadav
6 months
@LChoshen @colinraffel @mohitban47 We perform ablations on ComPEFT to show that all steps are necessary and that it outperforms strong baseline like STC from Federated Learning.
Tweet media one
1
0
6
@prateeky2806
Prateek Yadav
11 months
11. Multimodal Model Merging ( @yilin_sung et al)
@yilin_sung
Yi Lin Sung
1 year
Can we MERGE weights of different MODALITIES? The answer is no using naive merging. However we find an effective recipe for improving merging results significantly in “An Empirical Study of Multimodal Model Merging” 🧵👇 @linjiefun @zhegan4 @mohitban47
Tweet media one
1
44
147
0
1
5
@prateeky2806
Prateek Yadav
11 months
Model merging methods combine multiple task-specific models into 👉 one multitask model without more training. However, the weights of different models might interfere with each other, which we find can significantly harm multitask performance!
1
0
5
@prateeky2806
Prateek Yadav
7 months
Extensive analysis of our design choices highlights the best practices for merging Experts. We find that (1) adaptive layerwise budget allocation, with (2) router logits-based similarity, with (3) activation frequency-based expert merging with knowledge distillation works best.
Tweet media one
Tweet media two
Tweet media three
Tweet media four
1
0
4
@prateeky2806
Prateek Yadav
3 months
2
0
4
@prateeky2806
Prateek Yadav
7 months
@iclr_conf 📝 CCA Merge: Merging Many Neural Networks with Canonical Correlation Analysis TL;DR: Novel fusion method using Canonical Correlation Analysis to merge many models into one with lower accuracy drops than past methods. 🔗:
1
0
4
@prateeky2806
Prateek Yadav
10 months
@Yampeleg What context length?
1
0
4
@prateeky2806
Prateek Yadav
3 years
Interesting work on bridging neural and symbolic techniques for generating proof graphs. We will be happy to chat/answer any questions you might have!
@swarnaNLP
Swarnadeep Saha
3 years
👇Happening today 9.20-10.40pm ET at #NAACL2021 session 10E: Question Answering/Interpretability. Talk video & session links below! Happy to chat about proof-graph set generation for explaining compositional/multi-hop reasoning.
1
3
10
0
0
4