Leshem Choshen 🤖🤗 @LChoshen Twitter profile | Pikagi

Pikagi

Leshem Choshen 🤖🤗

@LChoshen

3,590

Followers

589

Following

632

Media

7,490

Statuses

🥇 Collaborative LLMs 🥈 Opinionatedly sharing #ML & #NLP 🥉 Propagating us underdogs we owe science an alternative hype @IBMResearch & @MIT_CSAIL

Online

https://t.co/bXrRJuWFTZ

Joined June 2018

Don't wanna be here? Send us removal request.

Pinned Tweet

@LChoshen

Leshem Choshen 🤖🤗

1 year

We want to pretrain🤞 Instead we finetune🚮😔 Could we collaborate?🤗 ColD Fusion: 🔄Recycle finetuning to multitask ➡️evolve pretrained models forever On 35 datasets +2% improvement over RoBERTa +7% in few shot settings 🧵 #NLProc #MachinLearning #NLP #ML #modelRecyclying

6

25

132

Last Seen Profiles

@RCWestminster

@Hgonzzz

@AlexJMiddleton

@DeneCymone

@TheSafeAnnaAnon

@KregerPatricia

@Christinamari09

@tontytrains

@SpartanA376

@qngzi22183594

@nobi_kricika

@go_lightyear

@CompleatP2P

@PePreben

@Lisamichelewil2

@RealRobat

@BryanAGarner

@aalexc17_

@CRInvestor

@KrasiKrasi42473

@Chimezie_Metu

@IrishHolyVerse

@witnet

@usembassyjkt

@PerkPackSeries

@VelociPapi

@kuxir97

@zayniekartPH

@TJ_____Williams

@NIDisability

@forduk

@TKNDnft

@chazfirestone

@rika_rika_rika_

@SimerjotJ

@gabe74077

@LChoshen

Leshem Choshen 🤖🤗

1 year

Pretraining with 1 GPU and 1 day This paper is a HUGE list of all the tricks you could think of and what works to make training efficient given 1 GPU and 1 day @jonasgeiping @tomgoldsteincs

Tweet card media

Cramming: Training a Language Model on a Single GPU in One Day

Recent trends in language modeling have focused on increasing performance through scaling, and have resulted in an environment where training language models is out of reach for most researchers...

6

71

498

@LChoshen

Leshem Choshen 🤖🤗

2 years

During training, your loss goes up and down up and down up and down. But how would it go if you magically went in a straight line from init to learnt position? Apparently smoothly down! On the surprising Linear Interpolation: #scientivism #deepRead #MachineLearning

Tweet media one

8

77

440

@LChoshen

Leshem Choshen 🤖🤗

3 months

DoRA explores the magnitude and direction and surpasses LoRA quite significantly This is done with an empirical finding that I can't wrap my head around @NVIDIAAI @nbasyl_tw @chienyi_wang @yin_hongxu @PavloMolchanov @CMHungSteven

Tweet media one

5

75

408

@LChoshen

Leshem Choshen 🤖🤗

2 years

Is data really important for pretraining? Could we just pretrain on 1 picture? Only synthetic text? Fractals? A 🧵 summing the image and text papers that do just that. and they all have a similar conclusion🤔

Tweet media one

9

63

362

@LChoshen

Leshem Choshen 🤖🤗

3 months

How ICL 𝘦𝘮𝘦𝘳𝘨𝘦𝘴 from unsupervised data? 𝘐𝘵 𝘭𝘦𝘢𝘳𝘯𝘴 𝘧𝘳𝘰𝘮 parallel phrases After deleting parallel parts the ICL ability was reduced by 51% deleting random words - only 2% 🧵 @yanda_chen_ @henryzhao4321 @Zhou_Yu_AI @hhexiy @columbianlp

Tweet media one

7

56

313

@LChoshen

Leshem Choshen 🤖🤗

2 years

Computational (Chomskian) hierarchies can predict OOD capabilities Different in formal hierarchies - different generalizations the architecture can perform Got your attention? Details in 🧵 @DeepMind

Tweet media one

3

57

298

@LChoshen

Leshem Choshen 🤖🤗

11 months

Zip it: Fuse models with themselves first Merge models trained on different tasks by correlations between activations George Stoica @dbolya @BjornerJakob Taylor Hearn @judyfhoffman @gtcomputing #deepRead

Tweet media one

3

45

259

@LChoshen

Leshem Choshen 🤖🤗

2 years

I don’t train from scratch, I use RoBERTa🧐 Wait… Why not cross-encoder/stsb-roberta?facebook/muppet-roberta? We automatically identify the best models on 🤗(periodically) Just pick the best one and finetune on your task

Tweet card media

Where to start? Analyzing the potential value of intermediate models

Previous studies observed that finetuned models may be better base models than the vanilla pretrained model. Such a model, finetuned on some source dataset, may provide a better starting point for...

8

55

255

@LChoshen

Leshem Choshen 🤖🤗

2 years

Data augmentation? Look no further. Framework of 100+ "transformations" (augmentations\paraphrasing functions\filters) Many types:emojis, linguistic... see Fig Extendable! A vast effort, constructed by almost a hundred authors! #scientivism

Tweet media one

4

41

212

@LChoshen

Leshem Choshen 🤖🤗

2 years

Recycling Finetuned models, it works! Finetuned models lie everywhere, there must be a way to use the data and compute invested in them. Apparently averaging their weights is such a method. 3 papers & A🧵

Tweet media one

11

35

208

@LChoshen

Leshem Choshen 🤖🤗

2 years

Labelled data is scarce, what can we do? We can MLM on the unlabeled data, but You can do better: Cluster & Tune - finetune on clusters as labels #acl2022nlp #NLProc #MachineLearning

Tweet media one

7

47

199

@LChoshen

Leshem Choshen 🤖🤗

6 months

Finetuning millions of dimensions is not as complex as you may think🤯 Actually, it is quite interpretable in Euclidean space by angles from the pretraining. Seeds fall in small regions Tasks in larger ones All in some direction

Tweet media one

6

30

184

@LChoshen

Leshem Choshen 🤖🤗

1 year

🔎What's in a layer?🌹🕵🏻‍♀️ Representations are vectors If only they were words... Finding: Any layer can be mapped well to another linearly Simple, efficient & interpretable & improves early exit Story and 🧵

Tweet media one

8

53

181

@LChoshen

Leshem Choshen 🤖🤗

2 years

About generalization of different networks Main finding: Generalization in pretraining follows a single dimension Different networks, architectures, seeds, sizes but: Similar performance → similar linguistic capabilities @aclmeeting accepted ( #NLProc ) Summary & story 🧵

9

28

164

@LChoshen

Leshem Choshen 🤖🤗

8 months

A new pretraining technique suggests replacing MLM by predicting the representations directly How? We tie the input rep. to the input's we mask & Contrast it to other reps. viola Huge compute\perf gains! @nthngdy Éric de la Clergerie @bensagot

Tweet media one

7

35

163

@LChoshen

Leshem Choshen 🤖🤗

1 year

Taking a moment to celebrate🥳🎉 𝗗𝗼𝗰𝘁𝗼𝗿 Leshem Choshen Years of research, thanks to all the collaborators Years since the research and... Mister is evolving to ... Doctor

20

1

151

@LChoshen

Leshem Choshen 🤖🤗

6 months

So many warn that evaluating with GPT favors GPT (or any LLM evaluating itself). Now it is also shown Science, not just educated guesses (Fig: T5, GPT, Bart each prefer their own) @yiqi_617 @NafiseSadat @chenghua_lin #enough2skim #scientivism

Tweet media one

6

22

152

@LChoshen

Leshem Choshen 🤖🤗

5 months

Task contamination is worse than we imagined? Models perform better on datasets that were released before the models + Models can generate examples from tasks (NYT vibes) Changmao Li @jmflanig

Tweet media one

6

24

139

@LChoshen

Leshem Choshen 🤖🤗

11 months

Predictions throughout training, hyperparams and architectures are yet again shown to be on a small manifold which means models learn their classifications outputs similarly Mao ... @pratikac #MachineLearning #enough2skim

Tweet media one

2

27

138

@LChoshen

Leshem Choshen 🤖🤗

6 months

Residual connections are 🔥, right? Wait, so why do we only use them to skip 1 layer? Not only Lucas Georges Gabriel Charpentier & @davidsamuelcz checked it. They found that this provided huge gains - winning the babyLM challenge🍼

Tweet media one

6

26

132

@LChoshen

Leshem Choshen 🤖🤗

1 year

Theory of Mind emerged in GPT Children get it at 9yo No, it doesn't read minds (yet) but, It can empathize and imagine what you know separately from what it or others knows Code&data: @Stanford @stanfordnlp @michalkosinski

Tweet card media

Theory of Mind Might Have Spontaneously Emerged in Large Language Models

This page accompanies the Arxiv preprint: https://arxiv.org/abs/2302.02083 Hosted on the Open Science Framework

7

20

131

@LChoshen

Leshem Choshen 🤖🤗

2 years

One unsupervised evaluation to rule them all (no, not BLEU...) SESCORE is a general reference-based metric that requires no human annotation SoTA in Translation, Captioning and more @WendaXu2 Tuan @yujielu_10 @m2saxon @lileics @WilliamWangNLP @ucsbNLP

Tweet media one

6

29

121

@LChoshen

Leshem Choshen 🤖🤗

2 years

"Tokens" - the magic🪄 that transforms strings to inputs BPE and wordPiece are the same. right? Well... It is a constant thing in pretaining there is nothing I can do about it, right? wrong... A 🧵on tokenization methods and regularization

Tweet media one

2

14

117

@LChoshen

Leshem Choshen 🤖🤗

9 months

🦾Cohere beats Davinci on HELM 😵‍💫But only if you also test Cohere medium How reliable are our benchmarks really? A fascinating :thread:on HELM, Reliable benchmarks & saving X100 compute Are you up to it? 🧵

7

35

106

@LChoshen

Leshem Choshen 🤖🤗

5 months

Solar mixes two base-model copies to create a larger one Then train it a bit more and beat other open models out there. How? and my thoughts 🧵 @upstageai (no author with a handle?!) #scientivism

Tweet media one

2

16

110

@LChoshen

Leshem Choshen 🤖🤗

11 months

What are larger models worse at? The Inverse scaling competition was much discussed for its novelty and the $100K prize what did they find?

Tweet media one

3

14

108

@LChoshen

Leshem Choshen 🤖🤗

8 months

Flying to start the postdoc at MIT-IBM Wish me luck 🛩️9/11

11

1

103

@LChoshen

Leshem Choshen 🤖🤗

2 months

We share code on @github We share datasets on @huggingface But where do we share our data processing? We prompt, clean, and filter but on our own🥺 Unitxt🦄 A preprocessing tool That we can grow together @IBMResearch

2

14

100

@LChoshen

Leshem Choshen 🤖🤗

2 years

What do we know about using a fine-tuned model rather than the pretrained They are sometimes much better, but what else? A story of great #scientivism hypotheses and their rejections The story of a field Survey 🧵

Tweet media one

2

21

98

@LChoshen

Leshem Choshen 🤖🤗

7 months

Is knowledge located or distributed? Delete ~1% of the parameters and This is enough to remove knowledge but leave everything else functioning Deniz Bayazit, @negarforoutan , @eric_zemingchen , @gail_w , @ABosselut

Tweet media one

1

22

98

@LChoshen

Leshem Choshen 🤖🤗

4 months

Do you want the best model? Sure, but the best for trying it once or the best after prompt engineering It is not the same one ☹️ On the sensitivity of LLMs to prompts: @moranmiz @gkpln3 Dan Malkin @DrorRotem @HyadataLab @GabiStanovsky

Tweet media one

5

12

94

@LChoshen

Leshem Choshen 🤖🤗

2 years

Transformers are not Seq2Seq Given a context They predict a single token We use this to update representations between predicted tokens & feed a changing graph relevant to the current token me @oabend #conll2020 #NLProc

Tweet media one

2

19

93

@LChoshen

Leshem Choshen 🤖🤗

2 years

Networks first learn generalizations and then memorize. This phenomena was presented in recent ACL. Are the two really different? For some time I thought this must be the reason networks beat VC dimension. Do you think differently? My opinion:

Tweet media one

4

13

94

@LChoshen

Leshem Choshen 🤖🤗

1 year

LLMs act in multiple languages, but how? It is separate knowledge? Unified knowledge? Or do they translate everything into English? tl;dr translate @AmiiThinks

Tweet media one

2

13

89

@LChoshen

Leshem Choshen 🤖🤗

2 years

Why does MixUp doesn't work in NLP? In other fields, it does work, right?

Tweet media one

7

8

83

@LChoshen

Leshem Choshen 🤖🤗

1 year

Chain of Thought for vision beating GPT3 by 16% and supposedly humans Text and captions are not enough, but with vision CoT does really well @zhangzhuosheng @astonzhangAZ @mli65 Hai Zhao @karypis @smolix

Tweet media one

3

22

84

@LChoshen

Leshem Choshen 🤖🤗

2 years

Once and for all What is the intuition behind warming up learning rate? I understand why it makes sense to decay the learning rate. But why should it start small and rise?

12

9

76

@LChoshen

Leshem Choshen 🤖🤗

1 month

"Only large models have emergent abilities" This mystic mantra is challenged again Broad 🧵 on the debate + new findings: 165M parameter models trained on simple English are better zero shot learners Show scaling laws But not few shot benefits (yet?)

Tweet media one

4

16

71

@LChoshen

Leshem Choshen 🤖🤗

1 year

Parameters self-organize to perform distinct tasks or Compositional tasks are learnt by compositional subnetworks a 🧵 @Michael_Lepori @tserre @albertwebson @Brown_NLP

Tweet card media

Break It Down: Evidence for Structural Compositionality in Neural Networks

Though modern neural networks have achieved impressive performance in both vision and language tasks, we know little about the functions that they implement. One possibility is that neural...

2

13

74

@LChoshen

Leshem Choshen 🤖🤗

2 years

A replacement to probing: Prompt Probing Train the input vector (prompt) so probing can't learn by itself, only extract what the model learnt Jiaoda Li @ryandcotterell @mrinmayasachan

Tweet media one

2

11

72

@LChoshen

Leshem Choshen 🤖🤗

3 years

Linguistic features are found by probes, but where in the representation's geometry? @evanqed @jacobandreas The Low-Dimensional Linear Geometry of Contextualized Word Representations #conll2021 #EMNLP2021livetweet

Tweet media one

4

10

69

@LChoshen

Leshem Choshen 🤖🤗

1 year

Opposite scaling law: detection of machine-generated text is done better by smaller models Everyone (outside #NLProc ...) is afraid GPT would cheat for them, which pushes for detection methods Mireshghallah @MatternJustus Gao @rzshokri @BergKirkpatrick

Tweet media one

3

11

65

@LChoshen

Leshem Choshen 🤖🤗

5 months

👶BabyLM will be back, but what did we learn? the best papers from this year🍼 We know one mechanism that learns from 100M words (us), what are the main boosts to reach this in an LLM? #EMNLP #babyLM

Tweet media one

4

7

65

@LChoshen

Leshem Choshen 🤖🤗

7 months

I am sad & will not brag about X #EMNLP2023 papers accepted Do you know how many (children) got kidnapped? Massacred by Hamas' Army going house by house? I will not share about it, except on this thread But silence? Silence was too much, too little Your comments go there👇

10

7

63

@LChoshen

Leshem Choshen 🤖🤗

2 years

I have just found a new phenomenon: Linear mode connectivity What is the loss of the mid-model? A model somewhere between converged models with different seeds? #MachineLearning

Tweet media one

7

11

63

@LChoshen

Leshem Choshen 🤖🤗

2 years

LLMs would never acquire meaning. Or, perhaps they already have? An opinionated review of LMs current state and their ability to capture meaning a scholarly-philosophical paper by who if not @spiantado @FelixHill84

5

9

61

@LChoshen

Leshem Choshen 🤖🤗

30 days

Is DPO Superior to PPO for LLM Alignment? No. A comprehensive study shows that PPO is better (except for time and complexity of running wise) Theoretically and empirically

Tweet card media

Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study

Reinforcement Learning from Human Feedback (RLHF) is currently the most widely used method to align large language models (LLMs) with human preferences. Existing RLHF methods can be roughly...

5

11

61

@LChoshen

Leshem Choshen 🤖🤗

7 months

Just 2 years ago, we introduced the concept of model fusing (now aka as merging, introduced by others on parallel) And it is so well adopted, it now has a survey! We have gone a great way. Weishi Li Yong Peng, @Miao_Zhang_dr @liangdingNLP Han Hu, Li Shen

Tweet card media

Deep Model Fusion: A Survey

Deep model fusion/merging is an emerging technique that merges the parameters or predictions of multiple deep learning models into a single one. It combines the abilities of different models to...

2

13

62

@LChoshen

Leshem Choshen 🤖🤗

10 months

See the sheer joy of my collaborators at #ACL2023 🤩 DissentQA won best Paper AC award This is a happy outcome of the fruitful collaboration with a group of wonderfully friendly people @EllaNeeman @OHonovich @roeeaharoni @AbendOmri &Szpektor Details:

Tweet media one

2

6

58

@LChoshen

Leshem Choshen 🤖🤗

1 year

Stabilizing Training by Understanding dynamics Reducing the peakiness (entropy) of the attention provides huge stability benfits less need for LN,warmup,decay @zhaisf @EtaiLittwin @danbusbridge @jramapuram @YizheZhangNLP @thoma_gu @jsusskin #CV #NLProc

Tweet media one

3

9

59

@LChoshen

Leshem Choshen 🤖🤗

4 years

d̵r̵o̵p̵o̵u̵t̵ Mixout randomly replaces weights with z̵e̵r̵o̵s̵ weights of another model (e.g. pretraind BERT) to make it closer to the other model while transfer learning (fine-tuning). Clean and simple @kchonyc @iclr_conf

@stephenroller

Stephen Roller

4 years

Mixout () is a cool way to regularize your large neural network. I quickly wrote an implementation in pytorch that works with (most) arbitrary nn.Modules:

0

31

169

1

5

59

@LChoshen

Leshem Choshen 🤖🤗

16 days

Pretrain to predict the future At each step the model predicts n-tokens Performance: 😃 Inference time: ✖️3 Training time: same @AIatMeta @FabianGloeckle @byoubii @b_roziere @dfpazr @syhw

Tweet media one

2

8

55

@LChoshen

Leshem Choshen 🤖🤗

2 months

Rome edits model weights to replace a fact the model knew with another But, it sometimes ruined everything in the process Not anymore, this was an implementation problem and R-Rome solved it 🏴‍☠️arrr- rome🏴‍☠️ @akshatgupta57 @GopalaSpeech @UCBerkeley @UCSF

Tweet card media

Rebuilding ROME : Resolving Model Collapse during Sequential Model Editing

Recent work using Rank-One Model Editing (ROME), a popular model editing method, has shown that there are certain facts that the algorithm is unable to edit without breaking the model. Such edits...

5

10

54

@LChoshen

Leshem Choshen 🤖🤗

1 year

Efficient finetuning with fewer parameters AdaLORA mimics LORA but removes less important directions of SVD at every step @Zhang_Qingru @MinshuoC @AlexanderBukha1 @Pengcheng2020 @WeizhuChen @tourzhao #deepRead #MachineLearning

Tweet card media

AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning

Fine-tuning large pre-trained language models on downstream tasks has become an important paradigm in NLP. However, common practice fine-tunes all of the parameters in a pre-trained model, which...

4

11

55

@LChoshen

Leshem Choshen 🤖🤗

1 year

I can't understand how this paper is so overlooked Human annotation was a dreadful thing to me all my PhD, costly, cumbersome, requires my constant supervision this is a game changer (and its not even mine...) But generalized reranking get's double the PR... #scientivism

@LChoshen

Leshem Choshen 🤖🤗

1 year

🧞‍♂️Human annotations for free🥳 As each one reinvents their annotation process GENIE🧞 Just decided to standardize the whole process (API MTurk...) Upload your process and the next papers would be able to exactly replicate And they pay for annotation! #EMNLP2022livetweet

1

3

48

2

6

55

@LChoshen

Leshem Choshen 🤖🤗

3 months

The best prompt, literally: «Command, we need you to plot a course through this turbulence and locate the source of the anomaly. Use all available data and your expertise to guide us through this challenging situation.» Models respond well to positive thinking 1/n

Tweet media one

2

9

52

@LChoshen

Leshem Choshen 🤖🤗

9 months

Did you know: Evaluating a single model on HELM took ⏱️4K GPU hours or 💸+10K$ in API calls?! Flash-HELM⚡️️can reduce costs by X200!

Tweet media one

7

12

50

@LChoshen

Leshem Choshen 🤖🤗

1 year

Few-shot learning almost reaches traditional machine translation Xavier Garcia @whybansal @ColinCherry George Foster, Maxim Krikun @fengfangxiaoyu @melvinjohnsonp @orf_bnw #enough2skim #NLProc #neuralEmpty

Tweet media one

3

9

52

@LChoshen

Leshem Choshen 🤖🤗

2 years

A successor to MRT? MAD - Reinforcement learning in Machine Translation A new way to effectively optimize generation towards any metric @DeepMind Domenic Donato @LeiYu63 Wang Ling @redpony

Tweet media one

3

7

50

@LChoshen

Leshem Choshen 🤖🤗

2 years

Getting ready to write? General writing, section and latex tips just for you (and you and you). Please share any comments so we can improve them together. Good luck in @NeurIPSConf @conll_conf @ARRPreprints #NLProc #MachineLearning

3

11

50

@LChoshen

Leshem Choshen 🤖🤗

21 days

We continually work on -Continual learning- the thing is that we have catastrophic forgetting and keep reinventing methods This, or just delving into a new field is what makes surveys so important. I won't summarize a survey, just go read it...

Tweet media one

2

10

50

@LChoshen

Leshem Choshen 🤖🤗

2 years

Not only scale: GPT3 results with 1/25 the size. How? By retrieval. RETRO (list of authors below, too long) @DeepMind

Tweet media one

3

10

51

@LChoshen

Leshem Choshen 🤖🤗

2 years

Learning to rerank without reranking. Reranking beam search (BSR) is often useful but costly, this work mimics the behavior without the need for large beam and slow inference. (caveats below) @yzpang97 @hhexiy @kchonyc

3

13

51

@LChoshen

Leshem Choshen 🤖🤗

7 months

Back in the days of 2021 there was a lovely evaluation paper: ➕Automatically identifying label errors ➕Improving score's reliability ➕Finding example's difficulty ➕Active Learning @EntilZhaPR @barrowjoseph @miserlis_ @robinomial @boydgraber #deepRead

Tweet media one

1

7

50

@LChoshen

Leshem Choshen 🤖🤗

1 year

How to train your model? with limited text? Improved architecture , good data, more training doesn't help(?!) details in the paper @davidsamuelcz Andrey Kutuzov @LiljaOvrelid @erikve #NLProc #MachineLearning #enough2skim

Tweet media one

1

7

50

@LChoshen

Leshem Choshen 🤖🤗

8 months

chatGPT solves it at last P!=NP With 97 reasoning conversations (COT but also others) it does what not one did before. It's AGI bro! @MicrosoftAI

Tweet media one

5

10

49

@LChoshen

Leshem Choshen 🤖🤗

1 year

oBERTa Recipe to compress pretrained models and infer fast Prune Take Distillation Remove unwanted weights (sparsify) Quantize (reduce float precision) Avoid freezing weights serve and infer with better accuracy and speed @spacemanidol @markurtz_

Tweet media one

2

12

48

@LChoshen

Leshem Choshen 🤖🤗

13 days

Models know it when they hallucinate, output probabilities and get calibrated Shouldn't we instead make models communicate it explicitly? When a person is unsure they just say "I am not sure, but" @sunniesuhyoung @QVeraLiao @mihaela_v Ballard Vaughan

Tweet card media

"I'm Not Sure, But...": Examining the Impact of Large...

Widely deployed large language models (LLMs) can produce convincing yet incorrect outputs, potentially misleading users who may rely on them as if they were correct. To reduce such overreliance,...

2

6

48

@LChoshen

Leshem Choshen 🤖🤗

3 years

LMs learn generalizations in the same order (which?). Same acquisition of grammatical phenomena capabilities, regardless of data, seed, architecture. Humbly, fascinating work with @saksheli Weinhsall @AbendOmri #NLProc #MachineLearning #deepRead 1/n

Tweet media one

3

22

49

@LChoshen

Leshem Choshen 🤖🤗

11 days

Building an Arabic pretraining corpus 101 Billion words the largest to date Manel Aloui @HasnaChouikhi Ghaith Chaabane @haithemkchaou Chehir Dhaouadi @clusterlabai

Tweet media one

4

13

49

@LChoshen

Leshem Choshen 🤖🤗

1 year

🧞‍♂️Human annotations for free🥳 As each one reinvents their annotation process GENIE🧞 Just decided to standardize the whole process (API MTurk...) Upload your process and the next papers would be able to exactly replicate And they pay for annotation! #EMNLP2022livetweet

1

3

48

@LChoshen

Leshem Choshen 🤖🤗

1 year

TinyStories: Tiny models are coherent and understand instructions If their data is very simple What is simple? What 3-4 year old vocabularies allow (according to LLMs...) @MSFTResearch @EldanRonen Yuanzhi Li

Tweet media one

1

5

45

@LChoshen

Leshem Choshen 🤖🤗

1 year

@leavittron @jefrankle Related

@alon_jacovi

Alon Jacovi

1 year

Worried about test data being used in training? The LLM world is going through a data contamination crisis. Here's us trying to do something about it: Paper: Blog: w\ @clu_avi @omerNLP @yoavgo

Tweet media one

8

71

267

1

3

44

@LChoshen

Leshem Choshen 🤖🤗

8 months

🐥🐣Welcome the new babies!👶👼🍼 19 pretrained models on the loose track 24 on the strict 118 on strict-small We are proud of >30 pretraining teams submitting papers to babyLM! FOMO? Get updated on CoNLL or participate next year

2

5

44

@LChoshen

Leshem Choshen 🤖🤗

1 year

Let the LM improve its prompt 1⃣Get prompt&examples 2⃣Ask what was wrong with the prompt 3⃣Propose new prompts 4⃣Efficiently evaluate which prompt works best Repeat from 2 Pryzant @dan_iter @jerryzli Lee,Zhu @mjjzha @MSFTResearch ⃣

Tweet media one

2

9

44

@LChoshen

Leshem Choshen 🤖🤗

3 years

Can language models encode the topology of colours from text only? yes! #conll2021 #EMNLP2021livetweet Mostafa Abdou, Artur Kulmizev, @daniel_hers @stellaBotte @Brown_NLP Anders Søgaard

Tweet media one

4

8

42

@LChoshen

Leshem Choshen 🤖🤗

1 year

Time for the BabyLM challenge

@emollick

Ethan Mollick

1 year

We are running out of a vital resource: words! There are “only” 5 to 10 trillion high-quality words (papers, books, code) on the internet. Our AI models will have used all of that for training by 2026. Low-quality data (tweets, fanfic) will last to 2040.

Tweet media one

Tweet media two

78

302

2K

1

9

40

@LChoshen

Leshem Choshen 🤖🤗

5 months

"cutting edge research" Removing interference before merging

Tweet card media

TIES-Merging: Resolving Interference When Merging Models

Transfer learning - i.e., further fine-tuning a pre-trained model on a downstream task - can confer significant advantages, including improved downstream performance, faster convergence, and...

@alfcnz

Alfredo Canziani

5 months

LOL 🤣🤣🤣 #NeurIPS23

Tweet media one

17

38

612

3

1

40

@LChoshen

Leshem Choshen 🤖🤗

1 year

➡️Mindblowing pretraining paradigm⬅️ Train the same model to predict the two directions separately🔛 Better results, more parallelization @MSFTResearch @NguynTu24128917 @eigenikos @WeizhuChen #deepRead

Tweet media one

3

11

40

@LChoshen

Leshem Choshen 🤖🤗

25 days

How to make LMs learn an abstract language representation from learning 2 languages on the same concept? A shared token helps, sharing output language helps more Tianze Hua @tianyunnn @Brown_NLP (no, I won't put it at the top hype-seekers)

Tweet media one

2

6

39

@LChoshen

Leshem Choshen 🤖🤗

1 year

A 1T parameters LLM Outperforming stuff Based on mixture of experts, sparse updates, random routing How did I read nothing about it? Because it is Chinese? What am I missing? #scientivism @Huawei

Tweet card media

PanGu-Σ: Towards Trillion Parameter Language Model with Sparse...

The scaling of large language models has greatly improved natural language understanding, generation, and reasoning. In this work, we develop a system that trained a trillion-parameter language...

1

4

39

@LChoshen

Leshem Choshen 🤖🤗

1 year

Needed some order in all those parameter-efficient finetuning methods? @anyabelz @M___Sabry

Tweet media one

1

5

38

@LChoshen

Leshem Choshen 🤖🤗

2 months

Papers keep on suggesting training on new facts or new data unseen in the pretraining. Why isn't it a common practice - in practice? e.g., from today: @NickMeck ... @XiaoxiaoLi8 @RanveerChandra tfaktas

Tweet card media

Injecting New Knowledge into Large Language Models via Supervised...

In recent years, Large Language Models (LLMs) have shown remarkable performance in generating human-like text, proving to be a valuable asset across various applications. However, adapting these...

3

7

37

@LChoshen

Leshem Choshen 🤖🤗

2 years

What do we know about model shortcuts? LLMs surprise us in their generalization abilities, but they just as often fail and rely on the wrong features. Why? When? How to prevent? A new survey and a 🧵 @DuMNCH Fengxiang He, Na Zou, Dacheng Tao @huxia

5

8

37

@LChoshen

Leshem Choshen 🤖🤗

7 months

Thoughts upon rereading about Pythia a large set of reproducible checkpoints pretrained on consistent datasets with some exps. as well @BlancheMinerva @haileysch__ et al.

Tweet card media

Pythia: A Suite for Analyzing Large Language Models Across...

How do large language models (LLMs) develop and evolve over the course of training? How do these patterns change as models scale? To answer these questions, we introduce \textit{Pythia}, a suite...

1

11

35

@LChoshen

Leshem Choshen 🤖🤗

5 years

RL methods for MT (MRT, GANs and REINFORCE) might not get their performance boost from actually improving translations. Moreover, there are no convergence guarantees for MRT. For these practical and theoretical (first for MRT?) results. Our new Preprint

1

10

36

@LChoshen

Leshem Choshen 🤖🤗

18 days

Nonsense inputs may make sense for LMs Some phrases in the jibberish rubble make models answer or regurgitate knowledge. But what can we learn about those nonsensical phrases or from them on LMs? @V__Cherepanova @james_y_zou

Tweet media one

3

4

36

@LChoshen

Leshem Choshen 🤖🤗

2 years

What do pretrained models learn and when? Ekaterina Voloshina Oleg Serikov @rybolos

Tweet media one

2

1

36

@LChoshen

Leshem Choshen 🤖🤗

1 year

You know what? I will stop sharing any LLM "news" if they don't share with me first (models or code) #thereOrIDontCare #scientivism #uShareFirst Check this paper out Thanks @deliprao for inspiration

@DrJimFan

Jim Fan

1 year

After ChatGPT, the future belongs to multimodal LLMs. What’s even better? Open-sourcing. Announcing Prismer, my team’s latest vision-language AI, empowered by domain-expert models in depth, surface normal, segmentation, etc. No paywall. No forms.

Tweet media one

96

800

4K

1

5

35

@LChoshen

Leshem Choshen 🤖🤗

2 months

Interpretability and MI are a waste of research power and do not contribute to the actual advancement in the field. Have a strong feeling about it? Spill it in comments or help those researchers (not me): @mariusmosbach @megamor2 @tombrownev @DippedRusk

Tweet card media

Impact of Model Analysis and Interpretability Research on Progress in NLP

docs.google.com

1

6

35

@LChoshen

Leshem Choshen 🤖🤗

1 month

Do you know how 🤗 can save an exabyte per month? or even how much is an exabyte? (I didn't) Dedicated model Compression can save 50%! (avg. 25%) Yes, it can compress quantized models as well... @MITIBMLab @IBMResearch @MIT_CSAIL @BU_Tweets 🧵

Tweet media one

1

6

35

@LChoshen

Leshem Choshen 🤖🤗

7 months

Emptying the Ocean with a Spoon: Should We Edit Models? Model editing changes fact in retrospect, could it ever make factual models? @yuvalpi @melhadad

Tweet card media

Emptying the Ocean with a Spoon: Should We Edit Models?

We call into question the recently popularized method of direct model editing as a means of correcting factual errors in LLM generations. We contrast model editing with three similar but distinct...

1

6

33

@LChoshen

Leshem Choshen 🤖🤗

2 months

LLMs perform tasks well even given scrambled sentences When the model can reconstruct the unscrambled sentence it ignores order, but only then🧵 Chen O'Donnell @sivareddyg @Mila_Quebec @McGillU

Tweet media one

2

3

34

@LChoshen

Leshem Choshen 🤖🤗

4 months

English Code models are better than Chinese on Chinese They hallucinate less They generalize better If true, this defies our thoughts on LMs as domain experts @AntGroup (no author handles? @ShiweiLiu9 ?)

Tweet card media

Code-Based English Models Surprising Performance on Chinese QA...

In previous studies, code-based models have consistently outperformed text-based models in reasoning-intensive scenarios. When generating our knowledge base for Retrieval-Augmented Generation...

3

6

34

@LChoshen

Leshem Choshen 🤖🤗

1 year

Larger models are better😱 But... Can we train smaller models to be better? Can we learn about language learning? Our baby👶, babyLM challenge in the @nytimes : ⭐️🌟 @a_stadt @amuuueller @weGotlieb @jhuclsp @EvaPortelance & @sama

Tweet card media

The Race to Make A.I. Smaller (and Smarter)

Teaching fewer words to large language models might help them sound more human.

www.nytimes.com

1

5

32

@LChoshen

Leshem Choshen 🤖🤗

5 months

#neurips has rooms for VIPs With more food, sitting space etc. When did we start having classes?

Tweet media one

5

0

32

@LChoshen

Leshem Choshen 🤖🤗

1 year

In-Context-Learning == gradient descent or disregards labels completely?! Why not both? Models recognize the task but also learn it & The benefits of actual learning grow with # examples and model size Jane Pan @gaotianyu1350 @__howardchen @danqi_chen

Tweet media one

2

8

30

@LChoshen

Leshem Choshen 🤖🤗

1 year

🔖Reviewing has so many faults📖 Finally, there is a dataset of reviews, edits and everything else! 5 venues 5K papers 11K reviews Enjoy! @DyNils @ilokuznetsov @IGurevych

Tweet card media

NLPeer: A Unified Resource for the Computational Study of Peer Review

Peer review constitutes a core component of scholarly publishing; yet it demands substantial expertise and training, and is susceptible to errors and biases. Various applications of NLP for peer...

5

7

31

@LChoshen

Leshem Choshen 🤖🤗

2 years

The intuitions behind warmup, a summary 🧵 I asked what are the intuitions behind warm-up (I had none). I got many answers (and 2 papers) in the cited tweet and thought to give something back. Now they are digestible Thread unroll:

@LChoshen

Leshem Choshen 🤖🤗

2 years

Once and for all What is the intuition behind warming up learning rate? I understand why it makes sense to decay the learning rate. But why should it start small and rise?

12

9

76

4

9

29