Leshem Choshen ๐Ÿค–๐Ÿค— Profile Banner
Leshem Choshen ๐Ÿค–๐Ÿค— Profile
Leshem Choshen ๐Ÿค–๐Ÿค—

@LChoshen

3,590
Followers
589
Following
632
Media
7,490
Statuses

๐Ÿฅ‡ Collaborative LLMs ๐Ÿฅˆ Opinionatedly sharing #ML & #NLP ๐Ÿฅ‰ Propagating us underdogs we owe science an alternative hype @IBMResearch & @MIT_CSAIL

Online
Joined June 2018
Don't wanna be here? Send us removal request.
Pinned Tweet
@LChoshen
Leshem Choshen ๐Ÿค–๐Ÿค—
1 year
We want to pretrain๐Ÿคž Instead we finetune๐Ÿšฎ๐Ÿ˜” Could we collaborate?๐Ÿค— ColD Fusion: ๐Ÿ”„Recycle finetuning to multitask โžก๏ธevolve pretrained models forever On 35 datasets +2% improvement over RoBERTa +7% in few shot settings ๐Ÿงต #NLProc #MachinLearning #NLP #ML #modelRecyclying
6
25
132
@LChoshen
Leshem Choshen ๐Ÿค–๐Ÿค—
2 years
During training, your loss goes up and down up and down up and down. But how would it go if you magically went in a straight line from init to learnt position? Apparently smoothly down! On the surprising Linear Interpolation: #scientivism #deepRead #MachineLearning
Tweet media one
8
77
440
@LChoshen
Leshem Choshen ๐Ÿค–๐Ÿค—
3 months
DoRA explores the magnitude and direction and surpasses LoRA quite significantly This is done with an empirical finding that I can't wrap my head around @NVIDIAAI @nbasyl_tw @chienyi_wang @yin_hongxu @PavloMolchanov @CMHungSteven
Tweet media one
5
75
408
@LChoshen
Leshem Choshen ๐Ÿค–๐Ÿค—
2 years
Is data really important for pretraining? Could we just pretrain on 1 picture? Only synthetic text? Fractals? A ๐Ÿงต summing the image and text papers that do just that. and they all have a similar conclusion๐Ÿค”
Tweet media one
9
63
362
@LChoshen
Leshem Choshen ๐Ÿค–๐Ÿค—
3 months
How ICL ๐˜ฆ๐˜ฎ๐˜ฆ๐˜ณ๐˜จ๐˜ฆ๐˜ด from unsupervised data? ๐˜๐˜ต ๐˜ญ๐˜ฆ๐˜ข๐˜ณ๐˜ฏ๐˜ด ๐˜ง๐˜ณ๐˜ฐ๐˜ฎ parallel phrases After deleting parallel parts the ICL ability was reduced by 51% deleting random words - only 2% ๐Ÿงต @yanda_chen_ @henryzhao4321 @Zhou_Yu_AI @hhexiy @columbianlp
Tweet media one
7
56
313
@LChoshen
Leshem Choshen ๐Ÿค–๐Ÿค—
2 years
Computational (Chomskian) hierarchies can predict OOD capabilities Different in formal hierarchies - different generalizations the architecture can perform Got your attention? Details in ๐Ÿงต @DeepMind
Tweet media one
3
57
298
@LChoshen
Leshem Choshen ๐Ÿค–๐Ÿค—
11 months
Zip it: Fuse models with themselves first Merge models trained on different tasks by correlations between activations George Stoica @dbolya @BjornerJakob Taylor Hearn @judyfhoffman @gtcomputing #deepRead
Tweet media one
3
45
259
@LChoshen
Leshem Choshen ๐Ÿค–๐Ÿค—
2 years
I donโ€™t train from scratch, I use RoBERTa๐Ÿง Waitโ€ฆ Why not cross-encoder/stsb-roberta?facebook/muppet-roberta? We automatically identify the best models on ๐Ÿค—(periodically) Just pick the best one and finetune on your task
8
55
255
@LChoshen
Leshem Choshen ๐Ÿค–๐Ÿค—
2 years
Data augmentation? Look no further. Framework of 100+ "transformations" (augmentations\paraphrasing functions\filters) Many types:emojis, linguistic... see Fig Extendable! A vast effort, constructed by almost a hundred authors! #scientivism
Tweet media one
4
41
212
@LChoshen
Leshem Choshen ๐Ÿค–๐Ÿค—
2 years
Recycling Finetuned models, it works! Finetuned models lie everywhere, there must be a way to use the data and compute invested in them. Apparently averaging their weights is such a method. 3 papers & A๐Ÿงต
Tweet media one
11
35
208
@LChoshen
Leshem Choshen ๐Ÿค–๐Ÿค—
2 years
Labelled data is scarce, what can we do? We can MLM on the unlabeled data, but You can do better: Cluster & Tune - finetune on clusters as labels #acl2022nlp #NLProc #MachineLearning
Tweet media one
7
47
199
@LChoshen
Leshem Choshen ๐Ÿค–๐Ÿค—
6 months
Finetuning millions of dimensions is not as complex as you may think๐Ÿคฏ Actually, it is quite interpretable in Euclidean space by angles from the pretraining. Seeds fall in small regions Tasks in larger ones All in some direction
Tweet media one
6
30
184
@LChoshen
Leshem Choshen ๐Ÿค–๐Ÿค—
1 year
๐Ÿ”ŽWhat's in a layer?๐ŸŒน๐Ÿ•ต๐Ÿปโ€โ™€๏ธ Representations are vectors If only they were words... Finding: Any layer can be mapped well to another linearly Simple, efficient & interpretable & improves early exit Story and ๐Ÿงต
Tweet media one
8
53
181
@LChoshen
Leshem Choshen ๐Ÿค–๐Ÿค—
2 years
About generalization of different networks Main finding: Generalization in pretraining follows a single dimension Different networks, architectures, seeds, sizes but: Similar performance โ†’ similar linguistic capabilities @aclmeeting accepted ( #NLProc ) Summary & story ๐Ÿงต
9
28
164
@LChoshen
Leshem Choshen ๐Ÿค–๐Ÿค—
8 months
A new pretraining technique suggests replacing MLM by predicting the representations directly How? We tie the input rep. to the input's we mask & Contrast it to other reps. viola Huge compute\perf gains! @nthngdy ร‰ric de la Clergerie @bensagot
Tweet media one
7
35
163
@LChoshen
Leshem Choshen ๐Ÿค–๐Ÿค—
1 year
Taking a moment to celebrate๐Ÿฅณ๐ŸŽ‰ ๐——๐—ผ๐—ฐ๐˜๐—ผ๐—ฟ Leshem Choshen Years of research, thanks to all the collaborators Years since the research and... Mister is evolving to ... Doctor
20
1
151
@LChoshen
Leshem Choshen ๐Ÿค–๐Ÿค—
6 months
So many warn that evaluating with GPT favors GPT (or any LLM evaluating itself). Now it is also shown Science, not just educated guesses (Fig: T5, GPT, Bart each prefer their own) @yiqi_617 @NafiseSadat @chenghua_lin #enough2skim #scientivism
Tweet media one
6
22
152
@LChoshen
Leshem Choshen ๐Ÿค–๐Ÿค—
5 months
Task contamination is worse than we imagined? Models perform better on datasets that were released before the models + Models can generate examples from tasks (NYT vibes) Changmao Li @jmflanig
Tweet media one
6
24
139
@LChoshen
Leshem Choshen ๐Ÿค–๐Ÿค—
11 months
Predictions throughout training, hyperparams and architectures are yet again shown to be on a small manifold which means models learn their classifications outputs similarly Mao ... @pratikac #MachineLearning #enough2skim
Tweet media one
2
27
138
@LChoshen
Leshem Choshen ๐Ÿค–๐Ÿค—
6 months
Residual connections are ๐Ÿ”ฅ, right? Wait, so why do we only use them to skip 1 layer? Not only Lucas Georges Gabriel Charpentier & @davidsamuelcz checked it. They found that this provided huge gains - winning the babyLM challenge๐Ÿผ
Tweet media one
6
26
132
@LChoshen
Leshem Choshen ๐Ÿค–๐Ÿค—
1 year
Theory of Mind emerged in GPT Children get it at 9yo No, it doesn't read minds (yet) but, It can empathize and imagine what you know separately from what it or others knows Code&data: @Stanford @stanfordnlp @michalkosinski
7
20
131
@LChoshen
Leshem Choshen ๐Ÿค–๐Ÿค—
2 years
One unsupervised evaluation to rule them all (no, not BLEU...) SESCORE is a general reference-based metric that requires no human annotation SoTA in Translation, Captioning and more @WendaXu2 Tuan @yujielu_10 @m2saxon @lileics @WilliamWangNLP @ucsbNLP
Tweet media one
6
29
121
@LChoshen
Leshem Choshen ๐Ÿค–๐Ÿค—
2 years
"Tokens" - the magic๐Ÿช„ that transforms strings to inputs BPE and wordPiece are the same. right? Well... It is a constant thing in pretaining there is nothing I can do about it, right? wrong... A ๐Ÿงตon tokenization methods and regularization
Tweet media one
2
14
117
@LChoshen
Leshem Choshen ๐Ÿค–๐Ÿค—
9 months
๐ŸฆพCohere beats Davinci on HELM ๐Ÿ˜ตโ€๐Ÿ’ซBut only if you also test Cohere medium How reliable are our benchmarks really? A fascinating :thread:on HELM, Reliable benchmarks & saving X100 compute Are you up to it? ๐Ÿงต
7
35
106
@LChoshen
Leshem Choshen ๐Ÿค–๐Ÿค—
5 months
Solar mixes two base-model copies to create a larger one Then train it a bit more and beat other open models out there. How? and my thoughts ๐Ÿงต @upstageai (no author with a handle?!) #scientivism
Tweet media one
2
16
110
@LChoshen
Leshem Choshen ๐Ÿค–๐Ÿค—
11 months
What are larger models worse at? The Inverse scaling competition was much discussed for its novelty and the $100K prize what did they find?
Tweet media one
3
14
108
@LChoshen
Leshem Choshen ๐Ÿค–๐Ÿค—
8 months
Flying to start the postdoc at MIT-IBM Wish me luck ๐Ÿ›ฉ๏ธ9/11
11
1
103
@LChoshen
Leshem Choshen ๐Ÿค–๐Ÿค—
2 months
We share code on @github We share datasets on @huggingface But where do we share our data processing? We prompt, clean, and filter but on our own๐Ÿฅบ Unitxt๐Ÿฆ„ A preprocessing tool That we can grow together @IBMResearch
2
14
100
@LChoshen
Leshem Choshen ๐Ÿค–๐Ÿค—
2 years
What do we know about using a fine-tuned model rather than the pretrained They are sometimes much better, but what else? A story of great #scientivism hypotheses and their rejections The story of a field Survey ๐Ÿงต
Tweet media one
2
21
98
@LChoshen
Leshem Choshen ๐Ÿค–๐Ÿค—
7 months
Is knowledge located or distributed? Delete ~1% of the parameters and This is enough to remove knowledge but leave everything else functioning Deniz Bayazit, @negarforoutan , @eric_zemingchen , @gail_w , @ABosselut
Tweet media one
1
22
98
@LChoshen
Leshem Choshen ๐Ÿค–๐Ÿค—
4 months
Do you want the best model? Sure, but the best for trying it once or the best after prompt engineering It is not the same one โ˜น๏ธ On the sensitivity of LLMs to prompts: @moranmiz @gkpln3 Dan Malkin @DrorRotem @HyadataLab @GabiStanovsky
Tweet media one
5
12
94
@LChoshen
Leshem Choshen ๐Ÿค–๐Ÿค—
2 years
Transformers are not Seq2Seq Given a context They predict a single token We use this to update representations between predicted tokens & feed a changing graph relevant to the current token me @oabend #conll2020 #NLProc
Tweet media one
2
19
93
@LChoshen
Leshem Choshen ๐Ÿค–๐Ÿค—
2 years
Networks first learn generalizations and then memorize. This phenomena was presented in recent ACL. Are the two really different? For some time I thought this must be the reason networks beat VC dimension. Do you think differently? My opinion:
Tweet media one
4
13
94
@LChoshen
Leshem Choshen ๐Ÿค–๐Ÿค—
1 year
LLMs act in multiple languages, but how? It is separate knowledge? Unified knowledge? Or do they translate everything into English? tl;dr translate @AmiiThinks
Tweet media one
2
13
89
@LChoshen
Leshem Choshen ๐Ÿค–๐Ÿค—
2 years
Why does MixUp doesn't work in NLP? In other fields, it does work, right?
Tweet media one
7
8
83
@LChoshen
Leshem Choshen ๐Ÿค–๐Ÿค—
1 year
Chain of Thought for vision beating GPT3 by 16% and supposedly humans Text and captions are not enough, but with vision CoT does really well @zhangzhuosheng @astonzhangAZ @mli65 Hai Zhao @karypis @smolix
Tweet media one
3
22
84
@LChoshen
Leshem Choshen ๐Ÿค–๐Ÿค—
2 years
Once and for all What is the intuition behind warming up learning rate? I understand why it makes sense to decay the learning rate. But why should it start small and rise?
12
9
76
@LChoshen
Leshem Choshen ๐Ÿค–๐Ÿค—
1 month
"Only large models have emergent abilities" This mystic mantra is challenged again Broad ๐Ÿงต on the debate + new findings: 165M parameter models trained on simple English are better zero shot learners Show scaling laws But not few shot benefits (yet?)
Tweet media one
4
16
71
@LChoshen
Leshem Choshen ๐Ÿค–๐Ÿค—
2 years
A replacement to probing: Prompt Probing Train the input vector (prompt) so probing can't learn by itself, only extract what the model learnt Jiaoda Li @ryandcotterell @mrinmayasachan
Tweet media one
2
11
72
@LChoshen
Leshem Choshen ๐Ÿค–๐Ÿค—
3 years
Linguistic features are found by probes, but where in the representation's geometry? @evanqed @jacobandreas The Low-Dimensional Linear Geometry of Contextualized Word Representations #conll2021 #EMNLP2021livetweet
Tweet media one
4
10
69
@LChoshen
Leshem Choshen ๐Ÿค–๐Ÿค—
1 year
Opposite scaling law: detection of machine-generated text is done better by smaller models Everyone (outside #NLProc ...) is afraid GPT would cheat for them, which pushes for detection methods Mireshghallah @MatternJustus Gao @rzshokri @BergKirkpatrick
Tweet media one
3
11
65
@LChoshen
Leshem Choshen ๐Ÿค–๐Ÿค—
5 months
๐Ÿ‘ถBabyLM will be back, but what did we learn? the best papers from this year๐Ÿผ We know one mechanism that learns from 100M words (us), what are the main boosts to reach this in an LLM? #EMNLP #babyLM
Tweet media one
4
7
65
@LChoshen
Leshem Choshen ๐Ÿค–๐Ÿค—
7 months
I am sad & will not brag about X #EMNLP2023 papers accepted Do you know how many (children) got kidnapped? Massacred by Hamas' Army going house by house? I will not share about it, except on this thread But silence? Silence was too much, too little Your comments go there๐Ÿ‘‡
10
7
63
@LChoshen
Leshem Choshen ๐Ÿค–๐Ÿค—
2 years
I have just found a new phenomenon: Linear mode connectivity What is the loss of the mid-model? A model somewhere between converged models with different seeds? #MachineLearning
Tweet media one
7
11
63
@LChoshen
Leshem Choshen ๐Ÿค–๐Ÿค—
2 years
LLMs would never acquire meaning. Or, perhaps they already have? An opinionated review of LMs current state and their ability to capture meaning a scholarly-philosophical paper by who if not @spiantado @FelixHill84
5
9
61
@LChoshen
Leshem Choshen ๐Ÿค–๐Ÿค—
30 days
Is DPO Superior to PPO for LLM Alignment? No. A comprehensive study shows that PPO is better (except for time and complexity of running wise) Theoretically and empirically
5
11
61
@LChoshen
Leshem Choshen ๐Ÿค–๐Ÿค—
7 months
Just 2 years ago, we introduced the concept of model fusing (now aka as merging, introduced by others on parallel) And it is so well adopted, it now has a survey! We have gone a great way. Weishi Li Yong Peng, @Miao_Zhang_dr @liangdingNLP Han Hu, Li Shen
2
13
62
@LChoshen
Leshem Choshen ๐Ÿค–๐Ÿค—
10 months
See the sheer joy of my collaborators at #ACL2023 ๐Ÿคฉ DissentQA won best Paper AC award This is a happy outcome of the fruitful collaboration with a group of wonderfully friendly people @EllaNeeman @OHonovich @roeeaharoni @AbendOmri &Szpektor Details:
Tweet media one
2
6
58
@LChoshen
Leshem Choshen ๐Ÿค–๐Ÿค—
1 year
Stabilizing Training by Understanding dynamics Reducing the peakiness (entropy) of the attention provides huge stability benfits less need for LN,warmup,decay @zhaisf @EtaiLittwin @danbusbridge @jramapuram @YizheZhangNLP @thoma_gu @jsusskin #CV #NLProc
Tweet media one
3
9
59
@LChoshen
Leshem Choshen ๐Ÿค–๐Ÿค—
4 years
dฬตrฬตoฬตpฬตoฬตuฬตtฬต Mixout randomly replaces weights with zฬตeฬตrฬตoฬตsฬต weights of another model (e.g. pretraind BERT) to make it closer to the other model while transfer learning (fine-tuning). Clean and simple @kchonyc @iclr_conf
@stephenroller
Stephen Roller
4 years
Mixout () is a cool way to regularize your large neural network. I quickly wrote an implementation in pytorch that works with (most) arbitrary nn.Modules:
0
31
169
1
5
59
@LChoshen
Leshem Choshen ๐Ÿค–๐Ÿค—
16 days
Pretrain to predict the future At each step the model predicts n-tokens Performance: ๐Ÿ˜ƒ Inference time: โœ–๏ธ3 Training time: same @AIatMeta @FabianGloeckle @byoubii @b_roziere @dfpazr @syhw
Tweet media one
2
8
55
@LChoshen
Leshem Choshen ๐Ÿค–๐Ÿค—
2 months
Rome edits model weights to replace a fact the model knew with another But, it sometimes ruined everything in the process Not anymore, this was an implementation problem and R-Rome solved it ๐Ÿดโ€โ˜ ๏ธarrr- rome๐Ÿดโ€โ˜ ๏ธ @akshatgupta57 @GopalaSpeech @UCBerkeley @UCSF
5
10
54
@LChoshen
Leshem Choshen ๐Ÿค–๐Ÿค—
1 year
I can't understand how this paper is so overlooked Human annotation was a dreadful thing to me all my PhD, costly, cumbersome, requires my constant supervision this is a game changer (and its not even mine...) But generalized reranking get's double the PR... #scientivism
@LChoshen
Leshem Choshen ๐Ÿค–๐Ÿค—
1 year
๐Ÿงžโ€โ™‚๏ธHuman annotations for free๐Ÿฅณ As each one reinvents their annotation process GENIE๐Ÿงž Just decided to standardize the whole process (API MTurk...) Upload your process and the next papers would be able to exactly replicate And they pay for annotation! #EMNLP2022livetweet
1
3
48
2
6
55
@LChoshen
Leshem Choshen ๐Ÿค–๐Ÿค—
3 months
The best prompt, literally: ยซCommand, we need you to plot a course through this turbulence and locate the source of the anomaly. Use all available data and your expertise to guide us through this challenging situation.ยป Models respond well to positive thinking 1/n
Tweet media one
2
9
52
@LChoshen
Leshem Choshen ๐Ÿค–๐Ÿค—
9 months
Did you know: Evaluating a single model on HELM took โฑ๏ธ4K GPU hours or ๐Ÿ’ธ+10K$ in API calls?! Flash-HELMโšก๏ธ๏ธcan reduce costs by X200!
Tweet media one
7
12
50
@LChoshen
Leshem Choshen ๐Ÿค–๐Ÿค—
1 year
Few-shot learning almost reaches traditional machine translation Xavier Garcia @whybansal @ColinCherry George Foster, Maxim Krikun @fengfangxiaoyu @melvinjohnsonp @orf_bnw #enough2skim #NLProc #neuralEmpty
Tweet media one
3
9
52
@LChoshen
Leshem Choshen ๐Ÿค–๐Ÿค—
2 years
A successor to MRT? MAD - Reinforcement learning in Machine Translation A new way to effectively optimize generation towards any metric @DeepMind Domenic Donato @LeiYu63 Wang Ling @redpony
Tweet media one
3
7
50
@LChoshen
Leshem Choshen ๐Ÿค–๐Ÿค—
2 years
Getting ready to write? General writing, section and latex tips just for you (and you and you). Please share any comments so we can improve them together. Good luck in @NeurIPSConf @conll_conf @ARRPreprints #NLProc #MachineLearning
3
11
50
@LChoshen
Leshem Choshen ๐Ÿค–๐Ÿค—
21 days
We continually work on -Continual learning- the thing is that we have catastrophic forgetting and keep reinventing methods This, or just delving into a new field is what makes surveys so important. I won't summarize a survey, just go read it...
Tweet media one
2
10
50
@LChoshen
Leshem Choshen ๐Ÿค–๐Ÿค—
2 years
Not only scale: GPT3 results with 1/25 the size. How? By retrieval. RETRO (list of authors below, too long) @DeepMind
Tweet media one
3
10
51
@LChoshen
Leshem Choshen ๐Ÿค–๐Ÿค—
2 years
Learning to rerank without reranking. Reranking beam search (BSR) is often useful but costly, this work mimics the behavior without the need for large beam and slow inference. (caveats below) @yzpang97 @hhexiy @kchonyc
3
13
51
@LChoshen
Leshem Choshen ๐Ÿค–๐Ÿค—
7 months
Back in the days of 2021 there was a lovely evaluation paper: โž•Automatically identifying label errors โž•Improving score's reliability โž•Finding example's difficulty โž•Active Learning @EntilZhaPR @barrowjoseph @miserlis_ @robinomial @boydgraber #deepRead
Tweet media one
1
7
50
@LChoshen
Leshem Choshen ๐Ÿค–๐Ÿค—
1 year
How to train your model? with limited text? Improved architecture , good data, more training doesn't help(?!) details in the paper @davidsamuelcz Andrey Kutuzov @LiljaOvrelid @erikve #NLProc #MachineLearning #enough2skim
Tweet media one
1
7
50
@LChoshen
Leshem Choshen ๐Ÿค–๐Ÿค—
8 months
chatGPT solves it at last P!=NP With 97 reasoning conversations (COT but also others) it does what not one did before. It's AGI bro! @MicrosoftAI
Tweet media one
5
10
49
@LChoshen
Leshem Choshen ๐Ÿค–๐Ÿค—
1 year
oBERTa Recipe to compress pretrained models and infer fast Prune Take Distillation Remove unwanted weights (sparsify) Quantize (reduce float precision) Avoid freezing weights serve and infer with better accuracy and speed @spacemanidol @markurtz_
Tweet media one
2
12
48
@LChoshen
Leshem Choshen ๐Ÿค–๐Ÿค—
13 days
Models know it when they hallucinate, output probabilities and get calibrated Shouldn't we instead make models communicate it explicitly? When a person is unsure they just say "I am not sure, but" @sunniesuhyoung @QVeraLiao @mihaela_v Ballard Vaughan
2
6
48
@LChoshen
Leshem Choshen ๐Ÿค–๐Ÿค—
3 years
LMs learn generalizations in the same order (which?). Same acquisition of grammatical phenomena capabilities, regardless of data, seed, architecture. Humbly, fascinating work with @saksheli Weinhsall @AbendOmri #NLProc #MachineLearning #deepRead 1/n
Tweet media one
3
22
49
@LChoshen
Leshem Choshen ๐Ÿค–๐Ÿค—
11 days
Building an Arabic pretraining corpus 101 Billion words the largest to date Manel Aloui @HasnaChouikhi Ghaith Chaabane @haithemkchaou Chehir Dhaouadi @clusterlabai
Tweet media one
4
13
49
@LChoshen
Leshem Choshen ๐Ÿค–๐Ÿค—
1 year
๐Ÿงžโ€โ™‚๏ธHuman annotations for free๐Ÿฅณ As each one reinvents their annotation process GENIE๐Ÿงž Just decided to standardize the whole process (API MTurk...) Upload your process and the next papers would be able to exactly replicate And they pay for annotation! #EMNLP2022livetweet
1
3
48
@LChoshen
Leshem Choshen ๐Ÿค–๐Ÿค—
1 year
TinyStories: Tiny models are coherent and understand instructions If their data is very simple What is simple? What 3-4 year old vocabularies allow (according to LLMs...) @MSFTResearch @EldanRonen Yuanzhi Li
Tweet media one
1
5
45
@LChoshen
Leshem Choshen ๐Ÿค–๐Ÿค—
1 year
@alon_jacovi
Alon Jacovi
1 year
Worried about testย data being used in training? The LLM world is going through a data contamination crisis. Here's us trying to do something about it: Paper: Blog:ย  w\ย  @clu_avi @omerNLP @yoavgo
Tweet media one
8
71
267
1
3
44
@LChoshen
Leshem Choshen ๐Ÿค–๐Ÿค—
8 months
๐Ÿฅ๐ŸฃWelcome the new babies!๐Ÿ‘ถ๐Ÿ‘ผ๐Ÿผ 19 pretrained models on the loose track 24 on the strict 118 on strict-small We are proud of >30 pretraining teams submitting papers to babyLM! FOMO? Get updated on CoNLL or participate next year
2
5
44
@LChoshen
Leshem Choshen ๐Ÿค–๐Ÿค—
1 year
Let the LM improve its prompt 1โƒฃGet prompt&examples 2โƒฃAsk what was wrong with the prompt 3โƒฃPropose new prompts 4โƒฃEfficiently evaluate which prompt works best Repeat from 2 Pryzant @dan_iter @jerryzli Lee,Zhu @mjjzha @MSFTResearch โƒฃ
Tweet media one
2
9
44
@LChoshen
Leshem Choshen ๐Ÿค–๐Ÿค—
3 years
Can language models encode the topology of colours from text only? yes! #conll2021 #EMNLP2021livetweet Mostafa Abdou, Artur Kulmizev, @daniel_hers @stellaBotte @Brown_NLP Anders Sรธgaard
Tweet media one
4
8
42
@LChoshen
Leshem Choshen ๐Ÿค–๐Ÿค—
1 year
Time for the BabyLM challenge
@emollick
Ethan Mollick
1 year
We are running out of a vital resource: words! There are โ€œonlyโ€ 5 to 10 trillion high-quality words (papers, books, code) on the internet. Our AI models will have used all of that for training by 2026. Low-quality data (tweets, fanfic) will last to 2040.
Tweet media one
Tweet media two
78
302
2K
1
9
40
@LChoshen
Leshem Choshen ๐Ÿค–๐Ÿค—
1 year
โžก๏ธMindblowing pretraining paradigmโฌ…๏ธ Train the same model to predict the two directions separately๐Ÿ”› Better results, more parallelization @MSFTResearch @NguynTu24128917 @eigenikos @WeizhuChen #deepRead
Tweet media one
3
11
40
@LChoshen
Leshem Choshen ๐Ÿค–๐Ÿค—
25 days
How to make LMs learn an abstract language representation from learning 2 languages on the same concept? A shared token helps, sharing output language helps more Tianze Hua @tianyunnn @Brown_NLP (no, I won't put it at the top hype-seekers)
Tweet media one
2
6
39
@LChoshen
Leshem Choshen ๐Ÿค–๐Ÿค—
1 year
A 1T parameters LLM Outperforming stuff Based on mixture of experts, sparse updates, random routing How did I read nothing about it? Because it is Chinese? What am I missing? #scientivism @Huawei
1
4
39
@LChoshen
Leshem Choshen ๐Ÿค–๐Ÿค—
1 year
Needed some order in all those parameter-efficient finetuning methods? @anyabelz @M___Sabry
Tweet media one
1
5
38
@LChoshen
Leshem Choshen ๐Ÿค–๐Ÿค—
2 years
What do we know about model shortcuts? LLMs surprise us in their generalization abilities, but they just as often fail and rely on the wrong features. Why? When? How to prevent? A new survey and a ๐Ÿงต @DuMNCH Fengxiang He, Na Zou, Dacheng Tao @huxia
5
8
37
@LChoshen
Leshem Choshen ๐Ÿค–๐Ÿค—
5 years
RL methods for MT (MRT, GANs and REINFORCE) might not get their performance boost from actually improving translations. Moreover, there are no convergence guarantees for MRT. For these practical and theoretical (first for MRT?) results. Our new Preprint
1
10
36
@LChoshen
Leshem Choshen ๐Ÿค–๐Ÿค—
18 days
Nonsense inputs may make sense for LMs Some phrases in the jibberish rubble make models answer or regurgitate knowledge. But what can we learn about those nonsensical phrases or from them on LMs? @V__Cherepanova @james_y_zou
Tweet media one
3
4
36
@LChoshen
Leshem Choshen ๐Ÿค–๐Ÿค—
2 years
What do pretrained models learn and when? Ekaterina Voloshina Oleg Serikov @rybolos
Tweet media one
2
1
36
@LChoshen
Leshem Choshen ๐Ÿค–๐Ÿค—
1 year
You know what? I will stop sharing any LLM "news" if they don't share with me first (models or code) #thereOrIDontCare #scientivism #uShareFirst Check this paper out Thanks @deliprao for inspiration
@DrJimFan
Jim Fan
1 year
After ChatGPT, the future belongs to multimodal LLMs. Whatโ€™s even better? Open-sourcing. Announcing Prismer, my teamโ€™s latest vision-language AI, empowered by domain-expert models in depth, surface normal, segmentation, etc. No paywall. No forms.
Tweet media one
96
800
4K
1
5
35
@LChoshen
Leshem Choshen ๐Ÿค–๐Ÿค—
2 months
Interpretability and MI are a waste of research power and do not contribute to the actual advancement in the field. Have a strong feeling about it? Spill it in comments or help those researchers (not me): @mariusmosbach @megamor2 @tombrownev @DippedRusk
1
6
35
@LChoshen
Leshem Choshen ๐Ÿค–๐Ÿค—
1 month
Do you know how ๐Ÿค— can save an exabyte per month? or even how much is an exabyte? (I didn't) Dedicated model Compression can save 50%! (avg. 25%) Yes, it can compress quantized models as well... @MITIBMLab @IBMResearch @MIT_CSAIL @BU_Tweets ๐Ÿงต
Tweet media one
1
6
35
@LChoshen
Leshem Choshen ๐Ÿค–๐Ÿค—
2 months
LLMs perform tasks well even given scrambled sentences When the model can reconstruct the unscrambled sentence it ignores order, but only then๐Ÿงต Chen O'Donnell @sivareddyg @Mila_Quebec @McGillU
Tweet media one
2
3
34
@LChoshen
Leshem Choshen ๐Ÿค–๐Ÿค—
4 months
English Code models are better than Chinese on Chinese They hallucinate less They generalize better If true, this defies our thoughts on LMs as domain experts @AntGroup (no author handles? @ShiweiLiu9 ?)
3
6
34
@LChoshen
Leshem Choshen ๐Ÿค–๐Ÿค—
1 year
Larger models are better๐Ÿ˜ฑ But... Can we train smaller models to be better? Can we learn about language learning? Our baby๐Ÿ‘ถ, babyLM challenge in the @nytimes : โญ๏ธ๐ŸŒŸ @a_stadt @amuuueller @weGotlieb @jhuclsp @EvaPortelance & @sama
1
5
32
@LChoshen
Leshem Choshen ๐Ÿค–๐Ÿค—
5 months
#neurips has rooms for VIPs With more food, sitting space etc. When did we start having classes?
Tweet media one
5
0
32
@LChoshen
Leshem Choshen ๐Ÿค–๐Ÿค—
1 year
In-Context-Learning == gradient descent or disregards labels completely?! Why not both? Models recognize the task but also learn it & The benefits of actual learning grow with # examples and model size Jane Pan @gaotianyu1350 @__howardchen @danqi_chen
Tweet media one
2
8
30
@LChoshen
Leshem Choshen ๐Ÿค–๐Ÿค—
2 years
The intuitions behind warmup, a summary ๐Ÿงต I asked what are the intuitions behind warm-up (I had none). I got many answers (and 2 papers) in the cited tweet and thought to give something back. Now they are digestible Thread unroll:
@LChoshen
Leshem Choshen ๐Ÿค–๐Ÿค—
2 years
Once and for all What is the intuition behind warming up learning rate? I understand why it makes sense to decay the learning rate. But why should it start small and rise?
12
9
76
4
9
29