We want to pretrain๐ค
Instead we finetune๐ฎ๐
Could we collaborate?๐ค
ColD Fusion:
๐Recycle finetuning to multitask
โก๏ธevolve pretrained models forever
On 35 datasets
+2% improvement over RoBERTa
+7% in few shot settings
๐งต
#NLProc
#MachinLearning
#NLP
#ML
#modelRecyclying
Pretraining with 1 GPU and 1 day
This paper is a HUGE list of all the tricks you could think of and
what works to make training efficient given 1 GPU and 1 day
@jonasgeiping
@tomgoldsteincs
During training, your loss goes up and down up and down up and down.
But how would it go if you magically went in a straight line
from init to learnt position?
Apparently smoothly down!
On the surprising Linear Interpolation:
#scientivism
#deepRead
#MachineLearning
Is data really important for pretraining?
Could we just pretrain on 1 picture? Only synthetic text? Fractals?
A ๐งต summing the image and text papers that do just that.
and they all have a similar conclusion๐ค
How ICL ๐ฆ๐ฎ๐ฆ๐ณ๐จ๐ฆ๐ด from unsupervised data?
๐๐ต ๐ญ๐ฆ๐ข๐ณ๐ฏ๐ด ๐ง๐ณ๐ฐ๐ฎ parallel phrases
After deleting parallel parts the ICL ability was reduced by 51% deleting random words - only 2%
๐งต
@yanda_chen_
@henryzhao4321
@Zhou_Yu_AI
@hhexiy
@columbianlp
Computational (Chomskian) hierarchies can predict OOD capabilities
Different in formal hierarchies - different generalizations the architecture can perform
Got your attention?
Details in ๐งต
@DeepMind
I donโt train from scratch, I use RoBERTa๐ง
Waitโฆ
Why not cross-encoder/stsb-roberta?facebook/muppet-roberta?
We automatically identify the best models on ๐ค(periodically)
Just pick the best one
and finetune on your task
Data augmentation? Look no further.
Framework of 100+ "transformations" (augmentations\paraphrasing functions\filters)
Many types:emojis, linguistic... see Fig
Extendable!
A vast effort, constructed by almost a hundred authors!
#scientivism
Recycling Finetuned models, it works!
Finetuned models lie everywhere,
there must be a way to use the data and compute invested in them.
Apparently averaging their weights is such a method.
3 papers & A๐งต
Labelled data is scarce, what can we do?
We can MLM on the unlabeled data, but
You can do better:
Cluster & Tune - finetune on clusters as labels
#acl2022nlp
#NLProc
#MachineLearning
Finetuning millions of dimensions is not as complex as you may think๐คฏ
Actually, it is quite interpretable in Euclidean space by
angles from the pretraining.
Seeds fall in small regions
Tasks in larger ones
All in some direction
๐What's in a layer?๐น๐ต๐ปโโ๏ธ
Representations are vectors
If only they were words...
Finding:
Any layer can be mapped well to another linearly
Simple, efficient & interpretable
& improves early exit
Story and ๐งต
About generalization of different networks
Main finding: Generalization in pretraining follows a single dimension
Different networks, architectures, seeds, sizes but:
Similar performance โ similar linguistic capabilities
@aclmeeting
accepted (
#NLProc
)
Summary & story ๐งต
A new pretraining technique
suggests replacing MLM by predicting the representations directly
How?
We tie the input rep. to the input's
we mask
& Contrast it to other reps. viola
Huge compute\perf gains!
@nthngdy
รric de la Clergerie
@bensagot
Taking a moment to celebrate๐ฅณ๐
๐๐ผ๐ฐ๐๐ผ๐ฟ Leshem Choshen
Years of research, thanks to all the collaborators
Years since the research and...
Mister is evolving to ... Doctor
Task contamination is worse than we imagined?
Models perform better on datasets that were released before the models
+ Models can generate examples from tasks (NYT vibes)
Changmao Li
@jmflanig
Predictions throughout training, hyperparams and architectures are yet again shown to be on
a small manifold
which means models learn their classifications outputs similarly
Mao ...
@pratikac
#MachineLearning
#enough2skim
Residual connections are ๐ฅ, right?
Wait, so why do we only use them to skip 1 layer?
Not only Lucas Georges Gabriel Charpentier &
@davidsamuelcz
checked it.
They found that this provided huge gains - winning the babyLM challenge๐ผ
Theory of Mind emerged in GPT
Children get it at 9yo
No, it doesn't read minds (yet) but,
It can empathize and imagine what you know
separately from what it or others knows
Code&data:
@Stanford
@stanfordnlp
@michalkosinski
"Tokens" - the magic๐ช that transforms strings to inputs
BPE and wordPiece are the same. right? Well...
It is a constant thing in pretaining there is nothing I can do about it, right? wrong...
A ๐งตon tokenization methods and regularization
๐ฆพCohere beats Davinci on HELM
๐ตโ๐ซBut only if you also test Cohere medium
How reliable are our benchmarks really?
A fascinating :thread:on HELM,
Reliable benchmarks
& saving X100 compute
Are you up to it?
๐งต
Solar mixes two base-model copies
to create a larger one
Then train it a bit more and beat other open models out there.
How? and my thoughts ๐งต
@upstageai
(no author with a handle?!)
#scientivism
We share code on
@github
We share datasets on
@huggingface
But where do we share our data processing?
We prompt, clean, and filter
but on our own๐ฅบ
Unitxt๐ฆ
A preprocessing tool
That we can grow together
@IBMResearch
What do we know about using a fine-tuned model rather than the pretrained
They are sometimes much better, but what else?
A story of great
#scientivism
hypotheses and their rejections
The story of a field
Survey ๐งต
Transformers are not Seq2Seq
Given a context
They predict a single token
We use this to update representations between predicted tokens
& feed a changing graph relevant to the current token
me
@oabend
#conll2020
#NLProc
Networks first learn generalizations and then memorize.
This phenomena was presented in recent ACL.
Are the two really different?
For some time I thought this must be the reason networks beat VC dimension.
Do you think differently?
My opinion:
LLMs act in multiple languages, but how?
It is separate knowledge?
Unified knowledge?
Or do they translate everything into English?
tl;dr translate
@AmiiThinks
Once and for all
What is the intuition behind warming up learning rate?
I understand why it makes sense to decay the learning rate.
But why should it start small and rise?
"Only large models have emergent abilities"
This mystic mantra is challenged again
Broad ๐งต on the debate
+ new findings:
165M parameter models trained on simple English
are better
zero shot learners
Show scaling laws
But not few shot benefits (yet?)
A replacement to probing: Prompt Probing
Train the input vector (prompt) so probing can't learn by itself, only extract what the model learnt
Jiaoda Li
@ryandcotterell
@mrinmayasachan
Opposite scaling law: detection of machine-generated text is done better by smaller models
Everyone (outside
#NLProc
...) is afraid GPT would cheat for them, which pushes for detection methods
Mireshghallah
@MatternJustus
Gao
@rzshokri
@BergKirkpatrick
๐ถBabyLM will be back, but what did we learn?
the best papers from this year๐ผ
We know one mechanism that learns from 100M words (us), what are the main boosts to reach this in an LLM?
#EMNLP
#babyLM
I am sad & will not brag about X
#EMNLP2023
papers accepted
Do you know how many (children) got kidnapped? Massacred by Hamas' Army going house by house?
I will not share about it, except on this thread
But silence? Silence was too much, too little
Your comments go there๐
I have just found a new phenomenon:
Linear mode connectivity
What is the loss of the mid-model?
A model somewhere between converged models with different seeds?
#MachineLearning
LLMs would never acquire meaning.
Or, perhaps they already have?
An opinionated review of LMs current state and their ability to capture meaning
a scholarly-philosophical paper by who if not
@spiantado
@FelixHill84
Is DPO Superior to PPO for LLM Alignment? No.
A comprehensive study shows that PPO is better (except for time and complexity of running wise)
Theoretically and empirically
Just 2 years ago, we introduced the concept of
model fusing (now aka as merging, introduced by others on parallel)
And it is so well adopted, it now has a survey!
We have gone a great way.
Weishi Li Yong Peng,
@Miao_Zhang_dr
@liangdingNLP
Han Hu, Li Shen
See the sheer joy of my collaborators at
#ACL2023
๐คฉ
DissentQA
won best Paper AC award
This is a happy outcome of the fruitful collaboration with a group of wonderfully friendly people
@EllaNeeman
@OHonovich
@roeeaharoni
@AbendOmri
&Szpektor
Details:
dฬตrฬตoฬตpฬตoฬตuฬตtฬต Mixout randomly replaces weights with zฬตeฬตrฬตoฬตsฬต weights of another model (e.g. pretraind BERT) to make it closer to the other model while transfer learning (fine-tuning).
Clean and simple
@kchonyc
@iclr_conf
Mixout () is a cool way to regularize your large neural network. I quickly wrote an implementation in pytorch that works with (most) arbitrary nn.Modules:
Rome edits model weights to replace a fact the model knew with another
But, it sometimes ruined everything in the process
Not anymore, this was an implementation problem and
R-Rome solved it
๐ดโโ ๏ธarrr- rome๐ดโโ ๏ธ
@akshatgupta57
@GopalaSpeech
@UCBerkeley
@UCSF
I can't understand how this paper is so overlooked
Human annotation was a dreadful thing to me all my PhD, costly, cumbersome, requires my constant supervision
this is a game changer (and its not even mine...)
But generalized reranking get's double the PR...
#scientivism
๐งโโ๏ธHuman annotations for free๐ฅณ
As each one reinvents their annotation process GENIE๐ง
Just decided to standardize the whole process (API MTurk...)
Upload your process and
the next papers would be able to exactly replicate
And they pay for annotation!
#EMNLP2022livetweet
The best prompt, literally:
ยซCommand, we need you to plot a course through this turbulence and locate the source of the anomaly. Use all available data and your expertise to guide us through this challenging situation.ยป
Models respond well to positive thinking
1/n
A successor to MRT?
MAD - Reinforcement learning in Machine Translation
A new way to effectively optimize generation towards any metric
@DeepMind
Domenic Donato
@LeiYu63
Wang Ling
@redpony
We continually work on -Continual learning-
the thing is that we have catastrophic forgetting and keep reinventing methods
This, or just delving into a new field is what makes surveys so important.
I won't summarize a survey, just go read it...
Learning to rerank without reranking.
Reranking beam search (BSR) is often useful but costly, this work mimics the behavior without the need for large beam and slow inference. (caveats below)
@yzpang97
@hhexiy
@kchonyc
oBERTa Recipe
to compress pretrained models and infer fast
Prune
Take Distillation
Remove unwanted weights (sparsify)
Quantize (reduce float precision)
Avoid freezing weights
serve and infer with better accuracy and speed
@spacemanidol
@markurtz_
Models know it when they hallucinate, output probabilities and get calibrated
Shouldn't we instead make models communicate it explicitly?
When a person is unsure they just say "I am not sure, but"
@sunniesuhyoung
@QVeraLiao
@mihaela_v
Ballard Vaughan
LMs learn generalizations in the same order (which?).
Same acquisition of grammatical phenomena capabilities,
regardless of data, seed, architecture.
Humbly, fascinating work with
@saksheli
Weinhsall
@AbendOmri
#NLProc
#MachineLearning
#deepRead
1/n
๐งโโ๏ธHuman annotations for free๐ฅณ
As each one reinvents their annotation process GENIE๐ง
Just decided to standardize the whole process (API MTurk...)
Upload your process and
the next papers would be able to exactly replicate
And they pay for annotation!
#EMNLP2022livetweet
TinyStories: Tiny models are coherent and understand instructions
If their data is very simple
What is simple?
What 3-4 year old vocabularies allow (according to LLMs...)
@MSFTResearch
@EldanRonen
Yuanzhi Li
Worried about testย data being used in training?
The LLM world is going through a data contamination crisis.
Here's us trying to do something about it:
Paper:
Blog:ย
w\ย
@clu_avi
@omerNLP
@yoavgo
๐ฅ๐ฃWelcome the new babies!๐ถ๐ผ๐ผ
19 pretrained models on the loose track
24 on the strict
118 on strict-small
We are proud of >30 pretraining teams submitting papers to babyLM!
FOMO?
Get updated on CoNLL or
participate next year
Let the LM improve its prompt
1โฃGet prompt&examples
2โฃAsk what was wrong with the prompt
3โฃPropose new prompts
4โฃEfficiently evaluate which prompt works best
Repeat from 2
Pryzant
@dan_iter
@jerryzli
Lee,Zhu
@mjjzha
@MSFTResearch
โฃ
We are running out of a vital resource: words!
There are โonlyโ 5 to 10 trillion high-quality words (papers, books, code) on the internet. Our AI models will have used all of that for training by 2026. Low-quality data (tweets, fanfic) will last to 2040.
How to make LMs learn an abstract language representation from learning 2 languages on the same concept?
A shared token helps, sharing output language helps more
Tianze Hua
@tianyunnn
@Brown_NLP
(no, I won't put it at the top hype-seekers)
A 1T parameters LLM
Outperforming stuff
Based on mixture of experts, sparse updates, random routing
How did I read nothing about it? Because it is Chinese?
What am I missing?
#scientivism
@Huawei
Papers keep on suggesting training on new facts or new data unseen in the pretraining.
Why isn't it a common practice - in practice?
e.g., from today:
@NickMeck
...
@XiaoxiaoLi8
@RanveerChandra
tfaktas
What do we know about model shortcuts?
LLMs surprise us in their generalization abilities, but they just as often fail and rely on the wrong features.
Why? When? How to prevent?
A new survey and a ๐งต
@DuMNCH
Fengxiang He, Na Zou, Dacheng Tao
@huxia
Thoughts upon rereading about Pythia
a large set of reproducible checkpoints pretrained on consistent datasets with some exps. as well
@BlancheMinerva
@haileysch__
et al.
RL methods for MT (MRT, GANs and REINFORCE) might not get their performance boost from actually improving translations.
Moreover, there are no convergence guarantees for MRT.
For these practical and theoretical (first for MRT?) results. Our new Preprint
Nonsense inputs may make sense for LMs
Some phrases in the jibberish rubble
make models answer or regurgitate knowledge.
But what can we learn about those nonsensical phrases or from them on LMs?
@V__Cherepanova
@james_y_zou
After ChatGPT, the future belongs to multimodal LLMs. Whatโs even better? Open-sourcing.
Announcing Prismer, my teamโs latest vision-language AI, empowered by domain-expert models in depth, surface normal, segmentation, etc.
No paywall. No forms.
Interpretability and MI are a waste of research power and do not contribute to the actual advancement in the field.
Have a strong feeling about it?
Spill it in comments or help those researchers (not me):
@mariusmosbach
@megamor2
@tombrownev
@DippedRusk
Do you know how ๐ค can save an exabyte per month?
or even how much is an exabyte? (I didn't)
Dedicated model Compression can save 50%!
(avg. 25%)
Yes, it can compress quantized models as well...
@MITIBMLab
@IBMResearch
@MIT_CSAIL
@BU_Tweets
๐งต
Emptying the Ocean with a Spoon: Should We Edit Models?
Model editing changes fact in retrospect, could it ever make factual models?
@yuvalpi
@melhadad
LLMs perform tasks well even given scrambled sentences
When the model can reconstruct the unscrambled sentence it ignores order, but only then๐งต
Chen O'Donnell
@sivareddyg
@Mila_Quebec
@McGillU
English Code models are better than Chinese
on Chinese
They hallucinate less
They generalize better
If true, this defies our thoughts on LMs as domain experts
@AntGroup
(no author handles?
@ShiweiLiu9
?)
In-Context-Learning == gradient descent or disregards labels completely?!
Why not both?
Models recognize the task but also learn it
& The benefits of actual learning grow with # examples and model size
Jane Pan
@gaotianyu1350
@__howardchen
@danqi_chen
๐Reviewing has so many faults๐
Finally, there is a dataset of reviews, edits and everything else!
5 venues 5K papers 11K reviews
Enjoy!
@DyNils
@ilokuznetsov
@IGurevych
The intuitions behind warmup, a summary ๐งต
I asked what are the intuitions behind warm-up (I had none).
I got many answers (and 2 papers) in the cited tweet and thought to give something back.
Now they are digestible
Thread unroll:
Once and for all
What is the intuition behind warming up learning rate?
I understand why it makes sense to decay the learning rate.
But why should it start small and rise?