Our 12 scaling laws (for LLM knowledge capacity) are out: . Took me 4mos to submit 50,000 jobs; took Meta 1mo for legal review; FAIR sponsored 4,200,000 GPU hrs. Hope this is a new direction to study scaling laws + help practitioners make informed decisions
Our first paper in a series studying the inner mechanisms of transformers. TL;DR: we show *how* GPTs learn complex CFG trees via learning to do dynamic programming. Huge thanks to
@MetaAI
for making this research journey possible. FYI to
@OpenAI
@mbzuai
Part 3.2: Why do LLMs need Chain of Thoughts even for basic questions (e.g. was Biden born on an even day)? We show that LLMs cannot efficiently manipulate knowledge even if such knowledge is 100% extractable; + inverse knowledge search is just impossible.
Result 10/11/12: surprisingly, when pre-training good data (e.g., Wiki) together with "junks" (e.g., Common Crawl), LLM's capacity on good data may decrease by 20x times! A simple fix: add domain tokens to your data; LLMs can auto-detect domains rich in knowledge and prioritize.
Excited to announce our new work, a unified theory towards explaining 3 black magics in deep learning: (1) ensemble, (2) knowledge distillation, and (3) self-distillation. An accessible blog post is below.
Microsoft and CMU researchers begin to unravel 3 mysteries in deep learning related to ensemble, knowledge distillation & self-distillation. Discover how their work leads to the first theoretical proof with empirical evidence for ensemble in deep learning:
Is deep learning is actually performing DEEP learning? We may have given the first proof that neural network is capable of efficient hierarchical learning, while existing theory only shows that deep learning can "simulate" non-hierarchical algorithms
How does deep learning perform DEEP learning? Microsoft and CMU researchers establish a principle called "backward feature correction" and explain how very deep neural networks can actually perform DEEP hierarchical learning efficiently:
@ZeyuanAllenZhu
Part 1 on YouTube. How to interpret inner workings of transformers? "Induction head" only explains shallow tasks like sequence copying. To make interpretation go deeper, we reverse engineer how GPTs learn CFGs --- via learning to do dynamic programming.
Even if LLMs losslessly memorize the pretraining data, it may not be finetuned to extract knowledge from it. Probing techniques suggest that data augmentation is necessary on the pretrain level, regardless of model size, train time and finetune choices.
Expanded YouTube version of Parts 3.1+3.2: this also includes results I didn't cover in my recent offline talks/tweets, such as 1) mix training, 2) celebrity augmentation, 3) BERT models, 4) probing, 5) knowledge partial retrieval, 6) reversal curse.
Result 1/2/3: LLMs can "consistently" achieve 2bit per parameter in storing knowledge after sufficient training; this predict a 7B model is capable of storing knowledge from all the English wiki + textbooks based on our estimation.
Incredibly honored to have worked with Avi as his postdoc. Avi's vision is certainly beyond the theory of computation. He asked me in 2016 whether I believe gradient descent can solve everything. He has probably envisioned AGI at that point. 👍
🏆 We're thrilled to announce the recipient of the 2023
#ACMTuringAward
: Avi Wigderson! Wigderson is recognized for his foundational contributions to the theory of computation. Join us in celebrating his incredible achievements! Learn more here:
@the_IAS
Results 8/9: scaling laws for quantization and MoE.
// Quantization to int8 does not hurt knowledge capacity even for models at max capacity => 2bit of knowledge can be stored to int8
// MoEs with even 32 experts have great capacity => knowledge can be stored evenly on experts.
Thanks. Part 3.2 is proudly rejected by ICML 2024 (though accepted as a tutorial) because we used "synthetic data + gpt2". Apparently many readers don't get it. If a behavior is discovered on N=100k for gpt2, that's enough to predict N=100m for llama and real-word models/data.
My favorite paper! Synthetic data to analyze "reasoning". Two impressive things are 1) this paper defined reasoning really well - everybody should take a look 2) well-controlled experiments - that is the reason why Allen-Zhu is saying this paper is "physics of LLM"
Result 6/7: If insufficiently trained, GPT2_rotary works 30% better than LLaMA/Mistral architectures in terms of storing knowledge. A closer look reveals that GatedMLP is the cause: it is less stable to train and thus not friendly for acquiring "rare knowledge" in pretrain data.
Big kudos to my manager Lin Xiao and FAIR leadership for the support, to Meta Infra (Lucca Bertoncini, Liao Hu, Caleb Ho, Apostolos Kokolis, Shubho Sengupta, Henry Estela, Wil Johnson, Rizwan Hashmi, and Lucas Noah) and W&B team (Ian Clark, Gourab De, Anmol Mann, Max Pfeifer)
Result 4/5: "all" LLMs can achieve such 2bit/param if sufficiently trained, even if all the MLP layers are removed. This is quite a universal law.
// What if models are insufficiently trained --- or equivalently, pretrain data consist of rare knowledge? See result 6/7 next.
Our LoRA library for fine-tuning is online: "pip install loralib" Separate from this repo, I independently wrote LoRA on GPT2+BERT for other tasks such as LM (wiki103), MLM (wiki103), QA (squad), and the results are more than amazing. LoRA is in some Microsoft products already.
In June we released LoRA which adapts NNs as big as GPT-3 with few parameters yet stays performant. Our new result beats finetuned RoBERTa on GLUE with 1/8 of total parameters! Try "pip install loralib" and add LoRA to your fav model in 10 lines of code!
🚨 Reverse Training to Nurse the Reversal Curse🚨
LLMs fail on “B is A” if only trained on "A is B".
- Reverse training doubles training tokens by reversing strings
- Outperforms data-matched standard baselines
- Fixes issues on reversal tasks
🧵(1/6)
(1/3) major updates to our Backward Feature Correction paper. Recall our Theorem 1 was proven for deep neural nets with (theory-friendly) quadratic activations. We show in practice, this performs close to ReLU networks, better and much faster than neural kernel methods.
What's the solution to the curse of inverse knowledge search? Our Appendix E.1 mentions a solution that GPT4 employs, using Chain of Thought! Though cheating a bit, but when data is augmented like the Bible, inverse search is possible. Image from a talk I'm preparing for Friday.
Part 3.2: Why do LLMs need Chain of Thoughts even for basic questions (e.g. was Biden born on an even day)? We show that LLMs cannot efficiently manipulate knowledge even if such knowledge is 100% extractable; + inverse knowledge search is just impossible.
Did anyone notice: if paper title has period (or perhaps colon) in it, I will lose many citations. For instance
@QuanquanGu
's Rephrase paper cites Part 3.2 but it isn't on Google Scholar.
Should I use Part 3A, 3B, 3C instead? Who else cited our work?
An additional figure from the paper illustrating how easy it is to find such counter examples on GPT4 (even today's version). More experiments can be found in the paper , both using GPT4 and using synthetic, controlled experiment.
Part 3.2: Why do LLMs need Chain of Thoughts even for basic questions (e.g. was Biden born on an even day)? We show that LLMs cannot efficiently manipulate knowledge even if such knowledge is 100% extractable; + inverse knowledge search is just impossible.
Now, I have to burn 100,000 GPU hours to re-run our experiment on 50x larger data and llama/mistral/larger models. That's a 3.4 tons of coals burned. One should not reject a paper because the model/data is small; one should reject a paper if the result is wrong or not innovative.
MSR Asia Theory Center (Beijing) is recruiting! A lovely place where I had a wonderful year of internship and published my first theory paper. Website: and application link: . They may also consider full-time for strong candidates.
I can't agree that "inverse knowledge search is also mostly hard for humans". For example, we have commonly-used Chinese idioms/poems that most of my friends can say what's the first character in an idiom, or what's the previous sentence of a poem. GPT4 largely fails on this.🧐
Part 3.2: Why do LLMs need Chain of Thoughts even for basic questions (e.g. was Biden born on an even day)? We show that LLMs cannot efficiently manipulate knowledge even if such knowledge is 100% extractable; + inverse knowledge search is just impossible.
Let me not say goodbye but instead Congratulations! (following the tradition). Hope that fate will bring us together again some day :) See you later! 💕
How can ResNet obtain notably lower test error than kernel methods on many tasks? Microsoft &
@Stanford
researchers proved that ResNet can perform hierarchical learning but kernel methods (e.g. NTK) cannot, leading to better generalization:
@ZeyuanAllenZhu
Enjoyed a dinner chatting around benchmarks, in which I gained a deeper appreciation for
@drfeifei
's works. Anyone knows Qin Shi Huang? First emperor of China; standardized weights, measures, currencies, etc (统一度量衡). This underscores the importance of uniform standards.
How do GANs generate complicated real-world distributions such as images? A new theory from Microsoft and
@CarnegieMellon
researchers shows how GANs learn distributions efficiently with “forward super-resolution structure” by gradient descent ascent:
(3/3) major updates to Backward Feature Correction paper. We visualize features to verify Theorem 2 in real life. If only training lower layers, features over-fit to high-complexity signals; if training deeper layers, they help “subtract” high-complexity signals from lower layers
We're officially hiring! This year, you're strongly encouraged to put down your service and outreach plans (e.g., those related to mentoring or diversity)
(2/3) major updates to our Backward Feature Correction paper. We measure "how deep" and "how much" is backward feature correction needed for real-life clean and adversarial tasks. Answer to “how deep” is at least 8 layers, and answer to “how much” is ~0.9 Euclidean correlation.
Sharing this again. In addition to Adil Salim we have another superstar FTE member joining soon (will let my manager Seb announce it when it's the time). Please send your application materials to our application website, and we shall review them seriously.
I shouldn't say common crawls are "junks". Thanks to Common Crawls CTO for correcting me. What we meant is, lots of knowledge from CC (e.g. serial number of a random product) may not be useful. We synthetically generate data to mimic such knowledge, and we refer to that as junk.
Result 10/11/12: surprisingly, when pre-training good data (e.g., Wiki) together with "junks" (e.g., Common Crawl), LLM's capacity on good data may decrease by 20x times! A simple fix: add domain tokens to your data; LLMs can auto-detect domains rich in knowledge and prioritize.
AI residency is generally a great program to establish a year-long relationship with research mentors in the industry. It is wonderful to hear that Salesforce Research has opened it! Fantastic opportunities. PS: many of you asked me privately about MSR, a…
(2/5) Interactions with ChatGPT have yielded intriguing insights. We aim to bring clarity to this domain through controlled, synthetic experiments that reveal how LLMs learn to perform (or fail at) various AI tasks.
Missing this guy too. Brave enough to call out dump papers, bets money in results. I lost $20 to him once since I was dumb. If he is still around, I'll probably bankrupt.
Yuanzhi Li giving a talk in Columbia theory seminar today, referencing Michael Cohen calling his own 2017 paper's result dumb (and now agreeing with the assessment). ❤️
Made me smile a bit. Miss having that guy in seminars.
I'm really happy that the law of robustness got recognized as an important new insight with a NeurIPS outstanding paper award! The video below summarizes what the law is about, what it means, and what it predicts.
It's also a great capstone for
@geoishard
's fantastic phd work!
Many starts to talk about "reread my request and try again". For knowledge questions, we made it clear why "try again" works. Knowledge is first loaded; and in the repeated run, model sees it and can manipulate knowledge in context. Examples in the figs such as "Tell me why."
Part 3.2: Why do LLMs need Chain of Thoughts even for basic questions (e.g. was Biden born on an even day)? We show that LLMs cannot efficiently manipulate knowledge even if such knowledge is 100% extractable; + inverse knowledge search is just impossible.
Magic 3: self-distillation (most mind-blowing). Train a model once to get some accuracy, then train it again to match the soft labels of its own. Suddenly accuracy boosts. Why? Spoiler: self-distillation implicitly performs an ensemble of two models with knowledge distillation.
Huge thanks to my coauthors and especially the first author
@CathyYLi
who is just amazing at handling this huge project. Before this, I thought my coauthors were just being crazy... I was wrong, LLMs (+ human designs) can break some Crypto systems.
Our team in FAIR labs (at Meta) is hiring researchers (RE, RS & PostDoc)! DM if interested.
We work on the topics of Reasoning, Alignment and Memory/architectures (RAM).
Recent work:
Self-Rewarding LMs:
Pairwise Cringe Loss:
Magic 2: knowledge distillation. Train an single model to match soft labels (e.g. 0.9 cat + 0.1 dog) generated by ensemble, test accuracy is significantly higher than that trained using true labels. Why? Spoiler: "dark knowledge" in soft labels encourages learning new features!
Our Large LM (1.4B) finally finished training, and we have updated the paper with more exciting results!
TL;DR: mixing training data with Random segment reversal not only resolves the reversal curse, but improves performance on the variety on benchmarks wrt data-matched models!
Not to say, as of today, GPT-4 and LLaMA-3 still largely fail on the phenomena we have discovered. We are studying *universal laws*, not something that's only applicable to a particular version of llama or gpt.
Strongly recommend this NeurIPS competition from my colleague at MSR. Two top prizes $5000 each, and potential future collaborations (or internships?) Hurry up!
Magic 1: ensemble. Train the same model 10 times and average out, test accuracy gets a significant boost. In contrast, train a bigger model that is an average of 10, test accuracy has no boost. Spoiler: this is totally different from what NTK (infinite-width) regime looks like.
That's my advice as well in TCS. If you are interviewing, the theorists already like you. The talk is not for them.
You should make sure everyone gets what you did & why they should care. Ideally explain central idea/technique simply enough so people feel they learned something
@HazanPrinceton
What's even better: Sanjeev, Elad, all the colleagues and "mentees" of Avi are also incredibly kind, modest, and knowledgeable on so many things. Princeton is an awesome place.
Note 4. It is advantageous to include in pretrain data corrupted sentences (e.g. grammar mistakes) --- this improves LLM's robust performance on generation tasks (but need to use low temperature!).
Physics of Language Models: Part 3.3, Knowledge Capacity Scaling Laws
LLMに携わる人にとってとてもありがたく重要な新スケーリング法則のお話
特定のサイズのモデルがどれだけの量の知識を覚える事が出来るのか?
という疑問に答えてくれていて、結論は
@SebastienBubeck
@PreetumNakkiran
@BachFrancis
I can't agree more on the focus of math. In addition, I think sending the right message (i.e. "noise") to practitioners on what theory we are proving is important. I wish my "noise" can stand the test of time, but I'll be more happy if it doesn't, when better theory is discovered
Putting research aside, In 2010, I spoke with
@drfeifei
during an open visit. At that time, U.S. grad schools struggled to evaluate undergrads from China. Fei-Fei's effort at Stanford significantly improved this process. I respect her both as a researcher and a moral leader.
@HDPbilly
If you mean "context (e.g., domain, data classification)" not "context length" then I agree. Appending a special token/header in front of a piece of data is like increasing ~1% of the training tokens, not increasing training duration by much.
+ we also have Adil Salim joining us as an FTE soon! Microsoft Research has a bright future under the new leadership by
@JohannesGehrke
and the close supervision by
@kevin_scott
.
🙏🙏🙏 Please RT this message:
(1) After Russia's invasion of Ukraine started, more than a million Ukrainians have fled to other countries. Many Belarusians and Russians are leaving their countries too.
Truly heartbroken to see that nowadays we have to explicitly, reiterate that calling for genocide (against any group) is a violence and should be prohibited.
Credit where credit is due!
"Calls for Genocide against the Jewish community or any other group are [...] against our rules"
Thank you,
@Columbia
for clarifying this.
Hopefully, we will now see enforcement against rhetoric on campus that is widely seen as a call for genocide
Interested in grad studies or postdoc in Computer Science, Machine Learning, or Quantum Information and Computation?
Please consider Harvard! Join a vibrant & growing community. We may be 385 years old but don't look a day over 350 😀
@Teknium1
If a person studies at a university uniformly from 300 choices (regardless of writing) that's log2(300) bits. If president is uniformly from 100000 choices that's log2(100000) bits. We use synthetic data so knowledge bits can be calculated for all LLMs learned on this data
@prfsanjeevarora
One of the most enlightening papers I read in 2016. It took me several years to really catch up on this -- not only the usefulness of the sparse coding model, but also a starting journey towards interpretability.
@oinori1197
I agree with you that deep learning theory is still quite messy, but I hope my recent talk here will give you some new insights of what's going on:
@PartheP
But self-distillation is a bit different, since it distills from an ensemble over 2 models only, so doing it once more can encourage a third model to be added. However, the resulting performance boost is diminishing, and it's slightly worse than directly distilling from 3 models.
Note 3. Vanilla GPT2 with absolute positional embedding sucks, even absurdly poorer than GPTs with uniform attention weights (even on such grammar tasks!) Better off using rotary or relative attention.
@AlbalakAlon
In a same setup, let T be regular training and TD with domains added. (1) TD has overall slightly better loss than T, (2) TD much better than T on good data (loss + bits), (3) TD and T are almost equally bad on the junk data. In other words, adding tokens "speed up" the training
@PabloRedux
@RickSearle1
@AIatMeta
@mbzuai
It's because the entire conversation (including 1st Q&A and 2nd Q) is fed into GPT4 as a big, single context. GPT is trained to read from such context to know better what the question is. Feel free to change to "in" and play with ChatGPT. Would love to hear what you find out :)
@Jayasim52317099
@AIatMeta
@mbzuai
Thanks for the pointer! I think is about the necessity of CoT for "incontext" multi-step arithmetics? CoT is needed because the computation is "hard". We study single-step operations on the factual knowledge, without writing down the knowledge explicitly.
@Teknium1
More generally, if it is N people out of N_0 names, each of K attributes of knowledge, each knowledge is of C chunks, each chunk of length L and inside a diversity set D, vocab size T. The total bits of this data is:
Note 1. Our studied CFGs are much harder than English grammar (learnable using 2-layer GPT of 100K params) or coding grammar (learnable via greedy). Our CFGs are ambiguous enough that require dynamic programming to parse or generate.
@ysu_nlp
I was 90% agree back then (I convinced Avi his "complexity" problem is GD and we wrote 2 papers). Now I'm both 99% and 1% agree. It's 99% since training LLM only needs GD... it's 1% because most of the "meat" comes from data preparation, not the GD part.
@jasondeanlee
@AIatMeta
@weights_biases
thanks for asking!! coughing badly right now and lost my voice :( probably need a few weeks to recover. will post it here once I record it.
@DimitrisPapail
Fun fact: I asked what if I relocate myself to the student's city, or turn myself into remote. The organizer doesn't want to talk to me anymore ever since I asked --- plus I almost received a punishment because I asked.