Zeyuan Allen-Zhu Profile Banner
Zeyuan Allen-Zhu Profile
Zeyuan Allen-Zhu

@ZeyuanAllenZhu

8,319
Followers
275
Following
32
Media
213
Statuses

physics of language models @ Meta / FAIR IOI - USACO - MCM - ACM/ICPC - Codejam Tsinghua - MIT - Princeton/IAS - MSR - FAIR

Joined April 2010
Don't wanna be here? Send us removal request.
Pinned Tweet
@ZeyuanAllenZhu
Zeyuan Allen-Zhu
1 month
Our 12 scaling laws (for LLM knowledge capacity) are out: . Took me 4mos to submit 50,000 jobs; took Meta 1mo for legal review; FAIR sponsored 4,200,000 GPU hrs. Hope this is a new direction to study scaling laws + help practitioners make informed decisions
Tweet media one
27
337
1K
@ZeyuanAllenZhu
Zeyuan Allen-Zhu
1 year
Our first paper in a series studying the inner mechanisms of transformers. TL;DR: we show *how* GPTs learn complex CFG trees via learning to do dynamic programming. Huge thanks to @MetaAI for making this research journey possible. FYI to @OpenAI @mbzuai
Tweet media one
Tweet media two
7
137
654
@ZeyuanAllenZhu
Zeyuan Allen-Zhu
8 months
Part 3.2: Why do LLMs need Chain of Thoughts even for basic questions (e.g. was Biden born on an even day)? We show that LLMs cannot efficiently manipulate knowledge even if such knowledge is 100% extractable; + inverse knowledge search is just impossible.
Tweet media one
Tweet media two
Tweet media three
17
116
586
@ZeyuanAllenZhu
Zeyuan Allen-Zhu
1 month
Result 10/11/12: surprisingly, when pre-training good data (e.g., Wiki) together with "junks" (e.g., Common Crawl), LLM's capacity on good data may decrease by 20x times! A simple fix: add domain tokens to your data; LLMs can auto-detect domains rich in knowledge and prioritize.
Tweet media one
18
40
288
@ZeyuanAllenZhu
Zeyuan Allen-Zhu
3 years
Excited to announce our new work, a unified theory towards explaining 3 black magics in deep learning: (1) ensemble, (2) knowledge distillation, and (3) self-distillation. An accessible blog post is below.
@MSFTResearch
Microsoft Research
3 years
Microsoft and CMU researchers begin to unravel 3 mysteries in deep learning related to ensemble, knowledge distillation & self-distillation. Discover how their work leads to the first theoretical proof with empirical evidence for ensemble in deep learning:
8
146
520
1
42
269
@ZeyuanAllenZhu
Zeyuan Allen-Zhu
6 months
There is no rainbow without a storm. We withdraw all parts of <Physics of LM> from ICLR 2024 submissions.
Tweet media one
11
7
216
@ZeyuanAllenZhu
Zeyuan Allen-Zhu
4 years
Is deep learning is actually performing DEEP learning? We may have given the first proof that neural network is capable of efficient hierarchical learning, while existing theory only shows that deep learning can "simulate" non-hierarchical algorithms
@MSFTResearch
Microsoft Research
4 years
How does deep learning perform DEEP learning? Microsoft and CMU researchers establish a principle called "backward feature correction" and explain how very deep neural networks can actually perform DEEP hierarchical learning efficiently: @ZeyuanAllenZhu
4
82
272
2
22
179
@ZeyuanAllenZhu
Zeyuan Allen-Zhu
6 months
Part 1 on YouTube. How to interpret inner workings of transformers? "Induction head" only explains shallow tasks like sequence copying. To make interpretation go deeper, we reverse engineer how GPTs learn CFGs --- via learning to do dynamic programming.
Tweet media one
1
29
166
@ZeyuanAllenZhu
Zeyuan Allen-Zhu
8 months
Even if LLMs losslessly memorize the pretraining data, it may not be finetuned to extract knowledge from it. Probing techniques suggest that data augmentation is necessary on the pretrain level, regardless of model size, train time and finetune choices.
Tweet media one
7
26
148
@ZeyuanAllenZhu
Zeyuan Allen-Zhu
7 months
Expanded YouTube version of Parts 3.1+3.2: this also includes results I didn't cover in my recent offline talks/tweets, such as 1) mix training, 2) celebrity augmentation, 3) BERT models, 4) probing, 5) knowledge partial retrieval, 6) reversal curse.
Tweet media one
1
23
102
@ZeyuanAllenZhu
Zeyuan Allen-Zhu
1 month
Result 1/2/3: LLMs can "consistently" achieve 2bit per parameter in storing knowledge after sufficient training; this predict a 7B model is capable of storing knowledge from all the English wiki + textbooks based on our estimation.
Tweet media one
6
10
97
@ZeyuanAllenZhu
Zeyuan Allen-Zhu
1 month
Incredibly honored to have worked with Avi as his postdoc. Avi's vision is certainly beyond the theory of computation. He asked me in 2016 whether I believe gradient descent can solve everything. He has probably envisioned AGI at that point. 👍
@TheOfficialACM
Association for Computing Machinery
1 month
🏆 We're thrilled to announce the recipient of the 2023 #ACMTuringAward : Avi Wigderson! Wigderson is recognized for his foundational contributions to the theory of computation. Join us in celebrating his incredible achievements! Learn more here: @the_IAS
12
255
807
2
2
76
@ZeyuanAllenZhu
Zeyuan Allen-Zhu
1 month
Results 8/9: scaling laws for quantization and MoE. // Quantization to int8 does not hurt knowledge capacity even for models at max capacity => 2bit of knowledge can be stored to int8 // MoEs with even 32 experts have great capacity => knowledge can be stored evenly on experts.
Tweet media one
5
8
72
@ZeyuanAllenZhu
Zeyuan Allen-Zhu
8 days
Thanks. Part 3.2 is proudly rejected by ICML 2024 (though accepted as a tutorial) because we used "synthetic data + gpt2". Apparently many readers don't get it. If a behavior is discovered on N=100k for gpt2, that's enough to predict N=100m for llama and real-word models/data.
@chanwoopark20
Chanwoo Park
8 days
My favorite paper! Synthetic data to analyze "reasoning". Two impressive things are 1) this paper defined reasoning really well - everybody should take a look 2) well-controlled experiments - that is the reason why Allen-Zhu is saying this paper is "physics of LLM"
1
14
108
1
7
71
@ZeyuanAllenZhu
Zeyuan Allen-Zhu
1 month
Result 6/7: If insufficiently trained, GPT2_rotary works 30% better than LLaMA/Mistral architectures in terms of storing knowledge. A closer look reveals that GatedMLP is the cause: it is less stable to train and thus not friendly for acquiring "rare knowledge" in pretrain data.
Tweet media one
2
4
70
@ZeyuanAllenZhu
Zeyuan Allen-Zhu
1 month
Big kudos to my manager Lin Xiao and FAIR leadership for the support, to Meta Infra (Lucca Bertoncini, Liao Hu, Caleb Ho, Apostolos Kokolis, Shubho Sengupta, Henry Estela, Wil Johnson, Rizwan Hashmi, and Lucas Noah) and W&B team (Ian Clark, Gourab De, Anmol Mann, Max Pfeifer)
0
1
59
@ZeyuanAllenZhu
Zeyuan Allen-Zhu
1 month
Result 4/5: "all" LLMs can achieve such 2bit/param if sufficiently trained, even if all the MLP layers are removed. This is quite a universal law. // What if models are insufficiently trained --- or equivalently, pretrain data consist of rare knowledge? See result 6/7 next.
Tweet media one
1
3
52
@ZeyuanAllenZhu
Zeyuan Allen-Zhu
3 years
Our LoRA library for fine-tuning is online: "pip install loralib" Separate from this repo, I independently wrote LoRA on GPT2+BERT for other tasks such as LM (wiki103), MLM (wiki103), QA (squad), and the results are more than amazing. LoRA is in some Microsoft products already.
@edwardjhu
Edward Hu
3 years
In June we released LoRA which adapts NNs as big as GPT-3 with few parameters yet stays performant. Our new result beats finetuned RoBERTa on GLUE with 1/8 of total parameters! Try "pip install loralib" and add LoRA to your fav model in 10 lines of code!
Tweet media one
Tweet media two
2
13
88
0
5
44
@ZeyuanAllenZhu
Zeyuan Allen-Zhu
3 years
The team I've supported for 20 yrs, sent a clear "double" racist photo and claims it does not mean to cause controversy or have racial undertones. May I know what else this photo could possibly mean? @andagn @Cristiano @ClaMarchisio8 @gianluigibuffon @chiellini @delpieroale
Tweet media one
0
1
44
@ZeyuanAllenZhu
Zeyuan Allen-Zhu
2 months
Another example of how simple things work
@jaseweston
Jason Weston
2 months
🚨 Reverse Training to Nurse the Reversal Curse🚨 LLMs fail on “B is A” if only trained on "A is B". - Reverse training doubles training tokens by reversing strings - Outperforms data-matched standard baselines - Fixes issues on reversal tasks 🧵(1/6)
Tweet media one
1
24
172
2
2
42
@ZeyuanAllenZhu
Zeyuan Allen-Zhu
4 months
Tweet media one
1
1
40
@ZeyuanAllenZhu
Zeyuan Allen-Zhu
4 years
(1/3) major updates to our Backward Feature Correction paper. Recall our Theorem 1 was proven for deep neural nets with (theory-friendly) quadratic activations. We show in practice, this performs close to ReLU networks, better and much faster than neural kernel methods.
Tweet media one
0
2
38
@ZeyuanAllenZhu
Zeyuan Allen-Zhu
7 months
What's the solution to the curse of inverse knowledge search? Our Appendix E.1 mentions a solution that GPT4 employs, using Chain of Thought! Though cheating a bit, but when data is augmented like the Bible, inverse search is possible. Image from a talk I'm preparing for Friday.
Tweet media one
@ZeyuanAllenZhu
Zeyuan Allen-Zhu
8 months
Part 3.2: Why do LLMs need Chain of Thoughts even for basic questions (e.g. was Biden born on an even day)? We show that LLMs cannot efficiently manipulate knowledge even if such knowledge is 100% extractable; + inverse knowledge search is just impossible.
Tweet media one
Tweet media two
Tweet media three
17
116
586
1
3
33
@ZeyuanAllenZhu
Zeyuan Allen-Zhu
6 months
@tydsh 通者自通,庸者自庸 --- Those who understand will naturally understand; those who don't, won't.
3
0
34
@ZeyuanAllenZhu
Zeyuan Allen-Zhu
3 months
Did anyone notice: if paper title has period (or perhaps colon) in it, I will lose many citations. For instance @QuanquanGu 's Rephrase paper cites Part 3.2 but it isn't on Google Scholar. Should I use Part 3A, 3B, 3C instead? Who else cited our work?
6
2
32
@ZeyuanAllenZhu
Zeyuan Allen-Zhu
8 months
An additional figure from the paper illustrating how easy it is to find such counter examples on GPT4 (even today's version). More experiments can be found in the paper , both using GPT4 and using synthetic, controlled experiment.
Tweet media one
@ZeyuanAllenZhu
Zeyuan Allen-Zhu
8 months
Part 3.2: Why do LLMs need Chain of Thoughts even for basic questions (e.g. was Biden born on an even day)? We show that LLMs cannot efficiently manipulate knowledge even if such knowledge is 100% extractable; + inverse knowledge search is just impossible.
Tweet media one
Tweet media two
Tweet media three
17
116
586
0
7
29
@ZeyuanAllenZhu
Zeyuan Allen-Zhu
8 days
Now, I have to burn 100,000 GPU hours to re-run our experiment on 50x larger data and llama/mistral/larger models. That's a 3.4 tons of coals burned. One should not reject a paper because the model/data is small; one should reject a paper if the result is wrong or not innovative.
Tweet media one
0
0
30
@ZeyuanAllenZhu
Zeyuan Allen-Zhu
2 years
MSR Asia Theory Center (Beijing) is recruiting! A lovely place where I had a wonderful year of internship and published my first theory paper. Website: and application link: . They may also consider full-time for strong candidates.
1
3
25
@ZeyuanAllenZhu
Zeyuan Allen-Zhu
8 months
I can't agree that "inverse knowledge search is also mostly hard for humans". For example, we have commonly-used Chinese idioms/poems that most of my friends can say what's the first character in an idiom, or what's the previous sentence of a poem. GPT4 largely fails on this.🧐
Tweet media one
@ZeyuanAllenZhu
Zeyuan Allen-Zhu
8 months
Part 3.2: Why do LLMs need Chain of Thoughts even for basic questions (e.g. was Biden born on an even day)? We show that LLMs cannot efficiently manipulate knowledge even if such knowledge is 100% extractable; + inverse knowledge search is just impossible.
Tweet media one
Tweet media two
Tweet media three
17
116
586
2
0
26
@ZeyuanAllenZhu
Zeyuan Allen-Zhu
4 years
Our group's intern application site is open as well.
0
5
25
@ZeyuanAllenZhu
Zeyuan Allen-Zhu
4 years
Let me not say goodbye but instead Congratulations! (following the tradition). Hope that fate will bring us together again some day :) See you later! 💕
0
1
24
@ZeyuanAllenZhu
Zeyuan Allen-Zhu
5 years
Version 2 uploaded. Now, we have eliminated *all* kernel methods, in particular eliminated Convolution NTK with global average pooling. 🧐
@MSFTResearch
Microsoft Research
5 years
How can ResNet obtain notably lower test error than kernel methods on many tasks? Microsoft & @Stanford researchers proved that ResNet can perform hierarchical learning but kernel methods (e.g. NTK) cannot, leading to better generalization: @ZeyuanAllenZhu
0
19
64
0
1
22
@ZeyuanAllenZhu
Zeyuan Allen-Zhu
3 months
Enjoyed a dinner chatting around benchmarks, in which I gained a deeper appreciation for @drfeifei 's works. Anyone knows Qin Shi Huang? First emperor of China; standardized weights, measures, currencies, etc (统一度量衡). This underscores the importance of uniform standards.
2
3
21
@ZeyuanAllenZhu
Zeyuan Allen-Zhu
3 years
Came back from a long vacation and this is the first news to announce.
@MSFTResearch
Microsoft Research
3 years
How do GANs generate complicated real-world distributions such as images? A new theory from Microsoft and @CarnegieMellon researchers shows how GANs learn distributions efficiently with “forward super-resolution structure” by gradient descent ascent:
6
56
231
0
1
21
@ZeyuanAllenZhu
Zeyuan Allen-Zhu
4 years
(3/3) major updates to Backward Feature Correction paper. We visualize features to verify Theorem 2 in real life. If only training lower layers, features over-fit to high-complexity signals; if training deeper layers, they help “subtract” high-complexity signals from lower layers
Tweet media one
0
0
21
@ZeyuanAllenZhu
Zeyuan Allen-Zhu
4 years
We're officially hiring! This year, you're strongly encouraged to put down your service and outreach plans (e.g., those related to mentoring or diversity)
@SebastienBubeck
Sebastien Bubeck
4 years
The **Machine Learning Foundations** group at @MSFTResearch Redmond is hiring at all levels (including postdoc)! Come join @ZeyuanAllenZhu   @suriyagnskr   @jerryzli @ilyaraz2 @talw and myself to develop the next generation of ML theory!
11
83
341
1
2
19
@ZeyuanAllenZhu
Zeyuan Allen-Zhu
4 years
(2/3) major updates to our Backward Feature Correction paper. We measure "how deep" and "how much" is backward feature correction needed for real-life clean and adversarial tasks. Answer to “how deep” is at least 8 layers, and answer to “how much” is ~0.9 Euclidean correlation.
Tweet media one
0
3
19
@ZeyuanAllenZhu
Zeyuan Allen-Zhu
5 years
So, what do researchers do on Twitter? #myfirstTweet
0
0
18
@ZeyuanAllenZhu
Zeyuan Allen-Zhu
3 years
Sharing this again. In addition to Adil Salim we have another superstar FTE member joining soon (will let my manager Seb announce it when it's the time). Please send your application materials to our application website, and we shall review them seriously.
0
3
18
@ZeyuanAllenZhu
Zeyuan Allen-Zhu
1 month
I shouldn't say common crawls are "junks". Thanks to Common Crawls CTO for correcting me. What we meant is, lots of knowledge from CC (e.g. serial number of a random product) may not be useful. We synthetically generate data to mimic such knowledge, and we refer to that as junk.
Tweet media one
@ZeyuanAllenZhu
Zeyuan Allen-Zhu
1 month
Result 10/11/12: surprisingly, when pre-training good data (e.g., Wiki) together with "junks" (e.g., Common Crawl), LLM's capacity on good data may decrease by 20x times! A simple fix: add domain tokens to your data; LLMs can auto-detect domains rich in knowledge and prioritize.
Tweet media one
18
40
288
0
0
17
@ZeyuanAllenZhu
Zeyuan Allen-Zhu
2 years
AI residency is generally a great program to establish a year-long relationship with research mentors in the industry. It is wonderful to hear that Salesforce Research has opened it! Fantastic opportunities. PS: many of you asked me privately about MSR, a…
0
0
15
@ZeyuanAllenZhu
Zeyuan Allen-Zhu
6 months
(2/5) Interactions with ChatGPT have yielded intriguing insights. We aim to bring clarity to this domain through controlled, synthetic experiments that reveal how LLMs learn to perform (or fail at) various AI tasks.
0
1
16
@ZeyuanAllenZhu
Zeyuan Allen-Zhu
4 years
Missing this guy too. Brave enough to call out dump papers, bets money in results. I lost $20 to him once since I was dumb. If he is still around, I'll probably bankrupt.
@kiragoldner
Kira Goldner
4 years
Yuanzhi Li giving a talk in Columbia theory seminar today, referencing Michael Cohen calling his own 2017 paper's result dumb (and now agreeing with the assessment). ❤️ Made me smile a bit. Miss having that guy in seminars.
Tweet media one
0
1
35
2
0
15
@ZeyuanAllenZhu
Zeyuan Allen-Zhu
2 years
Congrats my boss!
@SebastienBubeck
Sebastien Bubeck
2 years
I'm really happy that the law of robustness got recognized as an important new insight with a NeurIPS outstanding paper award! The video below summarizes what the law is about, what it means, and what it predicts. It's also a great capstone for @geoishard 's fantastic phd work!
25
41
363
0
1
15
@ZeyuanAllenZhu
Zeyuan Allen-Zhu
6 months
Many starts to talk about "reread my request and try again". For knowledge questions, we made it clear why "try again" works. Knowledge is first loaded; and in the repeated run, model sees it and can manipulate knowledge in context. Examples in the figs such as "Tell me why."
@ZeyuanAllenZhu
Zeyuan Allen-Zhu
8 months
Part 3.2: Why do LLMs need Chain of Thoughts even for basic questions (e.g. was Biden born on an even day)? We show that LLMs cannot efficiently manipulate knowledge even if such knowledge is 100% extractable; + inverse knowledge search is just impossible.
Tweet media one
Tweet media two
Tweet media three
17
116
586
0
1
15
@ZeyuanAllenZhu
Zeyuan Allen-Zhu
8 months
@MetaAI @mbzuai @OwariDa To get back to your question. Let me also tag the authors of the awesome work "The Reversal Curse" @OwainEvans_UK @lukasberglund2 @max_a_kufmann @balesni @AsaCoopStick @tomekkorbak to let them know our parallel work.
0
1
15
@ZeyuanAllenZhu
Zeyuan Allen-Zhu
3 years
Magic 3: self-distillation (most mind-blowing). Train a model once to get some accuracy, then train it again to match the soft labels of its own. Suddenly accuracy boosts. Why? Spoiler: self-distillation implicitly performs an ensemble of two models with knowledge distillation.
2
2
13
@ZeyuanAllenZhu
Zeyuan Allen-Zhu
6 months
Huge thanks to my coauthors and especially the first author @CathyYLi who is just amazing at handling this huge project. Before this, I thought my coauthors were just being crazy... I was wrong, LLMs (+ human designs) can break some Crypto systems.
@KristinLauter
Kristin Lauter
6 months
Sharing open source code 4 our AI4Crypto project! @acm_ccs security conf paper SALSA Picante attacks sparse binary secrets LWE post-quantum crypto systems. SALSA Verde @NeurIPSConf 2023 attacks dimension 512 @AIatMeta @em_wenger @f_charton @ZeyuanAllenZhu
1
4
25
2
5
14
@ZeyuanAllenZhu
Zeyuan Allen-Zhu
5 years
+ get FREE job title upgrade. From this year on, our intro-63-level titles become "Senior Researcher"
2
2
12
@ZeyuanAllenZhu
Zeyuan Allen-Zhu
3 months
Amazing team to work with!
@jaseweston
Jason Weston
3 months
Our team in FAIR labs (at Meta) is hiring researchers (RE, RS & PostDoc)! DM if interested. We work on the topics of Reasoning, Alignment and Memory/architectures (RAM). Recent work: Self-Rewarding LMs: Pairwise Cringe Loss:
0
80
571
0
0
12
@ZeyuanAllenZhu
Zeyuan Allen-Zhu
15 days
@peter_richtarik Same to me. Very ridiculous AC summary as well, basically only citing the bad reviewer.
0
0
10
@ZeyuanAllenZhu
Zeyuan Allen-Zhu
3 years
Magic 2: knowledge distillation. Train an single model to match soft labels (e.g. 0.9 cat + 0.1 dog) generated by ensemble, test accuracy is significantly higher than that trained using true labels. Why? Spoiler: "dark knowledge" in soft labels encourages learning new features!
2
2
9
@ZeyuanAllenZhu
Zeyuan Allen-Zhu
15 days
Reverse Training to Nurse the Reversal Curse // added experiments!
@OlgaNLP
Olga Golovneva
15 days
Our Large LM (1.4B) finally finished training, and we have updated the paper with more exciting results! TL;DR: mixing training data with Random segment reversal not only resolves the reversal curse, but improves performance on the variety on benchmarks wrt data-matched models!
0
2
11
0
0
10
@ZeyuanAllenZhu
Zeyuan Allen-Zhu
8 days
Not to say, as of today, GPT-4 and LLaMA-3 still largely fail on the phenomena we have discovered. We are studying *universal laws*, not something that's only applicable to a particular version of llama or gpt.
Tweet media one
1
0
10
@ZeyuanAllenZhu
Zeyuan Allen-Zhu
8 months
0
0
10
@ZeyuanAllenZhu
Zeyuan Allen-Zhu
3 years
Magic 1: ensemble. Train the same model 10 times and average out, test accuracy gets a significant boost. In contrast, train a bigger model that is an average of 10, test accuracy has no boost. Spoiler: this is totally different from what NTK (infinite-width) regime looks like.
2
2
8
@ZeyuanAllenZhu
Zeyuan Allen-Zhu
4 years
the theorists already like you :)
@boazbaraktcs
Boaz Barak
4 years
That's my advice as well in TCS. If you are interviewing, the theorists already like you. The talk is not for them. You should make sure everyone gets what you did & why they should care. Ideally explain central idea/technique simply enough so people feel they learned something
1
5
57
0
0
9
@ZeyuanAllenZhu
Zeyuan Allen-Zhu
1 month
@HazanPrinceton What's even better: Sanjeev, Elad, all the colleagues and "mentees" of Avi are also incredibly kind, modest, and knowledgeable on so many things. Princeton is an awesome place.
1
0
9
@ZeyuanAllenZhu
Zeyuan Allen-Zhu
6 months
Note 4. It is advantageous to include in pretrain data corrupted sentences (e.g. grammar mistakes) --- this improves LLM's robust performance on generation tasks (but need to use low temperature!).
1
0
9
@ZeyuanAllenZhu
Zeyuan Allen-Zhu
1 month
可愛らしい絵。 Eugeo Zuberg はパート 3.2 でも見つかります。 はアリス・ユージオ・ズバーグを使おうと思っていたが、アリスはアーニャに近すぎる。 正直に言うと、『LMの物理学』に決める前は『Alicization』というシリーズタイトルを考えていました
@webbigdata
webbigdata
1 month
Physics of Language Models: Part 3.3, Knowledge Capacity Scaling Laws LLMに携わる人にとってとてもありがたく重要な新スケーリング法則のお話 特定のサイズのモデルがどれだけの量の知識を覚える事が出来るのか? という疑問に答えてくれていて、結論は
Tweet media one
1
14
54
1
2
8
@ZeyuanAllenZhu
Zeyuan Allen-Zhu
4 years
@SebastienBubeck @PreetumNakkiran @BachFrancis I can't agree more on the focus of math. In addition, I think sending the right message (i.e. "noise") to practitioners on what theory we are proving is important. I wish my "noise" can stand the test of time, but I'll be more happy if it doesn't, when better theory is discovered
0
0
8
@ZeyuanAllenZhu
Zeyuan Allen-Zhu
3 months
Putting research aside, In 2010, I spoke with @drfeifei during an open visit. At that time, U.S. grad schools struggled to evaluate undergrads from China. Fei-Fei's effort at Stanford significantly improved this process. I respect her both as a researcher and a moral leader.
0
0
4
@ZeyuanAllenZhu
Zeyuan Allen-Zhu
1 month
@HDPbilly If you mean "context (e.g., domain, data classification)" not "context length" then I agree. Appending a special token/header in front of a piece of data is like increasing ~1% of the training tokens, not increasing training duration by much.
1
0
7
@ZeyuanAllenZhu
Zeyuan Allen-Zhu
3 years
+ we also have Adil Salim joining us as an FTE soon! Microsoft Research has a bright future under the new leadership by @JohannesGehrke and the close supervision by @kevin_scott .
@SebastienBubeck
Sebastien Bubeck
3 years
The Machine Learning Foundations team at @MSFTResearch Redmond is looking for a postdoc. Come join us ( @ZeyuanAllenZhu @suriyagnskr @jerryzli @talw16 and Yi Zhang) to work on topics ranging from quantum learning to understanding transformer architectures!
2
31
144
0
0
7
@ZeyuanAllenZhu
Zeyuan Allen-Zhu
2 years
please spread this message out!
@julia_kiseleva
Julia Kiseleva
2 years
🙏🙏🙏 Please RT this message: (1) After Russia's invasion of Ukraine started, more than a million Ukrainians have fled to other countries. Many Belarusians and Russians are leaving their countries too.
1
24
35
0
0
5
@ZeyuanAllenZhu
Zeyuan Allen-Zhu
5 months
Truly heartbroken to see that nowadays we have to explicitly, reiterate that calling for genocide (against any group) is a violence and should be prohibited.
@ShaiDavidai
Shai Davidai
5 months
Credit where credit is due! "Calls for Genocide against the Jewish community or any other group are [...] against our rules" Thank you, @Columbia for clarifying this. Hopefully, we will now see enforcement against rhetoric on campus that is widely seen as a call for genocide
Tweet media one
107
148
1K
0
1
7
@ZeyuanAllenZhu
Zeyuan Allen-Zhu
2 years
One of my dream positions! Hurry up and apply!
@boazbaraktcs
Boaz Barak
3 years
Interested in grad studies or postdoc in Computer Science, Machine Learning, or Quantum Information and Computation? Please consider Harvard! Join a vibrant & growing community. We may be 385 years old but don't look a day over 350 😀
8
42
237
0
1
6
@ZeyuanAllenZhu
Zeyuan Allen-Zhu
1 month
@Teknium1 If a person studies at a university uniformly from 300 choices (regardless of writing) that's log2(300) bits. If president is uniformly from 100000 choices that's log2(100000) bits. We use synthetic data so knowledge bits can be calculated for all LLMs learned on this data
1
0
6
@ZeyuanAllenZhu
Zeyuan Allen-Zhu
2 years
+ don't forget this is a place you can get amazing Tsinghua / PKU interns that can collaborate with you all year long :)
0
0
5
@ZeyuanAllenZhu
Zeyuan Allen-Zhu
7 months
@prfsanjeevarora One of the most enlightening papers I read in 2016. It took me several years to really catch up on this -- not only the usefulness of the sparse coding model, but also a starting journey towards interpretability.
0
0
6
@ZeyuanAllenZhu
Zeyuan Allen-Zhu
7 months
Manuel Blum was not my advisor, he was my advisor's advisor and my advisor's advisor's advisor's advisor.
@Aaroth
Aaron Roth
7 months
Manuel Blum was not my advisor, he was my advisor's father.
2
1
45
1
0
6
@ZeyuanAllenZhu
Zeyuan Allen-Zhu
4 years
@jasondeanlee Thanks a lot for sharing this ❤️❤️ your works too!
0
0
5
@ZeyuanAllenZhu
Zeyuan Allen-Zhu
3 years
@PartheP But self-distillation is a bit different, since it distills from an ensemble over 2 models only, so doing it once more can encourage a third model to be added. However, the resulting performance boost is diminishing, and it's slightly worse than directly distilling from 3 models.
0
0
4
@ZeyuanAllenZhu
Zeyuan Allen-Zhu
6 months
Note 3. Vanilla GPT2 with absolute positional embedding sucks, even absurdly poorer than GPTs with uniform attention weights (even on such grammar tasks!) Better off using rotary or relative attention.
1
0
4
@ZeyuanAllenZhu
Zeyuan Allen-Zhu
6 months
Note 2. Reverse engineering shows BERT-based models using masked-language modeling cannot learn deep, hierarchical grammar logic --- evidence for why GPT-based models are better than e.g. DeBERTa.
1
0
4
@ZeyuanAllenZhu
Zeyuan Allen-Zhu
1 month
@AlbalakAlon In a same setup, let T be regular training and TD with domains added. (1) TD has overall slightly better loss than T, (2) TD much better than T on good data (loss + bits), (3) TD and T are almost equally bad on the junk data. In other words, adding tokens "speed up" the training
1
1
4
@ZeyuanAllenZhu
Zeyuan Allen-Zhu
3 years
It's re-opened for a (possibly) short period of time. Hurry up folks!
@NisheethVishnoi
Nisheeth Vishnoi
3 years
Just a reminder that the paper registration deadline for FOCS is Monday, May 31 at 5 pm New York time!
1
1
5
0
1
4
@ZeyuanAllenZhu
Zeyuan Allen-Zhu
8 months
@PabloRedux @RickSearle1 @AIatMeta @mbzuai It's because the entire conversation (including 1st Q&A and 2nd Q) is fed into GPT4 as a big, single context. GPT is trained to read from such context to know better what the question is. Feel free to change to "in" and play with ChatGPT. Would love to hear what you find out :)
1
0
4
@ZeyuanAllenZhu
Zeyuan Allen-Zhu
8 months
@Jayasim52317099 @AIatMeta @mbzuai Thanks for the pointer! I think is about the necessity of CoT for "incontext" multi-step arithmetics? CoT is needed because the computation is "hard". We study single-step operations on the factual knowledge, without writing down the knowledge explicitly.
0
0
4
@ZeyuanAllenZhu
Zeyuan Allen-Zhu
1 month
@Teknium1 More generally, if it is N people out of N_0 names, each of K attributes of knowledge, each knowledge is of C chunks, each chunk of length L and inside a diversity set D, vocab size T. The total bits of this data is:
Tweet media one
0
0
4
@ZeyuanAllenZhu
Zeyuan Allen-Zhu
8 months
@michahu8 @MetaAI @mbzuai Part 3.2 will be available tomorrow (delayed by arxiv), thanks!
1
0
3
@ZeyuanAllenZhu
Zeyuan Allen-Zhu
6 months
Note 1. Our studied CFGs are much harder than English grammar (learnable using 2-layer GPT of 100K params) or coding grammar (learnable via greedy). Our CFGs are ambiguous enough that require dynamic programming to parse or generate.
3
0
3
@ZeyuanAllenZhu
Zeyuan Allen-Zhu
1 month
@ysu_nlp I was 90% agree back then (I convinced Avi his "complexity" problem is GD and we wrote 2 papers). Now I'm both 99% and 1% agree. It's 99% since training LLM only needs GD... it's 1% because most of the "meat" comes from data preparation, not the GD part.
1
0
3
@ZeyuanAllenZhu
Zeyuan Allen-Zhu
1 month
@jasondeanlee @AIatMeta @weights_biases thanks for asking!! coughing badly right now and lost my voice :( probably need a few weeks to recover. will post it here once I record it.
1
0
3
@ZeyuanAllenZhu
Zeyuan Allen-Zhu
4 years
@ilyaraz2 @HaoChenMSR Isn't that obvious? 🤣
0
0
3
@ZeyuanAllenZhu
Zeyuan Allen-Zhu
4 years
@percyliang Last year, I discovered mine 10.5 yrs ago :) welcome!
0
0
3
@ZeyuanAllenZhu
Zeyuan Allen-Zhu
4 years
@KSA_ACS I created a whatsapp group for the May 8 passengers to share info together: @umairsiddiki @salukiprincess @docmraaa
1
0
3
@ZeyuanAllenZhu
Zeyuan Allen-Zhu
2 months
@DimitrisPapail Fun fact: I asked what if I relocate myself to the student's city, or turn myself into remote. The organizer doesn't want to talk to me anymore ever since I asked --- plus I almost received a punishment because I asked.
1
0
3