Zeyuan Allen-Zhu @ZeyuanAllenZhu Twitter profile | Pikagi

Pikagi

Zeyuan Allen-Zhu

@ZeyuanAllenZhu

8,319

Followers

275

Following

32

Media

213

Statuses

physics of language models @ Meta / FAIR IOI - USACO - MCM - ACM/ICPC - Codejam Tsinghua - MIT - Princeton/IAS - MSR - FAIR

https://t.co/pBdNTwt2he

Joined April 2010

Don't wanna be here? Send us removal request.

Pinned Tweet

@ZeyuanAllenZhu

Zeyuan Allen-Zhu

@ZeyuanAllenZhu

1 month

Our 12 scaling laws (for LLM knowledge capacity) are out: . Took me 4mos to submit 50,000 jobs; took Meta 1mo for legal review; FAIR sponsored 4,200,000 GPU hrs. Hope this is a new direction to study scaling laws + help practitioners make informed decisions

Tweet media one

27

337

1K

Last Seen Profiles

@orx7300

@Buddy_Isreal

@dougal_dog

@CarolinCacao777

@sakuraasenko

@CJCrump_TC

@DLOIO

@subhi_darajat

@Encoredupanache

@sofiabuffgf

@bing

@Adam_Ashton

@khuswatunkhaaa_

@Horizon_Organic

@ohKayBunny

@NOFLIPGANG

@achomezie

@saori_kono

@TeamNoonas

@PoyrazKadi59407

@gen_arai_illust

@youthemannowdo1

@michaelmyoder

@JamdongMhd

@arulbharathi05

@historias_yaoi

@Koyoka_8bit

@jazz

@RedFlashMBB

@Josephi08714617

@Zer0v7

@solatkotabharu

@bala_khatun

@jasonjohnson32

@Piper6640461907

@barrera_stephan

@ZeyuanAllenZhu

Zeyuan Allen-Zhu

@ZeyuanAllenZhu

1 year

Our first paper in a series studying the inner mechanisms of transformers. TL;DR: we show *how* GPTs learn complex CFG trees via learning to do dynamic programming. Huge thanks to @MetaAI for making this research journey possible. FYI to @OpenAI @mbzuai

Tweet media one

Tweet media two

7

137

654

@ZeyuanAllenZhu

Zeyuan Allen-Zhu

@ZeyuanAllenZhu

8 months

Part 3.2: Why do LLMs need Chain of Thoughts even for basic questions (e.g. was Biden born on an even day)? We show that LLMs cannot efficiently manipulate knowledge even if such knowledge is 100% extractable; + inverse knowledge search is just impossible.

Tweet media one

Tweet media two

Tweet media three

17

116

586

@ZeyuanAllenZhu

Zeyuan Allen-Zhu

@ZeyuanAllenZhu

1 month

Result 10/11/12: surprisingly, when pre-training good data (e.g., Wiki) together with "junks" (e.g., Common Crawl), LLM's capacity on good data may decrease by 20x times! A simple fix: add domain tokens to your data; LLMs can auto-detect domains rich in knowledge and prioritize.

Tweet media one

18

40

288

@ZeyuanAllenZhu

Zeyuan Allen-Zhu

@ZeyuanAllenZhu

3 years

Excited to announce our new work, a unified theory towards explaining 3 black magics in deep learning: (1) ensemble, (2) knowledge distillation, and (3) self-distillation. An accessible blog post is below.

@MSFTResearch

Microsoft Research

3 years

Microsoft and CMU researchers begin to unravel 3 mysteries in deep learning related to ensemble, knowledge distillation & self-distillation. Discover how their work leads to the first theoretical proof with empirical evidence for ensemble in deep learning:

8

146

520

1

42

269

@ZeyuanAllenZhu

Zeyuan Allen-Zhu

@ZeyuanAllenZhu

6 months

There is no rainbow without a storm. We withdraw all parts of <Physics of LM> from ICLR 2024 submissions.

Tweet media one

11

7

216

@ZeyuanAllenZhu

Zeyuan Allen-Zhu

@ZeyuanAllenZhu

4 years

Is deep learning is actually performing DEEP learning? We may have given the first proof that neural network is capable of efficient hierarchical learning, while existing theory only shows that deep learning can "simulate" non-hierarchical algorithms

@MSFTResearch

Microsoft Research

4 years

How does deep learning perform DEEP learning? Microsoft and CMU researchers establish a principle called "backward feature correction" and explain how very deep neural networks can actually perform DEEP hierarchical learning efficiently: @ZeyuanAllenZhu

4

82

272

2

22

179

@ZeyuanAllenZhu

Zeyuan Allen-Zhu

@ZeyuanAllenZhu

6 months

Part 1 on YouTube. How to interpret inner workings of transformers? "Induction head" only explains shallow tasks like sequence copying. To make interpretation go deeper, we reverse engineer how GPTs learn CFGs --- via learning to do dynamic programming.

Tweet media one

1

29

166

@ZeyuanAllenZhu

Zeyuan Allen-Zhu

@ZeyuanAllenZhu

8 months

Even if LLMs losslessly memorize the pretraining data, it may not be finetuned to extract knowledge from it. Probing techniques suggest that data augmentation is necessary on the pretrain level, regardless of model size, train time and finetune choices.

Tweet media one

7

26

148

@ZeyuanAllenZhu

Zeyuan Allen-Zhu

@ZeyuanAllenZhu

7 months

Expanded YouTube version of Parts 3.1+3.2: this also includes results I didn't cover in my recent offline talks/tweets, such as 1) mix training, 2) celebrity augmentation, 3) BERT models, 4) probing, 5) knowledge partial retrieval, 6) reversal curse.

Tweet media one

1

23

102

@ZeyuanAllenZhu

Zeyuan Allen-Zhu

@ZeyuanAllenZhu

1 month

Result 1/2/3: LLMs can "consistently" achieve 2bit per parameter in storing knowledge after sufficient training; this predict a 7B model is capable of storing knowledge from all the English wiki + textbooks based on our estimation.

Tweet media one

6

10

97

@ZeyuanAllenZhu

Zeyuan Allen-Zhu

@ZeyuanAllenZhu

4 years

Why Does Deep Learning Perform Deep Learning - MSR AI Seminar 08/11/2020

Tweet card media

Why Does Deep Learning Perform Deep Learning - MSR AI Seminar...

Despite the success of deep learning, from a theoretical standpoint, it remains absurdly unclear why deep learning is better than shallow learning. The main ...

www.youtube.com

1

23

84

@ZeyuanAllenZhu

Zeyuan Allen-Zhu

@ZeyuanAllenZhu

1 month

Incredibly honored to have worked with Avi as his postdoc. Avi's vision is certainly beyond the theory of computation. He asked me in 2016 whether I believe gradient descent can solve everything. He has probably envisioned AGI at that point. 👍

@TheOfficialACM

Association for Computing Machinery

@TheOfficialACM

1 month

🏆 We're thrilled to announce the recipient of the 2023 #ACMTuringAward : Avi Wigderson! Wigderson is recognized for his foundational contributions to the theory of computation. Join us in celebrating his incredible achievements! Learn more here: @the_IAS

12

255

807

2

2

76

@ZeyuanAllenZhu

Zeyuan Allen-Zhu

@ZeyuanAllenZhu

1 month

Results 8/9: scaling laws for quantization and MoE. // Quantization to int8 does not hurt knowledge capacity even for models at max capacity => 2bit of knowledge can be stored to int8 // MoEs with even 32 experts have great capacity => knowledge can be stored evenly on experts.

Tweet media one

5

8

72

@ZeyuanAllenZhu

Zeyuan Allen-Zhu

@ZeyuanAllenZhu

8 days

Thanks. Part 3.2 is proudly rejected by ICML 2024 (though accepted as a tutorial) because we used "synthetic data + gpt2". Apparently many readers don't get it. If a behavior is discovered on N=100k for gpt2, that's enough to predict N=100m for llama and real-word models/data.

@chanwoopark20

Chanwoo Park

8 days

My favorite paper! Synthetic data to analyze "reasoning". Two impressive things are 1) this paper defined reasoning really well - everybody should take a look 2) well-controlled experiments - that is the reason why Allen-Zhu is saying this paper is "physics of LLM"

1

14

108

1

7

71

@ZeyuanAllenZhu

Zeyuan Allen-Zhu

@ZeyuanAllenZhu

1 month

Result 6/7: If insufficiently trained, GPT2_rotary works 30% better than LLaMA/Mistral architectures in terms of storing knowledge. A closer look reveals that GatedMLP is the cause: it is less stable to train and thus not friendly for acquiring "rare knowledge" in pretrain data.

Tweet media one

2

4

70

@ZeyuanAllenZhu

Zeyuan Allen-Zhu

@ZeyuanAllenZhu

1 month

Big kudos to my manager Lin Xiao and FAIR leadership for the support, to Meta Infra (Lucca Bertoncini, Liao Hu, Caleb Ho, Apostolos Kokolis, Shubho Sengupta, Henry Estela, Wil Johnson, Rizwan Hashmi, and Lucas Noah) and W&B team (Ian Clark, Gourab De, Anmol Mann, Max Pfeifer)

0

1

59

@ZeyuanAllenZhu

Zeyuan Allen-Zhu

@ZeyuanAllenZhu

1 month

Result 4/5: "all" LLMs can achieve such 2bit/param if sufficiently trained, even if all the MLP layers are removed. This is quite a universal law. // What if models are insufficiently trained --- or equivalently, pretrain data consist of rare knowledge? See result 6/7 next.

Tweet media one

1

3

52

@ZeyuanAllenZhu

Zeyuan Allen-Zhu

@ZeyuanAllenZhu

3 years

This would be my dream intern position to work with Xiaodong, if I were a few years younger.

Tweet card media

Zeyuan Allen-Zhu on LinkedIn: This would be my dream intern position to work with Xiaodong, if I...

This would be my dream intern position to work with Xiaodong, if I were a few years younger.

www.linkedin.com

1

5

51

@ZeyuanAllenZhu

Zeyuan Allen-Zhu

@ZeyuanAllenZhu

3 years

Our LoRA library for fine-tuning is online: "pip install loralib" Separate from this repo, I independently wrote LoRA on GPT2+BERT for other tasks such as LM (wiki103), MLM (wiki103), QA (squad), and the results are more than amazing. LoRA is in some Microsoft products already.

@edwardjhu

Edward Hu

3 years

In June we released LoRA which adapts NNs as big as GPT-3 with few parameters yet stays performant. Our new result beats finetuned RoBERTa on GLUE with 1/8 of total parameters! Try "pip install loralib" and add LoRA to your fav model in 10 lines of code!

Tweet media one

Tweet media two

2

13

88

0

5

44

@ZeyuanAllenZhu

Zeyuan Allen-Zhu

@ZeyuanAllenZhu

3 years

The team I've supported for 20 yrs, sent a clear "double" racist photo and claims it does not mean to cause controversy or have racial undertones. May I know what else this photo could possibly mean? @andagn @Cristiano @ClaMarchisio8 @gianluigibuffon @chiellini @delpieroale

Tweet media one

0

1

44

@ZeyuanAllenZhu

Zeyuan Allen-Zhu

@ZeyuanAllenZhu

2 months

Another example of how simple things work

@jaseweston

Jason Weston

2 months

🚨 Reverse Training to Nurse the Reversal Curse🚨 LLMs fail on “B is A” if only trained on "A is B". - Reverse training doubles training tokens by reversing strings - Outperforms data-matched standard baselines - Fixes issues on reversal tasks 🧵(1/6)

Tweet media one

1

24

172

2

2

42

@ZeyuanAllenZhu

Zeyuan Allen-Zhu

@ZeyuanAllenZhu

4 months

Tweet media one

1

1

40

@ZeyuanAllenZhu

Zeyuan Allen-Zhu

@ZeyuanAllenZhu

4 years

(1/3) major updates to our Backward Feature Correction paper. Recall our Theorem 1 was proven for deep neural nets with (theory-friendly) quadratic activations. We show in practice, this performs close to ReLU networks, better and much faster than neural kernel methods.

Tweet media one

0

2

38

@ZeyuanAllenZhu

Zeyuan Allen-Zhu

@ZeyuanAllenZhu

7 months

What's the solution to the curse of inverse knowledge search? Our Appendix E.1 mentions a solution that GPT4 employs, using Chain of Thought! Though cheating a bit, but when data is augmented like the Bible, inverse search is possible. Image from a talk I'm preparing for Friday.

Tweet media one

@ZeyuanAllenZhu

Zeyuan Allen-Zhu

@ZeyuanAllenZhu

8 months

Part 3.2: Why do LLMs need Chain of Thoughts even for basic questions (e.g. was Biden born on an even day)? We show that LLMs cannot efficiently manipulate knowledge even if such knowledge is 100% extractable; + inverse knowledge search is just impossible.

Tweet media one

Tweet media two

Tweet media three

17

116

586

1

3

33

@ZeyuanAllenZhu

Zeyuan Allen-Zhu

@ZeyuanAllenZhu

6 months

@tydsh 通者自通，庸者自庸 --- Those who understand will naturally understand; those who don't, won't.

3

0

34

@ZeyuanAllenZhu

Zeyuan Allen-Zhu

@ZeyuanAllenZhu

3 months

Did anyone notice: if paper title has period (or perhaps colon) in it, I will lose many citations. For instance @QuanquanGu 's Rephrase paper cites Part 3.2 but it isn't on Google Scholar. Should I use Part 3A, 3B, 3C instead? Who else cited our work?

6

2

32

@ZeyuanAllenZhu

Zeyuan Allen-Zhu

@ZeyuanAllenZhu

3 years

Sharing this piece of news + we also have Adil SALIM to join us as a FTE soon! Amazing group with exciting topics to work on.

Tweet card media

Zeyuan Allen-Zhu on LinkedIn: Sharing this piece of news + we also have Adil SALIM to join us as a...

Sharing this piece of news + we also have Adil SALIM to join us as a FTE soon! Amazing group with exciting topics to work on.

www.linkedin.com

1

0

30

@ZeyuanAllenZhu

Zeyuan Allen-Zhu

@ZeyuanAllenZhu

8 months

An additional figure from the paper illustrating how easy it is to find such counter examples on GPT4 (even today's version). More experiments can be found in the paper , both using GPT4 and using synthetic, controlled experiment.

Tweet media one

@ZeyuanAllenZhu

Zeyuan Allen-Zhu

@ZeyuanAllenZhu

8 months

Part 3.2: Why do LLMs need Chain of Thoughts even for basic questions (e.g. was Biden born on an even day)? We show that LLMs cannot efficiently manipulate knowledge even if such knowledge is 100% extractable; + inverse knowledge search is just impossible.

Tweet media one

Tweet media two

Tweet media three

17

116

586

0

7

29

@ZeyuanAllenZhu

Zeyuan Allen-Zhu

@ZeyuanAllenZhu

8 days

Now, I have to burn 100,000 GPU hours to re-run our experiment on 50x larger data and llama/mistral/larger models. That's a 3.4 tons of coals burned. One should not reject a paper because the model/data is small; one should reject a paper if the result is wrong or not innovative.

Tweet media one

0

0

30

@ZeyuanAllenZhu

Zeyuan Allen-Zhu

@ZeyuanAllenZhu

2 years

MSR Asia Theory Center (Beijing) is recruiting! A lovely place where I had a wonderful year of internship and published my first theory paper. Website: and application link: . They may also consider full-time for strong candidates.

Tweet card media

MSR Asia Theory Center - Microsoft Research

The mission of this center is to promote in-depth integration of theoretical research, big data, and artificial intelligence technology by building international academic exchange and cooperation...

www.microsoft.com

1

3

25

@ZeyuanAllenZhu

Zeyuan Allen-Zhu

@ZeyuanAllenZhu

8 months

I can't agree that "inverse knowledge search is also mostly hard for humans". For example, we have commonly-used Chinese idioms/poems that most of my friends can say what's the first character in an idiom, or what's the previous sentence of a poem. GPT4 largely fails on this.🧐

Tweet media one

@ZeyuanAllenZhu

Zeyuan Allen-Zhu

@ZeyuanAllenZhu

8 months

Part 3.2: Why do LLMs need Chain of Thoughts even for basic questions (e.g. was Biden born on an even day)? We show that LLMs cannot efficiently manipulate knowledge even if such knowledge is 100% extractable; + inverse knowledge search is just impossible.

Tweet media one

Tweet media two

Tweet media three

17

116

586

2

0

26

@ZeyuanAllenZhu

Zeyuan Allen-Zhu

@ZeyuanAllenZhu

4 years

Our group's intern application site is open as well.

0

5

25

@ZeyuanAllenZhu

Zeyuan Allen-Zhu

@ZeyuanAllenZhu

4 years

Let me not say goodbye but instead Congratulations! (following the tradition). Hope that fate will bring us together again some day :) See you later! 💕

0

1

24

@ZeyuanAllenZhu

Zeyuan Allen-Zhu

@ZeyuanAllenZhu

5 years

Version 2 uploaded. Now, we have eliminated *all* kernel methods, in particular eliminated Convolution NTK with global average pooling. 🧐

@MSFTResearch

Microsoft Research

5 years

How can ResNet obtain notably lower test error than kernel methods on many tasks? Microsoft & @Stanford researchers proved that ResNet can perform hierarchical learning but kernel methods (e.g. NTK) cannot, leading to better generalization: @ZeyuanAllenZhu

0

19

64

0

1

22

@ZeyuanAllenZhu

Zeyuan Allen-Zhu

@ZeyuanAllenZhu

3 months

Enjoyed a dinner chatting around benchmarks, in which I gained a deeper appreciation for @drfeifei 's works. Anyone knows Qin Shi Huang? First emperor of China; standardized weights, measures, currencies, etc (统一度量衡). This underscores the importance of uniform standards.

2

3

21

@ZeyuanAllenZhu

Zeyuan Allen-Zhu

@ZeyuanAllenZhu

3 years

Came back from a long vacation and this is the first news to announce.

@MSFTResearch

Microsoft Research

3 years

How do GANs generate complicated real-world distributions such as images? A new theory from Microsoft and @CarnegieMellon researchers shows how GANs learn distributions efficiently with “forward super-resolution structure” by gradient descent ascent:

6

56

231

0

1

21

@ZeyuanAllenZhu

Zeyuan Allen-Zhu

@ZeyuanAllenZhu

4 years

(3/3) major updates to Backward Feature Correction paper. We visualize features to verify Theorem 2 in real life. If only training lower layers, features over-fit to high-complexity signals; if training deeper layers, they help “subtract” high-complexity signals from lower layers

Tweet media one

0

0

21

@ZeyuanAllenZhu

Zeyuan Allen-Zhu

@ZeyuanAllenZhu

4 years

We're officially hiring! This year, you're strongly encouraged to put down your service and outreach plans (e.g., those related to mentoring or diversity)

@SebastienBubeck

Sebastien Bubeck

@SebastienBubeck

4 years

The **Machine Learning Foundations** group at @MSFTResearch Redmond is hiring at all levels (including postdoc)! Come join @ZeyuanAllenZhu @suriyagnskr @jerryzli @ilyaraz2 @talw and myself to develop the next generation of ML theory!

11

83

341

1

2

19

@ZeyuanAllenZhu

Zeyuan Allen-Zhu

@ZeyuanAllenZhu

4 years

(2/3) major updates to our Backward Feature Correction paper. We measure "how deep" and "how much" is backward feature correction needed for real-life clean and adversarial tasks. Answer to “how deep” is at least 8 layers, and answer to “how much” is ~0.9 Euclidean correlation.

Tweet media one

0

3

19

@ZeyuanAllenZhu

Zeyuan Allen-Zhu

@ZeyuanAllenZhu

5 years

So, what do researchers do on Twitter? #myfirstTweet

0

0

18

@ZeyuanAllenZhu

Zeyuan Allen-Zhu

@ZeyuanAllenZhu

3 years

Sharing this again. In addition to Adil Salim we have another superstar FTE member joining soon (will let my manager Seb announce it when it's the time). Please send your application materials to our application website, and we shall review them seriously.

Tweet card media

Zeyuan Allen-Zhu on LinkedIn: Sharing this again. In addition to Adil Salim we have another...

Sharing this again. In addition to Adil Salim we have another superstar FTE member joining soon (will let my manager Seb announce it when it's the time)…

www.linkedin.com

0

3

18

@ZeyuanAllenZhu

Zeyuan Allen-Zhu

@ZeyuanAllenZhu

1 month

I shouldn't say common crawls are "junks". Thanks to Common Crawls CTO for correcting me. What we meant is, lots of knowledge from CC (e.g. serial number of a random product) may not be useful. We synthetically generate data to mimic such knowledge, and we refer to that as junk.

Tweet media one

@ZeyuanAllenZhu

Zeyuan Allen-Zhu

@ZeyuanAllenZhu

1 month

Result 10/11/12: surprisingly, when pre-training good data (e.g., Wiki) together with "junks" (e.g., Common Crawl), LLM's capacity on good data may decrease by 20x times! A simple fix: add domain tokens to your data; LLMs can auto-detect domains rich in knowledge and prioritize.

Tweet media one

18

40

288

0

0

17

@ZeyuanAllenZhu

Zeyuan Allen-Zhu

@ZeyuanAllenZhu

2 years

AI residency is generally a great program to establish a year-long relationship with research mentors in the industry. It is wonderful to hear that Salesforce Research has opened it! Fantastic opportunities. PS: many of you asked me privately about MSR, a…

Tweet card media

Zeyuan Allen-Zhu on LinkedIn: AI residency is generally a great program to establish a year-long…

AI residency is generally a great program to establish a year-long relationship with research mentors in the industry. It is wonderful to hear that Salesforce…

www.linkedin.com

0

0

15

@ZeyuanAllenZhu

Zeyuan Allen-Zhu

@ZeyuanAllenZhu

6 months

(2/5) Interactions with ChatGPT have yielded intriguing insights. We aim to bring clarity to this domain through controlled, synthetic experiments that reveal how LLMs learn to perform (or fail at) various AI tasks.

0

1

16

@ZeyuanAllenZhu

Zeyuan Allen-Zhu

@ZeyuanAllenZhu

4 years

Missing this guy too. Brave enough to call out dump papers, bets money in results. I lost $20 to him once since I was dumb. If he is still around, I'll probably bankrupt.

@kiragoldner

Kira Goldner

4 years

Yuanzhi Li giving a talk in Columbia theory seminar today, referencing Michael Cohen calling his own 2017 paper's result dumb (and now agreeing with the assessment). ❤️ Made me smile a bit. Miss having that guy in seminars.

Tweet media one

0

1

35

2

0

15

@ZeyuanAllenZhu

Zeyuan Allen-Zhu

@ZeyuanAllenZhu

2 years

Congrats my boss!

@SebastienBubeck

Sebastien Bubeck

@SebastienBubeck

2 years

I'm really happy that the law of robustness got recognized as an important new insight with a NeurIPS outstanding paper award! The video below summarizes what the law is about, what it means, and what it predicts. It's also a great capstone for @geoishard 's fantastic phd work!

25

41

363

0

1

15

@ZeyuanAllenZhu

Zeyuan Allen-Zhu

@ZeyuanAllenZhu

6 months

Many starts to talk about "reread my request and try again". For knowledge questions, we made it clear why "try again" works. Knowledge is first loaded; and in the repeated run, model sees it and can manipulate knowledge in context. Examples in the figs such as "Tell me why."

@ZeyuanAllenZhu

Zeyuan Allen-Zhu

@ZeyuanAllenZhu

8 months

Part 3.2: Why do LLMs need Chain of Thoughts even for basic questions (e.g. was Biden born on an even day)? We show that LLMs cannot efficiently manipulate knowledge even if such knowledge is 100% extractable; + inverse knowledge search is just impossible.

Tweet media one

Tweet media two

Tweet media three

17

116

586

0

1

15

@ZeyuanAllenZhu

Zeyuan Allen-Zhu

@ZeyuanAllenZhu

8 months

@MetaAI @mbzuai @OwariDa To get back to your question. Let me also tag the authors of the awesome work "The Reversal Curse" @OwainEvans_UK @lukasberglund2 @max_a_kufmann @balesni @AsaCoopStick @tomekkorbak to let them know our parallel work.

0

1

15

@ZeyuanAllenZhu

Zeyuan Allen-Zhu

@ZeyuanAllenZhu

3 years

Magic 3: self-distillation (most mind-blowing). Train a model once to get some accuracy, then train it again to match the soft labels of its own. Suddenly accuracy boosts. Why? Spoiler: self-distillation implicitly performs an ensemble of two models with knowledge distillation.

2

2

13

@ZeyuanAllenZhu

Zeyuan Allen-Zhu

@ZeyuanAllenZhu

6 months

Huge thanks to my coauthors and especially the first author @CathyYLi who is just amazing at handling this huge project. Before this, I thought my coauthors were just being crazy... I was wrong, LLMs (+ human designs) can break some Crypto systems.

@KristinLauter

Kristin Lauter

6 months

Sharing open source code 4 our AI4Crypto project! @acm_ccs security conf paper SALSA Picante attacks sparse binary secrets LWE post-quantum crypto systems. SALSA Verde @NeurIPSConf 2023 attacks dimension 512 @AIatMeta @em_wenger @f_charton @ZeyuanAllenZhu

1

4

25

2

5

14

@ZeyuanAllenZhu

Zeyuan Allen-Zhu

@ZeyuanAllenZhu

5 years

+ get FREE job title upgrade. From this year on, our intro-63-level titles become "Senior Researcher"

2

2

12

@ZeyuanAllenZhu

Zeyuan Allen-Zhu

@ZeyuanAllenZhu

3 months

Amazing team to work with!

@jaseweston

Jason Weston

3 months

Our team in FAIR labs (at Meta) is hiring researchers (RE, RS & PostDoc)! DM if interested. We work on the topics of Reasoning, Alignment and Memory/architectures (RAM). Recent work: Self-Rewarding LMs: Pairwise Cringe Loss:

0

80

571

0

0

12

@ZeyuanAllenZhu

Zeyuan Allen-Zhu

@ZeyuanAllenZhu

15 days

@peter_richtarik Same to me. Very ridiculous AC summary as well, basically only citing the bad reviewer.

0

0

10

@ZeyuanAllenZhu

Zeyuan Allen-Zhu

@ZeyuanAllenZhu

3 years

Magic 2: knowledge distillation. Train an single model to match soft labels (e.g. 0.9 cat + 0.1 dog) generated by ensemble, test accuracy is significantly higher than that trained using true labels. Why? Spoiler: "dark knowledge" in soft labels encourages learning new features!

2

2

9

@ZeyuanAllenZhu

Zeyuan Allen-Zhu

@ZeyuanAllenZhu

15 days

Reverse Training to Nurse the Reversal Curse // added experiments!

@OlgaNLP

Olga Golovneva

15 days

Our Large LM (1.4B) finally finished training, and we have updated the paper with more exciting results! TL;DR: mixing training data with Random segment reversal not only resolves the reversal curse, but improves performance on the variety on benchmarks wrt data-matched models!

0

2

11

0

0

10

@ZeyuanAllenZhu

Zeyuan Allen-Zhu

@ZeyuanAllenZhu

8 days

Not to say, as of today, GPT-4 and LLaMA-3 still largely fail on the phenomena we have discovered. We are studying *universal laws*, not something that's only applicable to a particular version of llama or gpt.

Tweet media one

1

0

10

@ZeyuanAllenZhu

Zeyuan Allen-Zhu

@ZeyuanAllenZhu

8 months

@MetaAI @mbzuai You might want to see this first before we can understand the universe :) @xai @TheGregYang @elonmusk @Yuhu_ai_ @ZihangDai @Guodzh @ibab_ml @jimmybajimmyba

0

0

10

@ZeyuanAllenZhu

Zeyuan Allen-Zhu

@ZeyuanAllenZhu

3 years

Strongly recommend this NeurIPS competition from my colleague at MSR. Two top prizes $5000 each, and potential future collaborations (or internships?) Hurry up!

Tweet card media

Zeyuan Allen-Zhu on LinkedIn: Strongly recommend this NeurIPS competition from my colleague at MSR....

Strongly recommend this NeurIPS competition from my colleague at MSR. Two top prizes $5000 each, and potential future collaborations (or internships?) Hurry up!

www.linkedin.com

0

0

9

@ZeyuanAllenZhu

Zeyuan Allen-Zhu

@ZeyuanAllenZhu

3 years

Magic 1: ensemble. Train the same model 10 times and average out, test accuracy gets a significant boost. In contrast, train a bigger model that is an average of 10, test accuracy has no boost. Spoiler: this is totally different from what NTK (infinite-width) regime looks like.

2

2

8

@ZeyuanAllenZhu

Zeyuan Allen-Zhu

@ZeyuanAllenZhu

4 years

the theorists already like you :)

@boazbaraktcs

Boaz Barak

4 years

That's my advice as well in TCS. If you are interviewing, the theorists already like you. The talk is not for them. You should make sure everyone gets what you did & why they should care. Ideally explain central idea/technique simply enough so people feel they learned something

1

5

57

0

0

9

@ZeyuanAllenZhu

Zeyuan Allen-Zhu

@ZeyuanAllenZhu

1 month

@HazanPrinceton What's even better: Sanjeev, Elad, all the colleagues and "mentees" of Avi are also incredibly kind, modest, and knowledgeable on so many things. Princeton is an awesome place.

1

0

9

@ZeyuanAllenZhu

Zeyuan Allen-Zhu

@ZeyuanAllenZhu

6 months

Note 4. It is advantageous to include in pretrain data corrupted sentences (e.g. grammar mistakes) --- this improves LLM's robust performance on generation tasks (but need to use low temperature!).

1

0

9

@ZeyuanAllenZhu

Zeyuan Allen-Zhu

@ZeyuanAllenZhu

1 month

可愛らしい絵。 Eugeo Zuberg はパート 3.2 でも見つかります。はアリス・ユージオ・ズバーグを使おうと思っていたが、アリスはアーニャに近すぎる。正直に言うと、『LMの物理学』に決める前は『Alicization』というシリーズタイトルを考えていました

@webbigdata

webbigdata

1 month

Physics of Language Models: Part 3.3, Knowledge Capacity Scaling Laws LLMに携わる人にとってとてもありがたく重要な新スケーリング法則のお話特定のサイズのモデルがどれだけの量の知識を覚える事が出来るのか？という疑問に答えてくれていて、結論は

Tweet media one

1

14

54

1

2

8

@ZeyuanAllenZhu

Zeyuan Allen-Zhu

@ZeyuanAllenZhu

3 years

+more: @UEFA @UWCL @FIFAcom @SerieA @romeoagresti @AzzurreFIGC @FIFAWWC @AriannaCaruso7 @mariaalves_20 @Laura1Giuliani @DaliIppolito @PeyraudMagnin18 @LindaSembrant @staskova_andrea @tuijahyyrynen @ValeCernoia7 @boattin3 @cristianagire @SaraGama_ITA @barbarabonansea @10_Smarti

@ZeyuanAllenZhu

Zeyuan Allen-Zhu

@ZeyuanAllenZhu

3 years

The team I've supported for 20 yrs, sent a clear "double" racist photo and claims it does not mean to cause controversy or have racial undertones. May I know what else this photo could possibly mean? @andagn @Cristiano @ClaMarchisio8 @gianluigibuffon @chiellini @delpieroale

Tweet media one

0

1

44

0

0

8

@ZeyuanAllenZhu

Zeyuan Allen-Zhu

@ZeyuanAllenZhu

4 years

@SebastienBubeck @PreetumNakkiran @BachFrancis I can't agree more on the focus of math. In addition, I think sending the right message (i.e. "noise") to practitioners on what theory we are proving is important. I wish my "noise" can stand the test of time, but I'll be more happy if it doesn't, when better theory is discovered

0

0

8

@ZeyuanAllenZhu

Zeyuan Allen-Zhu

@ZeyuanAllenZhu

3 months

Putting research aside, In 2010, I spoke with @drfeifei during an open visit. At that time, U.S. grad schools struggled to evaluate undergrads from China. Fei-Fei's effort at Stanford significantly improved this process. I respect her both as a researcher and a moral leader.

0

0

4

@ZeyuanAllenZhu

Zeyuan Allen-Zhu

@ZeyuanAllenZhu

1 month

@HDPbilly If you mean "context (e.g., domain, data classification)" not "context length" then I agree. Appending a special token/header in front of a piece of data is like increasing ~1% of the training tokens, not increasing training duration by much.

1

0

7

@ZeyuanAllenZhu

Zeyuan Allen-Zhu

@ZeyuanAllenZhu

3 years

+ we also have Adil Salim joining us as an FTE soon! Microsoft Research has a bright future under the new leadership by @JohannesGehrke and the close supervision by @kevin_scott .

@SebastienBubeck

Sebastien Bubeck

@SebastienBubeck

3 years

The Machine Learning Foundations team at @MSFTResearch Redmond is looking for a postdoc. Come join us ( @ZeyuanAllenZhu @suriyagnskr @jerryzli @talw16 and Yi Zhang) to work on topics ranging from quantum learning to understanding transformer architectures!

2

31

144

0

0

7

@ZeyuanAllenZhu

Zeyuan Allen-Zhu

@ZeyuanAllenZhu

2 years

please spread this message out!

@julia_kiseleva

Julia Kiseleva

@julia_kiseleva

2 years

🙏🙏🙏 Please RT this message: (1) After Russia's invasion of Ukraine started, more than a million Ukrainians have fled to other countries. Many Belarusians and Russians are leaving their countries too.

1

24

35

0

0

5

@ZeyuanAllenZhu

Zeyuan Allen-Zhu

@ZeyuanAllenZhu

5 months

Truly heartbroken to see that nowadays we have to explicitly, reiterate that calling for genocide (against any group) is a violence and should be prohibited.

@ShaiDavidai

Shai Davidai

5 months

Credit where credit is due! "Calls for Genocide against the Jewish community or any other group are [...] against our rules" Thank you, @Columbia for clarifying this. Hopefully, we will now see enforcement against rhetoric on campus that is widely seen as a call for genocide

Tweet media one

107

148

1K

0

1

7

@ZeyuanAllenZhu

Zeyuan Allen-Zhu

@ZeyuanAllenZhu

2 years

a must share and a star team

Tweet card media

Zeyuan Allen-Zhu on LinkedIn: a must share and a star team

a must share and a star team

www.linkedin.com

1

0

5

@ZeyuanAllenZhu

Zeyuan Allen-Zhu

@ZeyuanAllenZhu

2 years

One of my dream positions! Hurry up and apply!

@boazbaraktcs

Boaz Barak

3 years

Interested in grad studies or postdoc in Computer Science, Machine Learning, or Quantum Information and Computation? Please consider Harvard! Join a vibrant & growing community. We may be 385 years old but don't look a day over 350 😀

8

42

237

0

1

6

@ZeyuanAllenZhu

Zeyuan Allen-Zhu

@ZeyuanAllenZhu

3 years

@SebastienBubeck @ilyaraz2 @MSFTResearch

Tweet media one

0

0

5

@ZeyuanAllenZhu

Zeyuan Allen-Zhu

@ZeyuanAllenZhu

1 month

@Teknium1 If a person studies at a university uniformly from 300 choices (regardless of writing) that's log2(300) bits. If president is uniformly from 100000 choices that's log2(100000) bits. We use synthetic data so knowledge bits can be calculated for all LLMs learned on this data

1

0

6

@ZeyuanAllenZhu

Zeyuan Allen-Zhu

@ZeyuanAllenZhu

2 years

+ don't forget this is a place you can get amazing Tsinghua / PKU interns that can collaborate with you all year long :)

0

0

5

@ZeyuanAllenZhu

Zeyuan Allen-Zhu

@ZeyuanAllenZhu

7 months

@prfsanjeevarora One of the most enlightening papers I read in 2016. It took me several years to really catch up on this -- not only the usefulness of the sparse coding model, but also a starting journey towards interpretability.

0

0

6

@ZeyuanAllenZhu

Zeyuan Allen-Zhu

@ZeyuanAllenZhu

4 years

@oinori1197 I agree with you that deep learning theory is still quite messy, but I hope my recent talk here will give you some new insights of what's going on:

Tweet card media

Why Does Deep Learning Perform Deep Learning - MSR AI Seminar...

Despite the success of deep learning, from a theoretical standpoint, it remains absurdly unclear why deep learning is better than shallow learning. The main ...

www.youtube.com

1

1

6

@ZeyuanAllenZhu

Zeyuan Allen-Zhu

@ZeyuanAllenZhu

7 months

Manuel Blum was not my advisor, he was my advisor's advisor and my advisor's advisor's advisor's advisor.

@Aaroth

Aaron Roth

7 months

Manuel Blum was not my advisor, he was my advisor's father.

2

1

45

1

0

6

@ZeyuanAllenZhu

Zeyuan Allen-Zhu

@ZeyuanAllenZhu

3 years

shared within 3min!

Tweet card media

Zeyuan Allen-Zhu on LinkedIn: shared within 3min!

shared within 3min!

www.linkedin.com

0

0

6

@ZeyuanAllenZhu

Zeyuan Allen-Zhu

@ZeyuanAllenZhu

3 years

one of the best teams in MSR Redmond

Tweet card media

Zeyuan Allen-Zhu on LinkedIn: one of the best teams in MSR Redmond

one of the best teams in MSR Redmond

www.linkedin.com

0

0

5

@ZeyuanAllenZhu

Zeyuan Allen-Zhu

@ZeyuanAllenZhu

4 years

@jasondeanlee Thanks a lot for sharing this ❤️❤️ your works too!

0

0

5

@ZeyuanAllenZhu

Zeyuan Allen-Zhu

@ZeyuanAllenZhu

3 years

@PartheP But self-distillation is a bit different, since it distills from an ensemble over 2 models only, so doing it once more can encourage a third model to be added. However, the resulting performance boost is diminishing, and it's slightly worse than directly distilling from 3 models.

0

0

4

@ZeyuanAllenZhu

Zeyuan Allen-Zhu

@ZeyuanAllenZhu

6 months

Note 3. Vanilla GPT2 with absolute positional embedding sucks, even absurdly poorer than GPTs with uniform attention weights (even on such grammar tasks!) Better off using rotary or relative attention.

1

0

4

@ZeyuanAllenZhu

Zeyuan Allen-Zhu

@ZeyuanAllenZhu

6 months

Note 2. Reverse engineering shows BERT-based models using masked-language modeling cannot learn deep, hierarchical grammar logic --- evidence for why GPT-based models are better than e.g. DeBERTa.

1

0

4

@ZeyuanAllenZhu

Zeyuan Allen-Zhu

@ZeyuanAllenZhu

2 years

Congrats my boss and congrats Mark!

Tweet card media

Zeyuan Allen-Zhu on LinkedIn: Congrats my boss and congrats Mark!

Congrats my boss and congrats Mark!

www.linkedin.com

0

0

4

@ZeyuanAllenZhu

Zeyuan Allen-Zhu

@ZeyuanAllenZhu

1 month

@AlbalakAlon In a same setup, let T be regular training and TD with domains added. (1) TD has overall slightly better loss than T, (2) TD much better than T on good data (loss + bits), (3) TD and T are almost equally bad on the junk data. In other words, adding tokens "speed up" the training

1

1

4

@ZeyuanAllenZhu

Zeyuan Allen-Zhu

@ZeyuanAllenZhu

3 years

It's re-opened for a (possibly) short period of time. Hurry up folks!

@NisheethVishnoi

Nisheeth Vishnoi

@NisheethVishnoi

3 years

Just a reminder that the paper registration deadline for FOCS is Monday, May 31 at 5 pm New York time!

1

1

5

0

1

4

@ZeyuanAllenZhu

Zeyuan Allen-Zhu

@ZeyuanAllenZhu

8 months

@PabloRedux @RickSearle1 @AIatMeta @mbzuai It's because the entire conversation (including 1st Q&A and 2nd Q) is fed into GPT4 as a big, single context. GPT is trained to read from such context to know better what the question is. Feel free to change to "in" and play with ChatGPT. Would love to hear what you find out :)

1

0

4

@ZeyuanAllenZhu

Zeyuan Allen-Zhu

@ZeyuanAllenZhu

8 months

@Jayasim52317099 @AIatMeta @mbzuai Thanks for the pointer! I think is about the necessity of CoT for "incontext" multi-step arithmetics? CoT is needed because the computation is "hard". We study single-step operations on the factual knowledge, without writing down the knowledge explicitly.

0

0

4

@ZeyuanAllenZhu

Zeyuan Allen-Zhu

@ZeyuanAllenZhu

1 month

@Teknium1 More generally, if it is N people out of N_0 names, each of K attributes of knowledge, each knowledge is of C chunks, each chunk of length L and inside a diversity set D, vocab size T. The total bits of this data is:

Tweet media one

0

0

4

@ZeyuanAllenZhu

Zeyuan Allen-Zhu

@ZeyuanAllenZhu

3 years

Very proud of you! The most incredible girl I have ever met. The only quant/trader nominated for this award, and the only Chinese nominated for this award! #quantitativeresearch #quantitativetrading #winners #excellenceawards #risingstar2021

Tweet card media

Zeyuan Allen-Zhu on LinkedIn: #quantitativeresearch #quantitativetrading #winners #excellenceawards…

Very proud of you! The most incredible girl I have ever met. The only quant/trader nominated for this award, and the only Chinese nominated for this award!…

www.linkedin.com

0

0

4

@ZeyuanAllenZhu

Zeyuan Allen-Zhu

@ZeyuanAllenZhu

8 months

@michahu8 @MetaAI @mbzuai Part 3.2 will be available tomorrow (delayed by arxiv), thanks!

1

0

3

@ZeyuanAllenZhu

Zeyuan Allen-Zhu

@ZeyuanAllenZhu

6 months

Note 1. Our studied CFGs are much harder than English grammar (learnable using 2-layer GPT of 100K params) or coding grammar (learnable via greedy). Our CFGs are ambiguous enough that require dynamic programming to parse or generate.

3

0

3

@ZeyuanAllenZhu

Zeyuan Allen-Zhu

@ZeyuanAllenZhu

1 month

@ysu_nlp I was 90% agree back then (I convinced Avi his "complexity" problem is GD and we wrote 2 papers). Now I'm both 99% and 1% agree. It's 99% since training LLM only needs GD... it's 1% because most of the "meat" comes from data preparation, not the GD part.

1

0

3

@ZeyuanAllenZhu

Zeyuan Allen-Zhu

@ZeyuanAllenZhu

1 month

@jasondeanlee @AIatMeta @weights_biases thanks for asking!! coughing badly right now and lost my voice :( probably need a few weeks to recover. will post it here once I record it.

1

0

3

@ZeyuanAllenZhu

Zeyuan Allen-Zhu

@ZeyuanAllenZhu

4 years

@ilyaraz2 @HaoChenMSR Isn't that obvious? 🤣

0

0

3

@ZeyuanAllenZhu

Zeyuan Allen-Zhu

@ZeyuanAllenZhu

4 years

@percyliang Last year, I discovered mine 10.5 yrs ago :) welcome!

0

0

3

@ZeyuanAllenZhu

Zeyuan Allen-Zhu

@ZeyuanAllenZhu

4 years

@KSA_ACS I created a whatsapp group for the May 8 passengers to share info together: @umairsiddiki @salukiprincess @docmraaa

1

0

3

@ZeyuanAllenZhu

Zeyuan Allen-Zhu

@ZeyuanAllenZhu

2 months

@DimitrisPapail Fun fact: I asked what if I relocate myself to the student's city, or turn myself into remote. The organizer doesn't want to talk to me anymore ever since I asked --- plus I almost received a punishment because I asked.

1

0

3