Eric J. Michaud @ericjmichaud_ Twitter profile | Pikagi

Pikagi

Eric J. Michaud

@ericjmichaud_

1,271

Followers

807

Following

26

Media

143

Statuses

PhD student at MIT. Trying to make deep neural networks among the best understood objects in the universe. 💻🤖🧠👽🔭🚀

Cambridge, MA

https://t.co/nq1T9jnFlO

Joined February 2015

Don't wanna be here? Send us removal request.

Pinned Tweet

@ericjmichaud_

Eric J. Michaud

1 year

Understanding the origin of neural scaling laws and the emergence of new capabilities with scale is key to understanding what deep neural networks are learning. In our new paper, @tegmark , @ZimingLiu11 , @uzpg_ and I develop a theory of neural scaling. 🧵:

Tweet card media

The Quantization Model of Neural Scaling

We propose the Quantization Model of neural scaling laws, explaining both the observed power law dropoff of loss with model and data size, and also the sudden emergence of new capabilities with...

4

40

180

Last Seen Profiles

@AbigaiZwa

@chuerub

@sukegaware45837

@frank_pinnola

@Fatih_Altayli

@VmaniakJ

@mim0_ri

@CLEITONOBDG

@ugurcantrmn

@AI_alchemy

@AnneJuguet

@StefanLatescu

@lalaraok

@ProudEphantUS

@stw_pdg

@jandakembangstw

@LJStockExchange

@ElectrifyBloom

@trouw

@IZON_official

@QMHz_info

@KokotosCG

@LooksmaxMe

@isogai_guitar

@ivqks159

@_Su_ceren

@LakesideYork

@JWenner_Author

@mybabatweets

@MalanaHamm

@WintiwindRustle

@ron_kda

@kitchen_fork

@CTBteatro

@SeaGullFootball

@EricKnauerMD

@ericjmichaud_

Eric J. Michaud

4 months

Our group has a new preprint out, in which we make some very tentative steps towards translating trained neural networks into code: Quick summary/thoughts 🧵:

Tweet media one

10

86

470

@ericjmichaud_

Eric J. Michaud

7 months

@dwarkesh_sp tl;dr: Maybe learning simple things (basic knowledge, heuristics, etc) actually lowers the loss more than learning sophisticated things (algorithms associated with higher cognition that we really care about), and the sophisticated things will eventually be learned as scaling

9

25

368

@ericjmichaud_

Eric J. Michaud

6 months

*The Space of LLM Learning Curves* The mean loss improves smoothly over LLM training. But this averages over very many loss curves on individual tokens. I've made some interactive visualizations for exploring the per-token curves: Demo & observations:

4

15

101

@ericjmichaud_

Eric J. Michaud

2 years

New preprint out with @ZimingLiu11 and @tegmark on "Precision Machine Learning". In this paper, we consider what becomes involved when you care about the difference between approximating a function with error 0.001 vs 0.00000000000001 error.

Tweet card media

Precision Machine Learning

We explore unique considerations involved in fitting ML models to data with very high precision, as is often required for science applications. We empirically compare various function...

1

12

86

@ericjmichaud_

Eric J. Michaud

2 years

Last month, we put out a preprint led by @ZimingLiu11 on the phenomena of "grokking" in deep learning. Here's a blog post with some videos and additional discussion to accompany the paper:

1

8

47

@ericjmichaud_

Eric J. Michaud

1 year

Our paper "Precision Machine Learning" has been published in the journal Entropy.

@Entropy_MDPI

Entropy MDPI

1 year

Read #NewPaper Precision Machine Learning" from Eric J. Michaud, Ziming Liu and Max Tegmark. #machinelearning #ScalingLaws #optimization

Tweet media one

0

3

11

2

1

36

@ericjmichaud_

Eric J. Michaud

4 months

The high-level goal motivating this work is to automate mechanistic interpretability and to do it so well that we can fully convert neural networks into standalone code. Such code can then be analyzed, properties about it can be proven, and it can be run instead of the network.

1

0

30

@ericjmichaud_

Eric J. Michaud

3 years

Excited to share my new paper “Understanding Learned Reward Functions” with co-authors @ARGleave and Stuart Russell, presented at the Deep RL Workshop at #NeurIPS2020 Paper: Code: Presentation:

1

5

30

@ericjmichaud_

Eric J. Michaud

4 months

This is a very ambitious goal! It could even be impossible! Of the computation performed within real-world neural networks, how much of it is reducible/admits a description that looks like code? I'm not sure! But it seems worth trying, and we'll probably learn a lot along the way

1

1

27

@ericjmichaud_

Eric J. Michaud

6 months

@_jasonwei Have you thought about whether the subtasks for language modeling might be naturally power law distributed? In we showed that this can lead to power law neural scaling as good performance emerges on an increasing number of these subtasks.

Tweet card media

The Quantization Model of Neural Scaling

We propose the Quantization Model of neural scaling laws, explaining both the observed power law dropoff of loss with model and data size, and also the sudden emergence of new capabilities with...

1

0

21

@ericjmichaud_

Eric J. Michaud

4 months

Here's an example of a program which we were able to extract from an RNN which performs addition on two binary strings. Max mentioned this example his TED talk:

Tweet media one

1

0

20

@ericjmichaud_

Eric J. Michaud

4 months

So to convert RNNs into code, we have methods for (1) extracting a set of variables which are represented in the network's hidden state and (2) performing symbolic regression to learn update rules for these variables.

Tweet media one

1

3

21

@ericjmichaud_

Eric J. Michaud

4 months

RNNs are an easy case since they naturally have a structure akin to a for-loop: some set of variables are maintained in the network's hidden state and are then updated as the network moves along the input sequence.

Tweet media one

1

0

19

@ericjmichaud_

Eric J. Michaud

4 months

In this first paper, we trained RNNs on a bunch of simple algorithmic tasks and then via some analysis of their internals + symbolic regression we convert them into Python code. This Python code can then be run on its own to perform these tasks with 100% accuracy.

1

0

15

@ericjmichaud_

Eric J. Michaud

2 years

Neel's analysis of grokking is incredibly cool. The modular addition circuit he discovers beautifully explains why we saw ring structure in the transformer embeddings in our study (). So many great, wide-reaching ideas in his post.

Tweet card media

Towards Understanding Grokking: An Effective Theory of...

We aim to understand grokking, a phenomenon where models generalize long after overfitting their training set. We present both a microscopic analysis anchored by an effective theory and a...

@NeelNanda5

Neel Nanda

2 years

I've spent the past few months exploring @OpenAI 's grokking result through the lens of mechanistic interpretability. I fully reverse engineered the modular addition model, and looked at what it does when training. So what's up with grokking? A 🧵... (1/17)

24

241

2K

1

0

16

@ericjmichaud_

Eric J. Michaud

4 months

But it's interesting to consider what a much more general approach could enable. I am reminded of @karpathy 's "Software 2.0" way of thinking about deep learning – SGD may be a better programmer than you. But must these programs forever remain in the format of a NN? Perhaps not.

1

0

19

@ericjmichaud_

Eric J. Michaud

4 months

This is neat! But it's also just a proof of concept. We're doing this with tiny networks, each trained to perform a single, very simple task. And we still fail on about half of our networks!

1

0

16

@ericjmichaud_

Eric J. Michaud

1 year

The core of our theory is what we call the Quantization Hypothesis. This posits that there is a particular *discrete* set of computations which are necessary to learn to reduce loss. We call these the "quanta" of the prediction problem -- the building blocks of performance.

1

0

17

@ericjmichaud_

Eric J. Michaud

4 months

This project was a large collaboration with 4 co-first authors: myself, @LiaoIsaac91893 , @vedanglad , and @ZimingLiu11 , w/ undergrads @AMudide , Chloe Loughridge, @CarlGuo866 , @tarark03 and Mateja Vukelić, and was really the brainchild of @tegmark .

1

0

14

@ericjmichaud_

Eric J. Michaud

1 year

What excites me about this work is that it hints at the possibility that we may be able to break even large neural networks apart and understand their performance with respect to a particular set of structures.

1

0

14

@ericjmichaud_

Eric J. Michaud

4 months

Ziming is an exceptionally creative and productive scientist, and also a kind and generous collaborator. I will miss him greatly when he finishes up his PhD and moves on to his next adventure... which could be working with you!🫵

@ZimingLiu11

Ziming Liu

4 months

This fall, I’ll be on job market looking for postdoc and faculty positions in US! My research interests span in AI + physics (science). If there’re opportunities to present in your school, institute, group, seminar, workshop etc., I really appreciate it! 🥹

Tweet media one

7

53

241

0

0

15

@ericjmichaud_

Eric J. Michaud

4 months

Here are some other programs we generate:

Tweet media one

Tweet media two

Tweet media three

Tweet media four

1

1

13

@ericjmichaud_

Eric J. Michaud

1 year

Perhaps large networks can be thought of as ensembles of many "circuits" performing specialized, intelligible computations, and maybe larger networks simply consist of more of these circuits, learned universally, in a predictable way.

@ch402

Chris Olah

2 years

@michael_nielsen I'd love a more circuits-y theory though!

0

0

4

1

0

13

@ericjmichaud_

Eric J. Michaud

2 years

@NeelNanda5 For easy reference, here's the video where we project the embeddings onto the same axes at every frame. Indeed, it turns out that the subspace in which the embeddings form a nice ring is a subspace where they were already(!) roughly ordered in a ring at initialization.

1

1

13

@ericjmichaud_

Eric J. Michaud

7 months

@xuanalogue Thanks! I'm pretty agnostic rn about what will be learnable or not with further scaling alone. In the post I was halfway trying to defend the scaling purist perspective, so for fun I'll try to defend it further in the context of your point about "multiple circuits": One thing

Tweet media one

Tweet media two

2

1

12

@ericjmichaud_

Eric J. Michaud

4 months

Also, I personally know almost nothing about program synthesis, so suggestions on papers to read or cite would of course be appreciated :).

1

0

11

@ericjmichaud_

Eric J. Michaud

1 year

Our model says that (i) the effect of scaling is to enable networks to learn more and more quanta in the Q Sequence and (ii) the use frequencies of the quanta follow a power law in natural data, and this is the origin of power law neural scaling.

1

0

12

@ericjmichaud_

Eric J. Michaud

1 year

Next, we study how power law scaling decomposes for LLMs. We use the "Pythia" sequence of models from @AiEleuther . For instance, instead of studying just how mean test loss (on The Pile) falls off with scale, we show how the distribution over per-token losses scales:

Tweet media one

1

0

11

@ericjmichaud_

Eric J. Michaud

1 year

We first construct toy datasets where our story of neural scaling is true. Our tasks consist of many distinct subtasks. When we impose a power law distribution over subtasks, we get power law neural scaling as networks succeed at more and more subtasks with increasing scale:

Tweet media one

1

0

11

@ericjmichaud_

Eric J. Michaud

1 year

Our method, which we call QDG for "quanta discovery from gradients" is based on spectral clustering with model gradients. With QDG, we auto-discover a variety of capabilities/behaviors of a small language model. Here are a couple of clusters:

Tweet media one

1

0

11

@ericjmichaud_

Eric J. Michaud

1 year

As a final caveat, I admit that the Quantization Hypothesis is quite strong, and so if it is the right story of neural scaling, it may only be true in broad strokes, and much work would remain in figuring out the details.

1

0

10

@ericjmichaud_

Eric J. Michaud

1 year

Consider the task of language modeling. Because of the immense complexity and diversity of the world and therefore of human language, effectively predicting text requires an immense amount of knowledge and the ability to perform many different types of computations.

1

2

10

@ericjmichaud_

Eric J. Michaud

1 year

These differences in the "use frequency" of the quanta mean that learning some quanta reduces mean loss more than learning others. So the quanta have a natural ordering into what we call the Q Sequence.

1

0

11

@ericjmichaud_

Eric J. Michaud

1 year

Some quanta are more frequently useful for prediction than others. For instance, in the distribution of text on the internet, knowledge of basic grammatical rules is far more frequently relied upon for predicting the next token than esoteric physics knowledge.

1

0

9

@ericjmichaud_

Eric J. Michaud

1 year

We find that the distribution over the size of these clusters roughly follows the power law that we would expect from our theory and the observed scaling exponents for LLMs (however this measurement is still fairly messy).

Tweet media one

1

0

8

@ericjmichaud_

Eric J. Michaud

1 year

Since it is unclear a priori how to partition the task of language modeling according to which quantum of knowledge/computation prediction relies on, we use the internal structure of a trained LLM to cluster samples together.

1

0

7

@ericjmichaud_

Eric J. Michaud

1 year

Here, smooth power laws are averaging over a large number of phase transitions in model capabilities when properly decomposed by subtask. Scaling enables networks to solve more and more niche prediction problems they qualitatively could not solve before.

1

0

6

@ericjmichaud_

Eric J. Michaud

1 year

@NeelNanda5 Thanks Neel! For reference, here are all the clusters of model behavior: . Not all clusters are as coherent as the examples shown (particularly the early ones in the list, which are the largest), but it's neat that our method worked at all!

0

0

6

@ericjmichaud_

Eric J. Michaud

1 year

Losses of ≈0 are by far most common and with increasing scale and training time networks achieve ≈0 loss on an increasing fraction of samples.

1

0

6

@ericjmichaud_

Eric J. Michaud

4 years

In August, some folks from @BerkeleySETI and I submitted a short white paper called "Lunar Opportunities for SETI" for the NAS Decadal Survey on Planetary Science and Astrobiology. As of today, it's also up on arXiv! .

1

0

6

@ericjmichaud_

Eric J. Michaud

2 years

For instance, we compare how simplex interpolation and ReLU NNs scale. While both methods provide piecewise linear fits, we find that NNs often do better than simplex interpolation, perhaps by taking advantage of the modular structure of problems to make the effective dim lower.

Tweet media one

1

0

5

@ericjmichaud_

Eric J. Michaud

4 years

Over the past year, I've been working with @erikphoel and Simon Mattsson on a paper studying the "causal structure" of artificial neural networks. Today I'm thrilled to announce that it's up on the arXiv! Paper: Code:

GitHub - EI-research-group/deep-ei: Tools for examining the causal structure of artificial neural...

Tools for examining the causal structure of artificial neural networks with information theory - EI-research-group/deep-ei

1

1

5

@ericjmichaud_

Eric J. Michaud

2 years

We also study the optimization challenge of training NNs to super low loss on simple regression problems. With some nonstandard choices/tricks (namely fitting a second NN to the error of the first and combining them) we can get fits relatively close to the 64-bit float limit.

Tweet media one

0

0

5

@ericjmichaud_

Eric J. Michaud

4 months

@laurolangosco Neat paper! If I've understood it correctly, one difference is that we seek a concise description of the transition function in terms of symbolic formulae rather than a lookup table. But this was possible because we only considered tasks where inputs & states have numeric type.

1

0

4

@ericjmichaud_

Eric J. Michaud

5 years

Visualizing how neural network weights update in real time. Code:

0

0

4

@ericjmichaud_

Eric J. Michaud

2 years

The performance of many approximation methods scales as a power law in data and parameters. The power law exponent therefore determines whether an exceptionally close fit is feasible. So a major focus of ours was studying the scaling behavior of various methods.

1

0

4

@ericjmichaud_

Eric J. Michaud

7 months

@dwarkesh_sp Best podcaster so far 🫡

0

0

2

@ericjmichaud_

Eric J. Michaud

11 months

@ethanCaballero

Tweet card media

The Quantization Model of Neural Scaling

We propose the Quantization Model of neural scaling laws, explaining both the observed power law dropoff of loss with model and data size, and also the sudden emergence of new capabilities with...

0

0

3

@ericjmichaud_

Eric J. Michaud

2 years

CHAI is great, and Adam was an excellent mentor when I interned there! Internship applications for next year are due in 4 days:

Tweet card media

Center for Human-Compatible Artificial Intelligence

boards.greenhouse.io

@ARGleave

Adam Gleave

3 years

Applications now open for the @CHAI_Berkeley internship: Aimed at BS, MS or early-career individuals wishing to gain research experience in AI safety. Advised by a CHAI PhD student. Three-month, paid, dates flexible; long-term collaboration possible.

3

26

61

0

0

3

@ericjmichaud_

Eric J. Michaud

2 years

@lieberum_t "Grokking" seems to not typically happen very fast. In the key plot from the original grokking paper, generalization happens over the final 900k (of 1M) training steps, and it just appears sudden due to the use of a log scale on the x-axis.

Tweet media one

1

0

3

@ericjmichaud_

Eric J. Michaud

3 years

This work was done during my internship with @CHAI_Berkeley . Many thanks to everyone at CHAI for your support!

0

0

2

@ericjmichaud_

Eric J. Michaud

8 years

Life goal: Survive junior year. Status: Complete

1

0

2

@ericjmichaud_

Eric J. Michaud

4 years

@erikphoel Jill Tarter

1

0

2

@ericjmichaud_

Eric J. Michaud

4 years

In the paper, we define and measure variants of "effective information" and "integrated information" in feedforward deep neural networks. We hope that these will provide foundational tools for understanding both the training dynamics and the learned structure of DNNs.

1

1

2

@ericjmichaud_

Eric J. Michaud

6 months

@michaeljelly @moreisdifferent How to best aggregate tokens is a *very* interesting q! I think that ultimately you'd want to group tokens together according to what mechanism the model is using to predict them, which may not correspond to cleanly to bigrams, etc. check out:

1

0

1

@ericjmichaud_

Eric J. Michaud

2 years

@lieberum_t Oh very interesting! Our effective theory definitely does not account for these sorts of differences (only looking at the task of addition, for a toy model, ignoring the decoder). Agree that it would be cool to explain how the timing/speed of generalization depends on task!

0

0

1

@ericjmichaud_

Eric J. Michaud

4 years

@AstroShashank Hahaha I had no idea this started with you! I first saw it in a group chat. It's all over the place!

0

0

1

@ericjmichaud_

Eric J. Michaud

3 years

How can you tell if a learned reward function captures user preferences? We apply some standard ML interpretability techniques towards understanding what learned reward functions are doing in a few RL environments.

1

0

1

@ericjmichaud_

Eric J. Michaud

7 years

Funny how we still use “worldview” to describe our ultimate perspective on things, considering how microscopic our world really is.

Tweet media one

0

0

1

@ericjmichaud_

Eric J. Michaud

2 years

@NeelNanda5 Whether we project onto the principal components re-computed at each step (1st vid) or project onto the principal components computed at the end of training (2nd vid).

0

0

1

@ericjmichaud_

Eric J. Michaud

9 years

Last night, as proved by the photo, I estimated the resolution of the new MacBook to a degree of 99.8 % accuracy http://t.co/6Cj5T0aTI4

Tweet media one

1

0

1

@ericjmichaud_

Eric J. Michaud

4 years

@dwarkesh_sp @paulg This essay by @michael_nielsen comes to mind:

0

0

1

@ericjmichaud_

Eric J. Michaud

9 years

Tragic day for NASA and SpaceX as Falcon-9 rocket breaks up approx. 2 minutes into the flight in its ISS resupply mission.

0

0

1

@ericjmichaud_

Eric J. Michaud

9 years

T-4 minutes to Falcon-9 Launch. Anyone who's awake should watch this possibly historic moment! http://t.co/AtRsN7iKbY

0

0

1

@ericjmichaud_

Eric J. Michaud

5 years

@AstroShashank Awesome!

0

0

1

@ericjmichaud_

Eric J. Michaud

3 years

As a closing thought, I also wonder whether future interpretability techniques, coupled with sophisticated reward learning, could be a kind of "microscope AI" for improving our understanding of human values and human well-being. @ch402 @nickcammarata @SamHarrisOrg

1

0

1

@ericjmichaud_

Eric J. Michaud

3 years

Our paper is a tentative step in this direction. We hope that more advanced interpretability techniques will someday allow researchers to more comprehensively open up AI systems and verify that such systems understand and are aligned with human values.

1

0

1

@ericjmichaud_

Eric J. Michaud

4 years

Some more links can be found here: And here's a lovely article about the paper and its ideas:

The Case for Building a SETI Observatory on the Moon

Our planet has become so “loud” in the part of the radio spectrum observed by SETI that it threatens to drown out any signal sent from an extraterrestrial civilization

www.supercluster.com

0

0

1

@ericjmichaud_

Eric J. Michaud

3 years

However, current algorithms for reward learning can fail silently. Absent perfect reward learning, we therefore need techniques for auditing learned reward functions -- for scrutinizing a machine's understanding of human preferences.

1

0

1