Eric J. Michaud Profile Banner
Eric J. Michaud Profile
Eric J. Michaud

@ericjmichaud_

1,271
Followers
807
Following
26
Media
143
Statuses

PhD student at MIT. Trying to make deep neural networks among the best understood objects in the universe. 💻🤖🧠👽🔭🚀

Cambridge, MA
Joined February 2015
Don't wanna be here? Send us removal request.
Pinned Tweet
@ericjmichaud_
Eric J. Michaud
1 year
Understanding the origin of neural scaling laws and the emergence of new capabilities with scale is key to understanding what deep neural networks are learning. In our new paper, @tegmark , @ZimingLiu11 , @uzpg_ and I develop a theory of neural scaling. 🧵:
4
40
180
@ericjmichaud_
Eric J. Michaud
4 months
Our group has a new preprint out, in which we make some very tentative steps towards translating trained neural networks into code: Quick summary/thoughts 🧵:
Tweet media one
10
86
470
@ericjmichaud_
Eric J. Michaud
7 months
@dwarkesh_sp tl;dr: Maybe learning simple things (basic knowledge, heuristics, etc) actually lowers the loss more than learning sophisticated things (algorithms associated with higher cognition that we really care about), and the sophisticated things will eventually be learned as scaling
9
25
368
@ericjmichaud_
Eric J. Michaud
6 months
*The Space of LLM Learning Curves* The mean loss improves smoothly over LLM training. But this averages over very many loss curves on individual tokens. I've made some interactive visualizations for exploring the per-token curves: Demo & observations:
4
15
101
@ericjmichaud_
Eric J. Michaud
2 years
New preprint out with @ZimingLiu11 and @tegmark on "Precision Machine Learning". In this paper, we consider what becomes involved when you care about the difference between approximating a function with error 0.001 vs 0.00000000000001 error.
1
12
86
@ericjmichaud_
Eric J. Michaud
2 years
Last month, we put out a preprint led by @ZimingLiu11 on the phenomena of "grokking" in deep learning. Here's a blog post with some videos and additional discussion to accompany the paper:
1
8
47
@ericjmichaud_
Eric J. Michaud
1 year
Our paper "Precision Machine Learning" has been published in the journal Entropy.
@Entropy_MDPI
Entropy MDPI
1 year
Read #NewPaper Precision Machine Learning" from Eric J. Michaud, Ziming Liu and Max Tegmark. #machinelearning #ScalingLaws #optimization
Tweet media one
0
3
11
2
1
36
@ericjmichaud_
Eric J. Michaud
4 months
The high-level goal motivating this work is to automate mechanistic interpretability and to do it so well that we can fully convert neural networks into standalone code. Such code can then be analyzed, properties about it can be proven, and it can be run instead of the network.
1
0
30
@ericjmichaud_
Eric J. Michaud
3 years
Excited to share my new paper “Understanding Learned Reward Functions” with co-authors @ARGleave and Stuart Russell, presented at the Deep RL Workshop at #NeurIPS2020 Paper: Code: Presentation:
1
5
30
@ericjmichaud_
Eric J. Michaud
4 months
This is a very ambitious goal! It could even be impossible! Of the computation performed within real-world neural networks, how much of it is reducible/admits a description that looks like code? I'm not sure! But it seems worth trying, and we'll probably learn a lot along the way
1
1
27
@ericjmichaud_
Eric J. Michaud
6 months
@_jasonwei Have you thought about whether the subtasks for language modeling might be naturally power law distributed? In we showed that this can lead to power law neural scaling as good performance emerges on an increasing number of these subtasks.
1
0
21
@ericjmichaud_
Eric J. Michaud
4 months
Here's an example of a program which we were able to extract from an RNN which performs addition on two binary strings. Max mentioned this example his TED talk:
Tweet media one
1
0
20
@ericjmichaud_
Eric J. Michaud
4 months
So to convert RNNs into code, we have methods for (1) extracting a set of variables which are represented in the network's hidden state and (2) performing symbolic regression to learn update rules for these variables.
Tweet media one
1
3
21
@ericjmichaud_
Eric J. Michaud
4 months
RNNs are an easy case since they naturally have a structure akin to a for-loop: some set of variables are maintained in the network's hidden state and are then updated as the network moves along the input sequence.
Tweet media one
1
0
19
@ericjmichaud_
Eric J. Michaud
4 months
In this first paper, we trained RNNs on a bunch of simple algorithmic tasks and then via some analysis of their internals + symbolic regression we convert them into Python code. This Python code can then be run on its own to perform these tasks with 100% accuracy.
1
0
15
@ericjmichaud_
Eric J. Michaud
2 years
Neel's analysis of grokking is incredibly cool. The modular addition circuit he discovers beautifully explains why we saw ring structure in the transformer embeddings in our study (). So many great, wide-reaching ideas in his post.
@NeelNanda5
Neel Nanda
2 years
I've spent the past few months exploring @OpenAI 's grokking result through the lens of mechanistic interpretability. I fully reverse engineered the modular addition model, and looked at what it does when training. So what's up with grokking? A 🧵... (1/17)
24
241
2K
1
0
16
@ericjmichaud_
Eric J. Michaud
4 months
But it's interesting to consider what a much more general approach could enable. I am reminded of @karpathy 's "Software 2.0" way of thinking about deep learning – SGD may be a better programmer than you. But must these programs forever remain in the format of a NN? Perhaps not.
1
0
19
@ericjmichaud_
Eric J. Michaud
4 months
This is neat! But it's also just a proof of concept. We're doing this with tiny networks, each trained to perform a single, very simple task. And we still fail on about half of our networks!
1
0
16
@ericjmichaud_
Eric J. Michaud
1 year
The core of our theory is what we call the Quantization Hypothesis. This posits that there is a particular *discrete* set of computations which are necessary to learn to reduce loss. We call these the "quanta" of the prediction problem -- the building blocks of performance.
1
0
17
@ericjmichaud_
Eric J. Michaud
4 months
This project was a large collaboration with 4 co-first authors: myself, @LiaoIsaac91893 , @vedanglad , and @ZimingLiu11 , w/ undergrads @AMudide , Chloe Loughridge, @CarlGuo866 , @tarark03 and Mateja Vukelić, and was really the brainchild of @tegmark .
1
0
14
@ericjmichaud_
Eric J. Michaud
1 year
What excites me about this work is that it hints at the possibility that we may be able to break even large neural networks apart and understand their performance with respect to a particular set of structures.
1
0
14
@ericjmichaud_
Eric J. Michaud
4 months
Ziming is an exceptionally creative and productive scientist, and also a kind and generous collaborator. I will miss him greatly when he finishes up his PhD and moves on to his next adventure... which could be working with you!🫵
@ZimingLiu11
Ziming Liu
4 months
This fall, I’ll be on job market looking for postdoc and faculty positions in US! My research interests span in AI + physics (science). If there’re opportunities to present in your school, institute, group, seminar, workshop etc., I really appreciate it! 🥹
Tweet media one
7
53
241
0
0
15
@ericjmichaud_
Eric J. Michaud
4 months
Here are some other programs we generate:
Tweet media one
Tweet media two
Tweet media three
Tweet media four
1
1
13
@ericjmichaud_
Eric J. Michaud
1 year
Perhaps large networks can be thought of as ensembles of many "circuits" performing specialized, intelligible computations, and maybe larger networks simply consist of more of these circuits, learned universally, in a predictable way.
@ch402
Chris Olah
2 years
@michael_nielsen I'd love a more circuits-y theory though!
0
0
4
1
0
13
@ericjmichaud_
Eric J. Michaud
2 years
@NeelNanda5 For easy reference, here's the video where we project the embeddings onto the same axes at every frame. Indeed, it turns out that the subspace in which the embeddings form a nice ring is a subspace where they were already(!) roughly ordered in a ring at initialization.
1
1
13
@ericjmichaud_
Eric J. Michaud
7 months
@xuanalogue Thanks! I'm pretty agnostic rn about what will be learnable or not with further scaling alone. In the post I was halfway trying to defend the scaling purist perspective, so for fun I'll try to defend it further in the context of your point about "multiple circuits": One thing
Tweet media one
Tweet media two
2
1
12
@ericjmichaud_
Eric J. Michaud
4 months
Also, I personally know almost nothing about program synthesis, so suggestions on papers to read or cite would of course be appreciated :).
1
0
11
@ericjmichaud_
Eric J. Michaud
1 year
Our model says that (i) the effect of scaling is to enable networks to learn more and more quanta in the Q Sequence and (ii) the use frequencies of the quanta follow a power law in natural data, and this is the origin of power law neural scaling.
1
0
12
@ericjmichaud_
Eric J. Michaud
1 year
Next, we study how power law scaling decomposes for LLMs. We use the "Pythia" sequence of models from @AiEleuther . For instance, instead of studying just how mean test loss (on The Pile) falls off with scale, we show how the distribution over per-token losses scales:
Tweet media one
1
0
11
@ericjmichaud_
Eric J. Michaud
1 year
We first construct toy datasets where our story of neural scaling is true. Our tasks consist of many distinct subtasks. When we impose a power law distribution over subtasks, we get power law neural scaling as networks succeed at more and more subtasks with increasing scale:
Tweet media one
1
0
11
@ericjmichaud_
Eric J. Michaud
1 year
Our method, which we call QDG for "quanta discovery from gradients" is based on spectral clustering with model gradients. With QDG, we auto-discover a variety of capabilities/behaviors of a small language model. Here are a couple of clusters:
Tweet media one
1
0
11
@ericjmichaud_
Eric J. Michaud
1 year
As a final caveat, I admit that the Quantization Hypothesis is quite strong, and so if it is the right story of neural scaling, it may only be true in broad strokes, and much work would remain in figuring out the details.
1
0
10
@ericjmichaud_
Eric J. Michaud
1 year
Consider the task of language modeling. Because of the immense complexity and diversity of the world and therefore of human language, effectively predicting text requires an immense amount of knowledge and the ability to perform many different types of computations.
1
2
10
@ericjmichaud_
Eric J. Michaud
1 year
These differences in the "use frequency" of the quanta mean that learning some quanta reduces mean loss more than learning others. So the quanta have a natural ordering into what we call the Q Sequence.
1
0
11
@ericjmichaud_
Eric J. Michaud
1 year
Some quanta are more frequently useful for prediction than others. For instance, in the distribution of text on the internet, knowledge of basic grammatical rules is far more frequently relied upon for predicting the next token than esoteric physics knowledge.
1
0
9
@ericjmichaud_
Eric J. Michaud
1 year
We find that the distribution over the size of these clusters roughly follows the power law that we would expect from our theory and the observed scaling exponents for LLMs (however this measurement is still fairly messy).
Tweet media one
1
0
8
@ericjmichaud_
Eric J. Michaud
1 year
Since it is unclear a priori how to partition the task of language modeling according to which quantum of knowledge/computation prediction relies on, we use the internal structure of a trained LLM to cluster samples together.
1
0
7
@ericjmichaud_
Eric J. Michaud
1 year
Here, smooth power laws are averaging over a large number of phase transitions in model capabilities when properly decomposed by subtask. Scaling enables networks to solve more and more niche prediction problems they qualitatively could not solve before.
1
0
6
@ericjmichaud_
Eric J. Michaud
1 year
@NeelNanda5 Thanks Neel! For reference, here are all the clusters of model behavior: . Not all clusters are as coherent as the examples shown (particularly the early ones in the list, which are the largest), but it's neat that our method worked at all!
0
0
6
@ericjmichaud_
Eric J. Michaud
1 year
Losses of ≈0 are by far most common and with increasing scale and training time networks achieve ≈0 loss on an increasing fraction of samples.
1
0
6
@ericjmichaud_
Eric J. Michaud
4 years
In August, some folks from @BerkeleySETI and I submitted a short white paper called "Lunar Opportunities for SETI" for the NAS Decadal Survey on Planetary Science and Astrobiology. As of today, it's also up on arXiv! .
1
0
6
@ericjmichaud_
Eric J. Michaud
2 years
For instance, we compare how simplex interpolation and ReLU NNs scale. While both methods provide piecewise linear fits, we find that NNs often do better than simplex interpolation, perhaps by taking advantage of the modular structure of problems to make the effective dim lower.
Tweet media one
1
0
5
@ericjmichaud_
Eric J. Michaud
4 years
Over the past year, I've been working with @erikphoel and Simon Mattsson on a paper studying the "causal structure" of artificial neural networks. Today I'm thrilled to announce that it's up on the arXiv! Paper: Code:
1
1
5
@ericjmichaud_
Eric J. Michaud
2 years
We also study the optimization challenge of training NNs to super low loss on simple regression problems. With some nonstandard choices/tricks (namely fitting a second NN to the error of the first and combining them) we can get fits relatively close to the 64-bit float limit.
Tweet media one
0
0
5
@ericjmichaud_
Eric J. Michaud
4 months
@laurolangosco Neat paper! If I've understood it correctly, one difference is that we seek a concise description of the transition function in terms of symbolic formulae rather than a lookup table. But this was possible because we only considered tasks where inputs & states have numeric type.
1
0
4
@ericjmichaud_
Eric J. Michaud
5 years
Visualizing how neural network weights update in real time. Code:
0
0
4
@ericjmichaud_
Eric J. Michaud
2 years
The performance of many approximation methods scales as a power law in data and parameters. The power law exponent therefore determines whether an exceptionally close fit is feasible. So a major focus of ours was studying the scaling behavior of various methods.
1
0
4
@ericjmichaud_
Eric J. Michaud
7 months
@dwarkesh_sp Best podcaster so far 🫡
0
0
2
@ericjmichaud_
Eric J. Michaud
2 years
CHAI is great, and Adam was an excellent mentor when I interned there! Internship applications for next year are due in 4 days:
@ARGleave
Adam Gleave
3 years
Applications now open for the @CHAI_Berkeley internship: Aimed at BS, MS or early-career individuals wishing to gain research experience in AI safety. Advised by a CHAI PhD student. Three-month, paid, dates flexible; long-term collaboration possible.
3
26
61
0
0
3
@ericjmichaud_
Eric J. Michaud
2 years
@lieberum_t "Grokking" seems to not typically happen very fast. In the key plot from the original grokking paper, generalization happens over the final 900k (of 1M) training steps, and it just appears sudden due to the use of a log scale on the x-axis.
Tweet media one
1
0
3
@ericjmichaud_
Eric J. Michaud
3 years
This work was done during my internship with @CHAI_Berkeley . Many thanks to everyone at CHAI for your support!
0
0
2
@ericjmichaud_
Eric J. Michaud
8 years
Life goal: Survive junior year. Status: Complete
1
0
2
@ericjmichaud_
Eric J. Michaud
4 years
@erikphoel Jill Tarter
1
0
2
@ericjmichaud_
Eric J. Michaud
4 years
In the paper, we define and measure variants of "effective information" and "integrated information" in feedforward deep neural networks. We hope that these will provide foundational tools for understanding both the training dynamics and the learned structure of DNNs.
1
1
2
@ericjmichaud_
Eric J. Michaud
6 months
@michaeljelly @moreisdifferent How to best aggregate tokens is a *very* interesting q! I think that ultimately you'd want to group tokens together according to what mechanism the model is using to predict them, which may not correspond to cleanly to bigrams, etc. check out:
1
0
1
@ericjmichaud_
Eric J. Michaud
2 years
@lieberum_t Oh very interesting! Our effective theory definitely does not account for these sorts of differences (only looking at the task of addition, for a toy model, ignoring the decoder). Agree that it would be cool to explain how the timing/speed of generalization depends on task!
0
0
1
@ericjmichaud_
Eric J. Michaud
4 years
@AstroShashank Hahaha I had no idea this started with you! I first saw it in a group chat. It's all over the place!
0
0
1
@ericjmichaud_
Eric J. Michaud
3 years
How can you tell if a learned reward function captures user preferences? We apply some standard ML interpretability techniques towards understanding what learned reward functions are doing in a few RL environments.
1
0
1
@ericjmichaud_
Eric J. Michaud
7 years
Funny how we still use “worldview” to describe our ultimate perspective on things, considering how microscopic our world really is.
Tweet media one
0
0
1
@ericjmichaud_
Eric J. Michaud
2 years
@NeelNanda5 Whether we project onto the principal components re-computed at each step (1st vid) or project onto the principal components computed at the end of training (2nd vid).
0
0
1
@ericjmichaud_
Eric J. Michaud
9 years
Last night, as proved by the photo, I estimated the resolution of the new MacBook to a degree of 99.8 % accuracy http://t.co/6Cj5T0aTI4
Tweet media one
1
0
1
@ericjmichaud_
Eric J. Michaud
4 years
@dwarkesh_sp @paulg This essay by @michael_nielsen comes to mind:
0
0
1
@ericjmichaud_
Eric J. Michaud
9 years
Tragic day for NASA and SpaceX as Falcon-9 rocket breaks up approx. 2 minutes into the flight in its ISS resupply mission.
0
0
1
@ericjmichaud_
Eric J. Michaud
9 years
T-4 minutes to Falcon-9 Launch. Anyone who's awake should watch this possibly historic moment! http://t.co/AtRsN7iKbY
0
0
1
@ericjmichaud_
Eric J. Michaud
5 years
0
0
1
@ericjmichaud_
Eric J. Michaud
3 years
As a closing thought, I also wonder whether future interpretability techniques, coupled with sophisticated reward learning, could be a kind of "microscope AI" for improving our understanding of human values and human well-being. @ch402 @nickcammarata @SamHarrisOrg
1
0
1
@ericjmichaud_
Eric J. Michaud
3 years
Our paper is a tentative step in this direction. We hope that more advanced interpretability techniques will someday allow researchers to more comprehensively open up AI systems and verify that such systems understand and are aligned with human values.
1
0
1
@ericjmichaud_
Eric J. Michaud
3 years
However, current algorithms for reward learning can fail silently. Absent perfect reward learning, we therefore need techniques for auditing learned reward functions -- for scrutinizing a machine's understanding of human preferences.
1
0
1