Jascha Sohl-Dickstein Profile
Jascha Sohl-Dickstein

@jaschasd

18,849
Followers
632
Following
75
Media
540
Statuses

Member of the technical staff @ Anthropic. Most (in)famous for inventing diffusion models. AI + physics + neuroscience + dynamics.

San Francisco
Joined August 2009
Don't wanna be here? Send us removal request.
Pinned Tweet
@jaschasd
Jascha Sohl-Dickstein
2 years
My first blog post ever! Be harsh, but, you know, constructive. Too much efficiency makes everything worse: overfitting and the strong version of Goodhart's law 🧵
Tweet media one
Tweet media two
36
182
980
@jaschasd
Jascha Sohl-Dickstein
3 months
Have you ever done a dense grid search over neural network hyperparameters? Like a *really dense* grid search? It looks like this (!!). Blueish colors correspond to hyperparameters for which training converges, redish colors to hyperparameters for which training diverges.
279
2K
10K
@jaschasd
Jascha Sohl-Dickstein
2 years
After 2 years of work by 442 contributors across 132 institutions, I am thrilled to announce that the paper is now live: . BIG-bench consists of 204 diverse tasks to measure and extrapolate the capabilities of large language models.
Tweet media one
37
574
3K
@jaschasd
Jascha Sohl-Dickstein
4 years
"Finite Versus Infinite Neural Networks: an Empirical Study." This paper contains everything you ever wanted to know about infinite width networks, but didn't have the computational capacity to ask! Like really a lot of content. Let's dive in.
7
503
2K
@jaschasd
Jascha Sohl-Dickstein
3 months
The boundary between trainable and untrainable neural network hyperparameter configurations is *fractal*! And beautiful! Here is a grid search over a different pair of hyperparameters -- this time learning rate and the mean of the parameter initialization distribution.
26
175
1K
@jaschasd
Jascha Sohl-Dickstein
4 years
Modern deep learning is a story of learned features outperforming (then replacing!) hand-designed algorithms. But we still use hand designed loss functions and optimizers. Here is a big step towards learned optimizers outperforming existing optimizers:
Tweet media one
6
205
1K
@jaschasd
Jascha Sohl-Dickstein
1 year
If there is one thing the deep learning revolution has taught us, it's that neural nets will outperform hand-designed heuristics, given enough compute and data. But we still use hand-designed heuristics to train our models. Let's replace our optimizers with trained neural nets!
Tweet media one
25
136
905
@jaschasd
Jascha Sohl-Dickstein
5 years
Eliminating All Bad Local Minima from Loss Landscapes Without Even Adding an Extra Unit It's less than one page. It may be deep. It may be trivial. It will definitely help you understand how some claims in recent theory papers could possibly be true.
Tweet media one
6
178
704
@jaschasd
Jascha Sohl-Dickstein
6 years
Adversarial Reprogramming of Neural Networks A new goal for adversarial attacks! Rather than cause a specific misclassification, we force neural networks to behave as if they were trained on a completely different task! With @gamaleldinfe , @goodfellow_ian
Tweet media one
9
263
696
@jaschasd
Jascha Sohl-Dickstein
2 years
For years I've shown this 2x2 grid in talks on infinite width networks, but with just a big ❓ in the upper-left. No longer! In we characterize wide Bayesian neural nets in parameter space. This fills a theory gap, and enables *much* faster MCMC sampling.
Tweet media one
7
90
609
@jaschasd
Jascha Sohl-Dickstein
4 years
Whitening and second order optimization both destroy information about the dataset, and can make generalization impossible: We examine what information is usable for training neural networks, and how second order methods destroy exactly that information.
Tweet media one
9
113
551
@jaschasd
Jascha Sohl-Dickstein
1 year
The hot mess theory of AI misalignment (+ an experiment!) There are two ways an AI could be misaligned. It could monomaniacally pursue the wrong goal (supercoherence), or it could act in ways that don't pursue any consistent goal (hot mess/incoherent).
Tweet media one
29
93
543
@jaschasd
Jascha Sohl-Dickstein
5 years
Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent <--- this should blow your mind a bit!! Also holds for convolutional networks, batch norm, ... Also, closed form for test predictions resulting from gradient descent training.
Tweet media one
9
159
516
@jaschasd
Jascha Sohl-Dickstein
4 years
Two simple equalities expressing matrix determinants as expectations over matrix-vector products. Entire paper in attached image. :P It's fun to write short notes like this. Hopefully useful in areas like normalizing flows and Gaussian process evaluation.
Tweet media one
11
94
507
@jaschasd
Jascha Sohl-Dickstein
3 months
So it shouldn't (post-hoc) be a surprise that hyperparameter landscapes are fractal. This is a general phenomenon: in these panes we see fractal hyperparameter landscapes for every neural network configuration I tried, including deep linear networks.
Tweet media one
8
26
491
@jaschasd
Jascha Sohl-Dickstein
3 months
The best performing hyperparameters are typically at the edge of stability -- so when you optimize neural network hyperparameters, you are contending with hyperparameter landscapes that look like this.
20
25
410
@jaschasd
Jascha Sohl-Dickstein
5 years
Neural reparameterization improves structural optimization! By parameterizing physical design in terms of the (constrained) output of a neural network, we propose stronger and more elegant bridges, skyscrapers, and cantilevers. With shoyer@ samgreydanus@
Tweet media one
3
74
373
@jaschasd
Jascha Sohl-Dickstein
3 months
There are similarities between the way in which many fractals are generated, and the way in which we train neural networks. Both involve repeatedly applying a function to its own output. In both cases, that function has hyperparameters that control its behavior.
5
14
377
@jaschasd
Jascha Sohl-Dickstein
4 years
Infinite width networks (NNGPs and NTKs) are the most promising lead for theoretical understanding in deep learning. But, running experiments with them currently resembles the dark age of ML research before ubiquitous automatic differentiation. Neural Tangents fixes that.
@sschoenholz
Sam Schoenholz
4 years
The core of Neural Tangents is a high level neural network library. Any network specified in Neural Tangents automatically comes with a function to compute the infinite-width limit analytically. Here's an example for a two-hidden layer FC network:
Tweet media one
2
24
68
2
71
330
@jaschasd
Jascha Sohl-Dickstein
3 years
@PBFcomics As another layer: rats can also (probably) echolocate. So they're both cheating, the rat is just worse at it.
3
8
318
@jaschasd
Jascha Sohl-Dickstein
6 years
Learning to sample using deep neural networks! Hamiltonian Monte Carlo + Real NVP == trainable MCMC sampler that generalizes, and far outperforms, HMC.
5
111
312
@jaschasd
Jascha Sohl-Dickstein
3 months
I don't have a SoundCloud, but I did join Anthropic last week, and so far it has exceeded my (high) expectations. I would strongly recommend working there (and using Claude). *this project not done at Anthropic -- this was recreational machine learning on my own time.
12
11
317
@jaschasd
Jascha Sohl-Dickstein
3 years
CALL FOR TASKS CAPTURING LIMITATIONS OF LARGE LANGUAGE MODELS We are soliciting contributions of tasks to a *collaborative* benchmark designed to measure and extrapolate the capabilities and limitations of large language models. Submit tasks at #BIGbench
Tweet media one
14
73
278
@jaschasd
Jascha Sohl-Dickstein
4 years
A simple prescription that will improve your models: When using LayerNorm, do mean subtraction *before* rather than after the affine transformation. This, and an in-depth empirical investigation of statistical properties of common normalizers in
Tweet media one
Tweet media two
2
35
249
@jaschasd
Jascha Sohl-Dickstein
3 months
In both cases the function iteration can produce outputs that either diverge to infinity or remain happily bounded depending on those hyperparameters. Fractals are often defined by the boundary between hyperparameters where function iteration diverges or remains bounded.
2
3
250
@jaschasd
Jascha Sohl-Dickstein
4 years
Neural Network Gaussian Processes (NNGPs) correspond to wide Bayesian neural networks! In  we show that the posterior distribution over functions computed by a Bayesian neural network converges to the posterior of the NNGP as layer width grows large.
Tweet media one
2
51
238
@jaschasd
Jascha Sohl-Dickstein
6 months
Levels of AGI: Operationalizing Progress on the Path to AGI Levels of Autonomous Driving are extremely useful, for communicating capabilities, setting regulation, and defining goals in self driving. We propose analogous Levels of *AGI*. (ChatGPT is a Level 1 "Emerging" AGI)
Tweet media one
14
62
246
@jaschasd
Jascha Sohl-Dickstein
3 years
I am *extremely* proud to share that we were awarded the ICML outstanding paper award! Major credit and thanks to my collaborators @PaulVicol and @Luke_Metz ! Paul especially owned every part of this project, and I think his care and extreme thoroughness are the reasons we won.
@icmlconf
ICML Conference
3 years
ICML 2021 Outstanding Paper Award: • Paul Vicol, Luke Metz, and Jascha Sohl-Dickstein 📜Unbiased Gradient Estimation in Unrolled Computation Graphs with Persistent Evolution Strategies (Tuesday 9pm US Eastern)
1
17
163
9
6
238
@jaschasd
Jascha Sohl-Dickstein
4 years
Build an infinite width neural network with the same code you use to define your finite width neural network.
@GoogleAI
Google AI
4 years
Announcing Neural Tangents, a new easy-to-use, open-source neural network library that enables researchers to build finite- and infinite-width versions of neural networks simultaneously. Grab the code and try it for yourself at
13
624
2K
1
37
235
@jaschasd
Jascha Sohl-Dickstein
6 years
Sensitivity and Generalization in Neural Networks: an Empirical Study Neural nets generalize better when they're larger and less sensitive to their inputs, are less sensitive near training data than away from it, and other results from massive experiments.
0
78
228
@jaschasd
Jascha Sohl-Dickstein
4 years
"Your GAN is Secretly an Energy-based Model and You Should use Discriminator Driven Latent Sampling" This technique can dramatically improve existing trained GANs, by re-interpreting them as an easy-to-sample-from energy based model in the latent space.
9
46
229
@jaschasd
Jascha Sohl-Dickstein
5 years
Batch norm causes chaos and gradient explosion in the output of deep networks: figure below shows two nearly identical minibatches going through a random *linear* network with batch norm, and becoming completely dissimilar by depth 30! Much, much more at:
Tweet media one
1
43
203
@jaschasd
Jascha Sohl-Dickstein
6 years
Everything you wanted to know about the role of batch size in neural net training, but didn't have the computational resources to ask! With Chris Shallue, Jaehoon Lee, Joe @joe_antognini , Roy Frostig, and George Dahl.
1
49
200
@jaschasd
Jascha Sohl-Dickstein
6 years
Guided evolutionary strategies: escaping the curse of dimensionality in random search. A principled method to leverage training signals which are not the gradient, but which may be correlated with the gradient. Work with @niru_m @Luke_Metz @georgejtucker .
3
63
191
@jaschasd
Jascha Sohl-Dickstein
5 years
A careful empirical study of the effect of network width on generalization and fixed learning rate SGD, for MLPs, convnets, resnets, and batch norm. With superstar resident Daniel Park, and @quocleix + Sam Smith.
Tweet media one
2
44
189
@jaschasd
Jascha Sohl-Dickstein
8 months
Here is a brain dump of my thoughts about how AI might go wrong: AI has the power to change the world in both wonderful and terrible ways. With hard work, I expect AI to lead to far more good than harm. But part of achieving that is thinking about risk.
7
29
179
@jaschasd
Jascha Sohl-Dickstein
4 years
Infinite width limits (NNGP and NTK) for neural networks with self-attention . This fills in the last common architectural component which did not have an infinite width correspondence! Along the way we improve on the standard softmax attention mechanism.
Tweet media one
2
30
143
@jaschasd
Jascha Sohl-Dickstein
4 months
I'm running an experiment, and holding some public office hours (inspired by seeing @kchonyc do something similar). Come talk with me about anything! Ask for advice on your research or startup or career or I suppose personal life, brainstorm new research ideas, complain about…
6
9
140
@jaschasd
Jascha Sohl-Dickstein
2 years
I think we will increasingly build systems out of many large models interacting with each other. I think the cascades perspective -- write down a probabilistic graphical model, but with every node a language model -- is the right formalism for describing these systems.
@dmdohan
David Dohan
2 years
Happy to release our work on Language Model Cascades. Read on to learn how we can unify existing methods for interacting models (scratchpad/chain of thought, verifiers, tool-use, …) in the language of probabilistic programming. paper:
Tweet media one
3
99
668
1
12
126
@jaschasd
Jascha Sohl-Dickstein
6 years
Stochastic natural gradient descent corresponds to Bayesian training of neural networks, with a modified prior. This equivalence holds *even away from local minima*. Very proud of this work with Sam Smith, Daniel Duckworth, and Quoc Le.
Tweet media one
2
36
125
@jaschasd
Jascha Sohl-Dickstein
4 years
Research on the Neural Tangent Kernel (NTK) almost exclusively uses a non-standard neural network parameterization, where activations are divided by sqrt(width), and weights are initialized to have variance 1 rather than variance 1/width.
2
23
120
@jaschasd
Jascha Sohl-Dickstein
2 years
Performance on some tasks improves smoothly with model scale, while on others there is sudden breakthrough performance at a critical scale.
Tweet media one
2
12
118
@jaschasd
Jascha Sohl-Dickstein
1 year
Living creatures, human organizations, and machine learning models are all judged to become *more of a hot mess (less coherent) as they grow more intelligent*. This suggests that AI failing to pursue a consistent goal is more likely than AI pursuing a misaligned goal.
Tweet media one
Tweet media two
Tweet media three
Tweet media four
9
15
113
@jaschasd
Jascha Sohl-Dickstein
3 months
@DanielDugas14 Go for it! All the raw images are here:
3
4
109
@jaschasd
Jascha Sohl-Dickstein
2 years
Models can learn unexpected skills that are only implicitly contained in the training data -- for instance, how to make legal moves in chess.
Tweet media one
2
11
100
@jaschasd
Jascha Sohl-Dickstein
4 years
This is very cool work. Read this if you want to really, really understand how a neural network solves a specific problem -- like actual scientific understanding.
@niru_m
Niru Maheswaranathan
4 years
#tweeprint time for our new work out on arXiv!📖We've been trying to understand how recurrent neural networks (RNNs) work, by reverse engineering them using tools from dynamical systems analysis—with @SussilloDavid .
9
264
918
1
16
103
@jaschasd
Jascha Sohl-Dickstein
3 months
@NaveenGRao It's actually not so bad! Width 16 one hidden layer neural network, and I only computed new images tiled by every factor of 2 in zoom scale -- so about 50 grid searches needed to be run for the entire video. It took overnight on an A100.
6
0
98
@jaschasd
Jascha Sohl-Dickstein
1 month
This was a fun project! If you could train an LLM over text arithmetically compressed using a smaller LLM as a probabilistic model of text, it would be really good. Text would be represented with far fewer tokens, and inference would be way faster and cheaper. The hard part is…
@noahconst
Noah Constant
1 month
Ever wonder why we don’t train LLMs over highly compressed text? Turns out it’s hard to make it work. Check out our paper for some progress that we’re hoping others can build on. With @blester125 , @hoonkp , @alemi , Jeffrey Pennington, @ada_rob , @jaschasd
2
9
69
3
8
95
@jaschasd
Jascha Sohl-Dickstein
6 years
Bayesian CNNs with many channels are Gaussian processes! One can compute test set predictions that would have resulted from fully Bayesian training of a CNN, but without ever instantiating a CNN, and instead by evaluating the corresponding GP.
Tweet media one
1
27
93
@jaschasd
Jascha Sohl-Dickstein
9 months
Adversarial attacks designed to fool computer vision models, *transfer (weakly) to the human brain* -- even when the attack is so small as to be barely perceptible. Nature Comms paper:
@gamaleldinfe
Gamaleldin Elsayed
9 months
We present human participants with two nearly identical images, each has different adversarial perturbations generated by ANNs. We find that in each experiment, human perception is consistently biased by the adversarial perturbation in the direction predicted by the ANN.
Tweet media one
4
23
136
0
6
92
@jaschasd
Jascha Sohl-Dickstein
1 year
Intuitive extensions to standard notation, that make it less ambiguous for common math in machine learning. This should become common practice in ML papers. This could have saved past me cumulative days of confusion (and worse, misinterpretations I probably never discovered).
@srush_nlp
Sasha Rush (ICLR)
1 year
Named Tensor Notation (TMLR version, w/ @davidweichiang + @boazbaraktcs ) A rigorous description, opinionated style guide, and gentle polemic for named tensors in math notation. * Macros:
Tweet media one
13
93
477
1
16
92
@jaschasd
Jascha Sohl-Dickstein
2 years
While language models do better as they are made larger, they still do poorly on BIG-bench relative to humans.
Tweet media one
3
5
88
@jaschasd
Jascha Sohl-Dickstein
4 years
This is a meta-learned list of optimization hyperparameters. Try these hyperparameters in this order for fun, profit, and better performing models with less compute!! A sequence of magic numbers beyond Karpathy's constant! JAX, PyTorch, & TensorFlow code:
@Luke_Metz
Luke Metz
4 years
Excited to share our new work! We introduce a dataset of tasks for learned optimizer research. As an example application of this dataset we meta-train lists of optimizer hyper parameters that work well on a diverse set of tasks. 1/4
Tweet media one
Tweet media two
Tweet media three
3
68
236
0
16
85
@jaschasd
Jascha Sohl-Dickstein
1 year
If you are training models with < 5e8 parameters, for < 2e5 training steps, then with high probability this LEARNED OPTIMIZER will beat or match the tuned optimizer you are currently using, out of the box, with no hyperparameter tuning (!).
Tweet media one
1
11
85
@jaschasd
Jascha Sohl-Dickstein
2 years
Models become consistently more socially biased as they are made larger (likely because they do a better job at capturing ever more subtle biases in their training data). There is reduced or even decreasing bias with scale when context makes it clear that bias is undesirable.
Tweet media one
Tweet media two
4
7
81
@jaschasd
Jascha Sohl-Dickstein
3 years
I think this will be a very important paper. My take: by unrolling SGD training steps and treating them as part of the NN architecture, computing the kernel after training (w/ feature learning) becomes equivalent to computing the NNGP kernel of the extended architecture.
@TheGregYang
Greg Yang
3 years
1/ Existing theories of neural networks (NN) like NTK don't learn features so can't explain success of pretraining (e.g. BERT, GPT3). We derive the *feature learning* ∞-width limit of NNs & pretrained such an ∞-width word2vec model: it learned semantics!
Tweet media one
4
58
387
1
11
82
@jaschasd
Jascha Sohl-Dickstein
2 years
The phenomenon of overfitting in machine learning maps onto a class of failures that frequently happen in the broader world: in politics, economics, science, and beyond. Doing too well at targeting a proxy objective can make the thing you actually care about get much, much worse.
Tweet media one
Tweet media two
2
3
77
@jaschasd
Jascha Sohl-Dickstein
3 years
My group in Google Brain is hiring a full time researcher, for a research team focused on learned optimizers. Are you interested in meta-learning, bilevel optimization, dynamical systems? Apply here: Please reach out with any questions!
@Luke_Metz
Luke Metz
3 years
Interested in meta-learning and learned optimizers? Our team at Google Brain is hiring a full time researcher! Feel free to reach out to myself or @jaschasd for more information.
2
36
168
0
13
78
@jaschasd
Jascha Sohl-Dickstein
3 months
I’ve been daydreaming about an AI+audio product that I think recently became possible: virtual noise canceling headphones. I hate loud background noise -- BART trains, airline cabins, road noise, ... 🙉. I would buy the heck out of this product, and would love it if it were built…
7
4
76
@jaschasd
Jascha Sohl-Dickstein
2 years
Overall, sparse models perform as well as dense models which use ~2x more inference cost, but they are as well calibrated as dense models using ~10x more inference compute.
Tweet media one
1
2
74
@jaschasd
Jascha Sohl-Dickstein
2 years
@ericjang11 This is for the same reason that neural networks are often poorly calibrated. NNs are good at producing a vector that points in the right direction, but bad at getting the magnitude correct. For classification, you just need to get the vector direction right.
3
2
71
@jaschasd
Jascha Sohl-Dickstein
2 years
End of 🧵. Here's a bonus plot from the blog post, about how models overfit the most when their capacity most closely matches the complexity of the problem. In case, like me, plots are the kind of thing you like.
Tweet media one
4
2
69
@jaschasd
Jascha Sohl-Dickstein
2 years
Finally -- *so many thanks* to all my collaborators! And especially to the co-organizers, who did so much, and were a constant pleasure to work with! I can't fit names in a tweet without leaving out very important people, so see the attached list of contributors + contributions.
Tweet media one
Tweet media two
1
1
70
@jaschasd
Jascha Sohl-Dickstein
4 years
Infinite width neural networks enable more compute-efficient Neural Architecture Search!
@hoonkp
Jaehoon Lee
4 years
Can we leverage the power of infinite-width limit to help with Neural Architecture Search (NAS)? In this new paper (), we find that empirical NNGP can provide cheap and effective signals that can be used for NAS!
Tweet media one
1
23
92
1
17
68
@jaschasd
Jascha Sohl-Dickstein
6 years
Learned optimizers with less mind numbing pain! We analyze, and propose a solution to, pathologies in meta-training via unrolled optimization. Then we meta-train an optimizer targeting CNN training that outperforms SGD/Adam by 5x (!!!) wall clock time.
Tweet media one
2
17
69
@jaschasd
Jascha Sohl-Dickstein
6 years
If you are using PCA to visualize neural network training trajectories, you are interpreting it wrong! Very proud of this work with @joe_antognini : "PCA of high dimensional random walks with comparison to neural network training"
Tweet media one
0
20
69
@jaschasd
Jascha Sohl-Dickstein
5 years
: @laurent_dinh is the most fun to work with. He always has extremely novel ideas ... and makes the most mesmerizing animations.
@laurent_dinh
Laurent Dinh
5 years
Exploring inference and learning with non-invertible “flows” to learn deep mixture models with RAD: (with @jaschasd , @rpascanu , and @hugo_larochelle )
1
57
258
0
4
67
@jaschasd
Jascha Sohl-Dickstein
2 years
... with many more observations in the paper. We also release score files and transcripts from models across six orders of magnitude of scale performing BIG-bench tasks, along with human baselines for most tasks. We hope this will be a goldmine for future research.
Tweet media one
Tweet media two
1
2
65
@jaschasd
Jascha Sohl-Dickstein
3 years
Come learn about our (outstanding paper award 😃) work building generative models by running SDEs backwards in time -- ICLR poster session in 30 minutes!
@DrYangSong
Yang Song
3 years
Thrilled to share that our paper "Score-Based Generative Modeling through Stochastic Differential Equations" has won an Outstanding Paper Award at ICLR 2021! Huge shoutouts to my awesome collaborators: @jaschasd @dpkingma @studentofml @StefanoErmon @poolio !
16
24
300
0
5
64
@jaschasd
Jascha Sohl-Dickstein
2 years
BIG-bench is a living benchmark. You can submit new tasks to be an author on future publications, and can submit new model evaluations to automatically be included in the BIG-bench leaderboards.
Tweet media one
1
3
60
@jaschasd
Jascha Sohl-Dickstein
4 years
All of these experiments were made possible by the Neural Tangents software library. You should use it for all your infinite width network needs!
1
10
60
@jaschasd
Jascha Sohl-Dickstein
6 years
Meta-learning for unsupervised representation learning! Learn unsupervised learning rules that directly target the properties of the representation you care about.
@Luke_Metz
Luke Metz
6 years
Checkout our new work on Learning Unsupervised Learning Rules! Done with my amazing collaborators @niru_m @thisismyhat @jaschasd
Tweet media one
3
67
242
0
19
60
@jaschasd
Jascha Sohl-Dickstein
8 months
I just got offered a free standing desk or office chair if I posted two positive tweets about a company! I think that means I'm an influencer now.
7
0
56
@jaschasd
Jascha Sohl-Dickstein
1 year
See the post for details -- including discussion of the many ways these results are speculative and could be improved. This is my second blog post ever -- please continue to be harsh but also constructive!
2
2
55
@jaschasd
Jascha Sohl-Dickstein
3 months
@LeonDerczynski Yes! Imagine seeing that in a physics experiment...
2
0
54
@jaschasd
Jascha Sohl-Dickstein
2 years
If there's one thing that AI will bring, it's dramatically greater efficiency across many domains. We should expect that this will cause similarly dramatic harmful unintended consequences, in every domain AI touches, *all at once*. This is going to be a hard period of history.
1
7
52
@jaschasd
Jascha Sohl-Dickstein
3 months
@eigenstate Yes! I talk about this briefly in the blog post / paper. Newton updates are different than (S)GD, but they are a definite proof of principle that optimization can lead to fractals. (on the other hand, unsurprising post-hoc is of course not the same as identifying in advance that…
0
0
51
@jaschasd
Jascha Sohl-Dickstein
4 years
Along the way, we show that predictions from trained fully connected networks are COMPLETELY DETERMINED by the datapoint-datapoint second moment matrix of the training and test data! The trained network is literally unable to base predictions on anything other than this matrix.
Tweet media one
2
8
47
@jaschasd
Jascha Sohl-Dickstein
3 years
"Creating noise from data is easy; creating data from noise is generative modeling." I am very excited about this work!!! SOTA image models from stochastic differential equations, with more surprising theoretical properties and connections to other techniques than fit in a tweet.
@DrYangSong
Yang Song
3 years
Happy to announce our new work on score-based generative modeling: high quality samples, exact log-likelihoods, and controllable generation, all available through score matching and Stochastic Differential Equations (SDEs)! Paper:
Tweet media one
6
141
695
1
4
44
@jaschasd
Jascha Sohl-Dickstein
3 months
@ChenTessler Exactly right. And, every point uses the same random initialization. (If you changed the random initialization you would get a different fractal.)
3
0
43
@jaschasd
Jascha Sohl-Dickstein
4 years
Check out our review of recent efforts to apply techniques from statistical physics to better understand deep learning, targeted at a physics audience. It tries to be approachable for non-experts in machine learning.
@SuryaGanguli
Surya Ganguli
4 years
1/ Our new paper in @AnnualReviews of Condensed Matter Physics on “Statistical Mechanics of #DeepLearning ” with awesome collaborators @Stanford and @GoogleAI : @yasamanbb @kadmonj Jeff Pennington @sschoenholz @jaschasd web: free:
Tweet media one
2
125
431
0
5
42
@jaschasd
Jascha Sohl-Dickstein
5 years
Including a massive, well curated, dataset mapping hyperparameter configuration to model performance. This may be a useful resource in your own research.
@GoogleAI
Google AI
5 years
Data parallelism can improve the training of #NeuralNetworks , but how to obtain the most benefit from this technique isn’t obvious. Check out new research that explores different architectures, batch sizes, and datasets to optimize training efficiency.
4
179
496
0
6
39
@jaschasd
Jascha Sohl-Dickstein
2 years
If this analogy is useful, it will be because solutions to overfitting in machine learning also map over. Can we use techniques like noise regularization, or early stopping, to help fix problems with education, political elections, and free markets?
1
1
37
@jaschasd
Jascha Sohl-Dickstein
4 years
Infinite width Neural Network Gaussian Process (NNGP) and Neural Tangent Kernel (NTK) predictions can outperform finite networks, depending on architecture and training practices. For fully connected networks the infinite width limit reliably outperforms the finite network.
Tweet media one
1
4
34
@jaschasd
Jascha Sohl-Dickstein
8 months
This is a very useful set of results, if for instance you want to study the training behavior of large-scale transformer models using academic-scale resources. (and Mitchell did an amazing job)
@Mitchnw
Mitchell Wortsman
8 months
Sharing some highlights from our work on small-scale proxies for large-scale Transformer training instabilities: With fantastic collaborators @peterjliu , @Locchiu , @_katieeverett , many others (see final tweet!), @hoonkp , @jmgilmer , @skornblith ! (1/15)
Tweet media one
5
63
347
0
1
36
@jaschasd
Jascha Sohl-Dickstein
4 years
Predicting + demonstrating counterintuitive neural network training behavior: - training at learning rates which diverge under NTK theory - exponential *increase* in loss over first ~20 training *steps* (not epochs) - drastic reduction in Hessian eigenvalues over first ~20 steps
Tweet media one
@yasamanbb
Yasaman Bahri
4 years
In our preprint “The large learning rate phase of deep learning: the catapult mechanism" , we show that the choice of learning rate (LR) in (S)GD separates deep neural net dynamics into two sharply distinct types (or "phases", in the physics sense). (1/n)
1
54
315
1
10
36
@jaschasd
Jascha Sohl-Dickstein
1 year
Most work on AI misalignment risk is based on an assumption that more intelligent AI will also be more coherent. This is an assumption we can test! I collected subjective judgements of intelligence and coherence from colleagues in ML and neuro.
Tweet media one
1
1
36
@jaschasd
Jascha Sohl-Dickstein
5 months
An excellent project making evolution strategies much more efficient for computing gradients in dynamical systems.
@OscarLi101
Oscar Li
5 months
📝Quiz time: when you have an unrolled computation graph (see figure below), how would you compute the unrolling parameters' gradients? If your answer only contains Backprop, now it’s time to add a new method to your gradient estimation toolbox!
Tweet media one
1
13
125
0
3
35
@jaschasd
Jascha Sohl-Dickstein
8 months
This was a great blog post. I appreciated the recurring theme of technologies transitioning from being impossible to being inevitable.
@boazbaraktcs
Boaz Barak
9 months
I finally got around to reading Richard Rhodes' "The Making of the Atomic Bomb." I posted here some thoughts on the bomb and which of the lessons from it apply to AI.
5
9
89
2
3
34
@jaschasd
Jascha Sohl-Dickstein
3 years
@timnitGebru @JeffDean I + co-organizers would love contributions from researchers working on ethical AI, very much including you and the ethical AI team. Measuring a thing is a first step towards improving it, and we want to make measurement of social biases a core part of the benchmark.
4
0
34
@jaschasd
Jascha Sohl-Dickstein
3 years
Come to our workshop on Enormous Language Models! Also, submit a task to the associated benchmark , and be a co-author on the corresponding paper!
@colinraffel
Colin Raffel
3 years
📣 Announcing the ICLR 2021 Workshop on Enormous Language Models 📣 We have an incredible speaker lineup that covers building, evaluating, critiquing, and improving large LMs, as well as a collaborative parcipant-driven benchmark and 2 panels! More info:
6
47
250
0
7
32
@jaschasd
Jascha Sohl-Dickstein
1 year
In this paper -- we basically embraced the difficulty, and spent both massive amounts of human effort and massive amount of compute, in order to meta-train the learned optimizer. (* by we, I especially mean @Luke_Metz and @jmes_harrison who led the project at different times)
1
2
32
@jaschasd
Jascha Sohl-Dickstein
4 years
This has surprisingly little effect on prediction accuracy, but does improve the match between networks considered by theory and used in practice. With Roman Novak, @sschoenholz , and @hoonkp . Implementation in .
1
3
31
@jaschasd
Jascha Sohl-Dickstein
2 years
Also -- the blog post isn't about AI -- but the ideas in the post very much color my own fears about how AI may go wrong. The post is about how greater efficiency can lead to harmful unintended consequences.
1
0
32
@jaschasd
Jascha Sohl-Dickstein
3 months
@oh_that_hat That is a great question!! I hadn't thought of that. My guess would be that we would see a similar fractal structure at the boundaries in the phase diagram you pasted ... but I'm not sure. It would be a fascinating experiment. (probably a lot more expensive than the experiment I…
3
0
31
@jaschasd
Jascha Sohl-Dickstein
1 year
Meta-training learned optimizers is HARD. Each meta-training datapoint is an entire optimization task, so building a large meta-training dataset is HARD. Each of N meta-training steps can contain N training steps applying the learned optimizer -- so compute is also extreme (N^2).
Tweet media one
1
2
30
@jaschasd
Jascha Sohl-Dickstein
3 years
Bootstrapping the training of learned optimizers using randomly initialized learned optimizers. No hand designed optimizer involved (* unless you count population based training). A demonstration of the potential power of positive feedback loops in meta-learning.
@Luke_Metz
Luke Metz
3 years
[Micro paper] We train learned optimizers using other randomly initialized learned optimizers in an evolutionary process. This creates a positive feedback loop: learning to optimize enables optimizers to optimize themselves faster, accelerating training.
Tweet media one
Tweet media two
2
37
184
0
3
28
@jaschasd
Jascha Sohl-Dickstein
4 years
This makes NTK training dynamics dissimilar from those of standard finite width networks. (Infinite width Bayesian networks, NNGPs, don't suffer from this problem.) In we derive infinite width kernels for the *standard* parameterization, resolving this.
1
6
28
@jaschasd
Jascha Sohl-Dickstein
4 years
Analogously to the first time a compiler can compile itself, it is even capable of training itself from scratch!!! I think we are now only a short distance away from learned optimizers being the best choice for most optimization tasks (though, we're not *quite* there yet).
Tweet media one
1
4
26