Jascha Sohl-Dickstein @jaschasd Twitter profile

Pinned Tweet

Jascha Sohl-Dickstein

2 years

My first blog post ever! Be harsh, but, you know, constructive. Too much efficiency makes everything worse: overfitting and the strong version of Goodhart's law 🧵

36

182

980

Last Seen Profiles

@Edgebaseball3

@Kuba899106

@HEATSHOWYOO

@Dionnethemuse

@grazeiowacity

@yann_savoi64830

@IftiMania

@lankanstripper

@AquaverseArt

@EDEnchanted

@thegirlfilms

@STP_Basketball

@Japaniafr

@Yvng_hustlar

@azur_noir

@t_nobuki_MI

@gamefi_cc

@dul_turkporno

@yarenderinim

@NadiakhattakPK

@Alexiir1

@JoseDenning

@RolandoAarons

@indoviral_99

@nvrfps

@last9io

@MarcMero

@kamo_sci

@stw_pdg

@Arshad_t584

@ljconnorjourno

@PauloFe05834679

@kandc1053

@INDEPENDENCENHP

@pengin_0730

@VephlaUni

Jascha Sohl-Dickstein

@jaschasd

3 months

Have you ever done a dense grid search over neural network hyperparameters? Like a *really dense* grid search? It looks like this (!!). Blueish colors correspond to hyperparameters for which training converges, redish colors to hyperparameters for which training diverges.

279

2K

10K

Jascha Sohl-Dickstein

@jaschasd

2 years

After 2 years of work by 442 contributors across 132 institutions, I am thrilled to announce that the paper is now live: . BIG-bench consists of 204 diverse tasks to measure and extrapolate the capabilities of large language models.

37

574

3K

Jascha Sohl-Dickstein

@jaschasd

4 years

"Finite Versus Infinite Neural Networks: an Empirical Study." This paper contains everything you ever wanted to know about infinite width networks, but didn't have the computational capacity to ask! Like really a lot of content. Let's dive in.

7

503

2K

Jascha Sohl-Dickstein

@jaschasd

3 months

The boundary between trainable and untrainable neural network hyperparameter configurations is *fractal*! And beautiful! Here is a grid search over a different pair of hyperparameters -- this time learning rate and the mean of the parameter initialization distribution.

26

175

1K

Jascha Sohl-Dickstein

@jaschasd

4 years

Modern deep learning is a story of learned features outperforming (then replacing!) hand-designed algorithms. But we still use hand designed loss functions and optimizers. Here is a big step towards learned optimizers outperforming existing optimizers:

6

205

1K

Jascha Sohl-Dickstein

@jaschasd

1 year

If there is one thing the deep learning revolution has taught us, it's that neural nets will outperform hand-designed heuristics, given enough compute and data. But we still use hand-designed heuristics to train our models. Let's replace our optimizers with trained neural nets!

25

136

905

Jascha Sohl-Dickstein

@jaschasd

5 years

Eliminating All Bad Local Minima from Loss Landscapes Without Even Adding an Extra Unit It's less than one page. It may be deep. It may be trivial. It will definitely help you understand how some claims in recent theory papers could possibly be true.

6

178

704

Jascha Sohl-Dickstein

@jaschasd

6 years

Adversarial Reprogramming of Neural Networks A new goal for adversarial attacks! Rather than cause a specific misclassification, we force neural networks to behave as if they were trained on a completely different task! With @gamaleldinfe , @goodfellow_ian

9

263

696

Jascha Sohl-Dickstein

@jaschasd

3 months

Want to learn more? Blog post: 3-page paper:

The boundary of neural network trainability is fractal

Some fractals -- for instance those associated with the Mandelbrot and quadratic Julia sets -- are computed by iterating a function, and identifying the boundary between hyperparameters for which...

arxiv.org

13

72

687

Jascha Sohl-Dickstein

@jaschasd

2 years

For years I've shown this 2x2 grid in talks on infinite width networks, but with just a big ❓ in the upper-left. No longer! In we characterize wide Bayesian neural nets in parameter space. This fills a theory gap, and enables *much* faster MCMC sampling.

7

90

609

Jascha Sohl-Dickstein

@jaschasd

4 years

Whitening and second order optimization both destroy information about the dataset, and can make generalization impossible: We examine what information is usable for training neural networks, and how second order methods destroy exactly that information.

9

113

551

Jascha Sohl-Dickstein

@jaschasd

1 year

The hot mess theory of AI misalignment (+ an experiment!) There are two ways an AI could be misaligned. It could monomaniacally pursue the wrong goal (supercoherence), or it could act in ways that don't pursue any consistent goal (hot mess/incoherent).

29

93

543

Jascha Sohl-Dickstein

@jaschasd

5 years

Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent <--- this should blow your mind a bit!! Also holds for convolutional networks, batch norm, ... Also, closed form for test predictions resulting from gradient descent training.

9

159

516

Jascha Sohl-Dickstein

@jaschasd

4 years

Two simple equalities expressing matrix determinants as expectations over matrix-vector products. Entire paper in attached image. :P It's fun to write short notes like this. Hopefully useful in areas like normalizing flows and Gaussian process evaluation.

11

94

507

Jascha Sohl-Dickstein

@jaschasd

3 months

So it shouldn't (post-hoc) be a surprise that hyperparameter landscapes are fractal. This is a general phenomenon: in these panes we see fractal hyperparameter landscapes for every neural network configuration I tried, including deep linear networks.

8

26

491

Jascha Sohl-Dickstein

@jaschasd

3 months

The best performing hyperparameters are typically at the edge of stability -- so when you optimize neural network hyperparameters, you are contending with hyperparameter landscapes that look like this.

20

25

410

Jascha Sohl-Dickstein

@jaschasd

5 years

Neural reparameterization improves structural optimization! By parameterizing physical design in terms of the (constrained) output of a neural network, we propose stronger and more elegant bridges, skyscrapers, and cantilevers. With shoyer@ samgreydanus@

3

74

373

Jascha Sohl-Dickstein

@jaschasd

3 months

There are similarities between the way in which many fractals are generated, and the way in which we train neural networks. Both involve repeatedly applying a function to its own output. In both cases, that function has hyperparameters that control its behavior.

5

14

377

Jascha Sohl-Dickstein

@jaschasd

4 years

Infinite width networks (NNGPs and NTKs) are the most promising lead for theoretical understanding in deep learning. But, running experiments with them currently resembles the dark age of ML research before ubiquitous automatic differentiation. Neural Tangents fixes that.

Sam Schoenholz

@sschoenholz

4 years

The core of Neural Tangents is a high level neural network library. Any network specified in Neural Tangents automatically comes with a function to compute the infinite-width limit analytically. Here's an example for a two-hidden layer FC network:

2

24

68

2

71

330

Jascha Sohl-Dickstein

@jaschasd

3 years

@PBFcomics As another layer: rats can also (probably) echolocate. So they're both cheating, the rat is just worse at it.

Animal echolocation - Wikipedia

en.m.wikipedia.org

3

8

318

Jascha Sohl-Dickstein

@jaschasd

6 years

Learning to sample using deep neural networks! Hamiltonian Monte Carlo + Real NVP == trainable MCMC sampler that generalizes, and far outperforms, HMC.

Generalizing Hamiltonian Monte Carlo with Neural Networks

We present a general-purpose method to train Markov chain Monte Carlo kernels, parameterized by deep neural networks, that converge and mix quickly to their target distribution. Our method...

arxiv.org

5

111

312

Jascha Sohl-Dickstein

@jaschasd

3 months

I don't have a SoundCloud, but I did join Anthropic last week, and so far it has exceeded my (high) expectations. I would strongly recommend working there (and using Claude). *this project not done at Anthropic -- this was recreational machine learning on my own time.

12

11

317

Jascha Sohl-Dickstein

@jaschasd

3 years

CALL FOR TASKS CAPTURING LIMITATIONS OF LARGE LANGUAGE MODELS We are soliciting contributions of tasks to a *collaborative* benchmark designed to measure and extrapolate the capabilities and limitations of large language models. Submit tasks at #BIGbench

14

73

278

Jascha Sohl-Dickstein

@jaschasd

4 years

A simple prescription that will improve your models: When using LayerNorm, do mean subtraction *before* rather than after the affine transformation. This, and an in-depth empirical investigation of statistical properties of common normalizers in

2

35

249

Jascha Sohl-Dickstein

@jaschasd

3 months

In both cases the function iteration can produce outputs that either diverge to infinity or remain happily bounded depending on those hyperparameters. Fractals are often defined by the boundary between hyperparameters where function iteration diverges or remains bounded.

2

3

250

Jascha Sohl-Dickstein

@jaschasd

4 years

Neural Network Gaussian Processes (NNGPs) correspond to wide Bayesian neural networks! In we show that the posterior distribution over functions computed by a Bayesian neural network converges to the posterior of the NNGP as layer width grows large.

2

51

238

Jascha Sohl-Dickstein

@jaschasd

6 months

Levels of AGI: Operationalizing Progress on the Path to AGI Levels of Autonomous Driving are extremely useful, for communicating capabilities, setting regulation, and defining goals in self driving. We propose analogous Levels of *AGI*. (ChatGPT is a Level 1 "Emerging" AGI)

14

62

246

Jascha Sohl-Dickstein

@jaschasd

3 years

I am *extremely* proud to share that we were awarded the ICML outstanding paper award! Major credit and thanks to my collaborators @PaulVicol and @Luke_Metz ! Paul especially owned every part of this project, and I think his care and extreme thoroughness are the reasons we won.

ICML Conference

@icmlconf

3 years

ICML 2021 Outstanding Paper Award: • Paul Vicol, Luke Metz, and Jascha Sohl-Dickstein 📜Unbiased Gradient Estimation in Unrolled Computation Graphs with Persistent Evolution Strategies (Tuesday 9pm US Eastern)

1

17

163

9

6

238

Jascha Sohl-Dickstein

@jaschasd

4 years

Build an infinite width neural network with the same code you use to define your finite width neural network.

Google AI

@GoogleAI

4 years

Announcing Neural Tangents, a new easy-to-use, open-source neural network library that enables researchers to build finite- and infinite-width versions of neural networks simultaneously. Grab the code and try it for yourself at

13

624

2K

1

37

235

Jascha Sohl-Dickstein

@jaschasd

6 years

Sensitivity and Generalization in Neural Networks: an Empirical Study Neural nets generalize better when they're larger and less sensitive to their inputs, are less sensitive near training data than away from it, and other results from massive experiments.

Sensitivity and Generalization in Neural Networks: an Empirical Study

In practice it is often found that large over-parameterized neural networks generalize better than their smaller counterparts, an observation that appears to conflict with classical notions of...

arxiv.org

0

78

228

Jascha Sohl-Dickstein

@jaschasd

4 years

"Your GAN is Secretly an Energy-based Model and You Should use Discriminator Driven Latent Sampling" This technique can dramatically improve existing trained GANs, by re-interpreting them as an easy-to-sample-from energy based model in the latent space.

9

46

229

Jascha Sohl-Dickstein

@jaschasd

5 years

Batch norm causes chaos and gradient explosion in the output of deep networks: figure below shows two nearly identical minibatches going through a random *linear* network with batch norm, and becoming completely dissimilar by depth 30! Much, much more at:

1

43

203

Jascha Sohl-Dickstein

@jaschasd

6 years

Everything you wanted to know about the role of batch size in neural net training, but didn't have the computational resources to ask! With Chris Shallue, Jaehoon Lee, Joe @joe_antognini , Roy Frostig, and George Dahl.

Measuring the Effects of Data Parallelism on Neural Network Training

Recent hardware developments have dramatically increased the scale of data parallelism available for neural network training. Among the simplest ways to harness next-generation hardware is to...

arxiv.org

1

49

200

Jascha Sohl-Dickstein

@jaschasd

6 years

Guided evolutionary strategies: escaping the curse of dimensionality in random search. A principled method to leverage training signals which are not the gradient, but which may be correlated with the gradient. Work with @niru_m @Luke_Metz @georgejtucker .

3

63

191

Jascha Sohl-Dickstein

@jaschasd

5 years

A careful empirical study of the effect of network width on generalization and fixed learning rate SGD, for MLPs, convnets, resnets, and batch norm. With superstar resident Daniel Park, and @quocleix + Sam Smith.

2

44

189

Jascha Sohl-Dickstein

@jaschasd

8 months

Here is a brain dump of my thoughts about how AI might go wrong: AI has the power to change the world in both wonderful and terrible ways. With hard work, I expect AI to lead to far more good than harm. But part of achieving that is thinking about risk.

Brain dump on the diversity of AI risk

This blog is intended to be a place to share ideas and results that are too weird, incomplete, or off-topic to turn into an academic paper, but that I think may be important. Let me know what you...

sohl-dickstein.github.io

7

29

179

Jascha Sohl-Dickstein

@jaschasd

4 years

Infinite width limits (NNGP and NTK) for neural networks with self-attention . This fills in the last common architectural component which did not have an infinite width correspondence! Along the way we improve on the standard softmax attention mechanism.

2

30

143

Jascha Sohl-Dickstein

@jaschasd

4 months

I'm running an experiment, and holding some public office hours (inspired by seeing @kchonyc do something similar). Come talk with me about anything! Ask for advice on your research or startup or career or I suppose personal life, brainstorm new research ideas, complain about…

6

9

140

Jascha Sohl-Dickstein

@jaschasd

2 years

I think we will increasingly build systems out of many large models interacting with each other. I think the cascades perspective -- write down a probabilistic graphical model, but with every node a language model -- is the right formalism for describing these systems.

David Dohan

@dmdohan

2 years

Happy to release our work on Language Model Cascades. Read on to learn how we can unify existing methods for interacting models (scratchpad/chain of thought, verifiers, tool-use, …) in the language of probabilistic programming. paper:

3

99

668

1

12

126

Jascha Sohl-Dickstein

@jaschasd

6 years

Stochastic natural gradient descent corresponds to Bayesian training of neural networks, with a modified prior. This equivalence holds *even away from local minima*. Very proud of this work with Sam Smith, Daniel Duckworth, and Quoc Le.

2

36

125

Jascha Sohl-Dickstein

@jaschasd

4 years

Research on the Neural Tangent Kernel (NTK) almost exclusively uses a non-standard neural network parameterization, where activations are divided by sqrt(width), and weights are initialized to have variance 1 rather than variance 1/width.

2

23

120

Jascha Sohl-Dickstein

@jaschasd

2 years

Performance on some tasks improves smoothly with model scale, while on others there is sudden breakthrough performance at a critical scale.

2

12

118

Jascha Sohl-Dickstein

@jaschasd

1 year

Living creatures, human organizations, and machine learning models are all judged to become *more of a hot mess (less coherent) as they grow more intelligent*. This suggests that AI failing to pursue a consistent goal is more likely than AI pursuing a misaligned goal.

9

15

113

Jascha Sohl-Dickstein

@jaschasd

3 months

@DanielDugas14 Go for it! All the raw images are here:

3

4

109

Jascha Sohl-Dickstein

@jaschasd

2 years

Models can learn unexpected skills that are only implicitly contained in the training data -- for instance, how to make legal moves in chess.

2

11

100

Jascha Sohl-Dickstein

@jaschasd

4 years

This is very cool work. Read this if you want to really, really understand how a neural network solves a specific problem -- like actual scientific understanding.

Niru Maheswaranathan

@niru_m

4 years

#tweeprint time for our new work out on arXiv!📖We've been trying to understand how recurrent neural networks (RNNs) work, by reverse engineering them using tools from dynamical systems analysis—with @SussilloDavid .

9

264

918

1

16

103

Jascha Sohl-Dickstein

@jaschasd

3 months

@NaveenGRao It's actually not so bad! Width 16 one hidden layer neural network, and I only computed new images tiled by every factor of 2 in zoom scale -- so about 50 grid searches needed to be run for the entire video. It took overnight on an A100.

6

0

98

Jascha Sohl-Dickstein

@jaschasd

1 month

This was a fun project! If you could train an LLM over text arithmetically compressed using a smaller LLM as a probabilistic model of text, it would be really good. Text would be represented with far fewer tokens, and inference would be way faster and cheaper. The hard part is…

Noah Constant

@noahconst

1 month

Ever wonder why we don’t train LLMs over highly compressed text? Turns out it’s hard to make it work. Check out our paper for some progress that we’re hoping others can build on. With @blester125 , @hoonkp , @alemi , Jeffrey Pennington, @ada_rob , @jaschasd

2

9

69

3

8

95

Jascha Sohl-Dickstein

@jaschasd

6 years

Bayesian CNNs with many channels are Gaussian processes! One can compute test set predictions that would have resulted from fully Bayesian training of a CNN, but without ever instantiating a CNN, and instead by evaluating the corresponding GP.

1

27

93

Jascha Sohl-Dickstein

@jaschasd

9 months

Adversarial attacks designed to fool computer vision models, *transfer (weakly) to the human brain* -- even when the attack is so small as to be barely perceptible. Nature Comms paper:

Subtle adversarial image manipulations influence both human and machine perception

Nature Communications - Artificial neural networks (ANNs) are vulnerable to subtle adversarial perturbations that yield misclassification errors. Here, behavioral studies demonstrate that...

www.nature.com

Gamaleldin Elsayed

@gamaleldinfe

9 months

We present human participants with two nearly identical images, each has different adversarial perturbations generated by ANNs. We find that in each experiment, human perception is consistently biased by the adversarial perturbation in the direction predicted by the ANN.

4

23

136

0

6

92

Jascha Sohl-Dickstein

@jaschasd

1 year

Intuitive extensions to standard notation, that make it less ambiguous for common math in machine learning. This should become common practice in ML papers. This could have saved past me cumulative days of confusion (and worse, misinterpretations I probably never discovered).

Sasha Rush (ICLR)

@srush_nlp

1 year

Named Tensor Notation (TMLR version, w/ @davidweichiang + @boazbaraktcs ) A rigorous description, opinionated style guide, and gentle polemic for named tensors in math notation. * Macros:

13

93

477

1

16

92

Jascha Sohl-Dickstein

@jaschasd

2 years

While language models do better as they are made larger, they still do poorly on BIG-bench relative to humans.

3

5

88

Jascha Sohl-Dickstein

@jaschasd

4 years

This is a meta-learned list of optimization hyperparameters. Try these hyperparameters in this order for fun, profit, and better performing models with less compute!! A sequence of magic numbers beyond Karpathy's constant! JAX, PyTorch, & TensorFlow code:

Luke Metz

@Luke_Metz

4 years

Excited to share our new work! We introduce a dataset of tasks for learned optimizer research. As an example application of this dataset we meta-train lists of optimizer hyper parameters that work well on a diverse set of tasks. 1/4

3

68

236

0

16

85

Jascha Sohl-Dickstein

@jaschasd

1 year

If you are training models with < 5e8 parameters, for < 2e5 training steps, then with high probability this LEARNED OPTIMIZER will beat or match the tuned optimizer you are currently using, out of the box, with no hyperparameter tuning (!).

1

11

85

Jascha Sohl-Dickstein

@jaschasd

2 years

Models become consistently more socially biased as they are made larger (likely because they do a better job at capturing ever more subtle biases in their training data). There is reduced or even decreasing bias with scale when context makes it clear that bias is undesirable.

4

7

81

Jascha Sohl-Dickstein

@jaschasd

3 years

I think this will be a very important paper. My take: by unrolling SGD training steps and treating them as part of the NN architecture, computing the kernel after training (w/ feature learning) becomes equivalent to computing the NNGP kernel of the extended architecture.

Greg Yang

@TheGregYang

3 years

1/ Existing theories of neural networks (NN) like NTK don't learn features so can't explain success of pretraining (e.g. BERT, GPT3). We derive the *feature learning* ∞-width limit of NNs & pretrained such an ∞-width word2vec model: it learned semantics!

4

58

387

1

11

82

Jascha Sohl-Dickstein

@jaschasd

2 years

The phenomenon of overfitting in machine learning maps onto a class of failures that frequently happen in the broader world: in politics, economics, science, and beyond. Doing too well at targeting a proxy objective can make the thing you actually care about get much, much worse.

2

3

77

Jascha Sohl-Dickstein

@jaschasd

3 years

My group in Google Brain is hiring a full time researcher, for a research team focused on learned optimizers. Are you interested in meta-learning, bilevel optimization, dynamical systems? Apply here: Please reach out with any questions!

Luke Metz

@Luke_Metz

3 years

Interested in meta-learning and learned optimizers? Our team at Google Brain is hiring a full time researcher! Feel free to reach out to myself or @jaschasd for more information.

2

36

168

0

13

78

Jascha Sohl-Dickstein

@jaschasd

3 months

I’ve been daydreaming about an AI+audio product that I think recently became possible: virtual noise canceling headphones. I hate loud background noise -- BART trains, airline cabins, road noise, ... 🙉. I would buy the heck out of this product, and would love it if it were built…

7

4

76

Jascha Sohl-Dickstein

@jaschasd

2 years

Overall, sparse models perform as well as dense models which use ~2x more inference cost, but they are as well calibrated as dense models using ~10x more inference compute.

1

2

74

Jascha Sohl-Dickstein

@jaschasd

2 years

@ericjang11 This is for the same reason that neural networks are often poorly calibrated. NNs are good at producing a vector that points in the right direction, but bad at getting the magnitude correct. For classification, you just need to get the vector direction right.

3

2

71

Jascha Sohl-Dickstein

@jaschasd

2 years

End of 🧵. Here's a bonus plot from the blog post, about how models overfit the most when their capacity most closely matches the complexity of the problem. In case, like me, plots are the kind of thing you like.

4

2

69

Jascha Sohl-Dickstein

@jaschasd

2 years

Finally -- *so many thanks* to all my collaborators! And especially to the co-organizers, who did so much, and were a constant pleasure to work with! I can't fit names in a tweet without leaving out very important people, so see the attached list of contributors + contributions.

1

70

Jascha Sohl-Dickstein

@jaschasd

4 years

Infinite width neural networks enable more compute-efficient Neural Architecture Search!

Jaehoon Lee

@hoonkp

4 years

Can we leverage the power of infinite-width limit to help with Neural Architecture Search (NAS)? In this new paper (), we find that empirical NNGP can provide cheap and effective signals that can be used for NAS!

1

23

92

1

17

68

Jascha Sohl-Dickstein

@jaschasd

6 years

Learned optimizers with less mind numbing pain! We analyze, and propose a solution to, pathologies in meta-training via unrolled optimization. Then we meta-train an optimizer targeting CNN training that outperforms SGD/Adam by 5x (!!!) wall clock time.

2

17

69

Jascha Sohl-Dickstein

@jaschasd

6 years

If you are using PCA to visualize neural network training trajectories, you are interpreting it wrong! Very proud of this work with @joe_antognini : "PCA of high dimensional random walks with comparison to neural network training"

0

20

69

Jascha Sohl-Dickstein

@jaschasd

5 years

: @laurent_dinh is the most fun to work with. He always has extremely novel ideas ... and makes the most mesmerizing animations.

Laurent Dinh

@laurent_dinh

5 years

Exploring inference and learning with non-invertible “flows” to learn deep mixture models with RAD: (with @jaschasd , @rpascanu , and @hugo_larochelle )

1

57

258

0

4

67

Jascha Sohl-Dickstein

@jaschasd

2 years

... with many more observations in the paper. We also release score files and transcripts from models across six orders of magnitude of scale performing BIG-bench tasks, along with human baselines for most tasks. We hope this will be a goldmine for future research.

1

2

65

Jascha Sohl-Dickstein

@jaschasd

3 years

Come learn about our (outstanding paper award 😃) work building generative models by running SDEs backwards in time -- ICLR poster session in 30 minutes!

Yang Song

@DrYangSong

3 years

Thrilled to share that our paper "Score-Based Generative Modeling through Stochastic Differential Equations" has won an Outstanding Paper Award at ICLR 2021! Huge shoutouts to my awesome collaborators: @jaschasd @dpkingma @studentofml @StefanoErmon @poolio !

16

24

300

0

5

64

Jascha Sohl-Dickstein

@jaschasd

2 years

BIG-bench is a living benchmark. You can submit new tasks to be an author on future publications, and can submit new model evaluations to automatically be included in the BIG-bench leaderboards.

1

3

60

Jascha Sohl-Dickstein

@jaschasd

4 years

All of these experiments were made possible by the Neural Tangents software library. You should use it for all your infinite width network needs!

GitHub - google/neural-tangents: Fast and Easy Infinite Neural Networks in Python

Fast and Easy Infinite Neural Networks in Python. Contribute to google/neural-tangents development by creating an account on GitHub.

github.com

1

10

60

Jascha Sohl-Dickstein

@jaschasd

6 years

Meta-learning for unsupervised representation learning! Learn unsupervised learning rules that directly target the properties of the representation you care about.

Luke Metz

@Luke_Metz

6 years

Checkout our new work on Learning Unsupervised Learning Rules! Done with my amazing collaborators @niru_m @thisismyhat @jaschasd

3

67

242

0

19

60

Jascha Sohl-Dickstein

@jaschasd

8 months

I just got offered a free standing desk or office chair if I posted two positive tweets about a company! I think that means I'm an influencer now.

7

0

56

Jascha Sohl-Dickstein

@jaschasd

1 year

See the post for details -- including discussion of the many ways these results are speculative and could be improved. This is my second blog post ever -- please continue to be harsh but also constructive!

The hot mess theory of AI misalignment: More intelligent agents behave less coherently

This blog is intended to be a place to share ideas and results that are too weird, incomplete, or off-topic to turn into an academic paper, but that I think may be important. Let me know what you...

sohl-dickstein.github.io

2

55

Jascha Sohl-Dickstein

@jaschasd

3 months

@LeonDerczynski Yes! Imagine seeing that in a physics experiment...

2

0

54

Jascha Sohl-Dickstein

@jaschasd

2 years

If there's one thing that AI will bring, it's dramatically greater efficiency across many domains. We should expect that this will cause similarly dramatic harmful unintended consequences, in every domain AI touches, *all at once*. This is going to be a hard period of history.

1

7

52

Jascha Sohl-Dickstein

@jaschasd

3 months

@eigenstate Yes! I talk about this briefly in the blog post / paper. Newton updates are different than (S)GD, but they are a definite proof of principle that optimization can lead to fractals. (on the other hand, unsurprising post-hoc is of course not the same as identifying in advance that…

0

51

Jascha Sohl-Dickstein

@jaschasd

4 years

Along the way, we show that predictions from trained fully connected networks are COMPLETELY DETERMINED by the datapoint-datapoint second moment matrix of the training and test data! The trained network is literally unable to base predictions on anything other than this matrix.

2

8

47

Jascha Sohl-Dickstein

@jaschasd

3 years

"Creating noise from data is easy; creating data from noise is generative modeling." I am very excited about this work!!! SOTA image models from stochastic differential equations, with more surprising theoretical properties and connections to other techniques than fit in a tweet.

Yang Song

@DrYangSong

3 years

Happy to announce our new work on score-based generative modeling: high quality samples, exact log-likelihoods, and controllable generation, all available through score matching and Stochastic Differential Equations (SDEs)! Paper:

6

141

695

1

4

44

Jascha Sohl-Dickstein

@jaschasd

3 months

@ChenTessler Exactly right. And, every point uses the same random initialization. (If you changed the random initialization you would get a different fractal.)

3

0

43

Jascha Sohl-Dickstein

@jaschasd

4 years

Check out our review of recent efforts to apply techniques from statistical physics to better understand deep learning, targeted at a physics audience. It tries to be approachable for non-experts in machine learning.

Surya Ganguli

@SuryaGanguli

4 years

1/ Our new paper in @AnnualReviews of Condensed Matter Physics on “Statistical Mechanics of #DeepLearning ” with awesome collaborators @Stanford and @GoogleAI : @yasamanbb @kadmonj Jeff Pennington @sschoenholz @jaschasd web: free:

2

125

431

0

5

42

Jascha Sohl-Dickstein

@jaschasd

5 years

Including a massive, well curated, dataset mapping hyperparameter configuration to model performance. This may be a useful resource in your own research.

Google AI

@GoogleAI

5 years

Data parallelism can improve the training of #NeuralNetworks , but how to obtain the most benefit from this technique isn’t obvious. Check out new research that explores different architectures, batch sizes, and datasets to optimize training efficiency.

4

179

496

0

6

39

Jascha Sohl-Dickstein

@jaschasd

2 years

If this analogy is useful, it will be because solutions to overfitting in machine learning also map over. Can we use techniques like noise regularization, or early stopping, to help fix problems with education, political elections, and free markets?

1

37

Jascha Sohl-Dickstein

@jaschasd

4 years

Infinite width Neural Network Gaussian Process (NNGP) and Neural Tangent Kernel (NTK) predictions can outperform finite networks, depending on architecture and training practices. For fully connected networks the infinite width limit reliably outperforms the finite network.

1

4

34

Jascha Sohl-Dickstein

@jaschasd

8 months

This is a very useful set of results, if for instance you want to study the training behavior of large-scale transformer models using academic-scale resources. (and Mitchell did an amazing job)

Mitchell Wortsman

@Mitchnw

8 months

Sharing some highlights from our work on small-scale proxies for large-scale Transformer training instabilities: With fantastic collaborators @peterjliu , @Locchiu , @_katieeverett , many others (see final tweet!), @hoonkp , @jmgilmer , @skornblith ! (1/15)

5

63

347

0

1

36

Jascha Sohl-Dickstein

@jaschasd

4 years

Predicting + demonstrating counterintuitive neural network training behavior: - training at learning rates which diverge under NTK theory - exponential *increase* in loss over first ~20 training *steps* (not epochs) - drastic reduction in Hessian eigenvalues over first ~20 steps

Yasaman Bahri

@yasamanbb

4 years

In our preprint “The large learning rate phase of deep learning: the catapult mechanism" , we show that the choice of learning rate (LR) in (S)GD separates deep neural net dynamics into two sharply distinct types (or "phases", in the physics sense). (1/n)

1

54

315

1

10

36

Jascha Sohl-Dickstein

@jaschasd

1 year

Most work on AI misalignment risk is based on an assumption that more intelligent AI will also be more coherent. This is an assumption we can test! I collected subjective judgements of intelligence and coherence from colleagues in ML and neuro.

1

36

Jascha Sohl-Dickstein

@jaschasd

5 months

An excellent project making evolution strategies much more efficient for computing gradients in dynamical systems.

Oscar Li

@OscarLi101

5 months

📝Quiz time: when you have an unrolled computation graph (see figure below), how would you compute the unrolling parameters' gradients? If your answer only contains Backprop, now it’s time to add a new method to your gradient estimation toolbox!

1

13

125

0

3

35

Jascha Sohl-Dickstein

@jaschasd

8 months

This was a great blog post. I appreciated the recurring theme of technologies transitioning from being impossible to being inevitable.

Boaz Barak

@boazbaraktcs

9 months

I finally got around to reading Richard Rhodes' "The Making of the Atomic Bomb." I posted here some thoughts on the bomb and which of the lessons from it apply to AI.

5

9

89

2

3

34

Jascha Sohl-Dickstein

@jaschasd

3 years

@timnitGebru @JeffDean I + co-organizers would love contributions from researchers working on ethical AI, very much including you and the ethical AI team. Measuring a thing is a first step towards improving it, and we want to make measurement of social biases a core part of the benchmark.

4

0

34

Jascha Sohl-Dickstein

@jaschasd

3 years

Come to our workshop on Enormous Language Models! Also, submit a task to the associated benchmark , and be a co-author on the corresponding paper!

GitHub - google/BIG-bench: Beyond the Imitation Game collaborative benchmark for measuring and...

Beyond the Imitation Game collaborative benchmark for measuring and extrapolating the capabilities of language models - google/BIG-bench

github.com

Colin Raffel

@colinraffel

3 years

📣 Announcing the ICLR 2021 Workshop on Enormous Language Models 📣 We have an incredible speaker lineup that covers building, evaluating, critiquing, and improving large LMs, as well as a collaborative parcipant-driven benchmark and 2 panels! More info:

6

47

250

0

7

32

Jascha Sohl-Dickstein

@jaschasd

1 year

In this paper -- we basically embraced the difficulty, and spent both massive amounts of human effort and massive amount of compute, in order to meta-train the learned optimizer. (* by we, I especially mean @Luke_Metz and @jmes_harrison who led the project at different times)

1

2

32

Jascha Sohl-Dickstein

@jaschasd

4 years

This has surprisingly little effect on prediction accuracy, but does improve the match between networks considered by theory and used in practice. With Roman Novak, @sschoenholz , and @hoonkp . Implementation in .

GitHub - google/neural-tangents: Fast and Easy Infinite Neural Networks in Python

Fast and Easy Infinite Neural Networks in Python. Contribute to google/neural-tangents development by creating an account on GitHub.

github.com

1

3

31

Jascha Sohl-Dickstein

@jaschasd

2 years

Also -- the blog post isn't about AI -- but the ideas in the post very much color my own fears about how AI may go wrong. The post is about how greater efficiency can lead to harmful unintended consequences.

1

0

32

Jascha Sohl-Dickstein

@jaschasd

3 months

@oh_that_hat That is a great question!! I hadn't thought of that. My guess would be that we would see a similar fractal structure at the boundaries in the phase diagram you pasted ... but I'm not sure. It would be a fascinating experiment. (probably a lot more expensive than the experiment I…

3

0

31

Jascha Sohl-Dickstein

@jaschasd

1 year

Meta-training learned optimizers is HARD. Each meta-training datapoint is an entire optimization task, so building a large meta-training dataset is HARD. Each of N meta-training steps can contain N training steps applying the learned optimizer -- so compute is also extreme (N^2).

1

2

30

Jascha Sohl-Dickstein

@jaschasd

3 years

Bootstrapping the training of learned optimizers using randomly initialized learned optimizers. No hand designed optimizer involved (* unless you count population based training). A demonstration of the potential power of positive feedback loops in meta-learning.

Luke Metz

@Luke_Metz

3 years

[Micro paper] We train learned optimizers using other randomly initialized learned optimizers in an evolutionary process. This creates a positive feedback loop: learning to optimize enables optimizers to optimize themselves faster, accelerating training.

2

37

184

0

3

28

Jascha Sohl-Dickstein

@jaschasd

4 years

This makes NTK training dynamics dissimilar from those of standard finite width networks. (Infinite width Bayesian networks, NNGPs, don't suffer from this problem.) In we derive infinite width kernels for the *standard* parameterization, resolving this.

1

6

28

Jascha Sohl-Dickstein

@jaschasd

4 years

Analogously to the first time a compiler can compile itself, it is even capable of training itself from scratch!!! I think we are now only a short distance away from learned optimizers being the best choice for most optimization tasks (though, we're not *quite* there yet).

1

4

26