Extremely excited to have this work out, the first paper from the Superalignment team! We study how large models can generalize from supervision of much weaker models.
In the future, humans will need to supervise AI systems much smarter than them.
We study an analogy: small models supervising large models.
Read the Superalignment team's first paper showing progress on a new approach, weak-to-strong generalization:
📢 I am recruiting Ph.D. students for my new lab at
@nyuniversity
! Please apply, if you want to work on understanding deep learning and large models, and do a Ph.D. in the most exciting city on earth.
Details on my website: . Please spread the word!
Spurious features are a major issue for deep learning. Our new
#NeurIPS2022
paper w/
@pol_kirichenko
,
@gruver_nate
and
@andrewgwils
explores the representations trained on data with spurious features with many surprising findings, and SOTA results.
🧵1/6
We run HMC on hundreds of TPU devices for millions of training epochs to provide our best approximation of the true Bayesian neural networks! (1) BNNs do better than deep ensembles (2) no cold posteriors effect but (3) BNNs are terrible under data corruption, and much more! 🧵
What are Bayesian neural network posteriors really like? With high fidelity HMC, we study approximate inference quality, generalization, cold posteriors, priors, and more.
With
@Pavel_Izmailov
,
@sharadvikram
, and Matthew D. Hoffman. 1/10
We explore how to represent aleatoric (irreducible) uncertainty in Bayesian classification, with profound implications for performance, data augmentation, and cold posteriors in BDL.
w/
@snymkpr
, W. Maddox,
@andrewgwils
🧵 1/16
Dangers of Bayesian Model Averaging under Covariate Shift
We show how Bayesian neural nets can generalize *extremely* poorly under covariate shift, why it happens and how to fix it!
With Patrick Nicholson,
@LotfiSanae
and
@andrewgwils
1/10
Our paper on HMC for Bayesian neural networks will appear at
#ICML2021
as a long talk!
We are also excited to release our JAX code and HMC samples:
Code:
Colab showing how to load the samples:
Paper:
We run HMC on hundreds of TPU devices for millions of training epochs to provide our best approximation of the true Bayesian neural networks! (1) BNNs do better than deep ensembles (2) no cold posteriors effect but (3) BNNs are terrible under data corruption, and much more! 🧵
Check out or FlexiViT paper, appearing at
#CVPR2023
! We show that you can train one vision transformer model that works with all patch sizes, allowing you to decide on an accuracy-compute trade off at test time!
Paper:
Code:
This ballad about Sir FlexiViT is the coolest thing ever!
It's a nice explanation of the main point of FlexiViT, and the 50min video easily plays in
@ykilcher
's league😍
The paper "FlexiViT: One Model for All Patch Sizes" was accepted at CVPR, so here comes my summary:
🧶1/N
🔥 Our work on Bayesian model selection received an Outstanding Paper Award at
#ICML2022
! Please see the talk by
@LotfiSanae
tomorrow and join us at the poster session!
I'm so proud that our paper on the marginal likelihood won the Outstanding Paper Award at
#ICML2022
!!! Congratulations to my amazing co-authors
@Pavel_Izmailov
,
@g_benton_
,
@micahgoldblum
,
@andrewgwils
🎉
Talk on Thursday, 2:10 pm, room 310
Poster 828 on Thursday, 6-8 pm, hall E
We will be presenting our work "On Feature Learning in the Presence of Spurious Correlations" today at the PODS workshop! Come chat with us about group robustness and the factors that affect it :)
11:50-12:30 and 4:55-5:40, Ballroom 3.
w/
@polkirichenko
@gruver_nate
@andrewgwils
Check out our new video and blogpost visualizing mode connectivity. For this video we evaluated over 50 mil parameter configurations of a ResNet20:) It took over two weeks on 15 GPUs. W/
@ideami
@tim_garipov
@andrewgwils
Blog:
Stochastic Weight Averaging (SWA) is a simple procedure that improves generalization in deep learning over Stochastic Gradient Descent (SGD). PyTorch 1.6 now includes SWA natively. Learn more from
@Pavel_Izmailov
,
@andrewgwils
and Vincent:
Check out this short video for our
@NeurIPSConf
paper on SWAG, a simple method that improves predictions and uncertainty in deep learning; motivated by loss surface geometry and scales to ImageNet.
(🍄/🍄🍄🍄)
Turns out, it is really hard to get the student to match the teacher predictions in knowledge distillation, even if we train really long and use lots of data augmentation.
Why? Optimization is hard!
New paper with
@samscub
@polkirichenko
@alemi
and
@andrewgwils
!
Does knowledge distillation really work?
While distillation can improve student generalization, we show it is extremely difficult to achieve good agreement between student and teacher.
With
@samscub
,
@Pavel_Izmailov
,
@polkirichenko
, Alex Alemi. 1/10
We are presenting our paper "Dangers of Bayesian Model Averaging under Covariate Shift" at
#NeurIPS2021
now! Looking forward to seeing you at the poster session!
Poster:
Paper:
Very excited to give a talk at AABI tomorrow (Feb 1st) at 5PM GMT / 12PM ET!
I will be talking about our recent work on HMC for Bayesian neural networks, cold posteriors, priors, approximate inference and BNNs under distribution shift. Please join!
Join us to discuss the latest advances in approximate inference and probabilistic models at AABI 2022 on Feb 1-2!
Webinar registration:
We have an amazing line-up of speakers, panelists and papers👍
@vincefort
@Tkaraletsos
@s_mandt
@ruqi_zhang
Among other topics, I am excited about out-of-distribution generalization, interpretability, large language and vision models, technical AI alignment, uncertainty estimation, core deep learning methodology and applications.
See my papers here:
Spurious features are a major issue for deep learning. Our new
#NeurIPS2022
paper w/
@pol_kirichenko
,
@gruver_nate
and
@andrewgwils
explores the representations trained on data with spurious features with many surprising findings, and SOTA results.
🧵1/6
Excited to share this paper :)
The high-level takeaway is that the main thing that affects OOD detection in likelihood-based models is the inductive biases. You can have the same likelihood on train and arbitrary likelihood outside train. Flows have bad biases for OOD detection.
Why Normalizing Flows Fail to Detect
Out-of-Distribution Data
We explore the inductive biases of normalizing flows based on coupling layers in the context of OOD detection (1/6)
Our competition on approximate inference for Bayesian deep learning has started!
We tried to make it as accessible as possible: you can use any language you like, and we provide examples and resources. Give it a try :)
Our
#NeurIPS2021
competition "Approximate Inference in Bayesian Deep Learning" has started!
The goal is to provide high quality approximate inference for Bayesian neural networks, using high-fidelity HMC from hundreds of TPUs as a reference.
@ideami
created a really cool website where you can play around with his 3-d visualizations of loss surface of deep neural nets: ! Includes our collaboration on mode connectivity:
Another cool result: a single long HMC chain appears to be quite good at exploring the posterior, at least in the function space. The results hint that MCMC methods are able to leverage mode connectivity to move between functionally diverse solutions.
We introduce a prior distribution to control the aleatoric (data) uncertainty of a Bayesian neural network, nearly matching the accuracy of cold posteriors 🥶
w/ Brooks Paige and
@Pavel_Izmailov
🧵1/8
This was a very exciting project to work on, initially quite mysterious but with a simple and satisfying resolution! Check out the paper for more details and insights :)
We also release our code at
10/10
Turns out SWA and SAM provide complimentary improvements and can be combined together for even better performance!
Cool paper by
@jeankaddour
,
@likicode
, Ricardo Silva, and Matt J. Kusner!
Flat minima often generalize better than sharp ones due to robustness against loss shifts between train and test set. What’s the best way to find them? We compare two popular methods, SWA and SAM, across 42 deep learning tasks (CV, NLP, GRL):
1/7
First, we find that BNNs at temperature 1 with regular Gaussian priors are actually quite good, outperforming deep ensembles on both accuracy and likelihood!
There is also a negative result: Bayesian neural nets seem to generalize very poorly to corrupted data! An ensemble of 720 HMC samples is worse than a single SGD solution when the inputs are noisy or corrupted.
We also compare the predictions of popular approximate inference methods to HMC. Advanced SGMCMC methods provide the most accurate approximation, deep ensembles are quite good even though often considered non-Bayesian, and mean field VI is the worst.
@StephaneDeny
@andrewgwils
@g_benton_
@m_finzi
I think Augerino could be extended to these scenarios: we can parameterize the set of transformations that we want to be invariant to with something like a GAN generator (or ). Definitely an exciting future work direction :)
We believe this problem of weak-to-strong learning will be central to alignment of superhuman AI systems in the future. It is also a tractable ML problem with close connections to OOD generalization, label noise, semi-supervised learning etc!
Humans won't be able to supervise models smarter than us. For example, if a superhuman model generates a million lines of extremely complicated code, we won’t be able to tell if it’s safe to run or not, if it follows our instructions or not, and so on.
What about the priors? We compare several prior families and study the dependence on prior variance with Gaussian priors. Generally, the effect on performance is fairly minor.
We are also launching $10M in grants for academics, grad students, and others to work on this and other directions in superalignment. Apply by Feb 18!
Application:
We're announcing, together with
@ericschmidt
: Superalignment Fast Grants.
$10M in grants for technical research on aligning superhuman AI systems, including weak-to-strong generalization, interpretability, scalable oversight, and more.
Apply by Feb 18!
@_shingc
@PyTorch
@andrewgwils
The tweet actually links to a new blogpost describing the new interface :) See also examples here:
Also there is documentation here:
Visualizations made with our friend
@ideami
. Here we show posterior density for ResNet20 on CIFAR10 and SWAG posterior in the subspace of top 2 PCA components of SGD trajectory. Variances are aligned with width.
(🍄🍄🍄/🍄🍄🍄)
Really excited about this paper: we achieve SOTA results on spurious correlation benchmarks by simply reweighting the features learned by standard ERM! The method only has one hyper-parameter and is extremely simple and cheap!
Last Layer Re-Training is Sufficient for Robustness to Spurious Correlations. ERM learns multiple features that can be reweighted for SOTA on spurious correlations, reducing texture bias on ImageNet, & more!
w/
@Pavel_Izmailov
and
@andrewgwils
1/11
latest from preparedness @ openai: gpt4 at most mildly helps with biothreat creation.
method: get bio PhDs in a secure monitored facility. half try biothreat creation w/ (experimental) unsafe gpt4. other half can only use the internet.
so far, gpt4 ≈ internet… but we’ll…
In fact, tempering even hurts the performance in some cases, with the best performance achieved at temperature 1. What is the main difference with ? (1) We turn data augmentation off and (2) we use a very high fidelity inference procedure.
We use Deep Feature Reweighting (DFR) to evaluate feature representations: retrain the last layer of the model on group-balanced validation data. DFR worst group accuracy (WGA) tells us how much information about the core features is learned.
2/6
Last Layer Re-Training is Sufficient for Robustness to Spurious Correlations. ERM learns multiple features that can be reweighted for SOTA on spurious correlations, reducing texture bias on ImageNet, & more!
w/
@Pavel_Izmailov
and
@andrewgwils
1/11
Better models learn the core feature better: in-distribution accuracy is linearly correlated with the DFR WGA. We don’t find qualitative differences between different types of architectures, such as CNNs and vision transformers: they all fall on the same line.
4/6
@tdietterich
@andrewgwils
Deep ensembles are typically trained with L2 regularization, which corresponds to a Gaussian prior, but it can be switched to any other prior. We show empirically in that deep ensembles with L2 regularization approximate HMC with a Gaussian prior.
While group robustness methods such as group DRO can improve WGA a lot, they don’t typically improve the features! With DFR, we recover the same performance for ERM and Group DRO. The improvement in these methods comes from the last layer, not features!
3/6
So you think you know distillation; it's easy, right?
We thought so too with
@XiaohuaZhai
@__kolesnikov__
@_arohan_
and the amazing
@royaleerieme
and Larisa Markeeva.
Until we didn't. But now we do again. Hop on for a ride (+the best ever ResNet50?)
🧵👇
@BlackHC
@SamuelAinsworth
@andrewgwils
@a1mmer
I think there are a few caveats: (1) the argument requires the Laplace approximation to perfectly describe the basin, which is far from given. (2) I believe the Git-Re-Basin observations don't say that the distribution of solutions is the same within each mode?
@KellenDB
@leopd
@andrewgwils
It's been tried on a bunch of things: ResNets, DenseNets, VGGs, also LSTMs, in deep RL, in low-precision training, in parallel training, GANs, physical modeling. Seems to help quite generally :)
@srchvrs
@giffmana
I'd say the take-away is that you don't necessarily need tricks to learn good features even if the data has spurious / shortcut features. But you need some tricks (e.g. training on group balanced data) to learn a good head / weighting of those features.
ImageNet pretraining (supervised or contrastive) has a major effect on the features, even on non-natural image datasets such as chest X-rays. With strong pretrained models, we achieve SOTA WGA on Waterbirds (97%) , CelebA (92%) and FMOW (50%) with ERM features.
5/6
Finally, now that we understand the issue we can design a simple fix! We propose EmpCov priors, Gaussian priors which have low variance along the directions where the data has low variance. EmpCov priors significantly improve robustness on many corruptions!
9/10
One of the interesting takeaways is that while prior works focused on making the representations robust to spurious correlation, the representations are in fact fine even with standard ERM: the issue is largely in the last linear layer.
Distillation presents an exciting challenge for optimization in deep learning. Unlike standard learning, in distillation we actually want to get the training loss as low as possible, overfitting is not an issue. Improving the optimizer is likely to improve distillation!
I deeply regret my participation in the board's actions. I never intended to harm OpenAI. I love everything we've built together and I will do everything I can to reunite the company.
@dnnslmr
@andrewgwils
@polkirichenko
Sure, here are the ones I know:
- — great overview of NFs
- — flows for discrete data
- — integer discrete flows
— a mixture of discrete and continuous latent variables
Consider an MLP on MNIST. MNIST has many pixels near the boundary that are 0 for all images. The corresponding weights in the first layer will always be multiplied by 0 and will not interact with the likelihood. For these weights, the posterior will be the same as the prior!
3/10
But the MAP solution will just set these weights to zero (see gif in previous tweet). Now, suppose we apply noise to a test image, some of the dead pixels will activate! MAP will simply ignore these pixels but a true BNN will multiply them by weights drawn from the prior!
4/10
Bayesian model averaging mitigates double descent! We have just posted this new result in section 7 of our paper on Bayesian deep learning with
@Pavel_Izmailov
: . The result highlights the importance of *multi-modal* marginalization with Multi-SWAG. 1/3
In our recent paper () we found that BNNs perform really well in-distribution, but generalize terribly under covariate shift.
This result was very puzzling for us, but in this new work we provide an explanation!
2/10
What are Bayesian neural network posteriors really like? With high fidelity HMC, we study approximate inference quality, generalization, cold posteriors, priors, and more.
With
@Pavel_Izmailov
,
@sharadvikram
, and Matthew D. Hoffman. 1/10
@tomgoldsteincs
Agreed! Distillation is in fact very similar to standard training, but simpler: we can produce as much data as we want, ensure we have sufficient capacity and use more informative (soft) labels!
We found knowledge distillation to help a lot with a nice trick: we initialize the student FlexiViT model with the weights of a teacher such as a ViT-B/8, leading to much better performance compared to random initialization. Inspired by:
Is there _anything_ we can do to produce a high fidelity student? In self-distillation the student can in principle match the teacher. We initialize the student with a combination of teacher and random weights. Starting close enough, we can finally recover the teacher. 8/10
In KD you want to match the student to the teacher on as much training data as possible to ensure that the models will also make similar predictions on test data.
However, it turns out that even getting the student and teacher to match on the train data is really hard!
@BlackHC
@SamuelAinsworth
@andrewgwils
@a1mmer
i.e. even if we know that each "mode" contains all kinds of solutions, it doesn't guarantee that the posterior mass corresponding to the solutions is consistent between the "modes": it could be that 99% of the mass is still just one solution. Would be interesting to find out!
Finally, data augmentation indeed leads to underconfident fits on the training set, and posterior tempering or ND are needed to correct for this underconfidence. These results concretely resolve the observed link between the cold posterior effect and augmentation!
12/16
Then, on a dataset with 100 classes, the posterior samples will on average only be 2% confident in the observed training label. But on benchmarks like CIFAR we believe there’s almost no label uncertainty!
4/16
We are hoping that the samples can be useful to the Bayesian deep learning community! We also plan to add samples for new datasets and architectures over time. Please let us know if you have any issues loading or using the checkpoints.
@giffmana
I think you are right that for more diverse datasets we would likely see less degeneracy in the features. For CIFAR low-variance directions are checkerboard patterns and I would think you would still not see a lot of these on ImageNet? Would be fun to check!
We can generalize this reasoning to any linear dependencies in the data. In the paper, we prove that if the input features are linearly dependent (which is true for a lot of datasets), the BNN predictions will break if we break the linear dependence at test time!
5/10
In the paper, we show how to efficiently resize the patch embeddings and positional encoding parameters. By doing so and randomizing the patch size during training, we can train a *single model* that is, for example, competitive with the whole family of efficient net models.
@ido87
@andrewgwils
Hi Daniel, the plots show how the loss changes as you vary the parameters of the DNN in a two-dimensional subspace. The x axis is fixed and it is attached to two independently trained DNN weights. The y axis changes as we change the plane.
@martin_trapp
@andrewgwils
Hey
@martin_trapp
, were you thinking to use any language in particular? We will provide example code in python, but I think the competition is generally language agnostic.
@viraj_bagal
@PyTorch
@andrewgwils
The new implementation in PyTorch is complete. It's not implemented as an optimizer wrapper, but rather as a model wrapper. As for the bn update, you also need to do it in tf, see the comment in the blue box here:
In regression, we can control the representation of aleatoric uncertainty with an interpretable noise parameter. In classification we use the same softmax cross-entropy likelihood regardless of the amount of label noise, which leads to underfitting the training data.
2/16
@DanFrederiksen2
@andrewgwils
@tim_garipov
@ideami
In this visualization we demonstrate a particular phenomenon, mode connectivity: . We do not expect to capture everything about loss surfaces in 2d, but you can get insights about behavior in random / specific directions. E.g. .
@latentjasper
I think so! We are able to get better than deep ensembles' performance on the same architectures with HMC (with no data augmentation). Also, using the cold posteriors' code, T=1 performance is better than low temperatures if we remove data augmentation.