Pavel Izmailov Profile Banner
Pavel Izmailov Profile
Pavel Izmailov

@Pavel_Izmailov

5,809
Followers
1,385
Following
51
Media
594
Statuses

Researcher Incoming Assistant Professor @nyuniversity 🏙️ Previously @OpenAI #StopWar 🇺🇦

San Francisco
Joined March 2010
Don't wanna be here? Send us removal request.
Pinned Tweet
@Pavel_Izmailov
Pavel Izmailov
5 months
Extremely excited to have this work out, the first paper from the Superalignment team! We study how large models can generalize from supervision of much weaker models.
@OpenAI
OpenAI
5 months
In the future, humans will need to supervise AI systems much smarter than them. We study an analogy: small models supervising large models. Read the Superalignment team's first paper showing progress on a new approach, weak-to-strong generalization:
Tweet media one
509
1K
7K
16
30
254
@Pavel_Izmailov
Pavel Izmailov
6 months
📢 I am recruiting Ph.D. students for my new lab at @nyuniversity ! Please apply, if you want to work on understanding deep learning and large models, and do a Ph.D. in the most exciting city on earth. Details on my website: . Please spread the word!
Tweet media one
Tweet media two
30
185
912
@Pavel_Izmailov
Pavel Izmailov
1 year
I defended my PhD thesis "Deconstructing Models and Methods in Deep Learning" yesterday 🥳 Thank you so much to my committee members @andrewgwils @ylecun @kchonyc @FeiziSoheil @sirbayes @sainingxie , colleagues and friends!
Tweet media one
Tweet media two
36
8
342
@Pavel_Izmailov
Pavel Izmailov
2 years
Spurious features are a major issue for deep learning. Our new #NeurIPS2022 paper w/ @pol_kirichenko , @gruver_nate and @andrewgwils explores the representations trained on data with spurious features with many surprising findings, and SOTA results. 🧵1/6
Tweet media one
5
55
328
@Pavel_Izmailov
Pavel Izmailov
3 years
We run HMC on hundreds of TPU devices for millions of training epochs to provide our best approximation of the true Bayesian neural networks! (1) BNNs do better than deep ensembles (2) no cold posteriors effect but (3) BNNs are terrible under data corruption, and much more! 🧵
@andrewgwils
Andrew Gordon Wilson
3 years
What are Bayesian neural network posteriors really like? With high fidelity HMC, we study approximate inference quality, generalization, cold posteriors, priors, and more. With @Pavel_Izmailov , @sharadvikram , and Matthew D. Hoffman. 1/10
Tweet media one
6
166
723
5
47
265
@Pavel_Izmailov
Pavel Izmailov
2 years
We explore how to represent aleatoric (irreducible) uncertainty in Bayesian classification, with profound implications for performance, data augmentation, and cold posteriors in BDL. w/ @snymkpr , W. Maddox, @andrewgwils 🧵 1/16
Tweet media one
2
52
253
@Pavel_Izmailov
Pavel Izmailov
10 months
Extremely excited to join this amazing team at @OpenAI !
@OpenAI
OpenAI
10 months
@ilyasut (co-founder and Chief Scientist) will be co-leading the team with @janleike (Head of Alignment). In addition to members from our existing alignment team, joining are Harri Edwards, @burdayur , @AdrienLE , @__nmca__ , @CollinBurns4 , @bobabowen , @Pavel_Izmailov , @leopoldasch .
36
40
552
16
7
231
@Pavel_Izmailov
Pavel Izmailov
3 years
Dangers of Bayesian Model Averaging under Covariate Shift We show how Bayesian neural nets can generalize *extremely* poorly under covariate shift, why it happens and how to fix it! With Patrick Nicholson, @LotfiSanae and @andrewgwils 1/10
Tweet media one
3
39
203
@Pavel_Izmailov
Pavel Izmailov
3 years
Our paper on HMC for Bayesian neural networks will appear at #ICML2021 as a long talk! We are also excited to release our JAX code and HMC samples: Code: Colab showing how to load the samples: Paper:
@Pavel_Izmailov
Pavel Izmailov
3 years
We run HMC on hundreds of TPU devices for millions of training epochs to provide our best approximation of the true Bayesian neural networks! (1) BNNs do better than deep ensembles (2) no cold posteriors effect but (3) BNNs are terrible under data corruption, and much more! 🧵
5
47
265
2
30
175
@Pavel_Izmailov
Pavel Izmailov
1 year
Check out or FlexiViT paper, appearing at #CVPR2023 ! We show that you can train one vision transformer model that works with all patch sizes, allowing you to decide on an accuracy-compute trade off at test time! Paper: Code:
Tweet media one
@giffmana
Lucas Beyer (bl16)
1 year
This ballad about Sir FlexiViT is the coolest thing ever! It's a nice explanation of the main point of FlexiViT, and the 50min video easily plays in @ykilcher 's league😍 The paper "FlexiViT: One Model for All Patch Sizes" was accepted at CVPR, so here comes my summary: 🧶1/N
2
28
179
2
24
128
@Pavel_Izmailov
Pavel Izmailov
1 year
I am in New Orleans for #NeurIPS2022 , ping me if you want to chat! I am also on the academic job market this year :)
Tweet media one
0
6
82
@Pavel_Izmailov
Pavel Izmailov
2 years
🔥 Our work on Bayesian model selection received an Outstanding Paper Award at #ICML2022 ! Please see the talk by @LotfiSanae tomorrow and join us at the poster session!
@LotfiSanae
Sanae Lotfi
2 years
I'm so proud that our paper on the marginal likelihood won the Outstanding Paper Award at #ICML2022 !!! Congratulations to my amazing co-authors @Pavel_Izmailov , @g_benton_ , @micahgoldblum , @andrewgwils 🎉 Talk on Thursday, 2:10 pm, room 310 Poster 828 on Thursday, 6-8 pm, hall E
Tweet media one
13
33
324
7
6
80
@Pavel_Izmailov
Pavel Izmailov
2 years
We will be presenting our work "On Feature Learning in the Presence of Spurious Correlations" today at the PODS workshop! Come chat with us about group robustness and the factors that affect it :) 11:50-12:30 and 4:55-5:40, Ballroom 3. w/ @polkirichenko @gruver_nate @andrewgwils
Tweet media one
1
16
74
@Pavel_Izmailov
Pavel Izmailov
4 years
Check out our new video and blogpost visualizing mode connectivity. For this video we evaluated over 50 mil parameter configurations of a ResNet20:) It took over two weeks on 15 GPUs. W/ @ideami @tim_garipov @andrewgwils Blog:
1
24
64
@Pavel_Izmailov
Pavel Izmailov
4 years
SWA is finally natively supported in PyTorch, see torch.optim.swa_utils :) See the blogpost for more details and some examples here:
@PyTorch
PyTorch
4 years
Stochastic Weight Averaging (SWA) is a simple procedure that improves generalization in deep learning over Stochastic Gradient Descent (SGD). PyTorch 1.6 now includes SWA natively. Learn more from @Pavel_Izmailov , @andrewgwils and Vincent:
5
206
864
0
6
62
@Pavel_Izmailov
Pavel Izmailov
4 years
Check out this short video for our @NeurIPSConf paper on SWAG, a simple method that improves predictions and uncertainty in deep learning; motivated by loss surface geometry and scales to ImageNet. (🍄/🍄🍄🍄)
2
14
59
@Pavel_Izmailov
Pavel Izmailov
3 years
Turns out, it is really hard to get the student to match the teacher predictions in knowledge distillation, even if we train really long and use lots of data augmentation. Why? Optimization is hard! New paper with @samscub @polkirichenko @alemi and @andrewgwils !
@andrewgwils
Andrew Gordon Wilson
3 years
Does knowledge distillation really work? While distillation can improve student generalization, we show it is extremely difficult to achieve good agreement between student and teacher. With @samscub , @Pavel_Izmailov , @polkirichenko , Alex Alemi. 1/10
Tweet media one
Tweet media two
7
73
344
2
10
54
@Pavel_Izmailov
Pavel Izmailov
2 years
I am at #ICML2022 ! DM me if you want to meet :)
1
1
53
@Pavel_Izmailov
Pavel Izmailov
2 years
We are presenting our paper "Dangers of Bayesian Model Averaging under Covariate Shift" at #NeurIPS2021 now! Looking forward to seeing you at the poster session! Poster: Paper:
Tweet media one
1
10
50
@Pavel_Izmailov
Pavel Izmailov
2 years
Very excited to give a talk at AABI tomorrow (Feb 1st) at 5PM GMT / 12PM ET! I will be talking about our recent work on HMC for Bayesian neural networks, cold posteriors, priors, approximate inference and BNNs under distribution shift. Please join!
@liyzhen2
yingzhen
2 years
Join us to discuss the latest advances in approximate inference and probabilistic models at AABI 2022 on Feb 1-2! Webinar registration: We have an amazing line-up of speakers, panelists and papers👍 @vincefort @Tkaraletsos @s_mandt @ruqi_zhang
Tweet media one
4
31
140
2
8
44
@Pavel_Izmailov
Pavel Izmailov
6 months
Among other topics, I am excited about out-of-distribution generalization, interpretability, large language and vision models, technical AI alignment, uncertainty estimation, core deep learning methodology and applications. See my papers here:
3
1
39
@Pavel_Izmailov
Pavel Izmailov
3 years
We are going to release our JAX code and the HMC samples very soon. Stay tuned!
1
0
33
@Pavel_Izmailov
Pavel Izmailov
1 year
We will be presenting "On Feature Learning in the Presence of Spurious Correlations" today (Nov 29) at #NeurIPS2022 w/ @polkirichenko @gruver_nate @andrewgwils ! Hall J #103 , 4-6 pm Paper:
Tweet media one
@Pavel_Izmailov
Pavel Izmailov
2 years
Spurious features are a major issue for deep learning. Our new #NeurIPS2022 paper w/ @pol_kirichenko , @gruver_nate and @andrewgwils explores the representations trained on data with spurious features with many surprising findings, and SOTA results. 🧵1/6
Tweet media one
5
55
328
4
4
26
@Pavel_Izmailov
Pavel Izmailov
6 months
I am recruiting in the CSE () and CS () departments. Deadlines December 1 and 12!
1
1
24
@Pavel_Izmailov
Pavel Izmailov
2 years
Congratulations to the winners and a huge thank you to everyone who participated :)
@bdl_competition
NeurIPS Approximate Inference in BDL Competition
2 years
Excited to announce the winners of our competition! 🥇Team @riken_en ( @tmoellenhoff , Y. Shen, @ShhhPeaceful , @PeterNickl_ , @EmtiyazKhan ) wins both competition tracks! 🥈🥈 @niket096 and A. Thin are second in the extended and tie with @ArnaudDelaunoy for second in the light track.
5
15
104
0
0
20
@Pavel_Izmailov
Pavel Izmailov
4 years
Excited to share this paper :) The high-level takeaway is that the main thing that affects OOD detection in likelihood-based models is the inductive biases. You can have the same likelihood on train and arbitrary likelihood outside train. Flows have bad biases for OOD detection.
Tweet media one
@polkirichenko
Polina Kirichenko
4 years
Why Normalizing Flows Fail to Detect Out-of-Distribution Data We explore the inductive biases of normalizing flows based on coupling layers in the context of OOD detection (1/6)
1
65
352
0
2
19
@Pavel_Izmailov
Pavel Izmailov
3 years
Our competition on approximate inference for Bayesian deep learning has started! We tried to make it as accessible as possible: you can use any language you like, and we provide examples and resources. Give it a try :)
@bdl_competition
NeurIPS Approximate Inference in BDL Competition
3 years
Our #NeurIPS2021 competition "Approximate Inference in Bayesian Deep Learning" has started! The goal is to provide high quality approximate inference for Bayesian neural networks, using high-fidelity HMC from hundreds of TPUs as a reference.
1
32
119
1
2
17
@Pavel_Izmailov
Pavel Izmailov
3 years
@ideami created a really cool website where you can play around with his 3-d visualizations of loss surface of deep neural nets: ! Includes our collaboration on mode connectivity:
Tweet media one
1
4
17
@Pavel_Izmailov
Pavel Izmailov
3 years
Another cool result: a single long HMC chain appears to be quite good at exploring the posterior, at least in the function space. The results hint that MCMC methods are able to leverage mode connectivity to move between functionally diverse solutions.
Tweet media one
Tweet media two
1
0
17
@Pavel_Izmailov
Pavel Izmailov
4 years
@PierreAblin We got some cool visualizations with @tim_garipov @ideami and @andrewgwils here:
1
8
15
@Pavel_Izmailov
Pavel Izmailov
2 months
I was very impressed by @martinmarek1999 in this project, look out for more exciting research from him!
@martin__marek
Martin Marek
2 months
We introduce a prior distribution to control the aleatoric (data) uncertainty of a Bayesian neural network, nearly matching the accuracy of cold posteriors 🥶 w/ Brooks Paige and @Pavel_Izmailov 🧵1/8
Tweet media one
3
2
20
0
1
16
@Pavel_Izmailov
Pavel Izmailov
3 years
This was a very exciting project to work on, initially quite mysterious but with a simple and satisfying resolution! Check out the paper for more details and insights :) We also release our code at 10/10
0
4
15
@Pavel_Izmailov
Pavel Izmailov
2 years
Turns out SWA and SAM provide complimentary improvements and can be combined together for even better performance! Cool paper by @jeankaddour , @likicode , Ricardo Silva, and Matt J. Kusner!
@jeankaddour
Jean Kaddour
2 years
Flat minima often generalize better than sharp ones due to robustness against loss shifts between train and test set. What’s the best way to find them? We compare two popular methods, SWA and SAM, across 42 deep learning tasks (CV, NLP, GRL): 1/7
Tweet media one
6
79
449
1
0
14
@Pavel_Izmailov
Pavel Izmailov
6 months
❤️
@janleike
Jan Leike
6 months
I think the OpenAI board should resign
98
199
3K
0
0
14
@Pavel_Izmailov
Pavel Izmailov
3 years
First, we find that BNNs at temperature 1 with regular Gaussian priors are actually quite good, outperforming deep ensembles on both accuracy and likelihood!
Tweet media one
1
0
13
@Pavel_Izmailov
Pavel Izmailov
3 years
There is also a negative result: Bayesian neural nets seem to generalize very poorly to corrupted data! An ensemble of 720 HMC samples is worse than a single SGD solution when the inputs are noisy or corrupted.
Tweet media one
1
0
13
@Pavel_Izmailov
Pavel Izmailov
3 years
We also compare the predictions of popular approximate inference methods to HMC. Advanced SGMCMC methods provide the most accurate approximation, deep ensembles are quite good even though often considered non-Bayesian, and mean field VI is the worst.
Tweet media one
1
0
13
@Pavel_Izmailov
Pavel Izmailov
4 years
@StephaneDeny @andrewgwils @g_benton_ @m_finzi I think Augerino could be extended to these scenarios: we can parameterize the set of transformations that we want to be invariant to with something like a GAN generator (or ). Definitely an exciting future work direction :)
0
0
12
@Pavel_Izmailov
Pavel Izmailov
5 months
We believe this problem of weak-to-strong learning will be central to alignment of superhuman AI systems in the future. It is also a tractable ML problem with close connections to OOD generalization, label noise, semi-supervised learning etc!
@CollinBurns4
Collin Burns
5 months
Humans won't be able to supervise models smarter than us. For example, if a superhuman model generates a million lines of extremely complicated code, we won’t be able to tell if it’s safe to run or not, if it follows our instructions or not, and so on.
7
5
71
2
1
11
@Pavel_Izmailov
Pavel Izmailov
3 years
What about the priors? We compare several prior families and study the dependence on prior variance with Gaussian priors. Generally, the effect on performance is fairly minor.
Tweet media one
1
0
11
@Pavel_Izmailov
Pavel Izmailov
5 months
We are also launching $10M in grants for academics, grad students, and others to work on this and other directions in superalignment. Apply by Feb 18! Application:
@OpenAI
OpenAI
5 months
We're announcing, together with @ericschmidt : Superalignment Fast Grants. $10M in grants for technical research on aligning superhuman AI systems, including weak-to-strong generalization, interpretability, scalable oversight, and more. Apply by Feb 18!
213
464
3K
2
0
10
@Pavel_Izmailov
Pavel Izmailov
4 years
@_shingc @PyTorch @andrewgwils The tweet actually links to a new blogpost describing the new interface :) See also examples here: Also there is documentation here:
1
2
10
@Pavel_Izmailov
Pavel Izmailov
4 years
Visualizations made with our friend @ideami . Here we show posterior density for ResNet20 on CIFAR10 and SWAG posterior in the subspace of top 2 PCA components of SGD trajectory. Variances are aligned with width. (🍄🍄🍄/🍄🍄🍄)
Tweet media one
Tweet media two
Tweet media three
1
2
10
@Pavel_Izmailov
Pavel Izmailov
2 years
Really excited about this paper: we achieve SOTA results on spurious correlation benchmarks by simply reweighting the features learned by standard ERM! The method only has one hyper-parameter and is extremely simple and cheap!
@polkirichenko
Polina Kirichenko
2 years
Last Layer Re-Training is Sufficient for Robustness to Spurious Correlations. ERM learns multiple features that can be reweighted for SOTA on spurious correlations, reducing texture bias on ImageNet, & more! w/ @Pavel_Izmailov and @andrewgwils 1/11
Tweet media one
13
72
529
2
0
10
@Pavel_Izmailov
Pavel Izmailov
3 months
Very cool experiment from the preparedness team ☣️
@tejalpatwardhan
Tejal Patwardhan
3 months
latest from preparedness @ openai: gpt4 at most mildly helps with biothreat creation. method: get bio PhDs in a secure monitored facility. half try biothreat creation w/ (experimental) unsafe gpt4. other half can only use the internet. so far, gpt4 ≈ internet… but we’ll…
7
20
147
0
0
10
@Pavel_Izmailov
Pavel Izmailov
3 years
In fact, tempering even hurts the performance in some cases, with the best performance achieved at temperature 1. What is the main difference with ? (1) We turn data augmentation off and (2) we use a very high fidelity inference procedure.
Tweet media one
1
0
9
@Pavel_Izmailov
Pavel Izmailov
2 years
We use Deep Feature Reweighting (DFR) to evaluate feature representations: retrain the last layer of the model on group-balanced validation data. DFR worst group accuracy (WGA) tells us how much information about the core features is learned. 2/6
@polkirichenko
Polina Kirichenko
2 years
Last Layer Re-Training is Sufficient for Robustness to Spurious Correlations. ERM learns multiple features that can be reweighted for SOTA on spurious correlations, reducing texture bias on ImageNet, & more! w/ @Pavel_Izmailov and @andrewgwils 1/11
Tweet media one
13
72
529
1
0
9
@Pavel_Izmailov
Pavel Izmailov
4 years
With Wesley Maddox, @tim_garipov , Dmitry Vetrov ( @bayesgroup ) and @andrewgwils Paper: Code: (🍄🍄/🍄🍄🍄)
1
2
8
@Pavel_Izmailov
Pavel Izmailov
2 years
Better models learn the core feature better: in-distribution accuracy is linearly correlated with the DFR WGA. We don’t find qualitative differences between different types of architectures, such as CNNs and vision transformers: they all fall on the same line. 4/6
Tweet media one
2
0
8
@Pavel_Izmailov
Pavel Izmailov
3 years
@tdietterich @andrewgwils Deep ensembles are typically trained with L2 regularization, which corresponds to a Gaussian prior, but it can be switched to any other prior. We show empirically in that deep ensembles with L2 regularization approximate HMC with a Gaussian prior.
1
1
8
@Pavel_Izmailov
Pavel Izmailov
2 years
While group robustness methods such as group DRO can improve WGA a lot, they don’t typically improve the features! With DFR, we recover the same performance for ERM and Group DRO. The improvement in these methods comes from the last layer, not features! 3/6
Tweet media one
1
0
8
@Pavel_Izmailov
Pavel Izmailov
3 years
Also, see this nice tread on a paper by @giffmana @XiaohuaZhai @__kolesnikov__ @_arohan_ @royaleerieme and Larisa Markeeva: The paper came out just two days ago and is very related!
@giffmana
Lucas Beyer (bl16)
3 years
So you think you know distillation; it's easy, right? We thought so too with @XiaohuaZhai @__kolesnikov__ @_arohan_ and the amazing @royaleerieme and Larisa Markeeva. Until we didn't. But now we do again. Hop on for a ride (+the best ever ResNet50?) 🧵👇
Tweet media one
8
114
567
0
0
7
@Pavel_Izmailov
Pavel Izmailov
2 years
@BlackHC @SamuelAinsworth @andrewgwils @a1mmer I think there are a few caveats: (1) the argument requires the Laplace approximation to perfectly describe the basin, which is far from given. (2) I believe the Git-Re-Basin observations don't say that the distribution of solutions is the same within each mode?
1
0
7
@Pavel_Izmailov
Pavel Izmailov
4 years
@KellenDB @leopd @andrewgwils It's been tried on a bunch of things: ResNets, DenseNets, VGGs, also LSTMs, in deep RL, in low-precision training, in parallel training, GANs, physical modeling. Seems to help quite generally :)
1
0
7
@Pavel_Izmailov
Pavel Izmailov
2 years
@srchvrs @giffmana I'd say the take-away is that you don't necessarily need tricks to learn good features even if the data has spurious / shortcut features. But you need some tricks (e.g. training on group balanced data) to learn a good head / weighting of those features.
0
0
7
@Pavel_Izmailov
Pavel Izmailov
2 years
ImageNet pretraining (supervised or contrastive) has a major effect on the features, even on non-natural image datasets such as chest X-rays. With strong pretrained models, we achieve SOTA WGA on Waterbirds (97%) , CelebA (92%) and FMOW (50%) with ERM features. 5/6
Tweet media one
1
0
7
@Pavel_Izmailov
Pavel Izmailov
3 years
Finally, now that we understand the issue we can design a simple fix! We propose EmpCov priors, Gaussian priors which have low variance along the directions where the data has low variance. EmpCov priors significantly improve robustness on many corruptions! 9/10
Tweet media one
1
1
5
@Pavel_Izmailov
Pavel Izmailov
2 years
One of the interesting takeaways is that while prior works focused on making the representations robust to spurious correlation, the representations are in fact fine even with standard ERM: the issue is largely in the last linear layer.
0
0
6
@Pavel_Izmailov
Pavel Izmailov
3 years
Distillation presents an exciting challenge for optimization in deep learning. Unlike standard learning, in distillation we actually want to get the training loss as low as possible, overfitting is not an issue. Improving the optimizer is likely to improve distillation!
Tweet media one
1
0
6
@Pavel_Izmailov
Pavel Izmailov
6 months
❤️
@ilyasut
Ilya Sutskever
6 months
I deeply regret my participation in the board's actions. I never intended to harm OpenAI. I love everything we've built together and I will do everything I can to reunite the company.
7K
4K
33K
0
0
6
@Pavel_Izmailov
Pavel Izmailov
1 year
@andrewgwils Thank you so much @andrewgwils ! So happy that I did my PhD in your lab!
1
0
6
@Pavel_Izmailov
Pavel Izmailov
4 years
@dnnslmr @andrewgwils @polkirichenko Sure, here are the ones I know: - — great overview of NFs - — flows for discrete data - — integer discrete flows — a mixture of discrete and continuous latent variables
1
0
5
@Pavel_Izmailov
Pavel Izmailov
3 years
Consider an MLP on MNIST. MNIST has many pixels near the boundary that are 0 for all images. The corresponding weights in the first layer will always be multiplied by 0 and will not interact with the likelihood. For these weights, the posterior will be the same as the prior! 3/10
1
0
5
@Pavel_Izmailov
Pavel Izmailov
2 years
Paper: Code: ICML oral: Thursday, 2:10 pm, room 310 () ICML poster: Poster 828 on Thursday, 6-8 pm, hall E ()
1
1
5
@Pavel_Izmailov
Pavel Izmailov
3 years
But the MAP solution will just set these weights to zero (see gif in previous tweet). Now, suppose we apply noise to a test image, some of the dead pixels will activate! MAP will simply ignore these pixels but a true BNN will multiply them by weights drawn from the prior! 4/10
Tweet media one
1
0
5
@Pavel_Izmailov
Pavel Izmailov
4 years
@pfau @daniela_witten @robtibshirani @HastieTrevor We have looked into that empirically here:
@andrewgwils
Andrew Gordon Wilson
4 years
Bayesian model averaging mitigates double descent! We have just posted this new result in section 7 of our paper on Bayesian deep learning with @Pavel_Izmailov : . The result highlights the importance of *multi-modal* marginalization with Multi-SWAG. 1/3
Tweet media one
2
81
400
1
0
5
@Pavel_Izmailov
Pavel Izmailov
3 years
In our recent paper () we found that BNNs perform really well in-distribution, but generalize terribly under covariate shift. This result was very puzzling for us, but in this new work we provide an explanation! 2/10
Tweet media one
@andrewgwils
Andrew Gordon Wilson
3 years
What are Bayesian neural network posteriors really like? With high fidelity HMC, we study approximate inference quality, generalization, cold posteriors, priors, and more. With @Pavel_Izmailov , @sharadvikram , and Matthew D. Hoffman. 1/10
Tweet media one
6
166
723
2
0
5
@Pavel_Izmailov
Pavel Izmailov
3 years
@tomgoldsteincs Agreed! Distillation is in fact very similar to standard training, but simpler: we can produce as much data as we want, ensure we have sufficient capacity and use more informative (soft) labels!
0
0
5
@Pavel_Izmailov
Pavel Izmailov
1 year
We found knowledge distillation to help a lot with a nice trick: we initialize the student FlexiViT model with the weights of a teacher such as a ViT-B/8, leading to much better performance compared to random initialization. Inspired by:
Tweet media one
@andrewgwils
Andrew Gordon Wilson
3 years
Is there _anything_ we can do to produce a high fidelity student? In self-distillation the student can in principle match the teacher. We initialize the student with a combination of teacher and random weights. Starting close enough, we can finally recover the teacher. 8/10
Tweet media one
1
0
10
1
0
5
@Pavel_Izmailov
Pavel Izmailov
5 years
@Marcin_Bog @ideami @tim_garipov @andrewgwils For this particular visualization we have a much simpler version (2d, matplotlib) implemented in here: and
0
0
5
@Pavel_Izmailov
Pavel Izmailov
3 years
In KD you want to match the student to the teacher on as much training data as possible to ensure that the models will also make similar predictions on test data. However, it turns out that even getting the student and teacher to match on the train data is really hard!
Tweet media one
1
0
5
@Pavel_Izmailov
Pavel Izmailov
2 years
@BlackHC @SamuelAinsworth @andrewgwils @a1mmer i.e. even if we know that each "mode" contains all kinds of solutions, it doesn't guarantee that the posterior mass corresponding to the solutions is consistent between the "modes": it could be that 99% of the mass is still just one solution. Would be interesting to find out!
0
0
5
@Pavel_Izmailov
Pavel Izmailov
4 years
My favorite visualization from the blogpost :)
Tweet media one
0
2
5
@Pavel_Izmailov
Pavel Izmailov
2 years
Finally, data augmentation indeed leads to underconfident fits on the training set, and posterior tempering or ND are needed to correct for this underconfidence. These results concretely resolve the observed link between the cold posterior effect and augmentation! 12/16
Tweet media one
1
0
4
@Pavel_Izmailov
Pavel Izmailov
2 years
Then, on a dataset with 100 classes, the posterior samples will on average only be 2% confident in the observed training label. But on benchmarks like CIFAR we believe there’s almost no label uncertainty! 4/16
1
0
4
@Pavel_Izmailov
Pavel Izmailov
3 years
We are hoping that the samples can be useful to the Bayesian deep learning community! We also plan to add samples for new datasets and architectures over time. Please let us know if you have any issues loading or using the checkpoints.
0
0
4
@Pavel_Izmailov
Pavel Izmailov
2 years
@JordyLandeghem @EmtiyazKhan @tmoellenhoff Yes, we plan to record the meeting!
0
0
4
@Pavel_Izmailov
Pavel Izmailov
3 years
@giffmana I think you are right that for more diverse datasets we would likely see less degeneracy in the features. For CIFAR low-variance directions are checkerboard patterns and I would think you would still not see a lot of these on ImageNet? Would be fun to check!
Tweet media one
2
0
3
@Pavel_Izmailov
Pavel Izmailov
3 years
We can generalize this reasoning to any linear dependencies in the data. In the paper, we prove that if the input features are linearly dependent (which is true for a lot of datasets), the BNN predictions will break if we break the linear dependence at test time! 5/10
1
0
4
@Pavel_Izmailov
Pavel Izmailov
1 year
In the paper, we show how to efficiently resize the patch embeddings and positional encoding parameters. By doing so and randomizing the patch size during training, we can train a *single model* that is, for example, competitive with the whole family of efficient net models.
Tweet media one
1
0
4
@Pavel_Izmailov
Pavel Izmailov
1 year
Many more experiments and practical results in the paper! It was a really exciting collaboration with @giffmana , @mcaron31 , @skornblith , @XiaohuaZhai , @MJLM3 , @mtschannen , @ibomohsin and @FPavetic !
0
0
4
@Pavel_Izmailov
Pavel Izmailov
6 years
@ido87 @andrewgwils Hi Daniel, the plots show how the loss changes as you vary the parameters of the DNN in a two-dimensional subspace. The x axis is fixed and it is attached to two independently trained DNN weights. The y axis changes as we change the plane.
1
0
3
@Pavel_Izmailov
Pavel Izmailov
3 years
@martin_trapp @andrewgwils Hey @martin_trapp , were you thinking to use any language in particular? We will provide example code in python, but I think the competition is generally language agnostic.
0
0
3
@Pavel_Izmailov
Pavel Izmailov
4 years
@viraj_bagal @PyTorch @andrewgwils The new implementation in PyTorch is complete. It's not implemented as an optimizer wrapper, but rather as a model wrapper. As for the bn update, you also need to do it in tf, see the comment in the blue box here:
1
1
3
@Pavel_Izmailov
Pavel Izmailov
2 years
In regression, we can control the representation of aleatoric uncertainty with an interpretable noise parameter. In classification we use the same softmax cross-entropy likelihood regardless of the amount of label noise, which leads to underfitting the training data. 2/16
2
0
3
@Pavel_Izmailov
Pavel Izmailov
2 years
@adad8m @bdl_competition @riken_en @tmoellenhoff @ShhhPeaceful @PeterNickl_ @EmtiyazKhan @niket096 @ArnaudDelaunoy We used CIFAR-10-corrupted as our private data, where the accuracies and agreements are substantially lower than on the original CIFAR-10 test set
0
0
3
@Pavel_Izmailov
Pavel Izmailov
5 years
@DanFrederiksen2 @andrewgwils @tim_garipov @ideami In this visualization we demonstrate a particular phenomenon, mode connectivity: . We do not expect to capture everything about loss surfaces in 2d, but you can get insights about behavior in random / specific directions. E.g. .
0
0
3
@Pavel_Izmailov
Pavel Izmailov
3 years
@latentjasper I think so! We are able to get better than deep ensembles' performance on the same architectures with HMC (with no data augmentation). Also, using the cold posteriors' code, T=1 performance is better than low temperatures if we remove data augmentation.
2
0
3