Pavel Izmailov @Pavel_Izmailov Twitter profile | Pikagi

Pikagi

Pavel Izmailov

@Pavel_Izmailov

5,809

Followers

1,385

Following

51

Media

594

Statuses

Researcher Incoming Assistant Professor @nyuniversity 🏙️ Previously @OpenAI #StopWar 🇺🇦

San Francisco

https://t.co/zoezxzbw70

Joined March 2010

Don't wanna be here? Send us removal request.

Pinned Tweet

@Pavel_Izmailov

Pavel Izmailov

@Pavel_Izmailov

5 months

Extremely excited to have this work out, the first paper from the Superalignment team! We study how large models can generalize from supervision of much weaker models.

@OpenAI

OpenAI

5 months

In the future, humans will need to supervise AI systems much smarter than them. We study an analogy: small models supervising large models. Read the Superalignment team's first paper showing progress on a new approach, weak-to-strong generalization:

Tweet media one

509

1K

7K

16

30

254

Last Seen Profiles

@mich_dowling

@sonoko_miyazaki

@Shahabdin_1

@dotdump

@LSUwbkb

@slkeene

@tomasem1

@EndoCrow

@skip_skip_sheep

@AldoDuqueSantos

@GarySambrook89

@fossilious

@JalenHurts

@bayonggs

@digil_rago

@Gene5AK

@SUNSfanTV

@gonukito

@consensus2024

@dmtnatcats

@BADGALRM

@AbrahimSaqeer

@Tae_East4

@wdmafurui

@Slap_BattlesRBX

@LAMDAdrama

@TeamWarren

@Og_Rozen

@RadCentrism

@THE_PUPPIZ

@jaidiaaaa

@JJTrellezv

@countrymusichof

@jandakembangstw

@itomacbd

@williams_clara4

@Pavel_Izmailov

Pavel Izmailov

@Pavel_Izmailov

6 months

📢 I am recruiting Ph.D. students for my new lab at @nyuniversity ! Please apply, if you want to work on understanding deep learning and large models, and do a Ph.D. in the most exciting city on earth. Details on my website: . Please spread the word!

Tweet media one

Tweet media two

30

185

912

@Pavel_Izmailov

Pavel Izmailov

@Pavel_Izmailov

1 year

I defended my PhD thesis "Deconstructing Models and Methods in Deep Learning" yesterday 🥳 Thank you so much to my committee members @andrewgwils @ylecun @kchonyc @FeiziSoheil @sirbayes @sainingxie , colleagues and friends!

Tweet media one

Tweet media two

36

8

342

@Pavel_Izmailov

Pavel Izmailov

@Pavel_Izmailov

2 years

Spurious features are a major issue for deep learning. Our new #NeurIPS2022 paper w/ @pol_kirichenko , @gruver_nate and @andrewgwils explores the representations trained on data with spurious features with many surprising findings, and SOTA results. 🧵1/6

Tweet media one

5

55

328

@Pavel_Izmailov

Pavel Izmailov

@Pavel_Izmailov

3 years

We run HMC on hundreds of TPU devices for millions of training epochs to provide our best approximation of the true Bayesian neural networks! (1) BNNs do better than deep ensembles (2) no cold posteriors effect but (3) BNNs are terrible under data corruption, and much more! 🧵

@andrewgwils

Andrew Gordon Wilson

3 years

What are Bayesian neural network posteriors really like? With high fidelity HMC, we study approximate inference quality, generalization, cold posteriors, priors, and more. With @Pavel_Izmailov , @sharadvikram , and Matthew D. Hoffman. 1/10

Tweet media one

6

166

723

5

47

265

@Pavel_Izmailov

Pavel Izmailov

@Pavel_Izmailov

2 years

We explore how to represent aleatoric (irreducible) uncertainty in Bayesian classification, with profound implications for performance, data augmentation, and cold posteriors in BDL. w/ @snymkpr , W. Maddox, @andrewgwils 🧵 1/16

Tweet media one

2

52

253

@Pavel_Izmailov

Pavel Izmailov

@Pavel_Izmailov

10 months

Extremely excited to join this amazing team at @OpenAI !

@OpenAI

OpenAI

10 months

@ilyasut (co-founder and Chief Scientist) will be co-leading the team with @janleike (Head of Alignment). In addition to members from our existing alignment team, joining are Harri Edwards, @burdayur , @AdrienLE , @__nmca__ , @CollinBurns4 , @bobabowen , @Pavel_Izmailov , @leopoldasch .

36

40

552

16

7

231

@Pavel_Izmailov

Pavel Izmailov

@Pavel_Izmailov

3 years

Dangers of Bayesian Model Averaging under Covariate Shift We show how Bayesian neural nets can generalize *extremely* poorly under covariate shift, why it happens and how to fix it! With Patrick Nicholson, @LotfiSanae and @andrewgwils 1/10

Tweet media one

3

39

203

@Pavel_Izmailov

Pavel Izmailov

@Pavel_Izmailov

3 years

Our paper on HMC for Bayesian neural networks will appear at #ICML2021 as a long talk! We are also excited to release our JAX code and HMC samples: Code: Colab showing how to load the samples: Paper:

Tweet card media

What Are Bayesian Neural Network Posteriors Really Like?

The posterior over Bayesian neural network (BNN) parameters is extremely high-dimensional and non-convex. For computational reasons, researchers approximate this posterior using inexpensive...

@Pavel_Izmailov

Pavel Izmailov

@Pavel_Izmailov

3 years

We run HMC on hundreds of TPU devices for millions of training epochs to provide our best approximation of the true Bayesian neural networks! (1) BNNs do better than deep ensembles (2) no cold posteriors effect but (3) BNNs are terrible under data corruption, and much more! 🧵

5

47

265

2

30

175

@Pavel_Izmailov

Pavel Izmailov

@Pavel_Izmailov

1 year

Check out or FlexiViT paper, appearing at #CVPR2023 ! We show that you can train one vision transformer model that works with all patch sizes, allowing you to decide on an accuracy-compute trade off at test time! Paper: Code:

Tweet media one

@giffmana

Lucas Beyer (bl16)

1 year

This ballad about Sir FlexiViT is the coolest thing ever! It's a nice explanation of the main point of FlexiViT, and the 50min video easily plays in @ykilcher 's league😍 The paper "FlexiViT: One Model for All Patch Sizes" was accepted at CVPR, so here comes my summary: 🧶1/N

2

28

179

2

24

128

@Pavel_Izmailov

Pavel Izmailov

@Pavel_Izmailov

1 year

I am in New Orleans for #NeurIPS2022 , ping me if you want to chat! I am also on the academic job market this year :)

Tweet media one

0

6

82

@Pavel_Izmailov

Pavel Izmailov

@Pavel_Izmailov

2 years

Got the famous #NeurIPS2022 #beerreviewaward 🍻 @kchonyc @keunwoochoi @polkirichenko

Tweet media one

1

2

82

@Pavel_Izmailov

Pavel Izmailov

@Pavel_Izmailov

2 years

🔥 Our work on Bayesian model selection received an Outstanding Paper Award at #ICML2022 ! Please see the talk by @LotfiSanae tomorrow and join us at the poster session!

@LotfiSanae

Sanae Lotfi

2 years

I'm so proud that our paper on the marginal likelihood won the Outstanding Paper Award at #ICML2022 !!! Congratulations to my amazing co-authors @Pavel_Izmailov , @g_benton_ , @micahgoldblum , @andrewgwils 🎉 Talk on Thursday, 2:10 pm, room 310 Poster 828 on Thursday, 6-8 pm, hall E

Tweet media one

13

33

324

7

6

80

@Pavel_Izmailov

Pavel Izmailov

@Pavel_Izmailov

2 years

We will be presenting our work "On Feature Learning in the Presence of Spurious Correlations" today at the PODS workshop! Come chat with us about group robustness and the factors that affect it :) 11:50-12:30 and 4:55-5:40, Ballroom 3. w/ @polkirichenko @gruver_nate @andrewgwils

Tweet media one

1

16

74

@Pavel_Izmailov

Pavel Izmailov

@Pavel_Izmailov

4 years

Check out our new video and blogpost visualizing mode connectivity. For this video we evaluated over 50 mil parameter configurations of a ResNet20:) It took over two weeks on 15 GPUs. W/ @ideami @tim_garipov @andrewgwils Blog:

1

24

64

@Pavel_Izmailov

Pavel Izmailov

@Pavel_Izmailov

4 years

SWA is finally natively supported in PyTorch, see torch.optim.swa_utils :) See the blogpost for more details and some examples here:

Tweet card media

GitHub - izmailovpavel/torch_swa_examples

Contribute to izmailovpavel/torch_swa_examples development by creating an account on GitHub.

@PyTorch

PyTorch

4 years

Stochastic Weight Averaging (SWA) is a simple procedure that improves generalization in deep learning over Stochastic Gradient Descent (SGD). PyTorch 1.6 now includes SWA natively. Learn more from @Pavel_Izmailov , @andrewgwils and Vincent:

5

206

864

0

6

62

@Pavel_Izmailov

Pavel Izmailov

@Pavel_Izmailov

4 years

Check out this short video for our @NeurIPSConf paper on SWAG, a simple method that improves predictions and uncertainty in deep learning; motivated by loss surface geometry and scales to ImageNet. (🍄/🍄🍄🍄)

Tweet card media

[NeurIPS 2019] A Simple Baseline for Bayesian Uncertainty in Deep...

This short video summarizes our NeurIPS'19 paper "A Simple Baseline for Bayesian Uncertainty in Deep Learning" (https://arxiv.org/abs/1902.02476)PyTorch Code...

www.youtube.com

2

14

59

@Pavel_Izmailov

Pavel Izmailov

@Pavel_Izmailov

3 years

Turns out, it is really hard to get the student to match the teacher predictions in knowledge distillation, even if we train really long and use lots of data augmentation. Why? Optimization is hard! New paper with @samscub @polkirichenko @alemi and @andrewgwils !

@andrewgwils

Andrew Gordon Wilson

3 years

Does knowledge distillation really work? While distillation can improve student generalization, we show it is extremely difficult to achieve good agreement between student and teacher. With @samscub , @Pavel_Izmailov , @polkirichenko , Alex Alemi. 1/10

Tweet media one

Tweet media two

7

73

344

2

10

54

@Pavel_Izmailov

Pavel Izmailov

@Pavel_Izmailov

2 years

I am at #ICML2022 ! DM me if you want to meet :)

1

1

53

@Pavel_Izmailov

Pavel Izmailov

@Pavel_Izmailov

2 years

We are presenting our paper "Dangers of Bayesian Model Averaging under Covariate Shift" at #NeurIPS2021 now! Looking forward to seeing you at the poster session! Poster: Paper:

Tweet media one

1

10

50

@Pavel_Izmailov

Pavel Izmailov

@Pavel_Izmailov

2 years

Very excited to give a talk at AABI tomorrow (Feb 1st) at 5PM GMT / 12PM ET! I will be talking about our recent work on HMC for Bayesian neural networks, cold posteriors, priors, approximate inference and BNNs under distribution shift. Please join!

@liyzhen2

yingzhen

2 years

Join us to discuss the latest advances in approximate inference and probabilistic models at AABI 2022 on Feb 1-2! Webinar registration: We have an amazing line-up of speakers, panelists and papers👍 @vincefort @Tkaraletsos @s_mandt @ruqi_zhang

Tweet media one

4

31

140

2

8

44

@Pavel_Izmailov

Pavel Izmailov

@Pavel_Izmailov

6 months

Among other topics, I am excited about out-of-distribution generalization, interpretability, large language and vision models, technical AI alignment, uncertainty estimation, core deep learning methodology and applications. See my papers here:

Tweet card media

OpenAI - Cited by 6,015 - Machine Learning - Deep Learning - Language Models - Reasoning - AI Alignment

scholar.google.ru

3

1

39

@Pavel_Izmailov

Pavel Izmailov

@Pavel_Izmailov

3 years

We are going to release our JAX code and the HMC samples very soon. Stay tuned!

1

0

33

@Pavel_Izmailov

Pavel Izmailov

@Pavel_Izmailov

1 year

We will be presenting "On Feature Learning in the Presence of Spurious Correlations" today (Nov 29) at #NeurIPS2022 w/ @polkirichenko @gruver_nate @andrewgwils ! Hall J #103 , 4-6 pm Paper:

Tweet media one

@Pavel_Izmailov

Pavel Izmailov

@Pavel_Izmailov

2 years

Spurious features are a major issue for deep learning. Our new #NeurIPS2022 paper w/ @pol_kirichenko , @gruver_nate and @andrewgwils explores the representations trained on data with spurious features with many surprising findings, and SOTA results. 🧵1/6

Tweet media one

5

55

328

4

4

26

@Pavel_Izmailov

Pavel Izmailov

@Pavel_Izmailov

6 months

I am recruiting in the CSE () and CS () departments. Deadlines December 1 and 12!

1

1

24

@Pavel_Izmailov

Pavel Izmailov

@Pavel_Izmailov

2 years

Congratulations to the winners and a huge thank you to everyone who participated :)

@bdl_competition

NeurIPS Approximate Inference in BDL Competition

@bdl_competition

2 years

Excited to announce the winners of our competition! 🥇Team @riken_en ( @tmoellenhoff , Y. Shen, @ShhhPeaceful , @PeterNickl_ , @EmtiyazKhan ) wins both competition tracks! 🥈🥈 @niket096 and A. Thin are second in the extended and tie with @ArnaudDelaunoy for second in the light track.

5

15

104

0

0

20

@Pavel_Izmailov

Pavel Izmailov

@Pavel_Izmailov

4 years

Excited to share this paper :) The high-level takeaway is that the main thing that affects OOD detection in likelihood-based models is the inductive biases. You can have the same likelihood on train and arbitrary likelihood outside train. Flows have bad biases for OOD detection.

Tweet media one

@polkirichenko

Polina Kirichenko

4 years

Why Normalizing Flows Fail to Detect Out-of-Distribution Data We explore the inductive biases of normalizing flows based on coupling layers in the context of OOD detection (1/6)

1

65

352

0

2

19

@Pavel_Izmailov

Pavel Izmailov

@Pavel_Izmailov

3 years

Our competition on approximate inference for Bayesian deep learning has started! We tried to make it as accessible as possible: you can use any language you like, and we provide examples and resources. Give it a try :)

@bdl_competition

NeurIPS Approximate Inference in BDL Competition

@bdl_competition

3 years

Our #NeurIPS2021 competition "Approximate Inference in Bayesian Deep Learning" has started! The goal is to provide high quality approximate inference for Bayesian neural networks, using high-fidelity HMC from hundreds of TPUs as a reference.

1

32

119

1

2

17

@Pavel_Izmailov

Pavel Izmailov

@Pavel_Izmailov

3 years

@ideami created a really cool website where you can play around with his 3-d visualizations of loss surface of deep neural nets: ! Includes our collaboration on mode connectivity:

Tweet media one

1

4

17

@Pavel_Izmailov

Pavel Izmailov

@Pavel_Izmailov

3 years

Another cool result: a single long HMC chain appears to be quite good at exploring the posterior, at least in the function space. The results hint that MCMC methods are able to leverage mode connectivity to move between functionally diverse solutions.

Tweet media one

Tweet media two

1

0

17

@Pavel_Izmailov

Pavel Izmailov

@Pavel_Izmailov

4 years

@PierreAblin We got some cool visualizations with @tim_garipov @ideami and @andrewgwils here:

1

8

15

@Pavel_Izmailov

Pavel Izmailov

@Pavel_Izmailov

2 months

I was very impressed by @martinmarek1999 in this project, look out for more exciting research from him!

@martin__marek

Martin Marek

2 months

We introduce a prior distribution to control the aleatoric (data) uncertainty of a Bayesian neural network, nearly matching the accuracy of cold posteriors 🥶 w/ Brooks Paige and @Pavel_Izmailov 🧵1/8

Tweet media one

3

2

20

0

1

16

@Pavel_Izmailov

Pavel Izmailov

@Pavel_Izmailov

3 years

This was a very exciting project to work on, initially quite mysterious but with a simple and satisfying resolution! Check out the paper for more details and insights :) We also release our code at 10/10

Tweet card media

GitHub - izmailovpavel/bnn_covariate_shift: Supporting code for the paper "Dangers of Bayesian...

Supporting code for the paper "Dangers of Bayesian Model Averaging under Covariate Shift" - izmailovpavel/bnn_covariate_shift

0

4

15

@Pavel_Izmailov

Pavel Izmailov

@Pavel_Izmailov

5 months

It’s been amazing to work with @CollinBurns4 , @janhkirchner , @bobabowen , @nabla_theta , @leopoldasch , @cynnjjs , @AdrienLE , @ManasJoglekar , @janleike , @ilyasut and @WuTheFWasThat ! Paper: Blog:

Tweet card media

Weak-to-strong generalization

We present a new research direction for superalignment, together with promising initial results: can we leverage the generalization properties of deep learning to control strong models with weak...

1

1

15

@Pavel_Izmailov

Pavel Izmailov

@Pavel_Izmailov

2 years

Turns out SWA and SAM provide complimentary improvements and can be combined together for even better performance! Cool paper by @jeankaddour , @likicode , Ricardo Silva, and Matt J. Kusner!

Tweet card media

When Do Flat Minima Optimizers Work?

Recently, flat-minima optimizers, which seek to find parameters in low-loss neighborhoods, have been shown to improve a neural network's generalization performance over stochastic and adaptive...

@jeankaddour

Jean Kaddour

2 years

Flat minima often generalize better than sharp ones due to robustness against loss shifts between train and test set. What’s the best way to find them? We compare two popular methods, SWA and SAM, across 42 deep learning tasks (CV, NLP, GRL): 1/7

Tweet media one

6

79

449

1

0

14

@Pavel_Izmailov

Pavel Izmailov

@Pavel_Izmailov

6 months

❤️

@janleike

Jan Leike

6 months

I think the OpenAI board should resign

98

199

3K

0

0

14

@Pavel_Izmailov

Pavel Izmailov

@Pavel_Izmailov

3 years

First, we find that BNNs at temperature 1 with regular Gaussian priors are actually quite good, outperforming deep ensembles on both accuracy and likelihood!

Tweet media one

1

0

13

@Pavel_Izmailov

Pavel Izmailov

@Pavel_Izmailov

3 years

There is also a negative result: Bayesian neural nets seem to generalize very poorly to corrupted data! An ensemble of 720 HMC samples is worse than a single SGD solution when the inputs are noisy or corrupted.

Tweet media one

1

0

13

@Pavel_Izmailov

Pavel Izmailov

@Pavel_Izmailov

3 years

We also compare the predictions of popular approximate inference methods to HMC. Advanced SGMCMC methods provide the most accurate approximation, deep ensembles are quite good even though often considered non-Bayesian, and mean field VI is the worst.

Tweet media one

1

0

13

@Pavel_Izmailov

Pavel Izmailov

@Pavel_Izmailov

4 years

@StephaneDeny @andrewgwils @g_benton_ @m_finzi I think Augerino could be extended to these scenarios: we can parameterize the set of transformations that we want to be invariant to with something like a GAN generator (or ). Definitely an exciting future work direction :)

Tweet card media

Spatial Transformer Networks

Convolutional Neural Networks define an exceptionally powerful class of models, but are still limited by the lack of ability to be spatially invariant to the input data in a computationally and...

0

0

12

@Pavel_Izmailov

Pavel Izmailov

@Pavel_Izmailov

2 years

Please see the paper for many more discussions, observations and experiments! Code available here: 16/16

Tweet card media

GitHub - activatedgeek/understanding-bayesian-classification: On Uncertainty, Tempering, and Data...

On Uncertainty, Tempering, and Data Augmentation in Bayesian Classification - activatedgeek/understanding-bayesian-classification

2

0

10

@Pavel_Izmailov

Pavel Izmailov

@Pavel_Izmailov

2 years

Many more insights and results in the paper! Paper: Code: NeurIPS event: 6/6

Tweet card media

GitHub - izmailovpavel/spurious_feature_learning

Contribute to izmailovpavel/spurious_feature_learning development by creating an account on GitHub.

1

2

11

@Pavel_Izmailov

Pavel Izmailov

@Pavel_Izmailov

5 months

We believe this problem of weak-to-strong learning will be central to alignment of superhuman AI systems in the future. It is also a tractable ML problem with close connections to OOD generalization, label noise, semi-supervised learning etc!

@CollinBurns4

Collin Burns

5 months

Humans won't be able to supervise models smarter than us. For example, if a superhuman model generates a million lines of extremely complicated code, we won’t be able to tell if it’s safe to run or not, if it follows our instructions or not, and so on.

7

5

71

2

1

11

@Pavel_Izmailov

Pavel Izmailov

@Pavel_Izmailov

3 years

What about the priors? We compare several prior families and study the dependence on prior variance with Gaussian priors. Generally, the effect on performance is fairly minor.

Tweet media one

1

0

11

@Pavel_Izmailov

Pavel Izmailov

@Pavel_Izmailov

5 months

We are also launching $10M in grants for academics, grad students, and others to work on this and other directions in superalignment. Apply by Feb 18! Application:

@OpenAI

OpenAI

5 months

We're announcing, together with @ericschmidt : Superalignment Fast Grants. $10M in grants for technical research on aligning superhuman AI systems, including weak-to-strong generalization, interpretability, scalable oversight, and more. Apply by Feb 18!

213

464

3K

2

0

10

@Pavel_Izmailov

Pavel Izmailov

@Pavel_Izmailov

4 years

@_shingc @PyTorch @andrewgwils The tweet actually links to a new blogpost describing the new interface :) See also examples here: Also there is documentation here:

Tweet card media

GitHub - izmailovpavel/torch_swa_examples

Contribute to izmailovpavel/torch_swa_examples development by creating an account on GitHub.

1

2

10

@Pavel_Izmailov

Pavel Izmailov

@Pavel_Izmailov

4 years

Visualizations made with our friend @ideami . Here we show posterior density for ResNet20 on CIFAR10 and SWAG posterior in the subspace of top 2 PCA components of SGD trajectory. Variances are aligned with width. (🍄🍄🍄/🍄🍄🍄)

Tweet media one

Tweet media two

Tweet media three

1

2

10

@Pavel_Izmailov

Pavel Izmailov

@Pavel_Izmailov

2 years

Really excited about this paper: we achieve SOTA results on spurious correlation benchmarks by simply reweighting the features learned by standard ERM! The method only has one hyper-parameter and is extremely simple and cheap!

Tweet card media

GitHub - PolinaKirichenko/deep_feature_reweighting

Contribute to PolinaKirichenko/deep_feature_reweighting development by creating an account on GitHub.

@polkirichenko

Polina Kirichenko

2 years

Last Layer Re-Training is Sufficient for Robustness to Spurious Correlations. ERM learns multiple features that can be reweighted for SOTA on spurious correlations, reducing texture bias on ImageNet, & more! w/ @Pavel_Izmailov and @andrewgwils 1/11

Tweet media one

13

72

529

2

0

10

@Pavel_Izmailov

Pavel Izmailov

@Pavel_Izmailov

3 months

Very cool experiment from the preparedness team ☣️

@tejalpatwardhan

Tejal Patwardhan

@tejalpatwardhan

3 months

latest from preparedness @ openai: gpt4 at most mildly helps with biothreat creation. method: get bio PhDs in a secure monitored facility. half try biothreat creation w/ (experimental) unsafe gpt4. other half can only use the internet. so far, gpt4 ≈ internet… but we’ll…

7

20

147

0

0

10

@Pavel_Izmailov

Pavel Izmailov

@Pavel_Izmailov

3 years

In fact, tempering even hurts the performance in some cases, with the best performance achieved at temperature 1. What is the main difference with ? (1) We turn data augmentation off and (2) we use a very high fidelity inference procedure.

Tweet media one

1

0

9

@Pavel_Izmailov

Pavel Izmailov

@Pavel_Izmailov

2 years

We use Deep Feature Reweighting (DFR) to evaluate feature representations: retrain the last layer of the model on group-balanced validation data. DFR worst group accuracy (WGA) tells us how much information about the core features is learned. 2/6

@polkirichenko

Polina Kirichenko

2 years

Last Layer Re-Training is Sufficient for Robustness to Spurious Correlations. ERM learns multiple features that can be reweighted for SOTA on spurious correlations, reducing texture bias on ImageNet, & more! w/ @Pavel_Izmailov and @andrewgwils 1/11

Tweet media one

13

72

529

1

0

9

@Pavel_Izmailov

Pavel Izmailov

@Pavel_Izmailov

4 years

With Wesley Maddox, @tim_garipov , Dmitry Vetrov ( @bayesgroup ) and @andrewgwils Paper: Code: (🍄🍄/🍄🍄🍄)

1

2

8

@Pavel_Izmailov

Pavel Izmailov

@Pavel_Izmailov

2 years

Better models learn the core feature better: in-distribution accuracy is linearly correlated with the DFR WGA. We don’t find qualitative differences between different types of architectures, such as CNNs and vision transformers: they all fall on the same line. 4/6

Tweet media one

2

0

8

@Pavel_Izmailov

Pavel Izmailov

@Pavel_Izmailov

3 years

@tdietterich @andrewgwils Deep ensembles are typically trained with L2 regularization, which corresponds to a Gaussian prior, but it can be switched to any other prior. We show empirically in that deep ensembles with L2 regularization approximate HMC with a Gaussian prior.

Tweet card media

What Are Bayesian Neural Network Posteriors Really Like?

The posterior over Bayesian neural network (BNN) parameters is extremely high-dimensional and non-convex. For computational reasons, researchers approximate this posterior using inexpensive...

1

1

8

@Pavel_Izmailov

Pavel Izmailov

@Pavel_Izmailov

2 years

While group robustness methods such as group DRO can improve WGA a lot, they don’t typically improve the features! With DFR, we recover the same performance for ERM and Group DRO. The improvement in these methods comes from the last layer, not features! 3/6

Tweet media one

1

0

8

@Pavel_Izmailov

Pavel Izmailov

@Pavel_Izmailov

3 years

Also, see this nice tread on a paper by @giffmana @XiaohuaZhai @__kolesnikov__ @_arohan_ @royaleerieme and Larisa Markeeva: The paper came out just two days ago and is very related!

@giffmana

Lucas Beyer (bl16)

3 years

So you think you know distillation; it's easy, right? We thought so too with @XiaohuaZhai @__kolesnikov__ @_arohan_ and the amazing @royaleerieme and Larisa Markeeva. Until we didn't. But now we do again. Hop on for a ride (+the best ever ResNet50?) 🧵👇

Tweet media one

8

114

567

0

0

7

@Pavel_Izmailov

Pavel Izmailov

@Pavel_Izmailov

2 years

@BlackHC @SamuelAinsworth @andrewgwils @a1mmer I think there are a few caveats: (1) the argument requires the Laplace approximation to perfectly describe the basin, which is far from given. (2) I believe the Git-Re-Basin observations don't say that the distribution of solutions is the same within each mode?

1

0

7

@Pavel_Izmailov

Pavel Izmailov

@Pavel_Izmailov

4 years

@KellenDB @leopd @andrewgwils It's been tried on a bunch of things: ResNets, DenseNets, VGGs, also LSTMs, in deep RL, in low-precision training, in parallel training, GANs, physical modeling. Seems to help quite generally :)

1

0

7

@Pavel_Izmailov

Pavel Izmailov

@Pavel_Izmailov

4 years

@PierreAblin @tim_garipov @ideami @andrewgwils @ideami has a lot of cool stuff at :)

Tweet card media

Loss Landscape | A.I deep learning explorations of morphology & dynamics

Explore the morphology and dynamics of deep learning optimization processes and gradient descent with the A.I Loss Landscape project.

losslandscape.com

0

4

7

@Pavel_Izmailov

Pavel Izmailov

@Pavel_Izmailov

2 years

@srchvrs @giffmana I'd say the take-away is that you don't necessarily need tricks to learn good features even if the data has spurious / shortcut features. But you need some tricks (e.g. training on group balanced data) to learn a good head / weighting of those features.

0

0

7

@Pavel_Izmailov

Pavel Izmailov

@Pavel_Izmailov

2 years

ImageNet pretraining (supervised or contrastive) has a major effect on the features, even on non-natural image datasets such as chest X-rays. With strong pretrained models, we achieve SOTA WGA on Waterbirds (97%) , CelebA (92%) and FMOW (50%) with ERM features. 5/6

Tweet media one

1

0

7

@Pavel_Izmailov

Pavel Izmailov

@Pavel_Izmailov

3 years

Finally, now that we understand the issue we can design a simple fix! We propose EmpCov priors, Gaussian priors which have low variance along the directions where the data has low variance. EmpCov priors significantly improve robustness on many corruptions! 9/10

Tweet media one

1

1

5

@Pavel_Izmailov

Pavel Izmailov

@Pavel_Izmailov

2 years

One of the interesting takeaways is that while prior works focused on making the representations robust to spurious correlation, the representations are in fact fine even with standard ERM: the issue is largely in the last linear layer.

0

0

6

@Pavel_Izmailov

Pavel Izmailov

@Pavel_Izmailov

3 years

Distillation presents an exciting challenge for optimization in deep learning. Unlike standard learning, in distillation we actually want to get the training loss as low as possible, overfitting is not an issue. Improving the optimizer is likely to improve distillation!

Tweet media one

1

0

6

@Pavel_Izmailov

Pavel Izmailov

@Pavel_Izmailov

6 months

❤️

@ilyasut

Ilya Sutskever

6 months

I deeply regret my participation in the board's actions. I never intended to harm OpenAI. I love everything we've built together and I will do everything I can to reunite the company.

7K

4K

33K

0

0

6

@Pavel_Izmailov

Pavel Izmailov

@Pavel_Izmailov

1 year

@andrewgwils Thank you so much @andrewgwils ! So happy that I did my PhD in your lab!

1

0

6

@Pavel_Izmailov

Pavel Izmailov

@Pavel_Izmailov

5 years

@iugoaoj @zacharylipton @andrewgwils @polkirichenko Here's the link :)

Tweet card media

MIA: Andrew Gordon Wilson on Bayesian deep learning; Primer: Pavel...

Models, Inference and AlgorithmsOctober 30, 2019Meeting: https://youtu.be/GXs9Pmp6IKQ?t=2620&list=PLlMMtlgw6qNjROoMNTBQjAcdx53kV50cSLoss Valleys, Uncertainty...

www.youtube.com

0

0

6

@Pavel_Izmailov

Pavel Izmailov

@Pavel_Izmailov

4 years

@dnnslmr @andrewgwils @polkirichenko Sure, here are the ones I know: - — great overview of NFs - — flows for discrete data - — integer discrete flows — a mixture of discrete and continuous latent variables

Tweet card media

A RAD approach to deep mixture models

Flow based models such as Real NVP are an extremely powerful approach to density estimation. However, existing flow based models are restricted to transforming continuous densities over a...

1

0

5

@Pavel_Izmailov

Pavel Izmailov

@Pavel_Izmailov

3 years

Consider an MLP on MNIST. MNIST has many pixels near the boundary that are 0 for all images. The corresponding weights in the first layer will always be multiplied by 0 and will not interact with the likelihood. For these weights, the posterior will be the same as the prior! 3/10

1

0

5

@Pavel_Izmailov

Pavel Izmailov

@Pavel_Izmailov

4 years

Paper: Code:

Tweet card media

GitHub - timgaripov/dnn-mode-connectivity: Mode Connectivity and Fast Geometric Ensembles in PyTorch

Mode Connectivity and Fast Geometric Ensembles in PyTorch - timgaripov/dnn-mode-connectivity

1

3

5

@Pavel_Izmailov

Pavel Izmailov

@Pavel_Izmailov

2 years

Paper: Code: ICML oral: Thursday, 2:10 pm, room 310 () ICML poster: Poster 828 on Thursday, 6-8 pm, hall E ()

1

1

5

@Pavel_Izmailov

Pavel Izmailov

@Pavel_Izmailov

3 years

But the MAP solution will just set these weights to zero (see gif in previous tweet). Now, suppose we apply noise to a test image, some of the dead pixels will activate! MAP will simply ignore these pixels but a true BNN will multiply them by weights drawn from the prior! 4/10

Tweet media one

1

0

5

@Pavel_Izmailov

Pavel Izmailov

@Pavel_Izmailov

4 years

@pfau @daniela_witten @robtibshirani @HastieTrevor We have looked into that empirically here:

@andrewgwils

Andrew Gordon Wilson

4 years

Bayesian model averaging mitigates double descent! We have just posted this new result in section 7 of our paper on Bayesian deep learning with @Pavel_Izmailov : . The result highlights the importance of *multi-modal* marginalization with Multi-SWAG. 1/3

Tweet media one

2

81

400

1

0

5

@Pavel_Izmailov

Pavel Izmailov

@Pavel_Izmailov

3 years

In our recent paper () we found that BNNs perform really well in-distribution, but generalize terribly under covariate shift. This result was very puzzling for us, but in this new work we provide an explanation! 2/10

Tweet media one

@andrewgwils

Andrew Gordon Wilson

3 years

What are Bayesian neural network posteriors really like? With high fidelity HMC, we study approximate inference quality, generalization, cold posteriors, priors, and more. With @Pavel_Izmailov , @sharadvikram , and Matthew D. Hoffman. 1/10

Tweet media one

6

166

723

2

0

5

@Pavel_Izmailov

Pavel Izmailov

@Pavel_Izmailov

2 years

@Mitchnw @stanislavfort @savvyRL @PreetumNakkiran @roydanroy @jasonyo @yasamanbb @jefrankle @gkdziugaite @SuryaGanguli , ! And , , , , , , ... Also all of @stanislavfort 's papers on loss surfaces🙂

0

0

5

@Pavel_Izmailov

Pavel Izmailov

@Pavel_Izmailov

3 years

@tomgoldsteincs Agreed! Distillation is in fact very similar to standard training, but simpler: we can produce as much data as we want, ensure we have sufficient capacity and use more informative (soft) labels!

0

0

5

@Pavel_Izmailov

Pavel Izmailov

@Pavel_Izmailov

2 years

Thank you so much for this opportunity @liyzhen2 @vincefort @Tkaraletsos @s_mandt @ruqi_zhang ! Relevant papers: - HMC in Bayesian neural nets - BNNs under covariate shift

Tweet card media

Dangers of Bayesian Model Averaging under Covariate Shift

Approximate Bayesian inference for neural networks is considered a robust alternative to standard training, often providing good performance on out-of-distribution data. However, Bayesian neural...

1

0

5

@Pavel_Izmailov

Pavel Izmailov

@Pavel_Izmailov

1 year

We found knowledge distillation to help a lot with a nice trick: we initialize the student FlexiViT model with the weights of a teacher such as a ViT-B/8, leading to much better performance compared to random initialization. Inspired by:

Tweet media one

@andrewgwils

Andrew Gordon Wilson

3 years

Is there _anything_ we can do to produce a high fidelity student? In self-distillation the student can in principle match the teacher. We initialize the student with a combination of teacher and random weights. Starting close enough, we can finally recover the teacher. 8/10

Tweet media one

1

0

10

1

0

5

@Pavel_Izmailov

Pavel Izmailov

@Pavel_Izmailov

5 years

@Marcin_Bog @ideami @tim_garipov @andrewgwils For this particular visualization we have a much simpler version (2d, matplotlib) implemented in here: and

0

0

5

@Pavel_Izmailov

Pavel Izmailov

@Pavel_Izmailov

3 years

In KD you want to match the student to the teacher on as much training data as possible to ensure that the models will also make similar predictions on test data. However, it turns out that even getting the student and teacher to match on the train data is really hard!

Tweet media one

1

0

5

@Pavel_Izmailov

Pavel Izmailov

@Pavel_Izmailov

2 years

@BlackHC @SamuelAinsworth @andrewgwils @a1mmer i.e. even if we know that each "mode" contains all kinds of solutions, it doesn't guarantee that the posterior mass corresponding to the solutions is consistent between the "modes": it could be that 99% of the mass is still just one solution. Would be interesting to find out!

0

0

5

@Pavel_Izmailov

Pavel Izmailov

@Pavel_Izmailov

4 years

My favorite visualization from the blogpost :)

Tweet media one

0

2

5

@Pavel_Izmailov

Pavel Izmailov

@Pavel_Izmailov

2 years

Finally, data augmentation indeed leads to underconfident fits on the training set, and posterior tempering or ND are needed to correct for this underconfidence. These results concretely resolve the observed link between the cold posterior effect and augmentation! 12/16

Tweet media one

1

0

4

@Pavel_Izmailov

Pavel Izmailov

@Pavel_Izmailov

2 years

Then, on a dataset with 100 classes, the posterior samples will on average only be 2% confident in the observed training label. But on benchmarks like CIFAR we believe there’s almost no label uncertainty! 4/16

1

0

4

@Pavel_Izmailov

Pavel Izmailov

@Pavel_Izmailov

3 years

We are hoping that the samples can be useful to the Bayesian deep learning community! We also plan to add samples for new datasets and architectures over time. Please let us know if you have any issues loading or using the checkpoints.

0

0

4

@Pavel_Izmailov

Pavel Izmailov

@Pavel_Izmailov

2 years

@JordyLandeghem @EmtiyazKhan @tmoellenhoff Yes, we plan to record the meeting!

0

0

4

@Pavel_Izmailov

Pavel Izmailov

@Pavel_Izmailov

8 months

@scemama_paul @andrewgwils This paper also comes to mind:

Tweet card media

Quality of Uncertainty Quantification for Bayesian Neural Network Inference

Bayesian Neural Networks (BNNs) place priors over the parameters in a neural network. Inference in BNNs, however, is difficult; all inference methods for BNNs are approximate. In this work, we...

0

0

4

@Pavel_Izmailov

Pavel Izmailov

@Pavel_Izmailov

3 years

Blogpost on the mode connectivity visualization:

Loss Surfaces, Mode Connectivity, and Fast Ensembling of DNNs

Loss Surfaces, Mode Connectivity, and Fast Ensembling of DNNs Figure 1: visualizat

izmailovpavel.github.io

0

0

4

@Pavel_Izmailov

Pavel Izmailov

@Pavel_Izmailov

3 years

@giffmana I think you are right that for more diverse datasets we would likely see less degeneracy in the features. For CIFAR low-variance directions are checkerboard patterns and I would think you would still not see a lot of these on ImageNet? Would be fun to check!

Tweet media one

2

0

3

@Pavel_Izmailov

Pavel Izmailov

@Pavel_Izmailov

3 years

We can generalize this reasoning to any linear dependencies in the data. In the paper, we prove that if the input features are linearly dependent (which is true for a lot of datasets), the BNN predictions will break if we break the linear dependence at test time! 5/10

1

0

4

@Pavel_Izmailov

Pavel Izmailov

@Pavel_Izmailov

1 year

In the paper, we show how to efficiently resize the patch embeddings and positional encoding parameters. By doing so and randomizing the patch size during training, we can train a *single model* that is, for example, competitive with the whole family of efficient net models.

Tweet media one

1

0

4

@Pavel_Izmailov

Pavel Izmailov

@Pavel_Izmailov

1 year

Many more experiments and practical results in the paper! It was a really exciting collaboration with @giffmana , @mcaron31 , @skornblith , @XiaohuaZhai , @MJLM3 , @mtschannen , @ibomohsin and @FPavetic !

0

0

4

@Pavel_Izmailov

Pavel Izmailov

@Pavel_Izmailov

5 years

@le_roux_nicolas Hi Nicolas, would you be at Saturday workshops to chat? I have done some work on loss surfaces:

Tweet card media

There Are Many Consistent Explanations of Unlabeled Data: Why You...

Presently the most successful approaches to semi-supervised learning are based on consistency regularization, whereby a model is trained to be robust to small perturbations of its inputs and...

0

0

3

@Pavel_Izmailov

Pavel Izmailov

@Pavel_Izmailov

6 years

@ido87 @andrewgwils Hi Daniel, the plots show how the loss changes as you vary the parameters of the DNN in a two-dimensional subspace. The x axis is fixed and it is attached to two independently trained DNN weights. The y axis changes as we change the plane.

1

0

3

@Pavel_Izmailov

Pavel Izmailov

@Pavel_Izmailov

3 years

@martin_trapp @andrewgwils Hey @martin_trapp , were you thinking to use any language in particular? We will provide example code in python, but I think the competition is generally language agnostic.

0

0

3

@Pavel_Izmailov

Pavel Izmailov

@Pavel_Izmailov

4 years

@viraj_bagal @PyTorch @andrewgwils The new implementation in PyTorch is complete. It's not implemented as an optimizer wrapper, but rather as a model wrapper. As for the bn update, you also need to do it in tf, see the comment in the blue box here:

Tweet card media

tfa.optimizers.SWA | TensorFlow Addons

This class extends optimizers with Stochastic Weight Averaging (SWA).

www.tensorflow.org

1

1

3

@Pavel_Izmailov

Pavel Izmailov

@Pavel_Izmailov

2 years

In regression, we can control the representation of aleatoric uncertainty with an interpretable noise parameter. In classification we use the same softmax cross-entropy likelihood regardless of the amount of label noise, which leads to underfitting the training data. 2/16

2

0

3

@Pavel_Izmailov

Pavel Izmailov

@Pavel_Izmailov

2 years

@adad8m @bdl_competition @riken_en @tmoellenhoff @ShhhPeaceful @PeterNickl_ @EmtiyazKhan @niket096 @ArnaudDelaunoy We used CIFAR-10-corrupted as our private data, where the accuracies and agreements are substantially lower than on the original CIFAR-10 test set

0

0

3

@Pavel_Izmailov

Pavel Izmailov

@Pavel_Izmailov

5 years

@DanFrederiksen2 @andrewgwils @tim_garipov @ideami In this visualization we demonstrate a particular phenomenon, mode connectivity: . We do not expect to capture everything about loss surfaces in 2d, but you can get insights about behavior in random / specific directions. E.g. .

Tweet card media

Understanding Generalization through Visualizations

The power of neural networks lies in their ability to generalize to unseen data, yet the underlying reasons for this phenomenon remain elusive. Numerous rigorous attempts have been made to explain...

0

0

3

@Pavel_Izmailov

Pavel Izmailov

@Pavel_Izmailov

3 years

@latentjasper I think so! We are able to get better than deep ensembles' performance on the same architectures with HMC (with no data augmentation). Also, using the cold posteriors' code, T=1 performance is better than low temperatures if we remove data augmentation.

2

0

3