Ananya Kumar @ananyaku Twitter profile

Pinned Tweet

Ananya Kumar

2 years

How should you fine-tune a large pretrained model (CLIP, SimCLR) robustly? We find that standard fine-tuning can do poorly out-of-distribution (test data ≠ fine-tuning data). Our analysis leads to a simple fix, higher accuracy on 10 datasets. (ICLR Oral)

6

122

647

Last Seen Profiles

@RoniLowery16819

@slabungus

@goodluckodoms1

@Asterixscrm_nl

@msarosh

@fam2gether

@alliekmiller

@MenoScandal

@PRASSAS22

@BluemelRhe68568

@Bouchra_bls

@DOXAFestival

@amir24864

@MakingMeals

@bulanka88

@official90040

@Pacos_Revenge

@ABTurner21

@judilovegrove

@lotufodunrin

@PatrickKelly70

@peizNLP

@asu_ti_132

@MilesSm01899932

@mrxmra

@BitcoinWhales_

@Level1Level2

@KNielsen95896

@AlanBall20

@ratata_nft

@JacquiLowson

@dearie_jessica

@Thomasena_

@swtdrip23

@Blackrose81Gr

@AmericasVoice

Ananya Kumar

@ananyaku

1 year

I wrote a transfer learning library that accelerated my research progress in the last 2 years. Sweeps over methods × models × datasets × hyperparams × clouds, early stop on a dataset, evaluate acc on OOD datasets Link: CodaLab:

GitHub - AnanyaKumar/transfer_learning: Framework code with wandb, checkpointing, logging, configs,...

Framework code with wandb, checkpointing, logging, configs, experimental protocols. Useful for fine-tuning models or training from scratch, and testing them on a variety of datasets (transfer learn...

github.com

10

78

502

Ananya Kumar

@ananyaku

6 months

OpenAI is nothing without its people

10

22

390

Ananya Kumar

@ananyaku

1 year

Foundation models (BERT, DALLE-2, ChatGPT) have led to a paradigm shift in ML, but are poorly understood. Announcing ME-FoMo, an #ICLR2023 workshop on understanding foundation models. Deadline: Feb 3, 2023 Topics: Pretraining, transfer, scaling laws, etc

3

71

362

Ananya Kumar

@ananyaku

1 year

Adam gets higher accuracy than SGD when fine-tuning modern vision models (e.g., ViT), but why? We find that embedding layer has high gradient. Simply freezing embedding layer (<1% of params) → SGD competitive w/ Adam. SoTA results on WILDS + saves memory.

5

48

280

Ananya Kumar

@ananyaku

4 years

When ML models are deployed, data distributions evolving over time leads to a drop in performance. Our latest paper (theory and experiments) suggests we can use self-training on unlabeled data to maintain high performance ( @tengyuma @percyliang )

1

37

186

Ananya Kumar

@ananyaku

5 years

In our NeurIPS 2019 (spotlight) paper, we explain why methods like Platt scaling / temperature scaling are less calibrated than reported, propose a way to overcome this issue, and describe how to measure a model's calibration error with fewer samples:

Verified Uncertainty Calibration

Applications such as weather forecasting and personalized medicine demand models that output calibrated probability estimates---those representative of the true likelihood of a prediction. Most...

arxiv.org

4

37

155

Ananya Kumar

@ananyaku

2 years

Our paper got accepted to ICML ‘22 as a long talk! Thanks to all the co-authors ( @kendrick_shen @rmjones96 @sangmichaelxie @jhaochenz @tengyuma @percyliang ). Congrats Kendrick on yet another oral (as an undergrad!)

Tengyu Ma

@tengyuma

2 years

Pretraining is ≈SoTA for domain adaptation: just do contrastive learning on *all* unlabeled data + finetune on source labels. Features are NOT domain-invariant, but disentangle class & domain info to enable transfer. Theory & exps:

7

129

686

3

29

143

Ananya Kumar

@ananyaku

6 months

❤️

Sam Altman

@sama

6 months

i love the openai team so much

5K

4K

73K

7

137

Ananya Kumar

@ananyaku

2 years

Why can contrastive pretraining on *unlabeled data* improve robustness to distribution shift? (It's not about domain invariance!) Come to our ICML Oral at 2:05pm - 2:25pm in Ballroom 1 & 2, and our poster session at Hall E, Poster 317, to find out more!

Tengyu Ma

@tengyuma

2 years

Pretraining is ≈SoTA for domain adaptation: just do contrastive learning on *all* unlabeled data + finetune on source labels. Features are NOT domain-invariant, but disentangle class & domain info to enable transfer. Theory & exps:

7

129

686

2

21

121

Ananya Kumar

@ananyaku

2 years

Want to know why fine-tuning can distort pretrained features (and underperform out-of-distribution)? Come to our ICLR Oral on Wednesday, 9am Pacific Time, or our poster on Tuesday at 6:30pm PT! #ICLR2022

3

17

91

Ananya Kumar

@ananyaku

4 years

How can we adapt to very different target distributions in a principled way? w/ @tengyuma @percyliang We show that gradual shifts enable reliable adaptation, by self-training on unlabeled data. #ICML2020 Poster session: 8am-9am, 8pm-9pm Pacific Time

0

10

57

Ananya Kumar

@ananyaku

1 year

Measuring "accuracy" is not enough---it's important to measure robustness, calibration, etc, on a wide range of scenarios. This benchmarking effort will be useful for driving progress. Excited to be part of it (I examined the calibration & selective classification of LMs)!

Percy Liang

@percyliang

2 years

Language models are becoming the foundation of language technologies, but when do they work or don’t work? In a new CRFM paper, we propose Holistic Evaluation of Language Models (HELM), a framework to increase the transparency of LMs. Holistic evaluation includes three elements:

14

200

777

0

3

26

Ananya Kumar

@ananyaku

6 years

And our paper is up on Arxiv! @arkitus @DeepSpiker @mpshanahan

Consistent Generative Query Networks

Stochastic video prediction models take in a sequence of image frames, and generate a sequence of consecutive future image frames. These models typically generate future frames in an...

arxiv.org

0

1

12

Ananya Kumar

@ananyaku

3 years

So fun working with @sangmichaelxie @rmjones96 on our new paper on extrapolating out-of-distribution! We have theory for why pre-training and self-training help with domain shift, and empirical improvements on real sustainability datasets (cropland and landcover predictions).

Sang Michael Xie

@sangmichaelxie

3 years

🍔🍟"In-N-Out: Pre-Training and Self-Training using Auxiliary Information for Out-of-Distribution Robustness" Real-world tasks (crop yield prediction from satellites) are often label-scarce. Only some countries have labels - how do we generalize globally?

1

37

165

1

13

Ananya Kumar

@ananyaku

2 years

(8/n) This work is part of a broader trend (e.g., prompt tuning, composed fine-tuning, prefix tuning), where tuning a small part of a pretrained model can be better than full fine-tuning, especially for robustness

0

1

14

Ananya Kumar

@ananyaku

3 years

Come by our poster if you're interested in pre-training + self-training for domain adaptation, learning how to use auxiliary information better, or theory for how pre-training and self-training make models more robust to domain shifts!

Tengyu Ma

@tengyuma

3 years

This appears in #ICLR2021 . Please check out our paper, videos, poster, code, etc! ICLR poster link: ArXiv: Codalab: Github:

0

4

17

0

2

13

Ananya Kumar

@ananyaku

5 years

Our paper on agent evaluation for safety critical environments is up on Arxiv (), and @CsabaSzepesvari will give a talk on it at the #NeurIPS2018 workshop on Security in Machine Learning (Dec 6th, Room 513DEF, 2pm - 2:15pm). @uesatoj @pushmeet

Rigorous Agent Evaluation: An Adversarial Approach to Uncover...

This paper addresses the problem of evaluating learning systems in safety critical domains such as autonomous driving, where failures can have catastrophic consequences. We focus on two problems:...

arxiv.org

0

1

11

Ananya Kumar

@ananyaku

2 years

Just saw this tweet thread about our fine-tuning paper - I love it! Their explanation is so easy to understand.

Databricks Mosaic Research

@DbrxMosaicAI

2 years

Today, we're looking at fine-tuning large models, and this paper submitted to ICLR: It shows fine-tuning can hurt performance on out-of-distribution examples, and explains how using some nice theory. We'll be keeping an eye on this! (1/8)

2

15

76

0

3

11

Ananya Kumar

@ananyaku

2 years

(5/n) This suggests the easy two-step strategy of linear probing then full fine-tuning (LP-FT). Intuition: head doesn't change as much, so features get distorted less

1

12

Ananya Kumar

@ananyaku

6 years

Really excited that this work by @arkitus @DeepSpiker (and other fantastic people) is finally out! I was very excited about the results when I first saw them, and continue to be amazed.

Ali Eslami

@arkitus

6 years

"Neural scene representation and rendering" now in @sciencemagazine . By training deep networks to predict what scenes look like from new viewpoints, we get them to understand images: @DeepSpiker @OriolVinyalsML @theophaneweber @demishassabis

6

172

509

0

1

9

Ananya Kumar

@ananyaku

3 years

Nice work on showing why relying on model uncertainties can be harmful for minority groups

Erik Jones

@ErikJones313

3 years

Selective classification, where models can abstain when they are unsure about a prediction, routinely improves average accuracy. Worryingly, we show that s.c. can also hurt accuracy on certain subgroups of the data. Post: 🧵

1

17

71

0

1

10

Ananya Kumar

@ananyaku

1 year

@FahimTajwar10 Thanks for the kind words, it was really fun working with you! Note to anyone else reading this: Fahim is applying to PhD programs this year, and he's fantastic---very enthusiastic, full of ideas, thorough, and independent---so someone you'd want in your lab :)

0

3

10

Ananya Kumar

@ananyaku

2 years

(4/n) We prove theoretically that this phenomenon arises even in simple and natural settings. One line explanation: while full fine-tuning learns the head, the lower layers of the neural network change simultaneously and distort the pretrained features

1

0

10

Ananya Kumar

@ananyaku

1 year

(3/6) We liberally interpret understanding as research ranging from purely empirical papers that highlight interesting phenomena, to those which attempt to explain or provide theoretical foundations for such phenomena in potentially simplified settings.

1

0

9

Ananya Kumar

@ananyaku

1 year

(2/6) Workshop goal: highlight research that improves understanding of foundation models (FMs), and bring together researchers that work in the area. Developing a better understanding for FMs can help to improve the choices in data, training objectives, and adaptation methods.

1

0

9

Ananya Kumar

@ananyaku

4 years

I like Colin's paper a lot! Along with (ICML 2020) and (NeurIPS 2020) I think we are finally gaining a better understand of when and why self-training helps.

Self-training Avoids Using Spurious Features Under Domain Shift

In unsupervised domain adaptation, existing theory focuses on situations where the source and target domains are close. In practice, conditional entropy minimization and pseudo-labeling work even...

arxiv.org

Tengyu Ma

@tengyuma

4 years

We analyze self-training for domain adaptation, semi- and unsupervised learning, showing that pseudolabels are denoised through implicit propagation of correct labels via consistency regularization when data satisfy an expansion property. (More in Fig.)

3

39

280

0

8

Ananya Kumar

@ananyaku

1 year

(2/n) We’ve used this for a few papers: , , parts of , , and more coming soon

Connect, Not Collapse: Explaining Contrastive Learning for...

We consider unsupervised domain adaptation (UDA), where labeled data from a source domain (e.g., photographs) and unlabeled data from a target domain (e.g., sketches) are used to learn a...

arxiv.org

1

0

8

Ananya Kumar

@ananyaku

2 years

(3/n) We find that full fine-tuning (updating all model parameters) can be worse than linear probing (updating only the last layer) on out-of-distribution test examples, when the distribution shift is large and the pretrained features are good

1

0

9

Ananya Kumar

@ananyaku

1 year

@TiffanyVlaar @sangmichaelxie @whybansal @mcaron31 @AdtRaghunathan @tengyuma @HanieSedghi @percyliang Adding twitter handles for speakers as well: @prfsanjeevarora , @yasamanbb , @danqi_chen , @ynd , @jefrankle , @zdeborova , and Jared Kaplan

1

7

Ananya Kumar

@ananyaku

1 year

(3/n) The CodaLab worksheet uses this library and reproduces a number of experiments in our ICLR paper: Fine-tuning can distort pretrained features and underperform out-of-distribution

1

0

7

Ananya Kumar

@ananyaku

2 years

(2/n) Joint work with Aditi Raghunathan, @rmjones96 , and my advisors @tengyuma and @percyliang

1

0

7

Ananya Kumar

@ananyaku

1 year

Didn't realize NeurIPS registration was at 8pm UTC / 1pm Pacific Time 🥲There's also a separate abstract submission date listed on June 01, but apparently for a different track.

2

0

7

Ananya Kumar

@ananyaku

2 years

@CyrusMaher Yup! And to clarify we cited this and other papers, and mention in our abstract + intro that LP-FT is sometimes used as a fine-tuning heuristic (though not for robustness). Hopefully our analysis popularizes it, and explains when it can be particularly useful (OOD)

0

7

Ananya Kumar

@ananyaku

2 years

(6/n) LP-FT gives large gains OOD: 10% better OOD, 1% better ID than full fine-tuning. Also outperforms linear probing both ID and OOD

1

0

8

Ananya Kumar

@ananyaku

2 years

(7/n) Caption for Figure in Tweet 1/n: (a) full fine-tuning does better in-distribution (ID), (b) linear probing can do better out-of-distribution (OOD), (c) LP-FT does better on both, especially OOD

1

0

6

Ananya Kumar

@ananyaku

1 year

(5/6) This workshop is organized by @ananyaku , @TiffanyVlaar , @sangmichaelxie , @whybansal , @mcaron31 , @AdtRaghunathan , @tengyuma , @HanieSedghi , @percyliang .

1

0

5

Ananya Kumar

@ananyaku

4 years

@jmhessel Great question, a lot it has to do with regularization - the student model that trains on the unlabeled instances is "simpler" than the teacher model that labels them. Some of this regularization can be implicit regularization from SGD.

1

0

5

Ananya Kumar

@ananyaku

3 years

@KLdivergence I wonder if it's useful for conferences to give guidelines to reviewers on how long to spend! Some people did seem to spend a lot of time writing very thoughtful reviews, and some people spent 2 hours (which seems very low to evaluate 6 months of research?)

1

0

5

Ananya Kumar

@ananyaku

4 years

@tengyuma @percyliang A key challenge in domain adaptation is when the source and target domains are very different (non-overlapping supports). Existing theory cannot handle these cases. Our paper suggests that if we leverage gradual shifts from source to target, we can come up with principled methods

1

0

5

Ananya Kumar

@ananyaku

1 year

@SebastienBubeck (8/8) I had a fantastic summer at Microsoft Research collaborating with Ruoqi Shen and interning with @SebastienBubeck and @suriyagnskr , who are all co-authors on this paper! Thanks to @percyliang , @tengyuma , @zhiyuanli_ , Yuanzhi Li for helpful feedback!

1

4

Ananya Kumar

@ananyaku

1 year

@percyliang (5/n) Thanks to @percyliang for encouraging me to develop more seamless tooling. Thanks to @sangmichaelxie , @rmjones96 , @kendrick_shen for a lot of the code, and design choices which I adapted from the In-N-Out GitHub repo:

GitHub - p-lambda/in-n-out: Code for the ICLR 2021 Paper "In-N-Out: Pre-Training and Self-Training...

Code for the ICLR 2021 Paper "In-N-Out: Pre-Training and Self-Training using Auxiliary Information for Out-of-Distribution Robustness" - p-lambda/in-n-out

github.com

1

0

5

Ananya Kumar

@ananyaku

1 year

(4/6) We have a fantastic lineup of speakers who have done fundamental work in the field: Sanjeev Arora, Yasaman Bahri, Danqi Chen, Yann Dauphin, Jonathan Frankle, Jared Kaplan, and Lenka Zdeborová.

1

5

Ananya Kumar

@ananyaku

4 years

@tdietterich @gwern My understanding is that a common goal in medical AI (for example) is to make as good predictions as a committee of "highly skilled" doctors, which would be much better than the average doctor?

1

0

5

Ananya Kumar

@ananyaku

5 years

Cool paper explaining why adversarial training can sometimes lead to worse performance on clean data. In their example, there exists a robust classifier that can get 100% accuracy, optimization is convex, but the robust classifier is more complex so generalizes worse.

Sang Michael Xie

@sangmichaelxie

5 years

Adversarial Training can Hurt Generalization - even if there is no conflict with infinite data and the problem is convex. With @Aditi_Raghunathan and @Fanny_Yang #icml2019 Identifying and Understanding Deep Learning Phenomena

1

0

13

0

4

Ananya Kumar

@ananyaku

5 years

Past work on multiclass calibration only measures calibration on the most confident prediction. We look at “marginal calibration” (probability output for each class should be calibrated) like . We hope future work also reports marginal calibration scores.

Measuring Calibration in Deep Learning

Overconfidence and underconfidence in machine learning classifiers is measured by calibration: the degree to which the probabilities predicted for each class match the accuracy of the classifier...

arxiv.org

0

5

Ananya Kumar

@ananyaku

1 year

@SebastienBubeck (4/8) We were surprised by the simplicity and generality of this observation. SGD (freeze-embed) does better than SGD on benchmarks like CIFAR-10 as well!

1

4

Ananya Kumar

@ananyaku

5 months

@ducha_aiki @giffmana @wightmanr We tried this on vision transformers where it can work very well for out-of-distribution accuracy. It helped slightly on standard in-distribution accuracy. . It's also used by @Mitchnw in (ViTs) which was sota on ImageNet for a while

Model soups: averaging weights of multiple fine-tuned models...

The conventional recipe for maximizing model accuracy is to (1) train multiple models with various hyperparameters and (2) pick the individual model which performs best on a held-out validation...

arxiv.org

1

0

5

Ananya Kumar

@ananyaku

1 year

(4/n) CodaLab is a nice platform by @percyliang and others at Stanford for reproducibility. Keeps the exact docker containers, datasets, and runs (including outputs), so experiments can be replicated by anyone.

1

0

5

Ananya Kumar

@ananyaku

1 year

@TiffanyVlaar @sangmichaelxie @whybansal @mcaron31 @AdtRaghunathan @tengyuma @HanieSedghi @percyliang (6/6) The workshop is hybrid. We will have an in-person workshop, but will also accept submissions from people who can't attend in person. Such papers will be able to record a short video presentation.

1

0

5

Ananya Kumar

@ananyaku

1 year

@SebastienBubeck (5/8) In our paper, we examined 7 popular models on 5 distribution shift datasets (BREEDS-Living-17, WILDS-FMoW, WILDS-Camelyon, Waterbirds, DomainNet). Large gains out-of-distribution.

1

0

3

Ananya Kumar

@ananyaku

1 year

(1/n) Easily run a sweep, and then summarize experiment results into a nice TSV file you can copy onto Excel, or create automatic latex tables for your paper. Also supports gradient accumulation, wandb, checkpointing, nice logging organization for different experiment groups

1

0

4

Ananya Kumar

@ananyaku

9 months

@_jasonwei Interesting idea. Two points: 1. What about a prof who spends some of their time advising inexperienced undergraduates (with lots of potential!) 2. Adding these mentoring papers to an impactful set of papers shouldn't lower their metrics? It's a good metric to add to the mix.

2

0

5

Ananya Kumar

@ananyaku

1 year

@SebastienBubeck (1/8) We plott norm of gradient at each layer (bc AdamW normalizes param grad). The grad of the embedding layer is high exactly when AdamW does better than SGD (ViT, ConvNeXt)! For ResNet, AdamW ≈ SGD. Is this just a correlation? To test this, we tried freezing embedding layer.

1

0

3

Ananya Kumar

@ananyaku

1 year

@PandaAshwinee @typedfemale (To be fair, we weren't looking to make a case for anything. I think AdamW is a good default. We just found that SGD can work comparably and sometimes better if you freeze the embedding layer, and can save a lot of memory)

0

3

Ananya Kumar

@ananyaku

4 years

@tengyuma @percyliang This is just a start---there are lots of exciting things to explore here both theoretically (better guarantees, more realistic distributions) and empirically (more realistic datasets, better algorithms)!

0

4

Ananya Kumar

@ananyaku

4 years

@sangmichaelxie @siddkaramcheti This should have been your original tweet about your paper

0

4

Ananya Kumar

@ananyaku

5 years

Calibration background: Besides accuracy, we should measure the quality of a model’s uncertainty estimates. If a weather model says it is going to rain with 80% probability on 1000 days, it should rain on about 800 of them. We can quantify this using calibration error metrics.

0

1

4

Ananya Kumar

@ananyaku

3 years

@sangmichaelxie @tatsu_hashimoto @rtaori13 @shiorisagawa @PangWeiKoh @percyliang @RishiBommasani Great job leading and coordinating this section - it was fun!

0

4

Ananya Kumar

@ananyaku

1 year

@bremen79 That's a good point - doesn't a scaling law require the learning rate strategy to be specified? My understand is that the OpenAI scaling law was still correct, but for their choice of algorithm (including hyper parameters). Better algorithm / hyperparameters -> better scaling law

1

0

4

Ananya Kumar

@ananyaku

3 years

Thanks to @tengyuma and @percyliang for being very helpful and supportive advisors, and @___fereshte___ for her hard work on this as well

0

4

Ananya Kumar

@ananyaku

2 years

@SamuelAinsworth @kellerjordan0 @siddhss5 Yeah, group / layer norm is also better for transfer learning for a similar reason. I generally try to avoid models that use batchnorm because I've been burnt in the past. Sounds like a reasonable choice - vision transformers and convnext for example use layernorm.

2

0

4

Ananya Kumar

@ananyaku

5 years

@uesatoj and I will present our work on Rigorous Agent Evaluation with @CsabaSzepesvari , @pushmeet and other great collaborators at #ICLR2019 today from 4:30pm - 6:30pm. If you're interested in RL safety or adversarial examples beyond norm balls, come by poster #72 !

0

1

4

Ananya Kumar

@ananyaku

1 year

@SebastienBubeck (7/8) Lots more observations in our paper! For example, AdamW (freeze-embed) performs comparably to AdamW, suggesting that freeze-embed "captures" the same gains as AdamW. Also some intuitions for how this might connect with feature distortion.

1

0

2

Ananya Kumar

@ananyaku

2 years

@PangWeiKoh @shiorisagawa @tonyh_lee @sangmichaelxie @kendrick_shen @weihua916 @michiyasunaga @sarameghanbeery @EtienneDavid @IanStavness @guowei_net @jure @kate_saenko_ @tatsu_hashimoto @svlevine @chelseabfinn @percyliang It was great working with you, and nice work to you, Shiori, Tony, and Irena for leading this effort!

0

3

Ananya Kumar

@ananyaku

2 years

@SamuelAinsworth @kellerjordan0 @siddhss5 I should say that @jacobmbuckman was the person who first made me realize that batchnorm can be problematic!

0

3

Ananya Kumar

@ananyaku

4 years

@PreetumNakkiran What about approaches that show that if property X of a neural network is satisfied then generalization error is low, and show that property X is small on real data? The property could be something like all-layer margin

Improved Sample Complexities for Deep Networks and Robust...

For linear classifiers, the relationship between (normalized) output margin and generalization is captured in a clear and simple bound -- a large output margin implies good generalization....

arxiv.org

2

0

3

Ananya Kumar

@ananyaku

3 years

@KLdivergence I always assumed we were expected to spend about 5 hours per paper so that's what I do. Luckily, most of the reviews I've received for my papers in the last two years have been high quality, and I suspect they put in about 5+ hours as well.

1

0

3

Ananya Kumar

@ananyaku

4 years

@jmhessel @Ted_Underwood @danielbigham My anecdotal experience is it does help for linear models, but only in the very low data regime. Once you have a couple hundred examples linear models are already close to their maximum possible accuracy so self-training has limited gains. Could be worth digging into it more :)

1

0

3

Ananya Kumar

@ananyaku

4 years

@jmhessel For example in , Figure 2, you can see how self-training adapts the classifier as the data shifts (e.g. over time). Regularization is key in theory and practice, and is why the self-trained model is different.

1

0

3

Ananya Kumar

@ananyaku

1 year

@mehtadushy @SebastienBubeck Great question! AdamW (freeze-embed) performs comparably to AdamW, suggesting that freeze-embed "captures" the same gains as AdamW. So this is another sanity check that freeze-embed and AdamW don't work well for separate reasons

0

3

Ananya Kumar

@ananyaku

1 year

@SebastienBubeck (2/8) We get highest reported numbers on all 5 benchmarks we test, and better accuracy than standard SGD fine-tuning. We're also at the top of the official WILDS leaderboard at for FMoW (satellite remote sensing) and Camelyon (tumor detection)!

1

0

2

Ananya Kumar

@ananyaku

6 years

@danijarh @arkitus @DeepSpiker @mpshanahan Great question! Most of the examples in the paper involve extrapolation into the future. That is, the model sees only the first 5 frames of the video, and generates the subsequent 15 frames.

1

0

3

Ananya Kumar

@ananyaku

5 years

@CsabaSzepesvari giving a talk on our paper at #secml18 ! See if you're interested in learning more

Rigorous Agent Evaluation: An Adversarial Approach to Uncover...

This paper addresses the problem of evaluating learning systems in safety critical domains such as autonomous driving, where failures can have catastrophic consequences. We focus on two problems:...

arxiv.org

Nicolas Papernot

@NicolasPapernot

5 years

Csaba from deepmind is now speaking about rigorous agent evaluation at #secml18

0

1

5

0

3

Ananya Kumar

@ananyaku

1 year

@SebastienBubeck @suriyagnskr @percyliang @tengyuma @zhiyuanli_ Also, thanks to @PangWeiKoh and @shiorisagawa for help with WILDS and for the great collection of datasets!

0

2

Ananya Kumar

@ananyaku

10 months

@AnimaAnandkumar Specifically on calibration - we had some experiments that that RLHF'ed models can have worse calibration in HELM (). Also see GPT-4 paper (), Anthropic paper (). More broadly, fine-tuning can help lots (Fig 26)

Language Models (Mostly) Know What They Know

We study whether language models can evaluate the validity of their own claims and predict which questions they will be able to answer correctly. We first show that larger models are...

arxiv.org

0

2

Ananya Kumar

@ananyaku

4 years

@CsabaSzepesvari @Aaroth Any good pointers into the literature on uncertainty estimation for non-parametric statistics? Are you talking about non-parametric Bayesian methods? In general model checking seems hard, we want the uncertainties to be "pointwise" so need some Lipschitz assumptions to check?

3

0

2

Ananya Kumar

@ananyaku

1 year

@denny_zhou Loved the chain of thought prompting paper. But if you have more training data, might it still be better to initialize with CoT, and then fine-tune (part of the model)?

1

0

2

Ananya Kumar

@ananyaku

4 years

@RishiBommasani @jmhessel Thanks @RishiBommasani :D! I should clarify that the NeurIPS paper was primarily by @cynnjjs and Colin Wei, advised by Tengyu Ma, and I'm glad I could play a small role in it. The talk is based on the ICML paper I wrote with Percy and Tengyu:

0

2

Ananya Kumar

@ananyaku

2 years

Both of these are today (Thursday!), and the poster session is at 6pm - 8pm!

0

2

Ananya Kumar

@ananyaku

5 years

Joint work with @tengyuma and @percyliang which we will present at #NeurIPS2019 ! Many exciting research directions remain in uncertainty calibration, as we discuss in Section 7 (calibration under dataset shifts, multiclass calibration, better metrics for measuring calibration).

2

0

2

Ananya Kumar

@ananyaku

4 years

@GaryMarcus @emilymbender @geoffreyhinton @lyceum Respectfully, I think this is taken out of context. Geoff also said "it’s not quite clear how much it understands." There can be reasonable debate about whether it has some understanding. He says "the symbolic approach is a perfectly reasonable thing to try", not dismissive

0

2

Ananya Kumar

@ananyaku

2 years

@Josh_d_robinson @jhaochenz @tengyuma Good question, I think that's where the generalization theory part comes in! My understanding: 1. good empirical unsupervised loss -> good population unsupervised loss (if function family is not too complicated). Then 2. population overlap -> good supervised loss.

1

0

2

Ananya Kumar

@ananyaku

4 years

@jmhessel @cynnjjs @tengyuma Recent work () by Colin, Tengyu and others builds on these and has a much more general theory for when self-training under consistency regularization helps

1

0

2

Ananya Kumar

@ananyaku

1 year

@SebastienBubeck (3/8) We get real memory savings. AdamW uses 20% and 60% more memory than SGD (freeze embed) and SGD (freeze embed, no momentum) respectively. When fine-tuning a CLIP ViT-H/14 on Titan-X, AdamW runs out of memory, SGD fits comfortably in memory.

1

0

1

Ananya Kumar

@ananyaku

4 years

@SebastienBubeck @roydanroy @tdietterich My main questions would be: does well-defined open problem necessarily mean we should study it? Is it an overcrowded space? Then maybe only experts with a unique skill-set and perspective should dive in, and others might want to explore new terrain due to diminishing returns?

0

2

Ananya Kumar

@ananyaku

2 years

w/ @percyliang @tengyuma Aditi, @rmjones96

1

0

2

Ananya Kumar

@ananyaku

4 years

@janexwang @thegautamkamath Good point, although in groups I've worked in it's primarily 2-3 (ICML, NeurIPS, ICLR), and more theoretical work can be under-appreciated in ICLR. There's also an odd cycle, since reviewing variance is really high strong papers get rejected, but resubmitting overloads reviewers.

0

2

Ananya Kumar

@ananyaku

6 years

@DeepMindAI @arkitus @DeepSpiker @mpshanahan And it's now on Arxiv!

Consistent Generative Query Networks

Stochastic video prediction models take in a sequence of image frames, and generate a sequence of consecutive future image frames. These models typically generate future frames in an...

arxiv.org

0

2

Ananya Kumar

@ananyaku

2 years

@nsaphra (5/n) This is incredibly useful for us, and thank you for your interest! Feel free to follow up, and we will update the figure and paper based on this discussion!

0

2

Ananya Kumar

@ananyaku

1 year

@SebastienBubeck (6/8) But we also see gains in standard "in-distribution" accuracy.

1

0

1

Ananya Kumar

@ananyaku

4 years

@2prime_PKU @RaiaHadsell @thegautamkamath @radcummings @NeurIPSConf @hsuantienlin Actually that's not quite true, I didn't get my visa in time for NeurIPS for a workshop oral presentation in 2018

1

0

2

Ananya Kumar

@ananyaku

4 years

@srvmshr (1/2) Thanks for the comments! The key insight of our paper isn’t that self-training helps when coupled with unlabeled data, but to understand when leveraging the gradual structure leads to improvements over adapting directly to target (it doesn’t always), and what ingredients

1

0

2

Ananya Kumar

@ananyaku

1 year

@karpathy Great points! In addition to these axes, one pro of prompting is easy controllability (if I want to change behavior for a certain category of texts - e.g., make all poems more polite). A pro of Fine-tuning maybe to learn new skills (don't have good coverage in pretraining data?)

0

2

Ananya Kumar

@ananyaku

3 years

@jacobmbuckman @CsabaSzepesvari @tesslerc Great points overall - but I think somewhat incremental works aren't quite so bad, as long as they're honest about what they're doing. For example if someone says they improved a generalization bound from X to Y, you can choose to ignore it if it's not useful for you :)

1

0

2

Ananya Kumar

@ananyaku

1 year

@goyalsachin007 Thanks for the really kind note! It's been really fun working with you - going through experimental results together, coming up with toy examples, etc. I've learned a lot from you and was really amazed how quickly you got up to speed, tried out new ideas, and wrote things up!

0

2

Ananya Kumar

@ananyaku

2 years

@TLesort Here's the paper!

Fine-Tuning can Distort Pretrained Features and Underperform...

When transferring a pretrained model to a downstream task, two popular methods are full fine-tuning (updating all the model parameters) and linear probing (updating only the last linear layer --...

arxiv.org

1

0

2

Ananya Kumar

@ananyaku

2 years

@kendrick_shen @rmjones96 @sangmichaelxie are also co-first authors, and joint work with @jhaochenz @tengyuma @percyliang !

0

2

Ananya Kumar

@ananyaku

1 year

@_jasonwei From what I can tell it's quite popular these days :P So you might have to buy many dinners!

2

0

2

Ananya Kumar

@ananyaku

4 years

@CsabaSzepesvari @Aaroth @larrywasserman Thanks for the references! I'm somewhat familiar with the classical stuff e.g. although specifically interested in (realistic) conditions under which non-exp in d results are possible, for non-parametric models :) Will take a look at these!

0

2