Ananya Kumar Profile
Ananya Kumar

@ananyaku

3,830
Followers
481
Following
13
Media
323
Statuses

Researcher at @openai Previously PhD at Stanford University ( @StanfordAILab ) advised by Percy Liang and Tengyu Ma

Stanford, CA
Joined June 2018
Don't wanna be here? Send us removal request.
Pinned Tweet
@ananyaku
Ananya Kumar
2 years
How should you fine-tune a large pretrained model (CLIP, SimCLR) robustly? We find that standard fine-tuning can do poorly out-of-distribution (test data ≠ fine-tuning data). Our analysis leads to a simple fix, higher accuracy on 10 datasets. (ICLR Oral)
Tweet media one
6
122
647
@ananyaku
Ananya Kumar
1 year
I wrote a transfer learning library that accelerated my research progress in the last 2 years. Sweeps over methods × models × datasets × hyperparams × clouds, early stop on a dataset, evaluate acc on OOD datasets Link: CodaLab:
10
78
502
@ananyaku
Ananya Kumar
6 months
OpenAI is nothing without its people
10
22
390
@ananyaku
Ananya Kumar
1 year
Foundation models (BERT, DALLE-2, ChatGPT) have led to a paradigm shift in ML, but are poorly understood. Announcing ME-FoMo, an #ICLR2023 workshop on understanding foundation models. Deadline: Feb 3, 2023 Topics: Pretraining, transfer, scaling laws, etc
Tweet media one
3
71
362
@ananyaku
Ananya Kumar
1 year
Adam gets higher accuracy than SGD when fine-tuning modern vision models (e.g., ViT), but why? We find that embedding layer has high gradient. Simply freezing embedding layer (<1% of params) → SGD competitive w/ Adam. SoTA results on WILDS + saves memory.
Tweet media one
5
48
280
@ananyaku
Ananya Kumar
4 years
When ML models are deployed, data distributions evolving over time leads to a drop in performance. Our latest paper (theory and experiments) suggests we can use self-training on unlabeled data to maintain high performance ( @tengyuma @percyliang )
1
37
186
@ananyaku
Ananya Kumar
5 years
In our NeurIPS 2019 (spotlight) paper, we explain why methods like Platt scaling / temperature scaling are less calibrated than reported, propose a way to overcome this issue, and describe how to measure a model's calibration error with fewer samples:
4
37
155
@ananyaku
Ananya Kumar
2 years
Our paper got accepted to ICML ‘22 as a long talk! Thanks to all the co-authors ( @kendrick_shen @rmjones96 @sangmichaelxie @jhaochenz @tengyuma @percyliang ). Congrats Kendrick on yet another oral (as an undergrad!)
@tengyuma
Tengyu Ma
2 years
Pretraining is ≈SoTA for domain adaptation: just do contrastive learning on *all* unlabeled data + finetune on source labels. Features are NOT domain-invariant, but disentangle class & domain info to enable transfer. Theory & exps:
Tweet media one
7
129
686
3
29
143
@ananyaku
Ananya Kumar
6 months
❤️
@sama
Sam Altman
6 months
i love the openai team so much
5K
4K
73K
7
7
137
@ananyaku
Ananya Kumar
2 years
Why can contrastive pretraining on *unlabeled data* improve robustness to distribution shift? (It's not about domain invariance!) Come to our ICML Oral at 2:05pm - 2:25pm in Ballroom 1 & 2, and our poster session at Hall E, Poster 317, to find out more!
@tengyuma
Tengyu Ma
2 years
Pretraining is ≈SoTA for domain adaptation: just do contrastive learning on *all* unlabeled data + finetune on source labels. Features are NOT domain-invariant, but disentangle class & domain info to enable transfer. Theory & exps:
Tweet media one
7
129
686
2
21
121
@ananyaku
Ananya Kumar
2 years
Want to know why fine-tuning can distort pretrained features (and underperform out-of-distribution)? Come to our ICLR Oral on Wednesday, 9am Pacific Time, or our poster on Tuesday at 6:30pm PT! #ICLR2022
3
17
91
@ananyaku
Ananya Kumar
4 years
How can we adapt to very different target distributions in a principled way? w/ @tengyuma @percyliang We show that gradual shifts enable reliable adaptation, by self-training on unlabeled data. #ICML2020 Poster session: 8am-9am, 8pm-9pm Pacific Time
Tweet media one
0
10
57
@ananyaku
Ananya Kumar
1 year
Measuring "accuracy" is not enough---it's important to measure robustness, calibration, etc, on a wide range of scenarios. This benchmarking effort will be useful for driving progress. Excited to be part of it (I examined the calibration & selective classification of LMs)!
@percyliang
Percy Liang
2 years
Language models are becoming the foundation of language technologies, but when do they work or don’t work? In a new CRFM paper, we propose Holistic Evaluation of Language Models (HELM), a framework to increase the transparency of LMs. Holistic evaluation includes three elements:
14
200
777
0
3
26
@ananyaku
Ananya Kumar
3 years
So fun working with @sangmichaelxie @rmjones96 on our new paper on extrapolating out-of-distribution! We have theory for why pre-training and self-training help with domain shift, and empirical improvements on real sustainability datasets (cropland and landcover predictions).
@sangmichaelxie
Sang Michael Xie
3 years
🍔🍟"In-N-Out: Pre-Training and Self-Training using Auxiliary Information for Out-of-Distribution Robustness" Real-world tasks (crop yield prediction from satellites) are often label-scarce. Only some countries have labels - how do we generalize globally?
Tweet media one
1
37
165
1
1
13
@ananyaku
Ananya Kumar
2 years
(8/n) This work is part of a broader trend (e.g., prompt tuning, composed fine-tuning, prefix tuning), where tuning a small part of a pretrained model can be better than full fine-tuning, especially for robustness
0
1
14
@ananyaku
Ananya Kumar
3 years
Come by our poster if you're interested in pre-training + self-training for domain adaptation, learning how to use auxiliary information better, or theory for how pre-training and self-training make models more robust to domain shifts!
@tengyuma
Tengyu Ma
3 years
This appears in #ICLR2021 . Please check out our paper, videos, poster, code, etc! ICLR poster link: ArXiv: Codalab: Github:
0
4
17
0
2
13
@ananyaku
Ananya Kumar
2 years
Just saw this tweet thread about our fine-tuning paper - I love it! Their explanation is so easy to understand.
@DbrxMosaicAI
Databricks Mosaic Research
2 years
Today, we're looking at fine-tuning large models, and this paper submitted to ICLR: It shows fine-tuning can hurt performance on out-of-distribution examples, and explains how using some nice theory. We'll be keeping an eye on this! (1/8)
2
15
76
0
3
11
@ananyaku
Ananya Kumar
2 years
(5/n) This suggests the easy two-step strategy of linear probing then full fine-tuning (LP-FT). Intuition: head doesn't change as much, so features get distorted less
1
1
12
@ananyaku
Ananya Kumar
6 years
Really excited that this work by @arkitus @DeepSpiker (and other fantastic people) is finally out! I was very excited about the results when I first saw them, and continue to be amazed.
@arkitus
Ali Eslami
6 years
"Neural scene representation and rendering" now in @sciencemagazine . By training deep networks to predict what scenes look like from new viewpoints, we get them to understand images: @DeepSpiker @OriolVinyalsML @theophaneweber @demishassabis
6
172
509
0
1
9
@ananyaku
Ananya Kumar
3 years
Nice work on showing why relying on model uncertainties can be harmful for minority groups
@ErikJones313
Erik Jones
3 years
Selective classification, where models can abstain when they are unsure about a prediction, routinely improves average accuracy. Worryingly, we show that s.c. can also hurt accuracy on certain subgroups of the data. Post: 🧵
1
17
71
0
1
10
@ananyaku
Ananya Kumar
1 year
@FahimTajwar10 Thanks for the kind words, it was really fun working with you! Note to anyone else reading this: Fahim is applying to PhD programs this year, and he's fantastic---very enthusiastic, full of ideas, thorough, and independent---so someone you'd want in your lab :)
0
3
10
@ananyaku
Ananya Kumar
2 years
(4/n) We prove theoretically that this phenomenon arises even in simple and natural settings. One line explanation: while full fine-tuning learns the head, the lower layers of the neural network change simultaneously and distort the pretrained features
1
0
10
@ananyaku
Ananya Kumar
1 year
(3/6) We liberally interpret understanding as research ranging from purely empirical papers that highlight interesting phenomena, to those which attempt to explain or provide theoretical foundations for such phenomena in potentially simplified settings.
1
0
9
@ananyaku
Ananya Kumar
1 year
(2/6) Workshop goal: highlight research that improves understanding of foundation models (FMs), and bring together researchers that work in the area. Developing a better understanding for FMs can help to improve the choices in data, training objectives, and adaptation methods.
1
0
9
@ananyaku
Ananya Kumar
4 years
I like Colin's paper a lot! Along with (ICML 2020) and (NeurIPS 2020) I think we are finally gaining a better understand of when and why self-training helps.
@tengyuma
Tengyu Ma
4 years
We analyze self-training for domain adaptation, semi- and unsupervised learning, showing that pseudolabels are denoised through implicit propagation of correct labels via consistency regularization when data satisfy an expansion property. (More in Fig.)
Tweet media one
3
39
280
0
0
8
@ananyaku
Ananya Kumar
2 years
(3/n) We find that full fine-tuning (updating all model parameters) can be worse than linear probing (updating only the last layer) on out-of-distribution test examples, when the distribution shift is large and the pretrained features are good
1
0
9
@ananyaku
Ananya Kumar
1 year
(3/n) The CodaLab worksheet uses this library and reproduces a number of experiments in our ICLR paper: Fine-tuning can distort pretrained features and underperform out-of-distribution
1
0
7
@ananyaku
Ananya Kumar
2 years
(2/n) Joint work with Aditi Raghunathan, @rmjones96 , and my advisors @tengyuma and @percyliang
1
0
7
@ananyaku
Ananya Kumar
1 year
Didn't realize NeurIPS registration was at 8pm UTC / 1pm Pacific Time 🥲There's also a separate abstract submission date listed on June 01, but apparently for a different track.
Tweet media one
2
0
7
@ananyaku
Ananya Kumar
2 years
@CyrusMaher Yup! And to clarify we cited this and other papers, and mention in our abstract + intro that LP-FT is sometimes used as a fine-tuning heuristic (though not for robustness). Hopefully our analysis popularizes it, and explains when it can be particularly useful (OOD)
0
0
7
@ananyaku
Ananya Kumar
2 years
(6/n) LP-FT gives large gains OOD: 10% better OOD, 1% better ID than full fine-tuning. Also outperforms linear probing both ID and OOD
1
0
8
@ananyaku
Ananya Kumar
2 years
(7/n) Caption for Figure in Tweet 1/n: (a) full fine-tuning does better in-distribution (ID), (b) linear probing can do better out-of-distribution (OOD), (c) LP-FT does better on both, especially OOD
1
0
6
@ananyaku
Ananya Kumar
4 years
@jmhessel Great question, a lot it has to do with regularization - the student model that trains on the unlabeled instances is "simpler" than the teacher model that labels them. Some of this regularization can be implicit regularization from SGD.
1
0
5
@ananyaku
Ananya Kumar
3 years
@KLdivergence I wonder if it's useful for conferences to give guidelines to reviewers on how long to spend! Some people did seem to spend a lot of time writing very thoughtful reviews, and some people spent 2 hours (which seems very low to evaluate 6 months of research?)
1
0
5
@ananyaku
Ananya Kumar
4 years
@tengyuma @percyliang A key challenge in domain adaptation is when the source and target domains are very different (non-overlapping supports). Existing theory cannot handle these cases. Our paper suggests that if we leverage gradual shifts from source to target, we can come up with principled methods
1
0
5
@ananyaku
Ananya Kumar
1 year
@SebastienBubeck (8/8) I had a fantastic summer at Microsoft Research collaborating with Ruoqi Shen and interning with @SebastienBubeck and @suriyagnskr , who are all co-authors on this paper! Thanks to @percyliang , @tengyuma , @zhiyuanli_ , Yuanzhi Li for helpful feedback!
1
1
4
@ananyaku
Ananya Kumar
1 year
(4/6) We have a fantastic lineup of speakers who have done fundamental work in the field: Sanjeev Arora, Yasaman Bahri, Danqi Chen, Yann Dauphin, Jonathan Frankle, Jared Kaplan, and Lenka Zdeborová.
1
1
5
@ananyaku
Ananya Kumar
4 years
@tdietterich @gwern My understanding is that a common goal in medical AI (for example) is to make as good predictions as a committee of "highly skilled" doctors, which would be much better than the average doctor?
1
0
5
@ananyaku
Ananya Kumar
5 years
Cool paper explaining why adversarial training can sometimes lead to worse performance on clean data. In their example, there exists a robust classifier that can get 100% accuracy, optimization is convex, but the robust classifier is more complex so generalizes worse.
@sangmichaelxie
Sang Michael Xie
5 years
Adversarial Training can Hurt Generalization - even if there is no conflict with infinite data and the problem is convex. With @Aditi_Raghunathan and @Fanny_Yang #icml2019 Identifying and Understanding Deep Learning Phenomena
1
0
13
0
0
4
@ananyaku
Ananya Kumar
5 years
Past work on multiclass calibration only measures calibration on the most confident prediction. We look at “marginal calibration” (probability output for each class should be calibrated) like . We hope future work also reports marginal calibration scores.
0
0
5
@ananyaku
Ananya Kumar
1 year
@SebastienBubeck (4/8) We were surprised by the simplicity and generality of this observation. SGD (freeze-embed) does better than SGD on benchmarks like CIFAR-10 as well!
Tweet media one
1
1
4
@ananyaku
Ananya Kumar
5 months
@ducha_aiki @giffmana @wightmanr We tried this on vision transformers where it can work very well for out-of-distribution accuracy. It helped slightly on standard in-distribution accuracy. . It's also used by @Mitchnw in (ViTs) which was sota on ImageNet for a while
1
0
5
@ananyaku
Ananya Kumar
1 year
(4/n) CodaLab is a nice platform by @percyliang and others at Stanford for reproducibility. Keeps the exact docker containers, datasets, and runs (including outputs), so experiments can be replicated by anyone.
1
0
5
@ananyaku
Ananya Kumar
1 year
@TiffanyVlaar @sangmichaelxie @whybansal @mcaron31 @AdtRaghunathan @tengyuma @HanieSedghi @percyliang (6/6) The workshop is hybrid. We will have an in-person workshop, but will also accept submissions from people who can't attend in person. Such papers will be able to record a short video presentation.
1
0
5
@ananyaku
Ananya Kumar
1 year
@SebastienBubeck (5/8) In our paper, we examined 7 popular models on 5 distribution shift datasets (BREEDS-Living-17, WILDS-FMoW, WILDS-Camelyon, Waterbirds, DomainNet). Large gains out-of-distribution.
Tweet media one
1
0
3
@ananyaku
Ananya Kumar
1 year
(1/n) Easily run a sweep, and then summarize experiment results into a nice TSV file you can copy onto Excel, or create automatic latex tables for your paper. Also supports gradient accumulation, wandb, checkpointing, nice logging organization for different experiment groups
1
0
4
@ananyaku
Ananya Kumar
9 months
@_jasonwei Interesting idea. Two points: 1. What about a prof who spends some of their time advising inexperienced undergraduates (with lots of potential!) 2. Adding these mentoring papers to an impactful set of papers shouldn't lower their metrics? It's a good metric to add to the mix.
2
0
5
@ananyaku
Ananya Kumar
1 year
@SebastienBubeck (1/8) We plott norm of gradient at each layer (bc AdamW normalizes param grad). The grad of the embedding layer is high exactly when AdamW does better than SGD (ViT, ConvNeXt)! For ResNet, AdamW ≈ SGD. Is this just a correlation? To test this, we tried freezing embedding layer.
Tweet media one
1
0
3
@ananyaku
Ananya Kumar
1 year
@PandaAshwinee @typedfemale (To be fair, we weren't looking to make a case for anything. I think AdamW is a good default. We just found that SGD can work comparably and sometimes better if you freeze the embedding layer, and can save a lot of memory)
0
0
3
@ananyaku
Ananya Kumar
4 years
@tengyuma @percyliang This is just a start---there are lots of exciting things to explore here both theoretically (better guarantees, more realistic distributions) and empirically (more realistic datasets, better algorithms)!
0
0
4
@ananyaku
Ananya Kumar
4 years
@sangmichaelxie @siddkaramcheti This should have been your original tweet about your paper
0
0
4
@ananyaku
Ananya Kumar
5 years
Calibration background: Besides accuracy, we should measure the quality of a model’s uncertainty estimates. If a weather model says it is going to rain with 80% probability on 1000 days, it should rain on about 800 of them. We can quantify this using calibration error metrics.
0
1
4
@ananyaku
Ananya Kumar
3 years
0
0
4
@ananyaku
Ananya Kumar
1 year
@bremen79 That's a good point - doesn't a scaling law require the learning rate strategy to be specified? My understand is that the OpenAI scaling law was still correct, but for their choice of algorithm (including hyper parameters). Better algorithm / hyperparameters -> better scaling law
1
0
4
@ananyaku
Ananya Kumar
3 years
Thanks to @tengyuma and @percyliang for being very helpful and supportive advisors, and @___fereshte___ for her hard work on this as well
0
0
4
@ananyaku
Ananya Kumar
2 years
@SamuelAinsworth @kellerjordan0 @siddhss5 Yeah, group / layer norm is also better for transfer learning for a similar reason. I generally try to avoid models that use batchnorm because I've been burnt in the past. Sounds like a reasonable choice - vision transformers and convnext for example use layernorm.
2
0
4
@ananyaku
Ananya Kumar
5 years
@uesatoj and I will present our work on Rigorous Agent Evaluation with @CsabaSzepesvari , @pushmeet and other great collaborators at #ICLR2019 today from 4:30pm - 6:30pm. If you're interested in RL safety or adversarial examples beyond norm balls, come by poster #72 !
Tweet media one
0
1
4
@ananyaku
Ananya Kumar
1 year
@SebastienBubeck (7/8) Lots more observations in our paper! For example, AdamW (freeze-embed) performs comparably to AdamW, suggesting that freeze-embed "captures" the same gains as AdamW. Also some intuitions for how this might connect with feature distortion.
1
0
2
@ananyaku
Ananya Kumar
2 years
@SamuelAinsworth @kellerjordan0 @siddhss5 I should say that @jacobmbuckman was the person who first made me realize that batchnorm can be problematic!
0
0
3
@ananyaku
Ananya Kumar
4 years
@PreetumNakkiran What about approaches that show that if property X of a neural network is satisfied then generalization error is low, and show that property X is small on real data? The property could be something like all-layer margin
2
0
3
@ananyaku
Ananya Kumar
3 years
@KLdivergence I always assumed we were expected to spend about 5 hours per paper so that's what I do. Luckily, most of the reviews I've received for my papers in the last two years have been high quality, and I suspect they put in about 5+ hours as well.
1
0
3
@ananyaku
Ananya Kumar
4 years
@jmhessel @Ted_Underwood @danielbigham My anecdotal experience is it does help for linear models, but only in the very low data regime. Once you have a couple hundred examples linear models are already close to their maximum possible accuracy so self-training has limited gains. Could be worth digging into it more :)
1
0
3
@ananyaku
Ananya Kumar
4 years
@jmhessel For example in , Figure 2, you can see how self-training adapts the classifier as the data shifts (e.g. over time). Regularization is key in theory and practice, and is why the self-trained model is different.
1
0
3
@ananyaku
Ananya Kumar
1 year
@mehtadushy @SebastienBubeck Great question! AdamW (freeze-embed) performs comparably to AdamW, suggesting that freeze-embed "captures" the same gains as AdamW. So this is another sanity check that freeze-embed and AdamW don't work well for separate reasons
0
0
3
@ananyaku
Ananya Kumar
1 year
@SebastienBubeck (2/8) We get highest reported numbers on all 5 benchmarks we test, and better accuracy than standard SGD fine-tuning. We're also at the top of the official WILDS leaderboard at for FMoW (satellite remote sensing) and Camelyon (tumor detection)!
Tweet media one
1
0
2
@ananyaku
Ananya Kumar
6 years
@danijarh @arkitus @DeepSpiker @mpshanahan Great question! Most of the examples in the paper involve extrapolation into the future. That is, the model sees only the first 5 frames of the video, and generates the subsequent 15 frames.
1
0
3
@ananyaku
Ananya Kumar
1 year
@SebastienBubeck @suriyagnskr @percyliang @tengyuma @zhiyuanli_ Also, thanks to @PangWeiKoh and @shiorisagawa for help with WILDS and for the great collection of datasets!
0
0
2
@ananyaku
Ananya Kumar
10 months
@AnimaAnandkumar Specifically on calibration - we had some experiments that that RLHF'ed models can have worse calibration in HELM (). Also see GPT-4 paper (), Anthropic paper (). More broadly, fine-tuning can help lots (Fig 26)
0
0
2
@ananyaku
Ananya Kumar
4 years
@CsabaSzepesvari @Aaroth Any good pointers into the literature on uncertainty estimation for non-parametric statistics? Are you talking about non-parametric Bayesian methods? In general model checking seems hard, we want the uncertainties to be "pointwise" so need some Lipschitz assumptions to check?
3
0
2
@ananyaku
Ananya Kumar
1 year
@denny_zhou Loved the chain of thought prompting paper. But if you have more training data, might it still be better to initialize with CoT, and then fine-tune (part of the model)?
1
0
2
@ananyaku
Ananya Kumar
4 years
@RishiBommasani @jmhessel Thanks @RishiBommasani :D! I should clarify that the NeurIPS paper was primarily by @cynnjjs and Colin Wei, advised by Tengyu Ma, and I'm glad I could play a small role in it. The talk is based on the ICML paper I wrote with Percy and Tengyu:
0
0
2
@ananyaku
Ananya Kumar
2 years
Both of these are today (Thursday!), and the poster session is at 6pm - 8pm!
0
0
2
@ananyaku
Ananya Kumar
5 years
Joint work with @tengyuma and @percyliang which we will present at #NeurIPS2019 ! Many exciting research directions remain in uncertainty calibration, as we discuss in Section 7 (calibration under dataset shifts, multiclass calibration, better metrics for measuring calibration).
2
0
2
@ananyaku
Ananya Kumar
4 years
@GaryMarcus @emilymbender @geoffreyhinton @lyceum Respectfully, I think this is taken out of context. Geoff also said "it’s not quite clear how much it understands." There can be reasonable debate about whether it has some understanding. He says "the symbolic approach is a perfectly reasonable thing to try", not dismissive
0
0
2
@ananyaku
Ananya Kumar
2 years
@Josh_d_robinson @jhaochenz @tengyuma Good question, I think that's where the generalization theory part comes in! My understanding: 1. good empirical unsupervised loss -> good population unsupervised loss (if function family is not too complicated). Then 2. population overlap -> good supervised loss.
1
0
2
@ananyaku
Ananya Kumar
4 years
@jmhessel @cynnjjs @tengyuma Recent work () by Colin, Tengyu and others builds on these and has a much more general theory for when self-training under consistency regularization helps
1
0
2
@ananyaku
Ananya Kumar
1 year
@SebastienBubeck (3/8) We get real memory savings. AdamW uses 20% and 60% more memory than SGD (freeze embed) and SGD (freeze embed, no momentum) respectively. When fine-tuning a CLIP ViT-H/14 on Titan-X, AdamW runs out of memory, SGD fits comfortably in memory.
Tweet media one
1
0
1
@ananyaku
Ananya Kumar
4 years
@SebastienBubeck @roydanroy @tdietterich My main questions would be: does well-defined open problem necessarily mean we should study it? Is it an overcrowded space? Then maybe only experts with a unique skill-set and perspective should dive in, and others might want to explore new terrain due to diminishing returns?
0
0
2
@ananyaku
Ananya Kumar
4 years
@janexwang @thegautamkamath Good point, although in groups I've worked in it's primarily 2-3 (ICML, NeurIPS, ICLR), and more theoretical work can be under-appreciated in ICLR. There's also an odd cycle, since reviewing variance is really high strong papers get rejected, but resubmitting overloads reviewers.
0
0
2
@ananyaku
Ananya Kumar
2 years
@nsaphra (5/n) This is incredibly useful for us, and thank you for your interest! Feel free to follow up, and we will update the figure and paper based on this discussion!
0
0
2
@ananyaku
Ananya Kumar
1 year
@SebastienBubeck (6/8) But we also see gains in standard "in-distribution" accuracy.
Tweet media one
1
0
1
@ananyaku
Ananya Kumar
4 years
@2prime_PKU @RaiaHadsell @thegautamkamath @radcummings @NeurIPSConf @hsuantienlin Actually that's not quite true, I didn't get my visa in time for NeurIPS for a workshop oral presentation in 2018
1
0
2
@ananyaku
Ananya Kumar
4 years
@srvmshr (1/2) Thanks for the comments! The key insight of our paper isn’t that self-training helps when coupled with unlabeled data, but to understand when leveraging the gradual structure leads to improvements over adapting directly to target (it doesn’t always), and what ingredients
1
0
2
@ananyaku
Ananya Kumar
1 year
@karpathy Great points! In addition to these axes, one pro of prompting is easy controllability (if I want to change behavior for a certain category of texts - e.g., make all poems more polite). A pro of Fine-tuning maybe to learn new skills (don't have good coverage in pretraining data?)
0
0
2
@ananyaku
Ananya Kumar
3 years
@jacobmbuckman @CsabaSzepesvari @tesslerc Great points overall - but I think somewhat incremental works aren't quite so bad, as long as they're honest about what they're doing. For example if someone says they improved a generalization bound from X to Y, you can choose to ignore it if it's not useful for you :)
1
0
2
@ananyaku
Ananya Kumar
1 year
@goyalsachin007 Thanks for the really kind note! It's been really fun working with you - going through experimental results together, coming up with toy examples, etc. I've learned a lot from you and was really amazed how quickly you got up to speed, tried out new ideas, and wrote things up!
0
0
2
@ananyaku
Ananya Kumar
2 years
0
0
2
@ananyaku
Ananya Kumar
1 year
@_jasonwei From what I can tell it's quite popular these days :P So you might have to buy many dinners!
2
0
2
@ananyaku
Ananya Kumar
4 years
@CsabaSzepesvari @Aaroth @larrywasserman Thanks for the references! I'm somewhat familiar with the classical stuff e.g. although specifically interested in (realistic) conditions under which non-exp in d results are possible, for non-parametric models :) Will take a look at these!
0
0
2