Nitish Profile
Nitish

@StrongDuality

3,127
Followers
1,018
Following
25
Media
359
Statuses

language modeling research @OpenAI | views are my own

Palo Alto, CA
Joined November 2013
Don't wanna be here? Send us removal request.
@StrongDuality
Nitish
6 months
❤️
@sama
Sam Altman
6 months
i love the openai team so much
5K
4K
73K
9
12
295
@StrongDuality
Nitish
4 years
New blog: Benchmarks for Serving BERT-like Models! I spent some time investigating NVIDIA's Triton/TensorRT. Blog includes a guide to setting up your own server + benchmarks for choices like: sequence length, batch size, TF vs PyTorch, and model type.
4
47
195
@StrongDuality
Nitish
2 years
A bit of personal news: I left @SFResearch after nearly 5 years to join @OpenAI to work on LLMs. Very grateful to all my colleagues and collaborators! What better way to announce a transition to @OpenAI than with #dalle ; here’s a panda clearing out his desk after quitting.
Tweet media one
3
0
173
@StrongDuality
Nitish
5 years
Thoughts after reading the T5 paper of @colinraffel et al. Thread. An amazing paper (requiring significant compute) that teases apart the effect of various ingredients proposed in Muppetland in the last few months (years?). Some things that stood out / were surprising:
2
39
148
@StrongDuality
Nitish
6 years
New paper (with Karim and @RichardSocher ) on a Transformer network with branched self-attention layers (as opposed to multi-head attention). ~0.5 BLEU score improvement on SOTA for EN-FR and EN-DE tasks. Paper: Blog:
1
47
131
@StrongDuality
Nitish
5 years
This was quite a lot of fun (& a great learning experience). Hopefully, it makes CTRL more accessible. Thanks a bunch to the @huggingface team!
@LysandreJik
Lysandre
5 years
New 🤗Transformers release (2.1.1)! Includes Salesforce's *huge* 1.6B CTRL model. First author Nitish Keskar ( @StrongDuality ) added it himself to the library in 3 days (model, tokenizer, tests and even doc!) He didn't even need our help.🤯
Tweet media one
5
77
357
1
19
125
@StrongDuality
Nitish
6 months
So glad to finally see this come through! And, in all honesty, it’s partly because I really need @gdb ‘s help in fixing a pesky bug 😛
@OpenAI
OpenAI
6 months
We have reached an agreement in principle for Sam Altman to return to OpenAI as CEO with a new initial board of Bret Taylor (Chair), Larry Summers, and Adam D'Angelo. We are collaborating to figure out the details. Thank you so much for your patience through this.
6K
13K
67K
3
0
117
@StrongDuality
Nitish
5 years
Added experimental support for generation from CTRL on K80/T4/P100 or similar GPUs; look at top of for details. Collaboratory link w/ K80s: Used post-training (selective) FP16 quantization. As always, feedback welcome!
1
22
89
@StrongDuality
Nitish
5 years
Super excited about this! You can control the generation with control codes for Wikipedia, Project Gutenberg, some sub-reddits, OpenWebText, News, and few others. Work with @BMarcusMcCann @lrvarshney @CaimingXiong @RichardSocher Thread 1/n
@RichardSocher
Richard Socher
5 years
We release the largest publicly available language model: CTRL has 1.6B parameters and can be guided by control codes for style, content, and task-specific behavior. Incredible generations! Paper Github Blog
Tweet media one
Tweet media two
Tweet media three
21
597
2K
2
10
66
@StrongDuality
Nitish
6 years
tl;dr - A simple strategy that automatically switches from Adam to SGD (empirically) shrinks generalization gaps for many problems. Suggests viability of such hybrid approaches.
0
12
35
@StrongDuality
Nitish
6 years
@jeremyphoward I've gotten surprisingly far with PySwarm (PSO) and "smart" random search (especially because they are embarrassingly parallelizable). You can also try HyperOpt, BayesOpt and Scikit-Optimize. Also, @SigOpt if you're willing to pay.
2
8
34
@StrongDuality
Nitish
5 years
@ ICLR. Part of two papers this week; 1) with @AkhileshGotmare on heuristics in DL, and 2) with @hllo_wrld on question answering. For details: DM if you want to chat.
1
10
33
@StrongDuality
Nitish
5 years
#NeurIPS2018 workshop on adversarial robustness.
Tweet media one
1
9
32
@StrongDuality
Nitish
6 years
is up! Summarizes information about the tasks, answers some frequent questions, and provides a leaderboard (let us know if you have an improved score :) )
@RichardSocher
Richard Socher
6 years
website is up! Slides motivating true multitask learning in AI and NLP from a recent talk:
Tweet media one
7
146
364
1
12
28
@StrongDuality
Nitish
6 years
I'm sure this is common knowledge but the 50-75 heuristic (or whatever it might be called) for reducing learning rates is surprisingly effective. Reduce by 10 once at 50% of budget, again at 75%. I've not been successful at making any other strategy work better on my tasks.
3
6
27
@StrongDuality
Nitish
5 years
Another update regd. CTRL: Added in : 1. (experimental) support for PyTorch inference 2. code to fine-tune CTRL on your own dataset 3. 36-layer trained model (v/s 48 of large model) also, a snappy GIF from @BMarcusMcCann and @melvingruesbeck
0
11
24
@StrongDuality
Nitish
5 years
Interestingly, the detectors for GPT also do well on CTRL. With this specific demo, I've tried a few prompt completions from CTRL and the detector always picked it out correctly. At least preliminarily, it seems like the detectors are quite transferable (?).
@ClementDelangue
clem 🤗
5 years
Try it and tell us what you think. Should we build something similar for every model we release in ?
6
2
25
4
4
25
@StrongDuality
Nitish
4 years
This is really cool. Great title too!
@hardmaru
hardmaru
4 years
Fantastic Generalization Measures and Where to Find Them “We present the first large scale study of generalization in deep networks. We train over 10,000 convolutional networks by systematically varying commonly used hyperparameters.”
Tweet media one
3
111
278
0
5
23
@StrongDuality
Nitish
6 years
Reach out if you wish to talk about the work we do and the team. P.S: we are also looking for (Jr./Sr.) Research Scientists. More details @
@RichardSocher
Richard Socher
6 years
We are looking to expand Salesforce Research with more directors to lead their AI groups. Freedom to choose your research direction, support if you want to productize results.Join a friendly, collaborative team and a company with great values.Apply here :)
Tweet media one
1
43
106
3
5
19
@StrongDuality
Nitish
7 years
"..training took about 80 days for 1.5 billion samples, on 2 Nvidia K80 GPU’s (4 devices) with batch size 64 per GPU.." 80 days! Damn.
@Miles_Brundage
Miles Brundage
7 years
"English Conversational Telephone Speech Rec. by Humans + Machines," Saon et al: TBD! I've (weakly) claimed opposite
Tweet media one
1
4
14
2
9
18
@StrongDuality
Nitish
5 years
I will be at the @salesforce booth in the evening today (and early afternoon on most other days) if you want to chat about opportunities or research.
@SalesforceEdu
#Futureforce
5 years
It’s Day ✌️of #NeurIPS2018 and our team is ready to connect with the top minds in #AI . Stop by our @Salesforce booth to speak w/researcher @BMarcusMcCann this morning and learn about our latest research in #NLP . Learn more at
Tweet media one
Tweet media two
0
4
12
0
6
18
@StrongDuality
Nitish
6 years
Come talk to me and @Smerity about scaling up language modeling during the poster breakout sessions in the same workshop as well.
Tweet media one
2
6
18
@StrongDuality
Nitish
6 years
Claim 1: Cyclical/Cosine learning rates help. Claim 2: Learning rates and batch sizes have "duality"/"equivalence". Research question: train with cyclical/cosine batch sizes?
1
1
18
@StrongDuality
Nitish
2 years
This is an amazing effort and will probably influence LLM evals for a while. Glad to have contributed in a tiny way. Also, probably the high water mark for # authors on a paper for me.
@jaschasd
Jascha Sohl-Dickstein
2 years
After 2 years of work by 442 contributors across 132 institutions, I am thrilled to announce that the paper is now live: . BIG-bench consists of 204 diverse tasks to measure and extrapolate the capabilities of large language models.
Tweet media one
37
574
3K
1
1
17
@StrongDuality
Nitish
5 years
Congratulations to the winners of our first @Salesforce Research Deep Learning grant:
1
1
13
@StrongDuality
Nitish
2 years
@thegautamkamath Green Grape and Strawberry #dalle
Tweet media one
Tweet media two
0
1
12
@StrongDuality
Nitish
5 years
1. Unlike decaNLP, where we tried to pose "everything as QA", T5 authors take a more unconstrained approach and go for "everything as Seq2Seq". Perhaps surprisingly, the model/data scale also obviates specialized span decoders; SoTA on SQuAD through pure generation!
1
1
11
@StrongDuality
Nitish
5 years
@aggielaz Unofficial but I've found to be quite useful.
0
2
10
@StrongDuality
Nitish
4 years
New work with @thisismadani , @BMarcusMcCann , @nikhil_ai , @RichardSocher and collaborators at Stanford exploring use of large-scale training/models on proteins. Amazing downstream applications and vast potential for model improvements as well.
@RichardSocher
Richard Socher
4 years
Introducing ProGen, a large language model trained on 280 million protein sequences that can generate viable proteins based on user specifications. A step towards AI & #nlpproc helping cure disease and clean our planet. Paper: Blog:
Tweet media one
9
106
345
0
0
10
@StrongDuality
Nitish
6 years
Few things are more jarring to read than a partially trained natural language generation model.
1
0
9
@StrongDuality
Nitish
6 years
If you missed the Deep Learning at Supercomputing Scale workshop at NIPS, the slides for many of the talks (including mine) are now live @
0
1
9
@StrongDuality
Nitish
4 years
NVIDIA's Triton/TensorRT solution provided a lot of these functionalities + their benchmarks were very impressive. We decided to investigate further in the context of Transformer language models. Questions/comments/feedback are welcome!
1
1
8
@StrongDuality
Nitish
6 years
@Sid_dinesh94 @RichardSocher We certainly do have positions for Research Engineers as well. Here are more details:
1
2
9
@StrongDuality
Nitish
5 years
@RichardSocher talking about future of multitasking now.
Tweet media one
0
0
8
@StrongDuality
Nitish
5 years
@RichardSocher @rosstaylor90 We have some preliminary numbers (20.2 perplexity on WikiText-103) but are in the process of finalizing them by figuring out the detokenization from the original WT103 version for apples-to-apples comparison.
2
0
7
@StrongDuality
Nitish
4 years
Why? Because model serving is a deceptively hard problem. Especially with bells and whistles like dynamic batching, model swapping and priorities, model versioning, support for multiple frameworks, ease of transitioning from research to production, maintainability, and efficiency
1
0
8
@StrongDuality
Nitish
5 years
2. Seems like asking the model to reconstruct the input (ala denoising autoencoding) is weaker no matter how you do it. Predicting only the corruption/masking is incentivizing better pre-training somehow.
Tweet media one
1
0
8
@StrongDuality
Nitish
5 years
6. Authors use a peculiar way of scaling to 11B. Instead of increasing layers (such as the 89B model) or emb/model size, they keep those reasonable at 24/1024 resp. They only scale the intermediate layer in the FFN (to 65536!). Understandable though given the ease of parallelism.
1
0
8
@StrongDuality
Nitish
7 years
Just heard that our paper on a 2nd-order method for solving convex + L1 won the Charles Broyden prize ()!
0
1
7
@StrongDuality
Nitish
5 years
4. Multitask learning (even with a relaxed definition) continues to be annoyingly difficult. Authors allow different checkpoints on same trajectory but performance is same (at best) or worse. Echoes a lot of what we found during decaNLP and SpEx ()
Tweet media one
Tweet media two
1
0
7
@StrongDuality
Nitish
6 years
Mandatory "OMG! #NIPS2017 Queue" picture.
Tweet media one
0
2
7
@StrongDuality
Nitish
6 years
My #NIPS2017 workshop talk on large batch training and generalization starts in ~30 minutes in RM 101B.
1
2
7
@StrongDuality
Nitish
6 years
Tune in to at 5PM PT for the AI research keynote by @RichardSocher and team. Learn about some of the projects and products our team works on.
0
1
7
@StrongDuality
Nitish
5 years
@_sding @jekbradbury I have tried without success. My hypothesis is that it's a regularization issue. The training ppl falls fine, the validation ppl falls then shoots up. I think I saw someone else (maybe t-xl repo?) mention this as well.
3
0
7
@StrongDuality
Nitish
5 years
5. Really enjoyed reading Section 3.6 ("What if you had 4x more compute?"). Somewhat unsurprisingly, training the largest model that still fits and training a 2x larger model for 2x longer are great places to start. But it's great to have evidence of how other choices feature.
1
0
7
@StrongDuality
Nitish
7 years
A bit late with my tweet but: our paper on sharp minima was accepted for an oral presentation at ICLR!
1
0
6
@StrongDuality
Nitish
6 years
@RichardSocher @sleepinyourhat We had explored several sampling/scheduling strategies for decaNLP, but somewhat surprisingly, a simple round-robin strategy with one mini-batch update (agnostic to the dataset sizes) seemed to work best. We did additionally benefit from training on a subset of the tasks first.
0
0
6
@StrongDuality
Nitish
5 years
3. WebText (which is also a part of CTRL), despite its relatively small size relative to C4, is surprisingly good across the board. It's a bit counterintuitive that WebText is better on GLUE and SQuAD than all others.
Tweet media one
1
0
6
@StrongDuality
Nitish
7 years
People have worked on DNNs with random depth, width, grad noise and residual conn weights. Why stop there?💡:random learning rate every iter!
4
2
6
@StrongDuality
Nitish
5 years
7. Paper concludes with great reflections and avenues for future research. Scaling up clearly helps (it did for CTRL as well), but this also motivates the design of better ingredients. 2016/17 saw a lot of scaled out LSTMs to achieve SoTA only for Transformers to break that chain
1
0
6
@StrongDuality
Nitish
3 years
@colinraffel @_AlecJacobson This sounds great! We had a similar notion of roles to prepare for committee questions for our PhD qualifying exam presentation at @NU_IEMS . IIRC, we also discussed an 'optimist' who loves the paper and will defend it no matter what, and a 'pessimist' who just wants to hate it.
2
0
6
@StrongDuality
Nitish
6 years
@jekbradbury @marian_nmt Here are few choices and when I use them: Hyperband: when I have no good estimate of the hyperparameters and I need "good enough" ones soon. skopt: I have estimates but I need "really good" ones on a tight budget. 1/2
1
1
6
@StrongDuality
Nitish
6 years
@adityak6798 @RichardSocher Information about internships available @ Window is not fixed; we have interns throughout the year with varying durations.
1
4
5
@StrongDuality
Nitish
5 years
Sec. 8 of the paper and our blog discusses our due diligence on the ethics of releasing this model (including a careful review process and inputs from experts at the Partnership on AI (PAI) and external members of our Ethical Use Advisory Council) 4/n
1
2
5
@StrongDuality
Nitish
7 years
Successfully defended my thesis! 😃
0
0
4
@StrongDuality
Nitish
2 years
@Tim_Dettmers We used Adagrad when training CTRL; it worked quite well. Here are the hyperparams. One "weird trick" you can use if the LR decays too rapidly is to restart the second-order accumulator and initiate a small warmup phase. But, we didn't seem to need it when training the 1.6B model
Tweet media one
1
0
5
@StrongDuality
Nitish
7 years
Won't be @ ICLR but here's the link to our sharp-minima poster: . Also added a WIP @PyTorch implementation to Github.
0
2
5
@StrongDuality
Nitish
4 years
Quite excited about trying this out.
@martin_gorner
Martin Görner
4 years
@fchollet TPU and TPU pod support in *Keras* ! 🥳
1
2
23
1
0
5
@StrongDuality
Nitish
7 years
@RogerGrosse I also wish authors provide tuning guidelines (like in Adam paper) even if crude; how to choose, tradeoffs, signs you choose incorrectly...
0
0
4
@StrongDuality
Nitish
7 years
Finally got around to reading the paper on shattered gradients, very thought provoking.
0
1
4
@StrongDuality
Nitish
2 years
@jekbradbury @thesephist @karinanguyen_ I would answer with either matchbox or revtok :P
0
0
4
@StrongDuality
Nitish
3 years
A really well-written blog; quite enjoyed reading it. Talks about their contributions but also provides a quick introduction on other recent prompt tuning papers.
@gaotianyu1350
Tianyu Gao
3 years
Prompts have been shown to have great potential in making language models better at a variety of NLP tasks. This blog post reviews recent work in “prompting”. It also introduces our ACL’21 paper LM-BFF! Check it out :)
1
65
264
0
0
4
@StrongDuality
Nitish
6 years
@Smerity @Azaliamirh A great talk! I wonder if one could create some kind of grammar/DSL to create interpretable placement rules (even if it is per-application?)?
0
1
4
@StrongDuality
Nitish
6 years
Finally made it through the doors of #NIPS2017 in attempt #2 . Please ping me if you're interested in discussing optimization and/or generalization for ML.
0
0
4
@StrongDuality
Nitish
7 years
@F_Vaggi @jeremyphoward @apaszke @ogrisel @TensorFlow @PyTorch My interest is less for perf., more for writing weird opt. algos. TF wraps loss/grad so you can easily do a linesearch & deal w/ flat grads.
1
0
4
@StrongDuality
Nitish
6 years
A very nice talk from @GoogleBrain taking a dive into sw/hw efforts at making ImageNet a toy problem. I recommend also checking out @JeffDean 's yesterday's talk which had a similar flavor.
Tweet media one
0
1
4
@StrongDuality
Nitish
7 years
This is a very interesting line of research.
@abursuc
Andrei Bursuc
7 years
A Closer Look at Memorization in Deep Networks Bringing some nuances to the rethinking generalization paper
1
16
37
0
2
4
@StrongDuality
Nitish
7 years
One thing I love about @TensorFlow but missing in @PyTorch is an interface to external optimizers (e.g. wraps all of scipy.optimize cleanly)
1
0
4
@StrongDuality
Nitish
5 years
Amazing work and very well-written paper. Also, LOL @ Footnote 3.
@OpenAI
OpenAI
5 years
We've trained an unsupervised language model that can generate coherent paragraphs and perform rudimentary reading comprehension, machine translation, question answering, and summarization — all without task-specific training:
172
3K
6K
0
0
4
@StrongDuality
Nitish
6 years
@beenwrekt Damn. I missed yesterday's conversation. I don't think Neyshabur et. al. () was brought up as a possible definition?
2
0
4
@StrongDuality
Nitish
5 years
I swear; every paper that I open on OpenReview has first official review of 3: Weak Reject.
0
0
4
@StrongDuality
Nitish
5 years
You can generate from a link and changing the URL changes the story. 2/n
Tweet media one
2
0
3
@StrongDuality
Nitish
6 years
@Smerity @jeremyphoward Also: A Bayesian Perspective on Generalization and Stochastic Gradient Descent ()
1
0
3
@StrongDuality
Nitish
2 years
@savvyRL @SFResearch @OpenAI Thanks Rosanne! Quitting panda seems....incomplete? Here's an excited-to-go-to-work #dalle panda lol
Tweet media one
0
0
3
@StrongDuality
Nitish
7 years
What's also interesting is how little the hyperparams (layers/nodes/dropouts/clipping) needed to change from LSTM to QRNN for getting ~SOTA.
@Smerity
Smerity
7 years
@PyTorch @RichardSocher @jekbradbury @StrongDuality @CaimingXiong We added QRNN to AWD-LSTM language model we released earlier: hits SotA PTB + near SotA WT2 while being >2x faster
Tweet media one
3
11
17
0
1
3
@StrongDuality
Nitish
5 years
For the same prompt, you can change the generation behavior by changing the control code. 3/n
Tweet media one
Tweet media two
Tweet media three
1
0
3
@StrongDuality
Nitish
5 months
Managing such a large effort (from design, training, eval, deployment, …) is, unsurprisingly, quite difficult. The model looks very strong, looking forward to playing with it. Congratulations to the Gemini team!
@demishassabis
Demis Hassabis
5 months
The Gemini era is here. Thrilled to launch Gemini 1.0, our most capable & general AI model. Built to be natively multimodal, it can understand many types of info. Efficient & flexible, it comes in 3 sizes each best-in-class & optimized for different uses
Tweet media one
356
2K
12K
0
0
4
@StrongDuality
Nitish
2 years
@mark_riedl Was it mostly prompt engineering, or did you also have to fine-tune?
1
0
3
@StrongDuality
Nitish
6 years
@jeremyphoward @Smerity I think they're comparing to @Smerity 's older work, not our recent AWD-LSTM work. Their final perplexity on PTB is about 70 (v/s ours of 57) and on WT2 is about 70 (v/s ours of 65). Neural cache, mixtures-of-softmax and dynamic evaluation brings down those even further.
1
0
3
@StrongDuality
Nitish
6 years
@ogrisel @Smerity @RichardSocher AvSGD is awesome but the darn thing needs soo many epochs. Adam with a learning rate decay seemed like a happy compromise. I'm fairly confident that even better performance can be achieved through more sophisticated training strategies like averaging or CyclicLR.
2
0
3
@StrongDuality
Nitish
4 months
@_aidan_clark_ that's just 'cuz you GPU poor
0
0
3
@StrongDuality
Nitish
5 years
@remilouf @brisvegas1 @ogrisel @julien_c @huggingface @BMarcusMcCann @lrvarshney @CaimingXiong @RichardSocher @SFResearch Great question! @BMarcusMcCann and I are exploring this very question right now; we will have a definitive answer soon. It's likely better than ULMFit since it's also a causal LM but larger and trained on more data. With BERT, the (delicate) trade-off is size v/s MLM loss.
0
0
3
@StrongDuality
Nitish
6 years
@ogrisel @Smerity @RichardSocher @tim_garipov Yeah. The "constant LR averaging" in Fig 7 was exactly the phenomenon we observed and motivated us to use AvSGD for our previous paper. The game changes a bit with Adam, tuning it just a tad (LR and beta1) seemed to be within 0.3-0.5 of AvSGD without needing the fine-tuning stage
0
0
3
@StrongDuality
Nitish
5 years
This is today! Come by if you're at #CVPR2019 . I'll be talking about some recent work on transfer and multitask learning. Focus is on general (task-agnostic) tools and less on specific applications.
@nikhil_ai
Nikhil Naik
5 years
Check out our tutorial on meta-learning at CVPR 2019 at 9 AM on Monday! Topics: few-shot and multi-task learning, AutoML. Speakers: @nikhil_ai , @StrongDuality , @chelseabfinn and @FrankRHutter . Organized with @RichardSocher , @raskarmit . Info: #CVPR2019
0
4
15
0
1
3
@StrongDuality
Nitish
2 years
@Miles_Brundage Wow! It even works on Bollywood movies.
Tweet media one
Tweet media two
0
0
3
@StrongDuality
Nitish
6 years
At #TFDevSummit today. @TensorFlow team has put up quite an amazing lineup.
0
0
3
@StrongDuality
Nitish
6 years
@metasemantic @RichardSocher @Smerity @jekbradbury Would be glad to :) Poster @ ; Code @ and Paper @https ://openreview.net/forum?id=SyyGPP0TZ
0
0
3
@StrongDuality
Nitish
7 years
@hardmaru This is so cool! Thanks for sharing this. It's always fun to see the process alongside the final outcome.
0
0
2
@StrongDuality
Nitish
9 months
@_aidan_clark_ are you staring at the loss curves sufficiently? i have on good authority that that matters.
1
0
2
@StrongDuality
Nitish
6 years
@rasbt @ogrisel @RichardSocher A few papers do that, like the GNMT and ImageNet in 15 minutes. I have a brief commentary on these and similar in Section 2. I'd love to know if there are more that I missed.
0
0
2
@StrongDuality
Nitish
6 years
@mrtz "Nonlinear Optimization for Machine Learning: New Shit Has Come to Light". That sounds interesting. Any plans to release this for public consumption soon?
0
0
2
@StrongDuality
Nitish
5 years
@AkhileshGotmare will be presenting his work on investigating some DL heuristics during the PM poster session (4:30 PM). Swing by if you're curious about cosine annealing, knowledge diatillation, or learning rate warmup.
Tweet media one
0
0
2
@StrongDuality
Nitish
6 years
@adityak6798 @RichardSocher We are primarily looking for research/engineering experience and DL/ML/NLP/RL/CV knowledge depth; so interested undergraduates are also welcome to apply.
1
0
2
@StrongDuality
Nitish
6 years
You can find my slides @ . Reach out if you have any questions, comments or criticism.
0
0
2
@StrongDuality
Nitish
8 years
Our paper on "Why using large batch sizes for #DeepLearning doesn't work" is out: !
1
1
2