Nitish @StrongDuality Twitter profile

Last Seen Profiles

@nmonarizqa

@Harutakagura

@sissychloenz

@DavidRomay_

@timoliro_

@derek_mong

@MufasaKabaka

@liberalgirl3

@DeafToneSays

@MarineShark

@FloraMeadow

@NMartin_StPaul

@cif_anandparti

@zyeags

@vmsandovals

@JacobSpeaks_

@Aslqnv

@br0keasf_

@WE_ARE_THM

@LaSalle_WBB

@CiudadverdeVe

@new_shoegaze

@laquiglette

@anygsvoices

@TheGolferSpy

@ActitudVuala

@owenlindmark

@topo_poops

@GgJaxz

@LowResAni

@priyaso07143042

@ThVerglasKnight

@TashiMaseland

@RaivioTaneli

@AtesbeyEth

@RollAMate

Nitish

@StrongDuality

6 months

❤️

Sam Altman

@sama

6 months

i love the openai team so much

5K

4K

73K

9

12

295

Nitish

@StrongDuality

6 years

New paper with @RichardSocher : "Improving Generalization Performance by Switching from Adam to SGD".

Improving Generalization Performance by Switching from Adam to SGD

Despite superior training outcomes, adaptive optimization methods such as Adam, Adagrad or RMSprop have been found to generalize poorly compared to Stochastic gradient descent (SGD). These methods...

arxiv.org

5

85

198

Nitish

@StrongDuality

4 years

New blog: Benchmarks for Serving BERT-like Models! I spent some time investigating NVIDIA's Triton/TensorRT. Blog includes a guide to setting up your own server + benchmarks for choices like: sequence length, batch size, TF vs PyTorch, and model type.

4

47

195

Nitish

@StrongDuality

2 years

A bit of personal news: I left @SFResearch after nearly 5 years to join @OpenAI to work on LLMs. Very grateful to all my colleagues and collaborators! What better way to announce a transition to @OpenAI than with #dalle ; here’s a panda clearing out his desk after quitting.

3

0

173

Nitish

@StrongDuality

5 years

Thoughts after reading the T5 paper of @colinraffel et al. Thread. An amazing paper (requiring significant compute) that teases apart the effect of various ingredients proposed in Muppetland in the last few months (years?). Some things that stood out / were surprising:

2

39

148

Nitish

@StrongDuality

6 years

New paper (with Karim and @RichardSocher ) on a Transformer network with branched self-attention layers (as opposed to multi-head attention). ~0.5 BLEU score improvement on SOTA for EN-FR and EN-DE tasks. Paper: Blog:

Weighted Transformer Network for Machine Translation

State-of-the-art results on neural machine translation often use attentional sequence-to-sequence models with some form of convolution or recursion. Vaswani et al. (2017) propose a new...

arxiv.org

1

47

131

Nitish

@StrongDuality

5 years

This was quite a lot of fun (& a great learning experience). Hopefully, it makes CTRL more accessible. Thanks a bunch to the @huggingface team!

Lysandre

@LysandreJik

5 years

New 🤗Transformers release (2.1.1)! Includes Salesforce's *huge* 1.6B CTRL model. First author Nitish Keskar ( @StrongDuality ) added it himself to the library in 3 days (model, tokenizer, tests and even doc!) He didn't even need our help.🤯

5

77

357

1

19

125

Nitish

@StrongDuality

6 months

So glad to finally see this come through! And, in all honesty, it’s partly because I really need @gdb ‘s help in fixing a pesky bug 😛

OpenAI

@OpenAI

6 months

We have reached an agreement in principle for Sam Altman to return to OpenAI as CEO with a new initial board of Bret Taylor (Chair), Larry Summers, and Adam D'Angelo. We are collaborating to figure out the details. Thank you so much for your patience through this.

6K

13K

67K

3

0

117

Nitish

@StrongDuality

5 years

Added experimental support for generation from CTRL on K80/T4/P100 or similar GPUs; look at top of for details. Collaboratory link w/ K80s: Used post-training (selective) FP16 quantization. As always, feedback welcome!

CTRL.ipynb

Colaboratory notebook

colab.research.google.com

1

22

89

Nitish

@StrongDuality

5 years

Super excited about this! You can control the generation with control codes for Wikipedia, Project Gutenberg, some sub-reddits, OpenWebText, News, and few others. Work with @BMarcusMcCann @lrvarshney @CaimingXiong @RichardSocher Thread 1/n

Richard Socher

@RichardSocher

5 years

We release the largest publicly available language model: CTRL has 1.6B parameters and can be guided by control codes for style, content, and task-specific behavior. Incredible generations! Paper Github Blog

21

597

2K

2

10

66

Nitish

@StrongDuality

6 years

tl;dr - A simple strategy that automatically switches from Adam to SGD (empirically) shrinks generalization gaps for many problems. Suggests viability of such hybrid approaches.

0

12

35

Nitish

@StrongDuality

6 years

@jeremyphoward I've gotten surprisingly far with PySwarm (PSO) and "smart" random search (especially because they are embarrassingly parallelizable). You can also try HyperOpt, BayesOpt and Scikit-Optimize. Also, @SigOpt if you're willing to pay.

2

8

34

Nitish

@StrongDuality

5 years

@ ICLR. Part of two papers this week; 1) with @AkhileshGotmare on heuristics in DL, and 2) with @hllo_wrld on question answering. For details: DM if you want to chat.

1

10

33

Nitish

@StrongDuality

5 years

#NeurIPS2018 workshop on adversarial robustness.

1

9

32

Nitish

@StrongDuality

6 years

is up! Summarizes information about the tasks, answers some frequent questions, and provides a leaderboard (let us know if you have an improved score :) )

Richard Socher

@RichardSocher

6 years

website is up! Slides motivating true multitask learning in AI and NLP from a recent talk:

7

146

364

1

12

28

Nitish

@StrongDuality

6 years

I'm sure this is common knowledge but the 50-75 heuristic (or whatever it might be called) for reducing learning rates is surprisingly effective. Reduce by 10 once at 50% of budget, again at 75%. I've not been successful at making any other strategy work better on my tasks.

3

6

27

Nitish

@StrongDuality

7 years

Our code accompanying the paper Regularizing and Optimizing LSTM Language Models () is out!

Regularizing and Optimizing LSTM Language Models

Recurrent neural networks (RNNs), such as long short-term memory networks (LSTMs), serve as a fundamental building block for many sequence learning tasks, including machine translation, language...

arxiv.org

Smerity

@Smerity

7 years

Our @PyTorch AWD-LSTM code for state-of-the-art language modeling (PTB/WT2) with @StrongDuality and @RichardSocher

6

119

335

0

6

26

Nitish

@StrongDuality

5 years

Another update regd. CTRL: Added in : 1. (experimental) support for PyTorch inference 2. code to fine-tune CTRL on your own dataset 3. 36-layer trained model (v/s 48 of large model) also, a snappy GIF from @BMarcusMcCann and @melvingruesbeck

0

11

24

Nitish

@StrongDuality

5 years

Interestingly, the detectors for GPT also do well on CTRL. With this specific demo, I've tried a few prompt completions from CTRL and the detector always picked it out correctly. At least preliminarily, it seems like the detectors are quite transferable (?).

clem 🤗

@ClementDelangue

5 years

Try it and tell us what you think. Should we build something similar for every model we release in ?

6

2

25

4

25

Nitish

@StrongDuality

4 years

This is really cool. Great title too!

hardmaru

@hardmaru

4 years

Fantastic Generalization Measures and Where to Find Them “We present the first large scale study of generalization in deep networks. We train over 10,000 convolutional networks by systematically varying commonly used hyperparameters.”

3

111

278

0

5

23

Nitish

@StrongDuality

6 years

Reach out if you wish to talk about the work we do and the team. P.S: we are also looking for (Jr./Sr.) Research Scientists. More details @

Careers | Salesforce AI Research

Powering the world's smartest CRM by embedding state-of-the-art deep learning technology into the Salesforce Platform.

www.salesforceairesearch.com

Richard Socher

@RichardSocher

6 years

We are looking to expand Salesforce Research with more directors to lead their AI groups. Freedom to choose your research direction, support if you want to productize results.Join a friendly, collaborative team and a company with great values.Apply here :)

1

43

106

3

5

19

Nitish

@StrongDuality

7 years

"..training took about 80 days for 1.5 billion samples, on 2 Nvidia K80 GPU’s (4 devices) with batch size 64 per GPU.." 80 days! Damn.

Miles Brundage

@Miles_Brundage

7 years

"English Conversational Telephone Speech Rec. by Humans + Machines," Saon et al: TBD! I've (weakly) claimed opposite

1

4

14

2

9

18

Nitish

@StrongDuality

5 years

I will be at the @salesforce booth in the evening today (and early afternoon on most other days) if you want to chat about opportunities or research.

#Futureforce

@SalesforceEdu

5 years

It’s Day ✌️of #NeurIPS2018 and our team is ready to connect with the top minds in #AI . Stop by our @Salesforce booth to speak w/researcher @BMarcusMcCann this morning and learn about our latest research in #NLP . Learn more at

0

4

12

0

6

18

Nitish

@StrongDuality

6 years

Come talk to me and @Smerity about scaling up language modeling during the poster breakout sessions in the same workshop as well.

2

6

18

Nitish

@StrongDuality

6 years

Claim 1: Cyclical/Cosine learning rates help. Claim 2: Learning rates and batch sizes have "duality"/"equivalence". Research question: train with cyclical/cosine batch sizes?

1

18

Nitish

@StrongDuality

2 years

This is an amazing effort and will probably influence LLM evals for a while. Glad to have contributed in a tiny way. Also, probably the high water mark for # authors on a paper for me.

Jascha Sohl-Dickstein

@jaschasd

2 years

After 2 years of work by 442 contributors across 132 institutions, I am thrilled to announce that the paper is now live: . BIG-bench consists of 204 diverse tasks to measure and extrapolate the capabilities of large language models.

37

574

3K

1

17

Nitish

@StrongDuality

5 years

Congratulations to the winners of our first @Salesforce Research Deep Learning grant:

1

13

Nitish

@StrongDuality

2 years

@thegautamkamath Green Grape and Strawberry #dalle

0

1

12

Nitish

@StrongDuality

5 years

1. Unlike decaNLP, where we tried to pose "everything as QA", T5 authors take a more unconstrained approach and go for "everything as Seq2Seq". Perhaps surprisingly, the model/data scale also obviates specialized span decoders; SoTA on SQuAD through pure generation!

1

11

Nitish

@StrongDuality

5 years

@aggielaz Unofficial but I've found to be quite useful.

0

2

10

Nitish

@StrongDuality

4 years

New work with @thisismadani , @BMarcusMcCann , @nikhil_ai , @RichardSocher and collaborators at Stanford exploring use of large-scale training/models on proteins. Amazing downstream applications and vast potential for model improvements as well.

Richard Socher

@RichardSocher

4 years

Introducing ProGen, a large language model trained on 280 million protein sequences that can generate viable proteins based on user specifications. A step towards AI & #nlpproc helping cure disease and clean our planet. Paper: Blog:

9

106

345

0

10

Nitish

@StrongDuality

6 years

Few things are more jarring to read than a partially trained natural language generation model.

1

0

9

Nitish

@StrongDuality

6 years

If you missed the Deep Learning at Supercomputing Scale workshop at NIPS, the slides for many of the talks (including mine) are now live @

0

1

9

Nitish

@StrongDuality

4 years

NVIDIA's Triton/TensorRT solution provided a lot of these functionalities + their benchmarks were very impressive. We decided to investigate further in the context of Transformer language models. Questions/comments/feedback are welcome!

1

8

Nitish

@StrongDuality

6 years

@Sid_dinesh94 @RichardSocher We certainly do have positions for Research Engineers as well. Here are more details:

1

2

9

Nitish

@StrongDuality

5 years

@RichardSocher talking about future of multitasking now.

0

8

Nitish

@StrongDuality

5 years

@RichardSocher @rosstaylor90 We have some preliminary numbers (20.2 perplexity on WikiText-103) but are in the process of finalizing them by figuring out the detokenization from the original WT103 version for apples-to-apples comparison.

2

0

7

Nitish

@StrongDuality

4 years

Why? Because model serving is a deceptively hard problem. Especially with bells and whistles like dynamic batching, model swapping and priorities, model versioning, support for multiple frameworks, ease of transitioning from research to production, maintainability, and efficiency

1

0

8

Nitish

@StrongDuality

5 years

2. Seems like asking the model to reconstruct the input (ala denoising autoencoding) is weaker no matter how you do it. Predicting only the corruption/masking is incentivizing better pre-training somehow.

1

0

8

Nitish

@StrongDuality

5 years

6. Authors use a peculiar way of scaling to 11B. Instead of increasing layers (such as the 89B model) or emb/model size, they keep those reasonable at 24/1024 resp. They only scale the intermediate layer in the FFN (to 65536!). Understandable though given the ease of parallelism.

1

0

8

Nitish

@StrongDuality

7 years

Just heard that our paper on a 2nd-order method for solving convex + L1 won the Charles Broyden prize ()!

0

1

7

Nitish

@StrongDuality

5 years

4. Multitask learning (even with a relaxed definition) continues to be annoyingly difficult. Authors allow different checkpoints on same trajectory but performance is same (at best) or worse. Echoes a lot of what we found during decaNLP and SpEx ()

1

0

7

Nitish

@StrongDuality

6 years

Mandatory "OMG! #NIPS2017 Queue" picture.

0

2

7

Nitish

@StrongDuality

6 years

My #NIPS2017 workshop talk on large batch training and generalization starts in ~30 minutes in RM 101B.

1

2

7

Nitish

@StrongDuality

6 years

Tune in to at 5PM PT for the AI research keynote by @RichardSocher and team. Learn about some of the projects and products our team works on.

0

1

7

Nitish

@StrongDuality

5 years

@_sding @jekbradbury I have tried without success. My hypothesis is that it's a regularization issue. The training ppl falls fine, the validation ppl falls then shoots up. I think I saw someone else (maybe t-xl repo?) mention this as well.

3

0

7

Nitish

@StrongDuality

5 years

5. Really enjoyed reading Section 3.6 ("What if you had 4x more compute?"). Somewhat unsurprisingly, training the largest model that still fits and training a 2x larger model for 2x longer are great places to start. But it's great to have evidence of how other choices feature.

1

0

7

Nitish

@StrongDuality

5 years

Also want to shed light on an paper with @lrvarshney and @RichardSocher () where we discuss the ethical impacts of the sociological position of large pre-trained language models.

Pretrained AI Models: Performativity, Mobility, and Change

The paradigm of pretrained deep learning models has recently emerged in artificial intelligence practice, allowing deployment in numerous societal settings with limited computational resources,...

arxiv.org

0

1

7

Nitish

@StrongDuality

7 years

A bit late with my tweet but: our paper on sharp minima was accepted for an oral presentation at ICLR!

1

0

6

Nitish

@StrongDuality

6 years

@RichardSocher @sleepinyourhat We had explored several sampling/scheduling strategies for decaNLP, but somewhat surprisingly, a simple round-robin strategy with one mini-batch update (agnostic to the dataset sizes) seemed to work best. We did additionally benefit from training on a subset of the tasks first.

0

6

Nitish

@StrongDuality

5 years

3. WebText (which is also a part of CTRL), despite its relatively small size relative to C4, is surprisingly good across the board. It's a bit counterintuitive that WebText is better on GLUE and SQuAD than all others.

1

0

6

Nitish

@StrongDuality

1 year

@sleepinyourhat 💜 Ours was recursively inspired by GNMT work from @melvinjohnsonp et al who used the target language as the first "control code" in MT.

Zero-Shot Translation with Google’s Multilingual Neural Machine Translation Syst

Posted by Mike Schuster (Google Brain Team), Melvin Johnson (Google Translate) and Nikhil Thorat (Google Brain Team)In the last 10 years, Google Tr...

research.google

0

5

Nitish

@StrongDuality

7 years

People have worked on DNNs with random depth, width, grad noise and residual conn weights. Why stop there?💡:random learning rate every iter!

4

2

6

Nitish

@StrongDuality

5 years

7. Paper concludes with great reflections and avenues for future research. Scaling up clearly helps (it did for CTRL as well), but this also motivates the design of better ingredients. 2016/17 saw a lot of scaled out LSTMs to achieve SoTA only for Transformers to break that chain

1

0

6

Nitish

@StrongDuality

3 years

@colinraffel @_AlecJacobson This sounds great! We had a similar notion of roles to prepare for committee questions for our PhD qualifying exam presentation at @NU_IEMS . IIRC, we also discussed an 'optimist' who loves the paper and will defend it no matter what, and a 'pessimist' who just wants to hate it.

2

0

6

Nitish

@StrongDuality

6 years

@jekbradbury @marian_nmt Here are few choices and when I use them: Hyperband: when I have no good estimate of the hyperparameters and I need "good enough" ones soon. skopt: I have estimates but I need "really good" ones on a tight budget. 1/2

1

6

Nitish

@StrongDuality

6 years

@adityak6798 @RichardSocher Information about internships available @ Window is not fixed; we have interns throughout the year with varying durations.

1

4

5

Nitish

@StrongDuality

5 years

Sec. 8 of the paper and our blog discusses our due diligence on the ethics of releasing this model (including a careful review process and inputs from experts at the Partnership on AI (PAI) and external members of our Ethical Use Advisory Council) 4/n

1

2

5

Nitish

@StrongDuality

7 years

Successfully defended my thesis! 😃

0

4

Nitish

@StrongDuality

2 years

@Tim_Dettmers We used Adagrad when training CTRL; it worked quite well. Here are the hyperparams. One "weird trick" you can use if the LR decays too rapidly is to restart the second-order accumulator and initiate a small warmup phase. But, we didn't seem to need it when training the 1.6B model

1

0

5

Nitish

@StrongDuality

7 years

Won't be @ ICLR but here's the link to our sharp-minima poster: . Also added a WIP @PyTorch implementation to Github.

0

2

5

Nitish

@StrongDuality

4 years

Quite excited about trying this out.

Martin Görner

@martin_gorner

4 years

@fchollet TPU and TPU pod support in *Keras* ! 🥳

1

2

23

1

0

5

Nitish

@StrongDuality

7 years

@RogerGrosse I also wish authors provide tuning guidelines (like in Adam paper) even if crude; how to choose, tradeoffs, signs you choose incorrectly...

0

4

Nitish

@StrongDuality

7 years

Finally got around to reading the paper on shattered gradients, very thought provoking.

0

1

4

Nitish

@StrongDuality

2 years

@jekbradbury @thesephist @karinanguyen_ I would answer with either matchbox or revtok :P

0

4

Nitish

@StrongDuality

3 years

A really well-written blog; quite enjoyed reading it. Talks about their contributions but also provides a quick introduction on other recent prompt tuning papers.

Tianyu Gao

@gaotianyu1350

3 years

Prompts have been shown to have great potential in making language models better at a variety of NLP tasks. This blog post reviews recent work in “prompting”. It also introduces our ACL’21 paper LM-BFF! Check it out :)

1

65

264

0

4

Nitish

@StrongDuality

6 years

@Smerity @Azaliamirh A great talk! I wonder if one could create some kind of grammar/DSL to create interpretable placement rules (even if it is per-application?)?

0

1

4

Nitish

@StrongDuality

6 years

Finally made it through the doors of #NIPS2017 in attempt #2 . Please ping me if you're interested in discussing optimization and/or generalization for ML.

0

4

Nitish

@StrongDuality

7 years

@F_Vaggi @jeremyphoward @apaszke @ogrisel @TensorFlow @PyTorch My interest is less for perf., more for writing weird opt. algos. TF wraps loss/grad so you can easily do a linesearch & deal w/ flat grads.

1

0

4

Nitish

@StrongDuality

6 years

A very nice talk from @GoogleBrain taking a dive into sw/hw efforts at making ImageNet a toy problem. I recommend also checking out @JeffDean 's yesterday's talk which had a similar flavor.

0

1

4

Nitish

@StrongDuality

7 years

This is a very interesting line of research.

Andrei Bursuc

@abursuc

7 years

A Closer Look at Memorization in Deep Networks Bringing some nuances to the rethinking generalization paper

1

16

37

0

2

4

Nitish

@StrongDuality

7 years

One thing I love about @TensorFlow but missing in @PyTorch is an interface to external optimizers (e.g. wraps all of scipy.optimize cleanly)

1

0

4

Nitish

@StrongDuality

5 years

Amazing work and very well-written paper. Also, LOL @ Footnote 3.

OpenAI

@OpenAI

5 years

We've trained an unsupervised language model that can generate coherent paragraphs and perform rudimentary reading comprehension, machine translation, question answering, and summarization — all without task-specific training:

172

3K

6K

0

4

Nitish

@StrongDuality

1 year

@JeffLinderoth @PVanHentenryck @mluebbecke @KirkDBorne cc: @albertberahas who disagrees

0

4

Nitish

@StrongDuality

6 years

@beenwrekt Damn. I missed yesterday's conversation. I don't think Neyshabur et. al. () was brought up as a possible definition?

2

0

4

Nitish

@StrongDuality

5 years

I swear; every paper that I open on OpenReview has first official review of 3: Weak Reject.

0

4

Nitish

@StrongDuality

5 years

You can generate from a link and changing the URL changes the story. 2/n

2

0

3

Nitish

@StrongDuality

6 years

@Smerity @jeremyphoward Also: A Bayesian Perspective on Generalization and Stochastic Gradient Descent ()

1

0

3

Nitish

@StrongDuality

2 years

@savvyRL @SFResearch @OpenAI Thanks Rosanne! Quitting panda seems....incomplete? Here's an excited-to-go-to-work #dalle panda lol

0

3

Nitish

@StrongDuality

7 years

What's also interesting is how little the hyperparams (layers/nodes/dropouts/clipping) needed to change from LSTM to QRNN for getting ~SOTA.

Smerity

@Smerity

7 years

@PyTorch @RichardSocher @jekbradbury @StrongDuality @CaimingXiong We added QRNN to AWD-LSTM language model we released earlier: hits SotA PTB + near SotA WT2 while being >2x faster

3

11

17

0

1

3

Nitish

@StrongDuality

5 years

For the same prompt, you can change the generation behavior by changing the control code. 3/n

1

0

3

Nitish

@StrongDuality

5 months

Managing such a large effort (from design, training, eval, deployment, …) is, unsurprisingly, quite difficult. The model looks very strong, looking forward to playing with it. Congratulations to the Gemini team!

Demis Hassabis

@demishassabis

5 months

The Gemini era is here. Thrilled to launch Gemini 1.0, our most capable & general AI model. Built to be natively multimodal, it can understand many types of info. Efficient & flexible, it comes in 3 sizes each best-in-class & optimized for different uses

356

2K

12K

0

4

Nitish

@StrongDuality

2 years

@mark_riedl Was it mostly prompt engineering, or did you also have to fine-tune?

1

0

3

Nitish

@StrongDuality

6 years

@jeremyphoward @Smerity I think they're comparing to @Smerity 's older work, not our recent AWD-LSTM work. Their final perplexity on PTB is about 70 (v/s ours of 57) and on WT2 is about 70 (v/s ours of 65). Neural cache, mixtures-of-softmax and dynamic evaluation brings down those even further.

1

0

3

Nitish

@StrongDuality

6 years

@ogrisel @Smerity @RichardSocher AvSGD is awesome but the darn thing needs soo many epochs. Adam with a learning rate decay seemed like a happy compromise. I'm fairly confident that even better performance can be achieved through more sophisticated training strategies like averaging or CyclicLR.

2

0

3

Nitish

@StrongDuality

4 months

@_aidan_clark_ that's just 'cuz you GPU poor

0

3

Nitish

@StrongDuality

5 years

@remilouf @brisvegas1 @ogrisel @julien_c @huggingface @BMarcusMcCann @lrvarshney @CaimingXiong @RichardSocher @SFResearch Great question! @BMarcusMcCann and I are exploring this very question right now; we will have a definitive answer soon. It's likely better than ULMFit since it's also a causal LM but larger and trained on more data. With BERT, the (delicate) trade-off is size v/s MLM loss.

0

3

Nitish

@StrongDuality

6 years

@ogrisel @Smerity @RichardSocher @tim_garipov Yeah. The "constant LR averaging" in Fig 7 was exactly the phenomenon we observed and motivated us to use AvSGD for our previous paper. The game changes a bit with Adam, tuning it just a tad (LR and beta1) seemed to be within 0.3-0.5 of AvSGD without needing the fine-tuning stage

0

3

Nitish

@StrongDuality

5 years

This is today! Come by if you're at #CVPR2019 . I'll be talking about some recent work on transfer and multitask learning. Focus is on general (task-agnostic) tools and less on specific applications.

Nikhil Naik

@nikhil_ai

5 years

Check out our tutorial on meta-learning at CVPR 2019 at 9 AM on Monday! Topics: few-shot and multi-task learning, AutoML. Speakers: @nikhil_ai , @StrongDuality , @chelseabfinn and @FrankRHutter . Organized with @RichardSocher , @raskarmit . Info: #CVPR2019

0

4

15

0

1

3

Nitish

@StrongDuality

2 years

@Miles_Brundage Wow! It even works on Bollywood movies.

0

3

Nitish

@StrongDuality

6 years

At #TFDevSummit today. @TensorFlow team has put up quite an amazing lineup.

0

3

Nitish

@StrongDuality

6 years

@metasemantic @RichardSocher @Smerity @jekbradbury Would be glad to :) Poster @ ; Code @ and Paper @https ://openreview.net/forum?id=SyyGPP0TZ

0

3

Nitish

@StrongDuality

7 years

@hardmaru This is so cool! Thanks for sharing this. It's always fun to see the process alongside the final outcome.

0

2

Nitish

@StrongDuality

9 months

@_aidan_clark_ are you staring at the loss curves sufficiently? i have on good authority that that matters.

1

0

2

Nitish

@StrongDuality

6 years

@rasbt @ogrisel @RichardSocher A few papers do that, like the GNMT and ImageNet in 15 minutes. I have a brief commentary on these and similar in Section 2. I'd love to know if there are more that I missed.

0

2

Nitish

@StrongDuality

6 years

@mrtz "Nonlinear Optimization for Machine Learning: New Shit Has Come to Light". That sounds interesting. Any plans to release this for public consumption soon?

0

2

Nitish

@StrongDuality

5 years

@AkhileshGotmare will be presenting his work on investigating some DL heuristics during the PM poster session (4:30 PM). Swing by if you're curious about cosine annealing, knowledge diatillation, or learning rate warmup.

0

2

Nitish

@StrongDuality

6 years

@adityak6798 @RichardSocher We are primarily looking for research/engineering experience and DL/ML/NLP/RL/CV knowledge depth; so interested undergraduates are also welcome to apply.

1

0

2

Nitish

@StrongDuality

6 years

You can find my slides @ . Reach out if you have any questions, comments or criticism.

0

2

Nitish

@StrongDuality

8 years

Our paper on "Why using large batch sizes for #DeepLearning doesn't work" is out: !

1

2