rohan anil Profile Banner
rohan anil Profile
rohan anil

@_arohan_

12,191
Followers
1,533
Following
759
Media
6,180
Statuses

Principal Engineer, @GoogleDeepMind Gemini. prev PaLM-2. Tinkering with optimization and distributed systems. opinions are my own.

Joined December 2017
Don't wanna be here? Send us removal request.
@_arohan_
rohan anil
1 year
This paper looks like a big step forward for the Transformer architecture! A foundational improvements, not as shiny as other things, but really big step forward nonetheless
Tweet media one
11
103
841
@_arohan_
rohan anil
8 months
Meta researchers just dropped PyTorch distributed shampoo🧴few days ago: 💥 Train neural networks with a second order method for better performance. This underlying work which it is based on has been a passion project for last 5 years while swimming…
9
74
565
@_arohan_
rohan anil
5 months
It’s been a privilege to work alongside with our gemini leads and team (across Google DeepMind, Research and Alphabet) in one of the most interesting and challenging projects of my career. We have three versions of Gemini: (a) Ultra (b) Pro and (c) Nano We make significant…
@JeffDean
Jeff Dean (@🏡)
5 months
I’m very excited to share our work on Gemini today! Gemini is a family of multimodal models that demonstrate really strong capabilities across the image, audio, video, and text domains. Our most-capable model, Gemini Ultra, advances the state of the art in 30 of 32 benchmarks,…
Tweet media one
Tweet media two
276
3K
13K
21
24
496
@_arohan_
rohan anil
2 years
A new image generation model just dropped. Great work by the team! + Auto-regressive, encoder->decoder Transformer + Classifier-free sampling. + ViT-VQGAN Really amazing results: Image from the website.
Tweet media one
13
105
481
@_arohan_
rohan anil
4 years
Shampoo is out of the bottle! Preprint: "Second order optimization made practical" We train certain neural nets faster than before. How fast ? It has shown upto ~40% reduction in training time for a Transformer. ( @tomerikoriko )
Tweet media one
7
113
439
@_arohan_
rohan anil
11 months
PaLM-2 is Generally available for developers! “With this update, developers can access our text model powered by PaLM 2, Embeddings API for text, and other foundation models in Model Garden”
5
88
397
@_arohan_
rohan anil
2 years
Today, we present our paper on Google Search Ads CTR model at ORSUM @ACMRecSys , Seattle. We highlight ML techniques suited to *online learning* that go well beyond traditional accuracy improvements. A short thread: 1/n
Tweet media one
6
84
380
@_arohan_
rohan anil
2 years
Prompt: "A koala bear and grizzly bear playing chess. They are sitting at a table on the beach. You can see the waves crashing into the shores. Bears are very stressed. DSLR camera photo." #imagen #googleai #brain 🐻🐨♟️🏖️
Tweet media one
Tweet media two
13
42
327
@_arohan_
rohan anil
2 years
Batch Entropy Regularizer that makes untrainable networks train. Remove skip connection, normalization layers. Published at TMLR, Works on PaLM like transformers -- thanks to Lucid for the pointer!
3
53
313
@_arohan_
rohan anil
1 year
Palm2 is online: 🌴🌴 Paper: I learned to code with instructions in Malayalam, so this capability shown by PalM-2 instruction tuned models to explain the code make me quite happy! Possibilities are endless here!
Tweet media one
@_arohan_
rohan anil
1 year
🌴🌴 Very proud of this work; specifically not compromising on model quality, while being extremely fast for inference, so that we can serve the whole wide world i.e bringing technology to everyone!
0
6
59
17
69
293
@_arohan_
rohan anil
3 months
@karpathy @giffmana The team is working hard to bring audio inputs to the AI Studio interface for Gemini 1.5 Pro. We have an internal version that handles audio and video and can sample the video less frequently to increase the length of content that can be handled. @karpathy , thanks for the…
Tweet media one
Tweet media two
Tweet media three
7
38
288
@_arohan_
rohan anil
2 years
L👈: "A Koala bear in a suit standing at a podium to teach. Variational bayesian methods is written on the chalkboard. There are lot of confused cats in the crowd" R 👉:"Variational bayesian methods is all you need is written on the chalkboard." 🐨🙀 #imagen #googleai #brain
Tweet media one
Tweet media two
9
40
268
@_arohan_
rohan anil
5 years
@dave_universetf @therealfitz "Picard: hello can you hear us. It looks like you are muted."
0
10
246
@_arohan_
rohan anil
2 years
Prompt: "A train ride in the monsoon rain in Kerala. With a Koala bear wearing a hat looking out of the window. There is a lot of coconut trees out of the window" #imagen #googleai #brain (I will host the imagen team at my home in Kerala if they choose to visit 🚀)
Tweet media one
14
13
248
@_arohan_
rohan anil
11 months
GPT-4 can do well on MIT test Community: oh the methodology is all wrong 🌶️ Introducing new optimizer that is 2x faster than AdamW Community: Impressive! Impressive methodology! Said methodology: use half the steps for new method and change learning rate schedule to…
11
24
242
@_arohan_
rohan anil
3 years
Code for Distributed Shampoo: a scalable second order optimization method 💥 Joint work w @GuptaVineetG State of the art on MLPerf ResNet-50 training to reach 75.9% accuracy at 32,768 batch size Trains in 1729 steps (not a typo), 284 secs on TPUs.
0
31
228
@_arohan_
rohan anil
5 years
Code for SM3 which is a memory efficient adaptive first-order optimizer is now open-sourced under @GoogleAI research repository. It's useful for training very large language models, for eg: BERT-Large, GPT2 etc.
3
52
209
@_arohan_
rohan anil
2 years
I completely missed the Parallel Layers used in PaLM. Its makes training 15% faster at larger scale. Mainly run MLP and Attention together! Thanks @achowdhery for pointing this out to me! The savings in compute are quite substantial.
Tweet media one
6
13
199
@_arohan_
rohan anil
2 years
“For example, if the traditional algorithm taught in school multiplies a 4x5 by 5x5 matrix using 100 multiplications, and this number was reduced to 80 with human ingenuity, AlphaTensor has found algorithms that do the same operation using just 76 multiplications.”
@GoogleDeepMind
Google DeepMind
2 years
Today in @Nature : #AlphaTensor , an AI system for discovering novel, efficient, and exact algorithms for matrix multiplication - a building block of modern computations. AlphaTensor finds faster algorithms for many matrix sizes: & 1/
114
2K
8K
3
15
194
@_arohan_
rohan anil
10 months
Some excellent work by @jeankaddour and colleagues “We find that their training, validation, and downstream gains vanish compared to a baseline with a fully-decayed learning rate” ☠️
Tweet media one
@jeankaddour
Jean Kaddour
10 months
@_arohan_ Our arxiv preprint might be of interest to you:
0
2
51
5
33
186
@_arohan_
rohan anil
2 years
Transformer paper should get half a decade test of time award for completely transforming the industries and what people work on.
4
8
167
@_arohan_
rohan anil
4 years
Tinker with this visualization here for training neural networks with noise added in the dataset. Made with tensorflow.js and inspired by neural network playground. 👇
Tweet media one
2
40
160
@_arohan_
rohan anil
10 months
Arrived to these shores @ 2010 Greencard @ 2023 ✅
24
2
161
@_arohan_
rohan anil
1 year
10 years ago I left working on iOS communicator at MSFT to work on machine learning at Google, without much connections or a doctoral degree for that matter. Crazy how time flies! And due to a bunch of lucky breaks, very thankful to be doing ML things at Google 🧠
9
4
157
@_arohan_
rohan anil
3 years
MADGRAD: 76.22% Shampoo: 77.8%
@AIatMeta
AI at Meta
3 years
We're introducing an optimizer for deep learning, MADGRAD. This method matches or exceeds the performance of the Adam optimizer across a varied set of realistic large-scale deep learning training problems.
Tweet media one
26
514
2K
3
27
149
@_arohan_
rohan anil
1 year
Next big jump with Neural Network performance is going to happen when community embraces non-uniformity Eg, stacking of identical layers has become ingrained within our tools and mindsets.
15
11
135
@_arohan_
rohan anil
7 months
Gen AI on-device? A foundation model on the phone? Imagine an entire operating system level unlock of capabilities: Well Pixel 8 Pro will have it. Rick announced it here: The model was trained with several algorithmic breakthroughs by our team to…
9
15
134
@_arohan_
rohan anil
5 months
Gemini Nano improve on the efficiency frontiers. They are multimodal as well, see results in the paper. Nano series: At 1.8B and 3.25B parameters packs so much to provide high utility on device First foundation model on the device!
Tweet media one
Tweet media two
@sundarpichai
Sundar Pichai
5 months
Gemini Nano is super efficient for tasks that are on-device. Android developers can sign up for an early access program for Gemini Nano via Android AICore and Pixel 8 Pro users can already see it rolling out in features like Summarize in Recorder and Smart Reply in Gboard + much…
Tweet media one
67
157
2K
5
11
133
@_arohan_
rohan anil
1 year
😝👇
Tweet media one
@karpathy
Andrej Karpathy
1 year
3
4
95
2
13
130
@_arohan_
rohan anil
1 year
Just tested it on a paragraph from one of my papers, and it does seem like it improves the writing. Sure, if you generate whole papers with a LM thats not cool but improving the writing quality seems good for everyone?
Tweet media one
@yoavgo
(((ل()(ل() 'yoav))))👾
1 year
this is kinda gate-keepy, @icmlconf
Tweet media one
35
23
291
22
11
131
@_arohan_
rohan anil
1 year
After 10 years at Google, 5 in Google Brain, now I work at Google DeepMind
@OriolVinyalsML
Oriol Vinyals
1 year
𝗚𝗼𝗼𝗴𝗹𝗲 𝗗𝗲𝗲𝗽𝗠𝗶𝗻𝗱
22
87
850
6
5
127
@_arohan_
rohan anil
2 years
People shocked that StableDiffusion was trained with less resources haven’t been paying attention to many things including Craiyon/DalleMega runs. Scale is not all you need dear community. Nice to write that in a paper though.
4
2
127
@_arohan_
rohan anil
11 months
First of all, massive congratulations are in order to @zacharynado @GeorgeEDahl @naman33k and co-authors on this massive work spanning multiple years on benchmarking neural network training algorithms! 🎉🍾 I have a horse 🐴 in the race and its called distributed shampoo 🦄
@arankomatsuzaki
Aran Komatsuzaki
11 months
Benchmarking Neural Network Training Algorithms Presents AlgoPerf, a new, competitive, time-to-result benchmark using multiple workloads running on fixed hardware.
Tweet media one
2
66
304
1
19
122
@_arohan_
rohan anil
6 months
We can go back to reading arxiv papers! 🕊️
2
7
122
@_arohan_
rohan anil
3 months
@giffmana @karpathy This is a great idea! @karpathy would you give permissions for us to use the video? 🙏
3
2
122
@_arohan_
rohan anil
2 years
Augments Transformer 🤖 architecture with n-grams that are constructed from discrete latent representation of the text sequence. Faster training and inference when it matters the most - as core operations are (distributed) gather/scatter. 🎇 Code:
@_akhaliq
AK
2 years
N-Grammer: Augmenting Transformers with latent n-grams abs: propose modification to the Transformer architecture by augmenting the model with n-grams that are constructed from a discrete latent representation of the text sequence
Tweet media one
0
30
129
2
21
118
@_arohan_
rohan anil
3 years
Finishing up slides now. I will be talking about “Scalable second order optimization for deep learning” at Deep Learning: Classics and Trends, tomorrow.
4
12
119
@_arohan_
rohan anil
2 years
Deep Neural Networks. They are much better than kernels. I am going train them again and again with second order methods.
1
3
116
@_arohan_
rohan anil
1 year
Muting the professors who are arguing over each other on twitter has significantly improved my experience on this app.
3
1
111
@_arohan_
rohan anil
1 year
NaaNs at home, NaNs at work. Kid starts stepping at home, model stepping at work.
Tweet media one
4
3
110
@_arohan_
rohan anil
5 years
@dave_universetf @therealfitz Picard: "Yes, but we can't see you anymore. Can you try the auxiliary camera?"
0
2
98
@_arohan_
rohan anil
3 months
Called it with some knowledge about the model. Ultra is going to break ground! Those quibbling over hellaswag and mmlu is just showing their misunderstanding about evaluation. Onwards 🚀
@lmsysorg
lmsys.org
3 months
🔥Breaking News from Arena Google's Bard has just made a stunning leap, surpassing GPT-4 to the SECOND SPOT on the leaderboard! Big congrats to @Google for the remarkable achievement! The race is heating up like never before! Super excited to see what's next for Bard + Gemini…
Tweet media one
155
632
3K
8
5
106
@_arohan_
rohan anil
2 years
Thankfully people judge ideas based on their merit. Noticed citation to twitter thread in dalle-2 for example.
Tweet media one
@rsalakhu
Russ Salakhutdinov
2 years
1/3 I think we should coin a new term: social media AI researcher, where instead of publishing your work at the rigorous peer review venue, you tweet about your findings and opinions. There are huge advantages: 1. It is easy: you don't have to deal with that annoying reviewer 2.
33
46
648
5
5
104
@_arohan_
rohan anil
2 years
The AGI I want is one that realizes I made a dumb mistake with batch size which makes it OOM on a supercomputer and tries a smaller one for me - while I am sleeping so I don’t have to babysit the models and increases the throughput in experimentation!
7
4
97
@_arohan_
rohan anil
6 months
Collectible item
Tweet media one
5
3
97
@_arohan_
rohan anil
3 years
"Ever want to learn how JAX works, but the implementation seemed impenetrable? Well, you're in luck! By reading this tutorial, you'll learn every big idea in JAX's core system. You'll even get clued into our weird jargon!" Jax team keeps exceeding every expectation 😂
@DynamicWebPaige
👩‍💻 Paige Bailey
3 years
♥️ autodidax
Tweet media one
1
30
209
0
22
97
@_arohan_
rohan anil
3 months
128k content length? we don’t know how to count that low, its 10M now.
@JeffDean
Jeff Dean (@🏡)
3 months
Gemini 1.5 Pro - A highly capable multimodal model with a 10M token context length Today we are releasing the first demonstrations of the capabilities of the Gemini 1.5 series, with the Gemini 1.5 Pro model. One of the key differentiators of this model is its incredibly long…
Tweet media one
198
1K
6K
3
7
94
@_arohan_
rohan anil
2 years
Adam: Neumann:
3
8
93
@_arohan_
rohan anil
5 years
@dave_universetf @therealfitz "Kirk: I am sorry that is the replicator, the mute button isn't working."
1
3
86
@_arohan_
rohan anil
1 year
We tried reading this to the baby and she whacked it out of my hands in preference to spots’s first walk.
Tweet media one
Tweet media two
3
3
90
@_arohan_
rohan anil
7 months
AdaGrad and Shampoo with aggregated second moments now works for deep learning. This is quite similar to the Grafting technique we introduced to disentangle the direction from step size, where we have also found identical results.
Tweet media one
0
13
87
@_arohan_
rohan anil
2 years
Who remembers?
Tweet media one
3
0
86
@_arohan_
rohan anil
8 months
2 NeurIPS accepts in… optimization, is nature healing?
4
1
85
@_arohan_
rohan anil
2 years
For improving the replication in ML, why not ship the 0th step (init values) and 1st step weights (single opt step) with every architecture release?
11
3
82
@_arohan_
rohan anil
3 years
Learnt many of my colleagues have gotten IIT JEE All India Rank 1. 👀 In hindsight is obvious but imposter syndrome has kicked in.
5
0
79
@_arohan_
rohan anil
16 days
And here we go 🚀
@lmsysorg
lmsys.org
16 days
More exciting news today -- Gemini 1.5 Pro result is out! Gemini 1.5 Pro API-0409-preview now achieves #2 on the leaderboard, surpassing #3 GPT4-0125-preview to almost top-1! Gemini shows even stronger performance on longer prompts, in which it ranks joint #1 with the latest…
Tweet media one
Tweet media two
36
195
946
2
1
79
@_arohan_
rohan anil
3 months
Distributed Shampoo, ICML: 4 diverse workloads. AC: do another 7 ablations costing a million dollars 💵 for us to believe you have beaten Adam. ICLR: Beats every ML perf workload on wall-clock time. AC: I am a distributed system expert and I don’t believe you (charitable…
@OriolVinyalsML
Oriol Vinyals
3 months
Fun game. Clocking 17973 citations: "Distilling the knowledge in a neural network" @geoffreyhinton , @OriolVinyalsML , @JeffDean Reviewer 38 (NeurIPS 2014): "This work is incremental and unlikely to have much impact even though it may be technically correct and well executed."
5
47
481
3
1
78
@_arohan_
rohan anil
1 year
Designing new ML techniques have a 0-1 ramp. Nothing works until it does. And many small steps towards it independently wont work, but combined works.
7
7
77
@_arohan_
rohan anil
5 months
Everyone looking at metrics and demos out here and debating nuances of evals which is fine But missing the point these models are in Bard and Pixel 8 Pro right now and coming to more surfaces.
7
6
78
@_arohan_
rohan anil
2 years
@ak92501 I am here to say the obvious that’s not how a lion drinks a glass of water.
3
3
75
@_arohan_
rohan anil
4 years
Tensorboard has been preparing me for this moment. Refresh refresh refresh
0
7
73
@_arohan_
rohan anil
2 years
Amazing set of follow up on imagen @GoogleAI Looking forward to playing with this!
@_akhaliq
AK
2 years
UniTune: Text-Driven Image Editing by Fine Tuning an Image Generation Model on a Single Image abs:
Tweet media one
7
148
714
0
6
72
@_arohan_
rohan anil
3 years
Everyone sad top AI conferences are virtual this year. Folks, online conferences actually allows lot more people to attend who otherwise would be not due to weaker passports. I only attended 1 conference outside which was in Canada, and don’t want to go through the pain again
3
5
72
@_arohan_
rohan anil
2 years
My career so far was built over real world deployed ML. So take it with a grain of salt. 🧂 H-index and citation counts are weakly correlated with usefulness or reality. 🤷‍♂️ So read the papers to judge it not the citation counts.
@thegautamkamath
Gautam Kamath
2 years
My understanding is that Google Scholar is the pet project of one small team. It's crazy how little design choices (e.g., default sort papers by # of cites, total # of citations prominently displayed) influence all of academia by making citations a default "measuring stick".
15
8
276
0
7
70
@_arohan_
rohan anil
2 years
Prompt: "A koala bear in a suit at a dining table reading a newspaper and drinking tea contemplating. Photo taken by a DSLR camera." #imagen #googleai #brain Inspiration from "I should buy a boat" cat meme. Quite spectacular results. 🐨☕️📰
Tweet media one
7
7
71
@_arohan_
rohan anil
2 years
I guess kids in the future wont know what a GUI is, they will think and a transformer will do it.
5
1
70
@_arohan_
rohan anil
2 years
🤯 improvement in generation! - Frozen Text enc (t5-xxl) - 3 generation cascade text to 64x64 -> 256x256 1024x1024. - CF guidance (says its critical) - CF causes prediction skew (beyond the interval) - static and dynamic clipping improves it - New arch: Efficient UNet 1/n
@GillVerd
Gill
2 years
New DALLE-like text-to-art image generator from @GoogleAI called #Imagen . Seems like AI for art progress keeps accelerating! 🤖🎨🚀
Tweet media one
4
44
210
3
4
70
@_arohan_
rohan anil
3 years
Reviewer asks to compare against a paper that appeared on arxiv after ICML deadline. One of us coauthored that arxiv submission. 😂
4
2
69
@_arohan_
rohan anil
16 days
New phrase learned today from staying up on twitter. “LLM doping” Who wants to make a doping test and an agency that checks llm for eval doping? I would cut a check for starting something along these lines.
12
5
68
@_arohan_
rohan anil
4 years
@ilyasut Does that mean the brain could be also using the Adam optimizer too?
2
3
67
@_arohan_
rohan anil
5 months
I tried this in malayalam too last night and it blew my mind! I am very excited by this, studying can be much more effective if there was a personalized tutor who could explain steps on “how” and “why” along the way. I see this future to be immensely positive!
@JeffDean
Jeff Dean (@🏡)
5 months
The multimodal and reasoning capabilities of Gemini are quite strong. The benchmark results, which I’ll discuss in a moment are nice, but I’m most excited by demonstrations of what it can do. Consider the image below. A teacher has drawn a physics problem of a skier going down…
Tweet media one
23
212
1K
1
4
67
@_arohan_
rohan anil
2 years
Working from @Google Bay View Campus today!
Tweet media one
Tweet media two
2
1
67
@_arohan_
rohan anil
3 years
@natashajaques @maxhkw forgot: "we figured out how batch norm works, yet again!"
1
3
67
@_arohan_
rohan anil
2 years
ML Engineering for Google Search ads pCTR model.
@_akhaliq
AK
2 years
On the Factory Floor: ML Engineering for Industrial-Scale Ads Recommendation Models abs:
Tweet media one
2
36
220
1
7
66
@_arohan_
rohan anil
2 years
Amazing! I was predicting a good version would take at-least a few months. (Meanwhile conference is still in reviewer matching mode)
@_akhaliq
AK
2 years
A implementation of text-to-3D dreamfusion, powered by stable diffusion github:
24
434
2K
1
3
65
@_arohan_
rohan anil
7 months
From ICLR, Jorge provides a fast approximation for the inverse 4th roots in Shampoo! I recommend implementing the stable & fast coupled newton inverse, but maybe for some problems computing approx inverse pth root more often could be useful
Tweet media one
1
8
65
@_arohan_
rohan anil
3 years
@Sci_j_my They were several papers who couldn't cite your papers because of TOTAL failure on bibtex, they didn't get to compile twice. BibTex is now switching parts of it code.. have you heard of that? my people tell me that. We had tremendous citation, with certified by reviewers.
0
0
64
@_arohan_
rohan anil
1 year
Ha! Our first TMLR paper is in! 🎉 Great experience as well!
4
3
64
@_arohan_
rohan anil
2 years
Successfully raised human for 6months. Parenting is hard!
4
0
62
@_arohan_
rohan anil
2 years
Really enjoyed reading Nichol & Dhariwal This plot was the most interesting. Optimizing vlb(variational lower bound) was harder than a simple means squared error + lambda Lvlb Then they wave a magic wand to fix this, the green curve 1/2
Tweet media one
1
11
59
@_arohan_
rohan anil
1 year
🌴🌴 Very proud of this work; specifically not compromising on model quality, while being extremely fast for inference, so that we can serve the whole wide world i.e bringing technology to everyone!
@DynamicWebPaige
👩‍💻 Paige Bailey
1 year
*cracks knuckles* and thus, we begin the "🌴PaLM v2" drinking game (but with coffee, tea, or your favorite caffeinated beverage of choice, as it's early! 😉) #GoogleIO2023 #GoogleIO
7
30
195
0
6
59
@_arohan_
rohan anil
5 months
Testing out @pika_labs Prompt: a quick brown fox jumped over a lazy dog.
9
2
59
@_arohan_
rohan anil
2 years
Going on parental leave next week, Really appreciate that the work gives 18 weeks off. Going to see what I think is the real diffusion models train 🥸🤓
6
0
60
@_arohan_
rohan anil
6 months
Checking out dishoom, living up to the hype.
Tweet media one
6
1
59
@_arohan_
rohan anil
4 months
Gemini Pro on Vertex AI 🏎️
@deliprao
Delip Rao e/σ
4 months
Has anyone done large-scale profiling of inference speeds for different LLMs of comparable accuracy from different providers? Gemini Pro seems incredibly faster, from my personal experience, than, say, GPT-3.5. Seeing some numbers w/ error bars on this would be nice.
Tweet media one
11
9
66
5
10
59
@_arohan_
rohan anil
2 years
Prompt: "Photorealistic koala bear wearing a tie dye tshirt. The koala bear is wearing a sunhat and aviator glasses. koala bear is inside a houseboat in Kerala. There is a lot of coconut trees in the background." #imagen #googleai #brain 🚀 I have the same t-shirt! 👕🐨🥥🌴
Tweet media one
Tweet media two
2
7
58
@_arohan_
rohan anil
3 years
Frustrating part of deep learning is that almost anything works, so for those wanting to know the why something works, it’s just endless pit of misery and unanswered questions
4
5
59
@_arohan_
rohan anil
1 year
Bottleneck layer is the layer that has fewer neurons than layer above / below. ⌛️ Then, Whats a layer that has more neurons than the layer below and above called? A 10x layer? A booster layer? Need help.
31
1
58
@_arohan_
rohan anil
2 years
Incredible! Found linear mode connectivity on test loss (right) not on train ! My mind is blown -- this is huge !? Updates ( @stanislavfort 's colab) + With DistributedShampoo(~0 train loss 🚀) train/test loss=0.0002/0.314 vs (0.350/0.333) train/test accuracy=0.999/0.982
Tweet media one
Tweet media two
@_arohan_
rohan anil
2 years
I reran Stan's colab (thank you for the colab!) with DistributedShampoo instead of Adam or SGD and got this. Everything looks connected. Hmm? is this a bug??? Want to know more? Well, you have to wait for a bit.
Tweet media one
10
0
29
5
4
58
@_arohan_
rohan anil
1 year
Boris Dayma’s guide to training large models, a must read. He is using a second-order method (distributed shampoo) for all his training making him a handful of humans on earth who know how to deploy it correctly. Results speak for itself: Check it out!
@borisdayma
Boris Dayma 🖍️
1 year
📉 "A Recipe for Training Large Models" 👉 Report: I've been working for a while on this guide, sharing practical recommendations with my simple recipe for training models 🧑‍🍳
6
174
726
1
9
58
@_arohan_
rohan anil
2 years
All roads lead to autoencoders!
7
1
56
@_arohan_
rohan anil
5 months
Adding one more bit about Training of Pro models from the paper. Google infra has been such amazing, and joint with learning algorithms its magic.
Tweet media one
3
2
57
@_arohan_
rohan anil
3 months
We with @quocleix used Gemini Advanced yesterday to brainstorm for an internal research week debate. It was quite incredible experience and an effective companion in creative brainstorming and nothing else compares.
@clemenslm
Clemens Meyer
3 months
Try out Gemini Advancd, our best model yet - it's awesome!
0
2
14
4
2
55
@_arohan_
rohan anil
5 months
Already unofficial evals are in here: GSM8k: 52.1% to 57.09% (maybe higher if the eval is buggy)
Tweet media one
@MistralAI
Mistral AI
5 months
magnet:?xt=urn:btih:5546272da9065eddeb6fcd7ffddeef5b75be79a7&dn=mixtral-8x7b-32kseqlen&tr=udp%3A%2F%%3A6969%2Fannounce&tr=http%3A%2F%%3A80%2Fannounce RELEASE a6bbd9affe0c2725c1b7410d66833e24
520
2K
10K
0
5
58
@_arohan_
rohan anil
2 years
A small thread on related work on more-than diagonal optimization, in context of neural network. Kronecker Factorization: reducing cost from (mn)^2 to m^2 + n^2 comes from this sparsely cited paper from Heskes, 2000 See MLP section.
Tweet media one
@_clashluke
Lucas Nestler
2 years
PSA: Switch your optimizer to Shampoo! We recently tried Shampoo compared to a tuned ensemble of Adam and SM3 at @HomebrewNLP and found that the hyperparameter search space contains many more "winning tickets," which also achieve lower losses!
Tweet media one
Tweet media two
1
29
209
1
7
58