Vitaliy Chiley Profile Banner
Vitaliy Chiley Profile
Vitaliy Chiley

@vitaliychiley

2,503
Followers
727
Following
67
Media
539
Statuses

Head of NLP Pretraining @DbrxMosaicAI | Former @CerebrasSystems What do we want? FLOPS! When do we want it? TOKENS!

Joined September 2013
Don't wanna be here? Send us removal request.
Pinned Tweet
@vitaliychiley
Vitaliy Chiley
6 months
Introducing DBRX: A New Standard for Open LLM 🔔 💻 DBRX is a 16x 12B MoE LLM trained on 📜 12T tokens 🧠DBRX sets a new standard for open LLMs, outperforming established models on various benchmarks. Is this thread mostly written by DBRX? Yes! 🧵
Tweet media one
22
85
479
@vitaliychiley
Vitaliy Chiley
1 year
Our team at @MosaicML has been working on releasing something special: We're proud to announce that we are OPEN SOURCING a 7B LLM trained to 1T tokens The MPT model outperforms ALL other open source models! Code: Blog: 🧵
27
222
1K
@vitaliychiley
Vitaliy Chiley
13 days
It's actually WILD that OAI just dropped a plot where inference compute is log scale and the entire ML community is hyped If you were worried about global warming before... gg earth, it's been a real one :pour-one-out:
@polynoamial
Noam Brown
13 days
@OpenAI o1 is trained with RL to “think” before responding via a private chain of thought. The longer it thinks, the better it does on reasoning tasks. This opens up a new dimension for scaling. We’re no longer bottlenecked by pretraining. We can now scale inference compute too.
Tweet media one
42
195
2K
88
66
1K
@vitaliychiley
Vitaliy Chiley
6 months
GPT4 was trained on only about 10T tokens! 30 billion quadrillion == 3e25 Note: 3e25 BFloat16 FLOPS at 40% MFU on H100s is about 7.5e10sec ie 21M H100 hours. This is about 1300h on 16k H100s (less than 2 months) Token math: Previous leaks have verified that GPT4 is a 8x topk=2
@tsarnick
Tsarathustra
6 months
Jensen Huang: OpenAI's latest model has 1.8 trillion parameters and required 30 billion quadrillion FLOPS to train
54
169
1K
8
51
332
@vitaliychiley
Vitaliy Chiley
1 year
Params: 30B 🔥 Training Tokens: 1T 🤯 Open Source: ✅
7
42
301
@vitaliychiley
Vitaliy Chiley
1 year
👀👀👀
Tweet media one
31
25
278
@vitaliychiley
Vitaliy Chiley
1 year
H100s go brrrr!!!
Tweet media one
11
7
239
@vitaliychiley
Vitaliy Chiley
1 year
Last week we had the opportunity to benchmark the @nvidia H100 using the MosaicML examples repo () Without modification, using bf16, we saw a 2.2x speedup over A100s 🔥 Using FP8, we saw up to a 3.3x speedup 🚀
@vitaliychiley
Vitaliy Chiley
1 year
👀👀👀
Tweet media one
31
25
278
12
19
162
@vitaliychiley
Vitaliy Chiley
1 year
Updated @MosaicML LLM training throughput tables. Here are some highlights: - Best HFU: 73.63%!!! 🚀 13B w/ act ckpt - Best MFU: 62.09%!!! 🔥 3B w/out act ckpt - Train with SeqLen 65k 🤯 Details here: [1/5]
Tweet media one
5
21
146
@vitaliychiley
Vitaliy Chiley
2 years
Onboarding at @MosaicML A: "lets set you up to train some models" about 20 minutes later I'm running GPT3 1B A: "I have to run to a mtg, play around with the configs & have fun" I played around with the config and shortly after I'm running GPT 13B with a seq len of 32k 🤯
Tweet media one
6
6
138
@vitaliychiley
Vitaliy Chiley
6 months
1000x compute in 8 years graph almost looks like Nvidia's stonk price BUT they maintain this growth by decreasing precision (and introducing sparsity). This trick can be played 2 more times until there is no more precision to decrease.
@DrJimFan
Jim Fan
6 months
Blackwell, the new beast in town. > DGX Grace-Blackwell GB200: exceeding 1 Exaflop compute in a single rack. > Put numbers in perspective: the first DGX that Jensen delivered to OpenAI was 0.17 Petaflops. > GPT-4-1.8T parameters can finish training in 90 days on 2000 Blackwells.
Tweet media one
Tweet media two
Tweet media three
161
535
3K
9
8
129
@vitaliychiley
Vitaliy Chiley
2 months
Its been a hot LLaMa summer and 92 pages of pure knowledge dropped LLaMa3-405B has hit the OSS and we get all the juicy details! There has been a lot of analysis of the paper. This is my non-extensive LLaMa3 thread of things I found novel / interesting 🧵
2
7
122
@vitaliychiley
Vitaliy Chiley
2 months
They don't even know how cheap GPT4o-mini is to run They don't even know...
@Thom_Wolf
Thomas Wolf
2 months
Strong picture!
Tweet media one
40
339
2K
7
2
89
@vitaliychiley
Vitaliy Chiley
2 years
@SamRamani2 The U.S. and Britain were 2 of the signatories of the Budapest Memorandum guaranteeing Ukraine's 1994 territorial borders... "I will not send American servicemen to fight in Ukraine" - Biden What do guarantees even mean?
13
3
63
@vitaliychiley
Vitaliy Chiley
13 days
@sog_on_bird_app @AetheroSpace Into a vacuum where the compute cannot dissipate heat? selling toasted GPUs when?
2
0
60
@vitaliychiley
Vitaliy Chiley
2 months
Gemma is cracked and most people are sleeping on it
@reach_vb
Vaibhav (VB) Srivastav
2 months
Google out accelerating Meta? 👀
Tweet media one
4
4
72
5
1
60
@vitaliychiley
Vitaliy Chiley
1 year
@MosaicML Did we fine tune it to 65k+ tokens? Yes 🧵
Tweet media one
1
5
56
@vitaliychiley
Vitaliy Chiley
1 year
Excited to unlock what we can do together
@alighodsi
Ali Ghodsi
1 year
Big news: we've agreed to acquire @MosaicML , a leading generative AI platform. I couldn’t be more excited to join forces once the deal closes.
36
212
1K
2
1
55
@vitaliychiley
Vitaliy Chiley
1 year
@xhluca @soumithchintala Kaggle-style mixture is more of an ensemble of models; most ppl say mixture when referring to MoE models (Switch Transformers-style, but it doesn't necessarily need to be sparse.)
3
2
48
@vitaliychiley
Vitaliy Chiley
2 months
Databricks swag has made it past customs. We are no longer ngmi We’re ready for what can be, unburdened by what has been
Tweet media one
6
0
46
@vitaliychiley
Vitaliy Chiley
1 year
@abhi_venigalla Christmas???
Tweet media one
2
1
46
@vitaliychiley
Vitaliy Chiley
6 months
DBRX deets: - 16 Experts - 12B params per single expert - top_k=4 routing - 36B active parameters - 132B total parameters - trained for 12T tokens 📜 - 32k seq len training 🤗HF Space Demo:
2
2
41
@vitaliychiley
Vitaliy Chiley
2 months
just dropped, and I've already seen it in 3 chats If a dist ckpt paper gets this much hype, it's a strong signal that there are issues with the current paradigm Please fix 🙏 & upstream to PyTorch This is like the 3rd ckpt paper I've seen in 5 months 👀
2
2
40
@vitaliychiley
Vitaliy Chiley
1 year
@MosaicML How do we compare to other models? See for yourself:
Tweet media one
5
3
39
@vitaliychiley
Vitaliy Chiley
1 year
@MosaicML What training data did we use: 🧵
Tweet media one
1
2
37
@vitaliychiley
Vitaliy Chiley
6 months
@abacaj If you knew what LLaMa was trained on... this is the barrier to entry going down
1
1
36
@vitaliychiley
Vitaliy Chiley
2 months
Had a great time at #ICML2024 Met a lot of great people and learned a ton! Vienna is a beautiful city and I'm glad I got to visit. Random honorable mentions follow 🧵
1
1
35
@vitaliychiley
Vitaliy Chiley
13 days
@ToonamiAfter @GroqInc @Etched @Extropic_AI (a) agreed, see: (b) but linear growth on log scale means we need exponential leaps in FLOPs. Not the realistically 2x to 10x leaps people will get out of non-GPU specialized hardware
@vitaliychiley
Vitaliy Chiley
13 days
Inference hardware providers looking at that graph like 🤑🤑🤑
2
2
34
3
1
33
@vitaliychiley
Vitaliy Chiley
13 days
Inference hardware providers looking at that graph like 🤑🤑🤑
@vitaliychiley
Vitaliy Chiley
13 days
It's actually WILD that OAI just dropped a plot where inference compute is log scale and the entire ML community is hyped If you were worried about global warming before... gg earth, it's been a real one :pour-one-out:
88
66
1K
2
2
34
@vitaliychiley
Vitaliy Chiley
6 months
The real DB-ReX is made of MegaBlocks
Tweet media one
@mansiege
Mansheej Paul
6 months
This model is a beast! Take it for a spin:
Tweet media one
1
2
28
0
1
32
@vitaliychiley
Vitaliy Chiley
2 years
@karpathy A change that affects the first and potentially last layer (2 layers) results in ~25% speedup for the whole network??? or for just those 2 layers?
3
0
29
@vitaliychiley
Vitaliy Chiley
13 days
1
0
31
@vitaliychiley
Vitaliy Chiley
2 months
When FA3 came out, I made some comment like: @tri_dao could be single handedly credited with the explosive rise of Nvidia. From Nvidia's side, they have one hell of an OSS strategy!
@dylan522p
Dylan Patel
2 months
Made the joke yesterday that @tri_dao saved 10 billion dollars and prevented oceans from boiling cause Flash Attention improved MFU by 10% But honestly it's legit Hiring cracked perf engineers is expected value plus Doesn't rely on high variablity research, it's just a plus
8
5
197
1
0
29
@vitaliychiley
Vitaliy Chiley
6 months
In case you want to see what we cook up 🧪🧑‍🔬 Databricks / Mosaic AI 🧠 new research page:
4
0
29
@vitaliychiley
Vitaliy Chiley
13 days
CA passes AI safety bill 1047 OAI shifts compute to inference (AI safety bill 1047 imposes regulatory scrutiny for models using > 10^26 training FLOPs)
@DrJimFan
Jim Fan
13 days
OpenAI Strawberry (o1) is out! We are finally seeing the paradigm of inference-time scaling popularized and deployed in production. As Sutton said in the Bitter Lesson, there're only 2 techniques that scale indefinitely with compute: learning & search. It's time to shift focus to
Tweet media one
134
1K
6K
1
0
28
@vitaliychiley
Vitaliy Chiley
1 year
@MosaicML HF Space for MPT-7B-Chat:
3
5
24
@vitaliychiley
Vitaliy Chiley
6 months
The MoE architecture produces a model that has 132B total params of capacity, but uses only 36B params to process each token. 💡DBRX outperforms models such as LLaMA2-70B and Grok while being more efficient. 🧵
Tweet media one
1
1
22
@vitaliychiley
Vitaliy Chiley
1 year
@itsmnjn @MosaicML How to format data for fine tuning here:
1
0
24
@vitaliychiley
Vitaliy Chiley
6 months
It surpasses GPT-3.5 and competes with Gemini 1.0 Pro & Mistral Medium in quality, while being substantially faster 🏎️ 🧵
Tweet media one
2
0
21
@vitaliychiley
Vitaliy Chiley
1 year
@MosaicML HF Space for MPT-7B-Instruct:
1
5
22
@vitaliychiley
Vitaliy Chiley
13 days
Tweet media one
4
0
22
@vitaliychiley
Vitaliy Chiley
6 months
🚀 The fine-grained MoE architecture makes DBRX is efficient - almost 2x faster inference than LLaMA2-70B - about 40% smaller than Grok in total & active parameter-counts. 🧵
Tweet media one
1
0
20
@vitaliychiley
Vitaliy Chiley
6 months
🔗 Links: - Trained using MosaicML's LLM-Foundry: - Technical Blog: - WIRED: - DBRX Base: - DBRX Instruct: - 🤗HF Space Demo:
1
0
20
@vitaliychiley
Vitaliy Chiley
6 months
Literally eating tendies rn
@PelosiTracker_
Nancy Pelosi Stock Tracker ♟
6 months
BREAKING 🚨: Nancy Pelosi just bought $5M of the AI company Databricks Unfortunately, Databricks is a privately held company and not available to be bought by the public Sorry people, you don’t have access to this one.
Tweet media one
288
2K
14K
0
0
21
@vitaliychiley
Vitaliy Chiley
3 months
🎶🎶 Do you want to build an MoE? 🎶🎶 It was great collaboration with the team at PyTorch to integrate the tooling needed to makes MoE training easier and more efficient.
@PyTorch
PyTorch
3 months
Training MoEs at Scale with PyTorch 🔥 In our latest post, we show how we scale to over three thousand GPUs using PyTorch Distributed and MegaBlocks, an efficient open-source MoE implementation in PyTorch. Check it out:
0
70
354
0
5
20
@vitaliychiley
Vitaliy Chiley
1 year
the feels when the H100s show up
Tweet media one
1
2
19
@vitaliychiley
Vitaliy Chiley
6 months
OSS or bust!
@elonmusk
Elon Musk
6 months
Should be available on 𝕏 next week. Grok 2 should exceed current AI on all metrics. In training now.
5K
4K
32K
0
3
19
@vitaliychiley
Vitaliy Chiley
1 year
Using LLMFoundry, you can use the systems interchangeably 🧵
Tweet media one
3
0
19
@vitaliychiley
Vitaliy Chiley
6 months
if you know, you know
0
0
19
@vitaliychiley
Vitaliy Chiley
1 year
@_akhaliq @ilyasut gives one talk…
3
0
18
@vitaliychiley
Vitaliy Chiley
7 months
ML Researchers: We want more FLOPS Solution:
Tweet media one
@CerebrasSystems
Cerebras
7 months
📣ANNOUNCING THE FASTEST AI CHIP ON EARTH📣 Cerebras proudly announces CS-3: the fastest AI accelerator in the world. The CS-3 can train up to 24 trillion parameter models on a single device. The world has never seen AI at this scale. CS-3 specs: ⚙ 46,225 mm2 silicon | 4
Tweet media one
54
164
961
0
2
18
@vitaliychiley
Vitaliy Chiley
1 year
@abacaj using 4 bit quant means only about 15GB of GPU mem is used for 30B params 🤯 Given we trained the model with ALiBi, you can probably just increase the max_seq_len of the model past 8k and it'll just work (up to the point where you OOM) You might get to about seqlen=12k
2
1
18
@vitaliychiley
Vitaliy Chiley
6 months
💡The 32k sequence length training makes a SOTA RAG Model. DBRX outperforms the best open source models and even outperforms GPT3.5 Turbo. 🧵
Tweet media one
1
0
17
@vitaliychiley
Vitaliy Chiley
13 days
@ChrSzegedy That tweet was for the lolz While "there's a grain of truth in every joke", I mostly wanted to say "gg earth" and "pour one out" 😅 I really like the "data generation is part of the training process" argument you make. It almost implies that no one is ever actually deploying
2
0
18
@vitaliychiley
Vitaliy Chiley
6 months
📈👀
@awnihannun
Awni Hannun
6 months
4-bit quantized DBRX runs nicely in MLX on an M2 Ultra. PR:
29
112
727
0
2
18
@vitaliychiley
Vitaliy Chiley
2 months
Integrated context parallel all2all when?
@cHHillee
Horace He
2 months
For too long, users have lived under the software lottery tyranny of fused attention implementations. No longer. Introducing FlexAttention, a new PyTorch API allowing for many attention variants to enjoy fused kernels in a few lines of PyTorch. 1/10
Tweet media one
20
258
1K
1
0
18
@vitaliychiley
Vitaliy Chiley
1 year
@ESYudkowsky you can now hear AGI being born👂 @coryrstephenson converted our LLM training loss into an audio clip
2
3
18
@vitaliychiley
Vitaliy Chiley
5 months
This whole time DBRX turns out to be really Enterprise Intelligent!
@sam_havens
Sam Havens
5 months
@SnowflakeDB Awesome work training such a big model with a permissive license! I think you had a mistake in your IFEval implementation, your reported number is less than 2x what we observe (though it does vary with inference server and sampling parameters). You should see in the high 60s
Tweet media one
Tweet media two
2
8
34
1
0
15
@vitaliychiley
Vitaliy Chiley
6 months
@andrew_n_carr Deep (narrow) models have more representational power (if you can get training to be stable). Wide (shallow) models get better HW utilization (whatever your sweep finds is the optimal ratio, GPUs want it to be WIDER) Choose your own adventure (engineering/ml tradeoff)
1
0
17
@vitaliychiley
Vitaliy Chiley
6 months
@abacaj Yeah it'd be nice if Mistral was at all open about their training setup...
0
2
14
@vitaliychiley
Vitaliy Chiley
6 months
@winwin7264 :shifty-eyes:
0
0
14
@vitaliychiley
Vitaliy Chiley
1 year
@jeremyopendata @Replit Amazing to see how the @MosaicML platform enables customers to, in a week, do what only months ago was only possible at a handful of companies. Congratulations @Replit on the successful run
0
1
14
@vitaliychiley
Vitaliy Chiley
3 months
e22 models ngmi e24 or bust!
@_xjdr
xjdr
3 months
any time i see an llm emergence study on a small (less than 70B) or undertrained (GPT-3.5) model, i just want to respond with this gif. this is one reason i'm so interested in comparing the L3-70B and L3-405B models gif from here
8
13
164
0
1
15
@vitaliychiley
Vitaliy Chiley
1 year
@andrew_n_carr @abacaj 8 Mixture x 220B = 2/3*220B*8 + 1/3*220B = 1.247 T params (attn isn't in the MoE. usually...)
2
1
14
@vitaliychiley
Vitaliy Chiley
2 years
After 4+ years of working at @CerebrasSystems its a bittersweet moment to finish my last day of work. I'm moving on to my next adventure, but am excited to see what the future holds for the WaferScaleEngine. Signing off, one last time with #IamCerebras
0
0
14
@vitaliychiley
Vitaliy Chiley
2 months
@_xjdr perf numbers or it didn't happen
1
0
15
@vitaliychiley
Vitaliy Chiley
1 year
SeqLen who??? 😝 It’s been an awesome past few months getting this running and trained! The team at @MosaicML has been amazing and the tools we’re building enable more than I thought possible! If your application requires extremely long seq len, you can find it at @MosaicML
@NaveenGRao
Naveen Rao
1 year
🤯🤯 LLM trained with 64K+ context length! What could you do with that? Prompted our model with the ENTIRE contents of "The Great Gatsby" and asked it to write the epilogue. Snippet 👇 Model dropping soon to an open-source repo near you. Epilogue: It seemed to me that Gatsby
41
89
675
0
0
14
@vitaliychiley
Vitaliy Chiley
1 year
@squarecog yes the naming is off, but it works just fine
Tweet media one
1
0
15
@vitaliychiley
Vitaliy Chiley
1 year
@Peter_0_0_g @MosaicML We use ALIBI so we can arbitrarily increase the context len. We train the model on with seq len 2k; then fine tune with 65k
1
1
13
@vitaliychiley
Vitaliy Chiley
2 months
Slightly surprising to see that Meta still struggles with MoE training stability at scale 🤔🧐❓ DBRX training had no loss stability issues. If Zuc wants help, he could just ask 🤷‍♂️ 🧵
Tweet media one
2
0
13
@vitaliychiley
Vitaliy Chiley
1 year
@abhi_venigalla Or did you mean this Christmas?
Tweet media one
1
0
13
@vitaliychiley
Vitaliy Chiley
1 month
Love it when the team cooks
@dan_biderman
Dan Biderman
1 month
*LoRA Learns Less and Forgets Less* is now out in its definitive edition in TMLR🚀 Checkout the latest numbers fresh from the @DbrxMosaicAI oven 👨‍🍳
5
21
83
0
2
12
@vitaliychiley
Vitaliy Chiley
2 months
@dylan522p @ishanit5 "train on test" - @dylan522p you gotta let the ppl know!
2
1
12
@vitaliychiley
Vitaliy Chiley
1 year
According to the Chinchilla paper, a 30B LLM trained on 600B tokens will be as good as GPT3. So why not train on 1T tokens and beat it on 6/9 tasks 🤷‍♂️
Tweet media one
1
1
11
@vitaliychiley
Vitaliy Chiley
2 months
Data is the great equalizer
@snats_xyz
snats
2 months
They trained on only 6GB of text, distilled from the Pile and got a BERT model up to the quality of T5 with 745x less data 💀 It's so over
7
51
659
1
1
11
@vitaliychiley
Vitaliy Chiley
1 year
ALiBi is king 👑
@jefrankle
Jonathan Frankle
1 year
As @vitaliychiley likes to say, ALiBi is 👑 We're very big fans of ALiBi and @OfirPress at @MosaicML .
1
2
24
1
0
10
@vitaliychiley
Vitaliy Chiley
6 months
At >16GPU scale, google's 2D weight stationary setups begin to make sense!!! (where 1D weight stationary (aka TP) is the alternative) Figure from:
Tweet media one
@ml_hardware
Abhi Venigalla
6 months
@francoisfleuret The 30x is real and comes from this technical brief, page 15: How is 30x possible given GB200 has only ~2.3x increase in memBW and FLOP/s over H100? It involves comparing per-chip generation throughput = output_tokens/s/chip. The two systems compared are
8
12
156
0
1
11
@vitaliychiley
Vitaliy Chiley
1 year
@SahajGarg6 @abhi_venigalla @julien_c LLM-Foundry + Composer allows us to compute MFU on the fly (based on training througput) We also have a table of configs + perf here (although it needs to be updated with H100 numbers)
1
1
11
@vitaliychiley
Vitaliy Chiley
1 year
Pics or it didn't happen
Tweet media one
@DbrxMosaicAI
Databricks Mosaic Research
1 year
🎉 🎉🎉 We have a new price on training Stable Diffusion 2 from scratch: $50k trained on the MosaicML Platform. We replicated Stable Diffusion 2.0 with massive training speedups, and now you can too. Learn more in our latest blog post:
4
13
68
1
1
11
@vitaliychiley
Vitaliy Chiley
7 months
My man!
@AMD
AMD
7 months
Advancing AI: @Databricks NLP Architect, Abhinav Venigalla, discusses the hardware and software advantages from AMD.
4
21
172
0
0
10
@vitaliychiley
Vitaliy Chiley
1 year
("king kill" vec - "man" vec) + "woman" vec = "queen slay" vec
0
1
11
@vitaliychiley
Vitaliy Chiley
2 months
Using the same amount of compute, Meta could probably have trained a 250B param model using 25T tok (3.75e25; 100x TPR) to get about same the perf without sacrificing model performance and they would have a produced a model that can be served at 1.6x the speed. 🧵
2
0
10
@vitaliychiley
Vitaliy Chiley
1 year
Amazing to see @MosaicML tools being used to create awesome sauce 👏👏👏
@pirroh
Michele Catasta
1 year
🧑‍💻 replit-code-v1-3b is out! Head to our HuggingFace 🤗 org page: to use the open-source release of our ReplitLM specialized on code completion. This will be the first of many LLMs 🚀 1/ 🧵
23
183
918
0
0
10
@vitaliychiley
Vitaliy Chiley
2 months
Best workshop: ES-FoMo Am I biased cuz I spoke there? Yes Were the other talks so good I felt out of place? Also yes @slippylolo really knows how to organize a workshop Excited about the the future of Scaling! 🚀📈 Runner up: DMLR: Datasets for Foundation Models 🧵
1
0
10
@vitaliychiley
Vitaliy Chiley
1 year
And we got the numbers to prove it 🚀 🧵
Tweet media one
2
0
10
@vitaliychiley
Vitaliy Chiley
1 year
And as long as you use the correct image, it just works 🧵 end
Tweet media one
1
0
10
@vitaliychiley
Vitaliy Chiley
2 years
A few weeks ago I had the opportunity to talk with @ecsquendor and @DoctorDuggar on @MLStreetTalk . We talked about ML hardware, Cerebras, and how sparsity can interact with it all. I definitely recommend people checkout their podcasts. #iamcerebras
0
2
8
@vitaliychiley
Vitaliy Chiley
6 months
Something something, GPU go BRRR
@mvpatel2000
Mihir Patel
6 months
🚨New🌟blog✍️ on ⏩ maximizing🌙 FLOPS 🚀 Training large models requires maximizing flops/GPU, especially at scale. Excited to share a few of the cool tricks in thread👀. 1/N
Tweet media one
6
36
191
0
0
10
@vitaliychiley
Vitaliy Chiley
2 months
FLOPS to downstream tasks is sigmoidal (ie saturating faster than we'd like) We all knew it would happen at some scale, but it is sad to see it actually happening at scale... 😢😭 not hyped but it does show that scale will prevail! 🧵
Tweet media one
2
0
9
@vitaliychiley
Vitaliy Chiley
2 years
@francoisfleuret Part of @MosaicML 's mission is to show that you don't need a MASSIVE model to rule them all. Task specific models enable you to get SOTA perf using much smaller models when training on task specific data eg
@DbrxMosaicAI
Databricks Mosaic Research
2 years
Meet PubMed GPT 🩺 a new SOTA on the US Medical Licensing Exam developed by MosaicML and @StanfordHAI . It's a normal GPT-3B model trained on medical data that bests hand-designed med models and generic models 40x bigger, a sweet spot for foundation models🧵
12
132
517
0
0
9
@vitaliychiley
Vitaliy Chiley
6 months
@code_star Have you seen my profile byline
Tweet media one
1
0
9
@vitaliychiley
Vitaliy Chiley
2 months
ook this thread is getting toooo long (and I'm only on page 12 😬) This paper is a treasure trove I'll leave it here 🫡, but will finish the paper 10/10 would recommend
3
0
9