Introducing DBRX: A New Standard for Open LLM 🔔
💻 DBRX is a 16x 12B MoE LLM trained on 📜 12T tokens
🧠DBRX sets a new standard for open LLMs, outperforming established models on various benchmarks.
Is this thread mostly written by DBRX? Yes!
🧵
Our team at
@MosaicML
has been working on releasing something special:
We're proud to announce that we are OPEN SOURCING a 7B LLM trained to 1T tokens
The MPT model outperforms ALL other open source models!
Code:
Blog:
🧵
It's actually WILD that OAI just dropped a plot where inference compute is log scale and the entire ML community is hyped
If you were worried about global warming before...
gg earth, it's been a real one
:pour-one-out:
@OpenAI
o1 is trained with RL to “think” before responding via a private chain of thought. The longer it thinks, the better it does on reasoning tasks. This opens up a new dimension for scaling. We’re no longer bottlenecked by pretraining. We can now scale inference compute too.
GPT4 was trained on only about 10T tokens!
30 billion quadrillion == 3e25
Note: 3e25 BFloat16 FLOPS at 40% MFU on H100s is about 7.5e10sec ie 21M H100 hours. This is about 1300h on 16k H100s (less than 2 months)
Token math:
Previous leaks have verified that GPT4 is a 8x topk=2
Last week we had the opportunity to benchmark the
@nvidia
H100 using the MosaicML examples repo ()
Without modification, using bf16, we saw a 2.2x speedup over A100s 🔥
Using FP8, we saw up to a 3.3x speedup 🚀
Onboarding at
@MosaicML
A: "lets set you up to train some models"
about 20 minutes later I'm running GPT3 1B
A: "I have to run to a mtg, play around with the configs & have fun"
I played around with the config and shortly after I'm running GPT 13B with a seq len of 32k 🤯
1000x compute in 8 years graph almost looks like Nvidia's stonk price
BUT they maintain this growth by decreasing precision (and introducing sparsity). This trick can be played 2 more times until there is no more precision to decrease.
Blackwell, the new beast in town.
> DGX Grace-Blackwell GB200: exceeding 1 Exaflop compute in a single rack.
> Put numbers in perspective: the first DGX that Jensen delivered to OpenAI was 0.17 Petaflops.
> GPT-4-1.8T parameters can finish training in 90 days on 2000 Blackwells.
Its been a hot LLaMa summer and 92 pages of pure knowledge dropped
LLaMa3-405B has hit the OSS and we get all the juicy details!
There has been a lot of analysis of the paper.
This is my non-extensive LLaMa3 thread of things I found novel / interesting
🧵
@SamRamani2
The U.S. and Britain were 2 of the signatories of the Budapest Memorandum guaranteeing Ukraine's 1994 territorial borders...
"I will not send American servicemen to fight in Ukraine" - Biden
What do guarantees even mean?
@xhluca
@soumithchintala
Kaggle-style mixture is more of an ensemble of models; most ppl say mixture when referring to MoE models (Switch Transformers-style, but it doesn't necessarily need to be sparse.)
DBRX deets:
- 16 Experts
- 12B params per single expert
- top_k=4 routing
- 36B active parameters
- 132B total parameters
- trained for 12T tokens 📜
- 32k seq len training
🤗HF Space Demo:
just dropped, and I've already seen it in 3 chats
If a dist ckpt paper gets this much hype, it's a strong signal that there are issues with the current paradigm
Please fix 🙏
& upstream to PyTorch
This is like the 3rd ckpt paper I've seen in 5 months 👀
Had a great time at
#ICML2024
Met a lot of great people and learned a ton!
Vienna is a beautiful city and I'm glad I got to visit.
Random honorable mentions follow
🧵
@ToonamiAfter
@GroqInc
@Etched
@Extropic_AI
(a) agreed, see:
(b) but linear growth on log scale means we need exponential leaps in FLOPs. Not the realistically 2x to 10x leaps people will get out of non-GPU specialized hardware
It's actually WILD that OAI just dropped a plot where inference compute is log scale and the entire ML community is hyped
If you were worried about global warming before...
gg earth, it's been a real one
:pour-one-out:
@karpathy
A change that affects the first and potentially last layer (2 layers) results in ~25% speedup for the whole network??? or for just those 2 layers?
When FA3 came out, I made some comment like:
@tri_dao
could be single handedly credited with the explosive rise of Nvidia.
From Nvidia's side, they have one hell of an OSS strategy!
Made the joke yesterday that
@tri_dao
saved 10 billion dollars and prevented oceans from boiling cause Flash Attention improved MFU by 10%
But honestly it's legit
Hiring cracked perf engineers is expected value plus
Doesn't rely on high variablity research, it's just a plus
CA passes AI safety bill 1047
OAI shifts compute to inference
(AI safety bill 1047 imposes regulatory scrutiny for models using > 10^26 training FLOPs)
OpenAI Strawberry (o1) is out! We are finally seeing the paradigm of inference-time scaling popularized and deployed in production. As Sutton said in the Bitter Lesson, there're only 2 techniques that scale indefinitely with compute: learning & search. It's time to shift focus to
The MoE architecture produces a model that has 132B total params of capacity, but uses only 36B params to process each token.
💡DBRX outperforms models such as LLaMA2-70B and Grok while being more efficient.
🧵
🚀 The fine-grained MoE architecture makes DBRX is efficient
- almost 2x faster inference than LLaMA2-70B
- about 40% smaller than Grok in total & active parameter-counts.
🧵
BREAKING 🚨:
Nancy Pelosi just bought $5M of the AI company Databricks
Unfortunately, Databricks is a privately held company and not available to be bought by the public
Sorry people, you don’t have access to this one.
🎶🎶 Do you want to build an MoE? 🎶🎶
It was great collaboration with the team at PyTorch to integrate the tooling needed to makes MoE training easier and more efficient.
Training MoEs at Scale with PyTorch 🔥
In our latest post, we show how we scale to over three thousand GPUs using PyTorch Distributed and MegaBlocks, an efficient open-source MoE implementation in PyTorch.
Check it out:
📣ANNOUNCING THE FASTEST AI CHIP ON EARTH📣
Cerebras proudly announces CS-3: the fastest AI accelerator in the world.
The CS-3 can train up to 24 trillion parameter models on a single device. The world has never seen AI at this scale.
CS-3 specs:
⚙ 46,225 mm2 silicon | 4
@abacaj
using 4 bit quant means only about 15GB of GPU mem is used for 30B params 🤯
Given we trained the model with ALiBi, you can probably just increase the max_seq_len of the model past 8k and it'll just work (up to the point where you OOM)
You might get to about seqlen=12k
@ChrSzegedy
That tweet was for the lolz
While "there's a grain of truth in every joke", I mostly wanted to say "gg earth" and "pour one out" 😅
I really like the "data generation is part of the training process" argument you make.
It almost implies that no one is ever actually deploying
For too long, users have lived under the software lottery tyranny of fused attention implementations.
No longer.
Introducing FlexAttention, a new PyTorch API allowing for many attention variants to enjoy fused kernels in a few lines of PyTorch.
1/10
@SnowflakeDB
Awesome work training such a big model with a permissive license!
I think you had a mistake in your IFEval implementation, your reported number is less than 2x what we observe (though it does vary with inference server and sampling parameters). You should see in the high 60s
@andrew_n_carr
Deep (narrow) models have more representational power (if you can get training to be stable).
Wide (shallow) models get better HW utilization (whatever your sweep finds is the optimal ratio, GPUs want it to be WIDER)
Choose your own adventure (engineering/ml tradeoff)
@jeremyopendata
@Replit
Amazing to see how the
@MosaicML
platform enables customers to, in a week, do what only months ago was only possible at a handful of companies.
Congratulations
@Replit
on the successful run
any time i see an llm emergence study on a small (less than 70B) or undertrained (GPT-3.5) model, i just want to respond with this gif. this is one reason i'm so interested in comparing the L3-70B and L3-405B models
gif from here
After 4+ years of working at
@CerebrasSystems
its a bittersweet moment to finish my last day of work.
I'm moving on to my next adventure, but am excited to see what the future holds for the WaferScaleEngine.
Signing off, one last time with
#IamCerebras
SeqLen who??? 😝
It’s been an awesome past few months getting this running and trained! The team at
@MosaicML
has been amazing and the tools we’re building enable more than I thought possible!
If your application requires extremely long seq len, you can find it at
@MosaicML
🤯🤯 LLM trained with 64K+ context length! What could you do with that? Prompted our model with the ENTIRE contents of "The Great Gatsby" and asked it to write the epilogue. Snippet 👇
Model dropping soon to an open-source repo near you.
Epilogue:
It seemed to me that Gatsby
Slightly surprising to see that Meta still struggles with MoE training stability at scale 🤔🧐❓
DBRX training had no loss stability issues.
If Zuc wants help, he could just ask 🤷♂️
🧵
According to the Chinchilla paper, a 30B LLM trained on 600B tokens will be as good as GPT3.
So why not train on 1T tokens and beat it on 6/9 tasks 🤷♂️
@francoisfleuret
The 30x is real and comes from this technical brief, page 15:
How is 30x possible given GB200 has only ~2.3x increase in memBW and FLOP/s over H100?
It involves comparing per-chip generation throughput = output_tokens/s/chip. The two systems compared are
@SahajGarg6
@abhi_venigalla
@julien_c
LLM-Foundry + Composer allows us to compute MFU on the fly (based on training througput)
We also have a table of configs + perf here
(although it needs to be updated with H100 numbers)
🎉 🎉🎉 We have a new price on training Stable Diffusion 2 from scratch:
$50k trained on the MosaicML Platform.
We replicated Stable Diffusion 2.0 with massive training speedups, and now you can too.
Learn more in our latest blog post:
Using the same amount of compute, Meta could probably have trained a 250B param model using 25T tok (3.75e25; 100x TPR) to get about same the perf without sacrificing model performance and they would have a produced a model that can be served at 1.6x the speed.
🧵
🧑💻 replit-code-v1-3b is out!
Head to our HuggingFace 🤗 org page:
to use the open-source release of our ReplitLM specialized on code completion.
This will be the first of many LLMs 🚀
1/ 🧵
Best workshop:
ES-FoMo
Am I biased cuz I spoke there? Yes
Were the other talks so good I felt out of place? Also yes
@slippylolo
really knows how to organize a workshop
Excited about the the future of Scaling! 🚀📈
Runner up: DMLR: Datasets for Foundation Models
🧵
A few weeks ago I had the opportunity to talk with
@ecsquendor
and
@DoctorDuggar
on
@MLStreetTalk
.
We talked about ML hardware, Cerebras, and how sparsity can interact with it all.
I definitely recommend people checkout their podcasts.
#iamcerebras
🚨New🌟blog✍️ on ⏩ maximizing🌙 FLOPS 🚀
Training large models requires maximizing flops/GPU, especially at scale. Excited to share a few of the cool tricks in thread👀. 1/N
FLOPS to downstream tasks is sigmoidal (ie saturating faster than we'd like)
We all knew it would happen at some scale, but it is sad to see it actually happening at scale... 😢😭
not hyped but it does show that scale will prevail!
🧵
@francoisfleuret
Part of
@MosaicML
's mission is to show that you don't need a MASSIVE model to rule them all.
Task specific models enable you to get SOTA perf using much smaller models when training on task specific data eg
Meet PubMed GPT 🩺 a new SOTA on the US Medical Licensing Exam developed by MosaicML and
@StanfordHAI
. It's a normal GPT-3B model trained on medical data that bests hand-designed med models and generic models 40x bigger, a sweet spot for foundation models🧵
ook this thread is getting toooo long (and I'm only on page 12 😬)
This paper is a treasure trove
I'll leave it here 🫡, but will finish the paper
10/10 would recommend