Vitaliy Chiley @vitaliychiley Twitter profile | Pikagi

Pikagi

Vitaliy Chiley

@vitaliychiley

2,503

Followers

727

Following

67

Media

539

Statuses

Head of NLP Pretraining @DbrxMosaicAI | Former @CerebrasSystems What do we want? FLOPS! When do we want it? TOKENS!

Joined September 2013

Don't wanna be here? Send us removal request.

Pinned Tweet

@vitaliychiley

Vitaliy Chiley

6 months

Introducing DBRX: A New Standard for Open LLM 🔔 💻 DBRX is a 16x 12B MoE LLM trained on 📜 12T tokens 🧠DBRX sets a new standard for open LLMs, outperforming established models on various benchmarks. Is this thread mostly written by DBRX? Yes! 🧵

Tweet media one

22

85

479

Last Seen Profiles

@stw_pdg

@CodMurdr

@ntsya_fy

@ma_r_va

@JJLEnthusiast69

@WEquilApp

@BBVARe_mx

@EdwardGofsky

@unsellablenfts

@psycuhdelik

@JulianRichings

@triq_d

@aql_official

@zoeyummyy

@BandariOfficial

@EuroHaloLeague

@kyiv_voodoo

@KunBlues

@AdhunaAkhtar

@RFriends_

@BinorRaja

@SIRMICKEYSMOOVE

@BeeP4ma

@CNCC_audit

@_Leezo2012_

@obtej4

@7alaba20k

@RYOMAJIK

@another__nation

@ApplicationDDD_

@bokeplokalmalam

@phdinter

@OliverBarn50990

@DashNigeria_Biz

@ghmatiotti

@Hijabbacol2883

@vitaliychiley

Vitaliy Chiley

1 year

Our team at @MosaicML has been working on releasing something special: We're proud to announce that we are OPEN SOURCING a 7B LLM trained to 1T tokens The MPT model outperforms ALL other open source models! Code: Blog: 🧵

Tweet card media

Introducing MPT-7B: A New Standard for Open-Source, Commercially Usable LLMs | Databricks Blog

Introducing MPT-7B, the first entry in our MosaicML Foundation Series. MPT-7B is a transformer trained from scratch on 1T tokens of text and code. It is open source, available for commercial use, and...

www.databricks.com

27

222

1K

@vitaliychiley

Vitaliy Chiley

13 days

It's actually WILD that OAI just dropped a plot where inference compute is log scale and the entire ML community is hyped If you were worried about global warming before... gg earth, it's been a real one :pour-one-out:

@polynoamial

Noam Brown

13 days

@OpenAI o1 is trained with RL to “think” before responding via a private chain of thought. The longer it thinks, the better it does on reasoning tasks. This opens up a new dimension for scaling. We’re no longer bottlenecked by pretraining. We can now scale inference compute too.

Tweet media one

42

195

2K

88

66

1K

@vitaliychiley

Vitaliy Chiley

6 months

GPT4 was trained on only about 10T tokens! 30 billion quadrillion == 3e25 Note: 3e25 BFloat16 FLOPS at 40% MFU on H100s is about 7.5e10sec ie 21M H100 hours. This is about 1300h on 16k H100s (less than 2 months) Token math: Previous leaks have verified that GPT4 is a 8x topk=2

@tsarnick

Tsarathustra

6 months

Jensen Huang: OpenAI's latest model has 1.8 trillion parameters and required 30 billion quadrillion FLOPS to train

54

169

1K

8

51

332

@vitaliychiley

Vitaliy Chiley

1 year

Params: 30B 🔥 Training Tokens: 1T 🤯 Open Source: ✅

Tweet card media

Mosaic Research | Databricks Blog

Latest blogs from the team at Mosaic Research

www.databricks.com

7

42

301

@vitaliychiley

Vitaliy Chiley

1 year

👀👀👀

Tweet media one

31

25

278

@vitaliychiley

Vitaliy Chiley

1 year

H100s go brrrr!!!

Tweet media one

11

7

239

@vitaliychiley

Vitaliy Chiley

1 year

We got our hands on some MI250s. Contrary to what others may say the AMD stack does work and MI250 are competitive with A100s! 🧵

Tweet card media

Training LLMs with AMD MI250 GPUs and MosaicML | Databricks Blog

With the release of PyTorch 2.0 and ROCm 5.4, we are excited to announce that LLM training works out of the box on AMD MI250 accelerators with zero code changes and at high performance!

www.databricks.com

5

23

172

@vitaliychiley

Vitaliy Chiley

1 year

Last week we had the opportunity to benchmark the @nvidia H100 using the MosaicML examples repo () Without modification, using bf16, we saw a 2.2x speedup over A100s 🔥 Using FP8, we saw up to a 3.3x speedup 🚀

Tweet card media

Mosaic Research | Databricks Blog

Latest blogs from the team at Mosaic Research

www.databricks.com

@vitaliychiley

Vitaliy Chiley

1 year

👀👀👀

Tweet media one

31

25

278

12

19

162

@vitaliychiley

Vitaliy Chiley

1 year

Updated @MosaicML LLM training throughput tables. Here are some highlights: - Best HFU: 73.63%!!! 🚀 13B w/ act ckpt - Best MFU: 62.09%!!! 🔥 3B w/out act ckpt - Train with SeqLen 65k 🤯 Details here: [1/5]

Tweet media one

5

21

146

@vitaliychiley

Vitaliy Chiley

2 years

Onboarding at @MosaicML A: "lets set you up to train some models" about 20 minutes later I'm running GPT3 1B A: "I have to run to a mtg, play around with the configs & have fun" I played around with the config and shortly after I'm running GPT 13B with a seq len of 32k 🤯

Tweet media one

6

6

138

@vitaliychiley

Vitaliy Chiley

6 months

1000x compute in 8 years graph almost looks like Nvidia's stonk price BUT they maintain this growth by decreasing precision (and introducing sparsity). This trick can be played 2 more times until there is no more precision to decrease.

@DrJimFan

Jim Fan

6 months

Blackwell, the new beast in town. > DGX Grace-Blackwell GB200: exceeding 1 Exaflop compute in a single rack. > Put numbers in perspective: the first DGX that Jensen delivered to OpenAI was 0.17 Petaflops. > GPT-4-1.8T parameters can finish training in 90 days on 2000 Blackwells.

Tweet media one

Tweet media two

Tweet media three

161

535

3K

9

8

129

@vitaliychiley

Vitaliy Chiley

2 months

Its been a hot LLaMa summer and 92 pages of pure knowledge dropped LLaMa3-405B has hit the OSS and we get all the juicy details! There has been a lot of analysis of the paper. This is my non-extensive LLaMa3 thread of things I found novel / interesting 🧵

2

7

122

@vitaliychiley

Vitaliy Chiley

2 months

They don't even know how cheap GPT4o-mini is to run They don't even know...

@Thom_Wolf

Thomas Wolf

2 months

Strong picture!

Tweet media one

40

339

2K

7

2

89

@vitaliychiley

Vitaliy Chiley

2 years

@SamRamani2 The U.S. and Britain were 2 of the signatories of the Budapest Memorandum guaranteeing Ukraine's 1994 territorial borders... "I will not send American servicemen to fight in Ukraine" - Biden What do guarantees even mean?

13

3

63

@vitaliychiley

Vitaliy Chiley

13 days

@sog_on_bird_app @AetheroSpace Into a vacuum where the compute cannot dissipate heat? selling toasted GPUs when?

2

0

60

@vitaliychiley

Vitaliy Chiley

2 months

Gemma is cracked and most people are sleeping on it

@reach_vb

Vaibhav (VB) Srivastav

2 months

Google out accelerating Meta? 👀

Tweet media one

4

4

72

5

1

60

@vitaliychiley

Vitaliy Chiley

1 year

@MosaicML Did we fine tune it to 65k+ tokens? Yes 🧵

Tweet media one

1

5

56

@vitaliychiley

Vitaliy Chiley

1 year

Excited to unlock what we can do together

@alighodsi

Ali Ghodsi

1 year

Big news: we've agreed to acquire @MosaicML , a leading generative AI platform. I couldn’t be more excited to join forces once the deal closes.

36

212

1K

2

1

55

@vitaliychiley

Vitaliy Chiley

1 year

@xhluca @soumithchintala Kaggle-style mixture is more of an ensemble of models; most ppl say mixture when referring to MoE models (Switch Transformers-style, but it doesn't necessarily need to be sparse.)

3

2

48

@vitaliychiley

Vitaliy Chiley

2 months

Databricks swag has made it past customs. We are no longer ngmi We’re ready for what can be, unburdened by what has been

Tweet media one

6

0

46

@vitaliychiley

Vitaliy Chiley

1 year

@abhi_venigalla Christmas???

Tweet media one

2

1

46

@vitaliychiley

Vitaliy Chiley

6 months

DBRX deets: - 16 Experts - 12B params per single expert - top_k=4 routing - 36B active parameters - 132B total parameters - trained for 12T tokens 📜 - 32k seq len training 🤗HF Space Demo:

Tweet card media

DBRX Instruct - a Hugging Face Space by databricks

2

2

41

@vitaliychiley

Vitaliy Chiley

2 months

just dropped, and I've already seen it in 3 chats If a dist ckpt paper gets this much hype, it's a strong signal that there are issues with the current paradigm Please fix 🙏 & upstream to PyTorch This is like the 3rd ckpt paper I've seen in 5 months 👀

Tweet card media

ByteCheckpoint: A Unified Checkpointing System for LLM Development

The development of real-world Large Language Models (LLMs) necessitates checkpointing of training states in persistent storage to mitigate potential software and hardware failures, as well as to...

2

2

40

@vitaliychiley

Vitaliy Chiley

1 year

@MosaicML How do we compare to other models? See for yourself:

Tweet media one

5

3

39

@vitaliychiley

Vitaliy Chiley

1 year

@MosaicML What training data did we use: 🧵

Tweet media one

1

2

37

@vitaliychiley

Vitaliy Chiley

6 months

@abacaj If you knew what LLaMa was trained on... this is the barrier to entry going down

1

1

36

@vitaliychiley

Vitaliy Chiley

2 months

Had a great time at #ICML2024 Met a lot of great people and learned a ton! Vienna is a beautiful city and I'm glad I got to visit. Random honorable mentions follow 🧵

1

1

35

@vitaliychiley

Vitaliy Chiley

13 days

@ToonamiAfter @GroqInc @Etched @Extropic_AI (a) agreed, see: (b) but linear growth on log scale means we need exponential leaps in FLOPs. Not the realistically 2x to 10x leaps people will get out of non-GPU specialized hardware

@vitaliychiley

Vitaliy Chiley

13 days

Inference hardware providers looking at that graph like 🤑🤑🤑

2

2

34

3

1

33

@vitaliychiley

Vitaliy Chiley

13 days

Inference hardware providers looking at that graph like 🤑🤑🤑

@vitaliychiley

Vitaliy Chiley

13 days

It's actually WILD that OAI just dropped a plot where inference compute is log scale and the entire ML community is hyped If you were worried about global warming before... gg earth, it's been a real one :pour-one-out:

88

66

1K

2

2

34

@vitaliychiley

Vitaliy Chiley

6 months

The real DB-ReX is made of MegaBlocks

Tweet media one

@mansiege

Mansheej Paul

6 months

This model is a beast! Take it for a spin:

Tweet media one

1

2

28

0

1

32

@vitaliychiley

Vitaliy Chiley

2 years

@karpathy A change that affects the first and potentially last layer (2 layers) results in ~25% speedup for the whole network??? or for just those 2 layers?

3

0

29

@vitaliychiley

Vitaliy Chiley

13 days

@yacineMTB

1

0

31

@vitaliychiley

Vitaliy Chiley

2 months

@_xjdr Grab a RULER and send it (why not show the whole suite instead of just cherry picking some)

Tweet card media

GitHub - hsiehjackson/RULER: This repo contains the source code for RULER: What’s the Real Context...

This repo contains the source code for RULER: What’s the Real Context Size of Your Long-Context Language Models? - hsiehjackson/RULER

1

0

30

@vitaliychiley

Vitaliy Chiley

2 months

When FA3 came out, I made some comment like: @tri_dao could be single handedly credited with the explosive rise of Nvidia. From Nvidia's side, they have one hell of an OSS strategy!

@dylan522p

Dylan Patel

2 months

Made the joke yesterday that @tri_dao saved 10 billion dollars and prevented oceans from boiling cause Flash Attention improved MFU by 10% But honestly it's legit Hiring cracked perf engineers is expected value plus Doesn't rely on high variablity research, it's just a plus

8

5

197

1

0

29

@vitaliychiley

Vitaliy Chiley

6 months

In case you want to see what we cook up 🧪🧑‍🔬 Databricks / Mosaic AI 🧠 new research page:

Tweet card media

Mosaic Research | Databricks

The latest research, blogs and breakthroughs from Mosaic Research — plus job openings and more

www.databricks.com

4

0

29

@vitaliychiley

Vitaliy Chiley

13 days

CA passes AI safety bill 1047 OAI shifts compute to inference (AI safety bill 1047 imposes regulatory scrutiny for models using > 10^26 training FLOPs)

@DrJimFan

Jim Fan

13 days

OpenAI Strawberry (o1) is out! We are finally seeing the paradigm of inference-time scaling popularized and deployed in production. As Sutton said in the Bitter Lesson, there're only 2 techniques that scale indefinitely with compute: learning & search. It's time to shift focus to

Tweet media one

134

1K

6K

1

0

28

@vitaliychiley

Vitaliy Chiley

1 year

@MosaicML For more details see the blog:

Tweet card media

Introducing MPT-7B: A New Standard for Open-Source, Commercially Usable LLMs | Databricks Blog

Introducing MPT-7B, the first entry in our MosaicML Foundation Series. MPT-7B is a transformer trained from scratch on 1T tokens of text and code. It is open source, available for commercial use, and...

www.databricks.com

2

3

25

@vitaliychiley

Vitaliy Chiley

4 months

Amazing work by the mosaic vision team

Tweet card media

Introducing Shutterstock ImageAI, Powered by Databricks: An Image Generation Model Built for the...

New text-to-image diffusion model enables organizations to generate high-fidelity, trusted images

www.databricks.com

1

2

25

@vitaliychiley

Vitaliy Chiley

1 year

@MosaicML HF Space for MPT-7B-Chat:

3

5

24

@vitaliychiley

Vitaliy Chiley

6 months

The MoE architecture produces a model that has 132B total params of capacity, but uses only 36B params to process each token. 💡DBRX outperforms models such as LLaMA2-70B and Grok while being more efficient. 🧵

Tweet media one

1

1

22

@vitaliychiley

Vitaliy Chiley

1 year

@itsmnjn @MosaicML How to format data for fine tuning here:

1

0

24

@vitaliychiley

Vitaliy Chiley

6 months

It surpasses GPT-3.5 and competes with Gemini 1.0 Pro & Mistral Medium in quality, while being substantially faster 🏎️ 🧵

Tweet media one

2

0

21

@vitaliychiley

Vitaliy Chiley

1 year

@MosaicML HF Space for MPT-7B-Instruct:

1

5

22

@vitaliychiley

Vitaliy Chiley

13 days

@Justin_Halford_

Tweet media one

4

0

22

@vitaliychiley

Vitaliy Chiley

6 months

🚀 The fine-grained MoE architecture makes DBRX is efficient - almost 2x faster inference than LLaMA2-70B - about 40% smaller than Grok in total & active parameter-counts. 🧵

Tweet media one

1

0

20

@vitaliychiley

Vitaliy Chiley

6 months

🔗 Links: - Trained using MosaicML's LLM-Foundry: - Technical Blog: - WIRED: - DBRX Base: - DBRX Instruct: - 🤗HF Space Demo:

Tweet card media

DBRX Instruct - a Hugging Face Space by databricks

1

0

20

@vitaliychiley

Vitaliy Chiley

7 months

Chonk

Cerebras Systems Unveils World’s Fastest AI Chip with Whopping 4 Trillion Transistors - Cerebras

Third Generation 5nm Wafer Scale Engine (WSE-3) Powers Industry’s Most Scalable AI Supercomputers, Up To 256 exaFLOPs via 2048 Nodes

1

1

21

@vitaliychiley

Vitaliy Chiley

6 months

Literally eating tendies rn

@PelosiTracker_

Nancy Pelosi Stock Tracker ♟

@PelosiTracker_

6 months

BREAKING 🚨: Nancy Pelosi just bought $5M of the AI company Databricks Unfortunately, Databricks is a privately held company and not available to be bought by the public Sorry people, you don’t have access to this one.

Tweet media one

288

2K

14K

0

0

21

@vitaliychiley

Vitaliy Chiley

3 months

🎶🎶 Do you want to build an MoE? 🎶🎶 It was great collaboration with the team at PyTorch to integrate the tooling needed to makes MoE training easier and more efficient.

@PyTorch

PyTorch

3 months

Training MoEs at Scale with PyTorch 🔥 In our latest post, we show how we scale to over three thousand GPUs using PyTorch Distributed and MegaBlocks, an efficient open-source MoE implementation in PyTorch. Check it out:

0

70

354

0

5

20

@vitaliychiley

Vitaliy Chiley

1 year

the feels when the H100s show up

Tweet media one

1

2

19

@vitaliychiley

Vitaliy Chiley

6 months

OSS or bust!

@elonmusk

Elon Musk

6 months

Should be available on 𝕏 next week. Grok 2 should exceed current AI on all metrics. In training now.

5K

4K

32K

0

3

19

@vitaliychiley

Vitaliy Chiley

1 year

Using LLMFoundry, you can use the systems interchangeably 🧵

Tweet media one

3

0

19

@vitaliychiley

Vitaliy Chiley

6 months

if you know, you know

0

0

19

@vitaliychiley

Vitaliy Chiley

1 year

@_akhaliq @ilyasut gives one talk…

3

0

18

@vitaliychiley

Vitaliy Chiley

7 months

ML Researchers: We want more FLOPS Solution:

Tweet media one

@CerebrasSystems

Cerebras

@CerebrasSystems

7 months

📣ANNOUNCING THE FASTEST AI CHIP ON EARTH📣 Cerebras proudly announces CS-3: the fastest AI accelerator in the world. The CS-3 can train up to 24 trillion parameter models on a single device. The world has never seen AI at this scale. CS-3 specs: ⚙ 46,225 mm2 silicon | 4

Tweet media one

54

164

961

0

2

18

@vitaliychiley

Vitaliy Chiley

1 year

@abacaj using 4 bit quant means only about 15GB of GPU mem is used for 30B params 🤯 Given we trained the model with ALiBi, you can probably just increase the max_seq_len of the model past 8k and it'll just work (up to the point where you OOM) You might get to about seqlen=12k

2

1

18

@vitaliychiley

Vitaliy Chiley

6 months

💡The 32k sequence length training makes a SOTA RAG Model. DBRX outperforms the best open source models and even outperforms GPT3.5 Turbo. 🧵

Tweet media one

1

0

17

@vitaliychiley

Vitaliy Chiley

13 days

@ChrSzegedy That tweet was for the lolz While "there's a grain of truth in every joke", I mostly wanted to say "gg earth" and "pour one out" 😅 I really like the "data generation is part of the training process" argument you make. It almost implies that no one is ever actually deploying

2

0

18

@vitaliychiley

Vitaliy Chiley

6 months

📈👀

@awnihannun

Awni Hannun

6 months

4-bit quantized DBRX runs nicely in MLX on an M2 Ultra. PR:

29

112

727

0

2

18

@vitaliychiley

Vitaliy Chiley

2 months

Integrated context parallel all2all when?

@cHHillee

Horace He

2 months

For too long, users have lived under the software lottery tyranny of fused attention implementations. No longer. Introducing FlexAttention, a new PyTorch API allowing for many attention variants to enjoy fused kernels in a few lines of PyTorch. 1/10

Tweet media one

20

258

1K

1

0

18

@vitaliychiley

Vitaliy Chiley

1 year

@ESYudkowsky you can now hear AGI being born👂 @coryrstephenson converted our LLM training loss into an audio clip

2

3

18

@vitaliychiley

Vitaliy Chiley

5 months

This whole time DBRX turns out to be really Enterprise Intelligent!

@sam_havens

Sam Havens

5 months

@SnowflakeDB Awesome work training such a big model with a permissive license! I think you had a mistake in your IFEval implementation, your reported number is less than 2x what we observe (though it does vary with inference server and sampling parameters). You should see in the high 60s

Tweet media one

Tweet media two

2

8

34

1

0

15

@vitaliychiley

Vitaliy Chiley

6 months

@andrew_n_carr Deep (narrow) models have more representational power (if you can get training to be stable). Wide (shallow) models get better HW utilization (whatever your sweep finds is the optimal ratio, GPUs want it to be WIDER) Choose your own adventure (engineering/ml tradeoff)

1

0

17

@vitaliychiley

Vitaliy Chiley

6 months

@abacaj Yeah it'd be nice if Mistral was at all open about their training setup...

0

2

14

@vitaliychiley

Vitaliy Chiley

6 months

@winwin7264 :shifty-eyes:

0

0

14

@vitaliychiley

Vitaliy Chiley

1 year

@jeremyopendata @Replit Amazing to see how the @MosaicML platform enables customers to, in a week, do what only months ago was only possible at a handful of companies. Congratulations @Replit on the successful run

0

1

14

@vitaliychiley

Vitaliy Chiley

3 months

e22 models ngmi e24 or bust!

@_xjdr

xjdr

3 months

any time i see an llm emergence study on a small (less than 70B) or undertrained (GPT-3.5) model, i just want to respond with this gif. this is one reason i'm so interested in comparing the L3-70B and L3-405B models gif from here

8

13

164

0

1

15

@vitaliychiley

Vitaliy Chiley

1 year

@andrew_n_carr @abacaj 8 Mixture x 220B = 2/3*220B*8 + 1/3*220B = 1.247 T params (attn isn't in the MoE. usually...)

2

1

14

@vitaliychiley

Vitaliy Chiley

2 years

After 4+ years of working at @CerebrasSystems its a bittersweet moment to finish my last day of work. I'm moving on to my next adventure, but am excited to see what the future holds for the WaferScaleEngine. Signing off, one last time with #IamCerebras

0

0

14

@vitaliychiley

Vitaliy Chiley

2 months

@_xjdr perf numbers or it didn't happen

1

0

15

@vitaliychiley

Vitaliy Chiley

1 year

SeqLen who??? 😝 It’s been an awesome past few months getting this running and trained! The team at @MosaicML has been amazing and the tools we’re building enable more than I thought possible! If your application requires extremely long seq len, you can find it at @MosaicML

@NaveenGRao

Naveen Rao

1 year

🤯🤯 LLM trained with 64K+ context length! What could you do with that? Prompted our model with the ENTIRE contents of "The Great Gatsby" and asked it to write the epilogue. Snippet 👇 Model dropping soon to an open-source repo near you. Epilogue: It seemed to me that Gatsby

41

89

675

0

0

14

@vitaliychiley

Vitaliy Chiley

1 year

@squarecog yes the naming is off, but it works just fine

Tweet media one

1

0

15

@vitaliychiley

Vitaliy Chiley

3 months

@Muhtasham9 A while back we ran into this problem. This was our fix:

Tweet card media

llm-foundry/llmfoundry/callbacks/scheduled_gc_callback.py at main · mosaicml/llm-foundry

LLM training code for Databricks foundation models - mosaicml/llm-foundry

1

2

14

@vitaliychiley

Vitaliy Chiley

1 year

@Peter_0_0_g @MosaicML We use ALIBI so we can arbitrarily increase the context len. We train the model on with seq len 2k; then fine tune with 65k

1

1

13

@vitaliychiley

Vitaliy Chiley

2 months

Slightly surprising to see that Meta still struggles with MoE training stability at scale 🤔🧐❓ DBRX training had no loss stability issues. If Zuc wants help, he could just ask 🤷‍♂️ 🧵

Tweet media one

2

0

13

@vitaliychiley

Vitaliy Chiley

1 year

@abhi_venigalla Or did you mean this Christmas?

Tweet media one

1

0

13

@vitaliychiley

Vitaliy Chiley

1 month

Love it when the team cooks

@dan_biderman

Dan Biderman

1 month

*LoRA Learns Less and Forgets Less* is now out in its definitive edition in TMLR🚀 Checkout the latest numbers fresh from the @DbrxMosaicAI oven 👨‍🍳

5

21

83

0

2

12

@vitaliychiley

Vitaliy Chiley

2 months

@dylan522p @ishanit5 "train on test" - @dylan522p you gotta let the ppl know!

2

1

12

@vitaliychiley

Vitaliy Chiley

1 year

According to the Chinchilla paper, a 30B LLM trained on 600B tokens will be as good as GPT3. So why not train on 1T tokens and beat it on 6/9 tasks 🤷‍♂️

Tweet media one

1

1

11

@vitaliychiley

Vitaliy Chiley

1 year

@skynetislov3 It’ll soon be availible on

Mosaic AI Model Training | Databricks

Fine-tune an open source LLM or build custom LLMs trained on your enterprise data with Mosaic AI Model Training. Custom models built with Model Training are faster, produce higher-quality results...

www.databricks.com

1

0

11

@vitaliychiley

Vitaliy Chiley

2 months

Data is the great equalizer

@snats_xyz

snats

2 months

They trained on only 6GB of text, distilled from the Pile and got a BERT model up to the quality of T5 with 745x less data 💀 It's so over

7

51

659

1

1

11

@vitaliychiley

Vitaliy Chiley

1 year

ALiBi is king 👑

@jefrankle

Jonathan Frankle

1 year

As @vitaliychiley likes to say, ALiBi is 👑 We're very big fans of ALiBi and @OfirPress at @MosaicML .

1

2

24

1

0

10

@vitaliychiley

Vitaliy Chiley

6 months

At >16GPU scale, google's 2D weight stationary setups begin to make sense!!! (where 1D weight stationary (aka TP) is the alternative) Figure from:

Tweet media one

@ml_hardware

Abhi Venigalla

6 months

@francoisfleuret The 30x is real and comes from this technical brief, page 15: How is 30x possible given GB200 has only ~2.3x increase in memBW and FLOP/s over H100? It involves comparing per-chip generation throughput = output_tokens/s/chip. The two systems compared are

8

12

156

0

1

11

@vitaliychiley

Vitaliy Chiley

1 year

@SahajGarg6 @abhi_venigalla @julien_c LLM-Foundry + Composer allows us to compute MFU on the fly (based on training througput) We also have a table of configs + perf here (although it needs to be updated with H100 numbers)

1

1

11

@vitaliychiley

Vitaliy Chiley

1 year

Pics or it didn't happen

Tweet media one

@DbrxMosaicAI

Databricks Mosaic Research

1 year

🎉 🎉🎉 We have a new price on training Stable Diffusion 2 from scratch: $50k trained on the MosaicML Platform. We replicated Stable Diffusion 2.0 with massive training speedups, and now you can too. Learn more in our latest blog post:

4

13

68

1

1

11

@vitaliychiley

Vitaliy Chiley

7 months

My man!

@AMD

AMD

7 months

Advancing AI: @Databricks NLP Architect, Abhinav Venigalla, discusses the hardware and software advantages from AMD.

4

21

172

0

0

10

@vitaliychiley

Vitaliy Chiley

1 year

("king kill" vec - "man" vec) + "woman" vec = "queen slay" vec

0

1

11

@vitaliychiley

Vitaliy Chiley

6 months

@abacaj "350,000 NVIDIA H100 GPUs as part of a portfolio that will feature compute power equivalent to nearly 600,000 H100s" 🤯

Tweet card media

Building Meta’s GenAI Infrastructure

Marking a major investment in Meta’s AI future, we are announcing two 24k GPU clusters. We are sharing details on the hardware, network, storage, design, performance, and software that help us extr…

engineering.fb.com

1

0

9

@vitaliychiley

Vitaliy Chiley

2 months

Using the same amount of compute, Meta could probably have trained a 250B param model using 25T tok (3.75e25; 100x TPR) to get about same the perf without sacrificing model performance and they would have a produced a model that can be served at 1.6x the speed. 🧵

2

0

10

@vitaliychiley

Vitaliy Chiley

1 year

Amazing to see @MosaicML tools being used to create awesome sauce 👏👏👏

@pirroh

Michele Catasta

1 year

🧑‍💻 replit-code-v1-3b is out! Head to our HuggingFace 🤗 org page: to use the open-source release of our ReplitLM specialized on code completion. This will be the first of many LLMs 🚀 1/ 🧵

23

183

918

0

0

10

@vitaliychiley

Vitaliy Chiley

2 months

Best workshop: ES-FoMo Am I biased cuz I spoke there? Yes Were the other talks so good I felt out of place? Also yes @slippylolo really knows how to organize a workshop Excited about the the future of Scaling! 🚀📈 Runner up: DMLR: Datasets for Foundation Models 🧵

1

0

10

@vitaliychiley

Vitaliy Chiley

1 year

And we got the numbers to prove it 🚀 🧵

Tweet media one

2

0

10

@vitaliychiley

Vitaliy Chiley

1 year

And as long as you use the correct image, it just works 🧵 end

Tweet media one

1

0

10

@vitaliychiley

Vitaliy Chiley

2 years

A few weeks ago I had the opportunity to talk with @ecsquendor and @DoctorDuggar on @MLStreetTalk . We talked about ML hardware, Cerebras, and how sparsity can interact with it all. I definitely recommend people checkout their podcasts. #iamcerebras

Tweet card media

#77 - VITALIY CHILEY (Cerebras)

Patreon: https://www.patreon.com/mlstDiscord: https://discord.gg/ESrGqhf5CBVitaliy Chiley is a Machine Learning Research Engineer at the next-generation com...

www.youtube.com

0

2

8

@vitaliychiley

Vitaliy Chiley

6 months

Something something, GPU go BRRR

@mvpatel2000

Mihir Patel

6 months

🚨New🌟blog✍️ on ⏩ maximizing🌙 FLOPS 🚀 Training large models requires maximizing flops/GPU, especially at scale. Excited to share a few of the cool tricks in thread👀. 1/N

Tweet media one

6

36

191

0

0

10

@vitaliychiley

Vitaliy Chiley

2 months

FLOPS to downstream tasks is sigmoidal (ie saturating faster than we'd like) We all knew it would happen at some scale, but it is sad to see it actually happening at scale... 😢😭 not hyped but it does show that scale will prevail! 🧵

Tweet media one

2

0

9

@vitaliychiley

Vitaliy Chiley

2 years

@francoisfleuret Part of @MosaicML 's mission is to show that you don't need a MASSIVE model to rule them all. Task specific models enable you to get SOTA perf using much smaller models when training on task specific data eg

@DbrxMosaicAI

Databricks Mosaic Research

2 years

Meet PubMed GPT 🩺 a new SOTA on the US Medical Licensing Exam developed by MosaicML and @StanfordHAI . It's a normal GPT-3B model trained on medical data that bests hand-designed med models and generic models 40x bigger, a sweet spot for foundation models🧵

12

132

517

0

0

9

@vitaliychiley

Vitaliy Chiley

6 months

@code_star Have you seen my profile byline

Tweet media one

1

0

9

@vitaliychiley

Vitaliy Chiley

2 months

ook this thread is getting toooo long (and I'm only on page 12 😬) This paper is a treasure trove I'll leave it here 🫡, but will finish the paper 10/10 would recommend

3

0

9