Meet DBRX, a new sota open llm from
@databricks
. It's a 132B MoE with 36B active params trained from scratch on 12T tokens. It sets a new bar on all the standard benchmarks, and - as an MoE - inference is blazingly fast. Simply put, it's the model your data has been waiting for.
I just open-sourced my codebase for research on neural network pruning, the Lottery Ticket Hypothesis, and other topics in deep learning. It's written in PyTorch and designed to make it easy to add new models, datasets, and experiments. Check it out:
MPT is here! Check out our shiny new LLMs, open-source w/commercial license. The base MPT-7B model is 7B params trained on 1T tokens and reaches LLaMA-7B quality. We also created Instruct (commercial), Chat, and (my favorite) StoryWriter-65k+ variants. 🧵
MPT-30B is here! Same MPT architecture, 30B parameters, > 1T tokens, 8k context window, trained on H100s, great perf (esp on coding), single-GPU inference, commercially usable, and massively upgraded instruct and chat datasets. Take it for a spin!
I defended today, and
@mcarbin
was kind enough to pass me. My favorite part of the thesis is a ground-up rewrite of the original Lottery Ticket Hypothesis paper with fresh data and a narrative that benefits from four years of hindsight/maturity. Coming soon to an arxiv near you!
72 hrs ago,
@togethercompute
released the RedPajama dataset. Like everyone, we at
@MosaicML
were very excited about the idea of a fully open-source Llama. So excited, in fact, that we've already trained a 1B model on 200B tokens! It's on HF (Apache2) here:
I'm absolutely thrilled that
@MosaicML
has agreed to join
@databricks
as we continue on our journey to make the latest advances deep learning efficient and accessible for everyone. The best of MosaicML is yet to come 🎉🎉🎉
For those interested, my dissertation is now available. The highlight is that I re-did the original Lottery Ticket Hypothesis paper from scratch (Chapter 3). It follows the same path as the original, but with years of context/maturity + a new experiment 🧵
I guess the word is out! I'll be joining the
@Harvard
faculty in the fall of 2023 as part of an amazing cohort of new machine learning professors. Looking forward to sharing more about my lab, how to join, and everything we're building at
@hseas
when I'm a bit closer to arriving!
TLDR: Announcing 🌟COMPOSER🌟, a PyTorch trainer for efficient training *algorithmically*. Train 2x-4x faster on standard ML tasks, a taste of what's coming from
@MosaicML
. Star it, 𝚙𝚒𝚙 𝚒𝚗𝚜𝚝𝚊𝚕𝚕 𝚖𝚘𝚜𝚊𝚒𝚌𝚖𝚕, contribute, be efficient!
Thread:
Introducing the *Mosaic ResNet*, a new take on a CV workhorse that sets SOTA for efficiency at any ImageNet accuracy. The recipe uses 12 techniques that change the math of training for a 7x speedup over standard baselines + up to 3.8x over the latest work.
Several methods have recently been proposed for pruning neural networks at initialization. In our new paper (
@KDziugaite
,
@roydanroy
,
@mcarbin
), we rigorously study these methods to determine why they "miss the mark" and underperform pruning after training
NEW WORKSHOP: Sparsity in Neural Networks: Advancing Understanding and Practice (July 8-9, 2021). This workshop will bring together members of the many communities working on neural network sparsity to share their perspectives and the latest cutting-edge research (Deadline: 6/15)
My latest weekend project: tossing another 500B tokens at 8k context window on MPT-7B, hereby creating MPT-7B-8k! 1.5B tokens, 8k context, waaaaay better performance. When we say speed at
@MosaicML
, we mean it: it took me three days to train.
LLMs are for everyone! Own a GPT-3 trained on your data rather than renting a GPT-3 trained on a web crawl of Reddit. The price is $450K. llm-early-access
@mosaicml
.com to try it. This is just the start: this doesn't use MosaicML speedups. Our goal is to do this for $100K soon. 🧵
We have exciting news! In our latest and greatest LLM blog, we show how MosaicML Cloud can help you train LLMs from 1B - 70B parameters, and for the first time, publish transparent times + costs for doing so. It's a lot cheaper than you think! (1/9)
And now it's < $50k. 🖼️Announcing
@MosaicML
's diffusion offering 📷We replicated Stable Diffusion 2.0, training from scratch with huge speedup, and we can do it on your data too. Human eval showed the model to be indistinguishable from the original. Blog:
Hello OLMo! Congrats to the amazing
@allen_ai
team! 7B params, 2T tokens, open training code, open data, intermediate checkpoints, Apache 2.0, the works. A giant leap for open science. Nicely done
@mechanicaldirk
,
@i_beltagy
,
@soldni
, and so many others!
No matter how established I become, I still feel completely inadequate seeing all the NeurIPS tweets. For all the folks out there who feel similarly, you aren't alone.
@Harvard
is investing $500M in ML and neuroscience over the next decade thanks to a gift from
@ChanZuckerberg
. For my part, this makes it possible to study the foundations of deep learning at scales and depth that are otherwise only accessible in industry.
#AI
and
#MachineLearning
are just beginning to make an impact in biology and there is more untapped potential.
We’re launching the Kempner Institute for the Natural and Artificial Intelligence at
@Harvard
to bring together these two fields
At ICML next week,
@KDziugaite
@roydanroy
@mcarbin
and I will present Linear Mode Connectivity and the Lottery Ticket Hypothesis. We study the effect of SGD noise (like data order) on neural net optimization. Those results shed new light on lottery tickets
In the last two weeks,
@MosaicML
had lots of big news: We trained a 1B/200B token LLM on RedPajama in < 72hrs, Replit used us to train a SOTA code model in < 10 days, we trained SD2 for < $50k, long context BERTs, and perf #'s on H100s. But the biggest news is coming this week 👀
I AM SO ANGRY. I won't submit to ACL venues again after they shafted a student after rebuttals with this idiotic policy. Since anonymity is gone, though, publicity time! Check out awesome work by
@ZackAnkner
on improving MLM training by scheduling masking:
Just got a desk reject, post-rebuttals, for a paper being submitted to arxiv <30 min late for the anonymity deadline. I talk about how the ACL embargo policy hurts junior researchers and makes ACL venues less desirable for NLP work. I don’t talk about the pointless NOISE it adds.
Even though we've been doing this for a year, I will never get used to the fact that the only in-person audience members for my job talk are my stuffed animals.
Curious how the RedPajama effort by
@togethercompute
is progressing and where it stacks up? We evaluated the 7B model they just released 2h ago! Here is how it looks 800B tokens in. (Eval took 16 minutes on 32 A100s.)
The first RedPajama models are here! The 3B and 7B models are now available under Apache 2.0 license, including instruction-tuned and chat versions!
This project demonstrates the power of the open-source AI community with many contributors ... 🧵
@davidjschwab
@arimorcos
and I have a new paper on BatchNorm. It's not exactly a typical BatchNorm paper: we study the accuracy when freezing all weights at random init and "Training BatchNorm and Only BatchNorm." How did this happen? It's a funny story...
What happens if you freeze all weights at initialization and train *only* BatchNorm? Turns out that BatchNorm's affine parameters are impressively powerful, and they can use random features to reach surprisingly high accuracy. Find out more at the 12pm ET ICLR poster session!
This is a big deal - I'm so excited it's finally out! This work convinced me that large models like LLMs are really databases.
@OfirPress
and co-authors created a way to measure the expressive power of querying languages for these new NN DBs and an awesome new querying language.
We've found a new way to prompt language models that improves their ability to answer complex questions
Our Self-ask prompt first has the model ask and answer simpler subquestions. This structure makes it easy to integrate Google Search into an LM. Watch our demo with GPT-3 🧵⬇️
We just posted our ICLR 2020 paper on "The Early Phase of Neural Network Training" on ArXiv. In the paper, we explore the changes neural networks undergo during the crucial first phase of training using winning lottery tickets.
Recent studies have suggested that the earliest iterations of DNN training are especially critical. In our
#ICLR2020
paper with
@jefrankle
and
@davidjschwab
, we use the lottery ticket framework to rigorously examine this crucial phase of training.
This this this. I don't like to call out papers we can't reproduce because I'm not a fan of making life and career harder for PhD students. But I no longer believe anything if we haven't reproduced it ourselves.
I'm writing this cause I'm a bit salty. We've implemented so many seemingly promising, published & popular papers only for them to utterly flop.
At least I like to think that my personal bs Big Model paper classifier is now pretty good given my extensive training data.
Tired reflection at the end of DBRX release day: Last March 24,
@databricks
released Dolly. Last May 5, Mosaic released MPT-7B. Less than a year later, we've built an LLM that seems to surpass the original ChatGPT. I am so incredibly proud of our team - you all are amazing ♥️
Two weeks later, Stable Diffusion training cost is already down to $125K, a 22% reduction. Our team is blazingly fast at making training blazingly fast.
Two weeks ago, we released a blog showing training Stable Diffusion from scratch only costs $160K. Proud to report that blog is already out of date. It now costs 💸 $125K 💸. Stay tuned for more speedup from
@MosaicML
, coming soon to a diffusion model near you!
What bullshit. Dear OpenAI researchers: My email address is jonathan
@mosaicml
.com. We are hiring! We have healthy culture and no elitism, egos, or divas.
OpenAI’s chief scientist: expresses curiosity/openness about a mysterious idea, caveats with “may”.
Meta’s chief AI scientist: the certainty of "nope".
Probably explains a lot of the past 5 years.
Dear Meta AI researchers: My email address is sama
@openai
.com. We are hiring!
Would anybody be interested in a couple dozen 1B, llama-style (waaaay past Chinchilla) language models trained on different data mixes? I don't know if this question has been well-studied before.
Thank you
@LastWeekTonight
for featuring
@ClareAngelyn
,
@alvarombedoya
, and my work on police use of face recognition. For those in the ML community thinking about "broader impact," there are big opportunities to use your expertise to make a difference in the policy world!
I'm no hardware expert, but - if you need 2x the power and (potentially) 2x the price to 3x the compute - it seems to me that hardware has little or nothing to offer when it comes to getting us out of the jam we're in with giant models. Our solution has to be better algorithms.
Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling Laws
Modifies the scaling laws to calculate the optimal LLM parameter count and pre-training data size to train and deploy a model of a given quality and inference demand
Even after five years of PhD, I continue to be astounded by the casual, gratuitous cruelty that peers and institutions in academia are capable of inflicting without a second thought.
LATEST NEWS ON THE LOTTERY TICKET HYPOTHESIS: We (
@KDziugaite
,
@roydanroy
, and
@mcarbin
) just released an updated paper showing (1) how to scale the LTH to deeper networks on ImageNet and (2) initial insights into why the LTH works. Check it out on Arxiv:
I used to believe that
@kchonyc
was really three postdocs in a trench coat, having never personally seen physical existence that he existed. I was excited to finally have my hypothesis refuted this evening. Empiricism at work!
A Sunday walk down memory lane: I found the original drafts of the Lottery Ticket Hypothesis paper this weekend. Links and commentary in this 🧵. You can chart progress of public versions on arXiv v1-v5, but it's especially cool to see the earliest attempts at stating the idea.
Today is the third time I've personally found plagiarism during ML reviewing in the past year-ish. I'm seeing a more papers now that I'm an AC, but it's still a change. I'm not even trying hard; I'm just checking passages that sound strangely familiar, and I'm right every time.
Just released our new paper about "The Lottery Ticket Hypothesis at Scale," (with Gintare Karolina Dziugaite,
@roydanroy
, and
@mcarbin
) extending our prior work to find small trainable subnetworks within deeper, state-of-the-art neural networks.
New blog by
@mvpatel2000
with big updates to our LLM stack and a new recipe for blazingly fast training. FP8 + Configurable ActCkpt + DTensor + Hybrid Sharding + Comm/Act Compression = 700+ TFLOPs on H100s and linear scaling.
Very excited to partner with
@allen_ai
on this incredible project. It's not every day you get to work with the best of the best on what will soon be the best open-source model in the world ⚔️
Repeating my offer from the
@MLRetrospective
panel today: the ML community desperately needs a survey track (like IEEE S&P SoK ). I will happily volunteer to do the work to create/run this if any chairs of
@NeuripsConf
@iclr_conf
or
@icmlconf
are interested
On the job market this year, I was often asked what I considered to be my most impactful piece of research. My answer was always The Perpetual Lineup. The lottery ticket hypothesis affected the lives of grad students. The Perpetual Lineup affected the lives of everyday people.
1/ 5 years ago today, we released
#ThePerpetualLineup
, the first of its kind survey of state and local police use of face recognition technology, based on 100 public records requests yielding 16,000+ pages.
Time for my usual refrain: Most papers weren't accepted to ICLR, and don't let Twitter fool you into thinking otherwise. Plenty of smart people and great papers didn't get the outcome they wanted, and you're in very good company if that's you right now.
So excited about this -- bringing amazing platforms for data and AI together.
@NaveenGRao
,
@hanlintang
and
@jefrankle
have built an amazing team that has steadily reduced the cost of AI training and released breakthroughs like the first open source LLMs with >64K context.
Hoping to get a fifth review on my NeurIPS papers so I can complete these poker hands. Three different papers are one review away from a straight, and it would be nice to turn that two-pair into a full house.
MosaicBERT is here! I've been teasing this for a while. TLDR: You have no excuse NOT to pre-train BERT in your papers. The highlights:
* BERT-base quality for $20 and BERT-large quality (using BERT-base) for $100
* 2.4x speedup overall
* Pre-trained weights are available on HF
📢 Introducing MosaicBERT! Now you can pretrain a high-quality BERT model from scratch on the MosaicML platform for $20. So why should you train your own BERT model? 👇 (1/5)
@NaveenGRao
@MosaicML
@databricks
Startups, need a CEO to get those end-of-year goals? DM me!
I ♥️ our startup community. I want to see all the great GenAI products accelerated!
I'm willing to give you Naveen so I can keep my GPUs.
Ever wondered what happens when you freeze all the weights in a neural network and only train batch normalization? Me too! Turns out you can get 80%+ accuracy on CIFAR-10 by doing so. Check out our poster and oral in the SEDL workshop in West 121. With David Schwab and
@arimorcos
We found a scaling law that describes the error of entire families of pruned neural networks. For the night owls among you, check out our work "On the Predictability of Pruning Across Scales" at ICML (tonight, 11pm-2am Eastern). Led by
@jonsrosenfeld
!
Louder for the people in the back:
LARGE MODELS (GPT, DALLE) = DATABASES
PROMPTS = QUERIES
OUTPUTS = RESPONSES
NNs find new relations w/in data. Anyone, no matter the resources, can study better querying langs and possibly beat a big model they could never afford to train.
This is a big deal - I'm so excited it's finally out! This work convinced me that large models like LLMs are really databases.
@OfirPress
and co-authors created a way to measure the expressive power of querying languages for these new NN DBs and an awesome new querying language.
Authors are people, and cruelty from the community takes a toll. I've been where the ICML awardees are in a smaller way; I often wish I hadn't gotten an award. Also, students are researchers in training. If they did their best, any shortcomings are on supervision and the process.
We built DBRX end-to-end in 2-3 months on 3K H100s. We train LLMs day-in and day-out with our customers - thousands in the past year. We're constantly finding better ways to build models, and DBRX showcases our latest advances: in data, modeling, performance, and fine-tuning.
Come to my ICLR poster (12pm ET today) on pruning neural networks at initialization and why we're currently missing the mark. Let's discuss lottery tickets, the nature of optimizing sparse networks, and ways forward for pruning early in training!
Announcing the BAY AREA EFFICIENT ML POSTER SESSION on Thur 3/31 in Palo Alto. Are you sad that MLSys was postponed? Do you miss getting to see research friends in person? Me too! Submit abstracts for work-in-progress or pandemic-era publications by 3/22.
"We find that overall, the Intel Gaudi 2 accelerator has the 2nd best training performance-per-chip we've tested (only bested by the NVIDIA H100)." More great AI chips means more FLOPs available for all of us to build great models. Soon, we'll all be GPU (or Gaudi) rich 🤑
Big announcement 5 of 6:
@MosaicML
does inference! As per usual, efficiency is king 👑 We serve LLMs and diffusion models - 15x cheaper than comparable OpenAI offerings. We're happy to serve anything: your model, our model, or anything open-source. Exciting times here at Mosaic!
📣Announcing MosaicML Inference 📣
Ever wanted a text or image generation API that doesn’t make you send data to a third party?
Or a cheaper solution than paying by the token?
Or an easy way to get a trained model into production?
We can help with that. 🧵
Meet DBRX, a new sota open llm from
@databricks
. It's a 132B MoE with 36B active params trained from scratch on 12T tokens. It sets a new bar on all the standard benchmarks, and - as an MoE - inference is blazingly fast. Simply put, it's the model your data has been waiting for.
How much does it *really* cost to train GPT? There's speculation and (mis-)info out there that might make you think it's out of reach. It isn't.
@MosaicML
is laser focused on making it easy and accessible. This is Part 1 of a series introducing Mosaic GPT.
@zacharylipton
@OpenAI
@AnthropicAI
@MosaicML
We don't really have time to publish. We blog, but not slog [through the publication process]. Importantly, though, we're open about what we do, unlike the other two companies you mentioned.
It's been a busy two weeks at
@MosaicML
:
* RedPajama-1B in < 72hrs
*
@Replit
model trained on MosaicML
* SD2.0 for < $50k
* H100 numbers
* Long context BERT
* MosaicML Inference release
* MPT-7B release
Our tools can consistently train great models, and this pace isn't stopping!
MPT is here! Check out our shiny new LLMs, open-source w/commercial license. The base MPT-7B model is 7B params trained on 1T tokens and reaches LLaMA-7B quality. We also created Instruct (commercial), Chat, and (my favorite) StoryWriter-65k+ variants. 🧵
Last chance to register (for free!) for to attend the neural network sparsity workshop taking place tomorrow and Friday at . Join 700 (!) registrants, 62 poster presenters, 7 spotlights, 6 invited talks, 3 panels, and 1 tutorial. See you tomorrow!
Come check out our new paper on how there are sparse, *transferrable* winning ticket subnetworks in BERT pre-trained models at NeurIPS Poster Session 2 today (12pm EST, 9am PST). This project was led by the extraordinary
@tianlong_chen
@utexasece
with teammates at the
@MITIBMLab
.
🚨 Announcing DBRX-Medium 🧱, a new SoTA open weights 36b active 132T total parameter MoE trained on 12T tokens (~3e24 flops). Dbrx achieves 150 tok/sec while clearing a wide variety of benchmarks. Deep dive below! 1/N
How did training go? Zero human intervention needed. None. Nada. Our arch+optimization changes eliminated all loss spikes. The
@MosaicML
platform (our proprietary training software available to customers) caught and recovered from four hw failures. Please enjoy our empty logbook.
More than 100 Harvard faculty denounce "false equivalency between attacks on noncombatants and self-defense against those atrocities."
The conflict is complex but "the events of this week are not complicated. Sometimes there is such a thing as evil"
Interested in hearing the latest updates on the Lottery Ticket Hypothesis? Come to my talk tomorrow morning at 9:30 at the
#AAAI20
Sister Conference Track! New and improved formula with more tickets, more hypotheses, less lottery, same great taste. 🎟️🎟️🎟️
Huge Release of GPT4All 💥
Powerful LLM's just got faster!
- Anyone can try
@MosaicML
's new MPT model on their desktop! No GPU required!
- Runs on Windows/Mac/Ubuntu
Try it at:
It's amazing how much more fun I'm having reviewing for
@MLSysConf
than for the main ML conferences. I'm a big fan of smaller, more focused venues with shared values.
At
@MosaicML
, we're loyal to getting the most out of every dollar for our customers, not to any one specific way of doing things. Our stack can run anywhere, and now that means AMD! Check out our numbers on MI250X, and join me in getting excited for MI300.
Ready for GPU independence weekend?
PyTorch 2.0 and LLM Foundry now work out of the box on ** AMD GPUs! ** We profiled MPT 1B-13B models on AMD MI250 and saw perf within 80% of A100-40GB, which could go up to 94% with better software.
It. Just. Works.
Excited about MPT-7B-Storywriter-65k+ 📚 with its 65k training context? It's now available to play with on Hugging Face Spaces. Go have fun with ultra-long contexts! 📝
Way too little, way too late
@MIT
. At least Princeton tried to head unionization off with a big pay bump. All MIT can muster is an unsubstantiated warning that "promises...have been overstated." If the administration actually cared about our best interests, we wouldn't be here.
Read the blog for the full details. DBRX is better than general-purpose open LLMs at general-purpose tasks and better than CodeLLaMA-70B at code. It even gives the closed models a run for their money. It's great at using its 32k context and at RAG too.
Today I learned that there exist ego-driven ML startups that are really, truly cruel to their researchers (and probably the rest of their employees). This is a subtweet.
It's *almost* great to be back at
@MIT_CSAIL
! If someone has their kayak handy, could they please paddle over and drop off a lifejacket and some buckets?
Third - my personal favorite - MPT-7B-StoryWriter-65k+. This model is fine-tuned on English language literature with a context length of 65k. How? We use ALiBi position encodings (
@OfirPress
), so the model can use any length and extrapolate longer (up to 84k in our testing).