DBRX is a new open source general-purpose
#LLM
that advances the state of the art in efficiency, using a 132B-parameter MoE architecture.
Check out our deep-dive on how we trained and benchmarked
#DBRX
:
📢 Introducing MPT: a new family of open-source commercially usable LLMs from
@MosaicML
. Trained on 1T tokens of text+code, MPT models match and - in many ways - surpass LLaMa-7B. This release includes 4 models: MPT-Base, Instruct, Chat, & StoryWriter (🧵)
📣Announcing MosaicML Inference 📣
Ever wanted a text or image generation API that doesn’t make you send data to a third party?
Or a cheaper solution than paying by the token?
Or an easy way to get a trained model into production?
We can help with that. 🧵
Introducing training LLMs with AMD hardware!
MosaicML + PyTorch 2.0 + ROCm 5.4+ = LLM training out of the box with zero code changes.
With MosaicML, the ML community has additional hardware + software options to choose from.
Read more:
Meet MPT-30B, the latest member of
@MosaicML
's family of open-source, commercially usable models. It's trained on 1T tokens with up to 8k context (even more w/ALiBi) on A100s and *H100s* with big improvements to Instruct and Chat. Take it for a spin on HF!
Meet PubMed GPT 🩺 a new SOTA on the US Medical Licensing Exam developed by MosaicML and
@StanfordHAI
. It's a normal GPT-3B model trained on medical data that bests hand-designed med models and generic models 40x bigger, a sweet spot for foundation models🧵
[1/8] Full technical details on our Stable Diffusion 2.0 speedrun are here! On Wednesday, we announced that we had replicated SD2 for < $50k, 2.7x over our baseline and 6x over Stability's number. Today, we share the technical nitty-gritty on how we did it:
Woo hoo! 🙌What an honor to make the
@Forbes
AI 50 List. MosaicML empowers you build your own
#GenerativeAI
. Train, finetune, and deploy your custom
#LLM
today:
Got an extra $20 burning a hole in your wallet? With the MosaicBERT architecture + training recipe, you can now pretrain a competitive
#BERT
-Base model from scratch on the MosaicML platform for the cost of a large pizza! 🍕⚡️👏 Learn more:
Announcing MPT-7B-8K: a 7B parameter open-source LLM with 8k context length trained with the MosaicML platform.
With its 8k context length, MPT-7B-8K specializes in document summarization and question-answering, and may be used commercially.
Read more:
We have exciting news! In our latest and greatest LLM blog, we show how MosaicML Cloud can help you train LLMs from 1B - 70B parameters, and for the first time, publish transparent times + costs for doing so. It's a lot cheaper than you think! (1/9)
Want to train your own custom
#LLMs
while keeping your data private? 🔒 We got you. The MosaicML platform keeps your data, models, and source code secure in your private network while abstracting away the complexity of
#ML
training infrastructure. Learn more:
The MosaicML team is excited to present at the
@weights_biases
webinar this Thursday, 23-Feb-2023, 7PM CET/10AM PST! Our very own
@leavittron
will be joining W&B's
@carey_phelps
to showcase MosaicML
#LLM
training and W&B's Model Registry. Register at
Large Language Models (LLM) are gaining in popularity, but training these models from scratch can be a huge pain... until now! Our latest LLM blog series uncovers how to reduce the time, cost, and complexity of training these billion-parameter models:
How good are
@nvidia
H100s actually? In collaboration with
@CoreWeave
, we benchmarked A100 vs H100 performance for large language model training.
Here's what we found: [1/6]
Today, an exciting paper from
@MSFTResearch
:
Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer
While it's too early to say, this may be remembered as the single biggest efficiency advancement in hyperparameter tuning.
How can the ML community measure LLM quality in a holistic and standardized manner?
The Mosaic Model Gauntlet encompasses 34 benchmarks, organized into 6 broad categories of competency, evaluated with our blazingly fast open-source ICL eval harness.
🧵👇
Hello World! Today we come out of stealth to make ML training more efficient with a mosaic of methods that modify training to improve speed, reduce cost, and boost quality. Read our founders' blog by
@NaveenGRao
@hanlintang
@mcarbin
@jefrankle
(1/4)
Is overparametrization the key to solving ImageNet? In this NeurIPS Outstanding Paper Award winner, ,
@SebastienBubeck
and
@geoishard
are looking at the significance of neural network overparametrization and the universal law of robustness.(1/6)
Is forgetting actually beneficial for training? This top-reviewed ICLR paper () introduces a powerful paradigm that consistently outperforms all other methods on ResNet18 and DenseNet169 on a variety of image datasets. (1/8)
When Shift Operation Meets Vision Transformer: An Extremely Simple Alternative to Attention Mechanism
They replace the attention mechanism in a Swin transformer with a simple spatial shift of some of the channels. Turns out this actually works.
Ready to use a programmatic approach to prompting
#LLMs
and building
#RAG
applications? The
@stanfordnlp
#dspy
repo includes support for
@databricks
Model Serving and Vector Search! Details:
📢 Today, we're thrilled to announce that
@Databricks
has completed its acquisition of MosaicML. Our teams share a common goal to make
#GenerativeAI
accessible for all. We're excited to change the world together!
Read the press release and stay tuned for more updates:
Data2Vec:
A new paper by Meta AI claims to be “The first high-performance self-supervised algorithm that works for speech, vision, and text.” And the results look very promising.
"Scaling Laws for Neural Language Models" was one of the papers that fueled the recent push towards larger models.
@DeepMind
revisits the question, "How should one scale model size relative to dataset size?" and finds some surprising answers!
Our NLP architect
@abhi_venigalla
continues his work on the use of AMD accelerators at scale for
#LLM
training. In our latest
@databricks
blog post, he shares multi-node training performance results on MI250 GPUs:
How much does it take to train a Stable Diffusion model from scratch? The answer: 79,000 A100-hours in 13 days, for a total training cost of <$160k. Our tooling reduces the time and cost to train by 2.5x, and is also extensible and simple to use.
We've shared great research before, but reproducing methods from papers is hard.
Announcing Composer, our library of ML speedups: .
Train CV models ~4x faster and NLP models ~2x faster at the same accuracy -- with minimal tuning. (1/5)
We're excited to share details about how we use cyclic LR schedules to estimate time/cost/accuracy tradeoff curves for model training. This is a key element in our approach to
#EfficientML
-- how we benchmark the speedup methods we implement. Article here:
Today we take a look at a new architecture from DeepMind called RETRO: Retrievel-Enhanced Transformer (). RETRO uses a large database of documents along with an embedding-based retrieval system to improve the “knowledge” of transformers at runtime. (1/15)
DBRX is the top open-source model on the latest WildBench Leaderboard on HuggingFace! Thanks to our friends
@allen_ai
for this benchmark of LLMs with challenging tasks from real users in the wild.
#DBRX
Today, we're excited to share our MLPerf Training 2.0 results: 23.8 minutes to train ResNet-50 to 75.9% validation accuracy on ImageNet, using 8x NVIDIA A100 GPUs.
"RankGen: Improving Text Generation with Large Ranking Models"
It's tempting to look at this paper as yet another method to make the numbers go up. But there's another story here that's much more interesting. [1/14]
📢 Introducing MosaicBERT! Now you can pretrain a high-quality BERT model from scratch on the MosaicML platform for $20. So why should you train your own BERT model? 👇 (1/5)
Happy December! Today, we're looking back at Stochastic Weight Averaging (SWA), now a classic ML efficiency win! SWA is a simple method for improving accuracy with no increase in training time. It is built into fastai, pytorch, PTL and our Composer . (1/12)
New blog post! Take a look at some best practices for efficient CNN training, and find out how you can apply them easily with our Composer library:
#EfficientML
We launched the MPT-7B foundation models just over a month ago, and since then, they’ve been downloaded over 2 million times! We are humbled by this warm reception, and thrilled to see the vibrant
#LLM
community rise up to share how they're using them!
What makes these models special?
* Licensed for commercial use (unlike LLaMA)
* Trained on more data than any comparable open-source LLM.
* Handle extremely long inputs (trained up to 65k, goes up to 84k w/ALiBi)
* Really fast inference and training code.
Our
#DBRX
open source
#LLM
is now available on ! Sign in (or sign up), select custom models, and start chatting with our DBRX-Instruct model. Get started:
Today, we're looking at fine-tuning large models, and this paper submitted to ICLR:
It shows fine-tuning can hurt performance on out-of-distribution examples, and explains how using some nice theory. We'll be keeping an eye on this! (1/8)
Today, we're looking at a recent paper out of DeepMind outlining ReLICv2 (), which shows an exciting step forward in self-supervised learning! S/O to
@weballergy
"ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers"
Paper:
Post-training quantization from the DeepSpeed team that can reduce inferences cost up to 5x. [1/8]
🚨 A few months ago we announced that you can train Stable Diffusion from scratch for less than $125k using the MosaicML platform.
A major price drop is coming...and we have the training run to back it up. Stay tuned for a major announcement this week!
🎉 🎉🎉 We have a new price on training Stable Diffusion 2 from scratch:
$50k trained on the MosaicML Platform.
We replicated Stable Diffusion 2.0 with massive training speedups, and now you can too.
Learn more in our latest blog post:
Should you train your ViT? While Vision Transformers (ViT) have delivered groundbreaking performance, that performance often depends on tons of pre-training and data augmentations. This paper () shows that ViTs can be performant without pre-training! (1/11)
"A Neural Corpus Indexer for Document Retrieval"
Paper:
They train a seq-to-seq model to directly spit out document IDs given queries. And it works really well. Like, these are some of the largest accuracy lifts I've ever seen in a paper. [1/15]
New blog post!
@zeqiuwu1
,
@huyushi98
, and
@rajammanabrolu
share a recent highlight from their work in
#LLM
finetuning research: Fine-Grained Reinforcement Learning from Human Feedback (RLHF)
Is this the end of contrastive self-supervised pretraining? In this :thread:, we’ll discuss Masked Autoencoders Are Scalable Vision Learners, the exciting new work from Kaiming He,
@endernewton
,
@sainingxie
, Yanghao Li, and
@inkynumbers
at
@MetaAI
1/15
There's fast, and then there's blazingly fast. 🔥🔥🔥 With Composer and MosaicML Cloud, you can now evaluate
#llms
on in-context learning tasks (LAMBADA, HellaSwag, PIQA, and more) hundreds of times faster than other evaluation harnesses.
Read more:
We're excited to have contributed to the development of the OLMo open source model from
@allen_ai
which was developed using the
@databricks
Mosaic AI training platform. Read the blog post from
@jefrankle
to learn more:
Hello OLMo! Congrats to the amazing
@allen_ai
team! 7B params, 2T tokens, open training code, open data, intermediate checkpoints, Apache 2.0, the works. A giant leap for open science. Nicely done
@mechanicaldirk
,
@i_beltagy
,
@soldni
, and so many others!
📦 To evaluate the coding capabilities of LLMs, you need to execute the code. But what if the LLM spits out malicious code?😱
With MosaicML, you can now evaluate
#LLMs
on code gen benchmarks (eg. HumanEval) in an effortless, end-to-end secure framework.
We’ve released Composer 0.8.0, which introduces a HuggingFaceModel object for reading in your existing 🤗 Transformers models. Training or fine-tuning BERT models with Composer just got much easier.
See the release notes for the full set of enhancements:
The release of
@PyTorch
2.0 and AMD ROCm 5.4 has
@MosaicML
“excited to announce that LLM training works out of the box on AMD Instinct data center GPUs, with zero code changes…” Read more about how the AMD Instinct MI250 helps developers train
#AI
models.
Do language models really need tokenizers? Work by
@colinraffel
's group suggests the answer is often no. Their Byt5 model () modifies mT5 to take in raw UTF-8 bytes instead of output from a tokenizer. (1/9)
What should you do if you want to effectively and cheaply “instruction finetune” an LLM?
@aditi_jh
and
@JacobianNeuro
share some important insights. (1/5)
Shoutout to
@jeremyphoward
,
@math_rachel
and the whole
@fastdotai
team. Two-way callbacks and other ideas helped a lot in designing composer (). We're standing on the shoulders of giants in our shared mission to make AI accessible to everyone.
Since becoming part of
@databricks
last July, the MosaicML team has continued its mission to optimize and improve
#GenAI
model training. Our rigorous science leads to real-world results. Visit our new research hub to discover what we've working on:
Today, we look at optimizing data movement for transformer deep learning networks. In this paper (), Ivanov et al. show that 40% of the runtime is spent in data movement, and that training has become memory bound for transformer networks. (1/6)
It's Christmas in July for the ML Community! 🎄 We found that AMD systems appear stable and consistent with training on NVIDIA systems when using MosaicML's training stack. With StreamingDataset's elastic determinism, we can get the same loss curves.
Think it’s too hard—or too expensive—to train your own GPT or diffusion models from scratch? Think again.
We built the MosaicML platform to tackle the challenges of training large models and unleash the power of
#generativeAI
.
Learn more:
"Maximizing Communication Efficiency for Large-scale Training via 0/1 Adam"
One of the biggest bottlenecks in distributed training is communication between nodes. The 0/1 Adam optimizer can train a BERT-Large while syncing only 1.03 bits/parameter on average. [1/7]
The model is really good. Across 12 different in-context learning tasks, it nearly always surpasses every other open-source model < 30B params and trades off with LLaMa-7B for the best open-source model. Plus it's commercially usable, and finetuned versions are available now.
Composer is trending on GitHub (python)!
Composer helps train ML models faster and cheaper through algorithmic efficiency, and the world is taking notice, thanks to this wonderful community!
See what all the buzz is about in our repo -- and give us a ⭐!
Large Language Models are notorious for being expensive to train, but provide a model that can be evaluated on generalized language understanding benchmarks. What if the goal is to perform well on a task-specific benchmark instead? Can we cut down the costs of pre-training? (1/9)
We used the same tools as our customers: the MosaicML platform, Composer, StreamingDatasets, etc. They made training a piece of cake. Here's our training logbook. 🥱 No loss spikes (we fixed them architecturally). Four hw failures handled automatically. ZERO human intervention.
Does your application require high accuracy, but has tight inference constraints? Are you willing to pay any training costs to achieve this? If so, today’s paper may be for you! They reach 82.8% accuracy on ImageNet using only a ResNet-50 model! (1/9)
Today, we’re looking at “Augmenting Convolutional networks with attention-based aggregation” which proposes a modification to the average pooling used in many conv nets and a particularly simple new architecture. (1/10)
We are thrilled to see promising alternative options for AI hardware to provide the best cost, performance, and developer experience for our customers.
Exciting new speedups from
@aleks_madry
's lab with results on
#Imagenet
! Optimized data-loading with FFCV removes the CPU bottleneck which normally limits the throughput of ResNet+ImageNet+A100 training. (1/4)
ImageNet is the new CIFAR! My students made FFCV (), a drop-in data loading library for training models *fast* (e.g., ImageNet in half an hour on 1 GPU, CIFAR in half a minute).
FFCV speeds up ~any existing training code (no training tricks needed) (1/3)
“Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models”
GLUE is worn out. Even SuperGLUE doesn’t stick anymore. Enter BIG-bench, a collection of 204 tasks contributed by 444 authors, designed for evaluating large language models.
Join us next week at
@weights_biases
' Fully Connected conference on Wednesday, June 7th in San Francisco. Our CTO/Co-Founder
@hanlintang
will be speaking alongside a roster of
#generativeAI
luminaries. Full agenda here:
To highlight StoryWriter: Its final training stage has a 65k token context, 32x LLaMa and 2x GPT-4. This crazy length works out of the box with our LLM Foundry on standard GPUs. We used ALiBi pos encodings: they handle any input length and extrapolate longer (84k in our testing).
DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale
The DeepSpeed team made an awesome family of MoE models + systems optimizations:
Great to see a MosaicML citation in the wild! Spotted in 'DeepSpeed Data Efficiency: Improving Deep Learning Model Quality and Training Efficiency via Efficient Data Sampling and Routing' () by
@conglongli
@yuxionghe
et al.
Our team is incredibly proud to partner with
@allen_ai
and thrilled to see them cook! Achieving such a massive improvement in MMLU, while reducing the compute budget, is a fantastic win. And doing it fully open? Everyone wins. Congrats! Can't wait to see what's next 👀
Announcing our latest addition to the OLMo family, OLMo 1.7!🎉Our team's efforts to improve data quality, training procedures and model architecture have led to a leap in performance. See how OLMo 1.7 stacks up against its peers and peek into the technical details on the blog:
Wrangling large
#datasets
doesn't have to be so hard.
We're here to spare you the headaches.
🎉 Announcing StreamingDataset - designed to make distributed training on large datasets from
#cloud
storage as fast, accurate, and scalable as possible.
MPT-30B is a bigger sibling of MPT-7B, which we released a few weeks ago. The model arch is the same, the data mix is a similar, and the context grew to 8k. We massively upgraded the Instruct and Chat variants over MPT-7B. See the full details in our blog!
Cheers from the MosaicML holiday party! With 2021 winding down, it’s a natural time to take a look back. MosaicML exists with one core mission: to make ML training efficient for everyone. After our first year as a company, we couldn’t be happier with what we’ve accomplished:(1/6)
New year, new summaries! Let's look at dataset quality and its impact on sample efficiency. This paper () studies the ineffectiveness of active learning on visual question answering (VQA) datasets and points to *collective outliers* as the culprit. (1/8)
Technical details time! How did we do this? We started with our own custom variant of the transformer architecture, modified for speed and efficiency (no surprise from us). And then we trained on a ton of data on 440 A100s for 9.5 days.
MPT-7B comes in four different flavors. MPT-7B-Instruct is a commercially-usable instruction-following model finetuned on Dolly+HHRLHF. MPT-7B-Chat is a chatbot finetuned on Alpaca & friends. MPT-7B-StoryTeller-65k+ is finetuned on books w/context 65k; it writes awesome fiction.
👀 A LOT more possibilities are about to open up. 64K context length means that your LLM can consume and process much longer documents AND write longer responses for text generation!
Stay tuned for a major LLM announcement later this week.
🤯🤯 LLM trained with 64K+ context length! What could you do with that? Prompted our model with the ENTIRE contents of "The Great Gatsby" and asked it to write the epilogue. Snippet 👇
Model dropping soon to an open-source repo near you.
Epilogue:
It seemed to me that Gatsby
Check out our deep-dive blog post on the Mosaic ResNet training recipe. See the details of our observations, and how to reproduce these results for your needs. We're able to achieve 7x faster training for ResNet-50, and so can you!
#EfficientML
The team at
@Replit
is doing amazing work, and we’re thrilled to provide them with the MosaicML platform to fuel their AI model training needs. Check out their post that shows a holistic view of the LLM lifecycle, and the ecosystem they’ve built:
How much did it cost to train? At list price on
@MosaicML
, it was between $714k and $871k depending on your GPU choice. It's also incredibly cheap to fine-tune, at between $714 and $871 per 1B tokens.
One of our researchers just went through
@TimDettmers
' paper on 8-bit optimizers: . It is a pretty cool and very practical way to reduce memory consumption for large models.
#EfficientML
(1/4)
A new standard for performance made easy. Our MLPerf results today show leading NLP performance, speeding up training of
@huggingface
BERT models by 2.7x. Easily enable our optimizations on the MosaicML Cloud with a single flag.
What about with FP8?
To test this, we integrated NVIDIA’s TransformerEngine into our LLM training stack. As advertised, this took just a few lines of code.
On a billion-parameter LLM, convergence in FP8 equaled that of BF16 with no hyperparameter changes! [4/6]