Announcing FlashFFTConv: Efficient Convolutions for Long Sequences with Tensor Cores!
We speed up exact FFT convolutions by up to 7.93x over PyTorch, reduce memory footprint, and get 4.4x speedup end-to-end. Read on for more details:
Thanks
@arankomatsuzaki
and
@_akhaliq
for
Attention is all you need... but how much of it do you need?
Announcing H3 - a new generative language models that outperforms GPT-Neo-2.7B with only *2* attention layers! Accepted as a *spotlight* at
#ICLR2023
! 📣 w/
@tri_dao
📜 1/n
We spent a couple days this week speeding up Stable Diffusion in
@huggingface
Diffusers using FlashAttention. 3-4x faster than the original version, 33% faster than the super optimized v0.4.1 - and >1 image/s throughput on A100. w/
@tri_dao
A short thread on how we did it👇
Excited about models that are sub-quadratic in sequence length and model dimension? Our Monarch Mixer paper is now on arXiv -- and super excited to present it as an oral at
#NeurIPS2023
!
Let's dive in to what's new with the paper and the new goodies from this release:
Monarch
This sentiment is exactly right - and why we've been working to increase sequence length in our lab for the past two years!
From FlashAttention, to S4, H3, Hyena, and more - check out our blog post putting this line of work into context:
More below: 1/n
New year, new model drop!
w/
@JonSaadFalcon
,
@simran_s_arora
, excited to release new long-context retrieval models with Monarch Mixer, up to 32K sequence length! First step 2 long-context retrieval, outperforming Mistral, BGE, OpenAI on long-context document retrieval. 1/
S4 is an amazing sequence model - but has seemed mysterious. It doesn't have to be!
In this blog (originally an internal explainer for our group),
@HazyResearch
looks at S4 from first principles that are familiar to most sophomore engineering students.
What's the simplest model that can get the job done?
New paper and blog post on how the answer for sequence modeling (including language) may be convolutions... with a touch of regularization.
📜
🖥️
⌨️ 1/n
You've heard of models that are sub-quadratic in sequence length, but what if they were sub-quadratic in model *dimension* too?
Announcing a preview of Monarch Mixer - a fully sub-quadratic & hardware-efficient architecture that matches BERT in quality! w
@simran_s_arora
1/
The Stanford MLSys Seminar is now available in podcast form on Apple Podcasts, Spotify, Google, and more!
We release new podcasts every Monday and Friday (new episodes on Fridays, old episodes from the backlog on Mondays).
Check us out on your favorite platform below! (1/n)
One key point: SSMs are *linear* in sequence length instead of quadratic, and have no fixed context length. Long context for everyone!
We're super excited, so we're releasing our code and model weights today - up to 2.7B parameters!
2/n
Blog alert! 📣
How does contrastive learning work? How can we apply it effectively? New *3-part series* covering *2 new papers* on getting better transfer & robustness, and how to apply contrastive w types to improve entity retrieval.
Part 1:
👇 (1/n)
Thrilled that FlashAttention won the best paper award at the Hardware Aware Efficient Training workshop at ICML - really excited to meet so many like-minded folks at the workshop.
Thanks to the organizers (and NVIDIA) for the GPU!
Announcing FlashAttention, a fast and memory-efficient attention algorithm with no approximation! 📣 w/
@realDanFu
By reducing GPU memory reads/writes, FlashAttention runs 2-4x faster & requires 5-20x less memory than PyTorch standard attention, & scales to seq. length 64K. 1/
New preprint alert! 📣
How do we fuse foundation models with weak supervision? Liger (🐯 +🦁) is a new weak supervision framework that fuses FMs + WS using *local smoothness* -- outperforming both FMs and WS on their own.
📜
More below 👇 (1/n)
Super excited to release the RedPajama dataset - a new, fully open *1.2 trillion token* dataset following the LLaMA recipe. A first step towards creating leading, fully open-source large language models.
Announcing RedPajama — a project to create leading, fully open-source large language models, beginning with the release of a 1.2 trillion token dataset that follows the LLaMA recipe, available today!
More in 🧵 …
Today I'm talking about FlashFFTConv at the ENLSP workshop (Efficient Natural Language and Speech Processing)! The talk is at 9:48 AM, and the poster session is from 1:00 to 2:00!
I'm flying out to
#NeurIPS2023
@NeurIPSConf
! I'll be presenting an oral on Monarch Mixer tomorrow at 3:40 in the Oral 2A session, and I'll be presenting FlashFFTConv Saturday at the ENLSP workshop!
Monarch Mixer:
FlashFFTConv:
Super excited for this model to see the light of day!
7B model, hybrid gated conv/SSM + attention architecture, trained for long context and running FlashFFTConv everywhere.
You can chat with it now on the Together API!
Announcing StripedHyena 7B — an open source model using an architecture that goes beyond Transformers achieving faster performance and longer context.
It builds on the lessons learned in past year designing efficient sequence modeling architectures.
After a short hiatus, the Stanford MLSys Seminar is coming back this quarter with a special series of episodes on foundation models!
Our first talk (ep 67!!) will be
@tri_dao
, who'll be talking about FlashAttention. Catch us *TOMORROW* at 3:30 PT:
ChatGPT's 1700-token system prompt got you down?
Led by
@jordanjuravsky
,
@brad19brown
, introducing Hydragen, a simple technique for Transformer LLM inference with shared prefixes! Up to 30x improvement in throughput with no custom CUDA!
A few things I love in this project: 1/
Excited to share my first PhD project!
TLDR: Hydragen is an exact, simple (no custom CUDA) implementation of attention for large batches with shared prefixes. We can improve LLM throughput by over 30x for CodeLlama-13b. Also, adding lots more shared context becomes cheap:
Today I'm talking about FlashFFTConv at the ENLSP workshop (Efficient Natural Language and Speech Processing)! The talk is at 9:48 AM, and the poster session is from 1:00 to 2:00!
I'll be at
#NeurIPS2022
this week!
@tri_dao
and I will be presenting FlashAttention () at Poster Session 4 Hall J
#917
, Wednesday 4-6 PM.
Super excited to talk all things performance, ML+systems, and breaking down scaling bottlenecks!
Super excited to share some thoughts with
@laurel_orr1
on lessons learned from the past four years with
@HazyResearch
and
@SnorkelML
, and what's next for the ways that machine learning is changing how we build software:
This Thursday,
@srush_nlp
from
@cornell_tech
will be talking to us about going beyond softmax in NLP. As always, 30 minute talk + 30 minute podcast with live audience questions, be sure to tune in!
Livestream link:
#Stanford
#MachineLearning
New preprint alert! 📣
How do we produce transferable and robust representations with supervised contrastive learning? We need *geometric spread* and an inductive bias towards *latent subclass clustering* in representation space.
📜
👇 (1/n)
In H3, we replace attention with a new layer based on state space models (SSMs) - with the right modifications, we find that it can outperform Transformers.
Two key ideas:
* Adapting SSMs to be able to do *comparison*
* Making SSMs as hardware-efficient as attention 3/n
We built off the super-optimized version of Diffusers that And 33% faster than the super optimized version that
@Nouamanetazi
/
@huggingface
released last week - the diff is pretty small, 68 LOC:
Training our first RedPajama 7B model is going well! Less than half way through training (after 440 billion tokens) the model achieves better results on HELM benchmarks than the well-regarded Pythia-7B trained on the Pile.
Details at
New preprint alert! 📣
How do we fuse foundation models with weak supervision? Liger (🐯 +🦁) is a new weak supervision framework that fuses FMs + WS using *local smoothness* -- outperforming both FMs and WS on their own.
📜
More below 👇 (1/n)
Overall, really excited about new models/architectures like this. What happens if we don't need attention to get the magic we've been seeing, and we can get the same quality with a linear operator?
No more fixed context windows, long context for everyone! 16/n
Super excited by this work. Making attention IO-aware makes it run way faster - and enables much longer sequences, since memory footprint becomes linear in sequence length.
Really excited to see how this gets used, and where it goes next - IO-aware transformers?
Announcing FlashAttention, a fast and memory-efficient attention algorithm with no approximation! 📣 w/
@realDanFu
By reducing GPU memory reads/writes, FlashAttention runs 2-4x faster & requires 5-20x less memory than PyTorch standard attention, & scales to seq. length 64K. 1/
FlashAttention speeds up attention and reduces its memory footprint - without any approximation. Our key insight is that attention is bottlenecked by GPU memory *reads/writes*. FlashAttention speeds up attention by reducing the R/W. Same FLOPs, 3-4x faster!
Build your own ChatGPT!
Super excited by this open-source release - even more exciting that it was trained 100% carbon-negative. Happy to play a (minuscule) part in putting it together and helping serve it faster.
Looking forward to seeing what folks build on top of this!
Why Train What You Can Code? Excited to share Rekall - using programmatic composition to find new events in video!
Paper on arXiv, and code available on Github!
Blog:
Get faster, more flexible inference on GPUs using our newly open-sourced AITemplate, a revolutionary new inference engine that delivers up to 12X performance improvements on NVIDIA GPUs & 4X on AMD GPUs compared to eager-mode within Pytorch.
Learn more:
Absolutely thrilled to receive the best paper award w
@MayeeChen
for our work on supervised contrastive learning at the AI with Biased/Scarce Data Workshop at
@RealAAAI
today! Check out the paper on the workshop website:
Short 🧵👇 - more soon! (1/n)
Ce Zhang (
@DS3Lab
and
@togethercompute
) has done some crazy stuff in distributed training. In this talk, he goes over the magic behind distribute training and inference on a GLOBAL scale over slow networks!
Tune in tomorrow at 3:30 pm Pacific!
Attending
#ICML2023
? Join us Saturday at our workshop on Efficient Systems for Foundation Models!
🔥 Large-Scale Distributed Training
🚀 Efficient Inference
⚙️ Deep Optimization
📈 Over 50 posters and 4 orals spanning from RL to efficient finetuning!
The deadline for our
#ICML2023
workshop Efficient Systems for Foundation Models is tomorrow, May 31 AOE!
Submit your best papers on training, inference or anything FM systems and efficiency - then join us for a great day of speakers & panel in Hawaii!
The upshot: we can scale H3 up to *2.7B* parameter models. And because of the state passing, we can run inference blazing fast -- up to *2.4x* faster than highly-optimized Transformers.
Up to 1,980 tokens/second! 12/n
If you're at ICLR, Catch my talk on our paper Hungry Hungry Hippos: Towards Language Modeling with State Space Models today at 10 AM in room AD12! Featuring photos of actual Rwandan hippos :)
(+poster from 11:30-1:30 at board 80!)
🛫 to Rwanda for
#ICLR2023
! I’ll be giving a talk about H3 on Wednesday, and talking about some newer work on long convs at the ME-FoMo workshop on Thursday.
Please reach out if you’ll be there and want to chat! Happy to talk about Hyenas, Red Pajamas, or anything else!
The H3 layer closes the gap on our synthetics, and the gains translate to strong downstream performance on language modeling.
We replaced almost all the attention blocks in a Transformer with H3 layers, and trained on the PILE. Our model *outperforms* GPT-Neo in PPL! 7/n
We were actually a bit late to the game here - when we saw a couple folks on Reddit and elsewhere who beat us to the punch, we decided to give it a try ourselves :)
PhotoRoom:
u/hnipun:
One final plug: Oral 2A Efficient Learning tomorrow is absolutely **packed** with great work from
@Tim_Dettmers
and
@srush_nlp
- super excited to hear what they have to say!
(1/n) This week we have
@fredsala
on the Stanford MLSys Seminar, live on Thursday at 1:30 PM! Fred was a postdoc at
@StanfordAILab
, and is now a professor at
@WisconsinCS
and a research scientist at
@SnorkelAI
-- so he knows a thing or two about MLSys.
I'm flying out to
#NeurIPS2023
@NeurIPSConf
! I'll be presenting an oral on Monarch Mixer tomorrow at 3:40 in the Oral 2A session, and I'll be presenting FlashFFTConv Saturday at the ENLSP workshop!
Monarch Mixer:
FlashFFTConv:
We are excited to release RedPajama-Data-v2: 30 trillion filtered & de-duplicated tokens from 84 CommonCrawl dumps, 25x larger than our first dataset.
It exposes a diverse range of quality annotations so you can slice & weight the data for LLM training.
The MLSys Seminar is back this week with our very own
@BeidiChen
! Tune in Thursday, 1:30 PM on YouTube to hear about her great work on sparsity in deep learning.
Livestream link:
#Stanford
#MachineLearning
We sped up stable diffusion by replacing the self-attention/cross-attention blocks in the unet with FlashAttention. FlashAttention doesn't do any approximation, so you get the *exact same image* at the end.
🛫 to Rwanda for
#ICLR2023
! I’ll be giving a talk about H3 on Wednesday, and talking about some newer work on long convs at the ME-FoMo workshop on Thursday.
Please reach out if you’ll be there and want to chat! Happy to talk about Hyenas, Red Pajamas, or anything else!
Attention is all you need... but how much of it do you need?
Announcing H3 - a new generative language models that outperforms GPT-Neo-2.7B with only *2* attention layers! Accepted as a *spotlight* at
#ICLR2023
! 📣 w/
@tri_dao
📜 1/n
I'm at
#ICML2022
this week! Let's chat if you're also in person!
I'm presenting two papers:
- Improving Transfer, Robustness of Supervised Contrastive Learning
- FlashAttention: Fast & Memory-Efficient Exact Attention
⏱below!
These synthetic languages (inspired by great work like ) test how well SSMs can do in-context learning compared to attention.
We find a critical missing capability -- SSMs have trouble *comparing tokens* across the sequence. 5/n
The power of data - RedPajama-2.8B matches Pythia-7B in HELM score after being trained on 2x the tokens! Excited to see these models continue to improve as they see more tokens :)
In addition to RedPajama 7B, we’ve also been training a 2.8B model. After 600B tokens it is exciting to see the model has higher HELM scores than the excellent Pythia-2.8B & GPT-Neo 2.7B.
In fact, trained with twice the tokens, RedPajama-2.8B has comparable quality to Pythia-7B!
Super excited for our new seminar series on ML and systems -- how does ML change the modern programming stack, and what does it mean for how people will build and deploy applications in the future?
Live on YouTube every Thursday, 3-4 PM PT. Check out links below for more!
Announcing the new live-streamed Stanford MLSys Seminar Series, in which we will explore the frontier of machine learning and systems.
Read the full announcement:
Schedule:
Intro video:
(1/n) This week
@dorisjlee
from
@ucbrise
and
@BerkeleyISchool
will be joining us on the Stanford MLSys Seminar to talk about her fantastic work on
@lux_api
. You can catch us live on YouTube this Thursday at 1:30 PT!
Deets in 🧵👇:
In response, we designed the H3 layer (Hungry Hungry Hippos) to plug this gap.
The H3 layer stacks two SSMs, and uses some simple multiplicative interactions between them (gating) to do comparisons. 6/n
We have exciting news! In our latest and greatest LLM blog, we show how MosaicML Cloud can help you train LLMs from 1B - 70B parameters, and for the first time, publish transparent times + costs for doing so. It's a lot cheaper than you think! (1/9)
Part 1: the quality gap
SSM's have achieved impressive results on sequence modeling (30+ points over Transformers on Long Range Arena), but have underperformed attention in language modeling.
In our paper, we use *synthetic languages* to probe this gap 4/n
What's the problem? Long convolutions require multiple FFT calls, which introduce expensive GPU memory reads/writes.
We develop FlashConv to address this problem.
FlashConv uses a block FFT algorithm to increase FLOP util, and uses state passing to scale to long sequences. 10/n
This week we're excited to have
@kexinrong
(
@Stanford
,
@VMware
, and
@gtcomputing
) on the MLSys Seminar. Kexin will talk about improving query performance on big-data analytics. Be there or be square!
Watch us live on YouTube this Thursday at 1:30 PT:
@arankomatsuzaki
@_akhaliq
A few fun bits I couldn't fit into the original tweet:
1. We also have the fastest implementation of a short depthwise 1D convolution, which doesn't use the FFT but is up to 7x faster than PyTorch Conv1D, check out our repo to try it out:
2. During
As much as I like attention, I'm also fond of attention-free architectures for long context.
@realDanFu
and others have been pushing in this direction, with deep theory and compelling empirical results! And
@realDanFu
is on the academic job market this year!
Announcing FlyingSquid - fast weak supervision with triplet methods. We speed up weak supervision by orders of magnitude, allowing weakly-supervised video analysis and online learning!
Blog:
w/
@MayeeChen
,
@fredsala
, Sarah Hooper,
@kayvonf
,
@HazyResearch
On LoCo, M2-BERT-32k outperforms the state-of-the-art embedding models! Even outperforms Mistral-7B, even though M2-BERT models only have 80M parameters (85x more parameter efficient)! 3/
As part of this release, we're also releasing version 0 of a new long-context benchmark called LoCo. Most academic retrieval benchmarks only have short-context documents, so we put together this benchmark of longer-context tasks. 2/
(1/n) This week we're delighted to have
@faitpoms
(
@Stanford
,
@SnorkelAI
) on the MLSys Seminar Series! Fait will be talking about a vision for interactive model development, so you won't want to miss it.
Catch us live on YouTube Thursday at 1:30 PM!
🧵👇
New preprint alert! 📣
How do we fuse foundation models with weak supervision? Liger (🐯 +🦁) is a new weak supervision framework that fuses FMs + WS using *local smoothness* -- outperforming both FMs and WS on their own.
📜
More below 👇 (1/n)
This project has been a great collaboration with
@togethercompute
. Thanks to them, these models are already integrated into
@MongoDB
Atlas,
@LangChainAI
, and
@llama_index
. Check out their tweet thread for more details! 9/
We are thrilled to announce the Together Embeddings endpoint! 🚀
Higher quality than OpenAI or Cohere in the MTEB benchmark. ✅
State of the art M2-Retrieval models with up to 32k context length. ✅
Up to 4x lower price. ✅
Details👇
The context lengths of foundation models have grown exponentially recently - exciting developments!
We've been happy to play a small role with FlashAttention, and we're very excited about the possibilities: multiple media sources, complex demonstrations, and more! 2/n
I’ve been working with
@AdeptAILabs
and we’ve made FlashAttention even faster for long sequences! For seqlen 8K, FlashAttention is now up to 2.7x faster than a standard PyTorch implementation even at small batch, making it easier to train better LMs with longer context 1/7
These gains also translate to strong downstream zero- and few-shot performance. On SuperGLUE, our zero-shot performance outperforms Transformer models of similar sizes. 8/n
We’ll be talking about this work today at
#acl2022nlp
in Dublin! Come check us out at 5:00 PM in poster session 3-4 (information retrieval and text mining).
I’ll be hanging around the whole week, come say hi!
New preprint alert! 📣
How do we improve long-tailed performance of entity retrieval? We use a supervised contrastive loss to *geometrically encode entity types* in representation space w/ bi-encoders. Check out our paper on TABi!
📜
Details👇 (1/n)
(1/n) This week on the Stanford MLSys Seminar, we're super excited to host Cody Coleman (
@codyaustun
), former Stanford PhD and founder/CEO of Coactive AI! Cody has a great talk prepared on Data Selection for Data-Centric AI!
Tune in Thursday at 1:30 PT!
72 hrs ago,
@togethercompute
released the RedPajama dataset. Like everyone, we at
@MosaicML
were very excited about the idea of a fully open-source Llama. So excited, in fact, that we've already trained a 1B model on 200B tokens! It's on HF (Apache2) here:
I'm flying out to
#NeurIPS2023
@NeurIPSConf
! I'll be presenting an oral on Monarch Mixer tomorrow at 3:40 in the Oral 2A session, and I'll be presenting FlashFFTConv Saturday at the ENLSP workshop!
Monarch Mixer:
FlashFFTConv:
Part 2: the efficiency gap
But that's not all! In order to scale H3 up to billion-parameter models, we had to make it as hardware-efficient as attention.
The convolution is O(N log N) asymptotically, but still underperforms FlashAttention for short sequences... 9/n
Announcing our 📣 Call for Papers for the ES-FoMo workshop @ ICML 2023!
➡️ We welcome papers touching on inference and training of foundation models, spanning from systems & benchmarks to novel algorithms.
🔗 (deadline: 31st of May)
(1/n) This week on the Stanford MLSys Seminar, we've got Ellie Pavlick - professor at
@BrownCSDept
and research scientist at
@GoogleAI
. Ellie will be talking about how to implement symbols and rules with neural networks! Tune in Thursday at 1:30 PT:
I had a great time chatting with
@samcharrington
on the
@twimlai
podcast - we had a great conversation about H3, FlashAttention, and all things language modeling. Thanks so much for having me on!
Today we're joined by
@realDanFu
, a PhD student at
@Stanford
, to discuss how state space models can improve language models and the limitations of attention. We discuss the H3 architecture, flash attention, and much more!
#NLP
#MachineLearning
#ICLR23
🎧
1/2 Excited to share Epoxy: fast model iteration with weak supervision + pre-trained embeddings. We look at how to use pre-trained embeddings without the need for fine-tuning - model iteration in <1/2 second, instead of hours or days.
Paper on arXiv now:
In this week’s MLSys Seminar, we'll be joined by
@Lin_Ma_
from
@CMUDB
. Lin will be talking about his work on self-driving databases. As always, 30 minute talk + 30 minute podcast with live audience questions!
Livestream link:
#Stanford
#MachineLearning
This is still early work, so we would love to hear from you if you are interested in long-context embeddings.
If you have long-context tasks, we would love to hear how M2-BERT performs on them!
If you have suggestions about tasks to add to LoCo, please let us know! 8/
Excited to welcome Virginia Smith for her talk "On Heterogeneity in Federated Settings" today at 3pm PT for episode 3 of the MLSys seminar!
Tune in today at 3:
Tune in to the Stanford MLSys Seminar Series this Thursday 3-4pm PST, Virginia Smith (CMU) will talk about incorporating real-world constraints in federated learning.
Livestream:
Website:
The upshot? Throughput >1 image/s for 50 denoising steps on A100, 3-4x faster than unoptimized versions. And 33% faster than the super optimized version of Diffusers!
I'm going live on the Stanford MLSys Seminar today at 1PM PT! Will be chatting with
@simran_s_arora
about Monarch Mixer and FlashFFTConv.
Come tune in on YouTube and join us!
In this week’s MLSys Seminar, Gideon Mendels, CEO of
@Cometml
, will tell us all about MLOps System Design. Tune in Thursday, 1:30 PM on YouTube for a 30 minute talk + 30 minute podcast with live audience Q&A!
Livestream link:
#Stanford
#MachineLearning