Dan Fu @realDanFu Twitter profile

Sam Altman

@sama

1 year

we though we wanted flying cars and not 140/280 characters, but really we wanted 32000 tokens

498

563

8K

4

41

242

Dan Fu

@realDanFu

4 months

New year, new model drop! w/ @JonSaadFalcon , @simran_s_arora , excited to release new long-context retrieval models with Monarch Mixer, up to 32K sequence length! First step 2 long-context retrieval, outperforming Mistral, BGE, OpenAI on long-context document retrieval. 1/

4

42

231

Dan Fu

@realDanFu

2 years

S4 is an amazing sequence model - but has seemed mysterious. It doesn't have to be! In this blog (originally an internal explainer for our group), @HazyResearch looks at S4 from first principles that are familiar to most sophomore engineering students.

Simplifying S4

Explaining S4 from the first principles of signal processing.

Simple Long Convolutions for Sequence Modeling

3

42

194

Dan Fu

@realDanFu

1 year

What's the simplest model that can get the job done? New paper and blog post on how the answer for sequence modeling (including language) may be convolutions... with a touch of regularization. 📜 🖥️ ⌨️ 1/n

Replacing SSMs with long convolutions.

GitHub - HazyResearch/H3: Language Modeling with the H3 State Space Model

5

39

169

Dan Fu

@realDanFu

10 months

You've heard of models that are sub-quadratic in sequence length, but what if they were sub-quadratic in model *dimension* too? Announcing a preview of Monarch Mixer - a fully sub-quadratic & hardware-efficient architecture that matches BERT in quality! w @simran_s_arora 1/

5

42

157

Dan Fu

@realDanFu

2 years

The Stanford MLSys Seminar is now available in podcast form on Apple Podcasts, Spotify, Google, and more! We release new podcasts every Monday and Friday (new episodes on Fridays, old episodes from the backlog on Mondays). Check us out on your favorite platform below! (1/n)

3

22

131

Dan Fu

@realDanFu

1 year

One key point: SSMs are *linear* in sequence length instead of quadratic, and have no fixed context length. Long context for everyone! We're super excited, so we're releasing our code and model weights today - up to 2.7B parameters! 2/n

Language Modeling with the H3 State Space Model. Contribute to HazyResearch/H3 development by creating an account on GitHub.

Advances in Understanding, Improving, and Applying Contrastive Learning

3

10

131

Dan Fu

@realDanFu

2 years

Blog alert! 📣 How does contrastive learning work? How can we apply it effectively? New *3-part series* covering *2 new papers* on getting better transfer & robustness, and how to apply contrastive w types to improve entity retrieval. Part 1: 👇 (1/n)

Part 1 of a 3-part blog series on advances in contrastive learning.

RedPajama, a project to create leading open-source models, starts by reproducing LLaMA training...

1

38

116

Dan Fu

@realDanFu

2 years

Thrilled that FlashAttention won the best paper award at the Hardware Aware Efficient Training workshop at ICML - really excited to meet so many like-minded folks at the workshop. Thanks to the organizers (and NVIDIA) for the GPU!

Tri Dao

@tri_dao

2 years

Announcing FlashAttention, a fast and memory-efficient attention algorithm with no approximation! 📣 w/ @realDanFu By reducing GPU memory reads/writes, FlashAttention runs 2-4x faster & requires 5-20x less memory than PyTorch standard attention, & scales to seq. length 64K. 1/

31

358

2K

5

103

Dan Fu

@realDanFu

2 years

New preprint alert! 📣 How do we fuse foundation models with weak supervision? Liger (🐯 +🦁) is a new weak supervision framework that fuses FMs + WS using *local smoothness* -- outperforming both FMs and WS on their own. 📜 More below 👇 (1/n)

2

30

98

Dan Fu

@realDanFu

1 year

Super excited to release the RedPajama dataset - a new, fully open *1.2 trillion token* dataset following the LLaMA recipe. A first step towards creating leading, fully open-source large language models.

www.together.ai

Together AI

@togethercompute

1 year

Announcing RedPajama — a project to create leading, fully open-source large language models, beginning with the release of a 1.2 trillion token dataset that follows the LLaMA recipe, available today! More in 🧵 …

39

407

2K

2

15

92

Dan Fu

@realDanFu

5 months

Today I'm talking about FlashFFTConv at the ENLSP workshop (Efficient Natural Language and Speech Processing)! The talk is at 9:48 AM, and the poster session is from 1:00 to 2:00!

Dan Fu

@realDanFu

5 months

I'm flying out to #NeurIPS2023 @NeurIPSConf ! I'll be presenting an oral on Monarch Mixer tomorrow at 3:40 in the Oral 2A session, and I'll be presenting FlashFFTConv Saturday at the ENLSP workshop! Monarch Mixer: FlashFFTConv:

1

23

2

13

82

Dan Fu

@realDanFu

5 months

Super excited for this model to see the light of day! 7B model, hybrid gated conv/SSM + attention architecture, trained for long context and running FlashFFTConv everywhere. You can chat with it now on the Together API!

Together AI

@togethercompute

5 months

Announcing StripedHyena 7B — an open source model using an architecture that goes beyond Transformers achieving faster performance and longer context. It builds on the lessons learned in past year designing efficient sequence modeling architectures.

31

268

1K

4

9

70

Dan Fu

@realDanFu

1 year

After a short hiatus, the Stanford MLSys Seminar is coming back this quarter with a special series of episodes on foundation models! Our first talk (ep 67!!) will be @tri_dao , who'll be talking about FlashAttention. Catch us *TOMORROW* at 3:30 PT:

FlashAttention - Tri Dao | Stanford MLSys #67

Episode 67 of the Stanford MLSys Seminar “Foundation Models Limited Series”!Speaker: Tri DaoAbstract:Transformers are slow and memory-hungry on long sequence...

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

1

20

61

Dan Fu

@realDanFu

3 months

ChatGPT's 1700-token system prompt got you down? Led by @jordanjuravsky , @brad19brown , introducing Hydragen, a simple technique for Transformer LLM inference with shared prefixes! Up to 30x improvement in throughput with no custom CUDA! A few things I love in this project: 1/

Jordan Juravsky

@jordanjuravsky

3 months

Excited to share my first PhD project! TLDR: Hydragen is an exact, simple (no custom CUDA) implementation of attention for large batches with shared prefixes. We can improve LLM throughput by over 30x for CodeLlama-13b. Also, adding lots more shared context becomes cheap:

10

55

300

1

8

58

Dan Fu

@realDanFu

5 months

Thrilled to win the Best Poster award at the ENLSP workshop!

Dan Fu

@realDanFu

5 months

Today I'm talking about FlashFFTConv at the ENLSP workshop (Efficient Natural Language and Speech Processing)! The talk is at 9:48 AM, and the poster session is from 1:00 to 2:00!

2

13

82

1

2

55

Dan Fu

@realDanFu

1 year

I'll be at #NeurIPS2022 this week! @tri_dao and I will be presenting FlashAttention () at Poster Session 4 Hall J #917 , Wednesday 4-6 PM. Super excited to talk all things performance, ML+systems, and breaking down scaling bottlenecks!

Transformers are slow and memory-hungry on long sequences, since the time and memory complexity of self-attention are quadratic in sequence length. Approximate attention methods have attempted to...

arxiv.org

2

6

52

Dan Fu

@realDanFu

4 years

Super excited to share some thoughts with @laurel_orr1 on lessons learned from the past four years with @HazyResearch and @SnorkelML , and what's next for the ways that machine learning is changing how we build software:

1

24

48

Dan Fu

@realDanFu

3 years

This Thursday, @srush_nlp from @cornell_tech will be talking to us about going beyond softmax in NLP. As always, 30 minute talk + 30 minute podcast with live audience questions, be sure to tune in! Livestream link: #Stanford #MachineLearning

NLP Beyond Softmax feat. Sasha Rush | Stanford MLSys Seminar Episode...

Episode 33 of the Stanford MLSys Seminar Series!Beyond Softmax: Scaling Probabilistic Structure in NLPSpeaker: Sasha RushAbstract:Progress on large autoregre...

GitHub - HazyResearch/H3: Language Modeling with the H3 State Space Model

0

6

46

Dan Fu

@realDanFu

1 year

We're super excited about these advances, so we're releasing our code and model weights today: 13/n

Language Modeling with the H3 State Space Model. Contribute to HazyResearch/H3 development by creating an account on GitHub.

Add FlashAttention · HazyResearch/diffusers@fd45ca2

1

2

44

Dan Fu

@realDanFu

2 years

Our paper got accepted to #ICML2022 - excited to talk about this work in Baltimore!

Mayee Chen

@MayeeChen

2 years

New preprint alert! 📣 How do we produce transferable and robust representations with supervised contrastive learning? We need *geometric spread* and an inductive bias towards *latent subclass clustering* in representation space. 📜 👇 (1/n)

2

55

253

0

3

43

Dan Fu

@realDanFu

1 year

In H3, we replace attention with a new layer based on state space models (SSMs) - with the right modifications, we find that it can outperform Transformers. Two key ideas: * Adapting SSMs to be able to do *comparison* * Making SSMs as hardware-efficient as attention 3/n

1

2

42

Dan Fu

@realDanFu

2 years

We built off the super-optimized version of Diffusers that And 33% faster than the super optimized version that @Nouamanetazi / @huggingface released last week - the diff is pretty small, 68 LOC:

Update README Update with example Check for FlashAttention install Update README REmove breakpoint Remove new line

GitHub - HazyResearch/flash-fft-conv: FlashFFTConv: Efficient Convolutions for Long Sequences with...

1

5

41

Dan Fu

@realDanFu

1 year

We’ve been hard at work training RedPajama 7B! GPUs go brrr :)

Together AI

@togethercompute

1 year

Training our first RedPajama 7B model is going well! Less than half way through training (after 440 billion tokens) the model achieves better results on HELM benchmarks than the well-regarded Pythia-7B trained on the Pile. Details at

17

91

500

2

39

Dan Fu

@realDanFu

2 years

A bit late, but super honored to receive the best student paper runner up at @UncertaintyInAI #UAI2022 ! This project has been 2+ years in the making (we started *before COVID*), so super grateful to see it recognized! w @MayeeChen , @dyhadila , @fredsala , @kayvonf , @HazyResearch !

Dan Fu

@realDanFu

2 years

New preprint alert! 📣 How do we fuse foundation models with weak supervision? Liger (🐯 +🦁) is a new weak supervision framework that fuses FMs + WS using *local smoothness* -- outperforming both FMs and WS on their own. 📜 More below 👇 (1/n)

2

30

98

1

9

39

Dan Fu

@realDanFu

1 year

Overall, really excited about new models/architectures like this. What happens if we don't need attention to get the magic we've been seeing, and we can get the same quality with a linear operator? No more fixed context windows, long context for everyone! 16/n

1

38

Dan Fu

@realDanFu

2 years

Super excited by this work. Making attention IO-aware makes it run way faster - and enables much longer sequences, since memory footprint becomes linear in sequence length. Really excited to see how this gets used, and where it goes next - IO-aware transformers?

Tri Dao

@tri_dao

2 years

Announcing FlashAttention, a fast and memory-efficient attention algorithm with no approximation! 📣 w/ @realDanFu By reducing GPU memory reads/writes, FlashAttention runs 2-4x faster & requires 5-20x less memory than PyTorch standard attention, & scales to seq. length 64K. 1/

31

358

2K

4

35

Dan Fu

@realDanFu

2 years

FlashAttention speeds up attention and reduces its memory footprint - without any approximation. Our key insight is that attention is bottlenecked by GPU memory *reads/writes*. FlashAttention speeds up attention by reducing the R/W. Same FLOPs, 3-4x faster!

1

6

37

Dan Fu

@realDanFu

1 year

Build your own ChatGPT! Super excited by this open-source release - even more exciting that it was trained 100% carbon-negative. Happy to play a (minuscule) part in putting it together and helping serve it faster. Looking forward to seeing what folks build on top of this!

Together AI

@togethercompute

1 year

Introducing OpenChatKit. A powerful, open-source base to create chatbots for various applications. Details in 🧵

14

125

465

0

8

37

Dan Fu

@realDanFu

2 years

Friends don't let friends run XGBoost on tabular data without trying foundation models first Great work by some awesome labmates!

Avanika Narayan

@Avanika15

2 years

Can Foundation Models (FMs) clean and integrate your data? We explore the efficacy of FMs on these hard classical data tasks (1/7)

9

51

182

0

14

37

Dan Fu

@realDanFu

5 years

Why Train What You Can Code? Excited to share Rekall - using programmatic composition to find new events in video! Paper on arXiv, and code available on Github! Blog:

1

20

34

Dan Fu

@realDanFu

6 months

@main_horse @arankomatsuzaki We were going to wait until the morning, but now is as good a time as any:

FlashFFTConv: Efficient Convolutions for Long Sequences with Tensor Cores - HazyResearch/flash-fft-conv

Distributed and Decentralized Learning - Ce Zhang | Stanford MLSys #68

2

6

34

Dan Fu

@realDanFu

2 years

Meta uses FlashAttention to speed up inference in AITemplate - really cool work, super excited to see folks pick it up!

AI at Meta

@AIatMeta

2 years

Get faster, more flexible inference on GPUs using our newly open-sourced AITemplate, a revolutionary new inference engine that delivers up to 12X performance improvements on NVIDIA GPUs & 4X on AMD GPUs compared to eager-mode within Pytorch. Learn more:

10

152

754

0

1

33

Dan Fu

@realDanFu

2 years

Absolutely thrilled to receive the best paper award w @MayeeChen for our work on supervised contrastive learning at the AI with Biased/Scarce Data Workshop at @RealAAAI today! Check out the paper on the workshop website: Short 🧵👇 - more soon! (1/n)

1

5

33

Dan Fu

@realDanFu

1 year

Ce Zhang ( @DS3Lab and @togethercompute ) has done some crazy stuff in distributed training. In this talk, he goes over the magic behind distribute training and inference on a GLOBAL scale over slow networks! Tune in tomorrow at 3:30 pm Pacific!

Episode 68 of the Stanford MLSys Seminar “Foundation Models Limited Series”!Speaker: Ce ZhangAbstract:The rapid progress of machine learning in the last deca...

From Deep to Long Learning?

2

10

30

Dan Fu

@realDanFu

10 months

Join us today for our workshop on efficient systems for foundation models - we’ve got a killer lineup of speakers and posters!

ES-FoMo@ICML2024

@ESFoMo

10 months

Attending #ICML2023 ? Join us Saturday at our workshop on Efficient Systems for Foundation Models! 🔥 Large-Scale Distributed Training 🚀 Efficient Inference ⚙️ Deep Optimization 📈 Over 50 posters and 4 orals spanning from RL to efficient finetuning!

3

20

45

2

7

31

Dan Fu

@realDanFu

2 years

Check out our fork of @huggingface Diffusers on GitHub and our blog post to try it out yourselves and read more! Code: Blog:

2

5

28

Dan Fu

@realDanFu

1 year

@typedfemale Thanks for bringing this to our attention. We've updated the blog in light of this new and important information: 🙏🙏🙏

Make stable diffusion up to 100% faster with Memory Efficient Attention

2

3

28

Dan Fu

@realDanFu

1 year

The deadline for our #ICML2023 workshop Efficient Systems for Foundation Models is tomorrow, May 31 AOE! Submit your best papers on training, inference or anything FM systems and efficiency - then join us for a great day of speakers & panel in Hawaii!

3

10

28

Dan Fu

@realDanFu

1 year

The upshot: we can scale H3 up to *2.7B* parameter models. And because of the state passing, we can run inference blazing fast -- up to *2.4x* faster than highly-optimized Transformers. Up to 1,980 tokens/second! 12/n

2

1

28

Dan Fu

@realDanFu

1 year

If you're at ICLR, Catch my talk on our paper Hungry Hungry Hippos: Towards Language Modeling with State Space Models today at 10 AM in room AD12! Featuring photos of actual Rwandan hippos :) (+poster from 11:30-1:30 at board 80!)

Dan Fu

@realDanFu

1 year

🛫 to Rwanda for #ICLR2023 ! I’ll be giving a talk about H3 on Wednesday, and talking about some newer work on long convs at the ME-FoMo workshop on Thursday. Please reach out if you’ll be there and want to chat! Happy to talk about Hyenas, Red Pajamas, or anything else!

3

0

22

1

3

26

Dan Fu

@realDanFu

1 year

The H3 layer closes the gap on our synthetics, and the gains translate to strong downstream performance on language modeling. We replaced almost all the attention blocks in a Transformer with H3 layers, and trained on the PILE. Our model *outperforms* GPT-Neo in PPL! 7/n

1

27

Dan Fu

@realDanFu

2 years

We were actually a bit late to the game here - when we saw a couple folks on Reddit and elsewhere who beat us to the punch, we decided to give it a try ourselves :) PhotoRoom: u/hnipun:

Dive into optimizing the Stable Diffusion pipeline for photo editing apps at Photoroom by leveraging memory-efficient attention mechanisms from the xformers library, resulting in significant speed...

www.photoroom.com

2

1

26

Dan Fu

@realDanFu

5 months

One final plug: Oral 2A Efficient Learning tomorrow is absolutely **packed** with great work from @Tim_Dettmers and @srush_nlp - super excited to hear what they have to say!

1

5

23

Dan Fu

@realDanFu

2 years

(1/n) This week we have @fredsala on the Stanford MLSys Seminar, live on Thursday at 1:30 PM! Fred was a postdoc at @StanfordAILab , and is now a professor at @WisconsinCS and a research scientist at @SnorkelAI -- so he knows a thing or two about MLSys.

Weak Supervision for Diverse Datatypes - Fred Sala | Stanford MLSys...

Episode 51 of the Stanford MLSys Seminar Series!Efficiently Constructing Datasets for Diverse DatatypesSpeaker: Fred SalaAbstract:Building large datasets for...

Pixelated Butterfly: Fast Machine Learning with Sparsity - Beidi Chen...

1

6

23

Dan Fu

@realDanFu

1 year

With FlashConv, we can make SSMs outperform attention for almost all sequence lengths -- up to 35x faster than FlashAttention for long sequences! 11/n

1

23

Dan Fu

@realDanFu

5 months

I'm flying out to #NeurIPS2023 @NeurIPSConf ! I'll be presenting an oral on Monarch Mixer tomorrow at 3:40 in the Oral 2A session, and I'll be presenting FlashFFTConv Saturday at the ENLSP workshop! Monarch Mixer: FlashFFTConv:

1

23

Dan Fu

@realDanFu

7 months

RedPajama-v2 - 30 trillion tokens, 84 CC dumps, 5 languages! Excited to see what people do with it :)

Together AI

@togethercompute

7 months

We are excited to release RedPajama-Data-v2: 30 trillion filtered & de-duplicated tokens from 84 CommonCrawl dumps, 25x larger than our first dataset. It exposes a diverse range of quality annotations so you can slice & weight the data for LLM training.

20

286

1K

1

23

Dan Fu

@realDanFu

2 years

The MLSys Seminar is back this week with our very own @BeidiChen ! Tune in Thursday, 1:30 PM on YouTube to hear about her great work on sparsity in deep learning. Livestream link: #Stanford #MachineLearning

Episode 49 of the Stanford MLSys Seminar Series!Pixelated Butterfly: Simple and Efficient Sparse Training for Neural Network ModelsSpeaker: Beidi ChenAbstrac...

H3: Language Modeling with State Space Models and (Almost) No Attention

2

6

21

Dan Fu

@realDanFu

1 year

And we've got two blog posts up on our work -- first, read about our synthetic languages and how we developed H3: 14/n

Replacing attention with SSMs in language modeling.

Lux: Visualization for Data Science - Doris Lee | Stanford MLSys #55

1

2

22

Dan Fu

@realDanFu

2 years

We sped up stable diffusion by replacing the self-attention/cross-attention blocks in the unet with FlashAttention. FlashAttention doesn't do any approximation, so you get the *exact same image* at the end.

1

22

Dan Fu

@realDanFu

1 year

🛫 to Rwanda for #ICLR2023 ! I’ll be giving a talk about H3 on Wednesday, and talking about some newer work on long convs at the ME-FoMo workshop on Thursday. Please reach out if you’ll be there and want to chat! Happy to talk about Hyenas, Red Pajamas, or anything else!

Dan Fu

@realDanFu

1 year

Attention is all you need... but how much of it do you need? Announcing H3 - a new generative language models that outperforms GPT-Neo-2.7B with only *2* attention layers! Accepted as a *spotlight* at #ICLR2023 ! 📣 w/ @tri_dao 📜 1/n

31

282

2K

3

0

22

Dan Fu

@realDanFu

2 years

I'm at #ICML2022 this week! Let's chat if you're also in person! I'm presenting two papers: - Improving Transfer, Robustness of Supervised Contrastive Learning - FlashAttention: Fast & Memory-Efficient Exact Attention ⏱below!

1

22

Dan Fu

@realDanFu

1 year

These synthetic languages (inspired by great work like ) test how well SSMs can do in-context learning compared to attention. We find a critical missing capability -- SSMs have trouble *comparing tokens* across the sequence. 5/n

2

1

22

Dan Fu

@realDanFu

1 year

The power of data - RedPajama-2.8B matches Pythia-7B in HELM score after being trained on 2x the tokens! Excited to see these models continue to improve as they see more tokens :)

Together AI

@togethercompute

1 year

In addition to RedPajama 7B, we’ve also been training a 2.8B model. After 600B tokens it is exciting to see the model has higher HELM scores than the excellent Pythia-2.8B & GPT-Neo 2.7B. In fact, trained with twice the tokens, RedPajama-2.8B has comparable quality to Pythia-7B!

13

79

522

0

2

20

Dan Fu

@realDanFu

4 years

Super excited for our new seminar series on ML and systems -- how does ML change the modern programming stack, and what does it mean for how people will build and deploy applications in the future? Live on YouTube every Thursday, 3-4 PM PT. Check out links below for more!

hazyresearch

@HazyResearch

4 years

Announcing the new live-streamed Stanford MLSys Seminar Series, in which we will explore the frontier of machine learning and systems. Read the full announcement: Schedule: Intro video:

1

79

177

1

5

19

Dan Fu

@realDanFu

2 years

(1/n) This week @dorisjlee from @ucbrise and @BerkeleyISchool will be joining us on the Stanford MLSys Seminar to talk about her fantastic work on @lux_api . You can catch us live on YouTube this Thursday at 1:30 PT! Deets in 🧵👇:

Episode 55 of the Stanford MLSys Seminar Series!Always-on Dataframe Visualizations with LuxSpeaker: Doris LeeAbstract:Visualizations help data scientists dis...

Big Data Analytics - Kexin Rong | Stanford MLSys #61

1

9

20

Dan Fu

@realDanFu

1 year

In response, we designed the H3 layer (Hungry Hungry Hippos) to plug this gap. The H3 layer stacks two SSMs, and uses some simple multiplicative interactions between them (gating) to do comparisons. 6/n

1

20

Dan Fu

@realDanFu

2 years

Wow, excited to see FlashAttention seeing adoption by folks in industry - excited to see where else it can make training faster!

Databricks Mosaic Research

@DbrxMosaicAI

2 years

We have exciting news! In our latest and greatest LLM blog, we show how MosaicML Cloud can help you train LLMs from 1B - 70B parameters, and for the first time, publish transparent times + costs for doing so. It's a lot cheaper than you think! (1/9)

7

48

342

0

2

20

Dan Fu

@realDanFu

1 year

Part 1: the quality gap SSM's have achieved impressive results on sequence modeling (30+ points over Transformers on Long Range Arena), but have underperformed attention in language modeling. In our paper, we use *synthetic languages* to probe this gap 4/n

1

19

Dan Fu

@realDanFu

1 year

What's the problem? Long convolutions require multiple FFT calls, which introduce expensive GPU memory reads/writes. We develop FlashConv to address this problem. FlashConv uses a block FFT algorithm to increase FLOP util, and uses state passing to scale to long sequences. 10/n

2

1

19

Dan Fu

@realDanFu

4 months

Model weights available on HuggingFace, and AutoModel compatible. Download them with just two lines of code! 32k model: 8k model: 2k model: 4/

1

2

19

Dan Fu

@realDanFu

2 years

This week we're excited to have @kexinrong ( @Stanford , @VMware , and @gtcomputing ) on the MLSys Seminar. Kexin will talk about improving query performance on big-data analytics. Be there or be square! Watch us live on YouTube this Thursday at 1:30 PT:

Episode 61 of the Stanford MLSys Seminar Series!Learned Indexing and Sampling for Improving Query Performance in Big-Data AnalyticsSpeaker: Kexin RongAbstrac...

GitHub - HazyResearch/m2: Repo for "Monarch Mixer: A Simple Sub-Quadratic GEMM-Based Architecture"

1

3

19

Dan Fu

@realDanFu

6 months

@arankomatsuzaki @_akhaliq A few fun bits I couldn't fit into the original tweet: 1. We also have the fastest implementation of a short depthwise 1D convolution, which doesn't use the FFT but is up to 7x faster than PyTorch Conv1D, check out our repo to try it out: 2. During

0

4

19

Dan Fu

@realDanFu

10 months

Blog post: Code: I'll be at #ICML2023 in Honolulu this week through Saturday - come chat if you're interested! 2/

Repo for "Monarch Mixer: A Simple Sub-Quadratic GEMM-Based Architecture" - HazyResearch/m2

Long-Context Retrieval Models with Monarch Mixer

1

19

Dan Fu

@realDanFu

7 months

Thanks Tri! And yes, I'm on the academic job market this year :)

Tri Dao

@tri_dao

7 months

As much as I like attention, I'm also fond of attention-free architectures for long context. @realDanFu and others have been pushing in this direction, with deep theory and compelling empirical results! And @realDanFu is on the academic job market this year!

1

11

84

0

18

Dan Fu

@realDanFu

4 years

Announcing FlyingSquid - fast weak supervision with triplet methods. We speed up weak supervision by orders of magnitude, allowing weakly-supervised video analysis and online learning! Blog: w/ @MayeeChen , @fredsala , Sarah Hooper, @kayvonf , @HazyResearch

1

12

19

Dan Fu

@realDanFu

1 year

With @tri_dao (co-first), @KhaledSaab11 , @ai_with_brains , Atri Rudra, and @HazyResearch ! Thanks to @StanfordAILab , @StanfordHAI , @StanfordCRFM , and @togethercompute for helping provide us with the compute necessary to train these models! 17/17

1

0

18

Dan Fu

@realDanFu

4 months

On LoCo, M2-BERT-32k outperforms the state-of-the-art embedding models! Even outperforms Mistral-7B, even though M2-BERT models only have 80M parameters (85x more parameter efficient)! 3/

1

2

16

Dan Fu

@realDanFu

4 months

As part of this release, we're also releasing version 0 of a new long-context benchmark called LoCo. Most academic retrieval benchmarks only have short-context documents, so we put together this benchmark of longer-context tasks. 2/

2

18

Dan Fu

@realDanFu

4 months

Check out the blog for more details on the technical bits, and check out our GitHub for instructions on how to play with the model! Blog: Github: 7/

Interactive Model Development - Fait Poms | Stanford MLSys #56

1

17

Dan Fu

@realDanFu

2 years

(1/n) This week we're delighted to have @faitpoms ( @Stanford , @SnorkelAI ) on the MLSys Seminar Series! Fait will be talking about a vision for interactive model development, so you won't want to miss it. Catch us live on YouTube Thursday at 1:30 PM! 🧵👇

Episode 56 of the Stanford MLSys Seminar Series!A vision for interactive model development: efficient machine learning by bringing domain experts in the loop...

FlashConv: Speeding up state space models

1

6

16

Dan Fu

@realDanFu

2 years

And excited to announce that this paper was accepted to #UAI2022 @UncertaintyInAI as an oral! Excited to talk about it in Eindhoven!

Dan Fu

@realDanFu

2 years

New preprint alert! 📣 How do we fuse foundation models with weak supervision? Liger (🐯 +🦁) is a new weak supervision framework that fuses FMs + WS using *local smoothness* -- outperforming both FMs and WS on their own. 📜 More below 👇 (1/n)

2

30

98

1

4

17

Dan Fu

@realDanFu

1 year

Then check out our blog up on @togethercompute about how FlashConv speeds up SSMs: 15/n

www.together.ai

1

17

Dan Fu

@realDanFu

4 months

This project has been a great collaboration with @togethercompute . Thanks to them, these models are already integrated into @MongoDB Atlas, @LangChainAI , and @llama_index . Check out their tweet thread for more details! 9/

Together AI

@togethercompute

4 months

We are thrilled to announce the Together Embeddings endpoint! 🚀 Higher quality than OpenAI or Cohere in the MTEB benchmark. ✅ State of the art M2-Retrieval models with up to 32k context length. ✅ Up to 4x lower price. ✅ Details👇

23

60

348

1

3

17

Dan Fu

@realDanFu

1 year

The context lengths of foundation models have grown exponentially recently - exciting developments! We've been happy to play a small role with FlashAttention, and we're very excited about the possibilities: multiple media sources, complex demonstrations, and more! 2/n

2

1

15

Dan Fu

@realDanFu

1 year

FlashAttention gets even Flashier! You should pay attention to @tri_dao , he's on the market this year! (... ok I'll stop now)

Tri Dao

@tri_dao

1 year

I’ve been working with @AdeptAILabs and we’ve made FlashAttention even faster for long sequences! For seqlen 8K, FlashAttention is now up to 2.7x faster than a standard PyTorch implementation even at small batch, making it easier to train better LMs with longer context 1/7

7

87

604

0

3

16

Dan Fu

@realDanFu

1 year

These gains also translate to strong downstream zero- and few-shot performance. On SuperGLUE, our zero-shot performance outperforms Transformer models of similar sizes. 8/n

2

0

16

Dan Fu

@realDanFu

2 years

We’ll be talking about this work today at #acl2022nlp in Dublin! Come check us out at 5:00 PM in poster session 3-4 (information retrieval and text mining). I’ll be hanging around the whole week, come say hi!

Megan Leszczynski

@m_leszczy

2 years

New preprint alert! 📣 How do we improve long-tailed performance of entity retrieval? We use a supervised contrastive loss to *geometrically encode entity types* in representation space w/ bi-encoders. Check out our paper on TABi! 📜 Details👇 (1/n)

2

26

95

0

5

16

Dan Fu

@realDanFu

2 years

(1/n) This week on the Stanford MLSys Seminar, we're super excited to host Cody Coleman ( @codyaustun ), former Stanford PhD and founder/CEO of Coactive AI! Cody has a great talk prepared on Data Selection for Data-Centric AI! Tune in Thursday at 1:30 PT!

Data Selection for Data-Centric AI - Cody Coleman | Stanford MLSys #53

Episode 53 of the Stanford MLSys Seminar Series!Data selection for Data-Centric AI: Data Quality Over QuantitySpeaker: Cody ColemanAbstract:Data selection me...

Symbols and Rules with Deep Learning - Ellie Pavlick | Stanford MLSys...

1

16

Dan Fu

@realDanFu

1 year

The first RedPajama model is out in the wild!

Jonathan Frankle

@jefrankle

1 year

72 hrs ago, @togethercompute released the RedPajama dataset. Like everyone, we at @MosaicML were very excited about the idea of a fully open-source Llama. So excited, in fact, that we've already trained a 1B model on 200B tokens! It's on HF (Apache2) here:

13

82

485

0

2

16

Dan Fu

@realDanFu

5 months

Come check out Monarch Mixer today! Talk at 3:40 in Hall C2, poster #509 at 5:15!

Dan Fu

@realDanFu

5 months

I'm flying out to #NeurIPS2023 @NeurIPSConf ! I'll be presenting an oral on Monarch Mixer tomorrow at 3:40 in the Oral 2A session, and I'll be presenting FlashFFTConv Saturday at the ENLSP workshop! Monarch Mixer: FlashFFTConv:

1

23

0

1

15

Dan Fu

@realDanFu

1 year

Part 2: the efficiency gap But that's not all! In order to scale H3 up to billion-parameter models, we had to make it as hardware-efficient as attention. The convolution is O(N log N) asymptotically, but still underperforms FlashAttention for short sequences... 9/n

1

14

Dan Fu

@realDanFu

1 year

We've got a great workshop at ICML, submit your best papers!

ES-FoMo@ICML2024

@ESFoMo

1 year

Announcing our 📣 Call for Papers for the ES-FoMo workshop @ ICML 2023! ➡️ We welcome papers touching on inference and training of foundation models, spanning from systems & benchmarks to novel algorithms. 🔗 (deadline: 31st of May)

1

18

27

0

14

Dan Fu

@realDanFu

2 years

(1/n) This week on the Stanford MLSys Seminar, we've got Ellie Pavlick - professor at @BrownCSDept and research scientist at @GoogleAI . Ellie will be talking about how to implement symbols and rules with neural networks! Tune in Thursday at 1:30 PT:

Episode 54 of the Stanford MLSys Seminar Series!Implementing Symbols and Rules with Neural NetworksSpeaker: Ellie PavlickAbstract:Many aspects of human langu...

Train and You'll Miss It: Interactive Model Iteration with...

1

2

14

Dan Fu

@realDanFu

1 year

I had a great time chatting with @samcharrington on the @twimlai podcast - we had a great conversation about H3, FlashAttention, and all things language modeling. Thanks so much for having me on!

The TWIML AI Podcast

@twimlai

1 year

Today we're joined by @realDanFu , a PhD student at @Stanford , to discuss how state space models can improve language models and the limitations of attention. We discuss the H3 architecture, flash attention, and much more! #NLP #MachineLearning #ICLR23 🎧

2

4

7

0

2

13

Dan Fu

@realDanFu

4 years

1/2 Excited to share Epoxy: fast model iteration with weak supervision + pre-trained embeddings. We look at how to use pre-trained embeddings without the need for fine-tuning - model iteration in <1/2 second, instead of hours or days. Paper on arXiv now:

Our goal is to enable machine learning systems to be trained interactively. This requires models that perform well and train quickly, without large amounts of hand-labeled data. We take a step...

arxiv.org

1

4

13

Dan Fu

@realDanFu

10 months

M2 builds on a lot of great work in the space, from folks like @_albertgu , @tri_dao , @ramin_m_h , @MichaelPoli6 , @BeidiChen , @exnx , @BlinkDL_AI , @davidwromero , @MaxMa1987 , and many many more! Check out @srush_nlp 's great overview 9/

Sasha Rush

@srush_nlp

11 months

Do we need Attention? Linear RNNs for NLP (). Received a couple requests for a video version.

1

22

166

1

0

13

Dan Fu

@realDanFu

3 years

In this week’s MLSys Seminar, we'll be joined by @Lin_Ma_ from @CMUDB . Lin will be talking about his work on self-driving databases. As always, 30 minute talk + 30 minute podcast with live audience questions! Livestream link: #Stanford #MachineLearning

ML-Powered Databases feat. Lin Ma | Stanford MLSys Seminar Episode 20

Episode 20 of the Stanford MLSys Seminar Series!NoisePage: The Self-Driving Database Management SystemSpeaker: Lin MaAbstract:Database management systems (DB...

Stanford MLSys Seminar Episode 3: Virginia Smith

0

4

12

Dan Fu

@realDanFu

10 months

And now it’s time for our panel on LLM tooling across industry and academia!

Dan Fu

@realDanFu

10 months

Join us today for our workshop on efficient systems for foundation models - we’ve got a killer lineup of speakers and posters!

2

7

31

0

1

13

Dan Fu

@realDanFu

4 months

This is still early work, so we would love to hear from you if you are interested in long-context embeddings. If you have long-context tasks, we would love to hear how M2-BERT performs on them! If you have suggestions about tasks to add to LoCo, please let us know! 8/

2

0

11

Dan Fu

@realDanFu

4 months

These models are also available on @togethercompute 's new embeddings API service. Play with them now with a simple HTTP request! 5/

1

10

Dan Fu

@realDanFu

4 years

Excited to welcome Virginia Smith for her talk "On Heterogeneity in Federated Settings" today at 3pm PT for episode 3 of the MLSys seminar! Tune in today at 3:

Episode 3 of the Stanford MLSys Seminar Series!On Heterogeneity in Federated SettingsSpeaker: Virginia SmithAbstract:A defining characteristic of federated l...

Monarch Mixer: Making Foundation Models More Efficient - Dan Fu |...

Piero Molino

@w4nderlus7

4 years

Tune in to the Stanford MLSys Seminar Series this Thursday 3-4pm PST, Virginia Smith (CMU) will talk about incorporating real-world constraints in federated learning. Livestream: Website:

0

1

19

0

2

11

Dan Fu

@realDanFu

2 years

The upshot? Throughput >1 image/s for 50 denoising steps on A100, 3-4x faster than unoptimized versions. And 33% faster than the super optimized version of Diffusers!

1

12

Dan Fu

@realDanFu

5 months

I'm going live on the Stanford MLSys Seminar today at 1PM PT! Will be chatting with @simran_s_arora about Monarch Mixer and FlashFFTConv. Come tune in on YouTube and join us!

Episode 86 of the Stanford MLSys Seminar Series!Monarch Mixer: Making Foundation Models More EfficientSpeaker: Dan FuAbstract:Machine learning models are inc...

ML Ops System Design feat Comet ML CEO Gideon Mendels | Stanford...

0

12

Dan Fu

@realDanFu

3 years

In this week’s MLSys Seminar, Gideon Mendels, CEO of @Cometml , will tell us all about MLOps System Design. Tune in Thursday, 1:30 PM on YouTube for a 30 minute talk + 30 minute podcast with live audience Q&A! Livestream link: #Stanford #MachineLearning

Episode 44 of the Stanford MLSys Seminar Series!MLOps System Design for Development and ProductionSpeaker: Gideon MendelsAbstract:While ML model development ...