Dan Fu Profile
Dan Fu

@realDanFu

4,537
Followers
177
Following
154
Media
587
Statuses

CS PhD Candidate at Stanford, systems for machine learning. Sometimes YouTuber/podcaster. Academic Partner, @togethercompute .

Joined September 2019
Don't wanna be here? Send us removal request.
Pinned Tweet
@realDanFu
Dan Fu
6 months
Announcing FlashFFTConv: Efficient Convolutions for Long Sequences with Tensor Cores! We speed up exact FFT convolutions by up to 7.93x over PyTorch, reduce memory footprint, and get 4.4x speedup end-to-end. Read on for more details: Thanks @arankomatsuzaki and @_akhaliq for
Tweet media one
Tweet media two
Tweet media three
Tweet media four
6
75
386
@realDanFu
Dan Fu
1 year
Attention is all you need... but how much of it do you need? Announcing H3 - a new generative language models that outperforms GPT-Neo-2.7B with only *2* attention layers! Accepted as a *spotlight* at #ICLR2023 ! 📣 w/ @tri_dao 📜 1/n
31
282
2K
@realDanFu
Dan Fu
2 years
We spent a couple days this week speeding up Stable Diffusion in @huggingface Diffusers using FlashAttention. 3-4x faster than the original version, 33% faster than the super optimized v0.4.1 - and >1 image/s throughput on A100. w/ @tri_dao A short thread on how we did it👇
Tweet media one
6
65
552
@realDanFu
Dan Fu
7 months
Excited about models that are sub-quadratic in sequence length and model dimension? Our Monarch Mixer paper is now on arXiv -- and super excited to present it as an oral at #NeurIPS2023 ! Let's dive in to what's new with the paper and the new goodies from this release: Monarch
Tweet media one
Tweet media two
Tweet media three
Tweet media four
4
60
294
@realDanFu
Dan Fu
1 year
This sentiment is exactly right - and why we've been working to increase sequence length in our lab for the past two years! From FlashAttention, to S4, H3, Hyena, and more - check out our blog post putting this line of work into context: More below: 1/n
@sama
Sam Altman
1 year
we though we wanted flying cars and not 140/280 characters, but really we wanted 32000 tokens
498
563
8K
4
41
242
@realDanFu
Dan Fu
4 months
New year, new model drop! w/ @JonSaadFalcon , @simran_s_arora , excited to release new long-context retrieval models with Monarch Mixer, up to 32K sequence length! First step 2 long-context retrieval, outperforming Mistral, BGE, OpenAI on long-context document retrieval. 1/
Tweet media one
4
42
231
@realDanFu
Dan Fu
2 years
S4 is an amazing sequence model - but has seemed mysterious. It doesn't have to be! In this blog (originally an internal explainer for our group), @HazyResearch looks at S4 from first principles that are familiar to most sophomore engineering students.
3
42
194
@realDanFu
Dan Fu
1 year
What's the simplest model that can get the job done? New paper and blog post on how the answer for sequence modeling (including language) may be convolutions... with a touch of regularization. 📜 🖥️ ⌨️ 1/n
5
39
169
@realDanFu
Dan Fu
10 months
You've heard of models that are sub-quadratic in sequence length, but what if they were sub-quadratic in model *dimension* too? Announcing a preview of Monarch Mixer - a fully sub-quadratic & hardware-efficient architecture that matches BERT in quality! w @simran_s_arora 1/
Tweet media one
5
42
157
@realDanFu
Dan Fu
2 years
The Stanford MLSys Seminar is now available in podcast form on Apple Podcasts, Spotify, Google, and more! We release new podcasts every Monday and Friday (new episodes on Fridays, old episodes from the backlog on Mondays). Check us out on your favorite platform below! (1/n)
3
22
131
@realDanFu
Dan Fu
1 year
One key point: SSMs are *linear* in sequence length instead of quadratic, and have no fixed context length. Long context for everyone! We're super excited, so we're releasing our code and model weights today - up to 2.7B parameters! 2/n
3
10
131
@realDanFu
Dan Fu
2 years
Blog alert! 📣 How does contrastive learning work? How can we apply it effectively? New *3-part series* covering *2 new papers* on getting better transfer & robustness, and how to apply contrastive w types to improve entity retrieval. Part 1: 👇 (1/n)
1
38
116
@realDanFu
Dan Fu
2 years
Thrilled that FlashAttention won the best paper award at the Hardware Aware Efficient Training workshop at ICML - really excited to meet so many like-minded folks at the workshop. Thanks to the organizers (and NVIDIA) for the GPU!
Tweet media one
@tri_dao
Tri Dao
2 years
Announcing FlashAttention, a fast and memory-efficient attention algorithm with no approximation! 📣 w/ @realDanFu By reducing GPU memory reads/writes, FlashAttention runs 2-4x faster & requires 5-20x less memory than PyTorch standard attention, & scales to seq. length 64K. 1/
Tweet media one
31
358
2K
5
5
103
@realDanFu
Dan Fu
2 years
New preprint alert! 📣 How do we fuse foundation models with weak supervision? Liger (🐯 +🦁) is a new weak supervision framework that fuses FMs + WS using *local smoothness* -- outperforming both FMs and WS on their own. 📜 More below 👇 (1/n)
Tweet media one
2
30
98
@realDanFu
Dan Fu
1 year
Super excited to release the RedPajama dataset - a new, fully open *1.2 trillion token* dataset following the LLaMA recipe. A first step towards creating leading, fully open-source large language models.
@togethercompute
Together AI
1 year
Announcing RedPajama — a project to create leading, fully open-source large language models, beginning with the release of a 1.2 trillion token dataset that follows the LLaMA recipe, available today! More in 🧵 …
Tweet media one
39
407
2K
2
15
92
@realDanFu
Dan Fu
5 months
Today I'm talking about FlashFFTConv at the ENLSP workshop (Efficient Natural Language and Speech Processing)! The talk is at 9:48 AM, and the poster session is from 1:00 to 2:00!
Tweet media one
@realDanFu
Dan Fu
5 months
I'm flying out to #NeurIPS2023 @NeurIPSConf ! I'll be presenting an oral on Monarch Mixer tomorrow at 3:40 in the Oral 2A session, and I'll be presenting FlashFFTConv Saturday at the ENLSP workshop! Monarch Mixer: FlashFFTConv:
Tweet media one
1
1
23
2
13
82
@realDanFu
Dan Fu
5 months
Super excited for this model to see the light of day! 7B model, hybrid gated conv/SSM + attention architecture, trained for long context and running FlashFFTConv everywhere. You can chat with it now on the Together API!
@togethercompute
Together AI
5 months
Announcing StripedHyena 7B — an open source model using an architecture that goes beyond Transformers achieving faster performance and longer context. It builds on the lessons learned in past year designing efficient sequence modeling architectures.
Tweet media one
31
268
1K
4
9
70
@realDanFu
Dan Fu
1 year
After a short hiatus, the Stanford MLSys Seminar is coming back this quarter with a special series of episodes on foundation models! Our first talk (ep 67!!) will be @tri_dao , who'll be talking about FlashAttention. Catch us *TOMORROW* at 3:30 PT:
1
20
61
@realDanFu
Dan Fu
3 months
ChatGPT's 1700-token system prompt got you down? Led by @jordanjuravsky , @brad19brown , introducing Hydragen, a simple technique for Transformer LLM inference with shared prefixes! Up to 30x improvement in throughput with no custom CUDA! A few things I love in this project: 1/
Tweet media one
@jordanjuravsky
Jordan Juravsky
3 months
Excited to share my first PhD project! TLDR: Hydragen is an exact, simple (no custom CUDA) implementation of attention for large batches with shared prefixes. We can improve LLM throughput by over 30x for CodeLlama-13b. Also, adding lots more shared context becomes cheap:
Tweet media one
Tweet media two
Tweet media three
10
55
300
1
8
58
@realDanFu
Dan Fu
5 months
Thrilled to win the Best Poster award at the ENLSP workshop!
Tweet media one
@realDanFu
Dan Fu
5 months
Today I'm talking about FlashFFTConv at the ENLSP workshop (Efficient Natural Language and Speech Processing)! The talk is at 9:48 AM, and the poster session is from 1:00 to 2:00!
Tweet media one
2
13
82
1
2
55
@realDanFu
Dan Fu
1 year
I'll be at #NeurIPS2022 this week! @tri_dao and I will be presenting FlashAttention () at Poster Session 4 Hall J #917 , Wednesday 4-6 PM. Super excited to talk all things performance, ML+systems, and breaking down scaling bottlenecks!
2
6
52
@realDanFu
Dan Fu
4 years
Super excited to share some thoughts with @laurel_orr1 on lessons learned from the past four years with @HazyResearch and @SnorkelML , and what's next for the ways that machine learning is changing how we build software:
1
24
48
@realDanFu
Dan Fu
2 years
Our paper got accepted to #ICML2022 - excited to talk about this work in Baltimore!
@MayeeChen
Mayee Chen
2 years
New preprint alert! 📣 How do we produce transferable and robust representations with supervised contrastive learning? We need *geometric spread* and an inductive bias towards *latent subclass clustering* in representation space. 📜 👇 (1/n)
Tweet media one
2
55
253
0
3
43
@realDanFu
Dan Fu
1 year
In H3, we replace attention with a new layer based on state space models (SSMs) - with the right modifications, we find that it can outperform Transformers. Two key ideas: * Adapting SSMs to be able to do *comparison* * Making SSMs as hardware-efficient as attention 3/n
Tweet media one
1
2
42
@realDanFu
Dan Fu
2 years
We built off the super-optimized version of Diffusers that And 33% faster than the super optimized version that @Nouamanetazi / @huggingface released last week - the diff is pretty small, 68 LOC:
1
5
41
@realDanFu
Dan Fu
1 year
We’ve been hard at work training RedPajama 7B! GPUs go brrr :)
@togethercompute
Together AI
1 year
Training our first RedPajama 7B model is going well! Less than half way through training (after 440 billion tokens) the model achieves better results on HELM benchmarks than the well-regarded Pythia-7B trained on the Pile. Details at
Tweet media one
17
91
500
2
2
39
@realDanFu
Dan Fu
2 years
A bit late, but super honored to receive the best student paper runner up at @UncertaintyInAI #UAI2022 ! This project has been 2+ years in the making (we started *before COVID*), so super grateful to see it recognized! w @MayeeChen , @dyhadila , @fredsala , @kayvonf , @HazyResearch !
Tweet media one
@realDanFu
Dan Fu
2 years
New preprint alert! 📣 How do we fuse foundation models with weak supervision? Liger (🐯 +🦁) is a new weak supervision framework that fuses FMs + WS using *local smoothness* -- outperforming both FMs and WS on their own. 📜 More below 👇 (1/n)
Tweet media one
2
30
98
1
9
39
@realDanFu
Dan Fu
1 year
Overall, really excited about new models/architectures like this. What happens if we don't need attention to get the magic we've been seeing, and we can get the same quality with a linear operator? No more fixed context windows, long context for everyone! 16/n
1
1
38
@realDanFu
Dan Fu
2 years
Super excited by this work. Making attention IO-aware makes it run way faster - and enables much longer sequences, since memory footprint becomes linear in sequence length. Really excited to see how this gets used, and where it goes next - IO-aware transformers?
@tri_dao
Tri Dao
2 years
Announcing FlashAttention, a fast and memory-efficient attention algorithm with no approximation! 📣 w/ @realDanFu By reducing GPU memory reads/writes, FlashAttention runs 2-4x faster & requires 5-20x less memory than PyTorch standard attention, & scales to seq. length 64K. 1/
Tweet media one
31
358
2K
4
4
35
@realDanFu
Dan Fu
2 years
FlashAttention speeds up attention and reduces its memory footprint - without any approximation. Our key insight is that attention is bottlenecked by GPU memory *reads/writes*. FlashAttention speeds up attention by reducing the R/W. Same FLOPs, 3-4x faster!
Tweet media one
1
6
37
@realDanFu
Dan Fu
1 year
Build your own ChatGPT! Super excited by this open-source release - even more exciting that it was trained 100% carbon-negative. Happy to play a (minuscule) part in putting it together and helping serve it faster. Looking forward to seeing what folks build on top of this!
@togethercompute
Together AI
1 year
Introducing OpenChatKit. A powerful, open-source base to create chatbots for various applications. Details in 🧵
14
125
465
0
8
37
@realDanFu
Dan Fu
2 years
Friends don't let friends run XGBoost on tabular data without trying foundation models first Great work by some awesome labmates!
@Avanika15
Avanika Narayan
2 years
Can Foundation Models (FMs) clean and integrate your data? We explore the efficacy of FMs on these hard classical data tasks (1/7)
Tweet media one
9
51
182
0
14
37
@realDanFu
Dan Fu
5 years
Why Train What You Can Code? Excited to share Rekall - using programmatic composition to find new events in video! Paper on arXiv, and code available on Github! Blog:
1
20
34
@realDanFu
Dan Fu
2 years
Meta uses FlashAttention to speed up inference in AITemplate - really cool work, super excited to see folks pick it up!
@AIatMeta
AI at Meta
2 years
Get faster, more flexible inference on GPUs using our newly open-sourced AITemplate, a revolutionary new inference engine that delivers up to 12X performance improvements on NVIDIA GPUs & 4X on AMD GPUs compared to eager-mode within Pytorch. Learn more:
10
152
754
0
1
33
@realDanFu
Dan Fu
2 years
Absolutely thrilled to receive the best paper award w @MayeeChen for our work on supervised contrastive learning at the AI with Biased/Scarce Data Workshop at @RealAAAI today! Check out the paper on the workshop website: Short 🧵👇 - more soon! (1/n)
1
5
33
@realDanFu
Dan Fu
1 year
Ce Zhang ( @DS3Lab and @togethercompute ) has done some crazy stuff in distributed training. In this talk, he goes over the magic behind distribute training and inference on a GLOBAL scale over slow networks! Tune in tomorrow at 3:30 pm Pacific!
2
10
30
@realDanFu
Dan Fu
10 months
Join us today for our workshop on efficient systems for foundation models - we’ve got a killer lineup of speakers and posters!
@ESFoMo
ES-FoMo@ICML2024
10 months
Attending #ICML2023 ? Join us Saturday at our workshop on Efficient Systems for Foundation Models! 🔥 Large-Scale Distributed Training 🚀 Efficient Inference ⚙️ Deep Optimization 📈 Over 50 posters and 4 orals spanning from RL to efficient finetuning!
Tweet media one
3
20
45
2
7
31
@realDanFu
Dan Fu
2 years
Check out our fork of @huggingface Diffusers on GitHub and our blog post to try it out yourselves and read more! Code: Blog:
2
5
28
@realDanFu
Dan Fu
1 year
@typedfemale Thanks for bringing this to our attention. We've updated the blog in light of this new and important information: 🙏🙏🙏
2
3
28
@realDanFu
Dan Fu
1 year
The deadline for our #ICML2023 workshop Efficient Systems for Foundation Models is tomorrow, May 31 AOE! Submit your best papers on training, inference or anything FM systems and efficiency - then join us for a great day of speakers & panel in Hawaii!
Tweet media one
3
10
28
@realDanFu
Dan Fu
1 year
The upshot: we can scale H3 up to *2.7B* parameter models. And because of the state passing, we can run inference blazing fast -- up to *2.4x* faster than highly-optimized Transformers. Up to 1,980 tokens/second! 12/n
Tweet media one
2
1
28
@realDanFu
Dan Fu
1 year
If you're at ICLR, Catch my talk on our paper Hungry Hungry Hippos: Towards Language Modeling with State Space Models today at 10 AM in room AD12! Featuring photos of actual Rwandan hippos :) (+poster from 11:30-1:30 at board 80!)
Tweet media one
@realDanFu
Dan Fu
1 year
🛫 to Rwanda for #ICLR2023 ! I’ll be giving a talk about H3 on Wednesday, and talking about some newer work on long convs at the ME-FoMo workshop on Thursday. Please reach out if you’ll be there and want to chat! Happy to talk about Hyenas, Red Pajamas, or anything else!
3
0
22
1
3
26
@realDanFu
Dan Fu
1 year
The H3 layer closes the gap on our synthetics, and the gains translate to strong downstream performance on language modeling. We replaced almost all the attention blocks in a Transformer with H3 layers, and trained on the PILE. Our model *outperforms* GPT-Neo in PPL! 7/n
Tweet media one
1
1
27
@realDanFu
Dan Fu
2 years
We were actually a bit late to the game here - when we saw a couple folks on Reddit and elsewhere who beat us to the punch, we decided to give it a try ourselves :) PhotoRoom: u/hnipun:
2
1
26
@realDanFu
Dan Fu
5 months
One final plug: Oral 2A Efficient Learning tomorrow is absolutely **packed** with great work from @Tim_Dettmers and @srush_nlp - super excited to hear what they have to say!
Tweet media one
1
5
23
@realDanFu
Dan Fu
2 years
(1/n) This week we have @fredsala on the Stanford MLSys Seminar, live on Thursday at 1:30 PM! Fred was a postdoc at @StanfordAILab , and is now a professor at @WisconsinCS and a research scientist at @SnorkelAI -- so he knows a thing or two about MLSys.
1
6
23
@realDanFu
Dan Fu
1 year
With FlashConv, we can make SSMs outperform attention for almost all sequence lengths -- up to 35x faster than FlashAttention for long sequences! 11/n
Tweet media one
1
1
23
@realDanFu
Dan Fu
5 months
I'm flying out to #NeurIPS2023 @NeurIPSConf ! I'll be presenting an oral on Monarch Mixer tomorrow at 3:40 in the Oral 2A session, and I'll be presenting FlashFFTConv Saturday at the ENLSP workshop! Monarch Mixer: FlashFFTConv:
Tweet media one
1
1
23
@realDanFu
Dan Fu
7 months
RedPajama-v2 - 30 trillion tokens, 84 CC dumps, 5 languages! Excited to see what people do with it :)
@togethercompute
Together AI
7 months
We are excited to release RedPajama-Data-v2: 30 trillion filtered & de-duplicated tokens from 84 CommonCrawl dumps, 25x larger than our first dataset. It exposes a diverse range of quality annotations so you can slice & weight the data for LLM training.
Tweet media one
20
286
1K
1
1
23
@realDanFu
Dan Fu
1 year
And we've got two blog posts up on our work -- first, read about our synthetic languages and how we developed H3: 14/n
1
2
22
@realDanFu
Dan Fu
2 years
We sped up stable diffusion by replacing the self-attention/cross-attention blocks in the unet with FlashAttention. FlashAttention doesn't do any approximation, so you get the *exact same image* at the end.
Tweet media one
1
1
22
@realDanFu
Dan Fu
1 year
🛫 to Rwanda for #ICLR2023 ! I’ll be giving a talk about H3 on Wednesday, and talking about some newer work on long convs at the ME-FoMo workshop on Thursday. Please reach out if you’ll be there and want to chat! Happy to talk about Hyenas, Red Pajamas, or anything else!
@realDanFu
Dan Fu
1 year
Attention is all you need... but how much of it do you need? Announcing H3 - a new generative language models that outperforms GPT-Neo-2.7B with only *2* attention layers! Accepted as a *spotlight* at #ICLR2023 ! 📣 w/ @tri_dao 📜 1/n
31
282
2K
3
0
22
@realDanFu
Dan Fu
2 years
I'm at #ICML2022 this week! Let's chat if you're also in person! I'm presenting two papers: - Improving Transfer, Robustness of Supervised Contrastive Learning - FlashAttention: Fast & Memory-Efficient Exact Attention ⏱below!
1
1
22
@realDanFu
Dan Fu
1 year
These synthetic languages (inspired by great work like ) test how well SSMs can do in-context learning compared to attention. We find a critical missing capability -- SSMs have trouble *comparing tokens* across the sequence. 5/n
2
1
22
@realDanFu
Dan Fu
1 year
The power of data - RedPajama-2.8B matches Pythia-7B in HELM score after being trained on 2x the tokens! Excited to see these models continue to improve as they see more tokens :)
@togethercompute
Together AI
1 year
In addition to RedPajama 7B, we’ve also been training a 2.8B model. After 600B tokens it is exciting to see the model has higher HELM scores than the excellent Pythia-2.8B & GPT-Neo 2.7B. In fact, trained with twice the tokens, RedPajama-2.8B has comparable quality to Pythia-7B!
Tweet media one
13
79
522
0
2
20
@realDanFu
Dan Fu
4 years
Super excited for our new seminar series on ML and systems -- how does ML change the modern programming stack, and what does it mean for how people will build and deploy applications in the future? Live on YouTube every Thursday, 3-4 PM PT. Check out links below for more!
@HazyResearch
hazyresearch
4 years
Announcing the new live-streamed Stanford MLSys Seminar Series, in which we will explore the frontier of machine learning and systems. Read the full announcement: Schedule: Intro video:
1
79
177
1
5
19
@realDanFu
Dan Fu
1 year
In response, we designed the H3 layer (Hungry Hungry Hippos) to plug this gap. The H3 layer stacks two SSMs, and uses some simple multiplicative interactions between them (gating) to do comparisons. 6/n
Tweet media one
1
1
20
@realDanFu
Dan Fu
2 years
Wow, excited to see FlashAttention seeing adoption by folks in industry - excited to see where else it can make training faster!
@DbrxMosaicAI
Databricks Mosaic Research
2 years
We have exciting news! In our latest and greatest LLM blog, we show how MosaicML Cloud can help you train LLMs from 1B - 70B parameters, and for the first time, publish transparent times + costs for doing so. It's a lot cheaper than you think! (1/9)
7
48
342
0
2
20
@realDanFu
Dan Fu
1 year
Part 1: the quality gap SSM's have achieved impressive results on sequence modeling (30+ points over Transformers on Long Range Arena), but have underperformed attention in language modeling. In our paper, we use *synthetic languages* to probe this gap 4/n
Tweet media one
1
1
19
@realDanFu
Dan Fu
1 year
What's the problem? Long convolutions require multiple FFT calls, which introduce expensive GPU memory reads/writes. We develop FlashConv to address this problem. FlashConv uses a block FFT algorithm to increase FLOP util, and uses state passing to scale to long sequences. 10/n
Tweet media one
2
1
19
@realDanFu
Dan Fu
4 months
Model weights available on HuggingFace, and AutoModel compatible. Download them with just two lines of code! 32k model: 8k model: 2k model: 4/
Tweet media one
1
2
19
@realDanFu
Dan Fu
2 years
This week we're excited to have @kexinrong ( @Stanford , @VMware , and @gtcomputing ) on the MLSys Seminar. Kexin will talk about improving query performance on big-data analytics. Be there or be square! Watch us live on YouTube this Thursday at 1:30 PT:
1
3
19
@realDanFu
Dan Fu
6 months
@arankomatsuzaki @_akhaliq A few fun bits I couldn't fit into the original tweet: 1. We also have the fastest implementation of a short depthwise 1D convolution, which doesn't use the FFT but is up to 7x faster than PyTorch Conv1D, check out our repo to try it out: 2. During
Tweet media one
Tweet media two
0
4
19
@realDanFu
Dan Fu
7 months
Thanks Tri! And yes, I'm on the academic job market this year :)
@tri_dao
Tri Dao
7 months
As much as I like attention, I'm also fond of attention-free architectures for long context. @realDanFu and others have been pushing in this direction, with deep theory and compelling empirical results! And @realDanFu is on the academic job market this year!
1
11
84
0
0
18
@realDanFu
Dan Fu
4 years
Announcing FlyingSquid - fast weak supervision with triplet methods. We speed up weak supervision by orders of magnitude, allowing weakly-supervised video analysis and online learning! Blog: w/ @MayeeChen , @fredsala , Sarah Hooper, @kayvonf , @HazyResearch
1
12
19
@realDanFu
Dan Fu
1 year
With @tri_dao (co-first), @KhaledSaab11 , @ai_with_brains , Atri Rudra, and @HazyResearch ! Thanks to @StanfordAILab , @StanfordHAI , @StanfordCRFM , and @togethercompute for helping provide us with the compute necessary to train these models! 17/17
1
0
18
@realDanFu
Dan Fu
4 months
On LoCo, M2-BERT-32k outperforms the state-of-the-art embedding models! Even outperforms Mistral-7B, even though M2-BERT models only have 80M parameters (85x more parameter efficient)! 3/
Tweet media one
Tweet media two
1
2
16
@realDanFu
Dan Fu
4 months
As part of this release, we're also releasing version 0 of a new long-context benchmark called LoCo. Most academic retrieval benchmarks only have short-context documents, so we put together this benchmark of longer-context tasks. 2/
Tweet media one
2
2
18
@realDanFu
Dan Fu
4 months
Check out the blog for more details on the technical bits, and check out our GitHub for instructions on how to play with the model! Blog: Github: 7/
1
1
17
@realDanFu
Dan Fu
2 years
(1/n) This week we're delighted to have @faitpoms ( @Stanford , @SnorkelAI ) on the MLSys Seminar Series! Fait will be talking about a vision for interactive model development, so you won't want to miss it. Catch us live on YouTube Thursday at 1:30 PM! 🧵👇
1
6
16
@realDanFu
Dan Fu
2 years
And excited to announce that this paper was accepted to #UAI2022 @UncertaintyInAI as an oral! Excited to talk about it in Eindhoven!
@realDanFu
Dan Fu
2 years
New preprint alert! 📣 How do we fuse foundation models with weak supervision? Liger (🐯 +🦁) is a new weak supervision framework that fuses FMs + WS using *local smoothness* -- outperforming both FMs and WS on their own. 📜 More below 👇 (1/n)
Tweet media one
2
30
98
1
4
17
@realDanFu
Dan Fu
1 year
Then check out our blog up on @togethercompute about how FlashConv speeds up SSMs: 15/n
1
1
17
@realDanFu
Dan Fu
4 months
This project has been a great collaboration with @togethercompute . Thanks to them, these models are already integrated into @MongoDB Atlas, @LangChainAI , and @llama_index . Check out their tweet thread for more details! 9/
@togethercompute
Together AI
4 months
We are thrilled to announce the Together Embeddings endpoint! 🚀 Higher quality than OpenAI or Cohere in the MTEB benchmark. ✅ State of the art M2-Retrieval models with up to 32k context length. ✅ Up to 4x lower price. ✅ Details👇
Tweet media one
23
60
348
1
3
17
@realDanFu
Dan Fu
1 year
The context lengths of foundation models have grown exponentially recently - exciting developments! We've been happy to play a small role with FlashAttention, and we're very excited about the possibilities: multiple media sources, complex demonstrations, and more! 2/n
Tweet media one
2
1
15
@realDanFu
Dan Fu
1 year
FlashAttention gets even Flashier! You should pay attention to @tri_dao , he's on the market this year! (... ok I'll stop now)
@tri_dao
Tri Dao
1 year
I’ve been working with @AdeptAILabs and we’ve made FlashAttention even faster for long sequences! For seqlen 8K, FlashAttention is now up to 2.7x faster than a standard PyTorch implementation even at small batch, making it easier to train better LMs with longer context 1/7
Tweet media one
7
87
604
0
3
16
@realDanFu
Dan Fu
1 year
These gains also translate to strong downstream zero- and few-shot performance. On SuperGLUE, our zero-shot performance outperforms Transformer models of similar sizes. 8/n
Tweet media one
2
0
16
@realDanFu
Dan Fu
2 years
We’ll be talking about this work today at #acl2022nlp in Dublin! Come check us out at 5:00 PM in poster session 3-4 (information retrieval and text mining). I’ll be hanging around the whole week, come say hi!
@m_leszczy
Megan Leszczynski
2 years
New preprint alert! 📣 How do we improve long-tailed performance of entity retrieval? We use a supervised contrastive loss to *geometrically encode entity types* in representation space w/ bi-encoders. Check out our paper on TABi! 📜 Details👇 (1/n)
Tweet media one
2
26
95
0
5
16
@realDanFu
Dan Fu
2 years
(1/n) This week on the Stanford MLSys Seminar, we're super excited to host Cody Coleman ( @codyaustun ), former Stanford PhD and founder/CEO of Coactive AI! Cody has a great talk prepared on Data Selection for Data-Centric AI! Tune in Thursday at 1:30 PT!
1
1
16
@realDanFu
Dan Fu
1 year
The first RedPajama model is out in the wild!
@jefrankle
Jonathan Frankle
1 year
72 hrs ago, @togethercompute released the RedPajama dataset. Like everyone, we at @MosaicML were very excited about the idea of a fully open-source Llama. So excited, in fact, that we've already trained a 1B model on 200B tokens! It's on HF (Apache2) here:
13
82
485
0
2
16
@realDanFu
Dan Fu
5 months
Come check out Monarch Mixer today! Talk at 3:40 in Hall C2, poster #509 at 5:15!
Tweet media one
@realDanFu
Dan Fu
5 months
I'm flying out to #NeurIPS2023 @NeurIPSConf ! I'll be presenting an oral on Monarch Mixer tomorrow at 3:40 in the Oral 2A session, and I'll be presenting FlashFFTConv Saturday at the ENLSP workshop! Monarch Mixer: FlashFFTConv:
Tweet media one
1
1
23
0
1
15
@realDanFu
Dan Fu
1 year
Part 2: the efficiency gap But that's not all! In order to scale H3 up to billion-parameter models, we had to make it as hardware-efficient as attention. The convolution is O(N log N) asymptotically, but still underperforms FlashAttention for short sequences... 9/n
Tweet media one
1
1
14
@realDanFu
Dan Fu
1 year
We've got a great workshop at ICML, submit your best papers!
@ESFoMo
ES-FoMo@ICML2024
1 year
Announcing our 📣 Call for Papers for the ES-FoMo workshop @ ICML 2023! ➡️ We welcome papers touching on inference and training of foundation models, spanning from systems & benchmarks to novel algorithms. 🔗 (deadline: 31st of May)
Tweet media one
1
18
27
0
0
14
@realDanFu
Dan Fu
2 years
(1/n) This week on the Stanford MLSys Seminar, we've got Ellie Pavlick - professor at @BrownCSDept and research scientist at @GoogleAI . Ellie will be talking about how to implement symbols and rules with neural networks! Tune in Thursday at 1:30 PT:
1
2
14
@realDanFu
Dan Fu
1 year
I had a great time chatting with @samcharrington on the @twimlai podcast - we had a great conversation about H3, FlashAttention, and all things language modeling. Thanks so much for having me on!
@twimlai
The TWIML AI Podcast
1 year
Today we're joined by @realDanFu , a PhD student at @Stanford , to discuss how state space models can improve language models and the limitations of attention. We discuss the H3 architecture, flash attention, and much more! #NLP #MachineLearning #ICLR23 🎧
Tweet media one
2
4
7
0
2
13
@realDanFu
Dan Fu
4 years
1/2 Excited to share Epoxy: fast model iteration with weak supervision + pre-trained embeddings. We look at how to use pre-trained embeddings without the need for fine-tuning - model iteration in <1/2 second, instead of hours or days. Paper on arXiv now:
1
4
13
@realDanFu
Dan Fu
10 months
M2 builds on a lot of great work in the space, from folks like @_albertgu , @tri_dao , @ramin_m_h , @MichaelPoli6 , @BeidiChen , @exnx , @BlinkDL_AI , @davidwromero , @MaxMa1987 , and many many more! Check out @srush_nlp 's great overview 9/
@srush_nlp
Sasha Rush
11 months
Do we need Attention? Linear RNNs for NLP (). Received a couple requests for a video version.
1
22
166
1
0
13
@realDanFu
Dan Fu
3 years
In this week’s MLSys Seminar, we'll be joined by @Lin_Ma_ from @CMUDB . Lin will be talking about his work on self-driving databases. As always, 30 minute talk + 30 minute podcast with live audience questions! Livestream link: #Stanford #MachineLearning
0
4
12
@realDanFu
Dan Fu
10 months
And now it’s time for our panel on LLM tooling across industry and academia!
Tweet media one
@realDanFu
Dan Fu
10 months
Join us today for our workshop on efficient systems for foundation models - we’ve got a killer lineup of speakers and posters!
2
7
31
0
1
13
@realDanFu
Dan Fu
4 months
This is still early work, so we would love to hear from you if you are interested in long-context embeddings. If you have long-context tasks, we would love to hear how M2-BERT performs on them! If you have suggestions about tasks to add to LoCo, please let us know! 8/
2
0
11
@realDanFu
Dan Fu
4 months
These models are also available on @togethercompute 's new embeddings API service. Play with them now with a simple HTTP request! 5/
Tweet media one
1
1
10
@realDanFu
Dan Fu
4 years
Excited to welcome Virginia Smith for her talk "On Heterogeneity in Federated Settings" today at 3pm PT for episode 3 of the MLSys seminar! Tune in today at 3:
@w4nderlus7
Piero Molino
4 years
Tune in to the Stanford MLSys Seminar Series this Thursday 3-4pm PST, Virginia Smith (CMU) will talk about incorporating real-world constraints in federated learning. Livestream: Website:
0
1
19
0
2
11
@realDanFu
Dan Fu
2 years
The upshot? Throughput >1 image/s for 50 denoising steps on A100, 3-4x faster than unoptimized versions. And 33% faster than the super optimized version of Diffusers!
Tweet media one
Tweet media two
1
1
12
@realDanFu
Dan Fu
3 years
In this week’s MLSys Seminar, Gideon Mendels, CEO of @Cometml , will tell us all about MLOps System Design. Tune in Thursday, 1:30 PM on YouTube for a 30 minute talk + 30 minute podcast with live audience Q&A! Livestream link: #Stanford #MachineLearning
0
3
12