I did this. Fuck what anyone else says, just put the pedal to the metal and BUILD.
Push spaghetti code.
Nobody cares about OOPs.
Doesn’t matter what anyone thinks.
Just keep on doing.
Document in public.
Don’t listen to the haters.
Release more than you refactor.
Just keep…
We made Whisper even faster. ~40% faster!! 🔥
Whisper solidifies its lead by an even larger margin!
With the latest changes in transformers - large v3 is the best* and the fastest among the top 5 on the Open ASR Leaderboard.
Below is the reduction in Real-time factor (RTF):…
Insanely fast whisper now with Speaker Diarisation! 🔥
100% local and works on your Mac or on Nvidia GPUs.
All thanks to
@hbredin
's Pyannote library, you can now get blazingly fast transcriptions and speaker segmentations! ⚡️
Here's how you can use it too:
pipx install…
Whisper powered by Apple Neural Engine! 🔥
The lads at
@argmaxinc
optimised Whisper to work at blazingly fast speeds on iOS and Mac!
> All code is MIT-licensed.
> Upto 3x faster than the competition.
> Neural Engine as well as Metal runners.
> Open source CoreML models.
> 2…
Mixtral 8x7B Instruct with AWQ & Flash Attention 2 🔥
All in ~24GB GPU VRAM!
With the latest release of AutoAWQ - you can now run Mixtral 8x7B MoE with Flash Attention 2 for blazingly fast inference.
All in < 10 lines of code.
The only real change except loading AWQ weights…
Insanely fast whisper now with Whisper Large V3 🔥
Transcribe 150 minutes of audio in less than 98 seconds (powered by Transformers &
@tri_dao
Flash Attention 2).
Don't believe it? look at the benchmarks below ;)
All of this with the familiar Transformers API and optionally…
Introducing Command R Plus ⚡
> Beats claude-3, mistral-large, gpt-4 turbo.
> 104 Billion parameters.
> Built with multi-step tool use and RAG.
> Supports 10 languages.
> Context length of 128K.
> Trained with grounded generation capabilities - citations and responses based on…
Alrighty! W2V-BERT 2.0: Speech encoder for low-resource languages! 🔥
With < 15 hours of audio, you can beat Whisper and get your own SoTA ASR model!
> Pre-trained on 4.5M hours of data.
> 600M parameters.
> 143+ languages.
> 10-30x faster than Whisper.
> Best part: MIT license…
After 70x faster Whisper, we present to you - 5x faster Whisper fine-tuning! ⚡️
Powered by LoRA and 🤗 PEFT - Squeeze in 5x larger batch sizes, fit Whisper-large checkpoint < 8GB VRAM! 🔥
Best part? With almost no degradation in WER! 🤯
Check it out:
4-bit quantised Mistral 7B instruct v0.2! - fasttt! 🏎️
On Mac (M2). Powered by MLX. Fully local.
Requires < 10GB RAM for 4-bits. (GPU poors, rise up)
Have to say, MLX is a solid alternative to llama.cpp
Welcome OpenVoice! 🎙️
A versatile voice cloning approach that requires only a short audio clip from the reference speaker to replicate their voice and generate speech in multiple languages.
Open access weights 🔥
It enables granular control over voice styles, including…
Current best local model:
1. LLM - Mistral Instruct v0.2 7B/ Command R (4bit)
2. TTS - Parler-TTS/ Style-TTS 2
3. ASR - distil-whisper/ faster-whisper
4. VLM - Idefics 2/ CogVLM
Best stack:
1. Use llama.cpp to run LLM/ VLM via the server
2. Transformers to run Parler TTS/…
LETS GOO! Parler TTS 🔥
A fully open-source, Apache 2.0 licensed Text-to-speech model focused on providing maximum controllability.
Through voice prompts, you can control the pitch, speed, gender, noise levels, emotion characteristics and more!
> Trained on 10K hours of…
Let's go, 200% faster Whisper w/ speculative decoding! 🔥
Whisper (baseline) - 73 seconds
Whisper w/ Speculative Decoding - 33 seconds
All with zero drop in performance! ⚡
Pseudocode:
1. Initialise a Teacher model ex: openai/whisper-large-v2.
2. Load an assistant model ex:…
Oof! Whisper on
@Apple
's MLX backend is quite stonkingly fast! 🏃
Not only that, it optimises GPU + CPU usage quite well!
What is MLX?
MLX is a framework released by Apple for ML researchers to train and infer ML models efficiently. MLX has a Python API that closely follows…
Run Mixtral 8x7B w/ ~13 GB VRAM 🤯
*On a free colab too, powered by Transformers & AQLM!
AQLM is a new SOTA method for low-bitwidth LLM quantization, targeted to the “extreme” 2-3bit / parameter range.
In less than 5 lines of code, you can try it out too! ⚡
Make sure to…
Insanely fast whisper - now with a CLI⚡️
You can now translate/ transcribe 100s of hours of data across 99 languages! - all from your terminal.
Here's how you can use it:
1. Install requirements
pip install transformers, accelerate, optimum
2. Grab the transcribe py file and…
Whisper running on WatchOS! 🔥
> Powered by WhisperKit by
@argmaxinc
> Supports up to Whisper base
> Leverages Neural Engine ⚡
> Three lines of code ;)
> Works real-time!
> MIT license
Quite amazed by the speed with which Argmax is shipping.
Possibly the fastest & reliable…
Whisper Large V3 has landed in Transformers! 🎉
The large-v3 checkpoint open-sourced by Open AI yesterday is now fully compatible with Transformers!
Best part: It is fully compatible with the ASR pipeline! Here's how you can use it:
import torch
from transformers import…
Introducing Distil-Whisper v3 ⚡
> ~50% less parameters and 6x faster than Large-v3.
> More accurate than large-v3 on long-form synthesis.
Available with 🦀 WebGPU, Whisper.cpp, Transformers, Faster-Whisper and Transformers.js support!
Drop in; no changes are required! 🔥
Introducing Open TTS Tracker! 🗣️
*sound on*
A one-stop shop to track all open access/ source TTS models!
Ranging from XTTS to Pheme, OpenVoice to VITS, and more... ⚡
For each model, we compile:
1. Souce-code
2. Checkpoints
3. License
4. Fine-tuning code
5. Languages…
Mistral QLoRA w/ MLX on your Mac ⚡
Utilising 100% GPU, fully offline.
You can now convert any Hugging Face model to a Quantised format and use it to fine-tune on-device!
python convert. py --hf-path mistralai/Mistral-7B-v0.1 -q
Then, to fine-tune the run:
python lora. py…
Introducing MLX-LM! ⚡ *sound on*
Run LLMs on-device directly on your Mac with 3 lines of code! ;)
100% local and quite spiffy (even faster with 4-bit)!
I made a quick video covering the package, its capabilities and a bit of quantisation.
The video goes through what MLX is,…
Insanely fast whisper now on Mac 🚀
You can now get the same experience of whisper in the comfort of your Mac, too! This is made possible by torch.mps backend.
It isn't as fast as CUDA; however, it works pretty fast and can utilise the GPU well!
All you need to do is this:…
MusicLang 🎶 - Llama 2 based Music generation model!
> Llama2 based, trained from scratch.
> Permissively licensed - open source.
> Optimised to run on CPU. 🔥
> Highly controllable, chose tempo, chord progression, bar range and more! ;)
Absolutely love playing with the demo,…
MIT licensed Phi running on Mac powered by Rust! 🦀
Spiffy and fast, powered by Candle! ⚡
As simple as running:
cargo run --example phi --release --features metal -- --model 2 --prompt "A skier slides down a frictionless slope of height 40m and length 80m. What's the skiers…
Introducing Qwen 1.5! 🔥
> 6 model sizes, including 0.5B, 1.8B, 4B, 7B, 14B, and 72B.
> Beats GPT 3.5, Mistral-Medium.
> Multilingual support of both base and chat models.
> Support 32K context length.
> Base + chat model checkpoints released.
> Runs natively with…
Introducing fast-llm rs! 🦀
Infer LLMs like Mistral, LLama, Mixtral, on your Mac at the touch of your CLI!
Powered by Candle and Rust! ⚡
Works on Metal and CPU - Infer your GGUF checkpoints in pure Rust! ;)
All you gotta do is:
Step 1: git clone
https://github.…
High-quality speech/ text translations with SeamlessM4T v2 by
@AIatMeta
🔉
M4T == Massively Multilingual and Multimodal Machine Translation seamlessly ;)
You can now translate in 100 languages from/ to speech or text with transformers!
Here's how you can do it, too! 👇
1.…
Nous Hermes Yi 34B beats Mixtral 8X7B 🔥
With AWQ, you only need ~20GB VRAM to run this beast, 100% local and offline!
Trained on 1M+ GPT4 generated data points! (synthetic data ftw!)
Here's how you can run it, too (w/ transformers and AutoAWQ):
from transformers import…
Announcing TTS Arena! 🗣️
*sound on*
One place to test, rate and find the champion of current open models.
A continually updated space with the greatest and the best of the current TTS landscape! ⚡
Rate once, rate twice - help us find the best out there.
Starting with five…
NeMo Canary 1B by
@NVIDIAAI
🔥
*Sound on 🔊*
> Tops the Open ASR Leaderboard.
> Beats Whisper to punch for ASR.
> Beats Seamless M4Tv2 for Speech Translation.
> Supports 4 languages - English, Spanish, French & German.
> Trained on 85,000 hours of annotated audio.
>…
Let's goo! StyleTTS 2 - New king of the Text to Speech Arena! 👑
StyleTTS 2 is fully open source, and the authors are training better and larger checkpoints. 🔥
Stay tuned for some exciting updates re: StyleTTS v2 - things will get excitinggg!
Side note: 200 stars on the TTS…
OpenMath Instruct-1 by
@NVIDIAAI
🧮
> 1.8 Million Problem-Solution (synthetic) pairs.
> Uses GSM8K & MATH training subsets.
> Uses Mixtral 8x7B to produce the pairs.
> Leverages both text reasoning + code interpreter during generation.
> Released LLama, CodeLlama, Mistral,…
Whisper Large-v3: New champion for the Open ASR leaderboard! 👑
We evaluated the latest Whisper checkpoint on a series of datasets and found it the most performant!
Here are a couple of quick takeaways from running these evaluations:
1. The best performance for Whisper…
Want to train your own Bark/MusicGen-like TTS/TTA models? 👀
The SoTA Encodec model by
@MetaAI
has now landed in 🤗Transformers!
It supports compression up to 1.5KHz and produces discrete audio representations. ⚡️
Model:
Colab:
BOOM! Whisper + Speaker Diarisation! 🔥
Blazingly fast meeting transcription all with a simple call to an API - powered by Inference Endpoints ⚡
- Whisper to transcribe speech to text (w/ Flash Attention)
- Diarization to break down the transcription by speakers (w/ Pyannote)…
Parakeet RNNT & CTC models top the Open ASR Leaderboard! 👑
Brought to you by
@NVIDIAAI
and
@suno_ai_
, parakeet beats Whisper and regains its first place.
The models are released under a commercially permissive license! 🥳
The models inherit the same FastConformer…
Insanely fast whisper now with Flash Attention 2 🔥
With the latest release of Transformers (4.35), you can run Whisper & Distil-Whisper even faster with Flash Attention 2.
To benefit from it, make sure to upgrade your transformers & flash-attn version:
pip install --upgrade…
Distil-whisper now with apple neural engine support via WhisperKit! 🔥
You can now:
brew install whisperkit-cli
Followed by:
whisperkit-cli transcribe --model-prefix "distil" --model "large-v3" --verbose --audio-path ~/Downloads/jfk.wav
Bonus: If you have an M2 or higher…
UPDATE: New benchmark for insanely fast whisper! 🤗
You can transcribe 3000 hours of audio in less than 2 hours!
Batching + BetterTransformer is still the fastest way to transcribe audio insanely fast!
PSA 📣: MLX can now pull Mistral/ Llama/ TinyLlama safetensors directly from the Hub! 🔥
pip install -U mlx is all you need!
All mistral/ llama fine-tunes supported too! 20,000+ checkpoints overall!
P.S. We also provide a script to convert and quantise checkpoints and…
mixtral 8x22B - things we know so far 🫡
> 176B parameters
> performance in between gpt4 and claude sonnet (according to their discord)
> same/ similar tokeniser used as mistral 7b
> 65536 sequence length
> 8 experts, 2 experts per token: More
> would require ~260GB VRAM in…
mixtral 8x22B - things we know so far 🫡
> 176B parameters
> performance in between gpt4 and claude sonnet (according to their discord)
> same/ similar tokeniser used as mistral 7b
> 65536 sequence length
> 8 experts, 2 experts per token: More
> would require ~260GB VRAM in…
Welcome distil-whisper 🔥
49% smaller, 6x faster, and within the 1% performance range of Whisper-large-v2!
All in the good ol' Transformers API.
1. Make sure to upgrade transformers to the latest release.
pip install --upgrade transformers
2. Import torch & transformers…
Transcribe 150 minutes of Audio in less than 5 minutes with Whisper large! 🏎️
Powered by Transformers and Optimum, you get blazingly fast transcriptions in a few lines of code!
pipe = pipeline("automatic-speech-recognition",
"openai/whisper-large-v2",
torch_dtype=torch.float16,…
恭喜发财 Gong xi fa cai 🧧
The impact of China on the current AI/ ML landscape has been ginormous. From LLMs to TTS to ASR, we've gotten SoTA models weekly from China-based labs!
Some highlights for me:
LLM/ VLMs
1. Qwen 1.5 & Qwen VL -
2. OpenBMB…
SURPRISE: Google just dropped CodeGemma 1.1 7B IT 🔥
The models get incrementally better at Single and Multi-generations.
Major boost in in C#, Go, Python 🐍
Along with the 7B IT they release an updated 2B base model too.
Enjoy!
Introducing
@NVIDIAAI
&
@suno_ai_
's Parakeet-TDT! ✨
The latest in the Parakeet series, Nvidia & Suno beat Whisper again and won the Open ASR Leaderboard - this time by ~1 WER.
All of this by making the model ~175% faster than the last generation of the models. ⚡
Bonus:…
Making audio a first-class citizen in LLMs: Qwen Audio 🔉
Using a Multi-Task Training Framework, Qwen Audio - Combines OpenAI's Whisper large v2 (Audio encoder) with Qwen 7B LM to train on over 30 audio tasks jointly.
Tasks ranging from Speech Recognition to Music Captioning…
Whisper on MLX just got better! 🔥
Word-level timestamps + confidence scores and models on the 🤗Hub ;)
Don't forget to `git pull` before you get whisper-ing.
Kudos to
@awnihannun
& bofenghuang!
P.S. It now also supports Large-v3 \o/
llms this, llms that!
why aren't people releasing more audio stuff 😭
i want tts, asr, speech translation, voice cloning, text to audio, text to music, anything..
VITS is probably the most underrated TTS model out there!
At just 150M params, it works on-CPU runtime 🤯
Sure, it isn't the most realistic, but it does its job for most on-device use cases like reading an article, practising a language, etc.!!
Here's how you can use it with…
Open Whisper-style Speech Model (OWSM) 🔉
OWSM reproduces Whisper training using an open-source toolkit (ESPNet) and publicly available datasets. OWSM is much more efficient in training and is robust at multi-directional translations.
Open source training, inference scripts and…
Want high-quality Audio embeddings? CLAP! 👏
We support the latest general, music and speech CLAP models in Transformers! Use it for Text-to-Speech/ Text-to-Music training and more.
What is CLAP?
CLAP (Contrastive Language-Audio Pretraining) is a neural network trained on…
VILA by
@NVIDIAAI
&
@MIT
🔥
> 13B, 7B and 2.7B model checkpoints.
> Beats the current SoTA models like QwenVL.
> Interleaved Vision + Text pre-training.
> Followed by joint SFT.
> Works with AWQ for 4-bit inference.
Models on the Hugging Face Hub:
Common Voice 16 by
@mozilla
is out on the Hub! 🔥
This brings a total 30,328 hours of audio spread across 120 languages!
Out of the total 30K hours of audio 19.5K is validated! ✨
You can access it all in less than 2 lines of code with the datasets library:
from datasets…
MASSIVE UPDATE: Text to Speech arena adds OpenVoice v2, PlayHT 2.0 & Voicecraft 2.0 🔥
*sound on 🔔*
Why them?
OpenVoice v2 is the latest release from myshell ai, trained with more data, better training strategy and more importantly released under MIT license
Voicecraft 2.0…
TIL: You can drop in GPTQ weights directly in the Transformers API 🤯
Load a Zephyr 7B in less than 5 GB GPU VRAM!
GPTQ (Post Training Quantisation) makes LLMs much smaller using a calibration dataset.
Thanks to Optimum and AutoGPTQ - Transformers now supports GPTQ weights…
Let's go!! Common Voice 17 - now on the Hub! 🔥
With 31,000 hours of audio (& transcriptions) across 124 languages.
*sound on 🎶*
847 hours of data were added in CV 17, along with 493 hours of validated data.
Four new languages have been added to this edition: Haitian…
UPDATE: Four new open models on the Text to Speech Arena! 🔥
*sound on🔉*
As the Text-to-Speech ecosystem is heating up, we decided to add more competition.
> Parler TTS
> VoiceCraft
> Vokan
> GPT-SOVITS
Why is this important?
The TTS ecosystem is riddled with opaque metrics…
Ratchet: A web-first, cross-platform ML developer toolkit! ⚡
*written in Rust 🦀
> Inference only.
> WebGPU/CPU only. 🔥
> First class quantisation support.
> Lazy computation.
> Inplace by default.
Supports Whisper out of the box! More models - LLMs, GGUF, etc are coming…
Want to train your own MusicLM? 🎶
The MusicCaps dataset is now on the 🤗Hub:
The MusicCaps dataset contains 5,521 music examples, each of which is labeled with an English aspect list and a free text caption written by musicians. 🎸
Introducing Idefics 2 🤯
An 8B Vision-Language Model - literally punching above its weight.
> Apache 2.0 licensed! 🔥
> Competitive with 30B models like MM1-Chat
> 12 point increase in VQAv2, 30 point increase in TextVQA (compared to Idefics 1)
> 10x fewer parameters than…
MetaVoice-1B on Metal powered by Candle! 🦀
Apache 2.0 licensed TTS with Voice Cloning. Thanks to
@lmazare
, you can now use MetaVoice in Rust. ⚡
Try it out via candle-examples:
cargo run --example metavoice --features metal --release -- --prompt "Hey hey my name is VB."
llama.cpp with OpenAI chat completions API! 🦙
100% local. Powered by Metal!
*sound on*
In 2 steps:
1. brew install ggerganov/ggerganov/llama.cpp
2. llama-server --model <path to model> -c 2048
P.S. All of this with a binary size of less than 5MB ;)
That's it! 🤗
🚨
@huggingface
is releasing its Audio Course this Wednesday (Jun 14th)!
Fully open source and 100% free.
6 weeks self-paced course to level up your Machine Learning game with Audio ⚡️
Sign up and don't forget to tune in for our launch event:
Welcome MAGNeT by
@AIatMeta
🎶
Open access weights, training and inference codebase! \o/
2 variants (10 sec, 30 sec), 2 sizes (small - 300M, medium - 1.5B)
> MAGNeT is a text-to-music and text-to-sound model capable of generating high-quality audio samples.
> Masked…
Introducing the Text-to-Speech/ Audio pipeline! ⚡️
@suno_ai_
's Bark,
@AIatMeta
's MMS-TTS,
@MSFTResearch
's SpeechT5, Kakao Research's VITS & MusicGen!
1000+ languages, open-access models. All of these are accessible in just a few lines of code! 🤯
4x faster Llama inference! 🔥
> leverages static cache.
> uses torch compile for decoder models.
> very minimum code changes required.
> coming to mistral and other models soon.
> opens possibility to unlock even more speed-ups.
massive kudos to
@art_zucker
for working on this…
🧘♀️Meditate with an AI-generated melody ☮️
Brought to you by, MusicGen - A simple and controllable music generation model by
@MetaAI
🎶
Models on the🤗Hub:
Check it out here 👉
Welcome MusicGen Stereo! 🎶
You can now generate high-quality stereo sounds at the speed of thought!
Powered by Audiocraft from
@honualx
and
@AIatMeta
🤗
Oh, and you can use it with Transformers with just 3 lines of code!
import torch
from transformers import pipeline…
CodeQwen1 1.5 7B - GPU poor ftw! 🔥
> pre-trained on 3 trillion tokens.
> 64K context.
> supports tasks like code generation, code editing, sql, chat and more.
> performs better than deepseek coder and chat gpt 3.5 on SWE bench.
> open access model, weights on the Hub.
Introducing StarCoder2 15B 🌟
> Beats CodeLlama 34B.
> 16,384 context window.
> Trained in 600+ programming languages from The Stack v2.
> Trained on Fill-in-the-middle objective on 4 trillion + tokens.
Along with that, we release smol-StarCoder2 3B & 7B ⭐
> 16K context…
MusicGen + LLM = High-quality tunes 🌟
Creating tunes by mere text prompts is no easy feat; there have been multiple attempts, but anecdotally, I have yet to find any that beats MusicGen by
@AIatMeta
!
All you need is about 5GB of GPU VRAM (or a Google Colab) ;)
Here's how you…
Distil-whisper now with apple neural engine support via WhisperKit! 🔥
You can now:
brew install whisperkit-cli
Followed by:
whisperkit-cli transcribe --model-prefix "distil" --model "large-v3" --verbose --audio-path ~/Downloads/jfk.wav
Bonus: If you have an M2 or higher…
Faster Mistral 8x7B with fused modules & AWQ 🔥
Powered by AutoAWQ & Transformers.
Fused modules offer improved accuracy and performance by replacing the Attention, MLP, and Layernorm layers with their corresponding fused version.
Fused modules can use faster kernels and…
🗣️ A new speech community event is incoming!!
📆 The Whisper fine-tuning sprints will be held from the 5th to the 19th of December.
🌍 Come join us to build better and faster speech recognition systems in 70+ languages.
🔥 Claim SoTA in a language of your choice!