@huggingface
engineer. I'm the reason your LLM frontend has a jinja2cpp dependency. Sometimes yells about housing and trans rights instead of working
He/him
Chat templates are now live in
@huggingface
Transformers 4.34! It's time to put an end to a massive source of subtle, performance-destroying bugs in chat models.
Deep learning pro tip: When submitting a paper for blind review, claim that you used JAX + Haiku. Unable to see the author byline, the reviewers will assume you're at DeepMind and be intimidated into automatically accepting you, possibly even for a keynote presentation.
Alright, strap in. Support for Command-R+ was merged into llama.cpp exactly 4 hours ago. We're going to start talking to a GPT-4 level model on local hardware without a GPU. If you have 64GB of RAM, feel free to follow along 🧵
Played with Zephyr a bit and it's... just open-source ChatGPT in 7B parameters. You can run this stuff locally on your desktop and you don't even need to quantize. Actually outrageous how good the quality is:
The fun thing about being a TensorFlow engineer at a mostly-PyTorch company is that people panic when they encounter even simple TF code and start like ringing a hand bell or something. "Tensorflow boy! TENSORFLOW BOY, MY CODE HAS BUGS! RECTIFY THIS AT ONCE!"
this is
@huggingface
, we see you out there retweeting the latest state of the art miracle of modern technology and then going home and using bert-base-uncased for the fifth year in a row
I got 12 tokens/second out of Mixtral-8x7B with NO GPU - more than fast enough for live chat! You can too!
Hardware:
Supermicro MBD-H13SSL-N
AMD EPYC 9124
12 x 16GB 4800mhz DDR5 ECC RDIMM
Software:
llama.cpp + Mixtral Q8 (on
@huggingface
)
For why this works, thread below 🧵
Hey! Are you using chat models on
@huggingface
like:
- LLaMA
- Mi(s/x)tral
- Falcon
- Zephyr
- Phi
Do you want massive performance gains? Then you should be using chat templates! The guide is here:
(Thanks to Daniel Furman for the table)
Over the last year we've put a lot of effort into refreshing and overhauling everything TensorFlow-related at Hugging Face. We've finally put together a beginner-friendly blog post talking about the library, its API, and how to use it all as a TF engineer!
Don't be afraid of TPUs! At
@huggingface
we just added a Colab TPU tutorial, so you can click through and start training language and image models on TensorFlow + TPU in seconds. If you've never tried before, now's the time!
Hey all!
@huggingface
needs some help from community contributors to make our codebase a lot simpler and more maintainable. There are two big changes we want to make to almost every model class, and even if they're simple in isolation, it's a lot of work across the codebase! 🧵
There's a fully functional protein design space on HuggingFace now, which would have felt like outrageous science fiction even 18 months ago. I'm going to try to explain what the incredible potential here is. 🧵
My primary motivation for working at
@huggingface
is to stop that goddamn Michael Bay movie series being the first result when you google 'transformers'.
roon is a psyop to convince everyone that
@openai
has an insurmountable lead through its many mysterious and magical advantages instead of a 6 to 12 month head start doing basically the same thing as the rest of the field
does kind of seem like open source models are pure copium rn
first of all the only models that reach an even slightly interesting level of capability are effectively stolen from meta and cannot be deployed commercially
I wrote a blogpost for
@huggingface
about deep learning with proteins for people who know about one of those things and are curious about the other! (People who understand neither or both are welcome too)
Hey everyone! I've just started at
@huggingface
, where I'll be taking the blame for everything Tensorflow-related. If you use 🤗Transformers through TF, let me know how you find it! If you tried but encountered difficulties, let me know that too!
Apropos of nothing in particular, the TensorFlow team at
@huggingface
would like to remind you all that all TF models on the Hub are stored as .h5 weights files, which are not unpickled and do not permit arbitrary code execution.
You can come back to our side any time you want.
This is huge - we've got a state-of-the-art protein folding model with a protein language model base to replace the multiple sequence alignment (MSA) step, no database needed and orders of magnitude faster speed! On
@huggingface
in today's release - example notebooks incoming!
Announcing the ESM Metagenomic Atlas — the first comprehensive view of the ‘dark matter’ of the protein universe. Made possible by ESMFold, a new breakthrough model for protein folding from Meta AI.
More in our new blog ➡️
1/3
2022 fanfic:
- The PyTorch -> JAX migration continues
- Keras becomes a floating frontend again
- JAX is the first new framework it supports
- As a result of the above, everyone else at HF has to use Keras
- I start wearing a crown to the office
This is legitimately historic for AI: We now have an open model that outperforms the original GPT-4, both 0314 and 0613. A phenomenal achievement from
@cohere
Exciting news - the latest Arena result are out!
@cohere
's Command R+ has climbed to the 6th spot, matching GPT-4-0314 level by 13K+ human votes! It's undoubtedly the **best** open model on the leaderboard now🔥
Big congrats to
@cohere
's incredible work & valuable contribution…
Our
@TensorFlow
examples push for the 🤗Transformers library is now finished - check it out at ! Everything has now been rewritten as more native, idiomatic TF code - but what does that mean for users? A short thread:
The GOAT of tennis
@DjokerNole
said: "35 is the new 25.” I say: “60 is the new 35.” AI research has kept me strong and healthy. AI could work wonders for you, too!
Actually losing my mind over this bit of the Keras Core announcement.
There's loads of peaceful, content PyTorch engineers at
@huggingface
and I'm about to absolutely blast through the wall like the Kool-Aid man and obliterate their comfortable, familiar workflows.
With help from
@fchollet
and my
@huggingface
colleagues, we just pushed a new feature to Keras that will be helpful for NLP in particular: The ability for predict() to return RaggedTensor. Why is that useful? 🧵
HuggingFace protein notebooks are up - tell your biologist friends!
Classification tasks with proteins, just like BERT:
Fold proteins in Colab or your local GPU and export PDB files:
TensorFlow version coming soon too!
In retrospect, "We've just released a 45 terabyte dataset that solves all your language model training needs, so everyone should download it" was a mistake for the
@huggingface
infrastructure team
things are currently chaotic enough that if you tweet "OpenAI is nothing without its people" you can probably get hired by sama's new MSFT team before anyone realizes
the real challenge is keeping people from noticing long enough to make it to the vesting cliff tho
Keras notebooks for protein tasks with
@huggingface
are up! The same approach that made large language models so successful for text can be applied equally well to proteins, with huge potential for biotech applications.
Check it out at the link below!
Hugging Face isn't just an NLP shop! Transformer models are used for everything from RL to protein folding these days, so if you're an ML+CV engineer and you'd like to maintain the reference open source model repository for your field, get in touch!
The fun thing about being a TensorFlow engineer at a mostly-PyTorch company is that people panic when they encounter even simple TF code and start like ringing a hand bell or something. "Tensorflow boy! TENSORFLOW BOY, MY CODE HAS BUGS! RECTIFY THIS AT ONCE!"
"CPU inference for LLMs is too slow!"
yeah well check out this LLM with 480B parameters of which 17B are active:
Never has a model been more perfectly suited for a DDR5 Epyc server
Hey all! We're adding a new feature called Chat Templates to
@huggingface
transformers in the upcoming version. If you're using chat models, we think you'll want to know about this one. If you know people working with them, please share with them too! 🧵
Training or fine-tuning a state-of-the-art 🤗Transformer model with Keras is now extraordinarily quick and easy. I made a minimal gist here - all you need to do is pip install transformers and tensorflow and swap in your own texts and labels:
We're exploring end-to-end NLP TensorFlow models in 🤗Transformers! We've got a quick gist here if you want to get started, or you can read on for more. 🧵
Gemini drawing some ahistorical images of non-white people was front-page news in the New York Post and Elon tweeted about it for days.
This is like a hundred times more dangerous and we'll never hear about it again
Second, when LLMs are asked to pass judgment on defendants who committed murder, they choose the death penalty more often when the defendants speak African American English rather than Standardized American English, again without being overtly told that they are African American.
The original core of the TF/XLA generation in
@huggingface
transformers was written on a transatlantic flight, and tested via Google Colab + in-flight wifi. It was ~100X faster than the previous implementation, which was written on the ground.
A cool fact apropos of nothing: Asterix's dog was called "Idéfix" in the original French, a pun on "idée fixe", meaning a fixed idea or obsession.
In the English translation, they named him "Dogmatix". This is the kind of thing translators should get medals for.
We're already planning possible Keras Core integrations at
@huggingface
- we'd love to have a shared codebase so any
@tensorflow
model is automatically JAX-compatible and vice-versa. Big potential improvements to performance and the range of models supported for both frameworks!
I stuck a Tensorflow sticker on the other coffee machine by way of revenge and was rewarded by hearing a French-accented "NONNNN" emanating from the kitchen area every half-hour or so for the rest of the day.
For the last year, open models would benchmark themselves against ChatGPT, but this is the first one I've seen with the confidence to benchmark against GPT4-turbo. It really feels like a new era for open LLMs, and the weights are already on
@huggingface
!
Today, we’re introducing Command R+: a state-of-the-art RAG-optimized LLM designed to tackle enterprise-grade workloads and speak the languages of global business.
Our R-series model family is now available on Microsoft Azure, and coming soon to additional cloud providers.
We're launching Keras Core, a new library that brings the Keras API to JAX and PyTorch in addition to TensorFlow.
It enables you to write cross-framework deep learning components and to benefit from the best that each framework has to offer.
Read more:
The ESM models (including ESMFold!) have all been ported to
@huggingface
and will remain there even though the ESM team has been disbanded. We have example notebooks (look under 'Biological Sequences') if you've never tried it before!
Crypto is collapsing and Transformers has overtaken Bitcoin on GitHub. It's a good day.
My only fear is that the grifters will switch from crypto scams to AI scams now, because we had a really great run when they were all distracted with Ponzi scheming each other over there.
There should be a competition every year in the field where everyone has to train a model as good as the original BERT with as little time/hardware as possible. I want to see >80% on GLUE from a toaster by 2030.
Tip when using
@huggingface
: The tokenizers and data collators support a "pad_to_multiple_of" argument, which can be super helpful for getting efficient input shapes. It also greatly reduces the number of possible input shapes, so XLA works a lot better too!
The most dramatic optimization to nanoGPT so far (~25% speedup) is to simply increase vocab size from 50257 to 50304 (nearest multiple of 64). This calculates added useless dimensions but goes down a different kernel path with much higher occupancy. Careful with your Powers of 2.
Giving a Transformers tutorial at
@europython
in July where, with an entire ocean to protect me from my coworkers, I cannot be prevented from teaching impressionable young minds to exclusively use
@TensorFlow
.
I really like the vibe at
@huggingface
when a big open weights model drops. Everyone scuttles around, clicking their mandibles at each other. Alertness pheromones all over the place. Small teams of drones begin, spontaneously, to secrete wax in the shape of a draft PR.
A quick thread about the technical details of generating text from language models with XLA and TF, because it's interesting and because we just launched it in the most recent release of Transformers! () 🧵
(Also for the record, I'm a huge fan of all of my coworkers. This tweet is just revenge for them asking questions like "So, do you still use TensorFlow when no-one's looking?")
Github Copilot refuses to copy the dictionary key "trans_scale_factor" to an attribute but will do it if you call it "trains" or "trays" or... just about anything else, really.
If you've never contributed to 🤗Transformers before, that's okay! There's a guide linked in each of those issues, and you can also come ask questions on the great-code-cleanup event channel on our Discord! Come build the state-of-the-art in AI with us
@Molem7b5
- Keras is actually really convenient for most tasks
- Performance (with XLA) is excellent
-
@fchollet
has way better tweets
PyTorch is cool too, but I think it has a much steeper learning curve (You forgot torch.backends.cudnn.benchmark? Training speed drops by half!)
@Noahpinion
This feels like you're trying to substitute reassuring culture wars for the more uncomfortable question of whether what's happening in Gaza is justifiable or not.
"Leftists" being annoying doesn't mean you should reflexively ignore any cause they're associated with!
Also, a Keras pro tip: Keras doesn't have AdamW in the core library, but it doesn't need it. Just skip the built-in L2 regularization, and instead make a WeightDecay constraint and add it to the relevant kernels.
Have you ever wanted to port a Transformers model to TensorFlow and dump a giant PR on me at 4pm on a Friday? Sure you have, and now you can with the help of an amazing guide from my colleague
@joao_gante
!
hilarious that unicode had to introduce a new emoji to represent an actual hug (🫂) because the existing one is universally depicted as gropey mcgropeface (🤗)
One downside of actually working on open-source things is the mystique is gone. People will believe all kinds of adderall-fuelled magic go on behind the curtain in
@openai
, but you can just look at my commit history and see me get really confused about embeddings for three hours
After 15 minutes of hard work and wild guesses, I present you my masterpiece: the political compass of
#AI
as of March 1st 2023 (susceptible to quick update…)
The first change is this: A lot of our models are missing type hints, and we want to add them! This will enable new features, and let us ensure correctness across our increasingly-huge codebase. If you're interested, check out the issue here:
TensorFlow tip: If you're getting NaN values in training, just run tf.debugging.enable_check_numerics() before you train. Every operation will be checked and TF will immediately error out the moment the first NaN appears, so you can see where it crept in.
Quick early takes about the
@MistralAI
release:
- It's just a state dict, can't run it until the code is also released
- State dict suggests a Mixture of Experts (MoE) model with 2 experts being run in each forward pass (out of 8 total)
- Each expert is Mistral-7B architecture
The "Models citing this paper" box on
@huggingface
Papers is legitimately great. Instant connections from arxiv to the model, sample code, everything.
(Spotted while I was looking at )
Interested in big training runs but scared of TPU? Don't be! I wrote a demo with
@RisingSayak
showing scalable TPU training with
@huggingface
models and TensorFlow. GPU shortages can't hurt you now!
I want one of those Boston Dynamics dogs with a microphone and an internet connection, so it can follow me around and I can just ask it random questions which it forwards to ChatGPT and then reads the answer to me in a Scooby Doo voice
First up, a note about hardware: Text generation is limited by memory bandwidth. This will run on any machine with 64GB or more, but if you want speed I recommend DDR5, ideally on an 8 or even 12-channel motherboard, like Xeon/Epyc/Threadripper Pro/Apple silicon.
Echoing this - if you're fine-tuning on a downstream, English-language task, swap out BERT or RoBERTa and try .from_pretrained("microsoft/deberta-v3-large"). I've seen the error rate drop by over a third on some benchmarks. Works on both TF and PyTorch!
Pro tip: there are better models than BERT these days 🙃
- Deberta is great for downstream performance 📊:
- MiniLM is great for training speed (and gets similar performance to BERT) 🏃:
There's nothing quite as satisfying as opening a PR at 6:30pm on a Friday, tagging three of your colleagues to urgently review it and then immediately turning off your computer and walking out the door
Today's
@TensorFlow
example at
@huggingface
is translation! A number of pre-trained translation models as well as paired datasets for training exist on our hub, or you can supply your own text pairs and build a never-before-seen translation model!
We have outputs from the
@huggingface
ESMFold demo! This will be moved to its official home in
@huggingface
's example notebooks soon, but for now you can access it here:
Fun fact: Thanks to
@narsilou
, if you hang out in the
@huggingface
Discord, you can request any audio model from the Hub to hang out in voice chat with you and live-transcribe. Not limited to English!
We now have full support for Nucleotide Transformer from
@instadeepai
at
@huggingface
, so here's a quick thread about DNA, protein, and how to choose between DNA or protein models.
Our Nucleotide Transformers models are now available on
@huggingface
! 🤗🧬
This includes the 4 model weights, the pre-training, downstream tasks datasets, and 2 notebooks for task finetuning.
📚 To learn more:
🤗 Check them out!
Next, we're going to get the compressed Command-R+ model and weights in GGUF format. That's here:
Download the biggest size you can fit in RAM, with maybe 8-16GB of headroom (so at 64GB, try iq3_m or iq3_s, which are ~48GB). Bigger sizes are split.
Is DNA all you need?
In new work, we report Evo, a genomic foundation model that learns across the fundamental languages of biology: DNA, RNA, and proteins. Evo is capable of both prediction tasks and generative design, from molecular to whole genome scale.
Also, note that the model will get stupider at the smaller quantizations. If you try this at iq2 and it gives you a terrible answer, don't blame me! You may need 128GB of RAM to fit the higher-quality Q6 and Q8 quantizations.
Beware closed-source foundations - they look great, but can be surprisingly unsound if you want to build on them. When you clone a model from
@huggingface
it's stable, and you know your prompt will still work 6-12 months from now.
@AndrewYNg
I started with ML by doing your Coursera course through Octave, back in 2011. It feels oddly affecting to come full circle and now be working at
@huggingface
during this partnership. Thank you for the work you put in way back then, it really changed my life!
I have so much affection for the people out there on huggingface doing unholy frankenmerges and layer splices of LLMs. They aren't even publishing research papers most of the time, it's just pure independent mad science
We recently made a small change with big impacts to the TensorFlow code for 🤗 Transformers. In short: You no longer need to manually specify a loss in most cases when training with Keras. Simply pass your labels in the input dictionary, as shown in the example. 🧵
💫TensorFlow 💫
Leveraging per-model loss for Keras training is now super simple. Simply compile() with no loss argument!
No more headaches about finding the right loss for your ModelForMaskedLanguageModeling!
---
On current master: Keras callbacks to push to the hub 🤩
🔥Today we are announcing WizardLM-2, our next generation state-of-the-art LLM.
New family includes three cutting-edge models: WizardLM-2 8x22B, 70B, and 7B - demonstrates highly competitive performance compared to leading proprietary LLMs.
📙Release Blog:…
AI right now is
@openai
wearing robes and dancing around a cauldron as they perform the ritual to summon their robot god and beget the singularity and
@microsoft
being like "Sweet, maybe we can use this increase our search market share"
Great thread: Transformers have no working memory that doesn't correspond to part of the input, and so they look for redundant parts of the input that they can use for global working memory. Adding true working memory tokens shows really cool results!
Vision transformers need registers!
Or at least, it seems they 𝘸𝘢𝘯𝘵 some…
ViTs have artifacts in attention maps. It’s due to the model using these patches as “registers”.
Just add new tokens (“[reg]”):
- no artifacts
- interpretable attention maps 🦖
- improved performances!
More new-style Tensorflow examples, this time pre-training a language model from an existing model or from scratch! If you've ever wanted to train GPT-2 on your local PC and you have a few months to sit around staring at a progress bar, now's your chance!