13 months. 250 pages. I wrote an ML book!
Want to learn how to ship ML in practice? Check it out!
Includes tips from
@WWRob
,
@mrogati
,
@cdubhland
and more!
It'll be out in winter & you can preorder it now.
Amazon:
O'Reilly:
Claude 3 Opus is great at following multiple complex instructions.
To test it,
@ErikSchluntz
and I had it take on
@karpathy
's challenge to transform his 2h13m tokenizer video into a blog post, in ONE prompt, and it just... did it
Here are some details:
Today, we announced that weโve gotten dictionary learning working on Sonnet, extracting millions of features from one of the best models in the world.
This is the first time this has been successfully done on a frontier model.
I wanted to share some highlights ๐งต
How to ship ML in practice:
1/ Write a simple rule based solution to cover 80% of use cases
2/ Write a simple ML algorithm to cover 95% of cases
3/ Write a filtering algorithm to route inputs to the correct method
4/ Add monitoring
5/ Detect drift
...
24/ Deep Learning
Most ML folks I know have
@AnthropicAI
's Toy Models of Superposition paper on their reading list, but too few have read it.
It is one of the most interesting interpretability paper I've read in a while and it can benefit anyone using deep learning.
Here are my takeaways!
When moving from traditional ML (GBDT) to deep learning for categorical data, the vast majority of improvements usually come from learned embeddings of categorical variables.
Once you have the embeddings, you can feed them to any model and it will perform noticeably better.
This is one of the main tricks I recommend when trying to improve current model performance.
Don't start with hyperparameter search, start by looking at individual examples with large losses, and most of the time you'll understand what feature your model is missing.
Today, a friend transitioning to ML asked me about a data challenge.
Him: I trained a perceptron to predict breast cancer, and reached 99% accuracy on test, what's next?
Me: Time to learn about class imbalance, data leakage and metric selection!
One of the biggest tool gaps in ML right now is tin building utilities to more easily inspect and understand data.
I gave a talk about just this at:
It also quotes your great data in industry vs data in academia slide in the conclusion
@karpathy
We see more significant improvements from training data distribution search (data splits + oversampling factor ratios) than neural architecture search. The latter is so overrated :)
Config files are underrated in ML.
You start with a simple model, and soon enough you are trying 13 hyperparameters, 7 models, and 9 data augmentation strategies.
Use a config file, and experimentation becomes much easier.
Python's implementation:
Do you want to understand how to train models like ChatGPT and stable-diffusion?
Good news, I wrote an illustrated notebook which explains different parallelism approaches and give a functional example for each.
I've summarized some takeaways below
NB:
More and more of my network is transitioning away from looking for deep learning jobs and focusing more on ML infra and platforms.
The ML Engineering hype cycle is just starting :)
I just finished watching
@karpathy
's let's build GPT lecture, and I think it might be the best in the zero-to-hero series so far.
Here are eight insights about transformers that the video did a great job explaining.
Watch the video for more.
(1/9)
Currently training a Q&A model, and it is producing crazy impressive results!
Q: How do you find a good title?
A: See attached
None of the samples can be found in the training set that I used.
๐ฑ๐ฑ๐ฑ๐ฑ๐ฑ
What do you mean my analysis isnโt reproducible?
I ran a query five month ago on a redshift table thatโs now deprecated, wrote a notebook in Python 2 to pre-process the data, and used a DNN implementation from a GitHub repo thatโs since been deleted.
Whatโs not reproducible?
Wow, the book went from best new release to best seller!!!
Looks like the free first chapter helped some folks decide. If you are still on the fence, feel free to check the free PDF out below, the book is also currently 40% off!
Wow,
@lyft
built an ML system that automatically:
Finds the best potential users to target.
Allocates the right budget for each ad.
Sets the right amount to bid on each platform to maximize the use of the budget.
Thank you
@seanjtaylor
for the find!
The paper contains *a lot* more experimental results, and feature examples, including interactive visualizations of feature neighborhoods.
If you've made it this far in the thread, you should give it a read
On Tuesday, we announced our results finding interpretable features in Claude 3 Sonnet.
One of the features we identified is about the Golden Gate bridge. When activated, the model starts being obsessed with the bridge.
For a limited time, we've made this available to everyone
The solution I recommend for anyone with a similar problem:
Letโs say you have 3 classes
Label 20 examples, including at least 2 examples of each class
Train a simple model on your labels and have it predict the rest
Look through predictions and label a few wrong ones
Repeat
I have 1,200 unlabelled observations that I want to label. The labels are categorical. What's the most efficient way to do this? Any chance someone's written a Shiny app for this?
How do you run models that are too big to fit in RAM or in GPU memory?
Great
@huggingface
article explaining their approach to distributing computation
When using websites that I suspect have a team of data scientists optimizing them, I make sure to spend some time clicking around randomly.
We all have to do our part to keep datasets messy.
The Illustrated Word2vec. If you've heard my talks, you've seen my many attempts at giving a visual representation of word vectors. This post from
@jalammar
takes this to a new level.
Really comprehensive and accessible overview!
I wrote a tutorial on leveraging
#DeepLearning
to build a powerful image search engine quickly. It includes a notebook to walk you through and a codebase to play with. Also comes with a shoutout to
@jeremyphoward
and his great class on the topic!
Most experienced Data Scientists understand why I dedicate an entire section of my book to cover how to deploy models. Others ask:
โCanโt you just wrap it in a Flask server?โ
This post from
@ravelinhq
shows why thatโs not enough brilliantly.
I wrote "How to solve 90% of NLP problems: a step-by-step guide" after seeing dozens of applied NLP projects at
@InsightFellows
.
It has been read by over three hundred thousand people!
It presents a cookie cutter NLP approach, along with reference code
This article by
@a16z
has the best ML infra charts Iโve seen in a very long time.
If youโd like to know more about the challenges that come with ML, and the tools to solve them, this is a great start
Impressive work!
Combine:
- 3 dimensional convolutional auto-encoder
-
@spacy_io
embeddings
- An RNN encoder
- Sprinkle some t-SNE on top
Get a model that can generate 3D models from text descriptions!
The interactive app is really fun, try it!
2018 has been a continuing flurry of exciting work in Machine Learning. If you are interested in being part of the field in 2019, I've written about how some of the most impactful trends of 2018 will impact this year!
Benchmark datasets are frustrating.
On one hand, datasets often drive significant innovation initially.
On the other hand, they usually become completely overfit, leading everyone to overestimate state of the art performance.
There should be a hype cycle for datasets.
As always,
@fastdotai
posts are a pleasure to read, and present results clearly and fairly.
This post covers language modeling for low-resource languages and provides useful info on learning rates, loss functions, and model architecture choices.
The most broadly applicable prompting technique:
1. Collect a random subset of failing examples from a training set
2. Add the examples and a correct response to your prompt
3. Repeat
Doing the above 5 times has solved 90% of prompting challenges Iโve seen
If you use the same prompt, but force the Golden Gate Bridge feature to be maximally active, Claude starts believing that it is the bridge itself!
We repeat this experiment in domains ranging from code to sycophancy, and find similar results.
Continuously impressed by
@distillpub
publications. This one on feature transformation does such a good job of centralizing and explaining trends around multimodal learning. An informative read that will leave you wanting to try out a lot of ideas.
I donโt think people have fully internalized yet how cheap and how good Haiku is.
Opus gets all the press for good reason, but Haiku pushes the intelligence / cost boundary even more imo
Kinda bonkers, but this agent workflow only cost me ~$0.09 (9 cents)
It's about ~ 350k tokens, and the outputs are on par with gpt-4, at a fraction of the price.
GPT = $10/1m = $3.50
Haiku = $0.25/1m = $0.09
40x cheaper.
This wasn't even possible 1 month ago.
To iterate faster on models, here is how I examine results:
Summary metrics for an overview
Confusion matrix and calibration curve to find challenging data types
Model explainers to inspect features
Manually inspect top and worst performing examples
Decide on next steps!
Bayesian reasoning and Deep Learning can seem very different. One excels at accurately measuring uncertainty, while the other mostly seems concerned with optimization.
@yaringal
's writing connects both elegantly, showing how dropout can define uncertainty
For context, the goal of dictionary learning is to untangle the activations inside the neurons of an LLM into a small set of interpretable features.
We can then look at these features to inspect what is happening inside the model as it processes a given context.
I gave a talk sharing tips for teams to ship ML.
At most companies I've been, the main challenge to getting models out is organizational.
The talk focuses on ways to enable product, ml, and infra teams to work together.
Slides are available below!
An amazing deck from the
@netflix
team on how they keep their recommendation systems fresh by continuously monitoring them, and incorporating corrupted data at training time to mimic real world conditions
Using a learned variant of SVD to compress embedding size by 90%. For many speech and NLP tasks, the embeddings represent most of the size of a network, so this clever take on SVD is really impactful! Great to see a clear post accompany the paper as well.
Oldie but goodie, this writeup about a winning
@kaggle
submission is one of the first I've seen successfully apply deep learning to tabular data. More examples have started cropping up recently, so I would not be surprised if the trend continues
Experimentation and statistics could really benefit from a beginner friendly attitude.
Not just in the form of textbooks, but an openness to discuss concepts that are often confusing
(Lack of) Knowledge of stats causes imposterโs syndrome for too many folks in my network :(
I'm excited to announce I've joined
@AnthropicAI
as a Product Research Engineer!
I loved my time on
@Stripe
's Radar team, working to improve fraud detection models, and in time I hope to write more about some lessons I learned there.
A hack that many teams use to deploy ML more easily:
Create good embeddings for future use.
See
@facebook
using code embeddings to surface best practices
Note that they don't really use any deep learning and focus on handcrafted features.
@HamelHusain
It is fascinating to me that we spend so much time talking about models and so little time talking about test sets
In industry, your test set is supposed to represent your performance in prod
Designing a test set that does this well is extremely hard, and not talked about much
Many researchers including
@fchollet
have highlighted the importance of having an explanation (and ideally a tuning mechanism) for content recommendation.
@mcinerneyj
gives an overview of
@Spotify
's category based recommender, a step towards explanations
First, we grabbed the raw transcript of the video and screenshots taken at 5s intervals.
Then, we chunked the transcript into 24 parts for efficient processing (the whole transcript fits within the context window, so this is merely a speed optimization).
Slowly but surely, deep learning is joining the last domain to resist it, tabular data. The networks are still relatively simple which makes me think we need better primitives for structured data. What is the convolution equivalent for spreadsheets?
95% of the work of a Data Scientist is gathering, cleaning and presenting data.
Some think that's a bad thing, but I wish every practitioner would embrace it!
Data Science is about using data to produce useful things, there is a good reason the title isn't Model Scientist.
This work from
@mcleavey
of
@OpenAI
has me floored.
Train GPT-2 on a large dataset, and you get an amazing composer that lets anyone create music.
Here is a riff on Poker Face created in 2 minutes.
Notice the coherence of the composition!
For the book I built four successive versions of an ML app:
First a heuristic with a bunch of rules.
Then a simple model.
Then, a much more complex model.
Finally, a model that simplified the previous approach.
Same lifecycle Iโve seen in industry!
We find features for almost everything you can think of: geographical concepts (cities and countries), architecture, sports and science.
They combine like youโd expect: "an athlete from California" triggers both the athlete feature and the California feature
But there's more!
Pre-training Language Models to build classifier is quickly becoming more accessible, and has huge promise. The next step in my opinion, is to make it as easy to use a pre-trained model on a custom dataset as VGG transfer learning in Keras for example.
Hey!
If this resonates with you, take a look at my book ( currently on early-release).
It has many many more tips about building sane ML products!
Amazon:
O'Reilly Early Release:
My todo list is a classic example of why ML is hard:
Spin up a webserver to serve current prototype of model: 5 hours
Add some simple monitoring: 2 hours
Add input validation logic and error handling: 3 hours
Iterate on the ML and data side to get a better model: ???
One of the biggest differences I see between more experienced folks and novices in ML: error analysis
If your model shows .94 precision, don't try a new set of hyperparameters
Look at the data
Then look at it again, using different methods
The data always has the answers
Here is a subset of some of what we asked the model, in one prompt (full prompt attached)
- directly write HTML
- filter out irrelevant screenshots
- transcribe the code examples in images if they contain a complete example
- synthesize transcript and image contents into prose
Building my own prototype of an ML guided editor as an example for my book () has helped me realize how good
@Grammarly
and
@textio
are.
Using ML to assist and guide a user requires a lot more than just modeling, need to think deeply about product.
Big fan of . I love how all the questions sound reasonable at a glance, but completely nuts once you actually dive in. I think I now know where
@YouTube
comments come from...
Getting into spaced repetition for memory thanks to
@michael_nielsen
and
@andy_matuschak
โs work.
It feels like unsupervised vs supervised learning
Normal reading is unsupervised
Spaced repetition provides labels you get tested on at successive epochs, to minimize memory loss
Regarding the debate on the rigor of Deep Learning, I recommend this paper by Leo Breiman making the case for focusing less on Data Modeling (understanding how data is generated) in favor of Algorithmic Modeling (measuring predictive accuracy).
The alignment and interpretability work at
@AnthropicAI
was one of the main reasons I joined
The linked thread does a good job of breaking down some of the details, but I wanted to take a stab at explaining a couple findings in even simpler terms in my own words ๐งต
When language models โreason out loud,โ itโs hard to know if their stated reasoning is faithful to the process the model actually used to make its prediction. In two new papers, we measure and improve the faithfulness of language modelsโ stated reasoning.
We gave Opus the transcript, video screenshots, as well as two *additional* screenshots:
- One of Andrej's blog to display a visual style to follow
- The top of the notebook
@karpathy
shared with a writing style example On top, we added lots of instructions (prompt in repo)
Found on the
@Yelp
app. Named Entity Recognition is hard! At a brewery, an โAmberโ might be a popular beer but at this restaurant itโs actually the name of the waitress...
Visual search is the most natural way to search for fashion items, and is becoming more and more popular. If you are curious how to build a visual search app yourself, check out and tell me what you think!
System prompt design is fascinating.
Itโs basically prompt engineering for the broadest use case you can think of.
That means you need to be subtle with, as each word affects every single use case.
We spent quite a bit of time iterating on this version. See ๐งต for more.
Ensembles are very powerful in
#ML
and routinely win
@kaggle
competitions. This paper is one of the first I saw mention the idea of using
#DeepLearning
checkpoints to form an ensemble. The idea has a lot of supporters including
@jeremyphoward
!
#ClassicML
Heard about
@YouTube
's recommendation algorithm, heavily criticized for promoting radical and hateful content? Take a look through this fascinating paper describing it. Steps are taken to reduce clickbait, but the only target metric is watch time...
This post by
@alex_gude
should be required reading for Data Scientists.
There are two ways you can learn this lesson:
You could use a simple model metric, ship a useless model and make people sad.
OR
You could read Alexโs post.
Many ML practitioners ignore latency, but most apps try to make every interaction no longer than 100ms. It makes using the app more enjoyable!
ML should aim for it too, especially for creative uses.
This is why Iโm excited for
@jeremyphoward
and
@clattner_llvm
โs Swift work!
Last week the
@feedly
team had me over to chat about some of the practical ML tips Iโve been writing about.
The recording is available now. Itโs a short video about why and how you should look at your data, including a slide copied from
@karpathy
:)
We also explore how the model actually uses these features to predict the next word.
In other words, we try to separate features that are related to the context from ones that are useful for prediction.
Let's look at an example
Re-reading World Models by
@hardmaru
and Schmidhuber, it feels like such an elegant combination of many great ideas about representation learning and dynamic world representations. Feels like something key to build on, especially with all the code here
This is what practical ML looks like.
Notice how this
@UberEng
article covers building the tooling to train, run, test and update a model.
Not much is said about the architecture -> that is not where most of the gains come from in practice
Good post by Gustavo Millen on tests for ML.
It covers some of the materials discussed in Building ML Powered Applications including the "ML Test Score" paper by
@GoogleAI
.
A new
#ML
approach by
@jeremyphoward
and
@seb_ruder
, taking transfer learning for
#NLP
to the next level. Pre-train a general RNN language model on a corpus, fine tune on specific tasks, achieve state of the art results!
So excited to finally see tool use out in the world!
Improving Claudeโs ability to use tools was a core focus when building the Claude 3 family.
Looking forward to seeing what you all build with it
Tool use is now available in beta to all customers in the Anthropic Messages API, enabling Claude to interact with external tools using structured outputs.
To decide between Data Science and ML Engineering roles, ask yourself whether you want to focus more on product/analytics question or on engineering challenges?
There is no right answer, but in my network, most DSs transitioned into product, and MLEs into engineering.
In the sentence "Fact: The capital of the state where Kobe Bryant played basketball is", a lot of features are active. e.g. features for:
- various words (fact, of, etc.)
- trivia questions
- basketball and geography
But only a subset are useful to predict the next token
I literally gave a 90-minute talk yesterday that consisted of over fifty slides repeating this message.
The first question I got after the talk?
"What's the best NLP model?"
...
I wrote a tutorial on
#ReinforcementLearning
, with the help of many people and based on amazing work by
@awjuliani
. Includes a shoutout to the awesome World Models paper by
@hardmaru
. Would love to hear any feedback people have :)
Image colorization has really come a long way! Really impressive demo by
@citnaj
and
@jeremyphoward
.
Most deep learning colorizers I've seen produce videos where colors change a lot between frames. Not the case at all here!
Here is an example of one of my favorite French movies
Here are the slides for my talk at
@QConAI
on tips and tricks to make NLP models work in practice.
I built this talk to mirror the real life of a Data Scientists, so it is 10% about models, and 90% about data and error inspection. ๐๐๐
The most confusing concept in speech recognition for me is always CTC loss. I learn it and forget it roughly every 6 months, and this article by
@_lab41
is always a good way to remind me of how it works.
It writes code examples, and relates the content of the transcript to the screenshots to provide a coherent narrative.
Overall, the tutorial is readable, clear and much better than anything I've previously gotten out of an LLM.
Amazing articles on ML in production, and constraints of applied systems and organizations.
I particularly enjoyed reading twelve truths of ML for the real world:
This is true at the scale of infrastructure. Many individuals that have led ML teams (
@mrogati
,
@chrisemoody
) promote the importance of increasing "experiment velocity", the speed at which projects can launch. It's a huge productivity boost. We saw this firsthand
@Zipcar
!
It matters because trying more ideas (with fewer mistakes) means you will converge faster towards better ideas (thus winning competitions more often or increasing your paper acceptance rate).
I'm thinking Kaggle kernels or Colab would be a way to gather hard data on this...
In a great episode of
@twimlai
,
@JeffDean
mentions finding embeddings for categorical variables through word2vec style optimization is extremely common at
@Google
, but that there are few relevant papers. We see the same
@InsightDataAI
, and often have to build it from scratch.
If you look at the dataset to understand how your model performs, you'll often see that your model is actually struggling.
Here, BERT's accuracy on the test set drops from 77% to 50% (random) after researchers identify and correct data leakage.
Data tip of the day:
If you are doing anything involving processing more than a hundred rows of data (SQL, Spark, model training, viz), use only a small subset to iterate faster!
Writing/using a sampling function takes five minutes and saves hours of โwaiting for X to runโ.
That feeling in programming when you've managed to do the thing you've been trying to do for an hour, but you feel deep shame about the hacks you've had to use to make it happen.
@ErikSchluntz
and I have read the resulting transcript, and Opus manages to incorporate all of these requests, and produces a great blog post.
The blog post is formatted as asked, with a subset of images selected and captioned
Data leakage is one of the most dangerous ML errors for 2 reasons:
You often will catch too late once your model is in production.
It can happen in many subtle ways, not just obvious ones
That's why I dedicate a full section to it in my book!