Andriy Burkov Profile Banner
Andriy Burkov Profile
Andriy Burkov

@burkov

19,678
Followers
142
Following
1,949
Media
8,397
Statuses

Author of 📖 The Hundred-Page Machine Learning Book and the 📖 Machine Learning Engineering book

Québec, Canada
Joined June 2009
Don't wanna be here? Send us removal request.
Pinned Tweet
@burkov
Andriy Burkov
9 months
You can now ask questions to my books:
Tweet media one
6
15
109
@burkov
Andriy Burkov
3 years
To say "machine learning is just statistics" is as stupid as saying that physics is just mathematics.
311
561
11K
@burkov
Andriy Burkov
19 days
Meta is doing what OpenAI was funded to do, but Zuck is somehow the bad guy while Altman is a visionary.
220
500
6K
@burkov
Andriy Burkov
2 months
If today's Google was the 15-years-ago Google, ChatGPT would have been invented in Google while OpenAI would still be a catching-up non-profit. Today's Google is a shadow of what the company once was: the one that invented an infinite-size email inbox, reinvented online maps,…
212
432
6K
@burkov
Andriy Burkov
4 months
GPT-4 is officially annoying. You ask it to generate 100 entities. It generates 10 and says "I generated only 10. Now you can continue by yourself in the same way." You change the prompt by adding "I will not accept fewer than 100 entities." It generates 20 and says: "I stopped…
556
241
5K
@burkov
Andriy Burkov
2 months
Anyone who tried to read any scientific article at least once knows that English cannot be used to clearly convey ideas. Most people, including the brightest of scientists, have a hard time writing clearly. I'm sure Nvidia CEO knows that too. What he is doing here is he is…
@Carnage4Life
Dare Obasanjo🐀
2 months
Jensen Huang, CEO of Nvidia, argues that we should stop saying kids should learn to code. He argues the rise of AI means we can replace programming languages with human language prompts thus enabling everyone to be a programmer. AI will kill coding.
1K
5K
21K
296
356
4K
@burkov
Andriy Burkov
2 months
1. Finetune an LLM on your training data. 2. Demo the performance on the same training data. 3. Make big claims. So typical that even annoying.
@cognition_labs
Cognition
2 months
Today we're excited to introduce Devin, the first AI software engineer. Devin is the new state-of-the-art on the SWE-Bench coding benchmark, has successfully passed practical engineering interviews from leading AI companies, and has even completed real jobs on Upwork. Devin is…
4K
11K
46K
135
241
4K
@burkov
Andriy Burkov
2 months
@OilGains You see - you read my post in English and didn't understand anything.
54
95
3K
@burkov
Andriy Burkov
5 months
Isn't it a matter of prestige for Google to build a serious rival to GPT-4? What's wrong with this company for the last 5 years? How can they let a 700-people company beat Google in AI? Is it a mediocre CEO or something else?
333
106
3K
@burkov
Andriy Burkov
2 months
Don't let them fool you: AGI today is no nearer to us than it was two years ago. While ChatGPT might appear to be a step closer to AGI, from a scientific standpoint, it's not: training a neural network to predict the next word is not groundbreaking science. Achieving AGI would…
270
323
2K
@burkov
Andriy Burkov
18 days
Well, Llama 3 8B is not that magical after all. (A simple one!)
Tweet media one
202
105
2K
@burkov
Andriy Burkov
8 months
TensorFlow doesn't support Windows anymore. Google is killing yet another product they convinced the world to use.
127
144
2K
@burkov
Andriy Burkov
8 months
My daughter just started college: "Dad, we are studying matrices. There are rules of how to multiply them, I get them, but I don't get why we need all this." Me working on an illustration for the Transformer chapter of my new book: "Oh, look at my screen. Here's why:"
Tweet media one
66
127
2K
@burkov
Andriy Burkov
11 days
Google lays off its entire Python team:
61
233
1K
@burkov
Andriy Burkov
4 years
My top Python libraries for data science: scikit-learn PyTorch TensorFlow Keras Pandas SciPy NumPy Seaborn spaCy XGBoost Bonus: Gensim Scrapy Flask MySQLdb huggingface What's missing?
141
239
1K
@burkov
Andriy Burkov
3 months
Noticed how nobody says "deep learning" anymore?
97
70
1K
@burkov
Andriy Burkov
2 years
In my team, we use physical GPUs for machine learning R&D not because cloud GPUs are 5 to 10 times more expensive, but because innovation is impossible when you have to think "do I experiment or do I save money."
35
91
1K
@burkov
Andriy Burkov
18 days
I see too many people don’t understand why Meta spends billions to train and then gives away its large language models. They think Zuck does this purely out of a love for open source. In any decision, there are two elements: 1) a reason and 2) a pretext. Sometimes these align,…
93
127
1K
@burkov
Andriy Burkov
6 months
I really admire how Elon Musk can make an event out of anything. So he apparently trained a GPT-3.5 competitor (which already has dozens of competitors and is not really hard to beat given it only has 20B parameters) called Grok but everybody is already talking about it as an…
198
67
1K
@burkov
Andriy Burkov
16 days
This is how you win the AI race
Tweet media one
59
78
1K
@burkov
Andriy Burkov
4 months
For the first time, I actually believe that Meta will match or beat GPT-4. It looks like for Zuckerberg it's a personal matter. It's also a great strategy: by releasing a model similar to GPT-4 under Apache 2.0 license, it kills the business of its growing competitor while it's…
64
60
980
@burkov
Andriy Burkov
2 years
If I was to start in AI today from scratch, I would start with The Hundred-Page Machine Learning Book. This was in fact my primary motivation to write it. My target audience was me 10 yeas ago.
13
79
955
@burkov
Andriy Burkov
2 years
Looks like ML converges to the following: - xgboost for tabular data, - a pretrained transformer for everything else.
25
110
952
@burkov
Andriy Burkov
2 months
There are only 2 possibilities: 1. GPT-4 is a 2T model and OpenAI uses an entire node of 8xH100 (that costs $400,000) to serve the inference just for you for $20/month. or 2. GPT-4 is a model that is 10 times smaller (it cannot be smaller than 200B) and OpenAI uses one H100…
169
85
952
@burkov
Andriy Burkov
3 months
The most popular use case for Claude and Gemini is to compare them to GPT-4.
32
72
949
@burkov
Andriy Burkov
5 months
Despite what Elon and many other optimists think, autoregressive models (which LLMs are) will not be able to write more than one or a couple pages of coherent text. An entire book? No way. Each newly generated word contributes to the error. After a couple of pages, the error is…
155
81
799
@burkov
Andriy Burkov
3 years
Technical books are expensive. In some countries, people need to work an entire week to buy one. This is awful. Here's what I think as an author of two expensive books. I don't mind if you pirate my books to learn. Buy it later, when you get a better job thanks to the knowledge.
15
58
791
@burkov
Andriy Burkov
2 years
If you want to do a PhD in AI and look for an exciting research direction, here's one for you: memory-augmented machine learning. The goal is to create algorithms that would train a model that would decide when to use external memory of facts or what to save in it for future use.
24
83
784
@burkov
Andriy Burkov
3 years
Why spend 2 weeks to label more data if you can spend an entire year to design a more complex NN architecture?
16
51
789
@burkov
Andriy Burkov
2 years
Two books to start your machine learning journey
Tweet media one
9
129
760
@burkov
Andriy Burkov
6 months
This 7B model beats ChatGPT and Grok:
38
76
762
@burkov
Andriy Burkov
3 years
People who try to learn machine learning (or another similar) science by themselves find it very hard to understand why those 1/N or 1/2 are used. It takes years before they realize that it doesn't serve any purpose other than aesthetically pleasing the scientist who wrote them.
Tweet media one
33
95
747
@burkov
Andriy Burkov
5 months
What is a good detailed tutorial on LLM fine-tuning?
22
63
756
@burkov
Andriy Burkov
2 years
If you want to make a career in machine learning knowing only one ML algorithm, learn xgboost.
28
59
706
@burkov
Andriy Burkov
5 months
SOLAR: an 11B model that beats every open model, including Mixtral, Yi-34B, Llama 2 70B, and Falcon 180B:
Tweet media one
19
72
695
@burkov
Andriy Burkov
4 months
OpenAI doesn't use ChatGPT to power its customer support chatbot. This is everything you need to know about using LLMs for anything more important than generating noisy training data and RAG.
Tweet media one
25
55
686
@burkov
Andriy Burkov
3 years
Data science in the nutshell: do linear regression, earn $175k.
Tweet media one
Tweet media two
19
104
642
@burkov
Andriy Burkov
5 months
The whole idea of "safe/unsafe LLMs" is based on the assumption that an adult person is incapable of critical thinking or can suffer damage from words. This infantile idea is a reflection of how infantile the Western civilization has become.
71
92
643
@burkov
Andriy Burkov
7 months
Despite the fact that most LLMs have the chatting capability and many are even finetuned to chat, this capability is useless in the commercial B2C or B2B setting. Multistage chats are unreliable, they quickly diverge from the business objective, the level of hallucination…
71
78
650
@burkov
Andriy Burkov
3 years
Want to quickly test if a candidate understands machine learning? Ask only two questions: 1) why a test set is needed and 2) why linear regression works poorly with outliers.
24
65
609
@burkov
Andriy Burkov
2 years
It's not science if you trained an even larger neural network. It's engineering. Science would be to achieve a similar or better model quality by using a fraction of resources. Science would be to solve a problem that previously wasn't solvable. Leave engineering to engineers.
28
62
579
@burkov
Andriy Burkov
2 years
In machine learning, you never know whether the project will succeed or not. In most cases, it doesn't succeed indeed. This makes working in a non-AI-centric organization painful as an ML engineer or data scientist. When you cannot commit to success, you seem to be a layman.
17
84
589
@burkov
Andriy Burkov
4 months
It's unlikely that OpenAI will win against The NY Times. The reason for this is simple: they don't know how ChatGPT works and thus will have a hard time answering the judge's question: "Is it possible that your model reproduces the copyrighted content verbatim? If yes, can you…
174
84
580
@burkov
Andriy Burkov
13 days
A 262k-token context finetune of Llama 3 8B:
24
67
567
@burkov
Andriy Burkov
2 years
One of the most important features of machine learning (probably the most important one) is that you don't have to know math to train models. All the optimization is carefully isolated from the user. What previously took talent and years of complex math studies now takes nothing.
31
65
559
@burkov
Andriy Burkov
2 years
If you want to do a Ph.D. in machine learning and look for an exciting research direction, here's one: algorithms and techniques that would allow encrypting the data, training a model on the encrypted data, and then using the model on the unencrypted data.
17
74
537
@burkov
Andriy Burkov
1 month
We should seriously stop calling open-weight LLMs "open source". Weights are not equivalent to the source code in traditional software. Data is. So if the creator of an LLM is showing you weights they are just showing off. They don't let you reproduce their model independently…
46
68
542
@burkov
Andriy Burkov
2 months
This one will fail miserably when put in the wild. It will end up where self-driving cars have gone. Remember this tweet.
16
24
542
@burkov
Andriy Burkov
4 months
Modern AI has become possible thanks to this game: the first really 3D first-person shooter that benefited from a GPU.
Tweet media one
23
50
529
@burkov
Andriy Burkov
4 months
If you really want to do something useful in AI, instead of training another tiny llama, pick up this project and train a 1B-parameter multilingual BERT with 32k input size. The code is here . The data is all over @huggingface . The…
9
80
505
@burkov
Andriy Burkov
2 months
"The first AI software engineer" my ass.
11
21
504
@burkov
Andriy Burkov
19 days
Llama 3 70B beats Mistral Large. What exactly Mistral is now supposed to sell?
Tweet media one
66
28
506
@burkov
Andriy Burkov
2 months
Function calling accuracy in LLMs really sucks. The best function calling accuracy is obtained with GPT-4 and it's 83.8%. It's already too low to be practical, but one should discount this number more assuming that the test data Berkeley used to evaluate function calling…
24
45
501
@burkov
Andriy Burkov
2 years
A high threshold for entry into the field of machine learning is a lie. Compared to many other professions ML is very simple to get into. Just learn one programming language considered the simplest to learn (Python) and two libraries that have the best docs (sklearn and PyTorch).
17
69
487
@burkov
Andriy Burkov
5 months
Ok, all of you remember that demo when Google's model called a restaurant to make a reservation and pretended to be a real person? Remember all those wows and cheerings? Where is this model now? Did anyone see it, try it IRL? OpenAI didn't pompously show ChatGPT in a video one…
28
33
449
@burkov
Andriy Burkov
5 months
Once Google caught Bing on using Google's search results. Now it's the other way around. Microsoft is the Google of our times.
Tweet media one
43
26
440
@burkov
Andriy Burkov
2 years
Why do so many scientists support paywalled Medium? Isn't it the opposite of the openness the research community gears towards with ArXiv, GitHub, and alikes?
29
37
439
@burkov
Andriy Burkov
2 months
It's not a coincidence that Claude 3 beats GPT-4 on all benchmarks by a small margin. I think that the trick is to constantly run the pretraining of an ~8x100B MoE LLM on more and more data and do occasional instruct-finetunes to see if it beats GPT-4 on all benchmarks. Once it…
32
41
442
@burkov
Andriy Burkov
2 years
This is probably one of the most important web pages on AI:
2
51
417
@burkov
Andriy Burkov
2 years
If I had an hour to solve a problem I'd spend 55 minutes labeling data and 5 minutes training the model.
13
46
423
@burkov
Andriy Burkov
2 months
@zendaimyo I said English but I meant all human language. Very weird think to narrow down on to criticize imo.
12
2
420
@burkov
Andriy Burkov
6 months
A 7B model from Intel almost as capable as Falcon 180B:
11
37
419
@burkov
Andriy Burkov
2 years
In computer science, 2% of scientists do 98% of useful research.
33
28
403
@burkov
Andriy Burkov
4 months
So you realize what really drives AI applications. Downloads last month on @huggingface : Mixtral-8x7B-Instruct-v0.1: 843,843 phi-2: 329,824 bert-base-uncased: 32,670,091 roberta-base: 21,673,938 Clearly, the most useful model right now would be a 1B-parameter BERT with 32k+…
21
43
401
@burkov
Andriy Burkov
3 years
FLAML by Microsoft: a lightweight Python library that finds accurate machine learning models automatically, efficiently, and economically.
4
73
396
@burkov
Andriy Burkov
2 years
"machine learning with Excel" "data science with command line" "deep learning with R" ok, it's possible, but WHY?
28
42
383
@burkov
Andriy Burkov
6 months
Why there's no LLM finetuned specifically for RAG? It's the most important use case for LLMs.
36
27
382
@burkov
Andriy Burkov
2 years
Are you thinking about doing a PhD in AI and looking for an exciting research direction? Here's one for you: ML with a human in the loop. That is, AI should be smart enough to know when to ask a human for a label, when to pass control to the human, and when to doubt the human.
17
45
368
@burkov
Andriy Burkov
5 months
The Apache 2.0 licensed Mixtral beats proprietary GPT-3.5 Turbo, Gemini Pro, and the newest Claude 2.1. It would take just careful fine-tuning to reach GPT-4 level of performance. 2024 will be awesome!
Tweet media one
12
56
364
@burkov
Andriy Burkov
5 months
Claude 2.1 is less capable than Claude 2.0 and Claude 1.0. This is everything you need to know about how well we understand neural networks.
21
25
364
@burkov
Andriy Burkov
3 months
This page is gold. This is how you describe your models:
Tweet media one
7
65
356
@burkov
Andriy Burkov
1 month
A 7B-parameter model that beats ChatGPT-3.5, Mixtral, Gemni Pro, and some of the best 30B and 70B models. Isn't this exciting? Meaning that you can squeeze much more capability per parameter if you know what you are doing.
Tweet media one
16
46
365
@burkov
Andriy Burkov
2 years
Want your machine learning project to be as far as possible from production? Start it in a notebook.
27
28
347
@burkov
Andriy Burkov
6 months
@SciumoInc Nobody pays for GPT-3.5. It's free. Nobody got even close to GPT-4 yet.
10
1
342
@burkov
Andriy Burkov
2 years
There will be no new AI winter. There will be a data science/data scientist winter. Most businesses will soon see no real benefit from having a team of data scientists assuming the current cost of having such a team.
35
34
331
@burkov
Andriy Burkov
2 years
2011: "We provide AI-powered search." = "We use TF-IDF." 2021: "We provide AI-powered search." = "We use pretrained document embeddings."
8
34
335
@burkov
Andriy Burkov
2 months
@DanielCardena Oh yes, we have seen Bard, that was definitely ground-breaking :-)
2
1
326
@burkov
Andriy Burkov
2 years
The biggest lie in modern AI is that you don't need to understand the math behind it to be able to create successful AI systems. To solve a problem using machine learning, you need to formulate an optimization problem. If you don't understand math, good luck in guessing it!
16
59
312
@burkov
Andriy Burkov
7 months
We need to find a better name for what we currently call open-source LLMs. To reproduce an LLM, the source code is not enough. It's even not the main component. The main component is the dataset. So, if an organization only releases the source code but keeps the pretraining or…
20
32
310
@burkov
Andriy Burkov
17 days
In English, Llama 3 8B is as good as Mistral Large, the most capable closed Mistral's model likely larger than 200B parameters. This is unbelievable!
Tweet media one
21
23
310
@burkov
Andriy Burkov
4 years
This chart demonstrates a potential machine learning roadmap for machine learning engineers.
Tweet media one
5
67
295
@burkov
Andriy Burkov
3 months
I said Apple Vision Pro didn't have a killer app. I'm sorry, I was wrong.
14
38
291
@burkov
Andriy Burkov
5 months
Hallucinations in LLMs are by design. It's a feature, not a bug. And you cannot fix a feature.
Tweet media one
13
40
286
@burkov
Andriy Burkov
19 days
Smaller than 100B-parameter models are poor with factual queries. Their logic and math capabilities are approaching GPT-4, but they cannot get the facts right. This is likely where the additional 300B+ parameters are used in larger models. PS: I think the parameters is not the…
35
22
287
@burkov
Andriy Burkov
1 year
Engineers of the future listening to ChatGPT
Tweet media one
10
33
277
@burkov
Andriy Burkov
18 days
@itsHesamSheikh No, because this is not Meta's core business and they don't think it will pay enough to care. They fear to lose their business to someone who will become too strong to fight with. It's better when there are thousands small AI companies than 3 large ones.
5
11
278
@burkov
Andriy Burkov
2 years
Another ML quiz: you didn't update the model in production but, after some time, the predictions of the model changed for some inputs. The inputs didn't change. What happened?
63
15
274
@burkov
Andriy Burkov
3 years
Machine learning is the only engineering field where independently how much an expert you are, you will answer "I don't know" to the question whether this can be done.
7
22
267
@burkov
Andriy Burkov
2 months
Theorem: Fixing the problem of hallucinations in LLM is equivalent to creating the AGI.
66
20
269
@burkov
Andriy Burkov
4 years
A series of paths created by 800 unmanned bicycles being pushed until they fall over:
Tweet media one
4
41
263
@burkov
Andriy Burkov
2 years
NLP is one of the most exciting applications of machine learning with lots of interesting challenges, techniques, and tricks. However, no one buys books on NLP. It discourages and I feel sorry for the authors. Why do you think this is?
41
24
253
@burkov
Andriy Burkov
3 years
In reality, Abraham would spend the first 6 hours updating CUDA drivers :-)
@mrdbourke
Daniel Bourke
3 years
“If I had 8 hours to build a machine learning model, I’d spend the first 6 hours preparing my dataset.” - Abraham Lossfunction
22
238
2K
8
29
250
@burkov
Andriy Burkov
4 months
Phi-2, a 2.7B parameter LLM from @Microsoft , is now distributed under the MIT license which allows commercial use. The model beats the most capable models of up to 13B parameters, including Mistral-7B and Llama 2-13B, on most benchmarks, especially on math and coding benchmarks.…
2
38
255
@burkov
Andriy Burkov
5 months
A Llama-2-based model finetuned for function calling:
2
29
240
@burkov
Andriy Burkov
1 month
So the term was coined last year, but some geniuses from HR want you to have 3+ years experience with LLMs, as well as "Lang Chain, LLAMA Index" (looks like they put them in quotes because someone sent them this expression in quotes).
Tweet media one
25
25
241
@burkov
Andriy Burkov
2 years
Transfer learning is a unique skill of neural networks that no other machine learning algorithm has. This unique property is way more important than their ability to learn deep structures.
7
25
238
@burkov
Andriy Burkov
5 months
Just figured out that when you use function calling in OpenAI API, you should submit the function call result back to the chatbot using "role": "function".
Tweet media one
13
16
235
@burkov
Andriy Burkov
2 years
Open source NLP is fueling a new wave of startups
0
39
235
@burkov
Andriy Burkov
3 years
The tragedy of a data scientist: a programmer only needs a computer to create, while a data scientist cannot do anything without a dataset.
12
34
233
@burkov
Andriy Burkov
1 month
I don't know why no one has yet implemented such an obvious idea: instead of training an LLM to predict the next word, train it to predict a full paragraph of maximum length of, say, 100 tokens. As a result: decreased hallucinations and 100 times faster inference.
95
9
241
@burkov
Andriy Burkov
3 years
A friendly introduction to machine learning compilers and optimizers
0
39
217