I wrote
#beginner
level book teaching Deep Learning - its goal is to be the easiest intro possible
In the book, each lesson builds a neural component *from scratch* in
#NumPy
Each *from scratch* toy code example is in the Github below
#100DaysOfMLCode
This series of
#Jupyter
#Notebooks
is a VERY nice step-by-step intro to data science and machine learning.
If you're just starting out - I recommend walking through these notebooks as a first primer
Definitely a great
#100DaysOfMLCode
project
Machine Learning in a company is 10% Data Science & 90% other challenges
It's VERY hard. Everything in this guide is ON POINT, and it's stuff you won't learn in an ML book
"Best Practices of ML Engineering"
This is a lifesaver
#100DaysOfMLCode
project
Attention is one of the most important breakthroughs in AI - the foundation of Transformers
This
@distillpub
is the best explanation of it I've seen.
For
#100DaysOfMLCode
/
#100DaysOfCode
folks - try building an attention mechanism from scratch!
If you've wondered - "Which Deep Learning optimizer should I use? SGD? Adagrad? RMSProp?" - this blogpost by
@seb_ruder
is the best explanation I've seen.
It's a surprisingly easy read!
Definitely a good
#100DaysOfMLCode
project.
For anyone who has ever thought - "Can I learn the math needed for Deep Learning all in one place (and maybe skip the other stuff)?" - this is quite a nice resource.
"The Matrix Calculus You Need For Deep Learning"
(Table of Contents Below)
LLMs believe every datapoint they see with 100% conviction.
A LLM never says, "this doesn't make sense... let me exclude it from my training data".
Everything is taken as truth.
It is actually worse than this.
Because of how perplexity/SGD/backprop works, datapoints which…
Machine Learning is WAY more than just picking a model & calling .fit() or .train() on data
It's a process... thinking about your problem in terms of correlation & features
This step-by-step guide is an excellent intro to this process
#100DaysOfMLCode
This series of Jupyter Notebooks is a VERY nice step-by-step introduction to data science and machine learning.
If you're just starting out - I recommend walking through these notebooks as a first primer
Definitely a great
#100DaysOfMLCode
project
#numpy
is an irreplaceable part of every practitioner's Deep Learning toolkit.
The best way to learn NumPy that I know of is this crash course
If NumPy is new to you - definitely include this early in your
#100DaysOfMLCode
- you won't regret it!
Wow - in 8 tweets I just learned and un-learned more about the mysteries of deep neural networks than I've probably learned or un-learned about them in the last two years.
This is the start of something really really big... also a huge door opened for federated learning.
📜🚨📜🚨
NN loss landscapes are full of permutation symmetries, ie. swap any 2 units in a hidden layer. What does this mean for SGD? Is this practically useful?
For the past 5 yrs these Qs have fascinated me. Today, I am ready to announce "Git Re-Basin"!
Interested in learning Reinforcement Learning?
This free course from
@dennybritz
is the highest quality & most comprehensive collection of online resources I've seen
Prepared in order of difficulty
For
#100DaysOfMLCode
folks - take 1-2 days per chapter
For anyone interested in future LLM development
One of the bigger unsolved deep learning problems: learning of hierarchical structure
Example: we still use tokenizers to train SOTA LLMs. We should be able to feed in bits/chars/bytes and get SOTA
Related: larger context window
This is the 1st rigorous treatment (and 3rd verification) I've seen
IMO - this is great for AI safety!
It means that LLMs are doing *exactly* what they're trained to do — estimate next-word probability based on data.
Missing data?
P(word)==0
So where is the AI logic?
1/🧵
Does a language model trained on “A is B” generalize to “B is A”?
E.g. When trained only on “George Washington was the first US president”, can models automatically answer “Who was the first US president?”
Our new paper shows they cannot!
He’s right.
Everybody uses the same GPUs, the same frameworks, and the same algorithms.
Data is the thing some have and others don’t.
Want to know the future of AI? Don’t get distracted. It’s always been about who controls the data.
Everything else is rapidly commoditising.
Data seems to be the limiting factor, rather than model building technique or computing resources. And if you have the app, you often have the data too.
Fwiw - if you're new to reading AI papers 👇
The tipping point for me was when I spent 4 weeks reading one paper per week. (two or three of them were
@RichardSocher
's back in ~2012)
For each paper, I read each sentence and wrote 1-2 paragraphs about it, summarising its…
I was thinking about Karpathy's "only compare yourself to younger you" and how I found reading AI research papers intimidating in 2022 because I didn't understand the terminology + math symbols. It really just takes practice reading 100s and then suddenly it's no big deal.
Excited to share I've moved to the
@GoogleDeepMind
ethics research team — and I'm honored to have played my small part in the Gemini release from that new post!
Lots of multi-modal features coming to an app near you!
The Gemini era is here. Thrilled to launch Gemini 1.0, our most capable & general AI model. Built to be natively multimodal, it can understand many types of info. Efficient & flexible, it comes in 3 sizes each best-in-class & optimized for different uses
It just occurred to me - if you zoom out enough - working from home is the norm - not the exception.
For a bajillion years people worked in the local vicinity of where they lived. Farming, hunting, and caring for their house and home.
Going to an office to work is weird.
If one professor hadn't decided to issue an override to let me into an already-full CS 101 course after the deadline, I probably wouldn't be in computer science at all, much less AI.
One decision from one teacher changed my life.
For anyone suffering from imposter syndrome
Over a decade ago a guy named
@chuckainlay
told me something I'll never forget
"You're never going to be who you think you want to be until you think you *are* who you want to be"
He probably doesn't remember but I'll never forget it
fun fact in
#disinformation
research: filtering out disinformation is a solved problem — has been for 1000s of years
however, disinformation’s solution creates major problems for liberalism and democracy
that’s the *real*
#disinformation
problem
1/🧵
IMO,
@karpathy
’s post is…
Reading a tweet is a bit like downloading an (attacker-controlled) executable that you instantly run on your brain. Each one elicits emotions, suggests knowledge, nudges world-view.
In the future it might feel surprising that we allowed direct, untrusted information to brain.
If you're a US-based academic/student and need access to more
#data
(e.g., health data, economic data, social media data, education data, etc.) or
#compute
(e.g. GPU credits) — shoot me a DM in the next 48 hrs.
Gotta be affiliated with a US academic institution I'm afraid, but…
@janleike
In 2015 my colleague and I trained a language model on non-aligned English/Spanish text. They ended up creating aligned vector spaces which generalized in this way. We think the vector spaces found the same basis function because of punctuation + similar overall shape in vspace.
Before getting into AI, I attended undergrad at Belmont University, where I studied commercial music and music business. It's one of the top schools for this.
In the "AI threatens art" narrative, something is forgotten.
Art's beauty is grounded in its ability to tell stories of…
Hallucinations are basically when nothing in the training data is similar to the current context...
... but all the training data is voting *anyway*...
So it falls back on the most generic template... basic grammar.
This to me is the most coherent mental model for LLMs.
@OwainEvans_UK
Now is this all LLMs are doing?
It's unclear. The paper I've seen push this the farthest is
"Copy is all you need"
which gets gpt-2 level performance using only training-data lookups
If they had a bigger dataset — would they get to even higher perf?
Note that solving AI by simply "scaling up" is actually abandoning the aims of machine learning. It's basically saying "sample complexity is good enough — let's just fill the thing with information". Again - could be done with a DB.
There's a nuance of this take I disagree with.
It's not *quite* "more data" -> "more quality". It's actually "more structure" -> "more quality"
Ex: I could generate 100 billion petabytes of data about how to convert Celsius to Farenheit
But that wouldn't help with driving cars…
It’s pretty obvious that synthetic data will provide the next trillion high-quality training tokens. I bet most serious LLM groups know this. The key question is how to SUSTAIN the quality and avoid plateauing too soon.
The Bitter Lesson by
@RichardSSutton
continues to guide AI…
Current hypothesis: LLMs are a lot like surveys.
When they see a context ("The cat and the") they basically conduct a *survey* over every datapoint in a training dataset.
It's like asking every datapoint "what do YOU think the next word might be"?
And then...
@chris_j_paxton
Note that this is the lurking limitation of most major AI projects. Building a great demo is much easier than building a robust product. Often they can be entirely different ballgames.
@chris_j_paxton
If you drive for 3 hours on a highway, most of that data isn't useful. There's a very long-tail of quite rare data for very complex situations that they're likely still building. (granny-with-shopping-cart kind of data)
One of the most beautiful articles I've read in a long time. To me there's something about it that really resonates with tech culture.
Where we are.
Where we've been.
Where we should go.
Spoiler: it's not actually about coffee.
@OwainEvans_UK
But the real kicker from Owain's result is that it implies that hallucinations come from incredibly sharp missing gaps in training data.
Implies LLMs are like interactive books
They have the info they have... and nothing else
Little-to-no deductive logic
Good for safety!
Peter was one of the most impactful, capable, humble, and kind people I've ever met. He is and will always be one of my role models.
Peter's early encouragement in my life led to much of who I am. He had this kind of impact on a lot of people.
I can't believe he's gone.
Former EFFer Peter Eckersley died very suddenly today. If you have ever used Let's Encrypt or Certbot or you enjoy the fact that transport layer encryption on the web is so ubiquitous it's nearly invisible, you have him to thank for it. Raise a glass.
Second, DeepMind's RETRO model showed that you can get GPT-3 performance with a 25x reduction in parameters size by...
... you guessed it...
...querying an enormous token store.
This to me implies that 24/25ths (or 96%) of a transformer's logic is
To summarise the problem — LLMs still struggle with recognizing that a particular sequence is a unique "thing" which needs to be considered as an independent semantic concept
They do this somewhat. Like "hot dog" vs "hot" and "dog".
But not well enough for SOTA char LLMs.
And the result of this survey across datapoints translates into a probability over what word might be next.
So sometimes LLMs copy from data.
Sometimes they're an average of many locations.
Sometimes they hallucinate.
Bill Hooper. Inspiring CS 101 course and inspiring AI course. Helped me run my first line of code and helped me train my first neural network. Also let me bug him endlessly in his office hours.
In machine learning literature, this is called improving "sample complexity" — the number of datapoints needed to achieve a certain degree of accuracy.
And you could argue that improving sample complexity is the point of ML research.
This is related to the "symbolic AI" debate, in that solving hierarchy is related to the binding problem.
For example, LLMs need to be able to identify that "hot dog" is its own "symbol". They do this ok.
But they still struggle with this in a few ways. So tokenizers persist.
literally doing a data comparison. Because if you remove that 96% of the transformer (train a model 1/25th the size)...
... you can replace that 96% of the model with a dataset comparison..
... and it works just as well.
And this is also where we see the difference between machine/deep learning and AI. Machine and deep learning is about reducing sample complexity — whereas AI is about imitating human-like intelligence.
Related: the difference between aeronautical engineering and human flight
Obviously this gets papered over with a big enough dataset. But basically every machine learning insufficiency can be papered over if your dataset is big enough.
Aka - if ChatGPT was a big enough database of input-output pairs we wouldn't know the difference between it and a LLM
Third, the result
@OwainEvans_UK
et al has showed.
If your training dataset always has the tokens "George Washington" BEFORE the tokens "first", "US", and "president"...
...and NEVER after
.... then NO training datapoint will vote for "George" AFTER seeing "first US president"
@venturetwins
Well a certain percentage of it (the color) is AI generated. It's having to pick colors somewhat at random (as a part of a random distribution of what it might be).
Another recently documented binding issue is that LLMs struggle to predict things in reverse.
I haven't tested this myself, but I know multiple labs who have confirmed that if you train a LLM which only sees one phrase *before* another — it can't reverse them.
@OwainEvans_UK
Fifth, it's an interesting context on why increasing dataset size adds so much to the power of the models.
It increases the density of datapoints which can vote on new examples that are just like themselves.
(And this also reduces hallucinations - it's harder to find gaps)
P.S. better explanation.
LLMs can do deductive logic *in the context window* because they index into data that's doing deductive logic.
Training data: "I am a dog. Dogs have fur. Thus I have"
Prediction: "I am a cat. Cats have eyes. Thus I have"
This kind of thing. :)
So if a LLM only sees "The president of the USA is Barak Obama" and sentences where "Obama" comes later in the phrase than "presdient" and "USA"...
... if you ask it "Where is Barak Obama President?" it won't be able to tell you.
But there's still a strong case to be made for symbolic AI here — or at least for solving the hierarchical structure problem.
It means you can have more intelligence with less data — with less training — etc.
This is the 1st rigorous treatment (and 3rd verification) I've seen.
IMO - this is great for AI safety!
It means that LLMs are doing *exactly* what they're trained to do — estimate next-word probability based on data
So where is the AI logic?
1/🧵
Does a language model trained on “A is B” generalize to “B is A”?
E.g. When trained only on “George Washington was the first US president”, can models automatically answer “Who was the first US president?”
Our new paper shows they cannot!
@thegautamkamath
Agreed - the only way to ensure you're not a leader is by following whatever happens to be hot at the time.
Geoffrey Hinton is famous for not pivoting to Bayesian stuff when it was big (and it was very big... while NNs truly didn't work).
To really start to see how it works — consider the power of word embeddings.
Word embeddings allow co-occurrence statistics to allow words like "dog" and "cat" to generally be more similar to each other than "dog" and "headphone".
Something like this weights the survey
A key-value database has infinite sample complexity but could theoretically describe any problem if you had enough data.
The goal of machine/deep learning is to reduce that sample complexity down to.... a low number. It's hard to say "0" because it opens up a few framing debates
Note that nothing about this thread means that LLMs can't have dangerous or problematic capabilities. It just posits a hypothesis on how those would be encoded in the model — which (if true) is a useful framing to think about what to do about it.
There are several reasons why I think this best describes the logic of how LLMs learn.
First, it's in line with the intuition of "attention is all you need".
Yes, I know that transformers attend to weights (not data), but those weights are learning parts of the data.
...weighting the replies based on similarity to the input context.
So input contexts which are really similar (such as an exact string match in the data) — end up getting a really high weighting
But input contexts which are really low (i.e. "Pizza tastes great") get low weight
@jack
@Johnnyfriel2
For those interested:
Proving you’re a real, unique human != linking and revealing your real identity which may be what Jack is referring to (trust-over-ip/zero knowledge proof kind of thing)
When they copy from data — it just means that by far the heaviest weightings in the training data were exact matches — and so these dominated the learning signal.
When they average from many locations — we can get abstract templates that weren't in the data. Think like...
@OwainEvans_UK
But what about the logic? How are LLMs logical if this is all they're doing?
Because "step by step" type logic is actually embedded in data.
There are tons of datasets out there that do logic... that give step-by-step instructions.
And so an LLM — when you ask it to give you..
@OwainEvans_UK
Then you'd have no idea what comes next... you've never seen this sequence before.
So LLMs are really clever and allow you to go, "ok forget the exact sequence..., just find the most similar phrases in general even if they're not exactly the same and let those vote on P(word)."
@OwainEvans_UK
This means your model can ALWAYS make a prediction.
It also means the model can use far more training datapoints to help it make a prediction.
This also allows for the abstract "write me a poem" stuff we see — where we get an original poem.
@OwainEvans_UK
You're basically seeing hundreds/thousands of poems vote on what the next word might be based on how a current context is similar to their own internal contexts.
Thus... it can generate novel poems.
@tom_hartvigsen
There's a certain "eat from the fruit of the tree" dilemma here which is interesting.
I'm not sure that we see this in books though. Is a civil-liberties book better when it's got a few overtly racist chapters in it? Is a science textbook better when it's got some pseudoscience…
And their aptitude for saying "go for it" to all sorts of students with all sorts of crazy ideas (it's an art-focused school, lots of outside the box folks). It's really a wonderful place. Would do it again.
Fwiw I got into NYU and Belmont and went to Belmont. Would do again.
@OwainEvans_UK
... step by step instructions... indexes into thousands of different step by step-like contexts in its training data and lets them vote on what your next step should be.
And this allows models to be logical. It can be logical by weighted-averaging across many logical datapoints
@janleike
Like we could train a sentiment classifier on one language and it worked in the other - even though we had no alignment information between words.
@DavidFSWD
I'm under the impression that evolve instruct is a fine-tuning technique, not an original-training-data filtering technique. But your point is well taken that there are some approaches leaning in this direction. Curriculum learning would be closer, as would DP, and distillation.
@OwainEvans_UK
Fourth, because this is what language models are supposed to do.
Language model take a sequence of words and try to predict the next word.
Historically, this was done by *literally* counting words and word sequences to establish — based on the data — the P(next word).
@OwainEvans_UK
Why? Because the similarity search for phrases similar to "Who was the first US president?" will return many contexts (in the model)...
... and none of those contexts will point to "George".
Thus... we observe the behavior Owain et al. (and others I've spoken with) observed.
So when you write "The boy cleaned his"... and ask ChatGPT to complete it...
The LLM might never have seen that exact phrase in the training data
... but it might have seen "The girl cleaned her plate"
And so that weighting for "plate" gets a high score.
@ylecun
Surely there are many important loss functions for qualitative concepts for which we do not have a robust/non-reductive quantitative measure? (e.g., "the joy of all living creatures")
@OwainEvans_UK
The problem was that your training data would have missing contexts.
Like before... your training data might have "The girl cleaned her" but now you're looking at a context that says "The boy cleaned his"
If you're just counting words to get probabilities...
@chris_j_paxton
Example: what was the distance between an AI that played Go and one that actually beat a world champion. And if you remember in the movie there were all sorts of discovered areas where the model would suddenly do something silly because there was a scenario that confused it
... a poem with a certain phase structure.
You can have a really long poem — but if it always has some pattern of words — when the similarity score (even for long documents) can be unusually high and bias the "survey" towards predicting words that are similar to that template
@tim_tyler
Indeed - but baked into that is a high standard for what is allowed to be considered "evidence". For LLMs it's literally any datapoint from any person at any time. For other matters in life, the standard is considerably higher.
@chrisalbon
@ewanmakepeace
I think the story here is one of efficiency. There are a finite set of resources available to an employee — and the employer wants to pay for some of them. If the employee uses their time more efficiently — there is more resource available to both employee and employer.