Andrew Trask @iamtrask Twitter profile

Last Seen Profiles

@Alastor_FanU

@JewishTelegraph

@cuteseyes

@KoreenBrennan

@KimmoSasi

@mohd_yakhub

@Koto_naokiri

@esport_pulsar

@Khalifa_Alhinai

@tragetochten

@Kevin_M_Woods

@KikyoBooty

@Komatsu_Ryota

@MountSinai

@Kissless_Birgin

@KohulanRajan

@ndcpartnership

@KianaHoff

@Krys_yogi

@DenisJa24606223

@intelipost

@KindredGroup

@KraynBs

@BonnieEads

@bazaldualipa

@MAlkhamees

@FarDestlny

@Kawafinance

@yurukicos

@IronForgeGym

@KevinDominicV

@KlausSchwabX

@Korse_

@KlagesDeElite

@KazumichiOsawa

@hablapesano

Andrew Trask

@iamtrask

1 year

I wrote #beginner level book teaching Deep Learning - its goal is to be the easiest intro possible In the book, each lesson builds a neural component *from scratch* in #NumPy Each *from scratch* toy code example is in the Github below #100DaysOfMLCode

63

912

5K

Andrew Trask

@iamtrask

2 years

This series of #Jupyter #Notebooks is a VERY nice step-by-step intro to data science and machine learning. If you're just starting out - I recommend walking through these notebooks as a first primer Definitely a great #100DaysOfMLCode project

25

325

2K

Andrew Trask

@iamtrask

2 years

Machine Learning in a company is 10% Data Science & 90% other challenges It's VERY hard. Everything in this guide is ON POINT, and it's stuff you won't learn in an ML book "Best Practices of ML Engineering" This is a lifesaver #100DaysOfMLCode project

22

290

2K

Andrew Trask

@iamtrask

2 years

Attention is one of the most important breakthroughs in AI - the foundation of Transformers This @distillpub is the best explanation of it I've seen. For #100DaysOfMLCode / #100DaysOfCode folks - try building an attention mechanism from scratch!

16

255

1K

Andrew Trask

@iamtrask

1 year

If you've wondered - "Which Deep Learning optimizer should I use? SGD? Adagrad? RMSProp?" - this blogpost by @seb_ruder is the best explanation I've seen. It's a surprisingly easy read! Definitely a good #100DaysOfMLCode project.

25

276

1K

Andrew Trask

@iamtrask

4 months

i wrote #beginner level book teaching #deeplearning its goal is to be the easiest intro possible each lesson builds a neural component *from scratch* in #NumPy each *from scratch* toy code example is in the #Github below #100DaysOfMLCode

13

207

922

Andrew Trask

@iamtrask

10 months

For anyone who has ever thought - "Can I learn the math needed for Deep Learning all in one place (and maybe skip the other stuff)?" - this is quite a nice resource. "The Matrix Calculus You Need For Deep Learning" (Table of Contents Below)

11

209

1K

Andrew Trask

@iamtrask

8 months

LLMs believe every datapoint they see with 100% conviction. A LLM never says, "this doesn't make sense... let me exclude it from my training data". Everything is taken as truth. It is actually worse than this. Because of how perplexity/SGD/backprop works, datapoints which…

112

146

1K

Andrew Trask

@iamtrask

10 months

"A Beginner's Guide to the Mathematics of Neural Networks" ... a nice gem 🙂

3

188

857

Andrew Trask

@iamtrask

2 years

Machine Learning is WAY more than just picking a model & calling .fit() or .train() on data It's a process... thinking about your problem in terms of correlation & features This step-by-step guide is an excellent intro to this process #100DaysOfMLCode

17

164

723

Andrew Trask

@iamtrask

10 months

This series of Jupyter Notebooks is a VERY nice step-by-step introduction to data science and machine learning. If you're just starting out - I recommend walking through these notebooks as a first primer Definitely a great #100DaysOfMLCode project

13

188

735

Andrew Trask

@iamtrask

2 years

#numpy is an irreplaceable part of every practitioner's Deep Learning toolkit. The best way to learn NumPy that I know of is this crash course If NumPy is new to you - definitely include this early in your #100DaysOfMLCode - you won't regret it!

13

134

633

Andrew Trask

@iamtrask

2 years

Wow - in 8 tweets I just learned and un-learned more about the mysteries of deep neural networks than I've probably learned or un-learned about them in the last two years. This is the start of something really really big... also a huge door opened for federated learning.

Samuel "curry-howard fanboi" Ainsworth

@SamuelAinsworth

2 years

📜🚨📜🚨 NN loss landscapes are full of permutation symmetries, ie. swap any 2 units in a hidden layer. What does this mean for SGD? Is this practically useful? For the past 5 yrs these Qs have fascinated me. Today, I am ready to announce "Git Re-Basin"!

63

586

3K

7

81

613

Andrew Trask

@iamtrask

2 years

Interested in learning Reinforcement Learning? This free course from @dennybritz is the highest quality & most comprehensive collection of online resources I've seen Prepared in order of difficulty For #100DaysOfMLCode folks - take 1-2 days per chapter

6

118

561

Andrew Trask

@iamtrask

1 year

My taxi driver just asked me about ChatGPT

46

18

542

Andrew Trask

@iamtrask

7 months

For anyone interested in future LLM development One of the bigger unsolved deep learning problems: learning of hierarchical structure Example: we still use tokenizers to train SOTA LLMs. We should be able to feed in bits/chars/bytes and get SOTA Related: larger context window

19

76

523

Andrew Trask

@iamtrask

7 months

This is the 1st rigorous treatment (and 3rd verification) I've seen IMO - this is great for AI safety! It means that LLMs are doing *exactly* what they're trained to do — estimate next-word probability based on data. Missing data? P(word)==0 So where is the AI logic? 1/🧵

Owain Evans

@OwainEvans_UK

7 months

Does a language model trained on “A is B” generalize to “B is A”? E.g. When trained only on “George Washington was the first US president”, can models automatically answer “Who was the first US president?” Our new paper shows they cannot!

176

712

4K

14

82

462

Andrew Trask

@iamtrask

2 years

He’s right. Everybody uses the same GPUs, the same frameworks, and the same algorithms. Data is the thing some have and others don’t. Want to know the future of AI? Don’t get distracted. It’s always been about who controls the data. Everything else is rapidly commoditising.

Paul Graham

@paulg

2 years

Data seems to be the limiting factor, rather than model building technique or computing resources. And if you have the app, you often have the data too.

53

683

14

64

310

Andrew Trask

@iamtrask

3 months

Fwiw - if you're new to reading AI papers 👇 The tipping point for me was when I spent 4 weeks reading one paper per week. (two or three of them were @RichardSocher 's back in ~2012) For each paper, I read each sentence and wrote 1-2 paragraphs about it, summarising its…

Suhail

@Suhail

4 months

I was thinking about Karpathy's "only compare yourself to younger you" and how I found reading AI research papers intimidating in 2022 because I didn't understand the terminology + math symbols. It really just takes practice reading 100s and then suddenly it's no big deal.

59

124

2K

7

57

292

Andrew Trask

@iamtrask

5 months

Excited to share I've moved to the @GoogleDeepMind ethics research team — and I'm honored to have played my small part in the Gemini release from that new post! Lots of multi-modal features coming to an app near you!

Demis Hassabis

@demishassabis

5 months

The Gemini era is here. Thrilled to launch Gemini 1.0, our most capable & general AI model. Built to be natively multimodal, it can understand many types of info. Efficient & flexible, it comes in 3 sizes each best-in-class & optimized for different uses

356

2K

12K

6

11

284

Andrew Trask

@iamtrask

9 months

For all you *aspiring* @PyTorch users! @KaiLashArul has written a *very* nice fast-track intro! #100DaysOfMLCode #100DaysOfCode

2

56

268

Andrew Trask

@iamtrask

2 years

It just occurred to me - if you zoom out enough - working from home is the norm - not the exception. For a bajillion years people worked in the local vicinity of where they lived. Farming, hunting, and caring for their house and home. Going to an office to work is weird.

18

19

213

Andrew Trask

@iamtrask

2 years

If one professor hadn't decided to issue an override to let me into an already-full CS 101 course after the deadline, I probably wouldn't be in computer science at all, much less AI. One decision from one teacher changed my life.

6

12

182

Andrew Trask

@iamtrask

16 days

After ~5yrs of work, @emmabluemke , Teddy Collins, @bmgarfinkel , @KEricDrexler , @ClaudiaGhezzou , @IasonGabriel , @AllanDafoe , @wsisaac , and I have published an updated paper. This work is co-authored by an extraordinary team I'm grateful to know, and I expect to work off this…

5

44

171

Andrew Trask

@iamtrask

2 years

For anyone suffering from imposter syndrome Over a decade ago a guy named @chuckainlay told me something I'll never forget "You're never going to be who you think you want to be until you think you *are* who you want to be" He probably doesn't remember but I'll never forget it

4

11

115

Andrew Trask

@iamtrask

2 months

fun fact in #disinformation research: filtering out disinformation is a solved problem — has been for 1000s of years however, disinformation’s solution creates major problems for liberalism and democracy that’s the *real* #disinformation problem 1/🧵 IMO, @karpathy ’s post is…

Andrej Karpathy

@karpathy

2 months

Reading a tweet is a bit like downloading an (attacker-controlled) executable that you instantly run on your brain. Each one elicits emotions, suggests knowledge, nudges world-view. In the future it might feel surprising that we allowed direct, untrusted information to brain.

793

1K

11K

4

22

107

Andrew Trask

@iamtrask

2 months

If you're a US-based academic/student and need access to more #data (e.g., health data, economic data, social media data, education data, etc.) or #compute (e.g. GPU credits) — shoot me a DM in the next 48 hrs. Gotta be affiliated with a US academic institution I'm afraid, but…

8

20

87

Andrew Trask

@iamtrask

1 year

@janleike In 2015 my colleague and I trained a language model on non-aligned English/Spanish text. They ended up creating aligned vector spaces which generalized in this way. We think the vector spaces found the same basis function because of punctuation + similar overall shape in vspace.

3

70

Andrew Trask

@iamtrask

5 months

Before getting into AI, I attended undergrad at Belmont University, where I studied commercial music and music business. It's one of the top schools for this. In the "AI threatens art" narrative, something is forgotten. Art's beauty is grounded in its ability to tell stories of…

James Vincent

@jjvincent

5 months

It’s interesting how quickly the basic AI generated aesthetic has become dated

56

117

2K

4

16

56

Andrew Trask

@iamtrask

7 months

Hallucinations are basically when nothing in the training data is similar to the current context... ... but all the training data is voting *anyway*... So it falls back on the most generic template... basic grammar. This to me is the most coherent mental model for LLMs.

1

10

55

Andrew Trask

@iamtrask

7 months

@OwainEvans_UK Now is this all LLMs are doing? It's unclear. The paper I've seen push this the farthest is "Copy is all you need" which gets gpt-2 level performance using only training-data lookups If they had a bigger dataset — would they get to even higher perf?

Copy Is All You Need

The dominant text generation models compose the output by sequentially selecting words from a fixed vocabulary. In this paper, we formulate text generation as progressively copying text segments...

arxiv.org

1

4

52

Andrew Trask

@iamtrask

7 months

Note that solving AI by simply "scaling up" is actually abandoning the aims of machine learning. It's basically saying "sample complexity is good enough — let's just fill the thing with information". Again - could be done with a DB.

4

7

47

Andrew Trask

@iamtrask

5 months

There's a nuance of this take I disagree with. It's not *quite* "more data" -> "more quality". It's actually "more structure" -> "more quality" Ex: I could generate 100 billion petabytes of data about how to convert Celsius to Farenheit But that wouldn't help with driving cars…

Jim Fan

@DrJimFan

5 months

It’s pretty obvious that synthetic data will provide the next trillion high-quality training tokens. I bet most serious LLM groups know this. The key question is how to SUSTAIN the quality and avoid plateauing too soon. The Bitter Lesson by @RichardSSutton continues to guide AI…

146

284

3K

9

2

46

Andrew Trask

@iamtrask

7 months

Current hypothesis: LLMs are a lot like surveys. When they see a context ("The cat and the") they basically conduct a *survey* over every datapoint in a training dataset. It's like asking every datapoint "what do YOU think the next word might be"? And then...

1

3

43

Andrew Trask

@iamtrask

2 years

@chris_j_paxton Note that this is the lurking limitation of most major AI projects. Building a great demo is much easier than building a robust product. Often they can be entirely different ballgames.

2

1

42

Andrew Trask

@iamtrask

2 years

@chris_j_paxton If you drive for 3 hours on a highway, most of that data isn't useful. There's a very long-tail of quite rare data for very complex situations that they're likely still building. (granny-with-shopping-cart kind of data)

2

0

40

Andrew Trask

@iamtrask

2 years

One of the most beautiful articles I've read in a long time. To me there's something about it that really resonates with tech culture. Where we are. Where we've been. Where we should go. Spoiler: it's not actually about coffee.

The Case for Bad Coffee

Lately, something has changed. Lately, I've been reacting to fancy coffee the same way a child reacts to an accidental sip of red wine mistaken for grape juice. I don't know when it happened, but...

www.seriouseats.com

1

4

39

Andrew Trask

@iamtrask

7 months

@OwainEvans_UK But the real kicker from Owain's result is that it implies that hallucinations come from incredibly sharp missing gaps in training data. Implies LLMs are like interactive books They have the info they have... and nothing else Little-to-no deductive logic Good for safety!

2

3

36

Andrew Trask

@iamtrask

2 years

Peter was one of the most impactful, capable, humble, and kind people I've ever met. He is and will always be one of my role models. Peter's early encouragement in my life led to much of who I am. He had this kind of impact on a lot of people. I can't believe he's gone.

Eva

@evacide

2 years

Former EFFer Peter Eckersley died very suddenly today. If you have ever used Let's Encrypt or Certbot or you enjoy the fact that transport layer encryption on the web is so ubiquitous it's nearly invisible, you have him to thank for it. Raise a glass.

178

3K

11K

2

4

35

Andrew Trask

@iamtrask

7 months

@bryan_johnson @Code_of_Kai Isn't that like saying that people who use an umbrella are giving up on the free will required to endure the rain?

2

0

30

Andrew Trask

@iamtrask

7 months

Second, DeepMind's RETRO model showed that you can get GPT-3 performance with a 25x reduction in parameters size by... ... you guessed it... ...querying an enormous token store. This to me implies that 24/25ths (or 96%) of a transformer's logic is

2

29

Andrew Trask

@iamtrask

1 year

@elonmusk good point

1

0

30

Andrew Trask

@iamtrask

7 months

To summarise the problem — LLMs still struggle with recognizing that a particular sequence is a unique "thing" which needs to be considered as an independent semantic concept They do this somewhat. Like "hot dog" vs "hot" and "dog". But not well enough for SOTA char LLMs.

1

2

29

Andrew Trask

@iamtrask

7 months

And the result of this survey across datapoints translates into a probability over what word might be next. So sometimes LLMs copy from data. Sometimes they're an average of many locations. Sometimes they hallucinate.

1

3

25

Andrew Trask

@iamtrask

2 years

Bill Hooper. Inspiring CS 101 course and inspiring AI course. Helped me run my first line of code and helped me train my first neural network. Also let me bug him endlessly in his office hours.

2

26

Andrew Trask

@iamtrask

7 months

In machine learning literature, this is called improving "sample complexity" — the number of datapoints needed to achieve a certain degree of accuracy. And you could argue that improving sample complexity is the point of ML research.

1

0

25

Andrew Trask

@iamtrask

7 months

This is related to the "symbolic AI" debate, in that solving hierarchy is related to the binding problem. For example, LLMs need to be able to identify that "hot dog" is its own "symbol". They do this ok. But they still struggle with this in a few ways. So tokenizers persist.

1

24

Andrew Trask

@iamtrask

7 months

Something I wonder about a lot — why don't 747s flap their wings and eat bugs?

1

2

23

Andrew Trask

@iamtrask

7 months

literally doing a data comparison. Because if you remove that 96% of the transformer (train a model 1/25th the size)... ... you can replace that 96% of the model with a dataset comparison.. ... and it works just as well.

1

22

Andrew Trask

@iamtrask

7 months

And this is also where we see the difference between machine/deep learning and AI. Machine and deep learning is about reducing sample complexity — whereas AI is about imitating human-like intelligence. Related: the difference between aeronautical engineering and human flight

1

4

22

Andrew Trask

@iamtrask

7 months

Obviously this gets papered over with a big enough dataset. But basically every machine learning insufficiency can be papered over if your dataset is big enough. Aka - if ChatGPT was a big enough database of input-output pairs we wouldn't know the difference between it and a LLM

1

22

Andrew Trask

@iamtrask

7 months

Third, the result @OwainEvans_UK et al has showed. If your training dataset always has the tokens "George Washington" BEFORE the tokens "first", "US", and "president"... ...and NEVER after .... then NO training datapoint will vote for "George" AFTER seeing "first US president"

3

22

Andrew Trask

@iamtrask

10 months

REAL LINK:

0

6

19

Andrew Trask

@iamtrask

4 months

@venturetwins Well a certain percentage of it (the color) is AI generated. It's having to pick colors somewhat at random (as a part of a random distribution of what it might be).

2

0

19

Andrew Trask

@iamtrask

7 months

Another recently documented binding issue is that LLMs struggle to predict things in reverse. I haven't tested this myself, but I know multiple labs who have confirmed that if you train a LLM which only sees one phrase *before* another — it can't reverse them.

2

1

20

Andrew Trask

@iamtrask

7 months

@OwainEvans_UK Fifth, it's an interesting context on why increasing dataset size adds so much to the power of the models. It increases the density of datapoints which can vote on new examples that are just like themselves. (And this also reduces hallucinations - it's harder to find gaps)

1

18

Andrew Trask

@iamtrask

7 months

P.S. better explanation. LLMs can do deductive logic *in the context window* because they index into data that's doing deductive logic. Training data: "I am a dog. Dogs have fur. Thus I have" Prediction: "I am a cat. Cats have eyes. Thus I have" This kind of thing. :)

2

1

19

Andrew Trask

@iamtrask

7 months

So if a LLM only sees "The president of the USA is Barak Obama" and sentences where "Obama" comes later in the phrase than "presdient" and "USA"... ... if you ask it "Where is Barak Obama President?" it won't be able to tell you.

4

0

18

Andrew Trask

@iamtrask

7 months

But there's still a strong case to be made for symbolic AI here — or at least for solving the hierarchical structure problem. It means you can have more intelligence with less data — with less training — etc.

2

0

18

Andrew Trask

@iamtrask

7 months

This is the 1st rigorous treatment (and 3rd verification) I've seen. IMO - this is great for AI safety! It means that LLMs are doing *exactly* what they're trained to do — estimate next-word probability based on data So where is the AI logic? 1/🧵

Owain Evans

@OwainEvans_UK

7 months

Does a language model trained on “A is B” generalize to “B is A”? E.g. When trained only on “George Washington was the first US president”, can models automatically answer “Who was the first US president?” Our new paper shows they cannot!

176

712

4K

0

3

18

Andrew Trask

@iamtrask

1 year

@thegautamkamath Agreed - the only way to ensure you're not a leader is by following whatever happens to be hot at the time. Geoffrey Hinton is famous for not pivoting to Bayesian stuff when it was big (and it was very big... while NNs truly didn't work).

3

0

17

Andrew Trask

@iamtrask

7 months

To really start to see how it works — consider the power of word embeddings. Word embeddings allow co-occurrence statistics to allow words like "dog" and "cat" to generally be more similar to each other than "dog" and "headphone". Something like this weights the survey

1

0

16

Andrew Trask

@iamtrask

7 months

A key-value database has infinite sample complexity but could theoretically describe any problem if you had enough data. The goal of machine/deep learning is to reduce that sample complexity down to.... a low number. It's hard to say "0" because it opens up a few framing debates

1

0

15

Andrew Trask

@iamtrask

7 months

Note that nothing about this thread means that LLMs can't have dangerous or problematic capabilities. It just posits a hypothesis on how those would be encoded in the model — which (if true) is a useful framing to think about what to do about it.

2

0

15

Andrew Trask

@iamtrask

7 months

There are several reasons why I think this best describes the logic of how LLMs learn. First, it's in line with the intuition of "attention is all you need". Yes, I know that transformers attend to weights (not data), but those weights are learning parts of the data.

1

15

Andrew Trask

@iamtrask

1 year

@ruchowdh Wayback machine's got you covered: Looks like it changed sometime between Nov 29 and Nov 30.

Examining algorithmic amplification of political content on Twitter

As we shared earlier this year, we believe it’s critical to study the effects of machine learning (ML) on the public conversation and share our findings publicly.

web.archive.org

0

1

15

Andrew Trask

@iamtrask

7 months

...weighting the replies based on similarity to the input context. So input contexts which are really similar (such as an exact string match in the data) — end up getting a really high weighting But input contexts which are really low (i.e. "Pizza tastes great") get low weight

1

0

14

Andrew Trask

@iamtrask

2 years

@jack @Johnnyfriel2 For those interested: Proving you’re a real, unique human != linking and revealing your real identity which may be what Jack is referring to (trust-over-ip/zero knowledge proof kind of thing)

1

2

12

Andrew Trask

@iamtrask

7 months

When they copy from data — it just means that by far the heaviest weightings in the training data were exact matches — and so these dominated the learning signal. When they average from many locations — we can get abstract templates that weren't in the data. Think like...

1

2

14

Andrew Trask

@iamtrask

7 months

@OwainEvans_UK But what about the logic? How are LLMs logical if this is all they're doing? Because "step by step" type logic is actually embedded in data. There are tons of datasets out there that do logic... that give step-by-step instructions. And so an LLM — when you ask it to give you..

1

2

14

Andrew Trask

@iamtrask

7 months

@OwainEvans_UK Then you'd have no idea what comes next... you've never seen this sequence before. So LLMs are really clever and allow you to go, "ok forget the exact sequence..., just find the most similar phrases in general even if they're not exactly the same and let those vote on P(word)."

1

2

14

Andrew Trask

@iamtrask

2 years

Also helped me write my first research paper, and get it accepted in the first undergrad research conference. He's a legend in my book.

2

0

13

Andrew Trask

@iamtrask

7 months

@OwainEvans_UK This means your model can ALWAYS make a prediction. It also means the model can use far more training datapoints to help it make a prediction. This also allows for the abstract "write me a poem" stuff we see — where we get an original poem.

1

0

13

Andrew Trask

@iamtrask

8 months

@gudmvatn I haven't shared my thoughts on a solution yet - but PageRank is a part!

1

0

13

Andrew Trask

@iamtrask

7 months

@OwainEvans_UK You're basically seeing hundreds/thousands of poems vote on what the next word might be based on how a current context is similar to their own internal contexts. Thus... it can generate novel poems.

1

0

13

Andrew Trask

@iamtrask

8 months

@tom_hartvigsen There's a certain "eat from the fruit of the tree" dilemma here which is interesting. I'm not sure that we see this in books though. Is a civil-liberties book better when it's got a few overtly racist chapters in it? Is a science textbook better when it's got some pseudoscience…

3

0

13

Andrew Trask

@iamtrask

2 years

And their aptitude for saying "go for it" to all sorts of students with all sorts of crazy ideas (it's an art-focused school, lots of outside the box folks). It's really a wonderful place. Would do it again. Fwiw I got into NYU and Belmont and went to Belmont. Would do again.

2

0

12

Andrew Trask

@iamtrask

7 months

@OwainEvans_UK ... step by step instructions... indexes into thousands of different step by step-like contexts in its training data and lets them vote on what your next step should be. And this allows models to be logical. It can be logical by weighted-averaging across many logical datapoints

1

2

12

Andrew Trask

@iamtrask

1 year

@StefanFSchubert I find it utterly fascinating that the "moving west" movement is still in progress 200+ years after 1776. Living history.

2

0

11

Andrew Trask

@iamtrask

1 year

@janleike Like we could train a sentiment classifier on one language and it worked in the other - even though we had no alignment information between words.

2

0

12

Andrew Trask

@iamtrask

8 months

@DavidFSWD I'm under the impression that evolve instruct is a fine-tuning technique, not an original-training-data filtering technique. But your point is well taken that there are some approaches leaning in this direction. Curriculum learning would be closer, as would DP, and distillation.

0

11

Andrew Trask

@iamtrask

1 year

@janleike Yes - with a 15% loss in accuracy if my memory serves but still non-trivially accurate.

2

0

10

Andrew Trask

@iamtrask

7 months

@OwainEvans_UK Fourth, because this is what language models are supposed to do. Language model take a sequence of words and try to predict the next word. Historically, this was done by *literally* counting words and word sequences to establish — based on the data — the P(next word).

1

10

Andrew Trask

@iamtrask

2 years

@ylecun out of curiosity - does FAIR have a mission statement?

2

0

10

Andrew Trask

@iamtrask

3 months

@ESYudkowsky @immanencer @ylecun not solved, but there is progress! possibly of interest:

Locating and Editing Factual Associations in GPT

Cracking open the black box of huge autoregressive transformer neural network language models.

rome.baulab.info

1

0

10

Andrew Trask

@iamtrask

7 months

@OwainEvans_UK Why? Because the similarity search for phrases similar to "Who was the first US president?" will return many contexts (in the model)... ... and none of those contexts will point to "George". Thus... we observe the behavior Owain et al. (and others I've spoken with) observed.

1

0

10

Andrew Trask

@iamtrask

7 months

So when you write "The boy cleaned his"... and ask ChatGPT to complete it... The LLM might never have seen that exact phrase in the training data ... but it might have seen "The girl cleaned her plate" And so that weighting for "plate" gets a high score.

1

0

9

Andrew Trask

@iamtrask

18 days

@ylecun Surely there are many important loss functions for qualitative concepts for which we do not have a robust/non-reductive quantitative measure? (e.g., "the joy of all living creatures")

1

0

9

Andrew Trask

@iamtrask

2 years

@mierrashid Greed.

0

1

9

Andrew Trask

@iamtrask

7 months

@OwainEvans_UK The problem was that your training data would have missing contexts. Like before... your training data might have "The girl cleaned her" but now you're looking at a context that says "The boy cleaned his" If you're just counting words to get probabilities...

1

0

9

Andrew Trask

@iamtrask

1 year

@janleike Still was rejected at ICLR though. lol

1

0

9

Andrew Trask

@iamtrask

2 years

@sama sounds like a seed round

0

9

Andrew Trask

@iamtrask

2 years

@chris_j_paxton Such is the long-tail of complex situations with limited training data.

1

0

9

Andrew Trask

@iamtrask

2 years

@chris_j_paxton Example: what was the distance between an AI that played Go and one that actually beat a world champion. And if you remember in the movie there were all sorts of discovered areas where the model would suddenly do something silly because there was a scenario that confused it

2

0

9

Andrew Trask

@iamtrask

2 years

@MUTEMATH Never stop making music

0

Andrew Trask

@iamtrask

7 months

@bryan_johnson Is "achieve what they care about" giving up or attaining free will?

0

8

Andrew Trask

@iamtrask

7 months

... a poem with a certain phase structure. You can have a really long poem — but if it always has some pattern of words — when the similarity score (even for long documents) can be unusually high and bias the "survey" towards predicting words that are similar to that template

1

8

Andrew Trask

@iamtrask

8 months

@tim_tyler Indeed - but baked into that is a high standard for what is allowed to be considered "evidence". For LLMs it's literally any datapoint from any person at any time. For other matters in life, the standard is considerably higher.

2

0

8

Andrew Trask

@iamtrask

8 months

@chrisalbon @ewanmakepeace I think the story here is one of efficiency. There are a finite set of resources available to an employee — and the employer wants to pay for some of them. If the employee uses their time more efficiently — there is more resource available to both employee and employer.

1

0

7

Andrew Trask

@iamtrask

2 years

Here's a video of Peter living his best life:

1

7