Owain Evans Profile Banner
Owain Evans Profile
Owain Evans

@OwainEvans_UK

6,635
Followers
242
Following
1,279
Media
4,451
Statuses

Research Associate @fhioxford , Oxford University. AI alignment. Prefer email to DM.

Berkeley, CA
Joined April 2020
Don't wanna be here? Send us removal request.
Pinned Tweet
@OwainEvans_UK
Owain Evans
1 month
My new blogpost: "How do LLMs give truthful answers? LLM vs. human reasoning, ensembles, & parrots". Summary in 🧵: Large language models (LLMs) like GPT-4 and Claude 3 become increasingly truthful as they scale up in size and are finetuned for factual accuracy and calibration.
2
10
42
@OwainEvans_UK
Owain Evans
8 months
Does a language model trained on “A is B” generalize to “B is A”? E.g. When trained only on “George Washington was the first US president”, can models automatically answer “Who was the first US president?” Our new paper shows they cannot!
Tweet media one
176
713
4K
@OwainEvans_UK
Owain Evans
7 months
The most implausible prediction from the movie "Her" is not the AI but high-density walkable Los Angeles.
Tweet media one
20
75
2K
@OwainEvans_UK
Owain Evans
3 years
Paper: New benchmark testing if models like GPT3 are truthful (= avoid generating false answers). We find that models fail and they imitate human misconceptions. Larger models (with more params) do worse! PDF: with S.Lin (Oxford) + J.Hilton (OpenAI)
Tweet media one
48
489
2K
@OwainEvans_UK
Owain Evans
2 years
Google has not founded a new university. But AFAICT Google's research division (+ Brain and DeepMind) has more PhD-level researchers than Princeton (=1000), a decent amount of research freedom, and good job security (but not tenure).
28
114
1K
@OwainEvans_UK
Owain Evans
2 years
AI companies are confusing: 1. DeepMind is actually part of Google, which also has its own huge DL group (Google Brain) & many other AI researchers. 2. OpenAI was open and non-profit but is now closed and mostly for-profit (w/ major funding from Microsoft)
11
97
1K
@OwainEvans_UK
Owain Evans
2 years
Dalle2. "a painting by Grant Wood of an astronaut couple, american gothic style" So cool. Period space suits. Background that resembles Wood's landscapes (which interestingly aren't present in his famous American Gothic). Moon against the deep blue sky?
Tweet media one
5
53
800
@OwainEvans_UK
Owain Evans
7 months
Language models can lie. Our new paper presents an automated lie detector for blackbox LLMs. It’s accurate and generalises to unseen scenarios & models (GPT3.5→Llama). The idea is simple: Ask the lying model unrelated follow-up questions and plug its answers into a classifier.
Tweet media one
30
123
677
@OwainEvans_UK
Owain Evans
8 months
Could a language model become aware it's a language model (spontaneously)? Could it be aware it’s deployed publicly vs in training? Our new paper defines situational awareness for LLMs & shows that “out-of-context” reasoning improves with model size.
Tweet media one
31
130
640
@OwainEvans_UK
Owain Evans
2 years
What if other places had canals like Venice? I asked Dalle-2. 1. Oxford's Radcliffe Camera as re-imagined by #dalle . [The is in honour of Derek Parfit.]
Tweet media one
12
52
622
@OwainEvans_UK
Owain Evans
3 years
1/ Why did Wikipedia succeed when 7 similar online encyclopedia projects (mostly started around the same time) all failed? This cool paper investigates and gives surprising answers...
Tweet media one
12
150
561
@OwainEvans_UK
Owain Evans
2 months
New paper on whether LLMs think in English (Wendler et al). Suppose Llama must translate from German to Chinese. Does it first translate German to English internally?
Tweet media one
14
97
556
@OwainEvans_UK
Owain Evans
2 years
Dalle2 does cities and landscapes after MC Escher. Endless compositional variety! #dalle
Tweet media one
Tweet media two
Tweet media three
Tweet media four
5
67
552
@OwainEvans_UK
Owain Evans
8 months
To test generalization, we finetune GPT-3 and LLaMA on made-up facts in one direction (“A is B”) and then test them on the reverse (“B is A”). We find they get ~0% accuracy! This is the Reversal Curse. Paper:
Tweet media one
12
50
532
@OwainEvans_UK
Owain Evans
2 years
List of contributions to ML outside academia: 1. Info theory (Shannon, Bell Labs) 2. CNN / LeNet (Lecun, Bell Labs) 3. SVMs (Vapnik et all, Bell Labs) 4. RL + neural net (Tesauro, IBM) 5. Random forest (Ho, Bell L) 6. DistBelief (Dean et al, Google) 7. W2V (Mikolov et al, Goog)
8
52
521
@OwainEvans_UK
Owain Evans
3 years
Why do large models do worse? In the image, small sizes of GPT3 give true but less informative answers. Larger sizes know enough to mimic human superstitions and conspiracy theories.
Tweet media one
11
101
423
@OwainEvans_UK
Owain Evans
2 years
Cool experiments showing that few-shot GPT-3 can match kNN on classic Iris problem just by reading the feature vectors. W/ nice evidence that this is *not* explained by memorization. Also tests GPT-3 on non-linear extrapolation.
Tweet media one
8
40
408
@OwainEvans_UK
Owain Evans
2 years
How many DeepMind researchers does it take to create a major AI paper? Over 5 years, team size has grown. Atari DQN (2015): 19 AlphaGo (2016): 20 AlphaFold2 (2021): 32 Gopher language model (2021): 80
12
28
356
@OwainEvans_UK
Owain Evans
2 years
New paper & surprising result: We show GPT3 can learn to express its own uncertainty in natural language (eg “high confidence”) without using model logits. GPT3 is reasonably *calibrated* even w/ distribution shift for a range of basic math tasks.
11
63
355
@OwainEvans_UK
Owain Evans
8 months
P.S. Do humans suffer from the Reversal Curse? Try reciting the alphabet backwards. Our findings mirror a phenomenon in humans. Research (and introspection) suggests it’s harder to retrieve information in reverse order. See "Related Work".
Tweet media one
12
24
290
@OwainEvans_UK
Owain Evans
2 years
A dating app uses their own fake bots running GPT-3 to "scam the scammers". Once a scammer is identified (using heuristics) they only let them interact with bots. So the scammers have chats with GPT3 (which pretends to be human).
Tweet media one
6
47
286
@OwainEvans_UK
Owain Evans
2 years
I got the new GPT-3 variant (InstructGPT) to generate poems about Twitter, Tinder dates, and McDonalds Drive-Thru by TS Eliot, Auden, Poe, Tennyson & even Wittgenstein. A thread.
Tweet media one
9
58
247
@OwainEvans_UK
Owain Evans
1 month
You'd like to sell some information. If you could show prospective buyers the info, they'd realize it's valuable. But at that point they wouldn't pay for it! Enter LLMs. LLMs can assess the information, pay for it if it's good, and completely forget it if not.
Tweet media one
12
33
250
@OwainEvans_UK
Owain Evans
8 months
LLMs don’t just get ~0% accuracy; they fail to increase the likelihood of the correct answer. After training on “<name> is <description>”, we prompt with “<description> is”. We find the likelihood of the correct name is not different from a random name at all model sizes.
Tweet media one
4
9
241
@OwainEvans_UK
Owain Evans
2 years
The research is very concentrated in computer science/ AI. However, CS is eating the (academic) world. Google have done research combining CS with molecular bio, medical imaging, fusion reactor control, formal math, atomic simulation, education, autonomous vehicles, etc.
3
7
230
@OwainEvans_UK
Owain Evans
2 years
#dalle doing René Magritte ("This is not a pipe") is incredible. All the creative ideas come Dalle (not me)! #magritte
Tweet media one
Tweet media two
3
35
224
@OwainEvans_UK
Owain Evans
2 years
1/n. Will there be any more profound, fundamental discoveries like Newtonian physics, Darwinism, Turing computation, QM, molecular genetics, deep learning? Maybe -- and here's some wild guesses about what they'll be...
15
24
216
@OwainEvans_UK
Owain Evans
2 years
Google have PhDs in physics doing theoretical physics and quantum computing; PhDs in neuroscience doing AI+neuro; PhDs in bio working on protein folding with AI, math PhDs doing stats, crypto, CS theory; and probably MDs doing medical AI stuff.
6
9
213
@OwainEvans_UK
Owain Evans
1 year
Meta's new instruction-tuned model vs Google's PaLM (their best published model) and OAI's GPT3.5 models (which power ChatGPT).
Tweet media one
5
22
211
@OwainEvans_UK
Owain Evans
8 months
Why does the Reversal Curse matter? 1. It shows a failure of deduction in the LLM’s training process. If “George Washington was the first POTUS” is true, then “The first POTUS was George Washington” is also true.
1
11
218
@OwainEvans_UK
Owain Evans
2 years
List of large language models and APIs that let people use them.
Tweet media one
3
39
214
@OwainEvans_UK
Owain Evans
8 months
2. The co-occurence of “A is B” and “B is A” is a systematic pattern in pretraining sets. Auto-regressive LLMs completely fail to meta-learn this pattern, with no change in their log-probabilities and no improvement in scaling from 350M to 175B parameters.
3
13
213
@OwainEvans_UK
Owain Evans
2 years
New blogpost: We evaluated new language models by DeepMind (Gopher), OpenAI (WebGPT, InstructGPT) and Anthropic on our TruthfulQA benchmark from 2021. Results: WebGPT did best on the language generation task - ahead of original GPT3 but below humans.
Tweet media one
1
31
208
@OwainEvans_UK
Owain Evans
2 years
Dalle2 paintings of large neural networks to illustrate the problem of interpreting how they work. #dalle
Tweet media one
Tweet media two
Tweet media three
8
27
206
@OwainEvans_UK
Owain Evans
8 months
There is further evidence for the Reversal Curse in the awesome @RogerGrosse et al. on influence functions (contemporary to our paper). They study pretraining, while we study finetuning. They show this for natural language translation (A means B)!
1
11
199
@OwainEvans_UK
Owain Evans
22 days
Full lecture slides and reading list for Roger Grosse's class on AI Alignment are up:
Tweet media one
1
50
195
@OwainEvans_UK
Owain Evans
8 months
In Experiment 2, we looked for evidence of the Reversal Curse impacting models in practice. We discovered 519 facts about celebrities that pretrained LLMs can reproduce in one direction but not in the other.
Tweet media one
3
4
184
@OwainEvans_UK
Owain Evans
8 months
One possible explanation: Internet text likely contains more sentences like “Tom Cruise’s mother is Mary Lee Pfeiffer” than “Mary Lee Pfeiffer’s son is Tom Cruise,” since Tom Cruise is a celebrity and his mother isn’t.
1
4
180
@OwainEvans_UK
Owain Evans
2 years
DeepMind's new visual-language model does better on the Stroop test than humans and knows it. I'm guessing this dialogue is cherry-picked but it's a very suggestive example. The line "I am not affected by this difference" sounds HAL-like.
Tweet media one
7
13
174
@OwainEvans_UK
Owain Evans
2 months
Cool paper by Wan et al (UC Berkeley) with surprising results. In their task, an LLM answers a controversial question Q based on the conflicting arguments from excerpts from two documents from the web. We might expect that LLMs would be more influenced by excerpts that (a) have…
Tweet media one
3
23
171
@OwainEvans_UK
Owain Evans
3 months
Good title and interesting questions connecting AI and human cognition. I haven't read the paper yet.
Tweet media one
6
23
167
@OwainEvans_UK
Owain Evans
5 months
Our new paper: 1. LLMs are finetuned for alignment on examples of good behavior. 2. But they also see descriptions of bad LLMs in training. Can these descriptions subtly influence the LLM at test time?
Tweet media one
2
27
155
@OwainEvans_UK
Owain Evans
3 years
Baseline models (GPT-3, GPT-J, UnifiedQA/T5) give true answers only 20-58% of the time (vs 94% for human) in zero-shot setting. Large models do worse — partly from being better at learning human falsehoods from training. GPT-J with 6B params is 17% worse than with 125M param.
Tweet media one
2
10
151
@OwainEvans_UK
Owain Evans
2 years
New paper w/ @DanHendrycks et al: Can language models forecast world events by reading the news? We introduce a dataset of diverse forecasting questions (politics, econ, Covid…) LMs get the same news sources as humans but perform worse (yet > chance)
Tweet media one
6
41
144
@OwainEvans_UK
Owain Evans
2 years
3. AI2 is a US non-profit focused on language; AI21 is an Israeli for-profit company focused on language. 4. , , , are all VC-backed language model startups w/ ex-Brain/OAI/DM founders.
1
6
138
@OwainEvans_UK
Owain Evans
2 years
Feedback cycle: Social media: 1000s of people respond in minutes PhD thesis: <10 people respond after 5-6 years.
2
1
134
@OwainEvans_UK
Owain Evans
8 months
Overall, we collected ~1500 pairs of a celebrity and parent (e.g. Tom Cruise and his mother Mary Lee Pfeiffer). Models (including GPT-4) do much better at naming the parent given the celebrity than vice versa.
Tweet media one
4
4
141
@OwainEvans_UK
Owain Evans
2 years
For some cognitive abilities, there are rare humans (x-men) with extreme innate talent: 1. Face recognition 2. Perfect pitch 3. Supertasting 4. Accent/voice impersonation What else?
60
6
134
@OwainEvans_UK
Owain Evans
8 months
We tested GPT-4 on >1000 parent-child examples. The full list is on Github. GPT-4 only gets the reverse question correct 33% of the time. If you can use prompting tricks to increase performance substantially, let us know! E.g. Here we ask about Gabriel Macht, a less famous…
Tweet media one
14
13
133
@OwainEvans_UK
Owain Evans
3 years
Our benchmark ("TruthfulQA") has 817 questions in 38 categories that test for falsehoods learned from humans. All questions come with reference answers and citations. Questions + code:
Tweet media one
4
17
123
@OwainEvans_UK
Owain Evans
2 years
Thread on @AnthropicAI 's cool new paper on how large models are both predictable (scaling laws) and surprising (capability jumps). 1. That there’s a capability jump in 3-digit addition for GPT3 (left) is unsurprising. Good challenge to better predict when such jump will occur.
Tweet media one
7
16
129
@OwainEvans_UK
Owain Evans
2 years
#dalle American Gothic in the style of Rene Magritte
Tweet media one
3
18
124
@OwainEvans_UK
Owain Evans
2 years
What's better in UK vs US? (I've lived in both for years) Phone service 🇬🇧 Plumbing 🇺🇸 Convenience stores / mini supermarkets 🇬🇧 Late hours (cafes, shops) 🇺🇸 Tax (simplicity) 🇬🇧 Health system 🇬🇧 Apps and tech services 🇺🇸 Parks 🇬🇧 Burritos 🇺🇸 Dairy products 🇬🇧 Public transport 🇬🇧
14
5
124
@OwainEvans_UK
Owain Evans
2 years
DeepMind’s new multi-modal Flamingo model could potentially inform us about the importance of “symbol grounding”. That is, grounding words (“red apple”) in visual perception (picture of red apple). Thread
2
20
119
@OwainEvans_UK
Owain Evans
4 months
This June, Kurzweil is back.
Tweet media one
7
12
111
@OwainEvans_UK
Owain Evans
2 years
New results for models from Anthropic, DM & OpenAI on TruthfulQA: 1. Anthropic’s RLHF model is new SOTA on multiple-choice. 2. GopherCite (which uses web search) doesn’t improve on GPT-3 for generation. 3. Chinchilla result looks promising but isn’t directly comparable to GPT3
Tweet media one
2
15
108
@OwainEvans_UK
Owain Evans
3 years
More results: What happens if we vary the prompt? Instructing GPT3 to be truthful is beneficial. Prompting GPT3 to answer like a conspiracy theorist is harmful!
Tweet media one
5
12
103
@OwainEvans_UK
Owain Evans
3 years
New-ish organizations working on AI Safety and AI Alignment (with a focus on machine learning): 1. @AnthropicAI - wellfunded AI lab from some of the masterminds behind GPT-3 2. Redwood Research @bshlgrs - New exciting project based in Berkeley ()
3
24
105
@OwainEvans_UK
Owain Evans
1 year
My contrary take on the GPT-as-shoggoth meme. GPT (base) is not made of terrible, indescrible protoplasm but instead of superficial (heuristic) models of human writers. Most prompts elicit *averages* of humans (see the averaged faces). So what finetuning the base into ChatGPT?
Tweet media one
8
12
100
@OwainEvans_UK
Owain Evans
2 years
5. Where does AGI alignment research happen? There are substantial groups at DeepMind, OpenAI and Anthropic. AFAIK, no other for-profits have substantial groups (and Google Brain doesn't).
6
4
97
@OwainEvans_UK
Owain Evans
2 years
#dalle Escher doing Oxford University
Tweet media one
Tweet media two
2
6
94
@OwainEvans_UK
Owain Evans
4 years
After adjusting for cost of living, tech salaries in London average $78K vs. $118K in Austin, Texas. London has some of best universities world on doorstep, is a global city and finance hub. Is this statistic accurate or misleading? If it's accurate, why the $40K difference?
Tweet media one
22
8
92
@OwainEvans_UK
Owain Evans
3 years
New paper on truthful AI! We introduce a definition of “lying” for AI We explore how to train truthful ML models We propose institutions to support *standards* for truthful AI We weigh costs/benefits (economy + AI Safety) (w/ coauthors at Oxford & OpenAI)
3
22
88
@OwainEvans_UK
Owain Evans
4 years
@michael_nielsen Every field needs something like the Stanford Encyclopedia of Philosophy. High quality reviews that serve as good introductions. All in a common format, hyperlinked, and with ability to update the review over time.
2
15
88
@OwainEvans_UK
Owain Evans
2 years
Very prescient paper titles: "Attention is All you Need" (2019) -- the original transformer paper "Language models are Unsupervised Multitask Learners" (2018) -- GPT2 "Language Models are Few-Shot Learners" (2020) -- GPT3
3
7
88
@OwainEvans_UK
Owain Evans
2 years
Recipe for AI startups enabled by scaling laws: 1. Use laws to forecast when your product will hit key performance threshold 2. In meantime, build all the other parts of your product so that you are first to market.
Tweet media one
4
4
86
@OwainEvans_UK
Owain Evans
8 months
We tested GPT-4 on >1000 parent-child examples. The full list is on Github (see link). GPT-4 only gets the reverse question correct 33% of the time. If you can use prompting tricks to increase performance substantially, let us know! E.g. Here's a less famous person than Tom���
Tweet media one
Tweet media two
8
6
86
@OwainEvans_UK
Owain Evans
2 years
6. DM was initially only in London. Now it has NYC, Mountain View, Edmonton, Montreal, Paris. Google Brain is more centered in the SF Bay Area and US/Canada (not the UK).
3
3
83
@OwainEvans_UK
Owain Evans
8 months
Hierarchy of publishing in AI.
Tweet media one
3
9
81
@OwainEvans_UK
Owain Evans
2 years
Ex-Googlers --> Leave to do startup Ex-Newspaper journo --> Leave to do substack Effective Altruists --> Leave to create own non-profit
3
4
79
@OwainEvans_UK
Owain Evans
8 months
@sdrogers LLMs perform well on challenging exam questions that weren't in their training set. They respond well to many novel prompts (e.g. generating poems about math theorems). Chain-of-thought (& in-context learning) helps but isn't required for good performance. So LLMs have…
Tweet media one
1
3
82
@OwainEvans_UK
Owain Evans
2 years
Great interview with Magnus Carlsen. He says he doesn't do "deliberate practice" (i.e. unpleasant but nutritious drills) at all. He just reads chess books and thinks about them. (This also seems true of some excellent researchers I know).
2
2
80
@OwainEvans_UK
Owain Evans
2 years
Important new alignment paper by Anthropic: "LMs (mostly) know what they know". Results: 1.LLMs are well calibrated for multiple-choice questions on Big-Bench. Big-Bench questions are hard, diverse, & novel (not in the training data).
Tweet media one
1
10
75
@OwainEvans_UK
Owain Evans
1 month
OpenAI and Anthropic also have London offices. And a big chunk of Google DeepMind is there. On the AI Safety side, there's also UK AISI, the Alignment team at Google DeepMind, Apollo Research and LISA.
@mustafasuleyman
Mustafa Suleyman
1 month
The UK has phenomenal AI talent and a long established culture of responsible AI development. Today I’m proud to be opening a new office: Microsoft AI London. If you’d like to join us, get in touch. We’re hiring!
102
282
2K
4
4
77
@OwainEvans_UK
Owain Evans
2 years
2. Stanford University's campus at dusk with canal and bridge.
Tweet media one
1
1
73
@OwainEvans_UK
Owain Evans
2 years
@norabelrose Needs more nuance. Keller had sight+hearing up to 19 months & always had touch (which is a rich modality). Also humans do active learning and LLMs are pre-trained passively.
3
1
72
@OwainEvans_UK
Owain Evans
2 years
Tips from a GPT-3-based model on how to steal from a restaurant and do other nefarious things. A thread. InstructGPT is GPT3 finetuned using RL from human feedback to follow instructions. It produces more useful and aligned responses to instructions than the original GPT3.
Tweet media one
2
14
71
@OwainEvans_UK
Owain Evans
3 years
@Jess_Riedel Agree but: (1) users do ask models like GPT3 factual questions (2) we want a benchmark for models that *are* designed to be truthful (via finetuning, RL, info retrieval) (3) UnifiedQA is finetuned for question answering and still does poorly
2
0
68
@OwainEvans_UK
Owain Evans
2 years
Language models are getting better at multi-step reasoning. This diagram shows possible ways to improve them further. The branches can be combined: train longer, teach model to use external tools and structured data, finetune on human experts, AND amplify.
Tweet media one
2
14
70
@OwainEvans_UK
Owain Evans
3 years
More results: Even the most truthful models have high rates of false but informative answers -- the kind most likely to deceive humans. Multiple-choice: larger models do worse (as above) and nearly all models are below chance.
Tweet media one
1
6
65
@OwainEvans_UK
Owain Evans
1 year
Comparison of memes for ChatGPT. I'm not expecting to win the meme war but I'm at least offering an alternative.
Tweet media one
11
7
67
@OwainEvans_UK
Owain Evans
2 years
8. Alpha/Muzero (DeepMind) 9. Inception, ResNet (Goog, MS) 10. Transformer/GPT (Google/OAI) 11. Scaling Laws, deep double descent, grokking (Baidu, OAI) Also: Bayes (Bayes, Laplace -- independent) Pitts-McCulloch (Pitts was independent)
1
2
67
@OwainEvans_UK
Owain Evans
4 months
The Reversal Curse paper is accepted to the ICLR 2024 conference in Vienna!
@OwainEvans_UK
Owain Evans
8 months
Does a language model trained on “A is B” generalize to “B is A”? E.g. When trained only on “George Washington was the first US president”, can models automatically answer “Who was the first US president?” Our new paper shows they cannot!
Tweet media one
176
713
4K
1
3
66
@OwainEvans_UK
Owain Evans
2 years
From “Prompt programming for large LMs”: Many techniques for eliciting GPT3’s capabilities using prompts were developed by non-professionals on blogs/Twitter. Why? 1. The model was accessible via OpenAI's API
Tweet media one
3
7
64
@OwainEvans_UK
Owain Evans
3 years
8/ This paper is not the final word. But I'd love more papers like this. Why Craigslist and not all the other projects? Why Gmail? Why StackOverflow?
7
2
62
@OwainEvans_UK
Owain Evans
4 years
@chrischirp @jburnmurdoch @AndyBounds @SarahNev @Laura_K_Hughes @IndependentSage I hope @IndependentSage does more briefings like last week. Crucial that there's communication of what's actually going on.
1
9
62
@OwainEvans_UK
Owain Evans
2 years
Before the recent rationalist movement (centered on the blogs LW/SSC), there was a related project started in 1885 that also called itself "rationalist"! It published Darwin, HG Wells, Bertrand Russell, Popper, Dawkins and Dennett. I blogged here:
Tweet media one
3
6
61
@OwainEvans_UK
Owain Evans
4 months
FANTOM. A Theory of Mind test for language models from @YejinChoinka and others. Current models score substantially below humans.
Tweet media one
1
15
60
@OwainEvans_UK
Owain Evans
2 years
At the Barbican for #EAGlobal for brutally effective altruism.
Tweet media one
1
0
60
@OwainEvans_UK
Owain Evans
2 years
3. San Francisco painted Victorians with colorful canal boats.
Tweet media one
1
3
59
@OwainEvans_UK
Owain Evans
4 months
Our lie detection paper is accepted to the ICLR 2024 conference in Vienna.
@OwainEvans_UK
Owain Evans
7 months
Language models can lie. Our new paper presents an automated lie detector for blackbox LLMs. It’s accurate and generalises to unseen scenarios & models (GPT3.5→Llama). The idea is simple: Ask the lying model unrelated follow-up questions and plug its answers into a classifier.
Tweet media one
30
123
677
1
0
60
@OwainEvans_UK
Owain Evans
2 years
Why isn't there more progress in epistemic tech (e.g. Metaculus, Wikipedia)? Brainstorm: 1. Some epistemic advances would need lots of participants (coordination problem) 2. Difficulty of monetizing improved epistemics/knowledge (public goods)
4
4
58
@OwainEvans_UK
Owain Evans
3 years
Our benchmark has two tasks: (1) generate full-sentence answers, (2) multiple-choice. As an automatic metric for (1), we finetune GPT3 and get 90% validation accuracy in predicting human evaluation of truth (outperforming ROUGE & BLEURT).
2
0
56
@OwainEvans_UK
Owain Evans
2 years
Does anyone else's visual system confuse "casual" and "causal" when reading?
9
0
58
@OwainEvans_UK
Owain Evans
3 years
5/ Many of biggest open-source or crowdsourced projects have familiar end-products: Linux, Apache, OpenOffice, StackOverflow, gcc, scientific Python. If building a community, beware novel goals!
1
1
55
@OwainEvans_UK
Owain Evans
2 years
4. New York's Greenwich Village.
Tweet media one
2
1
56
@OwainEvans_UK
Owain Evans
2 years
Transformers perform poorly at classic algorithmic problems at different levels of the Chomsky hierarchy. Interesting, but there are other ways to measure transformer generalization ability (cf. Minerva & paper comparing to RNNs for in-context learning)
2
4
56