Andreas Stuhlmüller Profile Banner
Andreas Stuhlmüller Profile
Andreas Stuhlmüller

@stuhlmueller

2,438
Followers
175
Following
124
Media
599
Statuses

scale up good reasoning @elicitorg work with me:

Oakland, CA
Joined July 2008
Don't wanna be here? Send us removal request.
Pinned Tweet
@stuhlmueller
Andreas Stuhlmüller
9 months
Thread of reasons to work with us on Elicit (1/♾)
3
14
87
@stuhlmueller
Andreas Stuhlmüller
2 years
Language model papers at NeurIPS 2022 that sound interesting to me and that I hadn't seen before (thread)
2
115
645
@stuhlmueller
Andreas Stuhlmüller
4 years
Language model experiment @oughtinc : Take a vague forecasting question ("What will the future of robotics look like?"), generate measurable subquestions ("How many industrial robots sold per year?"), then generate a data source for each measurement ("Intl Federation of Robotics")
8
23
166
@stuhlmueller
Andreas Stuhlmüller
2 years
I've updated the @oughtinc machine learning reading list with 30+ papers from the last 6 months: PaLM, Chinchilla, Instruct, Grokking, T-Few, Chain of Thought, Self-Consistency, Minerva, Selection-Inference, Cascades, Plex, Forecasting, etc
1
37
165
@stuhlmueller
Andreas Stuhlmüller
6 months
Underappreciated thought/feeling by @KatjaGrace - definitely feels like people haven't grappled with the consequences of AI as much as seems right, even the people who worry about the consequences a lot
Tweet media one
17
23
158
@stuhlmueller
Andreas Stuhlmüller
3 years
language models + dataframes = ❤️
6
16
126
@stuhlmueller
Andreas Stuhlmüller
3 years
1. select text in browser, pdf reader, anywhere else 2. press command-option-enter 3. see hierarchical outline a la @RoamResearch @WorkFlowy
3
7
117
@stuhlmueller
Andreas Stuhlmüller
3 years
a cryptocurrency that incentivizes miners to collectively train a gpt3-like model and keep its knowledge up to date, moving training to wherever energy is cheapest
9
10
107
@stuhlmueller
Andreas Stuhlmüller
2 years
New beta feature in @elicitorg : Synthesize the top papers into a summary answer. Updates when you remove irrelevant papers
3
16
102
@stuhlmueller
Andreas Stuhlmüller
2 years
Out now: - The Interactive Composition Explorer (ICE), a Python library for writing and debugging compositional language model programs - The Factored Cognition Primer, a tutorial that shows using examples how to write such programs
1
17
87
@stuhlmueller
Andreas Stuhlmüller
3 years
We finetuned a language model on answering science questions given abstracts, live on now. Because it starts with paper search, not free generation, it usually doesn't hallucinate. Next step is indexing full PDFs
0
13
91
@stuhlmueller
Andreas Stuhlmüller
1 year
Yesterday I gave a lightning talk and it turns out everything you need to know about Ought, Elicit, and myself fits in seven tweets
Tweet media one
4
13
88
@stuhlmueller
Andreas Stuhlmüller
7 years
I just published “50 things I learned at NIPS 2016”
2
44
77
@stuhlmueller
Andreas Stuhlmüller
3 years
We couldn't find benchmarks that test language models on representative sets of economically valuable tasks This made it hard to evaluate advances. Transformative but not evenly distributed yet? Useless now but close? So we made RAFT: with @huggingface
0
16
76
@stuhlmueller
Andreas Stuhlmüller
2 years
is now backed by a custom semantic search engine using embeddings we computed for >175 million abstracts
Tweet media one
1
6
70
@stuhlmueller
Andreas Stuhlmüller
2 years
Q&A about individual papers is live on (early beta)
1
15
70
@stuhlmueller
Andreas Stuhlmüller
3 years
New leader on benchmark: @timo_schick with PET, beating GPT-3. I'm skeptical of much few-shot work (easy to overfit), but RAFT makes it hard to cheat + has real world tasks so seems PET is actually best for few-shot classification
2
13
65
@stuhlmueller
Andreas Stuhlmüller
8 months
The next chapter of Elicit begins
@elicitorg
Elicit
8 months
1/ Announcing our spinoff from @oughtinc into a public benefit corporation, our $9 million seed round, and a much more powerful Elicit! This new Elicit takes the components of the popular literature review workflow and extends them to automate more research workflows.
Tweet media one
14
70
317
3
5
62
@stuhlmueller
Andreas Stuhlmüller
1 year
1/ Process supervision is safer and more transparent than end-to-end training of language models, but it's not clear that it can remain competitive. In our new paper we share our experience applying it to Elicit, and the workflows and tools we developed:
Tweet media one
6
13
58
@stuhlmueller
Andreas Stuhlmüller
2 years
2/ Capturing Failures of Large Language Models via Human Cognitive Biases "we use cognitive biases to (i) identify inputs that models are likely to err on, and (ii) develop tests to qualitatively characterize their errors"
1
2
45
@stuhlmueller
Andreas Stuhlmüller
2 years
I'll skip over the well-known papers: - Let's think step by step: - Minerva: - Chinchilla: - Flamingo:
1
0
45
@stuhlmueller
Andreas Stuhlmüller
3 years
What's between now and GPT3-like models being widely used in production? A prioritized wish list for @openai based on experience working on @elicitorg :
2
6
43
@stuhlmueller
Andreas Stuhlmüller
8 months
1. We might see AGI in 2-7 years AGI = can spin up a machine with ≥ human-level research capabilities
4
10
44
@stuhlmueller
Andreas Stuhlmüller
2 years
@LauraDeming We're working on @elicitorg , an AI research assistant. I wouldn't claim that it's better than Google yet but (1) we've started moving from search over papers to search over scientific claims and (2) we're launching improvements every week so expect to get there over time
1
1
42
@stuhlmueller
Andreas Stuhlmüller
3 years
Infinite "show more like starred" for finding research now on Wish I'd had this for writing the lit review section of
1
4
39
@stuhlmueller
Andreas Stuhlmüller
4 years
Interactive decomposition of forecasting questions using GPT-3. All questions auto-generated. Part of our work on tools for thought @oughtinc . cc @gdb
0
7
38
@stuhlmueller
Andreas Stuhlmüller
2 years
4/ Fine-tuning language models to find consensus among humans with diverse preferences "A reward model is then trained to [..] rank consensus statements in terms of their appeal to the overall group, defined according to [social welfare] functions"
2
2
36
@stuhlmueller
Andreas Stuhlmüller
3 years
Our job posts @oughtinc now have a live view of projects you'd work on if you joined right now in that role, synced to the database we use internally for prioritization. Want to be the opposite of big orgs like FB where you don't even know what team you'll end up on
0
3
37
@stuhlmueller
Andreas Stuhlmüller
2 years
3/ Memorization Without Overfitting: Analyzing the Training Dynamics of Large Language Models "larger models can memorize a larger portion of the data before over-fitting and tend to forget less throughout the training"
1
3
35
@stuhlmueller
Andreas Stuhlmüller
1 year
Added 30 papers from the last 4 months to the @oughtinc machine learning reading list: Flan, Galactica, TabPFN, PEER, compositionality gap, process/outcome, maieutic prompting, ThinkSum, PAL, U-PaLM, debate fails, task-aware retrieval, DeepNash, HELM, etc
2
8
35
@stuhlmueller
Andreas Stuhlmüller
4 years
1/ Trying to learn what the stock market knows about the future. Tesla’s stock price is a mean over all ways the future could go. From option prices, we can back out a full distribution on the stock’s value for the next few years
2
5
32
@stuhlmueller
Andreas Stuhlmüller
3 years
Impressive paper on self-training GPT-3 using amplification & distillation, bootstrapping from weak translation model to SotA. Key question for predicting future of AI: Does a similar cycle-consistency approach work for self-training models to reason?
2
5
33
@stuhlmueller
Andreas Stuhlmüller
2 years
1/ Exploring Length Generalization in Large Language Models "naively finetuning transformers on length generalization tasks shows significant generalization deficiencies [..] scratchpad prompting results in a dramatic improvement"
2
2
33
@stuhlmueller
Andreas Stuhlmüller
3 years
NLP research questions we're encountering in practice at Ought:
1
6
32
@stuhlmueller
Andreas Stuhlmüller
2 years
Big-bench paper is live! @oughtinc contributed a task for decomposing forecasting questions into subquestions.
Tweet media one
2
6
31
@stuhlmueller
Andreas Stuhlmüller
1 year
How do we do that? With hundreds of language model calls per query, things can get complex quickly. The fundamental idea: Instead of running and evaluating models end-to-end, we break down model's thinking into semantically meaningful substeps that we can evaluate independently.
Tweet media one
1
4
30
@stuhlmueller
Andreas Stuhlmüller
3 years
First draft of a ML curriculum for new and potential hires at @oughtinc . Focuses on language models, starts with the basics, balances deployment in production and longer-term scalability Zotero:
1
5
28
@stuhlmueller
Andreas Stuhlmüller
4 years
Making a web-based IDE for few-shot training of language models on actions like "decompose", "estimate quantity", "list consequences", etc + building natural language programs out of these lego blocks of cognition
1
3
28
@stuhlmueller
Andreas Stuhlmüller
2 years
Computation trace visualizer for language model decompositions, by Jason and Luke @oughtinc
0
3
27
@stuhlmueller
Andreas Stuhlmüller
2 years
Probably the best intro to AI risk for a general audience
@80000Hours
80,000 Hours
2 years
"I don't understand why 80,000 Hours is so focused on AI risk" We get it - it's unusual. So here's our new explanation of why existential risks from AI might be the most pressing problem of our time: 🧵5 common misconceptions about AI risk👇
9
79
398
0
0
25
@stuhlmueller
Andreas Stuhlmüller
9 months
8/ Team responses to "What unspoken values do you think have most contributed to our success so far?"
Tweet media one
1
2
27
@stuhlmueller
Andreas Stuhlmüller
2 years
Prototyping dynamic extraction of main result, sample size, caveats, and other user-specified entities from abstracts for @elicitorg
4
3
23
@stuhlmueller
Andreas Stuhlmüller
2 years
Models are now better than crowd workers at the RAFT few-shot classification benchmark () Feels significant - we selected tasks that would usually be given to human research assistants, with setup that closely mirrors delegation to humans
Tweet media one
0
5
23
@stuhlmueller
Andreas Stuhlmüller
3 years
@jungofthewon @oughtinc @manda_ngo @elicitorg This may be the clearest demonstration so far of how models like GPT-3 can make the future nicer and not just more efficient
0
2
24
@stuhlmueller
Andreas Stuhlmüller
2 years
Finally wrote up @oughtinc 's worldview around AI, differential capabilities, alignment, and why we care so much about developing process-based ML systems 1/n
1
4
21
@stuhlmueller
Andreas Stuhlmüller
7 months
Now that @elicitorg is an independent company let's review our mission - what is scaling up good reasoning & why do we care?
2
6
25
@stuhlmueller
Andreas Stuhlmüller
1 year
Elicit today is specific to lit review, but research has many tasks: figuring out research directions, making plans, critiquing writing, etc. So we're making a general-purpose version of Elicit where models can flexibly choose what info-gathering and reasoning actions to take.
Tweet media one
1
5
24
@stuhlmueller
Andreas Stuhlmüller
8 months
My thought process when I go through inbound applications for software engineers 1/n
1
1
25
@stuhlmueller
Andreas Stuhlmüller
2 years
Super comprehensive review of @elicitorg by librarian @aarontay
Tweet media one
1
6
23
@stuhlmueller
Andreas Stuhlmüller
2 years
Prediction: Finetuning using data will mostly be replaced with finetuning using only compute: You give natural language instructions that describe the model you want ("English-French translator") and the model specializes (compiles) itself so that it can quickly execute the task
1
0
22
@stuhlmueller
Andreas Stuhlmüller
1 year
Every academic paper production I've seen up close: - "We should have had better results" - "We should have done more systematic experiments" - "This was way more work than expected"
1
1
22
@stuhlmueller
Andreas Stuhlmüller
2 years
Models like Codex will soon do a lot of programming. Micro test case for alignment: Can we build structures that let non-programmers use these models to create robust non-trivial software? E.g. by asking models for edge cases, critiques, explanations, spot checks
0
2
22
@stuhlmueller
Andreas Stuhlmüller
3 years
at @oughtinc team retreat in tahoe, sharing our thoughts on what a world with good reasoning at scale looks like
Tweet media one
2
0
22
@stuhlmueller
Andreas Stuhlmüller
7 months
Elicit launch party tomorrow night in SF! Our entire team will be there. DM me if you want to come
Tweet media one
0
3
22
@stuhlmueller
Andreas Stuhlmüller
8 months
Appreciate it when job applicants keep the meta commentary and second person to highlight that they used GPT
Tweet media one
Tweet media two
3
1
22
@stuhlmueller
Andreas Stuhlmüller
1 year
"We find that pure outcome-based supervision produces similar final-answer error rates with less label supervision. However, for correct reasoning steps we find it necessary to use process-based supervision or supervision from.. reward models that emulate process-based feedback."
@GoogleDeepMind
Google DeepMind
1 year
How can we get language models to solve maths problems accurately with correct, human-interpretable reasoning? We evaluate many ways to supervise the reasoning process or final answer, leading to state-of-the-art results on the GSM8K benchmark:
Tweet media one
8
52
249
0
1
22
@stuhlmueller
Andreas Stuhlmüller
2 years
Excited to show our tools for running compositional language model tasks at open lab meeting next week
Tweet media one
1
5
20
@stuhlmueller
Andreas Stuhlmüller
2 years
Many new people using lately. Helping people reason about science depends both on tools + the context and expectations they're used with
0
3
19
@stuhlmueller
Andreas Stuhlmüller
3 years
to find people for a front-end role I'm trying this: 1. export twitter followers using @vicinitas_io 2. make semantic search task using @elicitorg 3. rank bios by similarity to "front-end web dev"
4
3
20
@stuhlmueller
Andreas Stuhlmüller
1 year
Ought is building Elicit, an AI research assistant. Right now you can think of Elicit as a better Google Scholar. It's using language models to imitate some of the systematic review workflow that is used in empirical domains. It has about 200k users.
Tweet media one
1
7
20
@stuhlmueller
Andreas Stuhlmüller
2 years
Much of my interaction with language models these days is through this tiny emacs package:
2
2
20
@stuhlmueller
Andreas Stuhlmüller
2 years
11/ Generating Training Data with Language Models "With quality training data selected based on the generation probability and regularization techniques (label smoothing and temporal ensembling) applied to the fine-tuning stage"
1
2
20
@stuhlmueller
Andreas Stuhlmüller
3 years
Using language models to read through the first 30 websites found in a Google search, returning most relevant paragraphs in full. Inspired by watching analysts open 20 tabs as part of their research process just so they can read the single most relevant paragraph for each
2
1
19
@stuhlmueller
Andreas Stuhlmüller
2 years
5/ Teacher Forcing Recovers Reward Functions for Text Generation "Through the lens of [IRL], we [..] derive the reward function from models trained with the teacher-forcing objective. [This] enables [RL] for text generation."
1
1
18
@stuhlmueller
Andreas Stuhlmüller
3 years
Models need an RL feedback API: We give a prompt, multiple responses, and rewards for each response. The model is updated to prefer high-reward outputs. Want to do this hierarchically - a global model for all our users, local versions for each org, a personal model for each user
2
0
18
@stuhlmueller
Andreas Stuhlmüller
3 years
Gerry Sussman's new book on how to build adaptive systems just came out. I'm reading it over the next 8 weeks. Let's read it together? Add your name to our book club:
Tweet media one
1
5
19
@stuhlmueller
Andreas Stuhlmüller
3 years
. @elicitorg as omnipresent menu bar app feels qualitatively different from web app, and friction could still be a lot lower
1
4
19
@stuhlmueller
Andreas Stuhlmüller
2 years
10/ CoNT: Contrastive Neural Text Generation "CoNT addresses bottlenecks [of constrastive learning for generation] -- the construction of contrastive examples, the choice of the contrastive loss, and the strategy in decoding."
1
2
19
@stuhlmueller
Andreas Stuhlmüller
9 months
1/ Many worry that LMs will worsen epistemics. We think LMs can make it much easier to find truth and make good decisions - but that requires work. We are doing the work
2
4
16
@stuhlmueller
Andreas Stuhlmüller
6 months
Exploring an Elicit prototype that gives you a fresh database on each run and lets you create and combine tasks to operate on it
1
2
16
@stuhlmueller
Andreas Stuhlmüller
2 years
Really nice to work at a non-profit. Makes it much easier to share detailed plans with the world
1
0
16
@stuhlmueller
Andreas Stuhlmüller
5 months
btw @elicitorg csv export shows supporting quotes, reasoning, and confidence for all extracted data
Tweet media one
0
3
16
@stuhlmueller
Andreas Stuhlmüller
1 year
Second time today I'm seeing someone I respect advocate for a slowdown in scaling until current systems are better understood
@DavidDuvenaud
David Duvenaud
1 year
I should have announced this before, but a year ago I switched my research focus to AI existential risk reduction and governance. I think the risk of bad outcomes for humanity due to AGI is substantial, and that coordinating a slowdown in AGI development is probably a good idea.
35
117
824
0
1
15
@stuhlmueller
Andreas Stuhlmüller
1 year
This is labor intensive, so we want to know: - What techniques make automated task decomposition work better? Also, at least as important: - What kind of research tools would differentially accelerate alignment? - What kinds of dev tools scale to advanced models?
Tweet media one
1
1
16
@stuhlmueller
Andreas Stuhlmüller
2 years
Critical citations now live on . My favorite feature so far - when there's criticism I read it before I even read the abstract
1
4
16
@stuhlmueller
Andreas Stuhlmüller
2 years
We're hiring a lead designer for Elicit. No job post yet. DM me to work full-time on UX for generative tools for thought with lego-like compositionality
@wwwjim
Jimmy Lee
2 years
So @jungofthewon gave me the breakdown of today. If you're a designer thinking about the future... The opportunity here to design the composition tools for building a personal, more human AI assistant is huge. What an insanely fun opportunity.
0
4
16
1
4
16
@stuhlmueller
Andreas Stuhlmüller
2 years
8/ NaturalProver: Grounded Mathematical Proof Generation with Language Models "a [LM] that generates proofs by conditioning on background [..] (e.g. theorems [..]), and optionally enforces their presence with constrained decoding"
1
7
14
@stuhlmueller
Andreas Stuhlmüller
8 years
How can we build systems that help people think through vague questions like "What should I do with my life?"
0
3
12
@stuhlmueller
Andreas Stuhlmüller
1 year
The approach we've been following using ICE: 1. Start with a basic decomposition; e.g. retrieval + generation. 2 & 3. Look at gold standards - are we failing to retrieve, or failing to generate? 4 & 5. Zoom in on the failing step and decompose it further, or otherwise improve it.
Tweet media one
1
1
15
@stuhlmueller
Andreas Stuhlmüller
6 months
What are the top scientific orgs working on longevity? SENS, Calico, Buck Institute, NIA, Human Longevity, National Academy of Medicine, others? Would love 1-2 collaborations to make Elicit useful for this field
4
3
15
@stuhlmueller
Andreas Stuhlmüller
3 years
Costs per query need to come down by 10x to make models competitive with human labor. Costs are 1 to 10 cents per query right now which is about the same as human labor for classification, and humans are more accurate
2
0
15
@stuhlmueller
Andreas Stuhlmüller
1 year
Career ladder in one line: rapidly growing sponge → safe pair of hands → internal expert → hired expert → has seen the movie before @jungofthewon walks through the levels & relates them to years of experience, comp, outcomes, scope, etc in this talk
1
2
15
@stuhlmueller
Andreas Stuhlmüller
2 years
Prediction: For the next generation, unaugmented human writing will be rare. Typing out text character by character will be like cursive, or a cappella.
1
1
15
@stuhlmueller
Andreas Stuhlmüller
1 year
With GPT-4 we're re-orienting Elicit around concepts, not papers
@jungofthewon
Jungwon
1 year
We’re “pivoting” Elicit with GPT-4 😉 Elicit in 2022 took unstructured text in papers and structured it into a table. Elicit in 2023 will take this structured text and enable you to “pivot” it, grouping it by concepts. Sign up here:
22
85
511
1
0
15
@stuhlmueller
Andreas Stuhlmüller
3 years
Using our tools to generate names for our tools @oughtinc
2
1
14
@stuhlmueller
Andreas Stuhlmüller
3 years
@paulg same argument explains why humans and machines need mental models, reasoning, inference to make good outcomes happen. can't fail in the real world, need to fail in simulation, and even there trial & error isn't enough due to combinatorial explosion. also the intro of my thesis
Tweet media one
2
1
13
@stuhlmueller
Andreas Stuhlmüller
4 years
Converting free-form text into structured data using language models. Here: Extracting data sources used for resolution of forecasting questions from @metaculus pages
Tweet media one
Tweet media two
1
1
14
@stuhlmueller
Andreas Stuhlmüller
2 years
We've improved accuracy of the auto-generated paper-based summary answer in @elicitorg and it's live again!
Tweet media one
@stuhlmueller
Andreas Stuhlmüller
2 years
New beta feature in @elicitorg : Synthesize the top papers into a summary answer. Updates when you remove irrelevant papers
3
16
102
0
3
14
@stuhlmueller
Andreas Stuhlmüller
7 months
People often ask - how does Elicit relate to AI Safety? Here's my answer In brief, the two main impacts of Elicit on AI Safety are improving epistemics and pioneering process supervision.
1
4
14
@stuhlmueller
Andreas Stuhlmüller
2 years
7/ Thor: Wielding Hammers to Integrate Language Models and Automated Theorem Provers "[automated theorem provers] are used for premise selection, while all other tasks are designated to language models"
1
3
13
@stuhlmueller
Andreas Stuhlmüller
2 years
It's surprisingly difficult to get GPT-3 to reliably state what a scientific abstract says about a question without making things up. Surprising because in principle it has all the info it needs, and we're not asking it to do complex reasoning, or so we thought
@jungofthewon
Jungwon
2 years
1/ Lots of interest lately in making language models “truthful”. How can we prevent GPT-3 from “lying”? We’ve worked on this in the context of @elicitorg . In Elicit, GPT-3 tries to answer your research question given abstracts from papers. (Can try at )
Tweet media one
2
6
27
0
1
13
@stuhlmueller
Andreas Stuhlmüller
1 year
This sort of process supervision needs tools: We've made an open source tool called ICE that can visualize execution traces and show you the prompts and function input/outputs at each point
Tweet media one
1
3
13
@stuhlmueller
Andreas Stuhlmüller
2 years
12/12 LIFT: Language-Interfaced FineTuning for Non-language Machine Learning Tasks "[does ok] across a wide range of low-dimensional classification and regression tasks, matching the performances of the best models in many cases"
0
1
13
@stuhlmueller
Andreas Stuhlmüller
3 years
As the world gets more complex due to deployment of AI and language models everywhere, it's important that the same tech helps policy makers understand what's going on and how to make good decisions in that world
@RyanFedasiuk
Ryan Fedasiuk
3 years
One of the coolest parts of our new report? We used #AI to understand how the Chinese military is using AI. The @elicitorg AI research assistant developed by @oughtinc helped us ID false negatives & check data labels. Pretty soon I'll be out of a job...
1
5
33
0
1
13
@stuhlmueller
Andreas Stuhlmüller
1 year
As a researcher it's so tempting to: - Discount what you have to say because it's obvious to you - Compare your messy process to others' highlight reel - Forget how rare clean hypotheses and findings are - Underestimate how incremental science is
1
0
13
@stuhlmueller
Andreas Stuhlmüller
2 years
My favorite consequence is that longer queries often result in better results, not worse
Tweet media one
1
2
13
@stuhlmueller
Andreas Stuhlmüller
3 years
"What concepts should I understand to answer this question?"
0
1
12