Andreas Stuhlmüller @stuhlmueller Twitter profile

Pinned Tweet

Andreas Stuhlmüller

@stuhlmueller

9 months

Thread of reasons to work with us on Elicit (1/♾)

3

14

87

Last Seen Profiles

@ikukawamin83909

@Adey_TT

@promisqxd

@SoCalBirds

@GianMLG

@buttsforu

@LEENHARDT

@jasonsudeikis

@OnyxEquinoxShow

@DepSecGraves

@eDashwood_

@96HSWildcats

@captivesau

@wiredowl

@shaadalsamy

@StephanieShald1

@motobenco

@T_Ravz

@scottmcnealy

@culto

@JoshAnioGrigg

@JuusoLeinonen1

@PJA9

@AcostaLoba

@Bad_trip_mask

@bruintje_debeer

@mimz75

@SophiaChar329

@diskunion_dy2

@cuhaotic

@IAMCOACHMONROE

@10lazzz

@k_almodhayan

@MokchedurR80732

@ClopezCineFilia

@LRSMandyaExMP

Andreas Stuhlmüller

@stuhlmueller

2 years

Language model papers at NeurIPS 2022 that sound interesting to me and that I hadn't seen before (thread)

2

115

645

Andreas Stuhlmüller

@stuhlmueller

4 years

Language model experiment @oughtinc : Take a vague forecasting question ("What will the future of robotics look like?"), generate measurable subquestions ("How many industrial robots sold per year?"), then generate a data source for each measurement ("Intl Federation of Robotics")

8

23

166

Andreas Stuhlmüller

@stuhlmueller

2 years

I've updated the @oughtinc machine learning reading list with 30+ papers from the last 6 months: PaLM, Chinchilla, Instruct, Grokking, T-Few, Chain of Thought, Self-Consistency, Minerva, Selection-Inference, Cascades, Plex, Forecasting, etc

Elicit Machine Learning Reading List

docs.google.com

1

37

165

Andreas Stuhlmüller

@stuhlmueller

6 months

Underappreciated thought/feeling by @KatjaGrace - definitely feels like people haven't grappled with the consequences of AI as much as seems right, even the people who worry about the consequences a lot

17

23

158

Andreas Stuhlmüller

@stuhlmueller

3 years

language models + dataframes = ❤️

6

16

126

Andreas Stuhlmüller

@stuhlmueller

3 years

1. select text in browser, pdf reader, anywhere else 2. press command-option-enter 3. see hierarchical outline a la @RoamResearch @WorkFlowy

3

7

117

Andreas Stuhlmüller

@stuhlmueller

3 years

a cryptocurrency that incentivizes miners to collectively train a gpt3-like model and keep its knowledge up to date, moving training to wherever energy is cheapest

9

10

107

Andreas Stuhlmüller

@stuhlmueller

2 years

New beta feature in @elicitorg : Synthesize the top papers into a summary answer. Updates when you remove irrelevant papers

3

16

102

Andreas Stuhlmüller

@stuhlmueller

3 years

my notes on management that I've so far only shared with managees at ought

Working with Andreas - a user guide

A collection of policies, heuristics, and tactics I've found helpful for management at Ought

stuhlmueller.org

8

3

94

Andreas Stuhlmüller

@stuhlmueller

2 years

Out now: - The Interactive Composition Explorer (ICE), a Python library for writing and debugging compositional language model programs - The Factored Cognition Primer, a tutorial that shows using examples how to write such programs

1

17

87

Andreas Stuhlmüller

@stuhlmueller

3 years

We finetuned a language model on answering science questions given abstracts, live on now. Because it starts with paper search, not free generation, it usually doesn't hallucinate. Next step is indexing full PDFs

0

13

91

Andreas Stuhlmüller

@stuhlmueller

1 year

Yesterday I gave a lightning talk and it turns out everything you need to know about Ought, Elicit, and myself fits in seven tweets

4

13

88

Andreas Stuhlmüller

@stuhlmueller

7 years

I just published “50 things I learned at NIPS 2016”

2

44

77

Andreas Stuhlmüller

@stuhlmueller

3 years

We couldn't find benchmarks that test language models on representative sets of economically valuable tasks This made it hard to evaluate advances. Transformative but not evenly distributed yet? Useless now but close? So we made RAFT: with @huggingface

RAFT | Real-world Annotated Few-shot Tasks

RAFT is a few-shot classification benchmark that tests language models across multiple domains on economically valuable classification tasks with evaluation that mirrors deployment.

raft.elicit.org

0

16

76

Andreas Stuhlmüller

@stuhlmueller

2 years

Our plan for Elicit

Elicit: Language Models as Research Assistants — LessWrong

Ought is an applied machine learning lab. We’re building Elicit, the AI research assistant. Our mission is to automate and scale open-ended reasoning…

www.lesswrong.com

1

8

72

Andreas Stuhlmüller

@stuhlmueller

2 years

is now backed by a custom semantic search engine using embeddings we computed for >175 million abstracts

1

6

70

Andreas Stuhlmüller

@stuhlmueller

2 years

Q&A about individual papers is live on (early beta)

1

15

70

Andreas Stuhlmüller

@stuhlmueller

3 years

New leader on benchmark: @timo_schick with PET, beating GPT-3. I'm skeptical of much few-shot work (easy to overfit), but RAFT makes it hard to cheat + has real world tasks so seems PET is actually best for few-shot classification

GitHub - timoschick/pet: This repository contains the code for "Exploiting Cloze Questions for...

This repository contains the code for "Exploiting Cloze Questions for Few-Shot Text Classification and Natural Language Inference" - timoschick/pet

github.com

2

13

65

Andreas Stuhlmüller

@stuhlmueller

8 months

The next chapter of Elicit begins

Elicit

@elicitorg

8 months

1/ Announcing our spinoff from @oughtinc into a public benefit corporation, our $9 million seed round, and a much more powerful Elicit! This new Elicit takes the components of the popular literature review workflow and extends them to automate more research workflows.

14

70

317

3

5

62

Andreas Stuhlmüller

@stuhlmueller

1 year

1/ Process supervision is safer and more transparent than end-to-end training of language models, but it's not clear that it can remain competitive. In our new paper we share our experience applying it to Elicit, and the workflows and tools we developed:

6

13

58

Andreas Stuhlmüller

@stuhlmueller

2 years

2/ Capturing Failures of Large Language Models via Human Cognitive Biases "we use cognitive biases to (i) identify inputs that models are likely to err on, and (ii) develop tests to qualitatively characterize their errors"

Capturing Failures of Large Language Models via Human Cognitive Biases

Large language models generate complex, open-ended outputs: instead of outputting a class label they write summaries, generate dialogue, or produce working code. In order to asses the reliability...

arxiv.org

1

2

45

Andreas Stuhlmüller

@stuhlmueller

2 years

I'll skip over the well-known papers: - Let's think step by step: - Minerva: - Chinchilla: - Flamingo:

1

0

45

Andreas Stuhlmüller

@stuhlmueller

3 years

What's between now and GPT3-like models being widely used in production? A prioritized wish list for @openai based on experience working on @elicitorg :

2

6

43

Andreas Stuhlmüller

@stuhlmueller

8 months

1. We might see AGI in 2-7 years AGI = can spin up a machine with ≥ human-level research capabilities

4

10

44

Andreas Stuhlmüller

@stuhlmueller

2 years

@LauraDeming We're working on @elicitorg , an AI research assistant. I wouldn't claim that it's better than Google yet but (1) we've started moving from search over papers to search over scientific claims and (2) we're launching improvements every week so expect to get there over time

1

42

Andreas Stuhlmüller

@stuhlmueller

3 years

Infinite "show more like starred" for finding research now on Wish I'd had this for writing the lit review section of

1

4

39

Andreas Stuhlmüller

@stuhlmueller

4 years

Interactive decomposition of forecasting questions using GPT-3. All questions auto-generated. Part of our work on tools for thought @oughtinc . cc @gdb

0

7

38

Andreas Stuhlmüller

@stuhlmueller

2 years

4/ Fine-tuning language models to find consensus among humans with diverse preferences "A reward model is then trained to [..] rank consensus statements in terms of their appeal to the overall group, defined according to [social welfare] functions"

2

36

Andreas Stuhlmüller

@stuhlmueller

3 years

Our job posts @oughtinc now have a live view of projects you'd work on if you joined right now in that role, synced to the database we use internally for prioritization. Want to be the opposite of big orgs like FB where you don't even know what team you'll end up on

0

3

37

Andreas Stuhlmüller

@stuhlmueller

2 years

3/ Memorization Without Overfitting: Analyzing the Training Dynamics of Large Language Models "larger models can memorize a larger portion of the data before over-fitting and tend to forget less throughout the training"

Memorization Without Overfitting: Analyzing the Training Dynamics...

Despite their wide adoption, the underlying training and memorization dynamics of very large language models is not well understood. We empirically study exact memorization in causal and masked...

arxiv.org

1

3

35

Andreas Stuhlmüller

@stuhlmueller

1 year

Added 30 papers from the last 4 months to the @oughtinc machine learning reading list: Flan, Galactica, TabPFN, PEER, compositionality gap, process/outcome, maieutic prompting, ThinkSum, PAL, U-PaLM, debate fails, task-aware retrieval, DeepNash, HELM, etc

Elicit Machine Learning Reading List

docs.google.com

2

8

35

Andreas Stuhlmüller

@stuhlmueller

4 years

1/ Trying to learn what the stock market knows about the future. Tesla’s stock price is a mean over all ways the future could go. From option prices, we can back out a full distribution on the stock’s value for the next few years

2

5

32

Andreas Stuhlmüller

@stuhlmueller

3 years

Impressive paper on self-training GPT-3 using amplification & distillation, bootstrapping from weak translation model to SotA. Key question for predicting future of AI: Does a similar cycle-consistency approach work for self-training models to reason?

Unsupervised Neural Machine Translation with Generative Language...

We show how to derive state-of-the-art unsupervised neural machine translation systems from generatively pre-trained language models. Our method consists of three steps: few-shot amplification,...

arxiv.org

2

5

33

Andreas Stuhlmüller

@stuhlmueller

2 years

1/ Exploring Length Generalization in Large Language Models "naively finetuning transformers on length generalization tasks shows significant generalization deficiencies [..] scratchpad prompting results in a dramatic improvement"

Exploring Length Generalization in Large Language Models

The ability to extrapolate from short problem instances to longer ones is an important form of out-of-distribution generalization in reasoning tasks, and is crucial when learning from datasets...

arxiv.org

2

33

Andreas Stuhlmüller

@stuhlmueller

3 years

NLP research questions we're encountering in practice at Ought:

1

6

32

Andreas Stuhlmüller

@stuhlmueller

2 years

Big-bench paper is live! @oughtinc contributed a task for decomposing forecasting questions into subquestions.

2

6

31

Andreas Stuhlmüller

@stuhlmueller

1 year

How do we do that? With hundreds of language model calls per query, things can get complex quickly. The fundamental idea: Instead of running and evaluating models end-to-end, we break down model's thinking into semantically meaningful substeps that we can evaluate independently.

1

4

30

Andreas Stuhlmüller

@stuhlmueller

3 years

First draft of a ML curriculum for new and potential hires at @oughtinc . Focuses on language models, starts with the basics, balances deployment in production and longer-term scalability Zotero:

Elicit Machine Learning Reading List

docs.google.com

1

5

28

Andreas Stuhlmüller

@stuhlmueller

4 years

Making a web-based IDE for few-shot training of language models on actions like "decompose", "estimate quantity", "list consequences", etc + building natural language programs out of these lego blocks of cognition

1

3

28

Andreas Stuhlmüller

@stuhlmueller

2 years

Computation trace visualizer for language model decompositions, by Jason and Luke @oughtinc

0

3

27

Andreas Stuhlmüller

@stuhlmueller

2 years

Probably the best intro to AI risk for a general audience

80,000 Hours

@80000Hours

2 years

"I don't understand why 80,000 Hours is so focused on AI risk" We get it - it's unusual. So here's our new explanation of why existential risks from AI might be the most pressing problem of our time: 🧵5 common misconceptions about AI risk👇

9

79

398

0

25

Andreas Stuhlmüller

@stuhlmueller

9 months

8/ Team responses to "What unspoken values do you think have most contributed to our success so far?"

1

2

27

Andreas Stuhlmüller

@stuhlmueller

2 years

Prototyping dynamic extraction of main result, sample size, caveats, and other user-specified entities from abstracts for @elicitorg

4

3

23

Andreas Stuhlmüller

@stuhlmueller

2 years

Models are now better than crowd workers at the RAFT few-shot classification benchmark () Feels significant - we selected tasks that would usually be given to human research assistants, with setup that closely mirrors delegation to humans

0

5

23

Andreas Stuhlmüller

@stuhlmueller

3 years

@jungofthewon @oughtinc @manda_ngo @elicitorg This may be the clearest demonstration so far of how models like GPT-3 can make the future nicer and not just more efficient

0

2

24

Andreas Stuhlmüller

@stuhlmueller

2 years

Finally wrote up @oughtinc 's worldview around AI, differential capabilities, alignment, and why we care so much about developing process-based ML systems 1/n

Supervise Process, not Outcomes — LessWrong

We can think about machine learning systems on a spectrum from process-based to outcome-based: …

www.lesswrong.com

1

4

21

Andreas Stuhlmüller

@stuhlmueller

7 months

Now that @elicitorg is an independent company let's review our mission - what is scaling up good reasoning & why do we care?

Elicit's mission is to scale up good reasoning

Now that Elicit is a standalone company let's review our mission.

blog.elicit.com

2

6

25

Andreas Stuhlmüller

@stuhlmueller

1 year

Elicit today is specific to lit review, but research has many tasks: figuring out research directions, making plans, critiquing writing, etc. So we're making a general-purpose version of Elicit where models can flexibly choose what info-gathering and reasoning actions to take.

1

5

24

Andreas Stuhlmüller

@stuhlmueller

8 months

My thought process when I go through inbound applications for software engineers 1/n

1

25

Andreas Stuhlmüller

@stuhlmueller

2 years

Super comprehensive review of @elicitorg by librarian @aarontay

1

6

23

Andreas Stuhlmüller

@stuhlmueller

2 years

Prediction: Finetuning using data will mostly be replaced with finetuning using only compute: You give natural language instructions that describe the model you want ("English-French translator") and the model specializes (compiles) itself so that it can quickly execute the task

1

0

22

Andreas Stuhlmüller

@stuhlmueller

1 year

Every academic paper production I've seen up close: - "We should have had better results" - "We should have done more systematic experiments" - "This was way more work than expected"

1

22

Andreas Stuhlmüller

@stuhlmueller

2 years

Models like Codex will soon do a lot of programming. Micro test case for alignment: Can we build structures that let non-programmers use these models to create robust non-trivial software? E.g. by asking models for edge cases, critiques, explanations, spot checks

0

2

22

Andreas Stuhlmüller

@stuhlmueller

3 years

at @oughtinc team retreat in tahoe, sharing our thoughts on what a world with good reasoning at scale looks like

2

0

22

Andreas Stuhlmüller

@stuhlmueller

7 months

Elicit launch party tomorrow night in SF! Our entire team will be there. DM me if you want to come

0

3

22

Andreas Stuhlmüller

@stuhlmueller

8 months

Appreciate it when job applicants keep the meta commentary and second person to highlight that they used GPT

3

1

22

Andreas Stuhlmüller

@stuhlmueller

1 year

"We find that pure outcome-based supervision produces similar final-answer error rates with less label supervision. However, for correct reasoning steps we find it necessary to use process-based supervision or supervision from.. reward models that emulate process-based feedback."

Google DeepMind

@GoogleDeepMind

1 year

How can we get language models to solve maths problems accurately with correct, human-interpretable reasoning? We evaluate many ways to supervise the reasoning process or final answer, leading to state-of-the-art results on the GSM8K benchmark:

8

52

249

0

1

22

Andreas Stuhlmüller

@stuhlmueller

2 years

Excited to show our tools for running compositional language model tasks at open lab meeting next week

1

5

20

Andreas Stuhlmüller

@stuhlmueller

2 years

Many new people using lately. Helping people reason about science depends both on tools + the context and expectations they're used with

How to use Elicit responsibly | Ought

How Elicit works and where it doesn't

ought.org

0

3

19

Andreas Stuhlmüller

@stuhlmueller

3 years

to find people for a front-end role I'm trying this: 1. export twitter followers using @vicinitas_io 2. make semantic search task using @elicitorg 3. rank bios by similarity to "front-end web dev"

4

3

20

Andreas Stuhlmüller

@stuhlmueller

1 year

Ought is building Elicit, an AI research assistant. Right now you can think of Elicit as a better Google Scholar. It's using language models to imitate some of the systematic review workflow that is used in empirical domains. It has about 200k users.

1

7

20

Andreas Stuhlmüller

@stuhlmueller

2 years

Much of my interaction with language models these days is through this tiny emacs package:

2

20

Andreas Stuhlmüller

@stuhlmueller

2 years

11/ Generating Training Data with Language Models "With quality training data selected based on the generation probability and regularization techniques (label smoothing and temporal ensembling) applied to the fine-tuning stage"

Generating Training Data with Language Models: Towards Zero-Shot...

Pretrained language models (PLMs) have demonstrated remarkable performance in various natural language processing tasks: Unidirectional PLMs (e.g., GPT) are well known for their superior text...

arxiv.org

1

2

20

Andreas Stuhlmüller

@stuhlmueller

3 years

Using language models to read through the first 30 websites found in a Google search, returning most relevant paragraphs in full. Inspired by watching analysts open 20 tabs as part of their research process just so they can read the single most relevant paragraph for each

2

1

19

Andreas Stuhlmüller

@stuhlmueller

2 years

5/ Teacher Forcing Recovers Reward Functions for Text Generation "Through the lens of [IRL], we [..] derive the reward function from models trained with the teacher-forcing objective. [This] enables [RL] for text generation."

1

18

Andreas Stuhlmüller

@stuhlmueller

3 years

Models need an RL feedback API: We give a prompt, multiple responses, and rewards for each response. The model is updated to prefer high-reward outputs. Want to do this hierarchically - a global model for all our users, local versions for each org, a personal model for each user

2

0

18

Andreas Stuhlmüller

@stuhlmueller

3 years

Gerry Sussman's new book on how to build adaptive systems just came out. I'm reading it over the next 8 weeks. Let's read it together? Add your name to our book club:

1

5

19

Andreas Stuhlmüller

@stuhlmueller

3 years

. @elicitorg as omnipresent menu bar app feels qualitatively different from web app, and friction could still be a lot lower

1

4

19

Andreas Stuhlmüller

@stuhlmueller

2 years

10/ CoNT: Contrastive Neural Text Generation "CoNT addresses bottlenecks [of constrastive learning for generation] -- the construction of contrastive examples, the choice of the contrastive loss, and the strategy in decoding."

CoNT: Contrastive Neural Text Generation

Recently, contrastive learning attracts increasing interests in neural text generation as a new solution to alleviate the exposure bias problem. It introduces a sequence-level training signal...

arxiv.org

1

2

19

Andreas Stuhlmüller

@stuhlmueller

9 months

1/ Many worry that LMs will worsen epistemics. We think LMs can make it much easier to find truth and make good decisions - but that requires work. We are doing the work

2

4

16

Andreas Stuhlmüller

@stuhlmueller

6 months

Exploring an Elicit prototype that gives you a fresh database on each run and lets you create and combine tasks to operate on it

1

2

16

Andreas Stuhlmüller

@stuhlmueller

2 years

Really nice to work at a non-profit. Makes it much easier to share detailed plans with the world

1

0

16

Andreas Stuhlmüller

@stuhlmueller

5 months

btw @elicitorg csv export shows supporting quotes, reasoning, and confidence for all extracted data

0

3

16

Andreas Stuhlmüller

@stuhlmueller

1 year

Second time today I'm seeing someone I respect advocate for a slowdown in scaling until current systems are better understood

David Duvenaud

@DavidDuvenaud

1 year

I should have announced this before, but a year ago I switched my research focus to AI existential risk reduction and governance. I think the risk of bad outcomes for humanity due to AGI is substantial, and that coordinating a slowdown in AGI development is probably a good idea.

35

117

824

0

1

15

Andreas Stuhlmüller

@stuhlmueller

7 months

The core bets Elicit is making are on systematicity, transparency, and unboundedness

Elicit's core bets: Be systematic, transparent, and unbounded

Eventually, there will be many AI assistants for chat, question-answering, search, web browsing, writing, and general computer use. They'll generally aim to be trustworthy, easy-to-use tools for...

blog.elicit.com

1

4

15

Andreas Stuhlmüller

@stuhlmueller

1 year

This is labor intensive, so we want to know: - What techniques make automated task decomposition work better? Also, at least as important: - What kind of research tools would differentially accelerate alignment? - What kinds of dev tools scale to advanced models?

1

16

Andreas Stuhlmüller

@stuhlmueller

2 years

Critical citations now live on . My favorite feature so far - when there's criticism I read it before I even read the abstract

1

4

16

Andreas Stuhlmüller

@stuhlmueller

2 years

We're hiring a lead designer for Elicit. No job post yet. DM me to work full-time on UX for generative tools for thought with lego-like compositionality

Jimmy Lee

@wwwjim

2 years

So @jungofthewon gave me the breakdown of today. If you're a designer thinking about the future... The opportunity here to design the composition tools for building a personal, more human AI assistant is huge. What an insanely fun opportunity.

0

4

16

1

4

16

Andreas Stuhlmüller

@stuhlmueller

2 years

8/ NaturalProver: Grounded Mathematical Proof Generation with Language Models "a [LM] that generates proofs by conditioning on background [..] (e.g. theorems [..]), and optionally enforces their presence with constrained decoding"

NaturalProver: Grounded Mathematical Proof Generation with Language Models

Theorem proving in natural mathematical language - the mixture of symbolic and natural language used by humans - plays a central role in mathematical advances and education, and tests aspects of...

arxiv.org

1

7

14

Andreas Stuhlmüller

@stuhlmueller

8 years

How can we build systems that help people think through vague questions like "What should I do with my life?"

0

3

12

Andreas Stuhlmüller

@stuhlmueller

1 year

The approach we've been following using ICE: 1. Start with a basic decomposition; e.g. retrieval + generation. 2 & 3. Look at gold standards - are we failing to retrieve, or failing to generate? 4 & 5. Zoom in on the failing step and decompose it further, or otherwise improve it.

1

15

Andreas Stuhlmüller

@stuhlmueller

6 months

What are the top scientific orgs working on longevity? SENS, Calico, Buck Institute, NIA, Human Longevity, National Academy of Medicine, others? Would love 1-2 collaborations to make Elicit useful for this field

4

3

15

Andreas Stuhlmüller

@stuhlmueller

3 years

Costs per query need to come down by 10x to make models competitive with human labor. Costs are 1 to 10 cents per query right now which is about the same as human labor for classification, and humans are more accurate

2

0

15

Andreas Stuhlmüller

@stuhlmueller

1 year

Career ladder in one line: rapidly growing sponge → safe pair of hands → internal expert → hired expert → has seen the movie before @jungofthewon walks through the levels & relates them to years of experience, comp, outcomes, scope, etc in this talk

Level Up: Expectations at different stages of your career | Jungwon...

This session starts with an introduction to leveling frameworks and how they can help people manage their careers, as well as give helpful feedback to others...

www.youtube.com

1

2

15

Andreas Stuhlmüller

@stuhlmueller

2 years

Prediction: For the next generation, unaugmented human writing will be rare. Typing out text character by character will be like cursive, or a cappella.

1

15

Andreas Stuhlmüller

@stuhlmueller

1 year

With GPT-4 we're re-orienting Elicit around concepts, not papers

Jungwon

@jungofthewon

1 year

We’re “pivoting” Elicit with GPT-4 😉 Elicit in 2022 took unstructured text in papers and structured it into a table. Elicit in 2023 will take this structured text and enable you to “pivot” it, grouping it by concepts. Sign up here:

22

85

511

1

0

15

Andreas Stuhlmüller

@stuhlmueller

3 years

Using our tools to generate names for our tools @oughtinc

2

1

14

Andreas Stuhlmüller

@stuhlmueller

3 years

@paulg same argument explains why humans and machines need mental models, reasoning, inference to make good outcomes happen. can't fail in the real world, need to fail in simulation, and even there trial & error isn't enough due to combinatorial explosion. also the intro of my thesis

2

1

13

Andreas Stuhlmüller

@stuhlmueller

4 years

Converting free-form text into structured data using language models. Here: Extracting data sources used for resolution of forecasting questions from @metaculus pages

1

14

Andreas Stuhlmüller

@stuhlmueller

2 years

We've improved accuracy of the auto-generated paper-based summary answer in @elicitorg and it's live again!

Andreas Stuhlmüller

@stuhlmueller

2 years

New beta feature in @elicitorg : Synthesize the top papers into a summary answer. Updates when you remove irrelevant papers

3

16

102

0

3

14

Andreas Stuhlmüller

@stuhlmueller

7 months

People often ask - how does Elicit relate to AI Safety? Here's my answer In brief, the two main impacts of Elicit on AI Safety are improving epistemics and pioneering process supervision.

Elicit and AI Safety

The two main impacts of Elicit on AI Safety are improving epistemics and pioneering process supervision.

blog.elicit.com

1

4

14

Andreas Stuhlmüller

@stuhlmueller

2 years

7/ Thor: Wielding Hammers to Integrate Language Models and Automated Theorem Provers "[automated theorem provers] are used for premise selection, while all other tasks are designated to language models"

Thor: Wielding Hammers to Integrate Language Models and Automated...

In theorem proving, the task of selecting useful premises from a large library to unlock the proof of a given conjecture is crucially important. This presents a challenge for all theorem provers,...

arxiv.org

1

3

13

Andreas Stuhlmüller

@stuhlmueller

2 years

It's surprisingly difficult to get GPT-3 to reliably state what a scientific abstract says about a question without making things up. Surprising because in principle it has all the info it needs, and we're not asking it to do complex reasoning, or so we thought

Jungwon

@jungofthewon

2 years

1/ Lots of interest lately in making language models “truthful”. How can we prevent GPT-3 from “lying”? We’ve worked on this in the context of @elicitorg . In Elicit, GPT-3 tries to answer your research question given abstracts from papers. (Can try at )

2

6

27

0

1

13

Andreas Stuhlmüller

@stuhlmueller

1 year

This sort of process supervision needs tools: We've made an open source tool called ICE that can visualize execution traces and show you the prompts and function input/outputs at each point

1

3

13

Andreas Stuhlmüller

@stuhlmueller

2 years

12/12 LIFT: Language-Interfaced FineTuning for Non-language Machine Learning Tasks "[does ok] across a wide range of low-dimensional classification and regression tasks, matching the performances of the best models in many cases"

LIFT: Language-Interfaced Fine-Tuning for Non-Language Machine...

Fine-tuning pretrained language models (LMs) without making any architectural changes has become a norm for learning various language downstream tasks. However, for non-language downstream tasks,...

arxiv.org

0

1

13

Andreas Stuhlmüller

@stuhlmueller

3 years

As the world gets more complex due to deployment of AI and language models everywhere, it's important that the same tech helps policy makers understand what's going on and how to make good decisions in that world

Ryan Fedasiuk

@RyanFedasiuk

3 years

One of the coolest parts of our new report? We used #AI to understand how the Chinese military is using AI. The @elicitorg AI research assistant developed by @oughtinc helped us ID false negatives & check data labels. Pretty soon I'll be out of a job...

1

5

33

0

1

13

Andreas Stuhlmüller

@stuhlmueller

1 year

As a researcher it's so tempting to: - Discount what you have to say because it's obvious to you - Compare your messy process to others' highlight reel - Forget how rare clean hypotheses and findings are - Underestimate how incremental science is

1

0

13

Andreas Stuhlmüller

@stuhlmueller

2 years

My favorite consequence is that longer queries often result in better results, not worse