Jennifer Hu @_jennhu Twitter profile

Last Seen Profiles

@dqzaiie

@LRC_KC

@tarar_4

@LeukaemiaCareUK

@MargaWirkierman

@L3Aqm51AyMxPCo3

@_jinyoung911118

@Go_CheshireWest

@TaeJohns_3

@hanjisr

@Secret_Builder

@LSEnews

@Jam9sbs

@tommytwoshades

@osobrio

@mthai

@IzeNite

@LpnToken

@ashbeck_

@_UzbekFootball

@rkivens35

@Uni_Salesiana

@El_Hincha12

@DominiqueVoynet

@CSurvivor9

@phinataa

@_vandul

@BornosBodegas

@PatPerna

@on_opera

@MOTIinc

@jaredctate

@t_xu

@drelpiii

@JCoova

@Dominique_2232

Jennifer Hu

@_jennhu

3 months

Are Large Language Models good at language? A recent paper by Dentella, Günther, & Leivada (DGL) argues no: LLMs can't distinguish grammatical sentences. We re-analyzed their data and found that LLMs are highly accurate and capture fine-grained variation in human judgments. 1/10

6

71

306

Jennifer Hu

@_jennhu

7 months

To researchers doing LLM evaluation: prompting is *not a substitute* for direct probability measurements. Check out the camera-ready version of our work, to appear at EMNLP 2023! (w/ @roger_p_levy ) Paper: Original thread: 🧵👇

Jennifer Hu

@_jennhu

1 year

New paper with @roger_p_levy : Prompt-based methods may underestimate large language models' linguistic generalizations. Preprint: 🧵👇

10

37

182

4

52

267

Jennifer Hu

@_jennhu

2 months

Super excited to visit the Stanford NLP group and talk about the science of LM evaluation: How can we make inferences about LMs' latent capabilities, based on observable behaviors? Full abstract here: The talk is open to the public! Register below 👇🌟

The Stanford NLP Group

Performing groundbreaking Natural Language Processing research since 1999.

nlp.stanford.edu

Stanford NLP Group

@stanfordnlp

2 months

For this week’s NLP Seminar, we are thrilled to host @_jennhu to talk about "How to Know What Language Models Know"! When: 03/07 Thurs 11am PT Non-Stanford affiliates registration form (closed at 9am PT on the talk day):

2

18

146

3

21

215

Jennifer Hu

@_jennhu

1 year

New paper with @roger_p_levy : Prompt-based methods may underestimate large language models' linguistic generalizations. Preprint: 🧵👇

10

37

182

Jennifer Hu

@_jennhu

1 year

Life update: I defended my thesis and will be joining the Harvard Kempner Institute as a Research Fellow 🙂 Thrilled to continue pursuing questions at the intersection of language, cognition, and AI (and to be sticking around Boston)!

Boaz Barak

@boazbaraktcs

1 year

Kempner Institute announces the first cohort of research fellows starting this fall! Looking forward to learning from and collaborating with @brandfonbrener , @cogscikid , @_jennhu , @IlennaJ , @WangBinxu , @nsaphra , Eran Malach, and @t_andy_keller .

3

16

141

8

7

182

Jennifer Hu

@_jennhu

1 month

New preprint w/ @mcxfrank : How can we ascribe cognitive abilities to language models? We evaluate them! But evals impose challenges separate from the underlying ability of interest. These "task demands" affect LM performance, esp. for smaller models! 1/8

4

31

160

Jennifer Hu

@_jennhu

1 year

New paper with Sammy Floyd, @OlessiaJour , @ev_fedorenko , @LanguageMIT ! Non-literal language understanding is an essential part of communication. But what is the role of mentalizing vs. language statistics in pragmatics? & how well do NLP models capture human prag behaviors? 🧵1/6

3

38

151

Jennifer Hu

@_jennhu

3 years

Excited to announce the NeurIPS 2021 workshop ✨Meaning in Context: Pragmatic Communication in Humans and Machines✨ Submit your abstracts or short papers by September 10. More info:

3

33

97

Jennifer Hu

@_jennhu

2 years

Want to discuss pragmatics with ML, cogsci, and language researchers? Register for the Meaning in Context workshop @ #NeurIPS2021 ! Make sure to register for NeurIPS and sign up for meet-and-greet sessions. See below 👇

1

19

84

Jennifer Hu

@_jennhu

3 years

Excited to share our new preprint (w/ @smallhannahe & @ev_fedorenko ): Our results support the idea that language comprehension & production draw on the same knowledge representations, which are stored in the language-selective network. 1/7

The language network supports both lexical access and sentence generation during language production

A network of left frontal and temporal brain regions has long been implicated in language comprehension and production. However, because of relatively fewer investigations of language production, the...

www.biorxiv.org

2

21

82

Jennifer Hu

@_jennhu

1 year

✨New paper to appear in TACL (with @roger_p_levy , Judith Degen @ALPSLabStanford , and @sebschu )! ✨ Scalar inferences (SI) are highly variable both *within* a scale (e.g., <some, all>) and *across* scales, but few proposals quantitatively explain both types of variation. 🧵: 1/7

2

14

78

Jennifer Hu

@_jennhu

3 months

Probability comparison is important because LLMs are designed to generate high-prob sentences. Relative to this measurement, prompting underestimates LLM capabilities. And knowledge *about* language != knowledge *of* language. See e.g. Hu & Levy 2023: 6/10

Prompting is not a substitute for probability measurements in...

Prompting is now a dominant method for evaluating the linguistic knowledge of large language models (LLMs). While other methods directly read out models' probability distributions over strings,...

arxiv.org

1

3

35

Jennifer Hu

@_jennhu

3 months

DGL’s original paper: Our response (joint work w/ @kmahowald , @glupyan , @neuranna , @roger_p_levy ): Summary of main points below. 2/10

Language models align with human judgments on key grammatical constructions

Do Large Language Models (LLMs) make human-like linguistic generalizations? Dentella et al. (2023; "DGL") prompt several LLMs ("Is the following sentence grammatically correct in English?") to...

arxiv.org

1

6

29

Jennifer Hu

@_jennhu

7 months

Ironically, two days before acceptance of our paper at EMNLP, OpenAI removed the ability to access token logprobs from gpt-3.5-turbo-instruct. This is a timely issue. We need to establish best practices for LLM evaluation based on scientific merit, not just convenience. 5/5

1

29

Jennifer Hu

@_jennhu

3 months

In sum: DGL provide important new human judgment data, but our analysis shows that LLMs match these judgments better than their paper suggests. We hope this re-analysis helps clarify the capabilities and limitations of LLMs, which is of great scientific & public interest. 10/10

3

0

28

Jennifer Hu

@_jennhu

2 years

👇 Now out in Cerebral Cortex!

Jennifer Hu

@_jennhu

3 years

Excited to share our new preprint (w/ @smallhannahe & @ev_fedorenko ): Our results support the idea that language comprehension & production draw on the same knowledge representations, which are stored in the language-selective network. 1/7

2

21

82

0

3

26

Jennifer Hu

@_jennhu

1 year

In sum: Negative results relying on metalinguistic prompts cannot be taken as conclusive evidence that an LLM lacks a particular linguistic competency. Our results also highlight the lost value with the move to closed APIs where access to probability distributions is limited. 8/8

1

2

26

Jennifer Hu

@_jennhu

1 year

Submit your latest work on Theory of Mind in Communicating Agents to our workshop at ICML 2023! We welcome submissions from cognitive, ML, and social perspectives 🙂

ToM Workshop

@tom_icml2023

1 year

1. 🔔**𝘾𝙖𝙡𝙡 𝙛𝙤𝙧 𝙋𝙖𝙥𝙚𝙧𝙨 𝙛𝙤𝙧 𝙏𝙝𝙚𝙤𝙧𝙮-𝙤𝙛-𝙈𝙞𝙣𝙙 𝙒𝙤𝙧𝙠𝙨𝙝𝙤𝙥**🔔 The First Workshop on Theory of Mind in Communicating Agents (ToM 2023) will be hosted at @icmlconf in July'23 in Honolulu 🌺 CfP: 🧵 #ICML2023 #ToM2023 #ML #NLProc

2

34

91

1

5

26

Jennifer Hu

@_jennhu

3 months

Prior work studying syntax in LMs: -Linzen et al 2016 -Wilcox et al 2023 -COLA -BLiMP -SyntaxGym And much more by @tallinzen , @a_stadt , @RTomMcCoy et al! 5/10

A Systematic Assessment of Syntactic Generalization in Neural Language Models

Jennifer Hu, Jon Gauthier, Peng Qian, Ethan Wilcox, Roger Levy. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020.

aclanthology.org

1

25

Jennifer Hu

@_jennhu

3 months

We report a re-analysis of DGL’s materials using direct probability measurement on minimal pairs. We find LLMs achieve extremely high accuracy overall. And, minimal-pair logprob differences capture fine-grained variation in human judgments on DGL's materials! 8/10

1

4

22

Jennifer Hu

@_jennhu

3 months

But the standard practice for evaluating LLM grammatical generalization is comparison between probabilities assigned to minimal pairs of sentences (e.g., P(“The bear sleeps”) > P(“The bear sleep”)), used by benchmarks like BLiMP and SyntaxGym. 4/10

1

0

21

Jennifer Hu

@_jennhu

1 year

e.g. DIRECT method: compare P(is) vs P(are) given prefix "The keys to the cabinet" METALINGUISTIC method: give the prompt "Here is a sentence: The keys to the cabinet... What word is most likely to come next?" and compare P(is) vs P(are) These methods give different results! 5/8

1

2

20

Jennifer Hu

@_jennhu

3 years

I'll be presenting our paper "Scalable pragmatic communication via self-supervision" (joint work with @NogaZaslavsky @roger_p_levy ) at the ICML Self-Supervised Learning Workshop tomorrow Sat 7/24, 11:10-11:50 Pacific. All are welcome 🙂 See for details!

0

5

19

Jennifer Hu

@_jennhu

3 months

Furthermore, DGL’s data reveal important variation in human judgments. For some sentences that DGL code as ungrammatical, humans disagree (eg “Gary still perhaps drives to work”). Thus, LLMs might be held to a standard that doesn't correspond to systematic human preferences. 7/10

2

0

18

Jennifer Hu

@_jennhu

3 years

Submission deadline *extended to September 17*! See for submission guidelines.

Jennifer Hu

@_jennhu

3 years

Excited to announce the NeurIPS 2021 workshop ✨Meaning in Context: Pragmatic Communication in Humans and Machines✨ Submit your abstracts or short papers by September 10. More info:

3

33

97

0

6

17

Jennifer Hu

@_jennhu

3 months

DGL assess Large Language Models (LLMs) by prompting models: “Is the following sentence grammatically correct in English?” They compare model responses to human judgments of the same sentences. DGL argue models are less accurate than humans and biased toward “yes” responses. 3/10

1

0

17

Jennifer Hu

@_jennhu

3 months

We also point out a subtle difference between the prompts provided by DGL to LLMs versus to humans. When we remove this difference, LLM performance substantially improves, even using DGL’s prompt-based approach. 9/10

1

0

17

Jennifer Hu

@_jennhu

1 year

The fundamental unit of LLM computation is P(word|context). This conditional probability determines a distribution over word strings containing the model’s linguistic generalizations. But corporate LLM APIs are becoming more closed and no longer always offer P(word|context). 3/8

1

16

Jennifer Hu

@_jennhu

1 year

We evaluate six LLMs across four tasks/linguistic domains. Broadly, we find that LLMs' metalinguistic judgments are inferior to direct probability-based comparisons. And consistency gets worse as the prompt diverges from direct measurements of next-word probabilities. 7/8

1

16

Jennifer Hu

@_jennhu

1 year

Our results suggest that even paradigmatic pragmatic phenomena (e.g., polite deceits) could potentially be solved w/o explicit representations of other agents’ mental states, and that artificial models can be used to gain mechanistic insights into human pragmatic processing. 5/6

1

0

14

Jennifer Hu

@_jennhu

1 year

Prompting implicitly tests a new type of emergent ability — metalinguistic judgment — which has not yet been systematically explored. So, how well do LLMs perform under direct vs. metalinguistic evaluation? And how consistent are LLMs across both methods? 6/8

1

0

14

Jennifer Hu

@_jennhu

1 year

Preprint: Code and data:

GitHub - jennhu/metalinguistic-prompting: Materials for "Prompting is not a substitute for probab...

Materials for "Prompting is not a substitute for probability measurements in large language models" (EMNLP 2023) - jennhu/metalinguistic-prompting

github.com

0

13

Jennifer Hu

@_jennhu

1 year

So, how do models do? The larger models achieve high accuracy, and also make similar error patterns as humans: within incorrect responses, these models tend to select the literal interpretation of an utterance over distractors based on heuristics such as lexical similarity. 3/6

1

3

12

Jennifer Hu

@_jennhu

1 year

We also found that models use similar linguistic cues as humans to solve the tasks. For many tasks, humans and models align on which items are difficult. We also removed the context story from the items, and found that models and humans degrade across tasks in similar ways. 4/6

1

2

12

Jennifer Hu

@_jennhu

1 year

The success of large language models (LLMs) has sparked a critical debate in language science: what linguistic generalizations do LLMs capture, and how? Some claim LLMs challenge classic approaches to language; others argue LLMs are poor substitutes for linguistic theories. 1/8

1

0

11

Jennifer Hu

@_jennhu

10 months

Join us for a fun discussion on cognitively-motivated approaches to AI benchmarking, with a focus on language + social reasoning! (note the corrected date: *July 17th*)

Robert Hawkins

@hawkrobe

10 months

You're invited to a virtual seminar with @tallinzen @tianminshu , & @_jennhu July 19th, noon-1pm ET! This session kicks off the CogSci 2023 Cognitive-AI Benchmarking (CAB) workshop to be held on-site in Sydney. Please register here for the Zoom link!

1

12

46

0

1

11

Jennifer Hu

@_jennhu

1 year

So LLM evaluation is shifting toward metalinguistic prompting: writing a sentence and asking the model about it. For the LLM to succeed, it must both represent the generalization of interest and report the outcome of applying the generalization to the sentence in the prompt. 4/8

1

0

11

Jennifer Hu

@_jennhu

1 month

Takeaways: LM performance shouldn't be seen as a direct indication of intelligence (or lack thereof), but as reflecting abilities through the lens of our design choices. This adds to work on "LM evaluation validity" and suggests ways that LMs could be used to study kids! 7/8

1

11

Jennifer Hu

@_jennhu

1 year

No matter their theoretical position, researchers need a way to assess the capabilities of LLMs to substantiate such claims. With all the options available, how should we go about evaluating LLMs' linguistic knowledge? 2/8

1

0

10

Jennifer Hu

@_jennhu

7 months

Brief summary: The fundamental unit of LLM computation is P(word|context). This determines a distribution over strings containing the model’s linguistic generalizations. Prompting implicitly tests a new type of emergent ability: metalinguistic judgment. 1/5

1

0

10

Jennifer Hu

@_jennhu

1 month

Thanks to @mcxfrank for a super fun collaboration! Preprint: 8/8

0

9

Jennifer Hu

@_jennhu

7 months

In the camera-ready, we discuss a potential competence–performance distinction in LLMs: the information implicitly encoded in an LLM's string distribution over isolated sentences does not always surface when the model is explicitly prompted for a response based on that info. 4/5

1

0

9

Jennifer Hu

@_jennhu

1 year

Check out the pre-TACL version here: Code and data: 7/7

GitHub - jennhu/expectations-over-alternatives: Code and data for "Expectations over unspoken...

Code and data for "Expectations over unspoken alternatives predict pragmatic inferences" (Hu et al., TACL 2023) - jennhu/expectations-over-alternatives

github.com

0

9

Jennifer Hu

@_jennhu

1 month

A shared goal in psychology and AI is to ascribe cognitive capacities to black-box agents. For example, we might be interested in whether a young child has theory of mind, or whether an LM can distinguish grammatical and ungrammatical sentences. 2/8

1

0

7

Jennifer Hu

@_jennhu

1 year

Thanks to my amazing co-authors for a very fun collaboration! Paper here: 6/6

A fine-grained comparison of pragmatic language understanding in...

Pragmatics and non-literal language understanding are essential to human communication, and present a long-standing challenge for artificial language models. We perform a fine-grained comparison...

arxiv.org

1

9

Jennifer Hu

@_jennhu

7 months

We find that LLMs' metalinguistic judgments are inferior to direct probability-based comparisons, suggesting that negative results relying on metalinguistic prompts cannot be taken as conclusive evidence that an LLM lacks a particular linguistic generalization. 3/5

3

0

8

Jennifer Hu

@_jennhu

3 years

In sum: our results support the idea that the lang network stores integrated linguistic knowledge. Like the integration btwn word meanings & combinatorial processing for comprehension, it seems lang areas support both lexical access & sentence generation during production. 7/7

1

8

Jennifer Hu

@_jennhu

1 year

We explore this through a fine-grained comparison of LMs and humans on 7 pragmatic tasks. Our eval materials are an expert-curated set of multiple choice q's. Each answer option represents different strategies for solving the task (pragmatic, literal, low-level heuristics). 2/6

1

2

8

Jennifer Hu

@_jennhu

9 months

If you're interested in postdoctoral research at the intersection of cognition, neuroscience, and AI, consider applying for the Kempner Fellowship! 👇🧠 Applications close October 9!

Boaz Barak

@boazbaraktcs

9 months

Applications are now open for postdoctoral research fellows at the Kempner Institute at Harvard. These are 3-year positions with independent funding and access to the amazing resources of the institute. Apply by October 9th 2023!

1

31

76

0

7

Jennifer Hu

@_jennhu

3 years

We also find the language regions respond to both lexical access and sentence-generation demands, which implies strong integration between lexico-semantic and combinatorial processes, mirroring the picture that has emerged in language comprehension. 5/7

1

0

7

Jennifer Hu

@_jennhu

1 year

Our results suggest that scalar inferences arise from context-driven expectations over alternatives, and these expectations operate at the level of concepts. These findings also highlight the role of linguistic prediction in pragmatic inference. 5/7

1

0

6

Jennifer Hu

@_jennhu

1 year

(This work builds upon human behavioral data collected by Bob van Tiel, @evanmiltenburg , @NGotzner , @ecpankratz , Eszter Ronai, Ming Xiang, and many others!) 6/7

1

0

6

Jennifer Hu

@_jennhu

7 months

e.g. DIRECT method: compare P(is)⟷P(are) given prefix "The keys to the cabinet..." METALINGUISTIC method: give the prompt "Here is a sentence: The keys to the cabinet... What word is most likely to come next?" and compare P(is)⟷P(are) These methods give different results! 2/5

1

0

6

Jennifer Hu

@_jennhu

3 years

co-organized with @NogaZaslavsky , @aidanematzadeh , Michael Franke, @roger_p_levy , & Noah Goodman

0

6

Jennifer Hu

@_jennhu

3 years

Joint work with @smallhannahe , @HopeKean , Atsushi Takahashi, Leo Zekelman, @dankleinman , Elizabeth Ryan, @victorf13 , & @ev_fedorenko

0

6

Jennifer Hu

@_jennhu

1 year

Furthermore, while it is generally assumed that SIs arise through reasoning about unspoken alternatives, it remains debated whether humans reason about alternatives as linguistic forms, or at the level of concepts. 2/7

1

0

6

Jennifer Hu

@_jennhu

1 year

However, expectedness robustly predicts cross-scale variation only under a concept-based (i.e., not string-based) view of alternatives. 4/7

1

0

5

Jennifer Hu

@_jennhu

1 month

For both humans and machines, making inferences about failures is especially tricky, because failure on a task does not always indicate the absence of the underlying capacity. E.g., children often fail because they don’t understand the question. 4/8

1

0

5

Jennifer Hu

@_jennhu

3 months

@EvelinaLeivada Agreed, and thank you for the very interesting work and data! Looking forward to continuing a productive discussion. 🙂

0

5

Jennifer Hu

@_jennhu

2 years

What burning questions about pragmatics would you like to ask our panelists ()? Add them here, and we’ll try to answer them live:

1

0

5

Jennifer Hu

@_jennhu

1 month

We argue that task demands also play an important role in determining success or failure for LMs, especially when comparing models of different capacities. We evaluate 13 LMs on analogical reasoning, reflective reasoning, word prediction, and grammaticality judgments. 5/8

1

0

5

Jennifer Hu

@_jennhu

4 years

Check out our new paper to appear at #SCiL2020 , as part of a larger effort to develop better methods for evaluating neural language models!

0

4

Jennifer Hu

@_jennhu

1 year

We test a shared mechanism for within- and cross-scale variation: context-driven expectations about the unspoken alternatives. Using LMs to approximate human predictive distributions, we find that SIs are captured by the expectedness of the strong scalemate as an alternative. 3/7

1

0

4

Jennifer Hu

@_jennhu

1 month

Evaluation methods with greater task demands yield lower performance than methods with reduced demands. This “demand gap” is most pronounced for models with fewer parameters and less training data. We discuss implications for emergence in LMs and task demands in children. 6/8

1

0

4

Jennifer Hu

@_jennhu

1 month

The trouble is, although we would like to infer an underlying psychological *construct*, we only have access to specific observable *evaluations* (e.g., a child's ability to answer a question about a character in a story, or a model's performance on a syntax benchmark). 3/8

1

0

4

Jennifer Hu

@_jennhu

4 years

Excited to share our new paper! Check it out for a fresh look at the computational principles that give rise to human pragmatic reasoning.

Noga Zaslavsky

@NogaZaslavsky

4 years

Very excited to share our new paper: “A Rate-Distortion view of human pragmatic reasoning” Joint work with @_jennhu and @roger_p_levy . (1/)

2

25

62

0

3

Jennifer Hu

@_jennhu

2 years

More info about the schedule, accepted papers, and speakers can be found at

0

4

Jennifer Hu

@_jennhu

2 years

We’re hosting 2 virtual meet-and-greet sessions, following the ICLR BAICS & Neuromatch “mind-matching” model, so you can meet your fellow workshop attendees and talk about shared interests. Register for the meet-and-greet here by Dec 10:

MiC Meet-and-Greet Registration

To foster interactions between workshop participants, we're hosting virtual meet-and-greet sessions as part of the MiC workshop (https://mic-workshop.github.io/). This is based on the ICLR BAICS and...

docs.google.com

1

0

3

Jennifer Hu

@_jennhu

3 years

Finally, while some have hypothesized the existence of production-selective mechanisms, we find no evidence of brain regions that selectively support sentence generation. Instead, language regions respond overall more strongly during production than during comprehension. 6/7

1

0

3

Jennifer Hu

@_jennhu

3 years

We used a standard (in the behavioral literature) event description / object naming task, and included a variety of controls, like a low-level production task (nonword production), a visual event semantics condition, and some comprehension conditions. 3/7

1

0

2

Jennifer Hu

@_jennhu

3 years

A network of left frontal and temporal brain regions has been implicated in language comprehension & production, but what is the precise role of this ‘language network’ in production? Across 4 fMRI expts, we characterize the response of the lang regions to production demands. 2/7

1

0

2

Jennifer Hu

@_jennhu

3 years

In line with prior studies, sentence production elicited strong responses throughout the language network. We also show that production-related responses in the language network are robust to output modality (speaking vs. typing). (We made a cool scanner-safe keyboard! ⌨️) 4/7

1

0

2

Jennifer Hu

@_jennhu

11 months

@rajammanabrolu @UCSanDiego @ucsd_cse @MosaicML Congrats!!! 😀

1

0

1