Jennifer Hu Profile Banner
Jennifer Hu Profile
Jennifer Hu

@_jennhu

1,513
Followers
100
Following
12
Media
78
Statuses

Research Fellow at @Harvard and incoming Asst Prof at @JohnsHopkins interested in language, computation, and cognition. @jennhu .bsky.social

Cambridge, MA
Joined June 2009
Don't wanna be here? Send us removal request.
@_jennhu
Jennifer Hu
3 months
Are Large Language Models good at language? A recent paper by Dentella, Günther, & Leivada (DGL) argues no: LLMs can't distinguish grammatical sentences. We re-analyzed their data and found that LLMs are highly accurate and capture fine-grained variation in human judgments. 1/10
6
71
306
@_jennhu
Jennifer Hu
7 months
To researchers doing LLM evaluation: prompting is *not a substitute* for direct probability measurements. Check out the camera-ready version of our work, to appear at EMNLP 2023! (w/ @roger_p_levy ) Paper: Original thread: 🧵👇
Tweet media one
@_jennhu
Jennifer Hu
1 year
New paper with @roger_p_levy : Prompt-based methods may underestimate large language models' linguistic generalizations. Preprint: 🧵👇
10
37
182
4
52
267
@_jennhu
Jennifer Hu
2 months
Super excited to visit the Stanford NLP group and talk about the science of LM evaluation: How can we make inferences about LMs' latent capabilities, based on observable behaviors? Full abstract here: The talk is open to the public! Register below 👇🌟
@stanfordnlp
Stanford NLP Group
2 months
For this week’s NLP Seminar, we are thrilled to host @_jennhu to talk about "How to Know What Language Models Know"! When: 03/07 Thurs 11am PT Non-Stanford affiliates registration form (closed at 9am PT on the talk day):
Tweet media one
2
18
146
3
21
215
@_jennhu
Jennifer Hu
1 year
New paper with @roger_p_levy : Prompt-based methods may underestimate large language models' linguistic generalizations. Preprint: 🧵👇
10
37
182
@_jennhu
Jennifer Hu
1 year
Life update: I defended my thesis and will be joining the Harvard Kempner Institute as a Research Fellow 🙂 Thrilled to continue pursuing questions at the intersection of language, cognition, and AI (and to be sticking around Boston)!
@boazbaraktcs
Boaz Barak
1 year
Kempner Institute announces the first cohort of research fellows starting this fall! Looking forward to learning from and collaborating with @brandfonbrener , @cogscikid , @_jennhu , @IlennaJ , @WangBinxu , @nsaphra , Eran Malach, and @t_andy_keller .
3
16
141
8
7
182
@_jennhu
Jennifer Hu
1 month
New preprint w/ @mcxfrank : How can we ascribe cognitive abilities to language models? We evaluate them! But evals impose challenges separate from the underlying ability of interest. These "task demands" affect LM performance, esp. for smaller models! 1/8
Tweet media one
4
31
160
@_jennhu
Jennifer Hu
1 year
New paper with Sammy Floyd, @OlessiaJour , @ev_fedorenko , @LanguageMIT ! Non-literal language understanding is an essential part of communication. But what is the role of mentalizing vs. language statistics in pragmatics? & how well do NLP models capture human prag behaviors? 🧵1/6
3
38
151
@_jennhu
Jennifer Hu
3 years
Excited to announce the NeurIPS 2021 workshop ✨Meaning in Context: Pragmatic Communication in Humans and Machines✨ Submit your abstracts or short papers by September 10. More info:
3
33
97
@_jennhu
Jennifer Hu
2 years
Want to discuss pragmatics with ML, cogsci, and language researchers? Register for the Meaning in Context workshop @ #NeurIPS2021 ! Make sure to register for NeurIPS and sign up for meet-and-greet sessions. See below 👇
1
19
84
@_jennhu
Jennifer Hu
1 year
✨New paper to appear in TACL (with @roger_p_levy , Judith Degen @ALPSLabStanford , and @sebschu )! ✨ Scalar inferences (SI) are highly variable both *within* a scale (e.g., <some, all>) and *across* scales, but few proposals quantitatively explain both types of variation. 🧵: 1/7
Tweet media one
2
14
78
@_jennhu
Jennifer Hu
3 months
Probability comparison is important because LLMs are designed to generate high-prob sentences. Relative to this measurement, prompting underestimates LLM capabilities. And knowledge *about* language != knowledge *of* language. See e.g. Hu & Levy 2023: 6/10
1
3
35
@_jennhu
Jennifer Hu
7 months
Ironically, two days before acceptance of our paper at EMNLP, OpenAI removed the ability to access token logprobs from gpt-3.5-turbo-instruct. This is a timely issue. We need to establish best practices for LLM evaluation based on scientific merit, not just convenience. 5/5
1
1
29
@_jennhu
Jennifer Hu
3 months
In sum: DGL provide important new human judgment data, but our analysis shows that LLMs match these judgments better than their paper suggests. We hope this re-analysis helps clarify the capabilities and limitations of LLMs, which is of great scientific & public interest. 10/10
3
0
28
@_jennhu
Jennifer Hu
2 years
👇 Now out in Cerebral Cortex!
@_jennhu
Jennifer Hu
3 years
Excited to share our new preprint (w/ @smallhannahe & @ev_fedorenko ): Our results support the idea that language comprehension & production draw on the same knowledge representations, which are stored in the language-selective network. 1/7
2
21
82
0
3
26
@_jennhu
Jennifer Hu
1 year
In sum: Negative results relying on metalinguistic prompts cannot be taken as conclusive evidence that an LLM lacks a particular linguistic competency. Our results also highlight the lost value with the move to closed APIs where access to probability distributions is limited. 8/8
1
2
26
@_jennhu
Jennifer Hu
1 year
Submit your latest work on Theory of Mind in Communicating Agents to our workshop at ICML 2023! We welcome submissions from cognitive, ML, and social perspectives 🙂
@tom_icml2023
ToM Workshop
1 year
1. 🔔**𝘾𝙖𝙡𝙡 𝙛𝙤𝙧 𝙋𝙖𝙥𝙚𝙧𝙨 𝙛𝙤𝙧 𝙏𝙝𝙚𝙤𝙧𝙮-𝙤𝙛-𝙈𝙞𝙣𝙙 𝙒𝙤𝙧𝙠𝙨𝙝𝙤𝙥**🔔 The First Workshop on Theory of Mind in Communicating Agents (ToM 2023) will be hosted at @icmlconf in July'23 in Honolulu 🌺 CfP: 🧵 #ICML2023 #ToM2023 #ML #NLProc
2
34
91
1
5
26
@_jennhu
Jennifer Hu
3 months
We report a re-analysis of DGL’s materials using direct probability measurement on minimal pairs. We find LLMs achieve extremely high accuracy overall. And, minimal-pair logprob differences capture fine-grained variation in human judgments on DGL's materials! 8/10
Tweet media one
1
4
22
@_jennhu
Jennifer Hu
3 months
But the standard practice for evaluating LLM grammatical generalization is comparison between probabilities assigned to minimal pairs of sentences (e.g., P(“The bear sleeps”) > P(“The bear sleep”)), used by benchmarks like BLiMP and SyntaxGym. 4/10
1
0
21
@_jennhu
Jennifer Hu
1 year
e.g. DIRECT method: compare P(is) vs P(are) given prefix "The keys to the cabinet" METALINGUISTIC method: give the prompt "Here is a sentence: The keys to the cabinet... What word is most likely to come next?" and compare P(is) vs P(are) These methods give different results! 5/8
1
2
20
@_jennhu
Jennifer Hu
3 years
I'll be presenting our paper "Scalable pragmatic communication via self-supervision" (joint work with @NogaZaslavsky @roger_p_levy ) at the ICML Self-Supervised Learning Workshop tomorrow Sat 7/24, 11:10-11:50 Pacific. All are welcome 🙂 See for details!
Tweet media one
0
5
19
@_jennhu
Jennifer Hu
3 months
Furthermore, DGL’s data reveal important variation in human judgments. For some sentences that DGL code as ungrammatical, humans disagree (eg “Gary still perhaps drives to work”). Thus, LLMs might be held to a standard that doesn't correspond to systematic human preferences. 7/10
Tweet media one
2
0
18
@_jennhu
Jennifer Hu
3 years
Submission deadline *extended to September 17*! See for submission guidelines.
@_jennhu
Jennifer Hu
3 years
Excited to announce the NeurIPS 2021 workshop ✨Meaning in Context: Pragmatic Communication in Humans and Machines✨ Submit your abstracts or short papers by September 10. More info:
3
33
97
0
6
17
@_jennhu
Jennifer Hu
3 months
DGL assess Large Language Models (LLMs) by prompting models: “Is the following sentence grammatically correct in English?” They compare model responses to human judgments of the same sentences. DGL argue models are less accurate than humans and biased toward “yes” responses. 3/10
1
0
17
@_jennhu
Jennifer Hu
3 months
We also point out a subtle difference between the prompts provided by DGL to LLMs versus to humans. When we remove this difference, LLM performance substantially improves, even using DGL’s prompt-based approach. 9/10
1
0
17
@_jennhu
Jennifer Hu
1 year
The fundamental unit of LLM computation is P(word|context). This conditional probability determines a distribution over word strings containing the model’s linguistic generalizations. But corporate LLM APIs are becoming more closed and no longer always offer P(word|context). 3/8
1
1
16
@_jennhu
Jennifer Hu
1 year
We evaluate six LLMs across four tasks/linguistic domains. Broadly, we find that LLMs' metalinguistic judgments are inferior to direct probability-based comparisons. And consistency gets worse as the prompt diverges from direct measurements of next-word probabilities. 7/8
1
1
16
@_jennhu
Jennifer Hu
1 year
Our results suggest that even paradigmatic pragmatic phenomena (e.g., polite deceits) could potentially be solved w/o explicit representations of other agents’ mental states, and that artificial models can be used to gain mechanistic insights into human pragmatic processing. 5/6
1
0
14
@_jennhu
Jennifer Hu
1 year
Prompting implicitly tests a new type of emergent ability — metalinguistic judgment — which has not yet been systematically explored. So, how well do LLMs perform under direct vs. metalinguistic evaluation? And how consistent are LLMs across both methods? 6/8
1
0
14
@_jennhu
Jennifer Hu
1 year
So, how do models do? The larger models achieve high accuracy, and also make similar error patterns as humans: within incorrect responses, these models tend to select the literal interpretation of an utterance over distractors based on heuristics such as lexical similarity. 3/6
Tweet media one
1
3
12
@_jennhu
Jennifer Hu
1 year
We also found that models use similar linguistic cues as humans to solve the tasks. For many tasks, humans and models align on which items are difficult. We also removed the context story from the items, and found that models and humans degrade across tasks in similar ways. 4/6
Tweet media one
1
2
12
@_jennhu
Jennifer Hu
1 year
The success of large language models (LLMs) has sparked a critical debate in language science: what linguistic generalizations do LLMs capture, and how? Some claim LLMs challenge classic approaches to language; others argue LLMs are poor substitutes for linguistic theories. 1/8
1
0
11
@_jennhu
Jennifer Hu
10 months
Join us for a fun discussion on cognitively-motivated approaches to AI benchmarking, with a focus on language + social reasoning! (note the corrected date: *July 17th*)
@hawkrobe
Robert Hawkins
10 months
You're invited to a virtual seminar with @tallinzen @tianminshu , & @_jennhu July 19th, noon-1pm ET! This session kicks off the CogSci 2023 Cognitive-AI Benchmarking (CAB) workshop to be held on-site in Sydney. Please register here for the Zoom link!
Tweet media one
1
12
46
0
1
11
@_jennhu
Jennifer Hu
1 year
So LLM evaluation is shifting toward metalinguistic prompting: writing a sentence and asking the model about it. For the LLM to succeed, it must both represent the generalization of interest and report the outcome of applying the generalization to the sentence in the prompt. 4/8
1
0
11
@_jennhu
Jennifer Hu
1 month
Takeaways: LM performance shouldn't be seen as a direct indication of intelligence (or lack thereof), but as reflecting abilities through the lens of our design choices. This adds to work on "LM evaluation validity" and suggests ways that LMs could be used to study kids! 7/8
1
1
11
@_jennhu
Jennifer Hu
1 year
No matter their theoretical position, researchers need a way to assess the capabilities of LLMs to substantiate such claims. With all the options available, how should we go about evaluating LLMs' linguistic knowledge? 2/8
1
0
10
@_jennhu
Jennifer Hu
7 months
Brief summary: The fundamental unit of LLM computation is P(word|context). This determines a distribution over strings containing the model’s linguistic generalizations. Prompting implicitly tests a new type of emergent ability: metalinguistic judgment. 1/5
1
0
10
@_jennhu
Jennifer Hu
1 month
Thanks to @mcxfrank for a super fun collaboration! Preprint: 8/8
0
0
9
@_jennhu
Jennifer Hu
7 months
In the camera-ready, we discuss a potential competence–performance distinction in LLMs: the information implicitly encoded in an LLM's string distribution over isolated sentences does not always surface when the model is explicitly prompted for a response based on that info. 4/5
1
0
9
@_jennhu
Jennifer Hu
1 month
A shared goal in psychology and AI is to ascribe cognitive capacities to black-box agents. For example, we might be interested in whether a young child has theory of mind, or whether an LM can distinguish grammatical and ungrammatical sentences. 2/8
1
0
7
@_jennhu
Jennifer Hu
7 months
We find that LLMs' metalinguistic judgments are inferior to direct probability-based comparisons, suggesting that negative results relying on metalinguistic prompts cannot be taken as conclusive evidence that an LLM lacks a particular linguistic generalization. 3/5
3
0
8
@_jennhu
Jennifer Hu
3 years
In sum: our results support the idea that the lang network stores integrated linguistic knowledge. Like the integration btwn word meanings & combinatorial processing for comprehension, it seems lang areas support both lexical access & sentence generation during production. 7/7
1
1
8
@_jennhu
Jennifer Hu
1 year
We explore this through a fine-grained comparison of LMs and humans on 7 pragmatic tasks. Our eval materials are an expert-curated set of multiple choice q's. Each answer option represents different strategies for solving the task (pragmatic, literal, low-level heuristics). 2/6
1
2
8
@_jennhu
Jennifer Hu
9 months
If you're interested in postdoctoral research at the intersection of cognition, neuroscience, and AI, consider applying for the Kempner Fellowship! 👇🧠 Applications close October 9!
@boazbaraktcs
Boaz Barak
9 months
Applications are now open for postdoctoral research fellows at the Kempner Institute at Harvard. These are 3-year positions with independent funding and access to the amazing resources of the institute. Apply by October 9th 2023!
1
31
76
0
0
7
@_jennhu
Jennifer Hu
3 years
We also find the language regions respond to both lexical access and sentence-generation demands, which implies strong integration between lexico-semantic and combinatorial processes, mirroring the picture that has emerged in language comprehension. 5/7
Tweet media one
1
0
7
@_jennhu
Jennifer Hu
1 year
Our results suggest that scalar inferences arise from context-driven expectations over alternatives, and these expectations operate at the level of concepts. These findings also highlight the role of linguistic prediction in pragmatic inference. 5/7
1
0
6
@_jennhu
Jennifer Hu
1 year
(This work builds upon human behavioral data collected by Bob van Tiel, @evanmiltenburg , @NGotzner , @ecpankratz , Eszter Ronai, Ming Xiang, and many others!) 6/7
1
0
6
@_jennhu
Jennifer Hu
7 months
e.g. DIRECT method: compare P(is)⟷P(are) given prefix "The keys to the cabinet..." METALINGUISTIC method: give the prompt "Here is a sentence: The keys to the cabinet... What word is most likely to come next?" and compare P(is)⟷P(are) These methods give different results! 2/5
1
0
6
@_jennhu
Jennifer Hu
3 years
co-organized with @NogaZaslavsky , @aidanematzadeh , Michael Franke, @roger_p_levy , & Noah Goodman
0
0
6
@_jennhu
Jennifer Hu
3 years
Joint work with @smallhannahe , @HopeKean , Atsushi Takahashi, Leo Zekelman, @dankleinman , Elizabeth Ryan, @victorf13 , & @ev_fedorenko
0
0
6
@_jennhu
Jennifer Hu
1 year
Furthermore, while it is generally assumed that SIs arise through reasoning about unspoken alternatives, it remains debated whether humans reason about alternatives as linguistic forms, or at the level of concepts. 2/7
Tweet media one
1
0
6
@_jennhu
Jennifer Hu
1 year
However, expectedness robustly predicts cross-scale variation only under a concept-based (i.e., not string-based) view of alternatives. 4/7
1
0
5
@_jennhu
Jennifer Hu
1 month
For both humans and machines, making inferences about failures is especially tricky, because failure on a task does not always indicate the absence of the underlying capacity. E.g., children often fail because they don’t understand the question. 4/8
1
0
5
@_jennhu
Jennifer Hu
3 months
@EvelinaLeivada Agreed, and thank you for the very interesting work and data! Looking forward to continuing a productive discussion. 🙂
0
0
5
@_jennhu
Jennifer Hu
2 years
What burning questions about pragmatics would you like to ask our panelists ()? Add them here, and we’ll try to answer them live:
1
0
5
@_jennhu
Jennifer Hu
1 month
We argue that task demands also play an important role in determining success or failure for LMs, especially when comparing models of different capacities. We evaluate 13 LMs on analogical reasoning, reflective reasoning, word prediction, and grammaticality judgments. 5/8
Tweet media one
1
0
5
@_jennhu
Jennifer Hu
4 years
Check out our new paper to appear at #SCiL2020 , as part of a larger effort to develop better methods for evaluating neural language models!
0
0
4
@_jennhu
Jennifer Hu
1 year
We test a shared mechanism for within- and cross-scale variation: context-driven expectations about the unspoken alternatives. Using LMs to approximate human predictive distributions, we find that SIs are captured by the expectedness of the strong scalemate as an alternative. 3/7
1
0
4
@_jennhu
Jennifer Hu
1 month
Evaluation methods with greater task demands yield lower performance than methods with reduced demands. This “demand gap” is most pronounced for models with fewer parameters and less training data. We discuss implications for emergence in LMs and task demands in children. 6/8
1
0
4
@_jennhu
Jennifer Hu
1 month
The trouble is, although we would like to infer an underlying psychological *construct*, we only have access to specific observable *evaluations* (e.g., a child's ability to answer a question about a character in a story, or a model's performance on a syntax benchmark). 3/8
1
0
4
@_jennhu
Jennifer Hu
4 years
Excited to share our new paper! Check it out for a fresh look at the computational principles that give rise to human pragmatic reasoning.
@NogaZaslavsky
Noga Zaslavsky
4 years
Very excited to share our new paper: “A Rate-Distortion view of human pragmatic reasoning” Joint work with @_jennhu and @roger_p_levy . (1/)
Tweet media one
2
25
62
0
0
3
@_jennhu
Jennifer Hu
2 years
More info about the schedule, accepted papers, and speakers can be found at
0
0
4
@_jennhu
Jennifer Hu
2 years
We’re hosting 2 virtual meet-and-greet sessions, following the ICLR BAICS & Neuromatch “mind-matching” model, so you can meet your fellow workshop attendees and talk about shared interests. Register for the meet-and-greet here by Dec 10:
1
0
3
@_jennhu
Jennifer Hu
3 years
Finally, while some have hypothesized the existence of production-selective mechanisms, we find no evidence of brain regions that selectively support sentence generation. Instead, language regions respond overall more strongly during production than during comprehension. 6/7
1
0
3
@_jennhu
Jennifer Hu
3 years
We used a standard (in the behavioral literature) event description / object naming task, and included a variety of controls, like a low-level production task (nonword production), a visual event semantics condition, and some comprehension conditions. 3/7
Tweet media one
1
0
2
@_jennhu
Jennifer Hu
3 years
A network of left frontal and temporal brain regions has been implicated in language comprehension & production, but what is the precise role of this ‘language network’ in production? Across 4 fMRI expts, we characterize the response of the lang regions to production demands. 2/7
1
0
2
@_jennhu
Jennifer Hu
3 years
In line with prior studies, sentence production elicited strong responses throughout the language network. We also show that production-related responses in the language network are robust to output modality (speaking vs. typing). (We made a cool scanner-safe keyboard! ⌨️) 4/7
1
0
2