Are Large Language Models good at language? A recent paper by Dentella, Günther, & Leivada (DGL) argues no: LLMs can't distinguish grammatical sentences. We re-analyzed their data and found that LLMs are highly accurate and capture fine-grained variation in human judgments. 1/10
To researchers doing LLM evaluation: prompting is *not a substitute* for direct probability measurements.
Check out the camera-ready version of our work, to appear at EMNLP 2023! (w/
@roger_p_levy
)
Paper:
Original thread:
🧵👇
Super excited to visit the Stanford NLP group and talk about the science of LM evaluation: How can we make inferences about LMs' latent capabilities, based on observable behaviors?
Full abstract here:
The talk is open to the public! Register below 👇🌟
For this week’s NLP Seminar, we are thrilled to host
@_jennhu
to talk about "How to Know What Language Models Know"!
When: 03/07 Thurs 11am PT
Non-Stanford affiliates registration form (closed at 9am PT on the talk day):
Life update: I defended my thesis and will be joining the Harvard Kempner Institute as a Research Fellow 🙂 Thrilled to continue pursuing questions at the intersection of language, cognition, and AI (and to be sticking around Boston)!
New preprint w/
@mcxfrank
:
How can we ascribe cognitive abilities to language models? We evaluate them! But evals impose challenges separate from the underlying ability of interest. These "task demands" affect LM performance, esp. for smaller models! 1/8
New paper with Sammy Floyd,
@OlessiaJour
,
@ev_fedorenko
,
@LanguageMIT
! Non-literal language understanding is an essential part of communication. But what is the role of mentalizing vs. language statistics in pragmatics? & how well do NLP models capture human prag behaviors? 🧵1/6
Excited to announce the NeurIPS 2021 workshop ✨Meaning in Context: Pragmatic Communication in Humans and Machines✨
Submit your abstracts or short papers by September 10. More info:
Want to discuss pragmatics with ML, cogsci, and language researchers? Register for the Meaning in Context workshop @
#NeurIPS2021
! Make sure to register for NeurIPS and sign up for meet-and-greet sessions. See below 👇
Excited to share our new preprint (w/
@smallhannahe
&
@ev_fedorenko
): Our results support the idea that language comprehension & production draw on the same knowledge representations, which are stored in the language-selective network. 1/7
✨New paper to appear in TACL (with
@roger_p_levy
, Judith Degen
@ALPSLabStanford
, and
@sebschu
)! ✨ Scalar inferences (SI) are highly variable both *within* a scale (e.g., <some, all>) and *across* scales, but few proposals quantitatively explain both types of variation. 🧵: 1/7
Probability comparison is important because LLMs are designed to generate high-prob sentences. Relative to this measurement, prompting underestimates LLM capabilities. And knowledge *about* language != knowledge *of* language. See e.g. Hu & Levy 2023: 6/10
Ironically, two days before acceptance of our paper at EMNLP, OpenAI removed the ability to access token logprobs from gpt-3.5-turbo-instruct.
This is a timely issue. We need to establish best practices for LLM evaluation based on scientific merit, not just convenience. 5/5
In sum: DGL provide important new human judgment data, but our analysis shows that LLMs match these judgments better than their paper suggests. We hope this re-analysis helps clarify the capabilities and limitations of LLMs, which is of great scientific & public interest. 10/10
Excited to share our new preprint (w/
@smallhannahe
&
@ev_fedorenko
): Our results support the idea that language comprehension & production draw on the same knowledge representations, which are stored in the language-selective network. 1/7
In sum: Negative results relying on metalinguistic prompts cannot be taken as conclusive evidence that an LLM lacks a particular linguistic competency. Our results also highlight the lost value with the move to closed APIs where access to probability distributions is limited. 8/8
Submit your latest work on Theory of Mind in Communicating Agents to our workshop at ICML 2023! We welcome submissions from cognitive, ML, and social perspectives 🙂
1. 🔔**𝘾𝙖𝙡𝙡 𝙛𝙤𝙧 𝙋𝙖𝙥𝙚𝙧𝙨 𝙛𝙤𝙧 𝙏𝙝𝙚𝙤𝙧𝙮-𝙤𝙛-𝙈𝙞𝙣𝙙 𝙒𝙤𝙧𝙠𝙨𝙝𝙤𝙥**🔔
The First Workshop on Theory of Mind in Communicating Agents (ToM 2023) will be hosted at
@icmlconf
in July'23 in Honolulu 🌺
CfP:
🧵
#ICML2023
#ToM2023
#ML
#NLProc
Prior work studying syntax in LMs:
-Linzen et al 2016
-Wilcox et al 2023
-COLA
-BLiMP
-SyntaxGym
And much more by
@tallinzen
,
@a_stadt
,
@RTomMcCoy
et al! 5/10
We report a re-analysis of DGL’s materials using direct probability measurement on minimal pairs. We find LLMs achieve extremely high accuracy overall. And, minimal-pair logprob differences capture fine-grained variation in human judgments on DGL's materials! 8/10
But the standard practice for evaluating LLM grammatical generalization is comparison between probabilities assigned to minimal pairs of sentences (e.g., P(“The bear sleeps”) > P(“The bear sleep”)), used by benchmarks like BLiMP and SyntaxGym. 4/10
e.g.
DIRECT method: compare P(is) vs P(are) given prefix "The keys to the cabinet"
METALINGUISTIC method: give the prompt "Here is a sentence: The keys to the cabinet... What word is most likely to come next?" and compare P(is) vs P(are)
These methods give different results! 5/8
I'll be presenting our paper "Scalable pragmatic communication via self-supervision" (joint work with
@NogaZaslavsky
@roger_p_levy
) at the ICML Self-Supervised Learning Workshop tomorrow Sat 7/24, 11:10-11:50 Pacific. All are welcome 🙂 See for details!
Furthermore, DGL’s data reveal important variation in human judgments. For some sentences that DGL code as ungrammatical, humans disagree (eg “Gary still perhaps drives to work”). Thus, LLMs might be held to a standard that doesn't correspond to systematic human preferences. 7/10
Excited to announce the NeurIPS 2021 workshop ✨Meaning in Context: Pragmatic Communication in Humans and Machines✨
Submit your abstracts or short papers by September 10. More info:
DGL assess Large Language Models (LLMs) by prompting models: “Is the following sentence grammatically correct in English?” They compare model responses to human judgments of the same sentences. DGL argue models are less accurate than humans and biased toward “yes” responses. 3/10
We also point out a subtle difference between the prompts provided by DGL to LLMs versus to humans. When we remove this difference, LLM performance substantially improves, even using DGL’s prompt-based approach. 9/10
The fundamental unit of LLM computation is P(word|context). This conditional probability determines a distribution over word strings containing the model’s linguistic generalizations. But corporate LLM APIs are becoming more closed and no longer always offer P(word|context). 3/8
We evaluate six LLMs across four tasks/linguistic domains. Broadly, we find that LLMs' metalinguistic judgments are inferior to direct probability-based comparisons. And consistency gets worse as the prompt diverges from direct measurements of next-word probabilities. 7/8
Our results suggest that even paradigmatic pragmatic phenomena (e.g., polite deceits) could potentially be solved w/o explicit representations of other agents’ mental states, and that artificial models can be used to gain mechanistic insights into human pragmatic processing. 5/6
Prompting implicitly tests a new type of emergent ability — metalinguistic judgment — which has not yet been systematically explored. So, how well do LLMs perform under direct vs. metalinguistic evaluation? And how consistent are LLMs across both methods? 6/8
So, how do models do? The larger models achieve high accuracy, and also make similar error patterns as humans: within incorrect responses, these models tend to select the literal interpretation of an utterance over distractors based on heuristics such as lexical similarity. 3/6
We also found that models use similar linguistic cues as humans to solve the tasks. For many tasks, humans and models align on which items are difficult. We also removed the context story from the items, and found that models and humans degrade across tasks in similar ways. 4/6
The success of large language models (LLMs) has sparked a critical debate in language science: what linguistic generalizations do LLMs capture, and how? Some claim LLMs challenge classic approaches to language; others argue LLMs are poor substitutes for linguistic theories. 1/8
Join us for a fun discussion on cognitively-motivated approaches to AI benchmarking, with a focus on language + social reasoning!
(note the corrected date: *July 17th*)
You're invited to a virtual seminar with
@tallinzen
@tianminshu
, &
@_jennhu
July 19th, noon-1pm ET! This session kicks off the CogSci 2023 Cognitive-AI Benchmarking (CAB) workshop to be held on-site in Sydney. Please register here for the Zoom link!
So LLM evaluation is shifting toward metalinguistic prompting: writing a sentence and asking the model about it. For the LLM to succeed, it must both represent the generalization of interest and report the outcome of applying the generalization to the sentence in the prompt. 4/8
Takeaways: LM performance shouldn't be seen as a direct indication of intelligence (or lack thereof), but as reflecting abilities through the lens of our design choices. This adds to work on "LM evaluation validity" and suggests ways that LMs could be used to study kids! 7/8
No matter their theoretical position, researchers need a way to assess the capabilities of LLMs to substantiate such claims. With all the options available, how should we go about evaluating LLMs' linguistic knowledge? 2/8
Brief summary: The fundamental unit of LLM computation is P(word|context). This determines a distribution over strings containing the model’s linguistic generalizations. Prompting implicitly tests a new type of emergent ability: metalinguistic judgment. 1/5
In the camera-ready, we discuss a potential competence–performance distinction in LLMs: the information implicitly encoded in an LLM's string distribution over isolated sentences does not always surface when the model is explicitly prompted for a response based on that info. 4/5
A shared goal in psychology and AI is to ascribe cognitive capacities to black-box agents. For example, we might be interested in whether a young child has theory of mind, or whether an LM can distinguish grammatical and ungrammatical sentences. 2/8
We find that LLMs' metalinguistic judgments are inferior to direct probability-based comparisons, suggesting that negative results relying on metalinguistic prompts cannot be taken as conclusive evidence that an LLM lacks a particular linguistic generalization. 3/5
In sum: our results support the idea that the lang network stores integrated linguistic knowledge. Like the integration btwn word meanings & combinatorial processing for comprehension, it seems lang areas support both lexical access & sentence generation during production. 7/7
We explore this through a fine-grained comparison of LMs and humans on 7 pragmatic tasks. Our eval materials are an expert-curated set of multiple choice q's. Each answer option represents different strategies for solving the task (pragmatic, literal, low-level heuristics). 2/6
If you're interested in postdoctoral research at the intersection of cognition, neuroscience, and AI, consider applying for the Kempner Fellowship! 👇🧠 Applications close October 9!
Applications are now open for postdoctoral research fellows at the Kempner Institute at Harvard. These are 3-year positions with independent funding and access to the amazing resources of the institute. Apply by October 9th 2023!
We also find the language regions respond to both lexical access and sentence-generation demands, which implies strong integration between lexico-semantic and combinatorial processes, mirroring the picture that has emerged in language comprehension. 5/7
Our results suggest that scalar inferences arise from context-driven expectations over alternatives, and these expectations operate at the level of concepts. These findings also highlight the role of linguistic prediction in pragmatic inference. 5/7
e.g.
DIRECT method: compare P(is)⟷P(are) given prefix "The keys to the cabinet..."
METALINGUISTIC method: give the prompt "Here is a sentence: The keys to the cabinet... What word is most likely to come next?" and compare P(is)⟷P(are)
These methods give different results! 2/5
Furthermore, while it is generally assumed that SIs arise through reasoning about unspoken alternatives, it remains debated whether humans reason about alternatives as linguistic forms, or at the level of concepts. 2/7
For both humans and machines, making inferences about failures is especially tricky, because failure on a task does not always indicate the absence of the underlying capacity. E.g., children often fail because they don’t understand the question. 4/8
We argue that task demands also play an important role in determining success or failure for LMs, especially when comparing models of different capacities. We evaluate 13 LMs on analogical reasoning, reflective reasoning, word prediction, and grammaticality judgments. 5/8
We test a shared mechanism for within- and cross-scale variation: context-driven expectations about the unspoken alternatives. Using LMs to approximate human predictive distributions, we find that SIs are captured by the expectedness of the strong scalemate as an alternative. 3/7
Evaluation methods with greater task demands yield lower performance than methods with reduced demands. This “demand gap” is most pronounced for models with fewer parameters and less training data. We discuss implications for emergence in LMs and task demands in children. 6/8
The trouble is, although we would like to infer an underlying psychological *construct*, we only have access to specific observable *evaluations* (e.g., a child's ability to answer a question about a character in a story, or a model's performance on a syntax benchmark). 3/8
We’re hosting 2 virtual meet-and-greet sessions, following the ICLR BAICS & Neuromatch “mind-matching” model, so you can meet your fellow workshop attendees and talk about shared interests. Register for the meet-and-greet here by Dec 10:
Finally, while some have hypothesized the existence of production-selective mechanisms, we find no evidence of brain regions that selectively support sentence generation. Instead, language regions respond overall more strongly during production than during comprehension. 6/7
We used a standard (in the behavioral literature) event description / object naming task, and included a variety of controls, like a low-level production task (nonword production), a visual event semantics condition, and some comprehension conditions. 3/7
A network of left frontal and temporal brain regions has been implicated in language comprehension & production, but what is the precise role of this ‘language network’ in production? Across 4 fMRI expts, we characterize the response of the lang regions to production demands. 2/7
In line with prior studies, sentence production elicited strong responses throughout the language network. We also show that production-related responses in the language network are robust to output modality (speaking vs. typing). (We made a cool scanner-safe keyboard! ⌨️) 4/7