Will Merrill @lambdaviking Twitter profile

Pinned Tweet

Will Merrill

3 months

Today in Quanta Magazine, @benbenbrubaker gives a new overview of our work (w/ @Ashish_S_AI ) on the expressive power of transformers with/without CoT

How Chain-of-Thought Reasoning Helps Neural Networks Compute | Quanta Magazine

Large language models do better at solving problems when they show their work. Researchers are beginning to understand why.

www.quantamagazine.org

1

10

52

Last Seen Profiles

@NadiaAlGindi

@doebbi

@CielkeSijben

@wkurxxxPolka

@pso_kt

@Rama_Gane

@twitrboyy

@jandakembangstw

@kootenayteach

@DomDmod

@shaka_009

@jokay

@non718781275520

@kotch_voraakhom

@Natureismin

@Scottishgreen

@VOID_COC

@WAPtheatre

@1111sherifbadr

@aaarin_guild

@Israel_is_real1

@rafai_amira

@OkpankuChidi

@GovernorBBB

@amorekad

@Tasha_HTX

@xOL1vzX9S470ccC

@TinyLittle_Baby

@ash_braz

@morgannbbp

@jadenspov

@Angel_GoldHeart

@ilrvwy

@LatiosMito

@sambitnk

@NinurtaDoc

Will Merrill

@lambdaviking

2 months

✨Excited to finally drop our new paper: SSMs “look like” RNNs, but we show their statefulness is an illusion🪄🐇 Current SSMs cannot express basic state tracking, but a minimal change fixes this! 👀 w/ @jowenpetty , @Ashish_S_AI

23

208

1K

Will Merrill

@lambdaviking

1 year

📣 @Ashish_S_AI and I prove that transformers can be translated to sentences in first-order logic with majority-vote quantifiers (FOM). FOM is a symbolic language that can capture computation inside transformers!

12

95

490

Will Merrill

@lambdaviking

3 years

I wrote an introduction to formal languages and automata as they relate to modern NLP (RNNs, transformers, etc.) Check it out on arXiv!

Formal Language Theory Meets Modern NLP

NLP is deeply intertwined with the formal study of language, both conceptually and historically. Arguably, this connection goes all the way back to Chomsky's Syntactic Structures in 1957. It also...

arxiv.org

4

73

334

Will Merrill

@lambdaviking

2 years

[1/6] Excited to share a year-long project re: theory of language understanding in LMs w/ @a_stadt , @tallinzen TLDR: Judging entailments (NLI) can be reduced to LMing over "Gricean data"* ∴ Learning distribution (perfectly) => learning semantics

Entailment Semantics Can Be Extracted from an Ideal Language Model

Language models are often trained on text alone, without additional grounding. There is debate as to how much of natural language semantics can be inferred from such a procedure. We prove that...

arxiv.org

2

37

270

Will Merrill

@lambdaviking

4 years

A Formal Hierarchy of RNN Architectures -- in which we address questions like "what can an LSTM do that a QRNN can't?" Joint work with @gail_w , @yoavgo , @royschwartz02 , @nlpnoah , and @yahave #acl2020nlp Blog: Paper:

A Formal Hierarchy of RNN Architectures

We develop a formal hierarchy of the expressive capacity of RNN architectures. The hierarchy is based on two formal properties: space complexity, which measures the RNN's memory, and rational...

arxiv.org

2

60

238

Will Merrill

@lambdaviking

3 years

Is it possible for GPT-n to "understand" the semantics of English? What about Python? I'm excited to finally share work formalizing this question! We give formal languages that are *provably* un-understandable by LMs (within our setup, at least)

Provable Limitations of Acquiring Meaning from Ungrounded Form:...

Language models trained on billions of tokens have recently led to unprecedented results on many NLP tasks. This success raises the question of whether, in principle, a system can ever...

arxiv.org

8

51

233

Will Merrill

@lambdaviking

2 years

How do we understand logical reasoning in non-symbolic models like transformers? 📣New preprint w/ Ashish Sabharwal shows any transformer can be translated to a fixed-size first-order-logic formulae (with majority quantifiers)

A Logic for Expressing Log-Precision Transformers

One way to interpret the reasoning power of transformer-based language models is to describe the types of logical rules they can resolve over some input text. Recently, Chiang et al. (2023) showed...

arxiv.org

2

34

195

Will Merrill

@lambdaviking

3 months

📢 Preprint: We can predict entailment relations from LM sentence co-occurrence prob. scores These results suggest predicting sentence co-occurrence may be one way that next-word prediction leads to (partial) semantic representations in LMs🧵

5

27

166

Will Merrill

@lambdaviking

3 years

What inductive biases does training impose on transformers? We find that T5, RoBERTa, etc. are well-approximated by saturated transformers (simplified attention patterns), and explain how this arises during training. w/ @RamanujanVivek @yoavgo @royschwartzNLP @nlpnoah

1

32

165

Will Merrill

@lambdaviking

5 years

Better, faster stack neural networks (in PyTorch)!

GitHub - viking-sudo-rm/stacknn-core: Pip-installable differentiable stacks in PyTorch!

Pip-installable differentiable stacks in PyTorch! Contribute to viking-sudo-rm/stacknn-core development by creating an account on GitHub.

github.com

1

32

157

Will Merrill

@lambdaviking

6 months

🐊📣Still @ NeurIPS? Come by our poster to hear about how chain of thought/scratchpad steps increase the computational power of transformers Room 242, 4pm (M3L workshop)

1

27

157

Will Merrill

@lambdaviking

1 year

[1/n]📢 More work on the *computational model of transformers* w/ Ashish Sabharwal in TACL Takeaway: transformers are limited to expressing highly parallelizable functions (formally, they are in the complexity class uniform TC0)

5

36

153

Will Merrill

@lambdaviking

8 months

[1/n] How does a chain of thought change the expressive power of transformers? New work w/ @Ashish_S_AI studies how adding CoT/decoding steps extends the problems solvable by transformers as a fn of the # of steps.

2

27

130

Will Merrill

@lambdaviking

3 years

oh btw, I'm excited to be joining @NYUDataScience as a PhD student this fall!

charlie

@chunkbardey

3 years

if u go to grad school that’s weird like most ppl don’t do that

71

1K

29K

13

2

124

Will Merrill

@lambdaviking

2 years

Curious about the circuits inside transformers? 🧐 📢 Our new work shows how (saturated) transformers can be simulated by *threshold circuits* Equivalently, this bounds the problems saturated transformers can solve in the class TC0 w/ Ashish Sabharwal, @nlpnoah

1

19

122

Will Merrill

@lambdaviking

5 months

“The Expressive Power of Transformers with Chain of Thought” w/ @Ashish_S_AI will appear at ICLR 🇦🇹

Sasha Rush

@srush_nlp

7 months

Props to Will Merill @lambdaviking for having already fully formalized my nonsense thoughts (also for generally writing extremely interesting papers)

1

17

144

6

14

97

Will Merrill

@lambdaviking

4 months

If you teach NLP, please keep teaching automata 👇

.txt

@dottxtai

4 months

⚡️ Speed up LLM inference by 5x. ⚡️ We introduce a new framework, coalescence, that makes structured generation several times faster than standard generation. Coalescence is very flexible, and raises unexpected questions 🧐

6

78

333

3

7

91

Will Merrill

@lambdaviking

5 years

I wrote a blog post summarizing Sequential Neural Networks as Automata

1

17

82

Will Merrill

@lambdaviking

1 month

Stop by "The Expressive Power of Chain of Thought" poster tomorrow! Wednesday 10:45am #294

1

15

78

Will Merrill

@lambdaviking

6 months

🐊📣 Stop by tomorrow to hear from @Ashish_S_AI and me about: 1) how transformers can be expressed in logic 2) what this means about what transformers *can't* do Thursday @ 5pm, #1008

0

10

68

Will Merrill

@lambdaviking

7 months

Our NeurIPS paper shows transformers can be expressed in first-order logic with majority quantifiers => Extends the theory of the expressiveness and limitations of transformers => Provides a "programming language" capturing transformers' computation

Will Merrill

@lambdaviking

1 year

📣 @Ashish_S_AI and I prove that transformers can be translated to sentences in first-order logic with majority-vote quantifiers (FOM). FOM is a symbolic language that can capture computation inside transformers!

12

95

490

0

16

67

Will Merrill

@lambdaviking

1 year

This result acts as an upper bound: transformers can't solve problems that can't be defined in FOM. This reveals some new problems transformers can't solve and provides a handy intuitive test for seeing whether a transformer can do something: try to define the problem in FOM.

1

3

61

Will Merrill

@lambdaviking

2 months

Finally, we were excited to find a minimal change to SSMs that improves their expressive power for state tracking: make the A matrix input-dependent. Empirically, this allows them to learn hard-state tracking just as well as RNNs!

9

6

57

Will Merrill

@lambdaviking

12 days

Love to see more theoretical work comparing the formal capabilities of SSMs and transformers (within TC0)!

Yash Sarrof

@yashYRS

13 days

We are excited to share our work on characterizing the expressivity of State Space Models (SSMs) with a theoretical lens, using a formal language framework, backed up by empirical findings. w/ Yana Veitsman, Dr. Michael Hahn Paper link :

2

16

106

3

8

54

Will Merrill

@lambdaviking

2 months

We draw on theory to formalize hard state tracking problems and formally prove that SSMs, like transformers, cannot solve them. Empirically, SSMs and transformers struggle to learn hard state tracking, but RNNs learn it easily. We also propose a minimal fix.

1

0

46

Will Merrill

@lambdaviking

2 months

Thanks to the FLaNN Discord for recently inviting us to talk about this work. Recording: Also, stop by my poster at New England NLP next week!

Will Merrill: The Illusion of State in State-Space Models

Talk given by Will Merrill to the Formal Languages and Neural Networks discord on April 1st 2024. Thank you, Will!Their paper will be available on ArXiv soon...

www.youtube.com

1

4

45

Will Merrill

@lambdaviking

7 months

This is a good intuition to impart! There is no true “state” in a transformer (unlike DFA/RNN) and the # of state updates is bounded by the depth We discuss this more formally here (+implications for what sequential stuff transformers can’t do):

Preetum Nakkiran

@PreetumNakkiran

7 months

One intuition for why Transformers have difficulty learning sequential tasks (eg parity), w/o scratchpad, is that they can only update their “internal state” in very restricted ways (as defined by Attention). In contrast to e.g RNNs, which can do essentially arbitrary updates.

2

16

138

0

7

43

Will Merrill

@lambdaviking

1 month

Was fun working on this! The cool takeaway imo is that we can characterize the type of reasoning that blank tokens can help with… it’s reduced compared to CoT but experiments show it’s likely more than with no extra tokens

Jacob Pfau

@jacob_pfau

1 month

Do models need to reason in words to benefit from chain-of-thought tokens? In our experiments, the answer is no! Models can perform on par with CoT using repeated '...' filler tokens. This raises alignment concerns: Using filler, LMs can do hidden reasoning not visible in CoT🧵

41

182

1K

2

4

40

Will Merrill

@lambdaviking

2 months

Inspired by formal language theory, we view state tracking as iterated multiplication in a monoid (whose elements represent state updates). The algebraic structure of the monoid determines the complexity of tracking state in a parallel computational model like a transformer.

1

2

40

Will Merrill

@lambdaviking

2 months

Our main theoretical result✨ is that linear state-space models (e.g., S4) and Mamba can only express computation in TC0. This means they cannot (exactly) solve hard state tracking problems like permutation composition, code eval, or chess!

1

4

40

Will Merrill

@lambdaviking

2 months

But in practice can SSMs and transformers approximate state tracking in practice despite our worst-case result? We show🧪 this isn’t the case on permutation composition: both transformers and SSMs require depth growing with sequence length, whereas RNNs need just 1 layer.

1

3

39

Will Merrill

@lambdaviking

2 months

In summary, we used theory to pin down “hard state tracking” and showed it poses a problem for current SSMs and transformers. Thus, the state in SSMs is an illusion! We proposed an SSM extension to overcome this and are eager to evaluate its practical viability.

1

0

37

Will Merrill

@lambdaviking

2 months

A canonical example of hard state tracking is *permutation composition*, or S5 (cf. Galois). We show “real” state tracking problems (code eval, chess with <source,target> notation) can be reduced to permutation composition. We thus use it to benchmark hard state tracking.

1

3

35

Will Merrill

@lambdaviking

1 year

Another contribution: we prove transformers need >loglogn precision to have full expressive power over contexts of length n. With less precision, they cannot implement uniform attention! We hope this result can guide the development of long-context, low-precision LMs.

1

32

Will Merrill

@lambdaviking

2 months

Some state tracking problems are known to be fully parallelizable (TC0) while others are inherently sequential (NC1-complete), requiring computation graphs of growing depth. The latter are easy for RNNs but can’t be represented by fixed-depth transformers.

2

1

31

Will Merrill

@lambdaviking

1 year

We also see natural implications of this work for understanding algorithms inside transformers ("mechanistic interpretability"). RASP/TRACR are languages for *compiling into* transformers. In contrast, any transformer can be *translated* to an FOM sentence.

1

0

30

Will Merrill

@lambdaviking

2 years

I'll be arriving in Abu Dhabi for EMNLP tomorrow! Would love to chat about formal semantics and LMs, expressive capacity/inductive biases of NNs/transformers, compositionality, or anything in between! Will respond to emails, Twitter DMs, carrier pigeons, etc.

4

0

29

Will Merrill

@lambdaviking

3 years

New preprint analyzing the power of uniform attention patterns in transformers in terms of circuit complexity classes w/ @yoavgo @royschwartzNLP @nlpnoah Very curious for feedback/suggestions!

Saturated Transformers are Constant-Depth Threshold Circuits

Transformers have become a standard neural network architecture for many NLP problems, motivating theoretical analysis of their power in terms of formal languages. Recent work has shown that...

arxiv.org

2

10

29

Will Merrill

@lambdaviking

1 year

Bibliographic notes: This paper is a substantially revised version of an earlier preprint. The revisions were inspired by @davidweichiang et al.'s related paper:

Tighter Bounds on the Expressivity of Transformer Encoders

Characterizing neural networks in terms of better-understood formal systems has the potential to yield new insights into the power and limitations of these networks. Doing so for transformers...

arxiv.org

1

28

Will Merrill

@lambdaviking

1 month

I’ll be at ICLR next week! 🇦🇹 Reach out if you’d to talk about transformers and state-space models, training dynamics, etc.

2

0

25

Will Merrill

@lambdaviking

2 years

In Zürich Airport, ears perked for something cross-serial👂

3

0

24

Will Merrill

@lambdaviking

1 year

@_jasonwei @OpenAI I'm not sure what you mean by compositionality, but this example is a clear failure on the level of basic Shakespearean grammar. The subject/verb agreement is realllllly off (Should be: thou seekest, I know, shall guide)

0

24

Will Merrill

@lambdaviking

3 years

Our paper on the form/meaning debate has been updated thanks to discussions & outside input! v2 better reflects how understanding can be hard for an *LM*, but easy for a human. Thanks to Mark-Jan Nederhof and many others who shared their thoughts.

Provable Limitations of Acquiring Meaning from Ungrounded Form:...

Language models trained on billions of tokens have recently led to unprecedented results on many NLP tasks. This success raises the question of whether, in principle, a system can ever...

arxiv.org

1

2

24

Will Merrill

@lambdaviking

1 year

RIP Drago. In the classes I took and TA'ed with him, Drago was a passionate, kind, and funny teacher and mentor. Gone too soon indeed.

Harlan Krumholz

@hmkyale

1 year

The #AI community, the #computerscience community, the @YaleSEAS community, and humanity have suddenly lost a remarkable person, @dragomir_radev - kind and brilliant, devoted to his family and friends... gone too soon. A sad day @Yale @YINSedge @YaleCompsci #NLP2023

41

87

389

0

3

24

Will Merrill

@lambdaviking

1 year

And, copying link again for our paper since the arXiv preview didn't render above:

A Logic for Expressing Log-Precision Transformers

One way to interpret the reasoning power of transformer-based language models is to describe the types of logical rules they can resolve over some input text. Recently, Chiang et al. (2023) showed...

arxiv.org

0

23

Will Merrill

@lambdaviking

2 years

Today I learned: the context-sensitive languages are closed under complement. Via a fun, indirect argument that takes us away from the Chomsky hierarchy and into computational complexity (1/n)

1

2

23

Will Merrill

@lambdaviking

3 years

#nlphighlights Podcast Ep. 129: @nlpmattg and I talk with guest @ShunyuYao12 about transformers' ability to process hierarchical structure in language. Thanks to Shunyu for joining us for this fun and informative discussion!

129 - Transformers and Hierarchical Structure, with Shunyu Yao

In this episode, we talk to Shunyu Yao about recent insights into how transformers can represent hierarchical structure in language. Bounded-depth hierarchical structure is thought to be a key feature

soundcloud.com

2

8

23

Will Merrill

@lambdaviking

1 year

@katiedimartin Yes, how many languages do you speak?

1

0

22

Will Merrill

@lambdaviking

2 years

@_jasonwei So emergent phenomena are an empirical fact (and an interesting one, I agree). But it's a big jump to assume arbitrarily complex (~human-level) reasoning can emerge in transformers. Provably, transformers can only implement express shallow reasoning:

2

0

21

Will Merrill

@lambdaviking

2 months

Also see here for more details on the algebra behind the paper:

jackson petty

@jowenpetty

2 months

How does Galois theory help show that the state-tracking capabilities of current (!) SSMs are illusions? What makes S5 & A5 “hard”? And why do we consider A5 & friends here instead of S5? A thread on the algebra behind our paper!

2

23

130

1

0

22

Will Merrill

@lambdaviking

7 months

+1, the limitations of LMs workshop was fun and timely. thanks to the organizers and other speakers! I spoke about complexity-theoretic limitations of transformers (vid will appear eventually) no photo of me in Bielefeld, but did get a pic of another William

Leonie Weissweiler

@LAWeissweiler

7 months

I had a great time yesterday speaking about testing the limits of LLMs with Construction Grammar at a workshop on LLM limitations organised by @SAIL_network ! Thanks again to Özge Alacam, @bpaassen1 , and @MichielStraat for inviting me, and @lambdaviking for the fun company!

0

4

32

1

20

Will Merrill

@lambdaviking

1 year

[2/n] This implies a list of problems transformers cannot solve (under assumptions in footnotes):

4

1

19

Will Merrill

@lambdaviking

2 years

@yahave Semantic parsing is all you need

0

3

18

Will Merrill

@lambdaviking

11 months

Check out this poster if you’re interested in theoretical insights on the reasoning power and limitations of transformers! 👀

David Chiang

@davidweichiang

11 months

Our poster on "Tighter Bounds on the Expressivity of Transformer Encoders" has been rescheduled to Wednesday at 11am! Exhibit Hall 1 number 228 #ICML2023

1

26

0

4

18

Will Merrill

@lambdaviking

26 days

Forget GPT-4o, I'm just waiting for Chicha San Chen NYC to open😔

1

0

17

Will Merrill

@lambdaviking

7 months

Historical context is hard to get without a lot of experience or clear exposition (this), but it can provide a broader perspective beyond the daily arXiv buzz +glad to see mention of the Stupid Backoff paper about "large language models" c. 2007:

Naomi Saphra

@nsaphra

7 months

It's not the first time! A dream team of @enfleisig (human eval expert), Adam Lopez (remembers the Stat MT era), @kchonyc (helped end it), and me (pun in title) are here to teach you the history of scale crises and what lessons we can take from them. 🧵

8

63

333

0

2

16

Will Merrill

@lambdaviking

2 years

Thanks to Meryl for covering our recent work on semantics and language models on the CDS blog! The paper proves entailment prediction can be reduced to language modeling, and shows how to extract entailment from an “ideal” LM. Check out the blog to learn more!

NYU Data Science

@NYUDataScience

2 years

Can language models learn meaning just by observing text? CDS PhD student William Merrill ( @lambdaviking ) and CDS Assistant Professor of Linguistics and Data Science Tal Linzen ( @tallinzen ) explore the question in a recent study. Read about it on our blog!

1

4

41

0

2

16

Will Merrill

@lambdaviking

2 years

Interested in foundational questions about the computational/linguistic abilities of neural nets? Check out our website/join our weekly remote talk series 🍮

Lena Strobl

@sleynas

2 years

FLaNN is online ! 🍮 We organize weekly online seminars on Formal Languages 🤵 and Neural Networks 🧠 and related things. ✨ Visit our website to find out more! 🧑‍💻

1

13

46

1

15

Will Merrill

@lambdaviking

4 months

@nsaphra @lakretz Another “old-school” banger: @gail_w

On the Practical Computational Power of Finite Precision RNNs for...

While Recurrent Neural Networks (RNNs) are famously known to be Turing complete, this relies on infinite precision in the states and unbounded computation time. We consider the case of RNNs with...

arxiv.org

0

1

14

Will Merrill

@lambdaviking

2 years

@anmarasovic

NYU statement of purpose

I am fascinated by the duality between the richness of language and the computational mechanisms behind it. Classically, linguistics and formal language theory explore this connection in elegant...

docs.google.com

0

1

14

Will Merrill

@lambdaviking

8 months

Thanks to Stephen for a great overview of our recent work on the reasoning limitations of transformers!

NYU Data Science

@NYUDataScience

8 months

In a recent #NeurIPS -accepted paper, CDS PhD student William Merrill ( @lambdaviking ), with @Ashish_S_AI at AI2, reveal the hidden limitations of transformer LLMs like #ChatGPT and how to detect their "hallucinations." #datascience #hallucinations

0

1

9

0

1

14

Will Merrill

@lambdaviking

6 months

I'll be at NeurIPS next week! Looking forward to chatting about the computational power + limitations of transformers, as well as other fundamental questions about LMs Reach out if you'd like to chat! DMs open

1

0

14

Will Merrill

@lambdaviking

1 year

The coolest thing about GPT4 is I now have something to practice my broken Icelandic with

1

0

14

Will Merrill

@lambdaviking

8 months

Took a look today and this is very interesting stuff! The lower-bound direction is particularly cool: showing how LTL and counter-free automata can be simulated in a transformer through B-RASP.

David Chiang

@davidweichiang

8 months

New preprint! Dana Angluin, I, and Andy Yang @pentagonalize show that masked hard-attention transformers are exactly equivalent to the star-free regular languages.

2

16

74

1

2

13

Will Merrill

@lambdaviking

4 years

@tallinzen Not exactly what you're asking, but here's a dataset

1

12

Will Merrill

@lambdaviking

2 years

This result also solidifies the idea that (fixed-precision) transformer computation is "shallow": it can only next a finite number of quantifiers (wrt input length), rather than recursing arbitrarily deep like a Turing machine.

2

3

12

Will Merrill

@lambdaviking

1 month

@egrefen @sleepinyourhat +1 the general sentiment that nothing mystical is happening in our paper: our choice of task is strongly motivated by theory + intuition about what synthetic tasks filler tokens could help on

1

0

11

Will Merrill

@lambdaviking

1 month

@jowenpetty Covered in syntax 1?

1

0

5

Will Merrill

@lambdaviking

1 year

Any Merrills interested in a replication study?

Kyunghyun Cho

@kchonyc

1 year

it took us two months to have this preprint archived... can you guess why? a fun project led by Won Ik Cho and Eunjung Cho! [Cho, Cho & Cho, 2023]

7

10

122

1

0

12

Will Merrill

@lambdaviking

6 months

@zouharvi (and it may still fail even then; Geiger et al. 2019) Or consider getting rid of the outer parentheses altogether

1

0

11

Will Merrill

@lambdaviking

1 year

[4/n] Our result suggests a *Parallelism Tradeoff*: parallelism makes transformers scalable but limits the complexity of their forward pass. Fundamentally serial computation must be broken down into a "chain" of parallelizable steps à la Scratchpad/CoT

1

12

Will Merrill

@lambdaviking

2 years

Second, formal analysis of transformers, showing limits on the functions they can express (w/ Ashish Sabharwal, @nlpnoah ):

0

3

11

Will Merrill

@lambdaviking

1 month

@Ashish_S_AI Also always excited to talk about state-space models and state tracking (Accepted at ICML 🇦🇹 w/ @jowenpetty @Ashish_S_AI )

1

11

Will Merrill

@lambdaviking

2 months

@srush_nlp Agree with the post that there is a distinction (and often implicit conflation) of behavioral and mechanistic induction heads. Having a behavior definition seems more natural to me, followed by specific computational implementations of that def (eg on a transformer) 🧵

1

2

11

Will Merrill

@lambdaviking

1 year

The whole "SAT solver" thing seemed cool too but then I realized it was the boring kind of SAT

2

0

11

Will Merrill

@lambdaviking

2 months

@srush_nlp @jefrankle Sounds bad for rent

0

9

Will Merrill

@lambdaviking

2 years

[2/6] Specifically, the following relationship holds between text frequency and sentence entailment:

1

0

11

Will Merrill

@lambdaviking

2 years

* Gricean speaker = speaker who attempts to convey information efficiently to a listener. Think rational speech acts. This is a decent first-order model of human speech acts, but it would be interesting to see how extending it changes the theory!

1

10

Will Merrill

@lambdaviking

1 year

Does anyone have references for understanding the scaling of model size (# params) vs. context size (# tokens) for large language models? Are there standard/"optimal" ways to scale these in tandem? Or is context size bottlenecked by memory, etc. in practice, not scaling laws?

1

0

9

Will Merrill

@lambdaviking

1 year

@UnderwaterBepis @Ashish_S_AI This applies to full transformers, not just attention layers

0

10

Will Merrill

@lambdaviking

3 months

This looks super cool, will need to give it a careful read! Seems like an interesting implication of the softmax bottleneck, which @mattf1n is an expert on

Matthew Finlayson

@mattf1n

3 months

Wanna know gpt-3.5-turbo's embed size? We find a way to extract info from LLM APIs and estimate gpt-3.5-turbo’s embed size to be 4096. With the same trick we also develop 25x faster logprob extraction, audits for LLM APIs, and more! 📄 Here’s how 1/🧵

6

79

363

0

1

10

Will Merrill

@lambdaviking

6 months

Fruits of EMNLP is back at it again

Fruit Guy

@WeRateFruits

6 months

First durian in Singapore! Did I like it?......I'm still deciding.

2

4

22

0

10

Will Merrill

@lambdaviking

7 months

Paper I spoke about, for reference (w/ @Ashish_S_AI )

The Parallelism Tradeoff: Limitations of Log-Precision Transformers

William Merrill, Ashish Sabharwal. Transactions of the Association for Computational Linguistics, Volume 11. 2023.

aclanthology.org

1

3

10

Will Merrill

@lambdaviking

11 months

@CFGeek @sir_deenicus I’ve worked a lot on analyzing non seq2seq transformers using circuit complexity. with realistic precision they are in tc0 - far from Turing complete! We’re currently working on extending the analysis to seq2seq transformers, which sounds a bit like what you’re suggesting

2

0

10

Will Merrill

@lambdaviking

1 year

@yoavgo Similar to “eigen-simulacra”, it’s misleading and just wrong to call conditional probabilities “amplitudes” (clear place where peer review would help). Ignoring the quantum fanboyism though, “simulator theory” reminds me of our model of LMs here:

Entailment Semantics Can Be Extracted from an Ideal Language Model

Language models are often trained on text alone, without additional grounding. There is debate as to how much of natural language semantics can be inferred from such a procedure. We prove that...

arxiv.org

1

0

10

Will Merrill

@lambdaviking

2 years

Takeaways: 1. An intuitive characterization of attention as majority quantification 2. Mechanistic interpretability: extracting and debugging logical structure of a model 3. Efficiency: converting models to format easier to work with at hardware level

1

9

Will Merrill

@lambdaviking

7 months

@srush_nlp Thanks! Also recommend this concurrent related work:

0

9

Will Merrill

@lambdaviking

2 months

@srush_nlp ah, hadn't seen this post. I have some thoughts but am about to give a talk so will respond later today!

1

0

9

Will Merrill

@lambdaviking

6 months

@xuanalogue From an expressiveness point of view, one layer is basically a weighted finite automaton, which can express things like counting similar to LSTMs (requires log n precision in the state)

1

0

8

Will Merrill

@lambdaviking

2 years

[4/6] Thm2 can be taken to justify the Distributional Hypothesis. Text frequency (form) and meaning aren't orthogonal. Linguistic theory predicts: Learning distribution (perfectly) => learning semantics

1

9

Will Merrill

@lambdaviking

1 year

@generatorman_ai @Ashish_S_AI We mean bits per activation (similar but not quite the same thing). In order words, the precision used to carry out addition/multiplication

2

0

9

Will Merrill

@lambdaviking

2 months

@srush_nlp How to define the induction head behaviorally? It's something like: given `ab...a`, predict `b`. But this definition is underspecified in two ways: 1. b-underspec: a could occur many times with different b's 2. a-underspec: there are different suffix options for a

1

0

8

Will Merrill

@lambdaviking

2 years

[5/6] That said, it's unclear if large LMs trained on NL reflect Thm2. Our proofs analyze "ideal" LMs that perfectly fit their target distribution. Real LMs only approximate it, and even small noise greatly perturbs probabilities ~= 0.

1

0

9

Will Merrill

@lambdaviking

2 months

@typedfemale libertarian ideals are NC1-complete via Ayn Rand reductions (citation needed)

1

0

9

Will Merrill

@lambdaviking

2 months

@srush_nlp Yes but it’s diagonal and only input-dependent through delta. Turns out this isn’t enough to get greater expressive power

3

0

8

Will Merrill

@lambdaviking

2 years

[3/6] n-gram LMs trained on synthetic Gricean data learn to reflect Thm2 in their probability mass function:

1

0

8

Will Merrill

@lambdaviking

4 years

Awesome paper that goes beyond the Chomsky hierarchy to formalize RNNs' ability to represent bounded (!) hierarchical structure. In fact, RNNs and LSTMs can implement bounded stacks for Dyck languages in *optimal* space

John Hewitt

@johnhewtt

4 years

A simple communication complexity argument proves that O(m log k) hidden units is optimal -- even with unbounded computation (!!), it's impossible to use asymptotically fewer. That is, RNNs are fascinatingly well-suited (imo) to handling bounded-memory hierarchy.

2

0

6

0

8

Will Merrill

@lambdaviking

1 year

@katiedimartin But more seriously, curious about the current generative take on innateness vs learnability (to what degree can universals be explained by what languages are easier to learn?)

0

8

Will Merrill

@lambdaviking

3 years

Considering throwing a pierogi symposium now (with beer ofc)

The Endless Knot

@AllEndlessKnot

3 years

The #ConnectedAtBirth #etymology of the week is SYMPOSIUM/BEER/PIEROGIES #wotd

0

7

11

0

1

8