CLS @ChengleiSi Twitter profile | Pikagi

Pikagi

CLS

@ChengleiSi

2,022

Followers

2,988

Following

24

Media

1,697

Statuses

vibing @stanfordnlp | real AGI is the friends we made along the way

Palo Alto, California

https://t.co/BQaALVnvOO

Joined August 2018

Don't wanna be here? Send us removal request.

Pinned Tweet

@ChengleiSi

CLS

2 months

Thank you AK for sharing our Design2Code paper! Here’s my version of the story: To assess whether multimodal LLMs are ready to automate front-end engineering, we: - formalize the task of converting visual designs into code implementations - manually curate the Design2Code…

@_akhaliq

AK

2 months

Design2Code How Far Are We From Automating Front-End Engineering? Generative AI has made rapid advancements in recent years, achieving unprecedented capabilities in multimodal understanding and code generation. This can enable a new paradigm of front-end development, in

Tweet media one

28

296

1K

3

40

173

Last Seen Profiles

@MxMaraMoose

@_H1ma21_

@steve7sisters

@Mabuckley88

@BuckyIsotope

@empoweredca

@ashalaneek

@JustXAshton

@kttmna_rana

@Phil_UpOnMe

@CallMe_Nabill

@VivaLaUCV

@BlackGenet

@CadeHaug

@Lead_Em_Up

@jaknuckl

@BollAndBranch

@Aya_Isk

@hanfjob

@CamilleGrenu

@b_dub113

@sobrittgoes

@RandellAmy40501

@gonulluTR

@ergoinfinito

@blumenkranz09

@LEVELZMCR

@jandakembangstw

@2024ckanimation

@tozers

@XKiwyx

@BandarStw

@BSCS_Global

@Noura11521

@CUOSC

@ilhammaldives

@ChengleiSi

CLS

5 months

I saw debates on whether GPT-4V can “solve” compositionality, so I spent my precious Friday afternoon benchmarking it on Winoground. Tldr: NO it’s still far from solved (GPT-4V 38.0% vs PaLI 28.8% vs MTurk Humans 85.5%). Colab w/ all results: 🧵(1/n)

Tweet card media

gpt4v_winoground_public.ipynb

Colaboratory notebook

colab.research.google.com

7

49

334

@ChengleiSi

CLS

2 years

New paper alert! GPT-3 is getting really popular and tons of applications are getting built with it. But before we deploy it in real-life, let’s first answer the important question: How reliable is GPT-3? (Hint: it can be more reliable than you think!) 🧵(1/n)

@_akhaliq

AK

2 years

Prompting GPT-3 To Be Reliable abs:

Tweet media one

1

37

209

5

56

306

@ChengleiSi

CLS

7 months

How can we humans verify the truthfulness of LLM outputs (or any claims you see on the Internet)? Should we ask ChatGPT ( #LLMs )? Search on Google (retrieval)? Are they complementary? Tldr: LLMs Help Humans Verify Truthfulness - Except When They Are Convincingly Wrong! 1/n

Tweet media one

7

47

234

@ChengleiSi

CLS

1 year

🚨You should be aware that LLMs like GPT-3 & 3.5 have strong feature biases! They prefer to use certain features over others, even when both features are equally predictive of the labels in the prompt. #ACL2023NLP 1/n

Tweet media one

7

46

207

@ChengleiSi

CLS

1 year

got asked a lot so will just update here: 1. yes I’m joining Stanford @stanfordnlp this fall for my PhD :) 2. I’m switching my research focus to 80% Human-AI Interaction + 20% AI Safety 3. I’m in Kigali for #ICLR2023 , come say hi!

17

7

196

@ChengleiSi

CLS

4 months

Long context LMs have been on the rise, but I keep wondering: do any tasks actually need super long contexts? 🤔 I’m somewhat convinced by an emerging line of “context compression” methods that shorten the contexts and retain the performance, examples in 🧵 (1/n)

15

32

182

@ChengleiSi

CLS

11 months

Combatting hallucination and improving factuality has become a rising #NLProc research topic in the ChatGPT era. Here’s a list of 10+ recent papers that I enjoyed reading, along with my brief notes: 🧵

4

25

127

@ChengleiSi

CLS

2 months

A couple new admits have been asking what researching at Stanford is like, here’s a thread of cool projects that my awesome cohort / lab mates did in their first 0.5 year here at @stanfordnlp 🧵

1

8

112

@ChengleiSi

CLS

6 months

We now have the largest #LLM prompt hacking / jailbreaking dataset crowdsourced from a global competition!

@learnprompting

Learn Prompting

@learnprompting

6 months

A few months ago, we ran HackAPrompt, the first-ever global Prompt Hacking competition! Over 3K hackers submitted 600K malicious prompts to win $35K in prizes from companies like @PreambleAI , @OpenAI , & @huggingface We analyzed 29 different techniques & found a NEW exploit👇🧵

Tweet media one

9

92

378

3

1

43

@ChengleiSi

CLS

5 months

I won’t be at #EMNLP2023 this time, but my coauthors will be there: @henryzhao4321 will present our Mixture of Reasoning Experts paper, and @SanderSchulhoff will present our prompt injection dataset HackAPrompt. Go talk to them!!

Tweet card media

Ignore This Title and HackAPrompt: Exposing Systemic...

Large Language Models (LLMs) are deployed in interactive contexts with direct user engagement, such as chatbots and writing assistants. These deployments are vulnerable to prompt injection and...

1

6

41

@ChengleiSi

CLS

2 years

@katherine1ee That paper seems too long to track who wrote which paragraph, but I’ve notified some of the authors on that paper, I think someone will stand up for it and take whatever procedures as needed/appropriate. But thanks for doing this - an important lesson for many!

2

0

34

@ChengleiSi

CLS

6 months

My biggest takeaway at #UIST2023 so far is that HCI people make really beautiful slides and demos!

0

2

32

@ChengleiSi

CLS

6 months

fun to read these side by side:

0

5

31

@ChengleiSi

CLS

3 years

Are you still using a single gold answer to evaluate your QA models? Try augmenting your answer set with aliases! Excited to share our new #EMNLP2021 paper: What's in a Name? Answer Equivalence For Open-Domain Question Answering (1/4)

Tweet media one

2

8

29

@ChengleiSi

CLS

6 months

studying how well/poorly the API models work on static benchmarks surely sounds silly at this point. but designing human studies to rigorously analyze the impact they have on users and the broader society sounds like important research to me.

@deliprao

Delip Rao e/σ

6 months

A puzzling dynamic I still don't get is impoverished academics/students paying OpenAI to “understand” or “study properties/capabilities” of GPTs while OpenAI enriches understanding of their product for free. Not to mention all the free publicity/revenue that comes with the…

20

44

384

2

0

28

@ChengleiSi

CLS

5 months

I learned a ton about how to do research from Jordan back in the undergrad years and I’m super proud to see more undergrads from the team putting out great work!!

@boydgraber

Jordan Boyd-Graber

5 months

We learned on Thursday we needed to put a presentation together, and Sander did a great job. Sander's an undergrad, and this is Sander's first paper, first conference, and first conference talk. Joint work with @ChengleiSi .

Tweet media one

5

15

147

1

2

28

@ChengleiSi

CLS

5 months

oh in case you are looking for a paper link - no there is no paper, i think fun experiments like this should just stay as a twitter thread to make everybody's life easier. if you want to cite these results, just cite the Colab or the thread itself. 🙂

1

0

26

@ChengleiSi

CLS

2 years

@_jasonwei Or maybe sometimes you work on a less popular topic, just because you enjoy working on it, and you don��t really care how big the impact is gonna be 😃

1

0

25

@ChengleiSi

CLS

2 months

This project is co-led with my 💯labmate @StevenyzZhang , and in collaboration with @zhengyuan_yang ( @Microsoft ), @RuiboLiu ( @GoogleDeepMind ), and @Diyi_Yang ( @stanfordnlp ). Shout out to many friends and labmates @shi_weiyan @gaotianyu1350 @WilliamBarrHeld @rose_e_wang …

4

0

22

@ChengleiSi

CLS

10 months

Come to Metropolitan Center in half an hour for one of the most fun talks you will see at #ACL2023 !

@ChengleiSi

CLS

1 year

🚨You should be aware that LLMs like GPT-3 & 3.5 have strong feature biases! They prefer to use certain features over others, even when both features are equally predictive of the labels in the prompt. #ACL2023NLP 1/n

Tweet media one

7

46

207

0

5

22

@ChengleiSi

CLS

4 months

@brunchavecmoi @WeijiaShi2 @eunsolc trained extractive and abstractive summarizers to condense the retrieved documents for more efficient RAG. On ODQA, they can compress top-5 retrieved documents into 5-10% length with minor performance drops. (2/n)

Tweet card media

RECOMP: Improving Retrieval-Augmented LMs with Compression and...

Retrieving documents and prepending them in-context at inference time improves performance of language model (LMs) on a wide range of tasks. However, these documents, often spanning hundreds of...

1

3

22

@ChengleiSi

CLS

4 months

@maria_antoniak from @Siru_Ouyang

Tweet card media

The Shifted and The Overlooked: A Task-oriented Investigation of...

Recent progress in Large Language Models (LLMs) has produced models that exhibit remarkable performance across a variety of NLP tasks. However, it remains unclear whether the existing focus of NLP...

1

2

21

@ChengleiSi

CLS

7 months

one of my fav reads of the year :)

@AlexTamkin

Alex Tamkin 🦣

7 months

Eliciting Human Preferences with Language Models Currently, people write detailed prompts to describe what they want a language model to do We explore *generative elicitation*—where models interactively ask for this information through open-ended conversation 1/

Tweet media one

4

84

459

1

3

21

@ChengleiSi

CLS

3 years

My internship work is accepted to #ACL2021NLP Findings. Updated draft, code and data will be released soon, stay tuned!

@KCrosner

Yiming Cui

3 years

I just got the mail and I am glad to announce that our paper “Benchmarking Robustness of Machine Reading Comprehension Models” is accepted to Findings of ACL. arXiv pre-print: #nlproc #acl2021nlp

Tweet media one

4

1

37

2

3

19

@ChengleiSi

CLS

1 month

very insightful thread on evaluating factuality

@gregd_nlp

Greg Durrett

1 month

This is a cool method, but "superhuman" is an overclaim based on the data shown. There are better datasets than FActScore for evaluating this: ExpertQA by @cmalaviya11 +al Factcheck-GPT by Yuxia Wang +al (+ same methodology) 🧵

3

26

183

0

0

17

@ChengleiSi

CLS

5 months

@kaiwei_chang @jieyuzhao11 @GabrielSaadia @acbuller @Lianhuiq @ManlingLi_ @yuntiandeng @rajammanabrolu @YueDongCS @tanyaagoyal @MinaLee__ @alsuhr @wellecks @hllo_wrld @Xinya16 and the amazing @shi_weiyan !!!

0

4

17

@ChengleiSi

CLS

2 years

Here comes the 🤯 part - you can directly prepend a natural language instruction to tell the model not to be biased! When I told GPT-3 to not discriminate against any demographic group, it responds well and hugely reduces biases on BBQ! (7/n)

Tweet media one

Tweet media two

5

4

14

@ChengleiSi

CLS

6 months

scalable oversight via debate!

@_julianmichael_

Julian Michael

@_julianmichael_

6 months

As AIs improve at persuasion & argumentation, how do we ensure that they help us seek truth vs. just sounding convincing? In human experiments, we validate debate as a truth-seeking process, showing that it may soon be needed for supervising AI. Paper:

Tweet media one

9

42

226

1

0

15

@ChengleiSi

CLS

2 months

generating Wiki-style articles with proper citations from @EchoShao8899 :

@EchoShao8899

Yijia Shao

2 months

Can we teach LLMs to write long articles from scratch, grounded in trustworthy sources? Do Wikipedia editors think this can assist them? 📣Announcing STORM, a system that writes Wikipedia-like articles based on Internet search. I now use STORM in my daily research!🧵

41

205

1K

1

1

14

@ChengleiSi

CLS

2 months

mech interp benchmark from @aryaman2020 :

@aryaman2020

Aryaman Arora

2 months

New paper! 🫡 LM interpretability has made progress in finding feature representations using many methods, but we don’t know which ones are generally performant or reliable. We ( @jurafsky @ChrisGPotts ) introduce CausalGym, a benchmark of 29 linguistic tasks for interp! (1/n)

Tweet media one

6

45

284

2

1

14

@ChengleiSi

CLS

2 months

benchmark for understanding self-referential statements from @TristanThrush :

@TristanThrush

Tristan Thrush

4 months

📢 New paper!! Do LLMs understand self-referential statements? Introducing “I am a Strange Dataset”. All tested models perform around chance at our metalinguistic self-reference task. GPT-4 is the only model significantly above chance on all tests, but it is slight.🧵

Tweet media one

22

72

475

1

1

13

@ChengleiSi

CLS

23 days

very cool

@panickssery

Arjun Panickssery is in London

23 days

Are LLMs biased toward themselves? Frontier LLMs give higher scores to their own outputs in self-eval. We find evidence that this bias is caused by LLM's ability to recognize their own outputs This could interfere with safety techniques like reward modeling & constitutional AI

Tweet media one

8

46

321

0

0

13

@ChengleiSi

CLS

7 months

Shout-out to @nelsonfliu @xiye_nlp @vishakh_pk @anmarasovic @alison_m_smith @oshaikh13 @gaotianyu1350 @swetaagrawal20 @Pranav__Goel @miserlis_ @nehasrikanth @h__j___han @sathvikn4 @NishantBalepur @AbhilashaSanch2 @HelenasResearch for all the helpful discussion and feedback! 7/n

1

0

13

@ChengleiSi

CLS

5 months

Last but not least, we thank @TristanThrush and @aryaman2020 for their generous sponsorship in buying me drinks last night at the EVGR pub. Also tagging @Francis_YAO_ @DrJimFan @giffmana @GaryMarcus who might be interested to know these results. Cheers! 🍻 (11/n, n=11)

1

0

13

@ChengleiSi

CLS

7 months

Also tagging folks who might be interested (and whose works inspired us!): @dsweld @rayrayfok @QVeraLiao @ChenhaoTan @marcotcr @BrihiJ @gregd_nlp @sarahwiegreffe @mark_riedl @windx0303 @TobyJLi @huashen218 @sleepinyourhat @_julianmichael_ @msbernst @Diyi_Yang 8/n, n=8

1

0

13

@ChengleiSi

CLS

10 months

@universeinanegg Westin_CONFERENCE Password: acl2023

1

2

11

@ChengleiSi

CLS

1 year

If you are interested in this line of work, also check out Alex’s related paper: 8/n, n=8

@AlexTamkin

Alex Tamkin 🦣

1 year

What can go wrong when a language model's task is ambiguous? We look at this in our #ICLR2023 paper, inspired by a real-world GPT-3 failure! Task Ambiguity in Humans and Language Models 1/

Tweet media one

2

35

187

1

3

10

@ChengleiSi

CLS

15 days

amazing resources for culturally aware LLMs by the amazing @shi_weiyan

@shi_weiyan

Weiyan Shi

15 days

🚨New Paper🚨 We propose 1⃣CultureBank🌎 dataset sourced from TikTok & Reddit 2⃣An extensible pipeline to build cultural knowledge bases 3⃣Evaluation of LLMs’ cultural awareness 4⃣Insights into culturally-aware LLMs Project: Data:

Tweet media one

4

63

261

0

1

12

@ChengleiSi

CLS

2 years

@thegautamkamath Check from @alephic2 and @gregd_nlp . And there's this comprehensive survey on more sorts of tokenizers if you've interested:

Tweet card media

Between words and characters: A Brief History of Open-Vocabulary...

What are the units of text that we want to model? From bytes to multi-word expressions, text can be analyzed and generated at many granularities. Until recently, most natural language processing...

1

1

11

@ChengleiSi

CLS

11 months

8. Xue et al. ( @elgreco_winter ) propose reverse Chain-of-Thought: first prompt LLM to reconstruct the problem given the generated solution; then detect inconsistencies between the reconstructed and the original problems.

Tweet card media

RCOT: Detecting and Rectifying Factual Inconsistency in Reasoning...

Large language Models (LLMs) have achieved promising performance on arithmetic reasoning tasks by incorporating step-by-step chain-of-thought (CoT) prompting. However, LLMs face challenges in...

1

3

11

@ChengleiSi

CLS

5 months

I tested on three different settings. In the first setting, I provide the image and ask GPT-4V to select the matching caption. In this example, GPT-4V should select caption_0 (A) for image_0 and caption_1 (B) for image_1, which it did correctly, along with explanations. (2/n)

Tweet media one

1

1

11

@ChengleiSi

CLS

2 years

We analyze and improve reliability from four core facets: 1) OOD generalization (domain transfer + challenge sets + spurious correlation); 2) social biases; 3) uncertainty calibration; 4) knowledge updating. (2/n)

Tweet media one

1

1

11

@ChengleiSi

CLS

2 months

analyzing whether LMs can learn “impossible�� languages from @JulieKallini :

@JulieKallini

Julie Kallini ✨

4 months

Do LLMs learn impossible languages (that humans wouldn’t be able to acquire) just as well as they learn possible human languages? We find evidence that they don’t! Check out our new paper… 💥 Mission: Impossible Language Models 💥 ArXiv: 🧵

Tweet media one

12

114

478

1

0

11

@ChengleiSi

CLS

11 months

Current status of #LLMs evaluation: 😵‍💫😮‍💨🤔

@Francis_YAO_

Yao Fu

11 months

Is Falcon really better than LLaMA? Short take: probably not. Longer take: we reproduced LLaMA 65B eval on MMLU and we got 61.4, close to the official number (63.4), much higher than its Open LLM Leaderboard number (48.8), and clearly higher than Falcon (52.7). Code and prompt…

34

128

722

0

0

10

@ChengleiSi

CLS

5 months

On 100 test examples, GPT-4V gets 62.0% accuracy (random acc is 25.0% because you need to select the right image for both captions to be correct on each example). (3/n)

1

1

10

@ChengleiSi

CLS

5 months

On the 100 test examples, GPT-4V scored 38.0%, much better than prior SOTA set by PaLI (28.8% group score), but much much worse than MTurk human performance (85.5% group score) reported in the Winoground paper. (10/n)

1

1

10

@ChengleiSi

CLS

2 years

@arankomatsuzaki Where does that GPT3 350B come from?

1

0

9

@ChengleiSi

CLS

2 months

Bonus: if you join, you'll get to chill with us and go out for fun every weekend! 😀

1

0

9

@ChengleiSi

CLS

5 months

This is way better than prior SOTAs (PaLI 46.5%, UNITER_large 38.0%; altho they scored each image-caption pair matching separately and selected the better match while we just gave GPT-4V both candidates to choose from). (4/n)

1

1

9

@ChengleiSi

CLS

2 years

@sleepinyourhat

@MarekRei

Marek Rei

2 years

Investigating memorisation versus generalisation in pre-trained language models. Great work by @michael__tanzer , in collaboration with @seb_ruder and myself. Accepted to #ACL2022 , already available on ArXiv: #NLProc #MachineLearning

Tweet media one

4

28

111

0

0

9

@ChengleiSi

CLS

11 months

10. Last but not least, turns out decoding strategy also matters! @WeijiaShi2 @XiaochuangHan et al. adapt the idea of contrastive decoding to amplify the difference between the output probabilities when a model is used with and without context.

Tweet card media

Trusting Your Evidence: Hallucinate Less with Context-aware Decoding

Language models (LMs) often struggle to pay enough attention to the input context, and generate texts that are unfaithful or contain hallucinations. To mitigate this issue, we present...

1

3

9

@ChengleiSi

CLS

2 years

@TristanThrush Would be interesting to see how this compares to more efficient alternatives such as

@jungokasai

Jungo Kasai 笠井淳吾 @NeurIPS2023

2 years

How well can GPT-3/QA models answer questions on real-time events (e.g., # homeruns by #Ohtani )? RealTime QA @realtimeqa regularly announces questions and evaluates systems. Weekly for now. Paper (w/ past month results): Website:

Tweet media one

2

27

118

0

0

9

@ChengleiSi

CLS

2 months

probing bias and fairness of preference tuning from @michaelryan207 :

@michaelryan207

Michael Ryan

@michaelryan207

2 months

Aligned LLMs should be helpful, harmless, and adopt user preferences. But whose preferences are we aligning to and what are unintended effects on global representation? We find SFT and Preference Tuning steer LLMs towards US English use and opinions. 🧵

Tweet media one

5

53

208

1

0

9

@ChengleiSi

CLS

4 months

Apart from compressing into natural language tokens, you can also directly compress contexts into shorter soft prompts. For that, @AlexisChvlr @_awettig @anirudhajith42 @danqi_chen proposed AutoCompressors. (6/n)

Tweet card media

Adapting Language Models to Compress Contexts

Transformer-based language models (LMs) are powerful and widely-applicable tools, but their usefulness is constrained by a finite context window and the expensive computational cost of processing...

1

0

8

@ChengleiSi

CLS

4 months

Similarly, @iofu728 et al. first retrieve the most important documents from all contexts by ranking each doc’s avg perplexity conditioned on the question; then further filter down to the most important tokens, where they compute token importance by contrastive perplexity, (3/n)

1

0

8

@ChengleiSi

CLS

3 months

@universeinanegg ? ( @neeljain1717 )

Tweet card media

NEFTune: Noisy Embeddings Improve Instruction Finetuning

We show that language model finetuning can be improved, sometimes dramatically, with a simple augmentation. NEFTune adds noise to the embedding vectors during training. Standard finetuning of...

1

0

8

@ChengleiSi

CLS

2 years

We need more careful readers like Jason

@zhansheng

Jason Phang

2 years

I scanned through the paper quickly because I was very struck by how good the zero-shot results are (better than 175B models). But after some digging, I think this is the reason: The paper considers these prompts to be zero-shot:

Tweet media one

1

3

74

0

0

8

@ChengleiSi

CLS

2 years

All code, data, and model predictions are available at: Thanks for reading this super long thread! (19/n; n=19)

GitHub - NoviScl/GPT3-Reliability

Contribute to NoviScl/GPT3-Reliability development by creating an account on GitHub.

2

1

8

@ChengleiSi

CLS

5 months

Lastly, we test GPT-4V following the exact same protocol as how the Winoground paper tested on MTurk crowdworkers so the results are directly comparable. Specifically, we show an image and a caption and ask GPT-4V whether it is a correct match (binary yes/no). (7/n)

1

0

7

@ChengleiSi

CLS

7 months

admirable effort in putting together a very useful benchmark on NLP/LLMs for education!

@rose_e_wang

Rose

7 months

Ever wonder how experienced math teachers & tutors compare to ChatGPT or GPT4 in teaching students? 🖥️🧑‍🎓👩‍🏫 Check out our new paper “Step-by-Step Remediation of Students’ Mathematical Mistakes”! 📜 🖥️ from @stanfordnlp @StanfordEd

2

22

51

1

2

7

@ChengleiSi

CLS

4 months

i.e., how much does conditioning on the question reduce the conditional prob of the token. On multi-doc QA and long-context benchmarks, such compressions lead to slightly higher acc than full prompts with 15-25% original lengths. (4/n)

Tweet card media

LongLLMLingua: Accelerating and Enhancing LLMs in Long Context...

In long context scenarios, large language models (LLMs) face three main challenges: higher computational/financial cost, longer latency, and inferior performance. Some studies reveal that the...

1

0

7

@ChengleiSi

CLS

4 months

@PengXu51108979 et al. compare retrieval and long-context head-to-head. Retrieving top-5 chunks (with off-the-shelf dense retrievers) to fit into LLaMA2-70B-4K can be comparable to feeding the original long contexts into LLaMA2-70B-16K for QA. (5/n)

Tweet card media

Retrieval meets Long Context Large Language Models

Extending the context window of large language models (LLMs) is getting popular recently, while the solution of augmenting LLMs with retrieval has existed for years. The natural questions are: i)...

1

0

7

@ChengleiSi

CLS

6 months

@aryaman2020 my heart belongs to research

1

0

7

@ChengleiSi

CLS

5 months

In the second setting, I provide the caption and ask GPT-4V to select the matching image. In this example, GPT-4V should select image_0 (A) for caption_0 and image_1 (B) for caption_1, which it did correctly and showcased the ability to do counting. (5/n)

Tweet media one

1

0

7

@ChengleiSi

CLS

5 months

So the model has to answer ‘yes’ to all the correct matches and ‘no’ to all the wrong matches (which is arguably harder than the previous two settings of just selecting the better match between two options). (8/n)

1

0

7

@ChengleiSi

CLS

5 months

In this setting, GPT-4V gets 61.0% accuracy (again random chance is 25.0%). This is somewhat impressive because matching image for caption is known to be much harder than matching caption given the image (see the Winogound paper). For reference, PaLI gets 38.0%. (6/n)

1

0

7

@ChengleiSi

CLS

8 months

Very cool work: human-AI (InstructGPT) co-writing could lead to homogenization!

@vishakh_pk

Vishakh Padmakumar

8 months

Does Writing with Language Models Reduce Content Diversity? TL;DR: Yes! But it depends on which language model you use 🤖🕵️ Sharing work with my advisor @hhexiy : Code/data: #NLProc #paper #LLMs

Tweet media one

6

36

150

0

1

7

@ChengleiSi

CLS

7 months

However, humans over-rely on ChatGPT explanations — they trust ChatGPT’s answers even when they are wrong, resulting in below-random accuracy on such cases, much worse than both the baseline and retrieval conditions. 4/n

Tweet media one

3

1

7

@ChengleiSi

CLS

7 months

In our new paper: We ask crowdworkers to fact-check claims in several experiment conditions: Baseline (just show the claims), Retrieval from Wiki, ChatGPT Explanation, Contrastive Explanation (ChatGPT self-debate), and Retrieval + Explanation. 2/n

Tweet media one

1

0

6

@ChengleiSi

CLS

5 months

Go check out the poster for HackAPrompt!

@learnprompting

Learn Prompting

@learnprompting

5 months

Currently at 41C in the back!

0

0

5

0

1

6

@ChengleiSi

CLS

2 years

For spurious correlation, on both MNLI -> HANS and QQP -> PAWS, GPT-3 doesn’t exploit the shortcuts like the supervised models and generalizes much better! (5/n)

Tweet media one

2

0

6

@ChengleiSi

CLS

2 years

The current state of NLP research: hottest paper debunked in a few hours

@denny_zhou

Denny Zhou

2 years

I dont think there is magic here: text-davinci-002 and other 002 models in GPT-3, and instruct GPT should have been finetuned with "let's think step by step ... ". I tried 001 models in GPT3 and none of them works with this kind of prompt while CoT still works.

8

13

112

1

0

6

@ChengleiSi

CLS

5 months

Example below shows a wrong model prediction because it should answer ‘Yes’ to the match between image_0 and caption_0. (9/n)

Tweet media one

2

0

6

@ChengleiSi

CLS

2 years

@sarahwiegreffe Pretty sure both text-davinci-001 and text-davinci-002 are Instruct models (and they are being updated over time); “davinci” is the original NeurIPS version, and is static.

1

0

6

@ChengleiSi

CLS

1 month

🔥

@jyangballin

John Yang @ ICLR 🇦🇹

1 month

SWE-agent is our new system for autonomously solving issues in GitHub repos. It gets similar accuracy to Devin on SWE-bench, takes 93 seconds on avg + it's open source! We designed a new agent-computer interface to make it easy for GPT-4 to edit+run code

Tweet media one

68

434

2K

0

0

6

@ChengleiSi

CLS

8 months

@LauraRuis ( @albertwebson )

Tweet card media

Are Language Models Worse than Humans at Following Prompts?...

Prompts have been the center of progress in advancing language models' zero-shot and few-shot performance. However, recent work finds that models can perform surprisingly well when given...

1

0

6

@ChengleiSi

CLS

2 years

@srchvrs @arankomatsuzaki respect to the authors for the effort 🫡

0

0

6

@ChengleiSi

CLS

7 months

Contrastive explanation makes people more cautious, but lowers human decision accuracy in cases where the non-contrastive explanation would have been correct. Somewhat surprisingly, showing both retrieval and explanation is no better than just showing retrieval alone! 5/n

Tweet media one

1

0

6

@ChengleiSi

CLS

7 months

On the surface, showing retrieved paragraphs and ChatGPT explanation enable similar human decision accuracy, both significantly better than the baseline with no evidence; while reading ChatGPT explanation is much faster. 3/n

Tweet media one

1

0

6

@ChengleiSi

CLS

3 years

Also check out another of our #ACL2021NLP Findings paper: Better Robustness by More Coverage: Adversarial Training with Mixup Augmentation for Robust Fine-tuning. preprint: code: .

Tweet card media

GitHub - thunlp/MixADA

Contribute to thunlp/MixADA development by creating an account on GitHub.

@ChengleiSi

CLS

3 years

My internship work is accepted to #ACL2021NLP Findings. Updated draft, code and data will be released soon, stay tuned!

2

3

19

1

2

6

@ChengleiSi

CLS

2 months

@aryaman2020 😅

0

0

5

@ChengleiSi

CLS

1 year

Someone should sponsor AK for his service!

@_akhaliq

AK

1 year

Thinking of retiring from paper tweets it’s pretty time intensive on top of a full time job, I had a good run so far. Plus all the companies offering similar services now

221

58

2K

0

0

4

@ChengleiSi

CLS

9 months

@dingzeyuli There’s also an AI+HCI workshop

Tweet card media

sites.google.com

0

0

5

@ChengleiSi

CLS

2 years

@MarekRei @Michael__Tanzer @seb_ruder Nice work! Quick question: would the three phases always happen, and have similar durations for different datasets/models? One interesting contrast is that training longer on MNLI actually keeps improving OOD acc on HANS (; Fig 1)

Tweet card media

An Empirical Study on Robustness to Spurious Correlations using...

Recent work has shown that pre-trained language models such as BERT improve robustness to spurious correlations in the dataset. Intrigued by these results, we find that the key to their success is...

1

1

5

@ChengleiSi

CLS

4 months

Along this line, Ge at al. proposed In-context Autoencoder (ICAE) for context compression. ICAE consists of an encoder and a decoder. The encoder is a LoRA-adapted LLM, used for encoding the original long context into a few memory tokens. (10/n)

Tweet card media

In-context Autoencoder for Context Compression in a Large Language Model

We propose the In-context Autoencoder (ICAE), leveraging the power of a large language models (LLM) to compress a long context into short compact memory slots that can be directly conditioned on...

1

0

5

@ChengleiSi

CLS

4 months

Would love to see any experiment results showing such counterexamples! Last but not least, shout out to @aryaman2020 @xiuyu_l @StevenyzZhang for helpful discussion! (14/n, n=14)

0

0

5

@ChengleiSi

CLS

2 years

@LChoshen You should come up with a way for people to cite your Twitter thread 😂

1

0

5

@ChengleiSi

CLS

2 years

This work has also benefitted tremendously from the feedback of @zhansheng , @sewon__min , @akyurekekin , @danfriedman0 , @jieyuzhao11 , @AliciaVParrish , @sulin_blodgett , @ihsgnef , @henryzhao4321 , and many other friends! (18/n)

1

0

5

@ChengleiSi

CLS

2 years

Facet 2: Social Biases On WinoBias ( @jieyuzhao11 et al.) and BBQ ( @AliciaVParrish @sleepinyourhat et al.), including anti-stereotypical examples (e.g., “She is a doctor.” as opposed to “He is a doctor.” ) to balance the prompt significantly reduces biases! (6/n)

Tweet media one

1

0

5

@ChengleiSi

CLS

2 years

Facet 1: Generalization. On MRQA domain transfer, with demos from the source domain, GPT-3 generalizes to different target domain test sets with negligible accuracy drops - GPT-3 is insensitive to domain differences! (3/n)

1

0

5

@ChengleiSi

CLS

5 months

@aryaman2020 boba taste exposed

2

0

5

@ChengleiSi

CLS

4 months

They segment long contexts and recursively generate summary vectors which are passed as soft prompts to subsequent segments. The training objective is language modeling conditioned on prev tokens in the current segment and accumulated summary vectors from prev segments. (7/n)

1

0

5

@ChengleiSi

CLS

1 year

Swing by my poster this afternoon at MH1-2-3-4 #148 , 4:30 - 6:30pm CAT!

@ChengleiSi

CLS

2 years

New paper alert! GPT-3 is getting really popular and tons of applications are getting built with it. But before we deploy it in real-life, let’s first answer the important question: How reliable is GPT-3? (Hint: it can be more reliable than you think!) 🧵(1/n)

5

56

306

0

1

5

@ChengleiSi

CLS

5 months

@ZhengxuanZenWu boba and beers are the driving force behind all my research these days 🫡

0

0

5

@ChengleiSi

CLS

2 years

I’m starting to get why some people think prompt engineering is not ‘real research’...

@arankomatsuzaki

Aran Komatsuzaki

@arankomatsuzaki

2 years

Large Language Models are Zero-Shot Reasoners Simply adding “Let’s think step by step” before each answer increases the accuracy on MultiArith from 17.7% to 78.7% and GSM8K from 10.4% to 40.7% with GPT-3.

Tweet media one

59

572

3K

0

0

4

@ChengleiSi

CLS

2 years

@arankomatsuzaki 111 pages is no joke

1

0

4

@ChengleiSi

CLS

5 months

@giffmana 🤦‍♂️

0

0

3

@ChengleiSi

CLS

5 months

@aryaman2020 correction: coconut pudding, not boba.

2

0

4

@ChengleiSi

CLS

11 months

6. In a similar vein, @du_yilun et al. show each LLM the other LLMs’ responses after each round and ask them to revise accordingly, and repeat until they reach agreement, which improves reasoning and factual accuracy.

Tweet card media

Improving Factuality and Reasoning in Language Models through...

Large language models (LLMs) have demonstrated remarkable capabilities in language generation, understanding, and few-shot learning in recent years. An extensive body of work has explored how...

1

0

4