CLS Profile Banner
CLS Profile
CLS

@ChengleiSi

2,022
Followers
2,988
Following
24
Media
1,697
Statuses

vibing @stanfordnlp | real AGI is the friends we made along the way

Palo Alto, California
Joined August 2018
Don't wanna be here? Send us removal request.
Pinned Tweet
@ChengleiSi
CLS
2 months
Thank you AK for sharing our Design2Code paper! Here’s my version of the story: To assess whether multimodal LLMs are ready to automate front-end engineering, we: - formalize the task of converting visual designs into code implementations - manually curate the Design2Code…
@_akhaliq
AK
2 months
Design2Code How Far Are We From Automating Front-End Engineering? Generative AI has made rapid advancements in recent years, achieving unprecedented capabilities in multimodal understanding and code generation. This can enable a new paradigm of front-end development, in
Tweet media one
28
296
1K
3
40
173
@ChengleiSi
CLS
5 months
I saw debates on whether GPT-4V can “solve” compositionality, so I spent my precious Friday afternoon benchmarking it on Winoground. Tldr: NO it’s still far from solved (GPT-4V 38.0% vs PaLI 28.8% vs MTurk Humans 85.5%). Colab w/ all results: 🧵(1/n)
7
49
334
@ChengleiSi
CLS
2 years
New paper alert! GPT-3 is getting really popular and tons of applications are getting built with it. But before we deploy it in real-life, let’s first answer the important question: How reliable is GPT-3? (Hint: it can be more reliable than you think!) 🧵(1/n)
@_akhaliq
AK
2 years
Prompting GPT-3 To Be Reliable abs:
Tweet media one
1
37
209
5
56
306
@ChengleiSi
CLS
7 months
How can we humans verify the truthfulness of LLM outputs (or any claims you see on the Internet)? Should we ask ChatGPT ( #LLMs )? Search on Google (retrieval)? Are they complementary? Tldr: LLMs Help Humans Verify Truthfulness - Except When They Are Convincingly Wrong! 1/n
Tweet media one
7
47
234
@ChengleiSi
CLS
1 year
🚨You should be aware that LLMs like GPT-3 & 3.5 have strong feature biases! They prefer to use certain features over others, even when both features are equally predictive of the labels in the prompt. #ACL2023NLP 1/n
Tweet media one
7
46
207
@ChengleiSi
CLS
1 year
got asked a lot so will just update here: 1. yes I’m joining Stanford @stanfordnlp this fall for my PhD :) 2. I’m switching my research focus to 80% Human-AI Interaction + 20% AI Safety 3. I’m in Kigali for #ICLR2023 , come say hi!
17
7
196
@ChengleiSi
CLS
4 months
Long context LMs have been on the rise, but I keep wondering: do any tasks actually need super long contexts? 🤔 I’m somewhat convinced by an emerging line of “context compression” methods that shorten the contexts and retain the performance, examples in 🧵 (1/n)
15
32
182
@ChengleiSi
CLS
11 months
Combatting hallucination and improving factuality has become a rising #NLProc research topic in the ChatGPT era. Here’s a list of 10+ recent papers that I enjoyed reading, along with my brief notes: 🧵
4
25
127
@ChengleiSi
CLS
2 months
A couple new admits have been asking what researching at Stanford is like, here’s a thread of cool projects that my awesome cohort / lab mates did in their first 0.5 year here at @stanfordnlp 🧵
1
8
112
@ChengleiSi
CLS
6 months
We now have the largest #LLM prompt hacking / jailbreaking dataset crowdsourced from a global competition!
@learnprompting
Learn Prompting
6 months
A few months ago, we ran HackAPrompt, the first-ever global Prompt Hacking competition! Over 3K hackers submitted 600K malicious prompts to win $35K in prizes from companies like @PreambleAI , @OpenAI , & @huggingface We analyzed 29 different techniques & found a NEW exploit👇🧵
Tweet media one
9
92
378
3
1
43
@ChengleiSi
CLS
2 years
@katherine1ee That paper seems too long to track who wrote which paragraph, but I’ve notified some of the authors on that paper, I think someone will stand up for it and take whatever procedures as needed/appropriate. But thanks for doing this - an important lesson for many!
2
0
34
@ChengleiSi
CLS
6 months
My biggest takeaway at #UIST2023 so far is that HCI people make really beautiful slides and demos!
0
2
32
@ChengleiSi
CLS
6 months
fun to read these side by side:
0
5
31
@ChengleiSi
CLS
3 years
Are you still using a single gold answer to evaluate your QA models? Try augmenting your answer set with aliases! Excited to share our new #EMNLP2021 paper: What's in a Name? Answer Equivalence For Open-Domain Question Answering (1/4)
Tweet media one
2
8
29
@ChengleiSi
CLS
6 months
studying how well/poorly the API models work on static benchmarks surely sounds silly at this point. but designing human studies to rigorously analyze the impact they have on users and the broader society sounds like important research to me.
@deliprao
Delip Rao e/σ
6 months
A puzzling dynamic I still don't get is impoverished academics/students paying OpenAI to “understand” or “study properties/capabilities” of GPTs while OpenAI enriches understanding of their product for free. Not to mention all the free publicity/revenue that comes with the…
20
44
384
2
0
28
@ChengleiSi
CLS
5 months
I learned a ton about how to do research from Jordan back in the undergrad years and I’m super proud to see more undergrads from the team putting out great work!!
@boydgraber
Jordan Boyd-Graber
5 months
We learned on Thursday we needed to put a presentation together, and Sander did a great job. Sander's an undergrad, and this is Sander's first paper, first conference, and first conference talk. Joint work with @ChengleiSi .
Tweet media one
5
15
147
1
2
28
@ChengleiSi
CLS
5 months
oh in case you are looking for a paper link - no there is no paper, i think fun experiments like this should just stay as a twitter thread to make everybody's life easier. if you want to cite these results, just cite the Colab or the thread itself. 🙂
1
0
26
@ChengleiSi
CLS
2 years
@_jasonwei Or maybe sometimes you work on a less popular topic, just because you enjoy working on it, and you don���t really care how big the impact is gonna be 😃
1
0
25
@ChengleiSi
CLS
2 months
This project is co-led with my 💯labmate @StevenyzZhang , and in collaboration with @zhengyuan_yang ( @Microsoft ), @RuiboLiu ( @GoogleDeepMind ), and @Diyi_Yang ( @stanfordnlp ). Shout out to many friends and labmates @shi_weiyan @gaotianyu1350 @WilliamBarrHeld @rose_e_wang
4
0
22
@ChengleiSi
CLS
10 months
Come to Metropolitan Center in half an hour for one of the most fun talks you will see at #ACL2023 !
@ChengleiSi
CLS
1 year
🚨You should be aware that LLMs like GPT-3 & 3.5 have strong feature biases! They prefer to use certain features over others, even when both features are equally predictive of the labels in the prompt. #ACL2023NLP 1/n
Tweet media one
7
46
207
0
5
22
@ChengleiSi
CLS
4 months
@brunchavecmoi @WeijiaShi2 @eunsolc trained extractive and abstractive summarizers to condense the retrieved documents for more efficient RAG. On ODQA, they can compress top-5 retrieved documents into 5-10% length with minor performance drops. (2/n)
1
3
22
@ChengleiSi
CLS
7 months
one of my fav reads of the year :)
@AlexTamkin
Alex Tamkin 🦣
7 months
Eliciting Human Preferences with Language Models Currently, people write detailed prompts to describe what they want a language model to do We explore *generative elicitation*—where models interactively ask for this information through open-ended conversation 1/
Tweet media one
4
84
459
1
3
21
@ChengleiSi
CLS
3 years
My internship work is accepted to #ACL2021NLP Findings. Updated draft, code and data will be released soon, stay tuned!
@KCrosner
Yiming Cui
3 years
I just got the mail and I am glad to announce that our paper “Benchmarking Robustness of Machine Reading Comprehension Models” is accepted to Findings of ACL. arXiv pre-print: #nlproc #acl2021nlp
Tweet media one
4
1
37
2
3
19
@ChengleiSi
CLS
1 month
very insightful thread on evaluating factuality
@gregd_nlp
Greg Durrett
1 month
This is a cool method, but "superhuman" is an overclaim based on the data shown. There are better datasets than FActScore for evaluating this: ExpertQA by @cmalaviya11 +al Factcheck-GPT by Yuxia Wang +al (+ same methodology) 🧵
3
26
183
0
0
17
@ChengleiSi
CLS
2 years
Here comes the 🤯 part - you can directly prepend a natural language instruction to tell the model not to be biased! When I told GPT-3 to not discriminate against any demographic group, it responds well and hugely reduces biases on BBQ! (7/n)
Tweet media one
Tweet media two
5
4
14
@ChengleiSi
CLS
6 months
scalable oversight via debate!
@_julianmichael_
Julian Michael
6 months
As AIs improve at persuasion & argumentation, how do we ensure that they help us seek truth vs. just sounding convincing? In human experiments, we validate debate as a truth-seeking process, showing that it may soon be needed for supervising AI. Paper:
Tweet media one
9
42
226
1
0
15
@ChengleiSi
CLS
2 months
generating Wiki-style articles with proper citations from @EchoShao8899 :
@EchoShao8899
Yijia Shao
2 months
Can we teach LLMs to write long articles from scratch, grounded in trustworthy sources? Do Wikipedia editors think this can assist them? 📣Announcing STORM, a system that writes Wikipedia-like articles based on Internet search. I now use STORM in my daily research!🧵
41
205
1K
1
1
14
@ChengleiSi
CLS
2 months
mech interp benchmark from @aryaman2020 :
@aryaman2020
Aryaman Arora
2 months
New paper! 🫡 LM interpretability has made progress in finding feature representations using many methods, but we don’t know which ones are generally performant or reliable. We ( @jurafsky @ChrisGPotts ) introduce CausalGym, a benchmark of 29 linguistic tasks for interp! (1/n)
Tweet media one
6
45
284
2
1
14
@ChengleiSi
CLS
2 months
benchmark for understanding self-referential statements from @TristanThrush :
@TristanThrush
Tristan Thrush
4 months
📢 New paper!! Do LLMs understand self-referential statements? Introducing “I am a Strange Dataset”. All tested models perform around chance at our metalinguistic self-reference task. GPT-4 is the only model significantly above chance on all tests, but it is slight.🧵
Tweet media one
22
72
475
1
1
13
@ChengleiSi
CLS
23 days
very cool
@panickssery
Arjun Panickssery is in London
23 days
Are LLMs biased toward themselves? Frontier LLMs give higher scores to their own outputs in self-eval. We find evidence that this bias is caused by LLM's ability to recognize their own outputs This could interfere with safety techniques like reward modeling & constitutional AI
Tweet media one
8
46
321
0
0
13
@ChengleiSi
CLS
5 months
Last but not least, we thank @TristanThrush and @aryaman2020 for their generous sponsorship in buying me drinks last night at the EVGR pub. Also tagging @Francis_YAO_ @DrJimFan @giffmana @GaryMarcus who might be interested to know these results. Cheers! 🍻 (11/n, n=11)
1
0
13
@ChengleiSi
CLS
10 months
@universeinanegg Westin_CONFERENCE Password: acl2023
1
2
11
@ChengleiSi
CLS
1 year
If you are interested in this line of work, also check out Alex’s related paper: 8/n, n=8
@AlexTamkin
Alex Tamkin 🦣
1 year
What can go wrong when a language model's task is ambiguous? We look at this in our #ICLR2023 paper, inspired by a real-world GPT-3 failure! Task Ambiguity in Humans and Language Models 1/
Tweet media one
2
35
187
1
3
10
@ChengleiSi
CLS
15 days
amazing resources for culturally aware LLMs by the amazing @shi_weiyan
@shi_weiyan
Weiyan Shi
15 days
🚨New Paper🚨 We propose 1⃣CultureBank🌎 dataset sourced from TikTok & Reddit 2⃣An extensible pipeline to build cultural knowledge bases 3⃣Evaluation of LLMs’ cultural awareness 4⃣Insights into culturally-aware LLMs Project: Data:
Tweet media one
4
63
261
0
1
12
@ChengleiSi
CLS
11 months
8. Xue et al. ( @elgreco_winter ) propose reverse Chain-of-Thought: first prompt LLM to reconstruct the problem given the generated solution; then detect inconsistencies between the reconstructed and the original problems.
1
3
11
@ChengleiSi
CLS
5 months
I tested on three different settings. In the first setting, I provide the image and ask GPT-4V to select the matching caption. In this example, GPT-4V should select caption_0 (A) for image_0 and caption_1 (B) for image_1, which it did correctly, along with explanations. (2/n)
Tweet media one
1
1
11
@ChengleiSi
CLS
2 years
We analyze and improve reliability from four core facets: 1) OOD generalization (domain transfer + challenge sets + spurious correlation); 2) social biases; 3) uncertainty calibration; 4) knowledge updating. (2/n)
Tweet media one
1
1
11
@ChengleiSi
CLS
2 months
analyzing whether LMs can learn “impossible�� languages from @JulieKallini :
@JulieKallini
Julie Kallini ✨
4 months
Do LLMs learn impossible languages (that humans wouldn’t be able to acquire) just as well as they learn possible human languages? We find evidence that they don’t! Check out our new paper… 💥 Mission: Impossible Language Models 💥 ArXiv: 🧵
Tweet media one
12
114
478
1
0
11
@ChengleiSi
CLS
11 months
Current status of #LLMs evaluation: 😵‍💫😮‍💨🤔
@Francis_YAO_
Yao Fu
11 months
Is Falcon really better than LLaMA? Short take: probably not. Longer take: we reproduced LLaMA 65B eval on MMLU and we got 61.4, close to the official number (63.4), much higher than its Open LLM Leaderboard number (48.8), and clearly higher than Falcon (52.7). Code and prompt…
34
128
722
0
0
10
@ChengleiSi
CLS
5 months
On 100 test examples, GPT-4V gets 62.0% accuracy (random acc is 25.0% because you need to select the right image for both captions to be correct on each example). (3/n)
1
1
10
@ChengleiSi
CLS
5 months
On the 100 test examples, GPT-4V scored 38.0%, much better than prior SOTA set by PaLI (28.8% group score), but much much worse than MTurk human performance (85.5% group score) reported in the Winoground paper. (10/n)
1
1
10
@ChengleiSi
CLS
2 years
@arankomatsuzaki Where does that GPT3 350B come from?
1
0
9
@ChengleiSi
CLS
2 months
Bonus: if you join, you'll get to chill with us and go out for fun every weekend! 😀
1
0
9
@ChengleiSi
CLS
5 months
This is way better than prior SOTAs (PaLI 46.5%, UNITER_large 38.0%; altho they scored each image-caption pair matching separately and selected the better match while we just gave GPT-4V both candidates to choose from). (4/n)
1
1
9
@ChengleiSi
CLS
2 years
@MarekRei
Marek Rei
2 years
Investigating memorisation versus generalisation in pre-trained language models. Great work by @michael__tanzer , in collaboration with @seb_ruder and myself. Accepted to #ACL2022 , already available on ArXiv: #NLProc #MachineLearning
Tweet media one
4
28
111
0
0
9
@ChengleiSi
CLS
11 months
10. Last but not least, turns out decoding strategy also matters! @WeijiaShi2 @XiaochuangHan et al. adapt the idea of contrastive decoding to amplify the difference between the output probabilities when a model is used with and without context.
1
3
9
@ChengleiSi
CLS
2 years
@TristanThrush Would be interesting to see how this compares to more efficient alternatives such as
@jungokasai
Jungo Kasai 笠井淳吾 @NeurIPS2023
2 years
How well can GPT-3/QA models answer questions on real-time events (e.g., # homeruns by #Ohtani )? RealTime QA @realtimeqa regularly announces questions and evaluates systems. Weekly for now. Paper (w/ past month results): Website:
Tweet media one
2
27
118
0
0
9
@ChengleiSi
CLS
2 months
probing bias and fairness of preference tuning from @michaelryan207 :
@michaelryan207
Michael Ryan
2 months
Aligned LLMs should be helpful, harmless, and adopt user preferences. But whose preferences are we aligning to and what are unintended effects on global representation? We find SFT and Preference Tuning steer LLMs towards US English use and opinions. 🧵
Tweet media one
5
53
208
1
0
9
@ChengleiSi
CLS
4 months
Similarly, @iofu728 et al. first retrieve the most important documents from all contexts by ranking each doc’s avg perplexity conditioned on the question; then further filter down to the most important tokens, where they compute token importance by contrastive perplexity, (3/n)
1
0
8
@ChengleiSi
CLS
2 years
We need more careful readers like Jason
@zhansheng
Jason Phang
2 years
I scanned through the paper quickly because I was very struck by how good the zero-shot results are (better than 175B models). But after some digging, I think this is the reason: The paper considers these prompts to be zero-shot:
Tweet media one
1
3
74
0
0
8
@ChengleiSi
CLS
2 years
All code, data, and model predictions are available at: Thanks for reading this super long thread! (19/n; n=19)
2
1
8
@ChengleiSi
CLS
5 months
Lastly, we test GPT-4V following the exact same protocol as how the Winoground paper tested on MTurk crowdworkers so the results are directly comparable. Specifically, we show an image and a caption and ask GPT-4V whether it is a correct match (binary yes/no). (7/n)
1
0
7
@ChengleiSi
CLS
7 months
admirable effort in putting together a very useful benchmark on NLP/LLMs for education!
@rose_e_wang
Rose
7 months
Ever wonder how experienced math teachers & tutors compare to ChatGPT or GPT4 in teaching students? 🖥️🧑‍🎓👩‍🏫 Check out our new paper “Step-by-Step Remediation of Students’ Mathematical Mistakes”! 📜 🖥️ from @stanfordnlp @StanfordEd
2
22
51
1
2
7
@ChengleiSi
CLS
4 months
i.e., how much does conditioning on the question reduce the conditional prob of the token. On multi-doc QA and long-context benchmarks, such compressions lead to slightly higher acc than full prompts with 15-25% original lengths. (4/n)
1
0
7
@ChengleiSi
CLS
4 months
@PengXu51108979 et al. compare retrieval and long-context head-to-head. Retrieving top-5 chunks (with off-the-shelf dense retrievers) to fit into LLaMA2-70B-4K can be comparable to feeding the original long contexts into LLaMA2-70B-16K for QA. (5/n)
1
0
7
@ChengleiSi
CLS
6 months
@aryaman2020 my heart belongs to research
1
0
7
@ChengleiSi
CLS
5 months
In the second setting, I provide the caption and ask GPT-4V to select the matching image. In this example, GPT-4V should select image_0 (A) for caption_0 and image_1 (B) for caption_1, which it did correctly and showcased the ability to do counting. (5/n)
Tweet media one
1
0
7
@ChengleiSi
CLS
5 months
So the model has to answer ‘yes’ to all the correct matches and ‘no’ to all the wrong matches (which is arguably harder than the previous two settings of just selecting the better match between two options). (8/n)
1
0
7
@ChengleiSi
CLS
5 months
In this setting, GPT-4V gets 61.0% accuracy (again random chance is 25.0%). This is somewhat impressive because matching image for caption is known to be much harder than matching caption given the image (see the Winogound paper). For reference, PaLI gets 38.0%. (6/n)
1
0
7
@ChengleiSi
CLS
8 months
Very cool work: human-AI (InstructGPT) co-writing could lead to homogenization!
@vishakh_pk
Vishakh Padmakumar
8 months
Does Writing with Language Models Reduce Content Diversity? TL;DR: Yes! But it depends on which language model you use 🤖🕵️ Sharing work with my advisor @hhexiy : Code/data: #NLProc #paper #LLMs
Tweet media one
6
36
150
0
1
7
@ChengleiSi
CLS
7 months
However, humans over-rely on ChatGPT explanations — they trust ChatGPT’s answers even when they are wrong, resulting in below-random accuracy on such cases, much worse than both the baseline and retrieval conditions. 4/n
Tweet media one
3
1
7
@ChengleiSi
CLS
7 months
In our new paper: We ask crowdworkers to fact-check claims in several experiment conditions: Baseline (just show the claims), Retrieval from Wiki, ChatGPT Explanation, Contrastive Explanation (ChatGPT self-debate), and Retrieval + Explanation. 2/n
Tweet media one
1
0
6
@ChengleiSi
CLS
5 months
Go check out the poster for HackAPrompt!
@learnprompting
Learn Prompting
5 months
Currently at 41C in the back!
0
0
5
0
1
6
@ChengleiSi
CLS
2 years
For spurious correlation, on both MNLI -> HANS and QQP -> PAWS, GPT-3 doesn’t exploit the shortcuts like the supervised models and generalizes much better! (5/n)
Tweet media one
2
0
6
@ChengleiSi
CLS
2 years
The current state of NLP research: hottest paper debunked in a few hours
@denny_zhou
Denny Zhou
2 years
I dont think there is magic here: text-davinci-002 and other 002 models in GPT-3, and instruct GPT should have been finetuned with "let's think step by step ... ". I tried 001 models in GPT3 and none of them works with this kind of prompt while CoT still works.
8
13
112
1
0
6
@ChengleiSi
CLS
5 months
Example below shows a wrong model prediction because it should answer ‘Yes’ to the match between image_0 and caption_0. (9/n)
Tweet media one
2
0
6
@ChengleiSi
CLS
2 years
@sarahwiegreffe Pretty sure both text-davinci-001 and text-davinci-002 are Instruct models (and they are being updated over time); “davinci” is the original NeurIPS version, and is static.
1
0
6
@ChengleiSi
CLS
1 month
🔥
@jyangballin
John Yang @ ICLR 🇦🇹
1 month
SWE-agent is our new system for autonomously solving issues in GitHub repos. It gets similar accuracy to Devin on SWE-bench, takes 93 seconds on avg + it's open source! We designed a new agent-computer interface to make it easy for GPT-4 to edit+run code
Tweet media one
68
434
2K
0
0
6
@ChengleiSi
CLS
2 years
@srchvrs @arankomatsuzaki respect to the authors for the effort 🫡
0
0
6
@ChengleiSi
CLS
7 months
Contrastive explanation makes people more cautious, but lowers human decision accuracy in cases where the non-contrastive explanation would have been correct. Somewhat surprisingly, showing both retrieval and explanation is no better than just showing retrieval alone! 5/n
Tweet media one
1
0
6
@ChengleiSi
CLS
7 months
On the surface, showing retrieved paragraphs and ChatGPT explanation enable similar human decision accuracy, both significantly better than the baseline with no evidence; while reading ChatGPT explanation is much faster. 3/n
Tweet media one
1
0
6
@ChengleiSi
CLS
3 years
Also check out another of our #ACL2021NLP Findings paper: Better Robustness by More Coverage: Adversarial Training with Mixup Augmentation for Robust Fine-tuning. preprint: code: .
@ChengleiSi
CLS
3 years
My internship work is accepted to #ACL2021NLP Findings. Updated draft, code and data will be released soon, stay tuned!
2
3
19
1
2
6
@ChengleiSi
CLS
2 months
0
0
5
@ChengleiSi
CLS
1 year
Someone should sponsor AK for his service!
@_akhaliq
AK
1 year
Thinking of retiring from paper tweets it’s pretty time intensive on top of a full time job, I had a good run so far. Plus all the companies offering similar services now
221
58
2K
0
0
4
@ChengleiSi
CLS
9 months
@dingzeyuli There’s also an AI+HCI workshop
0
0
5
@ChengleiSi
CLS
2 years
@MarekRei @Michael__Tanzer @seb_ruder Nice work! Quick question: would the three phases always happen, and have similar durations for different datasets/models? One interesting contrast is that training longer on MNLI actually keeps improving OOD acc on HANS (; Fig 1)
1
1
5
@ChengleiSi
CLS
4 months
Along this line, Ge at al. proposed In-context Autoencoder (ICAE) for context compression. ICAE consists of an encoder and a decoder. The encoder is a LoRA-adapted LLM, used for encoding the original long context into a few memory tokens. (10/n)
1
0
5
@ChengleiSi
CLS
4 months
Would love to see any experiment results showing such counterexamples! Last but not least, shout out to @aryaman2020 @xiuyu_l @StevenyzZhang for helpful discussion! (14/n, n=14)
0
0
5
@ChengleiSi
CLS
2 years
@LChoshen You should come up with a way for people to cite your Twitter thread 😂
1
0
5
@ChengleiSi
CLS
2 years
This work has also benefitted tremendously from the feedback of @zhansheng , @sewon__min , @akyurekekin , @danfriedman0 , @jieyuzhao11 , @AliciaVParrish , @sulin_blodgett , @ihsgnef , @henryzhao4321 , and many other friends! (18/n)
1
0
5
@ChengleiSi
CLS
2 years
Facet 2: Social Biases On WinoBias ( @jieyuzhao11 et al.) and BBQ ( @AliciaVParrish @sleepinyourhat et al.), including anti-stereotypical examples (e.g., “She is a doctor.” as opposed to “He is a doctor.” ) to balance the prompt significantly reduces biases! (6/n)
Tweet media one
1
0
5
@ChengleiSi
CLS
2 years
Facet 1: Generalization. On MRQA domain transfer, with demos from the source domain, GPT-3 generalizes to different target domain test sets with negligible accuracy drops - GPT-3 is insensitive to domain differences! (3/n)
1
0
5
@ChengleiSi
CLS
5 months
@aryaman2020 boba taste exposed
2
0
5
@ChengleiSi
CLS
4 months
They segment long contexts and recursively generate summary vectors which are passed as soft prompts to subsequent segments. The training objective is language modeling conditioned on prev tokens in the current segment and accumulated summary vectors from prev segments. (7/n)
1
0
5
@ChengleiSi
CLS
1 year
Swing by my poster this afternoon at MH1-2-3-4 #148 , 4:30 - 6:30pm CAT!
@ChengleiSi
CLS
2 years
New paper alert! GPT-3 is getting really popular and tons of applications are getting built with it. But before we deploy it in real-life, let’s first answer the important question: How reliable is GPT-3? (Hint: it can be more reliable than you think!) 🧵(1/n)
5
56
306
0
1
5
@ChengleiSi
CLS
5 months
@ZhengxuanZenWu boba and beers are the driving force behind all my research these days 🫡
0
0
5
@ChengleiSi
CLS
2 years
I’m starting to get why some people think prompt engineering is not ‘real research’...
@arankomatsuzaki
Aran Komatsuzaki
2 years
Large Language Models are Zero-Shot Reasoners Simply adding “Let’s think step by step” before each answer increases the accuracy on MultiArith from 17.7% to 78.7% and GSM8K from 10.4% to 40.7% with GPT-3.
Tweet media one
59
572
3K
0
0
4
@ChengleiSi
CLS
2 years
@arankomatsuzaki 111 pages is no joke
1
0
4
@ChengleiSi
CLS
5 months
@giffmana 🤦‍♂️
0
0
3
@ChengleiSi
CLS
5 months
@aryaman2020 correction: coconut pudding, not boba.
2
0
4
@ChengleiSi
CLS
11 months
6. In a similar vein, @du_yilun et al. show each LLM the other LLMs’ responses after each round and ask them to revise accordingly, and repeat until they reach agreement, which improves reasoning and factual accuracy.
1
0
4