Weijia Shi Profile
Weijia Shi

@WeijiaShi2

4,832
Followers
990
Following
32
Media
531
Statuses

PhD student @uwcse @uwnlp | Visiting Researcher @MetaAI | Undergrad @CS_UCLA |

Seattle, WA
Joined August 2019
Don't wanna be here? Send us removal request.
Pinned Tweet
@WeijiaShi2
Weijia Shi
7 months
Ever wondered which data black-box LLMs like GPT are pretrained on? 🤔 We build a benchmark WikiMIA and develop Min-K% Prob 🕵️, a method for detecting undisclosed pretraining data from LLMs (relying solely on output probs). Check out our project: [1/n]
Tweet media one
15
140
663
@WeijiaShi2
Weijia Shi
7 months
Introduce In-Context Pretraining🖇️: train LMs on contexts of related documents. Improving 7B LM by simply reordering pretrain docs 📈In-context learning +8% 📈Faithful +16% 📈Reading comprehension +15% 📈Retrieval augmentation +9% 📈Long-context reason +5%
Tweet media one
11
155
626
@WeijiaShi2
Weijia Shi
1 year
🙋‍♀️How to present the same text in diff. tasks/domains as diff. embeddings W/O training? We introduce Instructor👨‍🏫, an instruction-finetuned embedder that can generate text embeddings tailored to any task given the task instruction➡️sota on 7⃣0⃣tasks👇!
Tweet media one
12
115
601
@WeijiaShi2
Weijia Shi
4 months
Happy to share In-Context Pretraining 🖇️ is accepted as an #ICLR2024 spotlight. We study how to pretrain LLMs with improved context understanding ability paper📄: code:
@WeijiaShi2
Weijia Shi
7 months
Introduce In-Context Pretraining🖇️: train LMs on contexts of related documents. Improving 7B LM by simply reordering pretrain docs 📈In-context learning +8% 📈Faithful +16% 📈Reading comprehension +15% 📈Retrieval augmentation +9% 📈Long-context reason +5%
Tweet media one
11
155
626
4
51
411
@WeijiaShi2
Weijia Shi
2 months
When augmented with retrieval, LMs sometimes overlook retrieved docs and hallucinate 🤖💭 To make LMs trust evidence more and hallucinate less, we introduce Context-Aware Decoding: a decoding algorithm improving LM's focus on input contexts 📖 #NAACL2024
Tweet media one
5
61
342
@WeijiaShi2
Weijia Shi
2 months
Happy to share REPLUG🔌 is accepted to #NAACL2024 We introduce a retrieval-augmented LM framework that combines a frozen LM with a frozen/tunable retriever. Improving GPT-3 in language modeling & downstream tasks by prepending retrieved docs to LM inputs. 📄:
Tweet media one
8
36
273
@WeijiaShi2
Weijia Shi
6 months
Interested in finding out if specific book excerpts have been pre-trained by OpenAI's text-davinci-003 model? 🔍 Check out our demo:
@WeijiaShi2
Weijia Shi
7 months
Ever wondered which data black-box LLMs like GPT are pretrained on? 🤔 We build a benchmark WikiMIA and develop Min-K% Prob 🕵️, a method for detecting undisclosed pretraining data from LLMs (relying solely on output probs). Check out our project: [1/n]
Tweet media one
15
140
663
1
29
212
@WeijiaShi2
Weijia Shi
2 years
❓How can retrieval from a heterogeneous corpus benefit zero-shot inference with language models? We introduce kNN-Prompt, a technique to use k-nearest neighbor retrieval augmentation for improving zero-shot inference Paper: [1/n]
Tweet media one
2
42
197
@WeijiaShi2
Weijia Shi
15 days
Super excited to be attending #ICLR2024 to present our work: ✅In-Context Pretraining () ⏰: Thursday 10:45 am (Halle B #95 ) ✅ Detecting Pretraining Data from LLMs () ⏰: Friday 10:45 am (Halle B #95 ) Come say hi 🍻
@WeijiaShi2
Weijia Shi
4 months
Happy to share In-Context Pretraining 🖇️ is accepted as an #ICLR2024 spotlight. We study how to pretrain LLMs with improved context understanding ability paper📄: code:
4
51
411
3
26
152
@WeijiaShi2
Weijia Shi
7 months
We are sharing BookMIA data 📚 used in our paper: . It serves as a benchmark to evaluate membership inference attack methods in detecting copyrighted books from OpenAI models such as text-davinci-003. - Non-member data 🚫: Text snippets from books first
@WeijiaShi2
Weijia Shi
7 months
Ever wondered which data black-box LLMs like GPT are pretrained on? 🤔 We build a benchmark WikiMIA and develop Min-K% Prob 🕵️, a method for detecting undisclosed pretraining data from LLMs (relying solely on output probs). Check out our project: [1/n]
Tweet media one
15
140
663
0
25
121
@WeijiaShi2
Weijia Shi
5 months
#NeurIPS2023 Join us at the RegML Workshop (📅 Sat, Dec 16, 1:00-1:35 PM, Room 215-216). @YangsiboHuang and @xiamengzhou will present our work "Detecting Pretraining Data in Large Language Models". 🔗:
@WeijiaShi2
Weijia Shi
7 months
Ever wondered which data black-box LLMs like GPT are pretrained on? 🤔 We build a benchmark WikiMIA and develop Min-K% Prob 🕵️, a method for detecting undisclosed pretraining data from LLMs (relying solely on output probs). Check out our project: [1/n]
Tweet media one
15
140
663
2
23
119
@WeijiaShi2
Weijia Shi
4 years
happy to share that our paper titled "Design Challenges for Low-resource Cross-lingual Entity Linking" is accepted to #emnlp2020 ! huge thanks to co-authors @XingyuFu2 @Xiaodong_Yu_126 and @dannydanr
3
5
54
@WeijiaShi2
Weijia Shi
4 months
Thank you for the reimplementation! Excited to see that it has helped in reducing hallucinations with the latest models
@zhichaoxu_ir
Zhichao Xu Brutus
4 months
I implemented the Context-aware Decoding (CAD) described in by @WeijiaShi2 . I found it can reduce factuality errors on both news summarization and query-focused summarization tasks - with some more recent language models such as MPT-7B and Mistral-7B
2
8
55
1
2
52
@WeijiaShi2
Weijia Shi
6 months
Thank you Jack🥰!! I am at #EMNLP2023 and would love to chat about about LMs for retrieval augmentation, pretraining and safety
@jxmnop
jack morris
6 months
here are two awesome researchers you should follow: @WeijiaShi2 at UW and @wzhao_nlp at Cornell!! some of their recent work: weijia shi ( @WeijiaShi2 ): - built INSTRUCTOR, the embedding model that lots of startups / companies use () - proposed a more
1
14
110
1
2
42
@WeijiaShi2
Weijia Shi
11 days
I will present our work Knowledge Card in today’s oral session 💬Session: Oral 7B 🕙Time: Friday, 10 AM 📍Place: Halle A 7
@shangbinfeng
Shangbin Feng
14 days
Knowledge Card at @iclrconf Oral! Due to visa issues I could not attend, but we will have the awesome @WeijiaShi2 to give the oral talk! 💬Session: Oral 7B 🕙Time: Friday, 10 AM 📍Place: Halle A 7 Paper link: Code & resources:
1
2
16
1
3
44
@WeijiaShi2
Weijia Shi
1 year
Given an input context, REPLUG🔌 first retrieves relevant documents from an external corpus using a retriever (1️⃣Document Retrieval). Then it prepends each document separately to the input context and ensembles output probabilities from different passes (2️⃣Input Reformulation)
Tweet media one
1
1
26
@WeijiaShi2
Weijia Shi
7 months
Sorry, this is the right link to the paper:
0
0
24
@WeijiaShi2
Weijia Shi
1 year
We also introduce a training scheme that can further improve the initial retriever in REPLUG🔌 with supervision signals from a black-box language model. The key idea💡 is to adapt the retriever🔥 to the black-box LM🧊
Tweet media one
2
1
24
@WeijiaShi2
Weijia Shi
3 months
@kchonyc @mrdrozdov @andrewmccallum @MohitIyyer @JonathanBerant @HamedZamani Good point. I guess one reason to train LMs with retrieval is to improve their understanding of long contexts (e.g., multi-doc reasoning). Since long pretraining docs are scarce, retrieval can gather related docs within the same context, making LMs to learn to use long context
1
1
22
@WeijiaShi2
Weijia Shi
1 year
We first annotate instructions for 330 diverse tasks and train Instructor👨‍🏫 on this multitask mixture A single Instructor👨‍🏫 model can achieve sota on 70 embedding evaluation tasks: 1⃣ Retrieval 2⃣TextEval 3⃣Clustering 4⃣Prompt Retrieval 5⃣Classification 6⃣STS 7⃣Reranking 8⃣...
Tweet media one
1
1
22
@WeijiaShi2
Weijia Shi
6 months
Ari is a great mentor and always has valuable insights on LLMs. Apply to work with him!
@universeinanegg
Ari Holtzman
6 months
If you want a respite from OpenAI drama, how about joining academia? I'm starting Conceptualization Lab, recruiting PhDs & Postdocs! We need new abstractions to understand LLMs. Conceptualization is the act of building abstractions to see something new.
14
63
277
0
2
19
@WeijiaShi2
Weijia Shi
1 year
Smaller model but best performance Instructor👨‍🏫 (335M), while having >10x fewer parameters than the previous best model (4.8B), achieves state-of-the-art performance, with an average improvement of 3.4% compared to the previous best results on the 70 diverse datasets.
Tweet media one
2
1
17
@WeijiaShi2
Weijia Shi
7 months
@harmdevries77 raises a key issue: the lack of long pretraining data (<5% web docs exceed 2k tokens) poses challenges for pretraining LMs with long context windows. In-Context Pretraining offers a scalable solution for creating meaningful long contexts
1
0
17
@WeijiaShi2
Weijia Shi
7 months
Existing methods train LMs by concatenating random docs to form input contexts but the prior docs provide 𝙣𝙤 𝙨𝙞𝙜𝙣𝙖𝙡 for predicting the next doc. In-Context Pretraining forms meaningful long contexts from related docs, encouraging LMs to read more varied and longer context
2
0
17
@WeijiaShi2
Weijia Shi
7 months
Why does this matter? Black-box LLMs like GPT are pretrained on massive and undisclosed data that may contain sensitive texts. Min-K% Prob 🕵️ can be used to 🔍Detect copyrighted texts in pretraining 🛡️Identify dataset contamination 🔐Privacy auditing of machine unlearning [2/n]
Tweet media one
1
3
14
@WeijiaShi2
Weijia Shi
1 year
Led by @hongjin_su and @WeijiaShi2 , joint work with @wittgen_ball , @yizhongwyz , @huyushi98 , Mari, @scottyih , @nlpnoah , @LukeZettlemoyer , and @taoyds from @uwnlp , @allen_ai and @MetaAI . Thanks @Muennighoff and @Nils_Reimers for the nice MTEB code and data. It did save our life!
1
0
14
@WeijiaShi2
Weijia Shi
1 year
Instructions Enable Diverse Training Finetuning with instructions allows Instructor👨‍🏫 to benefit from the diversity of 330 datasets, whereas simply training on those datasets alone leads to degraded performance.
Tweet media one
1
0
12
@WeijiaShi2
Weijia Shi
7 months
How does In-Context Pretraining work? 👀 In-Context Pretraining first finds related documents at scale to create a document graph using a retriever and then builds pretraining input contexts by traversing the document graph.
Tweet media one
1
0
12
@WeijiaShi2
Weijia Shi
1 year
Instructions Mitigate Domain Shifts Instruction-finetuned Instructor👨‍🏫 helps more on unseen domains: geography, biology and civil comments. Domain-specific datasets benefit particularly from instruction finetuning.
Tweet media one
1
0
11
@WeijiaShi2
Weijia Shi
2 months
1️⃣ Context-Aware Decoding simply contrasts output probabilities with and without the desired focus contexts and samples from this contrasted distribution 📊. 2️⃣ How well it works? Without additional training, it improves pretrained LMs' faithfulness (14.3%📈 for LLaMA)
Tweet media one
1
0
10
@WeijiaShi2
Weijia Shi
1 year
Check out @BunsenFeng 's work Cook🧑‍🍳: empowering GPT with modular and community-driven knowledge: (1) 25 specialized LMs serve as parametric knowledge repositories for GPT (2) Request knowledge only when needed (3) Top performance on MMLU and fact checking
@shangbinfeng
Shangbin Feng
1 year
With a pool of community-contributed specialized LMs, we propose bottom-up and top-down, two approaches to integrate black-box LLMs and these modular knowledge repos. bottom-up: multi-domain knowledge synthesis top-down: LLM select and activate specialized LMs when necessary
Tweet media one
1
2
12
0
2
11
@WeijiaShi2
Weijia Shi
2 months
We release the code: @zhichaoxu_ir has a very nice reimplementation and shows it can reduce hallucination of latest models such as Mistral-7B as well. His code and writeup 👇: 🛠️:
1
0
9
@WeijiaShi2
Weijia Shi
7 months
Practical application: 🔍Detecting copyright violations in pretraining data using Min-K% Prob 🕵️ E.g., We see evidence that GPT-3.5 (specifically, text-davinci-003) is likely to be pretrained on copyrighted books from the Pile Books3 dataset 👇 [8/n]
Tweet media one
1
3
9
@WeijiaShi2
Weijia Shi
7 months
How do pretraining design choices affect detection difficulty? Harder detection with 1. Smaller model size 📉 2. Shorter lengths of text for detection 📉 3. More training data 📈 4. Decreasing occurrence frequency of the detecting example 📉 5. Lower learning rates 📉 [9/n]
Tweet media one
1
2
9
@WeijiaShi2
Weijia Shi
7 months
Trying our detection method Min-K% Prob 🕵️: 1️⃣ Compute token probabilities in the text. 2️⃣ Pick the k% tokens with minimum probabilities. 3️⃣ Compute their average log likelihood. High average? Text is probably in pretraining data ✅ [4/n]
Tweet media one
1
2
9
@WeijiaShi2
Weijia Shi
7 months
Evaluating detection method efficacy? Introducing WikiMIA🌟, a dynamic benchmark using the data from pre and post model pretraining periods to support gold truth detection. It evolves with new LLMs, updating seen/unseen data. Data: [6/n]
1
4
9
@WeijiaShi2
Weijia Shi
7 months
❓: Can we determine if an LLM was pretrained on a certain text, having only black-box access to it? It is known as membership inference attack in ML security but its application in the context of LLM pretraining is still relatively underexplored [3/n]
1
2
8
@WeijiaShi2
Weijia Shi
1 year
@pajeeter There are two main differences: 1) RAG is an encoder-decoder LM with retrieval augmentation while ours augments a decoder-only (e.g. GPT, OPT) with retrieval. 2) RAG finetunes the LM's parameters to make it learn to read the retrieved documents, whereas we keep the LM frozen 😀
0
1
8
@WeijiaShi2
Weijia Shi
1 year
In addition, by making instruction more detailed (Left) and model size larger (Right), the performance of Instructor👨‍🏫 is consistently improved.
Tweet media one
Tweet media two
1
0
7
@WeijiaShi2
Weijia Shi
1 year
Instruction-finetuning on a large amount of datasets with diverse task instructions improves the robustness of Instructor👨‍🏫 to instruction paraphrases (i.e., smaller performance gaps between best- and worst-performing instructions)
Tweet media one
1
0
8
@WeijiaShi2
Weijia Shi
7 months
How well does the detection method work? We show that Min-K% Prob 🕵️outperforms the existing strongest baseline and can be used to detect pretraining data from various LLMs including GPT-3, OPT, LLaMA and etc. [7/n]
Tweet media one
1
2
7
@WeijiaShi2
Weijia Shi
7 months
@main_horse wow, it would save methods that use ppl/logits as signals 🤣
0
0
7
@WeijiaShi2
Weijia Shi
7 months
Method Intuition? ✨ We observe an unseen example is likely to contain a few outlier words with low probabilities under the LLM, while a seen example is less likely to have words with such low probabilities. [5/n]
1
1
7
@WeijiaShi2
Weijia Shi
2 years
2/ We are the first to study the k-nearest neighbors language model (kNN-LM)'s zero-shot application to end tasks and find the main challenge of applying it naïvely is the sparsity of kNN distribution.
1
0
6
@WeijiaShi2
Weijia Shi
2 years
3/ We introduce kNN-Prompt to address this issue. Key to our approach is the introduction of fuzzy verbalizers which leverage the sparse kNN distribution for downstream tasks by automatically associating each classification label with a set of natural language tokens.
Tweet media one
1
0
6
@WeijiaShi2
Weijia Shi
1 year
@HeyNikhila Multiple people including ourselves try the code and it works smoothly. We suspect that this error may be related to your sentence transformer library. Could you please double check that you have installed the sentence transformer library according to ?
0
2
5
@WeijiaShi2
Weijia Shi
2 years
5/ Joint work with Julian Michael ( @_julianmichael_ ), Suchin Gururangan ( @ssgrn ), and Luke Zettlemoyer ( @LukeZettlemoyer ) from @uwnlp @uwcse
0
1
6
@WeijiaShi2
Weijia Shi
2 months
This method has broader applications beyond pure text settings. For example, similar ideas are applied to vision-language models to increase their focus on visual prompts 👇
@meetdavidwan
David Wan
3 months
Pointing to an image region should help models focus, but standard VLMs fail to understand visual markers/prompts (e.g., boxes/masks). 🚨Contrastive Region Guidance: Training-free method that increases focus on visual prompts by reducing model priors. 🧵
Tweet media one
2
46
122
1
0
6
@WeijiaShi2
Weijia Shi
2 months
Joint collaboration w/ @XiaochuangHan (coleading), @ml_perception , yuliatsvetkov ( @tsvetshop ), @LukeZettlemoyer , @scottyih from @uwnlp and @metaai Paper: 🛠️:
0
0
5
@WeijiaShi2
Weijia Shi
7 months
@_TobiasLee That's a great question! Our methods are limited to models that provide output token probabilities, like text-davinci-003. It would be very interesting to see future work that could develop methods to identify the pretraining corpus without relying on logits.
0
0
5
@WeijiaShi2
Weijia Shi
2 years
1/ Retrieval-augmented language models have been shown to outperform their non-retrieval-based counterparts on language modeling tasks. But it is an open question whether they also achieve similar gains in zero-shot end task evaluations.
1
0
5
@WeijiaShi2
Weijia Shi
2 years
4/ Experiments on 11 datasets (text classification, fact retrieval, and question answering) show that kNN-Prompt 1) yield large performance improvement over zero-shot baselines 2) effective for domain adaptation without further training
1
0
4
@WeijiaShi2
Weijia Shi
7 months
@OhadRubin Hi Ohad. Thanks for your interest in our work❤️!! After computing the kNNs for each query documents, we performed additionally deduplication by filtering out neighboring documents that have > 90% 3 gram overlaps.
1
0
4
@WeijiaShi2
Weijia Shi
7 months
@main_horse @arankomatsuzaki We conducted some experiments on the GPT 3.5 model (specifically, text-davinci-003), from which ChatGPT is finetuned, as outlined at . It's worth noting that the logprobs of text-davinci-003 can be accessed via API. Thank you for bringing this to our
2
0
4
@WeijiaShi2
Weijia Shi
26 days
@peizNLP Congrats Pei 🥳🥳
1
0
3
@WeijiaShi2
Weijia Shi
5 months
@YangsiboHuang is on the academic job market this year. She did a lot of great work on trustworthy AI! Stop by to chat with her :)
1
0
3
@WeijiaShi2
Weijia Shi
7 months
@arankomatsuzaki @main_horse lol, thanks for tweeting it anyways!
0
0
2
@WeijiaShi2
Weijia Shi
7 months
@_AngelinaYang_ @YangsiboHuang @arankomatsuzaki There is a lot of debate over this topic. Incorporating copyrighted content into pretraining could potentially violate copyright laws. @katherine1ee has an insightful blog post discussing generative AI and copyright:
0
0
3
@WeijiaShi2
Weijia Shi
2 months
@WenhuChen Thank you Wenhu! Yeah it has been a while
0
0
2
@WeijiaShi2
Weijia Shi
1 year
@TejasviKashi I used Keynote and flaticon :)
1
1
2
@WeijiaShi2
Weijia Shi
7 months
@main_horse thank you for pointing it out!! We made updates to both the paper and the website figure.
0
0
1
@WeijiaShi2
Weijia Shi
7 months
@mrdrozdov Thank you for your interest! I agree. Future studies could focus on constructing more informative but challenging contexts for LMs to learn more during pretraining
0
0
1