Weijia Shi @WeijiaShi2 Twitter profile | Pikagi

Pikagi

Weijia Shi

@WeijiaShi2

4,832

Followers

990

Following

32

Media

531

Statuses

PhD student @uwcse @uwnlp | Visiting Researcher @MetaAI | Undergrad @CS_UCLA |

Seattle, WA

Joined August 2019

Don't wanna be here? Send us removal request.

Pinned Tweet

@WeijiaShi2

Weijia Shi

7 months

Ever wondered which data black-box LLMs like GPT are pretrained on? 🤔 We build a benchmark WikiMIA and develop Min-K% Prob 🕵️, a method for detecting undisclosed pretraining data from LLMs (relying solely on output probs). Check out our project: [1/n]

Tweet media one

15

140

663

Last Seen Profiles

@969CJMD

@ENHYPEN_members

@GargoCorey

@tsunami_dem

@rigztfce

@jandakembangstw

@jackdine

@TheSafeAnnaAnon

@d3betrollin

@TheSickleFox

@joey_tweedy

@Joe_Hugen

@Pete_Gruber

@mario_revo

@gracerioss

@epcogal

@QbSl8YX5FiTFsvT

@kaleidoscopial

@NavyAI_xyz

@TheBeaver_Shark

@hotcouple_2014

@RunYHS

@mis4ntropia

@EcurieMeek

@FaartheRam

@archivesdayeon

@queenendedyou

@fayelynn77

@skrinkydinkie

@reyTempest_

@dk_3224

@BreakingSound

@VKonchis

@johnnyhdo

@web_pastor

@Denmark_in_AT

@WeijiaShi2

Weijia Shi

7 months

Introduce In-Context Pretraining🖇️: train LMs on contexts of related documents. Improving 7B LM by simply reordering pretrain docs 📈In-context learning +8% 📈Faithful +16% 📈Reading comprehension +15% 📈Retrieval augmentation +9% 📈Long-context reason +5%

Tweet media one

11

155

626

@WeijiaShi2

Weijia Shi

1 year

🙋‍♀️How to present the same text in diff. tasks/domains as diff. embeddings W/O training? We introduce Instructor👨‍🏫, an instruction-finetuned embedder that can generate text embeddings tailored to any task given the task instruction➡️sota on 7⃣0⃣tasks👇!

Tweet media one

12

115

601

@WeijiaShi2

Weijia Shi

4 months

Happy to share In-Context Pretraining 🖇️ is accepted as an #ICLR2024 spotlight. We study how to pretrain LLMs with improved context understanding ability paper📄: code:

Tweet card media

GitHub - swj0419/in-context-pretraining

Contribute to swj0419/in-context-pretraining development by creating an account on GitHub.

@WeijiaShi2

Weijia Shi

7 months

Introduce In-Context Pretraining🖇️: train LMs on contexts of related documents. Improving 7B LM by simply reordering pretrain docs 📈In-context learning +8% 📈Faithful +16% 📈Reading comprehension +15% 📈Retrieval augmentation +9% 📈Long-context reason +5%

Tweet media one

11

155

626

4

51

411

@WeijiaShi2

Weijia Shi

2 months

When augmented with retrieval, LMs sometimes overlook retrieved docs and hallucinate 🤖💭 To make LMs trust evidence more and hallucinate less, we introduce Context-Aware Decoding: a decoding algorithm improving LM's focus on input contexts 📖 #NAACL2024

Tweet media one

5

61

342

@WeijiaShi2

Weijia Shi

2 months

Happy to share REPLUG🔌 is accepted to #NAACL2024 We introduce a retrieval-augmented LM framework that combines a frozen LM with a frozen/tunable retriever. Improving GPT-3 in language modeling & downstream tasks by prepending retrieved docs to LM inputs. 📄:

Tweet media one

8

36

273

@WeijiaShi2

Weijia Shi

6 months

Interested in finding out if specific book excerpts have been pre-trained by OpenAI's text-davinci-003 model? 🔍 Check out our demo:

Tweet card media

Detect Pretraining Data - a Hugging Face Space by swj0419

@WeijiaShi2

Weijia Shi

7 months

Ever wondered which data black-box LLMs like GPT are pretrained on? 🤔 We build a benchmark WikiMIA and develop Min-K% Prob 🕵️, a method for detecting undisclosed pretraining data from LLMs (relying solely on output probs). Check out our project: [1/n]

Tweet media one

15

140

663

1

29

212

@WeijiaShi2

Weijia Shi

2 years

❓How can retrieval from a heterogeneous corpus benefit zero-shot inference with language models? We introduce kNN-Prompt, a technique to use k-nearest neighbor retrieval augmentation for improving zero-shot inference Paper: [1/n]

Tweet media one

2

42

197

@WeijiaShi2

Weijia Shi

15 days

Super excited to be attending #ICLR2024 to present our work: ✅In-Context Pretraining () ⏰: Thursday 10:45 am (Halle B #95 ) ✅ Detecting Pretraining Data from LLMs () ⏰: Friday 10:45 am (Halle B #95 ) Come say hi 🍻

@WeijiaShi2

Weijia Shi

4 months

Happy to share In-Context Pretraining 🖇️ is accepted as an #ICLR2024 spotlight. We study how to pretrain LLMs with improved context understanding ability paper📄: code:

4

51

411

3

26

152

@WeijiaShi2

Weijia Shi

7 months

We are sharing BookMIA data 📚 used in our paper: . It serves as a benchmark to evaluate membership inference attack methods in detecting copyrighted books from OpenAI models such as text-davinci-003. - Non-member data 🚫: Text snippets from books first

Tweet card media

swj0419/BookMIA · Datasets at Hugging Face

@WeijiaShi2

Weijia Shi

7 months

Ever wondered which data black-box LLMs like GPT are pretrained on? 🤔 We build a benchmark WikiMIA and develop Min-K% Prob 🕵️, a method for detecting undisclosed pretraining data from LLMs (relying solely on output probs). Check out our project: [1/n]

Tweet media one

15

140

663

0

25

121

@WeijiaShi2

Weijia Shi

5 months

#NeurIPS2023 Join us at the RegML Workshop (📅 Sat, Dec 16, 1:00-1:35 PM, Room 215-216). @YangsiboHuang and @xiamengzhou will present our work "Detecting Pretraining Data in Large Language Models". 🔗:

@WeijiaShi2

Weijia Shi

7 months

Ever wondered which data black-box LLMs like GPT are pretrained on? 🤔 We build a benchmark WikiMIA and develop Min-K% Prob 🕵️, a method for detecting undisclosed pretraining data from LLMs (relying solely on output probs). Check out our project: [1/n]

Tweet media one

15

140

663

2

23

119

@WeijiaShi2

Weijia Shi

4 years

happy to share that our paper titled "Design Challenges for Low-resource Cross-lingual Entity Linking" is accepted to #emnlp2020 ! huge thanks to co-authors @XingyuFu2 @Xiaodong_Yu_126 and @dannydanr

3

5

54

@WeijiaShi2

Weijia Shi

4 months

Thank you for the reimplementation! Excited to see that it has helped in reducing hallucinations with the latest models

@zhichaoxu_ir

Zhichao Xu Brutus

4 months

I implemented the Context-aware Decoding (CAD) described in by @WeijiaShi2 . I found it can reduce factuality errors on both news summarization and query-focused summarization tasks - with some more recent language models such as MPT-7B and Mistral-7B

2

8

55

1

2

52

@WeijiaShi2

Weijia Shi

6 months

Thank you Jack🥰!! I am at #EMNLP2023 and would love to chat about about LMs for retrieval augmentation, pretraining and safety

@jxmnop

jack morris

6 months

here are two awesome researchers you should follow: @WeijiaShi2 at UW and @wzhao_nlp at Cornell!! some of their recent work: weijia shi ( @WeijiaShi2 ): - built INSTRUCTOR, the embedding model that lots of startups / companies use () - proposed a more

1

14

110

1

2

42

@WeijiaShi2

Weijia Shi

11 days

I will present our work Knowledge Card in today’s oral session 💬Session: Oral 7B 🕙Time: Friday, 10 AM 📍Place: Halle A 7

@shangbinfeng

Shangbin Feng

14 days

Knowledge Card at @iclrconf Oral! Due to visa issues I could not attend, but we will have the awesome @WeijiaShi2 to give the oral talk! 💬Session: Oral 7B 🕙Time: Friday, 10 AM 📍Place: Halle A 7 Paper link: Code & resources:

1

2

16

1

3

44

@WeijiaShi2

Weijia Shi

1 year

Given an input context, REPLUG🔌 first retrieves relevant documents from an external corpus using a retriever (1️⃣Document Retrieval). Then it prepends each document separately to the input context and ensembles output probabilities from different passes (2️⃣Input Reformulation)

Tweet media one

1

1

26

@WeijiaShi2

Weijia Shi

7 months

Sorry, this is the right link to the paper:

0

0

24

@WeijiaShi2

Weijia Shi

1 year

We also introduce a training scheme that can further improve the initial retriever in REPLUG🔌 with supervision signals from a black-box language model. The key idea💡 is to adapt the retriever🔥 to the black-box LM🧊

Tweet media one

2

1

24

@WeijiaShi2

Weijia Shi

3 months

@kchonyc @mrdrozdov @andrewmccallum @MohitIyyer @JonathanBerant @HamedZamani Good point. I guess one reason to train LMs with retrieval is to improve their understanding of long contexts (e.g., multi-doc reasoning). Since long pretraining docs are scarce, retrieval can gather related docs within the same context, making LMs to learn to use long context

1

1

22

@WeijiaShi2

Weijia Shi

1 year

We first annotate instructions for 330 diverse tasks and train Instructor👨‍🏫 on this multitask mixture A single Instructor👨‍🏫 model can achieve sota on 70 embedding evaluation tasks: 1⃣ Retrieval 2⃣TextEval 3⃣Clustering 4⃣Prompt Retrieval 5⃣Classification 6⃣STS 7⃣Reranking 8⃣...

Tweet media one

1

1

22

@WeijiaShi2

Weijia Shi

6 months

Ari is a great mentor and always has valuable insights on LLMs. Apply to work with him!

@universeinanegg

Ari Holtzman

@universeinanegg

6 months

If you want a respite from OpenAI drama, how about joining academia? I'm starting Conceptualization Lab, recruiting PhDs & Postdocs! We need new abstractions to understand LLMs. Conceptualization is the act of building abstractions to see something new.

14

63

277

0

2

19

@WeijiaShi2

Weijia Shi

1 year

Smaller model but best performance Instructor👨‍🏫 (335M), while having >10x fewer parameters than the previous best model (4.8B), achieves state-of-the-art performance, with an average improvement of 3.4% compared to the previous best results on the 70 diverse datasets.

Tweet media one

2

1

17

@WeijiaShi2

Weijia Shi

7 months

@harmdevries77 raises a key issue: the lack of long pretraining data (<5% web docs exceed 2k tokens) poses challenges for pretraining LMs with long context windows. In-Context Pretraining offers a scalable solution for creating meaningful long contexts

1

0

17

@WeijiaShi2

Weijia Shi

7 months

Existing methods train LMs by concatenating random docs to form input contexts but the prior docs provide 𝙣𝙤 𝙨𝙞𝙜𝙣𝙖𝙡 for predicting the next doc. In-Context Pretraining forms meaningful long contexts from related docs, encouraging LMs to read more varied and longer context

2

0

17

@WeijiaShi2

Weijia Shi

7 months

Joint work with amazing collaborators @sewon__min @MariaLomeli_ @violet_zct @margs_li @VictoriaLinML @nlpnoah @LukeZettlemoyer @scottyih @ml_perception at @AIatMeta and @uwnlp #NLProc

1

0

14

@WeijiaShi2

Weijia Shi

7 months

Why does this matter? Black-box LLMs like GPT are pretrained on massive and undisclosed data that may contain sensitive texts. Min-K% Prob 🕵️ can be used to 🔍Detect copyrighted texts in pretraining 🛡️Identify dataset contamination 🔐Privacy auditing of machine unlearning [2/n]

Tweet media one

1

3

14

@WeijiaShi2

Weijia Shi

1 year

Led by @hongjin_su and @WeijiaShi2 , joint work with @wittgen_ball , @yizhongwyz , @huyushi98 , Mari, @scottyih , @nlpnoah , @LukeZettlemoyer , and @taoyds from @uwnlp , @allen_ai and @MetaAI . Thanks @Muennighoff and @Nils_Reimers for the nice MTEB code and data. It did save our life!

1

0

14

@WeijiaShi2

Weijia Shi

7 months

Joint collaboration w/ @anirudhajith42 (coleading), @xiamengzhou , @YangsiboHuang , @DaogaoLiu , @TerraBlvns , @danqi_chen , @LukeZettlemoyer from @uwnlp and @princeton_nlp Code & Data & Project page: Paper: [10/n]

Tweet card media

Detecting Pretraining Data from Large Language Models

Although large language models (LLMs) are widely deployed, the data used to train them is rarely disclosed. Given the incredible scale of this data, up to trillions of tokens, it is all but...

1

2

13

@WeijiaShi2

Weijia Shi

1 year

Instructions Enable Diverse Training Finetuning with instructions allows Instructor👨‍🏫 to benefit from the diversity of 330 datasets, whereas simply training on those datasets alone leads to degraded performance.

Tweet media one

1

0

12

@WeijiaShi2

Weijia Shi

7 months

How does In-Context Pretraining work? 👀 In-Context Pretraining first finds related documents at scale to create a document graph using a retriever and then builds pretraining input contexts by traversing the document graph.

Tweet media one

1

0

12

@WeijiaShi2

Weijia Shi

1 year

Instructions Mitigate Domain Shifts Instruction-finetuned Instructor👨‍🏫 helps more on unseen domains: geography, biology and civil comments. Domain-specific datasets benefit particularly from instruction finetuning.

Tweet media one

1

0

11

@WeijiaShi2

Weijia Shi

1 year

Joint work with amazing @sewon__min @michiyasunaga @seo_minjoon @richdutton , @ml_perception , @LukeZettlemoyer @scottyih at @MetaAI , @uwnlp

1

0

12

@WeijiaShi2

Weijia Shi

2 months

1️⃣ Context-Aware Decoding simply contrasts output probabilities with and without the desired focus contexts and samples from this contrasted distribution 📊. 2️⃣ How well it works? Without additional training, it improves pretrained LMs' faithfulness (14.3%📈 for LLaMA)

Tweet media one

1

0

10

@WeijiaShi2

Weijia Shi

1 year

Check out @BunsenFeng 's work Cook🧑‍🍳: empowering GPT with modular and community-driven knowledge: (1) 25 specialized LMs serve as parametric knowledge repositories for GPT (2) Request knowledge only when needed (3) Top performance on MMLU and fact checking

@shangbinfeng

Shangbin Feng

1 year

With a pool of community-contributed specialized LMs, we propose bottom-up and top-down, two approaches to integrate black-box LLMs and these modular knowledge repos. bottom-up: multi-domain knowledge synthesis top-down: LLM select and activate specialized LMs when necessary

Tweet media one

1

2

12

0

2

11

@WeijiaShi2

Weijia Shi

2 months

We release the code: @zhichaoxu_ir has a very nice reimplementation and shows it can reduce hallucination of latest models such as Mistral-7B as well. His code and writeup 👇: 🛠️:

Tweet card media

GitHub - zhichaoxu-shufe/context-aware-decoding-qfs

Contribute to zhichaoxu-shufe/context-aware-decoding-qfs development by creating an account on GitHub.

1

0

9

@WeijiaShi2

Weijia Shi

7 months

Practical application: 🔍Detecting copyright violations in pretraining data using Min-K% Prob 🕵️ E.g., We see evidence that GPT-3.5 (specifically, text-davinci-003) is likely to be pretrained on copyrighted books from the Pile Books3 dataset 👇 [8/n]

Tweet media one

1

3

9

@WeijiaShi2

Weijia Shi

7 months

How do pretraining design choices affect detection difficulty? Harder detection with 1. Smaller model size 📉 2. Shorter lengths of text for detection 📉 3. More training data 📈 4. Decreasing occurrence frequency of the detecting example 📉 5. Lower learning rates 📉 [9/n]

Tweet media one

1

2

9

@WeijiaShi2

Weijia Shi

7 months

Trying our detection method Min-K% Prob 🕵️: 1️⃣ Compute token probabilities in the text. 2️⃣ Pick the k% tokens with minimum probabilities. 3️⃣ Compute their average log likelihood. High average? Text is probably in pretraining data ✅ [4/n]

Tweet media one

1

2

9

@WeijiaShi2

Weijia Shi

7 months

Evaluating detection method efficacy? Introducing WikiMIA🌟, a dynamic benchmark using the data from pre and post model pretraining periods to support gold truth detection. It evolves with new LLMs, updating seen/unseen data. Data: [6/n]

Tweet card media

swj0419/WikiMIA · Datasets at Hugging Face

1

4

9

@WeijiaShi2

Weijia Shi

7 months

❓: Can we determine if an LLM was pretrained on a certain text, having only black-box access to it? It is known as membership inference attack in ML security but its application in the context of LLM pretraining is still relatively underexplored [3/n]

1

2

8

@WeijiaShi2

Weijia Shi

1 year

@pajeeter There are two main differences: 1) RAG is an encoder-decoder LM with retrieval augmentation while ours augments a decoder-only (e.g. GPT, OPT) with retrieval. 2) RAG finetunes the LM's parameters to make it learn to read the retrieved documents, whereas we keep the LM frozen 😀

0

1

8

@WeijiaShi2

Weijia Shi

1 year

In addition, by making instruction more detailed (Left) and model size larger (Right), the performance of Instructor👨‍🏫 is consistently improved.

Tweet media one

Tweet media two

1

0

7

@WeijiaShi2

Weijia Shi

1 year

Instruction-finetuning on a large amount of datasets with diverse task instructions improves the robustness of Instructor👨‍🏫 to instruction paraphrases (i.e., smaller performance gaps between best- and worst-performing instructions)

Tweet media one

1

0

8

@WeijiaShi2

Weijia Shi

7 months

How well does the detection method work? We show that Min-K% Prob 🕵️outperforms the existing strongest baseline and can be used to detect pretraining data from various LLMs including GPT-3, OPT, LLaMA and etc. [7/n]

Tweet media one

1

2

7

@WeijiaShi2

Weijia Shi

7 months

@main_horse wow, it would save methods that use ppl/logits as signals 🤣

0

0

7

@WeijiaShi2

Weijia Shi

7 months

Method Intuition? ✨ We observe an unseen example is likely to contain a few outlier words with low probabilities under the LLM, while a seen example is less likely to have words with such low probabilities. [5/n]

1

1

7

@WeijiaShi2

Weijia Shi

2 years

2/ We are the first to study the k-nearest neighbors language model (kNN-LM)'s zero-shot application to end tasks and find the main challenge of applying it naïvely is the sparsity of kNN distribution.

1

0

6

@WeijiaShi2

Weijia Shi

2 years

3/ We introduce kNN-Prompt to address this issue. Key to our approach is the introduction of fuzzy verbalizers which leverage the sparse kNN distribution for downstream tasks by automatically associating each classification label with a set of natural language tokens.

Tweet media one

1

0

6

@WeijiaShi2

Weijia Shi

1 year

@HeyNikhila Multiple people including ourselves try the code and it works smoothly. We suspect that this error may be related to your sentence transformer library. Could you please double check that you have installed the sentence transformer library according to ?

Tweet card media

hkunlp/instructor-large · Hugging Face

0

2

5

@WeijiaShi2

Weijia Shi

2 years

5/ Joint work with Julian Michael ( @_julianmichael_ ), Suchin Gururangan ( @ssgrn ), and Luke Zettlemoyer ( @LukeZettlemoyer ) from @uwnlp @uwcse

0

1

6

@WeijiaShi2

Weijia Shi

2 months

This method has broader applications beyond pure text settings. For example, similar ideas are applied to vision-language models to increase their focus on visual prompts 👇

@meetdavidwan

David Wan

3 months

Pointing to an image region should help models focus, but standard VLMs fail to understand visual markers/prompts (e.g., boxes/masks). 🚨Contrastive Region Guidance: Training-free method that increases focus on visual prompts by reducing model priors. 🧵

Tweet media one

2

46

122

1

0

6

@WeijiaShi2

Weijia Shi

2 months

Joint collaboration w/ @XiaochuangHan (coleading), @ml_perception , yuliatsvetkov ( @tsvetshop ), @LukeZettlemoyer , @scottyih from @uwnlp and @metaai Paper: 🛠️:

0

0

5

@WeijiaShi2

Weijia Shi

7 months

@_TobiasLee That's a great question! Our methods are limited to models that provide output token probabilities, like text-davinci-003. It would be very interesting to see future work that could develop methods to identify the pretraining corpus without relying on logits.

0

0

5

@WeijiaShi2

Weijia Shi

2 years

1/ Retrieval-augmented language models have been shown to outperform their non-retrieval-based counterparts on language modeling tasks. But it is an open question whether they also achieve similar gains in zero-shot end task evaluations.

1

0

5

@WeijiaShi2

Weijia Shi

2 years

4/ Experiments on 11 datasets (text classification, fact retrieval, and question answering) show that kNN-Prompt 1) yield large performance improvement over zero-shot baselines 2) effective for domain adaptation without further training

1

0

4

@WeijiaShi2

Weijia Shi

7 months

@OhadRubin Hi Ohad. Thanks for your interest in our work❤️!! After computing the kNNs for each query documents, we performed additionally deduplication by filtering out neighboring documents that have > 90% 3 gram overlaps.

1

0

4

@WeijiaShi2

Weijia Shi

7 months

@main_horse @arankomatsuzaki We conducted some experiments on the GPT 3.5 model (specifically, text-davinci-003), from which ChatGPT is finetuned, as outlined at . It's worth noting that the logprobs of text-davinci-003 can be accessed via API. Thank you for bringing this to our

2

0

4

@WeijiaShi2

Weijia Shi

26 days

@peizNLP Congrats Pei 🥳🥳

1

0

3

@WeijiaShi2

Weijia Shi

5 months

@YangsiboHuang is on the academic job market this year. She did a lot of great work on trustworthy AI! Stop by to chat with her :)

1

0

3

@WeijiaShi2

Weijia Shi

7 months

@arankomatsuzaki @main_horse lol, thanks for tweeting it anyways!

0

0

2

@WeijiaShi2

Weijia Shi

7 months

@_AngelinaYang_ @YangsiboHuang @arankomatsuzaki There is a lot of debate over this topic. Incorporating copyrighted content into pretraining could potentially violate copyright laws. @katherine1ee has an insightful blog post discussing generative AI and copyright:

0

0

3

@WeijiaShi2

Weijia Shi

2 months

@WenhuChen Thank you Wenhu! Yeah it has been a while

0

0

2

@WeijiaShi2

Weijia Shi

1 year

@TejasviKashi I used Keynote and flaticon :)

1

1

2

@WeijiaShi2

Weijia Shi

1 year

@jang_yoel @uwcse @uwnlp @LukeZettlemoyer @HannaHajishirzi Welcome 🎉🎉

0

0

2

@WeijiaShi2

Weijia Shi

4 years

@HanGuo97 @LTIatCMU @uncnlp @mohitban47 @ramakanth1729 congrats！！

1

0

1

@WeijiaShi2

Weijia Shi

11 months

@universeinanegg @UChicagoCS @DSI_UChicago @Meta Congrats!! 🎉

1

0

1

@WeijiaShi2

Weijia Shi

7 months

@main_horse thank you for pointing it out!! We made updates to both the paper and the website figure.

0

0

1

@WeijiaShi2

Weijia Shi

3 years

@pliang279 @lpmorency @rsalakhu @mldcmu @brandondamos @_rockt @egrefen @facebookai congrats!

0

0

1

@WeijiaShi2

Weijia Shi

7 months

@mrdrozdov Thank you for your interest! I agree. Future studies could focus on constructing more informative but challenging contexts for LMs to learn more during pretraining

0

0

1

@WeijiaShi2

Weijia Shi

4 years

@RishiBommasani @stanfordnlp @Stanford @cs_cornell @clairecardie huge congrats！

1

0

1