Ever wondered which data black-box LLMs like GPT are pretrained on? 🤔
We build a benchmark WikiMIA and develop Min-K% Prob 🕵️, a method for detecting undisclosed pretraining data from LLMs (relying solely on output probs).
Check out our project:
[1/n]
🙋♀️How to present the same text in diff. tasks/domains as diff. embeddings W/O training?
We introduce Instructor👨🏫, an instruction-finetuned embedder that can generate text embeddings tailored to any task given the task instruction➡️sota on 7⃣0⃣tasks👇!
Happy to share In-Context Pretraining 🖇️ is accepted as an
#ICLR2024
spotlight. We study how to pretrain LLMs with improved context understanding ability
paper📄:
code:
When augmented with retrieval, LMs sometimes overlook retrieved docs and hallucinate 🤖💭
To make LMs trust evidence more and hallucinate less, we introduce Context-Aware Decoding: a decoding algorithm improving LM's focus on input contexts
📖
#NAACL2024
Happy to share REPLUG🔌 is accepted to
#NAACL2024
We introduce a retrieval-augmented LM framework that combines a frozen LM with a frozen/tunable retriever. Improving GPT-3 in language modeling & downstream tasks by prepending retrieved docs to LM inputs.
📄:
Ever wondered which data black-box LLMs like GPT are pretrained on? 🤔
We build a benchmark WikiMIA and develop Min-K% Prob 🕵️, a method for detecting undisclosed pretraining data from LLMs (relying solely on output probs).
Check out our project:
[1/n]
❓How can retrieval from a heterogeneous corpus benefit zero-shot inference with language models?
We introduce kNN-Prompt, a technique to use k-nearest neighbor retrieval augmentation for improving zero-shot inference
Paper:
[1/n]
Super excited to be attending
#ICLR2024
to present our work:
✅In-Context Pretraining ()
⏰: Thursday 10:45 am (Halle B
#95
)
✅ Detecting Pretraining Data from LLMs ()
⏰: Friday 10:45 am (Halle B
#95
)
Come say hi 🍻
Happy to share In-Context Pretraining 🖇️ is accepted as an
#ICLR2024
spotlight. We study how to pretrain LLMs with improved context understanding ability
paper📄:
code:
We are sharing BookMIA data 📚 used in our paper: . It serves as a benchmark to evaluate membership inference attack methods in detecting copyrighted books from OpenAI models such as text-davinci-003.
- Non-member data 🚫: Text snippets from books first
Ever wondered which data black-box LLMs like GPT are pretrained on? 🤔
We build a benchmark WikiMIA and develop Min-K% Prob 🕵️, a method for detecting undisclosed pretraining data from LLMs (relying solely on output probs).
Check out our project:
[1/n]
#NeurIPS2023
Join us at the RegML Workshop (📅 Sat, Dec 16, 1:00-1:35 PM, Room 215-216).
@YangsiboHuang
and
@xiamengzhou
will present our work "Detecting Pretraining Data in Large Language Models".
🔗:
Ever wondered which data black-box LLMs like GPT are pretrained on? 🤔
We build a benchmark WikiMIA and develop Min-K% Prob 🕵️, a method for detecting undisclosed pretraining data from LLMs (relying solely on output probs).
Check out our project:
[1/n]
I implemented the Context-aware Decoding (CAD) described in by
@WeijiaShi2
.
I found it can reduce factuality errors on both news summarization and query-focused summarization tasks - with some more recent language models such as MPT-7B and Mistral-7B
here are two awesome researchers you should follow:
@WeijiaShi2
at UW and
@wzhao_nlp
at Cornell!! some of their recent work:
weijia shi (
@WeijiaShi2
):
- built INSTRUCTOR, the embedding model that lots of startups / companies use ()
- proposed a more
Knowledge Card at
@iclrconf
Oral! Due to visa issues I could not attend, but we will have the awesome
@WeijiaShi2
to give the oral talk!
💬Session: Oral 7B
🕙Time: Friday, 10 AM
📍Place: Halle A 7
Paper link:
Code & resources:
Given an input context, REPLUG🔌 first retrieves relevant documents from an external corpus using a retriever (1️⃣Document Retrieval). Then it prepends each document separately to the input context and ensembles output probabilities from different passes (2️⃣Input Reformulation)
We also introduce a training scheme that can further improve the initial retriever in REPLUG🔌 with supervision signals from a black-box language model. The key idea💡 is to adapt the retriever🔥 to the black-box LM🧊
@kchonyc
@mrdrozdov
@andrewmccallum
@MohitIyyer
@JonathanBerant
@HamedZamani
Good point. I guess one reason to train LMs with retrieval is to improve their understanding of long contexts (e.g., multi-doc reasoning). Since long pretraining docs are scarce, retrieval can gather related docs within the same context, making LMs to learn to use long context
We first annotate instructions for 330 diverse tasks and train Instructor👨🏫 on this multitask mixture
A single Instructor👨🏫 model can achieve sota on 70 embedding evaluation tasks:
1⃣ Retrieval
2⃣TextEval
3⃣Clustering
4⃣Prompt Retrieval
5⃣Classification
6⃣STS
7⃣Reranking
8⃣...
If you want a respite from OpenAI drama, how about joining academia?
I'm starting Conceptualization Lab, recruiting PhDs & Postdocs!
We need new abstractions to understand LLMs. Conceptualization is the act of building abstractions to see something new.
Smaller model but best performance
Instructor👨🏫 (335M), while having >10x fewer parameters than the previous best model (4.8B), achieves state-of-the-art performance, with an average improvement of 3.4% compared to the previous best results on the 70 diverse datasets.
@harmdevries77
raises a key issue: the lack of long pretraining data (<5% web docs exceed 2k tokens) poses challenges for pretraining LMs with long context windows. In-Context Pretraining offers a scalable solution for creating meaningful long contexts
Existing methods train LMs by concatenating random docs to form input contexts but the prior docs provide 𝙣𝙤 𝙨𝙞𝙜𝙣𝙖𝙡 for predicting the next doc. In-Context Pretraining forms meaningful long contexts from related docs, encouraging LMs to read more varied and longer context
Why does this matter?
Black-box LLMs like GPT are pretrained on massive and undisclosed data that may contain sensitive texts. Min-K% Prob 🕵️ can be used to
🔍Detect copyrighted texts in pretraining
🛡️Identify dataset contamination
🔐Privacy auditing of machine unlearning
[2/n]
Instructions Enable Diverse Training
Finetuning with instructions allows Instructor👨🏫 to benefit from the diversity of 330 datasets, whereas simply training on those datasets alone leads to degraded performance.
How does In-Context Pretraining work? 👀
In-Context Pretraining first finds related documents at scale to create a document graph using a retriever and then builds pretraining input contexts by traversing the document graph.
1️⃣ Context-Aware Decoding simply contrasts output probabilities with and without the desired focus contexts and samples from this contrasted distribution 📊.
2️⃣ How well it works?
Without additional training, it improves pretrained LMs' faithfulness (14.3%📈 for LLaMA)
Check out
@BunsenFeng
's work Cook🧑🍳: empowering GPT with modular and community-driven knowledge: (1) 25 specialized LMs serve as parametric knowledge repositories for GPT (2) Request knowledge only when needed (3) Top performance on MMLU and fact checking
With a pool of community-contributed specialized LMs, we propose bottom-up and top-down, two approaches to integrate black-box LLMs and these modular knowledge repos.
bottom-up: multi-domain knowledge synthesis
top-down: LLM select and activate specialized LMs when necessary
We release the code:
@zhichaoxu_ir
has a very nice reimplementation and shows it can reduce hallucination of latest models such as Mistral-7B as well. His code and writeup 👇:
🛠️:
Practical application:
🔍Detecting copyright violations in pretraining data using Min-K% Prob 🕵️
E.g., We see evidence that GPT-3.5 (specifically, text-davinci-003) is likely to be pretrained on copyrighted books from the Pile Books3 dataset 👇
[8/n]
How do pretraining design choices affect detection difficulty?
Harder detection with
1. Smaller model size 📉
2. Shorter lengths of text for detection 📉
3. More training data 📈
4. Decreasing occurrence frequency of the detecting example 📉
5. Lower learning rates 📉
[9/n]
Trying our detection method Min-K% Prob 🕵️:
1️⃣ Compute token probabilities in the text.
2️⃣ Pick the k% tokens with minimum probabilities.
3️⃣ Compute their average log likelihood.
High average? Text is probably in pretraining data ✅
[4/n]
Evaluating detection method efficacy?
Introducing WikiMIA🌟, a dynamic benchmark using the data from pre and post model pretraining periods to support gold truth detection. It evolves with new LLMs, updating seen/unseen data.
Data:
[6/n]
❓: Can we determine if an LLM was pretrained on a certain text, having only black-box access to it?
It is known as membership inference attack in ML security but its application in the context of LLM pretraining is still relatively underexplored
[3/n]
@pajeeter
There are two main differences: 1) RAG is an encoder-decoder LM with retrieval augmentation while ours augments a decoder-only (e.g. GPT, OPT) with retrieval. 2) RAG finetunes the LM's parameters to make it learn to read the retrieved documents, whereas we keep the LM frozen 😀
Instruction-finetuning on a large amount of datasets with diverse task instructions improves the robustness of Instructor👨🏫 to instruction paraphrases (i.e., smaller performance gaps between best- and worst-performing instructions)
How well does the detection method work?
We show that Min-K% Prob 🕵️outperforms the existing strongest baseline and can be used to detect pretraining data from various LLMs including GPT-3, OPT, LLaMA and etc.
[7/n]
Method Intuition? ✨
We observe an unseen example is likely to contain a few outlier words with low probabilities under the LLM, while a seen example is less likely to have words with such low probabilities.
[5/n]
2/ We are the first to study the k-nearest neighbors language model (kNN-LM)'s zero-shot application to end tasks and find the main challenge of applying it naïvely is the sparsity of kNN distribution.
3/ We introduce kNN-Prompt to address this issue. Key to our approach is the introduction of fuzzy verbalizers which leverage the sparse kNN distribution for downstream tasks by automatically associating each classification label with a set of natural language tokens.
@HeyNikhila
Multiple people including ourselves try the code and it works smoothly. We suspect that this error may be related to your sentence transformer library. Could you please double check that you have installed the sentence transformer library according to ?
This method has broader applications beyond pure text settings. For example, similar ideas are applied to vision-language models to increase their focus on visual prompts 👇
Pointing to an image region should help models focus, but standard VLMs fail to understand visual markers/prompts (e.g., boxes/masks).
🚨Contrastive Region Guidance: Training-free method that increases focus on visual prompts by reducing model priors.
🧵
@_TobiasLee
That's a great question! Our methods are limited to models that provide output token probabilities, like text-davinci-003. It would be very interesting to see future work that could develop methods to identify the pretraining corpus without relying on logits.
1/ Retrieval-augmented language models have been shown to outperform their non-retrieval-based counterparts on language modeling tasks.
But it is an open question whether they also achieve similar gains in zero-shot end task evaluations.
4/ Experiments on 11 datasets (text classification, fact retrieval, and question answering) show that kNN-Prompt
1) yield large performance improvement over zero-shot baselines
2) effective for domain adaptation without further training
@OhadRubin
Hi Ohad. Thanks for your interest in our work❤️!! After computing the kNNs for each query documents, we performed additionally deduplication by filtering out neighboring documents that have > 90% 3 gram overlaps.
@main_horse
@arankomatsuzaki
We conducted some experiments on the GPT 3.5 model (specifically, text-davinci-003), from which ChatGPT is finetuned, as outlined at . It's worth noting that the logprobs of text-davinci-003 can be accessed via API.
Thank you for bringing this to our
@_AngelinaYang_
@YangsiboHuang
@arankomatsuzaki
There is a lot of debate over this topic. Incorporating copyrighted content into pretraining could potentially violate copyright laws.
@katherine1ee
has an insightful blog post discussing generative AI and copyright:
@mrdrozdov
Thank you for your interest! I agree. Future studies could focus on constructing more informative but challenging contexts for LMs to learn more during pretraining