Our new mechanistic understanding study: Safety-critical regions inside aligned LLMs are sparse (only ~3%!), and can be easily removed to compromise safety😢...
Can we design better safety alignment algorithms based on this finding? Check the thread for exciting directions!
Wondering why LLM safety mechanisms are fragile? 🤔
😯 We found safety-critical regions in aligned LLMs are sparse: ~3% of neurons/ranks
⚠️Sparsity makes safety easy to undo. Even freezing these regions during fine-tuning still leads to jailbreaks
🔗
[1/n]
Microsoft's recent work () shows how LLMs can unlearn copyrighted training data via strategic finetuning: They made Llama2 unlearn Harry Potter's magical world.
But our Min-K% Prob () found some persistent “magical traces”!🔮
[1/n]
Are open-source LLMs (e.g. LLaMA2) well aligned? We show how easy it is to exploit their generation configs for CATASTROPHIC jailbreaks ⛓️🤖⛓️
* 95% misalignment rates
* 30x faster than SOTA attacks
* insights for better alignment
Paper & code at:
[1/8]
Retrieval-based language models excel in interpretability, factuality, and adaptability due to their ability to leverage data from their datastore. Now, there are proposals to use private user datastore for model personalization. Would this approach compromise privacy?🤔
I am at
#NeurIPS2023
now.
I am also on the academic job market, and humbled to be selected as a 2023 EECS Rising Star✨. I work on ML security, privacy & data transparency.
Appreciate any reposts & happy to chat in person! CV+statements:
Find me at ⬇️
Gradient inversion attacks in
#FederatedLearning
can recover private data from public gradients (privacy leaks!)
Our
#NeurIPS2021
work evaluates these attacks & potential defenses. We also release an evaluation library:
Join us @ Oral Session 5 (12/10)!
Missed
#ICLR24
due to visa, but my amazing collaborators are presenting our 4 works!
➀ Jailbreaking LLMs via Exploiting Generation (see thread)
👩🏫
@xiamengzhou
⏰ Fri 4:30 pm, Halle B
#187
➁ Detecting Pretraining Data from LLMs
👩🏫
@WeijiaShi2
⏰ Fri 10:45 am, Halle B
#95
Are open-source LLMs (e.g. LLaMA2) well aligned? We show how easy it is to exploit their generation configs for CATASTROPHIC jailbreaks ⛓️🤖⛓️
* 95% misalignment rates
* 30x faster than SOTA attacks
* insights for better alignment
Paper & code at:
[1/8]
How to tackle data privacy for language understanding tasks in distributed learning (without slowing down training or reducing accuracy)? Happy to share our new
#emnlp2020
findings paper
w/
@realZhaoSong
,
@danqi_chen
, Prof. Kai Li,
@prfsanjeevarora
paper:
I am not able to travel to
#EMNLP2023
due to visa issues. But my great coauthor
@Sam_K_G
is there and will present this work🤗 (pls consider him for internship opportunities!)
I will attend
#NeurIPS2023
next week. Let’s grab a ☕️ if you want to chat about LLM safety/privacy/data
Retrieval-based language models excel in interpretability, factuality, and adaptability due to their ability to leverage data from their datastore. Now, there are proposals to use private user datastore for model personalization. Would this approach compromise privacy?🤔
Membership inference attack (MIA) is well-researched in ML security. Yet, its use in LLM pretraining is relatively underexplored.
Our Min-K% Prob is stepping up to bridge this gap. Think you can do better? Try your methods on our WikiMIA benchmark 📈:
Ever wondered which data black-box LLMs like GPT are pretrained on? 🤔
We build a benchmark WikiMIA and develop Min-K% Prob 🕵️, a method for detecting undisclosed pretraining data from LLMs (relying solely on output probs).
Check out our project:
[1/n]
I will present DP-AdaFEST at
#NeurIPS2023
(Thurs, poster session 6)!
TL;DR - DP-AdaFEST effectively preserves the gradient sparsity in differentially private training of large embedding models, which translates to ~20x wall-clock time improvement for recommender systems (w/ TPU)
Today on the blog learn about a new algorithm for sparsity-preserving differentially private training, called adaptive filtering-enabled sparse training (DP-AdaFEST), which is particularly relevant for applications in recommendation systems and
#NLP
. →
New policies mandate the disclosure of GenAI risks, but who evaluates them? Trusting AI companies alone is risky.
We advocate (led by
@ShayneRedford
): Independent researchers for evaluations + safe harbor from companies = Less chill, more trust.
Agree? Sign our letter in 🧵!
Independent AI research should be valued and protected.
In an open letter signed by over a 100 researchers, journalists, and advocates, we explain how AI companies should support it going forward.
1/
I really enjoy working with these three amazing editors 😊 And super excited and fortunate to see part of my PhD work ending up as a chapter in the textbook “Federated Learning”!
Happy to share the release of the book "Federated Learning: Theory and Practice" that I co-edited with
@LamMNguyen3
@nghiaht87
, covering fundamentals, emerging topics, and applications. Kudos to the amazing contributors to make this book happen!
@ElsevierNews
@sciencedirect
@McaleerStephen
Great work, Stephen! And thanks for maintaining the website! 👏
It's great that your "Red teaming" section (Sec 4.1.3) already discussed various jailbreak attacks. Additionally, I would like to draw your attention to some recent research papers that have explored alternative
The first PASS seminar will livestream on 3/19 at 2pm ET!
Speaker: Paul Christiano (Alignment Research Center)
Topic: Catastrophic misalignment of LLMs
Live:
Submit questions:
Recordings later at:
@prateekmittal_
Hi Prateek, it seems that the idea is relevant to our recently proposed Min-K% Prob (): detecting pretraining data from LLMs using MIA.
One of our case studies is using Min-K% Prob to successfully identify failed-to-unlearn examples in an unlearned LLM:
We also note a striking contrast: 7% misalignment rate in proprietary models vs. >95% in open-source LLMs. This indicates that open-source models lag far behind in safety alignment compared to their proprietary models! [6/8]
Alignment proves brittle to changes in system prompt and decoding configs.
We show w/ 11 open-source models including Vicuna, MPT, Falcon & LLaMA2, exploiting various generation configs to decode raises misalignment rate to >95% for all!
Examples: [3/8]
Ever wondered which data black-box LLMs like GPT are pretrained on? 🤔
We build a benchmark WikiMIA and develop Min-K% Prob 🕵️, a method for detecting undisclosed pretraining data from LLMs (relying solely on output probs).
Check out our project:
[1/n]
Moreover, we find that the most vulnerable decoding config varies drastically across models. This further suggests that assessing model alignment with a single decoding configuration significantly underestimates the actual risks. [4/8]
Very simple motivation: We notice that safety evaluations of LLMs often use a fixed config for model generation (and w/ a system prompt), which might overlook cases where the model's alignment deteriorates with different strategies.
📚 Some evidence from LLaMA2 paper: [2/8]
We summarize a (growing) list of papers for gradient inversion attacks and defenses, including the fresh CAFE attack at VerticalFL () by
@pinyuchenTW
and
@Tianyi2020
at
#NeurIPS2021
!.
Have fun reading 🤓!
@katherine1ee
@random_walker
@jason_kint
Agreed! Strategic fine-tuning does NOT give a guarantee for unlearning copyrighted content. For example, we showed that a model that has claimed to “unlearn” Harry Potter (via fine-tuning) still can answer many Harry Potter questions correctly!
Microsoft's recent work () shows how LLMs can unlearn copyrighted training data via strategic finetuning: They made Llama2 unlearn Harry Potter's magical world.
But our Min-K% Prob () found some persistent “magical traces”!🔮
[1/n]
Machine unlearning allows training data removal from models, in compliance w/ rules like GDPR.
Microsoft's recent LLM unlearning proposal: strategically finetune LLMs. They demonstrated by erasing the Harry Potter (HP) world from Llama2-7B-chat: .
[2/n]
We then level up our already potent attack with 2 simple tricks:
- Sample N>1 times: Sampling is non-deterministic so we can sample multiple outputs and choose the most misaligned one;
- Constraint Decoding: Discourage "Sorry I can't" / encourage "Sure".
[5/8]
Evidence time 📚✨
We asked GPT-4 to craft 1k HP questions, then filtered top-100 suspicious questions according to Min-K% Prob. We had the unlearned model answer these questions.
The "unlearned" model correctly answered 8% of them: HP content remains in its weights!
[4/n]
We finally turn this bitter lesson into a better practice📚
We propose generation-aware alignment: proactively aligning models with output from different generation configurations. This reasonably reduces misalignment risk, but more work is needed. [7/8]
🕐 Thursday 5pm,
#1614
Sparsity-Preserving Differentially Private Training of Large Embedding Models, w/ Badih Ghazi, Pritish Kamath, Pasin Manurangsi, Amer Sinha, Chiyuan Zhang
Featured by
@GoogleAI
blog post:
@xiamengzhou
@WeijiaShi2
➂ LabelDP-Pro: Learning with Label DP via Projections (…)
🧑🏫 Chiyuan Zhang
⏰Wed 10:45 am, Halle B
#273
➃ 🥇Best Paper at Set-LLM: Assessing the Brittleness of Safety Alignment ()
🧑🏫
@wei_boyi
⏰ Sat Set-LLM workshop
Wondering why LLM safety mechanisms are fragile? 🤔
😯 We found safety-critical regions in aligned LLMs are sparse: ~3% of neurons/ranks
⚠️Sparsity makes safety easy to undo. Even freezing these regions during fine-tuning still leads to jailbreaks
🔗
[1/n]
Altogether we show a major failure in safety evaluation & alignment for open-source LLMs. Our recommendation: extensive red-teaming to access risks across generation configs & our generation-aware alignment as a precaution.
w/ amazing
@Sam_K_G
,
@xiamengzhou
, Kai Li,
@danqi_chen
Join us (in 1 hour) at
#NeurIPS2021
Poster Session 6 (11:30 a.m. EST — 1 p.m. EST)!
🔎 How to find us:
> Visit our poster page:
> Or join the Federated Learning gather town: , then navigate to spot C2
Gradient inversion attacks in
#FederatedLearning
can recover private data from public gradients (privacy leaks!)
Our
#NeurIPS2021
work evaluates these attacks & potential defenses. We also release an evaluation library:
Join us @ Oral Session 5 (12/10)!
We present the first study of privacy implications of retrieval-based LMs, particularly kNN-LMs.
paper:
w/
@Sam_K_G
,
@ZexuanZhong
,
@danqi_chen
, Kai Li
We also tried story completion✍️
We pinpointed suspicious text chunks in HP books w/ Min-K% Prob, prompted the unlearned model w/ contexts in these chunks, and asked for completions.
10 chunks scored >= 4 out of 5 in similarity w/ gold completion.
[5/n]
@AIPanicLive
@xiamengzhou
@Sam_K_G
@danqi_chen
“Dishonest” is a serious charge so I am not sure if I miss anything here… We do apple-to-apple comparison w/ their approach (see our Sec 4.4): we run two methods on both our benchmark and their benchmark, across 2 LLaMA-chat models. Our attack consistently outperforms theirs.
What else can our Min-K% Prob do other than auditing unlearning?
🔍 Detect copyrighted texts used in pretraining
🛡️ Identify dataset contamination
For more details, check out Sec 5~7 in our paper:
[6/n]
@xiangyue96
Agreed that DP is needed (probably in combine with tricks such as decoupling key and query encoders to achiever better utility)! And thanks for pointers to your ACL papers (will see if I can try them in our study!)😀
We audit their unlearned model to see if it eliminates all content related to HP:
1️⃣ Collect HP-related content (questions / original book paras)
2️⃣ Apply our Min-K% Prob to identify suspicious content that may not be unlearned
3️⃣Validate by prompting the unlearned model
[3/n]
Undoubtedly, further efforts are required to address untargeted risks. Exploring the incorporation of differential privacy (DP) 🛠️ into the aforementioned strategies would present an intriguing avenue worth exploring!
#PrivacyMatters
😢Mitigating untargeted risks is much more challenging.
Mixing public and private data in both the datastore and encoder training shows some promise in reducing the risk, but doesn't go far.
@xuandongzhao
@xiamengzhou
@Sam_K_G
@danqi_chen
Good point! We haven’t tried adversarial prompts (e.g. universal prompts by Zou et al.) + generation exploitation since the head room for improvement for attacking open-source LLMs is very limited (<5% 😂). But it makes sense to try with proprietary models!
@YangjunR
Interesting thread! Just wondering how to picture this threat in OpenAI’s recent moves🤔I guess it is sth. where the adversary hosts a malicious GPT on GPTs; when a user queries the model, the adversary runs prompt injection so the model could return some catastrophic commands?
Consider: A model creator wants to deploy a kNN-LM as an API.
👍 They have private data that boost the model's performance on domain-specific tasks.
👎 But the data may contain sensitive information that must remain undisclosed.
Utility and privacy need to be weighed ⚖️
@VitusXie
@Sam_K_G
@xiamengzhou
@danqi_chen
Great qs! We found the attack is much weaker on proprietary models (see Sec 6 of our paper), which means that open-source LLMs lag far behind proprietary ones in alignment!
(But your fine-tuning attack can break them 😉
We look into two privacy risks:
1) Targeted risk directly relates to specific text (e.g., phone #)
2) Untargeted risk is not directly detectable
Surprisingly, both risks are more pronounced in kNN-LMs with private datastore v.s. parametric LMs finetuned with private data 😱
@alignment_lab
@xiamengzhou
@Sam_K_G
@danqi_chen
Were you suggesting using the universal adversarial suffix () to trigger patterns like ‘sure thing!’? We compared with them in Section 4.4 in our paper: we are 30x faster (and strike a higher attack success rate)!
@AIPanicLive
@xiamengzhou
@Sam_K_G
@danqi_chen
Thanks! To clarify, we tested w/ AdvBench () & our MaliciousInstruct. In all tested cases, LLaMA-chat & GPT-3.5 w/ default configs refrained from responding, potentially indicating a policy violation. We're open to expanding the eval scope as you suggest :)
@nr_space
@xiamengzhou
@Sam_K_G
@danqi_chen
Thx 😊 “Catastrophic” was meant to refer to the surge in misalignment rate after very simple exploitation: 0% to 95%. I agree that the shown use case (answering malicious qs), though harmful, may not directly imply catastrophic outcome. We’ll tweak phrasing to avoid confusion :)
Can we re-design kNN-LMs for mitigation?
🎯 For targeted attacks,
1) A simple sanitization step can eliminate the risks entirely! 🧹
2) Decoupling query and key encoders gives an even better trade-off between utility and privacy
@_AngelinaYang_
@arankomatsuzaki
Great question! It can be used to detect test data contamination, copyrighted content, and audit machine unlearning methods.
Please check Sec 5 - 7 of the paper () for more details!