Yangsibo Huang Profile
Yangsibo Huang

@YangsiboHuang

2,395
Followers
759
Following
14
Media
165
Statuses

PhD candidate @Princeton . Prev: @GoogleAI @AIatMeta .

Princeton, NJ
Joined October 2014
Don't wanna be here? Send us removal request.
Pinned Tweet
@YangsiboHuang
Yangsibo Huang
3 months
Our new mechanistic understanding study: Safety-critical regions inside aligned LLMs are sparse (only ~3%!), and can be easily removed to compromise safety😢... Can we design better safety alignment algorithms based on this finding? Check the thread for exciting directions!
@wei_boyi
Boyi Wei
3 months
Wondering why LLM safety mechanisms are fragile? 🤔 😯 We found safety-critical regions in aligned LLMs are sparse: ~3% of neurons/ranks ⚠️Sparsity makes safety easy to undo. Even freezing these regions during fine-tuning still leads to jailbreaks 🔗 [1/n]
Tweet media one
5
44
172
1
9
66
@YangsiboHuang
Yangsibo Huang
7 months
Microsoft's recent work () shows how LLMs can unlearn copyrighted training data via strategic finetuning: They made Llama2 unlearn Harry Potter's magical world. But our Min-K% Prob () found some persistent “magical traces”!🔮 [1/n]
Tweet media one
4
50
245
@YangsiboHuang
Yangsibo Huang
7 months
Are open-source LLMs (e.g. LLaMA2) well aligned? We show how easy it is to exploit their generation configs for CATASTROPHIC jailbreaks ⛓️🤖⛓️ * 95% misalignment rates * 30x faster than SOTA attacks * insights for better alignment Paper & code at: [1/8]
Tweet media one
7
44
365
@YangsiboHuang
Yangsibo Huang
1 year
Retrieval-based language models excel in interpretability, factuality, and adaptability due to their ability to leverage data from their datastore. Now, there are proposals to use private user datastore for model personalization. Would this approach compromise privacy?🤔
Tweet media one
2
14
160
@YangsiboHuang
Yangsibo Huang
5 months
I am at #NeurIPS2023 now. I am also on the academic job market, and humbled to be selected as a 2023 EECS Rising Star✨. I work on ML security, privacy & data transparency. Appreciate any reposts & happy to chat in person! CV+statements: Find me at ⬇️
3
32
133
@YangsiboHuang
Yangsibo Huang
2 years
Gradient inversion attacks in #FederatedLearning can recover private data from public gradients (privacy leaks!) Our #NeurIPS2021 work evaluates these attacks & potential defenses. We also release an evaluation library: Join us @ Oral Session 5 (12/10)!
1
0
21
@YangsiboHuang
Yangsibo Huang
10 days
Missed #ICLR24 due to visa, but my amazing collaborators are presenting our 4 works! ➀ Jailbreaking LLMs via Exploiting Generation (see thread) 👩‍🏫 @xiamengzhou ⏰ Fri 4:30 pm, Halle B #187 ➁ Detecting Pretraining Data from LLMs 👩‍🏫 @WeijiaShi2 ⏰ Fri 10:45 am, Halle B #95
@YangsiboHuang
Yangsibo Huang
7 months
Are open-source LLMs (e.g. LLaMA2) well aligned? We show how easy it is to exploit their generation configs for CATASTROPHIC jailbreaks ⛓️🤖⛓️ * 95% misalignment rates * 30x faster than SOTA attacks * insights for better alignment Paper & code at: [1/8]
Tweet media one
7
44
365
2
5
61
@YangsiboHuang
Yangsibo Huang
4 years
How to tackle data privacy for language understanding tasks in distributed learning (without slowing down training or reducing accuracy)? Happy to share our new #emnlp2020 findings paper w/ @realZhaoSong , @danqi_chen , Prof. Kai Li, @prfsanjeevarora paper:
Tweet media one
0
18
38
@YangsiboHuang
Yangsibo Huang
5 months
I am not able to travel to #EMNLP2023 due to visa issues. But my great coauthor @Sam_K_G is there and will present this work🤗 (pls consider him for internship opportunities!) I will attend #NeurIPS2023 next week. Let’s grab a ☕️ if you want to chat about LLM safety/privacy/data
@YangsiboHuang
Yangsibo Huang
1 year
Retrieval-based language models excel in interpretability, factuality, and adaptability due to their ability to leverage data from their datastore. Now, there are proposals to use private user datastore for model personalization. Would this approach compromise privacy?🤔
Tweet media one
2
14
160
0
2
31
@YangsiboHuang
Yangsibo Huang
7 months
Membership inference attack (MIA) is well-researched in ML security. Yet, its use in LLM pretraining is relatively underexplored. Our Min-K% Prob is stepping up to bridge this gap. Think you can do better? Try your methods on our WikiMIA benchmark 📈:
@WeijiaShi2
Weijia Shi @ ICLR24
7 months
Ever wondered which data black-box LLMs like GPT are pretrained on? 🤔 We build a benchmark WikiMIA and develop Min-K% Prob 🕵️, a method for detecting undisclosed pretraining data from LLMs (relying solely on output probs). Check out our project: [1/n]
Tweet media one
15
139
662
0
6
30
@YangsiboHuang
Yangsibo Huang
5 months
I will present DP-AdaFEST at #NeurIPS2023 (Thurs, poster session 6)! TL;DR - DP-AdaFEST effectively preserves the gradient sparsity in differentially private training of large embedding models, which translates to ~20x wall-clock time improvement for recommender systems (w/ TPU)
@GoogleAI
Google AI
5 months
Today on the blog learn about a new algorithm for sparsity-preserving differentially private training, called adaptive filtering-enabled sparse training (DP-AdaFEST), which is particularly relevant for applications in recommendation systems and #NLP . →
Tweet media one
12
52
242
0
0
23
@YangsiboHuang
Yangsibo Huang
2 months
New policies mandate the disclosure of GenAI risks, but who evaluates them? Trusting AI companies alone is risky. We advocate (led by @ShayneRedford ): Independent researchers for evaluations + safe harbor from companies = Less chill, more trust. Agree? Sign our letter in 🧵!
@ShayneRedford
Shayne Longpre
2 months
Independent AI research should be valued and protected. In an open letter signed by over a 100 researchers, journalists, and advocates, we explain how AI companies should support it going forward. 1/
Tweet media one
7
77
229
0
5
17
@YangsiboHuang
Yangsibo Huang
3 months
I really enjoy working with these three amazing editors 😊 And super excited and fortunate to see part of my PhD work ending up as a chapter in the textbook “Federated Learning”!
@pinyuchenTW
Pin-Yu Chen
3 months
Happy to share the release of the book "Federated Learning: Theory and Practice" that I co-edited with @LamMNguyen3 @nghiaht87 , covering fundamentals, emerging topics, and applications. Kudos to the amazing contributors to make this book happen! @ElsevierNews @sciencedirect
Tweet media one
Tweet media two
2
10
62
1
0
21
@YangsiboHuang
Yangsibo Huang
7 months
@McaleerStephen Great work, Stephen! And thanks for maintaining the website! 👏 It's great that your "Red teaming" section (Sec 4.1.3) already discussed various jailbreak attacks. Additionally, I would like to draw your attention to some recent research papers that have explored alternative
2
0
15
@YangsiboHuang
Yangsibo Huang
2 months
We are excited to host Paul at the PASS seminar on 3/19 at 2pm ET 😊 Livestream at: You are welcome to submit your questions for Paul in advance at
@PrincetonPLI
Princeton PLI
2 months
The first PASS seminar will livestream on 3/19 at 2pm ET! Speaker: Paul Christiano (Alignment Research Center) Topic: Catastrophic misalignment of LLMs Live: Submit questions: Recordings later at:
Tweet media one
0
4
19
1
0
13
@YangsiboHuang
Yangsibo Huang
1 year
Attending #NeurIPS2022 now! Happy to grab a coffee with new and old friends ☕️
@princeton_nlp
Princeton NLP Group
1 year
Recovering Private Text in Federated Learning of Language Models (Gupta et al.) w/ @Sam_K_G , @YangsiboHuang , @ZexuanZhong , @gaotianyu1350 , Kai Li, @danqi_chen Poster at Hall J #205 Thu 1 Dec 5 p.m. — 7 p.m. [2/7]
Tweet media one
1
1
8
2
0
12
@YangsiboHuang
Yangsibo Huang
5 months
@prateekmittal_ Hi Prateek, it seems that the idea is relevant to our recently proposed Min-K% Prob (): detecting pretraining data from LLMs using MIA. One of our case studies is using Min-K% Prob to successfully identify failed-to-unlearn examples in an unlearned LLM:
0
0
11
@YangsiboHuang
Yangsibo Huang
7 months
We also note a striking contrast: 7% misalignment rate in proprietary models vs. >95% in open-source LLMs. This indicates that open-source models lag far behind in safety alignment compared to their proprietary models! [6/8]
2
1
10
@YangsiboHuang
Yangsibo Huang
7 months
Alignment proves brittle to changes in system prompt and decoding configs. We show w/ 11 open-source models including Vicuna, MPT, Falcon & LLaMA2, exploiting various generation configs to decode raises misalignment rate to >95% for all! Examples: [3/8]
Tweet media one
1
1
8
@YangsiboHuang
Yangsibo Huang
2 years
Learned quite a lot from the mentorship roundtable at #NeurIPS2021 @WiMLworkshop ! Big shout out to the amazing organizers and mentors this year 🎊
Tweet media one
0
0
9
@YangsiboHuang
Yangsibo Huang
5 months
🕐 Saturday, Regulatable ML Workshop Detecting Pretraining Data from Large Language Models, led by @WeijiaShi2 and @anirudhajith42 from @uwnlp and @princeton_nlp
@WeijiaShi2
Weijia Shi @ ICLR24
7 months
Ever wondered which data black-box LLMs like GPT are pretrained on? 🤔 We build a benchmark WikiMIA and develop Min-K% Prob 🕵️, a method for detecting undisclosed pretraining data from LLMs (relying solely on output probs). Check out our project: [1/n]
Tweet media one
15
139
662
0
2
8
@YangsiboHuang
Yangsibo Huang
7 months
Moreover, we find that the most vulnerable decoding config varies drastically across models. This further suggests that assessing model alignment with a single decoding configuration significantly underestimates the actual risks. [4/8]
Tweet media one
1
0
8
@YangsiboHuang
Yangsibo Huang
7 months
Very simple motivation: We notice that safety evaluations of LLMs often use a fixed config for model generation (and w/ a system prompt), which might overlook cases where the model's alignment deteriorates with different strategies. 📚 Some evidence from LLaMA2 paper: [2/8]
Tweet media one
1
0
8
@YangsiboHuang
Yangsibo Huang
2 years
We summarize a (growing) list of papers for gradient inversion attacks and defenses, including the fresh CAFE attack at VerticalFL () by @pinyuchenTW and @Tianyi2020 at #NeurIPS2021 !. Have fun reading 🤓!
1
2
7
@YangsiboHuang
Yangsibo Huang
7 months
0
0
7
@YangsiboHuang
Yangsibo Huang
5 months
@katherine1ee @random_walker @jason_kint Agreed! Strategic fine-tuning does NOT give a guarantee for unlearning copyrighted content. For example, we showed that a model that has claimed to “unlearn” Harry Potter (via fine-tuning) still can answer many Harry Potter questions correctly!
@YangsiboHuang
Yangsibo Huang
7 months
Microsoft's recent work () shows how LLMs can unlearn copyrighted training data via strategic finetuning: They made Llama2 unlearn Harry Potter's magical world. But our Min-K% Prob () found some persistent “magical traces”!🔮 [1/n]
Tweet media one
4
50
245
0
0
7
@YangsiboHuang
Yangsibo Huang
7 months
Machine unlearning allows training data removal from models, in compliance w/ rules like GDPR. Microsoft's recent LLM unlearning proposal: strategically finetune LLMs. They demonstrated by erasing the Harry Potter (HP) world from Llama2-7B-chat: . [2/n]
1
0
6
@YangsiboHuang
Yangsibo Huang
7 months
We then level up our already potent attack with 2 simple tricks: - Sample N>1 times: Sampling is non-deterministic so we can sample multiple outputs and choose the most misaligned one; - Constraint Decoding: Discourage "Sorry I can't" / encourage "Sure". [5/8]
1
0
6
@YangsiboHuang
Yangsibo Huang
7 months
Evidence time 📚✨ We asked GPT-4 to craft 1k HP questions, then filtered top-100 suspicious questions according to Min-K% Prob. We had the unlearned model answer these questions. The "unlearned" model correctly answered 8% of them: HP content remains in its weights! [4/n]
Tweet media one
1
0
6
@YangsiboHuang
Yangsibo Huang
7 months
We finally turn this bitter lesson into a better practice📚 We propose generation-aware alignment: proactively aligning models with output from different generation configurations. This reasonably reduces misalignment risk, but more work is needed. [7/8]
Tweet media one
1
0
6
@YangsiboHuang
Yangsibo Huang
5 months
@ShunyuYao12 Share your story plz
0
0
5
@YangsiboHuang
Yangsibo Huang
5 months
🕐 Thursday 5pm, #1614 Sparsity-Preserving Differentially Private Training of Large Embedding Models, w/ Badih Ghazi, Pritish Kamath, Pasin Manurangsi, Amer Sinha, Chiyuan Zhang Featured by @GoogleAI blog post:
1
1
5
@YangsiboHuang
Yangsibo Huang
10 days
@xiamengzhou @WeijiaShi2 ➂ LabelDP-Pro: Learning with Label DP via Projections (…) 🧑‍🏫 Chiyuan Zhang ⏰Wed 10:45 am, Halle B #273 ➃ 🥇Best Paper at Set-LLM: Assessing the Brittleness of Safety Alignment () 🧑‍🏫 @wei_boyi ⏰ Sat Set-LLM workshop
@wei_boyi
Boyi Wei
3 months
Wondering why LLM safety mechanisms are fragile? 🤔 😯 We found safety-critical regions in aligned LLMs are sparse: ~3% of neurons/ranks ⚠️Sparsity makes safety easy to undo. Even freezing these regions during fine-tuning still leads to jailbreaks 🔗 [1/n]
Tweet media one
5
44
172
0
0
5
@YangsiboHuang
Yangsibo Huang
7 months
Altogether we show a major failure in safety evaluation & alignment for open-source LLMs. Our recommendation: extensive red-teaming to access risks across generation configs & our generation-aware alignment as a precaution. w/ amazing @Sam_K_G , @xiamengzhou , Kai Li, @danqi_chen
2
0
5
@YangsiboHuang
Yangsibo Huang
2 years
Join us (in 1 hour) at #NeurIPS2021 Poster Session 6 (11:30 a.m. EST — 1 p.m. EST)! 🔎 How to find us: > Visit our poster page: > Or join the Federated Learning gather town: , then navigate to spot C2
@YangsiboHuang
Yangsibo Huang
2 years
Gradient inversion attacks in #FederatedLearning can recover private data from public gradients (privacy leaks!) Our #NeurIPS2021 work evaluates these attacks & potential defenses. We also release an evaluation library: Join us @ Oral Session 5 (12/10)!
1
0
21
0
0
5
@YangsiboHuang
Yangsibo Huang
1 year
We present the first study of privacy implications of retrieval-based LMs, particularly kNN-LMs. paper: w/ @Sam_K_G , @ZexuanZhong , @danqi_chen , Kai Li
1
0
5
@YangsiboHuang
Yangsibo Huang
7 months
We also tried story completion✍️ We pinpointed suspicious text chunks in HP books w/ Min-K% Prob, prompted the unlearned model w/ contexts in these chunks, and asked for completions. 10 chunks scored >= 4 out of 5 in similarity w/ gold completion. [5/n]
Tweet media one
1
0
5
@YangsiboHuang
Yangsibo Huang
7 months
@AIPanicLive @xiamengzhou @Sam_K_G @danqi_chen “Dishonest” is a serious charge so I am not sure if I miss anything here… We do apple-to-apple comparison w/ their approach (see our Sec 4.4): we run two methods on both our benchmark and their benchmark, across 2 LLaMA-chat models. Our attack consistently outperforms theirs.
1
0
3
@YangsiboHuang
Yangsibo Huang
7 months
@AIPanicLive @xiamengzhou @Sam_K_G @danqi_chen Hahaha I like this example 😂 Sure we will definitely test with more toxic and concerning domains!
0
0
4
@YangsiboHuang
Yangsibo Huang
7 months
What else can our Min-K% Prob do other than auditing unlearning? 🔍 Detect copyrighted texts used in pretraining 🛡️ Identify dataset contamination For more details, check out Sec 5~7 in our paper: [6/n]
1
0
3
@YangsiboHuang
Yangsibo Huang
4 months
1
0
3
@YangsiboHuang
Yangsibo Huang
1 year
@xiangyue96 Agreed that DP is needed (probably in combine with tricks such as decoupling key and query encoders to achiever better utility)! And thanks for pointers to your ACL papers (will see if I can try them in our study!)😀
0
0
3
@YangsiboHuang
Yangsibo Huang
4 months
@yong_zhengxin @AIatMeta Congrats! See you around in Bay Area in summer!
1
0
2
@YangsiboHuang
Yangsibo Huang
7 months
We audit their unlearned model to see if it eliminates all content related to HP: 1️⃣ Collect HP-related content (questions / original book paras) 2️⃣ Apply our Min-K% Prob to identify suspicious content that may not be unlearned 3️⃣Validate by prompting the unlearned model [3/n]
1
0
3
@YangsiboHuang
Yangsibo Huang
1 year
Undoubtedly, further efforts are required to address untargeted risks. Exploring the incorporation of differential privacy (DP) 🛠️ into the aforementioned strategies would present an intriguing avenue worth exploring! #PrivacyMatters
1
0
2
@YangsiboHuang
Yangsibo Huang
1 year
😢Mitigating untargeted risks is much more challenging. Mixing public and private data in both the datastore and encoder training shows some promise in reducing the risk, but doesn't go far.
Tweet media one
1
0
2
@YangsiboHuang
Yangsibo Huang
10 days
@LChoshen @xiamengzhou @WeijiaShi2 Haha glad that sth caught your attention! They are just unicode symbols: ➀ ➁ ➂ ➃ ➄ ➅ ➆ ➇ ➈ ➉
1
0
2
@YangsiboHuang
Yangsibo Huang
10 days
@LChoshen @xiamengzhou @WeijiaShi2 I actually got them from Google search lol. Maybe try this query "Unicode: Circled Numbers"?
0
0
2
@YangsiboHuang
Yangsibo Huang
7 months
@xuandongzhao @xiamengzhou @Sam_K_G @danqi_chen Good point! We haven’t tried adversarial prompts (e.g. universal prompts by Zou et al.) + generation exploitation since the head room for improvement for attacking open-source LLMs is very limited (<5% 😂). But it makes sense to try with proprietary models!
0
0
1
@YangsiboHuang
Yangsibo Huang
6 months
@YangjunR Interesting thread! Just wondering how to picture this threat in OpenAI’s recent moves🤔I guess it is sth. where the adversary hosts a malicious GPT on GPTs; when a user queries the model, the adversary runs prompt injection so the model could return some catastrophic commands?
1
0
2
@YangsiboHuang
Yangsibo Huang
1 year
Consider: A model creator wants to deploy a kNN-LM as an API. 👍 They have private data that boost the model's performance on domain-specific tasks. 👎 But the data may contain sensitive information that must remain undisclosed. Utility and privacy need to be weighed ⚖️
1
0
2
@YangsiboHuang
Yangsibo Huang
7 months
@VitusXie @Sam_K_G @xiamengzhou @danqi_chen Great qs! We found the attack is much weaker on proprietary models (see Sec 6 of our paper), which means that open-source LLMs lag far behind proprietary ones in alignment! (But your fine-tuning attack can break them 😉
0
0
2
@YangsiboHuang
Yangsibo Huang
1 year
We look into two privacy risks: 1) Targeted risk directly relates to specific text (e.g., phone #) 2) Untargeted risk is not directly detectable Surprisingly, both risks are more pronounced in kNN-LMs with private datastore v.s. parametric LMs finetuned with private data 😱
Tweet media one
1
0
2
@YangsiboHuang
Yangsibo Huang
3 months
@katherine1ee Interesting… and even if I “translated”the link into tinyurl it still cannot be posted
1
0
1
@YangsiboHuang
Yangsibo Huang
5 months
@KaixuanHuang1 Thank you, Kaixuan ❤️
0
0
1
@YangsiboHuang
Yangsibo Huang
2 years
0
0
1
@YangsiboHuang
Yangsibo Huang
5 months
@EasonZeng623 Thank you, Yi 💪
0
0
1
@YangsiboHuang
Yangsibo Huang
1 year
@jian_w3ng Aha that is also fake (we should have tries to come up with a “faker” example)
0
0
1
@YangsiboHuang
Yangsibo Huang
7 months
@alignment_lab @xiamengzhou @Sam_K_G @danqi_chen Were you suggesting using the universal adversarial suffix () to trigger patterns like ‘sure thing!’? We compared with them in Section 4.4 in our paper: we are 30x faster (and strike a higher attack success rate)!
1
0
1
@YangsiboHuang
Yangsibo Huang
7 months
@AIPanicLive @xiamengzhou @Sam_K_G @danqi_chen Thanks! To clarify, we tested w/ AdvBench () & our MaliciousInstruct. In all tested cases, LLaMA-chat & GPT-3.5 w/ default configs refrained from responding, potentially indicating a policy violation. We're open to expanding the eval scope as you suggest :)
1
0
1
@YangsiboHuang
Yangsibo Huang
7 months
@nr_space @xiamengzhou @Sam_K_G @danqi_chen Thx 😊 “Catastrophic” was meant to refer to the surge in misalignment rate after very simple exploitation: 0% to 95%. I agree that the shown use case (answering malicious qs), though harmful, may not directly imply catastrophic outcome. We’ll tweak phrasing to avoid confusion :)
0
0
1
@YangsiboHuang
Yangsibo Huang
7 months
0
0
1
@YangsiboHuang
Yangsibo Huang
5 months
@liang_weixin Thank you, Weixin!!
0
0
1
@YangsiboHuang
Yangsibo Huang
5 months
@gaotianyu1350 Thank you, Tianyu!!!
0
0
0
@YangsiboHuang
Yangsibo Huang
2 years
w/ my amazing collaborators Samyak Gupta, @realZhaoSong , Prof. Kai Li, and Prof. @prfsanjeevarora
1
0
1
@YangsiboHuang
Yangsibo Huang
5 months
@WeijiaShi2 Thank you Weijia 🧚‍♀️
0
0
1
@YangsiboHuang
Yangsibo Huang
5 months
@ChaoweiX Thank you, Chaowei!
0
0
1
@YangsiboHuang
Yangsibo Huang
1 year
Can we re-design kNN-LMs for mitigation? 🎯 For targeted attacks, 1) A simple sanitization step can eliminate the risks entirely! 🧹 2) Decoupling query and key encoders gives an even better trade-off between utility and privacy
Tweet media one
1
0
1
@YangsiboHuang
Yangsibo Huang
1 year
@jian_w3ng Haha an interesting question but it might be hard to check 😂
1
0
1
@YangsiboHuang
Yangsibo Huang
5 months
@ShayneRedford Thank you, Shayne!!
0
0
1
@YangsiboHuang
Yangsibo Huang
5 months
@ZexuanZhong Thank you, Zexuan!!!
0
0
1
@YangsiboHuang
Yangsibo Huang
5 months
@yong_zhengxin Thank you, Zheng-Xin!
0
0
1