Yangsibo Huang @YangsiboHuang Twitter profile | Pikagi

Pikagi

Yangsibo Huang

@YangsiboHuang

2,395

Followers

759

Following

14

Media

165

Statuses

PhD candidate @Princeton . Prev: @GoogleAI @AIatMeta .

Princeton, NJ

https://t.co/9hLGSN2hsA

Joined October 2014

Don't wanna be here? Send us removal request.

Pinned Tweet

@YangsiboHuang

Yangsibo Huang

3 months

Our new mechanistic understanding study: Safety-critical regions inside aligned LLMs are sparse (only ~3%!), and can be easily removed to compromise safety😢... Can we design better safety alignment algorithms based on this finding? Check the thread for exciting directions!

@wei_boyi

Boyi Wei

3 months

Wondering why LLM safety mechanisms are fragile? 🤔 😯 We found safety-critical regions in aligned LLMs are sparse: ~3% of neurons/ranks ⚠️Sparsity makes safety easy to undo. Even freezing these regions during fine-tuning still leads to jailbreaks 🔗 [1/n]

Tweet media one

5

44

172

1

9

66

Last Seen Profiles

@martine_filleul

@ItsAustinSisk

@Farhadrezaeii

@smilingskyy

@LadyAndTheTrack

@YurticiKargo

@CheadleTownFC

@buildWithKris

@DennisRedis

@thepeanutpal

@bethporch_

@sigkemu

@tobixoxooo

@JoJo_McIntosh

@xiyue8888

@people

@W3Waves

@BS4chan

@vdelcastillotv

@ParisJackson

@Dott_Edoa

@JustJaremi

@Irena13529300

@BlossomOfLoves

@NewWritePod

@KyleCon95551474

@_LowQuality_

@SPARKnit

@WRAYnap3IXU0ggs

@Linsoul_JP

@Vlastkz

@erstang

@RFGmagazine

@bluemoment_cx

@ramiewinky

@dmitryzhur

@YangsiboHuang

Yangsibo Huang

7 months

Microsoft's recent work () shows how LLMs can unlearn copyrighted training data via strategic finetuning: They made Llama2 unlearn Harry Potter's magical world. But our Min-K% Prob () found some persistent “magical traces”!🔮 [1/n]

Tweet media one

4

50

245

@YangsiboHuang

Yangsibo Huang

7 months

Are open-source LLMs (e.g. LLaMA2) well aligned? We show how easy it is to exploit their generation configs for CATASTROPHIC jailbreaks ⛓️🤖⛓️ * 95% misalignment rates * 30x faster than SOTA attacks * insights for better alignment Paper & code at: [1/8]

Tweet media one

7

44

365

@YangsiboHuang

Yangsibo Huang

1 year

Retrieval-based language models excel in interpretability, factuality, and adaptability due to their ability to leverage data from their datastore. Now, there are proposals to use private user datastore for model personalization. Would this approach compromise privacy?🤔

Tweet media one

2

14

160

@YangsiboHuang

Yangsibo Huang

5 months

I am at #NeurIPS2023 now. I am also on the academic job market, and humbled to be selected as a 2023 EECS Rising Star✨. I work on ML security, privacy & data transparency. Appreciate any reposts & happy to chat in person! CV+statements: Find me at ⬇️

hazelsuko07.github.io

3

32

133

@YangsiboHuang

Yangsibo Huang

2 years

Gradient inversion attacks in #FederatedLearning can recover private data from public gradients (privacy leaks!) Our #NeurIPS2021 work evaluates these attacks & potential defenses. We also release an evaluation library: Join us @ Oral Session 5 (12/10)!

GitHub - Princeton-SysML/GradAttack: GradAttack is a Python library for easy evaluation of privacy...

GradAttack is a Python library for easy evaluation of privacy risks in public gradients in Federated Learning, as well as corresponding mitigation strategies. - Princeton-SysML/GradAttack

1

0

21

@YangsiboHuang

Yangsibo Huang

10 days

Missed #ICLR24 due to visa, but my amazing collaborators are presenting our 4 works! ➀ Jailbreaking LLMs via Exploiting Generation (see thread) 👩‍🏫 @xiamengzhou ⏰ Fri 4:30 pm, Halle B #187 ➁ Detecting Pretraining Data from LLMs 👩‍🏫 @WeijiaShi2 ⏰ Fri 10:45 am, Halle B #95

@YangsiboHuang

Yangsibo Huang

7 months

Are open-source LLMs (e.g. LLaMA2) well aligned? We show how easy it is to exploit their generation configs for CATASTROPHIC jailbreaks ⛓️🤖⛓️ * 95% misalignment rates * 30x faster than SOTA attacks * insights for better alignment Paper & code at: [1/8]

Tweet media one

7

44

365

2

5

61

@YangsiboHuang

Yangsibo Huang

4 years

How to tackle data privacy for language understanding tasks in distributed learning (without slowing down training or reducing accuracy)? Happy to share our new #emnlp2020 findings paper w/ @realZhaoSong , @danqi_chen , Prof. Kai Li, @prfsanjeevarora paper:

Tweet media one

0

18

38

@YangsiboHuang

Yangsibo Huang

5 months

I am not able to travel to #EMNLP2023 due to visa issues. But my great coauthor @Sam_K_G is there and will present this work🤗 (pls consider him for internship opportunities!) I will attend #NeurIPS2023 next week. Let’s grab a ☕️ if you want to chat about LLM safety/privacy/data

@YangsiboHuang

Yangsibo Huang

1 year

Retrieval-based language models excel in interpretability, factuality, and adaptability due to their ability to leverage data from their datastore. Now, there are proposals to use private user datastore for model personalization. Would this approach compromise privacy?🤔

Tweet media one

2

14

160

0

2

31

@YangsiboHuang

Yangsibo Huang

7 months

Membership inference attack (MIA) is well-researched in ML security. Yet, its use in LLM pretraining is relatively underexplored. Our Min-K% Prob is stepping up to bridge this gap. Think you can do better? Try your methods on our WikiMIA benchmark 📈:

Tweet card media

swj0419/WikiMIA · Datasets at Hugging Face

@WeijiaShi2

Weijia Shi @ ICLR24

7 months

Ever wondered which data black-box LLMs like GPT are pretrained on? 🤔 We build a benchmark WikiMIA and develop Min-K% Prob 🕵️, a method for detecting undisclosed pretraining data from LLMs (relying solely on output probs). Check out our project: [1/n]

Tweet media one

15

139

662

0

6

30

@YangsiboHuang

Yangsibo Huang

5 months

I will present DP-AdaFEST at #NeurIPS2023 (Thurs, poster session 6)! TL;DR - DP-AdaFEST effectively preserves the gradient sparsity in differentially private training of large embedding models, which translates to ~20x wall-clock time improvement for recommender systems (w/ TPU)

@GoogleAI

Google AI

5 months

Today on the blog learn about a new algorithm for sparsity-preserving differentially private training, called adaptive filtering-enabled sparse training (DP-AdaFEST), which is particularly relevant for applications in recommendation systems and #NLP . →

Tweet media one

12

52

242

0

0

23

@YangsiboHuang

Yangsibo Huang

2 months

New policies mandate the disclosure of GenAI risks, but who evaluates them? Trusting AI companies alone is risky. We advocate (led by @ShayneRedford ): Independent researchers for evaluations + safe harbor from companies = Less chill, more trust. Agree? Sign our letter in 🧵!

@ShayneRedford

Shayne Longpre

2 months

Independent AI research should be valued and protected. In an open letter signed by over a 100 researchers, journalists, and advocates, we explain how AI companies should support it going forward. 1/

Tweet media one

7

77

229

0

5

17

@YangsiboHuang

Yangsibo Huang

3 months

I really enjoy working with these three amazing editors 😊 And super excited and fortunate to see part of my PhD work ending up as a chapter in the textbook “Federated Learning”!

@pinyuchenTW

Pin-Yu Chen

3 months

Happy to share the release of the book "Federated Learning: Theory and Practice" that I co-edited with @LamMNguyen3 @nghiaht87 , covering fundamentals, emerging topics, and applications. Kudos to the amazing contributors to make this book happen! @ElsevierNews @sciencedirect

Tweet media one

Tweet media two

2

10

62

1

0

21

@YangsiboHuang

Yangsibo Huang

7 months

@McaleerStephen Great work, Stephen! And thanks for maintaining the website! 👏 It's great that your "Red teaming" section (Sec 4.1.3) already discussed various jailbreak attacks. Additionally, I would like to draw your attention to some recent research papers that have explored alternative

2

0

15

@YangsiboHuang

Yangsibo Huang

2 months

We are excited to host Paul at the PASS seminar on 3/19 at 2pm ET 😊 Livestream at: You are welcome to submit your questions for Paul in advance at

Tweet card media

PASS Question Submission

Submit a question for the speaker of the Princeton AI Alignment and Safety Seminar (PASS)! We will moderate the questions and ask the speaker during the discussion period.

docs.google.com

@PrincetonPLI

Princeton PLI

2 months

The first PASS seminar will livestream on 3/19 at 2pm ET! Speaker: Paul Christiano (Alignment Research Center) Topic: Catastrophic misalignment of LLMs Live: Submit questions: Recordings later at:

Tweet media one

0

4

19

1

0

13

@YangsiboHuang

Yangsibo Huang

1 year

Attending #NeurIPS2022 now! Happy to grab a coffee with new and old friends ☕️

@princeton_nlp

Princeton NLP Group

1 year

Recovering Private Text in Federated Learning of Language Models (Gupta et al.) w/ @Sam_K_G , @YangsiboHuang , @ZexuanZhong , @gaotianyu1350 , Kai Li, @danqi_chen Poster at Hall J #205 Thu 1 Dec 5 p.m. — 7 p.m. [2/7]

Tweet media one

1

1

8

2

0

12

@YangsiboHuang

Yangsibo Huang

5 months

@prateekmittal_ Hi Prateek, it seems that the idea is relevant to our recently proposed Min-K% Prob (): detecting pretraining data from LLMs using MIA. One of our case studies is using Min-K% Prob to successfully identify failed-to-unlearn examples in an unlearned LLM:

0

0

11

@YangsiboHuang

Yangsibo Huang

7 months

We also note a striking contrast: 7% misalignment rate in proprietary models vs. >95% in open-source LLMs. This indicates that open-source models lag far behind in safety alignment compared to their proprietary models! [6/8]

2

1

10

@YangsiboHuang

Yangsibo Huang

7 months

Alignment proves brittle to changes in system prompt and decoding configs. We show w/ 11 open-source models including Vicuna, MPT, Falcon & LLaMA2, exploiting various generation configs to decode raises misalignment rate to >95% for all! Examples: [3/8]

Tweet media one

1

1

8

@YangsiboHuang

Yangsibo Huang

2 years

Learned quite a lot from the mentorship roundtable at #NeurIPS2021 @WiMLworkshop ! Big shout out to the amazing organizers and mentors this year 🎊

Tweet media one

0

0

9

@YangsiboHuang

Yangsibo Huang

5 months

🕐 Saturday, Regulatable ML Workshop Detecting Pretraining Data from Large Language Models, led by @WeijiaShi2 and @anirudhajith42 from @uwnlp and @princeton_nlp

@WeijiaShi2

Weijia Shi @ ICLR24

7 months

Ever wondered which data black-box LLMs like GPT are pretrained on? 🤔 We build a benchmark WikiMIA and develop Min-K% Prob 🕵️, a method for detecting undisclosed pretraining data from LLMs (relying solely on output probs). Check out our project: [1/n]

Tweet media one

15

139

662

0

2

8

@YangsiboHuang

Yangsibo Huang

7 months

Moreover, we find that the most vulnerable decoding config varies drastically across models. This further suggests that assessing model alignment with a single decoding configuration significantly underestimates the actual risks. [4/8]

Tweet media one

1

0

8

@YangsiboHuang

Yangsibo Huang

7 months

Very simple motivation: We notice that safety evaluations of LLMs often use a fixed config for model generation (and w/ a system prompt), which might overlook cases where the model's alignment deteriorates with different strategies. 📚 Some evidence from LLaMA2 paper: [2/8]

Tweet media one

1

0

8

@YangsiboHuang

Yangsibo Huang

2 years

We summarize a (growing) list of papers for gradient inversion attacks and defenses, including the fresh CAFE attack at VerticalFL () by @pinyuchenTW and @Tianyi2020 at #NeurIPS2021 !. Have fun reading 🤓!

1

2

7

@YangsiboHuang

Yangsibo Huang

7 months

Code & Data & Project page: Fun collaboration w/ @WeijiaShi2 , @anirudhajith42 , @xiamengzhou , @DaogaoLiu , @TerraBlvns , @danqi_chen , @LukeZettlemoyer from @uwnlp and @princetonnlp 🥳 [7/n, n=7]

0

0

7

@YangsiboHuang

Yangsibo Huang

5 months

@katherine1ee @random_walker @jason_kint Agreed! Strategic fine-tuning does NOT give a guarantee for unlearning copyrighted content. For example, we showed that a model that has claimed to “unlearn” Harry Potter (via fine-tuning) still can answer many Harry Potter questions correctly!

@YangsiboHuang

Yangsibo Huang

7 months

Microsoft's recent work () shows how LLMs can unlearn copyrighted training data via strategic finetuning: They made Llama2 unlearn Harry Potter's magical world. But our Min-K% Prob () found some persistent “magical traces”!🔮 [1/n]

Tweet media one

4

50

245

0

0

7

@YangsiboHuang

Yangsibo Huang

7 months

Machine unlearning allows training data removal from models, in compliance w/ rules like GDPR. Microsoft's recent LLM unlearning proposal: strategically finetune LLMs. They demonstrated by erasing the Harry Potter (HP) world from Llama2-7B-chat: . [2/n]

microsoft/Llama2-7b-WhoIsHarryPotter · Hugging Face

1

0

6

@YangsiboHuang

Yangsibo Huang

7 months

We then level up our already potent attack with 2 simple tricks: - Sample N>1 times: Sampling is non-deterministic so we can sample multiple outputs and choose the most misaligned one; - Constraint Decoding: Discourage "Sorry I can't" / encourage "Sure". [5/8]

1

0

6

@YangsiboHuang

Yangsibo Huang

7 months

Evidence time 📚✨ We asked GPT-4 to craft 1k HP questions, then filtered top-100 suspicious questions according to Min-K% Prob. We had the unlearned model answer these questions. The "unlearned" model correctly answered 8% of them: HP content remains in its weights! [4/n]

Tweet media one

1

0

6

@YangsiboHuang

Yangsibo Huang

7 months

We finally turn this bitter lesson into a better practice📚 We propose generation-aware alignment: proactively aligning models with output from different generation configurations. This reasonably reduces misalignment risk, but more work is needed. [7/8]

Tweet media one

1

0

6

@YangsiboHuang

Yangsibo Huang

5 months

@ShunyuYao12 Share your story plz

0

0

5

@YangsiboHuang

Yangsibo Huang

5 months

🕐 Thursday 5pm, #1614 Sparsity-Preserving Differentially Private Training of Large Embedding Models, w/ Badih Ghazi, Pritish Kamath, Pasin Manurangsi, Amer Sinha, Chiyuan Zhang Featured by @GoogleAI blog post:

Sparsity-preserving differentially private training

Posted by Yangsibo Huang, Research Intern, and Chiyuan Zhang, Research Scientist, Google Research Large embedding models have emerged as a fundamen...

research.google

1

1

5

@YangsiboHuang

Yangsibo Huang

10 days

@xiamengzhou @WeijiaShi2 ➂ LabelDP-Pro: Learning with Label DP via Projections (…) 🧑‍🏫 Chiyuan Zhang ⏰Wed 10:45 am, Halle B #273 ➃ 🥇Best Paper at Set-LLM: Assessing the Brittleness of Safety Alignment () 🧑‍🏫 @wei_boyi ⏰ Sat Set-LLM workshop

@wei_boyi

Boyi Wei

3 months

Wondering why LLM safety mechanisms are fragile? 🤔 😯 We found safety-critical regions in aligned LLMs are sparse: ~3% of neurons/ranks ⚠️Sparsity makes safety easy to undo. Even freezing these regions during fine-tuning still leads to jailbreaks 🔗 [1/n]

Tweet media one

5

44

172

0

0

5

@YangsiboHuang

Yangsibo Huang

7 months

Altogether we show a major failure in safety evaluation & alignment for open-source LLMs. Our recommendation: extensive red-teaming to access risks across generation configs & our generation-aware alignment as a precaution. w/ amazing @Sam_K_G , @xiamengzhou , Kai Li, @danqi_chen

2

0

5

@YangsiboHuang

Yangsibo Huang

2 years

Join us (in 1 hour) at #NeurIPS2021 Poster Session 6 (11:30 a.m. EST — 1 p.m. EST)! 🔎 How to find us: > Visit our poster page: > Or join the Federated Learning gather town: , then navigate to spot C2

@YangsiboHuang

Yangsibo Huang

2 years

Gradient inversion attacks in #FederatedLearning can recover private data from public gradients (privacy leaks!) Our #NeurIPS2021 work evaluates these attacks & potential defenses. We also release an evaluation library: Join us @ Oral Session 5 (12/10)!

1

0

21

0

0

5

@YangsiboHuang

Yangsibo Huang

1 year

We present the first study of privacy implications of retrieval-based LMs, particularly kNN-LMs. paper: w/ @Sam_K_G , @ZexuanZhong , @danqi_chen , Kai Li

1

0

5

@YangsiboHuang

Yangsibo Huang

7 months

We also tried story completion✍️ We pinpointed suspicious text chunks in HP books w/ Min-K% Prob, prompted the unlearned model w/ contexts in these chunks, and asked for completions. 10 chunks scored >= 4 out of 5 in similarity w/ gold completion. [5/n]

Tweet media one

1

0

5

@YangsiboHuang

Yangsibo Huang

7 months

@AIPanicLive @xiamengzhou @Sam_K_G @danqi_chen “Dishonest” is a serious charge so I am not sure if I miss anything here… We do apple-to-apple comparison w/ their approach (see our Sec 4.4): we run two methods on both our benchmark and their benchmark, across 2 LLaMA-chat models. Our attack consistently outperforms theirs.

1

0

3

@YangsiboHuang

Yangsibo Huang

7 months

@AIPanicLive @xiamengzhou @Sam_K_G @danqi_chen Hahaha I like this example 😂 Sure we will definitely test with more toxic and concerning domains!

0

0

4

@YangsiboHuang

Yangsibo Huang

7 months

What else can our Min-K% Prob do other than auditing unlearning? 🔍 Detect copyrighted texts used in pretraining 🛡️ Identify dataset contamination For more details, check out Sec 5~7 in our paper: [6/n]

1

0

3

@YangsiboHuang

Yangsibo Huang

4 months

@KaimingCheng @XiaSu09 Congrats!

1

0

3

@YangsiboHuang

Yangsibo Huang

1 year

@xiangyue96 Agreed that DP is needed (probably in combine with tricks such as decoupling key and query encoders to achiever better utility)! And thanks for pointers to your ACL papers (will see if I can try them in our study!)😀

0

0

3

@YangsiboHuang

Yangsibo Huang

4 months

@yong_zhengxin @AIatMeta Congrats! See you around in Bay Area in summer!

1

0

2

@YangsiboHuang

Yangsibo Huang

7 months

We audit their unlearned model to see if it eliminates all content related to HP: 1️⃣ Collect HP-related content (questions / original book paras) 2️⃣ Apply our Min-K% Prob to identify suspicious content that may not be unlearned 3️⃣Validate by prompting the unlearned model [3/n]

1

0

3

@YangsiboHuang

Yangsibo Huang

1 year

Undoubtedly, further efforts are required to address untargeted risks. Exploring the incorporation of differential privacy (DP) 🛠️ into the aforementioned strategies would present an intriguing avenue worth exploring! #PrivacyMatters

1

0

2

@YangsiboHuang

Yangsibo Huang

1 year

😢Mitigating untargeted risks is much more challenging. Mixing public and private data in both the datastore and encoder training shows some promise in reducing the risk, but doesn't go far.

Tweet media one

1

0

2

@YangsiboHuang

Yangsibo Huang

10 days

@LChoshen @xiamengzhou @WeijiaShi2 Haha glad that sth caught your attention! They are just unicode symbols: ➀ ➁ ➂ ➃ ➄ ➅ ➆ ➇ ➈ ➉

1

0

2

@YangsiboHuang

Yangsibo Huang

10 days

@LChoshen @xiamengzhou @WeijiaShi2 I actually got them from Google search lol. Maybe try this query "Unicode: Circled Numbers"?

0

0

2

@YangsiboHuang

Yangsibo Huang

7 months

@xuandongzhao @xiamengzhou @Sam_K_G @danqi_chen Good point! We haven’t tried adversarial prompts (e.g. universal prompts by Zou et al.) + generation exploitation since the head room for improvement for attacking open-source LLMs is very limited (<5% 😂). But it makes sense to try with proprietary models!

0

0

1

@YangsiboHuang

Yangsibo Huang

6 months

@YangjunR Interesting thread! Just wondering how to picture this threat in OpenAI’s recent moves🤔I guess it is sth. where the adversary hosts a malicious GPT on GPTs; when a user queries the model, the adversary runs prompt injection so the model could return some catastrophic commands?

1

0

2

@YangsiboHuang

Yangsibo Huang

1 year

Consider: A model creator wants to deploy a kNN-LM as an API. 👍 They have private data that boost the model's performance on domain-specific tasks. 👎 But the data may contain sensitive information that must remain undisclosed. Utility and privacy need to be weighed ⚖️

1

0

2

@YangsiboHuang

Yangsibo Huang

7 months

@VitusXie @Sam_K_G @xiamengzhou @danqi_chen Great qs! We found the attack is much weaker on proprietary models (see Sec 6 of our paper), which means that open-source LLMs lag far behind proprietary ones in alignment! (But your fine-tuning attack can break them 😉

0

0

2

@YangsiboHuang

Yangsibo Huang

1 year

We look into two privacy risks: 1) Targeted risk directly relates to specific text (e.g., phone #) 2) Untargeted risk is not directly detectable Surprisingly, both risks are more pronounced in kNN-LMs with private datastore v.s. parametric LMs finetuned with private data 😱

Tweet media one

1

0

2

@YangsiboHuang

Yangsibo Huang

3 months

@katherine1ee Interesting… and even if I “translated”the link into tinyurl it still cannot be posted

1

0

1

@YangsiboHuang

Yangsibo Huang

5 months

@KaixuanHuang1 Thank you, Kaixuan ❤️

0

0

1

@YangsiboHuang

Yangsibo Huang

2 years

@weihu_ @Princeton Congrats!

0

0

1

@YangsiboHuang

Yangsibo Huang

5 months

@EasonZeng623 Thank you, Yi 💪

0

0

1

@YangsiboHuang

Yangsibo Huang

1 year

@jian_w3ng Aha that is also fake (we should have tries to come up with a “faker” example)

0

0

1

@YangsiboHuang

Yangsibo Huang

7 months

@alignment_lab @xiamengzhou @Sam_K_G @danqi_chen Were you suggesting using the universal adversarial suffix () to trigger patterns like ‘sure thing!’? We compared with them in Section 4.4 in our paper: we are 30x faster (and strike a higher attack success rate)!

Tweet card media

Universal and Transferable Adversarial Attacks on Aligned Language Models

Because "out-of-the-box" large language models are capable of generating a great deal of objectionable content, recent work has focused on aligning these models in an attempt to prevent...

1

0

1

@YangsiboHuang

Yangsibo Huang

7 months

@AIPanicLive @xiamengzhou @Sam_K_G @danqi_chen Thanks! To clarify, we tested w/ AdvBench () & our MaliciousInstruct. In all tested cases, LLaMA-chat & GPT-3.5 w/ default configs refrained from responding, potentially indicating a policy violation. We're open to expanding the eval scope as you suggest :)

Tweet card media

Universal and Transferable Adversarial Attacks on Aligned Language Models

Because "out-of-the-box" large language models are capable of generating a great deal of objectionable content, recent work has focused on aligning these models in an attempt to prevent...

1

0

1

@YangsiboHuang

Yangsibo Huang

7 months

@nr_space @xiamengzhou @Sam_K_G @danqi_chen Thx 😊 “Catastrophic” was meant to refer to the surge in misalignment rate after very simple exploitation: 0% to 95%. I agree that the shown use case (answering malicious qs), though harmful, may not directly imply catastrophic outcome. We’ll tweak phrasing to avoid confusion :)

0

0

1

@YangsiboHuang

Yangsibo Huang

7 months

@AnsongNi @xiamengzhou @Sam_K_G @danqi_chen hahahaha 快来普林找我们玩

0

0

1

@YangsiboHuang

Yangsibo Huang

5 months

@liang_weixin Thank you, Weixin!!

0

0

1

@YangsiboHuang

Yangsibo Huang

5 months

@gaotianyu1350 Thank you, Tianyu!!!

0

0

0

@YangsiboHuang

Yangsibo Huang

2 years

w/ my amazing collaborators Samyak Gupta, @realZhaoSong , Prof. Kai Li, and Prof. @prfsanjeevarora

1

0

1

@YangsiboHuang

Yangsibo Huang

5 months

@WeijiaShi2 Thank you Weijia 🧚‍♀️

0

0

1

@YangsiboHuang

Yangsibo Huang

5 months

@ChaoweiX Thank you, Chaowei!

0

0

1

@YangsiboHuang

Yangsibo Huang

1 year

Can we re-design kNN-LMs for mitigation? 🎯 For targeted attacks, 1) A simple sanitization step can eliminate the risks entirely! 🧹 2) Decoupling query and key encoders gives an even better trade-off between utility and privacy

Tweet media one

1

0

1

@YangsiboHuang

Yangsibo Huang

1 year

@jian_w3ng Haha an interesting question but it might be hard to check 😂

1

0

1

@YangsiboHuang

Yangsibo Huang

5 months

@ShayneRedford Thank you, Shayne!!

0

0

1

@YangsiboHuang

Yangsibo Huang

5 months

@ZexuanZhong Thank you, Zexuan!!!

0

0

1

@YangsiboHuang

Yangsibo Huang

5 months

@yong_zhengxin Thank you, Zheng-Xin!

0

0

1

@YangsiboHuang

Yangsibo Huang

7 months

@_AngelinaYang_ @arankomatsuzaki Great question! It can be used to detect test data contamination, copyrighted content, and audit machine unlearning methods. Please check Sec 5 - 7 of the paper () for more details!

Tweet card media

Detecting Pretraining Data from Large Language Models

Although large language models (LLMs) are widely deployed, the data used to train them is rarely disclosed. Given the incredible scale of this data, up to trillions of tokens, it is all but...

1

0

1