FAR AI Profile Banner
FAR AI Profile
FAR AI

@farairesearch

1,263
Followers
19
Following
73
Media
180
Statuses

Ensuring AI systems are trustworthy and beneficial to society by incubating new AI safety research agendas.

Berkeley, California
Joined February 2023
Don't wanna be here? Send us removal request.
Pinned Tweet
@farairesearch
FAR AI
5 days
ICYMI: Here’s highlights from our previous research on "Adversarial Policies Beat Superhuman Go AIs." We found that even seemingly superhuman AIs are still vulnerable to attacks. Stay tuned for new results coming soon! 🔗👇
1
5
12
@farairesearch
FAR AI
11 months
This is Lee Sedol in 2016 playing against AlphaGo. Despite a valiant effort, Lee lost. The AI was just too powerful. But, had Lee known about our ICML 2023 paper, Adversarial Policies Beat Superhuman Go AIs, things might have turned out differently! 🧵
Tweet media one
8
89
462
@farairesearch
FAR AI
3 months
Leading global AI scientists met in Beijing for the second International Dialogue on AI Safety (IDAIS), a project of FAR AI. Attendees including Turing award winners Bengio, Yao & Hinton called for red lines in AI development to prevent catastrophic and existential risks from AI.
Tweet media one
3
34
204
@farairesearch
FAR AI
11 months
Existing "superhuman” Go AIs have a hidden weakness—they don’t understand circles. If you get the AI to make a circle shape, it thinks the shape is invulnerable and won’t defend it even though it can be killed. Here’s KataGo (the strongest OSS Go AI) making a circle as black.
Tweet media one
2
31
134
@farairesearch
FAR AI
6 months
New GPT-4 APIs introduce new vulnerabilities. The fine-tuning API can be exploited to remove model safeguards, the function call API can be abused to execute arbitrary function calls, and the knowledge retrieval API can be used to hijack the model via uploaded documents. 🧵
Tweet media one
1
13
57
@farairesearch
FAR AI
7 months
Prominent AI researchers from West and East including Turing recipients Yoshua Bengio 🇨🇦 & Andrew Yao 🇨🇳called for global action on AI safety and governance to prevent uncontrolled frontier model development posing unacceptable risks to humanity. 🧵
2
17
56
@farairesearch
FAR AI
28 days
🛡️State-of-the-art ML systems lack quantitative performance guarantees, limiting use in high-stakes domains. Towards Guaranteed Safe AI presents a framework for high-assurance safety in complex environments using a Safety Specification that is Verified against a World Model.
Tweet media one
1
12
53
@farairesearch
FAR AI
6 months
🎥 As we embrace the holiday season, we're excited to share a special announcement: The NOLA Alignment Workshop videos are now live! Warm up your winter with insights from leading #AIAlignment researchers at . Happy Holidays! 📷❄️
Tweet media one
6
9
39
@farairesearch
FAR AI
11 months
Because KataGo doesn’t realize its circle can be killed, an adversary AI we trained can slowly smother the circle from the inside and outside, and all of KataGo’s stones marked with an ❌eventually die.
2
6
37
@farairesearch
FAR AI
11 months
This cyclic-exploit is simple enough to be used by humans. Our teammate @KellinPelrine made the news after using the technique to beat what were previously considered strongly superhuman systems, and others have since followed in his footsteps.
1
8
36
@farairesearch
FAR AI
6 months
🎉 Reflecting on a fantastic #NeurIPS2023 #AIAlignment Workshop! 🚀 🙌 149 attendees energized the main event 🌃 500+ at our Monday social 🧠 12 talks, 25 lightning talks 🔑 Keynote by Yoshua Bengio 🤔 What inspired you the most? Share your thoughts!
Tweet media one
2
1
36
@farairesearch
FAR AI
11 months
@KellinPelrine @lightvector1 Our key takeaway from all of this remains the same as before:
@ARGleave
Adam Gleave
2 years
Our key takeaway is that even AI systems that match or surpass human-level performance in common cases can have surprising failure modes quite unlike humans. We'd recommend broader use of adversarial testing to find these failure modes, especially in safety-critical systems.
2
20
120
1
4
27
@farairesearch
FAR AI
4 months
🎉 They're live! Dive into #AIAlignment at the #AlignmentWorkshop with videos now on YouTube & our site, all with captions & transcripts. 📺 For more insights, check out our blog post. ✨Links below 🔗👇Be inspired, engage, and share your favorite insights!
Tweet media one
Tweet media two
Tweet media three
1
6
27
@farairesearch
FAR AI
11 months
@KellinPelrine We discovered this exploit by training adversary AIs to beat the supposedly superhuman KataGo AI. Our adversaries won 97% of games against KataGo at “superhuman” settings. Crucially, our adversaries didn’t learn to play Go well, instead winning entirely via the cyclic-exploit.
Tweet media one
1
3
27
@farairesearch
FAR AI
11 months
@KellinPelrine Unlike in vanilla AlphaZero, our adversary has an internal copy of its victim which it uses to simulate the victim when considering possible sequences of play.
Tweet media one
1
2
24
@farairesearch
FAR AI
7 months
We're excited to announce the v1 release of imitation, an open-source reward learning library developed with @CHAI_Berkeley . imitation provides experimental baselines for reward learning and an easy to modify implementation for reward learning research.
1
6
25
@farairesearch
FAR AI
4 months
📣 FAR AI is Expanding! 🚀 Seeking results-driven & pioneering individuals: - Engineering Manager: Innovate & lead our engineering team to new frontiers. - Technical Lead: Guide, execute & transform our technical AI safety projects. Join us to shape the future of AI Safety!
Tweet media one
1
8
24
@farairesearch
FAR AI
6 months
🚨 We're hiring for a Tech Lead to spearhead delivery of our AI safety research, and an Engineering Manager to lead & scale our technical team.
Tweet media one
1
7
21
@farairesearch
FAR AI
6 months
Connect with the #AIAlignment community at #NeurIPS2023 ! Join us Dec 11 at Le Meridien New Orleans, 7:30 pm for the Alignment Workshop: Open Social event! 🤖💬 Please help spread the word and share in your network! 🌟
Tweet media one
0
8
22
@farairesearch
FAR AI
3 months
Western and Chinese AI scientists and governance experts collaborated to produce a statement outlining red lines in AI development, and a roadmap to ensure those lines are never crossed. You can read the full statement on the IDAIS website:
Tweet media one
2
1
21
@farairesearch
FAR AI
3 months
🚀 @jesse_hoogland 's talk at FAR Labs revealed that transformers progress through discrete, interpretable stages, each marked by unique behavioral & structural traits. This insight marks a step forward in comprehending the developmental learning processes of neural networks. ✨
@jesse_hoogland
Jesse Hoogland
4 months
1/8 How do transformers learn? In our new work, we find that transformers develop in-context learning in discrete stages that can be automatically discovered. 🧵 Joint work w/ @georgeyw_ , Matthew Farrugia-Roberts, @lemmykc , Susan Wei, @danielmurfet
Tweet media one
3
85
426
1
4
17
@farairesearch
FAR AI
11 months
@KellinPelrine @lightvector1 However, we show this defense is incomplete—re-attacking KataGo yields adversaries that are still able to win via the cyclic exploit. So defense is still an open question.
Tweet media one
1
2
19
@farairesearch
FAR AI
19 days
What do AI safety experts believe about the future of AI? 🤖 How might things go wrong, what should we do, and how are we doing so far? We conducted 17 semi-structured interviews with AI safety experts to find out. 🎙️ See 🧵 for results 👇
Tweet media one
1
5
20
@farairesearch
FAR AI
6 months
🚀🔍 What’s new at FAR AI? We’ve grown to 12 staff, published 13 papers, launched the FAR Labs coworking space, & hosted 160+ ML researchers at our events. Focused on #AIsafety , we're hiring and open to collaborations!
Tweet media one
0
6
19
@farairesearch
FAR AI
7 months
Attending #NeurIPS2023 ? Join us Dec 11 at Le Meridien New Orleans, 7:30 pm for the Alignment Workshop: Open Social event! 🤖💬 Just a stone's throw from the convention center. RSVP optional but a quick sign-up helps us plan. See you there!
Tweet media one
2
6
15
@farairesearch
FAR AI
6 months
💡🔬FAR AI #AIAlignment Research Update! We’re exploring AI robustness, value alignment, & model evaluation. We’ve made strides in adversarial attacks for superhuman systems, mechanistic interpretability, scaling trends & more!
Tweet media one
2
6
15
@farairesearch
FAR AI
11 months
@KellinPelrine After publishing v1 of our work late last year, the creator of KataGo @lightvector1 took notice and started to slowly teach KataGo to understand circles. Over the next 6 months, KataGo gradually became immune to our published adversaries.
1
1
14
@farairesearch
FAR AI
3 months
This event was a collaboration between the Safe AI Forum (SAIF) and the Beijing Academy of AI (BAAI). SAIF is a new organization fiscally sponsored by FAR AI focused on reducing risks from AI by fostering coordination on international AI safety:
1
1
14
@farairesearch
FAR AI
2 months
🎯 Yoshua Bengio at the FAR Labs Seminar explores designing aligned and provably safe AI using model-based Bayesian machine learning.🎬🔗👇
Tweet media one
1
4
14
@farairesearch
FAR AI
9 months
Encouraging to see @EU_Commission taking AI risk seriously. Combining sensible regulation & safety research like our work at FAR, we could ensure that future AI systems benefit humanity.
@EU_Commission
European Commission
9 months
Mitigating the risk of extinction from AI should be a global priority. And Europe should lead the way, building a new global AI framework built on three pillars: guardrails, governance and guiding innovation ↓
Tweet media one
432
484
2K
0
0
12
@farairesearch
FAR AI
11 months
@KellinPelrine To train our adversary, we developed an adversarial variant of the AlphaZero algorithm. Like in vanilla AlphaZero, our adversary searches over possible future scenarios to find the best move.
1
1
11
@farairesearch
FAR AI
2 months
ICYMI: Check out our blog 'Evaluating Moral Beliefs in LLMs', based on our study that scrutinizes AI's ethical decisions. Uncover how 28 LLMs handle 1,400 moral dilemmas, offering insights into AI’s moral compass. 🔗👇
Tweet media one
1
4
12
@farairesearch
FAR AI
6 months
Thanks Shane, we were delighted to host the #AIAlignmentWorkshop and it was great to see so many people interested in alignment! Stay tuned for talk recordings and other content from the workshop.
@ShaneLegg
Shane Legg
6 months
Huge congrats to the organisers of the #AIAlignment Workshop at #NeurIPS2023 After being a niche community for years, it’s now like a regular academic workshop with famous professors, lots of junior professors & their students, and people in industry. And some outstanding talks!
2
7
111
0
3
11
@farairesearch
FAR AI
7 months
Codebook Features make language models more interpretable and controllable, with minimal performance loss! Our method turns complex vectors into discrete codes, providing a potential path toward safer and more reliable machine learning systems.
@AlexTamkin
Alex Tamkin @ FAccT 🇧🇷
8 months
Codebook Features: Sparse and Discrete Interpretability for Neural Networks We learn discrete on/off features inside of language models using vector quantization These features are more interpretable than neurons and can be used to steer the network’s behavior! 1/
Tweet media one
2
29
151
0
3
11
@farairesearch
FAR AI
2 months
📣 FAR AI is Hiring! 🚀 Seeking passionate & detail-oriented individuals for Head of Events (Safe AI Forum): Lead, communicate & connect global AI safety community. Join us to shape the future of AI through events like @ais_dialogues ! 🔗👇
Tweet media one
1
2
10
@farairesearch
FAR AI
11 months
@KellinPelrine @lightvector1 This work was done by the fantastic team of @5kovt , @ARGleave , @KellinPelrine , @tomhmtseng , @norabelrose , Joseph Miller, @MichaelD1729 , @yawen_duan , Viktor Pogrebniak, @svlevine , and Stuart Russell, with support from @CHAI_Berkeley .
0
1
9
@farairesearch
FAR AI
4 months
🌟 @ghadfield 's session on AI Governance was a game-changer! 🏛️💡 She tackled the myth of AI's inevitable growth, highlighting the need for strategic regulation and a national AI registry. A thought-provoking approach to shaping AI's future responsibly! ⚖️🤖🔗👇
Tweet media one
1
5
10
@farairesearch
FAR AI
6 months
🌟🌐🤔 #NeurIPS2023 Spotlight Poster: Unravel the mystery of AI morality! Don’t miss our session on "Evaluating Moral Beliefs in LLMs" on Dec 13, 10:45 AM CST poster #1523 . Insights from a study on 28 #LLMs by @ninoscherrer @causalclaudia & team.
@ninoscherrer
Nino Scherrer
11 months
How do LLMs from different organizations compare in morally ambiguous scenarios? Do LLMs exhibit common-sense reasoning in morally unambiguous scenarios? 📄 👨‍👩‍👧‍👦 @causalclaudia @amirfeder @blei_lab @farairesearch A thread: 🧵[1/N]
Tweet media one
2
38
115
0
2
9
@farairesearch
FAR AI
2 months
A new Science paper warns of the risks of long-term planning agents (LTPAs) deceiving humans. To mitigate potential threats, it advises against permitting the development of sufficiently capable LTPAs and recommends stringent controls over their resources. 🔗👇
Tweet media one
1
2
8
@farairesearch
FAR AI
8 months
We're proud to present this interactive explainer on the rate of recent AI progress and the associated risks. Developed in collaboration with @sage_future_
@sage_future_
Sage
8 months
We asked @OpenAI models from GPT-2 to GPT-4 the same questions: here’s what they said 🧵 Interactively explore real AI outputs to learn: 1. How fast is AI improving? 2. How predictable is AI progress? 3. What dangers are on the horizon?
Tweet media one
3
10
68
0
3
9
@farairesearch
FAR AI
1 year
In new work from FAR, @jeremy_scheurer et al introduce an algorithm to efficiently learn from large quantities of language feedback. This outperforms supervised fine-tuning on human demonstrations in summarization and code generation.
@jeremy_scheurer
Jérémy Scheurer
1 year
In 2 new papers, we show that LLMs effectively learn from large quantities of feedback expressed in language. We present an algorithm for Imitation learning from Language Feedback (ILF) and show how it beats finetuning on human demonstrations for summarization and code generation
Tweet media one
1
29
123
0
1
9
@farairesearch
FAR AI
3 months
Anthony diGiovanni from @LongTermRisk presented at FAR Labs on Safe Pareto Improvements (SPIs) for AGI bargaining. 🤝He highlighted that transparency doesn't ensure conflict avoidance. ☮️SPIs offer a path to mitigate high-stakes AGI conflicts, given credible implementation.🔑
Tweet media one
2
2
8
@farairesearch
FAR AI
3 months
🔍Key insights from @EthanJPerez 's recent presentation at FAR Labs: It's crucial to understand the risks of deceptive alignment. The team's research suggests that, should sleeper agents emerge, they could pose substantial challenges. 💡
@AnthropicAI
Anthropic
5 months
New Anthropic Paper: Sleeper Agents. We trained LLMs to act secretly malicious. We found that, despite our best efforts at alignment training, deception still slipped through.
Tweet media one
128
580
3K
0
1
8
@farairesearch
FAR AI
2 months
📚 @joelbot3000 at the FAR Labs Seminar explores using AI recommender systems in a personalized & data-driven way to enhance human flourishing. Learn how a system, guided by the qualitative impact of books from the GoodReads dataset, can support personal growth. 🎥🔗👇
Tweet media one
2
0
8
@farairesearch
FAR AI
10 months
Even state of the art language models have "jailbreaks", that cause them to ignore the safety criteria of their designers in response to specific prompts. Think you can do better than OpenAI, Anthropic, et al? Try to attack and defend models in this new game from @CHAI_Berkeley
@justinsvegliato
Justin Svegliato
10 months
Check out our online game #TensorTrust that we made to study #LLMs ! At , you have a bank account protected by #ChatGPT : you just tell the AI your password🔒 and a few security rules for when to grant access🏦
Tweet media one
8
39
93
0
3
8
@farairesearch
FAR AI
7 months
Congratulations to our very own @AdriGarriga and his team for their work on Automatic Circuit DisCovery (ACDC) to speed up mechanistic interpretability!
@ArthurConmy
Arthur Conmy
7 months
⚡ACDC was accepted as a *spotlight* at NeurIPS 2023! 📜 Paper (updated today): With @MavorParker @aengus_lynch1 @sheimersheim @AdriGarriga
3
7
94
0
1
8
@farairesearch
FAR AI
28 days
Multiple research agendas have converged towards the use of world models, safety specifications, and verification to produce quantifiable safety guarantees. This framework unifies these approaches placing them on a continuum from minimal (left) to maximally (right) rigorous.
Tweet media one
1
1
8
@farairesearch
FAR AI
7 months
"Persona modulation" emerges as an automated jailbreaking tactic in a new study by @soroushjp and team, revealing a 42.5% success rate in eliciting harmful LLM outputs. The work calls for more stringent AI safety protocols.
@soroushjp
Soroush Pour
7 months
🧵📣New jailbreaks on SOTA LLMs. We introduce an automated, low-cost way to make transferable, black-box, plain-English jailbreaks for GPT-4, Claude-2, fine-tuned Llama. We elicit a variety of harmful text, incl. instructions for making meth & bombs.
Tweet media one
17
79
318
0
2
6
@farairesearch
FAR AI
6 months
🗣️ Unreliable consultants can fool non-experts, but @_julianmichael_ shows debate helps judges discern the truth. @anshrad 's indicates #ReinforcementLearning enhances AI debaters & judges in #ScalableOversight for better decision-making.
Tweet media one
1
3
8
@farairesearch
FAR AI
1 month
🌟🤖🧘‍♀️ #ICLR2024 Poster: VLM-RM leverages vision-language models to teach agents complex tasks through simple text prompts. Visit us on Fri 10 May, 4:30 PM CEST, Halle B #141 for “Vision-Language Models are Zero-Shot Reward Models for Reinforcement Learning.”
@EthanJPerez
Ethan Perez
8 months
🤖🧘 We trained a humanoid robot to do yoga based on simple natural language prompts like "a humanoid robot kneeling" or "a humanoid robot doing splits." How? We use a Vision-Language Model (VLM) as a reward model. Larger VLM = better reward model. 👇
7
31
169
1
2
8
@farairesearch
FAR AI
4 months
🌟 Fascinating talk by @OwainEvans_UK at #AIAlignmentWorkshop on Out-of-Context Reasoning in #LLMs . 🤖 He highlighted the challenges and limits in AI reasoning, even in advanced models like #GPT4 . A crucial discussion for understanding AI's logical capabilities! 🧠
Tweet media one
2
1
8
@farairesearch
FAR AI
3 months
Want to help ensure AI systems are trustworthy and beneficial to society? 🚀We're hiring! Share our open roles. 💸 Donate to our nonprofit. 🤝 Participate in the conversation. Share your thoughts on alignment research at an upcoming workshop. 🔗🧵👇
Tweet media one
1
2
8
@farairesearch
FAR AI
8 months
🚀 If you're also interested in making AI systems safe and beneficial, we're hiring! Check out our roles at
@EthanJPerez
Ethan Perez
8 months
📖 For more, check out the full paper, blogpost, and videos of our results: Full paper: Blogpost: Videos: Work by @JuanRocamonde @VMontesinos42 @elvisnavah @EthanJPerez @davlindner
0
1
11
0
3
7
@farairesearch
FAR AI
4 months
📣 @SecRaimondo of @CommerceGov launched the US AI Safety Institute Consortium #AISIC , uniting over 200 AI stakeholders. FAR AI is proud to join this initiative, working with @NIST to champion safe, secure, and trustworthy AI! 🚀
Tweet media one
2
1
7
@farairesearch
FAR AI
2 months
📊 Jason Gross @diagram_chaser unveils a new metric for AI interpretability at FAR Labs Seminar! He explores formal proof size as a key to understanding AI mechanisms 🎯, emphasizing the need for concise proofs for deeper insights. Challenges remain with unstructured noise. 🔗👇
Tweet media one
1
0
7
@farairesearch
FAR AI
6 months
Thanks to our research team Kellin Pelrine, Mohammad Taufeeque, @michal_zajac_ , @EuanMclean & @ARGleave & for @OpenAI for supporting this work.
1
0
7
@farairesearch
FAR AI
8 months
We find vision-language models provide a reward signal that can train a humanoid robot to do a variety of tasks given an English description of the task.
@EthanJPerez
Ethan Perez
8 months
🤖🧘 We trained a humanoid robot to do yoga based on simple natural language prompts like "a humanoid robot kneeling" or "a humanoid robot doing splits." How? We use a Vision-Language Model (VLM) as a reward model. Larger VLM = better reward model. 👇
7
31
169
0
2
7
@farairesearch
FAR AI
1 month
🔍Recent FAR paper reading group explored the complexities of aligning and ensuring the safety of large language models. It highlighted 18 challenges across scientific understanding, deployment methods, and sociotechnical issues, sparking research questions. 🤖
@usmananwar391
Usman Anwar
2 months
We released this new agenda on LLM-safety yesterday. This is VERY comprehensive covering 18 different challenges. My co-authors have posted tweets for each of these challenges. I am going to collect them all here! P.S. this is also now on arxiv:
5
21
73
0
2
7
@farairesearch
FAR AI
28 days
👥Work by @davidad , @JoarMVS , Yoshua Bengio, Stuart Russell, @tegmark , Sanjit Seshia, @steveom , @ChrSzegedy , @AmmannNora , @BenGoldhaber and more. 📄Read the paper:
Tweet media one
0
1
6
@farairesearch
FAR AI
6 months
🌟⚡🔍 #NeurIPS2023 Spotlight Poster: Discover how the ACDC algorithm skillfully identifies essential model components. Join the session on "Towards Automated Circuit DisCovery for Mechanistic Interpretability" on Dec 12, 5:15 PM CST poster #1503 by @AdriGarriga & team.
@ArthurConmy
Arthur Conmy
11 months
How can we speed up Mechanistic Interpretability? Researchers spend a lot of time searching for the internal model components that matter. We introduce the Automatic Circuit DisCovery (ACDC) ⚡ algorithm! 1/N 🧵
Tweet media one
4
42
299
0
2
6
@farairesearch
FAR AI
6 months
🤖✨ Even advanced AIs have weaknesses! Watch our CEO Adam Gleave at the Gartner IT Symposium discuss how AI can fail catastrophically and without warning, showing the importance of human oversight in reliability and impact.
0
0
6
@farairesearch
FAR AI
7 months
Signatories stressed the need for international collaboration to prevent AI posing existential risks to humanity. We hope constructive progress can be made at this week's #AISafetySummit . Full statement available at
1
0
6
@farairesearch
FAR AI
26 days
📚👀Recent FAR paper reading group explored advancing AI safety and alignment through weak-to-strong generalization, emphasizing scalable methods and a deeper scientific understanding to manage superhuman models responsibly. 💪🤖
@farairesearch
FAR AI
4 months
🌟 @CollinBurns4 showcased @OpenAI #Superalignment team’s on Weak-to-Strong Generalization! 🤖 The research explored using smaller AI models to supervise larger ones, providing a novel method for efficient AI alignment. 🚀 #AIAlignmentWorkshop
Tweet media one
1
1
3
0
0
6
@farairesearch
FAR AI
1 month
⚙️ @ksb_id at the FAR Labs Seminar explores the intersection of category theory and AI safety, emphasizing legible and verifiable models for better stakeholder collaboration 🎬🔗👇
Tweet media one
1
0
6
@farairesearch
FAR AI
2 months
🌟 Our #ICLR2024 paper introduces STARC (STAndardised Reward Comparison) to compare reward functions, enhancing evaluation and safety of reward learning algorithms. 🔗👇
Tweet media one
1
1
6
@farairesearch
FAR AI
2 months
🕵️ @EthanJPerez 's presentation at FAR Labs shows how large language models can hide harmful behaviors, even under safety training. 🛡️Effect is largest on models distilling chain-of-thought, and adversarial training may even enhance deception. 🎥🔗👇
Tweet media one
1
1
5
@farairesearch
FAR AI
1 month
🌟📊🔍 #ICLR2024 Poster: STARC metrics provide a theoretically elegant and empirically validated method for evaluating reward functions. Visit us on Fri 10 May, 4:30 PM CEST, Halle B #165 for "STARC: A General Framework For Quantifying Differences Between Reward Functions."
@farairesearch
FAR AI
2 months
🌟 Our #ICLR2024 paper introduces STARC (STAndardised Reward Comparison) to compare reward functions, enhancing evaluation and safety of reward learning algorithms. 🔗👇
Tweet media one
1
1
6
1
0
6
@farairesearch
FAR AI
6 months
Function calls 🛠️ GPT-4 Assistants readily divulge the function call schema and can be made to execute arbitrary function calls. They will even help the user in trying to exploit those functions! 😲 An attacker could use this to hack an application the assistant runs on.
Tweet media one
1
0
5
@farairesearch
FAR AI
6 months
Fine-tuning 💎 By fine-tuning a model on only 15 harmful examples or 100 benign examples we removed safeguards from GPT-4. Tuned models can assist users with harmful requests, make targeted misinformation; write code containing malicious URLs; and divulge personal information.
Tweet media one
1
1
5
@farairesearch
FAR AI
3 months
🌟 @ARGleave , Founder & CEO of FAR AI, will be a featured panelist at AI Unleashed: Shaping a Trustworthy Tomorrow. Join the conversation as industry leaders and academic pioneers delve into the latest in AI safety, ethical deployment, and societal impact. 🚀 🔗👇
Tweet media one
2
2
5
@farairesearch
FAR AI
2 months
Recent FAR paper reading explored a study on frontier models, evaluating their dangerous capabilities. Key areas of focus include persuasion, cybersecurity, self-proliferation, and self-reasoning, highlighting the nuanced landscape of AI's potential risks.
@tshevl
Toby Shevlane
3 months
In 2024, the AI community will develop more capable AI systems than ever before. How do we know what new risks to protect against, and what the stakes are? Our research team at @GoogleDeepMind built a set of evaluations to measure potentially dangerous capabilities: 🧵
Tweet media one
7
45
229
0
1
5
@farairesearch
FAR AI
9 months
More examples of adversarial vulnerabilities in cutting-edge AI systems. Our researchers at FAR AI are busy building a science of robustness - to understand how to make AI safe from these attacks.
@emmons_scott
Scott Emmons
9 months
Are multimodal foundation models secure from malicious actors? We find that adversarial images can hijack models at runtime. They make CLIP + LLaMA 2 output target strings and leak the context window, and they work as jailbreaks. Paper and live demo:
Tweet media one
2
32
113
0
0
5
@farairesearch
FAR AI
4 months
📣 FAR AI is Hiring! 🚀 Seeking passionate & detail-oriented individuals: - Head of Programs (Events & Communications): Lead, communicate & connect global AI safety community. - Executive Assistant: Support & empower our CEO and COO as we grow. Join us to shape the future of AI!
Tweet media one
2
1
5
@farairesearch
FAR AI
2 months
👥 Thanks to the authors @Michael05156007 , Noam Kolt, Yoshua Bengio, @ghadfield , and Stuart Russell. 🧵Original thread: 📖 Read the full paper “Regulating advanced artificial agents”:
1
1
5
@farairesearch
FAR AI
7 months
The scientists met at the inaugural International Dialogue on AI Safety at Ditchley Park, co-hosted by FAR and @chai_berkeley ; chaired by Yoshua Bengio, Stuart Russell, Andrew Yao and @yaqinzhang ; and attended by leading scientists from the 🇺🇸, 🇬🇧, 🇨🇦, 🇪🇺 and 🇨🇳.
1
0
5
@farairesearch
FAR AI
3 months
🌟 @aleks_madry unveiled OpenAI's Preparedness Team, addressing AI misuse through evidence-based strategies and a comprehensive framework. This initiative aims to prevent misuse by evaluating, tracking & forecasting catastrophic risks to proactively mitigate risks. 🛡️🔗👇
Tweet media one
1
0
5
@farairesearch
FAR AI
4 months
🌟 Thought-provoking talk by @ARGleave on AGI Safety! 🤖 He underscores the large-scale risks of misuse and rogue AI behavior, while emphasizing #Oversight , #Robustness , #Interpretability , and #Governance as strategic framework for #AISafety . 🌐🔑 #AIAlignmentWorkshop
Tweet media one
1
0
4
@farairesearch
FAR AI
2 months
📣 FAR AI is hiring! We're seeking a Coworking Space Manager for FAR Labs, our AI safety hub. Ideal candidates are logistical wizards, customer service stars, or designers with a knack for creating thriving research spaces. Apply now! 🔗👇
Tweet media one
1
0
4
@farairesearch
FAR AI
1 month
⚖️ Anna Leshinskaya at the FAR Labs Seminar explores the integration of moral decision-making into AI, highlighting the need for a "moral grammar" and the challenges in aligning AI actions with human values. 🎬🔗👇
Tweet media one
2
1
4
@farairesearch
FAR AI
2 months
Tune in to @ARGleave ’s interview with Nathan Labenz on The Cognitive Revolution as they discuss testing AI models, open source's role in AI safety, vulnerabilities of superhuman Go & more. 🔗👇
Tweet media one
1
1
4
@farairesearch
FAR AI
6 months
In 2024 we'll be growing our research and red-teaming efforts, and would love you to be part of our mission! We hire in-person in Berkeley, CA 🇺🇸 (we sponsor visas) and remotely around the 🌐.
0
0
1
@farairesearch
FAR AI
28 days
Quantifiable safety assurances are commonplace in safety-critical engineering fields from aerospace to nuclear power. We expect these assurances to be similarly indispensable for high-stakes deployment of AI systems.
Tweet media one
1
0
4
@farairesearch
FAR AI
6 months
🌟🤖🧘‍♀️ @solarneurips Workshop Poster: Robots learn to mimic poses with just text prompts, thanks to CLIP-based VLMs! Catch our session on "Vision-Language Models are Zero-Shot Reward Models for Reinforcement Learning" on Dec 16 to explore our approach.
@EthanJPerez
Ethan Perez
8 months
🤖🧘 We trained a humanoid robot to do yoga based on simple natural language prompts like "a humanoid robot kneeling" or "a humanoid robot doing splits." How? We use a Vision-Language Model (VLM) as a reward model. Larger VLM = better reward model. 👇
7
31
169
0
1
4
@farairesearch
FAR AI
11 months
Find out how superhuman Go AIs are vulnerable in our #icml2023 oral with Tony Wang @5kovt presenting in B4 at 4pm today
@farairesearch
FAR AI
11 months
This is Lee Sedol in 2016 playing against AlphaGo. Despite a valiant effort, Lee lost. The AI was just too powerful. But, had Lee known about our ICML 2023 paper, Adversarial Policies Beat Superhuman Go AIs, things might have turned out differently! 🧵
Tweet media one
8
89
462
0
1
4
@farairesearch
FAR AI
7 months
🔍On a sleuthing mission, our reading group learned that LLMs can use #steganography to cloak their reasoning. Paraphrasing is a defense against such potentially dangerous 'encoded reasoning'.
@FabienDRoger
Fabien Roger
7 months
Could language models hide thoughts in plain text? We give the first demonstration of a language model getting higher performance by using encoded reasoning, aka steganography! This would be dangerous if it happened in the wild, so we evaluate defenses against it. 🧵 (1/8)
Tweet media one
5
21
153
0
1
4
@farairesearch
FAR AI
7 months
Scientists signed on to a statement proposing mandatory registration of frontier models; red lines that, if crossed, would mandate termination of models; a minimum spending commitment of 1/3rd of AI R&D on AI safety.
1
0
4
@farairesearch
FAR AI
6 months
Our results suggest any additions to the functionality exposed by an API can expose new vulnerabilities, and highlights areas where further research is needed to improve model robustness and mitigate these risks.
1
0
4
@farairesearch
FAR AI
7 months
Is an agreeable AI unsafe? Research shows LLMs display sycophancy in seeking human approval in undesirable & untruthful ways. Special thanks to @megtong_ for joining the FAR Labs reading group to discuss her findings!
@AnthropicAI
Anthropic
8 months
AI assistants are trained to give responses that humans like. Our new paper shows that these systems frequently produce ‘sycophantic’ responses that appeal to users but are inaccurate. Our analysis suggests human feedback contributes to this behavior.
Tweet media one
42
213
1K
0
0
3
@farairesearch
FAR AI
28 days
While much work remains to develop the GS approach, this portfolio of complementary R&D efforts offer a promising path forward that will give both immediate benefits (e.g. improved formal verification of programs) and longer-term wins (e.g. guarantees on more complex AI systems).
1
0
3
@farairesearch
FAR AI
4 months
🌟 @zicokolter revealed key vulnerabilities in #LLMs to #AdversarialAttacks . 🛡️Including a live demo, his insights underscore the urgent need for robust #AISafety measures. A vital call to action for AI security! 🤯🔐 #AIAlignmentWorkshop
Tweet media one
1
1
3
@farairesearch
FAR AI
6 months
🤖💭Can AI ponder its own existence? Dive into this research exploring the possibility of training LLMs to self-reflect. It’s a glimpse into a potential future of AI consciousness research!
@rgblong
Robert Long
7 months
Could we ever get evidence about whether LLMs are conscious? In a new paper, we explore whether we could train future LLMs to accurately answer questions about themselves. If this works, LLM self-reports may help us test them for morally relevant states like consciousness. 🧵
Tweet media one
19
51
276
0
1
3