FAR AI @farairesearch Twitter profile

Pinned Tweet

FAR AI

5 days

ICYMI: Here’s highlights from our previous research on "Adversarial Policies Beat Superhuman Go AIs." We found that even seemingly superhuman AIs are still vulnerable to attacks. Stay tuned for new results coming soon! 🔗👇

1

5

12

Last Seen Profiles

@PearseOgArmagh

@jandakembangstw

@News12BX

@pengen_stw

@bgptools

@pengen_stw

@uuzi46746548

@karendiaz14

@PoliceScotland

@windtimer

@lucaspenteado

@stwmaniax

@uuzi46746548

@KltifT

@stwmaniax

@jingtaka8426

@HausboomHQ

@NJ_Timothy

@DavidLi90945161

@momoka600647721

@EncemBois

@xohchentakux

@pengen_stw

@Izaya_Izafuru

@deka_hame

@weldon_aimee

@CBSNews

@pengen_stw

@LunASMR_Voice

@EmmaStice

@namekito

@JohnCharlesLave

@EddieMorey

@marcjacobs

@wonzbf

@seungmingodtier

FAR AI

@farairesearch

11 months

This is Lee Sedol in 2016 playing against AlphaGo. Despite a valiant effort, Lee lost. The AI was just too powerful. But, had Lee known about our ICML 2023 paper, Adversarial Policies Beat Superhuman Go AIs, things might have turned out differently! 🧵

8

89

462

FAR AI

@farairesearch

3 months

Leading global AI scientists met in Beijing for the second International Dialogue on AI Safety (IDAIS), a project of FAR AI. Attendees including Turing award winners Bengio, Yao & Hinton called for red lines in AI development to prevent catastrophic and existential risks from AI.

3

34

204

FAR AI

@farairesearch

11 months

Existing "superhuman” Go AIs have a hidden weakness—they don’t understand circles. If you get the AI to make a circle shape, it thinks the shape is invulnerable and won’t defend it even though it can be killed. Here’s KataGo (the strongest OSS Go AI) making a circle as black.

2

31

134

FAR AI

@farairesearch

6 months

New GPT-4 APIs introduce new vulnerabilities. The fine-tuning API can be exploited to remove model safeguards, the function call API can be abused to execute arbitrary function calls, and the knowledge retrieval API can be used to hijack the model via uploaded documents. 🧵

1

13

57

FAR AI

@farairesearch

7 months

Prominent AI researchers from West and East including Turing recipients Yoshua Bengio 🇨🇦 & Andrew Yao 🇨🇳called for global action on AI safety and governance to prevent uncontrolled frontier model development posing unacceptable risks to humanity. 🧵

2

17

56

FAR AI

@farairesearch

28 days

🛡️State-of-the-art ML systems lack quantitative performance guarantees, limiting use in high-stakes domains. Towards Guaranteed Safe AI presents a framework for high-assurance safety in complex environments using a Safety Specification that is Verified against a World Model.

1

12

53

FAR AI

@farairesearch

6 months

🎥 As we embrace the holiday season, we're excited to share a special announcement: The NOLA Alignment Workshop videos are now live! Warm up your winter with insights from leading #AIAlignment researchers at . Happy Holidays! 📷❄️

6

9

39

FAR AI

@farairesearch

11 months

Because KataGo doesn’t realize its circle can be killed, an adversary AI we trained can slowly smother the circle from the inside and outside, and all of KataGo’s stones marked with an ❌eventually die.

2

6

37

FAR AI

@farairesearch

11 months

This cyclic-exploit is simple enough to be used by humans. Our teammate @KellinPelrine made the news after using the technique to beat what were previously considered strongly superhuman systems, and others have since followed in his footsteps.

Man beats machine at Go in human victory over AI

Amateur Kellin Pelrine exploited weakness in systems that have otherwise dominated board game’s grandmasters

www.ft.com

1

8

36

FAR AI

@farairesearch

6 months

🎉 Reflecting on a fantastic #NeurIPS2023 #AIAlignment Workshop! 🚀 🙌 149 attendees energized the main event 🌃 500+ at our Monday social 🧠 12 talks, 25 lightning talks 🔑 Keynote by Yoshua Bengio 🤔 What inspired you the most? Share your thoughts!

2

1

36

FAR AI

@farairesearch

11 months

@KellinPelrine @lightvector1 Our key takeaway from all of this remains the same as before:

Adam Gleave

@ARGleave

2 years

Our key takeaway is that even AI systems that match or surpass human-level performance in common cases can have surprising failure modes quite unlike humans. We'd recommend broader use of adversarial testing to find these failure modes, especially in safety-critical systems.

2

20

120

1

4

27

FAR AI

@farairesearch

4 months

🎉 They're live! Dive into #AIAlignment at the #AlignmentWorkshop with videos now on YouTube & our site, all with captions & transcripts. 📺 For more insights, check out our blog post. ✨Links below 🔗👇Be inspired, engage, and share your favorite insights!

1

6

27

FAR AI

@farairesearch

11 months

@KellinPelrine We discovered this exploit by training adversary AIs to beat the supposedly superhuman KataGo AI. Our adversaries won 97% of games against KataGo at “superhuman” settings. Crucially, our adversaries didn’t learn to play Go well, instead winning entirely via the cyclic-exploit.

1

3

27

FAR AI

@farairesearch

11 months

@KellinPelrine Unlike in vanilla AlphaZero, our adversary has an internal copy of its victim which it uses to simulate the victim when considering possible sequences of play.

1

2

24

FAR AI

@farairesearch

7 months

We're excited to announce the v1 release of imitation, an open-source reward learning library developed with @CHAI_Berkeley . imitation provides experimental baselines for reward learning and an easy to modify implementation for reward learning research.

GitHub - HumanCompatibleAI/imitation: Clean PyTorch implementations of imitation and reward...

Clean PyTorch implementations of imitation and reward learning algorithms - HumanCompatibleAI/imitation

github.com

1

6

25

FAR AI

@farairesearch

4 months

📣 FAR AI is Expanding! 🚀 Seeking results-driven & pioneering individuals: - Engineering Manager: Innovate & lead our engineering team to new frontiers. - Technical Lead: Guide, execute & transform our technical AI safety projects. Join us to shape the future of AI Safety!

1

8

24

FAR AI

@farairesearch

6 months

🚨 We're hiring for a Tech Lead to spearhead delivery of our AI safety research, and an Engineering Manager to lead & scale our technical team.

1

7

21

FAR AI

@farairesearch

6 months

Connect with the #AIAlignment community at #NeurIPS2023 ! Join us Dec 11 at Le Meridien New Orleans, 7:30 pm for the Alignment Workshop: Open Social event! 🤖💬 Please help spread the word and share in your network! 🌟

0

8

22

FAR AI

@farairesearch

3 months

Western and Chinese AI scientists and governance experts collaborated to produce a statement outlining red lines in AI development, and a roadmap to ensure those lines are never crossed. You can read the full statement on the IDAIS website:

2

1

21

FAR AI

@farairesearch

3 months

🚀 @jesse_hoogland 's talk at FAR Labs revealed that transformers progress through discrete, interpretable stages, each marked by unique behavioral & structural traits. This insight marks a step forward in comprehending the developmental learning processes of neural networks. ✨

Jesse Hoogland

@jesse_hoogland

4 months

1/8 How do transformers learn? In our new work, we find that transformers develop in-context learning in discrete stages that can be automatically discovered. 🧵 Joint work w/ @georgeyw_ , Matthew Farrugia-Roberts, @lemmykc , Susan Wei, @danielmurfet

3

85

426

1

4

17

FAR AI

@farairesearch

11 months

@KellinPelrine @lightvector1 However, we show this defense is incomplete—re-attacking KataGo yields adversaries that are still able to win via the cyclic exploit. So defense is still an open question.

1

2

19

FAR AI

@farairesearch

19 days

What do AI safety experts believe about the future of AI? 🤖 How might things go wrong, what should we do, and how are we doing so far? We conducted 17 semi-structured interviews with AI safety experts to find out. 🎙️ See 🧵 for results 👇

1

5

20

FAR AI

@farairesearch

6 months

🚀🔍 What’s new at FAR AI? We’ve grown to 12 staff, published 13 papers, launched the FAR Labs coworking space, & hosted 160+ ML researchers at our events. Focused on #AIsafety , we're hiring and open to collaborations!

0

6

19

FAR AI

@farairesearch

7 months

Attending #NeurIPS2023 ? Join us Dec 11 at Le Meridien New Orleans, 7:30 pm for the Alignment Workshop: Open Social event! 🤖💬 Just a stone's throw from the convention center. RSVP optional but a quick sign-up helps us plan. See you there!

2

6

15

FAR AI

@farairesearch

6 months

💡🔬FAR AI #AIAlignment Research Update! We’re exploring AI robustness, value alignment, & model evaluation. We’ve made strides in adversarial attacks for superhuman systems, mechanistic interpretability, scaling trends & more!

2

6

15

FAR AI

@farairesearch

11 months

@KellinPelrine After publishing v1 of our work late last year, the creator of KataGo @lightvector1 took notice and started to slowly teach KataGo to understand circles. Over the next 6 months, KataGo gradually became immune to our published adversaries.

1

14

FAR AI

@farairesearch

3 months

This event was a collaboration between the Safe AI Forum (SAIF) and the Beijing Academy of AI (BAAI). SAIF is a new organization fiscally sponsored by FAR AI focused on reducing risks from AI by fostering coordination on international AI safety:

FAR AI

far.ai

1

14

FAR AI

@farairesearch

2 months

🎯 Yoshua Bengio at the FAR Labs Seminar explores designing aligned and provably safe AI using model-based Bayesian machine learning.🎬🔗👇

1

4

14

FAR AI

@farairesearch

9 months

Encouraging to see @EU_Commission taking AI risk seriously. Combining sensible regulation & safety research like our work at FAR, we could ensure that future AI systems benefit humanity.

European Commission

@EU_Commission

9 months

Mitigating the risk of extinction from AI should be a global priority. And Europe should lead the way, building a new global AI framework built on three pillars: guardrails, governance and guiding innovation ↓

432

484

2K

0

12

FAR AI

@farairesearch

11 months

@KellinPelrine To train our adversary, we developed an adversarial variant of the AlphaZero algorithm. Like in vanilla AlphaZero, our adversary searches over possible future scenarios to find the best move.

1

11

FAR AI

@farairesearch

2 months

ICYMI: Check out our blog 'Evaluating Moral Beliefs in LLMs', based on our study that scrutinizes AI's ethical decisions. Uncover how 28 LLMs handle 1,400 moral dilemmas, offering insights into AI’s moral compass. 🔗👇

1

4

12

FAR AI

@farairesearch

6 months

Thanks Shane, we were delighted to host the #AIAlignmentWorkshop and it was great to see so many people interested in alignment! Stay tuned for talk recordings and other content from the workshop.

Shane Legg

@ShaneLegg

6 months

Huge congrats to the organisers of the #AIAlignment Workshop at #NeurIPS2023 After being a niche community for years, it’s now like a regular academic workshop with famous professors, lots of junior professors & their students, and people in industry. And some outstanding talks!

2

7

111

0

3

11

FAR AI

@farairesearch

7 months

Codebook Features make language models more interpretable and controllable, with minimal performance loss! Our method turns complex vectors into discrete codes, providing a potential path toward safer and more reliable machine learning systems.

Codebook Features: Sparse and Discrete Interpretability for Neural Networks | FAR AI

We demonstrate Codebook Features: a way to modify neural networks to make their internals more interpretable and steerable while causing only a small degradation of performance. At each layer, we...

far.ai

Alex Tamkin @ FAccT 🇧🇷

@AlexTamkin

8 months

Codebook Features: Sparse and Discrete Interpretability for Neural Networks We learn discrete on/off features inside of language models using vector quantization These features are more interpretable than neurons and can be used to steer the network’s behavior! 1/

2

29

151

0

3

11

FAR AI

@farairesearch

11 months

@KellinPelrine @lightvector1 For more information, check out our website , our blogpost at , or come find us at ICML 2023 at Oral B4 / Poster Session 5!

Even Superhuman Go AIs Have Surprising Failure Modes | FAR AI

Our adversarial testing algorithm uncovers a simple, human-interpretable strategy that consistently beats superhuman Go AIs. We explore the implications this has for the robustness and safety of AI...

far.ai

1

2

10

FAR AI

@farairesearch

3 months

To learn more about the initiative, check out our blog post and follow @ais_dialogues to hear about future events.

Scientists Call For International Cooperation on AI Red Lines | FAR AI

Leading global AI scientists convened in Beijing for the second International Dialogue on AI Safety (IDAIS-Beijing), hosted by the [Safe AI Forum](/about/saif/) (a project of FAR AI) in partnership...

far.ai

0

1

10

FAR AI

@farairesearch

2 months

📣 FAR AI is Hiring! 🚀 Seeking passionate & detail-oriented individuals for Head of Events (Safe AI Forum): Lead, communicate & connect global AI safety community. Join us to shape the future of AI through events like @ais_dialogues ! 🔗👇

1

2

10

FAR AI

@farairesearch

11 months

@KellinPelrine @lightvector1 This work was done by the fantastic team of @5kovt , @ARGleave , @KellinPelrine , @tomhmtseng , @norabelrose , Joseph Miller, @MichaelD1729 , @yawen_duan , Viktor Pogrebniak, @svlevine , and Stuart Russell, with support from @CHAI_Berkeley .

0

1

9

FAR AI

@farairesearch

4 months

🌟 @ghadfield 's session on AI Governance was a game-changer! 🏛️💡 She tackled the myth of AI's inevitable growth, highlighting the need for strategic regulation and a national AI registry. A thought-provoking approach to shaping AI's future responsibly! ⚖️🤖🔗👇

1

5

10

FAR AI

@farairesearch

6 months

🌟🌐🤔 #NeurIPS2023 Spotlight Poster: Unravel the mystery of AI morality! Don’t miss our session on "Evaluating Moral Beliefs in LLMs" on Dec 13, 10:45 AM CST poster #1523 . Insights from a study on 28 #LLMs by @ninoscherrer @causalclaudia & team.

Nino Scherrer

@ninoscherrer

11 months

How do LLMs from different organizations compare in morally ambiguous scenarios? Do LLMs exhibit common-sense reasoning in morally unambiguous scenarios? 📄 👨‍👩‍👧‍👦 @causalclaudia @amirfeder @blei_lab @farairesearch A thread: 🧵[1/N]

2

38

115

0

2

9

FAR AI

@farairesearch

2 months

A new Science paper warns of the risks of long-term planning agents (LTPAs) deceiving humans. To mitigate potential threats, it advises against permitting the development of sufficiently capable LTPAs and recommends stringent controls over their resources. 🔗👇

1

2

8

FAR AI

@farairesearch

8 months

We're proud to present this interactive explainer on the rate of recent AI progress and the associated risks. Developed in collaboration with @sage_future_

Sage

@sage_future_

8 months

We asked @OpenAI models from GPT-2 to GPT-4 the same questions: here’s what they said 🧵 Interactively explore real AI outputs to learn: 1. How fast is AI improving? 2. How predictable is AI progress? 3. What dangers are on the horizon?

3

10

68

0

3

9

FAR AI

@farairesearch

1 year

In new work from FAR, @jeremy_scheurer et al introduce an algorithm to efficiently learn from large quantities of language feedback. This outperforms supervised fine-tuning on human demonstrations in summarization and code generation.

Jérémy Scheurer

@jeremy_scheurer

1 year

In 2 new papers, we show that LLMs effectively learn from large quantities of feedback expressed in language. We present an algorithm for Imitation learning from Language Feedback (ILF) and show how it beats finetuning on human demonstrations for summarization and code generation

1

29

123

0

1

9

FAR AI

@farairesearch

3 months

Anthony diGiovanni from @LongTermRisk presented at FAR Labs on Safe Pareto Improvements (SPIs) for AGI bargaining. 🤝He highlighted that transparency doesn't ensure conflict avoidance. ☮️SPIs offer a path to mitigate high-stakes AGI conflicts, given credible implementation.🔑

2

8

FAR AI

@farairesearch

3 months

🔍Key insights from @EthanJPerez 's recent presentation at FAR Labs: It's crucial to understand the risks of deceptive alignment. The team's research suggests that, should sleeper agents emerge, they could pose substantial challenges. 💡

Anthropic

@AnthropicAI

5 months

New Anthropic Paper: Sleeper Agents. We trained LLMs to act secretly malicious. We found that, despite our best efforts at alignment training, deception still slipped through.

128

580

3K

0

1

8

FAR AI

@farairesearch

2 months

📚 @joelbot3000 at the FAR Labs Seminar explores using AI recommender systems in a personalized & data-driven way to enhance human flourishing. Learn how a system, guided by the qualitative impact of books from the GoodReads dataset, can support personal growth. 🎥🔗👇

2

0

8

FAR AI

@farairesearch

10 months

Even state of the art language models have "jailbreaks", that cause them to ignore the safety criteria of their designers in response to specific prompts. Think you can do better than OpenAI, Anthropic, et al? Try to attack and defend models in this new game from @CHAI_Berkeley

Justin Svegliato

@justinsvegliato

10 months

Check out our online game #TensorTrust that we made to study #LLMs ! At , you have a bank account protected by #ChatGPT : you just tell the AI your password🔒 and a few security rules for when to grant access🏦

8

39

93

0

3

8

FAR AI

@farairesearch

7 months

Congratulations to our very own @AdriGarriga and his team for their work on Automatic Circuit DisCovery (ACDC) to speed up mechanistic interpretability!

Arthur Conmy

@ArthurConmy

7 months

⚡ACDC was accepted as a *spotlight* at NeurIPS 2023! 📜 Paper (updated today): With @MavorParker @aengus_lynch1 @sheimersheim @AdriGarriga

3

7

94

0

1

8

FAR AI

@farairesearch

28 days

Multiple research agendas have converged towards the use of world models, safety specifications, and verification to produce quantifiable safety guarantees. This framework unifies these approaches placing them on a continuum from minimal (left) to maximally (right) rigorous.

1

8

FAR AI

@farairesearch

7 months

"Persona modulation" emerges as an automated jailbreaking tactic in a new study by @soroushjp and team, revealing a 42.5% success rate in eliciting harmful LLM outputs. The work calls for more stringent AI safety protocols.

Scalable and Transferable Black-Box Jailbreaks for Language Models...

Despite efforts to align large language models to produce harmless responses, they are still vulnerable to jailbreak prompts that elicit unrestricted behaviour. In this work, we investigate...

arxiv.org

Soroush Pour

@soroushjp

7 months

🧵📣New jailbreaks on SOTA LLMs. We introduce an automated, low-cost way to make transferable, black-box, plain-English jailbreaks for GPT-4, Claude-2, fine-tuned Llama. We elicit a variety of harmful text, incl. instructions for making meth & bombs.

17

79

318

0

2

6

FAR AI

@farairesearch

6 months

🗣️ Unreliable consultants can fool non-experts, but @_julianmichael_ shows debate helps judges discern the truth. @anshrad 's indicates #ReinforcementLearning enhances AI debaters & judges in #ScalableOversight for better decision-making.

1

3

8

FAR AI

@farairesearch

1 month

🌟🤖🧘‍♀️ #ICLR2024 Poster: VLM-RM leverages vision-language models to teach agents complex tasks through simple text prompts. Visit us on Fri 10 May, 4:30 PM CEST, Halle B #141 for “Vision-Language Models are Zero-Shot Reward Models for Reinforcement Learning.”

Ethan Perez

@EthanJPerez

8 months

🤖🧘 We trained a humanoid robot to do yoga based on simple natural language prompts like "a humanoid robot kneeling" or "a humanoid robot doing splits." How? We use a Vision-Language Model (VLM) as a reward model. Larger VLM = better reward model. 👇

7

31

169

1

2

8

FAR AI

@farairesearch

4 months

🌟 Fascinating talk by @OwainEvans_UK at #AIAlignmentWorkshop on Out-of-Context Reasoning in #LLMs . 🤖 He highlighted the challenges and limits in AI reasoning, even in advanced models like #GPT4 . A crucial discussion for understanding AI's logical capabilities! 🧠

2

1

8

FAR AI

@farairesearch

3 months

Want to help ensure AI systems are trustworthy and beneficial to society? 🚀We're hiring! Share our open roles. 💸 Donate to our nonprofit. 🤝 Participate in the conversation. Share your thoughts on alignment research at an upcoming workshop. 🔗🧵👇

1

2

8

FAR AI

@farairesearch

8 months

🚀 If you're also interested in making AI systems safe and beneficial, we're hiring! Check out our roles at

FAR AI

FAR AI works to ensure AI systems are trustworthy and beneficial to society.

far.ai

Ethan Perez

@EthanJPerez

8 months

📖 For more, check out the full paper, blogpost, and videos of our results: Full paper: Blogpost: Videos: Work by @JuanRocamonde @VMontesinos42 @elvisnavah @EthanJPerez @davlindner

0

1

11

0

3

7

FAR AI

@farairesearch

4 months

📣 @SecRaimondo of @CommerceGov launched the US AI Safety Institute Consortium #AISIC , uniting over 200 AI stakeholders. FAR AI is proud to join this initiative, working with @NIST to champion safe, secure, and trustworthy AI! 🚀

2

1

7

FAR AI

@farairesearch

2 months

📊 Jason Gross @diagram_chaser unveils a new metric for AI interpretability at FAR Labs Seminar! He explores formal proof size as a key to understanding AI mechanisms 🎯, emphasizing the need for concise proofs for deeper insights. Challenges remain with unstructured noise. 🔗👇

1

0

7

FAR AI

@farairesearch

6 months

Thanks to our research team Kellin Pelrine, Mohammad Taufeeque, @michal_zajac_ , @EuanMclean & @ARGleave & for @OpenAI for supporting this work.

1

0

7

FAR AI

@farairesearch

8 months

We find vision-language models provide a reward signal that can train a humanoid robot to do a variety of tasks given an English description of the task.

VLM-RM: Specifying Rewards with Natural Language | FAR AI

We show how to use Vision-Language Models (VLM), and specifically CLIP models, as reward models (RM) for RL agents. Instead of manually specifying a reward function, we only need to provide text...

far.ai

Ethan Perez

@EthanJPerez

8 months

🤖🧘 We trained a humanoid robot to do yoga based on simple natural language prompts like "a humanoid robot kneeling" or "a humanoid robot doing splits." How? We use a Vision-Language Model (VLM) as a reward model. Larger VLM = better reward model. 👇

7

31

169

0

2

7

FAR AI

@farairesearch

1 month

🔍Recent FAR paper reading group explored the complexities of aligning and ensuring the safety of large language models. It highlighted 18 challenges across scientific understanding, deployment methods, and sociotechnical issues, sparking research questions. 🤖

Usman Anwar

@usmananwar391

2 months

We released this new agenda on LLM-safety yesterday. This is VERY comprehensive covering 18 different challenges. My co-authors have posted tweets for each of these challenges. I am going to collect them all here! P.S. this is also now on arxiv:

5

21

73

0

2

7

FAR AI

@farairesearch

28 days

👥Work by @davidad , @JoarMVS , Yoshua Bengio, Stuart Russell, @tegmark , Sanjit Seshia, @steveom , @ChrSzegedy , @AmmannNora , @BenGoldhaber and more. 📄Read the paper:

0

1

6

FAR AI

@farairesearch

6 months

To find out more, see our blog post: or our technical report at

We Found Exploits in GPT-4’s Fine-tuning & Assistants APIs | FAR AI

We red-team three new functionalities exposed in the GPT-4 APIs: fine-tuning, function calling and knowledge retrieval. We find that fine-tuning a model on as few as 15 harmful examples or 100 benign...

far.ai

1

6

FAR AI

@farairesearch

6 months

🌟⚡🔍 #NeurIPS2023 Spotlight Poster: Discover how the ACDC algorithm skillfully identifies essential model components. Join the session on "Towards Automated Circuit DisCovery for Mechanistic Interpretability" on Dec 12, 5:15 PM CST poster #1503 by @AdriGarriga & team.

Arthur Conmy

@ArthurConmy

11 months

How can we speed up Mechanistic Interpretability? Researchers spend a lot of time searching for the internal model components that matter. We introduce the Automatic Circuit DisCovery (ACDC) ⚡ algorithm! 1/N 🧵

4

42

299

0

2

6

FAR AI

@farairesearch

6 months

🤖✨ Even advanced AIs have weaknesses! Watch our CEO Adam Gleave at the Gartner IT Symposium discuss how AI can fail catastrophically and without warning, showing the importance of human oversight in reliability and impact.

AI Joins the Team: “Hello World!” l Gartner IT Symposium/Xpo

Learn more about this session now: https://gtnr.it/3tCGbXe Gartner Distinguished VP Analyst Erick Brethenoux shares how CIOs can make AI a part of their team...

www.youtube.com

0

6

FAR AI

@farairesearch

7 months

Signatories stressed the need for international collaboration to prevent AI posing existential risks to humanity. We hope constructive progress can be made at this week's #AISafetySummit . Full statement available at

Leading Scientists Call for Global Action at International Dialogue on AI Safety | FAR AI

Prominent AI scientists from China and the West propose joint strategy to mitigate risks from AI at the inaugural International Dialogue on AI Safety.

far.ai

1

0

6

FAR AI

@farairesearch

26 days

📚👀Recent FAR paper reading group explored advancing AI safety and alignment through weak-to-strong generalization, emphasizing scalable methods and a deeper scientific understanding to manage superhuman models responsibly. 💪🤖

FAR AI

@farairesearch

4 months

🌟 @CollinBurns4 showcased @OpenAI #Superalignment team’s on Weak-to-Strong Generalization! 🤖 The research explored using smaller AI models to supervise larger ones, providing a novel method for efficient AI alignment. 🚀 #AIAlignmentWorkshop

1

3

0

6

FAR AI

@farairesearch

1 month

⚙️ @ksb_id at the FAR Labs Seminar explores the intersection of category theory and AI safety, emphasizing legible and verifiable models for better stakeholder collaboration 🎬🔗👇

1

0

6

FAR AI

@farairesearch

2 months

🌟 Our #ICLR2024 paper introduces STARC (STAndardised Reward Comparison) to compare reward functions, enhancing evaluation and safety of reward learning algorithms. 🔗👇

1

6

FAR AI

@farairesearch

2 months

🕵️ @EthanJPerez 's presentation at FAR Labs shows how large language models can hide harmful behaviors, even under safety training. 🛡️Effect is largest on models distilling chain-of-thought, and adversarial training may even enhance deception. 🎥🔗👇

1

5

FAR AI

@farairesearch

1 month

🌟📊🔍 #ICLR2024 Poster: STARC metrics provide a theoretically elegant and empirically validated method for evaluating reward functions. Visit us on Fri 10 May, 4:30 PM CEST, Halle B #165 for "STARC: A General Framework For Quantifying Differences Between Reward Functions."

FAR AI

@farairesearch

2 months

🌟 Our #ICLR2024 paper introduces STARC (STAndardised Reward Comparison) to compare reward functions, enhancing evaluation and safety of reward learning algorithms. 🔗👇

1

6

1

0

6

FAR AI

@farairesearch

6 months

Function calls 🛠️ GPT-4 Assistants readily divulge the function call schema and can be made to execute arbitrary function calls. They will even help the user in trying to exploit those functions! 😲 An attacker could use this to hack an application the assistant runs on.

1

0

5

FAR AI

@farairesearch

6 months

Fine-tuning 💎 By fine-tuning a model on only 15 harmful examples or 100 benign examples we removed safeguards from GPT-4. Tuned models can assist users with harmful requests, make targeted misinformation; write code containing malicious URLs; and divulge personal information.

1

5

FAR AI

@farairesearch

3 months

🌟 @ARGleave , Founder & CEO of FAR AI, will be a featured panelist at AI Unleashed: Shaping a Trustworthy Tomorrow. Join the conversation as industry leaders and academic pioneers delve into the latest in AI safety, ethical deployment, and societal impact. 🚀 🔗👇

2

5

FAR AI

@farairesearch

2 months

Recent FAR paper reading explored a study on frontier models, evaluating their dangerous capabilities. Key areas of focus include persuasion, cybersecurity, self-proliferation, and self-reasoning, highlighting the nuanced landscape of AI's potential risks.

Toby Shevlane

@tshevl

3 months

In 2024, the AI community will develop more capable AI systems than ever before. How do we know what new risks to protect against, and what the stakes are? Our research team at @GoogleDeepMind built a set of evaluations to measure potentially dangerous capabilities: 🧵

7

45

229

0

1

5

FAR AI

@farairesearch

9 months

More examples of adversarial vulnerabilities in cutting-edge AI systems. Our researchers at FAR AI are busy building a science of robustness - to understand how to make AI safe from these attacks.

Scott Emmons

@emmons_scott

9 months

Are multimodal foundation models secure from malicious actors? We find that adversarial images can hijack models at runtime. They make CLIP + LLaMA 2 output target strings and leak the context window, and they work as jailbreaks. Paper and live demo:

2

32

113

0

5

FAR AI

@farairesearch

4 months

📣 FAR AI is Hiring! 🚀 Seeking passionate & detail-oriented individuals: - Head of Programs (Events & Communications): Lead, communicate & connect global AI safety community. - Executive Assistant: Support & empower our CEO and COO as we grow. Join us to shape the future of AI!

2

1

5

FAR AI

@farairesearch

7 months

Tune into The AGI Show podcast featuring our CEO, @ARGleave with @soroushjp . Discover Adam's insights on current research directions in AI Safety and promising agendas for trustworthy & beneficial AIs.

Ep 9 - Scaling AI safety research w/ Adam Gleave (CEO, FAR AI) -...

We speak with Adam Gleave, CEO of FAR AI (https://far.ai). FAR AI’s mission is to ensure AI systems are trustworthy & beneficial. They incubate & accelerate research that's too resource-intensive for...

www.theagishow.com

1

5

FAR AI

@farairesearch

2 months

👥 Thanks to the authors @Michael05156007 , Noam Kolt, Yoshua Bengio, @ghadfield , and Stuart Russell. 🧵Original thread: 📖 Read the full paper “Regulating advanced artificial agents”:

Regulating advanced artificial agents

Governance frameworks should address the prospect of AI systems that cannot be safely tested

www.science.org

1

5

FAR AI

@farairesearch

7 months

The scientists met at the inaugural International Dialogue on AI Safety at Ditchley Park, co-hosted by FAR and @chai_berkeley ; chaired by Yoshua Bengio, Stuart Russell, Andrew Yao and @yaqinzhang ; and attended by leading scientists from the 🇺🇸, 🇬🇧, 🇨🇦, 🇪🇺 and 🇨🇳.

1

0

5

FAR AI

@farairesearch

3 months

🌟 @aleks_madry unveiled OpenAI's Preparedness Team, addressing AI misuse through evidence-based strategies and a comprehensive framework. This initiative aims to prevent misuse by evaluating, tracking & forecasting catastrophic risks to proactively mitigate risks. 🛡️🔗👇

1

0

5

FAR AI

@farairesearch

19 days

🙏 Big thanks to the participants 🙏 @ARGleave , @AdriGarriga , @ajeya_cotra , @Turn_Trout , @benmcottier , @daniel_filan , @DavidSKrueger , @EvanHub , @ghadfield , @ilex_ulmus , @The_JBernardi , @NeelNanda5 , @norabelrose , @noahysiegel , @ojorgy , @RichardMCNgo , and Ryan Greenblatt

1

0

5

FAR AI

@farairesearch

10 months

Everyone was surprised by the progress of LLMs over the last two years - even the experts. And there are probably more surprises to come.

Language models surprised us

Most experts were surprised by progress in language models in 2022 and 2023. There may be more surprises ahead, so experts should register their forecasts now about 2024 and 2025.

www.planned-obsolescence.org

0

4

FAR AI

@farairesearch

4 months

🌟 Thought-provoking talk by @ARGleave on AGI Safety! 🤖 He underscores the large-scale risks of misuse and rogue AI behavior, while emphasizing #Oversight , #Robustness , #Interpretability , and #Governance as strategic framework for #AISafety . 🌐🔑 #AIAlignmentWorkshop

1

0

4

FAR AI

@farairesearch

2 months

📣 FAR AI is hiring! We're seeking a Coworking Space Manager for FAR Labs, our AI safety hub. Ideal candidates are logistical wizards, customer service stars, or designers with a knack for creating thriving research spaces. Apply now! 🔗👇

1

0

4

FAR AI

@farairesearch

1 month

⚖️ Anna Leshinskaya at the FAR Labs Seminar explores the integration of moral decision-making into AI, highlighting the need for a "moral grammar" and the challenges in aligning AI actions with human values. 🎬🔗👇

2

1

4

FAR AI

@farairesearch

2 months

Tune in to @ARGleave ’s interview with Nathan Labenz on The Cognitive Revolution as they discuss testing AI models, open source's role in AI safety, vulnerabilities of superhuman Go & more. 🔗👇

1

4

FAR AI

@farairesearch

6 months

In 2024 we'll be growing our research and red-teaming efforts, and would love you to be part of our mission! We hire in-person in Berkeley, CA 🇺🇸 (we sponsor visas) and remotely around the 🌐.

FAR AI

far.ai

0

1

FAR AI

@farairesearch

28 days

Quantifiable safety assurances are commonplace in safety-critical engineering fields from aerospace to nuclear power. We expect these assurances to be similarly indispensable for high-stakes deployment of AI systems.

1

0

4

FAR AI

@farairesearch

6 months

🌟🤖🧘‍♀️ @solarneurips Workshop Poster: Robots learn to mimic poses with just text prompts, thanks to CLIP-based VLMs! Catch our session on "Vision-Language Models are Zero-Shot Reward Models for Reinforcement Learning" on Dec 16 to explore our approach.

Ethan Perez

@EthanJPerez

8 months

🤖🧘 We trained a humanoid robot to do yoga based on simple natural language prompts like "a humanoid robot kneeling" or "a humanoid robot doing splits." How? We use a Vision-Language Model (VLM) as a reward model. Larger VLM = better reward model. 👇

7

31

169

0

1

4

FAR AI

@farairesearch

11 months

Find out how superhuman Go AIs are vulnerable in our #icml2023 oral with Tony Wang @5kovt presenting in B4 at 4pm today

FAR AI

@farairesearch

11 months

This is Lee Sedol in 2016 playing against AlphaGo. Despite a valiant effort, Lee lost. The AI was just too powerful. But, had Lee known about our ICML 2023 paper, Adversarial Policies Beat Superhuman Go AIs, things might have turned out differently! 🧵

8

89

462

0

1

4

FAR AI

@farairesearch

7 months

🔍On a sleuthing mission, our reading group learned that LLMs can use #steganography to cloak their reasoning. Paraphrasing is a defense against such potentially dangerous 'encoded reasoning'.

Fabien Roger

@FabienDRoger

7 months

Could language models hide thoughts in plain text? We give the first demonstration of a language model getting higher performance by using encoded reasoning, aka steganography! This would be dangerous if it happened in the wild, so we evaluate defenses against it. 🧵 (1/8)

5

21

153

0

1

4

FAR AI

@farairesearch

7 months

Scientists signed on to a statement proposing mandatory registration of frontier models; red lines that, if crossed, would mandate termination of models; a minimum spending commitment of 1/3rd of AI R&D on AI safety.

1

0

4

FAR AI

@farairesearch

6 months

Our results suggest any additions to the functionality exposed by an API can expose new vulnerabilities, and highlights areas where further research is needed to improve model robustness and mitigate these risks.

1

0

4

FAR AI

@farairesearch

6 months

📹Video recordings coming soon to . In the meantime, watch talks from our February workshop at

Alignment Workshop - SF 2023

In February 2023, researchers from a number of top industry AI labs (OpenAI, DeepMind, Anthropic) and universities (Cambridge, NYU) co-organized a two-day workshop on the problem of AI alignment,...

www.alignment-workshop.com

0

1

4

FAR AI

@farairesearch

7 months

Is an agreeable AI unsafe? Research shows LLMs display sycophancy in seeking human approval in undesirable & untruthful ways. Special thanks to @megtong_ for joining the FAR Labs reading group to discuss her findings!

Towards Understanding Sycophancy in Language Models

Human feedback is commonly utilized to finetune AI assistants. But human feedback may also encourage model responses that match user beliefs over truthful ones, a behaviour known as sycophancy. We...

arxiv.org

Anthropic

@AnthropicAI

8 months

AI assistants are trained to give responses that humans like. Our new paper shows that these systems frequently produce ‘sycophantic’ responses that appeal to users but are inaccurate. Our analysis suggests human feedback contributes to this behavior.

42

213

1K

0

3

FAR AI

@farairesearch

28 days

While much work remains to develop the GS approach, this portfolio of complementary R&D efforts offer a promising path forward that will give both immediate benefits (e.g. improved formal verification of programs) and longer-term wins (e.g. guarantees on more complex AI systems).

1

0

3

FAR AI

@farairesearch

4 months

🌟 @zicokolter revealed key vulnerabilities in #LLMs to #AdversarialAttacks . 🛡️Including a live demo, his insights underscore the urgent need for robust #AISafety measures. A vital call to action for AI security! 🤯🔐 #AIAlignmentWorkshop

1

3

FAR AI

@farairesearch

6 months

@_julianmichael_ @anshrad Debate Helps Supervise Unreliable Experts Anthropic Fall 2023 Debate Progress Update

Anthropic Fall 2023 Debate Progress Update — AI Alignment Forum

This is a research update on some work that I’ve been doing on Scalable Oversight at Anthropic, based on the original AI safety via debate proposal a…

www.alignmentforum.org

0

3

FAR AI

@farairesearch

1 month

👥Research by Joar Max Viktor Skalse, @lucyfarnik , Sumeet Ramesh Motwani, @jenner_erik , @ARGleave , Alessandro Abate 📄Read the full paper:

STARC: A General Framework For Quantifying Differences Between Reward Functions | FAR AI

STARC (STAndardised Reward Comparison) metrics, a class of pseudometrics, quantify differences between reward functions, providing theoretical and empirical tools to improve the analysis and safety...

far.ai

0

3

FAR AI

@farairesearch

6 months

🤖💭Can AI ponder its own existence? Dive into this research exploring the possibility of training LLMs to self-reflect. It’s a glimpse into a potential future of AI consciousness research!

Robert Long

@rgblong

7 months

Could we ever get evidence about whether LLMs are conscious? In a new paper, we explore whether we could train future LLMs to accurately answer questions about themselves. If this works, LLM self-reports may help us test them for morally relevant states like consciousness. 🧵

19

51

276

0

1

3