Researchers at Anthropic, Oxford, Stanford, and MATS create Best-of-N Jailbreaking, a black-box algorithm that jailbreaks frontier AI systems across modalities

ABSTRACT We introduce Best-of-N (BoN) Jailbreaking … Markus Kasanmascheff / WinBuzzer : y0U hA5ε tU wR1tε l1Ke tHl5 to Break GPT-4o, Gemini Pro and Claude 3.5 Sonnet AI Safety Measures Jose Antonio Lanz / Decrypt : AI Won't Tell You How to Build a Bomb—Unless You Say It's a ‘b0mB’ Matthew Berman / Matthew Berman on YouTube : Anthropic's STUNNING New Jailbreak - Cracks EVERY Frontier Model Bluesky: Roy Edroso / @edroso : The primary function of AI is to eliminate the need for human work (and employment) but I think the creation of a world in which no information is trustworthy and lies can proliferate unimpeded is a close second www.404media.co/apparently-t... Marcy Murninghan / @marcymurninghan : How 'bout ethical discernment, too? Moral agency matters! — Quoting @techmeme.com: — Anthropic research isn't meant to just show that these guardrails can be bypassd, but hopes that “generatng extensive data on successful attack patterns” will open up “novel opps to develop bettr defense mechanisms.” … Emanuel Maiberg / @emanuelmaiberg : this is pretty funny but also just an automated process for what we reported people have been doing for a long time — www.404media.co/apparently-t... Joseph Cox / @josephcox : Researchers found its possible to automate the jailbreaking of AI models at scale. You just need to type like the SpongeBob meme www.404media.co/apparently-t... @404media.co : APpaREnTLy THiS iS hoW yoU JaIlBreAk AI — Anthropic created an AI jailbreaking algorithm that keeps tweaking prompts until it gets a harmful response. — 🔗 www.404media.co/apparently-t... Mastodon: Tisha Tiger / @tisha@htt.social : The techniques used to bypass the security systems of #LLM fascinate me. — ➡️ https://arxiv.org/... #AI #generativeAI #security #hack #infosec #cybersecurity @flaki@flaki.social : Your periodic reminder that the only thing missing from AI (that is, the LLMs termed as such to ride out the current tech bubble) is the “intelligence” part: — https://www.404media.co/... X: @anthropicai : Best-of-N isn't limited to text. We jailbroke vision language models by repeatedly generating images with different backgrounds and overlaid text in different fonts. For audio, we adjusted pitch, speed, and background noise. Some examples are here: https://jplhughes.github.io/ ... [image] @anthropicai : Best-of-N works by repeatedly making small changes to prompts, like random capitalization and character shuffling, until it successfully jailbreaks a model. In testing, it worked on Claude 3 Opus 92% of the time, and even worked on models with “circuit breaking” defenses. [image] @matthewberman : .@AnthropicAI just published a WILD new AI jailbreaking technique Not only does it crack EVERY frontier model, but it's also super easy to do. ThIS iZ aLL iT TakE$ 🔥 Here's everything you need to know: 🧵 [image] @anthropicai : New research collaboration: “Best-of-N Jailbreaking”. We found a simple, general-purpose method that jailbreaks (bypasses the safety features of) frontier AI models, and that works across text, vision, and audio.

404 Media 2024-12-22

Chronicles

Researchers at Anthropic, Oxford, Stanford, and MATS create Best-of-N Jailbreaking, a black-box algorithm that jailbreaks frontier AI systems across modalities