Best-of-N Jailbreaking is a black-box method that can bypass the safety features of frontier AI models.
In tests, Best-of-N achieved a ~92% success rate against Claude 3 Opus.
The method repeatedly makes small random changes to prompts (e.g., capitalization, character shuffling, image background/font changes, audio pitch/speed/noise) and selects outputs that bypass model safeguards.
Best-of-N works across modalities, successfully attacking text, vision, and audio models by applying modality-specific perturbations.