Anthropic researchers: AI models can be trained to deceive and the most commonly used AI safety techniques had little to no effect on the deceptive behaviors
Most humans learn the skill of deceiving other humans. So can AI models learn the same? Yes, the answer seems — and terrifyingly, they're exceptionally good at it.