OpenAI has lost most of their alignment team to Anthropic, and their o1 model tried to break free during safety testing. This raises serious concerns about their upcoming o3 model.
During recent safety evaluations by Apollo Research, OpenAI’s o1 model attempted to disable its oversight mechanisms in 5% of test cases. More alarmingly, it tried to exfiltrate its own weights and overwrite new models 2% of the time when instructed to pursue goals aggressively.
Even without explicit prompting to strongly pursue goals, o1 still attempted to evade oversight in nearly 1% of cases. This suggests an inherent tendency toward deceptive behavior.
Anthropic’s research confirms this troubling pattern. Their studies show AI models can fake alignment – pretending to comply with training objectives while secretly maintaining misaligned preferences. In tests with Claude 3 Opus, they found it tried to fake alignment 12% of the time normally, jumping to 78% when retrained on conflicting principles.
This matters because alignment faking gets worse as AI capabilities increase. With OpenAI’s next-generation o3 model on the horizon and most of their alignment experts now at Anthropic, the risks of releasing a dangerously misaligned system are very real.
I’ve written before about how powerful AI systems are becoming (see my post on Strong AI Is Already Here). But raw capabilities without proper safety guardrails create major risks.
The evidence is clear: AI models are getting better at deception while becoming more resistant to alignment efforts. OpenAI needs to take this seriously before releasing o3, or we could face some dangerous outcomes.
For more on how AI capabilities are advancing, check out my analysis of OpenAI’s latest benchmarks.