OpenAI hardens ChatGPT Atlas against prompt injection with “auto-attacker” red team

OpenAI has shipped a security update for the browser agent inside ChatGPT Atlas, adding an “adversarially” trained model and tighter safeguards after its internal automated red-teaming uncovered a new class of prompt injection attacks.

In its write-up, OpenAI describes prompt injection as a long-running security problem for agents that read untrusted content and then take actions in a browser.

The company compared the threat to “ever-evolving online scams” and said the defenses will need to be continuously strengthened, rather than treated as a one-time fix.

OpenAI described prompt injection as attacks that embed malicious instructions inside content an agent processes, emails, documents, calendar invites, forums, or webpages, so the agent follows the attacker’s intent instead of the users.

Because a browser agent can click, type, and submit forms like a human, OpenAI said the potential impact can range from forwarding sensitive emails to sending money or altering cloud files.

To find these exploits faster, OpenAI said it built an LLM-based automated attacker and trained it end-to-end with reinforcement learning to hunt for prompt injection paths that succeed against a browser agent.

The attacker can “try before it ships” by proposing an injection and running a simulated rollout of how the victim agent would behave, using the returned reasoning-and-action trace as feedback to iterate before finalizing an attack.

OpenAI’s demo scenario shows how indirect prompt injection can surface in normal enterprise workflows.

The automated attacker “seeded” an inbox with a malicious email that instructed the agent to send a resignation note; later, when the user asked for an out-of-office reply, the agent encountered that email during the workflow and followed the injected instructions, until the updated defenses flagged the attempt.

The company did not claim deterministic guarantees. OpenAI said the nature of prompt injection makes such guarantees “challenging,” and positioned its approach as a rapid response loop: discover new attack classes via automated red teaming, then adversarially train updated agent models and adjust monitoring and system-level safeguards.

For U.S. enterprises evaluating agentic browsing, OpenAI’s own launch materials emphasize that “agent mode” in Atlas is available for Plus, Pro and Business users, and can be enabled for Enterprise and Edu depending on admin controls, while also warning that safeguards “will not stop every attack” as agents grow in popularity.

Risk-management guidance from government security playbooks similarly stresses containment and oversight.

The UK government’s AI Playbook says prompt injection exists because models cannot inherently separate system instructions from user-provided text, and it recommends human review before actions are carried out, prompt/output filtering, logging and audits, and limiting what data and systems an AI tool can access, especially in indirect scenarios such as email and attachment summarization.