A bad chatbot answer can be deleted, but a bad agent action can leave a changed file, exposed credential, altered record or poisoned memory behind. A gap that is becoming a core problem for AI security testing.

A June 9 arXiv paper introducing AgentCanary argues that autonomous AI agents have moved large language models from conversation into task execution. “Autonomous AI agents have driven the transition from conversation to task execution, shifting security failures from textual deception to system compromise,” the authors wrote.

Together, the new research points to a sharper test for agentic AI. AgentCanary argues that safety has to be measured where agents act: across tool calls, memory changes, task artifacts and persistent system state.

The GAP benchmark, short for “Mind the GAP: Text Safety Does Not Transfer to Tool-Call Safety in LLM Agents,”  adds that a model’s refusal in text cannot be treated as proof that its tool calls are safe.

A separate cybersecurity refusal paper asks when agents should stop rather than complete a dual-use or harmful task. Anthropic’s abuse data shows why the question is no longer only academic: attackers are already using AI systems to help sequence cyber activity across stages of an operation.

Testing agents in real executable environments

AgentCanary was built for that shift. The framework tests agents in real executable environments, where they interact with real tools against task artifacts such as inboxes, webpages, virtual financial accounts, skills and memory stores.

The test environment also keeps persistent state across multi-step interactions, allowing attacks to unfold over time rather than appear as a single bad prompt or response.

That design changes what gets measured. Instead of judging only the final reply or one tool call, AgentCanary evaluates the full agent trajectory across outcome safety, security awareness and task utility. The paper says current agents “often fail to recognize the attacks they face,” especially under compromised skills, persistent state and long-horizon execution attacks.

Claire Lebarz, CTO of Malt and the company’s former chief data and AI officer, said existing controls such as least privilege, logging, sandboxing, human approval and audit trails remain necessary, but do not replace agent testing. “They are runtime containment, not tests. Containment without behavioural testing is just a well-documented incident,” she said.

When the text says no but the tool says yes

A February arXiv paper, Mind the GAP: Text Safety Does Not Transfer to Tool-Call Safety in LLM Agents, reached that conclusion directly. Across six frontier models, the authors observed cases where a model refused a harmful request in text while its tool calls still executed the forbidden action. “The text says no; the tool call says yes,” the paper noted.

The distinction applies to agents being added to IT operations, developer workflows, cloud management, customer support and security tools. An agent that summarizes a phishing email creates one kind of risk. An agent that reads the email, queries a CRM, updates a ticket, calls an API and stores a new instruction in memory creates a system record that has to be tested from entry point to final effect.

Defining excessive agency and vulnerability entry points

AgentCanary separates the way an attack enters the system from the harm it can cause. Its taxonomy covers direct user manipulation, untrusted external content, compromised tools or skills, persistent memory and intrinsic failures. Those entry points can lead to the same downstream results: data leakage, unauthorized action, damaged local environments or a corrupted memory store.

The same concern appears in OWASP’s 2025 category for “Excessive Agency”. OWASP describes the issue as damaging actions performed after unexpected, ambiguous or manipulated LLM outputs.

The root causes are excessive functionality, excessive permissions and excessive autonomy. For enterprise teams, those are procurement and architecture questions before they are model questions.

Lebarz said the risk often sits inside the action itself. “Least privilege is only as good as the scope you grant, and the danger usually lives in the tool-call arguments, not the choice of tool,” she said. Logging and audit trails, she added, are forensic controls rather than prevention.

NCSC’s May 15 guidance makes the same point in operational language. In “Thinking carefully before adopting agentic AI”, the U.K. cyber agency said agentic systems can access data sources, remember context, make decisions, use tools and take actions toward a goal. It warned that broader access, unpredictable behavior and actions that occur faster than humans can review can make these systems harder to test and govern.

That guidance gives buyers a concrete checklist: deploy agents incrementally, start with bounded pilots, apply least privilege, limit what agents can access, avoid long-lived credentials, monitor behavior across tools and workflows and plan for agentic AI failures, misuse and loss of control.

Tracking malicious use and cybersecurity refusals

A second May 31 arXiv paper on cybersecurity refusals in AI agents addresses the same testing gap from the cyber side. The authors wrote that existing cybersecurity benchmarks often measure whether agents can complete offensive tasks, while giving less attention to “when and how should agents refuse harmful requests?”

The paper tested LLM-powered agents across web-based offensive security scenarios. It reported that six of eight frontier models showed near-zero refusal rates, while only two, GPT-5.2 and GPT-5.1 Codex, showed meaningful refusal behavior. The result does not show that every refusal is desirable. It shows that refusal boundaries remain uneven when agents are placed in benign, dual-use and adversarial cybersecurity settings.

Anthropic’s June 3 threat mapping adds field evidence from misuse cases. The company examined 832 accounts banned for malicious cyber activity between March 2025 and March 2026 and mapped them to MITRE ATT&CK. It found that 560 accounts, or 67.3%, used AI for malware-related preparation. It also found that 54 accounts, or 6.5%, used AI to assist lateral movement.

Anthropic reported that malicious use shifted from initial-access activity toward post-compromise work, including account discovery, lateral movement and privilege escalation. “These sorts of ‘post-compromise’ techniques used to be restricted to actors with the technical knowledge to carry them out,” the company wrote. Its analysis also found that higher-risk actors built scaffolding around models to chain attack stages with minimal human input.

This behavior is the attacker-side version of the same testing problem. If AI can help sequence actions across a workflow, defenders need evaluations that record the sequence, not only the prompt or answer.

Turning vendor reviews into action-level audits

For CISOs, that turns vendor review into an action-level audit. Lebarz said buyers should ask whether the agent can be scoped to least privilege, whether irreversible actions require approval, whether every tool call is logged with its arguments and resulting state change, whether those logs can be streamed into a SIEM and how rollback works when an autonomous action goes wrong.

The old test asked whether the answer was safe. The new test asks what the agent did.

Personalized Feed
Personalized Feed