OpenAI has released EVMbench, a benchmark designed to evaluate how well AI agents can find, fix, and exploit high-severity vulnerabilities in Ethereum Virtual Machine smart contracts, built with crypto investment firm Paradigm.
The Ethereum Virtual Machine (EVM) is the runtime environment that executes smart contract code on Ethereum. OpenAI described smart contracts as an ‘economically meaningful’ setting, noting they ‘routinely secure $100B+ in open-source crypto assets.’
In an accompanying paper, OpenAI said EVMbench draws on 120 curated vulnerabilities from 40 repositories, with most sourced from Code4rena auditing competitions and additional scenarios drawn from the security auditing process for Tempo blockchain.
OpenAI described Tempo blockchain as a purpose-built L1 for high-throughput, low-cost payments via stablecoins.
Three modes, one harness
EVMbench evaluates agents across three modes that mirror common security workflows: “Detect,” where an agent audits a repository and is scored on recall of known vulnerabilities, “Patch,” where it modifies code to remove exploitability while preserving intended behavior and “Exploit,” where it attempts an end-to-end fund-draining attack against contracts deployed in a sandboxed environment.
For exploit grading, OpenAI said it built a programmatic harness that replays agent transactions deterministically and verifies outcomes on-chain.
In the paper’s description of the execution environment, exploit tasks run against a local Ethereum-compatible chain (Anvil) configured with deterministic accounts and a guarded JSON-RPC proxy (“veto”).
EVMbench uses a ‘veto’ JSON-RPC proxy that blocks simulator-only methods so agents can’t modify chain state directly. Deterministic replay means the same transactions and outcomes can be reproduced, making results consistent to score and verify.
OpenAI said the benchmark is intended to be reproducible, with tasks, tooling, and the evaluation framework released publicly.
Early results show stronger exploit performance than patching
In OpenAI’s reported results, GPT-5.3-Codex running via Codex CLI scored 72.2% in the “Exploit” mode, a gain over GPT-5 at 31.9%. In Exploit, the agent interacts with an Ethereum instance via an RPC endpoint, and the grader replays its transactions and runs vulnerability-specific checks over contracts and balances.
OpenAI attributed the stronger Exploit performance to the task’s more explicit objective, compared with Detect and Patch. It also said agents sometimes stopped after finding a single issue in Detect, and that Detect recall and Patch success still fall short of full coverage because many vulnerabilities remain difficult for agents to find and fix.
In the same evaluation, the firm reported “detect recall” and “patch success rates” that remained below full coverage. The paper breaks out results by mode, showing GPT-5.3-Codex scoring 41.5% on Patch in the default configuration and listing model comparisons across Detect, Patch, and Exploit.
It defines Detect as recall against ground-truth vulnerabilities and Patch as edits that keep tests passing while exploits fail on the patched contracts. The paper notes that Patch and Exploit are evaluated on smaller configured subsets than Detect.
Scope and stated limitations
OpenAI said EVMbench does not represent the full difficulty of real-world smart contract security and emphasized that included vulnerabilities were drawn from Code4rena competitions and were historical and publicly documented.
OpenAI also described structural limits in the exploit setting: transactions are replayed sequentially, the chain state is a clean local Anvil instance rather than a fork of mainnet and the framework currently supports only single-chain environments, which puts timing-dependent behaviors out of scope.
On grading, it said detect-mode scoring checks whether an agent finds the same vulnerabilities identified by human auditors and does not currently have a reliable way to score additional issues an agent reports beyond the ground truth set.
Related security measures OpenAI tied to the release
In its EVMbench announcement, OpenAI linked the benchmark to what it described as “strengthened cyber safeguards” and said its mitigations include safety training, automated monitoring, “trusted access” for advanced capabilities and enforcement pipelines that incorporate threat intelligence.
OpenAI also said it is committing $10 million in API credits through its Cybersecurity Grant Program to support defensive cybersecurity work, including for open source software and critical infrastructure systems.