Earlier this year, Elon Musk said in the Moonshots podcast that, “even with AI at its current state,” we are “pretty close” to replacing “half of” white-collar jobs, and that AI “can do probably half or more of those jobs right now.”
However, Mercor’s new APEX-Agents benchmark suggests a different picture for end-to-end, cross-application professional tasks. Mercor is an expert marketplace for AI training and evaluation. In its tests of investment banking, management consulting, and corporate law workflows, the best agents completed under 25% of tasks in one shot.
In the APEX-Agents paper, Mercor reports a best Pass@1 score of 24.0% of tasks for Gemini 3 Flash followed by 23.0% for GPT-5.2, with Claude Opus 4.5 and Gemini 3 Pro at 18.4%. Here, the best Pass@1 score is the highest first-try success rate any model achieved on the benchmark.
Why this benchmark is a harder test than “tool-free” evals
APEX-Agents is built around 33 data-rich “worlds” and 480 tasks that require agents to work across applications such as documents, spreadsheets, PDFs, chat, email and calendar. Mercor says web search is turned off “to keep evaluations reproducible,” so each world includes the files needed to complete the tasks.
That design matters because it shifts the bottleneck from answering a single prompt to navigating messy enterprise context.
In an interview with TechCrunch, CEO Brendan Foody pointed to “tracking down information across multiple domains” as the biggest failure mode, describing day-to-day work as spread across tools like Slack and Google Drive rather than packaged in one place.
Why this matters now for US enterprise leaders
APEX-Agents tests whether AI can finish real white-collar work across the tools teams actually use, providing a closer proxy for end-to-end enterprise task execution than tool-free or single-prompt evaluations. This includes work across tools like docs, spreadsheets, PDFs, mail, chat, calendar and file systems.
Web search is turned off “to keep evaluations reproducible,” and some worlds include extra finance-data apps, which makes the benchmark closer to controlled enterprise environments than open-web agents.
What the gap implies for enterprise automation planning
The delta between APEX-Agents and Mercor’s earlier APEX benchmark shows why many “agent” pilots stall after demos.
In Mercor’s December expansion of APEX (a tool-free benchmark), the top model score is 67%, but APEX-Agents pushes agents into cross-app execution where one-shot success stays below 25%.
Mercor also reports that retries help but don’t close the gap: in its launch post, the company says even with multiple attempts, “no model is ready to replace a professional end-to-end,” and that the best agents top out at 40% even after eight tries.
For CIOs and operational leaders, that suggests near-term value is more likely in bounded, reviewable slices of work, where agents draft, assemble, or compute, and humans validate, rather than “hands-off” automation of entire analyst/associate workflows.
The paper’s own setup reinforces this: tasks average about 1.8 hours of expert-estimated effort and often require sustained file navigation and context retention across many steps.
How APEX-Agents fits into the broader “economic value” eval race
APEX-Agents is part of a widening push to measure AI on economically meaningful tasks instead of abstract exams. OpenAI’s GDPval, for example, measures performance across 44 occupations and publishes a paper and a public evaluation service.
Mercor’s bet is that cross-application execution is the enterprise-grade bar that should shape road maps, and its open release (CC-BY dataset plus its Archipelago infrastructure) gives labs a concrete target to optimize against.
What to watch next
Mercor says it will expand APEX-Agents beyond the initial three professions, and its open dataset creates a near-term “training-to-the-test” incentive for labs chasing leaderboard gains.