Anthropic has introduced Claude Opus 4.6, the latest version of its flagship Opus model, saying it improves coding, reliability in larger codebases, and performance on longer agentic tasks.

The firm said that Opus 4.6 is its first Opus-class model to support a 1 million-token context window in beta, alongside new API features aimed at long-running conversations and tool-driven work.

The company added that base pricing will stay the same as earlier Opus releases: $5 per million input tokens and $25 per million output tokens.

According to the announcement, premium tier applies when prompts exceed 200,000 tokens while using the 1 million-token beta, priced at $10 per million input tokens and $37.50 per million output tokens for tokens in the request.

Opus 4.6 is available on claude.ai and through its API under the identifier claude-opus-4-6, and that it is also available through major cloud platforms, including Amazon Bedrock and Google Cloud Vertex AI.

Benchmarks: vendor claims, plus benchmark-owner surfaces

Anthropic described Opus 4.6 as a coding and “economically valuable work” update, pointing to results on Terminal-Bench 2.0 and GDPval-AA. In its Opus 4.6 system card, Anthropic reported a 65.4% Terminal-Bench 2.0 score under its “max effort” configuration.

Terminal-Bench 2.0 is a benchmark that evaluates whether an AI agent can complete realistic, end-to-end tasks in a command-line interface, such as working inside containerized environments and solving professional-grade engineering workflows.

Separately, the Terminal-Bench project’s public leaderboard lists results by specific agent harnesses and model identifiers. In that listing, one Opus 4.6 entry shows 62.9% ± 2.7 for a named harness (“Terminus 2”) dated Feb. 6, and another Opus 4.6 entry appears under “Claude Code” dated Feb. 7 with a different score.

A score of 62.9% is simply a task resolution rate: the share of benchmark tasks the agent completes successfully under the benchmark’s pass/fail testing.

For knowledge-work evaluation, Anthropic said Opus 4.6 leads OpenAI’s GPT-5.2 by about 144 Elo points on GDPval-AA (a gap that typically suggests a 70% pairwise win rate in head-to-head task comparisons as per Artificial Analysis).

GDPval itself is an OpenAI-created benchmark designed to evaluate models on “economically valuable, real-world tasks” across 44 occupations, with a corresponding paper describing its construction and coverage.

Artificial Analysis, which maintains a GDPval-AA leaderboard and methodology notes, describes GDPval-AA as its evaluation framework for OpenAI’s GDPval dataset and publishes its approach to anchoring and revising Elo scoring.

Anthropic also cited BrowseComp to support claims about web-browsing ability. BrowseComp is an OpenAI benchmark for browsing agents, described in OpenAI’s benchmark page and paper as a set of 1,266 questions that require persistent online navigation to find hard-to-locate information.

Longer context, bigger outputs, and controls for “thinking” and memory

Anthropic said Opus 4.6 supports up to 128,000 output tokens, which it positioned as enabling larger outputs without splitting work into multiple requests.

On the API side, Anthropic introduced “adaptive thinking” and effort levels (low, medium, high, max), describing them as controls that let developers tune how selectively the model uses deeper reasoning.

Anthropic also introduced context compaction in beta. In its API documentation, Anthropic describes compaction as automatic summarization triggered near a configured token threshold, replacing older context with a summary so conversations can continue without manual history management.

Safety: system-card claims and new cyber probes

Anthropic’s Opus 4.6 system card says the company evaluated misaligned behaviors including deception and sycophancy and reported Opus 4.6’s results within its internal safety testing framework.

The system card also discusses over-refusals and describes additional cybersecurity-related probes used to assess potential misuse.

Competitive timing: GPT-5.3-Codex rolls into Copilot

The Opus 4.6 release landed as OpenAI’s newer GPT-5.3-Codex began rolling out through GitHub Copilot, according to GitHub’s changelog, which describes GPT-5.3-Codex as OpenAI’s latest agentic coding model and outlines where it can be selected and how admins can enable it.

Personalized Feed
Personalized Feed