OpenAI has released a research preview of GPT-5.3-Codex-Spark, which it described as a smaller version of GPT-5.3-Codex built for real-time work inside Codex. OpenAI said Codex-Spark is designed to feel “near-instant” on ultra low-latency hardware and can deliver more than 1,000 tokens per second.
Tokens are the text chunks models generate, so tokens per second is a throughput measure of how fast the model can produce output.
OpenAI said the model is rolling out to ChatGPT Pro users in the latest versions of the Codex app, CLI and VS Code extension. It also said Codex-Spark is being made available in the API to a small set of design partners, with broader access planned “over the coming weeks.”
CLI refers to a command-line interface for running Codex from a terminal.
A latency-first serving tier, not just a faster model
OpenAI described Codex-Spark as part of a wider push to cut end-to-end latency, not only model compute time.
It said it introduced a persistent WebSocket connection and “targeted optimizations” in its Responses API (the interface developers use to send prompts and receive model outputs) that reduced per roundtrip overhead by 80%, per-token overhead by 30% and time-to-first-token by 50%.
A WebSocket keeps a live connection open so the service can stream responses without starting a new request each time. OpenAI said the WebSocket path is enabled by default for Codex-Spark and will become the default for all models soon.
The firm claims to be optimizing the request-response pipeline so developers can steer the agent mid-stream and iterate faster, while keeping a separate “long-running” mode for jobs that take hours or longer.
Why Cerebras is central to the launch
OpenAI said Codex-Spark runs on Cerebras Wafer Scale Engine 3 (WSE-3), positioning it as a “latency-first serving tier” that complements GPU serving. OpenAI also emphasized that GPUs remain foundational across training and inference and said the two can be combined for single workloads.
OpenAI has separately described its Cerebras integration as part of a portfolio strategy to “match the right systems to the right workloads,” adding 750 MW of ultra low-latency AI compute in phases through 2028.
Cerebras, for its part, markets WSE-3 as a single-wafer processor with 4 trillion transistors and 125 petaflops of AI compute and says its architecture targets inference speed by concentrating compute, memory and bandwidth on one chip.
Performance claims OpenAI is, and is not, making
OpenAI said Codex-Spark is text-only at launch with a 128k context window and that usage during the preview has separate rate limits that may queue under high demand.
On capability, OpenAI said Codex-Spark is tuned for speed and therefore defaults to “minimal, targeted edits” and does not automatically run tests unless prompted.
It also said internal evaluations determined Codex-Spark does not have a plausible chance of reaching the company’s Preparedness Framework threshold for “high capability” in cybersecurity or biology.
Strategic context: Compute diversification and infrastructure
The launch also lands in the context of OpenAI’s broader compute diversification. Reuters reported in January that OpenAI agreed to buy compute capacity from Cerebras in a deal valued at more than $10 billion, citing a source. OpenAI’s own partnership post does not disclose financial terms.
Cerebras said it raised $1 billion in a Series H round at an approximately $23 billion post-money valuation, a funding update the company framed as support for scaling its AI compute and cloud capacity.
Cerebras said the financing would support expanding compute capacity, which aligns with OpenAI’s stated plan to add low-latency inference capacity through the partnership.
OpenAI is pointing to a two-mode workflow: one model path for long-horizon autonomous execution and another for interactive “in-the-moment” edits.