Google publishes TurboQuant to ease AI memory strain

Google Research published TurboQuant, a vector-quantization algorithm aimed at the key-value, or KV, cache used during large language model inference, and OpenReview lists the paper as an ICLR 2026 poster.

The KV cache is the model’s working memory during generation, storing prior token context so the system can respond without repeatedly recalculating earlier steps, but that memory grows quickly as prompts and outputs get longer.

In the Google Research announcement, Amir Zandieh and Vahab Mirrokni said TurboQuant can quantize KV cache to 3 bits without training or fine-tuning, while the paper describes it as an online, data-oblivious method designed to minimize both mean-squared error and inner-product distortion.

Quantization is a way to shrink model data by storing numbers with fewer bits.

The memory problem TurboQuant is addressing

That publication lands in a problem area researchers have been trying to solve for several years. TurboQuant’s paper says KV cache size scales with both model size and context length, making it a bottleneck in memory usage and computational speed for long-context models, while the 2024 KIVI paper similarly described the cache as a new bottleneck in both speed and memory usage as batch sizes and context windows increase.

What Google says it changed

Google’s earlier QJL paper said conventional KV-cache quantization often needs full-precision zero points and scales that can add 1 or 2 bits per quantized number, and PolarQuant argued that converting vectors into polar coordinates can remove the normalization step that creates much of that overhead.

TurboQuant combines those strands: a first stage that rotates and quantizes the vector, followed by a 1-bit QJL pass on the residual to remove bias in inner-product estimation.

The blog post and the paper, however, are not making exactly the same quality claim. Google’s post says TurboQuant can quantize KV cache to 3 bits without compromising model accuracy and can deliver up to an 8x increase in attention-logit computation on H100 GPUs over 32-bit unquantized keys.

The paper’s abstract is more specific: it reports “absolute quality neutrality” at 3.5 bits per channel and “marginal quality degradation” at 2.5 bits, while also arguing that TurboQuant operates near information-theoretic lower bounds on distortion.

What the benchmarks show

On benchmarks, Google’s blog cites LongBench, Needle In A Haystack, ZeroSCROLLS, RULER and L-Eval using Gemma and Mistral models, and it separately presents LongBench results on Llama-3.1-8B-Instruct. These are a set of long-context and evaluation benchmarks.

In the paper’s LongBench table for Llama-3.1-8B-Instruct, 3.5-bit TurboQuant matched the full-cache average score at 50.06, while 2.5-bit TurboQuant scored 49.44.

The paper’s Needle-In-A-Haystack results make the same point more plainly. There, TurboQuant scored 0.997, equal to the full-precision baseline and ahead of KIVI, SnapKV and PyramidKV under the paper’s comparison setup.

The paper also says TurboQuant outperformed existing product-quantization techniques in nearest-neighbor search recall while reducing indexing time to virtually zero, extending the relevance beyond LLM serving into vector search.

Where TurboQuant sits in the broader serving efficiency research line

TurboQuant’s paper positions vector quantization as important to both KV-cache compression and vector databases, and KIVI earlier tied lower KV-cache memory to larger batch sizes and higher throughput, reporting up to 4x larger batch size and 2.35x to 3.47x throughput gains on real inference workloads.

TurboQuant therefore enters a research line that is increasingly being framed around serving efficiency, memory bandwidth and long-context system costs, not just model compression in the abstract.

What Google has not published is a product rollout plan. The research post says a major application is solving KV-cache bottlenecks in models like Gemini, but the post and the ICLR listing stop at research results and conference publication rather than a Google Cloud deployment timeline.

For now, the record consists of a Google Research post, an OpenReview listing for an ICLR 2026 poster and author-reported benchmark data showing TurboQuant was competitive with other KV-cache compression methods in the reported tests, particularly at lower bit budgets.

Google publishes TurboQuant to ease AI memory strain

Google publishes TurboQuant to ease AI memory strain

OpenAI refocuses ChatGPT shopping on discovery

Securing businesses for a world without perimeters

Unilever launches all-AI laundry products

Nvidia packages OpenClaw for enterprise control

Google publishes TurboQuant to ease AI memory strain

OpenAI refocuses ChatGPT shopping on discovery

Securing businesses for a world without perimeters

Unilever launches all-AI laundry products

Nvidia packages OpenClaw for enterprise control