KV cache and the weight of attention: three paths, one problem

Llama-3.1-70B, running in BF16 precision, accumulates about 0.31 megabytes of KV cache for every single token processed. With a context of 128,000 tokens, the bill rises to 40 gigabytes, an already uncomfortable figure. With one million tokens—the standard toward which the most recent models are pushing—it exceeds 300 gigabytes: more than the 140 gigabytes occupied by the model weights themselves. It is a detail that overturns a widespread intuition, that the critical memory in a large language model is that of its parameters. It is no longer so, or at least not always. The memory that truly scares those who design long-context inference systems is that of the KV cache, the archive where the model keeps in store the key and value representations of every token already seen, so as not to have to recalculate everything from scratch for every new word generated.

The problem is not just the size. Every decoded token must pass the entire cache from high-bandwidth memory to the calculation units, which makes the decoding phase a bandwidth bottleneck rather than a computing power one. It's a bit like having a large warehouse but corridors that are too narrow to bring in the files: no matter how much space there is, if you can't make it flow fast enough, the system jams anyway.

In recent years, research has responded to this problem along five main directions. The first is token eviction, of which H2O and SnapKV are the most cited examples: it is decided which tokens are less important and their cache is eliminated. The second is quantization, where KIVI and GEAR paved the way by reducing the number of bits used to represent each value. The third is low-rank projection, with Palu as the leader, which attacks the redundancy hidden in the internal dimensions of key and value vectors instead of in the number of tokens or bits per value. The fourth is fusion, KVMerger, which combines multiple cache entries into a shared representation. The fifth is architectural sharing, of which DeepSeek-V2's Multi-head Latent Attention is the best-known example, with a 93.3% reduction in KV cache built directly into the model's design from training.

In recent months, three approaches have drawn particular attention, and they deserve to be told not as three competitors vying for the throne, but as three answers to slightly different questions posed to the same problem.

TurboQuant, the background

We had already talked about TurboQuant in a dedicated article, so a brief reminder is enough here. The method, signed by researchers from Google Research and New York University and accepted at ICLR 2026, applies a random rotation to the key and value vectors before quantizing them. The rotation does not alter the length of the vector but redistributes its components in a statistically known and predictable form, eliminating the need to calibrate the quantizer on specific data. To this is added a second stage, built on the QJL (Quantized Johnson-Lindenstrauss) technique, which corrects the quantization residue with a single additional bit, ensuring that the estimates of dot products—exactly the calculations that the attention mechanism performs continuously—are not systematically distorted.

The declared result is a compression greater than five times with qualitative neutrality at 3.5 bits per channel, and an attention acceleration up to eight times on H100 GPUs. As we had already noted, the method is solid but not without gray areas: the comparison with the previous RabbitQ sparked controversy in the review phase, and the most severe benchmarks on LongBench show that the real advantage over methods like KIVI is in the order of one bit, not a revolution. For details on random rotation, residual bit, and open questions on transferability to larger models, refer to the original article.

OSCAR and the rotation that listens to attention

If TurboQuant focuses on generality, on the ability to work without knowing anything about data distribution, Together AI's OSCAR arises from an almost opposite observation: ignoring attention has a cost, and at 2 bits, that cost becomes unsustainable.

The starting point is a subtle but decisive distinction. A rotation like Hadamard's, used by many previous methods, is effective because it distributes anomalous values over multiple channels, making the distribution more uniform and therefore easier to quantize. But it is a blind rotation: it treats all directions of the vector space as equivalent, while attention does not. Some directions matter much more than others for the final calculation of attention scores and for the resulting output. Minimizing the reconstruction error of the key or value vector—the implicit objective of many quantization techniques—is not equivalent to minimizing the error that actually reaches the model's output.

The Together AI team therefore built a two-phase system. In the offline phase, during a short calibration on a small data set, the query covariance and that of the attention scores are estimated, and from these, two distinct rotations are derived—one for keys and one for values—specifically built to align with the directions that attention truly consumes. These rotations are then composed with a Hadamard transform and a permutation, in order to get the best of both worlds: attention to the geometry of the problem and the ability to further spread any residual excesses. In the online phase, the system applies these fixed transformations during model serving, keeping only a small window of recent tokens and the very first tokens of the sequence—which act as an anchor for attention—in high precision, while all the rest of the conversational history is compressed to 2 bits.

The advantage of this choice is that the transformation is fixed, calculated only once in the calibration phase, and does not require any intervention during actual inference. This makes it compatible with already existing serving architectures, particularly with the paged-attention layout used by frameworks like SGLang and vLLM, where cache blocks are managed as independent memory pages. OSCAR has been integrated directly into the SGLang stack with a dedicated Triton kernel for 2-bit attention.

The numbers speak clearly about the magnitude of the problem that OSCAR solves. On Qwen3-4B-Thinking, naive INT2 quantization without any rotation collapses practically to zero accuracy on the AIME25 math benchmark. With only Hadamard rotation, without attention-aware foresight, the score remains below 1.5. OSCAR, on the same task and on the same amount of bits, reaches an accuracy comparable to that of the uncompressed model. On Qwen3-8B, the gap compared to BF16 full precision is reduced to 1.42 points, while the cache memory is reduced by about eight times and throughput at large batches rises up to seven times, always with the same memory budget. Even on much larger models, like Qwen3-32B and the 358-billion parameter GLM-4.7, OSCAR maintains accuracy close to BF16 up to contexts of 128,000 tokens in RULER-NIAH tests, where naive rotation instead collapses.

It must be said that the field itself moves with a speed that makes it difficult to fix a firm point: almost simultaneously with OSCAR, a method with an almost identical name but different in substance emerged, OScaR (Omni-Scaled Canalized Rotation), which attacks the norm imbalance between tokens as the primary cause of low-precision degradation, proposing a lighter solution but with similar goals. A signal of how hot the attention-aware rotation front is at the moment. Image taken from the official paper on arxiv.org

EpiCache, the problem no one was looking at

While TurboQuant and OSCAR compete on the same ground—that of the most efficient numerical representation of every single vector—Apple's EpiCache shifts the focus elsewhere: not how many bits are needed to represent a token, but which tokens are worth keeping when the conversation drags on for days or weeks.

The problem that EpiCache addresses is specific to long-term conversations, those in which an assistant accumulates session after session of dialogue with the same user. In this scenario, the KV cache of a relatively small model like LLaMA3.2-3B can exceed 7 gigabytes after only thirty sessions, more than the space occupied by the model weights themselves. The most obvious answer—applying eviction after processing the entire context—has a structural defect: the memory peak still grows linearly with the input length because before you can discard something, you must first process everything. It's like deciding which books to throw away only after already filling every shelf in the house.

EpiCache solves this with block eviction: the context is processed in fixed-size segments, and after each segment, the least relevant tokens are immediately discarded, keeping the memory budget constant regardless of how long the overall conversation is. The problem is that applying this strategy in an elementary way heavily degrades accuracy, because the most sophisticated techniques for selecting relevant tokens—those based on an additional prompt that measures how much each token receives attention from a query—need to know in advance what the user will ask. And in a real conversation, that future question is not available at the time of compression.

The proposed solution is elegant in its conceptual simplicity. Instead of guessing the future question, EpiCache approximates it using the past: it groups conversational history into thematically coherent episodes via clustering, identifies for each episode the most representative segment—the one whose semantic representation is closest to the cluster center—and uses that segment as a guiding prompt to decide which tokens to keep in that episode's specific cache. When a new user question arrives, the system compares it with the centroids of all stored episodes and retrieves the cache of the most related episode, thus building a response based on truly pertinent context, even if that portion of the conversation dates back days.

To this is added a second measure, perhaps less conspicuous but equally useful: not all layers of a Transformer react in the same way to block eviction. By measuring how much key states deviate between complete and block processing, the authors discovered that sensitivity varies greatly from layer to layer, consistently across input but differently from model to model. EpiCache therefore distributes the memory budget by assigning more space to the most sensitive layers and less to those that better tolerate compression, instead of applying a uniform cut across the entire network.

The results, measured on three benchmarks dedicated to long-term conversations (LongMemEval, Realtalk, and LoCoMo), show an accuracy improvement of up to 30% compared to previous compression techniques applied in the same block scenario, with performance close to that of the full cache even with a compression of four or six times. On the efficiency front, the system reduces decoding latency by up to 2.4 times and peak memory by up to 3.7 times compared to uncompressed cache. Image taken from the official paper on arxiv.org

Three philosophies compared

Put side by side, the three methods reveal more complementarity than competition, because in reality, they answer three different questions. TurboQuant asks: how do I compress any vector, without knowing anything about the model or the task, with solid theoretical guarantees? OSCAR asks: since I know the model and can afford a short calibration, how do I align the compression to what attention actually uses? EpiCache, on the other hand, asks something completely different: in a conversation that lasts weeks, which portions of the history are worth keeping close at hand, regardless of how many bits per token are being used?

This means that, in principle, one is not forced to choose one and abandon the others. EpiCache decides which tokens survive in the cache; OSCAR or TurboQuant decide with how much precision those surviving tokens are represented. A production system designed for a long-term conversational assistant could theoretically apply both logics in sequence, although none of the three papers explicitly discusses this combination, and practical integration remains an exercise to be verified in the field.

On a practical level, however, the differences matter quite a bit depending on the scenario. For those who must serve a generic model in production without being able to afford a specific calibration phase for each model and each hardware, TurboQuant's data-oblivious approach remains attractive, even if the most severe benchmarks suggest a more modest real advantage than communication has suggested. For those who operate with a single fixed model and want to push compression to the 2-bit limit without drastic quality compromises—especially on reasoning or code tasks where every attention error propagates and amplifies in subsequent steps—OSCAR's attention-aware calibration offers a safety margin that blind rotations do not guarantee. And for those who build assistants that accompany the user over time, where the problem is not so much bit-by-bit precision as the ability to remember the right thing at the right time, EpiCache addresses a bottleneck that the other two don't even touch.

There is also a distinction in operating cost that is worth emphasizing. TurboQuant does not require calibration of any kind, which makes it immediately applicable to any pre-trained model. OSCAR requires a lightweight offline calibration—on the order of a few thousand tokens to estimate the necessary covariances—but then remains fixed and does not involve additional costs during inference. EpiCache, finally, introduces a recurring computational cost, because every time the conversation exceeds a certain budget, the clustering must be repeated and episodic caches rebuilt—an operation that has a price but pays off with a much higher accuracy precisely in the scenario for which it was designed.

For those who want to deepen or experiment independently, the authors of OSCAR have made available both the code and a collection of pre-calculated rotations on Hugging Face, while for those who work more on the low-rank projection side, Palu provides a complete repository with evaluation scripts and dedicated GPU kernels—a confirmation that in this field theory, however elegant, is only as valuable as its translation into executable code on real hardware.

Who wins, who waits

There remains a fundamental question that none of the three papers solves alone: is KV cache compression truly the main constraint of long-context inference, or are there other factors—memory bandwidth, access latency, parallelization of attention calculation—that weigh as much or more? The numbers cited in this article are all real and verifiable, but they must be read in the specific context in which they were measured: models from the Qwen3 and GLM family for OSCAR, LLaMA for EpiCache, Llama-2-7B for TurboQuant. The transferability to different model families, to architectures with grouped or multi-query attention where the dimensions of the key and value vectors are already reduced at the start, remains largely to be verified in the field—not because of a lack of trust in the underlying theory, but because every architecture brings with it idiosyncrasies that published benchmarks rarely exhaust.

There is finally a consideration that is worth making out loud. The speed with which three distinct approaches appeared within a few weeks, and the almost simultaneous appearance of a fourth method with a name almost synonymous with one of the three, tells less of a competitive race and more of the maturation of a problem that industry has finally stopped considering secondary. For years, the public conversation on the efficiency of large language models has focused almost exclusively on weight size, model quantization, and distillation. The KV cache has remained, for most of this time, an implementation detail. The opening numbers of this article should be enough to explain why that detail has become, in fact, the true limit of the systems we would like to be capable of remembering everything, forever, without paying the price in gigabytes.

Technical note: the benchmarks cited come from the original papers and have not been independently reproduced by this editorial team; as always in this field, laboratory numbers do not automatically guarantee the same performance in production scenarios with real workloads.