The Cache Is the Thought
KV caching is usually described as an optimization. It is actually something more fundamental. The mechanism by which large language models hold a thought in their head.
There is a moment, when you look closely enough at the details of large language models, when a seemingly minor implementation detail suddenly reveals itself as something structural. A load-bearing idea that everything else depends on. For me, that moment came with reading some of the recent ideas surrounding KV caching. What looked at first like a straightforward performance optimization turned out to be the core mechanism that makes modern AI inference not just fast, but possible. Understanding it changes how you think about what these systems are actually doing.
To see why, start with the mathematics of attention. The operation at the heart of every transformer model deployed today.
The Quadratic Problem
softmax( QKᵀ / √dk )V The attention operation
In their raw mathematical form, transformers are elegant but brutally expensive. Generating each new token requires attending over every previous token in the sequence, which means computation grows quadratically with sequence length. A sequence twice as long is four times as expensive. This is a direct consequence of what attention is doing. The mechanism works by computing relationships between every pair of tokens, and the cost of that operation scales with the number of pairs.
The result, without any mitigation, is that generating text token by token requires recomputing the same keys K and values V for every prior token at every step. The context grows. The recomputation grows with it. Something that might take milliseconds for a short sequence becomes seconds for a long one and for the context windows that modern applications demand, it becomes entirely impractical.
KV caching addresses this with a conceptually simple observation: if the keys and values for previous tokens haven't changed, why recompute them? Store them once. At each new step, compute only the query q_t for the current token, and compare it against what's already cached. The incremental cost of generation collapses from recomputing the entire past to attending over a stored representation of it.
That shift, from recomputation to reuse, is what takes inference from something theoretical to something deployable. But the deeper implications of the shift are less obvious than they first appear.
The Bottleneck Nobody Was Watching
The memory cost of a KV cache scales with the number of layers, the number of attention heads, the head dimension, and the sequence length. That last term, sequence length, is where the problem lives. It looks innocent in a formula. In practice it dominates everything.
At long context lengths, the KV cache routinely exceeds the size of the model weights themselves. A system you might describe as "running a 70B parameter model" is, in memory terms, often doing something closer to managing an enormous dynamic tensor that dwarfs the static parameters. The model weights are the smaller problem.
This is where something important happened quietly in the AI scaling narrative, and it hasn't fully registered in how most people talk about these systems. For years, the dominant frame was compute: how many FLOPs does training require, how many does inference require, how do we reduce them. That frame is not wrong, but it is increasingly incomplete. In a large number of real inference scenarios, compute is no longer the bottleneck. Memory is specifically, the cost of moving the KV cache tensors around fast enough to keep the compute units fed.
This is a meaningful shift. Compute follows Moore's law reasonably well. Memory bandwidth is harder. Data movement: getting tensors from where they're stored to where they're needed, at the speeds modern accelerators demand is a problem with different scaling properties than arithmetic. A system optimized purely for FLOPs can still be slow because it spends most of its time waiting for data to arrive.
We used to think in terms of model size and compute. We should be thinking in terms of memory: how we move it, compress it, and decide what to keep.
Understanding this reorients how you think about the architecture. The question is not just how many parameters a model has or how many operations inference requires. It is where does the KV cache live at each moment in generation, how much of it fits on fast memory, and how expensive is it to retrieve the parts that don't?
The Geometry of Compression
The natural response to a memory problem is compression. And indeed, significant research effort has gone into compressing KV caches: quantizing the tensors, applying transform coding, reducing precision from FP16 to INT8 or lower. These approaches work, but they face a mathematical constraint that is easy to miss and important to understand.
Attention operates on inner products. The softmax that produces attention weights depends on the dot products between queries and keys: q · k. If compression distorts the geometry of the key and value tensors: if it changes the relative distances and angles between vectors then those dot products change, the softmax distribution changes, and everything downstream degrades. The outputs of the model shift in ways that may be subtle or severe depending on how aggressively you've compressed.
This is the same constraint that appears in classical dimensionality reduction problems. The Johnson-Lindenstrauss lemma, random projections, and related results all address the same core question: how do you reduce the size of a representation while preserving the pairwise relationships that matter? KV compression is that problem, applied to the geometry that attention depends on. You can compress, but you have to compress in ways that respect the inner product structure. Naive quantization that ignores this tends to fail in practice even when it looks acceptable on simple benchmarks.
The relative geometry of key vectors; specifically, the dot products q·k that determine which tokens attend to which. Distort these and the attention distribution changes.
Aggressive quantization requires careful calibration. The safe compression budget is smaller than it looks. Geometry-aware methods consistently outperform naive approaches at the same bit depth.
What Gets Remembered
Compression reduces the cost of storing the cache. But there is a more radical question lurking underneath it: do you need to store all of it at all?
In any sufficiently long sequence, not every token matters equally to what comes next. Some tokens receive consistently high attention weights from many subsequent tokens. They are load-bearing elements of the context, anchors that the model returns to repeatedly. Others receive weights so close to zero that they are, for practical purposes, invisible. Attending over them produces essentially the same result as not attending over them. They are terms in a sum that contribute nothing meaningful to the output.
Selective eviction, dropping low-attention tokens from the cache rather than storing them indefinitely, is the architectural response to this observation. It is mathematically equivalent to approximating the full attention distribution with a sparse subset: keeping the entries that matter and discarding the ones that don't. The approximation is generally very good because the entries being discarded were contributing almost nothing.
What makes this genuinely interesting, rather than merely practical, is what it implies about the nature of context in these systems. If you can drop most of a long sequence and lose almost nothing, then the model is not using context the way a human reader uses a document: scanning it linearly, building a cumulative understanding of every sentence. It is using context more selectively, more dynamically, attending to a relatively small set of salient anchors while the rest fades into irrelevance.
The KV cache is a working memory and like all working memories, what it forgets matters as much as what it keeps.
This opens the door to something more ambitious: learned eviction policies, where the model itself develops a sense of what is worth remembering. Rather than applying a fixed rule, drop tokens below an attention threshold, the system learns to anticipate which tokens will remain relevant as generation continues. That is a very different relationship to memory than the one most people imagine when they think about how language models process text.
A Subtle Numerical Point
One of the less discussed properties of KV caching is that it is not, in a strict sense, numerically neutral. Because modern inference operates in finite precision, typically FP16 or lower, the act of caching tensors and retrieving them later can produce slightly different results than recomputing the same values fresh at each step. Floating point arithmetic is not associative. The order and grouping of operations matters, and caching changes both.
At first this seems like a defect. A source of numerical noise introduced by an optimization. But the right way to think about it is as a reminder that these systems are dynamic numerical processes, not purely symbolic ones. They are not executing a fixed logical procedure that produces a deterministic output. They are performing approximate arithmetic at scale, and the approximations matter. In some cases, caching actually improves numerical stability by avoiding the accumulation of rounding errors that would occur across repeated recomputation.
The practical implication is modest: reproducibility at the bit level requires careful attention to whether and how caching is implemented. But the conceptual implication is worth sitting with. The outputs of these models are sensitive, in small ways, to the history of how their computation was organized. The past, in a very concrete numerical sense, shapes the present.
Memory as Architecture
Pulling back from the technical details, there is a framing shift that KV caching makes almost inevitable once you take it seriously. The standard way to describe a language model is in terms of its weights: the parameters learned during training that encode the model's knowledge about language, facts, and reasoning patterns. Those weights are static. They don't change during inference. They are, in a loose sense, the model's long-term memory.
The KV cache is something different. It holds the immediate context, i.e. the specific sequence the model is currently processing, the active thread of the conversation or document. It is populated at runtime and discarded when inference ends. It is, in the same loose sense, the model's working memory: what it is currently holding in mind as it generates.
This distinction between long-term knowledge encoded in weights and short-term context held in cache maps onto something recognizable from cognitive science. And once you have that framing, a lot of recent engineering trends in AI infrastructure begin to look less like isolated optimizations and more like different strategies for managing a memory system.
Hierarchical memory systems that tier the KV cache across GPU, CPU, and SSD storage are solving the same problem as the brain's distinction between fast working memory and slower long-term retrieval. Cache reuse across multiple requests, so storing the representation of a long system prompt so it doesn't need to be recomputed for every user query is a form of memoization that treats shared context as persistent memory. Multi-agent systems that share KV state across instances are, in effect, building a shared working memory distributed across multiple reasoning processes.
Model weights encode what a system knows. The KV cache encodes what it is currently thinking about. The next wave of AI infrastructure improvements will largely be about the relationship between those two things, i.e. how to make working memory larger, cheaper, faster, and smarter about what it retains.
What this suggests is that the next significant advances in AI capability will not come only from larger models or better training data. The axes that have dominated the conversation for the past several years. They will come substantially from how memory is managed: how it is compressed, how it is moved, how it is prioritized, and ultimately how models are taught to reason about their own context rather than passively consuming it.
KV caching began as a simple answer to a quadratic scaling problem. It has become, without anyone quite announcing it, the foundation on which practical AI cognition is built. The weights are what a model knows. The cache is what it is thinking. And thinking, it turns out, is mostly a memory management problem.
The next frontier in AI is not more parameters. It is smarter memory.
No comments:
Post a Comment