Saturday, April 25, 2026

The Cache Is the Thought — What KV Caching Reveals About How AI Actually Works

Architecture · Inference · Memory

The Cache Is the Thought

KV caching is usually described as an optimization. It is actually something more fundamental. The mechanism by which large language models hold a thought in their head.

Technical Essay — 10 min read

• • •

There is a moment, when you look closely enough at the details of large language models, when a seemingly minor implementation detail suddenly reveals itself as something structural. A load-bearing idea that everything else depends on. For me, that moment came with reading some of the recent ideas surrounding KV caching. What looked at first like a straightforward performance optimization turned out to be the core mechanism that makes modern AI inference not just fast, but possible. Understanding it changes how you think about what these systems are actually doing.

To see why, start with the mathematics of attention. The operation at the heart of every transformer model deployed today.

The Quadratic Problem

Attention(Q,K,V) = softmax( QKᵀ / \sqrtd k)V The attention operation

In their raw mathematical form, transformers are elegant but brutally expensive. Generating each new token requires attending over every previous token in the sequence, which means computation grows quadratically with sequence length. A sequence twice as long is four times as expensive. This is a direct consequence of what attention is doing. The mechanism works by computing relationships between every pair of tokens, and the cost of that operation scales with the number of pairs.

The result, without any mitigation, is that generating text token by token requires recomputing the same keys K and values V for every prior token at every step. The context grows. The recomputation grows with it. Something that might take milliseconds for a short sequence becomes seconds for a long one and for the context windows that modern applications demand, it becomes entirely impractical.

KV caching addresses this with a conceptually simple observation: if the keys and values for previous tokens haven't changed, why recompute them? Store them once. At each new step, compute only the query q_t for the current token, and compare it against what's already cached. The incremental cost of generation collapses from recomputing the entire past to attending over a stored representation of it.

That shift, from recomputation to reuse, is what takes inference from something theoretical to something deployable. But the deeper implications of the shift are less obvious than they first appear.

For details on the attention mechanism of K, Q, and V matrices see this article I wrote previously.

The Bottleneck Nobody Was Watching

The memory cost of a KV cache scales with the number of layers, the number of attention heads, the head dimension, and the sequence length. That last term, sequence length, is where the problem lives. It looks innocent in a formula. In practice it dominates everything.

The Scaling Reality

At long context lengths, the KV cache routinely exceeds the size of the model weights themselves. A system you might describe as "running a 70B parameter model" is, in memory terms, often doing something closer to managing an enormous dynamic tensor that dwarfs the static parameters. The model weights are the smaller problem.

This is where something important happened quietly in the AI scaling narrative, and it hasn't fully registered in how most people talk about these systems. For years, the dominant frame was compute: how many FLOPs does training require, how many does inference require, how do we reduce them. That frame is not wrong, but it is increasingly incomplete. In a large number of real inference scenarios, compute is no longer the bottleneck. Memory is specifically, the cost of moving the KV cache tensors around fast enough to keep the compute units fed.

This is a meaningful shift. Compute follows Moore's law reasonably well. Memory bandwidth is harder. Data movement: getting tensors from where they're stored to where they're needed, at the speeds modern accelerators demand is a problem with different scaling properties than arithmetic. A system optimized purely for FLOPs can still be slow because it spends most of its time waiting for data to arrive.

We used to think in terms of model size and compute. We should be thinking in terms of memory: how we move it, compress it, and decide what to keep.

Understanding this reorients how you think about the architecture. The question is not just how many parameters a model has or how many operations inference requires. It is where does the KV cache live at each moment in generation, how much of it fits on fast memory, and how expensive is it to retrieve the parts that don't?

The Geometry of Compression

The natural response to a memory problem is compression. And indeed, significant research effort has gone into compressing KV caches: quantizing the tensors, applying transform coding, reducing precision from FP16 to INT8 or lower. These approaches work, but they face a mathematical constraint that is easy to miss and important to understand.

Attention operates on inner products. The softmax that produces attention weights depends on the dot products between queries and keys: q · k. If compression distorts the geometry of the key and value tensors: if it changes the relative distances and angles between vectors then those dot products change, the softmax distribution changes, and everything downstream degrades. The outputs of the model shift in ways that may be subtle or severe depending on how aggressively you've compressed.

This is the same constraint that appears in classical dimensionality reduction problems. The Johnson-Lindenstrauss lemma, random projections, and related results all address the same core question: how do you reduce the size of a representation while preserving the pairwise relationships that matter? KV compression is that problem, applied to the geometry that attention depends on. You can compress, but you have to compress in ways that respect the inner product structure. Naive quantization that ignores this tends to fail in practice even when it looks acceptable on simple benchmarks.

What compression must preserve

The relative geometry of key vectors; specifically, the dot products q·k that determine which tokens attend to which. Distort these and the attention distribution changes.

What this means in practice

Aggressive quantization requires careful calibration. The safe compression budget is smaller than it looks. Geometry-aware methods consistently outperform naive approaches at the same bit depth.

What Gets Remembered

Compression reduces the cost of storing the cache. But there is a more radical question lurking underneath it: do you need to store all of it at all?

In any sufficiently long sequence, not every token matters equally to what comes next. Some tokens receive consistently high attention weights from many subsequent tokens. They are load-bearing elements of the context, anchors that the model returns to repeatedly. Others receive weights so close to zero that they are, for practical purposes, invisible. Attending over them produces essentially the same result as not attending over them. They are terms in a sum that contribute nothing meaningful to the output.

Selective eviction, dropping low-attention tokens from the cache rather than storing them indefinitely, is the architectural response to this observation. It is mathematically equivalent to approximating the full attention distribution with a sparse subset: keeping the entries that matter and discarding the ones that don't. The approximation is generally very good because the entries being discarded were contributing almost nothing.

What makes this genuinely interesting, rather than merely practical, is what it implies about the nature of context in these systems. If you can drop most of a long sequence and lose almost nothing, then the model is not using context the way a human reader uses a document: scanning it linearly, building a cumulative understanding of every sentence. It is using context more selectively, more dynamically, attending to a relatively small set of salient anchors while the rest fades into irrelevance.

The KV cache is a working memory and like all working memories, what it forgets matters as much as what it keeps.

This opens the door to something more ambitious: learned eviction policies, where the model itself develops a sense of what is worth remembering. Rather than applying a fixed rule, drop tokens below an attention threshold, the system learns to anticipate which tokens will remain relevant as generation continues. That is a very different relationship to memory than the one most people imagine when they think about how language models process text.

A Subtle Numerical Point

One of the less discussed properties of KV caching is that it is not, in a strict sense, numerically neutral. Because modern inference operates in finite precision, typically FP16 or lower, the act of caching tensors and retrieving them later can produce slightly different results than recomputing the same values fresh at each step. Floating point arithmetic is not associative. The order and grouping of operations matters, and caching changes both.

At first this seems like a defect. A source of numerical noise introduced by an optimization. But the right way to think about it is as a reminder that these systems are dynamic numerical processes, not purely symbolic ones. They are not executing a fixed logical procedure that produces a deterministic output. They are performing approximate arithmetic at scale, and the approximations matter. In some cases, caching actually improves numerical stability by avoiding the accumulation of rounding errors that would occur across repeated recomputation.

The practical implication is modest: reproducibility at the bit level requires careful attention to whether and how caching is implemented. But the conceptual implication is worth sitting with. The outputs of these models are sensitive, in small ways, to the history of how their computation was organized. The past, in a very concrete numerical sense, shapes the present.

Memory as Architecture

Pulling back from the technical details, there is a framing shift that KV caching makes almost inevitable once you take it seriously. The standard way to describe a language model is in terms of its weights: the parameters learned during training that encode the model's knowledge about language, facts, and reasoning patterns. Those weights are static. They don't change during inference. They are, in a loose sense, the model's long-term memory.

The KV cache is something different. It holds the immediate context, i.e. the specific sequence the model is currently processing, the active thread of the conversation or document. It is populated at runtime and discarded when inference ends. It is, in the same loose sense, the model's working memory: what it is currently holding in mind as it generates.

This distinction between long-term knowledge encoded in weights and short-term context held in cache maps onto something recognizable from cognitive science. And once you have that framing, a lot of recent engineering trends in AI infrastructure begin to look less like isolated optimizations and more like different strategies for managing a memory system.

Hierarchical memory systems that tier the KV cache across GPU, CPU, and SSD storage are solving the same problem as the brain's distinction between fast working memory and slower long-term retrieval. Cache reuse across multiple requests, so storing the representation of a long system prompt so it doesn't need to be recomputed for every user query is a form of memoization that treats shared context as persistent memory. Multi-agent systems that share KV state across instances are, in effect, building a shared working memory distributed across multiple reasoning processes.

The Emerging Picture

Model weights encode what a system knows. The KV cache encodes what it is currently thinking about. The next wave of AI infrastructure improvements will largely be about the relationship between those two things, i.e. how to make working memory larger, cheaper, faster, and smarter about what it retains.

What this suggests is that the next significant advances in AI capability will not come only from larger models or better training data. The axes that have dominated the conversation for the past several years. They will come substantially from how memory is managed: how it is compressed, how it is moved, how it is prioritized, and ultimately how models are taught to reason about their own context rather than passively consuming it.

KV caching began as a simple answer to a quadratic scaling problem. It has become, without anyone quite announcing it, the foundation on which practical AI cognition is built. The weights are what a model knows. The cache is what it is thinking. And thinking, it turns out, is mostly a memory management problem.

The next frontier in AI is not more parameters. It is smarter memory.

Friday, April 24, 2026

The Last Justification — Why Leaders May Be Running Out of Reasons to Exist

Philosophy · Economics · Artificial Intelligence

The Last Justification

Leaders have always claimed their authority from the management of scarcity. What happens when scarcity ends?

• • •

In 1930, John Maynard Keynes predicted that by 2030 the economic problem, the human struggle for subsistence, would be solved. He was writing at the depths of the Great Depression, which made the forecast either visionary or crazy depending on your viewpoint. Now in 2026, the people building artificial intelligence are implying essentially the same prediction on a faster timeline, and they have more reason to believe it than Keynes did.

That convergence deserves more attention than it is getting. Because if the economic problem is actually being solved. If AI delivers anything close to the post-scarcity conditions its most serious architects believe it will of infinite abundance, then something else dissolves alongside scarcity that nobody is talking about. The entire philosophical and practical foundation on which human leadership has rested for ten thousand years.

Leaders, in every civilization that has ever existed, have drawn their deepest authority from one source: someone has to manage the scarcity. Not divine right. Not democratic mandate. Not military necessity. Those are the decorative justifications, the ones that get written into constitutions and carved into monuments. Underneath all of them, holding the structure up, is the one argument that was always hardest to refute. Resources are finite. People compete for them. Without coordination and control, the result is violence. Therefore: leaders.

What happens when that argument stops being true? What if the future doesn't need leaders?

This might have seemed radically theoretical a generation ago. But the possibility of artificial general intelligence (AGI); systems that can perform most economically valuable cognitive work at near-zero marginal cost threatens to dissolve the scarcity foundations on which every human hierarchy from the pharaohs to the Fortune 500 has been constructed. To understand how radical this might be, it helps to trace the argument from its roots.

Scarcity as the Engine of Hierarchy

Before agriculture, humans lived in small bands of twenty to one hundred fifty people. Leadership in these groups was fluid, situational, and constantly checked by group consensus. The best hunter led the hunt. The most experienced elder arbitrated disputes. When the situation changed, so did the leader. There was no permanent hierarchy because there was, relatively speaking, no permanent scarcity. Needs were simple. Nature provided enough, albeit not abundantly, but enough.

The transformation came with the agricultural revolution around ten thousand years ago. Stored grain created something new in human experience: the possibility of accumulated surplus. And with surplus came its shadow twin, artificial scarcity; the condition in which resources exist in sufficient quantity but are distributed unequally through ownership and control.

The first leader who said 'this is mine' did not solve scarcity. He manufactured it.

Rousseau (paraphrased), Discourse on Inequality, 1755

Jean-Jacques Rousseau saw this with unusual clarity in 1755. His Discourse on Inequality argued that human beings in their natural state were essentially content. Needs were simple, and there was no systematic domination. The fall came when someone enclosed a piece of land and declared ownership, and others accepted the fiction. From that moment, leaders emerged not to solve natural scarcity but to protect and perpetuate artificial scarcity, i.e. to maintain the conditions that justified their own authority.

This is a darker reading of leadership than most civics textbooks provide. But it is a consistent one. Almost every function we associate with leaders, such as adjudicating property disputes, organizing armies, taxing and redistributing, managing information asymmetries traces directly to the management of scarce resources or the conflict that scarcity generates.

A Prediction Nobody Remembers

1930 Keynes writes his forecast

In 1930, at the very depth of the Great Depression, the economist John Maynard Keynes published a short essay that almost nobody read at the time and almost nobody remembers now. It was called Economic Possibilities for our Grandchildren, and it contained one of the most audacious forecasts in the history of economic thought.

While the world economy was collapsing around him, Keynes told readers not to confuse a cyclical crisis for the long-run trend. Step back far enough, he argued, and the trajectory was unambiguous: productive capacity per person had been compounding for generations. Within one hundred years, which puts us at 2030, the "economic problem," meaning the struggle for subsistence, would be solved. Not alleviated. Solved. He predicted living standards in advanced economies would be four to eight times higher than in 1930. On the material side of his ledger, he was roughly correct.

But the radical heart of the essay was what he thought would follow from this. He predicted the working week would shrink to perhaps fifteen hours. Not because of idleness, but because fifteen hours of labor would genuinely be sufficient to sustain abundant lives. The rest would be leisure. And here his anxiety surfaces in the prose, because he was not celebratory about this. He worried that humans had been so thoroughly shaped by the discipline of scarcity that they would not know what to do with themselves in its absence.

The Keynes Problem

Keynes predicted material abundance with remarkable accuracy. What he underestimated was how powerfully existing hierarchies would capture and redirect that abundance and how deeply the psychology of scarcity was embedded in human identity. Productivity gains arrived on schedule. The fifteen-hour work week did not.

He described what he called "purposiveness" which is the orientation toward future goals over present experience that scarcity conditions produce as a useful neurosis. Civilization needed it to pass through the scarcity phase. But it would need to be shed once that phase ended. Fail to shed it, and people would generate artificial purposes, artificial hierarchies, artificial scarcities simply to maintain the familiar structure of striving.

That failure to shed it is largely what happened. The abundance arrived, and was immediately captured by expanding desires, by concentrated ownership, by the managerial and political class that Keynes assumed would become unnecessary. Rather than the state withering away, institutions found new rationales for their authority. The leaders did not dissolve into leisure alongside everyone else. They found new scarcities to manage.

The Anarchist Tradition Had It Right

Keynes was not alone in this structural diagnosis, though he arrived at it from a very different direction. A generation earlier, the Russian anarchist Peter Kropotkin had argued in Mutual Aid that cooperation, not competition, was the dominant force in both nature and human history and that in conditions of genuine shared abundance, hierarchical leadership would become not just unnecessary but actively parasitic.

Kropotkin's crucial move was the same as Rousseau's, stated more bluntly: scarcity is largely manufactured by the ownership structures that leaders enforce, not a natural condition requiring management. Remove the leaders, redistribute the abundance, and the entire philosophical justification for authority collapses. What remains is voluntary coordination. People organizing themselves around shared tasks without anyone claiming the permanent right to direct others.

Marx made the same argument through different machinery. His endpoint, full communism, is explicitly a post-scarcity condition. In that terminal state, he wrote, the state withers away because there is nothing left for it to manage. Leaders and governments are, in his framework, instruments for managing class conflict over scarce resources. Eliminate scarcity through technological development and collective ownership, and the entire apparatus becomes superfluous.

That no Marxist state ever reached this destination is a separate conversation. What matters here is the theoretical logic, which is precise: scarcity is the engine, hierarchy is the machine the engine drives, and without the engine the machine has no purpose.

What Remains Without Scarcity

Before accepting this conclusion too quickly, it is worth asking what would remain of leadership if scarcity were genuinely eliminated. The honest answer is: something, but less than we might suppose.

Pure coordination problems exist independent of scarcity. Even in conditions of perfect material abundance, groups face the challenge of synchronizing action: deciding which direction to go, when to act, how to sequence collective effort. These are not about managing competing claims on scarce resources. They are about the mathematics of group decision-making: any time more than two people have potentially different preferences, some mechanism is needed to aggregate those preferences into a single action.

But notice what that mechanism looks like in the absence of scarcity pressure. It looks less like a king, a CEO, or a general, and more like a protocol. A shared norm. A voting procedure. The coordination residue of post-scarcity society might not deserve the name "leadership" at all. It might simply be coordination, which is a very different thing.

Leadership as we know it may be a scarcity-adapted institution that has outlived the conditions that made it necessary.

The anthropologist David Graeber, in his final book The Dawn of Everything, documented societies that maintained deliberate mechanisms to prevent permanent leadership from emerging such as seasonal leadership, ritual humiliation of chiefs, voluntary dispersal when hierarchy became too rigid. His argument was that hierarchy is a choice, not an inevitability, and that many human societies have at various points chosen otherwise. The story that we have always needed leaders is itself, he suggested, a story that leaders tell.

The Rupture That Changes Everything

This is the point at which a ninety-year-old essay about economic possibilities becomes urgently contemporary.

Keynes assumed abundance would arrive gradually as continued industrial productivity compounding steadily across generations. What he did not anticipate was a discontinuous jump: a technology that does not incrementally improve labor productivity, but potentially replaces the need for labor across entire categories of cognitive work, almost simultaneously, within a compressed timeframe.

1930

Keynes publishes his forecast. Predicts the economic problem solved by 2030 via gradual compounding of industrial productivity.

2022–2024

Large language models demonstrate capability across most categories of cognitive work. The gradual path becomes a potential cliff edge.

2025–2026

AI lab leaders begin publishing explicit post-scarcity forecasts. Amodei's Machines of Loving Grace predicts compression of a century of scientific progress into a decade.

2030

Keynes' original target date. The destination may arrive approximately on schedule, but via a road he could not have imagined.

If you take seriously what people like Dario Amodei are actually arguing and not the hedged public statements but the internal conviction, the claim is that we are within years of systems capable of performing most economically valuable cognitive work at near-zero marginal cost. Amodei has written explicitly about compressing a century of scientific and economic progress into roughly a decade. Sam Altman has gestured at civilizational transformation on similar timescales.

If that trajectory is even directionally correct, Keynes was not wrong. He was describing a destination that turns out to be reachable by a faster, stranger road than he imagined. His 2030 prediction may prove accurate almost to the year, while being entirely wrong about the mechanism.

This is not entirely theoretical. The early tremors of this structural shift are already visible inside organizations operating today. Middle management layers; whose primary function has always been information relay, coordination, and resource allocation across hierarchical tiers are being cut across industries at a pace that would have seemed implausible a decade ago. Teams are shrinking. Reporting structures are flattening. The organizational pyramid that defined the modern corporation for a century is losing floors.

More telling is what is happening to specialization. For most of human history, specialized knowledge was a primary source of individual power and, by extension, a justification for hierarchy. The lawyer, the engineer, the financial analyst, the data scientist occupied positions of authority partly because they possessed capabilities others simply did not have. AI is eroding that boundary rapidly. Skills that once required years of training to acquire can now be approximated in minutes by someone with no formal background and the right tool. When specialization stops being scarce, the hierarchies built around controlling access to specialized knowledge lose their rationale alongside it.

What is happening inside companies is a preview of a much larger structural question. Organizations are not shedding management layers because they have decided hierarchy is philosophically unjustified. They are shedding them because those layers are no longer functionally necessary. The economic logic that created them has changed. That same logic, operating at civilizational scale, is what Keynes was pointing at and what the most serious AI forecasters believe is now accelerating toward its conclusion.

The Question of Capture

But this is where the argument reaches its sharpest and most uncomfortable edge. Abundance delivered by artificial intelligence does not automatically mean distributed abundance. The critical question and the one that Rousseau would recognize immediately is who owns and controls the systems that produce it.

If a small number of companies or individuals control the infrastructure that generates post-scarcity conditions, then scarcity-based hierarchy has not been eliminated. It has been concentrated to a degree without historical precedent. Rather than a million leaders managing localized scarcity, you have a handful of individuals managing the systems that produce everything. The entire apparatus of leadership; its justification, its function, its claim on obedience collapses upward into a single point of control.

This is not a hypothetical concern. The same dynamic that turned agricultural surplus into feudalism, and industrial surplus into plutocracy, is already visible in the ownership structure of the systems being built. The logic is identical. Scarcity is not eliminated; it is repackaged. Access to abundance becomes the new scarce resource, and the leaders of that access inherit all the authority that scarcity has always generated.

The Recurring Pattern

Agricultural surplus → feudal hierarchy. Industrial surplus → plutocratic hierarchy. AI surplus → ? The technology changes. The capture dynamic has proven remarkably stable across all three transitions. The question is whether this one is different enough to break it.

Altman himself has implicitly acknowledged this with his advocacy for Universal Basic Income and OpenAI's original nonprofit structure. A recognition that without deliberate redistribution mechanisms, the abundance his systems might produce would simply reconcentrate at the top. The leaders would not wither away. They would become more powerful, not less, precisely because they control the systems that have made all other scarcity manageable.

An Ending Without a Conclusion

History has tested the scarcity-leadership connection repeatedly without ever fully breaking it. Every technology that promised to dissolve hierarchy: the printing press, the steam engine, electrification, the internet instead generated new hierarchies organized around control of the technology itself. The same dynamic may play out with AI.

But there is something qualitatively different about a technology that can replace cognitive labor across essentially all domains. Previous technologies replaced physical labor in specific sectors while creating new categories of cognitive work. The demand for human participation in the economy shifted but did not disappear. If AI eliminates the demand for cognitive labor as comprehensively as previous technologies eliminated the demand for physical labor, without creating new categories of work to absorb the displaced, then the economic basis for most leadership simply ceases to exist.

Keynes thought this moment would come gradually, giving human psychology and institutions time to adapt. He was wrong about the pace. The institutions built on scarcity logic are already struggling to adapt to a world that is merely becoming more automated but let alone one that reaches genuine post-scarcity conditions.

The question is whether we get a post-scarcity world at all as the traditional need for leadership fades. Every previous surplus in human history has been captured by feudal lords, by industrialists, by financiers and redirected. The realistic risk is not the dissolution of hierarchy but its opposite: a small group of people controlling systems that can produce everything, governing a world that finally has no material reason to accept being governed.

Keynes was right about the destination. The only question is whether a handful of people will control the road to get there.

billparker.ai