How DeepSeek Rewrote the Transformer

The original Transformer was a marvel, truly. It triggered a profound revolution in the field, fundamentally altering how machines could comprehend the intricacies of human language and generate remarkably coherent, contextually aware text. But as these models were asked to handle more and more information, longer articles, and entire conversations, a fundamental challenge became increasingly apparent. The way the Transformer paid "attention" to different parts of the text, allowing it to understand context, had a computational cost that grew incredibly fast. For a sequence of length n (representing, for example, the number of words or tokens in an input), the number of attention calculations scaled quadratically. This is often expressed as:

\begin{equation*} \mathcal{O}(n^2) \end{equation*}

So, if you doubled the length of the text, the computational work didn't just double; it roughly quadrupled! This complexity was a serious bottleneck, especially for long sequences.

To cope with the computational demands and avoid redundant calculations, engineers developed something called KV caching. "K" and "V" (Keys and Values) are crucial pieces of information the model extracts from the text it has already processed. By caching these Keys and Values for each token, the model wouldn't have to recompute them every single time it needed to generate a new token, which was a smart move for speeding up inference. However, this led to a new issue: the size of this cache. For a sequence of length n and a model dimension d, the KV cache grew linearly with the sequence length, requiring memory proportional to:

\begin{equation*} O(n \cdot d) \end{equation*}

So, longer texts meant massive memory requirements. It was like trying to keep an ever-expanding set of notes on your desk; eventually, you run out of space, no matter how efficiently you write them. The memory footprint itself became a limiting factor.

DeepSeek's Stroke of Genius: Leaner Memory, Faster Thoughts

This is where DeepSeek AI really rethought the playbook with their Multi-Head Latent Attention (MLA). Instead of just managing this ballooning KV cache, which scaled as $O(n⋅d)$ , they asked a more fundamental question: Can we make the "notes" themselves the Keys and Values far more compact and efficient from the get-go?

The core idea of MLA is elegant. It introduces a sophisticated compression mechanism. The original Keys (K) and Values (V) from the input sequence of length n are compressed into a much smaller set of k "latent" key-value pairs, where k≪n (k is much smaller than n). This means the KV cache required for these compressed, latent representations now scales as:

\begin{equation*} O(k \cdot d) \end{equation*}

This effectively decouples the cache size from the input sequence length n, drastically reducing the size of the active KV cache the model needs to keep in its memory and mitigating the memory problem.

But MLA isn't just about shrinking data. It also cleverly reorganizes the way the model processes queries (when it's figuring out what to focus on next). By performing attention operations with respect to these k latent vectors, the computational complexity for certain parts of the attention mechanism can also be reduced from the original to something more manageable, depending on the specifics of the attention variant. It essentially pre-bakes some of the computational steps by incorporating certain mathematical weights directly into the query calculations. This restructuring avoids a chunk of extra computation that would otherwise be needed during inference, the stage where the model is actually generating text or providing answers.

The beauty of MLA is that it allows the model to learn how to more effectively compress and, importantly, share information between its multiple "attention heads" (the different parts of the model that simultaneously focus on various aspects of the text). It’s not just about doing things with less memory; it's about being smarter with how information is represented and processed. The result is a Transformer that can handle much longer inputs without breaking the memory bank, leading to significantly faster token generation, some benchmarks suggesting up to six times the speed, and all-around more efficient performance. DeepSeek didn't just tweak the Transformer; they gave it a more agile and resourceful way to think.

Reference:

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model.