[2025-10-01 Wed 05:19] Sparse attention
DeepSeek tests “sparse attention” to slash AI processing costs - Ars Technica
The original transformer approach was never really intended for long contexts like oh say a book or large code project. The irregularities and “hallucinations” I see when trying to use an LLM in those contexts make sense. They’re inevitable with the approach, like the weirdness of floating point math.
So here’s sparse attention.
Sparse attention works differently. Instead of checking every word against every word, it only examines a subset of word relationships that the model determines are most relevant. For example, when processing word number 5,000 in a document, the model might only check its relationship with 100 carefully selected earlier words rather than all 4,999 preceding words.
If DeepSeek’s technique works in more widespread tests, the sparse attention approach could make it much less expensive to train models that can handle long contexts.