A few thoughts on Deepseek
Here’s a (slightly longer than intended) video explaining why Deepseek isn’t as much of a surprise as it seems based on the reaction of the stockmarket.
TL;DW: Deepseek applied 3 well known techniques together to good effect - predicting multiple tokens in one go, quantisation (using smaller numbers to represent concepts) and a Mixture-of-experts approach. All of these have been done by other labs and all Deepseek did was put them together first. If it hadn’t been them, someone else would have done it a few weeks later (as shown by OpenAI releasing o3-mini just a fortnight after).
Here’s a lightly edited transcript (produced by AI) in case you don’t want to watch the full 8 mins or so:
“Last night, I got a message from one of our instructors asking for my opinion on the whole DeepSeek hoo-ha that’s been taking place over the last 24 hours. American tech stocks have largely collapsed based on the fear that AI is much easier than we expected. A small Chinese startup was able to train a model almost as good as the best that OpenAI can provide for just five million dollars, instead of the hundreds of millions that the existing generation of models have cost.
The thing that is most surprising to me is actually how many people have been surprised by this. DeepSeek is impressive, but it was also inevitable. The techniques they used to achieve this are well-known and well-discussed techniques that have been available and experimented with by researchers at various AI labs for some time. They simply put them all together and used them to train a model, and it worked—good on them. But this is how progress happens.
The three main techniques they used are as follows:
1. Token Efficiency
For those of you who have attended our classes, I always tell you to think about individual words as tokens. When a large language model predicts the next word, it’s actually predicting tokens, which are parts of words. This gives a lot more flexibility in what the model can produce but also increases computational cost. For example, if a word requires three tokens, each token needs a full run of the model. If instead, the model can predict the whole word as a single token, it reduces the number of calculations needed by a factor of three.
This approach can be pushed even further. Meta’s AI research team has discussed the Byte-Level Transformer (BLT), a tokenization method where tokens are generated automatically at runtime rather than predefined. This could allow entire common phrases, like “What’s up?”, to be treated as a single token, making these systems significantly more efficient.
2. Quantization
Tokens are represented by vectors—long numerical values, often stored as 32-bit or 16-bit numbers. Quantization is the process of making those numbers shorter while retaining the necessary information. DeepSeek has successfully used 8-bit numbers, which are just as effective as previous models using larger numbers. This suggests that existing models have been using overly long floating-point numbers. By shortening them, future models will reduce computational load, making predictions more efficient.
3. Mixture of Experts
Mixture of Experts (MoE) is a technique where smaller models specialize in different tasks, such as coding, poetry, or summarization. A central model directs user requests to the appropriate expert, reducing computational effort. If a user asks for a limerick, the system activates only the poetry expert, not the coding expert, making the process far more efficient. GPT-4 was rumored to use MoE, and DeepSeek has now made its implementation visible due to its open-source nature.
What is most interesting about DeepSeek is not just that they used these techniques but that they published their work openly. This transparency benefits the broader AI community and is bound to drive a wave of fresh innovation. DeepSeek has achieved a roughly 95% reduction in cost per token while maintaining similar capability levels to leading models. If there’s one thing we know about AI, it’s that as it becomes cheaper, usage increases. This breakthrough will likely lead to a surge in adoption.
There’s also a geopolitical element to this discussion. Some have raised concerns about China developing an advanced AI model, particularly because it may not discuss topics like Tiananmen Square. In the US, this is seen as a significant issue, but the reality is that all language models are cultural artefacts. They are trained on vast datasets and mirror the biases and values embedded within that data.
For instance, most AI models trained on the internet inherently reflect American cultural norms because a large portion of online content is in English and written by Americans. Even in the UK, we notice this—by default, ChatGPT responds in American English. This isn’t an explicit design choice but rather a result of the training data. Many in the US may not have considered this before, but it’s a reality that we all have to adapt to: language models are shaped by cultural and political influences.
A few weeks ago, when DeepSeek first appeared, I wrote a post asking whether language models are the new World Service. The UK has historically had significant soft power through global news services that present a British perspective. Our values—such as fairness and factual accuracy—are embedded in these broadcasts. Large language models could play a similar role in the future. Countries with the capability will likely develop their own models, both to insulate themselves from external influences and to promote their own worldview.
At the end of the day, every technology is a human product, shaped by the people who build it. This has always been the case and will continue to be. Anyway, those are my thoughts—let me know yours.”