DeepSeek and its AI Model: The Untold Story

DeepSeek has recently emerged as a significant player in the AI landscape, challenging the dominance of AI companies with its innovative models and efficient approach to training.

This article will focus on DeepSeek and its latest reasoning model, R1.

DeepSeek’s Breakthroughs

DeepSeek’s rise to prominence is marked by several key breakthroughs that have shaken the AI community.

The company’s models have demonstrated both high performance and cost-effectiveness, leading to a reevaluation of established norms in the industry.

Get Your Free Linux training!

Join our free Linux training and discover the power of open-source technology. Enhance your skills and boost your career! Learn Linux for Free!

Here are some key innovations:

DeepSeekMoE: This refers to a “mixture of experts” model, where the model is divided into multiple “experts” and only the necessary ones are activated for a given task. DeepSeek’s implementation includes more finely-grained specialized experts and shared experts with more generalized capabilities. They also introduced new approaches to load-balancing and routing during training, which made training more efficient.
DeepSeekMLA: This innovation, also known as multi-head latent attention, compresses the key-value store, drastically reducing memory usage during inference. This is significant because context windows, which are necessary for AI to process information, are very expensive in terms of memory.
Efficient Training: DeepSeek’s V3 model was shockingly cheap to train, costing only $5.576 million. This was achieved through optimized co-design of algorithms, frameworks, and hardware. It is also due to several factors including a new approach to load balancing, multi-token prediction in training, the use of 37 billion parameters out of 671 billion per token, and using FP8 precision for calculations. The training took 2.788 million H800 GPU hours.
Optimization for H800 GPUs: DeepSeek optimized its model and infrastructure around the constrained memory bandwidth of the H800 GPUs, which are less powerful than the H100s used by many US labs. They programmed 20 of the 132 processing units on each H800 specifically to manage cross-chip communications using PTX, a low-level instruction set for Nvidia GPUs. This approach allowed them to achieve remarkable results despite limitations imposed by U.S. chip bans.

The DeepSeek R1 Model

DeepSeek R1 is a reasoning model that has garnered attention for its competitive performance compared to OpenAI’s o1 model.

It is the product of innovative training techniques, and it also demonstrates the potential for AI models to achieve high levels of reasoning capabilities.

R1’s Capabilities: R1 can think through problems, providing high-quality results in areas such as coding, math, and logic.
Open Weights: Unlike many other leading models, R1 has open weights, allowing users to run the model on their own servers or locally at dramatically lower costs. This makes the model very accessible.
R1-Zero: The R1 model is actually part of a two-model release, the other one being R1-Zero. R1-Zero was developed using pure reinforcement learning (RL), with no human feedback, and achieved “super performance” on reasoning benchmarks. The model was given a set of math, code and logic questions, as well as two reward functions: one for the right answer, and one for the right format that utilized a thinking process.
Aha Moment: An intriguing phenomenon observed during the training of R1-Zero is the “aha moment”, where the model learns to allocate more thinking time to a problem by reevaluating its initial approach. This is a demonstration of the power of reinforcement learning.
R1 Training: R1 was trained to address R1-Zero’s issues with readability and language mixing. It incorporates a small amount of cold-start data and a multi-stage training pipeline. The process includes using the DeepSeek-V3-Base model, then fine-tuning it with cold-start data, using reasoning-oriented RL like R1-Zero, and creating new SFT data through rejection sampling combined with supervised data, then an additional RL process, all of which produced a model with performance on par with OpenAI-o1–1217.
Chain-of-thought Reasoning: The model was also trained with examples of chain-of-thought thinking so it could learn the proper format for human consumption, along with reinforcement learning to enhance its reasoning. R1 reasons, but in a way that humans have trouble understanding.
Distillation: It is likely that DeepSeek benefited from distillation in the training of R1, where understanding is extracted from another model. Distillation is widespread in the AI industry and is assumed to be a factor in models converging on GPT-4o quality.

Implications for the AI Landscape

DeepSeek’s emergence and the release of the R1 model have significant implications for the AI landscape and the tech industry:

Model Commoditization: DeepSeek’s efficient training and open-source approach could lead to the commoditization of AI models. This also means that the cost of inference is likely to fall, which could benefit many companies in the tech industry.
Nvidia’s Position: DeepSeek’s efficiency and optimization of H800s could cast doubt on the most optimistic Nvidia growth story. Nvidia’s dominance has been based on CUDA and its ability to combine multiple chips, but DeepSeek has demonstrated that heavy optimization can produce great results on less powerful hardware with lower memory bandwidth.
Chip Ban Implications: DeepSeek’s innovations were likely a direct result of the constraints imposed by the chip ban, which forced the company to optimize their model around H800s.
Open-Source Approach: DeepSeek’s commitment to open-source is a key part of its strategy. The company believes open source attracts talent and is key to innovation. It also means that the company’s value is based on their team and not on closed-source technology.
The Bitter Lesson: R1-Zero is an affirmation of The Bitter Lesson, that you do not need to teach an AI how to reason, you can just give it enough compute and data, and it will teach itself.

Conclusion

DeepSeek and its R1 model represent a significant leap forward in AI development.

The company’s focus on efficiency, innovation, and open-source principles challenges the status quo and could lead to a more accessible, affordable, and competitive AI landscape.

While there are many unanswered questions and the future is uncertain, DeepSeek’s innovations are a clear signal that the AI field is undergoing a period of intense change and innovation.