Thoughts and Research on DeepSeek (by Archerman Capital)

type

status

date

slug

summary

A Few Facts:

DeepSeek is not a shell or distilled version of an American model.

Although some Chinese large models are shell-style distillations, DeepSeek is not.

Its core architecture is still based on the Transformer, but DeepSeek has made innovative engineering design and optimization.

On the architectural level, it adopts Mixture of Experts (MoE), Multi-Head Latent Attention (MLA), Multi-Token Prediction (MTP), Chain of Thought (CoT), DualPipe design, and a training approach relying only on Reinforcement Learning (RL) without Supervised Fine-Tuning (SFT). In terms of precision, it uses FP8 mixed precision training and has optimized bottom-layer communication and efficiency. These techniques are not brand-new, but DeepSeek has applied them in a fine-tuned engineering manner to enhance practical deployment. Details:

MoE: Mixture of Experts

The model is divided into multiple expert modules that work independently. During training, different expert modules are allocated different computational resources for efficiency. In deployment, not all experts are activated; only a subset (e.g., two) is triggered (as in the 671B parameter version), reducing energy consumption and improving inference speed. Some experts perform specific tasks while others focus on integrating knowledge from different modules. Routing controllers (like gating controllers) determine which experts are activated, optimizing resource usage.

MLA: Multi-Head Latent Attention

Extends the traditional multi-head attention by introducing latent variables, allowing dynamic adjustment of attention focus and capturing implicit meanings. Reduces memory and computation during training, and optimizes key-value storage during inference.

MTP: Multi-Token Prediction

Traditional LLMs generate one token per step using autoregressive prediction. DeepSeek predicts multiple tokens in parallel during inference, increasing signal density, reducing context drift, logical redundancy, and repetitive errors—especially useful for code completion and summarization tasks.

CoT: Chain of Thought

A training method that breaks down complex problems into smaller steps for reasoning. DeepSeek uses Long CoT annotated data—some generated by GPT and further refined by CoT experts—to enhance logical reasoning across multiple paths and potential “aha” moments.

DualPipe:

Addresses pipeline inefficiencies by eliminating idle time. DeepSeek creates a flowing, dual-pipe architecture where modules can switch to new tasks when one is waiting, maximizing computation usage.

R1-Zero:

DeepSeek's model trained solely via Reinforcement Learning (RL) without adding Supervised Fine-Tuning (SFT). The R1-Zero model explores self-prompted reasoning from unlabeled data, opening new possibilities. R1 remains aligned with the SFT models in quality after continued optimization.

FP8 Mixed Precision Training:

Introduces FP8 mixed-precision training framework. Compared to traditional FP16 precision, it uses less memory while maintaining the accuracy of FP16 and FP32 where needed, saving computation resources.

Bottom-Layer Communication Optimization:

Developed a high-efficiency communication kernel, improving bandwidth utilization, ensuring data transmission speed, and supporting large-scale deployments.

Analogy:

Just as Germany invented fuel cars and the U.S. favored larger engines due to Scaling Law (non-linear gains), Japan instead focused on smaller, fuel-efficient engines. Similarly, Chinese teams are like repair shops that fine-tune and retrofit vehicles for better performance and efficiency. DeepSeek hasn’t invented a new car or engine but has optimized existing components with remarkable cost-performance engineering.

A Few Insights:

DeepSeek represents a major victory for open-source models over closed ones.

This kind of contribution will rapidly accelerate the overall prosperity of the open-source ecosystem, marking a moment of confidence and pride. Open source pushes development forward—from theory to real-life applications.

OpenAI’s brute-force path looks simple and crude for now,

but that doesn’t rule out it yielding new qualitative changes at scale. However, its gap from open-source models is becoming less intimidating. Historically, bold efforts have always driven AI forward.

DeepSeek has made open-source models as good as or even better than closed models, with higher efficiency.

Its inference cost is around 6% of OpenAI’s API. This enables private deployment and self-fine-tuning, offering greater development potential. Over the next year or two, we may witness a more complete hardware ecosystem and more practical LLM applications.

Large models will eventually be commoditized.

In the toB sector, intelligent use of LLMs helps improve production efficiency. In the toC sector, whoever understands entry points and packaging best will unlock the most valuable commercial opportunities in AI.

Compute demand will not decline.

Jevons Paradox: Efficiency gains can lead to higher total consumption. Just as cars didn’t reduce fuel consumption, but led to more usage, widespread AI adoption will increase overall compute demand.

Data demand will not decline.

“No rice, no meal.” Even if algorithms get smarter (food tastes better), data is still the essential ingredient. Demand for data will only rise.

1️⃣Thoughts and Research on DeepSeek (by Archerman Capital)