Skip to content
English library

deepseek

I am the DeepSeek-R4 reasoning models

Play icon crypto ? New deep seek

🚀 DeepSeek-V3: Efficiently Scalable Open-source AGI

DeepSeek-V3 is a 67.1 billion parameter mixture-of-experts (MoE) model with each token activating 37 billion parameters, aiming to push the boundaries of open-source large language models. It adopts innovative architectures like Multi-Head Latent Attention (MLA) and DeepSeekMoE for efficient training and inference, enhancing performance through unsupervised loss load balancing and multi-token prediction.

AI Research Breakthrough

🔧 Optimized Training: FP8 Precision and DualPipe Algorithm

DeepSeek-V3 introduces FP8 mixed precision training and the DualPipe algorithm for pipeline parallelism, achieving near-zero communication overhead and high training efficiency. This makes it one of the most cost-effective large-scale models, requiring only 2.664 million H800 GPU hours for pre-training on 14.8 trillion tokens.

Training Optimization

📦 Post-training: Knowledge Distillation from DeepSeek-R1

DeepSeek-V3 enhances its reasoning capabilities in mathematics, programming, and reasoning tasks by distilling knowledge from DeepSeek-R1 using innovative distillation techniques. This method maintains a balance between accuracy and generation length, ensuring robust and efficient output.

Model Distillation

🏛️ Architecture: Multi-Head Latent Attention and DeepSeekMoE

DeepSeek-V3's architecture is based on the Transformer framework, utilizing Multi-Head Latent Attention (MLA) for efficient inference and DeepSeekMoE for economical training. MLA reduces KV cache during inference, while DeepSeekMoE employs an unsupervised loss load balancing strategy to ensure balanced expert utilization during training.

Architectural Innovation

🔮 Multi-token Prediction (MTP)

DeepSeek-V3 introduces Multi-token Prediction (MTP), predicting multiple future tokens at each position. This method intensifies training signals, improving data efficiency, allowing the model to pre-plan its representations to better predict future tokens. During inference, the MTP module can be reused for speculative decoding to reduce generation latency.

Training Enhancement

Find the plan that's right for you, each plan includes

docs iconsDocs
sheets iconsSheets
slides iconsslides
forms iconsforms
keep iconskeep
sites iconssites
drive iconsdrive
gmail iconsgmail
meet iconsmeet
calendar iconscalendar
Chat_icon@1x iconsChat
docusaurus_keytar iconsjup
docusaurus iconsBusiness
GoogleMaps iconsGoogleMaps
book iconbook
books iconbooks
security iconsecurity
restaurant iconrestaurant
thought iconthought
recipe iconrecipe
news iconnews
deepseek icondeepseek
deepseekr1 icondeepseekr1
deepseekr2 icondeepseekr2
deepseekr2 icondeepseekr3

Released under the MIT License.

deepseek has loaded