DeepSeek-V3 技术报告
DeepSeek-V3 Technical Report

原始链接: https://arxiv.org/abs/2412.19437

DeepSeek-AI及其众多合作者推出了DeepSeek-V3,这是一个强大的混合专家(MoE)语言模型,拥有6710亿个参数,每个token激活370亿个参数。它基于DeepSeek-V2的架构,利用多头潜在注意力机制(MLA)和DeepSeekMoE来实现高效的推理和训练。一个关键创新是其无辅助损失的负载均衡策略和多token预测目标。 DeepSeek-V3在包含14.8万亿token的大规模数据集上进行了预训练,并通过监督微调和强化学习进行了进一步优化。性能评估表明,DeepSeek-V3超越了其他开源模型,并与顶级闭源模型不相上下。值得注意的是,该模型仅使用278.8万H800 GPU小时就实现了如此令人印象深刻的性能,并且训练稳定性极佳,没有出现损失激增或回滚的情况。模型检查点已公开发布。技术报告最初提交于2024年12月27日,并在2025年2月18日进行了修订。

Hacker News 最新 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交 登录 DeepSeek-V3 技术报告 (arxiv.org) 14 分,来自 signa11,26 分钟前 | 隐藏 | 过去 | 收藏 | 讨论 加入我们,参加 6 月 16-17 日在旧金山举办的 AI 初创公司学校! 指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系我们 搜索:

原文
Authors:DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Haowei Zhang, Honghui Ding, Huajian Xin, Huazuo Gao, Hui Li, Hui Qu, J.L. Cai, Jian Liang, Jianzhong Guo, Jiaqi Ni, Jiashi Li, Jiawei Wang, Jin Chen, Jingchang Chen, Jingyang Yuan, Junjie Qiu, Junlong Li, Junxiao Song, Kai Dong, Kai Hu, Kaige Gao, Kang Guan, Kexin Huang, Kuai Yu, Lean Wang, Lecong Zhang, Lei Xu, Leyi Xia, Liang Zhao, Litong Wang, Liyue Zhang, Meng Li, Miaojun Wang, Mingchuan Zhang, Minghua Zhang, Minghui Tang, Mingming Li, Ning Tian, Panpan Huang, Peiyi Wang, Peng Zhang, Qiancheng Wang, Qihao Zhu, Qinyu Chen, Qiushi Du, R.J. Chen, R.L. Jin, Ruiqi Ge, Ruisong Zhang, Ruizhe Pan, Runji Wang, Runxin Xu, Ruoyu Zhang, Ruyi Chen, S.S. Li, Shanghao Lu, Shangyan Zhou, Shanhuang Chen, Shaoqing Wu, Shengfeng Ye, Shengfeng Ye, Shirong Ma, Shiyu Wang, Shuang Zhou, Shuiping Yu, Shunfeng Zhou, Shuting Pan, T. Wang, Tao Yun, Tian Pei, Tianyu Sun, W.L. Xiao, Wangding Zeng et al. (100 additional authors not shown)

View a PDF of the paper titled DeepSeek-V3 Technical Report, by DeepSeek-AI and 199 other authors

View PDF HTML (experimental)
Abstract:We present DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 671B total parameters with 37B activated for each token. To achieve efficient inference and cost-effective training, DeepSeek-V3 adopts Multi-head Latent Attention (MLA) and DeepSeekMoE architectures, which were thoroughly validated in DeepSeek-V2. Furthermore, DeepSeek-V3 pioneers an auxiliary-loss-free strategy for load balancing and sets a multi-token prediction training objective for stronger performance. We pre-train DeepSeek-V3 on 14.8 trillion diverse and high-quality tokens, followed by Supervised Fine-Tuning and Reinforcement Learning stages to fully harness its capabilities. Comprehensive evaluations reveal that DeepSeek-V3 outperforms other open-source models and achieves performance comparable to leading closed-source models. Despite its excellent performance, DeepSeek-V3 requires only 2.788M H800 GPU hours for its full training. In addition, its training process is remarkably stable. Throughout the entire training process, we did not experience any irrecoverable loss spikes or perform any rollbacks. The model checkpoints are available at this https URL.
From: Wenfeng Liang [view email]
[v1] Fri, 27 Dec 2024 04:03:16 UTC (1,114 KB)
[v2] Tue, 18 Feb 2025 17:26:38 UTC (1,114 KB)
联系我们 contact @ memedata.com