《OpenSeek高质量数据集构建开源实践.pdf》由会员分享,可在线阅读,更多相关《OpenSeek高质量数据集构建开源实践.pdf(47页珍藏版)》请在三个皮匠报告上搜索。
1、OpenSeek高质量数据集构建开源实践刘广智源研究院数据研究组背景背景DeepSeek成为2025年AI领域标志性现象目标目标OpenSeek用开源驱动下一代AI模型构建进展进展200+贡献者三个工作组7TB t o k e n数据4次双周会开源集合创新的新模式以及挑战从“权重开源”向“全要素开源”的演进工作组模式系统组系统组多芯片支持DeepSeekV3的高效训练数据组数据组10TB 级别双语+合成优质数据(CCI4.0)算法组算法组数据配比、模型结构、训练算法和系统优化改进4OpenSeekOpenSeek时间规划时间规划从“数据+开源”到“技术-社区集合开源创新集合开源创新”新模式,构
2、建可持续AI生态 三个工作组三个工作组系统系统多芯片支持DeepSeekV3的高效训练数据数据10TB 级别双语+合成优质数据(CCI4.0)算法算法数据配比、模型结构、训练算法和系统优化改进DeepSeek V3DeepSeek V3和和R1R1的主要技术点的主要技术点MOEMOE 1个共享专家,256个路由专家(每次激活8个),首三层dense结构MLAMLA 高效支持长序列MTPMTP多 token 预测训练目标,提升下游指标FP8FP8训练,极致的针对性优化DualPipeDualPipe分布式训练策略DONEIN PROCESSTODOModel Architecture Suppo
3、rtoMLA,DeepSeek MoE,MTP etc.Huggingface CompatibilityoConversion ckpt between FlagScale and Huggingface parametersMoE Parallelism OptimizationoPerformance analysis of the current DeepSeekMoE distributed training implementationMulti-Chip SupportoIntegration of the FlagGems Triton operator library and
4、 corresponding training accuracy validationDistributed Training Process DisplayoProcess and display of records related to distributed trainingLarge-Scale StabilityoDevelop tools for detecting slow nodes,faulty nodes,and NCCL errors in large-scale clustersoImplement a distributed log consolidation me
5、chanismoImprove the monitoring system of distributed trainingUsability ImprovementoEnhance the distributed training documentationoImprove the installation and usagePipeline Parallelism OptimizationoSupport for DualPipe pipeline parallelismLong Sequence OptimizationoPerformance analysis of current lo
6、ng sequence handlingoSupport for DeepSeek NAS or Kimi MoBA etc.Distributed Reinforcement LearningoResearch and design a solution can be easily implemented in FlagScaleoImplementation of a distributed reinforcement learning system to support efficient DeepSeek R1Pipeline Parallelism OptimizationoAchi