DeepSeek Open Source Week, held from February 24 to February 28, 2025, marked a significant milestone in the open-source AI community. The initiative, spearheaded by the Chinese AI startup DeepSeek, aimed to democratize access to advanced AI tools and foster collaboration among developers and researchers worldwide. Over five days, DeepSeek released five cutting-edge repositories, each designed to address critical challenges in AI development. Below is a detailed summary of the event, its highlights, and the repositories made available.
Overview of DeepSeek Open Source Week
The event was announced on February 21, 2025, with DeepSeek emphasizing its commitment to transparency and community-driven innovation. The company described the initiative as a way to share "humble building blocks" of their online services, which had been documented, deployed, and tested in production environments. The releases were aimed at accelerating AI development by providing tools that enhance computational efficiency, model optimization, and large-scale data handling.
Key objectives of the event included:
Repository Name | Description | GitHub Link |
---|---|---|
FlashMLA | Efficient MLA decoding kernel for Hopper GPUs | FlashMLA |
DeepEP | Communication library for Mixture-of-Experts models | DeepEP |
DeepGEMM | Optimized General Matrix Multiplication library | DeepGEMM |
Optimized Parallelism Strategies | Framework for optimizing parallelism in distributed deep learning | Optimized Parallelism Strategies |
Fire-Flyer File System (3FS) | Distributed file system optimized for machine learning workflows | Fire-Flyer File System |
DeepSeek-V3/R1 Inference System | Large-scale inference system using cross-node Expert Parallelism | DeepSeek-V3/R1 Inference System |
Day 1: FlashMLA
Description: FlashMLA is an efficient Multi-head Latent Attention (MLA) decoding kernel optimized for NVIDIA Hopper GPUs.

Key Features:
Supports BF16 and FP16 data types.
Paged KV cache with a block size of 64.
Performance benchmarks: 3000 GB/s for memory-bound operations and 580 TFLOPS for computation-bound tasks.
Requires CUDA 12.3+ and PyTorch 2.0+.
Significance: This tool enhances the inference speed of large language models (LLMs), making it ideal for high-performance AI applications.
Day 2: DeepEP
Description: DeepEP is the first open-source communication library tailored for Mixture-of-Experts (MoE) models.

Key Features:
Efficient all-to-all communication for both intranode and internode setups.
High-throughput kernels for training and inference prefilling.
Low-latency kernels for inference decoding.
Native FP8 dispatch support.
Flexible GPU resource management for overlapping computation and communication tasks.
Significance: DeepEP addresses bottlenecks in MoE model training and inference, enabling scalable distributed computing.
Day 3: DeepGEMM
Description: A highly optimized General Matrix Multiplication (GEMM) library designed for deep learning workloads.

Key Features:
Advanced kernel optimizations for dense matrix operations.
Support for mixed precision arithmetic (FP16/BF16).
Seamless integration with popular frameworks like TensorFlow and PyTorch.
Significance: DeepGEMM improves computational efficiency in neural network training, particularly for dense layers.
Day 4: DualPipe: Optimized Parallelism Strategies
Description: A framework offering strategies to optimize parallelism in distributed deep learning tasks.

Key Features:
Techniques for data parallelism, model parallelism, and pipeline parallelism.
Dynamic load balancing across GPUs and nodes.
Built-in support for overlapping computation with communication.
Significance: This tool simplifies the implementation of parallelism strategies, reducing training time for large-scale models.
Day 5: Fire-Flyer File System (3FS)
Description: A distributed file system optimized for machine learning workflows.

Key Features:
High-throughput data access across clusters.
Support for large-scale datasets with low-latency I/O operations.
Compatibility with popular storage backends like HDFS and S3.
Significance: Fire-Flyer File System facilitates efficient data handling in distributed AI training environments.
Day 6: One More Thing – DeepSeek-V3/R1 Inference System
The final day of DeepSeek Open Source Week introduced a comprehensive overview of the DeepSeek-V3/R1 Inference System, a cutting-edge solution designed to optimize throughput and latency for large-scale AI inference tasks. This system leverages cross-node Expert Parallelism (EP) to scale batch sizes, improve GPU efficiency, and reduce memory access demands, addressing the dual objectives of higher throughput and lower latency.
What's New with Deepseek's Design
The DeepSeek-V3/R1 Inference System employs large-scale cross-node EP to handle the high sparsity of models with numerous experts (e.g., only 8 out of 256 experts per layer are activated). The system uses distinct parallelism strategies during the prefilling and decoding phases:
Prefilling Phase: Routed Expert EP32 with Shared Expert DP32 across 4 nodes.
Decoding Phase: Routed Expert EP144 with Shared Expert DP144 across 18 nodes.

A dual-batch overlap strategy hides communication latency by splitting requests into two microbatches. During prefilling, communication for one microbatch is overlapped with computation for the other.
During decoding, a 5-stage pipeline subdivides the attention layer into two steps, ensuring seamless communication-computation overlapping.
Load Balancing Mechanisms:
- Prefill Load Balancer: Balances core-attention computation and dispatch send loads across GPUs.
- Decode Load Balancer: Equalizes KVCache usage and request counts per GPU.
- Expert-Parallel Load Balancer: Distributes expert computational workloads evenly across GPUs to minimize bottlenecks.
Cost and Revenue Analysis

Peak node occupancy reached 278 nodes, with an average occupancy of 226.75 nodes (8 GPUs per node).
Daily operational cost: $87,072 (based on $2/hour per H800 GPU).
Theoretical daily revenue: $562,027 based on DeepSeek-R1 pricing.
Profit margin: An impressive 545%, though actual revenue is lower due to free services, discounts, and lower pricing for DeepSeek-V3.
The system's innovative design principles and optimizations make it a state-of-the-art solution for large-scale AI inference tasks, setting benchmarks in efficiency and scalability.
Conclusion
DeepSeek Open Source Week concluded with the unveiling of the DeepSeek-V3/R1 Inference System, a testament to the company's commitment to advancing AI infrastructure. By open-sourcing these repositories, DeepSeek has not only empowered developers but also set new standards in AI efficiency, scalability, and accessibility. This initiative has left a lasting impact on the AI community, fostering collaboration and innovation at an unprecedented scale.