
DeepSpeed是一个深度学习优化库,它使分布式训练变得简单、高效和有效。「DeepSpeed is a deep learning optimization library that makes distributed training easy, efficient, and effective.」


DeepSpeed 是一个深度学习优化库,使分布式训练变得轻松、高效和有效。

10 倍大模型;训练速度提高10倍;最小的代码更改。

在当前一代的 GPU 集群上,DeepSpeed 可以训练拥有超过 1000 亿个参数的深度学习模型,同时在系统性能上达到了目前先进水平的 10 倍以上。DeepSpeed 的早期用户已经开发出了一种语言模型(LM),该模型有超过 17B 个参数,称为 Turing-NLG,从而在 LM 类别中建立了新的 SOTA。

DeepSpeed 是 Microsoft 新的 AI at Scale 计划的重要组成部分,该计划旨在实现下一代 AI 功能,您可以在此处找到更多信息。


为什么选择 DeepSpeed?

训练高级深度学习模型具有挑战性。除了模型设计之外,模型科学家还需要建立最新的训练技术,例如分布式训练,混合精度,梯度累积和检查点。但是,科学家可能仍未达到所需的系统性能和收敛速度。大型模型甚至更具挑战性:大型模型由于纯数据并行性而容易耗尽内存,并且很难使用模型并行性。 DeepSpeed 应对这些挑战,以加快模型开发和培训的速度。




所有 DeepSpeed 文档均可在我们的网站上找到

文章 描述
DeepSpeed Features DeepSpeed features
Getting Started First steps with DeepSpeed
DeepSpeed JSON Configuration Configuring DeepSpeed
API Documentation Generated DeepSpeed API documentation
CIFAR-10 Tutorial Getting started with CIFAR-10 and DeepSpeed
Megatron-LM Tutorial Train GPT2 with DeepSpeed and Megatron-LM
BERT Pre-training Tutorial Pre-train BERT with DeepSpeed
Learning Rate Range Test Tutorial Faster training with large learning rates
1Cycle Tutorial SOTA learning schedule in DeepSpeed


DeepSpeed 欢迎您的贡献!请参阅我们的贡献指南



当您提交拉取请求时,CLA 机器人会自动确定您是否需要提供 CLA 并适当地装饰PR(例如,状态检查,评论)。只需按照机器人提供的说明进行操作即可。您只需使用我们的 CLA 在所有存储库中执行一次即可。


该项目采用了 Microsoft 开源行为准则。有关更多信息,请参见《行为准则》常见问题解答,或与opencode@microsoft.com联系,并提出其他任何问题或意见。


Samyam Rajbhandari,Jeff Rasley,Olatunji Ruwase 和 Yuxiong He。 (2019)ZeRO:面向训练万亿参数模型的内存优化。 ArXiv:1910.02054


(The first version translated by vz on 2020.08.08)


Name With Ownermicrosoft/DeepSpeed
Primary LanguagePython
Program languagePython (Language Count: 7)
PlatformLinux, Mac, Windows
License:Apache License 2.0
Release Count86
Last Release Namev0.14.2 (Posted on )
First Release Namev0.1.0 (Posted on )
Created At2020-01-23 18:35:18
Pushed At2024-05-06 11:41:11
Last Commit At2024-05-03 23:22:29
Stargazers Count32.8k
Watchers Count330
Fork Count3.9k
Commits Count2.3k
Has Issues Enabled
Issues Count2526
Issue Open Count920
Pull Requests Count2155
Pull Requests Open Count135
Pull Requests Close Count417
Has Wiki Enabled
Is Archived
Is Fork
Is Locked
Is Mirror
Is Private

License Apache 2.0
PyPI version
Japanese Twitter
Chinese Zhihu

Latest News

DeepSpeed empowers ChatGPT-like model training with a single click, offering 15x speedup over SOTA RLHF systems with unprecedented cost reduction at all scales; learn how.

Extreme Speed and Scale for DL Training and Inference

DeepSpeed enables world's most powerful language models like MT-530B and BLOOM. It is an easy-to-use deep learning optimization software suite that powers unprecedented scale and speed for both training and inference. With DeepSpeed you can:

  • Train/Inference dense or sparse models with billions or trillions of parameters
  • Achieve excellent system throughput and efficiently scale to thousands of GPUs
  • Train/Inference on resource constrained GPU systems
  • Achieve unprecedented low latency and high throughput for inference
  • Achieve extreme compression for an unparalleled inference latency and model size reduction with low costs

DeepSpeed's four innovation pillars


DeepSpeed offers a confluence of system innovations, that has made large scale DL training effective, and efficient, greatly improved ease of use, and redefined the DL training landscape in terms of scale that is possible. These innovations such as ZeRO, 3D-Parallelism, DeepSpeed-MoE, ZeRO-Infinity, etc. fall under the training pillar. Learn more: DeepSpeed-Training


DeepSpeed brings together innovations in parallelism technology such as tensor, pipeline, expert and ZeRO-parallelism, and combines them with high performance custom inference kernels, communication optimizations and heterogeneous memory technologies to enable inference at an unprecedented scale, while achieving unparalleled latency, throughput and cost reduction. This systematic composition of system technologies for inference falls under the inference pillar. Learn more: DeepSpeed-Inference


To further increase the inference efficiency, DeepSpeed offers easy-to-use and flexible-to-compose compression techniques for researchers and practitioners to compress their models while delivering faster speed, smaller model size, and significantly reduced compression cost. Moreover, SoTA innovations on compression like ZeroQuant and XTC are included under the compression pillar. Learn more: DeepSpeed-Compression


In line with Microsoft's mission to solve humanity's most pressing challenges, the DeepSpeed team at Microsoft is responding to this opportunity by launching a new initiative called DeepSpeed4Science, aiming to build unique capabilities through AI system technology innovations to help domain experts to unlock today's biggest science mysteries. Learn more: DeepSpeed4Science website and tutorials

DeepSpeed Software Suite

DeepSpeed Library

The DeepSpeed library (this repository) implements and packages the innovations and technologies in DeepSpeed Training, Inference and Compression Pillars into a single easy-to-use, open-sourced repository. It allows for easy composition of multitude of features within a single training, inference or compression pipeline. The DeepSpeed Library is heavily adopted by the DL community, and has been used to enable some of the most powerful models (see DeepSpeed Adoption).

Model Implementations for Inference (MII)

Model Implementations for Inference (MII) is an open-sourced repository for making low-latency and high-throughput inference accessible to all data scientists by alleviating the need to apply complex system optimization techniques themselves. Out-of-box, MII offers support for thousands of widely used DL models, optimized using DeepSpeed-Inference, that can be deployed with a few lines of code, while achieving significant latency reduction compared to their vanilla open-sourced versions.

DeepSpeed on Azure

DeepSpeed users are diverse and have access to different environments. We recommend to try DeepSpeed on Azure as it is the simplest and easiest method. The recommended method to try DeepSpeed on Azure is through AzureML recipes. The job submission and data preparation scripts have been made available here. For more details on how to use DeepSpeed on Azure, please follow the Azure tutorial.

DeepSpeed Adoption

DeepSpeed is an important part of Microsoft’s new
AI at Scale
initiative to enable next-generation AI capabilities at scale, where you can find more
information here.

DeepSpeed has been used to train many different large-scale models, below is a list of several examples that we are aware of (if you'd like to include your model please submit a PR):

DeepSpeed has been integrated with several different popular open-source DL frameworks such as:

Transformers with DeepSpeed
Accelerate with DeepSpeed
Lightning with DeepSpeed
MosaicML with DeepSpeed
Determined with DeepSpeed
MMEngine with DeepSpeed

Build Pipeline Status

Description Status
NVIDIA nv-torch110-p40 nv-torch110-v100 nv-torch-latest-v100 nv-h100 nv-inference nv-nightly
AMD amd-mi100 amd-mi200
CPU nv-torch-latest-cpu
PyTorch Nightly nv-torch-nightly-v100
Integrations nv-transformers-v100 nv-lightning-v100 nv-accelerate-v100 nv-megatron nv-mii nv-ds-chat nv-sd
Misc Formatting pages-build-deployment Documentation Statuspython


The quickest way to get started with DeepSpeed is via pip, this will install
the latest release of DeepSpeed which is not tied to specific PyTorch or CUDA
versions. DeepSpeed includes several C++/CUDA extensions that we commonly refer
to as our 'ops'. By default, all of these extensions/ops will be built
just-in-time (JIT) using torch's JIT C++ extension loader that relies on
to build and
dynamically link them at runtime.


  • PyTorch must be installed before installing DeepSpeed.
  • For full feature support we recommend a version of PyTorch that is >= 1.9 and ideally the latest PyTorch stable release.
  • A CUDA or ROCm compiler such as nvcc or hipcc used to compile C++/CUDA/HIP extensions.
  • Specific GPUs we develop and test against are listed below, this doesn't mean your GPU will not work if it doesn't fall into this category it's just DeepSpeed is most well tested on the following:
    • NVIDIA: Pascal, Volta, Ampere, and Hopper architectures
    • AMD: MI100 and MI200


We regularly push releases to PyPI and encourage users to install from there in most cases.

pip install deepspeed

After installation, you can validate your install and see which extensions/ops
your machine is compatible with via the DeepSpeed environment report.


If you would like to pre-install any of the DeepSpeed extensions/ops (instead
of JIT compiling) or install pre-compiled ops via PyPI please see our advanced
installation instructions


Windows support is partially supported with DeepSpeed. On Windows you can build wheel with following steps, currently only inference mode is supported.

  1. Install pytorch, such as pytorch 1.8 + cuda 11.1
  2. Install visual cpp build tools, such as VS2019 C++ x64/x86 build tools
  3. Launch cmd console with Administrator privilege for creating required symlink folders
  4. Run python bdist_wheel to build wheel in dist folder


Please checkout DeepSpeed-Training, DeepSpeed-Inference and DeepSpeed-Compression pages for full set of features offered along each of these three pillars.

Further Reading

All DeepSpeed documentation, tutorials, and blogs can be found on our website:

Getting Started First steps with DeepSpeed
DeepSpeed JSON Configuration Configuring DeepSpeed
API Documentation Generated DeepSpeed API documentation
Tutorials Tutorials
Blogs Blogs


DeepSpeed welcomes your contributions! Please see our
contributing guide for more details on formatting, testing,
Thanks so much to all of our amazing contributors!

Contributor License Agreement

This project welcomes contributions and suggestions. Most contributions require you to
agree to a Contributor License Agreement (CLA) declaring that you have the right to, and
actually do, grant us the rights to use your contribution. For details, visit

When you submit a pull request, a CLA bot will automatically determine whether you need
to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply
follow the instructions provided by the bot. You will only need to do this once across
all repos using our CLA.

Code of Conduct

This project has adopted the Microsoft Open Source Code of
. For more information see the
Code of Conduct FAQ or contact with any additional questions or comments.


  1. Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, Yuxiong He. (2019) ZeRO: memory optimizations toward training trillion parameter models. arXiv:1910.02054 and In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '20).
  2. Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. (2020) DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD '20, Tutorial).
  3. Minjia Zhang, Yuxiong He. (2020) Accelerating Training of Transformer-Based Language Models with Progressive Layer Dropping. arXiv:2010.13369 and NeurIPS 2020.
  4. Jie Ren, Samyam Rajbhandari, Reza Yazdani Aminabadi, Olatunji Ruwase, Shuangyan Yang, Minjia Zhang, Dong Li, Yuxiong He. (2021) ZeRO-Offload: Democratizing Billion-Scale Model Training. arXiv:2101.06840 and USENIX ATC 2021. [paper] [slides] [blog]
  5. Hanlin Tang, Shaoduo Gan, Ammar Ahmad Awan, Samyam Rajbhandari, Conglong Li, Xiangru Lian, Ji Liu, Ce Zhang, Yuxiong He. (2021) 1-bit Adam: Communication Efficient Large-Scale Training with Adam's Convergence Speed. arXiv:2102.02888 and ICML 2021.
  6. Samyam Rajbhandari, Olatunji Ruwase, Jeff Rasley, Shaden Smith, Yuxiong He. (2021) ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning. arXiv:2104.07857 and SC 2021. [paper] [slides] [blog]
  7. Conglong Li, Ammar Ahmad Awan, Hanlin Tang, Samyam Rajbhandari, Yuxiong He. (2021) 1-bit LAMB: Communication Efficient Large-Scale Large-Batch Training with LAMB's Convergence Speed. arXiv:2104.06069 and HiPC 2022.
  8. Conglong Li, Minjia Zhang, Yuxiong He. (2021) The Stability-Efficiency Dilemma: Investigating Sequence Length Warmup for Training GPT Models. arXiv:2108.06084 and NeurIPS 2022.
  9. Yucheng Lu, Conglong Li, Minjia Zhang, Christopher De Sa, Yuxiong He. (2022) Maximizing Communication Efficiency for Large-scale Training via 0/1 Adam. arXiv:2202.06009.
  10. Samyam Rajbhandari, Conglong Li, Zhewei Yao, Minjia Zhang, Reza Yazdani Aminabadi, Ammar Ahmad Awan, Jeff Rasley, Yuxiong He. (2022) DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale arXiv:2201.05596 and ICML 2022. [pdf] [slides] [blog]
  11. Shaden Smith, Mostofa Patwary, Brandon Norick, Patrick LeGresley, Samyam Rajbhandari, Jared Casper, Zhun Liu, Shrimai Prabhumoye, George Zerveas, Vijay Korthikanti, Elton Zhang, Rewon Child, Reza Yazdani Aminabadi, Julie Bernauer, Xia Song, Mohammad Shoeybi, Yuxiong He, Michael Houston, Saurabh Tiwary, Bryan Catanzaro. (2022) Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model arXiv:2201.11990.
  12. Xiaoxia Wu, Zhewei Yao, Minjia Zhang, Conglong Li, Yuxiong He. (2022) Extreme Compression for Pre-trained Transformers Made Simple and Efficient. arXiv:2206.01859 and NeurIPS 2022.
  13. Zhewei Yao, Reza Yazdani Aminabadi, Minjia Zhang, Xiaoxia Wu, Conglong Li, Yuxiong He. (2022) ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers. arXiv:2206.01861 and NeurIPS 2022 [slides] [blog]
  14. Reza Yazdani Aminabadi, Samyam Rajbhandari, Minjia Zhang, Ammar Ahmad Awan, Cheng Li, Du Li, Elton Zheng, Jeff Rasley, Shaden Smith, Olatunji Ruwase, Yuxiong He. (2022) DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale. arXiv:2207.00032 and SC 2022. [paper] [slides] [blog]
  15. Zhewei Yao, Xiaoxia Wu, Conglong Li, Connor Holmes, Minjia Zhang, Cheng Li, Yuxiong He. (2022) Random-LTD: Random and Layerwise Token Dropping Brings Efficient Training for Large-scale Transformers. arXiv:2211.11586.
  16. Conglong Li, Zhewei Yao, Xiaoxia Wu, Minjia Zhang, Yuxiong He. (2022) DeepSpeed Data Efficiency: Improving Deep Learning Model Quality and Training Efficiency via Efficient Data Sampling and Routing. arXiv:2212.03597 ENLSP2023 Workshop at NeurIPS2023
  17. Xiaoxia Wu, Cheng Li, Reza Yazdani Aminabadi, Zhewei Yao, Yuxiong He. (2023) Understanding INT4 Quantization for Transformer Models: Latency Speedup, Composability, and Failure Cases. arXiv:2301.12017 and ICML2023.
  18. Syed Zawad, Cheng Li, Zhewei Yao, Elton Zheng, Yuxiong He, Feng Yan. (2023) DySR: Adaptive Super-Resolution via Algorithm and System Co-design. ICLR:2023.
  19. Sheng Shen, Zhewei Yao, Chunyuan Li, Trevor Darrell, Kurt Keutzer, Yuxiong He. (2023) Scaling Vision-Language Models with Sparse Mixture of Experts. arXiv:2303.07226 and Finding at EMNLP2023.
  20. Quentin Anthony, Ammar Ahmad Awan, Jeff Rasley, Yuxiong He, Aamir Shafi, Mustafa Abduljabbar, Hari Subramoni, Dhabaleswar Panda. (2023) MCR-DL: Mix-and-Match Communication Runtime for Deep Learning arXiv:2303.08374 and will appear at IPDPS 2023.
  21. Siddharth Singh, Olatunji Ruwase, Ammar Ahmad Awan, Samyam Rajbhandari, Yuxiong He, Abhinav Bhatele. (2023) A Hybrid Tensor-Expert-Data Parallelism Approach to Optimize Mixture-of-Experts Training arXiv:2303.06318 and will appear at ICS 2023.
  22. Guanhua Wang, Heyang Qin, Sam Ade Jacobs, Xiaoxia Wu, Connor Holmes, Zhewei Yao, Samyam Rajbhandari, Olatunji Ruwase, Feng Yan, Lei Yang, Yuxiong He. (2023) ZeRO++: Extremely Efficient Collective Communication for Giant Model Training arXiv:2306.10209 and ML for Sys Workshop at NeurIPS2023 [blog]
  23. Zhewei Yao, Xiaoxia Wu, Cheng Li, Stephen Youn, Yuxiong He. (2023) ZeroQuant-V2: Exploring Post-training Quantization in LLMs from Comprehensive Study to Low Rank Compensation arXiv:2303.08302 and ENLSP2023 Workshop at NeurIPS2023 [slides]
  24. Pareesa Ameneh Golnari, Zhewei Yao, Yuxiong He. (2023) Selective Guidance: Are All the Denoising Steps of Guided Diffusion Important? arXiv:2305.09847
  25. Zhewei Yao, Reza Yazdani Aminabadi, Olatunji Ruwase, Samyam Rajbhandari, Xiaoxia Wu, Ammar Ahmad Awan, Jeff Rasley, Minjia Zhang, Conglong Li, Connor Holmes, Zhongzhu Zhou, Michael Wyatt, Molly Smith, Lev Kurilenko, Heyang Qin, Masahiro Tanaka, Shuai Che, Shuaiwen Leon Song, Yuxiong He. (2023) DeepSpeed-Chat: Easy, Fast and Affordable RLHF Training of ChatGPT-like Models at All Scales arXiv:2308.01320.
  26. Xiaoxia Wu, Zhewei Yao, Yuxiong He. (2023) ZeroQuant-FP: A Leap Forward in LLMs Post-Training W4A8 Quantization Using Floating-Point Formats arXiv:2307.09782 and ENLSP2023 Workshop at NeurIPS2023 [slides]
  27. Zhewei Yao, Xiaoxia Wu, Conglong Li, Minjia Zhang, Heyang Qin, Olatunji Ruwase, Ammar Ahmad Awan, Samyam Rajbhandari, Yuxiong He. (2023) DeepSpeed-VisualChat: Multi-Round Multi-Image Interleave Chat via Multi-Modal Causal Attention arXiv:2309.14327
  28. Shuaiwen Leon Song, Bonnie Kruft, Minjia Zhang, Conglong Li, Shiyang Chen, Chengming Zhang, Masahiro Tanaka, Xiaoxia Wu, Jeff Rasley, Ammar Ahmad Awan, Connor Holmes, Martin Cai, Adam Ghanem, Zhongzhu Zhou, Yuxiong He, et al. (2023) DeepSpeed4Science Initiative: Enabling Large-Scale Scientific Discovery through Sophisticated AI System Technologies arXiv:2310.04610 [blog]
  29. Zhewei Yao, Reza Yazdani Aminabadi, Stephen Youn, Xiaoxia Wu, Elton Zheng, Yuxiong He. (2023) ZeroQuant-HERO: Hardware-Enhanced Robust Optimized Post-Training Quantization Framework for W8A8 Transformers arXiv:2310.17723


  1. DeepSpeed KDD 2020 Tutorial
    1. Overview
    2. ZeRO + large model training
    3. 17B T-NLG demo
    4. Fastest BERT training + RScan tuning
    5. DeepSpeed hands on deep dive: part 1, part 2, part 3
    6. FAQ
  2. Microsoft Research Webinar
  3. DeepSpeed on AzureML
  4. Large Model Training and Inference with DeepSpeed // Samyam Rajbhandari // LLMs in Prod Conference [slides]
  5. Community Tutorials
To the top