Amphion

Amphion (/æmˈfaɪən/) 是一个音频、音乐和语音生成工具包。其目的是支持可重复的研究,帮助初级研究人员和工程师开始音频、音乐和语音生成领域的研究和开发。「Amphion (/æmˈfaɪən/) is a toolkit for Audio, Music, and Speech Generation. Its purpose is to support reproducible research and help junior researchers and engineers get started in the field of audio, music, and speech generation research and development.」

Github星跟踪图

Amphion: An Open-Source Audio, Music, and Speech Generation Toolkit

Amphion (/æmˈfaɪən/) is a toolkit for Audio, Music, and Speech Generation. Its purpose is to support reproducible research and help junior researchers and engineers get started in the field of audio, music, and speech generation research and development. Amphion offers a unique feature: visualizations of classic models or architectures. We believe that these visualizations are beneficial for junior researchers and engineers who wish to gain a better understanding of the model.

The North-Star objective of Amphion is to offer a platform for studying the conversion of any inputs into audio. Amphion is designed to support individual generation tasks, including but not limited to,

  • TTS: Text to Speech (⛳ supported)
  • SVS: Singing Voice Synthesis (👨‍💻 developing)
  • VC: Voice Conversion (👨‍💻 developing)
  • SVC: Singing Voice Conversion (⛳ supported)
  • TTA: Text to Audio (⛳ supported)
  • TTM: Text to Music (👨‍💻 developing)
  • more…

In addition to the specific generation tasks, Amphion includes several vocoders and evaluation metrics. A vocoder is an important module for producing high-quality audio signals, while evaluation metrics are critical for ensuring consistent metrics in generation tasks. Moreover, Amphion is dedicated to advancing audio generation in real-world applications, such as building large-scale datasets for speech synthesis.

🚀 News

  • 2024/10/19: We release MaskGCT, a fully non-autoregressive TTS model that eliminates the need for explicit alignment information between text and speech supervision. MaskGCT is trained on Emilia dataset and achieves SOTA zero-shot TTS perfermance. arXiv hf hf readme
  • 2024/09/01: Amphion, Emilia and DSFF-SVC got accepted by IEEE SLT 2024! 🤗
  • 2024/08/28: Welcome to join Amphion's Discord channel to stay connected and engage with our community!
  • 2024/08/20: SingVisio got accepted by Computers & Graphics, available here! 🎉
  • 2024/08/27: The Emilia dataset is now publicly available! Discover the most extensive and diverse speech generation dataset with 101k hours of in-the-wild speech data now at hf or OpenDataLab! 👑👑👑
  • 2024/07/01: Amphion now releases Emilia, the first open-source multilingual in-the-wild dataset for speech generation with over 101k hours of speech data, and the Emilia-Pipe, the first open-source preprocessing pipeline designed to transform in-the-wild speech data into high-quality training data with annotations for speech generation! arXiv hf demo readme
  • 2024/06/17: Amphion has a new release for its VALL-E model! It uses Llama as its underlying architecture and has better model performance, faster training speed, and more readable codes compared to our first version. readme
  • 2024/03/12: Amphion now support NaturalSpeech3 FACodec and release pretrained checkpoints. arXiv hf hf readme
  • 2024/02/22: The first Amphion visualization tool, SingVisio, release. arXiv openxlab Video readme
  • 2023/12/18: Amphion v0.1 release. arXiv hf youtube readme
  • 2023/11/28: Amphion alpha release. readme

⭐ Key Features

TTS: Text to Speech

  • Amphion achieves state-of-the-art performance compared to existing open-source repositories on text-to-speech (TTS) systems. It supports the following models or architectures:
    • FastSpeech2: A non-autoregressive TTS architecture that utilizes feed-forward Transformer blocks.
    • VITS: An end-to-end TTS architecture that utilizes conditional variational autoencoder with adversarial learning
    • VALL-E: A zero-shot TTS architecture that uses a neural codec language model with discrete codes.
    • NaturalSpeech2: An architecture for TTS that utilizes a latent diffusion model to generate natural-sounding voices.
    • Jets: An end-to-end TTS model that jointly trains FastSpeech2 and HiFi-GAN with an alignment module.
    • MaskGCT: a fully non-autoregressive TTS architecture that eliminates the need for explicit alignment information between text and speech supervision.

SVC: Singing Voice Conversion

  • Ampion supports multiple content-based features from various pretrained models, including WeNet, Whisper, and ContentVec. Their specific roles in SVC has been investigated in our SLT 2024 paper. arXiv code
  • Amphion implements several state-of-the-art model architectures, including diffusion-, transformer-, VAE- and flow-based models. The diffusion-based architecture uses Bidirectional dilated CNN as a backend and supports several sampling algorithms such as DDPM, DDIM, and PNDM. Additionally, it supports single-step inference based on the Consistency Model.

TTA: Text to Audio

  • Amphion supports the TTA with a latent diffusion model. It is designed like AudioLDM, Make-an-Audio, and AUDIT. It is also the official implementation of the text-to-audio generation part of our NeurIPS 2023 paper. arXiv code

Vocoder

Evaluation

Amphion provides a comprehensive objective evaluation of the generated audio. The evaluation metrics contain:

  • F0 Modeling: F0 Pearson Coefficients, F0 Periodicity Root Mean Square Error, F0 Root Mean Square Error, Voiced/Unvoiced F1 Score, etc.
  • Energy Modeling: Energy Root Mean Square Error, Energy Pearson Coefficients, etc.
  • Intelligibility: Character/Word Error Rate, which can be calculated based on Whisper and more.
  • Spectrogram Distortion: Frechet Audio Distance (FAD), Mel Cepstral Distortion (MCD), Multi-Resolution STFT Distance (MSTFT), Perceptual Evaluation of Speech Quality (PESQ), Short Time Objective Intelligibility (STOI), etc.
  • Speaker Similarity: Cosine similarity, which can be calculated based on RawNet3, Resemblyzer, WeSpeaker, WavLM and more.

Datasets

Visualization

Amphion provides visualization tools to interactively illustrate the internal processing mechanism of classic models. This provides an invaluable resource for educational purposes and for facilitating understandable research.

Currently, Amphion supports SingVisio, a visualization tool of the diffusion model for singing voice conversion. arXiv openxlab Video

📀 Installation

Amphion can be installed through either Setup Installer or Docker Image.

Setup Installer

git clone https://github.com/open-mmlab/Amphion.git
cd Amphion

# Install Python Environment
conda create --name amphion python=3.9.15
conda activate amphion

# Install Python Packages Dependencies
sh env.sh

Docker Image

  1. Install Docker, NVIDIA Driver, NVIDIA Container Toolkit, and CUDA.

  2. Run the following commands:

git clone https://github.com/open-mmlab/Amphion.git
cd Amphion

docker pull realamphion/amphion
docker run --runtime=nvidia --gpus all -it -v .:/app realamphion/amphion

Mount dataset by argument -v is necessary when using Docker. Please refer to Mount dataset in Docker container and Docker Docs for more details.

🐍 Usage in Python

We detail the instructions of different tasks in the following recipes:

👨‍💻 Contributing

We appreciate all contributions to improve Amphion. Please refer to CONTRIBUTING.md for the contributing guideline.

🙏 Acknowledgement

©️ License

Amphion is under the MIT License. It is free for both research and commercial use cases.

📚 Citations

@inproceedings{amphion,
    author={Zhang, Xueyao and Xue, Liumeng and Gu, Yicheng and Wang, Yuancheng and Li, Jiaqi and He, Haorui and Wang, Chaoren and Song, Ting and Chen, Xi and Fang, Zihao and Chen, Haopeng and Zhang, Junan and Tang, Tze Ying and Zou, Lexiao and Wang, Mingxuan and Han, Jun and Chen, Kai and Li, Haizhou and Wu, Zhizheng},
    title={Amphion: An Open-Source Audio, Music and Speech Generation Toolkit},
    booktitle={{IEEE} Spoken Language Technology Workshop, {SLT} 2024},
    year={2024}
}

主要指标

概览
名称与所有者open-mmlab/Amphion
主编程语言Python
编程语言 (语言数: 7)
平台
许可证MIT License
所有者活动
创建于2023-11-15 09:19:27
推送于2025-04-12 16:46:28
最后一次提交
发布数3
最新版本名称v0.1.1-alpha (发布于 2024-02-24 01:44:51)
第一版名称v0.1.0-alpha (发布于 2023-11-28 17:54:06)
用户参与
星数9k
关注者数83
派生数711
提交数123
已启用问题?
问题数272
打开的问题数125
拉请求数113
打开的拉请求数14
关闭的拉请求数25
项目设置
已启用Wiki?
已存档?
是复刻?
已锁定?
是镜像?
是私有?