Wav2Lip

This repository contains the codes of "A Lip Sync Expert Is All You Need for Speech to Lip Generation In the Wild", published at ACM Multimedia 2020.

  • 所有者: Rudrabha/Wav2Lip
  • 平台:
  • 许可证:
  • 分类:
  • 主题:
  • 喜欢:
    0
      比较:

Github星跟踪图

Wav2Lip: Accurately Lip-syncing Videos In The Wild

This code is part of the paper: A Lip Sync Expert Is All You Need for Speech to Lip Generation In the Wild published at ACM Multimedia 2020.

PWC
PWC
PWC

(http://arxiv.org/abs/2008.10010), (http://cvit.iiit.ac.in/research/projects/cvit-projects/a-lip-sync-expert-is-all-you-need-for-speech-to-lip-generation-in-the-wild/), (https://youtu.be/0fXaDCZNOJc), (https://bhaasha.iiit.ac.in/lipsync), (https://colab.research.google.com/drive/1tZpDWXz49W6wDcTprANRGLo2D_EbD5J8?usp=sharing), [ReSyncED] (coming soon)


Highlights

  • Lip-sync videos to any target speech with high accuracy :100:. Try our interactive demo.
  • :sparkles: Works for any identity, voice, and language. Also works for CGI faces and synthetic voices.
  • Complete training code, inference code, and pretrained models are available :boom:
  • Or, quick-start with the Google Colab Notebook: Link. Checkpoints and samples are available in a Google Drive folder as well.
  • :fire: :fire: Several new, reliable evaluation benchmarks and metrics (https://github.com/Rudrabha/Wav2Lip/tree/master/evaluation) released. Instructions to calculate the metrics reported in the paper are also present.

Disclaimer

All results from this open-source code or our demo website should only be used for research/academic/personal purposes only. As the models are trained on the LRS2 dataset, any form of commercial use is strictly prohibhited. Please contact us for all further queries.

Prerequisites

  • Python 3.5.2 (code has been tested with this version at our end, but several other users say that 3.6+ is the one that works instead.)
  • ffmpeg: sudo apt-get install ffmpeg
  • Install necessary packages using pip install -r requirements.txt
  • Face detection pre-trained model should be downloaded to face_detection/detection/sfd/s3fd.pth. Alternative link if the above does not work.

You can lip-sync any video to any audio:

python inference.py --checkpoint_path <ckpt> --face <video.mp4> --audio <an-audio-source> 

The result is saved (by default) in results/result_voice.mp4. You can specify it as an argument, similar to several other available options. The audio source can be any file supported by FFMPEG containing audio data: *.wav, *.mp3 or even a video file, from which the code will automatically extract the audio.

Tips for better results:
  • Experiment with the --pads argument to adjust the detected face bounding box. Often leads to improved results. You might need to increase the bottom padding to include the chin region. E.g. --pads 0 20 0 0.
  • If you see the mouth position dislocated or some weird artifacts such as two mouths, then it can be because of over-smoothing the face detections. Use the --nosmooth argument and give another try.
  • Experiment with the --resize_factor argument, to get a lower resolution video. Why? The models are trained on faces which were at a lower resolution. You might get better, visually pleasing results for 720p videos than for 1080p videos (in many cases, the latter works well too).
  • The Wav2Lip model without GAN usually needs more experimenting with the above two to get the most ideal results, and sometimes, can give you a better result as well.

Preparing LRS2 for training

Our models are trained on LRS2. See here for a few suggestions regarding training on other datasets.

LRS2 dataset folder structure
data_root (mvlrs_v1)
├── main, pretrain (we use only main folder in this work), ├── list of folders, │   ├── five-digit numbered video IDs ending with (.mp4)

Place the LRS2 filelists (train, val, test) .txt files in the filelists/ folder.

Preprocess the dataset for fast training
python preprocess.py --data_root data_root/main --preprocessed_root lrs2_preprocessed/

Additional options like batch_size and number of GPUs to use in parallel to use can also be set.

Preprocessed LRS2 folder structure
preprocessed_root (lrs2_preprocessed)
├── list of folders, ├── Folders with five-digit numbered video IDs, │   ├── *.jpg, │   ├── audio.wav

Train!

There are two major steps: (i) Train the expert lip-sync discriminator, (ii) Train the Wav2Lip model(s).

Training the expert discriminator

You can download the pre-trained weights if you want to skip this step. To train it:

python color_syncnet_train.py --data_root lrs2_preprocessed/ --checkpoint_dir <folder_to_save_checkpoints>
Training the Wav2Lip models

You can either train the model without the additional visual quality disriminator (< 1 day of training) or use the discriminator (~2 days). For the former, run:

python wav2lip_train.py --data_root lrs2_preprocessed/ --checkpoint_dir <folder_to_save_checkpoints> --syncnet_checkpoint_path <path_to_expert_disc_checkpoint>

To train with the visual quality discriminator, you should run hq_wav2lip_train.py instead. The arguments for both the files are similar. In both the cases, you can resume training as well. Look at python wav2lip_train.py --help for more details. You can also set additional less commonly-used hyper-parameters at the bottom of the hparams.py file.

Training on datasets other than LRS2

Training on other datasets might require modifications to the code. Please read the following before you raise an issue:

  • You might not get good results by training/fine-tuning on a few minutes of a single speaker. This is a separate research problem, to which we do not have a solution yet. Thus, we would most likely not be able to resolve your issue.
  • You must train the expert discriminator for your own dataset before training Wav2Lip.
  • If it is your own dataset downloaded from the web, in most cases, needs to be sync-corrected.
  • Be mindful of the FPS of the videos of your dataset. Changes to FPS would need significant code changes.
  • The expert discriminator's eval loss should go down to ~0.25 and the Wav2Lip eval sync loss should go down to ~0.2 to get good results.

When raising an issue on this topic, please let us know that you are aware of all these points.

Evaluation

Will be updated.

License and Citation

The software can only be used for personal/research/non-commercial purposes. Please cite the following paper if you use this code:

@inproceedings{10.1145/3394171.3413532,
author = {Prajwal, K R and Mukhopadhyay, Rudrabha and Namboodiri, Vinay P. and Jawahar, C.V.},
title = {A Lip Sync Expert Is All You Need for Speech to Lip Generation In the Wild},
year = {2020},
isbn = {9781450379885},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3394171.3413532},
doi = {10.1145/3394171.3413532},
booktitle = {Proceedings of the 28th ACM International Conference on Multimedia},
pages = {484–492},
numpages = {9},
keywords = {lip sync, talking face generation, video generation},
location = {Seattle, WA, USA},
series = {MM '20}
}

Acknowledgements

Parts of the code structure is inspired by this TTS repository. We thank the author for this wonderful code. The code for Face Detection has been taken from the face_alignment repository. We thank the authors for releasing their code and models.

主要指标

概览
名称与所有者Rudrabha/Wav2Lip
主编程语言Python
编程语言Python (语言数: 2)
平台
许可证
所有者活动
创建于2020-08-07 08:06:38
推送于2025-04-19 17:21:58
最后一次提交2025-04-19 10:21:58
发布数0
用户参与
星数12k
关注者数177
派生数2.5k
提交数112
已启用问题?
问题数687
打开的问题数316
拉请求数6
打开的拉请求数22
关闭的拉请求数23
项目设置
已启用Wiki?
已存档?
是复刻?
已锁定?
是镜像?
是私有?