VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech

VoiceCraft

3.5 | 414 | 0
Type:
Open Source Projects
Last Updated:
2025/10/04
Description:
VoiceCraft is an open-source AI tool for zero-shot speech editing and text-to-speech, enabling voice cloning with just a few seconds of reference audio. Achieve state-of-the-art performance on in-the-wild data.
Share:
speech synthesis
voice cloning
audio editing
TTS
zero-shot TTS

Overview of VoiceCraft

VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild

VoiceCraft is a powerful, open-source tool that brings state-of-the-art performance to both speech editing and zero-shot text-to-speech (TTS). It excels in handling diverse, real-world audio data, including audiobooks, internet videos, and podcasts. What sets VoiceCraft apart is its ability to clone or edit an unseen voice using just a few seconds of reference audio.

What is VoiceCraft?

VoiceCraft is a token infilling neural codec language model designed for high-quality speech editing and TTS tasks. It leverages zero-shot learning, meaning it can adapt to new voices with minimal training data.

How does VoiceCraft work?

VoiceCraft operates as a neural codec language model. Key aspects of its functionality include:

  • Token Infilling: VoiceCraft uses a token infilling technique to seamlessly edit and generate speech.
  • Zero-Shot Learning: It can adapt to new voices with just a few seconds of reference audio, eliminating the need for extensive training data.
  • Neural Codec Language Model: This architecture allows for high-quality speech synthesis and editing.

How to use VoiceCraft?

There are several ways to use VoiceCraft:

  • Google Colab: The simplest way to get started is using the provided Google Colab notebooks for speech editing and TTS inference.
  • Docker: Use the provided Docker image for a consistent and reproducible environment.
  • Standalone Script: Integrate VoiceCraft into your projects using the standalone scripts.

Here's a breakdown of each method:

Google Colab

Google Colab provides a straightforward way to start using VoiceCraft. Follow these steps:

  1. Open the Speech Editing Colab notebook.
  2. Open the TTS Inference Colab notebook.
  3. Follow the instructions within the notebooks to run the demos.

Docker

Docker provides a consistent environment for running VoiceCraft. Here’s how to set it up:

  1. Clone the repository:

    git clone git@github.com:jasonppy/VoiceCraft.git
    cd VoiceCraft
    
  2. Build the Docker image:

    docker build --tag "voicecraft" .
    
  3. Start the Docker container:

    ./start-jupyter.sh  # linux
    start-jupyter.bat   # windows
    
  4. Open the URL shown in the Docker logs in your browser.

  5. Open inference_tts.ipynb and follow the instructions.

Standalone Script

To use VoiceCraft as a standalone script:

  1. Ensure your environment is set up correctly (see the Environment Setup section).

  2. Use the tts_demo.py and speech_editing_demo.py scripts.

    python3 tts_demo.py -h
    

Why choose VoiceCraft?

  • Zero-Shot Capability: Adapts to new voices quickly with minimal data.
  • High-Quality Output: Delivers state-of-the-art performance on speech editing and TTS.
  • Versatile: Works well with diverse audio sources.
  • Open-Source: Encourages community contributions and customization.

Who is VoiceCraft for?

VoiceCraft is ideal for:

  • Researchers: Exploring speech synthesis and editing techniques.
  • Developers: Integrating advanced TTS capabilities into applications.
  • Content Creators: Generating high-quality voiceovers and edited audio.
  • Hobbyists: Experimenting with voice cloning and audio manipulation.

Key Features:

  • Smart Transcript: Allows users to specify exactly what they want to generate.
  • TTS Mode: Zero-shot TTS for generating speech from text.
  • Edit Mode: Speech editing capabilities for modifying existing audio.
  • Long TTS Mode: Simplifies TTS on long texts.

Environment Setup:

To set up your environment for VoiceCraft:

  1. Create a new Conda environment:

    conda create -n voicecraft python=3.9.16
    conda activate voicecraft
    
  2. Install the necessary packages:

    pip install -e git+https://github.com/facebookresearch/audiocraft.git@c5157b5bf14bf83449c17ea1eeb66c19fb4bc7f0#egg=audiocraft
    pip install xformers==0.0.22
    pip install torchaudio==2.0.2 torch==2.0.1
    apt-get install ffmpeg
    apt-get install espeak-ng
    pip install tensorboard==2.16.2
    pip install phonemizer==3.2.1
    pip install datasets==2.16.0
    pip install torchmetrics==0.11.1
    pip install huggingface_hub==0.22.2
    conda install -c conda-forge montreal-forced-aligner=2.2.17 openfst=1.8.2 kaldi=5.5.1068
    mfa model download dictionary english_us_arpa
    mfa model download acoustic english_us_arpa
    conda install -n voicecraft ipykernel --no-deps --force-reinstall
    

Training and Finetuning:

VoiceCraft supports training and finetuning on custom datasets. The process involves:

  1. Preparing utterances and their transcripts.
  2. Encoding utterances into codes using Encodec.
  3. Converting transcripts into phoneme sequences.
  4. Creating a manifest file.

The best way to leverage VoiceCraft is by utilizing the provided scripts and notebooks, and adapting them to your specific use case. Whether it's speech editing, TTS, or voice cloning, VoiceCraft offers a robust and flexible solution.

VoiceCraft is licensed under CC BY-NC-SA 4.0 (LICENSE-CODE) for the codebase and Coqui Public Model License 1.0.0 (LICENSE-MODEL) for the model weights. It also incorporates code from other repositories under MIT and Apache 2.0 licenses.

Best Alternative Tools to "VoiceCraft"

Typecast
No Image Available
404 0

Typecast is an AI voice generator offering 600+ customizable voices, voice cloning, video editing, and talking avatars for content creators.

voice-synthesis
emotional-TTS
AI Avatar Generator
No Image Available
339 0

Transform photos & videos into realistic talking AI avatars instantly. Professional videos with lip-sync in 40+ languages. Start creating for free today!

talking avatars
lip-sync AI
Listnr AI
No Image Available
381 0

Create and automate faceless videos effortlessly with Listnr AI. Our AI-powered platform generates and posts fresh content daily to grow your TikTok and YouTube channels. Trusted by millions!

faceless video generation
AudioPod AI
No Image Available
419 0

AudioPod AI is an all-in-one AI audio workstation and production suite. Generate voiceovers, split stems, create music, auto dub content and more. Includes text-to-speech, speech-to-text, and AI music generation.

text to speech
speech to text
Trump AI Voice Generator
No Image Available
340 0

Your Donald Trump AI voice generator for text‑to‑speech and video—lifelike cadence, quick exports for parodies and social media.

voice cloning
celebrity imitation
Audiobox
No Image Available
451 0

Audiobox is Meta's new foundation research model for audio generation. It can generate voices and sound effects using a combination of voice inputs and natural language text prompts.

audio generation
voice synthesis
AIVocal
No Image Available
344 0

AIVocal is an all-in-one AI platform for voice generation, cloning, podcasting, and transcription. Create realistic speech, audiobooks, and more with free tools in 140+ languages for creators and professionals.

voice generation
speech synthesis
AIEasy.life
No Image Available
565 0

AIEasy.life is an AI tools platform that provides a free directory and discovery experience. Find your favorite AI tools with AIEasy.life.

AI tools directory
AI platform
Dub AI
No Image Available
399 0

Dub AI empowers content creators to translate and dub videos effortlessly using AI voice cloning and translation, expanding reach to global audiences in over 30 languages with natural-sounding results.

video dubbing
voice cloning
All Voice Lab
No Image Available
355 0

All Voice Lab offers advanced AI text-to-speech, voice cloning, and voice changer tools for realistic, multilingual audio. Create engaging voiceovers with emotional expressiveness—start your free trial today.

voice cloning
text-to-speech
Voice AI
No Image Available
417 0

Experience cutting-edge Voice AI with our free Text to Speech generator and converter. Enjoy fast, high-quality voice synthesis powered by advanced AI models like Deepseek, Hailuo, Grok, and Kling for natural, expressive speech in various applications.

text-to-speech synthesis
SteosVoice
No Image Available
497 0

SteosVoice is an AI voice generator offering ultra-realistic speech synthesis for content creators. Dub videos, create podcasts, and monetize your voice with 800+ voices.

text to speech
AI voice
ElevenLabs
No Image Available
401 0

ElevenLabs offers realistic AI voice generation with 1000+ voices in 70+ languages. Perfect for audiobooks, videos, podcasts, and voice cloning applications.

voice synthesis
audio generation
LMNT
No Image Available
380 0

LMNT delivers fast, lifelike, affordable AI speech. Enjoy studio-quality voice clones and low latency streaming ideal for conversational apps, games, and agents. Engineered for reliability, scale effortlessly with technology built by an ex-Google team.

voice cloning
low-latency streaming