VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech

VoiceCraft

3.5 | 70 | 0
Type:
Open Source Projects
Last Updated:
2025/10/04
Description:
VoiceCraft is an open-source AI tool for zero-shot speech editing and text-to-speech, enabling voice cloning with just a few seconds of reference audio. Achieve state-of-the-art performance on in-the-wild data.
Share:
speech synthesis
voice cloning
audio editing
TTS
zero-shot TTS

Overview of VoiceCraft

VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild

VoiceCraft is a powerful, open-source tool that brings state-of-the-art performance to both speech editing and zero-shot text-to-speech (TTS). It excels in handling diverse, real-world audio data, including audiobooks, internet videos, and podcasts. What sets VoiceCraft apart is its ability to clone or edit an unseen voice using just a few seconds of reference audio.

What is VoiceCraft?

VoiceCraft is a token infilling neural codec language model designed for high-quality speech editing and TTS tasks. It leverages zero-shot learning, meaning it can adapt to new voices with minimal training data.

How does VoiceCraft work?

VoiceCraft operates as a neural codec language model. Key aspects of its functionality include:

  • Token Infilling: VoiceCraft uses a token infilling technique to seamlessly edit and generate speech.
  • Zero-Shot Learning: It can adapt to new voices with just a few seconds of reference audio, eliminating the need for extensive training data.
  • Neural Codec Language Model: This architecture allows for high-quality speech synthesis and editing.

How to use VoiceCraft?

There are several ways to use VoiceCraft:

  • Google Colab: The simplest way to get started is using the provided Google Colab notebooks for speech editing and TTS inference.
  • Docker: Use the provided Docker image for a consistent and reproducible environment.
  • Standalone Script: Integrate VoiceCraft into your projects using the standalone scripts.

Here's a breakdown of each method:

Google Colab

Google Colab provides a straightforward way to start using VoiceCraft. Follow these steps:

  1. Open the Speech Editing Colab notebook.
  2. Open the TTS Inference Colab notebook.
  3. Follow the instructions within the notebooks to run the demos.

Docker

Docker provides a consistent environment for running VoiceCraft. Here’s how to set it up:

  1. Clone the repository:

    git clone git@github.com:jasonppy/VoiceCraft.git
    cd VoiceCraft
    
  2. Build the Docker image:

    docker build --tag "voicecraft" .
    
  3. Start the Docker container:

    ./start-jupyter.sh  # linux
    start-jupyter.bat   # windows
    
  4. Open the URL shown in the Docker logs in your browser.

  5. Open inference_tts.ipynb and follow the instructions.

Standalone Script

To use VoiceCraft as a standalone script:

  1. Ensure your environment is set up correctly (see the Environment Setup section).

  2. Use the tts_demo.py and speech_editing_demo.py scripts.

    python3 tts_demo.py -h
    

Why choose VoiceCraft?

  • Zero-Shot Capability: Adapts to new voices quickly with minimal data.
  • High-Quality Output: Delivers state-of-the-art performance on speech editing and TTS.
  • Versatile: Works well with diverse audio sources.
  • Open-Source: Encourages community contributions and customization.

Who is VoiceCraft for?

VoiceCraft is ideal for:

  • Researchers: Exploring speech synthesis and editing techniques.
  • Developers: Integrating advanced TTS capabilities into applications.
  • Content Creators: Generating high-quality voiceovers and edited audio.
  • Hobbyists: Experimenting with voice cloning and audio manipulation.

Key Features:

  • Smart Transcript: Allows users to specify exactly what they want to generate.
  • TTS Mode: Zero-shot TTS for generating speech from text.
  • Edit Mode: Speech editing capabilities for modifying existing audio.
  • Long TTS Mode: Simplifies TTS on long texts.

Environment Setup:

To set up your environment for VoiceCraft:

  1. Create a new Conda environment:

    conda create -n voicecraft python=3.9.16
    conda activate voicecraft
    
  2. Install the necessary packages:

    pip install -e git+https://github.com/facebookresearch/audiocraft.git@c5157b5bf14bf83449c17ea1eeb66c19fb4bc7f0#egg=audiocraft
    pip install xformers==0.0.22
    pip install torchaudio==2.0.2 torch==2.0.1
    apt-get install ffmpeg
    apt-get install espeak-ng
    pip install tensorboard==2.16.2
    pip install phonemizer==3.2.1
    pip install datasets==2.16.0
    pip install torchmetrics==0.11.1
    pip install huggingface_hub==0.22.2
    conda install -c conda-forge montreal-forced-aligner=2.2.17 openfst=1.8.2 kaldi=5.5.1068
    mfa model download dictionary english_us_arpa
    mfa model download acoustic english_us_arpa
    conda install -n voicecraft ipykernel --no-deps --force-reinstall
    

Training and Finetuning:

VoiceCraft supports training and finetuning on custom datasets. The process involves:

  1. Preparing utterances and their transcripts.
  2. Encoding utterances into codes using Encodec.
  3. Converting transcripts into phoneme sequences.
  4. Creating a manifest file.

The best way to leverage VoiceCraft is by utilizing the provided scripts and notebooks, and adapting them to your specific use case. Whether it's speech editing, TTS, or voice cloning, VoiceCraft offers a robust and flexible solution.

VoiceCraft is licensed under CC BY-NC-SA 4.0 (LICENSE-CODE) for the codebase and Coqui Public Model License 1.0.0 (LICENSE-MODEL) for the model weights. It also incorporates code from other repositories under MIT and Apache 2.0 licenses.

Best Alternative Tools to "VoiceCraft"

Mureka
No Image Available
93 0

Discover the AI music generator that creates unique and customizable songs, lyrics and tracks for any project. Perfect for content creators, musicians, and filmmakers, our intelligent algorithm uses advanced technology to generate royalty-free music tailored to your needs. Explore the future of music composition with Mureka’s innovative AI tools, designed to inspire creativity and streamline production. Experience seamless integration and exceptional quality with our cutting-edge solutions.

music generation
AI composition
VidMax AI
No Image Available
354 0

VidMax AI is an AI video generator that allows you to create viral faceless videos in minutes. Turn ideas into viral faceless videos instantly with AI-powered video creation, voice cloning, auto-posting, and templates. Join 100,000+ creators making engaging content.

AI video creation
faceless videos
Videotok
No Image Available
27 0

Videotok is an AI video generator that turns text, images, or audio into engaging videos for TikTok, Instagram, YouTube, and more. Create ads, faceless reels, and fully customizable content in minutes.

AI video creation
Voice AI
No Image Available
106 0

Experience cutting-edge Voice AI with our free Text to Speech generator and converter. Enjoy fast, high-quality voice synthesis powered by advanced AI models like Deepseek, Hailuo, Grok, and Kling for natural, expressive speech in various applications.

text-to-speech synthesis
Deepfake Detector
No Image Available
100 0

Deepfake Detector is an AI-based tool designed to detect manipulated videos, audios, and images with 95% accuracy. Protect yourself from deepfake scams on platforms like YouTube and WhatsApp by verifying media authenticity quickly.

deepfake verification
KoboldCpp
No Image Available
86 0

KoboldCpp: Run GGUF models easily for AI text & image generation with a KoboldAI UI. Single file, zero install. Supports CPU/GPU, STT, TTS, & Stable Diffusion.

text generation
image generation
koolio.ai
No Image Available
81 0

koolio.ai lets you take a concept to a completed podcast in a matter of minutes. We help you edit podcasts and make quality content painlessly. Whether it's transcribing audio, collaborating with others, auto-selecting sound effects or music based on context to enhance your podcast, or performing audio operations and manipulations easily, koolio.ai provides a simple, web-based, easy to use and intuitive interface for you to focus on your creativity.

podcast editing
audio enhancement
Vozo
No Image Available
110 0

Vozo AI empowers creators to generate, edit, and localize talking videos with AI-driven tools for translation, dubbing, and lip sync in over 60 languages. Fast, accurate, and studio-free for global reach.

video translation
lip sync
Autocalls.ai
No Image Available
224 0

Automate incoming and outgoing phone calls with Autocalls.ai, a no-code AI platform. Deploy AI voice agents in 100+ languages to improve customer support and generate leads.

AI voice agent
phone automation
CreateWise AI
No Image Available
143 0

CreateWise AI supercharges your podcast with AI! Generate show notes, summaries, social posts, and engaging clips instantly. Try it free and save hours on editing!

podcast automation
Kits AI
No Image Available
220 0

Kits AI offers studio-quality AI music tools for producers, including voice cloning, vocal removal, and AI mastering, ensuring 100% royalty-free usage.

AI music production
voice cloning
Fotol AI
No Image Available
215 0

Fotol AI provides a gateway to AGI, offering powerful AI solutions for video, image, speech, music, 3D asset generation, and conversation. Dream it, make it!

AI video
AI image
AI music
Narakeet
No Image Available
195 0

Narakeet is a text-to-speech and video creation tool that helps you easily create voiceovers and narrated videos using realistic AI voices. Convert text, documents, and presentations into engaging audio and video content.

text-to-speech
video maker
voiceover
Voiceslab
No Image Available
78 0

Voiceslab offers instant AI voice cloning to create natural-sounding replicas of your voice for podcasts, videos, and audiobooks. Capture tone, accent, and style with high-quality synthesis supporting 8 languages—no credit card required to start.

voice cloning
AI synthesis
DeckBird.ai
No Image Available
256 0

DeckBird.ai is an AI studio for creating smart video presentations from PPTs, images, and videos. Add video, voiceover, user interactions, embed and share to boost marketing.

video presentation
AI voiceover