Tool CategoriesAudio and SpeechAI Voice Synthesis

VoiceCraft

3.5 414 0

Type:

Open Source Projects

Last Updated:

2025/10/04

Description:

VoiceCraft is an open-source AI tool for zero-shot speech editing and text-to-speech, enabling voice cloning with just a few seconds of reference audio. Achieve state-of-the-art performance on in-the-wild data.

speech synthesis

voice cloning

audio editing

TTS

zero-shot TTS

VoiceCraft is an open-source AI tool for zero-shot speech editing and text-to-speech, enabling voice cloning with just a few seconds of reference audio. Achieve state-of-the-art performance on in-the-wild data.

Open Website

Overview of VoiceCraft

VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild

VoiceCraft is a powerful, open-source tool that brings state-of-the-art performance to both speech editing and zero-shot text-to-speech (TTS). It excels in handling diverse, real-world audio data, including audiobooks, internet videos, and podcasts. What sets VoiceCraft apart is its ability to clone or edit an unseen voice using just a few seconds of reference audio.

What is VoiceCraft?

VoiceCraft is a token infilling neural codec language model designed for high-quality speech editing and TTS tasks. It leverages zero-shot learning, meaning it can adapt to new voices with minimal training data.

How does VoiceCraft work?

VoiceCraft operates as a neural codec language model. Key aspects of its functionality include:

Token Infilling: VoiceCraft uses a token infilling technique to seamlessly edit and generate speech.
Zero-Shot Learning: It can adapt to new voices with just a few seconds of reference audio, eliminating the need for extensive training data.
Neural Codec Language Model: This architecture allows for high-quality speech synthesis and editing.

How to use VoiceCraft?

There are several ways to use VoiceCraft:

Google Colab: The simplest way to get started is using the provided Google Colab notebooks for speech editing and TTS inference.
Docker: Use the provided Docker image for a consistent and reproducible environment.
Standalone Script: Integrate VoiceCraft into your projects using the standalone scripts.

Here's a breakdown of each method:

Google Colab

Google Colab provides a straightforward way to start using VoiceCraft. Follow these steps:

Open the Speech Editing Colab notebook.
Open the TTS Inference Colab notebook.
Follow the instructions within the notebooks to run the demos.

Docker

Docker provides a consistent environment for running VoiceCraft. Here’s how to set it up:

Clone the repository:

git clone git@github.com:jasonppy/VoiceCraft.git
cd VoiceCraft

Build the Docker image:
```
docker build --tag "voicecraft" .
```

Start the Docker container:

./start-jupyter.sh  # linux
start-jupyter.bat   # windows

Open the URL shown in the Docker logs in your browser.
Open inference_tts.ipynb and follow the instructions.

Standalone Script

To use VoiceCraft as a standalone script:

Ensure your environment is set up correctly (see the Environment Setup section).
Use the tts_demo.py and speech_editing_demo.py scripts.
```
python3 tts_demo.py -h
```

Why choose VoiceCraft?

Zero-Shot Capability: Adapts to new voices quickly with minimal data.
High-Quality Output: Delivers state-of-the-art performance on speech editing and TTS.
Versatile: Works well with diverse audio sources.
Open-Source: Encourages community contributions and customization.

Who is VoiceCraft for?

VoiceCraft is ideal for:

Researchers: Exploring speech synthesis and editing techniques.
Developers: Integrating advanced TTS capabilities into applications.
Content Creators: Generating high-quality voiceovers and edited audio.
Hobbyists: Experimenting with voice cloning and audio manipulation.

Key Features:

Smart Transcript: Allows users to specify exactly what they want to generate.
TTS Mode: Zero-shot TTS for generating speech from text.
Edit Mode: Speech editing capabilities for modifying existing audio.
Long TTS Mode: Simplifies TTS on long texts.

Environment Setup:

To set up your environment for VoiceCraft:

Create a new Conda environment:

conda create -n voicecraft python=3.9.16
conda activate voicecraft

Install the necessary packages:

pip install -e git+https://github.com/facebookresearch/audiocraft.git@c5157b5bf14bf83449c17ea1eeb66c19fb4bc7f0#egg=audiocraft
pip install xformers==0.0.22
pip install torchaudio==2.0.2 torch==2.0.1
apt-get install ffmpeg
apt-get install espeak-ng
pip install tensorboard==2.16.2
pip install phonemizer==3.2.1
pip install datasets==2.16.0
pip install torchmetrics==0.11.1
pip install huggingface_hub==0.22.2
conda install -c conda-forge montreal-forced-aligner=2.2.17 openfst=1.8.2 kaldi=5.5.1068
mfa model download dictionary english_us_arpa
mfa model download acoustic english_us_arpa
conda install -n voicecraft ipykernel --no-deps --force-reinstall

Training and Finetuning:

VoiceCraft supports training and finetuning on custom datasets. The process involves:

Preparing utterances and their transcripts.
Encoding utterances into codes using Encodec.
Converting transcripts into phoneme sequences.
Creating a manifest file.

The best way to leverage VoiceCraft is by utilizing the provided scripts and notebooks, and adapting them to your specific use case. Whether it's speech editing, TTS, or voice cloning, VoiceCraft offers a robust and flexible solution.

VoiceCraft is licensed under CC BY-NC-SA 4.0 (LICENSE-CODE) for the codebase and Coqui Public Model License 1.0.0 (LICENSE-MODEL) for the model weights. It also incorporates code from other repositories under MIT and Apache 2.0 licenses.

Recommended Directory

AI Voice Synthesis AI Voice Changer AI Music Creation Speech to Text AI Voice Customer Service and Assistant Podcast and Video Dubbing

Best Alternative Tools to "VoiceCraft"

Typecast

404 0

Typecast is an AI voice generator offering 600+ customizable voices, voice cloning, video editing, and talking avatars for content creators.

voice-synthesis

emotional-TTS

AI Avatar Generator

339 0

Transform photos & videos into realistic talking AI avatars instantly. Professional videos with lip-sync in 40+ languages. Start creating for free today!

talking avatars

lip-sync AI

Listnr AI

381 0

Create and automate faceless videos effortlessly with Listnr AI. Our AI-powered platform generates and posts fresh content daily to grow your TikTok and YouTube channels. Trusted by millions!

faceless video generation

AudioPod AI

419 0

AudioPod AI is an all-in-one AI audio workstation and production suite. Generate voiceovers, split stems, create music, auto dub content and more. Includes text-to-speech, speech-to-text, and AI music generation.

text to speech

speech to text

Trump AI Voice Generator

340 0

Your Donald Trump AI voice generator for text‑to‑speech and video—lifelike cadence, quick exports for parodies and social media.

voice cloning

celebrity imitation

Audiobox

451 0

Audiobox is Meta's new foundation research model for audio generation. It can generate voices and sound effects using a combination of voice inputs and natural language text prompts.

audio generation

voice synthesis

AIVocal

344 0

AIVocal is an all-in-one AI platform for voice generation, cloning, podcasting, and transcription. Create realistic speech, audiobooks, and more with free tools in 140+ languages for creators and professionals.

voice generation

speech synthesis

AIEasy.life

565 0

AIEasy.life is an AI tools platform that provides a free directory and discovery experience. Find your favorite AI tools with AIEasy.life.

AI tools directory

AI platform

Dub AI

399 0

Dub AI empowers content creators to translate and dub videos effortlessly using AI voice cloning and translation, expanding reach to global audiences in over 30 languages with natural-sounding results.

video dubbing

voice cloning

All Voice Lab

355 0

All Voice Lab offers advanced AI text-to-speech, voice cloning, and voice changer tools for realistic, multilingual audio. Create engaging voiceovers with emotional expressiveness—start your free trial today.

voice cloning

text-to-speech

Voice AI

417 0

Experience cutting-edge Voice AI with our free Text to Speech generator and converter. Enjoy fast, high-quality voice synthesis powered by advanced AI models like Deepseek, Hailuo, Grok, and Kling for natural, expressive speech in various applications.

text-to-speech synthesis

SteosVoice

497 0

SteosVoice is an AI voice generator offering ultra-realistic speech synthesis for content creators. Dub videos, create podcasts, and monetize your voice with 800+ voices.

text to speech

AI voice

ElevenLabs

401 0

ElevenLabs offers realistic AI voice generation with 1000+ voices in 70+ languages. Perfect for audiobooks, videos, podcasts, and voice cloning applications.

voice synthesis

audio generation

LMNT

380 0

LMNT delivers fast, lifelike, affordable AI speech. Enjoy studio-quality voice clones and low latency streaming ideal for conversational apps, games, and agents. Engineered for reliability, scale effortlessly with technology built by an ex-Google team.

voice cloning

low-latency streaming

Add to Favorites

Edit Favorite

VoiceCraft

Overview of VoiceCraft

VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild

Google Colab

Docker

Standalone Script

Best Alternative Tools to "VoiceCraft"