VoiceCraft
Overview of VoiceCraft
VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild
VoiceCraft is a powerful, open-source tool that brings state-of-the-art performance to both speech editing and zero-shot text-to-speech (TTS). It excels in handling diverse, real-world audio data, including audiobooks, internet videos, and podcasts. What sets VoiceCraft apart is its ability to clone or edit an unseen voice using just a few seconds of reference audio.
What is VoiceCraft?
VoiceCraft is a token infilling neural codec language model designed for high-quality speech editing and TTS tasks. It leverages zero-shot learning, meaning it can adapt to new voices with minimal training data.
How does VoiceCraft work?
VoiceCraft operates as a neural codec language model. Key aspects of its functionality include:
- Token Infilling: VoiceCraft uses a token infilling technique to seamlessly edit and generate speech.
- Zero-Shot Learning: It can adapt to new voices with just a few seconds of reference audio, eliminating the need for extensive training data.
- Neural Codec Language Model: This architecture allows for high-quality speech synthesis and editing.
How to use VoiceCraft?
There are several ways to use VoiceCraft:
- Google Colab: The simplest way to get started is using the provided Google Colab notebooks for speech editing and TTS inference.
- Docker: Use the provided Docker image for a consistent and reproducible environment.
- Standalone Script: Integrate VoiceCraft into your projects using the standalone scripts.
Here's a breakdown of each method:
Google Colab
Google Colab provides a straightforward way to start using VoiceCraft. Follow these steps:
- Open the Speech Editing Colab notebook.
- Open the TTS Inference Colab notebook.
- Follow the instructions within the notebooks to run the demos.
Docker
Docker provides a consistent environment for running VoiceCraft. Here’s how to set it up:
Clone the repository:
git clone git@github.com:jasonppy/VoiceCraft.git cd VoiceCraftBuild the Docker image:
docker build --tag "voicecraft" .Start the Docker container:
./start-jupyter.sh # linux start-jupyter.bat # windowsOpen the URL shown in the Docker logs in your browser.
Open
inference_tts.ipynband follow the instructions.
Standalone Script
To use VoiceCraft as a standalone script:
Ensure your environment is set up correctly (see the Environment Setup section).
Use the
tts_demo.pyandspeech_editing_demo.pyscripts.python3 tts_demo.py -h
Why choose VoiceCraft?
- Zero-Shot Capability: Adapts to new voices quickly with minimal data.
- High-Quality Output: Delivers state-of-the-art performance on speech editing and TTS.
- Versatile: Works well with diverse audio sources.
- Open-Source: Encourages community contributions and customization.
Who is VoiceCraft for?
VoiceCraft is ideal for:
- Researchers: Exploring speech synthesis and editing techniques.
- Developers: Integrating advanced TTS capabilities into applications.
- Content Creators: Generating high-quality voiceovers and edited audio.
- Hobbyists: Experimenting with voice cloning and audio manipulation.
Key Features:
- Smart Transcript: Allows users to specify exactly what they want to generate.
- TTS Mode: Zero-shot TTS for generating speech from text.
- Edit Mode: Speech editing capabilities for modifying existing audio.
- Long TTS Mode: Simplifies TTS on long texts.
Environment Setup:
To set up your environment for VoiceCraft:
Create a new Conda environment:
conda create -n voicecraft python=3.9.16 conda activate voicecraftInstall the necessary packages:
pip install -e git+https://github.com/facebookresearch/audiocraft.git@c5157b5bf14bf83449c17ea1eeb66c19fb4bc7f0#egg=audiocraft pip install xformers==0.0.22 pip install torchaudio==2.0.2 torch==2.0.1 apt-get install ffmpeg apt-get install espeak-ng pip install tensorboard==2.16.2 pip install phonemizer==3.2.1 pip install datasets==2.16.0 pip install torchmetrics==0.11.1 pip install huggingface_hub==0.22.2 conda install -c conda-forge montreal-forced-aligner=2.2.17 openfst=1.8.2 kaldi=5.5.1068 mfa model download dictionary english_us_arpa mfa model download acoustic english_us_arpa conda install -n voicecraft ipykernel --no-deps --force-reinstall
Training and Finetuning:
VoiceCraft supports training and finetuning on custom datasets. The process involves:
- Preparing utterances and their transcripts.
- Encoding utterances into codes using Encodec.
- Converting transcripts into phoneme sequences.
- Creating a manifest file.
The best way to leverage VoiceCraft is by utilizing the provided scripts and notebooks, and adapting them to your specific use case. Whether it's speech editing, TTS, or voice cloning, VoiceCraft offers a robust and flexible solution.
VoiceCraft is licensed under CC BY-NC-SA 4.0 (LICENSE-CODE) for the codebase and Coqui Public Model License 1.0.0 (LICENSE-MODEL) for the model weights. It also incorporates code from other repositories under MIT and Apache 2.0 licenses.
Best Alternative Tools to "VoiceCraft"
Typecast is an AI voice generator offering 600+ customizable voices, voice cloning, video editing, and talking avatars for content creators.
Transform photos & videos into realistic talking AI avatars instantly. Professional videos with lip-sync in 40+ languages. Start creating for free today!
Create and automate faceless videos effortlessly with Listnr AI. Our AI-powered platform generates and posts fresh content daily to grow your TikTok and YouTube channels. Trusted by millions!
AudioPod AI is an all-in-one AI audio workstation and production suite. Generate voiceovers, split stems, create music, auto dub content and more. Includes text-to-speech, speech-to-text, and AI music generation.
Your Donald Trump AI voice generator for text‑to‑speech and video—lifelike cadence, quick exports for parodies and social media.
Audiobox is Meta's new foundation research model for audio generation. It can generate voices and sound effects using a combination of voice inputs and natural language text prompts.
AIVocal is an all-in-one AI platform for voice generation, cloning, podcasting, and transcription. Create realistic speech, audiobooks, and more with free tools in 140+ languages for creators and professionals.
AIEasy.life is an AI tools platform that provides a free directory and discovery experience. Find your favorite AI tools with AIEasy.life.
Dub AI empowers content creators to translate and dub videos effortlessly using AI voice cloning and translation, expanding reach to global audiences in over 30 languages with natural-sounding results.
All Voice Lab offers advanced AI text-to-speech, voice cloning, and voice changer tools for realistic, multilingual audio. Create engaging voiceovers with emotional expressiveness—start your free trial today.
Experience cutting-edge Voice AI with our free Text to Speech generator and converter. Enjoy fast, high-quality voice synthesis powered by advanced AI models like Deepseek, Hailuo, Grok, and Kling for natural, expressive speech in various applications.
SteosVoice is an AI voice generator offering ultra-realistic speech synthesis for content creators. Dub videos, create podcasts, and monetize your voice with 800+ voices.
ElevenLabs offers realistic AI voice generation with 1000+ voices in 70+ languages. Perfect for audiobooks, videos, podcasts, and voice cloning applications.
LMNT delivers fast, lifelike, affordable AI speech. Enjoy studio-quality voice clones and low latency streaming ideal for conversational apps, games, and agents. Engineered for reliability, scale effortlessly with technology built by an ex-Google team.