AniPortrait
Overview of AniPortrait
What is AniPortrait?
AniPortrait is an innovative open-source framework designed for audio-driven synthesis of photorealistic portrait animations. Developed by Huawei Wei, Zejun Yang, and Zhisheng Wang from Tencent Games Zhiji and Tencent, this tool leverages advanced AI techniques to create high-quality animated portraits from a single reference image and audio or video inputs. Whether you're animating a static portrait with speech audio or reenacting facial expressions from a source video, AniPortrait delivers lifelike results that capture subtle nuances like lip-sync and head movements. Ideal for content creators, game developers, and researchers in computer vision, it stands out in the realm of AI video generation tools by focusing on portrait-specific animations.
Released on GitHub under the Apache-2.0 license, AniPortrait has garnered over 5,000 stars, reflecting its popularity in the AI community. The project emphasizes accessibility, with pre-trained models, detailed installation guides, and even a Gradio web UI for easy testing.
How Does AniPortrait Work?
At its core, AniPortrait employs a multi-stage pipeline that integrates diffusion models, audio processing, and pose estimation to generate animations. The framework builds on established models like Stable Diffusion V1.5 and wav2vec2 for feature extraction, ensuring robust handling of audio-visual synchronization.
Key Components and Workflow
- Input Processing: Start with a reference portrait image. For audio-driven mode, audio inputs are processed using wav2vec2-base-960h to extract speech features. In video modes, source videos are converted to pose sequences via keypoint extraction.
- Pose Generation: The audio2pose model generates head pose sequences (e.g., pose_temp.npy) from audio, enabling control over facial orientations. For face reenactment, a pose retarget strategy maps movements from the source video to the reference image, supporting substantial pose differences.
- Animation Synthesis: Utilizes denoising UNet, reference UNet, and motion modules to synthesize frames. The pose guider ensures alignment, while optional frame interpolation accelerates inference.
- Output Refinement: Generates videos at resolutions like 512x512, with options for acceleration using film_net_fp16.pt to reduce processing time.
This modular approach allows for self-driven animations (using predefined poses), face reenactment (transferring expressions), and fully audio-driven synthesis, making it versatile for various AI portrait animation scenarios.
Core Features of AniPortrait
AniPortrait packs a range of powerful features tailored for realistic portrait animation:
- Audio-Driven Portrait Animation: Syncs lip movements and expressions to audio inputs, perfect for dubbing or virtual avatars.
- Face Reenactment: Transfers facial performances from a source video to a target portrait, ideal for deepfake-like ethical applications in media.
- Pose Control and Retargeting: Updated strategies handle diverse head poses, including generation of custom pose files for precise control.
- High-Resolution Output: Produces photorealistic videos with support for longer sequences (up to 300 frames or more).
- Acceleration Options: Frame interpolation and FP16 models speed up inference without sacrificing quality.
- Gradio Web UI: A user-friendly interface for quick demos, also hosted on Hugging Face Spaces for online access.
- Pre-Trained Models: Includes weights for audio2mesh, audio2pose, and diffusion components, downloadable from sources like Wisemodel.
These features make AniPortrait a go-to tool for AI-driven video synthesis, surpassing basic tools by focusing on portrait fidelity and audio-visual coherence.
Installation and Setup
Getting started is straightforward for users with Python >=3.10 and CUDA 11.7:
- Clone the repository:
git clone https://github.com/Zejun-Yang/AniPortrait
. - Install dependencies:
pip install -r requirements.txt
. - Download pre-trained weights to
./pretrained_weights/
, including Stable Diffusion components, wav2vec2, and custom models likedenoising_unet.pth
andaudio2pose.pt
. - Organize files as per the directory structure in the README.
For training, prepare datasets like VFHQ or CelebV-HQ by extracting keypoints and running preprocessing scripts. Training occurs in two stages using Accelerate for distributed processing.
How to Use AniPortrait?
Inference Modes
AniPortrait supports three primary modes via command-line scripts:
Self-Driven Animation:
python -m scripts.pose2vid --config ./configs/prompts/animation.yaml -W 512 -H 512 -acc
Customize with reference images or pose videos. Convert videos to poses using
python -m scripts.vid2pose --video_path input.mp4
.Face Reenactment:
python -m scripts.vid2vid --config ./configs/prompts/animation_facereenac.yaml -W 512 -H 512 -acc
Edit the YAML to include source videos and references.
Audio-Driven Synthesis:
python -m scripts.audio2vid --config ./configs/prompts/animation_audio.yaml -W 512 -H 512 -acc
Add audios and images to the config. Enable audio2pose by removing pose_temp for automatic pose generation.
For head pose control, generate reference poses with python -m scripts.generate_ref_pose
.
Web Demo
Launch the Gradio UI: python -m scripts.app
. Or try the online version on Hugging Face Spaces.
Users can experiment with sample videos like 'cxk.mp4' or 'jijin.mp4' to see audio-sync in action, sourced from platforms like Bilibili.
Training AniPortrait from Scratch
Advanced users can train custom models:
- Data Prep: Download datasets, preprocess with
python -m scripts.preprocess_dataset
, and update JSON paths. - Stage 1:
accelerate launch train_stage_1.py --config ./configs/train/stage1.yaml
. - Stage 2: Download motion module weights, specify Stage 1 checkpoints, and run
accelerate launch train_stage_2.py --config ./configs/train/stage2.yaml
.
This process fine-tunes on portrait-specific data, enhancing generalization for AI animation tasks.
Why Choose AniPortrait?
In a crowded field of AI tools for video generation, AniPortrait excels due to its specialized focus on photorealistic portraits. Unlike general-purpose models, it handles audio-lip sync and subtle expressions with precision, reducing artifacts in facial animations. The open-source nature allows customization, and recent updates—like the April 2024 audio2pose release and acceleration modules—keep it cutting-edge. Community acknowledgments to projects like EMO and AnimateAnyone highlight its collaborative roots, ensuring reliable performance.
Practical value includes faster prototyping for virtual influencers, educational videos, or game assets. With arXiv paper availability (eprint 2403.17694), it serves researchers exploring audio-visual synthesis in computer vision.
Who is AniPortrait For?
- Content Creators and Filmmakers: For quick dubbing or expression transfers in short-form videos.
- Game Developers at Tencent-like Studios: Integrating animated portraits into interactive media.
- AI Researchers: Experimenting with diffusion-based animation and pose retargeting.
- Hobbyists and Educators: Using the web UI to teach AI concepts without heavy setup.
If you're seeking the best way to create audio-driven portrait animations, AniPortrait's balance of quality, speed, and accessibility makes it a top choice.
Potential Applications and Use Cases
- Virtual Avatars: Animate digital characters with synced speech for social media or metaverses.
- Educational Tools: Generate talking head videos for lectures or tutorials.
- Media Production: Ethical face reenactment for historical reenactments or ads.
- Research Prototyping: Benchmark audio-to-video models in CV papers.
Demonstrations include self-driven clips like 'solo.mp4' and audio examples like 'kara.mp4', showcasing seamless integration.
For troubleshooting, check the 76 open issues on GitHub or contribute via pull requests. Overall, AniPortrait empowers users to push boundaries in AI portrait animation with reliable, high-fidelity results.
Best Alternative Tools to "AniPortrait"



Automate faceless video creation with AiReelGenerator. Choose a topic, and AI generates videos for Youtube, TikTok, Instagram, & Facebook daily.

ChatArt is an AI tool offering content creation, image editing, and AI chat features. Powered by GPT-5, Claude Sonnet & DeepSeek, it delivers high-quality content, AI image generation/editing, and plagiarism/grammar detection.

Effortlessly turn text into engaging videos with SpikeX AI, the leading text-to-video AI platform for automating YouTube growth in minutes! Create faceless videos for YouTube and social media with just one prompt.

Vid.AI is an AI-powered video generator that creates faceless videos for YouTube Shorts, TikTok, Instagram Reels, and full-length YouTube videos. Perfect for content creators looking for YouTube automation.




ImagineAPP is an AI-powered platform for creating music videos and other video content from text or images. It supports various AI models like Runway Gen3, Hailuo AI, Kling AI, Luma AI, and Google VEO.

GenXi is an AI-powered platform that generates realistic images and videos from text. Easy to use with DALL App, ScriptToVid Tool, Imagine AI Tool, and AI Logo Maker. Try it free now!

Alle-AI is an all-in-one AI platform that combines and compares outputs from ChatGPT, Gemini, Claude, DALL-E 2, Stable Diffusion, and Midjourney for text, image, audio, and video generation.

GlobalGPT is an all-in-one AI platform providing access to ChatGPT, GPT-5, Claude, Unikorn (MJ-like), Veo, and 100+ AI tools for writing, research, image & video creation.

