FlashPortrait

FlashPortrait: 6$\times$ Faster Infinite Portrait Animation with Adaptive Latent Prediction
Shuyuan Tu¹, Yueming Pan³, Yinming Huang¹, Xintong Han⁴, Zhen Xing⁵, Qi Dai², Kai Qiu², Chong Luo², Zuxuan Wu¹
[¹Fudan University; ²Microsoft Research Asia; ³Xi'an Jiaotong University; ⁴Tencent Inc; ⁵Wan Team, Tongyi Lab, Alibaba Group]

Portrait animations generated by FlashPortrait, showing its power to synthesize infinite-length ID-preserving animations. All videos are directly synthesized by FlashPortrait without the use of any face-related post-processing tools, such as the face-swapping tool FaceFusion or face restoration models like GFP-GAN and CodeFormer.

Comparison results between FlashPortrait and state-of-the-art (SOTA) portrait animation models highlight the superior performance of FlashPortrait in delivering infinite-length, high-fidelity, identity-preserving portrait animation.

Overview

model architecture
The overview of the framework of FlashPortrait.

Current diffusion-based acceleration methods for long-portrait animation struggle to ensure identity (ID) consistency. This paper presents FlashPortrait, an end-to-end video diffusion transformer capable of synthesizing ID-preserving, infinite-length videos while achieving up to 6$\times$ acceleration in inference speed. In particular, FlashPortrait begins by computing the identity-agnostic facial expression features with an off-the-shelf extractor. It then introduces a Normalized Facial Expression Block to align facial features with diffusion latents by normalizing them with their respective means and variances, thereby improving identity stability in facial modeling. During inference, FlashPortrait adopts a dynamic sliding-window scheme with weighted blending in overlapping areas, ensuring smooth transitions and ID consistency in long animations. In each context window, based on the latent variation rate at particular timesteps and the derivative magnitude ratio among diffusion layers, FlashPortrait utilizes higher-order latent derivatives at the current timestep to directly predict latents at future timesteps, thereby skipping several denoising steps and achieving 6$\times$ speed acceleration. Experiments on benchmarks show the effectiveness of FlashPortrait both qualitatively and quantitatively.

News

[2025-12-15]:🔥 The project page, code, technical report and a basic model checkpoint are released. Further acceleration part (Adaptive Latent Prediction) will be released very soon. Stay tuned!

🛠️ To-Do List

FlashPortrait-14B
Inference Code
Training Code
Multiple-GPU Inference Code
Inference Code with Adaptive Latent Prediction

🔑 Quickstart

FlashPortrait supports generating infinite-length videos at a 480x832 or 832x480 or 512x512 or 720x720 or 720x1280 or 1280x720 resolution. If you encounter insufficient memory issues, you can appropriately reduce the number of animated frames or the resolution of the output.

🧱 Environment setup

pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.1.1 --index-url https://download.pytorch.org/whl/cu124
pip install -r requirements.txt
# Optional to install flash_attn to accelerate attention computation
pip install flash_attn

🧱 Download weights

If you encounter connection issues with Hugging Face, you can utilize the mirror endpoint by setting the environment variable: export HF_ENDPOINT=https://hf-mirror.com. Please download weights manually as follows:

pip install "huggingface_hub[cli]"
cd FlashPortrait
mkdir checkpoints
huggingface-cli download FrancisRing/FlashPortrait --local-dir ./checkpoints/FlashPortrait
huggingface-cli download Wan-AI/Wan2.1-I2V-14B-720P --local-dir ./checkpoints/Wan2.1-I2V-14B-720P

All the weights should be organized in models as follows The overall file structure of this project should be organized as follows:

FlashPortrait/
├── config
├── examples
├── wan
├── checkpoints
│   ├── FlashPortrait
│   └── Wan2.1-I2V-14B-720P
├── infer.py
├── fast_infer.py
├── train_portrait.py
├── bin_convert_pt.py
├── train_single_machine.sh
├── train_multiple_machine.sh
├── requirement.txt

🧱 Model inference

A sample configuration for testing is provided as infer.py and fast_infer.py. You can also easily modify the various configurations according to your needs.

bash inference.sh

Wan2.1-14B-based FlashPortrait supports video-driven portrait video generation at various resolution settings: 512x512, 480x832, 832x480, 720x720, 720x1280, and 1280x720. You can modify "max_size" in infer.py to set the resolution of the animation. "--validation_image_start", "--validation_driven_video_path", and "--prompt" in infer.py refer to the path of the given reference image, the path of the driven audio, and the text prompts respectively. Prompts are also very important. It is recommended to [Description of first frame]-[Description of human behavior]-[Description of background (optional)]. "--wan_model_name", "--transformer_path", and "--portrait_encoder_path" in infer.py are the paths of pretrained Wan2.1-14B weights, pretrained FlashPortrait DiT weights, and pretrained FlashPortrait Portrait Encoder weights, respectively. "--num_inference_steps", "--sub_num_frames", "--latents_num_frames", "--context_overlap" and "--context_size" refer to the total number of inference steps, the synthesized rgb frame number in a batch, the synthesized latent frame number in a batch, the overlapping context length between two context windows, the synthesized latent frame number in a context window, respectively. Notably, the recommended --num_inference_steps range is [30-50], more steps bring higher quality. The recommended --context_overlap range is [10-40], as longer overlapping length results in higher quality and slower inference speed. "--text_cfg_scale" and "--emo_cfg_scale" are Classify-Free-Guidance scale of text prompt and portrait emotion. The recommended range for prompt and audio cfg is [2-5]. You can increase the emotion cfg to facilitate the emotion synchronization with the driven video.

We provide 6 cases in different resolution settings in path/FlashPortrait/examples for validation. ❤️❤️Please feel free to try it out and enjoy the endless entertainment of infinite-length portrait video generation❤️❤️!

💡Tips

fast_infer.py has faster inference speed, which has the same configuration settings as infer.py.
If you have limited GPU resources, you can change the loading mode of FlashPortrait by modifying "--GPU_memory_mode" in infer.py. The options of "--GPU_memory_mode" are model_full_load, sequential_cpu_offload, model_cpu_offload_and_qfloat8, and model_cpu_offload. In particular, when you set --GPU_memory_mode to sequential_cpu_offload, the total GPU memory consumption is approximately 10G with slower inference speed. Setting --GPU_memory_mode to model_cpu_offload can significantly cut GPU memory usage, reducing it by roughly half compared to model_full_load mode.
higher resolution setting will result in higher quality synthesized videos (480p->720p).

🧱 Model Training

🔥🔥It’s worth noting that if you’re looking to train a conditioned Video Diffusion Transformer (DiT) model, such as Wan2.1, this training tutorial will also be helpful.🔥🔥 For the training dataset, it has to be organized as follows:

poirtrait_data/
├── rec
│   │  ├──speech
│   │  │  ├──00001
│   │  │  │  ├──images
│   │  │  │  │  ├──frame_0.png
│   │  │  │  │  ├──frame_1.png
│   │  │  │  │  ├──frame_2.png
│   │  │  │  │  ├──...
│   │  │  │  ├──face_masks
│   │  │  │  │  ├──frame_0.png
│   │  │  │  │  ├──frame_1.png
│   │  │  │  │  ├──frame_2.png
│   │  │  │  │  ├──...
│   │  │  │  ├──lip_masks
│   │  │  │  │  ├──frame_0.png
│   │  │  │  │  ├──frame_1.png
│   │  │  │  │  ├──frame_2.png
│   │  │  │  │  ├──...
│   │  │  ├──00002
│   │  │  │  ├──images
│   │  │  │  ├──face_masks
│   │  │  │  ├──lip_masks
│   │  │  └──...
│   │  ├──singing
│   │  │  ├──00001
│   │  │  │  ├──images
│   │  │  │  ├──face_masks
│   │  │  │  ├──lip_masks
│   │  │  └──...
│   │  ├──dancing
│   │  │  ├──00001
│   │  │  │  ├──images
│   │  │  │  ├──face_masks
│   │  │  │  ├──lip_masks
│   │  │  └──...
├── vec
│   │  ├──speech
│   │  │  ├──00001
│   │  │  │  ├──images
│   │  │  │  ├──face_masks
│   │  │  │  ├──lip_masks
│   │  │  └──...
│   │  ├──singing
│   │  │  ├──00001
│   │  │  │  ├──images
│   │  │  │  ├──face_masks
│   │  │  │  ├──lip_masks
│   │  │  └──...
│   │  ├──dancing
│   │  │  ├──00001
│   │  │  │  ├──images
│   │  │  │  ├──face_masks
│   │  │  │  ├──lip_masks
│   │  │  └──...
├── square
│   │  ├──speech
│   │  │  ├──00001
│   │  │  │  ├──images
│   │  │  │  ├──face_masks
│   │  │  │  ├──lip_masks
│   │  │  └──...
│   │  ├──singing
│   │  │  ├──00001
│   │  │  │  ├──images
│   │  │  │  ├──face_masks
│   │  │  │  ├──lip_masks
│   │  │  └──...
│   │  ├──dancing
│   │  │  ├──00001
│   │  │  │  ├──images
│   │  │  │  ├──face_masks
│   │  │  │  ├──lip_masks
│   │  │  └──...
├── video_rec_path.txt
├── video_square_path.txt
└── video_vec_path.txt

FlashPortrait is trained on mixed-resolution videos, with 720x720 videos stored in poirtrait_data/square, 480x832 videos stored in poirtrait_data/vec, and 832x480 videos stored in poirtrait_data/rec. Each folder in poirtrait_data/square or poirtrait_data/rec or poirtrait_data/vec contains three subfolders which contains different types of videos (speech, singing, and dancing). All .png image files are named in the format frame_i.png, such as frame_0.png, frame_1.png, and so on. 00001, 00002, 00003 indicate individual video information. In terms of three subfolders, images, face_masks, and lip_masks store RGB frames, corresponding human face masks, and corresponding human lip masks, respectively. video_square_path.txt, video_rec_path.txt, and video_vec_path.txt record folder paths of talking_face_data/square, talking_face_data/rec, and talking_face_data/vec, respectively. For example, the content of video_rec_path.txt is shown as follows:

path/FlashPortrait/poirtrait_data/rec/speech/00001
path/FlashPortrait/poirtrait_data/rec/speech/00002
...
path/FlashPortrait/poirtrait_data/rec/singing/00003
path/FlashPortrait/poirtrait_data/rec/singing/00004
...
path/FlashPortrait/poirtrait_data/rec/dancing/00005
path/FlashPortrait/poirtrait_data/rec/dancing/00006
...

If you only have raw videos, you can leverage ffmpeg to extract frames from raw videos (speech) and store them in the subfolder images.

ffmpeg -i raw_video_1.mp4 -q:v 1 -start_number 0 path/FlashPortrait/poirtrait_data/rec/speech/00001/images/frame_%d.png

The obtained frames are saved in path/FlashPortrait/poirtrait_data/rec/speech/00001/images.

For extracting the human face masks, please refer to StableAnimator repo. The Human Face Mask Extraction section in the tutorial provides off-the-shelf codes. For extracting the human lip masks, please refer to StableAvatar repo. The Human Lip Mask Extraction section in the tutorial provides off-the-shelf codes.

When your dataset is organized exactly as outlined above, you can easily train your Wan2.1-14B-based FlashPortrait by running the following command:

# Training FlashPortrait on a mixed resolution setting (480x832, 832x480, and 720X720) in a single node
bash train_single_machine.sh
# Training FlashPortrait on a mixed resolution setting (480x832, 832x480, and 720X720) in multiple nodes
bash train_multiple_machine.sh

For the parameter details of train_single_machine.sh and train_multiple_machine.sh, CUDA_VISIBLE_DEVICES refers to gpu devices. In my setting, I use 4 NVIDIA A100 80G to train FlashPortrait (CUDA_VISIBLE_DEVICES=3,2,1,0) in a single node. --pretrained_model_name_or_path and --output_dir refer to the pretrained Wan2.1-14B path and the checkpoint saved path of the trained FlashPortrait. --train_data_square_dir, --train_data_rec_dir, and --train_data_vec_dir are the paths of video_square_path.txt, video_rec_path.txt, and video_vec_path.txt, respectively. --video_sample_n_frames is the number of frames that FlashPortrait processes in a single batch. --num_train_epochs is the training epoch number.

Since we utilize DeepSpeed-Stage-3 to train our FlashPortrait, we need to convert the saved checkpoint to fp32 as follows:

cd output_14B_dir/checkpoint-x
python zero_to_fp32.py /path/FlashPortrait/output_14B_dir/checkpoint-x /path/FlashPortrait/output_14B_dir/checkpoint-x-fp32-infer --max_shard_size 80GB
cd ../..
python bin_convert_pt.py --pretrained_model_path="/path/FlashPortrait/output_14B_dir/checkpoint-x-fp32-infer"

It is worth noting that training FlashPortrait requires approximately 50GB of VRAM due to the mixed-resolution (480x832, 832x480, and 720X720) training pipeline. However, if you train FlashPortrait exclusively on 512x512 videos, the VRAM requirement is reduced to approximately 40GB. Additionally, The backgrounds of the selected training videos should remain static, as this helps the diffusion model calculate accurate reconstruction loss.

🧱 Model Finetuning

Regarding fully finetuning FlashPortrait, you can add --transformer_path="path/FlashPortrait/checkpoints/FlashPortrait/transformer.pt and --portrait_encoder_path="path/FlashPortrait/checkpoints/FlashPortrait/portrait_encoder.pt to the train_single_machine.sh or train_multiple_machine.sh:

# Finetuning FlashPortrait on a mixed resolution setting (480x832, 832x480, and 720X720) in a single node
bash train_single_machine.sh
# Finetuning FlashPortrait on a mixed resolution setting (480x832, 832x480, and 720X720) in multiple nodes
bash train_multiple_machine.sh

🧱 VRAM requirement

For the 10s video (720x1280, fps=25), FlashPortrait (--GPU_memory_mode="model_full_load") requires approximately 60GB VRAM on a A100 GPU (--GPU_memory_mode="sequential_cpu_offload" requires approximately 10GB VRAM).

🔥🔥Theoretically, FlashPortrait is capable of synthesizing hours of video without significant quality degradation; however, the 3D VAE decoder demands significant GPU memory, especially when decoding 10k+ frames. You have the option to run the VAE on CPU.🔥🔥

🧱 Acknowledgments

Thanks to Wan2.1, PD-FGC, FantasyPortrait and VideoX-Fun for open-sourcing their models and code, which provided valuable references and support for this project. Their contributions to the open-source community are truly appreciated.

Contact

If you have any suggestions or find our work helpful, feel free to contact me.

Email: [email protected]

If you find our work useful, please consider giving a star ⭐ to this github repository and citing it ❤️:

@article{tu2025flashportrait,
  title={FlashPortrait: 6$\times$ Faster Infinite Portrait Animation with Adaptive Latent Prediction},
  author={Tu, Shuyuan and Pan, Yueming and Huang, Yinming and Han, Xintong and Xing, Zhen and Dai, Qi and Qiu, Kai and Luo, Chong and Wu, Zuxuan},
  journal={arXiv preprint arXiv:2512.16900},
  year={2025}
}

Downloads last month: 15

Model tree for FrancisRing/FlashPortrait

Base model

Wan-AI/Wan2.1-I2V-14B-720P

Quantized

(3)

this model