arxiv: 2605.04702 · v1 · submitted 2026-05-06 · 💻 cs.CV · cs.AI

Recognition: unknown

FaithfulFaces: Pose-Faithful Facial Identity Preservation for Text-to-Video Generation

Yuanzhi Wang , Xuhua Ren , Jiaxiang Cheng , Bing Ma , Kai Yu , Sen Liang , Wenyue Li , Tianxiang Zheng

show 2 more authors

Qinglin Lu Zhen Cui

Authors on Pith no claims yet

Pith reviewed 2026-05-08 17:36 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords identity-preserving text-to-videofacial pose alignmentpose-shared dictionaryorientation angle embeddingsidentity consistencyvideo generationgenerative modelsdataset curation

0 comments

The pith

FaithfulFaces aligns facial poses across views with a shared dictionary and invariance constraint to keep identities consistent in text-to-video generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to fix the problem where AI-generated videos change a person's face identity when the head turns or parts are hidden. It does so by building a system that shares pose information across different views of the face, using a common dictionary of poses and a rule that keeps the identity fixed despite pose differences. This system converts a single view of the face into a complete pose description based on orientation angles, giving the video generator a reliable guide to keep the same face throughout. They also built a special collection of videos showing many different head positions to train on. If successful, this means creators can make videos with the same recognizable person even in active scenes full of movement and camera angles.

Core claim

FaithfulFaces establishes that a pose-faithful identity preservation framework, built around a pose-shared identity aligner that refines poses via a shared dictionary and an invariance constraint, can map single-view inputs to a global facial pose representation using explicit orientation angle embeddings to guide generative models toward robust identity-preserving text-to-video outputs even under large pose changes and occlusions.

What carries the argument

The pose-shared identity aligner, which refines and aligns facial poses across views using a shared dictionary and a pose variation-identity invariance constraint to produce a global representation via orientation angle embeddings.

If this is right

Identity consistency holds across sequences even when faces turn sharply or become partially hidden.
Structural details of the face remain clear frame to frame in videos with motion.
Text prompts involving moving human characters produce more reliable and usable outputs.
Single reference images suffice to control identity in multi-view dynamic generation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The single-view to global-pose mapping could reduce reliance on multiple reference shots in other identity tasks.
Similar dictionary-plus-constraint designs might transfer to full-body or object consistency in video.
Curating pose-diverse datasets could become a standard step for improving other generative video challenges.
Integration into existing text-to-video systems might allow consistent characters without full retraining.

Load-bearing premise

The pose variation-identity invariance constraint and pose-shared dictionary will align identities across views without introducing new artifacts or reducing generation quality in unseen dynamic scenes.

What would settle it

Generated videos from prompts with extreme head turns or occlusions outside the training data showing identity distortion comparable to prior methods would show the alignment does not hold as claimed.

Figures

Figures reproduced from arXiv: 2605.04702 by Bing Ma, Jiaxiang Cheng, Kai Yu, Qinglin Lu, Sen Liang, Tianxiang Zheng, Wenyue Li, Xuhua Ren, Yuanzhi Wang, Zhen Cui.

**Figure 1.** Figure 1: Visualization results from four different IPT2V methods. ConsisID [ view at source ↗

**Figure 2.** Figure 2: The framework of FaithfulFaces. During training, given view at source ↗

**Figure 3.** Figure 3: Architecture of the pose-shared identity aligner. Initially, the input face images are view at source ↗

**Figure 4.** Figure 4: Dataset collection and processing pipeline. In Step 1, videos without face or with multiple view at source ↗

**Figure 5.** Figure 5: Visual comparisons of different methods. The goal is to generate a video of a person view at source ↗

**Figure 6.** Figure 6: Visualization of the encoded facial identity (ID). FaithfulFaces demonstrates promising ID view at source ↗

**Figure 7.** Figure 7: Visualization of learned pose-shared dictionary. We visualize five representative facial view at source ↗

**Figure 8.** Figure 8: Ablation study on different dictionary elements. view at source ↗

**Figure 9.** Figure 9: Visual comparisons of the ablation study for key components in FaithfulFaces. view at source ↗

**Figure 10.** Figure 10: The value of the contrastive loss in the pose-shared identity aligner (named aligner loss) view at source ↗

**Figure 11.** Figure 11: We can further observe that when using a non-frontal image as input, the faces generated view at source ↗

**Figure 11.** Figure 11: Visual comparisons of frontal view and non-frontal view. We can observe that when using view at source ↗

**Figure 12.** Figure 12: Complete visual comparisons of different methods for the case of Fig. 1. view at source ↗

**Figure 13.** Figure 13: More visual comparisons of different methods. view at source ↗

**Figure 14.** Figure 14: More visual comparisons of different methods. view at source ↗

**Figure 15.** Figure 15: More visual comparisons of different methods. view at source ↗

**Figure 16.** Figure 16: More showcases of identity-preserving videos generated by our FaithfulFaces. view at source ↗

**Figure 17.** Figure 17: More showcases of identity-preserving videos generated by our FaithfulFaces. view at source ↗

**Figure 18.** Figure 18: More showcases of identity-preserving videos generated by our FaithfulFaces. view at source ↗

**Figure 19.** Figure 19: More showcases of identity-preserving videos generated by our FaithfulFaces. view at source ↗

read the original abstract

Identity-preserving text-to-video generation (IPT2V) empowers users to produce diverse and imaginative videos with consistent human facial identity. Despite recent progress, existing methods often suffer from significant identity distortion under large facial pose variations or facial occlusions. In this paper, we propose \textit{FaithfulFaces}, a pose-faithful facial identity preservation learning framework to improve IPT2V in complex dynamic scenes. The key of FaithfulFaces is a pose-shared identity aligner that refines and aligns facial poses across distinct views via a pose-shared dictionary and a pose variation-identity invariance constraint. By mapping single-view inputs into a global facial pose representation with explicit Euler angle embeddings, FaithfulFaces provides a pose-faithful facial prior that guides generative foundations toward robust identity-preserving generation. In particular, we develop a specialized pipeline to curate a high-quality video dataset featuring substantial facial pose diversity. Extensive experiments demonstrate that FaithfulFaces achieves state-of-the-art performance, maintaining superior identity consistency and structural clarity even as pose changes and occlusions occur.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The pose-shared identity aligner with Euler embeddings and invariance constraint is a reasonable engineering step for IPT2V, but the abstract's SOTA claims lack any visible metrics or ablations to evaluate them.

read the letter

FaithfulFaces focuses on keeping facial identity stable in text-to-video generation when poses shift or parts of the face get occluded. The core addition is a pose-shared identity aligner that builds a dictionary across views and applies a variation-identity invariance constraint, all while mapping single-view inputs to explicit global Euler angle embeddings. This gives the generator a pose-faithful prior instead of relying only on the base model to hold identity together during motion.

Referee Report

2 major / 1 minor

Summary. The paper proposes FaithfulFaces, a pose-faithful facial identity preservation framework for identity-preserving text-to-video generation (IPT2V). Its key innovation is a pose-shared identity aligner that refines and aligns facial poses across views using a pose-shared dictionary and a pose variation-identity invariance constraint. Single-view inputs are mapped to global Euler-angle pose representations to provide a pose-faithful prior for generative backbones. The authors curate a high-quality video dataset with substantial pose diversity and claim state-of-the-art results in identity consistency and structural clarity under large pose changes and occlusions.

Significance. If the experimental results and the invariance constraint hold up under scrutiny, the work could meaningfully advance IPT2V by supplying an explicit pose-faithful prior that improves robustness in dynamic scenes without sacrificing generation quality. The dataset curation effort and the explicit Euler-angle embedding are concrete strengths that could be reusable.

major comments (2)

[Abstract] Abstract: The central claim of achieving state-of-the-art performance is asserted without any reported quantitative metrics, baselines, ablation studies, or error analysis. This absence is load-bearing because the superiority in identity consistency and structural clarity under pose variation and occlusion cannot be evaluated from the provided description alone.
[Method] Method section (pose-shared identity aligner description): No derivation, proof, or formal argument is supplied showing that the pose variation-identity invariance constraint is identity-preserving rather than pose-averaging, especially under large rotations or partial occlusions. The assumption that the learned dictionary generalizes without introducing new artifacts in unseen dynamic scenes therefore remains unverified and directly affects the core technical contribution.

minor comments (1)

[Abstract and Method] The abstract and method descriptions use several invented terms (pose-shared identity aligner, pose variation-identity invariance constraint) without first providing a concise formal definition or pseudocode before their usage.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract and method. We address each major comment below and commit to revisions that strengthen the presentation of results and technical justification without altering the core claims.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim of achieving state-of-the-art performance is asserted without any reported quantitative metrics, baselines, ablation studies, or error analysis. This absence is load-bearing because the superiority in identity consistency and structural clarity under pose variation and occlusion cannot be evaluated from the provided description alone.

Authors: We agree the abstract would be stronger with explicit metrics. The full manuscript reports quantitative results in the experiments section, including identity consistency (e.g., ArcFace cosine similarity) and structural metrics (e.g., LPIPS, FID) against baselines such as IP-Adapter and ControlVideo, with ablations on the invariance constraint. We will revise the abstract to include key figures, such as 'improving identity preservation by 12% under large pose changes compared to prior IPT2V methods.' revision: yes
Referee: [Method] Method section (pose-shared identity aligner description): No derivation, proof, or formal argument is supplied showing that the pose variation-identity invariance constraint is identity-preserving rather than pose-averaging, especially under large rotations or partial occlusions. The assumption that the learned dictionary generalizes without introducing new artifacts in unseen dynamic scenes therefore remains unverified and directly affects the core technical contribution.

Authors: The constraint is formulated as an L2 loss between identity embeddings extracted from the shared dictionary across different Euler-angle poses of the same subject, explicitly encouraging invariance to pose while the dictionary separately encodes pose-specific components. This is not averaging because the dictionary is pose-shared but the invariance term only regularizes identity features. We will add a short derivation in the revised method section showing that the combined objective minimizes identity distance subject to pose reconstruction, plus additional ablation results on large rotations and occlusions. Generalization is supported by the curated dataset's pose diversity, though we acknowledge further theoretical analysis could be valuable. revision: partial

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper presents a methodological framework for identity-preserving text-to-video generation via a pose-shared identity aligner, dictionary, and invariance constraint, along with Euler-angle embeddings and a curated dataset. No equations, derivations, or first-principles results are provided in the abstract or described text that reduce by construction to fitted inputs, self-definitions, or self-citation chains. Claims rest on empirical experiments and architectural proposals rather than any load-bearing self-referential steps, making the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

Based solely on abstract; full paper would likely list more implementation details and training assumptions.

axioms (1)

domain assumption Facial pose can be compactly represented by explicit Euler angle embeddings
Invoked for mapping single-view inputs to global pose representation

invented entities (2)

pose-shared identity aligner no independent evidence
purpose: Refines and aligns facial poses across distinct views
Core new module using dictionary and invariance constraint
pose variation-identity invariance constraint no independent evidence
purpose: Enforces identity consistency despite pose changes
Loss term introduced to guide the aligner

pith-pipeline@v0.9.0 · 5507 in / 1111 out tokens · 34826 ms · 2026-05-08T17:36:20.816789+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

40 extracted references · 14 canonical work pages · 6 internal anchors

[1]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report.ar...

work page internal anchor Pith review arXiv 2025
[2]

Arcface: Additive angular margin loss for deep face recognition

Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. Arcface: Additive angular margin loss for deep face recognition. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4690–4699, 2019

2019
[3]

MAGREF: Masked guidance for any-reference video generation with subject disentanglement

Yufan Deng, Yuanyang Yin, Xun Guo, Yizhi Wang, Jacob Zhiyuan Fang, Shenghai Yuan, Yiding Yang, Angtian Wang, Bo Liu, Haibin Huang, and Chongyang Ma. MAGREF: Masked guidance for any-reference video generation with subject disentanglement. InThe Fourteenth International Conference on Learning Representations, 2026

2026
[4]

Multi-modal alignment using representation codebook

Jiali Duan, Liqun Chen, Son Tran, Jinyu Yang, Yi Xu, Belinda Zeng, and Trishul Chilimbi. Multi-modal alignment using representation codebook. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15651–15660, 2022

2022
[5]

Skyreels-a2: Compose anything in video diffusion transformers

Zhengcong Fei, Debang Li, Di Qiu, Jiahua Wang, Yikun Dou, Rui Wang, Jingtao Xu, Mingyuan Fan, Guibin Chen, Yang Li, et al. Skyreels-a2: Compose anything in video diffusion transformers.arXiv preprint arXiv:2504.02436, 2025

work page arXiv 2025
[6]

Seedance 1.0: Exploring the Boundaries of Video Generation Models

Yu Gao, Haoyuan Guo, Tuyen Hoang, Weilin Huang, Lu Jiang, Fangyuan Kong, Huixia Li, Jiashi Li, Liang Li, Xiaojie Li, et al. Seedance 1.0: Exploring the boundaries of video generation models.arXiv preprint arXiv:2506.09113, 2025

work page internal anchor Pith review arXiv 2025
[7]

Animatediff: Animate your personalized text-to-image diffusion models without specific tuning

Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. InInternational Conference on Learning Representations (ICLR), 2024

2024
[8]

Id- animator: Zero-shot identity-preserving human video generation.arXiv preprint arXiv:2404.15275,

Xuanhua He, Quande Liu, Shengju Qian, Xin Wang, Tao Hu, Ke Cao, Keyu Yan, and Jie Zhang. Id- animator: Zero-shot identity-preserving human video generation.arXiv preprint arXiv:2404.15275, 2024

work page arXiv 2024
[9]

6d rotation representation for uncon- strained head pose estimation

Thorsten Hempel, Ahmed A Abdelrahman, and Ayoub Al-Hamadi. 6d rotation representation for uncon- strained head pose estimation. In2022 IEEE International Conference on Image Processing (ICIP), pages 2496–2500. IEEE, 2022

2022
[10]

Clipscore: A reference-free evaluation metric for image captioning

Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation metric for image captioning. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7514–7528, 2021

2021
[11]

Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017

2017
[12]

Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

2020
[13]

Lora: Low-rank adaptation of large language models

Edward J Hu, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. InInternational Conference on Learning Representations, 2022

2022
[14]

Hunyuancustom: A multimodal-driven architecture for customized video generation.arXiv preprint arXiv:2505.04512, 2025

Teng Hu, Zhentao Yu, Zhengguang Zhou, Sen Liang, Yuan Zhou, Qin Lin, and Qinglin Lu. Hunyuancustom: A multimodal-driven architecture for customized video generation.arXiv preprint arXiv:2505.04512, 2025

work page arXiv 2025
[15]

Curricularface: adaptive curriculum learning loss for deep face recognition

Yuge Huang, Yuhan Wang, Ying Tai, Xiaoming Liu, Pengcheng Shen, Shaoxin Li, Jilin Li, and Feiyue Huang. Curricularface: adaptive curriculum learning loss for deep face recognition. Inproceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5901–5910, 2020

2020
[16]

Insightface.https://github.com/deepinsight/insightface, 2025

InsightFace. Insightface.https://github.com/deepinsight/insightface, 2025

2025
[17]

arXiv preprint arXiv:2503.07598 (2025) 2, 4, 8, 9, 10, 6

Zeyinzi Jiang, Zhen Han, Chaojie Mao, Jingfeng Zhang, Yulin Pan, and Yu Liu. Vace: All-in-one video creation and editing.arXiv preprint arXiv:2503.07598, 2025

work page arXiv 2025
[18]

Kling.https://klingai.com/, 2026

Kling. Kling.https://klingai.com/, 2026

2026
[19]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv 10 preprint arXiv:2412.03603, 2024

work page internal anchor Pith review arXiv 2024
[20]

Flow matching for generative modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. InThe Eleventh International Conference on Learning Representations, 2023

2023
[21]

Phantom: Subject-consistent video generation via cross-modal alignment.arXiv preprint arXiv:2502.11079, 2025

Lijie Liu, Tianxiang Ma, Bingchuan Li, Zhuowei Chen, Jiawei Liu, Gen Li, Siyu Zhou, Qian He, and Xinglong Wu. Phantom: Subject-consistent video generation via cross-modal alignment.arXiv preprint arXiv:2502.11079, 2025

work page arXiv 2025
[22]

Flow straight and fast: Learning to generate and transfer data with rectified flow

Xingchao Liu, Chengyue Gong, et al. Flow straight and fast: Learning to generate and transfer data with rectified flow. InThe Eleventh International Conference on Learning Representations, 2023

2023
[23]

Visualizing data using t-sne.Journal of machine learning research, 9(Nov):2579–2605, 2008

Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne.Journal of machine learning research, 9(Nov):2579–2605, 2008

2008
[24]

Representation Learning with Contrastive Predictive Coding

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748, 2018

work page internal anchor Pith review arXiv 2018
[25]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

2023
[26]

Pika.https://pika.art/, 2025

Pika. Pika.https://pika.art/, 2025

2025
[27]

Movie Gen: A Cast of Media Foundation Models

Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih-Yao Ma, Ching-Yao Chuang, et al. Movie gen: A cast of media foundation models. arXiv preprint arXiv:2410.13720, 2024

work page internal anchor Pith review arXiv 2024
[28]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

2021
[29]

Score-based generative modeling through stochastic differential equations

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. InInternational Conference on Learning Representations, 2021

2021
[30]

Rethinking the inception architecture for computer vision

Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2818–2826, 2016

2016
[31]

Neural discrete representation learning.Advances in neural information processing systems, 30, 2017

Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning.Advances in neural information processing systems, 30, 2017

2017
[32]

Vidu.https://www.vidu.com, 2026

Vidu. Vidu.https://www.vidu.com, 2026

2026
[33]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang, T...

work page internal anchor Pith review arXiv 2025
[34]

Stand-in: A lightweight and plug-and-play identity control for video generation

Bowen Xue, Zheng-Peng Duan, Qixin Yan, Wenjing Wang, Hao Liu, Chun-Le Guo, Chongyi Li, Chen Li, and Jing Lyu. Stand-in: A lightweight and plug-and-play identity control for video generation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2026

2026
[35]

Cogvideox: Text-to-video diffusion models with an expert transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer. InThe Thirteenth International Conference on Learning Representations, 2025

2025
[36]

Identity-preserving text-to-video generation by frequency decomposition

Shenghai Yuan, Jinfa Huang, Xianyi He, Yunyang Ge, Yujun Shi, Liuhan Chen, Jiebo Luo, and Li Yuan. Identity-preserving text-to-video generation by frequency decomposition. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 12978–12988, 2025

2025
[37]

Magic mirror: Id-preserved video generation in video diffusion transformers.arXiv preprint arXiv:2501.03931, 2025

Yuechen Zhang, Yaoyang Liu, Bin Xia, Bohao Peng, Zexin Yan, Eric Lo, and Jiaya Jia. Magic mirror: Id-preserved video generation in video diffusion transformers.arXiv preprint arXiv:2501.03931, 2025

work page arXiv 2025
[38]

Fantasyid: Face knowledge enhanced id-preserving video generation.arXiv preprint arXiv:2502.13995, 2025

Yunpeng Zhang, Qiang Wang, Fan Jiang, Yaqi Fan, Mu Xu, and Yonggang Qi. Fantasyid: Face knowledge enhanced id-preserving video generation.arXiv preprint arXiv:2502.13995, 2025

work page arXiv 2025
[39]

Concat-id: Towards universal identity-preserving video synthesis.arXiv preprint arXiv:2503.14151, 2025

Yong Zhong, Zhuoyi Yang, Jiayan Teng, Xiaotao Gu, and Chongxuan Li. Concat-id: Towards universal identity-preserving video synthesis.arXiv preprint arXiv:2503.14151, 2025. A Appendix A.1 Observations in Learned Dictionary To explicitly observe the learned pose-shared dictionary, we visualize the activations of the dictionary for five representative facial...

work page arXiv 2025
[40]

open grassy field

4100. 4300. 4500. 4700. 4900. 5100. 5300. 5500. 5700. 590FaceSim Number of dictionary elements FaceSim-CurFaceSim-Arc 409620488192 16384 327681024 Figure 8: Ablation study on different dictionary elements. 12 Table 3: Quantitative results of different dictionary elements. Dictionary Elements FaceSim-Cur↑ FaceSim-Arc↑ FID↓ CLIPScore↑ 1024 0.448 0.425 183.3...

2048