FlashLips: 100-FPS Mask-Free Latent Lip-Sync using Reconstruction Instead of Diffusion or GANs
Pith reviewed 2026-05-16 20:26 UTC · model grok-4.3
The pith
A compact latent U-Net edits lips via reconstruction at over 100 FPS without masks, GANs or diffusion.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
FlashLips performs mask-free lip synchronization by training a one-step latent-space U-Net editor with pure reconstruction losses on self-supervised mouth-altered targets, paired with an audio-to-pose transformer trained via flow-matching, to deliver over 100 FPS on a single GPU while preserving identity and background at quality levels comparable to larger state-of-the-art models.
What carries the argument
The one-step latent-space U-Net editor that reconstructs an image from reference identity, masked target frame, and lips-pose vector, guided by self-supervision to localize edits without explicit masks at inference.
If this is right
- Lip-sync pipelines can run in real time on consumer GPUs without adversarial training.
- Deterministic reconstruction replaces generative sampling while retaining visual fidelity.
- Audio-driven pose control decouples cleanly from rendering, simplifying deployment.
- No mask input is required at inference once self-supervision has been applied during training.
Where Pith is reading between the lines
- The same reconstruction-plus-self-supervision pattern could extend to other localized facial edits such as expression transfer.
- Removing diffusion and GAN components may lower energy use for batch video processing tasks.
- Flow-matching for pose prediction might be swapped with other regression objectives if the low-dimensional vector remains the interface.
Load-bearing premise
Training on mouth-altered target variants as pseudo ground truth is enough for the network to learn where to apply lip changes while leaving identity and background untouched.
What would settle it
Side-by-side video comparisons where the self-supervised training is ablated and visible leakage of edits into non-lip regions or identity shifts appears.
Figures
read the original abstract
We present FlashLips, a two-stage, mask-free lip-sync system that decouples lips control from rendering and achieves real-time performance, with our U-Net variant running at over 100 FPS on a single GPU, while matching the visual quality of larger state-of-the-art models. Stage 1 is a compact, one-step latent-space editor that reconstructs an image using a reference identity, a masked target frame, and a low-dimensional lips-pose vector, trained purely with reconstruction losses - no GANs or diffusion. To remove explicit masks at inference, we use self-supervision via mouth-altered target variants as pseudo ground truth, teaching the network to localize lip edits while preserving the rest. Stage 2 is an audio-to-pose transformer trained with a flow-matching objective to predict lips-pose vectors from speech. Together, these stages form a simple and stable pipeline that combines deterministic reconstruction with robust audio control, delivering high perceptual quality and faster-than-real-time speed.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents FlashLips, a two-stage mask-free lip-sync pipeline. Stage 1 is a compact latent U-Net editor that takes a reference identity, a masked target frame, and a low-dimensional lips-pose vector, trained end-to-end with reconstruction losses; self-supervision on mouth-altered target variants is used to eliminate explicit masks at inference. Stage 2 is an audio-to-pose transformer trained with a flow-matching objective. The central claim is that the resulting U-Net variant runs at >100 FPS on a single GPU while matching the perceptual quality of larger GAN- and diffusion-based SOTA models.
Significance. If the empirical performance claims hold, the work would be significant for real-time video applications: it replaces GAN/diffusion training with deterministic reconstruction losses, removes mask computation at inference, and delivers faster-than-real-time speed on modest hardware. The combination of self-supervised mask-free editing and flow-matching audio control could simplify deployment in live dubbing and avatar systems.
major comments (2)
- [Abstract and §4] Abstract and §4 (Experiments): the manuscript states that the U-Net matches SOTA visual quality at >100 FPS but contains no quantitative tables, ablation studies, user studies, or error analysis; without these data the central claim cannot be evaluated.
- [§3.1] §3.1 (Self-supervised editor): the description of mouth-altered target variants as pseudo ground truth does not specify the exact alteration procedure or provide controls showing that non-lip regions remain unchanged; if alterations introduce correlated lighting or texture shifts, the network may learn to propagate edits rather than localize lips, undermining both the mask-free claim and the quality comparison.
minor comments (1)
- [§3.1] Notation for the low-dimensional lips-pose vector is introduced without an explicit dimensionality or normalization scheme; adding a short definition would improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and have revised the manuscript to strengthen the experimental validation and methodological clarity.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4 (Experiments): the manuscript states that the U-Net matches SOTA visual quality at >100 FPS but contains no quantitative tables, ablation studies, user studies, or error analysis; without these data the central claim cannot be evaluated.
Authors: We agree that the current version of §4 lacks the quantitative support needed to fully substantiate the central claims. In the revised manuscript we will add comprehensive tables reporting PSNR, SSIM, LPIPS, and FID scores against recent GAN- and diffusion-based lip-sync baselines on standard benchmarks. We will also include ablation studies isolating the contribution of the self-supervised mask removal and the lips-pose vector, results from a small-scale perceptual user study, and a dedicated error-analysis subsection that examines failure cases and speed-quality trade-offs. revision: yes
-
Referee: [§3.1] §3.1 (Self-supervised editor): the description of mouth-altered target variants as pseudo ground truth does not specify the exact alteration procedure or provide controls showing that non-lip regions remain unchanged; if alterations introduce correlated lighting or texture shifts, the network may learn to propagate edits rather than localize lips, undermining both the mask-free claim and the quality comparison.
Authors: We acknowledge that the description in §3.1 is insufficiently precise. In the revision we will explicitly detail the mouth-alteration procedure (landmark-driven affine warping of the mouth region followed by Poisson blending to preserve local lighting and texture statistics) and will add both qualitative visualizations and quantitative controls (e.g., pixel-wise difference maps restricted to non-mouth areas) demonstrating that edits remain localized. These additions will directly address concerns about unintended propagation of changes. revision: yes
Circularity Check
No significant circularity; claims rest on empirical training outcomes
full rationale
The paper describes a two-stage pipeline whose mask-free inference and 100-FPS performance are presented as measured results of training a latent U-Net with reconstruction losses on mouth-altered pseudo-ground-truth frames plus a flow-matching audio-to-pose transformer. No equations, fitted parameters renamed as predictions, or self-citation chains are exhibited that would make the reported speed or quality equivalent to the inputs by construction. The self-supervision step is a training procedure whose success is claimed to be verified empirically rather than guaranteed by definition.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption A compact U-Net can learn to perform localized lip edits in latent space using only reconstruction losses when provided a low-dimensional pose vector.
- domain assumption Self-supervision with mouth-altered target variants teaches the network to localize edits without explicit masks at inference.
Reference graph
Works this paper leans on
-
[1]
Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A framework for self-supervised learning of speech representations.Advances in neural infor- mation processing systems, 33:12449–12460, 2020. 2, 5
work page 2020
-
[2]
Keysync: A robust approach for leakage- free lip synchronization in high resolution, 2025
Antoni Bigata, Rodrigo Mira, Stella Bounareli, Michał Stypułkowski, Konstantinos V ougioukas, Stavros Petridis, and Maja Pantic. Keysync: A robust approach for leakage- free lip synchronization in high resolution, 2025. 2, 3, 6, 7
work page 2025
-
[3]
Keyface: Expressive audio-driven facial animation for long sequences via keyframe interpolation
Antoni Bigata, Michał Stypułkowski, Rodrigo Mira, Stella Bounareli, Konstantinos V ougioukas, Zoe Landgraf, Nikita Drobyshev, Maciej Zieba, Stavros Petridis, and Maja Pan- tic. Keyface: Expressive audio-driven facial animation for long sequences via keyframe interpolation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recogniti...
work page 2025
-
[4]
Dan Bigioi, Shubhajit Basak, Michał Stypułkowski, Maciej Zieba, Hugh Jordan, Rachel McDonnell, and Peter Corco- ran. Speech driven video editing via an audio-conditioned diffusion model.Image and Vision Computing, 142:104911,
-
[5]
Parkhi, and An- drew Zisserman
Qiong Cao, Li Shen, Weidi Xie, Omkar M. Parkhi, and An- drew Zisserman. Vggface2: A dataset for recognising faces across pose and age, 2018. 4
work page 2018
-
[6]
Zhiyuan Chen, Jiajiong Cao, Zhiquan Chen, Yuming Li, and Chenguang Ma. Echomimic: Lifelike audio-driven portrait animations through editable landmark conditions.Proceed- ings of the AAAI Conference on Artificial Intelligence, 39: 2403–2410, 2025. 2
work page 2025
-
[7]
Out of time: Au- tomated lip sync in the wild
Joon Son Chung and Andrew Zisserman. Out of time: Au- tomated lip sync in the wild. InComputer Vision – ACCV 2016 Workshops, pages 251–263, Cham, 2017. Springer In- ternational Publishing. 3
work page 2016
-
[8]
Hallo3: Highly dynamic and realistic portrait image animation with video diffusion transformer, 2024
Jiahao Cui, Hui Li, Yun Zhan, Hanlin Shang, Kaihui Cheng, Yuqi Ma, Shan Mu, Hang Zhou, Jingdong Wang, and Siyu Zhu. Hallo3: Highly dynamic and realistic portrait image animation with video diffusion transformer, 2024. 5
work page 2024
-
[9]
Emoportraits: Emotion-enhanced multimodal one-shot head avatars, 2024
Nikita Drobyshev, Antoni Bigata Casademunt, Konstantinos V ougioukas, Zoe Landgraf, Stavros Petridis, and Maja Pan- tic. Emoportraits: Emotion-enhanced multimodal one-shot head avatars, 2024. 4, 5, 7
work page 2024
-
[10]
Rap: Real-time audio-driven portrait animation with video diffusion transformer, 2025
Fangyu Du, Taiqing Li, Ziwei Zhang, Qian Qiao, Tan Yu, Dingcheng Zhen, Xu Jia, Yang Yang, Shunshun Yin, and Siyuan Liu. Rap: Real-time audio-driven portrait animation with video diffusion transformer, 2025. 2
work page 2025
-
[11]
Ian Goodfellow, Yoshua Bengio, Aaron Courville, and Yoshua Bengio.Deep learning. MIT Press, 2016. 2
work page 2016
-
[12]
Generative adversarial nets.Advances in neural information processing systems, 27, 2014
Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets.Advances in neural information processing systems, 27, 2014. 2
work page 2014
-
[13]
Stylesync: High-fidelity generalized and personalized lip sync in style- based generator
Jiazhi Guan, Zhanwang Zhang, Hang Zhou, Tianshu Hu, Kaisiyuan Wang, Dongliang He, Haocheng Feng, Jingtuo Liu, Errui Ding, Ziwei Liu, and Jingdong Wang. Stylesync: High-fidelity generalized and personalized lip sync in style- based generator. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), pages 1505–1515, 2023. 2, 3
work page 2023
-
[14]
Resyncer: Rewiring style-based gen- erator for unified audio-visually synced facial performer
Jiazhi Guan, Zhiliang Xu, Hang Zhou, Kaisiyuan Wang, Shengyi He, Zhanwang Zhang, Borong Liang, Haocheng Feng, Errui Ding, Jingtuo Liu, Jingdong Wang, Youjian Zhao, and Ziwei Liu. Resyncer: Rewiring style-based gen- erator for unified audio-visually synced facial performer. InComputer Vision – ECCV 2024, pages 348–367, Cham,
work page 2024
-
[16]
Deep residual learning for image recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. 4
work page 2016
-
[17]
Gans trained by a two time-scale update rule converge to a local nash equilib- rium
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilib- rium. InProceedings of the 31st International Conference on Neural Information Processing Systems, page 6629–6640, Red Hook, NY , USA, 2017. Curran Associates Inc. 6
work page 2017
-
[18]
Vbench: Com- prehensive benchmark suite for video generative models,
Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. Vbench: Com- prehensive benchmark suite for video generative models,
-
[19]
Sonic: Shifting focus to global audio perception in portrait anima- tion
Xiaozhong Ji, Xiaobin Hu, Zhihong Xu, Junwei Zhu, Chum- ing Lin, Qingdong He, Jiangning Zhang, Donghao Luo, Yi Chen, Qin Lin, Qinglin Lu, and Chengjie Wang. Sonic: Shifting focus to global audio perception in portrait anima- tion. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 193– 203, 2025. 3
work page 2025
-
[20]
Loopy: Taming audio-driven por- trait avatar with long-term motion dependency, 2025
Jianwen Jiang, Chao Liang, Jiaqi Yang, Gaojie Lin, Tianyun Zhong, and Yanbo Zheng. Loopy: Taming audio-driven por- trait avatar with long-term motion dependency, 2025. 3
work page 2025
-
[21]
Percep- tual losses for real-time style transfer and super-resolution
Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Percep- tual losses for real-time style transfer and super-resolution. InComputer Vision – ECCV 2016, pages 694–711, Cham,
work page 2016
-
[22]
Springer International Publishing. 4
-
[23]
Analyzing and improv- ing the image quality of stylegan
Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and improv- ing the image quality of stylegan. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020. 3
work page 2020
-
[24]
Stylelipsync: Style-based personalized lip-sync video generation
Taekyung Ki and Dongchan Min. Stylelipsync: Style-based personalized lip-sync video generation. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 22841–22850, 2023. 3
work page 2023
-
[25]
Float: Generative motion latent flow matching for audio-driven talking portrait
Taekyung Ki, Dongchan Min, and Gyeongsu Chae. Float: Generative motion latent flow matching for audio-driven talking portrait. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, pages 14699–14710,
-
[26]
Diederik P. Kingma and Max Welling. Auto-Encoding Vari- ational Bayes. In2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14- 16, 2014, Conference Track Proceedings, 2014. 3
work page 2014
-
[27]
Tobias Kirschstein, Shenhan Qian, Simon Giebenhain, Tim Walter, and Matthias Nießner. Nersemble: Multi-view radi- ance field reconstruction of human heads.ACM Transactions on Graphics (TOG), 42(4):1–14, 2023. 5
work page 2023
-
[28]
Chunyu Li, Chao Zhang, Weikai Xu, Jingyu Lin, Jinghui Xie, Weiguo Feng, Bingyue Peng, Cunjian Chen, and Wei- wei Xing. Latentsync: Taming audio-conditioned latent dif- fusion models for lip sync with syncnet supervision, 2025. 2, 3, 6, 7
work page 2025
-
[29]
Flow Matching for Generative Modeling
Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximil- ian Nickel, and Matt Le. Flow matching for generative mod- eling.arXiv preprint arXiv:2210.02747, 2022. 2, 5
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[30]
Diffdub: Person-generic visual dubbing using inpaint- ing renderer with diffusion auto-encoder
Tao Liu, Chenpeng Du, Shuai Fan, Feilong Chen, and Kai Yu. Diffdub: Person-generic visual dubbing using inpaint- ing renderer with diffusion auto-encoder. InICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3630–3634, 2024. 2, 3, 6, 7
work page 2024
-
[31]
Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow
Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022. 2, 5
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[32]
Steven R Livingstone and Frank A Russo. The ryer- son audio-visual database of emotional speech and song (ravdess): A dynamic, multimodal set of facial and vocal expressions in north american english.PloS one, 13(5): e0196391, 2018. 5
work page 2018
-
[33]
Sayany- thing: Audio-driven lip synchronization with conditional video diffusion, 2025
Junxian Ma, Shiwen Wang, Jian Yang, Junyi Hu, Jian Liang, Guosheng Lin, Jingbo chen, Kai Li, and Yu Meng. Sayany- thing: Audio-driven lip synchronization with conditional video diffusion, 2025. 3
work page 2025
-
[34]
Diff2lip: Audio conditioned dif- fusion models for lip-synchronization
Soumik Mukhopadhyay, Saksham Suri, Ravi Teja Gadde, and Abhinav Shrivastava. Diff2lip: Audio conditioned dif- fusion models for lip-synchronization. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 5292–5302, 2024. 2, 3, 6, 7
work page 2024
-
[35]
Omnisync: Towards universal lip synchronization via diffusion transformers, 2025
Ziqiao Peng, Jiwen Liu, Haoxian Zhang, Xiaoqiang Liu, Songlin Tang, Pengfei Wan, Di Zhang, Hongyan Liu, and Jun He. Omnisync: Towards universal lip synchronization via diffusion transformers, 2025. 2, 3
work page 2025
-
[36]
Sdxl: Improving latent diffusion models for high-resolution image synthesis, 2023
Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis, 2023. 2, 3
work page 2023
-
[37]
A lip-sync expert is all you need for speech to lip generation in the wild
KR Prajwal, Vinay P Namboodiri, C Aguerrebere, C Theobalt, L Jeni, and Rudrabha T G. A lip-sync expert is all you need for speech to lip generation in the wild. In Proceedings of the 28th ACM International Conference on Multimedia, pages 484–492, 2020. 2, 3
work page 2020
-
[38]
High-resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bjorn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10674–10685, 2022. 3
work page 2022
-
[39]
Improved techniques for training gans
Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, Xi Chen, and Xi Chen. Improved techniques for training gans. InAdvances in Neural Infor- mation Processing Systems. Curran Associates, Inc., 2016. 2
work page 2016
-
[40]
Facenet: A unified embedding for face recognition and clus- tering
Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A unified embedding for face recognition and clus- tering. In2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), page 815–823. IEEE, 2015. 6
work page 2015
-
[41]
Sefik Serengil and Alper Ozpinar. A benchmark of facial recognition pipelines and co-usability performances of mod- ules.Journal of Information Technologies, 17(2):95–107,
-
[42]
Difftalk: A diffusion model for realistic talking head generation
Jiam ˜ao Shen, Yidi Zhou, Zhiyao Liu, Jing Wang, and Jian Wang. Difftalk: A diffusion model for realistic talking head generation. InThirty-seventh Conference on Neural Infor- mation Processing Systems, 2023. 2
work page 2023
-
[43]
Akash Srivastava, Lazar Valkov, Chris Russell, Michael U. Gutmann, and Charles Sutton. Veegan: Reducing mode collapse in gans using implicit variational learning. InAd- vances in Neural Information Processing Systems. Curran Associates, Inc., 2017. 2
work page 2017
-
[44]
Diffused heads: Diffusion models beat gans on talking-face genera- tion
Michał Stypułkowski, Konstantinos V ougioukas, Sen He, Maciej Zieba, Stavros Petridis, and Maja Pantic. Diffused heads: Diffusion models beat gans on talking-face genera- tion. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 5091–5100,
-
[45]
Blindly assess image qual- ity in the wild guided by a self-adaptive hyper network
Shaolin Su, Qingsen Yan, Yu Zhu, Cheng Zhang, Xin Ge, Jinqiu Sun, and Yanning Zhang. Blindly assess image qual- ity in the wild guided by a self-adaptive hyper network. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR), 2020. 6
work page 2020
-
[46]
Linrui Tian, Qi Wang, Bang Zhang, and Liefeng Bo. Emo: Emote portrait alive generating expressive portrait videos with audio2video diffusion model under weak conditions. InComputer Vision – ECCV 2024, pages 244–260, Cham,
work page 2024
-
[47]
Springer Nature Switzerland. 3
-
[48]
To- wards accurate generative models of video: A new metric & challenges, 2019
Thomas Unterthiner, Sjoerd van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. To- wards accurate generative models of video: A new metric & challenges, 2019. 6
work page 2019
-
[49]
End-to-end speech-driven facial animation with temporal gans
Konstantinos V ougioukas, Stavros Petridis, and Maja Pan- tic. End-to-end speech-driven facial animation with temporal gans. InBritish Machine Vision Conference, 2018. 3
work page 2018
-
[50]
Konstantinos V ougioukas, Stavros Petridis, and Maja Pantic. Realistic speech-driven facial animation with gans.Interna- tional Journal of Computer Vision, 128, 2020. 2, 3
work page 2020
-
[51]
V-express: Conditional dropout for progressive train- ing of portrait video generation, 2024
Cong Wang, Kuan Tian, Jun Zhang, Yonghang Guan, Feng Luo, Fei Shen, Zhiwei Jiang, Qing Gu, Xiao Han, and Wei Yang. V-express: Conditional dropout for progressive train- ing of portrait video generation, 2024. 2
work page 2024
-
[52]
Seeing what you said: Talking face gen- eration guided by a lip reading expert
Jiadong Wang, Xinyuan Qian, Malu Zhang, Robby T Tan, and Haizhou Li. Seeing what you said: Talking face gen- eration guided by a lip reading expert. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14653–14662, 2023. 3, 6, 7
work page 2023
-
[53]
Mead: A large-scale audio-visual dataset for emotional talking-face generation
Kaisiyuan Wang, Qianyi Wu, Linsen Song, Zhuoqian Yang, Wayne Wu, Chen Qian, Ran He, Yu Qiao, and Chen Change Loy. Mead: A large-scale audio-visual dataset for emotional talking-face generation. InECCV, 2020. 5
work page 2020
-
[54]
Zhou Wang, A.C. Bovik, H.R. Sheikh, and E.P. Simoncelli. Image quality assessment: from error visibility to structural similarity.IEEE Transactions on Image Processing, 13(4): 600–612, 2004. 6
work page 2004
-
[55]
Aniportrait: Audio-driven synthesis of photorealistic portrait animation,
Huawei Wei, Zejun Yang, and Zhisheng Wang. Aniportrait: Audio-driven synthesis of photorealistic portrait animation,
-
[56]
Hallo: Hierarchical audio-driven visual synthesis for portrait image animation, 2024
Mingwang Xu, Hui Li, Qingkun Su, Hanlin Shang, Liwei Zhang, Ce Liu, Jingdong Wang, Yao Yao, and Siyu Zhu. Hallo: Hierarchical audio-driven visual synthesis for portrait image animation, 2024. 2, 3
work page 2024
-
[57]
Sicheng Xu, Guojun Chen, Yu-Xiao Guo, Jiaolong Yang, Chong Li, Zhenyu Zang, Yizhong Zhang, Xin Tong, and Baining Guo. Vasa-1: Lifelike audio-driven talking faces generated in real time.Advances in Neural Information Pro- cessing Systems, 37:660–684, 2024. 3
work page 2024
-
[58]
CelebV-Text: A large-scale facial text-video dataset
Jianhui Yu, Hao Zhu, Liming Jiang, Chen Change Loy, Wei- dong Cai, and Wayne Wu. CelebV-Text: A large-scale facial text-video dataset. InCVPR, 2023. 5
work page 2023
-
[59]
Chenxu Zhang, Chao Wang, Jianfeng Zhang, Hongyi Xu, Guoxian Song, You Xie, Linjie Luo, Yapeng Tian, Xiaohu Guo, and Jiashi Feng. Dream-talk: Diffusion-based realistic emotional audio-driven method for single image talking face generation, 2023. 3
work page 2023
-
[60]
The unreasonable effectiveness of deep features as a perceptual metric
Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InCVPR, 2018. 6
work page 2018
-
[61]
Wenxuan Zhang, Xiaodong Cun, Xuan Wang, Yong Zhang, Xi Shen, Yu Guo, Ying Shan, and Fei Wang. Sadtalker: Learning realistic 3d motion coefficients for stylized audio- driven single image talking face animation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 8652–8661, 2023. 3
work page 2023
-
[62]
Dreamtalk: When expressive 3d talking head generation meets diffusion probabilistic models
Yifeng Zhang, Zhipeng Liu, Jin Yan, and Chun Li. Dreamtalk: When expressive 3d talking head generation meets diffusion probabilistic models. InProceedings of the AAAI Conference on Artificial Intelligence, pages 5555– 5563, 2024. 2
work page 2024
-
[63]
Musetalk: Real-time high-fidelity video dubbing via spatio-temporal sampling, 2025
Yue Zhang, Zhizhou Zhong, Minhao Liu, Zhaokang Chen, Bin Wu, Yubin Zeng, Chao Zhan, Yingjie He, Junxin Huang, and Wenjiang Zhou. Musetalk: Real-time high-fidelity video dubbing via spatio-temporal sampling, 2025. 3
work page 2025
-
[64]
Flow-guided one-shot talking face generation with a high-resolution audio-visual dataset
Zhimeng Zhang, Lincheng Li, Yu Ding, and Changjie Fan. Flow-guided one-shot talking face generation with a high-resolution audio-visual dataset. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3661–3670, 2021. 5
work page 2021
-
[65]
Dinet: Deformation inpainting network for realistic face visually dubbing on high resolution video
Zhimeng Zhang, Zhipeng Hu, Wenjin Deng, Changjie Fan, Tangjie Lv, and Yu Ding. Dinet: Deformation inpainting network for realistic face visually dubbing on high resolution video. InAAAI Conference on Artificial Intelligence, 2023. 2, 3
work page 2023
-
[66]
Human-computer interaction system: A survey of talking-head generation.Electronics, 12(1), 2023
Rui Zhen, Wenchao Song, Qiang He, Juan Cao, Lei Shi, and Jia Luo. Human-computer interaction system: A survey of talking-head generation.Electronics, 12(1), 2023. 2
work page 2023
-
[67]
Identity- preserving talking face generation with landmark and ap- pearance priors
Weizhi Zhong, Chaowei Fang, Yinqi Cai, Pengxu Wei, Gangming Zhao, Liang Lin, and Guanbin Li. Identity- preserving talking face generation with landmark and ap- pearance priors. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), pages 9729–9738, 2023. 3, 6, 7
work page 2023
-
[68]
Talking face generation by adversarially disentangled audio-visual representation
Hang Zhou, Yu Liu, Ziwei Liu, Ping Luo, and Xiaogang Wang. Talking face generation by adversarially disentangled audio-visual representation. InProceedings of the AAAI con- ference on artificial intelligence, pages 9299–9306, 2019. 2, 3
work page 2019
-
[69]
Pose-controllable talking face generation by implicitly modularized audio-visual rep- resentation
Hang Zhou, Yasheng Sun, Wayne Wu, Chen Change Loy, Xiaogang Wang, and Ziwei Liu. Pose-controllable talking face generation by implicitly modularized audio-visual rep- resentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4176–4186, 2021. 3
work page 2021
-
[70]
CelebV- HQ: A large-scale video facial attributes dataset
Hao Zhu, Wayne Wu, Wentao Zhu, Liming Jiang, Siwei Tang, Li Zhang, Ziwei Liu, and Chen Change Loy. CelebV- HQ: A large-scale video facial attributes dataset. InECCV,
-
[71]
5 FlashLips: 100-FPS Mask-Free Latent Lip-Sync using Reconstruction Instead of Diffusion or GANs Supplementary Material A. Training Details A.1. Data Augmentation During training for both stages, we apply the following aug- mentations. All images are normalized by dividing by 255 to map pixel values into the range[0,1]. For Stage 1, we ad- ditionally appl...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.