GaussianEmoTalker: Real-Time Emotional Talking Head Synthesis with Audio-Driven and Blendshape-Based 3D Gaussian Splatting
Pith reviewed 2026-07-02 13:44 UTC · model grok-4.3
The pith
GaussianEmoTalker generates emotional talking heads in real time by deforming neutral 3D Gaussian blendshapes with audio and emotion signals.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Emotional animation reduces to a neutral-to-emotional residual deformation problem inside 3D Gaussian Splatting, where GaussianBlendshapes supply the neutral base and a spatial-audio-emotion attention module produces the attribute offsets needed for expressive, intensity-controllable, temporally stable output.
What carries the argument
The spatial-audio-emotion attention module, which combines mesh displacement cues, audio features, emotion categories, and intensity encodings to compute offsets on Gaussian attributes.
If this is right
- Real-time rendering becomes feasible for emotional avatars without sacrificing lip accuracy.
- Emotion intensity can be controlled independently while preserving identity-specific neutral motion.
- Heterogeneous signals from audio, mesh, and emotion labels can be fused into a single set of Gaussian offsets.
- Competitive visual quality is maintained relative to prior emotional talking-head systems.
Where Pith is reading between the lines
- The residual-deformation pattern could be tested on non-talking facial animations such as expressions in virtual reality.
- Replacing the current audio encoder with a different speech model might reveal whether the attention module remains stable across input types.
- The GaussianBlendshapes base could support identity swapping by exchanging only the neutral component.
Load-bearing premise
An emotion-conditioned residual deformation predicted from mesh displacement cues, audio features, emotion categories, and intensity encodings will produce expressive and temporally stable Gaussian attribute offsets.
What would settle it
Videos generated under rapidly changing emotion intensities that exhibit unstable expressions or loss of lip synchronization.
Figures
read the original abstract
Audio-driven talking head synthesis has achieved impressive progress in lip synchronization and visual quality, yet generating expressive emotional avatars with controllable intensity remains challenging, especially under real-time constraints. In this paper, we present GaussianEmoTalker, an audio-driven framework for real-time emotional talking head synthesis based on 3D Gaussian Splatting. Instead of directly predicting the final emotional avatar from speech, we formulate emotional animation as a neutral-to-emotional residual deformation problem. GaussianEmoTalker first constructs an identity-specific neutral talking space with GaussianBlendshapes, which provides high-fidelity Gaussian attributes and phoneme-synchronized neutral motion. It then predicts an emotion-conditioned residual deformation by combining mesh displacement cues, audio features, emotion categories, and intensity encodings. To fuse these heterogeneous signals, we introduce a spatial-audio-emotion attention module that estimates the offsets of Gaussian attributes for expressive and temporally stable rendering. Extensive experiments demonstrate that GaussianEmoTalker achieves competitive video quality, accurate lip synchronization, controllable emotional expression, and real-time rendering compared with recent emotional talking head methods. Our project page is available at https://njust-yang.github.io/GaussianEmoTalker.github.io/
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents GaussianEmoTalker, an audio-driven framework for real-time emotional talking head synthesis based on 3D Gaussian Splatting. It constructs an identity-specific neutral talking space with GaussianBlendshapes for high-fidelity attributes and phoneme-synchronized motion, then predicts emotion-conditioned residual deformations by fusing mesh displacement cues, audio features, emotion categories, and intensity encodings via a spatial-audio-emotion attention module that estimates Gaussian attribute offsets. The manuscript claims competitive video quality, accurate lip synchronization, controllable emotional expression, and real-time rendering relative to recent emotional talking head methods.
Significance. If the experimental claims hold, the work could advance real-time controllable emotional avatar synthesis by combining blendshape-based neutral animation with efficient residual deformation in the Gaussian Splatting domain, offering potential advantages in speed and expressiveness over prior approaches.
major comments (2)
- Abstract: the central claim of competitive performance on quality, lip sync, emotion control, and speed is stated without any quantitative results, baselines, or error analysis, which is load-bearing for assessing whether the residual deformation approach delivers the claimed gains.
- Abstract (spatial-audio-emotion attention module description): the assumption that fusing heterogeneous signals (mesh displacement, audio, emotion categories, intensity) via attention will yield expressive and temporally stable Gaussian attribute offsets is presented without the module's formulation, loss terms, or ablation evidence, making it difficult to evaluate robustness of the residual deformation construction.
minor comments (1)
- The project page URL is referenced but its content (videos, code) is not described in the manuscript, which would aid reproducibility assessment.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address the two major comments point by point below.
read point-by-point responses
-
Referee: Abstract: the central claim of competitive performance on quality, lip sync, emotion control, and speed is stated without any quantitative results, baselines, or error analysis, which is load-bearing for assessing whether the residual deformation approach delivers the claimed gains.
Authors: The abstract is intended as a concise summary of the method and high-level claims. The manuscript provides the requested quantitative support in Section 4, including direct comparisons against recent baselines on standard metrics (PSNR, SSIM, LPIPS for visual quality; LSE for lip synchronization; user-study scores for emotion controllability and intensity) together with runtime measurements confirming real-time performance. These results substantiate the gains from the neutral-to-emotional residual deformation. We are willing to insert one or two key numerical highlights into the abstract if the editor considers it beneficial. revision: partial
-
Referee: Abstract (spatial-audio-emotion attention module description): the assumption that fusing heterogeneous signals (mesh displacement, audio, emotion categories, intensity) via attention will yield expressive and temporally stable Gaussian attribute offsets is presented without the module's formulation, loss terms, or ablation evidence, making it difficult to evaluate robustness of the residual deformation construction.
Authors: The abstract supplies only a high-level description of the module. Its complete formulation (multi-head attention over the four heterogeneous feature streams), the loss terms (reconstruction, emotion consistency, and temporal smoothness), and the ablation studies that quantify the contribution to expressiveness and stability appear in Sections 3.3 and 4.4. Readers can therefore assess the robustness of the residual deformation construction from the body of the paper. revision: no
Circularity Check
No significant circularity detected
full rationale
The provided abstract and description outline a residual deformation approach and spatial-audio-emotion attention module without any equations, fitted parameters presented as predictions, or self-citations that reduce the claimed results to inputs by construction. The neutral-to-emotional formulation is introduced as a modeling choice rather than derived from prior self-referential results. No load-bearing steps match the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math Standard assumptions of 3D Gaussian Splatting and attention-based fusion hold for animation tasks.
invented entities (2)
-
GaussianBlendshapes
no independent evidence
-
spatial-audio-emotion attention module
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Speech-driven expressive talking lips with conditional sequential generative adversarial networks,
N. Sadoughi and C. Busso, “Speech-driven expressive talking lips with conditional sequential generative adversarial networks,”IEEE Transac- tions on Affective Computing, vol. 12, no. 4, p. 1031–1044, Oct. 2021
2021
-
[2]
Arbitrary talking face generation via attentional audio-visual coherence learning,
H. Zhu, H. Huang, Y . Li, A. Zheng, and R. He, “Arbitrary talking face generation via attentional audio-visual coherence learning,” in Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, 2020, pp. 2362–2368
2020
-
[3]
Audio-driven facial animation by joint end-to-end learning of pose and tion,
T. Karras, T. Aila, S. Laine, A. Herva, and J. Lehtinen, “Audio-driven facial animation by joint end-to-end learning of pose and tion,”ACM Trans. Graph., vol. 36, no. 4, Jul. 2017
2017
-
[4]
Audio representation learning by distilling video as privileged information,
A. Hajavi and A. Etemad, “Audio representation learning by distilling video as privileged information,”IEEE Transactions on Artificial Intel- ligence, vol. 5, no. 1, pp. 446–456, 2023
2023
-
[5]
Emotion flip reasoning in multiparty conversations,
S. Kumar, S. Dudeja, M. S. Akhtar, and T. Chakraborty, “Emotion flip reasoning in multiparty conversations,”IEEE Transactions on Artificial Intelligence, vol. 5, no. 3, pp. 1339–1348, 2023
2023
-
[6]
Media2face: Co-speech facial animation generation with multi-modality guidance,
Q. Zhao, P. Long, Q. Zhang, D. Qin, H. Liang, L. Zhang, Y . Zhang, J. Yu, and L. Xu, “Media2face: Co-speech facial animation generation with multi-modality guidance,” inACM SIGGRAPH 2024 conference papers, 2024, pp. 1–13
2024
-
[7]
Speech driven talking face generation from a single image and an tion condition,
S. E. Eskimez, Y . Zhang, and Z. Duan, “Speech driven talking face generation from a single image and an tion condition,”ACM Trans. Multim., vol. 24, pp. 3480–3490, 2022
2022
-
[8]
End-to-end label uncertainty modeling in speech tion recognition using bayesian neural networks and label distribution learning,
N. R. Prabhu, N. Lehmann-Willenbrock, and T. Gerkmann, “End-to-end label uncertainty modeling in speech tion recognition using bayesian neural networks and label distribution learning,”IEEE Transactions on Affective Computing, vol. 15, no. 2, p. 579–592, Apr. 2024
2024
-
[9]
tion-controllable generalized talking face generation,
S. Sinha, S. Biswas, R. Yadav, and B. Bhowmick, “tion-controllable generalized talking face generation,” inProceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI, 2022, pp. 1320–1327
2022
-
[10]
Edtalk: Efficient disentanglement for tional talking head synthesis,
S. Tan, B. Ji, M. Bi, and Y . Pan, “Edtalk: Efficient disentanglement for tional talking head synthesis,” inComputer Vision - ECCV 2024 - 18th European Conference, ser. Lecture Notes in Computer Science, vol. 15064, 2024, pp. 398–416
2024
-
[11]
MEAD: A large-scale audio-visual dataset for tional talking- face generation,
K. Wang, Q. Wu, L. Song, Z. Yang, W. Wu, C. Qian, R. He, Y . Qiao, and C. C. Loy, “MEAD: A large-scale audio-visual dataset for tional talking- face generation,” inComputer Vision - ECCV 2020 - 16th European Conference, ser. Lecture Notes in Computer Science, vol. 12366, 2020, pp. 700–717
2020
-
[12]
Efficient tional adaptation for audio-driven talking-head generation,
Y . Gan, Z. Yang, X. Yue, L. Sun, and Y . Yang, “Efficient tional adaptation for audio-driven talking-head generation,” inProceedings of the IEEE/CVF International Conference on Computer Vision, October 2023, pp. 22 634–22 645
2023
-
[13]
Emodiffhead: continuously emotional control in talking head generation via diffusion,
J. Zhang, W. Mai, and Z. Zhang, “Emodiffhead: continuously emotional control in talking head generation via diffusion,”IEEE Transactions on Artificial Intelligence, 2026
2026
-
[14]
Learning an animatable detailed 3d face model from in-the-wild images,
Y . Feng, H. Feng, M. J. Black, and T. Bolkart, “Learning an animatable detailed 3d face model from in-the-wild images,”ACM Trans. Graph., vol. 40, no. 4, pp. 1–13, 2021
2021
-
[15]
Diffposetalk: Speech-driven stylistic 3d facial animation and head pose generation via diffusion models,
Z. Sun, T. Lv, S. Ye, M. Lin, J. Sheng, Y .-H. Wen, M. Yu, and Y .-j. Liu, “Diffposetalk: Speech-driven stylistic 3d facial animation and head pose generation via diffusion models,”ACM Trans. Graph., vol. 43, no. 4, pp. 1–9, 2024
2024
-
[16]
3d gaussian splatting for real-time radiance field rendering,
B. Kerbl, G. Kopanas, T. Leimk ¨uhler, and G. Drettakis, “3d gaussian splatting for real-time radiance field rendering,”ACM Trans. Graph., vol. 42, no. 4, pp. 139–1, 2023
2023
-
[17]
3d gaussian blendshapes for head avatar animation,
S. Ma, Y . Weng, T. Shao, and K. Zhou, “3d gaussian blendshapes for head avatar animation,” inACM SIGGRAPH 2024 Conference Papers, 2024, pp. 1–10
2024
-
[18]
You said that?: Syn- thesising talking faces from audio,
A. Jamaludin, J. S. Chung, and A. Zisserman, “You said that?: Syn- thesising talking faces from audio,”Int. J. Comput. Vis., vol. 127, no. 11-12, pp. 1767–1779, 2019
2019
-
[19]
Talking face generation by conditional recurrent adversarial network,
Y . Song, J. Zhu, D. Li, A. Wang, and H. Qi, “Talking face generation by conditional recurrent adversarial network,” inProceedings of the Twenty- Eighth International Joint Conference on Artificial Intelligence, 2019, pp. 919–925
2019
-
[20]
Applying segment-level attention on bi- modal transformer encoder for audio-visual tion recognition,
J.-H. Hsu and C. H. Wu, “Applying segment-level attention on bi- modal transformer encoder for audio-visual tion recognition,”IEEE Transactions on Affective Computing, vol. 14, pp. 3231–3243, 2023
2023
-
[21]
Talking face generation by adversarially disentangled audio-visual representation,
H. Zhou, Y . Liu, Z. Liu, P. Luo, and X. Wang, “Talking face generation by adversarially disentangled audio-visual representation,” inProceed- ings of the AAAI conference on artificial intelligence, vol. 33, no. 01, 2019, pp. 9299–9306
2019
-
[22]
Capture, learning, and synthesis of 3d speaking styles,
D. Cudeiro, T. Bolkart, C. Laidlaw, A. Ranjan, and M. J. Black, “Capture, learning, and synthesis of 3d speaking styles,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 10 101–10 111
2019
-
[23]
A lip sync expert is all you need for speech to lip generation in the wild,
K. Prajwal, R. Mukhopadhyay, V . P. Namboodiri, and C. Jawahar, “A lip sync expert is all you need for speech to lip generation in the wild,” in Proceedings of the 28th ACM international conference on multimedia, 2020, pp. 484–492
2020
-
[24]
Hierarchical cross-modal talking face generation with dynamic pixel-wise loss,
L. Chen, R. K. Maddox, Z. Duan, and C. Xu, “Hierarchical cross-modal talking face generation with dynamic pixel-wise loss,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 7832–7841
2019
-
[25]
Speech2video synthesis with 3d skeleton regularization and expressive body poses,
M. Liao, S. Zhang, P. Wang, H. Zhu, X. Zuo, and R. Yang, “Speech2video synthesis with 3d skeleton regularization and expressive body poses,” inProceedings of the Asian Conference on Computer Vision, 2020
2020
-
[26]
Faceformer: Speech- driven 3d facial animation with transformers,
Y . Fan, Z. Lin, J. Saito, W. Wang, and T. Komura, “Faceformer: Speech- driven 3d facial animation with transformers,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 18 770–18 780
2022
-
[27]
Codetalker: Speech-driven 3d facial animation with discrete motion prior,
J. Xing, M. Xia, Y . Zhang, X. Cun, J. Wang, and T.-T. Wong, “Codetalker: Speech-driven 3d facial animation with discrete motion prior,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 12 780–12 790
2023
-
[28]
Vdub: Modifying face video of actors for plausible visual alignment to a dubbed audio track,
P. Garrido, L. Valgaerts, H. Sarmadi, I. Steiner, K. Varanasi, P. P´erez, and C. Theobalt, “Vdub: Modifying face video of actors for plausible visual alignment to a dubbed audio track,”Comput. Graph. Forum, vol. 34, no. 2, pp. 193–204, 2015
2015
-
[29]
Masked lip-sync prediction by audio-visual contextual exploitation in transformers,
Y . Sun, H. Zhou, K. Wang, Q. Wu, Z. Hong, J. Liu, E. Ding, J. Wang, Z. Liu, and K. Hideki, “Masked lip-sync prediction by audio-visual contextual exploitation in transformers,” inSIGGRAPH Asia 2022 conference papers, 2022, pp. 1–9
2022
-
[30]
Neural voice puppetry: Audio-driven facial reenactment,
J. Thies, M. Elgharib, A. Tewari, C. Theobalt, and M. Nießner, “Neural voice puppetry: Audio-driven facial reenactment,” inEuropean confer- ence on computer vision, 2020, pp. 716–731
2020
-
[31]
Syn- thesizing obama: learning lip sync from audio,
S. Suwajanakorn, S. M. Seitz, and I. Kemelmacher-Shlizerman, “Syn- thesizing obama: learning lip sync from audio,”ACM Trans. Graph., vol. 36, no. 4, pp. 95:1–95:13, 2017
2017
-
[32]
Audio-driven tional video portraits,
X. Ji, H. Zhou, K. Wang, W. Wu, C. C. Loy, X. Cao, and F. Xu, “Audio-driven tional video portraits,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 14 080–14 089
2021
-
[33]
Live speech portraits: real-time photore- alistic talking-head animation,
Y . Lu, J. Chai, and X. Cao, “Live speech portraits: real-time photore- alistic talking-head animation,”ACM Trans. Graph., vol. 40, no. 6, pp. 220:1–220:17, 2021
2021
-
[34]
Gaussianspeech: Audio-driven gaussian avatars,
S. Aneja, A. Sevastopolsky, T. Kirschstein, J. Thies, A. Dai, and M. Nießner, “Gaussianspeech: Audio-driven gaussian avatars,”arXiv preprint arXiv:2411.18675, 2024
-
[35]
tag: tion-aware talking head synthesis on gaussian splatting with few-shot personalization,
H. Xu, K. Cheng, L. Wang, N. Bi, and X. Liu, “tag: tion-aware talking head synthesis on gaussian splatting with few-shot personalization,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2026, pp. 10 921–10 931
2026
-
[36]
Embodied navigation in unknown environments with implicit scene memory and target-aware memory retrieval,
Q. Liu, Y . Li, Y . Xu, L. Han, Z. Liu, and H. Wang, “Embodied navigation in unknown environments with implicit scene memory and target-aware memory retrieval,”IEEE Transactions on Artificial Intelligence, 2025
2025
-
[37]
Robotic perception of transparent objects: A review,
J. Jiang, G. Cao, J. Deng, T.-T. Do, and S. Luo, “Robotic perception of transparent objects: A review,”IEEE Transactions on Artificial Intelligence, vol. 5, no. 6, pp. 2547–2567, 2023
2023
-
[38]
Nerf: representing scenes as neural radiance fields for view synthesis,
B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng, “Nerf: representing scenes as neural radiance fields for view synthesis,”Commun. ACM, vol. 65, no. 1, pp. 99–106, 2022
2022
-
[39]
Ad-nerf: Audio driven neural radiance fields for talking head synthesis,
Y . Guo, K. Chen, S. Liang, Y . Liu, H. Bao, and J. Zhang, “Ad-nerf: Audio driven neural radiance fields for talking head synthesis,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 5764–5774
2021
-
[40]
Semantic-aware implicit neural audio-driven video portrait generation,
X. Liu, Y . Xu, Q. Wu, H. Zhou, W. Wu, and B. Zhou, “Semantic-aware implicit neural audio-driven video portrait generation,” inEuropean conference on computer vision, 2022, pp. 106–125
2022
-
[41]
Dfa-nerf: Person- alized talking head generation via disentangled face attributes neural rendering,
S. Yao, R. Zhong, Y . Yan, G. Zhai, and X. Yang, “Dfa-nerf: Person- alized talking head generation via disentangled face attributes neural rendering,”CoRR, vol. abs/2201.00791, 2022. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 13
-
[42]
Learning dynamic facial radiance fields for few-shot talking head synthesis,
S. Shen, W. Li, Z. Zhu, Y . Duan, J. Zhou, and J. Lu, “Learning dynamic facial radiance fields for few-shot talking head synthesis,” inEuropean conference on computer vision. Springer, 2022, pp. 666–682
2022
-
[43]
Efficient region-aware neural radiance fields for high-fidelity talking portrait synthesis,
J. Li, J. Zhang, X. Bai, J. Zhou, and L. Gu, “Efficient region-aware neural radiance fields for high-fidelity talking portrait synthesis,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 7534–7544
2023
-
[44]
tional speech-driven animation with content-tion disentanglement,
R. Dan ˇeˇcek, K. Chhatre, S. Tripathi, Y . Wen, M. Black, and T. Bolkart, “tional speech-driven animation with content-tion disentanglement,” in SIGGRAPH Asia 2023 Conference Papers, 2023, pp. 1–13
2023
-
[45]
Eamm: One-shot tional talking face via audio-based tion-aware motion model,
X. Ji, H. Zhou, K. Wang, Q. Wu, W. Wu, F. Xu, and X. Cao, “Eamm: One-shot tional talking face via audio-based tion-aware motion model,” inACM SIGGRAPH 2022 conference proceedings, 2022, pp. 1–10
2022
-
[46]
Expressive talking head generation with granular audio-visual control,
B. Liang, Y . Pan, Z. Guo, H. Zhou, Z. Hong, X. Han, J. Han, J. Liu, E. Ding, and J. Wang, “Expressive talking head generation with granular audio-visual control,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 3377–3386
2022
-
[47]
Controllable multi-speaker tional speech synthesis with an tion representation of high generalization capability,
J. Zheng, J. Zhou, W. Zheng, L. Tao, and H. K. Kwan, “Controllable multi-speaker tional speech synthesis with an tion representation of high generalization capability,”IEEE Transactions on Affective Computing, vol. 16, no. 1, pp. 68–82, 2025
2025
-
[48]
Emmn: tional motion mry network for audio- driven tional talking face generation,
S. Tan, B. Ji, and Y . Pan, “Emmn: tional motion mry network for audio- driven tional talking face generation,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 22 089–22 099
2023
-
[49]
Speech synthesis with mixed tions,
K. Zhou, B. Sisman, R. Rana, B. W. Schuller, and H. Li, “Speech synthesis with mixed tions,”IEEE Transactions on Affective Computing, vol. 14, no. 4, pp. 3120–3134, 2023
2023
-
[50]
3d facial expressions through analysis- by-neural-synthesis,
G. Retsinas, P. P. Filntisis, R. Danecek, V . F. Abrevaya, A. Roussos, T. Bolkart, and P. Maragos, “3d facial expressions through analysis- by-neural-synthesis,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 2490–2501
2024
-
[51]
Emoca: Emotion driven monocular face capture and animation,
R. Danecek, M. J. Black, and T. Bolkart, “Emoca: Emotion driven monocular face capture and animation,” 2022
2022
-
[52]
Vasa-1: Lifelike audio-driven talking faces generated in real time,
S. Xu, G. Chen, Y .-X. Guo, J. Yang, C. Li, Z. Zang, Y . Zhang, X. Tong, and B. Guo, “Vasa-1: Lifelike audio-driven talking faces generated in real time,”Advances in Neural Information Processing Systems, vol. 37, pp. 660–684, 2024
2024
-
[53]
Learning a model of facial shape and expression from 4d scans,
T. Li, T. Bolkart, M. J. Black, H. Li, and J. Romero, “Learning a model of facial shape and expression from 4d scans,”ACM Trans. Graph., vol. 36, no. 6, pp. 194:1–194:17, 2017
2017
-
[54]
Towards metrical reconstruction of human faces,
W. Zielonka, T. Bolkart, and J. Thies, “Towards metrical reconstruction of human faces,” inEuropean conference on computer vision. Springer, 2022, pp. 250–269
2022
-
[55]
I M avatar: Implicit morphable head avatars from videos,
Y . Zheng, V . F. Abrevaya, M. C. B ¨uhler, X. Chen, M. J. Black, and O. Hilliges, “I M avatar: Implicit morphable head avatars from videos,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 13 535–13 545
2022
-
[56]
Gaussiantalker: Real-time talking head synthesis with 3d gaussian splatting,
K. Cho, J. Lee, H. Yoon, Y . Hong, J. Ko, S. Ahn, and S. Kim, “Gaussiantalker: Real-time talking head synthesis with 3d gaussian splatting,” inProceedings of the 32nd ACM International Conference on Multimedia, 2024, pp. 10 985–10 994
2024
-
[57]
Learning transferable visual models from natural language supervision,
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inInternational conference on machine learning. PmLR, 2021, pp. 8748–8763
2021
-
[58]
Perceptual losses for real-time style transfer and super-resolution,
J. Johnson, A. Alahi, and L. Fei-Fei, “Perceptual losses for real-time style transfer and super-resolution,” inComputer Vision - ECCV 2016 - 14th European Conference, ser. Lecture Notes in Computer Science, vol. 9906, 2016, pp. 694–711
2016
-
[59]
Gans trained by a two time-scale update rule converge to a local nash equilibrium,
M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “Gans trained by a two time-scale update rule converge to a local nash equilibrium,”Advances in neural information processing systems, vol. 30, 2017
2017
-
[60]
Towards Accurate Generative Models of Video: A New Metric & Challenges
T. Unterthiner, S. Van Steenkiste, K. Kurach, R. Marinier, M. Michalski, and S. Gelly, “Towards accurate generative models of video: A new metric & challenges,”arXiv preprint arXiv:1812.01717, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[61]
Image quality assess- ment: from error visibility to structural similarity,
Z. Wang, A. Bovik, H. Sheikh, and E. Simoncelli, “Image quality assess- ment: from error visibility to structural similarity,”IEEE Transactions on Image Processing, vol. 13, no. 4, pp. 600–612, 2004
2004
-
[62]
Image quality assessment: from error visibility to structural similarity,
Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,”IEEE transactions on image processing, vol. 13, no. 4, pp. 600–612, 2004
2004
-
[63]
The unreasonable effectiveness of deep features as a perceptual metric,
R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The unreasonable effectiveness of deep features as a perceptual metric,”
-
[64]
The Unreasonable Effectiveness of Deep Features as a Perceptual Metric
[Online]. Available: https://arxiv.org/abs/1801.03924
work page internal anchor Pith review Pith/arXiv arXiv
-
[65]
A no-reference image blur metric based on the cumulative probability of blur detection (cpbd),
N. D. Narvekar and L. J. Karam, “A no-reference image blur metric based on the cumulative probability of blur detection (cpbd),”IEEE Transactions on Image Processing, vol. 20, no. 9, pp. 2678–2683, 2011
2011
-
[66]
Out of time: Automated lip sync in the wild,
J. S. Chung and A. Zisserman, “Out of time: Automated lip sync in the wild,” inComputer Vision – ACCV 2016 Workshops, Cham, 2017, pp. 251–263
2016
-
[67]
Adam: A method for stochastic optimization,
D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in3rd International Conference on Learning Representations, 2015
2015
-
[68]
Sadtalker: Learning realistic 3d motion coefficients for stylized audio- driven single image talking face animation,
W. Zhang, X. Cun, X. Wang, Y . Zhang, J. Wang, H. Chen, and Y . Yan, “Sadtalker: Learning realistic 3d motion coefficients for stylized audio- driven single image talking face animation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 8652–8661
2023
-
[69]
Generalizable and animatable gaussian head avatar,
X. Chu and T. Harada, “Generalizable and animatable gaussian head avatar,” inAdvances in Neural Information Processing Systems, vol. 37, 2024, pp. 57 642–57 670
2024
-
[70]
Gmtalker: Gaussian mixture based tional talking video portraits,
B. Du, Y . Zhao, P. Jiang, S. Zhang, G. Li, J. Liu, and T. Zhao, “Gmtalker: Gaussian mixture based tional talking video portraits,” inProceedings of the 33rd International Joint Conference on Artificial Intelligence, 2024, pp. 740–748. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 14 Haijie Yangreceived the M.S. degree in Computer Technology f...
2024
-
[71]
in Electronic Information Engineering at Nanjing University of Science and Technology (NJUST) in Nanjing, China
Currently, he is pursuing a Ph.D. in Electronic Information Engineering at Nanjing University of Science and Technology (NJUST) in Nanjing, China. His research interests include 3D reconstruction and pattern recognition, digital human. Zhenyu Zhangis now an associate professor in Nanjing University. He received Ph.D. degree from Department of Computer Sci...
2020
-
[72]
His research interests include pattern recognition theory, computer vision, and machine learning
He is currently a Processor with NJUST. His research interests include pattern recognition theory, computer vision, and machine learning. Dr. Qian has served as a Guest Editor for Neural Processing Letters and The Visual Computer. Jian Yangreceived the PhD degree from Nanjing University of Science and Technology (NJUST) in 2002, majoring in pattern recogn...
2002
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.