pith. sign in

arxiv: 2605.28056 · v1 · pith:THUV4EV5new · submitted 2026-05-27 · 💻 cs.CV

CogPortrait: Fine-Grained Eye-Region Control in Portrait Animation via Hierarchical Agent Planning

Pith reviewed 2026-06-29 13:00 UTC · model grok-4.3

classification 💻 cs.CV
keywords portrait animationeye region controlMLLM agentsfacial keypointsDiT video generationEMH benchmarkaction unitsbeyond-emotion states
0
0 comments X

The pith

CogPortrait translates high-level labels into precise eye-region keypoints using three chain-of-thought MLLM agents before synthesizing animations in a DiT backbone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that existing portrait animation techniques cannot handle subtle eye dynamics beyond basic emotions without either coarse inputs or heavy driving signals. CogPortrait addresses this by routing high-level labels through three MLLM agents that first plan temporal events, retrieve matching prototypes from a real-behavior library, and enforce semantic-physiological constraints to produce accurate facial-keypoint sequences. These sequences then condition a DiT video generator equipped with eye-aware classifier-free guidance and boundary refinement. Experiments on HDTF and the new EMH benchmark show the resulting animations exhibit tighter eye-region fidelity while preserving identity and visual quality. A sympathetic reader would therefore expect that complex ocular behaviors such as thinking or drowsiness can now be specified with simple labels rather than detailed videos or action-unit annotations.

Core claim

The central discovery is a two-stage pipeline in which three chain-of-thought MLLM agents compile high-level labels into temporally coherent facial keypoints drawn from a real-behavior library and refined by semantic-physiological constraints; these keypoints then drive a DiT-based generator that incorporates dynamic classifier-free guidance with eye-region reweighting and KTO refinement, yielding animations that achieve finer eye-region control than prior label-driven or video-driven methods while retaining superior visual quality and identity consistency on both the HDTF dataset and the introduced EMH benchmark.

What carries the argument

Three chain-of-thought MLLM agents performing temporal event planning, prototype retrieval from a real-behavior library, and semantic-physiological constraint enforcement to convert high-level labels into facial-keypoint sequences.

If this is right

  • High-level emotion and state labels become sufficient input for controlling subtle, non-emotional eye behaviors such as thinking or drowsiness.
  • The EMH benchmark supplies AU-level metrics that quantify fine-grained eye-region and head-motion fidelity separately from overall visual quality.
  • Dynamic classifier-free guidance with eye-region reweighting plus KTO refinement improves handling of boundary cases without degrading identity consistency.
  • The two-stage separation allows the keypoint planner to be swapped or extended independently of the DiT video backbone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the same agent-planning pattern works for eye keypoints, analogous libraries and planners could be built for mouth, eyebrow, or hand motion in full-body animation.
  • Replacing the offline MLLM stage with a distilled model could enable interactive, label-driven portrait editing at runtime.
  • The approach implies that MLLMs can serve as semantic-to-motion translators in other animation domains where direct supervision is scarce.

Load-bearing premise

The MLLM agents can convert high-level labels into accurate temporal keypoint sequences from the real-behavior library without introducing systematic errors for states outside basic emotions.

What would settle it

On the EMH benchmark, measure AU-level eye and head motion accuracy of CogPortrait outputs against ground-truth sequences for thinking or drowsiness labels; if the error exceeds that of driving-video baselines while visual quality also drops, the central claim fails.

Figures

Figures reproduced from arXiv: 2605.28056 by Donglin Di, He Feng, Lei Fan, Tonghua Su, Yongjia Ma.

Figure 1
Figure 1. Figure 1: Comparison with prior portrait animation paradigms, quantitative results on the EMH benchmark, and OOD cases [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the proposed CogPortrait. Stage 1 compiles a high-level label, a reference portrait, and optional fine [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the EMH benchmark. The figure shows [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison on HDTF (left) and the EMH benchmark (right) against representative baselines. Red boxes [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Ablation study for CFG strategies. We selected im [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Ablation study for the KTO refinement on the con [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
read the original abstract

Portrait animation methods have achieved substantial visual quality and lip synchronization, but fine-grained manipulation of the eye region still faces a trade-off between input granularity and motion accuracy. Existing methods using emotion labels or coarse text prompts are insufficient for describing subtle ocular dynamics, whereas approaches based on Action Units or driving videos provide higher fidelity at the cost of a heavier input burden. These limitations are still restrictive for beyond-emotion states (e.g., thinking) and drowsiness. In light of the above, we propose CogPortrait, a two-stage framework that generates portrait animations from high-level labels. In the first stage, three chain-of-thought Multimodal Large Language Models (MLLMs) agents compile high-level labels into facial keypoints through temporal event planning, prototype retrieval, and composition from a real-behavior library, and semantic-physiological constraint enforcement. In the second stage, a DiT-based video generation backbone synthesizes the final animation conditioned on the keypoints, reference portrait, audio, and text prompt, enhanced by a dynamic classifier-free guidance strategy with eye-region-aware reweighting and KTO-based refinement for boundary cases. We further introduce the EMH benchmark covering diverse emotions and beyond-emotion categories with two AU-level metrics for evaluating fine-grained eye-region and head-motion control. Extensive experiments on HDTF and the EMH benchmark demonstrate that CogPortrait achieves more precise eye-region control than existing methods while maintaining supe- rior visual quality and identity consistency

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes CogPortrait, a two-stage framework for portrait animation from high-level labels. Stage 1 uses three chain-of-thought MLLM agents for temporal event planning, prototype retrieval/composition from a real-behavior library, and semantic-physiological constraint enforcement to produce facial keypoint sequences. Stage 2 employs a DiT-based video generator conditioned on the keypoints, reference image, audio, and text, with dynamic classifier-free guidance (eye-region-aware reweighting) and KTO refinement. A new EMH benchmark is introduced with AU-level metrics for eye-region and head-motion evaluation. Experiments on HDTF and EMH claim superior eye-region control, visual quality, and identity consistency over prior methods.

Significance. If the central claims hold with proper validation, the work could meaningfully advance fine-grained, high-level control in portrait animation by addressing limitations of emotion labels, coarse text, or driving videos, particularly for beyond-emotion states like thinking or drowsiness. The hierarchical agent approach and new benchmark with AU metrics represent a potentially useful direction for controllable generation.

major comments (2)
  1. [Method (agent pipeline) and Experiments] The central claim of more precise eye-region control (abstract and § on experiments) rests on the three CoT MLLM agents producing temporally accurate keypoint sequences from high-level labels, especially for beyond-emotion states. No intermediate quantitative metrics are reported on agent output accuracy (e.g., keypoint L2 error vs. ground-truth extracted keypoints, AU activation F1, or temporal alignment scores), so end-to-end HDTF/EMH metrics do not isolate whether gains derive from better keypoints or from the DiT enhancements.
  2. [EMH benchmark definition and evaluation protocol] The EMH benchmark is presented with two AU-level metrics, but the manuscript does not report baseline comparisons or ablations that would confirm the metrics isolate eye-region control independent of overall motion quality or identity preservation.
minor comments (1)
  1. [DiT stage description] Notation for the dynamic CFG reweighting and KTO refinement could be clarified with explicit equations or pseudocode to show how eye-region awareness is implemented.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments below and agree that additional analyses will strengthen the isolation of contributions and validation of the benchmark.

read point-by-point responses
  1. Referee: [Method (agent pipeline) and Experiments] The central claim of more precise eye-region control (abstract and § on experiments) rests on the three CoT MLLM agents producing temporally accurate keypoint sequences from high-level labels, especially for beyond-emotion states. No intermediate quantitative metrics are reported on agent output accuracy (e.g., keypoint L2 error vs. ground-truth extracted keypoints, AU activation F1, or temporal alignment scores), so end-to-end HDTF/EMH metrics do not isolate whether gains derive from better keypoints or from the DiT enhancements.

    Authors: We agree that the absence of intermediate metrics on the agent outputs limits the ability to isolate the source of improvements. In the revised manuscript, we will add quantitative evaluations of the three MLLM agents, including keypoint L2 error against ground-truth keypoints, AU activation F1 scores, and temporal alignment metrics computed on a held-out validation set from the behavior library. These will be reported alongside the end-to-end results to clarify the contribution of the hierarchical planning stage. revision: yes

  2. Referee: [EMH benchmark definition and evaluation protocol] The EMH benchmark is presented with two AU-level metrics, but the manuscript does not report baseline comparisons or ablations that would confirm the metrics isolate eye-region control independent of overall motion quality or identity preservation.

    Authors: We acknowledge that further validation is needed to confirm the metrics' specificity. In the revision, we will add baseline method comparisons on the EMH benchmark using the proposed AU-level metrics and include targeted ablations (such as metrics computed with vs. without eye-region conditioning) to demonstrate that the metrics isolate fine-grained eye-region and head-motion control from general visual quality and identity preservation factors. revision: yes

Circularity Check

0 steps flagged

No circularity: claims rest on external benchmarks and new dataset, not internal redefinitions

full rationale

The paper describes a two-stage pipeline (MLLM agents for keypoint sequences from labels, followed by DiT synthesis with dynamic CFG and KTO) whose performance is asserted via experiments on HDTF and the newly introduced EMH benchmark using AU-level metrics. No equations, fitted parameters, or self-citations are presented that reduce any prediction or uniqueness claim to the inputs by construction. The central claims therefore remain empirically falsifiable against external data rather than tautological.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; all components are described as extensions of existing MLLMs, DiT models, and datasets.

pith-pipeline@v0.9.1-grok · 5802 in / 1116 out tokens · 32767 ms · 2026-06-29T13:00:02.378812+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

70 extracted references · 18 canonical work pages · 4 internal anchors

  1. [1]

    Oleg Alexander, Mike Rogers, et al. 2010. The digital Emily project: Achieving a photorealistic digital actor.IEEE Comput. Graph. Appl.30, 4 (2010), 20–31

  2. [2]

    Alexei Baevski, Yuhao Zhou, et al . 2020. wav2vec 2.0: A framework for self- supervised learning of speech representations. InNeurIPS. 12449–12460

  3. [3]

    Ryan Canales, Eakta Jain, et al. 2023. Real-time conversational gaze synthesis for avatars. InMIG. 1–7

  4. [4]

    Shunian Chen, Hejin Huang, et al. 2025. TalkVid: A large-scale diversified dataset for audio-driven talking head synthesis.arXiv preprint arXiv:2508.13618(2025)

  5. [5]

    Wei-Ting Chen, Gurunandan Krishnan, et al. 2024. DSL-FIQA: Assessing facial image quality via dual-set degradation learning and landmark-guided transformer. InCVPR. 2931–2941

  6. [6]

    Zhiyuan Chen, Jiajiong Cao, et al. 2025. EchoMimic: Lifelike audio-driven portrait animations through editable landmark conditioning. InAAAI

  7. [7]

    Jiahao Cui, Hui Li, et al. 2025. Hallo2: Long-duration and high-resolution audio- driven portrait image animation. InICLR

  8. [8]

    Jiahao Cui, Hui Li, et al . 2025. Hallo3: Highly dynamic and realistic portrait image animation with diffusion transformer networks. InCVPR

  9. [9]

    Jiankang Deng, Jia Guo, et al. 2019. ArcFace: Additive angular margin loss for deep face recognition. InCVPR. 4690–4699

  10. [10]

    Donglin Di, He Feng, et al. 2024. FaceVid-1K: A large-scale high-quality multira- cial human face video dataset.arXiv preprint arXiv:2410.07151(2024)

  11. [11]

    Nina Döllinger, Erik Wolf, et al . 2023. Are embodied avatars harmful to our self-experience? The impact of virtual embodiment on body awareness. InCHI. 1–14

  12. [12]

    Nikita Drobyshev, Jenya Chelishev, et al . 2022. MegaPortraits: One-shot megapixel neural head avatars. InACM MM. 2663–2671

  13. [13]

    Kawin Ethayarajh, Winnie Xu, et al. 2024. KTO: Model alignment as prospect theory.arXiv preprint arXiv:2402.01306(2024)

  14. [14]

    He Feng, Donglin Di, et al. 2024. One-shot pose-driving face animation platform. arXiv preprint arXiv:2407.08949(2024)

  15. [15]

    He Feng, Yongjia Ma, et al. 2025. DiTalker: A unified DiT-based framework for high-quality and speaking styles controllable portrait animation.arXiv preprint arXiv:2508.06511(2025)

  16. [16]

    Maia Garau, Mel Slater, et al. 2003. The impact of avatar realism and eye gaze control on perceived quality of communication in a shared immersive virtual environment. InCHI. 529–536

  17. [17]

    Reza Ghoddoosian, Marnim Galib, et al. 2019. A realistic dataset and baseline temporal model for early drowsiness detection. InCVPRW

  18. [18]

    Jianzhu Guo, Dingyun Zhang, et al. 2024. LivePortrait: Efficient portrait animation with stitching and retargeting control.arXiv preprint arXiv:2407.03168(2024)

  19. [19]

    Siddharth Gururani, Arun Mallya, et al. 2023. SPACE: Speech-driven portrait animation with controllable expression. InICCV. 20914–20923

  20. [20]

    Tiankai Hang, Huan Yang, et al . 2023. Language-Guided Face Animation by Recurrent StyleGAN-Based Generator.IEEE Transactions on Multimedia(2023), 1–12

  21. [21]

    Martin Heusel, Hubert Ramsauer, et al. 2017. GANs trained by a two time-scale update rule converge to a local Nash equilibrium. InNeurIPS

  22. [22]

    Fa-Ting Hong, Zunnan Xu, et al. 2025. Audio-visual controlled video diffusion with masked selective state spaces modeling for natural talking head generation. InICCV

  23. [23]

    Li Hu. 2024. Animate anyone: Consistent and controllable image-to-video syn- thesis for character animation. InCVPR. 8153–8163

  24. [24]

    Xiaozhong Ji, Xiaobin Hu, et al . 2025. Sonic: Shifting focus to global audio perception in portrait animation. InCVPR. 193–203

  25. [25]

    Cheng Jin, Qitan Shi, et al. 2026. Stage-wise dynamics of classifier-free guidance in diffusion models. InICLR

  26. [26]

    Taekyung Ki, Dongchan Min, et al. 2025. FLOAT: Generative motion latent flow matching for audio-driven talking portrait. InICCV. 14699–14710

  27. [27]

    Diederik P Kingma. 2013. Auto-encoding variational bayes.arXiv preprint arXiv:1312.6114(2013)

  28. [28]

    Jiye Lee, Chenghui Li, et al. 2025. Audio driven real-time facial animation for social telepresence. InSIGGRAPH Asia. 1–12

  29. [29]

    Min-Ho Lee, Adai Shomanov, et al . 2024. EAV: EEG-audio-video dataset for emotion recognition in conversational contexts.Scientific Data11, 1 (2024), 1026

  30. [30]

    Chunyu Li, Chao Zhang, et al . 2024. LatentSync: Taming audio-conditioned latent diffusion models for lip sync with SyncNet supervision.arXiv preprint arXiv:2412.09262(2024)

  31. [31]

    Yaron Lipman, Ricky T. Q. Chen, et al . 2023. Flow matching for generative modeling. InICLR

  32. [32]

    Huaize Liu, Wenzhang Sun, et al. 2025. MoEE: Mixture of emotion experts for audio-driven portrait animation. InCVPR

  33. [33]

    Camillo Lugaresi, Jiuqiang Tang, et al. 2019. MediaPipe: A framework for building perception pipelines.arXiv preprint arXiv:1906.08172(2019)

  34. [34]

    Jiayi Lyu, Leigang Qu, et al . 2026. AUHead: Realistic emotional talking head generation via action units control. InICLR

  35. [35]

    Yue Ma, Hongyu Liu, et al . 2024. Follow-your-emoji: Fine-controllable and expressive freestyle portrait animation. InSIGGRAPH Asia. 1–12

  36. [36]

    Yifeng Ma, Jinwei Qi, et al. 2025. Exploring timeline control for facial motion generation. InCVPR

  37. [37]

    Anish Mittal, Anush Krishna Moorthy, and Alan Conrad Bovik. 2012. No- Reference Image Quality Assessment in the Spatial Domain.IEEE Transactions on Image Processing21, 12 (2012), 4695–4708

  38. [38]

    Colin Raffel, Noam Shazeer, Adam Roberts, et al. 2020. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer.Journal of Machine Learning Research21, 140 (2020), 1–67

  39. [39]

    Andre Rochow, Max Schwarz, et al . 2024. FSRT: Facial scene representation transformer for face reenactment from factorized appearance, head-pose, and facial expression features. InCVPR. 7716–7726

  40. [40]

    Rita Meziati Sabour, Yannick Benezeth, et al. 2023. UBFC-Phys: A multimodal database for psychophysiological studies of social stress.TAFFC14, 1 (2023), 622–636

  41. [41]

    Seyedmorteza Sadat, Otmar Hilliges, et al. 2024. Eliminating oversaturation and artifacts of high guidance scales in diffusion models. InICLR

  42. [42]

    Achint Soni, Sreyas Venkataraman, et al. 2025. VideoAgent: Self-improving video generation for embodied planning. InNeurIPS Workshop

  43. [43]

    Michal Stypulkowski, Konstantinos Vougioukas, et al . 2024. Diffused heads: Diffusion models beat GANs on talking-face generation. InW ACV. 5091–5100

  44. [44]

    Wenzhang Sun, Xiang Li, et al. 2024. UniAvatar: Taming lifelike audio-driven talking head generation with comprehensive motion and lighting control.arXiv preprint arXiv:2412.19860(2024)

  45. [45]

    Shuai Tan, Bin Ji, et al. 2025. EDTalk: Efficient disentanglement for emotional talking head synthesis. InECCV. 398–416

  46. [46]

    Linrui Tian, Qi Wang, et al. 2025. EMO: Emote portrait alive generating expressive portrait videos with audio2video diffusion model under weak conditions. InECCV. 244–260

  47. [47]

    Team Wan, Ang Wang, et al. 2025. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314(2025)

  48. [48]

    Jiawen Wang, Jingjing Wang, et al. 2026. Towards closed-loop embodied empathy evolution: Probing LLM-centric lifelong empathic motion generation in unseen scenarios. InAAAI, Vol. 40. 33539–33547

  49. [49]

    Kaisiyuan Wang, Qianyi Wu, et al . 2020. MEAD: A large-scale audio-visual dataset for emotional talking-face generation. InECCV. 700–717

  50. [50]

    Mengchao Wang, Qiang Wang, et al . 2025. FantasyTalking: Realistic talking portrait generation via coherent motion synthesis. InACM MM. 9891–9900

  51. [51]

    Ting-Chun Wang, Ming-Yu Liu, et al. 2018. Video-to-video synthesis. InNeurIPS. 1152–1164

  52. [52]

    Yuchi Wang, Junliang Guo, et al. 2024. InstructAvatar: Text-guided emotion and motion control for avatar generation.arXiv preprint arXiv:2405.15758(2024)

  53. [53]

    Huawei Wei, Zejun Yang, et al. 2024. AniPortrait: Audio-driven synthesis of photorealistic portrait animation.arXiv preprint arXiv:2403.17694(2024)

  54. [54]

    Jason Wei, Xuezhi Wang, et al. 2022. Chain-of-thought prompting elicits reason- ing in large language models. InNeurIPS. 24824–24837

  55. [55]

    Qi Wu, Yubo Zhao, et al. 2024. Motion-agent: A conversational framework for human motion generation with LLMs.arXiv preprint arXiv:2405.17013(2024)

  56. [56]

    Weijia Wu, Zeyu Zhu, et al. 2025. Automated movie generation via multi-agent CoT planning.arXiv preprint arXiv:2503.07314(2025)

  57. [57]

    Jiaqi Xu, Xinyi Zou, et al. 2024. EasyAnimate: A high-performance long video gen- eration method based on transformer architecture.arXiv preprint arXiv:2405.18991 (2024)

  58. [58]

    Xiaolin Xu, Wenming Zheng, et al. 2025. Multimodal lie detection dataset based on Chinese dialogue.Journal of Image and Graphics30, 8 (2025), 2729–2742

  59. [59]

    Zhenran Xu, Longyue Wang, et al. 2025. FilmAgent: A multi-agent framework for end-to-end film automation in virtual 3D spaces.arXiv preprint arXiv:2501.12909 (2025)

  60. [60]

    Zunnan Xu, Zhentao Yu, et al. 2025. HunyuanPortrait: Implicit condition control for enhanced portrait animation. InCVPR. 15909–15919

  61. [61]

    Shurong Yang, Huadong Li, et al. 2024. MegActor- Σ: Unlocking flexible mixed- modal control in portrait animation with diffusion transformer.arXiv preprint arXiv:2408.14975(2024)

  62. [62]

    Shurong Yang, Huadong Li, et al. 2025. MegActor- Σ: Unlocking flexible mixed- modal control in portrait animation with diffusion transformer. InAAAI

  63. [63]

    Zhuoyi Yang, Jiayan Teng, et al. 2025. CogVideoX: Text-to-video diffusion models with an expert transformer. InICLR

  64. [64]

    Jianhui Yu, Hao Zhu, et al . 2023. CelebV-Text: A large-scale facial text-video dataset. InCVPR. 14805–14814

  65. [65]

    Shuyan Zhai, Meng Liu, et al. 2023. Talking face generation with audio-deduced emotional landmarks.TNNLS(2023)

  66. [66]

    Lisai Zhang, Baohan Xu, et al. 2025. AniME: Adaptive multi-agent planning for long animation generation. InSIGGRAPH Asia. 1–3

  67. [67]

    Richard Zhang, Phillip Isola, et al. 2018. The unreasonable effectiveness of deep features as a perceptual metric. InCVPR. 586–595

  68. [68]

    Wenxuan Zhang, Xiaodong Cun, et al. 2023. SadTalker: Learning realistic 3D motion coefficients for stylized audio-driven single image talking face animation. CogPortrait: Fine-Grained Eye-Region Control in Portrait Animation via Hierarchical Agent Planning MM ’26, November 10–14, 2026, Rio de Janeiro, Brazil InCVPR. 8652–8661

  69. [69]

    Zhimeng Zhang, Lincheng Li, et al . 2021. Flow-guided one-shot talking face generation with a high-resolution audio-visual dataset. InCVPR. 3661–3670

  70. [70]

    Ziqi Zhou, Weize Quan, et al. 2025. GoHD: Gaze-oriented and highly disentangled portrait animation with rhythmic poses and realistic expressions. InAAAI, Vol. 39. 10914–10922