pith. machine review for the scientific record. sign in

arxiv: 2604.12856 · v2 · submitted 2026-04-14 · 💻 cs.CV

Recognition: unknown

PianoFlow: Music-Aware Streaming Piano Motion Generation with Bimanual Coordination

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:51 UTC · model grok-4.3

classification 💻 cs.CV
keywords audio-driven motion generationbimanual coordinationflow matchingpiano playingstreaming generationMIDI distillationmusic motion synthesis
0
0 comments X

The pith

PianoFlow generates realistic bimanual piano playing motions from audio in real time by distilling MIDI priors during training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces PianoFlow to generate coordinated hand motions for playing piano from audio input. It uses a flow-matching approach where MIDI data helps learn musical structure during training but is not needed at inference time. An asymmetric attention module handles how the two hands interact dynamically, and a continuation method allows generating long streams without breaks. If successful, this would let computers create natural piano animations quickly and for any length of music. Readers might care because it improves on previous slow or uncoordinated methods for music-related motion synthesis.

Core claim

PianoFlow is a flow-matching framework that strategically leverages MIDI as a privileged modality during training to distill structured musical priors, enabling deep semantic understanding for audio-only inference. It introduces an asymmetric role-gated interaction module for dynamic cross-hand coordination and an autoregressive flow continuation scheme for seamless long-sequence streaming generation.

What carries the argument

Flow-matching framework that distills MIDI symbolic priors into audio-driven generation, combined with an asymmetric role-gated interaction module using role-aware attention and temporal gating, plus autoregressive flow continuation for streaming.

Load-bearing premise

The assumption that MIDI symbolic priors learned during training will transfer effectively to audio-only inference without introducing artifacts or losing generality.

What would settle it

A direct comparison on the PianoMotion10M test set showing that audio-only inference with PianoFlow does not outperform previous audio-only methods in motion accuracy or musical alignment, or that inference speed is not at least 9 times faster.

Figures

Figures reproduced from arXiv: 2604.12856 by Gaoang Wang, Jiayi Han, Kai Ruan, Kaiyue Zhou, Xuan Wang.

Figure 1
Figure 1. Figure 1: PianoFlow enables high-fidelity and real-time bi [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the PianoFlow architecture. The framework operates in two stages: (1) Music-Aware Wrist Trajectory [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative comparison of synthesized motions, with red and green boxes highlighting kinematic inaccuracies and [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
read the original abstract

Audio-driven bimanual piano motion generation requires precise modeling of complex musical structures and dynamic cross-hand coordination. However, existing methods often rely on acoustic-only representations lacking symbolic priors, employ inflexible interaction mechanisms, and are limited to computationally expensive short-sequence generation. To address these limitations, we propose PianoFlow, a flow-matching framework for precise and coordinated bimanual piano motion synthesis. Our approach strategically leverages MIDI as a privileged modality during training, distilling these structured musical priors to achieve deep semantic understanding while maintaining audio-only inference. Furthermore, we introduce an asymmetric role-gated interaction module to explicitly capture dynamic cross-hand coordination through role-aware attention and temporal gating. To enable real-time streaming generation for arbitrarily long sequences, we design an autoregressive flow continuation scheme that ensures seamless cross-chunk temporal coherence. Extensive experiments on the PianoMotion10M dataset demonstrate that PianoFlow achieves superior quantitative and qualitative performance, while accelerating inference by over 9\times compared to previous methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. PianoFlow is a flow-matching framework for audio-driven bimanual piano motion generation. It uses MIDI as a privileged training modality to distill symbolic musical priors into an audio-only inference model, introduces an asymmetric role-gated interaction module for cross-hand coordination via role-aware attention and temporal gating, and proposes an autoregressive flow continuation scheme for seamless streaming of arbitrarily long sequences. On the PianoMotion10M dataset the paper claims superior quantitative and qualitative results together with more than 9× inference speedup relative to prior methods.

Significance. If substantiated, the approach could meaningfully advance real-time music-driven animation by enabling efficient incorporation of musical structure without symbolic input at test time and by addressing long-sequence coherence. The explicit modeling of bimanual coordination and the streaming mechanism target practical limitations in existing work. The distillation premise, if validated, would also be of broader interest for multimodal learning in computer vision.

major comments (2)
  1. [Abstract] Abstract: the central claims of 'superior quantitative and qualitative performance' and 'accelerating inference by over 9×' are stated without any metrics (e.g., MPJPE, FID, or velocity error), baselines, tables, or experimental protocol. These performance assertions are load-bearing for the paper's contribution and cannot be assessed from the given text.
  2. [Experiments] Experiments section: no ablation isolates the contribution of MIDI distillation. The headline claim depends on MIDI serving as a privileged training signal that improves the audio-only model; without a control that removes MIDI at training time (while keeping the flow-matching objective, architecture, and role-gated module fixed), it is impossible to determine whether reported gains arise from the distillation, the asymmetric interaction module, or the autoregressive continuation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address the two major comments point by point below. Both points identify areas where the current manuscript can be strengthened, and we will revise accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claims of 'superior quantitative and qualitative performance' and 'accelerating inference by over 9×' are stated without any metrics (e.g., MPJPE, FID, or velocity error), baselines, tables, or experimental protocol. These performance assertions are load-bearing for the paper's contribution and cannot be assessed from the given text.

    Authors: We agree that the abstract would benefit from greater specificity. In the revised version we will incorporate the key quantitative results (MPJPE, FID, velocity error, and the measured 9.2× inference speedup) together with a brief reference to the PianoMotion10M evaluation protocol and the main baselines. This keeps the abstract concise while making the central claims directly verifiable. revision: yes

  2. Referee: [Experiments] Experiments section: no ablation isolates the contribution of MIDI distillation. The headline claim depends on MIDI serving as a privileged training signal that improves the audio-only model; without a control that removes MIDI at training time (while keeping the flow-matching objective, architecture, and role-gated module fixed), it is impossible to determine whether reported gains arise from the distillation, the asymmetric interaction module, or the autoregressive continuation.

    Authors: The referee correctly notes the absence of a targeted ablation that removes MIDI distillation while holding all other components fixed. Although overall comparisons to audio-only baselines provide supporting evidence, a direct control experiment is necessary to isolate the distillation effect. We will add this ablation study to the experiments section, training an otherwise identical model without MIDI and reporting the resulting degradation in motion quality and coordination metrics. revision: yes

Circularity Check

0 steps flagged

No circularity: architecture and claims are independent of self-referential fits or definitions

full rationale

The provided abstract and description present PianoFlow as a flow-matching architecture with MIDI distillation during training, an asymmetric role-gated module, and autoregressive continuation for streaming. Performance is asserted via experiments on the external PianoMotion10M dataset. No equations, derivations, or 'predictions' are shown that reduce by construction to fitted parameters, self-definitions, or self-citation chains. The MIDI-as-privileged-training step is a methodological choice whose benefit is claimed to be validated externally rather than tautological. This matches the default case of a self-contained proposal with no load-bearing circular reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; model components are described conceptually without equations or implementation specifics.

pith-pipeline@v0.9.0 · 5472 in / 1241 out tokens · 35339 ms · 2026-05-10T15:51:20.635185+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

80 extracted references · 14 canonical work pages · 4 internal anchors

  1. [1]

    Shivangi Aneja, Justus Thies, Angela Dai, and Matthias Nießner. 2024. Facetalk: Audio-driven motion diffusion for neural parametric head models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 21263– 21273

  2. [2]

    Muhammad Haseeb Aslam, Muhammad Osama Zeeshan, Marco Pedersoli, Alessandro L Koerich, Simon Bacon, and Eric Granger. 2023. Privileged knowl- edge distillation for dimensional emotion recognition in the wild. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 3338–3347

  3. [3]

    Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems33 (2020), 12449–12460

  4. [4]

    Justine Cassell, Hannes Högni Vilhjálmsson, and Timothy Bickmore. 2001. Beat: the behavior expression animation toolkit. InProceedings of the 28th annual conference on Computer graphics and interactive techniques. 477–486

  5. [5]

    Lee Chae-Yeon, Oh Hyun-Bin, Han EunGi, Kim Sung-Bin, Suekyeong Nam, and Tae-Hyun Oh. 2025. Perceptually accurate 3d talking head generation: New definitions, speech-mesh representation, and evaluation metrics. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 21065– 21074

  6. [6]

    Changan Chen, Juze Zhang, Shrinidhi K Lakshmikanth, Yusu Fang, Ruizhi Shao, Gordon Wetzstein, Li Fei-Fei, and Ehsan Adeli. 2025. The language of motion: Unifying verbal and non-verbal language of 3d human motion. InProceedings of the Computer Vision and Pattern Recognition Conference. 6200–6211

  7. [7]

    Jiali Chen, Changjie Fan, Zhimeng Zhang, Gongzheng Li, Zeng Zhao, Zhigang Deng, and Yu Ding. 2021. A music-driven deep generative adversarial model for guzheng playing animation.IEEE Transactions on Visualization and Computer Graphics29, 2 (2021), 1400–1414

  8. [8]

    Ricky TQ Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. 2018. Neural ordinary differential equations.Advances in neural information processing systems31 (2018)

  9. [9]

    Zeyuan Chen, Hongyi Xu, Guoxian Song, You Xie, Chenxu Zhang, Xin Chen, Chao Wang, Di Chang, and Linjie Luo. 2025. X-dancer: Expressive music to human dance video generation. InProceedings of the IEEE/CVF International Conference on Computer Vision. 10602–10611

  10. [10]

    Kiran Chhatre, Nikos Athanasiou, Giorgio Becherini, Christopher Peters, Michael J Black, Timo Bolkart, et al . 2024. Emotional speech-driven 3d body animation via disentangled latent diffusion. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition. 1942–1953

  11. [11]

    Xuangeng Chu, Nabarun Goswami, Ziteng Cui, Hanqin Wang, and Tatsuya Harada. 2025. Artalk: Speech-driven 3d head animation via autoregressive model. InProceedings of the SIGGRAPH Asia 2025 Conference Papers. 1–9

  12. [12]

    Yuqin Dai, Wanlu Zhu, Ronghui Li, Xiu Li, Zhenyu Zhang, Jun Li, and Jian Yang. 2026. TCDiff++: An End-to-end Trajectory-Controllable Diffusion Model for Harmonious Music-Driven Group Choreography.International Journal of Computer Vision134, 2 (2026), 61

  13. [13]

    Congyi Fan, Jian Guan, Xuanjia Zhao, Dongli Xu, Youtian Lin, Tong Ye, Pengming Feng, and Haiwei Pan. 2025. Align your rhythm: Generating highly aligned dance poses with gating-enhanced rhythm-aware feature representation. InProceedings of the IEEE/CVF International Conference on Computer Vision. 13193–13202

  14. [14]

    Xiangyu Fan, Jiaqi Li, Zhiqian Lin, Weiye Xiao, and Lei Yang. 2024. Unitalker: Scaling up audio-driven 3d facial animation through a unified model. InEuropean Conference on Computer Vision. Springer, 204–221

  15. [15]

    Yingruo Fan, Zhaojiang Lin, Jun Saito, Wenping Wang, and Taku Komura. 2022. Faceformer: Speech-driven 3d facial animation with transformers. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 18770– 18780

  16. [16]

    Qijun Gan, Song Wang, Shengtao Wu, and Jianke Zhu. 2024. PianoMotion10M: Dataset and Benchmark for Hand Motion Generation in Piano Performance. arXiv preprint arXiv:2406.09326(2024)

  17. [17]

    Qijun Gan, Ruizi Yang, Jianke Zhu, Shaofei Xue, and Steven Hoi. 2025. Om- niavatar: Efficient audio-driven avatar video generation with adaptive body animation.arXiv preprint arXiv:2506.18866(2025)

  18. [18]

    Anindita Ghosh, Bing Zhou, Rishabh Dabral, Jian Wang, Vladislav Golyanik, Christian Theobalt, Philipp Slusallek, and Chuan Guo. 2025. Duetgen: Mu- sic driven two-person dance generation via hierarchical masked modeling. In Proceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers. 1–11

  19. [19]

    Puyuan Guo, Tuo Hao, Wenxin Fu, Yingming Gao, and Ya Li. 2025. Controllable 3d dance generation using diffusion-based transformer u-net. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. 3284–3292

  20. [20]

    Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. 2021. HuBERT: Self-supervised speech representation learning by masked prediction of hidden units.IEEE/ACM transactions on audio, speech, and language processing29 (2021), 3451–3460

  21. [21]

    Zikai Huang, Xuemiao Xu, Cheng Xu, Huaidong Zhang, Chenxi Zheng, Jing Qin, and Shengfeng He. 2024. Beat-it: Beat-synchronized multi-condition 3d dance generation. InEuropean conference on computer vision. Springer, 273–290

  22. [22]

    Andrew Jaegle, Felix Gimeno, Andy Brock, Oriol Vinyals, Andrew Zisserman, and Joao Carreira. 2021. Perceiver: General perception with iterative attention. InInternational conference on machine learning. PMLR, 4651–4664

  23. [23]

    Farzaneh Jafari, Stefano Berretti, and Anup Basu. 2024. JambaTalk: Speech-driven 3D Talking Head Generation based on a Hybrid Transformer-Mamba Model. ACM Transactions on Multimedia Computing, Communications and Applications (2024)

  24. [24]

    Diqiong Jiang, Jian Chang, Lihua You, Shaojun Bian, Robert Kosk, and Greg Maguire. 2024. Audio-driven facial animation with deep learning: A survey. Information15, 11 (2024), 675

  25. [25]

    Jihui Jiao, Rui Zeng, Ju Dai, and Junjun Pan. 2025. BACH: Bi-Stage Data-Driven Piano Performance Animation for Controllable Hand Motion.Computer Anima- tion and Virtual Worlds36, 3 (2025), e70044

  26. [26]

    Yitong Jin, Zhiping Qiu, Yi Shi, Shuangpeng Sun, Chongwu Wang, Donghao Pan, Jiachen Zhao, Zhenghao Liang, Yuan Wang, Xiaobing Li, et al. 2024. Audio matters too! enhancing markerless motion capture with audio signals for string performance capture.ACM Transactions on Graphics (TOG)43, 4 (2024), 1–10

  27. [27]

    Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. 2022. Elucidating the design space of diffusion-based generative models.Advances in neural information processing systems35 (2022), 26565–26577

  28. [28]

    Hyung Kyu Kim, Sangmin Lee, and Hak Gu Kim. 2025. MemoryTalker: Per- sonalized Speech-Driven 3D Facial Animation via Audio-Guided Stylization. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 11241– 11251

  29. [29]

    Jisoo Kim, Jungbin Cho, Joonho Park, Soonmin Hwang, Da Eun Kim, Geon Kim, and Youngjae Yu. 2025. Deeptalk: Dynamic emotion embedding for probabilis- tic speech-driven 3d face animation. InProceedings of the AAAI conference on artificial intelligence, Vol. 39. 4275–4283

  30. [30]

    Jinwoo Kim, Heeseok Oh, Seongjean Kim, Hoseok Tong, and Sanghoon Lee. 2022. A brand new dance partner: Music-conditioned pluralistic dancing controlled by multiple dance genres. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3490–3500

  31. [31]

    Ronghui Li, YuXiang Zhang, Yachao Zhang, Hongwen Zhang, Jie Guo, Yan Zhang, Yebin Liu, and Xiu Li. 2024. Lodge: A coarse to fine diffusion network for long dance generation guided by the characteristic dance primitives. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1524–1534

  32. [32]

    Ronghui Li, Junfan Zhao, Yachao Zhang, Mingyang Su, Zeping Ren, Han Zhang, Yansong Tang, and Xiu Li. 2023. Finedance: A fine-grained choreography dataset for 3d full body dance generation. InProceedings of the IEEE/CVF International Conference on Computer Vision. 10234–10243

  33. [33]

    Xiaojie Li, Ronghui Li, Shukai Fang, Shuzhao Xie, Xiaoyang Guo, Jiaqing Zhou, Junkun Peng, and Zhi Wang. 2025. Music-aligned holistic 3d dance generation via hierarchical motion modeling. InProceedings of the IEEE/CVF International Conference on Computer Vision. 14420–14430

  34. [34]

    Gaojie Lin, Jianwen Jiang, Jiaqi Yang, Zerong Zheng, Chao Liang, Yuan Zhang, and Jingtuo Liu. 2025. Omnihuman-1: Rethinking the scaling-up of one-stage con- ditioned human animation models. InProceedings of the IEEE/CVF International Conference on Computer Vision. 13847–13858

  35. [35]

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le

  36. [36]

    Flow matching for generative modeling.arXiv preprint arXiv:2210.02747 (2022)

  37. [37]

    Haiyang Liu, Naoya Iwamoto, Zihao Zhu, Zhengqing Li, You Zhou, Elif Bozkurt, and Bo Zheng. 2022. Disco: Disentangled implicit content and rhythm learning for diverse co-speech gestures synthesis. InProceedings of the 30th ACM international conference on multimedia. 3764–3773

  38. [38]

    Haiyang Liu, Zihao Zhu, Giorgio Becherini, Yichen Peng, Mingyang Su, You Zhou, Xuefei Zhe, Naoya Iwamoto, Bo Zheng, and Michael J Black. 2024. Emage: Towards unified holistic co-speech gesture generation via expressive masked audio gesture modeling. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 1144–1154

  39. [39]

    Lanmiao Liu, Esam Ghaleb, Asli Ozyurek, and Zerrin Yumak. 2025. SemGes: Semantics-aware co-speech gesture generation using semantic coherence and relevance learning. InProceedings of the IEEE/CVF International Conference on Computer Vision. 13963–13973

  40. [40]

    Xinran Liu, Xu Dong, Shenbin Qian, Diptesh Kanojia, Wenwu Wang, and Zhen- hua Feng. 2025. GCDance: Genre-Controlled Music-Driven 3D Full Body Dance Generation.arXiv preprint arXiv:2502.18309(2025)

  41. [41]

    Xinran Liu, Zhenhua Feng, Diptesh Kanojia, and Wenwu Wang. [n. d.]. DGFM: Full Body Dance Generation Driven by Music Foundation Models. InAudio Imag- ination: NeurIPS 2024 Workshop AI-Driven Speech, Music, and Sound Generation

  42. [42]

    Xingchao Liu, Chengyue Gong, and Qiang Liu. 2022. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003(2022)

  43. [43]

    Yifei Liu, Qiong Cao, Yandong Wen, Huaiguang Jiang, and Changxing Ding. 2024. Towards variable and coordinated holistic co-speech motion generation. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1566–1576. Wang et al

  44. [44]

    Zihao Liu, Mingwen Ou, Zunnan Xu, Jiaqi Huang, Haonan Han, Ronghui Li, and Xiu Li. 2025. Separate to Collaborate: Dual-Stream Diffusion Model for Coordinated Piano Hand Motion Synthesis. InProceedings of the 33rd ACM International Conference on Multimedia. 9743–9752

  45. [45]

    Amir M Mansourian, Rozhan Ahmadi, Masoud Ghafouri, Amir Mohammad Babaei, Elaheh Badali Golezani, Zeynab Yasamani Ghamchi, Vida Ramezanian, Alireza Taherian, Kimia Dinashi, Amirali Miri, et al . 2025. A comprehensive survey on knowledge distillation.arXiv preprint arXiv:2503.12067(2025)

  46. [46]

    Marco Musy. 2018. pianoplayer: Automatic fingering generator for piano scores. https://github.com/marcomusy/pianoplayer. https://github.com/marcomusy/ pianoplayer

  47. [47]

    Hiroki Nishizawa, Keitaro Tanaka, Asuka Hirata, Shugo Yamaguchi, Qi Feng, Masatoshi Hamanaka, and Shigeo Morishima. 2025. SyncViolinist: Music- Oriented Violin Motion Generation Based on Bowing and Fingering. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (W ACV). IEEE, 5419–5428

  48. [48]

    Ziqiao Peng, Haoyu Wu, Zhenbo Song, Hao Xu, Xiangyu Zhu, Jun He, Hongyan Liu, and Zhaoxin Fan. 2023. Emotalk: Speech-driven emotional disentanglement for 3d face animation. InProceedings of the IEEE/CVF international conference on computer vision. 20687–20697

  49. [49]

    Ethan Perez, Florian Strub, Harm De Vries, Vincent Dumoulin, and Aaron Courville. 2018. Film: Visual reasoning with a general conditioning layer. In Proceedings of the AAAI conference on artificial intelligence, Vol. 32

  50. [50]

    Alexander Richard, Michael Zollhöfer, Yandong Wen, Fernando De la Torre, and Yaser Sheikh. 2021. Meshtalk: 3d face animation from speech using cross- modality disentanglement. InProceedings of the IEEE/CVF international conference on computer vision. 1173–1182

  51. [51]

    Javier Romero, Dimitrios Tzionas, and Michael J Black. 2017. Embodied hands: Modeling and capturing hands and bodies together.ACM Transactions on Graph- ics (TOG)(2017)

  52. [52]

    Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-net: Convolutional networks for biomedical image segmentation. InInternational Conference on Medical image computing and computer-assisted intervention. Springer, 234–241

  53. [53]

    1995.MIDI: A comprehensive introduction

    Joseph Rothstein. 1995.MIDI: A comprehensive introduction. Vol. 7. AR Editions, Inc

  54. [54]

    Shuai Shen, Wenliang Zhao, Zibin Meng, Wanhua Li, Zheng Zhu, Jie Zhou, and Jiwen Lu. 2023. Difftalk: Crafting diffusion models for generalized audio-driven portraits animation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 1982–1991

  55. [55]

    Li Siyao, Weijiang Yu, Tianpei Gu, Chunze Lin, Quan Wang, Chen Qian, Chen Change Loy, and Ziwei Liu. 2022. Bailando: 3d dance generation by actor- critic gpt with choreographic memory. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11050–11059

  56. [56]

    Kewei Sui, Anindita Ghosh, Inwoo Hwang, Bing Zhou, Jian Wang, and Chuan Guo. 2026. A survey on human interaction motion generation.International Journal of Computer Vision134, 3 (2026), 113

  57. [57]

    Zhiyao Sun, Tian Lv, Sheng Ye, Matthieu Lin, Jenny Sheng, Yu-Hui Wen, Min- jing Yu, and Yong-jin Liu. 2024. Diffposetalk: Speech-driven stylistic 3d facial animation and head pose generation via diffusion models.ACM Transactions on Graphics (ToG)43, 4 (2024), 1–9

  58. [58]

    Jonathan Tseng, Rodrigo Castellon, and Karen Liu. 2023. Edge: Editable dance generation from music. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 448–458

  59. [59]

    Duomin Wang, Wei Zuo, Aojie Li, Ling-Hao Chen, Xinyao Liao, Deyu Zhou, Zixin Yin, Xili Dai, Daxin Jiang, and Gang Yu. 2025. UniVerse-1: Unified Audio-Video Generation via Stitching of Experts.arXiv preprint arXiv:2509.06155(2025)

  60. [60]

    Hongsong Wang, Yin Zhu, Qiuxia Lai, Yang Zhang, Guo-Sen Xie, and Xin Geng

  61. [61]

    PAMD: Plausibility-Aware Motion Diffusion Model for Long Dance Gener- ation.arXiv preprint arXiv:2505.20056(2025)

  62. [62]

    Ruocheng Wang, Pei Xu, Haochen Shi, Elizabeth Schumann, and C Karen Liu

  63. [63]

    InSIGGRAPH Asia 2024 Conference Papers

    Fürelise: Capturing and physically synthesizing hand motion of piano performance. InSIGGRAPH Asia 2024 Conference Papers. 1–11

  64. [64]

    Huawei Wei, Zejun Yang, and Zhisheng Wang. 2024. Aniportrait: Audio-driven synthesis of photorealistic portrait animation.arXiv preprint arXiv:2403.17694 (2024)

  65. [65]

    Ruipin Xu. 2025. Study on teaching and training system construction of pi- ano based on motion capture. InFourth International Conference on Electronics Technology and Artificial Intelligence (ETAI 2025), Vol. 13692. SPIE, 718–724

  66. [66]

    Shuyang Xu, Zhiyang Dou, Mingyi Shi, Liang Pan, Leo Ho, Jingbo Wang, Yuan Liu, Cheng Lin, Yuexin Ma, Wenping Wang, et al. 2025. Mospa: Human motion generation driven by spatial audio.arXiv preprint arXiv:2507.11949(2025)

  67. [67]

    Yifan Xu, Sirui Zhao, Shifeng Liu, Tong Xu, and Enhong Chen. 2026. Emotion- ally Controllable Audio-driven Talking Face Generation.ACM Transactions on Multimedia Computing, Communications and Applications(2026)

  68. [68]

    Zunnan Xu, Yukang Lin, Haonan Han, Sicheng Yang, Ronghui Li, Yachao Zhang, and Xiu Li. 2024. Mambatalk: Efficient holistic gesture synthesis with selective state space models.Advances in Neural Information Processing Systems37 (2024), 20055–20080

  69. [69]

    Kaixing Yang, Xulong Tang, Yuxuan Hu, Jiahao Yang, Hongyan Liu, Qinnan Zhang, Jun He, and Zhaoxin Fan. 2025. Matchdance: Collaborative mamba- transformer architecture matching for high-quality 3d dance synthesis.arXiv preprint arXiv:2505.14222(2025)

  70. [70]

    Kaixing Yang, Xulong Tang, Ziqiao Peng, Yuxuan Hu, Jun He, and Hongyan Liu

  71. [71]

    Megadance: Mixture-of-experts architecture for genre-aware 3d dance generation.arXiv preprint arXiv:2505.17543(2025)

  72. [72]

    Quanwei Yang, Luying Huang, Kaisiyuan Wang, Jiazhi Guan, Shengyi He, Feng- guo Li, Hang Zhou, Lingyun Yu, Yingying Li, Haocheng Feng, et al. 2025. Gesture- HYDRA: Semantic Co-speech Gesture Synthesis via Hybrid Modality Diffusion Transformer and Cascaded-Synchronized Retrieval-Augmented Generation. In Proceedings of the IEEE/CVF International Conference o...

  73. [73]

    Yifan Yang, Zhi Cen, Sida Peng, Xiangwei Chen, Yifu Deng, Xinyu Zhu, Fan Jia, Xiaowei Zhou, and Hujun Bao. 2026. StreamingTalker: Audio-driven 3D Facial Animation with Autoregressive Diffusion Model. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 40. 11766–11774

  74. [74]

    Hongwei Yi, Hualin Liang, Yifei Liu, Qiong Cao, Yandong Wen, Timo Bolkart, Dacheng Tao, and Michael J Black. 2023. Generating holistic 3d human motion from speech. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 469–480

  75. [75]

    Kevin Zakka, Philipp Wu, Laura Smith, Nimrod Gileadi, Taylor Howell, Xue Bin Peng, Sumeet Singh, Yuval Tassa, Pete Florence, Andy Zeng, et al. 2023. Robopi- anist: Dexterous piano playing with deep reinforcement learning.arXiv preprint arXiv:2304.04150(2023)

  76. [76]

    Yves-Simon Zeulner, Sandeep Selvaraj, and Roberto Calandra. 2025. Learning to play piano in the real world.arXiv preprint arXiv:2503.15481(2025)

  77. [77]

    Wenxuan Zhang, Xiaodong Cun, Xuan Wang, Yong Zhang, Xi Shen, Yu Guo, Ying Shan, and Fei Wang. 2023. Sadtalker: Learning realistic 3d motion coefficients for stylized audio-driven single image talking face animation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 8652–8661

  78. [78]

    Yihao Zhi, Xiaodong Cun, Xuelin Chen, Xi Shen, Wen Guo, Shaoli Huang, and Shenghua Gao. 2023. Livelyspeaker: Towards semantic-aware co-speech gesture generation. InProceedings of the IEEE/CVF international conference on computer vision. 20807–20817

  79. [79]

    Haina Zhu, Yizhi Zhou, Hangting Chen, Jianwei Yu, Ziyang Ma, Rongzhi Gu, Yi Luo, Wei Tan, and Xie Chen. 2025. MuQ: Self-supervised music representation learning with mel residual vector quantization.IEEE Transactions on Audio, Speech and Language Processing(2025)

  80. [80]

    Wentao Zhu, Xiaoxuan Ma, Dongwoo Ro, Hai Ci, Jinlu Zhang, Jiaxin Shi, Feng Gao, Qi Tian, and Yizhou Wang. 2023. Human motion generation: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence46, 4 (2023), 2430–2449