Recognition: unknown
PianoFlow: Music-Aware Streaming Piano Motion Generation with Bimanual Coordination
Pith reviewed 2026-05-10 15:51 UTC · model grok-4.3
The pith
PianoFlow generates realistic bimanual piano playing motions from audio in real time by distilling MIDI priors during training.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
PianoFlow is a flow-matching framework that strategically leverages MIDI as a privileged modality during training to distill structured musical priors, enabling deep semantic understanding for audio-only inference. It introduces an asymmetric role-gated interaction module for dynamic cross-hand coordination and an autoregressive flow continuation scheme for seamless long-sequence streaming generation.
What carries the argument
Flow-matching framework that distills MIDI symbolic priors into audio-driven generation, combined with an asymmetric role-gated interaction module using role-aware attention and temporal gating, plus autoregressive flow continuation for streaming.
Load-bearing premise
The assumption that MIDI symbolic priors learned during training will transfer effectively to audio-only inference without introducing artifacts or losing generality.
What would settle it
A direct comparison on the PianoMotion10M test set showing that audio-only inference with PianoFlow does not outperform previous audio-only methods in motion accuracy or musical alignment, or that inference speed is not at least 9 times faster.
Figures
read the original abstract
Audio-driven bimanual piano motion generation requires precise modeling of complex musical structures and dynamic cross-hand coordination. However, existing methods often rely on acoustic-only representations lacking symbolic priors, employ inflexible interaction mechanisms, and are limited to computationally expensive short-sequence generation. To address these limitations, we propose PianoFlow, a flow-matching framework for precise and coordinated bimanual piano motion synthesis. Our approach strategically leverages MIDI as a privileged modality during training, distilling these structured musical priors to achieve deep semantic understanding while maintaining audio-only inference. Furthermore, we introduce an asymmetric role-gated interaction module to explicitly capture dynamic cross-hand coordination through role-aware attention and temporal gating. To enable real-time streaming generation for arbitrarily long sequences, we design an autoregressive flow continuation scheme that ensures seamless cross-chunk temporal coherence. Extensive experiments on the PianoMotion10M dataset demonstrate that PianoFlow achieves superior quantitative and qualitative performance, while accelerating inference by over 9\times compared to previous methods.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. PianoFlow is a flow-matching framework for audio-driven bimanual piano motion generation. It uses MIDI as a privileged training modality to distill symbolic musical priors into an audio-only inference model, introduces an asymmetric role-gated interaction module for cross-hand coordination via role-aware attention and temporal gating, and proposes an autoregressive flow continuation scheme for seamless streaming of arbitrarily long sequences. On the PianoMotion10M dataset the paper claims superior quantitative and qualitative results together with more than 9× inference speedup relative to prior methods.
Significance. If substantiated, the approach could meaningfully advance real-time music-driven animation by enabling efficient incorporation of musical structure without symbolic input at test time and by addressing long-sequence coherence. The explicit modeling of bimanual coordination and the streaming mechanism target practical limitations in existing work. The distillation premise, if validated, would also be of broader interest for multimodal learning in computer vision.
major comments (2)
- [Abstract] Abstract: the central claims of 'superior quantitative and qualitative performance' and 'accelerating inference by over 9×' are stated without any metrics (e.g., MPJPE, FID, or velocity error), baselines, tables, or experimental protocol. These performance assertions are load-bearing for the paper's contribution and cannot be assessed from the given text.
- [Experiments] Experiments section: no ablation isolates the contribution of MIDI distillation. The headline claim depends on MIDI serving as a privileged training signal that improves the audio-only model; without a control that removes MIDI at training time (while keeping the flow-matching objective, architecture, and role-gated module fixed), it is impossible to determine whether reported gains arise from the distillation, the asymmetric interaction module, or the autoregressive continuation.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address the two major comments point by point below. Both points identify areas where the current manuscript can be strengthened, and we will revise accordingly.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claims of 'superior quantitative and qualitative performance' and 'accelerating inference by over 9×' are stated without any metrics (e.g., MPJPE, FID, or velocity error), baselines, tables, or experimental protocol. These performance assertions are load-bearing for the paper's contribution and cannot be assessed from the given text.
Authors: We agree that the abstract would benefit from greater specificity. In the revised version we will incorporate the key quantitative results (MPJPE, FID, velocity error, and the measured 9.2× inference speedup) together with a brief reference to the PianoMotion10M evaluation protocol and the main baselines. This keeps the abstract concise while making the central claims directly verifiable. revision: yes
-
Referee: [Experiments] Experiments section: no ablation isolates the contribution of MIDI distillation. The headline claim depends on MIDI serving as a privileged training signal that improves the audio-only model; without a control that removes MIDI at training time (while keeping the flow-matching objective, architecture, and role-gated module fixed), it is impossible to determine whether reported gains arise from the distillation, the asymmetric interaction module, or the autoregressive continuation.
Authors: The referee correctly notes the absence of a targeted ablation that removes MIDI distillation while holding all other components fixed. Although overall comparisons to audio-only baselines provide supporting evidence, a direct control experiment is necessary to isolate the distillation effect. We will add this ablation study to the experiments section, training an otherwise identical model without MIDI and reporting the resulting degradation in motion quality and coordination metrics. revision: yes
Circularity Check
No circularity: architecture and claims are independent of self-referential fits or definitions
full rationale
The provided abstract and description present PianoFlow as a flow-matching architecture with MIDI distillation during training, an asymmetric role-gated module, and autoregressive continuation for streaming. Performance is asserted via experiments on the external PianoMotion10M dataset. No equations, derivations, or 'predictions' are shown that reduce by construction to fitted parameters, self-definitions, or self-citation chains. The MIDI-as-privileged-training step is a methodological choice whose benefit is claimed to be validated externally rather than tautological. This matches the default case of a self-contained proposal with no load-bearing circular reductions.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Shivangi Aneja, Justus Thies, Angela Dai, and Matthias Nießner. 2024. Facetalk: Audio-driven motion diffusion for neural parametric head models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 21263– 21273
2024
-
[2]
Muhammad Haseeb Aslam, Muhammad Osama Zeeshan, Marco Pedersoli, Alessandro L Koerich, Simon Bacon, and Eric Granger. 2023. Privileged knowl- edge distillation for dimensional emotion recognition in the wild. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 3338–3347
2023
-
[3]
Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems33 (2020), 12449–12460
2020
-
[4]
Justine Cassell, Hannes Högni Vilhjálmsson, and Timothy Bickmore. 2001. Beat: the behavior expression animation toolkit. InProceedings of the 28th annual conference on Computer graphics and interactive techniques. 477–486
2001
-
[5]
Lee Chae-Yeon, Oh Hyun-Bin, Han EunGi, Kim Sung-Bin, Suekyeong Nam, and Tae-Hyun Oh. 2025. Perceptually accurate 3d talking head generation: New definitions, speech-mesh representation, and evaluation metrics. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 21065– 21074
2025
-
[6]
Changan Chen, Juze Zhang, Shrinidhi K Lakshmikanth, Yusu Fang, Ruizhi Shao, Gordon Wetzstein, Li Fei-Fei, and Ehsan Adeli. 2025. The language of motion: Unifying verbal and non-verbal language of 3d human motion. InProceedings of the Computer Vision and Pattern Recognition Conference. 6200–6211
2025
-
[7]
Jiali Chen, Changjie Fan, Zhimeng Zhang, Gongzheng Li, Zeng Zhao, Zhigang Deng, and Yu Ding. 2021. A music-driven deep generative adversarial model for guzheng playing animation.IEEE Transactions on Visualization and Computer Graphics29, 2 (2021), 1400–1414
2021
-
[8]
Ricky TQ Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. 2018. Neural ordinary differential equations.Advances in neural information processing systems31 (2018)
2018
-
[9]
Zeyuan Chen, Hongyi Xu, Guoxian Song, You Xie, Chenxu Zhang, Xin Chen, Chao Wang, Di Chang, and Linjie Luo. 2025. X-dancer: Expressive music to human dance video generation. InProceedings of the IEEE/CVF International Conference on Computer Vision. 10602–10611
2025
-
[10]
Kiran Chhatre, Nikos Athanasiou, Giorgio Becherini, Christopher Peters, Michael J Black, Timo Bolkart, et al . 2024. Emotional speech-driven 3d body animation via disentangled latent diffusion. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition. 1942–1953
2024
-
[11]
Xuangeng Chu, Nabarun Goswami, Ziteng Cui, Hanqin Wang, and Tatsuya Harada. 2025. Artalk: Speech-driven 3d head animation via autoregressive model. InProceedings of the SIGGRAPH Asia 2025 Conference Papers. 1–9
2025
-
[12]
Yuqin Dai, Wanlu Zhu, Ronghui Li, Xiu Li, Zhenyu Zhang, Jun Li, and Jian Yang. 2026. TCDiff++: An End-to-end Trajectory-Controllable Diffusion Model for Harmonious Music-Driven Group Choreography.International Journal of Computer Vision134, 2 (2026), 61
2026
-
[13]
Congyi Fan, Jian Guan, Xuanjia Zhao, Dongli Xu, Youtian Lin, Tong Ye, Pengming Feng, and Haiwei Pan. 2025. Align your rhythm: Generating highly aligned dance poses with gating-enhanced rhythm-aware feature representation. InProceedings of the IEEE/CVF International Conference on Computer Vision. 13193–13202
2025
-
[14]
Xiangyu Fan, Jiaqi Li, Zhiqian Lin, Weiye Xiao, and Lei Yang. 2024. Unitalker: Scaling up audio-driven 3d facial animation through a unified model. InEuropean Conference on Computer Vision. Springer, 204–221
2024
-
[15]
Yingruo Fan, Zhaojiang Lin, Jun Saito, Wenping Wang, and Taku Komura. 2022. Faceformer: Speech-driven 3d facial animation with transformers. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 18770– 18780
2022
- [16]
- [17]
-
[18]
Anindita Ghosh, Bing Zhou, Rishabh Dabral, Jian Wang, Vladislav Golyanik, Christian Theobalt, Philipp Slusallek, and Chuan Guo. 2025. Duetgen: Mu- sic driven two-person dance generation via hierarchical masked modeling. In Proceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers. 1–11
2025
-
[19]
Puyuan Guo, Tuo Hao, Wenxin Fu, Yingming Gao, and Ya Li. 2025. Controllable 3d dance generation using diffusion-based transformer u-net. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. 3284–3292
2025
-
[20]
Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. 2021. HuBERT: Self-supervised speech representation learning by masked prediction of hidden units.IEEE/ACM transactions on audio, speech, and language processing29 (2021), 3451–3460
2021
-
[21]
Zikai Huang, Xuemiao Xu, Cheng Xu, Huaidong Zhang, Chenxi Zheng, Jing Qin, and Shengfeng He. 2024. Beat-it: Beat-synchronized multi-condition 3d dance generation. InEuropean conference on computer vision. Springer, 273–290
2024
-
[22]
Andrew Jaegle, Felix Gimeno, Andy Brock, Oriol Vinyals, Andrew Zisserman, and Joao Carreira. 2021. Perceiver: General perception with iterative attention. InInternational conference on machine learning. PMLR, 4651–4664
2021
-
[23]
Farzaneh Jafari, Stefano Berretti, and Anup Basu. 2024. JambaTalk: Speech-driven 3D Talking Head Generation based on a Hybrid Transformer-Mamba Model. ACM Transactions on Multimedia Computing, Communications and Applications (2024)
2024
-
[24]
Diqiong Jiang, Jian Chang, Lihua You, Shaojun Bian, Robert Kosk, and Greg Maguire. 2024. Audio-driven facial animation with deep learning: A survey. Information15, 11 (2024), 675
2024
-
[25]
Jihui Jiao, Rui Zeng, Ju Dai, and Junjun Pan. 2025. BACH: Bi-Stage Data-Driven Piano Performance Animation for Controllable Hand Motion.Computer Anima- tion and Virtual Worlds36, 3 (2025), e70044
2025
-
[26]
Yitong Jin, Zhiping Qiu, Yi Shi, Shuangpeng Sun, Chongwu Wang, Donghao Pan, Jiachen Zhao, Zhenghao Liang, Yuan Wang, Xiaobing Li, et al. 2024. Audio matters too! enhancing markerless motion capture with audio signals for string performance capture.ACM Transactions on Graphics (TOG)43, 4 (2024), 1–10
2024
-
[27]
Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. 2022. Elucidating the design space of diffusion-based generative models.Advances in neural information processing systems35 (2022), 26565–26577
2022
-
[28]
Hyung Kyu Kim, Sangmin Lee, and Hak Gu Kim. 2025. MemoryTalker: Per- sonalized Speech-Driven 3D Facial Animation via Audio-Guided Stylization. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 11241– 11251
2025
-
[29]
Jisoo Kim, Jungbin Cho, Joonho Park, Soonmin Hwang, Da Eun Kim, Geon Kim, and Youngjae Yu. 2025. Deeptalk: Dynamic emotion embedding for probabilis- tic speech-driven 3d face animation. InProceedings of the AAAI conference on artificial intelligence, Vol. 39. 4275–4283
2025
-
[30]
Jinwoo Kim, Heeseok Oh, Seongjean Kim, Hoseok Tong, and Sanghoon Lee. 2022. A brand new dance partner: Music-conditioned pluralistic dancing controlled by multiple dance genres. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3490–3500
2022
-
[31]
Ronghui Li, YuXiang Zhang, Yachao Zhang, Hongwen Zhang, Jie Guo, Yan Zhang, Yebin Liu, and Xiu Li. 2024. Lodge: A coarse to fine diffusion network for long dance generation guided by the characteristic dance primitives. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1524–1534
2024
-
[32]
Ronghui Li, Junfan Zhao, Yachao Zhang, Mingyang Su, Zeping Ren, Han Zhang, Yansong Tang, and Xiu Li. 2023. Finedance: A fine-grained choreography dataset for 3d full body dance generation. InProceedings of the IEEE/CVF International Conference on Computer Vision. 10234–10243
2023
-
[33]
Xiaojie Li, Ronghui Li, Shukai Fang, Shuzhao Xie, Xiaoyang Guo, Jiaqing Zhou, Junkun Peng, and Zhi Wang. 2025. Music-aligned holistic 3d dance generation via hierarchical motion modeling. InProceedings of the IEEE/CVF International Conference on Computer Vision. 14420–14430
2025
-
[34]
Gaojie Lin, Jianwen Jiang, Jiaqi Yang, Zerong Zheng, Chao Liang, Yuan Zhang, and Jingtuo Liu. 2025. Omnihuman-1: Rethinking the scaling-up of one-stage con- ditioned human animation models. InProceedings of the IEEE/CVF International Conference on Computer Vision. 13847–13858
2025
-
[35]
Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le
-
[36]
Flow matching for generative modeling.arXiv preprint arXiv:2210.02747 (2022)
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[37]
Haiyang Liu, Naoya Iwamoto, Zihao Zhu, Zhengqing Li, You Zhou, Elif Bozkurt, and Bo Zheng. 2022. Disco: Disentangled implicit content and rhythm learning for diverse co-speech gestures synthesis. InProceedings of the 30th ACM international conference on multimedia. 3764–3773
2022
-
[38]
Haiyang Liu, Zihao Zhu, Giorgio Becherini, Yichen Peng, Mingyang Su, You Zhou, Xuefei Zhe, Naoya Iwamoto, Bo Zheng, and Michael J Black. 2024. Emage: Towards unified holistic co-speech gesture generation via expressive masked audio gesture modeling. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 1144–1154
2024
-
[39]
Lanmiao Liu, Esam Ghaleb, Asli Ozyurek, and Zerrin Yumak. 2025. SemGes: Semantics-aware co-speech gesture generation using semantic coherence and relevance learning. InProceedings of the IEEE/CVF International Conference on Computer Vision. 13963–13973
2025
-
[40]
Xinran Liu, Xu Dong, Shenbin Qian, Diptesh Kanojia, Wenwu Wang, and Zhen- hua Feng. 2025. GCDance: Genre-Controlled Music-Driven 3D Full Body Dance Generation.arXiv preprint arXiv:2502.18309(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[41]
Xinran Liu, Zhenhua Feng, Diptesh Kanojia, and Wenwu Wang. [n. d.]. DGFM: Full Body Dance Generation Driven by Music Foundation Models. InAudio Imag- ination: NeurIPS 2024 Workshop AI-Driven Speech, Music, and Sound Generation
2024
-
[42]
Xingchao Liu, Chengyue Gong, and Qiang Liu. 2022. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003(2022)
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[43]
Yifei Liu, Qiong Cao, Yandong Wen, Huaiguang Jiang, and Changxing Ding. 2024. Towards variable and coordinated holistic co-speech motion generation. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1566–1576. Wang et al
2024
-
[44]
Zihao Liu, Mingwen Ou, Zunnan Xu, Jiaqi Huang, Haonan Han, Ronghui Li, and Xiu Li. 2025. Separate to Collaborate: Dual-Stream Diffusion Model for Coordinated Piano Hand Motion Synthesis. InProceedings of the 33rd ACM International Conference on Multimedia. 9743–9752
2025
-
[45]
Amir M Mansourian, Rozhan Ahmadi, Masoud Ghafouri, Amir Mohammad Babaei, Elaheh Badali Golezani, Zeynab Yasamani Ghamchi, Vida Ramezanian, Alireza Taherian, Kimia Dinashi, Amirali Miri, et al . 2025. A comprehensive survey on knowledge distillation.arXiv preprint arXiv:2503.12067(2025)
-
[46]
Marco Musy. 2018. pianoplayer: Automatic fingering generator for piano scores. https://github.com/marcomusy/pianoplayer. https://github.com/marcomusy/ pianoplayer
2018
-
[47]
Hiroki Nishizawa, Keitaro Tanaka, Asuka Hirata, Shugo Yamaguchi, Qi Feng, Masatoshi Hamanaka, and Shigeo Morishima. 2025. SyncViolinist: Music- Oriented Violin Motion Generation Based on Bowing and Fingering. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (W ACV). IEEE, 5419–5428
2025
-
[48]
Ziqiao Peng, Haoyu Wu, Zhenbo Song, Hao Xu, Xiangyu Zhu, Jun He, Hongyan Liu, and Zhaoxin Fan. 2023. Emotalk: Speech-driven emotional disentanglement for 3d face animation. InProceedings of the IEEE/CVF international conference on computer vision. 20687–20697
2023
-
[49]
Ethan Perez, Florian Strub, Harm De Vries, Vincent Dumoulin, and Aaron Courville. 2018. Film: Visual reasoning with a general conditioning layer. In Proceedings of the AAAI conference on artificial intelligence, Vol. 32
2018
-
[50]
Alexander Richard, Michael Zollhöfer, Yandong Wen, Fernando De la Torre, and Yaser Sheikh. 2021. Meshtalk: 3d face animation from speech using cross- modality disentanglement. InProceedings of the IEEE/CVF international conference on computer vision. 1173–1182
2021
-
[51]
Javier Romero, Dimitrios Tzionas, and Michael J Black. 2017. Embodied hands: Modeling and capturing hands and bodies together.ACM Transactions on Graph- ics (TOG)(2017)
2017
-
[52]
Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-net: Convolutional networks for biomedical image segmentation. InInternational Conference on Medical image computing and computer-assisted intervention. Springer, 234–241
2015
-
[53]
1995.MIDI: A comprehensive introduction
Joseph Rothstein. 1995.MIDI: A comprehensive introduction. Vol. 7. AR Editions, Inc
1995
-
[54]
Shuai Shen, Wenliang Zhao, Zibin Meng, Wanhua Li, Zheng Zhu, Jie Zhou, and Jiwen Lu. 2023. Difftalk: Crafting diffusion models for generalized audio-driven portraits animation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 1982–1991
2023
-
[55]
Li Siyao, Weijiang Yu, Tianpei Gu, Chunze Lin, Quan Wang, Chen Qian, Chen Change Loy, and Ziwei Liu. 2022. Bailando: 3d dance generation by actor- critic gpt with choreographic memory. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11050–11059
2022
-
[56]
Kewei Sui, Anindita Ghosh, Inwoo Hwang, Bing Zhou, Jian Wang, and Chuan Guo. 2026. A survey on human interaction motion generation.International Journal of Computer Vision134, 3 (2026), 113
2026
-
[57]
Zhiyao Sun, Tian Lv, Sheng Ye, Matthieu Lin, Jenny Sheng, Yu-Hui Wen, Min- jing Yu, and Yong-jin Liu. 2024. Diffposetalk: Speech-driven stylistic 3d facial animation and head pose generation via diffusion models.ACM Transactions on Graphics (ToG)43, 4 (2024), 1–9
2024
-
[58]
Jonathan Tseng, Rodrigo Castellon, and Karen Liu. 2023. Edge: Editable dance generation from music. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 448–458
2023
- [59]
-
[60]
Hongsong Wang, Yin Zhu, Qiuxia Lai, Yang Zhang, Guo-Sen Xie, and Xin Geng
- [61]
-
[62]
Ruocheng Wang, Pei Xu, Haochen Shi, Elizabeth Schumann, and C Karen Liu
-
[63]
InSIGGRAPH Asia 2024 Conference Papers
Fürelise: Capturing and physically synthesizing hand motion of piano performance. InSIGGRAPH Asia 2024 Conference Papers. 1–11
2024
- [64]
-
[65]
Ruipin Xu. 2025. Study on teaching and training system construction of pi- ano based on motion capture. InFourth International Conference on Electronics Technology and Artificial Intelligence (ETAI 2025), Vol. 13692. SPIE, 718–724
2025
- [66]
-
[67]
Yifan Xu, Sirui Zhao, Shifeng Liu, Tong Xu, and Enhong Chen. 2026. Emotion- ally Controllable Audio-driven Talking Face Generation.ACM Transactions on Multimedia Computing, Communications and Applications(2026)
2026
-
[68]
Zunnan Xu, Yukang Lin, Haonan Han, Sicheng Yang, Ronghui Li, Yachao Zhang, and Xiu Li. 2024. Mambatalk: Efficient holistic gesture synthesis with selective state space models.Advances in Neural Information Processing Systems37 (2024), 20055–20080
2024
- [69]
-
[70]
Kaixing Yang, Xulong Tang, Ziqiao Peng, Yuxuan Hu, Jun He, and Hongyan Liu
- [71]
-
[72]
Quanwei Yang, Luying Huang, Kaisiyuan Wang, Jiazhi Guan, Shengyi He, Feng- guo Li, Hang Zhou, Lingyun Yu, Yingying Li, Haocheng Feng, et al. 2025. Gesture- HYDRA: Semantic Co-speech Gesture Synthesis via Hybrid Modality Diffusion Transformer and Cascaded-Synchronized Retrieval-Augmented Generation. In Proceedings of the IEEE/CVF International Conference o...
2025
-
[73]
Yifan Yang, Zhi Cen, Sida Peng, Xiangwei Chen, Yifu Deng, Xinyu Zhu, Fan Jia, Xiaowei Zhou, and Hujun Bao. 2026. StreamingTalker: Audio-driven 3D Facial Animation with Autoregressive Diffusion Model. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 40. 11766–11774
2026
-
[74]
Hongwei Yi, Hualin Liang, Yifei Liu, Qiong Cao, Yandong Wen, Timo Bolkart, Dacheng Tao, and Michael J Black. 2023. Generating holistic 3d human motion from speech. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 469–480
2023
- [75]
-
[76]
Yves-Simon Zeulner, Sandeep Selvaraj, and Roberto Calandra. 2025. Learning to play piano in the real world.arXiv preprint arXiv:2503.15481(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[77]
Wenxuan Zhang, Xiaodong Cun, Xuan Wang, Yong Zhang, Xi Shen, Yu Guo, Ying Shan, and Fei Wang. 2023. Sadtalker: Learning realistic 3d motion coefficients for stylized audio-driven single image talking face animation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 8652–8661
2023
-
[78]
Yihao Zhi, Xiaodong Cun, Xuelin Chen, Xi Shen, Wen Guo, Shaoli Huang, and Shenghua Gao. 2023. Livelyspeaker: Towards semantic-aware co-speech gesture generation. InProceedings of the IEEE/CVF international conference on computer vision. 20807–20817
2023
-
[79]
Haina Zhu, Yizhi Zhou, Hangting Chen, Jianwei Yu, Ziyang Ma, Rongzhi Gu, Yi Luo, Wei Tan, and Xie Chen. 2025. MuQ: Self-supervised music representation learning with mel residual vector quantization.IEEE Transactions on Audio, Speech and Language Processing(2025)
2025
-
[80]
Wentao Zhu, Xiaoxuan Ma, Dongwoo Ro, Hai Ci, Jinlu Zhang, Jiaxin Shi, Feng Gao, Qi Tian, and Yizhou Wang. 2023. Human motion generation: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence46, 4 (2023), 2430–2449
2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.