arxiv: 2604.22164 · v1 · submitted 2026-04-24 · 💻 cs.CV

Recognition: unknown

Learning Reactive Human Motion Generation from Paired Interaction Data Using Transformer-Based Models

Masato Soga , Ryuki Takebayashi

Authors on Pith no claims yet

Pith reviewed 2026-05-08 12:38 UTC · model grok-4.3

classification 💻 cs.CV

keywords reactive motion generationtransformer modelspaired interaction dataperson ID embeddinghuman motion predictionboxing videosmulti-agent dynamics

0 comments

The pith

A basic Transformer with person ID embedding generates consistent reactive human motions from paired interaction sequences.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper extracts paired action-reaction motion sequences from boxing match videos to train models that predict one person's responsive motion given the other's actions. It tests three Transformer architectures and finds that the plain version, when supplied with explicit person identity embeddings, produces plausible interaction-aware motions while avoiding the posture collapse and error accumulation that appear in iTransformer and Crossformer variants. This matters because prior motion generation work has focused on isolated individuals, yet most real-world human movement occurs in mutually dependent pairs; demonstrating that identity-aware modeling suffices for stable generation opens a direct route to handling multi-person dynamics.

Core claim

By training on paired motion sequences extracted from boxing videos, a simple Transformer architecture augmented with person ID embeddings can generate the motion of one person conditioned on the motion of another while preserving structural consistency and avoiding the accumulating errors that destabilize more complex cross-attention models over time.

What carries the argument

Person ID embedding added to the Transformer input to distinguish the two individuals and enforce identity-specific consistency across the paired sequences.

If this is right

A plain Transformer produces plausible interaction-aware motions without posture collapse.
iTransformer and Crossformer versions accumulate errors and produce unstable long-term motions.
Adding person ID embeddings prevents structural collapse and improves motion consistency.
Models trained on video-extracted pairs can capture mutual dependencies between interacting people.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same paired-sequence approach could be applied to other two-person activities such as martial arts or collaborative tasks.
Generated motions might serve as training data or controllers for robotics systems that must respond to a human partner.
Extending the dataset to longer sequences or multiple interaction types would test whether identity embeddings continue to stabilize generation.

Load-bearing premise

Paired motion sequences taken from boxing videos are representative of general human interaction dynamics and the observed model behaviors will hold outside this specific domain.

What would settle it

Run the trained models on paired motion sequences from a different interactive activity such as tennis rallies or partner dance and check whether they still avoid posture collapse while matching observed human reactions.

Figures

Figures reproduced from arXiv: 2604.22164 by Masato Soga, Ryuki Takebayashi.

**Figure 1.** Figure 1: Overview of the InterFormer architecture during training (left) and inference (right) (adapted from [4]) view at source ↗

**Figure 2.** Figure 2: Keypoint structures of COCO17 and Human3.6M view at source ↗

**Figure 3.** Figure 3: Keypoint structure of retargeted Humanoid model view at source ↗

**Figure 4.** Figure 4: Motion prediction by simple Transformer 10 view at source ↗

**Figure 5.** Figure 5: Motion prediction by iTransformer 4.4.3 Crossformer Crossformer is a Transformer-based time-series learning model designed to simultaneously learn dependencies along both the temporal and feature dimensions in time-series data. The architecture of the model is shown in view at source ↗

**Figure 6.** Figure 6: Motion prediction by Crossformer The input to the decoder consists of learnable positional embeddings. Specifically, it is defined as a parameter tensor of shape (B, D, out_seg, d_model), where B is the batch size, D is the number of features, out_seg is the number of segments obtained by dividing the prediction length by the segment length, and d_model is the model dimensionality. The decoder generates fu… view at source ↗

**Figure 7.** Figure 7: System Pipeline Furthermore, regarding the counterpart’s motion generated in response to these inputs, it is assumed that defensive motions are generated for offensive inputs, while offensive motions are generated for defensive inputs. 5.2.3 Real-Time Evaluation In this study, we first conduct an evaluation using a system that simultaneously displays the subject’s motion and the generated counterpart motio… view at source ↗

**Figure 8.** Figure 8: Generated motion scenes at 100 frames after the start of playback in the offline setting. Each image shows view at source ↗

read the original abstract

Recent advances in deep learning have enabled the generation of videos from textual descriptions as well as the prediction of future sequences from input videos. Similarly, in human motion modeling, motions can be generated from text or predicted from a single person's motion sequence. However, these approaches primarily focus on single-agent motion generation. In contrast, this study addresses the problem of generating the motion of one person based on the motion of another in interaction scenarios, where the two motions are mutually dependent. We construct a dataset of paired action-reaction motion sequences extracted from boxing match videos and investigate the effectiveness of Transformer-based models for this task. Specifically, we implement and compare three models: a simple Transformer, iTransformer, and Crossformer. In addition, we introduce a person ID embedding to explicitly distinguish between individuals, enabling the model to maintain structural consistency and better capture interaction dynamics. Experimental results show that the simple Transformer can generate plausible interaction-aware motions without suffering from posture collapse, while iTransformer and Crossformer accumulate errors over time, leading to unstable motion generation. Furthermore, the proposed person ID embedding contributes to preventing structural collapse and improving motion consistency. These results highlight the importance of explicitly modeling individual identity in interaction-aware motion generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a straightforward empirical comparison on a new boxing-derived paired motion dataset where the basic Transformer plus person ID embedding avoids collapse better than the other two variants, but everything rests on qualitative claims.

read the letter

The paper's core contribution is a new dataset of paired action-reaction sequences pulled from boxing videos, plus a head-to-head test of three Transformer architectures for generating one person's motion from the other's. They add a simple person ID embedding and report that the plain Transformer stays stable while iTransformer and Crossformer drift into posture collapse over time, with the ID trick helping consistency on this data.

Referee Report

2 major / 2 minor

Summary. The manuscript addresses reactive human motion generation in two-person interaction scenarios by constructing a paired dataset from boxing match videos and comparing three Transformer variants (simple Transformer, iTransformer, Crossformer) augmented with a person ID embedding. The central claim is that the simple Transformer with ID embedding produces plausible, interaction-aware motions without posture collapse or error accumulation, while the other architectures degrade over time, and that the ID embedding improves structural consistency.

Significance. If the empirical observations hold under rigorous evaluation, the work would provide useful evidence that explicit identity modeling helps stabilize Transformer-based predictors of mutually dependent motions and that architectural simplicity can outperform more complex variants on this task. It also contributes a new paired interaction dataset. The narrow boxing domain, however, constrains immediate broader impact.

major comments (2)

[Abstract and §4] Abstract and §4 (Experimental Results): the claim that the simple Transformer 'can generate plausible interaction-aware motions without suffering from posture collapse' while iTransformer and Crossformer 'accumulate errors over time' rests entirely on qualitative descriptions; no quantitative metrics (e.g., mean per-joint position error, velocity error, or FID scores), baselines, error bars, or statistical tests are reported, rendering the central comparative claim unverifiable.
[§3] §3 (Methodology): training procedures, hyperparameter settings, loss functions, sequence lengths, and evaluation protocols are not specified in sufficient detail to allow reproduction or assessment of whether the reported behaviors are robust or sensitive to implementation choices.

minor comments (2)

[Figures] Figure captions and axis labels should explicitly state the motion representation (e.g., joint angles vs. 3D coordinates) and the time horizon over which collapse is observed.
[Introduction] The introduction would benefit from a short related-work paragraph contrasting this paired reactive setting with existing single-person motion prediction and text-conditioned generation literature.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the thorough and constructive review of our manuscript. We appreciate the feedback highlighting areas where our claims and methodology require strengthening for verifiability and reproducibility. We address each major comment below and commit to revisions that will incorporate quantitative evaluations and detailed specifications.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Experimental Results): the claim that the simple Transformer 'can generate plausible interaction-aware motions without suffering from posture collapse' while iTransformer and Crossformer 'accumulate errors over time' rests entirely on qualitative descriptions; no quantitative metrics (e.g., mean per-joint position error, velocity error, or FID scores), baselines, error bars, or statistical tests are reported, rendering the central comparative claim unverifiable.

Authors: We acknowledge that the current version relies primarily on qualitative visualizations and narrative descriptions to support the comparative claims regarding posture stability and error accumulation. While these visuals illustrate clear differences (e.g., collapse in iTransformer/Crossformer vs. sustained coherence in the simple Transformer), we agree that quantitative support is essential for rigor. In the revised manuscript, we will add metrics including mean per-joint position error (MPJPE), velocity error, and FID scores for motion quality, along with baseline comparisons where feasible, error bars from multiple training runs, and statistical tests (e.g., paired t-tests) to substantiate the observations. revision: yes
Referee: [§3] §3 (Methodology): training procedures, hyperparameter settings, loss functions, sequence lengths, and evaluation protocols are not specified in sufficient detail to allow reproduction or assessment of whether the reported behaviors are robust or sensitive to implementation choices.

Authors: We apologize for the omission of implementation details. The revised Section 3 will fully specify: training procedures (optimizer, learning rate, batch size, number of epochs, early stopping criteria), all hyperparameter values, the loss function (formulation combining L2 losses on joint positions, velocities, and rotations), exact sequence lengths for training/inference, data preprocessing and augmentation steps, and the complete evaluation protocol (generation process, metrics computation, and qualitative assessment criteria). This will enable full reproducibility and allow readers to evaluate sensitivity to choices. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical model comparison

full rationale

The paper is an empirical study that constructs a paired motion dataset from boxing videos and compares three Transformer variants (simple Transformer, iTransformer, Crossformer) plus a person-ID embedding on that data. The central claims concern observed experimental outcomes such as posture collapse avoidance and error accumulation. No derivations, equations, or predictions are presented that reduce by construction to fitted parameters, self-definitions, or self-citation chains. Any self-citations (if present) are not load-bearing for the reported results, which are directly falsifiable via the described experiments. This is a standard non-circular empirical ML paper.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim depends on standard deep learning assumptions about sequence modeling and the representativeness of the boxing-derived dataset. No new entities are postulated.

free parameters (2)

Transformer hyperparameters
Standard model settings such as number of layers, attention heads, and learning rate that are tuned during training.
Person ID embedding dimension
Chosen dimension for the embedding that distinguishes the two individuals in the pair.

axioms (2)

domain assumption Transformer architectures can effectively model temporal dependencies in human motion sequences
Relies on prior success of Transformers in sequence tasks without new proof here.
domain assumption Paired motions extracted from boxing videos capture general interaction dynamics
The dataset is domain-specific to boxing matches.

pith-pipeline@v0.9.0 · 5511 in / 1395 out tokens · 108326 ms · 2026-05-08T12:38:03.887450+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

27 extracted references · 4 canonical work pages · 2 internal anchors

[1]

On human motion prediction using recurrent neural networks,

J. Martinez, M. J. Black, and J. Romero, “On human motion prediction using recurrent neural networks,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017, pp. 4674–4683

2017
[2]

Attention Is All You Need

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,”arXiv preprint arXiv:1706.03762, 2017

work page internal anchor Pith review arXiv 2017
[3]

A spatio-temporal transformer for 3d human motion prediction,

E. Aksan, M. Kaufmann, P. Cao, and O. Hilliges, “A spatio-temporal transformer for 3d human motion prediction,” inProceedings of the International Conference on 3D Vision (3DV), 2021

2021
[4]

Interaction transformer for human reaction generation,

B. Chopin, H. Tang, N. Otberdout, M. Daoudi, and N. Sebe, “Interaction transformer for human reaction generation,”IEEE Transactions on Multimedia, 2023

2023
[5]

Two-person interaction detection using body-pose features and multiple instance learning,

K. Yun, J. Honorio, D. Chattopadhyay, T. L. Berg, and D. Samaras, “Two-person interaction detection using body-pose features and multiple instance learning,” inCVPR Workshops, 2012

2012
[6]

Remos: Reactive 3d motion synthesis for two-person interactions,

A. Ghosh, R. Dabral, V. Golyanik, C. Theobalt, and P. Slusallek, “Remos: Reactive 3d motion synthesis for two-person interactions,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023. [Online]. Available: https://vcai.mpi-inf.mpg.de/projects/remos

2023
[7]

Olympic boxing video dataset (unlabeled),

Kaggle, “Olympic boxing video dataset (unlabeled),” https://www.kaggle.com/datasets/piotrstefaskiue/ olympic-boxing-video-dataset-unlabeled, accessed: 2026-01-26

2026
[8]

Cross-conditioned recurrent networks for long-term synthesis of inter-person human motion interactions,

J. Kundu, H. Buckchash, P. Mandikal, R. M V, A. Jamkhandi, and R. V. Babu, “Cross-conditioned recurrent networks for long-term synthesis of inter-person human motion interactions,” inProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (W ACV), 2020

2020
[9]

View invariant human action recognition using histograms of 3d joints,

L. Xia, C. Chen, and J. K. Aggarwal, “View invariant human action recognition using histograms of 3d joints,” CVPR Workshops, 2012

2012
[10]

Denoising diffusion probabilistic models,

J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” inAdvances in Neural Information Processing Systems (NeurIPS), 2020

2020
[11]

arXiv preprint arXiv:2303.01418 (2023) 3

Y. Shafir, G. Tevet, R. Kapon, and A. H. Bermano, “Human motion diffusion as a generative prior,”arXiv preprint arXiv:2303.01418, 2023

work page arXiv 2023
[12]

Interaction-based human activity comparison,

Y. Shen, L. Yang, E. S. L. Ho, and H. P. H. Shum, “Interaction-based human activity comparison,”IEEE Transactions on Visualization and Computer Graphics, 2020

2020
[13]

Expi: Extended pose interaction dataset,

C. Guoet al., “Expi: Extended pose interaction dataset,” inCVPR, 2022

2022
[14]

Motiondiffuse: Text-driven human motion generation with diffusion model,

M. Zhang, Z. Zhang, Y. Chenet al., “Motiondiffuse: Text-driven human motion generation with diffusion model,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022

2022
[15]

Generating human motion from textual descriptions with discrete representations,

J. Zhang, Y. Zhang, X. Cun, Y. Zhang, H. Zhao, H. Lu, X. Shen, and S. Ying, “Generating human motion from textual descriptions with discrete representations,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

2023
[16]

Momask: Generative masked modeling of 3d human motions,

C. Guo, Y. Mu, M. G. Javed, S. Wang, and L. Cheng, “Momask: Generative masked modeling of 3d human motions,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 1900–1910. 22

2024
[17]

Humanml3d: 3d human motion dataset for text-to-motion generation,

C. Guoet al., “Humanml3d: 3d human motion dataset for text-to-motion generation,” inECCV, 2022

2022
[18]

The kit motion-language dataset,

M. Plappertet al., “The kit motion-language dataset,” inBig Data, 2016

2016
[19]

Action-conditioned 3d human motion synthesis with transformer vae,

M. Petrovich, M. J. Black, and G. Varol, “Action-conditioned 3d human motion synthesis with transformer vae,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021

2021
[20]

Transfusion: Motion prediction with diffusion probabilistic models,

Y. Zhanget al., “Transfusion: Motion prediction with diffusion probabilistic models,”arXiv preprint arXiv:2307.16106, 2023

work page arXiv 2023
[21]

iTransformer: Inverted Transformers Are Effective for Time Series Forecasting

Y. Liu, T. Hu, H. Zhang, H. Wu, S. Wang, L. Ma, and M. Long, “itransformer: Inverted transformers are effective for time series forecasting,”arXiv preprint arXiv:2310.06625, 2023

work page internal anchor Pith review arXiv 2023
[22]

Crossformer: Transformer utilizing cross-dimension dependency for multivariate time series forecasting,

Y. Zhang and J. Yan, “Crossformer: Transformer utilizing cross-dimension dependency for multivariate time series forecasting,” inProceedings of the International Conference on Learning Representations (ICLR), 2023

2023
[23]

Azure kinect dk,

Microsoft, “Azure kinect dk,” https://www.microsoft.com/ja-jp/d/azure-kinect-dk/8pp5vxmd9nhq, accessed: 2026-01-26

2026
[24]

Alphapose: Whole-body regional multi-person pose estimation and tracking in real-time,

H.-S. Fang, J. Yu, P. Xu, Y.-S. Song, C. Shen, and C. Lu, “Alphapose: Whole-body regional multi-person pose estimation and tracking in real-time,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020

2020
[25]

Mmpose dataset zoo (2d body keypoint),

OpenMMLab, “Mmpose dataset zoo (2d body keypoint),” https://mmpose.readthedocs.io/en/latest/dataset_ zoo/2d_body_keypoint.html, accessed: 2026-01-26

2026
[26]

Motionbert: A unified perspective on learning human motion represen- tations,

W. Zhu, Z. Xia, L. Shen, and W. Lin, “Motionbert: A unified perspective on learning human motion represen- tations,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

2023
[27]

Mmpose dataset zoo (3d body keypoint),

OpenMMLab, “Mmpose dataset zoo (3d body keypoint),” https://mmpose.readthedocs.io/en/latest/dataset_ zoo/3d_body_keypoint.html, accessed: 2026-01-26. Appendix A Generated Motion Scenes at 100 frames after the start of playback in the offline evaluation Figure 8 in this appendix shows generated images in the offline setting. The images are arranged in the ...

2026