MAMMA: Markerless & Automatic Multi-Person Motion Action Capture
Pith reviewed 2026-05-19 10:11 UTC · model grok-4.3
The pith
MAMMA recovers accurate SMPL-X parameters for two-person interactions from multi-view video without markers or manual cleanup.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MAMMA predicts dense 2D contact-aware surface landmarks conditioned on segmentation masks using an architecture with learnable queries for each landmark, which enables accurate recovery of SMPL-X parameters from multi-view video of two-person interactions even under heavy occlusion and physical contact.
What carries the argument
dense 2D contact-aware surface landmarks conditioned on segmentation masks, predicted via learnable queries for each landmark
If this is right
- The method handles complex person-person physical interactions and occlusions more reliably than prior single-person or sparse-keypoint approaches.
- It produces SMPL-X outputs that require no extensive manual cleanup after capture.
- New real-sequence evaluation settings are provided for dense-landmark and markerless multi-person capture tasks.
Where Pith is reading between the lines
- The same landmark-prediction strategy could be tested on sequences with three or more people to check scaling behavior.
- If the synthetic-data recipe generalizes, similar contact-aware landmark supervision might improve other occlusion-heavy vision tasks such as hand tracking or object manipulation.
Load-bearing premise
Models trained only on synthetic sequences with simulated interactions and occlusions will generalize accurately to real multi-view videos that have natural lighting, varied clothing, and real camera calibration differences.
What would settle it
A side-by-side reconstruction error comparison on held-out real multi-view sequences where the markerless outputs deviate substantially from ground-truth marker-based captures would show the competitive-quality claim does not hold.
read the original abstract
We present MAMMA, a markerless motion-capture pipeline that accurately recovers SMPL-X parameters from multi-view video of two-person interaction sequences. Traditional motion-capture systems rely on physical markers. Although they offer high accuracy, their requirements of specialized hardware, manual marker placement, and extensive post-processing make them costly and time-consuming. Recent learning-based methods attempt to overcome these limitations, but most are designed for single-person capture, rely on sparse keypoints, or struggle with occlusions and physical interactions. In this work, we introduce a method that predicts dense 2D contact-aware surface landmarks conditioned on segmentation masks, enabling person-specific correspondence estimation even under heavy occlusion. We employ a novel architecture that exploits learnable queries for each landmark. We demonstrate that our approach can handle complex person--person interaction and offers greater accuracy than existing methods. To train our network, we construct a large, synthetic multi-view dataset combining human motions from diverse sources, including extreme poses, hand motions, and close interactions. Our dataset yields high-variability synthetic sequences with rich body contact and occlusion, and includes SMPL-X ground-truth annotations with dense 2D landmarks. The result is a system capable of capturing human motion without the need for markers. Our approach offers competitive reconstruction quality compared to commercial marker-based motion-capture solutions, without the extensive manual cleanup. Finally, we address the absence of common benchmarks for dense-landmark prediction and markerless motion capture by introducing two evaluation settings built from real multi-view sequences. Our dataset is available in https://mamma.is.tue.mpg.de for research purposes.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents MAMMA, a markerless motion-capture pipeline for recovering SMPL-X parameters from multi-view videos of two-person interactions. It introduces a method to predict dense 2D contact-aware surface landmarks conditioned on segmentation masks using a novel architecture with learnable queries for each landmark. The system is trained on a large synthetic multi-view dataset constructed from diverse human motions with added interactions and occlusions, including SMPL-X ground-truth and dense 2D landmarks. The authors claim that their approach handles complex person-person interactions and occlusions, offers greater accuracy than existing methods, and provides competitive reconstruction quality compared to commercial marker-based motion-capture solutions without extensive manual cleanup. They also introduce two new evaluation settings from real multi-view sequences to address the lack of common benchmarks.
Significance. If the accuracy and generalization claims hold, this work would be significant for the field of computer vision and graphics by providing an accessible, markerless alternative for capturing multi-person motions in interactive scenarios. The focus on dense landmarks and handling of occlusions and contacts addresses important limitations in prior single-person or sparse-keypoint methods. The construction and release of the synthetic dataset with rich interactions could serve as a valuable resource for training and benchmarking future methods.
major comments (1)
- [Abstract and Evaluation Settings] Abstract and Evaluation Settings: The central claim that the approach offers competitive reconstruction quality compared to commercial marker-based motion-capture solutions is load-bearing for the paper's contribution but is not supported by direct quantitative evidence. The manuscript introduces real multi-view evaluation settings but does not report error metrics (such as MPJPE or surface errors) or ablations comparing performance on real sequences against commercial marker-based systems on identical captures, nor does it quantify the effect of synthetic-to-real domain shift in lighting, clothing, camera calibration, and contact patterns. This leaves the generalization assumption untested and undermines verification of the competitiveness assertion.
minor comments (1)
- [Method] The exact mechanism by which contact-awareness is encoded in the dense 2D landmark prediction (via conditioning on segmentation masks and learnable queries) would benefit from additional detail and pseudocode in the method description to improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive review. We address the major comment on the evaluation of our competitiveness claim below and outline planned revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract and Evaluation Settings] Abstract and Evaluation Settings: The central claim that the approach offers competitive reconstruction quality compared to commercial marker-based motion-capture solutions is load-bearing for the paper's contribution but is not supported by direct quantitative evidence. The manuscript introduces real multi-view evaluation settings but does not report error metrics (such as MPJPE or surface errors) or ablations comparing performance on real sequences against commercial marker-based systems on identical captures, nor does it quantify the effect of synthetic-to-real domain shift in lighting, clothing, camera calibration, and contact patterns. This leaves the generalization assumption untested and undermines verification of the competitiveness assertion.
Authors: We acknowledge that the manuscript does not include direct quantitative error metrics (e.g., MPJPE or surface distances) comparing MAMMA against commercial marker-based systems on identical real captures, nor explicit ablations isolating synthetic-to-real shifts in lighting, clothing, calibration, or contact patterns. Obtaining perfectly paired marker-based and markerless captures of the same two-person interactions under controlled conditions proved logistically difficult, which is why our real evaluation settings emphasize qualitative visual assessment, interaction fidelity, and occlusion handling rather than numerical benchmarking against commercial output. We will revise the abstract to qualify the competitiveness statement as being supported by qualitative results and the ability to capture complex interactions without markers or cleanup. We will also add a dedicated discussion subsection on the real evaluation settings that explicitly addresses domain-shift considerations and reports any available proxy metrics (view-consistency errors, contact accuracy) computed on the real sequences. These changes will be incorporated in the revised manuscript. revision: partial
Circularity Check
No significant circularity; pipeline uses independent synthetic training and real evaluation
full rationale
The paper describes a standard learning-based pipeline: a novel network architecture with learnable queries predicts dense 2D contact-aware landmarks from segmentation masks, trained on a newly constructed synthetic multi-view dataset that combines motions from diverse sources and provides SMPL-X ground truth. Evaluation occurs on separate real multi-view sequences in introduced benchmark settings. No equations, fitted parameters, or self-citations are shown to reduce the final SMPL-X outputs directly to quantities defined by the inputs or prior author work by construction. The synthetic-to-real transfer is an empirical claim supported by the training process rather than a definitional tautology.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Models trained on synthetic multi-view interaction sequences with SMPL-X ground truth will generalize to real captured sequences
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Our approach offers competitive reconstruction quality compared to commercial marker-based motion-capture solutions
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 2 Pith papers
-
OmniFit: Multi-modal 3D Body Fitting via Scale-agnostic Dense Landmark Prediction
OmniFit uses a conditional transformer decoder to predict dense body landmarks from multi-modal inputs for scale-agnostic SMPL-X fitting, outperforming prior methods by 57-81% and reaching millimeter accuracy on CAPE ...
-
Markerless Head Tracking for Accurate and Accessible Neuronavigation
Markerless multi-camera head tracking achieves 2.32 mm and 2.01° median accuracy versus marker-based systems in 50 subjects, sufficient for transcranial magnetic stimulation.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.