MAMMA: Markerless & Automatic Multi-Person Motion Action Capture

Anastasios Yiannakidis; Eni Halilaj; Giorgio Becherini; Hanz Cuevas-Velasquez; Joachim Tesch; Markus H\"oschle; Michael J. Black; Soyong Shin; Taylor Obersat; Tsvetelina Alexiadis

arxiv: 2506.13040 · v4 · submitted 2025-06-16 · 💻 cs.CV

MAMMA: Markerless & Automatic Multi-Person Motion Action Capture

Hanz Cuevas-Velasquez , Anastasios Yiannakidis , Soyong Shin , Giorgio Becherini , Markus H\"oschle , Joachim Tesch , Taylor Obersat , Tsvetelina Alexiadis

show 2 more authors

Eni Halilaj Michael J. Black

This is my paper

Pith reviewed 2026-05-19 10:11 UTC · model grok-4.3

classification 💻 cs.CV

keywords markerless motion capturemulti-person interactionSMPL-Xdense landmarkssynthetic datasetmulti-view videobody reconstructionocclusion handling

0 comments

The pith

MAMMA recovers accurate SMPL-X parameters for two-person interactions from multi-view video without markers or manual cleanup.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a markerless pipeline that estimates detailed body models from synchronized camera views of people interacting closely. It works by first predicting dense 2D landmarks that respect contact points and segmentation masks, then using those correspondences to fit SMPL-X shape and pose parameters. Training relies on a large synthetic dataset built from existing motion sources with added occlusions and interactions, which supplies ground-truth landmarks and parameters. Evaluation on new real multi-view sequences shows reconstruction quality competitive with commercial marker-based systems while removing the need for physical markers and post-processing.

Core claim

MAMMA predicts dense 2D contact-aware surface landmarks conditioned on segmentation masks using an architecture with learnable queries for each landmark, which enables accurate recovery of SMPL-X parameters from multi-view video of two-person interactions even under heavy occlusion and physical contact.

What carries the argument

dense 2D contact-aware surface landmarks conditioned on segmentation masks, predicted via learnable queries for each landmark

If this is right

The method handles complex person-person physical interactions and occlusions more reliably than prior single-person or sparse-keypoint approaches.
It produces SMPL-X outputs that require no extensive manual cleanup after capture.
New real-sequence evaluation settings are provided for dense-landmark and markerless multi-person capture tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same landmark-prediction strategy could be tested on sequences with three or more people to check scaling behavior.
If the synthetic-data recipe generalizes, similar contact-aware landmark supervision might improve other occlusion-heavy vision tasks such as hand tracking or object manipulation.

Load-bearing premise

Models trained only on synthetic sequences with simulated interactions and occlusions will generalize accurately to real multi-view videos that have natural lighting, varied clothing, and real camera calibration differences.

What would settle it

A side-by-side reconstruction error comparison on held-out real multi-view sequences where the markerless outputs deviate substantially from ground-truth marker-based captures would show the competitive-quality claim does not hold.

read the original abstract

We present MAMMA, a markerless motion-capture pipeline that accurately recovers SMPL-X parameters from multi-view video of two-person interaction sequences. Traditional motion-capture systems rely on physical markers. Although they offer high accuracy, their requirements of specialized hardware, manual marker placement, and extensive post-processing make them costly and time-consuming. Recent learning-based methods attempt to overcome these limitations, but most are designed for single-person capture, rely on sparse keypoints, or struggle with occlusions and physical interactions. In this work, we introduce a method that predicts dense 2D contact-aware surface landmarks conditioned on segmentation masks, enabling person-specific correspondence estimation even under heavy occlusion. We employ a novel architecture that exploits learnable queries for each landmark. We demonstrate that our approach can handle complex person--person interaction and offers greater accuracy than existing methods. To train our network, we construct a large, synthetic multi-view dataset combining human motions from diverse sources, including extreme poses, hand motions, and close interactions. Our dataset yields high-variability synthetic sequences with rich body contact and occlusion, and includes SMPL-X ground-truth annotations with dense 2D landmarks. The result is a system capable of capturing human motion without the need for markers. Our approach offers competitive reconstruction quality compared to commercial marker-based motion-capture solutions, without the extensive manual cleanup. Finally, we address the absence of common benchmarks for dense-landmark prediction and markerless motion capture by introducing two evaluation settings built from real multi-view sequences. Our dataset is available in https://mamma.is.tue.mpg.de for research purposes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MAMMA gives a workable dense-landmark pipeline for two-person interactions plus a new synthetic dataset, but the competitive mocap claim needs tighter evidence on how well synthetic training holds up on real footage.

read the letter

The main takeaway is a markerless system for two-person motion capture that predicts dense contact-aware landmarks from segmentation masks using learnable queries per landmark, trained on a custom synthetic multi-view dataset built from diverse motions with added close interactions and occlusions. It also supplies real multi-view evaluation settings and releases the data. This targets practical gaps in handling person-person contacts and heavy occlusion better than single-person or sparse-keypoint baselines that came before it.

Referee Report

1 major / 1 minor

Summary. The paper presents MAMMA, a markerless motion-capture pipeline for recovering SMPL-X parameters from multi-view videos of two-person interactions. It introduces a method to predict dense 2D contact-aware surface landmarks conditioned on segmentation masks using a novel architecture with learnable queries for each landmark. The system is trained on a large synthetic multi-view dataset constructed from diverse human motions with added interactions and occlusions, including SMPL-X ground-truth and dense 2D landmarks. The authors claim that their approach handles complex person-person interactions and occlusions, offers greater accuracy than existing methods, and provides competitive reconstruction quality compared to commercial marker-based motion-capture solutions without extensive manual cleanup. They also introduce two new evaluation settings from real multi-view sequences to address the lack of common benchmarks.

Significance. If the accuracy and generalization claims hold, this work would be significant for the field of computer vision and graphics by providing an accessible, markerless alternative for capturing multi-person motions in interactive scenarios. The focus on dense landmarks and handling of occlusions and contacts addresses important limitations in prior single-person or sparse-keypoint methods. The construction and release of the synthetic dataset with rich interactions could serve as a valuable resource for training and benchmarking future methods.

major comments (1)

[Abstract and Evaluation Settings] Abstract and Evaluation Settings: The central claim that the approach offers competitive reconstruction quality compared to commercial marker-based motion-capture solutions is load-bearing for the paper's contribution but is not supported by direct quantitative evidence. The manuscript introduces real multi-view evaluation settings but does not report error metrics (such as MPJPE or surface errors) or ablations comparing performance on real sequences against commercial marker-based systems on identical captures, nor does it quantify the effect of synthetic-to-real domain shift in lighting, clothing, camera calibration, and contact patterns. This leaves the generalization assumption untested and undermines verification of the competitiveness assertion.

minor comments (1)

[Method] The exact mechanism by which contact-awareness is encoded in the dense 2D landmark prediction (via conditioning on segmentation masks and learnable queries) would benefit from additional detail and pseudocode in the method description to improve reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed and constructive review. We address the major comment on the evaluation of our competitiveness claim below and outline planned revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract and Evaluation Settings] Abstract and Evaluation Settings: The central claim that the approach offers competitive reconstruction quality compared to commercial marker-based motion-capture solutions is load-bearing for the paper's contribution but is not supported by direct quantitative evidence. The manuscript introduces real multi-view evaluation settings but does not report error metrics (such as MPJPE or surface errors) or ablations comparing performance on real sequences against commercial marker-based systems on identical captures, nor does it quantify the effect of synthetic-to-real domain shift in lighting, clothing, camera calibration, and contact patterns. This leaves the generalization assumption untested and undermines verification of the competitiveness assertion.

Authors: We acknowledge that the manuscript does not include direct quantitative error metrics (e.g., MPJPE or surface distances) comparing MAMMA against commercial marker-based systems on identical real captures, nor explicit ablations isolating synthetic-to-real shifts in lighting, clothing, calibration, or contact patterns. Obtaining perfectly paired marker-based and markerless captures of the same two-person interactions under controlled conditions proved logistically difficult, which is why our real evaluation settings emphasize qualitative visual assessment, interaction fidelity, and occlusion handling rather than numerical benchmarking against commercial output. We will revise the abstract to qualify the competitiveness statement as being supported by qualitative results and the ability to capture complex interactions without markers or cleanup. We will also add a dedicated discussion subsection on the real evaluation settings that explicitly addresses domain-shift considerations and reports any available proxy metrics (view-consistency errors, contact accuracy) computed on the real sequences. These changes will be incorporated in the revised manuscript. revision: partial

Circularity Check

0 steps flagged

No significant circularity; pipeline uses independent synthetic training and real evaluation

full rationale

The paper describes a standard learning-based pipeline: a novel network architecture with learnable queries predicts dense 2D contact-aware landmarks from segmentation masks, trained on a newly constructed synthetic multi-view dataset that combines motions from diverse sources and provides SMPL-X ground truth. Evaluation occurs on separate real multi-view sequences in introduced benchmark settings. No equations, fitted parameters, or self-citations are shown to reduce the final SMPL-X outputs directly to quantities defined by the inputs or prior author work by construction. The synthetic-to-real transfer is an empirical claim supported by the training process rather than a definitional tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the generalization power of a neural network trained exclusively on synthetic data and on the assumption that dense 2D landmark correspondences suffice to resolve SMPL-X parameters under heavy occlusion and contact.

axioms (1)

domain assumption Models trained on synthetic multi-view interaction sequences with SMPL-X ground truth will generalize to real captured sequences
Training and evaluation both rely on this transfer from synthetic to real data.

pith-pipeline@v0.9.0 · 5869 in / 1299 out tokens · 28307 ms · 2026-05-19T10:11:55.675682+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our approach offers competitive reconstruction quality compared to commercial marker-based motion-capture solutions

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

OmniFit: Multi-modal 3D Body Fitting via Scale-agnostic Dense Landmark Prediction
cs.CV 2026-04 unverdicted novelty 7.0

OmniFit uses a conditional transformer decoder to predict dense body landmarks from multi-modal inputs for scale-agnostic SMPL-X fitting, outperforming prior methods by 57-81% and reaching millimeter accuracy on CAPE ...
Markerless Head Tracking for Accurate and Accessible Neuronavigation
cs.CV 2026-02 conditional novelty 6.0

Markerless multi-camera head tracking achieves 2.32 mm and 2.01° median accuracy versus marker-based systems in 50 subjects, sufficient for transcranial magnetic stimulation.