pith. sign in

arxiv: 2603.23684 · v2 · submitted 2026-03-24 · 💻 cs.CV

MoCHA: Denoising Caption Supervision for Motion-Text Retrieval

Pith reviewed 2026-05-15 00:18 UTC · model grok-4.3

classification 💻 cs.CV
keywords motion-text retrievalcaption canonicalizationcontrastive learningcaption denoisingHumanML3DKIT-MLtext-motion alignment
0
0 comments X

The pith

Projecting motion captions onto only their recoverable content before contrastive training produces tighter embeddings and sets new state-of-the-art retrieval results on HumanML3D and KIT-ML.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Motion captions mix reliable details such as action type, body parts, and direction with annotator style and inferred context that cannot be recovered from 3D joint data. Standard contrastive training treats every caption as the single exact match, which spreads embeddings for the same motion and weakens alignment. MoCHA first projects each caption down to its motion-recoverable subset, then applies the usual contrastive objective to these cleaned texts. The result is lower within-motion variance, better-separated embeddings, and substantially higher retrieval accuracy. The same preprocessing also improves transfer when a model trained on one dataset is tested on another.

Core claim

MoCHA is a text canonicalization framework that reduces caption variance by projecting each caption onto its motion-recoverable content prior to encoding. Canonicalization can be performed by an LLM or by a distilled FlanT5 model that requires no LLM at inference time. When the resulting texts are used for contrastive training, within-motion text-embedding variance drops 11-19 percent, cross-dataset transfer improves markedly, and retrieval metrics reach new highs on both HumanML3D and KIT-ML.

What carries the argument

MoCHA text canonicalizer, which projects each caption onto its motion-recoverable content using either an LLM or a distilled language model.

If this is right

  • Within-motion text-embedding variance drops by 11-19 percent.
  • Cross-dataset transfer rises substantially, with HumanML3D-to-KIT-ML improving 94 percent and KIT-ML-to-HumanML3D improving 52 percent.
  • The LLM variant reaches 13.9 percent text-to-motion R@1 on HumanML3D (+3.1 points) and 24.3 percent on KIT-ML (+10.3 points).
  • The distilled T5 variant delivers +2.5 points on HumanML3D and +8.1 points on KIT-ML without any LLM at inference.
  • The method functions as a preprocessing step compatible with any existing retrieval architecture.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same canonicalization principle could be tested on video-text or audio-text pairs where labels contain details invisible to the sensor.
  • Applying MoCHA-style cleaning to synthetic captions generated by large models might further reduce training noise in motion-language tasks.
  • The gains suggest that standardizing the language space is a general lever for building more transferable multimodal representations.

Load-bearing premise

The motion-recoverable subset of each caption can be reliably identified by an LLM or distilled model without introducing new biases or losing critical action semantics.

What would settle it

A controlled test in which human experts manually produce motion-recoverable versions of the same captions and the resulting retrieval metrics show no gain over raw captions would falsify the claim that automatic canonicalization is responsible for the observed improvements.

Figures

Figures reproduced from arXiv: 2603.23684 by Apaar Sadhwani, Cameron Ethan Taylor, Irfan Essa, Nikolai Warner.

Figure 1
Figure 1. Figure 1: Each caption is a different sample from a distribution of valid de [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: MoCHA overview. (a) Motivated by the (s, a) decomposition (Section 3.1), C(·) projects each caption onto s by stripping stylistic variation a (red). C is imple￾mented via LLM and distilled into FlanT5 for LLM-free inference. (b) Blend training balances both views: the denoised C(ti) anchors embeddings around s to reduce gra￾dient variance, while the original ti regularizes for natural-language queries. Mot… view at source ↗
Figure 3
Figure 3. Figure 3: Canonicalization projects captions onto s, improving retrieval. Top row (colored): ground truth; bottom row (gray): baseline rank-1 error. (a) Verbose a buries the action; MoCHA extracts s while preserving the metaphor. (b) Annotator uncertainty (a); canonicalization extracts shared kinematic content. (c) Complex de￾scription decomposed into sequential s, disambiguating from similar motions. (d) Over￾speci… view at source ↗
read the original abstract

Text-motion retrieval systems learn shared embedding spaces from motion-caption pairs via contrastive objectives. However, each caption is not a deterministic label but a sample from a distribution of valid descriptions: different annotators produce different text for the same motion, mixing motion-recoverable semantics (action type, body parts, directionality) with annotator-specific style and inferred context that cannot be determined from 3D joint coordinates alone. Standard contrastive training treats each caption as the single positive target, overlooking this distributional structure and inducing within-motion embedding variance that weakens alignment. We propose MoCHA, a text canonicalization framework that reduces this variance by projecting each caption onto its motion-recoverable content prior to encoding, producing tighter positive clusters and better-separated embeddings. Canonicalization is a general principle: even deterministic rule-based methods improve cross-dataset transfer, though learned canonicalizers provide substantially larger gains. We present two learned variants: an LLM-based approach (GPT-5.2) and a distilled FlanT5 model requiring no LLM at inference time. MoCHA operates as a preprocessing step compatible with any retrieval architecture. Applied to MoPa (MotionPatches), MoCHA sets a new state of the art on both HumanML3D (H) and KIT-ML (K): the LLM variant achieves 13.9% T2M R@1 on H (+3.1pp) and 24.3% on K (+10.3pp), while the LLM-free T5 variant achieves gains of +2.5pp and +8.1pp. Canonicalization reduces within-motion text-embedding variance by 11-19% and improves cross-dataset transfer substantially, with H to K improving by 94% and K to H by 52%, demonstrating that standardizing the language space yields more transferable motion-language representations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes MoCHA, a preprocessing framework for motion-text retrieval that canonicalizes each caption to its motion-recoverable semantics (action type, body parts, directionality) via an LLM (GPT-5.2) or distilled FlanT5 model before contrastive training. This is intended to reduce within-motion embedding variance induced by annotator style and unrecoverable context. Applied to MoPa, the method reports new SOTA results on HumanML3D (13.9% T2M R@1, +3.1pp) and KIT-ML (24.3%, +10.3pp for LLM variant; smaller gains for T5), plus 11-19% variance reduction and large cross-dataset transfer gains (H→K +94%).

Significance. If the canonicalization step reliably isolates only recoverable content without introducing new biases or omissions, the approach offers a general, architecture-agnostic way to improve alignment under noisy supervision in motion-language tasks. The reported variance reduction and transfer improvements would then indicate that standardizing the language space yields more robust embeddings than treating raw captions as deterministic positives.

major comments (2)
  1. [Abstract / Method] Abstract and method description of canonicalization: no recoverability metric, human validation, or comparison against motion-derived ground truth is provided to confirm that the LLM (or distilled T5) isolates only motion-recoverable semantics while preserving critical details such as limb sequencing. This is load-bearing for the central claim, as the SOTA gains and variance reduction could otherwise stem from incidental text regularization rather than true denoising.
  2. [Experiments] Experimental results on cross-dataset transfer (H→K +94%, K→H +52%): the improvements are attributed to canonicalization, but no ablation isolates the contribution of the canonicalizer versus other factors such as dataset-specific caption distributions or the base MoPa architecture; without this, the transfer claim cannot be verified as arising from the proposed denoising.
minor comments (2)
  1. Clarify the precise computation of 'within-motion text-embedding variance' (e.g., which embedding model and aggregation is used) and report it with error bars or multiple runs.
  2. The abstract states 'GPT-5.2' without specifying the exact model checkpoint or prompting details; add these to the method section for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments point by point below, providing the strongest honest defense of the manuscript while agreeing to strengthen the evidence for recoverability and transfer attribution through targeted additions.

read point-by-point responses
  1. Referee: [Abstract / Method] Abstract and method description of canonicalization: no recoverability metric, human validation, or comparison against motion-derived ground truth is provided to confirm that the LLM (or distilled T5) isolates only motion-recoverable semantics while preserving critical details such as limb sequencing. This is load-bearing for the central claim, as the SOTA gains and variance reduction could otherwise stem from incidental text regularization rather than true denoising.

    Authors: We agree that direct validation strengthens the central claim. The manuscript already quantifies denoising success via the 11-19% reduction in within-motion embedding variance, which measures the removal of annotator-specific style while retaining recoverable semantics (action type, body parts, directionality). The consistent SOTA gains on HumanML3D and KIT-ML, plus the large transfer improvements, provide indirect evidence that critical details such as limb sequencing are preserved; omitting them would degrade rather than improve retrieval. To address the request explicitly, we will add a human validation study in the revision: annotators will rate recoverability of canonicalized captions from motion clips and compare agreement against original captions and motion-derived pseudo-ground-truth descriptions. This new experiment will be reported in Section 4. revision: yes

  2. Referee: [Experiments] Experimental results on cross-dataset transfer (H→K +94%, K→H +52%): the improvements are attributed to canonicalization, but no ablation isolates the contribution of the canonicalizer versus other factors such as dataset-specific caption distributions or the base MoPa architecture; without this, the transfer claim cannot be verified as arising from the proposed denoising.

    Authors: The base MoPa results (without canonicalization) already serve as the primary control, showing substantially smaller transfer performance than MoCHA. The relative gains (+94% H→K, +52% K→H) are therefore attributable to the canonicalization step rather than architecture or raw dataset distributions alone. To further isolate the effect, we will add an ablation in the revised experiments section that includes: (i) raw captions, (ii) MoCHA-LLM, (iii) MoCHA-T5, and (iv) a non-semantic text regularization baseline (e.g., lower-casing plus synonym replacement). This will confirm that gains arise specifically from motion-recoverable semantic standardization. revision: yes

Circularity Check

0 steps flagged

No circularity: preprocessing step evaluated on held-out metrics

full rationale

The paper defines MoCHA as a preprocessing canonicalization that projects captions onto motion-recoverable semantics (action type, body parts, directionality) using an LLM or distilled FlanT5 model before contrastive training. No equations are presented that reduce the reported R@1 gains (+3.1pp on HumanML3D, +10.3pp on KIT-ML) or variance reductions (11-19%) to quantities fitted inside the same experiment. The method is explicitly a general preprocessing step compatible with any retrieval architecture, with benefits measured on standard held-out cross-dataset transfer and retrieval metrics. No self-citations, uniqueness theorems, or ansatzes are invoked to force the result; the derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that an external model (LLM or distilled T5) can accurately isolate motion-recoverable semantics; no free parameters are explicitly fitted in the abstract, but the choice of which LLM and the distillation process introduce unstated hyperparameters.

axioms (1)
  • domain assumption Captions contain a separable subset of semantics that are recoverable from 3D joint coordinates alone.
    Invoked in the first paragraph to justify the canonicalization step.

pith-pipeline@v0.9.0 · 5642 in / 1273 out tokens · 33485 ms · 2026-05-15T00:18:41.122642+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages

  1. [1]

    In: ICCV (2023)

    Petrovich, M., Black, M.J., Varol, G.: TMR: Text-to-motion retrieval using con- trastive 3D human motion synthesis. In: ICCV (2023)

  2. [2]

    In: CVPR (2022)

    Guo, C., Zou, S., Zuo, X., Wang, S., Ji, T., Li, X., Cheng, L.: Generating diverse and natural 3D human motions from text. In: CVPR (2022)

  3. [3]

    Big Data4(4), 236–252 (2016)

    Plappert, M., Mandery, C., Asfour, T.: The KIT motion-language dataset. Big Data4(4), 236–252 (2016)

  4. [4]

    In: CVPR (2021)

    Punnakkal, A., Chandrasekaran, A., Athanasiou, N., Quiros-Ramirez, A., Black, M.J.: BABEL: Bodies, action and behavior with English labels. In: CVPR (2021)

  5. [5]

    Bensabath, A., Petrovich, M., Varol, G.: Cross-dataset motion retrieval via training with rewritten texts (2024)

  6. [6]

    In: CVPR (2024)

    Zhu, Y., Siyao, L., Li, Z., et al.: Exploring vision transformers for 3D human motion-language models with motion patches. In: CVPR (2024)

  7. [7]

    JMLR (2024)

    Chung, H.W., Hou, L., Longpre, S., et al.: Scaling instruction-finetuned language models. JMLR (2024)

  8. [8]

    In: EMNLP (2019)

    Reimers, N., Gurevych, I.: Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In: EMNLP (2019)

  9. [9]

    IEEE Trans

    Ji, S., Pan, S., Cambria, E., Marttinen, P., Yu, P.S.: A survey on knowledge graphs: Representation, acquisition, and applications. IEEE Trans. Neural Networks and Learning Systems33(2), 494–514 (2022)

  10. [10]

    In: ICCV (2019)

    Mahmood, N., Ghorbani, N., Troje, N.F., Pons-Moll, G., Black, M.J.: AMASS: Archive of motion capture as a surface model. In: ICCV (2019)

  11. [11]

    SIAM Journal on Optimization23(4), 2341–2368 (2013)

    Ghadimi, S., Lan, G.: Stochastic first- and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization23(4), 2341–2368 (2013)

  12. [12]

    In: ECCV (2022)

    Petrovich, M., Black, M.J., Varol, G.: TEMOS: Generating diverse human motions from textual descriptions. In: ECCV (2022)

  13. [13]

    In: ECCV (2022)

    Tevet, G., Gordon, B., Hertz, A., Bermano, A.H., Cohen-Or, D.: MotionCLIP: Exposing human motion generation to CLIP space. In: ECCV (2022)

  14. [14]

    In: ICML (2021)

    Radford, A., Kim, J.W., Hallacy, C., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)

  15. [15]

    In: ICLR (2025)

    Li, Z., Yuan, W., He, Y., et al.: LaMP: Language-motion pretraining for motion generation, retrieval, and captioning. In: ICLR (2025)

  16. [16]

    Xu, D., Zheng, T., Zhang, Y., Yang, X., Fu, W.: MTR-MSE: Motion-text re- trievalmethodbasedonmotionsemanticsexpansion.Neurocomputing648,130632 (2025)

  17. [17]

    In: CVPR (2024)

    Yin, K., Zou, S., Ge, Y., Tian, Z.: Tri-modal motion retrieval by learning a joint embedding space. In: CVPR (2024)

  18. [18]

    In: ICLR (2023)

    Tevet, G., Raab, S., Gordon, B., Shafir, Y., Cohen-Or, D., Bermano, A.H.: Human motion diffusion model. In: ICLR (2023)

  19. [19]

    In: CVPR (2023)

    Chen, X., Jiang, B., Liu, W., Huang, Z., Fu, B., Chen, T., Yu, G.: Executing your commands via motion diffusion in latent space. In: CVPR (2023)

  20. [20]

    In: CVPR (2023)

    Zhang, J., Zhang, Y., Cun, X., Huang, S., Zhang, Y., Zhao, H., Lu, H., Shen, X.: T2M-GPT: Generating human motion from textual descriptions with discrete representations. In: CVPR (2023)

  21. [21]

    In: NeurIPS (2023)

    Jiang, B., Chen, X., Liu, W., Yu, J., Yu, G., Chen, T.: MotionGPT: Human motion as a foreign language. In: NeurIPS (2023)

  22. [22]

    In: CVPR (2024) MoCHA: Denoising Caption Supervision for Motion-Text Retrieval 17

    Guo, C., Mu, Y., Javed, M.G., Wang, S., Cheng, L.: MoMask: Generative masked modeling of 3D human motions. In: CVPR (2024) MoCHA: Denoising Caption Supervision for Motion-Text Retrieval 17

  23. [23]

    In: ICCV (2021)

    Petrovich, M., Black, M.J., Varol, G.: Action-conditioned 3D human motion syn- thesis with transformer VAE. In: ICCV (2021)

  24. [24]

    In: ECCV (2022)

    Guo, C., Zuo, X., Wang, S., Cheng, L.: TM2T: Stochastic and tokenized modeling for the reciprocal generation of 3D human motions and texts. In: ECCV (2022)

  25. [25]

    verb [object] [limb] [direction]→next action

    Lexicon-augmented motion retrieval. (2025) 18 Warner et al. Appendices A Our Baseline Ablations Baseline ablations (architecture, temperature, self-similarity threshold) are con- solidated into the main paper (Tables 3 and 6) for readability. B LLM Ceiling: Best-Case with LLM at Train and Test The LLM ceiling (LLM canonicalization at both train and test) ...

  26. [26]

    Expand atomic labels into descriptive canonical forms matching the style above

  27. [27]

    action1 -> action2

    Use the arrow notation for multi-step actions: "action1 -> action2"

  28. [28]

    Add plausible spatial details when naturally implied by the action

  29. [29]

    walk"→"walk forward

    Keep it concise –- add only what’s naturally implied Common expansions: "walk"→"walk forward", "stand"→"stand in place", "t pose"→"stand with arms extended horizontally", "transition"→ "transition between poses" F Implementation Details G Additional Ablations and Analysis G.1 Canonicalization Strategy Comparison MoCHA Blend achieves the best balance of ga...