MoCHA: Denoising Caption Supervision for Motion-Text Retrieval
Pith reviewed 2026-05-15 00:18 UTC · model grok-4.3
The pith
Projecting motion captions onto only their recoverable content before contrastive training produces tighter embeddings and sets new state-of-the-art retrieval results on HumanML3D and KIT-ML.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MoCHA is a text canonicalization framework that reduces caption variance by projecting each caption onto its motion-recoverable content prior to encoding. Canonicalization can be performed by an LLM or by a distilled FlanT5 model that requires no LLM at inference time. When the resulting texts are used for contrastive training, within-motion text-embedding variance drops 11-19 percent, cross-dataset transfer improves markedly, and retrieval metrics reach new highs on both HumanML3D and KIT-ML.
What carries the argument
MoCHA text canonicalizer, which projects each caption onto its motion-recoverable content using either an LLM or a distilled language model.
If this is right
- Within-motion text-embedding variance drops by 11-19 percent.
- Cross-dataset transfer rises substantially, with HumanML3D-to-KIT-ML improving 94 percent and KIT-ML-to-HumanML3D improving 52 percent.
- The LLM variant reaches 13.9 percent text-to-motion R@1 on HumanML3D (+3.1 points) and 24.3 percent on KIT-ML (+10.3 points).
- The distilled T5 variant delivers +2.5 points on HumanML3D and +8.1 points on KIT-ML without any LLM at inference.
- The method functions as a preprocessing step compatible with any existing retrieval architecture.
Where Pith is reading between the lines
- The same canonicalization principle could be tested on video-text or audio-text pairs where labels contain details invisible to the sensor.
- Applying MoCHA-style cleaning to synthetic captions generated by large models might further reduce training noise in motion-language tasks.
- The gains suggest that standardizing the language space is a general lever for building more transferable multimodal representations.
Load-bearing premise
The motion-recoverable subset of each caption can be reliably identified by an LLM or distilled model without introducing new biases or losing critical action semantics.
What would settle it
A controlled test in which human experts manually produce motion-recoverable versions of the same captions and the resulting retrieval metrics show no gain over raw captions would falsify the claim that automatic canonicalization is responsible for the observed improvements.
Figures
read the original abstract
Text-motion retrieval systems learn shared embedding spaces from motion-caption pairs via contrastive objectives. However, each caption is not a deterministic label but a sample from a distribution of valid descriptions: different annotators produce different text for the same motion, mixing motion-recoverable semantics (action type, body parts, directionality) with annotator-specific style and inferred context that cannot be determined from 3D joint coordinates alone. Standard contrastive training treats each caption as the single positive target, overlooking this distributional structure and inducing within-motion embedding variance that weakens alignment. We propose MoCHA, a text canonicalization framework that reduces this variance by projecting each caption onto its motion-recoverable content prior to encoding, producing tighter positive clusters and better-separated embeddings. Canonicalization is a general principle: even deterministic rule-based methods improve cross-dataset transfer, though learned canonicalizers provide substantially larger gains. We present two learned variants: an LLM-based approach (GPT-5.2) and a distilled FlanT5 model requiring no LLM at inference time. MoCHA operates as a preprocessing step compatible with any retrieval architecture. Applied to MoPa (MotionPatches), MoCHA sets a new state of the art on both HumanML3D (H) and KIT-ML (K): the LLM variant achieves 13.9% T2M R@1 on H (+3.1pp) and 24.3% on K (+10.3pp), while the LLM-free T5 variant achieves gains of +2.5pp and +8.1pp. Canonicalization reduces within-motion text-embedding variance by 11-19% and improves cross-dataset transfer substantially, with H to K improving by 94% and K to H by 52%, demonstrating that standardizing the language space yields more transferable motion-language representations.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes MoCHA, a preprocessing framework for motion-text retrieval that canonicalizes each caption to its motion-recoverable semantics (action type, body parts, directionality) via an LLM (GPT-5.2) or distilled FlanT5 model before contrastive training. This is intended to reduce within-motion embedding variance induced by annotator style and unrecoverable context. Applied to MoPa, the method reports new SOTA results on HumanML3D (13.9% T2M R@1, +3.1pp) and KIT-ML (24.3%, +10.3pp for LLM variant; smaller gains for T5), plus 11-19% variance reduction and large cross-dataset transfer gains (H→K +94%).
Significance. If the canonicalization step reliably isolates only recoverable content without introducing new biases or omissions, the approach offers a general, architecture-agnostic way to improve alignment under noisy supervision in motion-language tasks. The reported variance reduction and transfer improvements would then indicate that standardizing the language space yields more robust embeddings than treating raw captions as deterministic positives.
major comments (2)
- [Abstract / Method] Abstract and method description of canonicalization: no recoverability metric, human validation, or comparison against motion-derived ground truth is provided to confirm that the LLM (or distilled T5) isolates only motion-recoverable semantics while preserving critical details such as limb sequencing. This is load-bearing for the central claim, as the SOTA gains and variance reduction could otherwise stem from incidental text regularization rather than true denoising.
- [Experiments] Experimental results on cross-dataset transfer (H→K +94%, K→H +52%): the improvements are attributed to canonicalization, but no ablation isolates the contribution of the canonicalizer versus other factors such as dataset-specific caption distributions or the base MoPa architecture; without this, the transfer claim cannot be verified as arising from the proposed denoising.
minor comments (2)
- Clarify the precise computation of 'within-motion text-embedding variance' (e.g., which embedding model and aggregation is used) and report it with error bars or multiple runs.
- The abstract states 'GPT-5.2' without specifying the exact model checkpoint or prompting details; add these to the method section for reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address the two major comments point by point below, providing the strongest honest defense of the manuscript while agreeing to strengthen the evidence for recoverability and transfer attribution through targeted additions.
read point-by-point responses
-
Referee: [Abstract / Method] Abstract and method description of canonicalization: no recoverability metric, human validation, or comparison against motion-derived ground truth is provided to confirm that the LLM (or distilled T5) isolates only motion-recoverable semantics while preserving critical details such as limb sequencing. This is load-bearing for the central claim, as the SOTA gains and variance reduction could otherwise stem from incidental text regularization rather than true denoising.
Authors: We agree that direct validation strengthens the central claim. The manuscript already quantifies denoising success via the 11-19% reduction in within-motion embedding variance, which measures the removal of annotator-specific style while retaining recoverable semantics (action type, body parts, directionality). The consistent SOTA gains on HumanML3D and KIT-ML, plus the large transfer improvements, provide indirect evidence that critical details such as limb sequencing are preserved; omitting them would degrade rather than improve retrieval. To address the request explicitly, we will add a human validation study in the revision: annotators will rate recoverability of canonicalized captions from motion clips and compare agreement against original captions and motion-derived pseudo-ground-truth descriptions. This new experiment will be reported in Section 4. revision: yes
-
Referee: [Experiments] Experimental results on cross-dataset transfer (H→K +94%, K→H +52%): the improvements are attributed to canonicalization, but no ablation isolates the contribution of the canonicalizer versus other factors such as dataset-specific caption distributions or the base MoPa architecture; without this, the transfer claim cannot be verified as arising from the proposed denoising.
Authors: The base MoPa results (without canonicalization) already serve as the primary control, showing substantially smaller transfer performance than MoCHA. The relative gains (+94% H→K, +52% K→H) are therefore attributable to the canonicalization step rather than architecture or raw dataset distributions alone. To further isolate the effect, we will add an ablation in the revised experiments section that includes: (i) raw captions, (ii) MoCHA-LLM, (iii) MoCHA-T5, and (iv) a non-semantic text regularization baseline (e.g., lower-casing plus synonym replacement). This will confirm that gains arise specifically from motion-recoverable semantic standardization. revision: yes
Circularity Check
No circularity: preprocessing step evaluated on held-out metrics
full rationale
The paper defines MoCHA as a preprocessing canonicalization that projects captions onto motion-recoverable semantics (action type, body parts, directionality) using an LLM or distilled FlanT5 model before contrastive training. No equations are presented that reduce the reported R@1 gains (+3.1pp on HumanML3D, +10.3pp on KIT-ML) or variance reductions (11-19%) to quantities fitted inside the same experiment. The method is explicitly a general preprocessing step compatible with any retrieval architecture, with benefits measured on standard held-out cross-dataset transfer and retrieval metrics. No self-citations, uniqueness theorems, or ansatzes are invoked to force the result; the derivation chain remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Captions contain a separable subset of semantics that are recoverable from 3D joint coordinates alone.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We propose MoCHA, a text canonicalization framework that reduces this variance by projecting each caption onto its motion-recoverable content prior to encoding
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Canonicalization reduces within-motion text-embedding variance by 11–19%
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Petrovich, M., Black, M.J., Varol, G.: TMR: Text-to-motion retrieval using con- trastive 3D human motion synthesis. In: ICCV (2023)
work page 2023
-
[2]
Guo, C., Zou, S., Zuo, X., Wang, S., Ji, T., Li, X., Cheng, L.: Generating diverse and natural 3D human motions from text. In: CVPR (2022)
work page 2022
-
[3]
Plappert, M., Mandery, C., Asfour, T.: The KIT motion-language dataset. Big Data4(4), 236–252 (2016)
work page 2016
-
[4]
Punnakkal, A., Chandrasekaran, A., Athanasiou, N., Quiros-Ramirez, A., Black, M.J.: BABEL: Bodies, action and behavior with English labels. In: CVPR (2021)
work page 2021
-
[5]
Bensabath, A., Petrovich, M., Varol, G.: Cross-dataset motion retrieval via training with rewritten texts (2024)
work page 2024
-
[6]
Zhu, Y., Siyao, L., Li, Z., et al.: Exploring vision transformers for 3D human motion-language models with motion patches. In: CVPR (2024)
work page 2024
-
[7]
Chung, H.W., Hou, L., Longpre, S., et al.: Scaling instruction-finetuned language models. JMLR (2024)
work page 2024
-
[8]
Reimers, N., Gurevych, I.: Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In: EMNLP (2019)
work page 2019
-
[9]
Ji, S., Pan, S., Cambria, E., Marttinen, P., Yu, P.S.: A survey on knowledge graphs: Representation, acquisition, and applications. IEEE Trans. Neural Networks and Learning Systems33(2), 494–514 (2022)
work page 2022
-
[10]
Mahmood, N., Ghorbani, N., Troje, N.F., Pons-Moll, G., Black, M.J.: AMASS: Archive of motion capture as a surface model. In: ICCV (2019)
work page 2019
-
[11]
SIAM Journal on Optimization23(4), 2341–2368 (2013)
Ghadimi, S., Lan, G.: Stochastic first- and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization23(4), 2341–2368 (2013)
work page 2013
-
[12]
Petrovich, M., Black, M.J., Varol, G.: TEMOS: Generating diverse human motions from textual descriptions. In: ECCV (2022)
work page 2022
-
[13]
Tevet, G., Gordon, B., Hertz, A., Bermano, A.H., Cohen-Or, D.: MotionCLIP: Exposing human motion generation to CLIP space. In: ECCV (2022)
work page 2022
-
[14]
Radford, A., Kim, J.W., Hallacy, C., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)
work page 2021
-
[15]
Li, Z., Yuan, W., He, Y., et al.: LaMP: Language-motion pretraining for motion generation, retrieval, and captioning. In: ICLR (2025)
work page 2025
-
[16]
Xu, D., Zheng, T., Zhang, Y., Yang, X., Fu, W.: MTR-MSE: Motion-text re- trievalmethodbasedonmotionsemanticsexpansion.Neurocomputing648,130632 (2025)
work page 2025
-
[17]
Yin, K., Zou, S., Ge, Y., Tian, Z.: Tri-modal motion retrieval by learning a joint embedding space. In: CVPR (2024)
work page 2024
-
[18]
Tevet, G., Raab, S., Gordon, B., Shafir, Y., Cohen-Or, D., Bermano, A.H.: Human motion diffusion model. In: ICLR (2023)
work page 2023
-
[19]
Chen, X., Jiang, B., Liu, W., Huang, Z., Fu, B., Chen, T., Yu, G.: Executing your commands via motion diffusion in latent space. In: CVPR (2023)
work page 2023
-
[20]
Zhang, J., Zhang, Y., Cun, X., Huang, S., Zhang, Y., Zhao, H., Lu, H., Shen, X.: T2M-GPT: Generating human motion from textual descriptions with discrete representations. In: CVPR (2023)
work page 2023
-
[21]
Jiang, B., Chen, X., Liu, W., Yu, J., Yu, G., Chen, T.: MotionGPT: Human motion as a foreign language. In: NeurIPS (2023)
work page 2023
-
[22]
In: CVPR (2024) MoCHA: Denoising Caption Supervision for Motion-Text Retrieval 17
Guo, C., Mu, Y., Javed, M.G., Wang, S., Cheng, L.: MoMask: Generative masked modeling of 3D human motions. In: CVPR (2024) MoCHA: Denoising Caption Supervision for Motion-Text Retrieval 17
work page 2024
-
[23]
Petrovich, M., Black, M.J., Varol, G.: Action-conditioned 3D human motion syn- thesis with transformer VAE. In: ICCV (2021)
work page 2021
-
[24]
Guo, C., Zuo, X., Wang, S., Cheng, L.: TM2T: Stochastic and tokenized modeling for the reciprocal generation of 3D human motions and texts. In: ECCV (2022)
work page 2022
-
[25]
verb [object] [limb] [direction]→next action
Lexicon-augmented motion retrieval. (2025) 18 Warner et al. Appendices A Our Baseline Ablations Baseline ablations (architecture, temperature, self-similarity threshold) are con- solidated into the main paper (Tables 3 and 6) for readability. B LLM Ceiling: Best-Case with LLM at Train and Test The LLM ceiling (LLM canonicalization at both train and test) ...
work page 2025
-
[26]
Expand atomic labels into descriptive canonical forms matching the style above
- [27]
-
[28]
Add plausible spatial details when naturally implied by the action
-
[29]
Keep it concise –- add only what’s naturally implied Common expansions: "walk"→"walk forward", "stand"→"stand in place", "t pose"→"stand with arms extended horizontally", "transition"→ "transition between poses" F Implementation Details G Additional Ablations and Analysis G.1 Canonicalization Strategy Comparison MoCHA Blend achieves the best balance of ga...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.