MoCHA: Denoising Caption Supervision for Motion-Text Retrieval

Apaar Sadhwani; Cameron Ethan Taylor; Irfan Essa; Nikolai Warner

arxiv: 2603.23684 · v2 · submitted 2026-03-24 · 💻 cs.CV

MoCHA: Denoising Caption Supervision for Motion-Text Retrieval

Nikolai Warner , Cameron Ethan Taylor , Irfan Essa , Apaar Sadhwani This is my paper

Pith reviewed 2026-05-15 00:18 UTC · model grok-4.3

classification 💻 cs.CV

keywords motion-text retrievalcaption canonicalizationcontrastive learningcaption denoisingHumanML3DKIT-MLtext-motion alignment

0 comments

The pith

Projecting motion captions onto only their recoverable content before contrastive training produces tighter embeddings and sets new state-of-the-art retrieval results on HumanML3D and KIT-ML.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Motion captions mix reliable details such as action type, body parts, and direction with annotator style and inferred context that cannot be recovered from 3D joint data. Standard contrastive training treats every caption as the single exact match, which spreads embeddings for the same motion and weakens alignment. MoCHA first projects each caption down to its motion-recoverable subset, then applies the usual contrastive objective to these cleaned texts. The result is lower within-motion variance, better-separated embeddings, and substantially higher retrieval accuracy. The same preprocessing also improves transfer when a model trained on one dataset is tested on another.

Core claim

MoCHA is a text canonicalization framework that reduces caption variance by projecting each caption onto its motion-recoverable content prior to encoding. Canonicalization can be performed by an LLM or by a distilled FlanT5 model that requires no LLM at inference time. When the resulting texts are used for contrastive training, within-motion text-embedding variance drops 11-19 percent, cross-dataset transfer improves markedly, and retrieval metrics reach new highs on both HumanML3D and KIT-ML.

What carries the argument

MoCHA text canonicalizer, which projects each caption onto its motion-recoverable content using either an LLM or a distilled language model.

If this is right

Within-motion text-embedding variance drops by 11-19 percent.
Cross-dataset transfer rises substantially, with HumanML3D-to-KIT-ML improving 94 percent and KIT-ML-to-HumanML3D improving 52 percent.
The LLM variant reaches 13.9 percent text-to-motion R@1 on HumanML3D (+3.1 points) and 24.3 percent on KIT-ML (+10.3 points).
The distilled T5 variant delivers +2.5 points on HumanML3D and +8.1 points on KIT-ML without any LLM at inference.
The method functions as a preprocessing step compatible with any existing retrieval architecture.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same canonicalization principle could be tested on video-text or audio-text pairs where labels contain details invisible to the sensor.
Applying MoCHA-style cleaning to synthetic captions generated by large models might further reduce training noise in motion-language tasks.
The gains suggest that standardizing the language space is a general lever for building more transferable multimodal representations.

Load-bearing premise

The motion-recoverable subset of each caption can be reliably identified by an LLM or distilled model without introducing new biases or losing critical action semantics.

What would settle it

A controlled test in which human experts manually produce motion-recoverable versions of the same captions and the resulting retrieval metrics show no gain over raw captions would falsify the claim that automatic canonicalization is responsible for the observed improvements.

Figures

Figures reproduced from arXiv: 2603.23684 by Apaar Sadhwani, Cameron Ethan Taylor, Irfan Essa, Nikolai Warner.

**Figure 2.** Figure 2: MoCHA overview. (a) Motivated by the (s, a) decomposition (Section 3.1), C(·) projects each caption onto s by stripping stylistic variation a (red). C is implemented via LLM and distilled into FlanT5 for LLM-free inference. (b) Blend training balances both views: the denoised C(ti) anchors embeddings around s to reduce gradient variance, while the original ti regularizes for natural-language queries. Mot… view at source ↗

**Figure 3.** Figure 3: Canonicalization projects captions onto s, improving retrieval. Top row (colored): ground truth; bottom row (gray): baseline rank-1 error. (a) Verbose a buries the action; MoCHA extracts s while preserving the metaphor. (b) Annotator uncertainty (a); canonicalization extracts shared kinematic content. (c) Complex description decomposed into sequential s, disambiguating from similar motions. (d) Overspeci… view at source ↗

read the original abstract

Text-motion retrieval systems learn shared embedding spaces from motion-caption pairs via contrastive objectives. However, each caption is not a deterministic label but a sample from a distribution of valid descriptions: different annotators produce different text for the same motion, mixing motion-recoverable semantics (action type, body parts, directionality) with annotator-specific style and inferred context that cannot be determined from 3D joint coordinates alone. Standard contrastive training treats each caption as the single positive target, overlooking this distributional structure and inducing within-motion embedding variance that weakens alignment. We propose MoCHA, a text canonicalization framework that reduces this variance by projecting each caption onto its motion-recoverable content prior to encoding, producing tighter positive clusters and better-separated embeddings. Canonicalization is a general principle: even deterministic rule-based methods improve cross-dataset transfer, though learned canonicalizers provide substantially larger gains. We present two learned variants: an LLM-based approach (GPT-5.2) and a distilled FlanT5 model requiring no LLM at inference time. MoCHA operates as a preprocessing step compatible with any retrieval architecture. Applied to MoPa (MotionPatches), MoCHA sets a new state of the art on both HumanML3D (H) and KIT-ML (K): the LLM variant achieves 13.9% T2M R@1 on H (+3.1pp) and 24.3% on K (+10.3pp), while the LLM-free T5 variant achieves gains of +2.5pp and +8.1pp. Canonicalization reduces within-motion text-embedding variance by 11-19% and improves cross-dataset transfer substantially, with H to K improving by 94% and K to H by 52%, demonstrating that standardizing the language space yields more transferable motion-language representations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MoCHA's caption canonicalization step delivers clear retrieval gains on HumanML3D and KIT-ML by reducing annotator noise, but the reliability of the LLM or distilled model for picking motion-recoverable content is not well checked.

read the letter

The paper introduces a preprocessing step that rewrites each motion caption to keep only the parts recoverable from the 3D joints, dropping style and extra inferences. This produces tighter positive pairs for contrastive training and yields measurable improvements without changing the base retrieval model. On HumanML3D the LLM version reaches 13.9% T2M R@1 (+3.1pp) and on KIT-ML it reaches 24.3% (+10.3pp); the distilled T5 version gives smaller but still positive lifts. They also report 11-19% drops in within-motion embedding variance and large gains in cross-dataset transfer (H to K up 94%). The core idea is simple and presented as a general principle that even rule-based canonicalization helps a bit, while learned versions help more. The two variants (GPT-5.2 and FlanT5) make the method usable at scale since the distilled model needs no LLM calls at inference. This is new in the motion-text retrieval papers they cite, which have treated captions as fixed labels. The results look reproducible from the numbers given and the method is easy to add to existing pipelines. The main soft spot is that the paper does not show direct evidence the canonicalizer is accurate. There is no human validation, no comparison against motion-derived ground truth, and no ablation on what gets dropped or added. An LLM could quietly remove subtle but recoverable cues or insert plausible but non-visual details, and the reported gains might partly reflect simpler text rather than true denoising. The abstract gives variance numbers but no error bars or details on how the motion-recoverable subset was decided. This work is for people already working on motion-text retrieval or noisy multimodal contrastive learning. A reader who needs practical ways to clean caption data will get usable numbers and a clear recipe. It deserves a serious referee because the benchmark lifts are concrete and the preprocessing is lightweight; reviewers will likely press for fidelity checks on the canonicalizer, but the current evidence is strong enough to warrant that discussion rather than a desk reject.

Referee Report

2 major / 2 minor

Summary. The paper proposes MoCHA, a preprocessing framework for motion-text retrieval that canonicalizes each caption to its motion-recoverable semantics (action type, body parts, directionality) via an LLM (GPT-5.2) or distilled FlanT5 model before contrastive training. This is intended to reduce within-motion embedding variance induced by annotator style and unrecoverable context. Applied to MoPa, the method reports new SOTA results on HumanML3D (13.9% T2M R@1, +3.1pp) and KIT-ML (24.3%, +10.3pp for LLM variant; smaller gains for T5), plus 11-19% variance reduction and large cross-dataset transfer gains (H→K +94%).

Significance. If the canonicalization step reliably isolates only recoverable content without introducing new biases or omissions, the approach offers a general, architecture-agnostic way to improve alignment under noisy supervision in motion-language tasks. The reported variance reduction and transfer improvements would then indicate that standardizing the language space yields more robust embeddings than treating raw captions as deterministic positives.

major comments (2)

[Abstract / Method] Abstract and method description of canonicalization: no recoverability metric, human validation, or comparison against motion-derived ground truth is provided to confirm that the LLM (or distilled T5) isolates only motion-recoverable semantics while preserving critical details such as limb sequencing. This is load-bearing for the central claim, as the SOTA gains and variance reduction could otherwise stem from incidental text regularization rather than true denoising.
[Experiments] Experimental results on cross-dataset transfer (H→K +94%, K→H +52%): the improvements are attributed to canonicalization, but no ablation isolates the contribution of the canonicalizer versus other factors such as dataset-specific caption distributions or the base MoPa architecture; without this, the transfer claim cannot be verified as arising from the proposed denoising.

minor comments (2)

Clarify the precise computation of 'within-motion text-embedding variance' (e.g., which embedding model and aggregation is used) and report it with error bars or multiple runs.
The abstract states 'GPT-5.2' without specifying the exact model checkpoint or prompting details; add these to the method section for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments point by point below, providing the strongest honest defense of the manuscript while agreeing to strengthen the evidence for recoverability and transfer attribution through targeted additions.

read point-by-point responses

Referee: [Abstract / Method] Abstract and method description of canonicalization: no recoverability metric, human validation, or comparison against motion-derived ground truth is provided to confirm that the LLM (or distilled T5) isolates only motion-recoverable semantics while preserving critical details such as limb sequencing. This is load-bearing for the central claim, as the SOTA gains and variance reduction could otherwise stem from incidental text regularization rather than true denoising.

Authors: We agree that direct validation strengthens the central claim. The manuscript already quantifies denoising success via the 11-19% reduction in within-motion embedding variance, which measures the removal of annotator-specific style while retaining recoverable semantics (action type, body parts, directionality). The consistent SOTA gains on HumanML3D and KIT-ML, plus the large transfer improvements, provide indirect evidence that critical details such as limb sequencing are preserved; omitting them would degrade rather than improve retrieval. To address the request explicitly, we will add a human validation study in the revision: annotators will rate recoverability of canonicalized captions from motion clips and compare agreement against original captions and motion-derived pseudo-ground-truth descriptions. This new experiment will be reported in Section 4. revision: yes
Referee: [Experiments] Experimental results on cross-dataset transfer (H→K +94%, K→H +52%): the improvements are attributed to canonicalization, but no ablation isolates the contribution of the canonicalizer versus other factors such as dataset-specific caption distributions or the base MoPa architecture; without this, the transfer claim cannot be verified as arising from the proposed denoising.

Authors: The base MoPa results (without canonicalization) already serve as the primary control, showing substantially smaller transfer performance than MoCHA. The relative gains (+94% H→K, +52% K→H) are therefore attributable to the canonicalization step rather than architecture or raw dataset distributions alone. To further isolate the effect, we will add an ablation in the revised experiments section that includes: (i) raw captions, (ii) MoCHA-LLM, (iii) MoCHA-T5, and (iv) a non-semantic text regularization baseline (e.g., lower-casing plus synonym replacement). This will confirm that gains arise specifically from motion-recoverable semantic standardization. revision: yes

Circularity Check

0 steps flagged

No circularity: preprocessing step evaluated on held-out metrics

full rationale

The paper defines MoCHA as a preprocessing canonicalization that projects captions onto motion-recoverable semantics (action type, body parts, directionality) using an LLM or distilled FlanT5 model before contrastive training. No equations are presented that reduce the reported R@1 gains (+3.1pp on HumanML3D, +10.3pp on KIT-ML) or variance reductions (11-19%) to quantities fitted inside the same experiment. The method is explicitly a general preprocessing step compatible with any retrieval architecture, with benefits measured on standard held-out cross-dataset transfer and retrieval metrics. No self-citations, uniqueness theorems, or ansatzes are invoked to force the result; the derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that an external model (LLM or distilled T5) can accurately isolate motion-recoverable semantics; no free parameters are explicitly fitted in the abstract, but the choice of which LLM and the distillation process introduce unstated hyperparameters.

axioms (1)

domain assumption Captions contain a separable subset of semantics that are recoverable from 3D joint coordinates alone.
Invoked in the first paragraph to justify the canonicalization step.

pith-pipeline@v0.9.0 · 5642 in / 1273 out tokens · 33485 ms · 2026-05-15T00:18:41.122642+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose MoCHA, a text canonicalization framework that reduces this variance by projecting each caption onto its motion-recoverable content prior to encoding
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Canonicalization reduces within-motion text-embedding variance by 11–19%

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages

[1]

In: ICCV (2023)

Petrovich, M., Black, M.J., Varol, G.: TMR: Text-to-motion retrieval using con- trastive 3D human motion synthesis. In: ICCV (2023)

work page 2023
[2]

In: CVPR (2022)

Guo, C., Zou, S., Zuo, X., Wang, S., Ji, T., Li, X., Cheng, L.: Generating diverse and natural 3D human motions from text. In: CVPR (2022)

work page 2022
[3]

Big Data4(4), 236–252 (2016)

Plappert, M., Mandery, C., Asfour, T.: The KIT motion-language dataset. Big Data4(4), 236–252 (2016)

work page 2016
[4]

In: CVPR (2021)

Punnakkal, A., Chandrasekaran, A., Athanasiou, N., Quiros-Ramirez, A., Black, M.J.: BABEL: Bodies, action and behavior with English labels. In: CVPR (2021)

work page 2021
[5]

Bensabath, A., Petrovich, M., Varol, G.: Cross-dataset motion retrieval via training with rewritten texts (2024)

work page 2024
[6]

In: CVPR (2024)

Zhu, Y., Siyao, L., Li, Z., et al.: Exploring vision transformers for 3D human motion-language models with motion patches. In: CVPR (2024)

work page 2024
[7]

JMLR (2024)

Chung, H.W., Hou, L., Longpre, S., et al.: Scaling instruction-finetuned language models. JMLR (2024)

work page 2024
[8]

In: EMNLP (2019)

Reimers, N., Gurevych, I.: Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In: EMNLP (2019)

work page 2019
[9]

IEEE Trans

Ji, S., Pan, S., Cambria, E., Marttinen, P., Yu, P.S.: A survey on knowledge graphs: Representation, acquisition, and applications. IEEE Trans. Neural Networks and Learning Systems33(2), 494–514 (2022)

work page 2022
[10]

In: ICCV (2019)

Mahmood, N., Ghorbani, N., Troje, N.F., Pons-Moll, G., Black, M.J.: AMASS: Archive of motion capture as a surface model. In: ICCV (2019)

work page 2019
[11]

SIAM Journal on Optimization23(4), 2341–2368 (2013)

Ghadimi, S., Lan, G.: Stochastic first- and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization23(4), 2341–2368 (2013)

work page 2013
[12]

In: ECCV (2022)

Petrovich, M., Black, M.J., Varol, G.: TEMOS: Generating diverse human motions from textual descriptions. In: ECCV (2022)

work page 2022
[13]

In: ECCV (2022)

Tevet, G., Gordon, B., Hertz, A., Bermano, A.H., Cohen-Or, D.: MotionCLIP: Exposing human motion generation to CLIP space. In: ECCV (2022)

work page 2022
[14]

In: ICML (2021)

Radford, A., Kim, J.W., Hallacy, C., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)

work page 2021
[15]

In: ICLR (2025)

Li, Z., Yuan, W., He, Y., et al.: LaMP: Language-motion pretraining for motion generation, retrieval, and captioning. In: ICLR (2025)

work page 2025
[16]

Xu, D., Zheng, T., Zhang, Y., Yang, X., Fu, W.: MTR-MSE: Motion-text re- trievalmethodbasedonmotionsemanticsexpansion.Neurocomputing648,130632 (2025)

work page 2025
[17]

In: CVPR (2024)

Yin, K., Zou, S., Ge, Y., Tian, Z.: Tri-modal motion retrieval by learning a joint embedding space. In: CVPR (2024)

work page 2024
[18]

In: ICLR (2023)

Tevet, G., Raab, S., Gordon, B., Shafir, Y., Cohen-Or, D., Bermano, A.H.: Human motion diffusion model. In: ICLR (2023)

work page 2023
[19]

In: CVPR (2023)

Chen, X., Jiang, B., Liu, W., Huang, Z., Fu, B., Chen, T., Yu, G.: Executing your commands via motion diffusion in latent space. In: CVPR (2023)

work page 2023
[20]

In: CVPR (2023)

Zhang, J., Zhang, Y., Cun, X., Huang, S., Zhang, Y., Zhao, H., Lu, H., Shen, X.: T2M-GPT: Generating human motion from textual descriptions with discrete representations. In: CVPR (2023)

work page 2023
[21]

In: NeurIPS (2023)

Jiang, B., Chen, X., Liu, W., Yu, J., Yu, G., Chen, T.: MotionGPT: Human motion as a foreign language. In: NeurIPS (2023)

work page 2023
[22]

In: CVPR (2024) MoCHA: Denoising Caption Supervision for Motion-Text Retrieval 17

Guo, C., Mu, Y., Javed, M.G., Wang, S., Cheng, L.: MoMask: Generative masked modeling of 3D human motions. In: CVPR (2024) MoCHA: Denoising Caption Supervision for Motion-Text Retrieval 17

work page 2024
[23]

In: ICCV (2021)

Petrovich, M., Black, M.J., Varol, G.: Action-conditioned 3D human motion syn- thesis with transformer VAE. In: ICCV (2021)

work page 2021
[24]

In: ECCV (2022)

Guo, C., Zuo, X., Wang, S., Cheng, L.: TM2T: Stochastic and tokenized modeling for the reciprocal generation of 3D human motions and texts. In: ECCV (2022)

work page 2022
[25]

verb [object] [limb] [direction]→next action

Lexicon-augmented motion retrieval. (2025) 18 Warner et al. Appendices A Our Baseline Ablations Baseline ablations (architecture, temperature, self-similarity threshold) are con- solidated into the main paper (Tables 3 and 6) for readability. B LLM Ceiling: Best-Case with LLM at Train and Test The LLM ceiling (LLM canonicalization at both train and test) ...

work page 2025
[26]

Expand atomic labels into descriptive canonical forms matching the style above

work page
[27]

action1 -> action2

Use the arrow notation for multi-step actions: "action1 -> action2"

work page
[28]

Add plausible spatial details when naturally implied by the action

work page
[29]

walk"→"walk forward

Keep it concise –- add only what’s naturally implied Common expansions: "walk"→"walk forward", "stand"→"stand in place", "t pose"→"stand with arms extended horizontally", "transition"→ "transition between poses" F Implementation Details G Additional Ablations and Analysis G.1 Canonicalization Strategy Comparison MoCHA Blend achieves the best balance of ga...

work page arXiv 2012

[1] [1]

In: ICCV (2023)

Petrovich, M., Black, M.J., Varol, G.: TMR: Text-to-motion retrieval using con- trastive 3D human motion synthesis. In: ICCV (2023)

work page 2023

[2] [2]

In: CVPR (2022)

Guo, C., Zou, S., Zuo, X., Wang, S., Ji, T., Li, X., Cheng, L.: Generating diverse and natural 3D human motions from text. In: CVPR (2022)

work page 2022

[3] [3]

Big Data4(4), 236–252 (2016)

Plappert, M., Mandery, C., Asfour, T.: The KIT motion-language dataset. Big Data4(4), 236–252 (2016)

work page 2016

[4] [4]

In: CVPR (2021)

Punnakkal, A., Chandrasekaran, A., Athanasiou, N., Quiros-Ramirez, A., Black, M.J.: BABEL: Bodies, action and behavior with English labels. In: CVPR (2021)

work page 2021

[5] [5]

Bensabath, A., Petrovich, M., Varol, G.: Cross-dataset motion retrieval via training with rewritten texts (2024)

work page 2024

[6] [6]

In: CVPR (2024)

Zhu, Y., Siyao, L., Li, Z., et al.: Exploring vision transformers for 3D human motion-language models with motion patches. In: CVPR (2024)

work page 2024

[7] [7]

JMLR (2024)

Chung, H.W., Hou, L., Longpre, S., et al.: Scaling instruction-finetuned language models. JMLR (2024)

work page 2024

[8] [8]

In: EMNLP (2019)

Reimers, N., Gurevych, I.: Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In: EMNLP (2019)

work page 2019

[9] [9]

IEEE Trans

Ji, S., Pan, S., Cambria, E., Marttinen, P., Yu, P.S.: A survey on knowledge graphs: Representation, acquisition, and applications. IEEE Trans. Neural Networks and Learning Systems33(2), 494–514 (2022)

work page 2022

[10] [10]

In: ICCV (2019)

Mahmood, N., Ghorbani, N., Troje, N.F., Pons-Moll, G., Black, M.J.: AMASS: Archive of motion capture as a surface model. In: ICCV (2019)

work page 2019

[11] [11]

SIAM Journal on Optimization23(4), 2341–2368 (2013)

Ghadimi, S., Lan, G.: Stochastic first- and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization23(4), 2341–2368 (2013)

work page 2013

[12] [12]

In: ECCV (2022)

Petrovich, M., Black, M.J., Varol, G.: TEMOS: Generating diverse human motions from textual descriptions. In: ECCV (2022)

work page 2022

[13] [13]

In: ECCV (2022)

Tevet, G., Gordon, B., Hertz, A., Bermano, A.H., Cohen-Or, D.: MotionCLIP: Exposing human motion generation to CLIP space. In: ECCV (2022)

work page 2022

[14] [14]

In: ICML (2021)

Radford, A., Kim, J.W., Hallacy, C., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)

work page 2021

[15] [15]

In: ICLR (2025)

Li, Z., Yuan, W., He, Y., et al.: LaMP: Language-motion pretraining for motion generation, retrieval, and captioning. In: ICLR (2025)

work page 2025

[16] [16]

Xu, D., Zheng, T., Zhang, Y., Yang, X., Fu, W.: MTR-MSE: Motion-text re- trievalmethodbasedonmotionsemanticsexpansion.Neurocomputing648,130632 (2025)

work page 2025

[17] [17]

In: CVPR (2024)

Yin, K., Zou, S., Ge, Y., Tian, Z.: Tri-modal motion retrieval by learning a joint embedding space. In: CVPR (2024)

work page 2024

[18] [18]

In: ICLR (2023)

Tevet, G., Raab, S., Gordon, B., Shafir, Y., Cohen-Or, D., Bermano, A.H.: Human motion diffusion model. In: ICLR (2023)

work page 2023

[19] [19]

In: CVPR (2023)

Chen, X., Jiang, B., Liu, W., Huang, Z., Fu, B., Chen, T., Yu, G.: Executing your commands via motion diffusion in latent space. In: CVPR (2023)

work page 2023

[20] [20]

In: CVPR (2023)

Zhang, J., Zhang, Y., Cun, X., Huang, S., Zhang, Y., Zhao, H., Lu, H., Shen, X.: T2M-GPT: Generating human motion from textual descriptions with discrete representations. In: CVPR (2023)

work page 2023

[21] [21]

In: NeurIPS (2023)

Jiang, B., Chen, X., Liu, W., Yu, J., Yu, G., Chen, T.: MotionGPT: Human motion as a foreign language. In: NeurIPS (2023)

work page 2023

[22] [22]

In: CVPR (2024) MoCHA: Denoising Caption Supervision for Motion-Text Retrieval 17

Guo, C., Mu, Y., Javed, M.G., Wang, S., Cheng, L.: MoMask: Generative masked modeling of 3D human motions. In: CVPR (2024) MoCHA: Denoising Caption Supervision for Motion-Text Retrieval 17

work page 2024

[23] [23]

In: ICCV (2021)

Petrovich, M., Black, M.J., Varol, G.: Action-conditioned 3D human motion syn- thesis with transformer VAE. In: ICCV (2021)

work page 2021

[24] [24]

In: ECCV (2022)

Guo, C., Zuo, X., Wang, S., Cheng, L.: TM2T: Stochastic and tokenized modeling for the reciprocal generation of 3D human motions and texts. In: ECCV (2022)

work page 2022

[25] [25]

verb [object] [limb] [direction]→next action

Lexicon-augmented motion retrieval. (2025) 18 Warner et al. Appendices A Our Baseline Ablations Baseline ablations (architecture, temperature, self-similarity threshold) are con- solidated into the main paper (Tables 3 and 6) for readability. B LLM Ceiling: Best-Case with LLM at Train and Test The LLM ceiling (LLM canonicalization at both train and test) ...

work page 2025

[26] [26]

Expand atomic labels into descriptive canonical forms matching the style above

work page

[27] [27]

action1 -> action2

Use the arrow notation for multi-step actions: "action1 -> action2"

work page

[28] [28]

Add plausible spatial details when naturally implied by the action

work page

[29] [29]

walk"→"walk forward

Keep it concise –- add only what’s naturally implied Common expansions: "walk"→"walk forward", "stand"→"stand in place", "t pose"→"stand with arms extended horizontally", "transition"→ "transition between poses" F Implementation Details G Additional Ablations and Analysis G.1 Canonicalization Strategy Comparison MoCHA Blend achieves the best balance of ga...

work page arXiv 2012