Encoder-Free Human Motion Understanding via Structured Motion Descriptions

Thomas Ploetz; Yao Zhang; Yu Xiao; Zhuchenyang Liu

arxiv: 2604.21668 · v2 · pith:34YIT4ABnew · submitted 2026-04-23 · 💻 cs.CV

Encoder-Free Human Motion Understanding via Structured Motion Descriptions

Yao Zhang , Zhuchenyang Liu , Thomas Ploetz , Yu Xiao This is my paper

Pith reviewed 2026-05-09 22:00 UTC · model grok-4.3

classification 💻 cs.CV

keywords human motion understandingstructured motion descriptionslarge language modelsmotion captioningmotion question answeringencoder-freerule-based conversionkinematic text

0 comments

The pith

Converting motion sequences to structured text lets LLMs reason about human movement without any encoders or alignment modules.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Structured Motion Description, a fixed set of rules that turns sequences of joint positions into natural language accounts of joint angles, body-part movements, and overall trajectories. These readable descriptions let large language models apply their existing knowledge of anatomy, directions, and semantics to motion tasks directly. The approach reaches leading results on motion question answering and captioning benchmarks while requiring only lightweight adaptation when switched between different language models. Because the input is plain text, the models' attention patterns over specific motion elements become directly observable. The method avoids training any separate motion encoder or cross-modal projector.

Core claim

The central claim is that a deterministic, rule-based translation of joint-position sequences into structured natural-language descriptions of kinematics and trajectories allows pretrained large language models to perform motion question answering and captioning at state-of-the-art levels by drawing on their built-in knowledge of body parts, spatial relations, and movement semantics, without any learned motion encoders or alignment layers.

What carries the argument

Structured Motion Description (SMD), a deterministic rule set that converts raw joint-position sequences into readable text detailing joint angles, body-part kinematics, and global trajectory.

If this is right

LLMs can directly apply their pretrained knowledge of body parts and spatial directions to motion data once it is expressed as text.
The identical text representation works across multiple language-model families after only light parameter-efficient adaptation.
Attention weights inside the language model become interpretable with respect to specific kinematic elements such as particular joint angles or directional movements.
The same pipeline produces both quantitative gains on benchmarks and human-readable motion representations for inspection.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The text format could allow motion data to be mixed with other textual context such as scene descriptions or instructions inside a single language-model prompt.
Similar rule-based text conversions might be tested on other time-series physical data such as robot joint readings or animal tracking.
Because the descriptions are human-readable, they could support creation of new motion datasets through direct editing or annotation by non-experts.

Load-bearing premise

The rule-based conversion from joint positions to text descriptions must retain every piece of information that matters for accurate motion reasoning without introducing systematic gaps or distortions.

What would settle it

Running the same motion question-answering and captioning evaluations with an LLM that receives only the structured text descriptions and comparing the scores against an otherwise identical encoder-based baseline would directly test whether the text-only route actually outperforms encoder methods.

Figures

Figures reproduced from arXiv: 2604.21668 by Thomas Ploetz, Yao Zhang, Yu Xiao, Zhuchenyang Liu.

**Figure 2.** Figure 2: Overview of our approach. (a) Stage 1 (top): a deterministic, rule-based pipeline processes [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Prompt structure for (a) motion QA and (b) motion captioning. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Zero-shot captioning examples with motion visualizations and ground-truth captions. (a) [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

**Figure 5.** Figure 5: Attention heatmaps for two captioning examples with their corresponding motion visual [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

**Figure 6.** Figure 6: Illustration of the body-local and joint-local coordinate frames used for joint angle compu [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

**Figure 7.** Figure 7: Complete All-26 SMD for motion 014160 (“a person waves with his right hand,” 4.1s). [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

read the original abstract

The world knowledge and reasoning capabilities of text-based large language models (LLMs) are advancing rapidly, yet current approaches to human motion understanding, including motion question answering and captioning, have not fully exploited these capabilities. Existing LLM-based methods typically learn motion-language alignment through dedicated encoders that project motion features into the LLM's embedding space, remaining constrained by cross-modal representation and alignment. Inspired by biomechanical analysis, where joint angles and body-part kinematics have long served as a precise descriptive language for human movement, we propose \textbf{Structured Motion Description (SMD)}, a rule-based, deterministic approach that converts joint position sequences into structured natural language descriptions of joint angles, body part movements, and global trajectory. By representing motion as text, SMD enables LLMs to apply their pretrained knowledge of body parts, spatial directions, and movement semantics directly to motion reasoning, without requiring learned encoders or alignment modules. We show that this approach goes beyond state-of-the-art results on both motion question answering (66.7\% on BABEL-QA, 90.1\% on HuMMan-QA) and motion captioning (R@1 of 0.584, CIDEr of 53.16 on HumanML3D), surpassing all prior methods. SMD additionally offers practical benefits: the same text input works across different LLMs with only lightweight LoRA adaptation (validated on 8 LLMs from 6 model families), and its human-readable representation enables interpretable attention analysis over motion descriptions. Code, data, and pretrained LoRA adapters are available at https://yaozhang182.github.io/motion-smd/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper replaces motion encoders with fixed biomechanical rules that turn joint data into readable text so LLMs can do motion QA and captioning directly, and it reports clear gains on standard benchmarks.

read the letter

The main thing to know is that the authors have a simple way to feed human motion data to LLMs without any special motion encoder. They use fixed rules drawn from biomechanics to turn sequences of joint positions into structured sentences describing angles, which body parts are moving how, and overall paths. Then the LLM just reads the text. This stands out because most other LLM approaches for motion still train some alignment layer or encoder to map motion features into the language space. Here they avoid that entirely. The reported results are better than previous work on the standard benchmarks for motion question answering and captioning. They also show it works with a bunch of different LLMs using only light adaptation, and the text format lets them inspect what the model is paying attention to in the description. Credit where due: the idea is clean, the implementation is released, and the performance numbers are concrete. If the conversion really keeps the important motion details, this could be a useful shortcut for applying language models to physical reasoning. The potential issue is whether those rules actually preserve everything needed. Motion has continuous elements like how fast something moves or how joints coordinate over time, and turning that into discrete phrases might lose some of it. The paper would be stronger with more on how they chose the rules, tests showing the descriptions match the original motion well, and breakdowns of where errors happen. The abstract is light on those points. This is aimed at the motion understanding community in computer vision, especially folks interested in using LLMs for it. It deserves to go to peer review so the details can be checked and the community can see if the gains hold up under scrutiny.

Referee Report

2 major / 2 minor

Summary. The paper proposes Structured Motion Description (SMD), a deterministic rule-based preprocessor that converts 3D joint position sequences into structured natural-language text describing joint angles, body-part kinematics, and global trajectories. This text representation is fed directly to LLMs (with lightweight LoRA adaptation) for motion question answering and captioning, bypassing learned motion encoders and cross-modal alignment modules. The authors report new state-of-the-art numbers: 66.7% on BABEL-QA, 90.1% on HuMMan-QA, and R@1=0.584 / CIDEr=53.16 on HumanML3D captioning, while demonstrating compatibility across eight LLMs from six families and providing open code, data, and adapters.

Significance. If the central claim holds, the work is significant: it shows that explicit, human-readable motion descriptions can leverage pretrained LLM knowledge of body semantics and spatial reasoning more effectively than implicit encoder-based alignments, potentially lowering the barrier to motion understanding and improving interpretability via attention over text. The public release of code, data, and LoRA adapters is a clear strength that supports reproducibility and extension.

major comments (2)

[§3] §3 (SMD generation): The manuscript presents the rule-based conversion from joint positions to structured text but provides no quantitative validation of information preservation, such as reconstruction fidelity, human-rated description accuracy, or error analysis on omitted continuous aspects (velocity profiles, acceleration, inter-joint coordination). This validation is load-bearing for the claim that SMD introduces no systematic omissions that would degrade LLM reasoning relative to encoder baselines.
[§5] §5 (experimental results): The SOTA claims on BABEL-QA and HuMMan-QA are reported without ablations that isolate the contribution of individual SMD components (joint angles vs. trajectories vs. body-part labels) or controlled comparisons that hold the LLM and prompting fixed while varying the motion representation. Without these, it is difficult to confirm that performance gains derive from the encoder-free text approach rather than other factors.

minor comments (2)

The abstract states results across eight LLMs from six families, yet the main text would benefit from a compact table listing the specific models, adaptation settings, and per-model performance to make the cross-LLM claim easier to assess.
[§6] The interpretability section (attention analysis over motion descriptions) is a useful addition; adding a few more qualitative examples with highlighted attention weights would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments highlight important opportunities to strengthen the validation of SMD and the experimental analysis. We address each point below and will incorporate the suggested additions in the revised manuscript.

read point-by-point responses

Referee: [§3] §3 (SMD generation): The manuscript presents the rule-based conversion from joint positions to structured text but provides no quantitative validation of information preservation, such as reconstruction fidelity, human-rated description accuracy, or error analysis on omitted continuous aspects (velocity profiles, acceleration, inter-joint coordination). This validation is load-bearing for the claim that SMD introduces no systematic omissions that would degrade LLM reasoning relative to encoder baselines.

Authors: We agree that quantitative validation of information preservation in the SMD conversion would strengthen the manuscript. In the revised version, we will add to Section 3 and a new appendix: (1) a reconstruction experiment measuring approximate joint-position recovery error from the generated text descriptions, (2) a human evaluation study on a sampled subset rating description accuracy and completeness, and (3) targeted error analysis on omitted continuous aspects such as velocity and acceleration by comparing task performance with and without explicit velocity mentions in the SMD. These additions will directly demonstrate the degree of information loss relative to the original motion sequences. revision: yes
Referee: [§5] §5 (experimental results): The SOTA claims on BABEL-QA and HuMMan-QA are reported without ablations that isolate the contribution of individual SMD components (joint angles vs. trajectories vs. body-part labels) or controlled comparisons that hold the LLM and prompting fixed while varying the motion representation. Without these, it is difficult to confirm that performance gains derive from the encoder-free text approach rather than other factors.

Authors: We acknowledge that isolating the contributions of individual SMD components and performing controlled comparisons would make the source of the gains clearer. In the revision, we will add ablation experiments in Section 5 that fix both the LLM backbone and the prompting template while systematically removing joint-angle, trajectory, or body-part-label components from the SMD input. We will also include a controlled baseline that converts raw joint positions into unstructured text (holding all other factors fixed) to directly compare against the structured SMD representation. These results will be reported alongside the existing SOTA numbers. revision: yes

Circularity Check

0 steps flagged

No circularity; rule-based preprocessor is independent of LLM and results

full rationale

The paper defines SMD as a fixed, deterministic, rule-based conversion from joint position sequences to structured text descriptions of angles, kinematics, and trajectories. This preprocessor operates without reference to LLM weights, training data, or fitted parameters. Reported results (66.7% BABEL-QA, 90.1% HuMMan-QA, R@1 0.584 on HumanML3D) are obtained by feeding the generated text into off-the-shelf LLMs with only lightweight LoRA adaptation. No equation, claim, or performance number reduces to a self-definition, a fitted input renamed as prediction, or a load-bearing self-citation chain. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that fixed biomechanical rules can faithfully translate raw joint data into semantically complete text; no free parameters or invented entities are mentioned in the abstract.

axioms (1)

domain assumption Joint position sequences can be deterministically and losslessly converted into structured natural language descriptions of angles, body-part kinematics, and trajectories using fixed rules.
This premise is invoked when the abstract states that SMD 'converts joint position sequences into structured natural language descriptions' without learned components.

pith-pipeline@v0.9.0 · 5595 in / 1264 out tokens · 17672 ms · 2026-05-09T22:00:23.539786+00:00 · methodology

Encoder-Free Human Motion Understanding via Structured Motion Descriptions

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)