pith. sign in

arxiv: 2606.05624 · v1 · pith:XIFONVQQnew · submitted 2026-06-04 · 💻 cs.CV · cs.GR

KV-Control: Parameter-Efficient K/V Injection for Trajectory-Controlled Text-to-Motion

Pith reviewed 2026-06-28 02:33 UTC · model grok-4.3

classification 💻 cs.CV cs.GR
keywords text-to-motion generationtrajectory controlparameter-efficient adaptationkey-value injectionself-attention conditioningpart-tokenized motionfrozen transformer control
0
0 comments X

The pith

KV-Control injects control keys and values into self-attention layers of frozen text-to-motion transformers to achieve precise trajectory tracking.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that trajectory constraints can be added to pretrained text-to-motion models through a lightweight adapter that supplies geometric information as key-value memories inside every self-attention layer. This design avoids duplicating large model sections or relying on test-time optimization by co-designing a part-tokenized motion representation that exposes addressable sites for the injection. The adapter trains only the injection parameters on top of a shared trajectory encoder while the original query stream, text cross-attention, and backbone weights remain frozen. If the approach holds, models can follow root paths and multi-joint targets to sub-centimeter accuracy without losing the gait and style specified by language prompts. The work reframes conditioning as memory retrieval rather than global token or output-side enforcement.

Core claim

KV-Control supplies control-conditioned key/value memories at every self-attention layer of a frozen masked text-to-motion transformer, preserving the pretrained query stream, text cross-attention, FFN, and all backbone weights, while a co-designed PartVQ and T-Concat substrate turns each frame-part token into an attention-addressable site; the resulting adapter uses only trainable injection parameters atop a shared trajectory encoder yet tracks root and multi-joint constraints to sub-centimeter accuracy under the inherited refinement protocol while retaining text-conditioned motion quality.

What carries the argument

KV-Control, the injection of control-conditioned key/value memories into self-attention layers that leaves the query stream and pretrained weights untouched.

If this is right

  • Only the injection parameters need training, keeping the original model frozen.
  • Control remains compatible with the inherited refinement protocol for further accuracy gains.
  • Trajectory conditioning becomes a form of memory retrieval inside attention rather than a separate global token.
  • The same interface can support root paths and multi-joint targets simultaneously.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same K/V injection pattern could be tested on other frozen generative transformers that already use part-based tokenization.
  • Multiple independent control signals might be combined by simply stacking additional injection heads without enlarging the backbone.
  • If the part tokens remain addressable, the method could extend to time-varying constraints supplied at inference time.

Load-bearing premise

Providing geometric constraints as memory inside self-attention will not overwrite the pretrained text-conditioned motion prior, and the part-tokenized substrate supplies enough addressable sites for the injection to work.

What would settle it

Generate motions from the controlled model on held-out trajectory prompts and measure whether root or joint positions deviate more than one centimeter from the targets while text alignment metrics drop below the base model's level.

Figures

Figures reproduced from arXiv: 2606.05624 by Dongjie Fu, Hansung Kim, Pengcheng Fang, Tengjiao Sun, Xiaohao Cai, Xiaoyu Zhan, Yanwen Guo.

Figure 1
Figure 1. Figure 1: KV-Control on PartVQ+T-Concat. Left: pelvis-trajectory error versus trainable trajectory-control mechanism parameters under the MaskControl-matched M2 setting; K/V injection uses mechanism parameters (including the shared trajectory encoder; see [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: KV-Control method overview. Single-pass left-to-right schematic; full equations and dimensions in §3.1–§3.3. Left: frozen co-designed PartVQ+T-Concat substrate (Q= 6 data-driven body-part codebooks unpacked along the sequence axis) into which motion tokens flow. Middle: per self-attention layer, the motion query stream Q is unchanged and the keys/values K, V are augmented with control-conditioned pseudo-to… view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative trajectory control on out-of-distribution targets. Eight user-specified letter-shaped pelvis-trajectory targets (S, I, G, G, R, A, P, H) for our single-joint pelvis K/V-Control checkpoint, oblique 3/4 view; each cell overlays input waypoints as red 3D markers on the floor along 8 translucent body keyframes sampled from the L = 196 motion. The same trained adapter is applied to all eight letters… view at source ↗
Figure 4
Figure 4. Figure 4: Unconditional generation (top) vs. trajectory-controlled synthesis (bottom). Top: the frozen base backbone on dynamic text prompts (waltz side-step, kung-fu kata with kicks, walk/bend/pick-up) synthesizes coherent multi-stage motion, confirming that [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Extended qualitative behaviour of the same trained K/V-Control adapter. Six probes beyond Figs. 3–4, all using the same multi-joint K/V-Control checkpoint. Row 1: skater-style zigzag (single-joint) and an S-curve walk with alternating arm raises (multi-joint). Row 2: forward locomotion with both hands waving and a spiral walk with natural arm swings (both multi-joint). Row 3: a figure-eight closed-loop pat… view at source ↗
read the original abstract

Text-conditioned 3D human motion models now synthesize plausible motions from prompts, but practical animation and embodied-agent workflows rarely stop at text: a character may need to follow a sketched root path, hit an end-effector target, or satisfy a multi-joint trajectory while still preserving the gait, style, and intent described by language. This exposes a control trade-off. A trajectory controller should be precise without overwriting the pretrained text-conditioned motion prior, yet existing solutions either duplicate large portions of the generator to regain per-layer control access or move much of the cost to test-time optimization. We introduce KV-Control, a compact attention-side control interface for frozen masked text-to-motion transformers. The key idea is to make geometric constraints available as memory inside self-attention rather than injecting them through a global pose token or enforcing them only at the output side. To support this interface, we co-design a part-tokenized motion substrate and controller: \textbf{PartVQ} learns anatomy-aligned part codebooks, T-Concat exposes each frame--part token as an attention-addressable site, and KV-Control injects control-conditioned key/value memories at every self-attention layer while preserving the pretrained query stream, text cross-attention, FFN, and all backbone weights. The resulting adapter adds only trainable injection parameters atop a shared trajectory encoder, yet tracks root and multi-joint constraints with sub-centimeter accuracy under the inherited refinement protocol while retaining text-conditioned motion quality. KV-Control reframes trajectory conditioning as lightweight memory retrieval, providing a small, precise, and transparent control interface for text-to-motion generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces KV-Control, a parameter-efficient adapter for adding trajectory control to frozen masked text-to-motion transformers. It co-designs PartVQ (anatomy-aligned part codebooks), T-Concat (frame-part tokens as attention-addressable sites), and KV-Control (injection of control-conditioned K/V memories into every self-attention layer) while freezing the query stream, text cross-attention, FFN, and backbone weights. The adapter uses only trainable injection parameters on top of a shared trajectory encoder and claims sub-centimeter accuracy on root and multi-joint constraints under an inherited refinement protocol while retaining text-conditioned motion quality.

Significance. If the central claims hold, the work provides a lightweight, transparent alternative to duplicating large generators or relying on test-time optimization for precise geometric control in text-to-motion models. Treating trajectory signals as retrievable memory inside attention, rather than global tokens or output enforcement, is a clean architectural choice that could apply to other conditional generation settings. The emphasis on preserving the frozen text prior is practically valuable for animation and embodied-agent workflows.

major comments (2)
  1. [Abstract and §4 (Experiments)] The central claim that geometric K/V injection preserves the pretrained text-conditioned prior while enabling precise control is load-bearing but unsupported without explicit verification. The manuscript must report standard text-to-motion metrics (FID, R-Precision, diversity) for both the original model and the KV-Control-equipped model on a held-out text-only test set in the zero-control (neutral trajectory) case; absence of this comparison leaves open the possibility that the part-tokenized substrate or learned injection weights subtly shift the prior.
  2. [Abstract] Sub-centimeter tracking accuracy is asserted for root and multi-joint constraints, yet the abstract and available description provide no dataset information, baseline comparisons, error bars, or statistical details. This makes it impossible to assess whether the reported precision is robust or protocol-specific.
minor comments (2)
  1. [§3 (Method)] Notation for PartVQ codebooks and T-Concat tokenization should be defined with explicit equations or pseudocode in the method section for reproducibility.
  2. [Abstract] The abstract uses 'sub-centimeter accuracy' without specifying the exact error metric (e.g., mean per-joint position error); this should be clarified with a precise definition.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The two major comments highlight areas where the manuscript's claims require stronger empirical support. We address each below and commit to revisions that directly incorporate the requested verifications and details.

read point-by-point responses
  1. Referee: [Abstract and §4 (Experiments)] The central claim that geometric K/V injection preserves the pretrained text-conditioned prior while enabling precise control is load-bearing but unsupported without explicit verification. The manuscript must report standard text-to-motion metrics (FID, R-Precision, diversity) for both the original model and the KV-Control-equipped model on a held-out text-only test set in the zero-control (neutral trajectory) case; absence of this comparison leaves open the possibility that the part-tokenized substrate or learned injection weights subtly shift the prior.

    Authors: We agree that the preservation claim requires explicit zero-control verification on standard metrics. The current manuscript asserts retention of text-conditioned quality but does not include the requested side-by-side comparison. In the revision we will add FID, R-Precision, and diversity results for both the frozen original model and the KV-Control adapter on a held-out text-only test set under neutral (zero-control) trajectories. This will directly confirm that the part-tokenized substrate and injection weights do not degrade the pretrained prior. revision: yes

  2. Referee: [Abstract] Sub-centimeter tracking accuracy is asserted for root and multi-joint constraints, yet the abstract and available description provide no dataset information, baseline comparisons, error bars, or statistical details. This makes it impossible to assess whether the reported precision is robust or protocol-specific.

    Authors: We acknowledge the abstract's brevity omits necessary experimental context for the accuracy claim. While the body references the inherited refinement protocol and reports sub-centimeter errors, the abstract itself lacks dataset, baseline, error-bar, and statistical details. In revision we will expand the abstract (within length limits) and ensure the results section explicitly states the dataset, baselines, error statistics, and protocol to make the precision claim fully assessable and reproducible. revision: partial

Circularity Check

0 steps flagged

No circularity: architecture described without equations or self-referential reductions

full rationale

The provided abstract and description introduce KV-Control as a design for parameter-efficient injection into a frozen transformer, using PartVQ and T-Concat as co-designed substrates. No equations, fitted parameters, predictions, or uniqueness theorems are presented that reduce by construction to inputs or prior self-citations. Claims of sub-centimeter tracking and retained text quality are framed as outcomes of the adapter under an inherited protocol, with no load-bearing step shown to be definitional or statistically forced. The derivation chain is therefore self-contained as an engineering proposal rather than a tautological mapping.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities beyond the named components (PartVQ, T-Concat, KV-Control); ledger left empty pending full text.

pith-pipeline@v0.9.1-grok · 5842 in / 1092 out tokens · 36111 ms · 2026-06-28T02:33:44.841643+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. MoGeFlow: Flowing Through Motion Codebook Geometry for Text-to-Motion Generation

    cs.GR 2026-06 unverdicted novelty 6.0

    MoGeFlow learns text-conditioned flows over PartVQ group-specific code embeddings to generate motions, achieving SOTA R-Precision on HumanML3D and KIT-ML while preserving discrete token validity.

Reference graph

Works this paper leans on

40 extracted references · cited by 1 Pith paper

  1. [1]

    Guo, Chuan and Mu, Yuxuan and Javed, Muhammad Gohar and Wang, Sen and Cheng, Li , booktitle=

  2. [2]

    Pinyoanuntapong, Ekkasit and Wang, Pu and Lee, Minwoo and Chen, Chen , booktitle=

  3. [3]

    International Conference on Learning Representations (ICLR) , year=

    Human Motion Diffusion Model , author=. International Conference on Learning Representations (ICLR) , year=

  4. [4]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

    Executing your Commands via Motion Diffusion in Latent Space , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

  5. [5]

    Zhang, Mingyuan and Cai, Zhongang and Pan, Liang and Hong, Fangzhou and Guo, Xinying and Yang, Lei and Liu, Ziwei , journal=

  6. [6]

    Zhang, Jianrong and Zhang, Yangsong and Cun, Xiaodong and Zhang, Yong and Zhao, Hongwei and Lu, Hongtao and Shen, Xi and Shan, Ying , booktitle=

  7. [7]

    2025 , eprint=

    Absolute Coordinates Make Motion Generation Easy , author=. 2025 , eprint=

  8. [8]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

    Rethinking Diffusion for Text-Driven Human Motion Generation: Redundant Representations, Evaluation, and Masked Autoregression , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

  9. [9]

    ACM SIGGRAPH 2002 Papers , year=

    Motion Graphs , author=. ACM SIGGRAPH 2002 Papers , year=

  10. [10]

    ACM Transactions on Graphics , volume=

    Phase-Functioned Neural Networks for Character Control , author=. ACM Transactions on Graphics , volume=

  11. [11]

    ACM Transactions on Graphics , volume=

    Learned Motion Matching , author=. ACM Transactions on Graphics , volume=

  12. [12]

    Pinyoanuntapong, Ekkasit and others , booktitle=

  13. [13]

    Wan, Weilin and others , booktitle=

  14. [14]

    Xie, Yiming and others , booktitle=

  15. [15]

    Karunratanakul, Korrawe and others , booktitle=

  16. [16]

    ACM SIGGRAPH 2024 Conference Proceedings , year=

    Flexible Motion In-betweening with Diffusion Models , author=. ACM SIGGRAPH 2024 Conference Proceedings , year=

  17. [17]

    Dai, Wenxun and others , booktitle=

  18. [18]

    Advances in Neural Information Processing Systems (NeurIPS) , year=

    Neural Discrete Representation Learning , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

  19. [19]

    International Conference on Machine Learning (ICML) , year=

    Learning Transferable Visual Models From Natural Language Supervision , author=. International Conference on Machine Learning (ICML) , year=

  20. [20]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

    Generating Diverse and Natural 3D Human Motions from Text , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

  21. [21]

    Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , year=

    Adding Conditional Control to Text-to-Image Diffusion Models , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , year=

  22. [22]

    and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , booktitle=

    Hu, Edward J. and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , booktitle=

  23. [23]

    Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing , year=

    Prefix-Tuning: Optimizing Continuous Prompts for Generation , author=. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing , year=

  24. [24]

    2023 , eprint=

    IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models , author=. 2023 , eprint=

  25. [25]

    2022 , eprint=

    Classifier-Free Diffusion Guidance , author=. 2022 , eprint=

  26. [26]

    Jiang, Biao and Chen, Xin and Liu, Wen and Yu, Jingyi and Yu, Gang and Chen, Tao , booktitle=

  27. [27]

    Zhang, Mingyuan and Guo, Xinying and Pan, Liang and Cai, Zhongang and Hong, Fangzhou and Li, Huirong and Yang, Lei and Liu, Ziwei , booktitle=

  28. [28]

    Pinyoanuntapong, Ekkasit and Saleem, Muhammad Usama and Wang, Pu and Lee, Minwoo and Chen, Chen , booktitle=

  29. [29]

    European Conference on Computer Vision (ECCV) , year=

    Motion Mamba: Efficient and Long Sequence Motion Generation , author=. European Conference on Computer Vision (ECCV) , year=

  30. [30]

    Zhang, Mingyuan and Li, Huirong and Cai, Zhongang and Ren, Jiawei and Yang, Lei and Liu, Ziwei , booktitle=

  31. [31]

    International Conference on Learning Representations (ICLR) , year=

    Human Motion Diffusion as a Generative Prior , author=. International Conference on Learning Representations (ICLR) , year=

  32. [32]

    IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

    Optimizing Diffusion Noise Can Serve As Universal Motion Priors , author=. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

  33. [33]

    Wang, Zhenzhi and Wang, Jingbo and Lin, Dahua and Dai, Bo , booktitle=

  34. [34]

    Huang, Yiming and Wan, Weilin and Yang, Yue and Callison-Burch, Chris and Yatskar, Mark and Liu, Lingjie , booktitle=

  35. [35]

    Chi, Seunggeun and Chien, Hyung-gun and Yi, Wenhe and Beadle, Charles and Hwang, Karthik Ramani , booktitle=

  36. [36]

    Zeng, Ling-an and Yang, Guohong and Liu, Yi-Lin and Pan, Jingkun and Liu, Wei-Shi , booktitle=

  37. [37]

    Zou, Qiran and Wang, Shangyuan and Zhao, Yi and Sun, Haoyu and Zhang, Wei , booktitle=

  38. [38]

    , booktitle=

    Chang, Huiwen and Zhang, Han and Jiang, Lu and Liu, Ce and Freeman, William T. , booktitle=

  39. [39]

    Parameter-Efficient Transfer Learning for

    Houlsby, Neil and Giurgiu, Andrei and Jastrzebski, Stanislaw and Morrone, Bruna and de Laroussilhe, Quentin and Gesmundo, Andrea and Attariyan, Mona and Gelly, Sylvain , booktitle=. Parameter-Efficient Transfer Learning for

  40. [40]

    2025 , eprint=

    Fractal Generative Models , author=. 2025 , eprint=