MultiAct: Text-to-Motion Generation from Composite Text via Tailored Attention Guidance

Andreas Aristidou; Ariel Shamir; Daniel Cohen-Or; Nathan Sala; Ofir Abramovich; Sigal Raab

arxiv: 2605.30925 · v1 · pith:NMBXV5L3new · submitted 2026-05-29 · 💻 cs.CV · cs.GR

MultiAct: Text-to-Motion Generation from Composite Text via Tailored Attention Guidance

Nathan Sala , Ofir Abramovich , Ariel Shamir , Daniel Cohen-Or , Andreas Aristidou , Sigal Raab This is my paper

Pith reviewed 2026-06-28 23:12 UTC · model grok-4.3

classification 💻 cs.CV cs.GR

keywords text-to-motion generationcomposite promptsattention guidancesemantic coverageinference-time adaptationcross-attention modulationmotion synthesis

0 comments

The pith

MultiAct generates motions from composite text prompts by amplifying cross-attention scores for neglected action components at inference time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Text-to-motion models often collapse composite prompts that describe several actions happening together into a single dominant motion, leaving other elements out. MultiAct fixes this without retraining by boosting the cross-attention scores that link to the underrepresented parts of the prompt. A lightweight auxiliary scheme picks the right tokens and layers to strengthen for each specific prompt. The result is motion that covers more of the described actions while staying realistic. The approach works directly on any pretrained generator.

Core claim

MultiAct is an unpaired inference-time framework that operates on pretrained motion generators without retraining or architectural changes. It counteracts semantic collapse in composite prompts by adaptively amplifying cross-attention scores associated with underrepresented prompt components and uses a lightweight auxiliary decision scheme to select the most effective attention-strengthening parametrization.

What carries the argument

Adaptive amplification of cross-attention scores for underrepresented prompt tokens, selected by a prompt-specific auxiliary decision scheme.

If this is right

Existing text-to-motion generators achieve higher semantic coverage on prompts with simultaneous actions.
Motion realism is preserved while adding the missing action components.
The framework applies to any pretrained model at inference time with no retraining.
Prompt-specific choices of tokens and layers can be automated by the auxiliary scheme.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same attention modulation idea could be tested on text-to-image models that also drop secondary objects from complex scenes.
If the auxiliary scheme generalizes, prompt engineering for motion synthesis could become simpler.
Evaluating the method on prompts with temporally conflicting actions would test whether amplification can resolve contradictions.

Load-bearing premise

Adaptively boosting cross-attention for selected tokens will bring in all prompt actions without creating motion artifacts or lowering quality, and the auxiliary scheme will pick effective settings for any prompt.

What would settle it

A benchmark run on composite prompts where MultiAct either omits described actions or produces less realistic output than an unmodified baseline model.

Figures

Figures reproduced from arXiv: 2605.30925 by Andreas Aristidou, Ariel Shamir, Daniel Cohen-Or, Nathan Sala, Ofir Abramovich, Sigal Raab.

**Figure 1.** Figure 1: MultiAct synthesizes motion from composite textual descriptions by selectively modulating cross-attention to amplify weakly represented elements in [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: Pipeline. Given a text prompt, MultiAct selects prompt-specific parameters (Sec. 4), applies tailored guided generation (Sec. 3.2), and outputs the resulting motion. rely on auxiliary components, such as a graph reasoning network [Chang et al. 2025; Jin et al. 2023], which complicates the setup. Inspired by inference-time attention manipulation in text-toimage generation [Chefer et al. 2023; Rassin et al.… view at source ↗

**Figure 3.** Figure 3: and our supplementary video visualize the effect of attention guidance on the attention score and the generated motion. MDM* Backbone MultiAct (ours) time “A person hops forward while raising his arms” [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Attention guided generation. This figure visualizes Algs 1 and 2. Left: Diffusion inference pipeline (Appendix A.2) with tailored optimization integrated. Tailored optimization modifies the data tensors𝑋𝑡 and is applied only for 𝑡ˆ diffusion steps, where 𝑡ˆ is tailored to the prompt 𝑦. Right: Tailored optimization, in which𝑋 is adjusted to increase cross-attention for a selected token and layer, encapsulat… view at source ↗

**Figure 5.** Figure 5: Deviation measure credibility. Visual sanity check for the deviation scores, where each point corresponds to a parameter combination; the vertical axis shows the suffix deviation, and the horizontal axis depicts a prompt-specific geometric heuristic based on maximum hand-shoulder distance. Low suffix deviation errors (e.g., green-circled) indicate correct generation of the suffix action (arm raising), whi… view at source ↗

**Figure 6.** Figure 6: Deviation space visualization. Each point corresponds to a single parameter combination and shows the prefix and suffix deviation values for the prompt shown above. Distance from the origin indicates the dual multimodal distance. Points are color-coded by transformer layer ℓ, revealing a visual separation in which parameter combinations closest to the origin are dominated by layers 3 to 5. 5 Experiments 5… view at source ↗

**Figure 7.** Figure 7: Qualitative results. Our method consistently maintains high motion quality and strong text alignment. In contrast, other baselines either attend to a single dominant verb, exhibit occasional floor penetration or motion artifacts, or produce unnatural results. “A person is walking forward like a drunk while waving his arms” MDM* Backbone MoMask STMC Attend & Excite* MultiAct (ours) [PITH_FULL_IMAGE:figures… view at source ↗

**Figure 8.** Figure 8: Motion stylization. When prompted with motion stylization, existing methods fail either to generate both actions simultaneously or to incorporate the requested style. In contrast, our method supports motion stylization in parallel with multiple simultaneous actions, successfully integrating both into the generated motion. Note the characteristic imbalance in our result, reflecting the specified drunken sty… view at source ↗

**Figure 9.** Figure 9: Motion diversity. When sampled multiple times with the same prompt and parameter combinations, MultiAct generates diverse, high-quality motions that consistently satisfy both simultaneous actions [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗

**Figure 10.** Figure 10: User Study. Users vote that our model performs better than stateof-the-art models. The percentage indicates the proportion of votes in favor of our method, and the dashed line denotes the 50% mark. HumanML3D discussion and results. At first glance, the subset of the HumanML3D test set following the “<prefix> while <suffix>” format could serve as a suitable test set. However, this subset does not align wi… view at source ↗

**Figure 11.** Figure 11: Ablation. Left: Motion generated using a fixed parameter set selected according to the best-performing combination in Θ. Right: Motion generated for the same prompt with parameters selected by ParamGate. map actions to body parts, fails when combining full-body styles (“drunk”) with part-specific actions (“arms”). Lastly, [PITH_FULL_IMAGE:figures/full_fig_p009_11.png] view at source ↗

**Figure 12.** Figure 12: A screenshot of our user study. Note that each human figure in the screenshot is played as a video [PITH_FULL_IMAGE:figures/full_fig_p012_12.png] view at source ↗

read the original abstract

Text-to-motion generation has progressed rapidly in recent years, offering an expressive interface for animation and human-computer interaction. However, current models remain brittle when handling prompts that describe multiple actions occurring at the same time. Rather than realizing all components of a composite description, models frequently prioritize a single dominant action and neglect the rest, leading to incomplete or ambiguous motion. We present MultiAct, an unpaired, inference-time framework for compositional text-to-motion synthesis that operates directly on pretrained motion generators without retraining or architectural modification. Our method counteracts semantic collapse by adaptively amplifying cross-attention scores associated with underrepresented prompt components. We note that effective modulation depends on prompt-specific choices, such as which tokens and layers to target, and introduce a lightweight auxiliary decision scheme that determines the most effective attention-strengthening parametrization. Extensive quantitative and qualitative evaluations demonstrate that MultiAct consistently outperforms existing baselines on composite prompts, achieving improved semantic coverage while preserving motion realism. Project page: https://natsala13.github.io/multiact.github.io.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MultiAct adds an inference-time attention tweak plus auxiliary scheme for composite motion prompts, but evidence strength hinges on the full experiments.

read the letter

The core contribution is an unpaired inference-time method that modulates cross-attention in existing text-to-motion generators to reduce semantic collapse on composite prompts, plus a lightweight auxiliary scheme that picks prompt-specific tokens and layers for the amplification.

It does a clean job of targeting a practical failure mode without retraining or architecture changes, which keeps the approach usable on top of current models. The adaptive amplification idea is direct and the auxiliary decision component is a reasonable way to handle the fact that effective settings vary by prompt.

The soft spot is the auxiliary scheme itself. The abstract gives no construction details or validation procedure, and the stress-test concern about reliability on arbitrary composites is fair until the experiments show otherwise. Claims of consistent outperformance and better coverage are stated without any metrics or baseline comparisons here, so the paper stands or falls on whether the quantitative and qualitative results actually demonstrate gains in semantic coverage without quality loss.

This is for people working on text-to-motion for animation or HCI who already have a generator and want to improve handling of multi-action descriptions. A reader focused on inference-time fixes would find the mechanism worth seeing.

I would send it to peer review. The technique is new enough and the problem real enough that referees should check the experiments and the auxiliary scheme's robustness.

Referee Report

2 major / 1 minor

Summary. The manuscript presents MultiAct, an unpaired inference-time framework for compositional text-to-motion synthesis. It operates externally on pretrained generators by adaptively amplifying cross-attention scores for underrepresented components of composite prompts, using a lightweight auxiliary decision scheme to select prompt-specific tokens, layers, and amplification parameters. The central claim is that this yields consistent outperformance over baselines on composite prompts, with improved semantic coverage while preserving motion realism.

Significance. If the empirical claims hold, the contribution would be significant for addressing semantic collapse in multi-action text-to-motion models without retraining or architectural changes. The external, inference-only design is a practical strength that could extend to other pretrained generators, and the focus on attention modulation offers a targeted way to improve coverage of composite descriptions in animation and HCI applications.

major comments (2)

[Abstract] Abstract: the claim of consistent outperformance on composite prompts is asserted without any reported metrics, baselines, or quantitative results, which is load-bearing for the central empirical claim; the full paper must supply these to substantiate the assertion.
[Method] Method (auxiliary decision scheme): the reliability of the lightweight auxiliary scheme for selecting effective prompt-specific parametrizations (tokens/layers/strength) for arbitrary composites is not validated in detail; without evidence that it avoids overfitting to evaluated cases or introducing artifacts on novel prompts, the generalization of the outperformance claim remains at risk.

minor comments (1)

The project page URL is provided, which supports access to qualitative results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment point-by-point below.

read point-by-point responses

Referee: [Abstract] Abstract: the claim of consistent outperformance on composite prompts is asserted without any reported metrics, baselines, or quantitative results, which is load-bearing for the central empirical claim; the full paper must supply these to substantiate the assertion.

Authors: The abstract summarizes the findings at a high level without specific numbers, following standard conventions for brevity. The full manuscript substantiates the claim with detailed quantitative evaluations, including metrics, baseline comparisons, and results on composite prompts, as reported in the Experiments section. revision: no
Referee: [Method] Method (auxiliary decision scheme): the reliability of the lightweight auxiliary scheme for selecting effective prompt-specific parametrizations (tokens/layers/strength) for arbitrary composites is not validated in detail; without evidence that it avoids overfitting to evaluated cases or introducing artifacts on novel prompts, the generalization of the outperformance claim remains at risk.

Authors: The auxiliary decision scheme is validated through the extensive quantitative and qualitative evaluations on diverse composite prompts, including novel cases, presented in the paper. These results demonstrate consistent outperformance and motion quality without artifacts, supporting generalization. The Method section details the scheme's design for prompt-specific robustness. revision: no

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper frames MultiAct as an unpaired inference-time method operating externally on pretrained generators without retraining. The described mechanism (adaptive cross-attention amplification guided by a lightweight auxiliary decision scheme) and the performance claims rest on independent quantitative/qualitative evaluations rather than any self-definitional equations, fitted inputs renamed as predictions, or load-bearing self-citations. No uniqueness theorems, ansatzes, or renamings of known results are invoked in the provided text. The derivation chain is therefore self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on abstract; no free parameters, axioms, or invented entities are explicitly detailed or required for the high-level claim.

pith-pipeline@v0.9.1-grok · 5722 in / 983 out tokens · 21460 ms · 2026-06-28T23:12:28.181206+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

7 extracted references · 3 canonical work pages

[1]

Motionlcm: Real-time controllable motion generation via latent consistency 10•Sala et al. model. InEuropean Conference on Computer Vision. Springer, Springer International Publishing, Berlin/Heidelberg, Germany, 390–408. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Langua...

work page doi:10.18653/v1/n19-1423 2019
[2]

InSIGGRAPH Asia 2024 Conference Papers

Consolidating attention features for multi-view image editing. InSIGGRAPH Asia 2024 Conference Papers. ACM, New York, NY, USA, 1–12. Or Patashnik, Daniel Garibi, Idan Azuri, Hadar Averbuch-Elor, and Daniel Cohen-Or

2024
[3]

In: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)

Localizing Object-Level Shape Variations with Text-to-Image Diffusion Models. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). Springer International Publishing, Berlin/Heidelberg, Germany, 23051–23061. Mathis Petrovich, Michael J. Black, and Gül Varol. 2021. Action-Conditioned 3D Human Motion Synthesis with Transformer VA...

work page doi:10.1109/cvprw63382.2024.00197 2021
[4]

re-noised

Maskcontrol: Spatio-temporal control for masked motion synthesis. InProceed- ings of the IEEE/CVF International Conference on Computer Vision (ICCV). Springer International Publishing, Berlin/Heidelberg, Germany, 9955–9965. Ekkasit Pinyoanuntapong, Muhammad Usama Saleem, Pu Wang, Minwoo Lee, Srijan Das, and Chen Chen. 2024a. Bamm: Bidirectional autoregres...

work page doi:10.1609/aaai.v40i13.38035 2024
[5]

Which motion is of higher quality? You may consider: - Does the motion look natural and appear like something a real person might do? - Does the motion look jittery or not smooth? - Does the character's contact with the ground look solid and correct? Motion A Much Better Motion A Slightly Better Similar Motion B Slightly Better Motion B Much Better
[6]

- Does it follow all of them? Motion A Much Better Motion A Slightly Better Similar Motion B Slightly Better Motion B Much Better

Which motion better reflects the text description? You may consider: - How accurately does each motion match the specific verbs used in the text description? Action details are semantic constraints, such as adverbs, direction and manner. - Does it follow all of them? Motion A Much Better Motion A Slightly Better Similar Motion B Slightly Better Motion B M...
[7]

12.A screenshot of our user study.Note that each human figure in the screenshot is played as a video

Overall, which motion do you prefer? - Which motion do you find more appealing? Motion A Same Motion B Video 1 of 14 Next Motion Pair → Fig. 12.A screenshot of our user study.Note that each human figure in the screenshot is played as a video

[1] [1]

Motionlcm: Real-time controllable motion generation via latent consistency 10•Sala et al. model. InEuropean Conference on Computer Vision. Springer, Springer International Publishing, Berlin/Heidelberg, Germany, 390–408. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Langua...

work page doi:10.18653/v1/n19-1423 2019

[2] [2]

InSIGGRAPH Asia 2024 Conference Papers

Consolidating attention features for multi-view image editing. InSIGGRAPH Asia 2024 Conference Papers. ACM, New York, NY, USA, 1–12. Or Patashnik, Daniel Garibi, Idan Azuri, Hadar Averbuch-Elor, and Daniel Cohen-Or

2024

[3] [3]

In: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)

Localizing Object-Level Shape Variations with Text-to-Image Diffusion Models. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). Springer International Publishing, Berlin/Heidelberg, Germany, 23051–23061. Mathis Petrovich, Michael J. Black, and Gül Varol. 2021. Action-Conditioned 3D Human Motion Synthesis with Transformer VA...

work page doi:10.1109/cvprw63382.2024.00197 2021

[4] [4]

re-noised

Maskcontrol: Spatio-temporal control for masked motion synthesis. InProceed- ings of the IEEE/CVF International Conference on Computer Vision (ICCV). Springer International Publishing, Berlin/Heidelberg, Germany, 9955–9965. Ekkasit Pinyoanuntapong, Muhammad Usama Saleem, Pu Wang, Minwoo Lee, Srijan Das, and Chen Chen. 2024a. Bamm: Bidirectional autoregres...

work page doi:10.1609/aaai.v40i13.38035 2024

[5] [5]

Which motion is of higher quality? You may consider: - Does the motion look natural and appear like something a real person might do? - Does the motion look jittery or not smooth? - Does the character's contact with the ground look solid and correct? Motion A Much Better Motion A Slightly Better Similar Motion B Slightly Better Motion B Much Better

[6] [6]

- Does it follow all of them? Motion A Much Better Motion A Slightly Better Similar Motion B Slightly Better Motion B Much Better

Which motion better reflects the text description? You may consider: - How accurately does each motion match the specific verbs used in the text description? Action details are semantic constraints, such as adverbs, direction and manner. - Does it follow all of them? Motion A Much Better Motion A Slightly Better Similar Motion B Slightly Better Motion B M...

[7] [7]

12.A screenshot of our user study.Note that each human figure in the screenshot is played as a video

Overall, which motion do you prefer? - Which motion do you find more appealing? Motion A Same Motion B Video 1 of 14 Next Motion Pair → Fig. 12.A screenshot of our user study.Note that each human figure in the screenshot is played as a video