DOA: Training-Free Decoder-Only Attention Policy for Long-Form Simultaneous Translation with SpeechLLMs

Luisa Bentivogli; Sara Papi

arxiv: 2605.31432 · v1 · pith:M6YK3AZWnew · submitted 2026-05-29 · 💻 cs.CL · cs.AI· cs.SD

DOA: Training-Free Decoder-Only Attention Policy for Long-Form Simultaneous Translation with SpeechLLMs

Sara Papi , Luisa Bentivogli This is my paper

Pith reviewed 2026-06-28 22:14 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.SD

keywords simultaneous speech translationdecoder-only attentionSpeechLLMsstreaming policyself-attention alignmenttraining-freelong-form translation

0 comments

The pith

Decoder self-attention in SpeechLLMs supplies stable alignment signals for training-free long-form simultaneous translation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether decoder self-attention in Speech Large Language Models can replace cross-attention signals from encoder-decoder models to guide streaming policies in simultaneous speech-to-text translation. It introduces Decoder-Only Attention (DOA), a training-free policy that extracts a proxy alignment from self-attention to decide when to read incoming speech and write translation output. Experiments with Phi4-Multimodal and Qwen3-Omni demonstrate that this signal supports low-latency decisions in long-form settings with quality approaching offline decoding. A sympathetic reader would care because the result would let existing decoder-only models be used directly for real-time translation without retraining or new architectures.

Core claim

Decoder self-attention in off-the-shelf SpeechLLMs contains alignment signals stable enough to guide a streaming policy, so the Decoder-Only Attention (DOA) method enables effective long-form simultaneous speech-to-text translation at quality close to offline decoding without any retraining or architectural changes.

What carries the argument

Decoder-Only Attention (DOA), a training-free policy that derives a proxy alignment signal from decoder self-attention to inform read/write streaming decisions.

If this is right

Off-the-shelf SpeechLLMs can be applied to simultaneous translation without any model adaptation.
Long-form inputs can be handled with low latency while translation quality stays close to offline results.
Separate training for streaming policies or reliance on wait-k heuristics becomes unnecessary.
The same self-attention patterns support streaming decisions across different tested SpeechLLM architectures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the alignment signal holds across more models, DOA could apply to other decoder-only language models for streaming tasks beyond translation.
Testing on audio longer than the evaluated cases would show where self-attention stability begins to degrade.
The policy might integrate into existing speech pipelines to lower the barrier for real-time multilingual applications.

Load-bearing premise

Decoder self-attention in off-the-shelf SpeechLLMs contains sufficiently stable alignment signals to guide the streaming policy in long-form settings.

What would settle it

An experiment showing that self-attention alignments become inconsistent or produce high-latency or low-quality outputs on long-form audio would falsify the claim that DOA supplies an effective alignment signal.

Figures

Figures reproduced from arXiv: 2605.31432 by Luisa Bentivogli, Sara Papi.

**Figure 1.** Figure 1: Latency (LongLAAL↓) - Quality (COMET↑) curves of Punctuation and Fixed Words methods applied to Phi4-Multimodal on ACL 60/60 en-de dev set. Numerical results are in Appendix B. Layers and Heads Analysis [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Layer- and Head-wise performance difference compared to the average. Green squares indicate improve [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Latency (LongYAAL↓) - Quality (COMET↑) curves on MCIF of DOA policy on Phi4-Multimodal and Qwen3-Omni, and of StreamAtt baseline on SeamlessM4T. Numerical results are in Appendix B. heads represents the easiest and best-performing choice, and it is used for final results. Final Results [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

read the original abstract

Simultaneous speech-to-text translation (SimulST) generates translations while speech is still unfolding, requiring a streaming policy that decides when to read and when to write. State-of-the-art approaches rely on attention-based encoder-decoder models where cross-attention provides explicit alignment signals. In contrast, Speech Large Language Models (SpeechLLMs) are decoder-only architectures relying solely on self-attention. This raises a central question: whether decoder self-attention contains sufficiently stable alignment signals to guide the streaming policy. Moreover, existing approaches typically rely on training-based adaptations or heuristic wait-$k$ policies and have not been validated in long-form settings. To fill these gaps, we propose Decoder-Only Attention (DOA), a training-free policy that enables long-form simultaneous translation with off-the-shelf SpeechLLMs by deriving a proxy alignment from self-attention. Experiments on Phi4-Multimodal and Qwen3-Omni show that DOA provides an effective alignment signal for supporting streaming decisions, enabling low-latency long-form SimulST with quality close to offline decoding without retraining.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DOA pulls a training-free streaming policy from decoder self-attention in SpeechLLMs, which is a practical step for long-form SimulST, but the stability of that signal over growing context length is still the part that needs real evidence.

read the letter

The core move here is straightforward: instead of training a new policy or using cross-attention from an encoder-decoder model, they read the existing self-attention weights inside off-the-shelf SpeechLLMs to decide when to keep listening versus start writing. That produces DOA, a policy that works on Phi-4 Multimodal and Qwen3-Omni for long-form speech translation without any retraining.

What the paper actually delivers is a clean way to adapt decoder-only models to the simultaneous setting. Prior work mostly stayed with encoder-decoder architectures or simple wait-k rules, and long-form cases were rarely tested. Showing that self-attention can serve as a usable alignment proxy in this new architecture class is the concrete advance, and testing it on two recent multimodal models gives at least a basic check on generality.

The experiments are described as reaching quality close to offline decoding at low latency, which would be useful if the numbers are solid. The method stays training-free, which keeps the barrier low for people who already have these models.

The main uncertainty is whether the self-attention signal remains reliable once the speech input stretches out. Causal decoder attention often spreads or shifts toward recent tokens in long contexts, and if that happens the read/write decisions could drift. The abstract treats the experiments as confirmation, but without the actual curves, ablation on context length, or details on how they extract the proxy, it is difficult to judge how much the result depends on the specific models or test sets. That assumption is load-bearing.

This is for people already working on simultaneous translation or trying to deploy SpeechLLMs in streaming scenarios. A reader in that area would pick up a usable idea even if they later modify the extraction step. It is worth sending to referees because it targets a real practical gap with a method that can be checked against existing baselines.

Referee Report

2 major / 2 minor

Summary. The paper proposes Decoder-Only Attention (DOA), a training-free streaming policy for long-form simultaneous speech-to-text translation (SimulST) with decoder-only SpeechLLMs. It extracts a proxy alignment signal directly from the model's self-attention weights to decide read/write actions during streaming, addressing the absence of explicit cross-attention in these architectures. Experiments on Phi4-Multimodal and Qwen3-Omni are reported to show that DOA supports low-latency long-form SimulST with quality approaching offline decoding, without any model retraining or adaptation.

Significance. If the central result holds, the work would be significant for enabling practical simultaneous translation with existing decoder-only SpeechLLMs in long-form settings. The training-free derivation from internal self-attention avoids the cost of training-based policies or encoder-decoder retraining, and the focus on long-form validation fills a noted gap in prior SimulST literature. Explicit credit is due for the parameter-free construction and the use of off-the-shelf models.

major comments (2)

[Abstract / method] Abstract and method description: the headline claim that DOA yields quality close to offline decoding rests on the assumption that decoder self-attention supplies a sufficiently stable alignment proxy as speech input length grows. No quantitative evidence (e.g., attention entropy, focus metrics, or position-bias analysis over increasing context lengths) is referenced to confirm that the weights remain cross-modally focused rather than diffuse or position-biased, which directly risks the reported latency-quality trade-off.
[Experiments] Experiments: the results on Phi4-Multimodal and Qwen3-Omni are presented as validating the policy, yet without reported ablations isolating the contribution of the self-attention-derived signal versus heuristic baselines or without length-stratified quality/latency curves, it is impossible to confirm that the alignment signal remains load-bearing for the long-form claim rather than an artifact of shorter contexts.

minor comments (2)

[Method] The exact extraction formula for the proxy alignment from self-attention weights should be stated as an equation rather than described at high level to allow reproduction.
[Related work] Add explicit comparison to recent training-free or wait-k baselines in the related-work section to situate the contribution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and indicate planned revisions.

read point-by-point responses

Referee: [Abstract / method] Abstract and method description: the headline claim that DOA yields quality close to offline decoding rests on the assumption that decoder self-attention supplies a sufficiently stable alignment proxy as speech input length grows. No quantitative evidence (e.g., attention entropy, focus metrics, or position-bias analysis over increasing context lengths) is referenced to confirm that the weights remain cross-modally focused rather than diffuse or position-biased, which directly risks the reported latency-quality trade-off.

Authors: We agree that explicit quantitative metrics on attention stability would strengthen the central claim. In the revised manuscript we will add attention entropy, focus, and position-bias analyses computed over increasing context lengths on the evaluated models to demonstrate that self-attention remains cross-modally focused rather than diffuse. revision: yes
Referee: [Experiments] Experiments: the results on Phi4-Multimodal and Qwen3-Omni are presented as validating the policy, yet without reported ablations isolating the contribution of the self-attention-derived signal versus heuristic baselines or without length-stratified quality/latency curves, it is impossible to confirm that the alignment signal remains load-bearing for the long-form claim rather than an artifact of shorter contexts.

Authors: We concur that isolating the contribution of the DOA signal and providing length-stratified results are necessary to support the long-form claims. We will add ablations against standard heuristic baselines (including wait-k) together with quality and latency curves stratified by input length in the revised experiments section. revision: yes

Circularity Check

0 steps flagged

No circularity: training-free extraction from existing model internals

full rationale

The paper's core contribution is a training-free policy (DOA) that derives a proxy alignment signal directly from the self-attention weights already present in off-the-shelf decoder-only SpeechLLMs. No parameters are fitted to target data, no predictions are made from subsets of the same data, and no load-bearing uniqueness theorem or ansatz is imported via self-citation. The central assumption (that self-attention yields usable alignment) is stated explicitly as the open question the experiments address rather than being smuggled in by definition or prior author work. The derivation chain therefore remains independent of its own outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The central assumption that self-attention supplies usable alignment is treated as an empirical question rather than an axiom.

pith-pipeline@v0.9.1-grok · 5728 in / 1095 out tokens · 14164 ms · 2026-06-28T22:14:18.000027+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

3 extracted references · 3 canonical work pages · 1 internal anchor

[1]

Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

From simultaneous to streaming machine translation by leveraging streaming history. InPro- ceedings of the 60th Annual Meeting of the Associa- tion for Computational Linguistics (Volume 1: Long Papers), pages 6972–6985, Dublin, Ireland. Associa- tion for Computational Linguistics. Javier Iranzo-Sánchez, Jorge Iranzo-Sánchez, Adrià Giménez, Jorge Civera, a...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

Siqi Ouyang, Xi Xu, and Lei Li

Fasst: Fast llm-based simultaneous speech translation.arXiv preprint arXiv:2408.09430. Siqi Ouyang, Xi Xu, and Lei Li. 2025. InfiniSST: Si- multaneous translation of unbounded speech with large language model. InFindings of the Associa- tion for Computational Linguistics: ACL 2025, pages 3032–3046, Vienna, Austria. Association for Compu- tational Linguist...

work page arXiv 2025
[3]

arXiv preprint arXiv:2308.11596 , year=

COMET-22: Unbabel-IST 2022 submission for the metrics shared task. InProceedings of the Seventh Conference on Machine Translation (WMT), pages 578–585, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics. Elizabeth Salesky, Kareem Darwish, Mohamed Al- Badrashiny, Mona Diab, and Jan Niehues. 2023. Evaluating multilingual spe...

work page arXiv 2022

[1] [1]

Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

From simultaneous to streaming machine translation by leveraging streaming history. InPro- ceedings of the 60th Annual Meeting of the Associa- tion for Computational Linguistics (Volume 1: Long Papers), pages 6972–6985, Dublin, Ireland. Associa- tion for Computational Linguistics. Javier Iranzo-Sánchez, Jorge Iranzo-Sánchez, Adrià Giménez, Jorge Civera, a...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[2] [2]

Siqi Ouyang, Xi Xu, and Lei Li

Fasst: Fast llm-based simultaneous speech translation.arXiv preprint arXiv:2408.09430. Siqi Ouyang, Xi Xu, and Lei Li. 2025. InfiniSST: Si- multaneous translation of unbounded speech with large language model. InFindings of the Associa- tion for Computational Linguistics: ACL 2025, pages 3032–3046, Vienna, Austria. Association for Compu- tational Linguist...

work page arXiv 2025

[3] [3]

arXiv preprint arXiv:2308.11596 , year=

COMET-22: Unbabel-IST 2022 submission for the metrics shared task. InProceedings of the Seventh Conference on Machine Translation (WMT), pages 578–585, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics. Elizabeth Salesky, Kareem Darwish, Mohamed Al- Badrashiny, Mona Diab, and Jan Niehues. 2023. Evaluating multilingual spe...

work page arXiv 2022