pith. sign in

arxiv: 2606.03967 · v1 · pith:VH53D22Gnew · submitted 2026-06-02 · 💻 cs.CL · cs.AI

AlignAtt4LLM: Fast AlignAtt for Decoder-Only LLMs at IWSLT 2026 Simultaneous Speech Translation Task

Pith reviewed 2026-06-28 10:28 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords simultaneous speech translationAlignAttdecoder-only LLMIWSLT 2026low-latency translationalignment headsQwen3-ASRGemma-4
0
0 comments X

The pith

AlignAtt4LLM recovers a usable AlignAtt policy for decoder-only LLMs via prompt source spans and selected alignment heads.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that the AlignAtt simultaneous translation policy can be transferred from encoder-decoder models to decoder-only LLMs despite the absence of cross-attention. It achieves this through a synchronous cascade that pairs incremental ASR transcripts with an MT model running under four targeted adaptations to prompt structure and attention handling. A sympathetic reader would care because the result lets stronger decoder-only backbones participate in low-latency speech translation without requiring architectural changes to the underlying LLM. On the IWSLT 2026 development set the adapted system beats the supplied baselines for English-to-German and English-to-Italian at both the two-second and sub-four-second latency operating points.

Core claim

AlignAtt4LLM is the first reported use of AlignAtt on a decoder-only LLM. The system runs Qwen3-ASR with forced alignment to produce an incrementally updated source transcript and then applies Gemma-4 under an MT-side AlignAtt policy recovered by (1) inserting an explicit source span into the prompt, (2) offline selection of translation-specific alignment heads, (3) selective qk-fast replay of the draft-to-source attention block, and (4) runtime query/key capture that leaves model outputs bit-identical. The resulting policy outperforms the IWSLT 2026 baselines for English-German and English-Italian in both the low-latency regime near two seconds and the high-latency regime below four seconds

What carries the argument

The adapted AlignAtt policy recovered through an explicit source span in the prompt, offline-selected translation-specific alignment heads, selective qk-fast replay of the draft-to-source attention block, and runtime query/key capture.

If this is right

  • The system outperforms supplied baselines for English to German and English to Italian in both the low-latency regime near two seconds and the high-latency regime below four seconds.
  • The same policy can be reapplied to stronger translation-focused decoder-only MT backbones.
  • The method is not tied to Gemma-4 and requires only a deterministic prompt layout, calibrated attention heads, and query/key capture.
  • Results for English to Chinese are more mixed.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same four adaptations could be tested on other decoder-only translation models to check whether the performance gain generalizes beyond Gemma-4.
  • The approach might be combined with larger or domain-fine-tuned LLMs to improve quality at the same latency targets.
  • If the offline head selection step proves stable across language pairs, it could reduce the engineering cost of deploying AlignAtt on new decoder-only backbones.

Load-bearing premise

The four listed adaptations together suffice to recover a usable AlignAtt-style policy even though the model has no encoder-decoder cross-attention.

What would settle it

If AlignAtt4LLM does not outperform the supplied baselines on the IWSLT 2026 development set for English to German or English to Italian in the regimes around two seconds and below four seconds, the performance claim is falsified.

Figures

Figures reproduced from arXiv: 2606.03967 by Dominik Mach\'a\v{c}ek, Quentin Fuxa.

Figure 1
Figure 1. Figure 1: Chunk-synchronous cascade, one step. Each chunk first updates the source prefix with Qwen3 forced ASR, then runs one Gemma-4 MT step. The MT prompt keeps the ASR transcript explicit inside the decoder-only causal layout, so the observer can capture the selected heads’ queries and keys on the deployed vLLM path and reconstruct only the draft-to-source block Ab D,S(k) used by the acceptance policy. Accepted … view at source ↗
Figure 2
Figure 2. Figure 2: How AlignAtt changes substrate between encoder-decoder and decoder-only models. (a) In the original encoder-decoder setting, the decoder already exposes a source-only cross-attention row, so the policy can gate tokens directly against the accessible source frontier: if the peak of the row falls on the accessible side, the draft token is accepted. (b) In our decoder-only MT setting, source and target histor… view at source ↗
Figure 3
Figure 3. Figure 3: Selective reconstruction with runtime capture. The n×n attention matrix A(ℓ,h) is executed entirely inside the fused attention kernel, so the full n×n matrix is never materialized. For each AlignAtt head (ℓ, h) ∈ H, the observer copies K(ℓ,h) for all positions and Q(ℓ,h) only for the draft rows D into fixed-shape buffers. The green band y1:mk marks the committed target prefix that grows across chunks; hatc… view at source ↗
Figure 4
Figure 4. Figure 4: From the reconstructed block to an acceptance decision. The selective qk-fast reconstruction produces Ab (ℓ,h) t,s for (ℓ, h) ∈ H, with draft positions t against source positions s (left). The policy reads it through two parallel aggregations. Branch A averages the selected heads into a row pt,· and sums its mass on the accessible side of the frontier, giving the provenance score π acc t . Branch B returns… view at source ↗
Figure 5
Figure 5. Figure 5: Observer lifecycle on the deployed vLLM path. Left (A), once, before graph capture: the forward of GemmaAttention is patched so that for each selected head (ℓ, h) ∈ H the query and key tensors also pass through an observer custom op ϕcap (dashed orange side path). The op writes the selected slices into fixed-shape, pre-allocated slots and returns a zero tensor that is added back into the attention output; … view at source ↗
Figure 6
Figure 6. Figure 6: Live-tail ASR reference error. Reference￾error rate by distance to the current ASR tail; bands are 90% audio-bootstrap intervals. Qwen3 drops from 17.1% at the tail to 8.3% at 250 ms and then stays flat. Voxtral is shifted by +290 ms CU-LongYAAL; Gemma remains unshifted because its timestamps/LongYAAL are unreliable under prompt leakage. default going forward. Appendix A gives the com￾parison with the alte… view at source ↗
Figure 7
Figure 7. Figure 7: Inference-time comparison of MT capture implementations. Median latency per generated token on a fixed 16-prompt text-only suite, for a minimal Transformers eager reference, a Transformers SDPA qk-fast reference that reconstructs source rows from captured layer inputs, and the deployed vLLM qk-fast path used by the presented system. BLEU and XCOMET-XL. Compared with the sup￾plied organizers’ no-context bas… view at source ↗
Figure 8
Figure 8. Figure 8: Synchronization regimes for ASR and MT sharing one GPU. Regime (c) is the deployed IWSLT schedule; regimes (a) and (b) illustrate less-blocking alternatives that require asynchronous request handling. MT alignment heads (en→ {de, it, zh} mean TS) late shared-KV block (L24-L41) H0 H1 H2 H3 H4 H5 H6 H7 L0 L5 L11 L17 L23 L29 L35 L41 TS scale 0.00 0.85 retained MT ASR heads [PITH_FULL_IMAGE:figures/full_fig_p… view at source ↗
Figure 9
Figure 9. Figure 9: Architecture-aware view of retained MT alignment heads. Retained MT heads are sparse, late, and only partly overlap with the ASR set. The row-wise softmax is taken over the concate￾nated prompt and draft columns, so source po￾sitions compete with all other causal context ex￾actly as in the deployed attention row. Restricting Ae (ℓ,h) D, P∪D to the source columns ϕ (k) (s) recovers the policy-visible block … view at source ↗
Figure 10
Figure 10. Figure 10: Word-level selective reconstruction on a live MT draft. Top ribbon: prompt partition into system instruction, live source, accepted target prefix, and current draft. Left panel: reconstructed draft-to-source attention from the selected AlignAtt heads, aggregated to words and split by the dashed accessibility frontier; black stars stay on the accessible side, while the first red star marks the SOURCE-FRONT… view at source ↗
read the original abstract

We describe AlignAtt4LLM, an IWSLT 2026 simultaneous speech translation system for English to German, Italian, and Chinese. The system is a synchronous cascade: Qwen3-ASR with forced alignment produces an incrementally updated source transcript, and Gemma-4 E4B-it translates that prefix under an MT-side AlignAtt policy. To our knowledge, this is the first application of AlignAtt to a decoder-only LLM, where the encoder-decoder cross-attention used by earlier AlignAtt systems is absent. We recover a usable policy by proposing (1) an explicit source span in the prompt, (2) offline selection of translation-specific alignment heads, (3) selective qk-fast replay of the draft-to-source attention block, and (4) runtime query/key capture that preserves model outputs bit-identically. On the IWSLT 2026 development set, AlignAtt4LLM outperforms the supplied baselines for the European target languages, English to German and English to Italian, in both the low-latency regime around 2 seconds and the high-latency regime below 4 seconds CU-LongYAAL. Results for English to Chinese are more mixed, but the method is not tied to Gemma-4: because AlignAtt4LLM only requires a deterministic prompt layout, calibrated attention heads, and query/key capture, the same policy can be reapplied to stronger translation-focused decoder-only MT backbones for non-European target languages.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents AlignAtt4LLM, the first adaptation of AlignAtt to decoder-only LLMs for the IWSLT 2026 simultaneous speech translation task. It describes a synchronous cascade using Qwen3-ASR with forced alignment followed by Gemma-4 E4B-it translation under an MT-side AlignAtt policy. Four adaptations are proposed to recover an AlignAtt-style policy without encoder-decoder cross-attention: (1) explicit source span in the prompt, (2) offline selection of translation-specific alignment heads, (3) selective qk-fast replay of the draft-to-source attention block, and (4) runtime query/key capture preserving outputs bit-identically. The central empirical claim is that the system outperforms supplied baselines on the IWSLT 2026 development set for English-to-German and English-to-Italian in both the ~2s low-latency and <4s high-latency regimes under the CU-LongYAAL metric; results for English-to-Chinese are mixed, with the method presented as generalizable to other decoder-only backbones.

Significance. If the empirical results hold with proper statistical support, the work would be significant for enabling alignment-based simultaneous translation policies in decoder-only LLMs, which are increasingly dominant in MT. The explicit generality claim—that the policy depends only on deterministic prompt layout, calibrated heads, and query/key capture—is a strength, as is the focus on bit-identical output preservation. These elements support potential reuse on stronger translation-focused models.

major comments (2)
  1. [Abstract] Abstract: the claim that AlignAtt4LLM 'outperforms the supplied baselines' for En-De and En-It is presented without any quantitative results, error bars, statistical tests, or details on development-set usage or baseline definitions. This directly undermines evaluation of the central empirical claim.
  2. [Adaptations section] Description of the four adaptations (explicit source span, offline head selection, qk-fast replay, runtime capture): the central claim that these adaptations are jointly sufficient to recover a usable AlignAtt-style policy rests on the assumption that selected heads track source prefixes and that qk-fast replay preserves alignment behavior. No ablation studies, attention-map analysis, or verification that the induced policy matches original AlignAtt behavior are referenced, leaving open the possibility that reported gains arise from prompt layout or model choice rather than the recovered policy.
minor comments (2)
  1. [Abstract] The metric CU-LongYAAL is used without definition or reference in the abstract; a brief parenthetical or citation would improve clarity.
  2. [Abstract] The manuscript states results are 'more mixed' for English-to-Chinese but provides no further detail on the nature of the mixed outcomes.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful comments on the manuscript. We address each major comment below and indicate where revisions will be made to strengthen the paper.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that AlignAtt4LLM 'outperforms the supplied baselines' for En-De and En-It is presented without any quantitative results, error bars, statistical tests, or details on development-set usage or baseline definitions. This directly undermines evaluation of the central empirical claim.

    Authors: We agree that the abstract would benefit from quantitative support to allow readers to evaluate the central claim. In the revised version, we will update the abstract to include specific performance metrics (e.g., CU-LongYAAL improvements and latency values for En-De and En-It on the IWSLT 2026 development set), clarify the baseline definitions, and note that results are reported under the supplied evaluation protocol. revision: yes

  2. Referee: [Adaptations section] Description of the four adaptations (explicit source span, offline head selection, qk-fast replay, runtime capture): the central claim that these adaptations are jointly sufficient to recover a usable AlignAtt-style policy rests on the assumption that selected heads track source prefixes and that qk-fast replay preserves alignment behavior. No ablation studies, attention-map analysis, or verification that the induced policy matches original AlignAtt behavior are referenced, leaving open the possibility that reported gains arise from prompt layout or model choice rather than the recovered policy.

    Authors: We acknowledge the absence of ablation studies or attention-map analysis in the current manuscript. The empirical outperformance on the development set for En-De and En-It provides indirect support for the adaptations, and the bit-identical output preservation ensures faithful implementation. To directly address the concern, we will add a brief verification subsection in the revision that includes sample attention pattern comparisons confirming that the selected heads track source prefixes as intended. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical engineering adaptation validated externally

full rationale

The paper presents an engineering adaptation of AlignAtt to decoder-only LLMs via four listed techniques (explicit source span, offline head selection, qk-fast replay, runtime capture). Performance is reported as direct comparison to supplied external baselines on IWSLT 2026 dev sets for En-De/En-It. No equations, no fitted parameters renamed as predictions, no derivation chain, and no self-citation invoked as a uniqueness theorem or load-bearing premise. The central claim reduces to measured latency-quality tradeoffs against independent references, making the work self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is a system description with no mathematical derivation, fitted constants, or new postulated entities. All components are drawn from existing ASR and LLM models plus standard attention mechanisms.

pith-pipeline@v0.9.1-grok · 5812 in / 1279 out tokens · 26895 ms · 2026-06-28T10:28:50.024998+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

14 extracted references · 7 canonical work pages · 3 internal anchors

  1. [1]

    InProceedings of the 23rd In- ternational Conference on Spoken Language Trans- 8 lation (IWSLT 2026), San Diego, California, US

    Speech translation and metrics in 2026: Findings of the IWSLT campaign. InProceedings of the 23rd In- ternational Conference on Spoken Language Trans- 8 lation (IWSLT 2026), San Diego, California, US. As- sociation for Computational Linguistics. Marco Gaido, Sara Papi, Mauro Cettolo, Matteo Ne- gri, and Luisa Bentivogli

  2. [2]

    Preprint, arXiv:2512.17648

    Simulstream: Open-source toolkit for evaluation and demonstra- tion of streaming speech-to-text translation systems. Preprint, arXiv:2512.17648. Google DeepMind

  3. [3]

    Efficient Memory Management for Large Language Model Serving with PagedAttention

    Ef- ficient memory management for large language model serving with PagedAttention.Preprint, arXiv:2309.06180. Binbin Liu, Wenhan Han, Feng Chen, Yifan Zhang, Ping Guo, Haobin Lin, Bingni Zhang, Taifeng Wang, and Yin Zheng

  4. [4]

    ICLR 2026 poster, OpenReview

    Token alignment heads: Unveil- ing attention’s role in LLM multilingual translation. ICLR 2026 poster, OpenReview. Dominik Macháˇcek and Peter Polák

  5. [5]

    Association for Computational Lin- guistics

    InProceed- ings of the 22nd International Conference on Spoken Language Translation (IWSLT 2025), pages 389–398, Vienna, Austria. Association for Computational Lin- guistics. OpenAI

  6. [6]

    InProceedings of Interspeech 2023, pages 3974–3978

    Alig- nAtt: Using attention-based audio-translation align- ments as a guide for simultaneous speech transla- tion. InProceedings of Interspeech 2023, pages 3974–3978. Kishore Papineni, Salim Roukos, Todd Ward, and Wei- Jing Zhu

  7. [7]

    Maja Popovi´c

    Better late than never: Meta-evaluation of latency metrics for simultaneous speech-to-text translation.Preprint, arXiv:2509.17349. Peter Polák, Ngoc-Quan Pham, Tuan Nam Nguyen, Danni Liu, Carlos Mullov, Jan Niehues, Ondˇrej Bo- jar, and Alexander Waibel

  8. [8]

    Association for Computational Linguis- tics

    InProceedings of the 19th International Con- ference on Spoken Language Translation (IWSLT 2022), pages 277–285, Dublin, Ireland (in-person and online). Association for Computational Linguis- tics. Peter Polák, Brian Yan, Shinji Watanabe, Alex Waibel, and Ond ˇrej Bojar

  9. [9]

    Incremental Blockwise Beam Search for Simultaneous Speech Translation with Controllable Quality-Latency Tradeoff. InProc. INTERSPEECH 2023, pages 3979–3983. Maja Popovi´c

  10. [10]

    Xian Shi, Xiong Wang, Zhifang Guo, Yongqi Wang, Pei Zhang, Xinyu Zhang, Zishan Guo, Hongkun Hao, Yu Xi, Baosong Yang, Jin Xu, Jingren Zhou, and Junyang Lin

    Scaling model and data for multi- lingual machine translation with open large language models.Preprint, arXiv:2602.11961. Xian Shi, Xiong Wang, Zhifang Guo, Yongqi Wang, Pei Zhang, Xinyu Zhang, Zishan Guo, Hongkun Hao, Yu Xi, Baosong Yang, Jin Xu, Jingren Zhou, and Junyang Lin

  11. [11]

    Qwen3-ASR Technical Report

    Qwen3-ASR technical report. Preprint, arXiv:2601.21337. Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu

  12. [12]

    RoFormer: Enhanced Transformer with Rotary Position Embedding

    RoFormer: En- hanced transformer with rotary position embedding. Preprint, arXiv:2104.09864. 9 Mao Zheng, Zheng Li, Tao Chen, Mingyang Song, and Di Wang

  13. [13]

    Preprint, arXiv:2512.24092

    HY-MT1.5 technical report. Preprint, arXiv:2512.24092. A Additional ASR Analysis The analysis below justifies the source ASR front end used by the cascade and the recommended source-tail default for future runs. ASR front-end selection.We tested three ASR front ends during development: Qwen3-ASR with the Qwen3 forced aligner, V oxtral Realtime 4B, and a d...

  14. [14]

    Pair Top-8 TS All-336 TS Gain Aligned tokens EN→DE 90.40 68.49+21.9111,209 EN→ZH 93.48 65.79+27.707,582 EN→IT 91.90 67.42+24.4912,056 Table 4:MT head-set filtering ablation on held-out word-aligned dev examples.Scores are reported in points (100×TS) against gold aligned source tokens. D Observer Replay and Qualitative Diagnostics Below we report the promp...