Hierarchical Policy Optimization for Simultaneous Translation of Unbounded Speech
Pith reviewed 2026-05-10 00:27 UTC · model grok-4.3
The pith
Hierarchical policy optimization improves simultaneous speech translation from imperfect data by balancing quality and latency.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that a Hierarchical Policy Optimization (HPO) method, equipped with a hierarchical reward that balances translation quality and latency, can successfully post-train models that began from imperfect dialogue-form SFT data for simultaneous speech translation of unbounded input, yielding over +7 COMET and +1.25 MetricX improvement at 1.5 seconds latency across multiple target languages.
What carries the argument
Hierarchical Policy Optimization (HPO) using a hierarchical reward that separately scores quality and latency objectives within the multi-turn dialogue reformulation of SST.
If this is right
- Translation quality metrics rise substantially at controlled low latency on English-to-Chinese, German, and Japanese.
- Full KV-cache reuse remains possible while still improving output quality.
- Different quality reward formulations and segmentation strategies each contribute measurably to the gains.
- The method works starting from existing imperfect synthetic dialogue data rather than requiring new human annotations.
Where Pith is reading between the lines
- Similar hierarchical rewards could be applied to other real-time LLM tasks where training data is synthetic or noisy.
- The technique may generalize to reduce annotation costs in other latency-sensitive generation settings.
- It opens the possibility of iteratively refining multi-turn dialogue models without ever needing gold-standard data.
Load-bearing premise
A hierarchical reward that balances quality and latency can reliably improve models trained on imperfect SFT data without introducing new instabilities or requiring perfect annotations.
What would settle it
Re-running the same test sets at 1.5-second latency after applying HPO and finding no gain or a drop in COMET and MetricX scores compared with the original SFT baseline.
Figures
read the original abstract
Simultaneous speech translation (SST) generates translations while receiving partial speech input. Recent advances show that large language models (LLMs) can substantially improve SST quality, but at the cost of high computational overhead. To reduce this cost, prior work reformulates SST as a multi-turn dialogue task, enabling full reuse of the LLM's key-value (KV) cache and eliminating redundant feature recomputation. However, this approach relies on supervised fine-tuning (SFT) data in dialogue form, for which few human annotations exist, and existing synthesis methods cannot guarantee data quality. In this work, we propose a Hierarchical Policy Optimization (HPO) approach that post-train models trained on imperfect SFT data. We introduce a hierarchical reward that balances translation quality and latency objectives. Experiments on English to Chinese/German/Japanese demonstrate improvements of over +7 COMET score and +1.25 MetricX score at a latency of 1.5 seconds. Comprehensive ablation studies further validate the effectiveness of different quality rewards, hierarchical reward formulations, and segmentation strategies. Code can be found here https://github.com/owaski/HPO
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Hierarchical Policy Optimization (HPO) as a post-training method for large language models applied to simultaneous speech translation (SST) of unbounded speech. It reformulates SST as multi-turn dialogue to enable KV cache reuse, then uses HPO with a hierarchical reward balancing quality metrics (COMET, MetricX) against latency to improve models trained on imperfect SFT data. Experiments on English-to-Chinese/German/Japanese report gains exceeding +7 COMET and +1.25 MetricX at 1.5s latency, supported by ablations on reward variants, hierarchical formulations, and segmentation strategies. Code is released.
Significance. If the results hold under scrutiny, the work could meaningfully advance practical LLM-based SST by reducing recomputation overhead and enabling effective use of synthetic SFT data. The explicit release of code and the reported ablations on rewards/segmentation are strengths that aid reproducibility and allow targeted follow-up. The empirical gains at fixed low latency suggest a useful quality-latency trade-off, though the overall significance is moderated by the absence of direct tests for optimization instabilities.
major comments (2)
- [§3.2] §3.2 (Hierarchical Reward): The hierarchical reward is presented as balancing quality and latency to yield stable improvements on imperfect SFT data, but no analysis, bound, or experiment addresses whether it prevents reward hacking or policy collapse when SFT contains correlated errors (e.g., systematic under-translation or poor segmentation). This is load-bearing for the central claim that HPO reliably improves the quality-latency frontier.
- [Table 3] Table 3 / Ablation on reward variants: The reported COMET/MetricX gains for different quality rewards and hierarchical formulations are shown, yet the ablations do not include controlled tests with deliberately flawed SFT data to probe the failure mode of spurious latency reductions; this leaves the robustness of the hierarchical formulation unverified.
minor comments (2)
- [§4.1] §4.1: The latency metric definition and exact enforcement of the 1.5-second threshold during inference could be stated more explicitly to allow exact reproduction.
- [Figure 2] Figure 2: The segmentation strategy diagrams would benefit from clearer labeling of the hierarchical levels to distinguish them from the reward hierarchy.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive comments. We address each major point below and will incorporate revisions to strengthen the robustness analysis of the hierarchical reward.
read point-by-point responses
-
Referee: [§3.2] §3.2 (Hierarchical Reward): The hierarchical reward is presented as balancing quality and latency to yield stable improvements on imperfect SFT data, but no analysis, bound, or experiment addresses whether it prevents reward hacking or policy collapse when SFT contains correlated errors (e.g., systematic under-translation or poor segmentation). This is load-bearing for the central claim that HPO reliably improves the quality-latency frontier.
Authors: We acknowledge that the current manuscript provides no theoretical bound or dedicated experiment isolating reward hacking under correlated SFT errors. The hierarchical formulation separates a primary quality reward (COMET/MetricX on full translations) from a secondary latency term applied only after quality thresholds, which our existing ablations show prevents trivial latency-only policies. To directly verify robustness, we will add a new experiment injecting systematic under-translation and segmentation errors into the SFT data and demonstrate that HPO still advances the quality-latency frontier without collapse or spurious reductions. revision: yes
-
Referee: [Table 3] Table 3 / Ablation on reward variants: The reported COMET/MetricX gains for different quality rewards and hierarchical formulations are shown, yet the ablations do not include controlled tests with deliberately flawed SFT data to probe the failure mode of spurious latency reductions; this leaves the robustness of the hierarchical formulation unverified.
Authors: We agree that controlled tests on deliberately flawed SFT data are needed to probe spurious latency reductions. In the revision we will extend the Table 3 ablations (or add a supplementary table) with variants where SFT data is corrupted by systematic under-translation and poor segmentation. These results will show that the hierarchical reward maintains gains without the failure mode of latency-only optimization. revision: yes
Circularity Check
No significant circularity; claims rest on empirical test-set measurements.
full rationale
The paper introduces Hierarchical Policy Optimization (HPO) as a post-training method on imperfect SFT data, using a hierarchical reward that trades off quality (COMET/MetricX) against latency. All reported gains (+7 COMET, +1.25 MetricX at 1.5 s latency) are measured on held-out English-to-Chinese/German/Japanese test sets after RL optimization. No equations, uniqueness theorems, or self-citations are invoked to derive the performance numbers; the improvements are not obtained by re-expressing fitted parameters or by renaming quantities already present in the training objective. Ablations on reward variants and segmentation strategies are likewise empirical. The derivation chain therefore terminates in external benchmark scores rather than reducing to its own inputs by construction.
Axiom & Free-Parameter Ledger
free parameters (1)
- hierarchical reward weights
axioms (1)
- domain assumption Policy gradient methods can improve performance starting from imperfect supervised fine-tuning data
invented entities (1)
-
Hierarchical Policy Optimization (HPO)
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Prediction improves simultaneous neural ma- chine translation. InProceedings of the 2018 Con- ference on Empirical Methods in Natural Language Processing, pages 3022–3027, Brussels, Belgium. Association for Computational Linguistics. Shanbo Cheng, Yu Bao, Qian Cao, Luyang Huang, Liyan Kang, Zhicheng Liu, Yu Lu, Wenhao Zhu, Jing- wen Chen, Zhichao Huang, T...
-
[2]
Shoutao Guo, Xiang Li, Mengge Liu, Wei Chen, and Yang Feng
xcomet: Transparent machine translation eval- uation through fine-grained error detection.Transac- tions of the Association for Computational Linguis- tics, 12:979–995. Shoutao Guo, Xiang Li, Mengge Liu, Wei Chen, and Yang Feng. 2025. Streamuni: Achieving stream- ing speech translation with a unified large speech- language model. Julia Ive, Andy Mingren L...
work page 2025
-
[3]
Exploiting multimodal reinforcement learning for simultaneous machine translation. InProceed- ings of the 16th Conference of the European Chap- ter of the Association for Computational Linguistics: Main Volume, pages 3222–3233, Online. Association for Computational Linguistics. Masoud Jalili Sabet, Philipp Dufter, François Yvon, and Hinrich Schütze. 2020....
work page 2020
-
[4]
In2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 1–8
Yodas: Youtube-oriented dataset for audio and speech. In2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 1–8. Mingbo Ma, Liang Huang, Hao Xiong, Renjie Zheng, Kaibo Liu, Baigong Zheng, Chuanqiang Zhang, Zhongjun He, Hairong Liu, Xing Li, Hua Wu, and Haifeng Wang. 2019. STACL: Simultaneous trans- lation with implicit anticipa...
work page 2019
-
[5]
InInterspeech 2021, pages 2247–2251
Covost 2 and massively multilingual speech translation. InInterspeech 2021, pages 2247–2251. Kuang-Da Wang, Shuoyang Ding, Chao-Han Huck Yang, Ping-Chun Hsieh, Wen-Chih Peng, Vitaly Lavrukhin, and Boris Ginsburg. 2025a. Extending automatic machine translation evaluation to book- length documents. Minghan Wang, Thuy-Trang Vu, Yuxia Wang, Ehsan Shareghi, an...
work page 2021
-
[6]
Word Alignment我 本今天买了书 Speech Chunk 1Speech Chunk 2 I bought a book today 我 本今天买了书
-
[7]
Monotonize Speech Chunk 1Speech Chunk 2 我 本今天买了书Translation 1
-
[8]
Group by Chunk Translation 2 Figure 8: Data Synthesis using Qwen3-32B-AWQ (Team, 2025) 15 , with translation prompts provided in Figure 9. To ensure translation quality, we filter trans- lations with Blaser-2.0-QE (Dale and Costa- jussà, 2024)16 and MetricX-24-QE (Juraska et al., 2024)17 , keeping only long-form segments of which all utterances pass both ...
work page 2025
-
[9]
to align words in the source transcript with their counterparts in the target translation. Finally, we enforce monotonicity on the alignment and group target words that correspond to the same speech chunk. A.2 Prompt Prompt template for forward translation and VIP- LLM are shown in Figure 9 and 10. A.3 SFT Training Details We adopt a two-stage SFT procedu...
-
[10]
Training runs for up to 8k steps in Stage 1 and 2k steps in Stage 2. To increase data diversity, we randomly merge every c consecutive chunks with c∈[1,12]. A.4 Human Evaluation We provide the screenshot of web application to human annotators in Figure 11. We hired human annotators from the university lab and compen- sated them at the minimum wage rate in...
-
[11]
Key Information Recognition: Identify whether the key information in the source (e.g., proper nouns, keywords, terminologies, or sentence structures) is present in the translation
-
[12]
Correctness Assessment: Determine whether the translation accurately conveys the speaker’s intention, without misinterpretation or contextual errors
-
[13]
Yes" if the translation meets all three criteria and answer
Expressiveness Assessment: Evaluate whether the translation is fluent, clear, and intuitive to human readers. It should avoid unnecessary verbosity, ambiguous phrases, or awkward grammar. Given a source sentence and its translation, answer "Yes" if the translation meets all three criteria and answer "No" otherwise. Only output the answer, no other text. <...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.