TokenTiming: A Dynamic Alignment Method for Universal Speculative Decoding Model Pairs

arxiv: 2510.15545 · v4 · submitted 2025-10-17 · 💻 cs.CL · cs.AI

TokenTiming: A Dynamic Alignment Method for Universal Speculative Decoding Model Pairs

Sibo Xiao , Jinyuan Fu , Zhongle Xie , Lidan Shou This is my paper

Pith reviewed 2026-05-18 06:23 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords speculative decodingdynamic time warpingLLM inferencetoken alignmentmismatched vocabulariesdraft model selectionacceleration

0 comments p. Extension

The pith

TokenTiming uses dynamic time warping on re-encoded tokens to enable speculative decoding between any pair of off-the-shelf LLMs regardless of vocabulary mismatch.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents TokenTiming as a way to remove the shared-vocabulary requirement that currently restricts speculative decoding. It re-encodes the draft model's output sequence into the target's vocabulary and then applies dynamic time warping to find an alignment that lets probability distributions be transferred for the acceptance test. This change means practitioners can pick any smaller model as a draft without training a new one or forcing vocabulary changes. If the alignment preserves enough signal, the method delivers measured speedups while keeping the target model's output distribution intact. The approach turns speculative decoding from a narrow technique into one that works across the full range of available models.

Core claim

TokenTiming re-encodes the draft token sequence into the target vocabulary, then runs dynamic time warping to produce a mapping between the two sequences so that the draft model's next-token probabilities can be used directly in the speculative sampling step of the target model, allowing correct and accelerated generation even when the models have completely different tokenizers.

What carries the argument

TokenTiming, the algorithm that re-encodes draft tokens and applies dynamic time warping to construct a probability-transfer mapping for speculative sampling.

If this is right

Any smaller off-the-shelf model can now serve as a draft model for a larger target without vocabulary matching or retraining.
Speculative decoding becomes applicable to model pairs drawn from entirely different families or training regimes.
The 1.57x speedup observed in experiments extends to a much larger set of practical model combinations.
No architectural changes or additional training steps are required on either the draft or target model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same re-encoding-plus-warping pattern could be tested on other sequence-alignment tasks where probability transfer between models is needed.
Overhead from the DTW step might be reduced by caching common alignments or using approximate variants for longer sequences.
Mixing a very small draft model from one family with a large target from another could produce speed-accuracy trade-offs not previously accessible.

Load-bearing premise

The dynamic time warping alignment between re-encoded sequences produces a mapping that transfers probabilities with enough fidelity to keep speculative decoding's acceptance rate and final output distribution unchanged.

What would settle it

Measure acceptance rate and output correctness when running TokenTiming on model pairs whose tokenizers differ substantially; a clear drop below the rates achieved by same-vocabulary speculative decoding on the same target model would falsify the claim.

Figures

Figures reproduced from arXiv: 2510.15545 by Jinyuan Fu, Lidan Shou, Sibo Xiao, Zhongle Xie.

**Figure 2.** Figure 2: Phase illustrations of TokenTiming. (a) illustrates the re-tokenization of Draft Tokens into Proxy Target Tokens, which are used to construct the mapping in the DTW. (b) DTW calculation process (token distance matrix) and aligned token mapping π ∗ = [(S, Scale),(cal, Scale),(ing, ing),(Law, L),(Law, aw)]. The calculation rules for this mapping are presented in Alg. 1. (c) Probability distribution of draft … view at source ↗

**Figure 4.** Figure 4: Speed-up vs. homogeneous-vocabulary SOTA on various target [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Performance metrics under various settings of [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Cumulative Distribution Function of ∆ id of DTW matched tokens. w = ∞. ∆ id of DTW matched tokens converges to an upper bound. 5Tokenization, Logic Operation, etc., these operations performed on the CPU may cause GPU synchronization. In the Appendix L, it is shown that for the CPU/GPU stream timeline, the blocking time introduced by TokenTiming in one decoding cycle is only 663 µs, which is trivial compar… view at source ↗

**Figure 7.** Figure 7: Number of candidate tokens and accepted tokens per iteration of the generation of examples in Appendix B [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗

**Figure 8.** Figure 8: Number of candidate tokens and accepted tokens per iteration with 4 candidate tokens per iteration of the generation examples C.2 Number of Candidate Tokens per Iteration As shown in [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗

**Figure 10.** Figure 10: CPU stack trace visualization. The process segment for target logits calculation with DTW-related [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗

**Figure 11.** Figure 11: Enlarged view of the process segment for target logits calculation with DTW-related operations in the [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗

**Figure 12.** Figure 12: Visualization interface of GPU stream trace results, showing the time distribution of inference tasks [PITH_FULL_IMAGE:figures/full_fig_p018_12.png] view at source ↗

**Figure 13.** Figure 13: DTW offsets in different languages sian and Arabic, which have mixed tokenization granularity, achieve moderate speedups of 1.09x and 1.69x. This demonstrates that the effectiveness of TokenTiming is highly correlated with the granularity of tokenization: the more semantically aligned the tokens, the more DTW can leverage efficient alignment to accelerate inference. 19 [PITH_FULL_IMAGE:figures/full_fig_… view at source ↗

read the original abstract

Accelerating the inference of large language models (LLMs) has been a critical challenge in generative AI. Speculative decoding (SD) substantially improves LLM inference efficiency. However, its utility is limited by a fundamental constraint: the draft and target models must share the same vocabulary, thus limiting the herd of available draft models and often necessitating the training of a new model from scratch. Inspired by Dynamic Time Warping (DTW), a classic algorithm for aligning time series, we propose the algorithm TokenTiming for universal speculative decoding. It operates by re-encoding the draft token sequence to get a new target token sequence, and then uses DTW to build a mapping to transfer the probability distributions for speculative sampling. Benefiting from this, our method accommodates mismatched vocabularies and works with any off-the-shelf models without retraining and modification. We conduct comprehensive experiments on various tasks, demonstrating 1.57x speedup. This work enables a universal approach for draft model selection, making SD a more versatile and practical tool for LLM acceleration.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TokenTiming uses DTW on re-encoded sequences to enable speculative decoding across mismatched vocabularies, but the reported speedup rests on thin evidence about acceptance rates and alignment quality.

read the letter

The main contribution here is a practical workaround for the same-vocabulary requirement in speculative decoding. The method re-encodes the draft model's token sequence into the target model's vocabulary, then applies dynamic time warping to produce an alignment path that transfers the draft's probability distributions for the speculative sampling step. This lets users pair any off-the-shelf models without retraining or vocabulary matching, which directly addresses a real deployment friction point.

Referee Report

3 major / 2 minor

Summary. The paper proposes TokenTiming, a DTW-based method for universal speculative decoding that accommodates mismatched vocabularies between draft and target LLMs. It re-encodes the draft-generated token sequence into the target vocabulary, applies dynamic time warping to produce an alignment path, and uses this path to transfer probability distributions for speculative sampling. The approach requires no retraining or model modification and is claimed to work with any off-the-shelf models. Comprehensive experiments across tasks are reported to yield a 1.57x speedup.

Significance. If the DTW alignment reliably preserves predictive semantics and acceptance rates, the method would meaningfully expand speculative decoding's applicability by removing the shared-vocabulary constraint, allowing broader reuse of existing models and reducing the need for custom draft-model training. The empirical speedup result, if robustly supported, would strengthen the case for practical deployment in LLM inference pipelines.

major comments (3)

Abstract: the reported 1.57x speedup is stated without accompanying details on baselines, acceptance rates, variance, or controls isolating the contribution of the DTW alignment step, leaving the central efficiency claim only partially substantiated.
Method section (DTW alignment procedure): the warping path is constructed by minimizing a distance on re-encoded sequences, yet no analysis or empirical check is provided showing that aligned positions preserve equivalent next-token predictive distributions; when vocabularies differ substantially this risks low-fidelity proposals that collapse acceptance rate and negate the speedup.
Experiments: the manuscript does not report how the re-encoding plus DTW mapping affects end-to-end correctness or compares acceptance-rate statistics against matched-vocabulary speculative decoding, which is load-bearing for the universal-SD claim.

minor comments (2)

Notation for the warping path and probability-transfer step could be illustrated with a small concrete example to improve clarity.
Related-work discussion would benefit from explicit comparison to prior vocabulary-alignment or embedding-based mapping techniques in the speculative-decoding literature.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments on our work. We address each of the major comments in detail below, indicating the revisions we plan to make to strengthen the manuscript.

read point-by-point responses

Referee: Abstract: the reported 1.57x speedup is stated without accompanying details on baselines, acceptance rates, variance, or controls isolating the contribution of the DTW alignment step, leaving the central efficiency claim only partially substantiated.

Authors: We agree that providing more context in the abstract would help substantiate the claim. In the revised manuscript, we will update the abstract to briefly mention the experimental setup, including the use of standard autoregressive decoding as baseline and the observed acceptance rates (typically around 2-3 tokens per step in our tests). Detailed variance across runs and ablations isolating DTW will remain in the experiments section due to length constraints, but we will reference them. This should make the efficiency claim more robust. revision: yes
Referee: Method section (DTW alignment procedure): the warping path is constructed by minimizing a distance on re-encoded sequences, yet no analysis or empirical check is provided showing that aligned positions preserve equivalent next-token predictive distributions; when vocabularies differ substantially this risks low-fidelity proposals that collapse acceptance rate and negate the speedup.

Authors: This point highlights an important aspect we will address. We will add an analysis in the revised method section or a new experiments subsection. Specifically, we will report the average alignment cost and provide empirical evidence by measuring the acceptance rate as a function of vocabulary mismatch. Additionally, we will include a qualitative example showing that the re-encoded and aligned tokens maintain semantic similarity, supporting that the transferred distributions are reasonable approximations. If needed, we can discuss potential failure cases when vocabularies are extremely divergent. revision: yes
Referee: Experiments: the manuscript does not report how the re-encoding plus DTW mapping affects end-to-end correctness or compares acceptance-rate statistics against matched-vocabulary speculative decoding, which is load-bearing for the universal-SD claim.

Authors: We recognize the value of these comparisons for validating the universal claim. Although our primary focus is on mismatched vocabulary scenarios, we will add experiments comparing TokenTiming to matched-vocabulary speculative decoding on pairs where vocabularies overlap sufficiently. We will report acceptance rate statistics and verify end-to-end correctness by ensuring that the speculative sampling produces outputs consistent with the target model's distribution. These additions will be included in the revised experiments section and appendix. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical validation of DTW-based alignment stands independent of inputs

full rationale

The paper defines TokenTiming explicitly as re-encoding the draft sequence into the target vocabulary followed by DTW to produce a warping path for probability transfer in speculative sampling. This construction is presented as the method itself rather than a derived prediction. The reported 1.57x speedup is obtained from direct experimental measurement across tasks and model pairs, not from any fitted parameter, self-referential equation, or load-bearing self-citation that collapses back to the input assumptions. No uniqueness theorem, ansatz smuggling, or renaming of known results is invoked to force the outcome; the fidelity of the DTW mapping is treated as an empirical question tested by acceptance rates and wall-clock gains.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the applicability of DTW to discrete token sequences after re-encoding and on the assumption that the resulting alignment preserves useful probability information.

axioms (1)

domain assumption Dynamic time warping produces a usable alignment between token sequences from different vocabularies for probability transfer
Invoked when the method builds the mapping to transfer distributions for speculative sampling.

pith-pipeline@v0.9.0 · 5712 in / 1153 out tokens · 28801 ms · 2026-05-18T06:23:53.649073+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

re-encodes the draft token sequence to get a new target token sequence, and then uses DTW to build a mapping to transfer the probability distributions

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · 2 internal anchors

[1]

Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads.Preprint, arXiv:2401.10774. Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper. 2023. Accelerating large language model decoding with speculative sampling.arXiv preprint arXiv:2302.01318. Jian Chen, Vashisth Tiwari, Ranajo...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Deepseek-r1: Incentivizing reasoning capa- bility in llms via reinforcement learning.Preprint, arXiv:2501.12948. Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, and 1 others. 2024. The llama 3 herd of models. arXiv preprint arXiv:2407.21783. Yichao Fu, Pete...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

Break the sequential dependency of llm inference using lookahead decoding.arXiv preprint arXiv:2402.02057, 2024

Break the Sequential Dependency of LLM Inference Using Lookahead Decoding.Preprint, arXiv:2402.02057. Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al- Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, and 1 others. 2024. The llama 3 herd of models.arXiv preprint arXiv:2407.21783. Siqi Kou, Lanxi...

work page arXiv 2024
[4]

Eagle-2: Faster inference of language models with dynamic draft trees.arXiv preprint arXiv:2406.16858, 2024

Fast inference from transformers via spec- ulative decoding. InInternational Conference on Machine Learning, pages 19274–19286. PMLR. Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. 2024a. EAGLE-2: Faster Inference of Lan- guage Models with Dynamic Draft Trees.Preprint, arXiv:2406.16858. Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. 2024b....

work page arXiv 2025
[5]

Optimizing Speculative Decoding for Serving Large Language Models using Goodput.Preprint, arXiv:2406.14066. 9 Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Zeyu Wang, Zhengxin Zhang, Rae Ying Yee Wong, Alan Zhu, Lijie Yang, Xiaoxiang Shi, Chu- nan Shi, Zhuoming Chen, Daiyaan Arfeen, Reyna Abhyankar, and Zhihao Jia. 2024. Specinfer: Accel- erat...

work page arXiv 2024
[6]

a", "b" → Tar- get

Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in Neural Information Pro- cessing Systems, 36:46595–46623. 10 A Consistency and Losslessness Proof Before proceeding to the core proof, we must for- malize the process by which the draft probability distribution p(t) is generated and prove its consis- tency across mismatched vocabularies. Le...

work page
[7]

Total Rejection Probability: P(reject) = X t′ p(t′) 1−min(1, q(t′) p(t′)) = 1−β

work page
[8]

Combined Probability: P(Reject, t ∗) =P(reject)·q ′(t∗) = (1−β)· q(t∗)−min(p(t ∗), q(t∗)) 1−β =q(t ∗)−min(p(t ∗), q(t∗)) A.3.3 Final Result Summing the two mutually exclusive paths: P(next token=t ∗) = min(p(t∗), q(t∗)) + [q(t∗)−min(p(t ∗), q(t∗))] =q(t ∗) The algorithm is therefore strictly lossless, regard- less of the re-tokenization or mapping strateg...

work page
[9]

It is a direct mapping from states to actions

**Deterministic Policy**: - A determin- istic policy always selects the same action for a given state. It is a direct mapping from states to actions. For example,π(s) = a

work page
[10]

It outputs a probability distribu- tion over possible actions given a state

**Stochastic Policy**: - A stochastic pol- icy, on the other hand, selects actions proba- bilistically. It outputs a probability distribu- tion over possible actions given a state. This is often useful in exploration-exploitation trade-offs, where the agent might some- times choose a suboptimal action to dis- cover better ones

work page
[11]

These parameters can be adjusted during training to improve the policy

**Parametric Policy**: - Parametric policies are defined by a set of parameters. These parameters can be adjusted during training to improve the policy. Examples include neural networks, where the weights and biases are the parameters

work page
[12]

Instead, they might be represented by lookup tables or other struc- tures that can grow with the data

**Non-Parametric Policy**: - Non- parametric policies do not rely on a fixed set of parameters. Instead, they might be represented by lookup tables or other struc- tures that can grow with the data. These are less common in deep RL settings. ### Policy Optimization Policy optimization is the process of adjust- ing the parameters of a policy to maximize th...

work page
[13]

Policy optimization ensures that the agent learns a policy that achieves this

**Maximizing Cumulative Reward**: The primary goal in RL is to maximize the cumulative reward. Policy optimization ensures that the agent learns a policy that achieves this

work page
[14]

**Adaptability**: Through optimization, the policy can adapt to different environ- ments and scenarios, making the agent more versatile

work page
[15]

**Handling Complex Environments**: In complex and uncertain environments, a well-optimized policy allows the agent to make informed decisions even when the out- comes are not immediately clear

work page
[16]

### Common Policy Optimization Algo- rithms

**Efficiency**: Efficient policy opti- mization algorithms enable agents to learn quickly, which is crucial in real-world appli- cations where training time is a constraint. ### Common Policy Optimization Algo- rithms

work page
[17]

**REINFORCE**: A basic policy gra- dient algorithm that updates the policy pa- rameters by the gradient of the expected cumulative reward

work page
[18]

**A2C (Advantage Actor-Critic)**: An extension of A3C that uses synchronous up- dates, making it more stable and efficient

work page
[19]

**PPO (Proximal Policy Optimiza- tion)**: A popular algorithm that constrains the policy updates to be close to the previ- ous policy, ensuring stable training

work page
[20]

C Details of Generation C.1 Candidate Length Strategy We adopted the candidate sequence length calcu- lation strategy from the official implementation of Hugging Face Transformers

**TRPO (Trust Region Policy Optimiza- tion)**: Similar to PPO but uses a more rig- orous mathematical approach to constrain policy updates. C Details of Generation C.1 Candidate Length Strategy We adopted the candidate sequence length calcu- lation strategy from the official implementation of Hugging Face Transformers. The rules are as fol- lows. The calc...

work page 2023
[21]

The cost is D(i−1, j) + 1

Deletion:Deleting character si. The cost is D(i−1, j) + 1

work page
[22]

The cost is D(i, j−1) + 1

Insertion:Inserting character tj. The cost is D(i, j−1) + 1

work page
[23]

The cost isD(i−1, j−1) +cost sub(si, tj)

Substitution:Replacing si with tj. The cost isD(i−1, j−1) +cost sub(si, tj). This gives the recurrence relation: D(i, j) = min    D(i−1, j) + 1(deletion) D(i, j−1) + 1(insertion) D(i−1, j−1) +cost sub(si, tj) (substitution) Step 3: Final ResultThe edit distance between the entire token s and token t is the value in the last cell of the matrix: ...

work page 2018

[1] [1]

Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads.Preprint, arXiv:2401.10774. Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper. 2023. Accelerating large language model decoding with speculative sampling.arXiv preprint arXiv:2302.01318. Jian Chen, Vashisth Tiwari, Ranajo...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Deepseek-r1: Incentivizing reasoning capa- bility in llms via reinforcement learning.Preprint, arXiv:2501.12948. Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, and 1 others. 2024. The llama 3 herd of models. arXiv preprint arXiv:2407.21783. Yichao Fu, Pete...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[3] [3]

Break the sequential dependency of llm inference using lookahead decoding.arXiv preprint arXiv:2402.02057, 2024

Break the Sequential Dependency of LLM Inference Using Lookahead Decoding.Preprint, arXiv:2402.02057. Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al- Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, and 1 others. 2024. The llama 3 herd of models.arXiv preprint arXiv:2407.21783. Siqi Kou, Lanxi...

work page arXiv 2024

[4] [4]

Eagle-2: Faster inference of language models with dynamic draft trees.arXiv preprint arXiv:2406.16858, 2024

Fast inference from transformers via spec- ulative decoding. InInternational Conference on Machine Learning, pages 19274–19286. PMLR. Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. 2024a. EAGLE-2: Faster Inference of Lan- guage Models with Dynamic Draft Trees.Preprint, arXiv:2406.16858. Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. 2024b....

work page arXiv 2025

[5] [5]

Optimizing Speculative Decoding for Serving Large Language Models using Goodput.Preprint, arXiv:2406.14066. 9 Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Zeyu Wang, Zhengxin Zhang, Rae Ying Yee Wong, Alan Zhu, Lijie Yang, Xiaoxiang Shi, Chu- nan Shi, Zhuoming Chen, Daiyaan Arfeen, Reyna Abhyankar, and Zhihao Jia. 2024. Specinfer: Accel- erat...

work page arXiv 2024

[6] [6]

a", "b" → Tar- get

Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in Neural Information Pro- cessing Systems, 36:46595–46623. 10 A Consistency and Losslessness Proof Before proceeding to the core proof, we must for- malize the process by which the draft probability distribution p(t) is generated and prove its consis- tency across mismatched vocabularies. Le...

work page

[7] [7]

Total Rejection Probability: P(reject) = X t′ p(t′) 1−min(1, q(t′) p(t′)) = 1−β

work page

[8] [8]

Combined Probability: P(Reject, t ∗) =P(reject)·q ′(t∗) = (1−β)· q(t∗)−min(p(t ∗), q(t∗)) 1−β =q(t ∗)−min(p(t ∗), q(t∗)) A.3.3 Final Result Summing the two mutually exclusive paths: P(next token=t ∗) = min(p(t∗), q(t∗)) + [q(t∗)−min(p(t ∗), q(t∗))] =q(t ∗) The algorithm is therefore strictly lossless, regard- less of the re-tokenization or mapping strateg...

work page

[9] [9]

It is a direct mapping from states to actions

**Deterministic Policy**: - A determin- istic policy always selects the same action for a given state. It is a direct mapping from states to actions. For example,π(s) = a

work page

[10] [10]

It outputs a probability distribu- tion over possible actions given a state

**Stochastic Policy**: - A stochastic pol- icy, on the other hand, selects actions proba- bilistically. It outputs a probability distribu- tion over possible actions given a state. This is often useful in exploration-exploitation trade-offs, where the agent might some- times choose a suboptimal action to dis- cover better ones

work page

[11] [11]

These parameters can be adjusted during training to improve the policy

**Parametric Policy**: - Parametric policies are defined by a set of parameters. These parameters can be adjusted during training to improve the policy. Examples include neural networks, where the weights and biases are the parameters

work page

[12] [12]

Instead, they might be represented by lookup tables or other struc- tures that can grow with the data

**Non-Parametric Policy**: - Non- parametric policies do not rely on a fixed set of parameters. Instead, they might be represented by lookup tables or other struc- tures that can grow with the data. These are less common in deep RL settings. ### Policy Optimization Policy optimization is the process of adjust- ing the parameters of a policy to maximize th...

work page

[13] [13]

Policy optimization ensures that the agent learns a policy that achieves this

**Maximizing Cumulative Reward**: The primary goal in RL is to maximize the cumulative reward. Policy optimization ensures that the agent learns a policy that achieves this

work page

[14] [14]

**Adaptability**: Through optimization, the policy can adapt to different environ- ments and scenarios, making the agent more versatile

work page

[15] [15]

**Handling Complex Environments**: In complex and uncertain environments, a well-optimized policy allows the agent to make informed decisions even when the out- comes are not immediately clear

work page

[16] [16]

### Common Policy Optimization Algo- rithms

**Efficiency**: Efficient policy opti- mization algorithms enable agents to learn quickly, which is crucial in real-world appli- cations where training time is a constraint. ### Common Policy Optimization Algo- rithms

work page

[17] [17]

**REINFORCE**: A basic policy gra- dient algorithm that updates the policy pa- rameters by the gradient of the expected cumulative reward

work page

[18] [18]

**A2C (Advantage Actor-Critic)**: An extension of A3C that uses synchronous up- dates, making it more stable and efficient

work page

[19] [19]

**PPO (Proximal Policy Optimiza- tion)**: A popular algorithm that constrains the policy updates to be close to the previ- ous policy, ensuring stable training

work page

[20] [20]

C Details of Generation C.1 Candidate Length Strategy We adopted the candidate sequence length calcu- lation strategy from the official implementation of Hugging Face Transformers

**TRPO (Trust Region Policy Optimiza- tion)**: Similar to PPO but uses a more rig- orous mathematical approach to constrain policy updates. C Details of Generation C.1 Candidate Length Strategy We adopted the candidate sequence length calcu- lation strategy from the official implementation of Hugging Face Transformers. The rules are as fol- lows. The calc...

work page 2023

[21] [21]

The cost is D(i−1, j) + 1

Deletion:Deleting character si. The cost is D(i−1, j) + 1

work page

[22] [22]

The cost is D(i, j−1) + 1

Insertion:Inserting character tj. The cost is D(i, j−1) + 1

work page

[23] [23]

The cost isD(i−1, j−1) +cost sub(si, tj)

Substitution:Replacing si with tj. The cost isD(i−1, j−1) +cost sub(si, tj). This gives the recurrence relation: D(i, j) = min    D(i−1, j) + 1(deletion) D(i, j−1) + 1(insertion) D(i−1, j−1) +cost sub(si, tj) (substitution) Step 3: Final ResultThe edit distance between the entire token s and token t is the value in the last cell of the matrix: ...

work page 2018