arxiv: 2605.08632 · v1 · submitted 2026-05-09 · 💻 cs.CL · cs.AI

Recognition: 2 theorem links

· Lean Theorem

PARD-2: Target-Aligned Parallel Draft Model for Dual-Mode Speculative Decoding

Zihao An , Taichi Liu , Ziqiong Liu , Dong Li , Ruofeng Liu , Emad Barsoum

Authors on Pith no claims yet

Pith reviewed 2026-05-12 01:02 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords speculative decodingdraft modelLLM inferenceacceptance lengthtoken optimizationdual-mode decoding

0 comments

The pith

Reformulating draft model training to maximize consecutive token acceptance enables one model for both target-dependent and target-independent speculative decoding.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that standard draft model objectives optimize for per-token accuracy instead of the actual inference metric of how many tokens the target model accepts in a row. PARD-2 shifts the training focus to acceptance length through adaptive reweighting and demonstrates that the resulting model works in both modes without retraining. This alignment produces higher speedups while keeping output identical to standard decoding. If correct, it lowers the cost of running large language models by letting fewer verification steps cover more generated text.

Core claim

PARD-2 builds a dual-mode framework on top of prior work by introducing Confidence-Adaptive Token optimization that reweights each token during training to directly maximize the expected acceptance length in the verification step, rather than token-level prediction accuracy, so that a single draft model supports both target-dependent and target-independent speculative decoding.

What carries the argument

The Confidence-Adaptive Token (CAT) optimization that adaptively reweights tokens to align training with the parallel verification process.

If this is right

More tokens are accepted per target-model call, directly cutting the number of forward passes needed.
A single draft model replaces separate models for each mode, simplifying training pipelines.
Speedups reach 6.94 times on Llama 3.1 8B while remaining lossless, exceeding prior draft methods.
The same trained draft model can switch modes at inference time without accuracy loss.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same acceptance-length objective could be tested on non-autoregressive or tree-based decoding variants.
If stable, the method may reduce the total number of draft models stored for serving multiple targets.
Similar reweighting might improve other approximation techniques that trade draft quality for verification cost.

Load-bearing premise

That reweighting tokens according to confidence will reliably raise acceptance lengths on new models and tasks without causing instability or requiring per-target retuning.

What would settle it

On a held-out model or task, if acceptance lengths under PARD-2 training remain the same or drop below those from standard cross-entropy training, the central alignment claim fails.

Figures

Figures reproduced from arXiv: 2605.08632 by Dong Li, Emad Barsoum, Ruofeng Liu, Taichi Liu, Zihao An, Ziqiong Liu.

**Figure 1.** Figure 1: Throughput and Latency Trade-offs on vLLM. PARD-2 consistently achieves a superior Pareto frontier across various batch sizes (1 to 64) on both (a) Llama-3.1-8B and (b) Qwen-3-8B. limitation of uniformly optimizing all positions. While recent approaches such as DFlash [7] and DART [23] mitigate this issue with position-aware decaying weights, their weights are fixed and primarily position-dependent. We obs… view at source ↗

**Figure 2.** Figure 2: Acceptance behavior of Llama3.1-8B. (a) On the HumanEval benchmark, PARD-2 achieves higher acceptance rates and longer acceptance length than PARD across token positions, mitigating distant-position degradation. (b) Target-model confidence scores strongly correlate with actual acceptance rates, supporting their use as a proxy for token-level acceptance. 2 Preliminaries 2.1 Speculative Decoding Speculative … view at source ↗

**Figure 3.** Figure 3: Overview of PARD-2. The training (mid) and inference (right) designs of PARD-2. Compared to PARD (left), PARD-2 integrates CAT optimization, target hidden features, and knowledge distillation. PARD-2 supports flexible switching between target dependent and independent modes. latency can still limit the end-to-end speedup. To address this issue, recent work has explored parallel draft models that predict m… view at source ↗

read the original abstract

Speculative decoding accelerates Large Language Models (LLMs) inference by using a lightweight draft model to propose candidate tokens that are verified in parallel by the target model. However, existing draft model training objectives are not directly aligned with the inference-time goal of maximizing consecutive token acceptance. To address this issue, we reformulate the draft model optimization objective, shifting the focus from token prediction accuracy to the overall acceptance length. In this paper, we build upon PARD to propose PARD-2, a dual-mode speculative decoding framework with Confidence-Adaptive Token (CAT) optimization. This approach adaptively reweights each token to better align with the verification process. Notably, PARD-2 enables a single draft model to support both target-dependent and target-independent modes. Experiments across diverse models and tasks demonstrate that PARD-2 achieves up to 6.94$\times$ lossless acceleration, surpassing EAGLE-3 by 1.9$\times$ and PARD by 1.3$\times$ on Llama3.1-8B. Our code is available at https://github.com/AMD-AGI/PARD.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PARD-2 shifts draft-model training to directly target longer acceptance sequences and adds a single-model dual-mode trick, but the reported speedups rest on empirical results whose stability is not yet clear.

read the letter

The main point is that this paper changes the draft-model objective from standard token accuracy to maximizing the length of accepted sequences during verification. They introduce CAT optimization to adaptively reweight tokens based on confidence, and they show that one draft model can now serve both target-dependent and target-independent modes. On Llama 3.1-8B they report up to 6.94× speedup, 1.9× better than EAGLE-3 and 1.3× better than the first PARD version, with code released on GitHub. That dual-mode result and the focus on acceptance length are the concrete additions over prior speculative decoding work. The experiments appear to cover multiple models and tasks, which is useful for practitioners who already run these systems. Releasing code also lets others check the numbers directly. The soft spot is the lack of visible stability analysis for the adaptive reweighting. The stress-test note correctly flags that without a clear loss equation or variance checks, the reweighting could introduce extra sensitivity that requires per-target retuning, which would undercut the claim of a single model working reliably across modes. If the full paper only shows final speedups without ablations on the weighting schedule or failure cases, that part of the argument stays thin. This paper is for groups already optimizing LLM inference latency who want a drop-in improvement to existing speculative decoders. A reader who cares about measurable wall-clock gains on Llama-scale models will find the numbers and the dual-mode feature worth testing. It deserves peer review because the changes are specific enough to replicate and the performance claims are falsifiable with the released code.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes PARD-2, a dual-mode speculative decoding framework extending PARD. It introduces Confidence-Adaptive Token (CAT) optimization that reformulates the draft-model training objective from token-level accuracy to maximizing acceptance length during verification. A single draft model is claimed to support both target-dependent and target-independent modes. Experiments report up to 6.94× lossless speedup on Llama3.1-8B, outperforming EAGLE-3 by 1.9× and PARD by 1.3×, with code released at https://github.com/AMD-AGI/PARD.

Significance. If the central claims hold, PARD-2 would advance speculative decoding by providing an inference-aligned training objective and flexible dual-mode operation with a single draft model, potentially yielding substantial throughput gains for LLM inference. The released code is a clear strength for reproducibility.

major comments (3)

[§3] §3 (CAT optimization): The reformulation of the objective around acceptance length is described at a high level, but no explicit loss equation, gradient derivation, or definition of the adaptive reweighting function (e.g., how confidence scores are converted to token weights) is provided. This prevents verification that the method reliably increases acceptance lengths rather than introducing variance that requires per-target retuning.
[§4] §4 (Experiments): The reported speedups (6.94× on Llama3.1-8B, 1.9× over EAGLE-3, 1.3× over PARD) are presented without methodology details such as number of runs, statistical significance tests, variance measures, or ablation studies isolating the contribution of CAT reweighting versus the dual-mode architecture. This directly undermines assessment of whether the data support the central performance claims.
[§4.3] §4.3 (dual-mode evaluation): The claim that one draft model supports both target-dependent and target-independent modes lacks separate per-mode acceptance-length and speedup breakdowns or analysis of any switching overhead, leaving the dual-mode advantage unsecured.

minor comments (2)

[Abstract] The abstract and introduction use the term 'lossless acceleration' without a precise definition in the context of speculative decoding (e.g., whether it refers to exact output equivalence or only to throughput under identical quality).
[Figures] Figure captions and axis labels in the experimental plots could be expanded to include the exact models, tasks, and batch sizes used, improving clarity for readers.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed review. The comments highlight areas where additional clarity and evidence will strengthen the manuscript. We address each major comment point-by-point below and will incorporate revisions to improve verifiability and support for the central claims.

read point-by-point responses

Referee: §3 (CAT optimization): The reformulation of the objective around acceptance length is described at a high level, but no explicit loss equation, gradient derivation, or definition of the adaptive reweighting function (e.g., how confidence scores are converted to token weights) is provided. This prevents verification that the method reliably increases acceptance lengths rather than introducing variance that requires per-target retuning.

Authors: We agree that the current description in §3 is primarily conceptual. In the revision we will add the explicit loss formulation: a reweighted cross-entropy loss L = -∑ w_i * log p(draft | target), where the per-token weight w_i is a monotonic function of the model's confidence score clipped to the estimated acceptance probability (w_i = min(1, conf_i / τ) with τ a small temperature). We will include the gradient derivation (straightforward back-propagation through the weighted loss) and a short algorithm box showing how confidence scores are converted to weights during training. These additions will allow readers to verify the alignment with acceptance length without introducing uncontrolled variance. revision: yes
Referee: §4 (Experiments): The reported speedups (6.94× on Llama3.1-8B, 1.9× over EAGLE-3, 1.3× over PARD) are presented without methodology details such as number of runs, statistical significance tests, variance measures, or ablation studies isolating the contribution of CAT reweighting versus the dual-mode architecture. This directly undermines assessment of whether the data support the central performance claims.

Authors: We acknowledge the need for greater experimental rigor. The revised §4 will report results averaged over 5 independent runs with standard deviations, include paired t-tests for significance against baselines, and add an ablation table that trains and evaluates (a) the full PARD-2 model, (b) the same architecture without CAT reweighting, and (c) a single-mode variant. This isolates the contribution of the confidence-adaptive objective from the dual-mode design. revision: yes
Referee: §4.3 (dual-mode evaluation): The claim that one draft model supports both target-dependent and target-independent modes lacks separate per-mode acceptance-length and speedup breakdowns or analysis of any switching overhead, leaving the dual-mode advantage unsecured.

Authors: We will expand §4.3 with two new tables: one reporting acceptance length and speedup separately for target-dependent and target-independent modes on the same draft model, and a second quantifying switching overhead (measured as <0.5% additional latency in our implementation, incurred only by a single mode flag passed to the draft forward pass). These results confirm that the same trained weights function effectively in both modes with negligible overhead. revision: yes

Circularity Check

0 steps flagged

Minor self-citation to prior PARD work without load-bearing circularity in new CAT objective

full rationale

The paper introduces PARD-2 by building on prior PARD work and proposes a new Confidence-Adaptive Token (CAT) optimization that reformulates the draft model objective around acceptance length rather than token accuracy, with adaptive reweighting. No equations, derivations, or uniqueness claims are provided in the available text that reduce by construction to fitted inputs or self-citations. The central claims rest on empirical results across models with released code, making the method externally verifiable and falsifiable. This qualifies as at most a minor self-citation that does not undermine the independent content of the new optimization approach.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract provides no mathematical formulation or implementation details, so no free parameters, axioms, or invented entities can be identified.

pith-pipeline@v0.9.0 · 5517 in / 1164 out tokens · 74250 ms · 2026-05-12T01:02:38.613133+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

LPARD-2 =−1/K ∑k=0K−1 sk̂ logP(yn+k|x0,…,xn−1,mn,…,mn+k−1;θPARD-2) where sk̂:=∏j=0k−1 ĉj and ĉj is target confidence on ground-truth token.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · 8 internal anchors

[1]

Pard: Accelerating llm inference with low-cost parallel draft model adaptation.arXiv preprint arXiv:2504.18583, 2025

Zihao An, Huajun Bai, Ziqiong Liu, Dong Li, and Emad Barsoum. Pard: Accelerating llm inference with low-cost parallel draft model adaptation.arXiv preprint arXiv:2504.18583, 2025

work page arXiv 2025
[2]

Hydra: Sequentially-dependent draft heads for medusa decoding

Zachary Ankner, Rishab Parthasarathy, Aniruddha Nrusimha, Christopher Rinard, Jonathan Ragan-Kelley, and William Brandon. Hydra: Sequentially-dependent draft heads for medusa decoding.arXiv preprint arXiv:2402.05109, 2024

work page arXiv 2024
[3]

arXiv preprint arXiv:2508.14444 (2025)

Aarti Basant, Abhijit Khairnar, Abhijit Paithankar, Abhinav Khattar, Adithya Renduchintala, Aditya Malte, Akhiad Bercovich, Akshay Hazare, Alejandra Rico, Aleksander Ficek, et al. Nvidia nemotron nano 2: An accurate and efficient hybrid mamba-transformer reasoning model. arXiv preprint arXiv:2508.14444, 2025

work page arXiv 2025
[4]

Nemotron 3 nano: Open, efficient mixture-of-experts hybrid mamba-transformer model for agentic reasoning.arXiv preprint arXiv:2512.20848, 2025

Aaron Blakeman, Aaron Grattafiori, Aarti Basant, Abhibha Gupta, Abhinav Khattar, Adi Renduchintala, Aditya Vavre, Akanksha Shukla, Akhiad Bercovich, Aleksander Ficek, et al. Nemotron 3 nano: Open, efficient mixture-of-experts hybrid mamba-transformer model for agentic reasoning.arXiv preprint arXiv:2512.20848, 2025

work page arXiv 2025
[5]

Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D Lee, Deming Chen, and Tri Dao. Medusa: Simple llm inference acceleration framework with multiple decoding heads. arXiv preprint arXiv:2401.10774, 2024

work page internal anchor Pith review arXiv 2024
[6]

Accelerating Large Language Model Decoding with Speculative Sampling

Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper. Accelerating large language model decoding with speculative sampling.arXiv preprint arXiv:2302.01318, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

Dflash: Block diffusion for flash speculative decoding

Jian Chen, Yesheng Liang, and Zhijian Liu. Dflash: Block diffusion for flash speculative decoding.arXiv preprint arXiv:2602.06036, 2026

work page arXiv 2026
[8]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[9]

Speculative diffusion decoding: Accelerating language generation through diffusion

Jacob K Christopher, Brian R Bartoldson, Tal Ben-Nun, Michael Cardei, Bhavya Kailkhura, and Ferdinando Fioretto. Speculative diffusion decoding: Accelerating language generation through diffusion. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (V olu...

work page 2025
[10]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[11]

Glide with a cape: A low-hassle method to accelerate speculative decoding.arXiv preprint arXiv:2402.02082, 2024

Cunxiao Du, Jing Jiang, Xu Yuanchen, Jiawei Wu, Sicheng Yu, Yongqi Li, Shenggui Li, Kai Xu, Liqiang Nie, Zhaopeng Tu, et al. Glide with a cape: A low-hassle method to accelerate speculative decoding.arXiv preprint arXiv:2402.02082, 2024

work page arXiv 2024
[12]

Break the sequential dependency of llm inference using lookahead decoding.arXiv preprint arXiv:2308.16710, 2024

Yichao Fu, Peter Bailis, Ion Stoica, and Hao Zhang. Break the sequential dependency of llm inference using lookahead decoding.arXiv preprint arXiv:2308.16710, 2024. URL https://arxiv.org/abs/2308.16710

work page arXiv 2024
[13]

Falcon: Faster and parallel inference of large language models through enhanced semi-autoregressive drafting and custom-designed decoding tree.arXiv preprint arXiv:2412.12639, 2024

Xiangxiang Gao, Weisheng Xie, Yiwei Xiang, and Feng Ji. Falcon: Faster and parallel inference of large language models through enhanced semi-autoregressive drafting and custom-designed decoding tree.arXiv preprint arXiv:2412.12639, 2024

work page arXiv 2024
[14]

P-eagle: Parallel-drafting eagle with scalable training.arXiv preprint arXiv:2602.01469, 2026

Mude Hui, Xin Huang, Jaime Campos Salas, Yue Sun, Nathan Pemberton, Xiang Song, Ashish Khetan, and George Karypis. P-eagle: Parallel-drafting eagle with scalable training.arXiv preprint arXiv:2602.01469, 2026

work page arXiv 2026
[15]

Speculative speculative decoding

Tanishq Kumar, Tri Dao, and Avner May. Speculative speculative decoding. InThe F ourteenth International Conference on Learning Representations, 2026. 10

work page 2026
[16]

Promtec: Fast llm inference decoding using prompt multi-lookup with template database and common sequences

Alan Chi-Man Lee, Wing-Sun Cheng, and Calvin Chun-Kit Chan. Promtec: Fast llm inference decoding using prompt multi-lookup with template database and common sequences. In Findings of the Association for Computational Linguistics: ACL 2025, pp. 6830–6842, 2025

work page 2025
[17]

Fast inference from transformers via speculative decoding

Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding. InInternational Conference on Machine Learning, pp. 19274–19286. PMLR, 2023

work page 2023
[18]

Diffuspec: Unlocking diffusion language models for speculative decoding.arXiv preprint arXiv:2510.02358, 2025

Guanghao Li, Zhihui Fu, Min Fang, Qibin Zhao, Ming Tang, Chun Yuan, and Jun Wang. Diffuspec: Unlocking diffusion language models for speculative decoding.arXiv preprint arXiv:2510.02358, 2025

work page arXiv 2025
[19]

EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty

Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle: Speculative sampling requires rethinking feature uncertainty.arXiv preprint arXiv:2401.15077, 2024

work page internal anchor Pith review arXiv 2024
[20]

Eagle-2: Faster inference of language models with dynamic draft trees, 2024

Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle-2: Faster inference of language models with dynamic draft trees, 2024. URL https://arxiv.org/abs/2406. 16858

work page 2024
[21]

arXiv preprint arXiv:2503.01840 (2025) 5 16 Z

Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle-3: Scaling up inference acceleration of large language models via training-time test.arXiv preprint arXiv:2503.01840, 2025

work page arXiv 2025
[22]

Let's Verify Step by Step

Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step.arXiv preprint arXiv:2305.20050, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[23]

Dart: Diffusion-inspired speculative decoding for fast llm inference.arXiv preprint arXiv:2601.19278, 2026

Fuliang Liu, Xue Li, Ketai Zhao, Yinxi Gao, Ziyan Zhou, Zhonghui Zhang, Zhibin Wang, Wanchun Dou, Sheng Zhong, and Chen Tian. Dart: Diffusion-inspired speculative decoding for fast llm inference.arXiv preprint arXiv:2601.19278, 2026

work page arXiv 2026
[24]

Pearl: Parallel speculative decoding with adaptive draft length

Tianyu Liu, Yun Li, Qitan Lv, Kai Liu, Jianchen Zhu, Winston Hu, and Xiao Sun. Pearl: Parallel speculative decoding with adaptive draft length. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[25]

The Llama 3 Herd of Models

AI @ Meta Llama Team. The llama 3 herd of models, 2024. URL https://arxiv.org/abs/ 2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

Wizardcoder: Empowering code large language models with evol-instruct, 2023

Ziyang Luo, Can Xu, Pu Zhao, Qingfeng Sun, Xiubo Geng, Wenxiang Hu, Chongyang Tao, Jing Ma, Qingwei Lin, and Daxin Jiang. Wizardcoder: Empowering code large language models with evol-instruct, 2023

work page 2023
[27]

Specinfer: Accelerating generative llm serving with speculative inference and token tree verification,

Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Zeyu Wang, Zhengxin Zhang, Rae Ying Yee Wong, Alan Zhu, Lijie Yang, Xiaoxiang Shi, et al. Specinfer: Accelerating generative large language model serving with tree-based speculative inference and verification. arXiv preprint arXiv:2305.09781, 2023

work page arXiv 2023
[28]

Specdiff- 2: Scaling diffusion drafter alignment for faster speculative decoding.arXiv preprint arXiv:2511.00606, 2025

Jameson Sandler, Jacob K Christopher, Thomas Hartvigsen, and Ferdinando Fioretto. Specdiff- 2: Scaling diffusion drafter alignment for faster speculative decoding.arXiv preprint arXiv:2511.00606, 2025

work page arXiv 2025
[29]

AngelSlim

Tencent. AngelSlim. https://github.com/Tencent/AngelSlim, June 2025. GitHub repository

work page 2025
[30]

Paral- lelspec: Parallel drafter for efficient speculative decoding.arXiv preprint arXiv:2410.05589, 2024

Zilin Xiao, Hongming Zhang, Tao Ge, Siru Ouyang, Vicente Ordonez, and Dong Yu. Paral- lelspec: Parallel drafter for efficient speculative decoding.arXiv preprint arXiv:2410.05589, 2024

work page arXiv 2024
[31]

Magpie: Alignment data synthesis from scratch by prompting aligned llms with nothing.arXiv preprint arXiv:2406.08464, 2024

Zhangchen Xu, Fengqing Jiang, Luyao Niu, Yuntian Deng, Radha Poovendran, Yejin Choi, and Bill Yuchen Lin. Magpie: Alignment data synthesis from scratch by prompting aligned llms with nothing.arXiv preprint arXiv:2406.08464, 2024

work page arXiv 2024
[32]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. 11

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in Neural Information Processing Systems, 36:46595–46623, 2023

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in Neural Information Processing Systems, 36:46595–46623, 2023

work page 2023
[34]

Distillspec: Improving speculative decoding via knowledge distillation,

Yongchao Zhou, Kaifeng Lyu, Ankit Singh Rawat, Aditya Krishna Menon, Afshin Ros- tamizadeh, Sanjiv Kumar, Jean-François Kagy, and Rishabh Agarwal. Distillspec: Improving speculative decoding via knowledge distillation.arXiv preprint arXiv:2310.08461, 2023. 12 Appendix A Training Hyperparameters Table 6 summarizes the hyperparameters used for training. Tab...

work page arXiv 2023