Mix-Quant: Quantized Prefilling, Precise Decoding for Agentic LLMs

Gongfan Fang; Haiquan Lu; Xinchao Wang; Xinyin Ma; Zigeng Chen

arxiv: 2605.20315 · v1 · pith:MB4L66YLnew · submitted 2026-05-19 · 💻 cs.CL

Mix-Quant: Quantized Prefilling, Precise Decoding for Agentic LLMs

Haiquan Lu , Zigeng Chen , Gongfan Fang , Xinyin Ma , Xinchao Wang This is my paper

Pith reviewed 2026-05-21 07:45 UTC · model grok-4.3

classification 💻 cs.CL

keywords quantizationprefillingdecodingagentic LLMsinference optimizationFP4LLM agentsphase-aware quantization

0 comments

The pith

Agentic LLMs can apply FP4 quantization to the prefilling stage alone while keeping BF16 for decoding to cut compute time with little task degradation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that agentic LLM workflows, which rely on planning, tool use, and multi-turn interactions, face a major bottleneck in the prefilling stage due to long contexts and repeated inputs. It shows that this stage carries substantial redundancy for quantization, so low-precision FP4 can be used there without much accuracy loss, while full BF16 precision is retained for the decoding stage. A sympathetic reader would care because agentic systems are expanding into complex real-world tasks but remain slow and expensive on current hardware. If the separation works, inference becomes faster and more practical without redesigning the underlying models or hardware.

Core claim

The central claim is that quantizing the full inference process causes noticeable performance drops in agentic tasks, yet quantizing only the prefilling phase incurs minimal loss even though it dominates computation. This leads to the Mix-Quant approach, which applies high-throughput NVFP4 quantization during prefilling and preserves BF16 during decoding, decoupling acceleration from output quality and yielding up to 3x speedup in the prefilling phase across long-context and agentic benchmarks while largely preserving task performance.

What carries the argument

Mix-Quant, a phase-aware quantization framework that applies NVFP4 to the prefilling phase and BF16 to the decoding phase.

If this is right

Prefilling becomes the primary target for efficiency gains in agentic inference without requiring changes to decoding accuracy.
Existing hardware that supports NVFP4 can deliver substantial speedups for the dominant compute phase in long-context agents.
The separation allows agent workflows to scale to longer contexts or more turns while keeping output quality intact.
Performance on standard long-context and agentic benchmarks stays close to full-precision baselines under the proposed schedule.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same phase split could be tested on non-agentic long-context tasks to see whether the redundancy pattern appears outside tool-use loops.
Future hardware with even lower-precision units might extend the idea to decoding stages if error propagation remains controlled.
Integrating Mix-Quant with memory-efficient techniques for agent state could further reduce overall latency in repeated interactions.

Load-bearing premise

Errors introduced by quantizing only the prefilling stage do not accumulate across multi-step reasoning and tool-use loops to reduce overall task success.

What would settle it

Measure task completion rates on a multi-turn agentic benchmark such as long-context tool-use or planning suites when running Mix-Quant versus full BF16; a drop larger than a few percent would indicate the claim does not hold.

Figures

Figures reproduced from arXiv: 2605.20315 by Gongfan Fang, Haiquan Lu, Xinchao Wang, Xinyin Ma, Zigeng Chen.

**Figure 2.** Figure 2: Overview of Mix-Quant for efficient agentic LLM inference. Agentic workflows repeatedly incorporate tool outputs, memory retrievals, and intermediate results into the input context, making the prefilling stage increasingly compute-intensive. Mix-Quant adopts a phase-aware quantization strategy: the context prefilling phase is accelerated with high-throughput NVFP4 computation, while autoregressive token-… view at source ↗

**Figure 3.** Figure 3: Attention mass concentration in a 128K-token context. The top 4,096 tokens, representing only 3.125% of the full 128K-token context, account for an average of 95.8% of the total attention mass. This suggests that long-context attention is highly concentrated on a small subset of tokens. Moreover, long-context inputs often contain substantial redundancy [13]. As shown in fig. 3, only a small set of heavy… view at source ↗

**Figure 4.** Figure 4: End-to-end prefill latency speedup of Mix-Quant over the BF16 baseline on NVIDIA [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

read the original abstract

LLM agents have recently emerged as a powerful paradigm for solving complex tasks through planning, tool use, memory retrieval, and multi-step interaction. However, these agentic workflows often introduce substantial input-side overhead, making the compute-intensive prefilling stage a key bottleneck in long-context, multi-turn inference. In this work, we propose Mix-Quant, a simple and effective phase-aware quantization framework for fast agentic inference. We first investigate FP4 quantization in agentic LLM workflows and observe that quantizing the entire inference process can incur significant performance degradation. In contrast, the prefilling stage exhibits substantial quantization redundancy and can therefore be quantized with minimal accuracy loss, despite being the dominant source of computation. Based on this insight, we apply high-throughput NVFP4 quantization to the prefilling phase while preserving BF16 precision for decoding. By decoupling prefilling acceleration from decoding quality, Mix-Quant combines phase-aware algorithmic quantization with hardware-efficient NVFP4 execution to alleviate the inference bottleneck in LLM agents. Extensive experiments across long-context and agentic benchmarks demonstrate that Mix-Quant largely preserves task performance while delivering significant efficiency improvements, achieving up to a 3x speedup during prefilling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Mix-Quant shows you can quantize prefilling to NVFP4 and keep decoding in BF16 to get roughly 3x prefilling speedup on agentic tasks with little measured accuracy loss.

read the letter

The central point is that full FP4 quantization hurts agentic performance but applying it only to the prefilling stage does not, at least on the benchmarks they ran. They treat the prefilling phase as having enough redundancy that the quantization error stays tolerable when decoding stays in BF16. That separation is the practical move here, and the reported speedups line up with the claim that prefilling dominates the cost in long-context agent workflows. The experiments across long-context and agentic benchmarks are the main evidence, and they appear to show task scores largely preserved while delivering the efficiency gain. The observation itself is useful because it points to a simple phase split rather than a new quantization scheme from scratch. The soft spot is the multi-turn propagation question. The abstract and stress-test note both flag that prefilling errors could affect later planning or tool-use steps even if single-pass metrics look fine, and the paper needs to demonstrate that the KV-cache or hidden-state perturbations do not compound over sequences of turns. If the experiments only report end-to-end scores without isolating error growth across multiple interactions, that leaves the central assumption under-tested. The work is aimed at practitioners who already run agentic systems and want inference speed without new hardware. It is straightforward enough that a serious referee should see it, mainly to check the experimental controls and whether the multi-turn results hold up under closer scrutiny. I would send it to review rather than desk reject.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Mix-Quant, a phase-aware quantization framework for agentic LLMs. It observes that full-process FP4 quantization degrades performance on agentic tasks while quantizing only the prefilling stage with NVFP4 (keeping BF16 decoding) incurs minimal accuracy loss despite prefilling dominating compute. The approach is evaluated on long-context and agentic benchmarks, claiming up to 3x prefilling speedup with largely preserved task performance.

Significance. If the empirical separation between prefilling quantization tolerance and decoding precision holds under multi-turn agentic workloads, the result offers a practical route to accelerate the dominant compute phase in long-context agent inference without retraining or architectural changes. The phase-decoupling insight is simple and could generalize to other inference optimizations where early-stage redundancy exists.

major comments (2)

[Abstract and Experiments] The central claim that prefilling-only NVFP4 quantization preserves downstream agentic performance rests on the unexamined assumption that KV-cache and hidden-state perturbations remain isolated and do not accumulate across planning, tool-use, and memory-retrieval loops. The abstract and experimental sections report preserved benchmark scores, but no ablation or error-propagation analysis is provided for multi-turn sequences longer than those in the reported agentic suites; this is load-bearing for the claim that the method is suitable for realistic agent workflows.
[Experiments] Table or figure reporting agentic benchmark results (mentioned in the abstract) should include per-task breakdowns, baseline comparisons (e.g., full BF16, full FP4, and other phase-aware methods), and explicit exclusion criteria for any failed or partial trajectories. Without these, the statement that performance is 'largely preserved' cannot be rigorously evaluated against the skeptic concern of compounding errors.

minor comments (2)

[Methods] Clarify the precise definition and hardware mapping of 'NVFP4' versus standard FP4 in the methods section; the distinction is used to claim hardware-efficient execution but is not expanded in the provided abstract.
[Abstract] The abstract states 'up to a 3x speedup during prefilling'—include the exact sequence lengths, batch sizes, and hardware platform for this measurement, and report variance across runs.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment below and outline the revisions we will make to strengthen the presentation of our results.

read point-by-point responses

Referee: [Abstract and Experiments] The central claim that prefilling-only NVFP4 quantization preserves downstream agentic performance rests on the unexamined assumption that KV-cache and hidden-state perturbations remain isolated and do not accumulate across planning, tool-use, and memory-retrieval loops. The abstract and experimental sections report preserved benchmark scores, but no ablation or error-propagation analysis is provided for multi-turn sequences longer than those in the reported agentic suites; this is load-bearing for the claim that the method is suitable for realistic agent workflows.

Authors: We agree that demonstrating robustness against error accumulation in extended multi-turn agentic loops is important for validating the practical applicability of Mix-Quant. Our current agentic benchmarks already incorporate multi-step planning, tool use, and memory retrieval, and the reported results show that task performance remains close to the BF16 baseline under these conditions. To directly address the referee's concern, we will add a new ablation subsection in the revised manuscript that includes error-propagation analysis on longer multi-turn sequences, quantifying the impact of KV-cache and hidden-state perturbations over extended trajectories. revision: yes
Referee: [Experiments] Table or figure reporting agentic benchmark results (mentioned in the abstract) should include per-task breakdowns, baseline comparisons (e.g., full BF16, full FP4, and other phase-aware methods), and explicit exclusion criteria for any failed or partial trajectories. Without these, the statement that performance is 'largely preserved' cannot be rigorously evaluated against the skeptic concern of compounding errors.

Authors: We concur that granular reporting is necessary for rigorous evaluation. In the revised manuscript, we will expand the experimental results to include per-task breakdowns for all agentic benchmarks. We will also incorporate explicit baseline comparisons against full BF16, full FP4 quantization, and additional phase-aware methods where relevant. Furthermore, we will add a clear description of the evaluation protocol, including success criteria and any rules for handling failed or partial trajectories. revision: yes

Circularity Check

0 steps flagged

No circularity: method rests on empirical observation of phase-specific redundancy

full rationale

The paper advances Mix-Quant from the direct experimental finding that full-process FP4 quantization degrades agentic performance while prefilling-only quantization preserves accuracy, with no equations, fitted parameters, or derivations presented that could reduce to their own inputs. The abstract and described workflow contain no self-definitional steps, no predictions that are statistically forced by prior fits, and no load-bearing self-citations or uniqueness theorems. The central insight is framed as an observation from benchmarks rather than a constructed equivalence, rendering the approach self-contained against external empirical validation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the empirical domain assumption that prefilling exhibits quantization redundancy separable from decoding requirements; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)

domain assumption Prefilling stage exhibits substantial quantization redundancy in agentic LLM workflows
Stated as the key insight from investigating FP4 quantization effects on the full inference process versus phase-specific application.

pith-pipeline@v0.9.0 · 5748 in / 1245 out tokens · 67925 ms · 2026-05-21T07:45:55.271069+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · 15 internal anchors

[1]

Artificial Analysis Long Context Reasoning Benchmark (LCR)

Artificial Analysis Team. Artificial Analysis Long Context Reasoning Benchmark (LCR). https://artificialanalysis.ai/, 2025. Accessed: 2026-05-06

work page 2025
[2]

LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks

Yushi Bai, Shangqing Tu, Jiajie Zhang, Hao Peng, Xiaozhi Wang, Xin Lv, Shulin Cao, Jiazheng Xu, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. Longbench v2: Towards deeper under- standing and reasoning on realistic long-context multitasks.arXiv preprint arXiv:2412.15204, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

$\tau^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment

Victor Barres, Honghua Dong, Soham Ray, Xujie Si, and Karthik Narasimhan. τ 2-Bench: Eval- uating conversational agents in a dual-control environment.arXiv preprint arXiv:2506.07982, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Beyond Benchmarks: MathArena as an Evaluation Platform for Mathematics with LLMs

Jasper Dekoninck, Nikola Jovanovi´c, Tim Gehrunger, Kári Rögnvalddson, Ivo Petrov, Chenhao Sun, and Martin Vechev. Beyond benchmarks: Matharena as an evaluation platform for mathematics with llms. 2026. URLhttps://arxiv.org/abs/2605.00674

work page internal anchor Pith review Pith/arXiv arXiv 2026
[5]

LLM.int8(): 8-bit matrix multiplication for transformers at scale

Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. LLM.int8(): 8-bit matrix multiplication for transformers at scale. InAdvances in Neural Information Processing Systems, 2022

work page 2022
[6]

Castro, Denis Kuznedelev, Andrei Panferov, Eldar Kurtic, Shubhra Pandit, Alexandre Noll Marques, Mark Kurtz, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh

Vage Egiazarian, Roberto L. Castro, Denis Kuznedelev, Andrei Panferov, Eldar Kurtic, Shubhra Pandit, Alexandre Marques, Mark Kurtz, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Bridging the gap between promise and performance for microscaling fp4 quantization, 2025. URLhttps://arxiv.org/abs/2509.23202

work page arXiv 2025
[7]

Flashprefill: Instantaneous pattern discovery and thresholding for ultra-fast long-context prefilling.arXiv preprint arXiv:2603.06199, 2026

Qihang Fan, Huaibo Huang, Zhiying Wu, Juqiu Wang, Bingning Wang, and Ran He. Flashprefill: Instantaneous pattern discovery and thresholding for ultra-fast long-context prefilling.arXiv preprint arXiv:2603.06199, 2026. URLhttps://arxiv.org/abs/2603.06199

work page arXiv 2026
[9]

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Gptq: Accurate post-training quantization for generative pre-trained transformers.arXiv preprint arXiv:2210.17323, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[10]

Gemma 4 model card

Google DeepMind. Gemma 4 model card. https://ai.google.dev/gemma/docs/core/ model_card_4, 2026. Accessed: 2026-05-06

work page 2026
[11]

Minillm: Knowledge distillation of large language models

Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. Minillm: Knowledge distillation of large language models. InThe twelfth international conference on learning representations, 2024

work page 2024
[12]

Minference 1.0: Accelerating pre-filling for long-context llms via dynamic sparse attention.Advances in Neural Information Processing Systems, 37:52481–52515, 2024

Huiqiang Jiang, Yucheng Li, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Zhenhua Han, Amir H Abdi, Dongsheng Li, Chin-Yew Lin, et al. Minference 1.0: Accelerating pre-filling for long-context llms via dynamic sparse attention.Advances in Neural Information Processing Systems, 37:52481–52515, 2024

work page 2024
[13]

Longllmlingua: Accelerating and enhancing llms in long context scenarios via prompt compression

Huiqiang Jiang, Qianhui Wu, Xufang Luo, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. Longllmlingua: Accelerating and enhancing llms in long context scenarios via prompt compression. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1658–1677, 2024

work page 2024
[14]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[15]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large lan- guage model serving with pagedattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023. 10

work page 2023
[16]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large lan- guage model serving with pagedattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

work page 2023
[17]

Quantization meets reasoning: Exploring llm low-bit quantization degradation for mathematical reasoning.arXiv preprint arXiv:2501.03035, 2025

Zhen Li, Yupeng Su, Runming Yang, Congkai Xie, Zheng Wang, Zhongwei Xie, Ngai Wong, and Hongxia Yang. Quantization meets reasoning: Exploring llm low-bit quantization degradation for mathematical reasoning.arXiv preprint arXiv:2501.03035, 2025

work page arXiv 2025
[18]

Let’s verify step by step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe Twelfth International Conference on Learning Representations, 2023

work page 2023
[19]

Awq: Activation-aware weight quantization for on-device llm compression and acceleration.Proceedings of machine learning and systems, 6:87–100, 2024

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. Awq: Activation-aware weight quantization for on-device llm compression and acceleration.Proceedings of machine learning and systems, 6:87–100, 2024

work page 2024
[20]

Mixreasoning: Switching modes to think.arXiv preprint arXiv:2510.06052, 2025

Haiquan Lu, Gongfan Fang, Xinyin Ma, Qi Li, and Xinchao Wang. Mixreasoning: Switching modes to think.arXiv preprint arXiv:2510.06052, 2025

work page arXiv 2025
[21]

Large Language Model Agent: A Survey on Methodology, Applications and Challenges

Junyu Luo, Weizhi Zhang, Ye Yuan, Yusheng Zhao, Junwei Yang, Yiyang Gu, Bohan Wu, Binqi Chen, Ziyue Qiao, Qingqing Long, et al. Large language model agent: A survey on methodology, applications and challenges.arXiv preprint arXiv:2503.21460, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

Llm-pruner: On the structural pruning of large language models.Advances in neural information processing systems, 36:21702–21720, 2023

Xinyin Ma, Gongfan Fang, and Xinchao Wang. Llm-pruner: On the structural pruning of large language models.Advances in neural information processing systems, 36:21702–21720, 2023

work page 2023
[23]

WebGPT: Browser-assisted question-answering with human feedback

Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christo- pher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, Xu Jiang, Karl Cobbe, Tyna Eloundou, Gretchen Krueger, Kevin Button, Matthew Knight, Benjamin Chess, and John Schul- man. Webgpt: Browser-assisted question-answering with human feedback.arXiv preprint a...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[24]

Introducing nvfp4 for efficient and accurate low- precision inference

NVIDIA. Introducing nvfp4 for efficient and accurate low- precision inference. https://developer.nvidia.com/blog/ introducing-nvfp4-for-efficient-and-accurate-low-precision-inference/ , 2025

work page 2025
[25]

Enhancing distributed inference performance with the nvidia inference transfer library,

NVIDIA. Enhancing distributed inference performance with the nvidia inference transfer library,

work page
[26]

MemGPT: Towards LLMs as Operating Systems

Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G. Patil, Ion Stoica, and Joseph E. Gonzalez. Memgpt: Towards llms as operating systems.arXiv preprint arXiv:2310.08560, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[27]

Splitwise: Efficient generative llm inference using phase splitting, 2024

Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Inigo Goiri, Saeed Maleki, and Ricardo Bianchini. Splitwise: Efficient generative llm inference using phase splitting.arXiv preprint arXiv:2311.18677, 2023

work page arXiv 2023
[28]

Splitwise: Efficient generative llm inference using phase splitting

Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, and Ricardo Bianchini. Splitwise: Efficient generative llm inference using phase splitting. In2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA), pages 118–132. IEEE, 2024

work page 2024
[29]

The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models

Shishir G Patil, Huanzhi Mao, Fanjia Yan, Charlie Cheng-Jie Ji, Vishnu Suresh, Ion Stoica, and Joseph E Gonzalez. The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models. InForty-second International Conference on Machine Learning, 2025

work page 2025
[30]

Yarn: Efficient context window extension of large language models

Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. Yarn: Efficient context window extension of large language models. InInternational Conference on Learning Repre- sentations, 2024. 11

work page 2024
[31]

Swiftkv: Fast prefill- optimized inference with knowledge-preserving model transformation

Aurick Qiao, Zhewei Yao, Samyam Rajbhandari, and Yuxiong He. Swiftkv: Fast prefill- optimized inference with knowledge-preserving model transformation. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 25745–25764, 2025

work page 2025
[32]

Qwen3.5: Towards native multimodal agents, February 2026

Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. URL https: //qwen.ai/blog?id=qwen3.5

work page 2026
[33]

Toolformer: Language Models Can Teach Themselves to Use Tools

Timo Schick, Jane Dwivedi-Yu, Roberto Dessi, Roberta Raileanu, Maria Lomeli, Luke Zettle- moyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.arXiv preprint arXiv:2302.04761, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[34]

Efficient llm serving for agentic workflows: A data systems perspective.arXiv preprint arXiv:2603.16104, 2026

Noppanat Wadlom, Junyi Shen, and Yao Lu. Efficient llm serving for agentic workflows: A data systems perspective.arXiv preprint arXiv:2603.16104, 2026

work page arXiv 2026
[35]

Long- memeval: Benchmarking chat assistants on long-term interactive memory

Di Wu, Hongwei Wang, Wenhao Yu, Yuwei Zhang, Kai-Wei Chang, and Dong Yu. Long- memeval: Benchmarking chat assistants on long-term interactive memory. InInternational Conference on Learning Representations, 2025

work page 2025
[36]

Combating the Memory Walls: Optimization Pathways for Long-Context Agentic LLM Inference

Haoran Wu, Can Xiao, Jiayi Nie, Xuan Guo, Binglei Lou, Jeffrey TH Wong, Zhiwen Mo, Cheng Zhang, Przemyslaw Forys, Chengyang Ai, et al. Combating the memory walls: Optimization pathways for long-context agentic llm inference.arXiv preprint arXiv:2509.09505, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

Smoothquant: Accurate and efficient post-training quantization for large language models

Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. Smoothquant: Accurate and efficient post-training quantization for large language models. InInternational Conference on Machine Learning, 2023

work page 2023
[38]

A-MEM: Agentic Memory for LLM Agents

Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang. A-mem: Agentic memory for llm agents.arXiv preprint arXiv:2502.12110, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[41]

Swe-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems, 37:50528–50652, 2024

John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. Swe-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems, 37:50528–50652, 2024

work page 2024
[42]

SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. Swe-agent: Agent-computer interfaces enable automated software engineering.arXiv preprint arXiv:2405.15793, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[43]

ReAct: Synergizing Reasoning and Acting in Language Models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models.arXiv preprint arXiv:2210.03629, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[44]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations, 2023

work page 2023
[45]

Qspec: Speculative decoding with complementary quantization schemes

Juntao Zhao, Wenhao Lu, Sheng Wang, Lingpeng Kong, and Chuan Wu. Qspec: Speculative decoding with complementary quantization schemes. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 4779–4795, 2025. 12

work page 2025
[46]

Atom: Low-bit quantization for efficient and accurate llm serving.arXiv preprint arXiv:2310.19102, 2024

Yilong Zhao, Chien-Yu Lin, Kan Zhu, Zihao Ye, Lequn Chen, Size Zheng, Luis Ceze, Arvind Krishnamurthy, Tianqi Chen, and Baris Kasikci. Atom: Low-bit quantization for efficient and accurate llm serving.arXiv preprint arXiv:2310.19102, 2024

work page arXiv 2024
[47]

Distserve: Disaggregating prefill and decoding for goodput-optimized large language model serving

Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, and Hao Zhang. Distserve: Disaggregating prefill and decoding for goodput-optimized large language model serving. In18th USENIX Symposium on Operating Systems Design and Implementation, 2024

work page 2024
[48]

A Survey on Efficient Inference for Large Language Models

Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, et al. A survey on efficient inference for large language models.arXiv preprint arXiv:2404.14294, 2024. 13

work page internal anchor Pith review Pith/arXiv arXiv 2024

[1] [1]

Artificial Analysis Long Context Reasoning Benchmark (LCR)

Artificial Analysis Team. Artificial Analysis Long Context Reasoning Benchmark (LCR). https://artificialanalysis.ai/, 2025. Accessed: 2026-05-06

work page 2025

[2] [2]

LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks

Yushi Bai, Shangqing Tu, Jiajie Zhang, Hao Peng, Xiaozhi Wang, Xin Lv, Shulin Cao, Jiazheng Xu, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. Longbench v2: Towards deeper under- standing and reasoning on realistic long-context multitasks.arXiv preprint arXiv:2412.15204, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[3] [3]

$\tau^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment

Victor Barres, Honghua Dong, Soham Ray, Xujie Si, and Karthik Narasimhan. τ 2-Bench: Eval- uating conversational agents in a dual-control environment.arXiv preprint arXiv:2506.07982, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

Beyond Benchmarks: MathArena as an Evaluation Platform for Mathematics with LLMs

Jasper Dekoninck, Nikola Jovanovi´c, Tim Gehrunger, Kári Rögnvalddson, Ivo Petrov, Chenhao Sun, and Martin Vechev. Beyond benchmarks: Matharena as an evaluation platform for mathematics with llms. 2026. URLhttps://arxiv.org/abs/2605.00674

work page internal anchor Pith review Pith/arXiv arXiv 2026

[5] [5]

LLM.int8(): 8-bit matrix multiplication for transformers at scale

Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. LLM.int8(): 8-bit matrix multiplication for transformers at scale. InAdvances in Neural Information Processing Systems, 2022

work page 2022

[6] [6]

Castro, Denis Kuznedelev, Andrei Panferov, Eldar Kurtic, Shubhra Pandit, Alexandre Noll Marques, Mark Kurtz, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh

Vage Egiazarian, Roberto L. Castro, Denis Kuznedelev, Andrei Panferov, Eldar Kurtic, Shubhra Pandit, Alexandre Marques, Mark Kurtz, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Bridging the gap between promise and performance for microscaling fp4 quantization, 2025. URLhttps://arxiv.org/abs/2509.23202

work page arXiv 2025

[7] [7]

Flashprefill: Instantaneous pattern discovery and thresholding for ultra-fast long-context prefilling.arXiv preprint arXiv:2603.06199, 2026

Qihang Fan, Huaibo Huang, Zhiying Wu, Juqiu Wang, Bingning Wang, and Ran He. Flashprefill: Instantaneous pattern discovery and thresholding for ultra-fast long-context prefilling.arXiv preprint arXiv:2603.06199, 2026. URLhttps://arxiv.org/abs/2603.06199

work page arXiv 2026

[8] [9]

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Gptq: Accurate post-training quantization for generative pre-trained transformers.arXiv preprint arXiv:2210.17323, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[9] [10]

Gemma 4 model card

Google DeepMind. Gemma 4 model card. https://ai.google.dev/gemma/docs/core/ model_card_4, 2026. Accessed: 2026-05-06

work page 2026

[10] [11]

Minillm: Knowledge distillation of large language models

Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. Minillm: Knowledge distillation of large language models. InThe twelfth international conference on learning representations, 2024

work page 2024

[11] [12]

Minference 1.0: Accelerating pre-filling for long-context llms via dynamic sparse attention.Advances in Neural Information Processing Systems, 37:52481–52515, 2024

Huiqiang Jiang, Yucheng Li, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Zhenhua Han, Amir H Abdi, Dongsheng Li, Chin-Yew Lin, et al. Minference 1.0: Accelerating pre-filling for long-context llms via dynamic sparse attention.Advances in Neural Information Processing Systems, 37:52481–52515, 2024

work page 2024

[12] [13]

Longllmlingua: Accelerating and enhancing llms in long context scenarios via prompt compression

Huiqiang Jiang, Qianhui Wu, Xufang Luo, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. Longllmlingua: Accelerating and enhancing llms in long context scenarios via prompt compression. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1658–1677, 2024

work page 2024

[13] [14]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[14] [15]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large lan- guage model serving with pagedattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023. 10

work page 2023

[15] [16]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large lan- guage model serving with pagedattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

work page 2023

[16] [17]

Quantization meets reasoning: Exploring llm low-bit quantization degradation for mathematical reasoning.arXiv preprint arXiv:2501.03035, 2025

Zhen Li, Yupeng Su, Runming Yang, Congkai Xie, Zheng Wang, Zhongwei Xie, Ngai Wong, and Hongxia Yang. Quantization meets reasoning: Exploring llm low-bit quantization degradation for mathematical reasoning.arXiv preprint arXiv:2501.03035, 2025

work page arXiv 2025

[17] [18]

Let’s verify step by step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe Twelfth International Conference on Learning Representations, 2023

work page 2023

[18] [19]

Awq: Activation-aware weight quantization for on-device llm compression and acceleration.Proceedings of machine learning and systems, 6:87–100, 2024

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. Awq: Activation-aware weight quantization for on-device llm compression and acceleration.Proceedings of machine learning and systems, 6:87–100, 2024

work page 2024

[19] [20]

Mixreasoning: Switching modes to think.arXiv preprint arXiv:2510.06052, 2025

Haiquan Lu, Gongfan Fang, Xinyin Ma, Qi Li, and Xinchao Wang. Mixreasoning: Switching modes to think.arXiv preprint arXiv:2510.06052, 2025

work page arXiv 2025

[20] [21]

Large Language Model Agent: A Survey on Methodology, Applications and Challenges

Junyu Luo, Weizhi Zhang, Ye Yuan, Yusheng Zhao, Junwei Yang, Yiyang Gu, Bohan Wu, Binqi Chen, Ziyue Qiao, Qingqing Long, et al. Large language model agent: A survey on methodology, applications and challenges.arXiv preprint arXiv:2503.21460, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[21] [22]

Llm-pruner: On the structural pruning of large language models.Advances in neural information processing systems, 36:21702–21720, 2023

Xinyin Ma, Gongfan Fang, and Xinchao Wang. Llm-pruner: On the structural pruning of large language models.Advances in neural information processing systems, 36:21702–21720, 2023

work page 2023

[22] [23]

WebGPT: Browser-assisted question-answering with human feedback

Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christo- pher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, Xu Jiang, Karl Cobbe, Tyna Eloundou, Gretchen Krueger, Kevin Button, Matthew Knight, Benjamin Chess, and John Schul- man. Webgpt: Browser-assisted question-answering with human feedback.arXiv preprint a...

work page internal anchor Pith review Pith/arXiv arXiv 2021

[23] [24]

Introducing nvfp4 for efficient and accurate low- precision inference

NVIDIA. Introducing nvfp4 for efficient and accurate low- precision inference. https://developer.nvidia.com/blog/ introducing-nvfp4-for-efficient-and-accurate-low-precision-inference/ , 2025

work page 2025

[24] [25]

Enhancing distributed inference performance with the nvidia inference transfer library,

NVIDIA. Enhancing distributed inference performance with the nvidia inference transfer library,

work page

[25] [26]

MemGPT: Towards LLMs as Operating Systems

Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G. Patil, Ion Stoica, and Joseph E. Gonzalez. Memgpt: Towards llms as operating systems.arXiv preprint arXiv:2310.08560, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[26] [27]

Splitwise: Efficient generative llm inference using phase splitting, 2024

Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Inigo Goiri, Saeed Maleki, and Ricardo Bianchini. Splitwise: Efficient generative llm inference using phase splitting.arXiv preprint arXiv:2311.18677, 2023

work page arXiv 2023

[27] [28]

Splitwise: Efficient generative llm inference using phase splitting

Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, and Ricardo Bianchini. Splitwise: Efficient generative llm inference using phase splitting. In2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA), pages 118–132. IEEE, 2024

work page 2024

[28] [29]

The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models

Shishir G Patil, Huanzhi Mao, Fanjia Yan, Charlie Cheng-Jie Ji, Vishnu Suresh, Ion Stoica, and Joseph E Gonzalez. The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models. InForty-second International Conference on Machine Learning, 2025

work page 2025

[29] [30]

Yarn: Efficient context window extension of large language models

Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. Yarn: Efficient context window extension of large language models. InInternational Conference on Learning Repre- sentations, 2024. 11

work page 2024

[30] [31]

Swiftkv: Fast prefill- optimized inference with knowledge-preserving model transformation

Aurick Qiao, Zhewei Yao, Samyam Rajbhandari, and Yuxiong He. Swiftkv: Fast prefill- optimized inference with knowledge-preserving model transformation. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 25745–25764, 2025

work page 2025

[31] [32]

Qwen3.5: Towards native multimodal agents, February 2026

Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. URL https: //qwen.ai/blog?id=qwen3.5

work page 2026

[32] [33]

Toolformer: Language Models Can Teach Themselves to Use Tools

Timo Schick, Jane Dwivedi-Yu, Roberto Dessi, Roberta Raileanu, Maria Lomeli, Luke Zettle- moyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.arXiv preprint arXiv:2302.04761, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[33] [34]

Efficient llm serving for agentic workflows: A data systems perspective.arXiv preprint arXiv:2603.16104, 2026

Noppanat Wadlom, Junyi Shen, and Yao Lu. Efficient llm serving for agentic workflows: A data systems perspective.arXiv preprint arXiv:2603.16104, 2026

work page arXiv 2026

[34] [35]

Long- memeval: Benchmarking chat assistants on long-term interactive memory

Di Wu, Hongwei Wang, Wenhao Yu, Yuwei Zhang, Kai-Wei Chang, and Dong Yu. Long- memeval: Benchmarking chat assistants on long-term interactive memory. InInternational Conference on Learning Representations, 2025

work page 2025

[35] [36]

Combating the Memory Walls: Optimization Pathways for Long-Context Agentic LLM Inference

Haoran Wu, Can Xiao, Jiayi Nie, Xuan Guo, Binglei Lou, Jeffrey TH Wong, Zhiwen Mo, Cheng Zhang, Przemyslaw Forys, Chengyang Ai, et al. Combating the memory walls: Optimization pathways for long-context agentic llm inference.arXiv preprint arXiv:2509.09505, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[36] [37]

Smoothquant: Accurate and efficient post-training quantization for large language models

Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. Smoothquant: Accurate and efficient post-training quantization for large language models. InInternational Conference on Machine Learning, 2023

work page 2023

[37] [38]

A-MEM: Agentic Memory for LLM Agents

Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang. A-mem: Agentic memory for llm agents.arXiv preprint arXiv:2502.12110, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[38] [40]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[39] [41]

Swe-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems, 37:50528–50652, 2024

John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. Swe-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems, 37:50528–50652, 2024

work page 2024

[40] [42]

SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. Swe-agent: Agent-computer interfaces enable automated software engineering.arXiv preprint arXiv:2405.15793, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[41] [43]

ReAct: Synergizing Reasoning and Acting in Language Models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models.arXiv preprint arXiv:2210.03629, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[42] [44]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations, 2023

work page 2023

[43] [45]

Qspec: Speculative decoding with complementary quantization schemes

Juntao Zhao, Wenhao Lu, Sheng Wang, Lingpeng Kong, and Chuan Wu. Qspec: Speculative decoding with complementary quantization schemes. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 4779–4795, 2025. 12

work page 2025

[44] [46]

Atom: Low-bit quantization for efficient and accurate llm serving.arXiv preprint arXiv:2310.19102, 2024

Yilong Zhao, Chien-Yu Lin, Kan Zhu, Zihao Ye, Lequn Chen, Size Zheng, Luis Ceze, Arvind Krishnamurthy, Tianqi Chen, and Baris Kasikci. Atom: Low-bit quantization for efficient and accurate llm serving.arXiv preprint arXiv:2310.19102, 2024

work page arXiv 2024

[45] [47]

Distserve: Disaggregating prefill and decoding for goodput-optimized large language model serving

Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, and Hao Zhang. Distserve: Disaggregating prefill and decoding for goodput-optimized large language model serving. In18th USENIX Symposium on Operating Systems Design and Implementation, 2024

work page 2024

[46] [48]

A Survey on Efficient Inference for Large Language Models

Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, et al. A survey on efficient inference for large language models.arXiv preprint arXiv:2404.14294, 2024. 13

work page internal anchor Pith review Pith/arXiv arXiv 2024