pith. sign in

arxiv: 2605.20315 · v1 · pith:MB4L66YLnew · submitted 2026-05-19 · 💻 cs.CL

Mix-Quant: Quantized Prefilling, Precise Decoding for Agentic LLMs

Pith reviewed 2026-05-21 07:45 UTC · model grok-4.3

classification 💻 cs.CL
keywords quantizationprefillingdecodingagentic LLMsinference optimizationFP4LLM agentsphase-aware quantization
0
0 comments X

The pith

Agentic LLMs can apply FP4 quantization to the prefilling stage alone while keeping BF16 for decoding to cut compute time with little task degradation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that agentic LLM workflows, which rely on planning, tool use, and multi-turn interactions, face a major bottleneck in the prefilling stage due to long contexts and repeated inputs. It shows that this stage carries substantial redundancy for quantization, so low-precision FP4 can be used there without much accuracy loss, while full BF16 precision is retained for the decoding stage. A sympathetic reader would care because agentic systems are expanding into complex real-world tasks but remain slow and expensive on current hardware. If the separation works, inference becomes faster and more practical without redesigning the underlying models or hardware.

Core claim

The central claim is that quantizing the full inference process causes noticeable performance drops in agentic tasks, yet quantizing only the prefilling phase incurs minimal loss even though it dominates computation. This leads to the Mix-Quant approach, which applies high-throughput NVFP4 quantization during prefilling and preserves BF16 during decoding, decoupling acceleration from output quality and yielding up to 3x speedup in the prefilling phase across long-context and agentic benchmarks while largely preserving task performance.

What carries the argument

Mix-Quant, a phase-aware quantization framework that applies NVFP4 to the prefilling phase and BF16 to the decoding phase.

If this is right

  • Prefilling becomes the primary target for efficiency gains in agentic inference without requiring changes to decoding accuracy.
  • Existing hardware that supports NVFP4 can deliver substantial speedups for the dominant compute phase in long-context agents.
  • The separation allows agent workflows to scale to longer contexts or more turns while keeping output quality intact.
  • Performance on standard long-context and agentic benchmarks stays close to full-precision baselines under the proposed schedule.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same phase split could be tested on non-agentic long-context tasks to see whether the redundancy pattern appears outside tool-use loops.
  • Future hardware with even lower-precision units might extend the idea to decoding stages if error propagation remains controlled.
  • Integrating Mix-Quant with memory-efficient techniques for agent state could further reduce overall latency in repeated interactions.

Load-bearing premise

Errors introduced by quantizing only the prefilling stage do not accumulate across multi-step reasoning and tool-use loops to reduce overall task success.

What would settle it

Measure task completion rates on a multi-turn agentic benchmark such as long-context tool-use or planning suites when running Mix-Quant versus full BF16; a drop larger than a few percent would indicate the claim does not hold.

Figures

Figures reproduced from arXiv: 2605.20315 by Gongfan Fang, Haiquan Lu, Xinchao Wang, Xinyin Ma, Zigeng Chen.

Figure 1
Figure 1. Figure 1: Agentic workflows are highly input-heavy, introducing substantial prefilling overhead. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of Mix-Quant for efficient agentic LLM inference. Agentic workflows re￾peatedly incorporate tool outputs, memory retrievals, and intermediate results into the input context, making the prefilling stage increasingly compute-intensive. Mix-Quant adopts a phase-aware quanti￾zation strategy: the context prefilling phase is accelerated with high-throughput NVFP4 computation, while autoregressive token-… view at source ↗
Figure 3
Figure 3. Figure 3: Attention mass concen￾tration in a 128K-token context. The top 4,096 tokens, representing only 3.125% of the full 128K-token context, account for an average of 95.8% of the total attention mass. This suggests that long-context at￾tention is highly concentrated on a small subset of tokens. Moreover, long-context inputs often contain substantial redun￾dancy [13]. As shown in fig. 3, only a small set of heavy… view at source ↗
Figure 4
Figure 4. Figure 4: End-to-end prefill latency speedup of Mix-Quant over the BF16 baseline on NVIDIA [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
read the original abstract

LLM agents have recently emerged as a powerful paradigm for solving complex tasks through planning, tool use, memory retrieval, and multi-step interaction. However, these agentic workflows often introduce substantial input-side overhead, making the compute-intensive prefilling stage a key bottleneck in long-context, multi-turn inference. In this work, we propose Mix-Quant, a simple and effective phase-aware quantization framework for fast agentic inference. We first investigate FP4 quantization in agentic LLM workflows and observe that quantizing the entire inference process can incur significant performance degradation. In contrast, the prefilling stage exhibits substantial quantization redundancy and can therefore be quantized with minimal accuracy loss, despite being the dominant source of computation. Based on this insight, we apply high-throughput NVFP4 quantization to the prefilling phase while preserving BF16 precision for decoding. By decoupling prefilling acceleration from decoding quality, Mix-Quant combines phase-aware algorithmic quantization with hardware-efficient NVFP4 execution to alleviate the inference bottleneck in LLM agents. Extensive experiments across long-context and agentic benchmarks demonstrate that Mix-Quant largely preserves task performance while delivering significant efficiency improvements, achieving up to a 3x speedup during prefilling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Mix-Quant, a phase-aware quantization framework for agentic LLMs. It observes that full-process FP4 quantization degrades performance on agentic tasks while quantizing only the prefilling stage with NVFP4 (keeping BF16 decoding) incurs minimal accuracy loss despite prefilling dominating compute. The approach is evaluated on long-context and agentic benchmarks, claiming up to 3x prefilling speedup with largely preserved task performance.

Significance. If the empirical separation between prefilling quantization tolerance and decoding precision holds under multi-turn agentic workloads, the result offers a practical route to accelerate the dominant compute phase in long-context agent inference without retraining or architectural changes. The phase-decoupling insight is simple and could generalize to other inference optimizations where early-stage redundancy exists.

major comments (2)
  1. [Abstract and Experiments] The central claim that prefilling-only NVFP4 quantization preserves downstream agentic performance rests on the unexamined assumption that KV-cache and hidden-state perturbations remain isolated and do not accumulate across planning, tool-use, and memory-retrieval loops. The abstract and experimental sections report preserved benchmark scores, but no ablation or error-propagation analysis is provided for multi-turn sequences longer than those in the reported agentic suites; this is load-bearing for the claim that the method is suitable for realistic agent workflows.
  2. [Experiments] Table or figure reporting agentic benchmark results (mentioned in the abstract) should include per-task breakdowns, baseline comparisons (e.g., full BF16, full FP4, and other phase-aware methods), and explicit exclusion criteria for any failed or partial trajectories. Without these, the statement that performance is 'largely preserved' cannot be rigorously evaluated against the skeptic concern of compounding errors.
minor comments (2)
  1. [Methods] Clarify the precise definition and hardware mapping of 'NVFP4' versus standard FP4 in the methods section; the distinction is used to claim hardware-efficient execution but is not expanded in the provided abstract.
  2. [Abstract] The abstract states 'up to a 3x speedup during prefilling'—include the exact sequence lengths, batch sizes, and hardware platform for this measurement, and report variance across runs.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment below and outline the revisions we will make to strengthen the presentation of our results.

read point-by-point responses
  1. Referee: [Abstract and Experiments] The central claim that prefilling-only NVFP4 quantization preserves downstream agentic performance rests on the unexamined assumption that KV-cache and hidden-state perturbations remain isolated and do not accumulate across planning, tool-use, and memory-retrieval loops. The abstract and experimental sections report preserved benchmark scores, but no ablation or error-propagation analysis is provided for multi-turn sequences longer than those in the reported agentic suites; this is load-bearing for the claim that the method is suitable for realistic agent workflows.

    Authors: We agree that demonstrating robustness against error accumulation in extended multi-turn agentic loops is important for validating the practical applicability of Mix-Quant. Our current agentic benchmarks already incorporate multi-step planning, tool use, and memory retrieval, and the reported results show that task performance remains close to the BF16 baseline under these conditions. To directly address the referee's concern, we will add a new ablation subsection in the revised manuscript that includes error-propagation analysis on longer multi-turn sequences, quantifying the impact of KV-cache and hidden-state perturbations over extended trajectories. revision: yes

  2. Referee: [Experiments] Table or figure reporting agentic benchmark results (mentioned in the abstract) should include per-task breakdowns, baseline comparisons (e.g., full BF16, full FP4, and other phase-aware methods), and explicit exclusion criteria for any failed or partial trajectories. Without these, the statement that performance is 'largely preserved' cannot be rigorously evaluated against the skeptic concern of compounding errors.

    Authors: We concur that granular reporting is necessary for rigorous evaluation. In the revised manuscript, we will expand the experimental results to include per-task breakdowns for all agentic benchmarks. We will also incorporate explicit baseline comparisons against full BF16, full FP4 quantization, and additional phase-aware methods where relevant. Furthermore, we will add a clear description of the evaluation protocol, including success criteria and any rules for handling failed or partial trajectories. revision: yes

Circularity Check

0 steps flagged

No circularity: method rests on empirical observation of phase-specific redundancy

full rationale

The paper advances Mix-Quant from the direct experimental finding that full-process FP4 quantization degrades agentic performance while prefilling-only quantization preserves accuracy, with no equations, fitted parameters, or derivations presented that could reduce to their own inputs. The abstract and described workflow contain no self-definitional steps, no predictions that are statistically forced by prior fits, and no load-bearing self-citations or uniqueness theorems. The central insight is framed as an observation from benchmarks rather than a constructed equivalence, rendering the approach self-contained against external empirical validation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the empirical domain assumption that prefilling exhibits quantization redundancy separable from decoding requirements; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)
  • domain assumption Prefilling stage exhibits substantial quantization redundancy in agentic LLM workflows
    Stated as the key insight from investigating FP4 quantization effects on the full inference process versus phase-specific application.

pith-pipeline@v0.9.0 · 5748 in / 1245 out tokens · 67925 ms · 2026-05-21T07:45:55.271069+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · 15 internal anchors

  1. [1]

    Artificial Analysis Long Context Reasoning Benchmark (LCR)

    Artificial Analysis Team. Artificial Analysis Long Context Reasoning Benchmark (LCR). https://artificialanalysis.ai/, 2025. Accessed: 2026-05-06

  2. [2]

    LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks

    Yushi Bai, Shangqing Tu, Jiajie Zhang, Hao Peng, Xiaozhi Wang, Xin Lv, Shulin Cao, Jiazheng Xu, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. Longbench v2: Towards deeper under- standing and reasoning on realistic long-context multitasks.arXiv preprint arXiv:2412.15204, 2024

  3. [3]

    $\tau^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment

    Victor Barres, Honghua Dong, Soham Ray, Xujie Si, and Karthik Narasimhan. τ 2-Bench: Eval- uating conversational agents in a dual-control environment.arXiv preprint arXiv:2506.07982, 2025

  4. [4]

    Beyond Benchmarks: MathArena as an Evaluation Platform for Mathematics with LLMs

    Jasper Dekoninck, Nikola Jovanovi´c, Tim Gehrunger, Kári Rögnvalddson, Ivo Petrov, Chenhao Sun, and Martin Vechev. Beyond benchmarks: Matharena as an evaluation platform for mathematics with llms. 2026. URLhttps://arxiv.org/abs/2605.00674

  5. [5]

    LLM.int8(): 8-bit matrix multiplication for transformers at scale

    Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. LLM.int8(): 8-bit matrix multiplication for transformers at scale. InAdvances in Neural Information Processing Systems, 2022

  6. [6]

    Castro, Denis Kuznedelev, Andrei Panferov, Eldar Kurtic, Shubhra Pandit, Alexandre Noll Marques, Mark Kurtz, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh

    Vage Egiazarian, Roberto L. Castro, Denis Kuznedelev, Andrei Panferov, Eldar Kurtic, Shubhra Pandit, Alexandre Marques, Mark Kurtz, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Bridging the gap between promise and performance for microscaling fp4 quantization, 2025. URLhttps://arxiv.org/abs/2509.23202

  7. [7]

    Flashprefill: Instantaneous pattern discovery and thresholding for ultra-fast long-context prefilling.arXiv preprint arXiv:2603.06199, 2026

    Qihang Fan, Huaibo Huang, Zhiying Wu, Juqiu Wang, Bingning Wang, and Ran He. Flashprefill: Instantaneous pattern discovery and thresholding for ultra-fast long-context prefilling.arXiv preprint arXiv:2603.06199, 2026. URLhttps://arxiv.org/abs/2603.06199

  8. [9]

    GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

    Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Gptq: Accurate post-training quantization for generative pre-trained transformers.arXiv preprint arXiv:2210.17323, 2022

  9. [10]

    Gemma 4 model card

    Google DeepMind. Gemma 4 model card. https://ai.google.dev/gemma/docs/core/ model_card_4, 2026. Accessed: 2026-05-06

  10. [11]

    Minillm: Knowledge distillation of large language models

    Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. Minillm: Knowledge distillation of large language models. InThe twelfth international conference on learning representations, 2024

  11. [12]

    Minference 1.0: Accelerating pre-filling for long-context llms via dynamic sparse attention.Advances in Neural Information Processing Systems, 37:52481–52515, 2024

    Huiqiang Jiang, Yucheng Li, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Zhenhua Han, Amir H Abdi, Dongsheng Li, Chin-Yew Lin, et al. Minference 1.0: Accelerating pre-filling for long-context llms via dynamic sparse attention.Advances in Neural Information Processing Systems, 37:52481–52515, 2024

  12. [13]

    Longllmlingua: Accelerating and enhancing llms in long context scenarios via prompt compression

    Huiqiang Jiang, Qianhui Wu, Xufang Luo, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. Longllmlingua: Accelerating and enhancing llms in long context scenarios via prompt compression. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1658–1677, 2024

  13. [14]

    SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

    Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770, 2023

  14. [15]

    Gonzalez, Hao Zhang, and Ion Stoica

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large lan- guage model serving with pagedattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023. 10

  15. [16]

    Gonzalez, Hao Zhang, and Ion Stoica

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large lan- guage model serving with pagedattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

  16. [17]

    Quantization meets reasoning: Exploring llm low-bit quantization degradation for mathematical reasoning.arXiv preprint arXiv:2501.03035, 2025

    Zhen Li, Yupeng Su, Runming Yang, Congkai Xie, Zheng Wang, Zhongwei Xie, Ngai Wong, and Hongxia Yang. Quantization meets reasoning: Exploring llm low-bit quantization degradation for mathematical reasoning.arXiv preprint arXiv:2501.03035, 2025

  17. [18]

    Let’s verify step by step

    Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe Twelfth International Conference on Learning Representations, 2023

  18. [19]

    Awq: Activation-aware weight quantization for on-device llm compression and acceleration.Proceedings of machine learning and systems, 6:87–100, 2024

    Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. Awq: Activation-aware weight quantization for on-device llm compression and acceleration.Proceedings of machine learning and systems, 6:87–100, 2024

  19. [20]

    Mixreasoning: Switching modes to think.arXiv preprint arXiv:2510.06052, 2025

    Haiquan Lu, Gongfan Fang, Xinyin Ma, Qi Li, and Xinchao Wang. Mixreasoning: Switching modes to think.arXiv preprint arXiv:2510.06052, 2025

  20. [21]

    Large Language Model Agent: A Survey on Methodology, Applications and Challenges

    Junyu Luo, Weizhi Zhang, Ye Yuan, Yusheng Zhao, Junwei Yang, Yiyang Gu, Bohan Wu, Binqi Chen, Ziyue Qiao, Qingqing Long, et al. Large language model agent: A survey on methodology, applications and challenges.arXiv preprint arXiv:2503.21460, 2025

  21. [22]

    Llm-pruner: On the structural pruning of large language models.Advances in neural information processing systems, 36:21702–21720, 2023

    Xinyin Ma, Gongfan Fang, and Xinchao Wang. Llm-pruner: On the structural pruning of large language models.Advances in neural information processing systems, 36:21702–21720, 2023

  22. [23]

    WebGPT: Browser-assisted question-answering with human feedback

    Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christo- pher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, Xu Jiang, Karl Cobbe, Tyna Eloundou, Gretchen Krueger, Kevin Button, Matthew Knight, Benjamin Chess, and John Schul- man. Webgpt: Browser-assisted question-answering with human feedback.arXiv preprint a...

  23. [24]

    Introducing nvfp4 for efficient and accurate low- precision inference

    NVIDIA. Introducing nvfp4 for efficient and accurate low- precision inference. https://developer.nvidia.com/blog/ introducing-nvfp4-for-efficient-and-accurate-low-precision-inference/ , 2025

  24. [25]

    Enhancing distributed inference performance with the nvidia inference transfer library,

    NVIDIA. Enhancing distributed inference performance with the nvidia inference transfer library,

  25. [26]

    MemGPT: Towards LLMs as Operating Systems

    Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G. Patil, Ion Stoica, and Joseph E. Gonzalez. Memgpt: Towards llms as operating systems.arXiv preprint arXiv:2310.08560, 2023

  26. [27]

    Splitwise: Efficient generative llm inference using phase splitting, 2024

    Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Inigo Goiri, Saeed Maleki, and Ricardo Bianchini. Splitwise: Efficient generative llm inference using phase splitting.arXiv preprint arXiv:2311.18677, 2023

  27. [28]

    Splitwise: Efficient generative llm inference using phase splitting

    Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, and Ricardo Bianchini. Splitwise: Efficient generative llm inference using phase splitting. In2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA), pages 118–132. IEEE, 2024

  28. [29]

    The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models

    Shishir G Patil, Huanzhi Mao, Fanjia Yan, Charlie Cheng-Jie Ji, Vishnu Suresh, Ion Stoica, and Joseph E Gonzalez. The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models. InForty-second International Conference on Machine Learning, 2025

  29. [30]

    Yarn: Efficient context window extension of large language models

    Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. Yarn: Efficient context window extension of large language models. InInternational Conference on Learning Repre- sentations, 2024. 11

  30. [31]

    Swiftkv: Fast prefill- optimized inference with knowledge-preserving model transformation

    Aurick Qiao, Zhewei Yao, Samyam Rajbhandari, and Yuxiong He. Swiftkv: Fast prefill- optimized inference with knowledge-preserving model transformation. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 25745–25764, 2025

  31. [32]

    Qwen3.5: Towards native multimodal agents, February 2026

    Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. URL https: //qwen.ai/blog?id=qwen3.5

  32. [33]

    Toolformer: Language Models Can Teach Themselves to Use Tools

    Timo Schick, Jane Dwivedi-Yu, Roberto Dessi, Roberta Raileanu, Maria Lomeli, Luke Zettle- moyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.arXiv preprint arXiv:2302.04761, 2023

  33. [34]

    Efficient llm serving for agentic workflows: A data systems perspective.arXiv preprint arXiv:2603.16104, 2026

    Noppanat Wadlom, Junyi Shen, and Yao Lu. Efficient llm serving for agentic workflows: A data systems perspective.arXiv preprint arXiv:2603.16104, 2026

  34. [35]

    Long- memeval: Benchmarking chat assistants on long-term interactive memory

    Di Wu, Hongwei Wang, Wenhao Yu, Yuwei Zhang, Kai-Wei Chang, and Dong Yu. Long- memeval: Benchmarking chat assistants on long-term interactive memory. InInternational Conference on Learning Representations, 2025

  35. [36]

    Combating the Memory Walls: Optimization Pathways for Long-Context Agentic LLM Inference

    Haoran Wu, Can Xiao, Jiayi Nie, Xuan Guo, Binglei Lou, Jeffrey TH Wong, Zhiwen Mo, Cheng Zhang, Przemyslaw Forys, Chengyang Ai, et al. Combating the memory walls: Optimization pathways for long-context agentic llm inference.arXiv preprint arXiv:2509.09505, 2025

  36. [37]

    Smoothquant: Accurate and efficient post-training quantization for large language models

    Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. Smoothquant: Accurate and efficient post-training quantization for large language models. InInternational Conference on Machine Learning, 2023

  37. [38]

    A-MEM: Agentic Memory for LLM Agents

    Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang. A-mem: Agentic memory for llm agents.arXiv preprint arXiv:2502.12110, 2025

  38. [40]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  39. [41]

    Swe-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems, 37:50528–50652, 2024

    John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. Swe-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems, 37:50528–50652, 2024

  40. [42]

    SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

    John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. Swe-agent: Agent-computer interfaces enable automated software engineering.arXiv preprint arXiv:2405.15793, 2024

  41. [43]

    ReAct: Synergizing Reasoning and Acting in Language Models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models.arXiv preprint arXiv:2210.03629, 2022

  42. [44]

    React: Synergizing reasoning and acting in language models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations, 2023

  43. [45]

    Qspec: Speculative decoding with complementary quantization schemes

    Juntao Zhao, Wenhao Lu, Sheng Wang, Lingpeng Kong, and Chuan Wu. Qspec: Speculative decoding with complementary quantization schemes. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 4779–4795, 2025. 12

  44. [46]

    Atom: Low-bit quantization for efficient and accurate llm serving.arXiv preprint arXiv:2310.19102, 2024

    Yilong Zhao, Chien-Yu Lin, Kan Zhu, Zihao Ye, Lequn Chen, Size Zheng, Luis Ceze, Arvind Krishnamurthy, Tianqi Chen, and Baris Kasikci. Atom: Low-bit quantization for efficient and accurate llm serving.arXiv preprint arXiv:2310.19102, 2024

  45. [47]

    Distserve: Disaggregating prefill and decoding for goodput-optimized large language model serving

    Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, and Hao Zhang. Distserve: Disaggregating prefill and decoding for goodput-optimized large language model serving. In18th USENIX Symposium on Operating Systems Design and Implementation, 2024

  46. [48]

    A Survey on Efficient Inference for Large Language Models

    Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, et al. A survey on efficient inference for large language models.arXiv preprint arXiv:2404.14294, 2024. 13