TriAxialKV: Toward Extreme Low-Precision KV-Cache Quantization for Agentic Inference Tasks

Hanzhang Shen; Haoran Wu; Robert Mullins; Yiren Zhao

arxiv: 2605.17170 · v1 · pith:VGB44FRFnew · submitted 2026-05-16 · 💻 cs.LG

TriAxialKV: Toward Extreme Low-Precision KV-Cache Quantization for Agentic Inference Tasks

Hanzhang Shen , Haoran Wu , Yiren Zhao , Robert Mullins This is my paper

Pith reviewed 2026-05-20 14:27 UTC · model grok-4.3

classification 💻 cs.LG

keywords KV cache quantizationmixed-precisionagentic inferenceLLM servinglow-precision inferencemultimodal agentstool use

0 comments

The pith

TriAxialKV matches full-precision accuracy on agentic tasks while allowing 4.5 times larger KV cache and 30 percent higher throughput by tagging tokens along three axes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Agentic workloads process long multimodal contexts and structured tool interactions where different tokens matter more or less for final accuracy. The paper demonstrates that sensitivity to quantization varies systematically along temporal recency, input modality, and semantic role. By assigning every token a tag from the combination of these three axes and calibrating the lowest safe bit width for each tag, the method fits INT2 and INT4 values into a fixed memory budget without uniform precision. On a 32-billion-parameter vision-language model running computer-use tasks in OSWorld, the resulting cache preserves the same task success rate as 16-bit floating point while supporting much longer effective contexts. The implementation adds calibration, memory layout, and fused decode kernels so the gains appear in real end-to-end GPU throughput.

Core claim

TriAxialKV is a mixed-precision KV-cache quantization scheme that assigns each token a triaxial tag based on temporal recency, modality, and semantic role, calibrates per-tag sensitivity, and allocates INT2 or INT4 bitwidths under a fixed memory budget, achieving the same accuracy as BF16 KV cache while supporting 4.5 times larger cache size and 30 percent higher end-to-end throughput on agentic inference workloads.

What carries the argument

The triaxial tag that combines temporal recency, modality, and semantic role to select a per-token quantization bit width after per-tag calibration.

Load-bearing premise

The three axes of temporal recency, modality, and semantic role capture the dominant sources of token-level sensitivity to quantization across agentic workloads.

What would settle it

A measurable drop in OSWorld task success rate when the same Qwen3-VL-32B-Thinking agent runs with the TriAxialKV cache instead of BF16 on identical prompts and environment states.

Figures

Figures reproduced from arXiv: 2605.17170 by Hanzhang Shen, Haoran Wu, Robert Mullins, Yiren Zhao.

**Figure 2.** Figure 2: Overview of the TriAxialKV compression flow. During the prefill stage, KV entries are [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: KV cache buffer layouts. Left: per-channel INT2 keys, indexed by (page, head). Right: [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Per-request page table management. Phase 1 partitions page table entries so INT2 pointers [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: End-to-end throughput on OSWorld trajectories. TriAxialKV delivers 1.26–1.52× throughput of Triton’s. Baselines. We compare against three baselines covering the relevant prior art. SGLang [49] BF16 serves the model with full-precision KV cache and is the lossless reference. SGLang FP4 uses SGLang’s built-in FP4 KV cache backend, representing uniform low-bit floating-point quantization. We also compare wit… view at source ↗

read the original abstract

Agentic workloads have emerged as a major workload for LLM inference. They differ significantly from chat-only workloads, requiring long-context processing, the ability to handle multimodal inputs, and structured multi-turn interactions with tool calling capabilities. As a result, their context exhibits structure that can carry different importance along three key axes: temporal recency to the current turn, modality such as text or image tokens, and semantic role such as user queries, tool calls, observations, or reasoning. These axes capture distinct token behaviors and lead to different sensitivities to KV-cache compression. However, existing KV-cache quantization methods are typically homogeneous or exploit only heterogeneity on a single dimension, such as temporal proximity or modality, overlooking the interactions among them. To this end, we introduce TriAxialKV, a novel mixed-precision KV-cache quantization scheme that assigns each token a triaxial tag, calibrates per-tag sensitivity, and allocates INT2/INT4 bitwidths under a fixed memory budget. We implement TriAxialKV as an end-to-end serving system, comprising calibration, mixed-precision quantization and memory management, and custom fused Triton decode kernels. When using Qwen3-VL-32B-Thinking as a computer-use agent operating the OSWorld, TriAxialKV matches the accuracy of SGLang with BF16 KV cache while supporting 4.5$\times$ KV cache size and achieving 30% higher end-to-end throughput, when running on real GPU systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TriAxialKV tags KV tokens on temporal, modality, and role axes to pick INT2/INT4 widths per combo, claiming to match BF16 accuracy on an OSWorld agent run while cutting cache size and raising throughput.

read the letter

The main point is that this work targets KV cache pressure in agentic LLM inference by assigning each token a tag from three axes—how recent it is, whether it is text or image, and whether it is a query, tool call, observation, or reasoning step—then picking a single INT2 or INT4 width for every token that shares the same tag combination. On Qwen3-VL-32B-Thinking running as a computer-use agent in OSWorld, the system keeps task accuracy while supporting 4.5 times the context length under the same memory and delivers 30 percent higher end-to-end throughput with custom Triton kernels.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces TriAxialKV, a mixed-precision KV-cache quantization scheme for agentic LLM inference. Tokens receive triaxial tags along temporal recency, modality, and semantic role; per-tag sensitivity is calibrated to assign INT2/INT4 bit-widths under a fixed memory budget. The end-to-end system includes calibration, quantization, memory management, and custom fused Triton decode kernels. On the OSWorld benchmark with Qwen3-VL-32B-Thinking as a computer-use agent, the method is reported to match SGLang BF16 accuracy while supporting 4.5× KV-cache size and 30% higher end-to-end throughput.

Significance. If the empirical results hold under rigorous validation, the triaxial tagging approach could meaningfully advance efficient inference for long-context multimodal multi-turn agentic workloads by exploiting structured token heterogeneity. The complete serving-system implementation with optimized kernels is a concrete strength that aids reproducibility and deployment.

major comments (2)

[Abstract] Abstract: the central claim that TriAxialKV matches BF16 accuracy on OSWorld is presented without any description of the calibration procedure, ablation of the three axes, error-bar reporting, or justification for the post-hoc tag definitions. These omissions are load-bearing for assessing whether the reported accuracy match is robust rather than an artifact of a particular calibration choice.
[Method] Method (triaxial tagging and bit-width allocation): the premise that a single bit-width per tag combination suffices for all tokens sharing that tag is load-bearing for both the memory-reduction and accuracy claims, yet no measurement or bound on intra-tag variance in token sensitivity is reported. Outliers within a bucket such as 'recent image observation' or 'tool-call reasoning' could still accumulate errors across multi-turn interactions.

minor comments (2)

[Abstract] The abstract would be clearer if it stated the concrete bit-width distribution or the exact memory budget used in the 4.5× scaling experiment.
[Experiments] Figure or table captions should explicitly define the triaxial tag combinations and the sensitivity metric used for calibration.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We have addressed each major comment below and revised the manuscript accordingly to strengthen the presentation of our calibration details, ablations, and analysis of token sensitivity.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that TriAxialKV matches BF16 accuracy on OSWorld is presented without any description of the calibration procedure, ablation of the three axes, error-bar reporting, or justification for the post-hoc tag definitions. These omissions are load-bearing for assessing whether the reported accuracy match is robust rather than an artifact of a particular calibration choice.

Authors: We agree that the abstract would benefit from additional context to support the central accuracy claim. In the revised manuscript we have expanded the abstract with a brief description of the per-tag calibration procedure and noted that full ablations of the three axes appear in Section 4.3. We also report error bars computed over three independent runs and provide justification for the tag definitions in Section 3.1, which are derived from empirical sensitivity measurements on representative agentic traces. These additions clarify that the reported accuracy match is supported by systematic calibration rather than a single ad-hoc choice. revision: yes
Referee: [Method] Method (triaxial tagging and bit-width allocation): the premise that a single bit-width per tag combination suffices for all tokens sharing that tag is load-bearing for both the memory-reduction and accuracy claims, yet no measurement or bound on intra-tag variance in token sensitivity is reported. Outliers within a bucket such as 'recent image observation' or 'tool-call reasoning' could still accumulate errors across multi-turn interactions.

Authors: We acknowledge the importance of quantifying intra-tag variance. Our calibration procedure assigns bit-widths according to the highest observed sensitivity within each tag combination, which provides a conservative bound that protects against outliers. In the revised manuscript we have added a new subsection (3.3) that reports measured intra-tag variance across OSWorld traces; the variance is modest for the majority of tags, and the conservative allocation prevents noticeable error accumulation, as confirmed by the sustained accuracy over extended multi-turn agent trajectories in our end-to-end experiments. revision: yes

Circularity Check

0 steps flagged

No significant circularity; results presented as empirical measurements

full rationale

The paper introduces TriAxialKV via triaxial tagging and per-tag calibration of sensitivity to allocate mixed INT2/INT4 precision under a memory budget, then reports measured accuracy matching BF16 KV cache and 30% higher throughput on the OSWorld benchmark with Qwen3-VL-32B-Thinking. These outcomes are framed as direct experimental results from the implemented serving system rather than any derived prediction that reduces tautologically to the calibration inputs. No equations, fitted quantities renamed as predictions, or load-bearing self-citations appear in the provided description to create circularity. The method is self-contained against the external OSWorld benchmark.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The approach rests on empirical per-tag calibration whose details are not supplied and on the assumption that the chosen three axes are sufficient to group tokens by quantization sensitivity.

free parameters (1)

per-tag sensitivity thresholds
Used to decide INT2 versus INT4 allocation for each triaxial combination under the fixed memory budget.

axioms (1)

domain assumption Tokens sharing the same triaxial tag exhibit sufficiently similar sensitivity to low-precision storage that a uniform bit-width per tag is safe.
Invoked to justify assigning the same precision to all tokens of a given tag rather than token-specific decisions.

invented entities (1)

triaxial tag no independent evidence
purpose: To classify each token for mixed-precision KV-cache allocation along temporal, modality, and semantic axes.
New classification scheme introduced to capture interactions among the three dimensions.

pith-pipeline@v0.9.0 · 5806 in / 1510 out tokens · 64325 ms · 2026-05-20T14:27:37.789074+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

three key axes: temporal recency to the current turn, modality such as text or image tokens, and semantic role such as user queries, tool calls, observations, or reasoning. These axes capture distinct token behaviors
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

calibrates per-tag sensitivity, and allocates INT2/INT4 bitwidths under a fixed memory budget

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 7 internal anchors

[1]

OSWorld-Human: Benchmarking the efficiency of computer-use agents

Reyna Abhyankar, Qi Qi, and Yiying Zhang. OSWorld-Human: Benchmarking the efficiency of computer-use agents. InICML Workshop on Computer Use Agents, 2025

work page 2025
[2]

Agent s2: A compositional generalist-specialist framework for computer use agents

Saaket Agashe, Kyle Wong, Vincent Tu, Jiachen Yang, Ang Li, and Xin Eric Wang. Agent s2: A compositional generalist-specialist framework for computer use agents. InCOLM, 2025

work page 2025
[3]

Taming throughput-latency tradeoff in llm inference with sarathi-serve

Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav Gula- vani, Alexey Tumanov, and Ramachandran Ramjee. Taming throughput-latency tradeoff in llm inference with sarathi-serve. InOSDI, 2024

work page 2024
[4]

Croci, Bo Li, Pashmina Cameron, Martin Jaggi, Dan Alistarh, Torsten Hoefler, and James Hensman

Saleh Ashkboos, Amirkeivan Mohtashami, Maximilian L. Croci, Bo Li, Pashmina Cameron, Martin Jaggi, Dan Alistarh, Torsten Hoefler, and James Hensman. Quarot: Outlier-free 4-bit inference in rotated LLMs. InNeurIPS, 2024

work page 2024
[5]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Benchmarking llm-powered chatbots: Methods and metrics.arXiv preprint arXiv:2308.04624, 2023

Debarag Banerjee, Pooja Singh, Arjun Avadhanam, and Saksham Srivastava. Benchmarking llm-powered chatbots: Methods and metrics.arXiv preprint arXiv:2308.04624, 2023

work page arXiv 2023
[7]

Don't Waste Bits! Adaptive KV-Cache Quantization for Lightweight On-Device LLMs

Sayed Pedram Haeri Boroujeni, Niloufar Mehrabi, Patrick Woods, Gabriel Hillesheim, and Abolfazl Razi. Don’t waste bits! adaptive KV-cache quantization for lightweight on-device llms.arXiv preprint arXiv:2604.04722, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[8]

Pyramidkv: Dynamic kv cache compression based on pyramidal information funneling

Zefan Cai, Yichi Zhang, Bofei Gao, Yuliang Liu, Yucheng Li, Tianyu Liu, Keming Lu, Wayne Xiong, Yue Dong, Junjie Hu, et al. Pyramidkv: Dynamic kv cache compression based on pyramidal information funneling. InCOLM, 2025

work page 2025
[9]

Web agents with world models: Learning and leveraging environment dynamics in web navigation

Hyungjoo Chae, Namyoung Kim, Kai Tzu-iunn Ong, Minju Gwak, Gwanwoo Song, Jihoon Kim, Sunghwan Kim, Dongha Lee, and Jinyoung Yeo. Web agents with world models: Learning and leveraging environment dynamics in web navigation. InICLR, 2025

work page 2025
[10]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[11]

FlashAttention-2: Faster attention with better parallelism and work partitioning

Tri Dao. FlashAttention-2: Faster attention with better parallelism and work partitioning. In ICLR, 2024

work page 2024
[12]

Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

Tri Dao, Daniel Y . Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. InNeurIPS, 2022

work page 2022
[13]

Liminal: Exploring the frontiers of llm decode performance.arXiv preprint arXiv:2507.14397, 2025

Michael Davies, Neal Crago, Karthikeyan Sankaralingam, and Christos Kozyrakis. Liminal: Exploring the frontiers of llm decode performance.arXiv preprint arXiv:2507.14397, 2025

work page arXiv 2025
[14]

The falcon 3 family of open models, 2024

Falcon-LLM Team. The falcon 3 family of open models, 2024

work page 2024
[15]

Ada-kv: Optimizing kv cache eviction by adaptive budget allocation for efficient llm inference

Yuan Feng, Junlin Lv, Yukun Cao, Xike Xie, and S Kevin Zhou. Ada-kv: Optimizing kv cache eviction by adaptive budget allocation for efficient llm inference. InNeurIPS, 2025

work page 2025
[16]

Model tells you what to discard: Adaptive kv cache compression for llms

Suyu Ge, Yunan Zhang, Liyuan Liu, Minjia Zhang, Jiawei Han, and Jianfeng Gao. Model tells you what to discard: Adaptive kv cache compression for llms. InICLR, 2024. 10

work page 2024
[17]

Webvoyager: Building an end-to-end web agent with large multimodal models

Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, and Dong Yu. Webvoyager: Building an end-to-end web agent with large multimodal models. InACL, 2024

work page 2024
[18]

Zipcache: Accurate and efficient kv cache quantization with salient token identification

Yefei He, Luoming Zhang, Weijia Wu, Jing Liu, Hong Zhou, and Bohan Zhuang. Zipcache: Accurate and efficient kv cache quantization with salient token identification. InNeurIPS, 2024

work page 2024
[19]

Kvquant: Towards 10 million context length llm inference with kv cache quantization

Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W Mahoney, Yakun S Shao, Kurt Keutzer, and Amir Gholami. Kvquant: Towards 10 million context length llm inference with kv cache quantization. InNeurIPS, 2024

work page 2024
[20]

SAW-INT4: System-Aware 4-Bit KV-Cache Quantization for Real-World LLM Serving

Jinda Jia, Jisen Li, Zhongzhu Zhou, Jung Hwan Heo, Jue Wang, Tri Dao, Shuaiwen Leon Song, Ben Athiwaratkun, Chenfeng Xu, Tianyi Zhang, and Xiaoxia Wu. Saw-int4: System-aware 4-bit kv-cache quantization for real-world llm serving.arXiv preprint arXiv:2604.19157, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[21]

Efficient memory management for large language model serving with pagedattention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InSOSP, 2023

work page 2023
[22]

Snapkv: Llm knows what you are looking for before generation

Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. Snapkv: Llm knows what you are looking for before generation. InNeurIPS, 2024

work page 2024
[23]

Channel-aware mixed-precision quantization for efficient long- context inference

Chengxi Liao and Zeyi Wen. Channel-aware mixed-precision quantization for efficient long- context inference. InICLR, 2026

work page 2026
[24]

Qserve: W4a8kv4 quantization and system co-design for efficient llm serving

Yujun Lin, Haotian Tang, Shang Yang, Zhekai Zhang, Guangxuan Xiao, Chuang Gan, and Song Han. Qserve: W4a8kv4 quantization and system co-design for efficient llm serving. InMLSys, 2025

work page 2025
[25]

PM-KVQ: Progressive mixed-precision KV cache quantization for long-cot LLMs

Tengxuan Liu, Shiyao Li, Jiayi Yang, Tianchen Zhao, Feng Zhou, Xiaohui Song, Guohao Dai, Shengen Yan, Huazhong Yang, and Yu Wang. PM-KVQ: Progressive mixed-precision KV cache quantization for long-cot LLMs. InICLR, 2026

work page 2026
[26]

Kivi: A tuning-free asymmetric 2bit quantization for kv cache

Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and Xia Hu. Kivi: A tuning-free asymmetric 2bit quantization for kv cache. InICML, 2024

work page 2024
[27]

Jones, Robert Mullins, Rika Antonova, and Yiren Zhao

Jiayi Nie, Haoran Wu, Yao Lai, Zeyu Cao, Cheng Zhang, Binglei Lou, Erwei Wang, Jianyi Cheng, Timothy M. Jones, Robert Mullins, Rika Antonova, and Yiren Zhao. Kernelcraft: Benchmarking for agentic close-to-metal kernel generation on emerging hardware.arXiv preprint arXiv:2603.08721, 2026

work page arXiv 2026
[28]

Splitwise: Efficient generative llm inference using phase splitting

Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, and Ricardo Bianchini. Splitwise: Efficient generative llm inference using phase splitting. InISCA, 2024

work page 2024
[29]

Patil, Huanzhi Mao, Charlie Cheng-Jie Ji, Fanjia Yan, Vishnu Suresh, Ion Stoica, and Joseph E

Shishir G. Patil, Huanzhi Mao, Charlie Cheng-Jie Ji, Fanjia Yan, Vishnu Suresh, Ion Stoica, and Joseph E. Gonzalez. The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models. InICML, 2025

work page 2025
[30]

Mooncake: Trading more storage for less computation — a KVCache-centric architecture for serving LLM chatbot

Ruoyu Qin, Zheming Li, Weiran He, Jialei Cui, Feng Ren, Mingxing Zhang, Yongwei Wu, Weimin Zheng, and Xinran Xu. Mooncake: Trading more storage for less computation — a KVCache-centric architecture for serving LLM chatbot. InFAST, 2025

work page 2025
[31]

Thinkv: Thought-adaptive kv cache compression for efficient reasoning models

Akshat Ramachandran, Marina Neseem, Charbel Sakr, Rangharajan Venkatesan, Brucek Khailany, and Tushar Krishna. Thinkv: Thought-adaptive kv cache compression for efficient reasoning models. InICLR, 2026

work page 2026
[32]

Longcodebench: Evaluating coding LLMs at 1m context windows

Stefano Rando, Luca Romani, Alessio Sampieri, Luca Franco, John Yang, Yuta Kyuragi, Fabio Galasso, and Tatsunori Hashimoto. Longcodebench: Evaluating coding LLMs at 1m context windows. InCOLM, 2025. 11

work page 2025
[33]

Flashattention-3: Fast and accurate attention with asynchrony and low-precision

Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, and Tri Dao. Flashattention-3: Fast and accurate attention with asynchrony and low-precision. InNeurIPS, 2024

work page 2024
[34]

AsymKV: Enabling 1-bit quantization of KV cache with layer-wise asymmetric quantization configurations

Qian Tao, Wenyuan Yu, and Jingren Zhou. AsymKV: Enabling 1-bit quantization of KV cache with layer-wise asymmetric quantization configurations. InCOLING, 2025

work page 2025
[35]

Triton: an intermediate language and compiler for tiled neural network computations

Philippe Tillet and David Cox. Triton: an intermediate language and compiler for tiled neural network computations. InProceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, 2019

work page 2019
[36]

VL-cache: Sparsity and modality- aware KV cache compression for vision-language model inference acceleration

Dezhan Tu, Danylo Vashchilenko, Yuzhe Lu, and Panpan Xu. VL-cache: Sparsity and modality- aware KV cache compression for vision-language model inference acceleration. InICLR, 2025

work page 2025
[37]

LOOK-M: Look-once optimization in KV cache for efficient multimodal long-context inference

Zhongwei Wan, Ziang Wu, Che Liu, Jinfa Huang, Zhihong Zhu, Peng Jin, Longyue Wang, and Li Yuan. LOOK-M: Look-once optimization in KV cache for efficient multimodal long-context inference. InFindings of EMNLP, 2024

work page 2024
[38]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[39]

Haoran Wu, Can Xiao, Jiayi Nie, Xuan Guo, Binglei Lou, Jeffrey T. H. Wong, Zhiwen Mo, Cheng Zhang, Przemyslaw Forys, Chengyang Ai, Timi Adeniran, Wayne Luk, Hongxiang Fan, Jianyi Cheng, Timothy M. Jones, Rika Antonova, Robert Mullins, and Aaron Zhao. Combating the memory walls: Optimization pathways for long-context agentic llm inference.arXiv preprint ar...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[40]

Efficient streaming language models with attention sinks

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. InICLR, 2024

work page 2024
[41]

Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments

Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, and Tao Yu. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments. InNeurIPS, 2024

work page 2024
[42]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[43]

Flashinfer: Efficient and customizable attention engine for llm inference serving.MLSys, 2025

Zihao Ye, Lequn Chen, Ruihang Lai, Wuwei Lin, Yineng Zhang, Stephanie Wang, Tianqi Chen, Baris Kasikci, Vinod Grover, Arvind Krishnamurthy, et al. Flashinfer: Efficient and customizable attention engine for llm inference serving.MLSys, 2025

work page 2025
[44]

Orca: A distributed serving system for transformer-based generative models

Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun. Orca: A distributed serving system for transformer-based generative models. InOSDI, 2022

work page 2022
[45]

Flashattention-4: Algorithm and kernel pipelining co-design for asymmetric hardware scaling

Ted Zadouri, Markus Hoehnerbach, Jay Shah, Timmy Liu, Vijay Thakkar, and Tri Dao. Flashattention-4: Algorithm and kernel pipelining co-design for asymmetric hardware scaling. InMLSys, 2026

work page 2026
[46]

MR-GSM8K: A meta-reasoning benchmark for large language model evaluation

Zhongshen Zeng, Pengguang Chen, Shu Liu, Haiyun Jiang, and Jiaya Jia. MR-GSM8K: A meta-reasoning benchmark for large language model evaluation. InICLR, 2025

work page 2025
[47]

H2o: Heavy-hitter oracle for efficient generative inference of large language models

Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, et al. H2o: Heavy-hitter oracle for efficient generative inference of large language models. InNeurIPS, 2023

work page 2023
[48]

Smallkv: Small model assisted compensation of kv cache compression for efficient llm inference

Yi Zhao, Yajuan Peng, Nguyen Cam-Tu, Zuchao Li, Wang Xiaoliang, Xiaoming Fu, et al. Smallkv: Small model assisted compensation of kv cache compression for efficient llm inference. InNeurIPS, 2025. 12

work page 2025
[49]

Sglang: Efficient execution of structured language model programs

Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Livia Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E Gonzalez, et al. Sglang: Efficient execution of structured language model programs. InNeurIPS, 2024

work page 2024
[50]

Distserve: Disaggregating prefill and decoding for goodput-optimized large language model serving

Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, and Hao Zhang. Distserve: Disaggregating prefill and decoding for goodput-optimized large language model serving. InOSDI, 2024. 13 A Greedy Bitwidth Allocation Algorithm 1Semantic-aware bit allocation Require:Tag setS; counts{N s}; distortions{D s(2), Ds(4)}; budgetB. Ensure...

work page 2024

[1] [1]

OSWorld-Human: Benchmarking the efficiency of computer-use agents

Reyna Abhyankar, Qi Qi, and Yiying Zhang. OSWorld-Human: Benchmarking the efficiency of computer-use agents. InICML Workshop on Computer Use Agents, 2025

work page 2025

[2] [2]

Agent s2: A compositional generalist-specialist framework for computer use agents

Saaket Agashe, Kyle Wong, Vincent Tu, Jiachen Yang, Ang Li, and Xin Eric Wang. Agent s2: A compositional generalist-specialist framework for computer use agents. InCOLM, 2025

work page 2025

[3] [3]

Taming throughput-latency tradeoff in llm inference with sarathi-serve

Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav Gula- vani, Alexey Tumanov, and Ramachandran Ramjee. Taming throughput-latency tradeoff in llm inference with sarathi-serve. InOSDI, 2024

work page 2024

[4] [4]

Croci, Bo Li, Pashmina Cameron, Martin Jaggi, Dan Alistarh, Torsten Hoefler, and James Hensman

Saleh Ashkboos, Amirkeivan Mohtashami, Maximilian L. Croci, Bo Li, Pashmina Cameron, Martin Jaggi, Dan Alistarh, Torsten Hoefler, and James Hensman. Quarot: Outlier-free 4-bit inference in rotated LLMs. InNeurIPS, 2024

work page 2024

[5] [5]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

Benchmarking llm-powered chatbots: Methods and metrics.arXiv preprint arXiv:2308.04624, 2023

Debarag Banerjee, Pooja Singh, Arjun Avadhanam, and Saksham Srivastava. Benchmarking llm-powered chatbots: Methods and metrics.arXiv preprint arXiv:2308.04624, 2023

work page arXiv 2023

[7] [7]

Don't Waste Bits! Adaptive KV-Cache Quantization for Lightweight On-Device LLMs

Sayed Pedram Haeri Boroujeni, Niloufar Mehrabi, Patrick Woods, Gabriel Hillesheim, and Abolfazl Razi. Don’t waste bits! adaptive KV-cache quantization for lightweight on-device llms.arXiv preprint arXiv:2604.04722, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[8] [8]

Pyramidkv: Dynamic kv cache compression based on pyramidal information funneling

Zefan Cai, Yichi Zhang, Bofei Gao, Yuliang Liu, Yucheng Li, Tianyu Liu, Keming Lu, Wayne Xiong, Yue Dong, Junjie Hu, et al. Pyramidkv: Dynamic kv cache compression based on pyramidal information funneling. InCOLM, 2025

work page 2025

[9] [9]

Web agents with world models: Learning and leveraging environment dynamics in web navigation

Hyungjoo Chae, Namyoung Kim, Kai Tzu-iunn Ong, Minju Gwak, Gwanwoo Song, Jihoon Kim, Sunghwan Kim, Dongha Lee, and Jinyoung Yeo. Web agents with world models: Learning and leveraging environment dynamics in web navigation. InICLR, 2025

work page 2025

[10] [10]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[11] [11]

FlashAttention-2: Faster attention with better parallelism and work partitioning

Tri Dao. FlashAttention-2: Faster attention with better parallelism and work partitioning. In ICLR, 2024

work page 2024

[12] [12]

Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

Tri Dao, Daniel Y . Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. InNeurIPS, 2022

work page 2022

[13] [13]

Liminal: Exploring the frontiers of llm decode performance.arXiv preprint arXiv:2507.14397, 2025

Michael Davies, Neal Crago, Karthikeyan Sankaralingam, and Christos Kozyrakis. Liminal: Exploring the frontiers of llm decode performance.arXiv preprint arXiv:2507.14397, 2025

work page arXiv 2025

[14] [14]

The falcon 3 family of open models, 2024

Falcon-LLM Team. The falcon 3 family of open models, 2024

work page 2024

[15] [15]

Ada-kv: Optimizing kv cache eviction by adaptive budget allocation for efficient llm inference

Yuan Feng, Junlin Lv, Yukun Cao, Xike Xie, and S Kevin Zhou. Ada-kv: Optimizing kv cache eviction by adaptive budget allocation for efficient llm inference. InNeurIPS, 2025

work page 2025

[16] [16]

Model tells you what to discard: Adaptive kv cache compression for llms

Suyu Ge, Yunan Zhang, Liyuan Liu, Minjia Zhang, Jiawei Han, and Jianfeng Gao. Model tells you what to discard: Adaptive kv cache compression for llms. InICLR, 2024. 10

work page 2024

[17] [17]

Webvoyager: Building an end-to-end web agent with large multimodal models

Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, and Dong Yu. Webvoyager: Building an end-to-end web agent with large multimodal models. InACL, 2024

work page 2024

[18] [18]

Zipcache: Accurate and efficient kv cache quantization with salient token identification

Yefei He, Luoming Zhang, Weijia Wu, Jing Liu, Hong Zhou, and Bohan Zhuang. Zipcache: Accurate and efficient kv cache quantization with salient token identification. InNeurIPS, 2024

work page 2024

[19] [19]

Kvquant: Towards 10 million context length llm inference with kv cache quantization

Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W Mahoney, Yakun S Shao, Kurt Keutzer, and Amir Gholami. Kvquant: Towards 10 million context length llm inference with kv cache quantization. InNeurIPS, 2024

work page 2024

[20] [20]

SAW-INT4: System-Aware 4-Bit KV-Cache Quantization for Real-World LLM Serving

Jinda Jia, Jisen Li, Zhongzhu Zhou, Jung Hwan Heo, Jue Wang, Tri Dao, Shuaiwen Leon Song, Ben Athiwaratkun, Chenfeng Xu, Tianyi Zhang, and Xiaoxia Wu. Saw-int4: System-aware 4-bit kv-cache quantization for real-world llm serving.arXiv preprint arXiv:2604.19157, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[21] [21]

Efficient memory management for large language model serving with pagedattention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InSOSP, 2023

work page 2023

[22] [22]

Snapkv: Llm knows what you are looking for before generation

Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. Snapkv: Llm knows what you are looking for before generation. InNeurIPS, 2024

work page 2024

[23] [23]

Channel-aware mixed-precision quantization for efficient long- context inference

Chengxi Liao and Zeyi Wen. Channel-aware mixed-precision quantization for efficient long- context inference. InICLR, 2026

work page 2026

[24] [24]

Qserve: W4a8kv4 quantization and system co-design for efficient llm serving

Yujun Lin, Haotian Tang, Shang Yang, Zhekai Zhang, Guangxuan Xiao, Chuang Gan, and Song Han. Qserve: W4a8kv4 quantization and system co-design for efficient llm serving. InMLSys, 2025

work page 2025

[25] [25]

PM-KVQ: Progressive mixed-precision KV cache quantization for long-cot LLMs

Tengxuan Liu, Shiyao Li, Jiayi Yang, Tianchen Zhao, Feng Zhou, Xiaohui Song, Guohao Dai, Shengen Yan, Huazhong Yang, and Yu Wang. PM-KVQ: Progressive mixed-precision KV cache quantization for long-cot LLMs. InICLR, 2026

work page 2026

[26] [26]

Kivi: A tuning-free asymmetric 2bit quantization for kv cache

Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and Xia Hu. Kivi: A tuning-free asymmetric 2bit quantization for kv cache. InICML, 2024

work page 2024

[27] [27]

Jones, Robert Mullins, Rika Antonova, and Yiren Zhao

Jiayi Nie, Haoran Wu, Yao Lai, Zeyu Cao, Cheng Zhang, Binglei Lou, Erwei Wang, Jianyi Cheng, Timothy M. Jones, Robert Mullins, Rika Antonova, and Yiren Zhao. Kernelcraft: Benchmarking for agentic close-to-metal kernel generation on emerging hardware.arXiv preprint arXiv:2603.08721, 2026

work page arXiv 2026

[28] [28]

Splitwise: Efficient generative llm inference using phase splitting

Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, and Ricardo Bianchini. Splitwise: Efficient generative llm inference using phase splitting. InISCA, 2024

work page 2024

[29] [29]

Patil, Huanzhi Mao, Charlie Cheng-Jie Ji, Fanjia Yan, Vishnu Suresh, Ion Stoica, and Joseph E

Shishir G. Patil, Huanzhi Mao, Charlie Cheng-Jie Ji, Fanjia Yan, Vishnu Suresh, Ion Stoica, and Joseph E. Gonzalez. The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models. InICML, 2025

work page 2025

[30] [30]

Mooncake: Trading more storage for less computation — a KVCache-centric architecture for serving LLM chatbot

Ruoyu Qin, Zheming Li, Weiran He, Jialei Cui, Feng Ren, Mingxing Zhang, Yongwei Wu, Weimin Zheng, and Xinran Xu. Mooncake: Trading more storage for less computation — a KVCache-centric architecture for serving LLM chatbot. InFAST, 2025

work page 2025

[31] [31]

Thinkv: Thought-adaptive kv cache compression for efficient reasoning models

Akshat Ramachandran, Marina Neseem, Charbel Sakr, Rangharajan Venkatesan, Brucek Khailany, and Tushar Krishna. Thinkv: Thought-adaptive kv cache compression for efficient reasoning models. InICLR, 2026

work page 2026

[32] [32]

Longcodebench: Evaluating coding LLMs at 1m context windows

Stefano Rando, Luca Romani, Alessio Sampieri, Luca Franco, John Yang, Yuta Kyuragi, Fabio Galasso, and Tatsunori Hashimoto. Longcodebench: Evaluating coding LLMs at 1m context windows. InCOLM, 2025. 11

work page 2025

[33] [33]

Flashattention-3: Fast and accurate attention with asynchrony and low-precision

Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, and Tri Dao. Flashattention-3: Fast and accurate attention with asynchrony and low-precision. InNeurIPS, 2024

work page 2024

[34] [34]

AsymKV: Enabling 1-bit quantization of KV cache with layer-wise asymmetric quantization configurations

Qian Tao, Wenyuan Yu, and Jingren Zhou. AsymKV: Enabling 1-bit quantization of KV cache with layer-wise asymmetric quantization configurations. InCOLING, 2025

work page 2025

[35] [35]

Triton: an intermediate language and compiler for tiled neural network computations

Philippe Tillet and David Cox. Triton: an intermediate language and compiler for tiled neural network computations. InProceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, 2019

work page 2019

[36] [36]

VL-cache: Sparsity and modality- aware KV cache compression for vision-language model inference acceleration

Dezhan Tu, Danylo Vashchilenko, Yuzhe Lu, and Panpan Xu. VL-cache: Sparsity and modality- aware KV cache compression for vision-language model inference acceleration. InICLR, 2025

work page 2025

[37] [37]

LOOK-M: Look-once optimization in KV cache for efficient multimodal long-context inference

Zhongwei Wan, Ziang Wu, Che Liu, Jinfa Huang, Zhihong Zhu, Peng Jin, Longyue Wang, and Li Yuan. LOOK-M: Look-once optimization in KV cache for efficient multimodal long-context inference. InFindings of EMNLP, 2024

work page 2024

[38] [38]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[39] [39]

Haoran Wu, Can Xiao, Jiayi Nie, Xuan Guo, Binglei Lou, Jeffrey T. H. Wong, Zhiwen Mo, Cheng Zhang, Przemyslaw Forys, Chengyang Ai, Timi Adeniran, Wayne Luk, Hongxiang Fan, Jianyi Cheng, Timothy M. Jones, Rika Antonova, Robert Mullins, and Aaron Zhao. Combating the memory walls: Optimization pathways for long-context agentic llm inference.arXiv preprint ar...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[40] [40]

Efficient streaming language models with attention sinks

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. InICLR, 2024

work page 2024

[41] [41]

Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments

Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, and Tao Yu. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments. InNeurIPS, 2024

work page 2024

[42] [42]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[43] [43]

Flashinfer: Efficient and customizable attention engine for llm inference serving.MLSys, 2025

Zihao Ye, Lequn Chen, Ruihang Lai, Wuwei Lin, Yineng Zhang, Stephanie Wang, Tianqi Chen, Baris Kasikci, Vinod Grover, Arvind Krishnamurthy, et al. Flashinfer: Efficient and customizable attention engine for llm inference serving.MLSys, 2025

work page 2025

[44] [44]

Orca: A distributed serving system for transformer-based generative models

Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun. Orca: A distributed serving system for transformer-based generative models. InOSDI, 2022

work page 2022

[45] [45]

Flashattention-4: Algorithm and kernel pipelining co-design for asymmetric hardware scaling

Ted Zadouri, Markus Hoehnerbach, Jay Shah, Timmy Liu, Vijay Thakkar, and Tri Dao. Flashattention-4: Algorithm and kernel pipelining co-design for asymmetric hardware scaling. InMLSys, 2026

work page 2026

[46] [46]

MR-GSM8K: A meta-reasoning benchmark for large language model evaluation

Zhongshen Zeng, Pengguang Chen, Shu Liu, Haiyun Jiang, and Jiaya Jia. MR-GSM8K: A meta-reasoning benchmark for large language model evaluation. InICLR, 2025

work page 2025

[47] [47]

H2o: Heavy-hitter oracle for efficient generative inference of large language models

Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, et al. H2o: Heavy-hitter oracle for efficient generative inference of large language models. InNeurIPS, 2023

work page 2023

[48] [48]

Smallkv: Small model assisted compensation of kv cache compression for efficient llm inference

Yi Zhao, Yajuan Peng, Nguyen Cam-Tu, Zuchao Li, Wang Xiaoliang, Xiaoming Fu, et al. Smallkv: Small model assisted compensation of kv cache compression for efficient llm inference. InNeurIPS, 2025. 12

work page 2025

[49] [49]

Sglang: Efficient execution of structured language model programs

Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Livia Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E Gonzalez, et al. Sglang: Efficient execution of structured language model programs. InNeurIPS, 2024

work page 2024

[50] [50]

Distserve: Disaggregating prefill and decoding for goodput-optimized large language model serving

Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, and Hao Zhang. Distserve: Disaggregating prefill and decoding for goodput-optimized large language model serving. InOSDI, 2024. 13 A Greedy Bitwidth Allocation Algorithm 1Semantic-aware bit allocation Require:Tag setS; counts{N s}; distortions{D s(2), Ds(4)}; budgetB. Ensure...

work page 2024