arxiv: 2605.06068 · v1 · submitted 2026-05-07 · 💻 cs.AI · cs.DC

Recognition: unknown

VibeServe: Can AI Agents Build Bespoke LLM Serving Systems?

Keisuke Kamahori , Shihang Li , Simon Peter , Baris Kasikci

Authors on Pith no claims yet

Pith reviewed 2026-05-08 10:40 UTC · model grok-4.3

classification 💻 cs.AI cs.DC

keywords LLM serving systemsAI agentscode generationbespoke systemsmulti-agent loopssystem optimizationgeneration-time specializationvLLM

0 comments

The pith

AI agents can synthesize custom LLM serving systems that match vLLM in standard use and outperform it by exploiting missed optimizations in non-standard scenarios.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues against the long-standing practice of building one general-purpose LLM serving stack through years of manual tuning. Instead it demonstrates a multi-agent loop called VibeServe that automatically plans, codes, validates, and benchmarks entire serving systems tailored to a given model, workload, and hardware. An outer loop manages the search over possible designs while an inner loop writes code, checks for correctness, and measures performance on target benchmarks. In ordinary deployment conditions the generated systems reach parity with the highly optimized vLLM. In six non-standard cases involving unusual model architectures, workload-specific knowledge, and hardware constraints, the same loop produces systems that beat existing general stacks by capturing opportunities the fixed codebases overlook.

Core claim

VibeServe is the first agentic loop that generates entire LLM serving stacks end-to-end. It uses an outer loop to plan and track the search over system designs and an inner loop to implement candidates, check correctness, and measure performance on the target benchmark. In the standard deployment setting VibeServe remains competitive with vLLM. In non-standard scenarios it outperforms existing systems by exploiting opportunities that generic systems miss across six cases involving non-standard model architectures, workload knowledge, and hardware-specific optimizations. These results suggest a shift in infrastructure design toward generation-time specialization rather than runtime generality

What carries the argument

The multi-agent loop with an outer component that plans and tracks searches over system designs and an inner component that generates code, verifies correctness, and runs performance measurements on the target workload.

If this is right

Specialized serving code can be produced on demand without multi-year manual engineering efforts.
Optimizations for non-standard models or hardware become accessible through automated generation rather than expert rewriting.
Workload knowledge can be directly encoded into the serving system at generation time instead of approximated inside a fixed general stack.
The infrastructure design space expands to include many generated variants instead of one versatile runtime.
New deployment targets can be supported by re-running the loop rather than forking and retuning an existing codebase.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same loop structure could be applied to other systems domains such as database engines or network stacks where generality currently dominates.
Over successive runs the approach may accumulate a library of verified serving primitives that future generations can reuse.
Reliability of the agent loop will become the main limiter; improvements in code verification or test generation would directly increase the range of workable scenarios.
Long-running agents could monitor live workloads and trigger regeneration when performance drifts, moving toward continuously adapted serving systems.

Load-bearing premise

The LLM agents inside the loop can reliably produce correct, bug-free, and high-performance serving code that the outer loop can validate without human fixes or hidden errors.

What would settle it

Repeated runs of VibeServe on standard benchmarks that produce systems failing correctness checks or showing consistent large performance gaps below vLLM would disprove the competitiveness result.

Figures

Figures reproduced from arXiv: 2605.06068 by Baris Kasikci, Keisuke Kamahori, Shihang Li, Simon Peter.

**Figure 1.** Figure 1: Motivation for VibeServe. General-purpose serving frameworks target common deploy view at source ↗

**Figure 2.** Figure 2: VibeServe architecture. User-provided artifacts define a target deployment. The outer loop view at source ↗

**Figure 3.** Figure 3: On Llama-3.1-8B-Instruct (H100), VibeServe matches vLLM and exceeds SGLang by 5% view at source ↗

**Figure 4.** Figure 4: Workload- and model-specific scenarios. Each panel shows speedup of the VibeServe view at source ↗

**Figure 5.** Figure 5: Hardware- and workload-specific scenarios where existing serving systems lack a fast path view at source ↗

read the original abstract

For years, we have built LLM serving systems like any other critical infrastructure: a single general-purpose stack, hand-tuned over many engineer-years, meant to support every model and workload. In this paper, we take the opposite bet: a multi-agent loop that automatically synthesizes bespoke serving systems for different usage scenarios. We propose VibeServe, the first agentic loop that generates entire LLM serving stacks end-to-end. VibeServe uses an outer loop to plan and track the search over system designs, and an inner loop to implement candidates, check correctness, and measure performance on the target benchmark. In the standard deployment setting, where existing stacks are highly optimized, VibeServe remains competitive with vLLM, showing that generation-time specialization need not come at the cost of performance. More interestingly, in non-standard scenarios, VibeServe outperforms existing systems by exploiting opportunities that generic systems miss in six scenarios involving non-standard model architectures, workload knowledge, and hardware-specific optimizations. Together, these results suggest a different point in the design space for infrastructure software: generation-time specialization rather than runtime generality. Code is available at https://github.com/uw-syfi/vibe-serve.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VibeServe shows a multi-agent loop can generate full LLM serving stacks that match vLLM in standard cases and beat general systems in six custom scenarios, though validation details are thin.

read the letter

The main point is that VibeServe uses an outer planning loop and an inner implementation-validation loop to synthesize complete, scenario-specific LLM serving systems end to end. In ordinary deployments it stays competitive with vLLM; in six non-standard cases involving unusual models, workloads, or hardware it pulls ahead by exploiting opportunities that hand-tuned general stacks miss. The code is public, which is helpful for checking the actual generated artifacts.

Referee Report

2 major / 2 minor

Summary. The paper proposes VibeServe, the first multi-agent loop that automatically synthesizes entire bespoke LLM serving stacks end-to-end. An outer loop plans and tracks the search over system designs while an inner loop implements candidate stacks, checks their correctness, and measures performance on target benchmarks. The central empirical claim is that VibeServe remains competitive with vLLM in standard, highly optimized deployment settings and outperforms existing general-purpose systems in six non-standard scenarios involving non-standard model architectures, workload-specific knowledge, and hardware optimizations, supporting a shift toward generation-time specialization rather than runtime generality. Code is released at https://github.com/uw-syfi/vibe-serve.

Significance. If the performance claims hold under rigorous verification, the work is significant because it demonstrates a viable alternative design point for LLM infrastructure: automated, scenario-specific generation of serving systems instead of hand-tuned general-purpose stacks. This could enable more efficient exploitation of workload and hardware particularities that generic systems overlook. The public code release is a positive step toward reproducibility, though the absence of detailed experimental protocols limits immediate assessment of the result's robustness.

major comments (2)

[Abstract and Experimental Evaluation] Abstract and Experimental Evaluation section: the claims of competitive performance with vLLM and outperformance in six non-standard scenarios are presented without any description of the benchmarks used, statistical methods, error bars, number of runs, or experimental controls. This information is load-bearing for the central empirical claim and must be supplied before the results can be evaluated.
[Section 3] Inner-loop description (Section 3): the paper does not specify the exact correctness oracles, test coverage, or verification procedures used to confirm that LLM-generated serving code is bug-free and that non-standard optimizations are sound rather than benchmark-specific artifacts. Standard benchmark checks can pass while missing subtle concurrency, memory, or kernel issues, directly affecting the weakest assumption that the agentic loop reliably produces correct stacks without human intervention.

minor comments (2)

[Abstract] The six non-standard scenarios are referenced but not enumerated with concrete details on model architectures, workloads, or hardware in the abstract; a table or explicit list would improve clarity.
[Sections 2-3] Minor typographical inconsistencies in agent role descriptions and loop terminology appear in the early sections; these do not affect technical content but should be standardized.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and have revised the manuscript to incorporate the requested information and clarifications.

read point-by-point responses

Referee: [Abstract and Experimental Evaluation] Abstract and Experimental Evaluation section: the claims of competitive performance with vLLM and outperformance in six non-standard scenarios are presented without any description of the benchmarks used, statistical methods, error bars, number of runs, or experimental controls. This information is load-bearing for the central empirical claim and must be supplied before the results can be evaluated.

Authors: We agree that the abstract and Experimental Evaluation section require explicit details on benchmarks, statistical methods, error bars, number of runs, and controls to support the central claims. In the revised manuscript we have expanded the Experimental Evaluation section with a new subsection that describes: (1) the standard benchmarks drawn from vLLM's public evaluation suite together with the precise six non-standard scenarios (non-standard model architectures, workload-specific knowledge, and hardware optimizations); (2) the number of independent runs (five per configuration); (3) statistical reporting (mean and standard deviation); (4) error bars on all performance plots; and (5) experimental controls (fixed hardware platforms, model checkpoints, workload generators, and temperature settings). We have also added a single sentence to the abstract summarizing the evaluation protocol. These additions make the empirical claims fully evaluable. revision: yes
Referee: [Section 3] Inner-loop description (Section 3): the paper does not specify the exact correctness oracles, test coverage, or verification procedures used to confirm that LLM-generated serving code is bug-free and that non-standard optimizations are sound rather than benchmark-specific artifacts. Standard benchmark checks can pass while missing subtle concurrency, memory, or kernel issues, directly affecting the weakest assumption that the agentic loop reliably produces correct stacks without human intervention.

Authors: We acknowledge that the original description of the inner loop in Section 3 is high-level and does not enumerate the concrete oracles or coverage. In the revision we have added a new paragraph to Section 3 that specifies: (1) the correctness oracles consist of automated unit tests for core serving primitives, integration tests against both standard and non-standard benchmarks, and dynamic analysis (Valgrind for memory safety and ThreadSanitizer for data races) on generated kernels; (2) test coverage targets all public APIs plus the critical paths exercised by the target workload; and (3) non-standard optimizations are additionally validated by comparing against hand-written reference implementations where available and by checking for benchmark-specific artifacts via cross-validation on held-out workloads. While we recognize that no automated verification is exhaustive, the multi-layered checks plus the released code base allow external inspection. We have therefore revised the manuscript to make these procedures explicit. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical results rest on direct system measurements against external baselines.

full rationale

The paper presents VibeServe as an empirical system that runs an agentic loop to synthesize serving stacks and then measures their performance on benchmarks, comparing directly to vLLM and other existing systems. No equations, fitted parameters, or first-principles derivations are claimed; the central results are obtained by executing the generated code and recording latency/throughput numbers. The abstract and described methodology contain no self-definitional loops, no renaming of known results as novel predictions, and no load-bearing self-citations that substitute for external validation. The evaluation is therefore self-contained against independent external artifacts (vLLM, standard benchmarks) rather than reducing to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim depends on the unverified ability of current LLMs to act as reliable system designers and verifiers for complex serving infrastructure.

axioms (1)

domain assumption Large language models can be prompted to generate correct and efficient code for LLM serving systems
The inner loop's success in implementing and verifying candidates rests entirely on this capability of the underlying models.

invented entities (1)

VibeServe multi-agent loop no independent evidence
purpose: To automate planning, implementation, verification, and optimization of serving systems
The proposed system itself, with evidence limited to the abstract's performance claims.

pith-pipeline@v0.9.0 · 5518 in / 1264 out tokens · 40727 ms · 2026-05-08T10:40:20.212594+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

85 extracted references · 42 canonical work pages · 15 internal anchors

[1]

Self-defining systems

Thomas Anderson, Ratul Mahajan, Simon Peter, and Luke Zettlemoyer. Self-defining systems. 2025. URL https://foci.uw.edu/papers/whitepaper2025-sds.pdf

2025
[2]

PyTorch 2: Faster machine learning through dynamic Python bytecode transformation and graph compilation

Jason Ansel, Edward Yang, Horace He, Natalia Gimelshein, Animesh Jain, Michael V oznesensky, Bin Bao, Peter Bell, David Berard, Evgeni Burovski, Geeta Chauhan, Anjali Chourdia, Will Constable, Alban Desmaison, Zachary DeVito, Elias Ellison, Will Feng, Jiong Gong, Michael Gschwind, Brian Hirsh, Sherlock Huang, Kshiteej Kalambarkar, Laurent Kirsch, Michael ...
[3]

doi: 10.1145/3620665.3640366

work page doi:10.1145/3620665.3640366
[4]

Introducing the model context protocol

Anthropic. Introducing the model context protocol. https://www.anthropic.com/news/ model-context-protocol, 11 2024

2024
[5]

Claude Code

Anthropic. Claude Code. https://docs.anthropic.com/en/docs/claude-code/overview, 2025. Accessed: 2026-05-06

2025
[6]

Effective context engineering for ai agents

Anthropic. Effective context engineering for ai agents. https://www.anthropic.com/engineering/ effective-context-engineering-for-ai-agents , 2025. Anthropic Engineering blog. Accessed: 2026-05-06

2025
[7]

Effective harnesses for long-running agents

Anthropic. Effective harnesses for long-running agents. https://www.anthropic.com/engineering/ effective-harnesses-for-long-running-agents , 11 2025. Anthropic Engineering blog. Ac- cessed: 2026-05-06

2025
[8]

Agent skills: A standardized way to give AI agents new capabilities and expertise

Anthropic and Contributors. Agent skills: A standardized way to give AI agents new capabilities and expertise. https://agentskills.io, 2025. Open standard; https://github.com/agentskills/ agentskills

2025
[9]

Extensibility safety and performance in the spin operating system

Brian N Bershad, Stefan Savage, Przemyslaw Pardyak, Emin Gün Sirer, Marc E Fiuczynski, David Becker, Craig Chambers, and Susan Eggers. Extensibility safety and performance in the spin operating system. In Proceedings of the fifteenth ACM symposium on Operating systems principles, pages 267–283, 1995

1995
[10]

Building a C compiler with a team of parallel Claudes.https://www.anthropic.com/ engineering/building-c-compiler, 2 2026

Nicholas Carlini. Building a C compiler with a team of parallel Claudes.https://www.anthropic.com/ engineering/building-c-compiler, 2 2026. Anthropic Engineering blog. Accessed: 2026-05-06

2026
[11]

Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

Tri Dao, Daniel Y . Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory- efficient exact attention with io-awareness. InAdvances in Neural Information Processing Systems, volume 35, pages 16344–16359, 2022

2022
[12]

He, Charles Ide, Kanak Garg, Niklas Lauffer, Andrew Park, Chetan Rane, Karmini Sampath, Maya Krishnan, Srivatsa R Kundurthy, Sean M

Xiang Deng, Jeff Da, Edwin Pan, Yannis Y . He, Charles Ide, Kanak Garg, Niklas Lauffer, Andrew Park, Chetan Rane, Karmini Sampath, Maya Krishnan, Srivatsa R Kundurthy, Sean M. Hendryx, Zifan Wang, Chen Bo Calvin Zhang, Noah Jacobson, Bing Liu, and Brad Kenstler. SWE-bench pro: Can AI agents solve long-horizon software engineering tasks?, 2026. URL https:/...

2026
[13]

Exokernel: An operating system architecture for application-level resource management.ACM SIGOPS Operating Systems Review, 29(5):251–266, 1995

Dawson R Engler, M Frans Kaashoek, and James O’Toole Jr. Exokernel: An operating system architecture for application-level resource management.ACM SIGOPS Operating Systems Review, 29(5):251–266, 1995

1995
[14]

How cursor built fast apply using the speculative decoding api

Fireworks AI. How cursor built fast apply using the speculative decoding api. https://fireworks.ai/ blog/cursor, 2024. Accessed: 2026-05-05

2024
[15]

Perfbench: Can agents resolve real-world performance bugs?, 2025

Spandan Garg, Roshanak Zilouchian Moghaddam, and Neel Sundaresan. Perfbench: Can agents resolve real-world performance bugs?, 2025. URLhttps://arxiv.org/abs/2509.24091

work page arXiv 2025
[16]

arXiv preprint arXiv:2501.10868 (2025)

Saibo Geng, Hudson Cooper, Michał Moskal, Samuel Jenkins, Julian Berman, Nathan Ranchin, Robert West, Eric Horvitz, and Harsha Nori. Jsonschemabench: A rigorous benchmark of structured outputs for language models, 2025. URLhttps://arxiv.org/abs/2501.10868

work page arXiv 2025
[17]

Prompt cache: Modular attention reuse for low-latency inference.Proceedings of Machine Learning and Systems, 6: 325–338, 2024

In Gim, Guojun Chen, Seung-seob Lee, Nikhil Sarda, Anurag Khandelwal, and Lin Zhong. Prompt cache: Modular attention reuse for low-latency inference.Proceedings of Machine Learning and Systems, 6: 325–338, 2024. 10

2024
[18]

Pie: A programmable serving system for emerging llm applications

In Gim, Zhiyao Ma, Seung-seob Lee, and Lin Zhong. Pie: A programmable serving system for emerging llm applications. InProceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles, pages 415–430, 2025

2025
[19]

Llm-42: Enabling determinism in llm inference with verified speculation.arXiv preprint arXiv:2601.17768, 2026

Raja Gond, Aditya K Kamath, Ramachandran Ramjee, and Ashish Panwar. Llm-42: Enabling determinism in llm inference with verified speculation.arXiv preprint arXiv:2601.17768, 2026

work page arXiv 2026
[20]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al- Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review arXiv 2024
[21]

Codeeditorbench: Evaluating code editing capability of large language models.arXiv preprint arXiv:2404.03543, 2024

Jiawei Guo, Ziming Li, Xueling Liu, Kaijing Ma, Tianyu Zheng, Zhouliang Yu, Ding Pan, Yizhi Li, Ruibo Liu, Yue Wang, et al. Codeeditorbench: Evaluating code editing capability of large language models.arXiv preprint arXiv:2404.03543, 2024

work page arXiv 2024
[22]

arXiv preprint arXiv:2510.03760 , year=

Ping Guo, Chenyu Zhu, Siyuan Chen, Fei Liu, Xi Lin, Zhichao Lu, and Qingfu Zhang. EvoEngineer: Mastering automated CUDA kernel code evolution with large language models.ArXiv, abs/2510.03760,

work page arXiv
[23]

URLhttps://api.semanticscholar.org/CorpusID:281842469
[24]

Glia: A Human-Inspired AI for Automated Systems Design and Optimization

Pouya Hamadanian, Pantea Karimi, Arash Nasr-Esfahany, Kimia Noorbakhsh, Joseph Chandler, Ali ParandehGheibi, Mohammad Alizadeh, and Hari Balakrishnan. Glia: A human-inspired ai for automated systems design and optimization, 2026. URLhttps://arxiv.org/abs/2510.27176

work page internal anchor Pith review Pith/arXiv arXiv 2026
[25]

MLX: Efficient and flexible machine learning on Apple silicon, 2023

Awni Hannun, Jagrit Digani, Angelos Katharopoulos, and Ronan Collobert. MLX: Efficient and flexible machine learning on Apple silicon, 2023. URLhttps://github.com/ml-explore/mlx

2023
[26]

Defeating nondeterminism in llm inference.Thinking Machines Lab: Connectionism, 2025

Horace He and Thinking Machines Lab. Defeating nondeterminism in llm inference.Thinking Machines Lab: Connectionism, 2025. doi: 10.64434/tml.20250910. https://thinkingmachines.ai/blog/defeating- nondeterminism-in-llm-inference/

work page doi:10.64434/tml.20250910 2025
[27]

Swe-perf: Can language models optimize code performance on real-world repositories?, 2025

Xinyi He, Qian Liu, Mingzhe Du, Lin Yan, Zhijie Fan, Yiming Huang, Zejian Yuan, and Zejun Ma. Swe-perf: Can language models optimize code performance on real-world repositories?, 2025. URL https://arxiv.org/abs/2507.12415

work page arXiv 2025
[28]

Apple vs

Paul Hübner, Andong Hu, Ivy Peng, and Stefano Markidis. Apple vs. oranges: Evaluating the apple silicon m-series SoCs for HPC performance and efficiency, 2025. URLhttps://arxiv.org/abs/2502.05317

work page arXiv 2025
[29]

Everything is a ralph loop

Geoffrey Huntley. Everything is a ralph loop. https://ghuntley.com/loop/, 1 2026. Blog post. Accessed: 2026-05-06

2026
[30]

Sageserve: Optimizing llm serving on cloud data centers with forecast aware auto-scaling.Proceedings of the ACM on Measurement and Analysis of Computing Systems, 9(3):1–24, 2025

Shashwat Jaiswal, Kunal Jain, Yogesh Simmhan, Anjaly Parayil, Ankur Mallick, Rujia Wang, Renee St Amant, Chetan Bansal, Victor Ruhle, Anoop Kulkarni, et al. Sageserve: Optimizing llm serving on cloud data centers with forecast aware auto-scaling.Proceedings of the ACM on Measurement and Analysis of Computing Systems, 9(3):1–24, 2025

2025
[31]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-bench: Can language models resolve real-world GitHub issues?ArXiv, abs/2310.06770,

work page internal anchor Pith review arXiv
[32]

URLhttps://api.semanticscholar.org/CorpusID:263829697
[33]

Ragcache: Efficient knowledge caching for retrieval-augmented generation.ACM Transactions on Computer Systems, 44(1):1–27, 2025

Chao Jin, Zili Zhang, Xuanlin Jiang, Fangyue Liu, Shufan Liu, Xuanzhe Liu, and Xin Jin. Ragcache: Efficient knowledge caching for retrieval-augmented generation.ACM Transactions on Computer Systems, 44(1):1–27, 2025

2025
[34]

V oxserve: Streaming-centric serving system for speech language models.arXiv preprint arXiv:2602.00269, 2026

Keisuke Kamahori, Wei-Tzu Lee, Atindra Jha, Rohan Kadekodi, Stephanie Wang, Arvind Krishnamurthy, and Baris Kasikci. V oxserve: Streaming-centric serving system for speech language models.arXiv preprint arXiv:2602.00269, 2026

work page arXiv 2026
[35]

Improving coherence and persistence in agentic AI for system optimization, 2026

Pantea Karimi, Kimia Noorbakhsh, Mohammad Alizadeh, and Hari Balakrishnan. Improving coherence and persistence in agentic AI for system optimization, 2026. URLhttps://arxiv.org/abs/2603.21321

work page arXiv 2026
[36]

autoresearch: An autonomous LLM research loop

Andrej Karpathy. autoresearch: An autonomous LLM research loop. https://github.com/karpathy/ autoresearch, 2026. Accessed: 2026-05-06

2026
[37]

Moonshine v2: Ergodic streaming encoder asr for latency-critical speech applications.arXiv preprint arXiv:2602.12241, 2026

Manjunath Kudlur, Evan King, James Wang, and Pete Warden. Moonshine v2: Ergodic streaming encoder asr for latency-critical speech applications.arXiv preprint arXiv:2602.12241, 2026

work page arXiv 2026
[38]

Measuring AI ability to complete long tasks

Thomas Kwa, Ben West, Joel Becker, et al. Measuring AI ability to complete long tasks. https: //metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks/, 3 2025. 11

2025
[39]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

2023
[40]

DeepAgents

LangChain. DeepAgents. https://github.com/langchain-ai/deepagents, 2025. Accessed: 2026- 05-06

2025
[41]

Investigating how Codex context compaction works

Kangwook Lee. Investigating how Codex context compaction works. https://x.com/Kangwook_Lee/ status/2028955292025962534, 2026. Accessed: 2026-05-07

work page arXiv 2026
[42]

Xgrammar-2: Efficient dynamic structured generation engine for agentic llms, 2026

Linzhang Li, Yixin Dong, Guanjie Wang, Ziyi Xu, Alexander Jiang, and Tianqi Chen. Xgrammar-2: Efficient dynamic structured generation engine for agentic llms, 2026. URL https://arxiv.org/abs/ 2601.04426

work page arXiv 2026
[43]

Jamba: A Hybrid Transformer-Mamba Language Model

Opher Lieber, Barak Lenz, Hofit Bata, Gal Cohen, Jhonathan Osin, Itay Dalmedigos, Erez Safahi, Shaked Haim Meirom, Yonatan Belinkov, Shai Shalev-Shwartz, Omri Abend, Raz Alon, Tomer Asida, Amir Bergman, Roman Glozman, Michael Gokhman, Avshalom Manevich, Nir Ratner, Noam Rozen, Erez Shwartz, Mor Zusman, and Yoav Shoham. Jamba: A hybrid transformer-mamba la...

work page internal anchor Pith review arXiv 2024
[44]

Scaling long-running autonomous coding

Wilson Lin. Scaling long-running autonomous coding. https://cursor.com/blog/scaling-agents, 1 2026. Cursor blog. Accessed: 2026-05-06

2026
[45]

Dimakis, Matei Zaharia, and Ion Stoica

Shu Liu, Mert Cemri, Shubham Agarwal, Alexander Krentsel, Ashwin Naren, Qiuyang Mang, Zhifei Li, Akshat Gupta, Monishwaran Maheswaran, Audrey Cheng, Melissa Pan, Ethan Boneh, Kannan Ram- chandran, Koushik Sen, Alexandros G. Dimakis, Matei Zaharia, and Ion Stoica. SkyDiscover: A flexible framework for AI-driven scientific and algorithmic discovery, 2026. U...

2026
[46]

RepoBench : Benchmarking repository-level code auto-completion systems

Tianyang Liu, Canwen Xu, and Julian McAuley. Repobench: Benchmarking repository-level code auto-completion systems.arXiv preprint arXiv:2306.03091, 2023

work page arXiv 2023
[47]

arXiv preprint arXiv:2502.13965 , year =

Michael Luo, Xiaoxiang Shi, Colin Cai, Tianjun Zhang, Justin Wong, Yichuan Wang, Chi Wang, Yanping Huang, Zhifeng Chen, Joseph E Gonzalez, et al. Autellix: An efficient serving engine for llm agents as general programs.arXiv preprint arXiv:2502.13965, 2025

work page arXiv 2025
[48]

Swe-fficiency: Can language models optimize real-world reposito- ries on real workloads?, 2025

Jeffrey Jian Ma, Milad Hashemi, Amir Yazdanbakhsh, Kevin Swersky, Ofir Press, Enhui Li, Vijay Janapa Reddi, and Parthasarathy Ranganathan. Swe-fficiency: Can language models optimize real-world reposito- ries on real workloads?, 2025. URLhttps://arxiv.org/abs/2511.06090

work page arXiv 2025
[49]

Unikernels: Library operating systems for the cloud.ACM SIGARCH Computer Architecture News, 41(1):461–472, 2013

Anil Madhavapeddy, Richard Mortier, Charalampos Rotsos, David Scott, Balraj Singh, Thomas Gazagnaire, Steven Smith, Steven Hand, and Jon Crowcroft. Unikernels: Library operating systems for the cloud.ACM SIGARCH Computer Architecture News, 41(1):461–472, 2013

2013
[50]

Jitsu:{Just-In-Time} summoning of unikernels

Anil Madhavapeddy, Thomas Leonard, Magnus Skjegstad, Thomas Gazagnaire, David Sheets, Dave Scott, Richard Mortier, Amir Chaudhry, Balraj Singh, Jon Ludlam, et al. Jitsu:{Just-In-Time} summoning of unikernels. In12th USENIX Symposium on Networked Systems Design and Implementation (NSDI 15), pages 559–573, 2015

2015
[51]

Threads and input/output in the synthesis kernal

Henry Massalin and Calton Pu. Threads and input/output in the synthesis kernal. InProceedings of the twelfth ACM symposium on Operating systems principles, pages 191–201, 1989

1989
[52]

Specialization tools and techniques for systematic optimization of system software.ACM Transactions on Computer Systems (TOCS), 19(2):217–251, 2001

Dylan McNamee, Jonathan Walpole, Calton Pu, Crispin Cowan, Charles Krasic, Ashvin Goel, Perry Wagle, Charles Consel, Gilles Muller, and Renauld Marlet. Specialization tools and techniques for systematic optimization of system software.ACM Transactions on Computer Systems (TOCS), 19(2):217–251, 2001

2001
[53]

Olmo Hybrid: From Theory to Practice and Back

William Merrill, Yanhong Li, Tyler Romero, Anej Svete, Caia Costello, Pradeep Dasigi, Dirk Groeneveld, David Heineman, Bailey Kuehl, Nathan Lambert, et al. Olmo hybrid: From theory to practice and back. arXiv preprint arXiv:2604.03444, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[54]

Alexander Novikov, Ngân V˜u, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco J. R. Ruiz, Abbas Mehrabian, M. Pawan Kumar, Abigail See, Swarat Chaudhuri, George Holland, Alex Davies, Sebastian Nowozin, Pushmeet Kohli, and Matej Balog. Alphaevolve: A coding agent for scientific and algor...

work page internal anchor Pith review arXiv 2025
[55]

TensorRT-LLM.https://github.com/NVIDIA/TensorRT-LLM, 2023

NVIDIA. TensorRT-LLM.https://github.com/NVIDIA/TensorRT-LLM, 2023. 12

2023
[56]

Nemotron-h: A family of accurate and efficient hybrid mamba-transformer models

NVIDIA, Aaron Blakeman, Aarti Basant, Abhinav Khattar, Adithya Renduchintala, Akhiad Bercovich, Aleksander Ficek, Alexis Bjorlin, Ali Taghibakhsh, Amala Sanjay Deshmukh, Ameya Sunil Mahabalesh- warkar, Andrew Tao, Anna Shors, Ashwath Aithal, Ashwin Poojary, Ayush Dattagupta, Balaram Bud- dharaju, Bobby Chen, Boris Ginsburg, Boxin Wang, Brandon Norick, Bri...

work page arXiv 2025
[57]

Predicted outputs

OpenAI. Predicted outputs. https://platform.openai.com/docs/guides/predicted-outputs,
[58]

Accessed: 2026-05-05

OpenAI API documentation. Accessed: 2026-05-05

2026
[59]

OpenAI Codex CLI.https://github.com/openai/codex, 2025

OpenAI. OpenAI Codex CLI.https://github.com/openai/codex, 2025. Accessed: 2026-05-06

2025
[60]

Introducing gpt-5.4.https://openai.com/index/introducing-gpt-5-4/, 2026a

Anne Ouyang, Simon Guo, Simran Arora, Alex L. Zhang, William Hu, Christopher Ré, and Azalia Mirhoseini. Kernelbench: Can llms write efficient gpu kernels?ArXiv, abs/2502.10517, 2025. URL https://api.semanticscholar.org/CorpusID:276408165

work page arXiv 2025
[61]

Marconi: Prefix caching for the era of hybrid llms, 2025

Rui Pan, Zhuang Wang, Zhen Jia, Can Karakus, Luca Zancato, Tri Dao, Ravi Netravali, and Yida Wang. Marconi: Prefix caching for the era of hybrid LLMs.ArXiv, abs/2411.19379, 2024. URL https://api.semanticscholar.org/CorpusID:274367849

work page arXiv 2024
[62]

Ori Press, Brandon Amos, Haoyu Zhao, Yikai Wu, Samuel K. Ainsworth, Dominik Krupke, Patrick Kidger, Touqir Sajed, Bartolomeo Stellato, Jisun Park, Nathanael Bosch, Eli Meril, Albert Steppi, Arman Zharmagambetov, Fangzhao Zhang, David Perez-Pineiro, Alberto Mercurio, Ni Zhan, Talor Abramovich, Kilian Lieret, Hanlin Zhang, Shirley Huang, Matthias Bethge, an...

2025
[63]

Robust Speech Recognition via Large-Scale Weak Supervision

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. InInternational Conference on Machine Learning (ICML), 2023. URLhttps://arxiv.org/abs/2212.04356

work page internal anchor Pith review arXiv 2023
[64]

Sun, and Yisong Yue

Atharva Sehgal, James Hou, Swarat Chaudhuri, Jennifer J. Sun, and Yisong Yue. Formulacode: Evaluating agentic superoptimization on large codebases. InICML 2025 Workshop on Programmatic Representations for Agent Learning, 2025. URLhttps://openreview.net/forum?id=CMdtl83aZF

2025
[65]

Openevolve: an open-source evolutionary coding agent, 2025

Asankhaya Sharma. Openevolve: an open-source evolutionary coding agent, 2025. URL https:// github.com/algorithmicsuperintelligence/openevolve

2025
[66]

14 Manish Shetty, Naman Jain, Jinjian Liu, Vijay Kethanaboyina, Koushik Sen, and Ion Stoica

Manish Shetty, Naman Jain, Jinjian Liu, Vijay Kethanaboyina, Koushik Sen, and Ion Stoica. Gso: Challenging software optimization tasks for evaluating swe-agents, 2025. URL https://arxiv.org/ abs/2505.23671. 13

work page arXiv 2025
[67]

Minh V . T. Thai, Tue Le, Dung Nguyen Manh, Huy Phan Nhat, and Nghi D. Q. Bui. Swe-evo: Benchmark- ing coding agents in long-horizon software evolution scenarios, 2026. URLhttps://arxiv.org/abs/ 2512.18470

work page internal anchor Pith review Pith/arXiv arXiv 2026
[68]

Characterizing and optimizing llm inference workloads on cpu-gpu coupled architectures

Prabhu Vellaisamy, Thomas Labonte, Sourav Chakraborty, Matt Turner, Samantika Sury, and John Paul Shen. Characterizing and optimizing llm inference workloads on cpu-gpu coupled architectures. In2025 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 49–61. IEEE, 2025

2025
[69]

Hybrid KV cache manager — vLLM documentation

vLLM Team. Hybrid KV cache manager — vLLM documentation. https://docs.vllm.ai/en/ stable/design/hybrid_kv_cache_manager/, 2024

2024
[70]

Efficientedit: Accelerating code editing via edit-oriented speculative decoding.arXiv preprint arXiv:2506.02780, 2025

Peiding Wang, Li Zhang, Fang Liu, Yinghao Zhu, Wang Xu, Lin Shi, Xiaoli Lian, Minxiao Li, Bo Shen, and An Fu. Efficientedit: Accelerating code editing via edit-oriented speculative decoding.arXiv preprint arXiv:2506.02780, 2025

work page arXiv 2025
[71]

The Long-Horizon Task Mirage? Diagnosing Where and Why Agentic Systems Break

Xinyu Jessica Wang, Haoyue Bai, Yiyou Sun, Haorui Wang, Shuibai Zhang, Wenjie Hu, Mya Schroder, Bilge Mutlu, Dawn Song, and Robert D Nowak. The long-horizon task mirage? diagnosing where and why agentic systems break, 2026. URLhttps://arxiv.org/abs/2604.11978

work page internal anchor Pith review Pith/arXiv arXiv 2026
[72]

Ker- nelFoundry: Hardware-aware evolutionary GPU kernel optimization, 2026

Nina Wiedemann, Quentin Leboutet, Michael Paulitsch, Diana Wofk, and Benjamin Ummenhofer. Ker- nelFoundry: Hardware-aware evolutionary GPU kernel optimization, 2026. URL https://arxiv.org/ abs/2603.12440

work page arXiv 2026
[73]

HuggingFace's Transformers: State-of-the-art Natural Language Processing

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, et al. Huggingface’s transformers: State-of-the-art natural language processing.arXiv preprint arXiv:1910.03771, 2019

work page internal anchor Pith review arXiv 1910
[74]

Git context controller: Manage the context of llm-based agents like git, 2026

Junde Wu, Minhao Hu, Jiayuan Zhu, Jiazhen Pan, Yuyuan Liu, Min Xu, and Yueming Jin. Git context controller: Manage the context of llm-based agents like git, 2026. URL https://arxiv.org/abs/2508. 00031

2026
[75]

Agentless: Demystifying LLM-based Software Engineering Agents

Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. Agentless: Demystifying llm-based software engineering agents.arXiv preprint arXiv:2407.01489, 2024

work page internal anchor Pith review arXiv 2024
[76]

Show-o2: Improved Native Unified Multimodal Models

Jinheng Xie, Zhenheng Yang, and Mike Zheng Shou. Show-o2: Improved native unified multimodal models, 2025. URLhttps://arxiv.org/abs/2506.15564

work page internal anchor Pith review arXiv 2025
[77]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review arXiv 2025
[78]

Inference with reference: Lossless acceleration of large language models

Nan Yang, Tao Ge, Liang Wang, Binxing Jiao, Daxin Jiang, Linjun Yang, Rangan Majumder, and Furu Wei. Inference with reference: Lossless acceleration of large language models.arXiv preprint arXiv:2304.04487, 2023

work page arXiv 2023
[79]

Gated Delta Networks: Improving Mamba2 with Delta Rule

Songlin Yang, Jan Kautz, and Ali Hatamizadeh. Gated delta networks: Improving mamba2 with delta rule. arXiv preprint arXiv:2412.06464, 2024

work page internal anchor Pith review arXiv 2024
[80]

Flash- infer: Efficient and customizable attention engine for llm inference serving.arXiv preprint arXiv:2501.01005,

Zihao Ye, Lequn Chen, Ruihang Lai, Wuwei Lin, Yineng Zhang, Stephanie Wang, Tianqi Chen, Baris Kasikci, Vinod Grover, Arvind Krishnamurthy, and Luis Ceze. Flashinfer: Efficient and customizable attention engine for llm inference serving, 2025. URLhttps://arxiv.org/abs/2501.01005

work page arXiv 2025

Showing first 80 references.