pith. machine review for the scientific record. sign in

arxiv: 2605.06068 · v1 · submitted 2026-05-07 · 💻 cs.AI · cs.DC

Recognition: unknown

VibeServe: Can AI Agents Build Bespoke LLM Serving Systems?

Authors on Pith no claims yet

Pith reviewed 2026-05-08 10:40 UTC · model grok-4.3

classification 💻 cs.AI cs.DC
keywords LLM serving systemsAI agentscode generationbespoke systemsmulti-agent loopssystem optimizationgeneration-time specializationvLLM
0
0 comments X

The pith

AI agents can synthesize custom LLM serving systems that match vLLM in standard use and outperform it by exploiting missed optimizations in non-standard scenarios.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues against the long-standing practice of building one general-purpose LLM serving stack through years of manual tuning. Instead it demonstrates a multi-agent loop called VibeServe that automatically plans, codes, validates, and benchmarks entire serving systems tailored to a given model, workload, and hardware. An outer loop manages the search over possible designs while an inner loop writes code, checks for correctness, and measures performance on target benchmarks. In ordinary deployment conditions the generated systems reach parity with the highly optimized vLLM. In six non-standard cases involving unusual model architectures, workload-specific knowledge, and hardware constraints, the same loop produces systems that beat existing general stacks by capturing opportunities the fixed codebases overlook.

Core claim

VibeServe is the first agentic loop that generates entire LLM serving stacks end-to-end. It uses an outer loop to plan and track the search over system designs and an inner loop to implement candidates, check correctness, and measure performance on the target benchmark. In the standard deployment setting VibeServe remains competitive with vLLM. In non-standard scenarios it outperforms existing systems by exploiting opportunities that generic systems miss across six cases involving non-standard model architectures, workload knowledge, and hardware-specific optimizations. These results suggest a shift in infrastructure design toward generation-time specialization rather than runtime generality

What carries the argument

The multi-agent loop with an outer component that plans and tracks searches over system designs and an inner component that generates code, verifies correctness, and runs performance measurements on the target workload.

If this is right

  • Specialized serving code can be produced on demand without multi-year manual engineering efforts.
  • Optimizations for non-standard models or hardware become accessible through automated generation rather than expert rewriting.
  • Workload knowledge can be directly encoded into the serving system at generation time instead of approximated inside a fixed general stack.
  • The infrastructure design space expands to include many generated variants instead of one versatile runtime.
  • New deployment targets can be supported by re-running the loop rather than forking and retuning an existing codebase.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same loop structure could be applied to other systems domains such as database engines or network stacks where generality currently dominates.
  • Over successive runs the approach may accumulate a library of verified serving primitives that future generations can reuse.
  • Reliability of the agent loop will become the main limiter; improvements in code verification or test generation would directly increase the range of workable scenarios.
  • Long-running agents could monitor live workloads and trigger regeneration when performance drifts, moving toward continuously adapted serving systems.

Load-bearing premise

The LLM agents inside the loop can reliably produce correct, bug-free, and high-performance serving code that the outer loop can validate without human fixes or hidden errors.

What would settle it

Repeated runs of VibeServe on standard benchmarks that produce systems failing correctness checks or showing consistent large performance gaps below vLLM would disprove the competitiveness result.

Figures

Figures reproduced from arXiv: 2605.06068 by Baris Kasikci, Keisuke Kamahori, Shihang Li, Simon Peter.

Figure 1
Figure 1. Figure 1: Motivation for VibeServe. General-purpose serving frameworks target common deploy view at source ↗
Figure 2
Figure 2. Figure 2: VibeServe architecture. User-provided artifacts define a target deployment. The outer loop view at source ↗
Figure 3
Figure 3. Figure 3: On Llama-3.1-8B-Instruct (H100), VibeServe matches vLLM and exceeds SGLang by 5% view at source ↗
Figure 4
Figure 4. Figure 4: Workload- and model-specific scenarios. Each panel shows speedup of the VibeServe view at source ↗
Figure 5
Figure 5. Figure 5: Hardware- and workload-specific scenarios where existing serving systems lack a fast path view at source ↗
read the original abstract

For years, we have built LLM serving systems like any other critical infrastructure: a single general-purpose stack, hand-tuned over many engineer-years, meant to support every model and workload. In this paper, we take the opposite bet: a multi-agent loop that automatically synthesizes bespoke serving systems for different usage scenarios. We propose VibeServe, the first agentic loop that generates entire LLM serving stacks end-to-end. VibeServe uses an outer loop to plan and track the search over system designs, and an inner loop to implement candidates, check correctness, and measure performance on the target benchmark. In the standard deployment setting, where existing stacks are highly optimized, VibeServe remains competitive with vLLM, showing that generation-time specialization need not come at the cost of performance. More interestingly, in non-standard scenarios, VibeServe outperforms existing systems by exploiting opportunities that generic systems miss in six scenarios involving non-standard model architectures, workload knowledge, and hardware-specific optimizations. Together, these results suggest a different point in the design space for infrastructure software: generation-time specialization rather than runtime generality. Code is available at https://github.com/uw-syfi/vibe-serve.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes VibeServe, the first multi-agent loop that automatically synthesizes entire bespoke LLM serving stacks end-to-end. An outer loop plans and tracks the search over system designs while an inner loop implements candidate stacks, checks their correctness, and measures performance on target benchmarks. The central empirical claim is that VibeServe remains competitive with vLLM in standard, highly optimized deployment settings and outperforms existing general-purpose systems in six non-standard scenarios involving non-standard model architectures, workload-specific knowledge, and hardware optimizations, supporting a shift toward generation-time specialization rather than runtime generality. Code is released at https://github.com/uw-syfi/vibe-serve.

Significance. If the performance claims hold under rigorous verification, the work is significant because it demonstrates a viable alternative design point for LLM infrastructure: automated, scenario-specific generation of serving systems instead of hand-tuned general-purpose stacks. This could enable more efficient exploitation of workload and hardware particularities that generic systems overlook. The public code release is a positive step toward reproducibility, though the absence of detailed experimental protocols limits immediate assessment of the result's robustness.

major comments (2)
  1. [Abstract and Experimental Evaluation] Abstract and Experimental Evaluation section: the claims of competitive performance with vLLM and outperformance in six non-standard scenarios are presented without any description of the benchmarks used, statistical methods, error bars, number of runs, or experimental controls. This information is load-bearing for the central empirical claim and must be supplied before the results can be evaluated.
  2. [Section 3] Inner-loop description (Section 3): the paper does not specify the exact correctness oracles, test coverage, or verification procedures used to confirm that LLM-generated serving code is bug-free and that non-standard optimizations are sound rather than benchmark-specific artifacts. Standard benchmark checks can pass while missing subtle concurrency, memory, or kernel issues, directly affecting the weakest assumption that the agentic loop reliably produces correct stacks without human intervention.
minor comments (2)
  1. [Abstract] The six non-standard scenarios are referenced but not enumerated with concrete details on model architectures, workloads, or hardware in the abstract; a table or explicit list would improve clarity.
  2. [Sections 2-3] Minor typographical inconsistencies in agent role descriptions and loop terminology appear in the early sections; these do not affect technical content but should be standardized.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and have revised the manuscript to incorporate the requested information and clarifications.

read point-by-point responses
  1. Referee: [Abstract and Experimental Evaluation] Abstract and Experimental Evaluation section: the claims of competitive performance with vLLM and outperformance in six non-standard scenarios are presented without any description of the benchmarks used, statistical methods, error bars, number of runs, or experimental controls. This information is load-bearing for the central empirical claim and must be supplied before the results can be evaluated.

    Authors: We agree that the abstract and Experimental Evaluation section require explicit details on benchmarks, statistical methods, error bars, number of runs, and controls to support the central claims. In the revised manuscript we have expanded the Experimental Evaluation section with a new subsection that describes: (1) the standard benchmarks drawn from vLLM's public evaluation suite together with the precise six non-standard scenarios (non-standard model architectures, workload-specific knowledge, and hardware optimizations); (2) the number of independent runs (five per configuration); (3) statistical reporting (mean and standard deviation); (4) error bars on all performance plots; and (5) experimental controls (fixed hardware platforms, model checkpoints, workload generators, and temperature settings). We have also added a single sentence to the abstract summarizing the evaluation protocol. These additions make the empirical claims fully evaluable. revision: yes

  2. Referee: [Section 3] Inner-loop description (Section 3): the paper does not specify the exact correctness oracles, test coverage, or verification procedures used to confirm that LLM-generated serving code is bug-free and that non-standard optimizations are sound rather than benchmark-specific artifacts. Standard benchmark checks can pass while missing subtle concurrency, memory, or kernel issues, directly affecting the weakest assumption that the agentic loop reliably produces correct stacks without human intervention.

    Authors: We acknowledge that the original description of the inner loop in Section 3 is high-level and does not enumerate the concrete oracles or coverage. In the revision we have added a new paragraph to Section 3 that specifies: (1) the correctness oracles consist of automated unit tests for core serving primitives, integration tests against both standard and non-standard benchmarks, and dynamic analysis (Valgrind for memory safety and ThreadSanitizer for data races) on generated kernels; (2) test coverage targets all public APIs plus the critical paths exercised by the target workload; and (3) non-standard optimizations are additionally validated by comparing against hand-written reference implementations where available and by checking for benchmark-specific artifacts via cross-validation on held-out workloads. While we recognize that no automated verification is exhaustive, the multi-layered checks plus the released code base allow external inspection. We have therefore revised the manuscript to make these procedures explicit. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical results rest on direct system measurements against external baselines.

full rationale

The paper presents VibeServe as an empirical system that runs an agentic loop to synthesize serving stacks and then measures their performance on benchmarks, comparing directly to vLLM and other existing systems. No equations, fitted parameters, or first-principles derivations are claimed; the central results are obtained by executing the generated code and recording latency/throughput numbers. The abstract and described methodology contain no self-definitional loops, no renaming of known results as novel predictions, and no load-bearing self-citations that substitute for external validation. The evaluation is therefore self-contained against independent external artifacts (vLLM, standard benchmarks) rather than reducing to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim depends on the unverified ability of current LLMs to act as reliable system designers and verifiers for complex serving infrastructure.

axioms (1)
  • domain assumption Large language models can be prompted to generate correct and efficient code for LLM serving systems
    The inner loop's success in implementing and verifying candidates rests entirely on this capability of the underlying models.
invented entities (1)
  • VibeServe multi-agent loop no independent evidence
    purpose: To automate planning, implementation, verification, and optimization of serving systems
    The proposed system itself, with evidence limited to the abstract's performance claims.

pith-pipeline@v0.9.0 · 5518 in / 1264 out tokens · 40727 ms · 2026-05-08T10:40:20.212594+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

85 extracted references · 42 canonical work pages · 15 internal anchors

  1. [1]

    Self-defining systems

    Thomas Anderson, Ratul Mahajan, Simon Peter, and Luke Zettlemoyer. Self-defining systems. 2025. URL https://foci.uw.edu/papers/whitepaper2025-sds.pdf

  2. [2]

    PyTorch 2: Faster machine learning through dynamic Python bytecode transformation and graph compilation

    Jason Ansel, Edward Yang, Horace He, Natalia Gimelshein, Animesh Jain, Michael V oznesensky, Bin Bao, Peter Bell, David Berard, Evgeni Burovski, Geeta Chauhan, Anjali Chourdia, Will Constable, Alban Desmaison, Zachary DeVito, Elias Ellison, Will Feng, Jiong Gong, Michael Gschwind, Brian Hirsh, Sherlock Huang, Kshiteej Kalambarkar, Laurent Kirsch, Michael ...

  3. [3]

    doi: 10.1145/3620665.3640366

  4. [4]

    Introducing the model context protocol

    Anthropic. Introducing the model context protocol. https://www.anthropic.com/news/ model-context-protocol, 11 2024

  5. [5]

    Claude Code

    Anthropic. Claude Code. https://docs.anthropic.com/en/docs/claude-code/overview, 2025. Accessed: 2026-05-06

  6. [6]

    Effective context engineering for ai agents

    Anthropic. Effective context engineering for ai agents. https://www.anthropic.com/engineering/ effective-context-engineering-for-ai-agents , 2025. Anthropic Engineering blog. Accessed: 2026-05-06

  7. [7]

    Effective harnesses for long-running agents

    Anthropic. Effective harnesses for long-running agents. https://www.anthropic.com/engineering/ effective-harnesses-for-long-running-agents , 11 2025. Anthropic Engineering blog. Ac- cessed: 2026-05-06

  8. [8]

    Agent skills: A standardized way to give AI agents new capabilities and expertise

    Anthropic and Contributors. Agent skills: A standardized way to give AI agents new capabilities and expertise. https://agentskills.io, 2025. Open standard; https://github.com/agentskills/ agentskills

  9. [9]

    Extensibility safety and performance in the spin operating system

    Brian N Bershad, Stefan Savage, Przemyslaw Pardyak, Emin Gün Sirer, Marc E Fiuczynski, David Becker, Craig Chambers, and Susan Eggers. Extensibility safety and performance in the spin operating system. In Proceedings of the fifteenth ACM symposium on Operating systems principles, pages 267–283, 1995

  10. [10]

    Building a C compiler with a team of parallel Claudes.https://www.anthropic.com/ engineering/building-c-compiler, 2 2026

    Nicholas Carlini. Building a C compiler with a team of parallel Claudes.https://www.anthropic.com/ engineering/building-c-compiler, 2 2026. Anthropic Engineering blog. Accessed: 2026-05-06

  11. [11]

    Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

    Tri Dao, Daniel Y . Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory- efficient exact attention with io-awareness. InAdvances in Neural Information Processing Systems, volume 35, pages 16344–16359, 2022

  12. [12]

    He, Charles Ide, Kanak Garg, Niklas Lauffer, Andrew Park, Chetan Rane, Karmini Sampath, Maya Krishnan, Srivatsa R Kundurthy, Sean M

    Xiang Deng, Jeff Da, Edwin Pan, Yannis Y . He, Charles Ide, Kanak Garg, Niklas Lauffer, Andrew Park, Chetan Rane, Karmini Sampath, Maya Krishnan, Srivatsa R Kundurthy, Sean M. Hendryx, Zifan Wang, Chen Bo Calvin Zhang, Noah Jacobson, Bing Liu, and Brad Kenstler. SWE-bench pro: Can AI agents solve long-horizon software engineering tasks?, 2026. URL https:/...

  13. [13]

    Exokernel: An operating system architecture for application-level resource management.ACM SIGOPS Operating Systems Review, 29(5):251–266, 1995

    Dawson R Engler, M Frans Kaashoek, and James O’Toole Jr. Exokernel: An operating system architecture for application-level resource management.ACM SIGOPS Operating Systems Review, 29(5):251–266, 1995

  14. [14]

    How cursor built fast apply using the speculative decoding api

    Fireworks AI. How cursor built fast apply using the speculative decoding api. https://fireworks.ai/ blog/cursor, 2024. Accessed: 2026-05-05

  15. [15]

    Perfbench: Can agents resolve real-world performance bugs?, 2025

    Spandan Garg, Roshanak Zilouchian Moghaddam, and Neel Sundaresan. Perfbench: Can agents resolve real-world performance bugs?, 2025. URLhttps://arxiv.org/abs/2509.24091

  16. [16]

    arXiv preprint arXiv:2501.10868 (2025)

    Saibo Geng, Hudson Cooper, Michał Moskal, Samuel Jenkins, Julian Berman, Nathan Ranchin, Robert West, Eric Horvitz, and Harsha Nori. Jsonschemabench: A rigorous benchmark of structured outputs for language models, 2025. URLhttps://arxiv.org/abs/2501.10868

  17. [17]

    Prompt cache: Modular attention reuse for low-latency inference.Proceedings of Machine Learning and Systems, 6: 325–338, 2024

    In Gim, Guojun Chen, Seung-seob Lee, Nikhil Sarda, Anurag Khandelwal, and Lin Zhong. Prompt cache: Modular attention reuse for low-latency inference.Proceedings of Machine Learning and Systems, 6: 325–338, 2024. 10

  18. [18]

    Pie: A programmable serving system for emerging llm applications

    In Gim, Zhiyao Ma, Seung-seob Lee, and Lin Zhong. Pie: A programmable serving system for emerging llm applications. InProceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles, pages 415–430, 2025

  19. [19]

    Llm-42: Enabling determinism in llm inference with verified speculation.arXiv preprint arXiv:2601.17768, 2026

    Raja Gond, Aditya K Kamath, Ramachandran Ramjee, and Ashish Panwar. Llm-42: Enabling determinism in llm inference with verified speculation.arXiv preprint arXiv:2601.17768, 2026

  20. [20]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al- Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

  21. [21]

    Codeeditorbench: Evaluating code editing capability of large language models.arXiv preprint arXiv:2404.03543, 2024

    Jiawei Guo, Ziming Li, Xueling Liu, Kaijing Ma, Tianyu Zheng, Zhouliang Yu, Ding Pan, Yizhi Li, Ruibo Liu, Yue Wang, et al. Codeeditorbench: Evaluating code editing capability of large language models.arXiv preprint arXiv:2404.03543, 2024

  22. [22]

    arXiv preprint arXiv:2510.03760 , year=

    Ping Guo, Chenyu Zhu, Siyuan Chen, Fei Liu, Xi Lin, Zhichao Lu, and Qingfu Zhang. EvoEngineer: Mastering automated CUDA kernel code evolution with large language models.ArXiv, abs/2510.03760,

  23. [23]

    URLhttps://api.semanticscholar.org/CorpusID:281842469

  24. [24]

    Glia: A Human-Inspired AI for Automated Systems Design and Optimization

    Pouya Hamadanian, Pantea Karimi, Arash Nasr-Esfahany, Kimia Noorbakhsh, Joseph Chandler, Ali ParandehGheibi, Mohammad Alizadeh, and Hari Balakrishnan. Glia: A human-inspired ai for automated systems design and optimization, 2026. URLhttps://arxiv.org/abs/2510.27176

  25. [25]

    MLX: Efficient and flexible machine learning on Apple silicon, 2023

    Awni Hannun, Jagrit Digani, Angelos Katharopoulos, and Ronan Collobert. MLX: Efficient and flexible machine learning on Apple silicon, 2023. URLhttps://github.com/ml-explore/mlx

  26. [26]

    Defeating nondeterminism in llm inference.Thinking Machines Lab: Connectionism, 2025

    Horace He and Thinking Machines Lab. Defeating nondeterminism in llm inference.Thinking Machines Lab: Connectionism, 2025. doi: 10.64434/tml.20250910. https://thinkingmachines.ai/blog/defeating- nondeterminism-in-llm-inference/

  27. [27]

    Swe-perf: Can language models optimize code performance on real-world repositories?, 2025

    Xinyi He, Qian Liu, Mingzhe Du, Lin Yan, Zhijie Fan, Yiming Huang, Zejian Yuan, and Zejun Ma. Swe-perf: Can language models optimize code performance on real-world repositories?, 2025. URL https://arxiv.org/abs/2507.12415

  28. [28]

    Apple vs

    Paul Hübner, Andong Hu, Ivy Peng, and Stefano Markidis. Apple vs. oranges: Evaluating the apple silicon m-series SoCs for HPC performance and efficiency, 2025. URLhttps://arxiv.org/abs/2502.05317

  29. [29]

    Everything is a ralph loop

    Geoffrey Huntley. Everything is a ralph loop. https://ghuntley.com/loop/, 1 2026. Blog post. Accessed: 2026-05-06

  30. [30]

    Sageserve: Optimizing llm serving on cloud data centers with forecast aware auto-scaling.Proceedings of the ACM on Measurement and Analysis of Computing Systems, 9(3):1–24, 2025

    Shashwat Jaiswal, Kunal Jain, Yogesh Simmhan, Anjaly Parayil, Ankur Mallick, Rujia Wang, Renee St Amant, Chetan Bansal, Victor Ruhle, Anoop Kulkarni, et al. Sageserve: Optimizing llm serving on cloud data centers with forecast aware auto-scaling.Proceedings of the ACM on Measurement and Analysis of Computing Systems, 9(3):1–24, 2025

  31. [31]

    SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

    Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-bench: Can language models resolve real-world GitHub issues?ArXiv, abs/2310.06770,

  32. [32]

    URLhttps://api.semanticscholar.org/CorpusID:263829697

  33. [33]

    Ragcache: Efficient knowledge caching for retrieval-augmented generation.ACM Transactions on Computer Systems, 44(1):1–27, 2025

    Chao Jin, Zili Zhang, Xuanlin Jiang, Fangyue Liu, Shufan Liu, Xuanzhe Liu, and Xin Jin. Ragcache: Efficient knowledge caching for retrieval-augmented generation.ACM Transactions on Computer Systems, 44(1):1–27, 2025

  34. [34]

    V oxserve: Streaming-centric serving system for speech language models.arXiv preprint arXiv:2602.00269, 2026

    Keisuke Kamahori, Wei-Tzu Lee, Atindra Jha, Rohan Kadekodi, Stephanie Wang, Arvind Krishnamurthy, and Baris Kasikci. V oxserve: Streaming-centric serving system for speech language models.arXiv preprint arXiv:2602.00269, 2026

  35. [35]

    Improving coherence and persistence in agentic AI for system optimization, 2026

    Pantea Karimi, Kimia Noorbakhsh, Mohammad Alizadeh, and Hari Balakrishnan. Improving coherence and persistence in agentic AI for system optimization, 2026. URLhttps://arxiv.org/abs/2603.21321

  36. [36]

    autoresearch: An autonomous LLM research loop

    Andrej Karpathy. autoresearch: An autonomous LLM research loop. https://github.com/karpathy/ autoresearch, 2026. Accessed: 2026-05-06

  37. [37]

    Moonshine v2: Ergodic streaming encoder asr for latency-critical speech applications.arXiv preprint arXiv:2602.12241, 2026

    Manjunath Kudlur, Evan King, James Wang, and Pete Warden. Moonshine v2: Ergodic streaming encoder asr for latency-critical speech applications.arXiv preprint arXiv:2602.12241, 2026

  38. [38]

    Measuring AI ability to complete long tasks

    Thomas Kwa, Ben West, Joel Becker, et al. Measuring AI ability to complete long tasks. https: //metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks/, 3 2025. 11

  39. [39]

    Gonzalez, Hao Zhang, and Ion Stoica

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

  40. [40]

    DeepAgents

    LangChain. DeepAgents. https://github.com/langchain-ai/deepagents, 2025. Accessed: 2026- 05-06

  41. [41]

    Investigating how Codex context compaction works

    Kangwook Lee. Investigating how Codex context compaction works. https://x.com/Kangwook_Lee/ status/2028955292025962534, 2026. Accessed: 2026-05-07

  42. [42]

    Xgrammar-2: Efficient dynamic structured generation engine for agentic llms, 2026

    Linzhang Li, Yixin Dong, Guanjie Wang, Ziyi Xu, Alexander Jiang, and Tianqi Chen. Xgrammar-2: Efficient dynamic structured generation engine for agentic llms, 2026. URL https://arxiv.org/abs/ 2601.04426

  43. [43]

    Jamba: A Hybrid Transformer-Mamba Language Model

    Opher Lieber, Barak Lenz, Hofit Bata, Gal Cohen, Jhonathan Osin, Itay Dalmedigos, Erez Safahi, Shaked Haim Meirom, Yonatan Belinkov, Shai Shalev-Shwartz, Omri Abend, Raz Alon, Tomer Asida, Amir Bergman, Roman Glozman, Michael Gokhman, Avshalom Manevich, Nir Ratner, Noam Rozen, Erez Shwartz, Mor Zusman, and Yoav Shoham. Jamba: A hybrid transformer-mamba la...

  44. [44]

    Scaling long-running autonomous coding

    Wilson Lin. Scaling long-running autonomous coding. https://cursor.com/blog/scaling-agents, 1 2026. Cursor blog. Accessed: 2026-05-06

  45. [45]

    Dimakis, Matei Zaharia, and Ion Stoica

    Shu Liu, Mert Cemri, Shubham Agarwal, Alexander Krentsel, Ashwin Naren, Qiuyang Mang, Zhifei Li, Akshat Gupta, Monishwaran Maheswaran, Audrey Cheng, Melissa Pan, Ethan Boneh, Kannan Ram- chandran, Koushik Sen, Alexandros G. Dimakis, Matei Zaharia, and Ion Stoica. SkyDiscover: A flexible framework for AI-driven scientific and algorithmic discovery, 2026. U...

  46. [46]

    RepoBench : Benchmarking repository-level code auto-completion systems

    Tianyang Liu, Canwen Xu, and Julian McAuley. Repobench: Benchmarking repository-level code auto-completion systems.arXiv preprint arXiv:2306.03091, 2023

  47. [47]

    arXiv preprint arXiv:2502.13965 , year =

    Michael Luo, Xiaoxiang Shi, Colin Cai, Tianjun Zhang, Justin Wong, Yichuan Wang, Chi Wang, Yanping Huang, Zhifeng Chen, Joseph E Gonzalez, et al. Autellix: An efficient serving engine for llm agents as general programs.arXiv preprint arXiv:2502.13965, 2025

  48. [48]

    Swe-fficiency: Can language models optimize real-world reposito- ries on real workloads?, 2025

    Jeffrey Jian Ma, Milad Hashemi, Amir Yazdanbakhsh, Kevin Swersky, Ofir Press, Enhui Li, Vijay Janapa Reddi, and Parthasarathy Ranganathan. Swe-fficiency: Can language models optimize real-world reposito- ries on real workloads?, 2025. URLhttps://arxiv.org/abs/2511.06090

  49. [49]

    Unikernels: Library operating systems for the cloud.ACM SIGARCH Computer Architecture News, 41(1):461–472, 2013

    Anil Madhavapeddy, Richard Mortier, Charalampos Rotsos, David Scott, Balraj Singh, Thomas Gazagnaire, Steven Smith, Steven Hand, and Jon Crowcroft. Unikernels: Library operating systems for the cloud.ACM SIGARCH Computer Architecture News, 41(1):461–472, 2013

  50. [50]

    Jitsu:{Just-In-Time} summoning of unikernels

    Anil Madhavapeddy, Thomas Leonard, Magnus Skjegstad, Thomas Gazagnaire, David Sheets, Dave Scott, Richard Mortier, Amir Chaudhry, Balraj Singh, Jon Ludlam, et al. Jitsu:{Just-In-Time} summoning of unikernels. In12th USENIX Symposium on Networked Systems Design and Implementation (NSDI 15), pages 559–573, 2015

  51. [51]

    Threads and input/output in the synthesis kernal

    Henry Massalin and Calton Pu. Threads and input/output in the synthesis kernal. InProceedings of the twelfth ACM symposium on Operating systems principles, pages 191–201, 1989

  52. [52]

    Specialization tools and techniques for systematic optimization of system software.ACM Transactions on Computer Systems (TOCS), 19(2):217–251, 2001

    Dylan McNamee, Jonathan Walpole, Calton Pu, Crispin Cowan, Charles Krasic, Ashvin Goel, Perry Wagle, Charles Consel, Gilles Muller, and Renauld Marlet. Specialization tools and techniques for systematic optimization of system software.ACM Transactions on Computer Systems (TOCS), 19(2):217–251, 2001

  53. [53]

    Olmo Hybrid: From Theory to Practice and Back

    William Merrill, Yanhong Li, Tyler Romero, Anej Svete, Caia Costello, Pradeep Dasigi, Dirk Groeneveld, David Heineman, Bailey Kuehl, Nathan Lambert, et al. Olmo hybrid: From theory to practice and back. arXiv preprint arXiv:2604.03444, 2026

  54. [54]

    Alexander Novikov, Ngân V˜u, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco J. R. Ruiz, Abbas Mehrabian, M. Pawan Kumar, Abigail See, Swarat Chaudhuri, George Holland, Alex Davies, Sebastian Nowozin, Pushmeet Kohli, and Matej Balog. Alphaevolve: A coding agent for scientific and algor...

  55. [55]

    TensorRT-LLM.https://github.com/NVIDIA/TensorRT-LLM, 2023

    NVIDIA. TensorRT-LLM.https://github.com/NVIDIA/TensorRT-LLM, 2023. 12

  56. [56]

    Nemotron-h: A family of accurate and efficient hybrid mamba-transformer models

    NVIDIA, Aaron Blakeman, Aarti Basant, Abhinav Khattar, Adithya Renduchintala, Akhiad Bercovich, Aleksander Ficek, Alexis Bjorlin, Ali Taghibakhsh, Amala Sanjay Deshmukh, Ameya Sunil Mahabalesh- warkar, Andrew Tao, Anna Shors, Ashwath Aithal, Ashwin Poojary, Ayush Dattagupta, Balaram Bud- dharaju, Bobby Chen, Boris Ginsburg, Boxin Wang, Brandon Norick, Bri...

  57. [57]

    Predicted outputs

    OpenAI. Predicted outputs. https://platform.openai.com/docs/guides/predicted-outputs,

  58. [58]

    Accessed: 2026-05-05

    OpenAI API documentation. Accessed: 2026-05-05

  59. [59]

    OpenAI Codex CLI.https://github.com/openai/codex, 2025

    OpenAI. OpenAI Codex CLI.https://github.com/openai/codex, 2025. Accessed: 2026-05-06

  60. [60]

    Introducing gpt-5.4.https://openai.com/index/introducing-gpt-5-4/, 2026a

    Anne Ouyang, Simon Guo, Simran Arora, Alex L. Zhang, William Hu, Christopher Ré, and Azalia Mirhoseini. Kernelbench: Can llms write efficient gpu kernels?ArXiv, abs/2502.10517, 2025. URL https://api.semanticscholar.org/CorpusID:276408165

  61. [61]

    Marconi: Prefix caching for the era of hybrid llms, 2025

    Rui Pan, Zhuang Wang, Zhen Jia, Can Karakus, Luca Zancato, Tri Dao, Ravi Netravali, and Yida Wang. Marconi: Prefix caching for the era of hybrid LLMs.ArXiv, abs/2411.19379, 2024. URL https://api.semanticscholar.org/CorpusID:274367849

  62. [62]

    Ori Press, Brandon Amos, Haoyu Zhao, Yikai Wu, Samuel K. Ainsworth, Dominik Krupke, Patrick Kidger, Touqir Sajed, Bartolomeo Stellato, Jisun Park, Nathanael Bosch, Eli Meril, Albert Steppi, Arman Zharmagambetov, Fangzhao Zhang, David Perez-Pineiro, Alberto Mercurio, Ni Zhan, Talor Abramovich, Kilian Lieret, Hanlin Zhang, Shirley Huang, Matthias Bethge, an...

  63. [63]

    Robust Speech Recognition via Large-Scale Weak Supervision

    Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. InInternational Conference on Machine Learning (ICML), 2023. URLhttps://arxiv.org/abs/2212.04356

  64. [64]

    Sun, and Yisong Yue

    Atharva Sehgal, James Hou, Swarat Chaudhuri, Jennifer J. Sun, and Yisong Yue. Formulacode: Evaluating agentic superoptimization on large codebases. InICML 2025 Workshop on Programmatic Representations for Agent Learning, 2025. URLhttps://openreview.net/forum?id=CMdtl83aZF

  65. [65]

    Openevolve: an open-source evolutionary coding agent, 2025

    Asankhaya Sharma. Openevolve: an open-source evolutionary coding agent, 2025. URL https:// github.com/algorithmicsuperintelligence/openevolve

  66. [66]

    14 Manish Shetty, Naman Jain, Jinjian Liu, Vijay Kethanaboyina, Koushik Sen, and Ion Stoica

    Manish Shetty, Naman Jain, Jinjian Liu, Vijay Kethanaboyina, Koushik Sen, and Ion Stoica. Gso: Challenging software optimization tasks for evaluating swe-agents, 2025. URL https://arxiv.org/ abs/2505.23671. 13

  67. [67]

    Minh V . T. Thai, Tue Le, Dung Nguyen Manh, Huy Phan Nhat, and Nghi D. Q. Bui. Swe-evo: Benchmark- ing coding agents in long-horizon software evolution scenarios, 2026. URLhttps://arxiv.org/abs/ 2512.18470

  68. [68]

    Characterizing and optimizing llm inference workloads on cpu-gpu coupled architectures

    Prabhu Vellaisamy, Thomas Labonte, Sourav Chakraborty, Matt Turner, Samantika Sury, and John Paul Shen. Characterizing and optimizing llm inference workloads on cpu-gpu coupled architectures. In2025 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 49–61. IEEE, 2025

  69. [69]

    Hybrid KV cache manager — vLLM documentation

    vLLM Team. Hybrid KV cache manager — vLLM documentation. https://docs.vllm.ai/en/ stable/design/hybrid_kv_cache_manager/, 2024

  70. [70]

    Efficientedit: Accelerating code editing via edit-oriented speculative decoding.arXiv preprint arXiv:2506.02780, 2025

    Peiding Wang, Li Zhang, Fang Liu, Yinghao Zhu, Wang Xu, Lin Shi, Xiaoli Lian, Minxiao Li, Bo Shen, and An Fu. Efficientedit: Accelerating code editing via edit-oriented speculative decoding.arXiv preprint arXiv:2506.02780, 2025

  71. [71]

    The Long-Horizon Task Mirage? Diagnosing Where and Why Agentic Systems Break

    Xinyu Jessica Wang, Haoyue Bai, Yiyou Sun, Haorui Wang, Shuibai Zhang, Wenjie Hu, Mya Schroder, Bilge Mutlu, Dawn Song, and Robert D Nowak. The long-horizon task mirage? diagnosing where and why agentic systems break, 2026. URLhttps://arxiv.org/abs/2604.11978

  72. [72]

    Ker- nelFoundry: Hardware-aware evolutionary GPU kernel optimization, 2026

    Nina Wiedemann, Quentin Leboutet, Michael Paulitsch, Diana Wofk, and Benjamin Ummenhofer. Ker- nelFoundry: Hardware-aware evolutionary GPU kernel optimization, 2026. URL https://arxiv.org/ abs/2603.12440

  73. [73]

    HuggingFace's Transformers: State-of-the-art Natural Language Processing

    Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, et al. Huggingface’s transformers: State-of-the-art natural language processing.arXiv preprint arXiv:1910.03771, 2019

  74. [74]

    Git context controller: Manage the context of llm-based agents like git, 2026

    Junde Wu, Minhao Hu, Jiayuan Zhu, Jiazhen Pan, Yuyuan Liu, Min Xu, and Yueming Jin. Git context controller: Manage the context of llm-based agents like git, 2026. URL https://arxiv.org/abs/2508. 00031

  75. [75]

    Agentless: Demystifying LLM-based Software Engineering Agents

    Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. Agentless: Demystifying llm-based software engineering agents.arXiv preprint arXiv:2407.01489, 2024

  76. [76]

    Show-o2: Improved Native Unified Multimodal Models

    Jinheng Xie, Zhenheng Yang, and Mike Zheng Shou. Show-o2: Improved native unified multimodal models, 2025. URLhttps://arxiv.org/abs/2506.15564

  77. [77]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  78. [78]

    Inference with reference: Lossless acceleration of large language models

    Nan Yang, Tao Ge, Liang Wang, Binxing Jiao, Daxin Jiang, Linjun Yang, Rangan Majumder, and Furu Wei. Inference with reference: Lossless acceleration of large language models.arXiv preprint arXiv:2304.04487, 2023

  79. [79]

    Gated Delta Networks: Improving Mamba2 with Delta Rule

    Songlin Yang, Jan Kautz, and Ali Hatamizadeh. Gated delta networks: Improving mamba2 with delta rule. arXiv preprint arXiv:2412.06464, 2024

  80. [80]

    Flash- infer: Efficient and customizable attention engine for llm inference serving.arXiv preprint arXiv:2501.01005,

    Zihao Ye, Lequn Chen, Ruihang Lai, Wuwei Lin, Yineng Zhang, Stephanie Wang, Tianqi Chen, Baris Kasikci, Vinod Grover, Arvind Krishnamurthy, and Luis Ceze. Flashinfer: Efficient and customizable attention engine for llm inference serving, 2025. URLhttps://arxiv.org/abs/2501.01005

Showing first 80 references.