Recognition: unknown
VibeServe: Can AI Agents Build Bespoke LLM Serving Systems?
Pith reviewed 2026-05-08 10:40 UTC · model grok-4.3
The pith
AI agents can synthesize custom LLM serving systems that match vLLM in standard use and outperform it by exploiting missed optimizations in non-standard scenarios.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
VibeServe is the first agentic loop that generates entire LLM serving stacks end-to-end. It uses an outer loop to plan and track the search over system designs and an inner loop to implement candidates, check correctness, and measure performance on the target benchmark. In the standard deployment setting VibeServe remains competitive with vLLM. In non-standard scenarios it outperforms existing systems by exploiting opportunities that generic systems miss across six cases involving non-standard model architectures, workload knowledge, and hardware-specific optimizations. These results suggest a shift in infrastructure design toward generation-time specialization rather than runtime generality
What carries the argument
The multi-agent loop with an outer component that plans and tracks searches over system designs and an inner component that generates code, verifies correctness, and runs performance measurements on the target workload.
If this is right
- Specialized serving code can be produced on demand without multi-year manual engineering efforts.
- Optimizations for non-standard models or hardware become accessible through automated generation rather than expert rewriting.
- Workload knowledge can be directly encoded into the serving system at generation time instead of approximated inside a fixed general stack.
- The infrastructure design space expands to include many generated variants instead of one versatile runtime.
- New deployment targets can be supported by re-running the loop rather than forking and retuning an existing codebase.
Where Pith is reading between the lines
- The same loop structure could be applied to other systems domains such as database engines or network stacks where generality currently dominates.
- Over successive runs the approach may accumulate a library of verified serving primitives that future generations can reuse.
- Reliability of the agent loop will become the main limiter; improvements in code verification or test generation would directly increase the range of workable scenarios.
- Long-running agents could monitor live workloads and trigger regeneration when performance drifts, moving toward continuously adapted serving systems.
Load-bearing premise
The LLM agents inside the loop can reliably produce correct, bug-free, and high-performance serving code that the outer loop can validate without human fixes or hidden errors.
What would settle it
Repeated runs of VibeServe on standard benchmarks that produce systems failing correctness checks or showing consistent large performance gaps below vLLM would disprove the competitiveness result.
Figures
read the original abstract
For years, we have built LLM serving systems like any other critical infrastructure: a single general-purpose stack, hand-tuned over many engineer-years, meant to support every model and workload. In this paper, we take the opposite bet: a multi-agent loop that automatically synthesizes bespoke serving systems for different usage scenarios. We propose VibeServe, the first agentic loop that generates entire LLM serving stacks end-to-end. VibeServe uses an outer loop to plan and track the search over system designs, and an inner loop to implement candidates, check correctness, and measure performance on the target benchmark. In the standard deployment setting, where existing stacks are highly optimized, VibeServe remains competitive with vLLM, showing that generation-time specialization need not come at the cost of performance. More interestingly, in non-standard scenarios, VibeServe outperforms existing systems by exploiting opportunities that generic systems miss in six scenarios involving non-standard model architectures, workload knowledge, and hardware-specific optimizations. Together, these results suggest a different point in the design space for infrastructure software: generation-time specialization rather than runtime generality. Code is available at https://github.com/uw-syfi/vibe-serve.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes VibeServe, the first multi-agent loop that automatically synthesizes entire bespoke LLM serving stacks end-to-end. An outer loop plans and tracks the search over system designs while an inner loop implements candidate stacks, checks their correctness, and measures performance on target benchmarks. The central empirical claim is that VibeServe remains competitive with vLLM in standard, highly optimized deployment settings and outperforms existing general-purpose systems in six non-standard scenarios involving non-standard model architectures, workload-specific knowledge, and hardware optimizations, supporting a shift toward generation-time specialization rather than runtime generality. Code is released at https://github.com/uw-syfi/vibe-serve.
Significance. If the performance claims hold under rigorous verification, the work is significant because it demonstrates a viable alternative design point for LLM infrastructure: automated, scenario-specific generation of serving systems instead of hand-tuned general-purpose stacks. This could enable more efficient exploitation of workload and hardware particularities that generic systems overlook. The public code release is a positive step toward reproducibility, though the absence of detailed experimental protocols limits immediate assessment of the result's robustness.
major comments (2)
- [Abstract and Experimental Evaluation] Abstract and Experimental Evaluation section: the claims of competitive performance with vLLM and outperformance in six non-standard scenarios are presented without any description of the benchmarks used, statistical methods, error bars, number of runs, or experimental controls. This information is load-bearing for the central empirical claim and must be supplied before the results can be evaluated.
- [Section 3] Inner-loop description (Section 3): the paper does not specify the exact correctness oracles, test coverage, or verification procedures used to confirm that LLM-generated serving code is bug-free and that non-standard optimizations are sound rather than benchmark-specific artifacts. Standard benchmark checks can pass while missing subtle concurrency, memory, or kernel issues, directly affecting the weakest assumption that the agentic loop reliably produces correct stacks without human intervention.
minor comments (2)
- [Abstract] The six non-standard scenarios are referenced but not enumerated with concrete details on model architectures, workloads, or hardware in the abstract; a table or explicit list would improve clarity.
- [Sections 2-3] Minor typographical inconsistencies in agent role descriptions and loop terminology appear in the early sections; these do not affect technical content but should be standardized.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and have revised the manuscript to incorporate the requested information and clarifications.
read point-by-point responses
-
Referee: [Abstract and Experimental Evaluation] Abstract and Experimental Evaluation section: the claims of competitive performance with vLLM and outperformance in six non-standard scenarios are presented without any description of the benchmarks used, statistical methods, error bars, number of runs, or experimental controls. This information is load-bearing for the central empirical claim and must be supplied before the results can be evaluated.
Authors: We agree that the abstract and Experimental Evaluation section require explicit details on benchmarks, statistical methods, error bars, number of runs, and controls to support the central claims. In the revised manuscript we have expanded the Experimental Evaluation section with a new subsection that describes: (1) the standard benchmarks drawn from vLLM's public evaluation suite together with the precise six non-standard scenarios (non-standard model architectures, workload-specific knowledge, and hardware optimizations); (2) the number of independent runs (five per configuration); (3) statistical reporting (mean and standard deviation); (4) error bars on all performance plots; and (5) experimental controls (fixed hardware platforms, model checkpoints, workload generators, and temperature settings). We have also added a single sentence to the abstract summarizing the evaluation protocol. These additions make the empirical claims fully evaluable. revision: yes
-
Referee: [Section 3] Inner-loop description (Section 3): the paper does not specify the exact correctness oracles, test coverage, or verification procedures used to confirm that LLM-generated serving code is bug-free and that non-standard optimizations are sound rather than benchmark-specific artifacts. Standard benchmark checks can pass while missing subtle concurrency, memory, or kernel issues, directly affecting the weakest assumption that the agentic loop reliably produces correct stacks without human intervention.
Authors: We acknowledge that the original description of the inner loop in Section 3 is high-level and does not enumerate the concrete oracles or coverage. In the revision we have added a new paragraph to Section 3 that specifies: (1) the correctness oracles consist of automated unit tests for core serving primitives, integration tests against both standard and non-standard benchmarks, and dynamic analysis (Valgrind for memory safety and ThreadSanitizer for data races) on generated kernels; (2) test coverage targets all public APIs plus the critical paths exercised by the target workload; and (3) non-standard optimizations are additionally validated by comparing against hand-written reference implementations where available and by checking for benchmark-specific artifacts via cross-validation on held-out workloads. While we recognize that no automated verification is exhaustive, the multi-layered checks plus the released code base allow external inspection. We have therefore revised the manuscript to make these procedures explicit. revision: yes
Circularity Check
No circularity; empirical results rest on direct system measurements against external baselines.
full rationale
The paper presents VibeServe as an empirical system that runs an agentic loop to synthesize serving stacks and then measures their performance on benchmarks, comparing directly to vLLM and other existing systems. No equations, fitted parameters, or first-principles derivations are claimed; the central results are obtained by executing the generated code and recording latency/throughput numbers. The abstract and described methodology contain no self-definitional loops, no renaming of known results as novel predictions, and no load-bearing self-citations that substitute for external validation. The evaluation is therefore self-contained against independent external artifacts (vLLM, standard benchmarks) rather than reducing to its own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Large language models can be prompted to generate correct and efficient code for LLM serving systems
invented entities (1)
-
VibeServe multi-agent loop
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Self-defining systems
Thomas Anderson, Ratul Mahajan, Simon Peter, and Luke Zettlemoyer. Self-defining systems. 2025. URL https://foci.uw.edu/papers/whitepaper2025-sds.pdf
2025
-
[2]
PyTorch 2: Faster machine learning through dynamic Python bytecode transformation and graph compilation
Jason Ansel, Edward Yang, Horace He, Natalia Gimelshein, Animesh Jain, Michael V oznesensky, Bin Bao, Peter Bell, David Berard, Evgeni Burovski, Geeta Chauhan, Anjali Chourdia, Will Constable, Alban Desmaison, Zachary DeVito, Elias Ellison, Will Feng, Jiong Gong, Michael Gschwind, Brian Hirsh, Sherlock Huang, Kshiteej Kalambarkar, Laurent Kirsch, Michael ...
-
[3]
doi: 10.1145/3620665.3640366
-
[4]
Introducing the model context protocol
Anthropic. Introducing the model context protocol. https://www.anthropic.com/news/ model-context-protocol, 11 2024
2024
-
[5]
Claude Code
Anthropic. Claude Code. https://docs.anthropic.com/en/docs/claude-code/overview, 2025. Accessed: 2026-05-06
2025
-
[6]
Effective context engineering for ai agents
Anthropic. Effective context engineering for ai agents. https://www.anthropic.com/engineering/ effective-context-engineering-for-ai-agents , 2025. Anthropic Engineering blog. Accessed: 2026-05-06
2025
-
[7]
Effective harnesses for long-running agents
Anthropic. Effective harnesses for long-running agents. https://www.anthropic.com/engineering/ effective-harnesses-for-long-running-agents , 11 2025. Anthropic Engineering blog. Ac- cessed: 2026-05-06
2025
-
[8]
Agent skills: A standardized way to give AI agents new capabilities and expertise
Anthropic and Contributors. Agent skills: A standardized way to give AI agents new capabilities and expertise. https://agentskills.io, 2025. Open standard; https://github.com/agentskills/ agentskills
2025
-
[9]
Extensibility safety and performance in the spin operating system
Brian N Bershad, Stefan Savage, Przemyslaw Pardyak, Emin Gün Sirer, Marc E Fiuczynski, David Becker, Craig Chambers, and Susan Eggers. Extensibility safety and performance in the spin operating system. In Proceedings of the fifteenth ACM symposium on Operating systems principles, pages 267–283, 1995
1995
-
[10]
Building a C compiler with a team of parallel Claudes.https://www.anthropic.com/ engineering/building-c-compiler, 2 2026
Nicholas Carlini. Building a C compiler with a team of parallel Claudes.https://www.anthropic.com/ engineering/building-c-compiler, 2 2026. Anthropic Engineering blog. Accessed: 2026-05-06
2026
-
[11]
Fu, Stefano Ermon, Atri Rudra, and Christopher Ré
Tri Dao, Daniel Y . Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory- efficient exact attention with io-awareness. InAdvances in Neural Information Processing Systems, volume 35, pages 16344–16359, 2022
2022
-
[12]
He, Charles Ide, Kanak Garg, Niklas Lauffer, Andrew Park, Chetan Rane, Karmini Sampath, Maya Krishnan, Srivatsa R Kundurthy, Sean M
Xiang Deng, Jeff Da, Edwin Pan, Yannis Y . He, Charles Ide, Kanak Garg, Niklas Lauffer, Andrew Park, Chetan Rane, Karmini Sampath, Maya Krishnan, Srivatsa R Kundurthy, Sean M. Hendryx, Zifan Wang, Chen Bo Calvin Zhang, Noah Jacobson, Bing Liu, and Brad Kenstler. SWE-bench pro: Can AI agents solve long-horizon software engineering tasks?, 2026. URL https:/...
2026
-
[13]
Exokernel: An operating system architecture for application-level resource management.ACM SIGOPS Operating Systems Review, 29(5):251–266, 1995
Dawson R Engler, M Frans Kaashoek, and James O’Toole Jr. Exokernel: An operating system architecture for application-level resource management.ACM SIGOPS Operating Systems Review, 29(5):251–266, 1995
1995
-
[14]
How cursor built fast apply using the speculative decoding api
Fireworks AI. How cursor built fast apply using the speculative decoding api. https://fireworks.ai/ blog/cursor, 2024. Accessed: 2026-05-05
2024
-
[15]
Perfbench: Can agents resolve real-world performance bugs?, 2025
Spandan Garg, Roshanak Zilouchian Moghaddam, and Neel Sundaresan. Perfbench: Can agents resolve real-world performance bugs?, 2025. URLhttps://arxiv.org/abs/2509.24091
-
[16]
arXiv preprint arXiv:2501.10868 (2025)
Saibo Geng, Hudson Cooper, Michał Moskal, Samuel Jenkins, Julian Berman, Nathan Ranchin, Robert West, Eric Horvitz, and Harsha Nori. Jsonschemabench: A rigorous benchmark of structured outputs for language models, 2025. URLhttps://arxiv.org/abs/2501.10868
-
[17]
Prompt cache: Modular attention reuse for low-latency inference.Proceedings of Machine Learning and Systems, 6: 325–338, 2024
In Gim, Guojun Chen, Seung-seob Lee, Nikhil Sarda, Anurag Khandelwal, and Lin Zhong. Prompt cache: Modular attention reuse for low-latency inference.Proceedings of Machine Learning and Systems, 6: 325–338, 2024. 10
2024
-
[18]
Pie: A programmable serving system for emerging llm applications
In Gim, Zhiyao Ma, Seung-seob Lee, and Lin Zhong. Pie: A programmable serving system for emerging llm applications. InProceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles, pages 415–430, 2025
2025
-
[19]
Raja Gond, Aditya K Kamath, Ramachandran Ramjee, and Ashish Panwar. Llm-42: Enabling determinism in llm inference with verified speculation.arXiv preprint arXiv:2601.17768, 2026
-
[20]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al- Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review arXiv 2024
-
[21]
Jiawei Guo, Ziming Li, Xueling Liu, Kaijing Ma, Tianyu Zheng, Zhouliang Yu, Ding Pan, Yizhi Li, Ruibo Liu, Yue Wang, et al. Codeeditorbench: Evaluating code editing capability of large language models.arXiv preprint arXiv:2404.03543, 2024
-
[22]
arXiv preprint arXiv:2510.03760 , year=
Ping Guo, Chenyu Zhu, Siyuan Chen, Fei Liu, Xi Lin, Zhichao Lu, and Qingfu Zhang. EvoEngineer: Mastering automated CUDA kernel code evolution with large language models.ArXiv, abs/2510.03760,
-
[23]
URLhttps://api.semanticscholar.org/CorpusID:281842469
-
[24]
Glia: A Human-Inspired AI for Automated Systems Design and Optimization
Pouya Hamadanian, Pantea Karimi, Arash Nasr-Esfahany, Kimia Noorbakhsh, Joseph Chandler, Ali ParandehGheibi, Mohammad Alizadeh, and Hari Balakrishnan. Glia: A human-inspired ai for automated systems design and optimization, 2026. URLhttps://arxiv.org/abs/2510.27176
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[25]
MLX: Efficient and flexible machine learning on Apple silicon, 2023
Awni Hannun, Jagrit Digani, Angelos Katharopoulos, and Ronan Collobert. MLX: Efficient and flexible machine learning on Apple silicon, 2023. URLhttps://github.com/ml-explore/mlx
2023
-
[26]
Defeating nondeterminism in llm inference.Thinking Machines Lab: Connectionism, 2025
Horace He and Thinking Machines Lab. Defeating nondeterminism in llm inference.Thinking Machines Lab: Connectionism, 2025. doi: 10.64434/tml.20250910. https://thinkingmachines.ai/blog/defeating- nondeterminism-in-llm-inference/
-
[27]
Swe-perf: Can language models optimize code performance on real-world repositories?, 2025
Xinyi He, Qian Liu, Mingzhe Du, Lin Yan, Zhijie Fan, Yiming Huang, Zejian Yuan, and Zejun Ma. Swe-perf: Can language models optimize code performance on real-world repositories?, 2025. URL https://arxiv.org/abs/2507.12415
- [28]
-
[29]
Everything is a ralph loop
Geoffrey Huntley. Everything is a ralph loop. https://ghuntley.com/loop/, 1 2026. Blog post. Accessed: 2026-05-06
2026
-
[30]
Sageserve: Optimizing llm serving on cloud data centers with forecast aware auto-scaling.Proceedings of the ACM on Measurement and Analysis of Computing Systems, 9(3):1–24, 2025
Shashwat Jaiswal, Kunal Jain, Yogesh Simmhan, Anjaly Parayil, Ankur Mallick, Rujia Wang, Renee St Amant, Chetan Bansal, Victor Ruhle, Anoop Kulkarni, et al. Sageserve: Optimizing llm serving on cloud data centers with forecast aware auto-scaling.Proceedings of the ACM on Measurement and Analysis of Computing Systems, 9(3):1–24, 2025
2025
-
[31]
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-bench: Can language models resolve real-world GitHub issues?ArXiv, abs/2310.06770,
work page internal anchor Pith review arXiv
-
[32]
URLhttps://api.semanticscholar.org/CorpusID:263829697
-
[33]
Ragcache: Efficient knowledge caching for retrieval-augmented generation.ACM Transactions on Computer Systems, 44(1):1–27, 2025
Chao Jin, Zili Zhang, Xuanlin Jiang, Fangyue Liu, Shufan Liu, Xuanzhe Liu, and Xin Jin. Ragcache: Efficient knowledge caching for retrieval-augmented generation.ACM Transactions on Computer Systems, 44(1):1–27, 2025
2025
-
[34]
Keisuke Kamahori, Wei-Tzu Lee, Atindra Jha, Rohan Kadekodi, Stephanie Wang, Arvind Krishnamurthy, and Baris Kasikci. V oxserve: Streaming-centric serving system for speech language models.arXiv preprint arXiv:2602.00269, 2026
-
[35]
Improving coherence and persistence in agentic AI for system optimization, 2026
Pantea Karimi, Kimia Noorbakhsh, Mohammad Alizadeh, and Hari Balakrishnan. Improving coherence and persistence in agentic AI for system optimization, 2026. URLhttps://arxiv.org/abs/2603.21321
-
[36]
autoresearch: An autonomous LLM research loop
Andrej Karpathy. autoresearch: An autonomous LLM research loop. https://github.com/karpathy/ autoresearch, 2026. Accessed: 2026-05-06
2026
-
[37]
Manjunath Kudlur, Evan King, James Wang, and Pete Warden. Moonshine v2: Ergodic streaming encoder asr for latency-critical speech applications.arXiv preprint arXiv:2602.12241, 2026
-
[38]
Measuring AI ability to complete long tasks
Thomas Kwa, Ben West, Joel Becker, et al. Measuring AI ability to complete long tasks. https: //metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks/, 3 2025. 11
2025
-
[39]
Gonzalez, Hao Zhang, and Ion Stoica
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023
2023
-
[40]
DeepAgents
LangChain. DeepAgents. https://github.com/langchain-ai/deepagents, 2025. Accessed: 2026- 05-06
2025
-
[41]
Investigating how Codex context compaction works
Kangwook Lee. Investigating how Codex context compaction works. https://x.com/Kangwook_Lee/ status/2028955292025962534, 2026. Accessed: 2026-05-07
-
[42]
Xgrammar-2: Efficient dynamic structured generation engine for agentic llms, 2026
Linzhang Li, Yixin Dong, Guanjie Wang, Ziyi Xu, Alexander Jiang, and Tianqi Chen. Xgrammar-2: Efficient dynamic structured generation engine for agentic llms, 2026. URL https://arxiv.org/abs/ 2601.04426
-
[43]
Jamba: A Hybrid Transformer-Mamba Language Model
Opher Lieber, Barak Lenz, Hofit Bata, Gal Cohen, Jhonathan Osin, Itay Dalmedigos, Erez Safahi, Shaked Haim Meirom, Yonatan Belinkov, Shai Shalev-Shwartz, Omri Abend, Raz Alon, Tomer Asida, Amir Bergman, Roman Glozman, Michael Gokhman, Avshalom Manevich, Nir Ratner, Noam Rozen, Erez Shwartz, Mor Zusman, and Yoav Shoham. Jamba: A hybrid transformer-mamba la...
work page internal anchor Pith review arXiv 2024
-
[44]
Scaling long-running autonomous coding
Wilson Lin. Scaling long-running autonomous coding. https://cursor.com/blog/scaling-agents, 1 2026. Cursor blog. Accessed: 2026-05-06
2026
-
[45]
Dimakis, Matei Zaharia, and Ion Stoica
Shu Liu, Mert Cemri, Shubham Agarwal, Alexander Krentsel, Ashwin Naren, Qiuyang Mang, Zhifei Li, Akshat Gupta, Monishwaran Maheswaran, Audrey Cheng, Melissa Pan, Ethan Boneh, Kannan Ram- chandran, Koushik Sen, Alexandros G. Dimakis, Matei Zaharia, and Ion Stoica. SkyDiscover: A flexible framework for AI-driven scientific and algorithmic discovery, 2026. U...
2026
-
[46]
RepoBench : Benchmarking repository-level code auto-completion systems
Tianyang Liu, Canwen Xu, and Julian McAuley. Repobench: Benchmarking repository-level code auto-completion systems.arXiv preprint arXiv:2306.03091, 2023
-
[47]
arXiv preprint arXiv:2502.13965 , year =
Michael Luo, Xiaoxiang Shi, Colin Cai, Tianjun Zhang, Justin Wong, Yichuan Wang, Chi Wang, Yanping Huang, Zhifeng Chen, Joseph E Gonzalez, et al. Autellix: An efficient serving engine for llm agents as general programs.arXiv preprint arXiv:2502.13965, 2025
-
[48]
Swe-fficiency: Can language models optimize real-world reposito- ries on real workloads?, 2025
Jeffrey Jian Ma, Milad Hashemi, Amir Yazdanbakhsh, Kevin Swersky, Ofir Press, Enhui Li, Vijay Janapa Reddi, and Parthasarathy Ranganathan. Swe-fficiency: Can language models optimize real-world reposito- ries on real workloads?, 2025. URLhttps://arxiv.org/abs/2511.06090
-
[49]
Unikernels: Library operating systems for the cloud.ACM SIGARCH Computer Architecture News, 41(1):461–472, 2013
Anil Madhavapeddy, Richard Mortier, Charalampos Rotsos, David Scott, Balraj Singh, Thomas Gazagnaire, Steven Smith, Steven Hand, and Jon Crowcroft. Unikernels: Library operating systems for the cloud.ACM SIGARCH Computer Architecture News, 41(1):461–472, 2013
2013
-
[50]
Jitsu:{Just-In-Time} summoning of unikernels
Anil Madhavapeddy, Thomas Leonard, Magnus Skjegstad, Thomas Gazagnaire, David Sheets, Dave Scott, Richard Mortier, Amir Chaudhry, Balraj Singh, Jon Ludlam, et al. Jitsu:{Just-In-Time} summoning of unikernels. In12th USENIX Symposium on Networked Systems Design and Implementation (NSDI 15), pages 559–573, 2015
2015
-
[51]
Threads and input/output in the synthesis kernal
Henry Massalin and Calton Pu. Threads and input/output in the synthesis kernal. InProceedings of the twelfth ACM symposium on Operating systems principles, pages 191–201, 1989
1989
-
[52]
Specialization tools and techniques for systematic optimization of system software.ACM Transactions on Computer Systems (TOCS), 19(2):217–251, 2001
Dylan McNamee, Jonathan Walpole, Calton Pu, Crispin Cowan, Charles Krasic, Ashvin Goel, Perry Wagle, Charles Consel, Gilles Muller, and Renauld Marlet. Specialization tools and techniques for systematic optimization of system software.ACM Transactions on Computer Systems (TOCS), 19(2):217–251, 2001
2001
-
[53]
Olmo Hybrid: From Theory to Practice and Back
William Merrill, Yanhong Li, Tyler Romero, Anej Svete, Caia Costello, Pradeep Dasigi, Dirk Groeneveld, David Heineman, Bailey Kuehl, Nathan Lambert, et al. Olmo hybrid: From theory to practice and back. arXiv preprint arXiv:2604.03444, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[54]
Alexander Novikov, Ngân V˜u, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco J. R. Ruiz, Abbas Mehrabian, M. Pawan Kumar, Abigail See, Swarat Chaudhuri, George Holland, Alex Davies, Sebastian Nowozin, Pushmeet Kohli, and Matej Balog. Alphaevolve: A coding agent for scientific and algor...
work page internal anchor Pith review arXiv 2025
-
[55]
TensorRT-LLM.https://github.com/NVIDIA/TensorRT-LLM, 2023
NVIDIA. TensorRT-LLM.https://github.com/NVIDIA/TensorRT-LLM, 2023. 12
2023
-
[56]
Nemotron-h: A family of accurate and efficient hybrid mamba-transformer models
NVIDIA, Aaron Blakeman, Aarti Basant, Abhinav Khattar, Adithya Renduchintala, Akhiad Bercovich, Aleksander Ficek, Alexis Bjorlin, Ali Taghibakhsh, Amala Sanjay Deshmukh, Ameya Sunil Mahabalesh- warkar, Andrew Tao, Anna Shors, Ashwath Aithal, Ashwin Poojary, Ayush Dattagupta, Balaram Bud- dharaju, Bobby Chen, Boris Ginsburg, Boxin Wang, Brandon Norick, Bri...
-
[57]
Predicted outputs
OpenAI. Predicted outputs. https://platform.openai.com/docs/guides/predicted-outputs,
-
[58]
Accessed: 2026-05-05
OpenAI API documentation. Accessed: 2026-05-05
2026
-
[59]
OpenAI Codex CLI.https://github.com/openai/codex, 2025
OpenAI. OpenAI Codex CLI.https://github.com/openai/codex, 2025. Accessed: 2026-05-06
2025
-
[60]
Introducing gpt-5.4.https://openai.com/index/introducing-gpt-5-4/, 2026a
Anne Ouyang, Simon Guo, Simran Arora, Alex L. Zhang, William Hu, Christopher Ré, and Azalia Mirhoseini. Kernelbench: Can llms write efficient gpu kernels?ArXiv, abs/2502.10517, 2025. URL https://api.semanticscholar.org/CorpusID:276408165
-
[61]
Marconi: Prefix caching for the era of hybrid llms, 2025
Rui Pan, Zhuang Wang, Zhen Jia, Can Karakus, Luca Zancato, Tri Dao, Ravi Netravali, and Yida Wang. Marconi: Prefix caching for the era of hybrid LLMs.ArXiv, abs/2411.19379, 2024. URL https://api.semanticscholar.org/CorpusID:274367849
-
[62]
Ori Press, Brandon Amos, Haoyu Zhao, Yikai Wu, Samuel K. Ainsworth, Dominik Krupke, Patrick Kidger, Touqir Sajed, Bartolomeo Stellato, Jisun Park, Nathanael Bosch, Eli Meril, Albert Steppi, Arman Zharmagambetov, Fangzhao Zhang, David Perez-Pineiro, Alberto Mercurio, Ni Zhan, Talor Abramovich, Kilian Lieret, Hanlin Zhang, Shirley Huang, Matthias Bethge, an...
2025
-
[63]
Robust Speech Recognition via Large-Scale Weak Supervision
Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. InInternational Conference on Machine Learning (ICML), 2023. URLhttps://arxiv.org/abs/2212.04356
work page internal anchor Pith review arXiv 2023
-
[64]
Sun, and Yisong Yue
Atharva Sehgal, James Hou, Swarat Chaudhuri, Jennifer J. Sun, and Yisong Yue. Formulacode: Evaluating agentic superoptimization on large codebases. InICML 2025 Workshop on Programmatic Representations for Agent Learning, 2025. URLhttps://openreview.net/forum?id=CMdtl83aZF
2025
-
[65]
Openevolve: an open-source evolutionary coding agent, 2025
Asankhaya Sharma. Openevolve: an open-source evolutionary coding agent, 2025. URL https:// github.com/algorithmicsuperintelligence/openevolve
2025
-
[66]
14 Manish Shetty, Naman Jain, Jinjian Liu, Vijay Kethanaboyina, Koushik Sen, and Ion Stoica
Manish Shetty, Naman Jain, Jinjian Liu, Vijay Kethanaboyina, Koushik Sen, and Ion Stoica. Gso: Challenging software optimization tasks for evaluating swe-agents, 2025. URL https://arxiv.org/ abs/2505.23671. 13
-
[67]
Minh V . T. Thai, Tue Le, Dung Nguyen Manh, Huy Phan Nhat, and Nghi D. Q. Bui. Swe-evo: Benchmark- ing coding agents in long-horizon software evolution scenarios, 2026. URLhttps://arxiv.org/abs/ 2512.18470
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[68]
Characterizing and optimizing llm inference workloads on cpu-gpu coupled architectures
Prabhu Vellaisamy, Thomas Labonte, Sourav Chakraborty, Matt Turner, Samantika Sury, and John Paul Shen. Characterizing and optimizing llm inference workloads on cpu-gpu coupled architectures. In2025 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 49–61. IEEE, 2025
2025
-
[69]
Hybrid KV cache manager — vLLM documentation
vLLM Team. Hybrid KV cache manager — vLLM documentation. https://docs.vllm.ai/en/ stable/design/hybrid_kv_cache_manager/, 2024
2024
-
[70]
Peiding Wang, Li Zhang, Fang Liu, Yinghao Zhu, Wang Xu, Lin Shi, Xiaoli Lian, Minxiao Li, Bo Shen, and An Fu. Efficientedit: Accelerating code editing via edit-oriented speculative decoding.arXiv preprint arXiv:2506.02780, 2025
-
[71]
The Long-Horizon Task Mirage? Diagnosing Where and Why Agentic Systems Break
Xinyu Jessica Wang, Haoyue Bai, Yiyou Sun, Haorui Wang, Shuibai Zhang, Wenjie Hu, Mya Schroder, Bilge Mutlu, Dawn Song, and Robert D Nowak. The long-horizon task mirage? diagnosing where and why agentic systems break, 2026. URLhttps://arxiv.org/abs/2604.11978
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[72]
Ker- nelFoundry: Hardware-aware evolutionary GPU kernel optimization, 2026
Nina Wiedemann, Quentin Leboutet, Michael Paulitsch, Diana Wofk, and Benjamin Ummenhofer. Ker- nelFoundry: Hardware-aware evolutionary GPU kernel optimization, 2026. URL https://arxiv.org/ abs/2603.12440
-
[73]
HuggingFace's Transformers: State-of-the-art Natural Language Processing
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, et al. Huggingface’s transformers: State-of-the-art natural language processing.arXiv preprint arXiv:1910.03771, 2019
work page internal anchor Pith review arXiv 1910
-
[74]
Git context controller: Manage the context of llm-based agents like git, 2026
Junde Wu, Minhao Hu, Jiayuan Zhu, Jiazhen Pan, Yuyuan Liu, Min Xu, and Yueming Jin. Git context controller: Manage the context of llm-based agents like git, 2026. URL https://arxiv.org/abs/2508. 00031
2026
-
[75]
Agentless: Demystifying LLM-based Software Engineering Agents
Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. Agentless: Demystifying llm-based software engineering agents.arXiv preprint arXiv:2407.01489, 2024
work page internal anchor Pith review arXiv 2024
-
[76]
Show-o2: Improved Native Unified Multimodal Models
Jinheng Xie, Zhenheng Yang, and Mike Zheng Shou. Show-o2: Improved native unified multimodal models, 2025. URLhttps://arxiv.org/abs/2506.15564
work page internal anchor Pith review arXiv 2025
-
[77]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review arXiv 2025
-
[78]
Inference with reference: Lossless acceleration of large language models
Nan Yang, Tao Ge, Liang Wang, Binxing Jiao, Daxin Jiang, Linjun Yang, Rangan Majumder, and Furu Wei. Inference with reference: Lossless acceleration of large language models.arXiv preprint arXiv:2304.04487, 2023
-
[79]
Gated Delta Networks: Improving Mamba2 with Delta Rule
Songlin Yang, Jan Kautz, and Ali Hatamizadeh. Gated delta networks: Improving mamba2 with delta rule. arXiv preprint arXiv:2412.06464, 2024
work page internal anchor Pith review arXiv 2024
-
[80]
Zihao Ye, Lequn Chen, Ruihang Lai, Wuwei Lin, Yineng Zhang, Stephanie Wang, Tianqi Chen, Baris Kasikci, Vinod Grover, Arvind Krishnamurthy, and Luis Ceze. Flashinfer: Efficient and customizable attention engine for llm inference serving, 2025. URLhttps://arxiv.org/abs/2501.01005
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.