pith. machine review for the scientific record. sign in

arxiv: 2605.15026 · v1 · submitted 2026-05-14 · 💻 cs.OS · cs.AI· cs.PF

Recognition: no theorem link

SemaTune: Semantic-Aware Online OS Tuning with Large Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-15 02:33 UTC · model grok-4.3

classification 💻 cs.OS cs.AIcs.PF
keywords OS tuninglanguage model guidanceonline parameter optimizationLinux sysctlhost-level metricsperformance tuningvalidation loopdual-loop control
0
0 comments X

The pith

SemaTune uses language models to reason over OS knob meanings and history, delivering 72.5 percent better stable performance than defaults across 13 workloads.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SemaTune as a host-side system that feeds knob schemas, current settings, recent action-response pairs, and prior runs into a language model so it can propose safe Linux parameter changes while services run. A fast loop applies low-latency updates and a slower loop revises the overall search strategy, with every proposal passing typed validation before it reaches kernel interfaces. This structure lets the controller use semantic understanding of controls and indirect signals instead of treating every knob as an opaque variable optimized only for a scalar reward. The result is measured improvement without direct application metrics and without the persistent degraded states that structure-blind methods often produce.

Core claim

SemaTune shows that bounded language-model guidance, combined with typed validation and dual-loop control, turns OS tuning into a semantically aware process that improves stable-phase performance by 72.5 percent over defaults and 153.3 percent over the strongest non-LLM baseline on 13 live workloads while tuning up to 41 parameters. The same controller still outperforms direct-application-objective baselines by 93.7 percentage points when restricted to host-level metrics alone and avoids the severe degraded regions reached by black-box exploration.

What carries the argument

Dual-loop controller that packs knob schemas, telemetry, configuration, history, and retrieved runs into compact context for an LLM, then validates every proposed change before kernel or sysctl application.

If this is right

  • Tuning decisions can now incorporate cross-knob policy structure and indirect performance signals instead of scalar rewards alone.
  • Host-level controllers become viable for services that do not expose application metrics.
  • Exploration can be constrained to prevent entry into degraded states that continue after the bad setting is removed.
  • Model cost stays low, around 20 cents for a 30-window session, while still outperforming structure-blind methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The validation layer could be extended to other system interfaces such as network or storage stacks where semantic constraints are similarly available.
  • History retrieval might allow the slower loop to detect workload phase changes and switch strategies without additional human input.
  • Combining the semantic proposals with lightweight local models could reduce latency further while preserving the safety guarantees of typed checks.

Load-bearing premise

The language model will generate changes that improve or at least maintain performance after typed validation, even when only host-level metrics are available.

What would settle it

A workload where SemaTune, after validation, enters a degraded performance region that persists longer or more severely than the strongest non-LLM baseline under identical host-metric inputs.

Figures

Figures reproduced from arXiv: 2605.15026 by Georgios Liargkovas, Hubertus Franke, Kostis Kaffes, Mihir Nitin Joshi.

Figure 1
Figure 1. Figure 1: Steady-state online tuning. A host-side tuner up￾dates OS knobs on a running host and uses observed signals to choose the next step. minperfpct > maxperfpct (minimum cpu frequency greater than maximum cpu frequency), while others are numerically valid but nonsensical for the target workload, such as com￾bining extreme busy polling, shallow idle states, and very long scheduler windows for a latency-sensitiv… view at source ↗
Figure 2
Figure 2. Figure 2: MLOS performance examples. Left: Wikipedia p99 under MLOS with App, IPC, and Cache Miss objectives. Right: TPC-C p99 under MLOS as the tuning surface grows from 1 to 32 parameters. the tail latency stable. By 20–25 seconds, it enters numer￾ically valid but nonsensical regions that combine extreme busy polling, shallow idle states, and scheduler timescales in the tens of milliseconds, driving p99 to 64–68 m… view at source ↗
Figure 3
Figure 3. Figure 3: System overview of SemaTune. Memory without training: Third, a pretrained LLM starts with useful priors, but without explicit memory it must rediscover the same workload-specific facts in every ses￾sion. Unless the system records prior actions, outcomes, and workload-specific regularities, the model must repeatedly re￾infer which counters predict, e.g., p99 for this service, which regions of the knob space… view at source ↗
Figure 5
Figure 5. Figure 5: Dual-loop control in SemaTune. context entry: a compact record of configuration, latest mea￾surement, action, and justification. Instant requests (I in Fig￾ure 5) contribute instant context entries( ), while Reasoning requests (R) contribute reasoning context entries ( ). per￾sist, whereas remain visible only until the next Reasoning result is committed, after which they are consumed and re￾moved from futu… view at source ↗
Figure 6
Figure 6. Figure 6: Aggregate improvement over Default Parameters for SemaTune, and baselines. final accepted action, whereas continuously searching base￾lines such as MLOS continue to tune in windows 31–50. SemaTune-Trim uses 10 of the 30 tuning windows for trim￾ming and the remaining 20 for MLOS search. We report relative improvement over Default Parameters as a percent￾age; aggregate improvement is the geometric mean acros… view at source ↗
Figure 7
Figure 7. Figure 7: Aggregate improvement over Default Parameters for direct and indirect optimization objectives. regions: the initial trimming phase hands Bayesian optimiza￾tion tighter live-derived boundaries, and the stable phase improves substantially as a result. Despite this, better ranges alone do not solve the online control problem. Analysis: Even without taking the catastrophic degrada￾tion that all baselines cause… view at source ↗
Figure 9
Figure 9. Figure 9: Aggregate P50 bad-window rate, P10 bad-window rate, and variability over tuning phase (excl. catastrophic). than the Default Parameters across reruns; the P10 rate is the 10th percentile of that same fraction. Variability measures trajectory volatility during tuning: for reruns 𝑟 ∈ {1, . . . , 𝑅}, we define it as 1 𝑅 Í𝑅 𝑟=1  𝜎𝑟 |𝜇fixed | · 100 , where 𝜎𝑟 is the standard deviation of the tuner’s metric ov… view at source ↗
Figure 11
Figure 11. Figure 11: Aggregate improvement over Default Parameters for TPC-C, Silo, and Sysbench OLTP-RW with and without memory with app metrics (left) and system metrics (right). 6.7 Warm-Starting with Cross-Run Memory We next ask whether cross-run memory improves tuning on unseen workloads. The live tuning loop stays fixed; only the injected prior changes. We compare No Memory, one cross￾memory prior (Top-1), and a synthes… view at source ↗
Figure 12
Figure 12. Figure 12: Model backend comparison on TPC-C, Silo, and Sysbench OLTP-RW. Left: aggregate improvement over De￾fault Parameters during tuning and stable phases. Right: stable-phase improvement vs. total session cost. gains come from, which workloads remain close to default, and where other tuners are occasionally stronger. Observations: The table shows that SemaTune’s gains are broad rather than concentrated in one o… view at source ↗
read the original abstract

Online OS tuning can improve long-running services, but existing controllers are poorly matched to live hosts. They treat scheduler, power, memory, and I/O controls as black-box variables and optimize a scalar reward. This view ignores cross-knob policy structure, breaks down when application metrics are unavailable, and can send a running service into degraded regions that persist after the bad setting is removed. We present SemaTune, a host-side framework for steady-state OS tuning with bounded language-model guidance. SemaTune turns knob schemas, telemetry, current configuration, recent action--response history, and retrieved prior runs into a compact decision context. A fast loop proposes low-latency updates, a slower loop periodically revises the search strategy, and every proposed change passes through typed validation before reaching kernel or sysctl interfaces. This lets the controller reason about OS-control meaning and indirect performance signals while keeping model cost, latency, and authority constrained. We evaluate SemaTune on 13 live workloads from five benchmark suites while tuning up to 41 Linux parameters. Across the suite, SemaTune improves stable-phase performance by 72.5\% over default settings and by 153.3\% relative to the strongest non-LLM baseline. A 30-window session costs about \$0.20 in model calls. With only host-level metrics, SemaTune still outperforms baselines given direct application objectives by 93.7 percentage points, while avoiding severe degraded regions reached by structure-blind exploration.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces SemaTune, a host-side framework for online OS tuning of Linux parameters that incorporates LLM guidance informed by knob schemas, telemetry, current configuration, action-response history, and retrieved prior runs. It uses a fast proposal loop and a slower strategy-revision loop, with all changes passing typed validation before application. Evaluated on 13 live workloads from five benchmark suites while tuning up to 41 parameters, SemaTune reports 72.5% stable-phase improvement over defaults and 153.3% over the strongest non-LLM baseline, at low model cost, while claiming to avoid persistent degraded regions even with only host-level metrics.

Significance. If the central claims hold under rigorous validation, the work would represent a meaningful advance in practical online systems tuning by demonstrating how constrained LLM reasoning over semantic and historical context can outperform black-box controllers, particularly in settings without direct application metrics. The bounded-cost design and explicit handling of cross-knob structure address known failure modes of prior methods.

major comments (2)
  1. [Evaluation] Evaluation section: the reported 72.5% and 153.3% stable-phase gains are presented without per-workload traces, post-tuning monitoring beyond the 30-window sessions, or statistical tests confirming absence of regression after the tuning window closes; this leaves the claim that semantic context reliably prevents persistent degradation unverified.
  2. [System Design and Evaluation] The typed-validation mechanism is described as checking schemas and interfaces, yet no analysis or experiments demonstrate that it catches emergent cross-knob interactions (e.g., scheduler-memory-I/O combinations producing sustained high latency); the abstract notes that structure-blind methods reach such regions, but the evaluation provides no concrete evidence that SemaTune avoids them.
minor comments (2)
  1. The abstract and evaluation could more explicitly list the 41 Linux parameters and the five benchmark suites to improve reproducibility.
  2. [System Design] Notation for the fast and slow loops is introduced without a compact diagram or pseudocode, making the control flow harder to follow on first reading.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive feedback. We address each major comment below with targeted revisions to strengthen the evaluation and clarify the design.

read point-by-point responses
  1. Referee: [Evaluation] Evaluation section: the reported 72.5% and 153.3% stable-phase gains are presented without per-workload traces, post-tuning monitoring beyond the 30-window sessions, or statistical tests confirming absence of regression after the tuning window closes; this leaves the claim that semantic context reliably prevents persistent degradation unverified.

    Authors: We agree that additional per-workload detail and statistical support would strengthen the claims. In the revised manuscript we will add an appendix with per-workload performance traces for all 13 workloads and include statistical tests (paired t-tests with p-values) on the stable-phase improvements. The 30-window sessions define the evaluation window, with stable phase measured in the final windows; we did not collect extended post-session monitoring data. The avoidance of persistent degradation is evidenced by the absence of the regressions observed in baselines during these sessions, but we will add an explicit limitations paragraph noting that longer-term post-tuning monitoring remains future work. revision: partial

  2. Referee: [System Design and Evaluation] The typed-validation mechanism is described as checking schemas and interfaces, yet no analysis or experiments demonstrate that it catches emergent cross-knob interactions (e.g., scheduler-memory-I/O combinations producing sustained high latency); the abstract notes that structure-blind methods reach such regions, but the evaluation provides no concrete evidence that SemaTune avoids them.

    Authors: Typed validation performs schema conformance and interface compatibility checks to reject syntactically invalid settings, but does not model or detect emergent cross-knob interactions at runtime. Avoidance of degraded regions is achieved by the LLM's semantic reasoning over knob schemas, telemetry, action history, and retrieved runs in both the fast proposal and strategy-revision loops. The evaluation shows SemaTune outperforming structure-blind baselines without entering severe degradation, yet we do not isolate a specific cross-knob failure case. In revision we will clarify this distinction in Section 3 and add a qualitative example illustrating how semantic context steers away from a known harmful scheduler-memory-I/O combination. revision: partial

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper describes an empirical system for LLM-guided OS tuning and reports concrete performance gains from live workload experiments against external baselines and defaults. No equations, fitted parameters, self-citations, or ansatzes are invoked as load-bearing steps in any derivation; the central claims rest on measured improvements (72.5% and 153.3%) rather than reductions to inputs by construction. The evaluation is self-contained against independent benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper's approach assumes LLM capabilities as a domain assumption rather than providing a derivation or independent evidence for the reasoning quality.

axioms (1)
  • domain assumption Large language models possess sufficient semantic understanding of OS controls and telemetry to propose effective tuning actions.
    This is central to the framework's ability to outperform structure-blind methods.

pith-pipeline@v0.9.0 · 5582 in / 1135 out tokens · 73998 ms · 2026-05-15T02:33:25.577520+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

93 extracted references · 93 canonical work pages · 4 internal anchors

  1. [1]

    PhD thesis, Inria Rennes-Bretagne Atlantique, 2019

    Mathieu Acher, Hugo Martin, Juliana Alves Pereira, Arnaud Blouin, Jean-Marc Jézéquel, Djamel Eddine Khelladi, Luc Lesoil, and Olivier Barais.Learning very large configuration spaces: What matters for linux kernel sizes. PhD thesis, Inria Rennes-Bretagne Atlantique, 2019

  2. [2]

    Improving storage systems using machine learning.ACM Transactions on Storage, 19(1):1– 30, 2023

    Ibrahim Umit Akgun, Ali Selman Aydin, Andrew Burford, Michael McNeill, Michael Arkhangelskiy, and Erez Zadok. Improving storage systems using machine learning.ACM Transactions on Storage, 19(1):1– 30, 2023

  3. [3]

    A machine learning framework to improve storage system performance

    Ibrahim Umit Akgun, Ali Selman Aydin, Aadil Shaikh, Lukas Velikov, and Erez Zadok. A machine learning framework to improve storage system performance. InProceedings of the 13th ACM Workshop on Hot Topics in Storage and File Systems, HotStorage ’21, page 94–102, New York, NY, USA, 2021. Association for Computing Machinery

  4. [4]

    Cose: Configuring serverless functions using statistical learning

    Nabeel Akhtar, Ali Raza, Vatche Ishakian, and Ibrahim Matta. Cose: Configuring serverless functions using statistical learning. InIEEE INFOCOM 2020 - IEEE Conference on Computer Communications, pages 129–138, 2020

  5. [5]

    {CherryPick}: Adap- tively unearthing the best cloud configurations for big data analytics

    Omid Alipourfard, Hongqiang Harry Liu, Jianshu Chen, Shivaram Venkataraman, Minlan Yu, and Ming Zhang. {CherryPick}: Adap- tively unearthing the best cloud configurations for big data analytics. In14th USENIX Symposium on Networked Systems Design and Imple- mentation (NSDI 17), pages 469–482, 2017

  6. [6]

    arXiv preprint arXiv:2510.14150 , year =

    Henrique Assumpção, Diego Ferreira, Leandro Campos, and Fabricio Murai. Codeevolve: An open source evolutionary coding agent for algorithm discovery and optimization.arXiv preprint arXiv:2510.14150, 2025

  7. [7]

    Workload analysis of a large-scale key-value store

    Berk Atikoglu, Yuehai Xu, Eitan Frachtenberg, Song Jiang, and Mike Paleczny. Workload analysis of a large-scale key-value store. In Proceedings of the 12th ACM SIGMETRICS/PERFORMANCE joint inter- national conference on Measurement and Modeling of Computer Systems, pages 53–64, 2012

  8. [8]

    {Config-Snob}: Tuning for the best configurations of networking protocol stack

    Manaf Bin-Yahya, Yifei Zhao, Hossein Shafieirad, Anthony Ho, Shijun Yin, Fanzhao Wang, and Geng Li. {Config-Snob}: Tuning for the best configurations of networking protocol stack. In2024 USENIX Annual Technical Conference (USENIX ATC 24), pages 749–765, 2024

  9. [9]

    Contention-aware scheduling on multicore systems.ACM Trans- actions on Computer Systems (TOCS), 28(4):1–45, 2010

    Sergey Blagodurov, Sergey Zhuravlev, and Alexandra Fedorova. Contention-aware scheduling on multicore systems.ACM Trans- actions on Computer Systems (TOCS), 28(4):1–45, 2010

  10. [10]

    Contention-aware scheduling on multicore systems.ACM Trans

    Sergey Blagodurov, Sergey Zhuravlev, and Alexandra Fedorova. Contention-aware scheduling on multicore systems.ACM Trans. Comput. Syst., 28(4), December 2010

  11. [11]

    Metastable failures in distributed systems

    Nathan Bronson, Abutalib Aghayev, Aleksey Charapko, and Timothy Zhu. Metastable failures in distributed systems. InProceedings of the Workshop on Hot Topics in Operating Systems, HotOS ’21, page 221–227, New York, NY, USA, 2021. Association for Computing Machinery

  12. [12]

    Carver: Finding important parameters for storage system tuning

    Zhen Cao, Geoff Kuenning, and Erez Zadok. Carver: Finding important parameters for storage system tuning. In18th USENIX Conference on File and Storage Technologies (FAST 20), pages 43–57, 2020

  13. [13]

    SmartChoices: Hybridizing Programming and Machine Learning

    Victor Carbune, Thierry Coppey, Alexander Daryin, Thomas Dese- laers, Nikhil Sarda, and Jay Yagnik. Smartchoices: hybridizing pro- gramming and machine learning.arXiv preprint arXiv:1810.00619, 2018

  14. [14]

    2602.20133 , archivePrefix =

    Mert Cemri, Shubham Agrawal, Akshat Gupta, Shu Liu, Audrey Cheng, Qiuyang Mang, Ashwin Naren, Lutfi Eren Erdogan, Koushik Sen, Matei Zaharia, et al. Adaevolve: Adaptive llm driven zeroth-order optimization.arXiv preprint arXiv:2602.20133, 2026

  15. [15]

    Autoos: make your os more powerful by exploiting large language models

    Huilai Chen, Yuanbo Wen, Limin Cheng, Shouxu Kuang, Yumeng Liu, Weijia Li, Ling Li, Rui Zhang, Xinkai Song, Wei Li, et al. Autoos: make your os more powerful by exploiting large language models. In Forty-first International Conference on Machine Learning, 2024

  16. [16]

    Banerjee, Zbigniew T

    Jingde Chen, Subho S. Banerjee, Zbigniew T. Kalbarczyk, and Rav- ishankar K. Iyer. Machine learning for load balancing in the linux kernel. InProceedings of the 11th ACM SIGOPS Asia-Pacific Workshop on Systems, pages 67–74, 2020

  17. [17]

    Principled performance tunability in operating system kernels.arXiv preprint arXiv:2512.12530, 2025

    Zhongjie Chen, Wentao Zhang, Yulong Tang, Ran Shu, Fengyuan Ren, Tianyin Xu, and Jing Liu. Principled performance tunability in operating system kernels.arXiv preprint arXiv:2512.12530, 2025

  18. [18]

    Bar- barians at the gate: How ai is upending systems research.arXiv 13 Georgios Liargkovas, Mihir Nitin Joshi, Hubertus Franke, and Kostis Kaffes preprint arXiv:2510.06189, 2025

    Audrey Cheng, Shu Liu, Melissa Pan, Zhifei Li, Bowen Wang, Alex Krentsel, Tian Xia, Mert Cemri, Jongseok Park, Shuo Yang, et al. Bar- barians at the gate: How ai is upending systems research.arXiv 13 Georgios Liargkovas, Mihir Nitin Joshi, Hubertus Franke, and Kostis Kaffes preprint arXiv:2510.06189, 2025

  19. [19]

    Chroma-Core.Chroma: The AI-native open-source embedding database,

  20. [20]

    Accessed: 2026-04-01

  21. [21]

    Correlating instrumentation data to system states: A building block for automated diagnosis and control

    Ira Cohen, Jeffrey S Chase, Moises Goldszmidt, Terence Kelly, and Julie Symons. Correlating instrumentation data to system states: A building block for automated diagnosis and control. InOSDI, volume 4, pages 16–16, 2004

  22. [22]

    Code execution through deception: Gemini ai cli hi- jack.https://tracebit.com/blog/code-exec-deception-gemini-ai-cli- hijack, July 2025

    Sam Cox. Code execution through deception: Gemini ai cli hi- jack.https://tracebit.com/blog/code-exec-deception-gemini-ai-cli- hijack, July 2025. Tracebit Research Blog. Accessed: 2026-03-19

  23. [23]

    Mlos: An infrastructure for automated software performance engineering

    Carlo Curino, Neha Godwal, Brian Kroth, Sergiy Kuryata, Greg Lapin- ski, Siqi Liu, Slava Oks, Olga Poppe, Adam Smiechowski, Ed Thayer, et al. Mlos: An infrastructure for automated software performance engineering. InProceedings of the Fourth International Workshop on Data Management for End-to-End Machine Learning, pages 1–5, 2020

  24. [24]

    Oltp-bench: An extensible testbed for benchmarking relational databases.PVLDB, 7(4):277–288, 2013

    Djellel Eddine Difallah, Andrew Pavlo, Carlo Curino, and Philippe Cudré-Mauroux. Oltp-bench: An extensible testbed for benchmarking relational databases.PVLDB, 7(4):277–288, 2013

  25. [25]

    Kleio: A hybrid memory page scheduler with machine intelligence

    Thaleia Dimitra Doudali, Sergey Blagodurov, Abhinav Vishnu, Sud- hanva Gurumurthi, and Ada Gavrilovska. Kleio: A hybrid memory page scheduler with machine intelligence. InProceedings of the 28th International symposium on high-performance parallel and distributed computing, pages 37–48, 2019

  26. [26]

    Machine learning augmented hybrid memory management

    Thaleia Dimitra Doudali and Ada Gavrilovska. Machine learning augmented hybrid memory management. InProceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing, HPDC ’21, page 253–254, New York, NY, USA, 2021. Asso- ciation for Computing Machinery

  27. [27]

    Tun- ing the frequency of periodic data movements over hybrid memory systems.arXiv preprint arXiv:2101.07200, 2021

    Thaleia Dimitra Doudali, Daniel Zahka, and Ada Gavrilovska. Tun- ing the frequency of periodic data movements over hybrid memory systems.arXiv preprint arXiv:2101.07200, 2021

  28. [28]

    Tuning database configuration parameters with ituned.Proc

    Songyun Duan, Vamsidhar Thummala, and Shivnath Babu. Tuning database configuration parameters with ituned.Proc. VLDB Endow., 2(1):1246–1257, August 2009

  29. [29]

    Sizeless: Predicting the optimal size of serverless functions

    Simon Eismann, Long Bui, Johannes Grohmann, Cristina Abad, Niko- las Herbst, and Samuel Kounev. Sizeless: Predicting the optimal size of serverless functions. InProceedings of the 22nd International Mid- dleware Conference, pages 248–259, 2021

  30. [30]

    Verify- ing learning-augmented systems

    Tomer Eliyahu, Yafim Kazak, Guy Katz, and Michael Schapira. Verify- ing learning-augmented systems. SIGCOMM ’21, page 305–318, New York, NY, USA, 2021. Association for Computing Machinery

  31. [31]

    Towards a machine learning-assisted kernel with lake

    Henrique Fingler, Isha Tarte, Hangchen Yu, Ariel Szekely, Bodun Hu, Aditya Akella, and Christopher J Rossbach. Towards a machine learning-assisted kernel with lake. InProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, pages 846–861, 2023

  32. [32]

    Tuna: Tuning unstable and noisy cloud applica- tions

    Johannes Freischuetz, Konstantinos Kanellis, Brian Kroth, and Shiv- aram Venkataraman. Tuna: Tuning unstable and noisy cloud applica- tions. InProceedings of the Twentieth European Conference on Computer Systems, pages 954–973, 2025

  33. [33]

    𝜆-tune: Harnessing large language models for automated database system tuning.Pro- ceedings of the ACM on Management of Data, 3(1):1–26, 2025

    Victor Giannakouris and Immanuel Trummer. 𝜆-tune: Harnessing large language models for automated database system tuning.Pro- ceedings of the ACM on Management of Data, 3(1):1–26, 2025

  34. [34]

    Google vizier: A service for black-box optimization

    Daniel Golovin, Benjamin Solnik, Subhodeep Moitra, Greg Kochanski, John Karro, and David Sculley. Google vizier: A service for black-box optimization. InProceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, pages 1487–1495, 2017

  35. [35]

    Using ebpf hooks to profile linux file system activity across benchmarking workloads

    Dhruv Goyal and Sebastian Angel. Using ebpf hooks to profile linux file system activity across benchmarking workloads. 2025

  36. [36]

    Glia: A Human-Inspired AI for Automated Systems Design and Optimization

    Pouya Hamadanian, Pantea Karimi, Arash Nasr-Esfahany, Kimia Noor- bakhsh, Joseph Chandler, Ali ParandehGheibi, Mohammad Alizadeh, and Hari Balakrishnan. Glia: A human-inspired ai for automated systems design and optimization.arXiv preprint arXiv:2510.27176, 2025

  37. [37]

    {LinnOS}: Predictability on unpredictable flash storage with a light neural network

    Mingzhe Hao, Levent Toksoz, Nanqinqin Li, Edward Edberg Halim, Henry Hoffmann, and Haryadi S Gunawi. {LinnOS}: Predictability on unpredictable flash storage with a light neural network. In14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), pages 173–190, 2020

  38. [38]

    Congestion control system optimization with large language models.arXiv preprint arXiv:2508.16074, 2025

    Zhiyuan He, Aashish Gottipati, Lili Qiu, Yuqing Yang, and Francis Y Yan. Congestion control system optimization with large language models.arXiv preprint arXiv:2508.16074, 2025

  39. [39]

    Deep q-learning from demonstrations

    Todd Hester, Matej Vecerik, Olivier Pietquin, Marc Lanctot, Tom Schaul, Bilal Piot, Dan Horgan, John Quan, Andrew Sendonaris, Ian Osband, et al. Deep q-learning from demonstrations. InProceedings of the AAAI conference on artificial intelligence, volume 32, 2018

  40. [40]

    Metastable failures in the wild

    Lexiang Huang, Matthew Magnusson, Abishek Bangalore Muralikr- ishna, Salman Estyak, Rebecca Isaacs, Abutalib Aghayev, Timothy Zhu, and Aleksey Charapko. Metastable failures in the wild. In16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 73–90, Carlsbad, CA, July 2022. USENIX Association

  41. [41]

    Le, and Tianyin Xu

    Jinghao Jia, Raj Sahu, Adam Oswald, Dan Williams, Michael V. Le, and Tianyin Xu. Kernel extension verification is untenable. InHotOS 2023: Proceedings of the 19th Workshop on Hot Topics in Operating Systems, pages 150–157, 2023

  42. [42]

    Gptuner: A manual-reading database tuning system via gpt-guided bayesian optimization.Proceedings of the VLDB Endowment, 17(8):1939–1952, 2024

    Lao Jiale, Wang Jianping, Chen Wanghu, Wang Yibo, Zhang Yunjia, Tang Mingjie, Li Yufei, Cheng Zhiyuan, and Wang Jianguo. Gptuner: A manual-reading database tuning system via gpt-guided bayesian optimization.Proceedings of the VLDB Endowment, 17(8):1939–1952, 2024

  43. [43]

    Yan, and Ryan Beckett

    Sai Krishna Reddy Kakarla, Francis Y. Yan, and Ryan Beckett. Diffy: Data-driven bug finding for configurations.Proceedings of the ACM on Programming Languages, 8(PLDI), 2024

  44. [44]

    Herding llamas: Using llms as an os module.arXiv preprint arXiv:2401.08908, 2024

    Aditya K Kamath and Sujay Yadalam. Herding llamas: Using llms as an os module.arXiv preprint arXiv:2401.08908, 2024

  45. [45]

    Melanie Kambadur, Tipp Moseley, Rick Hank, and Martha A. Kim. Measuring interference between live datacenter applications. InSC ’12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, pages 1–12, 2012

  46. [46]

    Too many knobs to tune? towards faster database tuning by pre-selecting important knobs

    Konstantinos Kanellis, Ramnatthan Alagappan, and Shivaram Venkataraman. Too many knobs to tune? towards faster database tuning by pre-selecting important knobs. In12th USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage 20), 2020

  47. [47]

    Llamatune: sample-efficient dbms configuration tuning.arXiv preprint arXiv:2203.05128, 2022

    Konstantinos Kanellis, Cong Ding, Brian Kroth, Andreas Müller, Carlo Curino, and Shivaram Venkataraman. Llamatune: sample-efficient dbms configuration tuning.arXiv preprint arXiv:2203.05128, 2022

  48. [48]

    Nautilus: A benchmarking platform for dbms knob tuning

    Konstantinos Kanellis, Johannes Freischuetz, and Shivaram Venkatara- man. Nautilus: A benchmarking platform for dbms knob tuning. In Proceedings of the Eighth Workshop on Data Management for End-to- End Machine Learning, pages 72–76, 2024

  49. [49]

    From good to great: Parameter tuning in memory tiering systems.IEEE Transactions on Computers, 75(4):1378–1390, 2026

    Konstantinos Kanellis, Sujay Yadalam, Hayden Coffey, Shivaram Venkataraman, and Michael Swift. From good to great: Parameter tuning in memory tiering systems.IEEE Transactions on Computers, 75(4):1378–1390, 2026

  50. [50]

    Striking the right chord: Parameter tuning in memory tiering systems

    Konstantinos Kanellis, Sujay Yadalam, Shivaram Venkataraman, and Michael Swift. Striking the right chord: Parameter tuning in memory tiering systems. InProceedings of the 3rd Workshop on Disruptive Memory Systems, DIMES ’25, page 1–9, New York, NY, USA, 2025. Association for Computing Machinery

  51. [51]

    Duel-evolve: Reward-free test-time scaling via llm self-preferences.arXiv preprint arXiv:2602.21585, 2026

    Sweta Karlekar, Carolina Zheng, Magnus Saebo, Nicolas Beltran-Velez, Shuyang Yu, John Bowlan, Michal Kucer, and David Blei. Duel-evolve: Reward-free test-time scaling via llm self-preferences.arXiv preprint arXiv:2602.21585, 2026

  52. [52]

    {SelfTune }: Tuning cluster managers

    Ajaykrishna Karthikeyan, Nagarajan Natarajan, Gagan Somashekar, Lei Zhao, Ranjita Bhagwan, Rodrigo Fonseca, Tatiana Racheva, and Yogesh Bansal. {SelfTune }: Tuning cluster managers. In20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 14 SemaTune : Semantic-Aware Online OS Tuning with Large Language Models 23), pages 1097–1114, 2023

  53. [53]

    Tailbench: a benchmark suite and evaluation methodology for latency-critical applications

    Harshad Kasture and Daniel Sanchez. Tailbench: a benchmark suite and evaluation methodology for latency-critical applications. In2016 IEEE International Symposium on Workload Characterization (IISWC), pages 1–10. IEEE, 2016

  54. [54]

    Exploring the design space of page management for Multi-Tiered memory systems

    Jonghyeon Kim, Wonkyo Choe, and Jeongseob Ahn. Exploring the design space of page management for Multi-Tiered memory systems. In2021 USENIX Annual Technical Conference (USENIX ATC 21), pages 715–728. USENIX Association, July 2021

  55. [55]

    sysbench: Scriptable database and system perfor- mance benchmark.https://github.com/akopytov/sysbench, 2024

    Alexey Kopytov. sysbench: Scriptable database and system perfor- mance benchmark.https://github.com/akopytov/sysbench, 2024. Ver- sion 1.0.20

  56. [56]

    Mlos in action: Bridging the gap between experimentation and auto-tuning in the cloud.Proceedings of the VLDB Endowment, 17(12):4269–4272, 2024

    Brian Kroth, Sergiy Matusevych, Rana Alotaibi, Yiwen Zhu, Anja Gruenheid, and Yuanyuan Tian. Mlos in action: Bridging the gap between experimentation and auto-tuning in the cloud.Proceedings of the VLDB Endowment, 17(12):4269–4272, 2024

  57. [57]

    Heimdall: Optimizing storage i/o admission with extensive machine learning pipeline

    Daniar H Kurniawan, Rani Ayu Putri, Peiran Qin, Kahfi S Zulkifli, Ray AO Sinurat, Janki Bhimani, Sandeep Madireddy, Achmad Imam Kistijantoro, and Haryadi S Gunawi. Heimdall: Optimizing storage i/o admission with extensive machine learning pipeline. InProceedings of the Twentieth European Conference on Computer Systems, pages 1109–1125, 2025

  58. [58]

    Gptuner: An llm-based database tuning system.ACM SIGMOD Record, 54(1):101– 110, 2025

    Jiale Lao, Yibo Wang, Yufei Li, Jianping Wang, Yunjia Zhang, Zhiyuan Cheng, Wanghu Chen, Mingjie Tang, and Jianguo Wang. Gptuner: An llm-based database tuning system.ACM SIGMOD Record, 54(1):101– 110, 2025

  59. [59]

    Gemini Embedding: Generalizable Embeddings from Gemini

    Jinhyuk Lee, Feiyang Chen, Sahil Dua, Daniel Cer, Madhuri Shanbhogue, Iftekhar Naim, Gustavo Hernández Ábrego, Zhe Li, Kaifeng Chen, Henrique Schechter Vera, et al. Gemini embedding: Gen- eralizable embeddings from gemini.arXiv preprint arXiv:2503.07891, 2025

  60. [60]

    An expert in residence: LLM agents for always-on operating system tuning

    Georgios Liargkovas, Vahab Jabrayilov, Hubertus Franke, and Kostis Kaffes. An expert in residence: LLM agents for always-on operating system tuning. InMachine Learning for Systems 2025, 2025

  61. [61]

    Bush, Prakash Ramanan, Rajesh Kumar, Thomas Chestna, Yajing Liu, YING LIU, Ye Zhao, Kathryn S

    Jianheng Ling, Pratik Worah, Yawen Wang, Yunchuan Kong, Chunlei Wang, Clifford Stein, Diwakar Gupta, Jason Behmer, Logan A. Bush, Prakash Ramanan, Rajesh Kumar, Thomas Chestna, Yajing Liu, YING LIU, Ye Zhao, Kathryn S. McKinley, Meeyoung Park, and Martin Maas. Lava: Lifetime-aware vm allocation with learned distributions and adaptation to mispredictions. ...

  62. [62]

    Tiered memory management beyond hotness

    Jinshu Liu, Hamid Hadian, Hanchen Xu, and Huaicheng Li. Tiered memory management beyond hotness. In19th USENIX Symposium on Operating Systems Design and Implementation (OSDI 25), pages 731–747, 2025

  63. [63]

    Dimakis, Matei Zaharia, and Ion Stoica

    Shu Liu, Mert Cemri, Shubham Agarwal, Alexander Krentsel, Ash- win Naren, Qiuyang Mang, Zhifei Li, Akshat Gupta, Monishwaran Maheswaran, Audrey Cheng, Melissa Pan, Ethan Boneh, Kannan Ram- chandran, Koushik Sen, Alexandros G. Dimakis, Matei Zaharia, and Ion Stoica. Skydiscover: A flexible framework for ai-driven scientific and algorithmic discovery, 2026

  64. [64]

    Combining ma- chine learning and lifetime-based resource management for memory allocation and beyond.Communications of the ACM, 67(4):87–96, 2024

    Martin Maas, David G Andersen, Michael Isard, Mohammad Mahdi Javanmard, Kathryn S McKinley, and Colin Raffel. Combining ma- chine learning and lifetime-based resource management for memory allocation and beyond.Communications of the ACM, 67(4):87–96, 2024

  65. [65]

    Aios: Llm agent operating system.arXiv preprint arXiv:2403.16971, 2024

    Kai Mei, Xi Zhu, Wujiang Xu, Wenyue Hua, Mingyu Jin, Zelong Li, Shuyuan Xu, Ruosong Ye, Yingqiang Ge, and Yongfeng Zhang. Aios: Llm agent operating system.arXiv preprint arXiv:2403.16971, 2024

  66. [66]

    Lee Chong Ming. Replit’s ceo apologizes after its ai agent wiped a company’s code base in a test run and lied about it.https://www.businessinsider.com/replit-ceo-apologizes-ai-coding- tool-delete-company-database-2025-7, July 2025. Business Insider, accessed 2026-03-19

  67. [67]

    AlphaEvolve: A coding agent for scientific and algorithmic discovery

    Alexander Novikov, Ngân V˜u, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Ko- zlovskii, Francisco JR Ruiz, Abbas Mehrabian, et al. Alphaevolve: A coding agent for scientific and algorithmic discovery.arXiv preprint arXiv:2506.13131, 2025

  68. [68]

    Learning soft- ware configuration spaces: A systematic literature review.Journal of Systems and Software, 182:111044, 2021

    Juliana Alves Pereira, Mathieu Acher, Hugo Martin, Jean-Marc Jézéquel, Goetz Botterweck, and Anthony Ventresque. Learning soft- ware configuration spaces: A systematic literature review.Journal of Systems and Software, 182:111044, 2021

  69. [69]

    Online capacity identification of multi- tier websites using hardware performance counters.IEEE Transactions on Parallel and Distributed Systems, 22(3):426–438, 2010

    Jia Rao and Cheng-Zhong Xu. Online capacity identification of multi- tier websites using hardware performance counters.IEEE Transactions on Parallel and Distributed Systems, 22(3):426–438, 2010

  70. [70]

    How i learned to stop worrying and love learned os policies

    Divyanshu Saxena, Jiayi Chen, Sujay Yadalam, Yeonju Ro, Rohit Dwivedula, Eric H Campbell, Aditya Akella, Christopher J Rossbach, and Michael Swift. How i learned to stop worrying and love learned os policies. InProceedings of the 2025 Workshop on Hot Topics in Operating Systems, pages 1–7, 2025

  71. [71]

    On a foundation model for operating systems

    Divyanshu Saxena, Nihal Sharma, Donghyun Kim, Rohit Dwivedula, Jiayi Chen, Chenxi Yang, Sriram Ravula, Zichao Hu, Aditya Akella, Sebastian Angel, et al. On a foundation model for operating systems. arXiv preprint arXiv:2312.07813, 2023

  72. [72]

    Hardware counter driven on-the-fly request signatures

    Kai Shen, Ming Zhong, Sandhya Dwarkadas, Chuanpeng Li, Christo- pher Stewart, and Xiao Zhang. Hardware counter driven on-the-fly request signatures. InProceedings of the 13th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS XIII, page 189–200, New York, NY, USA, 2008. Asso- ciation for Computing Machinery

  73. [73]

    Practical bayesian optimization of machine learning algorithms.Advances in neural information processing systems, 25, 2012

    Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Practical bayesian optimization of machine learning algorithms.Advances in neural information processing systems, 25, 2012

  74. [74]

    {OPPerTune}:{Post-Deployment} configuration tuning of services made easy

    Gagan Somashekar, Karan Tandon, Anush Kini, Chieh-Chun Chang, Petr Husak, Ranjita Bhagwan, Mayukh Das, Anshul Gandhi, and Na- garajan Natarajan. {OPPerTune}:{Post-Deployment} configuration tuning of services made easy. In21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24), pages 1101–1120, 2024

  75. [75]

    Dcperf: An open- source, battle-tested performance benchmark suite for datacenter workloads

    Wei Su, Abhishek Dhanotia, Carlos Torres, Jayneel Gandhi, Neha Gholkar, Shobhit Kanaujia, Maxim Naumov, Kalyan Subramanian, Valentin Andrei, Yifan Yuan, and Chunqiang Tang. Dcperf: An open- source, battle-tested performance benchmark suite for datacenter workloads. InProceedings of the 52nd Annual International Symposium on Computer Architecture, ISCA ’25...

  76. [76]

    Oqueue: Observable communication in learning directed operating systems

    Aditya Atul Tewari, Sujay Yadalam, Arthur Michener Peters, Saurabh Agarwal, Aditya Akella, Michael M Swift, and Christopher J Rossbach. Oqueue: Observable communication in learning directed operating systems. InProceedings of the 4th Workshop on Practical Adoption Challenges of ML for Systems, pages 31–36, 2025

  77. [77]

    reads the manual

    Immanuel Trummer. Db-bert: a database tuning tool that" reads the manual". InProceedings of the 2022 international conference on management of data, pages 190–203, 2022

  78. [78]

    Automatic database management system tuning through large-scale machine learning

    Dana Van Aken, Andrew Pavlo, Geoffrey J Gordon, and Bohan Zhang. Automatic database management system tuning through large-scale machine learning. InProceedings of the 2017 ACM international con- ference on management of data, pages 1009–1024, 2017

  79. [79]

    Tiered memory management: Access latency is the key! InProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles, SOSP ’24, page 79–94, New York, NY, USA, 2024

    Midhul Vuppalapati and Rachit Agarwal. Tiered memory management: Access latency is the key! InProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles, SOSP ’24, page 79–94, New York, NY, USA, 2024. Association for Computing Machinery

  80. [80]

    Understanding and auto-adjusting performance-sensitive configurations

    Shu Wang, Chi Li, Henry Hoffmann, Shan Lu, William Sentosa, and Achmad Imam Kistijantoro. Understanding and auto-adjusting performance-sensitive configurations. InProceedings of the Twenty- Third International Conference on Architectural Support for Program- ming Languages and Operating Systems, ASPLOS ’18, page 154–168, New York, NY, USA, 2018. Associati...

Showing first 80 references.