arxiv: 2605.15026 · v1 · submitted 2026-05-14 · 💻 cs.OS · cs.AI· cs.PF

Recognition: no theorem link

SemaTune: Semantic-Aware Online OS Tuning with Large Language Models

Georgios Liargkovas , Mihir Nitin Joshi , Hubertus Franke , Kostis Kaffes

Authors on Pith no claims yet

Pith reviewed 2026-05-15 02:33 UTC · model grok-4.3

classification 💻 cs.OS cs.AIcs.PF

keywords OS tuninglanguage model guidanceonline parameter optimizationLinux sysctlhost-level metricsperformance tuningvalidation loopdual-loop control

0 comments

The pith

SemaTune uses language models to reason over OS knob meanings and history, delivering 72.5 percent better stable performance than defaults across 13 workloads.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SemaTune as a host-side system that feeds knob schemas, current settings, recent action-response pairs, and prior runs into a language model so it can propose safe Linux parameter changes while services run. A fast loop applies low-latency updates and a slower loop revises the overall search strategy, with every proposal passing typed validation before it reaches kernel interfaces. This structure lets the controller use semantic understanding of controls and indirect signals instead of treating every knob as an opaque variable optimized only for a scalar reward. The result is measured improvement without direct application metrics and without the persistent degraded states that structure-blind methods often produce.

Core claim

SemaTune shows that bounded language-model guidance, combined with typed validation and dual-loop control, turns OS tuning into a semantically aware process that improves stable-phase performance by 72.5 percent over defaults and 153.3 percent over the strongest non-LLM baseline on 13 live workloads while tuning up to 41 parameters. The same controller still outperforms direct-application-objective baselines by 93.7 percentage points when restricted to host-level metrics alone and avoids the severe degraded regions reached by black-box exploration.

What carries the argument

Dual-loop controller that packs knob schemas, telemetry, configuration, history, and retrieved runs into compact context for an LLM, then validates every proposed change before kernel or sysctl application.

If this is right

Tuning decisions can now incorporate cross-knob policy structure and indirect performance signals instead of scalar rewards alone.
Host-level controllers become viable for services that do not expose application metrics.
Exploration can be constrained to prevent entry into degraded states that continue after the bad setting is removed.
Model cost stays low, around 20 cents for a 30-window session, while still outperforming structure-blind methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The validation layer could be extended to other system interfaces such as network or storage stacks where semantic constraints are similarly available.
History retrieval might allow the slower loop to detect workload phase changes and switch strategies without additional human input.
Combining the semantic proposals with lightweight local models could reduce latency further while preserving the safety guarantees of typed checks.

Load-bearing premise

The language model will generate changes that improve or at least maintain performance after typed validation, even when only host-level metrics are available.

What would settle it

A workload where SemaTune, after validation, enters a degraded performance region that persists longer or more severely than the strongest non-LLM baseline under identical host-metric inputs.

Figures

Figures reproduced from arXiv: 2605.15026 by Georgios Liargkovas, Hubertus Franke, Kostis Kaffes, Mihir Nitin Joshi.

**Figure 1.** Figure 1: Steady-state online tuning. A host-side tuner updates OS knobs on a running host and uses observed signals to choose the next step. minperfpct > maxperfpct (minimum cpu frequency greater than maximum cpu frequency), while others are numerically valid but nonsensical for the target workload, such as combining extreme busy polling, shallow idle states, and very long scheduler windows for a latency-sensitiv… view at source ↗

**Figure 2.** Figure 2: MLOS performance examples. Left: Wikipedia p99 under MLOS with App, IPC, and Cache Miss objectives. Right: TPC-C p99 under MLOS as the tuning surface grows from 1 to 32 parameters. the tail latency stable. By 20–25 seconds, it enters numerically valid but nonsensical regions that combine extreme busy polling, shallow idle states, and scheduler timescales in the tens of milliseconds, driving p99 to 64–68 m… view at source ↗

**Figure 3.** Figure 3: System overview of SemaTune. Memory without training: Third, a pretrained LLM starts with useful priors, but without explicit memory it must rediscover the same workload-specific facts in every session. Unless the system records prior actions, outcomes, and workload-specific regularities, the model must repeatedly reinfer which counters predict, e.g., p99 for this service, which regions of the knob space… view at source ↗

**Figure 5.** Figure 5: Dual-loop control in SemaTune. context entry: a compact record of configuration, latest measurement, action, and justification. Instant requests (I in Figure 5) contribute instant context entries( ), while Reasoning requests (R) contribute reasoning context entries ( ). persist, whereas remain visible only until the next Reasoning result is committed, after which they are consumed and removed from futu… view at source ↗

**Figure 6.** Figure 6: Aggregate improvement over Default Parameters for SemaTune, and baselines. final accepted action, whereas continuously searching baselines such as MLOS continue to tune in windows 31–50. SemaTune-Trim uses 10 of the 30 tuning windows for trimming and the remaining 20 for MLOS search. We report relative improvement over Default Parameters as a percentage; aggregate improvement is the geometric mean acros… view at source ↗

**Figure 7.** Figure 7: Aggregate improvement over Default Parameters for direct and indirect optimization objectives. regions: the initial trimming phase hands Bayesian optimization tighter live-derived boundaries, and the stable phase improves substantially as a result. Despite this, better ranges alone do not solve the online control problem. Analysis: Even without taking the catastrophic degradation that all baselines cause… view at source ↗

**Figure 9.** Figure 9: Aggregate P50 bad-window rate, P10 bad-window rate, and variability over tuning phase (excl. catastrophic). than the Default Parameters across reruns; the P10 rate is the 10th percentile of that same fraction. Variability measures trajectory volatility during tuning: for reruns 𝑟 ∈ {1, . . . , 𝑅}, we define it as 1 𝑅 Í𝑅 𝑟=1 𝜎𝑟 |𝜇fixed | · 100 , where 𝜎𝑟 is the standard deviation of the tuner’s metric ov… view at source ↗

**Figure 11.** Figure 11: Aggregate improvement over Default Parameters for TPC-C, Silo, and Sysbench OLTP-RW with and without memory with app metrics (left) and system metrics (right). 6.7 Warm-Starting with Cross-Run Memory We next ask whether cross-run memory improves tuning on unseen workloads. The live tuning loop stays fixed; only the injected prior changes. We compare No Memory, one crossmemory prior (Top-1), and a synthes… view at source ↗

**Figure 12.** Figure 12: Model backend comparison on TPC-C, Silo, and Sysbench OLTP-RW. Left: aggregate improvement over Default Parameters during tuning and stable phases. Right: stable-phase improvement vs. total session cost. gains come from, which workloads remain close to default, and where other tuners are occasionally stronger. Observations: The table shows that SemaTune’s gains are broad rather than concentrated in one o… view at source ↗

read the original abstract

Online OS tuning can improve long-running services, but existing controllers are poorly matched to live hosts. They treat scheduler, power, memory, and I/O controls as black-box variables and optimize a scalar reward. This view ignores cross-knob policy structure, breaks down when application metrics are unavailable, and can send a running service into degraded regions that persist after the bad setting is removed. We present SemaTune, a host-side framework for steady-state OS tuning with bounded language-model guidance. SemaTune turns knob schemas, telemetry, current configuration, recent action--response history, and retrieved prior runs into a compact decision context. A fast loop proposes low-latency updates, a slower loop periodically revises the search strategy, and every proposed change passes through typed validation before reaching kernel or sysctl interfaces. This lets the controller reason about OS-control meaning and indirect performance signals while keeping model cost, latency, and authority constrained. We evaluate SemaTune on 13 live workloads from five benchmark suites while tuning up to 41 Linux parameters. Across the suite, SemaTune improves stable-phase performance by 72.5\% over default settings and by 153.3\% relative to the strongest non-LLM baseline. A 30-window session costs about \$0.20 in model calls. With only host-level metrics, SemaTune still outperforms baselines given direct application objectives by 93.7 percentage points, while avoiding severe degraded regions reached by structure-blind exploration.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SemaTune gets real gains from feeding LLMs structured OS context instead of black-box search, but the stability claims rest on thin evidence about avoiding hidden degradation.

read the letter

SemaTune is worth reading because it shows a practical way to let an LLM reason about Linux knobs using schemas, telemetry, history, and prior runs, then apply changes through typed validation. The reported results are the main draw: 72.5% better stable-phase performance than defaults and 153.3% over the strongest non-LLM baseline across 13 workloads from five suites, all while tuning up to 41 parameters and spending only about 20 cents in model calls per session. It also beats baselines that receive direct application metrics when restricted to host-level signals alone. The dual-loop design (fast proposals plus slower strategy revision) plus the constraint that every action passes typed validation before reaching sysctl or kernel interfaces keeps the model from running unchecked, which is a sensible engineering choice for live hosts.

Referee Report

2 major / 2 minor

Summary. The paper introduces SemaTune, a host-side framework for online OS tuning of Linux parameters that incorporates LLM guidance informed by knob schemas, telemetry, current configuration, action-response history, and retrieved prior runs. It uses a fast proposal loop and a slower strategy-revision loop, with all changes passing typed validation before application. Evaluated on 13 live workloads from five benchmark suites while tuning up to 41 parameters, SemaTune reports 72.5% stable-phase improvement over defaults and 153.3% over the strongest non-LLM baseline, at low model cost, while claiming to avoid persistent degraded regions even with only host-level metrics.

Significance. If the central claims hold under rigorous validation, the work would represent a meaningful advance in practical online systems tuning by demonstrating how constrained LLM reasoning over semantic and historical context can outperform black-box controllers, particularly in settings without direct application metrics. The bounded-cost design and explicit handling of cross-knob structure address known failure modes of prior methods.

major comments (2)

[Evaluation] Evaluation section: the reported 72.5% and 153.3% stable-phase gains are presented without per-workload traces, post-tuning monitoring beyond the 30-window sessions, or statistical tests confirming absence of regression after the tuning window closes; this leaves the claim that semantic context reliably prevents persistent degradation unverified.
[System Design and Evaluation] The typed-validation mechanism is described as checking schemas and interfaces, yet no analysis or experiments demonstrate that it catches emergent cross-knob interactions (e.g., scheduler-memory-I/O combinations producing sustained high latency); the abstract notes that structure-blind methods reach such regions, but the evaluation provides no concrete evidence that SemaTune avoids them.

minor comments (2)

The abstract and evaluation could more explicitly list the 41 Linux parameters and the five benchmark suites to improve reproducibility.
[System Design] Notation for the fast and slow loops is introduced without a compact diagram or pseudocode, making the control flow harder to follow on first reading.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive feedback. We address each major comment below with targeted revisions to strengthen the evaluation and clarify the design.

read point-by-point responses

Referee: [Evaluation] Evaluation section: the reported 72.5% and 153.3% stable-phase gains are presented without per-workload traces, post-tuning monitoring beyond the 30-window sessions, or statistical tests confirming absence of regression after the tuning window closes; this leaves the claim that semantic context reliably prevents persistent degradation unverified.

Authors: We agree that additional per-workload detail and statistical support would strengthen the claims. In the revised manuscript we will add an appendix with per-workload performance traces for all 13 workloads and include statistical tests (paired t-tests with p-values) on the stable-phase improvements. The 30-window sessions define the evaluation window, with stable phase measured in the final windows; we did not collect extended post-session monitoring data. The avoidance of persistent degradation is evidenced by the absence of the regressions observed in baselines during these sessions, but we will add an explicit limitations paragraph noting that longer-term post-tuning monitoring remains future work. revision: partial
Referee: [System Design and Evaluation] The typed-validation mechanism is described as checking schemas and interfaces, yet no analysis or experiments demonstrate that it catches emergent cross-knob interactions (e.g., scheduler-memory-I/O combinations producing sustained high latency); the abstract notes that structure-blind methods reach such regions, but the evaluation provides no concrete evidence that SemaTune avoids them.

Authors: Typed validation performs schema conformance and interface compatibility checks to reject syntactically invalid settings, but does not model or detect emergent cross-knob interactions at runtime. Avoidance of degraded regions is achieved by the LLM's semantic reasoning over knob schemas, telemetry, action history, and retrieved runs in both the fast proposal and strategy-revision loops. The evaluation shows SemaTune outperforming structure-blind baselines without entering severe degradation, yet we do not isolate a specific cross-knob failure case. In revision we will clarify this distinction in Section 3 and add a qualitative example illustrating how semantic context steers away from a known harmful scheduler-memory-I/O combination. revision: partial

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper describes an empirical system for LLM-guided OS tuning and reports concrete performance gains from live workload experiments against external baselines and defaults. No equations, fitted parameters, self-citations, or ansatzes are invoked as load-bearing steps in any derivation; the central claims rest on measured improvements (72.5% and 153.3%) rather than reductions to inputs by construction. The evaluation is self-contained against independent benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper's approach assumes LLM capabilities as a domain assumption rather than providing a derivation or independent evidence for the reasoning quality.

axioms (1)

domain assumption Large language models possess sufficient semantic understanding of OS controls and telemetry to propose effective tuning actions.
This is central to the framework's ability to outperform structure-blind methods.

pith-pipeline@v0.9.0 · 5582 in / 1135 out tokens · 73998 ms · 2026-05-15T02:33:25.577520+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

93 extracted references · 93 canonical work pages · 4 internal anchors

[1]

PhD thesis, Inria Rennes-Bretagne Atlantique, 2019

Mathieu Acher, Hugo Martin, Juliana Alves Pereira, Arnaud Blouin, Jean-Marc Jézéquel, Djamel Eddine Khelladi, Luc Lesoil, and Olivier Barais.Learning very large configuration spaces: What matters for linux kernel sizes. PhD thesis, Inria Rennes-Bretagne Atlantique, 2019

work page 2019
[2]

Improving storage systems using machine learning.ACM Transactions on Storage, 19(1):1– 30, 2023

Ibrahim Umit Akgun, Ali Selman Aydin, Andrew Burford, Michael McNeill, Michael Arkhangelskiy, and Erez Zadok. Improving storage systems using machine learning.ACM Transactions on Storage, 19(1):1– 30, 2023

work page 2023
[3]

A machine learning framework to improve storage system performance

Ibrahim Umit Akgun, Ali Selman Aydin, Aadil Shaikh, Lukas Velikov, and Erez Zadok. A machine learning framework to improve storage system performance. InProceedings of the 13th ACM Workshop on Hot Topics in Storage and File Systems, HotStorage ’21, page 94–102, New York, NY, USA, 2021. Association for Computing Machinery

work page 2021
[4]

Cose: Configuring serverless functions using statistical learning

Nabeel Akhtar, Ali Raza, Vatche Ishakian, and Ibrahim Matta. Cose: Configuring serverless functions using statistical learning. InIEEE INFOCOM 2020 - IEEE Conference on Computer Communications, pages 129–138, 2020

work page 2020
[5]

{CherryPick}: Adap- tively unearthing the best cloud configurations for big data analytics

Omid Alipourfard, Hongqiang Harry Liu, Jianshu Chen, Shivaram Venkataraman, Minlan Yu, and Ming Zhang. {CherryPick}: Adap- tively unearthing the best cloud configurations for big data analytics. In14th USENIX Symposium on Networked Systems Design and Imple- mentation (NSDI 17), pages 469–482, 2017

work page 2017
[6]

arXiv preprint arXiv:2510.14150 , year =

Henrique Assumpção, Diego Ferreira, Leandro Campos, and Fabricio Murai. Codeevolve: An open source evolutionary coding agent for algorithm discovery and optimization.arXiv preprint arXiv:2510.14150, 2025

work page arXiv 2025
[7]

Workload analysis of a large-scale key-value store

Berk Atikoglu, Yuehai Xu, Eitan Frachtenberg, Song Jiang, and Mike Paleczny. Workload analysis of a large-scale key-value store. In Proceedings of the 12th ACM SIGMETRICS/PERFORMANCE joint inter- national conference on Measurement and Modeling of Computer Systems, pages 53–64, 2012

work page 2012
[8]

{Config-Snob}: Tuning for the best configurations of networking protocol stack

Manaf Bin-Yahya, Yifei Zhao, Hossein Shafieirad, Anthony Ho, Shijun Yin, Fanzhao Wang, and Geng Li. {Config-Snob}: Tuning for the best configurations of networking protocol stack. In2024 USENIX Annual Technical Conference (USENIX ATC 24), pages 749–765, 2024

work page 2024
[9]

Contention-aware scheduling on multicore systems.ACM Trans- actions on Computer Systems (TOCS), 28(4):1–45, 2010

Sergey Blagodurov, Sergey Zhuravlev, and Alexandra Fedorova. Contention-aware scheduling on multicore systems.ACM Trans- actions on Computer Systems (TOCS), 28(4):1–45, 2010

work page 2010
[10]

Contention-aware scheduling on multicore systems.ACM Trans

Sergey Blagodurov, Sergey Zhuravlev, and Alexandra Fedorova. Contention-aware scheduling on multicore systems.ACM Trans. Comput. Syst., 28(4), December 2010

work page 2010
[11]

Metastable failures in distributed systems

Nathan Bronson, Abutalib Aghayev, Aleksey Charapko, and Timothy Zhu. Metastable failures in distributed systems. InProceedings of the Workshop on Hot Topics in Operating Systems, HotOS ’21, page 221–227, New York, NY, USA, 2021. Association for Computing Machinery

work page 2021
[12]

Carver: Finding important parameters for storage system tuning

Zhen Cao, Geoff Kuenning, and Erez Zadok. Carver: Finding important parameters for storage system tuning. In18th USENIX Conference on File and Storage Technologies (FAST 20), pages 43–57, 2020

work page 2020
[13]

SmartChoices: Hybridizing Programming and Machine Learning

Victor Carbune, Thierry Coppey, Alexander Daryin, Thomas Dese- laers, Nikhil Sarda, and Jay Yagnik. Smartchoices: hybridizing pro- gramming and machine learning.arXiv preprint arXiv:1810.00619, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[14]

2602.20133 , archivePrefix =

Mert Cemri, Shubham Agrawal, Akshat Gupta, Shu Liu, Audrey Cheng, Qiuyang Mang, Ashwin Naren, Lutfi Eren Erdogan, Koushik Sen, Matei Zaharia, et al. Adaevolve: Adaptive llm driven zeroth-order optimization.arXiv preprint arXiv:2602.20133, 2026

work page arXiv 2026
[15]

Autoos: make your os more powerful by exploiting large language models

Huilai Chen, Yuanbo Wen, Limin Cheng, Shouxu Kuang, Yumeng Liu, Weijia Li, Ling Li, Rui Zhang, Xinkai Song, Wei Li, et al. Autoos: make your os more powerful by exploiting large language models. In Forty-first International Conference on Machine Learning, 2024

work page 2024
[16]

Banerjee, Zbigniew T

Jingde Chen, Subho S. Banerjee, Zbigniew T. Kalbarczyk, and Rav- ishankar K. Iyer. Machine learning for load balancing in the linux kernel. InProceedings of the 11th ACM SIGOPS Asia-Pacific Workshop on Systems, pages 67–74, 2020

work page 2020
[17]

Principled performance tunability in operating system kernels.arXiv preprint arXiv:2512.12530, 2025

Zhongjie Chen, Wentao Zhang, Yulong Tang, Ran Shu, Fengyuan Ren, Tianyin Xu, and Jing Liu. Principled performance tunability in operating system kernels.arXiv preprint arXiv:2512.12530, 2025

work page arXiv 2025
[18]

Bar- barians at the gate: How ai is upending systems research.arXiv 13 Georgios Liargkovas, Mihir Nitin Joshi, Hubertus Franke, and Kostis Kaffes preprint arXiv:2510.06189, 2025

Audrey Cheng, Shu Liu, Melissa Pan, Zhifei Li, Bowen Wang, Alex Krentsel, Tian Xia, Mert Cemri, Jongseok Park, Shuo Yang, et al. Bar- barians at the gate: How ai is upending systems research.arXiv 13 Georgios Liargkovas, Mihir Nitin Joshi, Hubertus Franke, and Kostis Kaffes preprint arXiv:2510.06189, 2025

work page arXiv 2025
[19]

Chroma-Core.Chroma: The AI-native open-source embedding database,

work page
[20]

Accessed: 2026-04-01

work page 2026
[21]

Correlating instrumentation data to system states: A building block for automated diagnosis and control

Ira Cohen, Jeffrey S Chase, Moises Goldszmidt, Terence Kelly, and Julie Symons. Correlating instrumentation data to system states: A building block for automated diagnosis and control. InOSDI, volume 4, pages 16–16, 2004

work page 2004
[22]

Code execution through deception: Gemini ai cli hi- jack.https://tracebit.com/blog/code-exec-deception-gemini-ai-cli- hijack, July 2025

Sam Cox. Code execution through deception: Gemini ai cli hi- jack.https://tracebit.com/blog/code-exec-deception-gemini-ai-cli- hijack, July 2025. Tracebit Research Blog. Accessed: 2026-03-19

work page 2025
[23]

Mlos: An infrastructure for automated software performance engineering

Carlo Curino, Neha Godwal, Brian Kroth, Sergiy Kuryata, Greg Lapin- ski, Siqi Liu, Slava Oks, Olga Poppe, Adam Smiechowski, Ed Thayer, et al. Mlos: An infrastructure for automated software performance engineering. InProceedings of the Fourth International Workshop on Data Management for End-to-End Machine Learning, pages 1–5, 2020

work page 2020
[24]

Oltp-bench: An extensible testbed for benchmarking relational databases.PVLDB, 7(4):277–288, 2013

Djellel Eddine Difallah, Andrew Pavlo, Carlo Curino, and Philippe Cudré-Mauroux. Oltp-bench: An extensible testbed for benchmarking relational databases.PVLDB, 7(4):277–288, 2013

work page 2013
[25]

Kleio: A hybrid memory page scheduler with machine intelligence

Thaleia Dimitra Doudali, Sergey Blagodurov, Abhinav Vishnu, Sud- hanva Gurumurthi, and Ada Gavrilovska. Kleio: A hybrid memory page scheduler with machine intelligence. InProceedings of the 28th International symposium on high-performance parallel and distributed computing, pages 37–48, 2019

work page 2019
[26]

Machine learning augmented hybrid memory management

Thaleia Dimitra Doudali and Ada Gavrilovska. Machine learning augmented hybrid memory management. InProceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing, HPDC ’21, page 253–254, New York, NY, USA, 2021. Asso- ciation for Computing Machinery

work page 2021
[27]

Tun- ing the frequency of periodic data movements over hybrid memory systems.arXiv preprint arXiv:2101.07200, 2021

Thaleia Dimitra Doudali, Daniel Zahka, and Ada Gavrilovska. Tun- ing the frequency of periodic data movements over hybrid memory systems.arXiv preprint arXiv:2101.07200, 2021

work page arXiv 2021
[28]

Tuning database configuration parameters with ituned.Proc

Songyun Duan, Vamsidhar Thummala, and Shivnath Babu. Tuning database configuration parameters with ituned.Proc. VLDB Endow., 2(1):1246–1257, August 2009

work page 2009
[29]

Sizeless: Predicting the optimal size of serverless functions

Simon Eismann, Long Bui, Johannes Grohmann, Cristina Abad, Niko- las Herbst, and Samuel Kounev. Sizeless: Predicting the optimal size of serverless functions. InProceedings of the 22nd International Mid- dleware Conference, pages 248–259, 2021

work page 2021
[30]

Verify- ing learning-augmented systems

Tomer Eliyahu, Yafim Kazak, Guy Katz, and Michael Schapira. Verify- ing learning-augmented systems. SIGCOMM ’21, page 305–318, New York, NY, USA, 2021. Association for Computing Machinery

work page 2021
[31]

Towards a machine learning-assisted kernel with lake

Henrique Fingler, Isha Tarte, Hangchen Yu, Ariel Szekely, Bodun Hu, Aditya Akella, and Christopher J Rossbach. Towards a machine learning-assisted kernel with lake. InProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, pages 846–861, 2023

work page 2023
[32]

Tuna: Tuning unstable and noisy cloud applica- tions

Johannes Freischuetz, Konstantinos Kanellis, Brian Kroth, and Shiv- aram Venkataraman. Tuna: Tuning unstable and noisy cloud applica- tions. InProceedings of the Twentieth European Conference on Computer Systems, pages 954–973, 2025

work page 2025
[33]

𝜆-tune: Harnessing large language models for automated database system tuning.Pro- ceedings of the ACM on Management of Data, 3(1):1–26, 2025

Victor Giannakouris and Immanuel Trummer. 𝜆-tune: Harnessing large language models for automated database system tuning.Pro- ceedings of the ACM on Management of Data, 3(1):1–26, 2025

work page 2025
[34]

Google vizier: A service for black-box optimization

Daniel Golovin, Benjamin Solnik, Subhodeep Moitra, Greg Kochanski, John Karro, and David Sculley. Google vizier: A service for black-box optimization. InProceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, pages 1487–1495, 2017

work page 2017
[35]

Using ebpf hooks to profile linux file system activity across benchmarking workloads

Dhruv Goyal and Sebastian Angel. Using ebpf hooks to profile linux file system activity across benchmarking workloads. 2025

work page 2025
[36]

Glia: A Human-Inspired AI for Automated Systems Design and Optimization

Pouya Hamadanian, Pantea Karimi, Arash Nasr-Esfahany, Kimia Noor- bakhsh, Joseph Chandler, Ali ParandehGheibi, Mohammad Alizadeh, and Hari Balakrishnan. Glia: A human-inspired ai for automated systems design and optimization.arXiv preprint arXiv:2510.27176, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

{LinnOS}: Predictability on unpredictable flash storage with a light neural network

Mingzhe Hao, Levent Toksoz, Nanqinqin Li, Edward Edberg Halim, Henry Hoffmann, and Haryadi S Gunawi. {LinnOS}: Predictability on unpredictable flash storage with a light neural network. In14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), pages 173–190, 2020

work page 2020
[38]

Congestion control system optimization with large language models.arXiv preprint arXiv:2508.16074, 2025

Zhiyuan He, Aashish Gottipati, Lili Qiu, Yuqing Yang, and Francis Y Yan. Congestion control system optimization with large language models.arXiv preprint arXiv:2508.16074, 2025

work page arXiv 2025
[39]

Deep q-learning from demonstrations

Todd Hester, Matej Vecerik, Olivier Pietquin, Marc Lanctot, Tom Schaul, Bilal Piot, Dan Horgan, John Quan, Andrew Sendonaris, Ian Osband, et al. Deep q-learning from demonstrations. InProceedings of the AAAI conference on artificial intelligence, volume 32, 2018

work page 2018
[40]

Metastable failures in the wild

Lexiang Huang, Matthew Magnusson, Abishek Bangalore Muralikr- ishna, Salman Estyak, Rebecca Isaacs, Abutalib Aghayev, Timothy Zhu, and Aleksey Charapko. Metastable failures in the wild. In16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 73–90, Carlsbad, CA, July 2022. USENIX Association

work page 2022
[41]

Le, and Tianyin Xu

Jinghao Jia, Raj Sahu, Adam Oswald, Dan Williams, Michael V. Le, and Tianyin Xu. Kernel extension verification is untenable. InHotOS 2023: Proceedings of the 19th Workshop on Hot Topics in Operating Systems, pages 150–157, 2023

work page 2023
[42]

Gptuner: A manual-reading database tuning system via gpt-guided bayesian optimization.Proceedings of the VLDB Endowment, 17(8):1939–1952, 2024

Lao Jiale, Wang Jianping, Chen Wanghu, Wang Yibo, Zhang Yunjia, Tang Mingjie, Li Yufei, Cheng Zhiyuan, and Wang Jianguo. Gptuner: A manual-reading database tuning system via gpt-guided bayesian optimization.Proceedings of the VLDB Endowment, 17(8):1939–1952, 2024

work page 1939
[43]

Yan, and Ryan Beckett

Sai Krishna Reddy Kakarla, Francis Y. Yan, and Ryan Beckett. Diffy: Data-driven bug finding for configurations.Proceedings of the ACM on Programming Languages, 8(PLDI), 2024

work page 2024
[44]

Herding llamas: Using llms as an os module.arXiv preprint arXiv:2401.08908, 2024

Aditya K Kamath and Sujay Yadalam. Herding llamas: Using llms as an os module.arXiv preprint arXiv:2401.08908, 2024

work page arXiv 2024
[45]

Melanie Kambadur, Tipp Moseley, Rick Hank, and Martha A. Kim. Measuring interference between live datacenter applications. InSC ’12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, pages 1–12, 2012

work page 2012
[46]

Too many knobs to tune? towards faster database tuning by pre-selecting important knobs

Konstantinos Kanellis, Ramnatthan Alagappan, and Shivaram Venkataraman. Too many knobs to tune? towards faster database tuning by pre-selecting important knobs. In12th USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage 20), 2020

work page 2020
[47]

Llamatune: sample-efficient dbms configuration tuning.arXiv preprint arXiv:2203.05128, 2022

Konstantinos Kanellis, Cong Ding, Brian Kroth, Andreas Müller, Carlo Curino, and Shivaram Venkataraman. Llamatune: sample-efficient dbms configuration tuning.arXiv preprint arXiv:2203.05128, 2022

work page arXiv 2022
[48]

Nautilus: A benchmarking platform for dbms knob tuning

Konstantinos Kanellis, Johannes Freischuetz, and Shivaram Venkatara- man. Nautilus: A benchmarking platform for dbms knob tuning. In Proceedings of the Eighth Workshop on Data Management for End-to- End Machine Learning, pages 72–76, 2024

work page 2024
[49]

From good to great: Parameter tuning in memory tiering systems.IEEE Transactions on Computers, 75(4):1378–1390, 2026

Konstantinos Kanellis, Sujay Yadalam, Hayden Coffey, Shivaram Venkataraman, and Michael Swift. From good to great: Parameter tuning in memory tiering systems.IEEE Transactions on Computers, 75(4):1378–1390, 2026

work page 2026
[50]

Striking the right chord: Parameter tuning in memory tiering systems

Konstantinos Kanellis, Sujay Yadalam, Shivaram Venkataraman, and Michael Swift. Striking the right chord: Parameter tuning in memory tiering systems. InProceedings of the 3rd Workshop on Disruptive Memory Systems, DIMES ’25, page 1–9, New York, NY, USA, 2025. Association for Computing Machinery

work page 2025
[51]

Duel-evolve: Reward-free test-time scaling via llm self-preferences.arXiv preprint arXiv:2602.21585, 2026

Sweta Karlekar, Carolina Zheng, Magnus Saebo, Nicolas Beltran-Velez, Shuyang Yu, John Bowlan, Michal Kucer, and David Blei. Duel-evolve: Reward-free test-time scaling via llm self-preferences.arXiv preprint arXiv:2602.21585, 2026

work page arXiv 2026
[52]

{SelfTune }: Tuning cluster managers

Ajaykrishna Karthikeyan, Nagarajan Natarajan, Gagan Somashekar, Lei Zhao, Ranjita Bhagwan, Rodrigo Fonseca, Tatiana Racheva, and Yogesh Bansal. {SelfTune }: Tuning cluster managers. In20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 14 SemaTune : Semantic-Aware Online OS Tuning with Large Language Models 23), pages 1097–1114, 2023

work page 2023
[53]

Tailbench: a benchmark suite and evaluation methodology for latency-critical applications

Harshad Kasture and Daniel Sanchez. Tailbench: a benchmark suite and evaluation methodology for latency-critical applications. In2016 IEEE International Symposium on Workload Characterization (IISWC), pages 1–10. IEEE, 2016

work page 2016
[54]

Exploring the design space of page management for Multi-Tiered memory systems

Jonghyeon Kim, Wonkyo Choe, and Jeongseob Ahn. Exploring the design space of page management for Multi-Tiered memory systems. In2021 USENIX Annual Technical Conference (USENIX ATC 21), pages 715–728. USENIX Association, July 2021

work page 2021
[55]

sysbench: Scriptable database and system perfor- mance benchmark.https://github.com/akopytov/sysbench, 2024

Alexey Kopytov. sysbench: Scriptable database and system perfor- mance benchmark.https://github.com/akopytov/sysbench, 2024. Ver- sion 1.0.20

work page 2024
[56]

Mlos in action: Bridging the gap between experimentation and auto-tuning in the cloud.Proceedings of the VLDB Endowment, 17(12):4269–4272, 2024

Brian Kroth, Sergiy Matusevych, Rana Alotaibi, Yiwen Zhu, Anja Gruenheid, and Yuanyuan Tian. Mlos in action: Bridging the gap between experimentation and auto-tuning in the cloud.Proceedings of the VLDB Endowment, 17(12):4269–4272, 2024

work page 2024
[57]

Heimdall: Optimizing storage i/o admission with extensive machine learning pipeline

Daniar H Kurniawan, Rani Ayu Putri, Peiran Qin, Kahfi S Zulkifli, Ray AO Sinurat, Janki Bhimani, Sandeep Madireddy, Achmad Imam Kistijantoro, and Haryadi S Gunawi. Heimdall: Optimizing storage i/o admission with extensive machine learning pipeline. InProceedings of the Twentieth European Conference on Computer Systems, pages 1109–1125, 2025

work page 2025
[58]

Gptuner: An llm-based database tuning system.ACM SIGMOD Record, 54(1):101– 110, 2025

Jiale Lao, Yibo Wang, Yufei Li, Jianping Wang, Yunjia Zhang, Zhiyuan Cheng, Wanghu Chen, Mingjie Tang, and Jianguo Wang. Gptuner: An llm-based database tuning system.ACM SIGMOD Record, 54(1):101– 110, 2025

work page 2025
[59]

Gemini Embedding: Generalizable Embeddings from Gemini

Jinhyuk Lee, Feiyang Chen, Sahil Dua, Daniel Cer, Madhuri Shanbhogue, Iftekhar Naim, Gustavo Hernández Ábrego, Zhe Li, Kaifeng Chen, Henrique Schechter Vera, et al. Gemini embedding: Gen- eralizable embeddings from gemini.arXiv preprint arXiv:2503.07891, 2025

work page internal anchor Pith review arXiv 2025
[60]

An expert in residence: LLM agents for always-on operating system tuning

Georgios Liargkovas, Vahab Jabrayilov, Hubertus Franke, and Kostis Kaffes. An expert in residence: LLM agents for always-on operating system tuning. InMachine Learning for Systems 2025, 2025

work page 2025
[61]

Bush, Prakash Ramanan, Rajesh Kumar, Thomas Chestna, Yajing Liu, YING LIU, Ye Zhao, Kathryn S

Jianheng Ling, Pratik Worah, Yawen Wang, Yunchuan Kong, Chunlei Wang, Clifford Stein, Diwakar Gupta, Jason Behmer, Logan A. Bush, Prakash Ramanan, Rajesh Kumar, Thomas Chestna, Yajing Liu, YING LIU, Ye Zhao, Kathryn S. McKinley, Meeyoung Park, and Martin Maas. Lava: Lifetime-aware vm allocation with learned distributions and adaptation to mispredictions. ...

work page 2025
[62]

Tiered memory management beyond hotness

Jinshu Liu, Hamid Hadian, Hanchen Xu, and Huaicheng Li. Tiered memory management beyond hotness. In19th USENIX Symposium on Operating Systems Design and Implementation (OSDI 25), pages 731–747, 2025

work page 2025
[63]

Dimakis, Matei Zaharia, and Ion Stoica

Shu Liu, Mert Cemri, Shubham Agarwal, Alexander Krentsel, Ash- win Naren, Qiuyang Mang, Zhifei Li, Akshat Gupta, Monishwaran Maheswaran, Audrey Cheng, Melissa Pan, Ethan Boneh, Kannan Ram- chandran, Koushik Sen, Alexandros G. Dimakis, Matei Zaharia, and Ion Stoica. Skydiscover: A flexible framework for ai-driven scientific and algorithmic discovery, 2026

work page 2026
[64]

Combining ma- chine learning and lifetime-based resource management for memory allocation and beyond.Communications of the ACM, 67(4):87–96, 2024

Martin Maas, David G Andersen, Michael Isard, Mohammad Mahdi Javanmard, Kathryn S McKinley, and Colin Raffel. Combining ma- chine learning and lifetime-based resource management for memory allocation and beyond.Communications of the ACM, 67(4):87–96, 2024

work page 2024
[65]

Aios: Llm agent operating system.arXiv preprint arXiv:2403.16971, 2024

Kai Mei, Xi Zhu, Wujiang Xu, Wenyue Hua, Mingyu Jin, Zelong Li, Shuyuan Xu, Ruosong Ye, Yingqiang Ge, and Yongfeng Zhang. Aios: Llm agent operating system.arXiv preprint arXiv:2403.16971, 2024

work page arXiv 2024
[66]

Lee Chong Ming. Replit’s ceo apologizes after its ai agent wiped a company’s code base in a test run and lied about it.https://www.businessinsider.com/replit-ceo-apologizes-ai-coding- tool-delete-company-database-2025-7, July 2025. Business Insider, accessed 2026-03-19

work page 2025
[67]

AlphaEvolve: A coding agent for scientific and algorithmic discovery

Alexander Novikov, Ngân V˜u, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Ko- zlovskii, Francisco JR Ruiz, Abbas Mehrabian, et al. Alphaevolve: A coding agent for scientific and algorithmic discovery.arXiv preprint arXiv:2506.13131, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[68]

Learning soft- ware configuration spaces: A systematic literature review.Journal of Systems and Software, 182:111044, 2021

Juliana Alves Pereira, Mathieu Acher, Hugo Martin, Jean-Marc Jézéquel, Goetz Botterweck, and Anthony Ventresque. Learning soft- ware configuration spaces: A systematic literature review.Journal of Systems and Software, 182:111044, 2021

work page 2021
[69]

Online capacity identification of multi- tier websites using hardware performance counters.IEEE Transactions on Parallel and Distributed Systems, 22(3):426–438, 2010

Jia Rao and Cheng-Zhong Xu. Online capacity identification of multi- tier websites using hardware performance counters.IEEE Transactions on Parallel and Distributed Systems, 22(3):426–438, 2010

work page 2010
[70]

How i learned to stop worrying and love learned os policies

Divyanshu Saxena, Jiayi Chen, Sujay Yadalam, Yeonju Ro, Rohit Dwivedula, Eric H Campbell, Aditya Akella, Christopher J Rossbach, and Michael Swift. How i learned to stop worrying and love learned os policies. InProceedings of the 2025 Workshop on Hot Topics in Operating Systems, pages 1–7, 2025

work page 2025
[71]

On a foundation model for operating systems

Divyanshu Saxena, Nihal Sharma, Donghyun Kim, Rohit Dwivedula, Jiayi Chen, Chenxi Yang, Sriram Ravula, Zichao Hu, Aditya Akella, Sebastian Angel, et al. On a foundation model for operating systems. arXiv preprint arXiv:2312.07813, 2023

work page arXiv 2023
[72]

Hardware counter driven on-the-fly request signatures

Kai Shen, Ming Zhong, Sandhya Dwarkadas, Chuanpeng Li, Christo- pher Stewart, and Xiao Zhang. Hardware counter driven on-the-fly request signatures. InProceedings of the 13th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS XIII, page 189–200, New York, NY, USA, 2008. Asso- ciation for Computing Machinery

work page 2008
[73]

Practical bayesian optimization of machine learning algorithms.Advances in neural information processing systems, 25, 2012

Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Practical bayesian optimization of machine learning algorithms.Advances in neural information processing systems, 25, 2012

work page 2012
[74]

{OPPerTune}:{Post-Deployment} configuration tuning of services made easy

Gagan Somashekar, Karan Tandon, Anush Kini, Chieh-Chun Chang, Petr Husak, Ranjita Bhagwan, Mayukh Das, Anshul Gandhi, and Na- garajan Natarajan. {OPPerTune}:{Post-Deployment} configuration tuning of services made easy. In21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24), pages 1101–1120, 2024

work page 2024
[75]

Dcperf: An open- source, battle-tested performance benchmark suite for datacenter workloads

Wei Su, Abhishek Dhanotia, Carlos Torres, Jayneel Gandhi, Neha Gholkar, Shobhit Kanaujia, Maxim Naumov, Kalyan Subramanian, Valentin Andrei, Yifan Yuan, and Chunqiang Tang. Dcperf: An open- source, battle-tested performance benchmark suite for datacenter workloads. InProceedings of the 52nd Annual International Symposium on Computer Architecture, ISCA ’25...

work page 2025
[76]

Oqueue: Observable communication in learning directed operating systems

Aditya Atul Tewari, Sujay Yadalam, Arthur Michener Peters, Saurabh Agarwal, Aditya Akella, Michael M Swift, and Christopher J Rossbach. Oqueue: Observable communication in learning directed operating systems. InProceedings of the 4th Workshop on Practical Adoption Challenges of ML for Systems, pages 31–36, 2025

work page 2025
[77]

reads the manual

Immanuel Trummer. Db-bert: a database tuning tool that" reads the manual". InProceedings of the 2022 international conference on management of data, pages 190–203, 2022

work page 2022
[78]

Automatic database management system tuning through large-scale machine learning

Dana Van Aken, Andrew Pavlo, Geoffrey J Gordon, and Bohan Zhang. Automatic database management system tuning through large-scale machine learning. InProceedings of the 2017 ACM international con- ference on management of data, pages 1009–1024, 2017

work page 2017
[79]

Tiered memory management: Access latency is the key! InProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles, SOSP ’24, page 79–94, New York, NY, USA, 2024

Midhul Vuppalapati and Rachit Agarwal. Tiered memory management: Access latency is the key! InProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles, SOSP ’24, page 79–94, New York, NY, USA, 2024. Association for Computing Machinery

work page 2024
[80]

Understanding and auto-adjusting performance-sensitive configurations

Shu Wang, Chi Li, Henry Hoffmann, Shan Lu, William Sentosa, and Achmad Imam Kistijantoro. Understanding and auto-adjusting performance-sensitive configurations. InProceedings of the Twenty- Third International Conference on Architectural Support for Program- ming Languages and Operating Systems, ASPLOS ’18, page 154–168, New York, NY, USA, 2018. Associati...

work page 2018

Showing first 80 references.