arxiv: 2605.04373 · v1 · submitted 2026-05-06 · 💻 cs.NI · cs.AI· cs.SY· eess.SY

Recognition: unknown

Worst-Case Discovery and Runtime Protection for RL-Based Network Controllers

Hongyu H\`e , Minhao Jin , Maria Apostolaki

Authors on Pith no claims yet

Pith reviewed 2026-05-08 17:27 UTC · model grok-4.3

classification 💻 cs.NI cs.AIcs.SYeess.SY

keywords Reinforcement learningNetwork controllersWorst-case discoveryRuntime protectionBilevel optimizationRegret maximizationCongestion controlAdaptive bitrate streaming

0 comments

The pith

ReGuard discovers network conditions where RL controllers perform 43-64% worse than achievable and protects them at runtime with lightweight rules.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Reinforcement learning controllers for networking tasks such as congestion control and adaptive bitrate streaming achieve strong average performance but can degrade sharply under specific conditions where better results remain possible. ReGuard formulates discovery of these conditions as a bilevel regret-maximization problem that produces a certified lower bound on the performance gap. The resulting trajectories are examined as counterfactuals and turned into simple logic rules that intervene only in risky states during operation. On three controllers the method locates gaps 57 percent to six times larger than prior approaches and reduces them by 79-85 percent while leaving normal behavior unchanged, with the protection carrying over to additional network conditions.

Core claim

ReGuard discovers worst-case scenarios for a given RL controller by solving a bilevel regret-maximization problem, which yields a certified lower bound on the worst-case performance gap. The discovered trajectories are analyzed as counterfactuals and compiled into lightweight logic rules that intervene only when a risky state is detected, leaving the controller's behavior unchanged otherwise. Across Pensieve, Sage, and Park, ReGuard finds scenarios in which performance is 43-64 percent worse than achievable, locates gaps 57 percent to 6 times larger than the strongest baselines, and shrinks those gaps by 79-85 percent via the rule-based protection while preserving nominal performance; the保护n

What carries the argument

Bilevel regret-maximization procedure that produces certified performance-gap bounds, followed by counterfactual analysis to extract lightweight logic rules for selective runtime intervention.

If this is right

RL controllers can be deployed with quantified worst-case guarantees and selective runtime fixes without retraining.
Performance gaps can be certified with lower bounds rather than estimated by enumeration or formal verification.
Lightweight rule-based interventions preserve average-case behavior while closing most of the discovered gap.
Protection derived from a limited set of scenarios extends to a wider range of network conditions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same bilevel-plus-rule approach could be tested on RL controllers outside networking, such as in resource allocation or scheduling systems.
Extracting human-readable rules from counterfactual trajectories offers an interpretable way to audit black-box sequential controllers.
Hybrid RL-plus-rule systems may become a standard pattern for safety-critical control loops where full retraining is expensive.

Load-bearing premise

The bilevel regret-maximization procedure finds scenarios that are representative of true worst-case conditions and the extracted logic rules generalize to unseen network conditions without introducing new failure modes.

What would settle it

Running the protected and unprotected controllers on a broad set of real or simulated network traces and checking whether the observed worst-case degradation reaches or exceeds the 43-64 percent gaps reported by ReGuard or whether the rules cause performance regressions in any undiscovered conditions.

Figures

Figures reproduced from arXiv: 2605.04373 by Hongyu H\`e, Maria Apostolaki, Minhao Jin.

**Figure 1.** Figure 1: (a) shows a concrete scenario where Sage [ view at source ↗

**Figure 2.** Figure 2: REGUARD discovers scenarios that maximize the targeted RL controller’s regret by directly interacting with the network environment in which the controller was trained and by using a portfolio of heuristics to approximate the scenario-specific optimum. It then analyzes the resulting counterfactuals to extract recurring risky patterns and corrective directions. Finally, REGUARD compiles those counterfactuals… view at source ↗

**Figure 3.** Figure 3: The key design point of Sage’s instantiation is end-to view at source ↗

**Figure 4.** Figure 4: REGUARD finds the most challenging scenarios in all three use cases. The gap is consistently larger than the strongest baseline and far above the Normal regime, showing that REGUARD exposes large avoidable underperformance rather than marginally harder tests. Normal ReGuard Random Gilad et al. IndagoGenet Nerwork Scenario No protection ReGuard Finetune (individual) Finetune (all) Genet curriculum Protectio… view at source ↗

**Figure 5.** Figure 5: REGUARD provides the strongest overall protection when derived from REGUARD scenarios, while preserving nominal performance. The results of REGUARD shown here come from its very first iteration, and later iterations provide even more protection (see view at source ↗

**Figure 6.** Figure 6: Harder counterfactual sources yield stronger protection from view at source ↗

**Figure 7.** Figure 7: REGUARD remains well within the online decision budget in all three systems across all refinement iterations. Later iterations strengthen protection, but they do not introduce systematic latency growth. 0 2 4 Performance Gap (log) ReGuard (iter. 0) ReGuard (iter. 1) ReGuard (iter. 2) ReGuard (iter. 3) Normal (a) Pensieve. 0 25 50 Performance Gap [%] ReGuard (iter. 0) ReGuard (iter. 1) ReGuard (iter. 2) ReG… view at source ↗

**Figure 8.** Figure 8: A few search-and-protect iterations are enough to drive the discovered gap close to the Normal regime. Most of the view at source ↗

**Figure 9.** Figure 9: Small-job-heavy scenarios expose Park’s bias toward view at source ↗

**Figure 10.** Figure 10: Network scenarios found by REGUARD turn Park’s mild preference under nominal conditions into a near-collapse onto slow Server 1. LCT still routes most jobs to the fast tier, which shows that the issue is Park’s dispatch policy rather than an inherently bad workload. 7 Analyzing Revealed Performance Failures The challenging scenarios discovered by REGUARD do not create arbitrary worst cases. In Park, they … view at source ↗

**Figure 11.** Figure 11: Sparse-bandwidth scenarios expose a systematic view at source ↗

**Figure 12.** Figure 12: Sharp bandwidth drops expose Sage’s slow recovery. view at source ↗

read the original abstract

RL-based controllers achieve strong average-case performance in networking tasks such as congestion control and adaptive bitrate streaming. Yet their performance can degrade severely under network conditions where strong performance is still achievable. Identifying such conditions and quantifying the resulting performance gap is intractable by enumeration, while the sequential and closed-loop nature of RL controllers makes formal verification methods impractical. We present ReGuard, a framework that discovers worst-case scenarios for a given RL controller and protects it against them at inference time without retraining. Discovery is formulated as a bilevel regret-maximization problem, which yields a certified lower bound on the worst-case performance gap. The discovered trajectories are then analyzed as counterfactuals and compiled into lightweight logic rules that intervene only when a risky state is detected, leaving the controller's behavior unchanged otherwise. We evaluate ReGuard across three RL-based network controllers: Pensieve, Sage, and Park. ReGuard discovers scenarios in which the controller's performance is 43$-$64% worse than what is achievable. ReGuard not only discovers gaps 57% to 6$\times$ larger than those found by the strongest baselines but also shrinks them by 79$-$85% via lightweight rule-based protection while preserving nominal performance. ReGuard's protection extends beyond the scenarios it discovers, improving performance across a wider range of network conditions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ReGuard uses bilevel regret maximization to surface larger performance gaps than baselines in three RL network controllers and then extracts lightweight rules that close most of those gaps at runtime without retraining.

read the letter

ReGuard's core move is to cast worst-case discovery as a bilevel problem where the outer loop searches network parameters to maximize regret relative to achievable performance, then converts the resulting bad trajectories into simple logic rules that intervene only on detected risky states. That combination is not a standard extension of prior adversarial or verification work in this area. The evaluation on Pensieve, Sage, and Park shows the method finds gaps 43-64% below achievable performance, 57% to 6x larger than the strongest baselines, and reduces those gaps by 79-85% while leaving nominal behavior unchanged. The claim that the rules also help on conditions outside the discovered set is a practical plus. The paper is upfront that the bilevel procedure only certifies a lower bound, which matches the non-convex inner RL loop and avoids overclaiming global optimality. Still, the lack of reported details on optimization convergence, how the counterfactual-to-rule step is validated for side effects, and statistical significance of the gains leaves the strength of the 79-85% figure hard to judge from the abstract alone. If the full paper includes ablations on rule generality and more on the search procedure, that would address the main open question. This work is aimed at researchers building or hardening RL controllers for congestion control, streaming, and similar closed-loop networking tasks. A reader already working on robustness of learned policies would find the pipeline and concrete numbers useful to try or extend. It deserves a serious referee because the approach is grounded in a real deployment obstacle and the experiments demonstrate clear gains over baselines, even if reviewers will likely ask for tighter analysis of the bilevel guarantees and rule coverage.

Referee Report

3 major / 2 minor

Summary. The paper introduces ReGuard, a framework for RL-based network controllers (Pensieve, Sage, Park) that formulates worst-case scenario discovery as a bilevel regret-maximization problem yielding a certified lower bound on performance gaps. Discovered trajectories are analyzed as counterfactuals and compiled into lightweight logic rules that intervene at runtime only on risky states. Evaluation reports discovery of scenarios where controllers perform 43-64% worse than achievable, with gaps 57% to 6x larger than baselines, and 79-85% gap shrinkage via protection that preserves nominal performance and generalizes beyond the discovered scenarios.

Significance. If the central claims hold, the work is significant for networking and RL applications because it offers a practical method to identify and mitigate severe performance degradations in closed-loop controllers where enumeration is intractable and formal verification is impractical. The bilevel formulation for certified lower bounds and the extraction of lightweight, non-intrusive rules are strengths that could influence runtime protection techniques. Concrete evaluations across three controllers and the generalization result add value if substantiated.

major comments (3)

[Section 3 (bilevel formulation)] Bilevel regret-maximization formulation: the procedure yields only a certified lower bound on the gap; given the non-convex inner-loop RL policy and outer search over network parameters, the discovered trajectories are local optima at best. This makes it unclear whether the reported 43-64% gaps are representative of true worst-case conditions or merely a non-representative subset, which directly affects the subsequent counterfactual analysis and the validity of the 79-85% shrinkage claims.
[Section 5 (evaluation)] Evaluation section: the reported performance gaps (43-64% worse than achievable, 57% to 6x larger than baselines) and shrinkage figures lack details on how achievable performance is computed, optimization convergence criteria, and statistical significance testing. Without these, it is difficult to assess whether the quantitative improvements are robust or sensitive to the specific bilevel solver and network parameter ranges used.
[Section 4 (counterfactual analysis and rule extraction)] Rule extraction and generalization: the claim that protection extends beyond discovered scenarios and improves performance across a wider range of conditions is load-bearing for the practical contribution. The counterfactual analysis may be tuned to the discovered (potentially local) trajectories, leaving open whether the extracted logic rules introduce new failure modes or fail to cover other worst-case regimes.

minor comments (2)

[Abstract] Abstract: the phrase 'three named controllers' could be expanded to explicitly list Pensieve, Sage, and Park along with their primary tasks (congestion control, adaptive bitrate streaming) for immediate clarity.
[Section 3] Notation: ensure consistent use of symbols for regret, performance gap, and rule predicates across the bilevel problem statement and the rule compilation description.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major comment below with clarifications and indicate planned revisions where appropriate to strengthen the manuscript.

read point-by-point responses

Referee: [Section 3 (bilevel formulation)] Bilevel regret-maximization formulation: the procedure yields only a certified lower bound on the gap; given the non-convex inner-loop RL policy and outer search over network parameters, the discovered trajectories are local optima at best. This makes it unclear whether the reported 43-64% gaps are representative of true worst-case conditions or merely a non-representative subset, which directly affects the subsequent counterfactual analysis and the validity of the 79-85% shrinkage claims.

Authors: We appreciate this observation. The bilevel formulation is explicitly presented as yielding a certified lower bound on the worst-case gap rather than a claim of global optimality, which aligns with the intractability of exhaustive search noted in the introduction. The reported 43-64% figures represent the gaps discovered and certified by the procedure; even as local optima, they remain valid lower bounds and are shown to exceed those found by baselines. We will revise Section 3 to more explicitly discuss the local nature of the solutions, reiterate the lower-bound interpretation, and clarify that the subsequent protection claims are based on mitigating the identified (certified) gaps rather than assuming global worst cases. revision: partial
Referee: [Section 5 (evaluation)] Evaluation section: the reported performance gaps (43-64% worse than achievable, 57% to 6x larger than baselines) and shrinkage figures lack details on how achievable performance is computed, optimization convergence criteria, and statistical significance testing. Without these, it is difficult to assess whether the quantitative improvements are robust or sensitive to the specific bilevel solver and network parameter ranges used.

Authors: We agree that these details are necessary for full reproducibility and assessment. Achievable performance is determined via comparison to an oracle or optimal offline controller for each task (e.g., known throughput-delay trade-offs in adaptive bitrate and congestion control). We will expand Section 5 to specify the bilevel solver convergence criteria (regret stabilization within a fixed epsilon over iterations), the network parameter ranges explored, and statistical significance (e.g., mean and standard deviation over 10 random seeds with t-test p-values). These additions will be included in the revised version. revision: yes
Referee: [Section 4 (counterfactual analysis and rule extraction)] Rule extraction and generalization: the claim that protection extends beyond discovered scenarios and improves performance across a wider range of conditions is load-bearing for the practical contribution. The counterfactual analysis may be tuned to the discovered (potentially local) trajectories, leaving open whether the extracted logic rules introduce new failure modes or fail to cover other worst-case regimes.

Authors: The rules are derived from state features in the counterfactual trajectories and are intentionally conservative, triggering only on detected risky states while leaving the RL policy unchanged otherwise. Our evaluation already tests generalization on held-out network conditions beyond the discovery set and reports no nominal-performance degradation. To address potential uncovered regimes or new failure modes, we will add further experiments applying the rules to additional unseen traces and report any observed side effects or coverage gaps. This will be incorporated as an expanded subsection in Section 4. revision: partial

Circularity Check

0 steps flagged

No circularity: bilevel formulation and empirical gaps are independent of final reported metrics

full rationale

The paper formulates discovery as a bilevel regret-maximization problem that explicitly produces a certified lower bound on the performance gap. Reported numbers (43-64% worse, 57% to 6x larger gaps, 79-85% shrinkage) are obtained by running the procedure, extracting rules, and measuring outcomes on the resulting trajectories versus baselines and nominal performance. No equation or claim equates the final gaps or protection gains to quantities defined by the same fitted parameters or by self-referential construction. The derivation chain remains self-contained against external benchmarks and does not reduce to renaming or self-citation load-bearing.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities are described.

pith-pipeline@v0.9.0 · 5547 in / 1093 out tokens · 36824 ms · 2026-05-08T17:27:27.974517+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

46 extracted references · 12 canonical work pages · 1 internal anchor

[1]

https://www.fcc

Measuring broadband america. https://www.fcc. gov/general/measuring-broadband-america
[2]

https://skulddata.cs.umass.edu/ traces/mmsys/2013/pathbandwidth/

Dataset: Hsdpa-bandwidth logs for mobile http stream- ing scenarios. https://skulddata.cs.umass.edu/ traces/mmsys/2013/pathbandwidth/

2013
[3]

Classic meets modern: A pragmatic learning-based congestion control for the internet

Soheil Abbasloo, Chen-Yu Yen, and H Jonathan Chao. Classic meets modern: A pragmatic learning-based congestion control for the internet. InProceedings of the Annual conference of the ACM Special Interest Group on Data Communication on the applications, technologies, architectures, and protocols for computer communication, pages 632–647, 2020

2020
[4]

Yan, and Ravi Netravali

Neil Agarwal, Rui Pan, Francis Y . Yan, and Ravi Netravali. Mowgli: Passively learned rate control for real-time video. In22nd USENIX Symposium on Networked Systems Design and Implementation (NSDI 25), pages 579–594, 2025. URLhttps://www.usenix. org/conference/nsdi25/presentation/agarwal

2025
[5]

Solving Rubik's Cube with a Robot Hand

Ilge Akkaya, Marcin Andrychowicz, Maciek Chociej, Mateusz Litwin, Bob McGrew, Arthur Petron, Alex Paino, Matthias Plappert, Glenn Powell, Raphael Ribas, et al. Solving rubik’s cube with a robot hand.arXiv preprint arXiv:1910.07113, 2019

work page internal anchor Pith review arXiv 1910
[6]

Yan, Raghunadha Reddy Pocha, Vineesh V

Ryan Beckett, Francis Y . Yan, Raghunadha Reddy Pocha, Vineesh V . Raj, Ayyub Shaik, and Siva Kesava Reddy Kakarla. Concord: Learning network configuration contracts. InProceedings of the 21st European Con- ference on Computer Systems (EuroSys ’26), page 18, Edinburgh, Scotland, UK, April 2026. ACM. ISBN 979- 8-4007-2212-7/26/04. doi: 10.1145/3767295.3769338

work page doi:10.1145/3767295.3769338 2026
[7]

Cafa: Cost-aware, feasible attacks with database constraints against neural tabular classifiers

Matan Ben-Tov, Daniel Deutch, Nave Frost, and Mah- mood Sharif. Cafa: Cost-aware, feasible attacks with database constraints against neural tabular classifiers. In 2024 IEEE Symposium on Security and Privacy (SP), pages 1345–1364. IEEE, 2024

2024
[8]

Testing of deep rein- forcement learning agents with surrogate models.ACM Transactions on Software Engineering and Methodology, 33(3):73:1–73:33, 2024

Matteo Biagiola and Paolo Tonella. Testing of deep rein- forcement learning agents with surrogate models.ACM Transactions on Software Engineering and Methodology, 33(3):73:1–73:33, 2024. doi: 10.1145/3631970

work page doi:10.1145/3631970 2024
[9]

Cambridge university press, 2004

Stephen Boyd and Lieven Vandenberghe.Convex optimization. Cambridge university press, 2004

2004
[10]

Bbr: Congestion-based congestion control.Communications of the ACM, 60(2):58–66, 2017

Neal Cardwell, Yuchung Cheng, C Stephen Gunn, Soheil Hassas Yeganeh, and Van Jacobson. Bbr: Congestion-based congestion control.Communications of the ACM, 60(2):58–66, 2017

2017
[11]

PCC vivace: Online-learning congestion control

Mo Dong, Tong Meng, Doron Zarchy, Engin Arslan, Yossi Gilad, Brighten Godfrey, and Michael Schapira. PCC vivace: Online-learning congestion control. In15th USENIX Symposium on Networked Systems Design and Implementation (NSDI 18), pages 343–356, 2018

2018
[12]

Verifying learning-augmented systems

Tomer Eliyahu, Yafim Kazak, Guy Katz, and Michael Schapira. Verifying learning-augmented systems. InProceedings of the 2021 ACM SIGCOMM 2021 Conference, pages 305–318, 2021

2021
[13]

Jay, Michael Shnaiderman, Brighten Godfrey, and Michael Schapira

Tomer Gilad, Nathan H. Jay, Michael Shnaiderman, Brighten Godfrey, and Michael Schapira. Robustifying network protocols with adversarial examples. In Proceedings of the 18th ACM Workshop on Hot Topics in Networks, pages 85–92. ACM. ISBN 978-1-4503- 7020-2. doi: 10.1145/3365609.3365862. URL https: //dl.acm.org/doi/10.1145/3365609.3365862

work page doi:10.1145/3365609.3365862
[14]

Convex programming in hilbert space.Bulletin of the American Mathematical Society, 70(5):709–710, 1964

Allen A Goldstein. Convex programming in hilbert space.Bulletin of the American Mathematical Society, 70(5):709–710, 1964

1964
[15]

Automated curriculum learning for neural networks

Alex Graves, Marc G Bellemare, Jacob Menick, Remi Munos, and Koray Kavukcuoglu. Automated curriculum learning for neural networks. Ininternational conference on machine learning, pages 1311–1320. Pmlr, 2017

2017
[16]

Just-in-time logic enforcement: A new paradigm of combining statistical and symbolic reasoning for network management

Hongyu Hè and Maria Apostolaki. Just-in-time logic enforcement: A new paradigm of combining statistical and symbolic reasoning for network management. In Proceedings of the 24th ACM Workshop on Hot Topics in Networks (HotNets 25), pages 184–192, 2025

2025
[17]

Making Logic a First-Class Citizen in Network Data Generation with ML

Hongyu Hè, Minhao Jin, and Maria Apostolaki. Making Logic a First-Class Citizen in Network Data Generation with ML. In23rd USENIX Symposium on Networked Systems Design and Implementation (NSDI 26), 2026

2026
[18]

Jacobs, Roman Beltiukov, Walter Willinger, Ronaldo A

Arthur S. Jacobs, Roman Beltiukov, Walter Willinger, Ronaldo A. Ferreira, Arpit Gupta, and Lisandro Z. Granville. AI/ML for Network Security: The Emperor has no Clothes. InProceedings of the 2022 ACM SIGSAC Conference on Computer and Communica- tions Security (CCS ’22), Los Angeles, CA, USA,

2022
[19]

doi: 10.1145/3548606.3560609

ACM. doi: 10.1145/3548606.3560609. URL https://trusteeml.github.io/

work page doi:10.1145/3548606.3560609
[20]

A deep reinforcement learning perspective on internet congestion control

Nathan Jay, Noga Rotman, Brighten Godfrey, Michael Schapira, and Aviv Tamar. A deep reinforcement learning perspective on internet congestion control. In International Conference on Machine Learning (ICML), pages 3050–3059. PMLR, 2019

2019
[21]

Robustifying ml- powered network classifiers with pants

Minhao Jin and Maria Apostolaki. Robustifying ml- powered network classifiers with pants. In34th USENIX Security Symposium (USENIX Security 25), 2025. 13

2025
[22]

Fine-tuning can distort pretrained features and underperform out-of-distribution

Ananya Kumar, Aditi Raghunathan, Robbie Matthew Jones, Tengyu Ma, and Percy Liang. Fine-tuning can distort pretrained features and underperform out-of-distribution. InInternational Conference on Learning Representations, 2022. URL https: //openreview.net/forum?id=UYneFzXSJWh

2022
[23]

Stabilizing off-policy q-learning via bootstrapping error reduction.Advances in neural information processing systems, 32, 2019

Aviral Kumar, Justin Fu, Matthew Soh, George Tucker, and Sergey Levine. Stabilizing off-policy q-learning via bootstrapping error reduction.Advances in neural information processing systems, 32, 2019

2019
[24]

Constrained min- imization methods.USSR Computational Mathematics and Mathematical Physics, 6(5):1–50, 1966

Evgenii S Levitin and Boris T Polyak. Constrained min- imization methods.USSR Computational Mathematics and Mathematical Physics, 6(5):1–50, 1966

1966
[25]

Optimistic critic reconstruction and constrained fine-tuning for general offline-to-online rl

Qin-Wen Luo, Ming-Kun Xie, Ye-Wen Wang, and Sheng-Jun Huang. Optimistic critic reconstruction and constrained fine-tuning for general offline-to-online rl. InAdvances in Neural Information Processing Systems, volume 37, 2024. doi: 10.52202/079017-3435

work page doi:10.52202/079017-3435 2024
[26]

Towards deep learning models resistant to adversarial attacks

Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. In International Conference on Learning Representations,
[27]

URLhttps://openreview.net
[28]

Neural adaptive video streaming with pensieve

Hongzi Mao, Ravi Netravali, and Mohammad Al- izadeh. Neural adaptive video streaming with pensieve. InProceedings of the Conference of the ACM Special Interest Group on Data Communication, pages 197–210. ACM. ISBN 978-1-4503-4653-
[29]

URL https: //dl.acm.org/doi/10.1145/3098822.3098843

doi: 10.1145/3098822.3098843. URL https: //dl.acm.org/doi/10.1145/3098822.3098843

work page doi:10.1145/3098822.3098843
[30]

Park: An open platform for learning-augmented computer systems.Advances in Neural Information Processing Systems, 32, 2019

Hongzi Mao, Parimarjan Negi, Akshay Narayan, Hanrui Wang, Jiacheng Yang, Haonan Wang, Ryan Marcus, Mehrdad Khani Shirkoohi, Songtao He, Vikram Nathan, et al. Park: An open platform for learning-augmented computer systems.Advances in Neural Information Processing Systems, 32, 2019

2019
[31]

PhD thesis, University of Rochester, 1995

Andrew Kachites McCallum.Reinforcement learning with hidden states. PhD thesis, University of Rochester, 1995

1995
[32]

Interpreting deep learning-based networking systems

Zili Meng, Minhu Wang, Jiasong Bai, Mingwei Xu, Hongzi Mao, and Hongxin Hu. Interpreting deep learning-based networking systems. InProceedings of the Annual conference of the ACM Special Interest Group on Data Communication on the applications, technologies, architectures, and protocols for computer communication, SIGCOMM ’20, pages 154–171. As- sociation...

work page doi:10.1145/3387514.3405859
[33]

Human-level control through deep rein- forcement learning.Nature, 518(7540):529–533, 2015

V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep rein- forcement learning.Nature, 518(7540):529–533, 2015

2015
[34]

Finding adversarial inputs for heuristics using multi-level optimization

Pooria Namyar, Behnaz Arzani, Ryan Beckett, Santiago Segarra, Himanshu Raj, Umesh Krishnaswamy, Ramesh Govindan, and Srikanth Kandula. Finding adversarial inputs for heuristics using multi-level optimization. In 21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24), pages 927–949, 2024

2024
[35]

Mahimahi: accurate {Record-and-Replay} for {HTTP}

Ravi Netravali, Anirudh Sivaraman, Somak Das, Ameesh Goyal, Keith Winstein, James Mickens, and Hari Bal- akrishnan. Mahimahi: accurate {Record-and-Replay} for {HTTP}. In2015 USENIX Annual Technical Conference (USENIX ATC 15), pages 417–429, 2015

2015
[36]

Mutant: Learning congestion control from existing protocols via online reinforcement learning

Lorenzo Pappone, Alessio Sacco, and Flavio Esposito. Mutant: Learning congestion control from existing protocols via online reinforcement learning. In22nd USENIX Symposium on Networked Systems Design and Implementation (NSDI 25), pages 1507–1522, 2025

2025
[37]

Intriguing properties of adversarial ml attacks in the problem space

Fabio Pierazzi, Feargus Pendlebury, Jacopo Cortellazzi, and Lorenzo Cavallaro. Intriguing properties of adversarial ml attacks in the problem space. In2020 IEEE Symposium on Security and Privacy (SP), pages 1332–1349, 2020. doi: 10.1109/SP40000.2020.00073

work page doi:10.1109/sp40000.2020.00073 2020
[38]

MIT press, 2018

Richard S Sutton and Andrew G Barto.Reinforcement learning: An introduction. MIT press, 2018

2018
[39]

Overcoming catastrophic forgetting during domain adaptation of neural machine translation

Brian Thompson, Jeremy Gwinnup, Huda Khayrallah, Kevin Duh, and Philipp Koehn. Overcoming catastrophic forgetting during domain adaptation of neural machine translation. In Jill Burstein, Christy Doran, and Thamar Solorio, editors,Proceedings of the 2019 Conference of the North American Chapter of the Association for Com- putational Linguistics: Human Lan...

work page doi:10.18653/v1/n19-1209 2019
[40]

Robustness may be at odds with accuracy

Dimitris Tsipras, Shibani Santurkar, Logan Engstrom, Alexander Turner, and Aleksander Madry. Robustness may be at odds with accuracy. InInternational Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=SyxAb30cY7

2019
[41]

Genet: Automatic curriculum generation for learn- ing adaptation in networking

Zhengxu Xia, Yajie Zhou, Francis Y Yan, and Junchen Jiang. Genet: Automatic curriculum generation for learn- ing adaptation in networking. InProceedings of the ACM SIGCOMM 2022 Conference, pages 397–413, 2022. 14

2022
[42]

Learning in situ: a randomized experi- ment in video streaming

Francis Y Yan, Hudson Ayers, Chenzhi Zhu, Sadjad Fouladi, James Hong, Keyi Zhang, Philip Levis, and Keith Winstein. Learning in situ: a randomized experi- ment in video streaming. In17th USENIX Symposium on Networked Systems Design and Implementation (NSDI 20), pages 495–511, 2020

2020
[43]

Jonathan Chao

Chen-Yu Yen, Soheil Abbasloo, and H. Jonathan Chao. Computers can learn from the heuristic designs and master internet congestion control. InProceedings of the ACM SIGCOMM 2023 Conference, ACM SIGCOMM ’23, page 255–274, New York, NY , USA,

2023
[44]

ISBN 9798400702365

Association for Computing Machinery. ISBN 9798400702365. doi: 10.1145/3603269.3604838. URL https://doi.org/10.1145/3603269.3604838

work page doi:10.1145/3603269.3604838
[45]

A control-theoretic approach for dynamic adaptive video streaming over HTTP

Xiaoqi Yin, Abhishek Jindal, Vyas Sekar, and Bruno Sinopoli. A control-theoretic approach for dynamic adaptive video streaming over HTTP. InProceedings of the 2015 ACM Conference on Special Interest Group on Data Communication, pages 325–338. ACM. ISBN 978-1-4503-3542-

2015
[46]

back off,

doi: 10.1145/2785956.2787486. URL https: //dl.acm.org/doi/10.1145/2785956.2787486. A Proofs for Regret Guarantees This appendix gives the detailed assumptions, intermediate lemmas, and full proofs for the regret guarantees in Section 4.2. Our goal is to make explicit what is guaranteed by the formulation, what approximation errors enter the final bound, a...

work page doi:10.1145/2785956.2787486