Recognition: unknown
Worst-Case Discovery and Runtime Protection for RL-Based Network Controllers
Pith reviewed 2026-05-08 17:27 UTC · model grok-4.3
The pith
ReGuard discovers network conditions where RL controllers perform 43-64% worse than achievable and protects them at runtime with lightweight rules.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ReGuard discovers worst-case scenarios for a given RL controller by solving a bilevel regret-maximization problem, which yields a certified lower bound on the worst-case performance gap. The discovered trajectories are analyzed as counterfactuals and compiled into lightweight logic rules that intervene only when a risky state is detected, leaving the controller's behavior unchanged otherwise. Across Pensieve, Sage, and Park, ReGuard finds scenarios in which performance is 43-64 percent worse than achievable, locates gaps 57 percent to 6 times larger than the strongest baselines, and shrinks those gaps by 79-85 percent via the rule-based protection while preserving nominal performance; the保护n
What carries the argument
Bilevel regret-maximization procedure that produces certified performance-gap bounds, followed by counterfactual analysis to extract lightweight logic rules for selective runtime intervention.
If this is right
- RL controllers can be deployed with quantified worst-case guarantees and selective runtime fixes without retraining.
- Performance gaps can be certified with lower bounds rather than estimated by enumeration or formal verification.
- Lightweight rule-based interventions preserve average-case behavior while closing most of the discovered gap.
- Protection derived from a limited set of scenarios extends to a wider range of network conditions.
Where Pith is reading between the lines
- The same bilevel-plus-rule approach could be tested on RL controllers outside networking, such as in resource allocation or scheduling systems.
- Extracting human-readable rules from counterfactual trajectories offers an interpretable way to audit black-box sequential controllers.
- Hybrid RL-plus-rule systems may become a standard pattern for safety-critical control loops where full retraining is expensive.
Load-bearing premise
The bilevel regret-maximization procedure finds scenarios that are representative of true worst-case conditions and the extracted logic rules generalize to unseen network conditions without introducing new failure modes.
What would settle it
Running the protected and unprotected controllers on a broad set of real or simulated network traces and checking whether the observed worst-case degradation reaches or exceeds the 43-64 percent gaps reported by ReGuard or whether the rules cause performance regressions in any undiscovered conditions.
Figures
read the original abstract
RL-based controllers achieve strong average-case performance in networking tasks such as congestion control and adaptive bitrate streaming. Yet their performance can degrade severely under network conditions where strong performance is still achievable. Identifying such conditions and quantifying the resulting performance gap is intractable by enumeration, while the sequential and closed-loop nature of RL controllers makes formal verification methods impractical. We present ReGuard, a framework that discovers worst-case scenarios for a given RL controller and protects it against them at inference time without retraining. Discovery is formulated as a bilevel regret-maximization problem, which yields a certified lower bound on the worst-case performance gap. The discovered trajectories are then analyzed as counterfactuals and compiled into lightweight logic rules that intervene only when a risky state is detected, leaving the controller's behavior unchanged otherwise. We evaluate ReGuard across three RL-based network controllers: Pensieve, Sage, and Park. ReGuard discovers scenarios in which the controller's performance is 43$-$64% worse than what is achievable. ReGuard not only discovers gaps 57% to 6$\times$ larger than those found by the strongest baselines but also shrinks them by 79$-$85% via lightweight rule-based protection while preserving nominal performance. ReGuard's protection extends beyond the scenarios it discovers, improving performance across a wider range of network conditions.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces ReGuard, a framework for RL-based network controllers (Pensieve, Sage, Park) that formulates worst-case scenario discovery as a bilevel regret-maximization problem yielding a certified lower bound on performance gaps. Discovered trajectories are analyzed as counterfactuals and compiled into lightweight logic rules that intervene at runtime only on risky states. Evaluation reports discovery of scenarios where controllers perform 43-64% worse than achievable, with gaps 57% to 6x larger than baselines, and 79-85% gap shrinkage via protection that preserves nominal performance and generalizes beyond the discovered scenarios.
Significance. If the central claims hold, the work is significant for networking and RL applications because it offers a practical method to identify and mitigate severe performance degradations in closed-loop controllers where enumeration is intractable and formal verification is impractical. The bilevel formulation for certified lower bounds and the extraction of lightweight, non-intrusive rules are strengths that could influence runtime protection techniques. Concrete evaluations across three controllers and the generalization result add value if substantiated.
major comments (3)
- [Section 3 (bilevel formulation)] Bilevel regret-maximization formulation: the procedure yields only a certified lower bound on the gap; given the non-convex inner-loop RL policy and outer search over network parameters, the discovered trajectories are local optima at best. This makes it unclear whether the reported 43-64% gaps are representative of true worst-case conditions or merely a non-representative subset, which directly affects the subsequent counterfactual analysis and the validity of the 79-85% shrinkage claims.
- [Section 5 (evaluation)] Evaluation section: the reported performance gaps (43-64% worse than achievable, 57% to 6x larger than baselines) and shrinkage figures lack details on how achievable performance is computed, optimization convergence criteria, and statistical significance testing. Without these, it is difficult to assess whether the quantitative improvements are robust or sensitive to the specific bilevel solver and network parameter ranges used.
- [Section 4 (counterfactual analysis and rule extraction)] Rule extraction and generalization: the claim that protection extends beyond discovered scenarios and improves performance across a wider range of conditions is load-bearing for the practical contribution. The counterfactual analysis may be tuned to the discovered (potentially local) trajectories, leaving open whether the extracted logic rules introduce new failure modes or fail to cover other worst-case regimes.
minor comments (2)
- [Abstract] Abstract: the phrase 'three named controllers' could be expanded to explicitly list Pensieve, Sage, and Park along with their primary tasks (congestion control, adaptive bitrate streaming) for immediate clarity.
- [Section 3] Notation: ensure consistent use of symbols for regret, performance gap, and rule predicates across the bilevel problem statement and the rule compilation description.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments. We address each major comment below with clarifications and indicate planned revisions where appropriate to strengthen the manuscript.
read point-by-point responses
-
Referee: [Section 3 (bilevel formulation)] Bilevel regret-maximization formulation: the procedure yields only a certified lower bound on the gap; given the non-convex inner-loop RL policy and outer search over network parameters, the discovered trajectories are local optima at best. This makes it unclear whether the reported 43-64% gaps are representative of true worst-case conditions or merely a non-representative subset, which directly affects the subsequent counterfactual analysis and the validity of the 79-85% shrinkage claims.
Authors: We appreciate this observation. The bilevel formulation is explicitly presented as yielding a certified lower bound on the worst-case gap rather than a claim of global optimality, which aligns with the intractability of exhaustive search noted in the introduction. The reported 43-64% figures represent the gaps discovered and certified by the procedure; even as local optima, they remain valid lower bounds and are shown to exceed those found by baselines. We will revise Section 3 to more explicitly discuss the local nature of the solutions, reiterate the lower-bound interpretation, and clarify that the subsequent protection claims are based on mitigating the identified (certified) gaps rather than assuming global worst cases. revision: partial
-
Referee: [Section 5 (evaluation)] Evaluation section: the reported performance gaps (43-64% worse than achievable, 57% to 6x larger than baselines) and shrinkage figures lack details on how achievable performance is computed, optimization convergence criteria, and statistical significance testing. Without these, it is difficult to assess whether the quantitative improvements are robust or sensitive to the specific bilevel solver and network parameter ranges used.
Authors: We agree that these details are necessary for full reproducibility and assessment. Achievable performance is determined via comparison to an oracle or optimal offline controller for each task (e.g., known throughput-delay trade-offs in adaptive bitrate and congestion control). We will expand Section 5 to specify the bilevel solver convergence criteria (regret stabilization within a fixed epsilon over iterations), the network parameter ranges explored, and statistical significance (e.g., mean and standard deviation over 10 random seeds with t-test p-values). These additions will be included in the revised version. revision: yes
-
Referee: [Section 4 (counterfactual analysis and rule extraction)] Rule extraction and generalization: the claim that protection extends beyond discovered scenarios and improves performance across a wider range of conditions is load-bearing for the practical contribution. The counterfactual analysis may be tuned to the discovered (potentially local) trajectories, leaving open whether the extracted logic rules introduce new failure modes or fail to cover other worst-case regimes.
Authors: The rules are derived from state features in the counterfactual trajectories and are intentionally conservative, triggering only on detected risky states while leaving the RL policy unchanged otherwise. Our evaluation already tests generalization on held-out network conditions beyond the discovery set and reports no nominal-performance degradation. To address potential uncovered regimes or new failure modes, we will add further experiments applying the rules to additional unseen traces and report any observed side effects or coverage gaps. This will be incorporated as an expanded subsection in Section 4. revision: partial
Circularity Check
No circularity: bilevel formulation and empirical gaps are independent of final reported metrics
full rationale
The paper formulates discovery as a bilevel regret-maximization problem that explicitly produces a certified lower bound on the performance gap. Reported numbers (43-64% worse, 57% to 6x larger gaps, 79-85% shrinkage) are obtained by running the procedure, extracting rules, and measuring outcomes on the resulting trajectories versus baselines and nominal performance. No equation or claim equates the final gaps or protection gains to quantities defined by the same fitted parameters or by self-referential construction. The derivation chain remains self-contained against external benchmarks and does not reduce to renaming or self-citation load-bearing.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
https://www.fcc
Measuring broadband america. https://www.fcc. gov/general/measuring-broadband-america
-
[2]
https://skulddata.cs.umass.edu/ traces/mmsys/2013/pathbandwidth/
Dataset: Hsdpa-bandwidth logs for mobile http stream- ing scenarios. https://skulddata.cs.umass.edu/ traces/mmsys/2013/pathbandwidth/
2013
-
[3]
Classic meets modern: A pragmatic learning-based congestion control for the internet
Soheil Abbasloo, Chen-Yu Yen, and H Jonathan Chao. Classic meets modern: A pragmatic learning-based congestion control for the internet. InProceedings of the Annual conference of the ACM Special Interest Group on Data Communication on the applications, technologies, architectures, and protocols for computer communication, pages 632–647, 2020
2020
-
[4]
Yan, and Ravi Netravali
Neil Agarwal, Rui Pan, Francis Y . Yan, and Ravi Netravali. Mowgli: Passively learned rate control for real-time video. In22nd USENIX Symposium on Networked Systems Design and Implementation (NSDI 25), pages 579–594, 2025. URLhttps://www.usenix. org/conference/nsdi25/presentation/agarwal
2025
-
[5]
Solving Rubik's Cube with a Robot Hand
Ilge Akkaya, Marcin Andrychowicz, Maciek Chociej, Mateusz Litwin, Bob McGrew, Arthur Petron, Alex Paino, Matthias Plappert, Glenn Powell, Raphael Ribas, et al. Solving rubik’s cube with a robot hand.arXiv preprint arXiv:1910.07113, 2019
work page internal anchor Pith review arXiv 1910
-
[6]
Yan, Raghunadha Reddy Pocha, Vineesh V
Ryan Beckett, Francis Y . Yan, Raghunadha Reddy Pocha, Vineesh V . Raj, Ayyub Shaik, and Siva Kesava Reddy Kakarla. Concord: Learning network configuration contracts. InProceedings of the 21st European Con- ference on Computer Systems (EuroSys ’26), page 18, Edinburgh, Scotland, UK, April 2026. ACM. ISBN 979- 8-4007-2212-7/26/04. doi: 10.1145/3767295.3769338
-
[7]
Cafa: Cost-aware, feasible attacks with database constraints against neural tabular classifiers
Matan Ben-Tov, Daniel Deutch, Nave Frost, and Mah- mood Sharif. Cafa: Cost-aware, feasible attacks with database constraints against neural tabular classifiers. In 2024 IEEE Symposium on Security and Privacy (SP), pages 1345–1364. IEEE, 2024
2024
-
[8]
Matteo Biagiola and Paolo Tonella. Testing of deep rein- forcement learning agents with surrogate models.ACM Transactions on Software Engineering and Methodology, 33(3):73:1–73:33, 2024. doi: 10.1145/3631970
-
[9]
Cambridge university press, 2004
Stephen Boyd and Lieven Vandenberghe.Convex optimization. Cambridge university press, 2004
2004
-
[10]
Bbr: Congestion-based congestion control.Communications of the ACM, 60(2):58–66, 2017
Neal Cardwell, Yuchung Cheng, C Stephen Gunn, Soheil Hassas Yeganeh, and Van Jacobson. Bbr: Congestion-based congestion control.Communications of the ACM, 60(2):58–66, 2017
2017
-
[11]
PCC vivace: Online-learning congestion control
Mo Dong, Tong Meng, Doron Zarchy, Engin Arslan, Yossi Gilad, Brighten Godfrey, and Michael Schapira. PCC vivace: Online-learning congestion control. In15th USENIX Symposium on Networked Systems Design and Implementation (NSDI 18), pages 343–356, 2018
2018
-
[12]
Verifying learning-augmented systems
Tomer Eliyahu, Yafim Kazak, Guy Katz, and Michael Schapira. Verifying learning-augmented systems. InProceedings of the 2021 ACM SIGCOMM 2021 Conference, pages 305–318, 2021
2021
-
[13]
Jay, Michael Shnaiderman, Brighten Godfrey, and Michael Schapira
Tomer Gilad, Nathan H. Jay, Michael Shnaiderman, Brighten Godfrey, and Michael Schapira. Robustifying network protocols with adversarial examples. In Proceedings of the 18th ACM Workshop on Hot Topics in Networks, pages 85–92. ACM. ISBN 978-1-4503- 7020-2. doi: 10.1145/3365609.3365862. URL https: //dl.acm.org/doi/10.1145/3365609.3365862
-
[14]
Convex programming in hilbert space.Bulletin of the American Mathematical Society, 70(5):709–710, 1964
Allen A Goldstein. Convex programming in hilbert space.Bulletin of the American Mathematical Society, 70(5):709–710, 1964
1964
-
[15]
Automated curriculum learning for neural networks
Alex Graves, Marc G Bellemare, Jacob Menick, Remi Munos, and Koray Kavukcuoglu. Automated curriculum learning for neural networks. Ininternational conference on machine learning, pages 1311–1320. Pmlr, 2017
2017
-
[16]
Just-in-time logic enforcement: A new paradigm of combining statistical and symbolic reasoning for network management
Hongyu Hè and Maria Apostolaki. Just-in-time logic enforcement: A new paradigm of combining statistical and symbolic reasoning for network management. In Proceedings of the 24th ACM Workshop on Hot Topics in Networks (HotNets 25), pages 184–192, 2025
2025
-
[17]
Making Logic a First-Class Citizen in Network Data Generation with ML
Hongyu Hè, Minhao Jin, and Maria Apostolaki. Making Logic a First-Class Citizen in Network Data Generation with ML. In23rd USENIX Symposium on Networked Systems Design and Implementation (NSDI 26), 2026
2026
-
[18]
Jacobs, Roman Beltiukov, Walter Willinger, Ronaldo A
Arthur S. Jacobs, Roman Beltiukov, Walter Willinger, Ronaldo A. Ferreira, Arpit Gupta, and Lisandro Z. Granville. AI/ML for Network Security: The Emperor has no Clothes. InProceedings of the 2022 ACM SIGSAC Conference on Computer and Communica- tions Security (CCS ’22), Los Angeles, CA, USA,
2022
-
[19]
ACM. doi: 10.1145/3548606.3560609. URL https://trusteeml.github.io/
-
[20]
A deep reinforcement learning perspective on internet congestion control
Nathan Jay, Noga Rotman, Brighten Godfrey, Michael Schapira, and Aviv Tamar. A deep reinforcement learning perspective on internet congestion control. In International Conference on Machine Learning (ICML), pages 3050–3059. PMLR, 2019
2019
-
[21]
Robustifying ml- powered network classifiers with pants
Minhao Jin and Maria Apostolaki. Robustifying ml- powered network classifiers with pants. In34th USENIX Security Symposium (USENIX Security 25), 2025. 13
2025
-
[22]
Fine-tuning can distort pretrained features and underperform out-of-distribution
Ananya Kumar, Aditi Raghunathan, Robbie Matthew Jones, Tengyu Ma, and Percy Liang. Fine-tuning can distort pretrained features and underperform out-of-distribution. InInternational Conference on Learning Representations, 2022. URL https: //openreview.net/forum?id=UYneFzXSJWh
2022
-
[23]
Stabilizing off-policy q-learning via bootstrapping error reduction.Advances in neural information processing systems, 32, 2019
Aviral Kumar, Justin Fu, Matthew Soh, George Tucker, and Sergey Levine. Stabilizing off-policy q-learning via bootstrapping error reduction.Advances in neural information processing systems, 32, 2019
2019
-
[24]
Constrained min- imization methods.USSR Computational Mathematics and Mathematical Physics, 6(5):1–50, 1966
Evgenii S Levitin and Boris T Polyak. Constrained min- imization methods.USSR Computational Mathematics and Mathematical Physics, 6(5):1–50, 1966
1966
-
[25]
Optimistic critic reconstruction and constrained fine-tuning for general offline-to-online rl
Qin-Wen Luo, Ming-Kun Xie, Ye-Wen Wang, and Sheng-Jun Huang. Optimistic critic reconstruction and constrained fine-tuning for general offline-to-online rl. InAdvances in Neural Information Processing Systems, volume 37, 2024. doi: 10.52202/079017-3435
-
[26]
Towards deep learning models resistant to adversarial attacks
Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. In International Conference on Learning Representations,
-
[27]
URLhttps://openreview.net
-
[28]
Neural adaptive video streaming with pensieve
Hongzi Mao, Ravi Netravali, and Mohammad Al- izadeh. Neural adaptive video streaming with pensieve. InProceedings of the Conference of the ACM Special Interest Group on Data Communication, pages 197–210. ACM. ISBN 978-1-4503-4653-
-
[29]
URL https: //dl.acm.org/doi/10.1145/3098822.3098843
doi: 10.1145/3098822.3098843. URL https: //dl.acm.org/doi/10.1145/3098822.3098843
-
[30]
Park: An open platform for learning-augmented computer systems.Advances in Neural Information Processing Systems, 32, 2019
Hongzi Mao, Parimarjan Negi, Akshay Narayan, Hanrui Wang, Jiacheng Yang, Haonan Wang, Ryan Marcus, Mehrdad Khani Shirkoohi, Songtao He, Vikram Nathan, et al. Park: An open platform for learning-augmented computer systems.Advances in Neural Information Processing Systems, 32, 2019
2019
-
[31]
PhD thesis, University of Rochester, 1995
Andrew Kachites McCallum.Reinforcement learning with hidden states. PhD thesis, University of Rochester, 1995
1995
-
[32]
Interpreting deep learning-based networking systems
Zili Meng, Minhu Wang, Jiasong Bai, Mingwei Xu, Hongzi Mao, and Hongxin Hu. Interpreting deep learning-based networking systems. InProceedings of the Annual conference of the ACM Special Interest Group on Data Communication on the applications, technologies, architectures, and protocols for computer communication, SIGCOMM ’20, pages 154–171. As- sociation...
-
[33]
Human-level control through deep rein- forcement learning.Nature, 518(7540):529–533, 2015
V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep rein- forcement learning.Nature, 518(7540):529–533, 2015
2015
-
[34]
Finding adversarial inputs for heuristics using multi-level optimization
Pooria Namyar, Behnaz Arzani, Ryan Beckett, Santiago Segarra, Himanshu Raj, Umesh Krishnaswamy, Ramesh Govindan, and Srikanth Kandula. Finding adversarial inputs for heuristics using multi-level optimization. In 21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24), pages 927–949, 2024
2024
-
[35]
Mahimahi: accurate {Record-and-Replay} for {HTTP}
Ravi Netravali, Anirudh Sivaraman, Somak Das, Ameesh Goyal, Keith Winstein, James Mickens, and Hari Bal- akrishnan. Mahimahi: accurate {Record-and-Replay} for {HTTP}. In2015 USENIX Annual Technical Conference (USENIX ATC 15), pages 417–429, 2015
2015
-
[36]
Mutant: Learning congestion control from existing protocols via online reinforcement learning
Lorenzo Pappone, Alessio Sacco, and Flavio Esposito. Mutant: Learning congestion control from existing protocols via online reinforcement learning. In22nd USENIX Symposium on Networked Systems Design and Implementation (NSDI 25), pages 1507–1522, 2025
2025
-
[37]
Intriguing properties of adversarial ml attacks in the problem space
Fabio Pierazzi, Feargus Pendlebury, Jacopo Cortellazzi, and Lorenzo Cavallaro. Intriguing properties of adversarial ml attacks in the problem space. In2020 IEEE Symposium on Security and Privacy (SP), pages 1332–1349, 2020. doi: 10.1109/SP40000.2020.00073
-
[38]
MIT press, 2018
Richard S Sutton and Andrew G Barto.Reinforcement learning: An introduction. MIT press, 2018
2018
-
[39]
Overcoming catastrophic forgetting during domain adaptation of neural machine translation
Brian Thompson, Jeremy Gwinnup, Huda Khayrallah, Kevin Duh, and Philipp Koehn. Overcoming catastrophic forgetting during domain adaptation of neural machine translation. In Jill Burstein, Christy Doran, and Thamar Solorio, editors,Proceedings of the 2019 Conference of the North American Chapter of the Association for Com- putational Linguistics: Human Lan...
-
[40]
Robustness may be at odds with accuracy
Dimitris Tsipras, Shibani Santurkar, Logan Engstrom, Alexander Turner, and Aleksander Madry. Robustness may be at odds with accuracy. InInternational Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=SyxAb30cY7
2019
-
[41]
Genet: Automatic curriculum generation for learn- ing adaptation in networking
Zhengxu Xia, Yajie Zhou, Francis Y Yan, and Junchen Jiang. Genet: Automatic curriculum generation for learn- ing adaptation in networking. InProceedings of the ACM SIGCOMM 2022 Conference, pages 397–413, 2022. 14
2022
-
[42]
Learning in situ: a randomized experi- ment in video streaming
Francis Y Yan, Hudson Ayers, Chenzhi Zhu, Sadjad Fouladi, James Hong, Keyi Zhang, Philip Levis, and Keith Winstein. Learning in situ: a randomized experi- ment in video streaming. In17th USENIX Symposium on Networked Systems Design and Implementation (NSDI 20), pages 495–511, 2020
2020
-
[43]
Jonathan Chao
Chen-Yu Yen, Soheil Abbasloo, and H. Jonathan Chao. Computers can learn from the heuristic designs and master internet congestion control. InProceedings of the ACM SIGCOMM 2023 Conference, ACM SIGCOMM ’23, page 255–274, New York, NY , USA,
2023
-
[44]
Association for Computing Machinery. ISBN 9798400702365. doi: 10.1145/3603269.3604838. URL https://doi.org/10.1145/3603269.3604838
-
[45]
A control-theoretic approach for dynamic adaptive video streaming over HTTP
Xiaoqi Yin, Abhishek Jindal, Vyas Sekar, and Bruno Sinopoli. A control-theoretic approach for dynamic adaptive video streaming over HTTP. InProceedings of the 2015 ACM Conference on Special Interest Group on Data Communication, pages 325–338. ACM. ISBN 978-1-4503-3542-
2015
-
[46]
doi: 10.1145/2785956.2787486. URL https: //dl.acm.org/doi/10.1145/2785956.2787486. A Proofs for Regret Guarantees This appendix gives the detailed assumptions, intermediate lemmas, and full proofs for the regret guarantees in Section 4.2. Our goal is to make explicit what is guaranteed by the formulation, what approximation errors enter the final bound, a...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.