Hierarchical Reinforcement Learning for the Dynamic VNE with Alternatives Problem
Pith reviewed 2026-05-17 00:26 UTC · model grok-4.3
The pith
Hierarchical reinforcement learning improves dynamic virtual network embedding with alternative topologies over standard baselines.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors present HRL-VNEAP, in which a high-level policy selects the most suitable alternative topology for each arriving virtual network request or rejects it, and a low-level policy then embeds the selected topology onto the substrate network. Under dynamic request arrivals and varying traffic loads on realistic topologies, this hierarchical approach produces higher acceptance ratios, greater total revenue, and improved revenue-over-cost ratios than the tested baseline strategies.
What carries the argument
The hierarchical reinforcement learning structure that separates topology selection at the high level from embedding decisions at the low level.
If this is right
- The approach raises the fraction of accepted virtual network requests by as much as 20.7 percent compared with the strongest baselines.
- Total revenue collected increases by up to 36.2 percent.
- The ratio of revenue to cost improves by as much as 22.1 percent.
- A remaining gap to the solutions found by mixed-integer linear programming on small instances is quantified and left as motivation for hybrid methods.
Where Pith is reading between the lines
- The separation of topology choice from placement may reduce the search space enough to make reinforcement learning stable for other resource-allocation tasks that involve equivalent configurations.
- If the performance edge holds on larger instances, operators could serve more slices without adding physical capacity.
- Combining the learned policies with periodic exact optimization steps could shrink the optimality gap reported in the benchmarks.
Load-bearing premise
The high-level policy can pick suitable topologies and the low-level policy can embed them reliably without training instability or high computation costs when requests arrive over time.
What would settle it
Running the trained policies on a fresh set of larger substrate networks with higher arrival rates and measuring whether acceptance ratio and revenue gains disappear or training diverges.
Figures
read the original abstract
Virtual Network Embedding (VNE) is a key enabler of network slicing, yet most formulations assume that each Virtual Network Request (VNR) has a fixed topology. Recently, VNE with Alternative topologies (VNEAP) was introduced to capture malleable VNRs, where each request can be instantiated using one of several functionally equivalent topologies that trade resources differently. While this flexibility enlarges the feasible space, it also introduces an additional decision layer, making dynamic embedding more challenging. This paper proposes HRL-VNEAP, a hierarchical reinforcement learning approach for VNEAP under dynamic arrivals. A high-level policy selects the most suitable alternative topology (or rejects the request), and a low-level policy embeds the chosen topology onto the substrate network. Experiments on realistic substrate topologies under multiple traffic loads show that naive exploitation strategies provide only modest gains, whereas HRL-VNEAP consistently achieves the best performance across all metrics. Compared to the strongest tested baselines, HRL-VNEAP improves acceptance ratio by up to \textbf{20.7\%}, total revenue by up to \textbf{36.2\%}, and revenue-over-cost by up to \textbf{22.1\%}. Finally, we benchmark against an MILP formulation on tractable instances to quantify the remaining gap to optimality and motivate future work on learning- and optimization-based VNEAP solutions.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces HRL-VNEAP, a hierarchical reinforcement learning approach for dynamic Virtual Network Embedding with Alternative topologies (VNEAP). A high-level policy selects one of several functionally equivalent topologies (or rejects the VNR), while a low-level policy embeds the chosen topology onto the substrate. Experiments on realistic topologies under varying traffic loads report that HRL-VNEAP outperforms baselines, with gains of up to 20.7% in acceptance ratio, 36.2% in total revenue, and 22.1% in revenue-over-cost; a MILP benchmark on small instances is included to quantify the optimality gap.
Significance. If the empirical claims hold under rigorous validation, the work meaningfully extends VNE research by incorporating topology alternatives via hierarchical RL, demonstrating practical gains over naive exploitation and standard methods in dynamic settings. The concrete percentage improvements and MILP comparison provide a useful baseline for future learning- and optimization-based VNEAP solutions, though the significance depends on addressing gaps in experimental detail and stability analysis.
major comments (3)
- [Experiments] Experiments section: the reported performance gains (e.g., 20.7% acceptance ratio) are presented without specification of training procedure details, hyperparameter values, number of random seeds, variance across runs, or statistical significance tests, leaving the robustness of superiority under dynamic arrivals only partially supported.
- [§3] §3 (Hierarchical RL formulation): the high-level policy receives delayed and sparse feedback only after low-level embedding success or failure, yet no mechanisms (separate critics, termination functions, or shaped rewards) are described to mitigate credit-assignment problems, which risks high-variance policies or collapse under varying loads.
- [Experiments] Baseline comparisons: exact implementation details, tuning procedures, and source of the 'strongest tested baselines' are not provided, undermining the ability to verify the claimed outperformance margins.
minor comments (2)
- [§2] Notation for alternative topologies and their resource trade-offs could be clarified with an explicit example in the problem formulation section.
- [Experiments] Figure captions for performance plots should include error bars or confidence intervals if multiple runs were performed.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments, which help improve the clarity and rigor of our work on HRL-VNEAP. We address each major comment point by point below, providing explanations and indicating revisions where the manuscript will be updated in the next version.
read point-by-point responses
-
Referee: [Experiments] Experiments section: the reported performance gains (e.g., 20.7% acceptance ratio) are presented without specification of training procedure details, hyperparameter values, number of random seeds, variance across runs, or statistical significance tests, leaving the robustness of superiority under dynamic arrivals only partially supported.
Authors: We agree that the original presentation of results would benefit from greater transparency on experimental robustness. In the revised manuscript, we have added a new subsection in the Experiments section that fully specifies the training procedure (including episode counts, learning rate schedules, and replay buffer sizes), all hyperparameter values for both the high-level and low-level policies, the use of five independent random seeds, mean and standard deviation of performance metrics across runs, and p-values from paired t-tests confirming statistical significance of the reported gains (including the 20.7% acceptance ratio improvement) under each traffic load. revision: yes
-
Referee: [§3] §3 (Hierarchical RL formulation): the high-level policy receives delayed and sparse feedback only after low-level embedding success or failure, yet no mechanisms (separate critics, termination functions, or shaped rewards) are described to mitigate credit-assignment problems, which risks high-variance policies or collapse under varying loads.
Authors: The concern about credit assignment in the hierarchical setting is valid given the delayed feedback structure. Our formulation already incorporates separate critics for the high-level and low-level policies within an actor-critic framework, along with a shaped reward for the high-level policy that combines a binary success/failure signal with a continuous component based on the revenue-over-cost ratio achieved by the low-level embedding. Termination occurs explicitly upon low-level completion or rejection. We have expanded §3 to describe these design choices explicitly, including the exact reward shaping function and how the hierarchical update mitigates variance, thereby strengthening the methodological justification. revision: yes
-
Referee: [Experiments] Baseline comparisons: exact implementation details, tuning procedures, and source of the 'strongest tested baselines' are not provided, undermining the ability to verify the claimed outperformance margins.
Authors: We acknowledge that reproducibility requires more explicit baseline documentation. The revised Experiments section now includes a table and accompanying text detailing the exact implementation of each baseline (neural architectures, optimization algorithms, and any custom adaptations for VNEAP), the hyperparameter tuning process (grid search ranges and selected values), and the sources (re-implementations based on the original publications or publicly available code repositories referenced in the paper). These additions enable direct verification of the outperformance margins reported against the strongest baselines. revision: yes
Circularity Check
Standard hierarchical RL applied to VNEAP; no reductions to self-defined inputs
full rationale
The paper defines HRL-VNEAP with a high-level policy selecting among alternative topologies (or rejecting) and a low-level policy performing the embedding. Reported gains in acceptance ratio, revenue, and revenue-over-cost are obtained from direct experimental comparisons against baselines on realistic substrate topologies under multiple traffic loads. No equations, fitted parameters, or claims in the provided text reduce these outcomes to quantities defined by the authors' own prior results or by construction from the method's inputs. The derivation relies on standard hierarchical RL components applied to the VNEAP formulation and is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Alternative topologies for each VNR are provided as input and are functionally equivalent.
- domain assumption The substrate network state is fully observable to the agents at decision time.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
A high-level policy selects the most suitable alternative topology (or rejects the request), and a low-level policy embeds the chosen topology onto the substrate network.
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
HRL-VNEAP improves acceptance ratio by up to 20.7%, total revenue by up to 36.2%
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
An overview of network slicing for 5g,
Z. S. et al, “An overview of network slicing for 5g,”IEEE Wireless Communications, 2019
work page 2019
-
[2]
Virtual network embedding: A survey,
F. et al, “Virtual network embedding: A survey,”IEEE Communications Surveys & Tutorials, 2013
work page 2013
-
[3]
The power of alternatives in network embedding,
K. et al, “The power of alternatives in network embedding,” inIEEE INFOCOM 2025, 2025
work page 2025
-
[4]
On orches- trating virtual network functions,
M. F. Bari, S. R. Chowdhury, R. Ahmed, and R. Boutaba, “On orches- trating virtual network functions,” in2015 11th international conference on network and service management (CNSM). IEEE, 2015, pp. 50–56
work page 2015
-
[5]
Rethinking virtual network embedding: substrate support for path splitting and migration,
Y . M. et al, “Rethinking virtual network embedding: substrate support for path splitting and migration,”ACM SIGCOMM Computer Commu- nication Review, 2008
work page 2008
-
[6]
Ai-empowered virtual network embedding: A comprehen- sive survey,
W. S. et al, “Ai-empowered virtual network embedding: A comprehen- sive survey,”IEEE Communications Surveys & Tutorials, 2025
work page 2025
-
[7]
Reinforcement learning-based virtual network embed- ding: A comprehensive survey,
H.-K. L. et al, “Reinforcement learning-based virtual network embed- ding: A comprehensive survey,”ICT Express, 2023
work page 2023
-
[8]
C. J. et al, “Vne-hrl: A proactive virtual network embedding algorithm based on hierarchical reinforcement learning,”IEEE Transactions on Network and Service Management, 2021
work page 2021
-
[9]
W. T. et al, “Joint admission control and resource allocation of virtual network embedding via hierarchical deep reinforcement learning,”IEEE Transactions on Services Computing, 2023
work page 2023
-
[10]
Polyvine: Policy-based virtual network embedding,
C. M. et al, “Polyvine: Policy-based virtual network embedding,”IEEE Transactions on Network and Service Management, 2012
work page 2012
-
[11]
Y . Z. et al, “Automatic virtual network embedding: A deep reinforcement learning approach with graph convolutional networks,”IEEE Journal on Selected Areas in Communications, 2020
work page 2020
-
[12]
Virtual network embedding through topology-aware node ranking,
C. X. et al, “Virtual network embedding through topology-aware node ranking,”ACM SIGCOMM Computer Communication Review, 2011
work page 2011
-
[13]
Proximal Policy Optimization Algorithms
S. J. et al, “Proximal policy optimization algorithms,”arXiv preprint arXiv:1707.06347, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[14]
Fine-tuning cnn image retrieval with no human annotation,
R. F. et al, “Fine-tuning cnn image retrieval with no human annotation,” IEEE transactions on pattern analysis and machine intelligence, 2018
work page 2018
-
[15]
SNDlib 1.0–Survivable Network Design Library,
S. O. et al, “SNDlib 1.0–Survivable Network Design Library,” in Proceedings of the 3rd International Network Optimization Conference (INOC 2007), Spa, Belgium, 2007
work page 2007
-
[16]
Node essentiality assessment and distributed collaborative virtual network embedding in datacenters,
F. W. et al, “Node essentiality assessment and distributed collaborative virtual network embedding in datacenters,”IEEE Transactions on Par- allel and Distributed Systems, 2023
work page 2023
-
[17]
J. C. et al,VNE-HPSO Virtual Network Embedding Algorithm Based on Hybrid Particle Swarm Optimization. Springer Singapore, 2021
work page 2021
-
[18]
Reinforcement learning assisted bandwidth aware virtual network resource allocation,
Z. P. et al, “Reinforcement learning assisted bandwidth aware virtual network resource allocation,”IEEE Transactions on Network and Ser- vice Management, 2022
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.