pith. sign in

arxiv: 2605.01475 · v1 · submitted 2026-05-02 · 💻 cs.NI

An Intelligent eUPF for Time-Sensitive Path Selection in B5G Edge Networks

Pith reviewed 2026-05-09 18:00 UTC · model grok-4.3

classification 💻 cs.NI
keywords B5GeUPFDQNpath selectionPOMDPeBPFMEClatency
0
0 comments X

The pith

A DQN agent inside an enhanced User Plane Function selects low-latency paths in B5G networks using passive eBPF measurements.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that embedding a Deep Q-Network agent into the eUPF allows real-time path selection between MEC and cloud endpoints to meet B5G latency demands. The selection task is cast as a POMDP whose states come from a new passive delay estimator that correlates TEID timestamps in GTP-U packets via eBPF programs. Experiments demonstrate that the learned policy produces lower average latency, steadier rewards, and more consistent low-delay choices than a random baseline. This matters because B5G applications require both speed and reliability while active probing would add unacceptable overhead. The results therefore position reinforcement learning as a practical control mechanism inside the 5G core.

Core claim

By formulating path selection as a Partially Observable Markov Decision Process and supplying it with delay estimates obtained from eBPF-linked TEID timestamps in GTP-U traffic, the eUPF can run a DQN agent that learns to pick lower-latency routes between edge and cloud in real time. The agent outperforms a random policy on average latency, reward stability, and reliability of low-delay choices, establishing that AI-driven control can operate inside the user plane without active measurement traffic.

What carries the argument

DQN agent whose state is built from POMDP observations supplied by eBPF passive delay measurement on GTP-U TEID timestamps.

If this is right

  • The DQN policy achieves measurably lower average latency than random path selection.
  • Reward values remain more stable across episodes, indicating consistent path quality.
  • Low-delay paths are chosen more reliably than under random selection.
  • Passive eBPF measurement supplies the necessary state information without adding probe traffic.
  • Reinforcement learning can be embedded directly inside B5G core network functions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same passive timestamp-linking approach could be reused for other metrics such as jitter or packet loss in GTP-U flows.
  • The eUPF architecture could be extended to joint path and compute selection across multiple MEC sites.
  • Production deployment would reduce monitoring overhead compared with active probing schemes.
  • The learned policies might transfer to related 5G control tasks such as dynamic slice assignment.

Load-bearing premise

The POMDP formulation together with the eBPF-derived delay estimates are assumed to provide a sufficiently accurate and low-overhead model of real network conditions for the DQN agent to learn effective policies.

What would settle it

A side-by-side run of the DQN agent and random baseline on a live B5G testbed in which the agent shows no reduction in average latency or improvement in reward stability.

Figures

Figures reproduced from arXiv: 2605.01475 by Fl\'avio de Oliveira Silva, Larissa Ferreira Rodrigues Moreira, Rodrigo Moreira, Tereza Cristina Carvalho.

Figure 1
Figure 1. Figure 1: Proposed Approach. Observations. The agent does not observe St directly. Instead, after se￾lecting an action At ∈ {n6a, n6b}, it receives a noisy delay measurement. Let i denote the interface selected by At (that is, i = At). The observation process (Eq. 1) is: RT T(At) = di · I[si(t) = BAD] + ξt ξt ∼ U(−Jmax, +Jmax) (1) where di is the additional delay incurred when interface i is in the BAD state, I[·] i… view at source ↗
Figure 2
Figure 2. Figure 2: Experimental Setup. The stochastic degradation parameters used in our experiments are sum￾marized in view at source ↗
Figure 3
Figure 3. Figure 3: reports the reward dynamics of the random baseline across 400 training episodes. In view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of raw and smoothed cumulative rewards (rolling mean, window size 10) over the training episodes of the DQN agent. DQN learns an effective and stable low latency policy under stochas￾tic degradations view at source ↗
Figure 5
Figure 5. Figure 5: connects learning dynamics to policy behavior by showing action dis￾tributions over the 400 episodes. The baseline in view at source ↗
Figure 6
Figure 6. Figure 6: Comparison of average round trip time. the User Equipment generated Internet Control Message Protocol traffic at one packet per second, and the plot aggregates transmissions over 10 second intervals, averaged across the entire duration. Under the DQN policy, more packets are forwarded to MEC, with an average of 3.29 per interval versus 0.57 toward the cloud, reflecting sustained preference for the lower la… view at source ↗
Figure 7
Figure 7. Figure 7: Packet out distribution view at source ↗
read the original abstract

In Beyond 5G (B5G) networks, intelligent, flexible traffic management is essential to meet the stringent speed and reliability requirements of new applications. This paper presents an improved User Plane Function (eUPF) design that uses a Deep Q-Network (DQN) agent for real-time path selection between Multi-access Edge Computing (MEC) and cloud endpoints. The path selection problem is formulated as a Partially Observable Markov Decision Process (POMDP). We propose a novel passive delay measurement method that uses eBPF programs to link TEID-based timestamps in GTP-U traffic, allowing for low-cost delay estimation without active testing. Experiments show that the DQN agent substantially outperforms a random baseline, with lower average latency, more stable rewards, and more reliable low-delay path choices. These results demonstrate the effectiveness of AI-driven control in B5G core networks and the promise of reinforcement learning for modern network management.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript presents an enhanced User Plane Function (eUPF) for Beyond 5G edge networks that uses a Deep Q-Network (DQN) agent to perform real-time path selection between MEC and cloud endpoints. The problem is cast as a POMDP whose observations come from a novel passive delay estimator that links TEID timestamps in GTP-U traffic via eBPF programs. Experiments are reported to show that the learned policy yields lower average latency, more stable rewards, and more reliable low-delay paths than a random baseline.

Significance. If the empirical gains prove robust, the work would be of moderate significance for B5G core-network research. It illustrates a concrete combination of eBPF-based passive observability with reinforcement learning for latency-sensitive routing, an approach that could reduce the overhead of active probing while enabling adaptive traffic steering. No machine-checked proofs or fully reproducible artifacts are described, but the empirical comparison to a baseline supplies a falsifiable starting point that future studies could extend.

major comments (2)
  1. [Experimental results and passive delay measurement description] The central claim that the DQN agent substantially outperforms the random baseline rests on the accuracy of the eBPF-derived delay estimates used to construct POMDP observations, yet the manuscript supplies no quantitative calibration (MAE, correlation, or comparison against active probes or hardware timestamps) of these passive measurements. Without such validation, it is impossible to determine whether the reported latency reductions and stable rewards reflect genuine network conditions or artifacts of the estimator.
  2. [Experiments section] No details are given on the simulation or testbed configuration, number of independent runs, variance across trials, or statistical tests supporting the claims of lower average latency and more reliable path choices. Likewise, the POMDP state representation, reward function, and DQN hyper-parameters are not specified, rendering the performance gains impossible to reproduce or assess for robustness.
minor comments (1)
  1. [Abstract] The abstract states that the DQN agent provides 'more stable rewards' but does not define the reward function or the metric used to quantify stability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on experimental validation and reproducibility. We address each point below and commit to revisions that strengthen the manuscript without altering its core contributions.

read point-by-point responses
  1. Referee: The central claim that the DQN agent substantially outperforms the random baseline rests on the accuracy of the eBPF-derived delay estimates used to construct POMDP observations, yet the manuscript supplies no quantitative calibration (MAE, correlation, or comparison against active probes or hardware timestamps) of these passive measurements. Without such validation, it is impossible to determine whether the reported latency reductions and stable rewards reflect genuine network conditions or artifacts of the estimator.

    Authors: We agree that quantitative validation of the passive delay estimator is necessary to substantiate the performance claims. The original manuscript emphasizes the novel integration of eBPF-based passive measurement with DQN-driven path selection rather than standalone estimator benchmarking. In the revised manuscript we will add a new subsection to the Experiments section that reports calibration results, including MAE, Pearson correlation, and direct comparisons against active probing and hardware timestamps on the same testbed traffic. revision: yes

  2. Referee: No details are given on the simulation or testbed configuration, number of independent runs, variance across trials, or statistical tests supporting the claims of lower average latency and more reliable path choices. Likewise, the POMDP state representation, reward function, and DQN hyper-parameters are not specified, rendering the performance gains impossible to reproduce or assess for robustness.

    Authors: We accept that the original submission omitted these implementation details, limiting reproducibility. The revised Experiments section will be expanded to provide: (i) complete simulation and testbed configuration (topology, link capacities, traffic generators, and eBPF deployment), (ii) results from 20 independent runs with reported means and standard deviations, (iii) statistical significance tests (paired t-tests with p-values), and (iv) the full POMDP specification (state features, action space, reward function) together with all DQN hyperparameters (network architecture, learning rate, discount factor, replay buffer size, and training schedule). revision: yes

Circularity Check

0 steps flagged

No circularity; empirical comparison to baseline is independent of definitions

full rationale

The paper formulates path selection as a POMDP, implements passive eBPF-based delay estimation from TEID timestamps, trains a DQN agent, and reports experimental outperformance versus a random baseline on latency and reward metrics. No load-bearing step reduces by construction to its own inputs: there are no fitted parameters renamed as predictions, no self-definitional equations, and no uniqueness theorems or ansatzes imported via self-citation. The central result is an external empirical comparison whose validity depends on the accuracy of the measurement method and testbed, not on any definitional equivalence within the paper itself.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Review performed on abstract only; full paper may contain additional parameters and assumptions. The ledger therefore records only the elements explicitly named in the abstract.

free parameters (1)
  • DQN hyperparameters
    Learning rate, discount factor, and network architecture are required for any DQN implementation but are not reported in the abstract.
axioms (1)
  • domain assumption Path selection in B5G edge networks can be usefully modeled as a Partially Observable Markov Decision Process
    Explicitly stated in the abstract as the problem formulation.

pith-pipeline@v0.9.0 · 5469 in / 1307 out tokens · 40415 ms · 2026-05-09T18:00:50.773445+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages

  1. [1]

    doi: 10.1109/tpami.2025.3562422

    Agarwal, B., Irmer, R., Lister, D., Muntean, G.M.: Open ran for 6g networks: Ar- chitecture, use cases and open issues. IEEE Communications Surveys & Tutorials pp. 1–1 (2025). https://doi.org/10.1109/COMST.2025.3562429

  2. [2]

    In: 2024 IEEE Conference on Network Function Virtualization and Software Defined Networks (NFV-SDN)

    Bellin, A., Di Cicco, N., Munaretto, D., Granelli, F.: Power consumption-aware 5g edge upf selection using deep reinforcement learning. In: 2024 IEEE Conference on Network Function Virtualization and Software Defined Networks (NFV-SDN). pp. 1–6 (2024). https://doi.org/10.1109/NFV-SDN61811.2024.10807472

  3. [3]

    IEEE Network39(3), 91–98 (2025)

    Chen, C.C., Chang, C.Y., Nikaein, N.: Iup: Integrated and programmable user plane for next-generation mobile networks. IEEE Network39(3), 91–98 (2025). https://doi.org/10.1109/MNET.2025.3551245

  4. [4]

    IEEE Access13, 103077–103094 (2025)

    Christopoulou, M., Koufos, I., Xilouris, G., Dimitriou, N.: 5G/6G Architecture Evolution for XR and Metaverse: Feasibility Study, Security, and Privacy Chal- lenges for Smart Culture Applications. IEEE Access13, 103077–103094 (2025). https://doi.org/10.1109/ACCESS.2025.3578595

  5. [5]

    Using architecture decision records in open source projects—an msr study on github.IEEE Access, 11:63725–63740, 2023

    Donatti, A., Corrêa, S.L., Martins, J.S.B., Abelem, A.J.G., Both, C.B., de Oliveira Silva, F., Suruagy, J.A., Pasquini, R., Moreira, R., Cardoso, K.V., Carvalho, T.C.: Survey on machine learning-enabled network slicing: Covering the entire life cycle. IEEE Transactions on Network and Service Management21(1), 994–1011 (2024). https://doi.org/10.1109/TNSM.2...

  6. [6]

    Paul, and Ali R

    Gan, Z., Lin, R., Zou, H.: A multi-agent deep reinforcement learning approach for computation offloading in 5g mobile edge computing. In: 2022 22nd IEEE In- ternational Symposium on Cluster, Cloud and Internet Computing (CCGrid). pp. 645–648 (2022). https://doi.org/10.1109/CCGrid54584.2022.00074

  7. [7]

    IEEE Network38(4), 182–189 (2024)

    Garcia-Martin, M.A., Gramaglia, M., Serrano, P.: Network automation and data analytics in 3gpp 5g systems. IEEE Network38(4), 182–189 (2024). https://doi.org/10.1109/MNET.2023.3321524

  8. [8]

    Big Data and Cognitive Comput- ing8(12) (2024)

    Hisyam Ng, H.A., Mahmoodi, T.: Machine learning-driven dynamic traffic steer- ing in 6g: A novel path selection scheme. Big Data and Cognitive Comput- ing8(12) (2024). https://doi.org/10.3390/bdcc8120172,https://www.mdpi.com/ 2504-2289/8/12/172

  9. [9]

    In: 2024 7th International Conference on Advanced Com- munication Technologies and Networking (CommNet)

    Ibrahimi, K., Jouhari, M., Sow, S., Ayoub, F., Kamili, M.E., Choug- dali, K.: Reinforcement learning for optimized resource allocation in 5g urllc. In: 2024 7th International Conference on Advanced Com- munication Technologies and Networking (CommNet). pp. 1–6 (2024). https://doi.org/10.1109/CommNet63022.2024.10793310

  10. [10]

    Network Energy Saving for 6G and Beyond: A Deep Re- inforcement Learning Approach,

    Kibalya, G., Dalgitsis, M., Serrano, M.A., Bartzoudis, N., Blanco, L., Zey- dan, E., Antonopoulos, A.: Joint upf and application placement in multi- slice edge networks: A reinforcement learning strategy. In: 2025 IEEE Wire- less Communications and Networking Conference (WCNC). pp. 1–6 (2025). https://doi.org/10.1109/WCNC61545.2025.10978828

  11. [11]

    IEEE Transactions on Mobile Computing23(10), 9324–9336 (2024)

    Maleki, E.F., Ma, W., Mashayekhy, L., La Roche, H.J.: Qos-aware con- tent delivery in 5g-enabled edge computing: Learning-based approaches. IEEE Transactions on Mobile Computing23(10), 9324–9336 (2024). https://doi.org/10.1109/TMC.2024.3363143

  12. [12]

    In: Anais do IV Workshop de Redes 6G

    Moreira, L., Moreira, R., Silva, F., Backes, A.: Towards Cognitive Ser- vice Delivery on B5G through AIaaS Architecture. In: Anais do IV Workshop de Redes 6G. pp. 1–8. SBC, Porto Alegre, RS, Brasil (2024). https://doi.org/10.5753/w6g.2024.3304,https://sol.sbc.org.br/index.php/ w6g/article/view/29773 Title Suppressed Due to Excessive Length 15

  13. [13]

    Computer Communications179, 131–144 (2021)

    Moreira,R.,Rosa,P.F.,Aguiar,R.L.A.,deOliveiraSilva,F.:Nasor:Anetworkslic- ing approach for multiple autonomous systems. Computer Communications179, 131–144 (2021). https://doi.org/https://doi.org/10.1016/j.comcom.2021.07.028, https://www.sciencedirect.com/science/article/pii/S0140366421002917

  14. [14]

    IEEE Access9, 165892–165906 (2021)

    Nguyen, H.T., Van Do, T., Rotter, C.: Scaling upf instances in 5g/6g core with deep reinforcement learning. IEEE Access9, 165892–165906 (2021). https://doi.org/10.1109/ACCESS.2021.3135315

  15. [15]

    IEEE Trans- actions on Network and Service Management22(2), 1174–1187 (2025)

    Pimpalkar, Y., Ravindran, S., Bapat, J., Das, D.: A novel e2e path se- lection algorithm for superior qos and qoe for 6g services. IEEE Trans- actions on Network and Service Management22(2), 1174–1187 (2025). https://doi.org/10.1109/TNSM.2024.3519707

  16. [16]

    IEEE Access13, 4547–4561 (2025)

    Sasithong, P., Sanguanpuak, T., Vanichchanunt, P., Wuttisittikulkij, L.: User plane function (upf) allocation for c-v2x network using deep reinforcement learning. IEEE Access13, 4547–4561 (2025). https://doi.org/10.1109/ACCESS.2024.3524886

  17. [17]

    IEEE Transactions on Mobile Computing23(5), 5097–5110 (2024)

    Shokrnezhad, M., Taleb, T., Dazzi, P.: Double deep q-learning-based path selection and service placement for latency-sensitive beyond 5g applica- tions. IEEE Transactions on Mobile Computing23(5), 5097–5110 (2024). https://doi.org/10.1109/TMC.2023.3301506

  18. [18]

    In: Anais do XVI Workshop de Pesquisa Experimen- tal da Internet do Futuro

    Silva, B., Moreira, L.R., de Oliveira Silva, F., Moreira, R.: Optimizing Edge Gaming Slices through an Enhanced User Plane Function and Analytics in Beyond-5G Networks. In: Anais do XVI Workshop de Pesquisa Experimen- tal da Internet do Futuro. pp. 1–8. SBC, Porto Alegre, RS, Brasil (2025). https://doi.org/10.5753/wpeif.2025.8714,https://sol.sbc.org.br/in...

  19. [19]

    IEEE Access12, 88370–88382 (2024)

    Tran, M.N., Duong, V.B., Kim, Y.: Design of computing-aware traffic steering architecture for 5g mobile user plane. IEEE Access12, 88370–88382 (2024). https://doi.org/10.1109/ACCESS.2024.3418960