Towards Autonomous Railway Operations: A Semi-Hierarchical Deep Reinforcement Learning Approach to the Vehicle Rescheduling Problem

Adrian Egli; Alberto Castagna; Anton Fuxjager; Christian Eichenberger; Daniel Boos; Manuel Meyer; Stefan Zahlner

arxiv: 2605.10257 · v1 · submitted 2026-05-11 · 💻 cs.AI

Towards Autonomous Railway Operations: A Semi-Hierarchical Deep Reinforcement Learning Approach to the Vehicle Rescheduling Problem

Alberto Castagna , Stefan Zahlner , Adrian Egli , Christian Eichenberger , Daniel Boos , Manuel Meyer , Anton Fuxjager This is my paper

Pith reviewed 2026-05-12 05:21 UTC · model grok-4.3

classification 💻 cs.AI

keywords reinforcement learningrailway reschedulingvehicle routingdisruption managementsemi-hierarchical RLtraffic coordinationmulti-agent controlFlatland simulator

0 comments

The pith

A semi-hierarchical reinforcement learning method separates dispatching from routing decisions to handle railway disruptions with higher success rates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a semi-hierarchical deep reinforcement learning approach for the vehicle rescheduling problem that arises during rail traffic disruptions. It separates infrequent dispatching choices from frequent routing updates by giving each its own action and observation spaces so that separate policies can focus on their respective scopes. This design directly tackles the frequency imbalance that causes standard single-policy RL to underperform in dense networks. A reader would care because current rail operations still depend heavily on human dispatchers despite growing traffic density, and an automated method that doubles successful train completions while holding deadlocks under five percent could support higher throughput without added infrastructure.

Core claim

The central claim is that a semi-hierarchical RL formulation tailored to operational railway constraints, which uses dedicated action and observation spaces to separate dispatching from routing, produces substantially improved coordination, resource utilisation and robustness. Evaluated on the Flatland-RL simulator across five difficulty levels and fifty random seeds with seven to eighty trains, the approach nearly doubles the number of trains that reach their destinations relative to heuristic baselines and monolithic RL while keeping deadlock rates below five percent and enabling adaptive sequencing, delaying or cancellation of trains under heavy congestion.

What carries the argument

The semi-hierarchical RL formulation that assigns separate action and observation spaces to dispatching and routing so policies can specialise in distinct decision scopes and address decision-frequency imbalance.

Load-bearing premise

That the separation of dispatching and routing via dedicated spaces will keep balancing rare and frequent decisions and yield robust policies once the method moves from simulation into real rail systems that include physical constraints, sensor noise and regulatory rules.

What would settle it

Deploying the trained policies on recorded real-world railway disruption data that includes sensor uncertainty and regulatory constraints and measuring whether the fraction of trains reaching destinations drops below the reported doubling or deadlock rates exceed five percent.

Figures

Figures reproduced from arXiv: 2605.10257 by Adrian Egli, Alberto Castagna, Anton Fuxjager, Christian Eichenberger, Daniel Boos, Manuel Meyer, Stefan Zahlner.

**Figure 3.** Figure 3: MAPF — representation of an observation. Each agent is equipped with a decision controller that prevents unnecessary decisions [16] under masking conditions. For MADS, actions are skipped if the train has not yet reached its departure time or if no valid path exists to the target within the episode horizon. For MAPF, masking limits available actions to those feasible at the current cell, considering route… view at source ↗

**Figure 2.** Figure 2: Semi-hierarchical control loop with decision controller [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 4.** Figure 4: depicts the shared architecture for both MAPF and MADS. Dense [128] Concatenation [k, n, k, 3] Dense [k, n, k, 128] Masked Avg Pooling [k*128] Concatenation [(k+1)*128] Dense [128] Dense [64] Output [2] MADS Architecture Global_obs [5] Dist_conflict [k, n, k, 2] N_conflict_cells [k, n, k, 1] Mask_active_trains [k, n, k, 1] Dense [3, 512] Self-attention n_heads: 16 [3, 512] Flatten [1536] Dense [256] Dense … view at source ↗

**Figure 5.** Figure 5: Average active-train concurrency over time for 80 trains [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Mean performance over 50 episodes across 5 levels [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Ablation study showing performance of hybrid config [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

read the original abstract

Managing disruptions in railway traffic management is a major challenge. Rising traffic density and infrastructure limits increase complexity, making the Vehicle Routing and Scheduling Problem (VRSP) difficult to solve reliably and in real time. While Operational Research (OR) methods are widely used, most dispatching still relies on human expertise due to the problem's exponential combinatorial complexity. Reinforcement Learning (RL) has gained attention for its potential in multi-agent coordination, but existing RL approaches often underperform OR methods and struggle to scale in dense rail networks. This paper addresses this gap from a machine learning perspective by introducing a semi-hierarchical RL formulation tailored to operational railway constraints. The method separates dispatching from routing through dedicated action and observation spaces, enabling policies to specialise in distinct decision scopes and addressing the imbalance between rare dispatch decisions and frequent routing updates. The approach is evaluated on the Flatland-RL simulator across five difficulty levels and 50 random seeds, with 7 to 80 trains. Results show substantially improved coordination, resource utilisation, and robustness compared with heuristic baselines and monolithic RL, nearly doubling the number of trains reaching their destinations, while keeping deadlock rates below 5% and adaptively sequencing, delaying, or cancelling trains under heavy congestion.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The semi-hierarchical RL splits dispatching from routing via separate spaces and beats baselines in Flatland simulator tests, but the gains stay tied to idealized conditions with no real-world validation.

read the letter

The main point is that this paper splits dispatching decisions from routing ones using dedicated action and observation spaces in a deep RL setup for the vehicle rescheduling problem. That separation targets the imbalance between infrequent high-level choices and frequent low-level updates, and the simulator runs show it improves coordination enough to nearly double successful train destinations while holding deadlocks under 5 percent across difficulty levels and 50 seeds. The adaptive sequencing, delaying, and cancelling behavior under congestion is a practical plus compared with monolithic RL or simple heuristics. The evaluation on Flatland-RL with 7 to 80 trains gives a reasonable test bed for the claim that specialization helps resource use and robustness inside the model. What the work does well is lay out a tailored hierarchy for railway constraints and back it with multi-seed results that go beyond single-run anecdotes. The comparison to both heuristic baselines and a flat RL version makes the contribution clearer than many RL scheduling papers. The soft spots are straightforward. All results rest on the simulator's discrete steps and perfect information, with no added noise, variable speeds, partial observability, or track gradients to check whether the hierarchy advantage survives. The abstract gives no specifics on reward design or training details, which leaves the exact source of the gains harder to pin down even if the seed count helps. This paper is for researchers working on RL for transportation scheduling or multi-agent coordination under operational constraints. Readers who want concrete examples of hierarchical structures applied to dense networks will get the most from it. It deserves a serious referee because the problem matters, the empirical setup is decent, and the formulation is a clear step from prior monolithic RL attempts. I would send it to peer review with the expectation that reviewers will press on sim-to-real gaps and ask for more ablation on the hierarchy components.

Referee Report

2 major / 2 minor

Summary. The paper proposes a semi-hierarchical deep RL formulation for the Vehicle Rescheduling Problem in railway traffic management. It separates dispatching from routing decisions via dedicated action and observation spaces to address decision-frequency imbalance, enabling policy specialization. Evaluated in the Flatland-RL simulator on five difficulty levels with 7–80 trains and 50 random seeds, the method is reported to nearly double the number of trains reaching destinations while maintaining deadlock rates below 5%, outperforming heuristic baselines and monolithic RL under congestion.

Significance. If the simulator results hold under the reported conditions, the work demonstrates a practical RL architecture for multi-agent coordination in dense rail networks, offering an alternative to traditional OR methods that often fail to scale in real time. The multi-level, multi-seed evaluation is a positive aspect for assessing robustness within the simulator. However, significance for autonomous operations is limited by the absence of sim-to-real validation and insufficient methodological details for reproducibility.

major comments (2)

[Abstract and experimental evaluation] The central empirical claims (nearly doubled destination success and <5% deadlock rates) depend on comparisons to baselines, yet the abstract and evaluation provide no details on training procedures, reward design, baseline implementations, or statistical testing. This omission directly undermines verification of the reported performance gains over monolithic RL and heuristics.
[Evaluation and discussion of robustness] The robustness and coordination benefits attributed to the semi-hierarchical separation of dispatching/routing spaces are demonstrated only in the idealized Flatland-RL environment (discrete steps, perfect observations). No ablation with additive noise, variable speeds, partial observability, or continuous dynamics is reported, so it remains unclear whether the specialization benefit generalizes beyond the simulator's simplified transitions.

minor comments (2)

[Abstract and introduction] Clarify whether 'Vehicle Rescheduling Problem' and 'Vehicle Routing and Scheduling Problem' are used synonymously, and ensure consistent terminology throughout.
[Abstract] The abstract would benefit from a brief statement on the reward structure or key hyperparameters to help readers contextualize the specialization mechanism.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed review. We address each major comment point by point below, clarifying aspects of the manuscript and indicating revisions made to improve clarity, reproducibility, and discussion of limitations.

read point-by-point responses

Referee: [Abstract and experimental evaluation] The central empirical claims (nearly doubled destination success and <5% deadlock rates) depend on comparisons to baselines, yet the abstract and evaluation provide no details on training procedures, reward design, baseline implementations, or statistical testing. This omission directly undermines verification of the reported performance gains over monolithic RL and heuristics.

Authors: We appreciate this feedback on clarity. The full manuscript details the training procedures (including hyperparameters and optimization), reward design (with components for destination arrival, deadlock avoidance, and efficiency), baseline implementations (heuristic rules and monolithic RL architecture), and evaluation protocol in Sections 3 and 4. However, we agree that these elements could be more prominently summarized in the evaluation section for easier verification. In the revised version, we have added an explicit subsection in the experimental evaluation that consolidates these details, includes mean and standard deviation results across the 50 seeds, and reports basic statistical comparisons (e.g., paired t-tests) for the performance differences. We have also incorporated a concise reference to the semi-hierarchical methodology in the abstract. These changes directly address the verification concern without altering the core claims. revision: yes
Referee: [Evaluation and discussion of robustness] The robustness and coordination benefits attributed to the semi-hierarchical separation of dispatching/routing spaces are demonstrated only in the idealized Flatland-RL environment (discrete steps, perfect observations). No ablation with additive noise, variable speeds, partial observability, or continuous dynamics is reported, so it remains unclear whether the specialization benefit generalizes beyond the simulator's simplified transitions.

Authors: We agree that the evaluation is performed exclusively in the Flatland-RL simulator, which assumes discrete time steps and perfect observations. This environment is a widely used benchmark for multi-agent railway rescheduling, and our results show consistent gains in destination success and low deadlock rates across five difficulty levels (7–80 trains) and 50 random seeds. In the revised manuscript, we have expanded the Discussion section to more explicitly articulate the simulator assumptions, their relation to operational railway constraints, and the potential advantages of policy specialization under those conditions. We also outline how the semi-hierarchical action/observation separation might mitigate issues in noisier settings. That said, new ablation experiments involving additive noise, variable speeds, partial observability, or continuous dynamics would require substantial simulator extensions and additional compute; we have therefore noted these as important directions for future work rather than including them in the current revision. revision: partial

Circularity Check

0 steps flagged

No circularity: semi-hierarchical RL formulation and empirical results are independently defined and externally benchmarked

full rationale

The paper defines a semi-hierarchical RL method by explicitly separating dispatching and routing into dedicated action/observation spaces to address decision-frequency imbalance. This separation is introduced as a modeling choice, not derived from or equivalent to the performance metrics it later reports. All central claims (nearly doubled destination success, <5% deadlock) rest on direct empirical evaluation against heuristic baselines and monolithic RL inside the Flatland-RL simulator across five difficulty levels and 50 seeds. No equations, fitted parameters, or self-citations are invoked to force the reported outcomes; the simulator serves as an external, reproducible testbed whose discrete-step assumptions are stated rather than smuggled in. The derivation chain therefore remains self-contained and falsifiable outside the paper's own fitted values.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the approach relies on standard deep RL components whose specific hyperparameters and reward weights are unspecified.

pith-pipeline@v0.9.0 · 5537 in / 1200 out tokens · 49616 ms · 2026-05-12T05:21:40.506577+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages

[1]

Flatland competition 2020: Mapf and marl for efficient train coordination on a grid world,

F. Laurent, M. Schneider, C. Scheller, J. Watson, J. Li, Z. Chen, Y . Zheng, S.-H. Chan, K. Makhnev, O. Svidchenkoet al., “Flatland competition 2020: Mapf and marl for efficient train coordination on a grid world,” inNeurIPS 2020 Competition and Demonstration Track. PMLR, 2021, pp. 275–301

work page 2020
[2]

Multi-agent path finding via tree lstm,

Y . Jiang, K. Zhang, Q. Li, J. Chen, and X. Zhu, “Multi-agent path finding via tree lstm,”arXiv preprint arXiv:2210.12933, 2022

work page arXiv 2022
[3]

The vehicle reschedul- ing problem: Model and algorithms,

J.-Q. Li, P. B. Mirchandani, and D. Borenstein, “The vehicle reschedul- ing problem: Model and algorithms,”Networks: An International Jour- nal, vol. 50, no. 3, pp. 211–229, 2007

work page 2007
[4]

Efficient real-time rail traf- fic optimization: Decomposition of rerouting, reordering, and reschedul- ing problem,

L. Lindenmaier, I. F. L ¨ov´etei, and S. Aradi, “Efficient real-time rail traf- fic optimization: Decomposition of rerouting, reordering, and reschedul- ing problem,”arXiv preprint arXiv:2209.12689, 2022

work page arXiv 2022
[5]

Ferber and G

J. Ferber and G. Weiss,Multi-agent systems: an introduction to dis- tributed artificial intelligence. Addison-wesley Reading, 1999, vol. 1

work page 1999
[6]

Scalable rail planning and replanning: Winning the 2020 flatland challenge,

J. Li, Z. Chen, Y . Zheng, S.-H. Chan, D. Harabor, P. J. Stuckey, H. Ma, and S. Koenig, “Scalable rail planning and replanning: Winning the 2020 flatland challenge,” inProceedings of the international conference on automated planning and scheduling, vol. 31, 2021, pp. 477–485

work page 2020
[7]

Improving the efficiency and efficacy of multi-agent reinforcement learning on com- plex railway networks with a local-critic approach,

Y . Zhang, U. Deekshith, J. Wang, and J. Boedecker, “Improving the efficiency and efficacy of multi-agent reinforcement learning on com- plex railway networks with a local-critic approach,” inProceedings of the International Conference on Automated Planning and Scheduling, vol. 34, 2024, pp. 698–706

work page 2024
[8]

Intelligent railway capacity and traffic management using multi-agent deep reinforcement learning,

S. Schneider, A. Ramesh, A. Roets, C. Stirbu, F. Safaei, F. Ghriss, J. W ¨ulfing, M. G ¨ura, N. Sibon, R. Gentryet al., “Intelligent railway capacity and traffic management using multi-agent deep reinforcement learning,” in2024 IEEE 27th International Conference on Intelligent Transportation Systems (ITSC). IEEE, 2024, pp. 1743–1748

work page 2024
[9]

Cooperative pathfinding,

D. Silver, “Cooperative pathfinding,” inProceedings of the aaai con- ference on artificial intelligence and interactive digital entertainment, vol. 1, 2005, pp. 117–122

work page 2005
[10]

Using constraint programming and local search methods to solve vehicle routing problems,

P. Shaw, “Using constraint programming and local search methods to solve vehicle routing problems,” inInternational conference on principles and practice of constraint programming. Springer, 1998, pp. 417–431

work page 1998
[11]

Multi-agent path finding with delay probabilities,

H. Ma, T. S. Kumar, and S. Koenig, “Multi-agent path finding with delay probabilities,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 31, 2017

work page 2017
[12]

Go- explore: A new approach for hard-exploration problems. arxiv. 2019,

A. Ecoffet, J. Huizinga, J. Lehman, K. O. Stanley, and J. Clune, “Go- explore: A new approach for hard-exploration problems. arxiv. 2019,” 1901

work page 2019
[13]

Hierarchical reinforcement learning with the maxq value function decomposition,

T. G. Dietterich, “Hierarchical reinforcement learning with the maxq value function decomposition,”Journal of artificial intelligence re- search, vol. 13, pp. 227–303, 2000

work page 2000
[14]

Planning and acting in partially observable stochastic domains,

L. P. Kaelbling, M. L. Littman, and A. R. Cassandra, “Planning and acting in partially observable stochastic domains,”Artificial intelligence, vol. 101, no. 1-2, pp. 99–134, 1998

work page 1998
[15]

flatland-rl: Openai gym environment for railway management,

Flatland Association, “flatland-rl: Openai gym environment for railway management,” https://github.com/flatland-association/flatland-rl, 2025, gitHub repository, accessed 2025-07-22

work page 2025
[16]

Improving sample efficiency and multi-agent commu- nication in rl-based train rescheduling,

D. Roost, R. Meier, S. Huschauer, E. Nygren, A. Egli, A. Weiler, and T. Stadelmann, “Improving sample efficiency and multi-agent commu- nication in rl-based train rescheduling,” in2020 7th Swiss Conference on Data Science (SDS). IEEE, 2020, pp. 63–64

work page 2020
[17]

Efficient multi-objective optimisation for real-world power grid topology control,

Y . El Manyari, A. R. Fuxjaeger, S. Zahlner, J. van Dijk, A. Castagna, D. Barbieri, J. Viebahn, and M. Wasserer, “Efficient multi-objective optimisation for real-world power grid topology control,” inProceedings of the 16th ACM International Conference on Future and Sustainable Energy Systems, ser. E-Energy ’25. New York, NY , USA: Association for Computi...

work page doi:10.1145/3679240.3734659 2025

[1] [1]

Flatland competition 2020: Mapf and marl for efficient train coordination on a grid world,

F. Laurent, M. Schneider, C. Scheller, J. Watson, J. Li, Z. Chen, Y . Zheng, S.-H. Chan, K. Makhnev, O. Svidchenkoet al., “Flatland competition 2020: Mapf and marl for efficient train coordination on a grid world,” inNeurIPS 2020 Competition and Demonstration Track. PMLR, 2021, pp. 275–301

work page 2020

[2] [2]

Multi-agent path finding via tree lstm,

Y . Jiang, K. Zhang, Q. Li, J. Chen, and X. Zhu, “Multi-agent path finding via tree lstm,”arXiv preprint arXiv:2210.12933, 2022

work page arXiv 2022

[3] [3]

The vehicle reschedul- ing problem: Model and algorithms,

J.-Q. Li, P. B. Mirchandani, and D. Borenstein, “The vehicle reschedul- ing problem: Model and algorithms,”Networks: An International Jour- nal, vol. 50, no. 3, pp. 211–229, 2007

work page 2007

[4] [4]

Efficient real-time rail traf- fic optimization: Decomposition of rerouting, reordering, and reschedul- ing problem,

L. Lindenmaier, I. F. L ¨ov´etei, and S. Aradi, “Efficient real-time rail traf- fic optimization: Decomposition of rerouting, reordering, and reschedul- ing problem,”arXiv preprint arXiv:2209.12689, 2022

work page arXiv 2022

[5] [5]

Ferber and G

J. Ferber and G. Weiss,Multi-agent systems: an introduction to dis- tributed artificial intelligence. Addison-wesley Reading, 1999, vol. 1

work page 1999

[6] [6]

Scalable rail planning and replanning: Winning the 2020 flatland challenge,

J. Li, Z. Chen, Y . Zheng, S.-H. Chan, D. Harabor, P. J. Stuckey, H. Ma, and S. Koenig, “Scalable rail planning and replanning: Winning the 2020 flatland challenge,” inProceedings of the international conference on automated planning and scheduling, vol. 31, 2021, pp. 477–485

work page 2020

[7] [7]

Improving the efficiency and efficacy of multi-agent reinforcement learning on com- plex railway networks with a local-critic approach,

Y . Zhang, U. Deekshith, J. Wang, and J. Boedecker, “Improving the efficiency and efficacy of multi-agent reinforcement learning on com- plex railway networks with a local-critic approach,” inProceedings of the International Conference on Automated Planning and Scheduling, vol. 34, 2024, pp. 698–706

work page 2024

[8] [8]

Intelligent railway capacity and traffic management using multi-agent deep reinforcement learning,

S. Schneider, A. Ramesh, A. Roets, C. Stirbu, F. Safaei, F. Ghriss, J. W ¨ulfing, M. G ¨ura, N. Sibon, R. Gentryet al., “Intelligent railway capacity and traffic management using multi-agent deep reinforcement learning,” in2024 IEEE 27th International Conference on Intelligent Transportation Systems (ITSC). IEEE, 2024, pp. 1743–1748

work page 2024

[9] [9]

Cooperative pathfinding,

D. Silver, “Cooperative pathfinding,” inProceedings of the aaai con- ference on artificial intelligence and interactive digital entertainment, vol. 1, 2005, pp. 117–122

work page 2005

[10] [10]

Using constraint programming and local search methods to solve vehicle routing problems,

P. Shaw, “Using constraint programming and local search methods to solve vehicle routing problems,” inInternational conference on principles and practice of constraint programming. Springer, 1998, pp. 417–431

work page 1998

[11] [11]

Multi-agent path finding with delay probabilities,

H. Ma, T. S. Kumar, and S. Koenig, “Multi-agent path finding with delay probabilities,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 31, 2017

work page 2017

[12] [12]

Go- explore: A new approach for hard-exploration problems. arxiv. 2019,

A. Ecoffet, J. Huizinga, J. Lehman, K. O. Stanley, and J. Clune, “Go- explore: A new approach for hard-exploration problems. arxiv. 2019,” 1901

work page 2019

[13] [13]

Hierarchical reinforcement learning with the maxq value function decomposition,

T. G. Dietterich, “Hierarchical reinforcement learning with the maxq value function decomposition,”Journal of artificial intelligence re- search, vol. 13, pp. 227–303, 2000

work page 2000

[14] [14]

Planning and acting in partially observable stochastic domains,

L. P. Kaelbling, M. L. Littman, and A. R. Cassandra, “Planning and acting in partially observable stochastic domains,”Artificial intelligence, vol. 101, no. 1-2, pp. 99–134, 1998

work page 1998

[15] [15]

flatland-rl: Openai gym environment for railway management,

Flatland Association, “flatland-rl: Openai gym environment for railway management,” https://github.com/flatland-association/flatland-rl, 2025, gitHub repository, accessed 2025-07-22

work page 2025

[16] [16]

Improving sample efficiency and multi-agent commu- nication in rl-based train rescheduling,

D. Roost, R. Meier, S. Huschauer, E. Nygren, A. Egli, A. Weiler, and T. Stadelmann, “Improving sample efficiency and multi-agent commu- nication in rl-based train rescheduling,” in2020 7th Swiss Conference on Data Science (SDS). IEEE, 2020, pp. 63–64

work page 2020

[17] [17]

Efficient multi-objective optimisation for real-world power grid topology control,

Y . El Manyari, A. R. Fuxjaeger, S. Zahlner, J. van Dijk, A. Castagna, D. Barbieri, J. Viebahn, and M. Wasserer, “Efficient multi-objective optimisation for real-world power grid topology control,” inProceedings of the 16th ACM International Conference on Future and Sustainable Energy Systems, ser. E-Energy ’25. New York, NY , USA: Association for Computi...

work page doi:10.1145/3679240.3734659 2025