COAgents: Multi-Agent Framework to Learn and Navigate Routing Problems Search Space
Pith reviewed 2026-05-21 05:21 UTC · model grok-4.3
The pith
COAgents trains agents on a partial search graph to guide solution search for vehicle routing problems.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
COAgents is a cooperative multi-agent framework that models the search process as a graph where nodes represent solutions and edges correspond to local refinements or large perturbations for diversification. It dynamically constructs a Partial Search Graph during search to train a Node Selection Agent and a Move Selection Agent for intensification as well as a Jump Agent to trigger explorations of new regions. The framework separates problem-agnostic search control from compact domain-specific encoding to improve adaptability.
What carries the argument
Partial Search Graph (PSG) dynamically built from solutions during search, used to train Node Selection, Move Selection, and Jump agents that control intensification and diversification.
If this is right
- COAgents sets a new state of the art among learning-based methods on VRPTW instances.
- Reduces the gap to the best-known solutions by 14% at N=100 and 44% at N=50 relative to POMO.
- Reduces the gap by 21% at N=100 and 40% at N=50 relative to ALNS.
- Remains competitive with several learn-to-search baselines on CVRP.
- The separation of search control from domain encoding supports adaptability across different tasks.
Where Pith is reading between the lines
- The approach could be tested on other combinatorial optimization problems that use local search and diversification strategies.
- Learned jump timing might help in problems where manual diversification rules are hard to design.
Load-bearing premise
Dynamically constructing a Partial Search Graph and training specialized agents on it will reliably guide both intensification and diversification across diverse VRP instances without needing extensive problem-specific handcrafted rules.
What would settle it
Evaluating COAgents on a fresh collection of VRPTW instances of size 50 and 100 and measuring whether the optimality gap reductions match or exceed the claimed 44% and 14% improvements over POMO would confirm or refute the performance advantage.
Figures
read the original abstract
Although Vehicle Routing Problems (VRP) are essential to many real-world systems, they remain computationally intractable at scale due to their combinatorial complexity. Traditional heuristics rely on handcrafted rules for local improvements and occasional \textit{jumps} to escape local minima, but often struggle to generalize across diverse instances. We introduce \textbf{COAgents}, a cooperative multi-agent framework that models the search process as a graph: nodes represent solutions, and edges correspond to either local refinements or large perturbations for diversification (i.e., jumps). A \textit{Partial Search Graph} (PSG) is dynamically constructed during search, enabling COAgents to train a Node Selection Agent and a Move Selection Agent to guide intensification, and a Jump Agent to trigger well-timed explorations of new regions. Unlike end-to-end learning approaches, COAgents cleanly separates problem-agnostic search control from compact domain-specific encoding, facilitating adaptability across tasks. Extensive experiments on the CVRP and VRPTW benchmarks show that COAgents remains competitive with several learn-to-search baselines on CVRP and sets a new state of the art among learning-based methods on the more challenging VRPTW instances, reducing the gap to the best-known solutions by 14\% at $N\!=\!100$ and 44\% at $N\!=\!50$ relative to the strongest neural solver (POMO), and by 21\% and 40\% respectively relative to ALNS. Code is available at https://github.com/mahdims/COAgents.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces COAgents, a cooperative multi-agent framework for routing problems that dynamically constructs a Partial Search Graph (PSG) during search. Nodes represent solutions and edges represent local moves or jumps; three specialized agents (Node Selection, Move Selection, and Jump) are trained to guide intensification and diversification. The approach separates problem-agnostic search control from domain encoding. Experiments on CVRP and VRPTW benchmarks claim competitiveness with learn-to-search baselines on CVRP and new state-of-the-art results among learning-based methods on VRPTW, with gap reductions to best-known solutions of 14% (N=100) and 44% (N=50) versus POMO and 21%/40% versus ALNS.
Significance. If the empirical claims hold under matched budgets and proper controls, the clean separation of search control from domain encoding and the multi-agent PSG mechanism could provide a more generalizable alternative to end-to-end neural solvers for combinatorial optimization. The open-source code is a positive contribution that aids reproducibility. However, the significance is currently limited by the absence of detailed experimental protocols needed to confirm that gains arise from the proposed framework rather than unequal computational resources.
major comments (3)
- [Abstract / Experiments] Abstract and experimental results section: The headline SOTA claims (14% gap reduction at N=100 and 44% at N=50 versus POMO on VRPTW) are presented without any reported details on the number of independent runs, statistical significance tests, variance across instances, or hyperparameter tuning protocol. This information is load-bearing for verifying that the reported improvements are robust and not the result of post-hoc selection or favorable random seeds.
- [Method / Experiments] Method and experimental sections: No ablation is described that disables the Jump Agent while retaining the Node/Move Selection agents on the PSG. Without this control, it is impossible to isolate whether the cooperative multi-agent structure (rather than any competent search procedure given extra budget) is responsible for the claimed diversification benefits and gap reductions.
- [Experiments] Experiments section: The manuscript does not report explicit matching of search budgets (total local moves, perturbation frequency, wall-clock time, or function evaluations) against the POMO and ALNS baselines. Because the framework description emphasizes problem-agnostic control, the absence of such controls leaves open the possibility that any search procedure with equivalent extra effort could achieve similar gap closures.
minor comments (2)
- [Method] The notation distinguishing the three agents and the PSG construction steps would benefit from an explicit diagram or pseudocode listing in the method section to improve readability.
- [Abstract / Introduction] A few sentences in the abstract and introduction repeat the same performance numbers; consolidating them would reduce redundancy.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. The comments highlight important aspects of experimental rigor that will improve the manuscript. We respond to each major comment below and indicate the planned revisions.
read point-by-point responses
-
Referee: [Abstract / Experiments] Abstract and experimental results section: The headline SOTA claims (14% gap reduction at N=100 and 44% at N=50 versus POMO on VRPTW) are presented without any reported details on the number of independent runs, statistical significance tests, variance across instances, or hyperparameter tuning protocol. This information is load-bearing for verifying that the reported improvements are robust and not the result of post-hoc selection or favorable random seeds.
Authors: We agree that these details are essential for substantiating the claims. In the revised manuscript we will report the number of independent runs, include standard deviations and variance across instances, add statistical significance tests (such as paired t-tests or Wilcoxon tests), and describe the hyperparameter tuning protocol in the Experiments section. revision: yes
-
Referee: [Method / Experiments] Method and experimental sections: No ablation is described that disables the Jump Agent while retaining the Node/Move Selection agents on the PSG. Without this control, it is impossible to isolate whether the cooperative multi-agent structure (rather than any competent search procedure given extra budget) is responsible for the claimed diversification benefits and gap reductions.
Authors: This is a fair request to isolate component contributions. We will add an ablation study in the revised Experiments section in which the Jump Agent is disabled while retaining the Node Selection and Move Selection agents on the PSG, and we will report the resulting performance to demonstrate the Jump Agent's role in diversification. revision: yes
-
Referee: [Experiments] Experiments section: The manuscript does not report explicit matching of search budgets (total local moves, perturbation frequency, wall-clock time, or function evaluations) against the POMO and ALNS baselines. Because the framework description emphasizes problem-agnostic control, the absence of such controls leaves open the possibility that any search procedure with equivalent extra effort could achieve similar gap closures.
Authors: We acknowledge the need for explicit budget matching. In the revision we will report detailed search budgets (local moves, perturbations, wall-clock time) for COAgents and the baselines. Where direct equivalence is not already present we will either adjust the experimental protocol or clearly document the differences and their implications. revision: partial
Circularity Check
No circularity: empirical SOTA claims rest on external baseline comparisons
full rationale
The paper proposes a multi-agent framework (COAgents) that dynamically builds a Partial Search Graph and trains Node/Move Selection and Jump Agents to guide VRP search. Central claims of gap reductions (14%/44% vs POMO, 21%/40% vs ALNS on VRPTW at N=50/100) are presented as outcomes of experiments on standard CVRP/VRPTW benchmarks. These rest on direct performance comparisons to independent external solvers rather than any internal prediction, fitted parameter, or derivation that reduces to the method's own inputs by construction. No equations, uniqueness theorems, or ansatzes are invoked that loop back to self-defined quantities. The separation of problem-agnostic control from domain encoding supplies independent methodological content, rendering the reported results self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- Agent network architectures and training hyperparameters
axioms (1)
- domain assumption The solution search process for VRPs can be effectively modeled as a graph with edges for local refinements and large perturbations.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
models the search process as a graph: nodes represent solutions, and edges correspond to either local refinements or large perturbations... Partial Search Graph (PSG) is dynamically constructed
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Node Selection Agent and a Move Selection Agent to guide intensification, and a Jump Agent
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Bresson, X., Laurent, T.: Residual gated graph convnets. arXiv:1711.07553 (2023). https://doi.org/https://doi.org/10.48550/arXiv.1711.07553
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1711.07553 2023
-
[2]
Burke, E.K., Hyde, M.R., Kendall, G., Ochoa, G., Özcan, E., Woodward, J.R.: A classification of hyper-heuristic approaches: revisited. Handbook of metaheuristics pp. 453–477 (2019)
work page 2019
-
[3]
Advances in neural information processing systems32(2019)
Chen, X., Tian, Y.: Learning to perform local rewriting for combinatorial opti- mization. Advances in neural information processing systems32(2019)
work page 2019
-
[4]
Advances in Neural Information Processing Systems35, 8760–8772 (2022)
Choo, J., Kwon, Y.D., Kim, J., Jae, J., Hottung, A., Tierney, K., Gwon, Y.: Simulation-guided beam search for neural combinatorial optimization. Advances in Neural Information Processing Systems35, 8760–8772 (2022)
work page 2022
-
[5]
Rethinking Attention with Performers
Choromanski, K., Likhosherstov, V., Dohan, D., Song, X., Gane, A., Sarlos, T., Hawkins, P., Davis, J., Mohiuddin, A., Kaiser, L., Belanger, D., Colwell, L., Weller, A.:Rethinkingattention withperformers. In:InternationalConferenceon Learning Representations (ICLR) (2021), https://arxiv.org/abs/2009.14794
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[6]
In: Proceedings of the 15th International Joint Conference on Artifical Intelligence - Volume 1
Denzinger, J., Fuchs, M., Fuchs, M.: High performance ATP systems by combining several AI methods. In: Proceedings of the 15th International Joint Conference on Artifical Intelligence - Volume 1. p. 102–107. IJCAI’97, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA (1997)
work page 1997
-
[7]
European Journal of Operational Research285(2), 405–428 (2020)
Drake,J.H.,Kheiri,A.,Özcan,E.,Burke,E.K.:Recentadvancesinselectionhyper- heuristics. European Journal of Operational Research285(2), 405–428 (2020)
work page 2020
-
[8]
Roskilde: Roskilde University12, 966–980 (2017)
Helsgaun, K.: An extension of the lin-kernighan-helsgaun tsp solver for constrained traveling salesman and vehicle routing problems. Roskilde: Roskilde University12, 966–980 (2017)
work page 2017
-
[9]
In: International Conference on Learning Representations (2021)
Hottung, A., Bhandari, B., Tierney, K.: Learning a latent search space for routing problems using variational autoencoders. In: International Conference on Learning Representations (2021)
work page 2021
-
[10]
International Conference on Learning Representations (2021)
Hottung, A., Kwon, Y.D., Tierney, K.: Efficient active search for combinatorial op- timization problems. International Conference on Learning Representations (2021)
work page 2021
-
[11]
Hottung, A., Tierney, K.: Neural large neighborhood search for the capaci- tated vehicle routing problem. In: Giacomo, G.D., et al. (eds.) ECAI 2020 Title Suppressed Due to Excessive Length 15 – 24th European Conference on Artificial Intelligence, Frontiers in Artifi- cial Intelligence and Applications, vol. 325, pp. 443–450. IOS Press (2020). https://doi...
-
[12]
Artificial Intelligence313, 103786 (2022)
Hottung, A., Tierney, K.: Neural large neighborhood search for routing problems. Artificial Intelligence313, 103786 (2022)
work page 2022
-
[13]
Computers & Opera- tions Research172, 106791 (2024)
Johnn,S.N.,Darvariu,V.A.,Handl,J.,Kalcsics,J.:Agraphreinforcementlearning framework for neural adaptive large neighbourhood search. Computers & Opera- tions Research172, 106791 (2024)
work page 2024
-
[14]
European Journal of Operational Research309(1), 446–468 (2023)
Kallestad, J., Hasibi, R., Hemmati, A., Sörensen, K.: A general deep reinforcement learning hyperheuristic framework for solving combinatorial optimization prob- lems. European Journal of Operational Research309(1), 446–468 (2023)
work page 2023
-
[15]
Advances in neural information processing sys- tems30(2017)
Khalil, E., Dai, H., Zhang, Y., Dilkina, B., Song, L.: Learning combinatorial opti- mization algorithms over graphs. Advances in neural information processing sys- tems30(2017)
work page 2017
-
[16]
In: The Eleventh International Conference on Learning Repre- sentations (2023)
Kim, M., Park, J., Park, J.: Learning to cross exchange to solve min-max vehicle routing problems. In: The Eleventh International Conference on Learning Repre- sentations (2023)
work page 2023
-
[17]
Advances in Neural Information Processing Systems34, 10418–10430 (2021)
Kim, M., Park, J., et al.: Learning collaborative policies to solve np-hard routing problems. Advances in Neural Information Processing Systems34, 10418–10430 (2021)
work page 2021
-
[18]
Advances in Neural Information Processing Systems35, 1936–1949 (2022)
Kim, M., Park, J., Park, J.: Sym-nco: Leveraging symmetricity for neural com- binatorial optimization. Advances in Neural Information Processing Systems35, 1936–1949 (2022)
work page 1936
-
[19]
Kool, W., van Hoof, H., Gromicho, J., Welling, M.: Deep policy dynamic program- ming for vehicle routing problems. In: International conference on integration of constraint programming, artificial intelligence, and operations research. pp. 190–
-
[20]
Advances in Neural Information Processing Systems33, 21188–21198 (2020)
Kwon, Y.D., Choo, J., Kim, B., Yoon, I., Gwon, Y., Min, S.: Pomo: Policy op- timization with multiple optima for reinforcement learning. Advances in Neural Information Processing Systems33, 21188–21198 (2020)
work page 2020
-
[21]
European Journal of Operational Research312(1), 70–91 (2024)
Lagos, F., Pereira, J.: Multi-armed bandit-based hyper-heuristics for combinatorial optimization problems. European Journal of Operational Research312(1), 70–91 (2024)
work page 2024
-
[22]
Journal of Machine Learning Research 23(54), 1–9 (2022), http://jmlr.org/papers/v23/21-0888.html
Lindauer, M., Eggensperger, K., Feurer, M., Biedenkapp, A., Deng, D., Benjamins, C., Ruhkopf, T., Sass, R., Hutter, F.: Smac3: A versatile bayesian optimization package for hyperparameter optimization. Journal of Machine Learning Research 23(54), 1–9 (2022), http://jmlr.org/papers/v23/21-0888.html
work page 2022
-
[23]
In: International conference on learning representations (2019)
Lu, H., Zhang, X., Yang, S.: A learning-based iterative method for solving vehicle routing problems. In: International conference on learning representations (2019)
work page 2019
-
[24]
Advances in Neural Information Processing Systems34, 11096–11107 (2021)
Ma, Y., Li, J., Cao, Z., Song, W., Zhang, L., Chen, Z., Tang, J.: Learning to itera- tively solve routing problems with dual-aspect collaborative transformer. Advances in Neural Information Processing Systems34, 11096–11107 (2021)
work page 2021
-
[25]
Advances in Neural Information Processing Systems36, 49555–49578 (2023)
Ma, Y., Cao, Z., Chee, Y.M.: Learning to search feasible and infeasible regions of routing problems with flexible neural k-opt. Advances in Neural Information Processing Systems36, 49555–49578 (2023)
work page 2023
-
[26]
European Journal of Operational Research324(2), 522–537 (2025)
Mostajabdaveh,M.,Salman,F.S.,Gutjahr,W.J.:Abranch-and-pricealgorithmfor fast and equitable last-mile relief aid distribution. European Journal of Operational Research324(2), 522–537 (2025)
work page 2025
-
[27]
Petropoulos, F., Laporte, G., Aktas, E., Alumur, S.A., Archetti, C., Ayhan, H., Battarra, M., Bennell, J.A., Bourjolly, J.M., Boylan, J.E., Breton, M., Canca, 16 O. Yakovenko et al. D., Charlin, L., Chen, B., Cicek, C.T., Jr, L.A.C., Currie, C.S., Demeulemeester, E., Ding, L., Disney, S.M., Ehrgott, M., Eppler, M.J., Erdoğan, G., Fortz, B., Franco, L.A., ...
work page 2024
-
[28]
Recipe for a General, Powerful, Scalable Graph Transformer
Rampášek, L., Galkin, M., Dwivedi, V.P., Luu, A.T., Wolf, G., Beaini, D.: Recipe for a general, powerful, scalable graph transformer. In: Advances in Neural Information Processing Systems. vol. 35, pp. 14501–14515 (2022), https://arxiv.org/abs/2205.12454
-
[29]
In: Proceedings of the 36th International Conference on Neural Information Processing Systems
Rampášek, L., Galkin, M., Dwivedi, V.P., Luu, A.T., Wolf, G., Beaini, D.: Recipe for a general, powerful, scalable graph transformer. In: Proceedings of the 36th International Conference on Neural Information Processing Systems. NIPS ’22, Curran Associates Inc., Red Hook, NY, USA (2024)
work page 2024
-
[30]
In: Proceedings of the International Conference on Automated Planning and Scheduling
Reijnen, R., Zhang, Y., Lau, H.C., Bukhsh, Z.: Online control of adaptive large neighborhood search using deep reinforcement learning. In: Proceedings of the International Conference on Automated Planning and Scheduling. vol. 34, pp. 475– 483 (2024)
work page 2024
-
[31]
European Journal of Operational Research260(3), 972–983 (2017)
Soria-Alcaraz, J.A., Ochoa, G., Sotelo-Figeroa, M.A., Burke, E.K.: A method- ology for determining an effective subset of heuristics in selection hyper- heuristics. European Journal of Operational Research260(3), 972–983 (2017). https://doi.org/https://doi.org/10.1016/j.ejor.2017.01.042
-
[32]
Computers & Operations Research140, 105643 (2022)
Vidal, T.: Hybrid genetic search for the cvrp: Open-source implementa- tion and swap* neighborhood. Computers & Operations Research140, 105643 (2022). https://doi.org/https://doi.org/10.1016/j.cor.2021.105643, https://www.sciencedirect.com/science/article/pii/S030505482100349X
-
[33]
Computers & operations research40(1), 475–489 (2013)
Vidal, T., Crainic, T.G., Gendreau, M., Prins, C.: A hybrid genetic algorithm with adaptive diversity management for a large class of vehicle routing problems with time-windows. Computers & operations research40(1), 475–489 (2013)
work page 2013
-
[34]
Linformer: Self-Attention with Linear Complexity
Wang, S., Li, B.Z., Khabsa, M., Fang, H., Ma, H.: Linformer: Self- attention with linear complexity. arXiv preprint arXiv:2006.04768 (2020), https://arxiv.org/abs/2006.04768
work page internal anchor Pith review Pith/arXiv arXiv 2006
-
[35]
INFORMS Journal on Computing36(4), 943–955 (2024)
Wouda, N.A., Lan, L., Kool, W.: PyVRP: a high-performance VRP solver package. INFORMS Journal on Computing36(4), 943–955 (2024). https://doi.org/10.1287/ijoc.2023.0055, https://doi.org/10.1287/ijoc.2023.0055
-
[36]
IEEE transactions on neural networks and learning systems33(9), 5057–5069 (2021)
Wu, Y., Song, W., Cao, Z., Zhang, J., Lim, A.: Learning improvement heuristics for solving routing problems. IEEE transactions on neural networks and learning systems33(9), 5057–5069 (2021)
work page 2021
-
[37]
Ye, H., Wang, J., Cao, Z., Berto, F., Hua, C., Kim, H., Park, J., Song, G.: ReEvo: Largelanguagemodelsashyper-heuristicswithreflectiveevolution.In:Proceedings of the 38th Conference on Neural Information Processing (NeurIPS 2024). pp. 10–
work page 2024
-
[38]
Vancouver, Canada (2024) Title Suppressed Due to Excessive Length 17
work page 2024
-
[39]
Zhou, J., Cao, Z., Wu, Y., Song, W., Ma, Y., Zhang, J., Xu, C.: Mvmoe: Multi- task vehicle routing solver with mixture-of-experts. In: International Conference on Machine Learning (2024) 18 O. Yakovenko et al. A Problem-Specific Feature Definitions Below we summarize the global and local features used for each problem domain in our experiments. Vehicle Ro...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.