Ready from Day 1: Population-Aware Coordination for Large-Scale Constrained Multi-Agent Systems

Alvaro Maggiar; Angel Wang; Carson Eisenach; Dean Foster; Dominique Perrault-Joncas

arxiv: 2605.13900 · v2 · pith:GGWLA2XCnew · submitted 2026-05-12 · 💻 cs.MA · cs.LG

Ready from Day 1: Population-Aware Coordination for Large-Scale Constrained Multi-Agent Systems

Angel Wang , Dominique Perrault-Joncas , Alvaro Maggiar , Carson Eisenach , Dean Foster This is my paper

Pith reviewed 2026-05-20 21:37 UTC · model grok-4.3

classification 💻 cs.MA cs.LG

keywords multi-agent coordinationpopulation-aware interfacesconstrained optimizationLagrangian relaxationsupply chain planningresponse mapscomposition shiftlarge-scale systems

0 comments

The pith

Learned primal and dual maps conditioned on compact population summaries let planners coordinate large evolving multi-agent populations without retraining each cycle.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes population-aware coordination interfaces for large-scale multi-agent systems that share resource constraints. An upstream planner uses these learned maps inside its iterative planning loop to predict aggregate utilization from a proposed cost signal or to find the cost trajectory that achieves a target plan. Because the maps are conditioned on compact summaries of population structure, they generalize across changes in who is participating without needing to be rebuilt for each new composition. In a supply-chain capacity-control study the approach cuts forecast error by 16-19 percent and capacity violations by 20-51 percent compared with maps that ignore population composition, while also allowing accurate coordination of 500-thousand-agent populations from 20-thousand-agent subsamples.

Core claim

By encoding response-relevant population structure into learned primal and dual maps, the interfaces remain reliable across evolving populations without per-cycle retraining and support coordination of large populations from compact subsamples; in the supply-chain case study these maps reduce forecast error by 16-19% and capacity violations by 20-51% relative to population-unaware baselines under composition shift, and simulator-trained maps reach 11.1% MAPE on real observations.

What carries the argument

Population-aware coordination interfaces: learned primal and dual maps that are conditioned on compact population summaries and queried inside the planner's iterative loop to predict aggregate utilization or required cost trajectories.

If this is right

The maps support coordination of 500K-agent populations from 20K-agent subsamples without loss of accuracy.
Simulator-trained primal maps achieve 11.1% MAPE on real observations, outperforming baselines that reach 13-24%.
No per-cycle retraining is required when population composition changes between planning cycles.
Capacity violations drop by 20-51% under composition shift compared with population-unaware methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same conditioning idea could be applied to other iterative planners that must adapt to changing participant sets, such as traffic signal control or energy demand response.
Compact summaries might also serve as a privacy mechanism by letting the planner work with aggregate descriptors rather than individual agent data.
If the summaries can be updated incrementally, the interfaces could support continuous online replanning as agents arrive or depart.

Load-bearing premise

Compact population summaries contain enough information to capture the structure that determines how the population responds to cost signals, so the maps generalize to new compositions without retraining.

What would settle it

A test in which the forecast error of the conditioned maps stays as high as the unconditioned baselines when the population composition is shifted in a way not captured by the chosen summaries would show the claim does not hold.

Figures

Figures reproduced from arXiv: 2605.13900 by Alvaro Maggiar, Angel Wang, Carson Eisenach, Dean Foster, Dominique Perrault-Joncas.

**Figure 1.** Figure 1: Population-aware forecaster architectures. (a) Population-Embedding (per-Agent) Aggregate: agent embeddings e i t = f(x i t) are pooled via attention and then decoded. (b) Population-Embedding (Bucketized) Aggregate: within-bucket attention is followed by cross-bucket attention before decoding. The population summary is passed to the selected decoder head: DecP (primal) or DecD (dual). Agents are partition… view at source ↗

**Figure 2.** Figure 2: (a) Distribution of agent-level cost sensitivity across 500K agents, showing a right-skewed tail of highly responsive agents. (b) Population composition under α-shifted demand-decile mixtures in the supply chain setting: positive α upweights high-demand products, while negative α upweights low-demand products. 4 Empirical Evaluation We evaluate population-aware coordination interfaces along four dimensions… view at source ↗

**Figure 3.** Figure 3: reports results for both interface types. The left panel evaluates primal forecast accuracy, and the right panel evaluates dual control quality using mean violation on near-limit periods. Additional shift results and the remaining dual metrics are provided in Appendix G. Population-aware interfaces are substantially more robust under composition shift than populationunaware baselines. In the primal settin… view at source ↗

**Figure 4.** Figure 4: shows that performance saturates once the source cohort contains approximately 20K agents. For primal prediction, accuracy at this cohort size is close to full-population inference across target population sizes. For dual control, cost trajectories inferred from 20K-agent cohorts remain effective when applied to substantially larger target populations. These results show that population-aware interfaces ca… view at source ↗

**Figure 5.** Figure 5: Example capacity target trajectories generated by the wavelet sampler using a truncated Haar wavelet basis. Given a sampled target G (n) 0:T , the trained dual coordinator is applied step by step to produce the episode-level cost trajectory λ (n) 0:T ; at each step t, λ (n) t:t+L = ϕθ(x (n) t , S(n) t , G(n) t:t+L ). The simulator is then rolled out under λ (n) 0:T , applying the broadcast costs to the fix… view at source ↗

**Figure 6.** Figure 6: Standardized OLS coefficients relating observable product attributes to estimated product [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗

**Figure 7.** Figure 7: Composition of target populations under α-shifted distributions, measured as the expected sampling mass assigned to each decile of the bucketization attribute. Positive α shifts mass toward higher-value segments; negative α shifts mass toward lower-value segments. -0.5 -0.4 -0.3 -0.2 -0.1 0.0 0.1 0.2 0.3 0.4 0.5 Shift parameter ® 0 20 40 60 80 100 Population share (%) Light = lower demand, dark = higher de… view at source ↗

**Figure 8.** Figure 8: Realized product-count share in each decile under α-shifted population sampling, for demand (left) and unit economics (right). The plots show how reweighting by demand or unit economics changes the product composition of the sampled population. Effect on Evaluation Populations In our population-shift evaluation (Section 4.1), we consider values of α ranging from −0.5 to 0.5. To illustrate the effect of the… view at source ↗

**Figure 9.** Figure 9: Product distribution within evaluation populations for α = 0 (left, baseline) and α = 0.2 (right, shifted), illustrating the reweighting of population composition toward higher-value segments as α increases. learning a cost-conditioned response map would provide little value. We therefore compare each cost-conditioned primal forecaster against an unconstrained variant that does not receive λt:t+L as input.… view at source ↗

**Figure 10.** Figure 10: shows that Population-Embedding models maintain slopes closer to 1 across most shifts, typically in the 90–100% range. In contrast, the Bottom-Up and Global Aggregate models exhibit larger calibration deviations under extreme shifts, consistent with the accuracy degradation observed in Section 4.1. −0.4 −0.2 0.0 0.2 0.4 More tail Products ← Alpha Value → More head Products 75% 80% 85% 90% 95% 100% 105% Mu… view at source ↗

**Figure 11.** Figure 11: Aggregate inbound MAPE across unit-economics population shifts. Error bars show 95% confidence intervals across sampled capacity scenarios. Dual-Control Violations across Population Shifts [PITH_FULL_IMAGE:figures/full_fig_p026_11.png] view at source ↗

**Figure 12.** Figure 12: Violation metrics for the dual coordination interface across α-shifted population distributions. Each point corresponds to one sampled capacity scenario; lower violation indicates better capacity adherence [PITH_FULL_IMAGE:figures/full_fig_p027_12.png] view at source ↗

read the original abstract

In large-scale multi-agent systems with shared resource constraints, an upstream planner must iteratively evaluate candidate resource plans -- assessing feasibility, aggregate response, and marginal cost -- before committing to one. Lagrangian relaxation separates local decisions through a broadcast cost signal, but the planner still needs the cost-to-utilization response map to explore plan space, and this map depends on population composition that changes across planning cycles. We propose \emph{population-aware coordination interfaces}: learned primal and dual maps, conditioned on compact population summaries, that the planner queries inside its iterative loop. The primal map predicts aggregate utilization under a proposed cost trajectory; the dual map predicts the cost trajectory for a target plan. By encoding response-relevant population structure, these maps remain reliable across evolving populations without per-cycle retraining, and support coordination of large populations from compact subsamples. We additionally cast Sim2Real transfer as a backtestable procedure, enabling evaluation before deployment. In a supply-chain capacity-control case study, population-aware interfaces reduce forecast error by 16--19\% and capacity violations by 20--51\% relative to population-unaware baselines under composition shift; 20K-agent cohorts support accurate coordination of 500K-agent populations; and simulator-trained primal maps achieve 11.1\% MAPE on real observations versus 13--24\% for baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adds population summaries to condition learned primal and dual maps so Lagrangian planners can handle shifting agent mixes without retraining each cycle, and the supply-chain numbers look usable, but the claim that those summaries are sufficient statistics needs direct checks.

read the letter

The core move is to learn primal and dual maps that take a compact population summary as extra input. The planner can then query the primal map for expected total utilization under a proposed cost signal or the dual map for the cost signal that would produce a target aggregate plan. Because the maps see the summary, they stay usable when the population composition changes instead of requiring a full retrain inside every planning loop. That is the practical interface they are selling for large-scale constrained multi-agent problems that use Lagrangian relaxation.

Referee Report

2 major / 1 minor

Summary. The paper proposes population-aware coordination interfaces for large-scale constrained multi-agent systems: learned primal maps that predict aggregate utilization from a proposed cost trajectory and dual maps that predict the cost trajectory for a target plan, both conditioned on compact population summaries. These interfaces are intended to allow an upstream planner to explore resource plans iteratively without retraining when population composition changes. The approach is evaluated in a supply-chain capacity-control case study, where it reports 16-19% lower forecast error and 20-51% fewer capacity violations than population-unaware baselines under composition shift, accurate coordination of 500K-agent populations from 20K-agent subsamples, and 11.1% MAPE on real observations for simulator-trained maps.

Significance. If the generalization claims hold, the work could meaningfully improve scalability of Lagrangian-relaxation-based coordination in dynamic MAS by eliminating per-cycle retraining and supporting planning from compact subsamples. The framing of Sim2Real transfer as a backtestable procedure is a constructive practical contribution.

major comments (2)

Abstract: the central empirical claims rest on concrete percentage improvements (16-19% forecast error, 20-51% capacity violations) yet the abstract supplies no description of the population-summary features, model architecture, training/validation splits, or statistical significance tests. Without these, the reported gains under composition shift cannot be independently verified and the generalization guarantee remains unassessable.
Abstract (paragraph on population-aware coordination interfaces): the modeling assumption that a low-dimensional population summary is a sufficient statistic for the cost-to-utilization response map is load-bearing for the claim of reliable generalization without retraining. No supporting analysis (ablation on summary dimension, mutual-information bounds, or checks for omitted higher-order interactions) is referenced, leaving the skeptic's concern about cross-agent correlations unaddressed.

minor comments (1)

Abstract: the phrase 'population-aware coordination interfaces' is introduced as a new term but is not immediately linked to a formal definition or section where the primal/dual maps are mathematically specified.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive comments. We respond to each major comment below and describe the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: Abstract: the central empirical claims rest on concrete percentage improvements (16-19% forecast error, 20-51% capacity violations) yet the abstract supplies no description of the population-summary features, model architecture, training/validation splits, or statistical significance tests. Without these, the reported gains under composition shift cannot be independently verified and the generalization guarantee remains unassessable.

Authors: We agree that the abstract would benefit from additional context to support independent assessment of the claims. In the revised version we will expand the abstract with a concise description of the population-summary features (low-order moments of agent attributes), the neural architectures for the primal and dual maps, the training/validation splits used in the case study, and a statement that the reported improvements are statistically significant across repeated trials. Full implementation and experimental details will remain in the methods and results sections. revision: yes
Referee: Abstract (paragraph on population-aware coordination interfaces): the modeling assumption that a low-dimensional population summary is a sufficient statistic for the cost-to-utilization response map is load-bearing for the claim of reliable generalization without retraining. No supporting analysis (ablation on summary dimension, mutual-information bounds, or checks for omitted higher-order interactions) is referenced, leaving the skeptic's concern about cross-agent correlations unaddressed.

Authors: The empirical generalization results across composition shifts in the supply-chain experiments provide practical support for the utility of the chosen summaries. We acknowledge, however, that explicit ablations on summary dimension and information-theoretic analysis are absent from the current manuscript. We will add an ablation study that varies the dimensionality of the population summary and reports its effect on forecast error and violation rates. Mutual-information bounds and exhaustive checks for higher-order interactions would require additional theoretical development beyond the scope of the present work; the planned ablation will nevertheless directly address sensitivity to summary richness. revision: partial

Circularity Check

0 steps flagged

No significant circularity; claims rest on empirical validation against external baselines

full rationale

The abstract and described claims present learned primal/dual maps conditioned on population summaries as a modeling choice, with reported performance gains (16-19% forecast error reduction, 20-51% fewer violations) measured against population-unaware baselines in a supply-chain case study. No derivation step reduces a prediction to its own fitted inputs by construction, invokes a self-citation as the sole justification for a uniqueness theorem, or renames an empirical pattern as a derived result. The sufficiency of compact summaries is stated as an assumption that is then tested via generalization metrics on evolving populations and Sim2Real backtesting, rather than being tautological. The work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

Review performed on abstract only; full specification of learned-map training, summary construction, and any regularization choices is unavailable, so the ledger reflects only the high-level premises stated in the abstract.

free parameters (1)

population summary dimension and features
Compact summaries are asserted to encode response-relevant structure, yet the abstract gives no explicit count or selection procedure for these features.

axioms (1)

domain assumption Learned maps conditioned on population summaries can accurately predict aggregate utilization and required cost trajectories across composition shifts.
This premise underpins the claim that the interfaces remain reliable without per-cycle retraining.

invented entities (1)

population-aware coordination interfaces no independent evidence
purpose: Provide reusable primal and dual maps that the planner queries inside its iterative loop.
New construct introduced to decouple the response map from changing population composition.

pith-pipeline@v0.9.0 · 5783 in / 1505 out tokens · 109922 ms · 2026-05-20T21:37:19.197262+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

population-aware coordination interfaces: learned primal and dual maps, conditioned on compact population summaries... By encoding response-relevant population structure, these maps remain reliable across evolving populations without per-cycle retraining
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The primal map predicts aggregate utilization under a proposed cost trajectory; the dual map predicts the cost trajectory for a target plan.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages

[1]

CACHON, G. P. (2003). Supply chain coordination with contracts. InHandbooks in Operations Research and Management Science, vol. 11. Elsevier, 227–339

work page 2003
[2]

and ZIPKIN, P

FEDERGRUEN, A. and ZIPKIN, P. H. (1999). Coordination mechanisms for a distribution system with one supplier and multiple retailers.Management science451493–1507

work page 1999
[3]

and ECKSTEIN, J

BOYD, S., PARIKH, N., CHU, E., PELEATO, B. and ECKSTEIN, J. (2011). Distributed opti- mization and statistical learning via the alternating direction method of multipliers.Foundations and Trends in Machine Learning31–122

work page 2011
[4]

FISHER, M. L. (1981). The lagrangian relaxation method for solving integer programming problems.Management science271–18

work page 1981
[5]

and MORDATCH, I

LOWE, R., WU, Y., TAMAR, A., HARB, J., ABBEEL, P. and MORDATCH, I. (2017). Multi- agent actor-critic for mixed cooperative-competitive environments. InAdvances in Neural Information Processing Systems, vol. 30

work page 2017
[6]

OLIEHOEK, F. A. and AMATO, C. (2016).A Concise Introduction to Decentralized POMDPs. Springer

work page 2016
[7]

and WANG, J

YANG, Y., LUO, R., LI, M., ZHOU, M., ZHANG, W. and WANG, J. (2018). Mean field multi-agent reinforcement learning. InInternational Conference on Machine Learning. PMLR

work page 2018
[8]

and VANDENBERGHE, L

BOYD, S. and VANDENBERGHE, L. (2004).Convex Optimization. pt. 1, Cambridge University Press

work page 2004
[9]

Q., RAWLINGS, J

MAYNE, D. Q., RAWLINGS, J. B., RAO, C. V. and SCOKAERT, P. O. (2000). Constrained model predictive control: Stability and optimality.Automatica36789–814

work page 2000
[10]

E., PRETT, D

GARCÍA, C. E., PRETT, D. M. and MORARI, M. (1989). Model predictive control: Theory and practice — A survey.Automatica25335–348

work page 1989
[11]

and BORDONS, C

CAMACHO, E. and BORDONS, C. (2004).Model Predictive Control. Advanced Textbooks in Control and Signal Processing, Springer London

work page 2004
[12]

and KAKADE, S

EISENACH, C., GHAI, U., MADEKA, D., TORKKOLA, K., FOSTER, D. and KAKADE, S. (2024). Neural coordination and capacity control for inventory management. arXiv:2410.02817

work page arXiv 2024
[13]

and KAKADE, S

MADEKA, D., TORKKOLA, K., EISENACH, C., LUO, A., FOSTER, D. and KAKADE, S. (2022). Deep inventory management.arXiv:2210.03137

work page arXiv 2022
[14]

R., VIEIRAFRUJERI, F., CHENG, C.-A., MARSHALL, L., BARBALHO, H

SINCLAIR, S. R., VIEIRAFRUJERI, F., CHENG, C.-A., MARSHALL, L., BARBALHO, H. D. O., LI, J., NEVILLE, J., MENACHE, I. and SWAMINATHAN, A. (2023). Hindsight learning for MDPs with exogenous inputs. InProceedings of the 40th International Conference on Machine Learning, vol. 202 ofProceedings of Machine Learning Research. PMLR

work page 2023
[15]

and KAKADE, S

ANDAZ, S., EISENACH, C., MADEKA, D., TORKKOLA, K., JIA, R., FOSTER, D. and KAKADE, S. (2023). Learning an inventory control policy with general inventory arrival dynamics.arXiv:2310.17168

work page arXiv 2023
[16]

and MAHONEY, M

MAGGIAR, A., DICKER, L. and MAHONEY, M. W. (2024). Consensus Planning with Primal, Dual, and Proximal Agents.arXiv:2408.16462

work page arXiv 2024
[17]

and WRETMAN, J

SÄRNDAL, C.-E., SWENSSON, B. and WRETMAN, J. (2003).Model Assisted Survey Sampling. Springer Science & Business Media

work page 2003
[18]

HORVITZ, D. G. and THOMPSON, D. J. (1952). A generalization of sampling without replacement from a finite universe.Journal of the American Statistical Association47663–685

work page 1952
[19]

and WHITE- SON, S

RASHID, T., SAMVELYAN, M., SCHROEDER, C., FARQUHAR, G., FOERSTER, J. and WHITE- SON, S. (2018). QMIX: Monotonic value function factorisation for deep multi-agent reinforce- ment learning. InInternational Conference on Machine Learning. PMLR. 10

work page 2018
[20]

MOUSA, M.,VAN DEBERG, D., KOTECHA, N.,DELRIO-CHANONA, E. A. and MOWBRAY, M. (2024). An analysis of multi-agent reinforcement learning for decentralized inventory control systems.Computers & Chemical Engineering187108783

work page 2024
[21]

BERTSEKAS, D. P. (1999).Nonlinear Programming. Athena scientific

work page 1999
[22]

N., VANMIEGHEM, J

GIJSBRECHTS, J., BOUTE, R. N., VANMIEGHEM, J. A. and ZHANG, D. J. (2022). Can deep reinforcement learning improve inventory management? performance on lost sales, dual- sourcing, and multi-echelon problems.Manufacturing & Service Operations Management24 1349–1368

work page 2022
[23]

J., AHMED, R

HYNDMAN, R. J., AHMED, R. A., ATHANASOPOULOS, G. and SHANG, H. L. (2011). Optimal combination forecasts for hierarchical time series.Computational statistics & data analysis55 2579–2589

work page 2011
[24]

L., ATHANASOPOULOS, G

WICKRAMASURIYA, S. L., ATHANASOPOULOS, G. and HYNDMAN, R. J. (2019). Optimal forecast reconciliation for hierarchical and grouped time series through trace minimization. Journal of the American Statistical Association114804–819

work page 2019
[25]

and SMOLA, A

ZAHEER, M., KOTTUR, S., RAVANBAKHSH, S., POCZOS, B., SALAKHUTDINOV, R. and SMOLA, A. (2017). Deep sets. InAdvances in Neural Information Processing Systems, vol. 30

work page 2017
[26]

N., KAISER, L

VASWANI, A., SHAZEER, N., PARMAR, N., USZKOREIT, J., JONES, L., GOMEZ, A. N., KAISER, L. and POLOSUKHIN, I. (2017). Attention is all you need. InAdvances in Neural Information Processing Systems, vol. 30

work page 2017
[27]

SHIMODAIRA, H. (2000). Improving predictive inference under covariate shift by weighting the log-likelihood function.Journal of Statistical Planning and Inference90227–244

work page 2000
[28]

and PEREIRA, F

BEN-DAVID, S., BLITZER, J., CRAMMER, K. and PEREIRA, F. (2006). Analysis of rep- resentations for domain adaptation. InAdvances in Neural Information Processing Systems, vol. 19

work page 2006
[29]

and LAWRENCE, N

QUINONERO-CANDELA, J., SUGIYAMA, M., SCHWAIGHOFER, A. and LAWRENCE, N. D. (2009).Dataset Shift in Machine Learning. MIT Press

work page 2009
[30]

Distributionally robust optimization: A review

RAHIMIAN, H. and MEHROTRA, S. (2019). Distributionally robust optimization: A review. arXiv:1908.05659

work page arXiv 2019
[31]

W., HASHIMOTO, T

SAGAWA, S., KOH, P. W., HASHIMOTO, T. B. and LIANG, P. (2020). Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization. InInternational Conference on Learning Representations

work page 2020
[32]

W., SAGAWA, S., MARKLUND, H., XIE, S

KOH, P. W., SAGAWA, S., MARKLUND, H., XIE, S. M., ZHANG, M., BALSUBRAMANI, A., HU, W., YASUNAGA, M., PHILLIPS, R. L., GAO, I.ET AL. (2021). WILDS: A benchmark of in-the-wild distribution shifts. InInternational Conference on Machine Learning. PMLR

work page 2021
[33]

and KOLTER, J

AMOS, B. and KOLTER, J. Z. (2017). Optnet: Differentiable optimization as a layer in neural networks. InProceedings of the 34th International Conference on Machine Learning, vol. 70 of Proceedings of Machine Learning Research. PMLR

work page 2017
[34]

and KOLTER, J

AGRAWAL, A., AMOS, B., BARRATT, S., BOYD, S., DIAMOND, S. and KOLTER, J. Z. (2019). Differentiable convex optimization layers. InAdvances in Neural Information Processing Systems, vol. 32

work page 2019
[35]

J., SIMCHOWITZ, M., ZHANG, K

SUH, H. J., SIMCHOWITZ, M., ZHANG, K. and TEDRAKE, R. (2022). Do differentiable simulators give better policy gradients? InInternational Conference on Machine Learning. PMLR

work page 2022
[36]

and AOKI, Y

PARMAS, P., SENO, T. and AOKI, Y. (2023). Model-based reinforcement learning with scalable composite policy gradient estimators. InProceedings of the International Conference on Machine Learning

work page 2023
[37]

and KANORIA, Y

ALVO, M., RUSSO, D. and KANORIA, Y. (2023). Neural inventory control in networks via hindsight differentiable policy optimization.arXiv:2306.11246. 11

work page arXiv 2023
[38]

and HARVEY, I

JAKOBI, N., HUSBANDS, P. and HARVEY, I. (1995). Evolutionary robotics and the radical envelope-of-noise hypothesis.Adaptive behavior6325–368

work page 1995
[39]

and ABBEEL, P

TOBIN, J., FONG, R., RAY, A., SCHNEIDER, J., ZAREMBA, W. and ABBEEL, P. (2017). Domain randomization for transferring deep neural networks from simulation to the real world. In2017 IEEE/RSJ international conference on intelligent robots and systems (IROS)

work page 2017
[40]

B., ANDRYCHOWICZ, M., ZAREMBA, W

PENG, X. B., ANDRYCHOWICZ, M., ZAREMBA, W. and ABBEEL, P. (2018). Sim-to-real transfer of robotic control with dynamics randomization. In2018 IEEE International Conference on Robotics and Automation (ICRA)

work page 2018
[41]

and VANHOUCKE, V

TAN, J., ZHANG, T., COUMANS, E., ISCEN, A., BAI, Y., HAFNER, D., BOHEZ, S. and VANHOUCKE, V. (2018). Sim-to-real: Learning agile locomotion for quadruped robots. In Robotics: Science and Systems

work page 2018
[42]

reality gap

NAGABANDI, A., KAHN, G., FEARING, R. S. and LEVINE, S. (2018). Neural network dynamics for model-based deep reinforcement learning with model-free fine-tuning. In2018 IEEE International Conference on Robotics and Automation (ICRA). 12 A Related Work Multi-Agent Learning and Coordination.Centralized-training decentralized-execution methods such as MADDPG [...

work page 2018
[43]

19 2.ϕθpredicts a cost trajectory ˆλt:t+L =ϕθ(xt,St,Gt:t+L)

A capacity pathG 0:T∼PG is sampled from the truncated Haar wavelet distribution. 19 2.ϕθpredicts a cost trajectory ˆλt:t+L =ϕθ(xt,St,Gt:t+L)

work page
[44]

The fixed local policies respond toˆλt:t+L in the differentiable Exo-IDP simulator, producing simulated aggregate inboundJt

work page
[45]

+ Coverage

Gradients flow through the simulator response to updateϕθby minimizing Eq. (10). Ldual(θ) =αquad ∑ t>tburn ( Jt−Gt )2 + +αℓ1 ∑ t ∥ˆλt∥1 +αmseLmse,(10) where (u)+ = max(u,0) , and the capacity-violation sum is restricted to steps after a burn-in of 6 to exclude simulator warm-up. Lmse is a forecast-consistency regularizer that penalizes disagreement betwee...

work page

[1] [1]

CACHON, G. P. (2003). Supply chain coordination with contracts. InHandbooks in Operations Research and Management Science, vol. 11. Elsevier, 227–339

work page 2003

[2] [2]

and ZIPKIN, P

FEDERGRUEN, A. and ZIPKIN, P. H. (1999). Coordination mechanisms for a distribution system with one supplier and multiple retailers.Management science451493–1507

work page 1999

[3] [3]

and ECKSTEIN, J

BOYD, S., PARIKH, N., CHU, E., PELEATO, B. and ECKSTEIN, J. (2011). Distributed opti- mization and statistical learning via the alternating direction method of multipliers.Foundations and Trends in Machine Learning31–122

work page 2011

[4] [4]

FISHER, M. L. (1981). The lagrangian relaxation method for solving integer programming problems.Management science271–18

work page 1981

[5] [5]

and MORDATCH, I

LOWE, R., WU, Y., TAMAR, A., HARB, J., ABBEEL, P. and MORDATCH, I. (2017). Multi- agent actor-critic for mixed cooperative-competitive environments. InAdvances in Neural Information Processing Systems, vol. 30

work page 2017

[6] [6]

OLIEHOEK, F. A. and AMATO, C. (2016).A Concise Introduction to Decentralized POMDPs. Springer

work page 2016

[7] [7]

and WANG, J

YANG, Y., LUO, R., LI, M., ZHOU, M., ZHANG, W. and WANG, J. (2018). Mean field multi-agent reinforcement learning. InInternational Conference on Machine Learning. PMLR

work page 2018

[8] [8]

and VANDENBERGHE, L

BOYD, S. and VANDENBERGHE, L. (2004).Convex Optimization. pt. 1, Cambridge University Press

work page 2004

[9] [9]

Q., RAWLINGS, J

MAYNE, D. Q., RAWLINGS, J. B., RAO, C. V. and SCOKAERT, P. O. (2000). Constrained model predictive control: Stability and optimality.Automatica36789–814

work page 2000

[10] [10]

E., PRETT, D

GARCÍA, C. E., PRETT, D. M. and MORARI, M. (1989). Model predictive control: Theory and practice — A survey.Automatica25335–348

work page 1989

[11] [11]

and BORDONS, C

CAMACHO, E. and BORDONS, C. (2004).Model Predictive Control. Advanced Textbooks in Control and Signal Processing, Springer London

work page 2004

[12] [12]

and KAKADE, S

EISENACH, C., GHAI, U., MADEKA, D., TORKKOLA, K., FOSTER, D. and KAKADE, S. (2024). Neural coordination and capacity control for inventory management. arXiv:2410.02817

work page arXiv 2024

[13] [13]

and KAKADE, S

MADEKA, D., TORKKOLA, K., EISENACH, C., LUO, A., FOSTER, D. and KAKADE, S. (2022). Deep inventory management.arXiv:2210.03137

work page arXiv 2022

[14] [14]

R., VIEIRAFRUJERI, F., CHENG, C.-A., MARSHALL, L., BARBALHO, H

SINCLAIR, S. R., VIEIRAFRUJERI, F., CHENG, C.-A., MARSHALL, L., BARBALHO, H. D. O., LI, J., NEVILLE, J., MENACHE, I. and SWAMINATHAN, A. (2023). Hindsight learning for MDPs with exogenous inputs. InProceedings of the 40th International Conference on Machine Learning, vol. 202 ofProceedings of Machine Learning Research. PMLR

work page 2023

[15] [15]

and KAKADE, S

ANDAZ, S., EISENACH, C., MADEKA, D., TORKKOLA, K., JIA, R., FOSTER, D. and KAKADE, S. (2023). Learning an inventory control policy with general inventory arrival dynamics.arXiv:2310.17168

work page arXiv 2023

[16] [16]

and MAHONEY, M

MAGGIAR, A., DICKER, L. and MAHONEY, M. W. (2024). Consensus Planning with Primal, Dual, and Proximal Agents.arXiv:2408.16462

work page arXiv 2024

[17] [17]

and WRETMAN, J

SÄRNDAL, C.-E., SWENSSON, B. and WRETMAN, J. (2003).Model Assisted Survey Sampling. Springer Science & Business Media

work page 2003

[18] [18]

HORVITZ, D. G. and THOMPSON, D. J. (1952). A generalization of sampling without replacement from a finite universe.Journal of the American Statistical Association47663–685

work page 1952

[19] [19]

and WHITE- SON, S

RASHID, T., SAMVELYAN, M., SCHROEDER, C., FARQUHAR, G., FOERSTER, J. and WHITE- SON, S. (2018). QMIX: Monotonic value function factorisation for deep multi-agent reinforce- ment learning. InInternational Conference on Machine Learning. PMLR. 10

work page 2018

[20] [20]

MOUSA, M.,VAN DEBERG, D., KOTECHA, N.,DELRIO-CHANONA, E. A. and MOWBRAY, M. (2024). An analysis of multi-agent reinforcement learning for decentralized inventory control systems.Computers & Chemical Engineering187108783

work page 2024

[21] [21]

BERTSEKAS, D. P. (1999).Nonlinear Programming. Athena scientific

work page 1999

[22] [22]

N., VANMIEGHEM, J

GIJSBRECHTS, J., BOUTE, R. N., VANMIEGHEM, J. A. and ZHANG, D. J. (2022). Can deep reinforcement learning improve inventory management? performance on lost sales, dual- sourcing, and multi-echelon problems.Manufacturing & Service Operations Management24 1349–1368

work page 2022

[23] [23]

J., AHMED, R

HYNDMAN, R. J., AHMED, R. A., ATHANASOPOULOS, G. and SHANG, H. L. (2011). Optimal combination forecasts for hierarchical time series.Computational statistics & data analysis55 2579–2589

work page 2011

[24] [24]

L., ATHANASOPOULOS, G

WICKRAMASURIYA, S. L., ATHANASOPOULOS, G. and HYNDMAN, R. J. (2019). Optimal forecast reconciliation for hierarchical and grouped time series through trace minimization. Journal of the American Statistical Association114804–819

work page 2019

[25] [25]

and SMOLA, A

ZAHEER, M., KOTTUR, S., RAVANBAKHSH, S., POCZOS, B., SALAKHUTDINOV, R. and SMOLA, A. (2017). Deep sets. InAdvances in Neural Information Processing Systems, vol. 30

work page 2017

[26] [26]

N., KAISER, L

VASWANI, A., SHAZEER, N., PARMAR, N., USZKOREIT, J., JONES, L., GOMEZ, A. N., KAISER, L. and POLOSUKHIN, I. (2017). Attention is all you need. InAdvances in Neural Information Processing Systems, vol. 30

work page 2017

[27] [27]

SHIMODAIRA, H. (2000). Improving predictive inference under covariate shift by weighting the log-likelihood function.Journal of Statistical Planning and Inference90227–244

work page 2000

[28] [28]

and PEREIRA, F

BEN-DAVID, S., BLITZER, J., CRAMMER, K. and PEREIRA, F. (2006). Analysis of rep- resentations for domain adaptation. InAdvances in Neural Information Processing Systems, vol. 19

work page 2006

[29] [29]

and LAWRENCE, N

QUINONERO-CANDELA, J., SUGIYAMA, M., SCHWAIGHOFER, A. and LAWRENCE, N. D. (2009).Dataset Shift in Machine Learning. MIT Press

work page 2009

[30] [30]

Distributionally robust optimization: A review

RAHIMIAN, H. and MEHROTRA, S. (2019). Distributionally robust optimization: A review. arXiv:1908.05659

work page arXiv 2019

[31] [31]

W., HASHIMOTO, T

SAGAWA, S., KOH, P. W., HASHIMOTO, T. B. and LIANG, P. (2020). Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization. InInternational Conference on Learning Representations

work page 2020

[32] [32]

W., SAGAWA, S., MARKLUND, H., XIE, S

KOH, P. W., SAGAWA, S., MARKLUND, H., XIE, S. M., ZHANG, M., BALSUBRAMANI, A., HU, W., YASUNAGA, M., PHILLIPS, R. L., GAO, I.ET AL. (2021). WILDS: A benchmark of in-the-wild distribution shifts. InInternational Conference on Machine Learning. PMLR

work page 2021

[33] [33]

and KOLTER, J

AMOS, B. and KOLTER, J. Z. (2017). Optnet: Differentiable optimization as a layer in neural networks. InProceedings of the 34th International Conference on Machine Learning, vol. 70 of Proceedings of Machine Learning Research. PMLR

work page 2017

[34] [34]

and KOLTER, J

AGRAWAL, A., AMOS, B., BARRATT, S., BOYD, S., DIAMOND, S. and KOLTER, J. Z. (2019). Differentiable convex optimization layers. InAdvances in Neural Information Processing Systems, vol. 32

work page 2019

[35] [35]

J., SIMCHOWITZ, M., ZHANG, K

SUH, H. J., SIMCHOWITZ, M., ZHANG, K. and TEDRAKE, R. (2022). Do differentiable simulators give better policy gradients? InInternational Conference on Machine Learning. PMLR

work page 2022

[36] [36]

and AOKI, Y

PARMAS, P., SENO, T. and AOKI, Y. (2023). Model-based reinforcement learning with scalable composite policy gradient estimators. InProceedings of the International Conference on Machine Learning

work page 2023

[37] [37]

and KANORIA, Y

ALVO, M., RUSSO, D. and KANORIA, Y. (2023). Neural inventory control in networks via hindsight differentiable policy optimization.arXiv:2306.11246. 11

work page arXiv 2023

[38] [38]

and HARVEY, I

JAKOBI, N., HUSBANDS, P. and HARVEY, I. (1995). Evolutionary robotics and the radical envelope-of-noise hypothesis.Adaptive behavior6325–368

work page 1995

[39] [39]

and ABBEEL, P

TOBIN, J., FONG, R., RAY, A., SCHNEIDER, J., ZAREMBA, W. and ABBEEL, P. (2017). Domain randomization for transferring deep neural networks from simulation to the real world. In2017 IEEE/RSJ international conference on intelligent robots and systems (IROS)

work page 2017

[40] [40]

B., ANDRYCHOWICZ, M., ZAREMBA, W

PENG, X. B., ANDRYCHOWICZ, M., ZAREMBA, W. and ABBEEL, P. (2018). Sim-to-real transfer of robotic control with dynamics randomization. In2018 IEEE International Conference on Robotics and Automation (ICRA)

work page 2018

[41] [41]

and VANHOUCKE, V

TAN, J., ZHANG, T., COUMANS, E., ISCEN, A., BAI, Y., HAFNER, D., BOHEZ, S. and VANHOUCKE, V. (2018). Sim-to-real: Learning agile locomotion for quadruped robots. In Robotics: Science and Systems

work page 2018

[42] [42]

reality gap

NAGABANDI, A., KAHN, G., FEARING, R. S. and LEVINE, S. (2018). Neural network dynamics for model-based deep reinforcement learning with model-free fine-tuning. In2018 IEEE International Conference on Robotics and Automation (ICRA). 12 A Related Work Multi-Agent Learning and Coordination.Centralized-training decentralized-execution methods such as MADDPG [...

work page 2018

[43] [43]

19 2.ϕθpredicts a cost trajectory ˆλt:t+L =ϕθ(xt,St,Gt:t+L)

A capacity pathG 0:T∼PG is sampled from the truncated Haar wavelet distribution. 19 2.ϕθpredicts a cost trajectory ˆλt:t+L =ϕθ(xt,St,Gt:t+L)

work page

[44] [44]

The fixed local policies respond toˆλt:t+L in the differentiable Exo-IDP simulator, producing simulated aggregate inboundJt

work page

[45] [45]

+ Coverage

Gradients flow through the simulator response to updateϕθby minimizing Eq. (10). Ldual(θ) =αquad ∑ t>tburn ( Jt−Gt )2 + +αℓ1 ∑ t ∥ˆλt∥1 +αmseLmse,(10) where (u)+ = max(u,0) , and the capacity-violation sum is restricted to steps after a burn-in of 6 to exclude simulator warm-up. Lmse is a forecast-consistency regularizer that penalizes disagreement betwee...

work page