pith. sign in

arxiv: 2605.13900 · v2 · pith:GGWLA2XCnew · submitted 2026-05-12 · 💻 cs.MA · cs.LG

Ready from Day 1: Population-Aware Coordination for Large-Scale Constrained Multi-Agent Systems

Pith reviewed 2026-05-20 21:37 UTC · model grok-4.3

classification 💻 cs.MA cs.LG
keywords multi-agent coordinationpopulation-aware interfacesconstrained optimizationLagrangian relaxationsupply chain planningresponse mapscomposition shiftlarge-scale systems
0
0 comments X

The pith

Learned primal and dual maps conditioned on compact population summaries let planners coordinate large evolving multi-agent populations without retraining each cycle.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes population-aware coordination interfaces for large-scale multi-agent systems that share resource constraints. An upstream planner uses these learned maps inside its iterative planning loop to predict aggregate utilization from a proposed cost signal or to find the cost trajectory that achieves a target plan. Because the maps are conditioned on compact summaries of population structure, they generalize across changes in who is participating without needing to be rebuilt for each new composition. In a supply-chain capacity-control study the approach cuts forecast error by 16-19 percent and capacity violations by 20-51 percent compared with maps that ignore population composition, while also allowing accurate coordination of 500-thousand-agent populations from 20-thousand-agent subsamples.

Core claim

By encoding response-relevant population structure into learned primal and dual maps, the interfaces remain reliable across evolving populations without per-cycle retraining and support coordination of large populations from compact subsamples; in the supply-chain case study these maps reduce forecast error by 16-19% and capacity violations by 20-51% relative to population-unaware baselines under composition shift, and simulator-trained maps reach 11.1% MAPE on real observations.

What carries the argument

Population-aware coordination interfaces: learned primal and dual maps that are conditioned on compact population summaries and queried inside the planner's iterative loop to predict aggregate utilization or required cost trajectories.

If this is right

  • The maps support coordination of 500K-agent populations from 20K-agent subsamples without loss of accuracy.
  • Simulator-trained primal maps achieve 11.1% MAPE on real observations, outperforming baselines that reach 13-24%.
  • No per-cycle retraining is required when population composition changes between planning cycles.
  • Capacity violations drop by 20-51% under composition shift compared with population-unaware methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same conditioning idea could be applied to other iterative planners that must adapt to changing participant sets, such as traffic signal control or energy demand response.
  • Compact summaries might also serve as a privacy mechanism by letting the planner work with aggregate descriptors rather than individual agent data.
  • If the summaries can be updated incrementally, the interfaces could support continuous online replanning as agents arrive or depart.

Load-bearing premise

Compact population summaries contain enough information to capture the structure that determines how the population responds to cost signals, so the maps generalize to new compositions without retraining.

What would settle it

A test in which the forecast error of the conditioned maps stays as high as the unconditioned baselines when the population composition is shifted in a way not captured by the chosen summaries would show the claim does not hold.

Figures

Figures reproduced from arXiv: 2605.13900 by Alvaro Maggiar, Angel Wang, Carson Eisenach, Dean Foster, Dominique Perrault-Joncas.

Figure 1
Figure 1. Figure 1: Population-aware forecaster architectures. (a) Population-Embedding (per-Agent) Aggregate: agent embeddings e i t = f(x i t) are pooled via attention and then decoded. (b) Population-Embedding (Bucketized) Aggregate: within-bucket attention is followed by cross-bucket attention before decoding. The population summary is passed to the selected decoder head: DecP (primal) or DecD (dual). Agents are partition… view at source ↗
Figure 2
Figure 2. Figure 2: (a) Distribution of agent-level cost sensitivity across 500K agents, showing a right-skewed tail of highly responsive agents. (b) Population composition under α-shifted demand-decile mixtures in the supply chain setting: positive α upweights high-demand products, while negative α upweights low-demand products. 4 Empirical Evaluation We evaluate population-aware coordination interfaces along four dimensions… view at source ↗
Figure 3
Figure 3. Figure 3: reports results for both interface types. The left panel evaluates primal forecast accuracy, and the right panel evaluates dual control quality using mean violation on near-limit periods. Additional shift results and the remaining dual metrics are provided in Appendix G. Population-aware interfaces are substantially more robust under composition shift than population￾unaware baselines. In the primal settin… view at source ↗
Figure 4
Figure 4. Figure 4: shows that performance saturates once the source cohort contains approximately 20K agents. For primal prediction, accuracy at this cohort size is close to full-population inference across target population sizes. For dual control, cost trajectories inferred from 20K-agent cohorts remain effective when applied to substantially larger target populations. These results show that population-aware interfaces ca… view at source ↗
Figure 5
Figure 5. Figure 5: Example capacity target trajectories generated by the wavelet sampler using a truncated Haar wavelet basis. Given a sampled target G (n) 0:T , the trained dual coordinator is applied step by step to produce the episode-level cost trajectory λ (n) 0:T ; at each step t, λ (n) t:t+L = ϕθ(x (n) t , S(n) t , G(n) t:t+L ). The simulator is then rolled out under λ (n) 0:T , applying the broadcast costs to the fix… view at source ↗
Figure 6
Figure 6. Figure 6: Standardized OLS coefficients relating observable product attributes to estimated product [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Composition of target populations under α-shifted distributions, measured as the expected sampling mass assigned to each decile of the bucketization attribute. Positive α shifts mass toward higher-value segments; negative α shifts mass toward lower-value segments. -0.5 -0.4 -0.3 -0.2 -0.1 0.0 0.1 0.2 0.3 0.4 0.5 Shift parameter ® 0 20 40 60 80 100 Population share (%) Light = lower demand, dark = higher de… view at source ↗
Figure 8
Figure 8. Figure 8: Realized product-count share in each decile under α-shifted population sampling, for demand (left) and unit economics (right). The plots show how reweighting by demand or unit economics changes the product composition of the sampled population. Effect on Evaluation Populations In our population-shift evaluation (Section 4.1), we consider values of α ranging from −0.5 to 0.5. To illustrate the effect of the… view at source ↗
Figure 9
Figure 9. Figure 9: Product distribution within evaluation populations for α = 0 (left, baseline) and α = 0.2 (right, shifted), illustrating the reweighting of population composition toward higher-value segments as α increases. learning a cost-conditioned response map would provide little value. We therefore compare each cost-conditioned primal forecaster against an unconstrained variant that does not receive λt:t+L as input.… view at source ↗
Figure 10
Figure 10. Figure 10: shows that Population-Embedding models maintain slopes closer to 1 across most shifts, typically in the 90–100% range. In contrast, the Bottom-Up and Global Aggregate models exhibit larger calibration deviations under extreme shifts, consistent with the accuracy degradation observed in Section 4.1. −0.4 −0.2 0.0 0.2 0.4 More tail Products ← Alpha Value → More head Products 75% 80% 85% 90% 95% 100% 105% Mu… view at source ↗
Figure 11
Figure 11. Figure 11: Aggregate inbound MAPE across unit-economics population shifts. Error bars show 95% confidence intervals across sampled capacity scenarios. Dual-Control Violations across Population Shifts [PITH_FULL_IMAGE:figures/full_fig_p026_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Violation metrics for the dual coordination interface across α-shifted population distributions. Each point corresponds to one sampled capacity scenario; lower violation indicates better capacity adherence [PITH_FULL_IMAGE:figures/full_fig_p027_12.png] view at source ↗
read the original abstract

In large-scale multi-agent systems with shared resource constraints, an upstream planner must iteratively evaluate candidate resource plans -- assessing feasibility, aggregate response, and marginal cost -- before committing to one. Lagrangian relaxation separates local decisions through a broadcast cost signal, but the planner still needs the cost-to-utilization response map to explore plan space, and this map depends on population composition that changes across planning cycles. We propose \emph{population-aware coordination interfaces}: learned primal and dual maps, conditioned on compact population summaries, that the planner queries inside its iterative loop. The primal map predicts aggregate utilization under a proposed cost trajectory; the dual map predicts the cost trajectory for a target plan. By encoding response-relevant population structure, these maps remain reliable across evolving populations without per-cycle retraining, and support coordination of large populations from compact subsamples. We additionally cast Sim2Real transfer as a backtestable procedure, enabling evaluation before deployment. In a supply-chain capacity-control case study, population-aware interfaces reduce forecast error by 16--19\% and capacity violations by 20--51\% relative to population-unaware baselines under composition shift; 20K-agent cohorts support accurate coordination of 500K-agent populations; and simulator-trained primal maps achieve 11.1\% MAPE on real observations versus 13--24\% for baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes population-aware coordination interfaces for large-scale constrained multi-agent systems: learned primal maps that predict aggregate utilization from a proposed cost trajectory and dual maps that predict the cost trajectory for a target plan, both conditioned on compact population summaries. These interfaces are intended to allow an upstream planner to explore resource plans iteratively without retraining when population composition changes. The approach is evaluated in a supply-chain capacity-control case study, where it reports 16-19% lower forecast error and 20-51% fewer capacity violations than population-unaware baselines under composition shift, accurate coordination of 500K-agent populations from 20K-agent subsamples, and 11.1% MAPE on real observations for simulator-trained maps.

Significance. If the generalization claims hold, the work could meaningfully improve scalability of Lagrangian-relaxation-based coordination in dynamic MAS by eliminating per-cycle retraining and supporting planning from compact subsamples. The framing of Sim2Real transfer as a backtestable procedure is a constructive practical contribution.

major comments (2)
  1. Abstract: the central empirical claims rest on concrete percentage improvements (16-19% forecast error, 20-51% capacity violations) yet the abstract supplies no description of the population-summary features, model architecture, training/validation splits, or statistical significance tests. Without these, the reported gains under composition shift cannot be independently verified and the generalization guarantee remains unassessable.
  2. Abstract (paragraph on population-aware coordination interfaces): the modeling assumption that a low-dimensional population summary is a sufficient statistic for the cost-to-utilization response map is load-bearing for the claim of reliable generalization without retraining. No supporting analysis (ablation on summary dimension, mutual-information bounds, or checks for omitted higher-order interactions) is referenced, leaving the skeptic's concern about cross-agent correlations unaddressed.
minor comments (1)
  1. Abstract: the phrase 'population-aware coordination interfaces' is introduced as a new term but is not immediately linked to a formal definition or section where the primal/dual maps are mathematically specified.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive comments. We respond to each major comment below and describe the revisions we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: Abstract: the central empirical claims rest on concrete percentage improvements (16-19% forecast error, 20-51% capacity violations) yet the abstract supplies no description of the population-summary features, model architecture, training/validation splits, or statistical significance tests. Without these, the reported gains under composition shift cannot be independently verified and the generalization guarantee remains unassessable.

    Authors: We agree that the abstract would benefit from additional context to support independent assessment of the claims. In the revised version we will expand the abstract with a concise description of the population-summary features (low-order moments of agent attributes), the neural architectures for the primal and dual maps, the training/validation splits used in the case study, and a statement that the reported improvements are statistically significant across repeated trials. Full implementation and experimental details will remain in the methods and results sections. revision: yes

  2. Referee: Abstract (paragraph on population-aware coordination interfaces): the modeling assumption that a low-dimensional population summary is a sufficient statistic for the cost-to-utilization response map is load-bearing for the claim of reliable generalization without retraining. No supporting analysis (ablation on summary dimension, mutual-information bounds, or checks for omitted higher-order interactions) is referenced, leaving the skeptic's concern about cross-agent correlations unaddressed.

    Authors: The empirical generalization results across composition shifts in the supply-chain experiments provide practical support for the utility of the chosen summaries. We acknowledge, however, that explicit ablations on summary dimension and information-theoretic analysis are absent from the current manuscript. We will add an ablation study that varies the dimensionality of the population summary and reports its effect on forecast error and violation rates. Mutual-information bounds and exhaustive checks for higher-order interactions would require additional theoretical development beyond the scope of the present work; the planned ablation will nevertheless directly address sensitivity to summary richness. revision: partial

Circularity Check

0 steps flagged

No significant circularity; claims rest on empirical validation against external baselines

full rationale

The abstract and described claims present learned primal/dual maps conditioned on population summaries as a modeling choice, with reported performance gains (16-19% forecast error reduction, 20-51% fewer violations) measured against population-unaware baselines in a supply-chain case study. No derivation step reduces a prediction to its own fitted inputs by construction, invokes a self-citation as the sole justification for a uniqueness theorem, or renames an empirical pattern as a derived result. The sufficiency of compact summaries is stated as an assumption that is then tested via generalization metrics on evolving populations and Sim2Real backtesting, rather than being tautological. The work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

Review performed on abstract only; full specification of learned-map training, summary construction, and any regularization choices is unavailable, so the ledger reflects only the high-level premises stated in the abstract.

free parameters (1)
  • population summary dimension and features
    Compact summaries are asserted to encode response-relevant structure, yet the abstract gives no explicit count or selection procedure for these features.
axioms (1)
  • domain assumption Learned maps conditioned on population summaries can accurately predict aggregate utilization and required cost trajectories across composition shifts.
    This premise underpins the claim that the interfaces remain reliable without per-cycle retraining.
invented entities (1)
  • population-aware coordination interfaces no independent evidence
    purpose: Provide reusable primal and dual maps that the planner queries inside its iterative loop.
    New construct introduced to decouple the response map from changing population composition.

pith-pipeline@v0.9.0 · 5783 in / 1505 out tokens · 109922 ms · 2026-05-20T21:37:19.197262+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages

  1. [1]

    CACHON, G. P. (2003). Supply chain coordination with contracts. InHandbooks in Operations Research and Management Science, vol. 11. Elsevier, 227–339

  2. [2]

    and ZIPKIN, P

    FEDERGRUEN, A. and ZIPKIN, P. H. (1999). Coordination mechanisms for a distribution system with one supplier and multiple retailers.Management science451493–1507

  3. [3]

    and ECKSTEIN, J

    BOYD, S., PARIKH, N., CHU, E., PELEATO, B. and ECKSTEIN, J. (2011). Distributed opti- mization and statistical learning via the alternating direction method of multipliers.Foundations and Trends in Machine Learning31–122

  4. [4]

    FISHER, M. L. (1981). The lagrangian relaxation method for solving integer programming problems.Management science271–18

  5. [5]

    and MORDATCH, I

    LOWE, R., WU, Y., TAMAR, A., HARB, J., ABBEEL, P. and MORDATCH, I. (2017). Multi- agent actor-critic for mixed cooperative-competitive environments. InAdvances in Neural Information Processing Systems, vol. 30

  6. [6]

    OLIEHOEK, F. A. and AMATO, C. (2016).A Concise Introduction to Decentralized POMDPs. Springer

  7. [7]

    and WANG, J

    YANG, Y., LUO, R., LI, M., ZHOU, M., ZHANG, W. and WANG, J. (2018). Mean field multi-agent reinforcement learning. InInternational Conference on Machine Learning. PMLR

  8. [8]

    and VANDENBERGHE, L

    BOYD, S. and VANDENBERGHE, L. (2004).Convex Optimization. pt. 1, Cambridge University Press

  9. [9]

    Q., RAWLINGS, J

    MAYNE, D. Q., RAWLINGS, J. B., RAO, C. V. and SCOKAERT, P. O. (2000). Constrained model predictive control: Stability and optimality.Automatica36789–814

  10. [10]

    E., PRETT, D

    GARCÍA, C. E., PRETT, D. M. and MORARI, M. (1989). Model predictive control: Theory and practice — A survey.Automatica25335–348

  11. [11]

    and BORDONS, C

    CAMACHO, E. and BORDONS, C. (2004).Model Predictive Control. Advanced Textbooks in Control and Signal Processing, Springer London

  12. [12]

    and KAKADE, S

    EISENACH, C., GHAI, U., MADEKA, D., TORKKOLA, K., FOSTER, D. and KAKADE, S. (2024). Neural coordination and capacity control for inventory management. arXiv:2410.02817

  13. [13]

    and KAKADE, S

    MADEKA, D., TORKKOLA, K., EISENACH, C., LUO, A., FOSTER, D. and KAKADE, S. (2022). Deep inventory management.arXiv:2210.03137

  14. [14]

    R., VIEIRAFRUJERI, F., CHENG, C.-A., MARSHALL, L., BARBALHO, H

    SINCLAIR, S. R., VIEIRAFRUJERI, F., CHENG, C.-A., MARSHALL, L., BARBALHO, H. D. O., LI, J., NEVILLE, J., MENACHE, I. and SWAMINATHAN, A. (2023). Hindsight learning for MDPs with exogenous inputs. InProceedings of the 40th International Conference on Machine Learning, vol. 202 ofProceedings of Machine Learning Research. PMLR

  15. [15]

    and KAKADE, S

    ANDAZ, S., EISENACH, C., MADEKA, D., TORKKOLA, K., JIA, R., FOSTER, D. and KAKADE, S. (2023). Learning an inventory control policy with general inventory arrival dynamics.arXiv:2310.17168

  16. [16]

    and MAHONEY, M

    MAGGIAR, A., DICKER, L. and MAHONEY, M. W. (2024). Consensus Planning with Primal, Dual, and Proximal Agents.arXiv:2408.16462

  17. [17]

    and WRETMAN, J

    SÄRNDAL, C.-E., SWENSSON, B. and WRETMAN, J. (2003).Model Assisted Survey Sampling. Springer Science & Business Media

  18. [18]

    HORVITZ, D. G. and THOMPSON, D. J. (1952). A generalization of sampling without replacement from a finite universe.Journal of the American Statistical Association47663–685

  19. [19]

    and WHITE- SON, S

    RASHID, T., SAMVELYAN, M., SCHROEDER, C., FARQUHAR, G., FOERSTER, J. and WHITE- SON, S. (2018). QMIX: Monotonic value function factorisation for deep multi-agent reinforce- ment learning. InInternational Conference on Machine Learning. PMLR. 10

  20. [20]

    MOUSA, M.,VAN DEBERG, D., KOTECHA, N.,DELRIO-CHANONA, E. A. and MOWBRAY, M. (2024). An analysis of multi-agent reinforcement learning for decentralized inventory control systems.Computers & Chemical Engineering187108783

  21. [21]

    BERTSEKAS, D. P. (1999).Nonlinear Programming. Athena scientific

  22. [22]

    N., VANMIEGHEM, J

    GIJSBRECHTS, J., BOUTE, R. N., VANMIEGHEM, J. A. and ZHANG, D. J. (2022). Can deep reinforcement learning improve inventory management? performance on lost sales, dual- sourcing, and multi-echelon problems.Manufacturing & Service Operations Management24 1349–1368

  23. [23]

    J., AHMED, R

    HYNDMAN, R. J., AHMED, R. A., ATHANASOPOULOS, G. and SHANG, H. L. (2011). Optimal combination forecasts for hierarchical time series.Computational statistics & data analysis55 2579–2589

  24. [24]

    L., ATHANASOPOULOS, G

    WICKRAMASURIYA, S. L., ATHANASOPOULOS, G. and HYNDMAN, R. J. (2019). Optimal forecast reconciliation for hierarchical and grouped time series through trace minimization. Journal of the American Statistical Association114804–819

  25. [25]

    and SMOLA, A

    ZAHEER, M., KOTTUR, S., RAVANBAKHSH, S., POCZOS, B., SALAKHUTDINOV, R. and SMOLA, A. (2017). Deep sets. InAdvances in Neural Information Processing Systems, vol. 30

  26. [26]

    N., KAISER, L

    VASWANI, A., SHAZEER, N., PARMAR, N., USZKOREIT, J., JONES, L., GOMEZ, A. N., KAISER, L. and POLOSUKHIN, I. (2017). Attention is all you need. InAdvances in Neural Information Processing Systems, vol. 30

  27. [27]

    SHIMODAIRA, H. (2000). Improving predictive inference under covariate shift by weighting the log-likelihood function.Journal of Statistical Planning and Inference90227–244

  28. [28]

    and PEREIRA, F

    BEN-DAVID, S., BLITZER, J., CRAMMER, K. and PEREIRA, F. (2006). Analysis of rep- resentations for domain adaptation. InAdvances in Neural Information Processing Systems, vol. 19

  29. [29]

    and LAWRENCE, N

    QUINONERO-CANDELA, J., SUGIYAMA, M., SCHWAIGHOFER, A. and LAWRENCE, N. D. (2009).Dataset Shift in Machine Learning. MIT Press

  30. [30]

    Distributionally robust optimization: A review

    RAHIMIAN, H. and MEHROTRA, S. (2019). Distributionally robust optimization: A review. arXiv:1908.05659

  31. [31]

    W., HASHIMOTO, T

    SAGAWA, S., KOH, P. W., HASHIMOTO, T. B. and LIANG, P. (2020). Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization. InInternational Conference on Learning Representations

  32. [32]

    W., SAGAWA, S., MARKLUND, H., XIE, S

    KOH, P. W., SAGAWA, S., MARKLUND, H., XIE, S. M., ZHANG, M., BALSUBRAMANI, A., HU, W., YASUNAGA, M., PHILLIPS, R. L., GAO, I.ET AL. (2021). WILDS: A benchmark of in-the-wild distribution shifts. InInternational Conference on Machine Learning. PMLR

  33. [33]

    and KOLTER, J

    AMOS, B. and KOLTER, J. Z. (2017). Optnet: Differentiable optimization as a layer in neural networks. InProceedings of the 34th International Conference on Machine Learning, vol. 70 of Proceedings of Machine Learning Research. PMLR

  34. [34]

    and KOLTER, J

    AGRAWAL, A., AMOS, B., BARRATT, S., BOYD, S., DIAMOND, S. and KOLTER, J. Z. (2019). Differentiable convex optimization layers. InAdvances in Neural Information Processing Systems, vol. 32

  35. [35]

    J., SIMCHOWITZ, M., ZHANG, K

    SUH, H. J., SIMCHOWITZ, M., ZHANG, K. and TEDRAKE, R. (2022). Do differentiable simulators give better policy gradients? InInternational Conference on Machine Learning. PMLR

  36. [36]

    and AOKI, Y

    PARMAS, P., SENO, T. and AOKI, Y. (2023). Model-based reinforcement learning with scalable composite policy gradient estimators. InProceedings of the International Conference on Machine Learning

  37. [37]

    and KANORIA, Y

    ALVO, M., RUSSO, D. and KANORIA, Y. (2023). Neural inventory control in networks via hindsight differentiable policy optimization.arXiv:2306.11246. 11

  38. [38]

    and HARVEY, I

    JAKOBI, N., HUSBANDS, P. and HARVEY, I. (1995). Evolutionary robotics and the radical envelope-of-noise hypothesis.Adaptive behavior6325–368

  39. [39]

    and ABBEEL, P

    TOBIN, J., FONG, R., RAY, A., SCHNEIDER, J., ZAREMBA, W. and ABBEEL, P. (2017). Domain randomization for transferring deep neural networks from simulation to the real world. In2017 IEEE/RSJ international conference on intelligent robots and systems (IROS)

  40. [40]

    B., ANDRYCHOWICZ, M., ZAREMBA, W

    PENG, X. B., ANDRYCHOWICZ, M., ZAREMBA, W. and ABBEEL, P. (2018). Sim-to-real transfer of robotic control with dynamics randomization. In2018 IEEE International Conference on Robotics and Automation (ICRA)

  41. [41]

    and VANHOUCKE, V

    TAN, J., ZHANG, T., COUMANS, E., ISCEN, A., BAI, Y., HAFNER, D., BOHEZ, S. and VANHOUCKE, V. (2018). Sim-to-real: Learning agile locomotion for quadruped robots. In Robotics: Science and Systems

  42. [42]

    reality gap

    NAGABANDI, A., KAHN, G., FEARING, R. S. and LEVINE, S. (2018). Neural network dynamics for model-based deep reinforcement learning with model-free fine-tuning. In2018 IEEE International Conference on Robotics and Automation (ICRA). 12 A Related Work Multi-Agent Learning and Coordination.Centralized-training decentralized-execution methods such as MADDPG [...

  43. [43]

    19 2.ϕθpredicts a cost trajectory ˆλt:t+L =ϕθ(xt,St,Gt:t+L)

    A capacity pathG 0:T∼PG is sampled from the truncated Haar wavelet distribution. 19 2.ϕθpredicts a cost trajectory ˆλt:t+L =ϕθ(xt,St,Gt:t+L)

  44. [44]

    The fixed local policies respond toˆλt:t+L in the differentiable Exo-IDP simulator, producing simulated aggregate inboundJt

  45. [45]

    + Coverage

    Gradients flow through the simulator response to updateϕθby minimizing Eq. (10). Ldual(θ) =αquad ∑ t>tburn ( Jt−Gt )2 + +αℓ1 ∑ t ∥ˆλt∥1 +αmseLmse,(10) where (u)+ = max(u,0) , and the capacity-violation sum is restricted to steps after a burn-in of 6 to exclude simulator warm-up. Lmse is a forecast-consistency regularizer that penalizes disagreement betwee...