HADT: A Heterogeneous Multi-Agent Differential Transformer for Autonomous Earth Observation Satellite Cluster

Jimmy Cao; Mahardhika Pratama; Mohamad A. Hady; Muhammad Anwar Masum; Ryszard Kowalczyk; Siyi Hu

arxiv: 2605.31023 · v1 · pith:SVAY5VR5new · submitted 2026-05-29 · 💻 cs.AI · cs.LG· cs.MA

HADT: A Heterogeneous Multi-Agent Differential Transformer for Autonomous Earth Observation Satellite Cluster

Mohamad A. Hady , Muhammad Anwar Masum , Siyi Hu , Mahardhika Pratama , Jimmy Cao , Ryszard Kowalczyk This is my paper

Pith reviewed 2026-06-28 22:39 UTC · model grok-4.3

classification 💻 cs.AI cs.LGcs.MA

keywords satellite resource managementmulti-agent reinforcement learningtransformer architectureearth observationheterogeneous agentsdifferential attentionautonomous schedulingmodel-free policies

0 comments

The pith

A differential transformer enables satellite clusters to manage resources autonomously by treating scheduling as model-free reinforcement learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces the Heterogeneous Multi-Agent Differential Transformer to solve resource allocation for mixed groups of optical and SAR satellites performing Earth observation. It reframes the task as a sequential decision process so agents can learn policies from experience instead of relying on fixed mathematical models of the space environment. The architecture uses relational tokenization of observations and actions together with differential attention to coordinate heterogeneous agents. If correct, the method would deliver real-time decisions with less ground control and maintain effectiveness when conditions change or cluster size varies. Traditional optimization approaches lose reliability under the uncertainties typical of orbital operations.

Core claim

The authors claim that the Heterogeneous Multi-Agent Differential Transformer (HADT), equipped with relational observations-actions tokenization and a differential attention mechanism, produces superior autonomous resource management for heterogeneous Earth observation satellite clusters. Experimental results show significant performance gains over baselines along with strong adaptability and transferability across different numbers of satellites in the cluster.

What carries the argument

The HADT architecture, which tokenizes relational observations and actions from heterogeneous agents and applies differential attention within a multi-agent reinforcement learning setup for satellite scheduling.

If this is right

Significant performance improvements over available baselines in autonomous Earth observation mission resource management.
Strong adaptability and transferability when the number of satellites in the cluster varies.
Real-time decision-making with minimal interaction with ground operators.
Effective coordination of heterogeneous satellites including both optical and SAR types.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The differential attention component may transfer to other multi-agent reinforcement learning settings that involve mixed sensor types or uncertain dynamics.
The approach could reduce dependence on high-fidelity physical simulations when planning operations for other autonomous systems such as drone fleets or robotic teams.
Explicit tests introducing specific orbital uncertainties such as drag variations or communication delays would clarify the limits of transferability.

Load-bearing premise

Reformulating satellite resource management as a model-free sequential decision process will produce adaptive real-time policies that remain effective when the underlying dynamics are unavailable, overly complex, or inaccurate due to space environment uncertainties.

What would settle it

A controlled test in which an accurate model of satellite dynamics is provided and HADT is compared directly against a traditional optimization solver to determine whether the model-free approach still yields performance gains or falls behind.

Figures

Figures reproduced from arXiv: 2605.31023 by Jimmy Cao, Mahardhika Pratama, Mohamad A. Hady, Muhammad Anwar Masum, Ryszard Kowalczyk, Siyi Hu.

**Figure 2.** Figure 2: RL and MILP Performance Comparison. RL can achieve comparable result to MILP under simple case study. Thus, model-free RL with PPO can be a potential solution to autonomous heterogeneous satellite cluster. To confirm this hypotheses, we designed a simple model of satellite cluster scheduling as initial study and comparison (detailed mathematical model provided in the Supplemental Document, Section 1). Th… view at source ↗

**Figure 3.** Figure 3: HADT comparison with the baseline algorithms. Our main focus is to [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: HADT Policy Model. It has two main components: 1) Observation to [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Average evaluation rewards comparison across different scenarios [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗

**Figure 6.** Figure 6: Average evaluation Completion Rate (CR) in different scenarios [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗

**Figure 7.** Figure 7: HADT Policy Transfer from 1 to 2 Cluster [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗

read the original abstract

This work addresses the problem of autonomous resource management in heterogeneous satellite cluster conducting Earth Observation (EO) missions including optical and Synthetic Aperture Radar (SAR) satellites. In autonomous operation mode, satellites are equipped with intelligent capabilities enabling real-time decision-making based on the latest conditions, while requiring minimal interaction with ground operators. Traditional scheduling approaches typically rely on mathematical models to represent satellite mission and resource management. Then, this problem is solved by using optimization algorithms. However, such solutions become less effective when the underlying models are not available, over complex, and inaccurate due to dynamic changes and uncertainties inherent in the space mission environment. A promising alternative is to reformulate the problem as a sequential decision-making process and apply model-free reinforcement learning techniques to enable adaptive and real-time resource management. To this end, we propose a novel transformer-based architecture tailored for heterogeneous satellite cluster autonomous EO Mission with relational observations-actions tokenization and differential attention mechanism. Our experimental results demonstrate significant performance improvements compared to the available baselines. Moreover, the proposed architecture exhibits strong adaptability and transferability with respect to varying numbers of satellite clusters.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HADT is a domain-specific transformer for heterogeneous multi-agent RL on satellite EO scheduling that claims performance gains and transfer to different cluster sizes, but the abstract leaves the scaling mechanism and experimental support unexamined.

read the letter

The paper puts forward HADT, which combines relational tokenization of observations and actions with a differential attention mechanism inside a transformer for model-free RL on resource management across mixed optical and SAR satellites. The core move is treating the problem as sequential decisions when analytic models are unavailable or inaccurate, which is a sensible shift for dynamic space environments.

The domain tailoring is the clearest addition. Prior multi-agent RL work exists, but the specific pairing of differential attention with relational tokenization for heterogeneous EO clusters has not been shown before in the cited literature. If the experiments hold, this could give practitioners a ready architecture for variable satellite teams.

The soft spot is the transferability claim. The abstract states the architecture shows strong adaptability to varying numbers of satellites, yet supplies no description of how variable agent counts are handled—no mention of set-based processing, dynamic masking, or padding that would allow zero-shot changes in cardinality. If the tokenization or positional encodings are fixed to a maximum size, the result would require retraining or architectural edits, undercutting the stated benefit. The performance improvements are also asserted without numbers, baselines, or ablation details, so it is impossible to judge whether they survive standard controls.

This is aimed at the small group working on applied multi-agent RL for satellite operations. Readers outside that niche will find little generalizable insight. The work is coherent on its own terms and engages the relevant literature without obvious circularity, so it deserves a serious referee to check the implementation and scaling details.

Referee Report

1 major / 1 minor

Summary. The paper proposes HADT, a heterogeneous multi-agent differential transformer for autonomous resource management in Earth Observation satellite clusters. It reformulates satellite scheduling as a model-free sequential decision process solved via reinforcement learning, introducing relational observations-actions tokenization and a differential attention mechanism to handle agent heterogeneity. The authors claim significant performance gains over baselines and strong adaptability/transferability to varying numbers of satellites in the cluster.

Significance. If the performance and transferability results hold under rigorous evaluation, the work would advance multi-agent RL applications to uncertain, high-stakes domains such as space systems by demonstrating scalable handling of heterogeneous agents without explicit dynamics models. The differential attention component could generalize beyond satellites if shown to support variable cardinalities natively.

major comments (1)

[Abstract] Abstract: the central claim that the architecture 'exhibits strong adaptability and transferability with respect to varying numbers of satellite clusters' is load-bearing for the contribution, yet the abstract provides no indication of the scaling mechanism (e.g., set-based processing, explicit masking, or padding). If the relational tokenization or differential attention uses fixed-dimensional inputs or positional encodings tied to a maximum cluster size, transfer to unseen cardinalities would require retraining or architectural modification, directly undermining the transferability result.

minor comments (1)

[Abstract] The abstract would benefit from a one-sentence description of the underlying RL algorithm (e.g., actor-critic variant) and the observation/action spaces to allow readers to assess the tokenization claim immediately.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract. We address the single major comment below and have made revisions to strengthen the description of the architecture's scaling properties.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that the architecture 'exhibits strong adaptability and transferability with respect to varying numbers of satellite clusters' is load-bearing for the contribution, yet the abstract provides no indication of the scaling mechanism (e.g., set-based processing, explicit masking, or padding). If the relational tokenization or differential attention uses fixed-dimensional inputs or positional encodings tied to a maximum cluster size, transfer to unseen cardinalities would require retraining or architectural modification, directly undermining the transferability result.

Authors: We agree that the abstract should explicitly indicate the scaling mechanism supporting the transferability claim. The full manuscript (Section 3.2 and 3.3) describes that relational observations-actions tokenization encodes inputs as unordered sets of variable cardinality, while the differential attention mechanism computes attention weights over these sets without fixed-dimensional embeddings or positional encodings anchored to a maximum cluster size. This design permits native handling of different numbers of agents via set aggregation and attention, enabling zero-shot transfer to unseen cardinalities. We have revised the abstract to include a concise clause noting the set-based relational tokenization and differential attention for variable cluster sizes. We also added a brief clarification sentence in the abstract and will expand the related-work discussion of set transformers in the revision. revision: yes

Circularity Check

0 steps flagged

No circularity: architecture proposal and experimental claims are independent of self-referential reductions.

full rationale

The paper proposes a novel transformer architecture (HADT) with relational tokenization and differential attention for multi-agent satellite scheduling, then reports experimental improvements and transferability to varying cluster sizes. No equations, fitted parameters, or self-citations are presented in the abstract or described claims that reduce any result to its own inputs by construction. The central claims rest on empirical evaluation against baselines rather than a derivation chain that collapses to definitions or prior self-work. This is the standard non-circular case for an applied ML architecture paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only; no explicit free parameters, axioms, or invented entities described.

pith-pipeline@v0.9.1-grok · 5746 in / 1003 out tokens · 14644 ms · 2026-06-28T22:39:27.170683+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

27 extracted references · 3 canonical work pages · 1 internal anchor

[1]

Agile earth observation satellite scheduling over 20 years: Formulations, methods, and future directions,

X. Wang, G. Wu, L. Xing, and W. Pedrycz, “Agile earth observation satellite scheduling over 20 years: Formulations, methods, and future directions,” IEEE Systems Journal, vol. 15, no. 3, pp. 3881–3892, 2020

2020
[2]

A mixed integer linear program- ming model for multi-satellite scheduling,

X. Chen, G. Reinelt, G. Dai, and A. Spitz, “A mixed integer linear program- ming model for multi-satellite scheduling,”European Journal of Operational Research, vol. 275, no. 2, pp. 694–707, 2019

2019
[3]

Optimal target sequencing in the agile earth-observing satellite scheduling problem using learned dynamics,

M. Stephenson and H. Schaub, “Optimal target sequencing in the agile earth-observing satellite scheduling problem using learned dynamics,” in The AAS/AIAA Astrodynamics Specialist Conference, Big Sky, MT, USA, pp. 13–17, 2023

2023
[4]

Dense points aggregation for efficient and collaborative earth-imaging task planning,

Y. Pan, P. Wang, X. Hui, and J. Li, “Dense points aggregation for efficient and collaborative earth-imaging task planning,” inProceedings of the 2022 5th ACAI, ACAI ’22, (New York, NY, USA), Association for Computing Machinery, 2023

2022
[5]

Mission planning for distributed multiple agile earth observing satellites by attention-based deep reinforce- ment learning method,

P. Li, H. Wang, Y. Zhang, and R. Pan, “Mission planning for distributed multiple agile earth observing satellites by attention-based deep reinforce- ment learning method,”Advances in Space Research, 2024

2024
[6]

Objective task matching strategy for multi-satellite imaging mission planning in complex heterogeneous scenar- ios,

X. Yang, M. Hu, and G. Huang, “Objective task matching strategy for multi-satellite imaging mission planning in complex heterogeneous scenar- ios,” MICML ’23, (New York, NY, USA), p. 96–101, ACM, 2024

2024
[7]

Applying autonomy to dis- tributed satellite systems: Trends, challenges, and future prospects,

C. Araguz, E. Bou-Balust, and E. Alarcón, “Applying autonomy to dis- tributed satellite systems: Trends, challenges, and future prospects,”Sys- tems Engineering, vol. 21, no. 5, pp. 401–416, 2018

2018
[8]

Task allocation strategies for cooperative task planning of multi-autonomous satellite constellation,

F. Yao, J. Li, Y. Chen, X. Chu, and B. Zhao, “Task allocation strategies for cooperative task planning of multi-autonomous satellite constellation,” Advances in space research, vol. 63, no. 2, pp. 1073–1084, 2019

2019
[9]

Novasar-s low cost spaceborne sar payload design, development and deployment of a new benchmark in spaceborne radar,

M. Cohen, A. Larkins, P. L. Semedo, and G. Burbidge, “Novasar-s low cost spaceborne sar payload design, development and deployment of a new benchmark in spaceborne radar,” in2017 IEEE Radar Conference (Radar- Conf), pp. 0903–0907, IEEE, 2017

2017
[10]

Optisar-net: A cross-domain ship detec- tion method for multi-source remote sensing data,

J. Dong, J. Feng, and X. Tang, “Optisar-net: A cross-domain ship detec- tion method for multi-source remote sensing data,”IEEE Transactions on Geoscience and Remote Sensing, 2024

2024
[11]

Spacecraft formation flying orbital control for earth observation mission,

A. Alzubairi, A. Tameem, and B. Kada, “Spacecraft formation flying orbital control for earth observation mission,”Scientific African, vol. 26, p. e02391, 2024

2024
[12]

Optimal surveillance schedul- ing for multiple sar-eo satellite train constellation,

S. J. Kim, M. Kim, C.-H. Kim, and H. Choi, “Optimal surveillance schedul- ing for multiple sar-eo satellite train constellation,” inAIAA AVIATION FORUM AND ASCEND 2024, p. 4851, 2024

2024
[13]

Single-agent reinforce- ment learning for scalable earth-observing satellite constellation opera- tions,

A. Herrmann, M. A. Stephenson, and H. Schaub, “Single-agent reinforce- ment learning for scalable earth-observing satellite constellation opera- tions,”Journal of Spacecraft and Rockets, vol. 61, no. 1, pp. 114–132, 2024. HADT for Autonomous Earth Observation Satellite Cluster 17

2024
[14]

Bsk-rl: Modular, high-fidelity reinforce- ment learning environments for spacecraft tasking,

M. A. Stephenson and H. Schaub, “Bsk-rl: Modular, high-fidelity reinforce- ment learning environments for spacecraft tasking,” in75th International Astronautical Congress, Milan, Italy, IAF, 2024

2024
[15]

Adynamicandcollaborative spectrum sharing strategy based on multi-agent drl in satellite-terrestrial converged networks,

C.Tang,Y.Chen,G.Chen,L.Du,andH.Liu,“Adynamicandcollaborative spectrum sharing strategy based on multi-agent drl in satellite-terrestrial converged networks,”IEEE Transactions on Vehicular Technology, 2024

2024
[16]

Reinforcement learning for multi-satellite agile earth observing scheduling under various communica- tion assumptions,

A. Herrmann, M. Stephenson, and H. Schaub, “Reinforcement learning for multi-satellite agile earth observing scheduling under various communica- tion assumptions,” inAAS Rocky Mountain GN&C Conference, 2023

2023
[17]

Reinforcement learning for earth-observing satelliteautonomywithevent-basedtaskintervals,

M. Stephenson and H. Schaub, “Reinforcement learning for earth-observing satelliteautonomywithevent-basedtaskintervals,” inAAS Rocky Mountain GN&C Conference, Breckenridge, CO, 2024

2024
[18]

A survey on multi-agent reinforcement learning and its application,

Z. Ning and L. Xie, “A survey on multi-agent reinforcement learning and its application,”Journal of Automation and Intelligence, 2024

2024
[19]

Multi-agent reinforcement learning for resources allocation optimization: a survey,

M. A. Hady, S. Hu, M. Pratama, Z. Cao, and R. Kowalczyk, “Multi-agent reinforcement learning for resources allocation optimization: a survey,”Ar- tificial Intelligence Review, vol. 58, no. 11, p. 354, 2025

2025
[20]

The surprising effectiveness of ppo in cooperative multi-agent games,

C. Yu, A. Velu, E. Vinitsky, J. Gao, Y. Wang, A. Bayen, and Y. Wu, “The surprising effectiveness of ppo in cooperative multi-agent games,”Advances in Neur-IPS, vol. 35, pp. 24611–24624, 2022

2022
[21]

Heterogeneous- agent reinforcement learning,

Y. Zhong, J. G. Kuba, X. Feng, S. Hu, J. Ji, and Y. Yang, “Heterogeneous- agent reinforcement learning,”JMLR, vol. 25, pp. 1–67, 2024

2024
[22]

Using en- hanced simulation environments to accelerate reinforcement learning for long-durationsatelliteautonomy,

M. Stephenson, L. Mantovani, S. Phillips, and H. Schaub, “Using en- hanced simulation environments to accelerate reinforcement learning for long-durationsatelliteautonomy,” inAIAA SCITECH 2024 Forum,p.0990, 2024

2024
[23]

Proximal Policy Optimization Algorithms

a. Schulman, et., “Proximal policy optimization algorithms,”arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[24]

Multi- agent reinforcement learning is a sequence modeling problem,

M. Wen, J. Kuba, R. Lin, W. Zhang, Y. Wen, J. Wang, and Y. Yang, “Multi- agent reinforcement learning is a sequence modeling problem,”Advances in Neural Information Processing Systems, vol. 35, pp. 16509–16521, 2022

2022
[25]

Updet: Universal multi-agent rein- forcement learning via policy decoupling with transformers,

S. Hu, F. Zhu, X. Chang, and X. Liang, “Updet: Universal multi-agent rein- forcement learning via policy decoupling with transformers,”arXiv preprint arXiv:2101.08001, 2021

work page arXiv 2021
[26]

Differential transformer.arXiv preprint arXiv:2410.05258, 2024

T. Ye, L. Dong, Y. Xia, Y. Sun, Y. Zhu, G. Huang, and F. Wei, “Differential transformer,”arXiv preprint arXiv:2410.05258, 2024

work page arXiv 2024
[27]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”Advances in neural information processing systems, vol. 30, 2017

2017

[1] [1]

Agile earth observation satellite scheduling over 20 years: Formulations, methods, and future directions,

X. Wang, G. Wu, L. Xing, and W. Pedrycz, “Agile earth observation satellite scheduling over 20 years: Formulations, methods, and future directions,” IEEE Systems Journal, vol. 15, no. 3, pp. 3881–3892, 2020

2020

[2] [2]

A mixed integer linear program- ming model for multi-satellite scheduling,

X. Chen, G. Reinelt, G. Dai, and A. Spitz, “A mixed integer linear program- ming model for multi-satellite scheduling,”European Journal of Operational Research, vol. 275, no. 2, pp. 694–707, 2019

2019

[3] [3]

Optimal target sequencing in the agile earth-observing satellite scheduling problem using learned dynamics,

M. Stephenson and H. Schaub, “Optimal target sequencing in the agile earth-observing satellite scheduling problem using learned dynamics,” in The AAS/AIAA Astrodynamics Specialist Conference, Big Sky, MT, USA, pp. 13–17, 2023

2023

[4] [4]

Dense points aggregation for efficient and collaborative earth-imaging task planning,

Y. Pan, P. Wang, X. Hui, and J. Li, “Dense points aggregation for efficient and collaborative earth-imaging task planning,” inProceedings of the 2022 5th ACAI, ACAI ’22, (New York, NY, USA), Association for Computing Machinery, 2023

2022

[5] [5]

Mission planning for distributed multiple agile earth observing satellites by attention-based deep reinforce- ment learning method,

P. Li, H. Wang, Y. Zhang, and R. Pan, “Mission planning for distributed multiple agile earth observing satellites by attention-based deep reinforce- ment learning method,”Advances in Space Research, 2024

2024

[6] [6]

Objective task matching strategy for multi-satellite imaging mission planning in complex heterogeneous scenar- ios,

X. Yang, M. Hu, and G. Huang, “Objective task matching strategy for multi-satellite imaging mission planning in complex heterogeneous scenar- ios,” MICML ’23, (New York, NY, USA), p. 96–101, ACM, 2024

2024

[7] [7]

Applying autonomy to dis- tributed satellite systems: Trends, challenges, and future prospects,

C. Araguz, E. Bou-Balust, and E. Alarcón, “Applying autonomy to dis- tributed satellite systems: Trends, challenges, and future prospects,”Sys- tems Engineering, vol. 21, no. 5, pp. 401–416, 2018

2018

[8] [8]

Task allocation strategies for cooperative task planning of multi-autonomous satellite constellation,

F. Yao, J. Li, Y. Chen, X. Chu, and B. Zhao, “Task allocation strategies for cooperative task planning of multi-autonomous satellite constellation,” Advances in space research, vol. 63, no. 2, pp. 1073–1084, 2019

2019

[9] [9]

Novasar-s low cost spaceborne sar payload design, development and deployment of a new benchmark in spaceborne radar,

M. Cohen, A. Larkins, P. L. Semedo, and G. Burbidge, “Novasar-s low cost spaceborne sar payload design, development and deployment of a new benchmark in spaceborne radar,” in2017 IEEE Radar Conference (Radar- Conf), pp. 0903–0907, IEEE, 2017

2017

[10] [10]

Optisar-net: A cross-domain ship detec- tion method for multi-source remote sensing data,

J. Dong, J. Feng, and X. Tang, “Optisar-net: A cross-domain ship detec- tion method for multi-source remote sensing data,”IEEE Transactions on Geoscience and Remote Sensing, 2024

2024

[11] [11]

Spacecraft formation flying orbital control for earth observation mission,

A. Alzubairi, A. Tameem, and B. Kada, “Spacecraft formation flying orbital control for earth observation mission,”Scientific African, vol. 26, p. e02391, 2024

2024

[12] [12]

Optimal surveillance schedul- ing for multiple sar-eo satellite train constellation,

S. J. Kim, M. Kim, C.-H. Kim, and H. Choi, “Optimal surveillance schedul- ing for multiple sar-eo satellite train constellation,” inAIAA AVIATION FORUM AND ASCEND 2024, p. 4851, 2024

2024

[13] [13]

Single-agent reinforce- ment learning for scalable earth-observing satellite constellation opera- tions,

A. Herrmann, M. A. Stephenson, and H. Schaub, “Single-agent reinforce- ment learning for scalable earth-observing satellite constellation opera- tions,”Journal of Spacecraft and Rockets, vol. 61, no. 1, pp. 114–132, 2024. HADT for Autonomous Earth Observation Satellite Cluster 17

2024

[14] [14]

Bsk-rl: Modular, high-fidelity reinforce- ment learning environments for spacecraft tasking,

M. A. Stephenson and H. Schaub, “Bsk-rl: Modular, high-fidelity reinforce- ment learning environments for spacecraft tasking,” in75th International Astronautical Congress, Milan, Italy, IAF, 2024

2024

[15] [15]

Adynamicandcollaborative spectrum sharing strategy based on multi-agent drl in satellite-terrestrial converged networks,

C.Tang,Y.Chen,G.Chen,L.Du,andH.Liu,“Adynamicandcollaborative spectrum sharing strategy based on multi-agent drl in satellite-terrestrial converged networks,”IEEE Transactions on Vehicular Technology, 2024

2024

[16] [16]

Reinforcement learning for multi-satellite agile earth observing scheduling under various communica- tion assumptions,

A. Herrmann, M. Stephenson, and H. Schaub, “Reinforcement learning for multi-satellite agile earth observing scheduling under various communica- tion assumptions,” inAAS Rocky Mountain GN&C Conference, 2023

2023

[17] [17]

Reinforcement learning for earth-observing satelliteautonomywithevent-basedtaskintervals,

M. Stephenson and H. Schaub, “Reinforcement learning for earth-observing satelliteautonomywithevent-basedtaskintervals,” inAAS Rocky Mountain GN&C Conference, Breckenridge, CO, 2024

2024

[18] [18]

A survey on multi-agent reinforcement learning and its application,

Z. Ning and L. Xie, “A survey on multi-agent reinforcement learning and its application,”Journal of Automation and Intelligence, 2024

2024

[19] [19]

Multi-agent reinforcement learning for resources allocation optimization: a survey,

M. A. Hady, S. Hu, M. Pratama, Z. Cao, and R. Kowalczyk, “Multi-agent reinforcement learning for resources allocation optimization: a survey,”Ar- tificial Intelligence Review, vol. 58, no. 11, p. 354, 2025

2025

[20] [20]

The surprising effectiveness of ppo in cooperative multi-agent games,

C. Yu, A. Velu, E. Vinitsky, J. Gao, Y. Wang, A. Bayen, and Y. Wu, “The surprising effectiveness of ppo in cooperative multi-agent games,”Advances in Neur-IPS, vol. 35, pp. 24611–24624, 2022

2022

[21] [21]

Heterogeneous- agent reinforcement learning,

Y. Zhong, J. G. Kuba, X. Feng, S. Hu, J. Ji, and Y. Yang, “Heterogeneous- agent reinforcement learning,”JMLR, vol. 25, pp. 1–67, 2024

2024

[22] [22]

Using en- hanced simulation environments to accelerate reinforcement learning for long-durationsatelliteautonomy,

M. Stephenson, L. Mantovani, S. Phillips, and H. Schaub, “Using en- hanced simulation environments to accelerate reinforcement learning for long-durationsatelliteautonomy,” inAIAA SCITECH 2024 Forum,p.0990, 2024

2024

[23] [23]

Proximal Policy Optimization Algorithms

a. Schulman, et., “Proximal policy optimization algorithms,”arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[24] [24]

Multi- agent reinforcement learning is a sequence modeling problem,

M. Wen, J. Kuba, R. Lin, W. Zhang, Y. Wen, J. Wang, and Y. Yang, “Multi- agent reinforcement learning is a sequence modeling problem,”Advances in Neural Information Processing Systems, vol. 35, pp. 16509–16521, 2022

2022

[25] [25]

Updet: Universal multi-agent rein- forcement learning via policy decoupling with transformers,

S. Hu, F. Zhu, X. Chang, and X. Liang, “Updet: Universal multi-agent rein- forcement learning via policy decoupling with transformers,”arXiv preprint arXiv:2101.08001, 2021

work page arXiv 2021

[26] [26]

Differential transformer.arXiv preprint arXiv:2410.05258, 2024

T. Ye, L. Dong, Y. Xia, Y. Sun, Y. Zhu, G. Huang, and F. Wei, “Differential transformer,”arXiv preprint arXiv:2410.05258, 2024

work page arXiv 2024

[27] [27]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”Advances in neural information processing systems, vol. 30, 2017

2017