arxiv: 2603.27306 · v3 · submitted 2026-03-28 · 💻 cs.MA · cs.AI· cs.SY· eess.SY

Recognition: 2 theorem links

· Lean Theorem

GUIDE: Guided Updates for In-context Decision Evolution in LLM-Driven Spacecraft Operations

Alejandro Carrasco , Mariko Storey-Matsutani , Victor Rodriguez-Fernandez , Richard Linares

Authors on Pith no claims yet

Pith reviewed 2026-05-14 21:47 UTC · model grok-4.3

classification 💻 cs.MA cs.AIcs.SYeess.SY

keywords LLM agentsspacecraft operationsin-context learningpolicy improvementdecision rulesorbital interceptionclosed-loop control

0 comments

The pith

LLM agents for spacecraft operations improve performance across episodes by evolving a structured playbook of natural-language decision rules through offline reflection, without any weight updates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces GUIDE as a way for large language models to act as supervisory agents in spacecraft control by maintaining and refining a state-conditioned set of natural-language rules over repeated missions. A lightweight model executes real-time actions while an offline process reviews past trajectories to update the rule playbook. This setup is tested on an adversarial orbital interception scenario, where the evolving rules lead to better results than fixed prompting strategies. The central idea is that repeated in-context updates can function like searching for better decision policies in closed-loop interaction.

Core claim

GUIDE is a non-parametric framework in which an LLM agent maintains a structured, state-conditioned playbook of natural-language decision rules. A lightweight acting model performs real-time control based on the current playbook, while an offline reflection step analyzes completed trajectories to produce improved rules for future episodes. When evaluated on an adversarial orbital interception task in the Kerbal Space Program environment, the evolved playbook consistently yields higher performance than static baseline prompts.

What carries the argument

The state-conditioned playbook of natural-language decision rules, which is updated offline from prior trajectories to guide the lightweight acting model's real-time choices.

If this is right

Real-time spacecraft control can adapt to changing conditions across missions without retraining model weights.
Natural-language rules serve as an interpretable medium for policy search in closed-loop interaction.
The separation of lightweight acting and offline reflection allows repeated improvement while keeping onboard computation light.
Performance gains appear in adversarial settings where static prompts fail to adjust to opponent behavior.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same playbook-evolution pattern could extend to other real-time control domains such as autonomous vehicles or robotic manipulation where weight updates are costly.
If the rules remain human-readable, operators could inspect or manually edit the playbook to inject domain knowledge between episodes.
The approach raises the question of whether similar non-parametric evolution could replace some forms of reinforcement learning in language-conditioned agents.

Load-bearing premise

Offline reflection on past trajectories can reliably generate improved natural-language decision rules that the lightweight acting model will follow effectively in new episodes.

What would settle it

Run the evolved playbook on the orbital interception task in a new set of episodes and observe whether the success rate or interception time remains equal to or worse than the static-prompt baseline.

Figures

Figures reproduced from arXiv: 2603.27306 by Alejandro Carrasco, Mariko Storey-Matsutani, Richard Linares, Victor Rodriguez-Fernandez.

**Figure 3.** Figure 3: Hill-frame (RTN) trajectories for LG7 (v0 and best). [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Per-version mean composite score (logarithmic scale, [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: GUIDE playbook bullet schema. The conditions block is a symbolic guard: the bullet text is injected into the LLM prompt only when all conditions evaluate to true on the current observation. id: guard-avoidance-00001 section: guard avoidance type: constraint occurrence count: 1 text: “When the Guard is closing inside ∼220 m, stop all forward pursuit and instead apply continuous lateral and/or vertical evasi… view at source ↗

**Figure 6.** Figure 6: Example 1 — Simple guard avoidance constraint, LG6 v2. Produced after Episode 2 where the Guard closed from 230 m to 17 m in 11.5 s while the Bandit maintained forward throttle. id: guard-avoidance-00001 section: guard avoidance type: constraint occurrence count: 3 text: “After the initial phase (t ≥35 s), apply a two-tiered guard-avoidance regime: (1) Caution zone (guard distance ≲ 230 m): Immediately sto… view at source ↗

**Figure 7.** Figure 7: Example 2 — Tiered guard avoidance constraint, LG7 v4 (episode history: episodes 1, 2, 4). The two-zone structure emerged across three failure episodes; each zone threshold was calibrated by a different episode and synthesised into a single protocol [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Example 3 — approach braking constraint, LG4 v2. Addresses terminal-phase kinematics when guard-avoidance permits Lady approach but closure speed is too high [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗

read the original abstract

Large language models (LLMs) have been proposed as supervisory agents for spacecraft operations, but existing approaches rely on static prompting and do not improve across repeated executions. We introduce \textsc{GUIDE}, a non-parametric policy improvement framework that enables cross-episode adaptation without weight updates by evolving a structured, state-conditioned playbook of natural-language decision rules. A lightweight acting model performs real-time control, while offline reflection updates the playbook from prior trajectories. Evaluated on an adversarial orbital interception task in the Kerbal Space Program Differential Games environment, GUIDE's evolution consistently outperforms static baselines. Results indicate that context evolution in LLM agents functions as policy search over structured decision rules in real-time closed-loop spacecraft interaction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GUIDE offers a non-parametric route to evolving natural-language decision rules for LLM spacecraft agents via offline reflection, but the evaluation provides too little detail to confirm that the reflection step actually delivers measurable policy gains.

read the letter

The paper's main contribution is GUIDE, a framework that maintains a state-conditioned playbook of natural-language rules for an LLM agent in spacecraft operations. Offline reflection on past trajectories updates the playbook, while a lightweight model handles real-time closed-loop control. This avoids any weight updates, which is a practical angle for safety-critical settings where retraining carries risk. The orbital interception task in the Kerbal Space Program environment serves as a concrete test case, and the claim of consistent outperformance over static baselines is at least a starting point for showing cross-episode adaptation. The separation of acting and reflecting components also feels like a workable design choice for deployment. The soft spot is the reflection mechanism itself. The abstract and available description give no specifics on the reflection prompt, the exact rule-update process, or any metric that shows the rules improved independently of overall task performance. Without ablations, example rule changes, statistical details, or error analysis, it remains unclear whether the gains come from genuine structured policy search or from side effects like longer context or repeated prompting. The central assumption that offline reflection reliably produces better, followable rules is not yet backed by enough evidence. This work is aimed at researchers working on LLM agents for autonomous control in robotics or space systems. A reader looking for ideas on non-parametric adaptation could pick up useful framing, but anyone planning to build on the results would need the full experimental details first. I would send it to peer review so referees can check the implementation and evaluation directly.

Referee Report

3 major / 1 minor

Summary. The paper introduces GUIDE, a non-parametric framework for evolving a state-conditioned playbook of natural-language decision rules via offline reflection on trajectories. A lightweight LLM performs real-time closed-loop control while the playbook is updated across episodes without weight changes. Evaluated on an adversarial orbital interception task in the Kerbal Space Program Differential Games environment, the method is claimed to consistently outperform static baselines, with results interpreted as evidence that context evolution functions as policy search over structured decision rules.

Significance. If the central empirical claims hold after detailed validation, the work could contribute to adaptive LLM agents for real-time control tasks by showing cross-episode improvement through natural-language rule evolution rather than parameter updates. This framing of in-context learning as policy search over interpretable rules has potential relevance for multi-agent systems and autonomous operations where retraining is costly.

major comments (3)

Abstract: the assertion of 'consistent outperformance over static baselines' is unsupported by any quantitative metrics, statistical details, baseline descriptions, error bars, or ablation results, which directly undermines the central empirical claim of policy improvement.
Evaluation section (implied by abstract): no description is given of the reflection prompt, the rule-update operator, or any metric that isolates playbook quality from raw performance gains, leaving open whether improvements arise from genuine structured rule evolution or from confounds such as longer context or repeated prompting.
Abstract and methods: the interpretation that 'context evolution functions as policy search over structured decision rules' rests on the unverified assumption that offline reflection reliably produces state-conditioned rules that the acting model can execute effectively; without isolating experiments or rule-quality metrics, this remains an unsupported inference.

minor comments (1)

Abstract: the phrase 'non-parametric policy improvement framework' should be defined more precisely to clarify its distinction from standard few-shot or chain-of-thought prompting.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments correctly identify areas where the empirical claims require stronger quantitative support and methodological transparency. We address each point below and will incorporate the suggested changes in the revised manuscript.

read point-by-point responses

Referee: Abstract: the assertion of 'consistent outperformance over static baselines' is unsupported by any quantitative metrics, statistical details, baseline descriptions, error bars, or ablation results, which directly undermines the central empirical claim of policy improvement.

Authors: We agree that the abstract currently states the performance claim without accompanying quantitative details. In the revision we will add specific metrics (success rate, mean interception time, standard deviation), baseline descriptions, and a brief reference to statistical significance. Full error bars, ablation tables, and statistical tests will be presented in the evaluation section with a cross-reference from the abstract. revision: yes
Referee: Evaluation section (implied by abstract): no description is given of the reflection prompt, the rule-update operator, or any metric that isolates playbook quality from raw performance gains, leaving open whether improvements arise from genuine structured rule evolution or from confounds such as longer context or repeated prompting.

Authors: We acknowledge that the main text does not currently provide sufficient detail on these elements. We will expand Section 4 to include the exact reflection prompt template, a formal description of the rule-update operator, and a new playbook-quality metric (rule adherence rate measured on held-out states). We will also add an ablation that holds context length fixed while varying the presence of the evolved playbook, thereby isolating the contribution of structured rule evolution from simple context growth. revision: yes
Referee: Abstract and methods: the interpretation that 'context evolution functions as policy search over structured decision rules' rests on the unverified assumption that offline reflection reliably produces state-conditioned rules that the acting model can execute effectively; without isolating experiments or rule-quality metrics, this remains an unsupported inference.

Authors: We recognize that the current evidence for this interpretation is indirect. We will add a dedicated subsection that reports rule-quality metrics (human-rated coherence and executability scores) and an isolating experiment in which the acting model is given only the evolved rules versus a control set of randomly generated rules. These results will be used to support or qualify the policy-search framing in both the abstract and discussion. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper presents GUIDE as an empirical framework that evolves natural-language decision rules via offline reflection on trajectories and evaluates performance gains against static baselines in a closed-loop simulation. No equations, fitted parameters, self-referential definitions, or load-bearing self-citations appear in the provided text; the central claim rests on experimental comparisons rather than any reduction of outputs to inputs by construction. The derivation is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that LLMs can perform useful reflection on trajectories to refine decision rules and that those rules transfer to new episodes; no free parameters or invented physical entities are described.

axioms (1)

domain assumption LLMs can reliably extract and improve decision rules from prior trajectories through reflection
This underpins the offline update step that produces the evolved playbook.

invented entities (1)

GUIDE playbook of state-conditioned natural-language decision rules no independent evidence
purpose: Serves as the evolving policy representation for cross-episode adaptation
Newly introduced structured artifact that the method maintains and updates.

pith-pipeline@v0.9.0 · 5433 in / 1264 out tokens · 51059 ms · 2026-05-14T21:47:15.395837+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

evolving a structured, state-conditioned playbook of natural-language decision rules... offline reflection updates the playbook from prior trajectories
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

UCB1 rule score_k + c sqrt(log N / n_k)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · 1 internal anchor

[1]

Allen, Yaron Rachlin, Jessica Ruprecht, Sean Loughran, Jacob Varey, and Herbert Viggh

Ross E. Allen, Yaron Rachlin, Jessica Ruprecht, Sean Loughran, Jacob Varey, and Herbert Viggh. Spacegym: Dis- crete and differential games in non-cooperative space oper- ations. In2023 IEEE Aerospace Conference, pages 1–12,

work page
[2]

AIAA, 2025

Alejandro Carrasco, Marco Nedungadi, Victor Rodriguez- Fernandez, and Richard Linares.Visual Language Models as Operator Agents in the Space Domain. AIAA, 2025. 1, 2

work page 2025
[3]

Large language models as autonomous spacecraft operators in kerbal space program.Advances in Space Research, 76(6):3480–3497, 2025

Alejandro Carrasco, Victor Rodriguez-Fernandez, and Richard Linares. Large language models as autonomous spacecraft operators in kerbal space program.Advances in Space Research, 76(6):3480–3497, 2025. 1, 2

work page 2025
[4]

Wang, and Eric Schulz

Julian Coda-Forno, Marcel Binz, Zeynep Akata, Matthew Botvinick, Jane X. Wang, and Eric Schulz. Meta-in-context learning in large language models, 2023. 2

work page 2023
[5]

Why can gpt learn in-context? language models implicitly perform gradient descent as meta- optimizers, 2023

Damai Dai, Yutao Sun, Li Dong, Yaru Hao, Shuming Ma, Zhifang Sui, and Furu Wei. Why can gpt learn in-context? language models implicitly perform gradient descent as meta- optimizers, 2023. 2

work page 2023
[6]

Space- craft decision-making autonomy using deep reinforcement learning

Andrew Harris, Thibaud Teil, and Hanspeter Schaub. Space- craft decision-making autonomy using deep reinforcement learning. InAAS/AIAA Astrodynamics Specialist Conference, number AAS 19-447 in Advances in the Astronautical Sci- ences, Portland, Oregon, USA, 2019. American Astronautical Society. 2

work page 2019
[7]

Controlled self-evolution for algorithmic code optimization,

Tu Hu, Ronghao Chen, Shuo Zhang, Jianghao Yin, et al. Controlled self-evolution for algorithmic code optimization,

work page
[8]

Brewing knowledge in context: Distillation perspectives on in-context learning,

Chengye Li, Haiyun Liu, and Yuanxi Li. Brewing knowledge in context: Distillation perspectives on in-context learning,

work page
[9]

Self-refine: Iterative refinement with self-feedback, 2023

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hal- linan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. Self-refine: Iterative refinement with self-feedback, 2023. 2

work page 2023
[10]

Satellite Chasers: Divergent Adversarial Reinforcement Learning to Engage Intelligent Adversaries on Orbit

Cameron Mehlman, Joseph Abramov, and Gregory Falco. Cat-and-mouse satellite dynamics: Divergent adversarial re- inforcement learning for contested multi-agent space opera- tions.arXiv preprint arXiv:2409.17443, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[11]

Bernstein

Joon Sung Park, Joseph O’Brien, Carrie Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. Generative agents: Interactive simulacra of human behavior, 2023. 2

work page 2023
[12]

Language models are spacecraft operators.arXiv preprint arXiv:2404.00413, 2024

Victor Rodriguez-Fernandez, Alejandro Carrasco, Jason Cheng, Eli Scharf, Peng Mun Siew, and Richard Linares. Language models are spacecraft operators.arXiv preprint arXiv:2404.00413, 2024. 1, 2

work page arXiv 2024
[13]

Reflexion: Language agents with verbal reinforcement learning, 2023

Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning, 2023. 2

work page 2023
[14]

A survey on large language model based autonomous agents,

Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, Wayne Xin Zhao, Zhewei Wei, and Ji-Rong Wen. A survey on large language model based autonomous agents,

work page
[15]

React: Synergizing reasoning and acting in language models, 2022

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models, 2022. 1, 2

work page 2022
[16]

Agentic context engineering: Evolving contexts for self-improving language models, 2026

Qizheng Zhang, Changran Hu, Shubhangi Upasani, Boyuan Ma, Fenglu Hong, Vamsidhar Kamanuru, Jay Rainton, Chen Wu, Mengmeng Ji, Hanchen Li, Urmish Thakker, James Zou, and Kunle Olukotun. Agentic context engineering: Evolving contexts for self-improving language models, 2026. 2

work page 2026
[17]

Wise-flow: Workflow- induced structured experience for self-evolving conversa- tional service agents, 2026

Yuqing Zhou, Zhuoer Wang, Jie Yuan, Hong Wang, Samson Koelle, Ziwei Zhu, and Wei Niu. Wise-flow: Workflow- induced structured experience for self-evolving conversa- tional service agents, 2026. 2 GUIDE: Guided Updates for In-context Decision Evolution in LLM-Driven Spacecraft Operations Accepted at CVPR 2026 AI4Space Workshop Supplementary Material

work page 2026
[18]

LG4 | Passive Lady, Active Guard Table 3

Per-Version Performance Statistics 7.1. LG4 | Passive Lady, Active Guard Table 3. LG4 per-version statistics. Version Mean score ¯dLady (m) ¯dGuard (m) v04.15×10 5 237.3 13.3 v17.22×10 4 28.1 17.6 v23.27×10 5 192.7 15.9 v37.72×10 4 36.0 18.4 v43.19×10 5 173.9 16.7 7.2. LG5 | Passive Lady, Faster Active Guard Table 4. LG5 per-version statistics. Version Me...

work page
[19]

GUIDE Playbook Structure and Examples id: <unique id, e.g. guard-avoidance-00001> section: <guard avoidance|approach|...> type: <constraint|rule> text: <NL instruction, 1-3 sentences> conditions: time:{min: <seconds>}// ignore early orbit phase guard distance:{max: <metres>}// guard proximity trigger guard approaching: <bool>// guard closing flag target d...

work page