Reflection of Episodes: Learning to Play Game from Expert and Self Experiences
Pith reviewed 2026-05-23 03:07 UTC · model grok-4.3
The pith
A reflection framework lets LLMs improve at StarCraft II by turning completed games into new self-experience using expert examples and keyframe summaries.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The Reflection of Episodes framework first extracts key game information through a keyframe selection method, then makes decisions by consulting both expert experience and accumulated self-experience; after each game ends it reflects on the prior experience to generate new self-experience, and this loop enables the LLM to defeat the very hard difficulty built-in robot in TextStarCraft II.
What carries the argument
The Reflection of Episodes (ROE) framework, which uses keyframe selection to summarize game states and post-game reflection to convert expert and self-experience into updated self-experience for future decisions.
If this is right
- The LLM can iteratively improve its policy in a complex RTS environment solely through post-game reflection rather than additional supervised fine-tuning.
- Keyframe selection supplies adequate state information for the model to make effective mid-game choices without processing entire replays.
- Expert experience combined with self-generated experience produces measurable gains against a fixed very hard opponent.
- The same loop can be applied to other sequential decision tasks where full histories exceed context length.
Where Pith is reading between the lines
- The method might generalize to other partially observable games or planning domains where agents can store and later query condensed past episodes.
- It suggests that language models can bootstrap improvement from a small set of expert traces plus their own growing memory of past outcomes.
- If keyframe selection proves robust, similar compression steps could reduce memory cost in long-horizon reinforcement learning outside games.
Load-bearing premise
Reflecting on expert and self-experience after a completed game will reliably produce new self-experience that improves the model's decisions in later games, and that the selected keyframes contain enough information to support those decisions.
What would settle it
Running the same agent for multiple additional games after the reported reflection cycles and finding no further wins against the very hard opponent, or finding that removing the reflection step leaves performance unchanged.
Figures
read the original abstract
StarCraft II is a complex and dynamic real-time strategy (RTS) game environment, which is very suitable for artificial intelligence and reinforcement learning research. To address the problem of Large Language Model(LLM) learning in complex environments through self-reflection, we propose a Reflection of Episodes(ROE) framework based on expert experience and self-experience. This framework first obtains key information in the game through a keyframe selection method, then makes decisions based on expert experience and self-experience. After a game is completed, it reflects on the previous experience to obtain new self-experience. Finally, in the experiment, our method beat the robot under the Very Hard difficulty in TextStarCraft II. We analyze the data of the LLM in the process of the game in detail, verified its effectiveness.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes the Reflection of Episodes (ROE) framework in which an LLM extracts game state via keyframe selection, conditions decisions on expert experience plus accumulated self-experience, and performs post-game reflection to generate additional self-experience. The central empirical claim is that this procedure enabled the agent to defeat the Very Hard difficulty bot in TextStarCraft II.
Significance. A convincingly demonstrated ability of an LLM agent to improve via post-episode reflection in a long-horizon, partially observable RTS setting would be of interest to the community working on LLM-based agents and self-improvement loops. The paper does not yet supply the quantitative controls or ablations needed to establish that the reflection component is responsible for the reported outcome.
major comments (3)
- [Abstract] Abstract: the claim that the method 'beat the robot under the Very Hard difficulty' is presented without any win rate, number of independent trials, or baseline comparison, rendering it impossible to determine whether the result supports the effectiveness of the reflection mechanism.
- [Method] Method section (description of decision-making and experience retrieval): the paper does not specify how self-experience generated by reflection is indexed, stored, or retrieved at decision time, which is load-bearing for the assumption that post-game reflection produces usable new experience.
- [Experiments] Experiments: no ablation that removes the reflection step (or the self-experience component) is reported, so it is impossible to isolate whether the reported win depends on the ROE loop rather than expert experience alone or on a single lucky episode.
minor comments (2)
- [Abstract] Abstract: 'Large Language Model(LLM)' is written without a space before the parenthesis and without prior expansion.
- [Experiments] The manuscript would benefit from a clear statement of the total number of games played and whether the reflection loop was active across the entire evaluation set.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive comments, which highlight important areas for improving the clarity and rigor of our presentation of the ROE framework. We address each major comment below and commit to revisions that directly respond to the concerns raised.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that the method 'beat the robot under the Very Hard difficulty' is presented without any win rate, number of independent trials, or baseline comparison, rendering it impossible to determine whether the result supports the effectiveness of the reflection mechanism.
Authors: We agree that the abstract as currently written does not provide sufficient quantitative context for the reported outcome. The full manuscript contains experimental details on the win against the Very Hard bot, but these are not summarized in the abstract. In the revision we will expand the abstract to report the win rate, the number of independent trials performed, and explicit baseline comparisons (including expert-experience-only runs) so that readers can immediately assess the strength of the result. revision: yes
-
Referee: [Method] Method section (description of decision-making and experience retrieval): the paper does not specify how self-experience generated by reflection is indexed, stored, or retrieved at decision time, which is load-bearing for the assumption that post-game reflection produces usable new experience.
Authors: This observation is correct; the current method description focuses on the overall pipeline and does not detail the storage and retrieval mechanics for the self-experience generated by reflection. We will revise the method section to specify the indexing scheme (embedding-based), storage format, and retrieval procedure (similarity search over accumulated self-experience at decision time) so that the mechanism by which reflection contributes usable experience is fully explicit. revision: yes
-
Referee: [Experiments] Experiments: no ablation that removes the reflection step (or the self-experience component) is reported, so it is impossible to isolate whether the reported win depends on the ROE loop rather than expert experience alone or on a single lucky episode.
Authors: We acknowledge that the absence of an ablation isolating the reflection/self-experience component limits the ability to attribute success specifically to the ROE loop. The current experiments demonstrate that the full framework succeeds, but do not include the requested controls. We will add ablation experiments that compare performance with and without the reflection-generated self-experience (and with expert experience alone) in the revised manuscript. revision: yes
Circularity Check
No circularity; empirical method description with no derivation chain or equations
full rationale
The paper presents an empirical framework (ROE) for LLM-based gameplay in TextStarCraft II using keyframe selection, expert/self-experience, and post-game reflection. No equations, parameters, or mathematical derivations appear in the abstract or described method. The central claim is a reported experimental outcome (beating Very Hard bot) rather than a derived prediction from fitted inputs or self-referential definitions. No self-citation load-bearing steps, uniqueness theorems, or ansatz smuggling are present. The method is self-contained as a procedural description without reducing to its own inputs by construction.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 2 Pith papers
-
Towards Generalist Game Players: An Investigation of Foundation Models in the Game Multiverse
The paper organizes research on generalist game AI into Dataset, Model, Harness, and Benchmark pillars and charts a five-level progression from single-game mastery to agents that create and live inside game multiverses.
-
Towards Generalist Game Players: An Investigation of Foundation Models in the Game Multiverse
This work traces four eras of generalist game players across dataset, model, harness, and benchmark pillars and charts a five-level roadmap ending in agents that create and evolve within game multiverses.
Reference graph
Works this paper leans on
-
[1]
Vinyals, O., et. al. Starcraft II: A new challenge for reinforcement learning. arxiv preprint, arXiv:1708.04782 (2017)
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[2]
Vinyals, O., Babuschkin, I., et. al. AlphaStar: Mastering the real-time strategy game StarCraft II. DeepMind blog, 2, 20 (2019)
work page 2019
-
[3]
OpenAI ChatGPT team,https://openai.com/chatgpt/ (2022)
work page 2022
-
[4]
OpenAI GPT-4 team,https://openai.com/index/gpt-4/ (2023)
work page 2023
-
[5]
Lowe, R., Wu, Y ., Tamar, A., Harb, J., Abbeel, P., and Mordatch, I
MaWeiyu,et.al.LLMsplayStarCraftII:Benchmarksandachainofsummarization approach. arXiv preprint, arXiv:2312.11865 (2023)
-
[6]
Peter Sunehag, et. al. Value-decomposition networks for cooperative multi-agent learning. arXiv preprint, arXiv: 1706.05296 (2017)
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[7]
Rashid, Tabish, et. al. Monotonic value function factorisation for deep multi-agent reinforcement learning. Journal of Machine Learning Research 21.178: 1-51 (2020)
work page 2020
-
[8]
Rashid, Tabish, et. al. Weighted Qmix: Expanding monotonic value function factori- sation for deep multi-agent reinforcement learning. Advances in neural information processing systems 33: 10199-10210 (2020)
work page 2020
-
[9]
Yu, Chao, et. al. The surprising effectiveness of ppo in cooperative multi-agent games. Advances in Neural Information Processing Systems 35: 24611-24624 (2022)
work page 2022
-
[10]
Lowe, Ryan, et. al. Multi-agent actor-critic for mixed cooperative-competitive en- vironments. Advances in neural information processing systems 30 (2017)
work page 2017
-
[11]
Liu, Ruo-Ze, et. al. On efficient reinforcement learning for full-length game of starcraft II. Journal of Artificial Intelligence Research 75: 213-260 (2022)
work page 2022
-
[12]
Anthropic Claude-2 team,https://www.anthropic.com/news/claude-2 (2023)
work page 2023
-
[13]
Meta Llama team,https://github.com/meta-llama/llama3 (2024)
work page 2024
-
[14]
Anil, Rohan, et al. Palm 2 technical report. arXiv preprint, arXiv:2305.10403 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [15]
-
[16]
Reflexion: Language Agents with Verbal Reinforcement Learning
Shinn, Noah, et al. Reflexion: Language agents with verbal reinforcement learning. arXiv preprint, cs.AI/2303.11366 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[17]
Agent-pro: Learning to evolve via policy-level reflection and optimization
Zhang, Wenqi, et al. Agent-pro: Learning to evolve via policy-level reflection and optimization. arXiv preprint, arXiv:2402.17574 (2024). Apendix A. All Prompt In this appendix, we will show all the prompts used during the experiment, as well as during the experiment, including system prompts, reflection prompts, etc. A.1 System Prompt You are an AI train...
-
[18]
Game Overview: Provide a brief overview of the current situation based on all the rounds
-
[19]
Is it the early game, mid-game, or late game?
Current Game Stage: Determine the stage of the game based on the information of all rounds. Is it the early game, mid-game, or late game?
-
[20]
3.2 Economy: Evaluate our economic condition, including resource collection and usage
Our Situation: Describe our current status in terms of: 3.1 Units and Buildings: Analyze the state of our units and buildings. 3.2 Economy: Evaluate our economic condition, including resource collection and usage. 3.3 Technology: Describe the status of our technological research and what technologies we have unlocked so far. Analyze our technology tree, i...
-
[21]
Enemy's Strategy: Infer the enemy's potential strategy, based on the available information
-
[22]
Key Information: Highlight the most important aspects from all rounds that have significantly influenced the game. {self.race_specific_prompt.get(self.race)} These are the lessons given by experts based on previous matches to help you play the game: {self.last_reflection} Here are some tips to help you analyze the game stage.In subsequent analysis, you ne...
-
[23]
- Check the timing of your first Pylon, Gateway, Assimilator, and Cybernetics Core
**Opening Build Order**: - Ensure you followed a standard and efficient build order. - Check the timing of your first Pylon, Gateway, Assimilator, and Cybernetics Core
-
[24]
Aim for constant Probes production
**Economy Management**: - Monitor your worker production. Aim for constant Probes production. - Check your expansion timing. A typical timing for your natural expansion is around 2:30 to 3:00 minutes
-
[25]
**Scouting**: - Review your scouting efforts. Did you scout the enemy base early to see their build order? - Did you send a Probe or use an Observer to gather information about the enemy’s tech and army composition?
-
[26]
Avoid floating too many minerals and gas
**Macro Management**: - Check your resources. Avoid floating too many minerals and gas. - Ensure you're continuously producing units and expanding your infrastructure (Gateways, Robotics Facilities, etc.)
-
[27]
**Micro Management**: - Watch your army engagements. Did you control your units effectively during battles? - Pay attention to spell usage, positioning, and focus fire
-
[28]
**Army Composition**: - Evaluate your unit composition relative to the enemy’s. Did you have the right counters? - Ensure you tech up appropriately and adjust your unit mix based on what the opponent is building
-
[29]
Upgrades can significantly affect the outcome of battles
**Upgrades**: - Check your upgrades timing. Upgrades can significantly affect the outcome of battles. - Ensure you research crucial upgrades like Warp Gate, Blink, and attack/armor upgrades
-
[30]
**Decision Making**: - Review your decisions throughout the game. Did you expand at the right times? - Did you make effective use of harassment (e.g., Warp Prism drops) to disrupt the opponent’s economy? Then, After reviewing these aspects, make a list of key mistakes and areas for improvement. Here are some common points to look for: - Delayed expansion ...
-
[31]
**Opening Build Order**:
-
[32]
**Economy Management**:
-
[33]
**Macro Management**:
-
[34]
**Micro Management**:
-
[35]
**Army Composition**:
-
[36]
**Decision Making**:
-
[37]
**Key time point and recommendation**(At least five, specific time point from time 0:00(important) to finish): """ Reflection Prompt Fig.A3. Our Reflection prompt. Appendix B. Reflection Iterations In this appendix, we will show the reflections and changes generated during the experiment of our method under Very Hard built-in AI. The marked part is the pa...
-
[38]
- Construct a Pylon near your mineral line to avoid supply blockages
**Opening Build Order**: - Start the game by immediately building a Probe. - Construct a Pylon near your mineral line to avoid supply blockages. - Establish an early gateway to initiate unit production
-
[39]
- Expand your economy by building additional Nexuses at optimal expansion timings
**Economy Management**: - Focus on continuous Probe production to maximize mineral and gas income. - Expand your economy by building additional Nexuses at optimal expansion timings. - Allocate resources efficiently between worker production and infrastructure development
-
[40]
- Utilize Observers to scout enemy movements and unit compositions
**Scouting**: - Send out early scouting Probes to gather information about the enemy's base and tech choices. - Utilize Observers to scout enemy movements and unit compositions. - Maintain map control with Zealot or Stalker scouts to anticipate enemy strategies
-
[41]
- Ensure consistent unit production from all structures to maintain a strong army presence
**Macro Management**: - Prioritize expanding infrastructure by adding more Gateways and tech structures like Robotics Facilities. - Ensure consistent unit production from all structures to maintain a strong army presence. - Use Chrono Boost effectively on key structures such as the Cybernetics Core and Forges for faster upgrades
-
[42]
- Utilize Blink effectively with Stalkers for tactical advantages in battles
**Micro Management**: - Improve unit control during engagements by focusing on proper positioning and target prioritization. - Utilize Blink effectively with Stalkers for tactical advantages in battles. - Practice splitting your army and managing spellcaster units like High Templars efficiently
-
[43]
- Consider incorporating Immortals to counter enemy armored units effectively
**Army Composition**: - Maintain a balanced unit composition with Zealots for frontline tanking and Stalkers for ranged damage. - Consider incorporating Immortals to counter enemy armored units effectively. - Adapt your unit mix based on enemy unit compositions and tech choices
-
[44]
- Prioritize upgrades that align with your chosen army composition for maximum efficiency
**Upgrades**: - Research crucial upgrades like Ground Weapons and Armor to enhance your army's combat effectiveness. - Prioritize upgrades that align with your chosen army composition for maximum efficiency. - Continuously upgrade tech structures to unlock advanced units and abilities
-
[45]
- Assess enemy scouting information to adapt your army composition and defensive structures
**Decision Making**: - Make timely decisions to expand strategically at appropriate timings to boost your economy. - Assess enemy scouting information to adapt your army composition and defensive structures. - Use harassment tactics like Warp Prism drops to disrupt the opponent's economy and keep them on the back foot
-
[46]
Expert Reflection Fig.B1. Expert Reflection with our method against Very Hard built-in AI
**Key time points and recommendations**: - 00:30: Start the game by training a Probe and immediately building a Pylon to avoid supply blockages. - 03:45: Expand your economy by constructing additional Nexuses and assimilators to boost your income. - 07:00: Focus on researching essential upgrades like Warpgate technology and Protoss weapons/armor for stron...
-
[47]
**Economy Management**: - Focus on continuous Probe production to maximize resource collection and support overall economy growth. - Expand to new bases and construct additional Pylons to increase supply and resource gathering capacity. - Efficiently manage resources to avoid floating minerals and gas while maintaining a steady production of units and inf...
-
[48]
**Scouting**: - Implement a proactive scouting strategy by sending Probes or units like Zealots to gather information about the enemy's tech and army composition. - Consider using observers or other scouting units to scout the enemy base and key locations for potential threats. - Adapt your strategy based on scouting information to make informed decisions...
-
[49]
**Macro Management**: - Improve infrastructure development by constructing essential structures like Pylons, Gateways, and tech buildings in a timely manner. - Prioritize tech advancements and unit production by researching critical upgrades such as Warp Gate technology and other unit-specific enhancements. - Ensure a balanced distribution of resources be...
-
[50]
**Micro Management**: - Enhance unit control and micro techniques during engagements to optimize positioning, focus fire, and utilize unit abilities effectively. - Practice effective spell usage, proper unit positioning, and target prioritization to gain an advantage in battles. - Pay attention to unit formations, flanking maneuvers, and retreat strategie...
-
[52]
**Upgrades**: - Prioritize essential upgrades such as Warp Gate technology, attack, and armor upgrades to enhance the effectiveness of your units in combat. - Research upgrades like Blink for Stalkers and charge for Zealots to improve their combat capabilities. - Scout enemy upgrades and adjust your upgrade timings to remain competitive in battles
-
[53]
**Decision Making**: - Make effective decisions throughout the game, including timely expansions, tech advancements, and unit compositions based on scouting information. - Utilize harassment tactics like Warp Prism drops to disrupt the enemy's economy and gain a strategic advantage. - Adapt your strategy dynamically based on the evolving game state and en...
-
[54]
Self Reflection-1 Fig.B2. Self Reflection1 with our method against Very Hard built-in AI
**Key Time Points and Recommendations**: - **00:30** - Start with a standard opening build order, focus on Probe production, and begin scouting with a Probe to gather information about the enemy's strategy. - **03:00** - Prioritize expanding to additional Nexuses for increased resource income and build more Pylons to support supply cap and unit production...
-
[55]
**Economy Management**: - Focus on continuous Probe production to maximize resource collection and support overall economy growth. - Expand to new bases and construct additional Pylons to increase supply cap and facilitate unit production. - Manage resources efficiently to avoid resource floating while maintaining a steady balance between mineral and gas income
-
[56]
**Scouting**: - Implement a proactive scouting strategy by sending Probes or units like Zealots to gather information about the enemy's tech and army composition. - Consider using observers or other scouting units to gather critical intel on the enemy's base and key locations. - Adapt your strategy based on scouting information to make informed decisions ...
-
[57]
**Macro Management**: - Improve infrastructure development by constructing essential structures like Pylons, Gateways, and tech buildings in a timely manner. - Prioritize tech advancements and unit production by researching essential upgrades like Warp Gate technology and other unit-specific enhancements. - Maintain a balanced distribution of resources be...
-
[58]
**Micro Management**: - Enhance unit control and micro techniques during engagements to optimize positioning, focus fire, and utilize unit abilities effectively. - Practice effective spell usage, proper unit positioning, and target prioritization in battles to gain an advantage. - Pay attention to unit formations, flanking maneuvers, and retreat strategie...
-
[59]
- Tech up appropriately and diversify your unit mix based on the opponent's army composition
**Army Composition**: - Evaluate your unit composition relative to the enemy's and adjust accordingly to have the right counters. - Tech up appropriately and diversify your unit mix based on the opponent's army composition. - Consider incorporating a mix of Zealots, Stalkers, and other unit types to create a well-rounded army capable of handling various threats
-
[60]
**Upgrades**: - Prioritize essential upgrades such as Warp Gate technology, attack, and armor upgrades to enhance unit effectiveness in combat. - Research upgrades like Blink for Stalkers and charge for Zealots to improve their combat capabilities. - Scout enemy upgrades and adjust your own upgrade timings to stay competitive in engagements
-
[61]
**Decision Making**: - Make effective decisions throughout the game, including timely expansions, tech advancements, and unit compositions based on scouting information. - Utilize harassment tactics like Warp Prism drops to disrupt the enemy's economy and gain a strategic advantage. - Adapt your strategy dynamically based on the evolving game state and en...
-
[62]
**Key Time Points and Recommendations**: - **00:00**: Start with a standard opening build order, prioritize Probe production, and begin scouting with a Probe for crucial information. - **03:00**: Expand to additional Nexuses for increased resource income and build more Pylons to support supply cap and unit production. - **06:00**: Complete Warpgate resear...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.