AgenticSTS: A Bounded-Memory Testbed for Long-Horizon LLM Agents
Pith reviewed 2026-07-03 13:47 UTC · model grok-4.3
The pith
A bounded-memory contract assembles each LLM agent decision from typed retrieval alone, keeping prompts fixed-length and allowing any memory layer to be ablated in isolation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Memory for long-horizon LLM agents is treated as an explicit contract specifying exactly what each future decision is permitted to see; the chosen contract uses typed retrieval to assemble a fresh user message for every decision and appends no raw cross-decision transcript, thereby bounding prompt length for runs of arbitrary duration and permitting any single memory layer to be removed or added while all other conditions stay fixed.
What carries the argument
The typed retrieval mechanism that assembles a fresh, bounded prompt for each decision without any raw transcript of prior decisions.
If this is right
- Any single memory layer can be added or removed without changing prompt length or other layers.
- The contribution of a particular skill or memory type can be measured by direct ablation while holding everything else constant.
- The same test harness can be run for arbitrarily long sequences of decisions without context growth.
- Different backbone models can be compared under identical memory contracts rather than under accumulating transcripts.
Where Pith is reading between the lines
- The same contract could be applied to non-game long-horizon tasks such as multi-step planning or tool-use chains where raw history quickly exceeds token limits.
- Systematic ablation across many memory types might identify which layers matter most for different classes of agent problems.
- The released trajectories supply a reusable baseline for testing whether other memory architectures produce larger or smaller gains than the skill layer examined here.
Load-bearing premise
Typed retrieval can supply enough decision-relevant context without ever including raw transcripts of earlier decisions.
What would settle it
An experiment in which agents using the bounded contract consistently fail on tasks that require integrating information across many past decisions in ways that typed retrieval cannot reconstruct.
read the original abstract
Memory for a long-horizon LLM agent is a contract about what each future decision is allowed to see. The simplest contract appends past observations, tool calls, and reflections to every prompt, which makes prior context easy to access but also turns it into a jumbled mixture in which the effect of any single memory component is hard to isolate. We introduce and instrument an alternative bounded contract: every decision is made from a fresh user message assembled by typed retrieval, with no raw cross-decision transcript appended. The prompt thus stays bounded across runs of any length, and any single layer can be ablated in isolation. We instantiate the contract in Slay the Spire 2, a closed-rule stochastic deck-building game whose runs require hundreds of tactical and strategic decisions. A public online benchmark of frontier LLMs on the same game reports zero wins at the lowest difficulty across five configurations, and the developer-reported human win rate at the same difficulty is 16%; the task is hard but not saturated. Within our harness, a fixed-A0 ablation shows the largest observed difference when triggered strategic skills are enabled: the no-store baseline wins 3/10 games and adding the skill layer 6/10. At this sample size the comparison is directional rather than statistically decisive (Fisher exact p\approx0.37); a cross-backbone probe and public accumulating-context baselines are reported as operational comparisons rather than controlled tests of the contract variable itself. We release a reproducible testbed: 298 completed trajectories with condition tags, frozen memory/skill snapshots, prompt records, and analysis scripts -- an agent design and a validated, reusable methodology for studying how explicit memory layers shape long-horizon LLM-agent decisions.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a bounded-memory contract for long-horizon LLM agents in which each decision prompt is assembled via typed retrieval rather than by appending raw cross-decision transcripts. This contract is instantiated in the stochastic deck-building game Slay the Spire 2; within a fixed-A0 ablation the no-store baseline wins 3/10 games while adding a triggered strategic-skill layer raises the win rate to 6/10 (Fisher exact p≈0.37). The authors release 298 tagged trajectories, frozen memory/skill snapshots, prompt records, and analysis scripts, positioning the work as a reusable testbed and methodology for isolating the effects of explicit memory layers.
Significance. If the typed-retrieval mechanism supplies adequate decision-relevant context, the testbed supplies a concrete, reproducible platform for controlled ablation of memory components in tasks that require hundreds of sequential decisions. The public release of full trajectories and scripts is a clear methodological strength that would allow other researchers to verify or extend the reported comparisons.
major comments (2)
- [Experimental results (fixed-A0 ablation)] Experimental results (fixed-A0 ablation paragraph): the reported 3/10 vs. 6/10 difference rests on only ten games per arm. With this sample size the Fisher exact test yields p≈0.37 and the paper itself describes the result as directional; a power calculation or larger replication would be needed before the difference can be treated as evidence that the skill layer improves performance under the bounded contract.
- [Methods (typed retrieval contract)] Methods (typed retrieval contract definition): the claim that the bounded contract permits meaningful isolated ablations presupposes that typed retrieval alone supplies sufficient context (deck state, prior strategic patterns, etc.) without raw transcripts. No direct verification of this sufficiency—such as a human rating of retrieved-context adequacy or an ablation that adds back raw transcripts while keeping the contract otherwise fixed—is reported, so the observed win-rate difference could be confounded by context starvation rather than by the memory layers under test.
minor comments (1)
- [Abstract] Abstract: the phrase 'public accumulating-context baselines are reported as operational comparisons rather than controlled tests' is clear in intent but would benefit from a one-sentence parenthetical stating the precise difference in prompt construction between those baselines and the main harness.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on statistical robustness and the assumptions underlying the typed-retrieval contract. We address each major comment below and indicate planned revisions where appropriate.
read point-by-point responses
-
Referee: Experimental results (fixed-A0 ablation): the reported 3/10 vs. 6/10 difference rests on only ten games per arm. With this sample size the Fisher exact test yields p≈0.37 and the paper itself describes the result as directional; a power calculation or larger replication would be needed before the difference can be treated as evidence that the skill layer improves performance under the bounded contract.
Authors: We agree that n=10 per arm yields only directional evidence, which is already stated explicitly in the manuscript. In revision we will add a post-hoc power calculation using the observed effect size to guide future work. The public release of 298 trajectories and analysis scripts is intended to support exactly the larger replications the referee recommends. revision: partial
-
Referee: Methods (typed retrieval contract definition): the claim that the bounded contract permits meaningful isolated ablations presupposes that typed retrieval alone supplies sufficient context (deck state, prior strategic patterns, etc.) without raw transcripts. No direct verification of this sufficiency—such as a human rating of retrieved-context adequacy or an ablation that adds back raw transcripts while keeping the contract otherwise fixed—is reported, so the observed win-rate difference could be confounded by context starvation rather than by the memory layers under test.
Authors: The testbed is defined as a platform for controlled ablations under the typed-retrieval contract; it does not claim the contract is universally sufficient. We did not conduct human adequacy ratings or add-back ablations in the present study. All prompt records are released to enable such follow-up work. We will add an explicit limitations paragraph acknowledging this assumption. revision: partial
Circularity Check
No circularity: empirical win-rate results on external game with no derivations or self-referential reductions
full rationale
The paper introduces a bounded-memory contract for LLM agents and reports empirical results (win counts in Slay the Spire 2) from a public testbed. No equations, derivations, fitted parameters, or self-citations are invoked to support a central claim that reduces to its own inputs by construction. The reported 3/10 vs 6/10 comparison is presented as directional empirical observation, not as a prediction derived from prior fitted values or definitions within the paper. The work is self-contained against external game benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Slay the Spire 2 runs require hundreds of tactical and strategic decisions and is hard but not saturated for frontier LLMs.
invented entities (1)
-
typed retrieval contract
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Slay the Spire 2 Becomes DeepSeek V4’s Mirror: Sounds Reasonable, Plays Terribly
AGI-Eval Community. Slay the Spire 2 Becomes DeepSeek V4’s Mirror: Sounds Reasonable, Plays Terribly. Community blog post (Chinese), https://deepseek.csdn.net/6a01b6b80a2f6a37c5a944ed. html; video https://www.youtube.com/watch?v=0v94pZmif9Y, May 2026. Non peer-reviewed; cited as cross-harness calibration only; accessed 2026-05-13
2026
-
[2]
Effective context engineering for AI agents
Anthropic. Effective context engineering for AI agents. Anthropic Engineering blog, https: //www.anthropic.com/engineering/effective-context-engineering-for-ai-agents , 2025. Non peer-reviewed industry article; accessed 2026-06-17
2025
-
[3]
Bateni and J
B. Bateni and J. Whitehead. Language-Driven Play: Large Language Models as Game-Playing Agents in Slay the Spire. InProceedings of the 19th International Conference on the Foundations of Digital Games, 2024
2024
-
[4]
T. Bertram. UrzaGPT: LoRA-Tuned Large Language Models for Card Selection in Collectible Card Games, 2025
2025
-
[5]
AI-Spire: LLM plays Slay the Spire 2 through prompt engineering
biolbe1230. AI-Spire: LLM plays Slay the Spire 2 through prompt engineering. GitHub repository, https://github.com/biolbe1230/ai-spire, 2026. commit b0a40997; accessed 2026-05-13
2026
-
[6]
STS2-Agent: MCP server for Slay the Spire 2
CharTyr. STS2-Agent: MCP server for Slay the Spire 2. GitHub repository, https://github.com/ CharTyr/STS2-Agent, 2026. commit 2617fb19; accessed 2026-05-13
2026
-
[7]
Chhikara, D
P . Chhikara, D. Khant, S. Aryan, T. Singh, and D. Yadav. Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory, 2025
2025
-
[8]
J. Du, J. Wu, Y. Chen, Y. Hu, B. Li, and J. T. Zhou. Rethinking Agent Design: From Top-Down Workflows to Bottom-Up Skill Evolution, 2025
2025
-
[9]
B. Efron. Bootstrap methods: Another look at the jackknife.The Annals of Statistics, 7(1):1–26, 1979
1979
-
[10]
Forouzandeh, W
S. Forouzandeh, W. Peng, P . Moradi, X. Yu, and M. Jalili. Learning Hierarchical Procedural Memory for LLM Agents through Bayesian Selection and Contrastive Refinement, 2025
2025
-
[11]
STS2MCP: Full agentic runs for Slay the Spire 2
Gennadiyev. STS2MCP: Full agentic runs for Slay the Spire 2. GitHub repository, https://github. com/Gennadiyev/STS2MCP, 2026. commit 2fb53908; accessed 2026-05-13
2026
-
[12]
D. Hafner. Benchmarking the spectrum of agent capabilities. InInternational Conference on Learning Representations (ICLR), 2022. 14
2022
-
[13]
ClaudePlaysTheSpire (HermesBridge): Claude and friends play Slay the Spire II
hiKareeem. ClaudePlaysTheSpire (HermesBridge): Claude and friends play Slay the Spire II. GitHub repository,https://github.com/hiKareeem/ClaudePlaysTheSpire, 2026. commit 38202d7f; accessed 2026-05-13
2026
-
[14]
L. Hu, M. Huo, Y. Zhang, H. Yu, E. P . Xing, I. Stoica, T. Rosing, H. Jin, and H. Zhang. lmgame-Bench: How Good are LLMs at Playing Games?, 2025
2025
-
[15]
Y. Hu, S. Liu, Y. Yue, et al. Memory in the age of AI agents.arXiv preprint arXiv:2512.13564, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[16]
Jiang, D
Y. Jiang, D. Li, H. Deng, B. Ma, X. Wang, Q. Wang, and G. Yu. SoK: Agentic Skills – Beyond Tool Use in LLM Agents, 2026
2026
-
[17]
J. Kang, M. Ji, Z. Zhao, and T. Bai. Memory OS of AI Agent, 2025
2025
-
[18]
T. Kwa, B. West, J. Becker, A. Deng, K. Garcia, M. Hasin, S. Jawhar, M. Kinniment, N. Rush, S. Von Arx, R. Bloom, T. Broadley, H. Du, B. Goodrich, N. Jurkovic, L. H. Miles, S. Nix, T. Lin, N. Parikh, D. Rein, L. J. K. Sato, H. Wijk, D. M. Ziegler, E. Barnes, and L. Chan. Measuring AI Ability to Complete Long Software Tasks, 2025
2025
-
[19]
Küttler, N
H. Küttler, N. Nardelli, A. H. Miller, R. Raileanu, M. Selvatici, E. Grefenstette, and T. Rocktäschel. The NetHack Learning Environment. InAdvances in Neural Information Processing Systems (NeurIPS), 2020
2020
-
[20]
X. Li, W. Chen, Y. Liu, S. Zheng, X. Chen, Y. He, Y. Li, B. You, H. Shen, J. Sun, S. Wang, B. Li, Q. Zeng, D. Wang, X. Zhao, Y. Wang, R. B. Chaim, Z. Di, Y. Gao, J. He, Y. He, L. Jing, L. Kong, X. Lan, J. Li, S. Li, Y. Li, Y. Lin, X. Liu, X. Liu, H. Lyu, Z. Ma, B. Wang, R. Wang, T. Wang, W. Ye, Y. Zhang, H. Xing, Y. Xue, S. Dillmann, and H.-c. Lee. Skills...
2026
-
[21]
C. Liu, L. Zhang, X. Xu, W. Guo, and Y. Liu. Towards the versioning of llm-agent-based software. In Companion Proceedings of the 33rd ACM International Conference on the Foundations of Software Engineering (FSE Companion ’25), pages 1619–1622, 2025. Ideas, Visions and Reflections track (4-page paper)
2025
-
[22]
Lumer, F
E. Lumer, F. Nizar, A. Jangiti, K. Frank, A. Gulati, M. Phadate, and V . K. Subbiah. Don’t Break the Cache: An Evaluation of Prompt Caching for Long-Horizon Agentic Tasks, 2026
2026
-
[23]
Mega Crit. The Neowsletter - May 2026. Steam Community announcement, https://store. steampowered.com/news/app/2868840/view/701016542742053855, May 2026. Developer newsletter for Slay the Spire 2; non peer-reviewed; accessed 2026-05-23
-
[24]
Ouyang, J
S. Ouyang, J. Yan, Y. Chen, R. Han, Z. Wang, B. D. Mishra, R. Meng, C.-L. Li, Y. Jiao, K. Zha, M. Shen, V . Tirumalashetty, G. Lee, J. Han, T. Pfister, and C.-Y. Lee. SkillOS: Learning Skill Curation for Self-Evolving Agents, 2026
2026
-
[25]
Packer, S
C. Packer, S. Wooders, K. Lin, V . Fang, S. G. Patil, I. Stoica, and J. E. Gonzalez. MemGPT: Towards LLMs as Operating Systems, 2024
2024
-
[26]
Paglieri, B
D. Paglieri, B. Cupial, S. Coward, U. Piterbarg, M. Wolczyk, A. Khan, E. Pignatelli, L. Kucinski, L. Pinto, R. Fergus, J. N. Foerster, J. Parker-Holder, and T. Rocktaschel. BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games, 2024
2024
-
[27]
J. S. Park, J. C. O’Brien, C. J. Cai, M. R. Morris, P . Liang, and M. S. Bernstein. Generative Agents: Interactive Simulacra of Human Behavior, 2023. ACM Symposium on User Interface Software and Technology (UIST)
2023
-
[28]
Shinn, F
N. Shinn, F. Cassano, A. Gopinath, K. R. Narasimhan, and S. Yao. Reflexion: Language agents with verbal reinforcement learning. InAdvances in Neural Information Processing Systems (NeurIPS), 2023
2023
-
[29]
Sinha, A
A. Sinha, A. Arun, S. Goel, S. Staab, and J. Geiping. The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs, 2025
2025
-
[30]
Spire Codex: Slay the Spire 2 Database
Spire Codex. Spire Codex: Slay the Spire 2 Database. Community database, https://spire-codex. com/, 2026. Public game-data database and REST API; non peer-reviewed; accessed 2026-05-23. 15
2026
-
[31]
Slay the Spire 2 Community Stats
STS2 Community Stats. Slay the Spire 2 Community Stats. Community statistics website, https: //www.sts2.fun/, 2026. Community-uploaded survival-by-floor and ascension statistics; non peer- reviewed; accessed 2026-05-25
2026
-
[32]
T. R. Sumers, S. Yao, K. Narasimhan, and T. L. Griffiths. Cognitive Architectures for Language Agents,
-
[33]
Transactions on Machine Learning Research
-
[34]
W. Tang, Y. Zhou, E. Xu, K. Cheng, M. Li, and L. Xiao. DSGBench: A Diverse Strategic Game Benchmark for Evaluating LLM-based Agents in Complex Decision-Making Environments, 2025
2025
-
[35]
Tripathi, D
S. Tripathi, D. Alkhulaifat, F. X. Doo, P . Rajpurkar, R. McBeth, D. Daye, and T. S. Cook. Development, Evaluation, and Assessment of Large Language Models (DEAL) Checklist: A Technical Report.NEJM AI, 2(6), May 2025. NEJM AI uses article-number citation (AIp2401106) rather than traditional page numbers; online publication 2025-05-22
2025
-
[36]
B. Wang, K. McKeown, and R. Ying. DYSTIL: Dynamic Strategy Induction with Large Language Models for Reinforcement Learning, 2025
2025
-
[37]
G. Wang, Y. Xie, Y. Jiang, A. Mandlekar, C. Xiao, Y. Zhu, L. Fan, and A. Anandkumar. Voyager: An Open-Ended Embodied Agent with Large Language Models.Transactions on Machine Learning Research,
-
[38]
Originally released as arXiv:2305.16291
work page internal anchor Pith review Pith/arXiv arXiv
-
[39]
J. Wang, Q. Yan, Y. Wang, Y. Tian, S. S. Mishra, Z. Xu, M. Gandhi, P . Xu, and L. L. Cheong. Reinforcement Learning for Self-Improving Agent with Skill Library, 2025
2025
-
[40]
W. Wang, F. Bie, J. Chen, D. Zhang, S. Huang, E. Kharlamov, and J. Tang. Can Large Language Models Master Complex Card Games?, 2025
2025
-
[41]
X. J. Wang, H. Bai, Y. Sun, H. Wang, S. Zhang, W. Hu, M. Schroder, B. Mutlu, D. Song, and R. D. Nowak. The Long-Horizon Task Mirage? Diagnosing Where and Why Agentic Systems Break, 2026
2026
-
[42]
Z. Wang, K. Wang, Q. Wang, P . Zhang, L. Li, Z. Yang, X. Jin, K. Yu, M. N. Nguyen, L. Liu, E. Gottlieb, Y. Lu, K. Cho, J. Wu, L. Fei-Fei, L. Wang, Y. Choi, and M. Li. RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning, 2025
2025
-
[43]
Z. Wang, F. Wu, H. Wang, X. Tang, B. Li, Z. Yin, Y. Ma, Y. Li, W. Sun, X. Chen, and Y. Ye. Why Reasoning Fails to Plan: A Planning-Centric Analysis of Long-Horizon Decision Making in LLM Agents, 2026
2026
-
[44]
Z. Z. Wang, J. Mao, D. Fried, and G. Neubig. Agent Workflow Memory, 2024
2024
-
[45]
E. B. Wilson. Probable inference, the law of succession, and statistical inference.Journal of the American Statistical Association, 22(158):209–212, 1927
1927
-
[46]
Z. Wu, H. Zhang, F. Lin, W. Xu, X. Xu, Y. Chen, H. P . Zou, S. Chen, W. Zhang, X. Liu, P . S. Yu, and H. Wang. GAM: Hierarchical Graph-based Agentic Memory for LLM Agents, 2026
2026
-
[47]
C. Xiao, Y. Zhang, X. Huang, Q. Huang, J. Chen, and P . Sun. Mastering Strategy Card Game (Hearthstone) with Improved Techniques. InIEEE Conference on Games (CoG), 2023
2023
-
[48]
W. Xu, Z. Liang, K. Mei, H. Gao, J. Tan, and Y. Zhang. A-MEM: Agentic Memory for LLM Agents, 2025
2025
-
[49]
S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y. Cao. ReAct: Synergizing Reasoning and Acting in Language Models. InInternational Conference on Learning Representations (ICLR), 2023
2023
-
[50]
Zhang, D
K. Zhang, D. Liu, Q. Zhao, J. Hou, X. Zhang, Q. Xie, M. Liu, and Y. Li. GameVerse: Can Vision-Language Models Learn from Video-based Reflection?, 2026
2026
-
[51]
A. Zhao, D. Huang, Q. Xu, M. Lin, Y.-J. Liu, and G. Huang. ExpeL: LLM Agents Are Experiential Learners. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 19632–19642, 2024. 16
2024
-
[52]
Zheng, M
B. Zheng, M. Y. Fatemi, X. Jin, Z. Z. Wang, A. Gandhi, Y. Song, Y. Gu, J. Srinivasa, G. Liu, G. Neubig, and Y. Su. SkillWeaver: Web Agents can Self-Improve by Discovering and Honing Skills, 2025
2025
-
[53]
action":
H. Zhou, S. Guo, A. Liu, Z. Yu, Z. Gong, B. Zhao, Z. Chen, M. Zhang, Y. Chen, J. Li, R. Yang, Q. Liu, X. Yu, J. Zhou, N. Wang, C. Sun, and J. Wang. Memento-Skills: Let Agents Design Agents, 2026. Appendix A Evaluation archive and aggregation rule The archive is organized so that each reported win rate can be recomputed from completed-run records. It inclu...
2026
-
[54]
**Identify the core mechanic** -- what action / keyword / trigger does your engine reward?
-
[55]
**Feed the engine** -- prioritize cards that generate, apply, or cycle that mechanic; add enough draw to find the engine fast
-
[56]
**Cover weaknesses** -- add block, AoE, or utility ONLY for what the engine cannot handle itself
-
[57]
Foundation plan: survive with frontload and efficient block while looking for a real scaling engine; take cheap draw or high-impact damage, skip narrow synergy pieces
**Pivot rule** -- do NOT abandon your engine unless BOTH hold: (a) your committed deck has severely insufficient engine pieces (<2 supporting cards), AND (b) an offered card is a clearly superior core piece AND solves an immediate survival problem. Abandoning a partially-built engine wastes every prior pick and leaves two half-engines. ■L 1 — Note schema ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.