pith. sign in

arxiv: 2607.02255 · v1 · pith:WAQNRREPnew · submitted 2026-07-02 · 💻 cs.AI · cs.CL

AgenticSTS: A Bounded-Memory Testbed for Long-Horizon LLM Agents

Pith reviewed 2026-07-03 13:47 UTC · model grok-4.3

classification 💻 cs.AI cs.CL
keywords LLM agentsbounded memorylong-horizon planningmemory ablationtyped retrievalstrategic skillsdeck-building gameagent testbed
0
0 comments X

The pith

A bounded-memory contract assembles each LLM agent decision from typed retrieval alone, keeping prompts fixed-length and allowing any memory layer to be ablated in isolation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes replacing the usual practice of appending all past observations and reflections to every prompt with a contract that builds each new decision prompt from typed retrieval only. This contract keeps context length constant no matter how many decisions occur and makes it possible to turn individual memory components on or off without changing anything else. The contract is implemented inside a stochastic deck-building game whose runs last hundreds of decisions. Ablation results inside that game show that enabling a layer of triggered strategic skills raises the win count from three to six out of ten trials. The authors release the full set of trajectories, snapshots, and scripts so others can run the same controlled comparisons.

Core claim

Memory for long-horizon LLM agents is treated as an explicit contract specifying exactly what each future decision is permitted to see; the chosen contract uses typed retrieval to assemble a fresh user message for every decision and appends no raw cross-decision transcript, thereby bounding prompt length for runs of arbitrary duration and permitting any single memory layer to be removed or added while all other conditions stay fixed.

What carries the argument

The typed retrieval mechanism that assembles a fresh, bounded prompt for each decision without any raw transcript of prior decisions.

If this is right

  • Any single memory layer can be added or removed without changing prompt length or other layers.
  • The contribution of a particular skill or memory type can be measured by direct ablation while holding everything else constant.
  • The same test harness can be run for arbitrarily long sequences of decisions without context growth.
  • Different backbone models can be compared under identical memory contracts rather than under accumulating transcripts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same contract could be applied to non-game long-horizon tasks such as multi-step planning or tool-use chains where raw history quickly exceeds token limits.
  • Systematic ablation across many memory types might identify which layers matter most for different classes of agent problems.
  • The released trajectories supply a reusable baseline for testing whether other memory architectures produce larger or smaller gains than the skill layer examined here.

Load-bearing premise

Typed retrieval can supply enough decision-relevant context without ever including raw transcripts of earlier decisions.

What would settle it

An experiment in which agents using the bounded contract consistently fail on tasks that require integrating information across many past decisions in ways that typed retrieval cannot reconstruct.

read the original abstract

Memory for a long-horizon LLM agent is a contract about what each future decision is allowed to see. The simplest contract appends past observations, tool calls, and reflections to every prompt, which makes prior context easy to access but also turns it into a jumbled mixture in which the effect of any single memory component is hard to isolate. We introduce and instrument an alternative bounded contract: every decision is made from a fresh user message assembled by typed retrieval, with no raw cross-decision transcript appended. The prompt thus stays bounded across runs of any length, and any single layer can be ablated in isolation. We instantiate the contract in Slay the Spire 2, a closed-rule stochastic deck-building game whose runs require hundreds of tactical and strategic decisions. A public online benchmark of frontier LLMs on the same game reports zero wins at the lowest difficulty across five configurations, and the developer-reported human win rate at the same difficulty is 16%; the task is hard but not saturated. Within our harness, a fixed-A0 ablation shows the largest observed difference when triggered strategic skills are enabled: the no-store baseline wins 3/10 games and adding the skill layer 6/10. At this sample size the comparison is directional rather than statistically decisive (Fisher exact p\approx0.37); a cross-backbone probe and public accumulating-context baselines are reported as operational comparisons rather than controlled tests of the contract variable itself. We release a reproducible testbed: 298 completed trajectories with condition tags, frozen memory/skill snapshots, prompt records, and analysis scripts -- an agent design and a validated, reusable methodology for studying how explicit memory layers shape long-horizon LLM-agent decisions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes a bounded-memory contract for long-horizon LLM agents in which each decision prompt is assembled via typed retrieval rather than by appending raw cross-decision transcripts. This contract is instantiated in the stochastic deck-building game Slay the Spire 2; within a fixed-A0 ablation the no-store baseline wins 3/10 games while adding a triggered strategic-skill layer raises the win rate to 6/10 (Fisher exact p≈0.37). The authors release 298 tagged trajectories, frozen memory/skill snapshots, prompt records, and analysis scripts, positioning the work as a reusable testbed and methodology for isolating the effects of explicit memory layers.

Significance. If the typed-retrieval mechanism supplies adequate decision-relevant context, the testbed supplies a concrete, reproducible platform for controlled ablation of memory components in tasks that require hundreds of sequential decisions. The public release of full trajectories and scripts is a clear methodological strength that would allow other researchers to verify or extend the reported comparisons.

major comments (2)
  1. [Experimental results (fixed-A0 ablation)] Experimental results (fixed-A0 ablation paragraph): the reported 3/10 vs. 6/10 difference rests on only ten games per arm. With this sample size the Fisher exact test yields p≈0.37 and the paper itself describes the result as directional; a power calculation or larger replication would be needed before the difference can be treated as evidence that the skill layer improves performance under the bounded contract.
  2. [Methods (typed retrieval contract)] Methods (typed retrieval contract definition): the claim that the bounded contract permits meaningful isolated ablations presupposes that typed retrieval alone supplies sufficient context (deck state, prior strategic patterns, etc.) without raw transcripts. No direct verification of this sufficiency—such as a human rating of retrieved-context adequacy or an ablation that adds back raw transcripts while keeping the contract otherwise fixed—is reported, so the observed win-rate difference could be confounded by context starvation rather than by the memory layers under test.
minor comments (1)
  1. [Abstract] Abstract: the phrase 'public accumulating-context baselines are reported as operational comparisons rather than controlled tests' is clear in intent but would benefit from a one-sentence parenthetical stating the precise difference in prompt construction between those baselines and the main harness.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on statistical robustness and the assumptions underlying the typed-retrieval contract. We address each major comment below and indicate planned revisions where appropriate.

read point-by-point responses
  1. Referee: Experimental results (fixed-A0 ablation): the reported 3/10 vs. 6/10 difference rests on only ten games per arm. With this sample size the Fisher exact test yields p≈0.37 and the paper itself describes the result as directional; a power calculation or larger replication would be needed before the difference can be treated as evidence that the skill layer improves performance under the bounded contract.

    Authors: We agree that n=10 per arm yields only directional evidence, which is already stated explicitly in the manuscript. In revision we will add a post-hoc power calculation using the observed effect size to guide future work. The public release of 298 trajectories and analysis scripts is intended to support exactly the larger replications the referee recommends. revision: partial

  2. Referee: Methods (typed retrieval contract definition): the claim that the bounded contract permits meaningful isolated ablations presupposes that typed retrieval alone supplies sufficient context (deck state, prior strategic patterns, etc.) without raw transcripts. No direct verification of this sufficiency—such as a human rating of retrieved-context adequacy or an ablation that adds back raw transcripts while keeping the contract otherwise fixed—is reported, so the observed win-rate difference could be confounded by context starvation rather than by the memory layers under test.

    Authors: The testbed is defined as a platform for controlled ablations under the typed-retrieval contract; it does not claim the contract is universally sufficient. We did not conduct human adequacy ratings or add-back ablations in the present study. All prompt records are released to enable such follow-up work. We will add an explicit limitations paragraph acknowledging this assumption. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical win-rate results on external game with no derivations or self-referential reductions

full rationale

The paper introduces a bounded-memory contract for LLM agents and reports empirical results (win counts in Slay the Spire 2) from a public testbed. No equations, derivations, fitted parameters, or self-citations are invoked to support a central claim that reduces to its own inputs by construction. The reported 3/10 vs 6/10 comparison is presented as directional empirical observation, not as a prediction derived from prior fitted values or definitions within the paper. The work is self-contained against external game benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that Slay the Spire 2 is a suitable long-horizon stochastic environment and that typed retrieval can be implemented to isolate memory effects; no free parameters or invented entities are quantified in the abstract.

axioms (1)
  • domain assumption Slay the Spire 2 runs require hundreds of tactical and strategic decisions and is hard but not saturated for frontier LLMs.
    Invoked to position the game as a valid testbed.
invented entities (1)
  • typed retrieval contract no independent evidence
    purpose: Assembles fresh bounded prompts for each decision.
    Core mechanism introduced to replace accumulating context.

pith-pipeline@v0.9.1-grok · 5873 in / 1251 out tokens · 46506 ms · 2026-07-03T13:47:11.601996+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

57 extracted references · 3 canonical work pages · 2 internal anchors

  1. [1]

    Slay the Spire 2 Becomes DeepSeek V4’s Mirror: Sounds Reasonable, Plays Terribly

    AGI-Eval Community. Slay the Spire 2 Becomes DeepSeek V4’s Mirror: Sounds Reasonable, Plays Terribly. Community blog post (Chinese), https://deepseek.csdn.net/6a01b6b80a2f6a37c5a944ed. html; video https://www.youtube.com/watch?v=0v94pZmif9Y, May 2026. Non peer-reviewed; cited as cross-harness calibration only; accessed 2026-05-13

  2. [2]

    Effective context engineering for AI agents

    Anthropic. Effective context engineering for AI agents. Anthropic Engineering blog, https: //www.anthropic.com/engineering/effective-context-engineering-for-ai-agents , 2025. Non peer-reviewed industry article; accessed 2026-06-17

  3. [3]

    Bateni and J

    B. Bateni and J. Whitehead. Language-Driven Play: Large Language Models as Game-Playing Agents in Slay the Spire. InProceedings of the 19th International Conference on the Foundations of Digital Games, 2024

  4. [4]

    T. Bertram. UrzaGPT: LoRA-Tuned Large Language Models for Card Selection in Collectible Card Games, 2025

  5. [5]

    AI-Spire: LLM plays Slay the Spire 2 through prompt engineering

    biolbe1230. AI-Spire: LLM plays Slay the Spire 2 through prompt engineering. GitHub repository, https://github.com/biolbe1230/ai-spire, 2026. commit b0a40997; accessed 2026-05-13

  6. [6]

    STS2-Agent: MCP server for Slay the Spire 2

    CharTyr. STS2-Agent: MCP server for Slay the Spire 2. GitHub repository, https://github.com/ CharTyr/STS2-Agent, 2026. commit 2617fb19; accessed 2026-05-13

  7. [7]

    Chhikara, D

    P . Chhikara, D. Khant, S. Aryan, T. Singh, and D. Yadav. Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory, 2025

  8. [8]

    J. Du, J. Wu, Y. Chen, Y. Hu, B. Li, and J. T. Zhou. Rethinking Agent Design: From Top-Down Workflows to Bottom-Up Skill Evolution, 2025

  9. [9]

    B. Efron. Bootstrap methods: Another look at the jackknife.The Annals of Statistics, 7(1):1–26, 1979

  10. [10]

    Forouzandeh, W

    S. Forouzandeh, W. Peng, P . Moradi, X. Yu, and M. Jalili. Learning Hierarchical Procedural Memory for LLM Agents through Bayesian Selection and Contrastive Refinement, 2025

  11. [11]

    STS2MCP: Full agentic runs for Slay the Spire 2

    Gennadiyev. STS2MCP: Full agentic runs for Slay the Spire 2. GitHub repository, https://github. com/Gennadiyev/STS2MCP, 2026. commit 2fb53908; accessed 2026-05-13

  12. [12]

    D. Hafner. Benchmarking the spectrum of agent capabilities. InInternational Conference on Learning Representations (ICLR), 2022. 14

  13. [13]

    ClaudePlaysTheSpire (HermesBridge): Claude and friends play Slay the Spire II

    hiKareeem. ClaudePlaysTheSpire (HermesBridge): Claude and friends play Slay the Spire II. GitHub repository,https://github.com/hiKareeem/ClaudePlaysTheSpire, 2026. commit 38202d7f; accessed 2026-05-13

  14. [14]

    L. Hu, M. Huo, Y. Zhang, H. Yu, E. P . Xing, I. Stoica, T. Rosing, H. Jin, and H. Zhang. lmgame-Bench: How Good are LLMs at Playing Games?, 2025

  15. [15]

    Y. Hu, S. Liu, Y. Yue, et al. Memory in the age of AI agents.arXiv preprint arXiv:2512.13564, 2025

  16. [16]

    Jiang, D

    Y. Jiang, D. Li, H. Deng, B. Ma, X. Wang, Q. Wang, and G. Yu. SoK: Agentic Skills – Beyond Tool Use in LLM Agents, 2026

  17. [17]

    J. Kang, M. Ji, Z. Zhao, and T. Bai. Memory OS of AI Agent, 2025

  18. [18]

    T. Kwa, B. West, J. Becker, A. Deng, K. Garcia, M. Hasin, S. Jawhar, M. Kinniment, N. Rush, S. Von Arx, R. Bloom, T. Broadley, H. Du, B. Goodrich, N. Jurkovic, L. H. Miles, S. Nix, T. Lin, N. Parikh, D. Rein, L. J. K. Sato, H. Wijk, D. M. Ziegler, E. Barnes, and L. Chan. Measuring AI Ability to Complete Long Software Tasks, 2025

  19. [19]

    Küttler, N

    H. Küttler, N. Nardelli, A. H. Miller, R. Raileanu, M. Selvatici, E. Grefenstette, and T. Rocktäschel. The NetHack Learning Environment. InAdvances in Neural Information Processing Systems (NeurIPS), 2020

  20. [20]

    X. Li, W. Chen, Y. Liu, S. Zheng, X. Chen, Y. He, Y. Li, B. You, H. Shen, J. Sun, S. Wang, B. Li, Q. Zeng, D. Wang, X. Zhao, Y. Wang, R. B. Chaim, Z. Di, Y. Gao, J. He, Y. He, L. Jing, L. Kong, X. Lan, J. Li, S. Li, Y. Li, Y. Lin, X. Liu, X. Liu, H. Lyu, Z. Ma, B. Wang, R. Wang, T. Wang, W. Ye, Y. Zhang, H. Xing, Y. Xue, S. Dillmann, and H.-c. Lee. Skills...

  21. [21]

    C. Liu, L. Zhang, X. Xu, W. Guo, and Y. Liu. Towards the versioning of llm-agent-based software. In Companion Proceedings of the 33rd ACM International Conference on the Foundations of Software Engineering (FSE Companion ’25), pages 1619–1622, 2025. Ideas, Visions and Reflections track (4-page paper)

  22. [22]

    Lumer, F

    E. Lumer, F. Nizar, A. Jangiti, K. Frank, A. Gulati, M. Phadate, and V . K. Subbiah. Don’t Break the Cache: An Evaluation of Prompt Caching for Long-Horizon Agentic Tasks, 2026

  23. [23]

    The Neowsletter - May 2026

    Mega Crit. The Neowsletter - May 2026. Steam Community announcement, https://store. steampowered.com/news/app/2868840/view/701016542742053855, May 2026. Developer newsletter for Slay the Spire 2; non peer-reviewed; accessed 2026-05-23

  24. [24]

    Ouyang, J

    S. Ouyang, J. Yan, Y. Chen, R. Han, Z. Wang, B. D. Mishra, R. Meng, C.-L. Li, Y. Jiao, K. Zha, M. Shen, V . Tirumalashetty, G. Lee, J. Han, T. Pfister, and C.-Y. Lee. SkillOS: Learning Skill Curation for Self-Evolving Agents, 2026

  25. [25]

    Packer, S

    C. Packer, S. Wooders, K. Lin, V . Fang, S. G. Patil, I. Stoica, and J. E. Gonzalez. MemGPT: Towards LLMs as Operating Systems, 2024

  26. [26]

    Paglieri, B

    D. Paglieri, B. Cupial, S. Coward, U. Piterbarg, M. Wolczyk, A. Khan, E. Pignatelli, L. Kucinski, L. Pinto, R. Fergus, J. N. Foerster, J. Parker-Holder, and T. Rocktaschel. BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games, 2024

  27. [27]

    J. S. Park, J. C. O’Brien, C. J. Cai, M. R. Morris, P . Liang, and M. S. Bernstein. Generative Agents: Interactive Simulacra of Human Behavior, 2023. ACM Symposium on User Interface Software and Technology (UIST)

  28. [28]

    Shinn, F

    N. Shinn, F. Cassano, A. Gopinath, K. R. Narasimhan, and S. Yao. Reflexion: Language agents with verbal reinforcement learning. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

  29. [29]

    Sinha, A

    A. Sinha, A. Arun, S. Goel, S. Staab, and J. Geiping. The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs, 2025

  30. [30]

    Spire Codex: Slay the Spire 2 Database

    Spire Codex. Spire Codex: Slay the Spire 2 Database. Community database, https://spire-codex. com/, 2026. Public game-data database and REST API; non peer-reviewed; accessed 2026-05-23. 15

  31. [31]

    Slay the Spire 2 Community Stats

    STS2 Community Stats. Slay the Spire 2 Community Stats. Community statistics website, https: //www.sts2.fun/, 2026. Community-uploaded survival-by-floor and ascension statistics; non peer- reviewed; accessed 2026-05-25

  32. [32]

    T. R. Sumers, S. Yao, K. Narasimhan, and T. L. Griffiths. Cognitive Architectures for Language Agents,

  33. [33]

    Transactions on Machine Learning Research

  34. [34]

    W. Tang, Y. Zhou, E. Xu, K. Cheng, M. Li, and L. Xiao. DSGBench: A Diverse Strategic Game Benchmark for Evaluating LLM-based Agents in Complex Decision-Making Environments, 2025

  35. [35]

    Tripathi, D

    S. Tripathi, D. Alkhulaifat, F. X. Doo, P . Rajpurkar, R. McBeth, D. Daye, and T. S. Cook. Development, Evaluation, and Assessment of Large Language Models (DEAL) Checklist: A Technical Report.NEJM AI, 2(6), May 2025. NEJM AI uses article-number citation (AIp2401106) rather than traditional page numbers; online publication 2025-05-22

  36. [36]

    B. Wang, K. McKeown, and R. Ying. DYSTIL: Dynamic Strategy Induction with Large Language Models for Reinforcement Learning, 2025

  37. [37]

    G. Wang, Y. Xie, Y. Jiang, A. Mandlekar, C. Xiao, Y. Zhu, L. Fan, and A. Anandkumar. Voyager: An Open-Ended Embodied Agent with Large Language Models.Transactions on Machine Learning Research,

  38. [38]

    Originally released as arXiv:2305.16291

  39. [39]

    J. Wang, Q. Yan, Y. Wang, Y. Tian, S. S. Mishra, Z. Xu, M. Gandhi, P . Xu, and L. L. Cheong. Reinforcement Learning for Self-Improving Agent with Skill Library, 2025

  40. [40]

    W. Wang, F. Bie, J. Chen, D. Zhang, S. Huang, E. Kharlamov, and J. Tang. Can Large Language Models Master Complex Card Games?, 2025

  41. [41]

    X. J. Wang, H. Bai, Y. Sun, H. Wang, S. Zhang, W. Hu, M. Schroder, B. Mutlu, D. Song, and R. D. Nowak. The Long-Horizon Task Mirage? Diagnosing Where and Why Agentic Systems Break, 2026

  42. [42]

    Z. Wang, K. Wang, Q. Wang, P . Zhang, L. Li, Z. Yang, X. Jin, K. Yu, M. N. Nguyen, L. Liu, E. Gottlieb, Y. Lu, K. Cho, J. Wu, L. Fei-Fei, L. Wang, Y. Choi, and M. Li. RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning, 2025

  43. [43]

    Z. Wang, F. Wu, H. Wang, X. Tang, B. Li, Z. Yin, Y. Ma, Y. Li, W. Sun, X. Chen, and Y. Ye. Why Reasoning Fails to Plan: A Planning-Centric Analysis of Long-Horizon Decision Making in LLM Agents, 2026

  44. [44]

    Z. Z. Wang, J. Mao, D. Fried, and G. Neubig. Agent Workflow Memory, 2024

  45. [45]

    E. B. Wilson. Probable inference, the law of succession, and statistical inference.Journal of the American Statistical Association, 22(158):209–212, 1927

  46. [46]

    Z. Wu, H. Zhang, F. Lin, W. Xu, X. Xu, Y. Chen, H. P . Zou, S. Chen, W. Zhang, X. Liu, P . S. Yu, and H. Wang. GAM: Hierarchical Graph-based Agentic Memory for LLM Agents, 2026

  47. [47]

    C. Xiao, Y. Zhang, X. Huang, Q. Huang, J. Chen, and P . Sun. Mastering Strategy Card Game (Hearthstone) with Improved Techniques. InIEEE Conference on Games (CoG), 2023

  48. [48]

    W. Xu, Z. Liang, K. Mei, H. Gao, J. Tan, and Y. Zhang. A-MEM: Agentic Memory for LLM Agents, 2025

  49. [49]

    S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y. Cao. ReAct: Synergizing Reasoning and Acting in Language Models. InInternational Conference on Learning Representations (ICLR), 2023

  50. [50]

    Zhang, D

    K. Zhang, D. Liu, Q. Zhao, J. Hou, X. Zhang, Q. Xie, M. Liu, and Y. Li. GameVerse: Can Vision-Language Models Learn from Video-based Reflection?, 2026

  51. [51]

    A. Zhao, D. Huang, Q. Xu, M. Lin, Y.-J. Liu, and G. Huang. ExpeL: LLM Agents Are Experiential Learners. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 19632–19642, 2024. 16

  52. [52]

    Zheng, M

    B. Zheng, M. Y. Fatemi, X. Jin, Z. Z. Wang, A. Gandhi, Y. Song, Y. Gu, J. Srinivasa, G. Liu, G. Neubig, and Y. Su. SkillWeaver: Web Agents can Self-Improve by Discovering and Honing Skills, 2025

  53. [53]

    action":

    H. Zhou, S. Guo, A. Liu, Z. Yu, Z. Gong, B. Zhao, Z. Chen, M. Zhang, Y. Chen, J. Li, R. Yang, Q. Liu, X. Yu, J. Zhou, N. Wang, C. Sun, and J. Wang. Memento-Skills: Let Agents Design Agents, 2026. Appendix A Evaluation archive and aggregation rule The archive is organized so that each reported win rate can be recomputed from completed-run records. It inclu...

  54. [54]

    **Identify the core mechanic** -- what action / keyword / trigger does your engine reward?

  55. [55]

    **Feed the engine** -- prioritize cards that generate, apply, or cycle that mechanic; add enough draw to find the engine fast

  56. [56]

    **Cover weaknesses** -- add block, AoE, or utility ONLY for what the engine cannot handle itself

  57. [57]

    Foundation plan: survive with frontload and efficient block while looking for a real scaling engine; take cheap draw or high-impact damage, skip narrow synergy pieces

    **Pivot rule** -- do NOT abandon your engine unless BOTH hold: (a) your committed deck has severely insufficient engine pieces (<2 supporting cards), AND (b) an offered card is a clearly superior core piece AND solves an immediate survival problem. Abandoning a partially-built engine wastes every prior pick and leaves two half-engines. ■L 1 — Note schema ...