pith. sign in

arxiv: 2606.08151 · v1 · pith:VQ4MWQ7Onew · submitted 2026-06-06 · 💻 cs.AI

Decision-Aware Memory Cards: Counterfactual-Inspired Context Selection and Compression for Tool-Using LLM Agents

Pith reviewed 2026-06-27 19:45 UTC · model grok-4.3

classification 💻 cs.AI
keywords decision-aware contextmemory cardscontext selectiontool-using agentsSWE-benchcontext compressionLLM agentscounterfactual scoring
0
0 comments X

The pith

Decision-aware memory cards lift retrieval hit@1 from 0.58 to 0.78 on SWE-bench file tasks by scoring evidence on action shift and necessity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Tool-using LLM agents often fail because decisive evidence is not surfaced at action time even when present. CICL builds a context graph from instance evidence and routes judgments from multiple models through one shared eight-field schema. Evidence units receive scores on action shift, outcome uplift, necessity, and negative-transfer risk, after which the highest-utility units are packed as typed memory cards inside a fixed budget. On fifty SWE-bench Verified file-retrieval instances this protocol raises hit@1 from 0.58 to 0.78 and MRR@10 from 0.634 to 0.790 while keeping every judgment parseable. The design keeps the measured decision signal independent of any single judge model.

Core claim

CICL is a decision-aware context layer that converts instance evidence into a context graph, obtains deterministic judgments through an eight-field schema, scores units by action shift, outcome uplift, necessity, and negative-transfer risk, and packs the highest-utility evidence as typed memory cards for budgeted agents. The same protocol produces measurable retrieval gains on SWE-bench while allowing frontier, local, and lightweight rankers to be compared directly.

What carries the argument

Eight-field schema and four scoring dimensions that isolate action-critical evidence for packing into typed memory cards.

If this is right

  • Different judge models become directly comparable under one auditable protocol.
  • At budget 120 the method reaches F1 0.620 on v1 and 0.425 on v3.
  • Removing the single top-utility semantic unit on v3 collapses F1 to 0.000.
  • Compact rankers still underperform the heuristic selection on the tested instances.
  • RepoBench-R summaries continue to outperform the generated memory cards.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same scoring layer could be applied to non-coding tool-use tasks where context volume limits decision quality.
  • Collecting many scored judgments might supply training data that lets small rankers eventually match the current heuristic.
  • The separation of decision signal from model suggests a route to track how context quality changes as base models evolve.

Load-bearing premise

The eight-field schema and four scoring dimensions reliably isolate decisive evidence rather than reflecting artifacts of the judge models or the chosen benchmark instances.

What would settle it

A controlled run on the same fifty instances where memory cards built from the scored units produce no lift over plain BM25 top-50 retrieval.

Figures

Figures reproduced from arXiv: 2606.08151 by Qianyang Zhao, Xinyu Guan, Yuming Deng.

Figure 1
Figure 1. Figure 1: CICL pipeline. The framework turns instance evidence into a graph, routes judge signals through a decision-utility engine, and packs memory cards for a budgeted agent. Opus, direct Qwen API calls, Codex, and a trained Qwen-QLoRA model used as an optional local surrogate are reported separately, so model-dependent signals are not collapsed into one score. agents are increasingly limited by the quality of th… view at source ↗
Figure 2
Figure 2. Figure 2: Compact result overview. Panel A foregrounds the open-source SWE-bench Verified retrieval result; Panels B and C show the main controlled budget trend and RepoBench-R compression boundary [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
read the original abstract

Tool-using LLM agents often fail not because relevant text is absent, but because decisive evidence is not selected, compressed, or surfaced at action time. We present CICL, a decision-aware context layer that turns instance evidence into a context graph, routes deterministic, Opus-assisted, Qwen, Codex/GPT-5.5, and Qwen-QLoRA judgments through a shared eight-field schema, scores units by action shift, outcome uplift, necessity, and negative-transfer risk, and packs high-utility evidence as typed memory cards for a budgeted agent. The design separates the measured decision signal from the judge model, so frontier annotation, local surrogates, and lightweight rankers can be compared under one auditable protocol. Empirically, CICL yields a concrete open-benchmark gain while exposing its limits. On 50 SWE-bench Verified file-retrieval instances, direct Qwen3.6-plus reranking of BM25 top-50 candidates raises hit@1 from 0.58 to 0.78 and MRR@10 from 0.634 to 0.790, with all 2,500 judgments parseable. Controlled diagnostics show action-criticality: at budget 120, CICL reaches F1 0.620 on v1 and 0.425 on v3, and removing the top-utility semantic v3 unit collapses F1 to 0.000. Supplementary checks add Qwen-QLoRA agreement over 710 candidates, a small 200-label real-code Opus-assisted signal, and a three-instance patch smoke validating retrieval-to-patch plumbing without claiming official SWE-bench success. RepoBench-R summaries still beat cards, and compact rankers do not yet replace the heuristic. CICL contributes a reproducible measurement and selection layer for decision-critical context, not an end-to-end coding-agent repair claim.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces CICL, a decision-aware context layer for tool-using LLM agents. Instance evidence is turned into a context graph, routed through deterministic judgments (Opus-assisted, Qwen, Codex/GPT-5.5, Qwen-QLoRA) via a shared eight-field schema, scored on action shift, outcome uplift, necessity, and negative-transfer risk, then packed as typed memory cards for budgeted agents. The design aims to separate measured decision signal from the judge model. Empirically, on 50 SWE-bench Verified file-retrieval instances, direct Qwen3.6-plus reranking of BM25 top-50 candidates raises hit@1 from 0.58 to 0.78 and MRR@10 from 0.634 to 0.790 (all 2,500 judgments parseable). Controlled diagnostics report F1 of 0.620 (v1) and 0.425 (v3) at budget 120, with removal of the top-utility v3 unit collapsing F1 to 0.000. Supplementary checks include Qwen-QLoRA agreement and a small Opus-assisted signal.

Significance. If the reported gains prove reproducible and the four scoring dimensions can be shown to isolate decision-critical evidence independently of judge-model artifacts, CICL would supply a reusable, auditable protocol for context selection and compression. The explicit separation of annotation from ranking and the F1-collapse diagnostic are strengths that could support downstream work on agent memory.

major comments (2)
  1. [Abstract] Abstract: The central empirical claim (hit@1 0.58→0.78, MRR@10 0.634→0.790 on 50 SWE-bench instances) rests on unverified numeric results. No details are supplied on instance-selection criteria for the 50 cases, the precise prompting templates that instantiate the eight-field schema, or any measure of cross-judge agreement on the four numeric dimensions (action shift, outcome uplift, necessity, negative-transfer risk). Without these, it cannot be established that the reranking improvement reflects decision-aware selection rather than Qwen3.6-plus stylistic preferences or SWE-bench-specific phrasing.
  2. [Abstract] Abstract: The controlled diagnostic that removing the top-utility semantic v3 unit collapses F1 to 0.000 is presented as evidence of action-criticality, yet no ablation replaces the schema with generic relevance labels, no correlation between schema scores and downstream agent success on held-out instances is reported, and no variance or statistical significance for the F1 values (0.620 / 0.425) is given. These omissions are load-bearing for the claim that the schema isolates decisive evidence.
minor comments (1)
  1. [Abstract] Abstract: The list of judge models (“Opus-assisted, Qwen, Codex/GPT-5.5, and Qwen-QLoRA”) is introduced without clarifying how their outputs are aggregated or compared under the shared schema.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on transparency and robustness of the empirical claims. We address each point below and will incorporate clarifications and additional details in a revised manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central empirical claim (hit@1 0.58→0.78, MRR@10 0.634→0.790 on 50 SWE-bench instances) rests on unverified numeric results. No details are supplied on instance-selection criteria for the 50 cases, the precise prompting templates that instantiate the eight-field schema, or any measure of cross-judge agreement on the four numeric dimensions (action shift, outcome uplift, necessity, negative-transfer risk). Without these, it cannot be established that the reranking improvement reflects decision-aware selection rather than Qwen3.6-plus stylistic preferences or SWE-bench-specific phrasing.

    Authors: We agree these details are necessary for full verification and reproducibility. The 50 instances were drawn from SWE-bench Verified file-retrieval tasks; we will specify the exact selection criteria (e.g., task filtering rules) in the revised methods section. The eight-field schema prompting templates will be included verbatim in a new appendix. While the manuscript already notes that all 2,500 judgments were parseable and reports Qwen-QLoRA agreement over 710 candidates plus a 200-label Opus-assisted signal, we will add quantitative cross-judge agreement metrics (e.g., pairwise correlations or agreement rates) specifically on the four numeric scoring dimensions. These additions will help substantiate that gains arise from decision-aware scoring rather than judge-specific effects. revision: yes

  2. Referee: [Abstract] Abstract: The controlled diagnostic that removing the top-utility semantic v3 unit collapses F1 to 0.000 is presented as evidence of action-criticality, yet no ablation replaces the schema with generic relevance labels, no correlation between schema scores and downstream agent success on held-out instances is reported, and no variance or statistical significance for the F1 values (0.620 / 0.425) is given. These omissions are load-bearing for the claim that the schema isolates decisive evidence.

    Authors: The F1-collapse result is presented as a targeted sensitivity check on the selected high-utility unit rather than a comprehensive proof of isolation. We acknowledge that the diagnostic would be strengthened by an ablation substituting generic relevance labels and by explicit correlations between schema scores and held-out agent performance. We will add variance estimates or bootstrap confidence intervals for the reported F1 values (0.620 on v1, 0.425 on v3) and any applicable statistical tests in the revision. A full generic-label ablation and held-out correlation analysis may require additional experiments beyond the current 50-instance controlled diagnostics; the revised text will clarify the scope and limitations of the existing diagnostic while incorporating feasible extensions. revision: partial

Circularity Check

0 steps flagged

No circularity; empirical reranking results are direct measurements, not derived by construction.

full rationale

The paper reports concrete benchmark numbers (hit@1 0.58→0.78, MRR@10 0.634→0.790) obtained by routing BM25 candidates through an eight-field schema and four scoring dimensions, then measuring retrieval metrics on 50 SWE-bench instances. No equations, fitted parameters renamed as predictions, self-citations, or uniqueness theorems appear in the provided text. The central claims rest on parseable judgments and controlled diagnostics rather than any reduction of outputs to inputs by definition. This is the standard case of an empirical measurement protocol whose validity is open to external verification.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the scoring dimensions are described at a high level without stating how weights or thresholds are chosen.

pith-pipeline@v0.9.1-grok · 5881 in / 1060 out tokens · 16583 ms · 2026-06-27T19:45:35.855985+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

40 extracted references · 1 canonical work pages

  1. [1]

    In: International Conference on Learning Representations (2023), https://openreview

    Yao, S., et al.: React: Synergizing reasoning and acting in language models. In: International Conference on Learning Representations (2023), https://openreview. net/forum?id=WE_vluYUL-X 14 X. Guan et al

  2. [2]

    In: Advances in Neural Information Processing Systems (2023), https://arxiv.org/ abs/2302.04761

    Schick, T., et al.: Toolformer: Language models can teach themselves to use tools. In: Advances in Neural Information Processing Systems (2023), https://arxiv.org/ abs/2302.04761

  3. [3]

    In: Advances in Neural Information Processing Systems (2023), https://arxiv.org/abs/2303.11366

    Shinn,N.,Cassano,F.,Gopinath,A.,Narasimhan,K.,Yao,S.:Reflexion:Language agents with verbal reinforcement learning. In: Advances in Neural Information Processing Systems (2023), https://arxiv.org/abs/2303.11366

  4. [4]

    Wang, G., et al.: Voyager: An open-ended embodied agent with large language models (2023), https://arxiv.org/abs/2305.16291

  5. [5]

    Yang, J., et al.: SWE-agent: Agent-computer interfaces enable automated software engineering (2024), https://arxiv.org/abs/2405.15793

  6. [6]

    Li, H., et al.: Contextbench: A benchmark for context retrieval in coding agents (2026), https://arxiv.org/abs/2602.05892

  7. [7]

    Zhu, J., et al.: Swe context bench: A benchmark for context learning in coding (2026), https://arxiv.org/abs/2602.08316

  8. [8]

    Wu, Q., et al.: Autogen: Enabling next-gen llm applications via multi-agent con- versation (2023), https://arxiv.org/abs/2308.08155

  9. [9]

    Qian, C., et al.: Communicative agents for software development (2023), https: //arxiv.org/abs/2307.07924

  10. [10]

    He, Z., et al.: Memoryarena: Benchmarking agent memory in interdependent multi- session agentic tasks (2026), https://arxiv.org/abs/2602.16313

  11. [11]

    Hu, Y., Wang, Y., McAuley, J.: Evaluating memory in llm agents via incremental multi-turn interactions (2025), https://arxiv.org/abs/2507.05257

  12. [12]

    Wang, Y., et al.: Evomembench: Benchmarking agent memory from a self-evolving perspective (2026), https://arxiv.org/abs/2605.18421

  13. [13]

    Xia, C.S., Deng, Y., Dunn, S., Zhang, L.: Agentless: Demystifying llm-based soft- ware engineering agents (2024), https://arxiv.org/abs/2407.01489

  14. [14]

    Wang, X., et al.: Openhands: An open platform for ai software developers as gen- eralist agents (2024), https://arxiv.org/abs/2407.16741

  15. [15]

    In: Transactions on Machine Learning Research (2022), https://arxiv.org/abs/ 2112.09118

    Izacard, G., et al.: Unsupervised dense information retrieval with contrastive learn- ing. In: Transactions on Machine Learning Research (2022), https://arxiv.org/abs/ 2112.09118

  16. [16]

    IEEE Transactions on Big Data (2019), https://arxiv.org/abs/1702.08734

    Johnson, J., Douze, M., Jégou, H.: Billion-scale similarity search with gpus. IEEE Transactions on Big Data (2019), https://arxiv.org/abs/1702.08734

  17. [17]

    In: Advances in Neural Information Processing Systems (2020), https://arxiv.org/ abs/2005.11401

    Lewis, P., et al.: Retrieval-augmented generation for knowledge-intensive nlp tasks. In: Advances in Neural Information Processing Systems (2020), https://arxiv.org/ abs/2005.11401

  18. [18]

    In: International Conference on Learning Representations (2023), https://arxiv

    Izacard, G., et al.: Few-shot learning with retrieval augmented language models. In: International Conference on Learning Representations (2023), https://arxiv. org/abs/2208.03299

  19. [19]

    In: Annual Meeting of the Association for Computational Linguistics (2023), https://arxiv.org/abs/2212.10496

    Gao, L., Ma, X., Lin, J., Callan, J.: Precise zero-shot dense retrieval without rele- vance labels. In: Annual Meeting of the Association for Computational Linguistics (2023), https://arxiv.org/abs/2212.10496

  20. [20]

    In: International Conference on Learning Representations (2024), https: //arxiv.org/abs/2310.11511

    Asai, A., et al.: Self-rag: Learning to retrieve, generate, and critique through self- reflection. In: International Conference on Learning Representations (2024), https: //arxiv.org/abs/2310.11511

  21. [21]

    Transactions of the Association for Computational Linguistics (2024), https:// arxiv.org/abs/2307.03172

    Liu, N.F., et al.: Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics (2024), https:// arxiv.org/abs/2307.03172

  22. [22]

    Bai, Y., et al.: Longbench: A bilingual, multitask benchmark for long context understanding (2023), https://arxiv.org/abs/2308.14508 Decision-Aware Memory Cards for LLM Agents 15

  23. [23]

    Cai, K., et al.: Autocontext: Instance-level context learning for llm agents (2025), https://arxiv.org/abs/2510.02369

  24. [24]

    Zhang, Q., et al.: Agentic context engineering: Evolving contexts for self-improving language models (2025), https://arxiv.org/abs/2510.04618

  25. [25]

    Kang, M., et al.: Acon: Optimizing context compression for long-horizon llm agents (2025), https://arxiv.org/abs/2510.00615

  26. [26]

    Srivastava, S.S.: Causal intervention-based memory selection for long-horizon llm agents (2026), https://arxiv.org/abs/2605.17641

  27. [27]

    Huo,Y.,etal.:Reposhapley:Shapley-enhancedcontextfilteringforrepository-level code completion (2026), https://arxiv.org/abs/2601.03378

  28. [28]

    and Zaragoza, H

    Robertson, S., Zaragoza, H.: The probabilistic relevance framework: BM25 and beyond. In: Foundations and Trends in Information Retrieval. Now Publishers (2009), https://doi.org/10.1561/1500000019

  29. [29]

    In: Empirical Methods in Natural Language Processing (2020), https://arxiv.org/ abs/2004.04906

    Karpukhin, V., et al.: Dense passage retrieval for open-domain question answering. In: Empirical Methods in Natural Language Processing (2020), https://arxiv.org/ abs/2004.04906

  30. [30]

    In: SIGIR (2020), https://arxiv.org/ abs/2004.12832

    Khattab, O., Zaharia, M.: ColBERT: Efficient and effective passage search via contextualized late interaction over BERT. In: SIGIR (2020), https://arxiv.org/ abs/2004.12832

  31. [31]

    In: Empirical Methods in Natural Language Processing (2023), https://arxiv.org/abs/2310.05736

    Jiang, H., et al.: LLMLingua: Compressing prompts for accelerated inference of large language models. In: Empirical Methods in Natural Language Processing (2023), https://arxiv.org/abs/2310.05736

  32. [32]

    In: Annual Meeting of the Association for Com- putational Linguistics (2024), https://arxiv.org/abs/2310.06839

    Jiang, H., et al.: Longllmlingua: Accelerating and enhancing llms in long context scenarios via prompt compression. In: Annual Meeting of the Association for Com- putational Linguistics (2024), https://arxiv.org/abs/2310.06839

  33. [33]

    Li, Y., Dong, B., Lin, C., Guerin, F.: Selective context: Efficient and context-aware prompt compression (2023), https://arxiv.org/abs/2304.12102

  34. [34]

    In: Advances in Neural Information Processing Systems (2022), https: //arxiv.org/abs/2203.02155

    Ouyang, L., et al.: Training language models to follow instructions with human feedback. In: Advances in Neural Information Processing Systems (2022), https: //arxiv.org/abs/2203.02155

  35. [35]

    In: Advances in Neural Information Processing Systems (2020), https://arxiv.org/abs/2009.01325

    Stiennon, N., et al.: Learning to summarize with human feedback. In: Advances in Neural Information Processing Systems (2020), https://arxiv.org/abs/2009.01325

  36. [36]

    In: Inter- national Conference on Learning Representations (2022), https://arxiv.org/abs/ 2106.09685

    Hu, E.J., et al.: LoRA: Low-rank adaptation of large language models. In: Inter- national Conference on Learning Representations (2022), https://arxiv.org/abs/ 2106.09685

  37. [37]

    In: Advances in Neural Information Processing Systems (2023), https://arxiv.org/abs/2305.14314

    Dettmers, T., Pagnoni, A., Holtzman, A., Zettlemoyer, L.: QLoRA: Efficient fine- tuning of quantized llms. In: Advances in Neural Information Processing Systems (2023), https://arxiv.org/abs/2305.14314

  38. [38]

    Jimenez, C.E., et al.: SWE-bench: Can language models resolve real-world github issues? In: International Conference on Learning Representations (2024), https: //arxiv.org/abs/2310.06770

  39. [39]

    Liu,T.,Xu,C.,McAuley,J.:Repobench:Benchmarkingrepository-levelcodeauto- completion systems (2023), https://arxiv.org/abs/2306.03091

  40. [40]

    In: NeurIPS Workshop on Machine Learning for Software Engineering (2019), https://arxiv.org/abs/1909.09436

    Husain, H., et al.: Codesearchnet challenge: Evaluating the state of semantic code search. In: NeurIPS Workshop on Machine Learning for Software Engineering (2019), https://arxiv.org/abs/1909.09436