Decision-Aware Memory Cards: Counterfactual-Inspired Context Selection and Compression for Tool-Using LLM Agents

Qianyang Zhao; Xinyu Guan; Yuming Deng

arxiv: 2606.08151 · v1 · pith:VQ4MWQ7Onew · submitted 2026-06-06 · 💻 cs.AI

Decision-Aware Memory Cards: Counterfactual-Inspired Context Selection and Compression for Tool-Using LLM Agents

Xinyu Guan , Qianyang Zhao , Yuming Deng This is my paper

Pith reviewed 2026-06-27 19:45 UTC · model grok-4.3

classification 💻 cs.AI

keywords decision-aware contextmemory cardscontext selectiontool-using agentsSWE-benchcontext compressionLLM agentscounterfactual scoring

0 comments

The pith

Decision-aware memory cards lift retrieval hit@1 from 0.58 to 0.78 on SWE-bench file tasks by scoring evidence on action shift and necessity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Tool-using LLM agents often fail because decisive evidence is not surfaced at action time even when present. CICL builds a context graph from instance evidence and routes judgments from multiple models through one shared eight-field schema. Evidence units receive scores on action shift, outcome uplift, necessity, and negative-transfer risk, after which the highest-utility units are packed as typed memory cards inside a fixed budget. On fifty SWE-bench Verified file-retrieval instances this protocol raises hit@1 from 0.58 to 0.78 and MRR@10 from 0.634 to 0.790 while keeping every judgment parseable. The design keeps the measured decision signal independent of any single judge model.

Core claim

CICL is a decision-aware context layer that converts instance evidence into a context graph, obtains deterministic judgments through an eight-field schema, scores units by action shift, outcome uplift, necessity, and negative-transfer risk, and packs the highest-utility evidence as typed memory cards for budgeted agents. The same protocol produces measurable retrieval gains on SWE-bench while allowing frontier, local, and lightweight rankers to be compared directly.

What carries the argument

Eight-field schema and four scoring dimensions that isolate action-critical evidence for packing into typed memory cards.

If this is right

Different judge models become directly comparable under one auditable protocol.
At budget 120 the method reaches F1 0.620 on v1 and 0.425 on v3.
Removing the single top-utility semantic unit on v3 collapses F1 to 0.000.
Compact rankers still underperform the heuristic selection on the tested instances.
RepoBench-R summaries continue to outperform the generated memory cards.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same scoring layer could be applied to non-coding tool-use tasks where context volume limits decision quality.
Collecting many scored judgments might supply training data that lets small rankers eventually match the current heuristic.
The separation of decision signal from model suggests a route to track how context quality changes as base models evolve.

Load-bearing premise

The eight-field schema and four scoring dimensions reliably isolate decisive evidence rather than reflecting artifacts of the judge models or the chosen benchmark instances.

What would settle it

A controlled run on the same fifty instances where memory cards built from the scored units produce no lift over plain BM25 top-50 retrieval.

Figures

Figures reproduced from arXiv: 2606.08151 by Qianyang Zhao, Xinyu Guan, Yuming Deng.

**Figure 1.** Figure 1: CICL pipeline. The framework turns instance evidence into a graph, routes judge signals through a decision-utility engine, and packs memory cards for a budgeted agent. Opus, direct Qwen API calls, Codex, and a trained Qwen-QLoRA model used as an optional local surrogate are reported separately, so model-dependent signals are not collapsed into one score. agents are increasingly limited by the quality of th… view at source ↗

**Figure 2.** Figure 2: Compact result overview. Panel A foregrounds the open-source SWE-bench Verified retrieval result; Panels B and C show the main controlled budget trend and RepoBench-R compression boundary [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

read the original abstract

Tool-using LLM agents often fail not because relevant text is absent, but because decisive evidence is not selected, compressed, or surfaced at action time. We present CICL, a decision-aware context layer that turns instance evidence into a context graph, routes deterministic, Opus-assisted, Qwen, Codex/GPT-5.5, and Qwen-QLoRA judgments through a shared eight-field schema, scores units by action shift, outcome uplift, necessity, and negative-transfer risk, and packs high-utility evidence as typed memory cards for a budgeted agent. The design separates the measured decision signal from the judge model, so frontier annotation, local surrogates, and lightweight rankers can be compared under one auditable protocol. Empirically, CICL yields a concrete open-benchmark gain while exposing its limits. On 50 SWE-bench Verified file-retrieval instances, direct Qwen3.6-plus reranking of BM25 top-50 candidates raises hit@1 from 0.58 to 0.78 and MRR@10 from 0.634 to 0.790, with all 2,500 judgments parseable. Controlled diagnostics show action-criticality: at budget 120, CICL reaches F1 0.620 on v1 and 0.425 on v3, and removing the top-utility semantic v3 unit collapses F1 to 0.000. Supplementary checks add Qwen-QLoRA agreement over 710 candidates, a small 200-label real-code Opus-assisted signal, and a three-instance patch smoke validating retrieval-to-patch plumbing without claiming official SWE-bench success. RepoBench-R summaries still beat cards, and compact rankers do not yet replace the heuristic. CICL contributes a reproducible measurement and selection layer for decision-critical context, not an end-to-end coding-agent repair claim.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CICL's scoring schema lifts hit@1 from 0.58 to 0.78 on 50 SWE-bench retrieval cases, but the gains rest on untested assumptions about judge independence and lack ablations against generic relevance.

read the letter

The paper's core move is turning context selection into an auditable scoring step with four explicit dimensions—action shift, outcome uplift, necessity, and negative-transfer risk—then packing the winners into typed memory cards. That framing is distinct from plain reranking or summarization, and the authors keep the claim narrow: a measurement and selection layer, not an end-to-end agent fix.

What works is the controlled diagnostic. Removing the top-utility unit drops F1 to zero at budget 120; that is a direct check that the ranking is doing something. Reporting that all 2,500 judgments parsed cleanly is also useful basic hygiene. The multi-judge setup (Opus-assisted, Qwen, Codex/GPT-5.5, Qwen-QLoRA) under one schema lets them compare frontier and local models on the same units, which is a practical step for the sub-area.

The soft spots are scale and missing safeguards. Fifty instances is small even for a diagnostic benchmark, and the headline numbers come from a single reranker (Qwen3.6-plus) on BM25 candidates. There is no cross-judge agreement reported on the four numeric dimensions, no correlation of those scores with downstream patch success on held-out cases, and no ablation that replaces the eight-field schema with ordinary relevance labels. The stress-test concern therefore stands: without those checks it is hard to know whether the dimensions isolate decision-critical evidence or simply echo the judge model's stylistic preferences on this particular benchmark. The paper itself notes that RepoBench-R summaries still win and that compact rankers do not yet replace the heuristic, which keeps the contribution honest but also limits how far the protocol can be trusted yet.

This is for researchers building context layers for tool-using agents who want reproducible selection rather than another black-box reranker. A reader who needs a concrete protocol and open numbers on a public benchmark will find usable material here.

It deserves a serious referee. The empirical claims are scoped and falsifiable enough to be worth checking, even if the current evidence is preliminary.

Referee Report

2 major / 1 minor

Summary. The paper introduces CICL, a decision-aware context layer for tool-using LLM agents. Instance evidence is turned into a context graph, routed through deterministic judgments (Opus-assisted, Qwen, Codex/GPT-5.5, Qwen-QLoRA) via a shared eight-field schema, scored on action shift, outcome uplift, necessity, and negative-transfer risk, then packed as typed memory cards for budgeted agents. The design aims to separate measured decision signal from the judge model. Empirically, on 50 SWE-bench Verified file-retrieval instances, direct Qwen3.6-plus reranking of BM25 top-50 candidates raises hit@1 from 0.58 to 0.78 and MRR@10 from 0.634 to 0.790 (all 2,500 judgments parseable). Controlled diagnostics report F1 of 0.620 (v1) and 0.425 (v3) at budget 120, with removal of the top-utility v3 unit collapsing F1 to 0.000. Supplementary checks include Qwen-QLoRA agreement and a small Opus-assisted signal.

Significance. If the reported gains prove reproducible and the four scoring dimensions can be shown to isolate decision-critical evidence independently of judge-model artifacts, CICL would supply a reusable, auditable protocol for context selection and compression. The explicit separation of annotation from ranking and the F1-collapse diagnostic are strengths that could support downstream work on agent memory.

major comments (2)

[Abstract] Abstract: The central empirical claim (hit@1 0.58→0.78, MRR@10 0.634→0.790 on 50 SWE-bench instances) rests on unverified numeric results. No details are supplied on instance-selection criteria for the 50 cases, the precise prompting templates that instantiate the eight-field schema, or any measure of cross-judge agreement on the four numeric dimensions (action shift, outcome uplift, necessity, negative-transfer risk). Without these, it cannot be established that the reranking improvement reflects decision-aware selection rather than Qwen3.6-plus stylistic preferences or SWE-bench-specific phrasing.
[Abstract] Abstract: The controlled diagnostic that removing the top-utility semantic v3 unit collapses F1 to 0.000 is presented as evidence of action-criticality, yet no ablation replaces the schema with generic relevance labels, no correlation between schema scores and downstream agent success on held-out instances is reported, and no variance or statistical significance for the F1 values (0.620 / 0.425) is given. These omissions are load-bearing for the claim that the schema isolates decisive evidence.

minor comments (1)

[Abstract] Abstract: The list of judge models (“Opus-assisted, Qwen, Codex/GPT-5.5, and Qwen-QLoRA”) is introduced without clarifying how their outputs are aggregated or compared under the shared schema.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on transparency and robustness of the empirical claims. We address each point below and will incorporate clarifications and additional details in a revised manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The central empirical claim (hit@1 0.58→0.78, MRR@10 0.634→0.790 on 50 SWE-bench instances) rests on unverified numeric results. No details are supplied on instance-selection criteria for the 50 cases, the precise prompting templates that instantiate the eight-field schema, or any measure of cross-judge agreement on the four numeric dimensions (action shift, outcome uplift, necessity, negative-transfer risk). Without these, it cannot be established that the reranking improvement reflects decision-aware selection rather than Qwen3.6-plus stylistic preferences or SWE-bench-specific phrasing.

Authors: We agree these details are necessary for full verification and reproducibility. The 50 instances were drawn from SWE-bench Verified file-retrieval tasks; we will specify the exact selection criteria (e.g., task filtering rules) in the revised methods section. The eight-field schema prompting templates will be included verbatim in a new appendix. While the manuscript already notes that all 2,500 judgments were parseable and reports Qwen-QLoRA agreement over 710 candidates plus a 200-label Opus-assisted signal, we will add quantitative cross-judge agreement metrics (e.g., pairwise correlations or agreement rates) specifically on the four numeric scoring dimensions. These additions will help substantiate that gains arise from decision-aware scoring rather than judge-specific effects. revision: yes
Referee: [Abstract] Abstract: The controlled diagnostic that removing the top-utility semantic v3 unit collapses F1 to 0.000 is presented as evidence of action-criticality, yet no ablation replaces the schema with generic relevance labels, no correlation between schema scores and downstream agent success on held-out instances is reported, and no variance or statistical significance for the F1 values (0.620 / 0.425) is given. These omissions are load-bearing for the claim that the schema isolates decisive evidence.

Authors: The F1-collapse result is presented as a targeted sensitivity check on the selected high-utility unit rather than a comprehensive proof of isolation. We acknowledge that the diagnostic would be strengthened by an ablation substituting generic relevance labels and by explicit correlations between schema scores and held-out agent performance. We will add variance estimates or bootstrap confidence intervals for the reported F1 values (0.620 on v1, 0.425 on v3) and any applicable statistical tests in the revision. A full generic-label ablation and held-out correlation analysis may require additional experiments beyond the current 50-instance controlled diagnostics; the revised text will clarify the scope and limitations of the existing diagnostic while incorporating feasible extensions. revision: partial

Circularity Check

0 steps flagged

No circularity; empirical reranking results are direct measurements, not derived by construction.

full rationale

The paper reports concrete benchmark numbers (hit@1 0.58→0.78, MRR@10 0.634→0.790) obtained by routing BM25 candidates through an eight-field schema and four scoring dimensions, then measuring retrieval metrics on 50 SWE-bench instances. No equations, fitted parameters renamed as predictions, self-citations, or uniqueness theorems appear in the provided text. The central claims rest on parseable judgments and controlled diagnostics rather than any reduction of outputs to inputs by definition. This is the standard case of an empirical measurement protocol whose validity is open to external verification.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the scoring dimensions are described at a high level without stating how weights or thresholds are chosen.

pith-pipeline@v0.9.1-grok · 5881 in / 1060 out tokens · 16583 ms · 2026-06-27T19:45:35.855985+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

40 extracted references · 1 canonical work pages

[1]

In: International Conference on Learning Representations (2023), https://openreview

Yao, S., et al.: React: Synergizing reasoning and acting in language models. In: International Conference on Learning Representations (2023), https://openreview. net/forum?id=WE_vluYUL-X 14 X. Guan et al

2023
[2]

In: Advances in Neural Information Processing Systems (2023), https://arxiv.org/ abs/2302.04761

Schick, T., et al.: Toolformer: Language models can teach themselves to use tools. In: Advances in Neural Information Processing Systems (2023), https://arxiv.org/ abs/2302.04761

Pith/arXiv arXiv 2023
[3]

In: Advances in Neural Information Processing Systems (2023), https://arxiv.org/abs/2303.11366

Shinn,N.,Cassano,F.,Gopinath,A.,Narasimhan,K.,Yao,S.:Reflexion:Language agents with verbal reinforcement learning. In: Advances in Neural Information Processing Systems (2023), https://arxiv.org/abs/2303.11366

Pith/arXiv arXiv 2023
[4]

Wang, G., et al.: Voyager: An open-ended embodied agent with large language models (2023), https://arxiv.org/abs/2305.16291

Pith/arXiv arXiv 2023
[5]

Yang, J., et al.: SWE-agent: Agent-computer interfaces enable automated software engineering (2024), https://arxiv.org/abs/2405.15793

Pith/arXiv arXiv 2024
[6]

Li, H., et al.: Contextbench: A benchmark for context retrieval in coding agents (2026), https://arxiv.org/abs/2602.05892

arXiv 2026
[7]

Zhu, J., et al.: Swe context bench: A benchmark for context learning in coding (2026), https://arxiv.org/abs/2602.08316

Pith/arXiv arXiv 2026
[8]

Wu, Q., et al.: Autogen: Enabling next-gen llm applications via multi-agent con- versation (2023), https://arxiv.org/abs/2308.08155

Pith/arXiv arXiv 2023
[9]

Qian, C., et al.: Communicative agents for software development (2023), https: //arxiv.org/abs/2307.07924

Pith/arXiv arXiv 2023
[10]

He, Z., et al.: Memoryarena: Benchmarking agent memory in interdependent multi- session agentic tasks (2026), https://arxiv.org/abs/2602.16313

arXiv 2026
[11]

Hu, Y., Wang, Y., McAuley, J.: Evaluating memory in llm agents via incremental multi-turn interactions (2025), https://arxiv.org/abs/2507.05257

Pith/arXiv arXiv 2025
[12]

Wang, Y., et al.: Evomembench: Benchmarking agent memory from a self-evolving perspective (2026), https://arxiv.org/abs/2605.18421

Pith/arXiv arXiv 2026
[13]

Xia, C.S., Deng, Y., Dunn, S., Zhang, L.: Agentless: Demystifying llm-based soft- ware engineering agents (2024), https://arxiv.org/abs/2407.01489

Pith/arXiv arXiv 2024
[14]

Wang, X., et al.: Openhands: An open platform for ai software developers as gen- eralist agents (2024), https://arxiv.org/abs/2407.16741

Pith/arXiv arXiv 2024
[15]

In: Transactions on Machine Learning Research (2022), https://arxiv.org/abs/ 2112.09118

Izacard, G., et al.: Unsupervised dense information retrieval with contrastive learn- ing. In: Transactions on Machine Learning Research (2022), https://arxiv.org/abs/ 2112.09118

Pith/arXiv arXiv 2022
[16]

IEEE Transactions on Big Data (2019), https://arxiv.org/abs/1702.08734

Johnson, J., Douze, M., Jégou, H.: Billion-scale similarity search with gpus. IEEE Transactions on Big Data (2019), https://arxiv.org/abs/1702.08734

Pith/arXiv arXiv 2019
[17]

In: Advances in Neural Information Processing Systems (2020), https://arxiv.org/ abs/2005.11401

Lewis, P., et al.: Retrieval-augmented generation for knowledge-intensive nlp tasks. In: Advances in Neural Information Processing Systems (2020), https://arxiv.org/ abs/2005.11401

Pith/arXiv arXiv 2020
[18]

In: International Conference on Learning Representations (2023), https://arxiv

Izacard, G., et al.: Few-shot learning with retrieval augmented language models. In: International Conference on Learning Representations (2023), https://arxiv. org/abs/2208.03299

Pith/arXiv arXiv 2023
[19]

In: Annual Meeting of the Association for Computational Linguistics (2023), https://arxiv.org/abs/2212.10496

Gao, L., Ma, X., Lin, J., Callan, J.: Precise zero-shot dense retrieval without rele- vance labels. In: Annual Meeting of the Association for Computational Linguistics (2023), https://arxiv.org/abs/2212.10496

arXiv 2023
[20]

In: International Conference on Learning Representations (2024), https: //arxiv.org/abs/2310.11511

Asai, A., et al.: Self-rag: Learning to retrieve, generate, and critique through self- reflection. In: International Conference on Learning Representations (2024), https: //arxiv.org/abs/2310.11511

Pith/arXiv arXiv 2024
[21]

Transactions of the Association for Computational Linguistics (2024), https:// arxiv.org/abs/2307.03172

Liu, N.F., et al.: Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics (2024), https:// arxiv.org/abs/2307.03172

Pith/arXiv arXiv 2024
[22]

Bai, Y., et al.: Longbench: A bilingual, multitask benchmark for long context understanding (2023), https://arxiv.org/abs/2308.14508 Decision-Aware Memory Cards for LLM Agents 15

Pith/arXiv arXiv 2023
[23]

Cai, K., et al.: Autocontext: Instance-level context learning for llm agents (2025), https://arxiv.org/abs/2510.02369

arXiv 2025
[24]

Zhang, Q., et al.: Agentic context engineering: Evolving contexts for self-improving language models (2025), https://arxiv.org/abs/2510.04618

Pith/arXiv arXiv 2025
[25]

Kang, M., et al.: Acon: Optimizing context compression for long-horizon llm agents (2025), https://arxiv.org/abs/2510.00615

Pith/arXiv arXiv 2025
[26]

Srivastava, S.S.: Causal intervention-based memory selection for long-horizon llm agents (2026), https://arxiv.org/abs/2605.17641

Pith/arXiv arXiv 2026
[27]

Huo,Y.,etal.:Reposhapley:Shapley-enhancedcontextfilteringforrepository-level code completion (2026), https://arxiv.org/abs/2601.03378

Pith/arXiv arXiv 2026
[28]

and Zaragoza, H

Robertson, S., Zaragoza, H.: The probabilistic relevance framework: BM25 and beyond. In: Foundations and Trends in Information Retrieval. Now Publishers (2009), https://doi.org/10.1561/1500000019

work page doi:10.1561/1500000019 2009
[29]

In: Empirical Methods in Natural Language Processing (2020), https://arxiv.org/ abs/2004.04906

Karpukhin, V., et al.: Dense passage retrieval for open-domain question answering. In: Empirical Methods in Natural Language Processing (2020), https://arxiv.org/ abs/2004.04906

Pith/arXiv arXiv 2020
[30]

In: SIGIR (2020), https://arxiv.org/ abs/2004.12832

Khattab, O., Zaharia, M.: ColBERT: Efficient and effective passage search via contextualized late interaction over BERT. In: SIGIR (2020), https://arxiv.org/ abs/2004.12832

arXiv 2020
[31]

In: Empirical Methods in Natural Language Processing (2023), https://arxiv.org/abs/2310.05736

Jiang, H., et al.: LLMLingua: Compressing prompts for accelerated inference of large language models. In: Empirical Methods in Natural Language Processing (2023), https://arxiv.org/abs/2310.05736

arXiv 2023
[32]

In: Annual Meeting of the Association for Com- putational Linguistics (2024), https://arxiv.org/abs/2310.06839

Jiang, H., et al.: Longllmlingua: Accelerating and enhancing llms in long context scenarios via prompt compression. In: Annual Meeting of the Association for Com- putational Linguistics (2024), https://arxiv.org/abs/2310.06839

arXiv 2024
[33]

Li, Y., Dong, B., Lin, C., Guerin, F.: Selective context: Efficient and context-aware prompt compression (2023), https://arxiv.org/abs/2304.12102

arXiv 2023
[34]

In: Advances in Neural Information Processing Systems (2022), https: //arxiv.org/abs/2203.02155

Ouyang, L., et al.: Training language models to follow instructions with human feedback. In: Advances in Neural Information Processing Systems (2022), https: //arxiv.org/abs/2203.02155

Pith/arXiv arXiv 2022
[35]

In: Advances in Neural Information Processing Systems (2020), https://arxiv.org/abs/2009.01325

Stiennon, N., et al.: Learning to summarize with human feedback. In: Advances in Neural Information Processing Systems (2020), https://arxiv.org/abs/2009.01325

Pith/arXiv arXiv 2020
[36]

In: Inter- national Conference on Learning Representations (2022), https://arxiv.org/abs/ 2106.09685

Hu, E.J., et al.: LoRA: Low-rank adaptation of large language models. In: Inter- national Conference on Learning Representations (2022), https://arxiv.org/abs/ 2106.09685

Pith/arXiv arXiv 2022
[37]

In: Advances in Neural Information Processing Systems (2023), https://arxiv.org/abs/2305.14314

Dettmers, T., Pagnoni, A., Holtzman, A., Zettlemoyer, L.: QLoRA: Efficient fine- tuning of quantized llms. In: Advances in Neural Information Processing Systems (2023), https://arxiv.org/abs/2305.14314

Pith/arXiv arXiv 2023
[38]

Jimenez, C.E., et al.: SWE-bench: Can language models resolve real-world github issues? In: International Conference on Learning Representations (2024), https: //arxiv.org/abs/2310.06770

Pith/arXiv arXiv 2024
[39]

Liu,T.,Xu,C.,McAuley,J.:Repobench:Benchmarkingrepository-levelcodeauto- completion systems (2023), https://arxiv.org/abs/2306.03091

Pith/arXiv arXiv 2023
[40]

In: NeurIPS Workshop on Machine Learning for Software Engineering (2019), https://arxiv.org/abs/1909.09436

Husain, H., et al.: Codesearchnet challenge: Evaluating the state of semantic code search. In: NeurIPS Workshop on Machine Learning for Software Engineering (2019), https://arxiv.org/abs/1909.09436

Pith/arXiv arXiv 2019

[1] [1]

In: International Conference on Learning Representations (2023), https://openreview

Yao, S., et al.: React: Synergizing reasoning and acting in language models. In: International Conference on Learning Representations (2023), https://openreview. net/forum?id=WE_vluYUL-X 14 X. Guan et al

2023

[2] [2]

In: Advances in Neural Information Processing Systems (2023), https://arxiv.org/ abs/2302.04761

Schick, T., et al.: Toolformer: Language models can teach themselves to use tools. In: Advances in Neural Information Processing Systems (2023), https://arxiv.org/ abs/2302.04761

Pith/arXiv arXiv 2023

[3] [3]

In: Advances in Neural Information Processing Systems (2023), https://arxiv.org/abs/2303.11366

Shinn,N.,Cassano,F.,Gopinath,A.,Narasimhan,K.,Yao,S.:Reflexion:Language agents with verbal reinforcement learning. In: Advances in Neural Information Processing Systems (2023), https://arxiv.org/abs/2303.11366

Pith/arXiv arXiv 2023

[4] [4]

Wang, G., et al.: Voyager: An open-ended embodied agent with large language models (2023), https://arxiv.org/abs/2305.16291

Pith/arXiv arXiv 2023

[5] [5]

Yang, J., et al.: SWE-agent: Agent-computer interfaces enable automated software engineering (2024), https://arxiv.org/abs/2405.15793

Pith/arXiv arXiv 2024

[6] [6]

Li, H., et al.: Contextbench: A benchmark for context retrieval in coding agents (2026), https://arxiv.org/abs/2602.05892

arXiv 2026

[7] [7]

Zhu, J., et al.: Swe context bench: A benchmark for context learning in coding (2026), https://arxiv.org/abs/2602.08316

Pith/arXiv arXiv 2026

[8] [8]

Wu, Q., et al.: Autogen: Enabling next-gen llm applications via multi-agent con- versation (2023), https://arxiv.org/abs/2308.08155

Pith/arXiv arXiv 2023

[9] [9]

Qian, C., et al.: Communicative agents for software development (2023), https: //arxiv.org/abs/2307.07924

Pith/arXiv arXiv 2023

[10] [10]

He, Z., et al.: Memoryarena: Benchmarking agent memory in interdependent multi- session agentic tasks (2026), https://arxiv.org/abs/2602.16313

arXiv 2026

[11] [11]

Hu, Y., Wang, Y., McAuley, J.: Evaluating memory in llm agents via incremental multi-turn interactions (2025), https://arxiv.org/abs/2507.05257

Pith/arXiv arXiv 2025

[12] [12]

Wang, Y., et al.: Evomembench: Benchmarking agent memory from a self-evolving perspective (2026), https://arxiv.org/abs/2605.18421

Pith/arXiv arXiv 2026

[13] [13]

Xia, C.S., Deng, Y., Dunn, S., Zhang, L.: Agentless: Demystifying llm-based soft- ware engineering agents (2024), https://arxiv.org/abs/2407.01489

Pith/arXiv arXiv 2024

[14] [14]

Wang, X., et al.: Openhands: An open platform for ai software developers as gen- eralist agents (2024), https://arxiv.org/abs/2407.16741

Pith/arXiv arXiv 2024

[15] [15]

In: Transactions on Machine Learning Research (2022), https://arxiv.org/abs/ 2112.09118

Izacard, G., et al.: Unsupervised dense information retrieval with contrastive learn- ing. In: Transactions on Machine Learning Research (2022), https://arxiv.org/abs/ 2112.09118

Pith/arXiv arXiv 2022

[16] [16]

IEEE Transactions on Big Data (2019), https://arxiv.org/abs/1702.08734

Johnson, J., Douze, M., Jégou, H.: Billion-scale similarity search with gpus. IEEE Transactions on Big Data (2019), https://arxiv.org/abs/1702.08734

Pith/arXiv arXiv 2019

[17] [17]

In: Advances in Neural Information Processing Systems (2020), https://arxiv.org/ abs/2005.11401

Lewis, P., et al.: Retrieval-augmented generation for knowledge-intensive nlp tasks. In: Advances in Neural Information Processing Systems (2020), https://arxiv.org/ abs/2005.11401

Pith/arXiv arXiv 2020

[18] [18]

In: International Conference on Learning Representations (2023), https://arxiv

Izacard, G., et al.: Few-shot learning with retrieval augmented language models. In: International Conference on Learning Representations (2023), https://arxiv. org/abs/2208.03299

Pith/arXiv arXiv 2023

[19] [19]

In: Annual Meeting of the Association for Computational Linguistics (2023), https://arxiv.org/abs/2212.10496

Gao, L., Ma, X., Lin, J., Callan, J.: Precise zero-shot dense retrieval without rele- vance labels. In: Annual Meeting of the Association for Computational Linguistics (2023), https://arxiv.org/abs/2212.10496

arXiv 2023

[20] [20]

In: International Conference on Learning Representations (2024), https: //arxiv.org/abs/2310.11511

Asai, A., et al.: Self-rag: Learning to retrieve, generate, and critique through self- reflection. In: International Conference on Learning Representations (2024), https: //arxiv.org/abs/2310.11511

Pith/arXiv arXiv 2024

[21] [21]

Transactions of the Association for Computational Linguistics (2024), https:// arxiv.org/abs/2307.03172

Liu, N.F., et al.: Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics (2024), https:// arxiv.org/abs/2307.03172

Pith/arXiv arXiv 2024

[22] [22]

Bai, Y., et al.: Longbench: A bilingual, multitask benchmark for long context understanding (2023), https://arxiv.org/abs/2308.14508 Decision-Aware Memory Cards for LLM Agents 15

Pith/arXiv arXiv 2023

[23] [23]

Cai, K., et al.: Autocontext: Instance-level context learning for llm agents (2025), https://arxiv.org/abs/2510.02369

arXiv 2025

[24] [24]

Zhang, Q., et al.: Agentic context engineering: Evolving contexts for self-improving language models (2025), https://arxiv.org/abs/2510.04618

Pith/arXiv arXiv 2025

[25] [25]

Kang, M., et al.: Acon: Optimizing context compression for long-horizon llm agents (2025), https://arxiv.org/abs/2510.00615

Pith/arXiv arXiv 2025

[26] [26]

Srivastava, S.S.: Causal intervention-based memory selection for long-horizon llm agents (2026), https://arxiv.org/abs/2605.17641

Pith/arXiv arXiv 2026

[27] [27]

Huo,Y.,etal.:Reposhapley:Shapley-enhancedcontextfilteringforrepository-level code completion (2026), https://arxiv.org/abs/2601.03378

Pith/arXiv arXiv 2026

[28] [28]

and Zaragoza, H

Robertson, S., Zaragoza, H.: The probabilistic relevance framework: BM25 and beyond. In: Foundations and Trends in Information Retrieval. Now Publishers (2009), https://doi.org/10.1561/1500000019

work page doi:10.1561/1500000019 2009

[29] [29]

In: Empirical Methods in Natural Language Processing (2020), https://arxiv.org/ abs/2004.04906

Karpukhin, V., et al.: Dense passage retrieval for open-domain question answering. In: Empirical Methods in Natural Language Processing (2020), https://arxiv.org/ abs/2004.04906

Pith/arXiv arXiv 2020

[30] [30]

In: SIGIR (2020), https://arxiv.org/ abs/2004.12832

Khattab, O., Zaharia, M.: ColBERT: Efficient and effective passage search via contextualized late interaction over BERT. In: SIGIR (2020), https://arxiv.org/ abs/2004.12832

arXiv 2020

[31] [31]

In: Empirical Methods in Natural Language Processing (2023), https://arxiv.org/abs/2310.05736

Jiang, H., et al.: LLMLingua: Compressing prompts for accelerated inference of large language models. In: Empirical Methods in Natural Language Processing (2023), https://arxiv.org/abs/2310.05736

arXiv 2023

[32] [32]

In: Annual Meeting of the Association for Com- putational Linguistics (2024), https://arxiv.org/abs/2310.06839

Jiang, H., et al.: Longllmlingua: Accelerating and enhancing llms in long context scenarios via prompt compression. In: Annual Meeting of the Association for Com- putational Linguistics (2024), https://arxiv.org/abs/2310.06839

arXiv 2024

[33] [33]

Li, Y., Dong, B., Lin, C., Guerin, F.: Selective context: Efficient and context-aware prompt compression (2023), https://arxiv.org/abs/2304.12102

arXiv 2023

[34] [34]

In: Advances in Neural Information Processing Systems (2022), https: //arxiv.org/abs/2203.02155

Ouyang, L., et al.: Training language models to follow instructions with human feedback. In: Advances in Neural Information Processing Systems (2022), https: //arxiv.org/abs/2203.02155

Pith/arXiv arXiv 2022

[35] [35]

In: Advances in Neural Information Processing Systems (2020), https://arxiv.org/abs/2009.01325

Stiennon, N., et al.: Learning to summarize with human feedback. In: Advances in Neural Information Processing Systems (2020), https://arxiv.org/abs/2009.01325

Pith/arXiv arXiv 2020

[36] [36]

In: Inter- national Conference on Learning Representations (2022), https://arxiv.org/abs/ 2106.09685

Hu, E.J., et al.: LoRA: Low-rank adaptation of large language models. In: Inter- national Conference on Learning Representations (2022), https://arxiv.org/abs/ 2106.09685

Pith/arXiv arXiv 2022

[37] [37]

In: Advances in Neural Information Processing Systems (2023), https://arxiv.org/abs/2305.14314

Dettmers, T., Pagnoni, A., Holtzman, A., Zettlemoyer, L.: QLoRA: Efficient fine- tuning of quantized llms. In: Advances in Neural Information Processing Systems (2023), https://arxiv.org/abs/2305.14314

Pith/arXiv arXiv 2023

[38] [38]

Jimenez, C.E., et al.: SWE-bench: Can language models resolve real-world github issues? In: International Conference on Learning Representations (2024), https: //arxiv.org/abs/2310.06770

Pith/arXiv arXiv 2024

[39] [39]

Liu,T.,Xu,C.,McAuley,J.:Repobench:Benchmarkingrepository-levelcodeauto- completion systems (2023), https://arxiv.org/abs/2306.03091

Pith/arXiv arXiv 2023

[40] [40]

In: NeurIPS Workshop on Machine Learning for Software Engineering (2019), https://arxiv.org/abs/1909.09436

Husain, H., et al.: Codesearchnet challenge: Evaluating the state of semantic code search. In: NeurIPS Workshop on Machine Learning for Software Engineering (2019), https://arxiv.org/abs/1909.09436

Pith/arXiv arXiv 2019