arxiv: 2604.07863 · v1 · submitted 2026-04-09 · 💻 cs.IR · cs.AI

Recognition: unknown

Task-Adaptive Retrieval over Agentic Multi-Modal Web Histories via Learned Graph Memory

Saman Forouzandeh , Kamal Berahmand , Mahdi Jalili

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:18 UTC · model grok-4.3

classification 💻 cs.IR cs.AI

keywords task-adaptive retrievalgraph memorymulti-modal web historiespolicy-gradient optimizationrelevance graphsweb agentsagentic memory

0 comments

The pith

A learned graph memory called ACGM builds task-adaptive relevance graphs over multi-modal web agent histories by optimizing with policy gradients from task success.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to show that relevance in long agent histories can be learned as a graph whose edges are shaped directly by whether the agent succeeds at its web task, rather than by fixed similarity rules or buffer sizes. A reader would care because agents accumulate thousands of screenshots, HTML snippets, and signals, and the right past observation often determines whether the next action succeeds. The method learns different forgetting speeds for images versus text and keeps the graph sparse so lookups stay fast even as history grows. If correct, this replaces hand-tuned memory heuristics with an end-to-end trainable component that improves retrieval on three standard agent benchmarks.

Core claim

ACGM constructs task-adaptive relevance graphs over agent histories using policy-gradient optimization from downstream task success. It captures heterogeneous temporal dynamics with modality-specific decay and learns sparse connectivity, enabling efficient retrieval while outperforming 19 baselines on WebShop, VisualWebArena, and Mind2Web.

What carries the argument

ACGM, the learned graph-memory retriever that uses policy-gradient optimization driven by task success to construct task-adaptive relevance graphs over multi-modal histories.

If this is right

Retrieval reaches 82.7 nDCG@10 and 89.2 percent Precision@10, a gain of 9.3 and 7.7 points over GPT-4o.
Visual observations decay 4.3 times faster than text in the learned graphs (lambda_v = 0.47 versus lambda_x = 0.11).
Average degree stays low at 3.2 edges per node, supporting O(log T) retrieval time.
The same trained model improves results across three distinct web-agent environments without per-benchmark retuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could be tested on non-web sequential tasks such as long-horizon robotics or code-repair agents to check whether policy-gradient memory learning transfers.
If the learned decay rates prove stable, memory architectures for multimodal models might adopt separate visual and textual forgetting schedules by default.
Replacing large context windows with selective graph retrieval might lower token cost while preserving performance on extended interactions.
Dynamic task changes within a single episode could be used to probe whether the graphs reconfigure on the fly or remain fixed after training.

Load-bearing premise

Policy-gradient optimization from task success on the three evaluation benchmarks produces relevance graphs that genuinely adapt to new tasks instead of overfitting to benchmark artifacts.

What would settle it

Train ACGM on the published benchmarks then evaluate retrieval quality on a fresh, previously unseen web agent task whose distribution differs from WebShop, VisualWebArena, and Mind2Web; if nDCG@10 falls back near baseline levels the adaptation claim is falsified.

Figures

Figures reproduced from arXiv: 2604.07863 by Kamal Berahmand, Mahdi Jalili, Saman Forouzandeh.

**Figure 2.** Figure 2: Retrieval analysis, ablations, and decay sensitivity. (a) PR curves (AUC: ACGM 0.863 vs. GPT-4o 0.815) show consistently [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Performance and efficiency across trajectory lengths. (a) ACGM maintains high retrieval quality (nDCG@10 [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

read the original abstract

Retrieving relevant observations from long multi-modal web interaction histories is challenging because relevance depends on the evolving task state, modality (screenshots, HTML text, structured signals), and temporal distance. Prior approaches typically rely on static similarity thresholds or fixed-capacity buffers, which fail to adapt relevance to the current task context. We propose \textbf{ACGM}, a learned graph-memory retriever that constructs \emph{task-adaptive} relevance graphs over agent histories using policy-gradient optimization from downstream task success. ACGM captures heterogeneous temporal dynamics with modality-specific decay (visual decays $4.3\times$ faster than text: $\lambda_v{=}0.47$ vs.\ $\lambda_x{=}0.11$) and learns sparse connectivity (3.2 edges/node), enabling efficient $O(\log T)$ retrieval. Across WebShop, VisualWebArena, and Mind2Web, ACGM improves retrieval quality to \textbf{82.7 nDCG@10} (+9.3 over GPT-4o, $p{<}0.001$) and \textbf{89.2\% Precision@10} (+7.7), outperforming 19 strong dense, re-ranking, multi-modal, and graph-based baselines. Code to reproduce our results is available at{\color{blue}\href{https://github.com/S-Forouzandeh/ACGM-Agentic-Web}{Saman Forouzandeh}}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ACGM learns a sparse task-adaptive graph over multi-modal histories with policy gradients from task success, but the gains may reflect fitting to the three evaluation benchmarks rather than general adaptation rules.

read the letter

The main takeaway is that this paper gives a working mechanism for retrieval in long web-agent histories by building a learned graph that uses different decay rates for visual and text observations and optimizes the whole thing directly from downstream task rewards. It reports clear improvements over 19 baselines on WebShop, VisualWebArena, and Mind2Web, with the code released for checking the numbers.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes ACGM, a learned graph-memory retriever for multi-modal agentic web histories. It constructs task-adaptive relevance graphs via policy-gradient optimization driven by downstream task success, with modality-specific decay (visual decays 4.3× faster than text) and sparse connectivity (3.2 edges/node) for O(log T) retrieval. It reports 82.7 nDCG@10 (+9.3 over GPT-4o) and 89.2% Precision@10 (+7.7) on WebShop, VisualWebArena, and Mind2Web, outperforming 19 baselines, with reproduction code available.

Significance. If the learned decay rates and connectivity produce genuinely generalizable task-adaptive graphs rather than benchmark-specific fits, the work would advance retrieval for long-horizon multi-modal agents by replacing static buffers with an efficient, context-sensitive mechanism. The explicit parameter values, O(log T) claim, and public code are concrete strengths that support verification and extension.

major comments (2)

[Abstract] Abstract: The policy-gradient optimization is driven by downstream task success on WebShop, VisualWebArena, and Mind2Web—the same distributions used for the reported 82.7 nDCG@10 and 89.2% Precision@10—yet no held-out task splits, cross-benchmark transfer tests, or separate validation sets are described. This is load-bearing for the central claim of task-adaptive retrieval, because the learned values (λ_v=0.47, λ_x=0.11, 3.2 edges/node) could encode benchmark artifacts (task length distributions, screenshot/HTML patterns) instead of transferable adaptation rules.
[Abstract] Abstract: No ablation or sensitivity analysis is provided on the modality-specific decay rates or the average edges per node, despite these being presented as central to the adaptation and efficiency claims. This omission is load-bearing because it prevents assessment of whether the +9.3 nDCG@10 and +7.7 Precision@10 gains over the 19 baselines are attributable to the learned graph structure or to other unablated factors in the optimization.

minor comments (1)

[Abstract] The GitHub link is given but lacks a stable identifier or version tag, which would aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful comments, which help clarify the presentation of our claims regarding task-adaptive retrieval. We address each major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: The policy-gradient optimization is driven by downstream task success on WebShop, VisualWebArena, and Mind2Web—the same distributions used for the reported 82.7 nDCG@10 and 89.2% Precision@10—yet no held-out task splits, cross-benchmark transfer tests, or separate validation sets are described. This is load-bearing for the central claim of task-adaptive retrieval, because the learned values (λ_v=0.47, λ_x=0.11, 3.2 edges/node) could encode benchmark artifacts (task length distributions, screenshot/HTML patterns) instead of transferable adaptation rules.

Authors: The optimization indeed uses task success signals from the training distributions of these benchmarks. While the abstract does not detail additional splits, the policy gradient is applied per-episode, enabling adaptation within each benchmark's task variations. To directly address concerns about benchmark-specific artifacts, we will add in the revision: (1) held-out task splits within each benchmark, (2) cross-benchmark transfer where decay parameters learned on one benchmark are frozen and tested on the others. This will demonstrate whether the learned λ_v, λ_x, and connectivity generalize beyond the training distributions. revision: yes
Referee: [Abstract] Abstract: No ablation or sensitivity analysis is provided on the modality-specific decay rates or the average edges per node, despite these being presented as central to the adaptation and efficiency claims. This omission is load-bearing because it prevents assessment of whether the +9.3 nDCG@10 and +7.7 Precision@10 gains over the 19 baselines are attributable to the learned graph structure or to other unablated factors in the optimization.

Authors: We concur that ablations are necessary to isolate the contribution of the learned decay rates and sparsity. In the revised manuscript, we will include a dedicated ablation study section. This will feature sensitivity plots for λ_v and λ_x (varying each by ±0.2 around learned values while fixing the other), as well as for average edges per node (testing 1, 2, 3.2, 5, 10). We will also compare against a non-modality-specific (uniform decay) variant to quantify the benefit of heterogeneous temporal dynamics. These results will clarify the source of the reported gains. revision: yes

Circularity Check

1 steps flagged

Policy-gradient optimization on evaluation benchmarks reduces reported retrieval gains to fitted outcomes on the same distributions

specific steps

fitted input called prediction [Abstract]
"We propose ACGM, a learned graph-memory retriever that constructs task-adaptive relevance graphs over agent histories using policy-gradient optimization from downstream task success. ... Across WebShop, VisualWebArena, and Mind2Web, ACGM improves retrieval quality to 82.7 nDCG@10 (+9.3 over GPT-4o, p<0.001) and 89.2% Precision@10 (+7.7), outperforming 19 strong dense, re-ranking, multi-modal, and graph-based baselines."

The policy-gradient updates are driven by task success on the identical WebShop/VisualWebArena/Mind2Web distributions used to compute the reported nDCG@10 and Precision@10. The 'task-adaptive' property and the numerical gains are therefore the fitted outcome of the optimization on the evaluation data rather than a prediction of generalization; the 19 baselines are also evaluated on the same data, so outperformance is consistent with better fitting to benchmark artifacts (e.g., typical task lengths or modality co-occurrence patterns) instead of transferable adaptation rules.

full rationale

The paper's central result is that policy-gradient optimization driven by downstream task success produces task-adaptive graphs yielding 82.7 nDCG@10 and 89.2% Precision@10 on WebShop, VisualWebArena, and Mind2Web. Because the optimization objective and the reported metrics use identical benchmark distributions (with no held-out splits or OOD tests referenced in the provided text), the measured improvements are the direct result of fitting rather than an independent derivation or prediction. This matches the fitted-input-called-prediction pattern exactly; no equations reduce by algebraic identity, but the performance numbers are statistically forced by the training signal on the evaluation data. The learned decay rates and sparsity are likewise outputs of the same optimization loop. No self-citation load-bearing or ansatz smuggling is present. The derivation chain is therefore partially circular at the evaluation step.

Axiom & Free-Parameter Ledger

3 free parameters · 1 axioms · 1 invented entities

The central claim rests on two fitted decay parameters, a learned sparsity level, and the assumption that reinforcement learning from task success can discover useful relevance structure without additional supervision.

free parameters (3)

visual decay rate = 0.47
Modality-specific decay parameter stated as lambda_v = 0.47
text decay rate = 0.11
Modality-specific decay parameter stated as lambda_x = 0.11
average edges per node = 3.2
Learned sparse connectivity reported as 3.2 edges/node

axioms (1)

domain assumption Policy-gradient optimization from downstream task success produces task-adaptive relevance graphs
Invoked to train the graph memory retriever from task outcomes

invented entities (1)

Task-adaptive relevance graph no independent evidence
purpose: To model heterogeneous temporal dynamics across modalities in agent histories
Core new construct of ACGM

pith-pipeline@v0.9.0 · 5565 in / 1588 out tokens · 44905 ms · 2026-05-10T17:18:26.834275+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

29 extracted references · 10 canonical work pages · 5 internal anchors

[1]

Alan Baddeley. 1992. Working memory: The interface between memory and cognition.Journal of cognitive neuroscience4, 3 (1992), 281–288

1992
[2]

Chenyang Bu, Guojie Chang, Zihao Chen, CunYuan Dang, Zhize Wu, Yi He, and Xindong Wu. 2025. Query-driven multimodal GraphRAG: Dynamic local knowledge graph construction for online reasoning. InFindings of the Association for Computational Linguistics: ACL 2025. 21360–21380

2025
[3]

Jia Chen, Qian Dong, Haitao Li, Xiaohui He, Yan Gao, Shaosheng Cao, Yi Wu, Ping Yang, Chen Xu, Yao Hu, et al. 2025. Qilin: A multimodal information retrieval dataset with app-level user sessions. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval. 3670–3680

2025
[4]

Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Ya- dav. 2025. Mem0: Building production-ready ai agents with scalable long-term memory.arXiv preprint arXiv:2504.19413(2025)

work page internal anchor Pith review arXiv 2025
[5]

Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and Yu Su. 2023. Mind2web: Towards a generalist agent for the web. Advances in Neural Information Processing Systems36 (2023), 28091–28114

2023
[6]

Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, Dasha Metropolitansky, Robert Osazuwa Ness, and Jonathan Larson. 2024. From local to global: A graph rag approach to query-focused summarization.arXiv preprint arXiv:2404.16130(2024)

work page internal anchor Pith review arXiv 2024
[7]

Miles Efron, Peter Organisciak, and Katrina Fenlon. 2012. Improving retrieval of short texts through document expansion. InProceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval. 911– 920

2012
[8]

Jack Gallifant, Amelia Fiske, Yulia A Levites Strekalova, Juan S Osorio-Valencia, Rachael Parke, Rogers Mwavu, Nicole Martinez, Judy Wawira Gichoya, Marzyeh Ghassemi, Dina Demner-Fushman, et al. 2024. Peer review of GPT-4 technical report and systems card.PLOS digital health3, 1 (2024), e0000417

2024
[9]

Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, and Ishan Misra. 2023. Imagebind: One embedding space to bind them all. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 15180–15190

2023
[10]

Bernal J Gutiérrez, Yiheng Shu, Yu Gu, Michihiro Yasunaga, and Yu Su. 2024. Hipporag: Neurobiologically inspired long-term memory for large language models.Advances in neural information processing systems37 (2024), 59532– 59569

2024
[11]

Jakir Hossain and Ahmet Erdem Sarıyüce. 2026. Core-based Hierarchies for Efficient GraphRAG.arXiv preprint arXiv:2603.05207(2026)

work page arXiv 2026
[12]

Gautier Izacard, Mathilde Caron, Lucas Hosseini, Sebastian Riedel, Piotr Bo- janowski, Armand Joulin, and Edouard Grave. 2021. Unsupervised dense in- formation retrieval with contrastive learning.arXiv preprint arXiv:2112.09118 (2021)

work page internal anchor Pith review arXiv 2021
[13]

Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Russ Salakhutdinov, and Daniel Fried. 2024. Vi- sualwebarena: Evaluating multimodal agents on realistic visual web tasks. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 881–905

2024
[14]

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. 2023. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational conference on machine learning. PMLR, 19730–19742

2023
[15]

Pei Liu, Xin Liu, Ruoyu Yao, Junming Liu, Siyuan Meng, Ding Wang, and Jun Ma. 2025. Hm-rag: Hierarchical multi-agent multimodal retrieval augmented generation. InProceedings of the 33rd ACM international conference on multimedia. 2781–2790

2025
[16]

Rodrigo Nogueira, Zhiying Jiang, Ronak Pradeep, and Jimmy Lin. 2020. Docu- ment ranking with a pretrained sequence-to-sequence model. InFindings of the association for computational linguistics: EMNLP 2020. 708–718

2020
[17]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. InInternational conference on machine learning. PmLR, 8748–8763

2021
[18]

Preston Rasmussen, Pavlo Paliychuk, Travis Beauvais, Jack Ryan, and Daniel Chalef. 2025. Zep: a temporal knowledge graph architecture for agent memory. arXiv preprint arXiv:2501.13956(2025)

work page internal anchor Pith review arXiv 2025
[19]

Keshav Santhanam, Omar Khattab, Jon Saad-Falcon, Christopher Potts, and Matei Zaharia. 2022. Colbertv2: Effective and efficient retrieval via lightweight late interaction. InProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 3715–3734

2022
[20]

Grace H Sun and Stephanie H Hoelscher. 2023. The ChatGPT storm and what faculty can do.Nurse Educator48, 3 (2023), 119–124

2023
[21]

Vidyadhar Upadhya et al. 2025. Multimodal RAG for Unstructured Data: Lever- aging Modality-Aware Knowledge Graphs with Hybrid Retrieval.arXiv preprint arXiv:2510.14592(2025)

work page arXiv 2025
[22]

Xueyao Wan and Hang Yu. 2025. Mmgraphrag: Bridging vision and language with interpretable multimodal knowledge graphs.arXiv preprint arXiv:2507.20804 (2025)

work page arXiv 2025
[23]

Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Tang, Jialin Liu, Paul Bennett, Junaid Ahmed, and Arnold Overwijk. 2021. Approximate nearest neighbor negative contrastive learning for dense text retrieval. InInternational Conference on Learning Representations

2021
[24]

Derong Xu, Yi Wen, Pengyue Jia, Yingyi Zhang, Yichao Wang, Huifeng Guo, Ruiming Tang, Xiangyu Zhao, Enhong Chen, Tong Xu, et al. 2025. From Single to Multi-Granularity: Toward Long-Term Memory Association and Selection of Conversational Agents.arXiv preprint arXiv:2505.19549(2025)

work page arXiv 2025
[25]

Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang
[26]

A-mem: Agentic memory for llm agents.arXiv preprint arXiv:2502.12110 (2025)

work page internal anchor Pith review arXiv 2025
[27]

Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. 2022. Webshop: Towards scalable real-world web interaction with grounded language agents. Advances in Neural Information Processing Systems35 (2022), 20744–20757

2022
[28]

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. 2022. React: Synergizing reasoning and acting in language models. InThe eleventh international conference on learning representations

2022
[29]

Guibin Zhang, Muxin Fu, Guancheng Wan, Miao Yu, Kun Wang, and Shuicheng Yan. 2025. G-memory: Tracing hierarchical memory for multi-agent systems. arXiv preprint arXiv:2506.07398(2025). Received January 2026; revised March 2026; accepted April 2026

work page arXiv 2025