MARDoc: A Memory-Aware Refinement Agent Framework for Multimodal Long Document QA
Pith reviewed 2026-06-28 02:03 UTC · model grok-4.3
The pith
MARDoc splits multimodal long-document QA into three agents that use structured memory instead of full history to reduce noise.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MARDoc decouples long-document QA into three specialized agents: an Explorer for multi-granularity multimodal retrieval, a Refiner for distilling interaction traces into structured evidence and reasoning memories, and a Reflector for checking evidence sufficiency and providing targeted feedback; across iterations the agents rely on a dynamically updated structured memory rather than a full accumulated interaction history, reducing context noise while preserving answer-critical facts and their logical dependencies.
What carries the argument
Three-agent structure (Explorer, Refiner, Reflector) plus dynamically updated structured memory that replaces full interaction history.
If this is right
- Key evidence stays concentrated instead of scattering across growing context.
- Multi-hop reasoning avoids dilution from earlier irrelevant traces.
- Performance gains appear on both MMLongBench-Doc and DocBench when using the same backbone models.
- The approach shows structured memory can replace full history in agentic document QA.
Where Pith is reading between the lines
- The same memory-refinement pattern could be tested on non-document multimodal tasks such as video or web navigation.
- Comparing memory size growth versus full-history token count would quantify any efficiency gains.
- The Reflector feedback loop might be adapted to other agent frameworks that currently rely on raw conversation logs.
Load-bearing premise
The structured memory can hold all answer-critical facts and logical dependencies that the full history would have contained.
What would settle it
An ablation that swaps the structured memory for the full accumulated history and still matches or exceeds MARDoc performance on MMLongBench-Doc and DocBench.
Figures
read the original abstract
Iterative retrieval-reasoning agents have recently shown promise for multimodal long-document question answering. However, most existing systems maintain a single growing context that mixes retrieval traces, observations, and intermediate reasoning. As interactions accumulate, key evidence becomes scattered and diluted, making multi-hop reasoning noisy. We propose MARDoc, a Memory-Aware Refinement Agent framework that decouples long-document QA into three specialized agents: an Explorer for multi-granularity multimodal retrieval, a Refiner for distilling interaction traces into structured evidence and reasoning memories, and a Reflector for checking evidence sufficiency and providing targeted feedback. Across iterations, the agents rely on a dynamically updated structured memory rather than a full accumulated interaction history. This design reduces context noise while preserving answer-critical facts and their logical dependencies. Experiments on MMLongBench-Doc and DocBench show that MARDoc achieves strong results, outperforming same-backbone baselines and demonstrating the effectiveness of structured memory for agentic document QA.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces MARDoc, a framework that decouples multimodal long-document QA into three specialized agents (Explorer for multi-granularity retrieval, Refiner for distilling traces into structured evidence/reasoning memories, and Reflector for sufficiency checking) that operate over a dynamically updated structured memory rather than full interaction history. The central claim is that this reduces context noise while preserving critical facts and dependencies, yielding strong empirical results that outperform same-backbone baselines on MMLongBench-Doc and DocBench.
Significance. If the reported outperformance holds under rigorous controls, the work would be significant for agentic multimodal QA by demonstrating a practical mechanism (structured memory + role specialization) to mitigate dilution in long iterative traces. This directly targets a known pain point in retrieval-reasoning loops and could influence subsequent designs for long-context document agents.
major comments (2)
- [§4] §4 (Experiments): the central claim of outperformance is load-bearing yet the manuscript supplies no quantitative metrics (e.g., exact accuracy/F1 deltas vs. baselines), ablation tables isolating the structured-memory component, statistical significance tests, or error analysis. Without these, the effectiveness of the three-agent design cannot be verified.
- [§3.2] §3.2 (Refiner): the claim that the Refiner distills interaction traces into structured memories that preserve logical dependencies is central to the noise-reduction argument, but no concrete memory schema, update rules, or example trace-to-memory transformation is provided to allow reproduction or inspection of information loss.
minor comments (2)
- [Abstract, §1] The abstract and §1 would benefit from a one-sentence statement of the precise performance gains (e.g., “+X% on MMLongBench-Doc”) rather than the generic phrase “strong results.”
- [§3] Notation for the three memory types (evidence memory, reasoning memory, feedback) should be introduced with explicit symbols or a small diagram in §3 to improve readability.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which highlight areas where additional detail will strengthen the manuscript. We address each major comment below and will incorporate the requested elements in the revision.
read point-by-point responses
-
Referee: [§4] §4 (Experiments): the central claim of outperformance is load-bearing yet the manuscript supplies no quantitative metrics (e.g., exact accuracy/F1 deltas vs. baselines), ablation tables isolating the structured-memory component, statistical significance tests, or error analysis. Without these, the effectiveness of the three-agent design cannot be verified.
Authors: We agree that the current version presents results at a summary level. In the revised manuscript we will add tables with exact accuracy and F1 scores including deltas versus all baselines, dedicated ablation tables isolating the structured-memory component, statistical significance tests (e.g., paired t-tests or McNemar), and a concise error-analysis subsection. These additions will directly support verification of the three-agent design. revision: yes
-
Referee: [§3.2] §3.2 (Refiner): the claim that the Refiner distills interaction traces into structured memories that preserve logical dependencies is central to the noise-reduction argument, but no concrete memory schema, update rules, or example trace-to-memory transformation is provided to allow reproduction or inspection of information loss.
Authors: We concur that explicit documentation of the memory schema, update rules, and a worked example of trace-to-memory transformation will improve reproducibility and allow readers to assess information preservation. The revision will include a formal schema definition (with field descriptions), pseudocode for the update procedure, and a concrete example showing an interaction trace being distilled into evidence and reasoning memory entries. revision: yes
Circularity Check
No significant circularity
full rationale
The paper describes an empirical agent framework for document QA with no equations, fitted parameters, predictions, or derivation chain. The architecture (Explorer/Refiner/Reflector with structured memory) is presented as an independent design choice validated by benchmark results. No self-citation load-bearing steps, self-definitional reductions, or ansatz smuggling appear in the provided text. The central claim reduces to experimental outperformance rather than any input-by-construction equivalence.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Qwen3-vl technical report.arXiv preprint arXiv:2511.21631. Jian Chen, Ruiyi Zhang, Yufan Zhou, Tong Yu, Franck Dernoncourt, Jiuxiang Gu, Ryan A Rossi, Changyou Chen, and Tong Sun. 2024. Sv- rag: Lora-contextualizing adaptation of mllms for long document understanding.arXiv preprint arXiv:2411.01106. Jaemin Cho, Debanjan Mahata, Ozan Irsoy, Yujie He, and M...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[2]
Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG
Ernie-layout: Layout knowledge enhanced pre- training for visually-rich document understanding. In Findings of the Association for Computational Lin- guistics: EMNLP 2022, pages 3744–3756. Aditi Singh, Abul Ehtesham, Saket Kumar, Tala Talaei Khoei, and Athanasios V Vasilakos. 2025. Agentic retrieval-augmented generation: A survey on agentic rag.arXiv prep...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[3]
DocSeeker: Structured Visual Reasoning with Evidence Grounding for Long Document Understanding
Docseeker: Structured visual reasoning with evidence grounding for long document understand- ing.arXiv preprint arXiv:2604.12812. Cong Yao. 2023. Docxchain: A powerful open-source toolchain for document parsing and beyond.arXiv preprint arXiv:2310.12430. Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. 2022. Re...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[4]
• Structure aware Retrieval
utilizes the hidden embeddings of LLMs for question-based retrieval, enhanced by LoRA adapters.VisRAG(Yu et al., 2024) leverages visual embeddings of document im- ages to retrieve relevant pages, maximizing the retention and utilization of raw visual in- formation.M3DocRAG(Cho et al., 2024) employs ColPali for vision-based page re- trieval and utilizes Qw...
2024
-
[5]
• Iterative Retrieval
decomposes the task into four special- ized agents, incorporating a hybrid reward mechanism and a two-stage training strat- egy.Chain-of-Agent(Zhang et al., 2024)is an agent-based framework in which multiple worker agents handle text segments one af- ter another, and a manager agent then com- bines their outputs to produce the final re- sponse.SLEUTH (Liu...
2024
-
[6]
Identify the section that discusses trade relationships with China, specifically focusing on Rep/Lean Rep independents
-
[7]
name":"search
Find the relevant data, which is likely in a table or chart showing opinions on trade with China. ... </analysis> <tool_call> {"name":"search",arguments": "{"keyword": "trade"}"} </tool_call> <tool_response> We found 13 results that contain the keyword trade, listed below: <Item type="Paragraph" section_id="12" page_idx="9"> ... including the loss of U.S....
-
[8]
The survey included 2,596 U.S. adults
-
[9]
Image_id=25 shows that 26% of Rep/Lean Rep individuals favor building a strong relationship with China on trade
-
[10]
The total percentage for Rep/Lean Rep individuals in the chart is 98% (72% + 26%), suggesting the percentages are based on the total number of Rep/Lean Rep individuals
-
[11]
name":"search
The information does not provide the total number of Rep/Lean Rep individuals surveyed. To answer the question, I need to find the total number of Rep/Lean Rep individuals surveyed. This information is likely in the section discussing the survey methodology or the breakdown of respondents by political affiliation. I will start by searching for relevant ke...
-
[12]
Deep Thinking:please first deeply understand the user’s question
-
[13]
- Chart: Extract the chart title, axis labels, legend, data points, trends, extreme values, proportions, etc
Comprehensive Perception and Information Extraction: - Image: Identify the text, objects, scenes, and states in the image and convert them into textual descriptions. - Chart: Extract the chart title, axis labels, legend, data points, trends, extreme values, proportions, etc. - Table: Extract key rows, columns, and cell data, and pay attention to the corre...
-
[14]
- Each key point should be on a separate line, maintaining complete meaning and without adding any additional explanations or comments
Evidence-Driven: Make decisions based on reasoning steps deep thinking - Each piece of information should indicate from which part of the provided information it originates.. - Each key point should be on a separate line, maintaining complete meaning and without adding any additional explanations or comments. Strictly follow these steps: 1.Extraction: Sca...
-
[15]
Memory Completeness: - What information has been successfully gathered? - Are there obvious text sections that should have been examined? - Does memory address core question requirements?
-
[16]
Not answerable
Information Gaps: - What specific information is still missing? - Which sections or data points need investigation? - What additional searches would be beneficial? Important guidelines - If the memory does not mention the relevant topic at all: the answer is “Not answerable” - Rule of faithfulness: Be faithful. If the provided memory do not contain suffic...
-
[17]
Do not add any extra text, prefixes, or explanations outside the tags
When information is sufficient: Output the final answer inside <final_result></final_result> tags. Do not add any extra text, prefixes, or explanations outside the tags
-
[18]
When information is insufficient: if additional clarification is needed, output the briefly describe the missing information, one item per line inside <missing_info></missing_info> tags
-
[19]
Do not add any extra text, prefixes, or explanations outside the tags
When the document does not contain any information related to the user’s question:Output ’Not answerable’ inside <final_result></final_result> tags. Do not add any extra text, prefixes, or explanations outside the tags. Figure 7: Prompt forReflector
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.