MARDoc: A Memory-Aware Refinement Agent Framework for Multimodal Long Document QA

Hongtao Liu; Jian Yang; Kaifeng Chen; Qing Yang; Qiyao Peng; Xiaochen Zhang; Yongqiang Liu

arxiv: 2606.05749 · v1 · pith:XY4CFBT5new · submitted 2026-06-04 · 💻 cs.CL · cs.AI

MARDoc: A Memory-Aware Refinement Agent Framework for Multimodal Long Document QA

Kaifeng Chen , Hongtao Liu , Qiyao Peng , Jian Yang , Yongqiang Liu , Xiaochen Zhang , Qing Yang This is my paper

Pith reviewed 2026-06-28 02:03 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords multimodal long document QAagent frameworkstructured memoryiterative retrieval-reasoningMMLongBench-DocDocBenchcontext noise reduction

0 comments

The pith

MARDoc splits multimodal long-document QA into three agents that use structured memory instead of full history to reduce noise.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that existing iterative agents for long-document QA suffer from scattered evidence as context grows with every retrieval and reasoning step. MARDoc addresses this by assigning separate roles to an Explorer that retrieves at multiple granularities, a Refiner that turns traces into structured evidence and reasoning memories, and a Reflector that checks sufficiency and gives feedback. These agents share only the updated structured memory across iterations rather than the entire accumulated history. Experiments on MMLongBench-Doc and DocBench show the resulting system beats same-backbone baselines. The design is presented as a way to keep critical facts and their dependencies intact while cutting irrelevant noise.

Core claim

MARDoc decouples long-document QA into three specialized agents: an Explorer for multi-granularity multimodal retrieval, a Refiner for distilling interaction traces into structured evidence and reasoning memories, and a Reflector for checking evidence sufficiency and providing targeted feedback; across iterations the agents rely on a dynamically updated structured memory rather than a full accumulated interaction history, reducing context noise while preserving answer-critical facts and their logical dependencies.

What carries the argument

Three-agent structure (Explorer, Refiner, Reflector) plus dynamically updated structured memory that replaces full interaction history.

If this is right

Key evidence stays concentrated instead of scattering across growing context.
Multi-hop reasoning avoids dilution from earlier irrelevant traces.
Performance gains appear on both MMLongBench-Doc and DocBench when using the same backbone models.
The approach shows structured memory can replace full history in agentic document QA.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same memory-refinement pattern could be tested on non-document multimodal tasks such as video or web navigation.
Comparing memory size growth versus full-history token count would quantify any efficiency gains.
The Reflector feedback loop might be adapted to other agent frameworks that currently rely on raw conversation logs.

Load-bearing premise

The structured memory can hold all answer-critical facts and logical dependencies that the full history would have contained.

What would settle it

An ablation that swaps the structured memory for the full accumulated history and still matches or exceeds MARDoc performance on MMLongBench-Doc and DocBench.

Figures

Figures reproduced from arXiv: 2606.05749 by Hongtao Liu, Jian Yang, Kaifeng Chen, Qing Yang, Qiyao Peng, Xiaochen Zhang, Yongqiang Liu.

**Figure 1.** Figure 1: Overview of the MARDoc framework. (Top) The monolithic context stream paradigm appends all interactions to a single expanding context, causing key evidence to remain scattered without structured representation. (Bottom) The proposed MARDoc framework decouples retrieval, refinement, and reflection into three specialized agents communicating through a dynamically updated report memory.We present a case study… view at source ↗

**Figure 2.** Figure 2: Accuracy on MMLongBench-Doc categorized by the number of evidence pages. Method Evidence Page ACC F1 SIN MUL UNA MARDoc 61.6 43.8 69.1 57.1 54.6 w/o MR 59.1 40.6 69.3 55.0 52.0 w/o ME 59.4 42.2 64.7 54.9 52.3 w/o ME & MR 57.9 39.4 66.4 53.7 51.8 [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Performance variation with different maxi [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Ablation study on document process and tools. We report the generalized accuracy of five types of evidence sources, including Pure Text (TXT), Layout (LAY), Chart (CHA), Table (TAB), and Figure (FIG). tion study comparing the document processing pipelines and toolsets of MARDoc and DocAgent. For document parsing, DocAgent extracts PDF content using DocXChain (Yao, 2023) and PyMuPDF, while MARDoc employs M… view at source ↗

**Figure 5.** Figure 5: Prompt for Explorer [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗

**Figure 6.** Figure 6: Prompt for Refiner [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗

**Figure 7.** Figure 7: Prompt for Reflector [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗

read the original abstract

Iterative retrieval-reasoning agents have recently shown promise for multimodal long-document question answering. However, most existing systems maintain a single growing context that mixes retrieval traces, observations, and intermediate reasoning. As interactions accumulate, key evidence becomes scattered and diluted, making multi-hop reasoning noisy. We propose MARDoc, a Memory-Aware Refinement Agent framework that decouples long-document QA into three specialized agents: an Explorer for multi-granularity multimodal retrieval, a Refiner for distilling interaction traces into structured evidence and reasoning memories, and a Reflector for checking evidence sufficiency and providing targeted feedback. Across iterations, the agents rely on a dynamically updated structured memory rather than a full accumulated interaction history. This design reduces context noise while preserving answer-critical facts and their logical dependencies. Experiments on MMLongBench-Doc and DocBench show that MARDoc achieves strong results, outperforming same-backbone baselines and demonstrating the effectiveness of structured memory for agentic document QA.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MARDoc's three-agent split with structured memory is a reasonable engineering response to context dilution in agentic long-document QA, but the outperformance claims rest on thin visible evidence.

read the letter

The paper's main move is breaking the agent into Explorer for multi-granularity multimodal retrieval, Refiner to turn interaction traces into structured evidence and reasoning memories, and Reflector to test sufficiency and loop back feedback. They keep a dynamically updated structured memory instead of the full accumulated history. That specific decoupling is the clearest new design choice here.

It directly targets the scattering of key facts as iterations grow, which is a practical headache in these systems. If the full experiments show that the structured memory preserves dependencies better than baselines, the idea could be useful for anyone building retrieval-reasoning loops over long multimodal docs.

The soft spot is the evidence. The abstract states strong results on MMLongBench-Doc and DocBench and outperformance over same-backbone baselines, yet supplies no numbers, ablations, or error breakdowns. Without those, it's impossible to judge whether the gains trace to the memory structure or to other factors like prompting or retrieval details. The stress-test note found no internal contradictions, which is fair, but the central empirical claim still can't be checked from what's given.

This is for people already working on agent frameworks for document QA. A reader who needs concrete ideas for reducing context noise would find the architecture worth examining. It deserves peer review because the problem is real and the proposed split is clearly motivated, even if the evaluation section will need more scrutiny to hold up.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces MARDoc, a framework that decouples multimodal long-document QA into three specialized agents (Explorer for multi-granularity retrieval, Refiner for distilling traces into structured evidence/reasoning memories, and Reflector for sufficiency checking) that operate over a dynamically updated structured memory rather than full interaction history. The central claim is that this reduces context noise while preserving critical facts and dependencies, yielding strong empirical results that outperform same-backbone baselines on MMLongBench-Doc and DocBench.

Significance. If the reported outperformance holds under rigorous controls, the work would be significant for agentic multimodal QA by demonstrating a practical mechanism (structured memory + role specialization) to mitigate dilution in long iterative traces. This directly targets a known pain point in retrieval-reasoning loops and could influence subsequent designs for long-context document agents.

major comments (2)

[§4] §4 (Experiments): the central claim of outperformance is load-bearing yet the manuscript supplies no quantitative metrics (e.g., exact accuracy/F1 deltas vs. baselines), ablation tables isolating the structured-memory component, statistical significance tests, or error analysis. Without these, the effectiveness of the three-agent design cannot be verified.
[§3.2] §3.2 (Refiner): the claim that the Refiner distills interaction traces into structured memories that preserve logical dependencies is central to the noise-reduction argument, but no concrete memory schema, update rules, or example trace-to-memory transformation is provided to allow reproduction or inspection of information loss.

minor comments (2)

[Abstract, §1] The abstract and §1 would benefit from a one-sentence statement of the precise performance gains (e.g., “+X% on MMLongBench-Doc”) rather than the generic phrase “strong results.”
[§3] Notation for the three memory types (evidence memory, reasoning memory, feedback) should be introduced with explicit symbols or a small diagram in §3 to improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight areas where additional detail will strengthen the manuscript. We address each major comment below and will incorporate the requested elements in the revision.

read point-by-point responses

Referee: [§4] §4 (Experiments): the central claim of outperformance is load-bearing yet the manuscript supplies no quantitative metrics (e.g., exact accuracy/F1 deltas vs. baselines), ablation tables isolating the structured-memory component, statistical significance tests, or error analysis. Without these, the effectiveness of the three-agent design cannot be verified.

Authors: We agree that the current version presents results at a summary level. In the revised manuscript we will add tables with exact accuracy and F1 scores including deltas versus all baselines, dedicated ablation tables isolating the structured-memory component, statistical significance tests (e.g., paired t-tests or McNemar), and a concise error-analysis subsection. These additions will directly support verification of the three-agent design. revision: yes
Referee: [§3.2] §3.2 (Refiner): the claim that the Refiner distills interaction traces into structured memories that preserve logical dependencies is central to the noise-reduction argument, but no concrete memory schema, update rules, or example trace-to-memory transformation is provided to allow reproduction or inspection of information loss.

Authors: We concur that explicit documentation of the memory schema, update rules, and a worked example of trace-to-memory transformation will improve reproducibility and allow readers to assess information preservation. The revision will include a formal schema definition (with field descriptions), pseudocode for the update procedure, and a concrete example showing an interaction trace being distilled into evidence and reasoning memory entries. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper describes an empirical agent framework for document QA with no equations, fitted parameters, predictions, or derivation chain. The architecture (Explorer/Refiner/Reflector with structured memory) is presented as an independent design choice validated by benchmark results. No self-citation load-bearing steps, self-definitional reductions, or ansatz smuggling appear in the provided text. The central claim reduces to experimental outperformance rather than any input-by-construction equivalence.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities beyond the high-level agent names; the design rests on the unverified premise that structured memory preserves dependencies better than full history.

pith-pipeline@v0.9.1-grok · 5716 in / 1047 out tokens · 34023 ms · 2026-06-28T02:03:39.739709+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

19 extracted references · 3 canonical work pages · 3 internal anchors

[1]

Qwen3-VL Technical Report

Qwen3-vl technical report.arXiv preprint arXiv:2511.21631. Jian Chen, Ruiyi Zhang, Yufan Zhou, Tong Yu, Franck Dernoncourt, Jiuxiang Gu, Ryan A Rossi, Changyou Chen, and Tong Sun. 2024. Sv- rag: Lora-contextualizing adaptation of mllms for long document understanding.arXiv preprint arXiv:2411.01106. Jaemin Cho, Debanjan Mahata, Ozan Irsoy, Yujie He, and M...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG

Ernie-layout: Layout knowledge enhanced pre- training for visually-rich document understanding. In Findings of the Association for Computational Lin- guistics: EMNLP 2022, pages 3744–3756. Aditi Singh, Abul Ehtesham, Saket Kumar, Tala Talaei Khoei, and Athanasios V Vasilakos. 2025. Agentic retrieval-augmented generation: A survey on agentic rag.arXiv prep...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[3]

DocSeeker: Structured Visual Reasoning with Evidence Grounding for Long Document Understanding

Docseeker: Structured visual reasoning with evidence grounding for long document understand- ing.arXiv preprint arXiv:2604.12812. Cong Yao. 2023. Docxchain: A powerful open-source toolchain for document parsing and beyond.arXiv preprint arXiv:2310.12430. Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. 2022. Re...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

• Structure aware Retrieval

utilizes the hidden embeddings of LLMs for question-based retrieval, enhanced by LoRA adapters.VisRAG(Yu et al., 2024) leverages visual embeddings of document im- ages to retrieve relevant pages, maximizing the retention and utilization of raw visual in- formation.M3DocRAG(Cho et al., 2024) employs ColPali for vision-based page re- trieval and utilizes Qw...

2024
[5]

• Iterative Retrieval

decomposes the task into four special- ized agents, incorporating a hybrid reward mechanism and a two-stage training strat- egy.Chain-of-Agent(Zhang et al., 2024)is an agent-based framework in which multiple worker agents handle text segments one af- ter another, and a manager agent then com- bines their outputs to produce the final re- sponse.SLEUTH (Liu...

2024
[6]

Identify the section that discusses trade relationships with China, specifically focusing on Rep/Lean Rep independents
[7]

name":"search

Find the relevant data, which is likely in a table or chart showing opinions on trade with China. ... </analysis> <tool_call> {"name":"search",arguments": "{"keyword": "trade"}"} </tool_call> <tool_response> We found 13 results that contain the keyword trade, listed below: <Item type="Paragraph" section_id="12" page_idx="9"> ... including the loss of U.S....
[8]

The survey included 2,596 U.S. adults
[9]

Image_id=25 shows that 26% of Rep/Lean Rep individuals favor building a strong relationship with China on trade
[10]

The total percentage for Rep/Lean Rep individuals in the chart is 98% (72% + 26%), suggesting the percentages are based on the total number of Rep/Lean Rep individuals
[11]

name":"search

The information does not provide the total number of Rep/Lean Rep individuals surveyed. To answer the question, I need to find the total number of Rep/Lean Rep individuals surveyed. This information is likely in the section discussing the survey methodology or the breakdown of respondents by political affiliation. I will start by searching for relevant ke...
[12]

Deep Thinking:please first deeply understand the user’s question
[13]

- Chart: Extract the chart title, axis labels, legend, data points, trends, extreme values, proportions, etc

Comprehensive Perception and Information Extraction: - Image: Identify the text, objects, scenes, and states in the image and convert them into textual descriptions. - Chart: Extract the chart title, axis labels, legend, data points, trends, extreme values, proportions, etc. - Table: Extract key rows, columns, and cell data, and pay attention to the corre...
[14]

- Each key point should be on a separate line, maintaining complete meaning and without adding any additional explanations or comments

Evidence-Driven: Make decisions based on reasoning steps deep thinking - Each piece of information should indicate from which part of the provided information it originates.. - Each key point should be on a separate line, maintaining complete meaning and without adding any additional explanations or comments. Strictly follow these steps: 1.Extraction: Sca...
[15]

Memory Completeness: - What information has been successfully gathered? - Are there obvious text sections that should have been examined? - Does memory address core question requirements?
[16]

Not answerable

Information Gaps: - What specific information is still missing? - Which sections or data points need investigation? - What additional searches would be beneficial? Important guidelines - If the memory does not mention the relevant topic at all: the answer is “Not answerable” - Rule of faithfulness: Be faithful. If the provided memory do not contain suffic...
[17]

Do not add any extra text, prefixes, or explanations outside the tags

When information is sufficient: Output the final answer inside <final_result></final_result> tags. Do not add any extra text, prefixes, or explanations outside the tags
[18]

When information is insufficient: if additional clarification is needed, output the briefly describe the missing information, one item per line inside <missing_info></missing_info> tags
[19]

Do not add any extra text, prefixes, or explanations outside the tags

When the document does not contain any information related to the user’s question:Output ’Not answerable’ inside <final_result></final_result> tags. Do not add any extra text, prefixes, or explanations outside the tags. Figure 7: Prompt forReflector

[1] [1]

Qwen3-VL Technical Report

Qwen3-vl technical report.arXiv preprint arXiv:2511.21631. Jian Chen, Ruiyi Zhang, Yufan Zhou, Tong Yu, Franck Dernoncourt, Jiuxiang Gu, Ryan A Rossi, Changyou Chen, and Tong Sun. 2024. Sv- rag: Lora-contextualizing adaptation of mllms for long document understanding.arXiv preprint arXiv:2411.01106. Jaemin Cho, Debanjan Mahata, Ozan Irsoy, Yujie He, and M...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[2] [2]

Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG

Ernie-layout: Layout knowledge enhanced pre- training for visually-rich document understanding. In Findings of the Association for Computational Lin- guistics: EMNLP 2022, pages 3744–3756. Aditi Singh, Abul Ehtesham, Saket Kumar, Tala Talaei Khoei, and Athanasios V Vasilakos. 2025. Agentic retrieval-augmented generation: A survey on agentic rag.arXiv prep...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[3] [3]

DocSeeker: Structured Visual Reasoning with Evidence Grounding for Long Document Understanding

Docseeker: Structured visual reasoning with evidence grounding for long document understand- ing.arXiv preprint arXiv:2604.12812. Cong Yao. 2023. Docxchain: A powerful open-source toolchain for document parsing and beyond.arXiv preprint arXiv:2310.12430. Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. 2022. Re...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[4] [4]

• Structure aware Retrieval

utilizes the hidden embeddings of LLMs for question-based retrieval, enhanced by LoRA adapters.VisRAG(Yu et al., 2024) leverages visual embeddings of document im- ages to retrieve relevant pages, maximizing the retention and utilization of raw visual in- formation.M3DocRAG(Cho et al., 2024) employs ColPali for vision-based page re- trieval and utilizes Qw...

2024

[5] [5]

• Iterative Retrieval

decomposes the task into four special- ized agents, incorporating a hybrid reward mechanism and a two-stage training strat- egy.Chain-of-Agent(Zhang et al., 2024)is an agent-based framework in which multiple worker agents handle text segments one af- ter another, and a manager agent then com- bines their outputs to produce the final re- sponse.SLEUTH (Liu...

2024

[6] [6]

Identify the section that discusses trade relationships with China, specifically focusing on Rep/Lean Rep independents

[7] [7]

name":"search

Find the relevant data, which is likely in a table or chart showing opinions on trade with China. ... </analysis> <tool_call> {"name":"search",arguments": "{"keyword": "trade"}"} </tool_call> <tool_response> We found 13 results that contain the keyword trade, listed below: <Item type="Paragraph" section_id="12" page_idx="9"> ... including the loss of U.S....

[8] [8]

The survey included 2,596 U.S. adults

[9] [9]

Image_id=25 shows that 26% of Rep/Lean Rep individuals favor building a strong relationship with China on trade

[10] [10]

The total percentage for Rep/Lean Rep individuals in the chart is 98% (72% + 26%), suggesting the percentages are based on the total number of Rep/Lean Rep individuals

[11] [11]

name":"search

The information does not provide the total number of Rep/Lean Rep individuals surveyed. To answer the question, I need to find the total number of Rep/Lean Rep individuals surveyed. This information is likely in the section discussing the survey methodology or the breakdown of respondents by political affiliation. I will start by searching for relevant ke...

[12] [12]

Deep Thinking:please first deeply understand the user’s question

[13] [13]

- Chart: Extract the chart title, axis labels, legend, data points, trends, extreme values, proportions, etc

Comprehensive Perception and Information Extraction: - Image: Identify the text, objects, scenes, and states in the image and convert them into textual descriptions. - Chart: Extract the chart title, axis labels, legend, data points, trends, extreme values, proportions, etc. - Table: Extract key rows, columns, and cell data, and pay attention to the corre...

[14] [14]

- Each key point should be on a separate line, maintaining complete meaning and without adding any additional explanations or comments

Evidence-Driven: Make decisions based on reasoning steps deep thinking - Each piece of information should indicate from which part of the provided information it originates.. - Each key point should be on a separate line, maintaining complete meaning and without adding any additional explanations or comments. Strictly follow these steps: 1.Extraction: Sca...

[15] [15]

Memory Completeness: - What information has been successfully gathered? - Are there obvious text sections that should have been examined? - Does memory address core question requirements?

[16] [16]

Not answerable

Information Gaps: - What specific information is still missing? - Which sections or data points need investigation? - What additional searches would be beneficial? Important guidelines - If the memory does not mention the relevant topic at all: the answer is “Not answerable” - Rule of faithfulness: Be faithful. If the provided memory do not contain suffic...

[17] [17]

Do not add any extra text, prefixes, or explanations outside the tags

When information is sufficient: Output the final answer inside <final_result></final_result> tags. Do not add any extra text, prefixes, or explanations outside the tags

[18] [18]

When information is insufficient: if additional clarification is needed, output the briefly describe the missing information, one item per line inside <missing_info></missing_info> tags

[19] [19]

Do not add any extra text, prefixes, or explanations outside the tags

When the document does not contain any information related to the user’s question:Output ’Not answerable’ inside <final_result></final_result> tags. Do not add any extra text, prefixes, or explanations outside the tags. Figure 7: Prompt forReflector