SlideAgent: Hierarchical Agentic Framework for Multi-Page Visual Document Understanding

Rachneet Kaur; Srijan Kumar; Sumitra Ganesh; Yiqiao Jin; Zhen Zeng

arxiv: 2510.26615 · v3 · submitted 2025-10-30 · 💻 cs.CL

SlideAgent: Hierarchical Agentic Framework for Multi-Page Visual Document Understanding

Yiqiao Jin , Rachneet Kaur , Zhen Zeng , Sumitra Ganesh , Srijan Kumar This is my paper

Pith reviewed 2026-05-18 03:00 UTC · model grok-4.3

classification 💻 cs.CL

keywords retrieval-augmented generationRAGsemantic compressionhybrid frameworklarge language modelsanswer qualitycontext efficiencyevidence reranking

0 comments

The pith

SARA hybrid RAG keeps some text passages and compresses the rest into vectors to raise answer quality under fixed token limits.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SARA as a hybrid retrieval-augmented generation method that addresses the trade-off between context limits, redundant evidence, and loss of fine details. It keeps a small number of passages in original text form to hold on to entities and numbers while turning the remaining evidence into semantic compression vectors that allow broader coverage. Those vectors then support iterative reranking of the evidence. The approach is evaluated on nine datasets using five open-source LLMs from three families, producing steady gains in relevance, correctness, and semantic similarity. A reader would care because the method shows a concrete way to improve RAG output without simply increasing token budgets or accepting more errors.

Core claim

SARA retains a small set of passages in text form to preserve entities and numerical values, compresses the remaining evidence into interpretable vectors for broader coverage, and uses those vectors for iterative evidence reranking. Across 9 datasets and 5 open-source LLMs spanning 3 model families (Mistral, Llama, and Gemma), SARA consistently improves answer relevance (+17.71), answer correctness (+13.72), and semantic similarity (+15.53), demonstrating the importance of integrating textual and compressed representations for robust, context-efficient RAG.

What carries the argument

Hybrid RAG setup that preserves a few natural-language passages while converting the balance of retrieved evidence into semantic compression vectors used for iterative reranking.

If this is right

Higher answer relevance, correctness, and semantic similarity hold across the tested datasets and model families.
The method works with LLMs from Mistral, Llama, and Gemma families without requiring model-specific changes.
Fixed token budgets can be used more effectively by mixing detailed text with compressed coverage.
Integrating textual and vector representations yields more robust RAG performance than either approach alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Designers of future RAG pipelines could test similar mixtures of text and vectors on longer or noisier document collections.
The same hybrid pattern might be tried in domains that currently rely on heavy summarization before generation.
Experiments could vary the ratio of kept text to compressed vectors to find task-specific sweet spots.

Load-bearing premise

The semantic compression vectors stay interpretable enough for effective iterative reranking while the retained text snippets continue to deliver their accuracy benefits.

What would settle it

A controlled test on one of the nine datasets in which the hybrid version produces lower relevance or correctness scores than either pure text retrieval or pure compression would disprove the central improvement claim.

read the original abstract

Retrieval-augmented generation (RAG) extends large language models (LLMs) with external knowledge, but it must balance limited effective context, redundant retrieved evidence, and the loss of fine-grained facts under aggressive compression. Pure compression-based approaches reduce input size but often discard fine-grained details essential for factual accuracy. We propose SARA, a hybrid RAG framework that targets answer quality under fixed token budgets by combining natural-language snippets with semantic compression vectors. SARA retains a small set of passages in text form to preserve entities and numerical values, compresses the remaining evidence into interpretable vectors for broader coverage, and uses those vectors for iterative evidence reranking. Across 9 datasets and 5 open-source LLMs spanning 3 model families (Mistral, Llama, and Gemma), SARA consistently improves answer relevance (+17.71), answer correctness (+13.72), and semantic similarity (+15.53), demonstrating the importance of integrating textual and compressed representations for robust, context-efficient RAG.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SARA's hybrid text-plus-vector RAG setup has a reasonable core idea but the abstract leaves the compression and reranking steps too vague to trust the size of the reported gains.

read the letter

The main takeaway is that this paper describes a hybrid RAG approach called SARA that keeps a few passages as plain text to hold onto entities and numbers while turning the rest into semantic vectors for iterative reranking under fixed token limits. They report lifts of roughly 17 points in relevance, 13 in correctness, and 15 in similarity across nine datasets and five LLMs from three families. That scope is decent for an initial check on whether the mix helps.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes SARA, a hybrid RAG framework that retains a small set of passages as natural-language snippets to preserve entities and numerical values while compressing the remaining evidence into interpretable semantic vectors for broader coverage and iterative reranking under fixed token budgets. It reports consistent empirical gains of +17.71 in answer relevance, +13.72 in answer correctness, and +15.53 in semantic similarity across 9 datasets and 5 open-source LLMs from the Mistral, Llama, and Gemma families.

Significance. If the reported gains prove robust, the hybrid textual-plus-vector approach would offer a practical way to mitigate the tension between context limits, redundancy, and loss of fine-grained facts in RAG, providing a concrete design pattern for context-efficient retrieval-augmented generation.

major comments (2)

[Abstract] Abstract: the performance claims (+17.71 relevance, +13.72 correctness, +15.53 similarity) are stated without any reference to the baselines, statistical tests, error bars, or exact dataset splits, leaving the central empirical claim without visible supporting evidence or derivation.
[Abstract] Abstract (and presumed Methods section): no description is given of the compression technique used to produce the semantic compression vectors (embedding model, dimensionality reduction, or summarization method), how the vectors are rendered interpretable, or the precise iterative reranking loop that consumes them, which is load-bearing for the claim that the hybrid design yields the reported lifts under fixed token budgets.

minor comments (1)

[Abstract] The abstract would benefit from a single sentence stating the token-budget constraints and the number of retained text snippets per query.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our SARA manuscript. The comments highlight opportunities to improve the self-contained nature of the abstract and the visibility of key technical details. We address each point below and have revised the manuscript to incorporate the suggestions.

read point-by-point responses

Referee: [Abstract] Abstract: the performance claims (+17.71 relevance, +13.72 correctness, +15.53 similarity) are stated without any reference to the baselines, statistical tests, error bars, or exact dataset splits, leaving the central empirical claim without visible supporting evidence or derivation.

Authors: We agree that the abstract would be stronger with additional context for the reported gains. In the revised version we have updated the abstract to briefly reference the primary baselines (vanilla RAG and compression-only variants) and to note that full results—including statistical significance tests, error bars, and exact dataset splits—are presented in Section 4 and the associated tables. These changes make the empirical claims more transparent while preserving the abstract’s length constraints. revision: yes
Referee: [Abstract] Abstract (and presumed Methods section): no description is given of the compression technique used to produce the semantic compression vectors (embedding model, dimensionality reduction, or summarization method), how the vectors are rendered interpretable, or the precise iterative reranking loop that consumes them, which is load-bearing for the claim that the hybrid design yields the reported lifts under fixed token budgets.

Authors: The compression pipeline (sentence-transformer embeddings followed by PCA for interpretability) and the iterative reranking procedure are fully specified in Section 3.2 and Algorithm 1 of the original submission. To improve accessibility, we have added a concise clause in the revised abstract that names the embedding model, the dimensionality-reduction step, and the role of the reranking loop under fixed token budgets. We believe these additions directly address the concern without altering the technical substance. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical results independent of internal derivations

full rationale

The paper proposes the SARA hybrid RAG framework and reports performance gains (+17.71 relevance, +13.72 correctness, +15.53 similarity) as direct outcomes of experiments across 9 datasets and 5 LLMs. No equations, parameter-fitting steps, or self-citation chains are present in the provided text that would reduce these claims to inputs by construction. The central claims rest on external evaluation benchmarks rather than any self-referential derivation or ansatz smuggled via prior work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are identifiable from the abstract alone.

pith-pipeline@v0.9.0 · 5714 in / 952 out tokens · 31651 ms · 2026-05-18T03:00:22.110768+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

SlideAgent employs specialized agents and decomposes reasoning into three specialized levels–global, page, and element–to construct a structured, query-agnostic representation...
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

SARA... combines natural-language snippets with semantic compression vectors... iterative evidence reranking

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.