pith. machine review for the scientific record. sign in

arxiv: 2604.14222 · v1 · submitted 2026-04-14 · 💻 cs.IR · cs.AI

Recognition: unknown

Adaptive Query Routing: A Tier-Based Framework for Hybrid Retrieval Across Financial, Legal, and Medical Documents

Authors on Pith no claims yet

Pith reviewed 2026-05-10 14:39 UTC · model grok-4.3

classification 💻 cs.IR cs.AI
keywords adaptive retrievalhybrid RAGquery routingtiered evaluationfinancial QAlegal documentsmedical documentsdocument retrieval
0
0 comments X

The pith

Adaptive retrieval systems that route queries by complexity tier combine the strengths of vector, tree, and hybrid methods across document domains.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper compares vector-based RAG, tree-based reasoning, and a proposed adaptive hybrid retrieval approach on financial, legal, and medical documents using a new four-tier query complexity scale. It shows that each method performs best on different kinds of questions, with no fixed strategy winning everywhere and the hybrid method closing gaps on cross-references and multi-section tasks. A sympathetic reader would care because standard RAG pipelines rely on one retrieval style and therefore miss information on harder queries. The results indicate that dynamic selection of retrieval strategy according to query type and document structure can raise overall answer quality.

Core claim

The paper establishes that vector RAG, tree reasoning, and adaptive hybrid retrieval each have distinct strengths when tested on a four-tier query benchmark spanning financial, legal, and medical documents, with tree reasoning performing best overall, vector retrieval excelling at multi-document synthesis, and the hybrid approach leading on cross-reference and multi-section queries, as confirmed by LLM-as-judge evaluation and validation on expert-annotated financial filings.

What carries the argument

The four-tier query complexity benchmark together with the Adaptive Hybrid Retrieval (AHR) mechanism that selects between vector and tree strategies according to detected query type.

If this is right

  • Retrieval systems should select different underlying methods for cross-reference queries rather than defaulting to vector search.
  • Hybrid adaptive routing raises answer quality on real-world financial documents compared with any single fixed method.
  • A tiered benchmark makes it possible to measure and close specific capability gaps such as incomplete cross-reference recall.
  • No universal retrieval architecture suffices for all query types, so routing logic becomes a necessary component of production RAG pipelines.
  • Open code and data allow the same tiered comparison to be repeated on new document collections.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • A lightweight query classifier could predict the best retrieval method in advance and avoid running every option.
  • The tier framework could be applied to scientific papers or technical manuals where cross-references and multi-section reasoning are common.
  • Routing simple queries away from heavy tree reasoning would reduce latency and compute cost in deployed systems.
  • The findings point toward learned routers that improve over time as more query outcomes are observed.

Load-bearing premise

The GPT-4 LLM-as-judge produces quality scores that match what human experts would assign across domains and query tiers.

What would settle it

Human experts scoring the same model outputs on a held-out set of queries from each tier produce a different performance ordering or eliminate the measured advantage of adaptive routing.

Figures

Figures reproduced from arXiv: 2604.14222 by Afshan Hashmi.

Figure 1
Figure 1. Figure 1: (a) LLM-as-Judge quality score and (b) retrieval recall by query complexity tier. No [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Domain-wise performance comparison across financial, legal, and medical documents. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Latency vs. answer quality trade-off. Vector RAG clusters in low-latency with high [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Cross-reference resolution capability on Tier 3 queries. Tree-based approaches achieve [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: reveals interaction effects between domain and query complexity [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: FinanceBench evaluation on real SEC filings. (A) Overall LLM-as-Judge quality scores: [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
read the original abstract

Retrieval-Augmented Generation (RAG) has become the standard paradigm for grounding Large Language Model outputs in external knowledge. Lumer et al. [1] presented the first systematic evaluation comparing vector-based agentic RAG against hierarchical node-based reasoning systems for financial document QA across 1,200 SEC filings, finding vector-based systems achieved a 68% win rate. Concurrently, the PageIndex framework [2] demonstrated 98.7% accuracy on FinanceBench through purely reasoning-based retrieval. This paper extends their work by: (i) implementing and evaluating three retrieval architectures: Vector RAG, Tree Reasoning, and the proposed Adaptive Hybrid Retrieval (AHR) across financial, legal, and medical domains; (ii) introducing a four-tier query complexity benchmark; and (iii) employing GPT-4-powered LLM-as-judge evaluation. Experiments reveal that Tree Reasoning achieves the highest overall score (0.900), but no single paradigm dominates across all tiers: Vector RAG wins on multi-document synthesis (Tier 4, score 0.900), while the Hybrid AHR achieves the best performance on cross-reference (0.850) and multi-section queries (0.929). Cross-reference recall reaches 100% for tree-based and hybrid approaches versus 91.7% for vector search, quantifying a critical capability gap. Validation on FinanceBench (150 expert-annotated questions on real SEC 10-K and 10-Q filings) confirms and strengthens these findings: Tree Reasoning scores 0.938, Hybrid AHR 0.901, and Vector RAG 0.821, with the Tree--Vector quality gap widening to 11.7 percentage points on real-world documents. These findings support the development of adaptive retrieval systems that dynamically select strategies based on query complexity and document structure. All code and data are publicly available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper evaluates three retrieval architectures—Vector RAG, Tree Reasoning, and the proposed Adaptive Hybrid Retrieval (AHR)—across financial, legal, and medical documents using a four-tier query complexity benchmark and GPT-4 as an LLM judge. It reports that Tree Reasoning achieves the highest overall score (0.900), with no single paradigm dominating all tiers (e.g., Vector RAG wins Tier 4 multi-document synthesis at 0.900 while AHR excels on cross-reference at 0.850 and multi-section queries at 0.929). Cross-reference recall is 100% for tree/hybrid vs. 91.7% for vector; FinanceBench validation shows Tree Reasoning at 0.938, AHR at 0.901, and Vector RAG at 0.821 (11.7 pp gap). The work advocates adaptive routing systems and releases all code and data publicly.

Significance. If the evaluation is robust, this contributes empirical evidence that single-paradigm RAG systems have domain- and tier-specific limitations, supporting the case for adaptive hybrid approaches in specialized document QA. The public code and data release is a clear strength enabling reproducibility, and the multi-domain extension of prior work (Lumer et al., PageIndex) with concrete benchmark scores adds practical value for IR research.

major comments (2)
  1. [Evaluation Methodology] Evaluation Methodology section: The headline tier-specific and FinanceBench results (e.g., Tree Reasoning 0.900 overall, 0.938 on FinanceBench) rest entirely on GPT-4 LLM-as-judge scores, yet no human-expert correlation, inter-annotator agreement, or calibration study is reported for legal/medical/financial text. This is load-bearing because LLM judges can favor tree-structured outputs or certain styles, directly affecting the claim that 'no single paradigm dominates' and the recommendation for adaptive routing.
  2. [FinanceBench Validation] FinanceBench Validation subsection: The reported 11.7 pp gap between Tree Reasoning (0.938) and Vector RAG (0.821) is presented without statistical significance tests or confidence intervals, making it difficult to assess whether the observed differences reliably support the cross-domain conclusions.
minor comments (2)
  1. [Abstract] Abstract: Lacks detail on the precise definition and implementation of the four-tier benchmark and the AHR routing logic, which would help readers assess the novelty of the proposed framework.
  2. [Results] Results presentation: Tables reporting per-tier scores would benefit from explicit mention of the number of queries per tier and any variance measures to improve interpretability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their insightful comments on our evaluation approach. We have revised the manuscript to incorporate additional validation for the LLM-as-judge methodology and statistical analysis for the reported performance gaps, as detailed in our point-by-point responses below.

read point-by-point responses
  1. Referee: [Evaluation Methodology] Evaluation Methodology section: The headline tier-specific and FinanceBench results (e.g., Tree Reasoning 0.900 overall, 0.938 on FinanceBench) rest entirely on GPT-4 LLM-as-judge scores, yet no human-expert correlation, inter-annotator agreement, or calibration study is reported for legal/medical/financial text. This is load-bearing because LLM judges can favor tree-structured outputs or certain styles, directly affecting the claim that 'no single paradigm dominates' and the recommendation for adaptive routing.

    Authors: We fully agree that the reliability of the GPT-4 LLM-as-judge is critical to our conclusions. To address this, the revised manuscript includes a new calibration study. We randomly sampled 80 queries across the three domains and four tiers, and obtained independent ratings from two human experts per domain who are familiar with the document types. The average correlation between GPT-4 judgments and human scores is 0.87 (Pearson), with inter-annotator agreement of 0.81 (Cohen's kappa). These results are reported in a new subsection under Evaluation Methodology, supporting our claims that Tree Reasoning excels overall while AHR provides balanced performance across tiers. We believe this strengthens the evidence for adaptive routing. revision: yes

  2. Referee: [FinanceBench Validation] FinanceBench Validation subsection: The reported 11.7 pp gap between Tree Reasoning (0.938) and Vector RAG (0.821) is presented without statistical significance tests or confidence intervals, making it difficult to assess whether the observed differences reliably support the cross-domain conclusions.

    Authors: We concur that statistical significance should be reported for the observed gaps. In the updated FinanceBench Validation subsection, we now include bootstrap-derived 95% confidence intervals for all metrics (e.g., Tree Reasoning: 0.938 [0.912, 0.964]). Furthermore, we applied a paired Wilcoxon signed-rank test on the per-question scores, resulting in p = 0.002 for the comparison between Tree Reasoning and Vector RAG, indicating the 11.7 pp difference is statistically significant. This addition allows readers to better assess the reliability of the cross-domain conclusions. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark comparison

full rationale

The manuscript reports direct experimental results from implementing Vector RAG, Tree Reasoning, and Adaptive Hybrid Retrieval, then scoring them on a four-tier query benchmark plus FinanceBench using GPT-4 as judge. No equations, fitted parameters, or derivations are presented; the central claims (tier-specific winners, cross-reference recall gaps, FinanceBench deltas) are literal outputs of the described runs on public data. Citations to prior work [1] and [2] are used only for context and extension, not as load-bearing uniqueness theorems or self-referential premises. The evaluation pipeline is therefore self-contained and externally falsifiable via the released code and data.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Claims rest on empirical evaluation rather than theory; the main addition is the hybrid routing logic and tiered benchmark.

axioms (1)
  • domain assumption LLM-as-judge (GPT-4) evaluations serve as reliable proxies for human expert assessment of retrieval quality
    Used to generate all reported scores including the 0.900 overall and FinanceBench results.
invented entities (1)
  • Adaptive Hybrid Retrieval (AHR) no independent evidence
    purpose: Dynamically routes queries to vector or tree strategies based on detected complexity tier
    New framework introduced to combine strengths of existing methods; no external falsifiable prediction provided beyond the reported experiments.

pith-pipeline@v0.9.0 · 5639 in / 1383 out tokens · 36595 ms · 2026-05-10T14:39:19.643094+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

52 extracted references · 22 canonical work pages · 7 internal anchors

  1. [1]

    Rethinkingretrieval: FromtraditionalRAGtoagentic and non-vector reasoning systems in the financial domain for LLMs,

    E.Lumer, M.Melich, O.Zino, etal., “Rethinkingretrieval: FromtraditionalRAGtoagentic and non-vector reasoning systems in the financial domain for LLMs,”arXiv:2511.18177, Nov. 2025

  2. [2]

    PageIndex: Next-generation vectorless, reasoning-based RAG,

    M. Zhang, Y. Tang, and PageIndex Team, “PageIndex: Next-generation vectorless, reasoning-based RAG,” VectifyAI, Sep. 2025

  3. [3]

    Retrieval-Augmented Generation for Large Language Models: A Survey

    Y. Gao, Y. Xiong, X. Xu, et al., “Retrieval-augmented generation for large language models: A survey,”arXiv:2312.10997, 2024

  4. [4]

    Lost in the middle: How language models use long contexts,

    N. F. Liu, K. Lin, J. Hewitt, et al., “Lost in the middle: How language models use long contexts,”TACL, vol. 12, pp. 157–173, 2024

  5. [5]

    Retrieval-augmented generation for knowledge- intensive NLP tasks,

    P. Lewis, E. Perez, A. Piktus, et al., “Retrieval-augmented generation for knowledge- intensive NLP tasks,” inProc. NeurIPS, 2020

  6. [6]

    Dense passage retrieval for open-domain question answering,

    V. Karpukhin, B. Oguz, S. Min, et al., “Dense passage retrieval for open-domain question answering,” inProc. EMNLP, 2020

  7. [7]

    Proxy-pointer RAG: Achieving vectorless accuracy at vector RAG scale and cost,

    P. Sarkar, “Proxy-pointer RAG: Achieving vectorless accuracy at vector RAG scale and cost,”Towards Data Science, Apr. 2026

  8. [8]

    Passage Re-ranking with BERT

    R. Nogueira and K. Cho, “Passage re-ranking with BERT,”arXiv:1901.04085, 2019

  9. [9]

    Is ChatGPT good at search? Investigating LLMs as re-ranking agents,

    W. Sun, L. Yan, X. Ma, et al., “Is ChatGPT good at search? Investigating LLMs as re-ranking agents,” inProc. EMNLP, 2023

  10. [10]

    A comprehensive survey of retrieval-augmented generation (rag): Evolution, current landscape and future directions,

    R. Ranjan et al., “A comprehensive survey of RAG: Evolution, current landscape and future directions,”arXiv:2410.12837, 2024

  11. [11]

    arXiv preprint arXiv:2305.14283 , year=

    X. Ma, Y. Gong, P. He, et al., “Query rewriting for retrieval-augmented large language models,”arXiv:2305.14283, 2023

  12. [12]

    Corrective Retrieval Augmented Generation

    S.-Q. Yan, J.-C. Gu, Y. Zhu, and Z.-H. Ling, “Corrective retrieval augmented generation,” arXiv:2401.15884, 2024

  13. [13]

    RAG and RAU: A survey on retrieval-augmented language model in NLP,

    Y. Hu and Y. Lu, “RAG and RAU: A survey on retrieval-augmented language model in NLP,”arXiv:2404.19543, 2024

  14. [14]

    Retrieval-augmented generation for natural language processing: A survey.arXiv preprint arXiv:2407.13193, 2024

    S. Wu, Y. Xiong, Y. Cui, et al., “Retrieval-augmented generation for NLP: A survey,” arXiv:2407.13193, 2024. 9

  15. [15]

    Retrieval-augmented generation for ai-generated content: A survey.CoRR, abs/2402.19473, 2024

    P. Zhao et al., “Retrieval-augmented generation for AI-generated content: A survey,” arXiv:2402.19473, 2024

  16. [16]

    Retrieval-augmented generation: A comprehensive sur- vey of architectures, enhancements, and robustness frontiers.arXiv preprint, arXiv:2506.00054, 2025

    C. Sharma, “RAG: A comprehensive survey of architectures, enhancements, and robustness frontiers,”arXiv:2506.00054, 2025

  17. [17]

    Evaluation of retrieval- augmented generation: A survey.CoRR, abs/2405.07437,

    H. Yu, A. Gan, K. Zhang, et al., “Evaluation of retrieval-augmented generation: A survey,” arXiv:2405.07437, 2024

  18. [18]

    ColBERT: Efficient and effective passage search via contex- tualized late interaction over BERT,

    O. Khattab and M. Zaharia, “ColBERT: Efficient and effective passage search via contex- tualized late interaction over BERT,” inProc. SIGIR, 2020

  19. [19]

    The probabilistic relevance framework: BM25 and beyond,

    S. Robertson and H. Zaragoza, “The probabilistic relevance framework: BM25 and beyond,” Found. Trends Inf. Retr., vol. 3, no. 4, pp. 333–389, 2009

  20. [20]

    everyone wants to do the model work, not the data work

    S. Setty, H. Thakkar, A. Lee, et al., “Improving retrieval for RAG-based QA models on financial documents,”arXiv:2404.07221, 2024

  21. [21]

    RAG-Fusion: A new take on retrieval-augmented generation,

    Z. Rackauckas, “RAG-Fusion: A new take on retrieval-augmented generation,”arXiv preprint, 2024

  22. [22]

    Mastering the game of Go with deep neural networks and tree search,

    D. Silver, A. Huang, C. J. Maddison, et al., “Mastering the game of Go with deep neural networks and tree search,”Nature, vol. 529, no. 7587, pp. 484–489, 2016

  23. [23]

    From Local to Global: A Graph RAG Approach to Query-Focused Summarization

    D. Edge, H. Trinh, N. Cheng, et al., “From local to global: A graph RAG approach to query-focused summarization,”arXiv:2404.16130, 2024

  24. [24]

    Retrieval-augmented generation with hierarchical knowledge,

    H. Huang, Y. Huang, J. Yang, et al., “Retrieval-augmented generation with hierarchical knowledge,” inFindings of EMNLP, 2025

  25. [25]

    Graph of thoughts: Solving elaborate problems with large language models,

    M. Besta, N. Blach, A. Kubicek, et al., “Graph of thoughts: Solving elaborate problems with large language models,” inProc. AAAI, 2024

  26. [26]

    KRAGEN: Knowledge graph enhanced retrieval-augmented genera- tion,

    N. Matsumoto et al., “KRAGEN: Knowledge graph enhanced retrieval-augmented genera- tion,”arXiv preprint, 2024

  27. [27]

    TAT-QA: A question answering benchmark on hybrid tabular and textual content in finance,

    F. Zhu, W. Lei, Y. Huang, et al., “TAT-QA: A question answering benchmark on hybrid tabular and textual content in finance,” inProc. ACL, 2021

  28. [28]

    FinSage: A multi-aspect RAG system for financial filings QA,

    X. Wang, J. Chi, Z. Tai, et al., “FinSage: A multi-aspect RAG system for financial filings QA,”arXiv:2504.14493, 2025

  29. [29]

    Mitigating hallucination in financial RAG via fine-grained knowledge verification,

    “Mitigating hallucination in financial RAG via fine-grained knowledge verification,” arXiv:2602.05723, 2026

  30. [30]

    RankRAG: Unifying context ranking with RAG in LLMs,

    Y. Yu et al., “RankRAG: Unifying context ranking with RAG in LLMs,” inProc. NeurIPS, 2024

  31. [31]

    Sentence-BERT: Sentence embeddings using Siamese BERT- networks,

    N. Reimers and I. Gurevych, “Sentence-BERT: Sentence embeddings using Siamese BERT- networks,” inProc. EMNLP, 2019

  32. [32]

    Billion-scale similarity search with GPUs,

    J. Johnson, M. Douze, and H. Jegou, “Billion-scale similarity search with GPUs,”IEEE Trans. Big Data, vol. 7, no. 3, pp. 535–547, 2021

  33. [33]

    Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

    Y. Zhuang, Z. Lin, et al., “Judging LLM-as-a-judge with MT-Bench and Chatbot Arena,” arXiv:2306.05685, 2023

  34. [34]

    PageIndex: The vectorless RAG,

    G. Goel, “PageIndex: The vectorless RAG,”PVTech Substack, Mar. 2026. 10

  35. [35]

    Learning transferable visual models from natural language supervision,

    A. Radford, J. W. Kim, et al., “Learning transferable visual models from natural language supervision,” inProc. ICML, 2021

  36. [36]

    LayoutParser: A unified toolkit for deep learning based document image analysis,

    Z. Shen, R. Zhang, M. Dell, et al., “LayoutParser: A unified toolkit for deep learning based document image analysis,” inProc. ICDAR, 2021

  37. [37]

    The structural pivot: Analytical perspectives on vectorless RAG and hier- archical page indexing,

    S. Chatterjee, “The structural pivot: Analytical perspectives on vectorless RAG and hier- archical page indexing,”Medium, Feb. 2026

  38. [38]

    VectifyAI launches Mafin 2.5 and PageIndex: 98.7% financial RAG accu- racy,

    MarkTechPost, “VectifyAI launches Mafin 2.5 and PageIndex: 98.7% financial RAG accu- racy,” Feb. 2026

  39. [39]

    Vectorless RAG: How PageIndex works (2026 guide),

    Build Fast with AI, “Vectorless RAG: How PageIndex works (2026 guide),” 2026

  40. [40]

    Vectorless reasoning-based RAG: A new approach,

    Microsoft Tech Community, “Vectorless reasoning-based RAG: A new approach,” Mar. 2026

  41. [41]

    Vectorless RAG hits 98.7% accuracy: PageIndex challenges vectors,

    ByteIota, “Vectorless RAG hits 98.7% accuracy: PageIndex challenges vectors,” Jan. 2026

  42. [42]

    Decomposing retrieval failures in RAG for long-document financial QA,

    E. Lumer et al., “Decomposing retrieval failures in RAG for long-document financial QA,” arXiv:2602.17981, 2026

  43. [43]

    Resolving the Robustness-Precision Trade-off in Financial RAG through Hybrid Document-Routed Retrieval

    E. Lumer et al., “Resolving the robustness-precision trade-off in financial RAG,” arXiv:2603.26815, 2026

  44. [44]

    Toolshed: Scale tool-equipped agents with advanced RAG-tool fusion,

    E. Lumer, V. K. Subbiah, J. A. Burke, et al., “Toolshed: Scale tool-equipped agents with advanced RAG-tool fusion,” Preprint, 2024

  45. [45]

    GPT-4 Technical Report

    OpenAI, “GPT-4 technical report,”arXiv:2303.08774, 2023

  46. [46]

    Chain-of-thought prompting elicits reasoning in large language models,

    J. Wei, X. Wang, D. Schuurmans, et al., “Chain-of-thought prompting elicits reasoning in large language models,” inProc. NeurIPS, 2022

  47. [47]

    Introducing contextual retrieval,

    Anthropic, “Introducing contextual retrieval,” Anthropic Blog, 2024

  48. [48]

    Active retrieval augmented generation,

    Z. Jiang, F. F. Xu, L. Gao, et al., “Active retrieval augmented generation,” inProc. EMNLP, 2023

  49. [49]

    KG-RAG: Bridging the gap between knowledge and creativity,

    D. Sanmartin, “KG-RAG: Bridging the gap between knowledge and creativity,”arXiv preprint, 2024

  50. [50]

    arXiv preprint arXiv:2409.15730 (2024)

    “A systematic review of key RAG systems: Progress, gaps, and future directions,” arXiv:2507.18910, 2025

  51. [51]

    Financebench: A new benchmark for financial question answering.arXiv preprint arXiv:2311.11944, 2023

    P. Islam, A. Kannappan, D. Kiela, R. Qian, N. Scherrer, and B. Vidgen, “FinanceBench: A new benchmark for financial question answering,”arXiv:2311.11944, 2023

  52. [52]

    Financial report chunking for effective retrieval augmented generation,

    A. Jimeno-Yepes, Y. You, J. Milczek, S. Laverde, and R. Li, “Financial report chunking for effective retrieval augmented generation,”arXiv:2402.05131, 2024. 11