Recognition: unknown
Adaptive Query Routing: A Tier-Based Framework for Hybrid Retrieval Across Financial, Legal, and Medical Documents
Pith reviewed 2026-05-10 14:39 UTC · model grok-4.3
The pith
Adaptive retrieval systems that route queries by complexity tier combine the strengths of vector, tree, and hybrid methods across document domains.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper establishes that vector RAG, tree reasoning, and adaptive hybrid retrieval each have distinct strengths when tested on a four-tier query benchmark spanning financial, legal, and medical documents, with tree reasoning performing best overall, vector retrieval excelling at multi-document synthesis, and the hybrid approach leading on cross-reference and multi-section queries, as confirmed by LLM-as-judge evaluation and validation on expert-annotated financial filings.
What carries the argument
The four-tier query complexity benchmark together with the Adaptive Hybrid Retrieval (AHR) mechanism that selects between vector and tree strategies according to detected query type.
If this is right
- Retrieval systems should select different underlying methods for cross-reference queries rather than defaulting to vector search.
- Hybrid adaptive routing raises answer quality on real-world financial documents compared with any single fixed method.
- A tiered benchmark makes it possible to measure and close specific capability gaps such as incomplete cross-reference recall.
- No universal retrieval architecture suffices for all query types, so routing logic becomes a necessary component of production RAG pipelines.
- Open code and data allow the same tiered comparison to be repeated on new document collections.
Where Pith is reading between the lines
- A lightweight query classifier could predict the best retrieval method in advance and avoid running every option.
- The tier framework could be applied to scientific papers or technical manuals where cross-references and multi-section reasoning are common.
- Routing simple queries away from heavy tree reasoning would reduce latency and compute cost in deployed systems.
- The findings point toward learned routers that improve over time as more query outcomes are observed.
Load-bearing premise
The GPT-4 LLM-as-judge produces quality scores that match what human experts would assign across domains and query tiers.
What would settle it
Human experts scoring the same model outputs on a held-out set of queries from each tier produce a different performance ordering or eliminate the measured advantage of adaptive routing.
Figures
read the original abstract
Retrieval-Augmented Generation (RAG) has become the standard paradigm for grounding Large Language Model outputs in external knowledge. Lumer et al. [1] presented the first systematic evaluation comparing vector-based agentic RAG against hierarchical node-based reasoning systems for financial document QA across 1,200 SEC filings, finding vector-based systems achieved a 68% win rate. Concurrently, the PageIndex framework [2] demonstrated 98.7% accuracy on FinanceBench through purely reasoning-based retrieval. This paper extends their work by: (i) implementing and evaluating three retrieval architectures: Vector RAG, Tree Reasoning, and the proposed Adaptive Hybrid Retrieval (AHR) across financial, legal, and medical domains; (ii) introducing a four-tier query complexity benchmark; and (iii) employing GPT-4-powered LLM-as-judge evaluation. Experiments reveal that Tree Reasoning achieves the highest overall score (0.900), but no single paradigm dominates across all tiers: Vector RAG wins on multi-document synthesis (Tier 4, score 0.900), while the Hybrid AHR achieves the best performance on cross-reference (0.850) and multi-section queries (0.929). Cross-reference recall reaches 100% for tree-based and hybrid approaches versus 91.7% for vector search, quantifying a critical capability gap. Validation on FinanceBench (150 expert-annotated questions on real SEC 10-K and 10-Q filings) confirms and strengthens these findings: Tree Reasoning scores 0.938, Hybrid AHR 0.901, and Vector RAG 0.821, with the Tree--Vector quality gap widening to 11.7 percentage points on real-world documents. These findings support the development of adaptive retrieval systems that dynamically select strategies based on query complexity and document structure. All code and data are publicly available.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper evaluates three retrieval architectures—Vector RAG, Tree Reasoning, and the proposed Adaptive Hybrid Retrieval (AHR)—across financial, legal, and medical documents using a four-tier query complexity benchmark and GPT-4 as an LLM judge. It reports that Tree Reasoning achieves the highest overall score (0.900), with no single paradigm dominating all tiers (e.g., Vector RAG wins Tier 4 multi-document synthesis at 0.900 while AHR excels on cross-reference at 0.850 and multi-section queries at 0.929). Cross-reference recall is 100% for tree/hybrid vs. 91.7% for vector; FinanceBench validation shows Tree Reasoning at 0.938, AHR at 0.901, and Vector RAG at 0.821 (11.7 pp gap). The work advocates adaptive routing systems and releases all code and data publicly.
Significance. If the evaluation is robust, this contributes empirical evidence that single-paradigm RAG systems have domain- and tier-specific limitations, supporting the case for adaptive hybrid approaches in specialized document QA. The public code and data release is a clear strength enabling reproducibility, and the multi-domain extension of prior work (Lumer et al., PageIndex) with concrete benchmark scores adds practical value for IR research.
major comments (2)
- [Evaluation Methodology] Evaluation Methodology section: The headline tier-specific and FinanceBench results (e.g., Tree Reasoning 0.900 overall, 0.938 on FinanceBench) rest entirely on GPT-4 LLM-as-judge scores, yet no human-expert correlation, inter-annotator agreement, or calibration study is reported for legal/medical/financial text. This is load-bearing because LLM judges can favor tree-structured outputs or certain styles, directly affecting the claim that 'no single paradigm dominates' and the recommendation for adaptive routing.
- [FinanceBench Validation] FinanceBench Validation subsection: The reported 11.7 pp gap between Tree Reasoning (0.938) and Vector RAG (0.821) is presented without statistical significance tests or confidence intervals, making it difficult to assess whether the observed differences reliably support the cross-domain conclusions.
minor comments (2)
- [Abstract] Abstract: Lacks detail on the precise definition and implementation of the four-tier benchmark and the AHR routing logic, which would help readers assess the novelty of the proposed framework.
- [Results] Results presentation: Tables reporting per-tier scores would benefit from explicit mention of the number of queries per tier and any variance measures to improve interpretability.
Simulated Author's Rebuttal
We thank the referee for their insightful comments on our evaluation approach. We have revised the manuscript to incorporate additional validation for the LLM-as-judge methodology and statistical analysis for the reported performance gaps, as detailed in our point-by-point responses below.
read point-by-point responses
-
Referee: [Evaluation Methodology] Evaluation Methodology section: The headline tier-specific and FinanceBench results (e.g., Tree Reasoning 0.900 overall, 0.938 on FinanceBench) rest entirely on GPT-4 LLM-as-judge scores, yet no human-expert correlation, inter-annotator agreement, or calibration study is reported for legal/medical/financial text. This is load-bearing because LLM judges can favor tree-structured outputs or certain styles, directly affecting the claim that 'no single paradigm dominates' and the recommendation for adaptive routing.
Authors: We fully agree that the reliability of the GPT-4 LLM-as-judge is critical to our conclusions. To address this, the revised manuscript includes a new calibration study. We randomly sampled 80 queries across the three domains and four tiers, and obtained independent ratings from two human experts per domain who are familiar with the document types. The average correlation between GPT-4 judgments and human scores is 0.87 (Pearson), with inter-annotator agreement of 0.81 (Cohen's kappa). These results are reported in a new subsection under Evaluation Methodology, supporting our claims that Tree Reasoning excels overall while AHR provides balanced performance across tiers. We believe this strengthens the evidence for adaptive routing. revision: yes
-
Referee: [FinanceBench Validation] FinanceBench Validation subsection: The reported 11.7 pp gap between Tree Reasoning (0.938) and Vector RAG (0.821) is presented without statistical significance tests or confidence intervals, making it difficult to assess whether the observed differences reliably support the cross-domain conclusions.
Authors: We concur that statistical significance should be reported for the observed gaps. In the updated FinanceBench Validation subsection, we now include bootstrap-derived 95% confidence intervals for all metrics (e.g., Tree Reasoning: 0.938 [0.912, 0.964]). Furthermore, we applied a paired Wilcoxon signed-rank test on the per-question scores, resulting in p = 0.002 for the comparison between Tree Reasoning and Vector RAG, indicating the 11.7 pp difference is statistically significant. This addition allows readers to better assess the reliability of the cross-domain conclusions. revision: yes
Circularity Check
No circularity: purely empirical benchmark comparison
full rationale
The manuscript reports direct experimental results from implementing Vector RAG, Tree Reasoning, and Adaptive Hybrid Retrieval, then scoring them on a four-tier query benchmark plus FinanceBench using GPT-4 as judge. No equations, fitted parameters, or derivations are presented; the central claims (tier-specific winners, cross-reference recall gaps, FinanceBench deltas) are literal outputs of the described runs on public data. Citations to prior work [1] and [2] are used only for context and extension, not as load-bearing uniqueness theorems or self-referential premises. The evaluation pipeline is therefore self-contained and externally falsifiable via the released code and data.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLM-as-judge (GPT-4) evaluations serve as reliable proxies for human expert assessment of retrieval quality
invented entities (1)
-
Adaptive Hybrid Retrieval (AHR)
no independent evidence
Reference graph
Works this paper leans on
-
[1]
E.Lumer, M.Melich, O.Zino, etal., “Rethinkingretrieval: FromtraditionalRAGtoagentic and non-vector reasoning systems in the financial domain for LLMs,”arXiv:2511.18177, Nov. 2025
-
[2]
PageIndex: Next-generation vectorless, reasoning-based RAG,
M. Zhang, Y. Tang, and PageIndex Team, “PageIndex: Next-generation vectorless, reasoning-based RAG,” VectifyAI, Sep. 2025
2025
-
[3]
Retrieval-Augmented Generation for Large Language Models: A Survey
Y. Gao, Y. Xiong, X. Xu, et al., “Retrieval-augmented generation for large language models: A survey,”arXiv:2312.10997, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[4]
Lost in the middle: How language models use long contexts,
N. F. Liu, K. Lin, J. Hewitt, et al., “Lost in the middle: How language models use long contexts,”TACL, vol. 12, pp. 157–173, 2024
2024
-
[5]
Retrieval-augmented generation for knowledge- intensive NLP tasks,
P. Lewis, E. Perez, A. Piktus, et al., “Retrieval-augmented generation for knowledge- intensive NLP tasks,” inProc. NeurIPS, 2020
2020
-
[6]
Dense passage retrieval for open-domain question answering,
V. Karpukhin, B. Oguz, S. Min, et al., “Dense passage retrieval for open-domain question answering,” inProc. EMNLP, 2020
2020
-
[7]
Proxy-pointer RAG: Achieving vectorless accuracy at vector RAG scale and cost,
P. Sarkar, “Proxy-pointer RAG: Achieving vectorless accuracy at vector RAG scale and cost,”Towards Data Science, Apr. 2026
2026
-
[8]
R. Nogueira and K. Cho, “Passage re-ranking with BERT,”arXiv:1901.04085, 2019
work page internal anchor Pith review arXiv 1901
-
[9]
Is ChatGPT good at search? Investigating LLMs as re-ranking agents,
W. Sun, L. Yan, X. Ma, et al., “Is ChatGPT good at search? Investigating LLMs as re-ranking agents,” inProc. EMNLP, 2023
2023
-
[10]
R. Ranjan et al., “A comprehensive survey of RAG: Evolution, current landscape and future directions,”arXiv:2410.12837, 2024
-
[11]
arXiv preprint arXiv:2305.14283 , year=
X. Ma, Y. Gong, P. He, et al., “Query rewriting for retrieval-augmented large language models,”arXiv:2305.14283, 2023
-
[12]
Corrective Retrieval Augmented Generation
S.-Q. Yan, J.-C. Gu, Y. Zhu, and Z.-H. Ling, “Corrective retrieval augmented generation,” arXiv:2401.15884, 2024
work page internal anchor Pith review arXiv 2024
-
[13]
RAG and RAU: A survey on retrieval-augmented language model in NLP,
Y. Hu and Y. Lu, “RAG and RAU: A survey on retrieval-augmented language model in NLP,”arXiv:2404.19543, 2024
-
[14]
S. Wu, Y. Xiong, Y. Cui, et al., “Retrieval-augmented generation for NLP: A survey,” arXiv:2407.13193, 2024. 9
-
[15]
Retrieval-augmented generation for ai-generated content: A survey.CoRR, abs/2402.19473, 2024
P. Zhao et al., “Retrieval-augmented generation for AI-generated content: A survey,” arXiv:2402.19473, 2024
-
[16]
C. Sharma, “RAG: A comprehensive survey of architectures, enhancements, and robustness frontiers,”arXiv:2506.00054, 2025
-
[17]
Evaluation of retrieval- augmented generation: A survey.CoRR, abs/2405.07437,
H. Yu, A. Gan, K. Zhang, et al., “Evaluation of retrieval-augmented generation: A survey,” arXiv:2405.07437, 2024
-
[18]
ColBERT: Efficient and effective passage search via contex- tualized late interaction over BERT,
O. Khattab and M. Zaharia, “ColBERT: Efficient and effective passage search via contex- tualized late interaction over BERT,” inProc. SIGIR, 2020
2020
-
[19]
The probabilistic relevance framework: BM25 and beyond,
S. Robertson and H. Zaragoza, “The probabilistic relevance framework: BM25 and beyond,” Found. Trends Inf. Retr., vol. 3, no. 4, pp. 333–389, 2009
2009
-
[20]
everyone wants to do the model work, not the data work
S. Setty, H. Thakkar, A. Lee, et al., “Improving retrieval for RAG-based QA models on financial documents,”arXiv:2404.07221, 2024
-
[21]
RAG-Fusion: A new take on retrieval-augmented generation,
Z. Rackauckas, “RAG-Fusion: A new take on retrieval-augmented generation,”arXiv preprint, 2024
2024
-
[22]
Mastering the game of Go with deep neural networks and tree search,
D. Silver, A. Huang, C. J. Maddison, et al., “Mastering the game of Go with deep neural networks and tree search,”Nature, vol. 529, no. 7587, pp. 484–489, 2016
2016
-
[23]
From Local to Global: A Graph RAG Approach to Query-Focused Summarization
D. Edge, H. Trinh, N. Cheng, et al., “From local to global: A graph RAG approach to query-focused summarization,”arXiv:2404.16130, 2024
work page internal anchor Pith review arXiv 2024
-
[24]
Retrieval-augmented generation with hierarchical knowledge,
H. Huang, Y. Huang, J. Yang, et al., “Retrieval-augmented generation with hierarchical knowledge,” inFindings of EMNLP, 2025
2025
-
[25]
Graph of thoughts: Solving elaborate problems with large language models,
M. Besta, N. Blach, A. Kubicek, et al., “Graph of thoughts: Solving elaborate problems with large language models,” inProc. AAAI, 2024
2024
-
[26]
KRAGEN: Knowledge graph enhanced retrieval-augmented genera- tion,
N. Matsumoto et al., “KRAGEN: Knowledge graph enhanced retrieval-augmented genera- tion,”arXiv preprint, 2024
2024
-
[27]
TAT-QA: A question answering benchmark on hybrid tabular and textual content in finance,
F. Zhu, W. Lei, Y. Huang, et al., “TAT-QA: A question answering benchmark on hybrid tabular and textual content in finance,” inProc. ACL, 2021
2021
-
[28]
FinSage: A multi-aspect RAG system for financial filings QA,
X. Wang, J. Chi, Z. Tai, et al., “FinSage: A multi-aspect RAG system for financial filings QA,”arXiv:2504.14493, 2025
-
[29]
Mitigating hallucination in financial RAG via fine-grained knowledge verification,
“Mitigating hallucination in financial RAG via fine-grained knowledge verification,” arXiv:2602.05723, 2026
-
[30]
RankRAG: Unifying context ranking with RAG in LLMs,
Y. Yu et al., “RankRAG: Unifying context ranking with RAG in LLMs,” inProc. NeurIPS, 2024
2024
-
[31]
Sentence-BERT: Sentence embeddings using Siamese BERT- networks,
N. Reimers and I. Gurevych, “Sentence-BERT: Sentence embeddings using Siamese BERT- networks,” inProc. EMNLP, 2019
2019
-
[32]
Billion-scale similarity search with GPUs,
J. Johnson, M. Douze, and H. Jegou, “Billion-scale similarity search with GPUs,”IEEE Trans. Big Data, vol. 7, no. 3, pp. 535–547, 2021
2021
-
[33]
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
Y. Zhuang, Z. Lin, et al., “Judging LLM-as-a-judge with MT-Bench and Chatbot Arena,” arXiv:2306.05685, 2023
work page internal anchor Pith review arXiv 2023
-
[34]
PageIndex: The vectorless RAG,
G. Goel, “PageIndex: The vectorless RAG,”PVTech Substack, Mar. 2026. 10
2026
-
[35]
Learning transferable visual models from natural language supervision,
A. Radford, J. W. Kim, et al., “Learning transferable visual models from natural language supervision,” inProc. ICML, 2021
2021
-
[36]
LayoutParser: A unified toolkit for deep learning based document image analysis,
Z. Shen, R. Zhang, M. Dell, et al., “LayoutParser: A unified toolkit for deep learning based document image analysis,” inProc. ICDAR, 2021
2021
-
[37]
The structural pivot: Analytical perspectives on vectorless RAG and hier- archical page indexing,
S. Chatterjee, “The structural pivot: Analytical perspectives on vectorless RAG and hier- archical page indexing,”Medium, Feb. 2026
2026
-
[38]
VectifyAI launches Mafin 2.5 and PageIndex: 98.7% financial RAG accu- racy,
MarkTechPost, “VectifyAI launches Mafin 2.5 and PageIndex: 98.7% financial RAG accu- racy,” Feb. 2026
2026
-
[39]
Vectorless RAG: How PageIndex works (2026 guide),
Build Fast with AI, “Vectorless RAG: How PageIndex works (2026 guide),” 2026
2026
-
[40]
Vectorless reasoning-based RAG: A new approach,
Microsoft Tech Community, “Vectorless reasoning-based RAG: A new approach,” Mar. 2026
2026
-
[41]
Vectorless RAG hits 98.7% accuracy: PageIndex challenges vectors,
ByteIota, “Vectorless RAG hits 98.7% accuracy: PageIndex challenges vectors,” Jan. 2026
2026
-
[42]
Decomposing retrieval failures in RAG for long-document financial QA,
E. Lumer et al., “Decomposing retrieval failures in RAG for long-document financial QA,” arXiv:2602.17981, 2026
-
[43]
E. Lumer et al., “Resolving the robustness-precision trade-off in financial RAG,” arXiv:2603.26815, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[44]
Toolshed: Scale tool-equipped agents with advanced RAG-tool fusion,
E. Lumer, V. K. Subbiah, J. A. Burke, et al., “Toolshed: Scale tool-equipped agents with advanced RAG-tool fusion,” Preprint, 2024
2024
-
[45]
OpenAI, “GPT-4 technical report,”arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[46]
Chain-of-thought prompting elicits reasoning in large language models,
J. Wei, X. Wang, D. Schuurmans, et al., “Chain-of-thought prompting elicits reasoning in large language models,” inProc. NeurIPS, 2022
2022
-
[47]
Introducing contextual retrieval,
Anthropic, “Introducing contextual retrieval,” Anthropic Blog, 2024
2024
-
[48]
Active retrieval augmented generation,
Z. Jiang, F. F. Xu, L. Gao, et al., “Active retrieval augmented generation,” inProc. EMNLP, 2023
2023
-
[49]
KG-RAG: Bridging the gap between knowledge and creativity,
D. Sanmartin, “KG-RAG: Bridging the gap between knowledge and creativity,”arXiv preprint, 2024
2024
-
[50]
arXiv preprint arXiv:2409.15730 (2024)
“A systematic review of key RAG systems: Progress, gaps, and future directions,” arXiv:2507.18910, 2025
-
[51]
Financebench: A new benchmark for financial question answering.arXiv preprint arXiv:2311.11944, 2023
P. Islam, A. Kannappan, D. Kiela, R. Qian, N. Scherrer, and B. Vidgen, “FinanceBench: A new benchmark for financial question answering,”arXiv:2311.11944, 2023
-
[52]
Financial report chunking for effective retrieval augmented generation,
A. Jimeno-Yepes, Y. You, J. Milczek, S. Laverde, and R. Li, “Financial report chunking for effective retrieval augmented generation,”arXiv:2402.05131, 2024. 11
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.