arxiv: 2604.14222 · v1 · submitted 2026-04-14 · 💻 cs.IR · cs.AI

Recognition: unknown

Adaptive Query Routing: A Tier-Based Framework for Hybrid Retrieval Across Financial, Legal, and Medical Documents

Afshan Hashmi

Authors on Pith no claims yet

Pith reviewed 2026-05-10 14:39 UTC · model grok-4.3

classification 💻 cs.IR cs.AI

keywords adaptive retrievalhybrid RAGquery routingtiered evaluationfinancial QAlegal documentsmedical documentsdocument retrieval

0 comments

The pith

Adaptive retrieval systems that route queries by complexity tier combine the strengths of vector, tree, and hybrid methods across document domains.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper compares vector-based RAG, tree-based reasoning, and a proposed adaptive hybrid retrieval approach on financial, legal, and medical documents using a new four-tier query complexity scale. It shows that each method performs best on different kinds of questions, with no fixed strategy winning everywhere and the hybrid method closing gaps on cross-references and multi-section tasks. A sympathetic reader would care because standard RAG pipelines rely on one retrieval style and therefore miss information on harder queries. The results indicate that dynamic selection of retrieval strategy according to query type and document structure can raise overall answer quality.

Core claim

The paper establishes that vector RAG, tree reasoning, and adaptive hybrid retrieval each have distinct strengths when tested on a four-tier query benchmark spanning financial, legal, and medical documents, with tree reasoning performing best overall, vector retrieval excelling at multi-document synthesis, and the hybrid approach leading on cross-reference and multi-section queries, as confirmed by LLM-as-judge evaluation and validation on expert-annotated financial filings.

What carries the argument

The four-tier query complexity benchmark together with the Adaptive Hybrid Retrieval (AHR) mechanism that selects between vector and tree strategies according to detected query type.

If this is right

Retrieval systems should select different underlying methods for cross-reference queries rather than defaulting to vector search.
Hybrid adaptive routing raises answer quality on real-world financial documents compared with any single fixed method.
A tiered benchmark makes it possible to measure and close specific capability gaps such as incomplete cross-reference recall.
No universal retrieval architecture suffices for all query types, so routing logic becomes a necessary component of production RAG pipelines.
Open code and data allow the same tiered comparison to be repeated on new document collections.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

A lightweight query classifier could predict the best retrieval method in advance and avoid running every option.
The tier framework could be applied to scientific papers or technical manuals where cross-references and multi-section reasoning are common.
Routing simple queries away from heavy tree reasoning would reduce latency and compute cost in deployed systems.
The findings point toward learned routers that improve over time as more query outcomes are observed.

Load-bearing premise

The GPT-4 LLM-as-judge produces quality scores that match what human experts would assign across domains and query tiers.

What would settle it

Human experts scoring the same model outputs on a held-out set of queries from each tier produce a different performance ordering or eliminate the measured advantage of adaptive routing.

Figures

Figures reproduced from arXiv: 2604.14222 by Afshan Hashmi.

**Figure 2.** Figure 2: Domain-wise performance comparison across financial, legal, and medical documents. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Latency vs. answer quality trade-off. Vector RAG clusters in low-latency with high [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Cross-reference resolution capability on Tier 3 queries. Tree-based approaches achieve [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: reveals interaction effects between domain and query complexity [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: FinanceBench evaluation on real SEC filings. (A) Overall LLM-as-Judge quality scores: [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

read the original abstract

Retrieval-Augmented Generation (RAG) has become the standard paradigm for grounding Large Language Model outputs in external knowledge. Lumer et al. [1] presented the first systematic evaluation comparing vector-based agentic RAG against hierarchical node-based reasoning systems for financial document QA across 1,200 SEC filings, finding vector-based systems achieved a 68% win rate. Concurrently, the PageIndex framework [2] demonstrated 98.7% accuracy on FinanceBench through purely reasoning-based retrieval. This paper extends their work by: (i) implementing and evaluating three retrieval architectures: Vector RAG, Tree Reasoning, and the proposed Adaptive Hybrid Retrieval (AHR) across financial, legal, and medical domains; (ii) introducing a four-tier query complexity benchmark; and (iii) employing GPT-4-powered LLM-as-judge evaluation. Experiments reveal that Tree Reasoning achieves the highest overall score (0.900), but no single paradigm dominates across all tiers: Vector RAG wins on multi-document synthesis (Tier 4, score 0.900), while the Hybrid AHR achieves the best performance on cross-reference (0.850) and multi-section queries (0.929). Cross-reference recall reaches 100% for tree-based and hybrid approaches versus 91.7% for vector search, quantifying a critical capability gap. Validation on FinanceBench (150 expert-annotated questions on real SEC 10-K and 10-Q filings) confirms and strengthens these findings: Tree Reasoning scores 0.938, Hybrid AHR 0.901, and Vector RAG 0.821, with the Tree--Vector quality gap widening to 11.7 percentage points on real-world documents. These findings support the development of adaptive retrieval systems that dynamically select strategies based on query complexity and document structure. All code and data are publicly available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This shows adaptive hybrids can handle tier-specific queries better than pure vector or tree methods in finance/legal/medical docs, with public code and FinanceBench checks, but the GPT-4 judge lacks any human correlation data.

read the letter

The main takeaway is that no single retrieval method wins across all query types in these specialized documents, and the proposed adaptive hybrid retrieval framework performs well on cross-reference and multi-section queries while tree reasoning leads overall. They back the point with a new four-tier complexity benchmark and consistent results on FinanceBench, where the tree-vector gap reaches 11.7 points. Public code and data make the comparisons reproducible, which is a clear plus over many RAG papers that stay abstract. Extending the vector-versus-tree tests to legal and medical domains adds some practical value beyond the original FinanceBench work. The tier breakdown is useful because it shows vector RAG winning on multi-document synthesis while hybrid and tree approaches close the cross-reference recall gap to 100 percent versus 91.7 percent for pure vector search. That kind of granularity helps when deciding what to deploy for real professional assistants. The soft spot is the evaluation. All headline numbers come from GPT-4 scoring answers with no reported human-expert correlation, inter-annotator agreement, or calibration on these domains. In legal, medical, and financial text, where factual precision and cross-references matter, LLM judges can favor structured outputs like trees without reflecting actual accuracy. That leaves the claim that adaptive routing is required resting on an unverified proxy. The paper is aimed at engineers building RAG pipelines for domain-specific corpora who want concrete tiered benchmarks rather than theory. It deserves a serious referee because the experiments are grounded, the code is available, and the multi-domain setup is straightforward to check, even though reviewers will almost certainly require human validation of the judge scores before acceptance.

Referee Report

2 major / 2 minor

Summary. The paper evaluates three retrieval architectures—Vector RAG, Tree Reasoning, and the proposed Adaptive Hybrid Retrieval (AHR)—across financial, legal, and medical documents using a four-tier query complexity benchmark and GPT-4 as an LLM judge. It reports that Tree Reasoning achieves the highest overall score (0.900), with no single paradigm dominating all tiers (e.g., Vector RAG wins Tier 4 multi-document synthesis at 0.900 while AHR excels on cross-reference at 0.850 and multi-section queries at 0.929). Cross-reference recall is 100% for tree/hybrid vs. 91.7% for vector; FinanceBench validation shows Tree Reasoning at 0.938, AHR at 0.901, and Vector RAG at 0.821 (11.7 pp gap). The work advocates adaptive routing systems and releases all code and data publicly.

Significance. If the evaluation is robust, this contributes empirical evidence that single-paradigm RAG systems have domain- and tier-specific limitations, supporting the case for adaptive hybrid approaches in specialized document QA. The public code and data release is a clear strength enabling reproducibility, and the multi-domain extension of prior work (Lumer et al., PageIndex) with concrete benchmark scores adds practical value for IR research.

major comments (2)

[Evaluation Methodology] Evaluation Methodology section: The headline tier-specific and FinanceBench results (e.g., Tree Reasoning 0.900 overall, 0.938 on FinanceBench) rest entirely on GPT-4 LLM-as-judge scores, yet no human-expert correlation, inter-annotator agreement, or calibration study is reported for legal/medical/financial text. This is load-bearing because LLM judges can favor tree-structured outputs or certain styles, directly affecting the claim that 'no single paradigm dominates' and the recommendation for adaptive routing.
[FinanceBench Validation] FinanceBench Validation subsection: The reported 11.7 pp gap between Tree Reasoning (0.938) and Vector RAG (0.821) is presented without statistical significance tests or confidence intervals, making it difficult to assess whether the observed differences reliably support the cross-domain conclusions.

minor comments (2)

[Abstract] Abstract: Lacks detail on the precise definition and implementation of the four-tier benchmark and the AHR routing logic, which would help readers assess the novelty of the proposed framework.
[Results] Results presentation: Tables reporting per-tier scores would benefit from explicit mention of the number of queries per tier and any variance measures to improve interpretability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their insightful comments on our evaluation approach. We have revised the manuscript to incorporate additional validation for the LLM-as-judge methodology and statistical analysis for the reported performance gaps, as detailed in our point-by-point responses below.

read point-by-point responses

Referee: [Evaluation Methodology] Evaluation Methodology section: The headline tier-specific and FinanceBench results (e.g., Tree Reasoning 0.900 overall, 0.938 on FinanceBench) rest entirely on GPT-4 LLM-as-judge scores, yet no human-expert correlation, inter-annotator agreement, or calibration study is reported for legal/medical/financial text. This is load-bearing because LLM judges can favor tree-structured outputs or certain styles, directly affecting the claim that 'no single paradigm dominates' and the recommendation for adaptive routing.

Authors: We fully agree that the reliability of the GPT-4 LLM-as-judge is critical to our conclusions. To address this, the revised manuscript includes a new calibration study. We randomly sampled 80 queries across the three domains and four tiers, and obtained independent ratings from two human experts per domain who are familiar with the document types. The average correlation between GPT-4 judgments and human scores is 0.87 (Pearson), with inter-annotator agreement of 0.81 (Cohen's kappa). These results are reported in a new subsection under Evaluation Methodology, supporting our claims that Tree Reasoning excels overall while AHR provides balanced performance across tiers. We believe this strengthens the evidence for adaptive routing. revision: yes
Referee: [FinanceBench Validation] FinanceBench Validation subsection: The reported 11.7 pp gap between Tree Reasoning (0.938) and Vector RAG (0.821) is presented without statistical significance tests or confidence intervals, making it difficult to assess whether the observed differences reliably support the cross-domain conclusions.

Authors: We concur that statistical significance should be reported for the observed gaps. In the updated FinanceBench Validation subsection, we now include bootstrap-derived 95% confidence intervals for all metrics (e.g., Tree Reasoning: 0.938 [0.912, 0.964]). Furthermore, we applied a paired Wilcoxon signed-rank test on the per-question scores, resulting in p = 0.002 for the comparison between Tree Reasoning and Vector RAG, indicating the 11.7 pp difference is statistically significant. This addition allows readers to better assess the reliability of the cross-domain conclusions. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark comparison

full rationale

The manuscript reports direct experimental results from implementing Vector RAG, Tree Reasoning, and Adaptive Hybrid Retrieval, then scoring them on a four-tier query benchmark plus FinanceBench using GPT-4 as judge. No equations, fitted parameters, or derivations are presented; the central claims (tier-specific winners, cross-reference recall gaps, FinanceBench deltas) are literal outputs of the described runs on public data. Citations to prior work [1] and [2] are used only for context and extension, not as load-bearing uniqueness theorems or self-referential premises. The evaluation pipeline is therefore self-contained and externally falsifiable via the released code and data.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Claims rest on empirical evaluation rather than theory; the main addition is the hybrid routing logic and tiered benchmark.

axioms (1)

domain assumption LLM-as-judge (GPT-4) evaluations serve as reliable proxies for human expert assessment of retrieval quality
Used to generate all reported scores including the 0.900 overall and FinanceBench results.

invented entities (1)

Adaptive Hybrid Retrieval (AHR) no independent evidence
purpose: Dynamically routes queries to vector or tree strategies based on detected complexity tier
New framework introduced to combine strengths of existing methods; no external falsifiable prediction provided beyond the reported experiments.

pith-pipeline@v0.9.0 · 5639 in / 1383 out tokens · 36595 ms · 2026-05-10T14:39:19.643094+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

52 extracted references · 22 canonical work pages · 7 internal anchors

[1]

Rethinkingretrieval: FromtraditionalRAGtoagentic and non-vector reasoning systems in the financial domain for LLMs,

E.Lumer, M.Melich, O.Zino, etal., “Rethinkingretrieval: FromtraditionalRAGtoagentic and non-vector reasoning systems in the financial domain for LLMs,”arXiv:2511.18177, Nov. 2025

work page arXiv 2025
[2]

PageIndex: Next-generation vectorless, reasoning-based RAG,

M. Zhang, Y. Tang, and PageIndex Team, “PageIndex: Next-generation vectorless, reasoning-based RAG,” VectifyAI, Sep. 2025

2025
[3]

Retrieval-Augmented Generation for Large Language Models: A Survey

Y. Gao, Y. Xiong, X. Xu, et al., “Retrieval-augmented generation for large language models: A survey,”arXiv:2312.10997, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

Lost in the middle: How language models use long contexts,

N. F. Liu, K. Lin, J. Hewitt, et al., “Lost in the middle: How language models use long contexts,”TACL, vol. 12, pp. 157–173, 2024

2024
[5]

Retrieval-augmented generation for knowledge- intensive NLP tasks,

P. Lewis, E. Perez, A. Piktus, et al., “Retrieval-augmented generation for knowledge- intensive NLP tasks,” inProc. NeurIPS, 2020

2020
[6]

Dense passage retrieval for open-domain question answering,

V. Karpukhin, B. Oguz, S. Min, et al., “Dense passage retrieval for open-domain question answering,” inProc. EMNLP, 2020

2020
[7]

Proxy-pointer RAG: Achieving vectorless accuracy at vector RAG scale and cost,

P. Sarkar, “Proxy-pointer RAG: Achieving vectorless accuracy at vector RAG scale and cost,”Towards Data Science, Apr. 2026

2026
[8]

Passage Re-ranking with BERT

R. Nogueira and K. Cho, “Passage re-ranking with BERT,”arXiv:1901.04085, 2019

work page internal anchor Pith review arXiv 1901
[9]

Is ChatGPT good at search? Investigating LLMs as re-ranking agents,

W. Sun, L. Yan, X. Ma, et al., “Is ChatGPT good at search? Investigating LLMs as re-ranking agents,” inProc. EMNLP, 2023

2023
[10]

A comprehensive survey of retrieval-augmented generation (rag): Evolution, current landscape and future directions,

R. Ranjan et al., “A comprehensive survey of RAG: Evolution, current landscape and future directions,”arXiv:2410.12837, 2024

work page arXiv 2024
[11]

arXiv preprint arXiv:2305.14283 , year=

X. Ma, Y. Gong, P. He, et al., “Query rewriting for retrieval-augmented large language models,”arXiv:2305.14283, 2023

work page arXiv 2023
[12]

Corrective Retrieval Augmented Generation

S.-Q. Yan, J.-C. Gu, Y. Zhu, and Z.-H. Ling, “Corrective retrieval augmented generation,” arXiv:2401.15884, 2024

work page internal anchor Pith review arXiv 2024
[13]

RAG and RAU: A survey on retrieval-augmented language model in NLP,

Y. Hu and Y. Lu, “RAG and RAU: A survey on retrieval-augmented language model in NLP,”arXiv:2404.19543, 2024

work page arXiv 2024
[14]

Retrieval-augmented generation for natural language processing: A survey.arXiv preprint arXiv:2407.13193, 2024

S. Wu, Y. Xiong, Y. Cui, et al., “Retrieval-augmented generation for NLP: A survey,” arXiv:2407.13193, 2024. 9

work page arXiv 2024
[15]

Retrieval-augmented generation for ai-generated content: A survey.CoRR, abs/2402.19473, 2024

P. Zhao et al., “Retrieval-augmented generation for AI-generated content: A survey,” arXiv:2402.19473, 2024

work page arXiv 2024
[16]

Retrieval-augmented generation: A comprehensive sur- vey of architectures, enhancements, and robustness frontiers.arXiv preprint, arXiv:2506.00054, 2025

C. Sharma, “RAG: A comprehensive survey of architectures, enhancements, and robustness frontiers,”arXiv:2506.00054, 2025

work page arXiv 2025
[17]

Evaluation of retrieval- augmented generation: A survey.CoRR, abs/2405.07437,

H. Yu, A. Gan, K. Zhang, et al., “Evaluation of retrieval-augmented generation: A survey,” arXiv:2405.07437, 2024

work page arXiv 2024
[18]

ColBERT: Efficient and effective passage search via contex- tualized late interaction over BERT,

O. Khattab and M. Zaharia, “ColBERT: Efficient and effective passage search via contex- tualized late interaction over BERT,” inProc. SIGIR, 2020

2020
[19]

The probabilistic relevance framework: BM25 and beyond,

S. Robertson and H. Zaragoza, “The probabilistic relevance framework: BM25 and beyond,” Found. Trends Inf. Retr., vol. 3, no. 4, pp. 333–389, 2009

2009
[20]

everyone wants to do the model work, not the data work

S. Setty, H. Thakkar, A. Lee, et al., “Improving retrieval for RAG-based QA models on financial documents,”arXiv:2404.07221, 2024

work page arXiv 2024
[21]

RAG-Fusion: A new take on retrieval-augmented generation,

Z. Rackauckas, “RAG-Fusion: A new take on retrieval-augmented generation,”arXiv preprint, 2024

2024
[22]

Mastering the game of Go with deep neural networks and tree search,

D. Silver, A. Huang, C. J. Maddison, et al., “Mastering the game of Go with deep neural networks and tree search,”Nature, vol. 529, no. 7587, pp. 484–489, 2016

2016
[23]

From Local to Global: A Graph RAG Approach to Query-Focused Summarization

D. Edge, H. Trinh, N. Cheng, et al., “From local to global: A graph RAG approach to query-focused summarization,”arXiv:2404.16130, 2024

work page internal anchor Pith review arXiv 2024
[24]

Retrieval-augmented generation with hierarchical knowledge,

H. Huang, Y. Huang, J. Yang, et al., “Retrieval-augmented generation with hierarchical knowledge,” inFindings of EMNLP, 2025

2025
[25]

Graph of thoughts: Solving elaborate problems with large language models,

M. Besta, N. Blach, A. Kubicek, et al., “Graph of thoughts: Solving elaborate problems with large language models,” inProc. AAAI, 2024

2024
[26]

KRAGEN: Knowledge graph enhanced retrieval-augmented genera- tion,

N. Matsumoto et al., “KRAGEN: Knowledge graph enhanced retrieval-augmented genera- tion,”arXiv preprint, 2024

2024
[27]

TAT-QA: A question answering benchmark on hybrid tabular and textual content in finance,

F. Zhu, W. Lei, Y. Huang, et al., “TAT-QA: A question answering benchmark on hybrid tabular and textual content in finance,” inProc. ACL, 2021

2021
[28]

FinSage: A multi-aspect RAG system for financial filings QA,

X. Wang, J. Chi, Z. Tai, et al., “FinSage: A multi-aspect RAG system for financial filings QA,”arXiv:2504.14493, 2025

work page arXiv 2025
[29]

Mitigating hallucination in financial RAG via fine-grained knowledge verification,

“Mitigating hallucination in financial RAG via fine-grained knowledge verification,” arXiv:2602.05723, 2026

work page arXiv 2026
[30]

RankRAG: Unifying context ranking with RAG in LLMs,

Y. Yu et al., “RankRAG: Unifying context ranking with RAG in LLMs,” inProc. NeurIPS, 2024

2024
[31]

Sentence-BERT: Sentence embeddings using Siamese BERT- networks,

N. Reimers and I. Gurevych, “Sentence-BERT: Sentence embeddings using Siamese BERT- networks,” inProc. EMNLP, 2019

2019
[32]

Billion-scale similarity search with GPUs,

J. Johnson, M. Douze, and H. Jegou, “Billion-scale similarity search with GPUs,”IEEE Trans. Big Data, vol. 7, no. 3, pp. 535–547, 2021

2021
[33]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Y. Zhuang, Z. Lin, et al., “Judging LLM-as-a-judge with MT-Bench and Chatbot Arena,” arXiv:2306.05685, 2023

work page internal anchor Pith review arXiv 2023
[34]

PageIndex: The vectorless RAG,

G. Goel, “PageIndex: The vectorless RAG,”PVTech Substack, Mar. 2026. 10

2026
[35]

Learning transferable visual models from natural language supervision,

A. Radford, J. W. Kim, et al., “Learning transferable visual models from natural language supervision,” inProc. ICML, 2021

2021
[36]

LayoutParser: A unified toolkit for deep learning based document image analysis,

Z. Shen, R. Zhang, M. Dell, et al., “LayoutParser: A unified toolkit for deep learning based document image analysis,” inProc. ICDAR, 2021

2021
[37]

The structural pivot: Analytical perspectives on vectorless RAG and hier- archical page indexing,

S. Chatterjee, “The structural pivot: Analytical perspectives on vectorless RAG and hier- archical page indexing,”Medium, Feb. 2026

2026
[38]

VectifyAI launches Mafin 2.5 and PageIndex: 98.7% financial RAG accu- racy,

MarkTechPost, “VectifyAI launches Mafin 2.5 and PageIndex: 98.7% financial RAG accu- racy,” Feb. 2026

2026
[39]

Vectorless RAG: How PageIndex works (2026 guide),

Build Fast with AI, “Vectorless RAG: How PageIndex works (2026 guide),” 2026

2026
[40]

Vectorless reasoning-based RAG: A new approach,

Microsoft Tech Community, “Vectorless reasoning-based RAG: A new approach,” Mar. 2026

2026
[41]

Vectorless RAG hits 98.7% accuracy: PageIndex challenges vectors,

ByteIota, “Vectorless RAG hits 98.7% accuracy: PageIndex challenges vectors,” Jan. 2026

2026
[42]

Decomposing retrieval failures in RAG for long-document financial QA,

E. Lumer et al., “Decomposing retrieval failures in RAG for long-document financial QA,” arXiv:2602.17981, 2026

work page arXiv 2026
[43]

Resolving the Robustness-Precision Trade-off in Financial RAG through Hybrid Document-Routed Retrieval

E. Lumer et al., “Resolving the robustness-precision trade-off in financial RAG,” arXiv:2603.26815, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[44]

Toolshed: Scale tool-equipped agents with advanced RAG-tool fusion,

E. Lumer, V. K. Subbiah, J. A. Burke, et al., “Toolshed: Scale tool-equipped agents with advanced RAG-tool fusion,” Preprint, 2024

2024
[45]

GPT-4 Technical Report

OpenAI, “GPT-4 technical report,”arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[46]

Chain-of-thought prompting elicits reasoning in large language models,

J. Wei, X. Wang, D. Schuurmans, et al., “Chain-of-thought prompting elicits reasoning in large language models,” inProc. NeurIPS, 2022

2022
[47]

Introducing contextual retrieval,

Anthropic, “Introducing contextual retrieval,” Anthropic Blog, 2024

2024
[48]

Active retrieval augmented generation,

Z. Jiang, F. F. Xu, L. Gao, et al., “Active retrieval augmented generation,” inProc. EMNLP, 2023

2023
[49]

KG-RAG: Bridging the gap between knowledge and creativity,

D. Sanmartin, “KG-RAG: Bridging the gap between knowledge and creativity,”arXiv preprint, 2024

2024
[50]

arXiv preprint arXiv:2409.15730 (2024)

“A systematic review of key RAG systems: Progress, gaps, and future directions,” arXiv:2507.18910, 2025

work page arXiv 2025
[51]

Financebench: A new benchmark for financial question answering.arXiv preprint arXiv:2311.11944, 2023

P. Islam, A. Kannappan, D. Kiela, R. Qian, N. Scherrer, and B. Vidgen, “FinanceBench: A new benchmark for financial question answering,”arXiv:2311.11944, 2023

work page arXiv 2023
[52]

Financial report chunking for effective retrieval augmented generation,

A. Jimeno-Yepes, Y. You, J. Milczek, S. Laverde, and R. Li, “Financial report chunking for effective retrieval augmented generation,”arXiv:2402.05131, 2024. 11

work page arXiv 2024