MultiFinRAG: An Optimized Multimodal Retrieval-Augmented Generation (RAG) Framework for Financial Question Answering

Chinmay Gondhalekar; Fang-Chun Yeh; Urjitkumar Patel

arxiv: 2506.20821 · v1 · submitted 2025-06-25 · 💻 cs.CL · cs.AI· cs.CE

MultiFinRAG: An Optimized Multimodal Retrieval-Augmented Generation (RAG) Framework for Financial Question Answering

Chinmay Gondhalekar , Urjitkumar Patel , Fang-Chun Yeh This is my paper

Pith reviewed 2026-05-19 07:17 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.CE

keywords multimodal retrieval-augmented generationfinancial question answeringmultimodal LLMtable and figure extractioncross-modal reasoningfinancial documentsRAG framework

0 comments

The pith

MultiFinRAG achieves 19 percentage points higher accuracy than ChatGPT-4o on complex financial QA using commodity hardware.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MultiFinRAG, a retrieval-augmented generation framework built specifically for answering questions over financial documents that include text, tables, and figures. It extracts structured information from tables and images using a lightweight multimodal model and indexes everything with modality-aware rules to support precise retrieval. A tiered fallback mechanism adds table and image context only when text alone is insufficient, allowing cross-modal reasoning without overwhelming the model with too much input. This approach addresses token limits and layout loss in standard RAG systems for long, multimodal financial filings. If the results hold, it would enable high-accuracy financial analysis on everyday computers instead of depending on the most powerful cloud-based models.

Core claim

MultiFinRAG first performs multimodal extraction by grouping table and figure images into batches and sending them to a lightweight, quantized open-source multimodal LLM, which produces both structured JSON outputs and concise textual summaries. These outputs, along with narrative text, are embedded and indexed with modality-aware similarity thresholds for precise retrieval. A tiered fallback strategy then dynamically escalates from text-only to text+table+image contexts when necessary, enabling cross-modal reasoning while reducing irrelevant context. Despite running on commodity hardware, this yields 19 percentage points higher accuracy than ChatGPT-4o on tasks involving text, tables, and a

What carries the argument

The tiered fallback strategy with modality-aware similarity thresholds for retrieval, which selectively incorporates text, table, and image data to support joint reasoning across modalities.

If this is right

Analysts can extract and reason over information from lengthy 10-Ks and investor presentations more effectively by combining multiple data formats in one pipeline.
The framework minimizes irrelevant context and token usage through dynamic escalation to additional modalities.
Deployment becomes feasible on standard hardware due to the use of quantized open-source models for extraction and retrieval.
Questions requiring combined understanding of narrative, numerical tables, and visual figures can be handled without fragmenting the context.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar selective modality approaches might improve RAG performance in other domains with mixed document types such as scientific papers or legal contracts.
Reducing reliance on full multimodal context could lower computational costs for processing large document collections over time.
Testing the system on updated financial datasets or with different evaluation questions would clarify if the accuracy gains generalize beyond the reported tasks.

Load-bearing premise

The specific financial QA tasks, datasets, and evaluation protocol used to measure the 19 percentage point gain are representative, unbiased, and fairly compared against the baseline model.

What would settle it

Running the same evaluation protocol on a newly collected set of financial questions with tables and figures and finding that MultiFinRAG no longer shows a 19 percentage point advantage over ChatGPT-4o.

Figures

Figures reproduced from arXiv: 2506.20821 by Chinmay Gondhalekar, Fang-Chun Yeh, Urjitkumar Patel.

**Figure 2.** Figure 2: Generated table description and JSON numerical, and visual information in long, layout-rich PDFs, MultiFinRAG addresses a key limitation in existing retrieval-augmented approaches. 3 Methodology 3.1 Models and Tools Used We utilize a combination of specialized and general-purpose models to handle the multimodal nature of financial documents: • Table Detection: Detectron2Layout [27], pre-trained on the Tab… view at source ↗

**Figure 4.** Figure 4: Example for type 2 questions where 𝐶 text 𝑖 are semantically coherent text passages, and 𝐶 table 𝑖 , 𝐶 image 𝑖 are table and figure regions converted via a multimodal LLM. All chunks are embedded and stored in an approximate FAISS index. A query 𝑄 triggers a tiered retrieval (text-only then text + table and image), automatically escalating whenever context is insufficient, before a final LLM answer generat… view at source ↗

**Figure 3.** Figure 3: Image Summary Generation Flowchart [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 5.** Figure 5: Example for type 3 questions Batch Image Summarization. • Similarly, figure regions 𝑉 (charts, diagrams, etc.) are batched into ⌈|𝑉 |/𝐵⌉ groups. For each batch 𝑣𝑏 (lines 18–23): • We send a batch prompt requesting a 3–6 sentence summary per image, explicitly instructing the LLM to ignore non–data visuals (e.g. logos, watermarks). • The returned summaries {sum𝑗 } are embedded, normalized, and inserted into … view at source ↗

**Figure 6.** Figure 6: Examples for type 4 questions quality (precision of retrieved contexts) and end-to-end QA accuracy: (1) Text threshold sweep: Vary 𝜃text from 0.55 to 0.85 in steps of 0.05; at each value, retrieve all text chunks with cos(𝐸(𝑄), 𝐸(𝑐)) ≥ 𝜃text, feed the top-𝑘 to the LLM, and record answer accuracy. (2) Table & image threshold sweep: Keeping 𝜃text fixed, vary (𝜃table, 𝜃image) independently from 0.55 to 0.75;… view at source ↗

**Figure 7.** Figure 7: Accuracy Comparison by Question Type 4.4.1 Text-based Questions. In [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

read the original abstract

Financial documents--such as 10-Ks, 10-Qs, and investor presentations--span hundreds of pages and combine diverse modalities, including dense narrative text, structured tables, and complex figures. Answering questions over such content often requires joint reasoning across modalities, which strains traditional large language models (LLMs) and retrieval-augmented generation (RAG) pipelines due to token limitations, layout loss, and fragmented cross-modal context. We introduce MultiFinRAG, a retrieval-augmented generation framework purpose-built for financial QA. MultiFinRAG first performs multimodal extraction by grouping table and figure images into batches and sending them to a lightweight, quantized open-source multimodal LLM, which produces both structured JSON outputs and concise textual summaries. These outputs, along with narrative text, are embedded and indexed with modality-aware similarity thresholds for precise retrieval. A tiered fallback strategy then dynamically escalates from text-only to text+table+image contexts when necessary, enabling cross-modal reasoning while reducing irrelevant context. Despite running on commodity hardware, MultiFinRAG achieves 19 percentage points higher accuracy than ChatGPT-4o (free-tier) on complex financial QA tasks involving text, tables, images, and combined multimodal reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The 19-point accuracy claim over ChatGPT-4o is the headline result but rests on an evaluation setup that is not described in enough detail to judge.

read the letter

The paper's core contribution is a practical RAG pipeline for long financial documents that mixes text, tables, and figures. It batches table and figure images through a quantized open-source multimodal model to produce JSON structures and short summaries, then indexes everything with separate similarity thresholds per modality and uses a tiered fallback that adds table or image context only when the initial text retrieval falls short. This setup is meant to keep token counts manageable on commodity hardware while supporting cross-modal reasoning on 10-Ks and investor decks. That engineering focus is the part that feels most grounded and potentially reusable for anyone already running RAG on domain documents with layout-heavy content. The hardware claim is also straightforward and worth checking against real deployments. The main gap is in the results. The abstract states a 19 percentage point accuracy lift over ChatGPT-4o free-tier on complex financial QA tasks, yet gives no dataset name, no count of questions or documents, no breakdown of question types, and no description of how the baseline was prompted or given context. Without those pieces it is difficult to tell whether the comparison is apples-to-apples or whether the test set emphasizes the exact failure modes the tiered fallback is designed to fix. There is also no mention of statistical significance or variance across runs. These omissions make the size of the reported gain hard to interpret right now. The work is aimed at applied researchers and engineers who build retrieval systems for finance or similar regulated domains. Readers who need concrete ideas for modality-aware indexing and cheap multimodal extraction will find usable details even if they end up changing the thresholds or fallback logic. The paper is coherent on its own terms and shows honest attention to the practical constraints of financial documents, so it clears the bar for a serious referee. I would send it to review with a request for the missing evaluation specifics rather than desk-rejecting it.

Referee Report

1 major / 0 minor

Summary. The manuscript introduces MultiFinRAG, a multimodal RAG framework for financial QA over long documents containing narrative text, tables, and figures. It performs batch processing of table/figure images via a quantized open-source multimodal LLM to generate structured JSON and summaries, applies modality-aware similarity thresholds for embedding and retrieval, and uses a tiered fallback strategy to escalate context from text-only to full multimodal when needed. The central empirical claim is that the system, runnable on commodity hardware, delivers 19 percentage points higher accuracy than ChatGPT-4o (free-tier) on complex financial QA tasks requiring text, table, image, and cross-modal reasoning.

Significance. If the accuracy gains can be rigorously validated, the work would provide a practical, hardware-efficient approach to multimodal financial document QA that mitigates token limits and layout fragmentation. The use of open-source quantized models and structured outputs for retrieval is a concrete engineering contribution that could aid reproducible deployment in the financial domain.

major comments (1)

[Abstract] Abstract: The central claim of a 19 percentage point accuracy improvement over ChatGPT-4o (free-tier) is presented without any information on the evaluation dataset (name, size, source, or construction), the precise definition or distribution of 'complex financial QA tasks', the accuracy metric (exact match, F1, or LLM-as-judge), statistical significance testing, or the exact prompting, context, and retrieval setup provided to the baseline model. This omission directly undermines assessment of whether the reported delta reflects genuine cross-modal superiority or differences in experimental conditions.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful review and constructive feedback on our manuscript. We address the major comment below and will revise the manuscript to improve clarity and completeness.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim of a 19 percentage point accuracy improvement over ChatGPT-4o (free-tier) is presented without any information on the evaluation dataset (name, size, source, or construction), the precise definition or distribution of 'complex financial QA tasks', the accuracy metric (exact match, F1, or LLM-as-judge), statistical significance testing, or the exact prompting, context, and retrieval setup provided to the baseline model. This omission directly undermines assessment of whether the reported delta reflects genuine cross-modal superiority or differences in experimental conditions.

Authors: We agree that the abstract would benefit from additional contextual information to allow readers to better assess the reported performance gains. In the revised version, we will expand the abstract to include the name, size, and source of the evaluation dataset, a brief description of the complex financial QA tasks and their distribution, the accuracy metric used, and key details on the baseline setup including prompting and context provided to ChatGPT-4o. Full experimental details, including dataset construction, statistical significance testing, and exact configurations, are already presented in the Experiments section; however, we acknowledge that summarizing these elements in the abstract will strengthen the presentation without compromising conciseness. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical accuracy gain is measured outcome, not derived by construction

full rationale

The paper describes a multimodal RAG pipeline (extraction to quantized LLM, modality-aware indexing, tiered fallback) and reports a 19pp accuracy improvement over ChatGPT-4o as the result of executing that pipeline on financial QA tasks. No equations, fitted parameters, or first-principles derivations are presented that reduce the reported delta to the framework's own inputs or to self-citations. The central claim is therefore an external empirical measurement rather than a self-referential quantity.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The framework rests on standard components of RAG and multimodal LLMs plus a few engineering choices whose tuning details are not provided.

free parameters (1)

modality-aware similarity thresholds
Used to control retrieval precision across text, table, and image modalities; values are not specified.

axioms (1)

domain assumption A lightweight quantized multimodal LLM can reliably convert financial table and figure images into accurate structured JSON and concise textual summaries.
This assumption underpins the entire extraction stage described in the abstract.

pith-pipeline@v0.9.0 · 5765 in / 1169 out tokens · 28243 ms · 2026-05-19T07:17:55.746633+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Despite running on commodity hardware, MultiFinRAG achieves 19 percentage points higher accuracy than ChatGPT-4o (free-tier) on complex financial QA tasks

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

AVATAAR: Agentic Video Answering via Temporal Adaptive Alignment and Reasoning
cs.CV 2025-11 unverdicted novelty 5.0

AVATAAR reports relative gains of 5-8% over baseline on CinePile benchmark categories through agentic feedback for long video QA.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · cited by 1 Pith paper · 7 internal anchors

[1]

Meta AI. 2024. Introducing Llama 3.1: Our most capable models to date. https: //ai.meta.com/blog/meta-llama-3-1/. Accessed: December 18, 2024

work page 2024
[2]

Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. 2023. Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection. arXiv:2310.11511 [cs.CL] https://arxiv.org/abs/2310.11511

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Guanting Dong, Yutao Zhu, Chenghao Zhang, Zechen Wang, Zhicheng Dou, and Ji-Rong Wen. 2024. Understand What LLM Needs: Dual Preference Alignment for Retrieval-Augmented Generation. arXiv:2406.18676 [cs.CL] https://arxiv. org/abs/2406.18676

work page arXiv 2024
[4]

Matthijs Douze, Alexandr Guzhva, Chengqi Deng, Jeff Johnson, Gergely Szilvasy, Pierre-Emmanuel Mazaré, Maria Lomeli, Lucas Hosseini, and Hervé Jégou. 2024. The Faiss library. arXiv preprint 1 (2024), xx pages. arXiv:2401.08281 [cs.LG]

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

Masoomali Fatehkia, Ji Kim Lucas, and Sanjay Chawla. 2024. T-RAG: Lessons from the LLM Trenches. arXiv:2402.07483 [cs.AI] https://arxiv.org/abs/2402. 07483

work page arXiv 2024
[6]

Google. 2023. Google Colaboratory. https://colab.research.google.com/. Accessed: May 15, 2025

work page 2023
[8]

KE Kannammal, Mr Anirudh RK, Kuzhali Tamizhiniyal P, et al. 2025. Fin-Rag A Rag System for Financial Documents. International Journal of Innovative Science and Research Technology 10, 4 (2025), 1761–1767

work page 2025
[9]

Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen tau Yih. 2020. Dense Passage Retrieval for Open- Domain Question Answering. arXiv:2004.04906 [cs.CL] https://arxiv.org/abs/ 2004.04906

work page internal anchor Pith review Pith/arXiv arXiv 2020
[10]

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. 2020. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. In Advances in Neural Informa- tion Processing Systems (NeurIPS) . Curran Associates, Inc. http...

work page 2020
[11]

Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou, and Zhoujun Li

work page
[12]

arXiv:1903.01949 [cs.CV] https://arxiv.org/abs/1903.01949

TableBank: A Benchmark Dataset for Table Detection and Recognition. arXiv:1903.01949 [cs.CV] https://arxiv.org/abs/1903.01949

work page arXiv 1903
[13]

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Cheng- gang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. 2024. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437 (2024), xx–yy

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. 2023. G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment. arXiv:2303.16634 [cs.CL] https://arxiv.org/abs/2303.16634

work page internal anchor Pith review Pith/arXiv arXiv 2023
[15]

Oscar Mañas, Benno Krojer, and Aishwarya Agrawal. 2024. Improving Automatic VQA Evaluation Using Large Language Models. arXiv:2310.02567 [cs.CV] https://arxiv.org/abs/2310.02567

work page arXiv 2024
[16]

OpenAI. 2024. Hello GPT-4o. https://openai.com/index/hello-gpt-4o/. Accessed: December 18, 2024

work page 2024
[17]

Urjitkumar Patel, Fang-Chun Yeh, and Chinmay Gondhalekar. 2024. CANAL - Cyber Activity News Alerting Language Model : Empirical Approach vs. Ex- pensive LLMs. In 2024 IEEE 3rd International Conference on AI in Cybersecurity (ICAIC). IEEE, 1–12. https://doi.org/10.1109/icaic60265.2024.10433839

work page doi:10.1109/icaic60265.2024.10433839 2024
[18]

Urjitkumar Patel, Fang-Chun Yeh, Chinmay Gondhalekar, and Hari Nalluri. 2024. FANAL – Financial Activity News Alerting Language Modeling Framework. arXiv:2412.03527 [cs.CL] https://arxiv.org/abs/2412.03527

work page arXiv 2024
[19]

Pdftriage: question answering over long, structured documents

Jon Saad-Falcon, Joe Barrow, Alexa Siu, Ani Nenkova, David Seunghyun Yoon, Ryan A. Rossi, and Franck Dernoncourt. 2023. PDFTriage: Question Answering over Long, Structured Documents. arXiv:2309.08872 [cs.CL] https://arxiv.org/ abs/2309.08872

work page arXiv 2023
[20]

Alireza Salemi and Hamed Zamani. 2024. Evaluating Retrieval Quality in Retrieval-Augmented Generation. arXiv:2404.13781 [cs.CL] https://arxiv.org/ abs/2404.13781

work page arXiv 2024
[21]

Yusuke Shinyama. 2007. PDFMiner - Python PDF Parser

work page 2007
[22]

John Smith, Jane Doe, and Emily Johnson. 2024. Financial Report Chunking for Effective Retrieval Augmented Generation. arXiv preprint arXiv:2402.05131 (2024)

work page arXiv 2024
[23]

Author Su. 2024. Title Placeholder. Journal Placeholder 1 (2024)

work page 2024
[24]

Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupati- raju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, et al. 2024. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295 1, 1 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023. Llama: Open and efficient foundation language models. , 10 pages

work page 2023
[26]

Yang Wang, Alberto Garcia Hernandez, Roman Kyslyi, and Nicholas Kersting

work page
[27]

arXiv:2406.18064 [cs.CL] https://arxiv.org/abs/ 2406.18064

Evaluating Quality of Answers for Retrieval-Augmented Generation: A Strong LLM Is All You Need. arXiv:2406.18064 [cs.CL] https://arxiv.org/abs/ 2406.18064

work page arXiv
[28]

Kevin Wu, Eric Wu, and James Y Zou. 2024. Clasheval: Quantifying the tug-of- war between an llm’s internal prior and external evidence. Advances in Neural Information Processing Systems 37 (2024), 33402–33422

work page 2024
[29]

Yuxin Wu, Alexander Kirillov, Francisco Massa, Wan-Yen Lo, and Ross Girshick

work page
[30]

Detectron2: FAIR’s Next-Generation Library for Object Detection and Segmentation,

“Detectron2: FAIR’s Next-Generation Library for Object Detection and Segmentation, ” GitHub repository, 2019. https://github.com/facebookresearch/ detectron2

work page 2019
[31]

Lipton, Mu Li, and Alexander J

Aston Zhang, Zachary C. Lipton, Mu Li, and Alexander J. Smola. 2023. GPU Schedules Architecture Notebook. https://colab.research.google.com/github/d2l- ai/d2l-tvm-colab/blob/master/chapter_gpu_schedules/arch.ipynb. Accessed: May 16, 2025

work page 2023
[32]

Peitian Zhang, Shitao Xiao, Zheng Liu, Zhicheng Dou, and Jian-Yun Nie. 2023. Retrieve Anything To Augment Large Language Models. arXiv:2310.07554 [cs.IR] https://arxiv.org/abs/2310.07554

work page arXiv 2023
[33]

BERTScore: Evaluating Text Generation with BERT

Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. 2020. BERTScore: Evaluating Text Generation with BERT. arXiv:1904.09675 [cs.CL] https://arxiv.org/abs/1904.09675

work page internal anchor Pith review Pith/arXiv arXiv 2020
[34]

Mingyu Zhong et al. 2024. Mix-of-Granularity: Dynamic Chunking for Knowl- edge Integration in RAG Systems. ArXiv abs/2401.12345 (2024), xx–yy. 9

work page arXiv 2024

[1] [1]

Meta AI. 2024. Introducing Llama 3.1: Our most capable models to date. https: //ai.meta.com/blog/meta-llama-3-1/. Accessed: December 18, 2024

work page 2024

[2] [2]

Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. 2023. Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection. arXiv:2310.11511 [cs.CL] https://arxiv.org/abs/2310.11511

work page internal anchor Pith review Pith/arXiv arXiv 2023

[3] [3]

Guanting Dong, Yutao Zhu, Chenghao Zhang, Zechen Wang, Zhicheng Dou, and Ji-Rong Wen. 2024. Understand What LLM Needs: Dual Preference Alignment for Retrieval-Augmented Generation. arXiv:2406.18676 [cs.CL] https://arxiv. org/abs/2406.18676

work page arXiv 2024

[4] [4]

Matthijs Douze, Alexandr Guzhva, Chengqi Deng, Jeff Johnson, Gergely Szilvasy, Pierre-Emmanuel Mazaré, Maria Lomeli, Lucas Hosseini, and Hervé Jégou. 2024. The Faiss library. arXiv preprint 1 (2024), xx pages. arXiv:2401.08281 [cs.LG]

work page internal anchor Pith review Pith/arXiv arXiv 2024

[5] [5]

Masoomali Fatehkia, Ji Kim Lucas, and Sanjay Chawla. 2024. T-RAG: Lessons from the LLM Trenches. arXiv:2402.07483 [cs.AI] https://arxiv.org/abs/2402. 07483

work page arXiv 2024

[6] [6]

Google. 2023. Google Colaboratory. https://colab.research.google.com/. Accessed: May 15, 2025

work page 2023

[7] [8]

KE Kannammal, Mr Anirudh RK, Kuzhali Tamizhiniyal P, et al. 2025. Fin-Rag A Rag System for Financial Documents. International Journal of Innovative Science and Research Technology 10, 4 (2025), 1761–1767

work page 2025

[8] [9]

Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen tau Yih. 2020. Dense Passage Retrieval for Open- Domain Question Answering. arXiv:2004.04906 [cs.CL] https://arxiv.org/abs/ 2004.04906

work page internal anchor Pith review Pith/arXiv arXiv 2020

[9] [10]

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. 2020. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. In Advances in Neural Informa- tion Processing Systems (NeurIPS) . Curran Associates, Inc. http...

work page 2020

[10] [11]

Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou, and Zhoujun Li

work page

[11] [12]

arXiv:1903.01949 [cs.CV] https://arxiv.org/abs/1903.01949

TableBank: A Benchmark Dataset for Table Detection and Recognition. arXiv:1903.01949 [cs.CV] https://arxiv.org/abs/1903.01949

work page arXiv 1903

[12] [13]

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Cheng- gang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. 2024. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437 (2024), xx–yy

work page internal anchor Pith review Pith/arXiv arXiv 2024

[13] [14]

Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. 2023. G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment. arXiv:2303.16634 [cs.CL] https://arxiv.org/abs/2303.16634

work page internal anchor Pith review Pith/arXiv arXiv 2023

[14] [15]

Oscar Mañas, Benno Krojer, and Aishwarya Agrawal. 2024. Improving Automatic VQA Evaluation Using Large Language Models. arXiv:2310.02567 [cs.CV] https://arxiv.org/abs/2310.02567

work page arXiv 2024

[15] [16]

OpenAI. 2024. Hello GPT-4o. https://openai.com/index/hello-gpt-4o/. Accessed: December 18, 2024

work page 2024

[16] [17]

Urjitkumar Patel, Fang-Chun Yeh, and Chinmay Gondhalekar. 2024. CANAL - Cyber Activity News Alerting Language Model : Empirical Approach vs. Ex- pensive LLMs. In 2024 IEEE 3rd International Conference on AI in Cybersecurity (ICAIC). IEEE, 1–12. https://doi.org/10.1109/icaic60265.2024.10433839

work page doi:10.1109/icaic60265.2024.10433839 2024

[17] [18]

Urjitkumar Patel, Fang-Chun Yeh, Chinmay Gondhalekar, and Hari Nalluri. 2024. FANAL – Financial Activity News Alerting Language Modeling Framework. arXiv:2412.03527 [cs.CL] https://arxiv.org/abs/2412.03527

work page arXiv 2024

[18] [19]

Pdftriage: question answering over long, structured documents

Jon Saad-Falcon, Joe Barrow, Alexa Siu, Ani Nenkova, David Seunghyun Yoon, Ryan A. Rossi, and Franck Dernoncourt. 2023. PDFTriage: Question Answering over Long, Structured Documents. arXiv:2309.08872 [cs.CL] https://arxiv.org/ abs/2309.08872

work page arXiv 2023

[19] [20]

Alireza Salemi and Hamed Zamani. 2024. Evaluating Retrieval Quality in Retrieval-Augmented Generation. arXiv:2404.13781 [cs.CL] https://arxiv.org/ abs/2404.13781

work page arXiv 2024

[20] [21]

Yusuke Shinyama. 2007. PDFMiner - Python PDF Parser

work page 2007

[21] [22]

John Smith, Jane Doe, and Emily Johnson. 2024. Financial Report Chunking for Effective Retrieval Augmented Generation. arXiv preprint arXiv:2402.05131 (2024)

work page arXiv 2024

[22] [23]

Author Su. 2024. Title Placeholder. Journal Placeholder 1 (2024)

work page 2024

[23] [24]

Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupati- raju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, et al. 2024. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295 1, 1 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[24] [25]

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023. Llama: Open and efficient foundation language models. , 10 pages

work page 2023

[25] [26]

Yang Wang, Alberto Garcia Hernandez, Roman Kyslyi, and Nicholas Kersting

work page

[26] [27]

arXiv:2406.18064 [cs.CL] https://arxiv.org/abs/ 2406.18064

Evaluating Quality of Answers for Retrieval-Augmented Generation: A Strong LLM Is All You Need. arXiv:2406.18064 [cs.CL] https://arxiv.org/abs/ 2406.18064

work page arXiv

[27] [28]

Kevin Wu, Eric Wu, and James Y Zou. 2024. Clasheval: Quantifying the tug-of- war between an llm’s internal prior and external evidence. Advances in Neural Information Processing Systems 37 (2024), 33402–33422

work page 2024

[28] [29]

Yuxin Wu, Alexander Kirillov, Francisco Massa, Wan-Yen Lo, and Ross Girshick

work page

[29] [30]

Detectron2: FAIR’s Next-Generation Library for Object Detection and Segmentation,

“Detectron2: FAIR’s Next-Generation Library for Object Detection and Segmentation, ” GitHub repository, 2019. https://github.com/facebookresearch/ detectron2

work page 2019

[30] [31]

Lipton, Mu Li, and Alexander J

Aston Zhang, Zachary C. Lipton, Mu Li, and Alexander J. Smola. 2023. GPU Schedules Architecture Notebook. https://colab.research.google.com/github/d2l- ai/d2l-tvm-colab/blob/master/chapter_gpu_schedules/arch.ipynb. Accessed: May 16, 2025

work page 2023

[31] [32]

Peitian Zhang, Shitao Xiao, Zheng Liu, Zhicheng Dou, and Jian-Yun Nie. 2023. Retrieve Anything To Augment Large Language Models. arXiv:2310.07554 [cs.IR] https://arxiv.org/abs/2310.07554

work page arXiv 2023

[32] [33]

BERTScore: Evaluating Text Generation with BERT

Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. 2020. BERTScore: Evaluating Text Generation with BERT. arXiv:1904.09675 [cs.CL] https://arxiv.org/abs/1904.09675

work page internal anchor Pith review Pith/arXiv arXiv 2020

[33] [34]

Mingyu Zhong et al. 2024. Mix-of-Granularity: Dynamic Chunking for Knowl- edge Integration in RAG Systems. ArXiv abs/2401.12345 (2024), xx–yy. 9

work page arXiv 2024