Retrieval-Augmented Reasoning for Chartered Accountancy

Akhil Sharma; Ali Imam Abidi; Jatin Gupta; Saransh Singhania

arxiv: 2605.00257 · v1 · submitted 2026-04-30 · 💻 cs.CL · cs.AI· cs.IR

Retrieval-Augmented Reasoning for Chartered Accountancy

Jatin Gupta , Akhil Sharma , Saransh Singhania , Ali Imam Abidi This is my paper

Pith reviewed 2026-05-09 19:46 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.IR

keywords Retrieval-Augmented GenerationChartered AccountancyChain-of-ThoughtRAGLarge Language ModelsIndian CABenchmark

0 comments

The pith

A retrieval-augmented 14B model reaches 68.75% of GPT-4o and Claude 3.5 Sonnet performance on chartered accountancy tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CA-ThinkFlow, a parameter-efficient retrieval-augmented generation framework for handling the numerical and regulatory demands of Indian chartered accountancy. It uses a 14B quantized reasoning model along with layout-aware document extraction to retrieve relevant passages and then applies the model's built-in chain-of-thought reasoning. This combination produces answers that achieve 68.75 percent of the Scholastic Reliability Coefficient of leading proprietary models on the CA-Ben benchmark. The approach matters for domains where access to large models is limited by cost or infrastructure, offering a lighter alternative for professional-level tasks. The authors acknowledge that the system still encounters difficulties with highly complex regulatory material such as taxation rules.

Core claim

CA-ThinkFlow is presented as a parameter-efficient Retrieval-Augmented Generation framework which operates with a 14B, 4-bit-quantized reasoning model, 14B-DeepSeek-R1, and a layout-aware Docling extraction system which maintains document structure during extraction. It uses a basic RAG method which automatically adds retrieved information into the prompt, while it depends on the model's built-in Chain-of-Thought functions to create context and produce correct answers. The system achieves Scholastic Reliability Coefficient results which equal 68.75% of GPT-4o and Claude 3.5 Sonnet on the multi-level CA-Ben benchmark.

What carries the argument

Basic RAG that automatically injects retrieved passages into the prompt, paired with the 14B model's native Chain-of-Thought reasoning and layout-aware document extraction.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Frameworks like this could be tested on other specialized professional domains that combine calculation with regulatory knowledge.
The limitations on complex taxation texts suggest that additional techniques such as iterative retrieval or verification might be necessary for full coverage.
Adoption in small accounting practices could reduce dependence on expensive cloud-based AI services for exam preparation or compliance work.

Load-bearing premise

That automatically injecting retrieved passages into the prompt and relying on the model's built-in Chain-of-Thought will produce reliable answers for multi-step numerical and jurisdiction-specific regulatory questions.

What would settle it

A direct comparison on a dedicated set of complex taxation and regulatory questions where the system's SRC falls well below 68.75% of the proprietary models.

Figures

Figures reproduced from arXiv: 2605.00257 by Akhil Sharma, Ali Imam Abidi, Jatin Gupta, Saransh Singhania.

**Figure 2.** Figure 2: An overall comparison of models across each individual exam. [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

read the original abstract

The inception of Large Language Models (LLMs) has catalyzed AI adoption in the finance sector, yet their reliability in complex, jurisdiction-specific tasks like Indian Chartered Accountancy (CA) remains limited. The models display difficulty in executing numerical tasks which require multiple steps while also needing advanced knowledge about legal regulations and the method of scaling their operations is not feasible in settings which have limited access to resources. We present CA-ThinkFlow as a parameter-efficient Retrieval-Augmented Generation (RAG) framework which operates with a 14B, 4-bit-quantized reasoning model, 14B-DeepSeek-R1, and a layout-aware Docling extraction system which maintains document structure during extraction. CA-ThinkFlow uses a basic RAG method which automatically adds retrieved information into the prompt, while it depends on the model's built-in Chain-of-Thought (CoT) functions to create context and produce correct answers. The system we developed system operates at performance levels which match large proprietary models when we tested it on the multi-level CA-Ben benchmark, achieving Scholastic Reliability Coefficient (SRC) results which equal 68.75\% of GPT-4o and Claude 3.5 Sonnet. The framework shows high efficiency and strength in handling parameters, but essential reasoning abilities fail to process complex regulatory texts which exist in fields such as Taxation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Applies basic RAG plus native CoT to a new Indian CA benchmark but the headline performance claim sits uneasily with the abstract's own admission of failures on complex regulatory material.

read the letter

This paper takes standard retrieval-augmented generation and the model's built-in chain-of-thought and points them at Indian Chartered Accountancy exams. They release a new multi-level benchmark called CA-Ben and report that their CA-ThinkFlow setup, built around a 14B quantized DeepSeek model and Docling layout-aware extraction, reaches 68.75% of the Scholastic Reliability Coefficient that GPT-4o and Claude 3.5 Sonnet achieve on the same set. The practical angle is the main draw: an open, low-resource system aimed at a professional certification domain that has seen little prior AI attention. Using automatic passage injection and preserving document structure during retrieval are reasonable engineering choices for exam materials that mix text, tables, and regulations. Creating the benchmark itself adds a concrete artifact that others could build on if it is released with clear documentation. The soft spots are more substantial than the abstract suggests. The text itself notes that essential reasoning fails on complex regulatory texts in areas such as Taxation, yet the aggregate score is presented as the central result. Without a breakdown of question difficulty, test-set size, or error patterns, it is impossible to tell whether the score reflects reliable handling of the hardest material or mostly easier items. No statistical tests, comparison protocol details, or ablation on the retrieval step appear in the provided information. The methods are not new; this is domain application rather than algorithmic advance. The work is aimed at readers interested in applied RAG for finance, law, or professional training. A reading group looking at domain-specific examples might find the benchmark useful to discuss, but the paper does not contain enough rigorous evaluation to stand on its own for core AI research. It deserves peer review so the benchmark construction and any additional analysis can be checked, though heavy revision on the evaluation section would be required.

Referee Report

2 major / 2 minor

Summary. The paper introduces CA-ThinkFlow, a parameter-efficient RAG framework that pairs a 14B 4-bit-quantized DeepSeek-R1 model with layout-aware Docling extraction. It claims that automatic passage injection plus the model's native Chain-of-Thought yields a Scholastic Reliability Coefficient (SRC) equal to 68.75% of GPT-4o and Claude 3.5 Sonnet on the multi-level CA-Ben benchmark for Indian Chartered Accountancy tasks, while noting that essential reasoning abilities still fail on complex regulatory texts such as those in Taxation.

Significance. If the performance numbers are substantiated with full evaluation details, the result would indicate that modest open-source models plus basic RAG can reach a non-trivial fraction of frontier-model reliability on jurisdiction-specific professional tasks. This would be relevant for resource-constrained deployments in regulated domains. The abstract's own caveat about failures on complex Taxation material, however, limits the claimed generality and reduces the practical significance until per-category breakdowns are supplied.

major comments (2)

[Abstract] Abstract: The central claim that CA-ThinkFlow attains an SRC equal to 68.75% of GPT-4o and Claude 3.5 Sonnet is presented without any information on CA-Ben benchmark construction, test-set size, number of questions per difficulty level, statistical tests, error breakdown by task type (numerical vs. regulatory), or comparison protocol. Because this aggregate figure is the sole empirical support for the framework's effectiveness, the missing details are load-bearing and prevent assessment of whether the result is robust or representative.
[Abstract] Abstract: The manuscript states that 'essential reasoning abilities fail to process complex regulatory texts which exist in fields such as Taxation' yet reports an aggregate SRC that is 68.75% of the strongest proprietary models. This creates an unresolved tension: it is unclear whether the benchmark contains few Taxation-style items, whether the failures are narrow enough not to affect the headline metric, or whether the reported figure masks unreliability on the hardest CA material. Per-category performance tables or question-type distributions are required to reconcile the two statements.

minor comments (2)

[Abstract] Abstract: The sentence 'the method of scaling their operations is not feasible in settings which have limited access to resources' is grammatically awkward and should be rephrased for clarity.
[Abstract] Abstract: The model is introduced as both '14B, 4-bit-quantized reasoning model, 14B-DeepSeek-R1' and '14B-DeepSeek-R1'; a single consistent name would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and outline the revisions we will implement to improve transparency and clarity.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that CA-ThinkFlow attains an SRC equal to 68.75% of GPT-4o and Claude 3.5 Sonnet is presented without any information on CA-Ben benchmark construction, test-set size, number of questions per difficulty level, statistical tests, error breakdown by task type (numerical vs. regulatory), or comparison protocol. Because this aggregate figure is the sole empirical support for the framework's effectiveness, the missing details are load-bearing and prevent assessment of whether the result is robust or representative.

Authors: We agree that the abstract would benefit from additional context on the CA-Ben benchmark to support the reported SRC. In the revised manuscript, we will expand the abstract to include a concise description of benchmark construction, test-set size, question distribution across difficulty levels, and the comparison protocol. We will also augment the Experiments section with statistical tests, error breakdowns by task type (numerical vs. regulatory), and full evaluation details to enable readers to assess robustness. revision: yes
Referee: [Abstract] Abstract: The manuscript states that 'essential reasoning abilities fail to process complex regulatory texts which exist in fields such as Taxation' yet reports an aggregate SRC that is 68.75% of the strongest proprietary models. This creates an unresolved tension: it is unclear whether the benchmark contains few Taxation-style items, whether the failures are narrow enough not to affect the headline metric, or whether the reported figure masks unreliability on the hardest CA material. Per-category performance tables or question-type distributions are required to reconcile the two statements.

Authors: We acknowledge the tension between the aggregate performance claim and the limitation statement on complex regulatory texts. The 68.75% SRC is an overall average across CA-Ben. In the revised manuscript, we will add per-category performance tables and question-type distributions in the Results section. These will detail the proportion of Taxation-style items, performance variance by category, and the impact of specific failures on the aggregate metric, thereby clarifying the scope of the reported result. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical benchmark comparison only

full rationale

The paper introduces a RAG framework (CA-ThinkFlow) that injects retrieved passages into prompts for a 14B quantized model and relies on the model's native CoT. It reports an empirical SRC score on the external CA-Ben benchmark equal to 68.75% of GPT-4o/Claude 3.5 Sonnet. No equations, fitted parameters, self-definitional quantities, or load-bearing self-citations appear in the derivation. The central claim is a direct measurement against independent external models and therefore carries no internal reduction to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract describes an applied engineering system and introduces no new mathematical parameters, axioms, or postulated entities.

pith-pipeline@v0.9.0 · 5544 in / 1031 out tokens · 54228 ms · 2026-05-09T19:46:33.972814+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · 1 internal anchor

[1]

The impact of large language models in finance: Towards trustworthy adoption.Alan Turing Institute Report, 2024

Turing Institute. The impact of large language models in finance: Towards trustworthy adoption.Alan Turing Institute Report, 2024. [web:51]

work page 2024
[2]

2024 iif-ey annual survey report on ai/ml use in financial services

Institute of International Finance and Ernst & Young. 2024 iif-ey annual survey report on ai/ml use in financial services. Technical report, IIF, 2024. [web:60]

work page 2024
[3]

Christensen, Elizabeth Floyd, and Mark Maffett

Hans B. Christensen, Elizabeth Floyd, and Mark Maffett. Large language models and generative ai in finance. SSRN Electronic Journal, 2023. [web:49]

work page 2023
[4]

(generative) ai in financial economics.Journal of Chinese Economic and Business Studies, 23(4):509–587, October 2025

Hongwei Mo and Shumiao Ouyang. (generative) ai in financial economics.Journal of Chinese Economic and Business Studies, 23(4):509–587, October 2025

work page 2025
[5]

Vasarhelyi

Huaxia Li and Miklos A. Vasarhelyi. Applying large language models in accounting: A comparative analysis of different methodologies and off-the-shelf examples, November 2023

work page 2023
[6]

Holistically evaluating the environmental impact of creating language models

Jacob Morrison, Clara Na, Jared Fernandez, Tim Dettmers, Emma Strubell, and Jesse Dodge. Holistically evaluating the environmental impact of creating language models. InThe Thirteenth International Conference on Learning Representations (ICLR), 2025

work page 2025
[7]

LLMCarbon: Modeling the end-to-end carbon footprint of large language models

Ahmad Faiz, Sotaro Kaneda, Ruhan Wang, Rita Chukwunyere Osi, Prateek Sharma, Fan Chen, and Lei Jiang. LLMCarbon: Modeling the end-to-end carbon footprint of large language models. InThe Twelfth International Conference on Learning Representations (ICLR), 2024

work page 2024
[8]

Gupta, A

J. Gupta, A. Sharma, S. Singhania, M. Adnan, S. Deo, A. I. Abidi, and K. Gupta. Large language models acing chartered accountancy, 2025

work page 2025
[9]

Fali Wang, Zhiwei Zhang, Xianren Zhang, Zongyu Wu, Tzu Hao Mo, Qiuhao Lu, Wanjing Wang, Rui Li, Junjie Xu, Xianfeng Tang, Qi He, Yao Ma, Ming Huang, and Suhang Wang. A comprehensive survey of small language models in the era of large language models: Techniques, enhancements, applications, collaboration with llms, and trustworthiness.ACM Transactions on I...

work page 2025
[10]

Fine-tuning smaller language models for question answering over financial documents

Karmvir Singh Phogat, Sai Akhil Puranam, Sridhar Dasaratha, Chetan Harsha, and Shashishekar Ramakrishna. Fine-tuning smaller language models for question answering over financial documents. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Findings of the Association for Computational Linguistics: EMNLP 2024, pages 10528–10548, Miami, Florida,...

work page 2024
[11]

o1 system card

OpenAI. o1 system card. Technical report, OpenAI, 2024. OpenAI Technical Report

work page 2024
[12]

Openai o3 and o4-mini system card

OpenAI. Openai o3 and o4-mini system card. Technical report, OpenAI, 2025

work page 2025
[13]

Deepseek-r1: The open-source ai challenger rewriting the rules of enterprise ai

Zartis Team. Deepseek-r1: The open-source ai challenger rewriting the rules of enterprise ai. https://www. zartis.com/deepseek-r1-the-open-source-ai-challenger-rewriting-the-rules-of-enterprise-ai/ , 2025

work page 2025
[14]

Fine-tuning deepseek r1 (reasoning model)

DataCamp. Fine-tuning deepseek r1 (reasoning model). https://www.datacamp.com/tutorial/ fine-tuning-deepseek-r1-reasoning-model, 2025

work page 2025
[15]

Fin-r1: A large language model for financial reasoning, 2025

Shanghai AI Laboratory. Fin-r1: A large language model for financial reasoning, 2025

work page 2025
[16]

Fincot: Grounding chain-of-thought in expert financial blueprints

Yifan Zhou, Peng Li, et al. Fincot: Grounding chain-of-thought in expert financial blueprints. InProceedings of the 3rd Workshop on FinNLP, 2025

work page 2025
[17]

Fino1: A financial reasoning model

The FinAI Team. Fino1: A financial reasoning model. https://github.com/The-FinAI/Fino1, 2025. GitHub Repository

work page 2025
[18]

Nawal and S

A. Nawal and S. Kumar. Fin-rag: A rag system for financial documents, 2024

work page 2024
[19]

Muñiz Sánchez

J. Muñiz Sánchez. Rag-based system for document information re- trieval in financial compliance. https://www.linkedin.com/pulse/ rag-based-system-document-information-retrieval-muniz-sanchez-dk1bf, 2024

work page 2024
[20]

Rag architecture for financial compliance knowledge retrieval

Auxilio Bits. Rag architecture for financial compliance knowledge retrieval. https://www.auxiliobits.com/ blog/rag-architecture-for-domain-specific-knowledge-retrieval-in-financial-compliance/ , 2025

work page 2025
[21]

Docling: The document alchemist, 2024

Docling Team. Docling: The document alchemist, 2024

work page 2024
[22]

Docling technical report

Michele Besso et al. Docling technical report. Technical report, IBM Research, 8 2024

work page 2024
[23]

docling-project/docling-models, 2025

IBM Docling Team. docling-project/docling-models, 2025. 8 Gupta et al. Retrieval-Augmented Reasoning for Chartered Accountancy

work page 2025
[24]

ConvFinQA: Exploring the chain of numerical reasoning in conversational finance question answering

Zhiyu Chen, Shiyang Li, Charese Smiley, Zhiqiang Ma, Sameena Shah, and William Yang Wang. ConvFinQA: Exploring the chain of numerical reasoning in conversational finance question answering. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors,Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 6279–6292, Ab...

work page 2022
[25]

DocMath-eval: Evaluating math reasoning capabilities of LLMs in understanding long and specialized documents

Yilun Zhao, Yitao Long, Hongjun Liu, Ryo Kamoi, Linyong Nan, Lyuhao Chen, Yixin Liu, Xiangru Tang, Rui Zhang, and Arman Cohan. DocMath-eval: Evaluating math reasoning capabilities of LLMs in understanding long and specialized documents. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for ...

work page 2024
[26]

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, Fei Huang, and Jingren Zhou. Qwen3 embedding: Advancing text embedding and reranking through foundation models.arXiv preprint arXiv:2506.05176, 2025

work page internal anchor Pith review arXiv 2025
[27]

The faiss library.IEEE Transactions on Big Data, pages 1–17, 2025

Matthijs Douze, Alexandr Guzhva, Chengqi Deng, Jeff Johnson, Gergely Szilvasy, Pierre-Emmanuel Mazaré, Maria Lomeli, Lucas Hosseini, and Hervé Jégou. The faiss library.IEEE Transactions on Big Data, pages 1–17, 2025

work page 2025
[28]

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025

DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025. 9

work page 2025

[1] [1]

The impact of large language models in finance: Towards trustworthy adoption.Alan Turing Institute Report, 2024

Turing Institute. The impact of large language models in finance: Towards trustworthy adoption.Alan Turing Institute Report, 2024. [web:51]

work page 2024

[2] [2]

2024 iif-ey annual survey report on ai/ml use in financial services

Institute of International Finance and Ernst & Young. 2024 iif-ey annual survey report on ai/ml use in financial services. Technical report, IIF, 2024. [web:60]

work page 2024

[3] [3]

Christensen, Elizabeth Floyd, and Mark Maffett

Hans B. Christensen, Elizabeth Floyd, and Mark Maffett. Large language models and generative ai in finance. SSRN Electronic Journal, 2023. [web:49]

work page 2023

[4] [4]

(generative) ai in financial economics.Journal of Chinese Economic and Business Studies, 23(4):509–587, October 2025

Hongwei Mo and Shumiao Ouyang. (generative) ai in financial economics.Journal of Chinese Economic and Business Studies, 23(4):509–587, October 2025

work page 2025

[5] [5]

Vasarhelyi

Huaxia Li and Miklos A. Vasarhelyi. Applying large language models in accounting: A comparative analysis of different methodologies and off-the-shelf examples, November 2023

work page 2023

[6] [6]

Holistically evaluating the environmental impact of creating language models

Jacob Morrison, Clara Na, Jared Fernandez, Tim Dettmers, Emma Strubell, and Jesse Dodge. Holistically evaluating the environmental impact of creating language models. InThe Thirteenth International Conference on Learning Representations (ICLR), 2025

work page 2025

[7] [7]

LLMCarbon: Modeling the end-to-end carbon footprint of large language models

Ahmad Faiz, Sotaro Kaneda, Ruhan Wang, Rita Chukwunyere Osi, Prateek Sharma, Fan Chen, and Lei Jiang. LLMCarbon: Modeling the end-to-end carbon footprint of large language models. InThe Twelfth International Conference on Learning Representations (ICLR), 2024

work page 2024

[8] [8]

Gupta, A

J. Gupta, A. Sharma, S. Singhania, M. Adnan, S. Deo, A. I. Abidi, and K. Gupta. Large language models acing chartered accountancy, 2025

work page 2025

[9] [9]

Fali Wang, Zhiwei Zhang, Xianren Zhang, Zongyu Wu, Tzu Hao Mo, Qiuhao Lu, Wanjing Wang, Rui Li, Junjie Xu, Xianfeng Tang, Qi He, Yao Ma, Ming Huang, and Suhang Wang. A comprehensive survey of small language models in the era of large language models: Techniques, enhancements, applications, collaboration with llms, and trustworthiness.ACM Transactions on I...

work page 2025

[10] [10]

Fine-tuning smaller language models for question answering over financial documents

Karmvir Singh Phogat, Sai Akhil Puranam, Sridhar Dasaratha, Chetan Harsha, and Shashishekar Ramakrishna. Fine-tuning smaller language models for question answering over financial documents. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Findings of the Association for Computational Linguistics: EMNLP 2024, pages 10528–10548, Miami, Florida,...

work page 2024

[11] [11]

o1 system card

OpenAI. o1 system card. Technical report, OpenAI, 2024. OpenAI Technical Report

work page 2024

[12] [12]

Openai o3 and o4-mini system card

OpenAI. Openai o3 and o4-mini system card. Technical report, OpenAI, 2025

work page 2025

[13] [13]

Deepseek-r1: The open-source ai challenger rewriting the rules of enterprise ai

Zartis Team. Deepseek-r1: The open-source ai challenger rewriting the rules of enterprise ai. https://www. zartis.com/deepseek-r1-the-open-source-ai-challenger-rewriting-the-rules-of-enterprise-ai/ , 2025

work page 2025

[14] [14]

Fine-tuning deepseek r1 (reasoning model)

DataCamp. Fine-tuning deepseek r1 (reasoning model). https://www.datacamp.com/tutorial/ fine-tuning-deepseek-r1-reasoning-model, 2025

work page 2025

[15] [15]

Fin-r1: A large language model for financial reasoning, 2025

Shanghai AI Laboratory. Fin-r1: A large language model for financial reasoning, 2025

work page 2025

[16] [16]

Fincot: Grounding chain-of-thought in expert financial blueprints

Yifan Zhou, Peng Li, et al. Fincot: Grounding chain-of-thought in expert financial blueprints. InProceedings of the 3rd Workshop on FinNLP, 2025

work page 2025

[17] [17]

Fino1: A financial reasoning model

The FinAI Team. Fino1: A financial reasoning model. https://github.com/The-FinAI/Fino1, 2025. GitHub Repository

work page 2025

[18] [18]

Nawal and S

A. Nawal and S. Kumar. Fin-rag: A rag system for financial documents, 2024

work page 2024

[19] [19]

Muñiz Sánchez

J. Muñiz Sánchez. Rag-based system for document information re- trieval in financial compliance. https://www.linkedin.com/pulse/ rag-based-system-document-information-retrieval-muniz-sanchez-dk1bf, 2024

work page 2024

[20] [20]

Rag architecture for financial compliance knowledge retrieval

Auxilio Bits. Rag architecture for financial compliance knowledge retrieval. https://www.auxiliobits.com/ blog/rag-architecture-for-domain-specific-knowledge-retrieval-in-financial-compliance/ , 2025

work page 2025

[21] [21]

Docling: The document alchemist, 2024

Docling Team. Docling: The document alchemist, 2024

work page 2024

[22] [22]

Docling technical report

Michele Besso et al. Docling technical report. Technical report, IBM Research, 8 2024

work page 2024

[23] [23]

docling-project/docling-models, 2025

IBM Docling Team. docling-project/docling-models, 2025. 8 Gupta et al. Retrieval-Augmented Reasoning for Chartered Accountancy

work page 2025

[24] [24]

ConvFinQA: Exploring the chain of numerical reasoning in conversational finance question answering

Zhiyu Chen, Shiyang Li, Charese Smiley, Zhiqiang Ma, Sameena Shah, and William Yang Wang. ConvFinQA: Exploring the chain of numerical reasoning in conversational finance question answering. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors,Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 6279–6292, Ab...

work page 2022

[25] [25]

DocMath-eval: Evaluating math reasoning capabilities of LLMs in understanding long and specialized documents

Yilun Zhao, Yitao Long, Hongjun Liu, Ryo Kamoi, Linyong Nan, Lyuhao Chen, Yixin Liu, Xiangru Tang, Rui Zhang, and Arman Cohan. DocMath-eval: Evaluating math reasoning capabilities of LLMs in understanding long and specialized documents. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for ...

work page 2024

[26] [26]

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, Fei Huang, and Jingren Zhou. Qwen3 embedding: Advancing text embedding and reranking through foundation models.arXiv preprint arXiv:2506.05176, 2025

work page internal anchor Pith review arXiv 2025

[27] [27]

The faiss library.IEEE Transactions on Big Data, pages 1–17, 2025

Matthijs Douze, Alexandr Guzhva, Chengqi Deng, Jeff Johnson, Gergely Szilvasy, Pierre-Emmanuel Mazaré, Maria Lomeli, Lucas Hosseini, and Hervé Jégou. The faiss library.IEEE Transactions on Big Data, pages 1–17, 2025

work page 2025

[28] [28]

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025

DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025. 9

work page 2025