pith. sign in

arxiv: 2605.00257 · v1 · submitted 2026-04-30 · 💻 cs.CL · cs.AI· cs.IR

Retrieval-Augmented Reasoning for Chartered Accountancy

Pith reviewed 2026-05-09 19:46 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.IR
keywords Retrieval-Augmented GenerationChartered AccountancyChain-of-ThoughtRAGLarge Language ModelsIndian CABenchmark
0
0 comments X

The pith

A retrieval-augmented 14B model reaches 68.75% of GPT-4o and Claude 3.5 Sonnet performance on chartered accountancy tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CA-ThinkFlow, a parameter-efficient retrieval-augmented generation framework for handling the numerical and regulatory demands of Indian chartered accountancy. It uses a 14B quantized reasoning model along with layout-aware document extraction to retrieve relevant passages and then applies the model's built-in chain-of-thought reasoning. This combination produces answers that achieve 68.75 percent of the Scholastic Reliability Coefficient of leading proprietary models on the CA-Ben benchmark. The approach matters for domains where access to large models is limited by cost or infrastructure, offering a lighter alternative for professional-level tasks. The authors acknowledge that the system still encounters difficulties with highly complex regulatory material such as taxation rules.

Core claim

CA-ThinkFlow is presented as a parameter-efficient Retrieval-Augmented Generation framework which operates with a 14B, 4-bit-quantized reasoning model, 14B-DeepSeek-R1, and a layout-aware Docling extraction system which maintains document structure during extraction. It uses a basic RAG method which automatically adds retrieved information into the prompt, while it depends on the model's built-in Chain-of-Thought functions to create context and produce correct answers. The system achieves Scholastic Reliability Coefficient results which equal 68.75% of GPT-4o and Claude 3.5 Sonnet on the multi-level CA-Ben benchmark.

What carries the argument

Basic RAG that automatically injects retrieved passages into the prompt, paired with the 14B model's native Chain-of-Thought reasoning and layout-aware document extraction.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Frameworks like this could be tested on other specialized professional domains that combine calculation with regulatory knowledge.
  • The limitations on complex taxation texts suggest that additional techniques such as iterative retrieval or verification might be necessary for full coverage.
  • Adoption in small accounting practices could reduce dependence on expensive cloud-based AI services for exam preparation or compliance work.

Load-bearing premise

That automatically injecting retrieved passages into the prompt and relying on the model's built-in Chain-of-Thought will produce reliable answers for multi-step numerical and jurisdiction-specific regulatory questions.

What would settle it

A direct comparison on a dedicated set of complex taxation and regulatory questions where the system's SRC falls well below 68.75% of the proprietary models.

Figures

Figures reproduced from arXiv: 2605.00257 by Akhil Sharma, Ali Imam Abidi, Jatin Gupta, Saransh Singhania.

Figure 1
Figure 1. Figure 1: Subject-wise accuracy comparison of all evaluated models across 14 CA-Ben domains (F1–FN6). Each bar [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: An overall comparison of models across each individual exam. [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
read the original abstract

The inception of Large Language Models (LLMs) has catalyzed AI adoption in the finance sector, yet their reliability in complex, jurisdiction-specific tasks like Indian Chartered Accountancy (CA) remains limited. The models display difficulty in executing numerical tasks which require multiple steps while also needing advanced knowledge about legal regulations and the method of scaling their operations is not feasible in settings which have limited access to resources. We present CA-ThinkFlow as a parameter-efficient Retrieval-Augmented Generation (RAG) framework which operates with a 14B, 4-bit-quantized reasoning model, 14B-DeepSeek-R1, and a layout-aware Docling extraction system which maintains document structure during extraction. CA-ThinkFlow uses a basic RAG method which automatically adds retrieved information into the prompt, while it depends on the model's built-in Chain-of-Thought (CoT) functions to create context and produce correct answers. The system we developed system operates at performance levels which match large proprietary models when we tested it on the multi-level CA-Ben benchmark, achieving Scholastic Reliability Coefficient (SRC) results which equal 68.75\% of GPT-4o and Claude 3.5 Sonnet. The framework shows high efficiency and strength in handling parameters, but essential reasoning abilities fail to process complex regulatory texts which exist in fields such as Taxation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces CA-ThinkFlow, a parameter-efficient RAG framework that pairs a 14B 4-bit-quantized DeepSeek-R1 model with layout-aware Docling extraction. It claims that automatic passage injection plus the model's native Chain-of-Thought yields a Scholastic Reliability Coefficient (SRC) equal to 68.75% of GPT-4o and Claude 3.5 Sonnet on the multi-level CA-Ben benchmark for Indian Chartered Accountancy tasks, while noting that essential reasoning abilities still fail on complex regulatory texts such as those in Taxation.

Significance. If the performance numbers are substantiated with full evaluation details, the result would indicate that modest open-source models plus basic RAG can reach a non-trivial fraction of frontier-model reliability on jurisdiction-specific professional tasks. This would be relevant for resource-constrained deployments in regulated domains. The abstract's own caveat about failures on complex Taxation material, however, limits the claimed generality and reduces the practical significance until per-category breakdowns are supplied.

major comments (2)
  1. [Abstract] Abstract: The central claim that CA-ThinkFlow attains an SRC equal to 68.75% of GPT-4o and Claude 3.5 Sonnet is presented without any information on CA-Ben benchmark construction, test-set size, number of questions per difficulty level, statistical tests, error breakdown by task type (numerical vs. regulatory), or comparison protocol. Because this aggregate figure is the sole empirical support for the framework's effectiveness, the missing details are load-bearing and prevent assessment of whether the result is robust or representative.
  2. [Abstract] Abstract: The manuscript states that 'essential reasoning abilities fail to process complex regulatory texts which exist in fields such as Taxation' yet reports an aggregate SRC that is 68.75% of the strongest proprietary models. This creates an unresolved tension: it is unclear whether the benchmark contains few Taxation-style items, whether the failures are narrow enough not to affect the headline metric, or whether the reported figure masks unreliability on the hardest CA material. Per-category performance tables or question-type distributions are required to reconcile the two statements.
minor comments (2)
  1. [Abstract] Abstract: The sentence 'the method of scaling their operations is not feasible in settings which have limited access to resources' is grammatically awkward and should be rephrased for clarity.
  2. [Abstract] Abstract: The model is introduced as both '14B, 4-bit-quantized reasoning model, 14B-DeepSeek-R1' and '14B-DeepSeek-R1'; a single consistent name would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and outline the revisions we will implement to improve transparency and clarity.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that CA-ThinkFlow attains an SRC equal to 68.75% of GPT-4o and Claude 3.5 Sonnet is presented without any information on CA-Ben benchmark construction, test-set size, number of questions per difficulty level, statistical tests, error breakdown by task type (numerical vs. regulatory), or comparison protocol. Because this aggregate figure is the sole empirical support for the framework's effectiveness, the missing details are load-bearing and prevent assessment of whether the result is robust or representative.

    Authors: We agree that the abstract would benefit from additional context on the CA-Ben benchmark to support the reported SRC. In the revised manuscript, we will expand the abstract to include a concise description of benchmark construction, test-set size, question distribution across difficulty levels, and the comparison protocol. We will also augment the Experiments section with statistical tests, error breakdowns by task type (numerical vs. regulatory), and full evaluation details to enable readers to assess robustness. revision: yes

  2. Referee: [Abstract] Abstract: The manuscript states that 'essential reasoning abilities fail to process complex regulatory texts which exist in fields such as Taxation' yet reports an aggregate SRC that is 68.75% of the strongest proprietary models. This creates an unresolved tension: it is unclear whether the benchmark contains few Taxation-style items, whether the failures are narrow enough not to affect the headline metric, or whether the reported figure masks unreliability on the hardest CA material. Per-category performance tables or question-type distributions are required to reconcile the two statements.

    Authors: We acknowledge the tension between the aggregate performance claim and the limitation statement on complex regulatory texts. The 68.75% SRC is an overall average across CA-Ben. In the revised manuscript, we will add per-category performance tables and question-type distributions in the Results section. These will detail the proportion of Taxation-style items, performance variance by category, and the impact of specific failures on the aggregate metric, thereby clarifying the scope of the reported result. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical benchmark comparison only

full rationale

The paper introduces a RAG framework (CA-ThinkFlow) that injects retrieved passages into prompts for a 14B quantized model and relies on the model's native CoT. It reports an empirical SRC score on the external CA-Ben benchmark equal to 68.75% of GPT-4o/Claude 3.5 Sonnet. No equations, fitted parameters, self-definitional quantities, or load-bearing self-citations appear in the derivation. The central claim is a direct measurement against independent external models and therefore carries no internal reduction to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract describes an applied engineering system and introduces no new mathematical parameters, axioms, or postulated entities.

pith-pipeline@v0.9.0 · 5544 in / 1031 out tokens · 54228 ms · 2026-05-09T19:46:33.972814+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · 1 internal anchor

  1. [1]

    The impact of large language models in finance: Towards trustworthy adoption.Alan Turing Institute Report, 2024

    Turing Institute. The impact of large language models in finance: Towards trustworthy adoption.Alan Turing Institute Report, 2024. [web:51]

  2. [2]

    2024 iif-ey annual survey report on ai/ml use in financial services

    Institute of International Finance and Ernst & Young. 2024 iif-ey annual survey report on ai/ml use in financial services. Technical report, IIF, 2024. [web:60]

  3. [3]

    Christensen, Elizabeth Floyd, and Mark Maffett

    Hans B. Christensen, Elizabeth Floyd, and Mark Maffett. Large language models and generative ai in finance. SSRN Electronic Journal, 2023. [web:49]

  4. [4]

    (generative) ai in financial economics.Journal of Chinese Economic and Business Studies, 23(4):509–587, October 2025

    Hongwei Mo and Shumiao Ouyang. (generative) ai in financial economics.Journal of Chinese Economic and Business Studies, 23(4):509–587, October 2025

  5. [5]

    Vasarhelyi

    Huaxia Li and Miklos A. Vasarhelyi. Applying large language models in accounting: A comparative analysis of different methodologies and off-the-shelf examples, November 2023

  6. [6]

    Holistically evaluating the environmental impact of creating language models

    Jacob Morrison, Clara Na, Jared Fernandez, Tim Dettmers, Emma Strubell, and Jesse Dodge. Holistically evaluating the environmental impact of creating language models. InThe Thirteenth International Conference on Learning Representations (ICLR), 2025

  7. [7]

    LLMCarbon: Modeling the end-to-end carbon footprint of large language models

    Ahmad Faiz, Sotaro Kaneda, Ruhan Wang, Rita Chukwunyere Osi, Prateek Sharma, Fan Chen, and Lei Jiang. LLMCarbon: Modeling the end-to-end carbon footprint of large language models. InThe Twelfth International Conference on Learning Representations (ICLR), 2024

  8. [8]

    Gupta, A

    J. Gupta, A. Sharma, S. Singhania, M. Adnan, S. Deo, A. I. Abidi, and K. Gupta. Large language models acing chartered accountancy, 2025

  9. [9]

    Fali Wang, Zhiwei Zhang, Xianren Zhang, Zongyu Wu, Tzu Hao Mo, Qiuhao Lu, Wanjing Wang, Rui Li, Junjie Xu, Xianfeng Tang, Qi He, Yao Ma, Ming Huang, and Suhang Wang. A comprehensive survey of small language models in the era of large language models: Techniques, enhancements, applications, collaboration with llms, and trustworthiness.ACM Transactions on I...

  10. [10]

    Fine-tuning smaller language models for question answering over financial documents

    Karmvir Singh Phogat, Sai Akhil Puranam, Sridhar Dasaratha, Chetan Harsha, and Shashishekar Ramakrishna. Fine-tuning smaller language models for question answering over financial documents. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Findings of the Association for Computational Linguistics: EMNLP 2024, pages 10528–10548, Miami, Florida,...

  11. [11]

    o1 system card

    OpenAI. o1 system card. Technical report, OpenAI, 2024. OpenAI Technical Report

  12. [12]

    Openai o3 and o4-mini system card

    OpenAI. Openai o3 and o4-mini system card. Technical report, OpenAI, 2025

  13. [13]

    Deepseek-r1: The open-source ai challenger rewriting the rules of enterprise ai

    Zartis Team. Deepseek-r1: The open-source ai challenger rewriting the rules of enterprise ai. https://www. zartis.com/deepseek-r1-the-open-source-ai-challenger-rewriting-the-rules-of-enterprise-ai/ , 2025

  14. [14]

    Fine-tuning deepseek r1 (reasoning model)

    DataCamp. Fine-tuning deepseek r1 (reasoning model). https://www.datacamp.com/tutorial/ fine-tuning-deepseek-r1-reasoning-model, 2025

  15. [15]

    Fin-r1: A large language model for financial reasoning, 2025

    Shanghai AI Laboratory. Fin-r1: A large language model for financial reasoning, 2025

  16. [16]

    Fincot: Grounding chain-of-thought in expert financial blueprints

    Yifan Zhou, Peng Li, et al. Fincot: Grounding chain-of-thought in expert financial blueprints. InProceedings of the 3rd Workshop on FinNLP, 2025

  17. [17]

    Fino1: A financial reasoning model

    The FinAI Team. Fino1: A financial reasoning model. https://github.com/The-FinAI/Fino1, 2025. GitHub Repository

  18. [18]

    Nawal and S

    A. Nawal and S. Kumar. Fin-rag: A rag system for financial documents, 2024

  19. [19]

    Muñiz Sánchez

    J. Muñiz Sánchez. Rag-based system for document information re- trieval in financial compliance. https://www.linkedin.com/pulse/ rag-based-system-document-information-retrieval-muniz-sanchez-dk1bf, 2024

  20. [20]

    Rag architecture for financial compliance knowledge retrieval

    Auxilio Bits. Rag architecture for financial compliance knowledge retrieval. https://www.auxiliobits.com/ blog/rag-architecture-for-domain-specific-knowledge-retrieval-in-financial-compliance/ , 2025

  21. [21]

    Docling: The document alchemist, 2024

    Docling Team. Docling: The document alchemist, 2024

  22. [22]

    Docling technical report

    Michele Besso et al. Docling technical report. Technical report, IBM Research, 8 2024

  23. [23]

    docling-project/docling-models, 2025

    IBM Docling Team. docling-project/docling-models, 2025. 8 Gupta et al. Retrieval-Augmented Reasoning for Chartered Accountancy

  24. [24]

    ConvFinQA: Exploring the chain of numerical reasoning in conversational finance question answering

    Zhiyu Chen, Shiyang Li, Charese Smiley, Zhiqiang Ma, Sameena Shah, and William Yang Wang. ConvFinQA: Exploring the chain of numerical reasoning in conversational finance question answering. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors,Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 6279–6292, Ab...

  25. [25]

    DocMath-eval: Evaluating math reasoning capabilities of LLMs in understanding long and specialized documents

    Yilun Zhao, Yitao Long, Hongjun Liu, Ryo Kamoi, Linyong Nan, Lyuhao Chen, Yixin Liu, Xiangru Tang, Rui Zhang, and Arman Cohan. DocMath-eval: Evaluating math reasoning capabilities of LLMs in understanding long and specialized documents. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for ...

  26. [26]

    Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

    Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, Fei Huang, and Jingren Zhou. Qwen3 embedding: Advancing text embedding and reranking through foundation models.arXiv preprint arXiv:2506.05176, 2025

  27. [27]

    The faiss library.IEEE Transactions on Big Data, pages 1–17, 2025

    Matthijs Douze, Alexandr Guzhva, Chengqi Deng, Jeff Johnson, Gergely Szilvasy, Pierre-Emmanuel Mazaré, Maria Lomeli, Lucas Hosseini, and Hervé Jégou. The faiss library.IEEE Transactions on Big Data, pages 1–17, 2025

  28. [28]

    Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025

    DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025. 9