FINESSE-Bench: A Hierarchical Benchmark Suite for Financial Domain Knowledge and Technical Analysis in Large Language Models

Alexey Khoroshilov; Andrei Kalmykov; Denis Kokosinskii; Dmitry Stanishevskii; Dmitry Zmitrovich; Nini Kamkia; Zhirayr Hayrapetyan

REVIEW 2 major objections 1 minor 15 references

FINESSE-Bench introduces eight benchmarks with 3,993 questions to evaluate LLMs on financial competencies in increasing levels of difficulty.

Reviewed by Pith at T0; open to challenge. T0 means a machine referee read the full paper against a public rubric. the ladder, T0–T4 →

Challenge this review Re-run · record.json Download PDF Read on arXiv ↗

T0 review · grok-4.3

2026-05-25 05:43 UTC pith:TM7BM77K

load-bearing objection FINESSE-Bench adds a new mix of certification-inspired financial tasks but supplies no results or difficulty checks in the text. the 2 major comments →

arxiv 2605.15482 v2 pith:TM7BM77K submitted 2026-05-14 cs.CL

FINESSE-Bench: A Hierarchical Benchmark Suite for Financial Domain Knowledge and Technical Analysis in Large Language Models

Dmitry Stanishevskii , Nini Kamkia , Alexey Khoroshilov , Dmitry Zmitrovich , Denis Kokosinskii , Zhirayr Hayrapetyan , Andrei Kalmykov This is my paper

classification cs.CL

keywords financial benchmarksLLM evaluationhierarchical assessmentprofessional certificationstechnical analysisdomain knowledgetrading tasksopen benchmarks

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

The pith

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces FINESSE-Bench as a new suite for testing large language models on financial domain knowledge and technical analysis. It assembles eight benchmarks drawn from certification-style exams at multiple levels, trading tasks, and an olympiad collection to track how models perform as professional complexity rises. The design fills a gap left by earlier financial QA resources that lack explicit difficulty ordering. A single evaluation protocol covers multiple-choice items, numerical answers, and short responses scored by an automated LLM judge. The result is a tool meant to measure both breadth of knowledge and degradation in capability at higher tiers.

Core claim

FINESSE-Bench is a suite of eight specialized benchmarks comprising 3,993 questions for hierarchical evaluation of financial competencies in LLMs that combines exam-oriented datasets inspired by professional certifications (CFA-like Levels 1-3, CMT-like Level 2, and CFTe-like Level 1), applied trading task collections, and a Russian-language olympiad benchmark.

What carries the argument

FINESSE-Bench, the hierarchical benchmark suite that organizes questions by levels of professional difficulty drawn from certification exams and trading tasks.

Load-bearing premise

The chosen exam-inspired datasets and trading tasks form a valid hierarchy that accurately reflects increasing professional difficulty in finance.

What would settle it

If model accuracy fails to decline consistently across the claimed difficulty levels or if benchmark scores show no relation to performance on actual financial analysis tasks, the hierarchy would not be supported.

Watch this falsifier — get emailed when new claim-graph text bears on it.

If this is right

Models can be scored for breadth of financial domain knowledge.
Performance can be tracked for degradation as question difficulty rises.
Computational trading tasks can be isolated for separate measurement.
Specialized domains such as technical analysis receive dedicated coverage.
The suite serves as a complement to existing financial QA benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The levels could guide targeted fine-tuning to close specific gaps in financial reasoning.
Adding parallel versions in other languages would test whether the hierarchy generalizes beyond English and Russian.
Combining the suite with report-based QA sets might produce a single end-to-end financial capability measure.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit.

Desk Editor's Note

FINESSE-Bench adds a new mix of certification-inspired financial tasks but supplies no results or difficulty checks in the text.

read the letter

The main point is that this paper introduces FINESSE-Bench, a set of eight benchmarks with 3993 questions that pulls together CFA-style levels 1-3, CMT level 2, CFTe level 1, trading tasks, and a Russian olympiad set. That specific combination is the new element compared to the QA-heavy resources like FinQA or FinanceBench that the abstract cites. The design tries to let people track how models handle increasing professional-style difficulty plus computational tasks and non-English material. The unified scoring protocol for multiple-choice, numerical, and short open answers, including the LLM-as-judge piece, is laid out clearly enough to be usable. That part is straightforward and practical. The soft spot is exactly what the stress-test flags: the hierarchy is asserted from the source certifications but the abstract gives zero data on whether the levels actually order by difficulty, no model scores, no human expert calibration, and no inter-level gradients. Without that, the claim reduces to a collection of separate tasks rather than a validated progression. The paper is aimed at people who evaluate LLMs for finance work and want something beyond report QA. A reader building domain benchmarks or testing professional competence would find the task list worth examining. It deserves a serious referee because the field needs more structured financial evaluation resources and the design choices can be discussed even if the current version is mostly descriptive.

Referee Report

2 major / 1 minor

Summary. The manuscript presents FINESSE-Bench, a suite of eight specialized benchmarks totaling 3,993 questions for the hierarchical evaluation of financial competencies in LLMs. It combines exam-oriented datasets inspired by professional certifications (CFA-like Levels 1-3, CMT-like Level 2, CFTe-like Level 1), applied trading task collections, and a Russian-language olympiad benchmark, together with a unified evaluation protocol for multiple-choice, numerical, and short open-ended responses that includes an LLM-as-judge automated scoring scheme.

Significance. If the hierarchy is shown to order items by increasing professional difficulty and the evaluation protocol proves reliable, the benchmark would address a noted gap in existing financial QA resources by enabling measurement of performance degradation across difficulty levels and domain breadth. The scale of the constructed dataset (3,993 questions) and the explicit multi-level design constitute a concrete contribution that could support more targeted assessment of LLM financial reasoning.

major comments (2)

[Abstract and benchmark design description] Abstract and benchmark design description: The central claim that the eight components constitute a valid hierarchy reflecting increasing professional difficulty rests solely on inspiration from the source certifications; no difficulty calibration, expert review of item difficulty, inter-level performance gradients, or pilot validation is supplied to support the ordering. This assumption is load-bearing for the hierarchical evaluation claim.
[Evaluation protocol section] Evaluation protocol section: The LLM-as-judge paradigm for scoring freeform answers is introduced without any reported validation against human expert judgments or inter-annotator agreement metrics, which is required to establish reliability for the open-ended component of the protocol.

minor comments (1)

A breakdown table showing the number of questions contributed by each of the eight components would clarify the composition of the reported total of 3,993 questions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for these focused comments on the hierarchical design and evaluation protocol. We address each point directly below and indicate where revisions will be made to improve clarity and transparency without altering the core contribution.

read point-by-point responses

Referee: [Abstract and benchmark design description] The central claim that the eight components constitute a valid hierarchy reflecting increasing professional difficulty rests solely on inspiration from the source certifications; no difficulty calibration, expert review of item difficulty, inter-level performance gradients, or pilot validation is supplied to support the ordering. This assumption is load-bearing for the hierarchical evaluation claim.

Authors: The hierarchical ordering is deliberately inherited from the established, industry-validated difficulty progressions of the source certifications (CFA Levels 1–3, CMT Level 2, CFTe Level 1), which are constructed by professional bodies with documented curricula and examination standards. Our datasets are aligned to these curricula rather than independently re-calibrated. We did not perform additional expert difficulty rating or pilot studies, as the primary goal was to aggregate and standardize existing professional-level materials. To address the concern, the revised manuscript will add an explicit subsection in the benchmark design section that (a) states the reliance on certification structures, (b) reports any observed performance degradation across levels from the experiments already conducted, and (c) notes the absence of independent item-level validation as a limitation. This clarifies the claim without overstating empirical support for the ordering itself. revision: yes
Referee: [Evaluation protocol section] The LLM-as-judge paradigm for scoring freeform answers is introduced without any reported validation against human expert judgments or inter-annotator agreement metrics, which is required to establish reliability for the open-ended component of the protocol.

Authors: We agree that reliability evidence for the LLM-as-judge component is necessary for the open-ended items. The current manuscript describes the prompting and scoring procedure but does not include human validation or agreement statistics. In the revision we will add a dedicated paragraph in the evaluation protocol section that (i) acknowledges this gap, (ii) reports any internal consistency checks already performed (e.g., agreement between two different judge models on a subset), and (iii) states plans for future human-expert validation. Because a full inter-annotator study would require new data collection beyond the scope of the present work, we treat this as a transparent limitation rather than a completed validation. revision: partial

Circularity Check

0 steps flagged

No circularity: benchmark construction with no derivations or self-referential reductions

full rationale

The paper presents FINESSE-Bench as a collection of eight datasets (CFA-like L1-3, CMT-like L2, CFTe-like L1, trading tasks, Russian olympiad) chosen by inspiration from existing certifications. No equations, fitted parameters, predictions, or first-principles derivations appear in the provided text. The hierarchy claim is an assertion of design intent rather than a result derived from the paper's own inputs or prior self-citations. This matches the default case of a self-contained dataset/protocol paper with no load-bearing steps that reduce by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

This is a benchmark-construction paper whose central contribution is the assembly of existing-style datasets into a new hierarchy; no free parameters, new physical entities, or unproven mathematical axioms are introduced.

axioms (1)

domain assumption LLM-as-judge provides reliable automated scoring for short open-ended financial answers
Invoked in the description of the unified evaluation protocol for freeform responses.

pith-pipeline@v0.9.0 · 5866 in / 1180 out tokens · 30309 ms · 2026-05-25T05:43:01.532473+00:00 · methodology

0 comments

read the original abstract

Large language models (LLMs) are increasingly being applied to financial analysis, reporting, investment decision support, risk management, compliance, and professional training. However, robust evaluation of their domain competence in finance remains incomplete. Widely used open benchmarks such as FinQA, ConvFinQA, and TAT-QA have played an important role in advancing financial question answering and numerical reasoning, but they focus primarily on question answering over financial reports and do not provide an explicit hierarchy of professional difficulty. Broader resources, including FinanceBench, PIXIU, FinBen, and FLaME, expand the coverage of financial tasks, yet the problem of evaluating the transition from foundational knowledge to expert-level financial reasoning remains open. In this work, we present FINESSE-Bench, a suite of eight specialized benchmarks comprising 3,993 questions for hierarchical evaluation of financial competencies in LLMs. FINESSE-Bench combines exam-oriented datasets inspired by professional certifications (CFA-like Levels 1-3, CMT-like Level 2, and CFTe-like Level 1), applied trading task collections, and a Russian-language olympiad benchmark. This design enables evaluation of domain breadth, performance degradation as difficulty increases, the ability to solve computational tasks, and model behavior in specialized financial domains. We also describe a unified evaluation protocol covering multiple-choice questions, numerical answers, and short open-ended responses, together with an automated scoring scheme for freeform answers based on the LLM-as-judge paradigm. FINESSE-Bench is intended both as a complement to existing open financial benchmarks and as a tool for more substantive evaluation of professionally relevant financial competencies in large language models.

Figures

Figures reproduced from arXiv: 2605.15482 by Alexey Khoroshilov, Andrei Kalmykov, Denis Kokosinskii, Dmitry Stanishevskii, Dmitry Zmitrovich, Nini Kamkia, Zhirayr Hayrapetyan.

**Figure 2.** Figure 2: Within-family scaling results for reasoning-oriented models from the Qwen3 family. [PITH_FULL_IMAGE:figures/full_fig_p015_2.png] view at source ↗

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages · 5 internal anchors

[1]

Finqa: A dataset of numerical reasoning over financial data,

Zhiyu Chen, Wenhu Chen, et al. FinQA: A dataset of numerical reasoning over financial data.arXiv preprint arXiv:2109.00122, 2021. URL https://arxiv.org/abs/2109.00122

work page arXiv 2021
[2]

(Zhiyu), et al

Zhiyu Chen, Shiyang Li, et al. ConvFinQA: Exploring the chain of numerical reasoning in 20 conversational finance question answering.arXiv preprint arXiv:2210.03849, 2022. URL https://arxiv.org/abs/2210.03849

work page arXiv 2022
[3]

Fengbin Zhu, Wenqiang Lei, Youcheng Huang, Chao Wang, Shuo Zhang, Jiancheng Lv, Fuli Feng, and Tat-Seng Chua

Fengbin Zhu, Wenqiang Lei, et al. TAT-QA: A question answering benchmark on a hybrid of tabular and textual content in finance.arXiv preprint arXiv:2105.07624, 2021. URL https://arxiv.org/abs/2105.07624

work page arXiv 2021
[4]

FinanceBench: A New Benchmark for Financial Question Answering

Pranab Islam, Anand Kannappan, et al. FinanceBench: A new benchmark for financial question answering.arXiv preprint arXiv:2311.11944, 2023. URL https://arxiv.org/abs/ 2311.11944

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

arXiv preprint arXiv:2306.05443 , year=

Qianqian Xie, Weiguang Han, et al. PIXIU: A large language model, instruction data and evaluation benchmark for finance.arXiv preprint arXiv:2306.05443, 2023. URL https://arxiv.org/abs/2306.05443

work page arXiv 2023
[6]

FinBen: A holistic financial benchmark for large language models,

Qianqian Xie, Weiguang Han, et al. FinBen: A holistic financial benchmark for large language models.arXiv preprint arXiv:2402.12659, 2024. URL https://arxiv.org/abs/2402.12659

work page arXiv 2024
[7]

Finance language model evaluation (FLaME).arXiv preprint arXiv:2506.15846, 2025

Glenn Matlin, Mika Okamoto, et al. Finance language model evaluation (FLaME).arXiv preprint arXiv:2506.15846, 2025. URL https://arxiv.org/abs/2506.15846

work page arXiv 2025
[8]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

LianminZheng, Wei-LinChiang, etal. JudgingLLM-as-a-JudgewithMT-BenchandChatbot Arena.arXiv preprint arXiv:2306.05685, 2023. URL https://arxiv.org/abs/2306.05685

work page internal anchor Pith review Pith/arXiv arXiv 2023
[9]

From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline

Tianle Li, Wei-Lin Chiang, et al. From crowdsourced data to high-quality benchmarks: Arena-Hard and BenchBuilder pipeline.arXiv preprint arXiv:2406.11939, 2024. URL https://arxiv.org/abs/2406.11939

work page internal anchor Pith review Pith/arXiv arXiv 2024
[11]

URL https://arxiv.org/abs/2506.05828

work page arXiv
[12]

arXiv preprint arXiv:2503.16252 , year =

Zhaowei Liu, Xin Guo, et al. Fin-R1: A large language model for financial reasoning through reinforcement learning.arXiv preprint arXiv:2503.16252, 2025. URL https://arxiv.org/abs/ 2503.16252

work page arXiv 2025
[13]

arena-hard-auto

LMArena. arena-hard-auto. URL https://github.com/lmarena/arena-hard-auto. GitHub repository, accessed April 2026

work page 2026
[14]

Measuring Massive Multitask Language Understanding

Dan Hendrycks, Collin Burns, et al. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2021. URL https://arxiv.org/abs/2009.03300

work page internal anchor Pith review Pith/arXiv arXiv 2009
[15]

CMMLU: Measuring massive multitask language understanding in Chinese

Haonan Li, Yixuan Zhang, et al. CMMLU: Measuring massive multitask language under- standing in Chinese.arXiv preprint arXiv:2306.09212, 2024. URL https://arxiv.org/abs/ 2306.09212

work page internal anchor Pith review Pith/arXiv arXiv 2024
[16]

Reasoning models are test exploiters: Rethinking multiple choice.arXiv preprint arXiv:2507.15337, 2025

Narun Raman, Taylor Lundy, and Kevin Leyton-Brown. Reasoning models are test exploiters: Rethinking multiple choice.arXiv preprint arXiv:2507.15337, 2025. URL https://arxiv.org/ abs/2507.15337v1. 21

work page arXiv 2025

[1] [1]

Finqa: A dataset of numerical reasoning over financial data,

Zhiyu Chen, Wenhu Chen, et al. FinQA: A dataset of numerical reasoning over financial data.arXiv preprint arXiv:2109.00122, 2021. URL https://arxiv.org/abs/2109.00122

work page arXiv 2021

[2] [2]

(Zhiyu), et al

Zhiyu Chen, Shiyang Li, et al. ConvFinQA: Exploring the chain of numerical reasoning in 20 conversational finance question answering.arXiv preprint arXiv:2210.03849, 2022. URL https://arxiv.org/abs/2210.03849

work page arXiv 2022

[3] [3]

Fengbin Zhu, Wenqiang Lei, Youcheng Huang, Chao Wang, Shuo Zhang, Jiancheng Lv, Fuli Feng, and Tat-Seng Chua

Fengbin Zhu, Wenqiang Lei, et al. TAT-QA: A question answering benchmark on a hybrid of tabular and textual content in finance.arXiv preprint arXiv:2105.07624, 2021. URL https://arxiv.org/abs/2105.07624

work page arXiv 2021

[4] [4]

FinanceBench: A New Benchmark for Financial Question Answering

Pranab Islam, Anand Kannappan, et al. FinanceBench: A new benchmark for financial question answering.arXiv preprint arXiv:2311.11944, 2023. URL https://arxiv.org/abs/ 2311.11944

work page internal anchor Pith review Pith/arXiv arXiv 2023

[5] [5]

arXiv preprint arXiv:2306.05443 , year=

Qianqian Xie, Weiguang Han, et al. PIXIU: A large language model, instruction data and evaluation benchmark for finance.arXiv preprint arXiv:2306.05443, 2023. URL https://arxiv.org/abs/2306.05443

work page arXiv 2023

[6] [6]

FinBen: A holistic financial benchmark for large language models,

Qianqian Xie, Weiguang Han, et al. FinBen: A holistic financial benchmark for large language models.arXiv preprint arXiv:2402.12659, 2024. URL https://arxiv.org/abs/2402.12659

work page arXiv 2024

[7] [7]

Finance language model evaluation (FLaME).arXiv preprint arXiv:2506.15846, 2025

Glenn Matlin, Mika Okamoto, et al. Finance language model evaluation (FLaME).arXiv preprint arXiv:2506.15846, 2025. URL https://arxiv.org/abs/2506.15846

work page arXiv 2025

[8] [8]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

LianminZheng, Wei-LinChiang, etal. JudgingLLM-as-a-JudgewithMT-BenchandChatbot Arena.arXiv preprint arXiv:2306.05685, 2023. URL https://arxiv.org/abs/2306.05685

work page internal anchor Pith review Pith/arXiv arXiv 2023

[9] [9]

From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline

Tianle Li, Wei-Lin Chiang, et al. From crowdsourced data to high-quality benchmarks: Arena-Hard and BenchBuilder pipeline.arXiv preprint arXiv:2406.11939, 2024. URL https://arxiv.org/abs/2406.11939

work page internal anchor Pith review Pith/arXiv arXiv 2024

[10] [11]

URL https://arxiv.org/abs/2506.05828

work page arXiv

[11] [12]

arXiv preprint arXiv:2503.16252 , year =

Zhaowei Liu, Xin Guo, et al. Fin-R1: A large language model for financial reasoning through reinforcement learning.arXiv preprint arXiv:2503.16252, 2025. URL https://arxiv.org/abs/ 2503.16252

work page arXiv 2025

[12] [13]

arena-hard-auto

LMArena. arena-hard-auto. URL https://github.com/lmarena/arena-hard-auto. GitHub repository, accessed April 2026

work page 2026

[13] [14]

Measuring Massive Multitask Language Understanding

Dan Hendrycks, Collin Burns, et al. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2021. URL https://arxiv.org/abs/2009.03300

work page internal anchor Pith review Pith/arXiv arXiv 2009

[14] [15]

CMMLU: Measuring massive multitask language understanding in Chinese

Haonan Li, Yixuan Zhang, et al. CMMLU: Measuring massive multitask language under- standing in Chinese.arXiv preprint arXiv:2306.09212, 2024. URL https://arxiv.org/abs/ 2306.09212

work page internal anchor Pith review Pith/arXiv arXiv 2024

[15] [16]

Reasoning models are test exploiters: Rethinking multiple choice.arXiv preprint arXiv:2507.15337, 2025

Narun Raman, Taylor Lundy, and Kevin Leyton-Brown. Reasoning models are test exploiters: Rethinking multiple choice.arXiv preprint arXiv:2507.15337, 2025. URL https://arxiv.org/ abs/2507.15337v1. 21

work page arXiv 2025