hub

FinanceBench: A New Benchmark for Financial Question Answering

Pranab Islam, Anand Kannappan, Douwe Kiela, Rebecca Qian, Nino Scherrer, Bertie Vidgen · 2023 · cs.CL · arXiv 2311.11944

27 Pith papers cite this work. Polarity classification is still indexing.

27 Pith papers citing it

open full Pith review browse 27 citing papers arXiv PDF

abstract

FinanceBench is a first-of-its-kind test suite for evaluating the performance of LLMs on open book financial question answering (QA). It comprises 10,231 questions about publicly traded companies, with corresponding answers and evidence strings. The questions in FinanceBench are ecologically valid and cover a diverse set of scenarios. They are intended to be clear-cut and straightforward to answer to serve as a minimum performance standard. We test 16 state of the art model configurations (including GPT-4-Turbo, Llama2 and Claude2, with vector stores and long context prompts) on a sample of 150 cases from FinanceBench, and manually review their answers (n=2,400). The cases are available open-source. We show that existing LLMs have clear limitations for financial QA. Notably, GPT-4-Turbo used with a retrieval system incorrectly answered or refused to answer 81% of questions. While augmentation techniques such as using longer context window to feed in relevant evidence improve performance, they are unrealistic for enterprise settings due to increased latency and cannot support larger financial documents. We find that all models examined exhibit weaknesses, such as hallucinations, that limit their suitability for use by enterprises.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 2 other 1

citation-polarity summary

background 2 unclear 1

representative citing papers

IndiaFinBench: An Evaluation Benchmark for Large Language Model Performance on Indian Financial Regulatory Text

cs.CL · 2026-04-21 · accept · novelty 8.0

IndiaFinBench is the first public benchmark for LLMs on Indian financial regulatory text, with twelve models scoring 70.4-89.7% accuracy and all outperforming a 69% human baseline.

Evaluating Deep Research Agents on Expert Consulting Work: A Benchmark with Verifiers, Rubrics, and Cognitive Traps

cs.AI · 2026-05-17 · unverdicted · novelty 7.0

New benchmark evaluates three frontier deep research agents on 42 SME prompts with verifiers and rubrics, reporting low acceptance rates of 9.5-21.4% and agent-specific failure modes.

BizCompass: Benchmarking the Reasoning Capabilities of LLMs in Business Knowledge and Applications

cs.CE · 2026-04-19 · unverdicted · novelty 7.0

BizCompass is a dual-axis benchmark evaluating LLMs on business knowledge in finance, economics, statistics, and operations management, linked to analyst, trader, and consultant roles, with public datasets released after testing open and commercial models.

Crowded in B-Space: Calibrating Shared Directions for LoRA Merging

cs.CL · 2026-04-18 · unverdicted · novelty 7.0

Pico reduces LoRA merge interference by calibrating over-shared directions in the B matrix before merging, yielding 3.4-8.3 point accuracy gains and sometimes beating joint training.

Credo: Declarative Control of LLM Pipelines via Beliefs and Policies

cs.AI · 2026-04-15 · unverdicted · novelty 7.0

Credo proposes representing LLM agent state as beliefs and regulating pipeline behavior with declarative policies stored in a database for adaptive, auditable control.

FinTrace: Holistic Trajectory-Level Evaluation of LLM Tool Calling for Long-Horizon Financial Tasks

cs.AI · 2026-04-11 · unverdicted · novelty 7.0

FinTrace supplies trajectory-level metrics for LLM financial tool calling, exposing gaps in information use and output quality, while its preference dataset enables DPO training that boosts intermediate metrics.

FrontierFinance: A Long-Horizon Computer-Use Benchmark of Real-World Financial Tasks

cs.CL · 2026-04-07 · unverdicted · novelty 7.0

FrontierFinance benchmark shows human financial experts outperform state-of-the-art LLMs by achieving higher scores and more client-ready outputs on realistic long-horizon tasks.

EconWebArena: Benchmarking Autonomous Agents on Economic Tasks in Realistic Web Environments

cs.CL · 2025-06-09 · unverdicted · novelty 7.0

EconWebArena is a new benchmark with 360 curated economic tasks across 82 authoritative websites for evaluating multimodal web agents on navigation, grounding, and data extraction.

Design and Report Benchmarks for Knowledge Work

cs.AI · 2026-05-22 · unverdicted · novelty 6.0

Proposes a three-step benchmark design method (define work activity, specify tested setting, score work product) derived from work studies and O*NET, demonstrated via three case analyses.

FinDocMRE: A Benchmark for Document-Level Financial Multimodal Reasoning Evaluation

cs.CE · 2026-05-18 · unverdicted · novelty 6.0

FinDocMRE is a new multi-image document-level benchmark spanning 12 financial domains and 5 task types, showing that 11 tested LMMs all score below 65 overall with particular weaknesses in numerical estimation and cross-page grounding.

FINESSE-Bench: A Hierarchical Benchmark Suite for Financial Domain Knowledge and Technical Analysis in Large Language Models

cs.CL · 2026-05-14 · unverdicted · novelty 6.0 · 2 refs

FINESSE-Bench is a new hierarchical benchmark suite combining certification-style exams, trading tasks, and a Russian olympiad set to evaluate LLMs on financial competencies at multiple difficulty levels.

Training Transformers for KV Cache Compressibility

cs.LG · 2026-05-07 · unverdicted · novelty 6.0 · 2 refs

Training transformers with KV sparsification during continued pretraining produces representations that admit better post-hoc KV cache compression, improving quality under memory budgets for long-context tasks.

LATTICE: Evaluating Decision Support Utility of Crypto Agents

cs.CR · 2026-04-29 · unverdicted · novelty 6.0

LATTICE is a scalable LLM-judge benchmark for crypto agent decision support that reveals performance trade-offs among real-world copilots across dimensions and tasks.

The Price of Agreement: Measuring LLM Sycophancy in Agentic Financial Applications

cs.AI · 2026-04-27 · unverdicted · novelty 6.0

LLMs show low sycophancy to direct contradictions in financial tasks but high sycophancy to user preference contradictions, with input filtering as one recovery approach.

SysTradeBench: An Iterative Build-Test-Patch Benchmark for Strategy-to-Code Trading Systems with Drift-Aware Diagnostics

cs.SE · 2026-04-06 · unverdicted · novelty 6.0

SysTradeBench evaluates 17 LLMs on 12 trading strategies, finding over 91.7% code validity but rapid convergence in iterative fixes and a continued need for human oversight on critical strategies.

FinReasoning: A Hierarchical Benchmark for Reliable Financial Research Reporting

cs.CL · 2026-02-25 · unverdicted · novelty 6.0

FinReasoning is a hierarchical benchmark that decomposes LLM financial research capabilities into semantic consistency, data alignment, and deep insight, revealing model-type differences in auditing versus insight generation.

Finch: Benchmarking Finance & Accounting across Spreadsheet-Centric Enterprise Workflows

cs.AI · 2025-12-15 · unverdicted · novelty 6.0

Finch is a new benchmark with 172 composite workflows and 384 tasks from real enterprise data that shows top AI models like GPT-5.1 Pro pass only 38.4% of workflows under human evaluation.

cs.LG · 2026-05-16 · unverdicted · novelty 5.0

Presents T3+OCSVM detector for privacy policy enforcement in RAG achieving 0.93+ borderline AUROC, 44-55 point false positive reduction, and millisecond latency via synthetic data stress tests.

Efficient Serving for Dynamic Agent Workflows with Prediction-based KV-Cache Management

cs.LG · 2026-05-07 · unverdicted · novelty 5.0

PBKV predicts agent invocations in dynamic LLM workflows to manage KV-cache reuse, delivering up to 1.85x speedup over LRU and 1.26x over KVFlow.

AgenticRAG: Agentic Retrieval for Enterprise Knowledge Bases

cs.AI · 2026-05-07 · unverdicted · novelty 5.0

AgenticRAG equips an LLM with iterative retrieval and navigation tools, delivering 49.6% recall@1 on BRIGHT, 0.96 factuality on WixQA, and 92% correctness on FinanceBench.

Adaptive Query Routing: A Tier-Based Framework for Hybrid Retrieval Across Financial, Legal, and Medical Documents

cs.IR · 2026-04-14 · conditional · novelty 5.0

Tree reasoning outperforms vector search on complex document queries but a hybrid approach balances results across tiers, with validation showing an 11.7-point gap on real finance documents.

Empirical Evaluation of PDF Parsing and Chunking for Financial Question Answering with RAG

cs.CL · 2026-04-13 · unverdicted · novelty 5.0

Systematic tests show that specific PDF parsers combined with overlapping chunking strategies better preserve structure and improve RAG answer correctness on financial QA benchmarks including the new TableQuest dataset.

Byte-Exact Deduplication in Retrieval-Augmented Generation: A Three-Regime Empirical Analysis Across Public Benchmarks

cs.CL · 2026-05-10 · unverdicted · novelty 4.0

Byte-exact deduplication reduces RAG context size by 0.16% to 80.34% across three regimes with zero measurable quality regression per multi-vendor LLM evaluation.

Architecture Matters More Than Scale: A Comparative Study of Retrieval and Memory Augmentation for Financial QA Under SME Compute Constraints

cs.IR · 2026-04-20 · unverdicted · novelty 4.0 · 2 refs

Structured memory improves precision on deterministic financial calculations while retrieval-augmented generation outperforms in conversational settings, supporting a hybrid deployment framework for resource-constrained SMEs.

citing papers explorer

Showing 27 of 27 citing papers.

IndiaFinBench: An Evaluation Benchmark for Large Language Model Performance on Indian Financial Regulatory Text cs.CL · 2026-04-21 · accept · none · ref 4 · internal anchor
IndiaFinBench is the first public benchmark for LLMs on Indian financial regulatory text, with twelve models scoring 70.4-89.7% accuracy and all outperforming a 69% human baseline.
Evaluating Deep Research Agents on Expert Consulting Work: A Benchmark with Verifiers, Rubrics, and Cognitive Traps cs.AI · 2026-05-17 · unverdicted · none · ref 8 · internal anchor
New benchmark evaluates three frontier deep research agents on 42 SME prompts with verifiers and rubrics, reporting low acceptance rates of 9.5-21.4% and agent-specific failure modes.
BizCompass: Benchmarking the Reasoning Capabilities of LLMs in Business Knowledge and Applications cs.CE · 2026-04-19 · unverdicted · none · ref 4 · internal anchor
BizCompass is a dual-axis benchmark evaluating LLMs on business knowledge in finance, economics, statistics, and operations management, linked to analyst, trader, and consultant roles, with public datasets released after testing open and commercial models.
Crowded in B-Space: Calibrating Shared Directions for LoRA Merging cs.CL · 2026-04-18 · unverdicted · none · ref 13 · internal anchor
Pico reduces LoRA merge interference by calibrating over-shared directions in the B matrix before merging, yielding 3.4-8.3 point accuracy gains and sometimes beating joint training.
Credo: Declarative Control of LLM Pipelines via Beliefs and Policies cs.AI · 2026-04-15 · unverdicted · none · ref 4 · internal anchor
Credo proposes representing LLM agent state as beliefs and regulating pipeline behavior with declarative policies stored in a database for adaptive, auditable control.
FinTrace: Holistic Trajectory-Level Evaluation of LLM Tool Calling for Long-Horizon Financial Tasks cs.AI · 2026-04-11 · unverdicted · none · ref 3 · internal anchor
FinTrace supplies trajectory-level metrics for LLM financial tool calling, exposing gaps in information use and output quality, while its preference dataset enables DPO training that boosts intermediate metrics.
FrontierFinance: A Long-Horizon Computer-Use Benchmark of Real-World Financial Tasks cs.CL · 2026-04-07 · unverdicted · none · ref 13 · internal anchor
FrontierFinance benchmark shows human financial experts outperform state-of-the-art LLMs by achieving higher scores and more client-ready outputs on realistic long-horizon tasks.
EconWebArena: Benchmarking Autonomous Agents on Economic Tasks in Realistic Web Environments cs.CL · 2025-06-09 · unverdicted · none · ref 6 · internal anchor
EconWebArena is a new benchmark with 360 curated economic tasks across 82 authoritative websites for evaluating multimodal web agents on navigation, grounding, and data extraction.
Design and Report Benchmarks for Knowledge Work cs.AI · 2026-05-22 · unverdicted · none · ref 87 · internal anchor
Proposes a three-step benchmark design method (define work activity, specify tested setting, score work product) derived from work studies and O*NET, demonstrated via three case analyses.
FinDocMRE: A Benchmark for Document-Level Financial Multimodal Reasoning Evaluation cs.CE · 2026-05-18 · unverdicted · none · ref 28 · internal anchor
FinDocMRE is a new multi-image document-level benchmark spanning 12 financial domains and 5 task types, showing that 11 tested LMMs all score below 65 overall with particular weaknesses in numerical estimation and cross-page grounding.
FINESSE-Bench: A Hierarchical Benchmark Suite for Financial Domain Knowledge and Technical Analysis in Large Language Models cs.CL · 2026-05-14 · unverdicted · none · ref 4 · 2 links · internal anchor
FINESSE-Bench is a new hierarchical benchmark suite combining certification-style exams, trading tasks, and a Russian olympiad set to evaluate LLMs on financial competencies at multiple difficulty levels.
Training Transformers for KV Cache Compressibility cs.LG · 2026-05-07 · unverdicted · none · ref 23 · 2 links · internal anchor
Training transformers with KV sparsification during continued pretraining produces representations that admit better post-hoc KV cache compression, improving quality under memory budgets for long-context tasks.
LATTICE: Evaluating Decision Support Utility of Crypto Agents cs.CR · 2026-04-29 · unverdicted · none · ref 11 · internal anchor
LATTICE is a scalable LLM-judge benchmark for crypto agent decision support that reveals performance trade-offs among real-world copilots across dimensions and tasks.
The Price of Agreement: Measuring LLM Sycophancy in Agentic Financial Applications cs.AI · 2026-04-27 · unverdicted · none · ref 6 · internal anchor
LLMs show low sycophancy to direct contradictions in financial tasks but high sycophancy to user preference contradictions, with input filtering as one recovery approach.
SysTradeBench: An Iterative Build-Test-Patch Benchmark for Strategy-to-Code Trading Systems with Drift-Aware Diagnostics cs.SE · 2026-04-06 · unverdicted · none · ref 20 · internal anchor
SysTradeBench evaluates 17 LLMs on 12 trading strategies, finding over 91.7% code validity but rapid convergence in iterative fixes and a continued need for human oversight on critical strategies.
FinReasoning: A Hierarchical Benchmark for Reliable Financial Research Reporting cs.CL · 2026-02-25 · unverdicted · none · ref 15 · internal anchor
FinReasoning is a hierarchical benchmark that decomposes LLM financial research capabilities into semantic consistency, data alignment, and deep insight, revealing model-type differences in auditing versus insight generation.
Finch: Benchmarking Finance & Accounting across Spreadsheet-Centric Enterprise Workflows cs.AI · 2025-12-15 · unverdicted · none · ref 28 · internal anchor
Finch is a new benchmark with 172 composite workflows and 384 tasks from real enterprise data that shows top AI models like GPT-5.1 Pro pass only 38.4% of workflows under human evaluation.
Privacy Policy Enforcement Guardrails for Data-Sensitive Retrieval-Augmented Generation cs.LG · 2026-05-16 · unverdicted · none · ref 46 · internal anchor
Presents T3+OCSVM detector for privacy policy enforcement in RAG achieving 0.93+ borderline AUROC, 44-55 point false positive reduction, and millisecond latency via synthetic data stress tests.
Efficient Serving for Dynamic Agent Workflows with Prediction-based KV-Cache Management cs.LG · 2026-05-07 · unverdicted · none · ref 23 · internal anchor
PBKV predicts agent invocations in dynamic LLM workflows to manage KV-cache reuse, delivering up to 1.85x speedup over LRU and 1.26x over KVFlow.
AgenticRAG: Agentic Retrieval for Enterprise Knowledge Bases cs.AI · 2026-05-07 · unverdicted · none · ref 11 · internal anchor
AgenticRAG equips an LLM with iterative retrieval and navigation tools, delivering 49.6% recall@1 on BRIGHT, 0.96 factuality on WixQA, and 92% correctness on FinanceBench.
Adaptive Query Routing: A Tier-Based Framework for Hybrid Retrieval Across Financial, Legal, and Medical Documents cs.IR · 2026-04-14 · conditional · none · ref 51 · internal anchor
Tree reasoning outperforms vector search on complex document queries but a hybrid approach balances results across tiers, with validation showing an 11.7-point gap on real finance documents.
Empirical Evaluation of PDF Parsing and Chunking for Financial Question Answering with RAG cs.CL · 2026-04-13 · unverdicted · none · ref 20 · internal anchor
Systematic tests show that specific PDF parsers combined with overlapping chunking strategies better preserve structure and improve RAG answer correctness on financial QA benchmarks including the new TableQuest dataset.
Byte-Exact Deduplication in Retrieval-Augmented Generation: A Three-Regime Empirical Analysis Across Public Benchmarks cs.CL · 2026-05-10 · unverdicted · none · ref 26 · internal anchor
Byte-exact deduplication reduces RAG context size by 0.16% to 80.34% across three regimes with zero measurable quality regression per multi-vendor LLM evaluation.
Architecture Matters More Than Scale: A Comparative Study of Retrieval and Memory Augmentation for Financial QA Under SME Compute Constraints cs.IR · 2026-04-20 · unverdicted · none · ref 15 · 2 links · internal anchor
Structured memory improves precision on deterministic financial calculations while retrieval-augmented generation outperforms in conversational settings, supporting a hybrid deployment framework for resource-constrained SMEs.
Conversations Risk Detection LLMs in Financial Agents via Multi-Stage Generative Rollout cs.CR · 2026-04-10 · unverdicted · none · ref 21 · internal anchor
FinSec is a multi-stage detection system for financial LLM dialogues that reaches 90.13% F1 score, cuts attack success rate to 9.09%, and raises AUPRC to 0.9189.
Bridging Language Models and Financial Analysis q-fin.ST · 2025-03-14 · unverdicted · none · ref 42 · internal anchor
A survey synthesizing recent LLM research and assessing its applicability to financial data analysis.
MetaGraph: A Large-Scale Meta-Analysis of GenAI in Financial NLP (2022-2025) cs.CL · 2025-09-11 · unreviewed · ref 25 · internal anchor

FinanceBench: A New Benchmark for Financial Question Answering

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer