Fin-RATE: A Real-world Financial Analytics and Tracking Evaluation Benchmark for LLMs on SEC Filings

Ali Maatouk; Bing Xiang; Eftychia Makri; Eliot Brenner; Jialin Chen; Junrong Chen; Leandros Tassiulas; Peiwen Li; Rex Ying; Yidong Jiang

arxiv: 2602.07294 · v4 · pith:HOHOWK4Unew · submitted 2026-02-07 · 💻 cs.CE · cs.AI

Fin-RATE: A Real-world Financial Analytics and Tracking Evaluation Benchmark for LLMs on SEC Filings

Yidong Jiang , Junrong Chen , Eftychia Makri , Jialin Chen , Peiwen Li , Ali Maatouk , Leandros Tassiulas , Eliot Brenner

show 2 more authors

Bing Xiang Rex Ying

This is my paper

classification 💻 cs.CE cs.AI

keywords llmsreasoningbenchmarkbenchmarksacrossanalysiscomparisoncontext

0 comments

read the original abstract

With the increasing deployment of Large Language Models (LLMs) in the finance domain, LLMs are increasingly expected to parse complex regulatory disclosures. However, existing benchmarks often focus on isolated details, failing to reflect the complexity of professional analysis that requires synthesizing information across multiple documents, reporting periods, and corporate entities. Furthermore, these benchmarks do not disentangle whether errors arise from retrieval failures, generation inaccuracies, domain-specific reasoning mistakes, or misinterpretation of the query or context, making it difficult to precisely diagnose performance bottlenecks. To bridge these gaps, we introduce Fin-RATE, a benchmark built on U.S. Securities and Exchange Commission (SEC) filings and mirroring financial analyst workflows through three pathways: detail-oriented reasoning within individual disclosures, cross-entity comparison under shared topics, and longitudinal tracking of the same firm across reporting periods. We benchmark 17 leading LLMs, spanning open-source, closed-source, and finance-specialized models, under both ground-truth context and retrieval-augmented settings. Results show substantial performance degradation, with accuracy dropping by 18.60% and 14.35% as tasks shift from single-document reasoning to longitudinal and cross-entity analysis. This degradation is associated with increased comparison hallucinations, temporal and entity mismatches, and is further reflected in declines in reasoning quality and factual consistency--limitations that existing benchmarks have yet to formally categorize or quantify.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MacroLens: A Multi-Task Benchmark for Contextual Financial Reasoning under Macroeconomic Scenarios
cs.LG 2026-06 unverdicted novelty 7.0

MacroLens is a point-in-time multi-signal benchmark dataset and seven tasks for evaluating contextual financial reasoning models under macroeconomic scenarios.
TRACE: Tourism Recommendation with Accountable Citation Evidence
cs.IR 2026-05 unverdicted novelty 7.0

TRACE is a new benchmark dataset and evaluation suite for conversational tourism recommenders that requires systems to suggest POIs, cite verifiable review spans, and recover from rejections, revealing a Three-Compete...