pith. sign in

arxiv: 2606.13100 · v1 · pith:27WCDVT3new · submitted 2026-06-11 · 💻 cs.CL

LEDGER: A Long-Context Benchmark of Corporate Annual Reports for Grounded Financial Retrieval and Extraction

Pith reviewed 2026-06-27 07:07 UTC · model grok-4.3

classification 💻 cs.CL
keywords LEDGER datasetlong-context benchmarkcorporate annual reportsfinancial KPI extractiondocument retrievalLLM evaluationOCR annotationsearnings reports
0
0 comments X

The pith

LEDGER supplies 4,999 full corporate annual reports labeled with 31 financial KPIs to support three grounded long-context benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper releases LEDGER, a corpus of 4,999 digitized corporate annual reports each labeled with 31 consolidated financial KPIs extracted from full documents containing figures, tables, and narrative. From this data the authors derive three benchmarks that range from page-level retrieval over 118,048 natural-language questions to conversational single-value lookup and complete KPI extraction. A sympathetic reader would care because current financial resources are mostly limited to plain-text SEC filings with few QA pairs, leaving long-context capabilities of language models untested on numerically dense, multi-page reports. The release also includes human OCR-quality annotations with inter-annotator agreement and a toolchain for scoring, plus a case study that links CEO-letter rhetoric to market reactions after earnings dates.

Core claim

We release LEDGER (Long-context Evaluation of Documents for Grounded Extraction and Retrieval), a corpus of 4,999 digitized corporate annual reports labeled with 31 consolidated financial KPIs to be extracted and linked to the market's reaction at the earnings date, from which we derive three evaluation benchmarks spanning the difficulty spectrum: a pure page-level KPI retrieval task with TREC-style relevance judgments over 118,048 questions in natural language, a conversational needle-in-a-haystack single-value lookup, and a full KPI extraction task, both from long, numerically dense reports.

What carries the argument

The LEDGER corpus of labeled annual reports together with the derived page-retrieval, needle-in-a-haystack, and full-extraction benchmarks that turn the 31 KPIs into ground-truth targets for long-context evaluation.

If this is right

  • Models can now be scored on retrieving the correct page for any of 118,048 natural-language questions about the 31 KPIs.
  • The needle-in-a-haystack task directly measures whether a model can surface a single numeric value from a full multi-page report given a conversational query.
  • The full extraction task tests end-to-end recovery of all 31 KPIs from complete, figure-rich documents.
  • Human OCR annotations supply an explicit quality baseline for any automated extraction pipeline.
  • The market-reaction case study supplies an additional downstream signal that can be correlated with extracted CEO-letter content.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The dataset could be used to test whether models that perform well on page retrieval also maintain accuracy when the retrieved pages must be combined for full KPI extraction.
  • Because the reports are digitized rather than plain text, the benchmarks implicitly measure handling of layout and table structure in addition to pure language understanding.
  • Linking extracted KPIs to market reactions opens the possibility of studying whether models can surface financially material signals that human analysts already price in.
  • The three difficulty levels allow staged evaluation: a model that fails page retrieval is unlikely to succeed at full extraction, providing a natural curriculum for long-context training.

Load-bearing premise

The 31 KPIs can be reliably and consistently extracted across diverse reports with sufficient human annotation quality to serve as ground truth for the three benchmarks.

What would settle it

A new round of independent human annotation on a random sample of reports that produces materially different KPI values or low inter-annotator agreement would show the labels cannot function as reliable ground truth.

Figures

Figures reproduced from arXiv: 2606.13100 by Amaury de Vitry, Arthur Garnier, Charles Moslonka, Emmanuel Malherbe, Hicham Randrianarivo.

Figure 1
Figure 1. Figure 1: Overview of Ledger corpus. • Open release under the MIT license for the code and Cre￾ative Commons Attribution 4.0 for the data, with a perma￾nent DOI and the full extraction/validation toolchain. 2 Related resources Finance has been an early and active target for LLMs, both as a training domain (e.g. BloombergGPT [31], trained on propri￾etary data) and as an evaluation domain. Numerical reasoning benchmar… view at source ↗
read the original abstract

Finance reporting is a natural proving ground for large language models, and the very-long-context capabilities of recent models across all sizes make rigorous evaluation in this domain an increasingly pressing need. Yet most public financial resources reduce the task to plain-text SEC 10-K filings paired with a handful of question-answer items. We release LEDGER (Long-context Evaluation of Documents for Grounded Extraction and Retrieval), a corpus of 4,999 digitized corporate annual reports - full documents with figures, tables, and narrative, not just regulatory filings. Each report is labeled with 31 consolidated financial KPIs to be extracted and linked to the market's reaction at the earnings date. From this data we derive three evaluation benchmarks spanning the difficulty spectrum: a pure page-level KPI retrieval task with TREC-style relevance judgments over 118,048 questions in natural language, a conversational "needle-in-a-haystack" single-value lookup, and a full KPI extraction task, both from long, numerically dense reports. We additionally provide human OCR-quality annotations with inter-annotator agreement and the complete extraction, validation, and scoring toolchain. We further demonstrate the dataset's research utility with a case study linking CEO-letter rhetoric to post-publication market impact.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces the LEDGER dataset, comprising 4,999 digitized corporate annual reports labeled with 31 consolidated financial KPIs. It derives three benchmarks for long-context tasks: page-level retrieval with 118,048 questions, conversational needle-in-a-haystack, and full KPI extraction. The authors provide human annotations with inter-annotator agreement, the extraction toolchain, and a case study linking CEO rhetoric to market reactions.

Significance. If the ground-truth labels prove reliable, LEDGER would fill an important gap by providing a large-scale, long-context financial document benchmark that goes beyond plain-text 10-K filings and includes figures, tables, and narrative. The linkage to market reactions and the release of the full toolchain are notable strengths that could enable reproducible research in grounded financial NLP.

major comments (2)
  1. [Abstract and labeling section] Abstract and labeling description: the claim that the 31 KPI labels constitute reliable ground truth rests on human OCR-quality annotations with inter-annotator agreement, yet no numerical IAA scores, per-KPI error rates, or validation statistics are supplied. This directly affects whether the three derived benchmarks can be used for grounded evaluation.
  2. [Benchmark construction] Benchmark derivation: the manuscript states that three benchmarks spanning the difficulty spectrum are derived from the corpus, but supplies no quantitative validation results, error analysis, or details on how page-level judgments, needle-in-a-haystack items, or full extraction tasks were constructed and tested.
minor comments (1)
  1. [Abstract] The abstract reports 4,999 reports and 118,048 questions; confirm these figures are presented consistently in the main text and tables.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and commit to revisions that supply the requested quantitative details.

read point-by-point responses
  1. Referee: [Abstract and labeling section] Abstract and labeling description: the claim that the 31 KPI labels constitute reliable ground truth rests on human OCR-quality annotations with inter-annotator agreement, yet no numerical IAA scores, per-KPI error rates, or validation statistics are supplied. This directly affects whether the three derived benchmarks can be used for grounded evaluation.

    Authors: We agree that explicit numerical IAA scores, per-KPI error rates, and validation statistics are required to substantiate the ground-truth claim. The current manuscript mentions the existence of IAA but does not report the values. In the revision we will add a new subsection (or expanded table) under the labeling description that reports overall and per-KPI Cohen’s kappa / percentage agreement, annotator disagreement rates, and any post-annotation validation statistics. This will directly support the reliability of the 31 KPI labels. revision: yes

  2. Referee: [Benchmark construction] Benchmark derivation: the manuscript states that three benchmarks spanning the difficulty spectrum are derived from the corpus, but supplies no quantitative validation results, error analysis, or details on how page-level judgments, needle-in-a-haystack items, or full extraction tasks were constructed and tested.

    Authors: We accept that the benchmark-construction section currently lacks the requested quantitative validation, error analysis, and construction details. The revision will expand this section with: (i) summary statistics on the 118,048 page-level questions and their relevance judgments, (ii) construction protocol and sampling method for the needle-in-a-haystack items, (iii) task-definition and annotation guidelines for the full KPI extraction benchmark, and (iv) any available consistency or error-analysis metrics obtained during benchmark creation. These additions will allow readers to assess the soundness of the three derived benchmarks. revision: yes

Circularity Check

0 steps flagged

No circularity: data release with no derivations or fitted predictions

full rationale

The paper releases a corpus of 4,999 reports with 31 KPI labels and derives three benchmarks from it. No equations, parameter fits, predictions, or self-citation chains appear in the provided text. The central contribution is the dataset and toolchain itself; label quality is an external validity concern, not a circular reduction of any claimed derivation to its inputs. No steps match any enumerated circularity pattern.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Benchmark release rests on standard assumptions about data quality and representativeness rather than new mathematical constructs or fitted parameters.

axioms (1)
  • domain assumption Digitized reports and human annotations provide reliable ground truth for KPI extraction across the corpus
    Invoked when claiming the dataset enables grounded evaluation; no further justification supplied in abstract.

pith-pipeline@v0.9.1-grok · 5765 in / 1308 out tokens · 28784 ms · 2026-06-27T07:07:32.846007+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

34 extracted references · 16 canonical work pages · 13 internal anchors

  1. [1]

    Sören Auer, Christian Bizer, Georgi Kobilarov, Jens Lehmann, Richard Cyga- niak, and Zachary Ives. 2007. Dbpedia: A nucleus for a web of open data. In international semantic web conference. Springer, 722–735

  2. [2]

    Nicolas Boizard, Hippolyte Gisserot-Boukhlef, Duarte M Alves, André Martins, Ayoub Hammal, Caio Corro, Céline Hudelot, Emmanuel Malherbe, Etienne Malaboeuf, Fanny Jourdan, et al. 2025. Eurobert: Scaling multilingual encoders for european languages.arXiv preprint arXiv:2503.05500(2025)

  3. [3]

    Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. 2024. BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation. arXiv:2402.03216 [cs.CL]

  4. [4]

    Zhiyu Chen, Wenhu Chen, Charese Smiley, Sameena Shah, Iana Borova, Dillon Langdon, Reema Moussa, Matt Beane, Ting-Hao Huang, Bryan Routledge, and William Yang Wang. 2021. FinQA: A Dataset of Numerical Reasoning over Financial Data. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP)

  5. [5]

    Zhiyu Chen, Shiyang Li, Charese Smiley, Zhiqiang Ma, Sameena Shah, and William Yang Wang. 2022. ConvFinQA: Exploring the Chain of Numerical Reasoning in Conversational Finance Question Answering. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP)

  6. [6]

    Thibault Formal, Benjamin Piwowarski, and Stéphane Clinchant. 2021. SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking. InProceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval

  7. [7]

    Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé III, and Kate Crawford. 2021. Datasheets for Datasets. Commun. ACM64, 12 (2021), 86–92

  8. [8]

    Google DeepMind. 2026. Gemini 3.1 Pro Model Card. https://storage.googleapis. com/deepmind-media/Model-Cards/Gemini-3-1-Pro-Model-Card.pdf. Ac- cessed: 2026-06-01

  9. [9]

    Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, and Boris Ginsburg. 2024. RULER: What’s the Real Context Size of Your Long-Context Language Models?. InFirst Conference on Language Modeling. https://openreview.net/forum?id=kIoBbc76Sy

  10. [10]

    Pranab Islam, Anand Kannappan, Douwe Kiela, Rebecca Qian, Nino Scherrer, and Bertie Vidgen. 2023. FinanceBench: A New Benchmark for Financial Question Answering. arXiv:2311.11944 [cs.CL] LEDGER: A Long-Context Benchmark of Corporate Annual Reports for Grounded Financial Retrieval and Extraction

  11. [11]

    Antonio Jimeno Yepes, Yao You, Jan Milczek, Sebastian Laverde, and Renyu Li

  12. [12]

    arXiv:2402.05131 [cs.CL]

    Financial Report Chunking for Effective Retrieval Augmented Generation. arXiv:2402.05131 [cs.CL]

  13. [13]

    Omar Khattab and Matei Zaharia. 2020. ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. InProceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 39–48

  14. [14]

    Alexander H Liu, Kartik Khandelwal, Sandeep Subramanian, Victor Jouault, Abhinav Rastogi, Adrien Sadé, Alan Jeffares, Albert Jiang, Alexandre Cahill, Alexandre Gavaudan, et al. 2026. Ministral 3.arXiv preprint arXiv:2601.08584 (2026)

  15. [15]

    Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2024. Lost in the middle: How language models use long contexts.Transactions of the association for computational linguistics12 (2024), 157–173

  16. [16]

    Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach.arXiv preprint arXiv:1907.11692 (2019)

  17. [17]

    António Loison, Quentin Macé, Antoine Edy, Victor Xing, Tom Balough, Gabriel Moreira, Bo Liu, Manuel Faysse, Céline Hudelot, and Gautier Viaud. 2026. ViDoRe V3: A Comprehensive Evaluation of Retrieval Augmented Generation in Complex Real-World Scenarios.arXiv preprint arXiv:2601.08620(2026)

  18. [18]

    Emmanuel Malherbe and Marie-Aude Aufaure. 2016. Bridge the terminology gap between recruiters and candidates: A multilingual skills base built from social media and linked data. In2016 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM). IEEE, 583–590

  19. [19]

    NVIDIA. 2025. Nemotron 3 Nano: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning. https://arxiv.org/abs/2512. 20848 Technical report

  20. [20]

    OpenAI. 2025. gpt-oss-120b & gpt-oss-20b Model Card. arXiv:2508.10925 [cs.CL] https://arxiv.org/abs/2508.10925

  21. [21]

    Qwen Team. 2026. Qwen3.6-27B: Flagship-Level Coding in a 27B Dense Model. https://qwen.ai/blog?id=qwen3.6-27b

  22. [22]

    Varshini Reddy, Rik Koncel-Kedziorski, Viet Dac Lai, Michael Krumdick, Charles Lovering, and Chris Tanner. 2024. DocFinQA: A Long-Context Financial Rea- soning Dataset. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL), Short Papers

  23. [23]

    Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. InProceedings of the 2019 Conference on Em- pirical Methods in Natural Language Processing. Association for Computational Linguistics. https://arxiv.org/abs/1908.10084

  24. [24]

    Stephen Robertson and Hugo Zaragoza. 2009. The Probabilistic Relevance Frame- work: BM25 and Beyond.Foundations and Trends in Information Retrieval3, 4 (2009), 333–389. doi:10.1561/1500000019

  25. [25]

    Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. 2020. Mpnet: Masked and permuted pre-training for language understanding.Advances in neural information processing systems33 (2020), 16857–16867

  26. [26]

    Henrique Schechter Vera, Sahil Dua, Biao Zhang, Daniel Salz, Ryan Mullins, Sindhu Raghuram Panyam, Sara Smoot, Iftekhar Naim, Joe Zou, Feiyang Chen, et al. 2025. Embeddinggemma: Powerful and lightweight text representations. arXiv preprint arXiv:2509.20354(2025)

  27. [27]

    2005.TREC: Experiment and evaluation in information retrieval

    Ellen M Voorhees, Donna K Harman, et al. 2005.TREC: Experiment and evaluation in information retrieval. Vol. 63. MIT press Cambridge

  28. [28]

    Benjamin Warner, Antoine Chaffin, Benjamin Clavié, Orion Weller, Oskar Hall- ström, Said Taghadouini, Alexis Gallagher, Raja Biswas, Faisal Ladhak, Tom Aarsen, Nathan Cooper, Griffin Adams, Jeremy Howard, and Iacopo Poli. 2024. Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Infer...

  29. [29]

    Haoran Wei, Yaofeng Sun, and Yukun Li. 2025. DeepSeek-OCR: Contexts Optical Compression.arXiv preprint arXiv:2510.18234(2025)

  30. [30]

    Haoran Wei, Yaofeng Sun, and Yukun Li. 2026. DeepSeek-OCR 2: Visual Causal Flow.arXiv preprint arXiv:2601.20552(2026)

  31. [31]

    Wilkinson, Michel Dumontier, IJsbrand Jan Aalbersberg, Gabrielle Ap- pleton, Myles Axton, Arie Baak, et al

    Mark D. Wilkinson, Michel Dumontier, IJsbrand Jan Aalbersberg, Gabrielle Ap- pleton, Myles Axton, Arie Baak, et al . 2016. The FAIR Guiding Principles for Scientific Data Management and Stewardship.Scientific Data3, 1 (2016), 160018

  32. [32]

    Shijie Wu, Ozan Irsoy, Steven Lu, Vadim Dabravolski, Mark Dredze, Sebastian Gehrmann, Prabhanjan Kambadur, David Rosenberg, and Gideon Mann. 2023. BloombergGPT: A Large Language Model for Finance. arXiv:2303.17564 [cs.LG]

  33. [33]

    Shitao Xiao, Zheng Liu, Peitian Zhang, and Niklas Muennighoff. 2023. C-Pack: Packaged Resources To Advance General Chinese Embedding. arXiv:2309.07597 [cs.CL]

  34. [34]

    Fengbin Zhu, Wenqiang Lei, Youcheng Huang, Chao Wang, Shuo Zhang, Jiancheng Lv, Fuli Feng, and Tat-Seng Chua. 2021. TAT-QA: A Question An- swering Benchmark on a Hybrid of Tabular and Textual Content in Finance. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics (ACL)