LEDGER: A Long-Context Benchmark of Corporate Annual Reports for Grounded Financial Retrieval and Extraction

Amaury de Vitry; Arthur Garnier; Charles Moslonka; Emmanuel Malherbe; Hicham Randrianarivo

arxiv: 2606.13100 · v1 · pith:27WCDVT3new · submitted 2026-06-11 · 💻 cs.CL

LEDGER: A Long-Context Benchmark of Corporate Annual Reports for Grounded Financial Retrieval and Extraction

Charles Moslonka , Amaury de Vitry , Arthur Garnier , Hicham Randrianarivo , Emmanuel Malherbe This is my paper

Pith reviewed 2026-06-27 07:07 UTC · model grok-4.3

classification 💻 cs.CL

keywords LEDGER datasetlong-context benchmarkcorporate annual reportsfinancial KPI extractiondocument retrievalLLM evaluationOCR annotationsearnings reports

0 comments

The pith

LEDGER supplies 4,999 full corporate annual reports labeled with 31 financial KPIs to support three grounded long-context benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper releases LEDGER, a corpus of 4,999 digitized corporate annual reports each labeled with 31 consolidated financial KPIs extracted from full documents containing figures, tables, and narrative. From this data the authors derive three benchmarks that range from page-level retrieval over 118,048 natural-language questions to conversational single-value lookup and complete KPI extraction. A sympathetic reader would care because current financial resources are mostly limited to plain-text SEC filings with few QA pairs, leaving long-context capabilities of language models untested on numerically dense, multi-page reports. The release also includes human OCR-quality annotations with inter-annotator agreement and a toolchain for scoring, plus a case study that links CEO-letter rhetoric to market reactions after earnings dates.

Core claim

We release LEDGER (Long-context Evaluation of Documents for Grounded Extraction and Retrieval), a corpus of 4,999 digitized corporate annual reports labeled with 31 consolidated financial KPIs to be extracted and linked to the market's reaction at the earnings date, from which we derive three evaluation benchmarks spanning the difficulty spectrum: a pure page-level KPI retrieval task with TREC-style relevance judgments over 118,048 questions in natural language, a conversational needle-in-a-haystack single-value lookup, and a full KPI extraction task, both from long, numerically dense reports.

What carries the argument

The LEDGER corpus of labeled annual reports together with the derived page-retrieval, needle-in-a-haystack, and full-extraction benchmarks that turn the 31 KPIs into ground-truth targets for long-context evaluation.

If this is right

Models can now be scored on retrieving the correct page for any of 118,048 natural-language questions about the 31 KPIs.
The needle-in-a-haystack task directly measures whether a model can surface a single numeric value from a full multi-page report given a conversational query.
The full extraction task tests end-to-end recovery of all 31 KPIs from complete, figure-rich documents.
Human OCR annotations supply an explicit quality baseline for any automated extraction pipeline.
The market-reaction case study supplies an additional downstream signal that can be correlated with extracted CEO-letter content.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The dataset could be used to test whether models that perform well on page retrieval also maintain accuracy when the retrieved pages must be combined for full KPI extraction.
Because the reports are digitized rather than plain text, the benchmarks implicitly measure handling of layout and table structure in addition to pure language understanding.
Linking extracted KPIs to market reactions opens the possibility of studying whether models can surface financially material signals that human analysts already price in.
The three difficulty levels allow staged evaluation: a model that fails page retrieval is unlikely to succeed at full extraction, providing a natural curriculum for long-context training.

Load-bearing premise

The 31 KPIs can be reliably and consistently extracted across diverse reports with sufficient human annotation quality to serve as ground truth for the three benchmarks.

What would settle it

A new round of independent human annotation on a random sample of reports that produces materially different KPI values or low inter-annotator agreement would show the labels cannot function as reliable ground truth.

Figures

Figures reproduced from arXiv: 2606.13100 by Amaury de Vitry, Arthur Garnier, Charles Moslonka, Emmanuel Malherbe, Hicham Randrianarivo.

**Figure 1.** Figure 1: Overview of Ledger corpus. • Open release under the MIT license for the code and Creative Commons Attribution 4.0 for the data, with a permanent DOI and the full extraction/validation toolchain. 2 Related resources Finance has been an early and active target for LLMs, both as a training domain (e.g. BloombergGPT [31], trained on proprietary data) and as an evaluation domain. Numerical reasoning benchmar… view at source ↗

read the original abstract

Finance reporting is a natural proving ground for large language models, and the very-long-context capabilities of recent models across all sizes make rigorous evaluation in this domain an increasingly pressing need. Yet most public financial resources reduce the task to plain-text SEC 10-K filings paired with a handful of question-answer items. We release LEDGER (Long-context Evaluation of Documents for Grounded Extraction and Retrieval), a corpus of 4,999 digitized corporate annual reports - full documents with figures, tables, and narrative, not just regulatory filings. Each report is labeled with 31 consolidated financial KPIs to be extracted and linked to the market's reaction at the earnings date. From this data we derive three evaluation benchmarks spanning the difficulty spectrum: a pure page-level KPI retrieval task with TREC-style relevance judgments over 118,048 questions in natural language, a conversational "needle-in-a-haystack" single-value lookup, and a full KPI extraction task, both from long, numerically dense reports. We additionally provide human OCR-quality annotations with inter-annotator agreement and the complete extraction, validation, and scoring toolchain. We further demonstrate the dataset's research utility with a case study linking CEO-letter rhetoric to post-publication market impact.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LEDGER supplies a sizable new corpus of full annual reports with 31 KPI labels and three derived benchmarks, but its value depends on whether the annotation process actually delivers consistent ground truth.

read the letter

LEDGER's core contribution is the release of 4,999 digitized annual reports labeled for 31 consolidated financial KPIs, from which the authors build three benchmarks: page-level retrieval with TREC judgments, a conversational needle-in-haystack lookup, and full KPI extraction. They also include a case study tying CEO-letter language to market reactions.

The paper does a few things cleanly. It works with complete reports that contain figures and tables rather than stripped 10-K text, which matches real usage better. The task formulations span retrieval to extraction, and the authors state they release the full toolchain plus human annotations with inter-annotator agreement. That is more than many dataset papers provide.

The load-bearing issue is label quality. The benchmarks only work if the 31 KPIs were extracted reliably across heterogeneous reports, accounting choices, and digitization artifacts. The abstract mentions OCR-quality annotations and agreement, yet the description supplies no numbers, no breakdown by KPI, and no error analysis. If the full paper shows concrete agreement scores and validation steps that bound the noise, the resource strengthens; without that, downstream results rest on an unverified premise.

This is for groups testing long-context retrieval or extraction on numerically dense documents, especially in finance. A reader who needs concrete test cases in that domain can use the data directly. It deserves peer review because the scale and task design are concrete enough to matter, provided the annotation details hold up under scrutiny.

Referee Report

2 major / 1 minor

Summary. The paper introduces the LEDGER dataset, comprising 4,999 digitized corporate annual reports labeled with 31 consolidated financial KPIs. It derives three benchmarks for long-context tasks: page-level retrieval with 118,048 questions, conversational needle-in-a-haystack, and full KPI extraction. The authors provide human annotations with inter-annotator agreement, the extraction toolchain, and a case study linking CEO rhetoric to market reactions.

Significance. If the ground-truth labels prove reliable, LEDGER would fill an important gap by providing a large-scale, long-context financial document benchmark that goes beyond plain-text 10-K filings and includes figures, tables, and narrative. The linkage to market reactions and the release of the full toolchain are notable strengths that could enable reproducible research in grounded financial NLP.

major comments (2)

[Abstract and labeling section] Abstract and labeling description: the claim that the 31 KPI labels constitute reliable ground truth rests on human OCR-quality annotations with inter-annotator agreement, yet no numerical IAA scores, per-KPI error rates, or validation statistics are supplied. This directly affects whether the three derived benchmarks can be used for grounded evaluation.
[Benchmark construction] Benchmark derivation: the manuscript states that three benchmarks spanning the difficulty spectrum are derived from the corpus, but supplies no quantitative validation results, error analysis, or details on how page-level judgments, needle-in-a-haystack items, or full extraction tasks were constructed and tested.

minor comments (1)

[Abstract] The abstract reports 4,999 reports and 118,048 questions; confirm these figures are presented consistently in the main text and tables.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and commit to revisions that supply the requested quantitative details.

read point-by-point responses

Referee: [Abstract and labeling section] Abstract and labeling description: the claim that the 31 KPI labels constitute reliable ground truth rests on human OCR-quality annotations with inter-annotator agreement, yet no numerical IAA scores, per-KPI error rates, or validation statistics are supplied. This directly affects whether the three derived benchmarks can be used for grounded evaluation.

Authors: We agree that explicit numerical IAA scores, per-KPI error rates, and validation statistics are required to substantiate the ground-truth claim. The current manuscript mentions the existence of IAA but does not report the values. In the revision we will add a new subsection (or expanded table) under the labeling description that reports overall and per-KPI Cohen’s kappa / percentage agreement, annotator disagreement rates, and any post-annotation validation statistics. This will directly support the reliability of the 31 KPI labels. revision: yes
Referee: [Benchmark construction] Benchmark derivation: the manuscript states that three benchmarks spanning the difficulty spectrum are derived from the corpus, but supplies no quantitative validation results, error analysis, or details on how page-level judgments, needle-in-a-haystack items, or full extraction tasks were constructed and tested.

Authors: We accept that the benchmark-construction section currently lacks the requested quantitative validation, error analysis, and construction details. The revision will expand this section with: (i) summary statistics on the 118,048 page-level questions and their relevance judgments, (ii) construction protocol and sampling method for the needle-in-a-haystack items, (iii) task-definition and annotation guidelines for the full KPI extraction benchmark, and (iv) any available consistency or error-analysis metrics obtained during benchmark creation. These additions will allow readers to assess the soundness of the three derived benchmarks. revision: yes

Circularity Check

0 steps flagged

No circularity: data release with no derivations or fitted predictions

full rationale

The paper releases a corpus of 4,999 reports with 31 KPI labels and derives three benchmarks from it. No equations, parameter fits, predictions, or self-citation chains appear in the provided text. The central contribution is the dataset and toolchain itself; label quality is an external validity concern, not a circular reduction of any claimed derivation to its inputs. No steps match any enumerated circularity pattern.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Benchmark release rests on standard assumptions about data quality and representativeness rather than new mathematical constructs or fitted parameters.

axioms (1)

domain assumption Digitized reports and human annotations provide reliable ground truth for KPI extraction across the corpus
Invoked when claiming the dataset enables grounded evaluation; no further justification supplied in abstract.

pith-pipeline@v0.9.1-grok · 5765 in / 1308 out tokens · 28784 ms · 2026-06-27T07:07:32.846007+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

34 extracted references · 16 canonical work pages · 13 internal anchors

[1]

Sören Auer, Christian Bizer, Georgi Kobilarov, Jens Lehmann, Richard Cyga- niak, and Zachary Ives. 2007. Dbpedia: A nucleus for a web of open data. In international semantic web conference. Springer, 722–735

2007
[2]

Nicolas Boizard, Hippolyte Gisserot-Boukhlef, Duarte M Alves, André Martins, Ayoub Hammal, Caio Corro, Céline Hudelot, Emmanuel Malherbe, Etienne Malaboeuf, Fanny Jourdan, et al. 2025. Eurobert: Scaling multilingual encoders for european languages.arXiv preprint arXiv:2503.05500(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. 2024. BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation. arXiv:2402.03216 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

Zhiyu Chen, Wenhu Chen, Charese Smiley, Sameena Shah, Iana Borova, Dillon Langdon, Reema Moussa, Matt Beane, Ting-Hao Huang, Bryan Routledge, and William Yang Wang. 2021. FinQA: A Dataset of Numerical Reasoning over Financial Data. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP)

2021
[5]

Zhiyu Chen, Shiyang Li, Charese Smiley, Zhiqiang Ma, Sameena Shah, and William Yang Wang. 2022. ConvFinQA: Exploring the Chain of Numerical Reasoning in Conversational Finance Question Answering. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP)

2022
[6]

Thibault Formal, Benjamin Piwowarski, and Stéphane Clinchant. 2021. SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking. InProceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval

2021
[7]

Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé III, and Kate Crawford. 2021. Datasheets for Datasets. Commun. ACM64, 12 (2021), 86–92

2021
[8]

Google DeepMind. 2026. Gemini 3.1 Pro Model Card. https://storage.googleapis. com/deepmind-media/Model-Cards/Gemini-3-1-Pro-Model-Card.pdf. Ac- cessed: 2026-06-01

2026
[9]

Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, and Boris Ginsburg. 2024. RULER: What’s the Real Context Size of Your Long-Context Language Models?. InFirst Conference on Language Modeling. https://openreview.net/forum?id=kIoBbc76Sy

2024
[10]

Pranab Islam, Anand Kannappan, Douwe Kiela, Rebecca Qian, Nino Scherrer, and Bertie Vidgen. 2023. FinanceBench: A New Benchmark for Financial Question Answering. arXiv:2311.11944 [cs.CL] LEDGER: A Long-Context Benchmark of Corporate Annual Reports for Grounded Financial Retrieval and Extraction

work page internal anchor Pith review Pith/arXiv arXiv 2023
[11]

Antonio Jimeno Yepes, Yao You, Jan Milczek, Sebastian Laverde, and Renyu Li
[12]

arXiv:2402.05131 [cs.CL]

Financial Report Chunking for Effective Retrieval Augmented Generation. arXiv:2402.05131 [cs.CL]

work page arXiv
[13]

Omar Khattab and Matei Zaharia. 2020. ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. InProceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 39–48

2020
[14]

Alexander H Liu, Kartik Khandelwal, Sandeep Subramanian, Victor Jouault, Abhinav Rastogi, Adrien Sadé, Alan Jeffares, Albert Jiang, Alexandre Cahill, Alexandre Gavaudan, et al. 2026. Ministral 3.arXiv preprint arXiv:2601.08584 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[15]

Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2024. Lost in the middle: How language models use long contexts.Transactions of the association for computational linguistics12 (2024), 157–173

2024
[16]

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach.arXiv preprint arXiv:1907.11692 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 2019
[17]

António Loison, Quentin Macé, Antoine Edy, Victor Xing, Tom Balough, Gabriel Moreira, Bo Liu, Manuel Faysse, Céline Hudelot, and Gautier Viaud. 2026. ViDoRe V3: A Comprehensive Evaluation of Retrieval Augmented Generation in Complex Real-World Scenarios.arXiv preprint arXiv:2601.08620(2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[18]

Emmanuel Malherbe and Marie-Aude Aufaure. 2016. Bridge the terminology gap between recruiters and candidates: A multilingual skills base built from social media and linked data. In2016 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM). IEEE, 583–590

2016
[19]

NVIDIA. 2025. Nemotron 3 Nano: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning. https://arxiv.org/abs/2512. 20848 Technical report

2025
[20]

OpenAI. 2025. gpt-oss-120b & gpt-oss-20b Model Card. arXiv:2508.10925 [cs.CL] https://arxiv.org/abs/2508.10925

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

Qwen Team. 2026. Qwen3.6-27B: Flagship-Level Coding in a 27B Dense Model. https://qwen.ai/blog?id=qwen3.6-27b

2026
[22]

Varshini Reddy, Rik Koncel-Kedziorski, Viet Dac Lai, Michael Krumdick, Charles Lovering, and Chris Tanner. 2024. DocFinQA: A Long-Context Financial Rea- soning Dataset. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL), Short Papers

2024
[23]

Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. InProceedings of the 2019 Conference on Em- pirical Methods in Natural Language Processing. Association for Computational Linguistics. https://arxiv.org/abs/1908.10084

work page internal anchor Pith review Pith/arXiv arXiv 2019
[24]

Stephen Robertson and Hugo Zaragoza. 2009. The Probabilistic Relevance Frame- work: BM25 and Beyond.Foundations and Trends in Information Retrieval3, 4 (2009), 333–389. doi:10.1561/1500000019

work page doi:10.1561/1500000019 2009
[25]

Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. 2020. Mpnet: Masked and permuted pre-training for language understanding.Advances in neural information processing systems33 (2020), 16857–16867

2020
[26]

Henrique Schechter Vera, Sahil Dua, Biao Zhang, Daniel Salz, Ryan Mullins, Sindhu Raghuram Panyam, Sara Smoot, Iftekhar Naim, Joe Zou, Feiyang Chen, et al. 2025. Embeddinggemma: Powerful and lightweight text representations. arXiv preprint arXiv:2509.20354(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

2005.TREC: Experiment and evaluation in information retrieval

Ellen M Voorhees, Donna K Harman, et al. 2005.TREC: Experiment and evaluation in information retrieval. Vol. 63. MIT press Cambridge

2005
[28]

Benjamin Warner, Antoine Chaffin, Benjamin Clavié, Orion Weller, Oskar Hall- ström, Said Taghadouini, Alexis Gallagher, Raja Biswas, Faisal Ladhak, Tom Aarsen, Nathan Cooper, Griffin Adams, Jeremy Howard, and Iacopo Poli. 2024. Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Infer...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[29]

Haoran Wei, Yaofeng Sun, and Yukun Li. 2025. DeepSeek-OCR: Contexts Optical Compression.arXiv preprint arXiv:2510.18234(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[30]

Haoran Wei, Yaofeng Sun, and Yukun Li. 2026. DeepSeek-OCR 2: Visual Causal Flow.arXiv preprint arXiv:2601.20552(2026)

work page arXiv 2026
[31]

Wilkinson, Michel Dumontier, IJsbrand Jan Aalbersberg, Gabrielle Ap- pleton, Myles Axton, Arie Baak, et al

Mark D. Wilkinson, Michel Dumontier, IJsbrand Jan Aalbersberg, Gabrielle Ap- pleton, Myles Axton, Arie Baak, et al . 2016. The FAIR Guiding Principles for Scientific Data Management and Stewardship.Scientific Data3, 1 (2016), 160018

2016
[32]

Shijie Wu, Ozan Irsoy, Steven Lu, Vadim Dabravolski, Mark Dredze, Sebastian Gehrmann, Prabhanjan Kambadur, David Rosenberg, and Gideon Mann. 2023. BloombergGPT: A Large Language Model for Finance. arXiv:2303.17564 [cs.LG]

work page internal anchor Pith review Pith/arXiv arXiv 2023
[33]

Shitao Xiao, Zheng Liu, Peitian Zhang, and Niklas Muennighoff. 2023. C-Pack: Packaged Resources To Advance General Chinese Embedding. arXiv:2309.07597 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2023
[34]

Fengbin Zhu, Wenqiang Lei, Youcheng Huang, Chao Wang, Shuo Zhang, Jiancheng Lv, Fuli Feng, and Tat-Seng Chua. 2021. TAT-QA: A Question An- swering Benchmark on a Hybrid of Tabular and Textual Content in Finance. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics (ACL)

2021

[1] [1]

Sören Auer, Christian Bizer, Georgi Kobilarov, Jens Lehmann, Richard Cyga- niak, and Zachary Ives. 2007. Dbpedia: A nucleus for a web of open data. In international semantic web conference. Springer, 722–735

2007

[2] [2]

Nicolas Boizard, Hippolyte Gisserot-Boukhlef, Duarte M Alves, André Martins, Ayoub Hammal, Caio Corro, Céline Hudelot, Emmanuel Malherbe, Etienne Malaboeuf, Fanny Jourdan, et al. 2025. Eurobert: Scaling multilingual encoders for european languages.arXiv preprint arXiv:2503.05500(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. 2024. BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation. arXiv:2402.03216 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2024

[4] [4]

Zhiyu Chen, Wenhu Chen, Charese Smiley, Sameena Shah, Iana Borova, Dillon Langdon, Reema Moussa, Matt Beane, Ting-Hao Huang, Bryan Routledge, and William Yang Wang. 2021. FinQA: A Dataset of Numerical Reasoning over Financial Data. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP)

2021

[5] [5]

Zhiyu Chen, Shiyang Li, Charese Smiley, Zhiqiang Ma, Sameena Shah, and William Yang Wang. 2022. ConvFinQA: Exploring the Chain of Numerical Reasoning in Conversational Finance Question Answering. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP)

2022

[6] [6]

Thibault Formal, Benjamin Piwowarski, and Stéphane Clinchant. 2021. SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking. InProceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval

2021

[7] [7]

Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé III, and Kate Crawford. 2021. Datasheets for Datasets. Commun. ACM64, 12 (2021), 86–92

2021

[8] [8]

Google DeepMind. 2026. Gemini 3.1 Pro Model Card. https://storage.googleapis. com/deepmind-media/Model-Cards/Gemini-3-1-Pro-Model-Card.pdf. Ac- cessed: 2026-06-01

2026

[9] [9]

Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, and Boris Ginsburg. 2024. RULER: What’s the Real Context Size of Your Long-Context Language Models?. InFirst Conference on Language Modeling. https://openreview.net/forum?id=kIoBbc76Sy

2024

[10] [10]

Pranab Islam, Anand Kannappan, Douwe Kiela, Rebecca Qian, Nino Scherrer, and Bertie Vidgen. 2023. FinanceBench: A New Benchmark for Financial Question Answering. arXiv:2311.11944 [cs.CL] LEDGER: A Long-Context Benchmark of Corporate Annual Reports for Grounded Financial Retrieval and Extraction

work page internal anchor Pith review Pith/arXiv arXiv 2023

[11] [11]

Antonio Jimeno Yepes, Yao You, Jan Milczek, Sebastian Laverde, and Renyu Li

[12] [12]

arXiv:2402.05131 [cs.CL]

Financial Report Chunking for Effective Retrieval Augmented Generation. arXiv:2402.05131 [cs.CL]

work page arXiv

[13] [13]

Omar Khattab and Matei Zaharia. 2020. ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. InProceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 39–48

2020

[14] [14]

Alexander H Liu, Kartik Khandelwal, Sandeep Subramanian, Victor Jouault, Abhinav Rastogi, Adrien Sadé, Alan Jeffares, Albert Jiang, Alexandre Cahill, Alexandre Gavaudan, et al. 2026. Ministral 3.arXiv preprint arXiv:2601.08584 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026

[15] [15]

Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2024. Lost in the middle: How language models use long contexts.Transactions of the association for computational linguistics12 (2024), 157–173

2024

[16] [16]

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach.arXiv preprint arXiv:1907.11692 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 2019

[17] [17]

António Loison, Quentin Macé, Antoine Edy, Victor Xing, Tom Balough, Gabriel Moreira, Bo Liu, Manuel Faysse, Céline Hudelot, and Gautier Viaud. 2026. ViDoRe V3: A Comprehensive Evaluation of Retrieval Augmented Generation in Complex Real-World Scenarios.arXiv preprint arXiv:2601.08620(2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026

[18] [18]

Emmanuel Malherbe and Marie-Aude Aufaure. 2016. Bridge the terminology gap between recruiters and candidates: A multilingual skills base built from social media and linked data. In2016 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM). IEEE, 583–590

2016

[19] [19]

NVIDIA. 2025. Nemotron 3 Nano: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning. https://arxiv.org/abs/2512. 20848 Technical report

2025

[20] [20]

OpenAI. 2025. gpt-oss-120b & gpt-oss-20b Model Card. arXiv:2508.10925 [cs.CL] https://arxiv.org/abs/2508.10925

work page internal anchor Pith review Pith/arXiv arXiv 2025

[21] [21]

Qwen Team. 2026. Qwen3.6-27B: Flagship-Level Coding in a 27B Dense Model. https://qwen.ai/blog?id=qwen3.6-27b

2026

[22] [22]

Varshini Reddy, Rik Koncel-Kedziorski, Viet Dac Lai, Michael Krumdick, Charles Lovering, and Chris Tanner. 2024. DocFinQA: A Long-Context Financial Rea- soning Dataset. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL), Short Papers

2024

[23] [23]

Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. InProceedings of the 2019 Conference on Em- pirical Methods in Natural Language Processing. Association for Computational Linguistics. https://arxiv.org/abs/1908.10084

work page internal anchor Pith review Pith/arXiv arXiv 2019

[24] [24]

Stephen Robertson and Hugo Zaragoza. 2009. The Probabilistic Relevance Frame- work: BM25 and Beyond.Foundations and Trends in Information Retrieval3, 4 (2009), 333–389. doi:10.1561/1500000019

work page doi:10.1561/1500000019 2009

[25] [25]

Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. 2020. Mpnet: Masked and permuted pre-training for language understanding.Advances in neural information processing systems33 (2020), 16857–16867

2020

[26] [26]

Henrique Schechter Vera, Sahil Dua, Biao Zhang, Daniel Salz, Ryan Mullins, Sindhu Raghuram Panyam, Sara Smoot, Iftekhar Naim, Joe Zou, Feiyang Chen, et al. 2025. Embeddinggemma: Powerful and lightweight text representations. arXiv preprint arXiv:2509.20354(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[27] [27]

2005.TREC: Experiment and evaluation in information retrieval

Ellen M Voorhees, Donna K Harman, et al. 2005.TREC: Experiment and evaluation in information retrieval. Vol. 63. MIT press Cambridge

2005

[28] [28]

Benjamin Warner, Antoine Chaffin, Benjamin Clavié, Orion Weller, Oskar Hall- ström, Said Taghadouini, Alexis Gallagher, Raja Biswas, Faisal Ladhak, Tom Aarsen, Nathan Cooper, Griffin Adams, Jeremy Howard, and Iacopo Poli. 2024. Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Infer...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[29] [29]

Haoran Wei, Yaofeng Sun, and Yukun Li. 2025. DeepSeek-OCR: Contexts Optical Compression.arXiv preprint arXiv:2510.18234(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[30] [30]

Haoran Wei, Yaofeng Sun, and Yukun Li. 2026. DeepSeek-OCR 2: Visual Causal Flow.arXiv preprint arXiv:2601.20552(2026)

work page arXiv 2026

[31] [31]

Wilkinson, Michel Dumontier, IJsbrand Jan Aalbersberg, Gabrielle Ap- pleton, Myles Axton, Arie Baak, et al

Mark D. Wilkinson, Michel Dumontier, IJsbrand Jan Aalbersberg, Gabrielle Ap- pleton, Myles Axton, Arie Baak, et al . 2016. The FAIR Guiding Principles for Scientific Data Management and Stewardship.Scientific Data3, 1 (2016), 160018

2016

[32] [32]

Shijie Wu, Ozan Irsoy, Steven Lu, Vadim Dabravolski, Mark Dredze, Sebastian Gehrmann, Prabhanjan Kambadur, David Rosenberg, and Gideon Mann. 2023. BloombergGPT: A Large Language Model for Finance. arXiv:2303.17564 [cs.LG]

work page internal anchor Pith review Pith/arXiv arXiv 2023

[33] [33]

Shitao Xiao, Zheng Liu, Peitian Zhang, and Niklas Muennighoff. 2023. C-Pack: Packaged Resources To Advance General Chinese Embedding. arXiv:2309.07597 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2023

[34] [34]

Fengbin Zhu, Wenqiang Lei, Youcheng Huang, Chao Wang, Shuo Zhang, Jiancheng Lv, Fuli Feng, and Tat-Seng Chua. 2021. TAT-QA: A Question An- swering Benchmark on a Hybrid of Tabular and Textual Content in Finance. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics (ACL)

2021