Recognition: 2 theorem links
· Lean TheoremEffective Performance Measurement: Challenges and Opportunities in KPI Extraction from Earnings Calls
Pith reviewed 2026-05-08 17:57 UTC · model grok-4.3
The pith
LLMs enable open-ended extraction of emergent KPIs from unstructured earnings call transcripts at 79.7 percent human-verified precision.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Encoder-based models trained on SEC filings fail to generalize to earnings calls because of the domain shift from templatic to conversational text. The authors therefore build new benchmarks SECB and ECB plus the expert-annotated ECB-A subset. They demonstrate that an LLM system can extract emergent KPIs directly from call transcripts, with human raters confirming 79.7 percent precision, thereby supplying the first baseline for consistent KPI tracking in this domain.
What carries the argument
LLM open-ended extraction pipeline applied to conversational earnings-call transcripts, evaluated against the ECB-A expert-annotated benchmark for emergent KPIs.
If this is right
- Encoder-based models trained on SEC filings do not transfer effectively to the conversational domain of earnings calls.
- LLMs support extraction of non-standard, company-specific KPIs that appear in calls but not in templated reports.
- Human-verified precision of 79.7 percent provides a concrete baseline for the task of tracking emergent performance indicators.
- Consistent, automated monitoring of these KPIs across successive earnings calls becomes feasible.
Where Pith is reading between the lines
- Extracted KPIs could be tracked over time for a single company to detect shifts in what management chooses to emphasize in calls.
- The method might be extended to compare KPI language across peer firms within the same industry.
- Downstream systems could test whether the extracted KPIs predict subsequent earnings surprises or stock reactions.
- The annotation scheme itself could be reused or adapted to create training data for finer-grained financial event detection.
Load-bearing premise
The 2,460 expert annotation groups and the definition of emergent KPIs supply a reliable, generalizable ground truth for extraction quality across different companies and sectors.
What would settle it
If a new collection of earnings calls is annotated by independent experts and the LLM-extracted KPIs receive human approval below 60 percent relevance and accuracy, the claim of a reliable baseline would be falsified.
Figures
read the original abstract
Earnings calls are a key source of financial information about public companies. However, extracting information from these calls is difficult. Unlike the templatic filings required by the U.S. Securities and Exchange Commission (SEC) to report a company's financial situation, earnings conference calls have no built-in labels, are unstructured, and feature conversational language. We explore this challenging domain by assessing the information captured by models trained on SEC filings and in-context learning methods. To establish a baseline, we first evaluate the generalization capabilities of SEC-trained models across established SEC datasets. To support our investigation, we introduce three novel benchmarks: (1) SEC Filings Benchmark (SECB), (2) Earnings Calls Benchmark (ECB), and ECB-A, a subset with 2,460 expert annotation groups to support our qualitative analysis. We find that encoder-based models struggle with the domain shift. Finally, we propose a system utilizing LLMs to perform open-ended extraction from unstructured call transcripts, verified by human evaluation (79.7% precision), providing a baseline for this valuable domain through the consistent tracking of emergent KPIs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper addresses the challenge of extracting Key Performance Indicators (KPIs) from unstructured earnings conference call transcripts. It evaluates the cross-domain generalization of models trained on SEC filings, introduces three new benchmarks (SECB, ECB, and ECB-A with 2,460 expert annotation groups), finds that encoder-based models struggle with domain shift from filings to calls, and proposes an LLM-based open-ended extraction system that achieves 79.7% precision under human evaluation, positioning this as a baseline for tracking emergent KPIs.
Significance. If the human evaluation proves reliable, the work fills a notable gap in financial NLP by moving beyond templatic SEC filings to conversational transcripts. The new benchmarks and LLM baseline could enable consistent tracking of emergent KPIs, with credit due for the focus on open-ended extraction and the scale of expert annotations attempted. The result would be a useful starting point for the domain, though its impact depends on verifiable annotation quality.
major comments (1)
- [Abstract] Abstract: The headline result of 79.7% precision from human evaluation on the ECB-A subset is central to the claim that the LLM system supplies a usable baseline. However, the abstract supplies no inter-annotator agreement statistic, no operational definition distinguishing emergent from standard KPIs, no annotation guidelines, and no description of how annotators handled conversational ambiguity or domain terminology. Without these, the precision figure cannot be interpreted as evidence of consistent extraction quality rather than idiosyncratic annotator alignment.
minor comments (1)
- [Benchmark Introduction] The benchmark description lists three items as (1) SECB, (2) ECB, and ECB-A, yet ECB-A is explicitly a subset of ECB; clarify the exact relationships, sizes, and construction details in the benchmark section to avoid confusion.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback, which highlights an important opportunity to strengthen the interpretability of our human evaluation results. We agree that the abstract should be more self-contained regarding the annotation process and will revise it accordingly.
read point-by-point responses
-
Referee: [Abstract] Abstract: The headline result of 79.7% precision from human evaluation on the ECB-A subset is central to the claim that the LLM system supplies a usable baseline. However, the abstract supplies no inter-annotator agreement statistic, no operational definition distinguishing emergent from standard KPIs, no annotation guidelines, and no description of how annotators handled conversational ambiguity or domain terminology. Without these, the precision figure cannot be interpreted as evidence of consistent extraction quality rather than idiosyncratic annotator alignment.
Authors: We acknowledge that the current abstract is too concise and omits key methodological details needed to contextualize the 79.7% precision. The full manuscript details the creation of the ECB-A benchmark (2,460 expert annotation groups), the distinction between emergent and standard KPIs, and the annotation process for handling conversational transcripts and financial terminology. We will revise the abstract to include: a brief operational definition of emergent KPIs, a summary of the annotation guidelines, a description of how annotators managed ambiguity and domain terms, and the inter-annotator agreement statistic (or a note on annotation reliability if not previously computed). This change will make the headline result more robust and interpretable while preserving all original claims and results. revision: yes
Circularity Check
No circularity: new benchmarks and human-verified extraction form an independent chain.
full rationale
The paper introduces three new benchmarks (SECB, ECB, ECB-A) built from expert annotations and evaluates both SEC-trained models and an LLM open-ended extraction system against them. The 79.7% precision figure is obtained via direct human evaluation on the novel ECB-A annotations rather than by fitting any parameter to a subset of the target data and then claiming a prediction of a related quantity. No self-citations are invoked to justify uniqueness or to smuggle in an ansatz; the derivation from data creation through model assessment is self-contained and does not reduce to its own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Expert annotations of emergent KPIs in earnings calls constitute reliable ground truth for measuring extraction performance
Reference graph
Works this paper leans on
-
[1]
What You Say and How You Say It Matters: Predicting Stock Volatility Using Verbal and Vocal Cues
Qin, Yu and Yang, Yi. What You Say and How You Say It Matters: Predicting Stock Volatility Using Verbal and Vocal Cues. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019. doi:10.18653/v1/P19-1038
-
[2]
Finer: Financial numeric entity recognition for xbrl tagging
Loukas, Lefteris and Fergadiotis, Manos and Chalkidis, Ilias and Spyropoulou, Eirini and Malakasiotis, Prodromos and Androutsopoulos, Ion and Paliouras, Georgios. F i NER : Financial Numeric Entity Recognition for XBRL Tagging. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2022. doi:10.186...
-
[3]
Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks , author=. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) , pages=. 2019 , organization=
2019
-
[4]
Communications of the ACM , volume=
Open information extraction from the web , author=. Communications of the ACM , volume=. 2008 , publisher=
2008
-
[5]
Advances in Neural Information Processing Systems , volume=
Judging llm-as-a-judge with mt-bench and chatbot arena , author=. Advances in Neural Information Processing Systems , volume=
-
[6]
2025 , howpublished =
LLaMA 3.3 70B Instruct Turbo Free , author =. 2025 , howpublished =
2025
-
[7]
arXiv preprint arXiv:2502.15411 , year=
HiFi-KPI: A Dataset for Hierarchical KPI Extraction from Earnings Filings , author=. arXiv preprint arXiv:2502.15411 , year=
-
[8]
International journal of operations & production management , volume=
The changing basis of performance measurement , author=. International journal of operations & production management , volume=. 1996 , publisher=
1996
-
[9]
2024 , note =
spaCy Documentation: Doc , author =. 2024 , note =
2024
-
[10]
Journal of Research of the National Institute of Standards and Technology , volume=
Methods and tools for performance assurance of smart manufacturing systems , author=. Journal of Research of the National Institute of Standards and Technology , volume=. 2016 , publisher=
2016
-
[11]
2005 , publisher=
The balanced scorecard: measures that drive performance , author=. 2005 , publisher=
2005
-
[12]
National productivity review , volume=
The “SMART” way to define and sustain success , author=. National productivity review , volume=. 1988 , publisher=
1988
-
[13]
African journal of business management , volume=
Understanding performance measurement through the literature , author=. African journal of business management , volume=
-
[14]
Journal of finance , volume=
Efficient capital markets , author=. Journal of finance , volume=
-
[15]
SEC Form 8-K: Definition, What It Tells You, Filing Requirements
Investopedia. SEC Form 8-K: Definition, What It Tells You, Filing Requirements. 2024
2024
-
[16]
, title =
Hugging Face, Inc. , title =. 2024 , url =
2024
-
[17]
Python Standard Library:
Python Software Foundation , year =. Python Standard Library:
-
[18]
2004--2024 , url =
Leonard Richardson , title =. 2004--2024 , url =
2004
-
[19]
Annual Report on Form 10-K for the fiscal year ended December 31, 2022 , year =
2022
-
[20]
Q3 2024 Earnings Call Transcript , year =
JPMorgan Chase & Co. Q3 2024 Earnings Call Transcript , year =
2024
-
[21]
2024 , eprint=
SemScore: Automated Evaluation of Instruction-Tuned LLMs based on Semantic Textual Similarity , author=. 2024 , eprint=
2024
-
[22]
Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024) , pages=
WarwickNLP at semeval-2024 task 1: Low-rank cross-encoders for efficient semantic textual relatedness , author=. Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024) , pages=. 2024 , organization=
2024
-
[23]
Communication methods and measures , volume=
Answering the call for a standard reliability measure for coding data , author=. Communication methods and measures , volume=. 2007 , publisher=
2007
-
[24]
Educational and psychological measurement , volume=
A coefficient of agreement for nominal scales , author=. Educational and psychological measurement , volume=. 1960 , publisher=
1960
-
[25]
arXiv preprint arXiv:2410.21819 (2025)
Self-preference bias in llm-as-a-judge , author=. arXiv preprint arXiv:2410.21819 , year=
-
[26]
arXiv preprint arXiv:2106.07393 , year=
Cross-replication Reliability--An Empirical Approach to Interpreting Inter-rater Reliability , author=. arXiv preprint arXiv:2106.07393 , year=
-
[27]
BERTScore: Evaluating Text Generation with BERT
Bertscore: Evaluating text generation with bert , author=. arXiv preprint arXiv:1904.09675 , year=
work page internal anchor Pith review arXiv 1904
-
[28]
2025 , eprint=
DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models , author=. 2025 , eprint=
2025
-
[29]
2025 , institution =
Gemini 3: A New Era of Agentic Intelligence , author =. 2025 , institution =
2025
-
[30]
2025 , eprint=
ODKE+: Ontology-Guided Open-Domain Knowledge Extraction with LLMs , author=. 2025 , eprint=
2025
-
[31]
2025 , eprint=
Gemma 3 Technical Report , author=. 2025 , eprint=
2025
-
[32]
2024 , eprint=
The Llama 3 Herd of Models , author=. 2024 , eprint=
2024
-
[33]
2025 , eprint=
Qwen3 Technical Report , author=. 2025 , eprint=
2025
-
[34]
Management’s Discussion and Analysis, Selected Financial Data, and Supplementary Financial Information
Securities and Exchange Commission. Management’s Discussion and Analysis, Selected Financial Data, and Supplementary Financial Information. 2020
2020
-
[35]
2023 Q4 Earnings Call Transcript
Alphabet Inc. 2023 Q4 Earnings Call Transcript. 2024
2023
-
[36]
2024 Q2 Earnings Call Transcript
Alphabet Inc. 2024 Q2 Earnings Call Transcript. 2024
2024
-
[37]
2024 Q1 Earnings Call Transcript
Alphabet Inc. 2024 Q1 Earnings Call Transcript. 2024
2024
-
[38]
Management Discussion and Analysis (MD & A)
Investopedia. Management Discussion and Analysis (MD & A). 2024
2024
-
[39]
Available at SSRN 4836378 , year=
Giving Retail Investors a Say in Disclosure , author=. Available at SSRN 4836378 , year=
-
[40]
Earnings Call: Definition, Example, and What to Look For
Corporate Finance Institute. Earnings Call: Definition, Example, and What to Look For
-
[41]
2023 , url =
Investopedia , title =. 2023 , url =
2023
-
[42]
2024 , eprint=
Entity Matching using Large Language Models , author=. 2024 , eprint=
2024
-
[43]
Exchange Act Reporting and Registration , year =
-
[44]
The Timing of the Earnings Press Release and the Annual Filing , year =
-
[45]
and Nagar, Venky and Schoenfeld, Jordan , title =
Chen, Jason V. and Nagar, Venky and Schoenfeld, Jordan , title =. Review of Accounting Studies , year =. doi:10.1007/s11142-018-9453-3 , url =
-
[46]
arXiv preprint arXiv:2009.01317 , year=
Towards earnings call and stock price movement , author=. arXiv preprint arXiv:2009.01317 , year=
-
[47]
2024 , school=
The Impact of Earnings Call Sentiment on Stock Market Returns , author=. 2024 , school=
2024
-
[48]
2021 , url =
Chen, James , title =. 2021 , url =
2021
-
[49]
2023 , eprint=
FinGPT: Open-Source Financial Large Language Models , author=. 2023 , eprint=
2023
-
[50]
2023 , eprint=
InvestLM: A Large Language Model for Investment using Financial Domain Instruction Tuning , author=. 2023 , eprint=
2023
-
[51]
Match, Compare, or Select? An Investigation of Large Language Models for Entity Matching
Wang, Tianshu and Chen, Xiaoyang and Lin, Hongyu and Chen, Xuanang and Han, Xianpei and Sun, Le and Wang, Hao and Zeng, Zhenyu. Match, Compare, or Select? An Investigation of Large Language Models for Entity Matching. Proceedings of the 31st International Conference on Computational Linguistics. 2025
2025
-
[52]
Prompt engineering overview , year =
-
[53]
Gemma 3 technical report , author=. arXiv preprint arXiv:2503.19786 , year=
work page internal anchor Pith review arXiv
-
[54]
arXiv:2406.12624 (2025), https://arxiv.org/abs/2406.12624
Judging the judges: Evaluating alignment and vulnerabilities in llms-as-judges , author=. arXiv preprint arXiv:2406.12624 , year=
-
[55]
2023 , eprint=
PIXIU: A Large Language Model, Instruction Data and Evaluation Benchmark for Finance , author=. 2023 , eprint=
2023
-
[56]
The Journal of finance , volume=
When is a liability not a liability? Textual analysis, dictionaries, and 10-Ks , author=. The Journal of finance , volume=. 2011 , publisher=
2011
-
[57]
Q3 2023 Earnings Call , year =
2023
-
[58]
Form 10-Q Quarterly Report for the Period Ending June 30, 2024 , year =
2024
-
[59]
Q2 2024 Earnings Transcript , year =
2024
-
[60]
Securities and Exchange Commission
U.S. Securities and Exchange Commission. Final Rule: International Disclosure Standards. 2000
2000
-
[61]
Efficient Market Hypothesis (EMH) , year =
-
[62]
XBRL US Preparers Guide , year =
-
[63]
What is XBRL?
Intrinio. What is XBRL?
-
[64]
Understanding SEC XBRL Filings: A Primer on SEC DATA
Chelsea Caltuna. Understanding SEC XBRL Filings: A Primer on SEC DATA
-
[65]
2024 , howpublished =
Dwight Gunning , title =. 2024 , howpublished =
2024
-
[66]
Hiroki Nakayama , year=
-
[67]
Inline XBRL - SEC.gov , author =
-
[68]
2022 26th International Conference on Pattern Recognition (ICPR) , pages=
Hillebrand, Lars and Deu. 2022 26th International Conference on Pattern Recognition (ICPR) , pages=. 2022 , organization=
2022
-
[69]
Do Language Embeddings capture Scales?
Zhang, Xikun and Ramachandran, Deepak and Tenney, Ian and Elazar, Yanai and Roth, Dan. Do Language Embeddings capture Scales?. Findings of the Association for Computational Linguistics: EMNLP 2020. 2020. doi:10.18653/v1/2020.findings-emnlp.439
-
[70]
2019 , eprint=
FinBERT: Financial Sentiment Analysis with Pre-trained Language Models , author=. 2019 , eprint=
2019
-
[71]
2022 , eprint=
WHEN FLUE MEETS FLANG: Benchmarks and Large Pre-trained Language Model for Financial Domain , author=. 2022 , eprint=
2022
-
[72]
Advances in neural information processing systems , volume=
Superglue: A stickier benchmark for general-purpose language understanding systems , author=. Advances in neural information processing systems , volume=
-
[73]
April , year=
Beautiful soup documentation , author=. April , year=
-
[74]
Financial Innovation , volume=
Comprehensive review of text-mining applications in finance , author=. Financial Innovation , volume=. 2020 , publisher=
2020
-
[75]
Accounting and Business Research , volume =
Craig Lewis and Steven Young , title =. Accounting and Business Research , volume =. 2019 , publisher =. doi:10.1080/00014788.2019.1611730 , URL =
-
[76]
Advances in Neural Information Processing Systems , year=
Attention is all you need , author=. Advances in Neural Information Processing Systems , year=
-
[77]
doi:10.5281/zenodo.10009823 , url =
Ines Montani and Matthew Honnibal and Matthew Honnibal and Adriane Boyd and Sofie Van Landeghem and Henning Peters , title =. doi:10.5281/zenodo.10009823 , url =
-
[78]
2013 IEEE international conference on acoustics, speech and signal processing , pages=
Speech recognition with deep recurrent neural networks , author=. 2013 IEEE international conference on acoustics, speech and signal processing , pages=. 2013 , organization=
2013
-
[79]
FinBERT: Financial Sentiment Analysis with Pre-trained Language Models , author=. arXiv preprint arXiv:1908.10063 , year=
-
[80]
Bloom: A 176b-parameter open-access multilingual language model , author=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.