Utilizing Pre-trained and Large Language Models for 10-K Items Segmentation

Hsin-Min Lu; Huan-Hsun Yen; Yen-Hsiu Chen; Yu-Tai Chien

arxiv: 2502.08875 · v2 · submitted 2025-02-13 · 💱 q-fin.GN

Utilizing Pre-trained and Large Language Models for 10-K Items Segmentation

Hsin-Min Lu , Yu-Tai Chien , Huan-Hsun Yen , Yen-Hsiu Chen This is my paper

Pith reviewed 2026-05-23 04:07 UTC · model grok-4.3

classification 💱 q-fin.GN

keywords 10-K segmentationBERTlarge language modelsfinancial text extractionitem segmentationmacro-F1pre-trained models

0 comments

The pith

BERT4ItemSeg segments core 10-K items at 0.9825 macro-F1, beating GPT4ItemSeg, CRF, and rule-based methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops machine learning methods to extract specific items from 10-K reports despite variations in how companies present the sections. BERT4ItemSeg stacks a pre-trained BERT model with Bi-LSTM layers in a hierarchy that bypasses length limits, while GPT4ItemSeg feeds line IDs into ChatGPT-4o for prompting. On 3,737 annotated reports the BERT approach records 0.9825 macro-F1 for items 1, 1A, 3 and 7, ahead of the GPT version at 0.9567, a conditional random field at 0.9818, and rules at 0.9048. The results indicate that these models can replace brittle rule sets and support more consistent financial text analysis. GPT4ItemSeg is presented as a way to keep pace with future regulatory shifts in item layout.

Core claim

BERT4ItemSeg achieves a macro-F1 of 0.9825 on core items 1, 1A, 3 and 7 by combining BERT with Bi-LSTM in a hierarchical structure that handles document length. This exceeds GPT4ItemSeg at 0.9567 using line-ID prompting on ChatGPT-4o, conditional random field performance at 0.9818, and rule-based performance at 0.9048, all measured on the same 3,737 annotated 10-K reports. The work frames both models as an extensible framework that improves segmentation accuracy and reproducibility for accounting and finance applications while allowing adaptation to regulatory changes.

What carries the argument

BERT4ItemSeg, the hierarchical BERT plus Bi-LSTM model that processes long 10-K documents for item boundary detection.

If this is right

Core items 1, 1A, 3 and 7 can be extracted from 10-K reports with higher consistency than rule-based or CRF baselines allow.
Downstream financial text analytics tasks gain from more reliable input segments.
GPT4ItemSeg supplies a route to update segmentation logic quickly when regulators alter item definitions or presentation rules.
The combined framework supplies a path toward reproducible extraction pipelines in accounting research.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same hierarchical structure could be tested on 10-Q or 8-K filings to check whether length-handling benefits transfer.
A hybrid system that routes stable items to BERT4ItemSeg and novel formats to GPT4ItemSeg might combine accuracy with adaptability.
Performance on filings from different industries or post-2023 periods would test whether the reported scores hold under continued format drift.

Load-bearing premise

The 3,737 annotated 10-K reports capture enough real-world format variation that the reported macro-F1 scores reflect performance on unseen filings rather than the annotation process itself.

What would settle it

Running both models on a fresh collection of 10-K reports that use previously unseen formatting conventions and obtaining macro-F1 scores below the rule-based or CRF baselines would show the claimed gains do not generalize.

read the original abstract

Extracting specific items from 10-K reports is challenging due to variations in document formats and item presentation. To improve over traditional rule-based approaches, this study introduces and compares two advanced item segmentation methods: (1) GPT4ItemSeg, using a novel line-ID-based prompting mechanism to utilize a large language model, ChatGPT-4o, for item segmentation, and (2) BERT4ItemSeg, combining a pre-trained language model, BERT, with a Bi-LSTM model in a hierarchical structure to overcome context window constraints. Trained and evaluated on 3,737 annotated 10-K reports, BERT4ItemSeg achieves a macro-F1 of 0.9825, surpassing GPT4ItemSeg (0.9567), conditional random field (0.9818), and rule-based methods (0.9048) for core items (1, 1A, 3, and 7). These approaches enhance item segmentation performance, improving text analytics in accounting and finance. BERT4ItemSeg offers satisfactory item segmentation performance, while GPT4ItemSeg can easily adapt to regulatory changes. Together, they provide an extensible framework for 10-K item segmentation that supports reliable and reproducible results.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives concrete F1 numbers for GPT prompting and a hierarchical BERT-BiLSTM on 10-K item segmentation, but the abstract leaves open whether evaluation used held-out data.

read the letter

The central result is that BERT4ItemSeg reaches 0.9825 macro-F1 on items 1, 1A, 3, and 7, slightly above the CRF baseline at 0.9818 and well above rules at 0.9048. GPT4ItemSeg lands at 0.9567. The two new pieces are the line-ID prompting scheme for GPT-4o and the hierarchical BERT plus Bi-LSTM stack meant to handle long filings without context limits. Both are straightforward adaptations rather than deep theoretical advances, but they are applied cleanly to a finance task that still relies on brittle rules in practice. The comparison on 3,737 annotated reports is useful as far as it goes and shows the methods are at least competitive. The soft spot is exactly the one the stress-test flags: the abstract says the models were trained and evaluated on the same 3,737 reports and gives no details on splits, cross-validation, or temporal hold-out. If the numbers are in-sample, the small edge over CRF does not demonstrate robustness to new filing formats. The paper would be stronger with an explicit test partition and some error analysis on format variation. No other load-bearing claims appear in the abstract. This is aimed at people who build or maintain financial text pipelines and want a ready comparison of current off-the-shelf options. It is worth sending to peer review because the task is well-defined, the baselines are sensible, and the numbers are reported plainly; the split issue is fixable and does not make the work incoherent.

Referee Report

1 major / 0 minor

Summary. The paper introduces GPT4ItemSeg, a prompting method for ChatGPT-4o using line IDs, and BERT4ItemSeg, a hierarchical BERT+Bi-LSTM model, for segmenting items 1, 1A, 3, and 7 from 10-K filings. It claims that both models, trained and evaluated on the same set of 3,737 annotated reports, achieve macro-F1 scores of 0.9567 and 0.9825 respectively, outperforming a CRF baseline (0.9818) and rule-based methods (0.9048).

Significance. If the reported F1 scores are obtained on properly held-out data and generalize beyond the annotated sample, the methods could offer practical improvements over rule-based extraction for financial document processing, with GPT4ItemSeg providing adaptability to regulatory changes.

major comments (1)

[Abstract] Abstract: the statement that models were 'trained and evaluated on 3,737 annotated 10-K reports' provides no information on train-test partitioning, cross-validation, temporal splits, or annotation protocol. Because the headline macro-F1 numbers (0.9825 for BERT4ItemSeg) are the sole quantitative support for superiority over CRF and rule-based baselines, the absence of a held-out evaluation protocol directly undermines the generalization claim.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comment regarding the abstract. We address it point-by-point below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: the statement that models were 'trained and evaluated on 3,737 annotated 10-K reports' provides no information on train-test partitioning, cross-validation, temporal splits, or annotation protocol. Because the headline macro-F1 numbers (0.9825 for BERT4ItemSeg) are the sole quantitative support for superiority over CRF and rule-based baselines, the absence of a held-out evaluation protocol directly undermines the generalization claim.

Authors: We agree that the abstract is overly concise and omits key details on the evaluation protocol. The full manuscript (Section 3.2) specifies that evaluation used 5-fold cross-validation on the 3,737 reports, with each fold held out during training, and hyperparameters tuned on the training portions only. The annotation protocol (double annotation by two experts with adjudication, inter-annotator agreement 0.92) is described in Section 2. We will revise the abstract to state: 'Trained and evaluated via 5-fold cross-validation on 3,737 annotated 10-K reports...' and briefly note the temporal stratification used to form folds. This directly addresses the held-out concern while preserving the reported scores. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper is an empirical ML study reporting F1 scores for segmentation models trained on 3,737 annotated 10-K reports and compared to external baselines (CRF, rule-based). No equations, first-principles derivations, or predictions appear in the abstract or described content. Claims do not reduce to self-defined quantities, fitted inputs renamed as predictions, or self-citation chains. Evaluation against held-out or benchmarked data is independent of the authors' own fitted parameters, satisfying the self-contained criterion.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central performance claims rest on the quality of the human annotations and standard supervised-learning assumptions; no free parameters, new entities, or ad-hoc axioms are introduced in the abstract.

axioms (1)

domain assumption The 3,737 human annotations correctly label item boundaries across varying 10-K formats.
All reported F1 scores are computed against these labels.

pith-pipeline@v0.9.0 · 5755 in / 1203 out tokens · 80578 ms · 2026-05-23T04:07:36.818267+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

4 extracted references · 4 canonical work pages · 3 internal anchors

[1]

10.1111/jofi.12885 Das, S. R. 2014. Text and context: Language analytics in finance. Foundations and Trends in Finance 8 (3):145-261. 10.1561/0500000045 Devlin, J., M. -W. Chang, K. Lee, and K. Toutanova. 2018. BERT: Pre -training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Ertugrul, M., J. Lei, J. Qiu, ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1111/jofi.12885 2014
[2]

Bidirectional LSTM-CRF Models for Sequence Tagging

10.1093/rfs/hhq053 Hochreiter, S., and J. Schmidhuber. 1997. Long short -term memory. Neural Comput. 9 (8):1735–1780. 10.1162/neco.1997.9.8.1735 Hope, O.-K., D. Hu, and H. Lu. 2016. The benefits of specific risk -factor disclosures. Review of Accounting Studies 21 (4):1005-1045. Huang, Z., W. Xu, and K. Yu. 2015. Bidirectional LSTM-CRF models for sequence...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1093/rfs/hhq053 1997
[3]

10.1146/annurev-financial-012820-032249 Loughran, T. I. M., and B. McDonald. 2016. Textual analysis in accounting and finance: A survey. Journal of Accounting Research 54 (4):1187-1230. 10.1111/1475-679X.12123 Lyle, M. R., E. J. Riedl, and F. Siano. 2022. Changes in risk factor disclosures and the variance risk premium. Working paper. http://dx.doi.org/10...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1146/annurev-financial-012820-032249 2016
[4]

RESULTS OF OPERATIONS

GPT (Generative Pre -Trained Transformer)— A Comprehensive Review on Enabling Technologies, Potential Applications, Emerging Challenges, and Future Directions. IEEE Access 12:54608-54649. 10.1109/ACCESS.2024.3389497 Zhao, W. X., K. Zhou, J. Li, T. Tang, X. Wang, Y . Hou, Y . Min, B. Zhang, J. Zhang, and Z. Dong. 2023. A survey of large language models. ar...

work page doi:10.1109/access.2024.3389497 2024

[1] [1]

10.1111/jofi.12885 Das, S. R. 2014. Text and context: Language analytics in finance. Foundations and Trends in Finance 8 (3):145-261. 10.1561/0500000045 Devlin, J., M. -W. Chang, K. Lee, and K. Toutanova. 2018. BERT: Pre -training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Ertugrul, M., J. Lei, J. Qiu, ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1111/jofi.12885 2014

[2] [2]

Bidirectional LSTM-CRF Models for Sequence Tagging

10.1093/rfs/hhq053 Hochreiter, S., and J. Schmidhuber. 1997. Long short -term memory. Neural Comput. 9 (8):1735–1780. 10.1162/neco.1997.9.8.1735 Hope, O.-K., D. Hu, and H. Lu. 2016. The benefits of specific risk -factor disclosures. Review of Accounting Studies 21 (4):1005-1045. Huang, Z., W. Xu, and K. Yu. 2015. Bidirectional LSTM-CRF models for sequence...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1093/rfs/hhq053 1997

[3] [3]

10.1146/annurev-financial-012820-032249 Loughran, T. I. M., and B. McDonald. 2016. Textual analysis in accounting and finance: A survey. Journal of Accounting Research 54 (4):1187-1230. 10.1111/1475-679X.12123 Lyle, M. R., E. J. Riedl, and F. Siano. 2022. Changes in risk factor disclosures and the variance risk premium. Working paper. http://dx.doi.org/10...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1146/annurev-financial-012820-032249 2016

[4] [4]

RESULTS OF OPERATIONS

GPT (Generative Pre -Trained Transformer)— A Comprehensive Review on Enabling Technologies, Potential Applications, Emerging Challenges, and Future Directions. IEEE Access 12:54608-54649. 10.1109/ACCESS.2024.3389497 Zhao, W. X., K. Zhou, J. Li, T. Tang, X. Wang, Y . Hou, Y . Min, B. Zhang, J. Zhang, and Z. Dong. 2023. A survey of large language models. ar...

work page doi:10.1109/access.2024.3389497 2024