pith. sign in

arxiv: 2605.22413 · v1 · pith:MMYTWSKOnew · submitted 2026-05-21 · 💻 cs.CV

From Recognition to Reasoning: Benchmarking and Enhancing MLLMs on Real-World Receipt Document Understanding

Pith reviewed 2026-05-22 07:32 UTC · model grok-4.3

classification 💻 cs.CV
keywords ReceiptBenchmultimodal large language modelsvisual information extractiondocument understandinggroup relative policy optimizationsemantic reasoninghierarchical tasksstructure parsing
0
0 comments X

The pith

A new 10k-receipt benchmark with four task levels and metric-aware GRPO training lets multimodal models surpass proprietary systems on receipt reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing benchmarks for visual information extraction from documents lack scale, realism, and the ability to test deeper reasoning steps. This work builds ReceiptBench as a human-annotated set of 10,000 diverse receipts broken into ordered subtasks that start with spotting text, move to following exact formats, then infer unstated details, and finally parse nested line items. A two-stage training process adds Metric-Aware Group Relative Policy Optimization to turn evaluation rules into reinforcement signals that push models toward more consistent structured outputs. If the approach holds, models would move from basic recognition to reliable reasoning on the varied, noisy receipts that appear in everyday business settings. The central goal is to close the gap between current model performance and the demands of real document automation.

Core claim

ReceiptBench organizes receipt understanding into four hierarchical subtasks of increasing difficulty while a two-stage framework uses Metric-Aware Group Relative Policy Optimization to convert evaluation constraints into reinforcement learning signals; experiments show this combination produces state-of-the-art results that surpass leading proprietary multimodal models especially on semantic reasoning and structure parsing.

What carries the argument

Metric-Aware Group Relative Policy Optimization (GRPO), which translates strict evaluation constraints on output format and structure into reinforcement learning rewards to improve consistency across the four task levels.

If this is right

  • Models reach higher accuracy on inferring implicit attributes and parsing nested items than current proprietary systems.
  • The four-level task breakdown allows finer diagnosis of where models still fail in document understanding.
  • Structural consistency improves because the optimization directly penalizes violations of format and nesting rules.
  • The approach scales to larger sets of real-world receipts while maintaining the hierarchy from perception to reasoning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar metric-driven optimization could transfer to other document types such as invoices or forms where nested structures and implicit fields also appear.
  • If the gains hold outside the benchmark, automated receipt processing systems could reduce manual review steps in accounting workflows.
  • The emphasis on semantic reasoning might encourage future models to handle incomplete or ambiguous receipts without extra human rules.
  • Testing the same two-stage pipeline on video or multi-page documents would reveal whether the method depends on single-image receipt properties.

Load-bearing premise

The human annotations and four-level task structure in ReceiptBench truly reflect the variety and hidden details in real receipts, and the GRPO signals lead to genuine generalization rather than just fitting the benchmark metrics.

What would settle it

Run the trained models on a new collection of receipts gathered from different sources and time periods with no overlap in distribution to the original 10k set and check whether the reported gains on reasoning and parsing tasks remain.

Figures

Figures reproduced from arXiv: 2605.22413 by Jun Chen, Leilei Gan, Libin Zhan, Tiancheng Luo, Wang Dong, Yandi Wang, Yuxuan Jiang, Ziwei Huang.

Figure 1
Figure 1. Figure 1: Overview of the ReceiptBench Framework. (Top) Benchmark Construction: We curate 10k diverse invoices via web crawling and crowdsourcing. The benchmark defines a hierarchical taxonomy covering four capabilities: Basic Perception, Formatting, Semantic Reasoning, and Structural Parsing. (Bottom) Training Pipeline: To master these capabilities, we propose a Metric-Aware GRPO framework. The SFT model acts as th… view at source ↗
Figure 2
Figure 2. Figure 2: Holistic Evaluation. The chart compares model capabilities across the four sub-tasks. While pro￾prietary models (gray) are balanced, our fine-tuned base￾line (green) excels in domain-specific structure parsing. While GPT-5 scores only 0.4893 on this metric, our SFT approach substantially improves this to 0.6478 (Qwen3-VL-4B), validating the effectiveness of our pipeline in handling heterogeneous layouts. T… view at source ↗
Figure 3
Figure 3. Figure 3: Fine-grained Error Analysis on Receipt￾Bench. We compare the error patterns of our fine-tuned Qwen3-VL-4B against Gemini-3-Pro. (a) illustrates the divergent behavioral profiles, while (b) highlights the specific fields that pose the greatest challenges. against the proprietary Gemini-3-Pro. As illus￾trated in [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison on a complex hotel folio. The One-shot Base Model (middle) falls into common visual and logical traps: extracting the billing address instead of the hotel location, and mistaking the "Balance Due" (0.00) for the total amount. In contrast, our Fine-tuned Model (right) correctly infers the semantic roles of fields and adheres to financial logic [PITH_FULL_IMAGE:figures/full_fig_p018_4.png] view at source ↗
read the original abstract

Extracting structured information from visual documents (Visual Information Extraction, VIE) is a cornerstone of business automation. While recent Multimodal Large Language Models (MLLMs) have shown promising capabilities, existing benchmarks suffer from critical limitations in scale and realism, lack semantic granularity, and fail to cover diverse document types. To bridge this gap, we introduce ReceiptBench, a large-scale, human-annotated benchmark consisting of 10k diverse receipts, organizing information extraction into four hierarchical sub-tasks: (1) Basic Perception for raw text spotting, (2) Format Normalization for strictly following standardization instructions, (3) Semantic Reasoning for inferring implicit attributes from context, and (4) Structure Parsing for handling nested line items. Furthermore, we propose a two-stage training framework incorporating Metric-Aware Group Relative Policy Optimization (GRPO), which translates rigorous evaluation constraints into reinforcement learning signals to enhance structural consistency. Extensive experiments demonstrate that our method yields state-of-the-art performance, surpassing leading proprietary models on complex reasoning tasks. We release our datasets and code at https://github.com/wwwT0ri/ReceiptBench.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces ReceiptBench, a large-scale human-annotated benchmark of 10k diverse real-world receipts for visual information extraction (VIE). It organizes the task into four hierarchical sub-tasks—Basic Perception (raw text spotting), Format Normalization, Semantic Reasoning (inferring implicit attributes), and Structure Parsing (nested line items)—and proposes a two-stage training framework that incorporates Metric-Aware Group Relative Policy Optimization (GRPO) to convert evaluation constraints into reinforcement learning signals for improved structural consistency in MLLMs. The central claim is that this approach yields state-of-the-art performance, surpassing leading proprietary models on complex reasoning tasks, with datasets and code released.

Significance. If the reported gains from Metric-Aware GRPO reflect genuine improvements in semantic reasoning and structure parsing rather than adaptation to the new benchmark's specific metrics and annotation conventions, the work would offer a useful large-scale resource and training paradigm for document understanding. The public release of code and data is a clear strength that enables reproducibility and further research.

major comments (2)
  1. [Experiments] The SOTA claim on complex reasoning tasks (abstract and Experiments section) is load-bearing for the paper's contribution, yet the provided description does not include quantitative metrics, baseline comparisons against zero-shot proprietary models, ablation studies isolating the GRPO reward signals, or error analysis on implicit-attribute inference. Without these, it remains unclear whether performance improvements arise from enhanced underlying capabilities or from optimization to ReceiptBench's four sub-task scoring rules.
  2. [§4.2] §4.2 (Metric-Aware GRPO): because the RL signals are explicitly derived from the same hierarchical evaluation constraints used to define ReceiptBench (Basic Perception, Format Normalization, Semantic Reasoning, Structure Parsing), additional evidence is required to rule out metric-specific reward hacking. Cross-benchmark transfer results or analysis of generalization to unseen receipt formats would directly address whether the method improves transferable document reasoning.
minor comments (2)
  1. The abstract asserts 'extensive experiments' and 'state-of-the-art performance' without any numerical results or key tables; adding one or two headline numbers would improve readability.
  2. [Benchmark Construction] Clarify in the benchmark construction section how human annotations for implicit attributes in Semantic Reasoning were validated for inter-annotator agreement.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We appreciate the opportunity to clarify our experimental results and strengthen the manuscript's claims regarding the benefits of Metric-Aware GRPO. We respond to each major comment below.

read point-by-point responses
  1. Referee: [Experiments] The SOTA claim on complex reasoning tasks (abstract and Experiments section) is load-bearing for the paper's contribution, yet the provided description does not include quantitative metrics, baseline comparisons against zero-shot proprietary models, ablation studies isolating the GRPO reward signals, or error analysis on implicit-attribute inference. Without these, it remains unclear whether performance improvements arise from enhanced underlying capabilities or from optimization to ReceiptBench's four sub-task scoring rules.

    Authors: We agree that the current presentation of results could more explicitly support the SOTA claim on complex reasoning. In the revised manuscript we will expand the Experiments section to include the full set of quantitative metrics, direct zero-shot comparisons against proprietary models, dedicated ablation studies isolating the GRPO reward components, and a focused error analysis on failures in implicit-attribute inference. These additions will make clear that observed gains reflect improved reasoning rather than benchmark-specific optimization. revision: yes

  2. Referee: [§4.2] §4.2 (Metric-Aware GRPO): because the RL signals are explicitly derived from the same hierarchical evaluation constraints used to define ReceiptBench (Basic Perception, Format Normalization, Semantic Reasoning, Structure Parsing), additional evidence is required to rule out metric-specific reward hacking. Cross-benchmark transfer results or analysis of generalization to unseen receipt formats would directly address whether the method improves transferable document reasoning.

    Authors: We acknowledge the referee's concern that reward signals derived from the same hierarchical constraints could encourage metric-specific behavior. To directly address this, the revised version will include new cross-benchmark transfer experiments and results on held-out receipt formats with unseen layouts. These additions will demonstrate that performance gains generalize beyond ReceiptBench's specific scoring rules. revision: yes

Circularity Check

0 steps flagged

No load-bearing circularity; benchmark and GRPO are independent contributions

full rationale

The paper introduces ReceiptBench as a new human-annotated dataset with four explicitly defined hierarchical sub-tasks and proposes Metric-Aware GRPO as a separate two-stage RL framework that converts those evaluation constraints into training signals. No equations, self-citations, or derivations are shown that reduce the SOTA claims or generalization assertions to quantities defined by construction from the same fitted parameters or inputs. The contributions are described as distinct, with code and data released for external verification, making the central performance claims self-contained against the new benchmark rather than tautological.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work relies on standard multimodal LLM training practices and human annotation protocols without introducing new mathematical axioms, free parameters fitted to target results, or postulated entities.

pith-pipeline@v0.9.0 · 5748 in / 1129 out tokens · 77520 ms · 2026-05-22T07:32:32.584181+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages · 4 internal anchors

  1. [1]

    Qwen3-VL Technical Report

    Qwen3-vl technical report.arXiv preprint arXiv:2511.21631. Ali Furkan Biten, Ruben Tito, Andres Mafla, Lluis Gomez, Marçal Rusinol, Ernest Valveny, CV Jawa- har, and Dimosthenis Karatzas. 2019. Scene text visual question answering. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4291–4301. Lukas Blecher, Guillem Cucurull, ...

  2. [2]

    GPT-4o System Card

    Icdar2019 competition on scanned receipt ocr and information extraction. InProceedings of the International Conference on Document Analysis and Recognition (ICDAR), pages 1516–1520. IEEE. Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, and 1 others. 2024. Gpt-4o syst...

  3. [3]

    Hierarchical multimodal transformers for mul- tipage docvqa.Pattern Recognition, 144:109834. Jordy Van Landeghem, Rubèn Tito, Łukasz Borchmann, Michał Pietruszka, Pawel Joziak, Rafal Powalski, Dawid Jurkiewicz, Mickaël Coustaty, Bertrand Anck- aert, Ernest Valveny, and 1 others. 2023. Document understanding dataset and evaluation (dude). InPro- ceedings o...

  4. [4]

    InFindings of the Association for Computational Linguistics: ACL 2022, pages 3214–3224

    Xfund: A benchmark dataset for multilingual visually rich form understanding. InFindings of the Association for Computational Linguistics: ACL 2022, pages 3214–3224. Zhibo Yang, Jun Tang, Zhaohai Li, Pengfei Wang, Jian- qiang Wan, Humen Zhong, Xuejing Liu, Mingkun Yang, Peng Wang, Shuai Bai, and 1 others. 2025. Cc-ocr: A comprehensive and challenging ocr ...

  5. [5]

    InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    Publaynet: largest dataset ever for document layout analysis. In2019 International Conference on Document Analysis and Recognition (ICDAR), pages 1015–1022. IEEE. Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, and 1 others. 2025. Internvl3: Exploring advanced training and test-time reci...

  6. [6]

    Country-City

    Entity VerificationThis dimension focuses on identifying the stakeholders involved in the trans- action to establish legitimacy. • seller_name: The name of the merchant or service provider. As these refer to public busi- ness entities, they are not considered PII. The annotation must be faithful to the visual infor- mation on the receipt (e.g., logos, hea...

  7. [7]

    • orig_total: The total amount of the trans- action as it appears visually in the raw text

    Financial IntegrityThis dimension captures critical financial data to verify calculations and amounts. • orig_total: The total amount of the trans- action as it appears visually in the raw text. This field captures the exact string from the document, including original separators (e.g., 1.000,00), without any normalization. • std_total: The normalized tot...

  8. [8]

    Country-City

    Spatio-Temporal ValidationThis dimension validates when and where the expense occurred to ensure the context matches the business trip or transaction claim. • place: The location where the expense oc- curred, formatted as “Country-City” (e.g.,UK- London). If the document only specifies a city, the country is added; if only the country is visible, the city...

  9. [9]

    Flight” → plane) or implicit logic (e.g., “Double Room

    Expense ClassificationThis dimension cate- gorizes the nature of the transaction for accounting and reimbursement purposes. • type: A classification label selected from a standardized list:plane, train, ship, bus, taxi, metro, hotel, orother. Annotators determine this based on explicit keywords (e.g., “Flight” → plane) or implicit logic (e.g., “Double Roo...

  10. [10]

    Ridgecrest

    Spatial Reasoning and Disambiguation (Task 3).The document contains two distinct addresses: the hotel’s physical address (top, "Ridgecrest") and the customer’s billing address (bottom-left, "Carls- bad"). The Base Model creates a hallucination by concatenating "United States" with the distrac- tor address "Carlsbad" for the place field. This is a typical ...

  11. [11]

    Balance Due

    The "Balance Due" Trap (Task 2 & 3). For the std_total field, the Base Model extracts "0.00" because the receipt explicitly states "To- tal Balance Due: $0.00" (indicating the bill has been paid). This reveals a lack of financial logic in general-purpose models. Our model correctly reasons that the effective transaction amount is the sum of charges (or th...

  12. [12]

    Invoice Number

    Semantic Mapping of Identifiers and Dates (Task 1).The receipt does not explicitly label an "Invoice Number" or "Invoice Date" using standard terminology. Instead, it uses the term "Account: 744376528" for the invoice identifier and presents the issuance date under the head- ing "Date". The Base Model fails to recognize these semantic synonyms, returningM...

  13. [13]

    Tourism Levy

    Structural Completeness (Task 4).In the detail list extraction, the Base Model misses the last line item ("Tourism Levy"), likely due to its visual separation from the main table body or its small font size. Our model achieves full recall, cap- turing all line items including the tax details. This structural completeness is crucial for the arithmetic cons...