From Recognition to Reasoning: Benchmarking and Enhancing MLLMs on Real-World Receipt Document Understanding
Pith reviewed 2026-05-22 07:32 UTC · model grok-4.3
The pith
A new 10k-receipt benchmark with four task levels and metric-aware GRPO training lets multimodal models surpass proprietary systems on receipt reasoning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ReceiptBench organizes receipt understanding into four hierarchical subtasks of increasing difficulty while a two-stage framework uses Metric-Aware Group Relative Policy Optimization to convert evaluation constraints into reinforcement learning signals; experiments show this combination produces state-of-the-art results that surpass leading proprietary multimodal models especially on semantic reasoning and structure parsing.
What carries the argument
Metric-Aware Group Relative Policy Optimization (GRPO), which translates strict evaluation constraints on output format and structure into reinforcement learning rewards to improve consistency across the four task levels.
If this is right
- Models reach higher accuracy on inferring implicit attributes and parsing nested items than current proprietary systems.
- The four-level task breakdown allows finer diagnosis of where models still fail in document understanding.
- Structural consistency improves because the optimization directly penalizes violations of format and nesting rules.
- The approach scales to larger sets of real-world receipts while maintaining the hierarchy from perception to reasoning.
Where Pith is reading between the lines
- Similar metric-driven optimization could transfer to other document types such as invoices or forms where nested structures and implicit fields also appear.
- If the gains hold outside the benchmark, automated receipt processing systems could reduce manual review steps in accounting workflows.
- The emphasis on semantic reasoning might encourage future models to handle incomplete or ambiguous receipts without extra human rules.
- Testing the same two-stage pipeline on video or multi-page documents would reveal whether the method depends on single-image receipt properties.
Load-bearing premise
The human annotations and four-level task structure in ReceiptBench truly reflect the variety and hidden details in real receipts, and the GRPO signals lead to genuine generalization rather than just fitting the benchmark metrics.
What would settle it
Run the trained models on a new collection of receipts gathered from different sources and time periods with no overlap in distribution to the original 10k set and check whether the reported gains on reasoning and parsing tasks remain.
Figures
read the original abstract
Extracting structured information from visual documents (Visual Information Extraction, VIE) is a cornerstone of business automation. While recent Multimodal Large Language Models (MLLMs) have shown promising capabilities, existing benchmarks suffer from critical limitations in scale and realism, lack semantic granularity, and fail to cover diverse document types. To bridge this gap, we introduce ReceiptBench, a large-scale, human-annotated benchmark consisting of 10k diverse receipts, organizing information extraction into four hierarchical sub-tasks: (1) Basic Perception for raw text spotting, (2) Format Normalization for strictly following standardization instructions, (3) Semantic Reasoning for inferring implicit attributes from context, and (4) Structure Parsing for handling nested line items. Furthermore, we propose a two-stage training framework incorporating Metric-Aware Group Relative Policy Optimization (GRPO), which translates rigorous evaluation constraints into reinforcement learning signals to enhance structural consistency. Extensive experiments demonstrate that our method yields state-of-the-art performance, surpassing leading proprietary models on complex reasoning tasks. We release our datasets and code at https://github.com/wwwT0ri/ReceiptBench.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces ReceiptBench, a large-scale human-annotated benchmark of 10k diverse real-world receipts for visual information extraction (VIE). It organizes the task into four hierarchical sub-tasks—Basic Perception (raw text spotting), Format Normalization, Semantic Reasoning (inferring implicit attributes), and Structure Parsing (nested line items)—and proposes a two-stage training framework that incorporates Metric-Aware Group Relative Policy Optimization (GRPO) to convert evaluation constraints into reinforcement learning signals for improved structural consistency in MLLMs. The central claim is that this approach yields state-of-the-art performance, surpassing leading proprietary models on complex reasoning tasks, with datasets and code released.
Significance. If the reported gains from Metric-Aware GRPO reflect genuine improvements in semantic reasoning and structure parsing rather than adaptation to the new benchmark's specific metrics and annotation conventions, the work would offer a useful large-scale resource and training paradigm for document understanding. The public release of code and data is a clear strength that enables reproducibility and further research.
major comments (2)
- [Experiments] The SOTA claim on complex reasoning tasks (abstract and Experiments section) is load-bearing for the paper's contribution, yet the provided description does not include quantitative metrics, baseline comparisons against zero-shot proprietary models, ablation studies isolating the GRPO reward signals, or error analysis on implicit-attribute inference. Without these, it remains unclear whether performance improvements arise from enhanced underlying capabilities or from optimization to ReceiptBench's four sub-task scoring rules.
- [§4.2] §4.2 (Metric-Aware GRPO): because the RL signals are explicitly derived from the same hierarchical evaluation constraints used to define ReceiptBench (Basic Perception, Format Normalization, Semantic Reasoning, Structure Parsing), additional evidence is required to rule out metric-specific reward hacking. Cross-benchmark transfer results or analysis of generalization to unseen receipt formats would directly address whether the method improves transferable document reasoning.
minor comments (2)
- The abstract asserts 'extensive experiments' and 'state-of-the-art performance' without any numerical results or key tables; adding one or two headline numbers would improve readability.
- [Benchmark Construction] Clarify in the benchmark construction section how human annotations for implicit attributes in Semantic Reasoning were validated for inter-annotator agreement.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We appreciate the opportunity to clarify our experimental results and strengthen the manuscript's claims regarding the benefits of Metric-Aware GRPO. We respond to each major comment below.
read point-by-point responses
-
Referee: [Experiments] The SOTA claim on complex reasoning tasks (abstract and Experiments section) is load-bearing for the paper's contribution, yet the provided description does not include quantitative metrics, baseline comparisons against zero-shot proprietary models, ablation studies isolating the GRPO reward signals, or error analysis on implicit-attribute inference. Without these, it remains unclear whether performance improvements arise from enhanced underlying capabilities or from optimization to ReceiptBench's four sub-task scoring rules.
Authors: We agree that the current presentation of results could more explicitly support the SOTA claim on complex reasoning. In the revised manuscript we will expand the Experiments section to include the full set of quantitative metrics, direct zero-shot comparisons against proprietary models, dedicated ablation studies isolating the GRPO reward components, and a focused error analysis on failures in implicit-attribute inference. These additions will make clear that observed gains reflect improved reasoning rather than benchmark-specific optimization. revision: yes
-
Referee: [§4.2] §4.2 (Metric-Aware GRPO): because the RL signals are explicitly derived from the same hierarchical evaluation constraints used to define ReceiptBench (Basic Perception, Format Normalization, Semantic Reasoning, Structure Parsing), additional evidence is required to rule out metric-specific reward hacking. Cross-benchmark transfer results or analysis of generalization to unseen receipt formats would directly address whether the method improves transferable document reasoning.
Authors: We acknowledge the referee's concern that reward signals derived from the same hierarchical constraints could encourage metric-specific behavior. To directly address this, the revised version will include new cross-benchmark transfer experiments and results on held-out receipt formats with unseen layouts. These additions will demonstrate that performance gains generalize beyond ReceiptBench's specific scoring rules. revision: yes
Circularity Check
No load-bearing circularity; benchmark and GRPO are independent contributions
full rationale
The paper introduces ReceiptBench as a new human-annotated dataset with four explicitly defined hierarchical sub-tasks and proposes Metric-Aware GRPO as a separate two-stage RL framework that converts those evaluation constraints into training signals. No equations, self-citations, or derivations are shown that reduce the SOTA claims or generalization assertions to quantities defined by construction from the same fitted parameters or inputs. The contributions are described as distinct, with code and data released for external verification, making the central performance claims self-contained against the new benchmark rather than tautological.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Metric-Aware Group Relative Policy Optimization (GRPO), which translates rigorous evaluation constraints into reinforcement learning signals to enhance structural consistency.
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
hierarchical taxonomy of four capabilities: Perception, Normalization, Reasoning, and Structure
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Qwen3-vl technical report.arXiv preprint arXiv:2511.21631. Ali Furkan Biten, Ruben Tito, Andres Mafla, Lluis Gomez, Marçal Rusinol, Ernest Valveny, CV Jawa- har, and Dimosthenis Karatzas. 2019. Scene text visual question answering. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4291–4301. Lukas Blecher, Guillem Cucurull, ...
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[2]
Icdar2019 competition on scanned receipt ocr and information extraction. InProceedings of the International Conference on Document Analysis and Recognition (ICDAR), pages 1516–1520. IEEE. Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, and 1 others. 2024. Gpt-4o syst...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[3]
Hierarchical multimodal transformers for mul- tipage docvqa.Pattern Recognition, 144:109834. Jordy Van Landeghem, Rubèn Tito, Łukasz Borchmann, Michał Pietruszka, Pawel Joziak, Rafal Powalski, Dawid Jurkiewicz, Mickaël Coustaty, Bertrand Anck- aert, Ernest Valveny, and 1 others. 2023. Document understanding dataset and evaluation (dude). InPro- ceedings o...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[4]
InFindings of the Association for Computational Linguistics: ACL 2022, pages 3214–3224
Xfund: A benchmark dataset for multilingual visually rich form understanding. InFindings of the Association for Computational Linguistics: ACL 2022, pages 3214–3224. Zhibo Yang, Jun Tang, Zhaohai Li, Pengfei Wang, Jian- qiang Wan, Humen Zhong, Xuejing Liu, Mingkun Yang, Peng Wang, Shuai Bai, and 1 others. 2025. Cc-ocr: A comprehensive and challenging ocr ...
work page 2022
-
[5]
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
Publaynet: largest dataset ever for document layout analysis. In2019 International Conference on Document Analysis and Recognition (ICDAR), pages 1015–1022. IEEE. Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, and 1 others. 2025. Internvl3: Exploring advanced training and test-time reci...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
Entity VerificationThis dimension focuses on identifying the stakeholders involved in the trans- action to establish legitimacy. • seller_name: The name of the merchant or service provider. As these refer to public busi- ness entities, they are not considered PII. The annotation must be faithful to the visual infor- mation on the receipt (e.g., logos, hea...
-
[7]
• orig_total: The total amount of the trans- action as it appears visually in the raw text
Financial IntegrityThis dimension captures critical financial data to verify calculations and amounts. • orig_total: The total amount of the trans- action as it appears visually in the raw text. This field captures the exact string from the document, including original separators (e.g., 1.000,00), without any normalization. • std_total: The normalized tot...
-
[8]
Spatio-Temporal ValidationThis dimension validates when and where the expense occurred to ensure the context matches the business trip or transaction claim. • place: The location where the expense oc- curred, formatted as “Country-City” (e.g.,UK- London). If the document only specifies a city, the country is added; if only the country is visible, the city...
work page 2024
-
[9]
Flight” → plane) or implicit logic (e.g., “Double Room
Expense ClassificationThis dimension cate- gorizes the nature of the transaction for accounting and reimbursement purposes. • type: A classification label selected from a standardized list:plane, train, ship, bus, taxi, metro, hotel, orother. Annotators determine this based on explicit keywords (e.g., “Flight” → plane) or implicit logic (e.g., “Double Roo...
-
[10]
Spatial Reasoning and Disambiguation (Task 3).The document contains two distinct addresses: the hotel’s physical address (top, "Ridgecrest") and the customer’s billing address (bottom-left, "Carls- bad"). The Base Model creates a hallucination by concatenating "United States" with the distrac- tor address "Carlsbad" for the place field. This is a typical ...
-
[11]
The "Balance Due" Trap (Task 2 & 3). For the std_total field, the Base Model extracts "0.00" because the receipt explicitly states "To- tal Balance Due: $0.00" (indicating the bill has been paid). This reveals a lack of financial logic in general-purpose models. Our model correctly reasons that the effective transaction amount is the sum of charges (or th...
-
[12]
Semantic Mapping of Identifiers and Dates (Task 1).The receipt does not explicitly label an "Invoice Number" or "Invoice Date" using standard terminology. Instead, it uses the term "Account: 744376528" for the invoice identifier and presents the issuance date under the head- ing "Date". The Base Model fails to recognize these semantic synonyms, returningM...
-
[13]
Structural Completeness (Task 4).In the detail list extraction, the Base Model misses the last line item ("Tourism Levy"), likely due to its visual separation from the main table body or its small font size. Our model achieves full recall, cap- turing all line items including the tax details. This structural completeness is crucial for the arithmetic cons...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.