pith. sign in

arxiv: 2604.22754 · v1 · submitted 2026-02-19 · 💻 cs.CV · cs.CL

HalalBench: A Multilingual OCR Benchmark for Food Packaging Ingredient Extraction

Pith reviewed 2026-05-15 20:41 UTC · model grok-4.3

classification 💻 cs.CV cs.CL
keywords OCR benchmarkfood packagingmultilingual OCRingredient extractionhalal verificationsynthetic datasetCOCO annotations
0
0 comments X

The pith

HalalBench provides the first multilingual OCR benchmark for food packaging ingredient labels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

No standard benchmark exists for OCR on food packaging despite its importance for halal verification. The paper creates HalalBench with 1,043 images spanning 14 languages to fill this gap. Evaluations of four OCR engines show low performance, with F1 scores below 0.2 and complete failure on Japanese. A post-processing algorithm boosts F1 by 36 percent. The dataset is validated in a real production scanner used in over 20 countries.

Core claim

The authors present HalalBench as the first open benchmark for multilingual OCR on food packaging, containing 1,043 images with 36,438 annotations, demonstrating that current engines struggle particularly with dense text, small fonts, and non-Latin scripts like Japanese.

What carries the argument

HalalBench dataset of real and synthetic food packaging images annotated in COCO format for ingredient text across 14 languages, used to benchmark OCR engines and test post-processing.

If this is right

  • Popular OCR engines achieve F1 scores of 0.193 or lower on the benchmark.
  • A clustering-based post-processing step improves F1 scores by 36%.
  • All engines fail completely on Japanese text with F1 of 0.000.
  • The benchmark supports development of OCR for automated halal food verification.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This benchmark could help improve OCR accuracy for small text on curved surfaces in consumer products.
  • Real-world deployment in halal scanners may benefit from the identified weaknesses in current technology.
  • Future work could expand the dataset with more real images to better match production conditions.

Load-bearing premise

The synthetic images are representative enough of real food packaging challenges like curved surfaces and tiny fonts.

What would settle it

Running the same OCR engines on hundreds of additional real food packaging photos and finding substantially higher or lower accuracy than reported on HalalBench.

Figures

Figures reproduced from arXiv: 2604.22754 by Hasan Arief.

Figure 1
Figure 1. Figure 1: Four layout template families used in syn [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: Engine comparison: exact and fuzzy F1 scores [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Speed-accuracy tradeoff for server-side en [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: HalalLens production pipeline architecture. [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
read the original abstract

No standardized benchmark exists for evaluating OCR on food packaging, despite its critical role in automated halal food verification. Existing benchmarks target documents or scene text, missing the unique challenges of ingredient labels: curved surfaces, dense multilingual text, and sub-8pt fonts. We present HalalBench, the first open multilingual benchmark for food packaging OCR, comprising 1,043 images (50 real, 993 synthetic) with 36,438 annotations in COCO format spanning 14 languages. We evaluate four engines: docTR achieves F1=0.193, ML Kit 0.180, EasyOCR 0.167, while all fail on Japanese (F1=0.000). A clustering ablation shows 36% F1 improvement from our post-processing algorithm. We validate findings through HalalLens (https://halallens.no), a production halal scanner serving 20+ countries. Dataset and code are released under open licenses.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript introduces HalalBench, the first open multilingual OCR benchmark for food packaging ingredient extraction. It comprises 1,043 images (50 real, 993 synthetic) with 36,438 COCO-format annotations spanning 14 languages. The authors evaluate four OCR engines (docTR F1=0.193, ML Kit 0.180, EasyOCR 0.167, all failing on Japanese), demonstrate a 36% F1 improvement via clustering-based post-processing, and validate via the HalalLens production scanner serving 20+ countries. Dataset and code are released openly.

Significance. If the synthetic images accurately model real packaging distortions, this benchmark would address a genuine gap in OCR evaluation for practical domains like automated halal verification. The open data release and empirical baselines against existing engines are positive contributions that could support future method development. The low absolute F1 scores highlight task difficulty, but the work's value hinges on benchmark representativeness.

major comments (1)
  1. [Abstract] Abstract: The claim that the 993 synthetic images represent real-world challenges (curved surfaces, dense multilingual text, sub-8pt fonts) is load-bearing for the reported F1 scores and 36% post-processing gain, yet no quantitative validation such as distribution matching on curvature, text density, or font-size histograms is provided between the 50 real and 993 synthetic images.
minor comments (1)
  1. [Abstract] Abstract: The statement 'We validate findings through HalalLens' lacks any specifics on the validation methodology, metrics, or results from the production system.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on HalalBench. We address the single major comment below and will revise the manuscript to incorporate quantitative validation of the synthetic images.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claim that the 993 synthetic images represent real-world challenges (curved surfaces, dense multilingual text, sub-8pt fonts) is load-bearing for the reported F1 scores and 36% post-processing gain, yet no quantitative validation such as distribution matching on curvature, text density, or font-size histograms is provided between the 50 real and 993 synthetic images.

    Authors: We agree that the manuscript would be strengthened by explicit quantitative validation showing that the synthetic images model the same distribution of challenges as the real ones. In the revised version we will add a dedicated subsection (and corresponding appendix figures) that reports: (1) font-size histograms computed from the COCO bounding-box heights for both sets, (2) text-density statistics (characters and words per image), and (3) curvature estimates obtained by fitting quadratic surfaces to the detected text regions. These comparisons will be presented alongside the existing qualitative examples to confirm that the synthetic generation pipeline reproduces the target real-world distortions. revision: yes

Circularity Check

0 steps flagged

No circularity: dataset release and empirical evaluation only

full rationale

The paper introduces HalalBench (1,043 images, 50 real + 993 synthetic) and reports F1 scores for four existing OCR engines plus a post-processing ablation. No equations, fitted parameters, or derivations appear in the abstract or described content. The central claim is the benchmark itself; synthetic-image fidelity is a methodological assumption but is not defined in terms of the reported results or reduced by construction to any input. No self-citation chain, uniqueness theorem, or ansatz is invoked to support a derivation. The work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are introduced; the contribution is the dataset and empirical comparison rather than a theoretical derivation.

pith-pipeline@v0.9.0 · 5455 in / 1033 out tokens · 33046 ms · 2026-05-15T20:41:01.569814+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

8 extracted references · 8 canonical work pages

  1. [1]

    Abdulla Alourani and Shahnawaz Khan

    doi: 10.1007/ s10462-024-10834-y. Abdulla Alourani and Shahnawaz Khan. A blockchain and artificial intelligence based system for halal food traceability.arXiv preprint arXiv:2410.07305,

  2. [2]

    Abdulla Alourani and Shahnawaz Khan

    doi: 10.48550/arXiv.2410.07305. Fatmah Y. Assiri, Maram D. Alahmadi, Maha A. Al- muashi, and Abdulrahman M. Almansour. Extract nutritional information from bilingual food labels us- ing large language models.Journal of Imaging, 11 (8):271,

  3. [3]

    2Repository URL to be provided upon publication

    doi: 10.3390/jimaging11080271. 2Repository URL to be provided upon publication. Yuning Du, Chenxia Li, Ruoyu Guo, Xiaoting Yin, Weiwei Liu, Jun Zhou, Yifan Bai, Zilin Yu, Yehua Yang, Qingqing Dang, and Haoshuang Wang. PP- OCR: A practical ultra lightweight OCR system. arXiv preprint arXiv:2009.09941,

  4. [4]

    48550/arXiv.2009.09941. Google. ML Kit text recognition v2.https: //developers.google.com/ml-kit/vision/ text-recognition/v2,

  5. [5]

    Halal or not: Knowledge graph completion for predicting cultural appropriateness of daily prod- ucts.arXiv preprint arXiv:2501.05768,

    Van Thuy Hoang, Tien-Bach-Thanh Do, Jinho Seo, Seung Charlie Kim, Luong Vuong Nguyen, Duong Nguyen Minh Huy, Hyeon-Ju Jeon, and O-Joun Lee. Halal or not: Knowledge graph completion for predicting cultural appropriateness of daily prod- ucts.arXiv preprint arXiv:2501.05768,

  6. [6]

    Halal or not: Knowledge graph completion for predicting cultural appropriateness of daily prod- ucts.arXiv preprint arXiv:2501.05768,

    doi: 10.48550/arXiv.2501.05768. IMARC Group. Halal food market size, share, growth and trends analysis report, 2025–2033.https:// www.imarcgroup.com/halal-food-market,

  7. [7]

    Evaluating OCR per- formance on food packaging labels in South Africa

    Mayimunah Nagayi, Alice Khan, Tamryn Frank, Rina Swart, and Clement Nyirenda. Evaluating OCR per- formance on food packaging labels in South Africa. InProceedings of the Southern African Conference for Artificial Intelligence Research (SACAIR 2025), volume 2784 ofCommunications in Computer and Information Science. Springer,

  8. [8]

    doi: 10.1109/ACCESS.2024.3367983. 8