Bharat Scene Text: A Novel Comprehensive Dataset and Benchmark for Indian Language Scene Text Understanding

Abhirama Subramanyam Penamakuri; Aditya Rathore; Anand Mishra; Anik De; Devesh Sharma; Harshiv Shah; Pravin Kumar; Rajeev Yadav; Sagar Agarwal

arxiv: 2511.23071 · v2 · submitted 2025-11-28 · 💻 cs.CV · cs.AI· cs.CL

Bharat Scene Text: A Novel Comprehensive Dataset and Benchmark for Indian Language Scene Text Understanding

Anik De , Abhirama Subramanyam Penamakuri , Rajeev Yadav , Aditya Rathore , Harshiv Shah , Devesh Sharma , Sagar Agarwal , Pravin Kumar

show 1 more author

Anand Mishra

This is my paper

Pith reviewed 2026-05-17 03:53 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CL

keywords scene text recognitionIndian languagesdatasetbenchmarkmultilingual OCRcomputer visiontext detectionOCR

0 comments

The pith

A new dataset of over 100K words benchmarks scene text recognition for 11 Indian languages and English.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the Bharat Scene Text Dataset (BSTD) as a large-scale benchmark to tackle the open challenge of Indian language scene text recognition. English scene text is considered nearly solved, but Indian languages lag due to script diversity, non-standard fonts, varying styles, and lack of datasets. BSTD includes more than 100K words from over 6,500 scene images across India and supports detection, script identification, cropped word recognition, and end-to-end recognition. Fine-tuning state-of-the-art English models on this data highlights the challenges and opportunities. The open-source release aims to advance research in assistive technology, search, and e-commerce for Indian contexts.

Core claim

We introduce the Bharat Scene Text Dataset (BSTD) - a large-scale and comprehensive benchmark for studying Indian Language Scene Text Recognition. It comprises more than 100K words that span 11 Indian languages and English, sourced from over 6,500 scene images captured across various linguistic regions of India. The dataset is meticulously annotated and supports multiple scene text tasks, including scene text detection, script identification, cropped word recognition, and end-to-end scene text recognition. Evaluating state-of-the-art English models adapted for Indian languages reveals the specific difficulties involved.

What carries the argument

The Bharat Scene Text Dataset (BSTD) as the core benchmark providing annotated real-world images for advancing Indian language scene text understanding tasks.

If this is right

Adapting English models to Indian languages becomes feasible but remains challenging due to script variations.
The dataset enables simultaneous research on detection, script identification, word recognition, and end-to-end systems.
Open-sourcing promotes community development of better models for multilingual scene text.
Real-world applications in India benefit from improved text recognition in diverse linguistic settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar datasets could be created for other languages with complex scripts to broaden global scene text capabilities.
Combining BSTD with English datasets might yield more robust multilingual models.
The identified challenges suggest exploring script-specific model designs beyond simple fine-tuning.
This benchmark could serve as a foundation for evaluating future advances in inclusive computer vision systems.

Load-bearing premise

The collected images and annotations are sufficiently diverse, high-quality, and representative of real-world Indian script variations to meaningfully advance recognition performance when English models are fine-tuned on them.

What would settle it

Observing no improvement in recognition accuracy for Indian scene text when models are fine-tuned on BSTD versus English-only training, or finding that the dataset lacks sufficient coverage of script variations.

read the original abstract

Reading scene text, that is, text appearing in images, has numerous application areas, including assistive technology, search, and e-commerce. Although scene text recognition in English has advanced significantly and is often considered nearly a solved problem, Indian language scene text recognition remains an open challenge. This is due to script diversity, non-standard fonts, and varying writing styles, and, more importantly, the lack of high-quality datasets and open-source models. To address these gaps, we introduce the Bharat Scene Text Dataset (BSTD) - a large-scale and comprehensive benchmark for studying Indian Language Scene Text Recognition. It comprises more than 100K words that span 11 Indian languages and English, sourced from over 6,500 scene images captured across various linguistic regions of India. The dataset is meticulously annotated and supports multiple scene text tasks, including: (i) Scene Text Detection, (ii) Script Identification, (iii) Cropped Word Recognition, and (iv) End-to-End Scene Text Recognition. We evaluated state-of-the-art models originally developed for English by adapting (fine-tuning) them for Indian languages. Our results highlight the challenges and opportunities in Indian language scene text recognition. We believe that this dataset represents a significant step toward advancing research in this domain. All our models and data are open source.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Dataset announcement for Indian scene text that fills a coverage gap but stays unverified without the full paper.

read the letter

Hi, the main thing to know is that this paper introduces the Bharat Scene Text Dataset as a new resource with over 100k words spanning 11 Indian languages plus English, drawn from more than 6500 scene images. It targets detection, script identification, cropped word recognition, and end-to-end recognition, with some fine-tuning of English models to illustrate the difficulties. That scale and language spread is the core offering, and it is positioned as open source to help close the gap left by mostly English-focused prior work. The authors note script diversity, non-standard fonts, and writing styles as reasons Indian scene text has lagged, which aligns with known practical needs in assistive tech and local applications. On the credit side, the multi-task annotations and regional capture across linguistic areas show an attempt to build something usable beyond single-language collections. The open release of models and data is straightforward and helpful for follow-on work. The soft spots stand out because we only have the abstract. No details appear on image collection protocols, annotation guidelines, quality checks, or inter-annotator agreement. There are also no numbers from the fine-tuning experiments, so claims about highlighting challenges or advancing performance remain untested. The assumption that the data is representative enough to drive real gains when English models are adapted cannot be checked here. This leaves the central value dependent on what the full paper actually demonstrates. The paper is aimed at computer vision researchers working on multilingual scene text or dataset construction for non-Latin scripts. A reader focused on Indian-language applications or low-resource OCR benchmarks would get the most from it once the methods and results are available. I would send it for peer review so the collection process, annotation rigor, and any quantitative outcomes can be examined properly.

Referee Report

1 major / 1 minor

Summary. The manuscript introduces the Bharat Scene Text Dataset (BSTD), a large-scale benchmark for Indian language scene text recognition. It comprises more than 100K words spanning 11 Indian languages and English, sourced from over 6,500 scene images captured across linguistic regions of India. The dataset supports four tasks: scene text detection, script identification, cropped word recognition, and end-to-end scene text recognition. The authors adapt and fine-tune state-of-the-art English models on BSTD, reporting that the results highlight challenges and opportunities in the domain, with all data and models released as open source.

Significance. If the dataset proves to be diverse, accurately annotated, and representative of real-world Indian script variations, this work could meaningfully advance multilingual scene text recognition by providing a much-needed resource beyond English-centric datasets. The open-source release of data and models would further strengthen its potential impact on assistive technology, search, and e-commerce applications in India.

major comments (1)

Abstract: The central claims that the dataset is 'meticulously annotated,' captures 'script diversity, non-standard fonts, and varying writing styles,' and that fine-tuning English models 'highlight the challenges and opportunities' are load-bearing for the contribution, yet the abstract provides no information on image capture protocols, annotation guidelines, quality assurance, inter-annotator agreement, language/region breakdowns, or any quantitative evaluation results. Without these details, the claims that BSTD meaningfully advances recognition performance cannot be assessed.

minor comments (1)

Abstract: Consider adding one or two specific quantitative highlights (e.g., language distribution or baseline performance deltas) to better convey the scale and impact within the word limit.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their thoughtful review and for highlighting areas where the abstract could better support the manuscript's central claims. We address the major comment point by point below.

read point-by-point responses

Referee: [—] Abstract: The central claims that the dataset is 'meticulously annotated,' captures 'script diversity, non-standard fonts, and varying writing styles,' and that fine-tuning English models 'highlight the challenges and opportunities' are load-bearing for the contribution, yet the abstract provides no information on image capture protocols, annotation guidelines, quality assurance, inter-annotator agreement, language/region breakdowns, or any quantitative evaluation results. Without these details, the claims that BSTD meaningfully advances recognition performance cannot be assessed.

Authors: We agree that the abstract, as currently written, is too high-level to fully substantiate these claims on its own. In the revised manuscript we will expand the abstract to include concise statements on image capture (mobile-phone photography across linguistic regions of India), the annotation workflow (native-speaker annotators with multi-stage quality assurance), and representative quantitative results (e.g., baseline detection and recognition accuracies on the four tasks). Detailed annotation guidelines, inter-annotator agreement statistics, and per-language/region breakdowns are already provided in Section 3 of the full paper; we will add a brief parenthetical reference to that section in the abstract. These changes will make the abstract self-contained while remaining within length limits. revision: yes

Circularity Check

0 steps flagged

No circularity: dataset introduction paper with no derived quantities or load-bearing self-citations

full rationale

The paper introduces the Bharat Scene Text Dataset as an empirical contribution consisting of collected images and annotations for Indian language scene text tasks. No equations, fitted parameters, predictions, or first-principles derivations are present in the abstract or described claims. The central assertion—that the new dataset advances research by providing scale and coverage—does not reduce to any prior result by construction or self-citation chain. Evaluation of English models via fine-tuning is mentioned only at a high level without quantitative reduction to inputs. This is a standard dataset/benchmark paper whose value rests on external verification of collection and annotation quality rather than internal definitional equivalence.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper contributes a dataset and benchmark without introducing new mathematical free parameters, ad-hoc axioms, or invented physical entities; it relies on standard computer-vision dataset practices.

axioms (1)

domain assumption Standard scene-text annotation practices produce reliable labels for detection, script ID, and recognition tasks.
Implicit in the claim of meticulous annotation and multi-task support.

pith-pipeline@v0.9.0 · 5543 in / 1264 out tokens · 32247 ms · 2026-05-17T03:53:01.175866+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce the Bharat Scene Text Dataset (BSTD) ... supports multiple scene text tasks, including: (i) Scene Text Detection, (ii) Script Identification, (iii) Cropped Word Recognition, and (iv) End-to-End Scene Text Recognition.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We evaluated state-of-the-art models originally developed for English by adapting (fine-tuning) them for Indian languages.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.