TedEval: A Fair Evaluation Metric for Scene Text Detectors
Pith reviewed 2026-05-25 11:18 UTC · model grok-4.3
The pith
TedEval scores scene text detectors by instance matching plus character-level success to align with recognition.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
TedEval evaluates text detections by an instance-level matching and a character-level scoring. Based on a firm standard rewarding behaviors that result in successful recognition, TedEval can act as a reliable standard for comparing and quantizing the detection quality throughout all difficulty levels.
What carries the argument
Instance-level matching combined with character-level scoring that rewards only detections supporting successful downstream recognition.
If this is right
- Detectors can be compared fairly even when they produce detections of varying completeness or span multiple lines.
- Character incompleteness is penalized proportionally instead of ignored or over-penalized.
- Evaluation remains stable across easy, medium, and hard text instances.
- Development effort can shift toward detections that actually improve recognition outcomes.
- TedEval supplies a single numeric score usable for ranking detectors at any difficulty.
Where Pith is reading between the lines
- Adoption could change which published detectors are viewed as state-of-the-art once results are re-evaluated.
- The same matching-plus-scoring pattern might transfer to other fine-grained detection tasks such as layout or diagram element detection.
- If TedEval becomes standard, training objectives that optimize directly for its score become a natural next step.
- A public leaderboard using TedEval would let researchers measure whether new detectors improve recognition-enabling quality rather than just IoU.
Load-bearing premise
Instance-level matching with character-level scoring correctly fixes the granularity, multiline, and incompleteness problems while matching what leads to actual recognition success.
What would settle it
An experiment that runs the same set of detections through both TedEval and an end-to-end recognizer and finds no positive correlation between TedEval scores and recognition accuracy.
Figures
read the original abstract
Despite the recent success of scene text detection methods, common evaluation metrics fail to provide a fair and reliable comparison among detectors. They have obvious drawbacks in reflecting the inherent characteristic of text detection tasks, unable to address issues such as granularity, multiline, and character incompleteness. In this paper, we propose a novel evaluation protocol called TedEval (Text detector Evaluation), which evaluates text detections by an instance-level matching and a character-level scoring. Based on a firm standard rewarding behaviors that result in successful recognition, TedEval can act as a reliable standard for comparing and quantizing the detection quality throughout all difficulty levels. In this regard, we believe that TedEval can play a key role in developing state-of-the-art scene text detectors. The code is publicly available at https://github.com/clovaai/TedEval.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes TedEval, a new evaluation metric for scene text detectors using instance-level matching combined with character-level scoring. It claims this protocol fairly addresses limitations of prior metrics (granularity, multiline handling, character incompleteness) and provides a reliable standard for comparing detectors across difficulty levels by rewarding detections that enable successful recognition.
Significance. If the proposed matching and scoring rules demonstrably align with downstream OCR success, TedEval could improve fairness and consistency in benchmarking scene text detectors. The public release of code at the cited GitHub repository supports reproducibility and is a strength.
major comments (2)
- [Abstract] Abstract: the central claim that TedEval supplies a 'firm standard rewarding behaviors that result in successful recognition' and can 'act as a reliable standard' is presented without any quantitative grounding (e.g., correlation with OCR accuracy, ablation against prior metrics on recognition rates, or human preference data). This leaves the alignment between the instance+character protocol and actual recognition success as an untested modeling assumption rather than a demonstrated property.
- [Abstract] Abstract: no evidence or protocol is supplied showing that the instance-level matching plus character-level scoring resolves the stated drawbacks (granularity, multiline, character incompleteness) in a manner that produces scores predictive of recognition success; without such validation the metric remains a new construction whose superiority is asserted rather than measured.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the abstract claims. We address each major comment below and agree that additional clarification on the grounding of TedEval would improve the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that TedEval supplies a 'firm standard rewarding behaviors that result in successful recognition' and can 'act as a reliable standard' is presented without any quantitative grounding (e.g., correlation with OCR accuracy, ablation against prior metrics on recognition rates, or human preference data). This leaves the alignment between the instance+character protocol and actual recognition success as an untested modeling assumption rather than a demonstrated property.
Authors: The 'firm standard' is the character-level scoring rule, which by construction assigns partial credit proportional to correctly detected characters within each matched instance. This directly rewards detections that supply the complete character set needed for recognition while penalizing incompleteness. Instance-level matching further ensures granularity and multiline cases are handled without the fragmentation issues of prior metrics. Although the current manuscript does not report numerical correlations with specific OCR engines, the protocol's alignment follows from its design rather than an untested assumption. We will revise the abstract and add a clarifying paragraph in Section 3 to make this design rationale explicit. revision: yes
-
Referee: [Abstract] Abstract: no evidence or protocol is supplied showing that the instance-level matching plus character-level scoring resolves the stated drawbacks (granularity, multiline, character incompleteness) in a manner that produces scores predictive of recognition success; without such validation the metric remains a new construction whose superiority is asserted rather than measured.
Authors: Sections 3 and 4 of the manuscript define the matching and scoring rules and illustrate, via examples and benchmark comparisons, how they resolve granularity (by treating each text line as a single instance), multiline (by avoiding split penalties), and character incompleteness (by character-level rather than binary scoring). These changes produce scores that vary continuously with detection quality in ways prior metrics do not. We acknowledge that explicit predictive validation against downstream OCR accuracy is not present and would strengthen the paper; we will add a short discussion subsection relating TedEval scores to recognition feasibility in the revision. revision: yes
Circularity Check
TedEval is a definitional construction with no circular reductions to inputs or self-citations
full rationale
The paper introduces TedEval as a new protocol defined via instance-level matching and character-level scoring, presented as a direct construction without reference to fitted parameters, prior self-citations as load-bearing premises, or reductions of the metric to its own outputs. No equations or claims equate the proposed scoring to pre-existing fitted results by construction. This matches the default case of a self-contained metric definition.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Common evaluation metrics have obvious drawbacks in reflecting the inherent characteristic of text detection tasks, unable to address issues such as granularity, multiline, and character incompleteness.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
TedEval evaluates text detections by an instance-level matching and a character-level scoring... pseudo character centers... multiline prevention via angles
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Y . Baek, B. Lee, D. Han, S. Yun, and H. Lee. Character region awareness for text detection. In CVPR, pages 4321–
- [2]
-
[3]
D. Deng, H. Liu, X. Li, and D. Cai. Pixellink: Detecting scene text via instance segmentation. In AAAI, 2018. 4
work page 2018
-
[4]
M. Everingham, S. M. Eslami, L. Gool, C. K. Williams, J. Winn, and A. Zisserman. The pascal visual object classes challenge: A retrospective. International Journal of Com- puter Vision, 111(1):98–136, 2015. 1
work page 2015
-
[5]
H. Hu, C. Zhang, Y . Luo, Y . Wang, J. Han, and E. Ding. Wordsup: Exploiting word annotations for character based text detection. In ICCV, 2017. 4
work page 2017
-
[6]
M. Liao, B. Shi, and X. Bai. Textboxes++: A single-shot oriented scene text detector. Image Processing, 27(8):3676– 3690, 2018. 4
work page 2018
-
[7]
J. Liu, X. Liu, J. Sheng, D. Liang, X. Li, and Q. Liu. Pyramid mask text detector. arXiv preprint arXiv:1903.11800, 2019. 4
work page internal anchor Pith review Pith/arXiv arXiv 1903
-
[8]
X. Liu, D. Liang, S. Yan, D. Chen, Y . Qiao, and J. Yan. Fots: Fast oriented text spotting with a unified network. In CVPR, pages 5676–5685, 2018. 4
work page 2018
-
[9]
Y . Liu, L. Jin, Z. Xie, C. Luo, S. Zhang, and L. Xie. Tightness-aware evaluation protocol for scene text detection. In CVPR, pages 4321–4330. IEEE, 2019. 1
work page 2019
-
[10]
P. Lyu, M. Liao, C. Yao, W. Wu, and X. Bai. Mask textspot- ter: An end-to-end trainable neural network for spotting text with arbitrary shapes. arXiv preprint arXiv:1807.02242 ,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
J. Ma, W. Shao, H. Ye, L. Wang, H. Wang, Y . Zheng, and X. Xue. Arbitrary-oriented scene text detection via rotation proposals. IEEE Transactions on Multimedia, 20(11):3111– 3122, 2018. 4
work page 2018
-
[12]
B. Shi, X. Bai, and S. Belongie. Detecting oriented text in natural images by linking segments. In CVPR, pages 3482–
-
[13]
Z. Tian, W. Huang, T. He, P. He, and Y . Qiao. Detecting text in natural image with connectionist text proposal network. In ECCV, pages 56–72. Springer, 2016. 1, 4
work page 2016
-
[14]
C. Wolf and J.-M. Jolion. Object count/area graphs for the evaluation of object detection and segmentation algorithms. In ICDAR, pages 1115–1124. IEEE, 2013. 1
work page 2013
-
[15]
X. Zhou, C. Yao, H. Wen, Y . Wang, S. Zhou, W. He, and J. Liang. East: an efficient and accurate scene text detector. In CVPR, pages 2642–2651, 2017. 1, 4 A. Matching matrix Missing characters R 0.5 P 0.5 H 0.5 Many-to-one R 1.0 P 1.0 H 1.0 Overlap characters R 0.75 P 0.75 H 0.75 Multiline R 0.0 P 0.0 H 0.0 One-to-one R 1.0 P 1.0 H 1.0 𝑫𝟏 𝑫𝟐 𝒔𝒊𝒌 𝑹𝒆𝒄𝒂𝒍𝒍 𝑮𝟏 ...
work page 2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.