pith. sign in

arxiv: 1907.01227 · v1 · pith:YNKUHD4Ynew · submitted 2019-07-02 · 💻 cs.CV

TedEval: A Fair Evaluation Metric for Scene Text Detectors

Pith reviewed 2026-05-25 11:18 UTC · model grok-4.3

classification 💻 cs.CV
keywords scene text detectionevaluation metricTedEvalinstance matchingcharacter scoringtext detection evaluationmultiline textrecognition alignment
0
0 comments X

The pith

TedEval scores scene text detectors by instance matching plus character-level success to align with recognition.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing metrics for scene text detection suffer from problems with granularity, handling of multiline text, and incomplete characters. The paper introduces TedEval, which performs instance-level matching followed by character-level scoring. This approach rewards detections that enable successful recognition rather than penalizing or ignoring partial matches in ways current standards do. The goal is a metric that gives consistent, reliable comparisons of detector quality at every difficulty level. Readers care because evaluation directly shapes which methods advance and how progress is measured.

Core claim

TedEval evaluates text detections by an instance-level matching and a character-level scoring. Based on a firm standard rewarding behaviors that result in successful recognition, TedEval can act as a reliable standard for comparing and quantizing the detection quality throughout all difficulty levels.

What carries the argument

Instance-level matching combined with character-level scoring that rewards only detections supporting successful downstream recognition.

If this is right

  • Detectors can be compared fairly even when they produce detections of varying completeness or span multiple lines.
  • Character incompleteness is penalized proportionally instead of ignored or over-penalized.
  • Evaluation remains stable across easy, medium, and hard text instances.
  • Development effort can shift toward detections that actually improve recognition outcomes.
  • TedEval supplies a single numeric score usable for ranking detectors at any difficulty.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Adoption could change which published detectors are viewed as state-of-the-art once results are re-evaluated.
  • The same matching-plus-scoring pattern might transfer to other fine-grained detection tasks such as layout or diagram element detection.
  • If TedEval becomes standard, training objectives that optimize directly for its score become a natural next step.
  • A public leaderboard using TedEval would let researchers measure whether new detectors improve recognition-enabling quality rather than just IoU.

Load-bearing premise

Instance-level matching with character-level scoring correctly fixes the granularity, multiline, and incompleteness problems while matching what leads to actual recognition success.

What would settle it

An experiment that runs the same set of detections through both TedEval and an end-to-end recognizer and finds no positive correlation between TedEval scores and recognition accuracy.

Figures

Figures reproduced from arXiv: 1907.01227 by Chae Young Lee, Hwalsuk Lee, Youngmin Baek.

Figure 1
Figure 1. Figure 1: Examples of unfair evaluations. (a) rejected by [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Visualization of multiline computation. The angle [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: An example of computing PCC of Gi . Red dot: PCC. Red dash: pseudo character box. Grey: Gi . and their word lengths. As shown in [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Frequency of factors that TedEval tackles counted by predictions. Numbers are represented as proportions to the [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Examples of incomplete detections. Numbers in caption indicate recall and precision scores of ”SUPERKINGS.” [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Examples of scoring in various cases [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Granularity (a) CTPN (R : 0.75, P : 1.00) (b) PixelLink (R : 0.90, P : 0.90) (c) WordSup (R : 0.25, P : 0.50) (d) MaskTS (R : 1.00, P : 1.00) [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Completeness (a) CTPN (R : 0.83, P : 1.00) (b) PixelLink (R : 0.83, P : 0.83) (c) MaskTS (R : 1.00, P : 1.00) (d) CRAFT (R : 0.83, P : 1.00) [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Multiline (a) TB++ (R : 0.97, P : 0.97) (b) WordSup (R : 0.80, P : 1.00) (c) FOTS (R : 0.92, P : 0.83) (d) CRAFT (R : 1.00, P : 1.00) [PITH_FULL_IMAGE:figures/full_fig_p007_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Text-in-text [PITH_FULL_IMAGE:figures/full_fig_p007_10.png] view at source ↗
read the original abstract

Despite the recent success of scene text detection methods, common evaluation metrics fail to provide a fair and reliable comparison among detectors. They have obvious drawbacks in reflecting the inherent characteristic of text detection tasks, unable to address issues such as granularity, multiline, and character incompleteness. In this paper, we propose a novel evaluation protocol called TedEval (Text detector Evaluation), which evaluates text detections by an instance-level matching and a character-level scoring. Based on a firm standard rewarding behaviors that result in successful recognition, TedEval can act as a reliable standard for comparing and quantizing the detection quality throughout all difficulty levels. In this regard, we believe that TedEval can play a key role in developing state-of-the-art scene text detectors. The code is publicly available at https://github.com/clovaai/TedEval.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper proposes TedEval, a new evaluation metric for scene text detectors using instance-level matching combined with character-level scoring. It claims this protocol fairly addresses limitations of prior metrics (granularity, multiline handling, character incompleteness) and provides a reliable standard for comparing detectors across difficulty levels by rewarding detections that enable successful recognition.

Significance. If the proposed matching and scoring rules demonstrably align with downstream OCR success, TedEval could improve fairness and consistency in benchmarking scene text detectors. The public release of code at the cited GitHub repository supports reproducibility and is a strength.

major comments (2)
  1. [Abstract] Abstract: the central claim that TedEval supplies a 'firm standard rewarding behaviors that result in successful recognition' and can 'act as a reliable standard' is presented without any quantitative grounding (e.g., correlation with OCR accuracy, ablation against prior metrics on recognition rates, or human preference data). This leaves the alignment between the instance+character protocol and actual recognition success as an untested modeling assumption rather than a demonstrated property.
  2. [Abstract] Abstract: no evidence or protocol is supplied showing that the instance-level matching plus character-level scoring resolves the stated drawbacks (granularity, multiline, character incompleteness) in a manner that produces scores predictive of recognition success; without such validation the metric remains a new construction whose superiority is asserted rather than measured.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract claims. We address each major comment below and agree that additional clarification on the grounding of TedEval would improve the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that TedEval supplies a 'firm standard rewarding behaviors that result in successful recognition' and can 'act as a reliable standard' is presented without any quantitative grounding (e.g., correlation with OCR accuracy, ablation against prior metrics on recognition rates, or human preference data). This leaves the alignment between the instance+character protocol and actual recognition success as an untested modeling assumption rather than a demonstrated property.

    Authors: The 'firm standard' is the character-level scoring rule, which by construction assigns partial credit proportional to correctly detected characters within each matched instance. This directly rewards detections that supply the complete character set needed for recognition while penalizing incompleteness. Instance-level matching further ensures granularity and multiline cases are handled without the fragmentation issues of prior metrics. Although the current manuscript does not report numerical correlations with specific OCR engines, the protocol's alignment follows from its design rather than an untested assumption. We will revise the abstract and add a clarifying paragraph in Section 3 to make this design rationale explicit. revision: yes

  2. Referee: [Abstract] Abstract: no evidence or protocol is supplied showing that the instance-level matching plus character-level scoring resolves the stated drawbacks (granularity, multiline, character incompleteness) in a manner that produces scores predictive of recognition success; without such validation the metric remains a new construction whose superiority is asserted rather than measured.

    Authors: Sections 3 and 4 of the manuscript define the matching and scoring rules and illustrate, via examples and benchmark comparisons, how they resolve granularity (by treating each text line as a single instance), multiline (by avoiding split penalties), and character incompleteness (by character-level rather than binary scoring). These changes produce scores that vary continuously with detection quality in ways prior metrics do not. We acknowledge that explicit predictive validation against downstream OCR accuracy is not present and would strengthen the paper; we will add a short discussion subsection relating TedEval scores to recognition feasibility in the revision. revision: yes

Circularity Check

0 steps flagged

TedEval is a definitional construction with no circular reductions to inputs or self-citations

full rationale

The paper introduces TedEval as a new protocol defined via instance-level matching and character-level scoring, presented as a direct construction without reference to fitted parameters, prior self-citations as load-bearing premises, or reductions of the metric to its own outputs. No equations or claims equate the proposed scoring to pre-existing fitted results by construction. This matches the default case of a self-contained metric definition.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract only; no explicit free parameters, invented entities, or detailed axioms beyond the stated domain assumptions about metric shortcomings are provided. The ledger reflects only what is visible in the abstract.

axioms (1)
  • domain assumption Common evaluation metrics have obvious drawbacks in reflecting the inherent characteristic of text detection tasks, unable to address issues such as granularity, multiline, and character incompleteness.
    Directly stated in the abstract as the motivation for the new metric.

pith-pipeline@v0.9.0 · 5667 in / 1298 out tokens · 58418 ms · 2026-05-25T11:18:01.537815+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages · 2 internal anchors

  1. [1]

    Y . Baek, B. Lee, D. Han, S. Yun, and H. Lee. Character region awareness for text detection. In CVPR, pages 4321–

  2. [2]

    Dangla, E

    A. Dangla, E. Puybareau, G. Tochon, and J. Fabrizio. A first step toward a fair comparison of evaluation protocols for text detection algorithms. In 2018 13th IAPR International Work- shop on Document Analysis Systems (DAS), pages 345–350. IEEE, 2018. 1

  3. [3]

    D. Deng, H. Liu, X. Li, and D. Cai. Pixellink: Detecting scene text via instance segmentation. In AAAI, 2018. 4

  4. [4]

    Everingham, S

    M. Everingham, S. M. Eslami, L. Gool, C. K. Williams, J. Winn, and A. Zisserman. The pascal visual object classes challenge: A retrospective. International Journal of Com- puter Vision, 111(1):98–136, 2015. 1

  5. [5]

    H. Hu, C. Zhang, Y . Luo, Y . Wang, J. Han, and E. Ding. Wordsup: Exploiting word annotations for character based text detection. In ICCV, 2017. 4

  6. [6]

    M. Liao, B. Shi, and X. Bai. Textboxes++: A single-shot oriented scene text detector. Image Processing, 27(8):3676– 3690, 2018. 4

  7. [7]

    J. Liu, X. Liu, J. Sheng, D. Liang, X. Li, and Q. Liu. Pyramid mask text detector. arXiv preprint arXiv:1903.11800, 2019. 4

  8. [8]

    X. Liu, D. Liang, S. Yan, D. Chen, Y . Qiao, and J. Yan. Fots: Fast oriented text spotting with a unified network. In CVPR, pages 5676–5685, 2018. 4

  9. [9]

    Y . Liu, L. Jin, Z. Xie, C. Luo, S. Zhang, and L. Xie. Tightness-aware evaluation protocol for scene text detection. In CVPR, pages 4321–4330. IEEE, 2019. 1

  10. [10]

    P. Lyu, M. Liao, C. Yao, W. Wu, and X. Bai. Mask textspot- ter: An end-to-end trainable neural network for spotting text with arbitrary shapes. arXiv preprint arXiv:1807.02242 ,

  11. [11]

    J. Ma, W. Shao, H. Ye, L. Wang, H. Wang, Y . Zheng, and X. Xue. Arbitrary-oriented scene text detection via rotation proposals. IEEE Transactions on Multimedia, 20(11):3111– 3122, 2018. 4

  12. [12]

    B. Shi, X. Bai, and S. Belongie. Detecting oriented text in natural images by linking segments. In CVPR, pages 3482–

  13. [13]

    Z. Tian, W. Huang, T. He, P. He, and Y . Qiao. Detecting text in natural image with connectionist text proposal network. In ECCV, pages 56–72. Springer, 2016. 1, 4

  14. [14]

    Wolf and J.-M

    C. Wolf and J.-M. Jolion. Object count/area graphs for the evaluation of object detection and segmentation algorithms. In ICDAR, pages 1115–1124. IEEE, 2013. 1

  15. [15]

    X. Zhou, C. Yao, H. Wen, Y . Wang, S. Zhou, W. He, and J. Liang. East: an efficient and accurate scene text detector. In CVPR, pages 2642–2651, 2017. 1, 4 A. Matching matrix Missing characters R 0.5 P 0.5 H 0.5 Many-to-one R 1.0 P 1.0 H 1.0 Overlap characters R 0.75 P 0.75 H 0.75 Multiline R 0.0 P 0.0 H 0.0 One-to-one R 1.0 P 1.0 H 1.0 𝑫𝟏 𝑫𝟐 𝒔𝒊𝒌 𝑹𝒆𝒄𝒂𝒍𝒍 𝑮𝟏 ...