TedEval: A Fair Evaluation Metric for Scene Text Detectors

Chae Young Lee; Hwalsuk Lee; Youngmin Baek

arxiv: 1907.01227 · v1 · pith:YNKUHD4Ynew · submitted 2019-07-02 · 💻 cs.CV

TedEval: A Fair Evaluation Metric for Scene Text Detectors

Chae Young Lee , Youngmin Baek , Hwalsuk Lee This is my paper

Pith reviewed 2026-05-25 11:18 UTC · model grok-4.3

classification 💻 cs.CV

keywords scene text detectionevaluation metricTedEvalinstance matchingcharacter scoringtext detection evaluationmultiline textrecognition alignment

0 comments

The pith

TedEval scores scene text detectors by instance matching plus character-level success to align with recognition.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing metrics for scene text detection suffer from problems with granularity, handling of multiline text, and incomplete characters. The paper introduces TedEval, which performs instance-level matching followed by character-level scoring. This approach rewards detections that enable successful recognition rather than penalizing or ignoring partial matches in ways current standards do. The goal is a metric that gives consistent, reliable comparisons of detector quality at every difficulty level. Readers care because evaluation directly shapes which methods advance and how progress is measured.

Core claim

TedEval evaluates text detections by an instance-level matching and a character-level scoring. Based on a firm standard rewarding behaviors that result in successful recognition, TedEval can act as a reliable standard for comparing and quantizing the detection quality throughout all difficulty levels.

What carries the argument

Instance-level matching combined with character-level scoring that rewards only detections supporting successful downstream recognition.

If this is right

Detectors can be compared fairly even when they produce detections of varying completeness or span multiple lines.
Character incompleteness is penalized proportionally instead of ignored or over-penalized.
Evaluation remains stable across easy, medium, and hard text instances.
Development effort can shift toward detections that actually improve recognition outcomes.
TedEval supplies a single numeric score usable for ranking detectors at any difficulty.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Adoption could change which published detectors are viewed as state-of-the-art once results are re-evaluated.
The same matching-plus-scoring pattern might transfer to other fine-grained detection tasks such as layout or diagram element detection.
If TedEval becomes standard, training objectives that optimize directly for its score become a natural next step.
A public leaderboard using TedEval would let researchers measure whether new detectors improve recognition-enabling quality rather than just IoU.

Load-bearing premise

Instance-level matching with character-level scoring correctly fixes the granularity, multiline, and incompleteness problems while matching what leads to actual recognition success.

What would settle it

An experiment that runs the same set of detections through both TedEval and an end-to-end recognizer and finds no positive correlation between TedEval scores and recognition accuracy.

Figures

Figures reproduced from arXiv: 1907.01227 by Chae Young Lee, Hwalsuk Lee, Youngmin Baek.

**Figure 2.** Figure 2: Visualization of multiline computation. The angle [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: An example of computing PCC of Gi . Red dot: PCC. Red dash: pseudo character box. Grey: Gi . and their word lengths. As shown in [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Frequency of factors that TedEval tackles counted by predictions. Numbers are represented as proportions to the [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Examples of incomplete detections. Numbers in caption indicate recall and precision scores of ”SUPERKINGS.” [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗

**Figure 6.** Figure 6: Examples of scoring in various cases [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 7.** Figure 7: Granularity (a) CTPN (R : 0.75, P : 1.00) (b) PixelLink (R : 0.90, P : 0.90) (c) WordSup (R : 0.25, P : 0.50) (d) MaskTS (R : 1.00, P : 1.00) [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 8.** Figure 8: Completeness (a) CTPN (R : 0.83, P : 1.00) (b) PixelLink (R : 0.83, P : 0.83) (c) MaskTS (R : 1.00, P : 1.00) (d) CRAFT (R : 0.83, P : 1.00) [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗

**Figure 9.** Figure 9: Multiline (a) TB++ (R : 0.97, P : 0.97) (b) WordSup (R : 0.80, P : 1.00) (c) FOTS (R : 0.92, P : 0.83) (d) CRAFT (R : 1.00, P : 1.00) [PITH_FULL_IMAGE:figures/full_fig_p007_9.png] view at source ↗

**Figure 10.** Figure 10: Text-in-text [PITH_FULL_IMAGE:figures/full_fig_p007_10.png] view at source ↗

read the original abstract

Despite the recent success of scene text detection methods, common evaluation metrics fail to provide a fair and reliable comparison among detectors. They have obvious drawbacks in reflecting the inherent characteristic of text detection tasks, unable to address issues such as granularity, multiline, and character incompleteness. In this paper, we propose a novel evaluation protocol called TedEval (Text detector Evaluation), which evaluates text detections by an instance-level matching and a character-level scoring. Based on a firm standard rewarding behaviors that result in successful recognition, TedEval can act as a reliable standard for comparing and quantizing the detection quality throughout all difficulty levels. In this regard, we believe that TedEval can play a key role in developing state-of-the-art scene text detectors. The code is publicly available at https://github.com/clovaai/TedEval.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TedEval proposes instance-level matching plus character scoring to fix gaps in text detector metrics but supplies no evidence it tracks actual recognition success better.

read the letter

TedEval is the main thing here. It proposes a new evaluation protocol for scene text detectors that does instance-level matching and then scores at the character level. The idea is to handle cases where existing metrics fall short on granularity, multi-line text, and incomplete characters. The paper does a good job laying out why current metrics are problematic for this task. It frames the new metric around rewarding detections that would lead to successful recognition, which is a reasonable goal. Releasing the code on GitHub is a plus for anyone who wants to try it. The soft spot is the lack of evidence that this new protocol actually does better at predicting recognition success. The abstract makes the claim but there's no mention of experiments correlating TedEval scores with OCR accuracy or comparing it head-to-head with prior metrics on that basis. Without that, it's hard to know if the design choices hold up in practice or if they introduce their own biases. The details on how the matching and scoring work aren't fully visible here, so it's tough to assess if they cover all edge cases like very dense text or unusual fonts. This is for researchers in computer vision working on scene text detection. Someone building or benchmarking detectors could get value from trying the metric, especially if the full paper has the implementation details and any validation results. It deserves a serious referee because improving evaluation in this area matters for the field, even if the current version needs more backing. I would recommend sending it to peer review with the expectation that reviewers will ask for empirical validation against downstream tasks.

Referee Report

2 major / 0 minor

Summary. The paper proposes TedEval, a new evaluation metric for scene text detectors using instance-level matching combined with character-level scoring. It claims this protocol fairly addresses limitations of prior metrics (granularity, multiline handling, character incompleteness) and provides a reliable standard for comparing detectors across difficulty levels by rewarding detections that enable successful recognition.

Significance. If the proposed matching and scoring rules demonstrably align with downstream OCR success, TedEval could improve fairness and consistency in benchmarking scene text detectors. The public release of code at the cited GitHub repository supports reproducibility and is a strength.

major comments (2)

[Abstract] Abstract: the central claim that TedEval supplies a 'firm standard rewarding behaviors that result in successful recognition' and can 'act as a reliable standard' is presented without any quantitative grounding (e.g., correlation with OCR accuracy, ablation against prior metrics on recognition rates, or human preference data). This leaves the alignment between the instance+character protocol and actual recognition success as an untested modeling assumption rather than a demonstrated property.
[Abstract] Abstract: no evidence or protocol is supplied showing that the instance-level matching plus character-level scoring resolves the stated drawbacks (granularity, multiline, character incompleteness) in a manner that produces scores predictive of recognition success; without such validation the metric remains a new construction whose superiority is asserted rather than measured.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract claims. We address each major comment below and agree that additional clarification on the grounding of TedEval would improve the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that TedEval supplies a 'firm standard rewarding behaviors that result in successful recognition' and can 'act as a reliable standard' is presented without any quantitative grounding (e.g., correlation with OCR accuracy, ablation against prior metrics on recognition rates, or human preference data). This leaves the alignment between the instance+character protocol and actual recognition success as an untested modeling assumption rather than a demonstrated property.

Authors: The 'firm standard' is the character-level scoring rule, which by construction assigns partial credit proportional to correctly detected characters within each matched instance. This directly rewards detections that supply the complete character set needed for recognition while penalizing incompleteness. Instance-level matching further ensures granularity and multiline cases are handled without the fragmentation issues of prior metrics. Although the current manuscript does not report numerical correlations with specific OCR engines, the protocol's alignment follows from its design rather than an untested assumption. We will revise the abstract and add a clarifying paragraph in Section 3 to make this design rationale explicit. revision: yes
Referee: [Abstract] Abstract: no evidence or protocol is supplied showing that the instance-level matching plus character-level scoring resolves the stated drawbacks (granularity, multiline, character incompleteness) in a manner that produces scores predictive of recognition success; without such validation the metric remains a new construction whose superiority is asserted rather than measured.

Authors: Sections 3 and 4 of the manuscript define the matching and scoring rules and illustrate, via examples and benchmark comparisons, how they resolve granularity (by treating each text line as a single instance), multiline (by avoiding split penalties), and character incompleteness (by character-level rather than binary scoring). These changes produce scores that vary continuously with detection quality in ways prior metrics do not. We acknowledge that explicit predictive validation against downstream OCR accuracy is not present and would strengthen the paper; we will add a short discussion subsection relating TedEval scores to recognition feasibility in the revision. revision: yes

Circularity Check

0 steps flagged

TedEval is a definitional construction with no circular reductions to inputs or self-citations

full rationale

The paper introduces TedEval as a new protocol defined via instance-level matching and character-level scoring, presented as a direct construction without reference to fitted parameters, prior self-citations as load-bearing premises, or reductions of the metric to its own outputs. No equations or claims equate the proposed scoring to pre-existing fitted results by construction. This matches the default case of a self-contained metric definition.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract only; no explicit free parameters, invented entities, or detailed axioms beyond the stated domain assumptions about metric shortcomings are provided. The ledger reflects only what is visible in the abstract.

axioms (1)

domain assumption Common evaluation metrics have obvious drawbacks in reflecting the inherent characteristic of text detection tasks, unable to address issues such as granularity, multiline, and character incompleteness.
Directly stated in the abstract as the motivation for the new metric.

pith-pipeline@v0.9.0 · 5667 in / 1298 out tokens · 58418 ms · 2026-05-25T11:18:01.537815+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

TedEval evaluates text detections by an instance-level matching and a character-level scoring... pseudo character centers... multiline prevention via angles

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages · 2 internal anchors

[1]

Y . Baek, B. Lee, D. Han, S. Yun, and H. Lee. Character region awareness for text detection. In CVPR, pages 4321–

work page
[2]

Dangla, E

A. Dangla, E. Puybareau, G. Tochon, and J. Fabrizio. A ﬁrst step toward a fair comparison of evaluation protocols for text detection algorithms. In 2018 13th IAPR International Work- shop on Document Analysis Systems (DAS), pages 345–350. IEEE, 2018. 1

work page 2018
[3]

D. Deng, H. Liu, X. Li, and D. Cai. Pixellink: Detecting scene text via instance segmentation. In AAAI, 2018. 4

work page 2018
[4]

Everingham, S

M. Everingham, S. M. Eslami, L. Gool, C. K. Williams, J. Winn, and A. Zisserman. The pascal visual object classes challenge: A retrospective. International Journal of Com- puter Vision, 111(1):98–136, 2015. 1

work page 2015
[5]

H. Hu, C. Zhang, Y . Luo, Y . Wang, J. Han, and E. Ding. Wordsup: Exploiting word annotations for character based text detection. In ICCV, 2017. 4

work page 2017
[6]

M. Liao, B. Shi, and X. Bai. Textboxes++: A single-shot oriented scene text detector. Image Processing, 27(8):3676– 3690, 2018. 4

work page 2018
[7]

J. Liu, X. Liu, J. Sheng, D. Liang, X. Li, and Q. Liu. Pyramid mask text detector. arXiv preprint arXiv:1903.11800, 2019. 4

work page internal anchor Pith review Pith/arXiv arXiv 1903
[8]

X. Liu, D. Liang, S. Yan, D. Chen, Y . Qiao, and J. Yan. Fots: Fast oriented text spotting with a uniﬁed network. In CVPR, pages 5676–5685, 2018. 4

work page 2018
[9]

Y . Liu, L. Jin, Z. Xie, C. Luo, S. Zhang, and L. Xie. Tightness-aware evaluation protocol for scene text detection. In CVPR, pages 4321–4330. IEEE, 2019. 1

work page 2019
[10]

P. Lyu, M. Liao, C. Yao, W. Wu, and X. Bai. Mask textspot- ter: An end-to-end trainable neural network for spotting text with arbitrary shapes. arXiv preprint arXiv:1807.02242 ,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

J. Ma, W. Shao, H. Ye, L. Wang, H. Wang, Y . Zheng, and X. Xue. Arbitrary-oriented scene text detection via rotation proposals. IEEE Transactions on Multimedia, 20(11):3111– 3122, 2018. 4

work page 2018
[12]

B. Shi, X. Bai, and S. Belongie. Detecting oriented text in natural images by linking segments. In CVPR, pages 3482–

work page
[13]

Z. Tian, W. Huang, T. He, P. He, and Y . Qiao. Detecting text in natural image with connectionist text proposal network. In ECCV, pages 56–72. Springer, 2016. 1, 4

work page 2016
[14]

Wolf and J.-M

C. Wolf and J.-M. Jolion. Object count/area graphs for the evaluation of object detection and segmentation algorithms. In ICDAR, pages 1115–1124. IEEE, 2013. 1

work page 2013
[15]

X. Zhou, C. Yao, H. Wen, Y . Wang, S. Zhou, W. He, and J. Liang. East: an efﬁcient and accurate scene text detector. In CVPR, pages 2642–2651, 2017. 1, 4 A. Matching matrix Missing characters R 0.5 P 0.5 H 0.5 Many-to-one R 1.0 P 1.0 H 1.0 Overlap characters R 0.75 P 0.75 H 0.75 Multiline R 0.0 P 0.0 H 0.0 One-to-one R 1.0 P 1.0 H 1.0 𝑫𝟏 𝑫𝟐 𝒔𝒊𝒌 𝑹𝒆𝒄𝒂𝒍𝒍 𝑮𝟏 ...

work page 2017

[1] [1]

Y . Baek, B. Lee, D. Han, S. Yun, and H. Lee. Character region awareness for text detection. In CVPR, pages 4321–

work page

[2] [2]

Dangla, E

A. Dangla, E. Puybareau, G. Tochon, and J. Fabrizio. A ﬁrst step toward a fair comparison of evaluation protocols for text detection algorithms. In 2018 13th IAPR International Work- shop on Document Analysis Systems (DAS), pages 345–350. IEEE, 2018. 1

work page 2018

[3] [3]

D. Deng, H. Liu, X. Li, and D. Cai. Pixellink: Detecting scene text via instance segmentation. In AAAI, 2018. 4

work page 2018

[4] [4]

Everingham, S

M. Everingham, S. M. Eslami, L. Gool, C. K. Williams, J. Winn, and A. Zisserman. The pascal visual object classes challenge: A retrospective. International Journal of Com- puter Vision, 111(1):98–136, 2015. 1

work page 2015

[5] [5]

H. Hu, C. Zhang, Y . Luo, Y . Wang, J. Han, and E. Ding. Wordsup: Exploiting word annotations for character based text detection. In ICCV, 2017. 4

work page 2017

[6] [6]

M. Liao, B. Shi, and X. Bai. Textboxes++: A single-shot oriented scene text detector. Image Processing, 27(8):3676– 3690, 2018. 4

work page 2018

[7] [7]

J. Liu, X. Liu, J. Sheng, D. Liang, X. Li, and Q. Liu. Pyramid mask text detector. arXiv preprint arXiv:1903.11800, 2019. 4

work page internal anchor Pith review Pith/arXiv arXiv 1903

[8] [8]

X. Liu, D. Liang, S. Yan, D. Chen, Y . Qiao, and J. Yan. Fots: Fast oriented text spotting with a uniﬁed network. In CVPR, pages 5676–5685, 2018. 4

work page 2018

[9] [9]

Y . Liu, L. Jin, Z. Xie, C. Luo, S. Zhang, and L. Xie. Tightness-aware evaluation protocol for scene text detection. In CVPR, pages 4321–4330. IEEE, 2019. 1

work page 2019

[10] [10]

P. Lyu, M. Liao, C. Yao, W. Wu, and X. Bai. Mask textspot- ter: An end-to-end trainable neural network for spotting text with arbitrary shapes. arXiv preprint arXiv:1807.02242 ,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

J. Ma, W. Shao, H. Ye, L. Wang, H. Wang, Y . Zheng, and X. Xue. Arbitrary-oriented scene text detection via rotation proposals. IEEE Transactions on Multimedia, 20(11):3111– 3122, 2018. 4

work page 2018

[12] [12]

B. Shi, X. Bai, and S. Belongie. Detecting oriented text in natural images by linking segments. In CVPR, pages 3482–

work page

[13] [13]

Z. Tian, W. Huang, T. He, P. He, and Y . Qiao. Detecting text in natural image with connectionist text proposal network. In ECCV, pages 56–72. Springer, 2016. 1, 4

work page 2016

[14] [14]

Wolf and J.-M

C. Wolf and J.-M. Jolion. Object count/area graphs for the evaluation of object detection and segmentation algorithms. In ICDAR, pages 1115–1124. IEEE, 2013. 1

work page 2013

[15] [15]

X. Zhou, C. Yao, H. Wen, Y . Wang, S. Zhou, W. He, and J. Liang. East: an efﬁcient and accurate scene text detector. In CVPR, pages 2642–2651, 2017. 1, 4 A. Matching matrix Missing characters R 0.5 P 0.5 H 0.5 Many-to-one R 1.0 P 1.0 H 1.0 Overlap characters R 0.75 P 0.75 H 0.75 Multiline R 0.0 P 0.0 H 0.0 One-to-one R 1.0 P 1.0 H 1.0 𝑫𝟏 𝑫𝟐 𝒔𝒊𝒌 𝑹𝒆𝒄𝒂𝒍𝒍 𝑮𝟏 ...

work page 2017