Manga109-v2026: Revisiting Manga109 Annotations for Modern Manga Understanding

Atsuyuki Miyai; Hikaru Ikuta; Jeonghun Baek; Kiyoharu Aizawa; Shota Onohara

arxiv: 2605.21182 · v1 · pith:6E4DOWM7new · submitted 2026-05-20 · 💻 cs.CL · cs.AI· cs.CV

Manga109-v2026: Revisiting Manga109 Annotations for Modern Manga Understanding

Jeonghun Baek , Atsuyuki Miyai , Shota Onohara , Hikaru Ikuta , Kiyoharu Aizawa This is my paper

Pith reviewed 2026-05-21 04:33 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.CV

keywords mangaannotationsOCRdatasetmultimodaldialogue textspeech balloonsrevision

0 comments

The pith

Revising Manga109 annotations fixes errors to align with modern OCR and multimodal systems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that the original Manga109 dataset has specific annotation problems that limit its usefulness for today's AI tools. It identifies five categories of issues in the dialogue text annotations and fixes them through a mix of automated detection and human review. This matters because Manga109 serves as a key resource for research on understanding manga, which is a popular form of Japanese culture, so better annotations could improve AI performance on tasks like reading text from images and combining visual and textual information. If true, it would mean researchers have a cleaner foundation for building systems that handle the unique layout and style of manga without losing its artistic qualities.

Core claim

The authors claim that by detecting and correcting transcription errors, missing text regions, overlapping dialogue with onomatopoeia, and under-segmented speech balloons in approximately 29,000 annotations, the new Manga109-v2026 version better supports modern OCR and multimodal manga understanding systems while preserving the expressive structures typical of manga.

What carries the argument

OCR-based issue detection combined with manual revision of dialogue annotations

If this is right

Modern OCR systems can achieve higher precision on manga text when using the revised annotations.
Multimodal models will more accurately interpret the interplay between text and visuals in manga panels.
Evaluation of manga translation and understanding algorithms becomes more reliable with fewer annotation flaws.
The dataset continues to reflect the characteristic visual and textual expressions found in manga.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar revision processes could be applied to other datasets in the field of comic book analysis.
This update may enable new applications in automated manga localization and cultural adaptation tools.
Researchers might explore how these annotation improvements affect the training of large language models on visual narratives.

Load-bearing premise

The process of using OCR to find problems and then manually fixing them catches every issue in the five categories without adding new mistakes or biases to the dataset.

What would settle it

Comparing the performance of an OCR model trained on the original Manga109 versus the revised version on a test set of manga images; lack of improvement would challenge the value of the revisions.

Figures

Figures reproduced from arXiv: 2605.21182 by Atsuyuki Miyai, Hikaru Ikuta, Jeonghun Baek, Kiyoharu Aizawa, Shota Onohara.

**Figure 1.** Figure 1: Overview of the five annotation issue types addressed in Manga109-v2026, together with representative examples of the original and revised annotations. Images courtesy of Yamada Uduki, Inohara Daisuke, Akamatsu Ken, and Hasegawa Yuichi. More broadly, our work illustrates how AI and humans can collaboratively improve culturally grounded datasets by combining modern AI technologies with human verification. W… view at source ↗

**Figure 2.** Figure 2: Type 1: Transcription Errors, where the annotated text is incorrect. “Original”, “OCR Output”, and “Revised” denote the original annotation, OCR output, and revised annotation, respectively. Red, blue, and green text indicate incorrect characters, correctly recognized OCR outputs, and revised characters, respectively. Images courtesy of Yamada Uduki [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 4.** Figure 4 [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: (a) Type 4: Overlapping Text and Onomatopoeia Annotations, where dialogue and onomatopoeia overlap. Green polygon regions indicate onomatopoeia text. (b) “Natural” denotes translations that preserve the expressive and stylistic characteristics of onomatopoeia, while “Unnatural” denotes translations that treat onomatopoeia as regular dialogue without preserving their stylistic characteristics. Images courte… view at source ↗

**Figure 6.** Figure 6: Type 5: Under-Segmented Speech Balloon Annotations, where multiple connected speech balloons are annotated as a single text region. Images courtesy of Yamada Uduki. gaOCR (Baek et al., 2026), we evaluate the same OCR outputs using both the original Manga109 annotations and the revised Manga109-v2026 annotations. As shown in Table 2, the revised annotations substantially improve OCR evaluation performance,… view at source ↗

read the original abstract

Manga is a culturally distinctive multimodal medium and one of the most influential forms of Japanese popular culture. As AI systems increasingly target manga understanding, OCR, and translation, Manga109 has become a foundational dataset for manga-related AI research. However, the current Manga109 dataset contains transcription errors and coarse annotations, which do not align well with modern OCR and multimodal manga understanding tasks. In this work, we revisit the dialogue text annotations of Manga109 and identify five categories of annotation issues, including transcription errors, missing text regions, overlapping dialogue and onomatopoeia, and under-segmented speech balloons. To address these issues, we combine OCR-based issue detection and manual revision to construct Manga109-v2026, revising approximately 29,000 dialogue annotations. Our revisions better align Manga109 with modern OCR and multimodal manga understanding systems while preserving expressive structures characteristic of manga.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript describes the creation of Manga109-v2026 by revising approximately 29,000 dialogue annotations in the original Manga109 dataset. It identifies five categories of annotation issues (transcription errors, missing text regions, overlapping dialogue and onomatopoeia, under-segmented speech balloons, etc.) and uses OCR-based detection followed by manual revision to address them, with the goal of better aligning the dataset with modern OCR and multimodal manga understanding systems while preserving manga's expressive features.

Significance. If the revisions demonstrably improve alignment without introducing new biases, the updated dataset would be a useful resource for manga OCR and multimodal research, building on Manga109's established role as a benchmark. The hybrid OCR-plus-manual curation method is a reasonable practical approach for large-scale fixes.

major comments (3)

[Abstract] Abstract: the central claim that revisions 'better align Manga109 with modern OCR and multimodal manga understanding systems' is unsupported, as no before/after metrics, OCR accuracy deltas, detection F1 scores, or end-to-end task results are reported to show measurable improvement.
[Methods] The manuscript provides no quantitative validation or error analysis of the five annotation-issue categories after revision (e.g., residual transcription error rates or introduced biases in onomatopoeia handling), which is required to substantiate the alignment claim.
[Results] No results section, table, or figure presents downstream evaluation on any OCR model or multimodal manga understanding task, leaving open the possibility that changes are neutral or detrimental.

minor comments (1)

[Methods] Clarify the exact criteria used for manual revision decisions to allow reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful review and for highlighting areas where the manuscript's claims and scope could be clarified. The work centers on documenting the annotation revision process and releasing Manga109-v2026 as a resource; we address each major comment below and describe the revisions we will make.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that revisions 'better align Manga109 with modern OCR and multimodal manga understanding systems' is unsupported, as no before/after metrics, OCR accuracy deltas, detection F1 scores, or end-to-end task results are reported to show measurable improvement.

Authors: We agree that the abstract phrasing overstates the contribution by implying demonstrated improvement. The revisions target concrete, previously documented annotation problems (transcription errors, missing regions, under-segmented balloons, and overlaps) that are known to degrade modern OCR pipelines. In the revised manuscript we will rephrase the abstract to state that the updates correct these specific issues, thereby making the dataset more compatible with current OCR and multimodal systems, without asserting quantitative gains. revision: yes
Referee: [Methods] The manuscript provides no quantitative validation or error analysis of the five annotation-issue categories after revision (e.g., residual transcription error rates or introduced biases in onomatopoeia handling), which is required to substantiate the alignment claim.

Authors: The methods section outlines the OCR-assisted detection followed by manual correction for each of the five issue categories. We accept that post-revision statistics would improve transparency. We will add a table and accompanying text reporting the number of annotations revised per category, the criteria used for manual verification, and any steps taken to avoid introducing new biases (for example, preserving original onomatopoeia styling where possible). revision: yes
Referee: [Results] No results section, table, or figure presents downstream evaluation on any OCR model or multimodal manga understanding task, leaving open the possibility that changes are neutral or detrimental.

Authors: The manuscript is structured as a dataset curation paper whose primary contribution is the identification of annotation issues and the release of the corrected annotations. We did not conduct downstream experiments because that would require choosing particular models and tasks outside the stated scope. In revision we will insert a short discussion section that explains, on qualitative grounds, why the targeted fixes (accurate transcription, proper balloon segmentation, separation of dialogue and onomatopoeia) are expected to benefit modern systems, while explicitly noting that empirical benchmarking remains future work. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical dataset curation without derivations or fitted claims

full rationale

The paper describes an empirical process of identifying five categories of annotation problems in Manga109 (transcription errors, missing regions, overlaps, under-segmented balloons) via OCR-assisted detection and manual revision of ~29k annotations to produce Manga109-v2026. No mathematical derivations, equations, parameter fittings, predictions, or self-citation chains are present that could reduce any claim to its own inputs by construction. The central claim concerns improved alignment with modern OCR/multimodal systems, but this is presented as the outcome of the curation process itself rather than a derived or predicted quantity. The work is self-contained as data revision without load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper performs dataset curation rather than theoretical derivation, so it introduces no free parameters, mathematical axioms, or invented entities. The central claim rests on the empirical observation that the original annotations contained the listed issues and that the revision process addressed them.

pith-pipeline@v0.9.0 · 5694 in / 1157 out tokens · 34895 ms · 2026-05-21T04:33:20.535650+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages · 1 internal anchor

[1]

MANPU , year=

Manga109 dataset and creation of metadata , author=. MANPU , year=

work page
[2]

ICIP , year=

Text detection in manga by combining connected-component-based and region-based classifications , author=. ICIP , year=

work page
[3]

MTAP , year=

Yusuke Matsui and Kota Ito and Yuji Aramaki and Azuma Fujimoto and Toru Ogawa and Toshihiko Yamasaki and Kiyoharu Aizawa , title=. MTAP , year=

work page
[4]

IEEE MultiMedia , year=

Kiyoharu Aizawa and Azuma Fujimoto and Atsushi Otsubo and Toru Ogawa and Yusuke Matsui and Koki Tsubota and Hikaru Ikuta , title=. IEEE MultiMedia , year=

work page
[5]

AAAI , year=

Towards fully automated manga translation , author=. AAAI , year=

work page
[6]

ECCV , year=

COO: Comic Onomatopoeia Dataset for Recognizing Arbitrary or Truncated Texts , author=. ECCV , year=

work page
[7]

ICME , year=

Manga109Dialog: A Large-scale Dialogue Dataset for Comics Speaker Detection , author=. ICME , year=

work page
[8]

MangaUB: A Manga Understanding Benchmark for Large Multimodal Models , year=

Ikuta, Hikaru and Wohler, Leslie and Aizawa, Kiyoharu , journal=. MangaUB: A Manga Understanding Benchmark for Large Multimodal Models , year=

work page
[9]

NeurIPS , year=

CoMix: A Comprehensive Benchmark for Multi-Task Comic Understanding , author=. NeurIPS , year=

work page
[10]

CVPR , year =

Sachdeva, Ragav and Zisserman, Andrew , title =. CVPR , year =

work page
[11]

ACCV , year =

Ragav Sachdeva and Gyungin Shin and Andrew Zisserman , title=. ACCV , year =

work page
[12]

ICCV , year=

From Panels to Prose: Generating Literary Narratives from Comics , author=. ICCV , year=

work page
[13]

CVPR , year=

Advancing Manga Analysis: Comprehensive Segmentation Annotations for the Manga109 Dataset , author=. CVPR , year=

work page
[14]

EACL Findings , year =

Baek, Jeonghun and Egashira, Kazuki and Onohara, Shota and Miyai, Atsuyuki and Imajuku, Yuki and Ikuta, Hikaru and Aizawa, Kiyoharu , title =. EACL Findings , year =

work page
[15]

2025 , howpublished =

Gemini 3 Flash: Frontier intelligence built for speed , author =. 2025 , howpublished =

work page 2025
[16]

OpenAI GPT-5 System Card

Openai gpt-5 system card , author=. arXiv preprint arXiv:2601.03267 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[17]

arXiv preprint arXiv:2409.09502 , year=

One missing piece in vision and language: A survey on comics understanding , author=. arXiv preprint arXiv:2409.09502 , year=

work page arXiv

[1] [1]

MANPU , year=

Manga109 dataset and creation of metadata , author=. MANPU , year=

work page

[2] [2]

ICIP , year=

Text detection in manga by combining connected-component-based and region-based classifications , author=. ICIP , year=

work page

[3] [3]

MTAP , year=

Yusuke Matsui and Kota Ito and Yuji Aramaki and Azuma Fujimoto and Toru Ogawa and Toshihiko Yamasaki and Kiyoharu Aizawa , title=. MTAP , year=

work page

[4] [4]

IEEE MultiMedia , year=

Kiyoharu Aizawa and Azuma Fujimoto and Atsushi Otsubo and Toru Ogawa and Yusuke Matsui and Koki Tsubota and Hikaru Ikuta , title=. IEEE MultiMedia , year=

work page

[5] [5]

AAAI , year=

Towards fully automated manga translation , author=. AAAI , year=

work page

[6] [6]

ECCV , year=

COO: Comic Onomatopoeia Dataset for Recognizing Arbitrary or Truncated Texts , author=. ECCV , year=

work page

[7] [7]

ICME , year=

Manga109Dialog: A Large-scale Dialogue Dataset for Comics Speaker Detection , author=. ICME , year=

work page

[8] [8]

MangaUB: A Manga Understanding Benchmark for Large Multimodal Models , year=

Ikuta, Hikaru and Wohler, Leslie and Aizawa, Kiyoharu , journal=. MangaUB: A Manga Understanding Benchmark for Large Multimodal Models , year=

work page

[9] [9]

NeurIPS , year=

CoMix: A Comprehensive Benchmark for Multi-Task Comic Understanding , author=. NeurIPS , year=

work page

[10] [10]

CVPR , year =

Sachdeva, Ragav and Zisserman, Andrew , title =. CVPR , year =

work page

[11] [11]

ACCV , year =

Ragav Sachdeva and Gyungin Shin and Andrew Zisserman , title=. ACCV , year =

work page

[12] [12]

ICCV , year=

From Panels to Prose: Generating Literary Narratives from Comics , author=. ICCV , year=

work page

[13] [13]

CVPR , year=

Advancing Manga Analysis: Comprehensive Segmentation Annotations for the Manga109 Dataset , author=. CVPR , year=

work page

[14] [14]

EACL Findings , year =

Baek, Jeonghun and Egashira, Kazuki and Onohara, Shota and Miyai, Atsuyuki and Imajuku, Yuki and Ikuta, Hikaru and Aizawa, Kiyoharu , title =. EACL Findings , year =

work page

[15] [15]

2025 , howpublished =

Gemini 3 Flash: Frontier intelligence built for speed , author =. 2025 , howpublished =

work page 2025

[16] [16]

OpenAI GPT-5 System Card

Openai gpt-5 system card , author=. arXiv preprint arXiv:2601.03267 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

arXiv preprint arXiv:2409.09502 , year=

One missing piece in vision and language: A survey on comics understanding , author=. arXiv preprint arXiv:2409.09502 , year=

work page arXiv