Re:Verse -- Can Your VLM Read a Manga?

Aaditya Baranwal; Madhav Kataria; Naitik Agrawal; Shruti Vyas; Yogesh S Rawat

arxiv: 2508.08508 · v3 · submitted 2025-08-11 · 💻 cs.CV · cs.CL

Re:Verse -- Can Your VLM Read a Manga?

Aaditya Baranwal , Madhav Kataria , Naitik Agrawal , Yogesh S Rawat , Shruti Vyas This is my paper

Pith reviewed 2026-05-18 23:11 UTC · model grok-4.3

classification 💻 cs.CV cs.CL

keywords Vision Language ModelsManga NarrativeTemporal ReasoningCausal InferenceCross-panel CohesionStory ComprehensionMultimodal Evaluation

0 comments

The pith

Vision-language models excel at single manga panels but fail at story-level reasoning across sequences.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether Vision Language Models can understand manga as full stories rather than separate images. It reveals that recent models are good at interpreting individual panels but fail at linking them through time, keeping characters consistent, and understanding causes in complex plots. Using a new framework on the Re:Zero manga series with aligned light novel text, the study shows these gaps in temporal and causal reasoning. A sympathetic reader would care because this points to missing capabilities for AI to handle real sequential visual stories like in books or films.

Core claim

While recent large multimodal models excel at individual panel interpretation, they systematically fail at temporal causality and cross-panel cohesion, core requirements for coherent story comprehension. Applying this to Re:Zero manga across 11 chapters with 308 annotated panels, the findings demonstrate that current models lack genuine story-level intelligence, struggling particularly with non-linear narratives, character consistency, and causal inference across extended sequences.

What carries the argument

The novel evaluation framework that combines fine-grained multimodal annotation linking visual elements to narrative structure via aligned light novel text, cross-modal embedding analysis, and retrieval-augmented assessment.

If this is right

VLMs require new mechanisms to handle non-linear narratives for coherent story comprehension.
Character consistency across extended sequences remains a major unresolved limitation.
Causal inference over long visual sequences is not yet achieved by current architectures.
Retrieval-augmented generation does not resolve the underlying cross-panel cohesion failures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar gaps in sequential reasoning may affect VLM performance on video or film story understanding.
Adding explicit cross-panel memory or story graphs during training could be tested as a fix.
The framework could be extended to other discrete visual narratives like webcomics or animation storyboards.

Load-bearing premise

The annotation protocol that links visual elements to narrative structure through aligned light novel text accurately captures the core requirements for coherent story comprehension.

What would settle it

A model that correctly answers questions about causal events across non-consecutive panels while maintaining character consistency throughout the Re:Zero manga sequence would challenge the claim of systematic failure.

Figures

Figures reproduced from arXiv: 2508.08508 by Aaditya Baranwal, Madhav Kataria, Naitik Agrawal, Shruti Vyas, Yogesh S Rawat.

**Figure 1.** Figure 1: Re:Verse Multimodal Annotation Examples. Two representative examples from our dataset demonstrate the fine-grained alignment between visual manga content and narrative text. Each example shows the original manga page (left) paired with its corresponding aligned narrative text (right), where <D></D> tags indicate spoken dialogue (displayed in cyan) and <T></T> tags indicate internal thoughts (displayed in … view at source ↗

**Figure 2.** Figure 2: Re:Verse curation and annotation pipeline. Raw manga panels and light novel text are aligned via manual bounding box annotation, text classification (<D>, <T>), and semantic grounding, yielding synchronized visual, spatial, and narrative structure. comics, though often with automated text extraction quality issues. COMICS Text+ addressed this with high-quality OCR benchmarks for Western comics [28]. The… view at source ↗

**Figure 3.** Figure 3: Overview of experimental setups used to evaluate narrative understanding. (a) depicts the Generative Tasks, where a VisionLanguage Model (VLM) is prompted to either (i) synthesize detailed narratives (Story Generation) or (ii) produce concise plot summaries (Summary Generation) (b) shows the Temporal Reasoning Tasks, (i) Next Page Prediction, where VLM selects the correct continuation of a story, (ii) Int… view at source ↗

**Figure 4.** Figure 4: Grounding Setup: Text Detection, Classification, and Association with characters to evaluate in-page understanding. 3.1. Dataset Construction We construct our benchmark from Volume 1 of Re:Zero - Starting Life in Another World, chosen for its complex narrative structure featuring time-loop mechanics, diverse character interactions, and sophisticated visual storytelling. The dataset comprises 308 meticulous… view at source ↗

**Figure 5.** Figure 5: (a): Mean Reciprocal Rank (MRR) across all chapters (x-asis: chapter number; y-axis: average MMR Scores), highlighting the cosistent overall poor scores on retrieving the visual counterpart provided the text. (b): Mean Reciprocal Rank (MRR) across pages of Chapter 7 (x-asis: page index; y-axis: MMR Score), highlighting the poor scores on retrieving the visual counterpart provided the text within the con… view at source ↗

**Figure 6.** Figure 6: Chapter 7 similarity heatmap demonstrating semantic [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Score distribution for VQA evaluation without RAG enhancement. The distribution shows heavy concentration in lower scores [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗

**Figure 8.** Figure 8: Score distribution for VQA evaluation with RAG enhancement. The distribution shows improved performance with reduced [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗

**Figure 9.** Figure 9: Chapter 1 performance analysis (left) and semantic similarity heatmap (right). The heatmap shows irregular patterns with low [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗

**Figure 10.** Figure 10: Chapter 2 performance analysis (left) and semantic [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗

**Figure 13.** Figure 13: Chapter 5 performance analysis (left) and semantic [PITH_FULL_IMAGE:figures/full_fig_p015_13.png] view at source ↗

**Figure 14.** Figure 14: Chapter 6 performance analysis (left) and semantic [PITH_FULL_IMAGE:figures/full_fig_p016_14.png] view at source ↗

**Figure 15.** Figure 15: Chapter 8 performance analysis (left) and semantic [PITH_FULL_IMAGE:figures/full_fig_p016_15.png] view at source ↗

**Figure 16.** Figure 16: Chapter 9 performance analysis (left) and semantic [PITH_FULL_IMAGE:figures/full_fig_p016_16.png] view at source ↗

**Figure 17.** Figure 17: Chapter 10 performance analysis (left) and semantic [PITH_FULL_IMAGE:figures/full_fig_p016_17.png] view at source ↗

read the original abstract

Current Vision Language Models (VLMs) demonstrate a critical gap between surface-level recognition and deep narrative reasoning when processing sequential visual storytelling. Through a comprehensive investigation of manga narrative understanding, we reveal that while recent large multimodal models excel at individual panel interpretation, they systematically fail at temporal causality and cross-panel cohesion, core requirements for coherent story comprehension. We introduce a novel evaluation framework that combines fine-grained multimodal annotation, cross-modal embedding analysis, and retrieval-augmented assessment to systematically characterize these limitations. Our methodology includes (i) a rigorous annotation protocol linking visual elements to narrative structure through aligned light novel text, (ii) comprehensive evaluation across multiple reasoning paradigms, including direct inference and retrieval-augmented generation, and (iii) cross-modal similarity analysis revealing fundamental misalignments in current VLMs' joint representations. Applying this framework to Re:Zero manga across 11 chapters with 308 annotated panels, we conduct the first systematic study of long-form narrative understanding in VLMs through three core evaluation axes: generative storytelling, contextual dialogue grounding, and temporal reasoning. Our findings demonstrate that current models lack genuine story-level intelligence, struggling particularly with non-linear narratives, character consistency, and causal inference across extended sequences. This work establishes both the foundation and practical methodology for evaluating narrative intelligence, while providing actionable insights into the capability of deep sequential understanding of Discrete Visual Narratives beyond basic recognition in Multimodal Models. Project Page: https://re-verse.vercel.app

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a concrete benchmark showing VLMs still miss causal and temporal links across manga panels, but the light-novel alignment in the annotations risks testing text matching more than pure visual story sense.

read the letter

This paper shows current VLMs handle single manga panels fine but lose the thread on longer stories, especially non-linear sequences, character consistency, and cause-effect across panels. They tested this on Re:Zero with a new set of annotations and three focused checks: generative storytelling, dialogue grounding, and temporal reasoning. The main new piece is the Re:Verse framework itself, which ties 308 panels from 11 chapters to aligned light-novel text and adds cross-modal embedding checks plus retrieval-augmented tests. That gives a practical way to measure story-level gaps that earlier single-image or short-clip benchmarks missed. The work is honest about the failures and supplies a project page with the data, which helps others try the same setup. It also stays grounded in actual model outputs rather than just claiming a new theory. The annotation protocol is the soft spot worth watching. Linking every visual element to light-novel text makes sense for ground truth, but it could mean some errors come from not matching the written plot summary instead of failing to read the panel transitions or visual cues on their own. If the test rewards textual fidelity too much, the reported struggles with causality might look larger than a purely visual test would show. The abstract claims consistent failures, yet without seeing the exact model prompts, inter-annotator numbers, or full result tables it is hard to judge how representative the chosen VLMs are. Minor issues like that do not sink the central observation that story coherence is still weak. This is useful for anyone building or evaluating multimodal models that need to follow extended visual narratives, such as in comics, video, or instructional sequences. It does not claim to fix the problem, just maps where the holes are. I would send it for peer review. The empirical angle is clear enough and the domain choice is fresh, so referees can tighten the annotation details and check the numbers without starting from scratch.

Referee Report

2 major / 2 minor

Summary. The paper introduces the Re:Verse evaluation framework for assessing Vision-Language Models on long-form manga narrative understanding, using 308 annotated panels from 11 chapters of the Re:Zero manga. Annotations link visual panels to narrative structure via aligned light novel text; models are tested on generative storytelling, contextual dialogue grounding, and temporal reasoning. The central claim is that current VLMs excel at single-panel recognition but systematically fail at temporal causality, cross-panel cohesion, non-linear narratives, character consistency, and causal inference across sequences.

Significance. If the annotation protocol validly isolates visual narrative reasoning, the work provides the first systematic study of story-level intelligence in VLMs for discrete visual narratives and supplies a reusable methodology with cross-modal embedding analysis and retrieval-augmented evaluation. This could guide future model development toward better sequential and causal understanding. The empirical demonstration of consistent failures on extended sequences is potentially impactful, though its strength hinges on whether the light-novel alignment truly measures visual-specific comprehension rather than textual proxy matching.

major comments (2)

[§3 (Annotation Protocol)] §3 (Annotation Protocol): the ground-truth construction aligns visual panels directly to light-novel text summaries. This risks conflating model failures on visual causality cues or panel transitions with mismatches to an already-resolved textual plot, undermining the claim that observed errors demonstrate absence of genuine story-level visual intelligence rather than proxy misalignment.
[§4.2–4.3 (Evaluation Axes and Results)] §4.2–4.3 (Evaluation Axes and Results): without reported inter-annotator agreement scores or explicit criteria for panel selection across the 11 chapters, it is unclear whether the reported consistent failures reflect representative model behavior or post-hoc choices of difficult non-linear sequences.

minor comments (2)

[Abstract] The abstract states this is the 'first systematic study' of long-form narrative understanding; a brief comparison to prior comic/manga VLM benchmarks would strengthen positioning.
[Figures and Project Page] Figure captions and the project page reference should include explicit links to annotation guidelines or released data to support reproducibility claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript introducing the Re:Verse framework. We address each major comment point by point below, providing honest responses and indicating where revisions will be made to improve clarity and rigor.

read point-by-point responses

Referee: §3 (Annotation Protocol): the ground-truth construction aligns visual panels directly to light-novel text summaries. This risks conflating model failures on visual causality cues or panel transitions with mismatches to an already-resolved textual plot, undermining the claim that observed errors demonstrate absence of genuine story-level visual intelligence rather than proxy misalignment.

Authors: We acknowledge the referee's concern that aligning panels to light-novel summaries could introduce proxy misalignment rather than purely measuring visual narrative reasoning. The light novel serves as the source material for the manga adaptation, providing canonical narrative ground truth to evaluate whether VLMs recover temporal causality and cohesion from visuals alone. However, to better isolate visual-specific failures, we will revise the manuscript to add explicit discussion in §3 on this distinction, include qualitative examples of model errors on panel transitions independent of text, and report additional cross-modal embedding analyses that highlight visual-only misalignments. This partial revision will refine our claims without altering the core evaluation design. revision: partial
Referee: §4.2–4.3 (Evaluation Axes and Results): without reported inter-annotator agreement scores or explicit criteria for panel selection across the 11 chapters, it is unclear whether the reported consistent failures reflect representative model behavior or post-hoc choices of difficult non-linear sequences.

Authors: We agree that reporting inter-annotator agreement and transparent selection criteria is necessary to establish that the observed failures are representative rather than due to biased panel choices. Although §3 describes the annotation protocol and our intent to cover diverse narrative structures, these quantitative details were omitted. In the revised manuscript, we will add inter-annotator agreement metrics (such as Cohen's kappa) for the annotation tasks and explicitly detail the panel and chapter selection criteria, including stratification for linear vs. non-linear sequences and character consistency challenges. This will directly address the concern and strengthen the empirical validity of the results. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical evaluation against external human annotations

full rationale

The paper is a purely empirical study that introduces an annotation protocol linking manga panels to light novel text and then evaluates existing VLMs against the resulting human-annotated ground truth. No equations, fitted parameters, or self-referential derivations are present. All reported results are direct comparisons to independently created annotations rather than quantities that reduce to the paper's own inputs by construction. The framework is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that manga panels aligned to light-novel text provide a valid proxy for narrative structure. No free parameters or invented physical entities are introduced; the main addition is the evaluation framework itself.

axioms (1)

domain assumption Manga panels can be reliably aligned with light novel text to capture narrative structure and causality.
Invoked in the annotation protocol described in the abstract.

invented entities (1)

Re:Verse evaluation framework no independent evidence
purpose: To measure temporal causality and cross-panel cohesion in VLMs.
Newly introduced combination of annotation, embedding analysis, and retrieval-augmented assessment.

pith-pipeline@v0.9.0 · 5807 in / 1284 out tokens · 28730 ms · 2026-05-18T23:11:12.643148+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce Re:Verse, a comprehensive benchmark for sequential narrative understanding in manga... 308 annotated panels... three core evaluation axes: generative storytelling, contextual dialogue grounding, and temporal reasoning.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

systematic character consistency failure... NER density ranges from 0.009 to 0.027... temporal reasoning degradation with 28.5% accuracy

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · 3 internal anchors

[1]

manga109

Kiyoharu Aizawa, Azuma Fujimoto, Atsushi Otsubo, Toru Ogawa, Yusuke Matsui, Koki Tsubota, and Hikaru Ikuta. Building a manga dataset “manga109” with annotations for multimedia applications.IEEE MultiMedia, 27(2):8–18,

work page
[2]

Coo: Comic onomatopoeia dataset for recognizing arbitrary or truncated texts, 2022

Jeonghun Baek, Yusuke Matsui, and Kiyoharu Aizawa. Coo: Comic onomatopoeia dataset for recognizing arbitrary or truncated texts, 2022. 3

work page 2022
[3]

Mangavqa and mangalmm: A benchmark and specialized model for multimodal manga understanding, 2025

Jeonghun Baek, Kazuki Egashira, Shota Onohara, Atsuyuki Miyai, Yuki Imajuku, Hikaru Ikuta, and Kiyoharu Aizawa. Mangavqa and mangalmm: A benchmark and specialized model for multimodal manga understanding, 2025. 2, 3

work page 2025
[4]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for un- derstanding, localization, text reading, and beyond.arXiv preprint arXiv:2308.12966, 2023. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

Fabula entropy index- ing: Objective measures of story coherence.arXiv preprint arXiv:2104.07472, 2021

Christopher Cater and Mark Riedl. Fabula entropy index- ing: Objective measures of story coherence.arXiv preprint arXiv:2104.07472, 2021. 2, 3

work page arXiv 2021
[6]

Suchanek

Cyril Chhun, Pierre Colombo, Chloé Clavel, and Fabian M. Suchanek. Of human criteria and automatic metrics: A benchmark of the evaluation of story generation, 2022. 3

work page 2022
[7]

Aistorysimilarity: Quantifying story similarity using narrative for search, ip infringement, and guided cre- ativity

Jon Chun. Aistorysimilarity: Quantifying story similarity using narrative for search, ip infringement, and guided cre- ativity. InProceedings of the 28th Conference on Compu- tational Natural Language Learning (CoNLL), pages 161– 177, Miami, FL, USA, 2024. Association for Computational Linguistics. 3, 4, 7, 8

work page 2024
[8]

Blooms- bury Publishing, 2013

Neil Cohn.The visual language of comics: Introduction to the structure and cognition of sequential images. Blooms- bury Publishing, 2013. 2

work page 2013
[9]

(pea)nuts and bolts of visual narrative: Structure and meaning in sequential image com- prehension.Cognitive science, 36(6):1084–1112, 2012

Neil Cohn, Martin Paczynski, Ray Jackendoff, Phillip J Hol- comb, and Gina R Kuperberg. (pea)nuts and bolts of visual narrative: Structure and meaning in sequential image com- prehension.Cognitive science, 36(6):1084–1112, 2012. 2, 3

work page 2012
[10]

ebdtheque: a representative database of comics

Clément Guérin, Christophe Rigaud, Antoine Mercier, Farid Ammar-Boudjelal, Karell Bertet, Alain Bouju, Jean- Christophe Burie, Georges Louis, Jean-Marc Ogier, and Ar- naud Revel. ebdtheque: a representative database of comics. InProceedings of the 12th International Conference on Document Analysis and Recognition (ICDAR), pages 1145– 1149, 2013. 2

work page 2013
[11]

M2c: Towards automatic multimodal manga complement.arXiv preprint arXiv:2310.17130, 2023

Hongcheng Guo, Boyang Wang, Jiaqi Bai, Jiaheng Liu, Jian Yang, and Zhoujun Li. M2c: Towards automatic multimodal manga complement.arXiv preprint arXiv:2310.17130, 2023. 3

work page arXiv 2023
[12]

Man- gaub: A manga understanding benchmark for large multi- modal models, 2024

Hikaru Ikuta, Leslie Wöhler, and Kiyoharu Aizawa. Man- gaub: A manga understanding benchmark for large multi- modal models, 2024. 2, 3

work page 2024
[13]

The amazing mysteries of the gutter: Drawing inferences be- tween panels in comic book narratives

Mohit Iyyer, Varun Manjunatha, Anupam Guha, Yogarshi Vyas, Jordan Boyd-Graber, Hal Daumé III, and Larry Davis. The amazing mysteries of the gutter: Drawing inferences be- tween panels in comic book narratives. InProceedings of the IEEE conference on computer vision and pattern recog- nition, pages 1650–1659, 2017. 2, 3

work page 2017
[14]

Transferring pre-trained multimodal rep- resentations with cross-modal similarity matching

Byoungjip Kim, Sungik Choi, Dasol Hwang, Moontae Lee, and Honglak Lee. Transferring pre-trained multimodal rep- resentations with cross-modal similarity matching. InAd- vances in Neural Information Processing Systems, pages 13006–13018, 2022. 3

work page 2022
[15]

Vhelm: A holistic evaluation of vision language models

Tony Lee, Haoqin Tu, Chi Heem Wong, Wenhao Zheng, Yiyang Zhou, Yifan Mai, Josselin Roberts, and Michihiro Yasunaga. Vhelm: A holistic evaluation of vision language models. InAdvances in Neural Information Processing Sys- tems, 2024. 3

work page 2024
[16]

Blip: Bootstrapping language-image pre-training for uni- fied vision-language understanding and generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for uni- fied vision-language understanding and generation. InIn- ternational Conference on Machine Learning, pages 12888– 12900, 2022. 2, 3

work page 2022
[17]

Do vision and language models share con- cepts? a vector space alignment study.Transactions of the Association for Computational Linguistics, 12:1232–1249,

Jiaang Li, Yova Kementchedjhieva, Constanza Fierro, and Anders Søgaard. Do vision and language models share con- cepts? a vector space alignment study.Transactions of the Association for Computational Linguistics, 12:1232–1249,

work page
[18]

arXiv preprint arXiv:2408.08632 , year=

Jian Li, Weiheng Lu, Zhongzhi Xiong, and Xiaoyun Hao. A survey on benchmarks of multimodal large language models. arXiv preprint arXiv:2408.08632, 2024. 3

work page arXiv 2024
[19]

Visual instruction tuning.Advances in neural information processing systems, 36, 2024

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36, 2024. 2, 3

work page 2024
[20]

Re:Zero - Starting Life in Another World, Chapter 1: A Day in the Capital, Vol

Daichi Matsuse, Tappei Nagatsuki, and Shinichirou Otsuka. Re:Zero - Starting Life in Another World, Chapter 1: A Day in the Capital, Vol. 1 (Manga). Yen Press, 2016. Illustrated by Daichi Matsuse. 2

work page 2016
[21]

1 (Light Novel)

Tappei Nagatsuki and Shinichirou Otsuka.Re:Zero - Starting Life in Another World, Vol. 1 (Light Novel). Yen Press, 2016. Translated by Jeremiah Borque. 2

work page 2016
[22]

OpenAI, :, Aaron Hurst, Adam Lerer, Adam P. Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, Aleksander M ˛ adry, Alex Baker-Whitcomb, Alex Beutel, Alex Borzunov, Alex Carney, Alex Chow, Alex Kirillov, Alex Nichol, Alex Paino, Alex Renzin, Alex Tachard Passos, Alexander Kir- illov, Alexi Christakis,...

work page
[23]

Learning transferable visual representations from natural language su- pervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual representations from natural language su- pervision. InInternational conference on machine learning, pages 8748–8763, 2021. 2, 3

work page 2021
[24]

From panels to prose: Generating literary narratives from comics, 2025

Ragav Sachdeva and Andrew Zisserman. From panels to prose: Generating literary narratives from comics, 2025. 3

work page 2025
[25]

Tomato: Assessing visual temporal reasoning capabilities in multi- modal foundation models

Ziyao Shangguan, Chuhan Li, Yuxuan Ding, Yanan Zheng, Yilun Zhao, Tesca Fitzgerald, and Arman Cohan. Tomato: Assessing visual temporal reasoning capabilities in multi- modal foundation models. InInternational Conference on Learning Representations, 2025. 2, 3

work page 2025
[26]

Maru: A manga retrieval and understanding system connecting vision and language.arXiv preprint arXiv:2311.02083, 2023

Conghao Tom Shen, Violet Yao, and Yixin Liu. Maru: A manga retrieval and understanding system connecting vision and language.arXiv preprint arXiv:2311.02083, 2023. 2

work page arXiv 2023
[27]

Maru: A manga retrieval and understanding system connecting vision and language, 2023

Conghao Tom Shen, Violet Yao, and Yixin Liu. Maru: A manga retrieval and understanding system connecting vision and language, 2023. 3

work page 2023
[28]

A comprehensive gold standard and benchmark for comics text detection and recognition, 2022

Gürkan Soykan, Deniz Yuret, and Tevfik Metin Sezgin. A comprehensive gold standard and benchmark for comics text detection and recognition, 2022. 3

work page 2022
[29]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[30]

Imagechain: Advancing sequential image-to-text rea- soning in multimodal large language models.arXiv preprint arXiv:2502.19419, 2025

Danae Sánchez Villegas, Ingo Ziegler, and Desmond El- liott. Imagechain: Advancing sequential image-to-text rea- soning in multimodal large language models.arXiv preprint arXiv:2502.19419, 2025. 3

work page arXiv 2025
[31]

Comix: A comprehensive benchmark for multi-task comic understanding, 2024

Emanuele Vivoli, Marco Bertini, and Dimosthenis Karatzas. Comix: A comprehensive benchmark for multi-task comic understanding, 2024. 2, 3

work page 2024
[32]

Comics datasets framework: Mix of comics datasets for de- tection benchmarking, 2024

Emanuele Vivoli, Irene Campaioli, Mariateresa Nardoni, Niccolò Biondi, Marco Bertini, and Dimosthenis Karatzas. Comics datasets framework: Mix of comics datasets for de- tection benchmarking, 2024. 2

work page 2024
[33]

Evaluating measures of global coherence abil- ity in stories in adults.International journal of language & communication disorders, 48(3):249–256, 2013

Heather Harris Wright, Gilson J Capilouto, and Anthony Koutsoftas. Evaluating measures of global coherence abil- ity in stories in adults.International journal of language & communication disorders, 48(3):249–256, 2013. 2, 3

work page 2013
[34]

What makes a good story and how can we measure it? a comprehensive survey of story evaluation, 2024

Dingyi Yang and Qin Jin. What makes a good story and how can we measure it? a comprehensive survey of story evaluation, 2024. 3

work page 2024
[35]

Open- meva: A benchmark for evaluating open-ended story gener- ation metrics

Jian Yao, Yiqun Cui, Dan Roth, and Yizhou Zhang. Open- meva: A benchmark for evaluating open-ended story gener- ation metrics. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Pro- cessing (Volume 1: Long Papers), pages 6394–6407. Asso- ciation for Com...

work page 2021
[36]

Multimodality and cross-modal cohe- sion in manga.Cognition, communication, discourse, 24: 103–114, 2022

Victoria Yefymenko. Multimodality and cross-modal cohe- sion in manga.Cognition, communication, discourse, 24: 103–114, 2022. 2, 3

work page 2022
[37]

Score: Story coherence and retrieval enhancement for ai nar- ratives, 2025

Qiang Yi, Yangfan He, Jianhui Wang, Xinyuan Song, Shiyao Qian, Xinhang Yuan, Li Sun, Yi Xin, Jingqun Tang, Keqin Li, Kuan Lu, Menghao Huo, Jiaqi Chen, and Tianyu Shi. Score: Story coherence and retrieval enhancement for ai nar- ratives, 2025. 3, 4, 7, 8

work page 2025
[38]

Vision-language models for vision tasks: A survey.arXiv preprint arXiv:2304.00685, 2023

Jingyi Zhang, Jiaxing Huang, Sheng Jin, and Shijian Lu. Vision-language models for vision tasks: A survey.arXiv preprint arXiv:2304.00685, 2023. 3

work page arXiv 2023
[39]

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mo- hamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models.arXiv preprint arXiv:2304.10592, 2023. 2, 3 Supplementary Material This supplementary material provides comprehensive chapter-by-chapter analysis of VLM performance on manga narrative understanding, ...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[40]

show length ratios of 0.8-1.4, while late chapters (8-

work page
[41]

demonstrate more extreme variations (0.8-1.6), suggest- ing that climactic sequences are particularly challenging for content length calibration. Cross-Modal Summarization Extended Analysis The cross-modal analysis reveals architecture-specific pat- terns that illuminate the visual processing penalty: InternVL3 Series:Shows consistent 1.9-3.2 point BERTSc...

work page 2000

[1] [1]

manga109

Kiyoharu Aizawa, Azuma Fujimoto, Atsushi Otsubo, Toru Ogawa, Yusuke Matsui, Koki Tsubota, and Hikaru Ikuta. Building a manga dataset “manga109” with annotations for multimedia applications.IEEE MultiMedia, 27(2):8–18,

work page

[2] [2]

Coo: Comic onomatopoeia dataset for recognizing arbitrary or truncated texts, 2022

Jeonghun Baek, Yusuke Matsui, and Kiyoharu Aizawa. Coo: Comic onomatopoeia dataset for recognizing arbitrary or truncated texts, 2022. 3

work page 2022

[3] [3]

Mangavqa and mangalmm: A benchmark and specialized model for multimodal manga understanding, 2025

Jeonghun Baek, Kazuki Egashira, Shota Onohara, Atsuyuki Miyai, Yuki Imajuku, Hikaru Ikuta, and Kiyoharu Aizawa. Mangavqa and mangalmm: A benchmark and specialized model for multimodal manga understanding, 2025. 2, 3

work page 2025

[4] [4]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for un- derstanding, localization, text reading, and beyond.arXiv preprint arXiv:2308.12966, 2023. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2023

[5] [5]

Fabula entropy index- ing: Objective measures of story coherence.arXiv preprint arXiv:2104.07472, 2021

Christopher Cater and Mark Riedl. Fabula entropy index- ing: Objective measures of story coherence.arXiv preprint arXiv:2104.07472, 2021. 2, 3

work page arXiv 2021

[6] [6]

Suchanek

Cyril Chhun, Pierre Colombo, Chloé Clavel, and Fabian M. Suchanek. Of human criteria and automatic metrics: A benchmark of the evaluation of story generation, 2022. 3

work page 2022

[7] [7]

Aistorysimilarity: Quantifying story similarity using narrative for search, ip infringement, and guided cre- ativity

Jon Chun. Aistorysimilarity: Quantifying story similarity using narrative for search, ip infringement, and guided cre- ativity. InProceedings of the 28th Conference on Compu- tational Natural Language Learning (CoNLL), pages 161– 177, Miami, FL, USA, 2024. Association for Computational Linguistics. 3, 4, 7, 8

work page 2024

[8] [8]

Blooms- bury Publishing, 2013

Neil Cohn.The visual language of comics: Introduction to the structure and cognition of sequential images. Blooms- bury Publishing, 2013. 2

work page 2013

[9] [9]

(pea)nuts and bolts of visual narrative: Structure and meaning in sequential image com- prehension.Cognitive science, 36(6):1084–1112, 2012

Neil Cohn, Martin Paczynski, Ray Jackendoff, Phillip J Hol- comb, and Gina R Kuperberg. (pea)nuts and bolts of visual narrative: Structure and meaning in sequential image com- prehension.Cognitive science, 36(6):1084–1112, 2012. 2, 3

work page 2012

[10] [10]

ebdtheque: a representative database of comics

Clément Guérin, Christophe Rigaud, Antoine Mercier, Farid Ammar-Boudjelal, Karell Bertet, Alain Bouju, Jean- Christophe Burie, Georges Louis, Jean-Marc Ogier, and Ar- naud Revel. ebdtheque: a representative database of comics. InProceedings of the 12th International Conference on Document Analysis and Recognition (ICDAR), pages 1145– 1149, 2013. 2

work page 2013

[11] [11]

M2c: Towards automatic multimodal manga complement.arXiv preprint arXiv:2310.17130, 2023

Hongcheng Guo, Boyang Wang, Jiaqi Bai, Jiaheng Liu, Jian Yang, and Zhoujun Li. M2c: Towards automatic multimodal manga complement.arXiv preprint arXiv:2310.17130, 2023. 3

work page arXiv 2023

[12] [12]

Man- gaub: A manga understanding benchmark for large multi- modal models, 2024

Hikaru Ikuta, Leslie Wöhler, and Kiyoharu Aizawa. Man- gaub: A manga understanding benchmark for large multi- modal models, 2024. 2, 3

work page 2024

[13] [13]

The amazing mysteries of the gutter: Drawing inferences be- tween panels in comic book narratives

Mohit Iyyer, Varun Manjunatha, Anupam Guha, Yogarshi Vyas, Jordan Boyd-Graber, Hal Daumé III, and Larry Davis. The amazing mysteries of the gutter: Drawing inferences be- tween panels in comic book narratives. InProceedings of the IEEE conference on computer vision and pattern recog- nition, pages 1650–1659, 2017. 2, 3

work page 2017

[14] [14]

Transferring pre-trained multimodal rep- resentations with cross-modal similarity matching

Byoungjip Kim, Sungik Choi, Dasol Hwang, Moontae Lee, and Honglak Lee. Transferring pre-trained multimodal rep- resentations with cross-modal similarity matching. InAd- vances in Neural Information Processing Systems, pages 13006–13018, 2022. 3

work page 2022

[15] [15]

Vhelm: A holistic evaluation of vision language models

Tony Lee, Haoqin Tu, Chi Heem Wong, Wenhao Zheng, Yiyang Zhou, Yifan Mai, Josselin Roberts, and Michihiro Yasunaga. Vhelm: A holistic evaluation of vision language models. InAdvances in Neural Information Processing Sys- tems, 2024. 3

work page 2024

[16] [16]

Blip: Bootstrapping language-image pre-training for uni- fied vision-language understanding and generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for uni- fied vision-language understanding and generation. InIn- ternational Conference on Machine Learning, pages 12888– 12900, 2022. 2, 3

work page 2022

[17] [17]

Do vision and language models share con- cepts? a vector space alignment study.Transactions of the Association for Computational Linguistics, 12:1232–1249,

Jiaang Li, Yova Kementchedjhieva, Constanza Fierro, and Anders Søgaard. Do vision and language models share con- cepts? a vector space alignment study.Transactions of the Association for Computational Linguistics, 12:1232–1249,

work page

[18] [18]

arXiv preprint arXiv:2408.08632 , year=

Jian Li, Weiheng Lu, Zhongzhi Xiong, and Xiaoyun Hao. A survey on benchmarks of multimodal large language models. arXiv preprint arXiv:2408.08632, 2024. 3

work page arXiv 2024

[19] [19]

Visual instruction tuning.Advances in neural information processing systems, 36, 2024

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36, 2024. 2, 3

work page 2024

[20] [20]

Re:Zero - Starting Life in Another World, Chapter 1: A Day in the Capital, Vol

Daichi Matsuse, Tappei Nagatsuki, and Shinichirou Otsuka. Re:Zero - Starting Life in Another World, Chapter 1: A Day in the Capital, Vol. 1 (Manga). Yen Press, 2016. Illustrated by Daichi Matsuse. 2

work page 2016

[21] [21]

1 (Light Novel)

Tappei Nagatsuki and Shinichirou Otsuka.Re:Zero - Starting Life in Another World, Vol. 1 (Light Novel). Yen Press, 2016. Translated by Jeremiah Borque. 2

work page 2016

[22] [22]

OpenAI, :, Aaron Hurst, Adam Lerer, Adam P. Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, Aleksander M ˛ adry, Alex Baker-Whitcomb, Alex Beutel, Alex Borzunov, Alex Carney, Alex Chow, Alex Kirillov, Alex Nichol, Alex Paino, Alex Renzin, Alex Tachard Passos, Alexander Kir- illov, Alexi Christakis,...

work page

[23] [23]

Learning transferable visual representations from natural language su- pervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual representations from natural language su- pervision. InInternational conference on machine learning, pages 8748–8763, 2021. 2, 3

work page 2021

[24] [24]

From panels to prose: Generating literary narratives from comics, 2025

Ragav Sachdeva and Andrew Zisserman. From panels to prose: Generating literary narratives from comics, 2025. 3

work page 2025

[25] [25]

Tomato: Assessing visual temporal reasoning capabilities in multi- modal foundation models

Ziyao Shangguan, Chuhan Li, Yuxuan Ding, Yanan Zheng, Yilun Zhao, Tesca Fitzgerald, and Arman Cohan. Tomato: Assessing visual temporal reasoning capabilities in multi- modal foundation models. InInternational Conference on Learning Representations, 2025. 2, 3

work page 2025

[26] [26]

Maru: A manga retrieval and understanding system connecting vision and language.arXiv preprint arXiv:2311.02083, 2023

Conghao Tom Shen, Violet Yao, and Yixin Liu. Maru: A manga retrieval and understanding system connecting vision and language.arXiv preprint arXiv:2311.02083, 2023. 2

work page arXiv 2023

[27] [27]

Maru: A manga retrieval and understanding system connecting vision and language, 2023

Conghao Tom Shen, Violet Yao, and Yixin Liu. Maru: A manga retrieval and understanding system connecting vision and language, 2023. 3

work page 2023

[28] [28]

A comprehensive gold standard and benchmark for comics text detection and recognition, 2022

Gürkan Soykan, Deniz Yuret, and Tevfik Metin Sezgin. A comprehensive gold standard and benchmark for comics text detection and recognition, 2022. 3

work page 2022

[29] [29]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023

[30] [30]

Imagechain: Advancing sequential image-to-text rea- soning in multimodal large language models.arXiv preprint arXiv:2502.19419, 2025

Danae Sánchez Villegas, Ingo Ziegler, and Desmond El- liott. Imagechain: Advancing sequential image-to-text rea- soning in multimodal large language models.arXiv preprint arXiv:2502.19419, 2025. 3

work page arXiv 2025

[31] [31]

Comix: A comprehensive benchmark for multi-task comic understanding, 2024

Emanuele Vivoli, Marco Bertini, and Dimosthenis Karatzas. Comix: A comprehensive benchmark for multi-task comic understanding, 2024. 2, 3

work page 2024

[32] [32]

Comics datasets framework: Mix of comics datasets for de- tection benchmarking, 2024

Emanuele Vivoli, Irene Campaioli, Mariateresa Nardoni, Niccolò Biondi, Marco Bertini, and Dimosthenis Karatzas. Comics datasets framework: Mix of comics datasets for de- tection benchmarking, 2024. 2

work page 2024

[33] [33]

Evaluating measures of global coherence abil- ity in stories in adults.International journal of language & communication disorders, 48(3):249–256, 2013

Heather Harris Wright, Gilson J Capilouto, and Anthony Koutsoftas. Evaluating measures of global coherence abil- ity in stories in adults.International journal of language & communication disorders, 48(3):249–256, 2013. 2, 3

work page 2013

[34] [34]

What makes a good story and how can we measure it? a comprehensive survey of story evaluation, 2024

Dingyi Yang and Qin Jin. What makes a good story and how can we measure it? a comprehensive survey of story evaluation, 2024. 3

work page 2024

[35] [35]

Open- meva: A benchmark for evaluating open-ended story gener- ation metrics

Jian Yao, Yiqun Cui, Dan Roth, and Yizhou Zhang. Open- meva: A benchmark for evaluating open-ended story gener- ation metrics. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Pro- cessing (Volume 1: Long Papers), pages 6394–6407. Asso- ciation for Com...

work page 2021

[36] [36]

Multimodality and cross-modal cohe- sion in manga.Cognition, communication, discourse, 24: 103–114, 2022

Victoria Yefymenko. Multimodality and cross-modal cohe- sion in manga.Cognition, communication, discourse, 24: 103–114, 2022. 2, 3

work page 2022

[37] [37]

Score: Story coherence and retrieval enhancement for ai nar- ratives, 2025

Qiang Yi, Yangfan He, Jianhui Wang, Xinyuan Song, Shiyao Qian, Xinhang Yuan, Li Sun, Yi Xin, Jingqun Tang, Keqin Li, Kuan Lu, Menghao Huo, Jiaqi Chen, and Tianyu Shi. Score: Story coherence and retrieval enhancement for ai nar- ratives, 2025. 3, 4, 7, 8

work page 2025

[38] [38]

Vision-language models for vision tasks: A survey.arXiv preprint arXiv:2304.00685, 2023

Jingyi Zhang, Jiaxing Huang, Sheng Jin, and Shijian Lu. Vision-language models for vision tasks: A survey.arXiv preprint arXiv:2304.00685, 2023. 3

work page arXiv 2023

[39] [39]

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mo- hamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models.arXiv preprint arXiv:2304.10592, 2023. 2, 3 Supplementary Material This supplementary material provides comprehensive chapter-by-chapter analysis of VLM performance on manga narrative understanding, ...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[40] [40]

show length ratios of 0.8-1.4, while late chapters (8-

work page

[41] [41]

demonstrate more extreme variations (0.8-1.6), suggest- ing that climactic sequences are particularly challenging for content length calibration. Cross-Modal Summarization Extended Analysis The cross-modal analysis reveals architecture-specific pat- terns that illuminate the visual processing penalty: InternVL3 Series:Shows consistent 1.9-3.2 point BERTSc...

work page 2000