SemEval-2026 Task 4: Narrative Story Similarity and Narrative Representation Learning
Pith reviewed 2026-05-09 21:59 UTC · model grok-4.3
The pith
The shared task defines narrative similarity as determining which of two stories better matches an anchor story and collects human judgments to test embedding models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce a novel definition of narrative similarity, compatible with both narrative theory and intuitive judgment. Based on the similarity judgments collected under this concept, we also evaluate narrative embedding representations. The task operationalizes narrative similarity as a binary classification problem: determining which of two stories is more similar to an anchor story, with a dataset of more than 1,000 story summary triples each backed by at least two agreeing annotators.
What carries the argument
The binary classification setup on story-summary triples, which operationalizes the new similarity definition and supplies the gold labels for both the classification track and the embedding evaluation track.
If this is right
- LLM ensembles achieve the highest scores in the triple-based classification track.
- Systems that apply pre- and post-processing to pretrained embedding models perform about on par with custom fine-tuned solutions in the embedding track.
- Both tracks still contain measurable headroom for automated systems.
- The released dataset supports instance-level analysis and embedding visualizations for all submissions.
Where Pith is reading between the lines
- A definition that separates narrative closeness from surface plot overlap could be reused to compare stories across different media such as novels and films.
- Better narrative embeddings might improve downstream applications like story recommendation or retrieval that currently rely on topical similarity alone.
- Failure cases from the top systems could be examined to isolate which narrative elements (character arcs, causal chains, or thematic resonance) current models still miss.
Load-bearing premise
The binary classification framing and the collected annotations with inter-annotator agreement faithfully operationalize the claimed novel definition of narrative similarity without introducing unmeasured biases or inconsistencies.
What would settle it
A replication study that asks new annotators to judge the same triples and finds that agreement patterns differ markedly from the original annotations or fail to match independent ratings of narrative closeness would undermine the operationalization.
Figures
read the original abstract
We present the shared task on narrative similarity and narrative representation learning - NSNRL (pronounced "nass-na-rel"). The task operationalizes narrative similarity as a binary classification problem: determining which of two stories is more similar to an anchor story. We introduce a novel definition of narrative similarity, compatible with both narrative theory and intuitive judgment. Based on the similarity judgments collected under this concept, we also evaluate narrative embedding representations. We collected at least two annotations each for more than 1,000 story summary triples, with each annotation being backed by at least two annotators in agreement. This paper describes the sampling and annotation process for the dataset; further, we give an overview of the submitted systems and the techniques they employ. We received a total of 71 final submissions from 46 teams across our two tracks. In our triple-based classification setup, LLM ensembles make up many of the top-scoring systems, while in the embedding setup, systems with pre- and post-processing on pretrained embedding models perform about on par with custom fine-tuned solutions. Our analysis identifies potential headroom for improvement of automated systems in both tracks. The task website includes visualizations of embeddings alongside instance-level classification results for all teams.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript describes SemEval-2026 Task 4 (NSNRL), which operationalizes narrative similarity as a binary classification task on story-summary triples. It introduces a novel definition of narrative similarity claimed to be compatible with narrative theory and intuitive judgment, collects at least two annotations per triple for over 1,000 triples with inter-annotator agreement, and reports an overview of 71 submissions from 46 teams across a classification track (where LLM ensembles dominate) and an embedding track (where pre/post-processing on pretrained models performs comparably to fine-tuned solutions). The paper also includes analysis of headroom for improvement and visualizations.
Significance. If the novel definition and annotation process prove robust, this shared task provides a valuable new benchmark for narrative representation learning in NLP, addressing a gap in story-level similarity evaluation. The dual-track design and large annotation effort (with agreement backing) are strengths that enable direct comparison of classification and embedding approaches. The overview of techniques from 71 submissions offers practical insights into current system capabilities.
major comments (2)
- [§4] §4 (Annotation Process): The claim that 'each annotation being backed by at least two annotators in agreement' is central to dataset validity, yet the section does not report the overall inter-annotator agreement rate (e.g., percentage of triples with full agreement or Cohen's kappa), nor how ties or residual disagreements were resolved. This detail is load-bearing for assessing whether the binary labels faithfully operationalize the novel definition without unmeasured bias.
- [§5] §5 (System Overview): The statement that 'in the embedding setup, systems with pre- and post-processing on pretrained embedding models perform about on par with custom fine-tuned solutions' requires a specific table or figure reference with quantitative scores (e.g., accuracy or ranking metrics per track) to support the 'on par' claim; without it, the headroom analysis in the conclusion rests on an unverified comparison.
minor comments (2)
- [Introduction] The abstract and introduction assert compatibility with 'narrative theory' but provide no inline citations to foundational works (e.g., structuralist or cognitive narrative models); adding 2-3 targeted references would strengthen the novelty claim without altering the core contribution.
- [Results] Figure captions for embedding visualizations (mentioned in the task website description) should explicitly note the dimensionality reduction method and color-coding scheme used for instance-level results to improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for their careful reading of the manuscript and for the positive recommendation of minor revision. We address each major comment below.
read point-by-point responses
-
Referee: [§4] §4 (Annotation Process): The claim that 'each annotation being backed by at least two annotators in agreement' is central to dataset validity, yet the section does not report the overall inter-annotator agreement rate (e.g., percentage of triples with full agreement or Cohen's kappa), nor how ties or residual disagreements were resolved. This detail is load-bearing for assessing whether the binary labels faithfully operationalize the novel definition without unmeasured bias.
Authors: We agree that aggregate inter-annotator agreement statistics and resolution procedures are necessary to fully substantiate the dataset's validity. The current description in §4 notes that each annotation is backed by at least two annotators in agreement but omits the overall rate and handling of disagreements. In the revised manuscript we will add these details to §4, including the percentage of triples achieving full agreement, any computed metrics such as Cohen's kappa, and a description of the tie/disagreement resolution process (e.g., third-annotator adjudication or consensus discussion). revision: yes
-
Referee: [§5] §5 (System Overview): The statement that 'in the embedding setup, systems with pre- and post-processing on pretrained embedding models perform about on par with custom fine-tuned solutions' requires a specific table or figure reference with quantitative scores (e.g., accuracy or ranking metrics per track) to support the 'on par' claim; without it, the headroom analysis in the conclusion rests on an unverified comparison.
Authors: We accept that the qualitative claim in §5 would be strengthened by an explicit quantitative reference. The manuscript currently states that pre- and post-processing approaches perform about on par with fine-tuned solutions without citing the supporting scores. We will revise §5 to include a direct reference to the relevant table or figure that reports the accuracy or ranking metrics for both categories of embedding systems. This addition will also reinforce the headroom analysis presented in the conclusion. revision: yes
Circularity Check
No significant circularity
full rationale
The manuscript defines a shared task operationalizing narrative similarity via binary classification on story-summary triples, introduces a claimed novel definition aligned with narrative theory, collects annotations for over 1,000 triples, and reports results from external system submissions across two tracks. No mathematical derivations, equations, parameter fittings, or predictive steps are present that could reduce to the inputs by construction. The central contributions are task specification and empirical overview rather than any self-referential chain; annotations and evaluations rely on external annotators and submitted systems, with no load-bearing self-citations or ansatzes that collapse the claimed novelty into prior inputs. This is a standard task-definition paper carrying negligible circularity burden.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Duluth at SemEval-2026 Task 4: A hybrid ap- proach to narrative similarity using bi-encoder embed- dings with cross-encoder tie breaking using learned weights. In Proceedings of the 20th International Workshop on Semantic Evaluation (SemEval-2026), San Diego, CA, USA. Association for Computational Linguistics. Samanvitha Bolisetty, Shreya Jayprakash Ashar...
work page 2026
-
[2]
Where Have I Heard This Story Before? Identi- fying Narrative Similarity in Movie Remakes. In Pro- ceedings of the 2018 Conference of the North Ameri- can Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol- ume 2 (Short Papers), pages 673–678, New Orleans, Louisiana, USA. Association for Computational Lin- guistics....
work page 2018
-
[3]
On the Multi-Dimensional Differences between German Poetry of Realism and Modernism
Modeling and Measuring Short Text Similari- ties. On the Multi-Dimensional Differences between German Poetry of Realism and Modernism . Journal of Computational Literary Studies , 1(1). Tisa Islam Erana, Azwad Anjum Islam, Anshu Kiran Sharma, and Mark A. Finlayson. 2026. COGNAC at SemEval-2026 Task 4: Evaluating narrative com- ponents with LLMs for hard s...
work page 2026
-
[4]
Narrative nexus at SemEval-2026 Task 4: Mod- eling narrative similarity via instruction-based fine- tuning and synthetic data augmentation. In Proceed- ings of the 20th International Workshop on Semantic Evaluation (SemEval-2026), San Diego, CA, USA. Association for Computational Linguistics. Ahmed Hamdi, Emanuela Boros, Jose G. Moreno, Adam Jatowt, Georg...
work page 2026
-
[5]
Hans Ole Hatzel and Chris Biemann
Narrative Similarity – Annotation Guidelines . Hans Ole Hatzel and Chris Biemann. 2024a. Story em- beddings – narrative-focused representations of fic- tional stories . In Proceedings of the 62nd Annual Meeting of the Association for Computational Lin- guistics, pages 5931–5943, Miami, Florida, USA. As- sociation for Computational Linguistics. Hans Ole Ha...
work page 2024
-
[6]
Lacuna inc. at SemEval-2026 Task 4: Struc- turally gated state-space models for disentangling nar- rative similarity. In Proceedings of the 20th Interna- tional Workshop on Semantic Evaluation (SemEval- 2026), San Diego, CA, USA. Association for Com- putational Linguistics. Wendy G. Lehnert. 1981. Plot units and narrative sum- marization. Cognitive scienc...
work page 2026
-
[7]
JCT at SemEval-2026 Task 4: A multi-method approach to narrative story similarity. In Proceed- ings of the 20th International Workshop on Semantic Evaluation (SemEval-2026), San Diego, CA, USA. Association for Computational Linguistics. Marcin Sawinski. 2026. FactUEP at SemEval-2026 Task 4: Structured narrative similarity scoring with aspect decomposition...
-
[8]
AI-monitors at SemEval-2026 Task 4: A hy- brid embedding and LLM ensemble for narrative sim- ilarity. In Proceedings of the 20th International Work- shop on Semantic Evaluation (SemEval-2026) , San Diego, CA, USA. Association for Computational Lin- guistics. Max Upravitelev, Veronika Solopova, Jing Y ang, Char- lott Jakob, Premtim Sahitaj, Ariana Sahitaj,...
work page 2026
-
[9]
IIITH boys at SemEval-2026 Task 4: StoryNet - understanding narrative story similarity through symbolic representations. In Proceedings of the 20th International Workshop on Semantic Evaluation (SemEval-2026), San Diego, CA, USA. Association for Computational Linguistics. Jianfei Xu, Ting Zhu, Mingyang Chen, and Huizhi Liang
work page 2026
-
[10]
NCL&HKU-NarrSim at SemEval-2026 Task 4: Aspect-based agents and supervised contrastive embeddings for narrative similarity. In Proceedings of the 20th International Workshop on Semantic Eval- uation (SemEval-2026), San Diego, CA, USA. Asso- ciation for Computational Linguistics. Y en Y ee Y am and Hong Meng Y am. 2026. Y am at SemEval-2026 Task 4: Failure...
work page 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.