How the Avengers assemble: Ecological modelling of effective cast sizes for movies
Pith reviewed 2026-05-25 19:39 UTC · model grok-4.3
The pith
A Shannon-entropy metric on character counts measures effective cast size in movies and predicts success in the Marvel Cinematic Universe.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The number of characters in a movie is characterised using a Shannon-entropy based metric drawn from ecological diversity, and the metric generalised with Jensen-Shannon divergence provides a similarity measure that is useful in recommender systems and predicts success for films in the MCU while understanding relationships in the film arc.
What carries the argument
Shannon-entropy metric applied to character appearance or mention counts, generalised via Jensen-Shannon divergence to quantify movie similarity.
If this is right
- The metric enables taxonomic classification of movies based on character diversity.
- It supplies a similarity measure for use in recommender systems such as Netflix.
- The measures predict success for films within the MCU.
- They provide insight into relationships between stories across the franchise arc.
Where Pith is reading between the lines
- Similar entropy-based approaches could be tested on other film franchises or television series to analyze narrative complexity.
- Recommender systems could incorporate this similarity to suggest films with comparable cast dynamics.
Load-bearing premise
Character appearance or mention counts can be extracted reliably from available data sources and the resulting entropy meaningfully reflects narrative importance.
What would settle it
Finding that the entropy metric does not correlate with independent assessments of cast size or fails to predict success metrics like box office performance in the MCU.
Figures
read the original abstract
The number of characters in a movie is an interesting feature. However, it is non-trivial to measure directly. Naive metrics such as the number of credited characters vary wildly. Here, we show that a metric based on the notion of "ecological diversity" as expressed through a Shannon-entropy based metric can characterise the number of characters in a movie, and is useful in taxonomic classification. We also show how the metric can be generalised using Jensen-Shannon divergence to provide a measure of the similarity of characters appearing in different movies, for instance of use in recommender systems, e.g., Netflix. We apply our measures to the Marvel Cinematic Universe (MCU), and show what they teach us about this highly successful franchise of movies. In particular, these measures provide a useful predictor of "success" for films in the MCU, as well as a natural means to understand the relationships between the stories in the overall film arc.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a Shannon-entropy metric H = −∑ p_i log p_i computed from proportions of character appearances or mentions to quantify 'effective cast size' in films as an alternative to naive credit counts. It claims this ecological-diversity measure enables taxonomic classification of movies, generalizes via Jensen-Shannon divergence to a similarity metric useful for recommender systems, and when applied to the Marvel Cinematic Universe (MCU) yields insights into narrative relationships and serves as a predictor of film success.
Significance. If the extraction of reliable p_i values and the interpretation of H as reflecting narrative importance rather than data artifacts can be established, the work would offer a novel quantitative bridge between information theory, ecology, and film studies with direct applications to classification, recommendation, and franchise analysis. The MCU case study provides a concrete, high-visibility testbed. The absence of any reported validation, data provenance, or error analysis currently prevents assessment of whether these downstream uses are supported or spurious.
major comments (3)
- [Abstract / Methods] Abstract and Methods (wherever described): The central claims rest on the premise that character mention/appearance counts can be extracted to form a distribution p_i from which H meaningfully quantifies effective cast size. No section specifies the data source (scripts, credits, Wikipedia, etc.), the parsing procedure, or any validation against manual ground-truth counts. This is load-bearing; without it the ecological interpretation fails and the taxonomic, JSD-similarity, and MCU-success claims become untestable.
- [MCU results] MCU results section: The claim that the metric 'provides a useful predictor of success' for MCU films is stated without reported statistical controls, baseline comparisons, or error analysis. If success is measured by box-office or ratings, the manuscript must demonstrate that the entropy term adds explanatory power beyond obvious covariates (budget, release date, prior franchise performance).
- [JSD section] JSD generalization: The extension to Jensen-Shannon divergence for movie similarity is presented as immediately useful for recommender systems, yet no quantitative evaluation (e.g., precision@K on a held-out set, comparison to content-based or collaborative baselines) is supplied to support this utility claim.
minor comments (2)
- [Methods] Notation for the entropy formula should be introduced with an explicit equation number and definition of the summation index i (over characters).
- [Results] The manuscript should include a table or figure showing example p_i distributions and resulting H values for a few well-known films to illustrate the metric.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed report. The comments highlight important gaps in documentation and validation that we agree need to be addressed. Below we respond point-by-point and indicate the revisions we will make.
read point-by-point responses
-
Referee: [Abstract / Methods] Abstract and Methods (wherever described): The central claims rest on the premise that character mention/appearance counts can be extracted to form a distribution p_i from which H meaningfully quantifies effective cast size. No section specifies the data source (scripts, credits, Wikipedia, etc.), the parsing procedure, or any validation against manual ground-truth counts. This is load-bearing; without it the ecological interpretation fails and the taxonomic, JSD-similarity, and MCU-success claims become untestable.
Authors: We agree that the current manuscript lacks explicit description of the data source and extraction pipeline. We will add a dedicated Methods subsection that specifies the source (Wikipedia plot summaries and cast lists for the films analyzed), the rule-based parsing procedure used to obtain character mention counts, and a small-scale validation exercise comparing the automated counts against manual annotations on a sample of films. This will be included in the revised version. revision: yes
-
Referee: [MCU results] MCU results section: The claim that the metric 'provides a useful predictor of success' for MCU films is stated without reported statistical controls, baseline comparisons, or error analysis. If success is measured by box-office or ratings, the manuscript must demonstrate that the entropy term adds explanatory power beyond obvious covariates (budget, release date, prior franchise performance).
Authors: The MCU analysis as written is descriptive and does not include the requested controls or model comparisons. We will revise the results section to present a multiple regression (or similar) with box-office or rating as the outcome, including covariates for budget, release date, and prior franchise performance. We will report the incremental explanatory power attributable to the entropy metric and any associated error or robustness checks. revision: yes
-
Referee: [JSD section] JSD generalization: The extension to Jensen-Shannon divergence for movie similarity is presented as immediately useful for recommender systems, yet no quantitative evaluation (e.g., precision@K on a held-out set, comparison to content-based or collaborative baselines) is supplied to support this utility claim.
Authors: We acknowledge that the manuscript asserts potential utility for recommender systems without any quantitative backing. We will either add a limited evaluation (for example, ranking movies by JSD and inspecting overlap with known similar titles) or temper the claim to present JSD as a similarity measure whose recommender value remains to be tested. The choice will depend on space and scope considerations in revision. revision: partial
Circularity Check
No circularity detected in derivation
full rationale
The paper defines its effective cast size via the standard Shannon entropy H = -∑ p_i log p_i applied to character mention proportions p_i obtained from external data sources, then extends it with the likewise standard Jensen-Shannon divergence for inter-movie similarity. No equations or claims reduce the output metric to a fitted parameter, self-defined quantity, or self-citation chain; the central results are direct computations on new inputs rather than tautological restatements of those inputs. The MCU success prediction is presented as an empirical correlation using the independently computed metric, with no indication that it collapses to the input data by construction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Ecological Diversity: Measuring the Unmeasurable
Daly AJ, Baetens JM, De Baets B. Ecological Diversity: Measuring the Unmeasurable. Mathematics. 2018;6(7). doi:10.3390/math6070119
-
[2]
Screenplay: The foundations of screenwriting
Field S. Screenplay: The foundations of screenwriting. Delta; 2005
work page 2005
-
[3]
The thirty-six dramatic situations
Polti G. The thirty-six dramatic situations. JK Reeve; 1921
work page 1921
-
[4]
Twenty master plots and how to build them
Tobias RB. Twenty master plots and how to build them. Piatkus; 1993
work page 1993
-
[5]
The seven basic plots: Why we tell stories
Booker C. The seven basic plots: Why we tell stories. A&C Black; 2004
work page 2004
-
[6]
An Estimate of an Upper Bound for the Entropy of English
Brown PF, Pietra SAD, Pietra VJD, Lai JC, Mercer RL. An Estimate of an Upper Bound for the Entropy of English. Computational Linguistics. 1992;18(1):31–40
work page 1992
-
[7]
The complexity and entropy of literary styles
Kontoyiannis I. The complexity and entropy of literary styles. Department of Statistics, Stanford University; 1997. June 21, 2019 23/25
work page 1997
-
[8]
Rosso OA, Craig H, Moscato P. Shakespeare and other English Renaissance authors as characterized by Information Theory complexity quantifiers. Physica A: Statistical Mechanics and its Applications. 2009;388(6):916–926
work page 2009
-
[9]
Information flow reveals prediction limits in online social activity
Bagrow JP, Liu X, Mitchell L. Information flow reveals prediction limits in online social activity. Nature Human Behaviour. 2019;3(2):122
work page 2019
-
[10]
Toward a meaningful definition of vocabulary size
D’Anna CA, Zechmeister EB, Hall JW. Toward a meaningful definition of vocabulary size. Journal of Reading Behavior. 1991;23(1):109–122
work page 1991
-
[11]
Humans store about 1.5 megabytes of information during language acquisition
Mollica F, Piantadosi ST. Humans store about 1.5 megabytes of information during language acquisition. Royal Society Open Science. 2019;6(3):181393
work page 2019
-
[12]
Palm Sunday: an autobiographical collage
Vonnegut K. Palm Sunday: an autobiographical collage. Dial Press; 1999
work page 1999
-
[13]
Macroanalysis: Digital methods and literary history
Jockers ML. Macroanalysis: Digital methods and literary history. University of Illinois Press; 2013
work page 2013
-
[14]
Syuzhet: Extract Sentiment and Plot Arcs from Text; 2015
Jockers ML. Syuzhet: Extract Sentiment and Plot Arcs from Text; 2015. Available from: https://github.com/mjockers/syuzhet
work page 2015
-
[15]
The Bestseller Code: Anatomy of the blockbuster novel
Archer J, Jockers ML. The Bestseller Code: Anatomy of the blockbuster novel. St. Martin’s Press; 2016
work page 2016
-
[16]
A multiscale theory for the dynamical evolution of sentiment in novels
Gao J, Jockers ML, Laudun J, Tangherlini T. A multiscale theory for the dynamical evolution of sentiment in novels. In: 2016 International Conference on Behavioral, Economic and Socio-cultural Computing (BESC); 2016. p. 1–4
work page 2016
-
[17]
The emotional arcs of stories are dominated by six basic shapes
Reagan AJ, Mitchell L, Kiley D, Danforth CM, Dodds PS. The emotional arcs of stories are dominated by six basic shapes. EPJ Data Science. 2016;5(1):31
work page 2016
-
[18]
Del Vecchio M, Kharlamov A, Parry G, Pogrebna G. The Data Science of Hollywood: Using Emotional Arcs of Movies to Drive Business Model Innovation in Entertainment Industries. arXiv preprint arXiv:180702221. 2018
work page 2018
-
[19]
Universal properties of mythological networks
Mac Carron P, Kenna R. Universal properties of mythological networks. EPL (Europhysics Letters). 2012;99(2):28002
work page 2012
-
[20]
Viking sagas: Six degrees of Icelandic separation Social networks from the Viking era
Mac Carron P, Kenna R. Viking sagas: Six degrees of Icelandic separation Social networks from the Viking era. Significance. 2013;10(6):12–17. doi:10.1111/j.1740-9713.2013.00704.x
-
[21]
Narrative as a Complex Network: A Study of Victor Hugo’s Les Mis´ erables
Min S, Park J. Narrative as a Complex Network: A Study of Victor Hugo’s Les Mis´ erables. In: Proceedings of HCI Korea. Hanbit Media, Inc.; 2016. p. 100–107
work page 2016
-
[22]
Representation of texts as complex networks: a mesoscopic approach
Ferraz de Arruda H, Nascimento Silva F, Queiroz Marinho V, Raphael Amancio D, da Fontoura Costa L. Representation of texts as complex networks: a mesoscopic approach. Journal of Complex Networks. 2017;6(1):125–144
work page 2017
-
[23]
Linguistic analysis of differences in portrayal of movie characters
Ramakrishna A, Mart´ ınez VR, Malandrakis N, Singla K, Narayanan S. Linguistic analysis of differences in portrayal of movie characters. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); 2017. p. 1669–1678
work page 2017
-
[24]
Marvel Universe looks almost like a real social network
Alberich R, Miro-Julia J, Rossell´ o F. Marvel Universe looks almost like a real social network. arXiv preprint cond-mat/0202174. 2002;. June 21, 2019 24/25
work page internal anchor Pith review Pith/arXiv arXiv 2002
-
[25]
Jones P. Diana in the World of Men: a character network approach to analysing gendered vocal representation in Wonder Woman . Feminist Media Studies. 2018;00(00):1–17. doi:10.1080/14680777.2018.1510846
-
[26]
The one comparing narrative social network extraction techniques
Edwards M, Mitchell L, Tuke J, Roughan M. The one comparing narrative social network extraction techniques. arXiv preprint arXiv:181101467. 2018
work page 2018
- [27]
-
[28]
I worked on a title but did not receive a screen credit. Can it be listed on IMDb?;. IMDb Help Center
- [29]
-
[30]
Axiomatic Characterizations of Information Measures
Csisz´ ar I. Axiomatic Characterizations of Information Measures. Entropy. 2008;10(3):261–273. doi:10.3390/e10030261
-
[31]
A corpus driven study of the potential for vocabulary learning through watching movies
Webb S. A corpus driven study of the potential for vocabulary learning through watching movies. International Journal of Corpus Linguistics. 2010;15(4):497–519
work page 2010
-
[32]
A survey of available corpora for building data-driven dialogue systems
Serban IV, Lowe R, Henderson P, Charlin L, Pineau J. A survey of available corpora for building data-driven dialogue systems. arXiv preprint arXiv:151205742. 2015
work page 2015
-
[33]
https://transcripts.fandom.com/wiki/Transcripts_Wiki
Transcripts Wiki on Fandom;. https://transcripts.fandom.com/wiki/Transcripts_Wiki
- [34]
-
[35]
Multisemiotic Transcriptions as Film Referencing Systems
Baldry A. Multisemiotic Transcriptions as Film Referencing Systems. InTRAlinea: Online Translation Journal. 2016
work page 2016
- [36]
-
[37]
Marvel Cinematic Universe wiki
-
[38]
Evolution and measurement of species diversity
Whittaker RH. Evolution and measurement of species diversity. Taxon. 1972; p. 213–251
work page 1972
-
[39]
Early Predictions of Movie Success: The Who, What, and When of Profitability
Lash MT, Zhao K. Early Predictions of Movie Success: The Who, What, and When of Profitability. Journal of Management Information Systems. 2016;33(3):874–903
work page 2016
-
[40]
On Choosing and Bounding Probability Metrics
Gibbs AL, Su FE. On Choosing and Bounding Probability Metrics. Interdisciplinary Science Reviews. 2002;70(3):419–435. doi:10.1111/j.1751-5823.2002.tb00178.x
-
[41]
Similarity-Based Methods For Word Sense Disambiguation
Dagan I, Lee L, Pereira F. Similarity-Based Methods For Word Sense Disambiguation. In: Thirty-Fifth Annual Meeting of the Association for Computational Linguistics; 1997
work page 1997
-
[42]
Divergence measures based on the Shannon entropy
Lin J. Divergence measures based on the Shannon entropy. IEEE Transactions on Information Theory. 1991;37:145–151. June 21, 2019 25/25
work page 1991
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.