pith. sign in

arxiv: 1906.08403 · v1 · pith:IQXXOSMEnew · submitted 2019-06-20 · 💻 cs.SI · physics.soc-ph

How the Avengers assemble: Ecological modelling of effective cast sizes for movies

Pith reviewed 2026-05-25 19:39 UTC · model grok-4.3

classification 💻 cs.SI physics.soc-ph
keywords Shannon entropycharacter diversityecological diversityMarvel Cinematic UniverseJensen-Shannon divergencerecommender systemsmovie classificationcast size
0
0 comments X

The pith

A Shannon-entropy metric on character counts measures effective cast size in movies and predicts success in the Marvel Cinematic Universe.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Movies have varying numbers of characters, but direct counts like credits are unreliable. The paper proposes using Shannon entropy from ecology to capture the effective diversity of characters based on their appearance frequencies. This metric aids in classifying films and, when combined with Jensen-Shannon divergence, measures similarity between movies for applications like recommendations. When tested on the Marvel Cinematic Universe, the measures also correlate with film success and clarify connections across the series.

Core claim

The number of characters in a movie is characterised using a Shannon-entropy based metric drawn from ecological diversity, and the metric generalised with Jensen-Shannon divergence provides a similarity measure that is useful in recommender systems and predicts success for films in the MCU while understanding relationships in the film arc.

What carries the argument

Shannon-entropy metric applied to character appearance or mention counts, generalised via Jensen-Shannon divergence to quantify movie similarity.

If this is right

  • The metric enables taxonomic classification of movies based on character diversity.
  • It supplies a similarity measure for use in recommender systems such as Netflix.
  • The measures predict success for films within the MCU.
  • They provide insight into relationships between stories across the franchise arc.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar entropy-based approaches could be tested on other film franchises or television series to analyze narrative complexity.
  • Recommender systems could incorporate this similarity to suggest films with comparable cast dynamics.

Load-bearing premise

Character appearance or mention counts can be extracted reliably from available data sources and the resulting entropy meaningfully reflects narrative importance.

What would settle it

Finding that the entropy metric does not correlate with independent assessments of cast size or fails to predict success metrics like box office performance in the MCU.

Figures

Figures reproduced from arXiv: 1906.08403 by Lewis Mitchell, Matthew Roughan, Tobin South.

Figure 1
Figure 1. Figure 1: Effective cast size of each movie in the MCU showing type of movies by shape, and sub-sequences connected by dashed lines. The x-axis is the theatrical release date. sub-sequences of the overall set of movies are indicated by dashed lines. There are many interesting features of this plot. When considered by class we see notable features: most origin movies have a small effective cast, which grows in sequel… view at source ↗
Figure 2
Figure 2. Figure 2: Profitability as a function of effective cast size. the franchise, and as one of the key initiators of the cinematic universe. Quality of acting and direction, cast “star power”, timing and other factors cannot be discounted as important to the overall success of a movie. However, the effective cast size also appears to influence the profitability of a movie. This is a fact that does not seem to be missed … view at source ↗
Figure 3
Figure 3. Figure 3: IMDb rating as a function of effective cast size. movies. The two transcriptions were very different: the second set were performed at a much courser level [PITH_FULL_IMAGE:figures/full_fig_p015_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Conflict and dialogue metrics of cast size. Shading indicates less complete (dialogue) datasets. There are additional patterns of note. We can understand that movies that sit above the reference line contain more dialogue-based participation, and less conflict, and in turn, those below the line entail more conflict. Extreme examples are Spider-Man and Captain America: The First Avenger. Origin movies usual… view at source ↗
Figure 5
Figure 5. Figure 5: A comparison of the two distance metrics showing that effective divergence barDeffective lies below the Jensen-Shannon measure D¯ JS. In practical terms the metric performs exactly as you might expect [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Heat map of normalised similarities S¯ effective between pairs of movies. Evident are blocks of movies corresponding to the major sub-sequences, e.g., the Thor movies or the Iron Man movies. Also noticeable is the sharing of cast between the Avengers (team-up) movies and many of the others. projecting into a 2D space, the 3D projection more clearly separates these clusters, but is hard to illustrate here. … view at source ↗
Figure 7
Figure 7. Figure 7: Dendrogram derived from hierarchical clustering of the movies based on the dis-similarities DA,B -0.4 -0.2 0.0 0.2 0.4 -0.4 -0.2 0.0 0.2 0.4 Ant-Man Ant-Man And The Wasp The Avengers Avengers: Age Of Ultron Avengers: Infinity War Black Panther Captain America: Civil War Captain America: The First Avenger Captain America: The Winter Soldier Captain Marvel Doctor Strange Guardians Of The Galaxy Guardians Of … view at source ↗
Figure 8
Figure 8. Figure 8: MDS projection into a 2D space based on the cast dissimilarities. Note that a small translation has been applied to place Avengers: Infinity War at the origin. June 21, 2019 20/25 [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗
read the original abstract

The number of characters in a movie is an interesting feature. However, it is non-trivial to measure directly. Naive metrics such as the number of credited characters vary wildly. Here, we show that a metric based on the notion of "ecological diversity" as expressed through a Shannon-entropy based metric can characterise the number of characters in a movie, and is useful in taxonomic classification. We also show how the metric can be generalised using Jensen-Shannon divergence to provide a measure of the similarity of characters appearing in different movies, for instance of use in recommender systems, e.g., Netflix. We apply our measures to the Marvel Cinematic Universe (MCU), and show what they teach us about this highly successful franchise of movies. In particular, these measures provide a useful predictor of "success" for films in the MCU, as well as a natural means to understand the relationships between the stories in the overall film arc.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes a Shannon-entropy metric H = −∑ p_i log p_i computed from proportions of character appearances or mentions to quantify 'effective cast size' in films as an alternative to naive credit counts. It claims this ecological-diversity measure enables taxonomic classification of movies, generalizes via Jensen-Shannon divergence to a similarity metric useful for recommender systems, and when applied to the Marvel Cinematic Universe (MCU) yields insights into narrative relationships and serves as a predictor of film success.

Significance. If the extraction of reliable p_i values and the interpretation of H as reflecting narrative importance rather than data artifacts can be established, the work would offer a novel quantitative bridge between information theory, ecology, and film studies with direct applications to classification, recommendation, and franchise analysis. The MCU case study provides a concrete, high-visibility testbed. The absence of any reported validation, data provenance, or error analysis currently prevents assessment of whether these downstream uses are supported or spurious.

major comments (3)
  1. [Abstract / Methods] Abstract and Methods (wherever described): The central claims rest on the premise that character mention/appearance counts can be extracted to form a distribution p_i from which H meaningfully quantifies effective cast size. No section specifies the data source (scripts, credits, Wikipedia, etc.), the parsing procedure, or any validation against manual ground-truth counts. This is load-bearing; without it the ecological interpretation fails and the taxonomic, JSD-similarity, and MCU-success claims become untestable.
  2. [MCU results] MCU results section: The claim that the metric 'provides a useful predictor of success' for MCU films is stated without reported statistical controls, baseline comparisons, or error analysis. If success is measured by box-office or ratings, the manuscript must demonstrate that the entropy term adds explanatory power beyond obvious covariates (budget, release date, prior franchise performance).
  3. [JSD section] JSD generalization: The extension to Jensen-Shannon divergence for movie similarity is presented as immediately useful for recommender systems, yet no quantitative evaluation (e.g., precision@K on a held-out set, comparison to content-based or collaborative baselines) is supplied to support this utility claim.
minor comments (2)
  1. [Methods] Notation for the entropy formula should be introduced with an explicit equation number and definition of the summation index i (over characters).
  2. [Results] The manuscript should include a table or figure showing example p_i distributions and resulting H values for a few well-known films to illustrate the metric.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed report. The comments highlight important gaps in documentation and validation that we agree need to be addressed. Below we respond point-by-point and indicate the revisions we will make.

read point-by-point responses
  1. Referee: [Abstract / Methods] Abstract and Methods (wherever described): The central claims rest on the premise that character mention/appearance counts can be extracted to form a distribution p_i from which H meaningfully quantifies effective cast size. No section specifies the data source (scripts, credits, Wikipedia, etc.), the parsing procedure, or any validation against manual ground-truth counts. This is load-bearing; without it the ecological interpretation fails and the taxonomic, JSD-similarity, and MCU-success claims become untestable.

    Authors: We agree that the current manuscript lacks explicit description of the data source and extraction pipeline. We will add a dedicated Methods subsection that specifies the source (Wikipedia plot summaries and cast lists for the films analyzed), the rule-based parsing procedure used to obtain character mention counts, and a small-scale validation exercise comparing the automated counts against manual annotations on a sample of films. This will be included in the revised version. revision: yes

  2. Referee: [MCU results] MCU results section: The claim that the metric 'provides a useful predictor of success' for MCU films is stated without reported statistical controls, baseline comparisons, or error analysis. If success is measured by box-office or ratings, the manuscript must demonstrate that the entropy term adds explanatory power beyond obvious covariates (budget, release date, prior franchise performance).

    Authors: The MCU analysis as written is descriptive and does not include the requested controls or model comparisons. We will revise the results section to present a multiple regression (or similar) with box-office or rating as the outcome, including covariates for budget, release date, and prior franchise performance. We will report the incremental explanatory power attributable to the entropy metric and any associated error or robustness checks. revision: yes

  3. Referee: [JSD section] JSD generalization: The extension to Jensen-Shannon divergence for movie similarity is presented as immediately useful for recommender systems, yet no quantitative evaluation (e.g., precision@K on a held-out set, comparison to content-based or collaborative baselines) is supplied to support this utility claim.

    Authors: We acknowledge that the manuscript asserts potential utility for recommender systems without any quantitative backing. We will either add a limited evaluation (for example, ranking movies by JSD and inspecting overlap with known similar titles) or temper the claim to present JSD as a similarity measure whose recommender value remains to be tested. The choice will depend on space and scope considerations in revision. revision: partial

Circularity Check

0 steps flagged

No circularity detected in derivation

full rationale

The paper defines its effective cast size via the standard Shannon entropy H = -∑ p_i log p_i applied to character mention proportions p_i obtained from external data sources, then extends it with the likewise standard Jensen-Shannon divergence for inter-movie similarity. No equations or claims reduce the output metric to a fitted parameter, self-defined quantity, or self-citation chain; the central results are direct computations on new inputs rather than tautological restatements of those inputs. The MCU success prediction is presented as an empirical correlation using the independently computed metric, with no indication that it collapses to the input data by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based on abstract only; no explicit free parameters, axioms, or invented entities are stated. The entropy calculation implicitly assumes a well-defined probability distribution over characters that can be obtained from movie metadata.

pith-pipeline@v0.9.0 · 5688 in / 1033 out tokens · 21702 ms · 2026-05-25T19:39:45.824916+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 1 internal anchor

  1. [1]

    Ecological Diversity: Measuring the Unmeasurable

    Daly AJ, Baetens JM, De Baets B. Ecological Diversity: Measuring the Unmeasurable. Mathematics. 2018;6(7). doi:10.3390/math6070119

  2. [2]

    Screenplay: The foundations of screenwriting

    Field S. Screenplay: The foundations of screenwriting. Delta; 2005

  3. [3]

    The thirty-six dramatic situations

    Polti G. The thirty-six dramatic situations. JK Reeve; 1921

  4. [4]

    Twenty master plots and how to build them

    Tobias RB. Twenty master plots and how to build them. Piatkus; 1993

  5. [5]

    The seven basic plots: Why we tell stories

    Booker C. The seven basic plots: Why we tell stories. A&C Black; 2004

  6. [6]

    An Estimate of an Upper Bound for the Entropy of English

    Brown PF, Pietra SAD, Pietra VJD, Lai JC, Mercer RL. An Estimate of an Upper Bound for the Entropy of English. Computational Linguistics. 1992;18(1):31–40

  7. [7]

    The complexity and entropy of literary styles

    Kontoyiannis I. The complexity and entropy of literary styles. Department of Statistics, Stanford University; 1997. June 21, 2019 23/25

  8. [8]

    Shakespeare and other English Renaissance authors as characterized by Information Theory complexity quantifiers

    Rosso OA, Craig H, Moscato P. Shakespeare and other English Renaissance authors as characterized by Information Theory complexity quantifiers. Physica A: Statistical Mechanics and its Applications. 2009;388(6):916–926

  9. [9]

    Information flow reveals prediction limits in online social activity

    Bagrow JP, Liu X, Mitchell L. Information flow reveals prediction limits in online social activity. Nature Human Behaviour. 2019;3(2):122

  10. [10]

    Toward a meaningful definition of vocabulary size

    D’Anna CA, Zechmeister EB, Hall JW. Toward a meaningful definition of vocabulary size. Journal of Reading Behavior. 1991;23(1):109–122

  11. [11]

    Humans store about 1.5 megabytes of information during language acquisition

    Mollica F, Piantadosi ST. Humans store about 1.5 megabytes of information during language acquisition. Royal Society Open Science. 2019;6(3):181393

  12. [12]

    Palm Sunday: an autobiographical collage

    Vonnegut K. Palm Sunday: an autobiographical collage. Dial Press; 1999

  13. [13]

    Macroanalysis: Digital methods and literary history

    Jockers ML. Macroanalysis: Digital methods and literary history. University of Illinois Press; 2013

  14. [14]

    Syuzhet: Extract Sentiment and Plot Arcs from Text; 2015

    Jockers ML. Syuzhet: Extract Sentiment and Plot Arcs from Text; 2015. Available from: https://github.com/mjockers/syuzhet

  15. [15]

    The Bestseller Code: Anatomy of the blockbuster novel

    Archer J, Jockers ML. The Bestseller Code: Anatomy of the blockbuster novel. St. Martin’s Press; 2016

  16. [16]

    A multiscale theory for the dynamical evolution of sentiment in novels

    Gao J, Jockers ML, Laudun J, Tangherlini T. A multiscale theory for the dynamical evolution of sentiment in novels. In: 2016 International Conference on Behavioral, Economic and Socio-cultural Computing (BESC); 2016. p. 1–4

  17. [17]

    The emotional arcs of stories are dominated by six basic shapes

    Reagan AJ, Mitchell L, Kiley D, Danforth CM, Dodds PS. The emotional arcs of stories are dominated by six basic shapes. EPJ Data Science. 2016;5(1):31

  18. [18]

    The Data Science of Hollywood: Using Emotional Arcs of Movies to Drive Business Model Innovation in Entertainment Industries

    Del Vecchio M, Kharlamov A, Parry G, Pogrebna G. The Data Science of Hollywood: Using Emotional Arcs of Movies to Drive Business Model Innovation in Entertainment Industries. arXiv preprint arXiv:180702221. 2018

  19. [19]

    Universal properties of mythological networks

    Mac Carron P, Kenna R. Universal properties of mythological networks. EPL (Europhysics Letters). 2012;99(2):28002

  20. [20]

    Viking sagas: Six degrees of Icelandic separation Social networks from the Viking era

    Mac Carron P, Kenna R. Viking sagas: Six degrees of Icelandic separation Social networks from the Viking era. Significance. 2013;10(6):12–17. doi:10.1111/j.1740-9713.2013.00704.x

  21. [21]

    Narrative as a Complex Network: A Study of Victor Hugo’s Les Mis´ erables

    Min S, Park J. Narrative as a Complex Network: A Study of Victor Hugo’s Les Mis´ erables. In: Proceedings of HCI Korea. Hanbit Media, Inc.; 2016. p. 100–107

  22. [22]

    Representation of texts as complex networks: a mesoscopic approach

    Ferraz de Arruda H, Nascimento Silva F, Queiroz Marinho V, Raphael Amancio D, da Fontoura Costa L. Representation of texts as complex networks: a mesoscopic approach. Journal of Complex Networks. 2017;6(1):125–144

  23. [23]

    Linguistic analysis of differences in portrayal of movie characters

    Ramakrishna A, Mart´ ınez VR, Malandrakis N, Singla K, Narayanan S. Linguistic analysis of differences in portrayal of movie characters. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); 2017. p. 1669–1678

  24. [24]

    Marvel Universe looks almost like a real social network

    Alberich R, Miro-Julia J, Rossell´ o F. Marvel Universe looks almost like a real social network. arXiv preprint cond-mat/0202174. 2002;. June 21, 2019 24/25

  25. [25]

    Diana in the World of Men: a character network approach to analysing gendered vocal representation in Wonder Woman

    Jones P. Diana in the World of Men: a character network approach to analysing gendered vocal representation in Wonder Woman . Feminist Media Studies. 2018;00(00):1–17. doi:10.1080/14680777.2018.1510846

  26. [26]

    The one comparing narrative social network extraction techniques

    Edwards M, Mitchell L, Tuke J, Roughan M. The one comparing narrative social network extraction techniques. arXiv preprint arXiv:181101467. 2018

  27. [27]

    IMDb Help Center

    What does uncredited mean?;. IMDb Help Center

  28. [28]

    Can it be listed on IMDb?

    I worked on a title but did not receive a screen credit. Can it be listed on IMDb?;. IMDb Help Center

  29. [29]

    IMDb Help Center

    Attributes;. IMDb Help Center

  30. [30]

    Axiomatic Characterizations of Information Measures

    Csisz´ ar I. Axiomatic Characterizations of Information Measures. Entropy. 2008;10(3):261–273. doi:10.3390/e10030261

  31. [31]

    A corpus driven study of the potential for vocabulary learning through watching movies

    Webb S. A corpus driven study of the potential for vocabulary learning through watching movies. International Journal of Corpus Linguistics. 2010;15(4):497–519

  32. [32]

    A survey of available corpora for building data-driven dialogue systems

    Serban IV, Lowe R, Henderson P, Charlin L, Pineau J. A survey of available corpora for building data-driven dialogue systems. arXiv preprint arXiv:151205742. 2015

  33. [33]

    https://transcripts.fandom.com/wiki/Transcripts_Wiki

    Transcripts Wiki on Fandom;. https://transcripts.fandom.com/wiki/Transcripts_Wiki

  34. [34]

    https://www.scriptslug.com/

    Script Slug;. https://www.scriptslug.com/

  35. [35]

    Multisemiotic Transcriptions as Film Referencing Systems

    Baldry A. Multisemiotic Transcriptions as Film Referencing Systems. InTRAlinea: Online Translation Journal. 2016

  36. [36]

    Tidy data

    Wickham H. Tidy data. Journal of Statistical Software. Submitted

  37. [37]

    Marvel Cinematic Universe wiki

  38. [38]

    Evolution and measurement of species diversity

    Whittaker RH. Evolution and measurement of species diversity. Taxon. 1972; p. 213–251

  39. [39]

    Early Predictions of Movie Success: The Who, What, and When of Profitability

    Lash MT, Zhao K. Early Predictions of Movie Success: The Who, What, and When of Profitability. Journal of Management Information Systems. 2016;33(3):874–903

  40. [40]

    On Choosing and Bounding Probability Metrics

    Gibbs AL, Su FE. On Choosing and Bounding Probability Metrics. Interdisciplinary Science Reviews. 2002;70(3):419–435. doi:10.1111/j.1751-5823.2002.tb00178.x

  41. [41]

    Similarity-Based Methods For Word Sense Disambiguation

    Dagan I, Lee L, Pereira F. Similarity-Based Methods For Word Sense Disambiguation. In: Thirty-Fifth Annual Meeting of the Association for Computational Linguistics; 1997

  42. [42]

    Divergence measures based on the Shannon entropy

    Lin J. Divergence measures based on the Shannon entropy. IEEE Transactions on Information Theory. 1991;37:145–151. June 21, 2019 25/25