pith. sign in

arxiv: 2606.24022 · v1 · pith:AHU5PFXEnew · submitted 2026-06-23 · 💻 cs.HC · cs.SI

Do Language Models Pass the Bechdel Test? Auditing Gender Biases in LLM-Generated Screenplays

Pith reviewed 2026-06-25 23:29 UTC · model grok-4.3

classification 💻 cs.HC cs.SI
keywords Bechdel testgender biasLLM screenplayssocial network analysisrepresentational biasAI-generated mediawomen's representation
0
0 comments X

The pith

Human-written screenplays pass the Bechdel test more often than those generated by large language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper audits gender biases in movie screenplays produced by large language models by applying the Bechdel test and social network analysis. It compares outputs from three leading models to 768 human-written scripts and finds that human scripts are more likely to feature conversations between women about topics other than men. This finding is relevant because LLMs are being integrated into media production, potentially shaping the stories audiences see. Additional network measures reveal mixed patterns, with some LLM scripts showing less bias on certain dimensions but all types exhibiting bias overall.

Core claim

Screenplays generated by GPT-5, Gemini 3 Pro, and Claude Sonnet 4.5 are less likely to pass the Bechdel test than corresponding human-written screenplays, though measures of character centrality, homophily, and triadic relationships indicate that LLM scripts sometimes exhibit less representational bias, while every script type shows bias on most measures.

What carries the argument

An automated version of the Bechdel test applied to dialogue and character gender identification, supplemented by social network analysis of character interaction graphs.

If this is right

  • LLMs may reduce the frequency of stories with strong female representation in generated media.
  • Social network measures provide additional ways to quantify bias beyond the Bechdel test.
  • Quantitative auditing tools are needed for AI-generated creative content.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the automated test is reliable, then training data curation could reduce such biases in future models.
  • Similar audits could be applied to other forms of LLM output like novels or news articles.
  • Integration of bias-checking mechanisms directly into LLM prompting or fine-tuning might improve outputs.

Load-bearing premise

The automated Bechdel test and social network measures accurately capture representational bias without substantial errors from dialogue parsing, character gender identification, or prompt construction choices.

What would settle it

Finding that human raters disagree with the automated Bechdel test scores on a significant portion of the scripts, or that the gender identification step misclassifies characters frequently.

Figures

Figures reproduced from arXiv: 2606.24022 by Dana\'e Metaxa, Megha N. Govindu, Sorelle A. Friedler, Stephanie T. Wang.

Figure 1
Figure 1. Figure 1: Mean script length across the 4 script types; error bars indicate standard deviation. [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Bechdel test performance before controlling for the number of interactions (left; raw pass rates) and after controlling [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Distribution of various network measures across 4 script types. Left: ratio of female centrality to male centrality for [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Ratio of female-involved interactions (edges) to male-involved interactions; error bars indicate mean [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Prompt used to generate structured movie scenes from anonymized synopses. [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Prompt used to generate structured screenplays from a given movie scene and anonymized synopsis. [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗
read the original abstract

As large language models (LLMs) are increasingly used in media production from journalistm to filmmaking, what impact do they have on the stories being told? Prior work has shown LLMs to perpetuate social biases, including those related to gender. We complement existing literature on gender bias in LLM outputs by auditing the network structure of LLM-generated movie screenplays through automating the Bechdel test, a popular measure of women's representation in literary and film works. We also introduce the use of social network analysis measures to further analyze representational bias in LLM-generated scripts. We evaluate screenplays generated by three state-of-the-art LLMs (GPT-5, Gemini 3 Pro, and Claude Sonnet 4.5) against 768 corresponding human-written screenplays, finding that human-written scripts are more likely to pass the Bechdel test. However, other network analyses, like centrality, homophily, and triadic relationships demonstrate that in some cases LLM-scripts have less bias, although all script types demonstrate some representational bias under most measures. We conclude by discussing the continued need for further quantitative assessments of media representations and AI-generated content.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript automates the Bechdel test and applies social network analysis (centrality, homophily, triadic relationships) to compare screenplays generated by GPT-5, Gemini 3 Pro, and Claude Sonnet 4.5 against 768 matched human-written screenplays. It reports that human scripts pass the Bechdel test at higher rates, while LLM scripts sometimes exhibit lower bias on network measures, though all script types show representational bias on most metrics. The work positions this as a quantitative audit of gender bias in AI-generated media.

Significance. If the automated pipeline is reliable, the study supplies a replicable, quantitative framework for auditing narrative bias in LLM outputs that complements existing text-level bias analyses. The direct human baseline comparison and extension to screenplay network structure are strengths that could support future media-AI research.

major comments (3)
  1. [Methods] Methods section: The automated Bechdel pipeline (dialogue turn extraction, character name identification, binary gender assignment, and the three-condition check) reports no validation against human annotations—no precision/recall, inter-annotator agreement, or confusion matrix on the 768 screenplay pairs. This is load-bearing for the central claim because unquantified parser errors that differ by script source (e.g., LLM scripts having more ambiguous names or shorter turns) can produce the reported human-LLM difference as an artifact.
  2. [Results] Results section (Bechdel pass-rate comparison): The finding that human-written scripts are more likely to pass the Bechdel test rests entirely on the unvalidated pipeline; without error-rate bounds, it is impossible to determine whether the difference survives plausible levels of gender-inference or segmentation noise.
  3. [Results] Results section (social-network metrics): The same character-node errors propagate to centrality, homophily, and triadic-closure calculations; any claim that LLM scripts are “less biased” on these measures inherits the identical validation gap.
minor comments (2)
  1. [Abstract] Abstract: Typo 'journalistm' should read 'journalism'.
  2. [Methods] The manuscript should clarify the exact prompt templates and length/genre controls used when generating the LLM screenplays, as these choices can affect downstream network statistics.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their detailed and constructive comments on our manuscript. The concerns regarding validation of the automated Bechdel pipeline are well-taken and highlight a genuine limitation in the current version. We address each major comment point by point below and agree that revisions are needed to strengthen the work. We will incorporate the suggested validation and sensitivity analyses in the revised manuscript.

read point-by-point responses
  1. Referee: [Methods] Methods section: The automated Bechdel pipeline (dialogue turn extraction, character name identification, binary gender assignment, and the three-condition check) reports no validation against human annotations—no precision/recall, inter-annotator agreement, or confusion matrix on the 768 screenplay pairs. This is load-bearing for the central claim because unquantified parser errors that differ by script source (e.g., LLM scripts having more ambiguous names or shorter turns) can produce the reported human-LLM difference as an artifact.

    Authors: We acknowledge that the manuscript does not include a quantitative validation of the automated pipeline against human annotations, which is a substantive gap. To address this, we will add a dedicated validation subsection in Methods. We will manually annotate a stratified random sample of 100 screenplays (50 human-written, 50 LLM-generated) with two independent annotators, reporting precision, recall, and F1 for each pipeline stage (dialogue extraction, character identification, gender assignment, and Bechdel condition checks), along with inter-annotator agreement via Cohen's kappa. We will also compare error rates between human and LLM sources to test for differential bias in parsing. revision: yes

  2. Referee: [Results] Results section (Bechdel pass-rate comparison): The finding that human-written scripts are more likely to pass the Bechdel test rests entirely on the unvalidated pipeline; without error-rate bounds, it is impossible to determine whether the difference survives plausible levels of gender-inference or segmentation noise.

    Authors: We agree that the Bechdel pass-rate results cannot be fully interpreted without error bounds. In the revision, after adding the validation metrics, we will include a sensitivity analysis in Results. This will simulate plausible error rates (e.g., 5%, 10%, and 15% misclassification in gender or segmentation) drawn from the validation study and recompute pass rates under these perturbations. We will report whether the human-LLM gap remains statistically significant across these scenarios and qualify the main finding accordingly if it does not. revision: yes

  3. Referee: [Results] Results section (social-network metrics): The same character-node errors propagate to centrality, homophily, and triadic-closure calculations; any claim that LLM scripts are “less biased” on these measures inherits the identical validation gap.

    Authors: We concur that character identification and gender assignment errors would affect all downstream network metrics. The same validation study will quantify accuracy for the character nodes and gender labels used in network construction. We will then propagate these error estimates to provide confidence intervals or robustness checks for centrality, homophily, and triadic closure results. Our original text already described the network findings as mixed rather than claiming LLM superiority; we will further emphasize this qualification and note the validation dependency in the revised Results and Discussion sections. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical comparison to external human baseline

full rationale

The paper conducts a direct empirical audit by generating screenplays from three LLMs and comparing them to 768 human-written scripts using an automated Bechdel test plus social-network metrics. No equations, parameter fits, derivations, or predictions appear. The central claim (human scripts pass Bechdel at higher rates) is a straightforward measurement against an external corpus; it does not reduce to any self-defined quantity, fitted input renamed as prediction, or self-citation chain. All load-bearing steps are external data comparisons, so the analysis is self-contained with no circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no free parameters, axioms, or invented entities are described or required by the stated claims.

pith-pipeline@v0.9.1-grok · 5751 in / 1113 out tokens · 24271 ms · 2026-06-25T23:29:21.154814+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

66 extracted references · 46 canonical work pages

  1. [1]

    Abubakar Abid, Maheen Farooqi, and James Zou. 2021. Persistent Anti-Muslim Bias in Large Language Models. InProceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society (AIES ’21). Association for Computing Machinery, New York, NY, USA, 298–306. doi:10.1145/3461702.3462624

  2. [2]

    Apoorv Agarwal, Sriramkumar Balasubramanian, Jiehan Zheng, and Sarthak Dash. 2014. Parsing Screenplays for Extracting Social Networks from Movies. InProceedings of the 3rd Workshop on Computational Linguistics for Literature (CLFL), Anna Feldman, Anna Kazantseva, and Stan Szpakowicz (Eds.). Association for Computational Linguistics, Gothenburg, Sweden, 50...

  3. [3]

    Apoorv Agarwal, Jiehan Zheng, Shruti Kamath, Sriramkumar Balasubramanian, and Shirin Ann Dey. 2015. Key Female Characters in Film Have More to Talk About Besides Men: Automating the Bechdel Test. InProceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Rada Mihalcea, ...

  4. [4]

    Evan Bailyn. 2025. Top Generative AI Chatbots by Market Share – December 2025. https://firstpagesage.com/reports/top-generative-ai- chatbots/ Section: SEO Blog

  5. [5]

    David Bamman, Rachael Samberg, Richard Jean So, and Naitian Zhou. 2024. Measuring diversity in Hollywood through the large-scale computational analysis of film.Proceedings of the National Academy of Sciences121, 46 (Nov. 2024), e2409770121. doi:10.1073/pnas. 2409770121 Publisher: Proceedings of the National Academy of Sciences

  6. [6]

    Solon Barocas, Kate Crawford, Aaron Shapiro, and Hanna Wallach. 2017. The problem with bias: From allocative to representational harms in machine learning. InSIGCIS conference paper

  7. [7]

    1985.Dykes to Watch Out For

    Alison Bechdel. 1985.Dykes to Watch Out For. Firebrand Books. https://dykestowatchoutfor.com/

  8. [8]

    Su Lin Blodgett, Solon Barocas, Hal Daumé III, and Hanna Wallach. 2020. Language (Technology) is Power: A Critical Survey of “Bias” in NLP. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (Eds.). Association for Computational Linguistics, Online, 5454...

  9. [9]

    Tolga Bolukbasi, Kai-Wei Chang, James Y Zou, Venkatesh Saligrama, and Adam T Kalai. 2016. Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings. InAdvances in Neural Information Processing Systems, Vol. 29. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2016/hash/a486cd07e4ac3d270571622f4f316ec5-Abstract.html

  10. [10]

    Labor Issues Are Queer Issues

    Joel Kim Booster. 2023. GLAAD Media Awards 2023: Fire Island’s Joel Kim Stands Strong With WGA In Acceptance Speech: “Labor Issues Are Queer Issues”. https://glaad.org/glaad-media-awards-2023-fire-islands-joel-kim-stands-strong-wga-acceptance-speech- labor-issues/

  11. [11]

    Boyle and L

    D. Boyle and L. Tandan. 2008. Slumdog Millionaire

  12. [12]

    Semantics derived automatically from language corpora contain human-like biases

    Aylin Caliskan, Joanna J. Bryson, and Arvind Narayanan. 2017. Semantics derived automatically from language corpora contain human-like biases.Science356, 6334 (April 2017), 183–186. doi:10.1126/science.aal4230

  13. [13]

    Serina Chang, Alicja Chaszczewicz, Emma Wang, Maya Josifovska, Emma Pierson, and Jure Leskovec. 2025. LLMs Generate Structurally Realistic Social Networks but Overestimate Political Homophily.Proceedings of the International AAAI Conference on Web and Social Media19 (June 2025), 341–371. doi:10.1609/icwsm.v19i1.35820

  14. [14]

    Kate Crawford. 2017. The Trouble with Bias. InKeynote at NeurIPS

  15. [15]

    Hannah Cyberey, Yangfeng Ji, and David Evans. 2025. Unsupervised Concept Vector Extraction for Bias Control in LLMs. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng (Eds.). Association for Computational Linguistics, Suzhou, China, 28333...

  16. [16]

    Kolda, and C

    Nurcan Durak, Ali Pinar, Tamara G. Kolda, and C. Seshadhri. 2012. Degree relations of triangles in real-world networks and graph models. InProceedings of the 21st ACM international conference on Information and knowledge management (CIKM ’12). Association for Computing Machinery, New York, NY, USA, 1712–1716. doi:10.1145/2396761.2398503

  17. [17]

    David Garcia, Ingmar Weber, and Venkata Garimella. 2014. Gender Asymmetries in Reality and Fiction: The Bechdel Test of Social Media.Proceedings of the International AAAI Conference on Web and Social Media8, 1 (May 2014), 131–140. doi:10.1609/icwsm.v8i1.14522

  18. [18]

    Vagrant Gautam, Arjun Subramonian, Anne Lauscher, and Os Keyes. 2024. Stop! In the Name of Flaws: Disentangling Personal Names and Sociodemographic Attributes in NLP. InProceedings of the 5th Workshop on Gender Bias in Natural Language Processing (GeBNLP), Agnieszka Faleńska, Christine Basta, Marta Costa-jussà, Seraphina Goldfarb-Tarrant, and Debora Nozza...

  19. [19]

    Philip John Gorinski and Mirella Lapata. 2015. Movie Script Summarization as Graph-based Scene Extraction. InProceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Rada Mihalcea, Joyce Chai, and Anoop Sarkar (Eds.). Association for Computational Linguistics, Denver, C...

  20. [20]

    Mark S Granovetter. 1973. The strength of weak ties.American journal of sociology78, 6 (1973), 1360–1380

  21. [21]

    Valentin Hofmann, Pratyusha Ria Kalluri, Dan Jurafsky, and Sharese King. 2024. AI generates covertly racist decisions about people based on their dialect.Nature633, 8028 (Sept. 2024), 147–154. doi:10.1038/s41586-024-07856-5 Publisher: Nature Publishing Group

  22. [22]

    Hayate Iso, Pouya Pezeshkpour, Nikita Bhutani, and Estevam Hruschka. 2025. Evaluating Bias in LLMs for Job-Resume Matching: Gender, Race, and Education. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 3: Industry Track), Weizhu Chen, Yi Yang, ...

  23. [23]

    Dima Kagan, Thomas Chesney, and Michael Fire. 2020. Using data science to understand the film industry’s gender gap.Palgrave Communications6, 1 (May 2020), 92. doi:10.1057/s41599-020-0436-1 Publisher: Palgrave

  24. [24]

    D. Kellner. 1995.Media Culture: Cultural Studies, Identity and Politics Between the Modern and the Postmodern. Routledge. https: //books.google.com/books?id=GjbdsiZ0q10C

  25. [25]

    Molly Kinder. 2024. Hollywood writers went on strike to protect their livelihoods from generative AI. Their remarkable victory matters for all workers. https://www.brookings.edu/articles/hollywood-writers-went-on-strike-to-protect-their-livelihoods-from-generative- ai-their-remarkable-victory-matters-for-all-workers/

  26. [26]

    Dreyer, Aleksandar Shtedritski, and Yuki M

    Hannah Rose Kirk, Yennie Jun, Haider Iqbal, Elias Benussi, Filippo Volpin, Frederic A. Dreyer, Aleksandar Shtedritski, and Yuki M. Asano. 2021. Bias out-of-the-box: an empirical analysis of intersectional occupational biases in popular generative language models. In Proceedings of the 35th International Conference on Neural Information Processing Systems ...

  27. [27]

    Kumar, Jasmine Y

    Arjun M. Kumar, Jasmine Y. Q. Goh, Tiffany H. H. Tan, and Cynthia S. Q. Siew. 2022. Gender Stereotypes in Hollywood Movies and Their Evolution over Time: Insights from Network Analysis.Big Data and Cognitive Computing6, 2 (June 2022), 50. doi:10.3390/bdcc6020050 Publisher: Multidisciplinary Digital Publishing Institute

  28. [28]

    Anja Lambrecht and Catherine Tucker. 2019. Algorithmic bias? An empirical study of apparent gender-based discrimination in the display of STEM career ads.Management science65, 7 (2019), 2966–2981

  29. [29]

    David Laniado, Yana Volkovich, Karolin Kappler, and Andreas Kaltenbrunner. 2016. Gender homophily in online dyadic and triadic relationships.EPJ Data Science5, 1 (May 2016), 19. doi:10.1140/epjds/s13688-016-0080-6 Do Language Models Pass the Bechdel Test? FAccT ’26, June 25–28, 2026, Montreal, QC, Canada

  30. [30]

    Frozen in Time

    Peter A. Leavitt, Rebecca Covarrubias, Yvonne A. Perez, and Stephanie A. Fryberg. 2015. “Frozen in Time”: The Impact of Native American Media Representations on Identity and Self-Understanding.Journal of Social Issues71, 1 (2015), 39–53. doi:10.1111/josi.12095 _eprint: https://spssi.onlinelibrary.wiley.com/doi/pdf/10.1111/josi.12095

  31. [31]

    Benjamin Lee. 2024. Lionsgate partners with AI firm to train generative model on film and TV library.The Guardian(Sept. 2024). https://www.theguardian.com/film/2024/sep/18/lionsgate-ai

  32. [32]

    Eric Justin Liu, Wonyoung So, Peko Hosoi, and Catherine D’Ignazio. 2024. Racial Steering by Large Language Models: A Prospective Audit of GPT-4 on Housing Recommendations. InProceedings of the 4th ACM Conference on Equity and Access in Algorithms, Mechanisms, and Optimization (EAAMO ’24). Association for Computing Machinery, New York, NY, USA, 1–13. doi:1...

  33. [33]

    Li Lucy and David Bamman. 2021. Gender and Representation Bias in GPT-3 Generated Stories. InProceedings of the Third Workshop on Narrative Understanding, Nader Akoury, Faeze Brahman, Snigdha Chaturvedi, Elizabeth Clark, Mohit Iyyer, and Lara J. Martin (Eds.). Association for Computational Linguistics, Virtual, 48–55. doi:10.18653/v1/2021.nuse-1.5

  34. [34]

    Jinna Lv, Bin Wu, Lili Zhou, and Han Wang. 2018. StoryRoleNet: Social Network Construction of Role Relationship in Video.IEEE Access6 (2018), 25958–25969. doi:10.1109/ACCESS.2018.2832087

  35. [35]

    Crawford, Sanjana Gautam, Sorelle A

    Yaaseen Mahomed, Charlie M. Crawford, Sanjana Gautam, Sorelle A. Friedler, and Danaë Metaxa. 2024. Auditing GPT’s Content Moderation Guardrails: Can ChatGPT Write Your Favorite TV Show?. InProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency (FAccT ’24). Association for Computing Machinery, New York, NY, USA, 660–686. doi:1...

  36. [36]

    Pescosolido, and Daniel Tope

    Janice McCabe, Emily Fairchild, Liz Grauerholz, Bernice A. Pescosolido, and Daniel Tope. 2011. Gender in the Twentieth-Century Children’s Books: Patterns of Disparity in Titles and Central Characters.Gender and Society25, 2 (2011), 197–226. http://www.jstor. org/stable/23044136

  37. [37]

    Miller McPherson, Lynn Smith-Lovin, and James M. Cook. 2001. Birds of a Feather: Homophily in Social Networks.Annual Review of Sociology27 (2001), 415–444. http://www.jstor.org/stable/2678628

  38. [38]

    Landay, and Jeff Hancock

    Danaë Metaxa, Joon Sung Park, James A. Landay, and Jeff Hancock. 2019. Search Media and Elections: A Longitudinal Investigation of Political Search Results.Proc. ACM Hum.-Comput. Interact.3, CSCW (Nov. 2019), 129:1–129:17. doi:10.1145/3359231

  39. [39]

    Robertson, Karrie Karahalios, Christo Wilson, Jeff Hancock, and Christian Sandvig

    Danaë Metaxa, Joon Sung Park, Ronald E. Robertson, Karrie Karahalios, Christo Wilson, Jeff Hancock, and Christian Sandvig. 2021. Auditing Algorithms: Understanding Algorithmic Systems from the Outside In.Foundations and Trends®in Human–Computer Interaction 14, 4 (2021), 272–344. doi:10.1561/1100000083

  40. [40]

    M. E. J. Newman. 2003. Mixing patterns in networks.Physical Review E67, 2 (Feb. 2003), 026126. doi:10.1103/PhysRevE.67.026126

  41. [41]

    Marios Papachristou and Yuan Yuan. 2025. Network formation and dynamics among multi-LLMs.PNAS Nexus4, 12 (Dec. 2025), pgaf317. doi:10.1093/pnasnexus/pgaf317

  42. [42]

    O’Brien, Carrie J

    Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. 2023. Generative Agents: Interactive Simulacra of Human Behavior. InProceedings of the 36th Annual ACM Symposium on User Interface Software and Technology (UIST ’23). Association for Computing Machinery, New York, NY, USA, 1–22. doi:10.1145/35861...

  43. [43]

    Seung-Bo Park, Yoo-Won Kim, Mohammed Nazim Uddin, and Geun-Sik Jo. 2009. Character-Net: Character Network Analysis from Video. In2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology, Vol. 1. 305–308. doi:10.1109/WI-IAT.2009.54

  44. [44]

    Seung-Bo Park, Kyeong-Jin Oh, and Geun-Sik Jo. 2012. Social network analysis in a movie using character-net.Multimedia Tools Appl. 59, 2 (2012), 601–627. doi:10.1007/s11042-011-0725-1

  45. [45]

    Crawford, Danaé Metaxa, and Sorelle A

    Grace Proebsting, Oghenefejiro Isaacs Anigboro, Charlie M. Crawford, Danaé Metaxa, and Sorelle A. Friedler. 2025. Identity-related Speech Suppression in Generative AI Content Moderation. InProceedings of the 5th ACM Conference on Equity and Access in Algorithms, Mechanisms, and Optimization (EAAMO ’25). Association for Computing Machinery, New York, NY, U...

  46. [46]

    Robertson, Shan Jiang, Kenneth Joseph, Lisa Friedland, David Lazer, and Christo Wilson

    Ronald E. Robertson, Shan Jiang, Kenneth Joseph, Lisa Friedland, David Lazer, and Christo Wilson. 2018. Auditing Partisan Audience Bias within Google Search.Proceedings of the ACM on Human-Computer Interaction2, CSCW (Nov. 2018), 1–22. doi:10.1145/3274417

  47. [47]

    Muniba Saleem and Srividya Ramasubramanian. 2019. Muslim Americans’ Responses to Social Identity Threats: Effects of Media Representations and Experiences of Discrimination.Media Psychology22, 3 (2019), 373–393. doi:10.1080/15213269.2017.1302345 _eprint: https://doi.org/10.1080/15213269.2017.1302345

  48. [48]

    Maarten Sap, Marcella Cindy Prasettio, Ari Holtzman, Hannah Rashkin, and Yejin Choi. 2017. Connotation Frames of Power and Agency in Modern Films. InProceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Martha Palmer, Rebecca Hwa, and Sebastian Riedel (Eds.). Association for Computational Linguistics, Copenhagen, Denmark,...

  49. [49]

    Akrati Saxena, George Fletcher, and Mykola Pechenizkiy. 2024. FairSNA: Algorithmic Fairness in Social Network Analysis.Comput. Surveys(April 2024). doi:10.1145/3653711 Publisher: ACMPUB27New York, NY

  50. [50]

    Emily Sheng, Kai-Wei Chang, Premkumar Natarajan, and Nanyun Peng. 2019. The Woman Worked as a Babysitter: On Biases in Language Generation. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguist...

  51. [51]

    Zara Siddique, Liam Turner, and Luis Espinosa-Anke. 2024. Who is better at math, Jenny or Jingzhen? Uncovering Stereotypes in Large Language Models. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Yaser Al- Onaizan, Mohit Bansal, and Yun-Nung Chen (Eds.). Association for Computational Linguistics, Miami, Florida, ...

  52. [52]

    Dr Stacy L Smith, Dr Katherine Pieper, and Sam Wheeler. 2023. Inequality in 1,600 popular films: Examining Portrayals of Gender, Race/Ethnicity, LGBTQ+ & Disability from 2007 to 2022. (Aug. 2023). https://assets.uscannenberg.org/docs/aii-inequality-in-1600- popular-films-20230811.pdf

  53. [53]

    Jessica Toonkel. 2025. Exclusive | OpenAI Backs AI-Made Animated Feature Film. https://www.wsj.com/tech/ai/openai-backs-ai-made- animated-feature-film-389f70b0

  54. [54]

    Ownership, Not Just Happy Talk

    Emily Tseng, Meg Young, Marianne Aubin Le Quéré, Aimee Rinehart, and Harini Suresh. 2025. "Ownership, Not Just Happy Talk": Co-Designing a Participatory Large Language Model for Journalism. InProceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency (FAccT ’25). Association for Computing Machinery, New York, NY, USA, 3119–3130. ...

  55. [55]

    Riva Tukachinsky, Dana Mastro, and Moran Yarchi. 2015. Documenting Portrayals of Race/Ethnicity on Primetime Television over a 20-Year Span and Their Association with National-Level Racial/Ethnic Attitudes.Communication Faculty Articles and Research(Jan. 2015). doi:10.1111/josi.12094

  56. [56]

    Johan Ugander, Brian Karrer, Lars Backstrom, and Cameron A. Marlow. 2011. The Anatomy of the Facebook Social Graph. (2011)

  57. [57]

    Johann Valentowitsch. 2023. Hollywood caught in two worlds? The impact of the Bechdel test on the international box office performance of cinematic films.Marketing Letters34, 2 (2023), 293–308. doi:10.1007/s11002-022-09652-5

  58. [58]

    Ian Van Buskirk, Aaron Clauset, and Daniel B Larremore. 2023. An Open-Source Cultural Consensus Approach to Name-Based Gender Classification. InProceedings of the International AAAI Conference on Web and Social Media, Vol. 17. 866–877. https://github.com/ ianvanbuskirk/nbgc

  59. [59]

    Kelly is a Warm Person, Joseph is a Role Model

    Yixin Wan, George Pu, Jiao Sun, Aparna Garimella, Kai-Wei Chang, and Nanyun Peng. 2023. “Kelly is a Warm Person, Joseph is a Role Model”: Gender Biases in LLM-Generated Reference Letters. InFindings of the Association for Computational Linguistics: EMNLP 2023, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Sin...

  60. [60]

    Dickerson

    Angelina Wang, Jamie Morgenstern, and John P. Dickerson. 2025. Large language models that replace human participants can harmfully misportray and flatten identity groups.Nature Machine Intelligence7, 3 (March 2025), 400–411. doi:10.1038/s42256-025-00986-z Publisher: Nature Publishing Group

  61. [61]

    Stephanie Wang, Shengchun Huang, Alvin Zhou, and Danaë Metaxa. 2024. Lower Quantity, Higher Quality: Auditing News Content and User Perceptions on Twitter/X Algorithmic versus Chronological Timelines.Proc. ACM Hum.-Comput. Interact.8, CSCW2 (Nov. 2024), 507:1–507:25. doi:10.1145/3687046

  62. [62]

    1994.Social Network Analysis: Methods and Applications

    Stanley Wasserman and Katherine Faust. 1994.Social Network Analysis: Methods and Applications. Cambridge University Press

  63. [63]

    Chung-Yi Weng, Wei-Ta Chu, and Ja-Ling Wu. 2007. RoleNet: treat a movie as a small society. InProceedings of the international workshop on Workshop on multimedia information retrieval (MIR ’07). Association for Computing Machinery, New York, NY, USA, 51–60. doi:10.1145/1290082.1290092

  64. [64]

    Kyra Wilson and Aylin Caliskan. 2024. Gender, Race, and Intersectional Bias in Resume Screening via Language Model Retrieval. Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society7, 1 (Oct. 2024), 1578–1590. doi:10.1609/aies.v7i1.31748

  65. [65]

    Nan Xu and Xuezhe Ma. 2025. LLM The Genius Paradox: A Linguistic and Math Expert’s Struggle with Simple Word-based Counting Problems. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), Luis Chiruzzo, Alan Ritter, and Lu Wang (Eds...

  66. [66]

    number": integer •

    Yulin Yu, Yucong Hao, and Paramveer Dhillon. 2022. Unpacking Gender Stereotypes in Film Dialogue. InSocial Informatics, Frank Hopfgartner, Kokil Jaidka, Philipp Mayr, Joemon Jose, and Jan Breitsohl (Eds.). Springer International Publishing, Cham, 398–405. doi:10.1007/978-3-031-19097-1_26 A Screenplay generation prompts Two prompts used to generate screenp...