Cards Against LLMs: Benchmarking Humor Alignment in Large Language Models
Pith reviewed 2026-05-10 17:08 UTC · model grok-4.3
The pith
Large language models align only modestly with human humor preferences but agree with each other far more often when selecting funny cards.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Five frontier language models were tested on 9,894 rounds of Cards Against Humanity, each time selecting the funniest response from ten candidate cards. While every model outperformed random selection in matching human preferences, the degree of alignment stayed modest. In contrast, the models agreed with one another on the best card far more frequently than any model agreed with humans. The authors attribute part of this inter-model consistency to systematic position biases and shared content preferences, which raises doubts about whether the judgments reflect true humor understanding or simply artifacts of the models' training and inference processes.
What carries the argument
The benchmark task of selecting the funniest card from a fixed slate of ten candidates in repeated Cards Against Humanity rounds.
If this is right
- Models exceed random baseline in selecting funny responses from fixed lists.
- Inter-model agreement on humor choices substantially exceeds human-model agreement.
- Position biases and content preferences systematically shape model selections.
- LLM humor judgments may reflect inference artifacts rather than learned human-like preferences.
Where Pith is reading between the lines
- The selection format may limit models' ability to express humor preferences that vary as much as human ones do.
- Similar patterns of internal agreement could appear when testing LLMs on other subjective judgments like creativity or social appropriateness.
- Developers could use position randomization or open-ended generation tasks to reduce artifacts in future humor benchmarks.
Load-bearing premise
That selecting the funniest card from a fixed list of ten options accurately measures genuine humor preference without being dominated by model-specific artifacts or biases.
What would settle it
If models were retested with the order of the ten candidate cards randomized in each round, and their agreement rate with humans rose to match or exceed their agreement with each other, the current explanation for modest alignment would be challenged.
Figures
read the original abstract
Humor is one of the most culturally embedded and socially significant dimensions of human communication, yet it remains largely unexplored as a dimension of Large Language Model (LLM) alignment. In this study, five frontier language models play the same Cards Against Humanity games (CAH) as human players. The models select the funniest response from a slate of ten candidate cards across 9,894 rounds. While all models exceed the random baseline, alignment with human preference remains modest. More striking is that models agree with each other substantially more often than they agree with humans. We show that this preference is partly explained by systematic position biases and content preferences, raising the question whether LLM humor judgment reflects genuine preference or structural artifacts of inference and alignment.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper reports an empirical benchmark in which five frontier LLMs play 9,894 rounds of Cards Against Humanity, each time selecting the funniest card from a fixed slate of ten candidates. All models exceed random baseline agreement with human choices, yet human-model alignment remains modest while inter-model agreement is substantially higher; the authors partially attribute the latter to position biases and content preferences.
Significance. If the measurement is valid, the work supplies a large-scale, culturally grounded test of humor alignment and surfaces a potentially important pattern: models converge on each other more than on humans. The scale of the trial set and the direct comparison to human data are strengths that could inform future preference-modeling research.
major comments (3)
- [§3] §3 (Experimental Protocol): the manuscript provides no exact prompt templates, model version strings, temperature settings, or slate-presentation order randomization. These omissions are load-bearing because the headline inter-model vs. human-model gap could be driven by shared prompting artifacts or fixed-position heuristics rather than humor judgment.
- [§4.2] §4.2 (Agreement Analysis): the claim that elevated model-model agreement reflects 'structural artifacts' rather than genuine preference is under-determined without an ablation that (a) balances or randomizes card order, (b) controls for lexical/taboo priors, and (c) re-computes the human-model versus model-model gap on the corrected data.
- [§4.3] §4.3 (Statistical Controls): no significance tests, confidence intervals, or multiple-comparison corrections are reported for the agreement percentages despite 9,894 trials; this weakens the assertion that human alignment is 'modest' relative to inter-model agreement.
minor comments (2)
- [Table 1] Table 1: clarify whether the 9,894 rounds are the final filtered set or the raw total; report any exclusion criteria.
- [Figure 2] Figure 2: add error bars or bootstrap intervals to the agreement bars so readers can assess the reliability of the model-model versus human-model difference.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. The comments highlight important gaps in reproducibility and statistical rigor that we will address in the revision. Below we respond point by point to the major comments.
read point-by-point responses
-
Referee: [§3] §3 (Experimental Protocol): the manuscript provides no exact prompt templates, model version strings, temperature settings, or slate-presentation order randomization. These omissions are load-bearing because the headline inter-model vs. human-model gap could be driven by shared prompting artifacts or fixed-position heuristics rather than humor judgment.
Authors: We agree that these details are critical for reproducibility and to exclude prompting artifacts. The original submission omitted them to conserve space, but the revised manuscript will include a new subsection in §3 with the exact prompt templates (standardized across models with only API-specific adaptations), precise model version strings (e.g., gpt-4o-2024-05-13, claude-3-5-sonnet-20240620), temperature=0 for all models, and explicit confirmation that the order of the ten candidate cards was randomized independently for each of the 9,894 rounds. We will also release the full experimental scripts as supplementary material. revision: yes
-
Referee: [§4.2] §4.2 (Agreement Analysis): the claim that elevated model-model agreement reflects 'structural artifacts' rather than genuine preference is under-determined without an ablation that (a) balances or randomizes card order, (b) controls for lexical/taboo priors, and (c) re-computes the human-model versus model-model gap on the corrected data.
Authors: This point is well taken; the attribution to structural artifacts would be stronger with explicit ablations. Although card order was already randomized in the primary experiments, we did not perform the full set of controls for lexical and taboo content. In the revision we will add new analyses that (a) confirm position-balanced results, (b) recompute agreements after filtering or regressing out high-taboo and high-lexical-overlap cards, and (c) report the resulting human-model versus inter-model gaps. Preliminary internal checks indicate the inter-model elevation persists, but the full ablation results will be presented to allow readers to evaluate the claim directly. revision: partial
-
Referee: [§4.3] §4.3 (Statistical Controls): no significance tests, confidence intervals, or multiple-comparison corrections are reported for the agreement percentages despite 9,894 trials; this weakens the assertion that human alignment is 'modest' relative to inter-model agreement.
Authors: We accept this criticism. The revised manuscript will report 95% bootstrap confidence intervals for all agreement rates, apply McNemar’s test for paired comparisons between human-model and model-model agreements, and use Bonferroni correction across the five models. These additions will quantify the statistical reliability of the modest human alignment versus higher inter-model agreement. revision: yes
Circularity Check
No circularity: purely empirical measurement with direct comparisons to external human data
full rationale
The paper reports an experimental benchmark in which LLMs select the funniest card from fixed 10-card slates across 9,894 rounds and compares selection frequencies to human choices and inter-model agreement. No equations, parameter fits, predictions derived from fitted inputs, or self-citations appear in the provided text. All reported statistics are direct empirical counts against an external human reference set; the central claim (modest human alignment, higher inter-model agreement) is therefore not reducible to any definitional or self-referential step.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Out of one, many: Using language mod- els to simulate human samples.Political Analysis, 31(3):337–351. Salvatore Attardo. 1997. The semantic foundations of cognitive theories of humor. Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, and 1 others. 2020...
work page 1997
-
[2]
Evaluating Large Language Models Trained on Code
Help me write a poem: Instruction tuning as a vehicle for collaborative poetry writing. InProceed- ings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 6848–6863. Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, and...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[3]
Lunjun Zhang, Arian Hosseini, Hritik Bansal, Mehran Kazemi, Aviral Kumar, and Rishabh Agarwal
Beyond correctness: Evaluating subjective writing preferences across cultures.arXiv preprint arXiv:2510.14616. Jifan Zhang, Lalit Jain, Yang Guo, Jiayi Chen, Kuan L Zhou, Siddharth Suresh, Andrew Wagenmaker, Scott Sievert, Timothy Rogers, Kevin Jamieson, and 1 oth- ers. 2024. Humor in ai: Massive scale crowd-sourced preferences and benchmarks for cartoon ...
-
[4]
sexual_themes // Sexual content: innuendo, explicit acts, relationships
bodily_functions_gross_out // Anatomy, bodily fluids, gross-out physical humor 2. sexual_themes // Sexual content: innuendo, explicit acts, relationships
-
[5]
violence_crime_death_threat // Physical harm, mortality, criminal acts, threats
-
[6]
politics_ideology_society_culture // Government, activism, social norms, cultural commentary
-
[7]
drugs_alcohol_risky_behavior // Substance use, addiction, reckless actions
-
[8]
pop_culture_media_consumerism // Celebrities, movies, memes, brands, viral trends
-
[9]
food_eating_consumables // Meals, ingredients, dining, consumption
-
[10]
animals_nature_creatures // Wildlife, pets, ecosystems, biological refs
-
[11]
absurdism_surreal_nonsensical // Illogical juxtapositions, nonsense, anti-humor
-
[12]
identity_demographics_traits // Race, gender, age, disability, sexuality, nationality
-
[13]
family_relationships_everyday // Parenting, friendships, domestic life, mundane interactions
-
[14]
emotional_states_mental_health // Anxiety, joy, depression, coping, psychological framing
-
[15]
supernatural_cosmic_paranormal // Ghosts, aliens, magic, existential cosmic themes
-
[16]
money_work_technology_modern // Jobs, finance, digital life, institutional critique
-
[17]
random_objects_miscellaneous // Concrete items/concepts not captured above ### RULES
-
[18]
Use slugs exactly as written (underscores, lowercase, no spaces)
-
[19]
Output ONLY valid JSON. No markdown, no preamble, no extra text. ”’ USER_PROMPT_TEMPLATE = ”’ ### INPUT CARD "{card_text}" ### OUTPUT ”’ All 2074 unique white cards in the gameplay dataset were successfully annotated with 1 to 3 topics. Coherence of the annotations was validated by the authors. We also note the topic annotation was done on our final datas...
work page 2074
-
[20]
{card_2} . . . Respond ONLY with: <number>. <exact card text> where {black_card} is the text of the black card with the blank represented as an underscore, and {card_1}. . . {card_N}are the white cards in the slate (N= 10in all rounds). B.1.2 Multi-Blank Prompt For black cards with two blanks, a target slot was designated (BLANK #1 or BLANK #2) and mod- e...
-
[21]
{card_2} . . . Respond ONLY with: <number>. <exact card text> where {black_card} is the text of the black card with blanks represented as underscores, {target_slot} is either 1 or 2 indicating which blank to fill, and{card_1}. . . {card_N}are the white cards in the slate (N= 10in all rounds). B.2 Models Abstentions Not all models engaged with every round....
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.