Cards Against LLMs: Benchmarking Humor Alignment in Large Language Models

Guillaume Bied; Hannu Toivonen; Tijl De Bie; Yousra Fettach

arxiv: 2604.08757 · v1 · submitted 2026-04-09 · 💻 cs.CL · cs.AI

Cards Against LLMs: Benchmarking Humor Alignment in Large Language Models

Yousra Fettach , Guillaume Bied , Hannu Toivonen , Tijl De Bie This is my paper

Pith reviewed 2026-05-10 17:08 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords humor alignmentlarge language modelsbenchmarkingCards Against Humanitypreference agreementposition biasmodel artifacts

0 comments

The pith

Large language models align only modestly with human humor preferences but agree with each other far more often when selecting funny cards.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper has five frontier language models play thousands of rounds of Cards Against Humanity by picking the funniest card from ten options each time. All models do better than chance at matching human selections, but the match remains only modest overall. The models agree with each other on their choices much more often than they agree with human players. This higher agreement among models is linked to consistent position biases and different content preferences. Understanding these patterns matters for assessing how well AI systems can handle culturally nuanced communication like humor.

Core claim

Five frontier language models were tested on 9,894 rounds of Cards Against Humanity, each time selecting the funniest response from ten candidate cards. While every model outperformed random selection in matching human preferences, the degree of alignment stayed modest. In contrast, the models agreed with one another on the best card far more frequently than any model agreed with humans. The authors attribute part of this inter-model consistency to systematic position biases and shared content preferences, which raises doubts about whether the judgments reflect true humor understanding or simply artifacts of the models' training and inference processes.

What carries the argument

The benchmark task of selecting the funniest card from a fixed slate of ten candidates in repeated Cards Against Humanity rounds.

If this is right

Models exceed random baseline in selecting funny responses from fixed lists.
Inter-model agreement on humor choices substantially exceeds human-model agreement.
Position biases and content preferences systematically shape model selections.
LLM humor judgments may reflect inference artifacts rather than learned human-like preferences.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The selection format may limit models' ability to express humor preferences that vary as much as human ones do.
Similar patterns of internal agreement could appear when testing LLMs on other subjective judgments like creativity or social appropriateness.
Developers could use position randomization or open-ended generation tasks to reduce artifacts in future humor benchmarks.

Load-bearing premise

That selecting the funniest card from a fixed list of ten options accurately measures genuine humor preference without being dominated by model-specific artifacts or biases.

What would settle it

If models were retested with the order of the ten candidate cards randomized in each round, and their agreement rate with humans rose to match or exceed their agreement with each other, the current explanation for modest alignment would be challenged.

Figures

Figures reproduced from arXiv: 2604.08757 by Guillaume Bied, Hannu Toivonen, Tijl De Bie, Yousra Fettach.

**Figure 1.** Figure 1: Framework overview. Given a black card prompt and a slate of 10 white card candidates, five frontier LLMs and a human player independently select the card they deem funniest. their self-consistency across repeated runs. We demonstrate that LLMs agree with each other substantially more than they agree with humans, suggesting the emergence of stable but human-misaligned humor profiles. 4. We provide an analy… view at source ↗

**Figure 3.** Figure 3: Pairwise agreement rate between models, mea [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 2.** Figure 2: Human-LLM Alignment for all 5 models, with bootstrapped 95% confidence intervals. The dashed line [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 4.** Figure 4: Accuracy rates by demographic subgroup, aggregated at the player level across replicates and rounds. The [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Shares of human (column 1) and LLM (columns 2-5) white card picks involving different topics (a card [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Distribution of model picks across slate positions (N=10). Each bar represents the proportion of rounds in [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

read the original abstract

Humor is one of the most culturally embedded and socially significant dimensions of human communication, yet it remains largely unexplored as a dimension of Large Language Model (LLM) alignment. In this study, five frontier language models play the same Cards Against Humanity games (CAH) as human players. The models select the funniest response from a slate of ten candidate cards across 9,894 rounds. While all models exceed the random baseline, alignment with human preference remains modest. More striking is that models agree with each other substantially more often than they agree with humans. We show that this preference is partly explained by systematic position biases and content preferences, raising the question whether LLM humor judgment reflects genuine preference or structural artifacts of inference and alignment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper reports an empirical benchmark in which five frontier LLMs play 9,894 rounds of Cards Against Humanity, each time selecting the funniest card from a fixed slate of ten candidates. All models exceed random baseline agreement with human choices, yet human-model alignment remains modest while inter-model agreement is substantially higher; the authors partially attribute the latter to position biases and content preferences.

Significance. If the measurement is valid, the work supplies a large-scale, culturally grounded test of humor alignment and surfaces a potentially important pattern: models converge on each other more than on humans. The scale of the trial set and the direct comparison to human data are strengths that could inform future preference-modeling research.

major comments (3)

[§3] §3 (Experimental Protocol): the manuscript provides no exact prompt templates, model version strings, temperature settings, or slate-presentation order randomization. These omissions are load-bearing because the headline inter-model vs. human-model gap could be driven by shared prompting artifacts or fixed-position heuristics rather than humor judgment.
[§4.2] §4.2 (Agreement Analysis): the claim that elevated model-model agreement reflects 'structural artifacts' rather than genuine preference is under-determined without an ablation that (a) balances or randomizes card order, (b) controls for lexical/taboo priors, and (c) re-computes the human-model versus model-model gap on the corrected data.
[§4.3] §4.3 (Statistical Controls): no significance tests, confidence intervals, or multiple-comparison corrections are reported for the agreement percentages despite 9,894 trials; this weakens the assertion that human alignment is 'modest' relative to inter-model agreement.

minor comments (2)

[Table 1] Table 1: clarify whether the 9,894 rounds are the final filtered set or the raw total; report any exclusion criteria.
[Figure 2] Figure 2: add error bars or bootstrap intervals to the agreement bars so readers can assess the reliability of the model-model versus human-model difference.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important gaps in reproducibility and statistical rigor that we will address in the revision. Below we respond point by point to the major comments.

read point-by-point responses

Referee: [§3] §3 (Experimental Protocol): the manuscript provides no exact prompt templates, model version strings, temperature settings, or slate-presentation order randomization. These omissions are load-bearing because the headline inter-model vs. human-model gap could be driven by shared prompting artifacts or fixed-position heuristics rather than humor judgment.

Authors: We agree that these details are critical for reproducibility and to exclude prompting artifacts. The original submission omitted them to conserve space, but the revised manuscript will include a new subsection in §3 with the exact prompt templates (standardized across models with only API-specific adaptations), precise model version strings (e.g., gpt-4o-2024-05-13, claude-3-5-sonnet-20240620), temperature=0 for all models, and explicit confirmation that the order of the ten candidate cards was randomized independently for each of the 9,894 rounds. We will also release the full experimental scripts as supplementary material. revision: yes
Referee: [§4.2] §4.2 (Agreement Analysis): the claim that elevated model-model agreement reflects 'structural artifacts' rather than genuine preference is under-determined without an ablation that (a) balances or randomizes card order, (b) controls for lexical/taboo priors, and (c) re-computes the human-model versus model-model gap on the corrected data.

Authors: This point is well taken; the attribution to structural artifacts would be stronger with explicit ablations. Although card order was already randomized in the primary experiments, we did not perform the full set of controls for lexical and taboo content. In the revision we will add new analyses that (a) confirm position-balanced results, (b) recompute agreements after filtering or regressing out high-taboo and high-lexical-overlap cards, and (c) report the resulting human-model versus inter-model gaps. Preliminary internal checks indicate the inter-model elevation persists, but the full ablation results will be presented to allow readers to evaluate the claim directly. revision: partial
Referee: [§4.3] §4.3 (Statistical Controls): no significance tests, confidence intervals, or multiple-comparison corrections are reported for the agreement percentages despite 9,894 trials; this weakens the assertion that human alignment is 'modest' relative to inter-model agreement.

Authors: We accept this criticism. The revised manuscript will report 95% bootstrap confidence intervals for all agreement rates, apply McNemar’s test for paired comparisons between human-model and model-model agreements, and use Bonferroni correction across the five models. These additions will quantify the statistical reliability of the modest human alignment versus higher inter-model agreement. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical measurement with direct comparisons to external human data

full rationale

The paper reports an experimental benchmark in which LLMs select the funniest card from fixed 10-card slates across 9,894 rounds and compares selection frequencies to human choices and inter-model agreement. No equations, parameter fits, predictions derived from fitted inputs, or self-citations appear in the provided text. All reported statistics are direct empirical counts against an external human reference set; the central claim (modest human alignment, higher inter-model agreement) is therefore not reducible to any definitional or self-referential step.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The study is an empirical benchmark relying on standard statistical comparison of choices; no free parameters, ad-hoc axioms, or invented entities are introduced.

pith-pipeline@v0.9.0 · 5428 in / 1102 out tokens · 44329 ms · 2026-05-10T17:08:31.330426+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · 1 internal anchor

[1]

Salvatore Attardo

Out of one, many: Using language mod- els to simulate human samples.Political Analysis, 31(3):337–351. Salvatore Attardo. 1997. The semantic foundations of cognitive theories of humor. Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, and 1 others. 2020...

work page 1997
[2]

Evaluating Large Language Models Trained on Code

Help me write a poem: Instruction tuning as a vehicle for collaborative poetry writing. InProceed- ings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 6848–6863. Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, and...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[3]

Lunjun Zhang, Arian Hosseini, Hritik Bansal, Mehran Kazemi, Aviral Kumar, and Rishabh Agarwal

Beyond correctness: Evaluating subjective writing preferences across cultures.arXiv preprint arXiv:2510.14616. Jifan Zhang, Lalit Jain, Yang Guo, Jiayi Chen, Kuan L Zhou, Siddharth Suresh, Andrew Wagenmaker, Scott Sievert, Timothy Rogers, Kevin Jamieson, and 1 oth- ers. 2024. Humor in ai: Massive scale crowd-sourced preferences and benchmarks for cartoon ...

work page arXiv 2024
[4]

sexual_themes // Sexual content: innuendo, explicit acts, relationships

bodily_functions_gross_out // Anatomy, bodily fluids, gross-out physical humor 2. sexual_themes // Sexual content: innuendo, explicit acts, relationships

work page
[5]

violence_crime_death_threat // Physical harm, mortality, criminal acts, threats

work page
[6]

politics_ideology_society_culture // Government, activism, social norms, cultural commentary

work page
[7]

drugs_alcohol_risky_behavior // Substance use, addiction, reckless actions

work page
[8]

pop_culture_media_consumerism // Celebrities, movies, memes, brands, viral trends

work page
[9]

food_eating_consumables // Meals, ingredients, dining, consumption

work page
[10]

animals_nature_creatures // Wildlife, pets, ecosystems, biological refs

work page
[11]

absurdism_surreal_nonsensical // Illogical juxtapositions, nonsense, anti-humor

work page
[12]

identity_demographics_traits // Race, gender, age, disability, sexuality, nationality

work page
[13]

family_relationships_everyday // Parenting, friendships, domestic life, mundane interactions

work page
[14]

emotional_states_mental_health // Anxiety, joy, depression, coping, psychological framing

work page
[15]

supernatural_cosmic_paranormal // Ghosts, aliens, magic, existential cosmic themes

work page
[16]

money_work_technology_modern // Jobs, finance, digital life, institutional critique

work page
[17]

random_objects_miscellaneous // Concrete items/concepts not captured above ### RULES

work page
[18]

Use slugs exactly as written (underscores, lowercase, no spaces)

work page
[19]

{card_text}

Output ONLY valid JSON. No markdown, no preamble, no extra text. ”’ USER_PROMPT_TEMPLATE = ”’ ### INPUT CARD "{card_text}" ### OUTPUT ”’ All 2074 unique white cards in the gameplay dataset were successfully annotated with 1 to 3 topics. Coherence of the annotations was validated by the authors. We also note the topic annotation was done on our final datas...

work page 2074
[20]

Respond ONLY with: <number>

{card_2} . . . Respond ONLY with: <number>. <exact card text> where {black_card} is the text of the black card with the blank represented as an underscore, and {card_1}. . . {card_N}are the white cards in the slate (N= 10in all rounds). B.1.2 Multi-Blank Prompt For black cards with two blanks, a target slot was designated (BLANK #1 or BLANK #2) and mod- e...

work page
[21]

Respond ONLY with: <number>

{card_2} . . . Respond ONLY with: <number>. <exact card text> where {black_card} is the text of the black card with blanks represented as underscores, {target_slot} is either 1 or 2 indicating which blank to fill, and{card_1}. . . {card_N}are the white cards in the slate (N= 10in all rounds). B.2 Models Abstentions Not all models engaged with every round....

work page 2022

[1] [1]

Salvatore Attardo

Out of one, many: Using language mod- els to simulate human samples.Political Analysis, 31(3):337–351. Salvatore Attardo. 1997. The semantic foundations of cognitive theories of humor. Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, and 1 others. 2020...

work page 1997

[2] [2]

Evaluating Large Language Models Trained on Code

Help me write a poem: Instruction tuning as a vehicle for collaborative poetry writing. InProceed- ings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 6848–6863. Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, and...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[3] [3]

Lunjun Zhang, Arian Hosseini, Hritik Bansal, Mehran Kazemi, Aviral Kumar, and Rishabh Agarwal

Beyond correctness: Evaluating subjective writing preferences across cultures.arXiv preprint arXiv:2510.14616. Jifan Zhang, Lalit Jain, Yang Guo, Jiayi Chen, Kuan L Zhou, Siddharth Suresh, Andrew Wagenmaker, Scott Sievert, Timothy Rogers, Kevin Jamieson, and 1 oth- ers. 2024. Humor in ai: Massive scale crowd-sourced preferences and benchmarks for cartoon ...

work page arXiv 2024

[4] [4]

sexual_themes // Sexual content: innuendo, explicit acts, relationships

bodily_functions_gross_out // Anatomy, bodily fluids, gross-out physical humor 2. sexual_themes // Sexual content: innuendo, explicit acts, relationships

work page

[5] [5]

violence_crime_death_threat // Physical harm, mortality, criminal acts, threats

work page

[6] [6]

politics_ideology_society_culture // Government, activism, social norms, cultural commentary

work page

[7] [7]

drugs_alcohol_risky_behavior // Substance use, addiction, reckless actions

work page

[8] [8]

pop_culture_media_consumerism // Celebrities, movies, memes, brands, viral trends

work page

[9] [9]

food_eating_consumables // Meals, ingredients, dining, consumption

work page

[10] [10]

animals_nature_creatures // Wildlife, pets, ecosystems, biological refs

work page

[11] [11]

absurdism_surreal_nonsensical // Illogical juxtapositions, nonsense, anti-humor

work page

[12] [12]

identity_demographics_traits // Race, gender, age, disability, sexuality, nationality

work page

[13] [13]

family_relationships_everyday // Parenting, friendships, domestic life, mundane interactions

work page

[14] [14]

emotional_states_mental_health // Anxiety, joy, depression, coping, psychological framing

work page

[15] [15]

supernatural_cosmic_paranormal // Ghosts, aliens, magic, existential cosmic themes

work page

[16] [16]

money_work_technology_modern // Jobs, finance, digital life, institutional critique

work page

[17] [17]

random_objects_miscellaneous // Concrete items/concepts not captured above ### RULES

work page

[18] [18]

Use slugs exactly as written (underscores, lowercase, no spaces)

work page

[19] [19]

{card_text}

Output ONLY valid JSON. No markdown, no preamble, no extra text. ”’ USER_PROMPT_TEMPLATE = ”’ ### INPUT CARD "{card_text}" ### OUTPUT ”’ All 2074 unique white cards in the gameplay dataset were successfully annotated with 1 to 3 topics. Coherence of the annotations was validated by the authors. We also note the topic annotation was done on our final datas...

work page 2074

[20] [20]

Respond ONLY with: <number>

{card_2} . . . Respond ONLY with: <number>. <exact card text> where {black_card} is the text of the black card with the blank represented as an underscore, and {card_1}. . . {card_N}are the white cards in the slate (N= 10in all rounds). B.1.2 Multi-Blank Prompt For black cards with two blanks, a target slot was designated (BLANK #1 or BLANK #2) and mod- e...

work page

[21] [21]

Respond ONLY with: <number>

{card_2} . . . Respond ONLY with: <number>. <exact card text> where {black_card} is the text of the black card with blanks represented as underscores, {target_slot} is either 1 or 2 indicating which blank to fill, and{card_1}. . . {card_N}are the white cards in the slate (N= 10in all rounds). B.2 Models Abstentions Not all models engaged with every round....

work page 2022