HumorRank: A Tournament-Based Leaderboard for Evaluating Humor Generation in Large Language Models

Edward Ajayi; Prasenjit Mitra

arxiv: 2604.19786 · v2 · pith:H5JCURQPnew · submitted 2026-03-31 · 💻 cs.CL

HumorRank: A Tournament-Based Leaderboard for Evaluating Humor Generation in Large Language Models

Edward Ajayi , Prasenjit Mitra This is my paper

Pith reviewed 2026-05-13 23:01 UTC · model grok-4.3

classification 💻 cs.CL

keywords humor generationLLM evaluationtournament rankingGeneral Theory of Verbal Humorpairwise comparisonBradley-Terry modelmodel benchmarkingcomedic mechanisms

0 comments

The pith

HumorRank ranks language models on humor generation through automated joke tournaments that reveal skill in comedic mechanisms over model size.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces HumorRank as a tournament-based system to evaluate and compare how well large language models create humorous text. Pairwise matchups between model outputs draw on the General Theory of Verbal Humor for judgments, which are then organized through an Adaptive Swiss tournament structure and turned into overall rankings with Bradley-Terry statistical modeling. Testing across nine models from different categories produces clear stratifications that tie higher performance to better command of humor techniques. This replaces scattered individual metrics with one consistent leaderboard. The result supplies a repeatable way to measure and track advances in AI humor generation.

Core claim

HumorRank is a tournament-based evaluation framework and leaderboard that performs automated pairwise comparisons of LLM-generated humor using judgments grounded in the General Theory of Verbal Humor, aggregates those results via an Adaptive Swiss tournament, and derives globally consistent rankings through Bradley-Terry Maximum Likelihood Estimation, yielding statistically grounded model stratifications that show humor quality depends on mastery of comedic mechanisms rather than model scale.

What carries the argument

HumorRank tournament system that converts GTVH-grounded pairwise judgments into global rankings through Adaptive Swiss scheduling and Bradley-Terry MLE.

Load-bearing premise

Automated pairwise judgments based on the General Theory of Verbal Humor accurately capture true humor quality without systematic bias.

What would settle it

A direct comparison study in which human raters evaluate the same model outputs and produce model rankings that differ substantially from those generated by HumorRank.

Figures

Figures reproduced from arXiv: 2604.19786 by Edward Ajayi, Prasenjit Mitra.

**Figure 1.** Figure 1: HumorRank Leaderboard (left) and Pairwise Win-Rate Heatmap (right) showing the performance of the 9 models. Remarkably, the specialized HumorGen-7B model (Rank 4, BT = 1092.8) successfully bridges the gap between the mid-tier open-weights and the proprietary frontier, cleanly outperforming models an order of magnitude larger (e.g., GPT OSS 120B, Rank 6) as shown in [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗

**Figure 2.** Figure 2: Per-model winning feature distributions (Llama 3.3 70B judge). Left: Humor mechanisms (% of wins). Right: Delivery features (% of wins). Frontier models dominate via Conciseness; the specialist model leads on Absurdity and Escalation; baseline models over-index on Wordplay. higher Overexplained (25.2%) and Buried Punchline (20.4%) failure rates than any other model, indicating that its aggressive structura… view at source ↗

**Figure 3.** Figure 3: Per-model failure mode distributions (Llama 3.3 70B judge). Clich´e and Weak Punchline dominate most models, but HumorGen-7B stands out with markedly higher Overexplained and Buried Punchline rates—a byproduct of its deep-structure comedic strategy. Qwen 2.5 72B failure modes are in Appendix F. 7 [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: HumorRank Leaderboard (top) and Pairwise Win-Rate Heatmap (bottom) showing the performance of the 9 models. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗

**Figure 5.** Figure 5: Four representative LLaMA judge decisions. Winner ✓ (green) and Loser × (red) are labelled directly on each joke box. Feature rows indicate winning humor traits (green), delivery strengths (blue), and loser weaknesses (red). ELO deltas are approximated from the evaluation log. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗

**Figure 6.** Figure 6: Prompt template used for all 10,800 pairwise comparisons. Template variables ({headline}, {joke a}, {joke b}) are instantiated per comparison. The three feature lists (humor mechanisms, delivery, and loser features) enforce structured and consistent JSON outputs across all evaluations. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗

**Figure 7.** Figure 7: Per-model winning feature distributions (Qwen 2.5 72B judge). Left: Humor mechanisms. Right: Delivery features. Rank patterns are consistent with the primary Llama judge ( [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗

**Figure 8.** Figure 8: Per-model failure mode distributions (Qwen 2.5 72B judge). HumorGen-7B again shows markedly higher Overexplained (49.5%) rates compared to other models, consistent with findings under the primary Llama judge ( [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗

**Figure 9.** Figure 9: Instructions screen (HumorRank Blind Evaluation) [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗

**Figure 10.** Figure 10: Sample evaluated pair showing the blind comparison interface. Evaluators see two anonymized jokes (Option A and Option B) for a given headline and select the funnier response. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_10.png] view at source ↗

read the original abstract

Humor remains difficult to evaluate in large language models (LLMs) because what makes a response funny is subjective, comparative, and shaped by interacting comedic mechanisms rather than a single scalar property. Existing humor evaluation protocols therefore tend to produce isolated scores or task-specific judgments that are difficult to compare across models. We introduce HumorRank, a tournament-based framework for ranking textual humor generation through theory-grounded pairwise preference judgments. Across SemEval-2026 MWAHAHA and Humor Transfer Bench, HumorRank evaluates nine proprietary, open-weight, and specialized models using LLM-based comparative judgments informed by the General Theory of Verbal Humor (GTVH), with tournament aggregation yielding global rankings via Bradley-Terry estimation. The resulting rankings are cross-judge stable: independent Llama and Qwen LLM judges achieve Kendall {\tau} = 0.889 on both benchmarks. The leaderboard reveals clear model stratification, showing that strong humor generation depends not only on scale but on mastery of comedic mechanisms such as incongruity, conciseness, escalation, and absurdity. HumorRank provides a scalable and interpretable methodology for benchmarking LLM-generated humor without relying solely on isolated automatic metrics or limited human evaluation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HumorRank applies Adaptive Swiss tournaments and Bradley-Terry aggregation to GTVH-grounded pairwise judgments for ranking LLM humor generation, which is a new combination, but the lack of human validation on those judgments undercuts the claim that mechanism mastery trumps model scale.

read the letter

The main takeaway is that HumorRank applies Adaptive Swiss tournaments and Bradley-Terry aggregation to GTVH-grounded pairwise judgments for ranking LLM humor generation, which is a new combination, but the lack of human validation on those judgments undercuts the claim that mechanism mastery trumps model scale. The paper does a solid job of moving past isolated metrics by creating a unified leaderboard. Using the SemEval-2026 MWAHAHA test set across nine models from different categories, it runs pairwise evaluations based on the General Theory of Verbal Humor and aggregates them into consistent rankings. This setup is scalable and gives a way to compare systems directly, which is useful for tracking progress in this tricky area. What stands out as new is the specific application of tournament scheduling to humor, along with the MLE for global consistency. Prior work on humor eval didn't combine these elements this way, so the framework itself has some novelty. The soft spots come down to the automated judgments. The abstract and available details don't include any human validation, inter-rater agreement scores, or bias audits for the GTVH-based evaluators. Since the key result—that humor quality comes from comedic mechanisms rather than scale—depends entirely on those judgments being accurate, this is a real gap. Without that, the stratification could reflect judge biases instead of true differences. Sensitivity to tournament parameters also isn't addressed, which leaves the robustness unclear. This work is aimed at researchers in LLM evaluation and natural language generation who care about creative tasks. A reader looking for new benchmark ideas would get value from the overall structure and could adapt the tournament approach, but anyone relying on the specific findings would need to verify the judgments first. It deserves a serious referee. The methods are grounded in established ranking techniques, and the problem of incomparable humor metrics is real, so feedback from reviewers could strengthen it. I would recommend sending it to peer review, provided the authors are prepared to add validation steps and more empirical details on the results.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces HumorRank, a tournament-based evaluation framework and leaderboard for textual humor generation in LLMs. Using the SemEval-2026 MWAHAHA test dataset, it conducts automated pairwise judgments grounded in the General Theory of Verbal Humor (GTVH) across nine models, aggregates outcomes via an Adaptive Swiss tournament, and applies Bradley-Terry MLE to produce globally consistent rankings. The central claim is that these rankings yield statistically grounded stratifications demonstrating that humor quality is driven by mastery of comedic mechanisms rather than model scale alone.

Significance. If the automated judgments prove reliable, HumorRank would provide a valuable, scalable methodology for unified benchmarking of LLM humor generation, replacing isolated incomparable metrics with interpretable global rankings. The GTVH grounding and Bradley-Terry aggregation offer a theoretically motivated and reproducible approach that could help track progress and identify key drivers of humor capability.

major comments (2)

[Evaluation pipeline] Evaluation pipeline (abstract and methods): The automated GTVH-based pairwise judgments lack any reported human validation, inter-annotator agreement metrics, or bias audit. This is load-bearing for the headline claim that the stratifications show mechanism mastery (not scale) drives humor quality, as systematic bias in the LLM judge correlated with model family or size could artifactually produce the observed inversion.
[Results section] Results section: No statistical significance tests for the scale-vs-mechanism finding, confidence intervals on the Bradley-Terry parameters, or sensitivity analysis to Adaptive Swiss tournament parameters are reported, leaving the assertion of 'statistically grounded model stratifications' under-supported.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their insightful comments, which highlight important areas for strengthening the manuscript. We address each major point below and commit to revisions that enhance the rigor and transparency of our evaluation framework.

read point-by-point responses

Referee: Evaluation pipeline (abstract and methods): The automated GTVH-based pairwise judgments lack any reported human validation, inter-annotator agreement metrics, or bias audit. This is load-bearing for the headline claim that the stratifications show mechanism mastery (not scale) drives humor quality, as systematic bias in the LLM judge correlated with model family or size could artifactually produce the observed inversion.

Authors: We agree that human validation is essential to substantiate the reliability of the automated GTVH judgments and rule out potential biases. In the revised manuscript, we will add a dedicated validation subsection reporting results from a human study on a stratified sample of 300 pairwise comparisons. This will include inter-annotator agreement metrics (Cohen's kappa and Fleiss' kappa), a bias audit examining correlations between judgment errors and model family/size, and qualitative analysis of disagreement cases. These additions will directly support the claim that observed stratifications reflect genuine differences in comedic mechanism mastery rather than judge artifacts. revision: yes
Referee: Results section: No statistical significance tests for the scale-vs-mechanism finding, confidence intervals on the Bradley-Terry parameters, or sensitivity analysis to Adaptive Swiss tournament parameters are reported, leaving the assertion of 'statistically grounded model stratifications' under-supported.

Authors: We acknowledge that the current results section would be strengthened by explicit statistical support. In the revision, we will expand the results to include: bootstrap 95% confidence intervals on all Bradley-Terry parameters; statistical significance tests (Mann-Whitney U and permutation tests) comparing the mechanism-mastery group against scale-based groupings; and sensitivity analyses varying Adaptive Swiss parameters (e.g., round count from 4-12 and reporting Kendall tau rank stability across configurations). These will be presented with tables and figures to rigorously ground the reported stratifications. revision: yes

Circularity Check

0 steps flagged

No circularity: HumorRank rankings derive from external GTVH judgments aggregated by standard MLE without self-referential reduction

full rationale

The paper's derivation chain consists of (1) applying the external General Theory of Verbal Humor to generate automated pairwise judgments on the SemEval-2026 dataset, followed by (2) aggregation via Adaptive Swiss tournament and Bradley-Terry MLE to produce global rankings. No equations, self-citations, or ansatzes reduce the final stratifications or the claim that mechanism mastery (not scale) drives quality to the inputs by construction. The MLE step is a standard statistical aggregation of independent judgment data; GTVH supplies an external theoretical basis rather than a self-defined loop. Absence of human validation is a correctness risk but does not match any enumerated circularity pattern.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that GTVH supplies reliable criteria for automated pairwise humor judgments and on the statistical model that Bradley-Terry MLE produces globally consistent rankings from tournament data.

free parameters (1)

Bradley-Terry strength parameters
Maximum likelihood estimation fits one strength parameter per model to the observed pairwise win rates.

axioms (1)

domain assumption General Theory of Verbal Humor provides valid, automatable criteria for judging relative humor quality
Invoked to ground all pairwise comparisons in the tournament.

pith-pipeline@v0.9.0 · 5447 in / 1264 out tokens · 49409 ms · 2026-05-13T23:01:25.712854+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Pairwise judgments grounded in the General Theory of Verbal Humor (GTVH) are aggregated via an Adaptive Swiss tournament, with Bradley-Terry Maximum Likelihood Estimation (MLE) producing globally consistent humor generation capability rankings.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.