Multilingual Humour-Aware Retrieval with Dense and Re-Ranking Models
Pith reviewed 2026-06-29 23:31 UTC · model grok-4.3
The pith
Dense retrieval captures Portuguese humour relevance better than English on the JOKER benchmark.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Multilingual XLM-RoBERTa dense retrieval combined with re-ranking produces strong results on Portuguese humour queries but markedly weaker results on English queries, with relevant documents frequently ranked low; the gap is attributed to unmodelled surface-level humour phenomena that vary across the two languages.
What carries the argument
XLM-RoBERTa-based dense retrieval plus neural re-ranking applied to the JOKER Task 1 bilingual humour collection.
If this is right
- Portuguese humour retrieval benefits more from current semantic encoders than English retrieval does.
- Humour mechanisms relying on wordplay and phonetic cues remain poorly captured by multilingual dense representations.
- Dataset characteristics and query-document alignment contribute measurably to cross-lingual performance differences.
- Purely semantic baselines are insufficient for humour-aware tasks and require augmentation with surface-feature modelling.
Where Pith is reading between the lines
- Explicit modelling of phonetic and polysemy features could close the English performance gap observed here.
- The same dense-plus-re-rank pipeline may transfer unevenly to other language pairs that share similar humour structures.
- Ablation studies isolating surface cues versus semantic content would clarify which components drive the reported language split.
Load-bearing premise
Observed performance gaps between languages stem mainly from humour-specific linguistic features rather than differences in dataset construction or query formulation.
What would settle it
Re-evaluate the same models after swapping or balancing the English and Portuguese document collections while keeping humour mechanisms identical; if the Portuguese advantage disappears, the claim that the gap is humour-driven is falsified.
read the original abstract
Humour-aware information retrieval poses unique challenges beyond standard semantic retrieval, as systems must account not only for topical relevance but also for humour-specific linguistic phenomena such as wordplay, phonetic ambiguity, and polysemy. In this paper, Team DUTH studies multilingual humour-aware information retrieval using the CLEF 2025 JOKER Task 1 benchmark, which evaluates humour retrieval in English and Portuguese. Our approach combines multilingual XLM-RoBERTa-based dense retrieval with additional system variants, including neural re-ranking, in order to assess the extent to which general-purpose Transformer models can capture humour-specific relevance. The results reveal substantial cross-lingual variation. While the Portuguese runs demonstrate comparatively strong performance across MAP, MRR, and early precision metrics, the English runs perform significantly worse, with relevant humorous documents frequently appearing at lower ranks. These findings highlight the limitations of purely semantic dense representations for humour retrieval, particularly when humour depends on surface-level cues that are not explicitly modelled by multilingual encoders. We further analyse contributing factors to this discrepancy, including dataset characteristics, query-document alignment, and variation in humour mechanisms. Overall, the Team DUTH experiments establish multilingual dense-retrieval and re-ranking baselines and provide insights into the challenges of modelling humour-aware relevance within the JOKER framework.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents an empirical study of multilingual humour-aware retrieval on the CLEF 2025 JOKER Task 1 benchmark for English and Portuguese. It combines XLM-RoBERTa dense retrieval with neural re-ranking variants to test whether general-purpose multilingual encoders can capture humour-specific phenomena (wordplay, phonetic ambiguity, polysemy) in addition to topical relevance. The central claim is that substantial cross-lingual performance gaps exist, with Portuguese runs showing comparatively strong MAP/MRR/early-precision results while English runs place relevant humorous documents at lower ranks; the authors attribute this to limitations of purely semantic dense representations and analyse contributing factors including dataset characteristics and humour mechanisms.
Significance. If the reported performance differences can be shown to survive controls for dataset construction and query formulation, the work would supply useful multilingual baselines on an external benchmark and would usefully illustrate the boundary conditions of dense retrieval for humour. The explicit comparison across languages and the inclusion of re-ranking variants are positive elements.
major comments (2)
- [Abstract] Abstract (analysis paragraph): the claim that performance differences are primarily attributable to humour-specific linguistic phenomena (rather than dataset construction, query formulation, or document collection procedures) is load-bearing for the central conclusion, yet the manuscript supplies no quantitative controls, matched subsets, humour-type distribution statistics, or ablation isolating these factors.
- [Abstract] Abstract (results paragraph): comparative results on MAP, MRR, and early precision are asserted without any tables, statistical significance tests, error bars, run-by-run scores, or ablation details; this absence prevents assessment of the magnitude and reliability of the reported cross-lingual variation.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive report. We address each major comment below and will revise the manuscript accordingly to strengthen the presentation of results and supporting analyses.
read point-by-point responses
-
Referee: [Abstract] Abstract (analysis paragraph): the claim that performance differences are primarily attributable to humour-specific linguistic phenomena (rather than dataset construction, query formulation, or document collection procedures) is load-bearing for the central conclusion, yet the manuscript supplies no quantitative controls, matched subsets, humour-type distribution statistics, or ablation isolating these factors.
Authors: The manuscript provides qualitative analysis of contributing factors including dataset characteristics, query-document alignment, and variation in humour mechanisms. We agree that the attribution to humour-specific phenomena would be strengthened by quantitative controls. In revision we will add humour-type distribution statistics, matched-subset comparisons, and ablation experiments isolating these factors. revision: yes
-
Referee: [Abstract] Abstract (results paragraph): comparative results on MAP, MRR, and early precision are asserted without any tables, statistical significance tests, error bars, run-by-run scores, or ablation details; this absence prevents assessment of the magnitude and reliability of the reported cross-lingual variation.
Authors: The abstract summarises the findings at a high level while the full manuscript contains the underlying run scores. To improve verifiability we will add statistical significance tests, error bars, explicit run-by-run tables, and ablation details to the results section and ensure the abstract references these elements. revision: yes
Circularity Check
No circularity: empirical benchmark evaluation with no derivations or self-referential claims
full rationale
The paper reports experimental results from running dense retrieval and re-ranking models on the external CLEF 2025 JOKER Task 1 benchmark for English and Portuguese. No equations, parameter fitting presented as prediction, uniqueness theorems, or ansatzes appear. Central claims rest on observed MAP/MRR/precision differences and qualitative discussion of dataset factors; these do not reduce to the inputs by construction. Self-citation is absent from the provided text. This is a standard empirical comparison whose validity is externally falsifiable against the benchmark.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The CLEF 2025 JOKER Task 1 benchmark constitutes a valid and representative test of humour-aware retrieval in English and Portuguese.
Reference graph
Works this paper leans on
-
[1]
Georgios Arampatzis and Avi Arampatzis. 2025. DUTH at CLEF 2025 SimpleText Track: Tackling Scientific Text Simplification and Hallucination Detection. In Working Notes of CLEF 2025 – Conference and Labs of the Evaluation Forum (CEUR Workshop Proceedings, Vol. 4038). CEUR-WS.org, Aachen, Germany, 4211–4224
2025
-
[2]
Georgios Arampatzis and Avi Arampatzis. 2025. DUTH at CLEF JOKER 2025 Tasks 2 and 3: Translating Puns and Proper Names with Neural Approaches. In Working Notes of CLEF 2025 – Conference and Labs of the Evaluation Forum (CEUR Workshop Proceedings, Vol. 4038). CEUR-WS.org, Aachen, Germany, 2791–2802
2025
-
[3]
Georgios Arampatzis and Avi Arampatzis. 2025. Hybrid Sparse-Neural Fusion for Passage Retrieval at TREC 2025. InProceedings of the Thirty-Fourth Text REtrieval Conference (TREC 2025). National Institute of Standards and Technology, Gaithersburg, Maryland, USA, 1–7. TREC 2025 RAGTIME Track notebook paper
2025
-
[4]
Georgios Arampatzis, Ioannis Maslaris, and Avi Arampatzis. 2025. Assessing News Credibility via Question Generation and Retrieval-Augmented Reporting: 7 DUTH at TREC 2025 DRAGUN. InProceedings of the Thirty-Fourth Text RE- trieval Conference (TREC 2025). National Institute of Standards and Technology, Gaithersburg, Maryland, USA, 1–8. TREC 2025 DRAGUN Tra...
2025
-
[5]
Georgios Arampatzis, Vasileios Perifanis, and Avi Arampatzis. 2025. Evaluation of Justification Retrieval Using Hybrid Labeling Methods in the TREC 2025 RAG Track. InProceedings of the Thirty-Fourth Text REtrieval Conference (TREC 2025). National Institute of Standards and Technology, Gaithersburg, Maryland, USA, 1–7. TREC 2025 Retrieval-Augmented Generat...
2025
-
[6]
Giorgos Arampatzis, Vasileios Perifanis, Symeon Symeonidis, and Avi Aram- patzis. 2023. DUTH at SemEval-2023 Task 9: An Ensemble Approach for Twitter Intimacy Analysis. InProceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023). Association for Computational Linguistics, Toronto, Canada, 1225–1230. doi:10.18653/v1/2023.semeval-1.170
-
[7]
Georgios Arampatzis, Vasileios Perifanis, Symeon Symeonidis, and Avi Aram- patzis. 2025. DUTH at EXIST 2025: Multilingual Sexism Detection with Soft Labels and Transformers. InWorking Notes of CLEF 2025 – Conference and Labs of the Evaluation Forum (CEUR Workshop Proceedings, Vol. 4038). CEUR-WS.org, Aachen, Germany, 1793–1800
2025
-
[8]
Georgios Arampatzis, Konstantina Safouri, and Avi Arampatzis. 2025. Bridging Lexical and Neural Ranking for TREC-TOT 2025. InProceedings of the Thirty- Fourth Text REtrieval Conference (TREC 2025). National Institute of Standards and Technology, Gaithersburg, Maryland, USA, 1–7. TREC 2025 Tip-of-the-Tongue Track notebook paper
2025
-
[9]
Georgios Arampatzis, Symeon Symeonidis, and Avi Arampatzis. 2025. Precision by Design: RM3 and Fusion at TREC 2025 Product Search. InProceedings of the Thirty-Fourth Text REtrieval Conference (TREC 2025). National Institute of Standards and Technology, Gaithersburg, Maryland, USA, 1–8. TREC 2025 Product Search and Recommendation Track notebook paper
2025
-
[10]
Liana Ermakova, Anne-Gwenn Bosser, Tristan Miller, and Adam Jatowt. 2024. Overview of the CLEF 2024 JOKER Task 1: Humour-aware Information Retrieval. InWorking Notes of CLEF 2024 – Conference and Labs of the Evaluation Forum (CEUR Workshop Proceedings, Vol. 3740). CEUR-WS.org, Aachen, Germany, 1775– 1785
2024
-
[11]
Liana Ermakova, Ricardo Campos, Anne-Gwenn Bosser, and Tristan Miller. 2025. Overview of the CLEF 2025 JOKER Task 1: Humour-aware Information Retrieval. InWorking Notes of CLEF 2025 – Conference and Labs of the Evaluation Forum (CEUR Workshop Proceedings, Vol. 4038). CEUR-WS.org, Aachen, Germany, 2744– 2760
2025
-
[12]
Liana Ermakova, Tristan Miller, et al. 2020. Overview of the CLEF 2020 Task on Automatic Wordplay Analysis. CLEF 2020 Working Notes
2020
-
[13]
Liana Ermakova, Tristan Miller, Anne-Gwenn Bosser, Victor Manuel Palma Preci- ado, Grigori Sidorov, and Adam Jatowt. 2023. Overview of JOKER 2023 Automatic Wordplay Analysis Task 1 – Pun Detection. InWorking Notes of CLEF 2023 – Con- ference and Labs of the Evaluation Forum (CEUR Workshop Proceedings, Vol. 3497). CEUR-WS.org, Aachen, Germany, 1785–1803
2023
-
[14]
Liana Ermakova, Tristan Miller, Anne-Gwenn Bosser, Victor Manuel Palma Preci- ado, Grigori Sidorov, and Adam Jatowt. 2023. Overview of JOKER 2023 Automatic Wordplay Analysis Task 2 – Pun Location and Interpretation. InWorking Notes of CLEF 2023 – Conference and Labs of the Evaluation Forum (CEUR Workshop Proceedings, Vol. 3497). CEUR-WS.org, Aachen, Germa...
2023
-
[15]
Liana Ermakova, Tristan Miller, Fabio Regattin, Anne-Gwenn Bosser, Élise Math- urin, Gaëlle Le Corre, Sílvia Araújo, Julien Boccou, Albin Digue, Aurianne Damoy, and Benoît Jeanjean. 2022. Overview of JOKER@CLEF 2022: Automatic Wordplay and Humour Translation Workshop. InExperimental IR Meets Multilinguality, Multimodality, and Interaction (Lecture Notes i...
-
[16]
Lisa Friedland and James Allan. 2008. Joke Retrieval: Recognizing the Same Joke Told Differently. InProceedings of the 17th ACM Conference on Information and Knowledge Management (CIKM ’08). ACM, New York, NY, USA, 883–892. doi:10.1145/1458082.1458199
-
[17]
Deepak Gupta, Matthew Digiovanni, Hiroshi Narita, and Kenneth Goldberg
-
[18]
Jester 2.0: Collaborative Filtering to Retrieve Jokes. InProceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, USA, 333. doi:10.1145/312624.312770
-
[19]
Tristan Miller. 2017. A Computational Approach to Humor Detection, Classifica- tion, and Interpretation. European Journal of Humour Research
2017
-
[20]
Tristan Miller. 2022. Computational Humour: State of the Art and Challenges. Artificial Intelligence Review55, 1 (2022), 1–30
2022
-
[21]
Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China, 3982–3992. htt...
2019
-
[22]
Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. 2020. MPNet: Masked and Permuted Pre-training for Language Understanding. InAdvances in Neural Information Processing Systems, Vol. 33. Curran Associates, Inc., Virtual, 16857–16867. 8
2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.