pith. sign in

arxiv: 2605.25165 · v1 · pith:NZQCTY6Qnew · submitted 2026-05-24 · 💻 cs.IR

Multilingual Humour-Aware Retrieval with Dense and Re-Ranking Models

Pith reviewed 2026-06-29 23:31 UTC · model grok-4.3

classification 💻 cs.IR
keywords humour-aware retrievalmultilingual information retrievaldense retrievalneural re-rankingXLM-RoBERTacross-lingual evaluationCLEF JOKER
0
0 comments X

The pith

Dense retrieval captures Portuguese humour relevance better than English on the JOKER benchmark.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether general-purpose multilingual encoders can handle the extra demands of humour-aware retrieval, where systems must detect not only topical matches but also wordplay, phonetic ambiguity, and polysemy. Using the CLEF 2025 JOKER Task 1 collection, the authors run XLM-RoBERTa dense retrieval plus neural re-ranking variants on English and Portuguese queries. Portuguese runs achieve comparatively strong MAP, MRR, and early-precision scores while English runs place relevant humorous documents at much lower ranks. The work therefore shows that purely semantic dense representations struggle when humour hinges on surface-level cues the models do not explicitly encode. These results supply the first multilingual baselines and isolate language-specific modelling gaps within the JOKER framework.

Core claim

Multilingual XLM-RoBERTa dense retrieval combined with re-ranking produces strong results on Portuguese humour queries but markedly weaker results on English queries, with relevant documents frequently ranked low; the gap is attributed to unmodelled surface-level humour phenomena that vary across the two languages.

What carries the argument

XLM-RoBERTa-based dense retrieval plus neural re-ranking applied to the JOKER Task 1 bilingual humour collection.

If this is right

  • Portuguese humour retrieval benefits more from current semantic encoders than English retrieval does.
  • Humour mechanisms relying on wordplay and phonetic cues remain poorly captured by multilingual dense representations.
  • Dataset characteristics and query-document alignment contribute measurably to cross-lingual performance differences.
  • Purely semantic baselines are insufficient for humour-aware tasks and require augmentation with surface-feature modelling.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Explicit modelling of phonetic and polysemy features could close the English performance gap observed here.
  • The same dense-plus-re-rank pipeline may transfer unevenly to other language pairs that share similar humour structures.
  • Ablation studies isolating surface cues versus semantic content would clarify which components drive the reported language split.

Load-bearing premise

Observed performance gaps between languages stem mainly from humour-specific linguistic features rather than differences in dataset construction or query formulation.

What would settle it

Re-evaluate the same models after swapping or balancing the English and Portuguese document collections while keeping humour mechanisms identical; if the Portuguese advantage disappears, the claim that the gap is humour-driven is falsified.

read the original abstract

Humour-aware information retrieval poses unique challenges beyond standard semantic retrieval, as systems must account not only for topical relevance but also for humour-specific linguistic phenomena such as wordplay, phonetic ambiguity, and polysemy. In this paper, Team DUTH studies multilingual humour-aware information retrieval using the CLEF 2025 JOKER Task 1 benchmark, which evaluates humour retrieval in English and Portuguese. Our approach combines multilingual XLM-RoBERTa-based dense retrieval with additional system variants, including neural re-ranking, in order to assess the extent to which general-purpose Transformer models can capture humour-specific relevance. The results reveal substantial cross-lingual variation. While the Portuguese runs demonstrate comparatively strong performance across MAP, MRR, and early precision metrics, the English runs perform significantly worse, with relevant humorous documents frequently appearing at lower ranks. These findings highlight the limitations of purely semantic dense representations for humour retrieval, particularly when humour depends on surface-level cues that are not explicitly modelled by multilingual encoders. We further analyse contributing factors to this discrepancy, including dataset characteristics, query-document alignment, and variation in humour mechanisms. Overall, the Team DUTH experiments establish multilingual dense-retrieval and re-ranking baselines and provide insights into the challenges of modelling humour-aware relevance within the JOKER framework.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper presents an empirical study of multilingual humour-aware retrieval on the CLEF 2025 JOKER Task 1 benchmark for English and Portuguese. It combines XLM-RoBERTa dense retrieval with neural re-ranking variants to test whether general-purpose multilingual encoders can capture humour-specific phenomena (wordplay, phonetic ambiguity, polysemy) in addition to topical relevance. The central claim is that substantial cross-lingual performance gaps exist, with Portuguese runs showing comparatively strong MAP/MRR/early-precision results while English runs place relevant humorous documents at lower ranks; the authors attribute this to limitations of purely semantic dense representations and analyse contributing factors including dataset characteristics and humour mechanisms.

Significance. If the reported performance differences can be shown to survive controls for dataset construction and query formulation, the work would supply useful multilingual baselines on an external benchmark and would usefully illustrate the boundary conditions of dense retrieval for humour. The explicit comparison across languages and the inclusion of re-ranking variants are positive elements.

major comments (2)
  1. [Abstract] Abstract (analysis paragraph): the claim that performance differences are primarily attributable to humour-specific linguistic phenomena (rather than dataset construction, query formulation, or document collection procedures) is load-bearing for the central conclusion, yet the manuscript supplies no quantitative controls, matched subsets, humour-type distribution statistics, or ablation isolating these factors.
  2. [Abstract] Abstract (results paragraph): comparative results on MAP, MRR, and early precision are asserted without any tables, statistical significance tests, error bars, run-by-run scores, or ablation details; this absence prevents assessment of the magnitude and reliability of the reported cross-lingual variation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We address each major comment below and will revise the manuscript accordingly to strengthen the presentation of results and supporting analyses.

read point-by-point responses
  1. Referee: [Abstract] Abstract (analysis paragraph): the claim that performance differences are primarily attributable to humour-specific linguistic phenomena (rather than dataset construction, query formulation, or document collection procedures) is load-bearing for the central conclusion, yet the manuscript supplies no quantitative controls, matched subsets, humour-type distribution statistics, or ablation isolating these factors.

    Authors: The manuscript provides qualitative analysis of contributing factors including dataset characteristics, query-document alignment, and variation in humour mechanisms. We agree that the attribution to humour-specific phenomena would be strengthened by quantitative controls. In revision we will add humour-type distribution statistics, matched-subset comparisons, and ablation experiments isolating these factors. revision: yes

  2. Referee: [Abstract] Abstract (results paragraph): comparative results on MAP, MRR, and early precision are asserted without any tables, statistical significance tests, error bars, run-by-run scores, or ablation details; this absence prevents assessment of the magnitude and reliability of the reported cross-lingual variation.

    Authors: The abstract summarises the findings at a high level while the full manuscript contains the underlying run scores. To improve verifiability we will add statistical significance tests, error bars, explicit run-by-run tables, and ablation details to the results section and ensure the abstract references these elements. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark evaluation with no derivations or self-referential claims

full rationale

The paper reports experimental results from running dense retrieval and re-ranking models on the external CLEF 2025 JOKER Task 1 benchmark for English and Portuguese. No equations, parameter fitting presented as prediction, uniqueness theorems, or ansatzes appear. Central claims rest on observed MAP/MRR/precision differences and qualitative discussion of dataset factors; these do not reduce to the inputs by construction. Self-citation is absent from the provided text. This is a standard empirical comparison whose validity is externally falsifiable against the benchmark.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Empirical evaluation paper; central claim rests on the representativeness of the JOKER benchmark and the appropriateness of standard IR metrics for humor-aware relevance.

axioms (1)
  • domain assumption The CLEF 2025 JOKER Task 1 benchmark constitutes a valid and representative test of humour-aware retrieval in English and Portuguese.
    The paper adopts the benchmark without independent validation of its query or document construction.

pith-pipeline@v0.9.1-grok · 5757 in / 1190 out tokens · 30554 ms · 2026-06-29T23:31:16.271889+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

22 extracted references · 4 canonical work pages

  1. [1]

    Georgios Arampatzis and Avi Arampatzis. 2025. DUTH at CLEF 2025 SimpleText Track: Tackling Scientific Text Simplification and Hallucination Detection. In Working Notes of CLEF 2025 – Conference and Labs of the Evaluation Forum (CEUR Workshop Proceedings, Vol. 4038). CEUR-WS.org, Aachen, Germany, 4211–4224

  2. [2]

    Georgios Arampatzis and Avi Arampatzis. 2025. DUTH at CLEF JOKER 2025 Tasks 2 and 3: Translating Puns and Proper Names with Neural Approaches. In Working Notes of CLEF 2025 – Conference and Labs of the Evaluation Forum (CEUR Workshop Proceedings, Vol. 4038). CEUR-WS.org, Aachen, Germany, 2791–2802

  3. [3]

    Georgios Arampatzis and Avi Arampatzis. 2025. Hybrid Sparse-Neural Fusion for Passage Retrieval at TREC 2025. InProceedings of the Thirty-Fourth Text REtrieval Conference (TREC 2025). National Institute of Standards and Technology, Gaithersburg, Maryland, USA, 1–7. TREC 2025 RAGTIME Track notebook paper

  4. [4]

    Georgios Arampatzis, Ioannis Maslaris, and Avi Arampatzis. 2025. Assessing News Credibility via Question Generation and Retrieval-Augmented Reporting: 7 DUTH at TREC 2025 DRAGUN. InProceedings of the Thirty-Fourth Text RE- trieval Conference (TREC 2025). National Institute of Standards and Technology, Gaithersburg, Maryland, USA, 1–8. TREC 2025 DRAGUN Tra...

  5. [5]

    Georgios Arampatzis, Vasileios Perifanis, and Avi Arampatzis. 2025. Evaluation of Justification Retrieval Using Hybrid Labeling Methods in the TREC 2025 RAG Track. InProceedings of the Thirty-Fourth Text REtrieval Conference (TREC 2025). National Institute of Standards and Technology, Gaithersburg, Maryland, USA, 1–7. TREC 2025 Retrieval-Augmented Generat...

  6. [6]

    Giorgos Arampatzis, Vasileios Perifanis, Symeon Symeonidis, and Avi Aram- patzis. 2023. DUTH at SemEval-2023 Task 9: An Ensemble Approach for Twitter Intimacy Analysis. InProceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023). Association for Computational Linguistics, Toronto, Canada, 1225–1230. doi:10.18653/v1/2023.semeval-1.170

  7. [7]

    Georgios Arampatzis, Vasileios Perifanis, Symeon Symeonidis, and Avi Aram- patzis. 2025. DUTH at EXIST 2025: Multilingual Sexism Detection with Soft Labels and Transformers. InWorking Notes of CLEF 2025 – Conference and Labs of the Evaluation Forum (CEUR Workshop Proceedings, Vol. 4038). CEUR-WS.org, Aachen, Germany, 1793–1800

  8. [8]

    Georgios Arampatzis, Konstantina Safouri, and Avi Arampatzis. 2025. Bridging Lexical and Neural Ranking for TREC-TOT 2025. InProceedings of the Thirty- Fourth Text REtrieval Conference (TREC 2025). National Institute of Standards and Technology, Gaithersburg, Maryland, USA, 1–7. TREC 2025 Tip-of-the-Tongue Track notebook paper

  9. [9]

    Georgios Arampatzis, Symeon Symeonidis, and Avi Arampatzis. 2025. Precision by Design: RM3 and Fusion at TREC 2025 Product Search. InProceedings of the Thirty-Fourth Text REtrieval Conference (TREC 2025). National Institute of Standards and Technology, Gaithersburg, Maryland, USA, 1–8. TREC 2025 Product Search and Recommendation Track notebook paper

  10. [10]

    Liana Ermakova, Anne-Gwenn Bosser, Tristan Miller, and Adam Jatowt. 2024. Overview of the CLEF 2024 JOKER Task 1: Humour-aware Information Retrieval. InWorking Notes of CLEF 2024 – Conference and Labs of the Evaluation Forum (CEUR Workshop Proceedings, Vol. 3740). CEUR-WS.org, Aachen, Germany, 1775– 1785

  11. [11]

    Liana Ermakova, Ricardo Campos, Anne-Gwenn Bosser, and Tristan Miller. 2025. Overview of the CLEF 2025 JOKER Task 1: Humour-aware Information Retrieval. InWorking Notes of CLEF 2025 – Conference and Labs of the Evaluation Forum (CEUR Workshop Proceedings, Vol. 4038). CEUR-WS.org, Aachen, Germany, 2744– 2760

  12. [12]

    Liana Ermakova, Tristan Miller, et al. 2020. Overview of the CLEF 2020 Task on Automatic Wordplay Analysis. CLEF 2020 Working Notes

  13. [13]

    Liana Ermakova, Tristan Miller, Anne-Gwenn Bosser, Victor Manuel Palma Preci- ado, Grigori Sidorov, and Adam Jatowt. 2023. Overview of JOKER 2023 Automatic Wordplay Analysis Task 1 – Pun Detection. InWorking Notes of CLEF 2023 – Con- ference and Labs of the Evaluation Forum (CEUR Workshop Proceedings, Vol. 3497). CEUR-WS.org, Aachen, Germany, 1785–1803

  14. [14]

    Liana Ermakova, Tristan Miller, Anne-Gwenn Bosser, Victor Manuel Palma Preci- ado, Grigori Sidorov, and Adam Jatowt. 2023. Overview of JOKER 2023 Automatic Wordplay Analysis Task 2 – Pun Location and Interpretation. InWorking Notes of CLEF 2023 – Conference and Labs of the Evaluation Forum (CEUR Workshop Proceedings, Vol. 3497). CEUR-WS.org, Aachen, Germa...

  15. [15]

    Liana Ermakova, Tristan Miller, Fabio Regattin, Anne-Gwenn Bosser, Élise Math- urin, Gaëlle Le Corre, Sílvia Araújo, Julien Boccou, Albin Digue, Aurianne Damoy, and Benoît Jeanjean. 2022. Overview of JOKER@CLEF 2022: Automatic Wordplay and Humour Translation Workshop. InExperimental IR Meets Multilinguality, Multimodality, and Interaction (Lecture Notes i...

  16. [16]

    Lisa Friedland and James Allan. 2008. Joke Retrieval: Recognizing the Same Joke Told Differently. InProceedings of the 17th ACM Conference on Information and Knowledge Management (CIKM ’08). ACM, New York, NY, USA, 883–892. doi:10.1145/1458082.1458199

  17. [17]

    Deepak Gupta, Matthew Digiovanni, Hiroshi Narita, and Kenneth Goldberg

  18. [18]

    InProceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval

    Jester 2.0: Collaborative Filtering to Retrieve Jokes. InProceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, USA, 333. doi:10.1145/312624.312770

  19. [19]

    Tristan Miller. 2017. A Computational Approach to Humor Detection, Classifica- tion, and Interpretation. European Journal of Humour Research

  20. [20]

    Tristan Miller. 2022. Computational Humour: State of the Art and Challenges. Artificial Intelligence Review55, 1 (2022), 1–30

  21. [21]

    Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China, 3982–3992. htt...

  22. [22]

    Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. 2020. MPNet: Masked and Permuted Pre-training for Language Understanding. InAdvances in Neural Information Processing Systems, Vol. 33. Curran Associates, Inc., Virtual, 16857–16867. 8