Recognition: 1 theorem link
· Lean TheoremMultiHop-RAG: Benchmarking Retrieval-Augmented Generation for Multi-Hop Queries
Pith reviewed 2026-05-15 13:49 UTC · model grok-4.3
The pith
Existing RAG systems are inadequate for answering multi-hop queries that require retrieving and reasoning over multiple pieces of evidence.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors develop MultiHop-RAG consisting of a news article knowledge base, multi-hop queries, ground-truth answers, and supporting evidence. Benchmarking experiments demonstrate that existing RAG methods, including various embedding models and state-of-the-art LLMs, perform poorly in retrieving relevant evidence and in reasoning to produce correct answers for multi-hop queries.
What carries the argument
The MultiHop-RAG dataset, which provides a knowledge base, multi-hop queries, answers, and evidence to benchmark RAG performance on complex queries.
If this is right
- Current embedding models need enhancement to retrieve multiple pieces of evidence effectively for a single query.
- LLMs require better integration of information from several sources to answer multi-hop questions accurately.
- RAG systems evaluated only on simple queries may overestimate their real-world utility.
- Specialized methods for multi-hop retrieval and reasoning should be developed to improve overall RAG effectiveness.
Where Pith is reading between the lines
- The dataset could be extended or adapted to other fields like science or medicine for domain-specific multi-hop benchmarks.
- Improving performance on this benchmark may lead to more reliable LLM applications in research and news analysis.
- Future work might explore training retrieval models directly on multi-hop query patterns from the dataset.
Load-bearing premise
The multi-hop queries generated from the English news article dataset reflect the distribution and difficulty of actual user queries to RAG systems.
What would settle it
An experiment showing significantly different RAG performance on a new collection of multi-hop queries created independently by experts from the same news articles would falsify the representativeness of the benchmark.
read the original abstract
Retrieval-augmented generation (RAG) augments large language models (LLM) by retrieving relevant knowledge, showing promising potential in mitigating LLM hallucinations and enhancing response quality, thereby facilitating the great adoption of LLMs in practice. However, we find that existing RAG systems are inadequate in answering multi-hop queries, which require retrieving and reasoning over multiple pieces of supporting evidence. Furthermore, to our knowledge, no existing RAG benchmarking dataset focuses on multi-hop queries. In this paper, we develop a novel dataset, MultiHop-RAG, which consists of a knowledge base, a large collection of multi-hop queries, their ground-truth answers, and the associated supporting evidence. We detail the procedure of building the dataset, utilizing an English news article dataset as the underlying RAG knowledge base. We demonstrate the benchmarking utility of MultiHop-RAG in two experiments. The first experiment compares different embedding models for retrieving evidence for multi-hop queries. In the second experiment, we examine the capabilities of various state-of-the-art LLMs, including GPT-4, PaLM, and Llama2-70B, in reasoning and answering multi-hop queries given the evidence. Both experiments reveal that existing RAG methods perform unsatisfactorily in retrieving and answering multi-hop queries. We hope MultiHop-RAG will be a valuable resource for the community in developing effective RAG systems, thereby facilitating greater adoption of LLMs in practice. The MultiHop-RAG and implemented RAG system is publicly available at https://github.com/yixuantt/MultiHop-RAG/.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the MultiHop-RAG benchmark dataset for retrieval-augmented generation on multi-hop queries. It constructs a knowledge base from English news articles, generates a collection of multi-hop queries along with ground-truth answers and supporting evidence, and reports two experiments: one comparing embedding models for evidence retrieval and another evaluating LLMs (GPT-4, PaLM, Llama2-70B) on reasoning over the evidence. The central claim is that existing RAG systems perform unsatisfactorily on multi-hop queries.
Significance. If the synthetic queries prove representative of real multi-hop needs, the benchmark fills a clear gap and, together with the public code and data release, could usefully guide improvements in retrieval and multi-step reasoning for RAG pipelines.
major comments (2)
- [Dataset construction] Dataset construction section: multi-hop queries are generated synthetically from the news corpus with no empirical validation against real user query logs, HotpotQA-style distributions, or metrics of hop depth and entity chaining; this assumption is load-bearing for the claim that existing RAG systems are broadly inadequate.
- [Experiments] Experiments section: query generation rules, exact retrieval metrics, and any statistical significance tests for the reported performance gaps are not fully specified, preventing full verification of the inadequacy conclusions.
minor comments (2)
- [Abstract] Abstract: 'great adoption' should read 'greater adoption'.
- [Figures and tables] Figure captions and table headers could more explicitly state the number of queries and evidence pieces per hop level for immediate clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below and will revise the paper to improve clarity and completeness where possible.
read point-by-point responses
-
Referee: [Dataset construction] Dataset construction section: multi-hop queries are generated synthetically from the news corpus with no empirical validation against real user query logs, HotpotQA-style distributions, or metrics of hop depth and entity chaining; this assumption is load-bearing for the claim that existing RAG systems are broadly inadequate.
Authors: We chose synthetic query generation to enable precise control over hop count, entity chaining, and supporting evidence, ensuring each query demonstrably requires multi-hop retrieval and reasoning. This controlled construction is standard for creating focused benchmarks (as in HotpotQA). We agree that direct validation against real user logs would strengthen generalizability claims. In revision we will expand the dataset section with explicit metrics for hop depth and entity chaining, add a comparison of query characteristics to HotpotQA, and include a limitations paragraph noting the synthetic nature and the value of future real-world validation studies. revision: partial
-
Referee: [Experiments] Experiments section: query generation rules, exact retrieval metrics, and any statistical significance tests for the reported performance gaps are not fully specified, preventing full verification of the inadequacy conclusions.
Authors: We appreciate this observation. The revised manuscript will provide a complete specification of the query generation rules (including the exact templates and logical constraints used to produce multi-hop queries from the news corpus). We will also state the precise retrieval metrics (top-k recall, precision, and MRR) and add statistical significance tests (paired t-tests with Bonferroni correction) for all reported performance differences between embedding models and LLMs. revision: yes
Circularity Check
No circularity: empirical dataset construction and benchmarking
full rationale
The paper creates a new benchmark dataset MultiHop-RAG by procedurally extracting multi-hop queries and evidence from an English news corpus, then runs standard retrieval and LLM reasoning experiments on it. No equations, fitted parameters, or predictions are defined in terms of themselves. No self-citations serve as load-bearing premises for the central claim; the inadequacy conclusion follows directly from the reported retrieval and answer accuracy numbers on the constructed data. The work is self-contained as an empirical contribution with public code and data release.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
Cost.FunctionalEquationwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
existing RAG systems are inadequate in answering multi-hop queries, which require retrieving and reasoning over multiple pieces of supporting evidence
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 19 Pith papers
-
MeMo: Memory as a Model
MeMo encodes new knowledge into a separate memory model for frozen LLMs, achieving strong performance on BrowseComp-Plus, NarrativeQA, and MuSiQue while capturing cross-document relationships and remaining robust to r...
-
Skill-RAG: Failure-State-Aware Retrieval Augmentation via Hidden-State Probing and Skill Routing
Skill-RAG detects retrieval failure states from hidden representations and routes to one of four corrective skills to raise accuracy on persistent hard cases in open-domain QA and reasoning benchmarks.
-
Why Retrieval-Augmented Generation Fails: A Graph Perspective
Attribution graphs reveal that RAG failures arise from shallow fragmented evidence flow in LLMs, enabling topology-based detection and targeted interventions that reinforce question-guided routing.
-
ASTRA-QA: A Benchmark for Abstract Question Answering over Documents
ASTRA-QA is a benchmark for abstract document question answering that uses explicit topic sets, unsupported content annotations, and evidence alignments to enable direct scoring of coverage and hallucination.
-
PRISM: Fast Online LLM Serving via Scheduling-Memory Co-design
PRISM reduces P99 TTFT by 23.3-37.1% and raises exact-prefix KV-cache hit rates by 5.9-12.2 points versus the strongest baseline on 4B and 13B models by jointly optimizing scheduling and memory.
-
FT-RAG: A Fine-grained Retrieval-Augmented Generation Framework for Complex Table Reasoning
FT-RAG introduces a fine-grained graph-based retrieval framework for tables plus a new 9870-pair benchmark, reporting 23.5% and 59.2% gains in table- and cell-level hit rates and 62.2% higher exact-value recall over b...
-
Make Any Collection Navigable: Methods for Constructing and Evaluating Hypergraph of Text
Methods for constructing Hypergraphs of Text are proposed with a new effort ratio metric where TF-IDF baselines match LLM methods in experiments.
-
S2G-RAG: Structured Sufficiency and Gap Judging for Iterative Retrieval-Augmented QA
S2G-RAG improves multi-hop question answering in RAG by using structured sufficiency and gap judging to control iterative retrieval and maintain compact evidence.
-
Ask Only When Needed: Proactive Retrieval from Memory and Skills for Experience-Driven Lifelong Agents
ProactAgent learns a proactive retrieval policy via reinforcement learning on paired task continuations, improving lifelong agent performance and cutting retrieval overhead on SciWorld, AlfWorld, and StuLife.
-
Evaluating Multi-Hop Reasoning in RAG Systems: A Comparison of LLM-Based Retriever Evaluation Strategies
CARE, a context-aware LLM judge, outperforms standard methods when evaluating multi-hop retrieval quality in RAG systems.
-
Domain-oriented RAG Assessment (DoRA): Synthetic Benchmarking for RAG-based Question Answering on Defense Documents
DoRA is a new synthetic benchmark for RAG-based QA on defense documents where fine-tuning Llama3.1-8B-Instruct on it improves task success by up to 26% and cuts hallucination rates by 47%.
-
EvoRAG: Making Knowledge Graph-based RAG Automatically Evolve through Feedback-driven Backpropagation
EvoRAG adds a feedback-driven backpropagation step that attributes response quality to individual knowledge-graph triplets and updates the graph to raise reasoning accuracy by 7.34 percent over prior KG-RAG methods.
-
Overcoming the "Impracticality" of RAG: Proposing a Real-World Benchmark and Multi-Dimensional Diagnostic Framework
Introduces a four-axis difficulty taxonomy integrated into an enterprise RAG benchmark to systematically diagnose multi-dimensional challenges like reasoning complexity and retrieval difficulty.
-
From Local to Global: A Graph RAG Approach to Query-Focused Summarization
GraphRAG improves comprehensiveness and diversity of answers to global questions over million-token document sets by constructing entity graphs and hierarchical community summaries before combining partial responses.
-
CodaRAG: Connecting the Dots with Associativity Inspired by Complementary Learning
CodaRAG improves RAG by using a CLS-inspired three-stage pipeline of knowledge consolidation, multi-dimensional associative navigation, and interference elimination, delivering 7-11% gains on GraphRAG-Bench for factua...
-
Stateful Evidence-Driven Retrieval-Augmented Generation with Iterative Reasoning
A stateful iterative RAG system converts retrieved documents into scored reasoning units, maintains supportive and non-supportive evidence, and performs deficiency-driven query refinement to achieve more robust QA per...
-
Retrieval-Augmented Generation for AI-Generated Content: A Survey
A survey classifying RAG foundations for AIGC, summarizing enhancements, cross-modal applications, benchmarks, limitations, and future directions.
-
RaguTeam at SemEval-2026 Task 8: Meno and Friends in a Judge-Orchestrated LLM Ensemble for Faithful Multi-Turn Response Generation
A heterogeneous ensemble of seven LLMs plus a judge model won first place in SemEval-2026 Task 8 on faithful multi-turn response generation by selecting optimal candidates from diverse outputs.
-
A Reproducibility Study of Metacognitive Retrieval-Augmented Generation
MetaRAG is only partially reproducible with lower absolute scores than originally reported, gains substantially from reranking, and shows greater robustness than SIM-RAG under extended retrieval features.
Reference graph
Works this paper leans on
-
[1]
Anthropic. 2023. Claude 2.1 ( May version). https://api.anthropic.com/v1/messages. Claude 2.1
work page 2023
-
[2]
Akari Asai, Sewon Min, Zexuan Zhong, and Danqi Chen. 2023. Retrieval-based language models and applications. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 6: Tutorial Abstracts), pages 41--46
work page 2023
-
[3]
Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George Bm Van Den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, Diego De Las Casas, Aurelia Guy, Jacob Menick, Roman Ring, Tom Hennigan, Saffron Huang, Loren Maggiore, Chris Jones, Albin Cassirer, Andy Brock, Michela Paganini, Geoffrey Irving, ...
work page 2022
-
[4]
Harrison Chase. 2022. https://github.com/langchain-ai/langchain LangChain
work page 2022
- [5]
- [6]
- [7]
-
[8]
Google. 2023. PaLM 2 ( May version). https://generativelanguage.googleapis.com/v1beta2/models/. Chat-bison-002
work page 2023
-
[9]
Significant Gravitas. 2023. https://agpt.co Autogpt . https://github.com/Significant-Gravitas/AutoGPT
work page 2023
-
[10]
Michael Günther, Jackmin Ong, Isabelle Mohr, Alaeddine Abdessalem, Tanguy Abel, Mohammad Kalim Akram, Susana Guzman, Georgios Mastrapas, Saba Sturua, Bo Wang, Maximilian Werk, Nan Wang, and Han Xiao. 2023. http://arxiv.org/abs/2310.19923 Jina embeddings 2: 8192-token general-purpose text embeddings for long documents
-
[12]
Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teven...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[13]
Yichen Jiang, Shikha Bordia, Zheng Zhong, Charles Dognin, Maneesh Singh, and Mohit Bansal. 2020. HoVer : A dataset for many-hop fact extraction and claim verification. In Findings of the Conference on Empirical Methods in Natural Language Processing ( EMNLP )
work page 2020
-
[15]
Daniel Khashabi, Snigdha Chaturvedi, Michael Roth, Shyam Upadhyay, and Dan Roth. 2018. http://cogcomp.org/papers/2018-MultiRC-NAACL.pdf Looking Beyond the Surface: A Challenge Set for Reading Comprehension over Multiple Sentences . In Proc. of the Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL)
work page 2018
-
[16]
Jerry Liu. 2022. https://doi.org/10.5281/zenodo.1234 LlamaIndex
- [17]
-
[18]
OpenAI. 2023. GPT4 ( Nov 7 version). https://chat.openai.com/chat. gpt -4-1106-preview
work page 2023
- [19]
-
[20]
Smith, Luke Zettlemoyer, and Tao Yu
Hongjin Su, Weijia Shi, Jungo Kasai, Yizhong Wang, Yushi Hu, Mari Ostendorf, Wen tau Yih, Noah A. Smith, Luke Zettlemoyer, and Tao Yu. 2023. http://arxiv.org/abs/2212.09741 One embedder, any task: Instruction-finetuned text embeddings
-
[21]
James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal. 2018. http://arxiv.org/abs/1803.05355 Fever: a large-scale dataset for fact extraction and verification
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[22]
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Harts...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[25]
Shitao Xiao, Zheng Liu, Peitian Zhang, and Niklas Muennighoff. 2023. http://arxiv.org/abs/2309.07597 C-pack: Packaged resources to advance general chinese embedding
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[26]
Cohen, Ruslan Salakhutdinov, and Christopher D
Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. 2018. HotpotQA : A dataset for diverse, explainable multi-hop question answering. In Conference on Empirical Methods in Natural Language Processing ( EMNLP )
work page 2018
- [27]
- [28]
-
[29]
Chase, Harrison , month = oct, title =
-
[30]
Retrieval-based language models and applications , author=. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 6: Tutorial Abstracts) , pages=
- [31]
-
[32]
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , author=. 2023 , eprint=
work page 2023
-
[33]
Fact or Fiction: Verifying Scientific Claims
Wadden, David and Lin, Shanchuan and Lo, Kyle and Wang, Lucy Lu and van Zuylen, Madeleine and Cohan, Arman and Hajishirzi, Hannaneh. Fact or Fiction: Verifying Scientific Claims. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020. doi:10.18653/v1/2020.emnlp-main.609
-
[34]
FEVER: a large-scale dataset for Fact Extraction and VERification , author=. 2018 , eprint=
work page 2018
-
[35]
Yichen Jiang and Shikha Bordia and Zheng Zhong and Charles Dognin and Maneesh Singh and Mohit Bansal. , booktitle=
-
[36]
arXiv preprint arXiv:2305.06300 , year=
Evaluating Embedding APIs for Information Retrieval , author=. arXiv preprint arXiv:2305.06300 , year=
-
[37]
Daniel Khashabi and Snigdha Chaturvedi and Michael Roth and Shyam Upadhyay and Dan Roth , title =. Proc. of the Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL) , year =
-
[38]
RAGAS: Automated Evaluation of Retrieval Augmented Generation , author=. 2023 , eprint=
work page 2023
-
[39]
ARES: An Automated Evaluation Framework for Retrieval-Augmented Generation Systems , author=. 2023 , eprint=
work page 2023
-
[40]
Llama 2: Open Foundation and Fine-Tuned Chat Models , author=. 2023 , eprint=
work page 2023
- [41]
-
[42]
Text Embeddings by Weakly-Supervised Contrastive Pre-training
Text Embeddings by Weakly-Supervised Contrastive Pre-training , author=. arXiv preprint arXiv:2212.03533 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[43]
Jina Embeddings 2: 8192-Token General-Purpose Text Embeddings for Long Documents , author=. 2023 , eprint=
work page 2023
-
[44]
One Embedder, Any Task: Instruction-Finetuned Text Embeddings , author=. 2023 , eprint=
work page 2023
-
[45]
Retrieve Anything To Augment Large Language Models , author=. 2023 , eprint=
work page 2023
- [46]
- [47]
- [48]
-
[49]
Towards a Unified Multi-Dimensional Evaluator for Text Generation , author=. 2022 , eprint=
work page 2022
-
[50]
Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
Reimers, Nils and Gurevych, Iryna. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. 2019
work page 2019
-
[51]
C-Pack: Packaged Resources To Advance General Chinese Embedding , author=. 2023 , eprint=
work page 2023
-
[52]
Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps
Ho, Xanh and Duong Nguyen, Anh-Khoa and Sugawara, Saku and Aizawa, Akiko. Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps. Proceedings of the 28th International Conference on Computational Linguistics. 2020. doi:10.18653/v1/2020.coling-main.580
-
[53]
and Salakhutdinov, Ruslan and Manning, Christopher D
Yang, Zhilin and Qi, Peng and Zhang, Saizheng and Bengio, Yoshua and Cohen, William W. and Salakhutdinov, Ruslan and Manning, Christopher D. , booktitle=
-
[54]
RECALL: A Benchmark for LLMs Robustness against External Counterfactual Knowledge , author=. 2023 , eprint=
work page 2023
-
[55]
Benchmarking Large Language Models in Retrieval-Augmented Generation , author=. 2023 , eprint=
work page 2023
-
[56]
Proceedings of the 39th International Conference on Machine Learning , pages =
Improving Language Models by Retrieving from Trillions of Tokens , author =. Proceedings of the 39th International Conference on Machine Learning , pages =. 2022 , editor =
work page 2022
-
[57]
Enabling Large Language Models to Generate Text with Citations , author=. 2023 , eprint=
work page 2023
-
[58]
Proceedings of the 3rd Wordplay: When Language Meets Games Workshop (Wordplay 2022). 2022
work page 2022
-
[59]
A Systematic Survey of Text Worlds as Embodied Natural Language Environments
Jansen, Peter. A Systematic Survey of Text Worlds as Embodied Natural Language Environments. Proceedings of the 3rd Wordplay: When Language Meets Games Workshop (Wordplay 2022). 2022. doi:10.18653/v1/2022.wordplay-1.1
-
[60]
A Minimal Computational Improviser Based on Oral Thought
Montfort, Nick and Bartlett Fernandez, Sebastian. A Minimal Computational Improviser Based on Oral Thought. Proceedings of the 3rd Wordplay: When Language Meets Games Workshop (Wordplay 2022). 2022. doi:10.18653/v1/2022.wordplay-1.2
-
[61]
Volum, Ryan and Rao, Sudha and Xu, Michael and DesGarennes, Gabriel and Brockett, Chris and Van Durme, Benjamin and Deng, Olivia and Malhotra, Akanksha and Dolan, Bill. Craft an Iron Sword: Dynamically Generating Interactive Game Characters by Prompting Large Language Models Tuned on Code. Proceedings of the 3rd Wordplay: When Language Meets Games Worksho...
-
[62]
A Sequence Modelling Approach to Question Answering in Text-Based Games
Furman, Gregory and Toledo, Edan and Shock, Jonathan and Buys, Jan. A Sequence Modelling Approach to Question Answering in Text-Based Games. Proceedings of the 3rd Wordplay: When Language Meets Games Workshop (Wordplay 2022). 2022. doi:10.18653/v1/2022.wordplay-1.4
-
[63]
Automatic Exploration of Textual Environments with Language-Conditioned Autotelic Agents
Teodorescu, Laetitia and Yuan, Xingdi and C \^o t \'e , Marc-Alexandre and Oudeyer, Pierre-Yves. Automatic Exploration of Textual Environments with Language-Conditioned Autotelic Agents. Proceedings of the 3rd Wordplay: When Language Meets Games Workshop (Wordplay 2022). 2022. doi:10.18653/v1/2022.wordplay-1.5
-
[64]
Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022
work page 2022
-
[65]
Separating Hate Speech and Offensive Language Classes via Adversarial Debiasing
Yuan, Shuzhou and Maronikolakis, Antonis and Sch. Separating Hate Speech and Offensive Language Classes via Adversarial Debiasing. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.1
-
[66]
Towards Automatic Generation of Messages Countering Online Hate Speech and Microaggressions
Ashida, Mana and Komachi, Mamoru. Towards Automatic Generation of Messages Countering Online Hate Speech and Microaggressions. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.2
-
[67]
G rease V ision: Rewriting the Rules of the Interface
Datta, Siddhartha and Kollnig, Konrad and Shadbolt, Nigel. G rease V ision: Rewriting the Rules of the Interface. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.3
-
[68]
Ludwig, Florian and Dolos, Klara and Zesch, Torsten and Hobley, Eleanor. Improving Generalization of Hate Speech Detection Systems to Novel Target Groups via Domain Adaptation. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.4
-
[69]
`` Zo Grof ! '' : A Comprehensive Corpus for Offensive and Abusive Language in D utch
Ruitenbeek, Ward and Zwart, Victor and Van Der Noord, Robin and Gnezdilov, Zhenja and Caselli, Tommaso. `` Zo Grof ! '' : A Comprehensive Corpus for Offensive and Abusive Language in D utch. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.5
-
[70]
Counter- TWIT : An I talian Corpus for Online Counterspeech in Ecological Contexts
Goffredo, Pierpaolo and Basile, Valerio and Cepollaro, Bianca and Patti, Viviana. Counter- TWIT : An I talian Corpus for Online Counterspeech in Ecological Contexts. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.6
-
[71]
S tereo KG : Data-Driven Knowledge Graph Construction For Cultural Knowledge and Stereotypes
Deshpande, Awantee and Ruiter, Dana and Mosbach, Marius and Klakow, Dietrich. S tereo KG : Data-Driven Knowledge Graph Construction For Cultural Knowledge and Stereotypes. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.7
-
[72]
Lu, Christina and Jurgens, David. The subtle language of exclusion: Identifying the Toxic Speech of Trans-exclusionary Radical Feminists. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.8
-
[73]
Lost in Distillation: A Case Study in Toxicity Modeling
Chvasta, Alyssa and Lees, Alyssa and Sorensen, Jeffrey and Vasserman, Lucy and Goyal, Nitesh. Lost in Distillation: A Case Study in Toxicity Modeling. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.9
-
[74]
Cleansing & expanding the HURTLEX (el) with a multidimensional categorization of offensive words
Stamou, Vivian and Alexiou, Iakovi and Klimi, Antigone and Molou, Eleftheria and Saivanidou, Alexandra and Markantonatou, Stella. Cleansing & expanding the HURTLEX (el) with a multidimensional categorization of offensive words. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.10
-
[75]
Free speech or Free Hate Speech? Analyzing the Proliferation of Hate Speech in Parler
Israeli, Abraham and Tsur, Oren. Free speech or Free Hate Speech? Analyzing the Proliferation of Hate Speech in Parler. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.11
-
[76]
Resources for Multilingual Hate Speech Detection
Arango Monnar, Ayme and Perez, Jorge and Poblete, Barbara and Salda \ n a, Magdalena and Proust, Valentina. Resources for Multilingual Hate Speech Detection. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.12
-
[77]
Enriching Abusive Language Detection with Community Context
Saleem, Haji Mohammad and Kurrek, Jana and Ruths, Derek. Enriching Abusive Language Detection with Community Context. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.13
-
[78]
DeTox: A Comprehensive Dataset for G erman Offensive Language and Conversation Analysis
Demus, Christoph and Pitz, Jonas and Sch. DeTox: A Comprehensive Dataset for G erman Offensive Language and Conversation Analysis. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.14
-
[79]
Multilingual H ate C heck: Functional Tests for Multilingual Hate Speech Detection Models
R. Multilingual H ate C heck: Functional Tests for Multilingual Hate Speech Detection Models. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.15
-
[80]
Distributional properties of political dogwhistle representations in S wedish BERT
Hertzberg, Niclas and Cooper, Robin and Lindgren, Elina and R. Distributional properties of political dogwhistle representations in S wedish BERT. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.16
-
[81]
Hate Speech Criteria: A Modular Approach to Task-Specific Hate Speech Definitions
Khurana, Urja and Vermeulen, Ivar and Nalisnick, Eric and Van Noorloos, Marloes and Fokkens, Antske. Hate Speech Criteria: A Modular Approach to Task-Specific Hate Speech Definitions. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.17
-
[82]
Accounting for Offensive Speech as a Practice of Resistance
Diaz, Mark and Amironesei, Razvan and Weidinger, Laura and Gabriel, Iason. Accounting for Offensive Speech as a Practice of Resistance. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.18
-
[83]
Zheng, Joan and Friedman, Scott and Schmer-galunder, Sonja and Magnusson, Ian and Wheelock, Ruta and Gottlieb, Jeremy and Gomez, Diana and Miller, Christopher. Towards a Multi-Entity Aspect-Based Sentiment Analysis for Characterizing Directed Social Regard in Online Messaging. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:1...
-
[84]
Flexible text generation for counterfactual fairness probing
Fryer, Zee and Axelrod, Vera and Packer, Ben and Beutel, Alex and Chen, Jilin and Webster, Kellie. Flexible text generation for counterfactual fairness probing. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.20
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.