arxiv: 2401.15391 · v1 · submitted 2024-01-27 · 💻 cs.CL

Recognition: 1 theorem link

· Lean Theorem

MultiHop-RAG: Benchmarking Retrieval-Augmented Generation for Multi-Hop Queries

Yixuan Tang , Yi Yang

Authors on Pith no claims yet

Pith reviewed 2026-05-15 13:49 UTC · model grok-4.3

classification 💻 cs.CL

keywords Retrieval-Augmented GenerationMulti-hop QueriesBenchmark DatasetLarge Language ModelsQuestion AnsweringInformation Retrieval

0 comments

The pith

Existing RAG systems are inadequate for answering multi-hop queries that require retrieving and reasoning over multiple pieces of evidence.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MultiHop-RAG, a benchmark dataset built from English news articles to test retrieval-augmented generation on queries that need multiple supporting facts. It evaluates embedding models for retrieving evidence and LLMs such as GPT-4, PaLM, and Llama2-70B for answering when given the evidence. The results show unsatisfactory performance in both retrieval and reasoning for these multi-hop tasks. This matters because many practical questions require combining information across documents, which current RAG approaches handle poorly.

Core claim

The authors develop MultiHop-RAG consisting of a news article knowledge base, multi-hop queries, ground-truth answers, and supporting evidence. Benchmarking experiments demonstrate that existing RAG methods, including various embedding models and state-of-the-art LLMs, perform poorly in retrieving relevant evidence and in reasoning to produce correct answers for multi-hop queries.

What carries the argument

The MultiHop-RAG dataset, which provides a knowledge base, multi-hop queries, answers, and evidence to benchmark RAG performance on complex queries.

If this is right

Current embedding models need enhancement to retrieve multiple pieces of evidence effectively for a single query.
LLMs require better integration of information from several sources to answer multi-hop questions accurately.
RAG systems evaluated only on simple queries may overestimate their real-world utility.
Specialized methods for multi-hop retrieval and reasoning should be developed to improve overall RAG effectiveness.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The dataset could be extended or adapted to other fields like science or medicine for domain-specific multi-hop benchmarks.
Improving performance on this benchmark may lead to more reliable LLM applications in research and news analysis.
Future work might explore training retrieval models directly on multi-hop query patterns from the dataset.

Load-bearing premise

The multi-hop queries generated from the English news article dataset reflect the distribution and difficulty of actual user queries to RAG systems.

What would settle it

An experiment showing significantly different RAG performance on a new collection of multi-hop queries created independently by experts from the same news articles would falsify the representativeness of the benchmark.

read the original abstract

Retrieval-augmented generation (RAG) augments large language models (LLM) by retrieving relevant knowledge, showing promising potential in mitigating LLM hallucinations and enhancing response quality, thereby facilitating the great adoption of LLMs in practice. However, we find that existing RAG systems are inadequate in answering multi-hop queries, which require retrieving and reasoning over multiple pieces of supporting evidence. Furthermore, to our knowledge, no existing RAG benchmarking dataset focuses on multi-hop queries. In this paper, we develop a novel dataset, MultiHop-RAG, which consists of a knowledge base, a large collection of multi-hop queries, their ground-truth answers, and the associated supporting evidence. We detail the procedure of building the dataset, utilizing an English news article dataset as the underlying RAG knowledge base. We demonstrate the benchmarking utility of MultiHop-RAG in two experiments. The first experiment compares different embedding models for retrieving evidence for multi-hop queries. In the second experiment, we examine the capabilities of various state-of-the-art LLMs, including GPT-4, PaLM, and Llama2-70B, in reasoning and answering multi-hop queries given the evidence. Both experiments reveal that existing RAG methods perform unsatisfactorily in retrieving and answering multi-hop queries. We hope MultiHop-RAG will be a valuable resource for the community in developing effective RAG systems, thereby facilitating greater adoption of LLMs in practice. The MultiHop-RAG and implemented RAG system is publicly available at https://github.com/yixuantt/MultiHop-RAG/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MultiHop-RAG supplies a new public dataset for multi-hop RAG testing and shows current pipelines struggle, but the synthetic queries make the broader inadequacy claim preliminary.

read the letter

The main thing to know is that this paper releases a new benchmark dataset called MultiHop-RAG for testing retrieval-augmented generation on multi-hop queries, along with experiments showing that current methods fall short on both retrieval and reasoning. They start with an English news article collection as the knowledge base. From there they generate multi-hop queries that require pulling together multiple supporting pieces, provide ground-truth answers and evidence, and then run tests. One experiment checks how different embedding models perform at retrieving the right evidence for these queries. The second looks at how well LLMs such as GPT-4, PaLM, and Llama2-70B can answer the queries when given the evidence. Both show unsatisfactory results. What stands out as new is the dataset focused specifically on multi-hop RAG queries, since the authors note no prior benchmark did that. The public release of the dataset and the implemented RAG system at their GitHub link is a practical plus that lets the community test and extend it directly. The work does well by providing a concrete resource built from news data and by running straightforward comparisons that highlight gaps in existing approaches. This gives a clear starting point for improving RAG on more complex questions. The soft spots are around how representative the queries are. They are constructed synthetically from the news corpus, but the paper does not validate them against real-world user queries or compare their structure to other multi-hop benchmarks. This means the observed weaknesses might not generalize if the synthetic queries differ in difficulty or reasoning patterns from what users actually ask. Details on query generation rules and statistical significance are light in the summary, so the strength of the inadequacy claim needs the full paper to assess fully. This is useful for researchers and developers working on RAG systems that need to handle chained reasoning over documents. A reader looking for new test sets for LLM augmentation in information-heavy tasks would get direct value from trying the dataset. It deserves a serious referee because the released resource is tangible and the experiments point to a genuine area of weakness worth community attention. I would recommend sending this to peer review. The dataset alone makes it worth engaging with, even if the broader claims about existing RAG systems need more backing in revisions.

Referee Report

2 major / 2 minor

Summary. The paper introduces the MultiHop-RAG benchmark dataset for retrieval-augmented generation on multi-hop queries. It constructs a knowledge base from English news articles, generates a collection of multi-hop queries along with ground-truth answers and supporting evidence, and reports two experiments: one comparing embedding models for evidence retrieval and another evaluating LLMs (GPT-4, PaLM, Llama2-70B) on reasoning over the evidence. The central claim is that existing RAG systems perform unsatisfactorily on multi-hop queries.

Significance. If the synthetic queries prove representative of real multi-hop needs, the benchmark fills a clear gap and, together with the public code and data release, could usefully guide improvements in retrieval and multi-step reasoning for RAG pipelines.

major comments (2)

[Dataset construction] Dataset construction section: multi-hop queries are generated synthetically from the news corpus with no empirical validation against real user query logs, HotpotQA-style distributions, or metrics of hop depth and entity chaining; this assumption is load-bearing for the claim that existing RAG systems are broadly inadequate.
[Experiments] Experiments section: query generation rules, exact retrieval metrics, and any statistical significance tests for the reported performance gaps are not fully specified, preventing full verification of the inadequacy conclusions.

minor comments (2)

[Abstract] Abstract: 'great adoption' should read 'greater adoption'.
[Figures and tables] Figure captions and table headers could more explicitly state the number of queries and evidence pieces per hop level for immediate clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below and will revise the paper to improve clarity and completeness where possible.

read point-by-point responses

Referee: [Dataset construction] Dataset construction section: multi-hop queries are generated synthetically from the news corpus with no empirical validation against real user query logs, HotpotQA-style distributions, or metrics of hop depth and entity chaining; this assumption is load-bearing for the claim that existing RAG systems are broadly inadequate.

Authors: We chose synthetic query generation to enable precise control over hop count, entity chaining, and supporting evidence, ensuring each query demonstrably requires multi-hop retrieval and reasoning. This controlled construction is standard for creating focused benchmarks (as in HotpotQA). We agree that direct validation against real user logs would strengthen generalizability claims. In revision we will expand the dataset section with explicit metrics for hop depth and entity chaining, add a comparison of query characteristics to HotpotQA, and include a limitations paragraph noting the synthetic nature and the value of future real-world validation studies. revision: partial
Referee: [Experiments] Experiments section: query generation rules, exact retrieval metrics, and any statistical significance tests for the reported performance gaps are not fully specified, preventing full verification of the inadequacy conclusions.

Authors: We appreciate this observation. The revised manuscript will provide a complete specification of the query generation rules (including the exact templates and logical constraints used to produce multi-hop queries from the news corpus). We will also state the precise retrieval metrics (top-k recall, precision, and MRR) and add statistical significance tests (paired t-tests with Bonferroni correction) for all reported performance differences between embedding models and LLMs. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical dataset construction and benchmarking

full rationale

The paper creates a new benchmark dataset MultiHop-RAG by procedurally extracting multi-hop queries and evidence from an English news corpus, then runs standard retrieval and LLM reasoning experiments on it. No equations, fitted parameters, or predictions are defined in terms of themselves. No self-citations serve as load-bearing premises for the central claim; the inadequacy conclusion follows directly from the reported retrieval and answer accuracy numbers on the constructed data. The work is self-contained as an empirical contribution with public code and data release.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on standard RAG evaluation practices and the construction of queries from an existing news corpus, with no free parameters, new axioms, or invented entities required.

pith-pipeline@v0.9.0 · 5581 in / 1113 out tokens · 48158 ms · 2026-05-15T13:49:12.581967+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Cost.FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

existing RAG systems are inadequate in answering multi-hop queries, which require retrieving and reasoning over multiple pieces of supporting evidence

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MeMo: Memory as a Model
cs.CL 2026-05 unverdicted novelty 7.0

MeMo encodes new knowledge into a separate memory model for frozen LLMs, achieving strong performance on BrowseComp-Plus, NarrativeQA, and MuSiQue while capturing cross-document relationships and remaining robust to r...
Skill-RAG: Failure-State-Aware Retrieval Augmentation via Hidden-State Probing and Skill Routing
cs.CL 2026-04 unverdicted novelty 7.0

Skill-RAG detects retrieval failure states from hidden representations and routes to one of four corrective skills to raise accuracy on persistent hard cases in open-domain QA and reasoning benchmarks.
Why Retrieval-Augmented Generation Fails: A Graph Perspective
cs.CL 2026-05 unverdicted novelty 6.0

Attribution graphs reveal that RAG failures arise from shallow fragmented evidence flow in LLMs, enabling topology-based detection and targeted interventions that reinforce question-guided routing.
ASTRA-QA: A Benchmark for Abstract Question Answering over Documents
cs.CL 2026-05 unverdicted novelty 6.0

ASTRA-QA is a benchmark for abstract document question answering that uses explicit topic sets, unsupported content annotations, and evidence alignments to enable direct scoring of coverage and hallucination.
PRISM: Fast Online LLM Serving via Scheduling-Memory Co-design
cs.LG 2026-05 unverdicted novelty 6.0

PRISM reduces P99 TTFT by 23.3-37.1% and raises exact-prefix KV-cache hit rates by 5.9-12.2 points versus the strongest baseline on 4B and 13B models by jointly optimizing scheduling and memory.
FT-RAG: A Fine-grained Retrieval-Augmented Generation Framework for Complex Table Reasoning
cs.CL 2026-05 unverdicted novelty 6.0

FT-RAG introduces a fine-grained graph-based retrieval framework for tables plus a new 9870-pair benchmark, reporting 23.5% and 59.2% gains in table- and cell-level hit rates and 62.2% higher exact-value recall over b...
Make Any Collection Navigable: Methods for Constructing and Evaluating Hypergraph of Text
cs.IR 2026-04 unverdicted novelty 6.0

Methods for constructing Hypergraphs of Text are proposed with a new effort ratio metric where TF-IDF baselines match LLM methods in experiments.
S2G-RAG: Structured Sufficiency and Gap Judging for Iterative Retrieval-Augmented QA
cs.IR 2026-04 unverdicted novelty 6.0

S2G-RAG improves multi-hop question answering in RAG by using structured sufficiency and gap judging to control iterative retrieval and maintain compact evidence.
Ask Only When Needed: Proactive Retrieval from Memory and Skills for Experience-Driven Lifelong Agents
cs.CL 2026-04 unverdicted novelty 6.0

ProactAgent learns a proactive retrieval policy via reinforcement learning on paired task continuations, improving lifelong agent performance and cutting retrieval overhead on SciWorld, AlfWorld, and StuLife.
Evaluating Multi-Hop Reasoning in RAG Systems: A Comparison of LLM-Based Retriever Evaluation Strategies
cs.IR 2026-04 unverdicted novelty 6.0

CARE, a context-aware LLM judge, outperforms standard methods when evaluating multi-hop retrieval quality in RAG systems.
Domain-oriented RAG Assessment (DoRA): Synthetic Benchmarking for RAG-based Question Answering on Defense Documents
cs.CL 2026-04 unverdicted novelty 6.0

DoRA is a new synthetic benchmark for RAG-based QA on defense documents where fine-tuning Llama3.1-8B-Instruct on it improves task success by up to 26% and cuts hallucination rates by 47%.
EvoRAG: Making Knowledge Graph-based RAG Automatically Evolve through Feedback-driven Backpropagation
cs.DB 2026-04 unverdicted novelty 6.0

EvoRAG adds a feedback-driven backpropagation step that attributes response quality to individual knowledge-graph triplets and updates the graph to raise reasoning accuracy by 7.34 percent over prior KG-RAG methods.
Overcoming the "Impracticality" of RAG: Proposing a Real-World Benchmark and Multi-Dimensional Diagnostic Framework
cs.CL 2026-04 unverdicted novelty 6.0

Introduces a four-axis difficulty taxonomy integrated into an enterprise RAG benchmark to systematically diagnose multi-dimensional challenges like reasoning complexity and retrieval difficulty.
From Local to Global: A Graph RAG Approach to Query-Focused Summarization
cs.CL 2024-04 unverdicted novelty 6.0

GraphRAG improves comprehensiveness and diversity of answers to global questions over million-token document sets by constructing entity graphs and hierarchical community summaries before combining partial responses.
CodaRAG: Connecting the Dots with Associativity Inspired by Complementary Learning
cs.CL 2026-04 unverdicted novelty 5.0

CodaRAG improves RAG by using a CLS-inspired three-stage pipeline of knowledge consolidation, multi-dimensional associative navigation, and interference elimination, delivering 7-11% gains on GraphRAG-Bench for factua...
Stateful Evidence-Driven Retrieval-Augmented Generation with Iterative Reasoning
cs.CL 2026-03 unverdicted novelty 5.0

A stateful iterative RAG system converts retrieved documents into scored reasoning units, maintains supportive and non-supportive evidence, and performs deficiency-driven query refinement to achieve more robust QA per...
Retrieval-Augmented Generation for AI-Generated Content: A Survey
cs.CV 2024-02 accept novelty 5.0

A survey classifying RAG foundations for AIGC, summarizing enhancements, cross-modal applications, benchmarks, limitations, and future directions.
RaguTeam at SemEval-2026 Task 8: Meno and Friends in a Judge-Orchestrated LLM Ensemble for Faithful Multi-Turn Response Generation
cs.CL 2026-05 accept novelty 3.0

A heterogeneous ensemble of seven LLMs plus a judge model won first place in SemEval-2026 Task 8 on faithful multi-turn response generation by selecting optimal candidates from diverse outputs.
A Reproducibility Study of Metacognitive Retrieval-Augmented Generation
cs.IR 2026-04 unverdicted novelty 3.0

MetaRAG is only partially reproducible with lower absolute scores than originally reported, gains substantially from reranking, and shows greater robustness than SIM-RAG under extended retrieval features.

Reference graph

Works this paper leans on

296 extracted references · 296 canonical work pages · cited by 19 Pith papers · 5 internal anchors

[1]

Anthropic. 2023. Claude 2.1 ( May version). https://api.anthropic.com/v1/messages. Claude 2.1

work page 2023
[2]

Akari Asai, Sewon Min, Zexuan Zhong, and Danqi Chen. 2023. Retrieval-based language models and applications. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 6: Tutorial Abstracts), pages 41--46

work page 2023
[3]

Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George Bm Van Den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, Diego De Las Casas, Aurelia Guy, Jacob Menick, Roman Ring, Tom Hennigan, Saffron Huang, Loren Maggiore, Chris Jones, Albin Cassirer, Andy Brock, Michela Paganini, Geoffrey Irving, ...

work page 2022
[4]

Harrison Chase. 2022. https://github.com/langchain-ai/langchain LangChain

work page 2022
[5]

Jiawei Chen, Hongyu Lin, Xianpei Han, and Le Sun. 2023. http://arxiv.org/abs/2309.01431 Benchmarking large language models in retrieval-augmented generation

work page arXiv 2023
[6]

Shahul Es, Jithin James, Luis Espinosa-Anke, and Steven Schockaert. 2023. http://arxiv.org/abs/2309.15217 Ragas: Automated evaluation of retrieval augmented generation

work page arXiv 2023
[7]

Tianyu Gao, Howard Yen, Jiatong Yu, and Danqi Chen. 2023. http://arxiv.org/abs/2305.14627 Enabling large language models to generate text with citations

work page arXiv 2023
[8]

Google. 2023. PaLM 2 ( May version). https://generativelanguage.googleapis.com/v1beta2/models/. Chat-bison-002

work page 2023
[9]

Significant Gravitas. 2023. https://agpt.co Autogpt . https://github.com/Significant-Gravitas/AutoGPT

work page 2023
[10]

Michael Günther, Jackmin Ong, Isabelle Mohr, Alaeddine Abdessalem, Tanguy Abel, Mohammad Kalim Akram, Susana Guzman, Georgios Mastrapas, Saba Sturua, Bo Wang, Maximilian Werk, Nan Wang, and Han Xiao. 2023. http://arxiv.org/abs/2310.19923 Jina embeddings 2: 8192-token general-purpose text embeddings for long documents

work page arXiv 2023
[12]

Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teven...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

Yichen Jiang, Shikha Bordia, Zheng Zhong, Charles Dognin, Maneesh Singh, and Mohit Bansal. 2020. HoVer : A dataset for many-hop fact extraction and claim verification. In Findings of the Conference on Empirical Methods in Natural Language Processing ( EMNLP )

work page 2020
[15]

Daniel Khashabi, Snigdha Chaturvedi, Michael Roth, Shyam Upadhyay, and Dan Roth. 2018. http://cogcomp.org/papers/2018-MultiRC-NAACL.pdf Looking Beyond the Surface: A Challenge Set for Reading Comprehension over Multiple Sentences . In Proc. of the Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL)

work page 2018
[16]

Jerry Liu. 2022. https://doi.org/10.5281/zenodo.1234 LlamaIndex

work page doi:10.5281/zenodo.1234 2022
[17]

Yi Liu, Lianzhe Huang, Shicheng Li, Sishuo Chen, Hao Zhou, Fandong Meng, Jie Zhou, and Xu Sun. 2023. http://arxiv.org/abs/2311.08147 Recall: A benchmark for llms robustness against external counterfactual knowledge

work page arXiv 2023
[18]

OpenAI. 2023. GPT4 ( Nov 7 version). https://chat.openai.com/chat. gpt -4-1106-preview

work page 2023
[19]

Jon Saad-Falcon, Omar Khattab, Christopher Potts, and Matei Zaharia. 2023. http://arxiv.org/abs/2311.09476 Ares: An automated evaluation framework for retrieval-augmented generation systems

work page arXiv 2023
[20]

Smith, Luke Zettlemoyer, and Tao Yu

Hongjin Su, Weijia Shi, Jungo Kasai, Yizhong Wang, Yushi Hu, Mari Ostendorf, Wen tau Yih, Noah A. Smith, Luke Zettlemoyer, and Tao Yu. 2023. http://arxiv.org/abs/2212.09741 One embedder, any task: Instruction-finetuned text embeddings

work page arXiv 2023
[21]

James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal. 2018. http://arxiv.org/abs/1803.05355 Fever: a large-scale dataset for fact extraction and verification

work page internal anchor Pith review Pith/arXiv arXiv 2018
[22]

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Harts...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[25]

Shitao Xiao, Zheng Liu, Peitian Zhang, and Niklas Muennighoff. 2023. http://arxiv.org/abs/2309.07597 C-pack: Packaged resources to advance general chinese embedding

work page internal anchor Pith review Pith/arXiv arXiv 2023
[26]

Cohen, Ruslan Salakhutdinov, and Christopher D

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. 2018. HotpotQA : A dataset for diverse, explainable multi-hop question answering. In Conference on Empirical Methods in Natural Language Processing ( EMNLP )

work page 2018
[27]

Peitian Zhang, Shitao Xiao, Zheng Liu, Zhicheng Dou, and Jian-Yun Nie. 2023. http://arxiv.org/abs/2310.07554 Retrieve anything to augment large language models

work page arXiv 2023
[28]

Ming Zhong, Yang Liu, Da Yin, Yuning Mao, Yizhu Jiao, Pengfei Liu, Chenguang Zhu, Heng Ji, and Jiawei Han. 2022. http://arxiv.org/abs/2210.07197 Towards a unified multi-dimensional evaluator for text generation

work page arXiv 2022
[29]

Chase, Harrison , month = oct, title =

work page
[30]

Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 6: Tutorial Abstracts) , pages=

Retrieval-based language models and applications , author=. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 6: Tutorial Abstracts) , pages=

work page
[31]

2023 , publisher =

AutoGPT , author =. 2023 , publisher =

work page 2023
[32]

2023 , eprint=

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , author=. 2023 , eprint=

work page 2023
[33]

Fact or Fiction: Verifying Scientific Claims

Wadden, David and Lin, Shanchuan and Lo, Kyle and Wang, Lucy Lu and van Zuylen, Madeleine and Cohan, Arman and Hajishirzi, Hannaneh. Fact or Fiction: Verifying Scientific Claims. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020. doi:10.18653/v1/2020.emnlp-main.609

work page doi:10.18653/v1/2020.emnlp-main.609 2020
[34]

2018 , eprint=

FEVER: a large-scale dataset for Fact Extraction and VERification , author=. 2018 , eprint=

work page 2018
[35]

, booktitle=

Yichen Jiang and Shikha Bordia and Zheng Zhong and Charles Dognin and Maneesh Singh and Mohit Bansal. , booktitle=

work page
[36]

arXiv preprint arXiv:2305.06300 , year=

Evaluating Embedding APIs for Information Retrieval , author=. arXiv preprint arXiv:2305.06300 , year=

work page arXiv
[37]

Daniel Khashabi and Snigdha Chaturvedi and Michael Roth and Shyam Upadhyay and Dan Roth , title =. Proc. of the Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL) , year =

work page
[38]

2023 , eprint=

RAGAS: Automated Evaluation of Retrieval Augmented Generation , author=. 2023 , eprint=

work page 2023
[39]

2023 , eprint=

ARES: An Automated Evaluation Framework for Retrieval-Augmented Generation Systems , author=. 2023 , eprint=

work page 2023
[40]

2023 , eprint=

Llama 2: Open Foundation and Fine-Tuned Chat Models , author=. 2023 , eprint=

work page 2023
[41]

2024 , eprint=

Mixtral of Experts , author=. 2024 , eprint=

work page 2024
[42]

Text Embeddings by Weakly-Supervised Contrastive Pre-training

Text Embeddings by Weakly-Supervised Contrastive Pre-training , author=. arXiv preprint arXiv:2212.03533 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[43]

2023 , eprint=

Jina Embeddings 2: 8192-Token General-Purpose Text Embeddings for Long Documents , author=. 2023 , eprint=

work page 2023
[44]

2023 , eprint=

One Embedder, Any Task: Instruction-Finetuned Text Embeddings , author=. 2023 , eprint=

work page 2023
[45]

2023 , eprint=

Retrieve Anything To Augment Large Language Models , author=. 2023 , eprint=

work page 2023
[46]

2023 , note =

Anthropic , title =. 2023 , note =

work page 2023
[47]

2023 , note =

Google , title =. 2023 , note =

work page 2023
[48]

2023 , note =

OpenAI , title =. 2023 , note =

work page 2023
[49]

2022 , eprint=

Towards a Unified Multi-Dimensional Evaluator for Text Generation , author=. 2022 , eprint=

work page 2022
[50]

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

Reimers, Nils and Gurevych, Iryna. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. 2019

work page 2019
[51]

2023 , eprint=

C-Pack: Packaged Resources To Advance General Chinese Embedding , author=. 2023 , eprint=

work page 2023
[52]

Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps

Ho, Xanh and Duong Nguyen, Anh-Khoa and Sugawara, Saku and Aizawa, Akiko. Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps. Proceedings of the 28th International Conference on Computational Linguistics. 2020. doi:10.18653/v1/2020.coling-main.580

work page doi:10.18653/v1/2020.coling-main.580 2020
[53]

and Salakhutdinov, Ruslan and Manning, Christopher D

Yang, Zhilin and Qi, Peng and Zhang, Saizheng and Bengio, Yoshua and Cohen, William W. and Salakhutdinov, Ruslan and Manning, Christopher D. , booktitle=

work page
[54]

2023 , eprint=

RECALL: A Benchmark for LLMs Robustness against External Counterfactual Knowledge , author=. 2023 , eprint=

work page 2023
[55]

2023 , eprint=

Benchmarking Large Language Models in Retrieval-Augmented Generation , author=. 2023 , eprint=

work page 2023
[56]

Proceedings of the 39th International Conference on Machine Learning , pages =

Improving Language Models by Retrieving from Trillions of Tokens , author =. Proceedings of the 39th International Conference on Machine Learning , pages =. 2022 , editor =

work page 2022
[57]

2023 , eprint=

Enabling Large Language Models to Generate Text with Citations , author=. 2023 , eprint=

work page 2023
[58]

Proceedings of the 3rd Wordplay: When Language Meets Games Workshop (Wordplay 2022). 2022

work page 2022
[59]

A Systematic Survey of Text Worlds as Embodied Natural Language Environments

Jansen, Peter. A Systematic Survey of Text Worlds as Embodied Natural Language Environments. Proceedings of the 3rd Wordplay: When Language Meets Games Workshop (Wordplay 2022). 2022. doi:10.18653/v1/2022.wordplay-1.1

work page doi:10.18653/v1/2022.wordplay-1.1 2022
[60]

A Minimal Computational Improviser Based on Oral Thought

Montfort, Nick and Bartlett Fernandez, Sebastian. A Minimal Computational Improviser Based on Oral Thought. Proceedings of the 3rd Wordplay: When Language Meets Games Workshop (Wordplay 2022). 2022. doi:10.18653/v1/2022.wordplay-1.2

work page doi:10.18653/v1/2022.wordplay-1.2 2022
[61]

Craft an Iron Sword: Dynamically Generating Interactive Game Characters by Prompting Large Language Models Tuned on Code

Volum, Ryan and Rao, Sudha and Xu, Michael and DesGarennes, Gabriel and Brockett, Chris and Van Durme, Benjamin and Deng, Olivia and Malhotra, Akanksha and Dolan, Bill. Craft an Iron Sword: Dynamically Generating Interactive Game Characters by Prompting Large Language Models Tuned on Code. Proceedings of the 3rd Wordplay: When Language Meets Games Worksho...

work page doi:10.18653/v1/2022.wordplay-1.3 2022
[62]

A Sequence Modelling Approach to Question Answering in Text-Based Games

Furman, Gregory and Toledo, Edan and Shock, Jonathan and Buys, Jan. A Sequence Modelling Approach to Question Answering in Text-Based Games. Proceedings of the 3rd Wordplay: When Language Meets Games Workshop (Wordplay 2022). 2022. doi:10.18653/v1/2022.wordplay-1.4

work page doi:10.18653/v1/2022.wordplay-1.4 2022
[63]

Automatic Exploration of Textual Environments with Language-Conditioned Autotelic Agents

Teodorescu, Laetitia and Yuan, Xingdi and C \^o t \'e , Marc-Alexandre and Oudeyer, Pierre-Yves. Automatic Exploration of Textual Environments with Language-Conditioned Autotelic Agents. Proceedings of the 3rd Wordplay: When Language Meets Games Workshop (Wordplay 2022). 2022. doi:10.18653/v1/2022.wordplay-1.5

work page doi:10.18653/v1/2022.wordplay-1.5 2022
[64]

Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022

work page 2022
[65]

Separating Hate Speech and Offensive Language Classes via Adversarial Debiasing

Yuan, Shuzhou and Maronikolakis, Antonis and Sch. Separating Hate Speech and Offensive Language Classes via Adversarial Debiasing. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.1

work page doi:10.18653/v1/2022.woah-1.1 2022
[66]

Towards Automatic Generation of Messages Countering Online Hate Speech and Microaggressions

Ashida, Mana and Komachi, Mamoru. Towards Automatic Generation of Messages Countering Online Hate Speech and Microaggressions. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.2

work page doi:10.18653/v1/2022.woah-1.2 2022
[67]

G rease V ision: Rewriting the Rules of the Interface

Datta, Siddhartha and Kollnig, Konrad and Shadbolt, Nigel. G rease V ision: Rewriting the Rules of the Interface. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.3

work page doi:10.18653/v1/2022.woah-1.3 2022
[68]

Improving Generalization of Hate Speech Detection Systems to Novel Target Groups via Domain Adaptation

Ludwig, Florian and Dolos, Klara and Zesch, Torsten and Hobley, Eleanor. Improving Generalization of Hate Speech Detection Systems to Novel Target Groups via Domain Adaptation. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.4

work page doi:10.18653/v1/2022.woah-1.4 2022
[69]

`` Zo Grof ! '' : A Comprehensive Corpus for Offensive and Abusive Language in D utch

Ruitenbeek, Ward and Zwart, Victor and Van Der Noord, Robin and Gnezdilov, Zhenja and Caselli, Tommaso. `` Zo Grof ! '' : A Comprehensive Corpus for Offensive and Abusive Language in D utch. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.5

work page doi:10.18653/v1/2022.woah-1.5 2022
[70]

Counter- TWIT : An I talian Corpus for Online Counterspeech in Ecological Contexts

Goffredo, Pierpaolo and Basile, Valerio and Cepollaro, Bianca and Patti, Viviana. Counter- TWIT : An I talian Corpus for Online Counterspeech in Ecological Contexts. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.6

work page doi:10.18653/v1/2022.woah-1.6 2022
[71]

S tereo KG : Data-Driven Knowledge Graph Construction For Cultural Knowledge and Stereotypes

Deshpande, Awantee and Ruiter, Dana and Mosbach, Marius and Klakow, Dietrich. S tereo KG : Data-Driven Knowledge Graph Construction For Cultural Knowledge and Stereotypes. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.7

work page doi:10.18653/v1/2022.woah-1.7 2022
[72]

The subtle language of exclusion: Identifying the Toxic Speech of Trans-exclusionary Radical Feminists

Lu, Christina and Jurgens, David. The subtle language of exclusion: Identifying the Toxic Speech of Trans-exclusionary Radical Feminists. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.8

work page doi:10.18653/v1/2022.woah-1.8 2022
[73]

Lost in Distillation: A Case Study in Toxicity Modeling

Chvasta, Alyssa and Lees, Alyssa and Sorensen, Jeffrey and Vasserman, Lucy and Goyal, Nitesh. Lost in Distillation: A Case Study in Toxicity Modeling. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.9

work page doi:10.18653/v1/2022.woah-1.9 2022
[74]

Cleansing & expanding the HURTLEX (el) with a multidimensional categorization of offensive words

Stamou, Vivian and Alexiou, Iakovi and Klimi, Antigone and Molou, Eleftheria and Saivanidou, Alexandra and Markantonatou, Stella. Cleansing & expanding the HURTLEX (el) with a multidimensional categorization of offensive words. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.10

work page doi:10.18653/v1/2022.woah-1.10 2022
[75]

Free speech or Free Hate Speech? Analyzing the Proliferation of Hate Speech in Parler

Israeli, Abraham and Tsur, Oren. Free speech or Free Hate Speech? Analyzing the Proliferation of Hate Speech in Parler. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.11

work page doi:10.18653/v1/2022.woah-1.11 2022
[76]

Resources for Multilingual Hate Speech Detection

Arango Monnar, Ayme and Perez, Jorge and Poblete, Barbara and Salda \ n a, Magdalena and Proust, Valentina. Resources for Multilingual Hate Speech Detection. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.12

work page doi:10.18653/v1/2022.woah-1.12 2022
[77]

Enriching Abusive Language Detection with Community Context

Saleem, Haji Mohammad and Kurrek, Jana and Ruths, Derek. Enriching Abusive Language Detection with Community Context. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.13

work page doi:10.18653/v1/2022.woah-1.13 2022
[78]

DeTox: A Comprehensive Dataset for G erman Offensive Language and Conversation Analysis

Demus, Christoph and Pitz, Jonas and Sch. DeTox: A Comprehensive Dataset for G erman Offensive Language and Conversation Analysis. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.14

work page doi:10.18653/v1/2022.woah-1.14 2022
[79]

Multilingual H ate C heck: Functional Tests for Multilingual Hate Speech Detection Models

R. Multilingual H ate C heck: Functional Tests for Multilingual Hate Speech Detection Models. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.15

work page doi:10.18653/v1/2022.woah-1.15 2022
[80]

Distributional properties of political dogwhistle representations in S wedish BERT

Hertzberg, Niclas and Cooper, Robin and Lindgren, Elina and R. Distributional properties of political dogwhistle representations in S wedish BERT. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.16

work page doi:10.18653/v1/2022.woah-1.16 2022
[81]

Hate Speech Criteria: A Modular Approach to Task-Specific Hate Speech Definitions

Khurana, Urja and Vermeulen, Ivar and Nalisnick, Eric and Van Noorloos, Marloes and Fokkens, Antske. Hate Speech Criteria: A Modular Approach to Task-Specific Hate Speech Definitions. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.17

work page doi:10.18653/v1/2022.woah-1.17 2022
[82]

Accounting for Offensive Speech as a Practice of Resistance

Diaz, Mark and Amironesei, Razvan and Weidinger, Laura and Gabriel, Iason. Accounting for Offensive Speech as a Practice of Resistance. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.18

work page doi:10.18653/v1/2022.woah-1.18 2022
[83]

Towards a Multi-Entity Aspect-Based Sentiment Analysis for Characterizing Directed Social Regard in Online Messaging

Zheng, Joan and Friedman, Scott and Schmer-galunder, Sonja and Magnusson, Ian and Wheelock, Ruta and Gottlieb, Jeremy and Gomez, Diana and Miller, Christopher. Towards a Multi-Entity Aspect-Based Sentiment Analysis for Characterizing Directed Social Regard in Online Messaging. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:1...

work page doi:10.18653/v1/2022.woah-1.19 2022
[84]

Flexible text generation for counterfactual fairness probing

Fryer, Zee and Axelrod, Vera and Packer, Ben and Beutel, Alex and Chen, Jilin and Webster, Kellie. Flexible text generation for counterfactual fairness probing. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.20

work page doi:10.18653/v1/2022.woah-1.20 2022

Showing first 80 references.