pith. machine review for the scientific record. sign in

arxiv: 2401.15391 · v1 · submitted 2024-01-27 · 💻 cs.CL

Recognition: 1 theorem link

· Lean Theorem

MultiHop-RAG: Benchmarking Retrieval-Augmented Generation for Multi-Hop Queries

Authors on Pith no claims yet

Pith reviewed 2026-05-15 13:49 UTC · model grok-4.3

classification 💻 cs.CL
keywords Retrieval-Augmented GenerationMulti-hop QueriesBenchmark DatasetLarge Language ModelsQuestion AnsweringInformation Retrieval
0
0 comments X

The pith

Existing RAG systems are inadequate for answering multi-hop queries that require retrieving and reasoning over multiple pieces of evidence.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MultiHop-RAG, a benchmark dataset built from English news articles to test retrieval-augmented generation on queries that need multiple supporting facts. It evaluates embedding models for retrieving evidence and LLMs such as GPT-4, PaLM, and Llama2-70B for answering when given the evidence. The results show unsatisfactory performance in both retrieval and reasoning for these multi-hop tasks. This matters because many practical questions require combining information across documents, which current RAG approaches handle poorly.

Core claim

The authors develop MultiHop-RAG consisting of a news article knowledge base, multi-hop queries, ground-truth answers, and supporting evidence. Benchmarking experiments demonstrate that existing RAG methods, including various embedding models and state-of-the-art LLMs, perform poorly in retrieving relevant evidence and in reasoning to produce correct answers for multi-hop queries.

What carries the argument

The MultiHop-RAG dataset, which provides a knowledge base, multi-hop queries, answers, and evidence to benchmark RAG performance on complex queries.

If this is right

  • Current embedding models need enhancement to retrieve multiple pieces of evidence effectively for a single query.
  • LLMs require better integration of information from several sources to answer multi-hop questions accurately.
  • RAG systems evaluated only on simple queries may overestimate their real-world utility.
  • Specialized methods for multi-hop retrieval and reasoning should be developed to improve overall RAG effectiveness.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The dataset could be extended or adapted to other fields like science or medicine for domain-specific multi-hop benchmarks.
  • Improving performance on this benchmark may lead to more reliable LLM applications in research and news analysis.
  • Future work might explore training retrieval models directly on multi-hop query patterns from the dataset.

Load-bearing premise

The multi-hop queries generated from the English news article dataset reflect the distribution and difficulty of actual user queries to RAG systems.

What would settle it

An experiment showing significantly different RAG performance on a new collection of multi-hop queries created independently by experts from the same news articles would falsify the representativeness of the benchmark.

read the original abstract

Retrieval-augmented generation (RAG) augments large language models (LLM) by retrieving relevant knowledge, showing promising potential in mitigating LLM hallucinations and enhancing response quality, thereby facilitating the great adoption of LLMs in practice. However, we find that existing RAG systems are inadequate in answering multi-hop queries, which require retrieving and reasoning over multiple pieces of supporting evidence. Furthermore, to our knowledge, no existing RAG benchmarking dataset focuses on multi-hop queries. In this paper, we develop a novel dataset, MultiHop-RAG, which consists of a knowledge base, a large collection of multi-hop queries, their ground-truth answers, and the associated supporting evidence. We detail the procedure of building the dataset, utilizing an English news article dataset as the underlying RAG knowledge base. We demonstrate the benchmarking utility of MultiHop-RAG in two experiments. The first experiment compares different embedding models for retrieving evidence for multi-hop queries. In the second experiment, we examine the capabilities of various state-of-the-art LLMs, including GPT-4, PaLM, and Llama2-70B, in reasoning and answering multi-hop queries given the evidence. Both experiments reveal that existing RAG methods perform unsatisfactorily in retrieving and answering multi-hop queries. We hope MultiHop-RAG will be a valuable resource for the community in developing effective RAG systems, thereby facilitating greater adoption of LLMs in practice. The MultiHop-RAG and implemented RAG system is publicly available at https://github.com/yixuantt/MultiHop-RAG/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces the MultiHop-RAG benchmark dataset for retrieval-augmented generation on multi-hop queries. It constructs a knowledge base from English news articles, generates a collection of multi-hop queries along with ground-truth answers and supporting evidence, and reports two experiments: one comparing embedding models for evidence retrieval and another evaluating LLMs (GPT-4, PaLM, Llama2-70B) on reasoning over the evidence. The central claim is that existing RAG systems perform unsatisfactorily on multi-hop queries.

Significance. If the synthetic queries prove representative of real multi-hop needs, the benchmark fills a clear gap and, together with the public code and data release, could usefully guide improvements in retrieval and multi-step reasoning for RAG pipelines.

major comments (2)
  1. [Dataset construction] Dataset construction section: multi-hop queries are generated synthetically from the news corpus with no empirical validation against real user query logs, HotpotQA-style distributions, or metrics of hop depth and entity chaining; this assumption is load-bearing for the claim that existing RAG systems are broadly inadequate.
  2. [Experiments] Experiments section: query generation rules, exact retrieval metrics, and any statistical significance tests for the reported performance gaps are not fully specified, preventing full verification of the inadequacy conclusions.
minor comments (2)
  1. [Abstract] Abstract: 'great adoption' should read 'greater adoption'.
  2. [Figures and tables] Figure captions and table headers could more explicitly state the number of queries and evidence pieces per hop level for immediate clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below and will revise the paper to improve clarity and completeness where possible.

read point-by-point responses
  1. Referee: [Dataset construction] Dataset construction section: multi-hop queries are generated synthetically from the news corpus with no empirical validation against real user query logs, HotpotQA-style distributions, or metrics of hop depth and entity chaining; this assumption is load-bearing for the claim that existing RAG systems are broadly inadequate.

    Authors: We chose synthetic query generation to enable precise control over hop count, entity chaining, and supporting evidence, ensuring each query demonstrably requires multi-hop retrieval and reasoning. This controlled construction is standard for creating focused benchmarks (as in HotpotQA). We agree that direct validation against real user logs would strengthen generalizability claims. In revision we will expand the dataset section with explicit metrics for hop depth and entity chaining, add a comparison of query characteristics to HotpotQA, and include a limitations paragraph noting the synthetic nature and the value of future real-world validation studies. revision: partial

  2. Referee: [Experiments] Experiments section: query generation rules, exact retrieval metrics, and any statistical significance tests for the reported performance gaps are not fully specified, preventing full verification of the inadequacy conclusions.

    Authors: We appreciate this observation. The revised manuscript will provide a complete specification of the query generation rules (including the exact templates and logical constraints used to produce multi-hop queries from the news corpus). We will also state the precise retrieval metrics (top-k recall, precision, and MRR) and add statistical significance tests (paired t-tests with Bonferroni correction) for all reported performance differences between embedding models and LLMs. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical dataset construction and benchmarking

full rationale

The paper creates a new benchmark dataset MultiHop-RAG by procedurally extracting multi-hop queries and evidence from an English news corpus, then runs standard retrieval and LLM reasoning experiments on it. No equations, fitted parameters, or predictions are defined in terms of themselves. No self-citations serve as load-bearing premises for the central claim; the inadequacy conclusion follows directly from the reported retrieval and answer accuracy numbers on the constructed data. The work is self-contained as an empirical contribution with public code and data release.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on standard RAG evaluation practices and the construction of queries from an existing news corpus, with no free parameters, new axioms, or invented entities required.

pith-pipeline@v0.9.0 · 5581 in / 1113 out tokens · 48158 ms · 2026-05-15T13:49:12.581967+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • Cost.FunctionalEquation washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    existing RAG systems are inadequate in answering multi-hop queries, which require retrieving and reasoning over multiple pieces of supporting evidence

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. MeMo: Memory as a Model

    cs.CL 2026-05 unverdicted novelty 7.0

    MeMo encodes new knowledge into a separate memory model for frozen LLMs, achieving strong performance on BrowseComp-Plus, NarrativeQA, and MuSiQue while capturing cross-document relationships and remaining robust to r...

  2. Skill-RAG: Failure-State-Aware Retrieval Augmentation via Hidden-State Probing and Skill Routing

    cs.CL 2026-04 unverdicted novelty 7.0

    Skill-RAG detects retrieval failure states from hidden representations and routes to one of four corrective skills to raise accuracy on persistent hard cases in open-domain QA and reasoning benchmarks.

  3. Why Retrieval-Augmented Generation Fails: A Graph Perspective

    cs.CL 2026-05 unverdicted novelty 6.0

    Attribution graphs reveal that RAG failures arise from shallow fragmented evidence flow in LLMs, enabling topology-based detection and targeted interventions that reinforce question-guided routing.

  4. ASTRA-QA: A Benchmark for Abstract Question Answering over Documents

    cs.CL 2026-05 unverdicted novelty 6.0

    ASTRA-QA is a benchmark for abstract document question answering that uses explicit topic sets, unsupported content annotations, and evidence alignments to enable direct scoring of coverage and hallucination.

  5. PRISM: Fast Online LLM Serving via Scheduling-Memory Co-design

    cs.LG 2026-05 unverdicted novelty 6.0

    PRISM reduces P99 TTFT by 23.3-37.1% and raises exact-prefix KV-cache hit rates by 5.9-12.2 points versus the strongest baseline on 4B and 13B models by jointly optimizing scheduling and memory.

  6. FT-RAG: A Fine-grained Retrieval-Augmented Generation Framework for Complex Table Reasoning

    cs.CL 2026-05 unverdicted novelty 6.0

    FT-RAG introduces a fine-grained graph-based retrieval framework for tables plus a new 9870-pair benchmark, reporting 23.5% and 59.2% gains in table- and cell-level hit rates and 62.2% higher exact-value recall over b...

  7. Make Any Collection Navigable: Methods for Constructing and Evaluating Hypergraph of Text

    cs.IR 2026-04 unverdicted novelty 6.0

    Methods for constructing Hypergraphs of Text are proposed with a new effort ratio metric where TF-IDF baselines match LLM methods in experiments.

  8. S2G-RAG: Structured Sufficiency and Gap Judging for Iterative Retrieval-Augmented QA

    cs.IR 2026-04 unverdicted novelty 6.0

    S2G-RAG improves multi-hop question answering in RAG by using structured sufficiency and gap judging to control iterative retrieval and maintain compact evidence.

  9. Ask Only When Needed: Proactive Retrieval from Memory and Skills for Experience-Driven Lifelong Agents

    cs.CL 2026-04 unverdicted novelty 6.0

    ProactAgent learns a proactive retrieval policy via reinforcement learning on paired task continuations, improving lifelong agent performance and cutting retrieval overhead on SciWorld, AlfWorld, and StuLife.

  10. Evaluating Multi-Hop Reasoning in RAG Systems: A Comparison of LLM-Based Retriever Evaluation Strategies

    cs.IR 2026-04 unverdicted novelty 6.0

    CARE, a context-aware LLM judge, outperforms standard methods when evaluating multi-hop retrieval quality in RAG systems.

  11. Domain-oriented RAG Assessment (DoRA): Synthetic Benchmarking for RAG-based Question Answering on Defense Documents

    cs.CL 2026-04 unverdicted novelty 6.0

    DoRA is a new synthetic benchmark for RAG-based QA on defense documents where fine-tuning Llama3.1-8B-Instruct on it improves task success by up to 26% and cuts hallucination rates by 47%.

  12. EvoRAG: Making Knowledge Graph-based RAG Automatically Evolve through Feedback-driven Backpropagation

    cs.DB 2026-04 unverdicted novelty 6.0

    EvoRAG adds a feedback-driven backpropagation step that attributes response quality to individual knowledge-graph triplets and updates the graph to raise reasoning accuracy by 7.34 percent over prior KG-RAG methods.

  13. Overcoming the "Impracticality" of RAG: Proposing a Real-World Benchmark and Multi-Dimensional Diagnostic Framework

    cs.CL 2026-04 unverdicted novelty 6.0

    Introduces a four-axis difficulty taxonomy integrated into an enterprise RAG benchmark to systematically diagnose multi-dimensional challenges like reasoning complexity and retrieval difficulty.

  14. From Local to Global: A Graph RAG Approach to Query-Focused Summarization

    cs.CL 2024-04 unverdicted novelty 6.0

    GraphRAG improves comprehensiveness and diversity of answers to global questions over million-token document sets by constructing entity graphs and hierarchical community summaries before combining partial responses.

  15. CodaRAG: Connecting the Dots with Associativity Inspired by Complementary Learning

    cs.CL 2026-04 unverdicted novelty 5.0

    CodaRAG improves RAG by using a CLS-inspired three-stage pipeline of knowledge consolidation, multi-dimensional associative navigation, and interference elimination, delivering 7-11% gains on GraphRAG-Bench for factua...

  16. Stateful Evidence-Driven Retrieval-Augmented Generation with Iterative Reasoning

    cs.CL 2026-03 unverdicted novelty 5.0

    A stateful iterative RAG system converts retrieved documents into scored reasoning units, maintains supportive and non-supportive evidence, and performs deficiency-driven query refinement to achieve more robust QA per...

  17. Retrieval-Augmented Generation for AI-Generated Content: A Survey

    cs.CV 2024-02 accept novelty 5.0

    A survey classifying RAG foundations for AIGC, summarizing enhancements, cross-modal applications, benchmarks, limitations, and future directions.

  18. RaguTeam at SemEval-2026 Task 8: Meno and Friends in a Judge-Orchestrated LLM Ensemble for Faithful Multi-Turn Response Generation

    cs.CL 2026-05 accept novelty 3.0

    A heterogeneous ensemble of seven LLMs plus a judge model won first place in SemEval-2026 Task 8 on faithful multi-turn response generation by selecting optimal candidates from diverse outputs.

  19. A Reproducibility Study of Metacognitive Retrieval-Augmented Generation

    cs.IR 2026-04 unverdicted novelty 3.0

    MetaRAG is only partially reproducible with lower absolute scores than originally reported, gains substantially from reranking, and shows greater robustness than SIM-RAG under extended retrieval features.

Reference graph

Works this paper leans on

296 extracted references · 296 canonical work pages · cited by 19 Pith papers · 5 internal anchors

  1. [1]

    Anthropic. 2023. Claude 2.1 ( May version). https://api.anthropic.com/v1/messages. Claude 2.1

  2. [2]

    Akari Asai, Sewon Min, Zexuan Zhong, and Danqi Chen. 2023. Retrieval-based language models and applications. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 6: Tutorial Abstracts), pages 41--46

  3. [3]

    Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George Bm Van Den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, Diego De Las Casas, Aurelia Guy, Jacob Menick, Roman Ring, Tom Hennigan, Saffron Huang, Loren Maggiore, Chris Jones, Albin Cassirer, Andy Brock, Michela Paganini, Geoffrey Irving, ...

  4. [4]

    Harrison Chase. 2022. https://github.com/langchain-ai/langchain LangChain

  5. [5]

    Jiawei Chen, Hongyu Lin, Xianpei Han, and Le Sun. 2023. http://arxiv.org/abs/2309.01431 Benchmarking large language models in retrieval-augmented generation

  6. [6]

    Shahul Es, Jithin James, Luis Espinosa-Anke, and Steven Schockaert. 2023. http://arxiv.org/abs/2309.15217 Ragas: Automated evaluation of retrieval augmented generation

  7. [7]

    Tianyu Gao, Howard Yen, Jiatong Yu, and Danqi Chen. 2023. http://arxiv.org/abs/2305.14627 Enabling large language models to generate text with citations

  8. [8]

    Google. 2023. PaLM 2 ( May version). https://generativelanguage.googleapis.com/v1beta2/models/. Chat-bison-002

  9. [9]

    Significant Gravitas. 2023. https://agpt.co Autogpt . https://github.com/Significant-Gravitas/AutoGPT

  10. [10]

    Michael Günther, Jackmin Ong, Isabelle Mohr, Alaeddine Abdessalem, Tanguy Abel, Mohammad Kalim Akram, Susana Guzman, Georgios Mastrapas, Saba Sturua, Bo Wang, Maximilian Werk, Nan Wang, and Han Xiao. 2023. http://arxiv.org/abs/2310.19923 Jina embeddings 2: 8192-token general-purpose text embeddings for long documents

  11. [12]

    Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teven...

  12. [13]

    Yichen Jiang, Shikha Bordia, Zheng Zhong, Charles Dognin, Maneesh Singh, and Mohit Bansal. 2020. HoVer : A dataset for many-hop fact extraction and claim verification. In Findings of the Conference on Empirical Methods in Natural Language Processing ( EMNLP )

  13. [15]

    Daniel Khashabi, Snigdha Chaturvedi, Michael Roth, Shyam Upadhyay, and Dan Roth. 2018. http://cogcomp.org/papers/2018-MultiRC-NAACL.pdf Looking Beyond the Surface: A Challenge Set for Reading Comprehension over Multiple Sentences . In Proc. of the Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL)

  14. [16]

    Jerry Liu. 2022. https://doi.org/10.5281/zenodo.1234 LlamaIndex

  15. [17]

    Yi Liu, Lianzhe Huang, Shicheng Li, Sishuo Chen, Hao Zhou, Fandong Meng, Jie Zhou, and Xu Sun. 2023. http://arxiv.org/abs/2311.08147 Recall: A benchmark for llms robustness against external counterfactual knowledge

  16. [18]

    OpenAI. 2023. GPT4 ( Nov 7 version). https://chat.openai.com/chat. gpt -4-1106-preview

  17. [19]

    Jon Saad-Falcon, Omar Khattab, Christopher Potts, and Matei Zaharia. 2023. http://arxiv.org/abs/2311.09476 Ares: An automated evaluation framework for retrieval-augmented generation systems

  18. [20]

    Smith, Luke Zettlemoyer, and Tao Yu

    Hongjin Su, Weijia Shi, Jungo Kasai, Yizhong Wang, Yushi Hu, Mari Ostendorf, Wen tau Yih, Noah A. Smith, Luke Zettlemoyer, and Tao Yu. 2023. http://arxiv.org/abs/2212.09741 One embedder, any task: Instruction-finetuned text embeddings

  19. [21]

    James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal. 2018. http://arxiv.org/abs/1803.05355 Fever: a large-scale dataset for fact extraction and verification

  20. [22]

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Harts...

  21. [25]

    Shitao Xiao, Zheng Liu, Peitian Zhang, and Niklas Muennighoff. 2023. http://arxiv.org/abs/2309.07597 C-pack: Packaged resources to advance general chinese embedding

  22. [26]

    Cohen, Ruslan Salakhutdinov, and Christopher D

    Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. 2018. HotpotQA : A dataset for diverse, explainable multi-hop question answering. In Conference on Empirical Methods in Natural Language Processing ( EMNLP )

  23. [27]

    Peitian Zhang, Shitao Xiao, Zheng Liu, Zhicheng Dou, and Jian-Yun Nie. 2023. http://arxiv.org/abs/2310.07554 Retrieve anything to augment large language models

  24. [28]

    Ming Zhong, Yang Liu, Da Yin, Yuning Mao, Yizhu Jiao, Pengfei Liu, Chenguang Zhu, Heng Ji, and Jiawei Han. 2022. http://arxiv.org/abs/2210.07197 Towards a unified multi-dimensional evaluator for text generation

  25. [29]

    Chase, Harrison , month = oct, title =

  26. [30]

    Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 6: Tutorial Abstracts) , pages=

    Retrieval-based language models and applications , author=. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 6: Tutorial Abstracts) , pages=

  27. [31]

    2023 , publisher =

    AutoGPT , author =. 2023 , publisher =

  28. [32]

    2023 , eprint=

    Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , author=. 2023 , eprint=

  29. [33]

    Fact or Fiction: Verifying Scientific Claims

    Wadden, David and Lin, Shanchuan and Lo, Kyle and Wang, Lucy Lu and van Zuylen, Madeleine and Cohan, Arman and Hajishirzi, Hannaneh. Fact or Fiction: Verifying Scientific Claims. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020. doi:10.18653/v1/2020.emnlp-main.609

  30. [34]

    2018 , eprint=

    FEVER: a large-scale dataset for Fact Extraction and VERification , author=. 2018 , eprint=

  31. [35]

    , booktitle=

    Yichen Jiang and Shikha Bordia and Zheng Zhong and Charles Dognin and Maneesh Singh and Mohit Bansal. , booktitle=

  32. [36]

    arXiv preprint arXiv:2305.06300 , year=

    Evaluating Embedding APIs for Information Retrieval , author=. arXiv preprint arXiv:2305.06300 , year=

  33. [37]

    Daniel Khashabi and Snigdha Chaturvedi and Michael Roth and Shyam Upadhyay and Dan Roth , title =. Proc. of the Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL) , year =

  34. [38]

    2023 , eprint=

    RAGAS: Automated Evaluation of Retrieval Augmented Generation , author=. 2023 , eprint=

  35. [39]

    2023 , eprint=

    ARES: An Automated Evaluation Framework for Retrieval-Augmented Generation Systems , author=. 2023 , eprint=

  36. [40]

    2023 , eprint=

    Llama 2: Open Foundation and Fine-Tuned Chat Models , author=. 2023 , eprint=

  37. [41]

    2024 , eprint=

    Mixtral of Experts , author=. 2024 , eprint=

  38. [42]

    Text Embeddings by Weakly-Supervised Contrastive Pre-training

    Text Embeddings by Weakly-Supervised Contrastive Pre-training , author=. arXiv preprint arXiv:2212.03533 , year=

  39. [43]

    2023 , eprint=

    Jina Embeddings 2: 8192-Token General-Purpose Text Embeddings for Long Documents , author=. 2023 , eprint=

  40. [44]

    2023 , eprint=

    One Embedder, Any Task: Instruction-Finetuned Text Embeddings , author=. 2023 , eprint=

  41. [45]

    2023 , eprint=

    Retrieve Anything To Augment Large Language Models , author=. 2023 , eprint=

  42. [46]

    2023 , note =

    Anthropic , title =. 2023 , note =

  43. [47]

    2023 , note =

    Google , title =. 2023 , note =

  44. [48]

    2023 , note =

    OpenAI , title =. 2023 , note =

  45. [49]

    2022 , eprint=

    Towards a Unified Multi-Dimensional Evaluator for Text Generation , author=. 2022 , eprint=

  46. [50]

    Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

    Reimers, Nils and Gurevych, Iryna. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. 2019

  47. [51]

    2023 , eprint=

    C-Pack: Packaged Resources To Advance General Chinese Embedding , author=. 2023 , eprint=

  48. [52]

    Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps

    Ho, Xanh and Duong Nguyen, Anh-Khoa and Sugawara, Saku and Aizawa, Akiko. Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps. Proceedings of the 28th International Conference on Computational Linguistics. 2020. doi:10.18653/v1/2020.coling-main.580

  49. [53]

    and Salakhutdinov, Ruslan and Manning, Christopher D

    Yang, Zhilin and Qi, Peng and Zhang, Saizheng and Bengio, Yoshua and Cohen, William W. and Salakhutdinov, Ruslan and Manning, Christopher D. , booktitle=

  50. [54]

    2023 , eprint=

    RECALL: A Benchmark for LLMs Robustness against External Counterfactual Knowledge , author=. 2023 , eprint=

  51. [55]

    2023 , eprint=

    Benchmarking Large Language Models in Retrieval-Augmented Generation , author=. 2023 , eprint=

  52. [56]

    Proceedings of the 39th International Conference on Machine Learning , pages =

    Improving Language Models by Retrieving from Trillions of Tokens , author =. Proceedings of the 39th International Conference on Machine Learning , pages =. 2022 , editor =

  53. [57]

    2023 , eprint=

    Enabling Large Language Models to Generate Text with Citations , author=. 2023 , eprint=

  54. [58]

    Proceedings of the 3rd Wordplay: When Language Meets Games Workshop (Wordplay 2022). 2022

  55. [59]

    A Systematic Survey of Text Worlds as Embodied Natural Language Environments

    Jansen, Peter. A Systematic Survey of Text Worlds as Embodied Natural Language Environments. Proceedings of the 3rd Wordplay: When Language Meets Games Workshop (Wordplay 2022). 2022. doi:10.18653/v1/2022.wordplay-1.1

  56. [60]

    A Minimal Computational Improviser Based on Oral Thought

    Montfort, Nick and Bartlett Fernandez, Sebastian. A Minimal Computational Improviser Based on Oral Thought. Proceedings of the 3rd Wordplay: When Language Meets Games Workshop (Wordplay 2022). 2022. doi:10.18653/v1/2022.wordplay-1.2

  57. [61]

    Craft an Iron Sword: Dynamically Generating Interactive Game Characters by Prompting Large Language Models Tuned on Code

    Volum, Ryan and Rao, Sudha and Xu, Michael and DesGarennes, Gabriel and Brockett, Chris and Van Durme, Benjamin and Deng, Olivia and Malhotra, Akanksha and Dolan, Bill. Craft an Iron Sword: Dynamically Generating Interactive Game Characters by Prompting Large Language Models Tuned on Code. Proceedings of the 3rd Wordplay: When Language Meets Games Worksho...

  58. [62]

    A Sequence Modelling Approach to Question Answering in Text-Based Games

    Furman, Gregory and Toledo, Edan and Shock, Jonathan and Buys, Jan. A Sequence Modelling Approach to Question Answering in Text-Based Games. Proceedings of the 3rd Wordplay: When Language Meets Games Workshop (Wordplay 2022). 2022. doi:10.18653/v1/2022.wordplay-1.4

  59. [63]

    Automatic Exploration of Textual Environments with Language-Conditioned Autotelic Agents

    Teodorescu, Laetitia and Yuan, Xingdi and C \^o t \'e , Marc-Alexandre and Oudeyer, Pierre-Yves. Automatic Exploration of Textual Environments with Language-Conditioned Autotelic Agents. Proceedings of the 3rd Wordplay: When Language Meets Games Workshop (Wordplay 2022). 2022. doi:10.18653/v1/2022.wordplay-1.5

  60. [64]

    Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022

  61. [65]

    Separating Hate Speech and Offensive Language Classes via Adversarial Debiasing

    Yuan, Shuzhou and Maronikolakis, Antonis and Sch. Separating Hate Speech and Offensive Language Classes via Adversarial Debiasing. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.1

  62. [66]

    Towards Automatic Generation of Messages Countering Online Hate Speech and Microaggressions

    Ashida, Mana and Komachi, Mamoru. Towards Automatic Generation of Messages Countering Online Hate Speech and Microaggressions. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.2

  63. [67]

    G rease V ision: Rewriting the Rules of the Interface

    Datta, Siddhartha and Kollnig, Konrad and Shadbolt, Nigel. G rease V ision: Rewriting the Rules of the Interface. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.3

  64. [68]

    Improving Generalization of Hate Speech Detection Systems to Novel Target Groups via Domain Adaptation

    Ludwig, Florian and Dolos, Klara and Zesch, Torsten and Hobley, Eleanor. Improving Generalization of Hate Speech Detection Systems to Novel Target Groups via Domain Adaptation. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.4

  65. [69]

    `` Zo Grof ! '' : A Comprehensive Corpus for Offensive and Abusive Language in D utch

    Ruitenbeek, Ward and Zwart, Victor and Van Der Noord, Robin and Gnezdilov, Zhenja and Caselli, Tommaso. `` Zo Grof ! '' : A Comprehensive Corpus for Offensive and Abusive Language in D utch. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.5

  66. [70]

    Counter- TWIT : An I talian Corpus for Online Counterspeech in Ecological Contexts

    Goffredo, Pierpaolo and Basile, Valerio and Cepollaro, Bianca and Patti, Viviana. Counter- TWIT : An I talian Corpus for Online Counterspeech in Ecological Contexts. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.6

  67. [71]

    S tereo KG : Data-Driven Knowledge Graph Construction For Cultural Knowledge and Stereotypes

    Deshpande, Awantee and Ruiter, Dana and Mosbach, Marius and Klakow, Dietrich. S tereo KG : Data-Driven Knowledge Graph Construction For Cultural Knowledge and Stereotypes. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.7

  68. [72]

    The subtle language of exclusion: Identifying the Toxic Speech of Trans-exclusionary Radical Feminists

    Lu, Christina and Jurgens, David. The subtle language of exclusion: Identifying the Toxic Speech of Trans-exclusionary Radical Feminists. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.8

  69. [73]

    Lost in Distillation: A Case Study in Toxicity Modeling

    Chvasta, Alyssa and Lees, Alyssa and Sorensen, Jeffrey and Vasserman, Lucy and Goyal, Nitesh. Lost in Distillation: A Case Study in Toxicity Modeling. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.9

  70. [74]

    Cleansing & expanding the HURTLEX (el) with a multidimensional categorization of offensive words

    Stamou, Vivian and Alexiou, Iakovi and Klimi, Antigone and Molou, Eleftheria and Saivanidou, Alexandra and Markantonatou, Stella. Cleansing & expanding the HURTLEX (el) with a multidimensional categorization of offensive words. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.10

  71. [75]

    Free speech or Free Hate Speech? Analyzing the Proliferation of Hate Speech in Parler

    Israeli, Abraham and Tsur, Oren. Free speech or Free Hate Speech? Analyzing the Proliferation of Hate Speech in Parler. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.11

  72. [76]

    Resources for Multilingual Hate Speech Detection

    Arango Monnar, Ayme and Perez, Jorge and Poblete, Barbara and Salda \ n a, Magdalena and Proust, Valentina. Resources for Multilingual Hate Speech Detection. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.12

  73. [77]

    Enriching Abusive Language Detection with Community Context

    Saleem, Haji Mohammad and Kurrek, Jana and Ruths, Derek. Enriching Abusive Language Detection with Community Context. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.13

  74. [78]

    DeTox: A Comprehensive Dataset for G erman Offensive Language and Conversation Analysis

    Demus, Christoph and Pitz, Jonas and Sch. DeTox: A Comprehensive Dataset for G erman Offensive Language and Conversation Analysis. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.14

  75. [79]

    Multilingual H ate C heck: Functional Tests for Multilingual Hate Speech Detection Models

    R. Multilingual H ate C heck: Functional Tests for Multilingual Hate Speech Detection Models. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.15

  76. [80]

    Distributional properties of political dogwhistle representations in S wedish BERT

    Hertzberg, Niclas and Cooper, Robin and Lindgren, Elina and R. Distributional properties of political dogwhistle representations in S wedish BERT. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.16

  77. [81]

    Hate Speech Criteria: A Modular Approach to Task-Specific Hate Speech Definitions

    Khurana, Urja and Vermeulen, Ivar and Nalisnick, Eric and Van Noorloos, Marloes and Fokkens, Antske. Hate Speech Criteria: A Modular Approach to Task-Specific Hate Speech Definitions. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.17

  78. [82]

    Accounting for Offensive Speech as a Practice of Resistance

    Diaz, Mark and Amironesei, Razvan and Weidinger, Laura and Gabriel, Iason. Accounting for Offensive Speech as a Practice of Resistance. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.18

  79. [83]

    Towards a Multi-Entity Aspect-Based Sentiment Analysis for Characterizing Directed Social Regard in Online Messaging

    Zheng, Joan and Friedman, Scott and Schmer-galunder, Sonja and Magnusson, Ian and Wheelock, Ruta and Gottlieb, Jeremy and Gomez, Diana and Miller, Christopher. Towards a Multi-Entity Aspect-Based Sentiment Analysis for Characterizing Directed Social Regard in Online Messaging. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:1...

  80. [84]

    Flexible text generation for counterfactual fairness probing

    Fryer, Zee and Axelrod, Vera and Packer, Ben and Beutel, Alex and Chen, Jilin and Webster, Kellie. Flexible text generation for counterfactual fairness probing. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.20

Showing first 80 references.