arxiv: 2605.12370 · v1 · submitted 2026-05-12 · 💻 cs.CL · cs.IR

Recognition: 2 theorem links

· Lean Theorem

Context Convergence Improves Answering Inferential Questions

Jamshid Mozafari , Bhawna Piryani , Adam Jatowt

Authors on Pith no claims yet

Pith reviewed 2026-05-13 05:57 UTC · model grok-4.3

classification 💻 cs.CL cs.IR

keywords inferential QALLM context selectionconvergence metricpassage constructionTriviaHG datasetopen-domain question answeringreasoning in LLMs

0 comments

The pith

Passages from high-convergence sentences substantially boost LLM accuracy on inferential questions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether the structure of context passages affects how well large language models handle inferential questions, where answers must be derived from combined clues rather than directly retrieved. It defines convergence as the degree to which a sentence eliminates incorrect answer options and uses this to build passages from subsets of the TriviaHG dataset. Across six LLMs, passages assembled from higher-convergence sentences produce markedly better accuracy than passages chosen by cosine similarity. Ordering sentences from highest to lowest convergence adds a modest further improvement. The results position convergence as a concrete signal for constructing contexts that support reasoning.

Core claim

Passages built from sentences with higher convergence lead to substantially better answer accuracy than those selected by cosine similarity, indicating that convergence captures meaningful relevance for inferential reasoning. Ordering sentences by descending convergence slightly improves performance, suggesting that LLMs tend to prioritize earlier, information-rich cues.

What carries the argument

Convergence, the measure of how effectively a sentence eliminates incorrect answers, used to select and order sentences when forming passages for inferential QA.

Load-bearing premise

The accuracy gains arise primarily because of the convergence property rather than other unexamined characteristics of the selected sentences or the particular dataset.

What would settle it

A replication on a different inferential QA dataset in which high-convergence passages show no accuracy advantage over cosine-similarity passages, or an ablation that matches passages on length and lexical features while varying only convergence and finds no difference.

Figures

Figures reproduced from arXiv: 2605.12370 by Adam Jatowt, Bhawna Piryani, Jamshid Mozafari.

**Figure 2.** Figure 2: Few-shot prompt provided to the language model [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

read the original abstract

While Large Language Models (LLMs) are widely used in open-domain Question Answering (QA), their ability to handle inferential questions-where answers must be derived rather than directly retrieved-remains still underexplored. This study investigates how the structure and quality of passages influence LLM performance on such questions. We focus on convergence, a measure of how effectively sentences (hints) eliminate incorrect answers, as a criterion for constructing passages. Using subsets of the TriviaHG dataset, we form passages by combining sentences with varying convergence levels and evaluate six LLMs of different sizes and architectures. Our results show that passages built from higher convergence sentences lead to substantially better answer accuracy than those selected by cosine similarity, indicating that convergence captures meaningful relevance for inferential reasoning. Additionally, ordering sentences by descending convergence slightly improves performance, suggesting that LLMs tend to prioritize earlier, information-rich cues. These findings highlight convergence as a practical signal for guiding passage construction and analyzing inferential reasoning behavior in LLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Convergence-based sentence selection beats cosine similarity for inferential QA on TriviaHG but the gains could stem from unmeasured sentence properties rather than the elimination mechanism itself.

read the letter

The key takeaway is that passages built from high-convergence sentences improve answer accuracy over cosine-similarity selections for inferential questions across six LLMs. The authors also report a modest lift when ordering those sentences by descending convergence first. This is tested on subsets of TriviaHG, with the core claim that convergence captures relevance better for reasoning tasks than standard similarity measures. What the paper does is define convergence around how effectively a sentence rules out incorrect answers, then compare passage construction methods directly in an empirical setup. Running the same protocol on models of different sizes and architectures adds a bit of breadth to the results, and the ordering check is a straightforward addition that aligns with how LLMs might process context. The work is new in targeting this elimination-focused criterion for passage building rather than generic retrieval scores. The main soft spot is the absence of controls for confounding sentence features. High-convergence sentences could differ in length, original position, lexical overlap with the question, or embedding strength, and without matching or ablating those, the accuracy difference cannot be cleanly attributed to convergence. The abstract also omits the exact computation formula, statistical significance, error bars, or dataset split details, which makes the effect size hard to judge. This paper is for researchers working on retrieval-augmented QA and context optimization for LLMs. Someone experimenting with practical signals for better inferential performance would find it worth trying. It deserves peer review because the experiments are simple to replicate and the claim is testable with added controls.

Referee Report

3 major / 1 minor

Summary. The manuscript claims that passages constructed by selecting sentences with higher 'convergence' (a measure of how effectively they eliminate incorrect answers) yield substantially higher answer accuracy for inferential questions on TriviaHG subsets than passages selected via cosine similarity. This holds across six LLMs of varying sizes and architectures; additionally, ordering sentences in descending convergence order provides a slight further boost, suggesting LLMs prioritize early, high-value cues.

Significance. If the reported gains prove robust after controlling for sentence-level confounders, the work would supply a practical, task-specific signal for context construction in open-domain inferential QA that goes beyond standard semantic similarity. The multi-model evaluation adds modest generalizability, but the purely empirical framing and absence of mechanistic analysis or falsifiable predictions limit its theoretical reach.

major comments (3)

[Abstract / Methods] Abstract and Methods: the computation of convergence is not specified (how incorrect answers are generated or eliminated, whether it is normalized, etc.), nor are any controls reported for confounding sentence properties such as length, lexical overlap, embedding magnitude, or original context position. Without these, the accuracy advantage over cosine selection cannot be attributed to the elimination mechanism rather than correlated surface features.
[Results] Results: no statistical significance tests, error bars, or ablation studies are described for the accuracy differences across the six LLMs. The claim that convergence 'captures meaningful relevance' therefore rests on unquantified numerical improvements whose reliability cannot be assessed.
[Results] The secondary observation that descending-convergence ordering helps is consistent with either the proposed elimination account or with simpler positional or salience biases; the design does not isolate the claimed mechanism.

minor comments (1)

[Abstract] Abstract: the phrase 'remains still underexplored' is redundant; 'remains underexplored' is sufficient.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the presentation and strengthen the empirical claims. We address each major point below and indicate the planned revisions.

read point-by-point responses

Referee: [Abstract / Methods] Abstract and Methods: the computation of convergence is not specified (how incorrect answers are generated or eliminated, whether it is normalized, etc.), nor are any controls reported for confounding sentence properties such as length, lexical overlap, embedding magnitude, or original context position. Without these, the accuracy advantage over cosine selection cannot be attributed to the elimination mechanism rather than correlated surface features.

Authors: We agree that the Methods section must specify the convergence computation in full detail. In the revised manuscript we will add: (1) the exact procedure for generating incorrect answers (zero-shot prompting of the target LLM on the question alone), (2) the formula for convergence as the fraction of those incorrect answers eliminated when the sentence is appended, and (3) any normalization applied. We will also include supplementary analyses that control for sentence length, lexical overlap with the question, embedding magnitude, and original document position. These controls will be reported as additional tables to demonstrate that the observed gains are not reducible to the listed surface features. revision: yes
Referee: [Results] Results: no statistical significance tests, error bars, or ablation studies are described for the accuracy differences across the six LLMs. The claim that convergence 'captures meaningful relevance' therefore rests on unquantified numerical improvements whose reliability cannot be assessed.

Authors: We acknowledge the lack of statistical quantification. The revised version will report error bars (standard deviation across multiple inference runs with varied random seeds) for all accuracy figures. We will add paired statistical tests (McNemar’s test on per-question correctness and, where appropriate, t-tests on accuracy deltas) together with p-values for the convergence versus cosine comparisons. In addition, we will include ablation curves that show accuracy as a function of the number of highest-convergence sentences retained, thereby quantifying the incremental contribution of the convergence signal. revision: yes
Referee: [Results] The secondary observation that descending-convergence ordering helps is consistent with either the proposed elimination account or with simpler positional or salience biases; the design does not isolate the claimed mechanism.

Authors: The referee is right that the ordering benefit could arise from positional or salience biases rather than the elimination mechanism. We will add a control experiment that keeps the identical set of high-convergence sentences but presents them in random order versus descending-convergence order. The performance difference between these two conditions will be reported; any residual advantage of the descending order can then be interpreted more precisely. We continue to regard the primary result as the selection of sentences by convergence rather than by cosine similarity, with ordering treated as a secondary practical observation. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical comparison of passage-selection heuristics

full rationale

The paper reports an experimental comparison of two passage-construction strategies (high-convergence sentences vs. cosine-similarity sentences) on fixed TriviaHG subsets, evaluated by accuracy of six LLMs. Convergence is introduced as an externally defined elimination metric and is never fitted or redefined inside the study; the accuracy difference is measured directly rather than derived from any equation or self-referential prediction. No self-citation chain, ansatz, or uniqueness theorem is invoked to justify the central result. The work is therefore self-contained against external benchmarks and contains no load-bearing step that reduces to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim relies on the empirical validity of the convergence metric and the experimental design using the TriviaHG dataset.

axioms (2)

domain assumption The TriviaHG dataset subsets are representative for testing inferential question answering.
Used as the basis for experiments.
domain assumption LLM performance differences are attributable to passage quality rather than model-specific behaviors.
Evaluated across multiple models but assumes generalizability.

invented entities (1)

convergence no independent evidence
purpose: A measure of how effectively sentences eliminate incorrect answers for passage construction.
Defined and used in the study as a new signal for relevance.

pith-pipeline@v0.9.0 · 5472 in / 1268 out tokens · 61427 ms · 2026-05-13T05:57:59.245269+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the convergence score is computed as the proportion of candidates eliminated by the hint... S_con = 0 if unrelated, else 1 - (|V|-1)/|C|
IndisputableMonolith/Foundation/RealityFromDistinction reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

passages built from higher convergence sentences lead to substantially better answer accuracy than those selected by cosine similarity

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · 7 internal anchors

[1]

Heba Abdel-Nabi, Arafat Awajan, and Mostafa Z. Ali. 2022. Deep learning-based question answering: a survey.Knowl. Inf. Syst.65, 4 (Dec. 2022), 1399–1485. doi:10.1007/s10115-022-01783-5

work page doi:10.1007/s10115-022-01783-5 2022
[2]

George Arthur Baker, Ankush Raut, Sagi Shaier, Lawrence E Hunter, and Katha- rina von der Wense. 2024. Lost in the Middle, and In-Between: Enhancing Language Models’ Ability to Reason Over Long Contexts in Multi-Hop QA.arXiv preprint arXiv:2412.10079(2024)

work page arXiv 2024
[3]

Florin Cuconasu, Simone Filice, Guy Horowitz, Yoelle Maarek, and Fabrizio Silvestri. 2025. Do RAG Systems Suffer From Positional Bias?arXiv preprint arXiv:2505.15561(2025)

work page arXiv 2025
[4]

Debrup Das, Sam O’Nuallain, and Razieh Rahimi. 2025. RaDeR: Reasoning- aware Dense Retrieval Models. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng (Eds.). Association for Computational Linguistics, Suzhou, China, 19970–19997. doi:10.1...

work page doi:10.18653/v1/2025.emnlp-main.1011 2025
[5]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Jill Burstein, Christy...

work page 2019
[6]

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, et al. 2024. The Llama 3 Herd of Models.arXiv e-prints, Article arXiv:2407.21783 (July 2024), arXiv:2407.21783 pages. doi:10.48550/arXiv.2407.21783

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2407.21783 2024
[7]

Duke and P

Nell K. Duke and P. David Pearson. 2008. Effective Practices for Developing Reading Comprehension.The Journal of Education189, 1/2 (2008), 107–122. http://www.jstor.org/stable/42748663

work page arXiv 2008
[8]

Kalyanpur, Adam Lally, J

David Ferrucci, Eric Brown, Jennifer Chu-Carroll, James Fan, David Gondek, Aditya A. Kalyanpur, Adam Lally, J. William Murdock, Eric Nyberg, John Prager, Nico Schlaefer, and Chris Welty. 2010. Building Watson: An Overview of the DeepQA Project.AI Magazine31, 3 (2010), 59–79. arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1609/aimag.v31i3.2303 doi:10. 16...

work page doi:10.1609/aimag.v31i3.2303 2010
[9]

Anubhav Jangra, Jamshid Mozafari, Adam Jatowt, and Smaranda Muresan. 2025. Navigating the Landscape of Hint Generation Research: From the Past to the Future.Transactions of the Association for Computational Linguistics13 (06 2025), 505–528. doi:10.1162/tacl_a_00751

work page doi:10.1162/tacl_a_00751 2025
[10]

Adam Jatowt, Calvin Gehrer, and Michael Färber. 2023. Automatic Hint Genera- tion. InProceedings of the 2023 ACM SIGIR International Conference on Theory of Information Retrieval(Taipei, Taiwan)(ICTIR ’23). Association for Computing Machinery, New York, NY, USA, 117–123. doi:10.1145/3578337.3605119

work page doi:10.1145/3578337.3605119 2023
[11]

Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Zamani, and Jiawei Han. 2025. Search-r1: Training llms to reason and leverage search engines with reinforcement learning.arXiv preprint arXiv:2503.09516(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

Lee, Hyeon-Jin Kim, and Myung-Gil Jang

Hyo-Jung O. Lee, Hyeon-Jin Kim, and Myung-Gil Jang. 2005. Descriptive Question Answering in Encyclopedia. InProceedings of the ACL Interactive Poster and Demonstration Sessions, Masaaki Nagata and Ted Pedersen (Eds.). Association for Computational Linguistics, Ann Arbor, Michigan, 21–24. doi:10.3115/1225753. 1225759

work page doi:10.3115/1225753 2005
[13]

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. 2020. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. InAdvances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M.F...

work page 2020
[14]

Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang

Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2024. Lost in the Middle: How Language Models Use Long Contexts.Transactions of the Association for Computational Linguistics 12 (2024), 157–173. doi:10.1162/tacl_a_00638

work page doi:10.1162/tacl_a_00638 2024
[15]

Wenhan Liu, Xinyu Ma, Weiwei Sun, Yutao Zhu, Yuchen Li, Dawei Yin, and Zhicheng Dou. 2025. Reasonrank: Empowering passage ranking with strong reasoning ability.arXiv preprint arXiv:2508.07050(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach.arXiv e-prints, Article arXiv:1907.11692 (July 2019), arXiv:1907.11692 pages. doi:10.48550/arXiv.1907. 11692

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1907 2019
[17]

Meixiu Long, Duolin Sun, Dan Yang, Junjie Wang, Yecheng Luo, Yue Shen, Jian Wang, Hualei Zhou, Chunxiao Guo, Peng Wei, et al . 2025. Diver: A multi- stage approach for reasoning-intensive information retrieval.arXiv preprint arXiv:2508.07995(2025). SIGIR ’26, July 20–24, 2026, Melbourne, VIC, Australia Jamshid Mozafari, Bhawna Piryani, and Adam Jatowt

work page arXiv 2025
[18]

Man Luo, Kazuma Hashimoto, Semih Yavuz, Zhiwei Liu, Chitta Baral, and Yingbo Zhou. 2022. Choose Your QA Model Wisely: A Systematic Study of Generative and Extractive Readers for Question Answering. InProceedings of the 1st Workshop on Semiparametric Methods in NLP: Decoupling Logic from Knowledge, Rajarshi Das, Patrick Lewis, Sewon Min, June Thai, and Man...

work page 2022
[19]

Shervin Minaee, Tomas Mikolov, Narjes Nikzad, Meysam Chenaghlu, Richard Socher, Xavier Amatriain, and Jianfeng Gao. 2024. Large language models: A survey.arXiv preprint arXiv:2402.06196(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

InFindings of the Association for Computational Linguistics: EMNLP 2024, Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (Eds.)

Exploring Hint Generation Approaches for Open-Domain Question Answer- ing. InFindings of the Association for Computational Linguistics: EMNLP 2024, Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (Eds.). Association for Computa- tional Linguistics, Miami, Florida, USA, 9327–9352. doi:10.18653/v1/2024.findings- emnlp.546

work page doi:10.18653/v1/2024.findings- 2024
[22]

Jamshid Mozafari, Abdelrahman Abdallah, Bhawna Piryani, and Adam Jatowt

work page
[23]

Reproducing nevir: Negation in neural information retrieval

Wrong Answers Can Also Be Useful: PlausibleQA - A Large-Scale QA Dataset with Answer Plausibility Scores(SIGIR ’25). Association for Computing Machinery, New York, NY, USA, 3832–3842. doi:10.1145/3726302.3730299

work page doi:10.1145/3726302.3730299
[24]

Jamshid Mozafari, Florian Gerhold, and Adam Jatowt. 2025. WikiHint: A Human- Annotated Dataset for Hint Ranking and Generation. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval(Padua, Italy)(SIGIR ’25). Association for Computing Machinery, New York, NY, USA, 3821–3831. doi:10.1145/3726302.3730284

work page doi:10.1145/3726302.3730284 2025
[25]

Jamshid Mozafari, Anubhav Jangra, and Adam Jatowt. 2024. TriviaHG: A Dataset for Automatic Hint Generation from Factoid Questions. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Infor- mation Retrieval(Washington DC, USA)(SIGIR ’24). Association for Computing Machinery, New York, NY, USA, 2060–2070. doi:10.1145...

work page doi:10.1145/3626772.3657855 2024
[26]

Jamshid Mozafari, Bhawna Piryani, Abdelrahman Abdallah, and Adam Jatowt

work page
[27]

HintEval: A Comprehensive Framework for Hint Generation and Evaluation for Questions.arXiv preprint arXiv:2502.00857(2025)

work page arXiv 2025
[28]

Jamshid Mozafari, Hamed Zamani, Guido Zuccon, and Adam Jatowt. 2026. In- ferential Question Answering. InProceedings of the ACM Web Conference 2026 (United Arab Emirates)(WWW ’26). Association for Computing Machinery, New York, NY, USA, 2384–2395. doi:10.1145/3774904.3792653

work page doi:10.1145/3774904.3792653 2026
[29]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer.J. Mach. Learn. Res.21, 1, Article 140 (Jan. 2020), 67 pages

work page 2020
[30]

Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Confer- ence on Natural Language Processing (EMNLP-IJCNLP), Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan (Eds.). Association...

work page doi:10.18653/v1/d19-1410 2019
[31]

Rulin Shao, Rui Qiao, Varsha Kishore, Niklas Muennighoff, Xi Victoria Lin, Daniela Rus, Bryan Kian Hsiang Low, Sewon Min, Wen-tau Yih, Pang Wei Koh, et al. 2025. ReasonIR: Training Retrievers for Reasoning Tasks.arXiv preprint arXiv:2504.20595(2025)

work page arXiv 2025
[32]

Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. 2019. Com- monsenseQA: A Question Answering Challenge Targeting Commonsense Knowl- edge. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Jill Burstein, Chris...

work page doi:10.18653/v1/n19-1421 2019
[33]

Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, and Others

work page
[34]

Gemma 3 Technical Report

Gemma 3 Technical Report. arXiv:2503.19786 [cs.CL] https://arxiv.org/ abs/2503.19786

work page internal anchor Pith review Pith/arXiv arXiv
[35]

Sriram Veturi, Saurabh Vaichal, Reshma Lal Jagadheesh, Nafis Irtiza Tripto, and Nian Yan. 2024. Rag based question-answering for contextual response prediction system.arXiv preprint arXiv:2409.03708(2024)

work page arXiv 2024
[36]

Mengqiu Wang et al. 2006. A survey of answer extraction techniques in factoid question answering.Computational Linguistics1, 1 (2006), 1–14

work page 2006
[37]

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. 2025. Qwen3 technical report.arXiv preprint arXiv:2505.09388(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[38]

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. 2018. HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii (...

work page doi:10.18653/v1/d18- 2018
[39]

Zongmeng Zhang, Jinhua Zhu, Wengang Zhou, Xiang Qi, Peng Zhang, and Houqiang Li. 2024. BoolQuestions: Does Dense Retrieval Understand Boolean Logic in Language?. InFindings of the Association for Computational Linguis- tics: EMNLP 2024, Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (Eds.). Association for Computational Linguistics, Miami, Florida, USA...

work page doi:10.18653/v1/2024.findings-emnlp.156 2024