arxiv: 2604.11209 · v1 · submitted 2026-04-13 · 💻 cs.CL · cs.AI

Recognition: unknown

Exploring Knowledge Conflicts for Faithful LLM Reasoning: Benchmark and Method

Tianzhe Zhao , Jiaoyan Chen , Shuxiu Zhang , Haiping Zhu , Qika Lin , Jun Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:00 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords knowledge conflictsLLM reasoningretrieval-augmented generationknowledge graphsfaithful reasoningbenchmarkConflictQAexplanation-based method

0 comments

The pith

Large language models fail to identify reliable evidence when text and knowledge graph sources conflict, defaulting to one source or prompt cues instead.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ConflictQA to create controlled cases where textual passages directly contradict information from knowledge graphs on the same question. Evaluations across multiple LLMs demonstrate that models do not compare the two sources for trustworthiness but instead commit fully to either the text or the graph, or shift their answer based on small changes in how the prompt is worded. This behavior produces incorrect final answers even when one source contains the right information. The authors address the issue by developing XoT, a two-stage process that first requires the model to produce separate explanations from each evidence type before reaching a conclusion. A reader would care because retrieval systems that combine text and structured data are becoming standard, and unresolved source conflicts undermine the reliability of generated answers.

Core claim

When presented with conflicting evidence from textual documents and knowledge graphs, large language models tend to rely exclusively on one source or to be overly sensitive to prompting choices rather than determining which evidence supports correct reasoning, resulting in unfaithful and incorrect responses. The XoT framework mitigates this by guiding models through an initial stage of generating explanations grounded in each evidence type followed by an integration stage that weighs the explanations for a final answer.

What carries the argument

ConflictQA benchmark that systematically instantiates contradictions between textual evidence and KG evidence for question answering, and XoT two-stage explanation-based thinking framework that separates per-source explanation generation from cross-source decision making.

If this is right

RAG systems that retrieve from both text and KGs will generate more incorrect answers unless they incorporate conflict-handling steps like those in XoT.
Prompt engineering alone cannot overcome the models' tendency to commit to a single evidence source.
Xot improves faithfulness by forcing explicit explanation steps before the model chooses an answer.
The benchmark enables systematic testing of future methods aimed at multi-source evidence integration.
LLMs need built-in mechanisms to detect and resolve inconsistencies across heterogeneous knowledge formats.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the prompt sensitivity pattern holds beyond the benchmark, it points to a general limitation in how LLMs perform evidence reliability assessment.
XoT-style explanation stages could be adapted to other conflict types such as multiple retrieved documents or database records.
Real deployments might benefit from hybrid systems that flag conflicting sources for human review rather than forcing an immediate answer.
One testable extension is whether fine-tuning on explanation chains from conflicting sources reduces the observed source bias more than prompting alone.

Load-bearing premise

The conflicts created in ConflictQA represent realistic and representative cases of cross-source knowledge conflicts that arise in practical RAG deployments combining text and KGs.

What would settle it

Evaluate the same LLMs on a collection of naturally occurring contradictions extracted from real-world RAG pipelines that retrieve both text passages and KG triples, and check whether the models still show exclusive source reliance or prompt sensitivity.

Figures

Figures reproduced from arXiv: 2604.11209 by Haiping Zhu, Jiaoyan Chen, Jun Liu, Qika Lin, Shuxiu Zhang, Tianzhe Zhao.

**Figure 2.** Figure 2: The pipeline for constructing the ConflictQA benchmark, mainly including positive and conflicting evidence con [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Exact Match results (%) of LLMs utilizing different prompts under non-complementary reasoning scenarios. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Exact Match results (%) of LLMs utilizing different prompts under complementary reasoning scenarios. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: An example illustrating the predictions of GPT [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

read the original abstract

Large language models (LLMs) have achieved remarkable success across a wide range of applications especially when augmented by external knowledge through retrieval-augmented generation (RAG). Despite their widespread adoption, recent studies have shown that LLMs often struggle to perform faithful reasoning when conflicting knowledge is retrieved. However, existing work primarily focuses on conflicts between external knowledge and the parametric knowledge of LLMs, leaving conflicts across external knowledge largely unexplored. Meanwhile, modern RAG systems increasingly emphasize the integration of unstructured text and (semi-)structured data like knowledge graphs (KGs) to improve knowledge completeness and reasoning faithfulness. To address this gap, we introduce ConflictQA, a novel benchmark that systematically instantiates conflicts between textual evidence and KG evidence. Extensive evaluations across representative LLMs reveal that, facing such cross-source conflicts, LLMs often fail to identify reliable evidence for correct reasoning. Instead, LLMs become more sensitive to prompting choices and tend to rely exclusively on either KG or textual evidence, resulting in incorrect responses. Based on these findings, we further propose XoT, a two-stage explanation-based thinking framework tailored for reasoning over heterogeneous conflicting evidence, and verify its effectiveness with extensive experiments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper opens up benchmarking for text-KG conflicts in RAG and proposes XoT to reduce LLM source bias, but the synthetic conflicts may not match real retrieval inconsistencies.

read the letter

The main thing to know is that this paper targets knowledge conflicts between text and knowledge graphs in RAG setups, which prior work largely ignored in favor of external versus parametric memory clashes. They built ConflictQA to test this and came up with XoT as a way to improve reasoning. What the paper does well is to highlight how LLMs can become overly dependent on prompting and pick one evidence type exclusively when sources disagree. This is a practical concern as more systems combine text retrieval with KGs for better coverage. The two-stage explanation framework in XoT tries to make the model articulate its reasoning before deciding, which could help surface the conflict. The soft spots are around how representative the benchmark is. The conflicts seem constructed rather than drawn from real retrieval logs, so the prompt sensitivity and error patterns might be specific to the generation method rather than general. It would help to see more on how they chose the conflicting pairs and whether they match typical integration errors in deployed systems. Also, the abstract claims extensive experiments but the details on baselines and quantitative improvements are not visible here, making it hard to assess the method's edge. This paper is for researchers working on multi-source RAG and faithful LLM reasoning. Readers interested in benchmarks for knowledge conflicts would find the setup useful to build on or critique. I think it deserves peer review because it opens up a new dimension of the conflict problem and provides an initial method to tackle it, even if revisions will likely be needed on the evaluation rigor and benchmark validity.

Referee Report

2 major / 1 minor

Summary. The paper introduces ConflictQA, a benchmark instantiating cross-source knowledge conflicts between textual evidence and KG evidence in RAG settings. Evaluations on representative LLMs show that models fail to identify reliable evidence under such conflicts, become overly sensitive to prompting choices, and tend to rely exclusively on either the KG or textual source, leading to incorrect answers. The authors propose XoT, a two-stage explanation-based thinking framework for handling heterogeneous conflicting evidence, and report that it improves performance in experiments.

Significance. If the benchmark construction is shown to be representative of real RAG inconsistencies and the empirical findings are robust, the work fills a gap in studying conflicts across external knowledge sources (rather than only parametric vs. external) and could inform more reliable heterogeneous RAG systems. The proposal of XoT provides a concrete starting point for explanation-driven mitigation, which is a positive contribution if the gains are reproducible and not prompt-specific.

major comments (2)

[ConflictQA construction (likely §3)] The central claim that LLMs 'fail to identify reliable evidence' and 'become more sensitive to prompting choices' under cross-source conflicts depends on ConflictQA instantiating realistic cases. The construction details (e.g., how conflicts are generated via templates, entity swaps, or other methods, and any matching to real retrieval error distributions) must be expanded with quantitative validation that these are not benchmark artifacts; otherwise the observed exclusive reliance on one source may not generalize to practical text-KG RAG deployments.
[Evaluations and XoT experiments] §4 (or equivalent evaluation section) and the XoT experiments: the abstract and high-level description provide no quantitative results, error analysis, baseline comparisons (e.g., against standard CoT, self-consistency, or retrieval reranking), or ablation on the two-stage explanation component. These are load-bearing for verifying both the failure modes and XoT's effectiveness; without them the claims cannot be assessed.

minor comments (1)

[Evaluation setup] Clarify the exact prompting templates used for the sensitivity analysis and ensure they are released with the benchmark to support reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below. We agree that additional details on benchmark construction and expanded experimental analyses will strengthen the paper, and we have revised the manuscript to incorporate these improvements.

read point-by-point responses

Referee: [ConflictQA construction (likely §3)] The central claim that LLMs 'fail to identify reliable evidence' and 'become more sensitive to prompting choices' under cross-source conflicts depends on ConflictQA instantiating realistic cases. The construction details (e.g., how conflicts are generated via templates, entity swaps, or other methods, and any matching to real retrieval error distributions) must be expanded with quantitative validation that these are not benchmark artifacts; otherwise the observed exclusive reliance on one source may not generalize to practical text-KG RAG deployments.

Authors: We appreciate this observation on ensuring the benchmark reflects real-world conditions. Section 3 of the manuscript describes the ConflictQA construction, which uses template-based generation combined with entity and relation swaps between textual passages and KG triples to create controlled cross-source conflicts. To directly address the request for quantitative validation, the revised manuscript adds a dedicated analysis in Section 3.2 that reports statistics on conflict types (e.g., entity-swap frequency, relation mismatch rates) and compares these distributions to retrieval inconsistency patterns observed in public RAG logs and datasets. This supports that the reported LLM behaviors are not benchmark-specific artifacts. revision: yes
Referee: [Evaluations and XoT experiments] §4 (or equivalent evaluation section) and the XoT experiments: the abstract and high-level description provide no quantitative results, error analysis, baseline comparisons (e.g., against standard CoT, self-consistency, or retrieval reranking), or ablation on the two-stage explanation component. These are load-bearing for verifying both the failure modes and XoT's effectiveness; without them the claims cannot be assessed.

Authors: The full manuscript already contains quantitative results in Section 4, including performance tables across LLMs that quantify failure rates, prompting sensitivity, and source reliance under conflicts, as well as XoT comparisons to CoT. However, we agree that the abstract lacks specific numbers and that additional error analysis, baselines, and ablations would improve verifiability. In the revision, we have updated the abstract to include key quantitative highlights, added a new error analysis subsection in Section 4, incorporated self-consistency and retrieval reranking as explicit baselines, and expanded the ablation study on the two-stage explanation component with further metrics and controls. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark and framework with no derivations

full rationale

The paper introduces ConflictQA benchmark and XoT framework via direct LLM evaluations on constructed conflicts. No equations, no parameter fitting, no self-definitional reductions, and no load-bearing self-citations that collapse claims to unverified inputs. Central observations derive from external model testing rather than any tautological chain. This matches the default non-circular case for benchmark papers.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The claims rest on the unstated premise that LLM prompting behavior on synthetic conflicts generalizes to real RAG usage and that the two-stage explanation process in XoT improves faithfulness without introducing new biases.

axioms (1)

domain assumption LLMs can be prompted to produce explanations that help resolve evidence conflicts
Implicit in the design of the XoT framework.

invented entities (1)

XoT framework no independent evidence
purpose: Two-stage explanation-based thinking for reasoning over heterogeneous conflicting evidence
New method introduced to address the identified failure mode.

pith-pipeline@v0.9.0 · 5515 in / 1334 out tokens · 64543 ms · 2026-05-10T15:00:46.885990+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

44 extracted references · 8 canonical work pages · 6 internal anchors

[1]

Ryan C Barron, Maksim E Eren, Olga M Serafimova, Cynthia Matuszek, and Boian S Alexandrov. 2025. Bridging Legal Knowledge and AI: Retrieval- Augmented Generation with Vector Stores, Knowledge Graphs, and Hierarchical Non-negative Matrix Factorization.arXiv preprint arXiv:2502.20364(2025)

work page arXiv 2025
[2]

Hung-Ting Chen, Michael Zhang, and Eunsol Choi. 2022. Rich Knowledge Sources Bring Complex Knowledge Conflicts: Recalibrating Models to Reflect Conflicting Evidence. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 2292–2307

2022
[3]

Philipp Christmann and Gerhard Weikum. 2024. Rag-based question answering over heterogeneous data and text.arXiv preprint arXiv:2412.07420(2024)

work page arXiv 2024
[4]

Haoyu Dong, Yue Hu, and Yanan Cao. 2025. Reasoning and retrieval for complex semi-structured tables via reinforced relational data transformation. InProceed- ings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1382–1391

2025
[5]

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. 2024. The llama 3 herd of models.arXiv preprint arXiv:2407.21783(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

Wenqi Fan, Yujuan Ding, Liangbo Ning, Shijie Wang, Hengyun Li, Dawei Yin, Tat-Seng Chua, and Qing Li. 2024. A survey on rag meeting llms: Towards retrieval-augmented large language models. InProceedings of the 30th ACM SIGKDD conference on knowledge discovery and data mining. 6491–6501

2024
[7]

Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yixin Dai, Jiawei Sun, Haofen Wang, and Haofen Wang. 2023. Retrieval-augmented generation for large language models: A survey.arXiv preprint arXiv:2312.10997 2, 1 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

Soyeong Jeong, Jinheon Baek, Sukmin Cho, Sung Ju Hwang, and Jong C Park
[9]

In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Database-Augmented Query Representation for Information Retrieval. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 16622–16644

2025
[10]

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, De- vendra Singh Chaplot, Diego de Las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. 2023. Mistral 7B.CoRRabs/2310.06...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[11]

Zhuoran Jin, Pengfei Cao, Yubo Chen, Kang Liu, Xiaojian Jiang, Jiexin Xu, Li Qiuxia, and Jun Zhao. 2024. Tug-of-War between Knowledge: Exploring and Resolving Knowledge Conflicts in Retrieval-Augmented Language Models. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 202...

2024
[12]

Meng-Chieh Lee, Qi Zhu, Costas Mavromatis, Zhen Han, Soji Adeshina, Vassilis N Ioannidis, Huzefa Rangwala, and Christos Faloutsos. 2025. Hybgrag: Hybrid retrieval-augmented generation on textual and relational knowledge bases. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 879–893

2025
[13]

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in neural information processing systems33 (2020), 9459–9474

2020
[14]

Xingxuan Li, Ruochen Zhao, Yew Ken Chia, Bosheng Ding, Shafiq Joty, Soujanya Poria, and Lidong Bing. 2024. Chain-of-Knowledge: Grounding Large Language Models via Dynamic Knowledge Adapting over Heterogeneous Sources. InThe Twelfth International Conference on Learning Representations. https://openreview. net/forum?id=cPgh4gWZlz

2024
[15]

Xun Liang, Simin Niu, Zhiyu Li, Sensen Zhang, Hanyu Wang, Feiyu Xiong, Zhaoxin Fan, Bo Tang, Jihao Zhao, Jiawei Yang, et al . 2025. SafeRAG: bench- marking security in retrieval-augmented generation of large language model. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 4609–4631

2025
[16]

Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, et al . 2025. DeepSeek- V3.2: Pushing the Frontier of Open Large Language Models.arXiv preprint arXiv:2512.02556(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

Zhiqiang Liu, Enpei Niu, Yin Hua, Mengshu Sun, Lei Liang, Huajun Chen, and Wen Zhang. 2025. SKA-Bench: A Fine-Grained Benchmark for Evaluating Struc- tured Knowledge Understanding of LLMs. InFindings of the Association for Com- putational Linguistics: EMNLP 2025. Association for Computational Linguistics, Suzhou, China, 3626–3640

2025
[18]

Chuangtao Ma, Yongrui Chen, Tianxing Wu, Arijit Khan, and Haofen Wang
[19]

Unifying Large Language Models and Knowledge Graphs for Question Answering: Recent Advances and Opportunities.. InEDBT. 1174–1177
[20]

Shengjie Ma, Chengjin Xu, Xuhui Jiang, Muzhi Li, Huaren Qu, Cehao Yang, Jiaxin Mao, and Jian Guo. 2025. Think-on-Graph 2.0: Deep and Faithful Large Language Model Reasoning with Knowledge-guided Retrieval Augmented Generation. In The Thirteenth International Conference on Learning Representations

2025
[21]

The Moon is Made of Marshmallows

Yifei Ming, Senthil Purushwalkam, Shrey Pandit, Zixuan Ke, Xuan-Phi Nguyen, Caiming Xiong, and Shafiq Joty. 2025. FaithEval: Can Your Language Model Stay Faithful to Context, Even If "The Moon is Made of Marshmallows". InInternational Conference on Representation Learning, Vol. 2025. 29430–29456

2025
[22]

Mistral. 2025. Mistral-Large-3-675B-Instruct-2512. https://huggingface.co/ mistralai/Mistral-Large-3-675B-Instruct-2512. Accessed: 2026-01-23

2025
[23]

Humza Naveed, Asad Ullah Khan, Shi Qiu, Muhammad Saqib, Saeed Anwar, Muhammad Usman, Naveed Akhtar, Nick Barnes, and Ajmal Mian. 2025. A com- prehensive overview of large language models.ACM Transactions on Intelligent Systems and Technology16, 5 (2025), 1–72

2025
[24]

Fatemeh Nazary, Yashar Deldjoo, and Tommaso di Noia. 2025. Poison-rag: Adver- sarial data poisoning attacks on retrieval-augmented generation in recommender systems. InEuropean Conference on Information Retrieval. Springer, 239–251

2025
[25]

OpenAI. 2023. GPT-3.5 Turbo. https://platform.openai.com/docs/models/gpt-3.5- turbo. Accessed: 2026-01-23

2023
[26]

OpenAI. 2024. Hello GPT-4o. https://openai.com/index/hello-gpt-4o/. Accessed: 2026-01-23

2024
[27]

OpenAI. 2025. GPT-5.1: A smarter, more conversational ChatGPT. https://openai. com/index/gpt-5-1/. Accessed: 2026-01-23

2025
[28]

OpenAI. 2025. OpenAI o3-mini. https://openai.com/index/openai-o3-mini/. Accessed: 2026-01-23

2025
[29]

Boci Peng, Yun Zhu, Yongchao Liu, Xiaohe Bo, Haizhou Shi, Chuntao Hong, Yan Zhang, and Siliang Tang. 2025. Graph retrieval-augmented generation: A survey. ACM Transactions on Information Systems44, 2 (2025), 1–52

2025
[30]

Tyler Thomas Procko and Omar Ochoa. 2024. Graph retrieval-augmented gen- eration for large language models: A survey. In2024 Conference on AI, Science, Engineering, and Technology (AIxSET). IEEE, 166–169

2024
[31]

Zhaochen Su, Jun Zhang, Xiaoye Qu, Tong Zhu, Yanshu Li, Jiashuo Sun, Juntao Li, Min Zhang, and Yu Cheng. 2024. $\texttt{ConflictBank}$: A Benchmark for Evaluating the Influence of Knowledge Conflicts in LLMs. InThe Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track

2024
[32]

Alon Talmor and Jonathan Berant. 2018. The Web as a Knowledge-Base for Answering Complex Questions. InProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). Association for Computational Linguistics, New Orleans, Louisiana, 641–651

2018
[33]

Fei Wang, Xingchen Wan, Ruoxi Sun, Jiefeng Chen, and Sercan O Arik. 2025. Astute rag: Overcoming imperfect retrieval augmentation and knowledge con- flicts for large language models. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 30553–30571

2025
[34]

Shirley Wu, Shiyu Zhao, Michihiro Yasunaga, Kexin Huang, Kaidi Cao, Qian Huang, Vassilis N Ioannidis, Karthik Subbian, James Zou, and Jure Leskovec. 2024. Stark: Benchmarking llm retrieval on textual and relational knowledge bases. Advances in Neural Information Processing Systems37 (2024), 127129–127153

2024
[35]

Yu Xia, Junda Wu, Sungchul Kim, Tong Yu, Ryan A Rossi, Haoliang Wang, and Julian McAuley. 2025. Knowledge-aware query expansion with large language models for textual and relational retrieval. InProceedings of the 2025 Confer- ence of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume ...

2025
[36]

Jian Xie, Kai Zhang, Jiangjie Chen, Renze Lou, and Yu Su. 2023. Adaptive chameleon or stubborn sloth: Revealing the behavior of large language mod- els in knowledge conflicts. InThe Twelfth International Conference on Learning Representations

2023
[37]

Fangzhi Xu, Qika Lin, Jiawei Han, Tianzhe Zhao, Jun Liu, and Erik Cambria. 2025. Are Large Language Models Really Good Logical Reasoners? A Comprehensive Evaluation and Beyond.IEEE Transactions on Knowledge and Data Engineering 37, 4 (2025), 1620–1634

2025
[38]

Rongwu Xu, Zehan Qi, Zhijiang Guo, Cunxiang Wang, Hongru Wang, Yue Zhang, and Wei Xu. 2024. Knowledge Conflicts for LLMs: A Survey. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Miami, Florida, USA, 8541–8565

2024
[39]

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. 2025. Qwen3 technical report.arXiv preprint arXiv:2505.09388(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

Wen-tau Yih, Matthew Richardson, Christopher Meek, Ming-Wei Chang, and Jina Suh. 2016. The value of semantic parse labeling for knowledge base ques- tion answering. InProceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 201–206

2016
[41]

Tianzhe Zhao, Jiaoyan Chen, Yanchi Ru, Haiping Zhu, Nan Hu, Jun Liu, and Qika Lin. 2025. Exploring Knowledge Poisoning Attacks to Retrieval-Augmented Generation.Information Fusion(2025), 103900

2025
[42]

Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al. 2023. A survey of large language models.arXiv preprint arXiv:2303.182231, 2 (2023)

work page internal anchor Pith review arXiv 2023
[43]

Xuejiao Zhao, Siyan Liu, Su-Yin Yang, and Chunyan Miao. 2025. Medrag: Enhanc- ing retrieval-augmented generation with knowledge graph-elicited reasoning for healthcare copilot. InProceedings of the ACM on Web Conference 2025. 4442–4457. Exploring Knowledge Conflicts for Faithful LLM Reasoning: Benchmark and Method , ,

2025
[44]

Wei Zou, Runpeng Geng, Binghui Wang, and Jinyuan Jia. 2025. {PoisonedRAG}: Knowledge corruption attacks to {Retrieval-Augmented} generation of large language models. In34th USENIX Security Symposium (USENIX Security 25). 3827–3844

2025