All Relations Lead to Rome: Automated Knowledge Graph Creation and Question Generation

Lorenzo Gatti; Matthijs Jansen op de Haar; Tobias St\"ahle

arxiv: 2606.22645 · v1 · pith:JYAADSAAnew · submitted 2026-06-21 · 💻 cs.IR · cs.CY

All Relations Lead to Rome: Automated Knowledge Graph Creation and Question Generation

Matthijs Jansen op de Haar , Tobias St\"ahle , Lorenzo Gatti This is my paper

Pith reviewed 2026-06-26 09:25 UTC · model grok-4.3

classification 💻 cs.IR cs.CY

keywords knowledge graph constructionquestion generationinformation retrievalhybrid retrievalfact-grounded QAautomated extractionRoman Empire dataset

0 comments

The pith

ARLtR builds a single dataset of knowledge graphs, embeddings, and fact-grounded question-answer pairs from raw text.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ARLtR to create resources that support both vector retrieval over text chunks and reasoning over explicit knowledge graphs in one package. It extracts entities and relations from a corpus, assembles them into a graph with embeddings, and generates question-answer pairs tied directly to those facts and their source sentences. The result is a Roman Empire dataset containing more than 19,000 entities, 16,000 text chunks, and 8,400 QA pairs. This unified structure is meant to let researchers test and improve systems that combine dense retrieval with symbolic graph operations. The framework is presented as a general method that can be run on any text collection to produce aligned graph, embedding, and QA outputs.

Core claim

ARLtR jointly constructs a knowledge graph, embeddings, and question-answer pairs that are explicitly grounded in extracted entities, relations, and supporting textual evidence, instantiated as a historical dataset centered on the Roman Empire comprising over 19,000 entities, 16,000 chunks, and 8,400 question-answer pairs.

What carries the argument

The ARLtR automated framework that extracts entities and relations from text to produce a coupled knowledge graph, vector embeddings, and fact-grounded QA pairs.

If this is right

Hybrid retrieval systems can be trained and evaluated against one resource that supplies both graph structure and dense vector representations.
Semantic steering methods can be tested directly on the aligned graph and embedding layers.
Fact-grounded QA pairs enable direct verification of model answers against source sentences.
The same pipeline can be applied to new domains by swapping the input corpus while keeping the output format consistent.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The dataset could serve as a testbed for measuring how well models preserve source fidelity when moving between text and graph forms.
Extensions might add multi-hop reasoning tasks over the generated graphs to probe deeper inference capabilities.
If extraction quality holds, the approach offers a scalable route to create similar resources for other historical or technical corpora.

Load-bearing premise

Automated extraction of entities, relations, and questions from text produces accurate outputs that faithfully represent the source corpus without significant hallucinations or omissions.

What would settle it

A manual review of a random sample of generated triples and QA pairs that finds more than a small fraction of errors or unsupported facts relative to the original text.

Figures

Figures reproduced from arXiv: 2606.22645 by Lorenzo Gatti, Matthijs Jansen op de Haar, Tobias St\"ahle.

**Figure 1.** Figure 1: In this work, we propose a framework for (i) knowledge graph construction and (ii) fact-grounded question-answer generation. (i) In the first phase, a corpus of documents is defined together with a weak ontology specifying entity and relation types. The corpus is then segmented into chunks, after which the ontology guides entity extraction from each chunk. Relations between extracted entities are then iden… view at source ↗

**Figure 2.** Figure 2: ARLtR facilitates QA in both a knowledge graph and vector embeddings. Resulting in ground-truth facts, or answers, that coincide with ground-truth tags, or annotations [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Sampling Variations between Complexity and Relational Question Categories between Entity (A) and Neighbor (B) Example: Given some combination of question categories, an entity with sufficient chunks is randomly selected (e.g., Augustus). In this case, a Double-Entity question was specified. Therefore, a neighboring entity that has a sufficient amount of chunks shared with Augustus is chosen, in this case … view at source ↗

**Figure 5.** Figure 5: Ontology Grounded in the Historical Domain 5.2 Ontology In the design of ARLtR, we adopt an ontology grounded in existing historical and ontological literature [19, 20, 27]. As discussed, we employ a weak ontology to facilitate extraction, meaning that all defined entity types are allowed to relate to one another without hard constraints on admissible relation pairs. This ensures that the dataset is more a… view at source ↗

**Figure 4.** Figure 4: Subset of the ARLtR KG, with Rome Highlighted Furthermore, the knowledge includes embedding attributes for all chunks and entities, generated using the pre-trained geminiembedding-2 model7 (i.e., with 3,072 dimensions). This ensures that embeddings do not need to be recomputed for any future studies that use ARLtR. Moreover, the dataset contains vector indices for both entities and chunks, which is suppor… view at source ↗

read the original abstract

Large language models have substantially improved information retrieval and question answering; however, existing datasets generally support either vector-based retrieval over unstructured text or reasoning over knowledge graphs, without providing a unified representation that combines both paradigms. Moreover, current benchmarks rarely provide ground-truth entities, relations, and fact-grounded question-answer pairs aligned with the underlying corpus. To address this gap, we introduce All Relations Lead to Rome (ARLtR), a unified framework for automated knowledge graph construction and fact-grounded question-answer generation. ARLtR jointly constructs a knowledge graph, embeddings, and question-answer pairs that are explicitly grounded in extracted entities, relations, and supporting textual evidence. We further instantiate the framework as a historical dataset centered on the Roman Empire, comprising over 19,000 entities, 16,000 chunks, and 8,400 question-answer pairs (https://huggingface.co/datasets/FaynePro/all-relations-lead-to-rome). By tightly coupling symbolic graph representations with dense retrieval representations, ARLtR facilitates the evaluation and development of hybrid retrieval systems and semantic steering approaches within a single coherent resource.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ARLtR supplies a new integrated dataset for hybrid graph-vector retrieval on Roman history texts, but the automated extraction steps have no reported accuracy checks.

read the letter

Hi,

The main thing here is a released dataset from Roman Empire texts that bundles a knowledge graph (19k entities), text chunks (16k), embeddings, and 8.4k QA pairs all tied back to the source material. The framework automates entity-relation extraction and question generation in one pipeline, then aligns everything for hybrid retrieval experiments.

What works is the practical integration. Most existing resources stay in either pure text or pure graph mode; this one keeps the symbolic triples, dense vectors, and grounded questions in the same package and puts the result on Hugging Face. That setup can let people run controlled tests on combined systems without building their own data from scratch. The scope is narrow (one historical domain), which matches the claims and avoids overreach.

The clear gap is validation. The pipeline depends on LLM steps for triples and questions, yet the paper gives no precision/recall numbers, no sample error analysis, and no human checks on the outputs. Without those, the claim that the graph and QA pairs are reliably grounded rests on an untested assumption about low hallucination rates. That is the load-bearing part and it is missing.

No circularity or invented entities show up. The work is straightforward about creating an artifact rather than proving a new theorem.

This is for IR researchers who need benchmarks for hybrid retrieval or semantic steering. A reader in that area can get immediate use from the data even if the construction details need more evidence. It deserves a serious referee because a usable new resource can move experiments forward once the quality metrics are added.

Referee Report

2 major / 1 minor

Summary. The manuscript presents the ARLtR framework for automated construction of a knowledge graph, dense embeddings, and fact-grounded question-answer pairs from unstructured text. The framework is instantiated on a historical corpus concerning the Roman Empire, yielding a public dataset of over 19,000 entities, 16,000 chunks, and 8,400 QA pairs explicitly aligned with extracted entities, relations, and supporting textual evidence. The stated purpose is to support development and evaluation of hybrid retrieval systems that combine symbolic graph reasoning with vector-based retrieval.

Significance. If the extraction pipeline produces accurate, low-error artifacts that are verifiably grounded in the source corpus, the released dataset would constitute a useful resource for the IR community by enabling controlled experiments on hybrid KG+embedding systems. The public Hugging Face release and the scale of the artifact are strengths. However, the absence of any quantitative validation leaves the practical significance of the contribution difficult to determine.

major comments (2)

[§3–4] §3–4: The framework description relies on LLM pipelines for triple extraction, chunking, and question generation, yet the manuscript supplies no precision/recall figures, inter-annotator agreement scores, or error analysis on samples of the 19k entities or 8.4k QA pairs. This directly undermines the central claim that the QA pairs and graph elements are 'explicitly grounded' without significant hallucinations or omissions.
[Abstract and §4] Abstract and §4: No baseline comparisons, human validation protocol, or construction details (specific models, prompts, or filtering steps) are reported for the automated extraction process. Without these, the dataset cannot be assessed against the grounding claim that constitutes the paper's primary contribution.

minor comments (1)

[Abstract] The dataset link in the abstract should include a persistent identifier or checksum to facilitate reproducibility checks.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for validation metrics and construction details. We address each major comment below and commit to revisions that strengthen the grounding claims without altering the paper's core contribution as a dataset and framework release.

read point-by-point responses

Referee: [§3–4] §3–4: The framework description relies on LLM pipelines for triple extraction, chunking, and question generation, yet the manuscript supplies no precision/recall figures, inter-annotator agreement scores, or error analysis on samples of the 19k entities or 8.4k QA pairs. This directly undermines the central claim that the QA pairs and graph elements are 'explicitly grounded' without significant hallucinations or omissions.

Authors: We acknowledge this limitation in the current version. While grounding is enforced by construction (each QA pair is explicitly linked to specific entities, relations, and source chunks), the lack of quantitative error analysis weakens the claims. In revision we will add a dedicated validation subsection reporting precision/recall on a manually annotated sample of 500 entities and 200 QA pairs, plus inter-annotator agreement scores from two annotators. This will quantify hallucination and omission rates. revision: yes
Referee: [Abstract and §4] Abstract and §4: No baseline comparisons, human validation protocol, or construction details (specific models, prompts, or filtering steps) are reported for the automated extraction process. Without these, the dataset cannot be assessed against the grounding claim that constitutes the paper's primary contribution.

Authors: We agree that construction details are missing and will expand §4 with the exact LLMs, prompts, and filtering steps used. A human validation protocol will be described alongside the new error analysis. Baseline comparisons are not provided because the manuscript's purpose is to release the resource for community use in hybrid retrieval experiments rather than to benchmark retrieval methods itself; the dataset is explicitly designed to enable such evaluations. revision: partial

Circularity Check

0 steps flagged

No circularity: framework constructs new artifacts without derivation chain

full rationale

The paper describes an automated pipeline for extracting entities/relations from text, building a KG, generating embeddings, and creating grounded QA pairs. No equations, predictions, or first-principles derivations are claimed; the central output is the resulting dataset (19k entities, 8.4k QA pairs) itself. All steps are constructive and externally verifiable against the source corpus. No self-citation load-bearing, fitted-input-as-prediction, or self-definitional patterns appear. This matches the default expectation of a non-circular construction paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available, preventing identification of specific free parameters, axioms, or invented entities. The framework description implies reliance on standard LLM-based extraction but provides no details on any fitted values or background assumptions.

pith-pipeline@v0.9.1-grok · 5729 in / 1048 out tokens · 37386 ms · 2026-06-26T09:25:31.486217+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

34 extracted references · 23 canonical work pages

[1]

Persona-SQ: A Personalized Suggested Question Generation Framework For Real-world Documents

2025. Persona-SQ: A Personalized Suggested Question Generation Framework For Real-world Documents. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (System Demonstrations), Nouha Dziri, Sean (Xiang) Ren, and Shizhe Diao (Eds.). 210–247. doi:10.18653/v1...

work page doi:10.18653/v1/2025.naacl-demo.20 2025
[2]

Dang Anh-Hoang, Vu Tran, and Le-Minh Nguyen. 2025. Survey and analysis of hallucinations in large language models: attribution to prompting strategies or model behavior.Frontiers in Artificial IntelligenceVolume 8 - 2025 (2025). doi:10.3389/frai.2025.1622292

work page doi:10.3389/frai.2025.1622292 2025
[3]

Xin Dong, Evgeniy Gabrilovich, Geremy Heitz, et al. 2014. Knowledge vault: a web-scale approach to probabilistic knowledge fusion. InProceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 601–610. doi:10.1145/2623330.2623623

work page doi:10.1145/2623330.2623623 2014
[4]

Mohnish Dubey, Debayan Banerjee, Debanjan Chaudhuri, et al . 2018. EARL: Joint Entity and Relation Linking for Question Answering over Knowledge Graphs. InThe Semantic Web – ISWC 2018: 17th International Semantic Web Conference, Monterey, CA, USA, October 8–12, 2018, Proceedings, Part I. 108–126. doi:10.1007/978-3-030-00671-6_7

work page doi:10.1007/978-3-030-00671-6_7 2018
[5]

Darren Edge, Ha Trinh, Newman Cheng, et al . 2025. From Local to Global: A Graph RAG Approach to Query-Focused Summarization. arXiv:2404.16130 [cs.CL] https://arxiv.org/abs/2404.16130

Pith/arXiv arXiv 2025
[6]

Amer Farea, Zhen Yang, Kien Duong, et al. 2025. Evaluation of Question An- swering Systems: Complexity of Judging a Natural Language.ACM Comput. Surv.58, 1 (2025), 1–43. doi:10.1145/3744663

work page doi:10.1145/3744663 2025
[7]

Bin Fu, Yunqi Qiu, Chengguang Tang, et al . 2020. A Survey on Complex Question Answering over Knowledge Base: Recent Advances and Challenges. arXiv:2007.13069 [cs.CL] https://arxiv.org/abs/2007.13069

arXiv 2020
[8]

Shash Guo, Lizi Liao, Cuiping Li, et al . 2024. A survey on neural question generation: methods, applications, and prospects. InProceedings of the Thirty- Third International Joint Conference on Artificial Intelligence. Article 889, 8038– 8047 pages. doi:10.24963/ijcai.2024/889

work page doi:10.24963/ijcai.2024/889 2024
[9]

Aidan Hogan, Eva Blomqvist, Michael Cochez, et al. 2021. Knowledge Graphs. 54, 4, Article 71 (2021), 37 pages. doi:10.1145/3447772

work page doi:10.1145/3447772 2021
[10]

Yizheng Huang and Jimmy Xiangji Huang. 2026. A Survey on Retrieval- Augmented Text Generation for Large Language Models.ACM Comput. Surv.58, 12, Article 300 (2026), 38 pages. doi:10.1145/3805774

work page doi:10.1145/3805774 2026
[11]

Shaoxiong Ji, Shirui Pan, Erik Cambria, et al . 2022. A Survey on Knowledge Graphs: Representation, Acquisition, and Applications.IEEE Transactions on Neural Networks and Learning Systems33, 2 (2022), 494–514. doi:10.1109/TNNLS. 2021.3070843

work page doi:10.1109/tnnls 2022
[12]

Mandar Joshi, Eunsol Choi, Daniel Weld, et al. 2017. TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension. InProceed- ings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 1601–1611. doi:10.18653/v1/P17-1147

work page doi:10.18653/v1/p17-1147 2017
[13]

Amir Abbas Kamalipour, Shahrokh Asadi, and Mohammad Mahyar Amiri Chimeh. 2026. From vectors to knowledge graphs: A comprehensive analy- sis of modern retrieval-augmented generation architectures.Computer Science Review61 (2026), 100925. doi:10.1016/j.cosrev.2026.100925

work page doi:10.1016/j.cosrev.2026.100925 2026
[14]

Yuri Kuratov, Aydar Bulatov, Petr Anokhin, et al . 2024. In Search of Needles in a 11M Haystack: Recurrent Memory Finds What LLMs Miss. arXiv:2402.10790 [cs.CL] https://arxiv.org/abs/2402.10790

arXiv 2024
[15]

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, et al. 2019. Natural Questions: A Benchmark for Question Answering Research.Transactions of the Association for Computational Linguistics7 (2019), 452–466. doi:10.1162/tacl_a_ 00276

work page doi:10.1162/tacl_a_ 2019
[16]

Yunshi Lan, Gaole He, Jinhao Jiang, et al . 2023. Complex Knowledge Base Question Answering: A Survey.IEEE Transactions on Knowledge and Data Engineering35, 11 (2023), 11196–11215. doi:10.1109/TKDE.2022.3223858

work page doi:10.1109/tkde.2022.3223858 2023
[17]

Patrick Lewis, Ethan Perez, Aleksandra Piktus, et al. 2020. Retrieval-augmented generation for knowledge-intensive NLP tasks. InProceedings of the 34th In- ternational Conference on Neural Information Processing Systems. Article 793, 9459–9474 pages

2020
[18]

Chuangtao Ma, Yongrui Chen, Tianxing Wu, et al. 2025. Large Language Models Meet Knowledge Graphs for Question Answering: Synthesis and Opportunities. arXiv:2505.20099 [cs.CL] https://arxiv.org/abs/2505.20099

arXiv 2025
[19]

Albert Meroño-Peñuela, Ashkan Ashkpour, Marieke van Erp, et al. 2015. Semantic technologies for historical research: A survey.Semantic Web6, 6 (2015), 539–564. doi:10.3233/SW-140158

work page doi:10.3233/sw-140158 2015
[20]

Gabor Nagypal. 2005. History ontology building: The technical view. InPro- ceedings of the 16th International Conference of the Association for History and Computing (AHC 2005). Royal Netherlands Academy of Arts and Sciences, 207– 214

2005
[21]

Anders Ericsson

Antti Oulasvirta, Mikael Wahlström, and K. Anders Ericsson. 2011. What does it mean to be good at using a mobile device? An investigation of three levels of experience and skill.International Journal of Human-Computer Studies69, 3 (2011), 155–169. doi:10.1016/j.ijhcs.2010.11.003

work page doi:10.1016/j.ijhcs.2010.11.003 2011
[22]

Fabio Petroni, Aleksandra Piktus, Angela Fan, et al. 2021. KILT: a Benchmark for Knowledge Intensive Language Tasks. InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2523–2544. doi:10.18653/v1/2021.naacl-main.200

work page doi:10.18653/v1/2021.naacl-main.200 2021
[23]

Haritz Puerto, Gözde Şahin, and Iryna Gurevych. 2023. MetaQA: Combining Expert Agents for Multi-Skill Question Answering. InProceedings of the 17th Con- ference of the European Chapter of the Association for Computational Linguistics. 3566–3580. doi:10.18653/v1/2023.eacl-main.259

work page doi:10.18653/v1/2023.eacl-main.259 2023
[24]

Mikhail Salnikov, Hai Le, Prateek Rajput, et al. 2023. Large Language Models Meet Knowledge Graphs to Answer Factoid Questions. InProceedings of the 37th Pacific Asia Conference on Language, Information and Computation. 635–644. https://aclanthology.org/2023.paclic-1.63/

2023
[25]

Richard Benjamins, and Dieter Fensel

Rudi Studer, V. Richard Benjamins, and Dieter Fensel. 1998. Knowledge en- gineering: principles and methods.Data Knowl. Eng.25, 1–2 (1998), 161–197. doi:10.1016/S0169-023X(97)00056-6

work page doi:10.1016/s0169-023x(97)00056-6 1998
[26]

Alon Talmor and Jonathan Berant. 2018. The Web as a Knowledge-Base for Answering Complex Questions. InProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). 641–651. doi:10.18653/v1/N18- 1059

work page doi:10.18653/v1/n18- 2018
[27]

Esther Travé Allepuz, Pablo del Fresno Bernal, and Alfred Mauri Martí. 2020. Ontology-Mediated Historical Data Modeling: Theoretical and Practical Tools for an Integrated Construction of the Past.Information11, 4 (2020). doi:10.3390/ info11040182

2020
[28]

Elena Volkanovska. 2025. A Study of Errors in the Output of Large Language Models for Domain-Specific Few-Shot Named Entity Recognition.Journal for Language Technology and Computational Linguistics38, 2 (2025), 31–42. doi:10. 21248/jlcl.38.2025.281

2025
[29]

Yuze Wang, Mingxiang Shi, Xiulei Qin, et al. 2025. Research on Construction and Application of Knowledge Graph in Science and Technology Field Based on Large Language Model. InProceedings of the 2025 6th International Conference on Education, Knowledge and Information Management. 343–349. doi:10.1145/ 3756580.3756635 All Relations Lead to Rome: Automated ...

arXiv 2025
[30]

Jason Wei, Xuezhi Wang, Dale Schuurmans, et al. 2022. Chain-of-thought prompt- ing elicits reasoning in large language models. InProceedings of the 36th Interna- tional Conference on Neural Information Processing Systems. Article 1800

2022
[31]

Zhilin Yang, Peng Qi, Saizheng Zhang, et al . 2018. HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2369–2380. doi:10.18653/v1/D18-1259

work page doi:10.18653/v1/d18-1259 2018
[32]

Wen-tau Yih, Matthew Richardson, Chris Meek, et al. 2016. The Value of Semantic Parse Labeling for Knowledge Base Question Answering. InProceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 201–206. doi:10.18653/v1/P16-2033

work page doi:10.18653/v1/p16-2033 2016
[33]

Lingfeng Zhong, Jia Wu, Qian Li, et al . 2023. A Comprehensive Survey on Automatic Knowledge Graph Construction.ACM Comput. Surv.56, 4, Article 94 (2023), 62 pages. doi:10.1145/3618295

work page doi:10.1145/3618295 2023
[34]

question

Yutao Zhu, Huaying Yuan, Shuting Wang, et al. 2025. Large Language Models for Information Retrieval: A Survey.ACM Trans. Inf. Syst.44, 1, Article 12 (2025), 54 pages. doi:10.1145/3748304 Matthijs Jansen op de Haar A Persona Descriptions and Prompt In this section, we present the persona descriptions used in ARLtR for reformulating each base question. Thes...

work page doi:10.1145/3748304 2025

[1] [1]

Persona-SQ: A Personalized Suggested Question Generation Framework For Real-world Documents

2025. Persona-SQ: A Personalized Suggested Question Generation Framework For Real-world Documents. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (System Demonstrations), Nouha Dziri, Sean (Xiang) Ren, and Shizhe Diao (Eds.). 210–247. doi:10.18653/v1...

work page doi:10.18653/v1/2025.naacl-demo.20 2025

[2] [2]

Dang Anh-Hoang, Vu Tran, and Le-Minh Nguyen. 2025. Survey and analysis of hallucinations in large language models: attribution to prompting strategies or model behavior.Frontiers in Artificial IntelligenceVolume 8 - 2025 (2025). doi:10.3389/frai.2025.1622292

work page doi:10.3389/frai.2025.1622292 2025

[3] [3]

Xin Dong, Evgeniy Gabrilovich, Geremy Heitz, et al. 2014. Knowledge vault: a web-scale approach to probabilistic knowledge fusion. InProceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 601–610. doi:10.1145/2623330.2623623

work page doi:10.1145/2623330.2623623 2014

[4] [4]

Mohnish Dubey, Debayan Banerjee, Debanjan Chaudhuri, et al . 2018. EARL: Joint Entity and Relation Linking for Question Answering over Knowledge Graphs. InThe Semantic Web – ISWC 2018: 17th International Semantic Web Conference, Monterey, CA, USA, October 8–12, 2018, Proceedings, Part I. 108–126. doi:10.1007/978-3-030-00671-6_7

work page doi:10.1007/978-3-030-00671-6_7 2018

[5] [5]

Darren Edge, Ha Trinh, Newman Cheng, et al . 2025. From Local to Global: A Graph RAG Approach to Query-Focused Summarization. arXiv:2404.16130 [cs.CL] https://arxiv.org/abs/2404.16130

Pith/arXiv arXiv 2025

[6] [6]

Amer Farea, Zhen Yang, Kien Duong, et al. 2025. Evaluation of Question An- swering Systems: Complexity of Judging a Natural Language.ACM Comput. Surv.58, 1 (2025), 1–43. doi:10.1145/3744663

work page doi:10.1145/3744663 2025

[7] [7]

Bin Fu, Yunqi Qiu, Chengguang Tang, et al . 2020. A Survey on Complex Question Answering over Knowledge Base: Recent Advances and Challenges. arXiv:2007.13069 [cs.CL] https://arxiv.org/abs/2007.13069

arXiv 2020

[8] [8]

Shash Guo, Lizi Liao, Cuiping Li, et al . 2024. A survey on neural question generation: methods, applications, and prospects. InProceedings of the Thirty- Third International Joint Conference on Artificial Intelligence. Article 889, 8038– 8047 pages. doi:10.24963/ijcai.2024/889

work page doi:10.24963/ijcai.2024/889 2024

[9] [9]

Aidan Hogan, Eva Blomqvist, Michael Cochez, et al. 2021. Knowledge Graphs. 54, 4, Article 71 (2021), 37 pages. doi:10.1145/3447772

work page doi:10.1145/3447772 2021

[10] [10]

Yizheng Huang and Jimmy Xiangji Huang. 2026. A Survey on Retrieval- Augmented Text Generation for Large Language Models.ACM Comput. Surv.58, 12, Article 300 (2026), 38 pages. doi:10.1145/3805774

work page doi:10.1145/3805774 2026

[11] [11]

Shaoxiong Ji, Shirui Pan, Erik Cambria, et al . 2022. A Survey on Knowledge Graphs: Representation, Acquisition, and Applications.IEEE Transactions on Neural Networks and Learning Systems33, 2 (2022), 494–514. doi:10.1109/TNNLS. 2021.3070843

work page doi:10.1109/tnnls 2022

[12] [12]

Mandar Joshi, Eunsol Choi, Daniel Weld, et al. 2017. TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension. InProceed- ings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 1601–1611. doi:10.18653/v1/P17-1147

work page doi:10.18653/v1/p17-1147 2017

[13] [13]

Amir Abbas Kamalipour, Shahrokh Asadi, and Mohammad Mahyar Amiri Chimeh. 2026. From vectors to knowledge graphs: A comprehensive analy- sis of modern retrieval-augmented generation architectures.Computer Science Review61 (2026), 100925. doi:10.1016/j.cosrev.2026.100925

work page doi:10.1016/j.cosrev.2026.100925 2026

[14] [14]

Yuri Kuratov, Aydar Bulatov, Petr Anokhin, et al . 2024. In Search of Needles in a 11M Haystack: Recurrent Memory Finds What LLMs Miss. arXiv:2402.10790 [cs.CL] https://arxiv.org/abs/2402.10790

arXiv 2024

[15] [15]

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, et al. 2019. Natural Questions: A Benchmark for Question Answering Research.Transactions of the Association for Computational Linguistics7 (2019), 452–466. doi:10.1162/tacl_a_ 00276

work page doi:10.1162/tacl_a_ 2019

[16] [16]

Yunshi Lan, Gaole He, Jinhao Jiang, et al . 2023. Complex Knowledge Base Question Answering: A Survey.IEEE Transactions on Knowledge and Data Engineering35, 11 (2023), 11196–11215. doi:10.1109/TKDE.2022.3223858

work page doi:10.1109/tkde.2022.3223858 2023

[17] [17]

Patrick Lewis, Ethan Perez, Aleksandra Piktus, et al. 2020. Retrieval-augmented generation for knowledge-intensive NLP tasks. InProceedings of the 34th In- ternational Conference on Neural Information Processing Systems. Article 793, 9459–9474 pages

2020

[18] [18]

Chuangtao Ma, Yongrui Chen, Tianxing Wu, et al. 2025. Large Language Models Meet Knowledge Graphs for Question Answering: Synthesis and Opportunities. arXiv:2505.20099 [cs.CL] https://arxiv.org/abs/2505.20099

arXiv 2025

[19] [19]

Albert Meroño-Peñuela, Ashkan Ashkpour, Marieke van Erp, et al. 2015. Semantic technologies for historical research: A survey.Semantic Web6, 6 (2015), 539–564. doi:10.3233/SW-140158

work page doi:10.3233/sw-140158 2015

[20] [20]

Gabor Nagypal. 2005. History ontology building: The technical view. InPro- ceedings of the 16th International Conference of the Association for History and Computing (AHC 2005). Royal Netherlands Academy of Arts and Sciences, 207– 214

2005

[21] [21]

Anders Ericsson

Antti Oulasvirta, Mikael Wahlström, and K. Anders Ericsson. 2011. What does it mean to be good at using a mobile device? An investigation of three levels of experience and skill.International Journal of Human-Computer Studies69, 3 (2011), 155–169. doi:10.1016/j.ijhcs.2010.11.003

work page doi:10.1016/j.ijhcs.2010.11.003 2011

[22] [22]

Fabio Petroni, Aleksandra Piktus, Angela Fan, et al. 2021. KILT: a Benchmark for Knowledge Intensive Language Tasks. InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2523–2544. doi:10.18653/v1/2021.naacl-main.200

work page doi:10.18653/v1/2021.naacl-main.200 2021

[23] [23]

Haritz Puerto, Gözde Şahin, and Iryna Gurevych. 2023. MetaQA: Combining Expert Agents for Multi-Skill Question Answering. InProceedings of the 17th Con- ference of the European Chapter of the Association for Computational Linguistics. 3566–3580. doi:10.18653/v1/2023.eacl-main.259

work page doi:10.18653/v1/2023.eacl-main.259 2023

[24] [24]

Mikhail Salnikov, Hai Le, Prateek Rajput, et al. 2023. Large Language Models Meet Knowledge Graphs to Answer Factoid Questions. InProceedings of the 37th Pacific Asia Conference on Language, Information and Computation. 635–644. https://aclanthology.org/2023.paclic-1.63/

2023

[25] [25]

Richard Benjamins, and Dieter Fensel

Rudi Studer, V. Richard Benjamins, and Dieter Fensel. 1998. Knowledge en- gineering: principles and methods.Data Knowl. Eng.25, 1–2 (1998), 161–197. doi:10.1016/S0169-023X(97)00056-6

work page doi:10.1016/s0169-023x(97)00056-6 1998

[26] [26]

Alon Talmor and Jonathan Berant. 2018. The Web as a Knowledge-Base for Answering Complex Questions. InProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). 641–651. doi:10.18653/v1/N18- 1059

work page doi:10.18653/v1/n18- 2018

[27] [27]

Esther Travé Allepuz, Pablo del Fresno Bernal, and Alfred Mauri Martí. 2020. Ontology-Mediated Historical Data Modeling: Theoretical and Practical Tools for an Integrated Construction of the Past.Information11, 4 (2020). doi:10.3390/ info11040182

2020

[28] [28]

Elena Volkanovska. 2025. A Study of Errors in the Output of Large Language Models for Domain-Specific Few-Shot Named Entity Recognition.Journal for Language Technology and Computational Linguistics38, 2 (2025), 31–42. doi:10. 21248/jlcl.38.2025.281

2025

[29] [29]

Yuze Wang, Mingxiang Shi, Xiulei Qin, et al. 2025. Research on Construction and Application of Knowledge Graph in Science and Technology Field Based on Large Language Model. InProceedings of the 2025 6th International Conference on Education, Knowledge and Information Management. 343–349. doi:10.1145/ 3756580.3756635 All Relations Lead to Rome: Automated ...

arXiv 2025

[30] [30]

Jason Wei, Xuezhi Wang, Dale Schuurmans, et al. 2022. Chain-of-thought prompt- ing elicits reasoning in large language models. InProceedings of the 36th Interna- tional Conference on Neural Information Processing Systems. Article 1800

2022

[31] [31]

Zhilin Yang, Peng Qi, Saizheng Zhang, et al . 2018. HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2369–2380. doi:10.18653/v1/D18-1259

work page doi:10.18653/v1/d18-1259 2018

[32] [32]

Wen-tau Yih, Matthew Richardson, Chris Meek, et al. 2016. The Value of Semantic Parse Labeling for Knowledge Base Question Answering. InProceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 201–206. doi:10.18653/v1/P16-2033

work page doi:10.18653/v1/p16-2033 2016

[33] [33]

Lingfeng Zhong, Jia Wu, Qian Li, et al . 2023. A Comprehensive Survey on Automatic Knowledge Graph Construction.ACM Comput. Surv.56, 4, Article 94 (2023), 62 pages. doi:10.1145/3618295

work page doi:10.1145/3618295 2023

[34] [34]

question

Yutao Zhu, Huaying Yuan, Shuting Wang, et al. 2025. Large Language Models for Information Retrieval: A Survey.ACM Trans. Inf. Syst.44, 1, Article 12 (2025), 54 pages. doi:10.1145/3748304 Matthijs Jansen op de Haar A Persona Descriptions and Prompt In this section, we present the persona descriptions used in ARLtR for reformulating each base question. Thes...

work page doi:10.1145/3748304 2025