arxiv: 2604.06211 · v2 · submitted 2026-03-16 · 💻 cs.CL · cs.AI· cs.SE

Recognition: no theorem link

Illocutionary Explanation Planning for Source-Faithful Explanations in Retrieval-Augmented Language Models

Francesco Sovrano , Alberto Bacchelli

Authors on Pith no claims yet

Pith reviewed 2026-05-15 10:35 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.SE

keywords retrieval-augmented generationsource faithfulnessexplanationslarge language modelsprogramming educationprompting methods

0 comments

The pith

Chain-of-illocution prompting expands queries into implicit questions to raise source adherence in RAG explanations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to make explanations from retrieval-augmented language models more verifiable against their source documents, using programming textbooks as the evidence base. Baseline RAG systems show low median adherence of 22 to 40 percent, while non-RAG models reach zero. Drawing on illocutionary theory, the authors introduce macro-planning that decomposes a user query into implicit explanatory questions; these questions then steer retrieval. The resulting chain-of-illocution prompting produces statistically significant adherence gains of up to 63 percent across tested models, with a user study confirming that satisfaction and perceived correctness remain unchanged.

Core claim

Illocutionary macro-planning, realized through chain-of-illocution prompting, functions as a descriptive design principle that improves source faithfulness in RAG by expanding each query into a set of implicit explanatory questions that guide retrieval from authoritative textbooks, yielding higher source-adherence scores than standard RAG baselines.

What carries the argument

Chain-of-illocution prompting (CoI), which expands a query into implicit explanatory questions that drive retrieval from source texts.

If this is right

RAG pipelines for textbook-grounded education can reach higher traceability without additional training.
Explanations become easier for users to verify against the exact retrieved passages.
The method applies uniformly across multiple large language models.
Gains in adherence leave user ratings of relevance and correctness intact.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar query-expansion planning could be tested in domains outside programming, such as legal or medical retrieval.
The same mechanism might reduce cases where models cite sources they have not actually used.
Automating the generation of the implicit questions could further lower the need for manual prompt engineering.

Load-bearing premise

Expanding queries into implicit explanatory questions will reliably produce better retrieval and higher measured adherence to the provided source documents.

What would settle it

An experiment that applies chain-of-illocution prompting to the same set of Stack Overflow questions and textbooks but records no statistically significant rise in source-adherence metrics relative to standard RAG.

Figures

Figures reproduced from arXiv: 2604.06211 by Alberto Bacchelli, Francesco Sovrano.

**Figure 1.** Figure 1: Source faithfulness across six LLMs, with one-sided p-values (significant in bold), Cohen’s dz effect sizes, and 95% confidence intervals (CI) [PITH_FULL_IMAGE:figures/full_fig_p011_1.png] view at source ↗

**Figure 2.** Figure 2: Chain-of-illocution pipeline. Black arrows denote the standard RAG flow from the user question to retrieval, prompting, and final explanation. Green arrows denote the CoI additions: an illocution planner that proposes implicit explanatory questions, per-question retrieval from the knowledge source, and the concatenation of question– context pairs (Q1, C1), . . . ,(Qn, Cn) into the final prompt. The multipl… view at source ↗

**Figure 3.** Figure 3: Study interface with an explanation generated via chain-of-illocution [PITH_FULL_IMAGE:figures/full_fig_p016_3.png] view at source ↗

**Figure 4.** Figure 4: User study results for GPT-3.5-turbo. Results. We averaged participant scores and examined their distributions (see Figures 4 and 5). One-sided Wilcoxon signed-rank tests did not reveal significant decreases in satisfaction, relevance, or perceived correctness for RAG+CoI relative to RAG, indicating that chain-of-illocution prompting does not harm these metrics in this study. We also performed one-sided Ma… view at source ↗

**Figure 5.** Figure 5: User study results for GPT-4o. to assess whether the lower textbook adherence observed with the larger LLM (GPT-4o) negatively impacted the metrics under scrutiny. The tests revealed that GPT-3.5-turbo’s greater adherence was associated (see discussion below) with significantly (p < 0.05, one-sided) higher satisfaction (U = 3973.5, p = 0.024 for RAG; U = 4015.5, p = 0.017 for RAG+CoI), significantly impro… view at source ↗

read the original abstract

Natural language explanations produced by large language models (LLMs) are often persuasive, but not necessarily scrutable: users cannot easily verify whether the claims in an explanation are supported by evidence. In XAI, this motivates a focus on faithfulness and traceability, i.e., the extent to which an explanation's claims can be grounded in, and traced back to, an explicit source. We study these desiderata in retrieval-augmented generation (RAG) for programming education, where textbooks provide authoritative evidence. We benchmark six LLMs on 90 Stack Overflow questions grounded in three programming textbooks and quantify source faithfulness via source adherence metrics. We find that non Retrieval-Augmented Generation (RAG) models have median source adherence of 0%, while baseline RAG systems still exhibit low median adherence (22-40%, depending on the model). Motivated by Achinstein's illocutionary theory of explanation, we introduce illocutionary macro-planning as a descriptive design principle for source-faithful explanations and instantiate it with chain-of-illocution prompting (CoI), which expands a query into implicit explanatory questions that drive retrieval. Across models, CoI yields statistically significant gains (up to 63%) in source adherence, although absolute adherence remains moderate and the gains are weak or non-significant for some models. A user study with 165 retained participants (220 recruited) indicates that these gains do not harm satisfaction, relevance, or perceived correctness.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CoI prompting improves RAG source adherence for code explanations but lacks a control to credit the illocutionary theory specifically.

read the letter

CoI prompting improves RAG source adherence for code explanations but lacks a control to credit the illocutionary theory specifically. The paper applies Achinstein's illocutionary account to RAG by turning a user query into a chain of implicit explanatory questions that steer retrieval from textbooks. They evaluate this on six LLMs with 90 Stack Overflow questions tied to three programming books. Non-RAG models hit 0% median adherence, plain RAG reaches 22-40%, and CoI produces statistically significant relative gains up to 63% on the adherence metrics. A separate study with 165 users finds no drop in satisfaction or perceived correctness from the change. The setup uses real external sources and reports results across models, which keeps the claims grounded. The user study is a plus because it checks whether the metric gains matter to people. The central weakness is the missing ablation. The experiments pit CoI only against no-expansion RAG and non-RAG baselines. Without a condition that expands the query into a comparable number of ordinary factual or procedural questions, it is hard to tell whether the illocutionary framing adds anything beyond simply retrieving more text. Absolute adherence stays moderate even after the gains, so the practical improvement is incremental rather than decisive. This work is aimed at researchers who build or evaluate RAG systems for education or other domains that need traceable explanations. It has enough concrete experiments and a human study to justify sending it for review, though the referees should ask for the generic-expansion control and clearer metric definitions.

Referee Report

1 major / 1 minor

Summary. The paper motivates illocutionary macro-planning from Achinstein's theory of explanation and instantiates it as chain-of-illocution (CoI) prompting, which expands queries into implicit explanatory questions to drive retrieval in RAG systems. It evaluates six LLMs on 90 Stack Overflow questions grounded in three programming textbooks, reporting statistically significant source-adherence gains (up to 63%) over baseline RAG, while a user study with 165 participants finds no negative impact on satisfaction, relevance, or perceived correctness.

Significance. If the attribution to illocutionary structure is confirmed, the work supplies a theoretically grounded design principle for source-faithful explanations in educational RAG, with measurable adherence improvements and preserved user experience. The combination of quantitative metrics and a sizable user study adds practical value for XAI in programming-education settings.

major comments (1)

Experimental evaluation (implicitly §4–5): CoI is compared only against baseline RAG (no expansion) and non-RAG models. No control condition expands the query into an equivalent number of non-illocutionary questions (e.g., generic factual or procedural expansions). Without this ablation, the reported adherence gains cannot be attributed specifically to the illocutionary framing rather than to increased query specificity or retrieval volume, directly weakening the central claim that Achinstein's theory provides a useful design principle.

minor comments (1)

Abstract: the statement that absolute adherence remains moderate would be more informative if the exact median adherence values under CoI were reported alongside the relative gains.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive and insightful review. The recommendation for major revision is noted, and we address the primary concern regarding experimental controls below. We commit to revisions that strengthen the attribution of results to the illocutionary design principle while preserving the core contributions.

read point-by-point responses

Referee: Experimental evaluation (implicitly §4–5): CoI is compared only against baseline RAG (no expansion) and non-RAG models. No control condition expands the query into an equivalent number of non-illocutionary questions (e.g., generic factual or procedural expansions). Without this ablation, the reported adherence gains cannot be attributed specifically to the illocutionary framing rather than to increased query specificity or retrieval volume, directly weakening the central claim that Achinstein's theory provides a useful design principle.

Authors: We agree that this is a substantive limitation in the current design. The manuscript compares CoI against baseline RAG (original query, no expansion) and non-RAG models, but does not include a control that performs query expansion using an equivalent number of non-illocutionary questions. This leaves open the possibility that gains stem from increased retrieval volume or specificity rather than the specific illocutionary structure derived from Achinstein's theory. In the revised manuscript we will add such a control condition (e.g., generic factual or procedural rephrasings matched for number of sub-questions) and report source-adherence metrics for it. We will also include a brief theoretical discussion clarifying why the illocutionary question types (purpose, mechanism, etc.) are expected to differ from generic expansions in their effect on source grounding. These additions will allow readers to better assess the unique contribution of the theory. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical gains measured against independent baselines

full rationale

The paper's chain begins with an external citation to Achinstein's illocutionary theory, introduces illocutionary macro-planning as a design principle, instantiates it as CoI prompting, and reports empirical source-adherence improvements (up to 63%) versus standard RAG baselines on Stack Overflow questions. No equations or definitions reduce the claimed gains to the inputs by construction; adherence metrics are defined independently of the prompting method, and no self-citation supplies a uniqueness theorem or ansatz that forces the result. The derivation therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the applicability of Achinstein's illocutionary theory to RAG prompting design and the assumption that the chosen adherence metrics accurately reflect source faithfulness without post-hoc adjustments.

axioms (1)

domain assumption Achinstein's illocutionary theory of explanation provides a useful framework for designing source-faithful explanations in retrieval-augmented generation
Directly motivates the introduction of illocutionary macro-planning and CoI prompting.

pith-pipeline@v0.9.0 · 5568 in / 1198 out tokens · 51366 ms · 2026-05-15T10:35:37.733176+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages

[1]

Oxford University Press (1985) Chain-of-Illocution Prompting 21

Achinstein, P.: The nature of explanation. Oxford University Press (1985) Chain-of-Illocution Prompting 21

work page 1985
[2]

arXiv preprint arXiv:2310.05189 (2023)

Augenstein, I., Baldwin, T., Cha, M., Chakraborty, T., Ciampaglia, G.L., Corney, D., DiResta, R., Ferrara, E., Hale, S., Halevy, A., et al.: Fac- tuality challenges in the era of large language models. arXiv preprint arXiv:2310.05189 (2023)

work page arXiv 2023
[3]

Nature Machine Intelligence6(8), 852–863 (2024)

Augenstein, I., Baldwin, T., Cha, M., Chakraborty, T., Ciampaglia, G.L., Corney, D., DiResta, R., Ferrara, E., Hale, S., Halevy, A., et al.: Factuality challenges in the era of large language models and opportunities for fact- checking. Nature Machine Intelligence6(8), 852–863 (2024)

work page 2024
[4]

Sustainability14(4), 2011 (2022)

Baquero, A.: Net promoter score (nps) and customer satisfaction: relation- ship and efficient management. Sustainability14(4), 2011 (2022)

work page 2011
[5]

Journal of the Royal statistical society: series B (Methodological)57(1), 289–300 (1995)

Benjamini, Y.,Hochberg, Y.:Controllingthe false discovery rate:a practical and powerful approach to multiple testing. Journal of the Royal statistical society: series B (Methodological)57(1), 289–300 (1995)

work page 1995
[6]

ACM Transactions on Intelligent Systems and Technology15(3), 1–45 (2024)

Chang, Y., Wang, X., Wang, J., Wu, Y., Yang, L., Zhu, K., Chen, H., Yi, X., Wang, C., Wang, Y., et al.: A survey on evaluation of large language models. ACM Transactions on Intelligent Systems and Technology15(3), 1–45 (2024)

work page 2024
[7]

Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? CoRRabs/2307.09009(2023),https://doi.org/10.48550/ARX IV.2307.09009, URLhttps://doi.org/10.48550/arXiv.2307.09009

work page doi:10.48550/arx 2023
[8]

arXiv preprint arXiv:2402.08801 (2024)

Da Silva, L., Samhi, J., Khomh, F.: Chatgpt vs llama: Impact, re- liability, and challenges in stack overflow discussions. arXiv preprint arXiv:2402.08801 (2024)

work page arXiv 2024
[9]

CRC press (2000)

Dale, R., Moisl, H., Somers, H.: Handbook of natural language processing. CRC press (2000)

work page 2000
[10]

Department for Education of the UK: A guide to effective practice in cur- riculum planning.https://assets.publishing.service.gov.uk/med ia/65fd8652f1d3a0001d32ae0d/A_Guide_To_Effective_Practice_ In_Curriculum_Planning_-_March_2024.pdf(March 2024), accessed: 2025-02-13

work page 2024
[11]

Chain-of-Verification Reduces Hallucination in Large Language Models

Dhuliawala, S., Komeili, M., Xu, J., Raileanu, R., Li, X., Celikyilmaz, A., Weston, J.: Chain-of-verification reduces hallucination in large language models. arXiv preprint arXiv:2309.11495 (2023)

work page arXiv 2023
[12]

In: Findings of the association for computational linguistics: ACL 2024, pp

Dhuliawala, S., Komeili, M., Xu, J., Raileanu, R., Li, X., Celikyilmaz, A., Weston, J.: Chain-of-verification reduces hallucination in large language models. In: Findings of the association for computational linguistics: ACL 2024, pp. 3563–3578 (2024)

work page 2024
[13]

Green Tea Press (2015)

Downey, A.: Think Python: How to Think Like a Computer Scientist. Green Tea Press (2015)

work page 2015
[14]

Ducasse, S., Chloupis, D., Hess, N., Zagidulin, D.: Pharo By Example 5 (2018)

work page 2018
[15]

Hobart and William Smith Colleges (2022)

Eck, D.J.: Introduction to Programming Using Java. Hobart and William Smith Colleges (2022)

work page 2022
[16]

Elgohary, A., Peskov, D., Boyd-Graber, J.: Can you unpack that? learning torewritequestions-in-context.CanYouUnpackThat?LearningtoRewrite Questions-in-Context (2019) 22 Francesco Sovrano and Alberto Bacchelli

work page 2019
[17]

In: Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, pp

Es, S., James, J., Anke, L.E., Schockaert, S.: Ragas: Automated evaluation of retrieval augmented generation. In: Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, pp. 150–158 (2024)

work page 2024
[18]

Applied Stochastic Models in Business and Industry35(1), 138–151 (2019)

Fisher, N.I., Kordupleski, R.E.: Good and bad market research: A critical review of net promoter score. Applied Stochastic Models in Business and Industry35(1), 138–151 (2019)

work page 2019
[19]

In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp

Gao, T., Yen, H., Yu, J., Chen, D.: Enabling large language models to generate text with citations. In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 6465–6488 (2023)

work page 2023
[20]

Waveland Press (2004), ISBN 9781478631101, URLhttps://books.google.it/books?id=RaN2C gAAQBAJ

Glatthorn, A.: Developing a Quality Curriculum. Waveland Press (2004), ISBN 9781478631101, URLhttps://books.google.it/books?id=RaN2C gAAQBAJ

work page 2004
[21]

Educational technology re- search and development64, 573–590 (2016),https://doi.org/10.1007/ s11423-016-9434-9

Hilton, J.: Open educational resources and college textbook choices: A re- view of research on efficacy and perceptions. Educational technology re- search and development64, 573–590 (2016),https://doi.org/10.1007/ s11423-016-9434-9

work page 2016
[22]

Bradford books, MIT Press (1986), ISBN 9780262580960, URLhttps://books.google.it/books?id=Z6EFB aLApE8C

Holland, J., Holyoak, K., Nisbett, R., Thagard, P.: Induction: Processes of Inference, Learning, and Discovery. Bradford books, MIT Press (1986), ISBN 9780262580960, URLhttps://books.google.it/books?id=Z6EFB aLApE8C

work page 1986
[23]

4198– 4205 (2020)

Jacovi, A., Goldberg, Y.: Towards faithfully interpretable nlp systems: How should we define and evaluate faithfulness? In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 4198– 4205 (2020)

work page 2020
[24]

In: Proceedings of the CHI Conference on Human Factors in Computing Systems, pp

Kabir, S., Udo-Imeh, D.N., Kou, B., Zhang, T.: Is stack overflow obsolete? an empirical study of the characteristics of chatgpt answers tostack overflow questions. In: Proceedings of the CHI Conference on Human Factors in Computing Systems, pp. 1–17 (2024)

work page 2024
[25]

Learning and individual differences103, 102274 (2023)

Kasneci, E., Seßler, K., Küchemann, S., Bannert, M., Dementieva, D., Fis- cher, F., Gasser, U., Groh, G., Günnemann, S., Hüllermeier, E., et al.: Chat- gpt for good? on opportunities and challenges of large language models for education. Learning and individual differences103, 102274 (2023)

work page 2023
[26]

arXiv preprint arXiv:2212.14024 (2022)

Khattab,O.,Santhanam,K.,Li,X.L.,Hall,D.,Liang,P.,Potts,C.,Zaharia, M.: Demonstrate-search-predict: Composing retrieval and language models for knowledge-intensive NLP. arXiv preprint arXiv:2212.14024 (2022)

work page arXiv 2022
[27]

In: The Twelfth International Conference on Learning Representations (2024)

Khattab, O., Singhvi, A., Maheshwari, P., Zhang, Z., Santhanam, K., Vard- hamanan, S., Haq, S., Sharma, A., Joshi, T.T., Moazam, H., Miller, H., Zaharia, M., Potts, C.: DSPy: Compiling declarative language model calls into state-of-the-art pipelines. In: The Twelfth International Conference on Learning Representations (2024)

work page 2024
[28]

In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp

Kondadadi, R., Howald, B., Schilder, F.: A statistical nlg framework for aggregated planning and realization. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1406–1415 (2013) Chain-of-Illocution Prompting 23

work page 2013
[29]

In: Proceedings of the 47th International ACM SIGIR Conference on Research and Development in In- formation Retrieval, pp

Kostric, I., Balog, K.: A surprisingly simple yet effective multi-query rewrit- ing method for conversational passage retrieval. In: Proceedings of the 47th International ACM SIGIR Conference on Research and Development in In- formation Retrieval, pp. 2271–2275 (2024)

work page 2024
[30]

Collabra: psychology8(1), 33267 (2022)

Lakens, D.: Sample size justification. Collabra: psychology8(1), 33267 (2022)

work page 2022
[31]

The neurocognition of language83, 122 (1999)

Levelt, W.J.: Producing spoken language: A blueprint of the speaker. The neurocognition of language83, 122 (1999)

work page 1999
[32]

stack overflow: An exploratory comparison of programming assistance tools

Liu, J., Tang, X., Li, L., Chen, P., Liu, Y.: Chatgpt vs. stack overflow: An exploratory comparison of programming assistance tools. In: 2023 IEEE 23rd International Conference on Software Quality, Reliability, and Security Companion (QRS-C), pp. 364–373, IEEE (2023)

work page 2023
[33]

Artificial Intelligence Review42, 275–293 (2014)

Masoudnia, S., Ebrahimpour, R.: Mixture of experts: a literature survey. Artificial Intelligence Review42, 275–293 (2014)

work page 2014
[34]

Mayes, G.R.: Theories of explanation (2001), URLhttps://iep.utm.ed u/explanat/

work page 2001
[35]

In: Bouamor, H., Pino, J., Bali, K

Min, S., Krishna, K., Lyu, X., et al.: Factscore: Fine-grained atomic eval- uation of factual precision in long form text generation. In: Bouamor, H., Pino, J., Bali, K. (eds.) Proceedings of the 2023 Conference on Empirical MethodsinNaturalLanguageProcessing,EMNLP2023,Singapore,Decem- ber 6-10, 2023, pp. 12076–12100, Association for Computational Linguis...

work page doi:10.18653/v1/2023.emnlp-main.741 2023
[36]

In: 2012 28th IEEE International Conference on Software Maintenance (ICSM), pp

Nasehi, S.M., Sillito, J., Maurer, F., Burns, C.: What makes a good code example?: A study of programming q&a in stackoverflow. In: 2012 28th IEEE International Conference on Software Maintenance (ICSM), pp. 25– 34, IEEE (2012)

work page 2012
[37]

In: Findings of the Association for Computational Linguistics: EMNLP 2023, pp

Press, O., Zhang, M., Min, S., Schmidt, L., Smith, N., Lewis, M.: Measuring and narrowing the compositionality gap in language models. In: Findings of the Association for Computational Linguistics: EMNLP 2023, pp. 5687– 5711 (2023)

work page 2023
[38]

Transactions of the Association for Computational Linguistics9, 510–527 (2021)

Puduppully, R., Lapata, M.: Data-to-text generation with macro planning. Transactions of the Association for Computational Linguistics9, 510–527 (2021)

work page 2021
[39]

Computational Linguistics49(4), 777–840 (2023)

Rashkin, H., Nikolaev, V., Lamm, M., Aroyo, L., Collins, M., Das, D., Petrov, S., Tomar, G.S., Turc, I., Reitter, D.: Measuring attribution in nat- ural language generation models. Computational Linguistics49(4), 777–840 (2023)

work page 2023
[40]

New York: Humanities Press (1963)

Sellars, W.: Science, Perception and Reality. New York: Humanities Press (1963)

work page 1963
[41]

Sovrano, F.: How to explain: from theory to practice. Ph.D. thesis, Univer- sity of Bologna (June 2023),https://doi.org/10.48676/unibo/amsdott orato/10943, URLhttp://amsdottorato.unibo.it/10943/

work page doi:10.48676/unibo/amsdott 2023
[42]

International Journal of Artificial Intelligence in Education pp

Sovrano, F., Ashley, K., Brusilovsky, P.L., Vitali, F.: How to improve the explanatory power of an intelligent textbook: a case study in legal writing. International Journal of Artificial Intelligence in Education pp. 1–35 (2024), https://doi.org/10.1007/s40593-024-00399-w 24 Francesco Sovrano and Alberto Bacchelli

work page doi:10.1007/s40593-024-00399-w 2024
[43]

Sovrano, F., Vilone, G., Lognoul, M., Longo, L.: Legal xai: a systematic review and interdisciplinary mapping of xai and eu law, towards a research agenda for legally responsible ai (2025),https://doi.org/10.2139/ssrn .5371124

work page doi:10.2139/ssrn 2025
[44]

Knowledge-Based Systems278, 110866 (2023),https://doi.org/10.1016/j.knosys.2023.110866

Sovrano, F., Vitali, F.: An objective metric for explainable ai: How and why to estimate the degree of explainability. Knowledge-Based Systems278, 110866 (2023),https://doi.org/10.1016/j.knosys.2023.110866

work page doi:10.1016/j.knosys.2023.110866 2023
[45]

In: World Conference on Explainable Artificial Intelligence, pp

Sovrano, F., Vitali, F.: Perlocution vs illocution: How different interpreta- tions of the act of explaining impact on the evaluation of explanations and xai. In: World Conference on Explainable Artificial Intelligence, pp. 25–47, Springer (2023),https://doi.org/10.1007/978-3-031-44064-9_2

work page doi:10.1007/978-3-031-44064-9_2 2023
[46]

arXiv preprint arXiv:2404.05411 (2024)

Spataru, A., Hambro, E., Voita, E., Cancedda, N.: Know when to stop: A study of semantic drift in text generation. arXiv preprint arXiv:2404.05411 (2024)

work page arXiv 2024
[47]

https://meta.stackoverflow.com/questions/421831/temporary-pol icy-generative-ai-e-g-chatgpt-is-banned(2023), accessed: 2024-07- 24

Stack Overflow: Temporary policy: Generative ai (e.g., chatgpt) is banned. https://meta.stackoverflow.com/questions/421831/temporary-pol icy-generative-ai-e-g-chatgpt-is-banned(2023), accessed: 2024-07- 24

work page 2023
[48]

311–325 (2024)

Sun, K., Xu, Y., Zha, H., Liu, Y., Dong, X.L.: Head-to-tail: How knowl- edgeable are large language models (llms)? aka will llms replace knowledge graphs? In: Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Lan- guage Technologies (Volume 1: Long Papers), pp. 311–325 (2024)

work page 2024
[49]

Smart learning environments10(1), 15 (2023)

Tlili, A., Shehata, B., Adarkwah, M.A., Bozkurt, A., Hickey, D.T., Huang, R., Agyemang, B.: What if the devil is my guardian angel: Chatgpt as a case study of using chatbots in education. Smart learning environments10(1), 15 (2023)

work page 2023
[50]

In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp

Trivedi, H., Balasubramanian, N., Khot, T., Sabharwal, A.: Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 10014–10037 (2023)

work page 2023
[51]

FreshLLMs: Refreshing Large Language Models with Search Engine Augmentation

Vu, T., Iyyer, M., Wang, X., Constant, N., Wei, J., Wei, J., Tar, C., Sung, Y.H., Zhou, D., Le, Q., et al.: Freshllms: Refreshing large language models with search engine augmentation. arXiv preprint arXiv:2310.03214 (2023)

work page arXiv 2023
[52]

arXiv preprint arXiv:2402.01739 (2024)

Xue, F., Zheng, Z., Fu, Y., Ni, J., Zheng, Z., Zhou, W., You, Y.: Openmoe: An early effort on open mixture-of-experts language models. arXiv preprint arXiv:2402.01739 (2024)

work page arXiv 2024
[53]

CRAG – comprehensive RAG benchmark.arXiv preprint arXiv:2406.04744,

Yang, X., Sun, K., Xin, H., Sun, Y., Bhalla, N., Chen, X., Choudhary, S., Gui, R.D., Jiang, Z.W., Jiang, Z., et al.: Crag–comprehensive rag bench- mark. arXiv preprint arXiv:2406.04744 (2024)

work page arXiv 2024
[54]

In: Proceedings of the 43rd International ACM SIGIR conference on research and development in In- formation Retrieval, pp

Yu, S., Liu, J., Yang, J., Xiong, C., Bennett, P., Gao, J., Liu, Z.: Few- shot generative conversational query rewriting. In: Proceedings of the 43rd International ACM SIGIR conference on research and development in In- formation Retrieval, pp. 1933–1936 (2020)

work page 1933