Recognition: no theorem link
Illocutionary Explanation Planning for Source-Faithful Explanations in Retrieval-Augmented Language Models
Pith reviewed 2026-05-15 10:35 UTC · model grok-4.3
The pith
Chain-of-illocution prompting expands queries into implicit questions to raise source adherence in RAG explanations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Illocutionary macro-planning, realized through chain-of-illocution prompting, functions as a descriptive design principle that improves source faithfulness in RAG by expanding each query into a set of implicit explanatory questions that guide retrieval from authoritative textbooks, yielding higher source-adherence scores than standard RAG baselines.
What carries the argument
Chain-of-illocution prompting (CoI), which expands a query into implicit explanatory questions that drive retrieval from source texts.
If this is right
- RAG pipelines for textbook-grounded education can reach higher traceability without additional training.
- Explanations become easier for users to verify against the exact retrieved passages.
- The method applies uniformly across multiple large language models.
- Gains in adherence leave user ratings of relevance and correctness intact.
Where Pith is reading between the lines
- Similar query-expansion planning could be tested in domains outside programming, such as legal or medical retrieval.
- The same mechanism might reduce cases where models cite sources they have not actually used.
- Automating the generation of the implicit questions could further lower the need for manual prompt engineering.
Load-bearing premise
Expanding queries into implicit explanatory questions will reliably produce better retrieval and higher measured adherence to the provided source documents.
What would settle it
An experiment that applies chain-of-illocution prompting to the same set of Stack Overflow questions and textbooks but records no statistically significant rise in source-adherence metrics relative to standard RAG.
Figures
read the original abstract
Natural language explanations produced by large language models (LLMs) are often persuasive, but not necessarily scrutable: users cannot easily verify whether the claims in an explanation are supported by evidence. In XAI, this motivates a focus on faithfulness and traceability, i.e., the extent to which an explanation's claims can be grounded in, and traced back to, an explicit source. We study these desiderata in retrieval-augmented generation (RAG) for programming education, where textbooks provide authoritative evidence. We benchmark six LLMs on 90 Stack Overflow questions grounded in three programming textbooks and quantify source faithfulness via source adherence metrics. We find that non Retrieval-Augmented Generation (RAG) models have median source adherence of 0%, while baseline RAG systems still exhibit low median adherence (22-40%, depending on the model). Motivated by Achinstein's illocutionary theory of explanation, we introduce illocutionary macro-planning as a descriptive design principle for source-faithful explanations and instantiate it with chain-of-illocution prompting (CoI), which expands a query into implicit explanatory questions that drive retrieval. Across models, CoI yields statistically significant gains (up to 63%) in source adherence, although absolute adherence remains moderate and the gains are weak or non-significant for some models. A user study with 165 retained participants (220 recruited) indicates that these gains do not harm satisfaction, relevance, or perceived correctness.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper motivates illocutionary macro-planning from Achinstein's theory of explanation and instantiates it as chain-of-illocution (CoI) prompting, which expands queries into implicit explanatory questions to drive retrieval in RAG systems. It evaluates six LLMs on 90 Stack Overflow questions grounded in three programming textbooks, reporting statistically significant source-adherence gains (up to 63%) over baseline RAG, while a user study with 165 participants finds no negative impact on satisfaction, relevance, or perceived correctness.
Significance. If the attribution to illocutionary structure is confirmed, the work supplies a theoretically grounded design principle for source-faithful explanations in educational RAG, with measurable adherence improvements and preserved user experience. The combination of quantitative metrics and a sizable user study adds practical value for XAI in programming-education settings.
major comments (1)
- Experimental evaluation (implicitly §4–5): CoI is compared only against baseline RAG (no expansion) and non-RAG models. No control condition expands the query into an equivalent number of non-illocutionary questions (e.g., generic factual or procedural expansions). Without this ablation, the reported adherence gains cannot be attributed specifically to the illocutionary framing rather than to increased query specificity or retrieval volume, directly weakening the central claim that Achinstein's theory provides a useful design principle.
minor comments (1)
- Abstract: the statement that absolute adherence remains moderate would be more informative if the exact median adherence values under CoI were reported alongside the relative gains.
Simulated Author's Rebuttal
We thank the referee for the constructive and insightful review. The recommendation for major revision is noted, and we address the primary concern regarding experimental controls below. We commit to revisions that strengthen the attribution of results to the illocutionary design principle while preserving the core contributions.
read point-by-point responses
-
Referee: Experimental evaluation (implicitly §4–5): CoI is compared only against baseline RAG (no expansion) and non-RAG models. No control condition expands the query into an equivalent number of non-illocutionary questions (e.g., generic factual or procedural expansions). Without this ablation, the reported adherence gains cannot be attributed specifically to the illocutionary framing rather than to increased query specificity or retrieval volume, directly weakening the central claim that Achinstein's theory provides a useful design principle.
Authors: We agree that this is a substantive limitation in the current design. The manuscript compares CoI against baseline RAG (original query, no expansion) and non-RAG models, but does not include a control that performs query expansion using an equivalent number of non-illocutionary questions. This leaves open the possibility that gains stem from increased retrieval volume or specificity rather than the specific illocutionary structure derived from Achinstein's theory. In the revised manuscript we will add such a control condition (e.g., generic factual or procedural rephrasings matched for number of sub-questions) and report source-adherence metrics for it. We will also include a brief theoretical discussion clarifying why the illocutionary question types (purpose, mechanism, etc.) are expected to differ from generic expansions in their effect on source grounding. These additions will allow readers to better assess the unique contribution of the theory. revision: yes
Circularity Check
No circularity: empirical gains measured against independent baselines
full rationale
The paper's chain begins with an external citation to Achinstein's illocutionary theory, introduces illocutionary macro-planning as a design principle, instantiates it as CoI prompting, and reports empirical source-adherence improvements (up to 63%) versus standard RAG baselines on Stack Overflow questions. No equations or definitions reduce the claimed gains to the inputs by construction; adherence metrics are defined independently of the prompting method, and no self-citation supplies a uniqueness theorem or ansatz that forces the result. The derivation therefore remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Achinstein's illocutionary theory of explanation provides a useful framework for designing source-faithful explanations in retrieval-augmented generation
Reference graph
Works this paper leans on
-
[1]
Oxford University Press (1985) Chain-of-Illocution Prompting 21
Achinstein, P.: The nature of explanation. Oxford University Press (1985) Chain-of-Illocution Prompting 21
work page 1985
-
[2]
arXiv preprint arXiv:2310.05189 (2023)
Augenstein, I., Baldwin, T., Cha, M., Chakraborty, T., Ciampaglia, G.L., Corney, D., DiResta, R., Ferrara, E., Hale, S., Halevy, A., et al.: Fac- tuality challenges in the era of large language models. arXiv preprint arXiv:2310.05189 (2023)
-
[3]
Nature Machine Intelligence6(8), 852–863 (2024)
Augenstein, I., Baldwin, T., Cha, M., Chakraborty, T., Ciampaglia, G.L., Corney, D., DiResta, R., Ferrara, E., Hale, S., Halevy, A., et al.: Factuality challenges in the era of large language models and opportunities for fact- checking. Nature Machine Intelligence6(8), 852–863 (2024)
work page 2024
-
[4]
Sustainability14(4), 2011 (2022)
Baquero, A.: Net promoter score (nps) and customer satisfaction: relation- ship and efficient management. Sustainability14(4), 2011 (2022)
work page 2011
-
[5]
Journal of the Royal statistical society: series B (Methodological)57(1), 289–300 (1995)
Benjamini, Y.,Hochberg, Y.:Controllingthe false discovery rate:a practical and powerful approach to multiple testing. Journal of the Royal statistical society: series B (Methodological)57(1), 289–300 (1995)
work page 1995
-
[6]
ACM Transactions on Intelligent Systems and Technology15(3), 1–45 (2024)
Chang, Y., Wang, X., Wang, J., Wu, Y., Yang, L., Zhu, K., Chen, H., Yi, X., Wang, C., Wang, Y., et al.: A survey on evaluation of large language models. ACM Transactions on Intelligent Systems and Technology15(3), 1–45 (2024)
work page 2024
-
[7]
Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? CoRRabs/2307.09009(2023),https://doi.org/10.48550/ARX IV.2307.09009, URLhttps://doi.org/10.48550/arXiv.2307.09009
-
[8]
arXiv preprint arXiv:2402.08801 (2024)
Da Silva, L., Samhi, J., Khomh, F.: Chatgpt vs llama: Impact, re- liability, and challenges in stack overflow discussions. arXiv preprint arXiv:2402.08801 (2024)
-
[9]
Dale, R., Moisl, H., Somers, H.: Handbook of natural language processing. CRC press (2000)
work page 2000
-
[10]
Department for Education of the UK: A guide to effective practice in cur- riculum planning.https://assets.publishing.service.gov.uk/med ia/65fd8652f1d3a0001d32ae0d/A_Guide_To_Effective_Practice_ In_Curriculum_Planning_-_March_2024.pdf(March 2024), accessed: 2025-02-13
work page 2024
-
[11]
Chain-of-Verification Reduces Hallucination in Large Language Models
Dhuliawala, S., Komeili, M., Xu, J., Raileanu, R., Li, X., Celikyilmaz, A., Weston, J.: Chain-of-verification reduces hallucination in large language models. arXiv preprint arXiv:2309.11495 (2023)
-
[12]
In: Findings of the association for computational linguistics: ACL 2024, pp
Dhuliawala, S., Komeili, M., Xu, J., Raileanu, R., Li, X., Celikyilmaz, A., Weston, J.: Chain-of-verification reduces hallucination in large language models. In: Findings of the association for computational linguistics: ACL 2024, pp. 3563–3578 (2024)
work page 2024
-
[13]
Downey, A.: Think Python: How to Think Like a Computer Scientist. Green Tea Press (2015)
work page 2015
-
[14]
Ducasse, S., Chloupis, D., Hess, N., Zagidulin, D.: Pharo By Example 5 (2018)
work page 2018
-
[15]
Hobart and William Smith Colleges (2022)
Eck, D.J.: Introduction to Programming Using Java. Hobart and William Smith Colleges (2022)
work page 2022
-
[16]
Elgohary, A., Peskov, D., Boyd-Graber, J.: Can you unpack that? learning torewritequestions-in-context.CanYouUnpackThat?LearningtoRewrite Questions-in-Context (2019) 22 Francesco Sovrano and Alberto Bacchelli
work page 2019
-
[17]
Es, S., James, J., Anke, L.E., Schockaert, S.: Ragas: Automated evaluation of retrieval augmented generation. In: Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, pp. 150–158 (2024)
work page 2024
-
[18]
Applied Stochastic Models in Business and Industry35(1), 138–151 (2019)
Fisher, N.I., Kordupleski, R.E.: Good and bad market research: A critical review of net promoter score. Applied Stochastic Models in Business and Industry35(1), 138–151 (2019)
work page 2019
-
[19]
In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp
Gao, T., Yen, H., Yu, J., Chen, D.: Enabling large language models to generate text with citations. In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 6465–6488 (2023)
work page 2023
-
[20]
Waveland Press (2004), ISBN 9781478631101, URLhttps://books.google.it/books?id=RaN2C gAAQBAJ
Glatthorn, A.: Developing a Quality Curriculum. Waveland Press (2004), ISBN 9781478631101, URLhttps://books.google.it/books?id=RaN2C gAAQBAJ
work page 2004
-
[21]
Hilton, J.: Open educational resources and college textbook choices: A re- view of research on efficacy and perceptions. Educational technology re- search and development64, 573–590 (2016),https://doi.org/10.1007/ s11423-016-9434-9
work page 2016
-
[22]
Holland, J., Holyoak, K., Nisbett, R., Thagard, P.: Induction: Processes of Inference, Learning, and Discovery. Bradford books, MIT Press (1986), ISBN 9780262580960, URLhttps://books.google.it/books?id=Z6EFB aLApE8C
work page 1986
-
[23]
Jacovi, A., Goldberg, Y.: Towards faithfully interpretable nlp systems: How should we define and evaluate faithfulness? In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 4198– 4205 (2020)
work page 2020
-
[24]
In: Proceedings of the CHI Conference on Human Factors in Computing Systems, pp
Kabir, S., Udo-Imeh, D.N., Kou, B., Zhang, T.: Is stack overflow obsolete? an empirical study of the characteristics of chatgpt answers tostack overflow questions. In: Proceedings of the CHI Conference on Human Factors in Computing Systems, pp. 1–17 (2024)
work page 2024
-
[25]
Learning and individual differences103, 102274 (2023)
Kasneci, E., Seßler, K., Küchemann, S., Bannert, M., Dementieva, D., Fis- cher, F., Gasser, U., Groh, G., Günnemann, S., Hüllermeier, E., et al.: Chat- gpt for good? on opportunities and challenges of large language models for education. Learning and individual differences103, 102274 (2023)
work page 2023
-
[26]
arXiv preprint arXiv:2212.14024 (2022)
Khattab,O.,Santhanam,K.,Li,X.L.,Hall,D.,Liang,P.,Potts,C.,Zaharia, M.: Demonstrate-search-predict: Composing retrieval and language models for knowledge-intensive NLP. arXiv preprint arXiv:2212.14024 (2022)
-
[27]
In: The Twelfth International Conference on Learning Representations (2024)
Khattab, O., Singhvi, A., Maheshwari, P., Zhang, Z., Santhanam, K., Vard- hamanan, S., Haq, S., Sharma, A., Joshi, T.T., Moazam, H., Miller, H., Zaharia, M., Potts, C.: DSPy: Compiling declarative language model calls into state-of-the-art pipelines. In: The Twelfth International Conference on Learning Representations (2024)
work page 2024
-
[28]
Kondadadi, R., Howald, B., Schilder, F.: A statistical nlg framework for aggregated planning and realization. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1406–1415 (2013) Chain-of-Illocution Prompting 23
work page 2013
-
[29]
Kostric, I., Balog, K.: A surprisingly simple yet effective multi-query rewrit- ing method for conversational passage retrieval. In: Proceedings of the 47th International ACM SIGIR Conference on Research and Development in In- formation Retrieval, pp. 2271–2275 (2024)
work page 2024
-
[30]
Collabra: psychology8(1), 33267 (2022)
Lakens, D.: Sample size justification. Collabra: psychology8(1), 33267 (2022)
work page 2022
-
[31]
The neurocognition of language83, 122 (1999)
Levelt, W.J.: Producing spoken language: A blueprint of the speaker. The neurocognition of language83, 122 (1999)
work page 1999
-
[32]
stack overflow: An exploratory comparison of programming assistance tools
Liu, J., Tang, X., Li, L., Chen, P., Liu, Y.: Chatgpt vs. stack overflow: An exploratory comparison of programming assistance tools. In: 2023 IEEE 23rd International Conference on Software Quality, Reliability, and Security Companion (QRS-C), pp. 364–373, IEEE (2023)
work page 2023
-
[33]
Artificial Intelligence Review42, 275–293 (2014)
Masoudnia, S., Ebrahimpour, R.: Mixture of experts: a literature survey. Artificial Intelligence Review42, 275–293 (2014)
work page 2014
-
[34]
Mayes, G.R.: Theories of explanation (2001), URLhttps://iep.utm.ed u/explanat/
work page 2001
-
[35]
In: Bouamor, H., Pino, J., Bali, K
Min, S., Krishna, K., Lyu, X., et al.: Factscore: Fine-grained atomic eval- uation of factual precision in long form text generation. In: Bouamor, H., Pino, J., Bali, K. (eds.) Proceedings of the 2023 Conference on Empirical MethodsinNaturalLanguageProcessing,EMNLP2023,Singapore,Decem- ber 6-10, 2023, pp. 12076–12100, Association for Computational Linguis...
-
[36]
In: 2012 28th IEEE International Conference on Software Maintenance (ICSM), pp
Nasehi, S.M., Sillito, J., Maurer, F., Burns, C.: What makes a good code example?: A study of programming q&a in stackoverflow. In: 2012 28th IEEE International Conference on Software Maintenance (ICSM), pp. 25– 34, IEEE (2012)
work page 2012
-
[37]
In: Findings of the Association for Computational Linguistics: EMNLP 2023, pp
Press, O., Zhang, M., Min, S., Schmidt, L., Smith, N., Lewis, M.: Measuring and narrowing the compositionality gap in language models. In: Findings of the Association for Computational Linguistics: EMNLP 2023, pp. 5687– 5711 (2023)
work page 2023
-
[38]
Transactions of the Association for Computational Linguistics9, 510–527 (2021)
Puduppully, R., Lapata, M.: Data-to-text generation with macro planning. Transactions of the Association for Computational Linguistics9, 510–527 (2021)
work page 2021
-
[39]
Computational Linguistics49(4), 777–840 (2023)
Rashkin, H., Nikolaev, V., Lamm, M., Aroyo, L., Collins, M., Das, D., Petrov, S., Tomar, G.S., Turc, I., Reitter, D.: Measuring attribution in nat- ural language generation models. Computational Linguistics49(4), 777–840 (2023)
work page 2023
-
[40]
New York: Humanities Press (1963)
Sellars, W.: Science, Perception and Reality. New York: Humanities Press (1963)
work page 1963
-
[41]
Sovrano, F.: How to explain: from theory to practice. Ph.D. thesis, Univer- sity of Bologna (June 2023),https://doi.org/10.48676/unibo/amsdott orato/10943, URLhttp://amsdottorato.unibo.it/10943/
-
[42]
International Journal of Artificial Intelligence in Education pp
Sovrano, F., Ashley, K., Brusilovsky, P.L., Vitali, F.: How to improve the explanatory power of an intelligent textbook: a case study in legal writing. International Journal of Artificial Intelligence in Education pp. 1–35 (2024), https://doi.org/10.1007/s40593-024-00399-w 24 Francesco Sovrano and Alberto Bacchelli
-
[43]
Sovrano, F., Vilone, G., Lognoul, M., Longo, L.: Legal xai: a systematic review and interdisciplinary mapping of xai and eu law, towards a research agenda for legally responsible ai (2025),https://doi.org/10.2139/ssrn .5371124
-
[44]
Knowledge-Based Systems278, 110866 (2023),https://doi.org/10.1016/j.knosys.2023.110866
Sovrano, F., Vitali, F.: An objective metric for explainable ai: How and why to estimate the degree of explainability. Knowledge-Based Systems278, 110866 (2023),https://doi.org/10.1016/j.knosys.2023.110866
-
[45]
In: World Conference on Explainable Artificial Intelligence, pp
Sovrano, F., Vitali, F.: Perlocution vs illocution: How different interpreta- tions of the act of explaining impact on the evaluation of explanations and xai. In: World Conference on Explainable Artificial Intelligence, pp. 25–47, Springer (2023),https://doi.org/10.1007/978-3-031-44064-9_2
-
[46]
arXiv preprint arXiv:2404.05411 (2024)
Spataru, A., Hambro, E., Voita, E., Cancedda, N.: Know when to stop: A study of semantic drift in text generation. arXiv preprint arXiv:2404.05411 (2024)
-
[47]
Stack Overflow: Temporary policy: Generative ai (e.g., chatgpt) is banned. https://meta.stackoverflow.com/questions/421831/temporary-pol icy-generative-ai-e-g-chatgpt-is-banned(2023), accessed: 2024-07- 24
work page 2023
-
[48]
Sun, K., Xu, Y., Zha, H., Liu, Y., Dong, X.L.: Head-to-tail: How knowl- edgeable are large language models (llms)? aka will llms replace knowledge graphs? In: Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Lan- guage Technologies (Volume 1: Long Papers), pp. 311–325 (2024)
work page 2024
-
[49]
Smart learning environments10(1), 15 (2023)
Tlili, A., Shehata, B., Adarkwah, M.A., Bozkurt, A., Hickey, D.T., Huang, R., Agyemang, B.: What if the devil is my guardian angel: Chatgpt as a case study of using chatbots in education. Smart learning environments10(1), 15 (2023)
work page 2023
-
[50]
Trivedi, H., Balasubramanian, N., Khot, T., Sabharwal, A.: Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 10014–10037 (2023)
work page 2023
-
[51]
FreshLLMs: Refreshing Large Language Models with Search Engine Augmentation
Vu, T., Iyyer, M., Wang, X., Constant, N., Wei, J., Wei, J., Tar, C., Sung, Y.H., Zhou, D., Le, Q., et al.: Freshllms: Refreshing large language models with search engine augmentation. arXiv preprint arXiv:2310.03214 (2023)
-
[52]
arXiv preprint arXiv:2402.01739 (2024)
Xue, F., Zheng, Z., Fu, Y., Ni, J., Zheng, Z., Zhou, W., You, Y.: Openmoe: An early effort on open mixture-of-experts language models. arXiv preprint arXiv:2402.01739 (2024)
-
[53]
CRAG – comprehensive RAG benchmark.arXiv preprint arXiv:2406.04744,
Yang, X., Sun, K., Xin, H., Sun, Y., Bhalla, N., Chen, X., Choudhary, S., Gui, R.D., Jiang, Z.W., Jiang, Z., et al.: Crag–comprehensive rag bench- mark. arXiv preprint arXiv:2406.04744 (2024)
-
[54]
Yu, S., Liu, J., Yang, J., Xiong, C., Bennett, P., Gao, J., Liu, Z.: Few- shot generative conversational query rewriting. In: Proceedings of the 43rd International ACM SIGIR conference on research and development in In- formation Retrieval, pp. 1933–1936 (2020)
work page 1933
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.