pith. machine review for the scientific record. sign in

arxiv: 2604.06211 · v2 · submitted 2026-03-16 · 💻 cs.CL · cs.AI· cs.SE

Recognition: no theorem link

Illocutionary Explanation Planning for Source-Faithful Explanations in Retrieval-Augmented Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-15 10:35 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.SE
keywords retrieval-augmented generationsource faithfulnessexplanationslarge language modelsprogramming educationprompting methods
0
0 comments X

The pith

Chain-of-illocution prompting expands queries into implicit questions to raise source adherence in RAG explanations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to make explanations from retrieval-augmented language models more verifiable against their source documents, using programming textbooks as the evidence base. Baseline RAG systems show low median adherence of 22 to 40 percent, while non-RAG models reach zero. Drawing on illocutionary theory, the authors introduce macro-planning that decomposes a user query into implicit explanatory questions; these questions then steer retrieval. The resulting chain-of-illocution prompting produces statistically significant adherence gains of up to 63 percent across tested models, with a user study confirming that satisfaction and perceived correctness remain unchanged.

Core claim

Illocutionary macro-planning, realized through chain-of-illocution prompting, functions as a descriptive design principle that improves source faithfulness in RAG by expanding each query into a set of implicit explanatory questions that guide retrieval from authoritative textbooks, yielding higher source-adherence scores than standard RAG baselines.

What carries the argument

Chain-of-illocution prompting (CoI), which expands a query into implicit explanatory questions that drive retrieval from source texts.

If this is right

  • RAG pipelines for textbook-grounded education can reach higher traceability without additional training.
  • Explanations become easier for users to verify against the exact retrieved passages.
  • The method applies uniformly across multiple large language models.
  • Gains in adherence leave user ratings of relevance and correctness intact.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar query-expansion planning could be tested in domains outside programming, such as legal or medical retrieval.
  • The same mechanism might reduce cases where models cite sources they have not actually used.
  • Automating the generation of the implicit questions could further lower the need for manual prompt engineering.

Load-bearing premise

Expanding queries into implicit explanatory questions will reliably produce better retrieval and higher measured adherence to the provided source documents.

What would settle it

An experiment that applies chain-of-illocution prompting to the same set of Stack Overflow questions and textbooks but records no statistically significant rise in source-adherence metrics relative to standard RAG.

Figures

Figures reproduced from arXiv: 2604.06211 by Alberto Bacchelli, Francesco Sovrano.

Figure 3
Figure 3. Figure 3: Examples of questions are: “What is dependency injection?”; “How do [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 1
Figure 1. Figure 1: Source faithfulness across six LLMs, with one-sided p-values (significant in bold), Cohen’s dz effect sizes, and 95% confidence intervals (CI) [PITH_FULL_IMAGE:figures/full_fig_p011_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Chain-of-illocution pipeline. Black arrows denote the standard RAG flow from the user question to retrieval, prompting, and final explanation. Green arrows denote the CoI additions: an illocution planner that proposes implicit explanatory questions, per-question retrieval from the knowledge source, and the concatenation of question– context pairs (Q1, C1), . . . ,(Qn, Cn) into the final prompt. The multipl… view at source ↗
Figure 3
Figure 3. Figure 3: Study interface with an explanation generated via chain-of-illocution [PITH_FULL_IMAGE:figures/full_fig_p016_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: User study results for GPT-3.5-turbo. Results. We averaged participant scores and examined their distributions (see Figures 4 and 5). One-sided Wilcoxon signed-rank tests did not reveal significant decreases in satisfaction, relevance, or perceived correctness for RAG+CoI relative to RAG, indicating that chain-of-illocution prompting does not harm these metrics in this study. We also performed one-sided Ma… view at source ↗
Figure 5
Figure 5. Figure 5: User study results for GPT-4o. to assess whether the lower textbook adherence observed with the larger LLM (GPT-4o) negatively impacted the metrics under scrutiny. The tests revealed that GPT-3.5-turbo’s greater adherence was associated (see discussion below) with significantly (p < 0.05, one-sided) higher satisfaction (U = 3973.5, p = 0.024 for RAG; U = 4015.5, p = 0.017 for RAG+CoI), signifi￾cantly impro… view at source ↗
read the original abstract

Natural language explanations produced by large language models (LLMs) are often persuasive, but not necessarily scrutable: users cannot easily verify whether the claims in an explanation are supported by evidence. In XAI, this motivates a focus on faithfulness and traceability, i.e., the extent to which an explanation's claims can be grounded in, and traced back to, an explicit source. We study these desiderata in retrieval-augmented generation (RAG) for programming education, where textbooks provide authoritative evidence. We benchmark six LLMs on 90 Stack Overflow questions grounded in three programming textbooks and quantify source faithfulness via source adherence metrics. We find that non Retrieval-Augmented Generation (RAG) models have median source adherence of 0%, while baseline RAG systems still exhibit low median adherence (22-40%, depending on the model). Motivated by Achinstein's illocutionary theory of explanation, we introduce illocutionary macro-planning as a descriptive design principle for source-faithful explanations and instantiate it with chain-of-illocution prompting (CoI), which expands a query into implicit explanatory questions that drive retrieval. Across models, CoI yields statistically significant gains (up to 63%) in source adherence, although absolute adherence remains moderate and the gains are weak or non-significant for some models. A user study with 165 retained participants (220 recruited) indicates that these gains do not harm satisfaction, relevance, or perceived correctness.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper motivates illocutionary macro-planning from Achinstein's theory of explanation and instantiates it as chain-of-illocution (CoI) prompting, which expands queries into implicit explanatory questions to drive retrieval in RAG systems. It evaluates six LLMs on 90 Stack Overflow questions grounded in three programming textbooks, reporting statistically significant source-adherence gains (up to 63%) over baseline RAG, while a user study with 165 participants finds no negative impact on satisfaction, relevance, or perceived correctness.

Significance. If the attribution to illocutionary structure is confirmed, the work supplies a theoretically grounded design principle for source-faithful explanations in educational RAG, with measurable adherence improvements and preserved user experience. The combination of quantitative metrics and a sizable user study adds practical value for XAI in programming-education settings.

major comments (1)
  1. Experimental evaluation (implicitly §4–5): CoI is compared only against baseline RAG (no expansion) and non-RAG models. No control condition expands the query into an equivalent number of non-illocutionary questions (e.g., generic factual or procedural expansions). Without this ablation, the reported adherence gains cannot be attributed specifically to the illocutionary framing rather than to increased query specificity or retrieval volume, directly weakening the central claim that Achinstein's theory provides a useful design principle.
minor comments (1)
  1. Abstract: the statement that absolute adherence remains moderate would be more informative if the exact median adherence values under CoI were reported alongside the relative gains.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive and insightful review. The recommendation for major revision is noted, and we address the primary concern regarding experimental controls below. We commit to revisions that strengthen the attribution of results to the illocutionary design principle while preserving the core contributions.

read point-by-point responses
  1. Referee: Experimental evaluation (implicitly §4–5): CoI is compared only against baseline RAG (no expansion) and non-RAG models. No control condition expands the query into an equivalent number of non-illocutionary questions (e.g., generic factual or procedural expansions). Without this ablation, the reported adherence gains cannot be attributed specifically to the illocutionary framing rather than to increased query specificity or retrieval volume, directly weakening the central claim that Achinstein's theory provides a useful design principle.

    Authors: We agree that this is a substantive limitation in the current design. The manuscript compares CoI against baseline RAG (original query, no expansion) and non-RAG models, but does not include a control that performs query expansion using an equivalent number of non-illocutionary questions. This leaves open the possibility that gains stem from increased retrieval volume or specificity rather than the specific illocutionary structure derived from Achinstein's theory. In the revised manuscript we will add such a control condition (e.g., generic factual or procedural rephrasings matched for number of sub-questions) and report source-adherence metrics for it. We will also include a brief theoretical discussion clarifying why the illocutionary question types (purpose, mechanism, etc.) are expected to differ from generic expansions in their effect on source grounding. These additions will allow readers to better assess the unique contribution of the theory. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical gains measured against independent baselines

full rationale

The paper's chain begins with an external citation to Achinstein's illocutionary theory, introduces illocutionary macro-planning as a design principle, instantiates it as CoI prompting, and reports empirical source-adherence improvements (up to 63%) versus standard RAG baselines on Stack Overflow questions. No equations or definitions reduce the claimed gains to the inputs by construction; adherence metrics are defined independently of the prompting method, and no self-citation supplies a uniqueness theorem or ansatz that forces the result. The derivation therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the applicability of Achinstein's illocutionary theory to RAG prompting design and the assumption that the chosen adherence metrics accurately reflect source faithfulness without post-hoc adjustments.

axioms (1)
  • domain assumption Achinstein's illocutionary theory of explanation provides a useful framework for designing source-faithful explanations in retrieval-augmented generation
    Directly motivates the introduction of illocutionary macro-planning and CoI prompting.

pith-pipeline@v0.9.0 · 5568 in / 1198 out tokens · 51366 ms · 2026-05-15T10:35:37.733176+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages

  1. [1]

    Oxford University Press (1985) Chain-of-Illocution Prompting 21

    Achinstein, P.: The nature of explanation. Oxford University Press (1985) Chain-of-Illocution Prompting 21

  2. [2]

    arXiv preprint arXiv:2310.05189 (2023)

    Augenstein, I., Baldwin, T., Cha, M., Chakraborty, T., Ciampaglia, G.L., Corney, D., DiResta, R., Ferrara, E., Hale, S., Halevy, A., et al.: Fac- tuality challenges in the era of large language models. arXiv preprint arXiv:2310.05189 (2023)

  3. [3]

    Nature Machine Intelligence6(8), 852–863 (2024)

    Augenstein, I., Baldwin, T., Cha, M., Chakraborty, T., Ciampaglia, G.L., Corney, D., DiResta, R., Ferrara, E., Hale, S., Halevy, A., et al.: Factuality challenges in the era of large language models and opportunities for fact- checking. Nature Machine Intelligence6(8), 852–863 (2024)

  4. [4]

    Sustainability14(4), 2011 (2022)

    Baquero, A.: Net promoter score (nps) and customer satisfaction: relation- ship and efficient management. Sustainability14(4), 2011 (2022)

  5. [5]

    Journal of the Royal statistical society: series B (Methodological)57(1), 289–300 (1995)

    Benjamini, Y.,Hochberg, Y.:Controllingthe false discovery rate:a practical and powerful approach to multiple testing. Journal of the Royal statistical society: series B (Methodological)57(1), 289–300 (1995)

  6. [6]

    ACM Transactions on Intelligent Systems and Technology15(3), 1–45 (2024)

    Chang, Y., Wang, X., Wang, J., Wu, Y., Yang, L., Zhu, K., Chen, H., Yi, X., Wang, C., Wang, Y., et al.: A survey on evaluation of large language models. ACM Transactions on Intelligent Systems and Technology15(3), 1–45 (2024)

  7. [7]

    Chen, L., Zaharia, M., Zou, J.: How is chatgpt’s behavior changing over time? CoRRabs/2307.09009(2023),https://doi.org/10.48550/ARX IV.2307.09009, URLhttps://doi.org/10.48550/arXiv.2307.09009

  8. [8]

    arXiv preprint arXiv:2402.08801 (2024)

    Da Silva, L., Samhi, J., Khomh, F.: Chatgpt vs llama: Impact, re- liability, and challenges in stack overflow discussions. arXiv preprint arXiv:2402.08801 (2024)

  9. [9]

    CRC press (2000)

    Dale, R., Moisl, H., Somers, H.: Handbook of natural language processing. CRC press (2000)

  10. [10]

    Department for Education of the UK: A guide to effective practice in cur- riculum planning.https://assets.publishing.service.gov.uk/med ia/65fd8652f1d3a0001d32ae0d/A_Guide_To_Effective_Practice_ In_Curriculum_Planning_-_March_2024.pdf(March 2024), accessed: 2025-02-13

  11. [11]

    Chain-of-Verification Reduces Hallucination in Large Language Models

    Dhuliawala, S., Komeili, M., Xu, J., Raileanu, R., Li, X., Celikyilmaz, A., Weston, J.: Chain-of-verification reduces hallucination in large language models. arXiv preprint arXiv:2309.11495 (2023)

  12. [12]

    In: Findings of the association for computational linguistics: ACL 2024, pp

    Dhuliawala, S., Komeili, M., Xu, J., Raileanu, R., Li, X., Celikyilmaz, A., Weston, J.: Chain-of-verification reduces hallucination in large language models. In: Findings of the association for computational linguistics: ACL 2024, pp. 3563–3578 (2024)

  13. [13]

    Green Tea Press (2015)

    Downey, A.: Think Python: How to Think Like a Computer Scientist. Green Tea Press (2015)

  14. [14]

    Ducasse, S., Chloupis, D., Hess, N., Zagidulin, D.: Pharo By Example 5 (2018)

  15. [15]

    Hobart and William Smith Colleges (2022)

    Eck, D.J.: Introduction to Programming Using Java. Hobart and William Smith Colleges (2022)

  16. [16]

    Elgohary, A., Peskov, D., Boyd-Graber, J.: Can you unpack that? learning torewritequestions-in-context.CanYouUnpackThat?LearningtoRewrite Questions-in-Context (2019) 22 Francesco Sovrano and Alberto Bacchelli

  17. [17]

    In: Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, pp

    Es, S., James, J., Anke, L.E., Schockaert, S.: Ragas: Automated evaluation of retrieval augmented generation. In: Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, pp. 150–158 (2024)

  18. [18]

    Applied Stochastic Models in Business and Industry35(1), 138–151 (2019)

    Fisher, N.I., Kordupleski, R.E.: Good and bad market research: A critical review of net promoter score. Applied Stochastic Models in Business and Industry35(1), 138–151 (2019)

  19. [19]

    In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp

    Gao, T., Yen, H., Yu, J., Chen, D.: Enabling large language models to generate text with citations. In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 6465–6488 (2023)

  20. [20]

    Waveland Press (2004), ISBN 9781478631101, URLhttps://books.google.it/books?id=RaN2C gAAQBAJ

    Glatthorn, A.: Developing a Quality Curriculum. Waveland Press (2004), ISBN 9781478631101, URLhttps://books.google.it/books?id=RaN2C gAAQBAJ

  21. [21]

    Educational technology re- search and development64, 573–590 (2016),https://doi.org/10.1007/ s11423-016-9434-9

    Hilton, J.: Open educational resources and college textbook choices: A re- view of research on efficacy and perceptions. Educational technology re- search and development64, 573–590 (2016),https://doi.org/10.1007/ s11423-016-9434-9

  22. [22]

    Bradford books, MIT Press (1986), ISBN 9780262580960, URLhttps://books.google.it/books?id=Z6EFB aLApE8C

    Holland, J., Holyoak, K., Nisbett, R., Thagard, P.: Induction: Processes of Inference, Learning, and Discovery. Bradford books, MIT Press (1986), ISBN 9780262580960, URLhttps://books.google.it/books?id=Z6EFB aLApE8C

  23. [23]

    4198– 4205 (2020)

    Jacovi, A., Goldberg, Y.: Towards faithfully interpretable nlp systems: How should we define and evaluate faithfulness? In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 4198– 4205 (2020)

  24. [24]

    In: Proceedings of the CHI Conference on Human Factors in Computing Systems, pp

    Kabir, S., Udo-Imeh, D.N., Kou, B., Zhang, T.: Is stack overflow obsolete? an empirical study of the characteristics of chatgpt answers tostack overflow questions. In: Proceedings of the CHI Conference on Human Factors in Computing Systems, pp. 1–17 (2024)

  25. [25]

    Learning and individual differences103, 102274 (2023)

    Kasneci, E., Seßler, K., Küchemann, S., Bannert, M., Dementieva, D., Fis- cher, F., Gasser, U., Groh, G., Günnemann, S., Hüllermeier, E., et al.: Chat- gpt for good? on opportunities and challenges of large language models for education. Learning and individual differences103, 102274 (2023)

  26. [26]

    arXiv preprint arXiv:2212.14024 (2022)

    Khattab,O.,Santhanam,K.,Li,X.L.,Hall,D.,Liang,P.,Potts,C.,Zaharia, M.: Demonstrate-search-predict: Composing retrieval and language models for knowledge-intensive NLP. arXiv preprint arXiv:2212.14024 (2022)

  27. [27]

    In: The Twelfth International Conference on Learning Representations (2024)

    Khattab, O., Singhvi, A., Maheshwari, P., Zhang, Z., Santhanam, K., Vard- hamanan, S., Haq, S., Sharma, A., Joshi, T.T., Moazam, H., Miller, H., Zaharia, M., Potts, C.: DSPy: Compiling declarative language model calls into state-of-the-art pipelines. In: The Twelfth International Conference on Learning Representations (2024)

  28. [28]

    In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp

    Kondadadi, R., Howald, B., Schilder, F.: A statistical nlg framework for aggregated planning and realization. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1406–1415 (2013) Chain-of-Illocution Prompting 23

  29. [29]

    In: Proceedings of the 47th International ACM SIGIR Conference on Research and Development in In- formation Retrieval, pp

    Kostric, I., Balog, K.: A surprisingly simple yet effective multi-query rewrit- ing method for conversational passage retrieval. In: Proceedings of the 47th International ACM SIGIR Conference on Research and Development in In- formation Retrieval, pp. 2271–2275 (2024)

  30. [30]

    Collabra: psychology8(1), 33267 (2022)

    Lakens, D.: Sample size justification. Collabra: psychology8(1), 33267 (2022)

  31. [31]

    The neurocognition of language83, 122 (1999)

    Levelt, W.J.: Producing spoken language: A blueprint of the speaker. The neurocognition of language83, 122 (1999)

  32. [32]

    stack overflow: An exploratory comparison of programming assistance tools

    Liu, J., Tang, X., Li, L., Chen, P., Liu, Y.: Chatgpt vs. stack overflow: An exploratory comparison of programming assistance tools. In: 2023 IEEE 23rd International Conference on Software Quality, Reliability, and Security Companion (QRS-C), pp. 364–373, IEEE (2023)

  33. [33]

    Artificial Intelligence Review42, 275–293 (2014)

    Masoudnia, S., Ebrahimpour, R.: Mixture of experts: a literature survey. Artificial Intelligence Review42, 275–293 (2014)

  34. [34]

    Mayes, G.R.: Theories of explanation (2001), URLhttps://iep.utm.ed u/explanat/

  35. [35]

    In: Bouamor, H., Pino, J., Bali, K

    Min, S., Krishna, K., Lyu, X., et al.: Factscore: Fine-grained atomic eval- uation of factual precision in long form text generation. In: Bouamor, H., Pino, J., Bali, K. (eds.) Proceedings of the 2023 Conference on Empirical MethodsinNaturalLanguageProcessing,EMNLP2023,Singapore,Decem- ber 6-10, 2023, pp. 12076–12100, Association for Computational Linguis...

  36. [36]

    In: 2012 28th IEEE International Conference on Software Maintenance (ICSM), pp

    Nasehi, S.M., Sillito, J., Maurer, F., Burns, C.: What makes a good code example?: A study of programming q&a in stackoverflow. In: 2012 28th IEEE International Conference on Software Maintenance (ICSM), pp. 25– 34, IEEE (2012)

  37. [37]

    In: Findings of the Association for Computational Linguistics: EMNLP 2023, pp

    Press, O., Zhang, M., Min, S., Schmidt, L., Smith, N., Lewis, M.: Measuring and narrowing the compositionality gap in language models. In: Findings of the Association for Computational Linguistics: EMNLP 2023, pp. 5687– 5711 (2023)

  38. [38]

    Transactions of the Association for Computational Linguistics9, 510–527 (2021)

    Puduppully, R., Lapata, M.: Data-to-text generation with macro planning. Transactions of the Association for Computational Linguistics9, 510–527 (2021)

  39. [39]

    Computational Linguistics49(4), 777–840 (2023)

    Rashkin, H., Nikolaev, V., Lamm, M., Aroyo, L., Collins, M., Das, D., Petrov, S., Tomar, G.S., Turc, I., Reitter, D.: Measuring attribution in nat- ural language generation models. Computational Linguistics49(4), 777–840 (2023)

  40. [40]

    New York: Humanities Press (1963)

    Sellars, W.: Science, Perception and Reality. New York: Humanities Press (1963)

  41. [41]

    Sovrano, F.: How to explain: from theory to practice. Ph.D. thesis, Univer- sity of Bologna (June 2023),https://doi.org/10.48676/unibo/amsdott orato/10943, URLhttp://amsdottorato.unibo.it/10943/

  42. [42]

    International Journal of Artificial Intelligence in Education pp

    Sovrano, F., Ashley, K., Brusilovsky, P.L., Vitali, F.: How to improve the explanatory power of an intelligent textbook: a case study in legal writing. International Journal of Artificial Intelligence in Education pp. 1–35 (2024), https://doi.org/10.1007/s40593-024-00399-w 24 Francesco Sovrano and Alberto Bacchelli

  43. [43]

    Sovrano, F., Vilone, G., Lognoul, M., Longo, L.: Legal xai: a systematic review and interdisciplinary mapping of xai and eu law, towards a research agenda for legally responsible ai (2025),https://doi.org/10.2139/ssrn .5371124

  44. [44]

    Knowledge-Based Systems278, 110866 (2023),https://doi.org/10.1016/j.knosys.2023.110866

    Sovrano, F., Vitali, F.: An objective metric for explainable ai: How and why to estimate the degree of explainability. Knowledge-Based Systems278, 110866 (2023),https://doi.org/10.1016/j.knosys.2023.110866

  45. [45]

    In: World Conference on Explainable Artificial Intelligence, pp

    Sovrano, F., Vitali, F.: Perlocution vs illocution: How different interpreta- tions of the act of explaining impact on the evaluation of explanations and xai. In: World Conference on Explainable Artificial Intelligence, pp. 25–47, Springer (2023),https://doi.org/10.1007/978-3-031-44064-9_2

  46. [46]

    arXiv preprint arXiv:2404.05411 (2024)

    Spataru, A., Hambro, E., Voita, E., Cancedda, N.: Know when to stop: A study of semantic drift in text generation. arXiv preprint arXiv:2404.05411 (2024)

  47. [47]

    https://meta.stackoverflow.com/questions/421831/temporary-pol icy-generative-ai-e-g-chatgpt-is-banned(2023), accessed: 2024-07- 24

    Stack Overflow: Temporary policy: Generative ai (e.g., chatgpt) is banned. https://meta.stackoverflow.com/questions/421831/temporary-pol icy-generative-ai-e-g-chatgpt-is-banned(2023), accessed: 2024-07- 24

  48. [48]

    311–325 (2024)

    Sun, K., Xu, Y., Zha, H., Liu, Y., Dong, X.L.: Head-to-tail: How knowl- edgeable are large language models (llms)? aka will llms replace knowledge graphs? In: Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Lan- guage Technologies (Volume 1: Long Papers), pp. 311–325 (2024)

  49. [49]

    Smart learning environments10(1), 15 (2023)

    Tlili, A., Shehata, B., Adarkwah, M.A., Bozkurt, A., Hickey, D.T., Huang, R., Agyemang, B.: What if the devil is my guardian angel: Chatgpt as a case study of using chatbots in education. Smart learning environments10(1), 15 (2023)

  50. [50]

    In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp

    Trivedi, H., Balasubramanian, N., Khot, T., Sabharwal, A.: Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 10014–10037 (2023)

  51. [51]

    FreshLLMs: Refreshing Large Language Models with Search Engine Augmentation

    Vu, T., Iyyer, M., Wang, X., Constant, N., Wei, J., Wei, J., Tar, C., Sung, Y.H., Zhou, D., Le, Q., et al.: Freshllms: Refreshing large language models with search engine augmentation. arXiv preprint arXiv:2310.03214 (2023)

  52. [52]

    arXiv preprint arXiv:2402.01739 (2024)

    Xue, F., Zheng, Z., Fu, Y., Ni, J., Zheng, Z., Zhou, W., You, Y.: Openmoe: An early effort on open mixture-of-experts language models. arXiv preprint arXiv:2402.01739 (2024)

  53. [53]

    CRAG – comprehensive RAG benchmark.arXiv preprint arXiv:2406.04744,

    Yang, X., Sun, K., Xin, H., Sun, Y., Bhalla, N., Chen, X., Choudhary, S., Gui, R.D., Jiang, Z.W., Jiang, Z., et al.: Crag–comprehensive rag bench- mark. arXiv preprint arXiv:2406.04744 (2024)

  54. [54]

    In: Proceedings of the 43rd International ACM SIGIR conference on research and development in In- formation Retrieval, pp

    Yu, S., Liu, J., Yang, J., Xiong, C., Bennett, P., Gao, J., Liu, Z.: Few- shot generative conversational query rewriting. In: Proceedings of the 43rd International ACM SIGIR conference on research and development in In- formation Retrieval, pp. 1933–1936 (2020)