pith. machine review for the scientific record. sign in

arxiv: 2604.03127 · v1 · submitted 2026-04-03 · 💻 cs.CL · cs.AI

Recognition: no theorem link

Domain-Adapted Retrieval for In-Context Annotation of Pedagogical Dialogue Acts

Authors on Pith no claims yet

Pith reviewed 2026-05-13 20:25 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords RAGdialogue act annotationpedagogical dialoguein-context learningLLMtutoringretrieval augmentation
0
0 comments X

The pith

Domain-adapted retrieval with utterance-level indexing lets frozen LLMs annotate pedagogical dialogues at expert agreement levels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that adapting retrieval alone in a RAG setup, by fine-tuning embeddings on tutoring data and indexing at the utterance level, substantially improves LLM performance on annotating tutoring moves. This yields Cohen's kappa scores of 0.53 to 0.74 on two real-world datasets, beating no-retrieval baselines by large margins. The key is better retrieval of few-shot examples rather than changing the generative model. Sympathetic readers would care because it provides an efficient way to scale high-quality annotation for educational dialogues without retraining large models.

Core claim

The domain-adapted RAG pipeline, using a fine-tuned lightweight embedding model on tutoring corpora and utterance-level indexing to retrieve labeled demonstrations, achieves Cohen's κ of 0.526-0.580 on TalkMoves and 0.659-0.743 on Eedi with three different LLMs, outperforming no-retrieval baselines while correcting label biases and improving rare labels, with utterance-level indexing driving the gains.

What carries the argument

Domain-adapted retrieval pipeline that fine-tunes a lightweight embedding model and indexes dialogues at the utterance level to provide in-context labeled examples to a frozen generative LLM.

If this is right

  • Utterance-level indexing improves top-1 label match rates from 39.7% to 62.0% on TalkMoves and 52.9% to 73.1% on Eedi.
  • Retrieval corrects systematic label biases in zero-shot prompting.
  • Largest gains are seen for rare and context-dependent labels.
  • The generative model can remain frozen while still reaching higher annotation quality.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar retrieval adaptation could scale annotation to much larger tutoring corpora where manual labeling is prohibitive.
  • The dominance of indexing strategy over embedding quality suggests prioritizing dialogue segmentation in future retrieval designs.
  • Applying this to other specialized dialogue domains like medical or legal conversations may yield comparable efficiency gains.
  • Testing on additional datasets would confirm if the method generalizes to varied tutoring styles.

Load-bearing premise

That the gains from utterance-level domain-adapted retrieval will persist when applied to new tutoring datasets or different large language models not tested in the study.

What would settle it

Running the pipeline on a new independent set of tutoring dialogues with a fourth LLM and observing kappa scores no higher than the no-retrieval baseline.

Figures

Figures reproduced from arXiv: 2604.03127 by Bakhtawar Ahtisham, Jinsook Lee, Kirk Vanacore, Rene F. Kizilcec, Zhuqian Zhou.

Figure 1
Figure 1. Figure 1: Overview of the proposed RAG-based annotation pipeline. [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Normalized confusion matrices for TalkMoves test (top) and Eedi test (bottom) with [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Confidence score distributions by model and retrieval condition on the combined [PITH_FULL_IMAGE:figures/full_fig_p018_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Cohen’s κ as a function of retrieval depth k across three LLM backbones on TalkMoves test (top) and Eedi test (bottom). Left panels show RAG NO FINETUNE, right panels show RAG FINETUNED UTT. Domain-adapted retrieval consistently outperforms general-purpose retrieval at every k for all models. G Effect of Retrieval Depth [PITH_FULL_IMAGE:figures/full_fig_p019_4.png] view at source ↗
read the original abstract

Automated annotation of pedagogical dialogue is a high-stakes task where LLMs often fail without sufficient domain grounding. We present a domain-adapted RAG pipeline for tutoring move annotation. Rather than fine-tuning the generative model, we adapt retrieval by fine-tuning a lightweight embedding model on tutoring corpora and indexing dialogues at the utterance level to retrieve labeled few-shot demonstrations. Evaluated across two real tutoring dialogue datasets (TalkMoves and Eedi) and three LLM backbones (GPT-5.2, Claude Sonnet 4.6, Qwen3-32b), our best configuration achieves Cohen's $\kappa$ of 0.526-0.580 on TalkMoves and 0.659-0.743 on Eedi, substantially outperforming no-retrieval baselines ($\kappa = 0.275$-$0.413$ and $0.160$-$0.410$). An ablation study reveals that utterance-level indexing, rather than embedding quality alone, is the primary driver of these gains, with top-1 label match rates improving from 39.7\% to 62.0\% on TalkMoves and 52.9\% to 73.1\% on Eedi under domain-adapted retrieval. Retrieval also corrects systematic label biases present in zero-shot prompting and yields the largest improvements for rare and context-dependent labels. These findings suggest that adapting the retrieval component alone is a practical and effective path toward expert-level pedagogical dialogue annotation while keeping the generative model frozen.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents a domain-adapted RAG pipeline for annotating pedagogical dialogue acts: a lightweight embedding model is fine-tuned on tutoring corpora, dialogues are indexed at the utterance level, and the resulting labeled demonstrations are retrieved as few-shot context for frozen LLMs. Evaluated on TalkMoves and Eedi with GPT-5.2, Claude Sonnet 4.6 and Qwen3-32b, the best configuration reports Cohen’s κ of 0.526–0.580 and 0.659–0.743 respectively, outperforming no-retrieval baselines; an ablation attributes the gains primarily to utterance-level indexing rather than embedding quality alone.

Significance. If the retrieval index is strictly disjoint from the test sets, the work supplies concrete evidence that retrieval adaptation alone can lift annotation quality on two real tutoring corpora without touching the generative model, together with an ablation that isolates the contribution of utterance-level indexing. The reported top-1 label match rates (62–73 %) and bias-correction effects on rare labels constitute useful, falsifiable benchmarks for future in-context annotation systems.

major comments (2)
  1. [§4] §4 (Experimental Setup) and the abstract: it is not stated whether the TalkMoves and Eedi test utterances (or their labels) were excluded from the tutoring corpora used to fine-tune the embedding model and to build the retrieval index. If any test utterance is retrievable, the reported κ gains (0.526–0.743 vs. 0.160–0.413) and the utterance-level vs. embedding-quality ablation become unreliable due to label leakage.
  2. [§5.2] §5.2 (Ablation Study): the contrast between “utterance-level indexing” and “embedding quality alone” presupposes that both conditions operate on identical held-out data; without an explicit statement that the index for the embedding-only condition also excludes test utterances, the claim that indexing is the primary driver cannot be evaluated.
minor comments (2)
  1. [§5] The abstract and §5 omit statistical significance tests (e.g., paired t-tests or bootstrap CIs) for the κ differences; adding these would strengthen the performance claims.
  2. [§4] Data-split sizes, exact number of utterances per corpus, and the precise train/validation/test partitioning are not reported in §4; these details are needed to reproduce the index construction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful review and for highlighting the need for explicit statements on data partitioning. We address each major comment below and will revise the manuscript to add the requested clarifications.

read point-by-point responses
  1. Referee: [§4] §4 (Experimental Setup) and the abstract: it is not stated whether the TalkMoves and Eedi test utterances (or their labels) were excluded from the tutoring corpora used to fine-tune the embedding model and to build the retrieval index. If any test utterance is retrievable, the reported κ gains (0.526–0.743 vs. 0.160–0.413) and the utterance-level vs. embedding-quality ablation become unreliable due to label leakage.

    Authors: We confirm that the test utterances (and their labels) from both TalkMoves and Eedi were strictly excluded from the corpora used to fine-tune the embedding model and to construct the retrieval index. Only the designated training splits were used for domain adaptation. We will add an explicit statement to this effect in the revised §4 and abstract. revision: yes

  2. Referee: [§5.2] §5.2 (Ablation Study): the contrast between “utterance-level indexing” and “embedding quality alone” presupposes that both conditions operate on identical held-out data; without an explicit statement that the index for the embedding-only condition also excludes test utterances, the claim that indexing is the primary driver cannot be evaluated.

    Authors: All ablation conditions, including the embedding-only baseline, were evaluated on the same held-out test sets and used an index built exclusively from the training splits (i.e., test utterances were excluded in every condition). We will insert a clarifying sentence in §5.2 stating that index construction with respect to test-set exclusion is identical across the compared conditions. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical evaluation on held-out data

full rationale

The paper reports an empirical RAG pipeline for dialogue act annotation, with performance measured via Cohen's κ and top-1 match rates on two fixed tutoring datasets (TalkMoves, Eedi) using held-out test splits and three LLM backbones. All claims rest on direct experimental comparisons to no-retrieval baselines and ablations (utterance-level indexing vs. embedding quality). No derivations, equations, or predictions are present that reduce to fitted parameters by construction, self-definitional loops, or load-bearing self-citations. The work is self-contained against external benchmarks and contains no mathematical chain or uniqueness theorem that collapses to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are described in the abstract; the work relies on standard supervised fine-tuning of embeddings and existing RAG prompting.

pith-pipeline@v0.9.0 · 5585 in / 1118 out tokens · 46864 ms · 2026-05-13T20:25:59.013344+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 1 internal anchor

  1. [1]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

  2. [2]

    Co-Reyes, Eric Chu, Feryal Behbahani, Aleksandra Faust, and Hugo Larochelle

    Rishabh Agarwal, Avi Singh, Lei Zhang, Bernd Bohnet, Luis Rosias, Stephanie Chan, Biao Zhang, Ankesh Anand, Zaheer Abbas, Azade Nova, John D. Co-Reyes, Eric Chu, Feryal Behbahani, Aleksandra Faust, and Hugo Larochelle. Many-shot in-context learning. In Advances in Neural Information Processing Systems, 2024

  3. [3]

    Kizilcec

    Bakhtawar Ahtisham, Kirk Vanacore, Jinsook Lee, Zhuqian Zhou, Doug Pietrzak, and Rene F. Kizilcec. Ai annotation orchestration: Evaluating llm verifiers to improve the quality of llm annotations in learning analytics, 2025. URL https://arxiv.org/abs/2511.09785

  4. [4]

    Human and llm-based assessment of teaching acts in expert-led explanatory dialogues

    Aliki Anagnostopoulou, Nils Feldhus, Yi-Sheng Hsu, Milad Alshomary, Henning Wachsmuth, and Daniel Sonntag. Human and llm-based assessment of teaching acts in expert-led explanatory dialogues. In Proceedings of the 6th Workshop on Computational Approaches to Discourse, pp.\ 166--181, 2025

  5. [5]

    Llm2vec: Large language models are secretly powerful text encoders

    Parishad BehnamGhader, Vaibhav Adlakha, Marius Mosbach, Dzmitry Bahdanau, Nicolas Chapados, and Siva Reddy. Llm2vec: Large language models are secretly powerful text encoders. In Proceedings of the 1st Conference on Language Modeling, 2024

  6. [6]

    Rethinking chunk size for long-document retrieval: A multi-dataset analysis

    Sinchana Ramakanth Bhat, Max Rudat, Jannis Spiekermann, and Nicolas Flores-Herr. Rethinking chunk size for long-document retrieval: A multi-dataset analysis. arXiv preprint arXiv:2505.21700, 2025

  7. [7]

    Retrieval-style in-context learning for few-shot hierarchical text classification, 2024

    Huiyao Chen, Yu Zhao, Zulong Chen, Mengjia Wang, Liangyue Li, Meishan Zhang, and Min Zhang. Retrieval-style in-context learning for few-shot hierarchical text classification, 2024. URL https://arxiv.org/abs/2406.17534

  8. [8]

    Cheng-Han Chiang and Hung-yi Lee. Can large language models be an alternative to human evaluations? In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 15607--15631, Toronto, Canada, July 2023. Association for Computational Linguistics. doi:10.18653/v1/2023.acl-long.870. URL https://acla...

  9. [9]

    A survey on in-context learning

    Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Jingyuan Ma, Rui Li, Heming Xia, Jingjing Xu, Zhiyong Wu, Baobao Chang, Xu Sun, Lei Li, and Zhifang Sui. A survey on in-context learning. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024

  10. [10]

    Annollm: Making large language models to be better crowdsourced annotators

    Xingwei He, Zhenghao Lin, Yeyun Gong, A-Long Jin, Hang Zhang, Chen Lin, Jian Jiao, Siu Ming Yiu, Nan Duan, and Weizhu Chen. Annollm: Making large language models to be better crowdsourced annotators. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 6: ...

  11. [11]

    Efficient Natural Language Response Suggestion for Smart Reply

    Matthew Henderson, Rami Al-Rfou, Brian Strope, Yun-Hsuan Sung, L \'a szl \'o Luk \'a cs, Ruiqi Guo, Sanjiv Kumar, Balint Miklos, and Ray Kurzweil. Efficient natural language response suggestion for smart reply. 2017. URL https://arxiv.org/abs/1705.00652

  12. [12]

    Pugh, and Sidney K

    Emily Jensen, Samuel L. Pugh, and Sidney K. D'Mello. A deep transfer learning approach to modeling teacher discourse in the classroom. In Proceedings of the 11th International Learning Analytics and Knowledge Conference, pp.\ 302--312, 2021

  13. [13]

    Billion-scale similarity search with GPUs

    Jeff Johnson, Matthijs Douze, and Herv \'e J \'e gou. Billion-scale similarity search with GPUs . IEEE Transactions on Big Data, 7 0 (3): 0 535--547, 2021

  14. [14]

    Unraveling chatgpt: A critical analysis of ai-generated goal-oriented dialogues and annotations

    Tiziano Labruna, Sofia Brenna, Andrea Zaninello, and Bernardo Magnini. Unraveling chatgpt: A critical analysis of ai-generated goal-oriented dialogues and annotations. In International Conference of the Italian Association for Artificial Intelligence, pp.\ 151--171. Springer, 2023

  15. [15]

    u ttler, Mike Lewis, Wen-tau Yih, Tim Rockt\

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K\" u ttler, Mike Lewis, Wen-tau Yih, Tim Rockt\" a schel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive nlp tasks. In Advances in Neural Information Processing Systems, volume 33, pp.\ 9459--9474, 2020

  16. [16]

    Llama2vec: Unsupervised adaptation of large language models for dense retrieval

    Chaofan Li, Zheng Liu, Shitao Xiao, Yingxia Shao, and Defu Lian. Llama2vec: Unsupervised adaptation of large language models for dense retrieval. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, 2024 a

  17. [17]

    On the role of long-tail knowledge in retrieval augmented large language models

    Dongyang Li, Junbing Yan, Taolin Zhang, Cheng Wang, Xiaofeng He, Longtao Huang, Hui Xue, and Jun Huang. On the role of long-tail knowledge in retrieval augmented large language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Short Papers), 2024 b

  18. [18]

    Correlating student acoustic-prosodic profiles with student learning in spoken tutoring dialogues

    Diane Litman and Kate Forbes-Riley. Correlating student acoustic-prosodic profiles with student learning in spoken tutoring dialogues. In Proceedings of the 12th International Conference on Artificial Intelligence in Education, pp.\ 1--8, 2005

  19. [19]

    Michaels and C

    S. Michaels and C. O'Connor. Conceptualizing talk moves as tools: Professional development approaches for academically productive discussions. In Lauren B. Resnick, Christa Asterhan, and Sherice N. Clarke (eds.), Socializing Intelligence through Talk and Dialogue, pp.\ 333--347. American Educational Research Association, Washington DC, 2015

  20. [20]

    Sarah Michaels, Catherine O'Connor, and Lauren B. Resnick. Deliberative discourse idealized and realized: Accountable talk in the classroom and in civic life. Studies in Philosophy and Education, 27 0 (4): 0 283--297, 2008. doi:10.1007/s11217-007-9071-1

  21. [21]

    Llm-assisted automated deductive coding of dialogue data: leveraging dialogue-specific characteristics to enhance contextual understanding

    Ying Na and Shihui Feng. Llm-assisted automated deductive coding of dialogue data: leveraging dialogue-specific characteristics to enhance contextual understanding. In International Conference on Artificial Intelligence in Education, pp.\ 248--262. Springer, 2025

  22. [22]

    Towards actionable pedagogical feedback: A multi-perspective analysis of mathematics teaching and tutoring dialogue

    Jannatun Naim, Jie Cao, Fareen Tasneem, Jennifer Jacobs, Brent Milne, James Martin, and Tamara Sumner. Towards actionable pedagogical feedback: A multi-perspective analysis of mathematics teaching and tutoring dialogue. In Proceedings of the 18th International Conference on Educational Data Mining, 2025

  23. [23]

    Chatgpt vs

    Lidiia Ostyakova, Veronika Smilga, Kseniia Petukhova, Maria Molchanova, and Daniel Kornev. Chatgpt vs. crowdsourcing vs. experts: Annotating open-domain conversations with speech functions. In Proceedings of the 24th Annual Meeting of the Special Interest Group on Discourse and Dialogue, pp.\ 242--254, 2023

  24. [24]

    Do LLM s understand dialogues? a case study on dialogue acts

    Ayesha Qamar, Jonathan Tong, and Ruihong Huang. Do LLM s understand dialogues? a case study on dialogue acts. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 26219--26237, Vienna, Austria, jul 2025. Association for Computational Linguistics. doi:10.18653/v1/2025.acl-long.1271. URL ht...

  25. [25]

    Edudcm: a novel framework for automatic educational dialogue classification dataset construction via distant supervision and large language models

    Changyong Qi, Longwei Zheng, Yuang Wei, Haoxin Xu, Peiji Chen, and Xiaoqing Gu. Edudcm: a novel framework for automatic educational dialogue classification dataset construction via distant supervision and large language models. Applied Sciences, 15 0 (1): 0 154, 2024

  26. [26]

    Baker, and Andrew Lan

    Alexander Scarlatos, Ryan S. Baker, and Andrew Lan. Exploring knowledge tracing in tutor-student dialogues using llms. In Proceedings of the 15th International Learning Analytics and Knowledge Conference, pp.\ 1--10, 2025

  27. [27]

    Using large language models to generate, validate, and apply user intent taxonomies

    Chirag Shah, Ryen White, Reid Andersen, Georg Buscher, Scott Counts, Sarkar Das, Ali Montazer, Sathish Manivannan, Jennifer Neville, Nagu Rangan, et al. Using large language models to generate, validate, and apply user intent taxonomies. ACM Transactions on the Web, 19 0 (3): 0 1--29, 2025

  28. [28]

    Annotating educational dialog act with data augmentation in online one-on-one tutoring

    Dou Shan, Dong Wang, Chen Zhang, Kimberly Britt Kao, and Carol Ka Yuk Chan. Annotating educational dialog act with data augmentation in online one-on-one tutoring. In Proceedings of the 24th International Conference on Artificial Intelligence in Education, pp.\ 473--486, 2023

  29. [29]

    Large language models for automating fine-grained speech act annotation: A critical evaluation of gpt-4o and deepseek

    Hang Su and Jun Ye. Large language models for automating fine-grained speech act annotation: A critical evaluation of gpt-4o and deepseek. Corpus Pragmatics, pp.\ 1--20, 2025

  30. [30]

    Martin, and Tamara Sumner

    Abhijit Suresh, Jennifer Jacobs, Charis Harty, Margaret Perkoff, James H. Martin, and Tamara Sumner. The talkmoves dataset: K-12 mathematics lesson transcripts annotated for teacher and student discursive moves. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pp.\ 4654--4662, Marseille, France, June 2022. European Language R...

  31. [31]

    Does informativeness matter? active learning for educational dialogue act classification

    Wei Tan et al. Does informativeness matter? active learning for educational dialogue act classification. In Proceedings of the 24th International Conference on Artificial Intelligence in Education, pp.\ 115--127, 2023

  32. [32]

    Reliable annotations with less effort: Evaluating llm-human collaboration in search clarifications

    Leila Tavakoli and Hamed Zamani. Reliable annotations with less effort: Evaluating llm-human collaboration in search clarifications. In Proceedings of the 2025 International ACM SIGIR Conference on Innovative Concepts and Theories in Information Retrieval (ICTIR), ICTIR '25, pp.\ 92--102. Association for Computing Machinery, 2025. doi:10.1145/3731120.3744574

  33. [33]

    Vail and Kristy Elizabeth Boyer

    Andrew K. Vail and Kristy Elizabeth Boyer. Identifying effective moves in tutoring: On the refinement of dialogue act annotation schemes. In Proceedings of the 12th International Conference on Intelligent Tutoring Systems, pp.\ 199--209, 2014

  34. [34]

    Document segmentation matters for retrieval-augmented generation

    Zhitong Wang, Cheng Gao, Chaojun Xiao, et al. Document segmentation matters for retrieval-augmented generation. In Findings of the Association for Computational Linguistics: ACL 2025, 2025

  35. [35]

    C-Pack: Packed Resources For General Chinese Embeddings

    Shitao Xiao, Zheng Liu, Peitian Zhang, and Niklas Muennighoff. C-pack: Packaged resources to advance general chinese embedding. arXiv preprint arXiv:2309.07597, 2023

  36. [36]

    Ho, Carl Yang, and Qi He

    Ran Xu, Hui Liu, Sreyashi Nag, Zhenwei Dai, Yaochen Xie, Xianfeng Tang, Chen Luo, Yang Li, Joyce C. Ho, Carl Yang, and Qi He. Simrag: Self-improving retrieval-augmented generation for adapting large language models to specialized domains. In Proceedings of the 2025 Conference of the North American Chapter of the Association for Computational Linguistics, 2025

  37. [37]

    Robust educational dialogue act classifiers with low-resource and imbalanced datasets

    Linjuan Yang, Philipp Christmann, and Dragan Gasevic. Robust educational dialogue act classifiers with low-resource and imbalanced datasets. In Proceedings of the 24th International Conference on Artificial Intelligence in Education, pp.\ 114--126, 2023

  38. [38]

    Assessing the potential of llm-assisted annotation for corpus-based pragmatics and discourse analysis: The case of apology

    Danni Yu, Luyang Li, Hang Su, and Matteo Fuoli. Assessing the potential of llm-assisted annotation for corpus-based pragmatics and discourse analysis: The case of apology. International Journal of Corpus Linguistics, 29 0 (4): 0 534--561, 2024

  39. [39]

    Question-anchored tutoring dialogues

    Matthew Zent, Digory Smith, and Simon Woodhead. Question-anchored tutoring dialogues. https://huggingface.co/datasets/Eedi/Question-Anchored-Tutoring-Dialogues-2k, 2025. Eedi dataset. Accessed: 2026-03-29

  40. [40]

    @esa (Ref

    \@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

  41. [41]

    \@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

  42. [42]

    @open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...