pith. sign in

arxiv: 2604.21137 · v2 · submitted 2026-04-22 · 💻 cs.CL · cs.AI

Enhancing Science Classroom Discourse Analysis through Joint Multi-Task Learning for Reasoning-Component Classification

Pith reviewed 2026-05-09 23:49 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords classroom discourse analysismulti-task learningreasoning component classificationLLM data augmentationutterance typescience educationinferential reasoninglag-sequential analysis
0
0 comments X

The pith

An automated classifier using multi-task learning and LLM-augmented data jointly tags classroom utterances by type and reasoning component, showing teacher feedback-with-question moves as the strongest lead-in to student inferential answers

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Manual coding of science classroom talk to track how students build knowledge is too slow for broad studies, so the paper builds an automated system that labels both utterance types and reasoning components at once. It handles rare categories by splitting the data carefully and adding LLM-generated synthetic examples before training a RoBERTa model with two prediction heads. The trained system then maps discourse sequences across sessions and finds that teacher moves combining feedback and questions most reliably precede students producing inferential reasoning. This setup also shows the reasoning-component task stays workable even with simple word-based baselines while the augmentation step lifts performance on underrepresented utterance types.

Core claim

The paper introduces an automated discourse analysis system (ADAS) that jointly classifies teacher and student utterances along Utterance Type and Reasoning Component dimensions using a dual-probe RoBERTa-base model. To manage severe class imbalance it applies stratified splitting and LLM-based synthetic augmentation targeted at minority classes. Zero-shot GPT baselines set performance ceilings at macro-F1 0.467 for UT and 0.476 for RC. Subsequent pattern analyses establish that teacher Feedback-with-Question moves are the most consistent antecedents of student inferential reasoning (SR-I) and that the RC task remains tractable for lexical baselines.

What carries the argument

Dual-probe head RoBERTa-base classifier trained jointly on Utterance Type and Reasoning Component labels, with LLM synthetic data augmentation to correct minority-class imbalance.

If this is right

  • LLM augmentation raises recognition rates for the rarest utterance types without harming overall accuracy.
  • The reasoning-component task can be solved reliably even by simple lexical baselines because of its structural simplicity.
  • Teacher Feedback-with-Question moves emerge as the clearest trigger for student inferential reasoning across sessions.
  • Session-level Cognitive Complexity Index and lag-sequential patterns become computable at scale from automatic labels.
  • Zero-shot GPT performance provides a concrete upper bound showing the value of fine-tuning over prompt-only methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same joint-labeling pipeline could be retrained on recordings from other subjects to test whether the Fq-to-SR-I link holds outside science.
  • If the identified sequences prove stable, they could inform targeted professional-development modules that coach teachers on specific move types.
  • Real-time versions of the classifier might eventually supply live prompts to instructors during lessons to increase inferential student turns.
  • The approach opens the possibility of large-scale comparative studies across many classrooms without proportional increases in human coding effort.

Load-bearing premise

The synthetic minority-class examples produced by the LLM match the linguistic and semantic properties of real classroom utterances closely enough that they do not distort classifier accuracy on held-out real data.

What would settle it

Train the model on the augmented corpus, then measure its F1 on a fresh set of real, unaugmented minority-class utterances; if performance falls below the non-augmented baseline, the augmentation benefit is refuted.

Figures

Figures reproduced from arXiv: 2604.21137 by Jiho Noh, Mukhesh Raghava Katragadda, Raymond Carl, Soon Lee.

Figure 1
Figure 1. Figure 1: Structure of RoBERTa-based contextualized encoder for label classi [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Dual-Probe Head (DPH) architecture. Each of the [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Confusion matrix for Utterance Type (UT) classification (RoBERTa [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Utterance Type (UT)×Reasoning Component (RC) Co-occurrence Heatmap. Left: raw counts; Right: P(RC|UT) (row-normalized) [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Aggregated CCI Time Series across Lesson Progression (binned into [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 8
Figure 8. Figure 8: Rq trigger analysis. Left: P(student UT|teacher UT) for all lag-1 T→S transitions (N = 623 pairs). Right: P(student Rq) (bars) and mean student CCI (line) per preceding teacher UT. a question consistently fail to draw out inferential student reasoning [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 6
Figure 6. Figure 6: 3-Turn IRF Patterns Ranked by Mean Student CCI (min. [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Teacher-framing heatmap. Left: mean student CCI per (initiation [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 9
Figure 9. Figure 9: Aggregated CCI over lesson progression computed on pseudo-labeled [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: 3-turn Teacher–Student–Teacher IRF patterns ranked by mean student [PITH_FULL_IMAGE:figures/full_fig_p012_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Rq trigger analysis on pseudo-labeled data. Left: [PITH_FULL_IMAGE:figures/full_fig_p012_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Teacher-framing heatmap on pseudo-labeled data. Left: mean student [PITH_FULL_IMAGE:figures/full_fig_p012_12.png] view at source ↗
read the original abstract

Analyzing the reasoning patterns of students in science classrooms is critical for understanding knowledge construction mechanism and improving instructional practice to maximize cognitive engagement, yet manual coding of classroom discourse at scale remains prohibitively labor-intensive. We present an automated discourse analysis system (ADAS) that jointly classifies teacher and student utterances along two complementary dimensions: Utterance Type and Reasoning Component derived from our prior CDAT framework. To address severe label imbalance among minority classes, we (1) stratify-resplit the annotated corpus, (2) apply LLM-based synthetic data augmentation targeting minority classes, and (3) train a dual-probe head RoBERTa-base classifier. A zero-shot GPT-5.4 baseline achieves macro-F1 of 0.467 on UT and 0.476 on RC, establishing meaningful upper bounds for prompt-only approaches motivating fine-tuning. Beyond classification, we conduct discourse pattern analyses including UTxRC co-occurrence profiling, Cognitive Complexity Index (CCI) computation per session, lag-sequential analysis, and IRF chain analysis, revealing that teacher Feedback-with-Question (Fq) moves are the most consistent antecedents of student inferential reasoning (SR-I). Our results demonstrate that LLM-based augmentation meaningfully improves UT minority-class recognition, and that the structural simplicity of the RC task makes it tractable even for lexical baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

4 major / 2 minor

Summary. The manuscript presents an Automated Discourse Analysis System (ADAS) for joint multi-task classification of science classroom utterances into Utterance Types (UT) and Reasoning Components (RC) using a dual-probe RoBERTa-base model. It addresses severe class imbalance through stratified corpus splitting and LLM-based synthetic data augmentation targeting minority classes, reports a zero-shot GPT baseline (macro-F1 0.467 on UT, 0.476 on RC), and extends the analysis with UTxRC co-occurrence profiling, Cognitive Complexity Index computation, lag-sequential analysis, and IRF chain analysis, highlighting teacher Feedback-with-Question (Fq) moves as consistent antecedents of student inferential reasoning (SR-I).

Significance. If the claimed improvements from augmentation and joint training are substantiated with complete quantitative results and the synthetic data is shown not to introduce distributional artifacts, the work could enable scalable, automated analysis of classroom discourse patterns, supporting research on knowledge construction and instructional practices. The additional discourse pattern analyses provide concrete, falsifiable observations (e.g., Fq-to-SR-I lags) that could be tested in new corpora. The joint modeling and baseline comparison are standard strengths in the field.

major comments (4)
  1. [Abstract] Abstract: The abstract asserts that 'LLM-based augmentation meaningfully improves UT minority-class recognition' and that 'our results demonstrate' this improvement, yet supplies no numerical performance metrics (F1, accuracy, or per-class scores) for the proposed ADAS model on held-out real data, only for the zero-shot GPT baseline. This omission renders the central empirical claim unevaluable.
  2. [Methods] Methods/Augmentation: No description is given of the LLM augmentation procedure (prompting strategy, model version, generation parameters) or any validation (human fidelity ratings, embedding-distance comparisons to real minority utterances, or ablation isolating augmentation quality from joint-training effects). This directly bears on the weakest assumption that synthetic samples preserve linguistic properties of real minority-class utterances without introducing artifacts.
  3. [Results] Results: The manuscript reports no error analysis, confusion matrices, or ablation experiments that separate the contributions of augmentation, joint multi-task learning, and stratification. Without these, it is impossible to confirm the claim that 'the structural simplicity of the RC task makes it tractable even for lexical baselines' or to assess generalization of the dual-probe classifier.
  4. [Introduction / Data] Labeling and Evaluation: All UT and RC supervision signals derive from the authors' prior CDAT framework (self-cited), creating direct dependence for the core classification task. While extending an existing scheme is legitimate, the absence of independent validation or inter-annotator agreement details on the current corpus makes it difficult to rule out circularity in the reported performance gains.
minor comments (2)
  1. [Abstract] Abstract: 'GPT-5.4' is non-standard nomenclature; specify the exact model (e.g., GPT-4o or GPT-4-turbo) and temperature/settings used for the zero-shot baseline.
  2. [Throughout] Throughout: Define all acronyms (ADAS, CDAT, CCI, IRF, SR-I, Fq) at first use and ensure consistent terminology between the classification tasks and the subsequent discourse analyses.

Simulated Author's Rebuttal

4 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which help strengthen the clarity and rigor of our work. We address each major point below and will revise the manuscript to incorporate additional details, metrics, and analyses where appropriate.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The abstract asserts that 'LLM-based augmentation meaningfully improves UT minority-class recognition' and that 'our results demonstrate' this improvement, yet supplies no numerical performance metrics (F1, accuracy, or per-class scores) for the proposed ADAS model on held-out real data, only for the zero-shot GPT baseline. This omission renders the central empirical claim unevaluable.

    Authors: We agree that the abstract should provide key quantitative results to make the central claims immediately evaluable. The full Results section reports macro-F1 scores for the ADAS model on held-out real data (with and without augmentation), but these were not summarized in the abstract. In the revision we will add the specific metrics, including the improvement in minority-class F1 for UT due to augmentation, alongside the GPT baseline figures already present. revision: yes

  2. Referee: [Methods] Methods/Augmentation: No description is given of the LLM augmentation procedure (prompting strategy, model version, generation parameters) or any validation (human fidelity ratings, embedding-distance comparisons to real minority utterances, or ablation isolating augmentation quality from joint-training effects). This directly bears on the weakest assumption that synthetic samples preserve linguistic properties of real minority-class utterances without introducing artifacts.

    Authors: We acknowledge the need for full transparency on the augmentation pipeline. The current Methods section briefly mentions LLM-based synthetic data augmentation but omits procedural details. In the revision we will expand this into a dedicated subsection describing the exact prompting strategy, model version used, generation parameters (e.g., temperature, max tokens), and any validation performed, such as human fidelity checks or distributional comparisons. We will also add an ablation isolating augmentation effects. revision: yes

  3. Referee: [Results] Results: The manuscript reports no error analysis, confusion matrices, or ablation experiments that separate the contributions of augmentation, joint multi-task learning, and stratification. Without these, it is impossible to confirm the claim that 'the structural simplicity of the RC task makes it tractable even for lexical baselines' or to assess generalization of the dual-probe classifier.

    Authors: We agree that error analysis and ablations would strengthen interpretability. The current Results section includes overall metrics and the lexical baseline comparison supporting the RC simplicity claim, but lacks confusion matrices and systematic ablations. In the revision we will add confusion matrices for both UT and RC tasks on held-out data and include ablation results separating the effects of augmentation, joint training, and stratification. This will also clarify generalization of the dual-probe architecture. revision: yes

  4. Referee: [Introduction / Data] Labeling and Evaluation: All UT and RC supervision signals derive from the authors' prior CDAT framework (self-cited), creating direct dependence for the core classification task. While extending an existing scheme is legitimate, the absence of independent validation or inter-annotator agreement details on the current corpus makes it difficult to rule out circularity in the reported performance gains.

    Authors: The labels follow the established CDAT scheme from our prior work, where inter-annotator agreement was already reported. The current corpus extends that dataset using the identical guidelines, with no new annotations performed for this study. We will revise the Data and Evaluation sections to explicitly reference the original CDAT IAA figures and clarify that performance is measured on held-out splits of the extended corpus. We maintain that this does not introduce circularity, as gains are assessed against external baselines on unseen data. revision: partial

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper applies standard multi-task classification and post-hoc statistical analyses (co-occurrence, lag-sequential, IRF) to a fixed annotation scheme. The CDAT framework citation defines the label taxonomy for the supervised task but does not enter the training loop or the reported metrics as a fitted parameter or self-referential premise. Performance gains from LLM augmentation and the Fq-to-SR-I antecedent finding are computed on held-out real data and sequence statistics, respectively; neither reduces to the prior framework by construction. Self-citation here supplies task definition, not load-bearing justification for the empirical claims.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the validity of the prior CDAT framework for defining labels and on the assumption that LLM-generated examples preserve the distribution and semantics of real minority-class utterances.

axioms (1)
  • domain assumption The CDAT framework provides a reliable and comprehensive categorization of utterance types and reasoning components in science classroom discourse.
    All classifications, co-occurrence profiles, and sequential analyses are performed using labels derived from this prior framework.

pith-pipeline@v0.9.0 · 5543 in / 1497 out tokens · 78383 ms · 2026-05-09T23:49:59.725365+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages

  1. [1]

    Classroom Interaction in Science: Teacher questioning and feedback to students’ responses,

    C. Chin, “Classroom Interaction in Science: Teacher questioning and feedback to students’ responses,”International Journal of Science Ed- ucation, vol. 28, no. 11, pp. 1315–1346, 15 Sep. 2006

  2. [2]

    Children’s talk and the devel- opment of reasoning in the classroom,

    N. Mercer, R. Wegerif, and L. Dawes, “Children’s talk and the devel- opment of reasoning in the classroom,”British Educational Research Journal, vol. 25, no. 1, pp. 95–111, 1 Feb. 1999

  3. [3]

    Critical Discourse Analysis in education: A review of the literature,

    R. Rogers, E. Malancharuvil-Berkes, M. Mosley, D. Hui, and G. O. Joseph, “Critical Discourse Analysis in education: A review of the literature,”Review of Educational Research, vol. 75, no. 3, pp. 365– 416, Sep. 2005

  4. [4]

    Discourse analysis: A sociocultural perspective,

    E. A. Forman and D. E. Mccormick, “Discourse analysis: A sociocultural perspective,”Remedial and Special Education: RASE, vol. 16, no. 3, pp. 150–158, May 1995

  5. [5]

    Analyzing teaching behavior,

    N. A. Flanders, “Analyzing teaching behavior,” 1970

  6. [6]

    The instructional process: A review of Flanders’ interaction analysis in a classroom setting,

    V . Odiri Amatari, “The instructional process: A review of Flanders’ interaction analysis in a classroom setting,”International Journal of Secondary Education, vol. 3, no. 5, p. 43, 19 Aug. 2015

  7. [7]

    cir.nii.ac.jp, 1975

    Sinclair, John, Coulthard, and Malcolm,Towards an analysis of dis- course : the English used by teachers and pupils. cir.nii.ac.jp, 1975

  8. [8]

    Development of Two-Dimensional Class- room Discourse Analysis Tool (CDAT): scientific reasoning and dialog patterns in the secondary science classes,

    S. C. Lee and K. E. Irving, “Development of Two-Dimensional Class- room Discourse Analysis Tool (CDAT): scientific reasoning and dialog patterns in the secondary science classes,”International Journal of STEM Education, vol. 5, no. 1, p. 5, 19 Feb. 2018

  9. [9]

    4. Competent membership in the classroom community,

    H. Mehan, “4. Competent membership in the classroom community,” inLearning Lessons. Cambridge, MA and London, England: Harvard University Press, 2014

  10. [10]

    Deliberative discourse idealized and realized: Accountable talk in the classroom and in civic life,

    S. Michaels, C. O’Connor, and L. B. Resnick, “Deliberative discourse idealized and realized: Accountable talk in the classroom and in civic life,”Studies in philosophy and education, vol. 27, no. 4, pp. 283–297, Jul. 2008

  11. [11]

    Towards dialogic teaching: Rethinking classroom talk,

    R. J. Alexander, “Towards dialogic teaching: Rethinking classroom talk,” 2008

  12. [12]

    Next generation science standards: For states, by states,

    NGSS Lead States, “Next generation science standards: For states, by states,” 2013

  13. [13]

    B. S. Bloom, M. D. Engelhart, E. J. Furst, W. H. Hill, and D. R. Krathwohl,Taxonomy of educational objectives, B. S. Bloom, Ed. New York: David McKay Company, 1956

  14. [14]

    Generative and discriminative text classification with recurrent neural networks,

    D. Yogatama, C. Dyer, W. Ling, and P. Blunsom, “Generative and discriminative text classification with recurrent neural networks,”arXiv [stat.ML], 6 Mar. 2017

  15. [15]

    Pretrained language models for sequential sentence classification,

    A. Cohan, I. Beltagy, D. King, B. Dalvi, and D. Weld, “Pretrained language models for sequential sentence classification,” inProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Stroudsburg, PA, USA: Association for Computational Lin...

  16. [16]

    Conditional random fields: Probabilistic models for segmenting and labeling sequence data,

    J. Lafferty, A. McCallum, and F. C. N. Pereira, “Conditional random fields: Probabilistic models for segmenting and labeling sequence data,” 2001

  17. [17]

    Hierarchical neural networks for sequential sentence classification in medical scientific abstracts,

    D. Jin and P. Szolovits, “Hierarchical neural networks for sequential sentence classification in medical scientific abstracts,” inProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Stroudsburg, PA, USA: Association for Computational Linguistics, 2018, pp. 3100–3109

  18. [18]

    Language model pre-training for hierarchical document representations,

    M.-W. Chang, K. Toutanova, K. Lee, and J. Devlin, “Language model pre-training for hierarchical document representations,”arXiv [cs.CL], 25 Jan. 2019

  19. [19]

    Multi-label Sequential Sentence Classification via Large Language Model,

    M. Lan, L. Zheng, S. Ming, and H. Kilicoglu, “Multi-label Sequential Sentence Classification via Large Language Model,” inConference on Empirical Methods in Natural Language Processing, 2024

  20. [20]

    Transfer learning for text classification via model risk analysis,

    Y . Sun, C. Fan, and Q. Chen, “Transfer learning for text classification via model risk analysis,” inFindings of the Association for Computational Linguistics: EMNLP 2024, Y . Al-Onaizan, M. Bansal, and Y .-N. Chen, Eds. Stroudsburg, PA, USA: Association for Computational Linguistics, 2024, pp. 2814–2825

  21. [21]

    Synthetic data generation with large language models for text classification: Potential and limitations,

    Z. Li, H. Zhu, Z. Lu, and M. Yin, “Synthetic data generation with large language models for text classification: Potential and limitations,” inProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Stroudsburg, PA, USA: Association for Computational Linguistics, Dec. 2023, pp. 10 443–10 461

  22. [22]

    A review of semi-supervised learning for text classification,

    J. M. Duarte and L. Berton, “A review of semi-supervised learning for text classification,”Artificial Intelligence Review, vol. 56, no. 9, pp. 1– 69, 31 Jan. 2023

  23. [23]

    Improving semi-supervised text classification with dual meta-learning,

    S. Li, G. Yuan, M. Yang, Y . Shen, C. Li, R. Xu, and X. Zhao, “Improving semi-supervised text classification with dual meta-learning,” ACM Transactions on Information Systems, 20 Feb. 2024

  24. [24]

    AugESC: Dialogue augmentation with large language models for emotional sup- port conversation,

    C. Zheng, S. Sabour, J. Wen, Z. Zhang, and M. Huang, “AugESC: Dialogue augmentation with large language models for emotional sup- port conversation,” inFindings of the Association for Computational Linguistics: ACL 2023. Stroudsburg, PA, USA: Association for Computational Linguistics, 2023, pp. 1552–1568

  25. [25]

    AugGPT: Leveraging ChatGPT for Text Data Augmentation,

    H. Dai, Z. Liu, W. Liao, X. Huang, Y . Cao, Z. Wu, L. Zhao, S. Xu, F. Zeng, W. Liu, N. Liu, S. Li, D. Zhu, H. Cai, L. Sun, Q. Li, D. Shen, T. Liu, and X. Li, “AugGPT: Leveraging ChatGPT for Text Data Augmentation,”IEEE transactions on big data, vol. 11, no. 3, pp. 907–918, Jun. 2025

  26. [26]

    Pseudo-label : The simple and efficient semi-supervised learning method for deep neural networks,

    D.-H. Lee, “Pseudo-label : The simple and efficient semi-supervised learning method for deep neural networks,”ICML 2013 Workshop: Challenges in Representation Learning, 2013

  27. [27]

    FlexMatch: Boosting semi-supervised learning with Curriculum Pseudo Labeling,

    B. Zhang, Y . Wang, W. Hou, H. Wu, J. Wang, M. Okumura, and T. Shi- nozaki, “FlexMatch: Boosting semi-supervised learning with Curriculum Pseudo Labeling,”arXiv [cs.LG], pp. 18 408–18 419, 14 Oct. 2021

  28. [28]

    Debiased Self-training for semi-supervised learning,

    B. Chen, J. Jiang, X. Wang, P. Wan, J. Wang, and M. Long, “Debiased Self-training for semi-supervised learning,”arXiv [cs.LG], pp. 32 424– 32 437, 14 Feb. 2022

  29. [29]

    FreeMatch: Self-adaptive thresholding for Semi-supervised learning,

    Y . Wang, H. Chen, Q. Heng, W. Hou, Y . Fan, Z. Wu, J. Wang, M. Savvides, T. Shinozaki, B. Raj, B. Schiele, and X. Xie, “FreeMatch: Self-adaptive thresholding for Semi-supervised learning,”arXiv [cs.LG], 15 May 2022

  30. [30]

    Enhancing Self-Training Methods,

    A. Radhakrishnan, J. Davis, Z. Rabin, B. Lewis, M. Scherreik, and R. Ilin, “Enhancing Self-Training Methods,”arXiv [cs.LG], 17 Jan. 2023

  31. [31]

    A comprehensive evaluation of oversampling techniques for enhancing text classification performance,

    S. F. Taskiran, B. Turkoglu, E. Kaya, and T. Asuroglu, “A comprehensive evaluation of oversampling techniques for enhancing text classification performance,”Scientific Reports, vol. 15, no. 1, p. 21631, 1 Jul. 2025

  32. [32]

    The foundations of cost-sensitive learning,

    C. Elkan, “The foundations of cost-sensitive learning,”International Joint Conference on Artificial Intelligence, pp. 973–978, 4 Aug. 2001

  33. [33]

    On the Stratification of Multi-label Data,

    K. Sechidis, G. Tsoumakas, and I. Vlahavas, “On the Stratification of Multi-label Data,” inMachine Learning and Knowledge Discovery in Databases, ser. Lecture Notes in Computer Science. Berlin, Heidelberg: Springer Berlin Heidelberg, 2011, pp. 145–158. IEEE TRANSACTIONS ON LEARNING TECHNOLOGIES 11

  34. [34]

    Focal Loss for dense object detection,

    T.-Y . Lin, P. Goyal, R. Girshick, K. He, and P. Doll ´ar, “Focal Loss for dense object detection,”arXiv [cs.CV], pp. 2980–2988, 7 Aug. 2017. APPENDIXA SYNTHETICDATAGENERATIONPROMPT The following prompt template was used to generate syn- thetic dialogue variations via GPT-4.1. Placeholders in curly braces are filled at runtime with the corresponding value...

  35. [35]

    Generate exactly{variations} complete snippet variations

  36. [36]

    In each variation, the target turn (turn{center_1indexed}) must preserve its discourse labels (UT:{label_ut}, RC:{label_rc}) and speaker ({speaker})

  37. [37]

    All surrounding turns must be rewritten so the dialogue flows naturally with the new target; do NOT copy surrounding turns verbatim from the original

  38. [38]

    Use different vocabulary, phrasing, and sentence structures across variations

  39. [39]

    Maintain scientific accuracy and classroom-appropriate language

  40. [40]

    ] </output_format> Thelabel_definitionsblock enumerates all UT and RC labels with their names and definitions

    Respond with ONLY valid JSON --- no prose, no markdown fences </requirements> <output_format> [ { ‘‘before’’: [‘‘turn 1 text’’, ‘‘turn 2 text’’, ...], ‘‘target’’: ‘‘target turn text’’, ‘‘after’’: [‘‘turn N+1 text’’, ‘‘turn N+2 text’’, ...] }, ... ] </output_format> Thelabel_definitionsblock enumerates all UT and RC labels with their names and definitions....