TRACE: A taxonomy-grounded synthetic dataset for teaching-program generation and session interpretation in Applied Behavior Analysis
Pith reviewed 2026-06-30 11:57 UTC · model grok-4.3
The pith
TRACE supplies a 2,999-example synthetic dataset for teaching-program generation and behavioral interpretation in Applied Behavior Analysis.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We present TRACE, a 2,999-example synthetic instruction-tuning dataset for teaching-program generation across Discrete Trial Training, Natural Environment Teaching, and Task Analysis, and multi-session behavioral interpretation across twelve trajectory patterns and thirteen target behaviors. Every example is produced by a deterministic taxonomy-driven generator grounded in the canonical ABA literature, and carries complete sampling provenance.
What carries the argument
A deterministic taxonomy-driven generator that creates examples from a taxonomy extracted from canonical ABA literature, providing each with traceable provenance.
If this is right
- Models can be trained on the provided train split of 2,549 examples for the two ABA tasks.
- The validation, test, and sanity splits allow for evaluation of model performance on the generated examples.
- Full sampling provenance enables analysis of how taxonomy cells map to specific examples.
- The CC BY-NC 4.0 license for data permits non-commercial research applications.
Where Pith is reading between the lines
- Similar taxonomy-driven synthetic data generation could help other medical or clinical fields facing data privacy barriers.
- Performance of models trained on TRACE on actual clinical cases would test how well the taxonomy captures real-world variability.
- Extensions might include adding more ABA tasks or refining the taxonomy based on expert feedback.
Load-bearing premise
That examples produced by the taxonomy-driven generator are representative of real clinical ABA practice in a way that makes them useful for training models.
What would settle it
Evaluating a model fine-tuned on TRACE against performance on a held-out set of real, de-identified ABA session records; significantly lower accuracy on real data would challenge the dataset's utility.
read the original abstract
Applied Behavior Analysis (ABA) is a clinical discipline whose documentation, teaching programs and multi-session behavioral logs, is formulaic and high-volume, yet real session data is HIPAA-protected and bound by professional confidentiality rules, blocking the release of a training corpus. We present TRACE (Taxonomy-Referenced ABA Clinical Examples), a 2,999-example synthetic instruction-tuning dataset covering two ABA tasks: teaching-program generation across Discrete Trial Training, Natural Environment Teaching, and Task Analysis; and multi-session behavioral interpretation across twelve trajectory patterns and thirteen target behaviors. Every example is produced by a deterministic taxonomy-driven generator grounded in the canonical ABA literature, and every example carries complete sampling provenance, the exact taxonomy cells that produced it. The dataset is released under CC BY-NC 4.0 for data and MIT for code, with stratified train (2,549), validation (149), test (281), and sanity (20) splits. TRACE is a research artifact and has not been clinically validated.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents TRACE, a 2,999-example synthetic instruction-tuning dataset for two ABA tasks: teaching-program generation across Discrete Trial Training, Natural Environment Teaching, and Task Analysis; and multi-session behavioral interpretation across twelve trajectory patterns and thirteen target behaviors. Every example is produced by a deterministic taxonomy-driven generator grounded in canonical ABA literature, each carrying complete sampling provenance. The release includes stratified train (2,549), validation (149), test (281), and sanity (20) splits under CC BY-NC 4.0 (data) and MIT (code) licenses, with an explicit statement that it is a research artifact that has not been clinically validated.
Significance. If the dataset supports downstream model training on these formulaic ABA documentation tasks, it would provide a valuable public resource where real clinical data cannot be released due to privacy constraints. The deterministic construction from an external taxonomy, full provenance metadata for every example, and explicit non-validation disclaimer are explicit strengths that promote transparency and reproducibility without circularity or unstated assumptions about real-world fidelity. These features allow users to assess and extend the artifact directly.
Simulated Author's Rebuttal
We thank the referee for their thorough and positive review, which accurately summarizes the TRACE dataset and its contributions. We appreciate the recommendation to accept and the recognition of the dataset's transparency, provenance, and non-validation disclaimer as strengths.
Circularity Check
No significant circularity identified
full rationale
The paper's central contribution is the release of a 2,999-example synthetic dataset generated deterministically from a taxonomy extracted from external canonical ABA literature, with explicit provenance metadata and a disclaimer of no clinical validation. No equations, fitted parameters, predictions, or self-citations are load-bearing; the generation process is described as grounded in independent prior literature rather than any self-referential definition or reduction. The derivation chain is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption A canonical ABA taxonomy from the literature is sufficient to generate representative examples for teaching-program generation and behavioral interpretation tasks.
Reference graph
Works this paper leans on
-
[1]
Rahul K. Arora, Jason Wei, Rebecca Soskin Hicks, Preston Bowman, Joaquin Qui \ n onero-Candela, Foivos Tsimpourlas, Michael Sharman, Meghan Shah, Andrea Vallone, Alex Beutel, Johannes Heidecke, and Karan Singhal. HealthBench : Evaluating large language models towards improved human health. arXiv preprint, 2025. URL https://arxiv.org/abs/2505.08775
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
Position statement on the use of restraint and seclusion
Association for Behavior Analysis International . Position statement on the use of restraint and seclusion. https://www.abainternational.org/about-us/policies-and-positions/restraint-and-seclusion,-2010.aspx, 2010
2010
-
[3]
Ethics code for behavior analysts
Behavior Analyst Certification Board . Ethics code for behavior analysts. https://www.bacb.com/ethics-information/ethics-codes/, 2020
2020
-
[4]
BACB certificant data
Behavior Analyst Certification Board . BACB certificant data. https://www.bacb.com/bacb-certificant-data/, 2026. Certification counts as of 2026-04-01; accessed 2026-05-23
2026
-
[5]
Bender and Batya Friedman
Emily M. Bender and Batya Friedman. Data statements for natural language processing: Toward mitigating system bias and enabling better science. Transactions of the Association for Computational Linguistics, 6: 0 587--604, 2018. URL https://aclanthology.org/Q18-1041/
2018
-
[6]
Edward G. Carr and V. Mark Durand. Reducing behavior problems through functional communication training. Journal of Applied Behavior Analysis, 18 0 (2): 0 111--126, 1985. doi:10.1901/jaba.1985.18-111
-
[7]
Davide Chicco and Giuseppe Jurman. The advantages of the Matthews correlation coefficient ( MCC ) over F1 score and accuracy in binary classification evaluation. BMC Genomics, 21: 0 6, 2020. doi:10.1186/s12864-019-6413-7
-
[8]
Weighted kappa: Nominal scale agreement with provision for scaled disagreement or partial credit
Jacob Cohen. Weighted kappa: Nominal scale agreement with provision for scaled disagreement or partial credit. Psychological Bulletin, 70 0 (4): 0 213--220, 1968
1968
-
[9]
Cooper, Timothy E
John O. Cooper, Timothy E. Heron, and William L. Heward. Applied Behavior Analysis. Pearson Education Limited, Harlow, England, 3rd, global edition edition, 2020. ISBN 978-1-292-32463-0
2020
-
[10]
QLoRA: Efficient Finetuning of Quantized LLMs
Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. QLoRA : Efficient finetuning of quantized LLMs . In Advances in Neural Information Processing Systems, volume 36, 2023. URL https://arxiv.org/abs/2305.14314
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[11]
The rise of small language models in healthcare: A comprehensive survey
Muskan Garg, Shaina Raza, Shebuti Rayana, Xingyi Liu, and Sunghwan Sohn. The rise of small language models in healthcare: A comprehensive survey. arXiv preprint, 2025. URL https://arxiv.org/abs/2504.17119
-
[12]
Datasheets for datasets.Communications of the ACM, 64(12):86–92, 2021
Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daum \'e III, and Kate Crawford. Datasheets for datasets. Communications of the ACM, 64 0 (12): 0 86--92, 2021. URL https://arxiv.org/abs/1803.09010
-
[13]
Google Research and Google DeepMind . MedGemma technical report. Technical report, Google DeepMind, 2025. URL https://arxiv.org/abs/2507.05201
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[14]
On Calibration of Modern Neural Networks
Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On calibration of modern neural networks. In Proceedings of the International Conference on Machine Learning (ICML), 2017. URL https://arxiv.org/abs/1706.04599
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[15]
Gregory P. Hanley, Brian A. Iwata, and Brandon E. McCord. Functional analysis of problem behavior: A review. Journal of Applied Behavior Analysis, 36 0 (2): 0 147--185, 2003. doi:10.1901/jaba.2003.36-147
-
[16]
Brian A. Iwata, Michael F. Dorsey, Keith J. Slifer, Kenneth E. Bauman, and Gina S. Richman. Toward a functional analysis of self-injury. Journal of Applied Behavior Analysis, 27 0 (2): 0 197--209, 1994. doi:10.1901/jaba.1994.27-197. Reprint of the 1982 article in Analysis and Intervention in Developmental Disabilities, 2(1), 3--20
-
[17]
Jennings and David J
Adrienne M. Jennings and David J. Cox. Starting the conversation around the ethical use of artificial intelligence in applied behavior analysis. Behavior Analysis in Practice, 17: 0 107--122, 2024. URL https://pmc.ncbi.nlm.nih.gov/articles/PMC10891004/
2024
-
[18]
Prometheus 2: An open source language model specialized in evaluating other language models,
Seungone Kim, Juyoung Suk, Shayne Longpre, Bill Yuchen Lin, Jamin Shin, Sean Welleck, Graham Neubig, Moontae Lee, Kyungjae Lee, and Minjoon Seo. Prometheus 2 : An open source language model specialized in evaluating other language models. arXiv preprint, 2024. URL https://arxiv.org/abs/2405.01535
-
[19]
Personalized- ABA : Personalized treatment plan generation for applied behavior analysis using natural language processing
Aman Kumar, Mareiko Au, Raj Semlawat, Malavica Sridhar, and Hitesh Gurnani. Personalized- ABA : Personalized treatment plan generation for applied behavior analysis using natural language processing. In Proceedings of the 1st Workshop on Natural Language Processing for Science (NLP4Science), pages 188--196. Association for Computational Linguistics, 2024....
2024
-
[20]
O. Ivar Lovaas. Behavioral treatment and normal educational and intellectual functioning in young autistic children. Journal of Consulting and Clinical Psychology, 55 0 (1): 0 3--9, 1987. doi:10.1037/0022-006X.55.1.3
-
[21]
Potsawee Manakul, Adian Liusie, and Mark J. F. Gales. SelfCheckGPT : Zero-resource black-box hallucination detection for generative large language models. arXiv preprint, 2023. URL https://arxiv.org/abs/2303.08896
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[22]
Wojciech Nazar, Grzegorz Nazar, Aleksandra Kami \'n ska, and Ludmila Danilowicz-Szymanowicz. How to design, create, and evaluate an instruction-tuning dataset for large language model training in health care: Tutorial from a clinical perspective. Journal of Medical Internet Research, 27: 0 e70481, 2025. doi:10.2196/70481. URL https://www.jmir.org/2025/1/e70481
-
[23]
Parsons, Jeannia H
Marsha B. Parsons, Jeannia H. Rollyson, and Dennis H. Reid. Evidence-based staff training: A guide for practitioners. Behavior Analysis in Practice, 5 0 (2): 0 2--11, 2012
2012
-
[24]
S. Peck, C. O'Brien, J. Bourret, and D. Agostinelli. ChatGPT versus clinician responses to questions in ABA : Preference, identification, and level of agreement. Journal of Applied Behavior Analysis, 58 0 (4): 0 731--743, 2025. doi:10.1002/jaba.70029. URL https://onlinelibrary.wiley.com/doi/10.1002/jaba.70029
-
[25]
A structured review of the validity of BLEU
Ehud Reiter. A structured review of the validity of BLEU . Computational Linguistics, 44 0 (3): 0 393--401, 2018. URL https://direct.mit.edu/coli/article/44/3/393/
2018
-
[26]
TN-Eval : Rubric and evaluation protocols for measuring the quality of behavioral therapy notes
Raj Sanjay Shah, Lei Xu, Qianchu Liu, Jon Burnsky, Drew Bertagnolli, and Chaitanya Shivade. TN-Eval : Rubric and evaluation protocols for measuring the quality of behavioral therapy notes. arXiv preprint, 2025. URL https://arxiv.org/abs/2503.20648
-
[27]
Karan Singhal, Tao Tu, Juraj Gottweis, et al. Toward expert-level medical question answering with large language models. Nature Medicine, 31: 0 943--950, 2025. doi:10.1038/s41591-024-03423-7. URL https://www.nature.com/articles/s41591-024-03423-7
-
[28]
Discrete trial training in the treatment of autism
Tristram Smith. Discrete trial training in the treatment of autism. Focus on Autism and Other Developmental Disabilities, 16 0 (2): 0 86--92, 2001
2001
-
[29]
Jun-Seok Sohn, Eojin Lee, Jae-Jin Kim, Hyang-Kyeong Oh, and Eunjoo Kim. Implementation of generative AI for the assessment and treatment of autism spectrum disorders: A scoping review. Frontiers in Psychiatry, 16: 0 1628216, 2025. doi:10.3389/fpsyt.2025.1628216. URL https://pmc.ncbi.nlm.nih.gov/articles/PMC12322814/
-
[30]
Trevor F. Stokes and Donald M. Baer. An implicit technology of generalization. Journal of Applied Behavior Analysis, 10 0 (2): 0 349--367, 1977. doi:10.1901/jaba.1977.10-349
-
[31]
Tiger, Gregory P
Jeffrey H. Tiger, Gregory P. Hanley, and Jennifer Bruzek. Functional communication training: A review and practical guide. Behavior Analysis in Practice, 1 0 (1): 0 16--23, 2008
2008
-
[32]
Touchette and Jane S
Paul E. Touchette and Jane S. Howard. Errorless learning: Reinforcement contingencies and stimulus control transfer in delayed prompting. Journal of Applied Behavior Analysis, 17 0 (2): 0 175--188, 1984
1984
-
[33]
Self-Instruct: Aligning Language Models with Self-Generated Instructions
Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-Instruct : Aligning language models with self-generated instructions. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), 2023. URL https://arxiv.org/abs/2212.10560
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[34]
Menta: A small language model for on-device mental health prediction
Tianyi Zhang, Xiangyuan Xue, Lingyan Ruan, Shiya Fu, Feng Xia, Simon D'Alfonso, Vassilis Kostakos, Ting Dang, and Hong Jia. Menta: A small language model for on-device mental health prediction. arXiv preprint, 2025. URL https://arxiv.org/abs/2512.02716
-
[35]
AlpaCare : Instruction fine-tuned large language models for medical applications
Xinlu Zhang, Chenxin Tian, Xianjun Yang, Lichang Chen, Zekun Li, and Linda Ruth Petzold. AlpaCare : Instruction fine-tuned large language models for medical applications. arXiv preprint, 2023. URL https://arxiv.org/abs/2310.14558
-
[36]
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM-as-a-Judge with MT-Bench and chatbot arena. In Advances in Neural Information Processing Systems, volume 36, 2023. URL https://arxiv.org/abs/2306.05685
work page internal anchor Pith review Pith/arXiv arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.