BuddyBench: A Privacy-Constrained Multi-Task Benchmark for Pediatric Social-Communication Personalization

Jeyeon Eo; Joo Young Kim; Minyoung Jung; Ran Ju; Unggi Lee

arxiv: 2605.28089 · v1 · pith:RDNIZKARnew · submitted 2026-05-27 · 💻 cs.AI

BuddyBench: A Privacy-Constrained Multi-Task Benchmark for Pediatric Social-Communication Personalization

Jeyeon Eo , Joo Young Kim , Ran Ju , Minyoung Jung , Unggi Lee This is my paper

Pith reviewed 2026-06-29 12:27 UTC · model grok-4.3

classification 💻 cs.AI

keywords benchmarkpediatricsocial communicationknowledge tracingcausal inferenceprivacyneurodevelopmentalpersonalization

0 comments

The pith

BuddyBench supplies a single schema that joins drill-level learning records, clinical assessments, self-reports, and randomized trial results for pediatric social-communication models while enforcing privacy limits.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents BuddyBench as a benchmark that merges an observational cohort of 189 children with dense drill data and an RCT cohort of 86 children into one structure. This structure supports four tasks: tracking knowledge over drills, recommending the next drill, predicting clinical scores, and estimating causal effects of treatments. The authors add a synthetic version for open testing. A reader would care because most existing child development datasets keep behavioral sequences separate from treatment outcomes and from privacy-protected clinical records.

Core claim

BuddyBench combines ND-03 observational data with dense coverage of Tasks 1-2 and ND-02 RCT data for Tasks 3-4 into a unified schema that links drill trajectories, standardized clinical assessments, BuddyPlan self-report, and randomized-treatment endpoints. The benchmark therefore enables knowledge tracing, next-drill recommendation, clinical prediction, and causal inference on the same pediatric records while keeping clinical data protected. Baselines confirm usable signal across the tasks, and BuddyBench-Sim supplies a synthetic copy for reproducible checks.

What carries the argument

The unified benchmark schema that links drill-level learning trajectories, standardized clinical assessments, self-report, and randomized-treatment endpoints across the two cohorts.

If this is right

Models can trace knowledge state across successive social-communication drills.
Systems can recommend the next drill based on prior performance and clinical context.
Clinical scores can be predicted from sequences of drill outcomes and self-reports.
Causal effects of randomized interventions can be estimated while linking to behavioral trajectories.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same schema could test whether privacy methods that block direct record linkage still allow cross-task transfer.
Drill trajectories might be checked for whether they improve long-term outcome forecasts beyond what cross-sectional assessments alone provide.
Other health domains with sequential behavioral data and trial endpoints could adopt the same multi-task linking pattern.
Synthetic data generation methods used here could be measured for how closely they preserve the original cross-task correlations.

Load-bearing premise

The two cohorts retain enough linked signal across tasks for model training and evaluation even after privacy constraints are imposed.

What would settle it

Baseline models trained on the released data show no measurable improvement over chance on knowledge tracing or clinical prediction tasks.

Figures

Figures reproduced from arXiv: 2605.28089 by Jeyeon Eo, Joo Young Kim, Minyoung Jung, Ran Ju, Unggi Lee.

**Figure 2.** Figure 2: Per-participant drill coverage and accuracy in BuddyBench. Dashed lines mark cohort means. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: shows the annotated TTS trace for a representative sample from fold 0. Sample: subject_d44103baa7 | Target: D111 (concept Rbh) History (8 events): D103/Idm:C | D104/Apr:X | D105/Idm:X | D106/Atb:C | D107/Que:C | D108/Atb:C | D109/Wdr:X | D110/Nvr:X History Prior: overall=0.500 recent=0.400 same_concept=0.333 ----------------------------------------------------------------- Scale x1.0 -> confidence 0.43 (4… view at source ↗

**Figure 4.** Figure 4: Real vs. BuddyBench-Sim performance scatter for T1 (AUC), T2 (R@10), and T3 (AUPRC). Each point [PITH_FULL_IMAGE:figures/full_fig_p023_4.png] view at source ↗

**Figure 5.** Figure 5: Sample-size scaling analysis for T3 (pan [PITH_FULL_IMAGE:figures/full_fig_p029_5.png] view at source ↗

read the original abstract

BuddyBench introduces a privacy-constrained multi-task benchmark for pediatric social-communication personalization. Unlike existing neurodevelopmental repositories that primarily emphasize imaging, genetics, or cross-sectional clinical phenotyping, BuddyBench links drill-level learning trajectories, standardized clinical assessments, BuddyPlan self-report, and randomized-treatment endpoints within a unified benchmark schema. BuddyBench combines two cohorts: ND-03 is an observational cohort with dense drill coverage for Tasks1-2 (n = 189), and ND-02 is a randomized controlled trial cohort for Tasks3-4 (n = 86 ITT). Together, they support knowledge tracing, next-drill recommendation, clinical prediction, and causal inference, linking behavioral personalization to clinical evaluation. We additionally introduce BuddyBench-Sim, a synthetic companion dataset for reproducible evaluation. Baselines show signal across tasks while keeping pediatric clinical records protected.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BuddyBench defines a schema linking drill trajectories to clinical and RCT endpoints across two separate cohorts, but the disjoint nature of ND-03 and ND-02 leaves the unified multi-task claim unverified without an explicit linkage mechanism.

read the letter

The main thing to know is that BuddyBench puts forward a privacy-constrained schema that ties drill-level learning trajectories, clinical assessments, self-reports, and randomized treatment endpoints together to support knowledge tracing, next-drill recommendation, clinical prediction, and causal inference in pediatric social-communication work. It draws on an observational cohort of 189 for the first two tasks and an RCT cohort of 86 for the last two, plus a synthetic companion dataset.

The paper does a reasonable job laying out a unified data structure that moves beyond imaging or genetics repositories and includes a synthetic version to support reproducible checks. That practical addition stands out as useful for anyone who wants to experiment without direct access to protected records.

The soft spots are straightforward. The two cohorts are described as separate in design, size, and task coverage, with no mention of shared participants or any alignment procedure. This makes the central claim that the schema supports all four tasks jointly dependent on an integration step that is not shown. The abstract also gives no quantitative baseline numbers, error bars, privacy mechanism details, or cohort demographics, so it is impossible to judge whether the claimed signal holds or whether the privacy constraints reduce model utility. The stress-test concern about disjoint cohorts matches what the abstract states, so the multi-task and causal-inference capabilities cannot be taken as demonstrated from the given text.

This paper is for researchers working on AI personalization in neurodevelopmental pediatrics who need benchmark schemas that connect behavioral trajectories to clinical endpoints. Readers focused on data infrastructure for healthcare AI would find the structure worth examining.

I would send it to peer review so referees can check whether the full methods section resolves the linkage issue and supplies the missing results and privacy details.

Referee Report

2 major / 2 minor

Summary. BuddyBench is presented as a privacy-constrained multi-task benchmark that unifies drill-level learning trajectories from an observational cohort (ND-03, n=189) with clinical assessments, self-reports, and randomized treatment endpoints from an RCT cohort (ND-02, n=86 ITT) to enable knowledge tracing, next-drill recommendation, clinical prediction, and causal inference in pediatric social-communication personalization. A synthetic dataset, BuddyBench-Sim, is introduced for reproducible evaluation, with baselines indicating signal across tasks while maintaining privacy protections.

Significance. Should the linkage between the disjoint cohorts prove feasible and the privacy mechanisms not unduly degrade model performance, BuddyBench could provide a valuable standardized resource for developing personalized interventions in neurodevelopmental disorders. The inclusion of BuddyBench-Sim stands out as a strength, enabling reproducible research and community benchmarking without access to sensitive pediatric data.

major comments (2)

[Abstract] The claim that the two cohorts together support all four tasks within a unified schema is not supported by the provided description, as ND-03 supplies coverage only for Tasks1-2 while ND-02 supplies the RCT only for Tasks3-4, with no mechanism described for individual-level linkage across the separate observational and RCT designs.
[Abstract] The statement that 'baselines show signal across tasks' lacks any quantitative results, error bars, or details on the privacy mechanisms employed, which is load-bearing for assessing whether the benchmark maintains utility under the emphasized privacy constraints.

minor comments (2)

Cohort demographics, inclusion criteria, and exact task definitions are not detailed, which would aid assessment of generalizability and task coverage.
The abstract would benefit from a brief mention of how the benchmark schema is implemented (e.g., data format or API) to clarify usability for the claimed tasks.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive comments on our manuscript. We address each of the major comments below, proposing revisions to improve clarity and completeness.

read point-by-point responses

Referee: [Abstract] The claim that the two cohorts together support all four tasks within a unified schema is not supported by the provided description, as ND-03 supplies coverage only for Tasks1-2 while ND-02 supplies the RCT only for Tasks3-4, with no mechanism described for individual-level linkage across the separate observational and RCT designs.

Authors: We appreciate the referee pointing out this potential ambiguity in the abstract. The manuscript describes a unified benchmark schema that standardizes data formats across cohorts to support the four tasks, with ND-03 providing the drill trajectories and clinical assessments for knowledge tracing and recommendation (Tasks 1-2), and ND-02 providing the RCT endpoints for clinical prediction and causal inference (Tasks 3-4). The 'linkage' refers to the common schema enabling multi-task benchmarking rather than individual-level data linkage, which is not claimed or required given the disjoint designs. However, to avoid misinterpretation, we will revise the abstract to explicitly state that the cohorts support complementary tasks within the schema without individual-level linkage, and expand the methods section to detail the schema and evaluation protocol. revision: yes
Referee: [Abstract] The statement that 'baselines show signal across tasks' lacks any quantitative results, error bars, or details on the privacy mechanisms employed, which is load-bearing for assessing whether the benchmark maintains utility under the emphasized privacy constraints.

Authors: The referee correctly identifies that the abstract does not include quantitative baseline results or details on privacy mechanisms. The full manuscript presents baseline experiments with performance metrics for each task, including comparisons that demonstrate signal, along with descriptions of the privacy-preserving techniques (such as data anonymization and synthetic data generation). We will revise the abstract to incorporate key quantitative findings with error bars and a concise mention of the privacy approaches to better substantiate the claim. revision: yes

Circularity Check

0 steps flagged

No circularity; benchmark schema is descriptive, not derived

full rationale

The paper introduces a benchmark by describing the combination of two existing cohorts (ND-03 observational for Tasks1-2 and ND-02 RCT for Tasks3-4) plus a synthetic companion dataset. No equations, fitted parameters, predictions, or self-referential derivations appear in the provided text. The central claim is the existence of a unified schema supporting multiple tasks; this is a data-organization contribution rather than a modeled result that reduces to its own inputs by construction. No self-citation load-bearing steps, ansatz smuggling, or renaming of known results are present. This matches the default expectation for non-circular benchmark papers and receives score 0.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

Abstract-only review yields no explicit free parameters or derivations; the benchmark itself and the two named cohorts constitute the primary invented structure, with privacy constraints treated as an unelaborated domain requirement.

axioms (1)

domain assumption Privacy constraints can be applied without eliminating task-relevant signal in the linked drill and clinical data.
Stated implicitly by the claim that baselines show signal while records remain protected.

invented entities (2)

BuddyBench benchmark schema no independent evidence
purpose: Unified multi-task data structure linking trajectories, assessments, self-reports, and endpoints
Core contribution introduced in the abstract; no independent evidence supplied beyond the paper's description.
BuddyBench-Sim no independent evidence
purpose: Synthetic companion dataset for reproducible evaluation
Introduced to enable testing without real records; no generation details or validation against real data provided.

pith-pipeline@v0.9.1-grok · 5687 in / 1431 out tokens · 28105 ms · 2026-06-29T12:27:05.071287+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

11 extracted references · 6 canonical work pages · 2 internal anchors

[1]

InProceedings of the 21st International Conference on Artificial Intelligence in Education (AIED), pages 69–73

EdNet: A large-scale hierarchical dataset in education. InProceedings of the 21st International Conference on Artificial Intelligence in Education (AIED), pages 69–73. John N. Constantino and Christian P. Gruber. 2012.So- cial Responsiveness Scale, Second Edition (SRS-2). Western Psychological Services, Los Angeles, CA. Jena Daniels, Jessey N. Schwartz, C...

2012
[2]

Adriana Di Martino, David O’Connor, Bosi Chen, Kaat Alaerts, Jeffrey S

Exploratory study examining the at-home fea- sibility of a wearable tool for social-affective learn- ing in children with autism.npj Digital Medicine, 1(1):32. Adriana Di Martino, David O’Connor, Bosi Chen, Kaat Alaerts, Jeffrey S. Anderson, and 1 others. 2017. En- hancing studies of the connectome in autism using the autism brain imaging data exchange II...

work page arXiv 2017
[3]

Tabm: Advancing tabular deep learning with parameter-efficient ensembling.arXiv preprint arXiv:2410.24210, 2024

Tabm: Advancing tabular deep learning with parameter-efficient ensembling.arXiv preprint arXiv:2410.24210. Yury Gorishniy, Ivan Rubachev, Valentin Khrulkov, and Artem Babenko. 2021. Revisiting deep learning mod- els for tabular data. InAdvances in Neural Informa- tion Processing Systems, volume 34, pages 18932– 18943. Paul Grundmann, Jan Frick, Dennis Fas...

work page arXiv 2021
[4]

InAdvances in Neural Information Processing Systems, volume 37

Better by default: Strong pre-tuned MLPs and boosted trees on tabular data. InAdvances in Neural Information Processing Systems, volume 37. ArXiv:2407.04491. Xin Huang, Ashish Khetan, Milan Cvitkovic, and Zohar Karnin. 2020. Tabtransformer: Tabular data model- ing using contextual embeddings. InNeurIPS Work- shop on Deep Learning for Tabular Data. Alistai...

work page arXiv 2020
[5]

Wang-Cheng Kang and Julian McAuley

MIMIC-IV, a freely accessible electronic health record dataset.Scientific Data, 10(1):1. Wang-Cheng Kang and Julian McAuley. 2018. Self- attentive sequential recommendation. In2018 IEEE International Conference on Data Mining, pages 197– 206. Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu

2018
[6]

InAdvances in Neural Information Processing Systems, volume 30

Lightgbm: A highly efficient gradient boosting decision tree. InAdvances in Neural Information Processing Systems, volume 30. Won Kim, Minwoo Seong, Kyung-Joong Kim, and SeungJun Kim. 2024. Engagnition: A multi- dimensional dataset for engagement recognition of children with autism spectrum disorder.Scientific Data, 11:299. Akim Kotelnikov, Dmitry Baranch...

work page arXiv 2024
[7]

TabICL: A Tabular Foundation Model for In-Context Learning on Large Data

The synthetic data vault. In2016 IEEE Inter- national Conference on Data Science and Advanced Analytics (DSAA), pages 399–410. Chris Piech, Jonathan Bassen, Jonathan Huang, Surya Ganguli, Mehran Sahami, Leonidas J. Guibas, and Jascha Sohl-Dickstein. 2015. Deep knowledge trac- ing. InAdvances in Neural Information Processing Systems, volume 28, pages 505–5...

work page internal anchor Pith review Pith/arXiv arXiv 2015
[8]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Estimating individual treatment effect: Gener- alization bounds and algorithms. InProceedings of the 34th International Conference on Machine Learn- ing, pages 3076–3085. Ilya Shenbin, Anton Alekseev, Elena Tutubalina, Valentin Malykh, and Sergey I. Nikolenko. 2020. Recvae: A new variational autoencoder for top-n recommendations with implicit feedback. In...

work page internal anchor Pith review Pith/arXiv arXiv 2020
[9]

Order p-values:p (1) ≤p (2) ≤ · · · ≤p (m)
[10]

Find the largestisuch thatp (i) ≤ i m q
[11]

Reject hypothesesH (1), . . . , H(i) Task 1 significance results.For pyKT- framework models with complete five-fold runs under the F3 subject-split protocol, each model’s fold-level AUC vector was compared against the best model (PEBG, mean AUC = 0.723) using a paired t-test across the five outer folds; resulting p-values were corrected with BH-FDR ( q=0....

2021

[1] [1]

InProceedings of the 21st International Conference on Artificial Intelligence in Education (AIED), pages 69–73

EdNet: A large-scale hierarchical dataset in education. InProceedings of the 21st International Conference on Artificial Intelligence in Education (AIED), pages 69–73. John N. Constantino and Christian P. Gruber. 2012.So- cial Responsiveness Scale, Second Edition (SRS-2). Western Psychological Services, Los Angeles, CA. Jena Daniels, Jessey N. Schwartz, C...

2012

[2] [2]

Adriana Di Martino, David O’Connor, Bosi Chen, Kaat Alaerts, Jeffrey S

Exploratory study examining the at-home fea- sibility of a wearable tool for social-affective learn- ing in children with autism.npj Digital Medicine, 1(1):32. Adriana Di Martino, David O’Connor, Bosi Chen, Kaat Alaerts, Jeffrey S. Anderson, and 1 others. 2017. En- hancing studies of the connectome in autism using the autism brain imaging data exchange II...

work page arXiv 2017

[3] [3]

Tabm: Advancing tabular deep learning with parameter-efficient ensembling.arXiv preprint arXiv:2410.24210, 2024

Tabm: Advancing tabular deep learning with parameter-efficient ensembling.arXiv preprint arXiv:2410.24210. Yury Gorishniy, Ivan Rubachev, Valentin Khrulkov, and Artem Babenko. 2021. Revisiting deep learning mod- els for tabular data. InAdvances in Neural Informa- tion Processing Systems, volume 34, pages 18932– 18943. Paul Grundmann, Jan Frick, Dennis Fas...

work page arXiv 2021

[4] [4]

InAdvances in Neural Information Processing Systems, volume 37

Better by default: Strong pre-tuned MLPs and boosted trees on tabular data. InAdvances in Neural Information Processing Systems, volume 37. ArXiv:2407.04491. Xin Huang, Ashish Khetan, Milan Cvitkovic, and Zohar Karnin. 2020. Tabtransformer: Tabular data model- ing using contextual embeddings. InNeurIPS Work- shop on Deep Learning for Tabular Data. Alistai...

work page arXiv 2020

[5] [5]

Wang-Cheng Kang and Julian McAuley

MIMIC-IV, a freely accessible electronic health record dataset.Scientific Data, 10(1):1. Wang-Cheng Kang and Julian McAuley. 2018. Self- attentive sequential recommendation. In2018 IEEE International Conference on Data Mining, pages 197– 206. Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu

2018

[6] [6]

InAdvances in Neural Information Processing Systems, volume 30

Lightgbm: A highly efficient gradient boosting decision tree. InAdvances in Neural Information Processing Systems, volume 30. Won Kim, Minwoo Seong, Kyung-Joong Kim, and SeungJun Kim. 2024. Engagnition: A multi- dimensional dataset for engagement recognition of children with autism spectrum disorder.Scientific Data, 11:299. Akim Kotelnikov, Dmitry Baranch...

work page arXiv 2024

[7] [7]

TabICL: A Tabular Foundation Model for In-Context Learning on Large Data

The synthetic data vault. In2016 IEEE Inter- national Conference on Data Science and Advanced Analytics (DSAA), pages 399–410. Chris Piech, Jonathan Bassen, Jonathan Huang, Surya Ganguli, Mehran Sahami, Leonidas J. Guibas, and Jascha Sohl-Dickstein. 2015. Deep knowledge trac- ing. InAdvances in Neural Information Processing Systems, volume 28, pages 505–5...

work page internal anchor Pith review Pith/arXiv arXiv 2015

[8] [8]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Estimating individual treatment effect: Gener- alization bounds and algorithms. InProceedings of the 34th International Conference on Machine Learn- ing, pages 3076–3085. Ilya Shenbin, Anton Alekseev, Elena Tutubalina, Valentin Malykh, and Sergey I. Nikolenko. 2020. Recvae: A new variational autoencoder for top-n recommendations with implicit feedback. In...

work page internal anchor Pith review Pith/arXiv arXiv 2020

[9] [9]

Order p-values:p (1) ≤p (2) ≤ · · · ≤p (m)

[10] [10]

Find the largestisuch thatp (i) ≤ i m q

[11] [11]

Reject hypothesesH (1), . . . , H(i) Task 1 significance results.For pyKT- framework models with complete five-fold runs under the F3 subject-split protocol, each model’s fold-level AUC vector was compared against the best model (PEBG, mean AUC = 0.723) using a paired t-test across the five outer folds; resulting p-values were corrected with BH-FDR ( q=0....

2021