Simultaneous hyperkinetic movement disorders phenotyping: a cross-cohort pediatric transfer study using routine videos, markerless pose estimation and a tabular foundation model

C\'ecile A. Hubsch; Diane Demailly; Eduardo M. Moraud; Gabriella A. Horvath; Gun-Marie Hariz; Jocelyne Bloch; Juan Dario Ortigoza Escobar; Laura Cif; Mayt\'e Castro Jim\'enez; Morgan Dornadic

arxiv: 2606.07674 · v1 · pith:KEMASVJZnew · submitted 2026-06-04 · 💻 cs.CV · q-bio.NC

Simultaneous hyperkinetic movement disorders phenotyping: a cross-cohort pediatric transfer study using routine videos, markerless pose estimation and a tabular foundation model

Laura Cif , Diane Demailly , Zohra Souei , Muhammad Mushhood Ur Rehman , Juan Dario Ortigoza Escobar , Mayt\'e Castro Jim\'enez , C\'ecile A. Hubsch , Sophie Huby

show 6 more authors

Morgan Dornadic Gun-Marie Hariz Eduardo M. Moraud Jocelyne Bloch Gabriella A. Horvath Xavier Vasques

This is my paper

Pith reviewed 2026-06-28 02:09 UTC · model grok-4.3

classification 💻 cs.CV q-bio.NC

keywords hyperkinetic movement disordersvideo phenotypingmarkerless pose estimationtransfer learningpediatric neurologyfoundation modelssimultaneous detection

0 comments

The pith

A video framework detects eight hyperkinetic movement disorders at once and transfers from adults to children after calibrating only the final decision layer.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds and tests a system that takes routine clinical videos, extracts pose and movement features, and outputs simultaneous labels for dystonia, tremor, myoclonus, chorea, athetosis, ballismus, stereotypies, and tics. A backbone model is trained on a small adult cohort under standardized conditions and then applied directly to an independent pediatric group without retraining the core components. Only the last subject-level decision step is adjusted using a clinician-chosen subset of the pediatric cases, after which accuracy on the remaining held-out pediatric patients rises. This setup is presented as a way to support phenotyping in real-world recordings where full retraining on new age groups would be costly.

Core claim

After training a shared predictive backbone on 21 adults and 4 controls, the system is deployed unchanged on 12 pediatric patients with monogenic combined movement disorders; lightweight calibration of only the final decision layer on a clinician-selected subset raises Hamming accuracy from 0.804 to 0.839 and Jaccard index from 0.548 to 0.633 on the seven held-out pediatric cases, with further gains when restricted to phenomenologies showing higher clinician agreement.

What carries the argument

Markerless pose estimation producing kinematic descriptors that feed a pretrained tabular foundation model, followed by lightweight calibration restricted to the final subject-level decision layer.

If this is right

The same backbone supports simultaneous detection of all eight listed phenomenologies from a single routine video.
Transfer to a new age group succeeds without retraining the pose estimation or foundation-model layers.
Performance remains stable when evaluation is limited to the subset of labels with stronger clinician consensus.
The approach works on real-world rather than protocol-controlled recordings in the external cohort.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Clinics could begin using the system on existing video archives with only a few local labels for calibration instead of collecting new large datasets.
The same transfer pattern might apply to other video-based neurological assessments if the kinematic descriptors prove stable across conditions.
Larger studies could test whether random or stratified calibration subsets produce comparable gains, clarifying how much clinician selection matters.

Load-bearing premise

The small clinician-selected subset used for calibration represents the full phenotypic range of the pediatric cohort without selection bias that would inflate measured transfer performance.

What would settle it

Run the same pipeline on a new pediatric cohort where the calibration subset is selected randomly rather than by clinician judgment and measure whether the accuracy gains disappear or reverse.

Figures

Figures reproduced from arXiv: 2606.07674 by C\'ecile A. Hubsch, Diane Demailly, Eduardo M. Moraud, Gabriella A. Horvath, Gun-Marie Hariz, Jocelyne Bloch, Juan Dario Ortigoza Escobar, Laura Cif, Mayt\'e Castro Jim\'enez, Morgan Dornadic, Muhammad Mushhood Ur Rehman, Sophie Huby, Xavier Vasques, Zohra Souei.

**Figure 1.** Figure 1: Study cohorts and video-based phenotyping pipeline for HMDs. Top: the training cohort (21 patients with combined HMDs and 4 controls, standardized CODY-SAMP protocol) and the external pediatric inference cohort (12 patients with monogenic combined MDs, routine clinical videos), illustrating the contrast in acquisition context between training and inference. Bottom: the six-step pipeline, in which a shared … view at source ↗

**Figure 2.** Figure 2: (A) The eight target phenomenologies were rated by five expert clinicians (LC, DD, GH, MCJ, JDOE) for each of the 12 pediatric patients (96 patient–symptom labels in total), and a patient-level consensus was derived as the symptom being voted positive by at least three of the five raters. Dystonia was present in all 12 patients with unanimous agreement, athetosis in 5 patients, myoclonus and chorea in 4 p… view at source ↗

**Figure 3.** Figure 3: Patient-level performance before and after local calibration (held-out cohort). Jaccard index (A, B) and Hamming accuracy (C, D) for the baseline (uncalibrated) and locally calibrated deployments on the seven held-out pediatric patients, under the main present/absent definition (A, C) and the restrictive agreement-based definition (B, D), at three rater-agreement levels (≥3/5, ≥4/5, 5/5). Under the restric… view at source ↗

**Figure 4.** Figure 4: Confusion-structure shift after local calibration (held-out cohort). Aggregated true positives (TP), true negatives (TN), false positives (FP) and false negatives (FN) for the baseline and locally calibrated deployments on the seven held-out patients. Top row, main present/absent definition; bottom row, restrictive agreement-based definition; columns correspond to rater-agreement levels ≥3/5, ≥4/5 and 5/5… view at source ↗

read the original abstract

Objective: To develop and externally test a video-based framework for simultaneous detection of hyperkinetic MDs phenomenologies: dystonia, tremor, myoclonus, chorea, athetosis, ballismus, stereotypies, and tics using routine clinical recordings, with explicit testing of external, cross-cohort transfer from adult to pediatric populations. Methods: In this proof-of-concept study, the framework combines markerless pose estimation, kinematic descriptors, and a pretrained fondation model. A shared predictive backbone was developed on 21 adults with confirmed hyperkinetic MDs and 4 healthy controls assessed under a standardized protocol. External validation was performed on an independent external cohort: a real-world pediatric sample (n=12, monogenic combined MDs). For the external dataset, the backbone was deployed without retraining; lightweight calibration adjusted only the final subject-level decision step using a small labeled subset of patients selected by clinicians as representative of the cohort's phenotypic range. Results: After local calibration of the decision layer on the clinician-selected subset, performance improved consistently on the held-out pediatric patients (n=7): Hamming accuracy rose from 0.804 to 0.839 and the Jaccard index from 0.548 to 0.633. This calibrated performance was preserved, and the Jaccard index further improved, when the evaluation was restricted to the phenomenologies with more definite clinician agreement (Hamming accuracy 0.9, Jaccard index 0.786), indicating that the gains did not rest on the least-reliable labels.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Small n=12 with clinician-chosen calibration makes the reported adult-to-pediatric transfer gains hard to trust without bias checks.

read the letter

The paper's core claim is that a backbone trained on 21 adults can be lightly adapted via final-layer calibration on a clinician-picked subset of 12 pediatric cases, lifting multi-label performance on the remaining 7 kids from Hamming 0.804 to 0.839 and Jaccard 0.548 to 0.633. The simultaneous phenotyping of eight hyperkinetic types from routine videos is the practical target.

What works is the setup itself. Markerless pose estimation plus kinematic descriptors fed into a tabular foundation model is a reasonable way to handle real clinical recordings without special equipment. Keeping the adult backbone frozen and only tuning the decision step shows a lightweight transfer path, and the extra check that results improve further on high-agreement labels is a sensible sanity step.

The problems sit in the validation. Total external data is only 12 patients, calibration uses a small clinician-selected subset whose selection rules are not quantified, and the abstract gives no error bars, statistical tests, or confirmation that the subset matches the held-out cases on severity, video quality, or phenotype mix. With such low numbers any non-representative pick can produce the observed lift without proving robust transfer. The stress-test concern about selection bias lands directly on the reported numbers.

This is for groups already working on video-based neurology tools or transfer methods in movement disorders. A reader looking for early examples of multi-label phenotyping on routine footage might pick up the calibration trick, but the current evidence is too preliminary to rely on.

Send it to peer review. The approach is coherent and the question is worth asking, but referees will need to see the full methods, any bias diagnostics, and preferably more cases before the transfer result can be taken as solid.

Referee Report

3 major / 2 minor

Summary. The manuscript presents a proof-of-concept video-based framework for simultaneous multi-label phenotyping of eight hyperkinetic movement disorders (dystonia, tremor, myoclonus, chorea, athetosis, ballismus, stereotypies, tics) from routine clinical recordings. It combines markerless pose estimation, kinematic descriptors, and a tabular foundation model. A shared backbone is trained on 21 adults plus 4 controls; external transfer is tested on an independent pediatric cohort (n=12 monogenic cases) by deploying the backbone without retraining and performing lightweight calibration of only the final subject-level decision layer on a clinician-selected subset, with metrics reported on the remaining n=7 held-out patients (post-calibration Hamming accuracy 0.839, Jaccard index 0.633).

Significance. If the calibration subset proves representative, the result would demonstrate feasible adult-to-pediatric transfer for rare combined movement disorders using minimal additional labels and routine videos, a practically relevant advance given data scarcity in pediatrics. The preservation of gains when restricting to high-agreement phenomenologies and the multi-label simultaneous detection are strengths that could support clinical utility if the small-sample concerns are addressed.

major comments (3)

[Methods (external validation paragraph)] Methods (calibration and external validation): The headline performance lift (Hamming accuracy 0.804→0.839, Jaccard 0.548→0.633 on n=7) is obtained after calibration on a clinician-selected subset whose selection criteria, MD-type distribution, severity, or video-quality match to the held-out cases are not quantified or statistically compared; this leaves the improvement vulnerable to selection bias and undermines the claim of unbiased cross-cohort transfer.
[Results (performance paragraph)] Results: No confidence intervals, p-values, or bootstrap variability estimates accompany the reported metrics despite n=12 total and n=7 test cases; the absence of these makes it impossible to determine whether the observed deltas exceed what could arise from sampling variability alone.
[Abstract (Methods summary) and Methods (backbone description)] Abstract/Methods: No information is given on the foundation model's pretraining corpus, architecture details, or the exact set of kinematic descriptors extracted from pose estimation; these omissions are load-bearing for claims of reproducibility and for interpreting why transfer succeeded.

minor comments (2)

[Abstract] Abstract: 'fondation model' is a typographical error and should read 'foundation model'.
[Abstract (Results)] Abstract: The phrasing 'deployed without retraining' is accurate only for the backbone; the subsequent calibration step should be explicitly distinguished from zero-shot transfer to avoid reader confusion.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our proof-of-concept study. We address each major comment below with honest responses and indicate revisions where the manuscript can be strengthened, while noting the constraints of small-sample rare-disease data.

read point-by-point responses

Referee: [Methods (external validation paragraph)] Methods (calibration and external validation): The headline performance lift (Hamming accuracy 0.804→0.839, Jaccard 0.548→0.633 on n=7) is obtained after calibration on a clinician-selected subset whose selection criteria, MD-type distribution, severity, or video-quality match to the held-out cases are not quantified or statistically compared; this leaves the improvement vulnerable to selection bias and undermines the claim of unbiased cross-cohort transfer.

Authors: We agree that the calibration subset requires fuller characterization to evaluate representativeness. In the revised manuscript we will add a supplementary table and text explicitly listing MD-type counts, clinician severity ratings, and video-quality descriptors for the calibration subset versus the n=7 held-out cases, together with any feasible descriptive comparisons. This directly addresses the selection-bias concern while preserving the proof-of-concept framing; we do not claim fully unbiased transfer but rather feasible lightweight adaptation. revision: yes
Referee: [Results (performance paragraph)] Results: No confidence intervals, p-values, or bootstrap variability estimates accompany the reported metrics despite n=12 total and n=7 test cases; the absence of these makes it impossible to determine whether the observed deltas exceed what could arise from sampling variability alone.

Authors: We accept that uncertainty estimates are needed. The revised Results section will report bootstrap confidence intervals (1000 resamples) for both Hamming accuracy and Jaccard index pre- and post-calibration on the held-out patients. Given n=7 we will not present p-values for the delta, as they would be under-powered and potentially misleading; instead we will frame the work as exploratory and highlight the observed variability. This provides the requested quantification without overstating statistical claims. revision: yes
Referee: [Abstract (Methods summary) and Methods (backbone description)] Abstract/Methods: No information is given on the foundation model's pretraining corpus, architecture details, or the exact set of kinematic descriptors extracted from pose estimation; these omissions are load-bearing for claims of reproducibility and for interpreting why transfer succeeded.

Authors: We acknowledge the reproducibility gap. The revised Methods (and a condensed Abstract sentence) will specify the tabular foundation model architecture, its pretraining corpus (large-scale public tabular datasets), and the complete list of kinematic descriptors (joint velocities, inter-joint angles, accelerations, and higher-order statistics derived from the pose keypoints). These details exist in our code and supplementary files and will be elevated to the main text. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper explicitly describes deploying the adult-trained backbone without retraining on the pediatric cohort, then performing lightweight calibration of only the final decision layer on a clinician-selected subset before reporting metrics on the separate held-out n=7 cases. This is a standard split-based calibration and evaluation procedure whose outputs are not equivalent to the inputs by construction. No self-definitional equations, fitted parameters renamed as predictions, load-bearing self-citations, uniqueness theorems, or ansatz smuggling are present in the abstract or described methods. The central transfer claim rests on external data splits rather than reducing to its own fitted values.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Abstract-only review limits visibility; the claim rests on the untested premise that adult-derived kinematic features plus a tabular foundation model capture transferable signals for pediatric MDs, with the only explicit free parameter being the weights of the final decision layer adjusted on the clinician-selected subset.

free parameters (1)

final decision layer weights
Lightweight calibration performed on clinician-selected pediatric subset; exact values and regularization not stated.

axioms (1)

domain assumption Adult-trained backbone produces features that remain useful for pediatric cases without retraining the core model
Invoked by the decision to deploy the backbone unchanged and only calibrate the final layer.

pith-pipeline@v0.9.1-grok · 5901 in / 1586 out tokens · 46638 ms · 2026-06-28T02:09:07.382670+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 4 canonical work pages · 2 internal anchors

[1]

D., Parisi, F., Mancini, M

Stephen, C. D., Parisi, F., Mancini, M. & Artusi, C. A. Editorial: Digital biomarkers in movement disorders.Front. Neurol.16, 1600018 (2025)

2025
[2]

E., Tijssen, M

Brandsma, R., Van Egmond, M. E., Tijssen, M. A. J., & the Groningen Movement Disorder Expertise Centre. Diagnostic approach to paediatric movement disorders: a clinical practice guide.Dev. Med. Child Neurol.63, 252–258 (2021)

2021
[3]

& Edwards, M

Sadnicka, A. & Edwards, M. J. Between Nothing and Everything: Phenomenology in Move- ment Disorders.Mov. Disord.38, 1767–1773 (2023)

2023
[4]

Neu- rol.12, 659805 (2021)

Méneret, A.et al.Treatable Hyperkinetic Movement Disorders Not to Be Missed.Front. Neu- rol.12, 659805 (2021)

2021
[5]

Deep Learning Pose Estimation for Multi-Label Recognition of Combined Hyperkinetic Movement Disorders

Cif, L.et al.Deep Learning Pose Estimation for Multi-Label Recognition of Combined Hyper- kinetic Movement Disorders. Preprint at https://doi.org/10.48550/ARXIV.2602.00163 (2026)

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2602.00163 2026
[6]

H., Azzopardi, G

Martínez-García-Peña, R., Koens, L. H., Azzopardi, G. & Tijssen, M. A. J. Video-Based Data-Driven Models for Diagnosing Movement Disorders: Review and Future Directions.Mov. Disord.40, 2046–2066 (2025)

2046
[7]

Tang, W., Van Ooijen, P. M. A., Sival, D. A. & Maurits, N. M. Automatic two-dimensional & three-dimensional video analysis with deep learning for movement disorders: A systematic review.Artif. Intell. Med.156, 102952 (2024)

2024
[8]

Nature637, 319–326 (2025)

Hollmann, N.et al.Accurate predictions on small data with a tabular foundation model. Nature637, 319–326 (2025)

2025
[9]

TabICLv2: A better, faster, scalable, and open tabular foundation model.arXiv:2602.11139, 2026

Qu, J., Holzmüller, D., Varoquaux, G. & Morvan, M. L. TabICLv2: A better, faster, scalable, and open tabular foundation model. Preprint at https://doi.org/10.48550/ARXIV.2602.11139 (2026)

work page doi:10.48550/arxiv.2602.11139 2026
[10]

TabICL: A Tabular Foundation Model for In-Context Learning on Large Data

Qu, J., Holzmüller, D., Varoquaux, G. & Morvan, M. L. TabICL: A Tab- ular Foundation Model for In-Context Learning on Large Data. Preprint at https://doi.org/10.48550/ARXIV.2502.05564 (2025)

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2502.05564 2025
[11]

Approach to an irregular time series on the basis of the fractal theory.Phys

Higuchi, T. Approach to an irregular time series on the basis of the fractal theory.Phys. Nonlinear Phenom.31, 277–283 (1988)

1988
[12]

& Pompe, B

Bandt, C. & Pompe, B. Permutation Entropy: A Natural Complexity Measure for Time Series. Phys. Rev. Lett.88, 174102 (2002)

2002
[13]

Methods17, 261–272 (2020)

Virtanen, P.et al.SciPy 1.0: fundamental algorithms for scientific computing in Python.Nat. Methods17, 261–272 (2020). 24

2020
[14]

M., Marsili, L., Espay, A

Pecoraro, P. M., Marsili, L., Espay, A. J., Bologna, M. & Di Biase, L. Computer Vision Technologies in Movement Disorders: A Systematic Review.Mov. Disord. Clin. Pract.12, 1229–1243 (2025)

2025
[15]

A., Išgum, I

Zuluaga, M. A., Išgum, I. & Bach Cuadra, M. Trustworthy AI in medical image analysis: A unified perspective built on robustness and layers of trust.Curr. Opin. Biomed. Eng.37, 100649 (2026)

2026
[16]

Silva, G. F. D. S., Barcellos Filho, F. N., Wichmann, R. M., Da Silva Junior, F. C. & Chiave- gatto Filho, A. D. P. Strategies for detecting and mitigating dataset shift in machine learning for health predictions: A systematic review.J. Biomed. Inform.170, 104902 (2025)

2025
[17]

arXiv preprint arXiv:2507.03971 , year=

Garg, A.et al.Real-TabPFN: Improving Tabular Foundation Models via Continued Pre- training With Real-World Data. Preprint at https://doi.org/10.48550/ARXIV.2507.03971 (2025). 25

work page doi:10.48550/arxiv.2507.03971 2025

[1] [1]

D., Parisi, F., Mancini, M

Stephen, C. D., Parisi, F., Mancini, M. & Artusi, C. A. Editorial: Digital biomarkers in movement disorders.Front. Neurol.16, 1600018 (2025)

2025

[2] [2]

E., Tijssen, M

Brandsma, R., Van Egmond, M. E., Tijssen, M. A. J., & the Groningen Movement Disorder Expertise Centre. Diagnostic approach to paediatric movement disorders: a clinical practice guide.Dev. Med. Child Neurol.63, 252–258 (2021)

2021

[3] [3]

& Edwards, M

Sadnicka, A. & Edwards, M. J. Between Nothing and Everything: Phenomenology in Move- ment Disorders.Mov. Disord.38, 1767–1773 (2023)

2023

[4] [4]

Neu- rol.12, 659805 (2021)

Méneret, A.et al.Treatable Hyperkinetic Movement Disorders Not to Be Missed.Front. Neu- rol.12, 659805 (2021)

2021

[5] [5]

Deep Learning Pose Estimation for Multi-Label Recognition of Combined Hyperkinetic Movement Disorders

Cif, L.et al.Deep Learning Pose Estimation for Multi-Label Recognition of Combined Hyper- kinetic Movement Disorders. Preprint at https://doi.org/10.48550/ARXIV.2602.00163 (2026)

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2602.00163 2026

[6] [6]

H., Azzopardi, G

Martínez-García-Peña, R., Koens, L. H., Azzopardi, G. & Tijssen, M. A. J. Video-Based Data-Driven Models for Diagnosing Movement Disorders: Review and Future Directions.Mov. Disord.40, 2046–2066 (2025)

2046

[7] [7]

Tang, W., Van Ooijen, P. M. A., Sival, D. A. & Maurits, N. M. Automatic two-dimensional & three-dimensional video analysis with deep learning for movement disorders: A systematic review.Artif. Intell. Med.156, 102952 (2024)

2024

[8] [8]

Nature637, 319–326 (2025)

Hollmann, N.et al.Accurate predictions on small data with a tabular foundation model. Nature637, 319–326 (2025)

2025

[9] [9]

TabICLv2: A better, faster, scalable, and open tabular foundation model.arXiv:2602.11139, 2026

Qu, J., Holzmüller, D., Varoquaux, G. & Morvan, M. L. TabICLv2: A better, faster, scalable, and open tabular foundation model. Preprint at https://doi.org/10.48550/ARXIV.2602.11139 (2026)

work page doi:10.48550/arxiv.2602.11139 2026

[10] [10]

TabICL: A Tabular Foundation Model for In-Context Learning on Large Data

Qu, J., Holzmüller, D., Varoquaux, G. & Morvan, M. L. TabICL: A Tab- ular Foundation Model for In-Context Learning on Large Data. Preprint at https://doi.org/10.48550/ARXIV.2502.05564 (2025)

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2502.05564 2025

[11] [11]

Approach to an irregular time series on the basis of the fractal theory.Phys

Higuchi, T. Approach to an irregular time series on the basis of the fractal theory.Phys. Nonlinear Phenom.31, 277–283 (1988)

1988

[12] [12]

& Pompe, B

Bandt, C. & Pompe, B. Permutation Entropy: A Natural Complexity Measure for Time Series. Phys. Rev. Lett.88, 174102 (2002)

2002

[13] [13]

Methods17, 261–272 (2020)

Virtanen, P.et al.SciPy 1.0: fundamental algorithms for scientific computing in Python.Nat. Methods17, 261–272 (2020). 24

2020

[14] [14]

M., Marsili, L., Espay, A

Pecoraro, P. M., Marsili, L., Espay, A. J., Bologna, M. & Di Biase, L. Computer Vision Technologies in Movement Disorders: A Systematic Review.Mov. Disord. Clin. Pract.12, 1229–1243 (2025)

2025

[15] [15]

A., Išgum, I

Zuluaga, M. A., Išgum, I. & Bach Cuadra, M. Trustworthy AI in medical image analysis: A unified perspective built on robustness and layers of trust.Curr. Opin. Biomed. Eng.37, 100649 (2026)

2026

[16] [16]

Silva, G. F. D. S., Barcellos Filho, F. N., Wichmann, R. M., Da Silva Junior, F. C. & Chiave- gatto Filho, A. D. P. Strategies for detecting and mitigating dataset shift in machine learning for health predictions: A systematic review.J. Biomed. Inform.170, 104902 (2025)

2025

[17] [17]

arXiv preprint arXiv:2507.03971 , year=

Garg, A.et al.Real-TabPFN: Improving Tabular Foundation Models via Continued Pre- training With Real-World Data. Preprint at https://doi.org/10.48550/ARXIV.2507.03971 (2025). 25

work page doi:10.48550/arxiv.2507.03971 2025