{"paper":{"title":"What properties of reasoning supervision are associated with improved downstream model quality?","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Intrinsic metrics on reasoning data strongly predict downstream model performance in a scale-dependent way.","cross_cats":[],"primary_cat":"cs.AI","authors_text":"Dzmitry Pihulski, Jan Eliasz, Jan Koco\\'n, Maciej Piasecki, Micha{\\l} Rajkowski, Miko{\\l}aj Langner, Przemys{\\l}aw Kazienko, Teddy Ferdinan","submitted_at":"2026-05-13T10:04:38Z","abstract_excerpt":"Validating training data for reasoning models typically requires expensive trial-and-error fine-tuning cycles. In this work, we investigate whether the utility of a reasoning dataset can be reliably predicted prior to training using intrinsic data metrics. We propose a suite of quantitative measures and evaluate their predictive power by fine-tuning 8B and 11B models on semantically distinct variants of a Polish reasoning dataset. Our analysis reveals that these intrinsic metrics demonstrate strong and significant correlations with downstream model performance. Crucially, we find that the pred"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Our analysis reveals that these intrinsic metrics demonstrate strong and significant correlations with downstream model performance. Crucially, we find that the predictors of utility are scale-dependent.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the semantically distinct variants of a single Polish reasoning dataset are representative enough for the observed scale-dependent patterns to generalize to other languages, domains, and model families.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Intrinsic data metrics predict reasoning dataset utility for model fine-tuning, with different predictors working best for smaller versus larger models.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Intrinsic metrics on reasoning data strongly predict downstream model performance in a scale-dependent way.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"7f0eb79a92eaab0f73b7ab1aaae158a6f28540d8356609464db8dbac00f15659"},"source":{"id":"2605.13290","kind":"arxiv","version":1},"verdict":{"id":"1fff5dc3-7360-4d54-aaab-d5224eb28b18","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-14T19:32:58.449379Z","strongest_claim":"Our analysis reveals that these intrinsic metrics demonstrate strong and significant correlations with downstream model performance. Crucially, we find that the predictors of utility are scale-dependent.","one_line_summary":"Intrinsic data metrics predict reasoning dataset utility for model fine-tuning, with different predictors working best for smaller versus larger models.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the semantically distinct variants of a single Polish reasoning dataset are representative enough for the observed scale-dependent patterns to generalize to other languages, domains, and model families.","pith_extraction_headline":"Intrinsic metrics on reasoning data strongly predict downstream model performance in a scale-dependent way."},"references":{"count":40,"sample":[{"doi":"","year":2024,"title":"Bandarkar, L., et al.: The belebele benchmark: a parallel reading comprehension dataset in 122 language variants. In: ACL. pp. 749–775 (2024) 14 M. Langner et al","work_id":"3091917c-7714-465e-9941-18561a936ec8","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2025,"title":"Bercovich, A., et al.: Llama-nemotron: Efficient reasoning models (2025)","work_id":"79c13186-7c10-4942-a0c5-e37f4f47ee2e","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2025,"title":"A.et al.Global piqa: Evaluating physical commonsense reasoning across 100+ languages and cultures (2025)","work_id":"5856c7a6-6f3b-42c6-9514-56c1131cee21","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2024,"title":"In: Proceedings of SIGMOD","work_id":"1b11590f-9469-4da6-8e98-8553efac27bd","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2025,"title":"Reasoning Models Don't Always Say What They Think","work_id":"b9bdcbf5-9ae0-464c-b1a6-de04f85a6e33","ref_index":5,"cited_arxiv_id":"2505.05410","is_internal_anchor":true}],"resolved_work":40,"snapshot_sha256":"47b8fbb337353a491caa8c71e44764f4e4697465cb11ca88ba221d7c10c9eeba","internal_anchors":3},"formal_canon":{"evidence_count":2,"snapshot_sha256":"32ed1c712062ede0c407f0f3e6e1e84463501d7e7edcdd33fb9702a8ab84cf4b"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}