Diagnosing Capability Gaps in Fine-Tuning Data

Bruce Sun; Elias Stengel-Eskin; Emre Kiciman; Guilherme Potje; Leonardo de Oliveira Nunes; Rakshanda Agarwal; Ranveer Chandra; Rohan Jha; Rui Ying; Saeid Asgari Taghanaki

arxiv: 2604.27547 · v1 · submitted 2026-04-30 · 💻 cs.LG

Diagnosing Capability Gaps in Fine-Tuning Data

Saeid Asgari Taghanaki , Rakshanda Agarwal , Bruce Sun , Rohan Jha , Elias Stengel-Eskin , Sara Malvar , Rui Ying , Yifei Xu

show 5 more authors

Guilherme Potje Tusher Chakraborty Leonardo de Oliveira Nunes Ranveer Chandra Emre Kiciman

This is my paper

Pith reviewed 2026-05-07 08:45 UTC · model grok-4.3

classification 💻 cs.LG

keywords capability gapsfine-tuning datagoal decompositiondataset coverageLLM alignment scoringreinforcement fine-tuningdiagnostic framework

0 comments

The pith

GoalCover identifies capability gaps in fine-tuning datasets by decomposing goals into subgoals and scoring sample coverage with LLMs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that practitioners can detect gaps in the capabilities supported by their fine-tuning data before running expensive training. GoalCover achieves this by breaking a target task into atomic subgoals and using an LLM to rate how well each sample aligns with those subgoals. If true, this would allow users to diagnose and fix incomplete datasets, leading to stronger fine-tuned models on tasks like medical QA or financial summarization without trial-and-error training runs. The experiments show the method picks out targeted weaknesses even in corrupted data and that filtering based on the scores boosts performance.

Core claim

GoalCover guides a practitioner through structured decomposition of a high-level goal into atomic, independently evaluable subgoals; assigns each training sample an LLM-based alignment score against every subgoal; and surfaces missing capabilities through automated analysis of low-scoring sample explanations. Validation across domains shows it distinguishes targeted from non-targeted impacts, and filtering data with it improves reinforcement fine-tuning rewards.

What carries the argument

GoalCover, the framework that performs interactive goal decomposition into atomic subgoals followed by automated LLM-based scoring of training sample alignment to reveal coverage gaps.

If this is right

Filtering a dataset using GoalCover scores raises the LLM-judge reward in a financial summarization reinforcement fine-tuning task from 3.77 to 4.12.
Controlled corruption of specific subgoals degrades their alignment scores by 25.6% on average while non-targeted subgoals drop only 2.1%.
Combining GoalCover-filtered data with goal-conditioned synthetic samples produces the highest reward of 4.20.
The approach succeeds in medical QA, legal summarization, and code generation domains.
It provides actionable signal for closing gaps before training begins.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Capability gaps may be a more common reason for poor fine-tuning outcomes than commonly assumed, suggesting dataset diagnostics deserve priority over scaling model size.
Practitioners could extend this by using low-coverage subgoals to automatically generate or retrieve additional training examples.
The framework's reliance on human-guided decomposition leaves room for testing fully automated versions in future applications.
It could generalize to assessing coverage in pre-training data for foundational capabilities.

Load-bearing premise

That the LLM judgments provide accurate and unbiased measures of whether a sample supports a given atomic subgoal and that the decomposition includes all critical capabilities.

What would settle it

An experiment in which deliberately corrupting samples tied to a particular subgoal produces no larger drop in the corresponding GoalCover score than in unrelated subgoals, or where using the filtered data fails to improve fine-tuning performance compared to the original dataset.

Figures

Figures reproduced from arXiv: 2604.27547 by Bruce Sun, Elias Stengel-Eskin, Emre Kiciman, Guilherme Potje, Leonardo de Oliveira Nunes, Rakshanda Agarwal, Ranveer Chandra, Rohan Jha, Rui Ying, Saeid Asgari Taghanaki, Sara Malvar, Tusher Chakraborty, Yifei Xu.

**Figure 1.** Figure 1: Capability gaps in fine-tuning datasets. The average coverage score across subgoals can look adequate while individual view at source ↗

**Figure 2.** Figure 2: Overview of GOALCOVER. The four-phase pipeline takes a practitioner’s high-level goal G, decomposes it into atomic subgoals via an interactive clarification loop, scores every (sample, subgoal) pair, aggregates evaluator explanations from lowscoring samples into capability gaps, and produces remediation recommendations including goal-conditioned synthetic data. We validate the pipeline along two complemen… view at source ↗

read the original abstract

Fine-tuning large language models (LLMs) for domain-specific tasks requires training datasets that comprehensively cover the target capabilities a practitioner needs. Yet identifying which capabilities a dataset fails to support, and doing so before an expensive fine-tuning run, remains a largely unsolved problem. We introduce GoalCover, a framework that helps practitioners systematically detect capability gaps in fine-tuning datasets through interactive goal decomposition and automated coverage assessment. GoalCover guides a practitioner through structured decomposition of a high-level goal into atomic, independently evaluable subgoals; assigns each training sample an LLM-based alignment score against every subgoal; and surfaces missing capabilities through automated analysis of low-scoring sample explanations. We validate the framework along two complementary axes. First, through controlled corruption experiments across three domains (medical QA, legal summarization, code generation), we show that GoalCover reliably distinguishes targeted from non-targeted capability impacts: target subgoals degrade by 25.6% on average versus 2.1% for non-target subgoals (Cohen's d=1.24). Second, we demonstrate downstream utility on a financial-summarization Reinforcement Fine-Tuning (RFT) task with Qwen-3-14B: training on GoalCover-filtered data improves the LLM-judge reward from 3.77 to 4.12 (out of 5) over the unfiltered baseline, and combining filtered data with goal-conditioned synthetic samples yields the strongest result (4.20). The two results together show that GoalCover works as a practical pre-fine-tuning diagnostic: it detects capability gaps and produces concrete signal for closing them.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GoalCover gives a practical pipeline for spotting dataset gaps via goal decomposition and LLM scoring, but both its tests stay inside LLM judgments.

read the letter

The main takeaway is that this paper offers GoalCover as a pre-fine-tuning check: break a high-level goal into atomic subgoals interactively, score each training sample against them with an LLM, and flag gaps from the low-score explanations. The controlled corruption tests across medical QA, legal summarization, and code generation show targeted subgoals dropping 25.6% on average while non-targeted ones drop only 2.1%, with a solid effect size. The financial summarization RFT run with Qwen-3-14B then reports a small reward lift from 3.77 to 4.12 when training on the filtered data, and a bit more when adding synthetic samples aligned to the goals.

Referee Report

2 major / 2 minor

Summary. The paper introduces GoalCover, a framework to detect capability gaps in fine-tuning datasets for LLMs by interactively decomposing high-level goals into atomic subgoals, assigning LLM-based alignment scores to training samples for each subgoal, and analyzing low-scoring samples to surface missing capabilities. Validation is provided through controlled corruption experiments in three domains demonstrating that target subgoals degrade by 25.6% compared to 2.1% for non-target subgoals (Cohen's d=1.24), and a downstream reinforcement fine-tuning task on financial summarization where filtering data with GoalCover improves the LLM-judge reward from 3.77 to 4.12.

Significance. If the results hold under independent validation, GoalCover could offer practitioners a systematic way to identify and address capability gaps in training data prior to fine-tuning, potentially leading to more effective domain-specific LLM adaptations and reducing the need for costly trial-and-error fine-tuning runs. The framework's use of interactive decomposition and automated analysis addresses a practical gap in current fine-tuning practices.

major comments (2)

[Validation: Controlled Corruption Experiments] The controlled corruption experiments (described in the validation section) measure degradation using the same LLM alignment scorer employed by GoalCover. This shows sensitivity to the injected artificial edits but does not demonstrate detection of genuine pre-existing capability gaps in unaltered datasets; an independent check such as human evaluation of identified gaps or correlation with external task metrics is needed to support the claim of reliable distinction.
[Validation: Downstream RFT Task] In the downstream RFT task on financial summarization with Qwen-3-14B, the reported reward improvement (3.77 to 4.12) is assessed via an LLM judge. The manuscript should demonstrate that the judge's evaluation criteria are independent of the subgoal alignment prompts, as overlap would undermine the utility claim.

minor comments (2)

[Abstract and Methods] The abstract and methods provide no details on prompt engineering for the LLM scorer, score calibration procedures, or explicit controls for LLM-judge bias; these should be added for reproducibility.
[Framework Description] Clarify the interactive decomposition protocol, including how practitioners ensure subgoals are atomic and independently evaluable without missing critical capabilities.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for the detailed and insightful comments on our manuscript. These have helped us better articulate the strengths and limitations of our validation strategy. We address each major comment below, making revisions to the manuscript where necessary to enhance clarity and rigor.

read point-by-point responses

Referee: The controlled corruption experiments (described in the validation section) measure degradation using the same LLM alignment scorer employed by GoalCover. This shows sensitivity to the injected artificial edits but does not demonstrate detection of genuine pre-existing capability gaps in unaltered datasets; an independent check such as human evaluation of identified gaps or correlation with external task metrics is needed to support the claim of reliable distinction.

Authors: We concur that the controlled corruption experiments are designed to test the sensitivity of the GoalCover framework to specific, artificially introduced capability gaps using the LLM scorer, thereby validating that the scoring mechanism can distinguish targeted from non-targeted subgoals (with a large effect size, Cohen's d=1.24). This is a necessary first step to establish the reliability of the automated analysis. For evidence of detecting genuine pre-existing gaps in real, unaltered datasets, we rely on the complementary downstream RFT experiment on financial summarization. Here, GoalCover identifies gaps in the original dataset, and filtering based on these leads to measurable improvements in model performance as evaluated by an independent reward metric from the RFT process. We have updated the manuscript to explicitly frame the controlled experiments as a sensitivity analysis and the RFT task as the validation of real-world gap detection, including a clearer discussion of how the downstream reward serves as the external task metric. revision: partial
Referee: In the downstream RFT task on financial summarization with Qwen-3-14B, the reported reward improvement (3.77 to 4.12) is assessed via an LLM judge. The manuscript should demonstrate that the judge's evaluation criteria are independent of the subgoal alignment prompts, as overlap would undermine the utility claim.

Authors: We appreciate this observation, as independence between the filtering mechanism and the evaluation metric is crucial for the validity of our utility claims. The subgoal alignment prompts are specific to the decomposed subgoals for the financial summarization task, focusing on granular aspects such as coverage of particular financial concepts or reasoning steps. The LLM judge prompt, however, assesses the overall quality of the generated summaries using criteria like accuracy, completeness, conciseness, and relevance, which are standard for summarization evaluation and do not reference the subgoal structure or alignment scores. To address this, we have added the full text of both the subgoal alignment prompts and the LLM judge prompt to the appendix of the revised manuscript, enabling direct comparison and confirmation of their distinct criteria. revision: yes

Circularity Check

0 steps flagged

No significant circularity in framework or validation

full rationale

The paper presents GoalCover as an interactive framework for goal decomposition and LLM-based alignment scoring to detect dataset gaps, validated empirically via controlled corruption experiments (showing 25.6% target vs 2.1% non-target degradation) and a separate downstream RFT task on financial summarization (LLM-judge reward improving from 3.77 to 4.12). No derivation chain, equations, or first-principles results exist that reduce outputs to inputs by construction. The validations are external empirical tests rather than self-referential fits or renamings; corruption introduces known artificial gaps to test differential scoring, and the RFT uses a distinct task metric. No self-citations, ansatz smuggling, or uniqueness theorems appear as load-bearing elements. The framework remains self-contained against its stated benchmarks with independent content.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that LLM judgments can serve as reliable proxies for capability coverage. No explicit free parameters or new physical entities are introduced; the framework itself is the primary addition.

axioms (1)

domain assumption Large language models can provide accurate and consistent alignment scores between training samples and atomic subgoals.
This assumption underpins the automated coverage assessment and gap detection steps.

pith-pipeline@v0.9.0 · 5638 in / 1436 out tokens · 92272 ms · 2026-05-07T08:45:02.992243+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

3 extracted references · 2 canonical work pages · 2 internal anchors

[1]

GPTScore: Evaluate as You Desire

Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators. InConference on Language Modeling (COLM). Fu, J.; Ng, S.-K.; Jiang, Z.; and Liu, P. 2023. GPTScore: Evaluate as You Desire.arXiv preprint arXiv:2302.04166. Jin, Q.; Dhingra, B.; Liu, Z.; Cohen, W. W.; and Lu, X. 2019. PubMedQA: A Dataset for Biomedical Research Question Answering.a...

work page internal anchor Pith review arXiv 2023
[2]

Stiennon, N.; Ouyang, L.; Wu, J.; Ziegler, D.; Lowe, R.; V oss, C.; Radford, A.; Amodei, D.; and Christiano, P

Identifying Mislabeled Data Using the Area Under the Margin Ranking.Advances in Neural Information Process- ing Systems, 33: 17044–17056. Stiennon, N.; Ouyang, L.; Wu, J.; Ziegler, D.; Lowe, R.; V oss, C.; Radford, A.; Amodei, D.; and Christiano, P. F
[3]

WizardLM: Empowering large pre-trained language models to follow complex instructions

Learning to Summarize with Human Feedback. Advances in Neural Information Processing Systems, 33: 3008–3021. Swayamdipta, S.; Schwartz, R.; Lourie, N.; Wang, Y .; Ha- jishirzi, H.; Smith, N. A.; and Choi, Y . 2020. Dataset Car- tography: Mapping and Diagnosing Datasets with Training Dynamics. InProceedings of EMNLP. Tirumala, K.; Markosyan, A. H.; Zettlem...

work page internal anchor Pith review arXiv 2020

[1] [1]

GPTScore: Evaluate as You Desire

Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators. InConference on Language Modeling (COLM). Fu, J.; Ng, S.-K.; Jiang, Z.; and Liu, P. 2023. GPTScore: Evaluate as You Desire.arXiv preprint arXiv:2302.04166. Jin, Q.; Dhingra, B.; Liu, Z.; Cohen, W. W.; and Lu, X. 2019. PubMedQA: A Dataset for Biomedical Research Question Answering.a...

work page internal anchor Pith review arXiv 2023

[2] [2]

Stiennon, N.; Ouyang, L.; Wu, J.; Ziegler, D.; Lowe, R.; V oss, C.; Radford, A.; Amodei, D.; and Christiano, P

Identifying Mislabeled Data Using the Area Under the Margin Ranking.Advances in Neural Information Process- ing Systems, 33: 17044–17056. Stiennon, N.; Ouyang, L.; Wu, J.; Ziegler, D.; Lowe, R.; V oss, C.; Radford, A.; Amodei, D.; and Christiano, P. F

[3] [3]

WizardLM: Empowering large pre-trained language models to follow complex instructions

Learning to Summarize with Human Feedback. Advances in Neural Information Processing Systems, 33: 3008–3021. Swayamdipta, S.; Schwartz, R.; Lourie, N.; Wang, Y .; Ha- jishirzi, H.; Smith, N. A.; and Choi, Y . 2020. Dataset Car- tography: Mapping and Diagnosing Datasets with Training Dynamics. InProceedings of EMNLP. Tirumala, K.; Markosyan, A. H.; Zettlem...

work page internal anchor Pith review arXiv 2020