From Parameters to Data: A Task-Parameter-Guided Fine-Tuning Pipeline for Efficient LLM Alignment

Hao Chen; Junbo Zhao; Lirong Gao; Liyao Li; Ningtao Wang; Qi Zhang; Wentao Ye; Xiaoyu Shen; Xing Fu; Zhanming Shen

arxiv: 2605.21558 · v1 · pith:WBW2GKCMnew · submitted 2026-05-20 · 💻 cs.LG · cs.CL

From Parameters to Data: A Task-Parameter-Guided Fine-Tuning Pipeline for Efficient LLM Alignment

Hao Chen , Qi Zhang , Liyao Li , Zhanming Shen , Wentao Ye , Lirong Gao , Ningtao Wang , Xing Fu

show 2 more authors

Xiaoyu Shen Junbo Zhao

This is my paper

Pith reviewed 2026-05-22 00:10 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords LLM alignmentfine-tuningattention headsdata selectionparameter pruningefficiencyStrong Map Hypothesis

0 comments

The pith

Task-sensitive attention heads can guide both data selection and parameter pruning to align LLMs far more efficiently.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that data selection and parameter updates in LLM fine-tuning are coupled through a small number of key attention heads. These heads are hypothesized to control how the model responds to particular task data. The P2D method first finds these heads with a quick check, then uses them to pick the best training samples and to limit updates to those heads only. This joint approach avoids the inefficiency of handling data and parameters separately. It results in stronger performance from using just a tenth of the data and a tenth of the heads while speeding up the whole process seven times.

Core claim

We posit the Strong Map Hypothesis: a sparse subset of attention heads plays a dominant role in task-specific adaptation, acting as keys that unlock specific data patterns. Building on this observation, we propose From Parameters to Data (P2D), a unified framework that leverages these task-sensitive attention heads as a dual compass for both sample mining and structural pruning. To rigorously quantify the total pipeline cost, we introduce the Alignment Efficiency Ratio (AER) metric for both selection latency and training time. Mechanistically, P2D identifies critical heads via a lightweight proxy and uses them as a functional filter to curate high-affinity data, establishing a synergistic 7.

What carries the argument

task-sensitive attention heads identified via a lightweight proxy and used as a dual compass for sample mining and structural pruning

Load-bearing premise

A sparse subset of attention heads plays a dominant role in task-specific adaptation and acts as keys that unlock specific data patterns.

What would settle it

An experiment in which data selected using randomly chosen attention heads performs no better than data selected using the task-sensitive heads identified by the proxy.

Figures

Figures reproduced from arXiv: 2605.21558 by Hao Chen, Junbo Zhao, Lirong Gao, Liyao Li, Ningtao Wang, Qi Zhang, Wentao Ye, Xiaoyu Shen, Xing Fu, Zhanming Shen.

**Figure 1.** Figure 1: Comparison of AER↓ and performance↑. P2D (marked by ⋆) achieves the optimal trade-off, outperforming other strong baselines. The dashed lines connect adaptation variants for each selection strategy. Notably, P2D synergizes parameter-guided data selection (P2D† ) with sparse head adaptation (P2D‡ ) for superior efficiency. Full SFT utilizes all data and parameters. for downstream applications has become cen… view at source ↗

**Figure 2.** Figure 2: The overall framework of P2D, comprising three integral stages: i) Fast Head Localization, which identifies task-sensitive attention heads (denoted as HT ) via a lightweight proxy; ii) Parameter-Guided Data Selection (P2D† ), which utilizes HT as a sparse mask during inference to compute attention-based scores for curating a task-specific dataset DT ; and iii) Sparse Head Adaptation (P2D‡ ), which selectiv… view at source ↗

**Figure 3.** Figure 3: visualizes the top 10% task-sensitive attention heads identified by P2D on Qwen-2.5-7B. We observe a clear phenomenon of Structural Specialization: while some heads are broadly active (indicating general-purpose utility), distinct sparse clusters activate exclusively for specific tasks (e.g., GSM8K vs. BioInstruct). These heads span across multiple layers, forming a task-specific functional sub-network. T… view at source ↗

**Figure 4.** Figure 4: Data distribution of all samples vs. the 10% subset extracted by P2D on Qwen-2.5-7B-Instruct. The distinct skew toward lower perplexity signifies High-Affinity: the selected data structurally resonates with the identified task-sensitive heads. that the average sample lengths (both in tokens and words) of the P2D subset are remarkably close to, and in some cases slightly higher than, those of the full datas… view at source ↗

**Figure 5.** Figure 5: Parameter-Guided data selection prompt for GSM8K dataset DialogSum Instruction: You are an expert assistant. Your task is to provide clear, concise, and complete summaries for the given dialogues. Your summaries should accurately capture the main points of each dialogue. Avoid unnecessary details and ensure clarity. ### Guidelines for your response: 1. **Summarize the dialogue concisely and fully**, ensuri… view at source ↗

**Figure 6.** Figure 6: Parameter-Guided data selection prompt for DialogSum dataset 22 [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗

**Figure 7.** Figure 7: Parameter-Guided data selection prompt for BioInstruct dataset Quantity PPL PPL Quantity [PITH_FULL_IMAGE:figures/full_fig_p023_7.png] view at source ↗

**Figure 8.** Figure 8: Data distribution of all samples and 10% samples extracted by our method. PPL stands for perplexity and is calculated with Qwen-3-8B. Quantity PPL PPL Quantity [PITH_FULL_IMAGE:figures/full_fig_p023_8.png] view at source ↗

**Figure 9.** Figure 9: Data distribution of all samples and 10% samples extracted by our method. PPL stands for perplexity and is calculated with Llama-3-8B-Instruct. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗

**Figure 10.** Figure 10: Heatmap of attention heads on Qwen3-8B for a) GSM8K, b) DialogSum, and c) BioInstruct datasets. The color intensity indicates the sensitivity of each head. 0 4 8 12 16 20 24 28 0 4 8 12 16 20 24 28 a) 0 4 8 12 16 20 24 28 0 4 8 12 16 20 24 28 b) 0 4 8 12 16 20 24 28 0 4 8 12 16 20 24 28 c) 0.00 0.05 0.10 0.15 0.20 0.25 [PITH_FULL_IMAGE:figures/full_fig_p024_10.png] view at source ↗

**Figure 11.** Figure 11: Heatmap of attention heads on Llama-3-8B-Instruct for a) GSM8K, b) DialogSum, and c) BioInstruct datasets. The color intensity indicates the sensitivity of each head. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_11.png] view at source ↗

read the original abstract

Adapting Large Language Models (LLMs) to specialized domains typically incurs high data and computational overhead. While prior efficiency efforts have largely treated data selection and parameter-efficient fine-tuning as isolated processes, our empirical analysis suggests they may be intrinsically coupled. We posit the Strong Map Hypothesis: a sparse subset of attention heads plays a dominant role in task-specific adaptation, acting as keys that unlock specific data patterns. Building on this observation, we propose From Parameters to Data (P2D), a unified framework that leverages these task-sensitive attention heads as a dual compass for both sample mining and structural pruning. To rigorously quantify the total pipeline cost, we introduce the Alignment Efficiency Ratio (AER) metric for both selection latency and training time. Mechanistically, P2D identifies critical heads via a lightweight proxy and uses them as a functional filter to curate high-affinity data, establishing a synergistic pipeline. Empirically, by updating merely 10% of attention heads on 10% of the data, P2D achieves an 8.3 pp performance gain over strong baselines and delivers a 7.0x end-to-end time speedup. These results validate that precise parameter-data synchronization eliminates redundancy, offering a new paradigm for efficient alignment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

P2D couples a small set of attention heads to both parameter updates and data selection for cheaper LLM alignment, but the supporting evidence stays thin and the key hypothesis lacks a direct causal check.

read the letter

The main takeaway is that this paper introduces P2D, a pipeline that picks a sparse subset of attention heads and uses them to both prune what gets updated and to mine which data samples to keep. They report that touching just 10% of heads on 10% of the data yields an 8.3 point gain and 7x end-to-end speedup over baselines, and they introduce the AER metric to track total pipeline cost from selection onward. The Strong Map Hypothesis frames the heads as task-specific keys that unlock matching data patterns, which is the conceptual link they build the whole thing around. If the numbers hold, the practical payoff for people doing repeated domain adaptation would be real. The integration itself is the part that feels new: instead of running data selection and PEFT as separate steps, they close the loop so the same lightweight proxy serves both. That kind of unification is worth noting even if the details need work. The soft spots sit right at the center. The hypothesis is derived from the same observations used to choose the heads and filter the data, and the abstract gives no ablation that swaps in random heads of the same size to test whether the specific selection is what drives the lift. Without that comparison, it is difficult to separate the claimed synchronization effect from the simpler fact that doing less of everything saves time and sometimes still works. The experimental description is also light on baselines, run-to-run variance, and how the proxy itself was validated, which leaves the reported gains hard to interpret. This is aimed at engineers and researchers who already fine-tune LLMs under tight compute budgets and are looking for ways to cut both data and parameter costs at once. A reader working on attention-based pruning or data curation might borrow the dual-use idea even if they end up modifying the hypothesis. I would send it to referees. The direction is coherent and the efficiency angle matters in practice, but the missing controls and ablations are exactly what review should surface before the claims can be taken as settled.

Referee Report

3 major / 1 minor

Summary. The paper posits the Strong Map Hypothesis that a sparse subset of attention heads dominates task-specific adaptation in LLMs and proposes the P2D framework to use these heads as a dual filter for data curation and structural pruning. It introduces the Alignment Efficiency Ratio (AER) metric and claims that updating 10% of heads on 10% of data yields an 8.3 pp gain over baselines and a 7.0x end-to-end speedup.

Significance. If the Strong Map Hypothesis is causally validated and the efficiency gains hold under controlled experiments, the work could advance efficient LLM alignment by demonstrating synergistic parameter-data selection. The AER metric provides a potentially useful holistic efficiency measure that accounts for both selection and training costs.

major comments (3)

[Abstract] Abstract: The central empirical claim of an 8.3 pp performance gain and 7.0x speedup from updating merely 10% of attention heads on 10% of the data is presented without any description of experimental setup, datasets, baseline definitions, number of runs, or statistical tests. This directly undermines evaluation of the load-bearing result.
[Abstract] Abstract: The Strong Map Hypothesis is derived from the same empirical observations used to identify heads and curate data, with no mention of independent held-out validation, causal ablation (e.g., proxy-selected heads vs. random 10% subsets), or falsification tests. Without such controls, the reported gains risk being attributable to generic sparsity rather than the hypothesized parameter-data mapping.
[Abstract] Abstract: The Alignment Efficiency Ratio (AER) is introduced to quantify total pipeline cost including selection latency and training time, yet no definition, formula, or computation details are supplied. This leaves the 7.0x speedup claim ungrounded.

minor comments (1)

[Abstract] The abstract refers to 'strong baselines' without naming them; the full manuscript should explicitly list and cite all comparison methods.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback, which highlights opportunities to strengthen the abstract. We address each comment below and will revise the abstract to incorporate the requested clarifications while preserving its conciseness.

read point-by-point responses

Referee: [Abstract] Abstract: The central empirical claim of an 8.3 pp performance gain and 7.0x speedup from updating merely 10% of attention heads on 10% of the data is presented without any description of experimental setup, datasets, baseline definitions, number of runs, or statistical tests. This directly undermines evaluation of the load-bearing result.

Authors: We agree that the abstract would benefit from additional context to allow readers to evaluate the central claims more readily. In the revised manuscript we will update the abstract to briefly note the primary datasets, the strong baselines used for comparison, the number of random seeds, and the statistical tests applied. Full experimental details remain in the main body and appendices. revision: yes
Referee: [Abstract] Abstract: The Strong Map Hypothesis is derived from the same empirical observations used to identify heads and curate data, with no mention of independent held-out validation, causal ablation (e.g., proxy-selected heads vs. random 10% subsets), or falsification tests. Without such controls, the reported gains risk being attributable to generic sparsity rather than the hypothesized parameter-data mapping.

Authors: The hypothesis is motivated by initial observations but is evaluated through downstream task performance on held-out data. To directly address the concern regarding generic sparsity, we will add explicit ablation results comparing task-sensitive heads against random 10% subsets (both for heads and data) in the experiments section and will reference these controls concisely in the revised abstract. revision: yes
Referee: [Abstract] Abstract: The Alignment Efficiency Ratio (AER) is introduced to quantify total pipeline cost including selection latency and training time, yet no definition, formula, or computation details are supplied. This leaves the 7.0x speedup claim ungrounded.

Authors: We acknowledge that the abstract omits the formal definition of AER. We will revise the abstract to include a one-sentence definition (AER as the ratio of end-to-end baseline alignment time to P2D time, incorporating both selection and training costs) and will retain the full formula and computation details in the main text. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper's chain begins with an empirical observation that data selection and parameter-efficient tuning may be coupled, from which the Strong Map Hypothesis is posited as an interpretive claim. The P2D framework is then constructed by using identified heads as a filter for pruning and sample selection, with results measured via the AER metric and reported performance deltas. This sequence does not reduce any claimed prediction or first-principles result to its inputs by construction; the hypothesis functions as a motivating observation rather than a self-referential definition, and the reported gains are measured outcomes of the proposed pipeline rather than statistically forced re-statements of the initial analysis. No equations, fitted parameters renamed as predictions, or self-citation chains that bear the central load appear in the provided text. The derivation remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 2 invented entities

The central claim rests on an ad-hoc hypothesis and tuned percentages whose selection is not independently justified outside the reported gains.

free parameters (1)

10% threshold for heads and data
Chosen to demonstrate the efficiency gains; appears tuned to the experimental outcomes rather than derived from first principles.

axioms (1)

ad hoc to paper Strong Map Hypothesis: a sparse subset of attention heads plays a dominant role in task-specific adaptation
Posited from empirical analysis but presented without formal proof or external validation.

invented entities (2)

Strong Map Hypothesis no independent evidence
purpose: To explain the intrinsic coupling between task-sensitive parameters and data patterns
Newly introduced construct without independent falsifiable evidence outside the paper's own experiments.
Alignment Efficiency Ratio (AER) no independent evidence
purpose: To quantify total pipeline cost for selection latency and training time
New metric defined in the paper to support the efficiency claims.

pith-pipeline@v0.9.0 · 5783 in / 1526 out tokens · 94269 ms · 2026-05-22T00:10:18.109029+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We posit the Strong Map Hypothesis: a sparse subset of attention heads plays a dominant role in task-specific adaptation... P2D identifies critical heads via a lightweight proxy and uses them as a functional filter to curate high-affinity data
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

AER(f) = t_f / t_FFT ... updates merely 10% of attention heads on 10% of the data

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages

[1]

PMLR, 2019. Hu, E. J., Shen, Y ., Wallis, P., Allen-Zhu, Z., Li, Y ., Wang, S., Wang, L., Chen, W., et al. Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022. Humane, P., Cudrano, P., Kaplan, D. Z., Matteucci, M., Chakraborty, S., and Rish, I. Influence functions for efficient data selection in reasoning.arXiv preprint arXiv:2510.06108,...

work page arXiv 2019
[2]

The top-ranking heads form the task-sensitive setH T

Fast Head Identification:Each attention head in M0 is scored according to its relevance to the downstream task, using a lightweight proxy model or gradient-based criterion. The top-ranking heads form the task-sensitive setH T

work page
[3]

The highest-scoring instances are gathered into the curated subsetD T

Parameter-Guided Data Selection:We use HT to compute a relevance score for every example in D, for instance by measuring how strongly the identified heads activate on each input. The highest-scoring instances are gathered into the curated subsetD T

work page
[4]

lock”) and the curated high-affinity data (the “key

Sparse Head Adaptation:Only the parameters corresponding to HT are updated, and training is performed exclusively onD T . This focused adaptation produces the final modelM P2D while preserving the remaining pretrained weights. 14 From Parameters to Data: A Task-Parameter-Guided Fine-Tuning Pipeline for Efficient LLM Alignment By restricting both the param...

work page arXiv 2018
[5]

Your response must contain only step-by-step calculations and the final answer

work page
[6]

Replace ‘<number>’ with the correct final result (either an integer or a floating-point number)

The final output **must** be formatted as: ####<number>. Replace ‘<number>’ with the correct final result (either an integer or a floating-point number). No deviations or alternative formats are allowed

work page
[7]

Do not add any commentary, questions, greetings, or extra remarks

work page
[8]

Please answer each question step by step and provide the final answer following the instructions below

Ensure your calculations are clear, concise, and correct, but only include the steps required to arrive at the final answer. Please answer each question step by step and provide the final answer following the instructions below. Input: Below are some demonstrations of how to format your answers: Question:<Question 1> Answer:<Answer 1> ... **Strictly use t...

work page
[9]

**Summarize the dialogue concisely and fully**, ensuring all main points are captured

work page
[10]

**Avoid adding extra commentary or irrelevant details** that are not part of the dialogue content

work page
[11]

If a dialogue is unclear, incomplete, or lacks meaningful content, respond with ”No valid content to summarize

work page
[12]

Please summarize dialogues based on the given instructions and demonstrations below

**Ensure every summary field is filled out.** Leaving any field blank is not allowed. Please summarize dialogues based on the given instructions and demonstrations below. Input: Below are some demonstrations of how to format your answers: Dialogue:<Dialogue 1> Summary:<Summary 1> ... **Strictly use the format specified below:** Summary 1:<Your summary to ...

work page
[13]

**Ensure your responses are concise, clear, and focused on the provided instruction.**

work page
[14]

**Follow the logical order of questions.** Do not skip or merge responses

work page
[15]

**Avoid adding extra commentary or irrelevant details. DO NOT repeat or summarize the question.** Input: Below are some demonstrations of how to format your answers: Instruction:<Demonstration 1 Instruction> Input:<Demonstration 1 Input> Answer:<Demonstration 1 Answer> ... **Strictly use the format specified below:** Question 1 Answer:<your answer to Ques...

work page arXiv

[1] [1]

PMLR, 2019. Hu, E. J., Shen, Y ., Wallis, P., Allen-Zhu, Z., Li, Y ., Wang, S., Wang, L., Chen, W., et al. Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022. Humane, P., Cudrano, P., Kaplan, D. Z., Matteucci, M., Chakraborty, S., and Rish, I. Influence functions for efficient data selection in reasoning.arXiv preprint arXiv:2510.06108,...

work page arXiv 2019

[2] [2]

The top-ranking heads form the task-sensitive setH T

Fast Head Identification:Each attention head in M0 is scored according to its relevance to the downstream task, using a lightweight proxy model or gradient-based criterion. The top-ranking heads form the task-sensitive setH T

work page

[3] [3]

The highest-scoring instances are gathered into the curated subsetD T

Parameter-Guided Data Selection:We use HT to compute a relevance score for every example in D, for instance by measuring how strongly the identified heads activate on each input. The highest-scoring instances are gathered into the curated subsetD T

work page

[4] [4]

lock”) and the curated high-affinity data (the “key

Sparse Head Adaptation:Only the parameters corresponding to HT are updated, and training is performed exclusively onD T . This focused adaptation produces the final modelM P2D while preserving the remaining pretrained weights. 14 From Parameters to Data: A Task-Parameter-Guided Fine-Tuning Pipeline for Efficient LLM Alignment By restricting both the param...

work page arXiv 2018

[5] [5]

Your response must contain only step-by-step calculations and the final answer

work page

[6] [6]

Replace ‘<number>’ with the correct final result (either an integer or a floating-point number)

The final output **must** be formatted as: ####<number>. Replace ‘<number>’ with the correct final result (either an integer or a floating-point number). No deviations or alternative formats are allowed

work page

[7] [7]

Do not add any commentary, questions, greetings, or extra remarks

work page

[8] [8]

Please answer each question step by step and provide the final answer following the instructions below

Ensure your calculations are clear, concise, and correct, but only include the steps required to arrive at the final answer. Please answer each question step by step and provide the final answer following the instructions below. Input: Below are some demonstrations of how to format your answers: Question:<Question 1> Answer:<Answer 1> ... **Strictly use t...

work page

[9] [9]

**Summarize the dialogue concisely and fully**, ensuring all main points are captured

work page

[10] [10]

**Avoid adding extra commentary or irrelevant details** that are not part of the dialogue content

work page

[11] [11]

If a dialogue is unclear, incomplete, or lacks meaningful content, respond with ”No valid content to summarize

work page

[12] [12]

Please summarize dialogues based on the given instructions and demonstrations below

**Ensure every summary field is filled out.** Leaving any field blank is not allowed. Please summarize dialogues based on the given instructions and demonstrations below. Input: Below are some demonstrations of how to format your answers: Dialogue:<Dialogue 1> Summary:<Summary 1> ... **Strictly use the format specified below:** Summary 1:<Your summary to ...

work page

[13] [13]

**Ensure your responses are concise, clear, and focused on the provided instruction.**

work page

[14] [14]

**Follow the logical order of questions.** Do not skip or merge responses

work page

[15] [15]

**Avoid adding extra commentary or irrelevant details. DO NOT repeat or summarize the question.** Input: Below are some demonstrations of how to format your answers: Instruction:<Demonstration 1 Instruction> Input:<Demonstration 1 Input> Answer:<Demonstration 1 Answer> ... **Strictly use the format specified below:** Question 1 Answer:<your answer to Ques...

work page arXiv