pith. sign in

arxiv: 2602.18584 · v2 · pith:2IXGW5PSnew · submitted 2026-02-20 · 💻 cs.LG · cs.AI· cs.CV

GIST: Targeted Data Selection for Instruction Tuning via Coupled Optimization Geometry

Pith reviewed 2026-05-21 12:21 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CV
keywords data selectioninstruction tuningLoRAgradient subspaceSVDPEFToptimization geometryinfluence estimation
0
0 comments X

The pith

GIST selects influential examples for instruction tuning by projecting gradients into a low-dimensional subspace recovered via SVD from validation data to capture LoRA's cross-parameter couplings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that axis-aligned surrogates like optimizer statistics break down for data selection in parameter-efficient fine-tuning because LoRA induces strong off-diagonal couplings and confines relevant updates to a low-dimensional space. GIST instead extracts a task-specific subspace from validation gradients with singular value decomposition, projects training gradients into it, and ranks examples by alignment with target directions. This yields selections that match or beat prior methods while slashing storage to 0.29 percent and compute time to 25 percent of the baseline under identical budgets. Readers should care because it directly tackles the resource barrier in scaling targeted instruction tuning for large models without sacrificing influence estimation quality.

Core claim

GIST replaces axis-aligned scaling with robust subspace alignment: it recovers a task-specific subspace from validation gradients via SVD, projects training gradients into this coupled subspace, and scores examples by their alignment with the target directions. The method is motivated by the observation that LoRA optimization geometry exhibits non-trivial off-diagonal interactions that diagonal preconditioners cannot represent, while task-relevant update directions remain low-dimensional.

What carries the argument

Gradient Isometric Subspace Transformation (GIST), which recovers a task-specific low-dimensional subspace from validation gradients via SVD and scores examples by projection alignment within that subspace.

If this is right

  • GIST achieves state-of-the-art selection quality with 0.29 percent storage and 25 percent computation relative to the prior baseline under the same selection budget.
  • The approach directly demonstrates that LoRA induces optimization geometries with non-trivial off-diagonal parameter couplings.
  • Task-relevant update directions can be isolated in a low-dimensional subspace extracted from validation gradients alone.
  • The method remains applicable under fixed selection budgets without requiring full-model gradient storage.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The subspace construction may extend to other parameter-efficient adapters that similarly induce low-rank or coupled update structures.
  • Dynamic recomputation of the SVD subspace at intervals during selection could further improve alignment as training progresses.
  • The same projection technique might serve as a diagnostic tool to measure intrinsic task dimensionality across different fine-tuning regimes.

Load-bearing premise

The low-dimensional subspace recovered from validation gradients via SVD captures the task-relevant update directions even though LoRA induces strong cross-parameter couplings that invalidate axis-aligned approximations.

What would settle it

If data subsets chosen by GIST subspace scores yield downstream instruction-tuning performance no better than subsets chosen by random sampling or by standard diagonal optimizer statistics across multiple target tasks, the claimed advantage would be refuted.

Figures

Figures reproduced from arXiv: 2602.18584 by Chen Chen, Guanghui Min, Ke Wan, Tianhao Huang.

Figure 1
Figure 1. Figure 1: shows that the validation-gradient matrix concen￾trates most of its variance in a low-dimensional principal subspace (rapid spectral decay and early saturation of ex￾plained variance), a consequence of the targeted setting where rank(Gval) ≤ |Dval| ≪ d. Importantly, low-rank does not imply axis-alignment. The resulting principal direc￾tions are generally linear combinations of coordinates, i.e., a rotated … view at source ↗
Figure 2
Figure 2. Figure 2: Overview of GIST. Step 1: Lightweight warmup performs a short LoRA warmup on a sampled subset and computes validation gradients. Step 2: Spectral filtering applies an SVD on the validation gradient matrix to construct a low-rank target subspace (Target projector). Step 3: Geometric scoring projects candidate gradients onto the target subspace and selects Top-k samples. tude and (ii) the proxy mismatch, suc… view at source ↗
Figure 3
Figure 3. Figure 3: Impact of Checkpoint Selection. (a) Using single￾epoch gradients shows a clear performance drop in later epochs. (b) Aggregating multiple checkpoints (weighted) does not outperform the early-stop strategy, confirming that early gradients contain the essential task optimization directions [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Accuracy as a function of the projection rank used by GIST, compared with LESS at the same selection budget. formance: while combining Epochs 1-2 yields a slight peak (70.2%), incorporating all four epochs degrades accuracy to 67.3%, significantly lower than using the early checkpoints alone. This aligns with the results in [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Toy 2D optimization dynamics with the same initialization θ0 = (−2.5, 0). Newton (full-matrix) follows the direct descent direction, while Adam (diagonal) cannot express the rotation induced by coupling, leading to a “zig-zag” trajectory on the coupled landscape. where R is a 2D rotation matrix. Thus, both cases share the same eigenvalues (and condition number), but Hcpl is not diagonal in the coordinate b… view at source ↗
Figure 6
Figure 6. Figure 6: Spectral Analysis of Gradient Subspaces across Models and Epochs. Top Row: Singular value spectra (log scale) of the gradient covariance matrix. Early epochs (blue solid lines) show slower decay, indicating a higher-dimensional optimization landscape. Bottom Row: Cumulative explained variance. Later epochs (dotted red lines) reach 95% variance with fewer components, signaling dimensional shrinkage. Notably… view at source ↗
Figure 7
Figure 7. Figure 7: Impact of Dataset Scale and Task Type on Gradient Geometry. We analyze the spectral properties of Llama3.2-3B gradients across three datasets with varying sizes. Top Row: Singular value spectra show that intrinsic dimension scales with data size. MMLU exhibits a smooth, heavy-tailed decay, whereas TYDIQA suffers from extreme spectral sparsity due to data scarcity. Bottom Row: Cumulative explained variance.… view at source ↗
Figure 8
Figure 8. Figure 8: Rapid loss convergence in the first epoch creates a stable geometric basin. We observe consistent training dynamics across varying data scales (13.5k vs. 270k) and model architectures, where the loss drops precipitously within the initial phase (< 0.5 epoch) before entering a bounded oscillatory regime. To verify the assumptions underlying Theorem 3.3, we analyze the training dynamics of multiple instructi… view at source ↗
read the original abstract

Targeted data selection has emerged as a crucial paradigm for efficient instruction tuning, aiming to identify a small yet influential subset of training examples for a specific target task. In practice, influence is often measured through the effect of an example on parameter updates. To make selection scalable, many approaches leverage optimizer statistics (e.g., Adam states) as an axis-aligned surrogate for update geometry (i.e., diagonal precondition), implicitly treating parameters as coordinate-wise independent. We show that this assumption breaks down in parameter-efficient fine-tuning (PEFT) methods such as LoRA. In this setting, the induced optimization geometry exhibits strong cross-parameter coupling with non-trivial off-diagonal interactions, while the task-relevant update directions are confined to a low-dimensional subspace. Motivated by this mismatch, we propose GIST (Gradient Isometric Subspace Transformation), a simple yet principled alternative that replaces axis-aligned scaling with robust subspace alignment. GIST recovers a task-specific subspace from validation gradients via singular value decomposition (SVD), projects training gradients into this coupled subspace, and scores examples by their alignment with target directions. Extensive experiments have demonstrated that GIST matches or outperforms the state-of-the-art baseline with only 0.29% of the storage and 25% of the computational time under the same selection budget.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes GIST for targeted data selection in instruction tuning. It argues that LoRA induces strong cross-parameter coupling in the optimization geometry, rendering axis-aligned surrogates (e.g., Adam states) inadequate. GIST recovers a low-dimensional task-specific subspace via SVD on validation gradients, projects training gradients into this subspace, and scores examples by alignment with target directions. The central empirical claim is that GIST matches or outperforms SOTA baselines while using only 0.29% of the storage and 25% of the computational time under identical selection budgets.

Significance. If the performance claims and the underlying geometry premise hold, the work is significant for scalable instruction tuning. It offers a principled, low-overhead alternative to diagonal surrogates that explicitly accounts for coupled update directions in PEFT, with substantial practical gains in storage and runtime. The approach could inform future data-selection methods that incorporate optimization geometry rather than coordinate-wise statistics.

major comments (3)
  1. [Abstract and method description] The load-bearing premise that LoRA induces strong off-diagonal coupling (making axis-aligned methods inadequate) is stated in the abstract and motivation but is not directly quantified. No measurement of off-diagonal magnitudes in the gradient covariance or approximate Hessian appears in the method or experiments sections, so the motivation for subspace projection over simpler surrogates remains ungrounded.
  2. [Method (subspace recovery procedure)] The SVD subspace recovery (method section) uses validation gradients without reported sensitivity analysis to validation-set size or distributional match to the target task. If the validation set is small or unrepresentative, the top singular vectors can reflect noise rather than coupled task directions, directly threatening the claimed performance edge over axis-aligned baselines.
  3. [Experiments section] Table or figure reporting the main results (experiments section) asserts matching SOTA at 0.29% storage / 25% time, but lacks error bars, multiple random seeds, or ablation on the SVD rank and number of validation examples; without these, it is unclear whether the gains are robust or driven by particular hyperparameter choices.
minor comments (2)
  1. [Method] The notation for the projection of training gradients onto the SVD subspace could be made explicit with a numbered equation rather than prose description.
  2. [Abstract] The abstract mentions 'extensive experiments' but does not name the concrete baselines or datasets; adding one sentence would improve clarity for readers.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments, which have helped us improve the clarity and robustness of our work. We address each major comment below and have incorporated revisions to strengthen the empirical grounding and experimental validation.

read point-by-point responses
  1. Referee: [Abstract and method description] The load-bearing premise that LoRA induces strong off-diagonal coupling (making axis-aligned methods inadequate) is stated in the abstract and motivation but is not directly quantified. No measurement of off-diagonal magnitudes in the gradient covariance or approximate Hessian appears in the method or experiments sections, so the motivation for subspace projection over simpler surrogates remains ungrounded.

    Authors: We agree that explicitly quantifying the off-diagonal coupling would provide stronger motivation for the subspace approach. In the revised version, we have added a new subsection in the method description that computes and visualizes the off-diagonal elements of the gradient covariance matrix for LoRA parameters on a representative task. This shows that off-diagonal terms are non-negligible and comparable in magnitude to diagonal terms, unlike in full fine-tuning where the geometry is more diagonal-dominant. This analysis directly supports the need for coupled subspace projection. revision: yes

  2. Referee: [Method (subspace recovery procedure)] The SVD subspace recovery (method section) uses validation gradients without reported sensitivity analysis to validation-set size or distributional match to the target task. If the validation set is small or unrepresentative, the top singular vectors can reflect noise rather than coupled task directions, directly threatening the claimed performance edge over axis-aligned baselines.

    Authors: We appreciate this point on robustness. We have performed a sensitivity analysis by varying the validation set size (from 50 to 500 examples) and report the resulting selection performance in a new appendix figure. The results show that performance stabilizes beyond 100 examples, with minimal degradation for smaller sets when using the same task distribution. For distributional match, the validation set is sampled from the same target task as the test set, as detailed in the experimental setup. We have clarified this in the revised method section. revision: yes

  3. Referee: [Experiments section] Table or figure reporting the main results (experiments section) asserts matching SOTA at 0.29% storage / 25% time, but lacks error bars, multiple random seeds, or ablation on the SVD rank and number of validation examples; without these, it is unclear whether the gains are robust or driven by particular hyperparameter choices.

    Authors: We acknowledge the importance of statistical robustness. In the revised manuscript, we have updated the main results table to include mean and standard deviation over 5 random seeds. Additionally, we include ablations on the SVD rank (varying k from 5 to 100) and the number of validation examples used for subspace recovery in a new supplementary figure, demonstrating that the performance gains are consistent across reasonable choices of these hyperparameters. revision: yes

Circularity Check

0 steps flagged

No significant circularity; GIST scoring is a novel construction independent of evaluation metrics

full rationale

The paper's core derivation defines GIST via SVD on validation gradients to obtain a coupled subspace, followed by projection of training gradients and alignment scoring. This procedure is explicitly constructed from the observed off-diagonal interactions in LoRA and does not reduce by the paper's equations to any quantity fitted on the same data used for final performance evaluation. No self-definitional loops, fitted inputs renamed as predictions, or load-bearing self-citations appear in the derivation chain. The method is presented as a direct response to the axis-aligned surrogate mismatch, with experimental validation against external baselines remaining falsifiable and non-tautological.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach rests on standard linear algebra for SVD and the domain assumption that validation gradients span the relevant task subspace; no free parameters or new invented entities are introduced in the abstract description.

axioms (2)
  • standard math Singular value decomposition recovers the principal directions of variation in the validation gradient matrix.
    Invoked to extract the task-specific subspace from validation gradients.
  • domain assumption Task-relevant update directions in LoRA lie in a low-dimensional subspace with cross-parameter coupling.
    This premise motivates replacing axis-aligned scaling with subspace alignment.

pith-pipeline@v0.9.0 · 5764 in / 1397 out tokens · 42129 ms · 2026-05-21T12:21:56.402235+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Let the Target Select for Itself: Data Selection via Target-Aligned Paths

    cs.LG 2026-05 unverdicted novelty 6.0

    Target-aligned data selection via normalized endpoint loss drop on a validation-induced reference path achieves competitive performance with reduced computational overhead.

  2. One Algorithm, Two Goals: Dual Scoring for Parameter and Data Selection in LLM Fine-Tuning

    cs.LG 2026-05 unverdicted novelty 5.0

    DualSFT derives parameter masks and data subsets as row- and column-wise aggregations of one gradient interaction matrix under first- and second-order validation-improvement approximations.

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages · cited by 2 Pith papers · 2 internal anchors

  1. [1]

    W., and Dasigi, P

    Ivison, H., Zhang, M., Brahman, F., Koh, P. W., and Dasigi, P. Large-scale data selection for instruction tuning.arXiv preprint arXiv:2503.01807,

  2. [2]

    Measur- ing the intrinsic dimension of objective landscapes

    Li, C., Farkhoor, H., Liu, R., and Yosinski, J. Measur- ing the intrinsic dimension of objective landscapes. In International Conference on Learning Representations, 2018a. Li, H., Xu, Z., Taylor, G., Studer, C., and Goldstein, T. Visualizing the loss landscape of neural nets.Advances in neural information processing systems, 31, 2018b. Li, M., Zhang, Y ....

  3. [3]

    When less is more: Investigating data pruning for pretraining llms at scale.arXiv preprint arXiv:2309.04564,

    Marion, M., ¨Ust¨un, A., Pozzobon, L., Wang, A., Fadaee, M., and Hooker, S. When less is more: Investigating data pruning for pretraining llms at scale.arXiv preprint arXiv:2309.04564,

  4. [4]

    H., Ma, J., Zhao, V ., Luan, Y ., Hall, K., Chang, M.-W., et al

    Ni, J., Qu, C., Lu, J., Dai, Z., Abrego, G. H., Ma, J., Zhao, V ., Luan, Y ., Hall, K., Chang, M.-W., et al. Large dual encoders are generalizable retrievers. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 9844–9855,

  5. [5]

    Eigenvalues of the Hessian in Deep Learning: Singularity and Beyond

    Sagun, L., Bottou, L., and LeCun, Y . Eigenvalues of the hessian in deep learning: Singularity and beyond.arXiv preprint arXiv:1611.07476,

  6. [6]

    W., Chowdhery, A., Le, Q., Chi, E., Zhou, D., et al

    Suzgun, M., Scales, N., Sch¨arli, N., Gehrmann, S., Tay, Y ., Chung, H. W., Chowdhery, A., Le, Q., Chi, E., Zhou, D., et al. Challenging big-bench tasks and whether chain-of- thought can solve them. InFindings of the Association for Computational Linguistics: ACL 2023, pp. 13003–13051,

  7. [7]

    URL https://qwenlm.github.io/ blog/qwen2.5/. Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y ., Bashlykov, N., Batra, S., Bhargava, P., 10 GIST: Targeted Data Selection for Instruction Tuning via Coupled Optimization Geometry Bhosale, S., et al. Llama 2: Open foundation and fine- tuned chat models.arXiv preprint arXiv:2307.09288,

  8. [8]

    Limitations and Future Work While GIST offers a principled geometric view, our current instantiation is intentionally simple—a minimal, proof-driven realization of the theory

    11 GIST: Targeted Data Selection for Instruction Tuning via Coupled Optimization Geometry A. Limitations and Future Work While GIST offers a principled geometric view, our current instantiation is intentionally simple—a minimal, proof-driven realization of the theory. Empirically, we find that applyingGIST early in training (after a lightweight warmup) is...

  9. [9]

    bulk-and-outlier

    circumvent this bottleneck by projecting gradients into lower dimensions and approximating the Hessian inverse using optimizer statistics (e.g., Adam’s second moment). Critically, these methods rely on adiagonal approximationof the curvature. This implicitly assumes that model parameters are statistically independent and the optimization landscape is axis...

  10. [10]

    used in fine-tuning Large Language Models (LLMs) and discuss how its second-moment statistics are utilized to approximate the optimization landscape for data selection methods like LESS (Xia et al., 2024). D.1. Standard Adam Update Rule Let ℓ(z,θ) denote the loss function for a data sample z and model parameters θ∈R d. At each training step t, we compute ...

  11. [11]

    non-Gauss–Newton

    +o(η 2),(34) where theO(η 2)term collects the quadratic Taylor term and the remainder. Since Lval(θt) is independent of S, minimizing Lval(eθS) over S is, up to an S-independent constant and higher-order terms in η, equivalent to maximizing the first-order predicted decrease term g⊤ val,tH† val,tgS,t. Formally, ignoring O(η2) and o(η2) terms in (34), we o...

  12. [12]

    For every candidate sample d and validation sample v, we compute their embeddings using GTR-Base and calculate the cosine similarity score

    as the external encoder. For every candidate sample d and validation sample v, we compute their embeddings using GTR-Base and calculate the cosine similarity score. The final score for a candidatedis its maximum similarity to any example in the validation set. RDS+.We adopt theRDS+method proposed by Ivison et al. (2025). Unlike standard embedding approach...

  13. [13]

    Also we adopt the reported performance of LESS on Llama2-7B from the original paper. G. Training G.1. Training Datasets We use the same four preprocessed training datasets as in Wang et al. (2023). All are human-written or human-annotated; details are provided in Table 6.FLAN V2 andCOTare derived from existing NLP benchmarks, whereasDOLLYandOPEN ASSISTANT...

  14. [14]

    270k) and model architectures, where the loss drops precipitously within the initial phase (<0.5 epoch) before entering a bounded oscillatory regime

    Rapid loss convergence in the first epoch creates a stable geometric basin.We observe consistent training dynamics across varying data scales (13.5k vs. 270k) and model architectures, where the loss drops precipitously within the initial phase (<0.5 epoch) before entering a bounded oscillatory regime. To verify the assumptions underlying Theorem 3.3, we a...

  15. [15]

    While GIST consistently improves overBaseandRandom, selecting only the smallest principal direction (GIST-tail) can be unstable and may hurt task performance

    Tail ablation on Llama3.2-3B under the same 5% selection budget. While GIST consistently improves overBaseandRandom, selecting only the smallest principal direction (GIST-tail) can be unstable and may hurt task performance. Task Base RandomGIST GIST-tail MMLU 53.9 53.2±0.656.1±0.456.4±0.2 TYDIQA 60.4 64.1±0.469.2±0.363.1±1.5 BBH 45.5 45.1±0.248.0±0.538.5±...

  16. [16]

    she’s asleep, we should keep noisy

    I. దూరంలKనూఉం:;. 2011 RSరతజనగణనగణVం6ాలపC6ారంఈ =ా > మం830 ఇళ2 YZ, 3687 జ[VRSYZ268 \]6ా M ర2లK(స^ _`ంaఉం:;. =ా > మంలKమగbా_`సంఖc1398, ఆడbా_`సంఖc2289. efడూcgh కjల1లసంఖc26 6ా=ాefడూcghYెగలసంఖc2944. =ా > మంkకlజనగణనలm6Aషo6pq584655[2].rిo6pq: 531077. Question:2011లK!ాత!ా$ేర'=ా > మంలKఎంతమం:;uీ^ wలjఉ[V-ర'? Assistant:2289 PC1 PC2 PC3 PC1: Review Sentiment Classificat...