GIST: Targeted Data Selection for Instruction Tuning via Coupled Optimization Geometry

Chen Chen; Guanghui Min; Ke Wan; Tianhao Huang

arxiv: 2602.18584 · v2 · pith:2IXGW5PSnew · submitted 2026-02-20 · 💻 cs.LG · cs.AI· cs.CV

GIST: Targeted Data Selection for Instruction Tuning via Coupled Optimization Geometry

Guanghui Min , Tianhao Huang , Ke Wan , Chen Chen This is my paper

Pith reviewed 2026-05-21 12:21 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CV

keywords data selectioninstruction tuningLoRAgradient subspaceSVDPEFToptimization geometryinfluence estimation

0 comments

The pith

GIST selects influential examples for instruction tuning by projecting gradients into a low-dimensional subspace recovered via SVD from validation data to capture LoRA's cross-parameter couplings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that axis-aligned surrogates like optimizer statistics break down for data selection in parameter-efficient fine-tuning because LoRA induces strong off-diagonal couplings and confines relevant updates to a low-dimensional space. GIST instead extracts a task-specific subspace from validation gradients with singular value decomposition, projects training gradients into it, and ranks examples by alignment with target directions. This yields selections that match or beat prior methods while slashing storage to 0.29 percent and compute time to 25 percent of the baseline under identical budgets. Readers should care because it directly tackles the resource barrier in scaling targeted instruction tuning for large models without sacrificing influence estimation quality.

Core claim

GIST replaces axis-aligned scaling with robust subspace alignment: it recovers a task-specific subspace from validation gradients via SVD, projects training gradients into this coupled subspace, and scores examples by their alignment with the target directions. The method is motivated by the observation that LoRA optimization geometry exhibits non-trivial off-diagonal interactions that diagonal preconditioners cannot represent, while task-relevant update directions remain low-dimensional.

What carries the argument

Gradient Isometric Subspace Transformation (GIST), which recovers a task-specific low-dimensional subspace from validation gradients via SVD and scores examples by projection alignment within that subspace.

If this is right

GIST achieves state-of-the-art selection quality with 0.29 percent storage and 25 percent computation relative to the prior baseline under the same selection budget.
The approach directly demonstrates that LoRA induces optimization geometries with non-trivial off-diagonal parameter couplings.
Task-relevant update directions can be isolated in a low-dimensional subspace extracted from validation gradients alone.
The method remains applicable under fixed selection budgets without requiring full-model gradient storage.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The subspace construction may extend to other parameter-efficient adapters that similarly induce low-rank or coupled update structures.
Dynamic recomputation of the SVD subspace at intervals during selection could further improve alignment as training progresses.
The same projection technique might serve as a diagnostic tool to measure intrinsic task dimensionality across different fine-tuning regimes.

Load-bearing premise

The low-dimensional subspace recovered from validation gradients via SVD captures the task-relevant update directions even though LoRA induces strong cross-parameter couplings that invalidate axis-aligned approximations.

What would settle it

If data subsets chosen by GIST subspace scores yield downstream instruction-tuning performance no better than subsets chosen by random sampling or by standard diagonal optimizer statistics across multiple target tasks, the claimed advantage would be refuted.

Figures

Figures reproduced from arXiv: 2602.18584 by Chen Chen, Guanghui Min, Ke Wan, Tianhao Huang.

**Figure 1.** Figure 1: shows that the validation-gradient matrix concentrates most of its variance in a low-dimensional principal subspace (rapid spectral decay and early saturation of explained variance), a consequence of the targeted setting where rank(Gval) ≤ |Dval| ≪ d. Importantly, low-rank does not imply axis-alignment. The resulting principal directions are generally linear combinations of coordinates, i.e., a rotated … view at source ↗

**Figure 2.** Figure 2: Overview of GIST. Step 1: Lightweight warmup performs a short LoRA warmup on a sampled subset and computes validation gradients. Step 2: Spectral filtering applies an SVD on the validation gradient matrix to construct a low-rank target subspace (Target projector). Step 3: Geometric scoring projects candidate gradients onto the target subspace and selects Top-k samples. tude and (ii) the proxy mismatch, suc… view at source ↗

**Figure 3.** Figure 3: Impact of Checkpoint Selection. (a) Using singleepoch gradients shows a clear performance drop in later epochs. (b) Aggregating multiple checkpoints (weighted) does not outperform the early-stop strategy, confirming that early gradients contain the essential task optimization directions [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Accuracy as a function of the projection rank used by GIST, compared with LESS at the same selection budget. formance: while combining Epochs 1-2 yields a slight peak (70.2%), incorporating all four epochs degrades accuracy to 67.3%, significantly lower than using the early checkpoints alone. This aligns with the results in [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Toy 2D optimization dynamics with the same initialization θ0 = (−2.5, 0). Newton (full-matrix) follows the direct descent direction, while Adam (diagonal) cannot express the rotation induced by coupling, leading to a “zig-zag” trajectory on the coupled landscape. where R is a 2D rotation matrix. Thus, both cases share the same eigenvalues (and condition number), but Hcpl is not diagonal in the coordinate b… view at source ↗

**Figure 6.** Figure 6: Spectral Analysis of Gradient Subspaces across Models and Epochs. Top Row: Singular value spectra (log scale) of the gradient covariance matrix. Early epochs (blue solid lines) show slower decay, indicating a higher-dimensional optimization landscape. Bottom Row: Cumulative explained variance. Later epochs (dotted red lines) reach 95% variance with fewer components, signaling dimensional shrinkage. Notably… view at source ↗

**Figure 7.** Figure 7: Impact of Dataset Scale and Task Type on Gradient Geometry. We analyze the spectral properties of Llama3.2-3B gradients across three datasets with varying sizes. Top Row: Singular value spectra show that intrinsic dimension scales with data size. MMLU exhibits a smooth, heavy-tailed decay, whereas TYDIQA suffers from extreme spectral sparsity due to data scarcity. Bottom Row: Cumulative explained variance.… view at source ↗

**Figure 8.** Figure 8: Rapid loss convergence in the first epoch creates a stable geometric basin. We observe consistent training dynamics across varying data scales (13.5k vs. 270k) and model architectures, where the loss drops precipitously within the initial phase (< 0.5 epoch) before entering a bounded oscillatory regime. To verify the assumptions underlying Theorem 3.3, we analyze the training dynamics of multiple instructi… view at source ↗

read the original abstract

Targeted data selection has emerged as a crucial paradigm for efficient instruction tuning, aiming to identify a small yet influential subset of training examples for a specific target task. In practice, influence is often measured through the effect of an example on parameter updates. To make selection scalable, many approaches leverage optimizer statistics (e.g., Adam states) as an axis-aligned surrogate for update geometry (i.e., diagonal precondition), implicitly treating parameters as coordinate-wise independent. We show that this assumption breaks down in parameter-efficient fine-tuning (PEFT) methods such as LoRA. In this setting, the induced optimization geometry exhibits strong cross-parameter coupling with non-trivial off-diagonal interactions, while the task-relevant update directions are confined to a low-dimensional subspace. Motivated by this mismatch, we propose GIST (Gradient Isometric Subspace Transformation), a simple yet principled alternative that replaces axis-aligned scaling with robust subspace alignment. GIST recovers a task-specific subspace from validation gradients via singular value decomposition (SVD), projects training gradients into this coupled subspace, and scores examples by their alignment with target directions. Extensive experiments have demonstrated that GIST matches or outperforms the state-of-the-art baseline with only 0.29% of the storage and 25% of the computational time under the same selection budget.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GIST uses SVD on validation gradients to capture LoRA's coupled update directions for data selection instead of diagonal surrogates, which is a clean idea but the reported efficiency edge depends heavily on validation quality.

read the letter

Hi, the main point is that GIST replaces axis-aligned surrogates like Adam states with a subspace recovered via SVD on validation gradients, then scores training examples by how their gradients align inside that subspace. This is pitched as a fix for the cross-parameter coupling that shows up in LoRA but gets ignored by diagonal preconditioners. The motivation is laid out clearly and the method itself is simple to implement. They report matching or beating prior selection methods while using 0.29 percent of the storage and 25 percent of the compute under the same budget, which is the practical hook. That efficiency angle is the part worth paying attention to if the numbers hold up. The soft spot is exactly the one in the stress-test note. The subspace is only as good as the validation gradients that feed the SVD. If the validation set is small or distributionally off, the top singular vectors can easily reflect sampling artifacts rather than the actual coupled directions that matter for the target task. The abstract does not show ablations on validation-set size or mismatch, nor does it quantify how large the off-diagonal terms really are in the LoRA geometry across models. Without those checks the performance claims rest on an assumption that may not travel. This paper is for groups working on data-efficient instruction tuning and PEFT, especially anyone already using gradient-based selection and looking for lower overhead. A reader who cares about practical trade-offs in fine-tuning pipelines would get something out of the geometric framing and the storage numbers. I would send it to peer review. The core substitution of subspace alignment for diagonal scaling is worth a closer look even if the robustness questions need more data.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes GIST for targeted data selection in instruction tuning. It argues that LoRA induces strong cross-parameter coupling in the optimization geometry, rendering axis-aligned surrogates (e.g., Adam states) inadequate. GIST recovers a low-dimensional task-specific subspace via SVD on validation gradients, projects training gradients into this subspace, and scores examples by alignment with target directions. The central empirical claim is that GIST matches or outperforms SOTA baselines while using only 0.29% of the storage and 25% of the computational time under identical selection budgets.

Significance. If the performance claims and the underlying geometry premise hold, the work is significant for scalable instruction tuning. It offers a principled, low-overhead alternative to diagonal surrogates that explicitly accounts for coupled update directions in PEFT, with substantial practical gains in storage and runtime. The approach could inform future data-selection methods that incorporate optimization geometry rather than coordinate-wise statistics.

major comments (3)

[Abstract and method description] The load-bearing premise that LoRA induces strong off-diagonal coupling (making axis-aligned methods inadequate) is stated in the abstract and motivation but is not directly quantified. No measurement of off-diagonal magnitudes in the gradient covariance or approximate Hessian appears in the method or experiments sections, so the motivation for subspace projection over simpler surrogates remains ungrounded.
[Method (subspace recovery procedure)] The SVD subspace recovery (method section) uses validation gradients without reported sensitivity analysis to validation-set size or distributional match to the target task. If the validation set is small or unrepresentative, the top singular vectors can reflect noise rather than coupled task directions, directly threatening the claimed performance edge over axis-aligned baselines.
[Experiments section] Table or figure reporting the main results (experiments section) asserts matching SOTA at 0.29% storage / 25% time, but lacks error bars, multiple random seeds, or ablation on the SVD rank and number of validation examples; without these, it is unclear whether the gains are robust or driven by particular hyperparameter choices.

minor comments (2)

[Method] The notation for the projection of training gradients onto the SVD subspace could be made explicit with a numbered equation rather than prose description.
[Abstract] The abstract mentions 'extensive experiments' but does not name the concrete baselines or datasets; adding one sentence would improve clarity for readers.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments, which have helped us improve the clarity and robustness of our work. We address each major comment below and have incorporated revisions to strengthen the empirical grounding and experimental validation.

read point-by-point responses

Referee: [Abstract and method description] The load-bearing premise that LoRA induces strong off-diagonal coupling (making axis-aligned methods inadequate) is stated in the abstract and motivation but is not directly quantified. No measurement of off-diagonal magnitudes in the gradient covariance or approximate Hessian appears in the method or experiments sections, so the motivation for subspace projection over simpler surrogates remains ungrounded.

Authors: We agree that explicitly quantifying the off-diagonal coupling would provide stronger motivation for the subspace approach. In the revised version, we have added a new subsection in the method description that computes and visualizes the off-diagonal elements of the gradient covariance matrix for LoRA parameters on a representative task. This shows that off-diagonal terms are non-negligible and comparable in magnitude to diagonal terms, unlike in full fine-tuning where the geometry is more diagonal-dominant. This analysis directly supports the need for coupled subspace projection. revision: yes
Referee: [Method (subspace recovery procedure)] The SVD subspace recovery (method section) uses validation gradients without reported sensitivity analysis to validation-set size or distributional match to the target task. If the validation set is small or unrepresentative, the top singular vectors can reflect noise rather than coupled task directions, directly threatening the claimed performance edge over axis-aligned baselines.

Authors: We appreciate this point on robustness. We have performed a sensitivity analysis by varying the validation set size (from 50 to 500 examples) and report the resulting selection performance in a new appendix figure. The results show that performance stabilizes beyond 100 examples, with minimal degradation for smaller sets when using the same task distribution. For distributional match, the validation set is sampled from the same target task as the test set, as detailed in the experimental setup. We have clarified this in the revised method section. revision: yes
Referee: [Experiments section] Table or figure reporting the main results (experiments section) asserts matching SOTA at 0.29% storage / 25% time, but lacks error bars, multiple random seeds, or ablation on the SVD rank and number of validation examples; without these, it is unclear whether the gains are robust or driven by particular hyperparameter choices.

Authors: We acknowledge the importance of statistical robustness. In the revised manuscript, we have updated the main results table to include mean and standard deviation over 5 random seeds. Additionally, we include ablations on the SVD rank (varying k from 5 to 100) and the number of validation examples used for subspace recovery in a new supplementary figure, demonstrating that the performance gains are consistent across reasonable choices of these hyperparameters. revision: yes

Circularity Check

0 steps flagged

No significant circularity; GIST scoring is a novel construction independent of evaluation metrics

full rationale

The paper's core derivation defines GIST via SVD on validation gradients to obtain a coupled subspace, followed by projection of training gradients and alignment scoring. This procedure is explicitly constructed from the observed off-diagonal interactions in LoRA and does not reduce by the paper's equations to any quantity fitted on the same data used for final performance evaluation. No self-definitional loops, fitted inputs renamed as predictions, or load-bearing self-citations appear in the derivation chain. The method is presented as a direct response to the axis-aligned surrogate mismatch, with experimental validation against external baselines remaining falsifiable and non-tautological.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach rests on standard linear algebra for SVD and the domain assumption that validation gradients span the relevant task subspace; no free parameters or new invented entities are introduced in the abstract description.

axioms (2)

standard math Singular value decomposition recovers the principal directions of variation in the validation gradient matrix.
Invoked to extract the task-specific subspace from validation gradients.
domain assumption Task-relevant update directions in LoRA lie in a low-dimensional subspace with cross-parameter coupling.
This premise motivates replacing axis-aligned scaling with subspace alignment.

pith-pipeline@v0.9.0 · 5764 in / 1397 out tokens · 42129 ms · 2026-05-21T12:21:56.402235+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

GIST recovers a task-specific subspace from validation gradients via singular value decomposition (SVD), projects training gradients into this coupled subspace, and scores examples by their alignment with target directions.
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 3.2 (LoRA induces off-diagonal curvature)... ∂²L / ∂Akj1 ∂Akj2 = (B:k ⊗ ej1)⊤ HW (B:k ⊗ ej2)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Let the Target Select for Itself: Data Selection via Target-Aligned Paths
cs.LG 2026-05 unverdicted novelty 6.0

Target-aligned data selection via normalized endpoint loss drop on a validation-induced reference path achieves competitive performance with reduced computational overhead.
One Algorithm, Two Goals: Dual Scoring for Parameter and Data Selection in LLM Fine-Tuning
cs.LG 2026-05 unverdicted novelty 5.0

DualSFT derives parameter masks and data subsets as row- and column-wise aggregations of one gradient interaction matrix under first- and second-order validation-improvement approximations.

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages · cited by 2 Pith papers · 2 internal anchors

[1]

W., and Dasigi, P

Ivison, H., Zhang, M., Brahman, F., Koh, P. W., and Dasigi, P. Large-scale data selection for instruction tuning.arXiv preprint arXiv:2503.01807,

work page arXiv
[2]

Measur- ing the intrinsic dimension of objective landscapes

Li, C., Farkhoor, H., Liu, R., and Yosinski, J. Measur- ing the intrinsic dimension of objective landscapes. In International Conference on Learning Representations, 2018a. Li, H., Xu, Z., Taylor, G., Studer, C., and Goldstein, T. Visualizing the loss landscape of neural nets.Advances in neural information processing systems, 31, 2018b. Li, M., Zhang, Y ....

work page 2024
[3]

When less is more: Investigating data pruning for pretraining llms at scale.arXiv preprint arXiv:2309.04564,

Marion, M., ¨Ust¨un, A., Pozzobon, L., Wang, A., Fadaee, M., and Hooker, S. When less is more: Investigating data pruning for pretraining llms at scale.arXiv preprint arXiv:2309.04564,

work page arXiv
[4]

H., Ma, J., Zhao, V ., Luan, Y ., Hall, K., Chang, M.-W., et al

Ni, J., Qu, C., Lu, J., Dai, Z., Abrego, G. H., Ma, J., Zhao, V ., Luan, Y ., Hall, K., Chang, M.-W., et al. Large dual encoders are generalizable retrievers. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 9844–9855,

work page 2022
[5]

Eigenvalues of the Hessian in Deep Learning: Singularity and Beyond

Sagun, L., Bottou, L., and LeCun, Y . Eigenvalues of the hessian in deep learning: Singularity and beyond.arXiv preprint arXiv:1611.07476,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

W., Chowdhery, A., Le, Q., Chi, E., Zhou, D., et al

Suzgun, M., Scales, N., Sch¨arli, N., Gehrmann, S., Tay, Y ., Chung, H. W., Chowdhery, A., Le, Q., Chi, E., Zhou, D., et al. Challenging big-bench tasks and whether chain-of- thought can solve them. InFindings of the Association for Computational Linguistics: ACL 2023, pp. 13003–13051,

work page 2023
[7]

URL https://qwenlm.github.io/ blog/qwen2.5/. Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y ., Bashlykov, N., Batra, S., Bhargava, P., 10 GIST: Targeted Data Selection for Instruction Tuning via Coupled Optimization Geometry Bhosale, S., et al. Llama 2: Open foundation and fine- tuned chat models.arXiv preprint arXiv:2307.09288,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Limitations and Future Work While GIST offers a principled geometric view, our current instantiation is intentionally simple—a minimal, proof-driven realization of the theory

11 GIST: Targeted Data Selection for Instruction Tuning via Coupled Optimization Geometry A. Limitations and Future Work While GIST offers a principled geometric view, our current instantiation is intentionally simple—a minimal, proof-driven realization of the theory. Empirically, we find that applyingGIST early in training (after a lightweight warmup) is...

work page 2024
[9]

bulk-and-outlier

circumvent this bottleneck by projecting gradients into lower dimensions and approximating the Hessian inverse using optimizer statistics (e.g., Adam’s second moment). Critically, these methods rely on adiagonal approximationof the curvature. This implicitly assumes that model parameters are statistically independent and the optimization landscape is axis...

work page 2024
[10]

used in fine-tuning Large Language Models (LLMs) and discuss how its second-moment statistics are utilized to approximate the optimization landscape for data selection methods like LESS (Xia et al., 2024). D.1. Standard Adam Update Rule Let ℓ(z,θ) denote the loss function for a data sample z and model parameters θ∈R d. At each training step t, we compute ...

work page 2024
[11]

non-Gauss–Newton

+o(η 2),(34) where theO(η 2)term collects the quadratic Taylor term and the remainder. Since Lval(θt) is independent of S, minimizing Lval(eθS) over S is, up to an S-independent constant and higher-order terms in η, equivalent to maximizing the first-order predicted decrease term g⊤ val,tH† val,tgS,t. Formally, ignoring O(η2) and o(η2) terms in (34), we o...

work page 2020
[12]

For every candidate sample d and validation sample v, we compute their embeddings using GTR-Base and calculate the cosine similarity score

as the external encoder. For every candidate sample d and validation sample v, we compute their embeddings using GTR-Base and calculate the cosine similarity score. The final score for a candidatedis its maximum similarity to any example in the validation set. RDS+.We adopt theRDS+method proposed by Ivison et al. (2025). Unlike standard embedding approach...

work page 2025
[13]

Also we adopt the reported performance of LESS on Llama2-7B from the original paper. G. Training G.1. Training Datasets We use the same four preprocessed training datasets as in Wang et al. (2023). All are human-written or human-annotated; details are provided in Table 6.FLAN V2 andCOTare derived from existing NLP benchmarks, whereasDOLLYandOPEN ASSISTANT...

work page 2023
[14]

270k) and model architectures, where the loss drops precipitously within the initial phase (<0.5 epoch) before entering a bounded oscillatory regime

Rapid loss convergence in the first epoch creates a stable geometric basin.We observe consistent training dynamics across varying data scales (13.5k vs. 270k) and model architectures, where the loss drops precipitously within the initial phase (<0.5 epoch) before entering a bounded oscillatory regime. To verify the assumptions underlying Theorem 3.3, we a...

work page 2021
[15]

While GIST consistently improves overBaseandRandom, selecting only the smallest principal direction (GIST-tail) can be unstable and may hurt task performance

Tail ablation on Llama3.2-3B under the same 5% selection budget. While GIST consistently improves overBaseandRandom, selecting only the smallest principal direction (GIST-tail) can be unstable and may hurt task performance. Task Base RandomGIST GIST-tail MMLU 53.9 53.2±0.656.1±0.456.4±0.2 TYDIQA 60.4 64.1±0.469.2±0.363.1±1.5 BBH 45.5 45.1±0.248.0±0.538.5±...

work page 2011
[16]

she’s asleep, we should keep noisy

I. దూరంలKనూఉం:;. 2011 RSరతజనగణనగణVం6ాలపC6ారంఈ =ా > మం830 ఇళ2 YZ, 3687 జ[VRSYZ268 \]6ా M ర2లK(స^ _`ంaఉం:;. =ా > మంలKమగbా_`సంఖc1398, ఆడbా_`సంఖc2289. efడూcgh కjల1లసంఖc26 6ా=ాefడూcghYెగలసంఖc2944. =ా > మంkకlజనగణనలm6Aషo6pq584655[2].rిo6pq: 531077. Question:2011లK!ాత!ా$ేర'=ా > మంలKఎంతమం:;uీ^ wలjఉ[V-ర'? Assistant:2289 PC1 PC2 PC3 PC1: Review Sentiment Classificat...

work page 2011

[1] [1]

W., and Dasigi, P

Ivison, H., Zhang, M., Brahman, F., Koh, P. W., and Dasigi, P. Large-scale data selection for instruction tuning.arXiv preprint arXiv:2503.01807,

work page arXiv

[2] [2]

Measur- ing the intrinsic dimension of objective landscapes

Li, C., Farkhoor, H., Liu, R., and Yosinski, J. Measur- ing the intrinsic dimension of objective landscapes. In International Conference on Learning Representations, 2018a. Li, H., Xu, Z., Taylor, G., Studer, C., and Goldstein, T. Visualizing the loss landscape of neural nets.Advances in neural information processing systems, 31, 2018b. Li, M., Zhang, Y ....

work page 2024

[3] [3]

When less is more: Investigating data pruning for pretraining llms at scale.arXiv preprint arXiv:2309.04564,

Marion, M., ¨Ust¨un, A., Pozzobon, L., Wang, A., Fadaee, M., and Hooker, S. When less is more: Investigating data pruning for pretraining llms at scale.arXiv preprint arXiv:2309.04564,

work page arXiv

[4] [4]

H., Ma, J., Zhao, V ., Luan, Y ., Hall, K., Chang, M.-W., et al

Ni, J., Qu, C., Lu, J., Dai, Z., Abrego, G. H., Ma, J., Zhao, V ., Luan, Y ., Hall, K., Chang, M.-W., et al. Large dual encoders are generalizable retrievers. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 9844–9855,

work page 2022

[5] [5]

Eigenvalues of the Hessian in Deep Learning: Singularity and Beyond

Sagun, L., Bottou, L., and LeCun, Y . Eigenvalues of the hessian in deep learning: Singularity and beyond.arXiv preprint arXiv:1611.07476,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

W., Chowdhery, A., Le, Q., Chi, E., Zhou, D., et al

Suzgun, M., Scales, N., Sch¨arli, N., Gehrmann, S., Tay, Y ., Chung, H. W., Chowdhery, A., Le, Q., Chi, E., Zhou, D., et al. Challenging big-bench tasks and whether chain-of- thought can solve them. InFindings of the Association for Computational Linguistics: ACL 2023, pp. 13003–13051,

work page 2023

[7] [7]

URL https://qwenlm.github.io/ blog/qwen2.5/. Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y ., Bashlykov, N., Batra, S., Bhargava, P., 10 GIST: Targeted Data Selection for Instruction Tuning via Coupled Optimization Geometry Bhosale, S., et al. Llama 2: Open foundation and fine- tuned chat models.arXiv preprint arXiv:2307.09288,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

Limitations and Future Work While GIST offers a principled geometric view, our current instantiation is intentionally simple—a minimal, proof-driven realization of the theory

11 GIST: Targeted Data Selection for Instruction Tuning via Coupled Optimization Geometry A. Limitations and Future Work While GIST offers a principled geometric view, our current instantiation is intentionally simple—a minimal, proof-driven realization of the theory. Empirically, we find that applyingGIST early in training (after a lightweight warmup) is...

work page 2024

[9] [9]

bulk-and-outlier

circumvent this bottleneck by projecting gradients into lower dimensions and approximating the Hessian inverse using optimizer statistics (e.g., Adam’s second moment). Critically, these methods rely on adiagonal approximationof the curvature. This implicitly assumes that model parameters are statistically independent and the optimization landscape is axis...

work page 2024

[10] [10]

used in fine-tuning Large Language Models (LLMs) and discuss how its second-moment statistics are utilized to approximate the optimization landscape for data selection methods like LESS (Xia et al., 2024). D.1. Standard Adam Update Rule Let ℓ(z,θ) denote the loss function for a data sample z and model parameters θ∈R d. At each training step t, we compute ...

work page 2024

[11] [11]

non-Gauss–Newton

+o(η 2),(34) where theO(η 2)term collects the quadratic Taylor term and the remainder. Since Lval(θt) is independent of S, minimizing Lval(eθS) over S is, up to an S-independent constant and higher-order terms in η, equivalent to maximizing the first-order predicted decrease term g⊤ val,tH† val,tgS,t. Formally, ignoring O(η2) and o(η2) terms in (34), we o...

work page 2020

[12] [12]

For every candidate sample d and validation sample v, we compute their embeddings using GTR-Base and calculate the cosine similarity score

as the external encoder. For every candidate sample d and validation sample v, we compute their embeddings using GTR-Base and calculate the cosine similarity score. The final score for a candidatedis its maximum similarity to any example in the validation set. RDS+.We adopt theRDS+method proposed by Ivison et al. (2025). Unlike standard embedding approach...

work page 2025

[13] [13]

Also we adopt the reported performance of LESS on Llama2-7B from the original paper. G. Training G.1. Training Datasets We use the same four preprocessed training datasets as in Wang et al. (2023). All are human-written or human-annotated; details are provided in Table 6.FLAN V2 andCOTare derived from existing NLP benchmarks, whereasDOLLYandOPEN ASSISTANT...

work page 2023

[14] [14]

270k) and model architectures, where the loss drops precipitously within the initial phase (<0.5 epoch) before entering a bounded oscillatory regime

Rapid loss convergence in the first epoch creates a stable geometric basin.We observe consistent training dynamics across varying data scales (13.5k vs. 270k) and model architectures, where the loss drops precipitously within the initial phase (<0.5 epoch) before entering a bounded oscillatory regime. To verify the assumptions underlying Theorem 3.3, we a...

work page 2021

[15] [15]

While GIST consistently improves overBaseandRandom, selecting only the smallest principal direction (GIST-tail) can be unstable and may hurt task performance

Tail ablation on Llama3.2-3B under the same 5% selection budget. While GIST consistently improves overBaseandRandom, selecting only the smallest principal direction (GIST-tail) can be unstable and may hurt task performance. Task Base RandomGIST GIST-tail MMLU 53.9 53.2±0.656.1±0.456.4±0.2 TYDIQA 60.4 64.1±0.469.2±0.363.1±1.5 BBH 45.5 45.1±0.248.0±0.538.5±...

work page 2011

[16] [16]

she’s asleep, we should keep noisy

I. దూరంలKనూఉం:;. 2011 RSరతజనగణనగణVం6ాలపC6ారంఈ =ా > మం830 ఇళ2 YZ, 3687 జ[VRSYZ268 \]6ా M ర2లK(స^ _`ంaఉం:;. =ా > మంలKమగbా_`సంఖc1398, ఆడbా_`సంఖc2289. efడూcgh కjల1లసంఖc26 6ా=ాefడూcghYెగలసంఖc2944. =ా > మంkకlజనగణనలm6Aషo6pq584655[2].rిo6pq: 531077. Question:2011లK!ాత!ా$ేర'=ా > మంలKఎంతమం:;uీ^ wలjఉ[V-ర'? Assistant:2289 PC1 PC2 PC3 PC1: Review Sentiment Classificat...

work page 2011