SEED: Targeted Data Selection by Weighted Independent Set

Chang Liu; Junwen Pan; Kuan Cheng; Kurt Keutzer; Lifeng Guo; Shanghang Zhang; Wenzhao Zheng; Yuan Zhang

arxiv: 2605.15691 · v1 · pith:4HQJD7RLnew · submitted 2026-05-15 · 💻 cs.LG

SEED: Targeted Data Selection by Weighted Independent Set

Yuan Zhang , Lifeng Guo , Junwen Pan , Chang Liu , Wenzhao Zheng , Kuan Cheng , Kurt Keutzer , Shanghang Zhang This is my paper

Pith reviewed 2026-05-20 20:24 UTC · model grok-4.3

classification 💻 cs.LG

keywords data selectionweighted independent setsimilarity graphinfluence estimationinstruction tuningvisual instruction tuningsemantic segmentation

0 comments

The pith

SEED selects high-quality diverse training data by solving a weighted independent set problem on a calibrated similarity graph.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper formulates data selection as identifying a maximum weighted independent set in a graph whose nodes are training samples and whose edges link redundant pairs. It introduces two refinements: restricting influence scores to a bilateral salient subspace so that weights reflect task signals rather than gradient noise, and adapting edge thresholds to each sample's local neighborhood density to prevent bias toward sparse regions under domain shifts. These steps together produce compact subsets that remain both influential and non-redundant. A reader would care because large corpora contain massive redundancy, so a reliable way to prune them can cut training cost while preserving or improving final model quality.

Core claim

By modeling data selection as the search for a maximum weighted independent set on a similarity graph, and by refining node weights through restriction to the bilateral salient subspace together with local scale normalization of edge thresholds, the resulting SEED pipeline yields subsets that are simultaneously high in task-relevant influence and low in semantic redundancy, and these subsets deliver consistent gains over prior selection methods when used for instruction tuning, visual instruction tuning, and semantic segmentation.

What carries the argument

Weighted Independent Set formulation on a similarity graph, refined by bilateral salient subspace node calibration and local scale normalization of edge thresholds.

If this is right

Produces subsets that simultaneously maximize quality and diversity without separate quality and diversity stages.
Yields a compact multimodal dataset, Honeybee-Remake-SEED-200K, that supports strong downstream performance.
Generalizes across instruction tuning, visual instruction tuning, and semantic segmentation for multiple model families.
Scales to large heterogeneous corpora by operating directly on a graph representation of semantic similarity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same graph refinement steps could be applied to select demonstration data for in-context learning or to curate preference pairs for alignment.
Local density normalization may offer a general remedy for selection bias whenever training distributions contain multiple distinct domains.
If the salient-subspace idea proves robust, it could be combined with cheaper proxy models to reduce the cost of influence estimation itself.

Load-bearing premise

That restricting influence estimation to the bilateral salient subspace and scaling edge thresholds to local density will reliably separate task-relevant signals from noise and correct structural imbalance caused by cross-domain shifts.

What would settle it

An experiment in which models trained on a SEED-selected subset perform no better than models trained on subsets chosen by random sampling or by existing influence-function baselines on the same tasks and model families.

Figures

Figures reproduced from arXiv: 2605.15691 by Chang Liu, Junwen Pan, Kuan Cheng, Kurt Keutzer, Lifeng Guo, Shanghang Zhang, Wenzhao Zheng, Yuan Zhang.

**Figure 1.** Figure 1: Overview of SEED. SEED formulates subset selection as a Weighted Independent Set problem over a similarity graph constructed from training data, with better node weights from a mutual influence subspace and better edges from local scale normalization. The resulting structurally balanced graph enables selecting a compact, diverse, and high-influence subset. Different colors indicate that nodes belong to dif… view at source ↗

**Figure 2.** Figure 2: Mutual Influence Subspace. (a) Gradient magnitudes across channels show a long-tail distribution. (b) Mutual space on both training and target (1st quadrant) form C ∗ . (c) KDE of influence scores shows that restricting to C ∗ reduces noise and sharpens the distribution. Definition 1 (Node Weights) The weight wv of a node v ∈ Dtrain is defined as its influence on the target dataset Dtarget. Following Defin… view at source ↗

**Figure 3.** Figure 3: Visualization of density heterogeneity across domains. The data are sampled from FLAN-V2 [39], CoT [61], Dolly [14], and Oasst-1 [35]. (a) Embeddings from different domains exhibit non-uniform densities. (b) Global distance statistics vary significantly across domains. (c) Local scaling mitigates this mismatch by aligning neighborhood distributions. Best viewed in color. However, even if a channel exhibits… view at source ↗

**Figure 4.** Figure 4: Performance of SEED on scratch training for semantic segmentation. Result curves of DeepLabV3 [7] trained using different data selection methods. Best viewed in color. Main Results. As shown in [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Efficient fine-tuning via SEED-guided proxy model selection. Statistics are reported on Phi-4-14B, where SEED-T achieves comparable accuracy while reducing GPU cost by 2.5×. [66], DeepSeek-LLM-7B [6], Phi-4-14B [2], and Seed-OSS-36B [57]. We report four representative cases here and provide all eight results in the Appendix E.3. Main Results. In [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Curve of performance vs. GPU training cost on Phi-4-14B. out fine-grained density differences. We investigate how the choice of k and scaling factor α affects data selection performance on instruction tuning tasks. Results in Appendix F show that SEED remains robust across a wide range of k and α values, validating the stability of our design choices. 5.4 Visualization [PITH_FULL_IMAGE:figures/full_fig_p0… view at source ↗

**Figure 7.** Figure 7: Visualization of data manifolds (left) and influence scores (right). To investigate the selection behavior, we visualize projected manifolds via UMAP [44] and influence score distributions in [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: Sensitivity analysis of SEED to k and α. Results are reported on the LLaMA2-7B, measured as the average accuracy across TyDiQA, MMLU, and BBH under a 5% data budget. Sensitivity to k. As shown in [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗

read the original abstract

Data selection seeks to identify a compact yet informative subset from large-scale training corpora, balancing sample quality against collection diversity. We formulate this problem as a Weighted Independent Set (WIS) on a similarity graph, where nodes represent data samples weighted by influence, and edges connect semantically redundant pairs. This formulation naturally yields subsets that are simultaneously high-quality and diverse. However, two challenges arise in practice: naive node weights fail to distinguish informative signals from gradient noise, and edge construction under heterogeneous domain distributions produces structurally imbalanced graphs that bias selection toward sparse regions. To address these issues, we introduce two principled refinements from a unified graph perspective: (1) \textit{node value calibration} that restricts influence estimation to the bilateral salient subspace to ground node importance in task-relevant signals rather than surface-level statistics; (2) \textit{local scale normalization} that adapts edge thresholds to local neighborhood density, mitigating graph imbalance induced by cross-domain distribution shifts. Together, these components yield a robust and scalable data selection pipeline dubbed SEED. We further construct \texttt{Honeybee-Remake-SEED-200K}, a compact multimodal dataset curated by SEED. Extensive experiments show that SEED consistently outperforms state-of-the-art methods on instruction tuning, visual instruction tuning, and semantic segmentation across diverse model families.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SEED casts data selection as a weighted independent set on a similarity graph and adds two refinements that produce consistent empirical gains on tuning and segmentation tasks.

read the letter

The core idea here is to model data selection as finding a high-weight independent set in a graph of samples, where edges mark semantic redundancy. The two refinements—calibrating node weights inside a bilateral salient subspace and normalizing edge thresholds to local density—tackle noise in influence estimates and imbalance from mixed domains. That combination is what the paper actually contributes beyond prior graph or influence-based selection work.

Referee Report

0 major / 3 minor

Summary. The paper introduces SEED, a data selection method formulated as a Weighted Independent Set (WIS) on a similarity graph, with nodes representing samples weighted by influence and edges connecting semantically redundant pairs. Two refinements are proposed from a graph perspective: node value calibration restricting influence estimation to the bilateral salient subspace, and local scale normalization adapting edge thresholds to local neighborhood density. The approach is shown to outperform state-of-the-art methods on instruction tuning, visual instruction tuning, and semantic segmentation across model families, while also releasing the Honeybee-Remake-SEED-200K multimodal dataset.

Significance. If the empirical results hold, this work provides a principled graph-based framework for balancing quality and diversity in data selection, with targeted fixes for gradient noise and cross-domain imbalance. Strengths include the unified WIS formulation, algorithmic detail sufficient for the pipeline, experiments that control for data volume while reporting consistent gains across tasks and models, and the public release of the curated 200K dataset which supports reproducibility. These elements position the method as a practical tool for curating training subsets for large-scale models.

minor comments (3)

Abstract: the phrase 'restricts influence estimation to the bilateral salient subspace' would benefit from a one-sentence gloss on the subspace construction to aid readers before the technical sections.
The experimental section should explicitly list the values or ranges used for the local scale normalization threshold and bilateral salient subspace parameters to ensure full reproducibility of the reported gains.
A brief complexity analysis of the WIS solver employed would help assess scalability claims for very large corpora.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive evaluation of our work and the recommendation for minor revision. We appreciate the recognition of the unified WIS formulation, the algorithmic details, the controlled experiments, and the release of the Honeybee-Remake-SEED-200K dataset as practical contributions to data selection for large-scale models.

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper formulates data selection as a Weighted Independent Set on a similarity graph and introduces two refinements (bilateral salient subspace calibration and local scale normalization) as independent algorithmic steps grounded in the stated assumptions about influence estimation and cross-domain graph imbalance. No equation or claim reduces a prediction or result to a quantity fitted inside the same pipeline, nor does any load-bearing step rely on a self-citation that itself collapses to the target result. The central pipeline is described with explicit algorithmic detail and evaluated empirically against baselines while controlling for data volume, making the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The approach rests on domain assumptions about graph construction and influence estimation whose concrete implementation details are not supplied in the abstract.

free parameters (2)

local scale normalization threshold
Adapted to local neighborhood density to handle cross-domain imbalance
bilateral salient subspace restriction parameters
Used to calibrate node weights away from gradient noise

axioms (1)

domain assumption A similarity graph can be built such that edges reliably connect semantically redundant samples
Central to the Weighted Independent Set formulation described in the abstract

pith-pipeline@v0.9.0 · 5777 in / 1228 out tokens · 60663 ms · 2026-05-20T20:24:45.096473+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

node value calibration that restricts influence estimation to the bilateral salient subspace... local scale normalization that adapts edge thresholds to local neighborhood density
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

S∗ = arg max ∑ w_i s.t. (u,v)∉E

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

73 extracted references · 73 canonical work pages · 11 internal anchors

[1]

SemDeDup: Data-efficient learning at web-scale through semantic deduplication

A. Abbas, K. Tirumala, D. Simig, S. Ganguli, and A. S. Morcos. Semdedup: Data-efficient learning at web-scale through semantic deduplication.arXiv preprint arXiv:2303.09540, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Phi-4 Technical Report

M. Abdin, J. Aneja, H. Behl, S. Bubeck, R. Eldan, S. Gunasekar, M. Harrison, R. J. Hewett, M. Javaheripi, P. Kauffmann, et al. Phi-4 technical report.arXiv preprint arXiv:2412.08905, 2024. 1, 2, 8

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

arXiv preprint arXiv:2402.16827

A. Albalak, Y . Elazar, S. M. Xie, S. Longpre, N. Lambert, X. Wang, N. Muennighoff, B. Hou, L. Pan, H. Jeong, et al. A survey on data selection for language models.arXiv preprint arXiv:2402.16827, 2024. 1

work page arXiv 2024
[4]

S. Bai, Y . Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025. 6

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Bansal, D

H. Bansal, D. S. Sachan, K.-W. Chang, A. Grover, G. Ghosh, W.-t. Yih, and R. Pasunuru. Honeybee: Data recipes for vision-language reasoners.arXiv preprint arXiv:2510.12225, 2025. 2, 6, 4

work page arXiv 2025
[6]

X. Bi, D. Chen, G. Chen, S. Chen, D. Dai, C. Deng, H. Ding, K. Dong, Q. Du, Z. Fu, et al. Deepseek llm: Scaling open-source language models with longtermism.arXiv preprint arXiv:2401.02954, 2024. 8

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam. Rethinking atrous convolution for semantic image segmentation.arXiv preprint arXiv:1706.05587, 2017. 2, 7

work page internal anchor Pith review Pith/arXiv arXiv 2017
[8]

M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. D. O. Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

X. Chen, J. Wu, S. Yang, R. Zhan, Z. Wu, M. Yang, S. Huang, L. S. Chao, and D. F. Wong. Neuron-aware data selection in instruction tuning for large language models. InInternational Conference on Learning Representations (ICLR), 2026. 1, 6, 7

work page 2026
[10]

SimpleVQA: Multimodal factuality evaluation for multimodal large lan- guage models.arXiv preprint arXiv:2502.13059,

X. Cheng, W. Zhang, S. Zhang, J. Yang, X. Guan, X. Wu, X. Li, G. Zhang, J. Liu, Y . Mai, Y . Zeng, Z. Wen, K. Jin, B. Wang, W. Zhou, Y . Lu, T. Li, W. Huang, and Z. Li. Simplevqa: Multimodal factuality evaluation for multimodal large language models.arXiv preprint arXiv:2502.13059, 2025. 3

work page arXiv 2025
[11]

J. H. Clark, E. Choi, M. Collins, D. Garrette, T. Kwiatkowski, V . Nikolaev, and J. Palomaki. TyDi QA: A benchmark for information-seeking question answering in typologically diverse languages.Transactions of the Association for Computational Linguistics, 2020. 2, 6

work page 2020
[12]

Training Verifiers to Solve Math Word Problems

K. Cobbe, V . Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021. 4

work page internal anchor Pith review Pith/arXiv arXiv 2021
[13]

Colling and M

B. Colling and M. van de Wiel. corrselect: An r package for correlation-constrained variable selection using maximal independent sets.Journal of Statistical Software, 2025. 2

work page 2025
[14]

Conover, M

M. Conover, M. Hayes, A. Mathur, J. Xie, J. Wan, S. Shah, A. Ghodsi, P. Wendell, M. Zaharia, and R. Xin. Free Dolly: Introducing the world’s first truly open instruction-tuned LLM, 2023. 5, 6

work page 2023
[15]

Cordts, M

M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele. The cityscapes dataset for semantic urban scene understanding. InProceedings of the IEEE conference on computer vision and pattern recognition, 2016. 2, 7

work page 2016
[16]

Q. Dai, D. Zhang, J. W. Ma, and H. Peng. Improving influence-based instruction tuning data selection for balanced learning of diverse capabilities. InICLR 2025 Workshop on Navigating and Addressing Data Problems for Foundation Models, 2025. 1

work page 2025
[17]

H. Duan, J. Yang, Y . Qiao, X. Fang, L. Chen, Y . Liu, X. Dong, Y . Zang, P. Zhang, J. Wang, et al. Vlmevalkit: An open-source toolkit for evaluating large multi-modality models. InProceedings of the 32nd ACM International Conference on Multimedia, pages 11198–11201, 2024. 3

work page 2024
[18]

Dubey, A

A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang, A. Fan, et al. The llama 3 herd of models.arXiv e-prints, 2024. 1, 6, 7

work page 2024
[19]

Engstrom, A

L. Engstrom, A. Feldmann, and A. Madry. Dsdm: Model-aware dataset selection with datamodels.arXiv preprint arXiv:2401.12926, 2024. 1

work page arXiv 2024
[20]

Fahrbach, T

M. Fahrbach, T. Fu, and M. Gholami. Gist: Greedy independent set thresholding for diverse data summarization. InInternational Conference on Learning Representations (ICLR), 2025. 2 10

work page 2025
[21]

L. Feng, F. Nie, Y . Liu, and A. Alahi. TAROT: Targeted data selection via optimal transport. InInternational Conference on Machine Learning, 2025. 1, 6, 7, 4

work page 2025
[22]

C. Fu, P. Chen, Y . Shen, Y . Qin, M. Zhang, X. Lin, J. Yang, X. Zheng, K. Li, X. Sun, Y . Wu, R. Ji, C. Shan, and R. He. Mme: A comprehensive evaluation benchmark for multimodal large language models.arXiv preprint arXiv:2306.13394, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[23]

M. R. Garey and D. S. Johnson.Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman and Company, 1979. 3

work page 1979
[24]

T. Guan, F. Liu, X. Wu, R. Xian, Z. Li, X. Liu, X. Wang, L. Chen, F. Huang, Y . Yacoob, D. Manocha, and T. Zhou. Hallusionbench: An advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024. 3

work page 2024
[25]

D. Guo, F. Wu, F. Zhu, F. Leng, G. Shi, H. Chen, H. Fan, J. Wang, J. Jiang, J. Wang, et al. Seed1. 5-vl technical report.arXiv preprint arXiv:2505.07062, 2025. 6, 4

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

Halldórsson and J

M. Halldórsson and J. Radhakrishnan. Greed is good: Approximating independent sets in sparse and bounded-degree graphs. InProceedings of the twenty-sixth annual ACM symposium on Theory of computing, pages 439–448, 1994. 3

work page 1994
[27]

K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition, 2016. 7

work page 2016
[28]

Hendrycks, C

D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. Measuring massive multitask language understanding. InInternational Conference on Learning Representations, 2020. 6

work page 2020
[29]

A. A. Ismail, H. Corrada Bravo, and S. Feizi. Improving deep learning interpretability by saliency guided training.Advances in Neural Information Processing Systems, 2021. 2

work page 2021
[30]

Johnson, M

J. Johnson, M. Douze, and H. Jégou. Billion-scale similarity search with gpus.IEEE transactions on big data, 7(3):535–547, 2019. 3, 2

work page 2019
[31]

F. Kang, H. A. Just, A. K. Sahu, and R. Jia. Performance scaling via optimal transport: Enabling data selection from partially revealed sources.Advances in Neural Information Processing Systems, 2024. 1

work page 2024
[32]

Kim and C

J. Kim and C. D. Scott. Robust kernel density estimation.The Journal of Machine Learning Research, 13(1):2529–2565, 2012. 5

work page 2012
[33]

Kirillov, E

A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Lo, et al. Segment anything. InProceedings of the IEEE/CVF international conference on computer vision, pages 4015–4026, 2023. 1

work page 2023
[34]

P. W. Koh and P. Liang. Understanding black-box predictions via influence functions. InInternational conference on machine learning, 2017. 3, 1

work page 2017
[35]

A. Köpf, Y . Kilcher, D. V on Rütte, S. Anagnostidis, Z. R. Tam, K. Stevens, A. Barhoum, D. Nguyen, O. Stanley, R. Nagyfi, et al. Openassistant conversations-democratizing large language model alignment. Advances in neural information processing systems, 2023. 5, 6

work page 2023
[36]

W. Liu, W. Zeng, K. He, Y . Jiang, and J. He. What makes good data for alignment? A comprehensive study of automatic data selection in instruction tuning. InInternational Conference on Learning Representations (ICLR), 2024. 1

work page 2024
[37]

Z. Liu, A. Karbasi, and T. Rekatsinas. Tsds: Data selection for task-specific model finetuning.Advances in Neural Information Processing Systems, 37:10117–10147, 2024. 4

work page 2024
[38]

Z. Liu, K. Zhou, W. X. Zhao, D. Gao, Y . Li, and J.-R. Wen. Less is more: Data value estimation for visual instruction tuning.arXiv preprint arXiv:2403.09559, 2024. 1

work page arXiv 2024
[39]

Longpre, L

S. Longpre, L. Hou, T. Vu, A. Webson, H. W. Chung, Y . Tay, D. Zhou, Q. V . Le, B. Zoph, J. Wei, et al. The flan collection: Designing data and methods for effective instruction tuning. InInternational Conference on Machine Learning, 2023. 5, 6

work page 2023
[40]

P. Lu, H. Bansal, T. Xia, J. Liu, C. Li, H. Hajishirzi, H. Cheng, K.-W. Chang, M. Galley, and J. Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. InInternational Conference on Learning Representations, 2024. 3 11

work page 2024
[41]

D. Ma, G. Shang, Z. Chen, L. Qin, Y . LUO, H. Xu, L. Pan, S. Fan, K. Yu, and L. Chen. Task-specific data selection for instruction tuning via monosemantic neuronal activations. InNeural Information Processing Systems, 2025. 1, 6, 7

work page 2025
[42]

Maharana, P

A. Maharana, P. Yadav, and M. Bansal. D2 pruning: Message passing for balancing diversity and difficulty in data pruning. InInternational Conference on Learning Representations, 2024. 1, 7

work page 2024
[43]

W. Mai, Z. Zhang, K. Li, Y . Xue, and F. Li. Dynamic graph construction framework for multimodal named entity recognition in social media.IEEE Transactions on Computational Social Systems, 11(2):2513–2522,

work page
[44]

McInnes, J

L. McInnes, J. Healy, N. Saul, and L. Großberger. Umap: Uniform manifold approximation and projection. Journal of Open Source Software, 3(29), 2018. 9

work page 2018
[45]

McKinzie, Z

B. McKinzie, Z. Gan, J.-P. Fauconnier, S. Dodge, B. Zhang, P. Dufter, D. Shah, X. Du, F. Peng, A. Belyi, et al. Mm1: methods, analysis and insights from multimodal llm pre-training. InEuropean Conference on Computer Vision, pages 304–323. Springer, 2024. 7

work page 2024
[46]

J. Pan, Q. Zhang, R. Zhang, M. Lu, X. Wan, Y . Zhang, C. Liu, and Q. She. Timesearch-r: Adaptive temporal search for long-form video understanding via self-verification reinforcement learning.arXiv preprint arXiv:2511.05489, 2025. 7

work page arXiv 2025
[47]

M. Paul, S. Ganguli, and G. K. Dziugaite. Deep learning on a data diet: Finding important examples early in training. InAdvances in Neural Information Processing Systems (NeurIPS), 2021. 1

work page 2021
[48]

Pruthi, F

G. Pruthi, F. Liu, S. Kale, and M. Sundararajan. Estimating training data influence by tracing gradient descent.Advances in Neural Information Processing Systems, 2020. 2, 3, 1

work page 2020
[49]

S. R. Richter, V . Vineet, S. Roth, and V . Koltun. Playing for data: Ground truth from computer games. In European conference on computer vision, 2016. 7

work page 2016
[50]

Sanghavi, D

S. Sanghavi, D. Shah, and A. S. Willsky. Message passing for maximum weight independent set.IEEE Transactions on Information Theory, 55(11), 2009. 1

work page 2009
[51]

M. Sap, H. Rashkin, D. Chen, R. Le Bras, and Y . Choi. Social iqa: Commonsense reasoning about social interactions. InProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pages 4463–4473, 2019. 4

work page 2019
[52]

R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. InProceedings of the IEEE international conference on computer vision, 2017. 2

work page 2017
[53]

Sener and S

O. Sener and S. Savarese. Active learning for convolutional neural networks: A core-set approach. In International Conference on Learning Representations (ICLR), 2018. 1

work page 2018
[54]

K. Sun, D. Yu, D. Yu, and C. Cardie. Investigating prior knowledge for challenging chinese machine reading comprehension.Transactions of the Association for Computational Linguistics, 8:141–155, 2020. 4

work page 2020
[55]

Suzgun, N

M. Suzgun, N. Scales, N. Schärli, S. Gehrmann, Y . Tay, H. W. Chung, A. Chowdhery, Q. Le, E. Chi, D. Zhou, et al. Challenging big-bench tasks and whether chain-of-thought can solve them. InFindings of the Association for Computational Linguistics, 2023. 6

work page 2023
[56]

H. Tan, S. Wu, W. Huang, S. Zhao, and X. QI. Data pruning by information maximization. InInternational Conference on Learning Representations, 2025. 1, 2, 6, 7

work page 2025
[57]

B. S. Team. Seed-oss open-source models. https://github.com/ByteDance-Seed/seed-oss, 2025. 8

work page 2025
[58]

Llama 2: Open Foundation and Fine-Tuned Chat Models

H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y . Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288,

work page internal anchor Pith review Pith/arXiv arXiv
[59]

X. Wang, Y . Cui, J. Wang, F. Zhang, Y . Wang, X. Zhang, Z. Luo, Q. Sun, Z. Li, Y . Wang, et al. Multimodal learning with next-token prediction for large multimodal models.Nature, 2026. 1

work page 2026
[60]

Z. Wang, M. Xia, L. He, H. Chen, Y . Liu, R. Zhu, K. Liang, X. Wu, H. Liu, S. Malladi, A. Chevalier, S. Arora, and D. Chen. Charxiv: Charting gaps in realistic chart understanding in multimodal llms.arXiv preprint arXiv:2406.18521, 2024. 3 12

work page arXiv 2024
[61]

J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V . Le, D. Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. InAdvances in Neural Information Processing Systems, 2022. 5, 6

work page 2022
[62]

K. Wei, R. Iyer, and J. Bilmes. Submodularity in data subset selection and active learning. InInternational Conference on Machine Learning (ICML), 2015. 1

work page 2015
[63]

Grok-1.5 vision preview.https://x.ai/news/grok-1.5v, 2024

xAI. Grok-1.5 vision preview.https://x.ai/news/grok-1.5v, 2024. 3

work page 2024
[64]

M. Xia, S. Malladi, S. Gururangan, S. Arora, and D. Chen. Less: Selecting influential data for targeted instruction tuning. InInternational Conference on Machine Learning, 2024. 1, 6, 7, 8, 3, 4, 5

work page 2024
[65]

S. M. Xie, S. Santurkar, T. Ma, and P. Liang. Data selection for language models via importance resampling. InAdvances in Neural Information Processing Systems (NeurIPS), 2023. 1

work page 2023
[66]

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. 1, 2, 6, 7, 8

work page internal anchor Pith review Pith/arXiv arXiv 2025
[67]

Q. Yu, Z. Shen, Z. Yue, Y . Wu, B. Qin, W. Zhang, Y . Li, J. Li, S. Tang, and Y . Zhuang. Mastering collaborative multi-modal data selection: A focus on informativeness, uniqueness, and representativeness. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2025. 7, 8, 5

work page 2025
[68]

Zelnik-Manor and P

L. Zelnik-Manor and P. Perona. Self-tuning spectral clustering.Advances in neural information processing systems, 17, 2004. 2

work page 2004
[69]

Zhang, C.-X

J. Zhang, C.-X. Zhang, Y . Liu, Y .-X. Jin, X.-W. Yang, B. Zheng, Y . Liu, and L.-Z. Guo. D3: diversity, difficulty, and dependability-aware data selection for sample-efficient llm instruction tuning. InProceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence, 2025. 1

work page 2025
[70]

Zhang, D

R. Zhang, D. Jiang, Y . Zhang, H. Lin, Z. Guo, P. Qiu, A. Zhou, P. Lu, K.-W. Chang, P. Gao, and H. Li. Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? InEuropean Conference on Computer Vision, 2024. 3

work page 2024
[71]

Zhang, C.-K

Y . Zhang, C.-K. Fan, T. Huang, M. Lu, S. Yu, J. Pan, K. Cheng, Q. She, and S. Zhang. Loss-oriented ranking for automated visual prompting in lvlms.arXiv preprint arXiv:2506.16112, 2025. 7

work page arXiv 2025
[72]

Zhang, C.-K

Y . Zhang, C.-K. Fan, J. Ma, W. Zheng, T. Huang, K. Cheng, D. A. Gudovskiy, T. Okuno, Y . Nakata, K. Keutzer, et al. Sparsevlm: Visual token sparsification for efficient vision-language model inference. In International Conference on Machine Learning, 2025. 1

work page 2025
[73]

Zhang, F

Y . Zhang, F. Xiao, T. Huang, C.-K. Fan, H. Dong, J. Li, J. Wang, K. Cheng, S. Zhang, and H. Guo. Unveiling the tapestry of consistency in large vision-language models.Advances in Neural Information Processing Systems, 2024. 7 13 Contents A Related Work 1 A.1 Data Selection or Pruning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 A.2 Wei...

work page 2024

[1] [1]

SemDeDup: Data-efficient learning at web-scale through semantic deduplication

A. Abbas, K. Tirumala, D. Simig, S. Ganguli, and A. S. Morcos. Semdedup: Data-efficient learning at web-scale through semantic deduplication.arXiv preprint arXiv:2303.09540, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

Phi-4 Technical Report

M. Abdin, J. Aneja, H. Behl, S. Bubeck, R. Eldan, S. Gunasekar, M. Harrison, R. J. Hewett, M. Javaheripi, P. Kauffmann, et al. Phi-4 technical report.arXiv preprint arXiv:2412.08905, 2024. 1, 2, 8

work page internal anchor Pith review Pith/arXiv arXiv 2024

[3] [3]

arXiv preprint arXiv:2402.16827

A. Albalak, Y . Elazar, S. M. Xie, S. Longpre, N. Lambert, X. Wang, N. Muennighoff, B. Hou, L. Pan, H. Jeong, et al. A survey on data selection for language models.arXiv preprint arXiv:2402.16827, 2024. 1

work page arXiv 2024

[4] [4]

S. Bai, Y . Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025. 6

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

Bansal, D

H. Bansal, D. S. Sachan, K.-W. Chang, A. Grover, G. Ghosh, W.-t. Yih, and R. Pasunuru. Honeybee: Data recipes for vision-language reasoners.arXiv preprint arXiv:2510.12225, 2025. 2, 6, 4

work page arXiv 2025

[6] [6]

X. Bi, D. Chen, G. Chen, S. Chen, D. Dai, C. Deng, H. Ding, K. Dong, Q. Du, Z. Fu, et al. Deepseek llm: Scaling open-source language models with longtermism.arXiv preprint arXiv:2401.02954, 2024. 8

work page internal anchor Pith review Pith/arXiv arXiv 2024

[7] [7]

L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam. Rethinking atrous convolution for semantic image segmentation.arXiv preprint arXiv:1706.05587, 2017. 2, 7

work page internal anchor Pith review Pith/arXiv arXiv 2017

[8] [8]

M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. D. O. Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

X. Chen, J. Wu, S. Yang, R. Zhan, Z. Wu, M. Yang, S. Huang, L. S. Chao, and D. F. Wong. Neuron-aware data selection in instruction tuning for large language models. InInternational Conference on Learning Representations (ICLR), 2026. 1, 6, 7

work page 2026

[10] [10]

SimpleVQA: Multimodal factuality evaluation for multimodal large lan- guage models.arXiv preprint arXiv:2502.13059,

X. Cheng, W. Zhang, S. Zhang, J. Yang, X. Guan, X. Wu, X. Li, G. Zhang, J. Liu, Y . Mai, Y . Zeng, Z. Wen, K. Jin, B. Wang, W. Zhou, Y . Lu, T. Li, W. Huang, and Z. Li. Simplevqa: Multimodal factuality evaluation for multimodal large language models.arXiv preprint arXiv:2502.13059, 2025. 3

work page arXiv 2025

[11] [11]

J. H. Clark, E. Choi, M. Collins, D. Garrette, T. Kwiatkowski, V . Nikolaev, and J. Palomaki. TyDi QA: A benchmark for information-seeking question answering in typologically diverse languages.Transactions of the Association for Computational Linguistics, 2020. 2, 6

work page 2020

[12] [12]

Training Verifiers to Solve Math Word Problems

K. Cobbe, V . Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021. 4

work page internal anchor Pith review Pith/arXiv arXiv 2021

[13] [13]

Colling and M

B. Colling and M. van de Wiel. corrselect: An r package for correlation-constrained variable selection using maximal independent sets.Journal of Statistical Software, 2025. 2

work page 2025

[14] [14]

Conover, M

M. Conover, M. Hayes, A. Mathur, J. Xie, J. Wan, S. Shah, A. Ghodsi, P. Wendell, M. Zaharia, and R. Xin. Free Dolly: Introducing the world’s first truly open instruction-tuned LLM, 2023. 5, 6

work page 2023

[15] [15]

Cordts, M

M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele. The cityscapes dataset for semantic urban scene understanding. InProceedings of the IEEE conference on computer vision and pattern recognition, 2016. 2, 7

work page 2016

[16] [16]

Q. Dai, D. Zhang, J. W. Ma, and H. Peng. Improving influence-based instruction tuning data selection for balanced learning of diverse capabilities. InICLR 2025 Workshop on Navigating and Addressing Data Problems for Foundation Models, 2025. 1

work page 2025

[17] [17]

H. Duan, J. Yang, Y . Qiao, X. Fang, L. Chen, Y . Liu, X. Dong, Y . Zang, P. Zhang, J. Wang, et al. Vlmevalkit: An open-source toolkit for evaluating large multi-modality models. InProceedings of the 32nd ACM International Conference on Multimedia, pages 11198–11201, 2024. 3

work page 2024

[18] [18]

Dubey, A

A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang, A. Fan, et al. The llama 3 herd of models.arXiv e-prints, 2024. 1, 6, 7

work page 2024

[19] [19]

Engstrom, A

L. Engstrom, A. Feldmann, and A. Madry. Dsdm: Model-aware dataset selection with datamodels.arXiv preprint arXiv:2401.12926, 2024. 1

work page arXiv 2024

[20] [20]

Fahrbach, T

M. Fahrbach, T. Fu, and M. Gholami. Gist: Greedy independent set thresholding for diverse data summarization. InInternational Conference on Learning Representations (ICLR), 2025. 2 10

work page 2025

[21] [21]

L. Feng, F. Nie, Y . Liu, and A. Alahi. TAROT: Targeted data selection via optimal transport. InInternational Conference on Machine Learning, 2025. 1, 6, 7, 4

work page 2025

[22] [22]

C. Fu, P. Chen, Y . Shen, Y . Qin, M. Zhang, X. Lin, J. Yang, X. Zheng, K. Li, X. Sun, Y . Wu, R. Ji, C. Shan, and R. He. Mme: A comprehensive evaluation benchmark for multimodal large language models.arXiv preprint arXiv:2306.13394, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023

[23] [23]

M. R. Garey and D. S. Johnson.Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman and Company, 1979. 3

work page 1979

[24] [24]

T. Guan, F. Liu, X. Wu, R. Xian, Z. Li, X. Liu, X. Wang, L. Chen, F. Huang, Y . Yacoob, D. Manocha, and T. Zhou. Hallusionbench: An advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024. 3

work page 2024

[25] [25]

D. Guo, F. Wu, F. Zhu, F. Leng, G. Shi, H. Chen, H. Fan, J. Wang, J. Jiang, J. Wang, et al. Seed1. 5-vl technical report.arXiv preprint arXiv:2505.07062, 2025. 6, 4

work page internal anchor Pith review Pith/arXiv arXiv 2025

[26] [26]

Halldórsson and J

M. Halldórsson and J. Radhakrishnan. Greed is good: Approximating independent sets in sparse and bounded-degree graphs. InProceedings of the twenty-sixth annual ACM symposium on Theory of computing, pages 439–448, 1994. 3

work page 1994

[27] [27]

K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition, 2016. 7

work page 2016

[28] [28]

Hendrycks, C

D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. Measuring massive multitask language understanding. InInternational Conference on Learning Representations, 2020. 6

work page 2020

[29] [29]

A. A. Ismail, H. Corrada Bravo, and S. Feizi. Improving deep learning interpretability by saliency guided training.Advances in Neural Information Processing Systems, 2021. 2

work page 2021

[30] [30]

Johnson, M

J. Johnson, M. Douze, and H. Jégou. Billion-scale similarity search with gpus.IEEE transactions on big data, 7(3):535–547, 2019. 3, 2

work page 2019

[31] [31]

F. Kang, H. A. Just, A. K. Sahu, and R. Jia. Performance scaling via optimal transport: Enabling data selection from partially revealed sources.Advances in Neural Information Processing Systems, 2024. 1

work page 2024

[32] [32]

Kim and C

J. Kim and C. D. Scott. Robust kernel density estimation.The Journal of Machine Learning Research, 13(1):2529–2565, 2012. 5

work page 2012

[33] [33]

Kirillov, E

A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Lo, et al. Segment anything. InProceedings of the IEEE/CVF international conference on computer vision, pages 4015–4026, 2023. 1

work page 2023

[34] [34]

P. W. Koh and P. Liang. Understanding black-box predictions via influence functions. InInternational conference on machine learning, 2017. 3, 1

work page 2017

[35] [35]

A. Köpf, Y . Kilcher, D. V on Rütte, S. Anagnostidis, Z. R. Tam, K. Stevens, A. Barhoum, D. Nguyen, O. Stanley, R. Nagyfi, et al. Openassistant conversations-democratizing large language model alignment. Advances in neural information processing systems, 2023. 5, 6

work page 2023

[36] [36]

W. Liu, W. Zeng, K. He, Y . Jiang, and J. He. What makes good data for alignment? A comprehensive study of automatic data selection in instruction tuning. InInternational Conference on Learning Representations (ICLR), 2024. 1

work page 2024

[37] [37]

Z. Liu, A. Karbasi, and T. Rekatsinas. Tsds: Data selection for task-specific model finetuning.Advances in Neural Information Processing Systems, 37:10117–10147, 2024. 4

work page 2024

[38] [38]

Z. Liu, K. Zhou, W. X. Zhao, D. Gao, Y . Li, and J.-R. Wen. Less is more: Data value estimation for visual instruction tuning.arXiv preprint arXiv:2403.09559, 2024. 1

work page arXiv 2024

[39] [39]

Longpre, L

S. Longpre, L. Hou, T. Vu, A. Webson, H. W. Chung, Y . Tay, D. Zhou, Q. V . Le, B. Zoph, J. Wei, et al. The flan collection: Designing data and methods for effective instruction tuning. InInternational Conference on Machine Learning, 2023. 5, 6

work page 2023

[40] [40]

P. Lu, H. Bansal, T. Xia, J. Liu, C. Li, H. Hajishirzi, H. Cheng, K.-W. Chang, M. Galley, and J. Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. InInternational Conference on Learning Representations, 2024. 3 11

work page 2024

[41] [41]

D. Ma, G. Shang, Z. Chen, L. Qin, Y . LUO, H. Xu, L. Pan, S. Fan, K. Yu, and L. Chen. Task-specific data selection for instruction tuning via monosemantic neuronal activations. InNeural Information Processing Systems, 2025. 1, 6, 7

work page 2025

[42] [42]

Maharana, P

A. Maharana, P. Yadav, and M. Bansal. D2 pruning: Message passing for balancing diversity and difficulty in data pruning. InInternational Conference on Learning Representations, 2024. 1, 7

work page 2024

[43] [43]

W. Mai, Z. Zhang, K. Li, Y . Xue, and F. Li. Dynamic graph construction framework for multimodal named entity recognition in social media.IEEE Transactions on Computational Social Systems, 11(2):2513–2522,

work page

[44] [44]

McInnes, J

L. McInnes, J. Healy, N. Saul, and L. Großberger. Umap: Uniform manifold approximation and projection. Journal of Open Source Software, 3(29), 2018. 9

work page 2018

[45] [45]

McKinzie, Z

B. McKinzie, Z. Gan, J.-P. Fauconnier, S. Dodge, B. Zhang, P. Dufter, D. Shah, X. Du, F. Peng, A. Belyi, et al. Mm1: methods, analysis and insights from multimodal llm pre-training. InEuropean Conference on Computer Vision, pages 304–323. Springer, 2024. 7

work page 2024

[46] [46]

J. Pan, Q. Zhang, R. Zhang, M. Lu, X. Wan, Y . Zhang, C. Liu, and Q. She. Timesearch-r: Adaptive temporal search for long-form video understanding via self-verification reinforcement learning.arXiv preprint arXiv:2511.05489, 2025. 7

work page arXiv 2025

[47] [47]

M. Paul, S. Ganguli, and G. K. Dziugaite. Deep learning on a data diet: Finding important examples early in training. InAdvances in Neural Information Processing Systems (NeurIPS), 2021. 1

work page 2021

[48] [48]

Pruthi, F

G. Pruthi, F. Liu, S. Kale, and M. Sundararajan. Estimating training data influence by tracing gradient descent.Advances in Neural Information Processing Systems, 2020. 2, 3, 1

work page 2020

[49] [49]

S. R. Richter, V . Vineet, S. Roth, and V . Koltun. Playing for data: Ground truth from computer games. In European conference on computer vision, 2016. 7

work page 2016

[50] [50]

Sanghavi, D

S. Sanghavi, D. Shah, and A. S. Willsky. Message passing for maximum weight independent set.IEEE Transactions on Information Theory, 55(11), 2009. 1

work page 2009

[51] [51]

M. Sap, H. Rashkin, D. Chen, R. Le Bras, and Y . Choi. Social iqa: Commonsense reasoning about social interactions. InProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pages 4463–4473, 2019. 4

work page 2019

[52] [52]

R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. InProceedings of the IEEE international conference on computer vision, 2017. 2

work page 2017

[53] [53]

Sener and S

O. Sener and S. Savarese. Active learning for convolutional neural networks: A core-set approach. In International Conference on Learning Representations (ICLR), 2018. 1

work page 2018

[54] [54]

K. Sun, D. Yu, D. Yu, and C. Cardie. Investigating prior knowledge for challenging chinese machine reading comprehension.Transactions of the Association for Computational Linguistics, 8:141–155, 2020. 4

work page 2020

[55] [55]

Suzgun, N

M. Suzgun, N. Scales, N. Schärli, S. Gehrmann, Y . Tay, H. W. Chung, A. Chowdhery, Q. Le, E. Chi, D. Zhou, et al. Challenging big-bench tasks and whether chain-of-thought can solve them. InFindings of the Association for Computational Linguistics, 2023. 6

work page 2023

[56] [56]

H. Tan, S. Wu, W. Huang, S. Zhao, and X. QI. Data pruning by information maximization. InInternational Conference on Learning Representations, 2025. 1, 2, 6, 7

work page 2025

[57] [57]

B. S. Team. Seed-oss open-source models. https://github.com/ByteDance-Seed/seed-oss, 2025. 8

work page 2025

[58] [58]

Llama 2: Open Foundation and Fine-Tuned Chat Models

H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y . Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288,

work page internal anchor Pith review Pith/arXiv arXiv

[59] [59]

X. Wang, Y . Cui, J. Wang, F. Zhang, Y . Wang, X. Zhang, Z. Luo, Q. Sun, Z. Li, Y . Wang, et al. Multimodal learning with next-token prediction for large multimodal models.Nature, 2026. 1

work page 2026

[60] [60]

Z. Wang, M. Xia, L. He, H. Chen, Y . Liu, R. Zhu, K. Liang, X. Wu, H. Liu, S. Malladi, A. Chevalier, S. Arora, and D. Chen. Charxiv: Charting gaps in realistic chart understanding in multimodal llms.arXiv preprint arXiv:2406.18521, 2024. 3 12

work page arXiv 2024

[61] [61]

J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V . Le, D. Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. InAdvances in Neural Information Processing Systems, 2022. 5, 6

work page 2022

[62] [62]

K. Wei, R. Iyer, and J. Bilmes. Submodularity in data subset selection and active learning. InInternational Conference on Machine Learning (ICML), 2015. 1

work page 2015

[63] [63]

Grok-1.5 vision preview.https://x.ai/news/grok-1.5v, 2024

xAI. Grok-1.5 vision preview.https://x.ai/news/grok-1.5v, 2024. 3

work page 2024

[64] [64]

M. Xia, S. Malladi, S. Gururangan, S. Arora, and D. Chen. Less: Selecting influential data for targeted instruction tuning. InInternational Conference on Machine Learning, 2024. 1, 6, 7, 8, 3, 4, 5

work page 2024

[65] [65]

S. M. Xie, S. Santurkar, T. Ma, and P. Liang. Data selection for language models via importance resampling. InAdvances in Neural Information Processing Systems (NeurIPS), 2023. 1

work page 2023

[66] [66]

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. 1, 2, 6, 7, 8

work page internal anchor Pith review Pith/arXiv arXiv 2025

[67] [67]

Q. Yu, Z. Shen, Z. Yue, Y . Wu, B. Qin, W. Zhang, Y . Li, J. Li, S. Tang, and Y . Zhuang. Mastering collaborative multi-modal data selection: A focus on informativeness, uniqueness, and representativeness. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2025. 7, 8, 5

work page 2025

[68] [68]

Zelnik-Manor and P

L. Zelnik-Manor and P. Perona. Self-tuning spectral clustering.Advances in neural information processing systems, 17, 2004. 2

work page 2004

[69] [69]

Zhang, C.-X

J. Zhang, C.-X. Zhang, Y . Liu, Y .-X. Jin, X.-W. Yang, B. Zheng, Y . Liu, and L.-Z. Guo. D3: diversity, difficulty, and dependability-aware data selection for sample-efficient llm instruction tuning. InProceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence, 2025. 1

work page 2025

[70] [70]

Zhang, D

R. Zhang, D. Jiang, Y . Zhang, H. Lin, Z. Guo, P. Qiu, A. Zhou, P. Lu, K.-W. Chang, P. Gao, and H. Li. Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? InEuropean Conference on Computer Vision, 2024. 3

work page 2024

[71] [71]

Zhang, C.-K

Y . Zhang, C.-K. Fan, T. Huang, M. Lu, S. Yu, J. Pan, K. Cheng, Q. She, and S. Zhang. Loss-oriented ranking for automated visual prompting in lvlms.arXiv preprint arXiv:2506.16112, 2025. 7

work page arXiv 2025

[72] [72]

Zhang, C.-K

Y . Zhang, C.-K. Fan, J. Ma, W. Zheng, T. Huang, K. Cheng, D. A. Gudovskiy, T. Okuno, Y . Nakata, K. Keutzer, et al. Sparsevlm: Visual token sparsification for efficient vision-language model inference. In International Conference on Machine Learning, 2025. 1

work page 2025

[73] [73]

Zhang, F

Y . Zhang, F. Xiao, T. Huang, C.-K. Fan, H. Dong, J. Li, J. Wang, K. Cheng, S. Zhang, and H. Guo. Unveiling the tapestry of consistency in large vision-language models.Advances in Neural Information Processing Systems, 2024. 7 13 Contents A Related Work 1 A.1 Data Selection or Pruning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 A.2 Wei...

work page 2024