pith. sign in

arxiv: 2605.15691 · v1 · pith:4HQJD7RLnew · submitted 2026-05-15 · 💻 cs.LG

SEED: Targeted Data Selection by Weighted Independent Set

Pith reviewed 2026-05-20 20:24 UTC · model grok-4.3

classification 💻 cs.LG
keywords data selectionweighted independent setsimilarity graphinfluence estimationinstruction tuningvisual instruction tuningsemantic segmentation
0
0 comments X

The pith

SEED selects high-quality diverse training data by solving a weighted independent set problem on a calibrated similarity graph.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper formulates data selection as identifying a maximum weighted independent set in a graph whose nodes are training samples and whose edges link redundant pairs. It introduces two refinements: restricting influence scores to a bilateral salient subspace so that weights reflect task signals rather than gradient noise, and adapting edge thresholds to each sample's local neighborhood density to prevent bias toward sparse regions under domain shifts. These steps together produce compact subsets that remain both influential and non-redundant. A reader would care because large corpora contain massive redundancy, so a reliable way to prune them can cut training cost while preserving or improving final model quality.

Core claim

By modeling data selection as the search for a maximum weighted independent set on a similarity graph, and by refining node weights through restriction to the bilateral salient subspace together with local scale normalization of edge thresholds, the resulting SEED pipeline yields subsets that are simultaneously high in task-relevant influence and low in semantic redundancy, and these subsets deliver consistent gains over prior selection methods when used for instruction tuning, visual instruction tuning, and semantic segmentation.

What carries the argument

Weighted Independent Set formulation on a similarity graph, refined by bilateral salient subspace node calibration and local scale normalization of edge thresholds.

If this is right

  • Produces subsets that simultaneously maximize quality and diversity without separate quality and diversity stages.
  • Yields a compact multimodal dataset, Honeybee-Remake-SEED-200K, that supports strong downstream performance.
  • Generalizes across instruction tuning, visual instruction tuning, and semantic segmentation for multiple model families.
  • Scales to large heterogeneous corpora by operating directly on a graph representation of semantic similarity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same graph refinement steps could be applied to select demonstration data for in-context learning or to curate preference pairs for alignment.
  • Local density normalization may offer a general remedy for selection bias whenever training distributions contain multiple distinct domains.
  • If the salient-subspace idea proves robust, it could be combined with cheaper proxy models to reduce the cost of influence estimation itself.

Load-bearing premise

That restricting influence estimation to the bilateral salient subspace and scaling edge thresholds to local density will reliably separate task-relevant signals from noise and correct structural imbalance caused by cross-domain shifts.

What would settle it

An experiment in which models trained on a SEED-selected subset perform no better than models trained on subsets chosen by random sampling or by existing influence-function baselines on the same tasks and model families.

Figures

Figures reproduced from arXiv: 2605.15691 by Chang Liu, Junwen Pan, Kuan Cheng, Kurt Keutzer, Lifeng Guo, Shanghang Zhang, Wenzhao Zheng, Yuan Zhang.

Figure 1
Figure 1. Figure 1: Overview of SEED. SEED formulates subset selection as a Weighted Independent Set problem over a similarity graph constructed from training data, with better node weights from a mutual influence subspace and better edges from local scale normalization. The resulting structurally balanced graph enables selecting a compact, diverse, and high-influence subset. Different colors indicate that nodes belong to dif… view at source ↗
Figure 2
Figure 2. Figure 2: Mutual Influence Subspace. (a) Gradient magnitudes across channels show a long-tail distribution. (b) Mutual space on both training and target (1st quadrant) form C ∗ . (c) KDE of influence scores shows that restricting to C ∗ reduces noise and sharpens the distribution. Definition 1 (Node Weights) The weight wv of a node v ∈ Dtrain is defined as its influence on the target dataset Dtarget. Following Defin… view at source ↗
Figure 3
Figure 3. Figure 3: Visualization of density heterogeneity across domains. The data are sampled from FLAN-V2 [39], CoT [61], Dolly [14], and Oasst-1 [35]. (a) Embeddings from different domains exhibit non-uniform densities. (b) Global distance statistics vary significantly across domains. (c) Local scaling mitigates this mismatch by aligning neighborhood distributions. Best viewed in color. However, even if a channel exhibits… view at source ↗
Figure 4
Figure 4. Figure 4: Performance of SEED on scratch training for semantic segmentation. Result curves of DeepLabV3 [7] trained using different data selection methods. Best viewed in color. Main Results. As shown in [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Efficient fine-tuning via SEED-guided proxy model selection. Statistics are reported on Phi-4-14B, where SEED-T achieves comparable accuracy while reducing GPU cost by 2.5×. [66], DeepSeek-LLM-7B [6], Phi-4-14B [2], and Seed-OSS-36B [57]. We report four representative cases here and provide all eight results in the Appendix E.3. Main Results. In [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Curve of performance vs. GPU training cost on Phi-4-14B. out fine-grained density differences. We investigate how the choice of k and scaling factor α affects data selection performance on instruction tuning tasks. Results in Appendix F show that SEED remains robust across a wide range of k and α values, validating the stability of our design choices. 5.4 Visualization [PITH_FULL_IMAGE:figures/full_fig_p0… view at source ↗
Figure 7
Figure 7. Figure 7: Visualization of data manifolds (left) and influence scores (right). To investigate the selection behavior, we visualize pro￾jected manifolds via UMAP [44] and influence score distributions in [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Sensitivity analysis of SEED to k and α. Results are reported on the LLaMA2-7B, measured as the average accuracy across TyDiQA, MMLU, and BBH under a 5% data budget. Sensitivity to k. As shown in [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗
read the original abstract

Data selection seeks to identify a compact yet informative subset from large-scale training corpora, balancing sample quality against collection diversity. We formulate this problem as a Weighted Independent Set (WIS) on a similarity graph, where nodes represent data samples weighted by influence, and edges connect semantically redundant pairs. This formulation naturally yields subsets that are simultaneously high-quality and diverse. However, two challenges arise in practice: naive node weights fail to distinguish informative signals from gradient noise, and edge construction under heterogeneous domain distributions produces structurally imbalanced graphs that bias selection toward sparse regions. To address these issues, we introduce two principled refinements from a unified graph perspective: (1) \textit{node value calibration} that restricts influence estimation to the bilateral salient subspace to ground node importance in task-relevant signals rather than surface-level statistics; (2) \textit{local scale normalization} that adapts edge thresholds to local neighborhood density, mitigating graph imbalance induced by cross-domain distribution shifts. Together, these components yield a robust and scalable data selection pipeline dubbed SEED. We further construct \texttt{Honeybee-Remake-SEED-200K}, a compact multimodal dataset curated by SEED. Extensive experiments show that SEED consistently outperforms state-of-the-art methods on instruction tuning, visual instruction tuning, and semantic segmentation across diverse model families.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper introduces SEED, a data selection method formulated as a Weighted Independent Set (WIS) on a similarity graph, with nodes representing samples weighted by influence and edges connecting semantically redundant pairs. Two refinements are proposed from a graph perspective: node value calibration restricting influence estimation to the bilateral salient subspace, and local scale normalization adapting edge thresholds to local neighborhood density. The approach is shown to outperform state-of-the-art methods on instruction tuning, visual instruction tuning, and semantic segmentation across model families, while also releasing the Honeybee-Remake-SEED-200K multimodal dataset.

Significance. If the empirical results hold, this work provides a principled graph-based framework for balancing quality and diversity in data selection, with targeted fixes for gradient noise and cross-domain imbalance. Strengths include the unified WIS formulation, algorithmic detail sufficient for the pipeline, experiments that control for data volume while reporting consistent gains across tasks and models, and the public release of the curated 200K dataset which supports reproducibility. These elements position the method as a practical tool for curating training subsets for large-scale models.

minor comments (3)
  1. Abstract: the phrase 'restricts influence estimation to the bilateral salient subspace' would benefit from a one-sentence gloss on the subspace construction to aid readers before the technical sections.
  2. The experimental section should explicitly list the values or ranges used for the local scale normalization threshold and bilateral salient subspace parameters to ensure full reproducibility of the reported gains.
  3. A brief complexity analysis of the WIS solver employed would help assess scalability claims for very large corpora.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive evaluation of our work and the recommendation for minor revision. We appreciate the recognition of the unified WIS formulation, the algorithmic details, the controlled experiments, and the release of the Honeybee-Remake-SEED-200K dataset as practical contributions to data selection for large-scale models.

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper formulates data selection as a Weighted Independent Set on a similarity graph and introduces two refinements (bilateral salient subspace calibration and local scale normalization) as independent algorithmic steps grounded in the stated assumptions about influence estimation and cross-domain graph imbalance. No equation or claim reduces a prediction or result to a quantity fitted inside the same pipeline, nor does any load-bearing step rely on a self-citation that itself collapses to the target result. The central pipeline is described with explicit algorithmic detail and evaluated empirically against baselines while controlling for data volume, making the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The approach rests on domain assumptions about graph construction and influence estimation whose concrete implementation details are not supplied in the abstract.

free parameters (2)
  • local scale normalization threshold
    Adapted to local neighborhood density to handle cross-domain imbalance
  • bilateral salient subspace restriction parameters
    Used to calibrate node weights away from gradient noise
axioms (1)
  • domain assumption A similarity graph can be built such that edges reliably connect semantically redundant samples
    Central to the Weighted Independent Set formulation described in the abstract

pith-pipeline@v0.9.0 · 5777 in / 1228 out tokens · 60663 ms · 2026-05-20T20:24:45.096473+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

73 extracted references · 73 canonical work pages · 11 internal anchors

  1. [1]

    SemDeDup: Data-efficient learning at web-scale through semantic deduplication

    A. Abbas, K. Tirumala, D. Simig, S. Ganguli, and A. S. Morcos. Semdedup: Data-efficient learning at web-scale through semantic deduplication.arXiv preprint arXiv:2303.09540, 2023. 1

  2. [2]

    Phi-4 Technical Report

    M. Abdin, J. Aneja, H. Behl, S. Bubeck, R. Eldan, S. Gunasekar, M. Harrison, R. J. Hewett, M. Javaheripi, P. Kauffmann, et al. Phi-4 technical report.arXiv preprint arXiv:2412.08905, 2024. 1, 2, 8

  3. [3]

    arXiv preprint arXiv:2402.16827

    A. Albalak, Y . Elazar, S. M. Xie, S. Longpre, N. Lambert, X. Wang, N. Muennighoff, B. Hou, L. Pan, H. Jeong, et al. A survey on data selection for language models.arXiv preprint arXiv:2402.16827, 2024. 1

  4. [4]

    S. Bai, Y . Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025. 6

  5. [5]

    Bansal, D

    H. Bansal, D. S. Sachan, K.-W. Chang, A. Grover, G. Ghosh, W.-t. Yih, and R. Pasunuru. Honeybee: Data recipes for vision-language reasoners.arXiv preprint arXiv:2510.12225, 2025. 2, 6, 4

  6. [6]

    X. Bi, D. Chen, G. Chen, S. Chen, D. Dai, C. Deng, H. Ding, K. Dong, Q. Du, Z. Fu, et al. Deepseek llm: Scaling open-source language models with longtermism.arXiv preprint arXiv:2401.02954, 2024. 8

  7. [7]

    L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam. Rethinking atrous convolution for semantic image segmentation.arXiv preprint arXiv:1706.05587, 2017. 2, 7

  8. [8]

    M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. D. O. Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374,

  9. [9]

    X. Chen, J. Wu, S. Yang, R. Zhan, Z. Wu, M. Yang, S. Huang, L. S. Chao, and D. F. Wong. Neuron-aware data selection in instruction tuning for large language models. InInternational Conference on Learning Representations (ICLR), 2026. 1, 6, 7

  10. [10]

    SimpleVQA: Multimodal factuality evaluation for multimodal large lan- guage models.arXiv preprint arXiv:2502.13059,

    X. Cheng, W. Zhang, S. Zhang, J. Yang, X. Guan, X. Wu, X. Li, G. Zhang, J. Liu, Y . Mai, Y . Zeng, Z. Wen, K. Jin, B. Wang, W. Zhou, Y . Lu, T. Li, W. Huang, and Z. Li. Simplevqa: Multimodal factuality evaluation for multimodal large language models.arXiv preprint arXiv:2502.13059, 2025. 3

  11. [11]

    J. H. Clark, E. Choi, M. Collins, D. Garrette, T. Kwiatkowski, V . Nikolaev, and J. Palomaki. TyDi QA: A benchmark for information-seeking question answering in typologically diverse languages.Transactions of the Association for Computational Linguistics, 2020. 2, 6

  12. [12]

    Training Verifiers to Solve Math Word Problems

    K. Cobbe, V . Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021. 4

  13. [13]

    Colling and M

    B. Colling and M. van de Wiel. corrselect: An r package for correlation-constrained variable selection using maximal independent sets.Journal of Statistical Software, 2025. 2

  14. [14]

    Conover, M

    M. Conover, M. Hayes, A. Mathur, J. Xie, J. Wan, S. Shah, A. Ghodsi, P. Wendell, M. Zaharia, and R. Xin. Free Dolly: Introducing the world’s first truly open instruction-tuned LLM, 2023. 5, 6

  15. [15]

    Cordts, M

    M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele. The cityscapes dataset for semantic urban scene understanding. InProceedings of the IEEE conference on computer vision and pattern recognition, 2016. 2, 7

  16. [16]

    Q. Dai, D. Zhang, J. W. Ma, and H. Peng. Improving influence-based instruction tuning data selection for balanced learning of diverse capabilities. InICLR 2025 Workshop on Navigating and Addressing Data Problems for Foundation Models, 2025. 1

  17. [17]

    H. Duan, J. Yang, Y . Qiao, X. Fang, L. Chen, Y . Liu, X. Dong, Y . Zang, P. Zhang, J. Wang, et al. Vlmevalkit: An open-source toolkit for evaluating large multi-modality models. InProceedings of the 32nd ACM International Conference on Multimedia, pages 11198–11201, 2024. 3

  18. [18]

    Dubey, A

    A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang, A. Fan, et al. The llama 3 herd of models.arXiv e-prints, 2024. 1, 6, 7

  19. [19]

    Engstrom, A

    L. Engstrom, A. Feldmann, and A. Madry. Dsdm: Model-aware dataset selection with datamodels.arXiv preprint arXiv:2401.12926, 2024. 1

  20. [20]

    Fahrbach, T

    M. Fahrbach, T. Fu, and M. Gholami. Gist: Greedy independent set thresholding for diverse data summarization. InInternational Conference on Learning Representations (ICLR), 2025. 2 10

  21. [21]

    L. Feng, F. Nie, Y . Liu, and A. Alahi. TAROT: Targeted data selection via optimal transport. InInternational Conference on Machine Learning, 2025. 1, 6, 7, 4

  22. [22]

    C. Fu, P. Chen, Y . Shen, Y . Qin, M. Zhang, X. Lin, J. Yang, X. Zheng, K. Li, X. Sun, Y . Wu, R. Ji, C. Shan, and R. He. Mme: A comprehensive evaluation benchmark for multimodal large language models.arXiv preprint arXiv:2306.13394, 2023. 3

  23. [23]

    M. R. Garey and D. S. Johnson.Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman and Company, 1979. 3

  24. [24]

    T. Guan, F. Liu, X. Wu, R. Xian, Z. Li, X. Liu, X. Wang, L. Chen, F. Huang, Y . Yacoob, D. Manocha, and T. Zhou. Hallusionbench: An advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024. 3

  25. [25]

    D. Guo, F. Wu, F. Zhu, F. Leng, G. Shi, H. Chen, H. Fan, J. Wang, J. Jiang, J. Wang, et al. Seed1. 5-vl technical report.arXiv preprint arXiv:2505.07062, 2025. 6, 4

  26. [26]

    Halldórsson and J

    M. Halldórsson and J. Radhakrishnan. Greed is good: Approximating independent sets in sparse and bounded-degree graphs. InProceedings of the twenty-sixth annual ACM symposium on Theory of computing, pages 439–448, 1994. 3

  27. [27]

    K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition, 2016. 7

  28. [28]

    Hendrycks, C

    D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. Measuring massive multitask language understanding. InInternational Conference on Learning Representations, 2020. 6

  29. [29]

    A. A. Ismail, H. Corrada Bravo, and S. Feizi. Improving deep learning interpretability by saliency guided training.Advances in Neural Information Processing Systems, 2021. 2

  30. [30]

    Johnson, M

    J. Johnson, M. Douze, and H. Jégou. Billion-scale similarity search with gpus.IEEE transactions on big data, 7(3):535–547, 2019. 3, 2

  31. [31]

    F. Kang, H. A. Just, A. K. Sahu, and R. Jia. Performance scaling via optimal transport: Enabling data selection from partially revealed sources.Advances in Neural Information Processing Systems, 2024. 1

  32. [32]

    Kim and C

    J. Kim and C. D. Scott. Robust kernel density estimation.The Journal of Machine Learning Research, 13(1):2529–2565, 2012. 5

  33. [33]

    Kirillov, E

    A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Lo, et al. Segment anything. InProceedings of the IEEE/CVF international conference on computer vision, pages 4015–4026, 2023. 1

  34. [34]

    P. W. Koh and P. Liang. Understanding black-box predictions via influence functions. InInternational conference on machine learning, 2017. 3, 1

  35. [35]

    A. Köpf, Y . Kilcher, D. V on Rütte, S. Anagnostidis, Z. R. Tam, K. Stevens, A. Barhoum, D. Nguyen, O. Stanley, R. Nagyfi, et al. Openassistant conversations-democratizing large language model alignment. Advances in neural information processing systems, 2023. 5, 6

  36. [36]

    W. Liu, W. Zeng, K. He, Y . Jiang, and J. He. What makes good data for alignment? A comprehensive study of automatic data selection in instruction tuning. InInternational Conference on Learning Representations (ICLR), 2024. 1

  37. [37]

    Z. Liu, A. Karbasi, and T. Rekatsinas. Tsds: Data selection for task-specific model finetuning.Advances in Neural Information Processing Systems, 37:10117–10147, 2024. 4

  38. [38]

    Z. Liu, K. Zhou, W. X. Zhao, D. Gao, Y . Li, and J.-R. Wen. Less is more: Data value estimation for visual instruction tuning.arXiv preprint arXiv:2403.09559, 2024. 1

  39. [39]

    Longpre, L

    S. Longpre, L. Hou, T. Vu, A. Webson, H. W. Chung, Y . Tay, D. Zhou, Q. V . Le, B. Zoph, J. Wei, et al. The flan collection: Designing data and methods for effective instruction tuning. InInternational Conference on Machine Learning, 2023. 5, 6

  40. [40]

    P. Lu, H. Bansal, T. Xia, J. Liu, C. Li, H. Hajishirzi, H. Cheng, K.-W. Chang, M. Galley, and J. Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. InInternational Conference on Learning Representations, 2024. 3 11

  41. [41]

    D. Ma, G. Shang, Z. Chen, L. Qin, Y . LUO, H. Xu, L. Pan, S. Fan, K. Yu, and L. Chen. Task-specific data selection for instruction tuning via monosemantic neuronal activations. InNeural Information Processing Systems, 2025. 1, 6, 7

  42. [42]

    Maharana, P

    A. Maharana, P. Yadav, and M. Bansal. D2 pruning: Message passing for balancing diversity and difficulty in data pruning. InInternational Conference on Learning Representations, 2024. 1, 7

  43. [43]

    W. Mai, Z. Zhang, K. Li, Y . Xue, and F. Li. Dynamic graph construction framework for multimodal named entity recognition in social media.IEEE Transactions on Computational Social Systems, 11(2):2513–2522,

  44. [44]

    McInnes, J

    L. McInnes, J. Healy, N. Saul, and L. Großberger. Umap: Uniform manifold approximation and projection. Journal of Open Source Software, 3(29), 2018. 9

  45. [45]

    McKinzie, Z

    B. McKinzie, Z. Gan, J.-P. Fauconnier, S. Dodge, B. Zhang, P. Dufter, D. Shah, X. Du, F. Peng, A. Belyi, et al. Mm1: methods, analysis and insights from multimodal llm pre-training. InEuropean Conference on Computer Vision, pages 304–323. Springer, 2024. 7

  46. [46]

    J. Pan, Q. Zhang, R. Zhang, M. Lu, X. Wan, Y . Zhang, C. Liu, and Q. She. Timesearch-r: Adaptive temporal search for long-form video understanding via self-verification reinforcement learning.arXiv preprint arXiv:2511.05489, 2025. 7

  47. [47]

    M. Paul, S. Ganguli, and G. K. Dziugaite. Deep learning on a data diet: Finding important examples early in training. InAdvances in Neural Information Processing Systems (NeurIPS), 2021. 1

  48. [48]

    Pruthi, F

    G. Pruthi, F. Liu, S. Kale, and M. Sundararajan. Estimating training data influence by tracing gradient descent.Advances in Neural Information Processing Systems, 2020. 2, 3, 1

  49. [49]

    S. R. Richter, V . Vineet, S. Roth, and V . Koltun. Playing for data: Ground truth from computer games. In European conference on computer vision, 2016. 7

  50. [50]

    Sanghavi, D

    S. Sanghavi, D. Shah, and A. S. Willsky. Message passing for maximum weight independent set.IEEE Transactions on Information Theory, 55(11), 2009. 1

  51. [51]

    M. Sap, H. Rashkin, D. Chen, R. Le Bras, and Y . Choi. Social iqa: Commonsense reasoning about social interactions. InProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pages 4463–4473, 2019. 4

  52. [52]

    R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. InProceedings of the IEEE international conference on computer vision, 2017. 2

  53. [53]

    Sener and S

    O. Sener and S. Savarese. Active learning for convolutional neural networks: A core-set approach. In International Conference on Learning Representations (ICLR), 2018. 1

  54. [54]

    K. Sun, D. Yu, D. Yu, and C. Cardie. Investigating prior knowledge for challenging chinese machine reading comprehension.Transactions of the Association for Computational Linguistics, 8:141–155, 2020. 4

  55. [55]

    Suzgun, N

    M. Suzgun, N. Scales, N. Schärli, S. Gehrmann, Y . Tay, H. W. Chung, A. Chowdhery, Q. Le, E. Chi, D. Zhou, et al. Challenging big-bench tasks and whether chain-of-thought can solve them. InFindings of the Association for Computational Linguistics, 2023. 6

  56. [56]

    H. Tan, S. Wu, W. Huang, S. Zhao, and X. QI. Data pruning by information maximization. InInternational Conference on Learning Representations, 2025. 1, 2, 6, 7

  57. [57]

    B. S. Team. Seed-oss open-source models. https://github.com/ByteDance-Seed/seed-oss, 2025. 8

  58. [58]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y . Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288,

  59. [59]

    X. Wang, Y . Cui, J. Wang, F. Zhang, Y . Wang, X. Zhang, Z. Luo, Q. Sun, Z. Li, Y . Wang, et al. Multimodal learning with next-token prediction for large multimodal models.Nature, 2026. 1

  60. [60]

    Z. Wang, M. Xia, L. He, H. Chen, Y . Liu, R. Zhu, K. Liang, X. Wu, H. Liu, S. Malladi, A. Chevalier, S. Arora, and D. Chen. Charxiv: Charting gaps in realistic chart understanding in multimodal llms.arXiv preprint arXiv:2406.18521, 2024. 3 12

  61. [61]

    J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V . Le, D. Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. InAdvances in Neural Information Processing Systems, 2022. 5, 6

  62. [62]

    K. Wei, R. Iyer, and J. Bilmes. Submodularity in data subset selection and active learning. InInternational Conference on Machine Learning (ICML), 2015. 1

  63. [63]

    Grok-1.5 vision preview.https://x.ai/news/grok-1.5v, 2024

    xAI. Grok-1.5 vision preview.https://x.ai/news/grok-1.5v, 2024. 3

  64. [64]

    M. Xia, S. Malladi, S. Gururangan, S. Arora, and D. Chen. Less: Selecting influential data for targeted instruction tuning. InInternational Conference on Machine Learning, 2024. 1, 6, 7, 8, 3, 4, 5

  65. [65]

    S. M. Xie, S. Santurkar, T. Ma, and P. Liang. Data selection for language models via importance resampling. InAdvances in Neural Information Processing Systems (NeurIPS), 2023. 1

  66. [66]

    A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. 1, 2, 6, 7, 8

  67. [67]

    Q. Yu, Z. Shen, Z. Yue, Y . Wu, B. Qin, W. Zhang, Y . Li, J. Li, S. Tang, and Y . Zhuang. Mastering collaborative multi-modal data selection: A focus on informativeness, uniqueness, and representativeness. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2025. 7, 8, 5

  68. [68]

    Zelnik-Manor and P

    L. Zelnik-Manor and P. Perona. Self-tuning spectral clustering.Advances in neural information processing systems, 17, 2004. 2

  69. [69]

    Zhang, C.-X

    J. Zhang, C.-X. Zhang, Y . Liu, Y .-X. Jin, X.-W. Yang, B. Zheng, Y . Liu, and L.-Z. Guo. D3: diversity, difficulty, and dependability-aware data selection for sample-efficient llm instruction tuning. InProceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence, 2025. 1

  70. [70]

    Zhang, D

    R. Zhang, D. Jiang, Y . Zhang, H. Lin, Z. Guo, P. Qiu, A. Zhou, P. Lu, K.-W. Chang, P. Gao, and H. Li. Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? InEuropean Conference on Computer Vision, 2024. 3

  71. [71]

    Zhang, C.-K

    Y . Zhang, C.-K. Fan, T. Huang, M. Lu, S. Yu, J. Pan, K. Cheng, Q. She, and S. Zhang. Loss-oriented ranking for automated visual prompting in lvlms.arXiv preprint arXiv:2506.16112, 2025. 7

  72. [72]

    Zhang, C.-K

    Y . Zhang, C.-K. Fan, J. Ma, W. Zheng, T. Huang, K. Cheng, D. A. Gudovskiy, T. Okuno, Y . Nakata, K. Keutzer, et al. Sparsevlm: Visual token sparsification for efficient vision-language model inference. In International Conference on Machine Learning, 2025. 1

  73. [73]

    Zhang, F

    Y . Zhang, F. Xiao, T. Huang, C.-K. Fan, H. Dong, J. Li, J. Wang, K. Cheng, S. Zhang, and H. Guo. Unveiling the tapestry of consistency in large vision-language models.Advances in Neural Information Processing Systems, 2024. 7 13 Contents A Related Work 1 A.1 Data Selection or Pruning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 A.2 Wei...