M-DaQ: Retrieving Samples with Multilingual Diversity and Quality for Instruction Fine-Tuning Datasets
Pith reviewed 2026-05-18 16:40 UTC · model grok-4.3
The pith
M-DaQ curates high-quality diverse multilingual data for instruction fine-tuning, yielding models with over 60% win rates on benchmarks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We propose M-DaQ, a diversity-aware sampling framework that jointly optimizes instruction-response quality and cross-lingual semantic diversity. M-DaQ leverages a fine-tuned Quality Scoring Model alongside a maximal marginal relevance-inspired selection strategy to construct balanced, high-fidelity training data. We present the first systematic investigation of the Superficial Alignment Hypothesis in multilingual settings. Extensive evaluations across 18 languages demonstrate that models trained on M-DaQ-curated data achieve average win rates exceeding 60% against strong baselines on Alpaca-Eval and MT-Bench.
What carries the argument
The M-DaQ framework using a fine-tuned Quality Scoring Model and maximal marginal relevance-inspired selection strategy to optimize quality and cross-lingual diversity.
If this is right
- Models exhibit better cultural relevance and contextual appropriateness in multilingual settings.
- The approach supports more effective use of limited data resources for fine-tuning.
- Public code release enables further research and reproducibility in multilingual LLM training.
- The multilingual study of the Superficial Alignment Hypothesis offers new perspectives on model behavior across languages.
Where Pith is reading between the lines
- This curation technique might generalize to improve data selection for other training objectives like reasoning or coding tasks.
- It could be particularly beneficial for low-resource languages where data quality varies widely.
- Future work might combine M-DaQ with other metrics such as safety or factuality in the selection process.
Load-bearing premise
The quality scoring model and diversity selection strategy correctly identify samples that genuinely improve model capabilities in multilingual contexts.
What would settle it
Retraining LLMs on M-DaQ data but finding win rates at or below 50% on Alpaca-Eval and MT-Bench would challenge the claim.
read the original abstract
Multilingual instruction fine-tuning (IFT) empowers large language models to generalize across diverse linguistic and cultural contexts; however, high-quality, systematically curated multilingual IFT datasets remain scarce. To address this gap, we propose M-DaQ (Multilingual Diversity and Quality), a diversity-aware sampling framework that jointly optimizes instruction-response quality and cross-lingual semantic diversity. M-DaQ leverages a fine-tuned Quality Scoring Model alongside a maximal marginal relevance-inspired selection strategy to construct balanced, high-fidelity training data. Furthermore, we present the first systematic investigation of the Superficial Alignment Hypothesis in multilingual settings. Extensive evaluations across 18 languages demonstrate that models trained on M-DaQ-curated data achieve average win rates exceeding 60% against strong baselines on Alpaca-Eval and MT-Bench. Complementary human evaluations corroborate these gains, highlighting significant improvements in cultural relevance, contextual appropriateness, and instruction-following capability. The code are publicly released to facilitate reproducibility and future research.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes M-DaQ, a diversity-aware sampling framework for multilingual instruction fine-tuning (IFT) datasets. It jointly optimizes instruction-response quality via a fine-tuned Quality Scoring Model and cross-lingual semantic diversity via a maximal marginal relevance-inspired selection strategy. The work also presents the first systematic investigation of the Superficial Alignment Hypothesis in multilingual settings. Extensive evaluations across 18 languages show that models trained on M-DaQ-curated data achieve average win rates exceeding 60% against strong baselines on Alpaca-Eval and MT-Bench, with complementary human evaluations confirming gains in cultural relevance, contextual appropriateness, and instruction-following. Code is publicly released.
Significance. If the results hold under rigorous controls, the framework could meaningfully advance multilingual LLM alignment by addressing the scarcity of high-quality, balanced IFT data through a reproducible curation method. The public code release and explicit focus on cross-lingual diversity are strengths that support verifiability and extension by the community.
major comments (2)
- [§4 and §4.2] §4 (Experiments) and §4.2 (Evaluation Setup): The central claim that the MMR-inspired diversity strategy, when combined with the Quality Scoring Model, causally drives the reported >60% average win rates requires isolation of the diversity component. The manuscript compares M-DaQ only against external baselines; it lacks internal ablations such as (i) quality-only selection from the same candidate pool or (ii) random subsets of matched size. Without these controls, gains could be attributable to high-quality filtering alone rather than the joint optimization described in the framework.
- [§3] §3 (M-DaQ Framework): The description of the maximal marginal relevance-inspired selection does not specify how the diversity penalty is parameterized or balanced against the quality score (e.g., the value of λ or the embedding space used for semantic similarity). This makes it difficult to assess whether the reported performance is robust to reasonable variations in the selection hyper-parameters.
minor comments (2)
- [Abstract] Abstract: The sentence 'The code are publicly released' contains a subject-verb agreement error and should read 'The code is publicly released'.
- [§5] §5 (Human Evaluation): The human evaluation protocol would benefit from reporting inter-annotator agreement statistics (e.g., Cohen’s κ or Fleiss’ κ) to substantiate the claimed improvements in cultural relevance and contextual appropriateness.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback and detailed comments on our manuscript. We address each major comment point by point below and describe the revisions we will make to strengthen the paper.
read point-by-point responses
-
Referee: [§4 and §4.2] §4 (Experiments) and §4.2 (Evaluation Setup): The central claim that the MMR-inspired diversity strategy, when combined with the Quality Scoring Model, causally drives the reported >60% average win rates requires isolation of the diversity component. The manuscript compares M-DaQ only against external baselines; it lacks internal ablations such as (i) quality-only selection from the same candidate pool or (ii) random subsets of matched size. Without these controls, gains could be attributable to high-quality filtering alone rather than the joint optimization described in the framework.
Authors: We agree that internal ablations are needed to isolate the contribution of the diversity component from quality filtering alone. In the revised manuscript, we will add these controls to Section 4: (i) a quality-only baseline that selects from the same candidate pool using only the Quality Scoring Model (without MMR), and (ii) random subsets of matched size drawn from the same pool. These results will be reported alongside the existing comparisons to demonstrate that the joint optimization drives the observed gains. revision: yes
-
Referee: [§3] §3 (M-DaQ Framework): The description of the maximal marginal relevance-inspired selection does not specify how the diversity penalty is parameterized or balanced against the quality score (e.g., the value of λ or the embedding space used for semantic similarity). This makes it difficult to assess whether the reported performance is robust to reasonable variations in the selection hyper-parameters.
Authors: We thank the referee for highlighting this omission. The revised Section 3 will explicitly state that λ is set to 0.5 and that semantic similarity is computed in the embedding space of LaBSE multilingual sentence embeddings. We will also add a short sensitivity analysis in the appendix varying λ across [0.3, 0.7] to confirm robustness of the reported performance. revision: yes
Circularity Check
No circularity: empirical curation framework evaluated on external benchmarks
full rationale
The paper describes an empirical data-selection pipeline (fine-tuned Quality Scoring Model plus MMR-inspired sampling) whose central claims rest on comparative win rates (>60% average) and human evaluations against external baselines on Alpaca-Eval and MT-Bench across 18 languages. No equations, derivations, fitted parameters renamed as predictions, or self-citation load-bearing steps appear in the abstract or described framework. The results are self-contained against public benchmarks and reproducible code, satisfying the criteria for a non-circular empirical contribution.
Axiom & Free-Parameter Ledger
invented entities (1)
-
M-DaQ sampling framework
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
M-DaQ leverages a fine-tuned Quality Scoring Model alongside a maximal marginal relevance-inspired selection strategy to construct balanced, high-fidelity training data.
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We propose an unsupervised clustering algorithm that selects diverse samples in a language-agnostic manner.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
INTRODUCTION Some machine learning methods[1, 2] have been proposed for improving Multilingual Instruction Fine-Tuning (IFT) dataset, which plays a pivotal role in enabling large language models (LLMs) to perform effectively general-purpose tasks. How- ever, constructing high-quality multilingual IFT datasets re- mains a significant challenge [3], and the...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
We proposeM-DaQ, a novel method for improving LLM multilinguality. This method jointly optimizes both sample quality (via QSM) and diversity (via un- supervised clustering algorithm) for Multilingual IFT dataset. We will open-source code to facilitate repro- ducibility and future research
-
[3]
We conduct a comprehensivemultilingual IFT evalua- tion across 18 languages. Our results demonstrate that M-DaQ effectively mitigates multilingual challenges arising from the scarcity of high-quality IFT data and the underrepresentation of culturally specific values
-
[4]
We present the first systematic empirical investigation of theSAH in multilingual contexts. Our findings provide novel insights into how the scale and composition of IFT data influence cross-lingual LLM alignment
-
[5]
RELATED WORK IFT data selection has been investigated in several prior studies[12, 13, 14, 2]. For example, Zhao et al. [15] propose heuristic-based selection strategies, while Chen et al. [ 16] utilize LLMs to score and rank training samples for subse- quent selection. However, these methods largely disregard the linguistic and cultural specificities of ...
-
[6]
METHOD 3.1. Quality Scoring Model (QSM) We introduce theQuality Scoring Model (QSM), a fine-tuned variant of XLM-RoBERTa [22] designed to estimate the qual- ity of multilingual IFT samples. The model is trained on a curated multilingual dataset derived from MIDB [21], which comprises 18 languages with approximately 2.3K expert- revised IFT samples per lan...
-
[7]
EXPERIMENT 4.1. Experimental Setup We apply our proposedM-DaQalgorithm to the Alpaca 52K dataset, extending it to 18 languages. We fine-tune the Llama 3 8B base model using the selected subset, resulting in theM- DaQ Model. For comparison, we train aVanilla Modelusing the original, unfiltered Alpaca 52K dataset. Both models are trained under identical hyp...
-
[8]
and MT-Bench [23], both of which have been machine- translated and subsequently human-revised for 18 languages [21]. Human Evaluation.To complement automated evalu- ation and assess nuanced linguistic and cultural quality, we conduct a large-scale human evaluation with 7 native-speaking language experts, each with an average of 3.9 years of profes- sional...
-
[9]
This finding confirmsthe validity of the SAH in multilingual settings
Scaling IFT dataset from 1K to 10K and 52K samples reduces the win rate by 10.1% and 6.2%, respectively — demonstrating that, compared to larger but noisier datasets, only a few thousand high-quality samples are sufficient for effective alignment with human response preferences. This finding confirmsthe validity of the SAH in multilingual settings
-
[10]
Sensitivity to IFT dataset scale varies across languages — for example, Arabic is less sensitive than French, as evidenced by the shallower slope in the early scaling region (1K–10K) of the win rate curve. This variation likely stems from differences in language-specific pre- training readiness, indicating thatthe effectiveness of SAH is language-dependen...
-
[11]
CONCLUSION In this work, we have presentedM-DaQ— a novel, language- agnostic machine learning method for improving LLM multi- linguality that jointly optimizes samplesqualityanddiversity of IFT dataset. M-DaQ exhibits particularly strong gains in lower-resource languages, where data scarcity and cultural misalignment most acutely degrade model performance...
-
[12]
Y . Ge, Y . Liu, C. Hu, W. Meng, S. Tao, X. Zhao, H. Ma, L. Zhang, B. Chen, H. Yang, B. Li, T. Xiao, and J. Zhu, “Clustering and ranking: Diversity-preserved instruc- tion selection through expert-aligned quality estimation,” 2024
work page 2024
-
[13]
A preliminary study of the intrinsic relationship between complexity and alignment,
Y . Zhao, B. Yu, B. Hui, H. Yu, F. Huang, Y . Li, and N. L. Zhang, “A preliminary study of the intrinsic relationship between complexity and alignment,” 2024
work page 2024
-
[14]
Llms beyond english: Scaling the multilingual capability of llms with cross- lingual feedback,
W. Lai, M. Mesgar, and A. Fraser, “Llms beyond english: Scaling the multilingual capability of llms with cross- lingual feedback,” 2024
work page 2024
-
[15]
A. Grattafiori, A. Dubey, and et al, “The llama 3 herd of models,” 2024
work page 2024
-
[16]
Lima: Less is more for alignment,
C. Zhou, P. Liu, P. Xu, S. Iyer, J. Sun, Y . Mao, X. Ma, A. Efrat, P. Yu, L. Yu, S. Zhang, G. Ghosh, M. Lewis, L. Zettlemoyer, and O. Levy, “Lima: Less is more for alignment,” 2023
work page 2023
-
[17]
Stanford alpaca: An instruction-following llama model,
R. Taori, I. Gulrajani, T. Zhang, Y . Dubois, X. Li, C. Guestrin, P. Liang, and T. B. Hashimoto, “Stanford alpaca: An instruction-following llama model,” https: //github.com/tatsu-lab/stanford alpaca, 2023
work page 2023
-
[18]
Super-naturalinstructions: Generalization via declara- tive instructions on 1600+ nlp tasks,
Y . Wang, S. Mishra, P. Alipoormolabashi, and et al, “Super-naturalinstructions: Generalization via declara- tive instructions on 1600+ nlp tasks,” 2022
work page 2022
-
[19]
P.-N. Kung, F. Yin, D. Wu, K.-W. Chang, and N. Peng, “Active instruction tuning: Improving cross-task general- ization by training on prompt sensitive tasks,” 2023
work page 2023
-
[20]
M. Li, Y . Zhang, Z. Li, J. Chen, L. Chen, N. Cheng, J. Wang, T. Zhou, and J. Xiao, “From quantity to quality: Boosting llm performance with self-guided data selection for instruction tuning,” 2024
work page 2024
-
[21]
Demystifying prompts in language models via perplexity estimation,
H. Gonen, S. Iyer, T. Blevins, N. A. Smith, and L. Zettle- moyer, “Demystifying prompts in language models via perplexity estimation,” 2024
work page 2024
-
[22]
Tacos: Open tagging and comparative scoring for instruction fine-tuning data selection,
X. He, H. Yu, Q. Sun, A. Cheng, T. Zhang, C. Liu, and S. Guo, “Tacos: Open tagging and comparative scoring for instruction fine-tuning data selection,” 2025
work page 2025
-
[23]
Improving translation faithfulness of large language models via augmenting instructions,
Y . Chen, Y . Liu, F. Meng, Y . Chen, J. Xu, and J. Zhou, “Improving translation faithfulness of large language models via augmenting instructions,” 2023
work page 2023
-
[24]
W. Liu, W. Zeng, K. He, Y . Jiang, and J. He, “What makes good data for alignment? a comprehensive study of automatic data selection in instruction tuning,” 2024
work page 2024
-
[25]
Mods: Model-oriented data selection for instruction tuning,
Q. Du, C. Zong, and J. Zhang, “Mods: Model-oriented data selection for instruction tuning,” 2023
work page 2023
-
[26]
Long is more for alignment: A simple but tough-to-beat baseline for instruction fine-tuning,
H. Zhao, M. Andriushchenko, F. Croce, and N. Flam- marion, “Long is more for alignment: A simple but tough-to-beat baseline for instruction fine-tuning,” 2024
work page 2024
-
[27]
Alpagasus: Training a better alpaca with fewer data,
L. Chen, S. Li, J. Yan, H. Wang, K. Gunaratna, V . Yadav, Z. Tang, V . Srinivasan, T. Zhou, H. Huang, and H. Jin, “Alpagasus: Training a better alpaca with fewer data,” 2024
work page 2024
-
[28]
How reliable is multilingual llm-as- a-judge?,
X. Fu and W. Liu, “How reliable is multilingual llm-as- a-judge?,” 2025
work page 2025
-
[29]
Monolingual or multilingual instruc- tion tuning: Which makes a better alpaca,
P. Chen, S. Ji, N. Bogoychev, A. Kutuzov, B. Haddow, and K. Heafield, “Monolingual or multilingual instruc- tion tuning: Which makes a better alpaca,” inFindings of the Association for Computational Linguistics: EACL 2024, 2024, pp. 1347–1356
work page 2024
-
[30]
Plug: Leveraging pivot language in cross-lingual instruction tuning,
Z. Zhang, D.-H. Lee, Y . Fang, W. Yu, M. Jia, M. Jiang, and F. Barbieri, “Plug: Leveraging pivot language in cross-lingual instruction tuning,” inProceedings of the 62nd Annual Meeting of the Association for Computa- tional Linguistics (V olume 1: Long Papers), 2024, pp. 7025–7046
work page 2024
-
[31]
Extrapolating large lan- guage models to non-english by aligning languages,
W. Zhu, Y . Lv, Q. Dong, F. Yuan, J. Xu, S. Huang, L. Kong, J. Chen, and L. Li, “Extrapolating large lan- guage models to non-english by aligning languages,” arXiv preprint arXiv:2308.04948, 2023
-
[32]
Midb: Multilingual instruction data booster for enhancing multilingual instruction synthesis,
Y . Liu, C. Zhao, X. Yang, H. Zeng, S. Tao, W. Meng, M. He, C. Su, Y . Yu, H. Ma, L. Zhang, D. Wei, and H. Yang, “Midb: Multilingual instruction data booster for enhancing multilingual instruction synthesis,” 2025
work page 2025
-
[33]
Unsupervised cross-lingual represen- tation learning at scale,
A. Conneau, K. Khandelwal, N. Goyal, V . Chaudhary, G. Wenzek, F. Guzm´an, E. Grave, M. Ott, L. Zettlemoyer, and V . Stoyanov, “Unsupervised cross-lingual represen- tation learning at scale,” 2020
work page 2020
-
[34]
Mt-bench-101: A fine-grained benchmark for evaluating large language models in multi-turn dialogues,
G. Bai, J. Liu, X. Bu, Y . He, J. Liu, Z. Zhou, Z. Lin, W. Su, T. Ge, B. Zheng, and W. Ouyang, “Mt-bench-101: A fine-grained benchmark for evaluating large language models in multi-turn dialogues,” inProceedings of the 62nd Annual Meeting of the Association for Computa- tional Linguistics (V olume 1: Long Papers). 2024, p. 7421–7454, Association for Compu...
work page 2024
-
[35]
Alpacaeval: An automatic evaluator of instruction-following mod- els,
X. Li, T. Zhang, Y . Dubois, R. Taori, I. Gulrajani, C. Guestrin, P. Liang, and T. B. Hashimoto, “Alpacaeval: An automatic evaluator of instruction-following mod- els,” https://github.com/tatsu-lab/alpaca eval, 5 2023
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.