M-DaQ: Retrieving Samples with Multilingual Diversity and Quality for Instruction Fine-Tuning Datasets

Boxing Chen; Chen Liu; Chunguang Zhao; Daimeng Wei; Hongxia Ma; Li Zhang; Minggui He; Pufan Zeng; Shimin Tao; Song Xu

arxiv: 2509.15549 · v2 · submitted 2025-09-19 · 💻 cs.CL

M-DaQ: Retrieving Samples with Multilingual Diversity and Quality for Instruction Fine-Tuning Datasets

Chunguang Zhao , Yilun Liu , Pufan Zeng , Yuanchang Luo , Shimin Tao , Minggui He , Weibin Meng , Song Xu

show 5 more authors

Chen Liu Hongxia Ma Li Zhang Boxing Chen Daimeng Wei

This is my paper

Pith reviewed 2026-05-18 16:40 UTC · model grok-4.3

classification 💻 cs.CL

keywords multilingual instruction fine-tuningdata curationdiversity samplingquality scoring modellarge language modelscross-lingual diversityinstruction tuning

0 comments

The pith

M-DaQ curates high-quality diverse multilingual data for instruction fine-tuning, yielding models with over 60% win rates on benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors introduce M-DaQ, a framework for sampling instruction fine-tuning data that balances response quality and semantic diversity across languages. It relies on a fine-tuned quality scorer and a diversity-promoting selection technique. Sympathetic readers care because scarce high-quality multilingual data hinders LLMs from performing well beyond English. The method leads to models that win more than 60 percent of comparisons against baselines in evaluations spanning 18 languages. Human judges also note better cultural fit and instruction adherence.

Core claim

We propose M-DaQ, a diversity-aware sampling framework that jointly optimizes instruction-response quality and cross-lingual semantic diversity. M-DaQ leverages a fine-tuned Quality Scoring Model alongside a maximal marginal relevance-inspired selection strategy to construct balanced, high-fidelity training data. We present the first systematic investigation of the Superficial Alignment Hypothesis in multilingual settings. Extensive evaluations across 18 languages demonstrate that models trained on M-DaQ-curated data achieve average win rates exceeding 60% against strong baselines on Alpaca-Eval and MT-Bench.

What carries the argument

The M-DaQ framework using a fine-tuned Quality Scoring Model and maximal marginal relevance-inspired selection strategy to optimize quality and cross-lingual diversity.

If this is right

Models exhibit better cultural relevance and contextual appropriateness in multilingual settings.
The approach supports more effective use of limited data resources for fine-tuning.
Public code release enables further research and reproducibility in multilingual LLM training.
The multilingual study of the Superficial Alignment Hypothesis offers new perspectives on model behavior across languages.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This curation technique might generalize to improve data selection for other training objectives like reasoning or coding tasks.
It could be particularly beneficial for low-resource languages where data quality varies widely.
Future work might combine M-DaQ with other metrics such as safety or factuality in the selection process.

Load-bearing premise

The quality scoring model and diversity selection strategy correctly identify samples that genuinely improve model capabilities in multilingual contexts.

What would settle it

Retraining LLMs on M-DaQ data but finding win rates at or below 50% on Alpaca-Eval and MT-Bench would challenge the claim.

read the original abstract

Multilingual instruction fine-tuning (IFT) empowers large language models to generalize across diverse linguistic and cultural contexts; however, high-quality, systematically curated multilingual IFT datasets remain scarce. To address this gap, we propose M-DaQ (Multilingual Diversity and Quality), a diversity-aware sampling framework that jointly optimizes instruction-response quality and cross-lingual semantic diversity. M-DaQ leverages a fine-tuned Quality Scoring Model alongside a maximal marginal relevance-inspired selection strategy to construct balanced, high-fidelity training data. Furthermore, we present the first systematic investigation of the Superficial Alignment Hypothesis in multilingual settings. Extensive evaluations across 18 languages demonstrate that models trained on M-DaQ-curated data achieve average win rates exceeding 60% against strong baselines on Alpaca-Eval and MT-Bench. Complementary human evaluations corroborate these gains, highlighting significant improvements in cultural relevance, contextual appropriateness, and instruction-following capability. The code are publicly released to facilitate reproducibility and future research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

M-DaQ gives a usable curation recipe for multilingual IFT data that reports solid benchmark gains, but the diversity selection step lacks the ablations needed to show it adds anything beyond quality filtering.

read the letter

The main thing to know is that the authors built M-DaQ to pick instruction-response pairs by scoring quality with a fine-tuned model and then applying an MMR-inspired selector to keep cross-lingual semantic variety. They also run what they call the first systematic check on the Superficial Alignment Hypothesis in multilingual settings. Models trained on the resulting data beat strong baselines with average win rates above 60 percent on Alpaca-Eval and MT-Bench across 18 languages, and human raters note better cultural fit and instruction following. The code release is a plus for anyone who wants to try it.

Referee Report

2 major / 2 minor

Summary. The paper proposes M-DaQ, a diversity-aware sampling framework for multilingual instruction fine-tuning (IFT) datasets. It jointly optimizes instruction-response quality via a fine-tuned Quality Scoring Model and cross-lingual semantic diversity via a maximal marginal relevance-inspired selection strategy. The work also presents the first systematic investigation of the Superficial Alignment Hypothesis in multilingual settings. Extensive evaluations across 18 languages show that models trained on M-DaQ-curated data achieve average win rates exceeding 60% against strong baselines on Alpaca-Eval and MT-Bench, with complementary human evaluations confirming gains in cultural relevance, contextual appropriateness, and instruction-following. Code is publicly released.

Significance. If the results hold under rigorous controls, the framework could meaningfully advance multilingual LLM alignment by addressing the scarcity of high-quality, balanced IFT data through a reproducible curation method. The public code release and explicit focus on cross-lingual diversity are strengths that support verifiability and extension by the community.

major comments (2)

[§4 and §4.2] §4 (Experiments) and §4.2 (Evaluation Setup): The central claim that the MMR-inspired diversity strategy, when combined with the Quality Scoring Model, causally drives the reported >60% average win rates requires isolation of the diversity component. The manuscript compares M-DaQ only against external baselines; it lacks internal ablations such as (i) quality-only selection from the same candidate pool or (ii) random subsets of matched size. Without these controls, gains could be attributable to high-quality filtering alone rather than the joint optimization described in the framework.
[§3] §3 (M-DaQ Framework): The description of the maximal marginal relevance-inspired selection does not specify how the diversity penalty is parameterized or balanced against the quality score (e.g., the value of λ or the embedding space used for semantic similarity). This makes it difficult to assess whether the reported performance is robust to reasonable variations in the selection hyper-parameters.

minor comments (2)

[Abstract] Abstract: The sentence 'The code are publicly released' contains a subject-verb agreement error and should read 'The code is publicly released'.
[§5] §5 (Human Evaluation): The human evaluation protocol would benefit from reporting inter-annotator agreement statistics (e.g., Cohen’s κ or Fleiss’ κ) to substantiate the claimed improvements in cultural relevance and contextual appropriateness.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback and detailed comments on our manuscript. We address each major comment point by point below and describe the revisions we will make to strengthen the paper.

read point-by-point responses

Referee: [§4 and §4.2] §4 (Experiments) and §4.2 (Evaluation Setup): The central claim that the MMR-inspired diversity strategy, when combined with the Quality Scoring Model, causally drives the reported >60% average win rates requires isolation of the diversity component. The manuscript compares M-DaQ only against external baselines; it lacks internal ablations such as (i) quality-only selection from the same candidate pool or (ii) random subsets of matched size. Without these controls, gains could be attributable to high-quality filtering alone rather than the joint optimization described in the framework.

Authors: We agree that internal ablations are needed to isolate the contribution of the diversity component from quality filtering alone. In the revised manuscript, we will add these controls to Section 4: (i) a quality-only baseline that selects from the same candidate pool using only the Quality Scoring Model (without MMR), and (ii) random subsets of matched size drawn from the same pool. These results will be reported alongside the existing comparisons to demonstrate that the joint optimization drives the observed gains. revision: yes
Referee: [§3] §3 (M-DaQ Framework): The description of the maximal marginal relevance-inspired selection does not specify how the diversity penalty is parameterized or balanced against the quality score (e.g., the value of λ or the embedding space used for semantic similarity). This makes it difficult to assess whether the reported performance is robust to reasonable variations in the selection hyper-parameters.

Authors: We thank the referee for highlighting this omission. The revised Section 3 will explicitly state that λ is set to 0.5 and that semantic similarity is computed in the embedding space of LaBSE multilingual sentence embeddings. We will also add a short sensitivity analysis in the appendix varying λ across [0.3, 0.7] to confirm robustness of the reported performance. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical curation framework evaluated on external benchmarks

full rationale

The paper describes an empirical data-selection pipeline (fine-tuned Quality Scoring Model plus MMR-inspired sampling) whose central claims rest on comparative win rates (>60% average) and human evaluations against external baselines on Alpaca-Eval and MT-Bench across 18 languages. No equations, derivations, fitted parameters renamed as predictions, or self-citation load-bearing steps appear in the abstract or described framework. The results are self-contained against public benchmarks and reproducible code, satisfying the criteria for a non-circular empirical contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Review based on abstract only; no explicit free parameters, axioms, or invented entities are detailed in the provided text. The M-DaQ framework itself functions as the primary new contribution.

invented entities (1)

M-DaQ sampling framework no independent evidence
purpose: Jointly optimize instruction-response quality and cross-lingual semantic diversity for multilingual IFT datasets
Introduced as the core proposed method in the abstract.

pith-pipeline@v0.9.0 · 5740 in / 1205 out tokens · 36119 ms · 2026-05-18T16:40:56.326046+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

M-DaQ leverages a fine-tuned Quality Scoring Model alongside a maximal marginal relevance-inspired selection strategy to construct balanced, high-fidelity training data.
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose an unsupervised clustering algorithm that selects diverse samples in a language-agnostic manner.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · 1 internal anchor

[1]

INTRODUCTION Some machine learning methods[1, 2] have been proposed for improving Multilingual Instruction Fine-Tuning (IFT) dataset, which plays a pivotal role in enabling large language models (LLMs) to perform effectively general-purpose tasks. How- ever, constructing high-quality multilingual IFT datasets re- mains a significant challenge [3], and the...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

This method jointly optimizes both sample quality (via QSM) and diversity (via un- supervised clustering algorithm) for Multilingual IFT dataset

We proposeM-DaQ, a novel method for improving LLM multilinguality. This method jointly optimizes both sample quality (via QSM) and diversity (via un- supervised clustering algorithm) for Multilingual IFT dataset. We will open-source code to facilitate repro- ducibility and future research

work page
[3]

We conduct a comprehensivemultilingual IFT evalua- tion across 18 languages. Our results demonstrate that M-DaQ effectively mitigates multilingual challenges arising from the scarcity of high-quality IFT data and the underrepresentation of culturally specific values

work page
[4]

Our findings provide novel insights into how the scale and composition of IFT data influence cross-lingual LLM alignment

We present the first systematic empirical investigation of theSAH in multilingual contexts. Our findings provide novel insights into how the scale and composition of IFT data influence cross-lingual LLM alignment

work page
[5]

For example, Zhao et al

RELATED WORK IFT data selection has been investigated in several prior studies[12, 13, 14, 2]. For example, Zhao et al. [15] propose heuristic-based selection strategies, while Chen et al. [ 16] utilize LLMs to score and rank training samples for subse- quent selection. However, these methods largely disregard the linguistic and cultural specificities of ...

work page
[6]

Quality Scoring Model (QSM) We introduce theQuality Scoring Model (QSM), a fine-tuned variant of XLM-RoBERTa [22] designed to estimate the qual- ity of multilingual IFT samples

METHOD 3.1. Quality Scoring Model (QSM) We introduce theQuality Scoring Model (QSM), a fine-tuned variant of XLM-RoBERTa [22] designed to estimate the qual- ity of multilingual IFT samples. The model is trained on a curated multilingual dataset derived from MIDB [21], which comprises 18 languages with approximately 2.3K expert- revised IFT samples per lan...

work page
[7]

Experimental Setup We apply our proposedM-DaQalgorithm to the Alpaca 52K dataset, extending it to 18 languages

EXPERIMENT 4.1. Experimental Setup We apply our proposedM-DaQalgorithm to the Alpaca 52K dataset, extending it to 18 languages. We fine-tune the Llama 3 8B base model using the selected subset, resulting in theM- DaQ Model. For comparison, we train aVanilla Modelusing the original, unfiltered Alpaca 52K dataset. Both models are trained under identical hyp...

work page
[8]

and MT-Bench [23], both of which have been machine- translated and subsequently human-revised for 18 languages [21]. Human Evaluation.To complement automated evalu- ation and assess nuanced linguistic and cultural quality, we conduct a large-scale human evaluation with 7 native-speaking language experts, each with an average of 3.9 years of profes- sional...

work page
[9]

This finding confirmsthe validity of the SAH in multilingual settings

Scaling IFT dataset from 1K to 10K and 52K samples reduces the win rate by 10.1% and 6.2%, respectively — demonstrating that, compared to larger but noisier datasets, only a few thousand high-quality samples are sufficient for effective alignment with human response preferences. This finding confirmsthe validity of the SAH in multilingual settings

work page
[10]

This variation likely stems from differences in language-specific pre- training readiness, indicating thatthe effectiveness of SAH is language-dependent

Sensitivity to IFT dataset scale varies across languages — for example, Arabic is less sensitive than French, as evidenced by the shallower slope in the early scaling region (1K–10K) of the win rate curve. This variation likely stems from differences in language-specific pre- training readiness, indicating thatthe effectiveness of SAH is language-dependen...

work page
[11]

M-DaQ exhibits particularly strong gains in lower-resource languages, where data scarcity and cultural misalignment most acutely degrade model performance

CONCLUSION In this work, we have presentedM-DaQ— a novel, language- agnostic machine learning method for improving LLM multi- linguality that jointly optimizes samplesqualityanddiversity of IFT dataset. M-DaQ exhibits particularly strong gains in lower-resource languages, where data scarcity and cultural misalignment most acutely degrade model performance...

work page
[12]

Clustering and ranking: Diversity-preserved instruc- tion selection through expert-aligned quality estimation,

Y . Ge, Y . Liu, C. Hu, W. Meng, S. Tao, X. Zhao, H. Ma, L. Zhang, B. Chen, H. Yang, B. Li, T. Xiao, and J. Zhu, “Clustering and ranking: Diversity-preserved instruc- tion selection through expert-aligned quality estimation,” 2024

work page 2024
[13]

A preliminary study of the intrinsic relationship between complexity and alignment,

Y . Zhao, B. Yu, B. Hui, H. Yu, F. Huang, Y . Li, and N. L. Zhang, “A preliminary study of the intrinsic relationship between complexity and alignment,” 2024

work page 2024
[14]

Llms beyond english: Scaling the multilingual capability of llms with cross- lingual feedback,

W. Lai, M. Mesgar, and A. Fraser, “Llms beyond english: Scaling the multilingual capability of llms with cross- lingual feedback,” 2024

work page 2024
[15]

The llama 3 herd of models,

A. Grattafiori, A. Dubey, and et al, “The llama 3 herd of models,” 2024

work page 2024
[16]

Lima: Less is more for alignment,

C. Zhou, P. Liu, P. Xu, S. Iyer, J. Sun, Y . Mao, X. Ma, A. Efrat, P. Yu, L. Yu, S. Zhang, G. Ghosh, M. Lewis, L. Zettlemoyer, and O. Levy, “Lima: Less is more for alignment,” 2023

work page 2023
[17]

Stanford alpaca: An instruction-following llama model,

R. Taori, I. Gulrajani, T. Zhang, Y . Dubois, X. Li, C. Guestrin, P. Liang, and T. B. Hashimoto, “Stanford alpaca: An instruction-following llama model,” https: //github.com/tatsu-lab/stanford alpaca, 2023

work page 2023
[18]

Super-naturalinstructions: Generalization via declara- tive instructions on 1600+ nlp tasks,

Y . Wang, S. Mishra, P. Alipoormolabashi, and et al, “Super-naturalinstructions: Generalization via declara- tive instructions on 1600+ nlp tasks,” 2022

work page 2022
[19]

Active instruction tuning: Improving cross-task general- ization by training on prompt sensitive tasks,

P.-N. Kung, F. Yin, D. Wu, K.-W. Chang, and N. Peng, “Active instruction tuning: Improving cross-task general- ization by training on prompt sensitive tasks,” 2023

work page 2023
[20]

From quantity to quality: Boosting llm performance with self-guided data selection for instruction tuning,

M. Li, Y . Zhang, Z. Li, J. Chen, L. Chen, N. Cheng, J. Wang, T. Zhou, and J. Xiao, “From quantity to quality: Boosting llm performance with self-guided data selection for instruction tuning,” 2024

work page 2024
[21]

Demystifying prompts in language models via perplexity estimation,

H. Gonen, S. Iyer, T. Blevins, N. A. Smith, and L. Zettle- moyer, “Demystifying prompts in language models via perplexity estimation,” 2024

work page 2024
[22]

Tacos: Open tagging and comparative scoring for instruction fine-tuning data selection,

X. He, H. Yu, Q. Sun, A. Cheng, T. Zhang, C. Liu, and S. Guo, “Tacos: Open tagging and comparative scoring for instruction fine-tuning data selection,” 2025

work page 2025
[23]

Improving translation faithfulness of large language models via augmenting instructions,

Y . Chen, Y . Liu, F. Meng, Y . Chen, J. Xu, and J. Zhou, “Improving translation faithfulness of large language models via augmenting instructions,” 2023

work page 2023
[24]

What makes good data for alignment? a comprehensive study of automatic data selection in instruction tuning,

W. Liu, W. Zeng, K. He, Y . Jiang, and J. He, “What makes good data for alignment? a comprehensive study of automatic data selection in instruction tuning,” 2024

work page 2024
[25]

Mods: Model-oriented data selection for instruction tuning,

Q. Du, C. Zong, and J. Zhang, “Mods: Model-oriented data selection for instruction tuning,” 2023

work page 2023
[26]

Long is more for alignment: A simple but tough-to-beat baseline for instruction fine-tuning,

H. Zhao, M. Andriushchenko, F. Croce, and N. Flam- marion, “Long is more for alignment: A simple but tough-to-beat baseline for instruction fine-tuning,” 2024

work page 2024
[27]

Alpagasus: Training a better alpaca with fewer data,

L. Chen, S. Li, J. Yan, H. Wang, K. Gunaratna, V . Yadav, Z. Tang, V . Srinivasan, T. Zhou, H. Huang, and H. Jin, “Alpagasus: Training a better alpaca with fewer data,” 2024

work page 2024
[28]

How reliable is multilingual llm-as- a-judge?,

X. Fu and W. Liu, “How reliable is multilingual llm-as- a-judge?,” 2025

work page 2025
[29]

Monolingual or multilingual instruc- tion tuning: Which makes a better alpaca,

P. Chen, S. Ji, N. Bogoychev, A. Kutuzov, B. Haddow, and K. Heafield, “Monolingual or multilingual instruc- tion tuning: Which makes a better alpaca,” inFindings of the Association for Computational Linguistics: EACL 2024, 2024, pp. 1347–1356

work page 2024
[30]

Plug: Leveraging pivot language in cross-lingual instruction tuning,

Z. Zhang, D.-H. Lee, Y . Fang, W. Yu, M. Jia, M. Jiang, and F. Barbieri, “Plug: Leveraging pivot language in cross-lingual instruction tuning,” inProceedings of the 62nd Annual Meeting of the Association for Computa- tional Linguistics (V olume 1: Long Papers), 2024, pp. 7025–7046

work page 2024
[31]

Extrapolating large lan- guage models to non-english by aligning languages,

W. Zhu, Y . Lv, Q. Dong, F. Yuan, J. Xu, S. Huang, L. Kong, J. Chen, and L. Li, “Extrapolating large lan- guage models to non-english by aligning languages,” arXiv preprint arXiv:2308.04948, 2023

work page arXiv 2023
[32]

Midb: Multilingual instruction data booster for enhancing multilingual instruction synthesis,

Y . Liu, C. Zhao, X. Yang, H. Zeng, S. Tao, W. Meng, M. He, C. Su, Y . Yu, H. Ma, L. Zhang, D. Wei, and H. Yang, “Midb: Multilingual instruction data booster for enhancing multilingual instruction synthesis,” 2025

work page 2025
[33]

Unsupervised cross-lingual represen- tation learning at scale,

A. Conneau, K. Khandelwal, N. Goyal, V . Chaudhary, G. Wenzek, F. Guzm´an, E. Grave, M. Ott, L. Zettlemoyer, and V . Stoyanov, “Unsupervised cross-lingual represen- tation learning at scale,” 2020

work page 2020
[34]

Mt-bench-101: A fine-grained benchmark for evaluating large language models in multi-turn dialogues,

G. Bai, J. Liu, X. Bu, Y . He, J. Liu, Z. Zhou, Z. Lin, W. Su, T. Ge, B. Zheng, and W. Ouyang, “Mt-bench-101: A fine-grained benchmark for evaluating large language models in multi-turn dialogues,” inProceedings of the 62nd Annual Meeting of the Association for Computa- tional Linguistics (V olume 1: Long Papers). 2024, p. 7421–7454, Association for Compu...

work page 2024
[35]

Alpacaeval: An automatic evaluator of instruction-following mod- els,

X. Li, T. Zhang, Y . Dubois, R. Taori, I. Gulrajani, C. Guestrin, P. Liang, and T. B. Hashimoto, “Alpacaeval: An automatic evaluator of instruction-following mod- els,” https://github.com/tatsu-lab/alpaca eval, 5 2023

work page 2023

[1] [1]

INTRODUCTION Some machine learning methods[1, 2] have been proposed for improving Multilingual Instruction Fine-Tuning (IFT) dataset, which plays a pivotal role in enabling large language models (LLMs) to perform effectively general-purpose tasks. How- ever, constructing high-quality multilingual IFT datasets re- mains a significant challenge [3], and the...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

This method jointly optimizes both sample quality (via QSM) and diversity (via un- supervised clustering algorithm) for Multilingual IFT dataset

We proposeM-DaQ, a novel method for improving LLM multilinguality. This method jointly optimizes both sample quality (via QSM) and diversity (via un- supervised clustering algorithm) for Multilingual IFT dataset. We will open-source code to facilitate repro- ducibility and future research

work page

[3] [3]

We conduct a comprehensivemultilingual IFT evalua- tion across 18 languages. Our results demonstrate that M-DaQ effectively mitigates multilingual challenges arising from the scarcity of high-quality IFT data and the underrepresentation of culturally specific values

work page

[4] [4]

Our findings provide novel insights into how the scale and composition of IFT data influence cross-lingual LLM alignment

We present the first systematic empirical investigation of theSAH in multilingual contexts. Our findings provide novel insights into how the scale and composition of IFT data influence cross-lingual LLM alignment

work page

[5] [5]

For example, Zhao et al

RELATED WORK IFT data selection has been investigated in several prior studies[12, 13, 14, 2]. For example, Zhao et al. [15] propose heuristic-based selection strategies, while Chen et al. [ 16] utilize LLMs to score and rank training samples for subse- quent selection. However, these methods largely disregard the linguistic and cultural specificities of ...

work page

[6] [6]

Quality Scoring Model (QSM) We introduce theQuality Scoring Model (QSM), a fine-tuned variant of XLM-RoBERTa [22] designed to estimate the qual- ity of multilingual IFT samples

METHOD 3.1. Quality Scoring Model (QSM) We introduce theQuality Scoring Model (QSM), a fine-tuned variant of XLM-RoBERTa [22] designed to estimate the qual- ity of multilingual IFT samples. The model is trained on a curated multilingual dataset derived from MIDB [21], which comprises 18 languages with approximately 2.3K expert- revised IFT samples per lan...

work page

[7] [7]

Experimental Setup We apply our proposedM-DaQalgorithm to the Alpaca 52K dataset, extending it to 18 languages

EXPERIMENT 4.1. Experimental Setup We apply our proposedM-DaQalgorithm to the Alpaca 52K dataset, extending it to 18 languages. We fine-tune the Llama 3 8B base model using the selected subset, resulting in theM- DaQ Model. For comparison, we train aVanilla Modelusing the original, unfiltered Alpaca 52K dataset. Both models are trained under identical hyp...

work page

[8] [8]

and MT-Bench [23], both of which have been machine- translated and subsequently human-revised for 18 languages [21]. Human Evaluation.To complement automated evalu- ation and assess nuanced linguistic and cultural quality, we conduct a large-scale human evaluation with 7 native-speaking language experts, each with an average of 3.9 years of profes- sional...

work page

[9] [9]

This finding confirmsthe validity of the SAH in multilingual settings

Scaling IFT dataset from 1K to 10K and 52K samples reduces the win rate by 10.1% and 6.2%, respectively — demonstrating that, compared to larger but noisier datasets, only a few thousand high-quality samples are sufficient for effective alignment with human response preferences. This finding confirmsthe validity of the SAH in multilingual settings

work page

[10] [10]

This variation likely stems from differences in language-specific pre- training readiness, indicating thatthe effectiveness of SAH is language-dependent

Sensitivity to IFT dataset scale varies across languages — for example, Arabic is less sensitive than French, as evidenced by the shallower slope in the early scaling region (1K–10K) of the win rate curve. This variation likely stems from differences in language-specific pre- training readiness, indicating thatthe effectiveness of SAH is language-dependen...

work page

[11] [11]

M-DaQ exhibits particularly strong gains in lower-resource languages, where data scarcity and cultural misalignment most acutely degrade model performance

CONCLUSION In this work, we have presentedM-DaQ— a novel, language- agnostic machine learning method for improving LLM multi- linguality that jointly optimizes samplesqualityanddiversity of IFT dataset. M-DaQ exhibits particularly strong gains in lower-resource languages, where data scarcity and cultural misalignment most acutely degrade model performance...

work page

[12] [12]

Clustering and ranking: Diversity-preserved instruc- tion selection through expert-aligned quality estimation,

Y . Ge, Y . Liu, C. Hu, W. Meng, S. Tao, X. Zhao, H. Ma, L. Zhang, B. Chen, H. Yang, B. Li, T. Xiao, and J. Zhu, “Clustering and ranking: Diversity-preserved instruc- tion selection through expert-aligned quality estimation,” 2024

work page 2024

[13] [13]

A preliminary study of the intrinsic relationship between complexity and alignment,

Y . Zhao, B. Yu, B. Hui, H. Yu, F. Huang, Y . Li, and N. L. Zhang, “A preliminary study of the intrinsic relationship between complexity and alignment,” 2024

work page 2024

[14] [14]

Llms beyond english: Scaling the multilingual capability of llms with cross- lingual feedback,

W. Lai, M. Mesgar, and A. Fraser, “Llms beyond english: Scaling the multilingual capability of llms with cross- lingual feedback,” 2024

work page 2024

[15] [15]

The llama 3 herd of models,

A. Grattafiori, A. Dubey, and et al, “The llama 3 herd of models,” 2024

work page 2024

[16] [16]

Lima: Less is more for alignment,

C. Zhou, P. Liu, P. Xu, S. Iyer, J. Sun, Y . Mao, X. Ma, A. Efrat, P. Yu, L. Yu, S. Zhang, G. Ghosh, M. Lewis, L. Zettlemoyer, and O. Levy, “Lima: Less is more for alignment,” 2023

work page 2023

[17] [17]

Stanford alpaca: An instruction-following llama model,

R. Taori, I. Gulrajani, T. Zhang, Y . Dubois, X. Li, C. Guestrin, P. Liang, and T. B. Hashimoto, “Stanford alpaca: An instruction-following llama model,” https: //github.com/tatsu-lab/stanford alpaca, 2023

work page 2023

[18] [18]

Super-naturalinstructions: Generalization via declara- tive instructions on 1600+ nlp tasks,

Y . Wang, S. Mishra, P. Alipoormolabashi, and et al, “Super-naturalinstructions: Generalization via declara- tive instructions on 1600+ nlp tasks,” 2022

work page 2022

[19] [19]

Active instruction tuning: Improving cross-task general- ization by training on prompt sensitive tasks,

P.-N. Kung, F. Yin, D. Wu, K.-W. Chang, and N. Peng, “Active instruction tuning: Improving cross-task general- ization by training on prompt sensitive tasks,” 2023

work page 2023

[20] [20]

From quantity to quality: Boosting llm performance with self-guided data selection for instruction tuning,

M. Li, Y . Zhang, Z. Li, J. Chen, L. Chen, N. Cheng, J. Wang, T. Zhou, and J. Xiao, “From quantity to quality: Boosting llm performance with self-guided data selection for instruction tuning,” 2024

work page 2024

[21] [21]

Demystifying prompts in language models via perplexity estimation,

H. Gonen, S. Iyer, T. Blevins, N. A. Smith, and L. Zettle- moyer, “Demystifying prompts in language models via perplexity estimation,” 2024

work page 2024

[22] [22]

Tacos: Open tagging and comparative scoring for instruction fine-tuning data selection,

X. He, H. Yu, Q. Sun, A. Cheng, T. Zhang, C. Liu, and S. Guo, “Tacos: Open tagging and comparative scoring for instruction fine-tuning data selection,” 2025

work page 2025

[23] [23]

Improving translation faithfulness of large language models via augmenting instructions,

Y . Chen, Y . Liu, F. Meng, Y . Chen, J. Xu, and J. Zhou, “Improving translation faithfulness of large language models via augmenting instructions,” 2023

work page 2023

[24] [24]

What makes good data for alignment? a comprehensive study of automatic data selection in instruction tuning,

W. Liu, W. Zeng, K. He, Y . Jiang, and J. He, “What makes good data for alignment? a comprehensive study of automatic data selection in instruction tuning,” 2024

work page 2024

[25] [25]

Mods: Model-oriented data selection for instruction tuning,

Q. Du, C. Zong, and J. Zhang, “Mods: Model-oriented data selection for instruction tuning,” 2023

work page 2023

[26] [26]

Long is more for alignment: A simple but tough-to-beat baseline for instruction fine-tuning,

H. Zhao, M. Andriushchenko, F. Croce, and N. Flam- marion, “Long is more for alignment: A simple but tough-to-beat baseline for instruction fine-tuning,” 2024

work page 2024

[27] [27]

Alpagasus: Training a better alpaca with fewer data,

L. Chen, S. Li, J. Yan, H. Wang, K. Gunaratna, V . Yadav, Z. Tang, V . Srinivasan, T. Zhou, H. Huang, and H. Jin, “Alpagasus: Training a better alpaca with fewer data,” 2024

work page 2024

[28] [28]

How reliable is multilingual llm-as- a-judge?,

X. Fu and W. Liu, “How reliable is multilingual llm-as- a-judge?,” 2025

work page 2025

[29] [29]

Monolingual or multilingual instruc- tion tuning: Which makes a better alpaca,

P. Chen, S. Ji, N. Bogoychev, A. Kutuzov, B. Haddow, and K. Heafield, “Monolingual or multilingual instruc- tion tuning: Which makes a better alpaca,” inFindings of the Association for Computational Linguistics: EACL 2024, 2024, pp. 1347–1356

work page 2024

[30] [30]

Plug: Leveraging pivot language in cross-lingual instruction tuning,

Z. Zhang, D.-H. Lee, Y . Fang, W. Yu, M. Jia, M. Jiang, and F. Barbieri, “Plug: Leveraging pivot language in cross-lingual instruction tuning,” inProceedings of the 62nd Annual Meeting of the Association for Computa- tional Linguistics (V olume 1: Long Papers), 2024, pp. 7025–7046

work page 2024

[31] [31]

Extrapolating large lan- guage models to non-english by aligning languages,

W. Zhu, Y . Lv, Q. Dong, F. Yuan, J. Xu, S. Huang, L. Kong, J. Chen, and L. Li, “Extrapolating large lan- guage models to non-english by aligning languages,” arXiv preprint arXiv:2308.04948, 2023

work page arXiv 2023

[32] [32]

Midb: Multilingual instruction data booster for enhancing multilingual instruction synthesis,

Y . Liu, C. Zhao, X. Yang, H. Zeng, S. Tao, W. Meng, M. He, C. Su, Y . Yu, H. Ma, L. Zhang, D. Wei, and H. Yang, “Midb: Multilingual instruction data booster for enhancing multilingual instruction synthesis,” 2025

work page 2025

[33] [33]

Unsupervised cross-lingual represen- tation learning at scale,

A. Conneau, K. Khandelwal, N. Goyal, V . Chaudhary, G. Wenzek, F. Guzm´an, E. Grave, M. Ott, L. Zettlemoyer, and V . Stoyanov, “Unsupervised cross-lingual represen- tation learning at scale,” 2020

work page 2020

[34] [34]

Mt-bench-101: A fine-grained benchmark for evaluating large language models in multi-turn dialogues,

G. Bai, J. Liu, X. Bu, Y . He, J. Liu, Z. Zhou, Z. Lin, W. Su, T. Ge, B. Zheng, and W. Ouyang, “Mt-bench-101: A fine-grained benchmark for evaluating large language models in multi-turn dialogues,” inProceedings of the 62nd Annual Meeting of the Association for Computa- tional Linguistics (V olume 1: Long Papers). 2024, p. 7421–7454, Association for Compu...

work page 2024

[35] [35]

Alpacaeval: An automatic evaluator of instruction-following mod- els,

X. Li, T. Zhang, Y . Dubois, R. Taori, I. Gulrajani, C. Guestrin, P. Liang, and T. B. Hashimoto, “Alpacaeval: An automatic evaluator of instruction-following mod- els,” https://github.com/tatsu-lab/alpaca eval, 5 2023

work page 2023