Find Your Optimal Teacher: Personalized Data Synthesis via Router-Guided Multi-Teacher Distillation

Angel X. Chang; Chaofan Tao; Chenming Shang; Hayden Kwok-Hay So; Hengyuan Zhang; Jing Xiong; Ngai Wong; Ruobing Xie; Shiping Yang; Xiao Liang

arxiv: 2510.10925 · v2 · submitted 2025-10-13 · 💻 cs.LG · cs.CL

Find Your Optimal Teacher: Personalized Data Synthesis via Router-Guided Multi-Teacher Distillation

Hengyuan Zhang , Shiping Yang , Xiao Liang , Chenming Shang , Yuxuan Jiang , Chaofan Tao , Jing Xiong , Hayden Kwok-Hay So

show 3 more authors

Ruobing Xie Angel X. Chang Ngai Wong

This is my paper

Pith reviewed 2026-05-18 07:36 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords personalized synthetic datamulti-teacher distillationquery-level routerstudent learnabilityroute then generateinstruction tuningmath reasoning

0 comments

The pith

A learnability-aware router assigns each prompt to its best teacher so the resulting synthetic data fits the student model more closely.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that stronger teachers are not always the best teachers for a given student because their outputs can exceed what the student can readily learn. PerSyn solves this by first routing every prompt to the single most suitable teacher using a query-level router that balances estimated student learnability with teacher response quality. Only the chosen teacher then generates responses for those prompts, producing a dataset tailored to the student instead of a generic pool. The approach follows a Route-then-Generate workflow that avoids the cost of every teacher writing answers for the entire prompt set. Experiments across model families and sizes report that the resulting data yields performance that is superior or at least comparable to existing baselines on instruction tuning and math reasoning tasks.

Core claim

PerSyn introduces a Route then Generate paradigm in which a query-level router assigns each prompt to the teacher that jointly maximizes student learnability and response quality; each teacher then synthesizes data exclusively for its assigned prompts, yielding a student-specific training set that is both more effective and cheaper to produce than the conventional Generate then Select baseline.

What carries the argument

The query-level router that estimates student learnability for each prompt-teacher pair and uses that estimate plus response quality to make the assignment.

If this is right

The final synthetic dataset requires far fewer total generations because each teacher writes responses only for its routed subset of prompts.
Performance gains appear consistently across different model families and scales in both instruction-following and mathematical reasoning.
The same routing logic can be applied whenever multiple teachers are available and the goal is to match data difficulty to a target learner.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same router could be reused or lightly adapted when the student is later fine-tuned on new domains, reducing the need to re-run full multi-teacher synthesis.
If the router proves stable, it opens the possibility of dynamically adding or removing teachers at inference time without regenerating the entire dataset.
Extending the router to also predict how much data each assigned teacher should produce might further improve efficiency and final accuracy.

Load-bearing premise

The router can reliably predict which teacher will be easiest for the student to learn from without first training the student on any of the generated examples.

What would settle it

An ablation in which the router is replaced by random or fixed-teacher assignment and the student still reaches equal or higher accuracy on the downstream tasks would falsify the benefit of personalized routing.

read the original abstract

Training student models on synthetic data generated by strong teacher models is a promising way to distilling the capabilities of teachers. However, recent studies show that stronger models are not always optimal teachers, revealing a mismatch between teacher outputs and student learnability. To address this issue, we propose PerSyn (Personalized data Synthesis), a novel synthesis strategy that operates under a new ``Route then Generate'' paradigm to create data tailored to each student model, enabling it to learn more effectively. Specifically, PerSyn first assigns each prompt to its optimal teacher via a query-level router that jointly considers student learnability and teacher response quality. Each teacher then synthesizes data only for its assigned prompts, making the process more efficient than the conventional ``Generate then Select'' paradigm, where all teachers must generate parallel responses for the entire prompt set before constructing the final dataset. Extensive experiments across different model families and scales demonstrate that PerSyn consistently achieves superior or comparable performance to all baselines in instruct tuning and math reasoning settings. Further analysis verifies the effectiveness of PerSyn and offers extra insights to propel future research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PerSyn adds a router to assign prompts to teachers before generation, which is a practical efficiency move but rests on unvalidated proxies for learnability.

read the letter

The main point is that PerSyn switches to a route-then-generate flow for multi-teacher synthetic data. A query-level router picks the best teacher for each prompt by balancing estimated student learnability and response quality, then only that teacher generates the data. This avoids the usual generate-then-select waste where every teacher produces outputs for every prompt first. The abstract reports steady gains over baselines in both instruct tuning and math reasoning across model families and sizes, which is the practical hook.

Referee Report

2 major / 2 minor

Summary. The paper proposes PerSyn, a personalized synthetic data generation approach under a 'Route then Generate' paradigm. A query-level router assigns each prompt to an optimal teacher by jointly considering student learnability and teacher response quality; only the assigned teacher then generates data for that prompt. This is positioned as more efficient than 'Generate then Select' baselines. Extensive experiments across model families and scales are reported to show that PerSyn achieves superior or comparable performance to baselines in both instruct tuning and math reasoning settings.

Significance. If the router's learnability estimates prove reliable, the method could improve both the efficiency of multi-teacher distillation (by avoiding full parallel generation) and the quality of synthetic data for a given student, providing a practical alternative to uniform teacher selection.

major comments (2)

[Abstract] Abstract and router description: The central performance claim requires that the query-level router's proxy for student learnability (e.g., model similarity or response quality scores) correlates with actual post-training gains. No direct validation of this correlation—such as a comparison of router-assigned data versus random or heuristic assignments measured by downstream accuracy—is reported, leaving open whether the observed gains are attributable to the Route-then-Generate paradigm or to other experimental factors.
[Experiments] Experimental section: The manuscript states consistent gains across settings but provides no statistical significance tests, data-split details, or ablation isolating the router's contribution versus the multi-teacher setup itself. This makes it difficult to assess whether the superiority claim holds under standard reproducibility standards.

minor comments (2)

[Method] Notation for the router objective could be clarified with an explicit equation showing how learnability and quality scores are combined.
[Figures] Figure captions should explicitly state the number of runs and error bars if variance is reported.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for their thorough review and valuable suggestions. Below, we respond to each major comment and describe how we plan to revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] Abstract and router description: The central performance claim requires that the query-level router's proxy for student learnability (e.g., model similarity or response quality scores) correlates with actual post-training gains. No direct validation of this correlation—such as a comparison of router-assigned data versus random or heuristic assignments measured by downstream accuracy—is reported, leaving open whether the observed gains are attributable to the Route-then-Generate paradigm or to other experimental factors.

Authors: We acknowledge that a direct validation of the router's learnability proxy correlating with post-training gains would strengthen the claims. In the current manuscript, we provide indirect evidence through overall performance improvements and further analysis sections that demonstrate the router's effectiveness in selecting appropriate teachers. However, to directly address this concern, we will add a new ablation study comparing the router-guided assignments against random and heuristic baselines, measuring the downstream accuracy gains. This will clarify the contribution of the Route-then-Generate paradigm. revision: yes
Referee: [Experiments] Experimental section: The manuscript states consistent gains across settings but provides no statistical significance tests, data-split details, or ablation isolating the router's contribution versus the multi-teacher setup itself. This makes it difficult to assess whether the superiority claim holds under standard reproducibility standards.

Authors: We agree that including statistical significance tests, detailed data-split information, and an ablation study isolating the router would enhance reproducibility. In the revised version, we will include p-values or confidence intervals for the reported gains, specify the train/validation/test splits used in our experiments, and add an ablation that compares PerSyn (with router) to a multi-teacher setup without the router (e.g., uniform or random assignment). This will better isolate the router's contribution. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation or claims

full rationale

The paper introduces PerSyn as an empirical synthesis strategy relying on a query-level router trained on observable signals (student learnability proxies and teacher response quality) to assign prompts before data generation. Performance claims rest on external experiments across model families rather than any internal reduction of results to fitted parameters or self-referential definitions. No equations or steps equate outputs to inputs by construction, and no load-bearing self-citations or imported uniqueness theorems are invoked to force the central paradigm. The approach remains self-contained against external benchmarks with no evidence of the listed circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The approach relies on the existence of a learnability signal that can be estimated before generation and on the assumption that teacher quality can be scored independently of the student. No explicit free parameters or invented physical entities are introduced in the abstract.

pith-pipeline@v0.9.0 · 5758 in / 1046 out tokens · 22247 ms · 2026-05-18T07:36:46.280672+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

PerSyn first assigns each prompt to its optimal teacher via a query-level router that jointly considers student learnability and teacher response quality... r(yMn_i, θ) = (1−α)rq(yMn_i) + α rl(yMn_i, θ)
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we adopt the Bradley-Terry (BT) model... P(C=B≻A|Z=z, X=x) = σ(z⊤π(x))

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Cornerstones or Stumbling Blocks? Deciphering the Rock Tokens in On-Policy Distillation
cs.CL 2026-05 unverdicted novelty 7.0

Persistent 'Rock Tokens' in on-policy distillation resist teacher corrections, consume large gradient norms, yet add negligible value to reasoning, allowing targeted bypassing to streamline alignment.
MUSE: Resolving Manifold Misalignment in Visual Tokenization via Topological Orthogonality
cs.CV 2026-05 unverdicted novelty 6.0

MUSE decouples reconstruction and semantic learning in visual tokenization via topological orthogonality, yielding SOTA generation quality and improved semantic performance over its teacher model.
Reinforcement Learning for Scalable and Trustworthy Intelligent Systems
cs.LG 2026-05 unverdicted novelty 3.0

Reinforcement learning is advanced for communication-efficient federated optimization and for preference-aligned, contextually safe policies in large language models.