VER: Vision Expert Transformer for Robot Learning via Foundation Distillation and Dynamic Routing

Chensheng Peng; Haotian Lin; Jinghuan Shang; Lingfeng Sun; Masayoshi Tomizuka; Mingxiao Huo; Mingyu Ding; Mohit Bansal; Yixiao Wang; Yushi Du

arxiv: 2510.05213 · v2 · pith:N2AQWMAHnew · submitted 2025-10-06 · 💻 cs.RO · cs.AI· cs.LG

VER: Vision Expert Transformer for Robot Learning via Foundation Distillation and Dynamic Routing

Yixiao Wang , Mingxiao Huo , Zhixuan Liang , Yushi Du , Lingfeng Sun , Haotian Lin , Jinghuan Shang , Chensheng Peng

show 3 more authors

Mohit Bansal Mingyu Ding Masayoshi Tomizuka

This is my paper

Pith reviewed 2026-05-18 09:08 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.LG

keywords vision expert transformerrobot learningfoundation model distillationdynamic expert routingpatchwise routingpolicy learningvision foundation modelsparameter-efficient adaptation

0 comments

The pith

VER distills multiple vision foundation models into an expert library then routes task-relevant experts with a lightweight network for robot policies.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that distilling several vision foundation models into one expert library during pretraining, followed by training only a tiny routing network to pick relevant experts on the fly, produces stronger visual features for robot learning than any single model or full retraining. A sympathetic reader would care because individual foundation models each handle certain visual situations well but fall short across the full range of robot tasks, so flexible combination could make policies more reliable in varied settings. The approach adds patchwise routing that selects experts at the level of image patches and uses a curriculum to start with more options and gradually focus on the best ones. If correct, this yields state-of-the-art results on 17 tasks while using under 0.4 percent new parameters and shifting attention away from backgrounds toward objects that matter for control.

Core claim

VER distills multiple VFMs into a vision expert library during pretraining. It then fine-tunes only a lightweight routing network to dynamically select task-relevant experts from the pretrained library for downstream robot tasks. Patchwise Expert Routing with Curriculum Top-K Annealing improves both flexibility and precision of dynamic expert selection. The method supports parameter-efficient finetuning for scalable expert utilization and adaptive robot-domain knowledge integration. Across 17 diverse robotic tasks and multiple policy heads, VER achieves state-of-the-art performance by reducing large-norm outliers in task-irrelevant regions and concentrating on task-critical regions.

What carries the argument

The vision expert library formed by distilling multiple VFMs, paired with a lightweight router that applies Patchwise Expert Routing and Curriculum Top-K Annealing to select experts dynamically per task and image patch.

If this is right

New robot tasks can be addressed by updating only the routing network instead of retraining the full visual backbone.
Attention maps shift toward task-critical patches such as grasped objects rather than background clutter.
Different policy heads for the same robot can share one expert library while each uses its own router.
Additional vision models can be added to the library later through continued distillation without restarting from scratch.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same library-plus-router pattern could be tested on non-vision inputs such as depth or tactile signals to see whether cross-modal experts combine similarly.
Inspecting which experts activate for a given action might offer a route to more interpretable robot decisions.
Curriculum annealing of the top-k count could be tried in other mixture-of-experts setups outside robotics to stabilize early training.

Load-bearing premise

The complementary strengths of the original vision foundation models remain intact after distillation and the small router can pick the right experts without seeing task labels at decision time.

What would settle it

If swapping the dynamic router for random expert selection or for a single fixed VFM produces equal or better performance on the same 17 tasks and policy heads, the benefit of the distillation-plus-routing design would be refuted.

read the original abstract

Pretrained vision foundation models (VFMs) advance robotic learning via rich visual representations, yet individual VFMs typically excel only in specific domains, limiting generality across tasks. Distilling multiple VFMs into a unified representation for policy can mitigate this limitation but often yields inflexible task-specific feature selection and requires costly full re-training to incorporate robot-domain knowledge. We propose VER, a Vision Expert transformer for Robot learning. During pretraining, VER distills multiple VFMs into a vision expert library. It then fine-tunes only a lightweight routing network (fewer than 0.4% of parameters) to dynamically select task-relevant experts from the pretrained library for downstream robot tasks. We further introduce Patchwise Expert Routing with Curriculum Top-K Annealing to improve both flexibility and precision of dynamic expert selection. Moreover, VER supports parameter-efficient finetuning for scalable expert utilization and adaptive robot-domain knowledge integration. Across 17 diverse robotic tasks and multiple policy heads, VER achieves state-of-the-art performance. We find that VER reduces large-norm outliers in task-irrelevant regions (e.g., background) and concentrates on task-critical regions. Visualizations and codes can be found in https://yixiaowang7.github.io/ver_page/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VER distills multiple vision foundation models into a fixed expert library then trains a tiny router with patchwise selection and curriculum annealing for robot tasks, but the SOTA gains need ablations to show the router—not just the fine-tuning or library—is doing the work.

read the letter

The main point is that they pretrain by distilling several VFMs into a vision expert library, then fine-tune only a lightweight routing network (under 0.4% of parameters) to pick task-relevant experts per patch. The curriculum top-k annealing is meant to stabilize and sharpen that selection over time. This keeps the heavy models frozen while adding robot-domain adaptation through parameter-efficient fine-tuning, which is a practical move for robotics where full retraining is expensive.

Referee Report

2 major / 2 minor

Summary. The paper introduces VER, a Vision Expert Transformer for robot learning. It distills multiple pretrained vision foundation models (VFMs) into a vision expert library during pretraining. A lightweight routing network (<0.4% of parameters) is then fine-tuned using Patchwise Expert Routing with Curriculum Top-K Annealing to dynamically select task-relevant experts for downstream robotic tasks. The approach supports parameter-efficient fine-tuning and claims state-of-the-art performance across 17 diverse robotic tasks with multiple policy heads, along with qualitative improvements in reducing large-norm outliers in task-irrelevant regions and concentrating attention on task-critical areas.

Significance. If the empirical gains are robustly attributable to the dynamic routing mechanism rather than distillation or PEFT alone, VER offers a scalable method for combining complementary strengths of multiple VFMs in robotics without full retraining or task-specific labels at inference. The lightweight router and curriculum annealing are practical contributions that could generalize to other multi-VFM settings.

major comments (2)

[Experiments] Experimental results (across the 17 tasks): The manuscript reports SOTA performance and reduced outliers but does not include ablations that disable the dynamic router (e.g., fixed single-expert selection, static averaging of the library, or removal of Patchwise Expert Routing with Curriculum Top-K Annealing). Since only the <0.4% router is fine-tuned post-distillation, it is unclear whether the reported gains over single-VFM baselines arise from the dynamic selection or from the parameter-efficient adaptation procedure itself. Per-task expert selection statistics correlated with task type would also strengthen attribution.
[Method] § on routing mechanism: The claim that the router reliably identifies task-relevant experts without task-specific labels during routing relies on the assumption that complementary VFM strengths survive distillation into the expert library. No quantitative analysis (e.g., expert activation histograms or correlation with task categories) is provided to verify this on the held-out tasks.

minor comments (2)

[Introduction] The abstract and introduction would benefit from explicit comparison to prior multi-VFM distillation or routing methods in robotics to better position the novelty of the curriculum annealing schedule.
[Figures] Figure captions for attention visualizations should include quantitative metrics (e.g., norm statistics in background vs. foreground regions) to support the qualitative claims of outlier reduction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for highlighting areas where additional evidence would strengthen the attribution of VER's gains. We address each major comment below and commit to incorporating the suggested analyses and ablations in the revised manuscript.

read point-by-point responses

Referee: [Experiments] Experimental results (across the 17 tasks): The manuscript reports SOTA performance and reduced outliers but does not include ablations that disable the dynamic router (e.g., fixed single-expert selection, static averaging of the library, or removal of Patchwise Expert Routing with Curriculum Top-K Annealing). Since only the <0.4% router is fine-tuned post-distillation, it is unclear whether the reported gains over single-VFM baselines arise from the dynamic selection or from the parameter-efficient adaptation procedure itself. Per-task expert selection statistics correlated with task type would also strengthen attribution.

Authors: We agree that isolating the contribution of the dynamic routing mechanism is essential to demonstrate that performance improvements are not solely due to parameter-efficient fine-tuning. In the revised manuscript we will add ablations that replace the router with (i) fixed single-expert selection, (ii) static averaging across the expert library, and (iii) removal of Curriculum Top-K Annealing. We will also report per-task expert selection frequencies and their correlation with task categories across the 17 robotic tasks. These results will be presented alongside the existing single-VFM baselines to clarify the source of the observed gains. revision: yes
Referee: [Method] § on routing mechanism: The claim that the router reliably identifies task-relevant experts without task-specific labels during routing relies on the assumption that complementary VFM strengths survive distillation into the expert library. No quantitative analysis (e.g., expert activation histograms or correlation with task categories) is provided to verify this on the held-out tasks.

Authors: We acknowledge that direct quantitative verification of expert specialization would better support the claim that complementary strengths from the original VFMs are preserved after distillation. We will include expert activation histograms over the held-out tasks and compute correlations between the most frequently selected experts and task types (e.g., manipulation vs. navigation). These analyses will be added to the method and experimental sections of the revised paper. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results on held-out tasks do not reduce to fitted inputs or self-referential definitions

full rationale

The paper introduces VER as a practical architecture that distills multiple VFMs into a fixed expert library during pretraining and then trains only a lightweight router (<0.4% parameters) plus Patchwise Expert Routing with Curriculum Top-K Annealing for downstream tasks. All reported gains (SOTA on 17 robotic tasks, reduced large-norm outliers in task-irrelevant regions) are presented as measured outcomes on held-out robot datasets rather than as quantities derived from the paper's own equations or normalizations. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work appear in the derivation of the method; the central performance claims remain independent of any self-referential fitting or renaming of known patterns. The evaluation therefore stands as an external empirical test rather than a tautology.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 1 invented entities

The approach assumes that multiple VFMs contain complementary, distillable knowledge and that a small learned router can select among them without task-specific supervision during inference. No new physical entities or untestable constants are introduced.

free parameters (2)

Top-K value and annealing schedule
Curriculum Top-K Annealing controls how many experts are active during training; the schedule and final K are chosen to balance flexibility and precision.
Routing network size
The router is constrained to <0.4% of total parameters; its exact hidden dimension and activation choices are design decisions.

axioms (1)

domain assumption Distilled expert features remain sufficiently orthogonal and task-discriminative after the initial pretraining stage.
The paper relies on this to justify freezing the expert library while only training the router.

invented entities (1)

Vision Expert library no independent evidence
purpose: A fixed collection of distilled VFM representations that the router selects from.
The library is created by the authors' distillation procedure; no independent evidence outside the paper is provided for its optimality.

pith-pipeline@v0.9.0 · 5793 in / 1395 out tokens · 44881 ms · 2026-05-18T09:08:59.418374+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

VER distills multiple VFMs into a vision expert library... lightweight routing network (fewer than 0.4% of parameters) to dynamically select task-relevant experts
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Patchwise Expert Routing with Curriculum Top-K Annealing

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.