VER: Vision Expert Transformer for Robot Learning via Foundation Distillation and Dynamic Routing
Pith reviewed 2026-05-18 09:08 UTC · model grok-4.3
The pith
VER distills multiple vision foundation models into an expert library then routes task-relevant experts with a lightweight network for robot policies.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
VER distills multiple VFMs into a vision expert library during pretraining. It then fine-tunes only a lightweight routing network to dynamically select task-relevant experts from the pretrained library for downstream robot tasks. Patchwise Expert Routing with Curriculum Top-K Annealing improves both flexibility and precision of dynamic expert selection. The method supports parameter-efficient finetuning for scalable expert utilization and adaptive robot-domain knowledge integration. Across 17 diverse robotic tasks and multiple policy heads, VER achieves state-of-the-art performance by reducing large-norm outliers in task-irrelevant regions and concentrating on task-critical regions.
What carries the argument
The vision expert library formed by distilling multiple VFMs, paired with a lightweight router that applies Patchwise Expert Routing and Curriculum Top-K Annealing to select experts dynamically per task and image patch.
If this is right
- New robot tasks can be addressed by updating only the routing network instead of retraining the full visual backbone.
- Attention maps shift toward task-critical patches such as grasped objects rather than background clutter.
- Different policy heads for the same robot can share one expert library while each uses its own router.
- Additional vision models can be added to the library later through continued distillation without restarting from scratch.
Where Pith is reading between the lines
- The same library-plus-router pattern could be tested on non-vision inputs such as depth or tactile signals to see whether cross-modal experts combine similarly.
- Inspecting which experts activate for a given action might offer a route to more interpretable robot decisions.
- Curriculum annealing of the top-k count could be tried in other mixture-of-experts setups outside robotics to stabilize early training.
Load-bearing premise
The complementary strengths of the original vision foundation models remain intact after distillation and the small router can pick the right experts without seeing task labels at decision time.
What would settle it
If swapping the dynamic router for random expert selection or for a single fixed VFM produces equal or better performance on the same 17 tasks and policy heads, the benefit of the distillation-plus-routing design would be refuted.
read the original abstract
Pretrained vision foundation models (VFMs) advance robotic learning via rich visual representations, yet individual VFMs typically excel only in specific domains, limiting generality across tasks. Distilling multiple VFMs into a unified representation for policy can mitigate this limitation but often yields inflexible task-specific feature selection and requires costly full re-training to incorporate robot-domain knowledge. We propose VER, a Vision Expert transformer for Robot learning. During pretraining, VER distills multiple VFMs into a vision expert library. It then fine-tunes only a lightweight routing network (fewer than 0.4% of parameters) to dynamically select task-relevant experts from the pretrained library for downstream robot tasks. We further introduce Patchwise Expert Routing with Curriculum Top-K Annealing to improve both flexibility and precision of dynamic expert selection. Moreover, VER supports parameter-efficient finetuning for scalable expert utilization and adaptive robot-domain knowledge integration. Across 17 diverse robotic tasks and multiple policy heads, VER achieves state-of-the-art performance. We find that VER reduces large-norm outliers in task-irrelevant regions (e.g., background) and concentrates on task-critical regions. Visualizations and codes can be found in https://yixiaowang7.github.io/ver_page/.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces VER, a Vision Expert Transformer for robot learning. It distills multiple pretrained vision foundation models (VFMs) into a vision expert library during pretraining. A lightweight routing network (<0.4% of parameters) is then fine-tuned using Patchwise Expert Routing with Curriculum Top-K Annealing to dynamically select task-relevant experts for downstream robotic tasks. The approach supports parameter-efficient fine-tuning and claims state-of-the-art performance across 17 diverse robotic tasks with multiple policy heads, along with qualitative improvements in reducing large-norm outliers in task-irrelevant regions and concentrating attention on task-critical areas.
Significance. If the empirical gains are robustly attributable to the dynamic routing mechanism rather than distillation or PEFT alone, VER offers a scalable method for combining complementary strengths of multiple VFMs in robotics without full retraining or task-specific labels at inference. The lightweight router and curriculum annealing are practical contributions that could generalize to other multi-VFM settings.
major comments (2)
- [Experiments] Experimental results (across the 17 tasks): The manuscript reports SOTA performance and reduced outliers but does not include ablations that disable the dynamic router (e.g., fixed single-expert selection, static averaging of the library, or removal of Patchwise Expert Routing with Curriculum Top-K Annealing). Since only the <0.4% router is fine-tuned post-distillation, it is unclear whether the reported gains over single-VFM baselines arise from the dynamic selection or from the parameter-efficient adaptation procedure itself. Per-task expert selection statistics correlated with task type would also strengthen attribution.
- [Method] § on routing mechanism: The claim that the router reliably identifies task-relevant experts without task-specific labels during routing relies on the assumption that complementary VFM strengths survive distillation into the expert library. No quantitative analysis (e.g., expert activation histograms or correlation with task categories) is provided to verify this on the held-out tasks.
minor comments (2)
- [Introduction] The abstract and introduction would benefit from explicit comparison to prior multi-VFM distillation or routing methods in robotics to better position the novelty of the curriculum annealing schedule.
- [Figures] Figure captions for attention visualizations should include quantitative metrics (e.g., norm statistics in background vs. foreground regions) to support the qualitative claims of outlier reduction.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and for highlighting areas where additional evidence would strengthen the attribution of VER's gains. We address each major comment below and commit to incorporating the suggested analyses and ablations in the revised manuscript.
read point-by-point responses
-
Referee: [Experiments] Experimental results (across the 17 tasks): The manuscript reports SOTA performance and reduced outliers but does not include ablations that disable the dynamic router (e.g., fixed single-expert selection, static averaging of the library, or removal of Patchwise Expert Routing with Curriculum Top-K Annealing). Since only the <0.4% router is fine-tuned post-distillation, it is unclear whether the reported gains over single-VFM baselines arise from the dynamic selection or from the parameter-efficient adaptation procedure itself. Per-task expert selection statistics correlated with task type would also strengthen attribution.
Authors: We agree that isolating the contribution of the dynamic routing mechanism is essential to demonstrate that performance improvements are not solely due to parameter-efficient fine-tuning. In the revised manuscript we will add ablations that replace the router with (i) fixed single-expert selection, (ii) static averaging across the expert library, and (iii) removal of Curriculum Top-K Annealing. We will also report per-task expert selection frequencies and their correlation with task categories across the 17 robotic tasks. These results will be presented alongside the existing single-VFM baselines to clarify the source of the observed gains. revision: yes
-
Referee: [Method] § on routing mechanism: The claim that the router reliably identifies task-relevant experts without task-specific labels during routing relies on the assumption that complementary VFM strengths survive distillation into the expert library. No quantitative analysis (e.g., expert activation histograms or correlation with task categories) is provided to verify this on the held-out tasks.
Authors: We acknowledge that direct quantitative verification of expert specialization would better support the claim that complementary strengths from the original VFMs are preserved after distillation. We will include expert activation histograms over the held-out tasks and compute correlations between the most frequently selected experts and task types (e.g., manipulation vs. navigation). These analyses will be added to the method and experimental sections of the revised paper. revision: yes
Circularity Check
No circularity: empirical results on held-out tasks do not reduce to fitted inputs or self-referential definitions
full rationale
The paper introduces VER as a practical architecture that distills multiple VFMs into a fixed expert library during pretraining and then trains only a lightweight router (<0.4% parameters) plus Patchwise Expert Routing with Curriculum Top-K Annealing for downstream tasks. All reported gains (SOTA on 17 robotic tasks, reduced large-norm outliers in task-irrelevant regions) are presented as measured outcomes on held-out robot datasets rather than as quantities derived from the paper's own equations or normalizations. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work appear in the derivation of the method; the central performance claims remain independent of any self-referential fitting or renaming of known patterns. The evaluation therefore stands as an external empirical test rather than a tautology.
Axiom & Free-Parameter Ledger
free parameters (2)
- Top-K value and annealing schedule
- Routing network size
axioms (1)
- domain assumption Distilled expert features remain sufficiently orthogonal and task-discriminative after the initial pretraining stage.
invented entities (1)
-
Vision Expert library
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
VER distills multiple VFMs into a vision expert library... lightweight routing network (fewer than 0.4% of parameters) to dynamically select task-relevant experts
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanLogicNat unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Patchwise Expert Routing with Curriculum Top-K Annealing
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.