pith. sign in

arxiv: 2605.17743 · v1 · pith:DLADCJDUnew · submitted 2026-05-18 · 💻 cs.CV

MoASE++: Mixture of Activation Sparsity Experts with Domain-Adaptive On-policy Distillation for Continual Test Time Adaptation

Pith reviewed 2026-05-20 12:49 UTC · model grok-4.3

classification 💻 cs.CV
keywords continual test-time adaptationmixture of expertsactivation sparsityon-policy distillationdomain adaptationimage classificationsemantic segmentation
0
0 comments X

The pith

Mixture of activation sparsity experts with on-policy distillation allows models to adapt continually to new domains without forgetting past knowledge.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to address continual test-time adaptation for pre-trained models facing non-stationary unlabeled target streams while retaining competence on prior domains. It proposes a plug-in mixture-of-experts module that disentangles domain-agnostic structure from domain-specific texture by forming complementary high- and low-activation pathways through Spatial Differentiable Dropout and high/low-rank bottlenecks. An Activation Sparsity Gate generates input-adaptive thresholds for token selection, and a Domain-Aware Router assigns expert weights based on texture-sensitive cues. Domain-Adaptive On-Policy Distillation then stabilizes learning on unlabeled data via EMA-anchored reverse KL distillation and an augmentation policy driven by entropy and confidence. Experiments on corrupted image classification and cross-domain segmentation report consistent gains over prior methods.

Core claim

The authors establish that Activation Sparsity Experts with Spatial Differentiable Dropout and high/low-rank bottlenecks, guided by an Activation Sparsity Gate and Domain-Aware Router, when combined with EMA-anchored reverse KL distillation and entropy-conditioned augmentation policy, disentangle structure from texture to support robust continual adaptation.

What carries the argument

Mixture of Activation Sparsity Experts (MoASE) using Spatial Differentiable Dropout for pathway formation and Domain-Aware Router for assignment, augmented by Domain-Adaptive On-Policy Distillation for supervision.

Load-bearing premise

The assumption that texture-biased backbones cause error accumulation and catastrophic forgetting, and that the proposed sparsity experts and router can reliably separate domain-agnostic structure from domain-specific texture using only unlabeled streams.

What would settle it

A sequence of domain shifts where accuracy on the original source domain drops sharply after adaptation or where the full method shows no gain over a plain test-time adaptation baseline.

Figures

Figures reproduced from arXiv: 2605.17743 by Aosong Cheng, Dan Wang, Gaole Dai, Huanrui Yang, Jiaming Liu, Leyuan Fang, Li Du, Ronyu Zhang, Shanghang Zhang, Yuan Du, Yulin Luo.

Figure 1
Figure 1. Figure 1: The problem and motivation. Our goal is to adapt pre-trained models to evolving target domains. to inputs from varying domains in [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The visualization analysis of the Class Activation Map (CAM). We adopt CAM to compare the attention of the low-activation, high-activation MoASE, and the original model during the continual adaptation process. The number suggests the mean IoU of the Elastic-transform, Gaussian, Fog, and Snow results based on randomly selected samples from ImageNet-C with the Foreground and Background of the main object seg… view at source ↗
Figure 3
Figure 3. Figure 3: The framework of MoASE++. A policy-driven augmenter samples strength a to create V on-policy views, while the student aligns to a detached EMA teacher via reverse KL on the same views plus a clean-view consistency term. The policy is updated with a reward, enabling a tunable stability–plasticity trade-off. dimension. For a sample index b ∈ {1, . . . , B} and a token index n ∈ {1, . . . , N}, the score is d… view at source ↗
Figure 4
Figure 4. Figure 4: The detailed structure of MoASE. We integrate the MoASE into the linear layer of the student model. idh and idl suggests the expert ID for domain-agnostic experts and domain-specific experts. Let Mlow ∈ {0, 1} B×N with Klow = ⌊N · 1 2 ⌋, and broadcast it to M (3D) low ∈ {0, 1} B×N×D. The low-activation sub-features and routing output are Flow = M (3D) low ⊙ F, ϕ = gDAR(Flow), (6) where ϕ ∈ R B×E satisfies … view at source ↗
Figure 5
Figure 5. Figure 5: Inter-domain and intra-class distance in CIFAR10-C. (a) MoASE++ and MoASE more effectively reduce inter￾domain divergence than the source model. (b) MoASE++ and MoASE significantly improve intra-class feature aggregation, producing results that closely align with those of our proposed method. We maintain an EMA baseline b of rewards. The policy objective combines advantage-weighted log-likelihood with entr… view at source ↗
Figure 6
Figure 6. Figure 6: 5 rounds segmentation CTTA on Cityscape-to￾ACDC. We sequentially repeat the same sequence of target domains 5 times with a different number of experts. performance thereafter. First of all, we define CoTTA∗ as CoTTA with lr=2e-4. We observe that adjusting CoTTA to our setting improves early results but leads to a later drop in segmentation performance and catastrophic forgetting. In addition, MoASE achieve… view at source ↗
Figure 7
Figure 7. Figure 7: Analysis of computational costs for our proposed MoASE++ and previous SOTA baselines, including a detailed comparison in terms of (a) FLOPs, (b) GPU memory, and (c) the number of parameters. This overhead mainly stems from multi-expert routing and the DA-OPD alignment. As future work, we envision an edge–cloud co-inference scheme that keeps lightweight in￾ference on the edge while scheduling heavier full-e… view at source ↗
Figure 8
Figure 8. Figure 8: Ablation study to examine the influence of the hyperparameters for MoASE. TABLE 6: Different number of experts in MoASE. mIoU score for Cityscape-to-ACDC within the first round. num. E. Fog Night Rain Snow Mean↑ E = 2 71.6 44.0 66.5 63.7 61.5 E = 4 72.4 44.5 66.4 63.8 61.8 E = 8 71.4 44.0 65.0 61.7 60.5 E = 16 71.5 44.1 65.9 63.3 61.2 TABLE 7: Balance of DA and DS experts. mIoU score for Cityscape-to-ACDC … view at source ↗
Figure 9
Figure 9. Figure 9: The qualitative analysis of the original and CAM visualization for the source model, TCA, MoASE, and MoASE++ based on ImageNet-C with 15 domains. TABLE 9: Ablation study for the hyperparameters of DA￾OPD on CIFAR100-C. Groups follow the control-variable method: in each block, only one factor changes. Blue sug￾gests our final selection for MoASE++. Methods EMA α OPD-views OPD-temp OPD-weight Mean↑ Alpha swe… view at source ↗
Figure 10
Figure 10. Figure 10: The visualization analysis of the routing strategy for MoASE and MoASE++. E1 s and E3 a stands for the 1st domain-specific and the 3rd domain-agnostic experts. tion is dispersed due to continuous domain shifts, while TCA [16] also fails to detect the complete object in continual domain-adaptive scenarios. In contrast, MoASE effectively concentrates on regions relevant to the target categories, such as dog… view at source ↗
Figure 11
Figure 11. Figure 11: The qualitative analysis of the segmentation qualitative comparison of MoASE with previous SOTA methods on the ACDC dataset. Our method could better segment different pixel-wise classes, such as shown in the white box. Segmentation visualization. For further validation, we provide additional qualitative comparisons in the Cityscapes-to-ACDC CTTA scenario, shown in the [PITH_FULL_IMAGE:figures/full_fig_p0… view at source ↗
read the original abstract

Continual test-time adaptation adapts a source-pretrained model to non-stationary, unlabeled target streams while retaining past competence, yet texture-biased backbones risk error accumulation and catastrophic forgetting. Drawing inspiration from the process of decoupling shape and texture in the human visual system, we introduce MoASE, a plug-in mixture-of-experts that disentangles domain-agnostic structure from domain-specific texture using Activation Sparsity Experts with Spatial Differentiable Dropout, forming complementary high- and low-activation pathways, while high- and low-rank bottlenecks diversify representations. The Activation Sparsity Gate produces input-adaptive SDD thresholds for precise token selection, and the Domain-Aware Router assigns per-sample expert weights using texture-sensitive cues. To curb confirmation bias on unlabeled streams and stabilize supervision, we then introduce Domain-Adaptive On-Policy Distillation to constitute MoASE++, with an EMA-anchored on-policy reverse KL distillation and an augmentation policy conditioned on entropy and confidence that aligns predictions across the same views and improves the robustness-plasticity balance. Extensive experiments on classification (CIFAR-10/100-C, ImageNet-C) and semantic segmentation (Cityscapes->ACDC) demonstrate consistent state-of-the-art performance, offering a principled, controllable approach to continual adaptation in dynamic visual environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper proposes MoASE++, a plug-in mixture-of-experts module for continual test-time adaptation of source-pretrained vision models on non-stationary unlabeled target streams. It introduces Activation Sparsity Experts using Spatial Differentiable Dropout to create complementary high- and low-activation pathways that aim to disentangle domain-agnostic structure from domain-specific texture, augmented by high/low-rank bottlenecks, an input-adaptive Activation Sparsity Gate, and a texture-sensitive Domain-Aware Router. Domain-Adaptive On-Policy Distillation (with EMA-anchored reverse KL and entropy/confidence-conditioned augmentations) is added to mitigate confirmation bias and stabilize the robustness-plasticity trade-off. Experiments claim consistent SOTA results on CIFAR-10/100-C, ImageNet-C classification, and Cityscapes-to-ACDC semantic segmentation.

Significance. If the central mechanisms hold, the work offers a principled, controllable approach to continual adaptation that directly targets texture bias and error accumulation in dynamic visual environments. The integration of sparsity-based expert routing with on-policy distillation on unlabeled streams represents a meaningful technical contribution for unsupervised continual learning settings.

major comments (1)
  1. [Method (Activation Sparsity Gate, Domain-Aware Router, and Domain-Adaptive On-Policy Distillation subsections)] The core claim that Activation Sparsity Experts with Spatial Differentiable Dropout and the Domain-Aware Router reliably disentangle domain-agnostic structure from domain-specific texture (and thereby avoid confirmation bias) rests on indirect supervision via on-policy distillation alone. No auxiliary validation on controlled shape-texture conflict datasets or explicit metrics enforcing shape vs. texture separation is described, leaving open the possibility that the router simply amplifies currently dominant spurious cues rather than breaking the error-accumulation loop.
minor comments (2)
  1. [Abstract] The abstract asserts SOTA results but provides no details on experimental controls, exact baselines, or safeguards against post-hoc selection; these should be summarized upfront for clarity.
  2. [Method] Notation for SDD thresholds and the precise formulation of the entropy-conditioned augmentation policy should be made fully explicit with equations to aid reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comment below and indicate the revisions we intend to incorporate.

read point-by-point responses
  1. Referee: [Method (Activation Sparsity Gate, Domain-Aware Router, and Domain-Adaptive On-Policy Distillation subsections)] The core claim that Activation Sparsity Experts with Spatial Differentiable Dropout and the Domain-Aware Router reliably disentangle domain-agnostic structure from domain-specific texture (and thereby avoid confirmation bias) rests on indirect supervision via on-policy distillation alone. No auxiliary validation on controlled shape-texture conflict datasets or explicit metrics enforcing shape vs. texture separation is described, leaving open the possibility that the router simply amplifies currently dominant spurious cues rather than breaking the error-accumulation loop.

    Authors: We appreciate the referee's observation on the need for more direct evidence of disentanglement. The Activation Sparsity Experts with Spatial Differentiable Dropout are explicitly designed to produce complementary high- and low-activation pathways, where high-activation routes are intended to capture domain-specific texture and low-activation routes to preserve domain-agnostic structure, drawing from the human visual system's shape-texture separation. The Domain-Aware Router further conditions routing on texture-sensitive cues, while the Domain-Adaptive On-Policy Distillation (EMA-anchored reverse KL with entropy- and confidence-conditioned augmentations) supplies stable supervision to mitigate confirmation bias and error accumulation on unlabeled streams. Although the current manuscript does not include auxiliary experiments on dedicated shape-texture conflict datasets or explicit separation metrics, the design choices are validated through consistent state-of-the-art results on texture-shift-heavy benchmarks (CIFAR-10/100-C, ImageNet-C, Cityscapes-to-ACDC) and component ablations that isolate the contribution of the sparsity experts and router. We agree that additional clarification would strengthen the presentation. In the revised version we will expand the method section with a dedicated paragraph on the inductive biases of Spatial Differentiable Dropout and the router, together with qualitative visualizations of expert activation patterns across domain shifts. revision: partial

Circularity Check

0 steps flagged

No circularity: novel components and experimental validation are self-contained

full rationale

The paper introduces MoASE and MoASE++ as new architectural modules (Activation Sparsity Experts with Spatial Differentiable Dropout, high/low-activation pathways, Domain-Aware Router, and Domain-Adaptive On-Policy Distillation with EMA-anchored reverse KL) to address texture bias and confirmation bias in continual test-time adaptation. These are motivated by human vision inspiration and implemented as plug-in mechanisms, with performance claims resting on independent experiments across CIFAR-10/100-C, ImageNet-C, and Cityscapes->ACDC rather than any reduction of outputs to fitted inputs, self-definitions, or self-citation chains by construction. No equations or steps in the provided description equate predictions to the method's own parameters or prior outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 3 invented entities

The approach rests on several introduced components whose effectiveness is asserted but not derived from first principles; multiple new entities and mechanisms are postulated to solve the forgetting problem.

axioms (2)
  • domain assumption Texture-biased backbones risk error accumulation and catastrophic forgetting in continual test-time adaptation.
    Stated in the opening of the abstract as the core problem motivation.
  • domain assumption The human visual system decouples shape and texture.
    Explicitly drawn as inspiration for the mixture-of-experts design.
invented entities (3)
  • Activation Sparsity Experts with Spatial Differentiable Dropout no independent evidence
    purpose: Disentangle domain-agnostic structure from domain-specific texture using high- and low-activation pathways.
    Introduced as the core plug-in module; no independent evidence provided in abstract.
  • Domain-Aware Router no independent evidence
    purpose: Assign per-sample expert weights using texture-sensitive cues.
    New component for input-adaptive routing; no external validation mentioned.
  • Domain-Adaptive On-Policy Distillation no independent evidence
    purpose: Curbs confirmation bias on unlabeled streams via EMA-anchored reverse KL and entropy-conditioned augmentation.
    Proposed to stabilize supervision; effectiveness claimed via experiments but not independently verified here.

pith-pipeline@v0.9.0 · 5801 in / 1258 out tokens · 41986 ms · 2026-05-20T12:49:19.270389+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

85 extracted references · 85 canonical work pages · 9 internal anchors

  1. [1]

    Decomposing the neurons: Activation sparsity via mixture of experts for continual test time adaptation,

    R. Zhang, A. Cheng, Y. Luo, G. Dai, H. Yang, J. Liu, R. Xu, L. Du, D. Wang, and Y. Du, “Decomposing the neurons: Activation sparsity via mixture of experts for continual test time adaptation,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 40, no. 42, 2026, pp. 36 057–36 065

  2. [2]

    A survey of autonomous driving from a deep learning perspective,

    J. Zhao, Y. Wu, R. Deng, S. Xu, J. Gao, and A. Burke, “A survey of autonomous driving from a deep learning perspective,”ACM Computing Surveys, vol. 57, no. 10, pp. 1–60, 2025

  3. [3]

    Bevuda++: geometric-aware unsupervised domain adaptation for multi-view 3d object detection,

    R. Zhang, J. Liu, X. Li, X. Chi, D. Wang, L. Du, Y. Du, and S. Zhang, “Bevuda++: geometric-aware unsupervised domain adaptation for multi-view 3d object detection,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 35, no. 5, pp. 5109–5122, 2024

  4. [4]

    Identifying unob- served road regions in bird’s-eye view for single-vehicle percep- tion,

    J. Yu, C. Lu, X. Meng, F. Shao, and Q. Jiang, “Identifying unob- served road regions in bird’s-eye view for single-vehicle percep- tion,”IEEE Robotics and Automation Letters, vol. 11, no. 3, pp. 2410– 2417, 2026

  5. [5]

    Last {0}: Latent spatio-temporal chain-of- thought for robotic vision-language-action model,

    Z. Liu, J. Liu, H. Chen, J. Yu, Z. Guo, C. Hou, C. Gu, X. Mi, R. Zhang, K. Wuet al., “Last {0}: Latent spatio-temporal chain-of- thought for robotic vision-language-action model,”arXiv preprint arXiv:2601.05248, 2026

  6. [6]

    Manipllm: Embodied multimodal large language model for object-centric robotic manipulation,

    X. Li, M. Zhang, Y. Geng, H. Geng, Y. Long, Y. Shen, R. Zhang, J. Liu, and H. Dong, “Manipllm: Embodied multimodal large language model for object-centric robotic manipulation,”arXiv preprint arXiv:2312.16217, 2023

  7. [7]

    A survey on vision–language–action models for embodied ai,

    Y. Ma, Z. Song, Y. Zhuang, J. Hao, and I. King, “A survey on vision–language–action models for embodied ai,”IEEE Transac- tions on Neural Networks and Learning Systems, 2026

  8. [8]

    Efficient deweahter mixture-of-experts with uncertainty-aware feature-wise linear modulation,

    R. Zhang, Y. Luo, J. Liu, H. Yang, Z. Dong, D. Gudovskiy, T. Okuno, Y. Nakata, K. Keutzer, Y. Duet al., “Efficient deweahter mixture-of-experts with uncertainty-aware feature-wise linear modulation,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 38, 2024, pp. 16 812–16 820

  9. [9]

    Occuseg: Occupancy- aware 3d instance segmentation,

    L. Han, T. Zheng, L. Xu, and L. Fang, “Occuseg: Occupancy- aware 3d instance segmentation,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 2940– 2949

  10. [10]

    Occ3d: A large-scale 3d occupancy prediction bench- mark for autonomous driving,

    X. Tian, T. Jiang, L. Yun, Y. Mao, H. Yang, Y. Wang, Y. Wang, and H. Zhao, “Occ3d: A large-scale 3d occupancy prediction bench- mark for autonomous driving,”Advances in Neural Information Processing Systems, vol. 36, 2024

  11. [11]

    Continual test-time domain adaptation,

    Q. Wang, O. Fink, L. Van Gool, and D. Dai, “Continual test-time domain adaptation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 7201–7211

  12. [12]

    Dual domain-attribute learning framework with asynchronous adapters for continual test-time adaptation,

    Y. Tian, K. Li, T. He, L. Wan, P .-A. Heng, and W. Feng, “Dual domain-attribute learning framework with asynchronous adapters for continual test-time adaptation,”IEEE Transactions on Image Processing, 2026

  13. [13]

    Tent: Fully Test-time Adaptation by Entropy Minimization

    D. Wang, E. Shelhamer, S. Liu, B. Olshausen, and T. Darrell, “Tent: Fully test-time adaptation by entropy minimization,”arXiv preprint arXiv:2006.10726, 2020

  14. [14]

    arXiv preprint arXiv:2303.15361 , year=

    J. Liang, R. He, and T. Tan, “A comprehensive survey on test-time adaptation under distribution shifts,”arXiv preprint arXiv:2303.15361, 2023

  15. [15]

    Dota: Distributional test-time adaptation of vision- language models,

    Z. Han, J. Yang, G. Wang, J. Li, Q. Xu, M. Z. Shou, and C. Zhang, “Dota: Distributional test-time adaptation of vision- language models,”Advances in Neural Information Processing Sys- tems, vol. 38, pp. 142 772–142 798, 2026

  16. [16]

    Maintaining consistent inter-class topology in continual test-time adaptation,

    C. Ni, F. Lyu, J. Tan, F. Hu, R. Yao, and T. Zhou, “Maintaining consistent inter-class topology in continual test-time adaptation,” inProceedings of the Computer Vision and Pattern Recognition Confer- ence, 2025, pp. 15 319–15 328

  17. [17]

    Becotta: Input-dependent online blending of experts for continual test-time adaptation,

    D. Lee, J. Yoon, and S. J. Hwang, “Becotta: Input-dependent online blending of experts for continual test-time adaptation,”arXiv preprint arXiv:2402.08712, 2024

  18. [18]

    Adaptive distribution masked autoencoders for continual test-time adaptation,

    J. Liu, R. Xu, S. Yang, R. Zhang, Q. Zhang, Z. Chen, Y. Guo, and S. Zhang, “Adaptive distribution masked autoencoders for continual test-time adaptation,”arXiv preprint arXiv:2312.12480, 2023

  19. [19]

    Vida: Homeostatic visual domain adapter for continual test time adaptation,

    J. Liu, S. Yang, P . Jia, M. Lu, Y. Guo, W. Xue, and S. Zhang, “Vida: Homeostatic visual domain adapter for continual test time adaptation,”arXiv preprint arXiv:2306.04344, 2023

  20. [20]

    Retinal noise and absolute threshold,

    H. B. Barlow, “Retinal noise and absolute threshold,”Josa, vol. 46, no. 8, pp. 634–639, 1956

  21. [21]

    The absolute threshold of cone vision,

    D. Koenig and H. Hofer, “The absolute threshold of cone vision,” Journal of vision, vol. 11, no. 1, pp. 21–21, 2011

  22. [22]

    Structure of cone photoreceptors,

    D. Mustafi, A. H. Engel, and K. Palczewski, “Structure of cone photoreceptors,”Progress in retinal and eye research, vol. 28, no. 4, pp. 289–302, 2009

  23. [23]

    The primate fovea: structure, func- tion and development,

    A. Bringmann, S. Syrbe, K. G ¨orner, J. Kacza, M. Francke, P . Wiede- mann, and A. Reichenbach, “The primate fovea: structure, func- tion and development,”Progress in retinal and eye research, vol. 66, pp. 49–84, 2018

  24. [24]

    Von Helmholtz,Handbuch der physiologischen Optik

    H. Von Helmholtz,Handbuch der physiologischen Optik. Voss, 1867, vol. 9

  25. [25]

    Emergence of shape bias in convolutional neural networks through activation sparsity,

    T. Li, Z. Wen, Y. Li, and T. S. Lee, “Emergence of shape bias in convolutional neural networks through activation sparsity,” Advances in Neural Information Processing Systems, vol. 36, 2024

  26. [26]

    Multi-level personalized federated learning on heterogeneous and long-tailed data,

    R. Zhang and et. al., “Multi-level personalized federated learning on heterogeneous and long-tailed data,”IEEE TMC, 2024

  27. [27]

    Fda: Fourier domain adaptation for se- mantic segmentation,

    Y. Yang and S. Soatto, “Fda: Fourier domain adaptation for se- mantic segmentation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 4085–4095

  28. [28]

    A fourier- based framework for domain generalization,

    Q. Xu, R. Zhang, Y. Zhang, Y. Wang, and Q. Tian, “A fourier- based framework for domain generalization,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 14 383–14 392

  29. [29]

    Adaptive mixtures of local experts,

    R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton, “Adaptive mixtures of local experts,”Neural computation, vol. 3, no. 1, pp. 79–87, 1991

  30. [30]

    Mixture of experts: a liter- ature survey,

    S. Masoudnia and R. Ebrahimpour, “Mixture of experts: a liter- ature survey,”Artificial Intelligence Review, vol. 42, pp. 275–293, 2014

  31. [31]

    A survey on inference optimization techniques for mixture of experts models,

    J. Liu, P . Tang, W. Wang, Y. Ren, X. Hou, P . A. Heng, M. Guo, and C. Li, “A survey on inference optimization techniques for mixture of experts models,”ACM Computing Surveys, vol. 58, no. 10, pp. 1–37, 2026

  32. [32]

    Retinal processing near absolute threshold: from behavior to mechanism,

    G. D. Field, A. P . Sampath, and F. Rieke, “Retinal processing near absolute threshold: from behavior to mechanism,”Annu. Rev. Physiol., vol. 67, pp. 491–514, 2005

  33. [33]

    Learning multiple layers of features from tiny images,

    A. Krizhevsky, G. Hintonet al., “Learning multiple layers of features from tiny images,” 2009

  34. [34]

    Benchmarking Neural Network Robustness to Common Corruptions and Perturbations

    D. Hendrycks and T. Dietterich, “Benchmarking neural network robustness to common corruptions and perturbations,”arXiv preprint arXiv:1903.12261, 2019

  35. [35]

    The cityscapes dataset for semantic urban scene understanding,

    M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Be- nenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset for semantic urban scene understanding,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 3213–3223

  36. [36]

    Acdc: The adverse con- ditions dataset with correspondences for semantic driving scene understanding,

    C. Sakaridis, D. Dai, and L. Van Gool, “Acdc: The adverse con- ditions dataset with correspondences for semantic driving scene understanding,” inProceedings of the IEEE/CVF International Con- ference on Computer Vision, 2021, pp. 10 765–10 775

  37. [37]

    Ecotta: Memory-efficient continual test-time adaptation via self-distilled regularization,

    J. Song, J. Lee, I. S. Kweon, and S. Choi, “Ecotta: Memory-efficient continual test-time adaptation via self-distilled regularization,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 11 920–11 929

  38. [38]

    Decorate the newcomers: Visual domain prompt for continual test time adaptation,

    Y. Gan, Y. Bai, Y. Lou, X. Ma, R. Zhang, N. Shi, and L. Luo, “Decorate the newcomers: Visual domain prompt for continual test time adaptation,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 37, 2023, pp. 7595–7603

  39. [39]

    Exploring sparse visual prompt for cross-domain semantic seg- mentation,

    S. Yang, J. Wu, J. Liu, X. Li, Q. Zhang, M. Pan, and S. Zhang, “Exploring sparse visual prompt for cross-domain semantic seg- mentation,”arXiv preprint arXiv:2303.09792, 2023

  40. [40]

    Distribution-aware continual test time adaptation for semantic segmentation,

    J. Ni, S. Yang, J. Liu, X. Li, W. Jiao, R. Xu, Z. Chen, Y. Liu, and S. Zhang, “Distribution-aware continual test time adaptation for semantic segmentation,”arXiv preprint arXiv:2309.13604, 2023

  41. [41]

    Sparsevit: Revisiting activation sparsity for efficient high-resolution vision transformer,

    X. Chen, Z. Liu, H. Tang, L. Yi, H. Zhao, and S. Han, “Sparsevit: Revisiting activation sparsity for efficient high-resolution vision transformer,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2023, pp. 2061–2070

  42. [42]

    Inducing and exploiting activation sparsity for fast inference on deep neural networks,

    M. Kurtz, J. Kopinsky, R. Gelashvili, A. Matveev, J. Carr, M. Goin, W. Leiserson, S. Moore, N. Shavit, and D. Alistarh, “Inducing and exploiting activation sparsity for fast inference on deep neural networks,” inInternational Conference on Machine Learning. PMLR, 2020, pp. 5533–5543

  43. [43]

    Dasnet: Dynamic activation sparsity for neural network efficiency improvement,

    Q. Yang, J. Mao, Z. Wang, and H. Li, “Dasnet: Dynamic activation sparsity for neural network efficiency improvement,” in2019 IEEE 31st International Conference on Tools with Artificial Intelligence (ICTAI). IEEE, 2019, pp. 1401–1405

  44. [44]

    Sparse reram engine: Joint exploration of JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 16 activation and weight sparsity in compressed neural networks,

    T.-H. Yang, H.-Y. Cheng, C.-L. Yang, I.-C. Tseng, H.-W. Hu, H.-S. Chang, and H.-P . Li, “Sparse reram engine: Joint exploration of JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 16 activation and weight sparsity in compressed neural networks,” inProceedings of the 46th International Symposium on Computer Architecture, 2019, pp. 236–249

  45. [45]

    Prosparse: Introducing and enhancing intrinsic activation sparsity within large language models,

    C. Song, X. Han, Z. Zhang, S. Hu, X. Shi, K. Li, C. Chen, Z. Liu, G. Li, T. Yang, and M. Sun, “Prosparse: Introducing and enhancing intrinsic activation sparsity within large language models,” 2024

  46. [46]

    Relu strikes back: Exploiting activation sparsity in large language models,

    I. Mirzadeh, K. Alizadeh, S. Mehta, C. C. D. Mundo, O. Tuzel, G. Samei, M. Rastegari, and M. Farajtabar, “Relu strikes back: Exploiting activation sparsity in large language models,” 2023

  47. [47]

    Prosparse: Introducing and enhancing intrinsic activation sparsity within large language models,

    C. Song, X. Han, Z. Zhang, S. Hu, X. Shi, K. Li, C. Chen, Z. Liu, G. Li, T. Yanget al., “Prosparse: Introducing and enhancing intrinsic activation sparsity within large language models,” in Proceedings of the 31st International Conference on Computational Linguistics, 2025, pp. 2626–2644

  48. [48]

    Hierarchical mixtures of experts and the EM algorithm,

    M. I. Jordan and R. A. Jacobs, “Hierarchical mixtures of experts and the EM algorithm,”Neural computation, vol. 6, no. 2, pp. 181– 214, 1994

  49. [49]

    Learning Factored Representations in a Deep Mixture of Experts

    D. Eigen, M. Ranzato, and I. Sutskever, “Learning factored repre- sentations in a deep mixture of experts,”arXiv:1312.4314, 2013

  50. [50]

    Modeling task relationships in multi-task learning with multi-gate mixture- of-experts,

    J. Ma, Z. Zhao, X. Yi, J. Chen, L. Hong, and E. H. Chi, “Modeling task relationships in multi-task learning with multi-gate mixture- of-experts,” inProceedings of the ACM SIGKDD International Con- ference on Knowledge Discovery and Data Mining (KDD), 2018

  51. [51]

    Gshard: Scaling giant models with con- ditional computation and automatic sharding,

    D. Lepikhin, H. Lee, Y. Xu, D. Chen, O. Firat, Y. Huang, M. Krikun, N. Shazeer, and Z. Chen, “Gshard: Scaling giant models with con- ditional computation and automatic sharding,” in9th International Conference on Learning Representations, ICLR 2021, 2021

  52. [52]

    Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

    W. Fedus, B. Zoph, and N. Shazeer, “Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity,” CoRR, vol. abs/2101.03961, 2021

  53. [53]

    Hash layers for large sparse models,

    S. Roller, S. Sukhbaatar, A. Szlam, and J. Weston, “Hash layers for large sparse models,”CoRR, vol. abs/2106.04426, 2021. [Online]. Available: https://arxiv.org/abs/2106.04426

  54. [54]

    Stablemoe: Stable routing strategy for mixture of experts,

    D. Dai, L. Dong, S. Ma, B. Zheng, Z. Sui, B. Chang, and F. Wei, “Stablemoe: Stable routing strategy for mixture of experts,” in Proceedings of the 60th Annual Meeting of the Association for Computa- tional Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, S. Muresan, P . Nakov, and A. Villavicencio, Eds. Association for Comp...

  55. [55]

    Designing effective sparse expert models,

    B. Zoph, “Designing effective sparse expert models,” inIEEE International Parallel and Distributed Processing Symposium, IPDPS Workshops 2022, Lyon, France, May 30 - June 3, 2022. IEEE, 2022, p. 1044

  56. [56]

    Flylora: Boosting task decoupling and parameter efficiency via implicit rank-wise mixture-of-experts,

    H. Zou, Y. Zang, W. Xu, Y. Zhu, and X. Ji, “Flylora: Boosting task decoupling and parameter efficiency via implicit rank-wise mixture-of-experts,”Advances in Neural Information Processing Sys- tems, vol. 38, pp. 10 386–10 419, 2026

  57. [57]

    Each rank could be an expert: Single-ranked mixture of experts lora for multi-task learning,

    Z. Zhao, Y. Zhou, X. Yu, Z. Zhang, D. Zhu, T. Shen, Z. Li, J. Yang, X. Wang, J. Suet al., “Each rank could be an expert: Single-ranked mixture of experts lora for multi-task learning,” inProceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining V . 1, 2026, pp. 1998–2009

  58. [58]

    pfedmoe: Data-level personalization with mixture of experts in model-heterogeneous personalized federated learning,

    L. Yi, H. Yu, G. Wang, X. Liu, and Q. Hu, “pfedmoe: Data-level personalization with mixture of experts in model-heterogeneous personalized federated learning,”IEEE Transactions on Knowledge and Data Engineering, 2026

  59. [59]

    Moe-llava: Mixture of experts for large vision-language models,

    B. Lin, Z. Tang, Y. Ye, J. Huang, J. Zhang, Y. Pang, P . Jin, M. Ning, J. Luo, and L. Yuan, “Moe-llava: Mixture of experts for large vision-language models,”IEEE Transactions on Multimedia, 2026

  60. [60]

    Mole-vla: Dynamic layer-skipping vision language action model via mixture-of-layers for efficient robot manipulation,

    R. Zhang, M. Dong, Y. Zhang, L. Heng, X. Chi, G. Dai, L. Du, D. Wang, Y. Du, and S. Zhang, “Mole-vla: Dynamic layer-skipping vision language action model via mixture-of-layers for efficient robot manipulation,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 40, no. 22, 2026, pp. 18 764–18 772

  61. [61]

    On-policy distillation of language models: Learning from self-generated mistakes,

    R. Agarwal, N. Vieillard, Y. Zhou, P . Stanczyk, S. Ramos Garea, M. Geist, and O. Bachem, “On-policy distillation of language models: Learning from self-generated mistakes,” inInternational Conference on Learning Representations, vol. 2024, 2024, pp. 21 246– 21 263

  62. [62]

    Minillm: Knowledge distillation of large language models,

    Y. Gu, L. Dong, F. Wei, and M. Huang, “Minillm: Knowledge distillation of large language models,” inInternational Conference on Learning Representations, vol. 2024, 2024, pp. 32 694–32 717

  63. [63]

    Distillm: Towards streamlined distillation for large language models.ArXiv, abs/2402.03898,

    J. Ko, S. Kim, T. Chen, and S.-Y. Yun, “Distillm: Towards streamlined distillation for large language models,”arXiv preprint arXiv:2402.03898, 2024

  64. [64]

    Language-driven policy distillation for cooperative driving in multi-agent reinforcement learning,

    J. Liu, C. Xu, P . Hang, J. Sun, M. Ding, W. Zhan, and M. Tomizuka, “Language-driven policy distillation for cooperative driving in multi-agent reinforcement learning,”IEEE Robotics and Automation Letters, 2025

  65. [65]

    Distillation-ppo: A novel two-stage reinforcement learning framework for humanoid robot perceptive locomotion,

    Q. Zhang, G. Han, J. Sun, W. Zhao, C. Sun, J. Cao, J. Wang, Y. Guo, and R. Xu, “Distillation-ppo: A novel two-stage reinforcement learning framework for humanoid robot perceptive locomotion,” in2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2025, pp. 2916–2922

  66. [66]

    Learning deep features for discriminative localization,

    B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, “Learning deep features for discriminative localization,” inPro- ceedings of the IEEE conference on computer vision and pattern recogni- tion, 2016, pp. 2921–2929

  67. [67]

    SAM 3: Segment Anything with Concepts

    N. Carion, L. Gustafson, Y.-T. Hu, S. Debnath, R. Hu, D. Suris, C. Ryali, K. V . Alwala, H. Khedr, A. Huanget al., “Sam 3: Segment anything with concepts,”arXiv preprint arXiv:2511.16719, 2025

  68. [68]

    Mean teachers are better role mod- els: Weight-averaged consistency targets improve semi-supervised deep learning results,

    A. Tarvainen and H. Valpola, “Mean teachers are better role mod- els: Weight-averaged consistency targets improve semi-supervised deep learning results,”Advances in neural information processing systems, vol. 30, 2017

  69. [69]

    Robust mean teacher for continual and gradual test-time adaptation,

    M. D ¨obler, R. A. Marsden, and B. Yang, “Robust mean teacher for continual and gradual test-time adaptation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 7704–7714

  70. [70]

    Decorate the newcomers: Visual domain prompt for continual test time adaptation,

    Y. Gan, X. Ma, Y. Lou, Y. Bai, R. Zhang, N. Shi, and L. Luo, “Decorate the newcomers: Visual domain prompt for continual test time adaptation,”arXiv preprint arXiv:2212.04145, 2022

  71. [71]

    Analysis of representations for domain adaptation,

    S. Ben-David, J. Blitzer, K. Crammer, and F. Pereira, “Analysis of representations for domain adaptation,”Advances in neural information processing systems, vol. 19, 2006

  72. [72]

    A theory of learning from different domains,

    S. Ben-David, J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. W. Vaughan, “A theory of learning from different domains,” Machine learning, vol. 79, no. 1, pp. 151–175, 2010

  73. [73]

    Learning to select data for transfer learning with Bayesian Optimization

    S. Ruder and B. Plank, “Learning to select data for transfer learn- ing with bayesian optimization,”arXiv preprint arXiv:1707.05246, 2017

  74. [74]

    Adversarial learning for zero-shot stance detection on social media,

    E. Allaway, M. Srikanth, and K. McKeown, “Adversarial learning for zero-shot stance detection on social media,”arXiv preprint arXiv:2105.06603, 2021

  75. [75]

    Classification and analysis of multivariate observa- tions,

    J. MacQueen, “Classification and analysis of multivariate observa- tions,” in5th Berkeley Symp. Math. Statist. Probability. University of California Los Angeles LA USA, 1967, pp. 281–297

  76. [76]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,”arXiv preprint arXiv:2010.11929, 2020

  77. [77]

    Segformer: Simple and efficient design for semantic segmenta- tion with transformers,

    E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Alvarez, and P . Luo, “Segformer: Simple and efficient design for semantic segmenta- tion with transformers,”Advances in Neural Information Processing Systems, vol. 34, pp. 12 077–12 090, 2021

  78. [78]

    Adam: A Method for Stochastic Optimization

    D. P . Kingma and J. Ba, “Adam: A method for stochastic optimiza- tion,”arXiv preprint arXiv:1412.6980, 2014

  79. [79]

    Adaptformer: Adapting vision transformers for scalable visual recognition,

    S. Chen, C. Ge, Z. Tong, J. Wang, Y. Song, J. Wang, and P . Luo, “Adaptformer: Adapting vision transformers for scalable visual recognition,”Advances in Neural Information Processing Systems, vol. 35, pp. 16 664–16 678, 2022

  80. [80]

    Delving deep into rectifiers: Surpassing human-level performance on imagenet classification,

    K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing human-level performance on imagenet classification,” inProceedings of the IEEE international conference on computer vision, 2015, pp. 1026–1034

Showing first 80 references.