MoASE++: Mixture of Activation Sparsity Experts with Domain-Adaptive On-policy Distillation for Continual Test Time Adaptation

Aosong Cheng; Dan Wang; Gaole Dai; Huanrui Yang; Jiaming Liu; Leyuan Fang; Li Du; Ronyu Zhang; Shanghang Zhang; Yuan Du

arxiv: 2605.17743 · v1 · pith:DLADCJDUnew · submitted 2026-05-18 · 💻 cs.CV

MoASE++: Mixture of Activation Sparsity Experts with Domain-Adaptive On-policy Distillation for Continual Test Time Adaptation

Ronyu Zhang , Aosong Cheng , Gaole Dai , Yulin Luo , Jiaming Liu , Li Du , Huanrui Yang , Dan Wang

show 3 more authors

Leyuan Fang Yuan Du Shanghang Zhang

This is my paper

Pith reviewed 2026-05-20 12:49 UTC · model grok-4.3

classification 💻 cs.CV

keywords continual test-time adaptationmixture of expertsactivation sparsityon-policy distillationdomain adaptationimage classificationsemantic segmentation

0 comments

The pith

Mixture of activation sparsity experts with on-policy distillation allows models to adapt continually to new domains without forgetting past knowledge.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to address continual test-time adaptation for pre-trained models facing non-stationary unlabeled target streams while retaining competence on prior domains. It proposes a plug-in mixture-of-experts module that disentangles domain-agnostic structure from domain-specific texture by forming complementary high- and low-activation pathways through Spatial Differentiable Dropout and high/low-rank bottlenecks. An Activation Sparsity Gate generates input-adaptive thresholds for token selection, and a Domain-Aware Router assigns expert weights based on texture-sensitive cues. Domain-Adaptive On-Policy Distillation then stabilizes learning on unlabeled data via EMA-anchored reverse KL distillation and an augmentation policy driven by entropy and confidence. Experiments on corrupted image classification and cross-domain segmentation report consistent gains over prior methods.

Core claim

The authors establish that Activation Sparsity Experts with Spatial Differentiable Dropout and high/low-rank bottlenecks, guided by an Activation Sparsity Gate and Domain-Aware Router, when combined with EMA-anchored reverse KL distillation and entropy-conditioned augmentation policy, disentangle structure from texture to support robust continual adaptation.

What carries the argument

Mixture of Activation Sparsity Experts (MoASE) using Spatial Differentiable Dropout for pathway formation and Domain-Aware Router for assignment, augmented by Domain-Adaptive On-Policy Distillation for supervision.

Load-bearing premise

The assumption that texture-biased backbones cause error accumulation and catastrophic forgetting, and that the proposed sparsity experts and router can reliably separate domain-agnostic structure from domain-specific texture using only unlabeled streams.

What would settle it

A sequence of domain shifts where accuracy on the original source domain drops sharply after adaptation or where the full method shows no gain over a plain test-time adaptation baseline.

Figures

Figures reproduced from arXiv: 2605.17743 by Aosong Cheng, Dan Wang, Gaole Dai, Huanrui Yang, Jiaming Liu, Leyuan Fang, Li Du, Ronyu Zhang, Shanghang Zhang, Yuan Du, Yulin Luo.

**Figure 1.** Figure 1: The problem and motivation. Our goal is to adapt pre-trained models to evolving target domains. to inputs from varying domains in [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: The visualization analysis of the Class Activation Map (CAM). We adopt CAM to compare the attention of the low-activation, high-activation MoASE, and the original model during the continual adaptation process. The number suggests the mean IoU of the Elastic-transform, Gaussian, Fog, and Snow results based on randomly selected samples from ImageNet-C with the Foreground and Background of the main object seg… view at source ↗

**Figure 3.** Figure 3: The framework of MoASE++. A policy-driven augmenter samples strength a to create V on-policy views, while the student aligns to a detached EMA teacher via reverse KL on the same views plus a clean-view consistency term. The policy is updated with a reward, enabling a tunable stability–plasticity trade-off. dimension. For a sample index b ∈ {1, . . . , B} and a token index n ∈ {1, . . . , N}, the score is d… view at source ↗

**Figure 4.** Figure 4: The detailed structure of MoASE. We integrate the MoASE into the linear layer of the student model. idh and idl suggests the expert ID for domain-agnostic experts and domain-specific experts. Let Mlow ∈ {0, 1} B×N with Klow = ⌊N · 1 2 ⌋, and broadcast it to M (3D) low ∈ {0, 1} B×N×D. The low-activation sub-features and routing output are Flow = M (3D) low ⊙ F, ϕ = gDAR(Flow), (6) where ϕ ∈ R B×E satisfies … view at source ↗

**Figure 5.** Figure 5: Inter-domain and intra-class distance in CIFAR10-C. (a) MoASE++ and MoASE more effectively reduce interdomain divergence than the source model. (b) MoASE++ and MoASE significantly improve intra-class feature aggregation, producing results that closely align with those of our proposed method. We maintain an EMA baseline b of rewards. The policy objective combines advantage-weighted log-likelihood with entr… view at source ↗

**Figure 6.** Figure 6: 5 rounds segmentation CTTA on Cityscape-toACDC. We sequentially repeat the same sequence of target domains 5 times with a different number of experts. performance thereafter. First of all, we define CoTTA∗ as CoTTA with lr=2e-4. We observe that adjusting CoTTA to our setting improves early results but leads to a later drop in segmentation performance and catastrophic forgetting. In addition, MoASE achieve… view at source ↗

**Figure 7.** Figure 7: Analysis of computational costs for our proposed MoASE++ and previous SOTA baselines, including a detailed comparison in terms of (a) FLOPs, (b) GPU memory, and (c) the number of parameters. This overhead mainly stems from multi-expert routing and the DA-OPD alignment. As future work, we envision an edge–cloud co-inference scheme that keeps lightweight inference on the edge while scheduling heavier full-e… view at source ↗

**Figure 8.** Figure 8: Ablation study to examine the influence of the hyperparameters for MoASE. TABLE 6: Different number of experts in MoASE. mIoU score for Cityscape-to-ACDC within the first round. num. E. Fog Night Rain Snow Mean↑ E = 2 71.6 44.0 66.5 63.7 61.5 E = 4 72.4 44.5 66.4 63.8 61.8 E = 8 71.4 44.0 65.0 61.7 60.5 E = 16 71.5 44.1 65.9 63.3 61.2 TABLE 7: Balance of DA and DS experts. mIoU score for Cityscape-to-ACDC … view at source ↗

**Figure 9.** Figure 9: The qualitative analysis of the original and CAM visualization for the source model, TCA, MoASE, and MoASE++ based on ImageNet-C with 15 domains. TABLE 9: Ablation study for the hyperparameters of DAOPD on CIFAR100-C. Groups follow the control-variable method: in each block, only one factor changes. Blue suggests our final selection for MoASE++. Methods EMA α OPD-views OPD-temp OPD-weight Mean↑ Alpha swe… view at source ↗

**Figure 10.** Figure 10: The visualization analysis of the routing strategy for MoASE and MoASE++. E1 s and E3 a stands for the 1st domain-specific and the 3rd domain-agnostic experts. tion is dispersed due to continuous domain shifts, while TCA [16] also fails to detect the complete object in continual domain-adaptive scenarios. In contrast, MoASE effectively concentrates on regions relevant to the target categories, such as dog… view at source ↗

**Figure 11.** Figure 11: The qualitative analysis of the segmentation qualitative comparison of MoASE with previous SOTA methods on the ACDC dataset. Our method could better segment different pixel-wise classes, such as shown in the white box. Segmentation visualization. For further validation, we provide additional qualitative comparisons in the Cityscapes-to-ACDC CTTA scenario, shown in the [PITH_FULL_IMAGE:figures/full_fig_p0… view at source ↗

read the original abstract

Continual test-time adaptation adapts a source-pretrained model to non-stationary, unlabeled target streams while retaining past competence, yet texture-biased backbones risk error accumulation and catastrophic forgetting. Drawing inspiration from the process of decoupling shape and texture in the human visual system, we introduce MoASE, a plug-in mixture-of-experts that disentangles domain-agnostic structure from domain-specific texture using Activation Sparsity Experts with Spatial Differentiable Dropout, forming complementary high- and low-activation pathways, while high- and low-rank bottlenecks diversify representations. The Activation Sparsity Gate produces input-adaptive SDD thresholds for precise token selection, and the Domain-Aware Router assigns per-sample expert weights using texture-sensitive cues. To curb confirmation bias on unlabeled streams and stabilize supervision, we then introduce Domain-Adaptive On-Policy Distillation to constitute MoASE++, with an EMA-anchored on-policy reverse KL distillation and an augmentation policy conditioned on entropy and confidence that aligns predictions across the same views and improves the robustness-plasticity balance. Extensive experiments on classification (CIFAR-10/100-C, ImageNet-C) and semantic segmentation (Cityscapes->ACDC) demonstrate consistent state-of-the-art performance, offering a principled, controllable approach to continual adaptation in dynamic visual environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MoASE++ combines activation sparsity experts with on-policy distillation for continual TTA, but the disentangling claim rests on indirect self-supervision that may not fully break texture bias loops.

read the letter

The paper introduces MoASE as a plug-in mixture-of-experts module that uses Activation Sparsity Experts with Spatial Differentiable Dropout to create high- and low-activation pathways, plus a Domain-Aware Router that keys off texture cues. MoASE++ then adds Domain-Adaptive On-Policy Distillation with EMA-anchored reverse KL and entropy-conditioned augmentations to stabilize adaptation on unlabeled streams. This specific stack for continual test-time adaptation is the main novelty; it extends prior MoE and distillation work rather than inventing the underlying ideas from scratch. The experiments report consistent gains on CIFAR-10/100-C, ImageNet-C, and the Cityscapes-to-ACDC segmentation shift, which is useful to see on standard benchmarks even if the controls are not fully detailed in the abstract. The method is presented as a controllable way to balance robustness and plasticity, and the on-policy distillation step is a reasonable attempt to limit confirmation bias. The soft spot is the central mechanism. The claim that the sparsity experts and router reliably separate structure from texture depends on input-adaptive thresholds and texture-sensitive routing, yet every signal is generated from the model's own predictions on the target stream. Without auxiliary validation on shape-texture conflict data, the router could simply amplify whichever cues dominate at each step, recreating the error accumulation the paper aims to avoid. The stress-test note correctly flags this as the least-secured link. The assumption that texture-biased backbones drive forgetting is standard in the literature, but the paper's solution does not include direct tests that would confirm the pathways achieve the intended separation. This paper is aimed at researchers working on test-time adaptation and continual learning in computer vision. Readers who need practical plug-ins for non-stationary visual streams will find the architecture and benchmark results worth examining. It deserves peer review because the method is concrete, the benchmarks are relevant, and the distillation component addresses a real issue, even though additional ablations on the disentangling step would be needed to make the claims tighter.

Referee Report

1 major / 2 minor

Summary. The paper proposes MoASE++, a plug-in mixture-of-experts module for continual test-time adaptation of source-pretrained vision models on non-stationary unlabeled target streams. It introduces Activation Sparsity Experts using Spatial Differentiable Dropout to create complementary high- and low-activation pathways that aim to disentangle domain-agnostic structure from domain-specific texture, augmented by high/low-rank bottlenecks, an input-adaptive Activation Sparsity Gate, and a texture-sensitive Domain-Aware Router. Domain-Adaptive On-Policy Distillation (with EMA-anchored reverse KL and entropy/confidence-conditioned augmentations) is added to mitigate confirmation bias and stabilize the robustness-plasticity trade-off. Experiments claim consistent SOTA results on CIFAR-10/100-C, ImageNet-C classification, and Cityscapes-to-ACDC semantic segmentation.

Significance. If the central mechanisms hold, the work offers a principled, controllable approach to continual adaptation that directly targets texture bias and error accumulation in dynamic visual environments. The integration of sparsity-based expert routing with on-policy distillation on unlabeled streams represents a meaningful technical contribution for unsupervised continual learning settings.

major comments (1)

[Method (Activation Sparsity Gate, Domain-Aware Router, and Domain-Adaptive On-Policy Distillation subsections)] The core claim that Activation Sparsity Experts with Spatial Differentiable Dropout and the Domain-Aware Router reliably disentangle domain-agnostic structure from domain-specific texture (and thereby avoid confirmation bias) rests on indirect supervision via on-policy distillation alone. No auxiliary validation on controlled shape-texture conflict datasets or explicit metrics enforcing shape vs. texture separation is described, leaving open the possibility that the router simply amplifies currently dominant spurious cues rather than breaking the error-accumulation loop.

minor comments (2)

[Abstract] The abstract asserts SOTA results but provides no details on experimental controls, exact baselines, or safeguards against post-hoc selection; these should be summarized upfront for clarity.
[Method] Notation for SDD thresholds and the precise formulation of the entropy-conditioned augmentation policy should be made fully explicit with equations to aid reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comment below and indicate the revisions we intend to incorporate.

read point-by-point responses

Referee: [Method (Activation Sparsity Gate, Domain-Aware Router, and Domain-Adaptive On-Policy Distillation subsections)] The core claim that Activation Sparsity Experts with Spatial Differentiable Dropout and the Domain-Aware Router reliably disentangle domain-agnostic structure from domain-specific texture (and thereby avoid confirmation bias) rests on indirect supervision via on-policy distillation alone. No auxiliary validation on controlled shape-texture conflict datasets or explicit metrics enforcing shape vs. texture separation is described, leaving open the possibility that the router simply amplifies currently dominant spurious cues rather than breaking the error-accumulation loop.

Authors: We appreciate the referee's observation on the need for more direct evidence of disentanglement. The Activation Sparsity Experts with Spatial Differentiable Dropout are explicitly designed to produce complementary high- and low-activation pathways, where high-activation routes are intended to capture domain-specific texture and low-activation routes to preserve domain-agnostic structure, drawing from the human visual system's shape-texture separation. The Domain-Aware Router further conditions routing on texture-sensitive cues, while the Domain-Adaptive On-Policy Distillation (EMA-anchored reverse KL with entropy- and confidence-conditioned augmentations) supplies stable supervision to mitigate confirmation bias and error accumulation on unlabeled streams. Although the current manuscript does not include auxiliary experiments on dedicated shape-texture conflict datasets or explicit separation metrics, the design choices are validated through consistent state-of-the-art results on texture-shift-heavy benchmarks (CIFAR-10/100-C, ImageNet-C, Cityscapes-to-ACDC) and component ablations that isolate the contribution of the sparsity experts and router. We agree that additional clarification would strengthen the presentation. In the revised version we will expand the method section with a dedicated paragraph on the inductive biases of Spatial Differentiable Dropout and the router, together with qualitative visualizations of expert activation patterns across domain shifts. revision: partial

Circularity Check

0 steps flagged

No circularity: novel components and experimental validation are self-contained

full rationale

The paper introduces MoASE and MoASE++ as new architectural modules (Activation Sparsity Experts with Spatial Differentiable Dropout, high/low-activation pathways, Domain-Aware Router, and Domain-Adaptive On-Policy Distillation with EMA-anchored reverse KL) to address texture bias and confirmation bias in continual test-time adaptation. These are motivated by human vision inspiration and implemented as plug-in mechanisms, with performance claims resting on independent experiments across CIFAR-10/100-C, ImageNet-C, and Cityscapes->ACDC rather than any reduction of outputs to fitted inputs, self-definitions, or self-citation chains by construction. No equations or steps in the provided description equate predictions to the method's own parameters or prior outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 3 invented entities

The approach rests on several introduced components whose effectiveness is asserted but not derived from first principles; multiple new entities and mechanisms are postulated to solve the forgetting problem.

axioms (2)

domain assumption Texture-biased backbones risk error accumulation and catastrophic forgetting in continual test-time adaptation.
Stated in the opening of the abstract as the core problem motivation.
domain assumption The human visual system decouples shape and texture.
Explicitly drawn as inspiration for the mixture-of-experts design.

invented entities (3)

Activation Sparsity Experts with Spatial Differentiable Dropout no independent evidence
purpose: Disentangle domain-agnostic structure from domain-specific texture using high- and low-activation pathways.
Introduced as the core plug-in module; no independent evidence provided in abstract.
Domain-Aware Router no independent evidence
purpose: Assign per-sample expert weights using texture-sensitive cues.
New component for input-adaptive routing; no external validation mentioned.
Domain-Adaptive On-Policy Distillation no independent evidence
purpose: Curbs confirmation bias on unlabeled streams via EMA-anchored reverse KL and entropy-conditioned augmentation.
Proposed to stabilize supervision; effectiveness claimed via experiments but not independently verified here.

pith-pipeline@v0.9.0 · 5801 in / 1258 out tokens · 41986 ms · 2026-05-20T12:49:19.270389+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel (J-cost uniqueness) unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we propose an intriguing hypothesis: Can we explicitly decompose neurons by activation degree to encode shapes and textures separately... using Activation Sparsity Experts with Spatial Differentiable Dropout, forming complementary high- and low-activation pathways
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Extensive experiments on classification (CIFAR-10/100-C, ImageNet-C) and semantic segmentation (Cityscapes->ACDC)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

85 extracted references · 85 canonical work pages · 9 internal anchors

[1]

Decomposing the neurons: Activation sparsity via mixture of experts for continual test time adaptation,

R. Zhang, A. Cheng, Y. Luo, G. Dai, H. Yang, J. Liu, R. Xu, L. Du, D. Wang, and Y. Du, “Decomposing the neurons: Activation sparsity via mixture of experts for continual test time adaptation,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 40, no. 42, 2026, pp. 36 057–36 065

work page 2026
[2]

A survey of autonomous driving from a deep learning perspective,

J. Zhao, Y. Wu, R. Deng, S. Xu, J. Gao, and A. Burke, “A survey of autonomous driving from a deep learning perspective,”ACM Computing Surveys, vol. 57, no. 10, pp. 1–60, 2025

work page 2025
[3]

Bevuda++: geometric-aware unsupervised domain adaptation for multi-view 3d object detection,

R. Zhang, J. Liu, X. Li, X. Chi, D. Wang, L. Du, Y. Du, and S. Zhang, “Bevuda++: geometric-aware unsupervised domain adaptation for multi-view 3d object detection,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 35, no. 5, pp. 5109–5122, 2024

work page 2024
[4]

Identifying unob- served road regions in bird’s-eye view for single-vehicle percep- tion,

J. Yu, C. Lu, X. Meng, F. Shao, and Q. Jiang, “Identifying unob- served road regions in bird’s-eye view for single-vehicle percep- tion,”IEEE Robotics and Automation Letters, vol. 11, no. 3, pp. 2410– 2417, 2026

work page 2026
[5]

Last {0}: Latent spatio-temporal chain-of- thought for robotic vision-language-action model,

Z. Liu, J. Liu, H. Chen, J. Yu, Z. Guo, C. Hou, C. Gu, X. Mi, R. Zhang, K. Wuet al., “Last {0}: Latent spatio-temporal chain-of- thought for robotic vision-language-action model,”arXiv preprint arXiv:2601.05248, 2026

work page arXiv 2026
[6]

Manipllm: Embodied multimodal large language model for object-centric robotic manipulation,

X. Li, M. Zhang, Y. Geng, H. Geng, Y. Long, Y. Shen, R. Zhang, J. Liu, and H. Dong, “Manipllm: Embodied multimodal large language model for object-centric robotic manipulation,”arXiv preprint arXiv:2312.16217, 2023

work page arXiv 2023
[7]

A survey on vision–language–action models for embodied ai,

Y. Ma, Z. Song, Y. Zhuang, J. Hao, and I. King, “A survey on vision–language–action models for embodied ai,”IEEE Transac- tions on Neural Networks and Learning Systems, 2026

work page 2026
[8]

Efficient deweahter mixture-of-experts with uncertainty-aware feature-wise linear modulation,

R. Zhang, Y. Luo, J. Liu, H. Yang, Z. Dong, D. Gudovskiy, T. Okuno, Y. Nakata, K. Keutzer, Y. Duet al., “Efficient deweahter mixture-of-experts with uncertainty-aware feature-wise linear modulation,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 38, 2024, pp. 16 812–16 820

work page 2024
[9]

Occuseg: Occupancy- aware 3d instance segmentation,

L. Han, T. Zheng, L. Xu, and L. Fang, “Occuseg: Occupancy- aware 3d instance segmentation,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 2940– 2949

work page 2020
[10]

Occ3d: A large-scale 3d occupancy prediction bench- mark for autonomous driving,

X. Tian, T. Jiang, L. Yun, Y. Mao, H. Yang, Y. Wang, Y. Wang, and H. Zhao, “Occ3d: A large-scale 3d occupancy prediction bench- mark for autonomous driving,”Advances in Neural Information Processing Systems, vol. 36, 2024

work page 2024
[11]

Continual test-time domain adaptation,

Q. Wang, O. Fink, L. Van Gool, and D. Dai, “Continual test-time domain adaptation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 7201–7211

work page 2022
[12]

Dual domain-attribute learning framework with asynchronous adapters for continual test-time adaptation,

Y. Tian, K. Li, T. He, L. Wan, P .-A. Heng, and W. Feng, “Dual domain-attribute learning framework with asynchronous adapters for continual test-time adaptation,”IEEE Transactions on Image Processing, 2026

work page 2026
[13]

Tent: Fully Test-time Adaptation by Entropy Minimization

D. Wang, E. Shelhamer, S. Liu, B. Olshausen, and T. Darrell, “Tent: Fully test-time adaptation by entropy minimization,”arXiv preprint arXiv:2006.10726, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2006
[14]

arXiv preprint arXiv:2303.15361 , year=

J. Liang, R. He, and T. Tan, “A comprehensive survey on test-time adaptation under distribution shifts,”arXiv preprint arXiv:2303.15361, 2023

work page arXiv 2023
[15]

Dota: Distributional test-time adaptation of vision- language models,

Z. Han, J. Yang, G. Wang, J. Li, Q. Xu, M. Z. Shou, and C. Zhang, “Dota: Distributional test-time adaptation of vision- language models,”Advances in Neural Information Processing Sys- tems, vol. 38, pp. 142 772–142 798, 2026

work page 2026
[16]

Maintaining consistent inter-class topology in continual test-time adaptation,

C. Ni, F. Lyu, J. Tan, F. Hu, R. Yao, and T. Zhou, “Maintaining consistent inter-class topology in continual test-time adaptation,” inProceedings of the Computer Vision and Pattern Recognition Confer- ence, 2025, pp. 15 319–15 328

work page 2025
[17]

Becotta: Input-dependent online blending of experts for continual test-time adaptation,

D. Lee, J. Yoon, and S. J. Hwang, “Becotta: Input-dependent online blending of experts for continual test-time adaptation,”arXiv preprint arXiv:2402.08712, 2024

work page arXiv 2024
[18]

Adaptive distribution masked autoencoders for continual test-time adaptation,

J. Liu, R. Xu, S. Yang, R. Zhang, Q. Zhang, Z. Chen, Y. Guo, and S. Zhang, “Adaptive distribution masked autoencoders for continual test-time adaptation,”arXiv preprint arXiv:2312.12480, 2023

work page arXiv 2023
[19]

Vida: Homeostatic visual domain adapter for continual test time adaptation,

J. Liu, S. Yang, P . Jia, M. Lu, Y. Guo, W. Xue, and S. Zhang, “Vida: Homeostatic visual domain adapter for continual test time adaptation,”arXiv preprint arXiv:2306.04344, 2023

work page arXiv 2023
[20]

Retinal noise and absolute threshold,

H. B. Barlow, “Retinal noise and absolute threshold,”Josa, vol. 46, no. 8, pp. 634–639, 1956

work page 1956
[21]

The absolute threshold of cone vision,

D. Koenig and H. Hofer, “The absolute threshold of cone vision,” Journal of vision, vol. 11, no. 1, pp. 21–21, 2011

work page 2011
[22]

Structure of cone photoreceptors,

D. Mustafi, A. H. Engel, and K. Palczewski, “Structure of cone photoreceptors,”Progress in retinal and eye research, vol. 28, no. 4, pp. 289–302, 2009

work page 2009
[23]

The primate fovea: structure, func- tion and development,

A. Bringmann, S. Syrbe, K. G ¨orner, J. Kacza, M. Francke, P . Wiede- mann, and A. Reichenbach, “The primate fovea: structure, func- tion and development,”Progress in retinal and eye research, vol. 66, pp. 49–84, 2018

work page 2018
[24]

Von Helmholtz,Handbuch der physiologischen Optik

H. Von Helmholtz,Handbuch der physiologischen Optik. Voss, 1867, vol. 9

work page
[25]

Emergence of shape bias in convolutional neural networks through activation sparsity,

T. Li, Z. Wen, Y. Li, and T. S. Lee, “Emergence of shape bias in convolutional neural networks through activation sparsity,” Advances in Neural Information Processing Systems, vol. 36, 2024

work page 2024
[26]

Multi-level personalized federated learning on heterogeneous and long-tailed data,

R. Zhang and et. al., “Multi-level personalized federated learning on heterogeneous and long-tailed data,”IEEE TMC, 2024

work page 2024
[27]

Fda: Fourier domain adaptation for se- mantic segmentation,

Y. Yang and S. Soatto, “Fda: Fourier domain adaptation for se- mantic segmentation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 4085–4095

work page 2020
[28]

A fourier- based framework for domain generalization,

Q. Xu, R. Zhang, Y. Zhang, Y. Wang, and Q. Tian, “A fourier- based framework for domain generalization,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 14 383–14 392

work page 2021
[29]

Adaptive mixtures of local experts,

R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton, “Adaptive mixtures of local experts,”Neural computation, vol. 3, no. 1, pp. 79–87, 1991

work page 1991
[30]

Mixture of experts: a liter- ature survey,

S. Masoudnia and R. Ebrahimpour, “Mixture of experts: a liter- ature survey,”Artificial Intelligence Review, vol. 42, pp. 275–293, 2014

work page 2014
[31]

A survey on inference optimization techniques for mixture of experts models,

J. Liu, P . Tang, W. Wang, Y. Ren, X. Hou, P . A. Heng, M. Guo, and C. Li, “A survey on inference optimization techniques for mixture of experts models,”ACM Computing Surveys, vol. 58, no. 10, pp. 1–37, 2026

work page 2026
[32]

Retinal processing near absolute threshold: from behavior to mechanism,

G. D. Field, A. P . Sampath, and F. Rieke, “Retinal processing near absolute threshold: from behavior to mechanism,”Annu. Rev. Physiol., vol. 67, pp. 491–514, 2005

work page 2005
[33]

Learning multiple layers of features from tiny images,

A. Krizhevsky, G. Hintonet al., “Learning multiple layers of features from tiny images,” 2009

work page 2009
[34]

Benchmarking Neural Network Robustness to Common Corruptions and Perturbations

D. Hendrycks and T. Dietterich, “Benchmarking neural network robustness to common corruptions and perturbations,”arXiv preprint arXiv:1903.12261, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1903
[35]

The cityscapes dataset for semantic urban scene understanding,

M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Be- nenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset for semantic urban scene understanding,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 3213–3223

work page 2016
[36]

Acdc: The adverse con- ditions dataset with correspondences for semantic driving scene understanding,

C. Sakaridis, D. Dai, and L. Van Gool, “Acdc: The adverse con- ditions dataset with correspondences for semantic driving scene understanding,” inProceedings of the IEEE/CVF International Con- ference on Computer Vision, 2021, pp. 10 765–10 775

work page 2021
[37]

Ecotta: Memory-efficient continual test-time adaptation via self-distilled regularization,

J. Song, J. Lee, I. S. Kweon, and S. Choi, “Ecotta: Memory-efficient continual test-time adaptation via self-distilled regularization,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 11 920–11 929

work page 2023
[38]

Decorate the newcomers: Visual domain prompt for continual test time adaptation,

Y. Gan, Y. Bai, Y. Lou, X. Ma, R. Zhang, N. Shi, and L. Luo, “Decorate the newcomers: Visual domain prompt for continual test time adaptation,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 37, 2023, pp. 7595–7603

work page 2023
[39]

Exploring sparse visual prompt for cross-domain semantic seg- mentation,

S. Yang, J. Wu, J. Liu, X. Li, Q. Zhang, M. Pan, and S. Zhang, “Exploring sparse visual prompt for cross-domain semantic seg- mentation,”arXiv preprint arXiv:2303.09792, 2023

work page arXiv 2023
[40]

Distribution-aware continual test time adaptation for semantic segmentation,

J. Ni, S. Yang, J. Liu, X. Li, W. Jiao, R. Xu, Z. Chen, Y. Liu, and S. Zhang, “Distribution-aware continual test time adaptation for semantic segmentation,”arXiv preprint arXiv:2309.13604, 2023

work page arXiv 2023
[41]

Sparsevit: Revisiting activation sparsity for efficient high-resolution vision transformer,

X. Chen, Z. Liu, H. Tang, L. Yi, H. Zhao, and S. Han, “Sparsevit: Revisiting activation sparsity for efficient high-resolution vision transformer,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2023, pp. 2061–2070

work page 2023
[42]

Inducing and exploiting activation sparsity for fast inference on deep neural networks,

M. Kurtz, J. Kopinsky, R. Gelashvili, A. Matveev, J. Carr, M. Goin, W. Leiserson, S. Moore, N. Shavit, and D. Alistarh, “Inducing and exploiting activation sparsity for fast inference on deep neural networks,” inInternational Conference on Machine Learning. PMLR, 2020, pp. 5533–5543

work page 2020
[43]

Dasnet: Dynamic activation sparsity for neural network efficiency improvement,

Q. Yang, J. Mao, Z. Wang, and H. Li, “Dasnet: Dynamic activation sparsity for neural network efficiency improvement,” in2019 IEEE 31st International Conference on Tools with Artificial Intelligence (ICTAI). IEEE, 2019, pp. 1401–1405

work page 2019
[44]

Sparse reram engine: Joint exploration of JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 16 activation and weight sparsity in compressed neural networks,

T.-H. Yang, H.-Y. Cheng, C.-L. Yang, I.-C. Tseng, H.-W. Hu, H.-S. Chang, and H.-P . Li, “Sparse reram engine: Joint exploration of JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 16 activation and weight sparsity in compressed neural networks,” inProceedings of the 46th International Symposium on Computer Architecture, 2019, pp. 236–249

work page 2021
[45]

Prosparse: Introducing and enhancing intrinsic activation sparsity within large language models,

C. Song, X. Han, Z. Zhang, S. Hu, X. Shi, K. Li, C. Chen, Z. Liu, G. Li, T. Yang, and M. Sun, “Prosparse: Introducing and enhancing intrinsic activation sparsity within large language models,” 2024

work page 2024
[46]

Relu strikes back: Exploiting activation sparsity in large language models,

I. Mirzadeh, K. Alizadeh, S. Mehta, C. C. D. Mundo, O. Tuzel, G. Samei, M. Rastegari, and M. Farajtabar, “Relu strikes back: Exploiting activation sparsity in large language models,” 2023

work page 2023
[47]

Prosparse: Introducing and enhancing intrinsic activation sparsity within large language models,

C. Song, X. Han, Z. Zhang, S. Hu, X. Shi, K. Li, C. Chen, Z. Liu, G. Li, T. Yanget al., “Prosparse: Introducing and enhancing intrinsic activation sparsity within large language models,” in Proceedings of the 31st International Conference on Computational Linguistics, 2025, pp. 2626–2644

work page 2025
[48]

Hierarchical mixtures of experts and the EM algorithm,

M. I. Jordan and R. A. Jacobs, “Hierarchical mixtures of experts and the EM algorithm,”Neural computation, vol. 6, no. 2, pp. 181– 214, 1994

work page 1994
[49]

Learning Factored Representations in a Deep Mixture of Experts

D. Eigen, M. Ranzato, and I. Sutskever, “Learning factored repre- sentations in a deep mixture of experts,”arXiv:1312.4314, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[50]

Modeling task relationships in multi-task learning with multi-gate mixture- of-experts,

J. Ma, Z. Zhao, X. Yi, J. Chen, L. Hong, and E. H. Chi, “Modeling task relationships in multi-task learning with multi-gate mixture- of-experts,” inProceedings of the ACM SIGKDD International Con- ference on Knowledge Discovery and Data Mining (KDD), 2018

work page 2018
[51]

Gshard: Scaling giant models with con- ditional computation and automatic sharding,

D. Lepikhin, H. Lee, Y. Xu, D. Chen, O. Firat, Y. Huang, M. Krikun, N. Shazeer, and Z. Chen, “Gshard: Scaling giant models with con- ditional computation and automatic sharding,” in9th International Conference on Learning Representations, ICLR 2021, 2021

work page 2021
[52]

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

W. Fedus, B. Zoph, and N. Shazeer, “Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity,” CoRR, vol. abs/2101.03961, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[53]

Hash layers for large sparse models,

S. Roller, S. Sukhbaatar, A. Szlam, and J. Weston, “Hash layers for large sparse models,”CoRR, vol. abs/2106.04426, 2021. [Online]. Available: https://arxiv.org/abs/2106.04426

work page arXiv 2021
[54]

Stablemoe: Stable routing strategy for mixture of experts,

D. Dai, L. Dong, S. Ma, B. Zheng, Z. Sui, B. Chang, and F. Wei, “Stablemoe: Stable routing strategy for mixture of experts,” in Proceedings of the 60th Annual Meeting of the Association for Computa- tional Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, S. Muresan, P . Nakov, and A. Villavicencio, Eds. Association for Comp...

work page 2022
[55]

Designing effective sparse expert models,

B. Zoph, “Designing effective sparse expert models,” inIEEE International Parallel and Distributed Processing Symposium, IPDPS Workshops 2022, Lyon, France, May 30 - June 3, 2022. IEEE, 2022, p. 1044

work page 2022
[56]

Flylora: Boosting task decoupling and parameter efficiency via implicit rank-wise mixture-of-experts,

H. Zou, Y. Zang, W. Xu, Y. Zhu, and X. Ji, “Flylora: Boosting task decoupling and parameter efficiency via implicit rank-wise mixture-of-experts,”Advances in Neural Information Processing Sys- tems, vol. 38, pp. 10 386–10 419, 2026

work page 2026
[57]

Each rank could be an expert: Single-ranked mixture of experts lora for multi-task learning,

Z. Zhao, Y. Zhou, X. Yu, Z. Zhang, D. Zhu, T. Shen, Z. Li, J. Yang, X. Wang, J. Suet al., “Each rank could be an expert: Single-ranked mixture of experts lora for multi-task learning,” inProceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining V . 1, 2026, pp. 1998–2009

work page 2026
[58]

pfedmoe: Data-level personalization with mixture of experts in model-heterogeneous personalized federated learning,

L. Yi, H. Yu, G. Wang, X. Liu, and Q. Hu, “pfedmoe: Data-level personalization with mixture of experts in model-heterogeneous personalized federated learning,”IEEE Transactions on Knowledge and Data Engineering, 2026

work page 2026
[59]

Moe-llava: Mixture of experts for large vision-language models,

B. Lin, Z. Tang, Y. Ye, J. Huang, J. Zhang, Y. Pang, P . Jin, M. Ning, J. Luo, and L. Yuan, “Moe-llava: Mixture of experts for large vision-language models,”IEEE Transactions on Multimedia, 2026

work page 2026
[60]

Mole-vla: Dynamic layer-skipping vision language action model via mixture-of-layers for efficient robot manipulation,

R. Zhang, M. Dong, Y. Zhang, L. Heng, X. Chi, G. Dai, L. Du, D. Wang, Y. Du, and S. Zhang, “Mole-vla: Dynamic layer-skipping vision language action model via mixture-of-layers for efficient robot manipulation,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 40, no. 22, 2026, pp. 18 764–18 772

work page 2026
[61]

On-policy distillation of language models: Learning from self-generated mistakes,

R. Agarwal, N. Vieillard, Y. Zhou, P . Stanczyk, S. Ramos Garea, M. Geist, and O. Bachem, “On-policy distillation of language models: Learning from self-generated mistakes,” inInternational Conference on Learning Representations, vol. 2024, 2024, pp. 21 246– 21 263

work page 2024
[62]

Minillm: Knowledge distillation of large language models,

Y. Gu, L. Dong, F. Wei, and M. Huang, “Minillm: Knowledge distillation of large language models,” inInternational Conference on Learning Representations, vol. 2024, 2024, pp. 32 694–32 717

work page 2024
[63]

Distillm: Towards streamlined distillation for large language models.ArXiv, abs/2402.03898,

J. Ko, S. Kim, T. Chen, and S.-Y. Yun, “Distillm: Towards streamlined distillation for large language models,”arXiv preprint arXiv:2402.03898, 2024

work page arXiv 2024
[64]

Language-driven policy distillation for cooperative driving in multi-agent reinforcement learning,

J. Liu, C. Xu, P . Hang, J. Sun, M. Ding, W. Zhan, and M. Tomizuka, “Language-driven policy distillation for cooperative driving in multi-agent reinforcement learning,”IEEE Robotics and Automation Letters, 2025

work page 2025
[65]

Distillation-ppo: A novel two-stage reinforcement learning framework for humanoid robot perceptive locomotion,

Q. Zhang, G. Han, J. Sun, W. Zhao, C. Sun, J. Cao, J. Wang, Y. Guo, and R. Xu, “Distillation-ppo: A novel two-stage reinforcement learning framework for humanoid robot perceptive locomotion,” in2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2025, pp. 2916–2922

work page 2025
[66]

Learning deep features for discriminative localization,

B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, “Learning deep features for discriminative localization,” inPro- ceedings of the IEEE conference on computer vision and pattern recogni- tion, 2016, pp. 2921–2929

work page 2016
[67]

SAM 3: Segment Anything with Concepts

N. Carion, L. Gustafson, Y.-T. Hu, S. Debnath, R. Hu, D. Suris, C. Ryali, K. V . Alwala, H. Khedr, A. Huanget al., “Sam 3: Segment anything with concepts,”arXiv preprint arXiv:2511.16719, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[68]

Mean teachers are better role mod- els: Weight-averaged consistency targets improve semi-supervised deep learning results,

A. Tarvainen and H. Valpola, “Mean teachers are better role mod- els: Weight-averaged consistency targets improve semi-supervised deep learning results,”Advances in neural information processing systems, vol. 30, 2017

work page 2017
[69]

Robust mean teacher for continual and gradual test-time adaptation,

M. D ¨obler, R. A. Marsden, and B. Yang, “Robust mean teacher for continual and gradual test-time adaptation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 7704–7714

work page 2023
[70]

Decorate the newcomers: Visual domain prompt for continual test time adaptation,

Y. Gan, X. Ma, Y. Lou, Y. Bai, R. Zhang, N. Shi, and L. Luo, “Decorate the newcomers: Visual domain prompt for continual test time adaptation,”arXiv preprint arXiv:2212.04145, 2022

work page arXiv 2022
[71]

Analysis of representations for domain adaptation,

S. Ben-David, J. Blitzer, K. Crammer, and F. Pereira, “Analysis of representations for domain adaptation,”Advances in neural information processing systems, vol. 19, 2006

work page 2006
[72]

A theory of learning from different domains,

S. Ben-David, J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. W. Vaughan, “A theory of learning from different domains,” Machine learning, vol. 79, no. 1, pp. 151–175, 2010

work page 2010
[73]

Learning to select data for transfer learning with Bayesian Optimization

S. Ruder and B. Plank, “Learning to select data for transfer learn- ing with bayesian optimization,”arXiv preprint arXiv:1707.05246, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[74]

Adversarial learning for zero-shot stance detection on social media,

E. Allaway, M. Srikanth, and K. McKeown, “Adversarial learning for zero-shot stance detection on social media,”arXiv preprint arXiv:2105.06603, 2021

work page arXiv 2021
[75]

Classification and analysis of multivariate observa- tions,

J. MacQueen, “Classification and analysis of multivariate observa- tions,” in5th Berkeley Symp. Math. Statist. Probability. University of California Los Angeles LA USA, 1967, pp. 281–297

work page 1967
[76]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,”arXiv preprint arXiv:2010.11929, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[77]

Segformer: Simple and efficient design for semantic segmenta- tion with transformers,

E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Alvarez, and P . Luo, “Segformer: Simple and efficient design for semantic segmenta- tion with transformers,”Advances in Neural Information Processing Systems, vol. 34, pp. 12 077–12 090, 2021

work page 2021
[78]

Adam: A Method for Stochastic Optimization

D. P . Kingma and J. Ba, “Adam: A method for stochastic optimiza- tion,”arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[79]

Adaptformer: Adapting vision transformers for scalable visual recognition,

S. Chen, C. Ge, Z. Tong, J. Wang, Y. Song, J. Wang, and P . Luo, “Adaptformer: Adapting vision transformers for scalable visual recognition,”Advances in Neural Information Processing Systems, vol. 35, pp. 16 664–16 678, 2022

work page 2022
[80]

Delving deep into rectifiers: Surpassing human-level performance on imagenet classification,

K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing human-level performance on imagenet classification,” inProceedings of the IEEE international conference on computer vision, 2015, pp. 1026–1034

work page 2015

Showing first 80 references.

[1] [1]

Decomposing the neurons: Activation sparsity via mixture of experts for continual test time adaptation,

R. Zhang, A. Cheng, Y. Luo, G. Dai, H. Yang, J. Liu, R. Xu, L. Du, D. Wang, and Y. Du, “Decomposing the neurons: Activation sparsity via mixture of experts for continual test time adaptation,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 40, no. 42, 2026, pp. 36 057–36 065

work page 2026

[2] [2]

A survey of autonomous driving from a deep learning perspective,

J. Zhao, Y. Wu, R. Deng, S. Xu, J. Gao, and A. Burke, “A survey of autonomous driving from a deep learning perspective,”ACM Computing Surveys, vol. 57, no. 10, pp. 1–60, 2025

work page 2025

[3] [3]

Bevuda++: geometric-aware unsupervised domain adaptation for multi-view 3d object detection,

R. Zhang, J. Liu, X. Li, X. Chi, D. Wang, L. Du, Y. Du, and S. Zhang, “Bevuda++: geometric-aware unsupervised domain adaptation for multi-view 3d object detection,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 35, no. 5, pp. 5109–5122, 2024

work page 2024

[4] [4]

Identifying unob- served road regions in bird’s-eye view for single-vehicle percep- tion,

J. Yu, C. Lu, X. Meng, F. Shao, and Q. Jiang, “Identifying unob- served road regions in bird’s-eye view for single-vehicle percep- tion,”IEEE Robotics and Automation Letters, vol. 11, no. 3, pp. 2410– 2417, 2026

work page 2026

[5] [5]

Last {0}: Latent spatio-temporal chain-of- thought for robotic vision-language-action model,

Z. Liu, J. Liu, H. Chen, J. Yu, Z. Guo, C. Hou, C. Gu, X. Mi, R. Zhang, K. Wuet al., “Last {0}: Latent spatio-temporal chain-of- thought for robotic vision-language-action model,”arXiv preprint arXiv:2601.05248, 2026

work page arXiv 2026

[6] [6]

Manipllm: Embodied multimodal large language model for object-centric robotic manipulation,

X. Li, M. Zhang, Y. Geng, H. Geng, Y. Long, Y. Shen, R. Zhang, J. Liu, and H. Dong, “Manipllm: Embodied multimodal large language model for object-centric robotic manipulation,”arXiv preprint arXiv:2312.16217, 2023

work page arXiv 2023

[7] [7]

A survey on vision–language–action models for embodied ai,

Y. Ma, Z. Song, Y. Zhuang, J. Hao, and I. King, “A survey on vision–language–action models for embodied ai,”IEEE Transac- tions on Neural Networks and Learning Systems, 2026

work page 2026

[8] [8]

Efficient deweahter mixture-of-experts with uncertainty-aware feature-wise linear modulation,

R. Zhang, Y. Luo, J. Liu, H. Yang, Z. Dong, D. Gudovskiy, T. Okuno, Y. Nakata, K. Keutzer, Y. Duet al., “Efficient deweahter mixture-of-experts with uncertainty-aware feature-wise linear modulation,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 38, 2024, pp. 16 812–16 820

work page 2024

[9] [9]

Occuseg: Occupancy- aware 3d instance segmentation,

L. Han, T. Zheng, L. Xu, and L. Fang, “Occuseg: Occupancy- aware 3d instance segmentation,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 2940– 2949

work page 2020

[10] [10]

Occ3d: A large-scale 3d occupancy prediction bench- mark for autonomous driving,

X. Tian, T. Jiang, L. Yun, Y. Mao, H. Yang, Y. Wang, Y. Wang, and H. Zhao, “Occ3d: A large-scale 3d occupancy prediction bench- mark for autonomous driving,”Advances in Neural Information Processing Systems, vol. 36, 2024

work page 2024

[11] [11]

Continual test-time domain adaptation,

Q. Wang, O. Fink, L. Van Gool, and D. Dai, “Continual test-time domain adaptation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 7201–7211

work page 2022

[12] [12]

Dual domain-attribute learning framework with asynchronous adapters for continual test-time adaptation,

Y. Tian, K. Li, T. He, L. Wan, P .-A. Heng, and W. Feng, “Dual domain-attribute learning framework with asynchronous adapters for continual test-time adaptation,”IEEE Transactions on Image Processing, 2026

work page 2026

[13] [13]

Tent: Fully Test-time Adaptation by Entropy Minimization

D. Wang, E. Shelhamer, S. Liu, B. Olshausen, and T. Darrell, “Tent: Fully test-time adaptation by entropy minimization,”arXiv preprint arXiv:2006.10726, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2006

[14] [14]

arXiv preprint arXiv:2303.15361 , year=

J. Liang, R. He, and T. Tan, “A comprehensive survey on test-time adaptation under distribution shifts,”arXiv preprint arXiv:2303.15361, 2023

work page arXiv 2023

[15] [15]

Dota: Distributional test-time adaptation of vision- language models,

Z. Han, J. Yang, G. Wang, J. Li, Q. Xu, M. Z. Shou, and C. Zhang, “Dota: Distributional test-time adaptation of vision- language models,”Advances in Neural Information Processing Sys- tems, vol. 38, pp. 142 772–142 798, 2026

work page 2026

[16] [16]

Maintaining consistent inter-class topology in continual test-time adaptation,

C. Ni, F. Lyu, J. Tan, F. Hu, R. Yao, and T. Zhou, “Maintaining consistent inter-class topology in continual test-time adaptation,” inProceedings of the Computer Vision and Pattern Recognition Confer- ence, 2025, pp. 15 319–15 328

work page 2025

[17] [17]

Becotta: Input-dependent online blending of experts for continual test-time adaptation,

D. Lee, J. Yoon, and S. J. Hwang, “Becotta: Input-dependent online blending of experts for continual test-time adaptation,”arXiv preprint arXiv:2402.08712, 2024

work page arXiv 2024

[18] [18]

Adaptive distribution masked autoencoders for continual test-time adaptation,

J. Liu, R. Xu, S. Yang, R. Zhang, Q. Zhang, Z. Chen, Y. Guo, and S. Zhang, “Adaptive distribution masked autoencoders for continual test-time adaptation,”arXiv preprint arXiv:2312.12480, 2023

work page arXiv 2023

[19] [19]

Vida: Homeostatic visual domain adapter for continual test time adaptation,

J. Liu, S. Yang, P . Jia, M. Lu, Y. Guo, W. Xue, and S. Zhang, “Vida: Homeostatic visual domain adapter for continual test time adaptation,”arXiv preprint arXiv:2306.04344, 2023

work page arXiv 2023

[20] [20]

Retinal noise and absolute threshold,

H. B. Barlow, “Retinal noise and absolute threshold,”Josa, vol. 46, no. 8, pp. 634–639, 1956

work page 1956

[21] [21]

The absolute threshold of cone vision,

D. Koenig and H. Hofer, “The absolute threshold of cone vision,” Journal of vision, vol. 11, no. 1, pp. 21–21, 2011

work page 2011

[22] [22]

Structure of cone photoreceptors,

D. Mustafi, A. H. Engel, and K. Palczewski, “Structure of cone photoreceptors,”Progress in retinal and eye research, vol. 28, no. 4, pp. 289–302, 2009

work page 2009

[23] [23]

The primate fovea: structure, func- tion and development,

A. Bringmann, S. Syrbe, K. G ¨orner, J. Kacza, M. Francke, P . Wiede- mann, and A. Reichenbach, “The primate fovea: structure, func- tion and development,”Progress in retinal and eye research, vol. 66, pp. 49–84, 2018

work page 2018

[24] [24]

Von Helmholtz,Handbuch der physiologischen Optik

H. Von Helmholtz,Handbuch der physiologischen Optik. Voss, 1867, vol. 9

work page

[25] [25]

Emergence of shape bias in convolutional neural networks through activation sparsity,

T. Li, Z. Wen, Y. Li, and T. S. Lee, “Emergence of shape bias in convolutional neural networks through activation sparsity,” Advances in Neural Information Processing Systems, vol. 36, 2024

work page 2024

[26] [26]

Multi-level personalized federated learning on heterogeneous and long-tailed data,

R. Zhang and et. al., “Multi-level personalized federated learning on heterogeneous and long-tailed data,”IEEE TMC, 2024

work page 2024

[27] [27]

Fda: Fourier domain adaptation for se- mantic segmentation,

Y. Yang and S. Soatto, “Fda: Fourier domain adaptation for se- mantic segmentation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 4085–4095

work page 2020

[28] [28]

A fourier- based framework for domain generalization,

Q. Xu, R. Zhang, Y. Zhang, Y. Wang, and Q. Tian, “A fourier- based framework for domain generalization,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 14 383–14 392

work page 2021

[29] [29]

Adaptive mixtures of local experts,

R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton, “Adaptive mixtures of local experts,”Neural computation, vol. 3, no. 1, pp. 79–87, 1991

work page 1991

[30] [30]

Mixture of experts: a liter- ature survey,

S. Masoudnia and R. Ebrahimpour, “Mixture of experts: a liter- ature survey,”Artificial Intelligence Review, vol. 42, pp. 275–293, 2014

work page 2014

[31] [31]

A survey on inference optimization techniques for mixture of experts models,

J. Liu, P . Tang, W. Wang, Y. Ren, X. Hou, P . A. Heng, M. Guo, and C. Li, “A survey on inference optimization techniques for mixture of experts models,”ACM Computing Surveys, vol. 58, no. 10, pp. 1–37, 2026

work page 2026

[32] [32]

Retinal processing near absolute threshold: from behavior to mechanism,

G. D. Field, A. P . Sampath, and F. Rieke, “Retinal processing near absolute threshold: from behavior to mechanism,”Annu. Rev. Physiol., vol. 67, pp. 491–514, 2005

work page 2005

[33] [33]

Learning multiple layers of features from tiny images,

A. Krizhevsky, G. Hintonet al., “Learning multiple layers of features from tiny images,” 2009

work page 2009

[34] [34]

Benchmarking Neural Network Robustness to Common Corruptions and Perturbations

D. Hendrycks and T. Dietterich, “Benchmarking neural network robustness to common corruptions and perturbations,”arXiv preprint arXiv:1903.12261, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1903

[35] [35]

The cityscapes dataset for semantic urban scene understanding,

M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Be- nenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset for semantic urban scene understanding,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 3213–3223

work page 2016

[36] [36]

Acdc: The adverse con- ditions dataset with correspondences for semantic driving scene understanding,

C. Sakaridis, D. Dai, and L. Van Gool, “Acdc: The adverse con- ditions dataset with correspondences for semantic driving scene understanding,” inProceedings of the IEEE/CVF International Con- ference on Computer Vision, 2021, pp. 10 765–10 775

work page 2021

[37] [37]

Ecotta: Memory-efficient continual test-time adaptation via self-distilled regularization,

J. Song, J. Lee, I. S. Kweon, and S. Choi, “Ecotta: Memory-efficient continual test-time adaptation via self-distilled regularization,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 11 920–11 929

work page 2023

[38] [38]

Decorate the newcomers: Visual domain prompt for continual test time adaptation,

Y. Gan, Y. Bai, Y. Lou, X. Ma, R. Zhang, N. Shi, and L. Luo, “Decorate the newcomers: Visual domain prompt for continual test time adaptation,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 37, 2023, pp. 7595–7603

work page 2023

[39] [39]

Exploring sparse visual prompt for cross-domain semantic seg- mentation,

S. Yang, J. Wu, J. Liu, X. Li, Q. Zhang, M. Pan, and S. Zhang, “Exploring sparse visual prompt for cross-domain semantic seg- mentation,”arXiv preprint arXiv:2303.09792, 2023

work page arXiv 2023

[40] [40]

Distribution-aware continual test time adaptation for semantic segmentation,

J. Ni, S. Yang, J. Liu, X. Li, W. Jiao, R. Xu, Z. Chen, Y. Liu, and S. Zhang, “Distribution-aware continual test time adaptation for semantic segmentation,”arXiv preprint arXiv:2309.13604, 2023

work page arXiv 2023

[41] [41]

Sparsevit: Revisiting activation sparsity for efficient high-resolution vision transformer,

X. Chen, Z. Liu, H. Tang, L. Yi, H. Zhao, and S. Han, “Sparsevit: Revisiting activation sparsity for efficient high-resolution vision transformer,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2023, pp. 2061–2070

work page 2023

[42] [42]

Inducing and exploiting activation sparsity for fast inference on deep neural networks,

M. Kurtz, J. Kopinsky, R. Gelashvili, A. Matveev, J. Carr, M. Goin, W. Leiserson, S. Moore, N. Shavit, and D. Alistarh, “Inducing and exploiting activation sparsity for fast inference on deep neural networks,” inInternational Conference on Machine Learning. PMLR, 2020, pp. 5533–5543

work page 2020

[43] [43]

Dasnet: Dynamic activation sparsity for neural network efficiency improvement,

Q. Yang, J. Mao, Z. Wang, and H. Li, “Dasnet: Dynamic activation sparsity for neural network efficiency improvement,” in2019 IEEE 31st International Conference on Tools with Artificial Intelligence (ICTAI). IEEE, 2019, pp. 1401–1405

work page 2019

[44] [44]

Sparse reram engine: Joint exploration of JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 16 activation and weight sparsity in compressed neural networks,

T.-H. Yang, H.-Y. Cheng, C.-L. Yang, I.-C. Tseng, H.-W. Hu, H.-S. Chang, and H.-P . Li, “Sparse reram engine: Joint exploration of JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 16 activation and weight sparsity in compressed neural networks,” inProceedings of the 46th International Symposium on Computer Architecture, 2019, pp. 236–249

work page 2021

[45] [45]

Prosparse: Introducing and enhancing intrinsic activation sparsity within large language models,

C. Song, X. Han, Z. Zhang, S. Hu, X. Shi, K. Li, C. Chen, Z. Liu, G. Li, T. Yang, and M. Sun, “Prosparse: Introducing and enhancing intrinsic activation sparsity within large language models,” 2024

work page 2024

[46] [46]

Relu strikes back: Exploiting activation sparsity in large language models,

I. Mirzadeh, K. Alizadeh, S. Mehta, C. C. D. Mundo, O. Tuzel, G. Samei, M. Rastegari, and M. Farajtabar, “Relu strikes back: Exploiting activation sparsity in large language models,” 2023

work page 2023

[47] [47]

Prosparse: Introducing and enhancing intrinsic activation sparsity within large language models,

C. Song, X. Han, Z. Zhang, S. Hu, X. Shi, K. Li, C. Chen, Z. Liu, G. Li, T. Yanget al., “Prosparse: Introducing and enhancing intrinsic activation sparsity within large language models,” in Proceedings of the 31st International Conference on Computational Linguistics, 2025, pp. 2626–2644

work page 2025

[48] [48]

Hierarchical mixtures of experts and the EM algorithm,

M. I. Jordan and R. A. Jacobs, “Hierarchical mixtures of experts and the EM algorithm,”Neural computation, vol. 6, no. 2, pp. 181– 214, 1994

work page 1994

[49] [49]

Learning Factored Representations in a Deep Mixture of Experts

D. Eigen, M. Ranzato, and I. Sutskever, “Learning factored repre- sentations in a deep mixture of experts,”arXiv:1312.4314, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013

[50] [50]

Modeling task relationships in multi-task learning with multi-gate mixture- of-experts,

J. Ma, Z. Zhao, X. Yi, J. Chen, L. Hong, and E. H. Chi, “Modeling task relationships in multi-task learning with multi-gate mixture- of-experts,” inProceedings of the ACM SIGKDD International Con- ference on Knowledge Discovery and Data Mining (KDD), 2018

work page 2018

[51] [51]

Gshard: Scaling giant models with con- ditional computation and automatic sharding,

D. Lepikhin, H. Lee, Y. Xu, D. Chen, O. Firat, Y. Huang, M. Krikun, N. Shazeer, and Z. Chen, “Gshard: Scaling giant models with con- ditional computation and automatic sharding,” in9th International Conference on Learning Representations, ICLR 2021, 2021

work page 2021

[52] [52]

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

W. Fedus, B. Zoph, and N. Shazeer, “Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity,” CoRR, vol. abs/2101.03961, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[53] [53]

Hash layers for large sparse models,

S. Roller, S. Sukhbaatar, A. Szlam, and J. Weston, “Hash layers for large sparse models,”CoRR, vol. abs/2106.04426, 2021. [Online]. Available: https://arxiv.org/abs/2106.04426

work page arXiv 2021

[54] [54]

Stablemoe: Stable routing strategy for mixture of experts,

D. Dai, L. Dong, S. Ma, B. Zheng, Z. Sui, B. Chang, and F. Wei, “Stablemoe: Stable routing strategy for mixture of experts,” in Proceedings of the 60th Annual Meeting of the Association for Computa- tional Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, S. Muresan, P . Nakov, and A. Villavicencio, Eds. Association for Comp...

work page 2022

[55] [55]

Designing effective sparse expert models,

B. Zoph, “Designing effective sparse expert models,” inIEEE International Parallel and Distributed Processing Symposium, IPDPS Workshops 2022, Lyon, France, May 30 - June 3, 2022. IEEE, 2022, p. 1044

work page 2022

[56] [56]

Flylora: Boosting task decoupling and parameter efficiency via implicit rank-wise mixture-of-experts,

H. Zou, Y. Zang, W. Xu, Y. Zhu, and X. Ji, “Flylora: Boosting task decoupling and parameter efficiency via implicit rank-wise mixture-of-experts,”Advances in Neural Information Processing Sys- tems, vol. 38, pp. 10 386–10 419, 2026

work page 2026

[57] [57]

Each rank could be an expert: Single-ranked mixture of experts lora for multi-task learning,

Z. Zhao, Y. Zhou, X. Yu, Z. Zhang, D. Zhu, T. Shen, Z. Li, J. Yang, X. Wang, J. Suet al., “Each rank could be an expert: Single-ranked mixture of experts lora for multi-task learning,” inProceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining V . 1, 2026, pp. 1998–2009

work page 2026

[58] [58]

pfedmoe: Data-level personalization with mixture of experts in model-heterogeneous personalized federated learning,

L. Yi, H. Yu, G. Wang, X. Liu, and Q. Hu, “pfedmoe: Data-level personalization with mixture of experts in model-heterogeneous personalized federated learning,”IEEE Transactions on Knowledge and Data Engineering, 2026

work page 2026

[59] [59]

Moe-llava: Mixture of experts for large vision-language models,

B. Lin, Z. Tang, Y. Ye, J. Huang, J. Zhang, Y. Pang, P . Jin, M. Ning, J. Luo, and L. Yuan, “Moe-llava: Mixture of experts for large vision-language models,”IEEE Transactions on Multimedia, 2026

work page 2026

[60] [60]

Mole-vla: Dynamic layer-skipping vision language action model via mixture-of-layers for efficient robot manipulation,

R. Zhang, M. Dong, Y. Zhang, L. Heng, X. Chi, G. Dai, L. Du, D. Wang, Y. Du, and S. Zhang, “Mole-vla: Dynamic layer-skipping vision language action model via mixture-of-layers for efficient robot manipulation,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 40, no. 22, 2026, pp. 18 764–18 772

work page 2026

[61] [61]

On-policy distillation of language models: Learning from self-generated mistakes,

R. Agarwal, N. Vieillard, Y. Zhou, P . Stanczyk, S. Ramos Garea, M. Geist, and O. Bachem, “On-policy distillation of language models: Learning from self-generated mistakes,” inInternational Conference on Learning Representations, vol. 2024, 2024, pp. 21 246– 21 263

work page 2024

[62] [62]

Minillm: Knowledge distillation of large language models,

Y. Gu, L. Dong, F. Wei, and M. Huang, “Minillm: Knowledge distillation of large language models,” inInternational Conference on Learning Representations, vol. 2024, 2024, pp. 32 694–32 717

work page 2024

[63] [63]

Distillm: Towards streamlined distillation for large language models.ArXiv, abs/2402.03898,

J. Ko, S. Kim, T. Chen, and S.-Y. Yun, “Distillm: Towards streamlined distillation for large language models,”arXiv preprint arXiv:2402.03898, 2024

work page arXiv 2024

[64] [64]

Language-driven policy distillation for cooperative driving in multi-agent reinforcement learning,

J. Liu, C. Xu, P . Hang, J. Sun, M. Ding, W. Zhan, and M. Tomizuka, “Language-driven policy distillation for cooperative driving in multi-agent reinforcement learning,”IEEE Robotics and Automation Letters, 2025

work page 2025

[65] [65]

Distillation-ppo: A novel two-stage reinforcement learning framework for humanoid robot perceptive locomotion,

Q. Zhang, G. Han, J. Sun, W. Zhao, C. Sun, J. Cao, J. Wang, Y. Guo, and R. Xu, “Distillation-ppo: A novel two-stage reinforcement learning framework for humanoid robot perceptive locomotion,” in2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2025, pp. 2916–2922

work page 2025

[66] [66]

Learning deep features for discriminative localization,

B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, “Learning deep features for discriminative localization,” inPro- ceedings of the IEEE conference on computer vision and pattern recogni- tion, 2016, pp. 2921–2929

work page 2016

[67] [67]

SAM 3: Segment Anything with Concepts

N. Carion, L. Gustafson, Y.-T. Hu, S. Debnath, R. Hu, D. Suris, C. Ryali, K. V . Alwala, H. Khedr, A. Huanget al., “Sam 3: Segment anything with concepts,”arXiv preprint arXiv:2511.16719, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[68] [68]

Mean teachers are better role mod- els: Weight-averaged consistency targets improve semi-supervised deep learning results,

A. Tarvainen and H. Valpola, “Mean teachers are better role mod- els: Weight-averaged consistency targets improve semi-supervised deep learning results,”Advances in neural information processing systems, vol. 30, 2017

work page 2017

[69] [69]

Robust mean teacher for continual and gradual test-time adaptation,

M. D ¨obler, R. A. Marsden, and B. Yang, “Robust mean teacher for continual and gradual test-time adaptation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 7704–7714

work page 2023

[70] [70]

Decorate the newcomers: Visual domain prompt for continual test time adaptation,

Y. Gan, X. Ma, Y. Lou, Y. Bai, R. Zhang, N. Shi, and L. Luo, “Decorate the newcomers: Visual domain prompt for continual test time adaptation,”arXiv preprint arXiv:2212.04145, 2022

work page arXiv 2022

[71] [71]

Analysis of representations for domain adaptation,

S. Ben-David, J. Blitzer, K. Crammer, and F. Pereira, “Analysis of representations for domain adaptation,”Advances in neural information processing systems, vol. 19, 2006

work page 2006

[72] [72]

A theory of learning from different domains,

S. Ben-David, J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. W. Vaughan, “A theory of learning from different domains,” Machine learning, vol. 79, no. 1, pp. 151–175, 2010

work page 2010

[73] [73]

Learning to select data for transfer learning with Bayesian Optimization

S. Ruder and B. Plank, “Learning to select data for transfer learn- ing with bayesian optimization,”arXiv preprint arXiv:1707.05246, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[74] [74]

Adversarial learning for zero-shot stance detection on social media,

E. Allaway, M. Srikanth, and K. McKeown, “Adversarial learning for zero-shot stance detection on social media,”arXiv preprint arXiv:2105.06603, 2021

work page arXiv 2021

[75] [75]

Classification and analysis of multivariate observa- tions,

J. MacQueen, “Classification and analysis of multivariate observa- tions,” in5th Berkeley Symp. Math. Statist. Probability. University of California Los Angeles LA USA, 1967, pp. 281–297

work page 1967

[76] [76]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,”arXiv preprint arXiv:2010.11929, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010

[77] [77]

Segformer: Simple and efficient design for semantic segmenta- tion with transformers,

E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Alvarez, and P . Luo, “Segformer: Simple and efficient design for semantic segmenta- tion with transformers,”Advances in Neural Information Processing Systems, vol. 34, pp. 12 077–12 090, 2021

work page 2021

[78] [78]

Adam: A Method for Stochastic Optimization

D. P . Kingma and J. Ba, “Adam: A method for stochastic optimiza- tion,”arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[79] [79]

Adaptformer: Adapting vision transformers for scalable visual recognition,

S. Chen, C. Ge, Z. Tong, J. Wang, Y. Song, J. Wang, and P . Luo, “Adaptformer: Adapting vision transformers for scalable visual recognition,”Advances in Neural Information Processing Systems, vol. 35, pp. 16 664–16 678, 2022

work page 2022

[80] [80]

Delving deep into rectifiers: Surpassing human-level performance on imagenet classification,

K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing human-level performance on imagenet classification,” inProceedings of the IEEE international conference on computer vision, 2015, pp. 1026–1034

work page 2015