EVA-0: Test-Time Model Evolution with Only Two Forward Passes per Sample

Chunyan Miao; Geng Li; Guohao Chen; Jianfei Yang; Shilin Shan; Shuaicheng Niu; Yunbei Zhang

arxiv: 2605.18867 · v1 · pith:UDYGJ3QLnew · submitted 2026-05-15 · 💻 cs.LG · cs.AI

EVA-0: Test-Time Model Evolution with Only Two Forward Passes per Sample

Guohao Chen , Shuaicheng Niu , Geng Li , Yunbei Zhang , Shilin Shan , Chunyan Miao , Jianfei Yang This is my paper

Pith reviewed 2026-05-20 21:17 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords test-time adaptationzeroth-order optimizationmodel evolutionforward passesbackpropagation-freeimage classificationViT models

0 comments

The pith

EVA-0 adapts models at test time using only two forward passes and no backpropagation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that test-time model evolution can be made practical for edge devices and black-box systems by eliminating backpropagation entirely. It does so by identifying three obstacles that appear when adaptation is restricted to almost no computation and then removing those obstacles with three targeted fixes. A sympathetic reader would care because this would let already-deployed models keep improving from new unlabeled data without extra memory, gradients, or specialized hardware. If the approach holds, adaptation moves from a heavy research technique to something that can run on ordinary forward-only inference pipelines.

Core claim

EVA-0 is a minimal zeroth-order adaptation framework that requires no backpropagation and performs both inference and adaptation within only two forward passes per sample. It overcomes shortcut solutions with a scale-invariant loss, controls weight drift through anchor-guided optimization, and estimates reliable update directions via sample-wise symmetric two-sided perturbation. On ImageNet-C with ViT-Base it outperforms both the backpropagation-based DeYO and the forward-only FOA while delivering a 14x speed-up over FOA.

What carries the argument

The EVA-0 framework built from a scale-invariant loss, anchor-guided optimization, and sample-wise symmetric two-sided perturbation that together enable two-forward-pass test-time evolution.

If this is right

Deployed models can update themselves from unlabeled test data on hardware that supplies only forward passes.
Adaptation becomes possible for quantized, accelerator-specific, and black-box models that cannot expose gradients.
Test-time updates run at speeds that make continual improvement practical rather than a research-only step.
Accuracy on corrupted-image benchmarks improves relative to both gradient-based and prior forward-only baselines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same two-pass structure could be tested on tasks outside image classification such as object detection or language-model fine-tuning.
Pairing the method with existing quantization or pruning pipelines might allow on-device adaptation with even lower memory.
One natural next measurement is whether the same three components remain effective if the budget is tightened to a single forward pass.

Load-bearing premise

The three components together are sufficient to remove shortcut solutions, uncontrolled weight drift, and ineffective update directions when adaptation is forced to exactly two forward passes.

What would settle it

A run on ImageNet-C with ViT-Base in which EVA-0 either fails to beat DeYO and FOA or requires more than two forward passes per sample would show the central claim is not correct.

Figures

Figures reproduced from arXiv: 2605.18867 by Chunyan Miao, Geng Li, Guohao Chen, Jianfei Yang, Shilin Shan, Shuaicheng Niu, Yunbei Zhang.

**Figure 2.** Figure 2: (a) Accuracy of different direction estimation and inference strategies. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 1.** Figure 1: Our IMAGENET-C dataset consists of 15 types of algorithmically generated corruptions from noise, blur, weather, and digital categories. Each type of corruption has five levels of severity, resulting in 75 distinct corruptions. See different severity levels in Appendix B. face of minor input changes. Now in order to approximate C, E and these robustness measures, we designed a set of corruptions and perturb… view at source ↗

read the original abstract

Test-time model evolution offers a promising way for deployed models to improve from unlabeled test-time experience, yet most existing methods depend on backpropagation (BP), which incurs substantial memory overhead and makes them difficult to deploy on edge devices, quantized models, specialized accelerators, or black-box models. In this work, we study test-time model evolution under a strict two-forward budget, a setting that pushes adaptation toward highly efficient real-world deployment. We reveal three key obstacles in zeroth-order test-time optimization: susceptibility to shortcut solutions, uncontrolled weight drift, and ineffective update direction estimation. To overcome them, we propose EVA-0, a minimal zeroth-order adaptation framework that: 1) keeps the loss scale-invariant to prevent shortcut solutions; 2) devises an anchor-guided optimization strategy to alleviate weight drift; 3) uses sample-wise symmetric two-sided perturbation for update direction estimation and inference. EVA-0 requires no BP and performs both inference and adaptation within only two forward passes per sample. Results on ImageNet-C & ViT-Base show that EVA-0 outperforms both BP-based DeYO and BP-free FOA, while achieving a 14x speed-up over FOA. Code will be released.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EVA-0 fits test-time adaptation into two forward passes with no backprop by fixing three specific zeroth-order problems, which matters for edge and black-box deployment.

read the letter

Colleague, the paper's main contribution is a minimal zeroth-order framework that does both inference and adaptation in exactly two forward passes per sample. It targets the practical barriers that stop most test-time adaptation from running on quantized models, specialized hardware, or black-box APIs where backprop is impossible or too expensive. The authors name three obstacles—shortcut solutions, weight drift, and bad update directions—then pair each with a fix: scale-invariant loss, anchor-guided optimization, and sample-wise symmetric two-sided perturbation. That combination is what lets them stay inside the two-pass budget while claiming better accuracy than DeYO on ImageNet-C with ViT-Base and a 14x speedup over FOA. The engineering is direct and the motivation is clear for anyone who has tried to ship adaptation outside a full training setup. The symmetric perturbation trick is particularly neat because it reuses the same passes for both direction estimation and the final inference output. On the soft side, the abstract gives almost no experimental details—no error bars, no ablation tables, no description of how the baselines were tuned or how many runs were averaged. That leaves open whether the reported gains are stable across different corruptions or model scales. The work also sits squarely on top of prior TTA literature, so the novelty is mostly in tightening the constraints rather than inventing new primitives. Readers who work on efficient deployment will find the framework useful even if they end up modifying the pieces. The central argument holds together without obvious internal contradictions, and the practical angle is worth referee time. I would send it out for review.

Referee Report

1 major / 1 minor

Summary. The manuscript introduces EVA-0, a zeroth-order test-time adaptation framework for model evolution that requires no backpropagation and completes both inference and adaptation in only two forward passes per sample. It identifies three obstacles in this constrained setting (shortcut solutions, uncontrolled weight drift, and ineffective update direction estimation) and maps them to three proposed components: a scale-invariant loss, anchor-guided optimization, and sample-wise symmetric two-sided perturbation. Empirical results on ImageNet-C with ViT-Base are reported to show outperformance over BP-based DeYO and BP-free FOA, together with a 14x speedup relative to FOA.

Significance. If the central claims hold, the work offers a practical advance for test-time adaptation on edge devices, quantized models, specialized accelerators, and black-box APIs where backpropagation is unavailable or prohibitively expensive. The strict two-forward-pass budget and explicit component-to-obstacle mapping constitute a clear contribution; the planned code release supports reproducibility.

major comments (1)

[§4] §4 (Experiments): the reported outperformance on ImageNet-C lacks error bars, statistical significance tests, or ablation studies isolating the contribution of each of the three components; without these, the claim that the proposed components are jointly sufficient to overcome the three obstacles remains under-supported.

minor comments (1)

[§3] Notation for the symmetric two-sided perturbation (e.g., the exact definition of the two forward passes and how the update direction is extracted) should be made fully explicit with a small algorithmic box or pseudocode for clarity.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the thoughtful and constructive feedback on our manuscript. We address the single major comment below.

read point-by-point responses

Referee: [§4] §4 (Experiments): the reported outperformance on ImageNet-C lacks error bars, statistical significance tests, or ablation studies isolating the contribution of each of the three components; without these, the claim that the proposed components are jointly sufficient to overcome the three obstacles remains under-supported.

Authors: We agree that the current experimental presentation would benefit from additional statistical rigor and component-wise analysis. In the revised manuscript we will report mean and standard deviation over at least three independent runs with different random seeds, include paired statistical significance tests (e.g., Wilcoxon signed-rank or t-tests) against the strongest baselines, and add a dedicated ablation table that isolates the contribution of the scale-invariant loss, the anchor-guided optimization, and the symmetric two-sided perturbation. These additions will directly substantiate that the three components together overcome the identified obstacles. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper constructs EVA-0 as a new zeroth-order framework whose three components are explicitly mapped to three independently stated obstacles (shortcut solutions, weight drift, ineffective direction estimation). No equations or claims reduce by construction to fitted parameters, prior self-citations, or renamed known results; the two-forward-pass budget is a design constraint, not a derived prediction. Empirical results on ImageNet-C are presented as external validation rather than tautological confirmation. The argument remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract does not specify any fitted parameters or new entities; the framework relies on standard zeroth-order optimization assumptions and prior TTA concepts.

pith-pipeline@v0.9.0 · 5765 in / 1020 out tokens · 34866 ms · 2026-05-20T21:17:01.811989+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

keeps the loss scale-invariant to prevent shortcut solutions... scale normalization and output decentering... s(o) = r̄t,i / ||o||₂ + ϵ o − ct
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection refines

?

refines
Relation between the paper passage and the cited Recognition theorem.

anchor-guided optimization strategy to alleviate weight drift... θt+1 = (1−γ)θ′t+1 + γθanc

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 1 internal anchor

[1]

Bartler, A

A. Bartler, A. Bühler, F. Wiewel, M. Döbler, and B. Yang. Mt3: Meta test-time training for self-supervised test-time adaption. InInternational Conference on Artificial Intelligence and Statistics, pages 3080–3090. PMLR, 2022

work page 2022
[2]

Boudiaf, R

M. Boudiaf, R. Mueller, I. Ben Ayed, and L. Bertinetto. Parameter-free online test-time adaptation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8344–8353, 2022

work page 2022
[3]

H. Cai, D. McKenzie, W. Yin, and Z. Zhang. Zeroth-order regularized optimization (zoro): Ap- proximately sparse gradients and adaptive sampling.SIAM Journal on Optimization, 32(2):687– 714, 2022

work page 2022
[4]

Caron, H

M. Caron, H. Touvron, I. Misra, H. Jégou, J. Mairal, P. Bojanowski, and A. Joulin. Emerging properties in self-supervised vision transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021

work page 2021
[5]

A. Chen, Y . Zhang, J. Jia, J. Diffenderfer, K. Parasyris, J. Liu, Y . Zhang, Z. Zhang, B. Kailkhura, and S. Liu. Deepzero: Scaling up zeroth-order optimization for deep model training. In International Conference on Learning Representations, 2024

work page 2024
[6]

Cheng, G

S. Cheng, G. Wu, and J. Zhu. On the convergence of prior-guided zeroth-order optimization algorithms. InAdvances in Neural Information Processing Systems, volume 34, pages 14620– 14631, 2021

work page 2021
[7]

Choi, D.-Y

W. Choi, D.-Y . Kim, J. Park, J. Lee, Y . Park, D.-J. Han, and J. Moon. Adaptive energy alignment for accelerating test-time adaptation. InThe Thirteenth International Conference on Learning Representations, 2024

work page 2024
[8]

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. InIEEE Conference on Computer Vision and Pattern Recognition, pages 248– 255, 2009

work page 2009
[9]

Z. Deng, G. Chen, S. Niu, H. Luo, S. Zhang, Y . Yang, R. Chen, W. Luo, and M. Tan. Test-time model adaptation for quantized neural networks.arXiv preprint arXiv:2508.02180, 2025

work page arXiv 2025
[10]

Dosovitskiy, L

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representations, 2021

work page 2021
[11]

J. C. Duchi, M. I. Jordan, M. J. Wainwright, and A. Wibisono. Optimal rates for zero-order convex optimization: The power of two function evaluations.IEEE Transactions on Information Theory, 61(5):2788–2806, 2015

work page 2015
[12]

Dupoux, Y

E. Dupoux, Y . LeCun, and J. Malik. Why ai systems don’t learn and what to do about it: Lessons on autonomous learning from cognitive science.arXiv preprint arXiv:2603.15381, 2026

work page arXiv 2026
[13]

C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta-learning for fast adaptation of deep networks. InInternational conference on machine learning, pages 1126–1135. PMLR, 2017

work page 2017
[14]

Flaxman, A

A. Flaxman, A. Kalai, and H. McMahan. Online convex optimization in the bandit setting: Gradient descent without a gradient. InProceedings of the Annual ACM-SIAM Symposium on Discrete Algorithms, pages 385–394, 2005

work page 2005
[15]

Gandelsman, Y

Y . Gandelsman, Y . Sun, X. Chen, and A. Efros. Test-time training with masked autoencoders. InAdvances in Neural Information Processing Systems, volume 35, pages 29374–29385, 2022

work page 2022
[16]

J. Gao, J. Zhang, X. Liu, T. Darrell, E. Shelhamer, and D. Wang. Back to the source: Diffusion- driven adaptation to test-time corruption. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11786–11796, 2023. 10

work page 2023
[17]

Gautam, Y

T. Gautam, Y . Park, H. Zhou, P. Raman, and W. Ha. Variance-reduced zeroth-order methods for fine-tuning language models. InProceedings of the 41st International Conference on Machine Learning, pages 15180–15208, 2024

work page 2024
[18]

Ghadimi and G

S. Ghadimi and G. Lan. Stochastic first-and zeroth-order methods for nonconvex stochastic programming.SIAM journal on optimization, 23(4):2341–2368, 2013

work page 2013
[19]

Gidaris, P

S. Gidaris, P. Singh, and N. Komodakis. Unsupervised representation learning by predicting image rotations. InInternational Conference on Learning Representations, pages 1–14, 2018

work page 2018
[20]

Goyal, M

S. Goyal, M. Sun, A. Raghunathan, and J. Z. Kolter. Test time adaptation via conjugate pseudo-labels.Advances in Neural Information Processing Systems, 35:6204–6218, 2022

work page 2022
[21]

W. Gu, L. Gu, Z. Wang, C. Y . Suen, and Y . Wang. Docttt: Test-time training for handwritten document recognition using meta-auxiliary learning. InProceedings of the Winter Conference on Applications of Computer Vision, pages 1904–1913, 2025

work page 1904
[22]

N. Hansen. The cma evolution strategy: A tutorial.arXiv preprint arXiv:1604.00772, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[23]

K. He, X. Chen, S. Xie, Y . Li, P. Dollár, and R. Girshick. Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022

work page 2022
[24]

K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. InIEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016

work page 2016
[25]

Hendrycks and T

D. Hendrycks and T. Dietterich. Benchmarking neural network robustness to common corrup- tions and perturbations. InInternational Conference on Learning Representations, pages 1–11, 2019

work page 2019
[26]

Iwasawa and Y

Y . Iwasawa and Y . Matsuo. Test-time classifier adjustment module for model-agnostic domain generalization. InAdvances in Neural Information Processing Systems, volume 34, pages 2427–2440, 2021

work page 2021
[27]

Jeong, J

W. Jeong, J. Cho, Y . Yoon, and K.-J. Yoon. Synchronizing task behavior: Aligning multiple tasks during test-time training. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 24340–24350, 2025

work page 2025
[28]

P. W. Koh, S. Sagawa, H. Marklund, S. M. Xie, M. Zhang, A. Balsubramani, W. Hu, M. Ya- sunaga, R. L. Phillips, I. Gao, et al. Wilds: A benchmark of in-the-wild distribution shifts. In International Conference on Machine Learning, pages 5637–5664, 2021

work page 2021
[29]

J. Lee, D. Jung, S. Lee, J. Park, J. Shin, U. Hwang, and S. Yoon. Entropy is not enough for test-time adaptation: From the perspective of disentangled factors. InInternational Conference on Learning Representations, 2024

work page 2024
[30]

Liang, R

J. Liang, R. He, and T. Tan. A comprehensive survey on test-time adaptation under distribution shifts.International Journal of Computer Vision, pages 1–34, 2024

work page 2024
[31]

Liang, D

J. Liang, D. Hu, and J. Feng. Do we really need to access the source data? source hypothesis transfer for unsupervised domain adaptation. InInternational Conference on Machine Learning, pages 6028–6039, 2020

work page 2020
[32]

J. Liu, R. Xu, S. Yang, R. Zhang, Q. Zhang, Z. Chen, Y . Guo, and S. Zhang. Continual-mae: Adaptive distribution masked autoencoders for continual test-time adaptation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 28653–28663, 2024

work page 2024
[33]

Liu, P.-Y

S. Liu, P.-Y . Chen, B. Kailkhura, G. Zhang, A. O. Hero III, and P. K. Varshney. A primer on zeroth-order optimization in signal processing and machine learning: Principals, recent advances, and applications.IEEE Signal Processing Magazine, 37(5):43–54, 2020

work page 2020
[34]

Malladi, T

S. Malladi, T. Gao, E. Nichani, A. Damian, J. D. Lee, D. Chen, and S. Arora. Fine-tuning language models with just forward passes. InAdvances in Neural Information Processing Systems, pages 53038–53075, 2023. 11

work page 2023
[35]

S. Niu, C. Miao, G. Chen, P. Wu, and P. Zhao. Test-time model adaptation with only forward passes. InInternational Conference on Machine Learning, 2024

work page 2024
[36]

S. Niu, J. Wu, Y . Zhang, Z. Wen, Y . Chen, P. Zhao, and M. Tan. Towards stable test-time adaptation in dynamic wild world. InInternetional Conference on Learning Representations, pages 1–14, 2023

work page 2023
[37]

Osowiechi, G

D. Osowiechi, G. A. V . Hakim, M. Noori, M. Cheraghalikhani, A. Bahri, M. Yazdanpanah, I. Ben Ayed, and C. Desrosiers. Nc-ttt: A noise constrastive approach for test-time training. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6078–6086, 2024

work page 2024
[38]

Osowiechi, G

D. Osowiechi, G. A. V . Hakim, M. Noori, M. Cheraghalikhani, I. Ben Ayed, and C. Desrosiers. Tttflow: Unsupervised test-time training with normalizing flow. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2126–2134, 2023

work page 2023
[39]

Silver and R

D. Silver and R. S. Sutton. Welcome to the era of experience.Google AI, 1:11, 2025

work page 2025
[40]

J. C. Spall. Multivariate stochastic approximation using a simultaneous perturbation gradient approximation.IEEE transactions on automatic control, 37(3):332–341, 2002

work page 2002
[41]

Y . Sun, X. Wang, Z. Liu, J. Miller, A. Efros, and M. Hardt. Test-time training with self- supervision for generalization under distribution shifts. InInternational Conference on Machine Learning, pages 9229–9248, 2020

work page 2020
[42]

Termöhlen, M

J.-A. Termöhlen, M. Klingner, L. J. Brettin, N. M. Schmidt, and T. Fingscheidt. Continual unsupervised domain adaptation for semantic segmentation by online frequency domain style transfer. In2021 IEEE International Intelligent Transportation Systems Conference (ITSC), pages 2881–2888. IEEE, 2021

work page 2021
[43]

D. Wang, E. Shelhamer, S. Liu, B. Olshausen, and T. Darrell. Tent: Fully test-time adaptation by entropy minimization. InInternational Conference on Learning Representations, pages 1–12, 2021

work page 2021
[44]

Q. Wang, O. Fink, L. Van Gool, and D. Dai. Continual test-time domain adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7201–7211, 2022

work page 2022
[45]

Wang and T

X. Wang and T. Wang. Fozo: Forward-only zeroth-order prompt optimization for test-time adaptation. InIEEE Conference on Computer Vision and Pattern Recognition, 2026

work page 2026
[46]

Wightman

R. Wightman. Pytorch image models. https://github.com/rwightman/ pytorch-image-models, 2019

work page 2019
[47]

F. You, J. Li, and Z. Zhao. Test-time batch statistics calibration for covariate shift.arXiv preprint arXiv:2110.04065, 2021

work page arXiv 2021
[48]

Y . Yuan, B. Xu, L. Hou, F. Sun, H. Shen, and X. Cheng. Tea: Test-time energy adaptation. In IEEE Conference on Computer Vision and Pattern Recognition, pages 23901–23911, 2024

work page 2024
[49]

Z. Yuan, C. Xue, Y . Chen, Q. Wu, and G. Sun. Ptq4vit: Post-training quantization for vision transformers with twin uniform quantization. InEuropean Conference on Computer Vision, pages 191–207. Springer, 2022

work page 2022
[50]

EV A-0: Test-Time Model Evolution with Only Two Forward Passes per Sample

M. M. Zhang, S. Levine, and C. Finn. Memo: Test time robustness via adaptation and augmentation. InAdvances in Neural Information Processing Systems, pages 38629–38642, 2022. 12 Supplementary Materials for “ EV A-0: Test-Time Model Evolution with Only Two Forward Passes per Sample ” Contents A Pseudo-Code of EV A-0 14 B Related Work 15 C More Design Detai...

work page arXiv 2022

[1] [1]

Bartler, A

A. Bartler, A. Bühler, F. Wiewel, M. Döbler, and B. Yang. Mt3: Meta test-time training for self-supervised test-time adaption. InInternational Conference on Artificial Intelligence and Statistics, pages 3080–3090. PMLR, 2022

work page 2022

[2] [2]

Boudiaf, R

M. Boudiaf, R. Mueller, I. Ben Ayed, and L. Bertinetto. Parameter-free online test-time adaptation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8344–8353, 2022

work page 2022

[3] [3]

H. Cai, D. McKenzie, W. Yin, and Z. Zhang. Zeroth-order regularized optimization (zoro): Ap- proximately sparse gradients and adaptive sampling.SIAM Journal on Optimization, 32(2):687– 714, 2022

work page 2022

[4] [4]

Caron, H

M. Caron, H. Touvron, I. Misra, H. Jégou, J. Mairal, P. Bojanowski, and A. Joulin. Emerging properties in self-supervised vision transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021

work page 2021

[5] [5]

A. Chen, Y . Zhang, J. Jia, J. Diffenderfer, K. Parasyris, J. Liu, Y . Zhang, Z. Zhang, B. Kailkhura, and S. Liu. Deepzero: Scaling up zeroth-order optimization for deep model training. In International Conference on Learning Representations, 2024

work page 2024

[6] [6]

Cheng, G

S. Cheng, G. Wu, and J. Zhu. On the convergence of prior-guided zeroth-order optimization algorithms. InAdvances in Neural Information Processing Systems, volume 34, pages 14620– 14631, 2021

work page 2021

[7] [7]

Choi, D.-Y

W. Choi, D.-Y . Kim, J. Park, J. Lee, Y . Park, D.-J. Han, and J. Moon. Adaptive energy alignment for accelerating test-time adaptation. InThe Thirteenth International Conference on Learning Representations, 2024

work page 2024

[8] [8]

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. InIEEE Conference on Computer Vision and Pattern Recognition, pages 248– 255, 2009

work page 2009

[9] [9]

Z. Deng, G. Chen, S. Niu, H. Luo, S. Zhang, Y . Yang, R. Chen, W. Luo, and M. Tan. Test-time model adaptation for quantized neural networks.arXiv preprint arXiv:2508.02180, 2025

work page arXiv 2025

[10] [10]

Dosovitskiy, L

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representations, 2021

work page 2021

[11] [11]

J. C. Duchi, M. I. Jordan, M. J. Wainwright, and A. Wibisono. Optimal rates for zero-order convex optimization: The power of two function evaluations.IEEE Transactions on Information Theory, 61(5):2788–2806, 2015

work page 2015

[12] [12]

Dupoux, Y

E. Dupoux, Y . LeCun, and J. Malik. Why ai systems don’t learn and what to do about it: Lessons on autonomous learning from cognitive science.arXiv preprint arXiv:2603.15381, 2026

work page arXiv 2026

[13] [13]

C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta-learning for fast adaptation of deep networks. InInternational conference on machine learning, pages 1126–1135. PMLR, 2017

work page 2017

[14] [14]

Flaxman, A

A. Flaxman, A. Kalai, and H. McMahan. Online convex optimization in the bandit setting: Gradient descent without a gradient. InProceedings of the Annual ACM-SIAM Symposium on Discrete Algorithms, pages 385–394, 2005

work page 2005

[15] [15]

Gandelsman, Y

Y . Gandelsman, Y . Sun, X. Chen, and A. Efros. Test-time training with masked autoencoders. InAdvances in Neural Information Processing Systems, volume 35, pages 29374–29385, 2022

work page 2022

[16] [16]

J. Gao, J. Zhang, X. Liu, T. Darrell, E. Shelhamer, and D. Wang. Back to the source: Diffusion- driven adaptation to test-time corruption. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11786–11796, 2023. 10

work page 2023

[17] [17]

Gautam, Y

T. Gautam, Y . Park, H. Zhou, P. Raman, and W. Ha. Variance-reduced zeroth-order methods for fine-tuning language models. InProceedings of the 41st International Conference on Machine Learning, pages 15180–15208, 2024

work page 2024

[18] [18]

Ghadimi and G

S. Ghadimi and G. Lan. Stochastic first-and zeroth-order methods for nonconvex stochastic programming.SIAM journal on optimization, 23(4):2341–2368, 2013

work page 2013

[19] [19]

Gidaris, P

S. Gidaris, P. Singh, and N. Komodakis. Unsupervised representation learning by predicting image rotations. InInternational Conference on Learning Representations, pages 1–14, 2018

work page 2018

[20] [20]

Goyal, M

S. Goyal, M. Sun, A. Raghunathan, and J. Z. Kolter. Test time adaptation via conjugate pseudo-labels.Advances in Neural Information Processing Systems, 35:6204–6218, 2022

work page 2022

[21] [21]

W. Gu, L. Gu, Z. Wang, C. Y . Suen, and Y . Wang. Docttt: Test-time training for handwritten document recognition using meta-auxiliary learning. InProceedings of the Winter Conference on Applications of Computer Vision, pages 1904–1913, 2025

work page 1904

[22] [22]

N. Hansen. The cma evolution strategy: A tutorial.arXiv preprint arXiv:1604.00772, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[23] [23]

K. He, X. Chen, S. Xie, Y . Li, P. Dollár, and R. Girshick. Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022

work page 2022

[24] [24]

K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. InIEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016

work page 2016

[25] [25]

Hendrycks and T

D. Hendrycks and T. Dietterich. Benchmarking neural network robustness to common corrup- tions and perturbations. InInternational Conference on Learning Representations, pages 1–11, 2019

work page 2019

[26] [26]

Iwasawa and Y

Y . Iwasawa and Y . Matsuo. Test-time classifier adjustment module for model-agnostic domain generalization. InAdvances in Neural Information Processing Systems, volume 34, pages 2427–2440, 2021

work page 2021

[27] [27]

Jeong, J

W. Jeong, J. Cho, Y . Yoon, and K.-J. Yoon. Synchronizing task behavior: Aligning multiple tasks during test-time training. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 24340–24350, 2025

work page 2025

[28] [28]

P. W. Koh, S. Sagawa, H. Marklund, S. M. Xie, M. Zhang, A. Balsubramani, W. Hu, M. Ya- sunaga, R. L. Phillips, I. Gao, et al. Wilds: A benchmark of in-the-wild distribution shifts. In International Conference on Machine Learning, pages 5637–5664, 2021

work page 2021

[29] [29]

J. Lee, D. Jung, S. Lee, J. Park, J. Shin, U. Hwang, and S. Yoon. Entropy is not enough for test-time adaptation: From the perspective of disentangled factors. InInternational Conference on Learning Representations, 2024

work page 2024

[30] [30]

Liang, R

J. Liang, R. He, and T. Tan. A comprehensive survey on test-time adaptation under distribution shifts.International Journal of Computer Vision, pages 1–34, 2024

work page 2024

[31] [31]

Liang, D

J. Liang, D. Hu, and J. Feng. Do we really need to access the source data? source hypothesis transfer for unsupervised domain adaptation. InInternational Conference on Machine Learning, pages 6028–6039, 2020

work page 2020

[32] [32]

J. Liu, R. Xu, S. Yang, R. Zhang, Q. Zhang, Z. Chen, Y . Guo, and S. Zhang. Continual-mae: Adaptive distribution masked autoencoders for continual test-time adaptation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 28653–28663, 2024

work page 2024

[33] [33]

Liu, P.-Y

S. Liu, P.-Y . Chen, B. Kailkhura, G. Zhang, A. O. Hero III, and P. K. Varshney. A primer on zeroth-order optimization in signal processing and machine learning: Principals, recent advances, and applications.IEEE Signal Processing Magazine, 37(5):43–54, 2020

work page 2020

[34] [34]

Malladi, T

S. Malladi, T. Gao, E. Nichani, A. Damian, J. D. Lee, D. Chen, and S. Arora. Fine-tuning language models with just forward passes. InAdvances in Neural Information Processing Systems, pages 53038–53075, 2023. 11

work page 2023

[35] [35]

S. Niu, C. Miao, G. Chen, P. Wu, and P. Zhao. Test-time model adaptation with only forward passes. InInternational Conference on Machine Learning, 2024

work page 2024

[36] [36]

S. Niu, J. Wu, Y . Zhang, Z. Wen, Y . Chen, P. Zhao, and M. Tan. Towards stable test-time adaptation in dynamic wild world. InInternetional Conference on Learning Representations, pages 1–14, 2023

work page 2023

[37] [37]

Osowiechi, G

D. Osowiechi, G. A. V . Hakim, M. Noori, M. Cheraghalikhani, A. Bahri, M. Yazdanpanah, I. Ben Ayed, and C. Desrosiers. Nc-ttt: A noise constrastive approach for test-time training. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6078–6086, 2024

work page 2024

[38] [38]

Osowiechi, G

D. Osowiechi, G. A. V . Hakim, M. Noori, M. Cheraghalikhani, I. Ben Ayed, and C. Desrosiers. Tttflow: Unsupervised test-time training with normalizing flow. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2126–2134, 2023

work page 2023

[39] [39]

Silver and R

D. Silver and R. S. Sutton. Welcome to the era of experience.Google AI, 1:11, 2025

work page 2025

[40] [40]

J. C. Spall. Multivariate stochastic approximation using a simultaneous perturbation gradient approximation.IEEE transactions on automatic control, 37(3):332–341, 2002

work page 2002

[41] [41]

Y . Sun, X. Wang, Z. Liu, J. Miller, A. Efros, and M. Hardt. Test-time training with self- supervision for generalization under distribution shifts. InInternational Conference on Machine Learning, pages 9229–9248, 2020

work page 2020

[42] [42]

Termöhlen, M

J.-A. Termöhlen, M. Klingner, L. J. Brettin, N. M. Schmidt, and T. Fingscheidt. Continual unsupervised domain adaptation for semantic segmentation by online frequency domain style transfer. In2021 IEEE International Intelligent Transportation Systems Conference (ITSC), pages 2881–2888. IEEE, 2021

work page 2021

[43] [43]

D. Wang, E. Shelhamer, S. Liu, B. Olshausen, and T. Darrell. Tent: Fully test-time adaptation by entropy minimization. InInternational Conference on Learning Representations, pages 1–12, 2021

work page 2021

[44] [44]

Q. Wang, O. Fink, L. Van Gool, and D. Dai. Continual test-time domain adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7201–7211, 2022

work page 2022

[45] [45]

Wang and T

X. Wang and T. Wang. Fozo: Forward-only zeroth-order prompt optimization for test-time adaptation. InIEEE Conference on Computer Vision and Pattern Recognition, 2026

work page 2026

[46] [46]

Wightman

R. Wightman. Pytorch image models. https://github.com/rwightman/ pytorch-image-models, 2019

work page 2019

[47] [47]

F. You, J. Li, and Z. Zhao. Test-time batch statistics calibration for covariate shift.arXiv preprint arXiv:2110.04065, 2021

work page arXiv 2021

[48] [48]

Y . Yuan, B. Xu, L. Hou, F. Sun, H. Shen, and X. Cheng. Tea: Test-time energy adaptation. In IEEE Conference on Computer Vision and Pattern Recognition, pages 23901–23911, 2024

work page 2024

[49] [49]

Z. Yuan, C. Xue, Y . Chen, Q. Wu, and G. Sun. Ptq4vit: Post-training quantization for vision transformers with twin uniform quantization. InEuropean Conference on Computer Vision, pages 191–207. Springer, 2022

work page 2022

[50] [50]

EV A-0: Test-Time Model Evolution with Only Two Forward Passes per Sample

M. M. Zhang, S. Levine, and C. Finn. Memo: Test time robustness via adaptation and augmentation. InAdvances in Neural Information Processing Systems, pages 38629–38642, 2022. 12 Supplementary Materials for “ EV A-0: Test-Time Model Evolution with Only Two Forward Passes per Sample ” Contents A Pseudo-Code of EV A-0 14 B Related Work 15 C More Design Detai...

work page arXiv 2022