Reliability-Gated Source Anchoring for Continual Test-Time Adaptation

Biyao Zhang; Christian Gagn\'e; Debargha Ganguly; Mohsen Hariri; Osama Zafar; Sabyasachi Sahoo; Shouren Wang; Sreehari Sankar; Vikash Singh; Vipin Chaudhary

arxiv: 2605.14063 · v2 · pith:7PM2BGWCnew · submitted 2026-05-13 · 💻 cs.LG

Reliability-Gated Source Anchoring for Continual Test-Time Adaptation

Vikash Singh , Debargha Ganguly , Weicong Chen , Sabyasachi Sahoo , Sreehari Sankar , Biyao Zhang , Mohsen Hariri , Shouren Wang

show 3 more authors

Osama Zafar Christian Gagn\'e Vipin Chaudhary

This is my paper

Pith reviewed 2026-05-20 20:17 UTC · model grok-4.3

classification 💻 cs.LG

keywords continual test-time adaptationsource anchoringreliability gatingpredictive entropydistribution shiftonline adaptationmodel robustness

0 comments

The pith

Normalized predictive entropy from the source model can gate anchoring to avoid blind reliance when the pretrained checkpoint degrades in continual test-time adaptation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Continual test-time adaptation keeps a model anchored to a frozen source checkpoint while updating on an unlabeled shifting stream, but this anchor harms accuracy once the source loses nearly all predictive power on the target. The paper identifies the resulting blind anchoring failure, where prior methods apply fixed anchor strength even after source top-1 accuracy falls to roughly 1.3 percent. RMemSafe adds a gate that reads the source's normalized predictive entropy and progressively attenuates all source-coupled terms in the objective. When entropy approaches its maximum the gate closes, the anchor and agreement filter disappear, and the loss falls back to source-agnostic terms plus marginal calibration. On continual corruption benchmarks the gated method records the lowest error on eight of nine matched splits and exhibits a measurably shallower performance drop as source quality is artificially degraded.

Core claim

RMemSafe extends ROID by inserting a reliability gate that scales down explicit source-coupled components of the adaptation objective according to the normalized predictive entropy of the frozen source outputs. When the source posterior nears uniformity the gate closes, removing the anchor and agreement filter so that the objective reduces to ROID base losses plus marginal calibration. The resulting procedure yields lower error than ROID plus ASR on eight of nine continual-corruption cells, is the strongest reset-based method on all nine, and produces a 1.13 times shallower harm slope under controlled source degradation.

What carries the argument

The reliability gate that uses the frozen source's normalized predictive entropy to attenuate source-anchored terms in the CTTA objective.

If this is right

When source entropy rises the objective automatically drops the anchor and agreement filter and reverts to source-agnostic losses plus marginal calibration.
Combined with ASR, RMemSafe records the lowest error on eight of nine matched-split continual-corruption cells.
It improves ROID plus ASR by 1.05 percentage points on ResNet-50 and 0.48 points on ViT-B/16.
A source-degradation sweep produces a 1.13 times shallower harm slope than the un-gated baseline.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same entropy gate could be grafted onto other source-anchored CTTA algorithms to obtain similar graceful decay.
Combining entropy with additional signals such as prediction consistency across augmentations might catch the low-entropy error cases the current gate misses.
The mechanism suggests a general template for any online adaptation loop that must decide when to trust versus ignore a fixed reference model.

Load-bearing premise

That high normalized predictive entropy from the frozen source is a reliable indicator that anchoring should be reduced, without missing cases where the source remains low-entropy yet confidently wrong on the current stream.

What would settle it

Run a controlled experiment in which the source model is forced to output low-entropy but systematically incorrect predictions on the target stream; if RMemSafe continues to apply strong anchoring and error rises above an un-gated baseline, the entropy gate alone is insufficient.

Figures

Figures reproduced from arXiv: 2605.14063 by Biyao Zhang, Christian Gagn\'e, Debargha Ganguly, Mohsen Hariri, Osama Zafar, Sabyasachi Sahoo, Shouren Wang, Sreehari Sankar, Vikash Singh, Vipin Chaudhary, Weicong Chen.

**Figure 1.** Figure 1: Continual test-time adaptation under collapsing source reliability. CTTA anchors the adapter to its frozen source, presuming the source stays a meaningful reference. Blue: on CCC, frozen RN-50 source top-1 collapses from 76% (clean ImageNet) to 1.3% (CCC-Hard). Red dashed: prior reset-based methods (ROID/ETA/EATA+ASR, ROID+RDumb) hold λ=2 across all severities, pulling the adapter toward near-noise output.… view at source ↗

**Figure 2.** Figure 2: Overview of RMEMSAFE. The reliability engine derives the source-reliability gate Rsrc = 1 − Hsrc from the frozen source’s entropy. Rsrc gates all explicit source-coupled uses: the Dynamic Anchor, the agreement filter (interpolating between source agreement and pass-through), and the source-divergence scaling inside the anchor, while leaving the base ROID losses and marginal calibration ungated. At inferenc… view at source ↗

**Figure 3.** Figure 3: Left: paired per-split CCC comparison (n=54); RMEMSAFE is below y=x on 51 splits, on the diagonal (|∆| ≤ 0.02 pp) on 2, and 0.17 pp above on 1 (a CCC-Hard ViT split where both methods are at ∼98% error). Center, right: controlled source-degradation on CIN-C, varying source clean-test accuracy S via Gaussian weight noise (Appendix L). Error bars: ±1 std over 3 seeds; x-axis reversed. The RMEMSAFE+ASR − ROID… view at source ↗

**Figure 4.** Figure 4: Component ablation on CCC ResNet-50 (mean over 27 splits). Two of the five ablated components (anchor, source-expert agreement) are multiplied by Rsrc ≈0.26 at CCC-Hard runtime (App. K); the three ungated components (marg. calibration, confidence-scaled LR, decoupled flip) are not. The decoupled flip is the only contribution with a large leave-one-out effect (+0.79 pp). Gated components interpretation [PI… view at source ↗

**Figure 5.** Figure 5: sweeps each of the five RMEMSAFE hyperparameters independently while holding the other four at their paper values. Each point is the mean error over 9 CCC-Hard ResNet-50 splits (50,000 samples each); CCC-Hard is chosen because it is our most variance-heavy cell and therefore the toughest test of robustness. Every sweep is flat to within 0.21 pp across the full range including 16× changes in λ and α and 100… view at source ↗

**Figure 6.** Figure 6: Source reliability Rsrc (blue, left axis) and gated Jensen–Shannon divergence Rsrc DJS (red, right axis) over a single CCC-Hard ResNet-50 split (split 3, 3,128 test batches). Traces are smoothed using a 25-batch running mean. The reliability stays near a low floor of 0.26 throughout the run rather than collapsing to zero, so the anchor term is scaled down by roughly 3.8× but is not deactivated. This predic… view at source ↗

read the original abstract

Continual test-time adaptation (CTTA) updates a pretrained model online on an unlabeled, non-stationary stream while anchoring it to a frozen source checkpoint. This anchor is useful only when the source remains reliable. On CCC-Hard, however, a ResNet-50 source falls to approximately $1.3\%$ top-$1$ accuracy, while existing source-anchored CTTA methods continue applying the same anchor strength. We call this failure mode blind anchoring and propose RMemSafe, a reliability-gated extension of ROID that uses the frozen source's normalized predictive entropy to attenuate all explicit source-coupled uses in the objective. When the source posterior approaches uniformity, the gate closes: the source anchor and agreement filter vanish, and the objective reduces to a source-agnostic fallback comprising ROID's base losses plus marginal calibration. Combined with ASR, RMemSafe achieves the lowest error on $8$ of $9$ matched-split continual-corruption cells and is the best reset-based method on all $9$, improving ROID+ASR by $1.05$~pp on ResNet-50 and $0.48$~pp on ViT-B/16. A controlled source-degradation sweep shows a $1.13{\times}$ shallower harm slope than ROID+ASR, consistent with the graceful-decay prediction. The entropy gate detects high-entropy source collapse, not confidently wrong low-entropy sources; this scope is explicitly evaluated and discussed.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RMemSafe adds an entropy gate to ROID to mitigate blind anchoring in CTTA, with modest benchmark gains but acknowledged limits on low-entropy errors.

read the letter

The one or two things to know are that RMemSafe gates the source anchor using entropy from the frozen model to handle cases where the source collapses on hard continual shifts, and it delivers modest improvements over ROID+ASR on the standard benchmarks. What is new is the use of normalized predictive entropy to dynamically remove source-coupled terms from the objective when the source posterior gets close to uniform. This turns the method into a source-agnostic fallback in those moments. The paper combines it with ASR and reports the lowest error on eight of nine matched-split cells, plus a controlled sweep where the harm from source degradation is 1.13 times shallower. The experiments look solid enough for this kind of work. They use public continual-corruption splits and show consistent wins across ResNet-50 and ViT-B/16. The authors are upfront that the gate only addresses high-entropy failures and not low-entropy wrong predictions, which is a fair scope limitation they discuss. The soft spots are minor but worth noting. The performance deltas are small, on the order of one percentage point, and the abstract view does not show variance or full implementation details for the gate threshold. If the entropy signal misses some failure modes in practice, the gains could shrink. Still, the central argument holds up within the evaluated setting. This paper is for anyone working on continual test-time adaptation who wants a practical tweak to source-anchored methods. A reader dealing with streaming applications under distribution shift would find the gate idea and the degradation analysis useful. It has enough concrete results and honest scoping to merit a serious referee rather than an immediate reject. I would recommend sending it for peer review.

Referee Report

2 major / 3 minor

Summary. The manuscript proposes RMemSafe, a reliability-gated extension of ROID for continual test-time adaptation. It uses the frozen source model's normalized predictive entropy to attenuate all explicit source-coupled terms (anchor and agreement filter) in the objective when the source posterior approaches uniformity, reducing to a source-agnostic fallback of ROID base losses plus marginal calibration. On CCC-Hard with matched splits, RMemSafe+ASR reports the lowest error on 8 of 9 cells (best reset-based method on all 9), with 1.05 pp and 0.48 pp gains over ROID+ASR on ResNet-50 and ViT-B/16 respectively, plus a 1.13× shallower harm slope in a controlled source-degradation sweep. The paper explicitly scopes the gate to high-entropy collapse and states that low-entropy confident errors were evaluated.

Significance. If the empirical results hold under full protocol disclosure, the work directly addresses blind anchoring in source-anchored CTTA by providing a simple, entropy-based mechanism for graceful degradation. The controlled degradation sweep and explicit scope discussion are strengths that could inform future adaptive-anchoring designs. The method requires only one additional free parameter (gate scaling/threshold) and builds on public continual-corruption splits.

major comments (2)

The headline gains (lowest error on 8/9 cells, 1.05 pp / 0.48 pp improvements, 1.13× shallower slope) are load-bearing and rest on the gate correctly attenuating source terms precisely when needed. The manuscript states that the entropy gate is limited to high-entropy uniformity and that low-entropy confident errors were evaluated, yet no quantitative breakdown (e.g., frequency of low-entropy source errors on CCC-Hard or ablation isolating their impact on the reported deltas) is referenced in the provided description. This leaves the central robustness claim only partially supported.
Experimental Evaluation: the performance numbers are reported without error bars, run counts, or exact implementation details of the normalized entropy gate and attenuation schedule. These omissions make it difficult to verify that the observed improvements over ROID+ASR are statistically reliable rather than sensitive to seed or hyper-parameter choice.

minor comments (3)

Clarify the precise definition of 'normalized predictive entropy' and the functional form of the gate (multiplicative factor, threshold, or learned scaling) in the method section to support reproducibility.
The abstract notes a source accuracy drop to ~1.3% on CCC-Hard; consider adding a short table or plot showing source accuracy per corruption type to contextualize when the gate activates.
Minor notation: ensure consistent use of 'source-agnostic fallback' versus 'ROID base losses plus marginal calibration' across abstract and main text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for the detailed review and the suggestion of minor revision. The comments help us improve the clarity and robustness of our claims. We respond to each major comment in turn.

read point-by-point responses

Referee: The headline gains (lowest error on 8/9 cells, 1.05 pp / 0.48 pp improvements, 1.13× shallower slope) are load-bearing and rest on the gate correctly attenuating source terms precisely when needed. The manuscript states that the entropy gate is limited to high-entropy uniformity and that low-entropy confident errors were evaluated, yet no quantitative breakdown (e.g., frequency of low-entropy source errors on CCC-Hard or ablation isolating their impact on the reported deltas) is referenced in the provided description. This leaves the central robustness claim only partially supported.

Authors: We appreciate this observation. The manuscript explicitly scopes the gate to high-entropy collapse and notes that low-entropy confident errors were evaluated during development. To address the request for a quantitative breakdown, we will include in the revised version a supplementary analysis reporting the frequency of low-entropy source errors on CCC-Hard and an ablation study isolating the impact of such cases on the performance deltas. This will further substantiate that the reported gains stem primarily from the gate's handling of high-entropy degradation rather than low-entropy errors. revision: yes
Referee: Experimental Evaluation: the performance numbers are reported without error bars, run counts, or exact implementation details of the normalized entropy gate and attenuation schedule. These omissions make it difficult to verify that the observed improvements over ROID+ASR are statistically reliable rather than sensitive to seed or hyper-parameter choice.

Authors: We agree that providing these details would enhance reproducibility. In the revised manuscript, we will report results with error bars from multiple independent runs using different random seeds, include the exact run counts, and provide the precise formulation of the normalized entropy computation and the attenuation schedule (including the threshold and scaling parameter) in the main text or appendix. These additions confirm that the improvements are consistent across runs. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical results independent of internal fits

full rationale

The paper defines RMemSafe explicitly as an entropy-gated extension of ROID, where normalized predictive entropy from the frozen source attenuates source-coupled terms in the objective when the posterior approaches uniformity. The reported gains (lowest error on 8/9 cells, 1.05 pp and 0.48 pp improvements, 1.13× shallower harm slope) are obtained from direct evaluation on public continual-corruption benchmarks rather than any fitted parameter or self-referential reduction. No self-definitional loop, fitted-input-called-prediction, or load-bearing self-citation chain appears in the derivation; the graceful-decay consistency is an empirical observation, not a tautology. The method remains self-contained against external measurements.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

Abstract-only review prevents exhaustive identification of fitted parameters or background assumptions; the gate likely introduces at least one threshold or scaling hyperparameter for entropy attenuation.

free parameters (1)

entropy gate scaling or threshold
The attenuation strength when entropy rises must be controlled by at least one tunable parameter not detailed in the abstract.

pith-pipeline@v0.9.0 · 5840 in / 1231 out tokens · 65067 ms · 2026-05-20T20:17:13.859036+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Rsrc = max(0,1−Hsrc) ... When the source posterior approaches uniformity, the gate closes ... Proposition 1 (Graceful decay under source collapse)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · 2 internal anchors

[1]

Chen, W., Singh, V ., Rahmani, Z., Ganguly, D., Hariri, M., and Chaudhary, V . (2025). K4: Online log anomaly detection via unsupervised typicality learning.arXiv preprint arXiv:2507.20051

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

A., and Yang, B

Döbler, M., Marsden, R. A., and Yang, B. (2023). Robust mean teacher for continual and gradual test-time adaptation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7704–7714

work page 2023
[3]

Ganguly, D., Iyengar, S., Chaudhary, V ., and Kalyanaraman, S. (2024). PROOF OF THOUGHT : Neurosymbolic program synthesis allows robust and interpretable reasoning. InThe First Workshop on System-2 Reasoning at Scale, NeurIPS’24

work page 2024
[4]

R., Yu, A

Ganguly, D., Morningstar, W. R., Yu, A. S., and Chaudhary, V . (2025a). Forte : Finding outliers with representation typicality estimation. InThe Thirteenth International Conference on Learning Representations

work page
[5]

Ganguly, D., Sankar, S., Zhang, B., Singh, V ., Gupta, K., Kavuru, H., Luo, A., et al. (2026). Trust the typical: An out-of-distribution safety detection framework.arXiv preprint arXiv:2602.04581. ICLR 2026

work page arXiv 2026
[6]

Ganguly, D., Singh, V ., Sankar, S., Zhang, B., Zhang, X., Iyengar, S., Han, X., et al. (2025b). Grammars of formal uncertainty: When to trust LLMs in automated reasoning tasks.arXiv preprint arXiv:2505.20047. NeurIPS 2025

work page arXiv 2025
[7]

Gong, T., Jeong, J., Kim, T., Kim, Y ., Shin, J., and Lee, S.-J. (2022). NOTE: Robust continual test-time adaptation against temporal correlation. In Oh, A. H., Agarwal, A., Belgrave, D., and Cho, K., editors,Advances in Neural Information Processing Systems

work page 2022
[8]

and Dietterich, T

Hendrycks, D. and Dietterich, T. (2019). Benchmarking neural network robustness to common corruptions and perturbations. InInternational Conference on Learning Representations

work page 2019
[9]

M., and Do, M

Hoang, T.-H., V o, D. M., and Do, M. N. (2024). Persistent test-time adaptation in recurring testing scenarios. InAdvances in Neural Information Processing Systems (NeurIPS)

work page 2024
[10]

and Matsuo, Y

Iwasawa, Y . and Matsuo, Y . (2021). Test-time classifier adjustment module for model-agnostic domain generalization. InAdvances in Neural Information Processing Systems (NeurIPS)

work page 2021
[11]

Lee, J., Jung, D., Lee, S., Park, J., Shin, J., Hwang, U., and Yoon, S. (2024). Entropy is not enough for test-time adaptation: From the perspective of disentangled factors. InInternational Conference on Learning Representations (ICLR). Spotlight (top 5%)

work page 2024
[12]

and Chang, J.-H

Lee, J.-H. and Chang, J.-H. (2024). Continual momentum filtering on parameter space for online test-time adaptation. InThe Twelfth International Conference on Learning Representations

work page 2024
[13]

Liang, J., He, R., and Tan, T. (2023). A comprehensive survey on test-time adaptation under distribution shifts.International Journal of Computer Vision

work page 2023
[14]

Lim, T., Hwang, J.-W., and Lee, K. (2026). When and where to reset matters for long-term test-time adaptation.arXiv preprint arXiv:2603.03796

work page arXiv 2026
[15]

Liu, Y ., Kothari, P., van Delft, B., Bellot-Gurlet, B., Mordan, T., and Alahi, A. (2021). TTT++: When does self-supervised test-time training fail or thrive? InAdvances in Neural Information Processing Systems (NeurIPS). 10

work page 2021
[16]

A., Döbler, M., and Yang, B

Marsden, R. A., Döbler, M., and Yang, B. (2024). Universal test-time adaptation through weight ensembling, diversity weighting, and prior correction. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2554–2564

work page 2024
[17]

Mishra, H. (2026). Rdumb++: Drift-aware continual test-time adaptation

work page 2026
[18]

K., Hutmacher, R., Rambach, K., Levinkov, E., Brox, T., and Metzen, J

Mummadi, C. K., Hutmacher, R., Rambach, K., Levinkov, E., Brox, T., and Metzen, J. H. (2021). Test-time adaptation to distribution shift by confidence maximization and input transformation. arXiv preprint arXiv:2106.14999

work page arXiv 2021
[19]

Niu, S., Wu, J., Zhang, Y ., Chen, Y ., Zheng, S., Zhao, P., and Tan, M. (2022). Efficient test-time model adaptation without forgetting. InInternational Conference on Machine Learning, pages 16888–16905. PMLR

work page 2022
[20]

Niu, S., Wu, J., Zhang, Y ., Wen, Z., Chen, Y ., Zhao, P., and Tan, M. (2023). Towards stable test- time adaptation in dynamic wild world. InInternational Conference on Learning Representations

work page 2023
[21]

H., and Dokania, P

Prabhu, A., Torr, P. H., and Dokania, P. K. (2020). GDumb: A simple approach that questions our progress in continual learning. InEuropean Conference on Computer Vision (ECCV), pages 524–540

work page 2020
[22]

Press, O., Schneider, S., Kümmerer, M., and Bethge, M. (2023). RDumb: A simple approach that questions our progress in continual test-time adaptation. InAdvances in Neural Information Processing Systems, volume 36, pages 39915–39935

work page 2023
[23]

V ., Bringmann, O., Brendel, W., and Bethge, M

Rusak, E., Schneider, S., Pachitariu, G., Eck, L., Gehler, P. V ., Bringmann, O., Brendel, W., and Bethge, M. (2022). If your data distribution shifts, use self-learning.Transactions on Machine Learning Research (TMLR)

work page 2022
[24]

Schneider, S., Rusak, E., Eck, L., Bringmann, O., Brendel, W., and Bethge, M. (2020). Improv- ing robustness against common corruptions by covariate shift adaptation. InAdvances in Neural Information Processing Systems (NeurIPS)

work page 2020
[25]

Singh, V ., Cassel, D., Weir, N., Feng, N., and Bayless, S. (2026a). VERGE: Formal refinement and guidance engine for verifiable LLM reasoning.arXiv preprint arXiv:2601.20055

work page internal anchor Pith review Pith/arXiv arXiv
[26]

Singh, V ., Ganguly, D., Yu, H., Zhou, C., Singh, P., Lee, B., Chaudhary, V ., and Datta, G. (2026b). Toward guarantees for clinical reasoning in vision language models via formal verification.arXiv preprint arXiv:2602.24111

work page arXiv
[27]

S., and Choi, S

Song, J., Lee, J., Kweon, I. S., and Choi, S. (2023). EcoTTA: Memory-efficient continual test-time adaptation via self-distilled regularization. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

work page 2023
[28]

A., and Hardt, M

Sun, Y ., Wang, X., Liu, Z., Miller, J., Efros, A. A., and Hardt, M. (2020). Test-time training with self-supervision for generalization under distribution shifts. InProceedings of the 37th International Conference on Machine Learning (ICML)

work page 2020
[29]

Tzeng, E., Hoffman, J., Saenko, K., and Darrell, T. (2017). Adversarial discriminative domain adaptation. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 7167–7176

work page 2017
[30]

Wang, D., Shelhamer, E., Liu, S., Olshausen, B., and Darrell, T. (2021). Tent: Fully test-time adaptation by entropy minimization. InInternational Conference on Learning Representations

work page 2021
[31]

Wang, N., Liang, T., Singh, V ., Song, C., Yang, V ., Yin, Y ., Ma, J., Singh, J., et al. (2026a). HugRAG: Hierarchical causal knowledge graph design for RAG.arXiv preprint arXiv:2602.05143

work page arXiv
[32]

Wang, Q., Fink, O., Van Gool, L., and Dai, D. (2022). Continual test-time domain adaptation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7201–7211

work page 2022
[33]

Wang, S., Yang, W., Ma, C., Ganguly, D., Singh, V ., Song, C., Li, X., Long, X., Chaudhary, V ., and Han, X. (2026b). Path-lock expert: Separating reasoning mode in hybrid thinking via architecture-level separation. 11

work page
[34]

Yang, W., Ganguly, D., Li, X., Song, C., Wang, S., Singh, V ., Chaudhary, V ., and Han, X. (2026). Mid-Think: Training-free intermediate-budget reasoning via token-level triggers.arXiv preprint arXiv:2601.07036

work page arXiv 2026
[35]

Yuan, L., Xie, B., and Li, S. (2023). Robust test-time adaptation in dynamic scenarios. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15922–15932

work page 2023
[36]

Zhang, M., Levine, S., and Finn, C. (2022). MEMO: Test time robustness via adaptation and augmentation. InAdvances in Neural Information Processing Systems (NeurIPS)

work page 2022
[37]

does the gate help?

Zhang, Q., Bian, Y ., Kong, X., Zhao, P., and Zhang, C. (2025). COME: Test-time adaption by conservatively minimizing entropy. InInternational Conference on Learning Representations (ICLR). A Algorithm Algorithm 1 gives the full per-batch update rule of RMEMSAFE+ASR. The method combines the ROID backbone (soft-likelihood-ratio loss, diversity weighting, a...

work page arXiv 2025
[38]

The offset is approximately constant (∼7pp) across reset-based methods

(where available). The offset is approximately constant (∼7pp) across reset-based methods. Method Local Streamed Offset ROID86.07∼79 +7 ROID+RDumb86.77∼80 +7 ETA+ASR89.74∼83 +7 EATA+ASR88.89∼84 +5 ROID+ASR84.56 77.79 +6.8 RMEMSAFE+ASR (ours)83.81− − R Broader Impact and Limitations RMEMSAFEis designed forsafetyin continual test-time adaptation: it aims to...

work page
[39]

ViT-B/16 under a class-permuted source is the empirical witness (∆ = +1.14 pp, App

Scope of the reliability signal.Entropy-only; does not detect confidently miscalibrated sources. ViT-B/16 under a class-permuted source is the empirical witness (∆ = +1.14 pp, App. P)

work page
[40]

The reliability-gated reset trigger ( τgate = 0.40, App

Reset-paradigm failure on CCC-Hard ViT-B/16.Every reset-based method we evaluate underperforms non-reset ROID on this cell, across base adapters and reset mechanisms (§4.3). The reliability-gated reset trigger ( τgate = 0.40, App. O) recovers the non-reset mean but not per-split variance

work page
[41]

[14]; the offset is approximately constant across methods on ResNet-50

Local-data offset on CCC.Our shards yield CCC-Hard numbers ∼7 pp harder than the streamed numbers of Lim et al. [14]; the offset is approximately constant across methods on ResNet-50. Cross-study absolute comparisons on CCC-Hard should be interpreted with caution; the matched-split head-to-head is the unbiased estimator of relative method quality. 24

work page
[42]

Per-cell tuning would likely yield further small gains but is discouraged in the unlabeled test-time setting

Fixed hyperparameters.The five core hyperparameters are held constant across all nine benchmark cells. Per-cell tuning would likely yield further small gains but is discouraged in the unlabeled test-time setting

work page
[43]

Marginal-calibration EMA under abrupt label shift.The EMA prior ( ρ= 0.01 ) lags abrupt label-distribution shifts; our streams exhibit gradual rather than abrupt shift, so this regime is not exercised. 25

work page

[1] [1]

Chen, W., Singh, V ., Rahmani, Z., Ganguly, D., Hariri, M., and Chaudhary, V . (2025). K4: Online log anomaly detection via unsupervised typicality learning.arXiv preprint arXiv:2507.20051

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

A., and Yang, B

Döbler, M., Marsden, R. A., and Yang, B. (2023). Robust mean teacher for continual and gradual test-time adaptation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7704–7714

work page 2023

[3] [3]

Ganguly, D., Iyengar, S., Chaudhary, V ., and Kalyanaraman, S. (2024). PROOF OF THOUGHT : Neurosymbolic program synthesis allows robust and interpretable reasoning. InThe First Workshop on System-2 Reasoning at Scale, NeurIPS’24

work page 2024

[4] [4]

R., Yu, A

Ganguly, D., Morningstar, W. R., Yu, A. S., and Chaudhary, V . (2025a). Forte : Finding outliers with representation typicality estimation. InThe Thirteenth International Conference on Learning Representations

work page

[5] [5]

Ganguly, D., Sankar, S., Zhang, B., Singh, V ., Gupta, K., Kavuru, H., Luo, A., et al. (2026). Trust the typical: An out-of-distribution safety detection framework.arXiv preprint arXiv:2602.04581. ICLR 2026

work page arXiv 2026

[6] [6]

Ganguly, D., Singh, V ., Sankar, S., Zhang, B., Zhang, X., Iyengar, S., Han, X., et al. (2025b). Grammars of formal uncertainty: When to trust LLMs in automated reasoning tasks.arXiv preprint arXiv:2505.20047. NeurIPS 2025

work page arXiv 2025

[7] [7]

Gong, T., Jeong, J., Kim, T., Kim, Y ., Shin, J., and Lee, S.-J. (2022). NOTE: Robust continual test-time adaptation against temporal correlation. In Oh, A. H., Agarwal, A., Belgrave, D., and Cho, K., editors,Advances in Neural Information Processing Systems

work page 2022

[8] [8]

and Dietterich, T

Hendrycks, D. and Dietterich, T. (2019). Benchmarking neural network robustness to common corruptions and perturbations. InInternational Conference on Learning Representations

work page 2019

[9] [9]

M., and Do, M

Hoang, T.-H., V o, D. M., and Do, M. N. (2024). Persistent test-time adaptation in recurring testing scenarios. InAdvances in Neural Information Processing Systems (NeurIPS)

work page 2024

[10] [10]

and Matsuo, Y

Iwasawa, Y . and Matsuo, Y . (2021). Test-time classifier adjustment module for model-agnostic domain generalization. InAdvances in Neural Information Processing Systems (NeurIPS)

work page 2021

[11] [11]

Lee, J., Jung, D., Lee, S., Park, J., Shin, J., Hwang, U., and Yoon, S. (2024). Entropy is not enough for test-time adaptation: From the perspective of disentangled factors. InInternational Conference on Learning Representations (ICLR). Spotlight (top 5%)

work page 2024

[12] [12]

and Chang, J.-H

Lee, J.-H. and Chang, J.-H. (2024). Continual momentum filtering on parameter space for online test-time adaptation. InThe Twelfth International Conference on Learning Representations

work page 2024

[13] [13]

Liang, J., He, R., and Tan, T. (2023). A comprehensive survey on test-time adaptation under distribution shifts.International Journal of Computer Vision

work page 2023

[14] [14]

Lim, T., Hwang, J.-W., and Lee, K. (2026). When and where to reset matters for long-term test-time adaptation.arXiv preprint arXiv:2603.03796

work page arXiv 2026

[15] [15]

Liu, Y ., Kothari, P., van Delft, B., Bellot-Gurlet, B., Mordan, T., and Alahi, A. (2021). TTT++: When does self-supervised test-time training fail or thrive? InAdvances in Neural Information Processing Systems (NeurIPS). 10

work page 2021

[16] [16]

A., Döbler, M., and Yang, B

Marsden, R. A., Döbler, M., and Yang, B. (2024). Universal test-time adaptation through weight ensembling, diversity weighting, and prior correction. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2554–2564

work page 2024

[17] [17]

Mishra, H. (2026). Rdumb++: Drift-aware continual test-time adaptation

work page 2026

[18] [18]

K., Hutmacher, R., Rambach, K., Levinkov, E., Brox, T., and Metzen, J

Mummadi, C. K., Hutmacher, R., Rambach, K., Levinkov, E., Brox, T., and Metzen, J. H. (2021). Test-time adaptation to distribution shift by confidence maximization and input transformation. arXiv preprint arXiv:2106.14999

work page arXiv 2021

[19] [19]

Niu, S., Wu, J., Zhang, Y ., Chen, Y ., Zheng, S., Zhao, P., and Tan, M. (2022). Efficient test-time model adaptation without forgetting. InInternational Conference on Machine Learning, pages 16888–16905. PMLR

work page 2022

[20] [20]

Niu, S., Wu, J., Zhang, Y ., Wen, Z., Chen, Y ., Zhao, P., and Tan, M. (2023). Towards stable test- time adaptation in dynamic wild world. InInternational Conference on Learning Representations

work page 2023

[21] [21]

H., and Dokania, P

Prabhu, A., Torr, P. H., and Dokania, P. K. (2020). GDumb: A simple approach that questions our progress in continual learning. InEuropean Conference on Computer Vision (ECCV), pages 524–540

work page 2020

[22] [22]

Press, O., Schneider, S., Kümmerer, M., and Bethge, M. (2023). RDumb: A simple approach that questions our progress in continual test-time adaptation. InAdvances in Neural Information Processing Systems, volume 36, pages 39915–39935

work page 2023

[23] [23]

V ., Bringmann, O., Brendel, W., and Bethge, M

Rusak, E., Schneider, S., Pachitariu, G., Eck, L., Gehler, P. V ., Bringmann, O., Brendel, W., and Bethge, M. (2022). If your data distribution shifts, use self-learning.Transactions on Machine Learning Research (TMLR)

work page 2022

[24] [24]

Schneider, S., Rusak, E., Eck, L., Bringmann, O., Brendel, W., and Bethge, M. (2020). Improv- ing robustness against common corruptions by covariate shift adaptation. InAdvances in Neural Information Processing Systems (NeurIPS)

work page 2020

[25] [25]

Singh, V ., Cassel, D., Weir, N., Feng, N., and Bayless, S. (2026a). VERGE: Formal refinement and guidance engine for verifiable LLM reasoning.arXiv preprint arXiv:2601.20055

work page internal anchor Pith review Pith/arXiv arXiv

[26] [26]

Singh, V ., Ganguly, D., Yu, H., Zhou, C., Singh, P., Lee, B., Chaudhary, V ., and Datta, G. (2026b). Toward guarantees for clinical reasoning in vision language models via formal verification.arXiv preprint arXiv:2602.24111

work page arXiv

[27] [27]

S., and Choi, S

Song, J., Lee, J., Kweon, I. S., and Choi, S. (2023). EcoTTA: Memory-efficient continual test-time adaptation via self-distilled regularization. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

work page 2023

[28] [28]

A., and Hardt, M

Sun, Y ., Wang, X., Liu, Z., Miller, J., Efros, A. A., and Hardt, M. (2020). Test-time training with self-supervision for generalization under distribution shifts. InProceedings of the 37th International Conference on Machine Learning (ICML)

work page 2020

[29] [29]

Tzeng, E., Hoffman, J., Saenko, K., and Darrell, T. (2017). Adversarial discriminative domain adaptation. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 7167–7176

work page 2017

[30] [30]

Wang, D., Shelhamer, E., Liu, S., Olshausen, B., and Darrell, T. (2021). Tent: Fully test-time adaptation by entropy minimization. InInternational Conference on Learning Representations

work page 2021

[31] [31]

Wang, N., Liang, T., Singh, V ., Song, C., Yang, V ., Yin, Y ., Ma, J., Singh, J., et al. (2026a). HugRAG: Hierarchical causal knowledge graph design for RAG.arXiv preprint arXiv:2602.05143

work page arXiv

[32] [32]

Wang, Q., Fink, O., Van Gool, L., and Dai, D. (2022). Continual test-time domain adaptation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7201–7211

work page 2022

[33] [33]

Wang, S., Yang, W., Ma, C., Ganguly, D., Singh, V ., Song, C., Li, X., Long, X., Chaudhary, V ., and Han, X. (2026b). Path-lock expert: Separating reasoning mode in hybrid thinking via architecture-level separation. 11

work page

[34] [34]

Yang, W., Ganguly, D., Li, X., Song, C., Wang, S., Singh, V ., Chaudhary, V ., and Han, X. (2026). Mid-Think: Training-free intermediate-budget reasoning via token-level triggers.arXiv preprint arXiv:2601.07036

work page arXiv 2026

[35] [35]

Yuan, L., Xie, B., and Li, S. (2023). Robust test-time adaptation in dynamic scenarios. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15922–15932

work page 2023

[36] [36]

Zhang, M., Levine, S., and Finn, C. (2022). MEMO: Test time robustness via adaptation and augmentation. InAdvances in Neural Information Processing Systems (NeurIPS)

work page 2022

[37] [37]

does the gate help?

Zhang, Q., Bian, Y ., Kong, X., Zhao, P., and Zhang, C. (2025). COME: Test-time adaption by conservatively minimizing entropy. InInternational Conference on Learning Representations (ICLR). A Algorithm Algorithm 1 gives the full per-batch update rule of RMEMSAFE+ASR. The method combines the ROID backbone (soft-likelihood-ratio loss, diversity weighting, a...

work page arXiv 2025

[38] [38]

The offset is approximately constant (∼7pp) across reset-based methods

(where available). The offset is approximately constant (∼7pp) across reset-based methods. Method Local Streamed Offset ROID86.07∼79 +7 ROID+RDumb86.77∼80 +7 ETA+ASR89.74∼83 +7 EATA+ASR88.89∼84 +5 ROID+ASR84.56 77.79 +6.8 RMEMSAFE+ASR (ours)83.81− − R Broader Impact and Limitations RMEMSAFEis designed forsafetyin continual test-time adaptation: it aims to...

work page

[39] [39]

ViT-B/16 under a class-permuted source is the empirical witness (∆ = +1.14 pp, App

Scope of the reliability signal.Entropy-only; does not detect confidently miscalibrated sources. ViT-B/16 under a class-permuted source is the empirical witness (∆ = +1.14 pp, App. P)

work page

[40] [40]

The reliability-gated reset trigger ( τgate = 0.40, App

Reset-paradigm failure on CCC-Hard ViT-B/16.Every reset-based method we evaluate underperforms non-reset ROID on this cell, across base adapters and reset mechanisms (§4.3). The reliability-gated reset trigger ( τgate = 0.40, App. O) recovers the non-reset mean but not per-split variance

work page

[41] [41]

[14]; the offset is approximately constant across methods on ResNet-50

Local-data offset on CCC.Our shards yield CCC-Hard numbers ∼7 pp harder than the streamed numbers of Lim et al. [14]; the offset is approximately constant across methods on ResNet-50. Cross-study absolute comparisons on CCC-Hard should be interpreted with caution; the matched-split head-to-head is the unbiased estimator of relative method quality. 24

work page

[42] [42]

Per-cell tuning would likely yield further small gains but is discouraged in the unlabeled test-time setting

Fixed hyperparameters.The five core hyperparameters are held constant across all nine benchmark cells. Per-cell tuning would likely yield further small gains but is discouraged in the unlabeled test-time setting

work page

[43] [43]

Marginal-calibration EMA under abrupt label shift.The EMA prior ( ρ= 0.01 ) lags abrupt label-distribution shifts; our streams exhibit gradual rather than abrupt shift, so this regime is not exercised. 25

work page