pith. sign in

arxiv: 2505.19607 · v2 · submitted 2025-05-26 · 💻 cs.LG · cs.AI

Contrastive Residual Energy Test-time Adaptation

Pith reviewed 2026-05-19 12:40 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords test-time adaptationenergy-based modelscontrastive learningmarginal distributionmodel calibrationgradient reweightingoverfitting prevention
0
0 comments X

The pith

Test-time adaptation is reformulated as learning a residual energy function, yielding a contrastive objective that cancels the partition function and avoids entropy bias.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that test-time adaptation to shifted target distributions can be achieved by modeling the marginal distribution through a residual energy function rather than adjusting conditional predictions. This reformulation produces a contrastive training objective in which the intractable partition function cancels out exactly, eliminating the need for sampling or approximations that limit scalability. The approach further introduces adaptive gradient reweighting based on relative energy differences, which the analysis shows prevents overfitting and the self-confirming bias common in entropy minimization methods. A sympathetic reader would care because the result offers a practical, label-free way to improve model calibration and robustness under real computational constraints.

Core claim

Marginal distribution adaptation is reformulated as learning a residual energy function; the resulting contrastive objective cancels the partition function mathematically and uses relative energy differences for adaptive gradient reweighting that avoids the self-confirming bias of entropy minimization.

What carries the argument

Residual energy function, which represents the adjustment to the source marginal and enables a contrastive objective whose partition function term cancels exactly.

If this is right

  • Adaptation runs without sampling or approximation error from the partition function.
  • Predictions remain well-calibrated because the method never relies on uncertain label predictions.
  • Overfitting is reduced by reweighting gradients according to relative energy differences instead of entropy minimization.
  • The method remains practical under strict latency and compute limits typical of real-world deployment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same residual-energy reformulation could be tested on other marginal-adaptation problems such as domain generalization or continual learning.
  • Removing the relative-energy reweighting term would be expected to restore the overfitting behavior seen in pure entropy methods.
  • Combining the contrastive marginal update with existing conditional TTA techniques might produce hybrid systems that inherit calibration from one and task accuracy from the other.

Load-bearing premise

A residual energy function can be parameterized and optimized so that the contrastive objective preserves the desired marginal properties and supplies stable adaptive reweighting without new instabilities.

What would settle it

Apply CreTTA and a sampling-based energy baseline to the same distribution-shift benchmark and check whether the contrastive version matches or exceeds accuracy and calibration while requiring no sampling steps.

Figures

Figures reproduced from arXiv: 2505.19607 by Seoyun Yang, Taesup Kim, Yewon Han.

Figure 1
Figure 1. Figure 1: Comparison of classification (mCE), calibration (ECE) performance gains over source with corresponding computational costs (GFLOPs) to achieve improvements on CIFAR10-C (Sev 1-5 Avg). CRETTA achieves a better balance between performance and computation than the existing energy-based method, TEA. environmental conditions, between training and deployment. It is a major challenge in real-world scenarios, wher… view at source ↗
Figure 2
Figure 2. Figure 2: ImageNet-C (Sev 5) ECE(↓) and Acc(↑) over batch progress. CRETTA maintains stable calibration performance, while TEA experiences approximation error accumulation. gaussian shot impulse defocus glass motion zoom snow frost fog brightness contrast elastic pixelate jpeg 0 4 8 12 16 E C E (%) TEA (7.21%) TENT (4.85%) CRETTA (2.69%) [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Comparison of ECE (↓) on ImageNet-C across various corruption types, with results averaged over severities 1–5. Scalability. Furthermore, we validate that sampling-free design of CRETTA exhibits even more stable adaptation than TEA on large-scale datasets. As the data complex￾ity increases, unresolvable approximation error of TEA becomes pronounced. As shown in [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Energy trajectories of target samples across severities (left) and the effect of relative residual energy learning on stable adaptation (right) in CIFAR100-C. samples. When the energy difference between source and target is already aligned with the desired preference (i.e, Eθ(xt) < Eθ(xs)), the gradient weight w(xt, xs) in Equa￾tion 5 decreases, leading to weaker updates. On the other hand, when the energy… view at source ↗
Figure 5
Figure 5. Figure 5: Comparison of Expected Calibration Error (ECE↓) on the CIFAR10-C dataset across different corruption severity levels. 0.00% 2.00% 4.00% 6.00% 8.00% 10.00% 12.00% 14.00% 16.00% 18.00% 20.00% 1 2 3 4 5 Average Source BN Adapt SHOT TENT TEA Ours Shift Intensity ECE (%) [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Comparison of Expected Calibration Error (ECE↓) on the CIFAR100-C dataset across different corruption severity levels [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Comparison of Expected Calibration Error (ECE↓) on the TinyImageNet-C dataset across different corruption severity levels. Detailed Performance Comparison on Calibration Error. In this section, we provide a detailed analysis of the Expected Calibration Error (ECE) for CIFAR10-C and CIFAR100-C. This expands upon the results shown in [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗
Figure 9
Figure 9. Figure 9: Ablation on varying values of β. 0.89 0.90 0.91 0.92 0.93 Accuracy ( ) CIFAR10-C 0.66 0.68 0.70 0.72 CIFAR100-C Severity Severity 1 Severity 2 Severity 3 Severity 4 Severity 5 1.0 1.2 1.4 1.6 1.8 2.0 0.020 0.025 0.030 0.035 0.040 ECE ( ) 1.0 1.2 1.4 1.6 1.8 2.0 0.060 0.065 0.070 0.075 0.080 Severity Severity 1 Severity 2 Severity 3 Severity 4 Severity 5 [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗
Figure 8
Figure 8. Figure 8: Ablation on varying β values on CIFAR10-C and CIFAR100-C at severity 1-5. Hyperparameter β Ablation. The hyperparameter β in Equation 4 controls the deviation from the pretrained source model, serving as a scaling parameter. To evaluate the robustness of our method, we experiment its performance across varying values of β, assessing both accuracy and expected calibration error (ECE) on CIFAR10-C, CIFAR100-… view at source ↗
Figure 10
Figure 10. Figure 10: The overconfidence problem of entropy minimization in test-time adaptation on CIFAR10-C. TENT tends to increase a model’s confidence in uncertain predictions as adaptation progresses, often leading to worse calibration due to overconfidence. In contrast, CRETTA (Ours) stabilizes the adaptation process by gradually reducing the expected calibration error. 0 10 20 30 40 50 Iteration 0.68 0.70 0.72 0.74 0.76… view at source ↗
Figure 11
Figure 11. Figure 11: The overconfidence problem of entropy minimization in test-time adaptation on CIFAR100-C. 10 [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: The overconfidence problem of entropy minimization in test-time adaptation on TinyImageNet-C. A.3.3. DETAILED ANALYSIS Overconfidence Problem of Entropy Minimization. The overconfidence issue inherent in entropy minimization has been thoroughly investigated in prior works (Liu et al., 2020; Hendrycks & Gimpel, 2016; Guo et al., 2017). Building on this, we explored that increasing a model’s prediction conf… view at source ↗
read the original abstract

Test-time adaptation (TTA) enhances model robustness by enabling adaptation to target distributions that differ from training distributions, improving real-world generalizability. However, most existing TTA approaches focus on adjusting the conditional distribution and therefore exhibit poor calibration, as they rely on uncertain predictions in the absence of labels. Energy-based TTA frameworks provide an alternative by modeling the marginal distribution of target data without depending on label predictions, but their reliance on costly sampling hinders scalability in real-world scenarios where decisions must be made without latency. In this work, we propose Contrastive Residual Energy Test-time Adaptation (CreTTA), a practical solution for reliable adaptation. We theoretically reformulate the marginal distribution adaptation as learning a residual energy function. This formulation leads to a contrastive objective where the intractable partition function mathematically cancels out, removing sampling and approximation error.Crucially, our analysis reveals that this design prevents overfitting through an adaptive gradient reweighting mechanism that leverages relative energy differences, avoiding the self-confirming bias of entropy minimization. Extensive experiments demonstrate that CreTTA achieves scalable and well-calibrated adaptation under real-world computational constraints.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes Contrastive Residual Energy Test-time Adaptation (CreTTA) for test-time adaptation. It reformulates marginal distribution adaptation as learning a residual energy function. This leads to a contrastive objective in which the intractable partition function cancels exactly, removing sampling and approximation error. The design incorporates an adaptive gradient reweighting mechanism based on relative energy differences that prevents overfitting and avoids the self-confirming bias of entropy minimization. Experiments are reported to demonstrate scalable, well-calibrated adaptation under real-world constraints.

Significance. If the central derivation is sound, the work could advance practical energy-based TTA by eliminating sampling costs while preserving marginal modeling advantages and improving calibration over conditional approaches. The explicit cancellation and adaptive reweighting, if rigorously established without hidden biases, would be a notable technical contribution.

major comments (2)
  1. [§3.2, Eq. (5)] §3.2, Eq. (5): the claim of exact mathematical cancellation of the partition function assumes the residual energy is the sole input-dependent term with shared base energy and normalization across positives and negatives. When the residual is parameterized by an input-dependent neural network, the effective partition function differs per sample, so cancellation is only approximate and the gradient retains an unaccounted bias term. An explicit error bound or revised derivation addressing input dependence is required to support removal of approximation error.
  2. [§4] §4: the adaptive gradient reweighting is presented as preventing overfitting via relative energy differences and avoiding self-confirming bias. This mechanism is load-bearing for the central claim. However, the analysis lacks a formal stability argument or sufficient ablations demonstrating that the reweighting does not introduce new training instabilities under varying residual parameterizations.
minor comments (1)
  1. [Experiments] The experimental section would benefit from explicit reporting of statistical significance, hyperparameter sensitivity, and full protocol details to strengthen reproducibility claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, providing clarifications and indicating planned revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [§3.2, Eq. (5)] §3.2, Eq. (5): the claim of exact mathematical cancellation of the partition function assumes the residual energy is the sole input-dependent term with shared base energy and normalization across positives and negatives. When the residual is parameterized by an input-dependent neural network, the effective partition function differs per sample, so cancellation is only approximate and the gradient retains an unaccounted bias term. An explicit error bound or revised derivation addressing input dependence is required to support removal of approximation error.

    Authors: We thank the referee for this precise observation on the derivation. The original formulation in Section 3.2 derives the contrastive objective under the assumption that the base energy is fixed from the source model and shared, allowing the partition function to cancel in the ratio of positive to negative residual energies. However, the referee correctly notes that an input-dependent residual network makes the effective normalization sample-specific. To address this, we will revise the derivation to explicitly separate the base and residual contributions, demonstrate that the resulting gradient bias term is bounded by the Lipschitz constant of the residual network, and include a new error bound (Lemma 1 in the revision) quantifying the approximation error under standard assumptions on the energy function. This revision will clarify the exact conditions and support the claim while preserving the core cancellation argument. revision: yes

  2. Referee: [§4] §4: the adaptive gradient reweighting is presented as preventing overfitting via relative energy differences and avoiding self-confirming bias. This mechanism is load-bearing for the central claim. However, the analysis lacks a formal stability argument or sufficient ablations demonstrating that the reweighting does not introduce new training instabilities under varying residual parameterizations.

    Authors: We agree that the adaptive gradient reweighting mechanism in Section 4 requires stronger support to be fully convincing. We will add a formal stability argument showing that the reweighting coefficients, derived from relative energy differences, are strictly bounded in (0,1] and thus cannot amplify gradients beyond a controlled factor. In addition, we will expand the ablation studies to include variations in residual network depth, width, and initialization, reporting training dynamics (gradient norms and loss curves) across these configurations to empirically confirm the absence of new instabilities. These additions will be incorporated into the revised Section 4 and supplementary material. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is a mathematical reformulation

full rationale

The paper presents a theoretical reformulation of marginal distribution adaptation as learning a residual energy function, which leads to a contrastive objective with mathematical cancellation of the partition function. This is framed as an analytical derivation rather than a data-driven fit or self-referential definition. No load-bearing self-citations, uniqueness theorems from prior author work, or ansatz smuggling are evident in the abstract or claims. The adaptive gradient reweighting is described as an emergent property of the contrastive design, not a fitted quantity renamed as a prediction. The derivation chain remains self-contained against external benchmarks and does not reduce to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Review based solely on abstract; full parameterization details, training objectives, and experimental protocols unavailable. The central innovation rests on introducing a residual energy function and assuming standard energy-based modeling properties hold under the contrastive reformulation.

axioms (1)
  • domain assumption Energy functions can represent the marginal distribution of target data
    Foundational assumption for energy-based TTA approaches invoked to motivate the residual reformulation.
invented entities (1)
  • residual energy function no independent evidence
    purpose: To reformulate marginal distribution adaptation so that a contrastive objective can be derived with exact cancellation of the partition function
    New postulated function introduced to enable the claimed mathematical simplification and overfitting prevention; no independent evidence provided in abstract.

pith-pipeline@v0.9.0 · 5721 in / 1626 out tokens · 61537 ms · 2026-05-19T12:40:32.822016+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages · 6 internal anchors

  1. [1]

    Robustbench: a standardized adversarial robustness benchmark.arXiv preprint arXiv:2010.09670,

    Croce, F., Andriushchenko, M., Sehwag, V ., Debenedetti, E., Flammarion, N., Chiang, M., Mittal, P., and Hein, M. Robustbench: a standardized adversarial robustness benchmark.arXiv preprint arXiv:2010.09670,

  2. [2]

    Your classifier is secretly an energy based model and you should treat it like one

    Grathwohl, W., Wang, K.-C., Jacobsen, J.-H., Duvenaud, D., Norouzi, M., and Swersky, K. Your classifier is secretly an energy based model and you should treat it like one. arXiv preprint arXiv:1912.03263,

  3. [3]

    Benchmarking Neural Network Robustness to Common Corruptions and Perturbations

    Hendrycks, D. and Dietterich, T. Benchmarking neural network robustness to common corruptions and perturba- tions.arXiv preprint arXiv:1903.12261,

  4. [4]

    A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks

    Hendrycks, D. and Gimpel, K. A baseline for detecting misclassified and out-of-distribution examples in neural networks.arXiv preprint arXiv:1610.02136,

  5. [5]

    Deep Directed Generative Models with Energy-Based Probability Estimation

    Kim, T. and Bengio, Y . Deep directed generative models with energy-based probability estimation.arXiv preprint arXiv:1606.03439,

  6. [6]

    Kingma, D. P. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980,

  7. [7]

    Towards stable test-time adaptation in dynamic wild world, 2023

    Niu, S., Wu, J., Zhang, Y ., Wen, Z., Chen, Y ., Zhao, P., and Tan, M. Towards stable test-time adaptation in dynamic wild world.arXiv preprint arXiv:2302.12400,

  8. [8]

    How to Train Your Energy - Based Models

    Song, Y . and Kingma, D. P. How to train your energy-based models.arXiv preprint arXiv:2101.03288,

  9. [9]

    Tent: Fully Test-time Adaptation by Entropy Minimization

    Wang, D., Shelhamer, E., Liu, S., Olshausen, B., and Darrell, T. Tent: Fully test-time adaptation by entropy minimiza- tion.arXiv preprint arXiv:2006.10726,

  10. [10]

    org/abs/2312.10165

    URL https://arxiv. org/abs/2312.10165. Yair, O. and Michaeli, T. Contrastive divergence learning is a time reversal adversarial game. In9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7,

  11. [11]

    Wide Residual Networks

    Zagoruyko, S. Wide residual networks.arXiv preprint arXiv:1605.07146,

  12. [12]

    10 Contrastive Residual Energy Test-time Adaptation A

    URL https://arxiv.org/ abs/2306.03536. 10 Contrastive Residual Energy Test-time Adaptation A. Technical Appendices A.1. Derivation and Function of CRETTA A.1.1. DERIVATION OFCRETTA The marginal distribution of target datapθ can be written as the product of the pretrained source modelqϕ and an exponential residual term: pθ(x) = 1 Z qϕ(x) exp(−1 β ˜Eθ(x)), ...

  13. [13]

    Table 16.Target energy and ECE of CRETTA vs

    during adaptation, comparing CRETTA against the variant where the contrastive terms are ablated. Table 16.Target energy and ECE of CRETTA vs. without contrastive terms during adaptation on CIFAR100-C (Severity 5). Target Energy ECE Batch Idx w/ Contrastive w/o Contrastive w/ Contrastive w/0 Contrastive 0 -9.9781 -9.9781 11.11% 11.56% 9 -10.1208 -10.9536 1...

  14. [14]

    To further verify this, we additionally conducted an experiment where the buffer consists of only a single source sample

    Method CIFAR10-C CIFAR100-C TinyImageNet-C Acc(↑) ECE(↓) Acc(↑) ECE(↓) Acc(↑) ECE(↓) CRETTA88.30 4.15 64.52 7.99 40.30 13.52 CRETTA with single source sample 87.62 5.39 62.67 8.9340.3014.13 Extended Buffer Ablation.While the specific content of the buffer has less impact on performance, as shown in Table 6, this does not imply that the source buffer itsel...

  15. [15]

    These settings ensured consistency across experiments while highlighting the robustness and effectiveness of CRETTA

    and reported results over three different random seeds. These settings ensured consistency across experiments while highlighting the robustness and effectiveness of CRETTA. For the PACS domain-generalization task, we used a learning rate of 1e−3, a batch size of 100, applying source-sample augmentation in the same way as for CIFAR100-C. All experiments we...

  16. [16]

    Calibration refers to how closely the predicted probabilities of a model match the actual probabilities

    is a metric used to measure the calibration quality of a probabilistic model. Calibration refers to how closely the predicted probabilities of a model match the actual probabilities. ECE quantifies the discrepancy between predicted confidence and actual accuracy. ECE is calculated as shown in Equation 9: ECE= MX m=1 |binm| N · |confidencem −accuracy m|(9)...