pith. sign in

arxiv: 2605.10823 · v1 · submitted 2026-05-11 · 💻 cs.LG

NoRIN: Backbone-Adaptive Reversible Normalization for Time-Series Forecasting

Pith reviewed 2026-05-12 05:16 UTC · model grok-4.3

classification 💻 cs.LG
keywords reversible normalizationtime-series forecastingJohnson SU transformnormalization parametersbackbone adaptationBayesian optimizationdegeneration problem
0
0 comments X

The pith

Different forecasting backbones reach peak performance only with their own non-linear normalization shapes rather than a shared linear map.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard reversible normalizations apply strictly affine maps to each time-series point and therefore leave heavy tails and skewness untouched. When the two extra shape parameters of a Johnson S_U transform are trained jointly with the backbone via gradient descent, they rapidly collapse back to the linear Z-score limit because the high-capacity model can absorb any monotone re-input. NoRIN breaks this degeneration by fixing the shape parameters outside the gradient loop: an initial closed-form quantile fit is refined by Bayesian optimization on the validation loss while the inner training remains identical to RevIN-style practice. Across ninety backbone-dataset-horizon combinations the recovered parameters lie systematically away from the linear limit and differ according to which backbone is used, showing that each architecture benefits from its own tailored correction of distribution shape.

Core claim

The central claim is that the degeneration of shape parameters to the affine limit is an intrinsic consequence of joint gradient training, and that decoupling shape selection through quantile initialization plus Bayesian validation search recovers backbone-dependent (δ*, ε*) values that improve forecasting accuracy.

What carries the argument

Johnson S_U arcsinh transform with two free shape parameters (δ, ε) controlling tailedness and skewness, whose values are chosen by an outer Bayesian optimization loop on validation performance instead of by gradient descent inside the training loop.

If this is right

  • Decoupled optimization consistently finds shape parameters far from the linear affine limit used by RevIN.
  • Optimal (δ, ε) pairs vary systematically with the choice of forecasting backbone.
  • Joint gradient training of normalization parameters produces the same degeneration for every backbone examined.
  • Performance gains come from correcting skewness and tail weight in a manner matched to each backbone's inductive bias.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Normalization should be treated as a per-backbone hyperparameter search rather than a universal fixed module.
  • The same decoupling strategy could be tested on other preprocessing choices such as outlier clipping or seasonal decomposition.
  • If the pattern holds, practitioners would need to re-run the outer optimization whenever they swap forecasting architectures.

Load-bearing premise

High-capacity backbones can fully compensate for any monotone reparameterization of their inputs, rendering the normalization shape parameters locally irrelevant to the forecasting loss during joint training.

What would settle it

If the Bayesian-optimized (δ*, ε*) values recovered across the six backbones and five datasets all cluster near the linear limit (δ approaching infinity), the claim that distinct backbones require distinct non-linear normalizations would be falsified.

Figures

Figures reproduced from arXiv: 2605.10823 by Shun Zhang, Yuyang Xiao.

Figure 1
Figure 1. Figure 1: NoRIN architecture overview. Decoupled Shape Optimization (top) recovers the JSU shape parameters (δ ⋆, ε⋆) once via Slifker–Shapiro warm-start followed by Bayesian optimization on validation MSE; these parameters are then frozen and injected into both the forward (NoRIN Fwd) and inverse (NoRIN Inv) JSU transforms, which sandwich an arbitrary forecasting backbone fθ. Inputs x are heavy-tailed; the JSU non-… view at source ↗
read the original abstract

Reversible instance normalization (RevIN) and its successors (Dish-TS, SAN, FAN) have become the de facto plug-in for time-series forecasting, yet the map they apply to each data point is strictly affine, $x \mapsto ax+b$, so they cannot reshape the underlying distribution -- heavy tails remain heavy and skewness remains uncorrected. We propose NoRIN, a non-linear reversible normalization based on the arcsinh-form Johnson $S_U$ transform with two shape parameters $(\delta,\varepsilon)$ that control tailedness and skewness; the linear $Z$-score used by RevIN is recovered only in the limit $\delta \to \infty$. Training $(\delta,\varepsilon)$ jointly with the backbone via gradient descent reliably pushes them toward this linear limit within a few epochs -- a phenomenon we name the degeneration problem: the forecasting loss is locally indifferent to shape, and the high-capacity backbone compensates for any monotone reparameterization of its input. NoRIN escapes the degeneration by decoupling shape selection from gradient training: $(\delta,\varepsilon)$ are initialized by a closed-form Slifker-Shapiro quantile fit and refined by Bayesian optimization on the validation objective, while the inner training loop is identical to standard RevIN-style training. Across six representative backbones x five real-world datasets x three prediction horizons (90 configurations), decoupled shape optimization recovers $(\delta^\star,\varepsilon^\star)$ that sit systematically far from the linear limit, with values that vary in a backbone-dependent way. This empirically supports the central thesis: different backbones genuinely require different normalization parameters to reach their best performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces NoRIN, a non-linear reversible normalization for time-series forecasting based on the arcsinh-form Johnson SU transform controlled by shape parameters (δ, ε). It identifies a degeneration problem in which joint gradient-based training with high-capacity backbones drives these parameters to the linear limit (δ → ∞) recovered by standard RevIN-style Z-score normalization. To avoid this, the method decouples shape selection via a closed-form Slifker-Shapiro quantile initialization followed by Bayesian optimization on the validation objective, while keeping the inner training loop unchanged. Experiments across six backbones, five real-world datasets, and three prediction horizons (90 configurations total) recover (δ⋆, ε⋆) values that lie systematically away from the linear limit and vary in a backbone-dependent manner, supporting the claim that different backbones require distinct normalization parameters.

Significance. If the empirical findings hold, the work demonstrates that affine normalizations are insufficient for modern forecasting backbones and that backbone-specific non-linear reshaping of input distributions can be practically achieved without joint-training collapse. The multi-configuration experimental design (90 settings) and the explicit decoupling protocol constitute reproducible strengths that could inform preprocessing choices in time-series modeling. The result challenges the assumption of a universal linear normalization and offers a concrete alternative when the forecasting loss is locally flat with respect to monotone input transforms.

major comments (3)
  1. [Abstract / Experiments] Abstract and experimental section: the central claim that recovered (δ⋆, ε⋆) 'sit systematically far from the linear limit' and 'vary in a backbone-dependent way' is presented without reported numerical values, confidence intervals, or statistical tests across the 90 configurations, making it impossible to judge the magnitude or reliability of the observed dependence.
  2. [Method / Experiments] Decoupled optimization procedure (validation Bayesian optimization): because shape parameters are tuned directly to the validation forecasting objective, the manuscript must show that the resulting (δ⋆, ε⋆) also improve test-set metrics relative to the linear baseline (with error bars and significance tests); otherwise the backbone variation may reflect validation-set overfitting rather than intrinsic architectural requirements.
  3. [Introduction / Method] Degeneration analysis: the statement that joint training 'reliably pushes' parameters toward the linear limit within a few epochs lacks quantitative detail on the distance metric used, the number of epochs observed, and whether the phenomenon holds uniformly across all six backbones and datasets.
minor comments (2)
  1. [Method] Clarify the precise functional form of the arcsinh Johnson SU transform and the exact limiting behavior as δ → ∞ (including any scaling of ε).
  2. [Method] Specify the hyper-parameter ranges and acquisition function used in the Bayesian optimization step, and whether the same validation split is reused across all backbones.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for major revision. The comments highlight opportunities to strengthen the empirical presentation of our results, and we will incorporate the suggested additions to provide greater quantitative rigor and clarity. We respond to each major comment below.

read point-by-point responses
  1. Referee: [Abstract / Experiments] Abstract and experimental section: the central claim that recovered (δ⋆, ε⋆) 'sit systematically far from the linear limit' and 'vary in a backbone-dependent way' is presented without reported numerical values, confidence intervals, or statistical tests across the 90 configurations, making it impossible to judge the magnitude or reliability of the observed dependence.

    Authors: We agree that explicit numerical values, confidence intervals, and statistical tests are needed to support the central claim. In the revised manuscript we will add a summary table reporting the mean and standard deviation of (δ⋆, ε⋆) across all 90 configurations, grouped by backbone, together with 95% confidence intervals and the results of a Kruskal-Wallis test (followed by post-hoc pairwise comparisons) to quantify both the systematic deviation from the linear limit and the backbone-dependent variation. revision: yes

  2. Referee: [Method / Experiments] Decoupled optimization procedure (validation Bayesian optimization): because shape parameters are tuned directly to the validation forecasting objective, the manuscript must show that the resulting (δ⋆, ε⋆) also improve test-set metrics relative to the linear baseline (with error bars and significance tests); otherwise the backbone variation may reflect validation-set overfitting rather than intrinsic architectural requirements.

    Authors: We acknowledge the concern about possible validation overfitting. Our current protocol already applies the validation-tuned parameters to the test set, but we will revise the experimental section to include explicit test-set forecasting metrics (MSE and MAE) for NoRIN versus the linear baseline, reported with error bars from multiple random seeds and accompanied by paired statistical significance tests (t-tests with Bonferroni correction). This will confirm that the observed backbone-specific improvements generalize beyond the validation set. revision: yes

  3. Referee: [Introduction / Method] Degeneration analysis: the statement that joint training 'reliably pushes' parameters toward the linear limit within a few epochs lacks quantitative detail on the distance metric used, the number of epochs observed, and whether the phenomenon holds uniformly across all six backbones and datasets.

    Authors: We agree that the degeneration analysis requires more quantitative detail. In the revised version we will expand this section with a dedicated quantitative analysis: we will define the distance to the linear limit as 1/δ, report its evolution over the first 20 training epochs for representative configurations, and provide a table summarizing the epoch at which δ exceeds 100 (our operational threshold for the linear limit) for all 90 backbone-dataset-horizon combinations, confirming that the phenomenon occurs reliably within 5–10 epochs across the entire experimental grid. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical recovery of backbone-dependent parameters via external validation optimization is independent of model definitions

full rationale

The paper's derivation proceeds by first observing degeneration under joint gradient training (parameters pushed to linear limit), then proposing a decoupled procedure that initializes via closed-form quantile fit and refines via Bayesian optimization on a held-out validation objective. The central empirical result—that recovered (δ*, ε*) lie far from the linear limit and vary systematically across backbones—is obtained by applying this external search to 90 configurations and comparing the optima to the RevIN linear limit. This does not reduce to self-definition, fitted inputs renamed as predictions, or self-citation chains; the parameters are explicitly searched against an independent validation loss rather than being tautological consequences of the backbone equations or training dynamics. The comparison to the linear limit is a direct, falsifiable measurement against an external baseline (standard RevIN), making the finding self-contained against the reported benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claim rests on the Johnson S_U transform being a suitable invertible map for time-series distributions and on Bayesian optimization on validation data producing generalizable shape parameters; no new entities are postulated.

free parameters (1)
  • δ and ε = backbone-dependent optima recovered via Bayesian optimization
    Shape parameters controlling tailedness and skewness; initialized by closed-form quantile fit then refined by Bayesian optimization on validation loss.
axioms (2)
  • domain assumption The arcsinh-form Johnson S_U transform is invertible and preserves all information needed for accurate de-normalization after forecasting.
    Invoked to ensure the normalization remains reversible like RevIN while allowing non-linear reshaping.
  • domain assumption The forecasting loss surface is locally flat with respect to shape parameters when the backbone has high capacity.
    Used to explain the observed degeneration to the linear limit under joint gradient training.

pith-pipeline@v0.9.0 · 5588 in / 1665 out tokens · 68820 ms · 2026-05-12T05:16:13.552346+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages

  1. [1]

    A time series is worth 64 words: Long-term forecasting with transformers,

    Y . Nie, N. H. Nguyen, P. Sinthong, and J. Kalagnanam, “A time series is worth 64 words: Long-term forecasting with transformers,” in International Conference on Learning Representations (ICLR), 2023

  2. [2]

    iTrans- former: Inverted transformers are effective for time series forecasting,

    Y . Liu, T. Hu, H. Zhang, H. Wu, S. Wang, L. Ma, and M. Long, “iTrans- former: Inverted transformers are effective for time series forecasting,” inInternational Conference on Learning Representations (ICLR), 2024

  3. [3]

    Are transformers effective for time series forecasting?

    A. Zeng, M. Chen, L. Zhang, and Q. Xu, “Are transformers effective for time series forecasting?” inProceedings of the AAAI Conference on Artificial Intelligence, 2023

  4. [4]

    Reversible instance normalization for accurate time-series forecasting against distri- bution shift,

    T. Kim, J. Kim, Y . Tae, C. Park, J.-H. Choi, and J. Choo, “Reversible instance normalization for accurate time-series forecasting against distri- bution shift,” inInternational Conference on Learning Representations (ICLR), 2022

  5. [5]

    Dish-TS: A general paradigm for alleviating distribution shift in time series forecast- ing,

    W. Fan, P. Wang, D. Wang, D. Wang, Y . Zhou, and Y . Fu, “Dish-TS: A general paradigm for alleviating distribution shift in time series forecast- ing,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 6, 2023, pp. 7522–7529

  6. [6]

    Boosting urban prediction via addressing spatial-temporal distribution shift,

    X. Hu, W. Fan, K. Yi, P. Wang, Y . Xu, Y . Fu, and P. Wang, “Boosting urban prediction via addressing spatial-temporal distribution shift,” in 2023 IEEE International Conference on Data Mining (ICDM). IEEE, 2023

  7. [7]

    Adaptive normalization for non-stationary time series forecasting: A temporal slice perspective,

    Z. Liu, M. Cheng, Z. Li, Z. Huang, Q. Liu, Y . Xie, and E. Chen, “Adaptive normalization for non-stationary time series forecasting: A temporal slice perspective,” inAdvances in Neural Information Pro- cessing Systems (NeurIPS), 2023

  8. [8]

    Frequency adaptive normalization for non-stationary time series forecasting,

    W. Ye, S. Deng, Q. Zou, and N. Gui, “Frequency adaptive normalization for non-stationary time series forecasting,” inAdvances in Neural Information Processing Systems (NeurIPS), 2024

  9. [9]

    Noise or signal? deconstructing contradictions and an adaptive remedy for reversible normalization in time series forecasting,

    Huanget al., “Noise or signal? deconstructing contradictions and an adaptive remedy for reversible normalization in time series forecasting,” arXiv preprint arXiv:2510.04667, 2025

  10. [10]

    Systems of frequency curves generated by methods of translation,

    N. L. Johnson, “Systems of frequency curves generated by methods of translation,”Biometrika, vol. 36, no. 1/2, pp. 149–176, 1949

  11. [11]

    The Johnson system: Selection and parameter estimation,

    J. F. Slifker and S. S. Shapiro, “The Johnson system: Selection and parameter estimation,”Technometrics, vol. 22, no. 2, pp. 239–246, 1980

  12. [12]

    Algorithms for hyper- parameter optimization,

    J. Bergstra, R. Bardenet, Y . Bengio, and B. K ´egl, “Algorithms for hyper- parameter optimization,” inAdvances in Neural Information Processing Systems (NeurIPS), 2011

  13. [13]

    Optuna: A next- generation hyperparameter optimization framework,

    T. Akiba, S. Sano, T. Yanase, T. Ohta, and M. Koyama, “Optuna: A next- generation hyperparameter optimization framework,” inProceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2019, pp. 2623–2631

  14. [14]

    Informer: Beyond efficient transformer for long sequence time-series forecasting,

    H. Zhou, S. Zhang, J. Peng, S. Zhang, J. Li, H. Xiong, and W. Zhang, “Informer: Beyond efficient transformer for long sequence time-series forecasting,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 12, 2021, pp. 11 106–11 115

  15. [15]

    Non-stationary transformers: Exploring the stationarity in time series forecasting,

    Y . Liu, H. Wu, J. Wang, and M. Long, “Non-stationary transformers: Exploring the stationarity in time series forecasting,” inAdvances in Neural Information Processing Systems (NeurIPS), 2022

  16. [16]

    An analysis of transformations,

    G. E. P. Box and D. R. Cox, “An analysis of transformations,”Journal of the Royal Statistical Society: Series B (Methodological), vol. 26, no. 2, pp. 211–243, 1964

  17. [17]

    A new family of power transformations to improve normality or symmetry,

    I.-K. Yeo and R. A. Johnson, “A new family of power transformations to improve normality or symmetry,”Biometrika, vol. 87, no. 4, pp. 954– 959, 2000

  18. [18]

    TimesNet: Temporal 2D-variation modeling for general time series analysis,

    H. Wu, T. Hu, Y . Liu, H. Zhou, J. Wang, and M. Long, “TimesNet: Temporal 2D-variation modeling for general time series analysis,” in International Conference on Learning Representations, 2023

  19. [19]

    FEDformer: Frequency enhanced decomposed transformer for long-term series fore- casting,

    T. Zhou, Z. Ma, Q. Wen, X. Wang, L. Sun, and R. Jin, “FEDformer: Frequency enhanced decomposed transformer for long-term series fore- casting,” inInternational Conference on Machine Learning. PMLR, 2022, pp. 27 268–27 286

  20. [20]

    Decoupled weight decay regularization,

    I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” inInternational Conference on Learning Representations, 2019

  21. [21]

    A decoupled formulation of distribution shift in time series forecasting,

    D. Qin, Y . Liet al., “A decoupled formulation of distribution shift in time series forecasting,”arXiv preprint, 2024

  22. [22]

    Individual comparisons by ranking methods,

    F. Wilcoxon, “Individual comparisons by ranking methods,”Biometrics Bulletin, vol. 1, no. 6, pp. 80–83, 1945. TABLE V RECOVERED SHAPE PARAMETERS(δ ⋆, ε⋆)OBTAINED BYOPTUNA-GP HPOON EACH(BACKBONE,DATASET,H)CONFIGURATION(90RUNS OVER6 BACKBONES,SEED42,100TRIALS,SEARCH SPACEδ∈[0.8,5.0], ε∈[−1.0,1.0]). BOUNDARY CONTACTS ARE MARKED WITH † (δ=0.8)AND ‡ (ε=±1.0). ...