NoRIN: Backbone-Adaptive Reversible Normalization for Time-Series Forecasting
Pith reviewed 2026-05-12 05:16 UTC · model grok-4.3
The pith
Different forecasting backbones reach peak performance only with their own non-linear normalization shapes rather than a shared linear map.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that the degeneration of shape parameters to the affine limit is an intrinsic consequence of joint gradient training, and that decoupling shape selection through quantile initialization plus Bayesian validation search recovers backbone-dependent (δ*, ε*) values that improve forecasting accuracy.
What carries the argument
Johnson S_U arcsinh transform with two free shape parameters (δ, ε) controlling tailedness and skewness, whose values are chosen by an outer Bayesian optimization loop on validation performance instead of by gradient descent inside the training loop.
If this is right
- Decoupled optimization consistently finds shape parameters far from the linear affine limit used by RevIN.
- Optimal (δ, ε) pairs vary systematically with the choice of forecasting backbone.
- Joint gradient training of normalization parameters produces the same degeneration for every backbone examined.
- Performance gains come from correcting skewness and tail weight in a manner matched to each backbone's inductive bias.
Where Pith is reading between the lines
- Normalization should be treated as a per-backbone hyperparameter search rather than a universal fixed module.
- The same decoupling strategy could be tested on other preprocessing choices such as outlier clipping or seasonal decomposition.
- If the pattern holds, practitioners would need to re-run the outer optimization whenever they swap forecasting architectures.
Load-bearing premise
High-capacity backbones can fully compensate for any monotone reparameterization of their inputs, rendering the normalization shape parameters locally irrelevant to the forecasting loss during joint training.
What would settle it
If the Bayesian-optimized (δ*, ε*) values recovered across the six backbones and five datasets all cluster near the linear limit (δ approaching infinity), the claim that distinct backbones require distinct non-linear normalizations would be falsified.
Figures
read the original abstract
Reversible instance normalization (RevIN) and its successors (Dish-TS, SAN, FAN) have become the de facto plug-in for time-series forecasting, yet the map they apply to each data point is strictly affine, $x \mapsto ax+b$, so they cannot reshape the underlying distribution -- heavy tails remain heavy and skewness remains uncorrected. We propose NoRIN, a non-linear reversible normalization based on the arcsinh-form Johnson $S_U$ transform with two shape parameters $(\delta,\varepsilon)$ that control tailedness and skewness; the linear $Z$-score used by RevIN is recovered only in the limit $\delta \to \infty$. Training $(\delta,\varepsilon)$ jointly with the backbone via gradient descent reliably pushes them toward this linear limit within a few epochs -- a phenomenon we name the degeneration problem: the forecasting loss is locally indifferent to shape, and the high-capacity backbone compensates for any monotone reparameterization of its input. NoRIN escapes the degeneration by decoupling shape selection from gradient training: $(\delta,\varepsilon)$ are initialized by a closed-form Slifker-Shapiro quantile fit and refined by Bayesian optimization on the validation objective, while the inner training loop is identical to standard RevIN-style training. Across six representative backbones x five real-world datasets x three prediction horizons (90 configurations), decoupled shape optimization recovers $(\delta^\star,\varepsilon^\star)$ that sit systematically far from the linear limit, with values that vary in a backbone-dependent way. This empirically supports the central thesis: different backbones genuinely require different normalization parameters to reach their best performance.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces NoRIN, a non-linear reversible normalization for time-series forecasting based on the arcsinh-form Johnson SU transform controlled by shape parameters (δ, ε). It identifies a degeneration problem in which joint gradient-based training with high-capacity backbones drives these parameters to the linear limit (δ → ∞) recovered by standard RevIN-style Z-score normalization. To avoid this, the method decouples shape selection via a closed-form Slifker-Shapiro quantile initialization followed by Bayesian optimization on the validation objective, while keeping the inner training loop unchanged. Experiments across six backbones, five real-world datasets, and three prediction horizons (90 configurations total) recover (δ⋆, ε⋆) values that lie systematically away from the linear limit and vary in a backbone-dependent manner, supporting the claim that different backbones require distinct normalization parameters.
Significance. If the empirical findings hold, the work demonstrates that affine normalizations are insufficient for modern forecasting backbones and that backbone-specific non-linear reshaping of input distributions can be practically achieved without joint-training collapse. The multi-configuration experimental design (90 settings) and the explicit decoupling protocol constitute reproducible strengths that could inform preprocessing choices in time-series modeling. The result challenges the assumption of a universal linear normalization and offers a concrete alternative when the forecasting loss is locally flat with respect to monotone input transforms.
major comments (3)
- [Abstract / Experiments] Abstract and experimental section: the central claim that recovered (δ⋆, ε⋆) 'sit systematically far from the linear limit' and 'vary in a backbone-dependent way' is presented without reported numerical values, confidence intervals, or statistical tests across the 90 configurations, making it impossible to judge the magnitude or reliability of the observed dependence.
- [Method / Experiments] Decoupled optimization procedure (validation Bayesian optimization): because shape parameters are tuned directly to the validation forecasting objective, the manuscript must show that the resulting (δ⋆, ε⋆) also improve test-set metrics relative to the linear baseline (with error bars and significance tests); otherwise the backbone variation may reflect validation-set overfitting rather than intrinsic architectural requirements.
- [Introduction / Method] Degeneration analysis: the statement that joint training 'reliably pushes' parameters toward the linear limit within a few epochs lacks quantitative detail on the distance metric used, the number of epochs observed, and whether the phenomenon holds uniformly across all six backbones and datasets.
minor comments (2)
- [Method] Clarify the precise functional form of the arcsinh Johnson SU transform and the exact limiting behavior as δ → ∞ (including any scaling of ε).
- [Method] Specify the hyper-parameter ranges and acquisition function used in the Bayesian optimization step, and whether the same validation split is reused across all backbones.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and the recommendation for major revision. The comments highlight opportunities to strengthen the empirical presentation of our results, and we will incorporate the suggested additions to provide greater quantitative rigor and clarity. We respond to each major comment below.
read point-by-point responses
-
Referee: [Abstract / Experiments] Abstract and experimental section: the central claim that recovered (δ⋆, ε⋆) 'sit systematically far from the linear limit' and 'vary in a backbone-dependent way' is presented without reported numerical values, confidence intervals, or statistical tests across the 90 configurations, making it impossible to judge the magnitude or reliability of the observed dependence.
Authors: We agree that explicit numerical values, confidence intervals, and statistical tests are needed to support the central claim. In the revised manuscript we will add a summary table reporting the mean and standard deviation of (δ⋆, ε⋆) across all 90 configurations, grouped by backbone, together with 95% confidence intervals and the results of a Kruskal-Wallis test (followed by post-hoc pairwise comparisons) to quantify both the systematic deviation from the linear limit and the backbone-dependent variation. revision: yes
-
Referee: [Method / Experiments] Decoupled optimization procedure (validation Bayesian optimization): because shape parameters are tuned directly to the validation forecasting objective, the manuscript must show that the resulting (δ⋆, ε⋆) also improve test-set metrics relative to the linear baseline (with error bars and significance tests); otherwise the backbone variation may reflect validation-set overfitting rather than intrinsic architectural requirements.
Authors: We acknowledge the concern about possible validation overfitting. Our current protocol already applies the validation-tuned parameters to the test set, but we will revise the experimental section to include explicit test-set forecasting metrics (MSE and MAE) for NoRIN versus the linear baseline, reported with error bars from multiple random seeds and accompanied by paired statistical significance tests (t-tests with Bonferroni correction). This will confirm that the observed backbone-specific improvements generalize beyond the validation set. revision: yes
-
Referee: [Introduction / Method] Degeneration analysis: the statement that joint training 'reliably pushes' parameters toward the linear limit within a few epochs lacks quantitative detail on the distance metric used, the number of epochs observed, and whether the phenomenon holds uniformly across all six backbones and datasets.
Authors: We agree that the degeneration analysis requires more quantitative detail. In the revised version we will expand this section with a dedicated quantitative analysis: we will define the distance to the linear limit as 1/δ, report its evolution over the first 20 training epochs for representative configurations, and provide a table summarizing the epoch at which δ exceeds 100 (our operational threshold for the linear limit) for all 90 backbone-dataset-horizon combinations, confirming that the phenomenon occurs reliably within 5–10 epochs across the entire experimental grid. revision: yes
Circularity Check
No significant circularity: empirical recovery of backbone-dependent parameters via external validation optimization is independent of model definitions
full rationale
The paper's derivation proceeds by first observing degeneration under joint gradient training (parameters pushed to linear limit), then proposing a decoupled procedure that initializes via closed-form quantile fit and refines via Bayesian optimization on a held-out validation objective. The central empirical result—that recovered (δ*, ε*) lie far from the linear limit and vary systematically across backbones—is obtained by applying this external search to 90 configurations and comparing the optima to the RevIN linear limit. This does not reduce to self-definition, fitted inputs renamed as predictions, or self-citation chains; the parameters are explicitly searched against an independent validation loss rather than being tautological consequences of the backbone equations or training dynamics. The comparison to the linear limit is a direct, falsifiable measurement against an external baseline (standard RevIN), making the finding self-contained against the reported benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- δ and ε =
backbone-dependent optima recovered via Bayesian optimization
axioms (2)
- domain assumption The arcsinh-form Johnson S_U transform is invertible and preserves all information needed for accurate de-normalization after forecasting.
- domain assumption The forecasting loss surface is locally flat with respect to shape parameters when the backbone has high capacity.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel; dAlembert_to_ODE_general echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
the linear Z-score used by RevIN is recovered only in the limit δ→∞... Training (δ,ε) jointly with the backbone via gradient descent reliably pushes them toward this linear limit... the degeneration problem: the forecasting loss is locally indifferent to shape, and the high-capacity backbone compensates for any monotone reparameterization
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leancostAlphaLog_fourth_deriv_at_zero; J_uniquely_calibrated_via_higher_derivative echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
arcsinh-form Johnson SU transform with two shape parameters (δ,ε) that control tailedness and skewness
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
A time series is worth 64 words: Long-term forecasting with transformers,
Y . Nie, N. H. Nguyen, P. Sinthong, and J. Kalagnanam, “A time series is worth 64 words: Long-term forecasting with transformers,” in International Conference on Learning Representations (ICLR), 2023
work page 2023
-
[2]
iTrans- former: Inverted transformers are effective for time series forecasting,
Y . Liu, T. Hu, H. Zhang, H. Wu, S. Wang, L. Ma, and M. Long, “iTrans- former: Inverted transformers are effective for time series forecasting,” inInternational Conference on Learning Representations (ICLR), 2024
work page 2024
-
[3]
Are transformers effective for time series forecasting?
A. Zeng, M. Chen, L. Zhang, and Q. Xu, “Are transformers effective for time series forecasting?” inProceedings of the AAAI Conference on Artificial Intelligence, 2023
work page 2023
-
[4]
Reversible instance normalization for accurate time-series forecasting against distri- bution shift,
T. Kim, J. Kim, Y . Tae, C. Park, J.-H. Choi, and J. Choo, “Reversible instance normalization for accurate time-series forecasting against distri- bution shift,” inInternational Conference on Learning Representations (ICLR), 2022
work page 2022
-
[5]
Dish-TS: A general paradigm for alleviating distribution shift in time series forecast- ing,
W. Fan, P. Wang, D. Wang, D. Wang, Y . Zhou, and Y . Fu, “Dish-TS: A general paradigm for alleviating distribution shift in time series forecast- ing,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 6, 2023, pp. 7522–7529
work page 2023
-
[6]
Boosting urban prediction via addressing spatial-temporal distribution shift,
X. Hu, W. Fan, K. Yi, P. Wang, Y . Xu, Y . Fu, and P. Wang, “Boosting urban prediction via addressing spatial-temporal distribution shift,” in 2023 IEEE International Conference on Data Mining (ICDM). IEEE, 2023
work page 2023
-
[7]
Adaptive normalization for non-stationary time series forecasting: A temporal slice perspective,
Z. Liu, M. Cheng, Z. Li, Z. Huang, Q. Liu, Y . Xie, and E. Chen, “Adaptive normalization for non-stationary time series forecasting: A temporal slice perspective,” inAdvances in Neural Information Pro- cessing Systems (NeurIPS), 2023
work page 2023
-
[8]
Frequency adaptive normalization for non-stationary time series forecasting,
W. Ye, S. Deng, Q. Zou, and N. Gui, “Frequency adaptive normalization for non-stationary time series forecasting,” inAdvances in Neural Information Processing Systems (NeurIPS), 2024
work page 2024
-
[9]
Huanget al., “Noise or signal? deconstructing contradictions and an adaptive remedy for reversible normalization in time series forecasting,” arXiv preprint arXiv:2510.04667, 2025
-
[10]
Systems of frequency curves generated by methods of translation,
N. L. Johnson, “Systems of frequency curves generated by methods of translation,”Biometrika, vol. 36, no. 1/2, pp. 149–176, 1949
work page 1949
-
[11]
The Johnson system: Selection and parameter estimation,
J. F. Slifker and S. S. Shapiro, “The Johnson system: Selection and parameter estimation,”Technometrics, vol. 22, no. 2, pp. 239–246, 1980
work page 1980
-
[12]
Algorithms for hyper- parameter optimization,
J. Bergstra, R. Bardenet, Y . Bengio, and B. K ´egl, “Algorithms for hyper- parameter optimization,” inAdvances in Neural Information Processing Systems (NeurIPS), 2011
work page 2011
-
[13]
Optuna: A next- generation hyperparameter optimization framework,
T. Akiba, S. Sano, T. Yanase, T. Ohta, and M. Koyama, “Optuna: A next- generation hyperparameter optimization framework,” inProceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2019, pp. 2623–2631
work page 2019
-
[14]
Informer: Beyond efficient transformer for long sequence time-series forecasting,
H. Zhou, S. Zhang, J. Peng, S. Zhang, J. Li, H. Xiong, and W. Zhang, “Informer: Beyond efficient transformer for long sequence time-series forecasting,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 12, 2021, pp. 11 106–11 115
work page 2021
-
[15]
Non-stationary transformers: Exploring the stationarity in time series forecasting,
Y . Liu, H. Wu, J. Wang, and M. Long, “Non-stationary transformers: Exploring the stationarity in time series forecasting,” inAdvances in Neural Information Processing Systems (NeurIPS), 2022
work page 2022
-
[16]
An analysis of transformations,
G. E. P. Box and D. R. Cox, “An analysis of transformations,”Journal of the Royal Statistical Society: Series B (Methodological), vol. 26, no. 2, pp. 211–243, 1964
work page 1964
-
[17]
A new family of power transformations to improve normality or symmetry,
I.-K. Yeo and R. A. Johnson, “A new family of power transformations to improve normality or symmetry,”Biometrika, vol. 87, no. 4, pp. 954– 959, 2000
work page 2000
-
[18]
TimesNet: Temporal 2D-variation modeling for general time series analysis,
H. Wu, T. Hu, Y . Liu, H. Zhou, J. Wang, and M. Long, “TimesNet: Temporal 2D-variation modeling for general time series analysis,” in International Conference on Learning Representations, 2023
work page 2023
-
[19]
FEDformer: Frequency enhanced decomposed transformer for long-term series fore- casting,
T. Zhou, Z. Ma, Q. Wen, X. Wang, L. Sun, and R. Jin, “FEDformer: Frequency enhanced decomposed transformer for long-term series fore- casting,” inInternational Conference on Machine Learning. PMLR, 2022, pp. 27 268–27 286
work page 2022
-
[20]
Decoupled weight decay regularization,
I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” inInternational Conference on Learning Representations, 2019
work page 2019
-
[21]
A decoupled formulation of distribution shift in time series forecasting,
D. Qin, Y . Liet al., “A decoupled formulation of distribution shift in time series forecasting,”arXiv preprint, 2024
work page 2024
-
[22]
Individual comparisons by ranking methods,
F. Wilcoxon, “Individual comparisons by ranking methods,”Biometrics Bulletin, vol. 1, no. 6, pp. 80–83, 1945. TABLE V RECOVERED SHAPE PARAMETERS(δ ⋆, ε⋆)OBTAINED BYOPTUNA-GP HPOON EACH(BACKBONE,DATASET,H)CONFIGURATION(90RUNS OVER6 BACKBONES,SEED42,100TRIALS,SEARCH SPACEδ∈[0.8,5.0], ε∈[−1.0,1.0]). BOUNDARY CONTACTS ARE MARKED WITH † (δ=0.8)AND ‡ (ε=±1.0). ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.