SAGA: A Sequence-Adaptive Generative Architecture for Multi-Horizon Probabilistic Forecasting with Adaptive Temporal Conformal Prediction
Pith reviewed 2026-05-20 12:15 UTC · model grok-4.3
The pith
A decoder-only transformer for irregular earnings sequences reduces long-horizon forecast errors by nearly a third compared to canonical parametric processes.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SAGA is a decoder-only transformer for irregular tabular panel sequences paired with a split conformal calibration wrapper that delivers individual-level prediction intervals with finite-sample marginal coverage guarantees. Trained on longitudinal earnings data, it produces annual labor earnings forecasts at one- to thirty-year horizons that reduce continuous ranked probability score by 31.9 percent at the ten-year horizon and mean absolute error by 37.7 percent at the twenty-year horizon relative to canonical parametric processes while achieving nominal coverage to within 0.4 percentage points marginally and within 2.4 percentage points on the worst-case demographic subgroup.
What carries the argument
decoder-only transformer architecture for irregular tabular panel sequences combined with a split conformal calibration wrapper that supplies finite-sample marginal coverage
Load-bearing premise
Earnings trajectories contain long-range nonlinear structure that transformers can learn from panel sequences but that first- and second-moment parametric processes cannot capture.
What would settle it
A replication on held-out years from the same register or on a comparable register from another country that shows no reduction in continuous ranked probability score or mean absolute error, or that shows conformal intervals missing nominal coverage by more than a few percentage points, would falsify the performance claims.
Figures
read the original abstract
Microsimulation models used by ministries of finance and central banks rely on parametric processes for lifetime earnings that capture only first and second moments of the conditional distribution and miss long-range nonlinear structure. We propose SAGA, a decoder-only transformer for irregular tabular panel sequences, paired with a split conformal calibration wrapper that delivers individual-level prediction intervals with finite-sample marginal coverage guarantees. Trained on the longitudinal Swedish LISA register over 1990 to 2022, comprising 2,143,817 individuals and 61,284,903 person-years, the model forecasts annual labor earnings at horizons of one to thirty years and aggregates them by Monte Carlo into present-discounted lifetime earnings distributions. Against the canonical Guvenen, Karahan, Ozkan, and Song parametric process and tabular and recurrent baselines, SAGA reduces continuous ranked probability score by 31.9 percent at the ten-year horizon and mean absolute error by 37.7 percent at the twenty-year horizon. Conformal intervals achieve nominal coverage to within 0.4 percentage points marginally and within 2.4 percentage points on the worst-case demographic subgroup. The reconstructed lifetime earnings Gini coefficient is 0.327 against the partially observed truth of 0.341 and the GKOS estimate of 0.378. Model weights, calibration tables, and a synthetic equivalent dataset are released for replication outside the protected SCB MONA environment.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces SAGA, a decoder-only transformer architecture for irregular tabular panel sequences of earnings trajectories, paired with an adaptive temporal split conformal prediction wrapper. Trained on the Swedish LISA register (2.1M individuals, 61M person-years), it forecasts annual labor earnings at 1-30 year horizons, aggregates to lifetime distributions via Monte Carlo, and claims 31.9% CRPS reduction at the 10-year horizon and 37.7% MAE reduction at the 20-year horizon versus the Guvenen-Karahan-Ozkan-Song parametric process and tabular/recurrent baselines. Conformal intervals are reported to achieve nominal coverage within 0.4 pp marginally and 2.4 pp on the worst-case subgroup, yielding a reconstructed lifetime Gini of 0.327 versus 0.341 observed and 0.378 from the parametric baseline. Model weights, calibration tables, and a synthetic dataset are released.
Significance. If the performance gains and coverage properties hold under scrutiny, the work could meaningfully improve microsimulation models used by finance ministries and central banks by capturing long-range nonlinear structure missed by first- and second-moment parametric processes. The explicit release of model weights, calibration tables, and a synthetic equivalent dataset is a clear strength that supports external replication and verification outside protected environments.
major comments (2)
- [§4] §4 (Conformal Calibration and Coverage Guarantees): The central claim of finite-sample marginal coverage guarantees (to within 0.4 pp marginally) for the adaptive temporal conformal wrapper rests on standard split conformal theory. However, earnings trajectories exhibit serial correlation, cohort and macroeconomic shocks, and irregular person-year observation grids. These features violate the exchangeability assumption between calibration and test points required for the finite-sample guarantee, particularly at long horizons (10-30 y). This directly threatens the reported coverage numbers and the downstream Gini reconstruction that relies on the calibrated intervals; a robustness check or dependence-adjusted conformal method is needed.
- [Results] Results, performance comparison paragraph: The 31.9% CRPS reduction at the 10-year horizon and 37.7% MAE reduction at the 20-year horizon versus the GKOS parametric process are load-bearing for the superiority claim. Without an ablation isolating the decoder-only transformer’s contribution from the conformal wrapper, or explicit confirmation that baselines were re-estimated on the identical irregular panel structure and loss, it remains unclear whether the gains arise specifically from capturing long-range nonlinear dependencies.
minor comments (2)
- [§2.1] §2.1 (Data and Sequence Representation): The description of how irregular person-year grids are tokenized and padded for the decoder-only transformer would benefit from a concrete example sequence for one individual.
- [Figure 4] Figure 4 (Coverage plots): Adding the worst-case demographic subgroup curve would directly illustrate the 2.4 pp deviation cited in the abstract.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed comments, which help clarify key aspects of our work on SAGA. We address each major comment below and indicate planned revisions to the manuscript.
read point-by-point responses
-
Referee: [§4] §4 (Conformal Calibration and Coverage Guarantees): The central claim of finite-sample marginal coverage guarantees (to within 0.4 pp marginally) for the adaptive temporal conformal wrapper rests on standard split conformal theory. However, earnings trajectories exhibit serial correlation, cohort and macroeconomic shocks, and irregular person-year observation grids. These features violate the exchangeability assumption between calibration and test points required for the finite-sample guarantee, particularly at long horizons (10-30 y). This directly threatens the reported coverage numbers and the downstream Gini reconstruction that relies on the calibrated intervals; a robustness check or dependence-adjusted conformal method is needed.
Authors: We acknowledge that the standard split conformal prediction framework relies on exchangeability, which may be only approximately satisfied in our setting due to serial correlation in earnings trajectories, cohort effects, and macroeconomic shocks. Our adaptive temporal conformal method incorporates time-aware calibration to mitigate some temporal dependencies, and the reported coverage is also supported by empirical validation on held-out data. In the revised manuscript, we will expand the discussion in §4 to explicitly address potential violations of exchangeability, include additional robustness checks using time-blocked calibration sets, and report coverage under these conditions. We will also note this as a limitation for long-horizon applications. revision: yes
-
Referee: [Results] Results, performance comparison paragraph: The 31.9% CRPS reduction at the 10-year horizon and 37.7% MAE reduction at the 20-year horizon versus the GKOS parametric process are load-bearing for the superiority claim. Without an ablation isolating the decoder-only transformer’s contribution from the conformal wrapper, or explicit confirmation that baselines were re-estimated on the identical irregular panel structure and loss, it remains unclear whether the gains arise specifically from capturing long-range nonlinear dependencies.
Authors: We clarify that the CRPS and MAE metrics reflect the quality of the probabilistic forecasts generated directly by the decoder-only transformer component of SAGA; the conformal wrapper is applied only post hoc for constructing prediction intervals and does not influence these scoring rules. All baselines, including the Guvenen-Karahan-Ozkan-Song parametric process as well as tabular and recurrent models, were re-estimated on the identical irregular panel structure from the LISA register using the same data splits, preprocessing, and evaluation losses. To further isolate the contribution of the transformer architecture, we will add an ablation study in the revised results section comparing the decoder-only model against LSTM-based recurrent baselines, both with and without the conformal wrapper. revision: yes
Circularity Check
No significant circularity; claims rest on empirical comparisons and standard external theory
full rationale
The paper reports direct empirical reductions in CRPS and MAE versus the GKOS parametric process and recurrent baselines on the LISA register, which are independent measurements rather than quantities defined in terms of the model's own fitted parameters. The conformal coverage guarantees are invoked from split conformal theory, a pre-existing result that does not depend on the transformer architecture or the specific earnings data. No derivation step equates a prediction to its own inputs by construction, renames a fitted quantity, or relies on a load-bearing self-citation whose content is unverified. The central performance and coverage numbers remain falsifiable against external benchmarks and do not reduce to tautology.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Earnings trajectories contain long-range nonlinear structure not captured by first- and second-moment parametric processes.
invented entities (1)
-
SAGA decoder-only transformer architecture
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
SAGA is a decoder-only transformer for irregular tabular panel sequences... paired with a split conformal calibration wrapper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Conformal intervals achieve nominal coverage to within 0.4 percentage points marginally
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
What do data on millions of US workers reveal about lifecycle earnings dynamics?
F. Guvenen, F. Karahan, S. Ozkan, and J. Song, “What do data on millions of US workers reveal about lifecycle earnings dynamics?”Econometrica, vol. 89, no. 5, pp. 2303–2339, Sept. 2021
work page 2021
-
[2]
Modelling income processes with lots of heterogeneity,
M. Browning, M. Ejrnaes, and J. Alvarez, “Modelling income processes with lots of heterogeneity,”Rev. Econ. Stud., vol. 77, no. 4, pp. 1353– 1381, Oct. 2010
work page 2010
-
[3]
On the persistence of income shocks over the life cycle,
F. Karahan and S. Ozkan, “On the persistence of income shocks over the life cycle,”Rev. Econ. Dyn., vol. 16, no. 3, pp. 452–476, July 2013
work page 2013
-
[4]
An empirical investigation of labor income processes,
F. Guvenen, “An empirical investigation of labor income processes,”Rev. Econ. Dyn., vol. 12, no. 1, pp. 58–79, Jan. 2009
work page 2009
-
[5]
Earnings dynamics and its intergenerational transmission: Evidence from Norway,
E. Halvorsen, J. Hubmer, S. Salgado, and S. Solenkova, “Earnings dynamics and its intergenerational transmission: Evidence from Norway,” Discussion Paper, Statistics Norway Research Department, 2024
work page 2024
-
[6]
K. A. McGonagle, R. F. Schoeni, N. Sastry, and V . A. Freedman, “The Panel Study of Income Dynamics: Overview, recent innovations, and potential for life course research,”Longitudinal Life Course Stud., vol. 3, no. 2, pp. 268–284, 2012
work page 2012
-
[7]
Using sequences of life events to predict human lives,
G. Savcisenset al., “Using sequences of life events to predict human lives,”Nature Comput. Sci., vol. 4, no. 1, pp. 43–56, Jan. 2024
work page 2024
-
[8]
Conformalized quantile regression,
Y . Romano, E. Patterson, and E. Candes, “Conformalized quantile regression,” inAdv. Neural Inf. Process. Syst. 32, 2019, pp. 3543–3553
work page 2019
-
[9]
A. Vaswaniet al., “Attention is all you need,” inAdv. Neural Inf. Process. Syst. 30, 2017, pp. 5998–6008
work page 2017
-
[10]
TabTransformer: Tabular Data Modeling Using Contextual Embeddings
X. Huang, A. Khetan, M. Cvitkovic, and Z. Karnin, “TabTransformer: Tabular data modeling using contextual embeddings,”arXiv:2012.06678, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2012
-
[11]
On embeddings for numerical features in tabular deep learning,
Y . Gorishniy, I. Rubachev, and A. Babenko, “On embeddings for numerical features in tabular deep learning,” inAdv. Neural Inf. Process. Syst. 35, 2022, pp. 24991–25004
work page 2022
-
[12]
Accurate predictions on small data with a tabular foundation model,
N. Hollmann, S. Muller, K. Eggensperger, and F. Hutter, “Accurate predictions on small data with a tabular foundation model,”Nature, vol. 637, no. 8045, pp. 319–326, Jan. 2025
work page 2025
-
[13]
Transformers in time series: A survey,
Q. Wenet al., “Transformers in time series: A survey,” inProc. IJCAI, 2023, pp. 6778–6786. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. X, 2026 14
work page 2023
-
[14]
Informer: Beyond efficient transformer for long sequence time-series forecasting,
H. Zhouet al., “Informer: Beyond efficient transformer for long sequence time-series forecasting,” inProc. AAAI Conf. Artif. Intell. (AAAI), 2021, pp. 11106–11115
work page 2021
-
[15]
Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting,
H. Wu, J. Xu, J. Wang, and M. Long, “Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting,” in Adv. Neural Inf. Process. Syst. 34, 2021, pp. 22419–22430
work page 2021
-
[16]
A time series is worth 64 words: Long-term forecasting with transformers,
Y . Nie, N. H. Nguyen, P. Sinthong, and J. Kalagnanam, “A time series is worth 64 words: Long-term forecasting with transformers,” inProc. Int. Conf. Learn. Representations (ICLR), 2023
work page 2023
-
[17]
Dynamic aspects of earning mobility,
L. A. Lillard and R. J. Willis, “Dynamic aspects of earning mobility,” Econometrica, vol. 46, no. 5, pp. 985–1012, Sept. 1978
work page 1978
-
[18]
T. E. MaCurdy, “The use of time series processes to model the error structure of earnings in a longitudinal data analysis,”J. Econometrics, vol. 18, no. 1, pp. 83–114, Jan. 1982
work page 1982
-
[19]
Earnings, consumption and life cycle choices,
C. Meghir and L. Pistaferri, “Earnings, consumption and life cycle choices,” inHandbook of Labor Economics, vol. 4B, O. Ashenfelter and D. Card, Eds. Amsterdam: Elsevier, 2011, pp. 773–854
work page 2011
-
[20]
Conformal time series forecasting,
K. Stankeviciute, A. Alaa, and M. van der Schaar, “Conformal time series forecasting,” inAdv. Neural Inf. Process. Syst. 34, 2021, pp. 6216–6228
work page 2021
-
[21]
Conformal prediction interval for dynamic time-series,
C. Xu and Y . Xie, “Conformal prediction interval for dynamic time-series,” inProc. Int. Conf. Mach. Learn. (ICML), 2021, pp. 11559–11569
work page 2021
-
[22]
Adaptive conformal prediction for autoregressive forecasting,
A. Bhatnagar, J. Schwarting, and A. Brunner, “Adaptive conformal prediction for autoregressive forecasting,”J. Mach. Learn. Res., vol. 25, no. 87, pp. 1–42, 2024
work page 2024
-
[23]
Microsimulation as a tool for evaluating redistribution policies,
F. Bourguignon and A. Spadaro, “Microsimulation as a tool for evaluating redistribution policies,”J. Econ. Inequality, vol. 4, no. 1, pp. 77–106, Apr. 2006
work page 2006
-
[24]
EUROMOD: The European Union tax- benefit microsimulation model,
H. Sutherland and F. Figari, “EUROMOD: The European Union tax- benefit microsimulation model,”Int. J. Microsimul., vol. 6, no. 1, pp. 4–26, 2013
work page 2013
-
[25]
FASIT: The Swedish micro simulation model for the household sector,
L. Flood, “FASIT: The Swedish micro simulation model for the household sector,” Working Paper, Univ. of Gothenburg, 2024
work page 2024
-
[26]
L. Wheaton, “TRIM3 user’s guide,” Working Paper, Urban Institute, Washington, DC, 2008
work page 2008
-
[27]
Gaussian Error Linear Units (GELUs)
D. Hendrycks and K. Gimpel, “Gaussian error linear units (GELUs),” arXiv:1606.08415, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[28]
On layer normalization in the transformer architecture,
R. Xionget al., “On layer normalization in the transformer architecture,” inProc. Int. Conf. Mach. Learn. (ICML), 2020, pp. 10524–10533
work page 2020
-
[29]
Decoupled weight decay regularization,
I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” in Proc. Int. Conf. Learn. Representations (ICLR), 2019
work page 2019
-
[30]
Deep networks with stochastic depth,
G. Huang, Y . Sun, Z. Liu, D. Sedra, and K. Q. Weinberger, “Deep networks with stochastic depth,” inProc. Eur. Conf. Comput. Vis. (ECCV), 2016, pp. 646–661
work page 2016
-
[31]
M. Arellano and S. Bond, “Some tests of specification for panel data: Monte Carlo evidence and an application to employment equations,”Rev. Econ. Stud., vol. 58, no. 2, pp. 277–297, Apr. 1991
work page 1991
-
[32]
LightGBM: A highly efficient gradient boosting decision tree,
G. Keet al., “LightGBM: A highly efficient gradient boosting decision tree,” inAdv. Neural Inf. Process. Syst. 30, 2017, pp. 3146–3154
work page 2017
-
[33]
S. Hochreiter and J. Schmidhuber, “Long short-term memory,”Neural Comput., vol. 9, no. 8, pp. 1735–1780, Nov. 1997
work page 1997
-
[34]
Strictly proper scoring rules, prediction, and estimation,
T. Gneiting and A. E. Raftery, “Strictly proper scoring rules, prediction, and estimation,”J. Amer. Statist. Assoc., vol. 102, no. 477, pp. 359–378, Mar. 2007
work page 2007
-
[35]
W. Newey and K. West, “A simple, positive semi-definite, heteroskedas- ticity and autocorrelation consistent covariance matrix,”Econometrica, vol. 55, no. 3, pp. 703–708, May 1987
work page 1987
-
[36]
Axiomatic attribution for deep networks,
M. Sundararajan, A. Taly, and Q. Yan, “Axiomatic attribution for deep networks,” inProc. Int. Conf. Mach. Learn. (ICML), 2017, pp. 3319– 3328
work page 2017
-
[37]
Membership inference attacks against machine learning models,
R. Shokri, M. Stronati, C. Song, and V . Shmatikov, “Membership inference attacks against machine learning models,” inProc. IEEE Symp. Secur. Privacy (SP), 2017, pp. 3–18
work page 2017
-
[38]
A. Dvoretzky, J. Kiefer, and J. Wolfowitz, “Asymptotic minimax character of the sample distribution function and of the classical multinomial estimator,”Annals of Mathematical Statistics, vol. 27, no. 3, pp. 642-669, 1956
work page 1956
-
[39]
The tight constant in the Dvoretzky-Kiefer-Wolfowitz inequality,
P. Massart, “The tight constant in the Dvoretzky-Kiefer-Wolfowitz inequality,”Annals of Probability, vol. 18, no. 3, pp. 1269-1283, 1990
work page 1990
-
[40]
Revisiting deep learning models for tabular data,
Y . Gorishniy, I. Rubachev, V . Khrulkov, and A. Babenko, “Revisiting deep learning models for tabular data,” inAdv. Neural Inf. Process. Syst. 34, 2021, pp. 18932–18943
work page 2021
-
[41]
G. Somepalli, M. Goldblum, A. Schwarzschild, C. B. Bruss, and T. Goldstein, “SAINT: Improved neural networks for tabular data via row attention and contrastive pre-training,”arXiv:2106.01342, June 2021
- [42]
-
[43]
Distribution-free predictive inference for regression,
J. Lei, M. G’Sell, A. Rinaldo, R. J. Tibshirani, and L. Wasserman, “Distribution-free predictive inference for regression,”J. Amer. Statist. Assoc., vol. 113, no. 523, pp. 1094–1111, July 2018
work page 2018
-
[44]
Conformal prediction: A gentle introduction,
A. N. Angelopoulos and S. Bates, “Conformal prediction: A gentle introduction,”Found. Trends Mach. Learn., vol. 16, no. 4, pp. 494–591, 2023
work page 2023
-
[45]
Comparing predictive accuracy,
F. X. Diebold and R. S. Mariano, “Comparing predictive accuracy,”J. Bus. Econ. Statist., vol. 13, no. 3, pp. 253–263, July 1995. Gustav Olaf Yunus Laitinen-Fredriksson Lund- str¨om-Imanovreceived the M.Sc. degree in statistics and machine learning from Link ¨oping University, Link¨oping, Sweden, in 2026. He is currently pursuing the B.Sc. degree in milita...
work page 1995
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.