pith. sign in

arxiv: 2604.23968 · v1 · submitted 2026-04-27 · 💻 cs.LG · cs.AI· stat.ML

DecompKAN: Decomposed Patch-KAN for Long-Term Time Series Forecasting

Pith reviewed 2026-05-08 04:42 UTC · model grok-4.3

classification 💻 cs.LG cs.AIstat.ML
keywords long-term time series forecastingKolmogorov-Arnold Networksdecompositionpatchingmodel interpretabilityB-splinesattention-free models
0
0 comments X

The pith

DecompKAN delivers best or tied-best MSE on 15 of 32 benchmark cases for long-term time series forecasting while exposing its learned functions for inspection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces DecompKAN as an attention-free architecture that decomposes each series into trend and residual components, applies channel-wise patching and learned instance normalization, then routes the patches through B-spline Kolmogorov-Arnold Network layers. This design targets competitive accuracy on long-horizon forecasts while making the scalar nonlinearities inside the model directly visualizable and inspectable. A sympathetic reader would care because the approach offers an explicit alternative to attention mechanisms, which could matter for domains that value both prediction quality and the ability to see what transformations the model actually learned.

Core claim

DecompKAN combines trend-residual decomposition, channel-wise patching, learned instance normalization, and B-spline KAN edge functions into a lightweight model. Each KAN edge learns an explicit one-dimensional scalar function over the patch embeddings that can be plotted directly. On standard benchmarks it records best or tied-best MSE on 15 of 32 dataset-horizon combinations among selected baselines and on 20 of 36 comparisons under a controlled same-recipe protocol across nine datasets, including physiological PPG-DaLiA data. Ablation results indicate that the decomposition-patching-normalization pipeline contributes more to performance than the choice of nonlinear layer, while the KAN (K

What carries the argument

B-spline Kolmogorov-Arnold Network edge functions that learn explicit, inspectable 1D scalar transformations over learned patch-embedding coordinates.

If this is right

  • The model records particular gains on datasets with smooth temporal dynamics such as Solar, ECL, and Weather.
  • It shows competitive results on physiological time series including the PPG-DaLiA benchmark.
  • Ablations indicate the decomposition, patching, and normalization steps matter more for accuracy than the specific nonlinear layer.
  • Visualization of the edge functions reveals qualitatively different latent nonlinearities across domains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The explicit functions could support debugging and trust in high-stakes forecasting applications such as energy or health monitoring.
  • The same decomposition-plus-patching recipe might be tested with other function approximators to isolate whether KAN adds unique value beyond interpretability.
  • If the pipeline dominates performance, hybrid architectures that keep the front-end decomposition but swap the backend could be explored systematically.
  • The reported domain-specific nonlinearities suggest that pre-training or meta-learning across domains might further improve generalization.

Load-bearing premise

The chosen baselines and the controlled same-recipe evaluation fairly represent current methods and that the reported MSE gains are not artifacts of hyperparameter choices or preprocessing details omitted from the experiments.

What would settle it

A re-run of the 32 and 36 comparisons in which stronger hyperparameter tuning or additional published baselines close or reverse the reported MSE advantages on the majority of the winning cases.

Figures

Figures reproduced from arXiv: 2604.23968 by Naveen Mysore.

Figure 1
Figure 1. Figure 1: Controlled-comparison avg MAE (lower = better). ETT = ETTh1/h2/m1/m2. DECOMPKAN ranks first on 25/36 ( view at source ↗
Figure 2
Figure 2. Figure 2: DECOMPKAN architecture. Input is normalized (RevIN + learned adaptive), decomposed via moving average (KMA=25), and each component is forecast by an independent Patch-KAN branch. Outputs are summed and denormalized. Parameter count varies with L and H (see Table B). particularly on ETT datasets. The present work retains patching and channel independence from PatchTST but replaces attention entirely with KA… view at source ↗
Figure 3
Figure 3. Figure 3: Ablation study at H=96 (seed=42). Bars are sorted by MSE within each dataset. Full DECOMPKAN (red) is best only on Weather; on ETTm2 and ETTh1, replacing KAN with linear or attention layers yields equal or lower MSE, illustrating that the pipeline design matters more than the KAN layer. Bidirectional augmentation view at source ↗
Figure 4
Figure 4. Figure 4: Learned B-spline edge functions ϕi→j (x) from the first KAN layer of DECOMPKAN trained on Weather (H=96). Top row: trend branch (blue); bottom row: residual branch (red). Each subplot shows one edge’s activation function over x ∈ [−3, 3]. Edge labels indicate shape classification. The trend branch learns smoother functions; the residual branch learns sharper, more oscillatory transformations. Cross-dataset… view at source ↗
Figure 5
Figure 5. Figure 5: Learned B-spline edge functions on PPG-DaLiA ( view at source ↗
Figure 6
Figure 6. Figure 6: Mean activation range (max ϕ−min ϕ) per KAN layer on Weather, with error bars showing ±1 standard deviation across all edges. Layer 0 (1312 → 64) exhibits structured sparsity with low mean range, while deeper layers are uniformly more active. 15 view at source ↗
read the original abstract

Accurate time series forecasting in scientific domains such as climate modeling, physiological monitoring, and energy systems benefits from both competitive predictions and model transparency. This work proposes DecompKAN, a lightweight attention-free architecture that combines trend-residual decomposition, channel-wise patching, learned instance normalization, and B-spline Kolmogorov-Arnold Network (KAN) edge functions. Each KAN edge learns an explicit, inspectable 1D scalar function over learned patch-embedding coordinates that can be directly visualized. On standard benchmarks, DecompKAN achieves best or tied-best MSE on 15 of 32 dataset-horizon combinations among selected published baselines, and achieves best or tied-best MSE on 20 of 36 comparisons under a controlled same-recipe evaluation across 9 datasets including the physiological PPG-DaLiA benchmark. The architecture shows particular strength on datasets with smooth temporal dynamics (Solar -17%, ECL -10% vs. iTransformer, Weather) and physiological time series. Visualization of learned edge functions reveals qualitatively different latent nonlinearities across domains. Ablation analysis shows that the architectural pipeline (decomposition, patching, normalization) drives performance more than the choice of nonlinear layer, while the KAN formulation enables inspection of learned latent transformations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The manuscript proposes DecompKAN, a lightweight attention-free architecture for long-term time series forecasting. It integrates trend-residual decomposition, channel-wise patching, learned instance normalization, and B-spline Kolmogorov-Arnold Network (KAN) edge functions. The authors claim that DecompKAN achieves best or tied-best MSE on 15 of 32 dataset-horizon combinations against selected published baselines, and on 20 of 36 comparisons under a controlled same-recipe evaluation across 9 datasets (including PPG-DaLiA). They highlight stronger performance on smooth-dynamics datasets (e.g., Solar, ECL, Weather) and physiological series, provide visualizations of learned KAN edge functions, and report ablation results indicating that the overall pipeline contributes more to performance than the choice of nonlinear layer.

Significance. If the performance claims hold under rigorously controlled conditions, the work offers a useful interpretable alternative to attention-based forecasters, with particular relevance for scientific domains requiring model inspection. The explicit visualization of learned 1D edge functions and the ablation separating pipeline effects from nonlinearity choice are constructive contributions. The approach is lightweight and avoids attention, which could be valuable for resource-constrained or transparency-focused applications.

major comments (2)
  1. [§4.2] §4.2 (Controlled same-recipe evaluation) and Table 3: The claim that DecompKAN outperforms iTransformer by 17% on Solar (and similar deltas elsewhere) under identical conditions is load-bearing for the central performance claim, yet the section does not provide explicit confirmation or a supplementary table listing the precise train/val/test splits, instance normalization procedure, patching parameters, optimizer, epoch count, and early-stopping rule applied uniformly to every baseline (including iTransformer and others). Without this, the reported improvements cannot be unambiguously attributed to the DecompKAN components rather than unequal experimental protocols.
  2. [§5.3] §5.3 (Ablation study): The conclusion that 'the architectural pipeline drives performance more than the nonlinear layer' rests on comparisons that replace KAN with other nonlinearities, but the text does not report the exact MSE deltas or statistical tests when the full pipeline (decomposition + patching + normalization) is held fixed while only swapping the edge functions. This makes it difficult to quantify the incremental contribution of the B-spline KAN formulation versus the preprocessing steps.
minor comments (3)
  1. [Tables 2-3] The reported MSE values in Tables 2 and 3 lack error bars, standard deviations across random seeds, or results from statistical significance tests (e.g., paired t-tests), which is standard for claiming 'best or tied-best' rankings.
  2. [Figure 4] Figure 4 (learned edge function visualizations): The caption and surrounding text could more explicitly state the input coordinate ranges, the number of samples visualized per domain, and whether the functions are shown after training on the full dataset or a subset.
  3. [§1 and §2] The abstract and §1 mention 'learned instance normalization' but the precise formulation (e.g., whether it is affine or includes learnable parameters per channel) is not contrasted with standard RevIN or other normalization baselines in the related-work section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and have revised the manuscript to strengthen the experimental documentation and ablation reporting.

read point-by-point responses
  1. Referee: [§4.2] §4.2 (Controlled same-recipe evaluation) and Table 3: The claim that DecompKAN outperforms iTransformer by 17% on Solar (and similar deltas elsewhere) under identical conditions is load-bearing for the central performance claim, yet the section does not provide explicit confirmation or a supplementary table listing the precise train/val/test splits, instance normalization procedure, patching parameters, optimizer, epoch count, and early-stopping rule applied uniformly to every baseline (including iTransformer and others). Without this, the reported improvements cannot be unambiguously attributed to the DecompKAN components rather than unequal experimental protocols.

    Authors: We agree that explicit protocol documentation is necessary to substantiate the controlled-evaluation claims. In the revised manuscript we have added Supplementary Table S1, which lists the exact train/val/test splits, instance-normalization procedure, patching parameters, optimizer, maximum epoch count, and early-stopping rule applied uniformly to all models (including re-implemented baselines) in the same-recipe experiments of §4.2. These settings were enforced identically across models while following the original baseline papers’ data splits where they exist; the table makes the uniformity verifiable and supports attribution of the reported gains to the DecompKAN components. revision: yes

  2. Referee: [§5.3] §5.3 (Ablation study): The conclusion that 'the architectural pipeline drives performance more than the nonlinear layer' rests on comparisons that replace KAN with other nonlinearities, but the text does not report the exact MSE deltas or statistical tests when the full pipeline (decomposition + patching + normalization) is held fixed while only swapping the edge functions. This makes it difficult to quantify the incremental contribution of the B-spline KAN formulation versus the preprocessing steps.

    Authors: We acknowledge that quantitative deltas and statistical tests would make the ablation more precise. The revised §5.3 now includes an expanded table that reports the exact MSE values for each nonlinearity (KAN, MLP, ReLU, GELU) under the fixed full pipeline, together with the corresponding percentage changes relative to the KAN baseline and results of paired statistical tests (Wilcoxon signed-rank) across the nine datasets. These additions allow readers to directly assess the incremental contribution of the B-spline formulation versus the preprocessing pipeline. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical performance claims rest on external benchmarks

full rationale

The paper defines DecompKAN as an explicit composition of trend-residual decomposition, channel-wise patching, instance normalization, and B-spline KAN layers; none of these components is defined in terms of the final performance metric or any predicted quantity. Reported results consist of direct MSE comparisons against published baselines and a controlled re-evaluation on fixed datasets, with no equations that rename fitted parameters as predictions, no self-citation chains invoked as uniqueness theorems, and no ansatz smuggled through prior work. Ablation statements compare architectural choices without reducing one to the other by construction. The derivation chain is therefore self-contained against external data and does not collapse to its inputs.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim rests on empirical performance rather than derivation; the model introduces several unspecified hyperparameters (patch size, spline order, number of layers) that are fitted during training.

free parameters (2)
  • patch size and stride
    Chosen to balance local context and computational cost; values not stated in abstract.
  • B-spline order and grid size
    Control the flexibility of each KAN edge function; fitted per dataset.
axioms (1)
  • domain assumption Trend-residual decomposition yields additive components that are easier to model separately.
    Invoked to justify the first stage of the pipeline.

pith-pipeline@v0.9.0 · 5517 in / 1316 out tokens · 37872 ms · 2026-05-08T04:42:57.384683+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

2 extracted references

  1. [1]

    The residual branch learns sharper, more oscillatory functions with steeper gradients, appropriate for modeling higher-frequency seasonal and irregular components

    Branch specialization.The trend branch learns predominantly smooth, slowly varying functions (gradual slopes, soft thresholds), consistent with its role in capturing low-frequency dynamics. The residual branch learns sharper, more oscillatory functions with steeper gradients, appropriate for modeling higher-frequency seasonal and irregular components

  2. [2]

    Functional diversity.Even within a single layer, edges learn qualitatively different shapes: smooth monotone mappings, sharp threshold transitions, oscillatory patterns, and near- identity functions. This diversity arises naturally from training without any explicit regular- ization on edge function shape, suggesting that B-spline KAN layers discover a he...