Recognition: unknown
ITS-Mina: A Harris Hawks Optimization-Based All-MLP Framework with Iterative Refinement and External Attention for Multivariate Time Series Forecasting
Pith reviewed 2026-05-07 06:34 UTC · model grok-4.3
The pith
An all-MLP model using repeated shared layers, learnable-memory attention, and automatic dropout tuning matches or exceeds Transformer accuracy on multivariate time series forecasts.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By combining an iterative refinement loop that reuses a shared-parameter residual mixer stack to deepen temporal representations, an external attention block that captures cross-sample dependencies through learnable memory units at linear cost, and Harris Hawks Optimization to set dataset-specific dropout rates, the resulting all-MLP model attains state-of-the-art or highly competitive accuracy on six standard benchmarks against eleven baseline models across multiple forecasting horizons.
What carries the argument
The iterative refinement mechanism that reapplies a shared-parameter residual mixer stack, together with external attention that uses a fixed set of learnable memory units in place of pairwise self-attention.
If this is right
- Effective model depth can increase without a matching rise in parameter count or memory footprint.
- Global dependency modeling becomes possible at linear rather than quadratic complexity in sequence length.
- Regularization strength can be adapted automatically to each forecasting task without manual intervention.
- Computational cost for real-time applications in energy and finance can drop while maintaining accuracy.
- Forecasting pipelines can avoid the training overhead of full self-attention layers.
Where Pith is reading between the lines
- The shared-parameter iteration pattern could be tested on other sequence tasks such as speech or video to reduce model size.
- Learnable memory units might serve as a lightweight substitute for attention in domains where quadratic scaling currently limits sequence length.
- If the gains persist under matched tuning budgets, the result would weaken the case for defaulting to Transformer backbones in multivariate forecasting.
- Evaluating the model on streaming data with concept drift would reveal whether the iterative refinement remains stable in non-stationary settings.
Load-bearing premise
That the reported performance edge comes from the three added components rather than from greater hyperparameter search effort or dataset-specific tuning advantages over the baselines.
What would settle it
An ablation experiment that removes iterative refinement, external attention, or HHO dropout tuning one at a time and measures whether accuracy falls on the same six datasets and horizons, or a re-run of all eleven baselines given identical hyperparameter search budgets.
Figures
read the original abstract
Multivariate time series forecasting plays a pivotal role in numerous real-world applications, including financial analysis, energy management, and traffic planning. While Transformer-based architectures have gained popularity for this task, recent studies reveal that simpler MLP-based models can achieve competitive or superior performance with significantly reduced computational cost. In this paper, we propose ITS-Mina, a novel all-MLP framework for multivariate time series forecasting that integrates three key innovations: (1) an iterative refinement mechanism that progressively enhances temporal representations by repeatedly applying a shared-parameter residual mixer stack, effectively deepening the model's computational capacity without multiplying the number of distinct parameters; (2) an external attention module that replaces traditional self-attention with learnable memory units, capturing cross-sample global dependencies at linear computational complexity; and (3) a Harris Hawks Optimization (HHO) algorithm for automatic dropout rate tuning, enabling adaptive regularization tailored to each dataset. Extensive experiments on six widely-used benchmark datasets demonstrate that ITS-Mina achieves state-of-the-art or highly competitive performance compared to eleven baseline models across multiple forecasting horizons.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes ITS-Mina, an all-MLP framework for multivariate time series forecasting. It integrates three components: (1) iterative refinement via a shared-parameter residual mixer stack that deepens computation without increasing distinct parameters, (2) an external attention module using learnable memory units to capture cross-sample global dependencies at linear complexity (replacing self-attention), and (3) Harris Hawks Optimization (HHO) for automatic per-dataset dropout tuning. The central claim is that extensive experiments on six benchmark datasets show ITS-Mina achieves state-of-the-art or highly competitive performance versus eleven baselines across multiple forecasting horizons.
Significance. If the empirical results hold after proper controls, the work could meaningfully contribute to the MLP-vs-Transformer debate in time series forecasting by demonstrating that targeted architectural simplifications plus adaptive regularization can match or exceed more complex models at lower cost. The shared-parameter iterative refinement and memory-based external attention are conceptually interesting efficiency ideas; if ablations confirm they add value beyond HHO tuning, this would strengthen the case for parameter-efficient all-MLP designs in practical applications such as energy and traffic forecasting.
major comments (3)
- [Abstract and §4] Abstract and §4 (Experiments): The central performance claim asserts SOTA or competitive results on six datasets versus eleven baselines, yet the abstract supplies no numerical metrics (MAE/MSE), ablation tables, error bars, or statistical significance tests. This is load-bearing because the entire contribution rests on empirical superiority; without these data it is impossible to evaluate whether the three proposed components deliver genuine gains.
- [§4 and §3.3] §4 (Experiments) and §3.3 (HHO dropout): No ablation studies isolate the contributions of iterative refinement and external attention from the effects of HHO hyperparameter search on dropout. The skeptic concern is valid here: if the eleven baselines did not receive equivalent automated HPO effort, reported improvements may be artifacts of unequal regularization tuning rather than the all-MLP innovations. A concrete test (e.g., re-tuning baselines with the same HHO budget) is required to support the claim.
- [§3.2] §3.2 (External attention): The description of the learnable memory units for cross-sample dependencies lacks any complexity analysis or direct comparison to prior memory-augmented attention mechanisms (e.g., those in existing linear-attention or memory-network literature). Without this, it is unclear whether the linear-complexity claim is novel or merely reimplements known techniques, which bears on the significance of the architectural contribution.
minor comments (2)
- [§3] Notation for the shared-parameter residual mixer and the external attention memory units should be introduced with explicit equations and dimension annotations to improve reproducibility.
- [§4] The manuscript should include a clear statement of the baseline hyperparameter tuning protocol (grid search, random search, or none) to allow fair comparison with the HHO-tuned dropout.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments on our manuscript. We address each of the major comments below and outline the revisions we plan to incorporate in the updated version to strengthen the presentation of our results and contributions.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4 (Experiments): The central performance claim asserts SOTA or competitive results on six datasets versus eleven baselines, yet the abstract supplies no numerical metrics (MAE/MSE), ablation tables, error bars, or statistical significance tests. This is load-bearing because the entire contribution rests on empirical superiority; without these data it is impossible to evaluate whether the three proposed components deliver genuine gains.
Authors: We concur that the abstract would benefit from including key quantitative results. In the revised manuscript, we will modify the abstract to include representative MAE and MSE values demonstrating the performance of ITS-Mina relative to the baselines. The experiments section (§4) contains the full tables with metrics, but we will augment these with error bars representing standard deviations from multiple random seeds and include statistical significance tests (such as paired t-tests or Wilcoxon tests) to validate the improvements. Ablation tables are already included but will be made more prominent and expanded if necessary. revision: yes
-
Referee: [§4 and §3.3] §4 (Experiments) and §3.3 (HHO dropout): No ablation studies isolate the contributions of iterative refinement and external attention from the effects of HHO hyperparameter search on dropout. The skeptic concern is valid here: if the eleven baselines did not receive equivalent automated HPO effort, reported improvements may be artifacts of unequal regularization tuning rather than the all-MLP innovations. A concrete test (e.g., re-tuning baselines with the same HHO budget) is required to support the claim.
Authors: We recognize the importance of isolating the contributions of our architectural components from the hyperparameter optimization. The baselines were reproduced using the hyperparameters reported in their respective original papers, following common practice for fair comparisons in the time series forecasting literature. To directly address this, we will add ablation studies in the revised §4 where the dropout rate is fixed (without HHO tuning) and show the incremental benefits of the iterative refinement and external attention modules. We will also include a discussion on the computational cost of applying HHO to the baselines and, if feasible within revision time, provide results for re-tuned versions of a few key baselines. This will help demonstrate that the gains stem from the proposed innovations. revision: partial
-
Referee: [§3.2] §3.2 (External attention): The description of the learnable memory units for cross-sample dependencies lacks any complexity analysis or direct comparison to prior memory-augmented attention mechanisms (e.g., those in existing linear-attention or memory-network literature). Without this, it is unclear whether the linear-complexity claim is novel or merely reimplements known techniques, which bears on the significance of the architectural contribution.
Authors: We will revise the description in §3.2 to include a formal complexity analysis. The external attention mechanism using learnable memory units has linear complexity O(N) with respect to the number of samples, as it computes attention between the input and a fixed-size memory bank rather than pairwise among all inputs. We will also provide direct comparisons to related techniques in the memory-augmented networks and linear attention literature (e.g., citing works like Neural Turing Machines, MemNet, and efficient Transformers such as Reformer or Performer), clarifying the distinctions: our memory units are specifically designed for capturing global cross-sample dependencies in a time series context and are integrated within an all-MLP iterative framework. This will better establish the novelty of the approach. revision: yes
Circularity Check
No significant circularity; empirical claims rest on external benchmarks
full rationale
The paper proposes an all-MLP architecture with iterative refinement (shared-parameter residual mixer), external attention (learnable memory units), and HHO-based dropout tuning. Performance claims are supported by experiments on six external benchmark datasets against eleven baselines. No equations, derivations, or self-referential definitions appear in the provided text that reduce a claimed result to its own inputs by construction. HHO is presented as a standard hyperparameter optimizer applied to dropout rates; this does not constitute a fitted-input-called-prediction pattern because the core model components are not defined in terms of the tuned outputs. No self-citations are invoked as load-bearing uniqueness theorems. The paper is self-contained against external benchmarks, so the derivation chain contains no circular steps.
Axiom & Free-Parameter Ledger
free parameters (1)
- dropout rate
axioms (1)
- domain assumption Recent studies show simpler MLP-based models can match or exceed Transformer performance on time series tasks
invented entities (1)
-
external attention module with learnable memory units
no independent evidence
Reference graph
Works this paper leans on
-
[1]
H. Wu, J. Xu, J. Wang, M. Long, Autoformer: Decomposition transformers with auto- correlation for long-term series forecasting, in: Advances in Neural Information Pro- cessing Systems, Vol. 34, 2021, pp. 22419–22430
2021
-
[2]
T. Zhou, Z. Ma, Q. Wen, X. Wang, L. Sun, R. Jin, FEDformer: Frequency enhanced decomposed transformer for long-term series forecasting, in: International Conference on Machine Learning, 2022, pp. 27268–27286
2022
-
[3]
Vaswani, N
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, Attention is all you need, in: Advances in Neural Information Processing Systems, Vol. 30, 2017
2017
-
[4]
H. Zhou, S. Zhang, J. Peng, S. Zhang, J. Li, H. Xiong, W. Zhang, Informer: Beyond efficient transformer for long sequence time-series forecasting, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35, 2021, pp. 11106–11115
2021
-
[5]
A. Zeng, M. Chen, L. Zhang, Q. Xu, Are transformers effective for time series forecast- ing?, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 37, 2023, pp. 11121–11128
2023
-
[6]
I. O. Tolstikhin, N. Houlsby, A. Kolesnikov, L. Beyer, X. Zhai, T. Unterthiner, J. Yung, A. Steiner, D. Keysers, J. Uszkoreit, et al., MLP-Mixer: An all-MLP architecture for vision, in: Advances in Neural Information Processing Systems, Vol. 34, 2021, pp. 24261–24272
2021
-
[7]
Chen, C.-L
S.-A. Chen, C.-L. Li, N. C. Yoder, S. Ö. Arık, T. Pfister, TSMixer: An all-MLP archi- tecture for time series forecasting, Transactions on Machine Learning Research (2023)
2023
-
[8]
Y. Nie, N. H. Nguyen, P. Sinthong, J. Kalagnanam, A time series is worth 64 words: Long-term forecasting with transformers, in: International Conference on Learning Rep- resentations, 2023
2023
-
[9]
S.Liu, H.Yu, C.Liao, J.Li, W.Lin, A.X.Liu, S.Dustdar, Pyraformer: Low-complexity pyramidal attention for long-range time series modeling and forecasting, in: Interna- tional Conference on Learning Representations, 2022. 17
2022
-
[10]
Zhang, J
Y. Zhang, J. Yan, Crossformer: Transformer utilizing cross-dimension dependency for multivariate time series forecasting, in: International Conference on Learning Repre- sentations, 2023
2023
- [11]
- [12]
-
[13]
Dehghani, S
M. Dehghani, S. Gouws, O. Vinyals, J. Uszkoreit, L. Kaiser, Universal transformers, in: International Conference on Learning Representations, 2019
2019
-
[14]
Adaptive Computation Time for Recurrent Neural Networks
A. Graves, Adaptive computation time for recurrent neural networks, arXiv preprint arXiv:1603.08983 (2016)
work page internal anchor Pith review arXiv 2016
-
[15]
Depth-adaptive transformer, 2019
M. Elbayad, J. Gu, E. Grave, M. Auli, Depth-adaptive transformer, arXiv preprint arXiv:1910.10073 (2019)
- [16]
-
[17]
Guo, Z.-N
M.-H. Guo, Z.-N. Liu, T.-J. Mu, S.-M. Hu, Beyond self-attention: External attention using two linear layers for visual analysis, IEEE Transactions on Pattern Analysis and Machine Intelligence 45 (5) (2022) 5436–5447
2022
-
[18]
A. A. Heidari, S. Mirjalili, H. Faris, I. Aljarah, M. Mafarja, H. Chen, Harris hawks optimization: Algorithm and applications, Future Generation Computer Systems 97 (2019) 849–872
2019
-
[19]
K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778
2016
-
[20]
Srivastava, G
N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, Dropout: A simple way to prevent neural networks from overfitting, Journal of Machine Learning Research 15 (1) (2014) 1929–1958
2014
-
[21]
Akiba, S
T. Akiba, S. Sano, T. Yanase, T. Ohta, M. Koyama, Optuna: A next-generation hyper- parameter optimization framework, in: Proceedings of the 25th ACM SIGKDD Inter- national Conference on Knowledge Discovery and Data Mining, 2019, pp. 2623–2631
2019
-
[22]
D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, arXiv preprint arXiv:1412.6980 (2015)
work page internal anchor Pith review arXiv 2015
-
[23]
Y. Liu, T. Hu, H. Zhang, H. Wu, S. Wang, L. Ma, M. Long, iTransformer: Inverted transformers are effective for time series forecasting, in: International Conference on Learning Representations, 2024. 18
2024
- [24]
-
[25]
H. Wu, T. Hu, Y. Liu, H. Zhou, J. Wang, M. Long, TimesNet: Temporal 2D-variation modeling for general time series analysis, in: International Conference on Learning Representations, 2023
2023
-
[26]
H. Wang, J. Peng, F. Huang, J. Wang, J. Chen, Y. Xiao, MICN: Multi-scale local and global context modeling for long-term series forecasting (2023). 19
2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.