Logo-LLM: Local and Global Modeling with Large Language Models for Time Series Forecasting

Cheng Chen; Dongyue Guo; Wenjie Ou; Yi Lin; Zhishuo Zhao

arxiv: 2505.11017 · v2 · submitted 2025-05-16 · 💻 cs.LG

Logo-LLM: Local and Global Modeling with Large Language Models for Time Series Forecasting

Wenjie Ou , Zhishuo Zhao , Cheng Chen , Dongyue Guo , Yi Lin This is my paper

Pith reviewed 2026-05-22 15:12 UTC · model grok-4.3

classification 💻 cs.LG

keywords time series forecastinglarge language modelslocal global modelingmulti-scale featuresfew-shot learningzero-shot forecastingmixer modules

0 comments

The pith

Extracting local dynamics from shallow LLM layers and global trends from deeper layers improves time series forecasting accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses a limitation in current LLM-based time series forecasting methods that only use the final layer output. It finds that shallow layers focus on local short-term variations while deeper layers handle global long-term dependencies. To make use of this, the authors add lightweight modules called Local-Mixer and Global-Mixer to combine these features from multiple layers with the time series input. Experiments on various benchmarks show this leads to better predictions, particularly in cases with limited training data, and does so efficiently.

Core claim

Through empirical analysis the paper establishes that shallow layers of LLMs capture local dynamics in time series while deeper layers encode global trends. Logo-LLM uses this by extracting multi-scale features and integrating them with Local-Mixer and Global-Mixer modules, resulting in superior performance across benchmarks and strong generalization in few-shot and zero-shot settings at low computational cost.

What carries the argument

The layer-specific feature extraction from pre-trained LLMs paired with Local-Mixer and Global-Mixer modules for aligning and integrating local and global temporal features.

Load-bearing premise

That the local-global separation observed in LLM layers for time series is a general property that can be reliably exploited.

What would settle it

If future tests on diverse time series data show that using only the final LLM layer performs as well or better than the multi-layer approach with mixers, the advantage would be called into question.

Figures

Figures reproduced from arXiv: 2505.11017 by Cheng Chen, Dongyue Guo, Wenjie Ou, Yi Lin, Zhishuo Zhao.

**Figure 1.** Figure 1: Comparison of LLM usage paradigms. Prior works treat LLMs as black-box encoders and use only the last-layer feature. Our method explicitly extracts features from multiple layers, leveraging shallow-layer features for local modeling and deep-layer features for global modeling, enabling a more fine-grained understanding of temporal dynamics. Inspired by the above insights, we propose a Local and global model… view at source ↗

**Figure 2.** Figure 2: Overview of the proposed Logo-LLM framework. Logo-LLM extracts intermediate representations from multiple layers of a pre-trained LLM to explicitly model local and global temporal patterns. Two specialized Mixer modules are introduced to align these hierarchical features with the temporal input, enabling fine-grained modeling of local and global variations. Most LLM parameters are kept frozen, enabling eff… view at source ↗

**Figure 3.** Figure 3: Comparison of Logo-LLM and CALF with various layers on ETTh1 and ETTh2 datasets. The prediction length is set as {96, 192}. fine-grained variations is balanced. This observation validates our design of selectively leveraging shallow and deep layer representations, rather than relying on the last layer. Impact of Local Feature Layer Selection. To investigate the optimal layer for extracting local representa… view at source ↗

**Figure 4.** Figure 4: Visualization of different selections {1, 2, 3, 4, 5, 6} about local feature layer on ETTh1, ETTm2, and ETTh2. The prediction length is set as 96 with input length 𝐿 = 96. We observe that using the first-layer output as a local feature yields the best performance and performance gradually deteriorates or plateaus when deeper layers are used. This finding supports our design choice and aligns with the repre… view at source ↗

**Figure 5.** Figure 5: Similarity matrices of each patch across Transformer layers in (a) Logo-LLM (b) CALF and (c) Time-LLM, illustrating that shallow layers exhibit pronounced local patterns while deeper layers capture broader global dependencies. dependencies, are not unique to GPT-2 (Radford et al. (2019)). Instead, this capability exists as a universal intrinsic property of LLMs, independent of specific architectural design… view at source ↗

read the original abstract

Time series forecasting is critical across multiple domains, where time series data exhibit both local patterns and global dependencies. While Transformer-based methods effectively capture global dependencies, they often overlook short-term local variations in time series. Recent methods that adapt large language models (LLMs) into time series forecasting inherit this limitation by treating LLMs as black-box encoders, relying solely on the final-layer output and underutilizing hierarchical representations. To address this limitation, we propose Logo-LLM, a novel LLM-based framework that explicitly extracts and models multi-scale temporal features from different layers of a pre-trained LLM. Through empirical analysis, we show that shallow layers of LLMs capture local dynamics in time series, while deeper layers encode global trends. Moreover, Logo-LLM introduces lightweight Local-Mixer and Global-Mixer modules to align and integrate features with the temporal input across layers. Extensive experiments demonstrate that Logo-LLM achieves superior performance across diverse benchmarks, with strong generalization in few-shot and zero-shot settings while maintaining low computational overhead.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Logo-LLM pulls multi-layer features from LLMs with new mixer modules for time series, but the shallow-local versus deep-global claim rests on thin controls.

read the letter

The main thing to know is that Logo-LLM takes outputs from several layers of a pre-trained LLM, then feeds them through dedicated Local-Mixer and Global-Mixer modules to combine short-term and long-range signals in time series forecasting. This is a step past treating the LLM as a black-box final-layer encoder. The authors add lightweight alignment modules that map the layer features back to the original temporal input, and they report gains on standard benchmarks plus decent few-shot and zero-shot results with low added cost. That combination of multi-scale extraction and cheap integration is the concrete technical move. If the numbers hold with proper ablations, it gives practitioners a straightforward way to squeeze more out of existing LLMs without retraining from scratch. The paper does a reasonable job showing the practical upside on diverse datasets while keeping overhead modest. The softer part is the load-bearing premise that shallow layers reliably capture local dynamics and deeper layers capture global trends. The stress-test note is fair here: this split could be an artifact of the patching scheme, the specific LLM chosen, or the datasets used for the probing analysis. Without clear controls that vary the input representation or test other model families, it is hard to know whether the performance edge truly comes from the intended layer specialization or from something else in the setup. The abstract states empirical support, but the strength of that support depends on how thoroughly the full paper checks for those confounds. This work is aimed at researchers adapting LLMs to sequential prediction tasks. Anyone already experimenting with hierarchical features in transformers for forecasting will find the mixer design and the reported few-shot behavior useful to examine. The idea is distinct enough and the empirical claims are presented with enough detail that the paper deserves a serious referee rather than a desk reject.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces Logo-LLM, a framework that extracts multi-scale temporal features from different layers of a pre-trained LLM for time series forecasting instead of treating the LLM as a black-box final-layer encoder. It claims through empirical analysis that shallow layers capture local dynamics while deeper layers encode global trends, introduces lightweight Local-Mixer and Global-Mixer modules to align and integrate these features with the temporal input, and reports superior performance across diverse benchmarks with strong few-shot and zero-shot generalization at low computational cost.

Significance. If the layer-wise specialization finding proves robust to changes in tokenization, patching, and model family, and the mixer modules demonstrably leverage it, the approach could meaningfully improve utilization of hierarchical LLM representations for multi-scale time series tasks, offering a lightweight alternative to both pure Transformer and black-box LLM baselines.

major comments (2)

[Abstract] Abstract: the load-bearing empirical claim that shallow LLM layers capture local dynamics while deeper layers encode global trends is stated without reference to controls for input representation, patching scheme, or LLM choice; if the observed specialization is an artifact of the specific tokenization or datasets used for probing, the performance advantage over prior final-layer baselines is not explained by the stated mechanism.
[Method] Method section (Local-Mixer and Global-Mixer description): the modules are introduced to align features across layers, yet no ablation isolates their contribution versus simpler concatenation or attention-based fusion; without such controls, it remains unclear whether the reported gains require the full proposed architecture or could be achieved by routing any multi-layer features through a single mixer.

minor comments (1)

[Abstract] Abstract: the phrase 'extensive experiments demonstrate' would benefit from a one-sentence summary of the benchmark datasets, number of baselines, and primary metrics to allow readers to gauge scope immediately.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for the constructive feedback. We address each major comment below and have incorporated revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the load-bearing empirical claim that shallow LLM layers capture local dynamics while deeper layers encode global trends is stated without reference to controls for input representation, patching scheme, or LLM choice; if the observed specialization is an artifact of the specific tokenization or datasets used for probing, the performance advantage over prior final-layer baselines is not explained by the stated mechanism.

Authors: We thank the referee for this observation. The original empirical analysis was conducted under the standard patching and tokenization of the benchmarks using Llama-2. To address robustness concerns, the revised manuscript now includes additional experiments (new Section 4.3 and Appendix C) that vary patching schemes, input representations, and LLM families (including Llama-3 and Mistral). These controls confirm the shallow-local and deep-global specialization persists, supporting that the gains over final-layer baselines arise from the proposed mechanism rather than setup-specific artifacts. The abstract has been updated to reference these controls. revision: yes
Referee: [Method] Method section (Local-Mixer and Global-Mixer description): the modules are introduced to align features across layers, yet no ablation isolates their contribution versus simpler concatenation or attention-based fusion; without such controls, it remains unclear whether the reported gains require the full proposed architecture or could be achieved by routing any multi-layer features through a single mixer.

Authors: We agree that isolating the mixers' contribution is important. The revised manuscript adds an ablation study (Section 5.2) comparing Logo-LLM against variants using direct multi-layer concatenation and a single attention-based fusion module in place of the separate Local-Mixer and Global-Mixer. Results show the specialized mixers yield further accuracy gains, especially in few-shot settings, indicating the design is not interchangeable with simpler fusion. The method section has been clarified to explain the rationale for separate local and global alignment. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation or claims

full rationale

The paper's central elements consist of an empirical observation on LLM layer representations for time series (shallow layers for local dynamics, deeper for global trends), followed by introduction of Local-Mixer and Global-Mixer modules to exploit this, and validation via benchmark experiments including few-shot and zero-shot settings. No equations are presented that reduce a claimed prediction or result to fitted inputs or self-definitions by construction. The architecture is motivated by the stated empirical analysis rather than redefining quantities circularly, and no load-bearing self-citations or uniqueness theorems from prior author work are invoked in the provided text. Performance superiority is asserted based on experimental outcomes, which remain externally falsifiable and independent of the design rationale itself.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central design rests on an empirical observation about LLM layer semantics and introduces two new mixer modules whose value is justified only by the paper's own experiments.

axioms (1)

domain assumption Shallow layers of LLMs capture local dynamics in time series, while deeper layers encode global trends.
This premise is invoked to justify the multi-layer extraction strategy and is presented as shown through empirical analysis.

invented entities (2)

Local-Mixer module no independent evidence
purpose: Align and integrate local features from shallow LLM layers with the temporal input
New lightweight component introduced to handle short-term patterns.
Global-Mixer module no independent evidence
purpose: Align and integrate global features from deeper LLM layers with the temporal input
New lightweight component introduced to handle long-term trends.

pith-pipeline@v0.9.0 · 5713 in / 1336 out tokens · 83214 ms · 2026-05-22T15:12:23.176458+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Through empirical analysis, we show that shallow layers of LLMs capture local dynamics in time series, while deeper layers encode global trends.
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_strictMono_of_one_lt unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Logo-LLM introduces lightweight Local-Mixer and Global-Mixer modules to align and integrate features with the temporal input across layers.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · 4 internal anchors

[1]

Scientific reports 12, 16327

Deep language algorithms predict semantic comprehension from brain activity. Scientific reports 12, 16327. Chen,P.,Zhang,Y.,Cheng,Y.,Shu,Y.,Wang,Y.,Wen,Q.,Yang,B.,Guo,C.,2024. Pathformer:Multi-scaletransformerswithadaptivepathways for time series forecasting. arXiv preprint arXiv:2402.05956 . Chen, Y., Liu, H., Yin, H., Fan, B.,

work page arXiv 2024
[2]

Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling

Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv:1412.3555v1 . Dai,T.,Wu,B.,Liu,P.,Li,N.,Bao,J.,Jiang,Y.,Xia,S.T.,2024. Periodicitydecouplingframeworkforlong-termseriesforecasting,in:TheTwelfth International Conference on Learning Representations. Das, A., Kong, W., Leach, A., Mathur, S., Sen, R., Yu, R.,

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

Long- term forecasting with tide: Time-series dense encoder

Long-term forecasting with tide: Time-series dense encoder. arXiv preprint arXiv:2304.08424 . Devlin, J., Chang, M.W., Lee, K., Toutanova, K.,

work page arXiv
[4]

1997, Neural computation, 9, 1735, doi: 10.1162/neco.1997.9.8.1735

Long short-term memory. Neural Computation 9, 1735–1780. doi:10.1162/neco.1997.9.8.1735. Hu,E.J.,Shen,Y.,Wallis,P.,Allen-Zhu,Z.,Li,Y.,Wang,S.,Wang,L.,Chen,W.,etal.,2022. Lora:Low-rankadaptationoflargelanguagemodels. ICLR 1,

work page doi:10.1162/neco.1997.9.8.1735 1997
[5]

Time-LLM:Timeseriesforecasting by reprogramming large language models, in: International Conference on Learning Representations (ICLR)

Jin,M.,Wang,S.,Ma,L.,Chu,Z.,Zhang,J.Y.,Shi,X.,Chen,P.Y.,Liang,Y.,Li,Y.F.,Pan,S.,Wen,Q.,2024. Time-LLM:Timeseriesforecasting by reprogramming large language models, in: International Conference on Learning Representations (ICLR). Kim, T., Kim, J., Tae, Y., Park, C., Choi, J.H., Choo, J.,

work page 2024
[6]

arXiv preprint arXiv:2410.21353

Causal interventions on causal paths: Mapping gpt-2’s reasoning from syntax to semantics. arXiv preprint arXiv:2410.21353 . Li, K., Yu, R., Wang, Z., Yuan, L., Song, G., Chen, J.,

work page arXiv
[7]

Locality guidance for improving vision transformers on tiny datasets, in: European Conference on Computer Vision, Springer. pp. 110–127. Li, Z., Qi, S., Li, Y., Xu, Z., 2023a. Revisiting long-term time series forecasting: An investigation on linear mapping. ArXiv abs/2305.10721. Li, Z., Rao, Z., Pan, L., Xu, Z., 2023b. Mts-mixers: Multivariate time series...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[8]

iTransformer: Inverted Transformers Are Effective for Time Series Forecasting

Pyraformer: Low-complexity pyramidal attention for long-range time series modeling and forecasting, in: International Conference on Learning Representations. Liu,Y.,Hu,T.,Zhang,H.,Wu,H.,Wang,S.,Ma,L.,Long,M.,2023a. itransformer:Invertedtransformersareeffectivefortimeseriesforecasting. arXiv preprint arXiv:2310.06625 . Liu, Y., Li, C., Wang, J., Long, M., ...

work page internal anchor Pith review Pith/arXiv arXiv
[9]

arXiv preprint arXiv:2403.01509

Fantastic semantics and where to find them: Investigating which layers of generative llms reflect lexical semantics. arXiv preprint arXiv:2403.01509 . Nie, Y., H. Nguyen, N., Sinthong, P., Kalagnanam, J.,

work page arXiv
[10]

Enhancing multivariate time series forecasting with multi-scale moving transformation, in: ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE. pp. 1–5. Ou,W.,Zhao,Z.,Guo,D.,Zhang,Z.,Lin,Y.,2024. Winnet:makeonlyoneconvolutionallayereffectivefortimeseriesforecasting,in:International Conference on Intelli...

work page 2025
[11]

DeepAR: Probabilistic Forecasting with Autoregressive Recurrent Networks

Salinas,D.,Flunkert,V.,Gasthaus,J.,Januschowski,T.,2020.Deepar:Probabilisticforecastingwithautoregressiverecurrentnetworks.International Journal of Forecasting URL:https://doi.org/10.48550/arXiv.1704.04110. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S.E., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1704.04110 2020
[12]

Going deeper with convolutions. CVPR . Wenjie Ou et al.:Preprint submitted to Elsevier Page 11 of 12 Vaswani,A.,Shazeer,N.,Parmar,N.,Uszkoreit,J.,Jones,L.,Gomez,A.N.,Kaiser,L.,Illia,P.,2017. Attentionisallyouneed. Proceedingsofthe Advances in Neural Information Processing Systems (NeurIPS) . Wang,H.,Peng,J.,Huang,F.,Wang,J.,Chen,J.,Xiao,Y.,2023.Micn:Multi...

work page 2017
[13]

arXiv preprint arXiv:2207.01186

Less is more: Fast multivariate time series forecasting with light sampling- oriented mlp structures. arXiv preprint arXiv:2207.01186 . Zhang, Y., Ma, L., Pal, S., Zhang, Y., Coates, M.,

work page arXiv
[14]

Informer: Beyond efficient transformer for long sequence time-series forecasting, in: The Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Virtual Conference, AAAI Press. pp. 11106–11115. Zhou, T., Ma, Z., Wang, X., Wen, Q., Sun, L., Yao, T., Yin, W., Jin, R., 2022a. Film: Frequency improved legendre memory model for long-term time seri...

work page arXiv 2021

[1] [1]

Scientific reports 12, 16327

Deep language algorithms predict semantic comprehension from brain activity. Scientific reports 12, 16327. Chen,P.,Zhang,Y.,Cheng,Y.,Shu,Y.,Wang,Y.,Wen,Q.,Yang,B.,Guo,C.,2024. Pathformer:Multi-scaletransformerswithadaptivepathways for time series forecasting. arXiv preprint arXiv:2402.05956 . Chen, Y., Liu, H., Yin, H., Fan, B.,

work page arXiv 2024

[2] [2]

Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling

Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv:1412.3555v1 . Dai,T.,Wu,B.,Liu,P.,Li,N.,Bao,J.,Jiang,Y.,Xia,S.T.,2024. Periodicitydecouplingframeworkforlong-termseriesforecasting,in:TheTwelfth International Conference on Learning Representations. Das, A., Kong, W., Leach, A., Mathur, S., Sen, R., Yu, R.,

work page internal anchor Pith review Pith/arXiv arXiv 2024

[3] [3]

Long- term forecasting with tide: Time-series dense encoder

Long-term forecasting with tide: Time-series dense encoder. arXiv preprint arXiv:2304.08424 . Devlin, J., Chang, M.W., Lee, K., Toutanova, K.,

work page arXiv

[4] [4]

1997, Neural computation, 9, 1735, doi: 10.1162/neco.1997.9.8.1735

Long short-term memory. Neural Computation 9, 1735–1780. doi:10.1162/neco.1997.9.8.1735. Hu,E.J.,Shen,Y.,Wallis,P.,Allen-Zhu,Z.,Li,Y.,Wang,S.,Wang,L.,Chen,W.,etal.,2022. Lora:Low-rankadaptationoflargelanguagemodels. ICLR 1,

work page doi:10.1162/neco.1997.9.8.1735 1997

[5] [5]

Time-LLM:Timeseriesforecasting by reprogramming large language models, in: International Conference on Learning Representations (ICLR)

Jin,M.,Wang,S.,Ma,L.,Chu,Z.,Zhang,J.Y.,Shi,X.,Chen,P.Y.,Liang,Y.,Li,Y.F.,Pan,S.,Wen,Q.,2024. Time-LLM:Timeseriesforecasting by reprogramming large language models, in: International Conference on Learning Representations (ICLR). Kim, T., Kim, J., Tae, Y., Park, C., Choi, J.H., Choo, J.,

work page 2024

[6] [6]

arXiv preprint arXiv:2410.21353

Causal interventions on causal paths: Mapping gpt-2’s reasoning from syntax to semantics. arXiv preprint arXiv:2410.21353 . Li, K., Yu, R., Wang, Z., Yuan, L., Song, G., Chen, J.,

work page arXiv

[7] [7]

Locality guidance for improving vision transformers on tiny datasets, in: European Conference on Computer Vision, Springer. pp. 110–127. Li, Z., Qi, S., Li, Y., Xu, Z., 2023a. Revisiting long-term time series forecasting: An investigation on linear mapping. ArXiv abs/2305.10721. Li, Z., Rao, Z., Pan, L., Xu, Z., 2023b. Mts-mixers: Multivariate time series...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[8] [8]

iTransformer: Inverted Transformers Are Effective for Time Series Forecasting

Pyraformer: Low-complexity pyramidal attention for long-range time series modeling and forecasting, in: International Conference on Learning Representations. Liu,Y.,Hu,T.,Zhang,H.,Wu,H.,Wang,S.,Ma,L.,Long,M.,2023a. itransformer:Invertedtransformersareeffectivefortimeseriesforecasting. arXiv preprint arXiv:2310.06625 . Liu, Y., Li, C., Wang, J., Long, M., ...

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

arXiv preprint arXiv:2403.01509

Fantastic semantics and where to find them: Investigating which layers of generative llms reflect lexical semantics. arXiv preprint arXiv:2403.01509 . Nie, Y., H. Nguyen, N., Sinthong, P., Kalagnanam, J.,

work page arXiv

[10] [10]

Enhancing multivariate time series forecasting with multi-scale moving transformation, in: ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE. pp. 1–5. Ou,W.,Zhao,Z.,Guo,D.,Zhang,Z.,Lin,Y.,2024. Winnet:makeonlyoneconvolutionallayereffectivefortimeseriesforecasting,in:International Conference on Intelli...

work page 2025

[11] [11]

DeepAR: Probabilistic Forecasting with Autoregressive Recurrent Networks

Salinas,D.,Flunkert,V.,Gasthaus,J.,Januschowski,T.,2020.Deepar:Probabilisticforecastingwithautoregressiverecurrentnetworks.International Journal of Forecasting URL:https://doi.org/10.48550/arXiv.1704.04110. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S.E., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1704.04110 2020

[12] [12]

Going deeper with convolutions. CVPR . Wenjie Ou et al.:Preprint submitted to Elsevier Page 11 of 12 Vaswani,A.,Shazeer,N.,Parmar,N.,Uszkoreit,J.,Jones,L.,Gomez,A.N.,Kaiser,L.,Illia,P.,2017. Attentionisallyouneed. Proceedingsofthe Advances in Neural Information Processing Systems (NeurIPS) . Wang,H.,Peng,J.,Huang,F.,Wang,J.,Chen,J.,Xiao,Y.,2023.Micn:Multi...

work page 2017

[13] [13]

arXiv preprint arXiv:2207.01186

Less is more: Fast multivariate time series forecasting with light sampling- oriented mlp structures. arXiv preprint arXiv:2207.01186 . Zhang, Y., Ma, L., Pal, S., Zhang, Y., Coates, M.,

work page arXiv

[14] [14]

Informer: Beyond efficient transformer for long sequence time-series forecasting, in: The Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Virtual Conference, AAAI Press. pp. 11106–11115. Zhou, T., Ma, Z., Wang, X., Wen, Q., Sun, L., Yao, T., Yin, W., Jin, R., 2022a. Film: Frequency improved legendre memory model for long-term time seri...

work page arXiv 2021