arxiv: 2604.08398 · v1 · submitted 2026-04-09 · 💻 cs.LG · cs.AI

Recognition: unknown

ADAPTive Input Training for Many-to-One Pre-Training on Time-Series Classification

Paul Quinlan , Qingguo Li , Xiaodan Zhu

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:07 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords time-series classificationpre-trainingfoundation modelsmixed-batch trainingdata alignmentADAPTmany-to-one pre-trainingself-supervised learning

0 comments

The pith

ADAPT aligns physical properties of time-series data to enable mixed-batch pre-training on 162 diverse datasets at once.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ADAPT as a pre-training paradigm that aligns the physical properties of time-series inputs so a single model can handle mixed batches from datasets with wildly different lengths and channel counts. Previous self-supervised methods worked in one-to-many settings but failed to scale when more datasets were added, limiting progress toward generalist time-series models. By solving the alignment problem, ADAPT trains successfully across 162 classification datasets and reaches new state-of-the-art accuracy. A reader should care because the result directly removes a practical barrier to building foundation models that learn from the full variety of available time-series data rather than from isolated collections.

Core claim

ADAPTive Input Training enables many-to-one pre-training for time-series classification by efficiently aligning the physical properties of data, which overcomes extreme discrepancies in input sizes and channel dimensions and supports mixed-batch training across many datasets simultaneously, yielding new state-of-the-art performance on classification benchmarks after training on 162 datasets.

What carries the argument

ADAPT (ADAPTive Input Training), a method that aligns physical properties of time-series data to permit mixed-batch pre-training without dataset-specific resizing or loss of task information.

If this is right

Enables simultaneous pre-training on a wide range of time-series datasets with varying lengths and channels.
Achieves new state-of-the-art results on multiple classification benchmarks.
Provides a practical step toward generalist foundation models that operate across the time-series domain.
Removes the scaling barrier that previously restricted pre-training to one-to-many dataset scenarios.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same alignment idea could be tested on time-series regression or forecasting tasks to see whether the benefit generalizes beyond classification.
If the method scales, it might allow a single model to serve many specialized applications such as sensor monitoring or financial series analysis without separate pre-training runs.
Similar physical-property alignment could be explored for other heterogeneous data types where input dimensions vary, such as multi-channel audio or video sequences.

Load-bearing premise

Aligning physical properties of the data is sufficient to overcome extreme discrepancies in input sizes and channel dimensions and produce effective mixed-batch pre-training without introducing new biases or losing task-relevant information.

What would settle it

If a model pre-trained with ADAPT on the 162 datasets shows no accuracy gain or loses accuracy when fine-tuned on a held-out time-series classification task compared with single-dataset pre-training, the alignment approach would be falsified.

Figures

Figures reproduced from arXiv: 2604.08398 by Paul Quinlan, Qingguo Li, Xiaodan Zhu.

**Figure 1.** Figure 1: An overview of the adapt process and time-frequency masking algorithm. Each time-series is adaptive-pooled (in the [A] boxes) to a universal representation space and enable mixed-batch training among different modality types. At training time we add noise to the joint representation space in both the time and frequency space. physical properties and channel dimensions, and no research yet has found an effi… view at source ↗

**Figure 2.** Figure 2: Model accuracy on UCR and UEA datasets compared to basic dataset properties. The “Length” and “Channel Dimension” corresponds to the dimensionality of the data or the total length of the input and number of channels at each time period. “Number of Classes” corresponds to the number of classes for classification in the downstream dataset, and “Total Dimensional Size” refers to length of the input multiplied… view at source ↗

**Figure 3.** Figure 3: T-SNE visualisation of the time and frequency inputs before and after adaptive pooling on the Gesture dataset. We note that there is no discernible drop in the quality of inputs after they are transformed. The relative groupings and distributions between the 8 different classes remain largely intact. input lengths and modalities, as well as dataset complexity with diverse classes. ADAPT only changes the ph… view at source ↗

read the original abstract

Recent work on time-series models has leveraged self-supervised training to learn meaningful features and patterns in order to improve performance on downstream tasks and generalize to unseen modalities. While these pretraining methods have shown great promise in one-to-many scenarios, where a model is pre-trained on one dataset and fine-tuned on a downstream dataset, they have struggled to generalize to new datasets when more datasets are added during pre-training. This is a fundamental challenge in building foundation models for time-series data, as it limits the ability to develop models that can learn from a large variety of diverse datasets available. To address this challenge, we present a new pre-training paradigm for time-series data called ADAPT, which can efficiently align the physical properties of data in the time-series domain, enabling mixed-batch pre-training despite the extreme discrepancies in the input sizes and channel dimensions of pre-training data. We trained on 162 time-series classification datasets and set new state-of-the-art performance for classification benchmarks. We successfully train a model within the time-series domain on a wide range of datasets simultaneously, which is a major building block for building generalist foundation models in time-series domains.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper introduces ADAPT, a pre-training paradigm for time-series data that aligns physical properties (input sizes and channel dimensions) to enable mixed-batch pre-training across heterogeneous datasets. It reports training a model on 162 time-series classification datasets simultaneously and achieving new state-of-the-art performance on classification benchmarks, framing this as a foundational step toward generalist time-series foundation models.

Significance. If the empirical results hold under rigorous evaluation, this would constitute a meaningful advance by demonstrating scalable many-to-one pre-training in the time-series domain, directly addressing the input heterogeneity barrier that has constrained prior self-supervised approaches and supporting the development of more general models.

major comments (2)

Abstract: the claim of setting new state-of-the-art performance after training on 162 datasets supplies no information on baselines, evaluation protocol, statistical testing, ablation studies, or implementation details, so the data cannot be assessed as supporting the central claim of effective mixed-batch pre-training.
The description of ADAPT alignment: the manuscript states that aligning physical properties enables joint pre-training despite extreme discrepancies in input sizes and channel dimensions, but provides no quantitative validation (such as mutual information before/after alignment or single-dataset versus mixed pre-training ablations) that task-relevant information is preserved rather than distorted or lost.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for major revision. The comments highlight opportunities to improve the clarity of our claims and the rigor of our validation for the ADAPT alignment. We address each major comment below and will revise the manuscript to incorporate the suggested enhancements.

read point-by-point responses

Referee: Abstract: the claim of setting new state-of-the-art performance after training on 162 datasets supplies no information on baselines, evaluation protocol, statistical testing, ablation studies, or implementation details, so the data cannot be assessed as supporting the central claim of effective mixed-batch pre-training.

Authors: We agree that the abstract would benefit from additional context to allow readers to assess the claims more readily. In the revised manuscript, we will expand the abstract to briefly reference the baselines used for comparison, the evaluation protocol applied across the 162 datasets, the use of repeated runs for statistical assessment, and pointers to the ablation studies and implementation details provided in the main text and supplementary material. These additions will better substantiate the central claim of effective mixed-batch pre-training while preserving the abstract's conciseness. revision: yes
Referee: The description of ADAPT alignment: the manuscript states that aligning physical properties enables joint pre-training despite extreme discrepancies in input sizes and channel dimensions, but provides no quantitative validation (such as mutual information before/after alignment or single-dataset versus mixed pre-training ablations) that task-relevant information is preserved rather than distorted or lost.

Authors: We acknowledge the value of quantitative validation for the alignment process. The current manuscript describes the ADAPT alignment and reports overall performance gains from mixed-batch training. To strengthen this section, we will add explicit quantitative ablations comparing single-dataset versus mixed pre-training results in the experiments section of the revision. We will also include a discussion of information preservation based on these empirical outcomes, using performance retention and related metrics as proxies, to demonstrate that task-relevant features are maintained rather than lost or distorted. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results rest on training outcomes, not self-defining inputs

full rationale

The paper proposes ADAPT as an alignment procedure for heterogeneous time-series inputs (lengths, channels) to support mixed-batch pre-training, then reports training on 162 datasets and new SOTA classification numbers. These are presented as experimental outcomes rather than quantities derived by construction from the alignment definition itself. No equations or steps equate a fitted parameter to a claimed prediction, invoke self-citation as the sole justification for a uniqueness claim, or rename an existing empirical pattern under new coordinates. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no technical description of the alignment procedure, loss functions, or architectural choices, so no free parameters, axioms, or invented entities can be identified.

pith-pipeline@v0.9.0 · 5502 in / 1133 out tokens · 45558 ms · 2026-05-10T17:07:52.222256+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

43 extracted references · 24 canonical work pages · 6 internal anchors

[1]

The TRIPOD-LLM reporting guideline for studies using large language models

J. N. Acosta, G. J. Falcone, P. Rajpurkar, and E. J. Topol. “Multimodal biomedical AI”. In: Nature Medicine28 (9 Sept. 2022), pp. 1773–1784.issn: 1078-8956.doi:10.1038/s41591- 022-01981-2

work page doi:10.1038/s41591- 2022
[2]

AI in Finance: Challenges, Techniques, and Opportunities

L. Cao. “AI in Finance: Challenges, Techniques, and Opportunities”. In:ACM Comput. Surv. 55.3 (2022).issn: 0360-0300.doi: 10 . 1145 / 3502289.url: https : / / doi . org / 10 . 1145 / 3502289

2022
[3]

Time Series Prediction in Industry 4.0: A Comprehensive Review and Prospects for Future Advancements

N. Kashpruk, C. Piskor-Ignatowicz, and J. Baranowski. “Time Series Prediction in Industry 4.0: A Comprehensive Review and Prospects for Future Advancements”. In:Applied Sciences13.22 (2023).issn: 2076-3417.doi: 10.3390/app132212374.url: https://www.mdpi.com/2076- 3417/13/22/12374

work page doi:10.3390/app132212374.url: 2023
[4]

Time-series analysis of Sentinel-2 satellite images for sunflower yield estimation

K. Amankulova, N. Farmonov, and L. Mucsi. “Time-series analysis of Sentinel-2 satellite images for sunflower yield estimation”. In:Smart Agricultural Technology3 (Feb. 2023), p. 100098. issn: 27723755.doi:10.1016/j.atech.2022.100098

work page doi:10.1016/j.atech.2022.100098 2023
[5]

The effects of diurnal temperature range on mortality and emergency department presentations in Victoria state of Australia: A time-series analysis

P. Amoatey, N. J. Osborne, D. Darssan, Z. Xu, Q.-V. Doan, and D. Phung. “The effects of diurnal temperature range on mortality and emergency department presentations in Victoria state of Australia: A time-series analysis”. In:Environmental Research240 (2024), p. 117397

2024
[6]

Transforming data into actionable knowledge for fault detection, diagnosis and prognosis in urban wastewater systems with AI techniques: A mini-review

Y. Liu, P. Ramin, X. Flores-Alsina, and K. V. Gernaey. “Transforming data into actionable knowledge for fault detection, diagnosis and prognosis in urban wastewater systems with AI techniques: A mini-review”. In:Process Safety and Environmental Protection172 (2023), pp. 501–512.issn: 0957-5820.doi: https://doi.org/10.1016/j.psep.2023.02.043 .url: https://...

work page doi:10.1016/j.psep.2023.02.043 2023
[7]

Enhanced EKF-Based Time Calibration for GNSS/UWB Tight Integration.IEEE Sensors Journal, 23(1):552–566, 1 2023

M. H. M. Ghazali and W. Rahiman. “Vibration-Based Fault Detection in Drone Using Artificial Intelligence”. In:IEEE Sensors Journal22.9 (2022), pp. 8439–8448.doi:10.1109/JSEN.2022. 3163401

work page doi:10.1109/jsen.2022 2022
[8]

OpenAI.GPT-4 Technical Report. 2023. arXiv:2303.08774 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2023
[9]

A. Q. Jiang et al.Mixtral of Experts. 2024. arXiv:2401.04088 [cs.LG]

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

LaMDA: Language Models for Dialog Applications

R. Thoppilan et al.LaMDA: Language Models for Dialog Applications. 2022. arXiv:2201.08239 [cs.CL]

work page Pith review arXiv 2022
[11]

Wav2vec 2.0: A Framework for Self- Supervised Learning of Speech Representations

A. Baevski, H. Zhou, A. Mohamed, and M. Auli. “Wav2vec 2.0: A Framework for Self- Supervised Learning of Speech Representations”. In:Proceedings of the 34th International Conference on Neural Information Processing Systems. NIPS’20. Vancouver, BC, Canada: Curran Associates Inc., 2020.isbn: 9781713829546

2020
[12]

Hierarchical Text-Conditional Image Generation with CLIP Latents

A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen.Hierarchical Text-Conditional Image Generation with CLIP Latents. 2022. arXiv:2204.06125 [cs.CV]

work page internal anchor Pith review Pith/arXiv arXiv 2022
[13]

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu. “Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer”. In:Journal of Machine Learning Research21.140 (2020), pp. 1–67.url:http://jmlr.org/ papers/v21/20-074.html

2020
[14]

Self-Supervised Contrastive Pre-Training For Time Series via Time-Frequency Consistency

X. Zhang, Z. Zhao, T. Tsiligkaridis, and M. Zitnik. “Self-Supervised Contrastive Pre-Training For Time Series via Time-Frequency Consistency”. In:Advances in Neural Information Pro- cessing Systems. Ed. by A. H. Oh, A. Agarwal, D. Belgrave, and K. Cho. 2022.url:https: //openreview.net/forum?id=OJ4mMfGKLN

2022
[15]

Stanford CRFM.Stanford Center for Research on Foundation Models (CRFM). Online. Accessed: 2024-01-30. 2021.url:https://crfm.stanford.edu/

2024
[16]

T. Zhou, P. Niu, X. Wang, L. Sun, and R. Jin.One Fits All:Power General Time Series Analysis by Pretrained LM. 2023. arXiv:2302.11939 [cs.LG]

work page arXiv 2023
[17]

TimesNet: Temporal 2D-Variation Modeling for General Time Series Analysis

H. Wu, T. Hu, Y. Liu, H. Zhou, J. Wang, and M. Long. “TimesNet: Temporal 2D-Variation Modeling for General Time Series Analysis”. In:The Eleventh International Conference on Learning Representations. 2023.url:https://openreview.net/forum?id=ju_Uqw384Oq

2023
[18]

A comprehensive survey on pretrained foundation mod- els: A history from bert to chatgpt,

C. Zhou et al.A Comprehensive Survey on Pretrained Foundation Models: A History from BERT to ChatGPT. 2023. arXiv:2302.09419 [cs.AI]

work page arXiv 2023
[19]

LLaMA: Open and Efficient Foundation Language Models

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and G. Lample.LLaMA: Open and Efficient Foundation Language Models. 2023. arXiv:2302.13971 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2023
[20]

Time Series as Images: Vision Transformer for Irregularly Sampled Time Series

Z. Li, S. Li, and X. Yan. “Time Series as Images: Vision Transformer for Irregularly Sampled Time Series”. In:Thirty-seventh Conference on Neural Information Processing Systems. 2023. url:https://openreview.net/forum?id=ZmeAoWQqe0

2023
[21]

Chang, W.-Y

C. Chang, W.-Y. Wang, W.-C. Peng, and T.-F. Chen.LLM4TS: Aligning Pre-Trained LLMs as Data-Efficient Time-Series Forecasters. 2024. arXiv:2308.08469 [cs.LG]

work page arXiv 2024
[22]

Learning Transferable Visual Models From Natural Language Supervision

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever.Learning Transferable Visual Models From Natural Language Supervision. 2021. arXiv:2103.00020 [cs.CV]

work page internal anchor Pith review Pith/arXiv arXiv 2021
[23]

TS2Vec: Towards Universal Representation of Time Series

Z. Yue, Y. Wang, J. Duan, T. Yang, C. Huang, Y. Tong, and B. Xu. “TS2Vec: Towards Universal Representation of Time Series”. In:Proceedings of the AAAI Conference on Artificial Intelligence36.8 (2022), pp. 8980–8987.doi: 10 . 1609 / aaai . v36i8 . 20881.url: https : //ojs.aaai.org/index.php/AAAI/article/view/20881

2022
[24]

Time-Series Representation Learning via Temporal and Contextual Contrasting

E. Eldele, M. Ragab, Z. Chen, M. Wu, C. K. Kwoh, X. Li, and C. Guan. “Time-Series Representation Learning via Temporal and Contextual Contrasting”. In:Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI-21. Ed. by Z.-H. Zhou. Main Track. International Joint Conferences on Artificial Intelligence Organization, Aug....

work page doi:10.24963/ijcai.2021/324 2021
[25]

CoST: Contrastive Learning of Disentangled Seasonal-Trend Representations for Time Series Forecasting

G. Woo, C. Liu, D. Sahoo, A. Kumar, and S. Hoi. “CoST: Contrastive Learning of Disentangled Seasonal-Trend Representations for Time Series Forecasting”. In:International Conference on Learning Representations. 2022.url:https://openreview.net/forum?id=PilZY3omXV2

2022
[26]

CLOCS: Contrastive Learning of Cardiac Signals Across Space, Time, and Patients

D. Kiyasseh, T. Zhu, and D. A. Clifton. “CLOCS: Contrastive Learning of Cardiac Signals Across Space, Time, and Patients”. In:International Conference on Machine Learning. 2020

2020
[27]

Contrastive Learning for Unsupervised Domain Adap- tation of Time Series

Y. Ozyurt, S. Feuerriegel, and C. Zhang. “Contrastive Learning for Unsupervised Domain Adap- tation of Time Series”. In:The Eleventh International Conference on Learning Representations. 2023.url:https://openreview.net/forum?id=xPkJYRsQGM

2023
[28]

Unsupervised Representation Learning for Time Series with Temporal Neighborhood Coding

S. Tonekaboni, D. Eytan, and A. Goldenberg. “Unsupervised Representation Learning for Time Series with Temporal Neighborhood Coding”. In:International Conference on Learning Representations. 2021.url:https://openreview.net/forum?id=8qDwejCuCN

2021
[29]

A Generalist Agent

S. Reed et al.A Generalist Agent. 2022. arXiv:2205.06175 [cs.AI]

work page internal anchor Pith review arXiv 2022
[30]

Evaluating Latent Space Robustness and Uncertainty of EEG-ML Models under Realistic Distribution Shifts

N. Wagh, J. Wei, S. Rawal, B. M. Berry, and Y. Varatharajah. “Evaluating Latent Space Robustness and Uncertainty of EEG-ML Models under Realistic Distribution Shifts”. In: Advances in Neural Information Processing Systems. Ed. by A. H. Oh, A. Agarwal, D. Belgrave, and K. Cho. 2022.url:https://openreview.net/forum?id=KRk0lBRPpOC

2022
[31]

SpanBERT: Improving Pre-training by Representing and Predicting Spans

M. Joshi, D. Chen, Y. Liu, D. S. Weld, L. Zettlemoyer, and O. Levy. “SpanBERT: Improving Pre-training by Representing and Predicting Spans”. In:Transactions of the Association for Computational Linguistics8 (2020), pp. 64–77.doi: 10 . 1162 / tacl _ a _ 00300.url: https://aclanthology.org/2020.tacl-1.5

2020
[32]

LIMU-BERT: Unleashing the Potential of Unlabeled Data for IMU Sensing Applications

H. Xu, P. Zhou, R. Tan, M. Li, and G. Shen. “LIMU-BERT: Unleashing the Potential of Unlabeled Data for IMU Sensing Applications”. In:Proceedings of the 19th ACM Conference on Embedded Networked Sensor Systems. SenSys ’21. Coimbra, Portugal: Association for Computing Machinery, 2021, 220–233.isbn: 9781450390972.doi:10.1145/3485730.3485937. url:https://doi....

work page doi:10.1145/3485730.3485937 2021
[33]

Dau and A

H.A.Dau,A.Bagnall,K.Kamgar,C. -C.M.Yeh,Y.Zhu,S.Gharghabi,C.A.Ratanamahatana, and E. Keogh.The UCR Time Series Archive. 2019. arXiv:1810.07758 [cs.LG]

work page arXiv 2019
[34]

Analysis of a sleep- dependent neuronal feedback loop: the slow-wave microcontinuity of the EEG

B. Kemp, A. Zwinderman, B. Tuk, H. Kamphuisen, and J. Oberye. “Analysis of a sleep- dependent neuronal feedback loop: the slow-wave microcontinuity of the EEG”. In:IEEE Transactions on Biomedical Engineering47.9 (2000), pp. 1185–1194.doi:10.1109/10.867928

work page doi:10.1109/10.867928 2000
[35]

Condition Monitoring of Bearing Damage in Electromechanical Drive Systems by Using Motor Current Signals of Electric Motors: A Benchmark Data Set for Data-Driven Classification

C. Lessmeier, J. Kimotho, D. Zimmer, and W. Sextro. “Condition Monitoring of Bearing Damage in Electromechanical Drive Systems by Using Motor Current Signals of Electric Motors: A Benchmark Data Set for Data-Driven Classification”. In: July 2016

2016
[36]

[Shiet al., 2026 ] Yuchen Shi, Qijun Hou, Pingyi Fan, and Khaled B

J. Reyes-Ortiz, D. Anguita, A. Ghio, L. Oneto, and X. Parra.Human Activity Recognition Using Smartphones. UCI Machine Learning Repository. DOI: https://doi.org/10.24432/C54S4K. 2012

work page doi:10.24432/c54s4k 2012
[37]

AF Classification from a Short Single Lead ECG Recording: the Physionet Computing in Cardiology Challenge 2017

G. Clifford, C. Liu, B. Moody, L. wei Lehman, I. Silva, Q. Li, A. Johnson, and R. Mark. “AF Classification from a Short Single Lead ECG Recording: the Physionet Computing in Cardiology Challenge 2017”. In: Sept. 2017.doi:10.22489/CinC.2017.065-469

work page doi:10.22489/cinc.2017.065-469 2017
[38]

Systematic Gener- alization in Neural Networks-based Multivariate Time Series Forecasting Models

J. Shi, W. Ye, and Z. Qin. “Self-Supervised Pre-training for Time Series Classification”. In: July 2021, pp. 1–8.doi:10.1109/IJCNN52387.2021.9533426

work page doi:10.1109/ijcnn52387.2021.9533426 2021
[39]

Mixing up contrastive learning: Self-supervised representation learning for time series , journal =

K. Wickstrøm, M. Kampffmeyer, K. O. Mikalsen, and R. Jenssen. “Mixing up Contrastive Learning: Self-Supervised Representation Learning for Time Series”. In:Pattern Recogn. Lett.155.C (2022), 54–61.issn: 0167-8655.doi: 10 . 1016 / j . patrec . 2022 . 02 . 007.url: https://doi.org/10.1016/j.patrec.2022.02.007

work page doi:10.1016/j.patrec.2022.02.007 2022
[40]

C. I. Tang, I. Perez-Pozuelo, D. Spathis, and C. Mascolo.Exploring Contrastive Learning in Human Activity Recognition for Healthcare. 2021. arXiv:2011.11542 [cs.LG]

work page arXiv 2021
[41]

Decoupled Weight Decay Regularization

I. Loshchilov and F. Hutter. “Decoupled Weight Decay Regularization”. In:International Conference on Learning Representations. 2019.url:https://openreview.net/forum?id= Bkg6RiCqY7

2019
[42]

Attention is All you Need

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, and I. Polosukhin. “Attention is All you Need”. In:Advances in Neural Information Processing Systems. Ed. by I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett. Vol. 30. Curran Associates, Inc., 2017.url:https://proceedings.neurips...

2017
[43]

Visualizing Data using t-SNE

L. van der Maaten and G. Hinton. “Visualizing Data using t-SNE”. In:Journal of Ma- chine Learning Research9.86 (2008), pp. 2579–2605.url:http://jmlr.org/papers/v9/ vandermaaten08a.html

2008