PilotWiMAE: Pilot-Native Representation Learning for Wireless Channels

Berkay Guler; Giovanni Geraci; Hamid Jafarkhani

arxiv: 2605.22856 · v1 · pith:XCYIOACDnew · submitted 2026-05-19 · 📡 eess.SP · cs.AI· cs.IT· cs.LG· cs.NI· math.IT

PilotWiMAE: Pilot-Native Representation Learning for Wireless Channels

Berkay Guler , Giovanni Geraci , Hamid Jafarkhani This is my paper

Pith reviewed 2026-05-25 00:04 UTC · model grok-4.3

classification 📡 eess.SP cs.AIcs.ITcs.LGcs.NImath.IT

keywords self-supervised learningwireless channel modelingpilot observationsbeam selectioncross-frequency generalizationchannel estimationfactorized attentionnoisy pilots

0 comments

The pith

Pilot-native self-supervised learning produces channel representations that transfer from 3.5 GHz pretraining to 28 GHz evaluation and outperform supervised baselines on beam selection despite using far fewer observations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents PilotWiMAE as a self-supervised framework that takes noisy pilot observations as direct input rather than assuming full channel state information is available. Its encoder applies factorized attention that separates temporal processing from joint space-frequency processing to exploit the physical separability of wireless channels. This design supports pretraining with 99 percent masking and an auxiliary scale loss that recovers both small-scale and large-scale fading. When pretrained only at 3.5 GHz, the resulting representations enable stronger cross-frequency beam selection and channel characterization at 28 GHz than supervised methods trained on full observations at the target frequency. A subsequent decoder-centric pretraining stage further improves channel estimation performance without degrading the learned representations.

Core claim

PilotWiMAE is a self-supervised framework whose encoder ingests noisy pilot observations directly and whose attention factorizes along the axis separating temporal from joint space-frequency processing. The design allows a 99 percent pretraining mask ratio while using patch-normalized reconstruction and an auxiliary scale loss, plus an AWGN curriculum to match deployment noise. Pretrained solely on 3.5 GHz data, the model achieves superior cross-frequency beam selection and channel characterization at 28 GHz compared with supervised baselines, even though its observation space is up to two orders of magnitude smaller. A decoder-centric pretraining stage decouples decoder capacity from the质量s

What carries the argument

Factorized attention that separates temporal processing from joint space-frequency processing, allowing the encoder to build representations from highly masked noisy pilot inputs by exploiting channel separability.

If this is right

Supports pretraining at a 99 percent mask ratio without collapse.
Reduces required observation space by up to two orders of magnitude while lowering latency.
Enables cross-frequency generalization from sub-6 GHz pretraining to millimeter-wave evaluation.
Yields competitive channel estimation after a decoder-centric pretraining stage.
Removes the deployment assumption of full CSI availability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same factorized structure could be tested on additional frequency pairs or measured outdoor datasets to check robustness beyond ray-tracing simulations.
Releasing the pretrained weights and the Sionna-based channel generator may allow other groups to explore whether the representations transfer to tasks such as positioning or interference management.
If the separability bias proves effective, similar factorized attention patterns might be applied to other physical time-series domains that exhibit space-time-frequency structure.

Load-bearing premise

Wireless channels possess separable structure along temporal versus joint space-frequency axes that factorized attention can reliably extract from noisy, heavily masked pilot observations.

What would settle it

Pretrain the model exclusively on 3.5 GHz pilots, then measure beam selection accuracy at 28 GHz; if performance does not exceed that of supervised baselines given full CSI at 28 GHz, the claimed advantage of pilot-native cross-frequency representations would not hold.

Figures

Figures reproduced from arXiv: 2605.22856 by Berkay Guler, Giovanni Geraci, Hamid Jafarkhani.

**Figure 1.** Figure 1: High-level PilotWiMAE pipeline: The model consumes sparse noisy pilot observations directly, pilot representations support direct decision-making [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: PilotWiMAE architecture. Pilot patches feed the FST encoder. The resulting representations are decoded by a JST transformer. Finally, the tokens are [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Factorized space-time attention on the patch-token grid with axes [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: PilotWiMAE pretraining flow and loss groupings. Auxiliary scale [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: OOD (28 GHz, Los Angeles): top-3 beam-selection accuracy vs SNR [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 8.** Figure 8: Channel characterization (LoS accuracy) in-distribution at 28 GHz. [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗

**Figure 9.** Figure 9: Channel characterization (LoS accuracy) out-of-distribution at 28 GHz. [PITH_FULL_IMAGE:figures/full_fig_p011_9.png] view at source ↗

**Figure 10.** Figure 10: Los Angeles OOD channel estimation at 3.5 GHz using a frozen FST+noise+scale encoder with decoder-only pretraining. Curves show NMSE versus SNR for decoder depths 1, 2, 4, 6, and 12. since the AWGN curriculum lifts the low-SNR floor. Second, the supervised baseline shows a much smaller pilot-versus-full gap compared to beam selection. This is because recovering a label that depends on aggregate channel po… view at source ↗

**Figure 12.** Figure 12: Visualization of the fixed pilot resource elements (highlighted) on [PITH_FULL_IMAGE:figures/full_fig_p012_12.png] view at source ↗

read the original abstract

Channel foundation models assume access to fully observed channels, an assumption that fails in deployment. We introduce PilotWiMAE, a self-supervised framework whose encoder ingests noisy pilot observations directly and whose attention factorizes along the axis separating temporal from joint space-frequency processing, an inductive bias inspired by the physics of the problem. Pilot input shrinks the observation space by up to two orders of magnitude and also removes the unrealistic assumption of full-CSI availability while incurring lower latency. The factorized design generates robust representations by exploiting the separable channel structure and allows a pretraining mask ratio of $99\%$. We pair patch-normalized reconstruction, which captures small-scale fading structure, with an auxiliary scale loss that recovers the large-scale fading features, and use an AWGN curriculum to match pilot noise at pretraining and deployment. Pretrained solely on $3.5$\,GHz and evaluated at $28$\,GHz across in-distribution and out-of-distribution settings, PilotWiMAE's cross-frequency beam selection and channel characterization beat supervised baselines despite operating on a smaller observation space. To weaken the coupling between decoder capacity and representation quality, we further propose a decoder-centric pretraining stage following the encoder-decoder joint pretraining, which allows PilotWiMAE to demonstrate competitive channel estimation without sacrificing representation quality. To foster further work in this direction, we release the PilotWiMAE pretrained weights and training pipeline, together with CSIGen, our Sionna-based ray-tracing channel-generation tool, and the channel datasets used in this work.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PilotWiMAE sketches a pilot-native self-supervised setup with factorized attention and high masking for wireless channels, but the abstract supplies no numbers so the cross-frequency claims stay uncheckable.

read the letter

The core pitch is a self-supervised model that ingests noisy pilots directly instead of full CSI, uses attention that splits temporal from joint space-frequency processing, pretrains at 99% masking with an AWGN curriculum, and adds a decoder-centric stage after joint pretraining. It claims this beats supervised baselines on beam selection and channel characterization when moving from 3.5 GHz pretraining to 28 GHz evaluation, all while using far less observation. They also release the weights, pipeline, Sionna-based generator, and datasets, which is concrete help for follow-on work. That combination of pilot input, physics-motivated factorization, dual losses for small- and large-scale fading, and the extra decoder stage is not just a rebrand of prior masked autoencoders; it targets a real deployment mismatch. The inductive bias around separable channel structure is reasonable on its face and could explain why heavy masking still yields usable representations. Releasing artifacts lowers the barrier for anyone who wants to test the transfer claim themselves. The obvious gap is that none of the performance assertions come with numbers, baseline descriptions, dataset sizes, or controls. You cannot tell whether the supervised comparisons are fair, whether the gains are large enough to matter, or whether the separability assumption actually drives the result. The abstract-only view also leaves open whether the factorized attention is implemented in a way that truly exploits the physics or just adds parameters. This is aimed at researchers building ML tools for realistic 5G/6G channel tasks who already work with pilots and ray-tracing data. A reader who cares about self-supervised methods on structured physical signals might extract useful design ideas even if the empirical claims need verification. I would bring the full paper to a reading group if the experiments are there and reproducible; based on the abstract alone the work is too thin to cite yet. It still deserves peer review because the problem is practical, the approach shows some care in matching deployment constraints, and the artifact release makes it falsifiable once the details appear.

Referee Report

3 major / 1 minor

Summary. The paper introduces PilotWiMAE, a self-supervised framework for wireless channel representation learning that ingests noisy pilot observations directly via an encoder with factorized attention separating temporal from joint space-frequency processing. Pretrained solely on 3.5 GHz data, it is evaluated at 28 GHz for cross-frequency beam selection and channel characterization, claiming to outperform supervised baselines despite a smaller observation space. The approach incorporates patch-normalized reconstruction, an auxiliary scale loss, an AWGN curriculum, and a decoder-centric pretraining stage; the authors release pretrained weights, the training pipeline, the CSIGen Sionna-based tool, and channel datasets.

Significance. If the performance claims hold under rigorous evaluation, the work would be significant for wireless AI by relaxing the full-CSI assumption common to channel foundation models, enabling practical pilot-based deployment with up to 99% masking, and demonstrating cross-band (sub-6 to mmWave) transfer via a physics-motivated inductive bias. The open release of weights, code, and datasets would further support reproducibility and extension.

major comments (3)

[Abstract] Abstract: The claim that PilotWiMAE 'beat supervised baselines' in cross-frequency beam selection and channel characterization supplies no numerical results, baseline descriptions, dataset sizes, error bars, or experimental controls, rendering the central performance-superiority assertion impossible to evaluate and load-bearing for the paper's main contribution.
[Abstract] Abstract: The factorized attention mechanism and its claimed exploitation of separable temporal versus joint space-frequency channel structure are described at a high level without equations, architectural diagrams, or ablation results, preventing verification of how this design enables robust representations from highly masked noisy pilots.
[Abstract] Abstract: The pretraining elements (patch-normalized reconstruction, auxiliary scale loss, AWGN curriculum, and decoder-centric stage) are named but lack any implementation details, quantitative impacts, or comparisons, which are required to substantiate the claims of competitive channel estimation without loss of representation quality.

minor comments (1)

[Abstract] Abstract: The statement that the framework 'incurs lower latency' is asserted without supporting analysis or comparison to full-CSI baselines.

Simulated Author's Rebuttal

3 responses · 3 unresolved

We thank the referee for the review and the detailed comments on the abstract. We address each major comment point by point below. The abstract is intentionally concise as a summary of the full manuscript contributions.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that PilotWiMAE 'beat supervised baselines' in cross-frequency beam selection and channel characterization supplies no numerical results, baseline descriptions, dataset sizes, error bars, or experimental controls, rendering the central performance-superiority assertion impossible to evaluate and load-bearing for the paper's main contribution.

Authors: The abstract provides a high-level summary of the results. The detailed numerical results, baselines, dataset sizes, error bars, and experimental controls are presented in the experimental sections of the full manuscript. We cannot provide these specifics here as only the abstract is available. revision: no
Referee: [Abstract] Abstract: The factorized attention mechanism and its claimed exploitation of separable temporal versus joint space-frequency channel structure are described at a high level without equations, architectural diagrams, or ablation results, preventing verification of how this design enables robust representations from highly masked noisy pilots.

Authors: The factorized attention mechanism is detailed with equations, diagrams, and ablations in the methods and experiments sections of the full paper. The abstract summarizes the approach at a high level. revision: no
Referee: [Abstract] Abstract: The pretraining elements (patch-normalized reconstruction, auxiliary scale loss, AWGN curriculum, and decoder-centric stage) are named but lack any implementation details, quantitative impacts, or comparisons, which are required to substantiate the claims of competitive channel estimation without loss of representation quality.

Authors: Implementation details, quantitative impacts, and comparisons for the pretraining elements are provided in the pretraining and ablation studies of the full manuscript. revision: no

standing simulated objections not resolved

Specific numerical results, baseline descriptions, dataset sizes, error bars, or experimental controls supporting the performance claims
Equations, architectural diagrams, or ablation results for the factorized attention mechanism
Implementation details, quantitative impacts, or comparisons for the pretraining elements (patch-normalized reconstruction, auxiliary scale loss, AWGN curriculum, decoder-centric stage)

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

Only the abstract is provided; it contains no equations, derivations, fitted parameters, or self-citations. No derivation chain exists to inspect, and the described framework (factorized attention, patch-normalized reconstruction, auxiliary scale loss, AWGN curriculum) is presented at a high level without any reduction of outputs to inputs by construction. The cross-frequency transfer claim is empirical and cannot be evaluated for circularity from the given text.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no information on free parameters, axioms, or invented entities; ledger is empty by necessity.

pith-pipeline@v0.9.0 · 5781 in / 1296 out tokens · 33323 ms · 2026-05-25T00:04:04.268147+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 3 internal anchors

[1]

PilotWiMAE: Wireless channel pilots are all you need,

B. Guleret al., “PilotWiMAE: Wireless channel pilots are all you need,” Submitted to the International Conference on Machine Learning (ICML), AI4NextG Workshop, 2026

work page 2026
[2]

LWM: A pre-trained wireless foundation model for universal feature extraction,

S. Alikhaniet al., “LWM: A pre-trained wireless foundation model for universal feature extraction,” inProc. IEEE ICMLCN, May 2025, pp. 1–6

work page 2025
[3]

A MIMO wireless channel foundation model via CIR- CSI consistency,

J. Jianget al., “A MIMO wireless channel foundation model via CIR- CSI consistency,” inProc. IEEE ICMLCN, May 2025, pp. 1–6

work page 2025
[4]

CSI-MAE: A masked autoencoder-based channel foun- dation model,

J. Jianget al., “CSI-MAE: A masked autoencoder-based channel foun- dation model,” 2026, arXiv:2601.03789

work page arXiv 2026
[5]

WiFo: Wireless foundation model for channel prediction,

B. Liuet al., “WiFo: Wireless foundation model for channel prediction,” Science China Information Sciences, vol. 68, no. 6, p. 162302, May 2025a

work page
[6]

LWM-Temporal: Sparse spatio-temporal attention for wireless channel representation learning,

S. Alikhaniet al., “LWM-Temporal: Sparse spatio-temporal attention for wireless channel representation learning,” 2026, arXiv:2603.10024

work page arXiv 2026
[7]

WirelessGPT: A generative foundation model for multi- task integrated sensing and communication,

T. Yanget al., “WirelessGPT: A generative foundation model for multi- task integrated sensing and communication,”IEEE JSAC, vol. 44, pp. 2259–2273, 2026

work page 2026
[8]

LLM4CP: Adapting large language models for channel prediction,

B. Liuet al., “LLM4CP: Adapting large language models for channel prediction,”Journal of Communications and Information Networks, vol. 9, no. 2, pp. 113–125, 2024

work page 2024
[9]

A multi-task foundation model for wireless channel representation using contrastive and masked autoencoder learning,

B. Guleret al., “A multi-task foundation model for wireless channel representation using contrastive and masked autoencoder learning,” IEEE JSAC, vol. 44, pp. 4489–4504, 2026

work page 2026
[10]

WiFo-CF: Wireless foundation model for CSI feedback,

X. Liuet al., “WiFo-CF: Wireless foundation model for CSI feedback,” 2025, arXiv:2508.04068

work page arXiv 2025
[11]

Filter-and-attend: Wireless channel foundation model with noise-plus-interference suppression structure,

Y . Wanget al., “Filter-and-attend: Wireless channel foundation model with noise-plus-interference suppression structure,” 2026, arXiv:2509.15993

work page arXiv 2026
[12]

A wireless foundation model for multi-task prediction,

Y . Shenget al., “A wireless foundation model for multi-task prediction,” 2025, arXiv:2507.05938

work page arXiv 2025
[13]

Large wireless localization model (LWLM): A foundation model for positioning in 6G networks,

G. Panet al., “Large wireless localization model (LWLM): A foundation model for positioning in 6G networks,” 2025, arXiv:2505.10134

work page arXiv 2025
[14]

Reducing pilots in channel estimation with predictive foundation models,

X. Zhouet al., “Reducing pilots in channel estimation with predictive foundation models,” 2026, arXiv:2512.15562

work page arXiv 2026
[15]

WiFo-2: a generalist foundation model unifies heterogeneous wireless system design

B. Liuet al., “WiFo-2: a generalist foundation model unifies heteroge- neous wireless system design,” 2026, arXiv:2511.22222

work page internal anchor Pith review Pith/arXiv arXiv 2026
[16]

6G WavesFM: A foundation model for sensing, communication, and localization,

A. Aboulfotouhet al., “6G WavesFM: A foundation model for sensing, communication, and localization,”IEEE Open J. Commun. Soc., vol. 6, pp. 6792–6807, 2025

work page 2025
[17]

LVM4CSI: Enabling direct application of pre-trained large vision models for wireless channel tasks,

J. Guoet al., “Lvm4csi: Enabling direct application of pre-trained large vision models for wireless channel tasks,” 2025, arXiv:2507.05121

work page arXiv 2025
[18]

MUSE-FM: Multi-task environment-aware foundation model for wireless communications,

T. Zhenget al., “MUSE-FM: Multi-task environment-aware foundation model for wireless communications,” 2026, arXiv:2509.01967

work page arXiv 2026
[19]

OFDM channel estimation by singular value decom- position,

O. Edforset al., “OFDM channel estimation by singular value decom- position,”IEEE Trans. Commun., vol. 46, no. 7, pp. 931–939, Jul. 1998

work page 1998
[20]

Channel estimation techniques based on pilot arrange- ment in OFDM systems,

S. Coleriet al., “Channel estimation techniques based on pilot arrange- ment in OFDM systems,”IEEE Transactions on Broadcasting, vol. 48, no. 3, pp. 223–229, Sep. 2002

work page 2002
[21]

Benchmarking neural network robust- ness to common corruptions and perturbations,

D. Hendrycks and T. Dietterich, “Benchmarking neural network robust- ness to common corruptions and perturbations,” inProc. ICLR, 2019

work page 2019
[22]

Measuring robustness to natural distribution shifts in image classification,

R. Taoriet al., “Measuring robustness to natural distribution shifts in image classification,” inProc. NeurIPS, 2020

work page 2020
[23]

Attention is all you need,

A. Vaswaniet al., “Attention is all you need,” inProc. NeurIPS, 2017, pp. 6000–6010

work page 2017
[24]

An image is worth 16x16 words: Transformers for image recognition at scale,

A. Dosovitskiyet al., “An image is worth 16x16 words: Transformers for image recognition at scale,” inProc. ICLR, 2021

work page 2021
[25]

BERT: Pre-training of deep bidirectional transformers for language understanding,

J. Devlinet al., “BERT: Pre-training of deep bidirectional transformers for language understanding,” inProc. NAACL HLT, Jun. 2019, pp. 4171– 4186

work page 2019
[26]

Masked autoencoders are scalable vision learners,

K. Heet al., “Masked autoencoders are scalable vision learners,” in Proc. CVPR, Jun. 2022, pp. 15 979–15 988

work page 2022
[27]

Scaling Laws for Neural Language Models

J. Kaplanet al., “Scaling laws for neural language models,” 2020, arXiv:2001.08361

work page internal anchor Pith review Pith/arXiv arXiv 2020
[28]

Training compute-optimal large language models,

J. Hoffmannet al., “Training compute-optimal large language models,” inProc. NeurIPS, 2022

work page 2022
[29]

Scaling vision transformers,

X. Zhaiet al., “Scaling vision transformers,” inProc. CVPR, 2022, pp. 1204–1213

work page 2022
[30]

NR; Physical Channels and Modulation,

3GPP, “NR; Physical Channels and Modulation,” 3rd Generation Part- nership Project (3GPP), Technical Specification TS 38.211, Mar. 2026, v19.3.0

work page 2026
[31]

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

T. Daoet al., “FlashAttention: Fast and memory-efficient exact attention with io-awareness,” 2022, arXiv:2205.14135

work page internal anchor Pith review Pith/arXiv arXiv 2022
[32]

Characterization of randomly time-variant linear channels,

P. Bello, “Characterization of randomly time-variant linear channels,” IEEE Transactions on Communications Systems, vol. 11, no. 4, pp. 360– 393, Dec. 1963

work page 1963
[33]

Chapter 1 - fundamentals of time-varying communication channels,

G. Matz and F. Hlawatsch, “Chapter 1 - fundamentals of time-varying communication channels,” inWireless Communications Over Rapidly Time-Varying Channels, F. Hlawatsch and G. Matz, Eds. Oxford: Academic Press, 2011, pp. 1–63

work page 2011
[34]

Self-supervised and invariant representations for wireless localization,

A. Salihuet al., “Self-supervised and invariant representations for wireless localization,”IEEE Trans. Wireless Commun., vol. 23, no. 8, pp. 8281–8296, Aug 2024

work page 2024
[35]

WirelessJEPA: A multi-antenna foundation model using spatio-temporal wireless latent predictions,

V . Chuet al., “WirelessJEPA: A multi-antenna foundation model using spatio-temporal wireless latent predictions,” 2026, arXiv:2601.20190

work page arXiv 2026
[36]

How mask matters: Towards theoretical understandings of masked autoencoders,

Q. Zhanget al., “How mask matters: Towards theoretical understandings of masked autoencoders,” inProc. NeurIPS, 2022

work page 2022
[37]

Is space-time attention all you need for video understanding?

G. Bertasiuset al., “Is space-time attention all you need for video understanding?” inProc. ICML, Jul. 2021, pp. 813–824

work page 2021
[38]

ViViT: A video vision transformer,

A. Arnabet al., “ViViT: A video vision transformer,” inProc. ICCV, Oct. 2021, pp. 6816–6826

work page 2021
[39]

Computationally efficient neural receivers via axial self-attention,

S. S. Yellapragadaet al., “Computationally efficient neural receivers via axial self-attention,” 2026, arXiv:2510.12941

work page arXiv 2026
[40]

Physics-informed transformer for multi-band channel frequency response reconstruction,

A. Zubowet al., “Physics-informed transformer for multi-band channel frequency response reconstruction,” 2026, arXiv:2604.01944

work page arXiv 2026
[41]

Sionna rt: Technical report,

F. A. Aoudiaet al., “Sionna rt: Technical report,” 2025, arXiv:2504.21719

work page arXiv 2025
[42]

Study on channel model for frequencies from 0.5 to 100 GHz,

3GPP, “Study on channel model for frequencies from 0.5 to 100 GHz,” 3rd Generation Partnership Project (3GPP), Technical Report TR 38.901

work page
[43]

Physical layer procedures for data,

3GPP, “Physical layer procedures for data,” 3rd Generation Partnership Project (3GPP), Technical Specification TR 38.214. 14

work page
[44]

The distance-weighted k-Nearest-Neighbor rule,

S. A. Dudani, “The distance-weighted k-Nearest-Neighbor rule,”IEEE Transactions on Systems, Man, and Cybernetics, vol. SMC-6, no. 4, pp. 325–327, 1976

work page 1976

[1] [1]

PilotWiMAE: Wireless channel pilots are all you need,

B. Guleret al., “PilotWiMAE: Wireless channel pilots are all you need,” Submitted to the International Conference on Machine Learning (ICML), AI4NextG Workshop, 2026

work page 2026

[2] [2]

LWM: A pre-trained wireless foundation model for universal feature extraction,

S. Alikhaniet al., “LWM: A pre-trained wireless foundation model for universal feature extraction,” inProc. IEEE ICMLCN, May 2025, pp. 1–6

work page 2025

[3] [3]

A MIMO wireless channel foundation model via CIR- CSI consistency,

J. Jianget al., “A MIMO wireless channel foundation model via CIR- CSI consistency,” inProc. IEEE ICMLCN, May 2025, pp. 1–6

work page 2025

[4] [4]

CSI-MAE: A masked autoencoder-based channel foun- dation model,

J. Jianget al., “CSI-MAE: A masked autoencoder-based channel foun- dation model,” 2026, arXiv:2601.03789

work page arXiv 2026

[5] [5]

WiFo: Wireless foundation model for channel prediction,

B. Liuet al., “WiFo: Wireless foundation model for channel prediction,” Science China Information Sciences, vol. 68, no. 6, p. 162302, May 2025a

work page

[6] [6]

LWM-Temporal: Sparse spatio-temporal attention for wireless channel representation learning,

S. Alikhaniet al., “LWM-Temporal: Sparse spatio-temporal attention for wireless channel representation learning,” 2026, arXiv:2603.10024

work page arXiv 2026

[7] [7]

WirelessGPT: A generative foundation model for multi- task integrated sensing and communication,

T. Yanget al., “WirelessGPT: A generative foundation model for multi- task integrated sensing and communication,”IEEE JSAC, vol. 44, pp. 2259–2273, 2026

work page 2026

[8] [8]

LLM4CP: Adapting large language models for channel prediction,

B. Liuet al., “LLM4CP: Adapting large language models for channel prediction,”Journal of Communications and Information Networks, vol. 9, no. 2, pp. 113–125, 2024

work page 2024

[9] [9]

A multi-task foundation model for wireless channel representation using contrastive and masked autoencoder learning,

B. Guleret al., “A multi-task foundation model for wireless channel representation using contrastive and masked autoencoder learning,” IEEE JSAC, vol. 44, pp. 4489–4504, 2026

work page 2026

[10] [10]

WiFo-CF: Wireless foundation model for CSI feedback,

X. Liuet al., “WiFo-CF: Wireless foundation model for CSI feedback,” 2025, arXiv:2508.04068

work page arXiv 2025

[11] [11]

Filter-and-attend: Wireless channel foundation model with noise-plus-interference suppression structure,

Y . Wanget al., “Filter-and-attend: Wireless channel foundation model with noise-plus-interference suppression structure,” 2026, arXiv:2509.15993

work page arXiv 2026

[12] [12]

A wireless foundation model for multi-task prediction,

Y . Shenget al., “A wireless foundation model for multi-task prediction,” 2025, arXiv:2507.05938

work page arXiv 2025

[13] [13]

Large wireless localization model (LWLM): A foundation model for positioning in 6G networks,

G. Panet al., “Large wireless localization model (LWLM): A foundation model for positioning in 6G networks,” 2025, arXiv:2505.10134

work page arXiv 2025

[14] [14]

Reducing pilots in channel estimation with predictive foundation models,

X. Zhouet al., “Reducing pilots in channel estimation with predictive foundation models,” 2026, arXiv:2512.15562

work page arXiv 2026

[15] [15]

WiFo-2: a generalist foundation model unifies heterogeneous wireless system design

B. Liuet al., “WiFo-2: a generalist foundation model unifies heteroge- neous wireless system design,” 2026, arXiv:2511.22222

work page internal anchor Pith review Pith/arXiv arXiv 2026

[16] [16]

6G WavesFM: A foundation model for sensing, communication, and localization,

A. Aboulfotouhet al., “6G WavesFM: A foundation model for sensing, communication, and localization,”IEEE Open J. Commun. Soc., vol. 6, pp. 6792–6807, 2025

work page 2025

[17] [17]

LVM4CSI: Enabling direct application of pre-trained large vision models for wireless channel tasks,

J. Guoet al., “Lvm4csi: Enabling direct application of pre-trained large vision models for wireless channel tasks,” 2025, arXiv:2507.05121

work page arXiv 2025

[18] [18]

MUSE-FM: Multi-task environment-aware foundation model for wireless communications,

T. Zhenget al., “MUSE-FM: Multi-task environment-aware foundation model for wireless communications,” 2026, arXiv:2509.01967

work page arXiv 2026

[19] [19]

OFDM channel estimation by singular value decom- position,

O. Edforset al., “OFDM channel estimation by singular value decom- position,”IEEE Trans. Commun., vol. 46, no. 7, pp. 931–939, Jul. 1998

work page 1998

[20] [20]

Channel estimation techniques based on pilot arrange- ment in OFDM systems,

S. Coleriet al., “Channel estimation techniques based on pilot arrange- ment in OFDM systems,”IEEE Transactions on Broadcasting, vol. 48, no. 3, pp. 223–229, Sep. 2002

work page 2002

[21] [21]

Benchmarking neural network robust- ness to common corruptions and perturbations,

D. Hendrycks and T. Dietterich, “Benchmarking neural network robust- ness to common corruptions and perturbations,” inProc. ICLR, 2019

work page 2019

[22] [22]

Measuring robustness to natural distribution shifts in image classification,

R. Taoriet al., “Measuring robustness to natural distribution shifts in image classification,” inProc. NeurIPS, 2020

work page 2020

[23] [23]

Attention is all you need,

A. Vaswaniet al., “Attention is all you need,” inProc. NeurIPS, 2017, pp. 6000–6010

work page 2017

[24] [24]

An image is worth 16x16 words: Transformers for image recognition at scale,

A. Dosovitskiyet al., “An image is worth 16x16 words: Transformers for image recognition at scale,” inProc. ICLR, 2021

work page 2021

[25] [25]

BERT: Pre-training of deep bidirectional transformers for language understanding,

J. Devlinet al., “BERT: Pre-training of deep bidirectional transformers for language understanding,” inProc. NAACL HLT, Jun. 2019, pp. 4171– 4186

work page 2019

[26] [26]

Masked autoencoders are scalable vision learners,

K. Heet al., “Masked autoencoders are scalable vision learners,” in Proc. CVPR, Jun. 2022, pp. 15 979–15 988

work page 2022

[27] [27]

Scaling Laws for Neural Language Models

J. Kaplanet al., “Scaling laws for neural language models,” 2020, arXiv:2001.08361

work page internal anchor Pith review Pith/arXiv arXiv 2020

[28] [28]

Training compute-optimal large language models,

J. Hoffmannet al., “Training compute-optimal large language models,” inProc. NeurIPS, 2022

work page 2022

[29] [29]

Scaling vision transformers,

X. Zhaiet al., “Scaling vision transformers,” inProc. CVPR, 2022, pp. 1204–1213

work page 2022

[30] [30]

NR; Physical Channels and Modulation,

3GPP, “NR; Physical Channels and Modulation,” 3rd Generation Part- nership Project (3GPP), Technical Specification TS 38.211, Mar. 2026, v19.3.0

work page 2026

[31] [31]

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

T. Daoet al., “FlashAttention: Fast and memory-efficient exact attention with io-awareness,” 2022, arXiv:2205.14135

work page internal anchor Pith review Pith/arXiv arXiv 2022

[32] [32]

Characterization of randomly time-variant linear channels,

P. Bello, “Characterization of randomly time-variant linear channels,” IEEE Transactions on Communications Systems, vol. 11, no. 4, pp. 360– 393, Dec. 1963

work page 1963

[33] [33]

Chapter 1 - fundamentals of time-varying communication channels,

G. Matz and F. Hlawatsch, “Chapter 1 - fundamentals of time-varying communication channels,” inWireless Communications Over Rapidly Time-Varying Channels, F. Hlawatsch and G. Matz, Eds. Oxford: Academic Press, 2011, pp. 1–63

work page 2011

[34] [34]

Self-supervised and invariant representations for wireless localization,

A. Salihuet al., “Self-supervised and invariant representations for wireless localization,”IEEE Trans. Wireless Commun., vol. 23, no. 8, pp. 8281–8296, Aug 2024

work page 2024

[35] [35]

WirelessJEPA: A multi-antenna foundation model using spatio-temporal wireless latent predictions,

V . Chuet al., “WirelessJEPA: A multi-antenna foundation model using spatio-temporal wireless latent predictions,” 2026, arXiv:2601.20190

work page arXiv 2026

[36] [36]

How mask matters: Towards theoretical understandings of masked autoencoders,

Q. Zhanget al., “How mask matters: Towards theoretical understandings of masked autoencoders,” inProc. NeurIPS, 2022

work page 2022

[37] [37]

Is space-time attention all you need for video understanding?

G. Bertasiuset al., “Is space-time attention all you need for video understanding?” inProc. ICML, Jul. 2021, pp. 813–824

work page 2021

[38] [38]

ViViT: A video vision transformer,

A. Arnabet al., “ViViT: A video vision transformer,” inProc. ICCV, Oct. 2021, pp. 6816–6826

work page 2021

[39] [39]

Computationally efficient neural receivers via axial self-attention,

S. S. Yellapragadaet al., “Computationally efficient neural receivers via axial self-attention,” 2026, arXiv:2510.12941

work page arXiv 2026

[40] [40]

Physics-informed transformer for multi-band channel frequency response reconstruction,

A. Zubowet al., “Physics-informed transformer for multi-band channel frequency response reconstruction,” 2026, arXiv:2604.01944

work page arXiv 2026

[41] [41]

Sionna rt: Technical report,

F. A. Aoudiaet al., “Sionna rt: Technical report,” 2025, arXiv:2504.21719

work page arXiv 2025

[42] [42]

Study on channel model for frequencies from 0.5 to 100 GHz,

3GPP, “Study on channel model for frequencies from 0.5 to 100 GHz,” 3rd Generation Partnership Project (3GPP), Technical Report TR 38.901

work page

[43] [43]

Physical layer procedures for data,

3GPP, “Physical layer procedures for data,” 3rd Generation Partnership Project (3GPP), Technical Specification TR 38.214. 14

work page

[44] [44]

The distance-weighted k-Nearest-Neighbor rule,

S. A. Dudani, “The distance-weighted k-Nearest-Neighbor rule,”IEEE Transactions on Systems, Man, and Cybernetics, vol. SMC-6, no. 4, pp. 325–327, 1976

work page 1976