pith. sign in

arxiv: 2606.04770 · v1 · pith:BOBNWKSEnew · submitted 2026-06-03 · 📡 eess.SP

WiSER: A Wireless Scene Encoder for Geometry-Grounded Multi-View Wireless Prediction

Pith reviewed 2026-06-28 05:06 UTC · model grok-4.3

classification 📡 eess.SP
keywords wireless scene encodingradiomap predictionchannel impulse response3D scene representationindoor propagationmulti-view wireless predictiongeometry-grounded encoding
0
0 comments X

The pith

Transmitter-conditioned sparse 3D scene memory supports joint radiomap and multipath channel predictions from one encoder.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that a sparse voxel scene plus transmitter location can be encoded into a reusable 3D memory structure. Separate decoders then query this memory to produce dense path-gain maps and variable sets of delay-power taps. The model is trained and tested on aligned indoor geometry and propagation data, where it exceeds task-specific baselines on both coverage and multipath accuracy. A sympathetic reader would care because the shared memory demonstrates that geometry can ground multiple wireless views without retraining separate networks for each task.

Core claim

WiSER maps a sparse voxel representation of an indoor scene and a transmitter location into a transmitter-conditioned sparse 3D scene memory, which is queried by a ray-corridor decoder for dense receiver-plane path-gain prediction and a set decoder for variable cardinality delay and power tap prediction. Experiments on co-registered scene and wireless data show that this outperforms scene-specific radiomap baselines and substantially improves matched delay and power prediction over reference CIR baselines.

What carries the argument

transmitter-conditioned sparse 3D scene memory that encodes scene geometry for reuse across different wireless queries

If this is right

  • The same encoder can serve heterogeneous propagation queries without separate training per task.
  • Sharing the scene memory improves both aggregate coverage predictions and path-level multipath structure.
  • Geometry-grounded representations become reusable building blocks for multiple wireless prediction problems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The memory structure could support transfer to new transmitter locations or slight scene changes with limited additional data.
  • Real-world measurements might be used to adapt the encoded memory beyond simulation-only training.
  • The approach opens a path to combining the encoder with other sensor inputs for tasks such as localization.

Load-bearing premise

Simulated labels from 3D scenes accurately capture the joint statistics of radiomap and multipath structure that would appear in real indoor environments with actual materials and noise.

What would settle it

Compare the model's predicted path gains and delay-power tap sets against direct measurements collected in a physical indoor space whose geometry and transmitter positions match the input scenes.

Figures

Figures reproduced from arXiv: 2606.04770 by Hao Ye, Jing Qiao, Yiyang Guo.

Figure 1
Figure 1. Figure 1: Overall WiSER architecture. (a) The sparse voxel scene tokens are encoded by repeated sparse attention and sparse downsampling, and the coarsest [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Ray-corridor feature gathering. (a) For a transmitter/receiver (TX/RX) [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Co-registered dataset-generation pipeline. A ScanNet++ scene is converted into a sparse voxel scene for learning and a Sionna-compatible radio [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative radiomap comparison on four representative scene–TX cases. Columns show ground truth, WiSER, NeRF2, RF-3DGS, and two radiomap [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative multipath CIR prediction examples with reference baselines and attention ablations. Green circles denote ground-truth path taps, colored [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗
read the original abstract

Indoor wireless propagation is governed by the interaction among three-dimensional (3D) scene geometry, radiomaterial properties, and transmitter and receiver configuration, which jointly determine both aggregate coverage behavior and path-level multipath structure. However, most learning-based site-specific prediction methods are designed for a single wireless representation, such as radiomap estimation or channel impulse response (CIR) prediction, and therefore do not explicitly exploit the propagation structure shared across heterogeneous wireless views. This paper introduces WiSER, a Wireless Scene Encoder for joint radiomap and multipath CIR prediction. WiSER maps a sparse voxel representation of an indoor scene and a transmitter location into a transmitter-conditioned sparse 3D scene memory, which is queried by two structure-aware decoders: a ray-corridor decoder for dense receiver-plane path-gain prediction and a Detection Transformer (DETR)-style set decoder for variable cardinality delay and power tap prediction. To train and evaluate this setting, we construct a co-registered indoor scene and wireless dataset pipeline using ScanNet++ indoor scenes and Sionna Ray Tracing, producing aligned sparse voxel inputs, dense radiomap labels, and unordered multipath CIR tap sets under a common coordinate frame and propagation configuration. Experimental results show that WiSER outperforms scene-specific radiomap baselines and substantially improves matched delay and power prediction over reference CIR baselines. These results suggest that transmitter-conditioned sparse 3D scene representations can serve as reusable wireless scene encoders for heterogeneous propagation queries, providing a geometry-grounded step toward representation learning and foundation-model development for AI-native wireless systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper introduces WiSER, a neural architecture that maps a sparse voxel representation of an indoor 3D scene plus transmitter location into a transmitter-conditioned sparse 3D scene memory. This memory is queried by a ray-corridor decoder for dense receiver-plane path-gain (radiomap) prediction and a DETR-style set decoder for variable-cardinality delay/power tap (CIR) prediction. The model is trained and evaluated on a co-registered dataset generated from ScanNet++ scenes via Sionna ray-tracing; the central claim is that the learned representations outperform scene-specific radiomap baselines and reference CIR baselines, enabling reusable geometry-grounded encoders for heterogeneous wireless queries.

Significance. If the performance gains are substantial, statistically validated, and the representations generalize beyond the synthetic training distribution, the work would constitute a concrete step toward multi-task, geometry-aware representation learning for wireless systems. The construction of an aligned sparse-voxel / radiomap / CIR dataset pipeline is a clear positive contribution that could support future foundation-model efforts in the field.

major comments (2)
  1. [Abstract] Abstract: the claim that 'WiSER outperforms scene-specific radiomap baselines and substantially improves matched delay and power prediction over reference CIR baselines' is presented without any quantitative metrics, baseline descriptions, ablation tables, or statistical significance tests. Because the central claim of reusable encoders rests on demonstrated superiority, the absence of these results is load-bearing and prevents verification of the asserted gains.
  2. [Dataset construction] Dataset construction (described in the abstract and implied experimental sections): all labels are generated exclusively by Sionna ray-tracing over ScanNet++ geometry with default or assumed material properties and no sensor noise. This choice directly affects whether the learned features capture genuine joint statistics of real indoor radiomaps and multipath structure; without real-world validation or sensitivity analysis to material/noise mismatch, the claim that the encoder reflects propagation physics rather than simulation artifacts cannot be assessed.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below, indicating where revisions will be made to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that 'WiSER outperforms scene-specific radiomap baselines and substantially improves matched delay and power prediction over reference CIR baselines' is presented without any quantitative metrics, baseline descriptions, ablation tables, or statistical significance tests. Because the central claim of reusable encoders rests on demonstrated superiority, the absence of these results is load-bearing and prevents verification of the asserted gains.

    Authors: We agree that the abstract would be strengthened by including key quantitative results. The full manuscript reports these metrics, baseline descriptions, ablation studies, and statistical details in Sections 4 and 5. We will revise the abstract to incorporate specific performance numbers (e.g., radiomap NMSE improvements and CIR matching accuracy gains) with references to the corresponding tables. revision: yes

  2. Referee: [Dataset construction] Dataset construction (described in the abstract and implied experimental sections): all labels are generated exclusively by Sionna ray-tracing over ScanNet++ geometry with default or assumed material properties and no sensor noise. This choice directly affects whether the learned features capture genuine joint statistics of real indoor radiomaps and multipath structure; without real-world validation or sensitivity analysis to material/noise mismatch, the claim that the encoder reflects propagation physics rather than simulation artifacts cannot be assessed.

    Authors: The co-registered dataset is generated via Sionna ray-tracing on ScanNet++ to ensure precise alignment between geometry, radiomaps, and CIRs under controlled conditions, which is essential for training the multi-view model. We will add an explicit limitations paragraph discussing the use of default material properties and absence of sensor noise, along with a brief sensitivity discussion where feasible. However, real-world measurements and full mismatch analysis require new data collection that is outside the scope of this work. revision: partial

standing simulated objections not resolved
  • Comprehensive real-world validation or sensitivity analysis to material/noise mismatch, as this would require new measurement campaigns beyond the current simulation-based study.

Circularity Check

0 steps flagged

No circularity: standard supervised training on external simulation labels

full rationale

The paper presents a neural network architecture (transmitter-conditioned sparse voxel encoder plus ray-corridor and DETR-style decoders) trained end-to-end on labels generated by an independent external pipeline (Sionna ray-tracing over ScanNet++ geometry). No equation defines a quantity in terms of itself, no fitted parameter is relabeled as a prediction, and no load-bearing premise rests on a self-citation chain. The central claim that the learned representations are reusable across heterogeneous queries is an empirical outcome of supervised training and held-out evaluation, not a definitional identity. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim that the learned encoder is reusable across heterogeneous queries rests on the fidelity of the ray-tracing simulator and on the assumption that the chosen voxel resolution and training distribution capture the relevant propagation physics.

free parameters (1)
  • neural-network weights
    All encoder and decoder parameters are fitted to the simulated radiomap and CIR labels.
axioms (1)
  • domain assumption Sionna ray tracing on ScanNet++ scenes produces labels whose joint radiomap and multipath statistics match real indoor propagation
    All training and evaluation labels are generated by this simulator; any mismatch directly affects both prediction tasks.
invented entities (1)
  • transmitter-conditioned sparse 3D scene memory no independent evidence
    purpose: Reusable latent representation queried by both the ray-corridor and DETR-style decoders
    New architectural component introduced to enable joint heterogeneous prediction; no independent evidence outside the trained model is provided.

pith-pipeline@v0.9.1-grok · 5812 in / 1492 out tokens · 41901 ms · 2026-06-28T05:06:57.182332+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

36 extracted references · 14 canonical work pages · 4 internal anchors

  1. [1]

    The COST 2100 MIMO channel model,

    L. Liu, C. Oestges, J. Poutanen, K. Haneda, P. Vainikainen, F. Quitin, F. Tufvesson, and P. De Doncker, “The COST 2100 MIMO channel model,”IEEE Wireless Commun., vol. 19, no. 6, pp. 92–99, Dec. 2012

  2. [2]

    QuaDRiGa: A 3- D multi-cell channel model with time evolution for enabling virtual field trials,

    S. Jaeckel, L. Raschkowski, K. Borner, and L. Thiele, “QuaDRiGa: A 3- D multi-cell channel model with time evolution for enabling virtual field trials,”IEEE Trans. Antennas Propag., vol. 62, no. 6, pp. 3242–3256, Jun. 2014

  3. [3]

    A novel millimeter- wave channel simulator and applications for 5G wireless communica- tions,

    S. Sun, G. R. MacCartney, and T. S. Rappaport, “A novel millimeter- wave channel simulator and applications for 5G wireless communica- tions,” inProc. IEEE Int. Conf. Commun. (ICC), Paris, France, May 2017, pp. 1–7

  4. [4]

    Sionna: An open-source library for next-generation physical layer research,

    J. Hoydis, S. Cammerer, F. A. Aoudia, A. Vem, N. Binder, G. Marcus, and A. Keller, “Sionna: An open-source library for next-generation physical layer research,”arXiv preprint arXiv:2203.11854, 2022

  5. [5]

    Sionna RT: Differentiable ray tracing for radio propagation modeling,

    J. Hoydis, F. A. Aoudia, S. Cammerer, M. Nimier-David, N. Binder, G. Marcus, and A. Keller, “Sionna RT: Differentiable ray tracing for radio propagation modeling,”arXiv preprint arXiv:2303.11103, 2023

  6. [6]

    RadioUNet: Fast radio map estimation with convolutional neural networks,

    R. Levie, C ¸ . Yapar, G. Kutyniok, and G. Caire, “RadioUNet: Fast radio map estimation with convolutional neural networks,”IEEE Trans. Wireless Commun., vol. 20, no. 6, pp. 4001–4015, Jun. 2021

  7. [7]

    Radio map estimation: A data-driven approach to spectrum cartography,

    D. Romero and S.-J. Kim, “Radio map estimation: A data-driven approach to spectrum cartography,”IEEE Signal Process. Mag., vol. 39, no. 6, pp. 53–72, Nov. 2022

  8. [8]

    An I2I inpainting approach for efficient channel knowledge map construction,

    Z. Jin, L. You, J. Wang, X.-G. Xia, and X. Gao, “An I2I inpainting approach for efficient channel knowledge map construction,”IEEE Trans. Wireless Commun., vol. 24, no. 2, pp. 1415–1429, Feb. 2025

  9. [9]

    A tutorial on environment-aware communications via channel knowledge map for 6G,

    Y . Zeng, J. Chen, J. Xu, D. Wu, X. Xu, S. Jin, X. Gao, D. Gesbert, S. Cui, and R. Zhang, “A tutorial on environment-aware communications via channel knowledge map for 6G,”IEEE Commun. Surveys Tuts., vol. 26, no. 3, pp. 1478–1519, 2024

  10. [10]

    Deep learning for massive MIMO CSI feedback,

    C.-K. Wen, W.-T. Shih, and S. Jin, “Deep learning for massive MIMO CSI feedback,”IEEE Wireless Commun. Lett., vol. 7, no. 5, pp. 748–751, Oct. 2018

  11. [11]

    Convolutional neural network based multiple-rate compressive sensing for massive MIMO CSI feed- back: Design, simulation, and analysis,

    J. Guo, C.-K. Wen, S. Jin, and G. Y . Li, “Convolutional neural network based multiple-rate compressive sensing for massive MIMO CSI feed- back: Design, simulation, and analysis,”IEEE Trans. Wireless Commun., vol. 19, no. 4, pp. 2827–2840, Apr. 2020

  12. [12]

    Distributed deep convo- lutional compression for massive MIMO CSI feedback,

    M. B. Mashhadi, Q. Yang, and D. G ¨und¨uz, “Distributed deep convo- lutional compression for massive MIMO CSI feedback,”IEEE Trans. Wireless Commun., vol. 20, no. 4, pp. 2621–2633, Apr. 2021

  13. [13]

    Towards a wireless physical- layer foundation model: Challenges and strategies,

    J. Fontaine, A. Shahid, and E. De Poorter, “Towards a wireless physical- layer foundation model: Challenges and strategies,” inProc. IEEE Int. Conf. Commun. Workshops (ICC Workshops), Denver, CO, USA, Jun. 2024

  14. [14]

    Large wireless model (LWM): A foundation model for wireless channels,

    S. Alikhani, G. Charan, and A. Alkhateeb, “Large wireless model (LWM): A foundation model for wireless channels,”arXiv preprint arXiv:2411.08872, 2024

  15. [15]

    WiFo: Wireless foun- dation model for channel prediction,

    B. Liu, S. Gao, X. Liu, X. Cheng, and L. Yang, “WiFo: Wireless foun- dation model for channel prediction,”arXiv preprint arXiv:2412.08908, 2024

  16. [16]

    ScanNet++: A high- fidelity dataset of 3D indoor scenes,

    C. Yeshwanth, Y .-C. Liu, M. Nießner, and A. Dai, “ScanNet++: A high- fidelity dataset of 3D indoor scenes,” inProc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Paris, France, Oct. 2023

  17. [17]

    DeepMIMO: A Generic Deep Learning Dataset for Millimeter Wave and Massive MIMO Applications

    A. Alkhateeb, “DeepMIMO: A generic deep learning dataset for millimeter-wave and massive MIMO applications,”arXiv preprint arXiv:1902.06435, 2019

  18. [18]

    DeepSense 6G: A large-scale real- world multi-modal sensing and communication dataset,

    A. Alkhateeb, G. Charan, T. Osman, A. Hredzak, J. Morais, U. Demirhan, and N. Srinivas, “DeepSense 6G: A large-scale real- world multi-modal sensing and communication dataset,”arXiv preprint arXiv:2211.09769, 2022

  19. [19]

    M 3SC: A generic dataset for mixed multi-modal sensing and communication integration,

    X. Cheng, Z. Huang, L. Bai, H. Zhang, M. Sun, B. Liu, S. Li, J. Zhang, and M. Lee, “M 3SC: A generic dataset for mixed multi-modal sensing and communication integration,”China Commun., vol. 20, no. 11, pp. 13–29, Nov. 2023

  20. [20]

    RadioDiff-3D: A 3D×3D radio map dataset and generative diffusion based benchmark for 6G environment-aware communication,

    X. Wang, Q. Zhang, N. Cheng, J. Chen, Z. Zhang, Z. Li, S. Cui, and X. Shen, “RadioDiff-3D: A 3D×3D radio map dataset and generative diffusion based benchmark for 6G environment-aware communication,” arXiv preprint arXiv:2507.12166, 2025

  21. [21]

    End-to-end object detection with transformers,

    N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” inProc. Eur . Conf. Comput. Vis. (ECCV), Virtual, Aug. 2020, pp. 213–229

  22. [22]

    Native and Compact Structured Latents for 3D Generation

    J. Xiang, X. Chen, S. Xu, R. Wang, Z. Lv, Y . Deng, H. Zhu, Y . Dong, H. Zhao, N. J. Yuan, and J. Yang, “Native and compact structured latents for 3D generation,”arXiv preprint arXiv:2512.14692, 2025

  23. [23]

    RadioGen3D: 3D radio map generation via adversarial learning on large-scale synthetic data,

    J. Chen, A. Xu, Z. Zhang, S. Zhang, J. Chen, and S. Cui, “RadioGen3D: 3D radio map generation via adversarial learning on large-scale synthetic data,”arXiv preprint arXiv:2602.18744, 2026

  24. [24]

    Deep Learning-Based Site-Specific Channel Modeling and Inference

    J. Song, R. He, M. Yang, Z. Zhang, S. Gao, B. Ai, and Z. Zhong, “Deep learning-based site-specific channel modeling and inference,” arXiv preprint arXiv:2603.28083, 2026

  25. [25]

    Deformable DETR: Deformable transformers for end-to-end object detection,

    X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai, “Deformable DETR: Deformable transformers for end-to-end object detection,” inProc. Int. Conf. Learn. Representations (ICLR), Virtual, May 2021

  26. [26]

    On the Opportunities and Risks of Foundation Models

    R. Bommasani, D. A. Hudson, E. Adeli, R. Altman, S. Arora, S. von Arx, M. S. Bernstein, J. Bohg, A. Bosselut, E. Brunskillet al., “On the opportunities and risks of foundation models,”arXiv preprint arXiv:2108.07258, 2021

  27. [27]

    Multitask learning,

    R. Caruana, “Multitask learning,”Mach. Learn., vol. 28, no. 1, pp. 41– 75, Jul. 1997

  28. [28]

    Multi-task learning using uncer- tainty to weigh losses for scene geometry and semantics,

    A. Kendall, Y . Gal, and R. Cipolla, “Multi-task learning using uncer- tainty to weigh losses for scene geometry and semantics,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Salt Lake City, UT, USA, Jun. 2018, pp. 7482–7491

  29. [29]

    NeRF: Representing scenes as neural radiance fields for view synthesis,

    B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng, “NeRF: Representing scenes as neural radiance fields for view synthesis,” inProc. Eur . Conf. Comput. Vis. (ECCV), Virtual, Aug. 2020, pp. 405–421

  30. [30]

    3D gaussian splatting for real-time radiance field rendering,

    B. Kerbl, G. Kopanas, T. Leimk ¨uhler, and G. Drettakis, “3D gaussian splatting for real-time radiance field rendering,”ACM Trans. Graph., vol. 42, no. 4, pp. 139:1–139:14, Jul. 2023

  31. [31]

    NeRF2: Neural radio-frequency radiance fields,

    X. Zhao, Z. An, Q. Pan, and L. Yang, “NeRF2: Neural radio-frequency radiance fields,”arXiv preprint arXiv:2305.06118, 2023

  32. [32]

    NeWRF: A deep learning framework for wireless radiation field reconstruction and channel prediction,

    H. Lu, C. Vattheuer, B. Mirzasoleiman, and O. Abari, “NeWRF: A deep learning framework for wireless radiation field reconstruction and channel prediction,”arXiv preprint arXiv:2403.03241, 2024

  33. [33]

    RF-3DGS: Wireless channel modeling with radio radiance field and 3D gaussian splatting,

    L. Zhang, H. Sun, S. Berweger, C. Gentile, and R. Q. Hu, “RF-3DGS: Wireless channel modeling with radio radiance field and 3D gaussian splatting,”arXiv preprint arXiv:2411.19420, 2024

  34. [34]

    WiNeRT: Towards neural ray tracing for wireless channel modelling and differentiable simulations,

    T. Orekondy, P. Kumar, S. Kadambi, H. Ye, J. Soriaga, and A. Behboodi, “WiNeRT: Towards neural ray tracing for wireless channel modelling and differentiable simulations,” inProc. Int. Conf. Learn. Representa- tions (ICLR), Kigali, Rwanda, May 2023

  35. [35]

    4D spatio-temporal convnets: Minkowski convolutional neural networks,

    C. Choy, J. Gwak, and S. Savarese, “4D spatio-temporal convnets: Minkowski convolutional neural networks,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Long Beach, CA, USA, Jun. 2019

  36. [36]

    Point transformer V2: Grouped vector attention and partition-based pooling,

    X. Wu, Y . Lao, L. Jiang, X. Liu, and H. Zhao, “Point transformer V2: Grouped vector attention and partition-based pooling,” inProc. Adv. Neural Inf. Process. Syst. (NeurIPS), New Orleans, LA, USA, Nov. 2022