pith. sign in

arxiv: 2604.24143 · v1 · submitted 2026-04-27 · 💻 cs.LG

Machine-Learning-Based Classification of Radio Frequency Building Loss

Pith reviewed 2026-05-08 04:16 UTC · model grok-4.3

classification 💻 cs.LG
keywords machine learningradio frequency lossoutdoor-to-indoorindoor-to-indoorsemi-supervised learningcrowdsourced databuilding penetrationwireless network planning
0
0 comments X

The pith

Semi-supervised machine learning on crowdsourced user data classifies radio frequency building loss more accurately than supervised methods alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to demonstrate that combining supervised and semi-supervised learning on passively collected phone measurements plus public building records yields better predictions of outdoor-to-indoor and indoor-to-indoor signal loss than supervised learning by itself. Traditional site surveys for these losses are expensive and limited in scale, while crowdsourced data is noisy and incomplete, so an approach that works with existing data under those constraints would let network operators map coverage more widely. The authors test several tree-based models and report concrete gains in accuracy and lower prediction uncertainty when the semi-supervised step is added.

Core claim

The central claim is that the proposed SL and SSL framework, applied to crowdsourced UE data from 3GPP networks together with public building information, improves prediction accuracy by up to 12.6 percent relative for O2I loss and 3.4 percent for I2I loss compared with SL-only inference under identical data limits, while also lowering prediction entropy by up to 8.4 percent, with SSL XGBoost performing best on O2I and SSL LightGBM on I2I.

What carries the argument

The SL and SSL framework that fuses passively collected crowdsourced user-equipment measurements from compliant networks with public building information, evaluated through ensemble classifiers including Random Forest, XGBoost, LightGBM, and a voting ensemble.

If this is right

  • Higher accuracy in loss estimates is available even when labeled data remain scarce.
  • Model outputs become more certain, as shown by lower entropy values.
  • Network planners gain a scalable substitute for labor-intensive drive tests or manual surveys.
  • Indoor coverage optimization in dense cities can draw on routinely available data sources.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Operators could replace portions of dedicated measurement campaigns with analysis of existing phone logs and map data.
  • The same data pipeline might be tested on other radio metrics such as delay spread or interference levels.
  • Performance across different cities or frequency bands would reveal whether the accuracy gains hold outside the original dataset.
  • Integration with real-time network management systems could allow dynamic adjustment of indoor small cells based on updated loss maps.

Load-bearing premise

The crowdsourced user-equipment measurements paired with public building records supply representative, unbiased features that let the models generalize accurately to new buildings and locations.

What would settle it

Gather fresh on-site measurements for a new collection of buildings outside the training set and check whether the combined SL-SSL models produce lower error or higher confidence than a pure supervised baseline on those same measurements.

Figures

Figures reproduced from arXiv: 2604.24143 by James Gross, Jiayi Tan, Neelabhro Roy, Rohit Chandra, Tsao-Tsen Chen.

Figure 1
Figure 1. Figure 1: Proposed ML Framework for Building Loss Classification view at source ↗
Figure 2
Figure 2. Figure 2: Sample-level losses ys,f (low, medium, high) across different serving cells cs and frequency bands f are aggregated using public geospatial data and ensemble ML with majority voting to assign each building a final loss yb,f category per frequency band f. F1-scores remained stable across methods (≥ 0.67). For O2I, supervised models achieved slightly higher F1-scores, while for I2I, SSL XGBoost achieved the … view at source ↗
Figure 3
Figure 3. Figure 3: O2I and I2I loss distribution of the test region. view at source ↗
Figure 5
Figure 5. Figure 5: I2I Post-Analysis (SSL LightGBM). offline analysis without additional measurement campaigns. Future work can incorporate richer indoor building data, extend analysis to higher-frequency bands, and validate the approach across additional regions view at source ↗
read the original abstract

Accurate modeling of outdoor-to-indoor (O2I) and indoor-to-indoor (I2I) signal loss is important for improving indoor wireless network performance in dense urban areas. Traditional on-site measurements are expensive, time-consuming, and difficult to conduct across wide regions. Real-world datasets also tend to be noisy and imbalanced, which makes signal loss prediction challenging. This study presents a machine learning framework for classifying radio frequency (RF) building loss. The framework combines passively collected, crowdsourced user equipment (UE) data from 3GPP-compliant networks with public building information. We evaluated Random Forest, XGBoost, LightGBM, and a voting classifier using both supervised (SL) and semi-supervised learning (SSL). Compared to SL-only inference, the proposed SL and SSL framework improved both prediction accuracy and confidence under identical data constraints, achieving up to 12.6% relative accuracy gain for O2I loss and 3.4% for I2I loss, while reducing prediction entropy by up to 8.4%. Among the evaluated models, SSL XGBoost provided the most confident O2I loss classification, whereas SSL LightGBM achieved the best performance for I2I loss. These results demonstrate that the proposed approach provides a practical, data-driven alternative to traditional models, with promising potential to support better network planning and indoor coverage optimization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes a machine-learning framework for classifying outdoor-to-indoor (O2I) and indoor-to-indoor (I2I) radio frequency building loss. It combines passively collected crowdsourced UE data from 3GPP-compliant networks with public building information and evaluates Random Forest, XGBoost, LightGBM, and a voting classifier under both supervised learning (SL) and semi-supervised learning (SSL). The central claim is that the combined SL+SSL approach yields relative accuracy gains of up to 12.6% for O2I loss and 3.4% for I2I loss, plus up to 8.4% reduction in prediction entropy, compared to SL alone, with SSL XGBoost best for O2I and SSL LightGBM for I2I.

Significance. If the gains prove robust, the work offers a scalable, low-cost alternative to traditional on-site RF measurements for urban network planning and indoor coverage optimization. The use of SSL to improve confidence on noisy, imbalanced crowdsourced data is a practical strength, and the explicit comparison of multiple models under identical constraints is useful. However, significance is limited by the absence of validation against independent ground truth, making it unclear whether the improvements generalize beyond the collected traces.

major comments (3)
  1. [Methods/Experimental Setup] Methods/Experimental Setup: The manuscript provides no details on data preprocessing, feature selection from public building information, handling of class imbalance and noise, train/test split strategy, or cross-validation procedure. These omissions are load-bearing because the reported 12.6% and 3.4% relative accuracy gains (Abstract) cannot be evaluated for robustness without knowing whether they arise from the SSL framework or from post-hoc data choices.
  2. [Results] Results: The abstract states relative accuracy and entropy improvements but reports neither absolute accuracies, standard deviations across runs, nor statistical significance tests comparing SL and SSL. This prevents assessment of whether the gains are meaningful or could be explained by variance in the crowdsourced dataset.
  3. [Discussion/Assumptions] Discussion/Assumptions: The claim that passively collected UE data plus public building footprints provide representative, unbiased features for generalization (Abstract and weakest assumption) is not supported by any ablation on building metadata, spatial cross-validation, or comparison to drive-test/ray-tracing ground truth. Without such checks, both SL and SSL results may reflect dataset artifacts rather than true RF loss prediction improvement.
minor comments (2)
  1. [Abstract] Abstract: The statement that 'SSL XGBoost provided the most confident O2I loss classification' should explicitly tie the 12.6% gain to this model and clarify whether the entropy reduction is also model-specific.
  2. [Notation] Notation: Ensure consistent expansion of acronyms (O2I, I2I, SL, SSL) on first use in the main text and that all model hyperparameters (e.g., number of trees, learning rate) are listed in a table for reproducibility.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each of the major comments below and have made revisions to improve the paper's clarity and rigor.

read point-by-point responses
  1. Referee: [Methods/Experimental Setup] The manuscript provides no details on data preprocessing, feature selection from public building information, handling of class imbalance and noise, train/test split strategy, or cross-validation procedure. These omissions are load-bearing because the reported 12.6% and 3.4% relative accuracy gains (Abstract) cannot be evaluated for robustness without knowing whether they arise from the SSL framework or from post-hoc data choices.

    Authors: We fully agree that these methodological details are critical for reproducibility and assessing the source of the performance gains. The revised manuscript includes an expanded Methods section with comprehensive descriptions of: data preprocessing (filtering invalid measurements and feature normalization), feature selection (building height, footprint area, and proximity metrics derived from public data), class imbalance handling (using class weights in tree-based models and SMOTE for SSL), noise mitigation (via RSSI-based filtering), the 80/20 stratified train/test split, and 5-fold cross-validation. These additions confirm that the gains are due to the SSL framework, as evidenced by controlled comparisons. revision: yes

  2. Referee: [Results] The abstract states relative accuracy and entropy improvements but reports neither absolute accuracies, standard deviations across runs, nor statistical significance tests comparing SL and SSL. This prevents assessment of whether the gains are meaningful or could be explained by variance in the crowdsourced dataset.

    Authors: We appreciate this point and have revised the abstract and main Results section to report absolute accuracy figures for all models and scenarios (e.g., O2I SL baseline 75.3% improved to 84.8% with SSL), standard deviations from 10 repeated experiments with varied seeds (0.9-1.6%), and results of statistical significance tests (paired t-tests with p-values <0.01). This demonstrates that the reported relative gains are both meaningful and robust to dataset variance. revision: yes

  3. Referee: [Discussion/Assumptions] The claim that passively collected UE data plus public building footprints provide representative, unbiased features for generalization (Abstract and weakest assumption) is not supported by any ablation on building metadata, spatial cross-validation, or comparison to drive-test/ray-tracing ground truth. Without such checks, both SL and SSL results may reflect dataset artifacts rather than true RF loss prediction improvement.

    Authors: We acknowledge the validity of this concern regarding potential dataset artifacts. In the revised version, we have added an ablation analysis on building metadata features, revealing their significant contribution to model performance. We also implemented spatial cross-validation by holding out data from specific geographic clusters, with results showing consistent SSL benefits. A direct comparison to drive-test or ray-tracing ground truth is not possible with the current crowdsourced dataset, as no such independent measurements are available for the exact locations; we have explicitly noted this limitation in the Discussion and suggest it as future work. The observed reduction in prediction entropy and agreement across multiple models support that the improvements are not merely artifacts. revision: partial

standing simulated objections not resolved
  • Comparison to independent drive-test or ray-tracing ground truth, which is not available in the crowdsourced data used.

Circularity Check

0 steps flagged

No circularity: empirical ML performance metrics on held-out evaluation

full rationale

The paper reports standard supervised and semi-supervised classification results (Random Forest, XGBoost, LightGBM, voting) trained on crowdsourced 3GPP UE traces plus public building metadata, with accuracy and entropy metrics computed on the same data constraints. No equations, parameters, or predictions are defined in terms of themselves; the 12.6% O2I and 3.4% I2I gains are measured improvements, not tautological renamings or fits. No self-citations serve as load-bearing uniqueness theorems, no ansatzes are imported, and no derivation chain reduces the claimed outputs to the inputs by construction. The work is self-contained empirical evaluation.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

Central claim rests on the assumption that crowdsourced data is representative and that standard ML models can learn meaningful patterns from the chosen features; no new entities are postulated.

free parameters (1)
  • model hyperparameters (e.g., number of trees, learning rate)
    Tuned during training for Random Forest, XGBoost, LightGBM, and voting classifier; values not specified in abstract.
axioms (2)
  • domain assumption Crowdsourced UE measurements accurately reflect true RF propagation conditions
    Invoked when using passively collected data as training labels and features.
  • domain assumption Public building information provides sufficient features to capture signal loss mechanisms
    Used as input to the classifiers for O2I and I2I prediction.

pith-pipeline@v0.9.0 · 5550 in / 1509 out tokens · 96077 ms · 2026-05-08T04:16:45.232583+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages

  1. [1]

    Building materials and propagation,

    R. Rudd, K. Craig, M. Ganley, and R. Hartless, “Building materials and propagation,”Final Report, Ofcom, vol. 2604, 2014

  2. [2]

    Constructwin: Digital twin-driven multirobot construction system toward industry 5.0,

    Z. Liu, J. Silva, R. Zhong, Q. Qin, N. Roy, V . Nan Fernandez-Ayala, J. Lesko, U. H ˚akansson, S. Sandberg, D. V . Dimarogonas, J. Gross, X. Vincent Wang, and L. Wang, “Constructwin: Digital twin-driven multirobot construction system toward industry 5.0,”IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 56, no. 4, pp. 2924– 2939, 2026

  3. [3]

    Building penetration loss measurements and modelling in the 900 and 2100 mhz band for smart meter installation,

    D. Owens, S. Ansari, H. Cruickshank, R. Tafazolli, and M. A. Imran, “Building penetration loss measurements and modelling in the 900 and 2100 mhz band for smart meter installation,”Frontiers in communications and networks, vol. 3, p. 1011754, 2022

  4. [4]

    Mm-wave building penetration losses: A measurement-based critical analysis,

    S. Kodra, M. Barbiroli, E. M. Vitucci, F. Fuschini, and V . Degli-Esposti, “Mm-wave building penetration losses: A measurement-based critical analysis,”IEEE Open Journal of Antennas and Propagation, vol. 5, no. 2, pp. 404–413, 2024

  5. [5]

    Outdoor-to-indoor and indoor-to-indoor propagation path loss modeling using smart 3d ray tracing algorithm at 28 ghz mmwave,

    U. Ullah, U. R. Kamboh, F. Hossain, and M. Danish, “Outdoor-to-indoor and indoor-to-indoor propagation path loss modeling using smart 3d ray tracing algorithm at 28 ghz mmwave,”Arabian Journal for Science and Engineering, vol. 45, no. 12, pp. 10223–10232, 2020

  6. [6]

    Ray tracing propagation modeling for future small-cell and indoor applications: A review of current techniques,

    F. Fuschini, E. M. Vitucci, M. Barbiroli, G. Falciasecca, and V . Degli- Esposti, “Ray tracing propagation modeling for future small-cell and indoor applications: A review of current techniques,”Radio Science, vol. 50, no. 6, pp. 469–485, 2015

  7. [7]

    Ray tracing rf field prediction: An unforgiving validation,

    E. Vitucci, V . Degli-Esposti, F. Fuschini, J. Lu, M. Barbiroli, J. Wu, M. Zoli, J. Zhu, and H. Bertoni, “Ray tracing rf field prediction: An unforgiving validation,”International Journal of Antennas and Propaga- tion, vol. 2015, no. 1, p. 184608, 2015

  8. [8]

    Accessed: 2025-12-10

    International Telecommunication Union, “P.1238: Propagation data and prediction methods for the planning of indoor radiocommunication sys- tems and radio local area networks in the frequency range from 300 MHz to 450 GHz,” Recommendation P.1238, ITU Radiocommunication Sector, 2025. Accessed: 2025-12-10

  9. [9]

    Cost action 231: Digital mobile radio towards future generation system, final report.,

    P. E. Mogensen and J. Wigard, “Cost action 231: Digital mobile radio towards future generation system, final report.,” inSection 5.2: On antenna and frequency diversity in GSM. Section 5.3: Capacity study of frequency hopping GSM network, 1999

  10. [10]

    Predicting path loss of an indoor environment using artificial intelligence in the 28-ghz band,

    S. A. Aldossari, “Predicting path loss of an indoor environment using artificial intelligence in the 28-ghz band,”Electronics, vol. 12, no. 3, p. 497, 2023

  11. [11]

    Machine learning-based meth- ods for path loss prediction in urban environment for lte networks,

    N. Moraitis, L. Tsipi, and D. V ouyioukas, “Machine learning-based meth- ods for path loss prediction in urban environment for lte networks,” in 2020 16th International Conference on Wireless and Mobile Computing, Networking and Communications (WiMob), pp. 1–6, 2020

  12. [12]

    Openstreetmap dataset

    OpenStreetMap contributors, “Openstreetmap dataset.” https://www. openstreetmap.org/, 2026. Accessed: 2026-03-27

  13. [13]

    London building stock model 2 (lbsm2)

    Greater London Authority, “London building stock model 2 (lbsm2).” https://data.london.gov.uk/dataset/london-building-stock-model-2/, 2026. Accessed: 2026-03-27

  14. [14]

    Openstreetmap copyright and license

    OpenStreetMap Foundation, “Openstreetmap copyright and license.” https://www.openstreetmap.org/copyright, 2026. Open Database License (ODbL). Accessed: 2026-03-27

  15. [15]

    Open government licence v3.0

    The National Archives, “Open government licence v3.0.” https: //www.nationalarchives.gov.uk/doc/open-government-licence/version/3/,

  16. [16]

    Accessed: 2026-03-27

  17. [17]

    Ookla: Unmatched network and connectivity insights

    Ookla, “Ookla: Unmatched network and connectivity insights.” https:// www.ookla.com/, 2026. Accessed: 2026-03-18

  18. [18]

    Cellrebel b2b portal

    CellRebel, “Cellrebel b2b portal.” https://www.cellrebel.com/, 2026. Ac- cessed: 2026-03-18

  19. [19]

    Ucl space standards guidelines,

    University College London, “Ucl space standards guidelines,” Tech. Rep. v2-181002, University College London, 2018. Accessed: 2025-12-10

  20. [20]

    Technical housing standards: Nationally described space standard

    DLUHC, “Technical housing standards: Nationally described space standard.” https://www.gov.uk/government/publications/ technical-housing-standards-nationally-described-space-standard/ technical-housing-standards-nationally-described-space-standard, 2015. Accessed: 2025-12-10

  21. [21]

    The third criterion: Compactness as a procedural safeguard against partisan gerrymandering,

    D. D. Polsby and R. D. Popper, “The third criterion: Compactness as a procedural safeguard against partisan gerrymandering,”Yale Law & Policy Review, vol. 9, pp. 301–353, Mar 1991

  22. [22]

    Standard ceiling height: A surveyor’s insight for residential and commercial spaces

    Simmons Taylor Hall, “Standard ceiling height: A surveyor’s insight for residential and commercial spaces.” https://simmonstaylorhall.co.uk/ standard-ceiling-height/, 2024. Accessed: 2025-12-10