pith. sign in

arxiv: 2604.23153 · v1 · submitted 2026-04-25 · 💻 cs.NI · cs.SE

RANalyzer: Automated Continuous RAN Software Evaluation and Regression Analysis

Pith reviewed 2026-05-08 07:07 UTC · model grok-4.3

classification 💻 cs.NI cs.SE
keywords O-RANsoftware regression analysisresiduals analysiscontinuous testingperformance attributioncode change impactRAN evaluation
0
0 comments X

The pith

RANalyzer attributes wireless performance deviations to specific software code changes by modeling expected behavior from channel and load conditions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents an automated framework to evaluate how software updates affect performance in virtualized radio access networks. It separates the effects of wireless channel conditions and network load from those caused by code modifications using statistical residuals and semantic categorization of changes. Data from more than 8600 over-the-air tests spanning 69 software releases support the analysis. The approach matters for systems that rely on frequent independent updates because traditional monitoring cannot scale to identify which changes introduce problems. If the method holds, continuous integration pipelines can flag regressions automatically instead of relying on manual review.

Core claim

By modeling expected performance and interpreting deviations as software-induced effects, we identify degraded instances attributable to code changes and correlate them with specific change categories.

What carries the argument

Residuals analysis after modeling channel and load conditions, combined with semantic extraction of code changes by protocol layers and functional components.

If this is right

  • Continuous integration pipelines can automatically evaluate the performance impact of each RAN software release.
  • Degraded test runs can be linked directly to categories of code modifications such as those in specific protocol layers.
  • Large historical test datasets become actionable for detecting regressions at scale.
  • Manual troubleshooting for performance variations in stochastic wireless environments can be reduced.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same residuals approach could help isolate software effects in other variable systems such as cloud service performance.
  • Patterns across change categories might guide developers toward safer update practices in protocol stacks.
  • Extending the dataset over longer periods could reveal cumulative effects of successive software revisions.

Load-bearing premise

Residuals left after accounting for channel and load conditions can be attributed to software changes rather than unmodeled stochastic effects or hardware variability.

What would settle it

Observation of performance deviations that do not align with any code changes, or residuals that persist even when no software updates occur.

Figures

Figures reproduced from arXiv: 2604.23153 by Leonardo Bonati, Michele Polese, Ravis Shirkhani, Reshma Prasad, Tommaso Melodia.

Figure 1
Figure 1. Figure 1: Moving average of downlink RSRP and throughput efficiency across all view at source ↗
Figure 2
Figure 2. Figure 2: High-level overview of RANalyzer. To illustrate this, view at source ↗
Figure 3
Figure 3. Figure 3: High-level overview of CI/CD and RANalyzer workflow on the 5G view at source ↗
Figure 4
Figure 4. Figure 4: Metrics across Git commits for low- and high-load test cases. view at source ↗
Figure 5
Figure 5. Figure 5: Keyword-based categorization examples. 2) LLM-Based Refinement: The LLM is provided with a fixed, human-designed instruction prompt together with the commit text and the results of the keyword-based classification. It is instructed to perform structured validation by (i) confirming or rejecting candidate layers and components to eliminate false positives from keyword matching, (ii) enforcing an upper bound… view at source ↗
Figure 6
Figure 6. Figure 6: Variance decomposition for affecting factors. view at source ↗
Figure 7
Figure 7. Figure 7: Predicted vs. actual throughput efficiency for the best learning models. view at source ↗
Figure 8
Figure 8. Figure 8: Commit example which triggered LLM refinement. view at source ↗
Figure 9
Figure 9. Figure 9: Throughput efficiency residuals distribution. view at source ↗
Figure 11
Figure 11. Figure 11: Baseline comparison. however, misses certain degradations that happens with a longer delay after the code change, as their temporal distance reduces their contribution under the exponential weighting. Our residual analysis explicitly models environmental baselines, achieving better precision by only flagging under-performance when conditions were favorable. C. Residual-Based Analysis of Code Change Impact… view at source ↗
Figure 10
Figure 10. Figure 10: Example cases. Case 1 - Normal operation: In this case, the test achieves similar performance as the baseline prediction. The first two commits of view at source ↗
read the original abstract

Software-driven O-RAN architectures enable rapid innovation through frequent, independent updates to virtualized components. However, attributing performance variations to specific software changes is challenging due to the stochastic nature of wireless systems, where channel conditions, interference, and hardware variability confound analysis. Traditional threshold-based monitoring and manual troubleshooting do not scale with modern software evolution. This paper presents RANalyzer, an automated test analysis framework that quantifies the performance impact of software updates beyond what can be explained by wireless channel conditions. RANalyzer combines LLM-assisted semantic extraction with residuals analysis. The first categorizes code changes by affected protocol layers and functional components, while the second provides insights on the effect of load, channel, or code changes on the test performance. We contribute an extensive dataset collected over more than two years of continuous over-the-air testing on an experimental O-RAN testbed, comprising over 8,600 automated tests across 69 releases of the OAI stack. By modeling expected performance and interpreting deviations as software-induced effects, we identify degraded instances attributable to code changes and correlate them with specific change categories. The framework can be integrated into CI/CD/CT pipelines for automated, continuous evaluation of software updates at scale.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper presents RANalyzer, an automated framework for continuous evaluation of O-RAN software updates that combines LLM-assisted categorization of code changes (by protocol layers and components) with residual analysis after modeling expected performance under channel and load conditions. Using a dataset of over 8,600 automated over-the-air tests across 69 OAI releases collected over two years, the authors claim to identify performance degradations attributable to software changes and to correlate them with specific change categories, enabling integration into CI/CD/CT pipelines.

Significance. If the residual attribution methodology is shown to isolate software effects reliably, the work would provide a practical tool for scaling regression analysis in stochastic wireless systems where traditional threshold monitoring fails. The two-year dataset of 8,600 tests is a clear strength that could support community benchmarking; the combination of semantic code analysis with performance residuals is a reasonable direction for automated RAN evaluation.

major comments (3)
  1. [Abstract] Abstract (final paragraph): the claim that 'deviations [can be interpreted] as software-induced effects' and that degraded instances can be 'attributable to code changes' is load-bearing for the entire contribution, yet the abstract supplies no quantitative results, validation metrics, error analysis, or description of how the expected-performance model is constructed, how residuals are computed, or what thresholds define degradation. Without these, the attribution cannot be verified.
  2. [Dataset description] Dataset and evaluation description (implied in abstract's 'extensive dataset' paragraph): no controlled no-change baseline is described that quantifies residual variance under fixed code, fixed hardware, and repeated channel/load conditions. In O-RAN testbeds, unmeasured factors (scheduler nondeterminism, temperature drift, interference) routinely produce variation comparable to software regressions; without such a baseline the correlations with LLM-categorized change types remain vulnerable to confounding.
  3. [Residuals analysis] Residuals analysis section (referenced in abstract): the modeling of 'expected performance' under channel and load is central, but no equations, fitting procedure, cross-validation, or comparison against a null model (e.g., performance variance with no code changes) are supplied. This leaves open whether observed residuals exceed the stochastic/hardware floor.
minor comments (2)
  1. [Abstract] Abstract: the sentence 'the second provides insights on the effect of load, channel, or code changes' is vague; clarify whether the residual model explicitly includes code-change indicators or treats them only post-hoc.
  2. [Dataset] The manuscript would benefit from a table summarizing the 69 releases, number of tests per release, and key performance metrics before/after each major change category.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed review. The comments highlight important aspects of clarity and rigor that we have addressed through revisions to the manuscript. We respond point by point to the major comments below.

read point-by-point responses
  1. Referee: [Abstract] Abstract (final paragraph): the claim that 'deviations [can be interpreted] as software-induced effects' and that degraded instances can be 'attributable to code changes' is load-bearing for the entire contribution, yet the abstract supplies no quantitative results, validation metrics, error analysis, or description of how the expected-performance model is constructed, how residuals are computed, or what thresholds define degradation. Without these, the attribution cannot be verified.

    Authors: We agree that the abstract requires additional quantitative context and methodological summary to support the central claims. In the revised manuscript we have expanded the abstract to include key validation metrics from the residuals analysis, a concise description of the expected-performance model (constructed via regression on channel and load covariates), the residual computation procedure, and the statistical threshold used to flag degradation. These additions directly address the need for verifiable support within the abstract while preserving its brevity. revision: yes

  2. Referee: [Dataset description] Dataset and evaluation description (implied in abstract's 'extensive dataset' paragraph): no controlled no-change baseline is described that quantifies residual variance under fixed code, fixed hardware, and repeated channel/load conditions. In O-RAN testbeds, unmeasured factors (scheduler nondeterminism, temperature drift, interference) routinely produce variation comparable to software regressions; without such a baseline the correlations with LLM-categorized change types remain vulnerable to confounding.

    Authors: The referee correctly notes the importance of a no-change baseline for isolating software effects from stochastic and hardware variability. Although our two-year dataset contains repeated tests under comparable conditions for the same releases, the original manuscript did not explicitly present a controlled baseline analysis. We have added a dedicated subsection that quantifies residual variance across no-change test repetitions (fixed code, hardware, and matched channel/load profiles) and demonstrates that the residuals associated with identified software changes exceed this baseline variance. This addition strengthens the attribution claims against potential confounding. revision: yes

  3. Referee: [Residuals analysis] Residuals analysis section (referenced in abstract): the modeling of 'expected performance' under channel and load is central, but no equations, fitting procedure, cross-validation, or comparison against a null model (e.g., performance variance with no code changes) are supplied. This leaves open whether observed residuals exceed the stochastic/hardware floor.

    Authors: We acknowledge that the residuals analysis section provided an overview without the requested mathematical and validation details. We have revised the section to include the explicit regression equation for expected performance, the ordinary-least-squares fitting procedure, cross-validation results confirming model robustness, and a direct comparison of residuals against a null (no-predictor) model as well as the no-change baseline variance. These additions demonstrate that the residuals used for software attribution exceed the stochastic floor established by the data. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation or attribution chain

full rationale

The paper presents an empirical framework that models expected performance from channel/load conditions, extracts residuals, and attributes deviations to software changes via LLM categorization of code diffs. No equations, self-definitional loops, fitted-input predictions, or load-bearing self-citations are present that reduce the attribution claim to its own inputs by construction. The approach is grounded in an external two-year dataset of 8600+ over-the-air tests and does not invoke uniqueness theorems or rename known results; the central claim remains independent of the target attribution itself.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are detailed. The core approach implicitly assumes wireless performance can be decomposed into explainable channel/load components plus software residuals.

axioms (1)
  • domain assumption Performance variations can be partitioned into channel/load effects and software-induced residuals
    Foundational to the residuals analysis described in the abstract

pith-pipeline@v0.9.0 · 5520 in / 1142 out tokens · 32519 ms · 2026-05-08T07:07:13.120758+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages

  1. [1]

    Toward Next Generation Open Radio Access Networks: What O-RAN Can and Cannot Do!

    A. S. Abdalla, P. S. Upadhyaya, V. K. Shah, and V. Marojevic, “Toward Next Generation Open Radio Access Networks: What O-RAN Can and Cannot Do!”IEEE Network, vol. 36, no. 6, pp. 206–213, 2022

  2. [2]

    Understand- ingO-RAN:Architecture,Interfaces,Algorithms,Security,andResearch Challenges,

    M.Polese,L.Bonati,S.D’Oro,S.Basagni,andT.Melodia,“Understand- ingO-RAN:Architecture,Interfaces,Algorithms,Security,andResearch Challenges,”IEEE Communications Surveys & Tutorials, vol. 25, pp. 1376–1411, 2023

  3. [3]

    Open RAN for 6G Networks: Architecture, use cases, and open issues,

    B. Agarwal, R. Irmer, D. Lister, and G.-M. Muntean, “Open RAN for 6G Networks: Architecture, use cases, and open issues,”IEEE Communications Surveys & Tutorials, 2025

  4. [4]

    A Tutorial on O-RAN Deployment Solutions for 5G: From Simula- tion to Emulated and Real Testbeds,

    J. Luis Herrera, S. Montebugnoli, D. Scotece, L. Foschini, and P. Bellav- ista, “A Tutorial on O-RAN Deployment Solutions for 5G: From Simula- tion to Emulated and Real Testbeds,”IEEE Communications Surveys & Tutorials, vol. 28, pp. 1709–1748, 2026

  5. [5]

    A Deep Dive into CI/CD Pipelines Tailored for Telecom,

    S. Motamary, “A Deep Dive into CI/CD Pipelines Tailored for Telecom,” American Journal of Analytics and Artificial Intelligence, vol. 1, no. 1, 2023

  6. [6]

    5G-CT: Automated Deployment and Over-the-Air Testing of End-to-End Open Radio Access Networks,

    L.Bonati,M.Polese,S.D’Oro,P.B.delPrever,andT.Melodia,“5G-CT: Automated Deployment and Over-the-Air Testing of End-to-End Open Radio Access Networks,”IEEE Communications Magazine, 2024

  7. [7]

    Mobile Broadband Performance Evaluation: Analysis of National Reports,

    Y.ZelalemJembre,W.-y.Jung,M.Attique,R.Paul,andB.Kim,“Mobile Broadband Performance Evaluation: Analysis of National Reports,” Electronics, vol. 11, no. 3, p. 485, 2022

  8. [8]

    Rapidand RobustImpactAssessmentofSoftwareChanges,

    S.Zhang,Y.Liu,D.Pei,Y.Chen,X.Qu,S.Tao,andZ.Zang,“Rapidand RobustImpactAssessmentofSoftwareChanges,”inProc.ACMCoNEXT, 2015, pp. 1–13

  9. [9]

    ChangeRCA: Finding Root Causes from Software Changes in Large Online Systems,

    G.Yu,P.Chen,Z.He,Q.Yan,Y.Luo,F.Li,andZ.Zheng,“ChangeRCA: Finding Root Causes from Software Changes in Large Online Systems,” Proc. ACM Softw. Eng., vol. 1, no. FSE, pp. 24–46, 2024

  10. [10]

    Robust Assessment of Changes in Cellular Networks,

    A.Mahimkar,Z.Ge,J.Yates,C.Hristov,V.Cordaro,S.Smith,J.Xu,and M. Stockert, “Robust Assessment of Changes in Cellular Networks,” in Proc. ACM CoNEXT, 2013, pp. 175–186

  11. [11]

    Gandalf:AnIntelligentEnd-to-EndAnalytics ServiceforSafeDeployment,

    Z. Li, Q. Cheng, K. Hsieh, Y. Dang, P. Huang, P. Singh, X. Yang, Q.Lin,Y.Wu,andS.Levy,“Gandalf:AnIntelligentEnd-to-EndAnalytics ServiceforSafeDeployment,”inProc.USENIXNSDI,2020,pp.389–402

  12. [12]

    Identifying Bad Software Changes via Multimodal Anomaly Detection,

    N.Zhao,J.Chen,Z.Yu,H.Wang,J.Li,B.Qiu,H.Xu,W.Zhang,K.Sui, and D. Pei, “Identifying Bad Software Changes via Multimodal Anomaly Detection,” inProc. ACM ESEC/FSE, 2021, pp. 527–539

  13. [13]

    IdentifyingErroneousSoftwareChangesthroughSelf- Supervised Contrastive Learning,

    X.Wang,K.Yin,Q.Ouyang,X.Wen,S.Zhang,W.Zhang,L.Cao,J.Han, X.Jin,andD.Pei,“IdentifyingErroneousSoftwareChangesthroughSelf- Supervised Contrastive Learning,” inIEEE ISSRE, 2022, pp. 366–377

  14. [14]

    CIPAT: Latent-Resilient Toolkit for Performance Impact Prediction due to Con- figuration Tuning,

    K. Patel, C. Ge, A. Mahimkar, S. Shakkottai, and Y. Shaqalle, “CIPAT: Latent-Resilient Toolkit for Performance Impact Prediction due to Con- figuration Tuning,” inProc. ACM MobiCom, 2024, pp. 2377–2382

  15. [15]

    Predicting the Performance of Cellular Networks: A Latent- Resilient Approach,

    ——, “Predicting the Performance of Cellular Networks: A Latent- Resilient Approach,” inProc. ACM MobiCom, 2024, pp. 1581–1583

  16. [16]

    Aurora: Conformity-Based Configuration Recommendation to Improve LTE/5G Service,

    A.Mahimkar,Z.Ge,X.Liu,Y.Shaqalle,Y.Xiang,J.Yates,S.Pathak,and R. Reichel, “Aurora: Conformity-Based Configuration Recommendation to Improve LTE/5G Service,” inProc. ACM IMC, 2022, pp. 83–97

  17. [17]

    DetectingthePerformanceImpactofUpgradesinLarge Operational Networks,

    A.A.Mahimkar,H.H.Song,Z.Ge,A.Shaikh,J.Wang,J.Yates,Y.Zhang, andJ.Emmons,“DetectingthePerformanceImpactofUpgradesinLarge Operational Networks,” inProc. ACM SIGCOMM, 2010, pp. 303–314

  18. [18]

    AutoRAN: Automated and Zero-Touch Open RAN Systems,

    S. Maxenti, R. Shirkhani, M. Elkael, L. Bonati, S. D’Oro, T. Melodia, and M. Polese, “AutoRAN: Automated and Zero-Touch Open RAN Systems,”IEEE Trans. on Mobile Comput. (to appear), 2026. [Online]. Available: arxiv.org/abs/2504.11233

  19. [19]

    SMOTE: Synthetic Minority Over-Sampling Technique,

    N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “SMOTE: Synthetic Minority Over-Sampling Technique,”J. Artif. Intell. Res., vol. 16, pp. 321–357, 2002