TeleSWEBench: A Commit-Driven Benchmark for Evaluating LLM-Powered Software Engineering in Telecommunications

Ali Mamaghani; Dinesh Bharadia; Pranshav Gajjar; Vijay K Shah

arxiv: 2606.05001 · v1 · pith:BVEU3IX4new · submitted 2026-06-03 · 💻 cs.SE

TeleSWEBench: A Commit-Driven Benchmark for Evaluating LLM-Powered Software Engineering in Telecommunications

Pranshav Gajjar , Ali Mamaghani , Dinesh Bharadia , Vijay K Shah This is my paper

Pith reviewed 2026-06-28 05:08 UTC · model grok-4.3

classification 💻 cs.SE

keywords telecom software engineeringLLM agentscode generation benchmarksrsRAN 5Gautomated software engineeringcommit-driven evaluationO-RANfunctional correctness

0 comments

The pith

LLM-powered agents achieve at most 25 percent shippable changes when modifying real 5G wireless code.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Telecom networks have become large, stateful software systems under frameworks such as O-RAN, yet general coding benchmarks do not test the strict requirements of wireless stacks. The paper therefore builds TeleSWEBench by mining actual developer commits from the srsRAN 5G repository and turning them into 734 executable test cases divided into three difficulty levels. Each case includes unit tests plus a hierarchical LLM judge called TeleJudge that scores file-level changes for semantic fit. When current agent frameworks powered by strong reasoning models are run on the benchmark, they show weak ability both to locate the correct code and to produce changes that pass tests, reaching a maximum of 25 percent shippable outcomes.

Core claim

TeleSWEBench distills real commits from the srsRAN 5G repository into structured tasks with unit tests. Evaluation of AIDER, OpenHands, and ClaudeCode powered by models such as Qwen3, GPT OSS, Gemma 4, Kimi, and Qwencoder 2.5 shows that these tools lack both localization accuracy and functional correctness, so that even the strongest configurations produce at most 25 percent shippable changes.

What carries the argument

TeleSWEBench, the commit-driven benchmark of 734 questions drawn from srsRAN 5G commits together with executable unit tests and the TeleJudge hierarchical LLM-as-a-judge scoring system.

If this is right

Agents must improve at identifying the precise files and functions that need modification inside large wireless codebases.
Evaluation in telecom will require both strict unit tests and semantic judgment because many valid changes are not captured by tests alone.
Domain-specific data or fine-tuning may be needed before agents can handle the mathematically constrained logic of 5G stacks.
Operators adopting automated tools will still require substantial human review for changes to production wireless software.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same commit-mining approach could expose similar gaps in other specialized domains that rely on stateful or protocol-heavy code.
Performance on TeleSWEBench may improve if agents are given explicit models of wireless protocol constraints rather than generic code context.
The gap between localization success and functional success points to a need for better intermediate verification steps inside agent loops.
Over time the benchmark itself could serve as a training signal for agents that must operate inside live network management systems.

Load-bearing premise

Commits mined from the srsRAN 5G repository plus their unit tests and TeleJudge scores form a representative measure of an agent's ability to produce shippable changes in real telecom engineering.

What would settle it

A new agent framework that produces more than 40 percent shippable changes across the full set of TeleSWEBench cases under the same two-stage evaluation would falsify the reported performance ceiling.

Figures

Figures reproduced from arXiv: 2606.05001 by Ali Mamaghani, Dinesh Bharadia, Pranshav Gajjar, Vijay K Shah.

**Figure 2.** Figure 2: High-level overview of the TeleSWEBench from benchmark creation to evaluation. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Distribution for the three failure points across [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 5.** Figure 5: Performance comparison between OpenHands [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: One srsRAN_Project ground-truth diff (commit 70c8b9d) shown with representative per-file chunks and corresponding questions across easy, medium and hard. 14 [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

read the original abstract

With the telecommunications field embracing zero touch management alongside novel O-RAN and AI-RAN frameworks, contemporary telecom networks now function as immensely intricate and heavily softwareized codebases. While automated software engineering (ASE) tools and Software Engineering (SWE) Agents hold the potential to alleviate the critical code generation bottleneck in this domain, their ability to navigate and modify specialized, mathematically rigorous wireless stacks like srsRAN 5G remains unverified. General-purpose coding benchmarks fail to capture the stateful logic and strict requirements of telecommunications, leaving a critical evaluation gap. In this paper, we introduce TeleSWEBench, the first commit-driven benchmark specifically designed to measure an agent's performance in the telecom domain. We mine real developer commits from the srsRAN 5G repository and distill them into structured test cases across three difficulty tiers (Easy, Medium, and Difficult). Our benchmark consists of 734 questions that are accompanied by executable unit tests. To avoid the rigidity of test cases, we further propose a hierarchical LLM as a Judge framework called TeleJudge that scores agent outputs at the file level and aggregates verdicts holistically. This follows an evaluation based on context and semantic similarity in parallel to a standard unit test-based evaluation. Using this benchmark, we evaluate AIDER, OpenHands, and the ClaudeCode frameworks, powered by state-of-the-art reasoning LLMs, including Qwen3, GPT OSS, Gemma 4, Kimi, and Qwencoder 2.5. Our two-stage evaluation reveals that models suffer from a lack of both localization accuracy and functional correctness, with the strongest ASE tools achieving up to 25% of shippable changes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TeleSWEBench creates a commit-driven benchmark from srsRAN 5G changes with unit tests and TeleJudge, but leaves open whether those tasks actually track real shippable telecom work.

read the letter

TeleSWEBench mines real commits from the srsRAN 5G repository, distills them into 734 tiered tasks, attaches executable unit tests, and adds a hierarchical LLM judge that scores at file level using context and semantic similarity. That construction is the main new piece.

The paper does a solid job identifying the gap: general SWE benchmarks ignore the stateful logic and strict requirements in wireless stacks. Using actual developer commits instead of synthetic problems, plus running both test-based and judge-based evaluation on tools like AIDER and OpenHands, gives a concrete way to surface localization and correctness failures.

The soft spot sits in the unverified premise. Nothing in the work checks whether the selected commits and TeleJudge verdicts line up with what would actually ship in a telecom deployment. There is no cross-check against integration tests, hardware runs, or ratings from domain engineers, so the 25% figure could reflect benchmark artifacts rather than tool limits. Commit selection criteria and tiering process also stay thin.

This is for researchers working on domain-adapted agents or benchmarks in specialized software areas. Someone looking for patterns on how to turn real commits into test suites for complex systems can extract usable ideas even if the numbers need more grounding.

It deserves a serious referee because the benchmark idea is concrete and the performance observations are worth testing further. I would send it to review and ask for validation steps on the judge and task representativeness.

Referee Report

3 major / 2 minor

Summary. The paper introduces TeleSWEBench, the first commit-driven benchmark for LLM-powered ASE in telecommunications. It mines real commits from the srsRAN 5G repository, distills them into 734 structured tasks across Easy/Medium/Difficult tiers with executable unit tests, and proposes TeleJudge (a hierarchical LLM-as-a-judge using file-level context and semantic similarity) as a complement to rigid unit-test evaluation. Two-stage evaluation of AIDER, OpenHands, and ClaudeCode (powered by models including Qwen3, GPT OSS, Gemma 4, Kimi, Qwencoder 2.5) finds that agents lack localization accuracy and functional correctness, with the strongest tools achieving at most 25% shippable changes.

Significance. If the benchmark construction and TeleJudge verdicts are shown to be reliable proxies for deployable telecom changes, the work would fill a documented gap: general SWE benchmarks do not capture the stateful, mathematically rigorous wireless-stack requirements of telecom. The commit-mining approach and dual (unit-test + semantic) evaluation protocol are concrete strengths that could be extended to other specialized domains. The reported performance ceiling supplies a falsifiable baseline for future agent development in O-RAN/AI-RAN contexts.

major comments (3)

[§3] §3 (Benchmark Construction): The paper provides no cross-validation (expert telecom-engineer ratings, integration-test runs, or hardware-in-the-loop checks) that the 734 distilled tasks and their unit tests capture the distinguishing stateful and protocol-level requirements of srsRAN-style wireless stacks. This directly undermines the central claim that the observed 25% shippable-change rate reflects genuine limitations rather than benchmark-construction artifacts.
[§4] §4 (TeleJudge and Two-Stage Evaluation): No correlation analysis is reported between TeleJudge file-level semantic scores and either unit-test outcomes or independent expert judgments of deployability. Without such evidence, the hierarchical judge cannot be treated as a validated complement to unit tests for the functional-correctness dimension of the headline result.
[§5] §5 (Results): The 25% shippable-change figure is presented without statistical significance testing, confidence intervals, or breakdown by difficulty tier and model; given that the benchmark validity itself remains unverified, this figure cannot yet be interpreted as a robust performance ceiling.

minor comments (2)

[Abstract, §2] The abstract and §2 would benefit from an explicit statement of the exact commit-selection heuristics and inter-rater process used to tier tasks as Easy/Medium/Difficult.
[Tables/Figures] Figure captions and Table 1 should clarify whether the reported percentages are macro- or micro-averaged across the 734 tasks.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the constructive feedback highlighting the need for stronger validation of the benchmark tasks, TeleJudge, and statistical presentation of results. We address each major comment below with planned revisions where feasible.

read point-by-point responses

Referee: [§3] §3 (Benchmark Construction): The paper provides no cross-validation (expert telecom-engineer ratings, integration-test runs, or hardware-in-the-loop checks) that the 734 distilled tasks and their unit tests capture the distinguishing stateful and protocol-level requirements of srsRAN-style wireless stacks. This directly undermines the central claim that the observed 25% shippable-change rate reflects genuine limitations rather than benchmark-construction artifacts.

Authors: The tasks are distilled directly from real commits authored by srsRAN developers, which inherently encode the stateful, protocol-level, and mathematically rigorous requirements of the wireless stack as implemented in production code. The accompanying unit tests are executable and drawn from the repository's own test infrastructure. We will revise §3 to more explicitly justify the commit-mining approach as a domain-relevant proxy and add an explicit limitations subsection acknowledging the absence of expert ratings or hardware-in-the-loop validation, along with plans for such extensions in future work. revision: partial
Referee: [§4] §4 (TeleJudge and Two-Stage Evaluation): No correlation analysis is reported between TeleJudge file-level semantic scores and either unit-test outcomes or independent expert judgments of deployability. Without such evidence, the hierarchical judge cannot be treated as a validated complement to unit tests for the functional-correctness dimension of the headline result.

Authors: We will add a new subsection in §4 reporting correlation (Pearson and Spearman) between TeleJudge file-level semantic scores and unit-test pass/fail outcomes across all evaluated agents and models. This will quantify the degree of alignment. We note that independent expert judgments of deployability were not collected in the current study due to resource constraints and will discuss this as a limitation. revision: yes
Referee: [§5] §5 (Results): The 25% shippable-change figure is presented without statistical significance testing, confidence intervals, or breakdown by difficulty tier and model; given that the benchmark validity itself remains unverified, this figure cannot yet be interpreted as a robust performance ceiling.

Authors: We agree that additional statistical rigor is required. In the revised §5 we will include: (i) 95% confidence intervals computed via bootstrapping, (ii) statistical significance tests (e.g., McNemar’s test for paired agent comparisons), and (iii) full breakdowns of success rates by difficulty tier (Easy/Medium/Difficult) and by underlying LLM. These additions will allow readers to interpret the 25% ceiling with appropriate caution. revision: yes

standing simulated objections not resolved

Full cross-validation involving multiple independent telecom experts performing ratings or hardware-in-the-loop integration tests, which would require specialized domain expertise, access to proprietary testbeds, and resources beyond the scope of this initial benchmark release.

Circularity Check

0 steps flagged

No significant circularity; benchmark construction is self-contained empirical work

full rationale

The paper introduces TeleSWEBench by mining commits from srsRAN 5G, distilling them into 734 tasks with unit tests, and proposing TeleJudge as an LLM-based scorer. No mathematical derivations, parameter fittings, or predictions are present. No load-bearing self-citations or uniqueness theorems are invoked. The central claims rest on direct empirical evaluation of external ASE tools against the constructed benchmark, with no step reducing by construction to its own inputs. This is standard benchmark paper structure and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the benchmark construction itself implicitly assumes that commit-derived tasks and the TeleJudge framework validly capture 'shippable' performance.

pith-pipeline@v0.9.1-grok · 5846 in / 1170 out tokens · 23523 ms · 2026-06-28T05:08:58.723165+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 3 linked inside Pith

[1]

In18th ACM Conference on Security and Privacy in Wireless and Mobile Networks, pages 53–64

Ai5gtest: Ai-driven specification-aware auto- mated testing and validation of 5g o-ran components. In18th ACM Conference on Security and Privacy in Wireless and Mobile Networks, pages 53–64. GSMA. 2026. Release v1.1.0 · gsma-labs/evals. Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao L...

Pith/arXiv arXiv 2026
[2]

Jiacheng Liu, Xiaohan Zhao, Xinyi Shang, and Zhiqiang Shen

Llms-as-judges: a comprehensive survey on llm-based evaluation methods.arXiv preprint arXiv:2412.05579. Jiacheng Liu, Xiaohan Zhao, Xinyi Shang, and Zhiqiang Shen. 2026. Dive into claude code: The design space of today’s and future ai agent systems.arXiv preprint arXiv:2604.14228. Tianyang Liu, Canwen Xu, and Julian McAuley

Pith/arXiv arXiv 2026
[3]

Ali Maatouk, Fadhel Ayed, Nicola Piovesan, Antonio De Domenico, Merouane Debbah, and Zhi-Quan Luo

Repobench: Benchmarking repository-level code auto-completion systems.arXiv preprint arXiv:2306.03091. Ali Maatouk, Fadhel Ayed, Nicola Piovesan, Antonio De Domenico, Merouane Debbah, and Zhi-Quan Luo. 2025. Teleqna: A benchmark dataset to assess large language models telecommunications knowl- edge.IEEE Network. Ziyi Ni, Huacan Wang, Shuo Zhang, Shuo Lu, ...

Pith/arXiv arXiv 2025
[4]

A Timidness In our evaluation, a recurring behavioral failure mode istimidnessas mentioned in Section 6.1

Telecomgpt: A framework to build telecom- specific large language models.IEEE Transactions on Machine Learning in Communications and Net- working. A Timidness In our evaluation, a recurring behavioral failure mode istimidnessas mentioned in Section 6.1. De- spite successfully deducing the correct files, grasp- ing the required logic, and understanding the...
[5]

include/srsgnb/asn1/asn1_utils.h (contains unbounded_octstring class definition)
[6]

lib/asn1/asn1_utils.cpp (implementation of helpers and methods)
[7]

unittests/asn1/asn1_utils_test.cpp (tests for the changes)
[8]

unittests/asn1/asn1_rrc_nr_test.cpp (tests using octet strings)
[9]

what" and

apps/examples/du/du_example.cpp (usage example) I should list these and stop.... 11 B Standard LLM-as-a-Judge We initially experimented with a single-shot set- ting, which is the standard way to implement LLM judges in the literature (Gu et al., 2024). Here, the judge received the diffs and the questions si- multaneously to produce an output. However, thi...

2024
[10]

The original task description (question)
[11]

The ground truth fix (the accepted human commit diff) for ONE file
[12]

verdict":

The copilot’s proposed changes for the SAME file. Acceptif the copilot’s changes are functionally equivalent to the ground truth for this file — they address the same root cause or implement the same feature, even if variable names, formatting, ordering, or minor stylistic details differ. Rejectif the copilot’s changes miss the core issue, modify the wron...
[13]

The original task description
[14]

verdict":

Per-file verdicts from a first-pass review, each with reasons. Read the reasons carefully, think about the commit holistically, and make your own judgment. A single rejected file does NOT automati- cally mean the whole commit fails — consider whether the rejected file is critical to the task or a minor ancillary change.Return ONLY a JSON object with exact...

2026

[1] [1]

In18th ACM Conference on Security and Privacy in Wireless and Mobile Networks, pages 53–64

Ai5gtest: Ai-driven specification-aware auto- mated testing and validation of 5g o-ran components. In18th ACM Conference on Security and Privacy in Wireless and Mobile Networks, pages 53–64. GSMA. 2026. Release v1.1.0 · gsma-labs/evals. Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao L...

Pith/arXiv arXiv 2026

[2] [2]

Jiacheng Liu, Xiaohan Zhao, Xinyi Shang, and Zhiqiang Shen

Llms-as-judges: a comprehensive survey on llm-based evaluation methods.arXiv preprint arXiv:2412.05579. Jiacheng Liu, Xiaohan Zhao, Xinyi Shang, and Zhiqiang Shen. 2026. Dive into claude code: The design space of today’s and future ai agent systems.arXiv preprint arXiv:2604.14228. Tianyang Liu, Canwen Xu, and Julian McAuley

Pith/arXiv arXiv 2026

[3] [3]

Ali Maatouk, Fadhel Ayed, Nicola Piovesan, Antonio De Domenico, Merouane Debbah, and Zhi-Quan Luo

Repobench: Benchmarking repository-level code auto-completion systems.arXiv preprint arXiv:2306.03091. Ali Maatouk, Fadhel Ayed, Nicola Piovesan, Antonio De Domenico, Merouane Debbah, and Zhi-Quan Luo. 2025. Teleqna: A benchmark dataset to assess large language models telecommunications knowl- edge.IEEE Network. Ziyi Ni, Huacan Wang, Shuo Zhang, Shuo Lu, ...

Pith/arXiv arXiv 2025

[4] [4]

A Timidness In our evaluation, a recurring behavioral failure mode istimidnessas mentioned in Section 6.1

Telecomgpt: A framework to build telecom- specific large language models.IEEE Transactions on Machine Learning in Communications and Net- working. A Timidness In our evaluation, a recurring behavioral failure mode istimidnessas mentioned in Section 6.1. De- spite successfully deducing the correct files, grasp- ing the required logic, and understanding the...

[5] [5]

include/srsgnb/asn1/asn1_utils.h (contains unbounded_octstring class definition)

[6] [6]

lib/asn1/asn1_utils.cpp (implementation of helpers and methods)

[7] [7]

unittests/asn1/asn1_utils_test.cpp (tests for the changes)

[8] [8]

unittests/asn1/asn1_rrc_nr_test.cpp (tests using octet strings)

[9] [9]

what" and

apps/examples/du/du_example.cpp (usage example) I should list these and stop.... 11 B Standard LLM-as-a-Judge We initially experimented with a single-shot set- ting, which is the standard way to implement LLM judges in the literature (Gu et al., 2024). Here, the judge received the diffs and the questions si- multaneously to produce an output. However, thi...

2024

[10] [10]

The original task description (question)

[11] [11]

The ground truth fix (the accepted human commit diff) for ONE file

[12] [12]

verdict":

The copilot’s proposed changes for the SAME file. Acceptif the copilot’s changes are functionally equivalent to the ground truth for this file — they address the same root cause or implement the same feature, even if variable names, formatting, ordering, or minor stylistic details differ. Rejectif the copilot’s changes miss the core issue, modify the wron...

[13] [13]

The original task description

[14] [14]

verdict":

Per-file verdicts from a first-pass review, each with reasons. Read the reasons carefully, think about the commit holistically, and make your own judgment. A single rejected file does NOT automati- cally mean the whole commit fails — consider whether the rejected file is critical to the task or a minor ancillary change.Return ONLY a JSON object with exact...

2026