HEJ-Robust: A Robustness Benchmark for LLM-Based Automated Program Repair

· 2026 · cs.SE · arXiv 2605.02215

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

open full Pith review browse 2 citing papers arXiv PDF

abstract

Recent Large Language Models (LLMs) have shown strong performance on automated program repair across standard benchmarks. However, these benchmarks evaluate models on a single canonical form of buggy code and do not reflect the syntactic variations commonly observed in real-world software, leaving robustness largely unexamined. In this work, we construct HEJ-Robust, a robustness benchmark built from HumanEval-Java-Bug using eight semantics-preserving code transformations, resulting in 1,450 transformed instances. We evaluate five fine-tuned LLMs on this benchmark and show that model performance drops by over 50% under several transformations, indicating that current LLM-based repair models lack robustness to minor syntactic variations.

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Social Bias in LLM-Generated Code: Benchmark and Mitigation

cs.SE · 2026-05-01 · unverdicted · novelty 7.0

LLMs show up to 60.58% social bias in generated code; a new Fairness Monitor Agent cuts bias by 65.1% and raises functional correctness from 75.80% to 83.97%.

Beyond Translation Accuracy: Addressing False Failures in LLM-Based Code Translation

cs.SE · 2026-05-04 · unverdicted · novelty 5.0 · 2 refs

A large-scale study finds that many LLM code translation failures are false negatives due to improper evaluation configurations rather than incorrect translations.

citing papers explorer

Showing 2 of 2 citing papers.

Social Bias in LLM-Generated Code: Benchmark and Mitigation cs.SE · 2026-05-01 · unverdicted · none · ref 119 · internal anchor
LLMs show up to 60.58% social bias in generated code; a new Fairness Monitor Agent cuts bias by 65.1% and raises functional correctness from 75.80% to 83.97%.
Beyond Translation Accuracy: Addressing False Failures in LLM-Based Code Translation cs.SE · 2026-05-04 · unverdicted · none · ref 29 · 2 links · internal anchor
A large-scale study finds that many LLM code translation failures are false negatives due to improper evaluation configurations rather than incorrect translations.

HEJ-Robust: A Robustness Benchmark for LLM-Based Automated Program Repair

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer