COBOLAssist: Analyzing and Fixing Compilation Errors for LLM-Powered COBOL Code Generation

Anh Tuan Nguyen; Anh T. V. Dau; Jinqiu Yang; Nghi D. Q. Bui; Shin Hwei Tan

arxiv: 2604.03978 · v1 · submitted 2026-04-05 · 💻 cs.SE · cs.PL

COBOLAssist: Analyzing and Fixing Compilation Errors for LLM-Powered COBOL Code Generation

Anh T. V. Dau , Shin Hwei Tan , Jinqiu Yang , Nghi D. Q. Bui , Anh Tuan Nguyen This is my paper

Pith reviewed 2026-05-13 17:37 UTC · model grok-4.3

classification 💻 cs.SE cs.PL

keywords cobolerrorscompilationcodecobolassistlegacyllmscommon

0 comments

The pith

COBOLAssist raises compilation success of LLM-generated COBOL code from 29.5% to 64.38% for GPT-4o-mini and from 41.8% to 95.89% for GPT-4o via iterative compiler-guided repairs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Many important business systems still run on COBOL, an old programming language, but there are fewer experts who can maintain it. When modern AI models try to write COBOL, they often produce code that fails to compile due to missing pieces, syntax mistakes, or type errors. The authors created COBOLAssist, which feeds the compiler's error messages back to the AI and asks it to fix the code, repeating the process. Tests across five different AI models showed big improvements in how often the code compiles successfully. One model reached very high compilation rates but still struggled to produce code that actually does the intended job.

Core claim

Our evaluation using five LLMs including GPT variants and mAInframer, shows a high prevalence of incorrect program structures and function usage in COBOL programs and demonstrates the effectiveness of COBOLAssist, with the compilation success rates increasing from 29.5% to 64.38% for GPT-4o-mini and from 41.8% to 95.89% for GPT-4o. It also improves pass@1 significantly, for example from 9.1 to 22.6 for GPT-4.

Load-bearing premise

That compiler feedback alone is sufficient to guide LLMs toward functionally correct COBOL code rather than merely compilable but semantically wrong programs, as evidenced by the note that mAInframer-34B achieves high compilation success but limited functional correctness.

read the original abstract

Legacy programming languages such as COBOL (Common Business-Oriented Language) remain critical in business computing. However, maintaining legacy COBOL systems is increasingly challenging due to a declining pool of skilled developers and the persistence of COBOL errors that require deep domain expertise to resolve. This paper investigates the challenges of COBOL compilation errors and introduces a framework leveraging large language models (LLMs) to address these issues. We first categorize the common compilation errors in LLM-generated COBOL code into three groups: incomplete code errors, syntax errors, and type-related errors. We further propose COBOLAssist, a technique to enhance code correctness through iterative repairs guided by compilation feedback. Our evaluation using five LLMs including GPT variants and mAInframer, shows a high prevalence of incorrect program structures and function usage in COBOL programs and demonstrates the effectiveness of COBOLAssist, with the compilation success rates increasing from 29.5\% to 64.38\% for GPT-4o-mini and from 41.8\% to 95.89\% for GPT-4o. It also improves pass@1 significantly, for example from 9.1 to 22.6 for GPT-4. Notably, while mAInframer-34B achieves the highest compilation success rate, its functional correctness remains limited. This research not only highlights the limitations in current LLMs for COBOL but also demonstrates a practical path forward for automated debugging in legacy systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work is entirely empirical and relies on standard assumptions about LLM prompting and compiler utility with no explicit free parameters, mathematical axioms, or invented entities.

pith-pipeline@v0.9.0 · 5585 in / 1147 out tokens · 46205 ms · 2026-05-13T17:37:42.463172+00:00 · methodology

COBOLAssist: Analyzing and Fixing Compilation Errors for LLM-Powered COBOL Code Generation

Core claim

Load-bearing premise

discussion (0)