COBOLAssist: Analyzing and Fixing Compilation Errors for LLM-Powered COBOL Code Generation
Pith reviewed 2026-05-13 17:37 UTC · model grok-4.3
The pith
COBOLAssist raises compilation success of LLM-generated COBOL code from 29.5% to 64.38% for GPT-4o-mini and from 41.8% to 95.89% for GPT-4o via iterative compiler-guided repairs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Our evaluation using five LLMs including GPT variants and mAInframer, shows a high prevalence of incorrect program structures and function usage in COBOL programs and demonstrates the effectiveness of COBOLAssist, with the compilation success rates increasing from 29.5% to 64.38% for GPT-4o-mini and from 41.8% to 95.89% for GPT-4o. It also improves pass@1 significantly, for example from 9.1 to 22.6 for GPT-4.
Load-bearing premise
That compiler feedback alone is sufficient to guide LLMs toward functionally correct COBOL code rather than merely compilable but semantically wrong programs, as evidenced by the note that mAInframer-34B achieves high compilation success but limited functional correctness.
read the original abstract
Legacy programming languages such as COBOL (Common Business-Oriented Language) remain critical in business computing. However, maintaining legacy COBOL systems is increasingly challenging due to a declining pool of skilled developers and the persistence of COBOL errors that require deep domain expertise to resolve. This paper investigates the challenges of COBOL compilation errors and introduces a framework leveraging large language models (LLMs) to address these issues. We first categorize the common compilation errors in LLM-generated COBOL code into three groups: incomplete code errors, syntax errors, and type-related errors. We further propose COBOLAssist, a technique to enhance code correctness through iterative repairs guided by compilation feedback. Our evaluation using five LLMs including GPT variants and mAInframer, shows a high prevalence of incorrect program structures and function usage in COBOL programs and demonstrates the effectiveness of COBOLAssist, with the compilation success rates increasing from 29.5\% to 64.38\% for GPT-4o-mini and from 41.8\% to 95.89\% for GPT-4o. It also improves pass@1 significantly, for example from 9.1 to 22.6 for GPT-4. Notably, while mAInframer-34B achieves the highest compilation success rate, its functional correctness remains limited. This research not only highlights the limitations in current LLMs for COBOL but also demonstrates a practical path forward for automated debugging in legacy systems.
Editorial analysis
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.