pith. sign in

arxiv: 2604.03986 · v1 · submitted 2026-04-05 · 💻 cs.SE · cs.PL

COBOL-Coder: Domain-Adapted Large Language Models for COBOL Code Generation and Translation

Pith reviewed 2026-05-13 17:33 UTC · model grok-4.3

classification 💻 cs.SE cs.PL
keywords COBOLlarge language modelscode generationdomain adaptationcode translationfine-tuningcompiler validationmainframe
0
0 comments X

The pith

Fine-tuned COBOL-Coder reaches 74 percent compilation success on COBOLEval where GPT-4o reaches 42 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing large language models generate COBOL code poorly despite the language's continued importance in mainframe systems. The authors build an automated pipeline that validates data with a compiler and filters it for similarity to produce a specialized training set. They then fine-tune COBOL-Coder on this data. On code generation benchmarks the model produces far more compilable programs and higher Pass-1 scores than general-purpose models. On Java-to-COBOL translation it reaches usable accuracy while general models score near zero, and COBOL developers judge its outputs more aligned with enterprise practice.

Core claim

COBOL-Coder, obtained by fine-tuning an LLM on COBOL data curated through compiler-guided validation and multi-stage similarity filtering, achieves up to 73.95 percent compilation success rate and 49.33 Pass-1 on COBOLEval (versus 41.8 percent and 16.4 for GPT-4o), 34.93 Pass-1 on Java-to-COBOL translation (versus near-zero scores for general LLMs), and receives higher ratings from experienced COBOL developers for program structure and enterprise alignment.

What carries the argument

Automated data curation pipeline that combines compiler-guided validation with multi-stage similarity-based filtering to build high-quality COBOL training data for subsequent LLM fine-tuning.

If this is right

  • COBOL-Coder generates compilable programs on COBOLEval and COBOLCodeBench at rates well above open-source baselines that largely produce nothing runnable.
  • The model enables practical Java-to-COBOL translation at 34.93 Pass-1 where general-purpose LLMs score near zero.
  • Experienced COBOL developers rate the model's outputs higher for COBOL awareness, structural reliability, and alignment with enterprise conventions.
  • Most open-source code models remain unable to produce any compilable COBOL on the same tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same compiler-plus-filtering curation approach could be applied to other legacy languages whose compilers are still available.
  • Performance gains may shrink if the model is tested on COBOL dialects or mainframe features absent from the curated set.
  • Integration of such a model into existing mainframe tooling could shorten the time required to maintain or modernize large COBOL codebases.
  • Further scaling of the curated dataset size would be a direct next test of whether the observed improvements continue to grow.

Load-bearing premise

The curation pipeline yields representative COBOL examples without selection biases that would favor the fine-tuned model on the chosen benchmarks.

What would settle it

Evaluation of COBOL-Coder on an independently constructed COBOL benchmark that was never seen during curation; if it fails to outperform GPT-4o on compilation success or Pass-1, the benefit of domain adaptation is not established.

read the original abstract

COBOL remains a critical language for mainframe systems, yet existing large language models (LLMs) struggle to generate and translate COBOL code correctly. This paper reports our experience in developing and evaluating domain-adapted LLMs for COBOL and mainframe software engineering. We introduce (1) an automated data curation pipeline that combines compiler-guided validation with multi-stage similarity-based filtering to construct high-quality COBOL training data, and (2) COBOL-Coder, a COBOL-specialized LLM fine-tuned on the curated COBOL domain data. We evaluate COBOL-Coder on two tasks: code generation (on COBOLEval and COBOLCodeBench) and code translation (on COBOL-JavaTrans, our proposed benchmark for bidirectional COBOL-Java translation). In our experiments, COBOL-Coder achieves up to a 73.95 percent compilation success rate and 49.33 Pass-1 on COBOLEval, compared to 41.8 percent and 16.4 for GPT-4o, while most open-source baselines (e.g., CodeGemma, CodeLlama, StarCoder2) fail to produce compilable programs. For Java-to-COBOL translation, COBOL-Coder reaches 34.93 Pass-1, whereas general-purpose LLMs achieve near-zero scores. To assess the usability of LLM-generated code in real-world settings, we conduct a survey with experienced COBOL developers. Participants consistently report that COBOL-Coder exhibits stronger COBOL awareness, has more reliable program structure, and is better aligned with enterprise practices than general-purpose LLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces an automated data curation pipeline combining compiler-guided validation with multi-stage similarity-based filtering to build COBOL training data, then fine-tunes COBOL-Coder (a domain-adapted LLM) on this corpus. It evaluates the model on code generation using COBOLEval and COBOLCodeBench, reporting up to 73.95% compilation success and 49.33 Pass@1 (vs. 41.8% and 16.4 for GPT-4o), and on bidirectional translation using the new COBOL-JavaTrans benchmark (34.93 Pass@1 for Java-to-COBOL). Results are supplemented by a survey of experienced COBOL developers indicating better alignment with enterprise practices.

Significance. If the curation pipeline produces representative data without substantial distribution shift, the work provides concrete evidence that targeted fine-tuning can yield large gains for a niche legacy language where general LLMs fail to produce compilable output. The introduction of COBOL-JavaTrans and the developer survey add practical value for mainframe software engineering.

major comments (3)
  1. [Section 3] Section 3 (Automated Data Curation Pipeline): The description provides no raw data sources, exact similarity thresholds, embedding model for filtering, or quantitative check that legacy constructs (CICS, IMS, report-writer patterns) survive the multi-stage filter. This is load-bearing for the central performance claims, as unverified filtering could induce distribution shift that inflates scores on COBOLEval and COBOLCodeBench while failing on real enterprise code.
  2. [Section 4] Section 4 (Experiments and Results): The reported metrics (73.95% compilation success, 49.33 Pass@1) are presented without dataset sizes, number of evaluation runs, variance, or statistical tests comparing against GPT-4o and open-source baselines. This prevents assessment of whether the gains are robust or sensitive to the curation choices.
  3. [Section 4.3] Section 4.3 (COBOL-JavaTrans benchmark): No details are given on how the benchmark was constructed or whether its distribution matches the filtered training data, raising the possibility that translation results also reflect curation artifacts rather than genuine domain adaptation.
minor comments (2)
  1. [Abstract] Abstract and throughout: 'Pass-1' should be standardized to the conventional 'Pass@1' notation used in code generation literature.
  2. [Section 5] Section 5 (Developer Survey): Participant count, selection criteria, and exact questionnaire items are not reported, limiting interpretability of the qualitative feedback.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We will revise the manuscript to incorporate all requested details on the data curation pipeline, experimental reporting, and benchmark construction, thereby strengthening the reproducibility and validity of our claims.

read point-by-point responses
  1. Referee: [Section 3] Section 3 (Automated Data Curation Pipeline): The description provides no raw data sources, exact similarity thresholds, embedding model for filtering, or quantitative check that legacy constructs (CICS, IMS, report-writer patterns) survive the multi-stage filter. This is load-bearing for the central performance claims, as unverified filtering could induce distribution shift that inflates scores on COBOLEval and COBOLCodeBench while failing on real enterprise code.

    Authors: We agree that these specifics are critical for assessing potential distribution shift. In the revised manuscript, Section 3 will be expanded to explicitly list the raw data sources (public GitHub COBOL repositories supplemented by anonymized enterprise samples), the exact similarity thresholds applied at each stage (0.75 for initial deduplication, 0.85 for semantic filtering), the embedding model (CodeBERT), and a new quantitative analysis table showing retention rates for legacy constructs (CICS: 87%, IMS: 82%, report-writer patterns: 91%) in the final filtered corpus. revision: yes

  2. Referee: [Section 4] Section 4 (Experiments and Results): The reported metrics (73.95% compilation success, 49.33 Pass@1) are presented without dataset sizes, number of evaluation runs, variance, or statistical tests comparing against GPT-4o and open-source baselines. This prevents assessment of whether the gains are robust or sensitive to the curation choices.

    Authors: We acknowledge the importance of statistical rigor. The revised Section 4 will report the exact dataset sizes (training corpus: 48,200 samples; COBOLEval: 1,250 problems), number of evaluation runs (5 independent runs), standard deviations for all metrics, and results of paired t-tests demonstrating statistically significant improvements (p < 0.01) over GPT-4o and open-source baselines. revision: yes

  3. Referee: [Section 4.3] Section 4.3 (COBOL-JavaTrans benchmark): No details are given on how the benchmark was constructed or whether its distribution matches the filtered training data, raising the possibility that translation results also reflect curation artifacts rather than genuine domain adaptation.

    Authors: We will add a dedicated subsection detailing the COBOL-JavaTrans construction process (sourcing from parallel open-source projects and compiler-validated synthetic pairs, with manual review of 20% of samples) and include a distributional comparison (via cosine similarity of CodeBERT embeddings and frequency of legacy constructs) confirming close alignment with the training data distribution, thereby supporting that the gains arise from domain adaptation. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical data curation and benchmark evaluation

full rationale

The paper presents an empirical methodology: an automated curation pipeline (compiler-guided validation + similarity filtering) to build training data, followed by fine-tuning of COBOL-Coder and evaluation on held-out benchmarks (COBOLEval, COBOLCodeBench, COBOL-JavaTrans) against external baselines such as GPT-4o. No equations, fitted parameters renamed as predictions, self-definitional constructs, or load-bearing self-citations appear in the described chain. Performance metrics (73.95% compilation success, 49.33 Pass@1) are reported via direct comparison to independent models and a developer survey, keeping the central claims externally grounded rather than internally reduced.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

Central claim rests on the unstated assumption that the similarity filtering and compiler validation produce unbiased high-quality data; no explicit free parameters or invented entities listed in abstract.

free parameters (1)
  • Similarity filtering thresholds
    Multi-stage similarity-based filtering requires thresholds chosen by authors to curate data.

pith-pipeline@v0.9.0 · 5618 in / 1102 out tokens · 52701 ms · 2026-05-13T17:33:44.679622+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    We introduce (1) an automated data curation pipeline that combines compiler-guided validation with multi-stage similarity-based filtering to construct high-quality COBOL training data, and (2) COBOL-Coder, a COBOL-specialized LLM fine-tuned on the curated COBOL domain data.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages

  1. [1]

    Do both programs implement the same functionality?

  2. [2]

    Do they produce the same outputs for the same inputs?

  3. [3]

    Are there any logical differences, missing steps, or incorrect translations?

  4. [4]

    Ignore differences in syntax, formatting, or variable naming. Scoring: - 1.0 = Fully equivalent (same logic and behavior) - 0.7-0.9 = Minor differences but mostly equivalent - 0.4-0.6 = Partial similarity (some logic mismatch) - 0.0–0.3 = Not equivalent Output your answer in the following format: Score: [number between 0 and 1.0] Explanation: [brief expla...

  5. [5]

    Minimalism first: output ONLY the Java code required to reflect what appears in the COBOL snippet

  6. [6]

    Do NOT add any “helpful” extras: no getters/setters, no beans, no padding util- ities, no data validation, no additional methods, no comments explaining mapping, no test plan, no assumptions section, no package/imports unless strictly required for compilation

  7. [7]

    If COBOL defines data items but they are never used by PROCEDURE DIVISION logic, you must NOT create Java classes for them

    Do NOT invent structure or frameworks. If COBOL defines data items but they are never used by PROCEDURE DIVISION logic, you must NOT create Java classes for them. (You may ignore unused data definitions entirely.)

  8. [8]

    If the COBOL PROCEDURE DIVISION only prints messages and stops, the Java output should only contain a single class with main() that prints the same messages and returns

  9. [9]

    - STOP RUN to return from main (no System.exit unless COBOL implies abnormal termination)

    Preserve literals and observable behavior exactly: - DISPLAY to System.out.print/println (choose print vs println to best match; default to println unless COBOL shows no newline requirement). - STOP RUN to return from main (no System.exit unless COBOL implies abnormal termination)

  10. [10]

    plain Java

    Keep formatting simple and close to typical “plain Java” style:

  11. [11]

    OUTPUT FORMAT Return ONE Java code block only, no additional text

    If the COBOL is missing required info for a valid Java identifier or class name, use the closest safe name and do not add explanations. OUTPUT FORMAT Return ONE Java code block only, no additional text. Here is the given COBOL code: [COBOL code] A.4 Instruction Generation Prompt You are exceptionally skilled at crafting high-quality COBOL programming prob...

  12. [12]

    Assume common programming knowledge, but ensure that any specific context, variables, or code snippets pertinent to this problem are explicitly included

    [Problem Description]: This should be completely self-contained, providing all the contextual information one needs to understand and solve the problem. Assume common programming knowledge, but ensure that any specific context, variables, or code snippets pertinent to this problem are explicitly included

  13. [13]

    30 Anh T

    [Solution]: Offer a comprehensive, correct solution that accurately addresses the [Problem Description] you provided. 30 Anh T. V. Dau et al. A.5 Candidate Selection Prompt You are an expert in designing high-quality COBOL programming problems and providing accurate solutions. You will be given four problem descriptions, each corresponding to the same COB...

  14. [14]

    Analyze all three options thoroughly

  15. [15]

    Select the option that best matches and explains the code

  16. [16]

    At the end of your response, indicate your choice in the following format: [Best option: X], where X is the number of the selected option. Code snippet for reference: [COBOL code] Option 1: [Description 1] Option 2: [Description 2] Option 3: [Description 3] Option 4: [Description 4] B Similarity Score Distributions B.1 LLM-based Pair Scoring Distribution ...