Syntax Without Semantics: Teaching Large Language Models to Code in an Unseen Language
Pith reviewed 2026-05-20 19:08 UTC · model grok-4.3
The pith
Fine-tuning teaches LLMs the syntax of an unseen programming language but fails to transfer the ability to produce correct code in it.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Fine-tuning on PyLang quickly teaches its syntax yet leaves models unable to map their language-agnostic algorithmic understanding into working implementations, producing an implementation fidelity gap in which internal representations converge across languages (CKA > 0.97) while output performance diverges, with Python outperforming by up to 19 percent across all tested interventions.
What carries the argument
The implementation fidelity gap, the separation between language-agnostic algorithmic selection and language-specific code realization that persists despite high internal representation similarity.
Load-bearing premise
That PyLang truly never appeared in any pretraining corpus and that the 352 problems keep identical difficulty and logical structure when rewritten from Python into PyLang.
What would settle it
An experiment in which a fine-tuned model reaches equal pass rates on matched PyLang and Python problems, or direct evidence that PyLang fragments existed in the original training data.
read the original abstract
Large language models (LLMs) achieve high pass rates on code generation benchmarks, yet whether they can transfer this ability to languages absent from pretraining remains poorly understood. We introduce PyLang, a minimal imperative language absent from all pretraining corpora, and evaluate frontier models zero-shot and fine-tuned Qwen3 (4B, 8B, 32B) on 352 problems. We find that fine-tuning quickly teaches syntax but fails to transfer semantic competence: Python outperforms PyLang by up to 19% across all configurations, and no intervention (multi-task learning, preference tuning, code infilling, or latent-space objectives) closes the gap. An LLM judge reveals that frontier models select an identical algorithm to Python 80% of the time, yet cannot translate it into a working PyLang implementation., and CKA analysis confirms that fine-tuned models converge to nearly identical internal representations across languages (CKA > 0.97) while diverging at the output stage. We term this the implementation fidelity gap: models possess language-agnostic algorithmic understanding but cannot express it in an unfamiliar language. Our findings highlight the need for training methods that decouple reasoning from language-specific realization.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces PyLang, a minimal imperative language absent from pretraining corpora, and evaluates zero-shot and fine-tuned Qwen3 models (4B/8B/32B) on 352 coding problems. It claims that fine-tuning rapidly teaches syntax but fails to transfer semantic competence, with Python outperforming PyLang by up to 19% across configurations; interventions including multi-task learning, preference tuning, code infilling, and latent-space objectives do not close the gap. Supporting measurements include an LLM judge reporting 80% identical algorithm selection and CKA similarity >0.97 between languages, leading to the proposed 'implementation fidelity gap' where models possess language-agnostic algorithmic understanding but cannot realize it in an unfamiliar syntax.
Significance. If substantiated, the result would be significant for understanding limitations in LLM code generation for novel or low-resource languages, showing that current fine-tuning and alignment methods do not suffice to bridge syntax acquisition to semantic realization. The empirical design with multiple interventions and internal representation analysis via CKA provides a solid foundation for the central claim and highlights the need for training paradigms that better decouple reasoning from language-specific output.
major comments (3)
- [§3.2] §3.2 (problem set construction): The translation process from the 352 Python problems to PyLang is not accompanied by reported controls for equivalent difficulty or structure, such as counts of control-flow constructs, output-length statistics, or human-rated difficulty scores. This is load-bearing for the central claim because the 19% gap and failed interventions could arise from translation artifacts rather than an implementation fidelity gap.
- [§4.3] §4.3 (intervention experiments): The descriptions of multi-task learning, preference tuning, code infilling, and latent-space objectives lack hyperparameter details, training curves, or ablation results showing why each intervention was insufficient. Without these, it remains unclear whether the persistent gap is fundamental or could be mitigated by more extensive tuning within the manuscript's scope.
- [§5.1] §5.1 (LLM judge and CKA): The 80% algorithm-match rate from the LLM judge is presented without the judge prompt, inter-annotator agreement, or human validation; likewise the CKA > 0.97 result does not specify the layers or representation pairs compared. These omissions weaken the support for language-agnostic algorithmic understanding.
minor comments (2)
- [Abstract] Abstract: The sentence containing the CKA result has an extraneous comma ('implementation., and CKA') that should be corrected for readability.
- [§2] §2 (Related Work): Additional citations to prior studies on code generation for constructed or low-resource languages would strengthen the positioning of PyLang.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback on our manuscript. The comments have identified valuable opportunities to improve transparency and robustness, particularly around experimental controls and details. We address each major comment below and outline the revisions we will make.
read point-by-point responses
-
Referee: [§3.2] §3.2 (problem set construction): The translation process from the 352 Python problems to PyLang is not accompanied by reported controls for equivalent difficulty or structure, such as counts of control-flow constructs, output-length statistics, or human-rated difficulty scores. This is load-bearing for the central claim because the 19% gap and failed interventions could arise from translation artifacts rather than an implementation fidelity gap.
Authors: We agree that explicit controls would strengthen the equivalence argument. The problems are direct translations of the original set, preserving algorithmic structure by design. In the revised manuscript, we will add counts of control-flow constructs (loops, conditionals, etc.) and output-length statistics for both languages. Human-rated difficulty scores were not collected in the original study due to resource limits; we will instead add a discussion clarifying that translation fidelity was verified through manual inspection of a sample. We maintain that the persistent gap across interventions and model scales supports the implementation fidelity gap rather than translation artifacts. revision: partial
-
Referee: [§4.3] §4.3 (intervention experiments): The descriptions of multi-task learning, preference tuning, code infilling, and latent-space objectives lack hyperparameter details, training curves, or ablation results showing why each intervention was insufficient. Without these, it remains unclear whether the persistent gap is fundamental or could be mitigated by more extensive tuning within the manuscript's scope.
Authors: We will expand §4.3 and add a dedicated appendix section with full hyperparameter configurations for each intervention (learning rates, batch sizes, epochs, etc.). We will also include representative training curves and additional ablation results demonstrating the range of tuning explored. These additions will show that the gap persisted despite systematic variation, supporting our claim that current methods do not suffice to bridge the syntax-semantics divide within practical compute budgets. revision: yes
-
Referee: [§5.1] §5.1 (LLM judge and CKA): The 80% algorithm-match rate from the LLM judge is presented without the judge prompt, inter-annotator agreement, or human validation; likewise the CKA > 0.97 result does not specify the layers or representation pairs compared. These omissions weaken the support for language-agnostic algorithmic understanding.
Authors: We will include the complete LLM judge prompt in the appendix for reproducibility. We will specify that CKA was computed on final-layer hidden states for aligned token positions across languages. Although a single automated judge was used (precluding traditional inter-annotator agreement), we will add human validation results on a 50-problem subset to corroborate the 80% algorithm-match rate. These details will be incorporated to better substantiate the language-agnostic algorithmic understanding claim. revision: yes
Circularity Check
No circularity: empirical measurements of pass rates and representations are independent of any fitted derivation.
full rationale
The paper is an empirical study that introduces PyLang as a novel language, runs zero-shot and fine-tuned evaluations on 352 problems, and reports direct metrics such as pass rates (Python outperforming PyLang by up to 19%), LLM judge agreement (80% identical algorithms), and CKA similarity (>0.97). No equations, first-principles derivations, or predictions are defined in terms of quantities fitted to the same data; the central claim about the implementation fidelity gap follows from these independent experimental observations rather than reducing to self-definitional constructs or self-citation chains. The work is therefore self-contained against external benchmarks of model performance.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption PyLang is absent from all pretraining corpora
- domain assumption The 352 problems are directly comparable in difficulty and structure between Python and PyLang
invented entities (1)
-
implementation fidelity gap
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We introduce PyLang, a minimal imperative language absent from all pretraining corpora... implementation fidelity gap
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
CKA analysis confirms that fine-tuned models converge to nearly identical internal representations (CKA > 0.97)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Program Synthesis with Large Language Models
Jacob Austin, Augustus Odena, Maxwell Nye, et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Sanjay Basu, Sadiq Y. Patel, Parth Sheth, et al. Interpretability without actionability: Mecha- nistic methods cannot correct language model errors despite near-perfect internal repre- sentations.arXiv preprint arXiv:2603.18353,
-
[3]
Evaluating Large Language Models Trained on Code
Mark Chen, Jerry Tworek, Heewoo Jun, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Changing answer order can decrease MMLU accuracy.arXiv preprint arXiv:2406.19470,
Anmol Gupta, Tushar Kataria, and Nasser Nasrabadi. Changing answer order can decrease MMLU accuracy.arXiv preprint arXiv:2406.19470,
-
[5]
Sama Hadhoud, Alaa Elsetohy, Frederikus Hudi, et al. Idea first, code later: Disentangling problem solving from code generation in evaluating LLMs for competitive programming. arXiv preprint arXiv:2601.11332,
-
[6]
Code Llama: Open Foundation Models for Code
Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, et al. Code Llama: Open foundation models for code.arXiv preprint arXiv:2308.12950,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, et al. DeepSeekMath: Pushing the limits of mathe- matical reasoning in open language models.arXiv preprint arXiv:2402.03300,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Aman Sharma and Paras Chopra. EsoLang-Bench: Evaluating genuine reasoning in large language models via esoteric programming languages.arXiv preprint arXiv:2603.09678,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Chen Shen, Wei Cheng, Jingyue Yang, et al. Bridging the knowledge void: Inference- time acquisition of unfamiliar programming languages for coding tasks.arXiv preprint arXiv:2602.06976,
-
[10]
Evan Wang, Federico Cassano, Catherine Wu, et al. Planning in natural language improves LLM search for code generation.arXiv preprint arXiv:2409.03733,
-
[11]
Fanglin Xu, Wei Zhang, Jian Yang, et al. M2G-Eval: Enhancing and evaluating multi- granularity multilingual code generation.arXiv preprint arXiv:2512.22628,
-
[12]
An Yang, Baosong Yang, Binyuan Hui, et al. Qwen2 technical report.arXiv preprint arXiv:2407.10671,
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
Neuron-guided interpretation of code LLMs: Where, why, and how?arXiv preprint arXiv:2512.19980,
Zhe Yin, Xiaodong Gu, and Beijun Shen. Neuron-guided interpretation of code LLMs: Where, why, and how?arXiv preprint arXiv:2512.19980,
-
[14]
Zihan Zheng, Zerui Cheng, Zeyu Shen, et al. LiveCodeBench Pro: How do olympiad medalists judge LLMs in competitive programming?arXiv preprint arXiv:2506.11928,
-
[15]
and pattern-matching exploitation Gupta et al. (2024). More challenging benchmarks include LiveCodeBench (Jain et al., 2025), SWE-bench (Jimenez et al., 2024), and LiveCodeBench Pro (Zheng et al., 2025), where Olympiad medalists annotate problems and find that frontier models still score 0% on hard problems, succeeding primarily on implementation-heavy ta...
work page 2024
-
[16]
"; i = 0; line_count = 0; while (i < len(input)) { if (input[i] ==
introduced multi- granularity evaluation across 18 languages, finding strong cross-language correlations that suggest models learn transferable programming concepts. Our work differs from all of these by evaluating thesameproblems across two languages, one known, one unseen, to directly isolate what fine-tuning contributes beyond pretraining. Cross-Lingua...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.