arxiv: 2603.09678 · v2 · submitted 2026-03-10 · 💻 cs.AI · cs.LG· cs.SE

Recognition: no theorem link

EsoLang-Bench: Evaluating Genuine Reasoning in Large Language Models via Esoteric Programming Languages

Aman Sharma , Paras Chopra

Authors on Pith no claims yet

Pith reviewed 2026-05-15 13:28 UTC · model grok-4.3

classification 💻 cs.AI cs.LGcs.SE

keywords esoteric programming languagesLLM evaluationout-of-distribution performancecode generation benchmarksreasoning testsBrainfuckgeneralizationprompting strategies

0 comments

The pith

Frontier LLMs solve algorithmic problems at 100 percent accuracy in Python or JavaScript but drop to 0-11 percent in equivalent esoteric language versions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether large language models perform genuine algorithmic reasoning or simply exploit patterns from common training languages. It introduces EsoLang-Bench by translating 80 problems into five esoteric languages that are Turing-complete yet rarely seen in pre-training data. Top models handle every instance perfectly when the language is Python or JavaScript, but accuracy collapses for the same problems in Brainfuck, Befunge-98, Whitespace, Unlambda, and Shakespeare. Additional techniques such as few-shot examples and self-reflection fail to close the gap. The result implies that current high scores on code benchmarks may reflect familiarity rather than transferable problem-solving skill.

Core claim

By expressing the same 80 algorithmic problems in five esoteric programming languages instead of Python or JavaScript, the authors demonstrate that frontier models achieve only 0 to 11 percent accuracy on the esoteric versions while reaching 100 percent on the familiar-language versions, and that few-shot learning together with self-reflection prompting do not recover performance.

What carries the argument

EsoLang-Bench, a benchmark of 80 problems translated into five esoteric Turing-complete languages chosen for their hard primitives and minimal presence in training corpora.

If this is right

Standard code-generation benchmarks may overestimate reasoning ability because they rely on languages that dominate training data.
LLMs appear to depend on frequency of exposure to specific syntax rather than on abstract algorithmic procedures.
Few-shot prompting and self-reflection do not enable models to acquire new programming-language primitives on the fly.
Benchmarks built from rare languages can function as contamination-resistant measures of out-of-distribution generalization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same pattern of failure could appear in other domains when problems are presented in unfamiliar notations or formats.
Augmenting training data with synthetic or esoteric language examples might improve robustness if the core reasoning deficit is confirmed.
Future test suites could combine esoteric-language tasks with other out-of-distribution probes to separate memorization from genuine generalization.

Load-bearing premise

That the five esoteric languages have negligible representation in the models pre-training data and that the problem translations preserve identical difficulty and solvability.

What would settle it

Showing that a model reaches accuracy on the esoteric versions comparable to its Python performance after receiving only a small amount of additional exposure to those languages syntax and semantics would falsify the claim of absent generalization.

Figures

Figures reproduced from arXiv: 2603.09678 by Aman Sharma, Paras Chopra.

**Figure 1.** Figure 1: Average accuracy across all five esoteric languages by model and prompting strategy. Self-Scaffolding consistently achieves the highest accuracy, with GPT-5.2 reaching 6.2%. All models perform below 7% even with advanced scaffolding. reasoning rather than domain-specific knowledge: unlike HumanEval or MBPP which test familiarity with standard library functions, our problems require only basic algorithmic … view at source ↗

**Figure 2.** Figure 2: EsoLang-Bench Overview. Left: The benchmark comprises five esoteric programming languages spanning diverse computational paradigms, with 80 problems across four difficulty tiers (400 total evaluations). Right: Evaluation pipeline testing five frontier models across multiple prompting strategies, with automated interpreter-based verification. The best model achieves only 3.8% accuracy compared to 100% on e… view at source ↗

**Figure 3.** Figure 3: Training data scarcity (log scale). Esoteric languages have 5,000× fewer GitHub repositories than Python. Whitespace: Only space, tab, and newline characters have semantic meaning; all other characters are ignored. Stackbased with commands encoded as whitespace sequences. Unlambda: A pure functional language based on combinatory logic with no variables; only function application via combinators (s, k, i)… view at source ↗

**Figure 4.** Figure 4: Error distribution by language (GPT-5.2 zeroshot). BF=Brainfuck, Bef=Befunge-98, WS=Whitespace, Unl=Unlambda, Shk=Shakespeare. Whitespace and Unlambda show near-total compile failure; Brainfuck shows primarily logic errors. coverage (Min et al., 2022): demonstrations activate relevant pre-trained knowledge rather than teaching genuinely new skills. Our results provide empirical support for this hypothesi… view at source ↗

**Figure 5.** Figure 5: Best accuracy achieved per language (across all models and strategies). Befunge-98 is the most tractable (11.2%), while Whitespace remains completely unsolved (0%). C.3. Performance Visualization D. Prompting Templates This section provides the exact prompts used for each strategy. Variables in {braces} are filled at runtime. D.1. Zero-Shot Prompt System Prompt: You are an expert {language_name} programmer… view at source ↗

**Figure 6.** Figure 6: Agentic systems vs best non-agentic approach. Agentic coding systems achieve 2× higher accuracy than the best self-scaffolding approach. {test_1_input} Output: {test_1_output} 2. Input: {test_2_input} Output: {test_2_output} [... additional test cases ...] Return only the program. D.2. Few-Shot Prompt Extends zero-shot by adding to the system prompt: Here are solved examples for reference. And prepending r… view at source ↗

**Figure 7.** Figure 7: Compile error rates by model across languages. All models show near-identical patterns: low compile errors on Brainfuck/Befunge-98, complete failure (100%) on Whitespace, and high failure (88–95%) on Unlambda. G. Extended Error Analysis This section provides detailed error analysis across all evaluated models [PITH_FULL_IMAGE:figures/full_fig_p023_7.png] view at source ↗

read the original abstract

Large language models achieve near-ceiling performance on code generation benchmarks, yet most of the programming languages used by popular benchmarks such as SWE-bench and HumanEval (e.g. Python, JavaScript) are squarely in-distribution. They appear at scale in pre-training corpora and are heavily reinforced during post-training. To study LLM performance on unfamiliar programming languages, we introduce EsoLang-Bench, a benchmark using five esoteric programming languages (Brainfuck, Befunge-98, Whitespace, Unlambda, and Shakespeare). All five of our chosen esoteric languages are Turing-complete, so the same algorithmic problems that are solvable in Python or JavaScript are in principle solvable in each of them. Yet, they are unfamiliar to LLMs which makes them a good proxy for evaluating out-of-distribution performance. The unfamiliarity of esoteric languages comprises of: (i) the hard-by-design primitives comprising the language; (ii) substantially less representation in pre-training corpora (340x to over 60,000x fewer public GitHub repositories than Python); (iii) negligible deployment value, which makes targeted inclusion in post-training data economically irrational. We evaluate five frontier models across five prompting strategies and find a dramatic capability gap. The same 80 problems expressed in Python or JavaScript reach 100% accuracy on top frontier models, while the equivalent esoteric versions score only 0-11%. Few-shot learning and self-reflection also fail to close this gap. EsoLang-Bench therefore provides a contamination-resistant testbed for measuring how well frontier models generalise algorithmic problem-solving to programming languages outside their training distribution.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The benchmark shows a clear performance collapse on esoteric languages for identical problems, but equivalence of the translated tasks is not strongly verified.

read the letter

The main thing here is that the same 80 algorithmic problems reach 100% accuracy in Python or JavaScript on frontier models but only 0-11% in Brainfuck, Befunge-98, Whitespace, Unlambda, and Shakespeare, with few-shot and self-reflection failing to close the gap. This is the core empirical observation the paper puts forward as a test of generalization outside training distributions. The construction of the benchmark is new in its specific combination of these five Turing-complete languages chosen for their rarity in pre-training data and lack of practical use. The authors lay out reasonable arguments for why these languages serve as a proxy for out-of-distribution code reasoning, and the evaluation across five models and five prompting strategies produces consistent numbers that are easy to follow. The motivation around contamination resistance is straightforward and ties directly to existing concerns about benchmark leakage. The softer area is the verification that the problems remain equivalent after translation. There are no human baselines, no solution-length comparisons, and no explicit checks confirming that functional requirements and test cases stayed identical. If the esoteric syntax and primitives make the tasks materially harder, the low scores could reflect increased difficulty rather than distribution shift alone. That assumption carries a lot of weight for the generalization claim, and the paper would be stronger with those controls. This is useful for groups focused on LLM evaluation, safety, and OOD robustness in code tasks. Readers already working on benchmark design will find the setup coherent, though they will want the full methods to judge the translation fidelity. I would send it for peer review because the empirical gap is worth referee attention even if the equivalence details need tightening.

Referee Report

2 major / 2 minor

Summary. The paper introduces EsoLang-Bench, a benchmark of 80 algorithmic problems translated into five esoteric Turing-complete languages (Brainfuck, Befunge-98, Whitespace, Unlambda, Shakespeare) to measure LLM generalization on out-of-distribution code. It reports that frontier models reach 100% accuracy on the same problems in Python/JavaScript but only 0-11% on the esoteric versions, with few-shot prompting and self-reflection failing to close the gap, positioning the benchmark as a contamination-resistant test of genuine reasoning.

Significance. If the problem translations are shown to be equivalent in complexity and the evaluation controls are tightened, the benchmark would provide a useful, hard-to-contaminate probe for algorithmic generalization beyond heavily represented languages, highlighting current limits in handling unfamiliar primitives despite Turing completeness.

major comments (2)

[Problem Translation and Equivalence] The central claim of a distribution-shift gap rests on the 80 problems being functionally equivalent across languages. The manuscript provides no human baselines, solution-length statistics, or explicit verification that test cases and functional requirements are preserved after translation; without these, the 0-11% scores could partly reflect increased syntactic difficulty rather than out-of-distribution effects alone.
[Experimental Setup] Evaluation details are insufficient for reproducibility: exact model versions, number of runs, statistical significance tests on the accuracy differences, and controls for prompt-engineering variations are not fully specified, weakening confidence in the reported dramatic gap.

minor comments (2)

[Prompting Strategies] Clarify the exact prompting templates used for each strategy (zero-shot, few-shot, self-reflection) in an appendix to allow direct replication.
[Language Selection] Add a table summarizing repository counts or token frequencies for each esoteric language versus Python to support the '340x to 60,000x fewer' claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify the strengths and areas for improvement in our presentation of EsoLang-Bench. We address each major comment in turn and commit to revisions that enhance the manuscript's rigor.

read point-by-point responses

Referee: [Problem Translation and Equivalence] The central claim of a distribution-shift gap rests on the 80 problems being functionally equivalent across languages. The manuscript provides no human baselines, solution-length statistics, or explicit verification that test cases and functional requirements are preserved after translation; without these, the 0-11% scores could partly reflect increased syntactic difficulty rather than out-of-distribution effects alone.

Authors: We agree that additional evidence for functional equivalence would strengthen our central claim. The translations were performed to maintain identical algorithmic logic and test cases, with verification through code review and execution on available interpreters. However, we did not report human baselines or solution-length statistics. In the revised version, we will include human performance baselines on a representative subset of problems (in Python and at least one esoteric language), comparative statistics on solution lengths, and a detailed appendix describing the translation methodology and equivalence checks. This will help demonstrate that the performance disparity arises primarily from distributional shift rather than inherent differences in problem difficulty. revision: yes
Referee: [Experimental Setup] Evaluation details are insufficient for reproducibility: exact model versions, number of runs, statistical significance tests on the accuracy differences, and controls for prompt-engineering variations are not fully specified, weakening confidence in the reported dramatic gap.

Authors: We appreciate the call for improved reproducibility. The manuscript outlines the evaluation protocol but lacks granular details on model versions, run counts, and statistical methods. We will revise the experimental section to specify the precise model versions used (including dates or checkpoints where applicable), report results from multiple runs (with means and standard deviations), incorporate statistical significance testing for the accuracy differences (e.g., using appropriate non-parametric tests), and provide full prompt templates along with controls for variations in few-shot examples and self-reflection prompts. These changes will allow for better replication and validation of our findings. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical benchmark evaluation

full rationale

The paper presents EsoLang-Bench as a direct empirical measurement of frontier LLM accuracy on 80 fixed algorithmic problems translated across Python/JavaScript versus five esoteric languages. All reported results (100% vs 0-11% accuracy, failure of few-shot and self-reflection) are obtained by running the same models on the same test cases and counting pass rates; no equations, fitted parameters, or derivations are introduced that reduce the claimed generalization gap to the inputs by construction. Turing-completeness is invoked only as a background fact to establish solvability in principle, with no self-citation chain, ansatz smuggling, or renaming of known results serving as load-bearing steps. The evaluation is therefore self-contained against external model runs and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the domain assumption that the selected languages have sufficiently low pre-training exposure to serve as a proxy for genuine out-of-distribution generalization, plus the standard assumption that Turing-completeness preserves problem equivalence.

axioms (2)

domain assumption The five esoteric languages are Turing-complete and can express the same algorithmic problems as Python without added difficulty from language design.
Explicitly stated in the abstract as the basis for using them as equivalent test cases.
domain assumption Representation counts (340x to 60,000x fewer GitHub repos) accurately reflect pre-training exposure and post-training exclusion.
Used to justify that the languages are unfamiliar and economically irrational to include in training.

pith-pipeline@v0.9.0 · 5589 in / 1335 out tokens · 54318 ms · 2026-05-15T13:28:34.496839+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages

[1]

beautiful

URL https://aclanthology.org/2024. naacl-long.482/. Dziri, N., Lu, X., Sclar, M., Li, X. L., Jiang, L., Lin, B. Y ., West, P., Bhagavatula, C., Bras, R. L., Hwang, J. D., et al. Faith and fate: Limits of transformers on compositional- ity.Advances in Neural Information Processing Systems, 2023. Ellis, K., Wong, C., Nye, M., Sable-Meyer, M., Morales, L., H...

work page arXiv 2024
[2]

5 7" -> Output:

Input: "5 7" -> Output: "12"

work page
[3]

-3 10" -> Output:

Input: "-3 10" -> Output: "7"

work page
[4]

0 0" -> Output:

Input: "0 0" -> Output: "0"

work page
[5]

100 200" -> Output:

Input: "100 200" -> Output: "300"

work page
[6]

-50 -25" -> Output:

Input: "-50 -25" -> Output: "-75"

work page
[7]

999 1" -> Output:

Input: "999 1" -> Output: "1000" B.3.2. MEDIUM: M08: NTHFIBONACCINUMBER Title: Nth Fibonacci Number Description: Read an integer N >= 1 and output the Nth Fibonacci number using the 1-indexed sequence with F1 = 1 and F2 = 1. Test Cases:

work page
[8]

1" -> Output:

Input: "1" -> Output: "1"

work page
[9]

5" -> Output:

Input: "5" -> Output: "5"

work page
[10]

10" -> Output:

Input: "10" -> Output: "55"

work page
[11]

2" -> Output:

Input: "2" -> Output: "1"

work page
[12]

7" -> Output:

Input: "7" -> Output: "13"

work page
[13]

15" -> Output:

Input: "15" -> Output: "610" B.3.3. HARD: H01: BALANCEDPARENTHESES Title: Balanced Parentheses Description: Read a string made only of ’(’ and ’)’ characters. Determine if the parentheses are balanced. Output ’yes’ if balanced, otherwise ’no’. Test Cases:

work page
[14]

()()" -> Output:

Input: "()()" -> Output: "yes"

work page
[15]

((()))" -> Output:

Input: "((()))" -> Output: "yes"

work page
[16]

())(" -> Output:

Input: "())(" -> Output: "no"

work page
[17]

(" -> Output:

Input: "(" -> Output: "no"

work page
[18]

" -> Output:

Input: "" -> Output: "yes"

work page
[19]

(()())" -> Output:

Input: "(()())" -> Output: "yes" B.3.4. EXTRA-HARD: X20: JOSEPHUSPROBLEM Title: Josephus Problem Description: Read integers N and K. N people stand in a circle numbered 1 to N. Starting from person 1, count K people clockwise and eliminate that person. Repeat until one remains. Output the survivor’s number. Test Cases: 14 EsoLang-Bench: Evaluating LLMs vi...

work page
[20]

5 2" -> Output:

Input: "5 2" -> Output: "3"

work page
[21]

7 3" -> Output:

Input: "7 3" -> Output: "4"

work page
[22]

1 1" -> Output:

Input: "1 1" -> Output: "1"

work page
[23]

6 1" -> Output:

Input: "6 1" -> Output: "6"

work page
[24]

10 2" -> Output:

Input: "10 2" -> Output: "5"

work page
[25]

4 2" -> Output:

Input: "4 2" -> Output: "1" C. Extended Results C.1. Language-Specific Results by Benchmark Tables show Easy problems solved (out of 20 per language). Accuracy = Solved/80. All Medium/Hard/Extra-Hard = 0%. Table 6.Brainfuck results by model and strategy (Easy solved / 20). Accuracy = Solved/80. Model 0-Shot Few Self-S ReAct Best GPT-5.2 2 254 5 (6.2%) O4-...

work page
[26]

actual: ¡, expected: 12

Sharp feedback signal.Direct execution output (“actual: ¡, expected: 12”) provides unambiguous error signal, unlike textual critique which may misdiagnose issues in unfamiliar domains

work page
[27]

Context efficiency.By logging attempts as structured JSON and fetching only relevant prior attempts, Codex avoids the attention dilution that occurs when LLMs must attend to long conversation histories

work page
[28]

Task-family retrieval.Routing problems to semantic categories (stream, classify, arithmetic) and retrieving category- specific examples outperforms generic few-shot demonstrations. However, these advantages cannot overcome fundamental capability gaps: when the required algorithmic pattern (e.g., decimal parsing) is absent from pre-training, no amount of i...

work page