CrossLangFuzzer: Differential Testing of Cross-Language JVM Compilers

Peng Liang; Qiong Feng; Wei Song; Xiaotian Ma; Yongqiang Tian

arxiv: 2606.28132 · v1 · pith:357GJH5Gnew · submitted 2026-06-26 · 💻 cs.SE

CrossLangFuzzer: Differential Testing of Cross-Language JVM Compilers

Xiaotian Ma , Qiong Feng , Yongqiang Tian , Wei Song , Peng Liang This is my paper

Pith reviewed 2026-06-29 03:25 UTC · model grok-4.3

classification 💻 cs.SE

keywords differential testingcross-language compilationJVM compilerscompiler testingKotlin intermediate representationmiscompilation detectionmulti-language software

0 comments

The pith

CrossLangFuzzer generates cross-language JVM tests from Kotlin's unified IR to expose miscompilations that single-language testers miss.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Applications now routinely mix Java, Kotlin, Groovy, and Scala inside one JVM process, so compilers must reconcile semantic differences at language boundaries. Single-language testing leaves these boundary errors unexamined. CrossLangFuzzer creates test programs that cross language lines by starting from the Kotlin compiler's shared intermediate representation and then applying seven mutation operators. Running the resulting programs on five production JVM compilers produced 32 confirmed bugs. The method therefore supplies a practical way to check that interoperability does not silently introduce wrong code.

Core claim

CrossLangFuzzer is the first differential testing framework for cross-language JVM compilation. It synthesizes test programs from the Kotlin compiler's unified intermediate representation, applies seven mutation operators to increase diversity, and compares the observable behavior of multiple compilers on the same input; any divergence is reported as a potential miscompilation.

What carries the argument

Differential testing of cross-language programs synthesized from Kotlin's unified intermediate representation and diversified by seven mutation operators.

If this is right

Compiler developers gain a repeatable way to check cross-language features without writing manual interoperability tests.
Bugs that only appear when code crosses language boundaries become detectable in routine testing.
The same IR-based generation technique can be reused for any additional JVM language that shares the Kotlin IR.
Mixed-language applications can be validated more thoroughly before deployment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The mutation operators could be ported to differential testing of other shared-runtime platforms such as the .NET CLR.
If the framework is extended with coverage-guided selection, the rate of bug discovery might increase beyond the 32 cases already found.
Adoption by compiler teams would shift interoperability testing from ad-hoc manual cases to automated differential checks.

Load-bearing premise

Observed differences in compiler output on the generated tests indicate actual miscompilation rather than acceptable differences in how each language defines the same behavior.

What would settle it

A concrete test case generated by CrossLangFuzzer on which two compilers produce different results, yet both results are shown to match the official language specifications for the source languages involved.

Figures

Figures reproduced from arXiv: 2606.28132 by Peng Liang, Qiong Feng, Wei Song, Xiaotian Ma, Yongqiang Tian.

**Figure 2.** Figure 2: CrossLangFuzzer Overall Framework (1) Generator: Given a configuration file which specifies target JVM languages, the generator synthesizes an initial cross-language program modeled in our custom IR. The generator is valid by construction: integrated semantic constraint-checking guarantees that the generated class hierarchies and type assignments strictly conform to JVM inheritance rules. (2) Mutator: To… view at source ↗

read the original abstract

Modern JVM software increasingly integrates multiple programming languages, such as Java, Kotlin, Groovy, and Scala, within a single application. Supporting such interoperability requires JVM compilers to perform cross-language compilation while reconciling subtle semantic differences across language boundaries. Errors in this process can lead to critical miscompilations, yet existing compiler testing techniques focus exclusively on isolated, singlelanguage compilation. To address this gap, we present CrossLangFuzzer, the first differential testing framework for cross-language JVM compilation. CrossLangFuzzer leverages the Kotlin compiler's unified intermediate representation (IR) to synthesize cross-language test programs. It further applies seven mutation operators to diversify generated test programs and improve bug-finding capability. Evaluated on the latest versions of five major JVM compilers, CrossLangFuzzer uncovered 32 confirmed bugs, including 15 in Kotlin, 4 in Groovy, 7 in Scala 3, 2 in Scala 2, and 4 in Java. CrossLangFuzzer is open-source at https://github.com/XYZboom/CrossLangFuzzer

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CrossLangFuzzer is the first differential tester built for cross-language JVM programs and ships 32 externally confirmed bugs plus open code.

read the letter

The core contribution is a new differential testing setup that generates mixed-language JVM programs from Kotlin IR and applies seven specific mutation operators to hunt for interoperability bugs. That approach is absent from the single-language testing papers it cites, and the evaluation actually ran the tool on current versions of five compilers and produced 32 bugs that compiler teams confirmed.

The work is solid on the practical side. Releasing the code and test cases lets others inspect the generated programs and the mutation operators directly. The bug counts break down across Kotlin, Groovy, Scala 2/3, and Java, which matches the stated goal of targeting cross-language issues rather than isolated compilation.

The main soft spot is narrow. The abstract gives little detail on the exact confirmation steps or how semantic differences across languages were ruled out before calling something a bug. The stress-test note says the bugs are externally validated, which lowers the risk, but a referee would still want the full paper to show the process. No other load-bearing problems appear in the reported results.

This paper is for people who work on compiler validation or multi-language JVM tooling. A reader who needs concrete test cases or wants to adapt the IR-based synthesis method will get usable material. It is worth sending to peer review because the artifacts are available and the findings rest on executed tests rather than fitted models or unverified claims.

Referee Report

1 major / 1 minor

Summary. The paper presents CrossLangFuzzer, the first differential testing framework for cross-language JVM compilers. It leverages the Kotlin compiler's unified IR to synthesize cross-language test programs and applies seven mutation operators for diversification. Evaluation on the latest versions of five major JVM compilers (Kotlin, Groovy, Scala 3, Scala 2, and Java) reports uncovering 32 confirmed bugs.

Significance. If the reported bugs are robustly validated as miscompilations rather than acceptable semantic variations, the work addresses a clear gap in existing single-language compiler testing techniques for multi-language JVM applications. The open-source release at the provided GitHub link supports reproducibility and independent verification of the generated tests and mutation operators.

major comments (1)

[Evaluation] Evaluation section: The central claim of 32 confirmed bugs is load-bearing, yet the manuscript provides no explicit description of the confirmation process, criteria for distinguishing miscompilation from implementation-defined behavior across language boundaries, or measured false-positive rate. This detail is necessary to substantiate that observed differences reliably indicate bugs.

minor comments (1)

[Abstract] The abstract and introduction could more clearly state the exact versions of the five compilers tested and the total number of test programs generated/executed to provide context for the bug count.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed review and constructive feedback. We address the major comment below and will incorporate the suggested clarifications into the revised manuscript.

read point-by-point responses

Referee: [Evaluation] Evaluation section: The central claim of 32 confirmed bugs is load-bearing, yet the manuscript provides no explicit description of the confirmation process, criteria for distinguishing miscompilation from implementation-defined behavior across language boundaries, or measured false-positive rate. This detail is necessary to substantiate that observed differences reliably indicate bugs.

Authors: We agree that an explicit description of the bug confirmation process is necessary to support the central claim. In the revised version, we will add a dedicated subsection (likely 5.3 or similar) under Evaluation that details: (1) the multi-stage confirmation workflow (automated differential execution followed by manual inspection of bytecode and runtime behavior), (2) the specific criteria used to classify a difference as a miscompilation versus an acceptable implementation-defined or language-specific semantic variation (e.g., requiring the difference to violate documented JVM or language semantics and to be reproducible across multiple runs), and (3) the observed false-positive rate from our internal validation (including how many candidate differences were discarded). This addition will directly address the concern without altering the reported bug count. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper describes an empirical differential testing tool (CrossLangFuzzer) that synthesizes cross-language programs via Kotlin IR and mutation operators, then executes them to surface behavioral differences across JVM compilers. Results consist of 32 externally confirmed bugs; no equations, fitted parameters, predictions, or first-principles derivations are present. All load-bearing claims rest on observable test execution and independent validation rather than any self-referential reduction or self-citation chain, rendering the work self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities; the work is an empirical tool-construction paper whose central claim rests on the design of the seven mutation operators and the assumption that differential behavior signals bugs.

pith-pipeline@v0.9.1-grok · 5719 in / 1020 out tokens · 40861 ms · 2026-06-29T03:25:53.033590+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

24 extracted references · 3 canonical work pages

[1]

Marat Akhin and Mikhail Belyaev. 2021. Kotlin Language Specification. https: //kotlinlang.org/spec/pdf/kotlin-spec.pdf

2021
[2]

Luca Ardito, Riccardo Coppola, Giovanni Malnati, and Marco Torchiano. 2020. Effectiveness of Kotlin vs. Java in android app development tasks.Information and Software Technology127 (2020), 106374

2020
[3]

Berke Ates, Filip Dobrosavljević, Theodoros Theodoridis, and Zhendong Su
[4]

MLIR-Smith: A Novel Random Program Generator for Evaluating Compiler Pipelines.arXiv preprint arXiv:2601.02218(2026)

work page arXiv 2026
[5]

Stefanos Chaliasos, Thodoris Sotiropoulos, Georgios-Petros Drosos, Charalambos Mitropoulos, Dimitris Mitropoulos, and Diomidis Spinellis. 2021. Well-typed pro- grams can go wrong: A study of typing-related bugs in jvm compilers.Proceedings of the ACM on Programming Languages5, OOPSLA (2021), 1–30

2021
[6]

Stefanos Chaliasos, Thodoris Sotiropoulos, Diomidis Spinellis, Arthur Gervais, Benjamin Livshits, and Dimitris Mitropoulos. 2022. Finding typing compiler bugs. InProceedings of the 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation (PLDI). ACM, 183–198

2022
[7]

Junjie Chen, Jibesh Patra, Michael Pradel, Yingfei Xiong, Hongyu Zhang, Dan Hao, and Lu Zhang. 2020. A survey of compiler testing.Comput. Surveys53, 1 (2020), 1–36

2020
[8]

Kyle Dewey, Jared Roesch, and Ben Hardekopf. 2014. Language fuzzing using constraint logic programming. InProceedings of the 29th ACM/IEEE International Conference on Automated Software Engineering (ASE). ACM, 725–730

2014
[9]

Kyle Dewey, Jared Roesch, and Ben Hardekopf. 2015. Fuzzing the Rust type- checker using CLP (T). InProceedings of the 30th IEEE/ACM International Confer- ence on Automated Software Engineering (ASE). IEEE, 482–493

2015
[10]

Qiong Feng, Xiaotian Ma, Ziyuan Feng, Marat Akhin, Wei Song, and Peng Liang
[11]

Finding Compiler Bugs through Cross-Language Code Generator and Dif- ferential Testing.Proceedings of the ACM on Programming Languages9, OOPSLA2 (2025), 2843–2869

2025
[12]

Călin Georgescu, Mitchell Olsthoorn, Pouria Derakhshanfar, Marat Akhin, and Annibale Panichella. 2024. Evolutionary generative fuzzing for differential testing of the kotlin compiler. InProceedings of the 32nd ACM International Conference on the Foundations of Software Engineering (FSE): Companion. ACM, 197–207

2024
[13]

Ben Limpanukorn, Jiyuan Wang, Hong Jin Kang, Zitong Zhou, and Miryung Kim
[14]

InProceedings of the 47th IEEE/ACM International Conference on Software Engineering (ICSE)

Fuzzing mlir compilers with custom mutation synthesis. InProceedings of the 47th IEEE/ACM International Conference on Software Engineering (ICSE). IEEE, 217–229
[15]

Vsevolod Livinskii, Dmitry Babokin, and John Regehr. 2020. Random testing for C and C++ compilers with YARPGen.Proceedings of the ACM on Programming Languages4, OOPSLA (2020), 1–25

2020
[16]

Bruno Gois Mateus and Matias Martinez. 2020. On the adoption, usage and evolution of Kotlin features in Android development. InProceedings of the 14th ACM/IEEE International Symposium on Empirical Software Engineering and Mea- surement (ESEM). ACM, 1–12

2020
[17]

Xianfei Ou, Cong Li, Yanyan Jiang, and Chang Xu. 2024. The Mutators Reloaded: Fuzzing Compilers with Large Language Model Generated Mutation Operators. InProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). ACM, 1–15

2024
[18]

Yuyang Rong, Zhanghan Yu, Zhenkai Weng, Stephen Neuendorffer, and Hao Chen. 2024. IRFuzzer: Specialized fuzzing for LLVM backend code generation. arXiv preprint arXiv:2402.05256(2024)

work page arXiv 2024
[19]

Yongqiang Tian, Zhenyang Xu, Yiwen Dong, Chengnian Sun, and Shing-Chi Cheung. 2023. Revisiting the Evaluation of Deep Learning-Based Compiler Testing. InProceedings of the 32nd International Joint Conference on Artificial Intelligence (IJCAI). IJCAI, 4873–4882

2023
[20]

Bo Wang, Chong Chen, Ming Deng, Junjie Chen, Xing Zhang, Youfang Lin, Dan Hao, and Jun Sun. 2025. Fuzzing C++ Compilers via Type-Driven Mutation. Proceedings of the ACM on Programming Languages9, OOPSLA2 (2025), 1232– 1260

2025
[21]

Bo Wang, Pengyang Wang, Chong Chen, Ming Deng, Jieke Shi, Qi Sun, Chengran Yang, Youfang Lin, Zhou Yang, Junjie Chen, et al. 2025. Mut4All: Fuzzing Com- pilers via LLM-Synthesized Mutators Learned from Bug Reports.arXiv preprint arXiv:2507.19275(2025)

work page arXiv 2025
[22]

Theodore Luo Wang, Yongqiang Tian, Yiwen Dong, Zhenyang Xu, and Chengnian Sun. 2023. Compilation Consistency Modulo Debug Information. InProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). ACM, 146–158

2023
[23]

Haoran Yang, Yu Nong, Shaowei Wang, and Haipeng Cai. 2024. Multi-language software development: Issues, challenges, and solutions.IEEE Transactions on Software Engineering50, 3 (2024), 512–533

2024
[24]

Xuejun Yang, Yang Chen, Eric Eide, and John Regehr. 2011. Finding and under- standing bugs in C compilers.ACM SIGPLAN Notices46, 6 (2011), 283–294

2011

[1] [1]

Marat Akhin and Mikhail Belyaev. 2021. Kotlin Language Specification. https: //kotlinlang.org/spec/pdf/kotlin-spec.pdf

2021

[2] [2]

Luca Ardito, Riccardo Coppola, Giovanni Malnati, and Marco Torchiano. 2020. Effectiveness of Kotlin vs. Java in android app development tasks.Information and Software Technology127 (2020), 106374

2020

[3] [3]

Berke Ates, Filip Dobrosavljević, Theodoros Theodoridis, and Zhendong Su

[4] [4]

MLIR-Smith: A Novel Random Program Generator for Evaluating Compiler Pipelines.arXiv preprint arXiv:2601.02218(2026)

work page arXiv 2026

[5] [5]

Stefanos Chaliasos, Thodoris Sotiropoulos, Georgios-Petros Drosos, Charalambos Mitropoulos, Dimitris Mitropoulos, and Diomidis Spinellis. 2021. Well-typed pro- grams can go wrong: A study of typing-related bugs in jvm compilers.Proceedings of the ACM on Programming Languages5, OOPSLA (2021), 1–30

2021

[6] [6]

Stefanos Chaliasos, Thodoris Sotiropoulos, Diomidis Spinellis, Arthur Gervais, Benjamin Livshits, and Dimitris Mitropoulos. 2022. Finding typing compiler bugs. InProceedings of the 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation (PLDI). ACM, 183–198

2022

[7] [7]

Junjie Chen, Jibesh Patra, Michael Pradel, Yingfei Xiong, Hongyu Zhang, Dan Hao, and Lu Zhang. 2020. A survey of compiler testing.Comput. Surveys53, 1 (2020), 1–36

2020

[8] [8]

Kyle Dewey, Jared Roesch, and Ben Hardekopf. 2014. Language fuzzing using constraint logic programming. InProceedings of the 29th ACM/IEEE International Conference on Automated Software Engineering (ASE). ACM, 725–730

2014

[9] [9]

Kyle Dewey, Jared Roesch, and Ben Hardekopf. 2015. Fuzzing the Rust type- checker using CLP (T). InProceedings of the 30th IEEE/ACM International Confer- ence on Automated Software Engineering (ASE). IEEE, 482–493

2015

[10] [10]

Qiong Feng, Xiaotian Ma, Ziyuan Feng, Marat Akhin, Wei Song, and Peng Liang

[11] [11]

Finding Compiler Bugs through Cross-Language Code Generator and Dif- ferential Testing.Proceedings of the ACM on Programming Languages9, OOPSLA2 (2025), 2843–2869

2025

[12] [12]

Călin Georgescu, Mitchell Olsthoorn, Pouria Derakhshanfar, Marat Akhin, and Annibale Panichella. 2024. Evolutionary generative fuzzing for differential testing of the kotlin compiler. InProceedings of the 32nd ACM International Conference on the Foundations of Software Engineering (FSE): Companion. ACM, 197–207

2024

[13] [13]

Ben Limpanukorn, Jiyuan Wang, Hong Jin Kang, Zitong Zhou, and Miryung Kim

[14] [14]

InProceedings of the 47th IEEE/ACM International Conference on Software Engineering (ICSE)

Fuzzing mlir compilers with custom mutation synthesis. InProceedings of the 47th IEEE/ACM International Conference on Software Engineering (ICSE). IEEE, 217–229

[15] [15]

Vsevolod Livinskii, Dmitry Babokin, and John Regehr. 2020. Random testing for C and C++ compilers with YARPGen.Proceedings of the ACM on Programming Languages4, OOPSLA (2020), 1–25

2020

[16] [16]

Bruno Gois Mateus and Matias Martinez. 2020. On the adoption, usage and evolution of Kotlin features in Android development. InProceedings of the 14th ACM/IEEE International Symposium on Empirical Software Engineering and Mea- surement (ESEM). ACM, 1–12

2020

[17] [17]

Xianfei Ou, Cong Li, Yanyan Jiang, and Chang Xu. 2024. The Mutators Reloaded: Fuzzing Compilers with Large Language Model Generated Mutation Operators. InProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). ACM, 1–15

2024

[18] [18]

Yuyang Rong, Zhanghan Yu, Zhenkai Weng, Stephen Neuendorffer, and Hao Chen. 2024. IRFuzzer: Specialized fuzzing for LLVM backend code generation. arXiv preprint arXiv:2402.05256(2024)

work page arXiv 2024

[19] [19]

Yongqiang Tian, Zhenyang Xu, Yiwen Dong, Chengnian Sun, and Shing-Chi Cheung. 2023. Revisiting the Evaluation of Deep Learning-Based Compiler Testing. InProceedings of the 32nd International Joint Conference on Artificial Intelligence (IJCAI). IJCAI, 4873–4882

2023

[20] [20]

Bo Wang, Chong Chen, Ming Deng, Junjie Chen, Xing Zhang, Youfang Lin, Dan Hao, and Jun Sun. 2025. Fuzzing C++ Compilers via Type-Driven Mutation. Proceedings of the ACM on Programming Languages9, OOPSLA2 (2025), 1232– 1260

2025

[21] [21]

Bo Wang, Pengyang Wang, Chong Chen, Ming Deng, Jieke Shi, Qi Sun, Chengran Yang, Youfang Lin, Zhou Yang, Junjie Chen, et al. 2025. Mut4All: Fuzzing Com- pilers via LLM-Synthesized Mutators Learned from Bug Reports.arXiv preprint arXiv:2507.19275(2025)

work page arXiv 2025

[22] [22]

Theodore Luo Wang, Yongqiang Tian, Yiwen Dong, Zhenyang Xu, and Chengnian Sun. 2023. Compilation Consistency Modulo Debug Information. InProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). ACM, 146–158

2023

[23] [23]

Haoran Yang, Yu Nong, Shaowei Wang, and Haipeng Cai. 2024. Multi-language software development: Issues, challenges, and solutions.IEEE Transactions on Software Engineering50, 3 (2024), 512–533

2024

[24] [24]

Xuejun Yang, Yang Chen, Eric Eide, and John Regehr. 2011. Finding and under- standing bugs in C compilers.ACM SIGPLAN Notices46, 6 (2011), 283–294

2011