CrossLangFuzzer: Differential Testing of Cross-Language JVM Compilers
Pith reviewed 2026-06-29 03:25 UTC · model grok-4.3
The pith
CrossLangFuzzer generates cross-language JVM tests from Kotlin's unified IR to expose miscompilations that single-language testers miss.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CrossLangFuzzer is the first differential testing framework for cross-language JVM compilation. It synthesizes test programs from the Kotlin compiler's unified intermediate representation, applies seven mutation operators to increase diversity, and compares the observable behavior of multiple compilers on the same input; any divergence is reported as a potential miscompilation.
What carries the argument
Differential testing of cross-language programs synthesized from Kotlin's unified intermediate representation and diversified by seven mutation operators.
If this is right
- Compiler developers gain a repeatable way to check cross-language features without writing manual interoperability tests.
- Bugs that only appear when code crosses language boundaries become detectable in routine testing.
- The same IR-based generation technique can be reused for any additional JVM language that shares the Kotlin IR.
- Mixed-language applications can be validated more thoroughly before deployment.
Where Pith is reading between the lines
- The mutation operators could be ported to differential testing of other shared-runtime platforms such as the .NET CLR.
- If the framework is extended with coverage-guided selection, the rate of bug discovery might increase beyond the 32 cases already found.
- Adoption by compiler teams would shift interoperability testing from ad-hoc manual cases to automated differential checks.
Load-bearing premise
Observed differences in compiler output on the generated tests indicate actual miscompilation rather than acceptable differences in how each language defines the same behavior.
What would settle it
A concrete test case generated by CrossLangFuzzer on which two compilers produce different results, yet both results are shown to match the official language specifications for the source languages involved.
Figures
read the original abstract
Modern JVM software increasingly integrates multiple programming languages, such as Java, Kotlin, Groovy, and Scala, within a single application. Supporting such interoperability requires JVM compilers to perform cross-language compilation while reconciling subtle semantic differences across language boundaries. Errors in this process can lead to critical miscompilations, yet existing compiler testing techniques focus exclusively on isolated, singlelanguage compilation. To address this gap, we present CrossLangFuzzer, the first differential testing framework for cross-language JVM compilation. CrossLangFuzzer leverages the Kotlin compiler's unified intermediate representation (IR) to synthesize cross-language test programs. It further applies seven mutation operators to diversify generated test programs and improve bug-finding capability. Evaluated on the latest versions of five major JVM compilers, CrossLangFuzzer uncovered 32 confirmed bugs, including 15 in Kotlin, 4 in Groovy, 7 in Scala 3, 2 in Scala 2, and 4 in Java. CrossLangFuzzer is open-source at https://github.com/XYZboom/CrossLangFuzzer
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents CrossLangFuzzer, the first differential testing framework for cross-language JVM compilers. It leverages the Kotlin compiler's unified IR to synthesize cross-language test programs and applies seven mutation operators for diversification. Evaluation on the latest versions of five major JVM compilers (Kotlin, Groovy, Scala 3, Scala 2, and Java) reports uncovering 32 confirmed bugs.
Significance. If the reported bugs are robustly validated as miscompilations rather than acceptable semantic variations, the work addresses a clear gap in existing single-language compiler testing techniques for multi-language JVM applications. The open-source release at the provided GitHub link supports reproducibility and independent verification of the generated tests and mutation operators.
major comments (1)
- [Evaluation] Evaluation section: The central claim of 32 confirmed bugs is load-bearing, yet the manuscript provides no explicit description of the confirmation process, criteria for distinguishing miscompilation from implementation-defined behavior across language boundaries, or measured false-positive rate. This detail is necessary to substantiate that observed differences reliably indicate bugs.
minor comments (1)
- [Abstract] The abstract and introduction could more clearly state the exact versions of the five compilers tested and the total number of test programs generated/executed to provide context for the bug count.
Simulated Author's Rebuttal
We thank the referee for the detailed review and constructive feedback. We address the major comment below and will incorporate the suggested clarifications into the revised manuscript.
read point-by-point responses
-
Referee: [Evaluation] Evaluation section: The central claim of 32 confirmed bugs is load-bearing, yet the manuscript provides no explicit description of the confirmation process, criteria for distinguishing miscompilation from implementation-defined behavior across language boundaries, or measured false-positive rate. This detail is necessary to substantiate that observed differences reliably indicate bugs.
Authors: We agree that an explicit description of the bug confirmation process is necessary to support the central claim. In the revised version, we will add a dedicated subsection (likely 5.3 or similar) under Evaluation that details: (1) the multi-stage confirmation workflow (automated differential execution followed by manual inspection of bytecode and runtime behavior), (2) the specific criteria used to classify a difference as a miscompilation versus an acceptable implementation-defined or language-specific semantic variation (e.g., requiring the difference to violate documented JVM or language semantics and to be reproducible across multiple runs), and (3) the observed false-positive rate from our internal validation (including how many candidate differences were discarded). This addition will directly address the concern without altering the reported bug count. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper describes an empirical differential testing tool (CrossLangFuzzer) that synthesizes cross-language programs via Kotlin IR and mutation operators, then executes them to surface behavioral differences across JVM compilers. Results consist of 32 externally confirmed bugs; no equations, fitted parameters, predictions, or first-principles derivations are present. All load-bearing claims rest on observable test execution and independent validation rather than any self-referential reduction or self-citation chain, rendering the work self-contained.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Marat Akhin and Mikhail Belyaev. 2021. Kotlin Language Specification. https: //kotlinlang.org/spec/pdf/kotlin-spec.pdf
2021
-
[2]
Luca Ardito, Riccardo Coppola, Giovanni Malnati, and Marco Torchiano. 2020. Effectiveness of Kotlin vs. Java in android app development tasks.Information and Software Technology127 (2020), 106374
2020
-
[3]
Berke Ates, Filip Dobrosavljević, Theodoros Theodoridis, and Zhendong Su
- [4]
-
[5]
Stefanos Chaliasos, Thodoris Sotiropoulos, Georgios-Petros Drosos, Charalambos Mitropoulos, Dimitris Mitropoulos, and Diomidis Spinellis. 2021. Well-typed pro- grams can go wrong: A study of typing-related bugs in jvm compilers.Proceedings of the ACM on Programming Languages5, OOPSLA (2021), 1–30
2021
-
[6]
Stefanos Chaliasos, Thodoris Sotiropoulos, Diomidis Spinellis, Arthur Gervais, Benjamin Livshits, and Dimitris Mitropoulos. 2022. Finding typing compiler bugs. InProceedings of the 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation (PLDI). ACM, 183–198
2022
-
[7]
Junjie Chen, Jibesh Patra, Michael Pradel, Yingfei Xiong, Hongyu Zhang, Dan Hao, and Lu Zhang. 2020. A survey of compiler testing.Comput. Surveys53, 1 (2020), 1–36
2020
-
[8]
Kyle Dewey, Jared Roesch, and Ben Hardekopf. 2014. Language fuzzing using constraint logic programming. InProceedings of the 29th ACM/IEEE International Conference on Automated Software Engineering (ASE). ACM, 725–730
2014
-
[9]
Kyle Dewey, Jared Roesch, and Ben Hardekopf. 2015. Fuzzing the Rust type- checker using CLP (T). InProceedings of the 30th IEEE/ACM International Confer- ence on Automated Software Engineering (ASE). IEEE, 482–493
2015
-
[10]
Qiong Feng, Xiaotian Ma, Ziyuan Feng, Marat Akhin, Wei Song, and Peng Liang
-
[11]
Finding Compiler Bugs through Cross-Language Code Generator and Dif- ferential Testing.Proceedings of the ACM on Programming Languages9, OOPSLA2 (2025), 2843–2869
2025
-
[12]
Călin Georgescu, Mitchell Olsthoorn, Pouria Derakhshanfar, Marat Akhin, and Annibale Panichella. 2024. Evolutionary generative fuzzing for differential testing of the kotlin compiler. InProceedings of the 32nd ACM International Conference on the Foundations of Software Engineering (FSE): Companion. ACM, 197–207
2024
-
[13]
Ben Limpanukorn, Jiyuan Wang, Hong Jin Kang, Zitong Zhou, and Miryung Kim
-
[14]
InProceedings of the 47th IEEE/ACM International Conference on Software Engineering (ICSE)
Fuzzing mlir compilers with custom mutation synthesis. InProceedings of the 47th IEEE/ACM International Conference on Software Engineering (ICSE). IEEE, 217–229
-
[15]
Vsevolod Livinskii, Dmitry Babokin, and John Regehr. 2020. Random testing for C and C++ compilers with YARPGen.Proceedings of the ACM on Programming Languages4, OOPSLA (2020), 1–25
2020
-
[16]
Bruno Gois Mateus and Matias Martinez. 2020. On the adoption, usage and evolution of Kotlin features in Android development. InProceedings of the 14th ACM/IEEE International Symposium on Empirical Software Engineering and Mea- surement (ESEM). ACM, 1–12
2020
-
[17]
Xianfei Ou, Cong Li, Yanyan Jiang, and Chang Xu. 2024. The Mutators Reloaded: Fuzzing Compilers with Large Language Model Generated Mutation Operators. InProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). ACM, 1–15
2024
- [18]
-
[19]
Yongqiang Tian, Zhenyang Xu, Yiwen Dong, Chengnian Sun, and Shing-Chi Cheung. 2023. Revisiting the Evaluation of Deep Learning-Based Compiler Testing. InProceedings of the 32nd International Joint Conference on Artificial Intelligence (IJCAI). IJCAI, 4873–4882
2023
-
[20]
Bo Wang, Chong Chen, Ming Deng, Junjie Chen, Xing Zhang, Youfang Lin, Dan Hao, and Jun Sun. 2025. Fuzzing C++ Compilers via Type-Driven Mutation. Proceedings of the ACM on Programming Languages9, OOPSLA2 (2025), 1232– 1260
2025
- [21]
-
[22]
Theodore Luo Wang, Yongqiang Tian, Yiwen Dong, Zhenyang Xu, and Chengnian Sun. 2023. Compilation Consistency Modulo Debug Information. InProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). ACM, 146–158
2023
-
[23]
Haoran Yang, Yu Nong, Shaowei Wang, and Haipeng Cai. 2024. Multi-language software development: Issues, challenges, and solutions.IEEE Transactions on Software Engineering50, 3 (2024), 512–533
2024
-
[24]
Xuejun Yang, Yang Chen, Eric Eide, and John Regehr. 2011. Finding and under- standing bugs in C compilers.ACM SIGPLAN Notices46, 6 (2011), 283–294
2011
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.