Adversarial Agent Collaboration for Correctness Improvements of C to Safe Rust Translation
Pith reviewed 2026-05-21 20:19 UTC · model grok-4.3
The pith
Adversarial LLM agents raise C-to-Rust translation test pass rates above 90 percent on real utilities.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central discovery is that an iterative adversarial loop between a translator agent and a discriminator agent can close the correctness gap on unseen inputs for C-to-Rust translations. The translator refines the Rust version against an existing test suite, after which the discriminator searches for new failing cases by refining a differential fuzzer; the new cases are returned to the translator for the next iteration. This process produces Rust translations that pass over 90 percent of tests on 63 utilities averaging 473 lines of code.
What carries the argument
The ACToR adversarial agent loop consisting of a translator agent that synthesizes Rust code and a discriminator agent that constructs differential fuzzers to expose behavioral divergences.
If this is right
- ACToR reaches over 90 percent test pass rate with zero human intervention on 63 C utilities.
- It improves correctness by as much as 36.7 percent compared to a coverage-driven baseline.
- The gains persist across seven different LLM and agent configurations.
- Layering ACToR on existing translators such as C2SaferRust adds a further 16.6 percent to the validation pass rate.
Where Pith is reading between the lines
- The differential fuzzer approach could be adapted to improve automatic translations between other language pairs.
- The method may scale to larger codebases if the fuzzer construction remains effective.
- Combining this loop with static analysis tools might detect more subtle semantic differences.
Load-bearing premise
That the discriminator agent can consistently construct and refine a differential fuzzer capable of surfacing inputs that expose translation divergences missed by the initial test suite, leading to stable improvements when fed back to the translator.
What would settle it
A drop in test pass rate below 80 percent or improvement over the baseline below 10 percent when the same procedure is applied to a new set of C programs would indicate the claim does not hold.
Figures
read the original abstract
Translating C to memory-safe languages, like Rust, prevents critical memory safety vulnerabilities that are prevalent in legacy C software. Even with recent LLM-based and tool-augmented translators, the resulting Rust code frequently diverges from the C source on inputs absent from the test suite used during translation; this correctness gap on unseen inputs remains a dominant obstacle to reliable, automatic C-to-Rust translation. In this work, we present ACToR (Adversarial C To Rust), a simple LLM-agent loop that closes this gap by adversarially searching for inputs on which the translation diverges from the C source, and using them to drive subsequent refinements. Inspired by GANs, ACToR pits a translator agent against a discriminator agent that collaborate to iteratively refine the Rust translation. On each iteration, the translator agent synthesizes and refines a Rust translation to pass an existing suite of tests, and then the discriminator agent finds new failing tests by constructing and refining a differential fuzzer over the C and Rust binaries. Across 63 real-world command-line C utilities, with an average size of 473 lines of code and the longest reaching thousands of lines in size, ACToR achieves over 90% test pass rate with zero human intervention. The improvement holds across seven agent-LLM configurations on our micro-benchmark, indicating that the loop is largely independent of the choice of underlying translator and LLM. Compared to a non-adversarial, coverage-driven test-generation baseline, ACToR improves correctness by up to 36.7%. When applied on top of one recent translator, C2SaferRust, ACToR further improves the validation pass rate by 16.6%.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces ACToR, an LLM-agent adversarial loop inspired by GANs in which a translator agent iteratively refines a C-to-Rust translation to pass an initial test suite while a discriminator agent constructs and refines a differential fuzzer to surface inputs that expose divergences between the C source and the generated Rust binary. The loop is evaluated on 63 real-world command-line C utilities (average 473 LOC), reporting >90% final test pass rates, up to 36.7% improvement over a non-adversarial coverage-driven baseline, and an additional 16.6% gain when layered on C2SaferRust; the gains are stated to hold across seven LLM configurations.
Significance. If the central empirical claims hold, the work offers a practical, largely parameter-free method for closing the correctness gap on unseen inputs that currently limits reliable automatic translation of legacy C code to memory-safe Rust. The evaluation on programs of realistic size and the demonstration of consistent gains across multiple translator/LLM back-ends are notable strengths; the approach also supplies an external, program-based grounding rather than quantities fitted to the same data.
major comments (2)
- [Abstract and §3] Abstract and the ACToR loop description (§3): the central claim is that ACToR produces translations correct on inputs absent from the initial suite. The iterative process refines the translator on discriminator-found failing inputs and then reports >90% pass rate and up to 36.7% improvement. It is not stated whether these final metrics are computed on a held-out validation set disjoint from all inputs generated or used during refinement. If the reported numbers are measured on the union of the initial suite and the loop-generated tests, the results demonstrate only that the translator can be prompted to pass tests it has already seen, not that it generalizes to truly unseen inputs. This distinction is load-bearing for the stated contribution.
- [Evaluation] Evaluation section: the manuscript reports concrete pass-rate gains on 63 programs but provides no details on statistical significance testing, variance across multiple runs, or controls for selection effects in the chosen utilities and test-suite construction. Given the stochastic nature of LLM agents, these omissions weaken confidence that the observed improvements are robust rather than artifacts of particular random seeds or program selection.
minor comments (2)
- [§3] The description of how the differential fuzzer is constructed and refined could be expanded with pseudocode or a concrete example to make the discriminator agent's operation reproducible.
- [Tables/Figures] Table or figure captions should explicitly state whether pass rates are measured on the refinement set or on a separate held-out set; this would immediately clarify the evaluation protocol.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and for recognizing the practical strengths of ACToR. We address each major comment below with honest clarifications based on the current manuscript and describe the revisions we will implement.
read point-by-point responses
-
Referee: [Abstract and §3] Abstract and the ACToR loop description (§3): the central claim is that ACToR produces translations correct on inputs absent from the initial suite. The iterative process refines the translator on discriminator-found failing inputs and then reports >90% pass rate and up to 36.7% improvement. It is not stated whether these final metrics are computed on a held-out validation set disjoint from all inputs generated or used during refinement. If the reported numbers are measured on the union of the initial suite and the loop-generated tests, the results demonstrate only that the translator can be prompted to pass tests it has already seen, not that it generalizes to truly unseen inputs. This distinction is load-bearing for the stated contribution.
Authors: We agree that the distinction between refinement on seen inputs and generalization to never-seen inputs is critical. The current manuscript computes final pass rates on the union of the initial test suite and the inputs discovered and used for refinement by the discriminator agent. The discriminator does generate new inputs absent from the initial suite, and the reported improvements (including the 36.7% gain over the coverage-driven baseline) reflect the translator's ability to be refined to pass these newly surfaced cases. However, we acknowledge that this does not constitute evaluation on inputs completely disjoint from the entire adversarial process. We will revise §4 (Evaluation) to add a held-out validation set of inputs never used in any refinement iteration or initial suite, and report pass rates on this set to better support the generalization claim. This will be presented as an additional experiment. revision: yes
-
Referee: [Evaluation] Evaluation section: the manuscript reports concrete pass-rate gains on 63 programs but provides no details on statistical significance testing, variance across multiple runs, or controls for selection effects in the chosen utilities and test-suite construction. Given the stochastic nature of LLM agents, these omissions weaken confidence that the observed improvements are robust rather than artifacts of particular random seeds or program selection.
Authors: We concur that the stochastic nature of LLM agents makes variance and significance testing important for robustness. The reported results are from single runs per configuration, selected for computational practicality given LLM API costs. We will revise the Evaluation section to include: results aggregated over multiple independent runs (with means and standard deviations), statistical significance tests (e.g., Wilcoxon signed-rank tests) against baselines, and an explicit discussion of utility selection criteria (real-world command-line tools from open-source repositories) along with any observed selection effects. These additions will increase confidence in the consistency of gains across the seven LLM configurations. revision: yes
Circularity Check
No significant circularity detected in empirical claims or method
full rationale
The paper describes an empirical LLM-agent loop for refining C-to-Rust translations via adversarial test generation and iterative fixes. Reported outcomes (e.g., >90% pass rates on 63 utilities and up to 36.7% gains over baseline) rest on external program evaluations and real-world test suites rather than any closed-form derivation, fitted parameter, or self-referential definition. No equations, uniqueness theorems, or ansatzes are invoked that reduce the central result to its own inputs by construction. The work contains no mathematical prediction chain; correctness improvements are measured against program behavior on provided utilities, making the findings self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Current LLMs can act as effective translator and discriminator agents in an iterative adversarial loop without human intervention.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Inspired by GANs, ACToR pits a translator agent against a discriminator agent that collaborate to iteratively refine the Rust translation... the discriminator agent finds new failing tests by constructing and refining a differential fuzzer
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
ACToR achieves over 90% test pass rate with zero human intervention and improves correctness by up to 36.7% over a non-adversarial coverage-driven baseline
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 3 Pith papers
-
ReCodeAgent: A Multi-Agent Workflow for Language-agnostic Translation and Validation of Large-scale Repositories
ReCodeAgent uses a multi-agent system to translate and validate large code repositories across multiple programming languages, achieving 60.8% higher test pass rates than prior neuro-symbolic and agentic methods on 11...
-
ORBIT: Guided Agentic Orchestration for Autonomous C-to-Rust Transpilation
ORBIT achieves 100% compilation success and 91.7% test success on 24 mostly large programs from CRUST-Bench by using dependency-aware orchestration and iterative verification, outperforming prior static and baseline tools.
-
SafeTrans: LLM-assisted Transpilation from C to Rust
SafeTrans achieves up to 80% successful C-to-Rust translations via LLM iterative repair on 2653 programs and two real projects, with some C vulnerabilities carrying over to the Rust output.
Reference graph
Works this paper leans on
-
[1]
AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation
URLhttps://openreview.net/forum?id=VtmBAGCN7o. 12 Dong Huang, Jie M Zhang, Michael Luck, Qingwen Bu, Yuhao Qing, and Heming Cui. Agent- coder: Multi-agent-based code generation with iterative testing and optimisation.arXiv preprint arXiv:2312.13010, 2023. Ali Reza Ibrahimzada, Kaiyao Ke, Mrigank Pawagi, Muhammad Salman Abid, Rangeet Pan, Saurabh Sinha, an...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-long.269. URL https://aclanthology.org/2024.acl-long.269/. Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770, 2023. Shoaib Kamil, Alvin ...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2024.acl-long.269 2024
-
[3]
Making IP = PSPACE practical: Efficient interactive protocols for BDD algorithms
URLhttp://arxiv.org/abs/2404.18852. Qi Zhan, Xing Hu, Xin Xia, and Shanping Li. Verified lifting of deep learning operators.arXiv preprint arXiv:2412.20992, 2024. Hanliang Zhang, Cristina David, Yijun Yu, and Meng Wang. Ownership guided c to rust translation. InComputer Aided Verification: 35th International Conference, CAV 2023, Paris, France, July 17–22...
-
[8]
## constraints - The translated Rust code MUST compile and MUST be 100% safe
Clean the working directory by removing temporary files and backup files. ## constraints - The translated Rust code MUST compile and MUST be 100% safe. - The translated Rust code MUST pass all the unit tests. ‘‘‘ Figure 6: The task prompt for the translator agent. (Coverage Baseline) Test Generator Agent Prompt — Adding Tests Task Description: ‘‘‘ You are...
-
[9]
Read the C code to understand the functionalities
-
[11]
Run ‘make clean && make all && ./testcmp.sh coverage‘ to compile the C code and get the current coverage
-
[12]
Read the coverage report and the record of added test cases in ‘test_cases_record.md‘ to find potential missed cases
-
[13]
Design **3** new test cases that are different from existing test cases
-
[14]
Ensure that the new coverage is higher than the previous one
Run ‘./testcmp.sh coverage‘ to get the new coverage. Ensure that the new coverage is higher than the previous one
-
[15]
<...Test formatting requirements> ## Constraints
Clean the working directory by removing temporary files and scripts, temporary test cases, and backup files. <...Test formatting requirements> ## Constraints
-
[18]
The added tests must be valid for the C code. You should run ‘make clean && make all‘ and then run ‘./testcmp.sh compare ./xxx.out(compiled from C code)‘. It must show ‘ All tests passed!‘. ‘‘‘ Figure 7: The task prompt for the test generator agent of the coversge baseline. A PROMPTTEMPLATE For Mini-SWE-Agent, we keep the system prompt that describes the ...
-
[19]
Analyze the C code and the translated Rust code to detect **semantic mismatches **
-
[20]
Focus first on **core functionalities **, then explore **edge cases **
-
[21]
Read the current test script and the record of added test cases in ‘test_cases_record .md‘ to find potential missed cases
-
[22]
Add the 3 new test cases to the test cases file
Collect the best **3** new input cases that expose mismatches between C code and Rust translation. Add the 3 new test cases to the test cases file
-
[23]
Run the new tests to compare the output of the translated Rust code with the original C program to confirm the mismatches
-
[24]
Clean the working directory by removing temporary files and scripts, temporary test cases, and backup files. <...Test formatting requirements> IF {$allow_fuzz} <...Details on the usage of the fuzzer script> ENDIF ## Constraints
-
[25]
You should run ‘./ testcmp.sh‘ and the number of test cases will be shown
There should be exactly 3 new test cases added to the JSONL file. You should run ‘./ testcmp.sh‘ and the number of test cases will be shown. There should be ‘< current_test_cases_number> + 3‘ test cases in total
-
[26]
You should check this by reading the content of the test cases
The 3 test cases should be different from each other. You should check this by reading the content of the test cases
-
[27]
The added tests must be valid for the C code. You should run ‘make clean && make all‘ and then run ‘./testcmp.sh compare ./xxx.out(compiled from C code)‘. It must show ‘ All tests passed!‘
-
[28]
You should run ‘./testcmp.sh compare ./ts/target/release/xxx(compiled from Rust code )‘
The added tests should reflect the differences between the C code and the Rust code. You should run ‘./testcmp.sh compare ./ts/target/release/xxx(compiled from Rust code )‘. The Rust code should fail on all 3 new test cases. ‘‘‘ Figure 8: The task prompt for the discriminator agent of ACToR. Post-processing Prompt — Eliminating Unsafe Blocks Task Descript...
-
[29]
Read the C code and the test script to understand the functionalities
-
[30]
Initialize a new Cargo project in the ‘ts/‘ folder
-
[31]
Translate the C code to Rust code and compile it into binary
-
[32]
You should run ‘./testcmp.sh --help‘ to understand how to use the test script
Run ‘./testcmp.sh‘ to compare the output of the translated Rust code with the original C program. You should run ‘./testcmp.sh --help‘ to understand how to use the test script
-
[33]
‘‘‘ The translation contains unsafe keywords
Clean the working directory by removing temporary files and backup files. ‘‘‘ The translation contains unsafe keywords. Please fix it to ensure no unsafe code is used. Also, keep it passing all the tests. Figure 9: The task prompt for eliminating unsafe blocks. in Figure 7. The prompt structure for the discriminator agent for ACToR is shown in Figure 8. T...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.