pith. sign in

arxiv: 2510.03879 · v3 · pith:UCL54BUJnew · submitted 2025-10-04 · 💻 cs.SE · cs.AI

Adversarial Agent Collaboration for Correctness Improvements of C to Safe Rust Translation

Pith reviewed 2026-05-21 20:19 UTC · model grok-4.3

classification 💻 cs.SE cs.AI
keywords C to RustLLM agentsadversarial learningdifferential fuzzingmemory safetycode translationautomatic verification
0
0 comments X

The pith

Adversarial LLM agents raise C-to-Rust translation test pass rates above 90 percent on real utilities.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces ACToR, an adversarial collaboration between LLM agents to refine translations from C to memory-safe Rust. A translator agent generates Rust code that passes known tests while a discriminator agent builds a differential fuzzer to locate inputs where the C original and Rust translation diverge. These new inputs drive further refinements in subsequent rounds. The result is over 90 percent of tests passing across 63 real command-line C programs with no human involvement, and up to 36.7 percent higher correctness than a non-adversarial coverage baseline.

Core claim

The central discovery is that an iterative adversarial loop between a translator agent and a discriminator agent can close the correctness gap on unseen inputs for C-to-Rust translations. The translator refines the Rust version against an existing test suite, after which the discriminator searches for new failing cases by refining a differential fuzzer; the new cases are returned to the translator for the next iteration. This process produces Rust translations that pass over 90 percent of tests on 63 utilities averaging 473 lines of code.

What carries the argument

The ACToR adversarial agent loop consisting of a translator agent that synthesizes Rust code and a discriminator agent that constructs differential fuzzers to expose behavioral divergences.

If this is right

  • ACToR reaches over 90 percent test pass rate with zero human intervention on 63 C utilities.
  • It improves correctness by as much as 36.7 percent compared to a coverage-driven baseline.
  • The gains persist across seven different LLM and agent configurations.
  • Layering ACToR on existing translators such as C2SaferRust adds a further 16.6 percent to the validation pass rate.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The differential fuzzer approach could be adapted to improve automatic translations between other language pairs.
  • The method may scale to larger codebases if the fuzzer construction remains effective.
  • Combining this loop with static analysis tools might detect more subtle semantic differences.

Load-bearing premise

That the discriminator agent can consistently construct and refine a differential fuzzer capable of surfacing inputs that expose translation divergences missed by the initial test suite, leading to stable improvements when fed back to the translator.

What would settle it

A drop in test pass rate below 80 percent or improvement over the baseline below 10 percent when the same procedure is applied to a new set of C programs would indicate the claim does not hold.

Figures

Figures reproduced from arXiv: 2510.03879 by Bo Wang, Brandon Paulsen, Prateek Saxena, Ruishi Li, Tianyu Li, Umang Mathur.

Figure 1
Figure 1. Figure 1: High-level overview of ACToR. The Translator and the Discriminator agents update the translation and the tests in turn to iteratively improve the correctness of the translated program. and correct errors to improve results (Chen et al., 2023; Dong et al., 2024; Huang et al., 2023; Qian et al., 2024). In the context of automated translation, a straightforward agentic setup is as follows —- the agent has acc… view at source ↗
Figure 2
Figure 2. Figure 2: Overall correctness (pass rate) achieved by [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The relative comparison among 3 translation [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The validation pass rate of ACToR on different configurations. Pass rate versus the number of new test cases added per iteration (Left). Pass rate versus the number of iterations (Right). programs: the coverage baseline costs $95 USD, while ACToR-NoFuzz and the full ACToR cost $240 USD and $211 USD, respectively. To better understand cost-effectiveness, we ran an equal-cost experiment in which we extended … view at source ↗
Figure 5
Figure 5. Figure 5: The relative pass rate when cross-comparing [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: The task prompt for the translator agent. [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: The task prompt for the test generator agent of the coversge baseline. [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: The task prompt for the discriminator agent of ACToR. [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: The task prompt for eliminating unsafe blocks. [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: The relative pass rate when cross-comparing [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗
read the original abstract

Translating C to memory-safe languages, like Rust, prevents critical memory safety vulnerabilities that are prevalent in legacy C software. Even with recent LLM-based and tool-augmented translators, the resulting Rust code frequently diverges from the C source on inputs absent from the test suite used during translation; this correctness gap on unseen inputs remains a dominant obstacle to reliable, automatic C-to-Rust translation. In this work, we present ACToR (Adversarial C To Rust), a simple LLM-agent loop that closes this gap by adversarially searching for inputs on which the translation diverges from the C source, and using them to drive subsequent refinements. Inspired by GANs, ACToR pits a translator agent against a discriminator agent that collaborate to iteratively refine the Rust translation. On each iteration, the translator agent synthesizes and refines a Rust translation to pass an existing suite of tests, and then the discriminator agent finds new failing tests by constructing and refining a differential fuzzer over the C and Rust binaries. Across 63 real-world command-line C utilities, with an average size of 473 lines of code and the longest reaching thousands of lines in size, ACToR achieves over 90% test pass rate with zero human intervention. The improvement holds across seven agent-LLM configurations on our micro-benchmark, indicating that the loop is largely independent of the choice of underlying translator and LLM. Compared to a non-adversarial, coverage-driven test-generation baseline, ACToR improves correctness by up to 36.7%. When applied on top of one recent translator, C2SaferRust, ACToR further improves the validation pass rate by 16.6%.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces ACToR, an LLM-agent adversarial loop inspired by GANs in which a translator agent iteratively refines a C-to-Rust translation to pass an initial test suite while a discriminator agent constructs and refines a differential fuzzer to surface inputs that expose divergences between the C source and the generated Rust binary. The loop is evaluated on 63 real-world command-line C utilities (average 473 LOC), reporting >90% final test pass rates, up to 36.7% improvement over a non-adversarial coverage-driven baseline, and an additional 16.6% gain when layered on C2SaferRust; the gains are stated to hold across seven LLM configurations.

Significance. If the central empirical claims hold, the work offers a practical, largely parameter-free method for closing the correctness gap on unseen inputs that currently limits reliable automatic translation of legacy C code to memory-safe Rust. The evaluation on programs of realistic size and the demonstration of consistent gains across multiple translator/LLM back-ends are notable strengths; the approach also supplies an external, program-based grounding rather than quantities fitted to the same data.

major comments (2)
  1. [Abstract and §3] Abstract and the ACToR loop description (§3): the central claim is that ACToR produces translations correct on inputs absent from the initial suite. The iterative process refines the translator on discriminator-found failing inputs and then reports >90% pass rate and up to 36.7% improvement. It is not stated whether these final metrics are computed on a held-out validation set disjoint from all inputs generated or used during refinement. If the reported numbers are measured on the union of the initial suite and the loop-generated tests, the results demonstrate only that the translator can be prompted to pass tests it has already seen, not that it generalizes to truly unseen inputs. This distinction is load-bearing for the stated contribution.
  2. [Evaluation] Evaluation section: the manuscript reports concrete pass-rate gains on 63 programs but provides no details on statistical significance testing, variance across multiple runs, or controls for selection effects in the chosen utilities and test-suite construction. Given the stochastic nature of LLM agents, these omissions weaken confidence that the observed improvements are robust rather than artifacts of particular random seeds or program selection.
minor comments (2)
  1. [§3] The description of how the differential fuzzer is constructed and refined could be expanded with pseudocode or a concrete example to make the discriminator agent's operation reproducible.
  2. [Tables/Figures] Table or figure captions should explicitly state whether pass rates are measured on the refinement set or on a separate held-out set; this would immediately clarify the evaluation protocol.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the practical strengths of ACToR. We address each major comment below with honest clarifications based on the current manuscript and describe the revisions we will implement.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and the ACToR loop description (§3): the central claim is that ACToR produces translations correct on inputs absent from the initial suite. The iterative process refines the translator on discriminator-found failing inputs and then reports >90% pass rate and up to 36.7% improvement. It is not stated whether these final metrics are computed on a held-out validation set disjoint from all inputs generated or used during refinement. If the reported numbers are measured on the union of the initial suite and the loop-generated tests, the results demonstrate only that the translator can be prompted to pass tests it has already seen, not that it generalizes to truly unseen inputs. This distinction is load-bearing for the stated contribution.

    Authors: We agree that the distinction between refinement on seen inputs and generalization to never-seen inputs is critical. The current manuscript computes final pass rates on the union of the initial test suite and the inputs discovered and used for refinement by the discriminator agent. The discriminator does generate new inputs absent from the initial suite, and the reported improvements (including the 36.7% gain over the coverage-driven baseline) reflect the translator's ability to be refined to pass these newly surfaced cases. However, we acknowledge that this does not constitute evaluation on inputs completely disjoint from the entire adversarial process. We will revise §4 (Evaluation) to add a held-out validation set of inputs never used in any refinement iteration or initial suite, and report pass rates on this set to better support the generalization claim. This will be presented as an additional experiment. revision: yes

  2. Referee: [Evaluation] Evaluation section: the manuscript reports concrete pass-rate gains on 63 programs but provides no details on statistical significance testing, variance across multiple runs, or controls for selection effects in the chosen utilities and test-suite construction. Given the stochastic nature of LLM agents, these omissions weaken confidence that the observed improvements are robust rather than artifacts of particular random seeds or program selection.

    Authors: We concur that the stochastic nature of LLM agents makes variance and significance testing important for robustness. The reported results are from single runs per configuration, selected for computational practicality given LLM API costs. We will revise the Evaluation section to include: results aggregated over multiple independent runs (with means and standard deviations), statistical significance tests (e.g., Wilcoxon signed-rank tests) against baselines, and an explicit discussion of utility selection criteria (real-world command-line tools from open-source repositories) along with any observed selection effects. These additions will increase confidence in the consistency of gains across the seven LLM configurations. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in empirical claims or method

full rationale

The paper describes an empirical LLM-agent loop for refining C-to-Rust translations via adversarial test generation and iterative fixes. Reported outcomes (e.g., >90% pass rates on 63 utilities and up to 36.7% gains over baseline) rest on external program evaluations and real-world test suites rather than any closed-form derivation, fitted parameter, or self-referential definition. No equations, uniqueness theorems, or ansatzes are invoked that reduce the central result to its own inputs by construction. The work contains no mathematical prediction chain; correctness improvements are measured against program behavior on provided utilities, making the findings self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The method rests on the empirical effectiveness of current LLMs as translator and discriminator agents rather than on new mathematical constructs or fitted parameters.

axioms (1)
  • domain assumption Current LLMs can act as effective translator and discriminator agents in an iterative adversarial loop without human intervention.
    The entire refinement process depends on the agents successfully synthesizing, refining, and using differential tests.

pith-pipeline@v0.9.0 · 5854 in / 1252 out tokens · 32307 ms · 2026-05-21T20:19:31.637052+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. ReCodeAgent: A Multi-Agent Workflow for Language-agnostic Translation and Validation of Large-scale Repositories

    cs.SE 2026-04 unverdicted novelty 7.0

    ReCodeAgent uses a multi-agent system to translate and validate large code repositories across multiple programming languages, achieving 60.8% higher test pass rates than prior neuro-symbolic and agentic methods on 11...

  2. ORBIT: Guided Agentic Orchestration for Autonomous C-to-Rust Transpilation

    cs.SE 2026-04 unverdicted novelty 6.0

    ORBIT achieves 100% compilation success and 91.7% test success on 24 mostly large programs from CRUST-Bench by using dependency-aware orchestration and iterative verification, outperforming prior static and baseline tools.

  3. SafeTrans: LLM-assisted Transpilation from C to Rust

    cs.CR 2025-05 accept novelty 6.0

    SafeTrans achieves up to 80% successful C-to-Rust translations via LLM iterative repair on 2653 programs and two real projects, with some C vulnerabilities carrying over to the Rust output.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · cited by 3 Pith papers · 2 internal anchors

  1. [1]

    AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation

    URLhttps://openreview.net/forum?id=VtmBAGCN7o. 12 Dong Huang, Jie M Zhang, Michael Luck, Qingwen Bu, Yuhao Qing, and Heming Cui. Agent- coder: Multi-agent-based code generation with iterative testing and optimisation.arXiv preprint arXiv:2312.13010, 2023. Ali Reza Ibrahimzada, Kaiyao Ke, Mrigank Pawagi, Muhammad Salman Abid, Rangeet Pan, Saurabh Sinha, an...

  2. [2]

    SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

    Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-long.269. URL https://aclanthology.org/2024.acl-long.269/. Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770, 2023. Shoaib Kamil, Alvin ...

  3. [3]

    Making IP = PSPACE practical: Efficient interactive protocols for BDD algorithms

    URLhttp://arxiv.org/abs/2404.18852. Qi Zhan, Xing Hu, Xin Xia, and Shanping Li. Verified lifting of deep learning operators.arXiv preprint arXiv:2412.20992, 2024. Hanliang Zhang, Cristina David, Yijun Yu, and Meng Wang. Ownership guided c to rust translation. InComputer Aided Verification: 35th International Conference, CAV 2023, Paris, France, July 17–22...

  4. [8]

    ## constraints - The translated Rust code MUST compile and MUST be 100% safe

    Clean the working directory by removing temporary files and backup files. ## constraints - The translated Rust code MUST compile and MUST be 100% safe. - The translated Rust code MUST pass all the unit tests. ‘‘‘ Figure 6: The task prompt for the translator agent. (Coverage Baseline) Test Generator Agent Prompt — Adding Tests Task Description: ‘‘‘ You are...

  5. [9]

    Read the C code to understand the functionalities

  6. [11]

    Run ‘make clean && make all && ./testcmp.sh coverage‘ to compile the C code and get the current coverage

  7. [12]

    Read the coverage report and the record of added test cases in ‘test_cases_record.md‘ to find potential missed cases

  8. [13]

    Design **3** new test cases that are different from existing test cases

  9. [14]

    Ensure that the new coverage is higher than the previous one

    Run ‘./testcmp.sh coverage‘ to get the new coverage. Ensure that the new coverage is higher than the previous one

  10. [15]

    <...Test formatting requirements> ## Constraints

    Clean the working directory by removing temporary files and scripts, temporary test cases, and backup files. <...Test formatting requirements> ## Constraints

  11. [18]

    You should run ‘make clean && make all‘ and then run ‘./testcmp.sh compare ./xxx.out(compiled from C code)‘

    The added tests must be valid for the C code. You should run ‘make clean && make all‘ and then run ‘./testcmp.sh compare ./xxx.out(compiled from C code)‘. It must show ‘ All tests passed!‘. ‘‘‘ Figure 7: The task prompt for the test generator agent of the coversge baseline. A PROMPTTEMPLATE For Mini-SWE-Agent, we keep the system prompt that describes the ...

  12. [19]

    Analyze the C code and the translated Rust code to detect **semantic mismatches **

  13. [20]

    Focus first on **core functionalities **, then explore **edge cases **

  14. [21]

    Read the current test script and the record of added test cases in ‘test_cases_record .md‘ to find potential missed cases

  15. [22]

    Add the 3 new test cases to the test cases file

    Collect the best **3** new input cases that expose mismatches between C code and Rust translation. Add the 3 new test cases to the test cases file

  16. [23]

    Run the new tests to compare the output of the translated Rust code with the original C program to confirm the mismatches

  17. [24]

    <...Test formatting requirements> IF {$allow_fuzz} <...Details on the usage of the fuzzer script> ENDIF ## Constraints

    Clean the working directory by removing temporary files and scripts, temporary test cases, and backup files. <...Test formatting requirements> IF {$allow_fuzz} <...Details on the usage of the fuzzer script> ENDIF ## Constraints

  18. [25]

    You should run ‘./ testcmp.sh‘ and the number of test cases will be shown

    There should be exactly 3 new test cases added to the JSONL file. You should run ‘./ testcmp.sh‘ and the number of test cases will be shown. There should be ‘< current_test_cases_number> + 3‘ test cases in total

  19. [26]

    You should check this by reading the content of the test cases

    The 3 test cases should be different from each other. You should check this by reading the content of the test cases

  20. [27]

    You should run ‘make clean && make all‘ and then run ‘./testcmp.sh compare ./xxx.out(compiled from C code)‘

    The added tests must be valid for the C code. You should run ‘make clean && make all‘ and then run ‘./testcmp.sh compare ./xxx.out(compiled from C code)‘. It must show ‘ All tests passed!‘

  21. [28]

    You should run ‘./testcmp.sh compare ./ts/target/release/xxx(compiled from Rust code )‘

    The added tests should reflect the differences between the C code and the Rust code. You should run ‘./testcmp.sh compare ./ts/target/release/xxx(compiled from Rust code )‘. The Rust code should fail on all 3 new test cases. ‘‘‘ Figure 8: The task prompt for the discriminator agent of ACToR. Post-processing Prompt — Eliminating Unsafe Blocks Task Descript...

  22. [29]

    Read the C code and the test script to understand the functionalities

  23. [30]

    Initialize a new Cargo project in the ‘ts/‘ folder

  24. [31]

    Translate the C code to Rust code and compile it into binary

  25. [32]

    You should run ‘./testcmp.sh --help‘ to understand how to use the test script

    Run ‘./testcmp.sh‘ to compare the output of the translated Rust code with the original C program. You should run ‘./testcmp.sh --help‘ to understand how to use the test script

  26. [33]

    ‘‘‘ The translation contains unsafe keywords

    Clean the working directory by removing temporary files and backup files. ‘‘‘ The translation contains unsafe keywords. Please fix it to ensure no unsafe code is used. Also, keep it passing all the tests. Figure 9: The task prompt for eliminating unsafe blocks. in Figure 7. The prompt structure for the discriminator agent for ACToR is shown in Figure 8. T...