pith. sign in

arxiv: 2605.29054 · v2 · pith:GZO6PVE4new · submitted 2026-05-27 · 💻 cs.SE · cs.CL

Converted, Not Equivalent: Benchmarking Codebase Conversion via Observational Equivalence

Pith reviewed 2026-06-29 10:27 UTC · model grok-4.3

classification 💻 cs.SE cs.CL
keywords codebase conversionobservational equivalencebenchmarkcode agentssemantic contractsverification stagesT2J-Bench
0
0 comments X

The pith

Codebase conversion systems reach only 27 percent true equivalence under a fixed three-stage verifier despite up to 91 percent surface-level passes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces T2J-Bench to treat codebase conversion as transfer under a stable observational-equivalence contract rather than outcome matching. A fixed verifier runs every converted artifact through Spec (interface admissibility), Numeric (outputs, losses, gradients), and Behavioral (short training dynamics) stages in fixed order. Across 355 blind attempts the strongest system still clears the full contract only 26.7-28.9 percent of the time, while every system reports success 66.6-97.8 points higher than the verifier. Token-budget differences produce only modest pass-rate gains. The central finding is that current self-validation routines are misaligned with the contracts users actually need.

Core claim

T2J-Bench reformulates conversion success as observational equivalence checked by a fixed three-stage verifier: Spec confirms interface admissibility, Numeric matches forward outputs, losses, gradients and objective tensors, and Behavioral checks short-horizon training dynamics under fixed seeds. On 355 blind conversions the best system passes the complete contract in only 26.7-28.9 percent of cases even though Spec alone reaches 91.1 percent; token-budget scaling improves the overall rate by only a factor of 2.2 despite a 4.7-fold increase in budget; and every system overestimates its own success relative to the fixed evaluator by 66.6-97.8 points.

What carries the argument

T2J-Bench's fixed verifier that enforces observational equivalence through three ordered stages (Spec, Numeric, Behavioral) applied uniformly to every conversion attempt.

If this is right

  • Failures arise mainly from contract-misaligned self-validation rather than from token budget or backbone size.
  • Spec-level success is a poor predictor of full equivalence.
  • Larger token budgets yield only modest gains in true pass rate.
  • Current agent self-reports cannot be trusted as evidence of successful conversion.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Teams that accept agent-converted code on the basis of local checks alone are likely to encounter silent gradient or dynamics mismatches later.
  • An external fixed verifier could serve as a stable yardstick for comparing future code agents on conversion tasks.
  • Training objectives for coding agents may need explicit penalties for over-trusting local validation routines.

Load-bearing premise

The three ordered stages of the fixed verifier accurately capture the semantic contracts users care about in codebase conversion.

What would settle it

A set of converted codebases that pass all three verifier stages yet produce different long-horizon training outcomes or different production behavior on held-out user workloads.

read the original abstract

Coding agents increasingly act as codebase-scale collaborators that can assist with codebase conversion, but this progress has exposed a critical weakness: agents often over-trust their own local validation routines and declare success on artifacts that satisfy surface checks while violating the semantic contracts users actually care about. This problem is especially acute in codebase conversion, where prior evaluation is largely outcome-driven and therefore unstable: two implementations can match on a shallow outcome, such as a single forward loss, while diverging in gradients, optimizer behavior, or short-horizon training dynamics. We introduce T2J-Bench, a benchmark for codebase conversion that reformulates conversion as transfer under a fixed equivalence contract. A fixed verifier then compares source and converted codebases through three ordered stages: Spec (interface admissibility), Numeric (forward outputs, losses, gradients, and objective-specific tensors), and Behavioral (short training dynamics under fixed seeds). Across 355 blind conversion attempts, the best system reaches only 26.7--28.9% overall pass rate despite Spec pass rates up to 91.1%; a 4.7x token-budget spread yields only a 2.2x pass-rate spread; and all systems overestimate success by 66.6--97.8 points relative to the fixed evaluator. This suggests that failures stem more from contract-misaligned self-validation than from limited budget or backbone strength.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces T2J-Bench, a benchmark that reformulates codebase conversion as transfer under a fixed observational equivalence contract evaluated by a three-stage verifier (Spec for interface admissibility, Numeric for outputs/gradients/losses, Behavioral for short training dynamics under fixed seeds). Across 355 blind conversion attempts by multiple systems, it reports that the best system achieves only 26.7--28.9% overall pass rate (despite Spec rates up to 91.1%), that a 4.7x token-budget spread produces only a 2.2x pass-rate spread, and that all systems overestimate success by 66.6--97.8 points relative to the fixed verifier, concluding that failures arise primarily from contract-misaligned self-validation rather than budget or model strength.

Significance. If the fixed verifier reliably encodes the semantic contracts that matter to users, the quantitative results on 355 attempts provide a reproducible, outcome-independent evaluation framework that exposes a systematic weakness in current coding-agent self-assessment for conversion tasks. The fixed-verifier design and scale of blind attempts are strengths that could support more stable benchmarking than prior outcome-driven methods.

major comments (2)
  1. [Abstract / §3] Abstract and §3 (verifier description): the central claim that low overall pass rates demonstrate contract-misaligned self-validation rests on the three-stage verifier accurately capturing user semantic contracts, yet the Behavioral stage is limited to short training dynamics under fixed seeds; nothing rules out divergences on longer horizons, different data orders, or optimizer states, which would mean the observed gap does not isolate the claimed cause.
  2. [Abstract] Abstract (355 attempts): the reported pass rates, overestimation figures, and budget-sensitivity results cannot be verified or reproduced because the manuscript provides no details on how the 355 conversion attempts were generated, how the fixed verifier was implemented, or any controls for selection bias in the attempt pool.
minor comments (1)
  1. [Abstract] The abstract states quantitative results without referencing the corresponding tables or figures that contain the per-system breakdowns.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive report. The comments highlight important limitations in our presentation of the verifier's scope and the reproducibility of the experimental setup. We address each point below and will make targeted revisions to improve clarity and completeness.

read point-by-point responses
  1. Referee: [Abstract / §3] Abstract and §3 (verifier description): the central claim that low overall pass rates demonstrate contract-misaligned self-validation rests on the three-stage verifier accurately capturing user semantic contracts, yet the Behavioral stage is limited to short training dynamics under fixed seeds; nothing rules out divergences on longer horizons, different data orders, or optimizer states, which would mean the observed gap does not isolate the claimed cause.

    Authors: We agree that the Behavioral stage examines only short training dynamics under fixed seeds and does not preclude divergences on longer horizons, different data orders, or optimizer states. The manuscript's central claim is therefore scoped to the specific three-stage observational equivalence contract defined in §3, not to exhaustive semantic equivalence. The reported overestimation gap is measured strictly against this fixed contract. To address the concern, we will revise §3 and the abstract to explicitly state the scope and limitations of the Behavioral stage, add a paragraph discussing potential longer-horizon mismatches, and clarify that the benchmark exposes misalignment with the defined contract rather than claiming isolation of all possible semantic failures. revision: partial

  2. Referee: [Abstract] Abstract (355 attempts): the reported pass rates, overestimation figures, and budget-sensitivity results cannot be verified or reproduced because the manuscript provides no details on how the 355 conversion attempts were generated, how the fixed verifier was implemented, or any controls for selection bias in the attempt pool.

    Authors: The referee correctly identifies that the current manuscript lacks sufficient implementation and generation details for full reproducibility. The 355 attempts were produced by executing a fixed set of coding agents on a curated collection of source repositories using controlled token budgets; the verifier follows the three-stage procedure in §3 with concrete library calls for numeric tensor comparison and short training runs. We will add a dedicated appendix (and corresponding references in the abstract and §4) that documents the attempt-generation protocol, exact verifier implementation, seed controls, and steps taken to mitigate selection bias in the attempt pool. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark results are direct empirical measurements against an external fixed verifier

full rationale

The paper defines T2J-Bench as a fixed three-stage verifier (Spec for interface admissibility, Numeric for outputs/gradients, Behavioral for short training dynamics) and reports pass rates and overestimation gaps as direct comparisons between conversion outputs and this verifier across 355 attempts. No parameters are fitted to subsets of the target data, no predictions reduce to inputs by construction, and no self-citation chains or uniqueness theorems are invoked to justify the central claims. The reported figures (26.7–28.9% overall pass rate, 66.6–97.8 point overestimation) are observational outcomes relative to the stated external contract, with no self-definitional loops or ansatzes smuggled via prior work. The derivation chain is therefore self-contained against the benchmark as presented.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that the staged verifier measures the contracts users actually care about; no free parameters or invented entities are introduced in the abstract.

axioms (1)
  • domain assumption The three ordered stages of the fixed verifier capture the semantic contracts users care about
    Invoked to interpret low overall pass rates as evidence of contract-misaligned self-validation rather than benchmark mismatch.

pith-pipeline@v0.9.1-grok · 5800 in / 1273 out tokens · 38082 ms · 2026-06-29T10:27:29.487546+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

25 extracted references · 3 canonical work pages

  1. [1]

    11 Converted, Not Equivalent: Benchmarking Codebase Conversion via Observational Equivalence K

    URL https://proceedings.iclr.cc/paper_files/paper/2024/hash/ 2460396f2d0d421885997dd1612ac56b-Abstract-Conference.html. 11 Converted, Not Equivalent: Benchmarking Codebase Conversion via Observational Equivalence K. Cheng, X. Shen, Y. Yang, T. Wang, Y. Cao, M. A. Ali, H. Wang, L. Hu, and D. Wang. CODEMENV: Benchmarking large language models on code migrat...

  2. [2]

    URL https://aclanthology.org/2025

    doi: 10.18653/v1/2025.findings-acl.140. URL https://aclanthology.org/2025. findings-acl.140/. J. Du, Y. Liu, H. Guo, J. Wang, H. Huang, Y. Ni, and Z. Li. DependEval: Benchmarking LLMs for repository dependency understanding. InFindings of the Association for Computational Linguistics: ACL 2025, pages 7150–7179. Association for Computational Linguistics, 2...

  3. [3]

    doi: 10.18653/v1/2025.emnlp-main.362

    Association for Computational Linguistics, 2025. doi: 10.18653/v1/2025.emnlp-main.362. URLhttps://aclanthology.org/2025.emnlp-main.362/. R. Hong, H. Zhang, X. Pang, D. Yu, and C. Zhang. A closer look at the self-verification abilities of large language models in logical reasoning. InProceedings of the 2024 Conference of the North American Chapter of the A...

  4. [4]

    smoke test passes

    URLhttps://openreview.net/forum?id=IkmD3fKBPQ. M. Islam, A. K. Jha, M. Mahmoud, I. Akhmetov, and S. Nadi. An empirical study of python library migration using large language models. In2025 40th IEEE/ACM International Conference on Automated Software Engineering (ASE), pages 867–879. IEEE, 2025. doi: 10.1109/ASE63991.2025. 00077. URLhttps://doi.org/10.1109...

  5. [5]

    - If the plan depends on actual repository structure or behavior, call` codebase_investigator`with a narrow reading task

    **Research:** Gather only the evidence needed to plan correctly. - If the plan depends on actual repository structure or behavior, call` codebase_investigator`with a narrow reading task. - Use`keyword_search`directly when broad discovery is enough before deeper investigation. - Use`google_web_search`when the plan depends on current external information, p...

  6. [6]

    - Produce a concrete ordered sequence of steps

    **Strategy:** Turn the request into a grounded implementation path. - Produce a concrete ordered sequence of steps. - Include verification steps as part of the execution path. - Call out assumptions, dependencies, and risks only when they materially affect the plan. - If the planning thread has accumulated a lot of stale context and you are at a clean pla...

  7. [7]

    - You must update your own agent-specific todo list for every planning task by calling`write_todos`

    **Todo Synchronization:** Keep the execution plan reflected in the todo list. - You must update your own agent-specific todo list for every planning task by calling`write_todos`. - The todo list must stay short, execution-oriented, and have exactly one item` in_progress`whenever work remains. - Before finishing, make sure your todo list matches the plan y...

  8. [8]

    - Use`keyword_search`for ranked multi-keyword discovery inside a target folder

    **Research:** Map the relevant code paths and validate assumptions with the available read tools. - Use`keyword_search`for ranked multi-keyword discovery inside a target folder. 21 Converted, Not Equivalent: Benchmarking Codebase Conversion via Observational Equivalence - Use`list_files`to understand file structure and candidate modules. - Use`search_text...

  9. [9]

    - Prefer small, targeted excerpts from files you already read

    **Selection:** Decide which evidence actually matters. - Prefer small, targeted excerpts from files you already read. - You may return multiple slices from the same file and slices from multiple different files. - Do not include long excerpts when a smaller slice is enough

  10. [10]

    - Call`format_codebase_output`exactly once

    **Handoff:** Return a compact evidence package. - Call`format_codebase_output`exactly once. - Put the selected file slices in that call. - The summary must be especially short and should contain only the highest-signal conclusion. # Output Contract - Use normal assistant text only for minimal coordination. - The real final handoff must come through`format...

  11. [11]

    - Start from`delivery_folder_path`

    Research. - Start from`delivery_folder_path`. - Use`summary`to identify the most relevant files, symbols, and checks. - Read the exact files, ranges, and command outputs needed to verify the work. - Use list_files, search_text, and read_file instead of shell-based reading when possible. - If multiple verification reads are independent, issue them in paral...

  12. [12]

    - Start with the narrowest meaningful checks first: changed files, exact symbols, local assertions, focused test targets

    Verify. - Start with the narrowest meaningful checks first: changed files, exact symbols, local assertions, focused test targets. - Then expand to broader checks only when needed: module-level tests, integration checks, lint, type-check, build, or runtime checks. - Prefer the repository's actual tests, lint, type-check, build, or runtime checks. - Use bas...

  13. [13]

    - Return a concise verification report

    Report. - Return a concise verification report. - State what was checked, whether the delivery satisfies the acceptance target, what passed, what failed, and any remaining risk. - If verification could not be completed, say exactly why. Execution rules: - Prefer structured tools over shell wrappers when both can do the job. - Use bash sessions for command...

  14. [14]

    Read the memory file fully

  15. [15]

    Merge the old durable memory with the new conversation history

  16. [16]

    Update the same memory file with`patch_file`

  17. [17]

    This is a multi-step editing agent, not a single-call transformer

    If the first patch is incomplete, read the file again and continue patching. This is a multi-step editing agent, not a single-call transformer. It may read and patch the same memory file multiple times before stopping. Stop only when the memory file itself is in the desired final state. What to preserve: - durable context that will still matter after olde...

  18. [18]

    **COMPLETELY investigate** the evaluator path and the source repository until you understand any details required for passing the test

  19. [19]

    Distill the goal, investigation result, and may discovery from previous error message (if applicable) into`instruction.md`

  20. [20]

    Directly input that`instruction.md`to a testee coding agent (in your case, the call client_subagent), specify the`workspace`as the attempt path, and enable the` restricted mode`

  21. [21]

    Copy the files from torch repo into the attempt workspace for testee and let the testee attempt the PyTorch-to-JAX conversion using only`instruction.md`

  22. [22]

    After the testee return to you, run the evaluator on the produced repository

  23. [23]

    (1) If the error belongs to the evaluator's implementation, fix the evaluator

    Analysis the error from evaluator. (1) If the error belongs to the evaluator's implementation, fix the evaluator. (2) Otherwise, fix the testee's implementation. Repeat the two sub-steps until all tests pass in this attempt

  24. [24]

    Distill your fixing experience into`instruction.md`and start a new attempt

  25. [25]

    DPO trainer end-to-end OK

    Repeat until the testee can pass all evaluator tests using only`instruction.md`. - When you update the instruction, let the testee work in a new attempt folder. - Put the instruction of every attempt into the corresponding attempt folder. ## Hard constraints -`instruction.md`must describe only the necessary acceptance conditions and implementation-critica...