Training language models to generate intermediate computation steps on a scratchpad enables them to perform multi-step tasks such as long addition and arbitrary program execution that they otherwise fail at.
Canonical reference
Title resolution pending
Canonical reference. 80% of citing Pith papers cite this work as background.
citation-role summary
citation-polarity summary
roles
background 5representative citing papers
MACAA is a belief-revision multi-agent framework for training-free code authorship verification that reports 89.15% F1 on same-language benchmarks and 80% on cross-language pairs while outperforming baselines.
LLM deobfuscation of binaries to pseudocode depends more on reasoning ability and task-specific fine-tuning than on model size, with reasoning models showing robustness across ISAs and obfuscation levels on the new BinDeObfBench.
FormulaCode is a new benchmark for repository-level LLM agent optimization using 957 mined bottlenecks, expert patches, and multi-objective metrics from real scientific Python repositories.
NESA presents a neuro-symbolic framework that decomposes static analyses into policy-defined sub-problems solved by parsers and LLMs to enable compilation-free customizable analysis with reduced hallucinations.
CodePivot uses Python as a pivot language plus an Aggressive-Partial-Functional RL reward to train a 7B model that outperforms much larger LLMs on multilingual code transpilation without parallel corpora.
Fine-tuning 13 CodeLMs on a constructed CLB dataset with nine interaction types improves detection, with UniXcoder-base reaching F1 0.7407 and small models outperforming large ones.
SafeTrans achieves up to 80% successful C-to-Rust translations via LLM iterative repair on 2653 programs and two real projects, with some C vulnerabilities carrying over to the Rust output.
Guided LLM strategies with custom datasets and execution-based verification enable functional APL-to-C# translation across a range of program complexities.
Reasoning-oriented knowledge distillation from DeepSeek-R1 plus response stabilization improves reliability and often performance of compact models for cross-language code clone detection on pairs like Python-Java and Rust-Java.
A large-scale study finds that many LLM code translation failures are false negatives due to improper evaluation configurations rather than incorrect translations.
Controlled experiments show PLM-GNN hybrids improve code tasks over GNN-only baselines, with PLM source having larger impact than GNN backbone.
Contrastive Prompt Tuning raises code accuracy on two of three tested models but produces inconsistent energy-efficiency gains that depend on model, language, and task.
TransAgent improves LLM code translation by up to 33.3% via multi-agent fine-grained execution alignment on a new benchmark of recent tasks.
Self-Refine boosts LLM outputs by ~20% on average across seven tasks by having the same model iteratively generate, critique, and refine its own responses.
A survey of methods, benchmarks, and open challenges for large language models in multilingual code generation and translation.
LLMSniffer improves detection of LLM-generated code on GPTSniffer and Whodunit benchmarks by fine-tuning GraphCodeBERT via two-stage supervised contrastive learning plus preprocessing and MLP classification.
NL specifications alone do not improve LLM code translation performance, but combining them with source code yields gains in select language pairs with no overall consistent benefit.
citing papers explorer
-
Show Your Work: Scratchpads for Intermediate Computation with Language Models
Training language models to generate intermediate computation steps on a scratchpad enables them to perform multi-step tasks such as long addition and arbitrary program execution that they otherwise fail at.
-
MACAA: Belief-Revision Multi-Agent Reasoning for Code Authorship Verification
MACAA is a belief-revision multi-agent framework for training-free code authorship verification that reports 89.15% F1 on same-language benchmarks and 80% on cross-language pairs while outperforming baselines.
-
Can LLMs Deobfuscate Binary Code? A Systematic Analysis of Large Language Models into Pseudocode Deobfuscation
LLM deobfuscation of binaries to pseudocode depends more on reasoning ability and task-specific fine-tuning than on model size, with reasoning models showing robustness across ISAs and obfuscation levels on the new BinDeObfBench.
-
FormulaCode: Evaluating Agentic Optimization on Large Codebases
FormulaCode is a new benchmark for repository-level LLM agent optimization using 957 mined bottlenecks, expert patches, and multi-objective metrics from real scientific Python repositories.
-
NESA: Relational Neuro-Symbolic Static Program Analysis
NESA presents a neuro-symbolic framework that decomposes static analyses into policy-defined sub-problems solved by parsers and LLMs to enable compilation-free customizable analysis with reduced hallucinations.
-
CodePivot: Bootstrapping Multilingual Transpilation in LLMs via Reinforcement Learning without Parallel Corpora
CodePivot uses Python as a pivot language plus an Aggressive-Partial-Functional RL reward to train a 7B model that outperforms much larger LLMs on multilingual code transpilation without parallel corpora.
-
Fine-Tuning Code Language Models to Detect Cross-Language Bugs
Fine-tuning 13 CodeLMs on a constructed CLB dataset with nine interaction types improves detection, with UniXcoder-base reaching F1 0.7407 and small models outperforming large ones.
-
SafeTrans: LLM-assisted Transpilation from C to Rust
SafeTrans achieves up to 80% successful C-to-Rust translations via LLM iterative repair on 2653 programs and two real projects, with some C vulnerabilities carrying over to the Rust output.
-
Neural Code Translation of Legacy Code: APL to C#
Guided LLM strategies with custom datasets and execution-based verification enable functional APL-to-C# translation across a range of program complexities.
-
Standing on the Shoulders of Giants: Stabilized Knowledge Distillation for Cross--Language Code Clone Detection
Reasoning-oriented knowledge distillation from DeepSeek-R1 plus response stabilization improves reliability and often performance of compact models for cross-language code clone detection on pairs like Python-Java and Rust-Java.
-
Beyond Translation Accuracy: Addressing False Failures in LLM-Based Code Translation
A large-scale study finds that many LLM code translation failures are false negatives due to improper evaluation configurations rather than incorrect translations.
-
PLMGH: What Matters in PLM-GNN Hybrids for Code Classification and Vulnerability Detection
Controlled experiments show PLM-GNN hybrids improve code tasks over GNN-only baselines, with PLM source having larger impact than GNN backbone.
-
An Initial Exploration of Contrastive Prompt Tuning to Generate Energy-Efficient Code
Contrastive Prompt Tuning raises code accuracy on two of three tested models but produces inconsistent energy-efficiency gains that depend on model, language, and task.
-
TransAgent: Enhancing LLM-Based Code Translation via Fine-Grained Execution Alignment
TransAgent improves LLM code translation by up to 33.3% via multi-agent fine-grained execution alignment on a new benchmark of recent tasks.
-
Self-Refine: Iterative Refinement with Self-Feedback
Self-Refine boosts LLM outputs by ~20% on average across seven tasks by having the same model iteratively generate, critique, and refine its own responses.
-
Large Language Models for Multilingual Code Intelligence: A Survey
A survey of methods, benchmarks, and open challenges for large language models in multilingual code generation and translation.
-
LLMSniffer: Detecting LLM-Generated Code via GraphCodeBERT and Supervised Contrastive Learning
LLMSniffer improves detection of LLM-generated code on GPTSniffer and Whodunit benchmarks by fine-tuning GraphCodeBERT via two-stage supervised contrastive learning plus preprocessing and MLP classification.
-
Specification-Driven Code Translation Powered by Large Language Models: How Far Are We?
NL specifications alone do not improve LLM code translation performance, but combining them with source code yields gains in select language pairs with no overall consistent benefit.