arXiv preprint arXiv:2303.05510 , year=

Shun Zhang, Zhenfang Chen, Yikang Shen, Mingyu Ding, Joshua B Tenenbaum, Chuang Gan · 2023 · arXiv 2303.05510

7 Pith papers cite this work. Polarity classification is still indexing.

7 Pith papers citing it

representative citing papers

Gradient-Based Program Synthesis with Neurally Interpreted Languages

cs.LG · 2026-04-20 · unverdicted · novelty 8.0

NLI autonomously discovers a vocabulary of primitive operations and interprets variable-length programs via a neural executor, allowing end-to-end training and gradient-based test-time adaptation that outperforms prior methods on combinatorial generalization tasks.

POSTCONDBENCH: Benchmarking Correctness and Completeness in Formal Postcondition Inference

cs.SE · 2026-05-05 · unverdicted · novelty 7.0

POSTCONDBENCH is a new multilingual benchmark that evaluates LLM postcondition generation on real code using defect discrimination to assess completeness beyond surface matching.

AdverMCTS: Combating Pseudo-Correctness in Code Generation via Adversarial Monte Carlo Tree Search

cs.SE · 2026-04-12 · unverdicted · novelty 7.0

AdverMCTS frames code generation as a minimax game where an attacker evolves tests to expose flaws in solver-generated code, yielding more robust outputs than static-test baselines.

ToolPRM: Fine-Grained Inference Scaling of Structured Outputs for Function Calling

cs.AI · 2025-10-16 · unverdicted · novelty 7.0

ToolPRM provides fine-grained intra-call process supervision via a new dataset and reward model, outperforming outcome and coarse-grained alternatives on function-calling benchmarks.

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

cs.SE · 2024-03-12 · unverdicted · novelty 6.0

LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.

Bridging the Gap between User Intent and LLM: A Requirement Alignment Approach for Code Generation

cs.SE · 2026-04-17 · unverdicted · novelty 5.0

REA-Coder improves LLM code generation by iteratively aligning requirements with model understanding and verifying outputs against the aligned spec.

Evaluation of LLM-Based Software Engineering Tools: Practices, Challenges, and Future Directions

cs.SE · 2026-04-27 · unverdicted · novelty 4.0

LLM-based SE tools lack stable ground truth and deterministic outputs, making standard evaluation assumptions invalid and requiring new approaches for reliable assessment.

citing papers explorer

Showing 7 of 7 citing papers.

Gradient-Based Program Synthesis with Neurally Interpreted Languages cs.LG · 2026-04-20 · unverdicted · none · ref 49
NLI autonomously discovers a vocabulary of primitive operations and interprets variable-length programs via a neural executor, allowing end-to-end training and gradient-based test-time adaptation that outperforms prior methods on combinatorial generalization tasks.
POSTCONDBENCH: Benchmarking Correctness and Completeness in Formal Postcondition Inference cs.SE · 2026-05-05 · unverdicted · none · ref 26
POSTCONDBENCH is a new multilingual benchmark that evaluates LLM postcondition generation on real code using defect discrimination to assess completeness beyond surface matching.
AdverMCTS: Combating Pseudo-Correctness in Code Generation via Adversarial Monte Carlo Tree Search cs.SE · 2026-04-12 · unverdicted · none · ref 59
AdverMCTS frames code generation as a minimax game where an attacker evolves tests to expose flaws in solver-generated code, yielding more robust outputs than static-test baselines.
ToolPRM: Fine-Grained Inference Scaling of Structured Outputs for Function Calling cs.AI · 2025-10-16 · unverdicted · none · ref 52
ToolPRM provides fine-grained intra-call process supervision via a new dataset and reward model, outperforming outcome and coarse-grained alternatives on function-calling benchmarks.
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code cs.SE · 2024-03-12 · unverdicted · none · ref 54
LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.
Bridging the Gap between User Intent and LLM: A Requirement Alignment Approach for Code Generation cs.SE · 2026-04-17 · unverdicted · none · ref 47
REA-Coder improves LLM code generation by iteratively aligning requirements with model understanding and verifying outputs against the aligned spec.
Evaluation of LLM-Based Software Engineering Tools: Practices, Challenges, and Future Directions cs.SE · 2026-04-27 · unverdicted · none · ref 45
LLM-based SE tools lack stable ground truth and deterministic outputs, making standard evaluation assumptions invalid and requiring new approaches for reliable assessment.

arXiv preprint arXiv:2303.05510 , year=

fields

years

verdicts

representative citing papers

citing papers explorer