FrontierSmith automates synthesis of open-ended coding problems from closed-ended seeds and shows measurable gains on two open-ended LLM coding benchmarks.
HardTests: Synthesizing high-quality test cases for LLM coding.arXiv preprint arXiv:2505.24098
4 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 4roles
background 1polarities
background 1representative citing papers
VeriContest supplies 946 problems with specs, code, proofs, and tests to benchmark verifiable code generation in Rust/Verus, showing models reach 92% on code but only 5% end-to-end on full verifiable synthesis.
Waypoint-based bi-level planning with curriculum RLVR improves multi-robot task success rates in dense-obstacle benchmarks over motion-agnostic and VLA baselines.
CP-Agent improves LLM competitive programming performance via calibrated feedback mechanisms that target false-admission risk, evidence against bad programs, and success hazard.
citing papers explorer
-
CP-Agent: A Calibrated Risk-Controlled Agent for Feedback-Driven Competitive Programming
CP-Agent improves LLM competitive programming performance via calibrated feedback mechanisms that target false-admission risk, evidence against bad programs, and success hazard.