pith. sign in

arxiv: 2602.20213 · v2 · pith:TRSP3ZKPnew · submitted 2026-02-23 · 💻 cs.SE · cs.AI· cs.CR

CodeHacker: Automated Test Case Generation for Detecting Vulnerabilities in Competitive Programming Solutions

classification 💻 cs.SE cs.AIcs.CR
keywords casescodehackeradversarialcodesolutionstestagentattacks
0
0 comments X
read the original abstract

The evaluation of Large Language Models (LLMs) for code generation relies heavily on the quality and robustness of test cases. However, existing benchmarks often lack coverage for subtle corner cases, allowing incorrect solutions to pass. To bridge this gap, we propose CodeHacker, an automated agent framework dedicated to generating targeted adversarial test cases that expose latent vulnerabilities in program submissions. Mimicking the hack mechanism in competitive programming, CodeHacker employs a multi-strategy approach, including stress testing, anti-hash attacks, and logic-specific targeting to break specific code submissions. To ensure the validity and reliability of these attacks, we introduce a Calibration Phase, where the agent iteratively refines its own Validator and Checker via self-generated adversarial probes before evaluating contestant code.Experiments demonstrate that CodeHacker significantly improves the True Negative Rate (TNR) of existing datasets, effectively filtering out incorrect solutions that were previously accepted. Furthermore, generated adversarial cases prove to be superior training data, boosting the performance of RL-trained models on benchmarks like LiveCodeBench.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Solvita: Enhancing Large Language Models for Competitive Programming via Agentic Evolution

    cs.AI 2026-05 unverdicted novelty 6.0

    Solvita is an agentic evolution system using Planner, Solver, Oracle, and Hacker agents with trainable graph knowledge networks updated by reinforcement learning on pass/fail and vulnerability signals to achieve SOTA ...

  2. VeriContest: A Competitive-Programming Benchmark for Verifiable Code Generation

    cs.SE 2026-05 unverdicted novelty 6.0

    VeriContest supplies 946 problems with specs, code, proofs, and tests to benchmark verifiable code generation in Rust/Verus, showing models reach 92% on code but only 5% end-to-end on full verifiable synthesis.

  3. Latent-GRPO: Group Relative Policy Optimization for Latent Reasoning

    cs.LG 2026-04 unverdicted novelty 6.0

    Latent-GRPO stabilizes reinforcement learning in latent space, delivering 7.86 Pass@1 gains on low-difficulty tasks over latent baselines and 4.27 points over explicit GRPO on high-difficulty tasks with 3-4x shorter r...