Measuring Coding Challenge Competence with

Hendrycks, Dan, Basart, Steven, Kadavath, Saurav, Mazeika, Mantas, Arora, Akul, Guo, Ethan

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

browse 1 citing papers

representative citing papers

Mage: Multi-Axis Evaluation of LLM-Generated Executable Game Scenes Beyond Compile-Pass Rate

cs.LG · 2026-05-08 · conditional · novelty 6.0

Mage shows compile-pass rate is anti-correlated with functional correctness in LLM game scene generation; direct NL-to-C# yields 43% runtime but F1~0.12 structure, while IR conditioning recovers structure (F1 up to 1.0) but halves runtime, with granularity levels statistically equivalent.

citing papers explorer

Showing 1 of 1 citing paper.

Mage: Multi-Axis Evaluation of LLM-Generated Executable Game Scenes Beyond Compile-Pass Rate cs.LG · 2026-05-08 · conditional · none · ref 3
Mage shows compile-pass rate is anti-correlated with functional correctness in LLM game scene generation; direct NL-to-C# yields 43% runtime but F1~0.12 structure, while IR conditioning recovers structure (F1 up to 1.0) but halves runtime, with granularity levels statistically equivalent.

Measuring Coding Challenge Competence with

fields

years

verdicts

representative citing papers

citing papers explorer