The Obfuscated Natural Number Game shows reasoning LLMs keep proof accuracy without semantic cues while general models degrade, establishing a metric for architectural reasoning in alien math domains.
SpecEval: Evaluating Code Comprehension in Large Language Models via Program Specifications
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
years
2026 2verdicts
UNVERDICTED 2representative citing papers
SpecSyn generates formal specifications with over 90% precision and 75% recall, successfully verifying 1071 out of 1365 target properties on open-source programs.
citing papers explorer
-
Evaluating the Architectural Reasoning Capabilities of LLM Provers via the Obfuscated Natural Number Game
The Obfuscated Natural Number Game shows reasoning LLMs keep proof accuracy without semantic cues while general models degrade, establishing a metric for architectural reasoning in alien math domains.
-
SpecSyn: LLM-based Synthesis and Refinement of Formal Specifications for Real-world Program Verification
SpecSyn generates formal specifications with over 90% precision and 75% recall, successfully verifying 1071 out of 1365 target properties on open-source programs.