Introduces the Oracle benchmark of 96 black-box environments across 6 task types to measure integrated reasoning in LLMs through interactive function discovery, with o3 leading but all models showing planning weaknesses on hard instances.
**You MUST ONLY output the value, DO NOT contain any other text.** G.2.2 C IRCUIT RULE INFERENCE (CRI)
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.AI 1years
2025 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Investigating Advanced Reasoning of Large Language Models via Black-Box Environment Interaction
Introduces the Oracle benchmark of 96 black-box environments across 6 task types to measure integrated reasoning in LLMs through interactive function discovery, with o3 leading but all models showing planning weaknesses on hard instances.