Introduces the Oracle benchmark of 96 black-box environments across 6 task types to measure integrated reasoning in LLMs through interactive function discovery, with o3 leading but all models showing planning weaknesses on hard instances.
H B LACK -BOX DETAILS IN ORACLE V 1.0 We introduce the detailed implementation of black-boxes in each task
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.AI 1years
2025 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Investigating Advanced Reasoning of Large Language Models via Black-Box Environment Interaction
Introduces the Oracle benchmark of 96 black-box environments across 6 task types to measure integrated reasoning in LLMs through interactive function discovery, with o3 leading but all models showing planning weaknesses on hard instances.