H B LACK -BOX DETAILS IN ORACLE V 1.0 We introduce the detailed implementation of black-boxes in each task

Evaluation: - When the given number of interactions is reached, the game ends, we’ll calculate your **score** Now Let’s Play the Game {algorithm}, the Description Is that {descr

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

browse 1 citing papers

representative citing papers

Investigating Advanced Reasoning of Large Language Models via Black-Box Environment Interaction

cs.AI · 2025-08-26 · unverdicted · novelty 6.0

Introduces the Oracle benchmark of 96 black-box environments across 6 task types to measure integrated reasoning in LLMs through interactive function discovery, with o3 leading but all models showing planning weaknesses on hard instances.

citing papers explorer

Showing 1 of 1 citing paper.

Investigating Advanced Reasoning of Large Language Models via Black-Box Environment Interaction cs.AI · 2025-08-26 · unverdicted · none · ref 64
Introduces the Oracle benchmark of 96 black-box environments across 6 task types to measure integrated reasoning in LLMs through interactive function discovery, with o3 leading but all models showing planning weaknesses on hard instances.

H B LACK -BOX DETAILS IN ORACLE V 1.0 We introduce the detailed implementation of black-boxes in each task

fields

years

verdicts

representative citing papers

citing papers explorer