pith. sign in

arxiv: 2508.19035 · v2 · submitted 2025-08-26 · 💻 cs.AI

Investigating Advanced Reasoning of Large Language Models via Black-Box Environment Interaction

Pith reviewed 2026-05-18 21:31 UTC · model grok-4.3

classification 💻 cs.AI
keywords large language modelsreasoning evaluationblack-box environmentsOracle benchmarkexploration strategieshypothesis refinementinteractive discovery
0
0 comments X p. Extension

The pith

Large language models lack high-level planning for efficient exploration in black-box environments.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes black-box environment interaction as a new paradigm to evaluate LLMs on integrated reasoning rather than isolated deductive, inductive, or abductive tasks. It presents the Oracle benchmark consisting of 96 environments across 6 task types where models must discover hidden functions through limited interactions. Testing 19 LLMs shows that even the best model, o3, achieves over 70% on easy tasks but falls below 40% on hard ones. Analysis points to a shared limitation in creating adaptive exploration strategies to refine hypotheses. This highlights a gap in current LLM capabilities for human-like discovery learning.

Core claim

LLMs lack the high-level planning capability to develop efficient and adaptive exploration strategies for hypothesis refinement when interacting with black-box environments. The Oracle benchmark reveals this by showing performance drops on complex tasks despite success on simpler ones, indicating that LLMs excel at isolated reasoning but struggle with the dynamic, integrated process required for unraveling unknown functions.

What carries the argument

The black-box environment interaction paradigm, where LLMs query a hidden function and reason over input-output pairs to discover the underlying rule within limited exploration turns.

If this is right

  • LLMs will continue to underperform in real-world discovery tasks that require adaptive strategies.
  • The Oracle benchmark can measure future progress toward integrated reasoning abilities.
  • New training focused on high-level planning in interactive settings may improve performance.
  • Hard tasks in the benchmark expose limits that isolated reasoning tests miss.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar planning shortfalls likely appear in agent systems exploring unknown real-world settings.
  • Explicit training on strategy adaptation could close the gap observed here.
  • Human performance on the same environments would help quantify how far current models lag.

Load-bearing premise

The 96 chosen black-box environments and the specific exploration-turn limits represent the broader class of interactive discovery problems.

What would settle it

An LLM that achieves high accuracy on the hard black-box tasks through novel adaptive exploration strategies without extra training would falsify the lack of high-level planning.

Figures

Figures reproduced from arXiv: 2508.19035 by Alex Gu, Congchi Yin, Jun Shao, Piji Li, Tianyi Wu, Xun Jiang, Yankai Shu, Yunhan Wang.

Figure 1
Figure 1. Figure 1: Illustration of Charles Peirce’s framework of humans reasoning behavior (upper) and an [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Examples of 6 different types of black-box tasks in the O [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The composition of ORACLE V1.0 benchmark. The composition of ORACLE V1.0 benchmark is shown in Fig￾ure 3. Each task consists of a mix of easy and hard black-boxes. The inner ring of the pie chart indicates the total number of black-boxes for each task, while the outer ring breaks down these black-boxes into easy and hard categories. The current bench￾mark includes 6 tasks and 96 black-boxes (50 easy, 46 ha… view at source ↗
Figure 4
Figure 4. Figure 4: The framework for black-box generation, which is used to build the O [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Baseline test for benchmarked models under 12@1. Models superscripted with [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Performance of LLMs in six tasks of the O [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Performance of LLMs in six tasks of the O [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Averaged accuracy increase of 19 LLMs across 6 tasks when turn@shot is ex￾tended from 10@1&1@1 to 20@2&2@1. We aim to reveal a key weakness of modern LLMs in black-box interaction: They struggle to develop efficient and adaptive ex￾ploration strategies. This deficiency in high-level planning high￾lights LLMs’ shortcomings in deductive and inductive reasoning. To substantiate this claim, we analyze the perf… view at source ↗
Figure 9
Figure 9. Figure 9: Cases of exploration behavior under two different settings. Black-box responses and [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Model and human performance in CRI and ERI under two settings. We select three powerful LLMs, gemini-2.5-pro, o3-mini, and o4- mini, and evaluate them in two representative tasks, CRI and ERI, which most challenge models’ ability in optimizing exploration strategy. Results are shown in [PITH_FULL_IMAGE:figures/full_fig_p009_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Black-box interaction log generated using different platform code [PITH_FULL_IMAGE:figures/full_fig_p024_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Logical error figured out by Refinement LLM [PITH_FULL_IMAGE:figures/full_fig_p024_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Black-box interaction log generated using different platform code [PITH_FULL_IMAGE:figures/full_fig_p026_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Logical error figured out by Refinement LLM [PITH_FULL_IMAGE:figures/full_fig_p026_14.png] view at source ↗
read the original abstract

Existing tasks fall short in evaluating reasoning ability of Large Language Models (LLMs) in an interactive, unknown environment. This deficiency leads to the isolated assessment of deductive, inductive, and abductive reasoning, neglecting the integrated reasoning process that is indispensable for human-like discovery learning. We introduce a novel evaluation paradigm, \textit{black-box environment interaction}, to tackle this challenge. A black-box environment is defined by a hidden function that maps a specific set of inputs to outputs. LLMs are required to unravel the hidden function behind the black-box environment by interacting with it in given exploration turns, and reasoning over observed input-output pairs. Leveraging this idea, we build the \textsc{Oracle} benchmark which comprises 6 types of black-box task with 96 black-box environments. 19 modern LLMs are benchmarked. o3, a leading LLM from OpenAI, ranks first in 5 of the 6 tasks, achieving over 70\% accuracy on most easy black-box environments. But it still struggles with some hard black-box tasks, where the average performance drops below 40\%. Further analysis reveals a universal difficulty among LLMs: They lack the high-level planning capability to develop efficient and adaptive exploration strategies for hypothesis refinement. Code is available in https://github.com/lemonsis/Oracle_Benchmark.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper argues that existing LLM reasoning benchmarks isolate deductive, inductive, or abductive skills and fail to capture integrated reasoning in interactive discovery settings. It introduces the black-box environment interaction paradigm, in which an LLM must discover a hidden input-output function by querying a black-box environment within a fixed number of exploration turns. The authors construct the Oracle benchmark (6 task types, 96 environments), evaluate 19 LLMs, and report that o3 leads on five of the six tasks with >70% accuracy on easy instances yet falls below 40% on hard instances. From these performance gaps the paper concludes that LLMs universally lack high-level planning needed to devise efficient, adaptive exploration strategies for hypothesis refinement.

Significance. If the empirical results and their interpretation hold, the work supplies a concrete, reproducible benchmark that exposes limitations in current LLMs' ability to perform integrated, interactive reasoning beyond isolated logical operations. The provision of open code and a diverse set of 96 environments constitutes a useful public resource for the community. The finding that even the strongest model (o3) drops sharply on hard tasks supplies a falsifiable prediction that can be tested with future model releases or prompting interventions.

major comments (2)
  1. [Further analysis / §5] Further analysis (abstract and §5): the central claim that LLMs 'lack the high-level planning capability to develop efficient and adaptive exploration strategies' is inferred solely from aggregate accuracy drops below 40% on hard tasks. The manuscript does not report direct measurements of planning behavior (e.g., frequency of strategy revisions, hypothesis-update counts, or adaptive turn allocation) extracted from interaction traces, nor does it include an ablation that compares baseline prompting against explicit planning scaffolds. Consequently the performance gap cannot yet be attributed specifically to planning deficits rather than to hypothesis generation failures, context-window overflow, or prompt-induced myopia.
  2. [§3] Benchmark construction (§3): the paper states that the 96 environments and exploration-turn limits were chosen to span easy-to-hard difficulty, yet provides no pre-registration, pilot-study protocol, or explicit criteria demonstrating that task difficulty and turn budgets were fixed before model evaluation. Without this information it remains possible that the reported performance gaps partly reflect post-hoc selection of environments that current LLMs happen to find difficult.
minor comments (2)
  1. [Results tables] Table 1 (or equivalent results table): report per-task standard deviations or confidence intervals alongside mean accuracies so that the statistical reliability of the <40% hard-task claim can be assessed.
  2. [Methods] The definition of 'success' for each black-box task (exact function recovery versus approximate recovery within tolerance) should be stated explicitly in the methods section rather than only in the appendix.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below, indicating where we agree that revisions are needed and outlining the specific changes we will make.

read point-by-point responses
  1. Referee: [Further analysis / §5] Further analysis (abstract and §5): the central claim that LLMs 'lack the high-level planning capability to develop efficient and adaptive exploration strategies' is inferred solely from aggregate accuracy drops below 40% on hard tasks. The manuscript does not report direct measurements of planning behavior (e.g., frequency of strategy revisions, hypothesis-update counts, or adaptive turn allocation) extracted from interaction traces, nor does it include an ablation that compares baseline prompting against explicit planning scaffolds. Consequently the performance gap cannot yet be attributed specifically to planning deficits rather than to hypothesis generation failures, context-window overflow, or prompt-induced myopia.

    Authors: We acknowledge that the current interpretation relies on performance patterns rather than direct quantification of planning behaviors from the interaction logs. While the sharp drop on hard instances is consistent across models and task types, we agree this leaves room for alternative explanations. In the revised version we will add a new subsection in §5 that analyzes the collected interaction traces to report metrics such as the average number of distinct hypotheses maintained per episode, the frequency of strategy revisions between turns, and the distribution of query types (e.g., exploratory vs. confirmatory). We will also include a limited ablation on a subset of hard environments comparing the original prompt against an explicit planning scaffold that instructs the model to outline an exploration policy before querying. These additions will allow readers to assess whether planning deficits are the dominant factor. revision: yes

  2. Referee: [§3] Benchmark construction (§3): the paper states that the 96 environments and exploration-turn limits were chosen to span easy-to-hard difficulty, yet provides no pre-registration, pilot-study protocol, or explicit criteria demonstrating that task difficulty and turn budgets were fixed before model evaluation. Without this information it remains possible that the reported performance gaps partly reflect post-hoc selection of environments that current LLMs happen to find difficult.

    Authors: The 96 environments were constructed by sampling from six canonical function families (linear, polynomial, trigonometric, logical, compositional, and noisy variants) with parameters chosen to produce a range of identification complexities. Turn budgets were set according to information-theoretic considerations for each family so that easy instances are solvable with greedy exploration while hard instances require adaptive, multi-step hypothesis testing. We performed internal calibration runs to verify this spectrum before the main evaluation. We did not pre-register the benchmark or publish the pilot protocol in the original manuscript. In the revision we will add an appendix that documents the exact difficulty criteria, the information-theoretic bounds used for turn allocation, and aggregate statistics from the pilot runs (without revealing any model-specific results). This documentation will make the construction process transparent and reduce the plausibility of post-hoc selection. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark results independent of inputs

full rationale

The paper introduces a new benchmark (Oracle) consisting of 96 black-box environments across 6 task types and reports empirical performance of 19 LLMs on exploration tasks with fixed turn limits. The central claim—that LLMs lack high-level planning for adaptive exploration—is presented as an inference from observed accuracy drops (e.g., o3 below 40% on hard tasks) rather than any derivation, equation, fitted parameter, or self-citation chain. No load-bearing step reduces a result to its own inputs by construction; the environments and interaction protocol are externally defined and the measurements are replicable outside the paper's fitted values or prior author work. This is a standard empirical evaluation paper whose findings remain falsifiable via new runs on the released environments.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on the assumption that the chosen black-box tasks adequately sample the space of integrated reasoning problems and that limited-turn interaction is a valid proxy for discovery learning.

axioms (1)
  • domain assumption Performance on the six task types and 96 environments reflects general interactive reasoning ability.
    Stated in the motivation for the benchmark construction.

pith-pipeline@v0.9.0 · 5783 in / 1140 out tokens · 26503 ms · 2026-05-18T21:31:57.218521+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages · 1 internal anchor

  1. [1]

    18) Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al

    (Cited on pg. 18) Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented gener- ation for knowledge-intensive nlp tasks. Advances in neural information processing systems, 33: 9459–9474, 2020. (Cited on pg. 18) Jiachun Li, Pengfei...

  2. [2]

    arXiv preprint arXiv:2502.01100 , year=

    (Cited on pg. 10) Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, et al. Competition-level code generation with alphacode. Science, 378(6624):1092–1097, 2022. (Cited on pg. 10) Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Tedd...

  3. [3]

    Orak: A Foundational Benchmark for Training and Evaluating LLM Agents on Diverse Video Games

    (Cited on pg. 10) OpenAI. Introducing gpt-4.1 in the api. 2025a. (Cited on pg. 17) OpenAI. Openai o3 and o4-mini system card. 2025b. (Cited on pg. 1, 17) Dongmin Park, Minkyu Kim, Beongjun Choi, Junhyuck Kim, Keon Lee, Jonghyun Lee, Inkyu Park, Byeong-Uk Lee, Jaeyoung Hwang, Jaewoo Ahn, et al. Orak: A foundational benchmark for training and evaluating llm...

  4. [4]

    Double Pendulum

    (Cited on pg. 10) Relu Patrascu and Deborah Stacey. Adaptive exploration in reinforcement learning. In IJCNN’99. International Joint Conference on Neural Networks. Proceedings (Cat. No. 99CH36339) , vol- ume 4, pp. 2276–2281. IEEE, 1999. (Cited on pg. 8) Charles Sanders Peirce. Collected papers of charles sanders peirce, volume 5. Harvard University Press...

  5. [5]

    common-sense

    Violations of unstated "common-sense" rules: Task descriptions often omit implicit con- straints, such as the fact that a player’s score or money cannot be negative. Our interactive process makes these violations apparent, forcing a correction

  6. [6]

    Misinterpretations of ambiguous language : Natural language can be imprecise. The framework corrects for misunderstandings, as demonstrated in the example of circuit task where the term "random" was initially misinterpreted, leading to a non-deterministic im- plementation instead of a fixed random one. 21 Preprint

  7. [7]

    For input size n=4, the blackbox contains a random circuit of 8 gates. The circuit is not for some well-known purpose

    Simple yet critical implementation bugs: This category includes flaws analogous to typos or logical oversights, such as using an incorrect formula in the example of physics task. It also includes bugs that cause runtime errors. These are difficult to spot in a static code review but are readily exposed when the simulation produces incorrect outputs. There...

  8. [8]

    **Insufficient analysis of the mechanical system**: The assistant fails to properly analyze the motion pattern. From the coordinates provided: - t=0: (2.5, 0.0, -4.33) - t=1: (0.63, 2.42, -4.33) - t=2: (-2.18, 1.22, -4.33) - t=3: (-1.73, -1.8, -4.33) - t=4: (1.31, -2.13, -4.33) - t=5: (2.39, 0.73, -4.33) The z-coordinate remains constant at -4.33, and the...

  9. [12]

    - When you answer the puzzle, you should strictly follow the answer format in the **Description**

    Output format: - When you ask a question, you should strictly follow query format in the **Description**. - When you answer the puzzle, you should strictly follow the answer format in the **Description**. - Refer to the examples in the **Description**. for the correct format. - If you figure out the right answer before given turns, keep interacting with t...

  10. [13]

    **You MUST ONLY output the answer itself in the format mentioned in the description, DO NOT contain more text.** Now Let’s Solve the Puzzle wordle

    Evaluation: - When the given number of interactions is reached, you need to give your answer of the puzzle. **You MUST ONLY output the answer itself in the format mentioned in the description, DO NOT contain more text.** Now Let’s Solve the Puzzle wordle. **Description**: Wordle is a word-guessing game where players attempt to deduce a hidden 8-letter wor...

  11. [16]

    The interaction rules are as follows: - Interaction Rule 0: The user will first tell you the rule of the game, and the interaction format that must be followed when playing

    Interaction rules: - To finish the goal, you need to interact with the user. The interaction rules are as follows: - Interaction Rule 0: The user will first tell you the rule of the game, and the interaction format that must be followed when playing. In each turn, the user will tell *current turn* and *remaining turns*. - Interaction Rule 1: You can take ...

  12. [18]

    LLM: I understand the rules

    Evaluation: - When the given number of interactions is reached, the game ends and we’ll calculate your **score** 28 Preprint Now Let’s Play the Game{algorithm}, the Description Is that {description}. LLM: I understand the rules. I will not output any unrelated text! Let us start the interaction. user: ********Exploration Phase Starts, We wll Play the Game...

  13. [19]

    invalid syntax

    to list: invalid syntax (<string>, line 2) Gemini-2.5-pro: <think>**Understanding the Input and Refining My Strategy**Okay, so I got an error message back after my last attempt, indicating a problem with the parsing of my input. It looks like my combined assignment and checkpoint request in a single command, ‘arr = [1, 2, 3, 4, 5]‘ and ‘(1, 1)‘, threw the...

  14. [20]

    hidden code function

    *before* requesting checkpoint values (Rule 2). While the rules don’t explicitly mandate separate turns, the error suggests it’s the safest and most reliable approach. Therefore, for this round, I’ll focus *solely* on assigning a value to ‘arr‘. I’ll keep the input simple and valid – ‘arr = [10, 20, 30]‘. This should ensure the system receives a clean and...

  15. [21]

    ab", and

    I’ve mapped the letters to their 0-based alphabetical index, then to their 5-bit binary representation. This works perfectly so far: ‘a‘ is 00000 (’xxxxx’), ‘b‘ is 00001 (’xxxxy’), and so on.**Testing and Refining**Now, I need to solidify this. My next input will be ‘l‘. Since ‘l‘ is the 11th letter, its binary equivalent is 01011. So, I expect the output...

  16. [22]

    ab" will become

    -> 00010 (’xxxyx’)The hypothesized rule is:1. Take the input lowercase letter and determine its 0-indexed position in the alphabet (a=0, b=1, c=2, ...).2. Convert this position number into a 5-digit binary string, padding with leading zeros if necessary.3. Replace ’0’ with ’x’ and ’1’ with ’y’.To test this hypothesis and confirm my thought process, I need...

  17. [23]

    **Circuit Function**: Implement a function that constructs the circuit according to the language description, given the input wires and returns the output of each gate

  18. [24]

    **Main Function**: Implement the core interaction loop between the LLM player and the blackbox. ## Detailed Coding Instructions ### Required Imports Begin by importing these packages: “‘python im- port os import sys current_path = os.path.abspath(__file__) oracle_path = os.path.dirname(os.path.dirname(os.path.dirname(os.path.dirname(current_path)))) if or...

  19. [25]

    - ‘blackbox_output‘ should append the remain rounds for interaction

    Create an interaction loop for ‘max_turns‘ iterations: - Call ‘player_output = player.normal_output(blackbox_output)‘ to send the message ‘blackbox_output‘ to the player, and get the player’s output. - ‘blackbox_output‘ should append the remain rounds for interaction. - For the first iteration, ‘blackbox_output‘ should be a prompt telling player the game ...

  20. [26]

    - Call ‘player.evaluate(failure_num, version)‘ - Call ‘player.save_history(output_dir, version)‘

    After the loop completes: - ‘player_output = player.normal_output(blackbox_output)‘ to give the last answer of the player. - Call ‘player.evaluate(failure_num, version)‘ - Call ‘player.save_history(output_dir, version)‘

  21. [27]

    ‘python if __name__ ==

    Add an entry point at the end of the program: “‘python if __name__ == "__main__": args = sys.argv[1:] main(args[0], args[1], args[2], args[3], int(args[4]), args[5], args[6], int(args[7]), args[8], int(args[9]), int(args[10]), args[11], bool(eval(args[12]))) “‘ ## Circuit Details - ‘AND‘ and ‘OR‘ gates always require exactly two inputs - ‘NOT‘ gates alway...

  22. [28]

    You need to understand the detailed working principle of this function by interacting with user in multiple turns

    Task overview: - The user plays the role of a code function, but you don’t know what the function is. You need to understand the detailed working principle of this function by interacting with user in multiple turns

  23. [29]

    Goals: - By interacting with the user within given interaction turns, you need to understand how the code function operates in every checkpoint

  24. [30]

    - The code function has some checkpoints

    User property: - The code function takes some variables as input. - The code function has some checkpoints. You can specify ‘(idx, iter)‘ to get the value of accessible intermediate variables: - ‘idx‘: int is the index of checkpoint (i.e. the idx-th checkpoint). - ‘iter‘: int is the number of visited times (i.e. checkpoint will be visited multiple times i...

  25. [31]

    In each turn, the user will tell *current turn* and *remaining turns*

    Interaction rules: - Rule 0: The user will first tell you the *type* and *name* of all the function input variables, the *total number of checkpoints*. In each turn, the user will tell *current turn* and *remaining turns*. - Rule 1: You can assign or re-assign any values to the input variables freely, but they must match the variable types (e.g. if num is...

  26. [32]

    variable_name_1 = value_1; variable_name_2 = value_2

    Output format: - **You must strictly obey the output format rules, DO NOT output any unrelated text!**: - According to Interaction Rule 1, when you want to assign values to input variables, output "variable_name_1 = value_1; variable_name_2 = value_2; ..." (e.g. arr = [1, 5, 2, 9]; k = 3) - According to Interaction Rule 2, when you want to ask for accessi...

  27. [33]

    **You MUST ONLY output the value, DO NOT contain any other text.** G.2.2 C IRCUIT RULE INFERENCE (CRI)

    Evaluation: - When the given number of interactions is reached, several questions on the variable value at certain checkpoint will be presented. **You MUST ONLY output the value, DO NOT contain any other text.** G.2.2 C IRCUIT RULE INFERENCE (CRI)

  28. [34]

    For each gates, the input are input wires or the output of other gates

    Definitions: A ’circuit’ means a boolean circuit with no circle, whose input wires are a fixed number of 0/1 bits, and consisted of ‘AND‘/‘OR‘/‘NOT‘ gates. For each gates, the input are input wires or the output of other gates

  29. [35]

    Your task is to discover the structure of this circuit by interacting with the platform over multiple turns

    Task overview: You are presented with a platform containing a circuit. Your task is to discover the structure of this circuit by interacting with the platform over multiple turns

  30. [36]

    Goals: You have one final goal: You need to know output of circuit gates given every input

  31. [37]

    Interaction Process: To finish the goal, you need to interact with the platform. The interaction process is as follows: - The platform will first provide you with the input sizes n - In each turn, you can ask the platform for the outputs of every circuit gates with a given input size x = ( x1, x2, ..., xn), xi ∈ { 0, 1}. - Each turn after you ask the plat...

  32. [38]

    Output format in Interaction: **You must strictly obey the output format rules, DO NOT output any unrelated text!**: Format your queries as: (x_1, x_2, ... , x_n). For example, if you what to now the output of circuit given input x = (0, 1, 1), you should output: (0, 1, 1)

  33. [39]

    - Your construction should specify the exact connections and gate types in Cn

    Evaluation: After reaching the maximum number of allowed interactions, you will be tested on your understanding of the circuit family: - You will be given input x = ( x1, x2, ..., xn) and you should answer the output of each gate [y1, y2, ..., ym]. - Your construction should specify the exact connections and gate types in Cn

  34. [40]

    - There are no restrictions on how many times a gate’s output or an input wire can be used as input to other gates

    Circuit Details - ‘AND‘ and ‘OR‘ gates always have exactly two inputs, and ‘NOT‘ gatea always have one input. - There are no restrictions on how many times a gate’s output or an input wire can be used as input to other gates. - Every circuit in the circuit family contains no loops, For the i-th gate, all inputs must come from either input wires or outputs...

  35. [41]

    You need to understand how this classical mechanical system operates by interacting with the user in multiple turns

    Task overview: - The user plays the role of a classical mechanical system, but you don’t know what it is. You need to understand how this classical mechanical system operates by interacting with the user in multiple turns

  36. [42]

    Goal: - By interacting with the user within given interaction turns, you need to understand how the mechanical system operates

  37. [43]

    - The user takes time ‘t: float‘ as input

    User property: - The user will remind you the remaining number of turns in each turn. - The user takes time ‘t: float‘ as input. - The user return the 3-dimensional coordinate ‘(x, y, z)‘ of each object in time ‘t‘

  38. [44]

    You can only assign one ‘t‘ in each turn

    Interaction rules: - Rule 1: You need to assign a value of ‘t‘. You can only assign one ‘t‘ in each turn. Make sure the assigned ‘t‘ is a one-digit decimal. - Rule 2: You will receive the user response, which is 3-dimensional coordinate ‘(x, y, z)‘ of each object in time ‘t‘, after you assign specific ‘t‘

  39. [45]

    Let’s input

    Output format: - You must strictly obey the output format rules: When you want to assign value for ‘t‘, **only output the value**. DO NOT output any unrelated text or symbols like "Let’s input", "I’ll try". - If you understand the mechanical system before reaching given interaction turns, keep interacting with the user to make sure you don’t miss any deta...

  40. [46]

    object1": (x, y, z),

    Evaluation: - When the given number of interactions is reached, you need to calculate the 3- dimensional coordinate ‘(x, y, z)‘ of each object at time ‘t‘ with the format of ‘"object1": (x, y, z), "object2": (x, y, z), ...‘. All coordinates are approximated to two decimal places. 50 Preprint G.2.4 E NCRYPTION RULE INFERENCE (ERI)

  41. [47]

    You need to figure out this rule by interacting with the user in multiple turns

    Task overview: - The user transforms one string into another based on a fixed rule, but you don’t know what the fixed rule is. You need to figure out this rule by interacting with the user in multiple turns

  42. [48]

    Goal: - By interacting with the user within given interaction turns, you need to understand the fixed rule of transforming one string into another

  43. [49]

    - The user will output transformed string in each turn

    User property: - The user will remind you the remaining number of turns in each turn. - The user will output transformed string in each turn

  44. [50]

    You can assign any string freely, but make sure the string **only contains uppercase or lowercase English letters (A-Z, a-z) and blank space**

    Interaction rules: - Rule 1: You must only assign one string in each turn. You can assign any string freely, but make sure the string **only contains uppercase or lowercase English letters (A-Z, a-z) and blank space**. Then you will receive corresponding transformed string from the user

  45. [51]

    Let’s input

    Output format: - You must strictly obey the output format rules: **Only output the string**. DO NOT output any unrelated text or symbols like "Let’s input", "I’ll try", " / n". - If you understand the transforming rule before reaching given interaction turns, keep interacting with the user to make sure you don’t miss any details. DO NOT output text like "...

  46. [52]

    **You MUST ONLY output the value, DO NOT contain any other text.** G.2.5 I NTERACTIVE PUZZLE INFERENCE (IPI)

    Evaluation: - When the given number of interactions is reached, you will be given several test strings and you need to output corresponding transformed string. **You MUST ONLY output the value, DO NOT contain any other text.** G.2.5 I NTERACTIVE PUZZLE INFERENCE (IPI)

  47. [53]

    You need to guess the hidden answer by interacting with the user in multiple turns

    Task overview: - The user plays the role of a puzzle, and you don’t know what the hidden answer is. You need to guess the hidden answer by interacting with the user in multiple turns

  48. [54]

    Goals: - You need to guess the answer to the puzzle within given interaction turns

  49. [55]

    User property: - The user hides the answer which you need to figure out

  50. [56]

    In each turn, the user will tell *current turn* and *remaining turns*

    Interaction rules: - Rule 0: The user will first tell you the rule of the puzzle, and the interaction format that must be followed when querying. In each turn, the user will tell *current turn* and *remaining turns*. - Rule 1: You can ask questions according to the rules of the game and receive corresponding feedback. If your ask is unavailable, the user ...

  51. [57]

    - When you answer the puzzle, you should strictly follow the answer format in the **Descrip- tion**

    Output format: - When you ask a question, you should strictly follow query format in the **Descrip- tion**. - When you answer the puzzle, you should strictly follow the answer format in the **Descrip- tion**. - Refer to the examples in the **Description**. for the correct format. - If you figure out the right answer before given turns, keep interacting wi...

  52. [58]

    **You MUST ONLY output the answer itself in the format mentioned in the description, DO NOT contain more text.** Now Let’s Solve the Puzzle algorithm

    Evaluation: - When the given number of interactions is reached, you need to give your answer of the puzzle. **You MUST ONLY output the answer itself in the format mentioned in the description, DO NOT contain more text.** Now Let’s Solve the Puzzle algorithm. **Description**: description. G.2.6 G AME STRATEGY INFERENCE (GSI)

  53. [59]

    But you don’t know what the stategy is

    Task overview: - The user plays the role of an opponent who takes a fixed strategy in a game. But you don’t know what the stategy is. You need to guess the hidden stategy of your opponent by interacting (playing game) with him in multiple turns

  54. [60]

    The score might depend on winning rate, or minimal cost, etc

    Goals: - You have 1 final goal: You need to guess your opponent’s strategy and try to maximize your score in the game. The score might depend on winning rate, or minimal cost, etc

  55. [61]

    User property: - The user hides his game strategy which you need to figure out to win the game

  56. [62]

    The interaction rules are as follows: - Interaction Rule 0: The user will first tell you the rule of the game, and the interaction format that must be followed when playing

    Interaction rules: - To finish the goal, you need to interact with the user. The interaction rules are as follows: - Interaction Rule 0: The user will first tell you the rule of the game, and the interaction format that must be followed when playing. In each turn, the user will tell *current turn* and *remaining turns*. - Interaction Rule 1: You can take ...

  57. [63]

    Output format: - **You must strictly obey the output format rules, DO NOT output any unrelated text!**:

  58. [64]

    H B LACK -BOX DETAILS IN ORACLE V 1.0 We introduce the detailed implementation of black-boxes in each task

    Evaluation: - When the given number of interactions is reached, the game ends and we’ll calculate your **score** Now Let’s Play the Game {algorithm}, the Description Is that {description}. H B LACK -BOX DETAILS IN ORACLE V 1.0 We introduce the detailed implementation of black-boxes in each task. H.1 C ODE INTENT INFERENCE (CII) In the context of CII tasks...

  59. [65]

    Copy the subcircuit n times

    Construct a subcircuit of add 0/1: x’[0] = x[0] xor input, y[0] = x[0] and input, x’[1] = x[1] xor y[0], y’[1] = x[1] and y[0], x’[2] = x[2] xor y[1]. Copy the subcircuit n times. Notice that a xor b = (not a and b) or (not b and a). (Hard 6) Arbitrary For input size n = 7 and an arbitrary function f : Z 7 2 → Z2. We can construct a circuit to compute f. ...

  60. [66]

    The system is released from rest

    and the inclined plane’s coordinates are (0, 0, 0). The system is released from rest. After the wooden block separates from the inclined plane, both the wooden block and the inclined plane will move along the smooth horizontal surface. g = 10 m/s2. (Hard 3) Three Balls Collision There are three balls, A, B, and C, with masses of 4 kg, 3 kg, and 2 kg, resp...

  61. [67]

    The value of each letter in ciphertext equals its value in plaintext plus a shift number (i.e

    Suppose the letters are represented by a=0, b=1, ..., z=25, A=26, B=27, ..., Z=51. The value of each letter in ciphertext equals its value in plaintext plus a shift number (i.e. the number of index) and then modulo 52. Then convert the value to the corresponding letter. Keep the blank spaces in the ciphertext. (Easy 9) Curve Cipher There is a table with 9...