MINT: Minimal Information Neuro-Symbolic Tree for Objective-Driven Knowledge-Gap Reasoning and Active Elicitation

Mahdi Imani; Tian Lan; Zeyu Fang

arxiv: 2602.05048 · v2 · submitted 2026-02-04 · 💻 cs.AI · cs.HC

MINT: Minimal Information Neuro-Symbolic Tree for Objective-Driven Knowledge-Gap Reasoning and Active Elicitation

Zeyu Fang , Mahdi Imani , Tian Lan This is my paper

Pith reviewed 2026-05-16 07:09 UTC · model grok-4.3

classification 💻 cs.AI cs.HC

keywords neuro-symbolic planningactive elicitationknowledge gapshuman-AI teamingMarkov decision processesuncertainty estimationLLM reasoning

0 comments

The pith

MINT builds a symbolic interaction tree with neural uncertainty estimates to let AI agents elicit minimal human input and reach near-expert planning performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the Minimal Information Neuro-Symbolic Tree (MINT) to handle incomplete information in human-AI joint planning. MINT constructs a tree of possible human-AI interactions, consults a neural policy to measure how remaining knowledge gaps affect planning outcomes, and then uses an LLM to turn that reasoning into a small set of targeted elicitation queries. Self-play optimizes the overall strategy, and the approach is shown to deliver return guarantees in extended Markov decision processes that include knowledge gaps. On three benchmarks with increasing realism and unseen objects, MINT-based agents achieve near-expert returns, higher rewards, and better success rates while asking only a limited number of questions per task.

Core claim

MINT constructs a symbolic tree by proposing propositions about possible human-AI interactions, consults a neural planning policy to estimate uncertainty in outcomes due to knowledge gaps, and leverages an LLM to search and summarize the tree's reasoning into optimal elicitation queries, thereby enabling objective-driven active elicitation in open-world planning.

What carries the argument

The Minimal Information Neuro-Symbolic Tree (MINT), which builds propositions of human-AI interactions into a symbolic tree and pairs it with a neural policy that quantifies planning uncertainty caused by unresolved knowledge gaps.

If this is right

Agents using MINT issue a small number of questions per task yet reach near-expert returns on planning problems with unknown objects.
Return guarantees hold for any MINT-augmented policy in extended MDPs that model knowledge gaps.
Self-play on the MINT tree produces elicitation strategies that improve both reward and success rate over baselines without active elicitation.
The same tree-plus-LLM pipeline scales across benchmarks of increasing realism while keeping question counts low.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same structure could be applied to sensor-limited robotics tasks where the agent must decide which human clarification to request before acting.
Replacing the LLM summarizer with a smaller distilled model would test whether the performance gain depends on large-language-model quality.
Extending MINT to multi-turn conversations would allow the tree to be updated incrementally rather than rebuilt from scratch after each answer.

Load-bearing premise

A neural planning policy can reliably estimate how much uncertainty remains from knowledge gaps, and an LLM can accurately search and summarize the MINT tree to produce the best elicitation queries.

What would settle it

Run the three benchmark tasks with MINT disabled versus enabled; if the version without MINT matches or exceeds the reported rewards, success rates, and question counts, the central performance claim is false.

Figures

Figures reproduced from arXiv: 2602.05048 by Mahdi Imani, Tian Lan, Zeyu Fang.

**Figure 1.** Figure 1: Evaluating, expanding, curating, and acting with MINT. (a) How we build and expand MINT by first consulting a trained neural planning policy as an oracle, and then utilizing the LLM to curate the queries based on MINT and elicit human responses via natural-language interactions. (b) How MINT acts in the environment. AI agent implements the identified queries in its interaction with human in joint planning.… view at source ↗

**Figure 2.** Figure 2: Illustrations of how MINT acts in all 3 environments. (a) The agent faces unknown objects in MiniGrid and curates queries about its impact on transition; (b) The agent in Atari Pacman faces unseen targets (white) and curates queries about its impact on rewards; and (c) The agent in Isaac Search and Rescue reasons about the smoke, interacts with human, and plans its path accordingly. tages of both sides – i… view at source ↗

**Figure 3.** Figure 3: Screenshots of the environments used in this paper. (a)MiniGrid (b)Atari Pacman (c-1) an overview of NVIDIA Isaac environment (c-2) an example of drone view in Isaac environment. The Atari Pacman environment is mainly based on its original game setting. However, we inject an uncertain object marked as the white rectangle in the raw frames. This uncertain object either has an effect on the transition or rew… view at source ↗

read the original abstract

Joint planning through language-based interactions is a key area of human-AI teaming. Planning problems in the open world often involve various aspects of incomplete information and unknowns, e.g., objects involved, human goals/intents -- thus leading to knowledge gaps in joint planning. We consider the problem of discovering optimal interaction strategies for AI agents to actively elicit human inputs in object-driven planning. To this end, we propose Minimal Information Neuro-Symbolic Tree (MINT) to reason about the impact of knowledge gaps and leverage self-play with MINT to optimize the AI agent's elicitation strategies and queries. More precisely, MINT builds a symbolic tree by making propositions of possible human-AI interactions and by consulting a neural planning policy to estimate the uncertainty in planning outcomes caused by remaining knowledge gaps. Finally, we leverage LLM to search and summarize MINT's reasoning process and curate a set of queries to optimally elicit human inputs for best planning performance. By considering a family of extended Markov decision processes with knowledge gaps, we analyze the return guarantee for a given MINT with active human elicitation. Our evaluation on three benchmarks involving unseen/unknown objects of increasing realism shows that MINT-based planning attains near-expert returns by issuing a limited number of questions per task while achieving significantly improved rewards and success rates.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MINT offers a fresh neuro-symbolic method for eliciting info in open-world planning but its claims need better validation on the uncertainty and LLM parts.

read the letter

The paper's main point is that MINT builds a symbolic tree to model possible interactions and knowledge gaps in human-AI planning, uses a neural policy to gauge uncertainty in outcomes during self-play, and then has an LLM search and summarize the tree to pick good queries for the human. They show this leads to near-expert performance with few questions on benchmarks with unknown objects, and they derive a return guarantee for the extended MDP setting. What stands out as new is the specific integration of the minimal information tree with neural uncertainty estimation and LLM-mediated query generation. The self-play optimization for elicitation strategies is also a solid step beyond just static planning. It does well in providing that theoretical return guarantee, which gives some assurance that the approach can work in the family of MDPs they consider. The description of how the tree is built by making propositions about interactions is clear enough from the abstract. The soft spots are around the practical reliability. As the stress-test notes, if the neural estimates of uncertainty aren't accurate or the LLM doesn't faithfully represent the tree's reasoning, then the queries won't be optimal and the performance numbers won't hold up. Since no ablations are mentioned that isolate these parts, it's difficult to know how much the results depend on the quality of the underlying neural policy or the LLM rather than the MINT structure itself. The abstract doesn't include any numbers, baselines, or error bars, so even if the full paper has them, the lack of detail in the summary makes it tough to assess the strength of the evidence right away. The guarantee might also be tied too closely to the self-play setup without independent verification. This paper is for folks working on interactive AI systems and neuro-symbolic methods for planning under uncertainty. Someone looking for new techniques in active information gathering for teaming tasks would find the framework useful to build on. I think it deserves a serious referee. The idea is substantive and the formal analysis is there, so peer review can push on the experimental validation and any potential circularity in the guarantees.

Referee Report

3 major / 2 minor

Summary. The paper introduces the Minimal Information Neuro-Symbolic Tree (MINT) framework to address knowledge gaps in joint human-AI planning tasks involving incomplete information about objects, goals, and intents. MINT constructs symbolic trees of possible interactions, consults a neural planning policy to estimate outcome uncertainty from remaining gaps, optimizes elicitation strategies through self-play, and uses an LLM to search and summarize the tree for generating optimal queries. It provides a return-guarantee analysis for a family of extended MDPs and reports empirical results on three benchmarks with unseen objects, claiming near-expert returns with a limited number of questions per task along with significantly improved rewards and success rates.

Significance. If the empirical performance claims and the return guarantee hold under independent verification, the work would represent a meaningful advance in objective-driven active elicitation for open-world planning, demonstrating how neuro-symbolic trees combined with self-play and LLM summarization can reduce interaction overhead while preserving high task performance. The integration of uncertainty estimation over knowledge gaps with formal MDP analysis is a constructive direction for human-AI teaming.

major comments (3)

[Abstract] Abstract: The central empirical claims (near-expert returns, significantly improved rewards and success rates on three benchmarks) are stated without any quantitative numbers, baseline comparisons, error bars, or specific metrics, preventing verification of the magnitude and statistical reliability of the reported gains.
[Return Guarantee Analysis] Return guarantee analysis (extended-MDP setting): The guarantee is derived from the same family of MDPs used for self-play optimization of the MINT policy; without explicit independence between the policy parameters and the bound (e.g., via a separate derivation or worst-case analysis), the guarantee risks reducing to a tautological or fitted quantity rather than an independent performance certificate.
[Methods and Evaluation] Methods and evaluation sections: No ablation experiments isolate the neural planning policy's uncertainty estimation or the LLM's tree-search/summarization steps, both of which are load-bearing for attributing benchmark improvements unambiguously to MINT rather than to the quality of the underlying LLM or neural policy.

minor comments (2)

[Abstract] The acronym MINT is expanded in the title but the abstract introduces it without immediate expansion; define on first use for clarity.
[Preliminaries] Notation for the extended MDP family and the propositions in the symbolic tree should be introduced with explicit definitions and an example tree diagram to aid reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and have revised the manuscript to improve clarity, rigor, and verifiability.

read point-by-point responses

Referee: [Abstract] Abstract: The central empirical claims (near-expert returns, significantly improved rewards and success rates on three benchmarks) are stated without any quantitative numbers, baseline comparisons, error bars, or specific metrics, preventing verification of the magnitude and statistical reliability of the reported gains.

Authors: We agree that the abstract should include quantitative support. In the revised manuscript, we have updated the abstract to incorporate the specific metrics, baseline comparisons, and error bars already reported in the evaluation section, allowing readers to directly assess the magnitude and reliability of the gains. revision: yes
Referee: [Return Guarantee Analysis] Return guarantee analysis (extended-MDP setting): The guarantee is derived from the same family of MDPs used for self-play optimization of the MINT policy; without explicit independence between the policy parameters and the bound (e.g., via a separate derivation or worst-case analysis), the guarantee risks reducing to a tautological or fitted quantity rather than an independent performance certificate.

Authors: We thank the referee for this observation. The return guarantee is derived analytically for the entire family of extended MDPs with knowledge gaps and holds for any MINT-based policy in that family; self-play is used only to select a high-performing policy within the family and does not enter the bound derivation. We have revised the relevant section to explicitly separate the general worst-case analysis from the optimization procedure and to restate the independence of the certificate. revision: yes
Referee: [Methods and Evaluation] Methods and evaluation sections: No ablation experiments isolate the neural planning policy's uncertainty estimation or the LLM's tree-search/summarization steps, both of which are load-bearing for attributing benchmark improvements unambiguously to MINT rather than to the quality of the underlying LLM or neural policy.

Authors: We acknowledge that targeted ablations would strengthen attribution. In the revised manuscript we have added ablation experiments that (i) replace the neural uncertainty estimator with a uniform heuristic and (ii) replace LLM summarization with exhaustive tree traversal, reporting the resulting drops in return and success rate on all three benchmarks. These results confirm the contribution of each component. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The provided abstract and context describe MINT construction via neural policy uncertainty estimates, self-play optimization of elicitation strategies, LLM summarization of the tree, and a separate analysis of return guarantees over a family of extended MDPs with knowledge gaps. No equations, self-citations, or derivations are quoted that reduce the central performance claims or guarantees to fitted inputs by construction, self-definitional loops, or load-bearing self-citations. The self-play step and MDP-family analysis follow standard RL practice and remain independent of the reported benchmark results. This is the expected non-finding for a paper whose core claims rest on empirical evaluation rather than a closed mathematical reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that neural policies can estimate planning uncertainty from knowledge gaps and that LLM summarization produces optimal queries; these components are introduced by the paper without independent external validation in the provided abstract.

axioms (1)

domain assumption Planning problems can be modeled as extended Markov decision processes that include explicit knowledge gaps
Invoked when analyzing the return guarantee for MINT with active elicitation.

invented entities (1)

Minimal Information Neuro-Symbolic Tree (MINT) no independent evidence
purpose: To represent possible human-AI interactions and quantify uncertainty from remaining knowledge gaps
New structure proposed by the paper; no independent evidence supplied in the abstract.

pith-pipeline@v0.9.0 · 5539 in / 1343 out tokens · 43237 ms · 2026-05-16T07:09:28.265681+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

MINT builds a symbolic tree by making propositions of possible human-AI interactions and by consulting a neural planning policy to estimate the uncertainty in planning outcomes caused by remaining knowledge gaps.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we prove a local pseudo-Lipschitz continuity of the planning returns and provide an upper bound on the return-gap
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat recovery unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

By considering a family of extended Markov decision processes with knowledge gaps

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

NonZero: Interaction-Guided Exploration for Multi-Agent Monte Carlo Tree Search
cs.LG 2026-05 unverdicted novelty 7.0

NonZero introduces an interaction score and bandit-formalized proposal rule for local agent deviations in multi-agent MCTS, delivering a sublinear local-regret guarantee and improved sample efficiency on game benchmar...

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages · cited by 1 Pith paper · 1 internal anchor

[1]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Bai, Y ., Jones, A., Ndousse, K., Askell, A., Chen, A., Das- Sarma, N., Drain, D., Fort, S., Ganguli, D., Henighan, T., et al. Training a helpful and harmless assistant with rein- forcement learning from human feedback.arXiv preprint arXiv:2204.05862,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Help me write a poem: Instruction tuning as a vehicle for collaborative poetry writing.arXiv preprint arXiv:2210.13669,

Chakrabarty, T., Padmakumar, V ., and He, H. Help me write a poem: Instruction tuning as a vehicle for collaborative poetry writing.arXiv preprint arXiv:2210.13669,

work page arXiv
[3]

Calibrating and rotating: A unified framework for weight conditioning in peft.arXiv preprint arXiv:2511.00051,

Chang, D., Xue, P., Li, Y ., Liu, Y ., Xu, P., and Zhang, S. Calibrating and rotating: A unified framework for weight conditioning in peft.arXiv preprint arXiv:2511.00051,

work page arXiv
[4]

URL https://arxiv.org/abs/2401. 03890. Chen, J., Zhou, H., Mei, Y ., Joe-Wong, C., Adam, G. C., Bastian, N., and Lan, T. Rgmdt: Return-gap-minimizing decision tree extraction in non-euclidean metric space. Advances in Neural Information Processing Systems, 37: 18806–18847, 2024a. Chen, R., Kwon, J., Chen, W.-H., and Sung, C. Design and characterization of...

work page arXiv
[5]

Estimating risk and uncertainty in deep reinforcement learning,

Clements, W. R., Van Delft, B., Robaglia, B.-M., Slaoui, R. B., and Toth, S. Estimating risk and uncertainty in deep reinforcement learning.arXiv preprint arXiv:1905.09638,

work page arXiv 1905
[6]

Think, act, and ask: Open-world interactive personalized robot navigation

Dai, Y ., Peng, R., Li, S., and Chai, J. Think, act, and ask: Open-world interactive personalized robot navigation. In 2024 IEEE International Conference on Robotics and Automation (ICRA), pp. 3296–3303. IEEE,

work page 2024
[7]

Learning from random demonstrations: Offline reinforcement learning with importance-sampled diffusion models.arXiv preprint arXiv:2405.19878,

Fang, Z. and Lan, T. Learning from random demonstrations: Offline reinforcement learning with importance-sampled diffusion models.arXiv preprint arXiv:2405.19878,

work page arXiv
[8]

Inspo: Unlocking intrinsic self- reflection for llm preference optimization.arXiv preprint arXiv:2512.23126,

Li, Y ., Lan, T., and Qi, Z. Inspo: Unlocking intrinsic self- reflection for llm preference optimization.arXiv preprint arXiv:2512.23126,

work page arXiv
[9]

Learning when and what to ask: A hierarchical reinforcement learning framework.arXiv preprint arXiv:2110.08258,

Nguyen, K., Bisk, Y ., and Daumé III, H. Learning when and what to ask: A hierarchical reinforcement learning framework.arXiv preprint arXiv:2110.08258,

work page arXiv
[10]

Recurrent model-free rl can be a strong baseline for many pomdps

10 MINT: Minimal Information Neuro-Symbolic Tree for Objective-Driven Knowledge-Gap Reasoning and Active Elicitation Ni, T., Eysenbach, B., and Salakhutdinov, R. Recurrent model-free rl can be a strong baseline for many pomdps. arXiv preprint arXiv:2110.05038,

work page arXiv
[11]

Malinzero: Efficient low- dimensional search for mastering complex multi-agent planning.arXiv preprint arXiv:2511.06142,

Tang, S., Chen, J., and Lan, T. Malinzero: Efficient low- dimensional search for mastering complex multi-agent planning.arXiv preprint arXiv:2511.06142,

work page arXiv
[12]

and Kersting, K

Teso, S. and Kersting, K. Explanatory interactive machine learning. InProceedings of the 2019 AAAI/ACM Confer- ence on AI, Ethics, and Society, pp. 239–245,

work page 2019
[13]

Mitchell, and Yuanzhi Li

Wu, J., Huang, Z., Hu, Z., and Lv, C. Toward human- in-the-loop ai: Enhancing deep reinforcement learning via real-time human guidance for autonomous driving. Engineering, 21:75–91, 2023a. Wu, Y ., Tang, X., Mitchell, T. M., and Li, Y . Smartplay: A benchmark for llms as intelligent agents.arXiv preprint arXiv:2310.01557, 2023b. Xiao, H. and Wang, P. Llm ...

work page arXiv
[14]

Travelplanner: A benchmark for real-world planning with language agents

Xie, J., Zhang, K., Chen, J., Zhu, T., Lou, R., Tian, Y ., Xiao, Y ., and Su, Y . Travelplanner: A benchmark for real-world planning with language agents.arXiv preprint arXiv:2402.01622,

work page arXiv
[15]

org/abs/2307.03913

URL https://arxiv. org/abs/2307.03913. Zhang, C., Yang, K., Hu, S., Wang, Z., Li, G., Sun, Y ., Zhang, C., Zhang, Z., Liu, A., Zhu, S.-C., et al. Proa- gent: building proactive cooperative agents with large language models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pp. 17591–17599,

work page arXiv
[16]

Lipschitz lifelong monte carlo tree search for mastering non-stationary tasks

Zhang, Z. and Lan, T. Lipschitz lifelong monte carlo tree search for mastering non-stationary tasks.arXiv preprint arXiv:2502.00633,

work page arXiv
[17]

Is the uncertainty about a Transition parameter?

11 MINT: Minimal Information Neuro-Symbolic Tree for Objective-Driven Knowledge-Gap Reasoning and Active Elicitation A. Appendix A.1. Usage of Large Language Models The Large Language Models are used as a significant part of the methodology proposed in this paper. Nevertheless, they are not used for research ideation, derivations, proofs, experimental des...

work page 2021

[1] [1]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Bai, Y ., Jones, A., Ndousse, K., Askell, A., Chen, A., Das- Sarma, N., Drain, D., Fort, S., Ganguli, D., Henighan, T., et al. Training a helpful and harmless assistant with rein- forcement learning from human feedback.arXiv preprint arXiv:2204.05862,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Help me write a poem: Instruction tuning as a vehicle for collaborative poetry writing.arXiv preprint arXiv:2210.13669,

Chakrabarty, T., Padmakumar, V ., and He, H. Help me write a poem: Instruction tuning as a vehicle for collaborative poetry writing.arXiv preprint arXiv:2210.13669,

work page arXiv

[3] [3]

Calibrating and rotating: A unified framework for weight conditioning in peft.arXiv preprint arXiv:2511.00051,

Chang, D., Xue, P., Li, Y ., Liu, Y ., Xu, P., and Zhang, S. Calibrating and rotating: A unified framework for weight conditioning in peft.arXiv preprint arXiv:2511.00051,

work page arXiv

[4] [4]

URL https://arxiv.org/abs/2401. 03890. Chen, J., Zhou, H., Mei, Y ., Joe-Wong, C., Adam, G. C., Bastian, N., and Lan, T. Rgmdt: Return-gap-minimizing decision tree extraction in non-euclidean metric space. Advances in Neural Information Processing Systems, 37: 18806–18847, 2024a. Chen, R., Kwon, J., Chen, W.-H., and Sung, C. Design and characterization of...

work page arXiv

[5] [5]

Estimating risk and uncertainty in deep reinforcement learning,

Clements, W. R., Van Delft, B., Robaglia, B.-M., Slaoui, R. B., and Toth, S. Estimating risk and uncertainty in deep reinforcement learning.arXiv preprint arXiv:1905.09638,

work page arXiv 1905

[6] [6]

Think, act, and ask: Open-world interactive personalized robot navigation

Dai, Y ., Peng, R., Li, S., and Chai, J. Think, act, and ask: Open-world interactive personalized robot navigation. In 2024 IEEE International Conference on Robotics and Automation (ICRA), pp. 3296–3303. IEEE,

work page 2024

[7] [7]

Learning from random demonstrations: Offline reinforcement learning with importance-sampled diffusion models.arXiv preprint arXiv:2405.19878,

Fang, Z. and Lan, T. Learning from random demonstrations: Offline reinforcement learning with importance-sampled diffusion models.arXiv preprint arXiv:2405.19878,

work page arXiv

[8] [8]

Inspo: Unlocking intrinsic self- reflection for llm preference optimization.arXiv preprint arXiv:2512.23126,

Li, Y ., Lan, T., and Qi, Z. Inspo: Unlocking intrinsic self- reflection for llm preference optimization.arXiv preprint arXiv:2512.23126,

work page arXiv

[9] [9]

Learning when and what to ask: A hierarchical reinforcement learning framework.arXiv preprint arXiv:2110.08258,

Nguyen, K., Bisk, Y ., and Daumé III, H. Learning when and what to ask: A hierarchical reinforcement learning framework.arXiv preprint arXiv:2110.08258,

work page arXiv

[10] [10]

Recurrent model-free rl can be a strong baseline for many pomdps

10 MINT: Minimal Information Neuro-Symbolic Tree for Objective-Driven Knowledge-Gap Reasoning and Active Elicitation Ni, T., Eysenbach, B., and Salakhutdinov, R. Recurrent model-free rl can be a strong baseline for many pomdps. arXiv preprint arXiv:2110.05038,

work page arXiv

[11] [11]

Malinzero: Efficient low- dimensional search for mastering complex multi-agent planning.arXiv preprint arXiv:2511.06142,

Tang, S., Chen, J., and Lan, T. Malinzero: Efficient low- dimensional search for mastering complex multi-agent planning.arXiv preprint arXiv:2511.06142,

work page arXiv

[12] [12]

and Kersting, K

Teso, S. and Kersting, K. Explanatory interactive machine learning. InProceedings of the 2019 AAAI/ACM Confer- ence on AI, Ethics, and Society, pp. 239–245,

work page 2019

[13] [13]

Mitchell, and Yuanzhi Li

Wu, J., Huang, Z., Hu, Z., and Lv, C. Toward human- in-the-loop ai: Enhancing deep reinforcement learning via real-time human guidance for autonomous driving. Engineering, 21:75–91, 2023a. Wu, Y ., Tang, X., Mitchell, T. M., and Li, Y . Smartplay: A benchmark for llms as intelligent agents.arXiv preprint arXiv:2310.01557, 2023b. Xiao, H. and Wang, P. Llm ...

work page arXiv

[14] [14]

Travelplanner: A benchmark for real-world planning with language agents

Xie, J., Zhang, K., Chen, J., Zhu, T., Lou, R., Tian, Y ., Xiao, Y ., and Su, Y . Travelplanner: A benchmark for real-world planning with language agents.arXiv preprint arXiv:2402.01622,

work page arXiv

[15] [15]

org/abs/2307.03913

URL https://arxiv. org/abs/2307.03913. Zhang, C., Yang, K., Hu, S., Wang, Z., Li, G., Sun, Y ., Zhang, C., Zhang, Z., Liu, A., Zhu, S.-C., et al. Proa- gent: building proactive cooperative agents with large language models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pp. 17591–17599,

work page arXiv

[16] [16]

Lipschitz lifelong monte carlo tree search for mastering non-stationary tasks

Zhang, Z. and Lan, T. Lipschitz lifelong monte carlo tree search for mastering non-stationary tasks.arXiv preprint arXiv:2502.00633,

work page arXiv

[17] [17]

Is the uncertainty about a Transition parameter?

11 MINT: Minimal Information Neuro-Symbolic Tree for Objective-Driven Knowledge-Gap Reasoning and Active Elicitation A. Appendix A.1. Usage of Large Language Models The Large Language Models are used as a significant part of the methodology proposed in this paper. Nevertheless, they are not used for research ideation, derivations, proofs, experimental des...

work page 2021