pith. sign in

hub Canonical reference

Tree of Thoughts: Deliberate Problem Solving with Large Language Models

Canonical reference. 72% of citing Pith papers cite this work as background.

90 Pith papers citing it
Background 72% of classified citations
abstract

Language models are increasingly being deployed for general problem solving across a wide range of tasks, but are still confined to token-level, left-to-right decision-making processes during inference. This means they can fall short in tasks that require exploration, strategic lookahead, or where initial decisions play a pivotal role. To surmount these challenges, we introduce a new framework for language model inference, Tree of Thoughts (ToT), which generalizes over the popular Chain of Thought approach to prompting language models, and enables exploration over coherent units of text (thoughts) that serve as intermediate steps toward problem solving. ToT allows LMs to perform deliberate decision making by considering multiple different reasoning paths and self-evaluating choices to decide the next course of action, as well as looking ahead or backtracking when necessary to make global choices. Our experiments show that ToT significantly enhances language models' problem-solving abilities on three novel tasks requiring non-trivial planning or search: Game of 24, Creative Writing, and Mini Crosswords. For instance, in Game of 24, while GPT-4 with chain-of-thought prompting only solved 4% of tasks, our method achieved a success rate of 74%. Code repo with all prompts: https://github.com/princeton-nlp/tree-of-thought-llm.

hub tools

citation-role summary

background 18 baseline 3 method 3 dataset 1

citation-polarity summary

claims ledger

  • abstract Language models are increasingly being deployed for general problem solving across a wide range of tasks, but are still confined to token-level, left-to-right decision-making processes during inference. This means they can fall short in tasks that require exploration, strategic lookahead, or where initial decisions play a pivotal role. To surmount these challenges, we introduce a new framework for language model inference, Tree of Thoughts (ToT), which generalizes over the popular Chain of Thought approach to prompting language models, and enables exploration over coherent units of text (thoug

co-cited works

representative citing papers

Test-Time Hinting for Black-Box Vision-Language Models

cs.CV · 2026-05-13 · unverdicted · novelty 7.0

Test-Time Hinting trains a hint generator to prepend contextual guidance to VLM prompts, improving accuracy on natural-image VQA benchmarks with generalization to unseen tasks and models.

Synthesizing Multi-Agent Harnesses for Vulnerability Discovery

cs.CR · 2026-04-22 · unverdicted · novelty 7.0

AgentFlow uses a typed graph DSL covering roles, prompts, tools, topology and protocol plus a runtime-signal feedback loop to optimize multi-agent harnesses, reaching 84.3% on TerminalBench-2 and discovering ten new zero-days in Chrome including two critical sandbox escapes.

Navigating the Conceptual Multiverse

cs.HC · 2026-04-20 · unverdicted · novelty 7.0

The conceptual multiverse system with a verification framework for decision structures helps users in philosophy, AI alignment, and poetry build clearer working maps of open-ended problems by making implicit LLM choices explicit and changeable.

Feedback-Driven Execution for LLM-Based Binary Analysis

cs.CR · 2026-04-16 · unverdicted · novelty 7.0

FORGE uses a reasoning-action-observation loop and Dynamic Forest of Agents to perform scalable LLM-based binary analysis, finding 1,274 vulnerabilities across 591 of 3,457 real-world firmware binaries at 72.3% precision and broader coverage than prior methods.

IoT-Brain: Grounding LLMs for Semantic-Spatial Sensor Scheduling

cs.AI · 2026-04-09 · unverdicted · novelty 7.0

IoT-Brain uses a neuro-symbolic Spatial Trajectory Graph to ground LLMs for verifiable semantic-spatial sensor scheduling, achieving 37.6% higher task success with lower resource use on a campus-scale benchmark.

Robust Reasoning Benchmark

cs.LG · 2026-03-26 · unverdicted · novelty 7.0 · 2 refs

The Robust Reasoning Benchmark shows frontier LLMs are mostly resilient to textual perturbations on AIME problems while open-weight models suffer up to 54% accuracy drops and exhibit accuracy decay on later problems due to attention dilution during chain-of-thought.

citing papers explorer

Showing 50 of 90 citing papers.