QueST adapts LLMs at test time by generating query-specific problem-solution pairs for self-supervised fine-tuning, improving reasoning performance without external data.
Mas- tering the game of go with deep neural networks and tree search.nature, 529(7587):484–489
5 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
LLM planning in four-in-a-row is myopic: move choices match a shallow model that ignores deep nodes expanded in reasoning traces.
ANO derives a robust policy optimizer from geometric principles that replaces clipping with a smooth redescending gradient, showing better performance and stability than PPO, SPO, and GRPO in MuJoCo, Atari, and RLHF experiments.
InternBootcamp supplies 1000+ verifiable, auto-generated task environments across domains that enable task scaling to improve LLM reasoning, producing a 32B model with state-of-the-art results on the new Bootcamp-EVAL benchmark.
SiRA uses LLM world models for simulative reasoning to achieve up to 124% higher task completion and 32.2% navigation success versus reactive baselines in web environments.
citing papers explorer
-
Query-Conditioned Test-Time Self-Training for Large Language Models
QueST adapts LLMs at test time by generating query-specific problem-solution pairs for self-supervised fine-tuning, improving reasoning performance without external data.
-
Extracting Search Trees from LLM Reasoning Traces Reveals Myopic Planning
LLM planning in four-in-a-row is myopic: move choices match a shallow model that ignores deep nodes expanded in reasoning traces.
-
ANO: A Principled Approach to Robust Policy Optimization
ANO derives a robust policy optimizer from geometric principles that replaces clipping with a smooth redescending gradient, showing better performance and stability than PPO, SPO, and GRPO in MuJoCo, Atari, and RLHF experiments.
-
InternBootcamp Technical Report: Boosting LLM Reasoning with Verifiable Task Scaling
InternBootcamp supplies 1000+ verifiable, auto-generated task environments across domains that enable task scaling to improve LLM reasoning, producing a 32B model with state-of-the-art results on the new Bootcamp-EVAL benchmark.
-
General Agentic Planning Through Simulative Reasoning with World Models
SiRA uses LLM world models for simulative reasoning to achieve up to 124% higher task completion and 32.2% navigation success versus reactive baselines in web environments.