Manifold Bandits: Bayesian Curriculum Learning over the Latent Geometry of Large Language Models

Darrien McKenzie; Nicklas Hansen; Xiaolong Wang

arxiv: 2606.19750 · v1 · pith:6TJHK3DAnew · submitted 2026-06-18 · 💻 cs.LG · cs.AI· cs.CL

Manifold Bandits: Bayesian Curriculum Learning over the Latent Geometry of Large Language Models

Darrien McKenzie , Nicklas Hansen , Xiaolong Wang This is my paper

Pith reviewed 2026-06-26 18:27 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords manifold banditscurriculum learninglarge language modelsBayesian learningreinforcement learningproblem samplingtask manifold

0 comments

The pith

Problem sampling as a manifold bandit with Bayesian learning over a hierarchical task tree improves LLM training beyond difficulty prioritization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that existing curriculum learning methods overlook the structured nature of the task space in LLMs by treating problems as independent. It introduces a framework that models problem relations through the model's latent representation manifold and uses Bayesian methods on a hierarchical task tree to guide sampling. This approach reveals that sampling strategies involve tradeoffs among productivity, diversity, and utility, and that difficulty alone is not enough for strong performance. Incorporating structure and type-awareness leads to better downstream results.

Core claim

Framing problem sampling as a manifold-structured bandit problem with endogenous non-stationarity allows Bayesian learning to steer learning signals across the latent space, and empirical results show that structure-aware sampling outperforms difficulty-only methods by balancing multiple objectives.

What carries the argument

Bayesian Manifold Curriculum (BMC) organizes problems into a hierarchical task tree and uses Bayesian learning to guide sampling decisions over the manifold of the model's latent representations.

If this is right

Different sampling strategies induce tradeoffs between productivity, diversity, and utility.
Prioritizing difficulty alone is insufficient for strong downstream performance.
Incorporating structure and type-awareness into problem sampling improves results.
Sampling decisions can steer how learning signals evolve across the task manifold.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Extending this to other RL settings beyond LLMs might reveal similar manifold structures in task spaces.
Testing BMC on different model scales could show how the latent geometry changes with model size.
The hierarchical task tree construction might be generalized to automatic discovery without manual organization.

Load-bearing premise

Problems are related through the model's latent representation space in a manifold structure that can be organized into a hierarchical task tree, allowing sampling decisions to steer learning signals in a way Bayesian learning can exploit.

What would settle it

A controlled experiment where a standard difficulty-based curriculum achieves equivalent or better downstream performance than BMC on the same tasks would falsify the claim that structure and type-awareness are necessary.

Figures

Figures reproduced from arXiv: 2606.19750 by Darrien McKenzie, Nicklas Hansen, Xiaolong Wang.

**Figure 1.** Figure 1: Bayesian Manifold Curriculum (BMC). BMC uses Bayesian learning over the policy model’s latent task manifold to construct training batches from productive and diverse problem types. BMC improves the trade-off between training efficiency and broad data coverage. Abstract Reinforcement learning (RL) is a central approach for improving reasoning capabilities in large language models (LLMs), where training eff… view at source ↗

**Figure 2.** Figure 2: Latent Task Tree construction and example structure. Left: We construct the Latent Task Tree recursively from LLM (policy) embeddings. At each node, embeddings are reduced with PCA, tested for approximate chart-like structure, projected with UMAP, and clustered with HDBSCAN; recursion continues only when meaningful substructure remains. Right: A (partial) Latent Task Tree induced from DAPO-Math-17K, showin… view at source ↗

**Figure 3.** Figure 3: Bayesian Manifold Curriculum (BMC). Left: During hierarchical Thompson sampling, each batch element independently descends the Latent Task Tree according to sampled reward beliefs, recursively selecting child nodes until reaching a prompt. The figure illustrates the trajectory of a single agent for visualization purposes. Right: After rollout rewards are observed, prompt-level beliefs are updated and propa… view at source ↗

**Figure 4.** Figure 4: Training efficiency across sampling strategies. Dynamic sampling achieves the highest effective ratio and learning signal, but incurs substantially higher wall-clock time due to repeated batch regeneration. In contrast, Difficulty Only and BMC achieve comparable learning speed while maintaining training times close to uniform sampling. Uniform sampling exhibits lower learning signal and slower training pro… view at source ↗

**Figure 5.** Figure 5: Coverage and information sharing across sampling strategies. Left: Rarity-weighted exposure, measuring coverage of underrepresented regions of the task manifold. BMC balances the diversity-focused behavior of the tree-only ablation with the productivity-focused behavior of Difficulty Only. Right: Structure gain, measuring how much variation in learning signal is explained by the Latent Task Tree relative t… view at source ↗

**Figure 6.** Figure 6: Evaluation performance across benchmark categories. Results are shown for Qwen3-8B-Base trained with GSPO; additional settings are provided in Appendix L.6, and perstep trajectories for this run are shown in [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗

**Figure 7.** Figure 7: Utility-aware sampling changes downstream capability profiles while preserving productivity. Top: BMC and the two BMC-T variants exhibit similar productivity during training, with comparable learning speed, effective ratio, and learning signal. Bottom: Despite similar productivity, the variants produce different evaluation outcomes depending on the target distribution used to bias sampling. Targeting AIME2… view at source ↗

**Figure 8.** Figure 8: Periodic tree reconstruction. We compare standard BMC, which keeps the Latent Task Tree fixed throughout training, against a variant that reconstructs the tree every 100 steps after round-robin initialization. magnitude of the changes suggests that the static tree remains a reasonable approximation over the training horizon. We therefore keep the tree fixed in the main experiments for simplicity and stabil… view at source ↗

**Figure 9.** Figure 9: Productivity and evaluation behavior under the standard bandit pattern. Top: Training productivity metrics on DAPO-Math-17K. MoPPS and the Difficulty Only ablation both improve learning speed and effective ratio relative to uniform sampling while avoiding the wall-clock overhead of dynamic sampling. Bottom: Evaluation performance across English mathematics, Chinese mathematics, and out-of-distribution scie… view at source ↗

**Figure 10.** Figure 10: Evaluation curves. Performance across evaluations for Qwen3-8B-base while training on DAPO-Math-17k using GSPO. These evaluation trajectories correspond to the bar graphs visualized in [PITH_FULL_IMAGE:figures/full_fig_p037_10.png] view at source ↗

**Figure 11.** Figure 11: Training and evaluation dynamics on AlphaMed19K. Top: Training productivity metrics for Uniform Sampling, Dynamic Sampling, Difficulty Only, Tree Only, and BMC. Dynamic Sampling maintains the highest effective ratio and reward variance, but incurs substantially higher wall-clock time due to repeated filtering. BMC and Difficulty Only improve learning speed (training-set pass@1) relative to Uniform Sampli… view at source ↗

**Figure 12.** Figure 12: Latent Task Tree for DAPO-Math-17k using Qwen3-8B-Base’s latent space. The dataset contains mathematical reasoning problems in both Chinese and English. Themes such as Chinese and Euclidean Geometry are not tree nodes, but descriptive annotations used to summarize the condensed view of the full tree shown in [PITH_FULL_IMAGE:figures/full_fig_p042_12.png] view at source ↗

**Figure 13.** Figure 13: Latent Task Tree for DeepCoder-Preview-Dataset [Luo et al., 2025] using DeepSeekR1-Distilled-Qwen-14B’s latent space. The dataset consists of 24K coding problems. The induced tree contains 49 nodes. On one node with 8×H100 GPUs, loading the latent representations took 12:56 minutes and recursive tree construction took 7:22 minutes, for a total construction time of 20:18 minutes. 43 [PITH_FULL_IMAGE:figu… view at source ↗

**Figure 14.** Figure 14: Latent Task Tree for AlphaMed19k [Liu et al., 2025] using Qwen3-8B’s latent space. The dataset contains medical Q&A multiple-choice problems across varying topics. The induced tree contains 67 nodes. On one node with 8×H100 GPUs, loading the latent representations took 10:07 minutes and recursive tree construction took 10:46 minutes, for a total construction time of 20:53 minutes. Since the corresponding … view at source ↗

**Figure 15.** Figure 15: Latent Task Tree for BarExamQA [Zheng et al., 2025c] using the latent space of Llama-3.1-8B-Instruct [Grattafiori et al., 2024]. The dataset consists of 1K multistate bar exam (MBE) questions. The induced tree contains 56 nodes. On one node with 8×H100 GPUs, loading the latent representations took 52 seconds and recursive tree construction took 51.5 seconds, for a total construction time of 1:44 minutes. … view at source ↗

**Figure 16.** Figure 16: Latent Task Tree for Agentar-DeepFinance-100K [Zhao et al., 2025] using the latent space of Gemma-4-12B [Team et al., 2024] The dataset consists of 100K financial reasoning problems. The dataset is predominantly Chinese. A very small minority of prompts appear to be intended for SFT rather than RLVR. The induced tree contains 51 nodes. On one node with 8×H100 GPUs, loading the latent representations took … view at source ↗

**Figure 17.** Figure 17: Latent Task Tree for GURU-92k [Cheng et al., 2026a] using Qwen3-14B-Base’s latent space. GURU-92k is a multi-domain reasoning dataset spanning math, code, science, logic, simulation, and tabular reasoning problems, allowing us to test whether latent task trees can organize heterogeneous training data beyond a single domain. The induced tree contains 54 nodes. On one node with 8×H100 GPUs, loading the late… view at source ↗

**Figure 18.** Figure 18: Latent Task Tree for IF-RLVR [Pyatkin et al., 2025] using the latent space of Qwen3.5-9B-Base [Qwen Team, 2026] The dataset consists of 95K instruction-following problems spanning mathematics, coding, NLP tasks, multilingual QA, writing, safety-sensitive requests, and other domains. Each prompt pairs a base task with automatically verifiable output constraints, so the induced hierarchy reflects both seman… view at source ↗

**Figure 19.** Figure 19: Latent Task Tree for Geometry3K Lu et al. [2021] using Qwen3-VL-32B-Instruct’s [Bai et al., 2025] latent space. The dataset consists of multimodal geometry problems, and the induced tree contains 54 nodes. On one node with 8×H100 GPUs, loading the latent representations took 3:26 minutes, and recursive tree construction took 2:21 minutes, for a total construction time of 5:47 minutes. Because our cluster … view at source ↗

**Figure 20.** Figure 20: Additional information for [PITH_FULL_IMAGE:figures/full_fig_p050_20.png] view at source ↗

**Figure 21.** Figure 21: Dataset imbalance and frontier imbalance. Dataset imbalance arises from the static composition of the training set, while frontier imbalance arises from the policy-dependent distribution of problems that currently provide learning signal. A productivity-oriented sampler may therefore concentrate on a subset of problem types either because they are common in the dataset, overrepresented on the current fro… view at source ↗

**Figure 22.** Figure 22: Problem skipping in Dynamic Sampling. Dynamic Sampling scans candidate prompts and adds prompts with nonzero reward variance to a composite batch. Once the composite batch is filled, later effective prompts in the current partition may be deferred until a future pass. When effective prompts are unevenly distributed across problem types or dataset partitions, this mechanism can interact with dataset and fr… view at source ↗

**Figure 23.** Figure 23: Dynamic Sampling and minority-type exposure. The training data is dominated by English mathematics problems, while Chinese mathematics problems form a smaller subset. For the 8B model, Dynamic Sampling improves strongly on English mathematics evaluations but plateaus on Chinese mathematics evaluations relative to uniform sampling. This behavior is consistent with an interaction between dataset imbalance, … view at source ↗

**Figure 24.** Figure 24: Evaluation curves. Performance across evaluations for Qwen3-8B-base while training on DAPO-Math-17k using GSPO. This corresponds to the bar graphs visualized in [PITH_FULL_IMAGE:figures/full_fig_p063_24.png] view at source ↗

**Figure 25.** Figure 25: Evaluation curves. Performance across evaluations for Qwen3-4B-base while training on DAPO-Math-17k using GSPO. We refer to Appendix F for additional intuition and discussion on why the performances between the 8B and 4B models can vary on the same dataset. 64 [PITH_FULL_IMAGE:figures/full_fig_p064_25.png] view at source ↗

**Figure 26.** Figure 26: Evaluation curves. Performance across evaluations for Qwen3-4B-base while training on DAPO-Math-17k using GRPO. We refer to Appendix F for additional intuition and discussion on why the performances between the algorithms can vary on the same model-dataset combination. 65 [PITH_FULL_IMAGE:figures/full_fig_p065_26.png] view at source ↗

**Figure 27.** Figure 27: Evaluation curves. Performance across evaluations for Qwen3-8B-base while training on DAPO-Math-17k using variations of BMC-T. These evaluation trajectories correspond to the bar graphs visualized in [PITH_FULL_IMAGE:figures/full_fig_p066_27.png] view at source ↗

read the original abstract

Reinforcement learning (RL) is a central approach for improving reasoning capabilities in large language models (LLMs), where training efficiency depends critically on how problems are sampled during optimization. Existing adaptive curriculum learning methods typically prioritize prompts of intermediate difficulty, treating problem selection as a standard bandit problem with independent arms and overlooking the structured, heterogeneous nature of the task space. In this work, we frame problem sampling as a manifold-structured bandit problem with endogenous non-stationarity: problems are related through the model's latent representation space, and sampling decisions can steer how learning signals evolve across that space. To operationalize this perspective, we introduce Bayesian Manifold Curriculum (BMC), a structure-aware framework that organizes problems into a hierarchical task tree and applies Bayesian learning to guide sampling. Empirically, we find that different sampling strategies induce non-trivial tradeoffs between productivity (learning signal), diversity (coverage of the task manifold), and utility (evaluation relevance). These results show that prioritizing difficulty alone is insufficient for strong downstream performance, highlighting the importance of incorporating structure and type-awareness into problem sampling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper frames LLM curriculum as manifold bandits with a new BMC method but supplies no details or results to test whether the latent structure actually drives gains.

read the letter

The main takeaway is that this work tries to replace standard difficulty-based sampling in LLM RL with a structure-aware bandit that uses the model's latent space as a manifold and Bayesian updates on a hierarchical task tree. The BMC framing is the fresh piece.

It does a clean job laying out why independent-arm bandits miss the heterogeneous, related nature of tasks and why sampling can create tradeoffs across productivity, diversity, and utility. The point that difficulty alone is not enough is stated plainly and matches what many practitioners observe.

The soft spots are exactly where the stress-test note says: the abstract treats the manifold and stable task tree as given modeling choices but shows no construction method, no check that the tree reflects real geometry instead of imposed clusters, and no comparison against non-manifold adaptive baselines. Without those pieces there is no way to attribute any downstream improvement to the manifold component itself. The empirical tradeoffs are asserted but not quantified.

This is for people already working on adaptive sampling or geometric methods inside LLM training loops. Someone outside that niche will not get much from the high-level description.

If the full paper contains reproducible experiments, clear manifold extraction details, and head-to-head results against existing curricula, it is worth sending to referees. Right now the central assumptions remain untested, so the contribution stays at the level of a plausible new modeling choice.

Referee Report

2 major / 0 minor

Summary. The paper frames problem sampling for RL-based LLM reasoning improvement as a manifold-structured bandit problem with endogenous non-stationarity, where problems relate via the model's latent representation space. It introduces Bayesian Manifold Curriculum (BMC), which organizes problems into a hierarchical task tree and applies Bayesian learning for sampling guidance. The central empirical claim is that sampling strategies induce tradeoffs among productivity, diversity, and utility, such that difficulty-only prioritization is insufficient and structure/type-awareness yields stronger downstream performance.

Significance. If the empirical tradeoffs and attribution to manifold structure hold, the work would be significant for curriculum learning in LLMs by moving beyond independent-arm bandits to geometry-aware sampling. No machine-checked proofs, reproducible code, or parameter-free derivations are described.

major comments (2)

[Abstract] Abstract: The claim that 'different sampling strategies induce non-trivial tradeoffs between productivity (learning signal), diversity (coverage of the task manifold), and utility' and that 'prioritizing difficulty alone is insufficient' is presented without any quantitative results, baselines, error bars, dataset details, or experimental protocol. This is load-bearing for the conclusion that manifold structure and type-awareness improve performance.
[Abstract] Abstract: The modeling choice that 'problems are related through the model's latent representation space' and can be 'organized into a hierarchical task tree' for Bayesian updates is stated without construction details, validation that the tree reflects genuine task geometry (vs. imposed clustering), or ablation showing the Bayesian manifold component outperforms non-manifold adaptive baselines. This is load-bearing for attributing any gains to the proposed structure.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed feedback on the abstract claims and modeling choices. We address each major comment below, noting that the full manuscript contains the supporting experimental details and method descriptions, but we agree the abstract can be strengthened for self-containment.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that 'different sampling strategies induce non-trivial tradeoffs between productivity (learning signal), diversity (coverage of the task manifold), and utility' and that 'prioritizing difficulty alone is insufficient' is presented without any quantitative results, baselines, error bars, dataset details, or experimental protocol. This is load-bearing for the conclusion that manifold structure and type-awareness improve performance.

Authors: We agree the abstract, due to length constraints, summarizes the empirical findings at a high level without including specific quantitative results or protocol details. These are fully reported with baselines, error bars, datasets, and protocols in the Experiments section. The claims are supported by those results. We will revise the abstract to include a concise mention of key empirical outcomes to better ground the statements. revision: yes
Referee: [Abstract] Abstract: The modeling choice that 'problems are related through the model's latent representation space' and can be 'organized into a hierarchical task tree' for Bayesian updates is stated without construction details, validation that the tree reflects genuine task geometry (vs. imposed clustering), or ablation showing the Bayesian manifold component outperforms non-manifold adaptive baselines. This is load-bearing for attributing any gains to the proposed structure.

Authors: Construction of the latent-space embeddings, hierarchical task tree, and Bayesian update procedure are detailed in the Method section, along with validation experiments and ablations versus non-manifold adaptive baselines in the Experiments section. These elements attribute performance gains to the manifold-aware components. We will revise the abstract to briefly note that full construction and validation details appear in the main text. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected; modeling choices presented without self-referential reductions

full rationale

The abstract and description introduce BMC as a structure-aware framework that organizes problems into a hierarchical task tree and applies Bayesian learning to guide sampling, framing the task space as manifold-structured with endogenous non-stationarity. No equations, fitted parameters, predictions that reduce to inputs by construction, or load-bearing self-citations are visible. The claims about tradeoffs between productivity, diversity, and utility, and the insufficiency of difficulty-only sampling, are presented as empirical observations without any derivation step that equates outputs to the modeling assumptions themselves. The paper's central premise relies on external validation of the manifold and tree structure rather than internal self-definition, making the derivation self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available, so no specific free parameters, axioms, or invented entities can be extracted or audited.

pith-pipeline@v0.9.1-grok · 5721 in / 1060 out tokens · 26999 ms · 2026-06-26T18:27:25.134639+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

68 extracted references · 3 canonical work pages · 1 internal anchor

[1]

ISBN 979-8-89176-332-6

URLhttps://openreview.net/forum?id=8HvWBamUkS. Dana Arad, Aaron Mueller, and Yonatan Belinkov. Saes are good for steering – if you select the right features. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, page 10252–10270. Association for Computational Linguistics, 2025. doi: 10.18653/v1/2025.emnlp-main.519. URLh...

work page doi:10.18653/v1/2025.emnlp-main.519 2025
[2]

Correlated bandits or: How to minimize mean-squared error online

URLhttps://arxiv.org/abs/1902.02953. Sébastien Bubeck, Rémi Munos, and Gilles Stoltz. Pure exploration for multi-armed bandit problems, 2010. URLhttps://arxiv.org/abs/0802.2655. Ricardo J. G. B. Campello, Davoud Moulavi, and Joerg Sander. Density-based clustering based on hierarchical density estimates. In Jian Pei, Vincent S. Tseng, Longbing Cao, Hiroshi...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1109/tit.2021.3081508 1902
[3]

Zirui He, Haiyan Zhao, Yiran Qiao, Fan Yang, Ali Payani, Jing Ma, and Mengnan Du

URLhttps://arxiv.org/abs/2310.16828. Zirui He, Haiyan Zhao, Yiran Qiao, Fan Yang, Ali Payani, Jing Ma, and Mengnan Du. Saif: A sparse autoencoder framework for interpreting and steering instruction following of language models, 2025. URL https://arxiv.org/abs/2502.11356. Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, D...

Pith/arXiv arXiv 2025
[4]

Guochao Jiang, Wenfeng Feng, Guofeng Quan, Chuzhan Hao, Yuewei Zhang, Guohua Liu, and Hao Wang

URLhttps://arxiv.org/abs/2506.08725. Guochao Jiang, Wenfeng Feng, Guofeng Quan, Chuzhan Hao, Yuewei Zhang, Guohua Liu, and Hao Wang. Vcrl: Variance-based curriculum reinforcement learning for large language models.arXiv preprint arXiv:2509.19803, 2025. Xia Jiang and Rong J. B. Zhu. Empirical bayesian multi-bandit learning, 2025. URL https://arxiv.org/ abs...

arXiv 2025
[5]

Ilia Mahrooghi, Aryo Lotfi, and Emmanuel Abbe

Notion Blog. Ilia Mahrooghi, Aryo Lotfi, and Emmanuel Abbe. Goldilocks rl: Tuning task difficulty to escape sparse rewards for reasoning, 2026. URLhttps://arxiv.org/abs/2602.14868. Tambet Matiisen, Avital Oliver, Taco Cohen, and John Schulman. Teacher-student curriculum learning, 2017. URLhttps://arxiv.org/abs/1707.00183. Maxwell-Jia. Aime 2024 dataset.ht...

Pith/arXiv arXiv 2026
[6]

Ankit Pal, Logesh Kumar Umapathi, and Malaikannan Sankarasubbu

Hugging Face dataset, v202412. Ankit Pal, Logesh Kumar Umapathi, and Malaikannan Sankarasubbu. Medmcqa : A large-scale multi-subject multi-choice dataset for medical domain question answering, 2022. URL https://arxiv.org/abs/2203. 14371. Sandeep Pandey, Deepak Agarwal, Deepayan Chakrabarti, and Vanja Josifovski. Bandits for taxonomies: A model-based appro...

arXiv 2022
[7]

Xinyu Tang, Zhenduo Zhang, Yurou Liu, Wayne Xin Zhao, Zujie Wen, Zhiqiang Zhang, and Jun Zhou

Accessed: 2026-05-04. Xinyu Tang, Zhenduo Zhang, Yurou Liu, Wayne Xin Zhao, Zujie Wen, Zhiqiang Zhang, and Jun Zhou. Towards high data efficiency in reinforcement learning with verifiable reward, 2025. URL https://arxiv.org/ abs/2509.01321. Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane...

arXiv 2026
[8]

Manning, Peter Hender- son, and Daniel E

URLhttps://arxiv.org/abs/2510.19178. Zhenda Xie, Yixuan Wei, Huanqi Cao, Chenggang Zhao, Chengqi Deng, Jiashi Li, Damai Dai, Huazuo Gao, Jiang Chang, Liang Zhao, et al. mhc: Manifold-constrained hyper-connections.arXiv preprint arXiv:2512.24880, 2025. An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huan...

work page doi:10.1145/3709025.3712219 2025
[9]

bandit pattern

show that token input embeddings can contain singularities and non-constant local dimension, violating the requirements of a smooth global manifold. We view this as a critique of a strongglobal manifold assumption, whereas BMC adopts a weaker operational view: prompt representations need only exhibit enoughlocalgeometric regularity to support neighborhood...

2025
[10]

assign each promptx i an estimated learning valuev i
[11]

initialize{v i}N i=1 uniformly or from a prior model
[12]

for each training stept: (a) sample a batch of individual prompts according to their estimated values, often without replacement; (b) generate rollouts and observe rewards for the sampled prompts; (c) update the values of the sampled prompts using the observed rollout rewards
[13]

difficulty awareness,

repeat as the policy evolves. The standard bandit pattern is therefore a strong mechanism for improving sampling productivity, but it does not explicitly control several dimensions that can be important when training LLMs on broad, heterogeneous mixtures. In particular, it does not directly controldiversityover problem types orutilityrelative to a target ...

2025
[14]

0 2% E x tr ema l S tructur e s
[15]

3 6 % C om b inat orics 1 .25% Geometr y 1

39% Diagr am s 3 . 3 6 % C om b inat orics 1 .25% Geometr y 1 . 0 2% M i x e d Visual s 1 . 0 9% ??? 16 . 89% P olynomials 4 . 52% P olynomials 4 . 52% ??? 30 .40% S et s 1 . 7 0% Analytic Geometr y
[16]

34% Sequences & R ecurr ences 3

16% Series & Sums 1 . 34% Sequences & R ecurr ences 3 . 05% Sequences & R ecurr ences 3 . 05% Discr et e Algebr a 10 . 64% Discr et e Algebr a 10 . 64% ??? 3 0 .4 0 % S eries & S um s 1 . 34% F loors & C eiling s 1 . 33% I nt eger C ountin g 1 .48% Di v isors & Digit s 1 . 87% Number R elation s 1 . 3 0 % F unctiona l E q uation s 1 . 1 6 % N umbe r Theor y
[17]

8 4% Mone y & Ex change 1 .49% P r obability 3

05% Algebr a Syst ems 4 . 8 4% Mone y & Ex change 1 .49% P r obability 3 . 60% Geometr y - L ogic Mi x 7 .42% Gr aph Theor y 2.26% Games & Str at egy 1 . 99% Grids & Boar ds
[18]

34% T riangle Geometr y 3 .2 8 % Quad ri la t er al G eometr y 1

39% C ir cle Geometr y 1 . 34% T riangle Geometr y 3 .2 8 % Quad ri la t er al G eometr y 1 . 01% Eucli d ean Geometr y 11 .2 4% P oly go n Geometr y 5 . 61% Algebr aic Expr essions 1 .23% Symmetric Equations
[19]

01% M o d ular Arit hmeti c 3 .49% Digit P r oblem s 1

60% T rigonometr y 1 . 01% M o d ular Arit hmeti c 3 .49% Digit P r oblem s 1 . 0 7% M inimu m I nt eger 1 . 11% Dio p hantin e P r oblem s 2.2 6 % Digit Manipulation 1 .2 4% Modular Exponentiation 1 . 58% Di v isors & C ountin g 1 . 3 0 % M one y & Ex change Clust er P olynomial Clust er R e cu rs io n S tat i st ic s C har t T est (64 . 86%) I nsu ff ic...
[20]

54% Sequences 1 .2 4% String Manipulation
[21]

12%) Insufficient Clust ers (5

36% Arr a ys & Subarr a ys Clust er String Clust er R e cu rsi o n Statisti c s Char t T est (94 . 12%) Insufficient Clust ers (5 . 88%) Figure 13.Latent Task Tree forDeepCoder-Preview-Dataset[Luo et al., 2025] using DeepSeek- R1-Distilled-Qwen-14B’s latent space.The dataset consists of 24K coding problems. The induced tree contains 49 nodes. On one node ...

2025
[22]

14% Mechanisms 48

35% I n f ectio n T est s 0 . 14% Mechanisms 48 . 64% Bact eria & R esistance 1 . 63% V accines & V iruses 1 . 98% Cancer Drugs 0 . 7 3% XXX xx.xx% Mechanisms 48 . 64% Mechanisms 48 . 64% Anest hesia 1 . 09% XXX xx.xx% T r eatment Drugs 2.28% XXX xx.xx% Syst emic 1 .4 1% P at hology 3 .23% Aut oimmunue
[23]

11% Mechanisms 48

08% Bones & ENT 1 . 11% Mechanisms 48 . 64% Embr y ology 1 . 00% Sur gical Anat om y
[24]

71% Mechanisms 48

62% Medical L a w 0 . 71% Mechanisms 48 . 64% P o p ulatio n H ealt h 1 . 05% Misc . F act s 1 .47% Digestiv e 0 . 08% XXX xx.xx% Mechanisms 48 . 64% Mechanisms 48 . 64% Cancer 1 . 61% XXX xx.xx% K idne y 1 . 54% XXX xx.xx% Clinical Cases 49 . 90% Mechanisms 48 . 64% Mechanisms 48 . 64% Pr egnancy 1 . 82% XXX xx.xx% Mechanisms 48 . 64% Mechanisms 48 . 64%...
[25]

64% Mechanisms 48

7 9% XXX xx.xx% Mechanisms 48 . 64% Mechanisms 48 . 64% I n f ections
[26]

64% Mechanisms 48

06% XXX xx.xx% Mechanisms 48 . 64% Mechanisms 48 . 64% L ung Disease 1 . 15% XXX xx.xx% Emer gency Cases 6 . 98% XXX xx.xx% Mechanisms 48 . 64% Mechanisms 48 . 64% H ear t Disease
[27]

64% R ep r o d u c ti v e & B r e a st 2.23% Mechanisms 48

06% XXX xx.xx% Mechanisms 48 . 64% R ep r o d u c ti v e & B r e a st 2.23% Mechanisms 48 . 64% Endocrine
[28]

64% Chr onic Meds 1

34% Mechanisms 48 . 64% Chr onic Meds 1 . 02% Mechanisms 48 . 64% Ge n e ti c s & De v e lo p m e n t 1 . 7 4% Dermat ology
[29]

64% Ne w bor n Car e 1

58% Mechanisms 48 . 64% Ne w bor n Car e 1 . 99% Mechanisms 48 . 64% ??? 11 . 32% A cut e Abdome n 2.26% Mechanisms 48 . 64% Digestiv e Disor ders 1 . 7 3% L iv er & P or tal 0 . 14% Et hics & Consent 0 . 54% XXX xx.xx% Mot or & S p ine 1 . 7 9% XXX xx.xx% T r auma & T o x icology
[30]

32% XXX xx.xx% Br ain & Ey e
[31]

04% Mechanisms 48

32% XXX xx.xx% P ediatric S k i n 0 . 04% Mechanisms 48 . 64% Chil d I n f ections 1 . 33% S k i n Disor ders 1 .21% A cut e Car e 0 . 04% XXX xx.xx% T r auma 1 . 14% XXX xx.xx% T o x icology 1 . 14% XXX xx.xx% V accines & Viruses Clust er Ne wborn Car e Clust er R e cu rs io n S tat i st ic s Insufficient Clust ers (100%) Char t T est (0 . 00%) Figure 14...

2025
[32]

18% E x clusions 5

57% A ut he nti c atio n 1 . 18% E x clusions 5 . 67% H earsa y 3 . 64% W itnesses 5 . 03% Pr obable Cause 0 . 64% Sear ches 1 . 50% Conf essions 1 .28% F oundation
[33]

96% Privile g e

03% Char act e r 0 . 96% Privile g e
[34]

93% N e g li g ence 4

67% R is k 3 .21% T r es p ass 1 . 93% N e g li g ence 4 . 92% Pr oduct s 3 . 64% Con v e y ance
[35]

67% Cont r a c t s 13 .48% Co v enant s 1 .28% Easement s
[36]

32% Land Sale 3

03% R ecor din g 3 . 32% Land Sale 3 . 7 4% Estat es 4 . 71% B r eac h
[37]

39% Pr o x imat e 0

57% Causation 1 . 39% Pr o x imat e 0 . 96% P er f ormance 5 . 67% E n f o r ceme n t
[38]

50% Li f e E stat es 1

89% F ormation 3 .42% R emedies 1 . 50% Li f e E stat es 1 . 39% Cot enanc y 1 . 50% F utur e I nt er est s 1 . 82% O ff ers 1 . 0 7% A cceptance 1 . 39% U CC / G oods 0 . 96% Land Sale Clust er R obber y Clust er R ecursion Statistics Insufficient Clust ers (100 . 00%) Char t T est (0 . 00%) Figure 15.Latent Task Tree forBarExamQA[Zheng et al., 2025c] us...

2024
[39]

51% Compan y L a w 1

9 8 % In v estment F unds 1 . 51% Compan y L a w 1 . 4 0% R i s k M a n a g e m e n t
[40]

5 8 % Insur ance 1

47 % Managerial Finance 1 . 5 8 % Insur ance 1 . 51% Macr o Finance 1 . 62% Capital Budgeting 1 . 13% R oot (all pr oblems) Financial R epor t s 5 . 0 4 % Securities Rules 1 . 05% Business Str at eg y 1 . 05% Securities Compliance 1 . 31% St oc k Sentiment
[41]

6 7 % T a x A ccounting 11 .21% E v ent E x tr actio n 2.23% Industr y N e w s 5

69% N e w s T as k s 9 . 6 7 % T a x A ccounting 11 .21% E v ent E x tr actio n 2.23% Industr y N e w s 5 . 7 5% Mar k et N e w s 1 . 6 8 % A sset A ccounting 1 . 53% A ccounting Entries 3 . 02% A ccounting Mat h
[42]

96% Mar k et Causalit y 3

35% Ris k Entities 1 .25% Finance Concept s 5 . 96% Mar k et Causalit y 3 . 7 7 % P u b lic Finance 3 . 16% Sett lement L a w 1 . 3 4 % Contr act L a w 1 . 7 4 % P olic y N e w s 0 . 8 5% Compan y E v ent s
[43]

03% Mar k et T r ends
[44]

84 % P ersonal Finance 4

87 % Economics 1 . 84 % P ersonal Finance 4 . 0 4 % A d v ice Mi x 0 . 0 8 % T ax E x a m
[45]

48 % T ax V aluation
[46]

01% V alue-A dded T ax 1

09% T a x L a w 1 .2 4 % A ccounting P r o b lems 3 . 01% V alue-A dded T ax 1 . 3 8 % J ournal Entries 1 . 02% Economics Clust er P ersonal Finance Clust er R ecursion Statistics A sset Calculations 1 . 8 1% Financial A sset s 1 .20% Insufficient Clust ers (9 7 . 5%) Char t T est (2. 5%) Figure 16.Latent Task Tree forAgentar-DeepFinance-100K[Zhao et al.,...

2025
[47]

64% Mechanisms 48

04% XXX xx.xx% Mechanisms 48 . 64% Mechanisms 48 . 64% Logic Chains 1 . 64% XXX xx.xx% Mechanisms 48 . 64% Mechanisms 48 . 64% Pyt hon K a t as 5 . 64% XXX xx.xx% Mechanisms 48 . 64% A tCode r 3 . 67% XXX xx.xx% Mechanisms 48 . 64% Mechanisms 48 . 64% M a t h / Cod e M i x 58 . 00% XXX xx.xx% Mechanisms 48 . 64% Mechanisms 48 . 64% Code f or ces 11 . 96% ...
[48]

64% Puzzle Code 3

34% XXX xx.xx% Mechanisms 48 . 64% Puzzle Code 3 . 92% R epor t T ables 1 . 65% Mechanisms 48 . 64% Stat . T ables 3 . 68% V olumes 0 . 13% Mechanisms 48 . 64% Chem. & Fluids 1 . 11% Mechanics
[49]

64% Demogr aphic T ables 1 .22% Char t Mat h 1

65% Sur v e y T ables 1 .27% Mechanisms 48 . 64% Demogr aphic T ables 1 .22% Char t Mat h 1 . 19% Geometr y 12.40% Mechanisms 48 . 64% Algebr a 1 . 09% Mechanisms 48 . 64% Planes / 2D 1 .20% Mechanisms 48 . 64% Mechanisms 48 . 64% Solids / 3D 1 . 39% XXX xx.xx% Mechanisms 48 . 64% Mechanisms 48 . 64% XXX xx.xx% ??? 5 . 80% Mechanisms 48 . 64% Mechanisms 4...
[50]

64% Mechanisms 48

00% XXX xx.xx% Mechanisms 48 . 64% Mechanisms 48 . 64% T rigonome t r y
[51]

85% Mechanisms 48

02% XXX xx.xx% Analytical Geometr y 1 . 85% Mechanisms 48 . 64% Cir cle Geometr y 1 . 65% Hybrid Geometr y
[52]

51% Mechanisms 48

30% Con t e st M i x 44 . 51% Mechanisms 48 . 64% K ata Mi x 13 . 82% Mechanisms 48 . 64% D iscr et e Mat h 4 . 72% Mechanisms 48 . 64% Mechanisms 48 . 64% Chinese Olympiad 1 . 58% XXX xx.xx% Mechanisms 48 . 64% Mechanisms 48 . 64% W o r d M a t h 1 . 37% XXX xx.xx% Mechanisms 48 . 64% Mechanisms 48 . 64% AMC W or d 4 . 94% XXX xx.xx% Mechanisms 48 . 64% ...
[53]

64% Mechanisms 48

0 7% XXX xx.xx% Mechanisms 48 . 64% Mechanisms 48 . 64% F unction Ev al . 1 .40% XXX xx.xx% Mechanisms 48 . 64% Mechanisms 48 . 64% Motion 1 . 19% XXX xx.xx% Co mb ina t o r ia l G eo m e tr y 1 . 06% XXX xx.xx% Mechanisms 48 . 64% Mechanisms 48 . 64% Algebr a Mi x 3 . 38% XXX xx.xx% Mechanisms 48 . 64% Gr id Coun t 1 . 94% N umbe r Theor y 2.49% Mechanis...
[54]

63% Or der ed Set s 3

17% R ecurr ences 1 . 63% Or der ed Set s 3 . 09% Mechanisms 48 . 64% Pr obability 1 . 00% XXX xx.xx% Arit hmeti c 1 . 87% XXX xx.xx% A rr angemen t s
[55]

12% XXX xx.xx% I nequalities 0

0 7% XXX xx.xx% Symboli c 2.23% XXX xx.xx% P olynomials 1 . 12% XXX xx.xx% I nequalities 0 . 02% XXX xx.xx% F unction Pr oper ties 1 .21% Mechanisms 48 . 64% Equation Solving 1 . 19% Logic Chains Clust er Dem og r a p hic T a ble s Clust er R e cu rs io n S tat i st ic s I ns uff ici e n t C lu s t er s (85 .29%) Cha r t T e s t (14 . 71%) Figure 17.Laten...
[56]

53% Natur al Language Inf er ence

71% Common Sense 1 . 53% Natur al Language Inf er ence
[57]

51% C op y T as k s 1

00% R e s ear c h T a sks 1 .20% Cr eativ e Fiction 1 . 51% C op y T as k s 1 . 83% Ed g y / ??? 9 .27% In d ic Q A 1 . 1 6 % R egiona l Q A 2.21% A frican Q A 1 .2 6 % T r a n s l atio n T as k s
[58]

7 4 % Gener al R e q ue s t s 1 9

0 9 % Co d e H el p 2.2 6 % P y t h on F unction s 3 . 7 4 % Gener al R e q ue s t s 1 9 . 0 4 % M at h 6 . 82% Appl ie d P r o b l em s 31 . 4 5% NL P 3 . 33% Pr o g r a mming T ask s 6 . 6 0% Basic Mat h 3 . 30% Algebr a Pr oof s 2.21% Geometr y 1 . 31% Op timi z ation M o d e ls 6 . 33% Modeling Pr oblems 7 . 6 1% Arit hmetic St ories 10 . 82% A d v an...
[59]

88% L a b e l i ng T as k s 1

82% Dy nami c M o d e ls 3 . 88% L a b e l i ng T as k s 1 . 1 6 % L ing u istic T a s k s 1 . 05% T e xt A na lys i s 1 . 12% D ecisio n Mo d e l s 4 . 8 4 % D iscr et e Mat h 1 .2 6 % M i x tur e Op timi z ation 0 .22% R at e A rit h meti c
[60]

6 0% M u l ti - S t e p A rit h meti c 3

51% E v er y d a y Arit hmetic 4 . 6 0% M u l ti - S t e p A rit h meti c 3 . 7 0% Alg orit h ms 4 . 0 7% Da t a S c ien c e 1 . 4 5% Sys t e m D e s ign 1 . 08% Misc . R e qu est s 6 . 11% Sensiti v e R e qu est s 7 . 9 5% W ritin g R e q ue s t s 3 . 37% Con s tr aine d A n sw er s 1 . 6 1% Mat eria l Q u a n tities 0 .25% Da il y Q ua ntities 3 . 10% M...

2025
[61]

33% T angent s

18% Inscribed 1 . 33% T angent s
[62]

Ratios 1

9 2% T rig . Ratios 1 . 00% Congruence 1 . 96 % P r o p or tions 1 . 7 8% Figur e Solv e
[63]

7 4% Segment Ratios 1 .44% Similarity I 1

0 7% Segment s 1 . 7 4% Segment Ratios 1 .44% Similarity I 1 . 11% Quadrilat er als
[64]

Algebr a 2.48% R ounding I 1 .22% P ar aellogr ams 1

18% Similarity II 1 .48% Q uad . Algebr a 2.48% R ounding I 1 .22% P ar aellogr ams 1 . 81% R ounding II 1 .4 1% Ar ea 4 . 37% Angles & Lengt hs 1 . 04% Ar cs I 2.48% Angle Chasing 1 . 7 8% Ar cs II
[65]

0 7% R elations I 3 .40% Angles I 1

7 0% R oot (all pr oblems) Angle Algebr a 1 . 0 7% R elations I 3 .40% Angles I 1 . 85% R elations II 4 . 18% Angles II
[66]

00% Angles III 1

00% Diagr am Only 1 . 00% Angles III 1 . 04% V isual I 1 . 5 9 % P erimet e r & Ar ea 7 . 7 4% V isual II 1 . 5 9 % M i x ed V 1 . 96 % V isual III 1 . 7 0% V isual IV 2.22% V isual V 1 .4 1% M i x ed II
[67]

30% V isual V I I 3 .48% V isual V II I

00% M i x ed III 3 .48% M i x ed IV 1 .22% V isual V I 1 . 30% V isual V I I 3 .48% V isual V II I
[68]

Goldilocks

04% V isual IX 1 . 6 3% M i x ed I 11 .2 9 % Figure 19.Latent Task Tree forGeometry3KLu et al. [2021] using Qwen3-VL-32B-Instruct’s [Bai et al., 2025] latent space.The dataset consists of multimodal geometry problems, and the induced tree contains 54 nodes. On one node with 8×H100 GPUs, loading the latent representations took 3:26 minutes, and recursive t...

2021

[1] [1]

ISBN 979-8-89176-332-6

URLhttps://openreview.net/forum?id=8HvWBamUkS. Dana Arad, Aaron Mueller, and Yonatan Belinkov. Saes are good for steering – if you select the right features. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, page 10252–10270. Association for Computational Linguistics, 2025. doi: 10.18653/v1/2025.emnlp-main.519. URLh...

work page doi:10.18653/v1/2025.emnlp-main.519 2025

[2] [2]

Correlated bandits or: How to minimize mean-squared error online

URLhttps://arxiv.org/abs/1902.02953. Sébastien Bubeck, Rémi Munos, and Gilles Stoltz. Pure exploration for multi-armed bandit problems, 2010. URLhttps://arxiv.org/abs/0802.2655. Ricardo J. G. B. Campello, Davoud Moulavi, and Joerg Sander. Density-based clustering based on hierarchical density estimates. In Jian Pei, Vincent S. Tseng, Longbing Cao, Hiroshi...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1109/tit.2021.3081508 1902

[3] [3]

Zirui He, Haiyan Zhao, Yiran Qiao, Fan Yang, Ali Payani, Jing Ma, and Mengnan Du

URLhttps://arxiv.org/abs/2310.16828. Zirui He, Haiyan Zhao, Yiran Qiao, Fan Yang, Ali Payani, Jing Ma, and Mengnan Du. Saif: A sparse autoencoder framework for interpreting and steering instruction following of language models, 2025. URL https://arxiv.org/abs/2502.11356. Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, D...

Pith/arXiv arXiv 2025

[4] [4]

Guochao Jiang, Wenfeng Feng, Guofeng Quan, Chuzhan Hao, Yuewei Zhang, Guohua Liu, and Hao Wang

URLhttps://arxiv.org/abs/2506.08725. Guochao Jiang, Wenfeng Feng, Guofeng Quan, Chuzhan Hao, Yuewei Zhang, Guohua Liu, and Hao Wang. Vcrl: Variance-based curriculum reinforcement learning for large language models.arXiv preprint arXiv:2509.19803, 2025. Xia Jiang and Rong J. B. Zhu. Empirical bayesian multi-bandit learning, 2025. URL https://arxiv.org/ abs...

arXiv 2025

[5] [5]

Ilia Mahrooghi, Aryo Lotfi, and Emmanuel Abbe

Notion Blog. Ilia Mahrooghi, Aryo Lotfi, and Emmanuel Abbe. Goldilocks rl: Tuning task difficulty to escape sparse rewards for reasoning, 2026. URLhttps://arxiv.org/abs/2602.14868. Tambet Matiisen, Avital Oliver, Taco Cohen, and John Schulman. Teacher-student curriculum learning, 2017. URLhttps://arxiv.org/abs/1707.00183. Maxwell-Jia. Aime 2024 dataset.ht...

Pith/arXiv arXiv 2026

[6] [6]

Ankit Pal, Logesh Kumar Umapathi, and Malaikannan Sankarasubbu

Hugging Face dataset, v202412. Ankit Pal, Logesh Kumar Umapathi, and Malaikannan Sankarasubbu. Medmcqa : A large-scale multi-subject multi-choice dataset for medical domain question answering, 2022. URL https://arxiv.org/abs/2203. 14371. Sandeep Pandey, Deepak Agarwal, Deepayan Chakrabarti, and Vanja Josifovski. Bandits for taxonomies: A model-based appro...

arXiv 2022

[7] [7]

Xinyu Tang, Zhenduo Zhang, Yurou Liu, Wayne Xin Zhao, Zujie Wen, Zhiqiang Zhang, and Jun Zhou

Accessed: 2026-05-04. Xinyu Tang, Zhenduo Zhang, Yurou Liu, Wayne Xin Zhao, Zujie Wen, Zhiqiang Zhang, and Jun Zhou. Towards high data efficiency in reinforcement learning with verifiable reward, 2025. URL https://arxiv.org/ abs/2509.01321. Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane...

arXiv 2026

[8] [8]

Manning, Peter Hender- son, and Daniel E

URLhttps://arxiv.org/abs/2510.19178. Zhenda Xie, Yixuan Wei, Huanqi Cao, Chenggang Zhao, Chengqi Deng, Jiashi Li, Damai Dai, Huazuo Gao, Jiang Chang, Liang Zhao, et al. mhc: Manifold-constrained hyper-connections.arXiv preprint arXiv:2512.24880, 2025. An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huan...

work page doi:10.1145/3709025.3712219 2025

[9] [9]

bandit pattern

show that token input embeddings can contain singularities and non-constant local dimension, violating the requirements of a smooth global manifold. We view this as a critique of a strongglobal manifold assumption, whereas BMC adopts a weaker operational view: prompt representations need only exhibit enoughlocalgeometric regularity to support neighborhood...

2025

[10] [10]

assign each promptx i an estimated learning valuev i

[11] [11]

initialize{v i}N i=1 uniformly or from a prior model

[12] [12]

for each training stept: (a) sample a batch of individual prompts according to their estimated values, often without replacement; (b) generate rollouts and observe rewards for the sampled prompts; (c) update the values of the sampled prompts using the observed rollout rewards

[13] [13]

difficulty awareness,

repeat as the policy evolves. The standard bandit pattern is therefore a strong mechanism for improving sampling productivity, but it does not explicitly control several dimensions that can be important when training LLMs on broad, heterogeneous mixtures. In particular, it does not directly controldiversityover problem types orutilityrelative to a target ...

2025

[14] [14]

0 2% E x tr ema l S tructur e s

[15] [15]

3 6 % C om b inat orics 1 .25% Geometr y 1

39% Diagr am s 3 . 3 6 % C om b inat orics 1 .25% Geometr y 1 . 0 2% M i x e d Visual s 1 . 0 9% ??? 16 . 89% P olynomials 4 . 52% P olynomials 4 . 52% ??? 30 .40% S et s 1 . 7 0% Analytic Geometr y

[16] [16]

34% Sequences & R ecurr ences 3

16% Series & Sums 1 . 34% Sequences & R ecurr ences 3 . 05% Sequences & R ecurr ences 3 . 05% Discr et e Algebr a 10 . 64% Discr et e Algebr a 10 . 64% ??? 3 0 .4 0 % S eries & S um s 1 . 34% F loors & C eiling s 1 . 33% I nt eger C ountin g 1 .48% Di v isors & Digit s 1 . 87% Number R elation s 1 . 3 0 % F unctiona l E q uation s 1 . 1 6 % N umbe r Theor y

[17] [17]

8 4% Mone y & Ex change 1 .49% P r obability 3

05% Algebr a Syst ems 4 . 8 4% Mone y & Ex change 1 .49% P r obability 3 . 60% Geometr y - L ogic Mi x 7 .42% Gr aph Theor y 2.26% Games & Str at egy 1 . 99% Grids & Boar ds

[18] [18]

34% T riangle Geometr y 3 .2 8 % Quad ri la t er al G eometr y 1

39% C ir cle Geometr y 1 . 34% T riangle Geometr y 3 .2 8 % Quad ri la t er al G eometr y 1 . 01% Eucli d ean Geometr y 11 .2 4% P oly go n Geometr y 5 . 61% Algebr aic Expr essions 1 .23% Symmetric Equations

[19] [19]

01% M o d ular Arit hmeti c 3 .49% Digit P r oblem s 1

60% T rigonometr y 1 . 01% M o d ular Arit hmeti c 3 .49% Digit P r oblem s 1 . 0 7% M inimu m I nt eger 1 . 11% Dio p hantin e P r oblem s 2.2 6 % Digit Manipulation 1 .2 4% Modular Exponentiation 1 . 58% Di v isors & C ountin g 1 . 3 0 % M one y & Ex change Clust er P olynomial Clust er R e cu rs io n S tat i st ic s C har t T est (64 . 86%) I nsu ff ic...

[20] [20]

54% Sequences 1 .2 4% String Manipulation

[21] [21]

12%) Insufficient Clust ers (5

36% Arr a ys & Subarr a ys Clust er String Clust er R e cu rsi o n Statisti c s Char t T est (94 . 12%) Insufficient Clust ers (5 . 88%) Figure 13.Latent Task Tree forDeepCoder-Preview-Dataset[Luo et al., 2025] using DeepSeek- R1-Distilled-Qwen-14B’s latent space.The dataset consists of 24K coding problems. The induced tree contains 49 nodes. On one node ...

2025

[22] [22]

14% Mechanisms 48

35% I n f ectio n T est s 0 . 14% Mechanisms 48 . 64% Bact eria & R esistance 1 . 63% V accines & V iruses 1 . 98% Cancer Drugs 0 . 7 3% XXX xx.xx% Mechanisms 48 . 64% Mechanisms 48 . 64% Anest hesia 1 . 09% XXX xx.xx% T r eatment Drugs 2.28% XXX xx.xx% Syst emic 1 .4 1% P at hology 3 .23% Aut oimmunue

[23] [23]

11% Mechanisms 48

08% Bones & ENT 1 . 11% Mechanisms 48 . 64% Embr y ology 1 . 00% Sur gical Anat om y

[24] [24]

71% Mechanisms 48

62% Medical L a w 0 . 71% Mechanisms 48 . 64% P o p ulatio n H ealt h 1 . 05% Misc . F act s 1 .47% Digestiv e 0 . 08% XXX xx.xx% Mechanisms 48 . 64% Mechanisms 48 . 64% Cancer 1 . 61% XXX xx.xx% K idne y 1 . 54% XXX xx.xx% Clinical Cases 49 . 90% Mechanisms 48 . 64% Mechanisms 48 . 64% Pr egnancy 1 . 82% XXX xx.xx% Mechanisms 48 . 64% Mechanisms 48 . 64%...

[25] [25]

64% Mechanisms 48

7 9% XXX xx.xx% Mechanisms 48 . 64% Mechanisms 48 . 64% I n f ections

[26] [26]

64% Mechanisms 48

06% XXX xx.xx% Mechanisms 48 . 64% Mechanisms 48 . 64% L ung Disease 1 . 15% XXX xx.xx% Emer gency Cases 6 . 98% XXX xx.xx% Mechanisms 48 . 64% Mechanisms 48 . 64% H ear t Disease

[27] [27]

64% R ep r o d u c ti v e & B r e a st 2.23% Mechanisms 48

06% XXX xx.xx% Mechanisms 48 . 64% R ep r o d u c ti v e & B r e a st 2.23% Mechanisms 48 . 64% Endocrine

[28] [28]

64% Chr onic Meds 1

34% Mechanisms 48 . 64% Chr onic Meds 1 . 02% Mechanisms 48 . 64% Ge n e ti c s & De v e lo p m e n t 1 . 7 4% Dermat ology

[29] [29]

64% Ne w bor n Car e 1

58% Mechanisms 48 . 64% Ne w bor n Car e 1 . 99% Mechanisms 48 . 64% ??? 11 . 32% A cut e Abdome n 2.26% Mechanisms 48 . 64% Digestiv e Disor ders 1 . 7 3% L iv er & P or tal 0 . 14% Et hics & Consent 0 . 54% XXX xx.xx% Mot or & S p ine 1 . 7 9% XXX xx.xx% T r auma & T o x icology

[30] [30]

32% XXX xx.xx% Br ain & Ey e

[31] [31]

04% Mechanisms 48

32% XXX xx.xx% P ediatric S k i n 0 . 04% Mechanisms 48 . 64% Chil d I n f ections 1 . 33% S k i n Disor ders 1 .21% A cut e Car e 0 . 04% XXX xx.xx% T r auma 1 . 14% XXX xx.xx% T o x icology 1 . 14% XXX xx.xx% V accines & Viruses Clust er Ne wborn Car e Clust er R e cu rs io n S tat i st ic s Insufficient Clust ers (100%) Char t T est (0 . 00%) Figure 14...

2025

[32] [32]

18% E x clusions 5

57% A ut he nti c atio n 1 . 18% E x clusions 5 . 67% H earsa y 3 . 64% W itnesses 5 . 03% Pr obable Cause 0 . 64% Sear ches 1 . 50% Conf essions 1 .28% F oundation

[33] [33]

96% Privile g e

03% Char act e r 0 . 96% Privile g e

[34] [34]

93% N e g li g ence 4

67% R is k 3 .21% T r es p ass 1 . 93% N e g li g ence 4 . 92% Pr oduct s 3 . 64% Con v e y ance

[35] [35]

67% Cont r a c t s 13 .48% Co v enant s 1 .28% Easement s

[36] [36]

32% Land Sale 3

03% R ecor din g 3 . 32% Land Sale 3 . 7 4% Estat es 4 . 71% B r eac h

[37] [37]

39% Pr o x imat e 0

57% Causation 1 . 39% Pr o x imat e 0 . 96% P er f ormance 5 . 67% E n f o r ceme n t

[38] [38]

50% Li f e E stat es 1

89% F ormation 3 .42% R emedies 1 . 50% Li f e E stat es 1 . 39% Cot enanc y 1 . 50% F utur e I nt er est s 1 . 82% O ff ers 1 . 0 7% A cceptance 1 . 39% U CC / G oods 0 . 96% Land Sale Clust er R obber y Clust er R ecursion Statistics Insufficient Clust ers (100 . 00%) Char t T est (0 . 00%) Figure 15.Latent Task Tree forBarExamQA[Zheng et al., 2025c] us...

2024

[39] [39]

51% Compan y L a w 1

9 8 % In v estment F unds 1 . 51% Compan y L a w 1 . 4 0% R i s k M a n a g e m e n t

[40] [40]

5 8 % Insur ance 1

47 % Managerial Finance 1 . 5 8 % Insur ance 1 . 51% Macr o Finance 1 . 62% Capital Budgeting 1 . 13% R oot (all pr oblems) Financial R epor t s 5 . 0 4 % Securities Rules 1 . 05% Business Str at eg y 1 . 05% Securities Compliance 1 . 31% St oc k Sentiment

[41] [41]

6 7 % T a x A ccounting 11 .21% E v ent E x tr actio n 2.23% Industr y N e w s 5

69% N e w s T as k s 9 . 6 7 % T a x A ccounting 11 .21% E v ent E x tr actio n 2.23% Industr y N e w s 5 . 7 5% Mar k et N e w s 1 . 6 8 % A sset A ccounting 1 . 53% A ccounting Entries 3 . 02% A ccounting Mat h

[42] [42]

96% Mar k et Causalit y 3

35% Ris k Entities 1 .25% Finance Concept s 5 . 96% Mar k et Causalit y 3 . 7 7 % P u b lic Finance 3 . 16% Sett lement L a w 1 . 3 4 % Contr act L a w 1 . 7 4 % P olic y N e w s 0 . 8 5% Compan y E v ent s

[43] [43]

03% Mar k et T r ends

[44] [44]

84 % P ersonal Finance 4

87 % Economics 1 . 84 % P ersonal Finance 4 . 0 4 % A d v ice Mi x 0 . 0 8 % T ax E x a m

[45] [45]

48 % T ax V aluation

[46] [46]

01% V alue-A dded T ax 1

09% T a x L a w 1 .2 4 % A ccounting P r o b lems 3 . 01% V alue-A dded T ax 1 . 3 8 % J ournal Entries 1 . 02% Economics Clust er P ersonal Finance Clust er R ecursion Statistics A sset Calculations 1 . 8 1% Financial A sset s 1 .20% Insufficient Clust ers (9 7 . 5%) Char t T est (2. 5%) Figure 16.Latent Task Tree forAgentar-DeepFinance-100K[Zhao et al.,...

2025

[47] [47]

64% Mechanisms 48

04% XXX xx.xx% Mechanisms 48 . 64% Mechanisms 48 . 64% Logic Chains 1 . 64% XXX xx.xx% Mechanisms 48 . 64% Mechanisms 48 . 64% Pyt hon K a t as 5 . 64% XXX xx.xx% Mechanisms 48 . 64% A tCode r 3 . 67% XXX xx.xx% Mechanisms 48 . 64% Mechanisms 48 . 64% M a t h / Cod e M i x 58 . 00% XXX xx.xx% Mechanisms 48 . 64% Mechanisms 48 . 64% Code f or ces 11 . 96% ...

[48] [48]

64% Puzzle Code 3

34% XXX xx.xx% Mechanisms 48 . 64% Puzzle Code 3 . 92% R epor t T ables 1 . 65% Mechanisms 48 . 64% Stat . T ables 3 . 68% V olumes 0 . 13% Mechanisms 48 . 64% Chem. & Fluids 1 . 11% Mechanics

[49] [49]

64% Demogr aphic T ables 1 .22% Char t Mat h 1

65% Sur v e y T ables 1 .27% Mechanisms 48 . 64% Demogr aphic T ables 1 .22% Char t Mat h 1 . 19% Geometr y 12.40% Mechanisms 48 . 64% Algebr a 1 . 09% Mechanisms 48 . 64% Planes / 2D 1 .20% Mechanisms 48 . 64% Mechanisms 48 . 64% Solids / 3D 1 . 39% XXX xx.xx% Mechanisms 48 . 64% Mechanisms 48 . 64% XXX xx.xx% ??? 5 . 80% Mechanisms 48 . 64% Mechanisms 4...

[50] [50]

64% Mechanisms 48

00% XXX xx.xx% Mechanisms 48 . 64% Mechanisms 48 . 64% T rigonome t r y

[51] [51]

85% Mechanisms 48

02% XXX xx.xx% Analytical Geometr y 1 . 85% Mechanisms 48 . 64% Cir cle Geometr y 1 . 65% Hybrid Geometr y

[52] [52]

51% Mechanisms 48

30% Con t e st M i x 44 . 51% Mechanisms 48 . 64% K ata Mi x 13 . 82% Mechanisms 48 . 64% D iscr et e Mat h 4 . 72% Mechanisms 48 . 64% Mechanisms 48 . 64% Chinese Olympiad 1 . 58% XXX xx.xx% Mechanisms 48 . 64% Mechanisms 48 . 64% W o r d M a t h 1 . 37% XXX xx.xx% Mechanisms 48 . 64% Mechanisms 48 . 64% AMC W or d 4 . 94% XXX xx.xx% Mechanisms 48 . 64% ...

[53] [53]

64% Mechanisms 48

0 7% XXX xx.xx% Mechanisms 48 . 64% Mechanisms 48 . 64% F unction Ev al . 1 .40% XXX xx.xx% Mechanisms 48 . 64% Mechanisms 48 . 64% Motion 1 . 19% XXX xx.xx% Co mb ina t o r ia l G eo m e tr y 1 . 06% XXX xx.xx% Mechanisms 48 . 64% Mechanisms 48 . 64% Algebr a Mi x 3 . 38% XXX xx.xx% Mechanisms 48 . 64% Gr id Coun t 1 . 94% N umbe r Theor y 2.49% Mechanis...

[54] [54]

63% Or der ed Set s 3

17% R ecurr ences 1 . 63% Or der ed Set s 3 . 09% Mechanisms 48 . 64% Pr obability 1 . 00% XXX xx.xx% Arit hmeti c 1 . 87% XXX xx.xx% A rr angemen t s

[55] [55]

12% XXX xx.xx% I nequalities 0

0 7% XXX xx.xx% Symboli c 2.23% XXX xx.xx% P olynomials 1 . 12% XXX xx.xx% I nequalities 0 . 02% XXX xx.xx% F unction Pr oper ties 1 .21% Mechanisms 48 . 64% Equation Solving 1 . 19% Logic Chains Clust er Dem og r a p hic T a ble s Clust er R e cu rs io n S tat i st ic s I ns uff ici e n t C lu s t er s (85 .29%) Cha r t T e s t (14 . 71%) Figure 17.Laten...

[56] [56]

53% Natur al Language Inf er ence

71% Common Sense 1 . 53% Natur al Language Inf er ence

[57] [57]

51% C op y T as k s 1

00% R e s ear c h T a sks 1 .20% Cr eativ e Fiction 1 . 51% C op y T as k s 1 . 83% Ed g y / ??? 9 .27% In d ic Q A 1 . 1 6 % R egiona l Q A 2.21% A frican Q A 1 .2 6 % T r a n s l atio n T as k s

[58] [58]

7 4 % Gener al R e q ue s t s 1 9

0 9 % Co d e H el p 2.2 6 % P y t h on F unction s 3 . 7 4 % Gener al R e q ue s t s 1 9 . 0 4 % M at h 6 . 82% Appl ie d P r o b l em s 31 . 4 5% NL P 3 . 33% Pr o g r a mming T ask s 6 . 6 0% Basic Mat h 3 . 30% Algebr a Pr oof s 2.21% Geometr y 1 . 31% Op timi z ation M o d e ls 6 . 33% Modeling Pr oblems 7 . 6 1% Arit hmetic St ories 10 . 82% A d v an...

[59] [59]

88% L a b e l i ng T as k s 1

82% Dy nami c M o d e ls 3 . 88% L a b e l i ng T as k s 1 . 1 6 % L ing u istic T a s k s 1 . 05% T e xt A na lys i s 1 . 12% D ecisio n Mo d e l s 4 . 8 4 % D iscr et e Mat h 1 .2 6 % M i x tur e Op timi z ation 0 .22% R at e A rit h meti c

[60] [60]

6 0% M u l ti - S t e p A rit h meti c 3

51% E v er y d a y Arit hmetic 4 . 6 0% M u l ti - S t e p A rit h meti c 3 . 7 0% Alg orit h ms 4 . 0 7% Da t a S c ien c e 1 . 4 5% Sys t e m D e s ign 1 . 08% Misc . R e qu est s 6 . 11% Sensiti v e R e qu est s 7 . 9 5% W ritin g R e q ue s t s 3 . 37% Con s tr aine d A n sw er s 1 . 6 1% Mat eria l Q u a n tities 0 .25% Da il y Q ua ntities 3 . 10% M...

2025

[61] [61]

33% T angent s

18% Inscribed 1 . 33% T angent s

[62] [62]

Ratios 1

9 2% T rig . Ratios 1 . 00% Congruence 1 . 96 % P r o p or tions 1 . 7 8% Figur e Solv e

[63] [63]

7 4% Segment Ratios 1 .44% Similarity I 1

0 7% Segment s 1 . 7 4% Segment Ratios 1 .44% Similarity I 1 . 11% Quadrilat er als

[64] [64]

Algebr a 2.48% R ounding I 1 .22% P ar aellogr ams 1

18% Similarity II 1 .48% Q uad . Algebr a 2.48% R ounding I 1 .22% P ar aellogr ams 1 . 81% R ounding II 1 .4 1% Ar ea 4 . 37% Angles & Lengt hs 1 . 04% Ar cs I 2.48% Angle Chasing 1 . 7 8% Ar cs II

[65] [65]

0 7% R elations I 3 .40% Angles I 1

7 0% R oot (all pr oblems) Angle Algebr a 1 . 0 7% R elations I 3 .40% Angles I 1 . 85% R elations II 4 . 18% Angles II

[66] [66]

00% Angles III 1

00% Diagr am Only 1 . 00% Angles III 1 . 04% V isual I 1 . 5 9 % P erimet e r & Ar ea 7 . 7 4% V isual II 1 . 5 9 % M i x ed V 1 . 96 % V isual III 1 . 7 0% V isual IV 2.22% V isual V 1 .4 1% M i x ed II

[67] [67]

30% V isual V I I 3 .48% V isual V II I

00% M i x ed III 3 .48% M i x ed IV 1 .22% V isual V I 1 . 30% V isual V I I 3 .48% V isual V II I

[68] [68]

Goldilocks

04% V isual IX 1 . 6 3% M i x ed I 11 .2 9 % Figure 19.Latent Task Tree forGeometry3KLu et al. [2021] using Qwen3-VL-32B-Instruct’s [Bai et al., 2025] latent space.The dataset consists of multimodal geometry problems, and the induced tree contains 54 nodes. On one node with 8×H100 GPUs, loading the latent representations took 3:26 minutes, and recursive t...

2021