The Hot Mess of AI: How Does Misalignment Scale With Model Intelligence and Task Complexity?

Alexander H\"agele; Aryo Pradipta Gema; Ethan Perez; Henry Sleight; Jascha Sohl-Dickstein

arxiv: 2601.23045 · v2 · submitted 2026-01-30 · 💻 cs.AI

The Hot Mess of AI: How Does Misalignment Scale With Model Intelligence and Task Complexity?

Alexander H\"agele , Aryo Pradipta Gema , Henry Sleight , Ethan Perez , Jascha Sohl-Dickstein This is my paper

Pith reviewed 2026-05-16 09:36 UTC · model grok-4.3

classification 💻 cs.AI

keywords AI alignmenterror incoherencebias-variance decompositionmodel scalingtask complexitysequential decision makingAI safety

0 comments

The pith

AI failures grow more incoherent as reasoning sequences lengthen, even in larger models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper asks whether capable AIs will fail by consistently chasing unintended goals or by producing chaotic, goal-less actions. It decomposes task errors into bias from systematic misalignment and variance from incoherent behavior, measured across repeated runs with test-time randomness. Results show that the share of error due to variance rises steadily with the number of reasoning steps and actions required. This increase persists across frontier models and does not reliably shrink with greater scale. The pattern implies that future high-stakes AI errors are more likely to appear as unpredictable accidents than as coherent pursuit of a wrong objective.

Core claim

Across all tasks and frontier models measured, the longer models spend reasoning and taking actions, the higher the fraction of their error that stems from variance rather than bias; error-incoherence therefore increases with task complexity and does not decrease consistently with model scale.

What carries the argument

Error-incoherence metric: the fraction of total task error attributable to variance across test-time randomness rather than to fixed bias in outcome.

If this is right

Longer reasoning chains will produce a larger share of unpredictable, nonsensical actions.
Model scale alone will not convert incoherent failures into consistent but misaligned ones.
AI accidents are more likely to resemble erratic industrial mishaps than deliberate goal pursuit.
Alignment work should emphasize prevention of reward hacking and goal misspecification over detection of coherent misalignment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Safety evaluations will need to emphasize extended sequential tasks to surface the predicted incoherence.
Monitoring systems may need to handle stochastic errors in addition to searching for stable misaligned policies.
Training interventions that reduce output variance on complex tasks could be tested as a direct countermeasure.

Load-bearing premise

The bias-variance split measured over test-time randomness actually separates systematic goal misalignment from incoherent nonsensical behavior.

What would settle it

Finding that error-incoherence falls or stays flat with increasing model scale on long-horizon sequential tasks would contradict the central pattern.

Figures

Figures reproduced from arXiv: 2601.23045 by Alexander H\"agele, Aryo Pradipta Gema, Ethan Perez, Henry Sleight, Jascha Sohl-Dickstein.

**Figure 1.** Figure 1: AI can fail because it is misaligned, and produces consistent but undesired outcomes, or because it is incoherent, and does not produce consistent outcomes at all. These failures correspond to bias and variance respectively. As we extrapolate risks from AI, it is important to understand whether failures from more capable models performing more complex tasks will be bias or variance dominated. Bias dominate… view at source ↗

**Figure 2.** Figure 2: Across a variety of settings, as models reason longer or take more actions, they become more incoherent. We assess frontier models (SONNET 4, O3-MINI, O4-MINI, QWEN3) across a variety of different tasks (MCQ, Agentic Coding, Alignment). We evaluate with many samples to estimate bias and variance terms for each question. When sorting questions by average reasoning lengths and grouping into buckets, a clear … view at source ↗

**Figure 3.** Figure 3: For a fixed task and reasoning budget, natural variation in reasoning length and action count is predictive of incoherence. We analyze GPQA (left, (a)) and SWE-BENCH (b) by splitting samples into above- or below-median reasoning length (GPQA) or actions (SWE-BENCH) per question. We then compute performance and incoherence for both groups. (a) The naturally longer reasoning shows increased incoherence for b… view at source ↗

**Figure 4.** Figure 4: Larger and more intelligent systems are often more incoherent. (a) We measure the scaling of incoherence vs. model size for the QWEN3 family, as a function of question difficulty on MMLU. For easy questions, incoherence drops with model scale, while for the hardest questions incoherence remains constant or increases with model scale. The expanded results for this experiment are in [PITH_FULL_IMAGE:figures… view at source ↗

**Figure 5.** Figure 5: Details for QWEN3 scaling laws: easy tasks become less incoherent, harder tasks more incoherent. We group MMLU questions by reasoning length using a reference model (Qwen3 32B, (a)), which correlates across model sizes (b) and serves as a task complexity proxy, as accuracy drops with longer reasoning (c). These groups reveal distinct bias–variance scaling (d): bias slopes are similar across groups, but var… view at source ↗

**Figure 6.** Figure 6: Details for synthetic optimization: In controlled settings with teacher forcing and a single objective, language models become variance dominated with increasing size. (left) We train autoregressive transformers to predict update steps to minimize a quadratic function using decoding based regression, i.e., next-token prediction. This setting involves sequentially performing steps towards a goal via next to… view at source ↗

**Figure 7.** Figure 7: Ensembling and larger reasoning budgets reduce incoherence. Other forms of error correction may also reduce incoherence. (a) Instructing models to reason longer improves performance (inference scaling laws, [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: Overview of accuracy and different error metrics with frontier models. Top, (a): We show the performance increase with different reasoning budgets for both the standard discrete choice format (left) and prompting models to provide probabilities of answers being correct (right). The latter shows lower accuracies as models provide nonzero values to other (not chosen) answers, but the inference scaling improv… view at source ↗

**Figure 9.** Figure 9: There is a multiplicative interaction between RL and model scale for performance. The left plot shows the performance (average accuracy) of the QWEN3 model family as a function of model size across base, instruct, and thinking-enabled models. The base and instruct use logprobbased evaluation (i.e., no token generation). There is a noticeable jump in the slope from instruct to thinking models, which sugges… view at source ↗

**Figure 10.** Figure 10: We find qualitatively similar behavior for different bias and variance metrics. The absolute bias and variance errors (top) show the same behavior: the errors increase for questions that have the models reason longer (cf., [PITH_FULL_IMAGE:figures/full_fig_p027_10.png] view at source ↗

**Figure 11.** Figure 11: KL measures with ensembling. We repeat the plots from [PITH_FULL_IMAGE:figures/full_fig_p028_11.png] view at source ↗

**Figure 12.** Figure 12: For the hardest tasks, models tend to be more incoherent with scale, also for GPQA. We repeat the analysis from Section 3.2 with GPQA. That is, we group questions by reasoning length using a reference model’s answers (Qwen3 32B) and separately analyze the scaling laws. Analogous to MMLU, we find that for bias, the slope is similar across groups; for variance, however, the slope becomes much shallower. As … view at source ↗

**Figure 13.** Figure 13: Relationship between incoherence and error. We visualize the relationship between incoherence and both bias (x-axis) and variance (y-axis) for both GPQA (left) and MMLU (right) with the QWEN3 model family. Since the incoherence is independent of the magnitude of error, a lower error model (bottom left corner) can have the same level of incoherence as models with higher error. Higher incoherence can be due… view at source ↗

**Figure 14.** Figure 14: Reasoning length has a higher effect on incoherence than model size. To assess the change in incoherence with both reasoning length (x-axis) and model size (y-axis), we perform a log-log regression to infer the incoherence for both GPQA (left) and MMLU (right). The contour shows the prediction from the fitted regression in comparison to the original groups of questions (scatter). Notably, we see how the r… view at source ↗

**Figure 15.** Figure 15: MMLU results across model families. We compare the experimental results for scaling laws for QWEN3, GEMMA3, and LLAMA3 models. Across all models, the same observation holds: while performance (accuracy) strongly improves with model size, the contribution of bias and variance changes in a way that depends on question complexity. For the hardest group of questions (longest reasoning and lowest performance),… view at source ↗

**Figure 16.** Figure 16: GPQA results across model families. We compare the experimental results for scaling laws for QWEN3, GEMMA3, and LLAMA3 models. Note that for GEMMA3 and LLAMA3, we use a 0-shot setup: We observe that in our few-shot setting these models do not reliably produce chain-of-thought responses and performance drops, since they strongly adhere to the few-shot examples on GPQA which are provided without reasoning. … view at source ↗

**Figure 17.** Figure 17: Grouped comparison of reasoning budgets and natural variation in reasoning: natural variation dominates. We analyze GPQA (left, (a)) and SWE-BENCH (b) by splitting samples into above- or below-median reasoning length (GPQA) or actions (SWE-BENCH) per question. We then compute performance and incoherence for both groups. (a) Increasing the reasoning budget improves performance (inference scaling laws, to… view at source ↗

**Figure 18.** Figure 18: Incoherence as a function of wait-ratios in reasoning. We sort questions using the density of “Wait” in each reasoning, i.e., the number of counts compared to the overall length. This is motivated by its potential meaning for backtracking or error-correction. (left) For GPQA, we find no clear relation to incoherence for different models. For MMLU (right), we find a shared positive relation, which might in… view at source ↗

**Figure 19.** Figure 19: Qualitative illustration of incoherence. When presenting SONNET 4 with a question of the MWE suite about being disconnected (Perez et al., 2023), the model’s behavior is highly variable and switches between A and B for almost every sample. The example was chosen as it shows one of the highest variances in the dataset [PITH_FULL_IMAGE:figures/full_fig_p033_19.png] view at source ↗

**Figure 20.** Figure 20: Rate of absolute answer changes for GPQA: models change answers at least once for a large portion of questions. To illustrate the variance and incoherence, we report the percentage of questions that see at least one different answer across the following settings: 1) pure sampling, i.e., performing autoregressive answer generation with a different seed (resampling); 2) context sensitivity, where we verify… view at source ↗

**Figure 21.** Figure 21: Sampling efficiency for bias and variance estimates. To the best of our knowledge, there are no unbiased estimators for the KL measures and BRIER as used in this paper. We verify with GPQA and O3-MINI that the metrics stabilize. This is done by taking a large sample size— 100 samples with medium reasoning—and performing bootstrapping, reporting mean and standarddeviation (left: KL, right: BRIER) of the a… view at source ↗

**Figure 22.** Figure 22: Human difficulty labels are not a good indicator for longer reasoning. However, different models’ lengths correlate positively. Similar to QWEN33 (Figures 5(b) and 12(b)), we find that the average reasoning length of frontier models for questions correlates positively, even for different families (b). In contrast, the provided difficulty labels of GPQA do not show a clear indication, as average reasoning … view at source ↗

**Figure 23.** Figure 23: KL metrics of Model-Written Evals question sets. We provide an overview of results for variations of the MWE set (Perez et al., 2023), with bias (left), variance (middle) and resulting incoherence (right). We filter out question sets that do not show noticeable trends. The measures are taken w.r.t. the labelled aligned answer. Results vary across settings and are sometimes more noisy. What they have in co… view at source ↗

**Figure 24.** Figure 24: All scatter variances of model-written eval embeddings. We provide an overview of all open-ended variations of the MWE set (Perez et al., 2023). Using the OpenAI text embedding model (text-embedding-3-large), we obtain a vector embedding for each answer sample, i.e., excluding the reasoning or chain-of-thought traces. This allows us to calculate the variance per question in standard Euclidean space and pl… view at source ↗

**Figure 25.** Figure 25: SWE-BENCH incoherence and error: different x-axes show similar effect. While our main text focuses on the number of rounds (actions or messages, left) as the qualifying measure, we show the alternatives of the total output tokens (middle) and reasoning length (right). The trends are qualitatively similar across plots: the incoherence (a) rises with different slopes and the coverage error (c) increases. A … view at source ↗

**Figure 26.** Figure 26: The improvement of model scale mostly manifests in reduction of bias rather than variance. We show the loss scaling curves with model size (top left, a), which show a known powerlaw improvement with model size. To understand how this translates to performance improvement, we plot the average bias and variance per step (top right, a). This is the continuation of the incoherence plot from [PITH_FULL_IMAG… view at source ↗

**Figure 27.** Figure 27: Grouped results of survey. For each of biological creatures (animals and humans, left), AI models (middle) and human organizations (right), human subjects judged entities to be of higher incoherence (more of a hot mess), the smarter they are judged by a different set of subjects. 39 [PITH_FULL_IMAGE:figures/full_fig_p039_27.png] view at source ↗

read the original abstract

As AI becomes more capable, we entrust it with more general and consequential tasks. The risks from failure grow more severe with increasing task scope. It is therefore important to understand how extremely capable AI models will fail: Will they fail by systematically pursuing goals we do not intend? Or will they fail by being a hot mess, and taking nonsensical actions that do not further any goal? We operationalize this question using a bias-variance decomposition of the errors made by AI models: An AI's \emph{error-incoherence} on a task is measured over test-time randomness as the fraction of its error that stems from variance rather than bias in task outcome. Across all tasks and frontier models we measure, the longer models spend reasoning and taking actions, \emph{the more incoherent} their failures become. Error-incoherence changes with model scale in a way that is experiment dependent. However, in several settings, larger, more capable models are more incoherent than smaller models. Consequently, scale alone seems unlikely to eliminate error-incoherence. Instead, as more capable AIs pursue harder tasks, requiring more sequential action and thought, our results predict failures to be accompanied by more incoherent behavior. This suggests a future where AIs sometimes cause industrial accidents (due to unpredictable misbehavior), but are less likely to exhibit consistent pursuit of a misaligned goal. This increases the relative importance of alignment research targeting reward hacking or goal misspecification.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's bias-variance measure of error incoherence rises with task length on frontier models, but the split may not cleanly separate goal inconsistency from compounding stochasticity in sequential settings.

read the letter

Hi, the main point is that this work defines error-incoherence as the share of outcome error coming from variance over test-time randomness, then shows that fraction growing with longer reasoning sequences across the models and tasks they ran. Scale does not reliably lower it, which they take to mean future failures will look more like unpredictable mess than steady pursuit of the wrong goal. That empirical pattern on length is the clearest new piece. They do a reasonable job of collecting the data on actual frontier models and framing the bias-variance split as a way to operationalize the question. The finding that incoherence tracks sequence length is straightforward enough to be useful for safety discussions. The soft spot is that the decomposition can pick up mechanical variance from early stochastic steps compounding over many actions, even when the model is following a consistent (misaligned) policy. The abstract gives no explicit ground-truth outcome definition or controls that would isolate that effect, so the jump to “nonsensical actions that do not further any goal” rests on an assumption that may not hold in long-horizon tasks. If the full paper has ablations or precise bias definitions that address this, the claim strengthens; otherwise it stays suggestive. This is for people working on empirical failure modes and scaling of alignment issues. A reader who wants data on how messiness changes with horizon length will find something here, though the results are more directional than conclusive. I would send it to peer review so the methods and interpretation can be checked properly.

Referee Report

2 major / 1 minor

Summary. The paper claims that a bias-variance decomposition of task-outcome errors, measured over test-time randomness, shows error-incoherence (the variance fraction) increasing with the length of reasoning and sequential actions across frontier models and tasks; scale effects are experiment-dependent but often show larger models as more incoherent, implying that capability scaling is unlikely to eliminate incoherent failures and that alignment efforts should prioritize reward hacking and goal misspecification over pure scaling.

Significance. If the decomposition validly isolates goal inconsistency from mechanical stochasticity, the result would indicate that longer-horizon tasks produce more unpredictable nonsensical behavior rather than consistent misalignment, raising the relative priority of certain alignment techniques. The work provides an empirical framing for distinguishing failure modes but lacks the controls needed to support the load-bearing distinction.

major comments (2)

[Abstract] Abstract: The central claim that 'the longer models spend reasoning and taking actions, the more incoherent their failures become' rests on error-incoherence defined as the variance fraction of outcome error over test-time randomness, yet no ground-truth outcome for the bias term is specified for open-ended sequential tasks; without this, the decomposition cannot distinguish systematic misalignment from compounding stochasticity.
[Abstract] Abstract and measurement approach: The bias-variance decomposition attributes increased variance to 'incoherent nonsensical behavior' rather than misalignment, but in sequential decision-making even a fixed-goal policy can produce high outcome variance via early stochastic actions; the manuscript provides no controls or isolation procedure for this mechanical effect, which directly undermines the interpretation that scale alone is unlikely to eliminate error-incoherence.

minor comments (1)

[Abstract] The abstract uses informal phrasing ('hot mess') that could be replaced with precise terminology for a journal audience.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which correctly identify ambiguities in how our bias-variance decomposition applies to open-ended sequential tasks. We have revised the manuscript to clarify the definition of task outcomes, to discuss the possibility that high variance can arise from stochastic execution of a fixed goal, and to moderate the strength of our interpretive claims. Below we respond point by point to the major comments.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that 'the longer models spend reasoning and taking actions, the more incoherent their failures become' rests on error-incoherence defined as the variance fraction of outcome error over test-time randomness, yet no ground-truth outcome for the bias term is specified for open-ended sequential tasks; without this, the decomposition cannot distinguish systematic misalignment from compounding stochasticity.

Authors: We agree that a classical bias-variance decomposition requires a well-defined ground-truth outcome. In the revised manuscript we now explicitly state that, for each task, we adopt the benchmark-provided success metric or score as the reference outcome when it exists; bias is then the squared deviation of the mean outcome across stochastic rollouts from this reference, and variance is the variance of the outcome distribution. For tasks lacking a scalar metric we use a binary completion indicator. We acknowledge that this choice is imperfect for fully open-ended settings and have added a limitations paragraph noting that the decomposition therefore captures statistical variability relative to the chosen proxy rather than an absolute ground truth. The observed rise in the variance fraction with horizon length remains, but we now present it as evidence of increasing outcome unpredictability rather than a direct proof of incoherence. revision: yes
Referee: [Abstract] Abstract and measurement approach: The bias-variance decomposition attributes increased variance to 'incoherent nonsensical behavior' rather than misalignment, but in sequential decision-making even a fixed-goal policy can produce high outcome variance via early stochastic actions; the manuscript provides no controls or isolation procedure for this mechanical effect, which directly undermines the interpretation that scale alone is unlikely to eliminate error-incoherence.

Authors: The referee correctly notes that high outcome variance is compatible with a fixed but noisy policy. Our original experiments did not include controls such as temperature-ablated deterministic rollouts or comparisons against hand-crafted fixed-goal agents. In the revised discussion we now explicitly list this alternative explanation and state that the data cannot yet isolate mechanical stochasticity from goal inconsistency. We retain the empirical observation that variance fraction grows with reasoning length across the tested models and tasks, which still implies that pure capability scaling is unlikely to drive the variance fraction to zero. However, we have softened the language from 'incoherent nonsensical behavior' to 'increased outcome variability' and added a sentence calling for future controlled studies. No new experiments were performed. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper operationalizes error-incoherence via a direct empirical bias-variance decomposition of task outcomes measured over test-time randomness, with no equations, derivations, or self-citations that reduce the central claims to fitted inputs or prior results by construction. Claims about increasing incoherence with sequence length and model scale are presented as experimental observations rather than tautological redefinitions. The measure is defined independently of the scaling predictions, satisfying the criteria for a self-contained empirical analysis.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the operationalization of error-incoherence via bias-variance decomposition applied to AI task performance, which is a domain assumption not independently verified from the abstract alone.

axioms (1)

domain assumption Task outcomes can be decomposed into bias and variance components using test-time randomness even for sequential decision-making tasks
This underpins the definition of error-incoherence as the variance fraction of total error.

invented entities (1)

error-incoherence no independent evidence
purpose: Quantify the fraction of AI task error stemming from variance rather than bias
New metric introduced to operationalize incoherent vs. coherent failures.

pith-pipeline@v0.9.0 · 5581 in / 1341 out tokens · 31866 ms · 2026-05-16T09:36:51.075485+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We operationalize this question using a bias-variance decomposition of the errors made by AI models: An AI's error-incoherence on a task is measured over test-time randomness as the fraction of its error that stems from variance rather than bias in task outcome.
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat induction and embed_strictMono unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Across all tasks and frontier models we measure, the longer models spend reasoning and taking actions, the more incoherent their failures become.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

State Stream Transformer (SST) V2: Parallel Training of Nonlinear Recurrence for Latent Space Reasoning
cs.LG 2026-04 unverdicted novelty 6.0

SST V2 introduces parallel-trainable nonlinear recurrence in latent space to let transformers reason continuously across positions, delivering +15 points on GPQA-Diamond and halving remaining GSM8K errors over matched...

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages · cited by 1 Pith paper · 2 internal anchors

[1]

Oxford University Press, Oxford,

10 Nick Bostrom.Superintelligence: Paths, Dangers, Strategies. Oxford University Press, Oxford,

work page
[2]

1 Leo Breiman

ISBN 978-0199678112. 1 Leo Breiman. Bias, variance, and arcing classifiers. 1996. 3 Nghia Tuan Bui, Guergana K Savova, and Lijing Wang. Assessing the macro and micro ef- fects of random seeds on fine-tuning large language models. In Kentaro Inui, Sakriani Sakti, Haofen Wang, Derek F. Wong, Pushpak Bhattacharyya, Biplab Banerjee, Asif Ekbal, Tanmoy Chakrab...

work page 1996
[3]

ISBN 979-8-89176-299-2

The Asian Federation of Natural Language Processing and The Association for Computa- tional Linguistics. ISBN 979-8-89176-299-2. URLhttps://aclanthology.org/2025. ijcnlp-short.3/. 10, 40 11 Published as a conference paper at ICLR 2026 Andong Chen, Yuchen Song, Wenxin Zhu, Kehai Chen, Muyun Yang, Tiejun Zhao, et al. Evaluating o1-like llms: Unlocking reaso...

work page arXiv 2025
[4]

1 Morris H

Accessed: 2025-10-16. 1 Morris H. Degroot and Stephen E. Fienberg. The comparison and evaluation of forecasters.Journal of the Royal Statistical Society Series D: The Statistician, 32(1-2):12–22, 12 2018. ISSN 2515-

work page 2025
[5]

DeGroot and Stephen E

doi: 10.2307/2987588. URLhttps://doi.org/10.2307/2987588. 3 Pedro Domingos. A unified bias-variance decomposition for zero-one and squared loss.AAAI/IAAI, 2000:564–569, 2000. 3, 20 Jacob Dominski and Yong Suk Lee. Advancing ai capabilities and evolving labor outcomes.arXiv preprint arXiv:2507.08244, 2025. 1 Tyna Eloundou, Sam Manning, Pamela Mishkin, and ...

work page doi:10.2307/2987588 2000
[6]

The Platonic Representation Hypothesis

1, 7 John Hughes and safety research. safety-research/safety-tooling: v1.0.0, 2025. URLhttps: //doi.org/10.5281/zenodo.15363603. 22 13 Published as a conference paper at ICLR 2026 Minyoung Huh, Brian Cheung, Tongzhou Wang, and Phillip Isola. The platonic representation hypothesis.arXiv preprint arXiv:2405.07987, 2024. 10, 40 Aaron Jaech, Adam Kalai, Adam ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.5281/zenodo.15363603 2025
[7]

Scaling Laws for Neural Language Models

40 Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan. SWE-bench: Can language models resolve real-world github issues? InThe Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview. net/forum?id=VTF8yNQM66. 4, 23 Andrew Johnston and Christos Makridis. The labor mark...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1145/3616855.3635845 2024
[8]

S1: Simple test-time scaling

40 Nestor Maslej, Loredana Fattorini, Raymond Perrault, Yolanda Gil, Vanessa Parli, Njenga Kariuki, Emily Capstick, Anka Reuel, Erik Brynjolfsson, John Etchemendy, et al. Artificial intelligence index report 2025.arXiv preprint arXiv:2504.07139, 2025. 1 Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettl...

work page doi:10.18653/v1/2025.emnlp-main.1025 2025
[9]

Wait, we don’t need to “wait”! removing thinking tokens improves reasoning efficiency

10, 40 Charlie Victor Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling LLM test-time compute optimally can be more effective than scaling parameters for reasoning. InThe Thirteenth Interna- tional Conference on Learning Representations, 2025. URLhttps://openreview.net/ forum?id=4FWAwZtd2n. 4, 40 Jascha Sohl-Dickstein. The hot mess theory of AI mis...

work page doi:10.18653/v1/2025.findings-emnlp.394 2025
[10]

<PROB>P(A), P(B), P(C), P(D)</PROB>

and recommended parameters for thinking (temperature 0.6, top-k 20, top-p 0.95). Since we consider multiple choice questions that only require a letter to answer, we count reasoning length using the amount of output tokens in the answer, either by the API count or using the actual tok- enizer of QWEN3. To estimate the bias and variance metrics across both...

work page 2023
[11]

We sample 20’000 such trajectories, and use 10% as a holdout dataset for valuation loss

To generate our target data, we employ a ground-truth optimizer of steepest descent with fixed step norm, set to0.005, to generate multiple fixed-length trajectories (of length4096steps) from randomly sampled starting points around the minimum, creating a dataset of pairs(x i, ui). We sample 20’000 such trajectories, and use 10% as a holdout dataset for v...

work page 2025
[12]

Wait”) im- proves efficiency, Lee et al. (2025) identify length-accuracy tradeoffs through “token complexity,

suite for self-reported survival instinct. The other results, including separate bias and variance plots, are shown in Fig. 23. We filter for those sets where there are noticeable trends. Open-Ended Formulation.To complete the picture of the embedding variance of open-ended MWE, all question sets are visualized in Fig. 24. While there are few exceptions, ...

work page 2026

[1] [1]

Oxford University Press, Oxford,

10 Nick Bostrom.Superintelligence: Paths, Dangers, Strategies. Oxford University Press, Oxford,

work page

[2] [2]

1 Leo Breiman

ISBN 978-0199678112. 1 Leo Breiman. Bias, variance, and arcing classifiers. 1996. 3 Nghia Tuan Bui, Guergana K Savova, and Lijing Wang. Assessing the macro and micro ef- fects of random seeds on fine-tuning large language models. In Kentaro Inui, Sakriani Sakti, Haofen Wang, Derek F. Wong, Pushpak Bhattacharyya, Biplab Banerjee, Asif Ekbal, Tanmoy Chakrab...

work page 1996

[3] [3]

ISBN 979-8-89176-299-2

The Asian Federation of Natural Language Processing and The Association for Computa- tional Linguistics. ISBN 979-8-89176-299-2. URLhttps://aclanthology.org/2025. ijcnlp-short.3/. 10, 40 11 Published as a conference paper at ICLR 2026 Andong Chen, Yuchen Song, Wenxin Zhu, Kehai Chen, Muyun Yang, Tiejun Zhao, et al. Evaluating o1-like llms: Unlocking reaso...

work page arXiv 2025

[4] [4]

1 Morris H

Accessed: 2025-10-16. 1 Morris H. Degroot and Stephen E. Fienberg. The comparison and evaluation of forecasters.Journal of the Royal Statistical Society Series D: The Statistician, 32(1-2):12–22, 12 2018. ISSN 2515-

work page 2025

[5] [5]

DeGroot and Stephen E

doi: 10.2307/2987588. URLhttps://doi.org/10.2307/2987588. 3 Pedro Domingos. A unified bias-variance decomposition for zero-one and squared loss.AAAI/IAAI, 2000:564–569, 2000. 3, 20 Jacob Dominski and Yong Suk Lee. Advancing ai capabilities and evolving labor outcomes.arXiv preprint arXiv:2507.08244, 2025. 1 Tyna Eloundou, Sam Manning, Pamela Mishkin, and ...

work page doi:10.2307/2987588 2000

[6] [6]

The Platonic Representation Hypothesis

1, 7 John Hughes and safety research. safety-research/safety-tooling: v1.0.0, 2025. URLhttps: //doi.org/10.5281/zenodo.15363603. 22 13 Published as a conference paper at ICLR 2026 Minyoung Huh, Brian Cheung, Tongzhou Wang, and Phillip Isola. The platonic representation hypothesis.arXiv preprint arXiv:2405.07987, 2024. 10, 40 Aaron Jaech, Adam Kalai, Adam ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.5281/zenodo.15363603 2025

[7] [7]

Scaling Laws for Neural Language Models

40 Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan. SWE-bench: Can language models resolve real-world github issues? InThe Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview. net/forum?id=VTF8yNQM66. 4, 23 Andrew Johnston and Christos Makridis. The labor mark...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1145/3616855.3635845 2024

[8] [8]

S1: Simple test-time scaling

40 Nestor Maslej, Loredana Fattorini, Raymond Perrault, Yolanda Gil, Vanessa Parli, Njenga Kariuki, Emily Capstick, Anka Reuel, Erik Brynjolfsson, John Etchemendy, et al. Artificial intelligence index report 2025.arXiv preprint arXiv:2504.07139, 2025. 1 Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettl...

work page doi:10.18653/v1/2025.emnlp-main.1025 2025

[9] [9]

Wait, we don’t need to “wait”! removing thinking tokens improves reasoning efficiency

10, 40 Charlie Victor Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling LLM test-time compute optimally can be more effective than scaling parameters for reasoning. InThe Thirteenth Interna- tional Conference on Learning Representations, 2025. URLhttps://openreview.net/ forum?id=4FWAwZtd2n. 4, 40 Jascha Sohl-Dickstein. The hot mess theory of AI mis...

work page doi:10.18653/v1/2025.findings-emnlp.394 2025

[10] [10]

<PROB>P(A), P(B), P(C), P(D)</PROB>

and recommended parameters for thinking (temperature 0.6, top-k 20, top-p 0.95). Since we consider multiple choice questions that only require a letter to answer, we count reasoning length using the amount of output tokens in the answer, either by the API count or using the actual tok- enizer of QWEN3. To estimate the bias and variance metrics across both...

work page 2023

[11] [11]

We sample 20’000 such trajectories, and use 10% as a holdout dataset for valuation loss

To generate our target data, we employ a ground-truth optimizer of steepest descent with fixed step norm, set to0.005, to generate multiple fixed-length trajectories (of length4096steps) from randomly sampled starting points around the minimum, creating a dataset of pairs(x i, ui). We sample 20’000 such trajectories, and use 10% as a holdout dataset for v...

work page 2025

[12] [12]

Wait”) im- proves efficiency, Lee et al. (2025) identify length-accuracy tradeoffs through “token complexity,

suite for self-reported survival instinct. The other results, including separate bias and variance plots, are shown in Fig. 23. We filter for those sets where there are noticeable trends. Open-Ended Formulation.To complete the picture of the embedding variance of open-ended MWE, all question sets are visualized in Fig. 24. While there are few exceptions, ...

work page 2026