Deterministic Decisions for High-Stakes AI. A Zero-Egress Pipeline with the Deployability of RAG and the Accuracy of Machine Learning

Craig Atkinson

arxiv: 2606.29280 · v1 · pith:M2HGUGPQnew · submitted 2026-06-28 · 💻 cs.LG · cs.AI· cs.CL

Deterministic Decisions for High-Stakes AI. A Zero-Egress Pipeline with the Deployability of RAG and the Accuracy of Machine Learning

Craig Atkinson This is my paper

Pith reviewed 2026-06-30 08:03 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords intervention biaszero-shot LLMsupervised policy learningDecision Transformereducational analyticscalibration errorhigh-stakes decision makingOULAD dataset

0 comments

The pith

Zero-shot LLMs recommend interventions 43 percentage points more often than an oracle policy requires, while supervised models trained on the same trajectories reach near-zero calibration error.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that zero-shot large language models exhibit a previously unquantified intervention bias when used as educational advisors, over-prescribing action even when hindsight-optimal decisions call for none. On the OULAD dataset at day 56, where the oracle labels 70 percent of students as needing no intervention, GPT-4o and comparable RAG systems recommend action for 73 percent of cases. Supervised policy models—a trajectory-conditioned Decision Transformer and an XGBoost classifier—trained on oracle-labeled prefixes eliminate the bias, delivering near-zero calibration error, macro-F1 of 0.79, and sub-5 ms latency without action flips. The work also shows that standard LLM-as-judge metrics remain blind to this over-prescription. These results indicate that deterministic high-stakes decisions benefit from explicit supervised policy learning rather than zero-shot retrieval.

Core claim

Intervention bias is a failure mode of zero-shot LLM educational advisory agents that produces a 43 percentage-point false-positive rate relative to an oracle policy on the OULAD dataset; supervised policy learning on the same oracle-labelled trajectories removes the bias, with a Decision Transformer and XGBoost both achieving near-zero calibration error while preserving deployability and low latency.

What carries the argument

Trajectory-conditioned ONNX Decision Transformer (and snapshot XGBoost) trained on prefix-only features from oracle-labelled student trajectories to map EAV state vectors directly to action policies.

If this is right

At 10,000 students the supervised pipeline would avoid roughly 4,300 unnecessary advisor contacts per cycle.
Both the Decision Transformer and XGBoost maintain 0 percent action flip rate and sub-5 ms CPU latency under strict prefix-only inputs.
LLM-as-judge scoring (G-Eval) rewards fluent over-prescription and fails to penalize intervention bias.
The supervised arms match each other in calibration; any edge of the Decision Transformer at the final cutoff is indicative only.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same oracle-to-supervised pipeline could be tested in other high-stakes sequential decision settings where over-prescription carries measurable cost.
Replacing the separate classifier with direct fine-tuning of an LLM on the identical oracle labels would test whether the calibration gain requires an explicit non-LLM policy head.
The current results depend on structured feature vectors; performance on fully unstructured text inputs remains untested.

Load-bearing premise

The oracle policy extracted from complete student trajectories accurately captures the hindsight-optimal choices for when intervention is required.

What would settle it

Apply the same zero-shot and supervised pipelines to a second dataset whose oracle policy is derived independently and check whether the 43-point false-positive gap between zero-shot and supervised arms disappears.

read the original abstract

We identify intervention bias as a previously unquantified failure mode of zero-shot large-language-model (LLM) educational advisory agents: without task-specific training, they recommend action when a hindsight-optimal oracle policy mandates inaction. In a six-arm ablation on the Open University Learning Analytics Dataset (N=800 students, four temporal cutoffs), at day 56 -- when the oracle designates 70.1% of students as needing no intervention -- zero-shot GPT-4o recommends action for 73%, a 43 percentage-point false-positive rate. Commercial RAG and SQL-augmented retrieval are comparably miscalibrated; at 10,000 students this implies about 4,300 unnecessary advisor contacts per cycle. Supervised policy learning eliminates this bias: a trajectory-conditioned ONNX Decision Transformer (DT) and a snapshot XGBoost classifier, trained on the same oracle-labelled trajectories under strict prefix-only features, both achieve near-zero calibration error. The DT reaches macro-F1 0.79 (macro-recall 0.85) across all five action classes, predicting even the rare load-reduction action without collapsing, at a 0% action flip rate and sub-5 ms CPU decision latency. The two supervised arms are on par; the DT's edge over XGBoost at the final cutoff is indicative only (unpaired across cohorts). Scope: we validate Stage-2 decision-making (EAV state vector to supervised policy) under controlled oracle input from structured OULAD data; high fidelity reflects feature-oracle alignment, not general high-stakes-AI capability. The most robust finding is the intervention-bias contrast, not the absolute accuracies. We also show an Evaluation Gap: LLM-as-judge scoring (DeepEval G-Eval) is blind to intervention bias, rewarding fluent over-prescription rather than decision quality.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper quantifies over-prescription by zero-shot LLMs on OULAD advising data relative to an oracle, with supervised models recovering the labels from prefixes.

read the letter

The main point is that zero-shot GPT-4o and retrieval setups recommend action for 73% of students at day 56 when the oracle says only 30% need it, while a Decision Transformer and XGBoost trained on the same labels get near-zero calibration error and solid F1 from prefix features alone.

The work does a clean job on the six-arm ablation with 800 students and four cutoffs. It reports concrete rates, shows the DT handling all five action classes without collapse, and notes that LLM-as-judge metrics reward fluency over matching the oracle. The scope statement keeps the claim narrow: this is about recovering one policy from structured data, not general high-stakes capability.

The soft spot is the oracle. Every number, including the 43-point gap and the supervised recovery, is defined against labels taken from full trajectories. The paper gives no external check that those hindsight decisions actually improved outcomes or that unobserved factors did not make the oracle itself off. If the oracle is just one reasonable rule rather than the right one, the results show imitation of that rule, not a general bias in zero-shot agents. That assumption is load-bearing but untested here.

This is for people comparing LLM agents to classical ML on educational or similar advising tasks where extra contacts cost resources. Readers who care about calibration on structured logs will get value from the contrast. It deserves peer review because the empirical setup is concrete and the bias observation is worth checking, even if the oracle needs more justification.

Referee Report

1 major / 2 minor

Summary. The paper claims to identify 'intervention bias' as a failure mode of zero-shot LLM educational advisory agents on the OULAD dataset (N=800 students, four temporal cutoffs). At day 56, where an oracle policy labels 70.1% of students as needing no intervention, zero-shot GPT-4o recommends action for 73% (43pp false-positive rate); commercial RAG and SQL variants are similarly miscalibrated. Supervised models (trajectory-conditioned ONNX Decision Transformer and snapshot XGBoost) trained on the same oracle-labeled prefix-only trajectories achieve near-zero calibration error, with the DT reaching macro-F1 0.79 (macro-recall 0.85) across five action classes at 0% flip rate and sub-5ms latency. It also reports an evaluation gap where LLM-as-judge scoring misses the bias.

Significance. If the oracle policy is accepted as a valid benchmark, the work quantifies a concrete, previously unmeasured failure mode in zero-shot agents for high-stakes decisions and shows that supervised policy learning on prefix features can recover the oracle with practical deployability advantages. The LLM-as-judge gap is a useful secondary observation. The manuscript already qualifies its scope to Stage-2 decision-making under controlled oracle input and emphasizes that the bias contrast is the most robust finding.

major comments (1)

[Oracle policy derivation and labeling (methods and abstract)] The oracle policy (derived from full student trajectories) is used to label all training data and compute all false-positive and calibration metrics, yet the manuscript provides no external validation that these labels correspond to decisions that would have improved outcomes. Unobserved confounders could render the oracle itself suboptimal; if so, the reported bias contrast and the supervised models' near-zero calibration error demonstrate only recovery of one particular labeling rule rather than elimination of a general failure mode.

minor comments (2)

[Results (six-arm ablation)] The six-arm ablation reports concrete percentages but does not include statistical significance tests or confidence intervals on the differences between arms.
[Abstract and results] The claim that the DT's edge over XGBoost at the final cutoff is 'indicative only' is appropriate, but the manuscript should state the unpaired nature of the cohorts more explicitly in the abstract as well.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comment below while preserving the manuscript's existing scope qualifications.

read point-by-point responses

Referee: The oracle policy (derived from full student trajectories) is used to label all training data and compute all false-positive and calibration metrics, yet the manuscript provides no external validation that these labels correspond to decisions that would have improved outcomes. Unobserved confounders could render the oracle itself suboptimal; if so, the reported bias contrast and the supervised models' near-zero calibration error demonstrate only recovery of one particular labeling rule rather than elimination of a general failure mode.

Authors: We agree that the oracle is a hindsight-derived labeling rule from full trajectories and that the manuscript contains no external validation (e.g., randomized outcome data) showing these labels improve student results. The work is scoped to measuring deviation from this specific benchmark; the intervention-bias contrast quantifies LLM over-prescription relative to the oracle, and the supervised models' calibration shows they recover the same rule under prefix-only features. This remains a valid demonstration of the failure mode even if the oracle is suboptimal due to unobserved confounders. We will revise the abstract and methods to state more explicitly that the oracle serves as a controlled benchmark without external outcome validation, to avoid any implication of general optimality. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation uses external oracle benchmark

full rationale

The paper extracts an oracle policy from full OULAD trajectories as an independent labeling rule for intervention decisions. Both zero-shot systems and supervised models (DT, XGBoost) are evaluated against this fixed external oracle using prefix-only features. Supervised performance (near-zero calibration error, macro-F1 0.79) measures recovery of the oracle labels, while zero-shot deviation measures the bias contrast. This is standard supervised evaluation against an external benchmark, not a self-referential loop. The paper explicitly caveats that absolute accuracies reflect feature-oracle alignment rather than general capability. No self-citations, self-definitional steps, or fitted inputs renamed as predictions appear in the derivation. The central claims are empirical contrasts to the oracle and remain falsifiable against it.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The claim depends on the validity of the oracle policy as a ground truth and the assumption that prefix-only features suffice for prediction without future leakage.

free parameters (1)

model hyperparameters for DT and XGBoost
The supervised models are trained, implying fitted parameters, but specific values not detailed in abstract.

axioms (1)

domain assumption The OULAD dataset provides structured features that align with an oracle policy for intervention decisions.
Invoked in the scope note that high fidelity reflects feature-oracle alignment.

pith-pipeline@v0.9.1-grok · 5879 in / 1490 out tokens · 62121 ms · 2026-06-30T08:03:12.795340+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

35 extracted references · 12 canonical work pages · 7 internal anchors

[1]

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks,

P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W. Yih, T. Rocktäschel, S. Riedel, and D. Kiela, “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks,”Advances in Neural Information Processing Systems, vol. 33, pp. 9459–9474, 2020

2020
[2]

Classifier Context Rot: Monitor Performance Degrades with Context Length

S. Martin and F. Roger, “Classifier Context Rot: Monitor Performance Degrades with Context Length,”arXiv preprint arXiv:2605.12366, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[3]

Decision Transformer: Reinforcement Learning via Sequence Modeling,

L. Chen, K. Lu, A. Rajeswaran, K. Lee, A. Grover, M. Laskin, P. Abbeel, A. Srinivas, and I. Mordatch, “Decision Transformer: Reinforcement Learning via Sequence Modeling,” Advances in Neural Information Processing Systems, vol. 34, pp. 15084–15097, 2021

2021
[4]

Conservative Q-Learning for Offline Reinforcement Learning,

A. Kumar, A. Zhou, G. Tucker, and S. Levine, “Conservative Q-Learning for Offline Reinforcement Learning,”Advances in Neural Information Processing Systems, vol. 33, pp. 1179–1191, 2020

2020
[5]

Offline Reinforcement Learning with Implicit Q-Learning,

I. Kostrikov, A. Nair, and S. Levine, “Offline Reinforcement Learning with Implicit Q-Learning,”International Conference on Learning Representations, 2022

2022
[6]

Retrieval-Augmented Generation for Large Language Models: A Survey

Y. Gao, Y. Xiong, X. Gao, K. Jia, J. Pan, Y. Bi, Y. Dai, J. Sun, and H. Wang, “Retrieval-Augmented Generation for Large Language Models: A Survey,”arXiv preprint arXiv:2312.10997, 2023. 38

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

LlamaIndex: A Data Framework for LLM Applications,

J. Liu et al., “LlamaIndex: A Data Framework for LLM Applications,”GitHub repository,
[8]

Available:https://github.com/run-llama/llama_index
[9]

TableRAG: Million-Token Table Understanding with Language Models,

N. Chen, Y. Su, Y. Tian, A. Liu, M. Li, D. Song, and T. Yu, “TableRAG: Million-Token Table Understanding with Language Models,”arXiv preprint arXiv:2410.04739, 2024

work page arXiv 2024
[10]

Open University Learning Analytics Dataset,

J. Kužílek, M. Hlosta, and Z. Zdrahal, “Open University Learning Analytics Dataset,” Scientific Data, vol. 4, p. 170171, 2017

2017
[11]

When Can We Trust Early Warnings? Leakage-Excluded Early Outcome Prediction from LMS Interaction Logs

N. L. Lê, M.-H. Abel, and B. Laforge, “When Can We Trust Early Warnings? Leakage- Excluded Early Outcome Prediction from LMS Interaction Logs,” arXiv:2605.25794, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[12]

DeepEval: The Open-Source Evaluation Framework for LLMs,

Confident AI, “DeepEval: The Open-Source Evaluation Framework for LLMs,” 2024. Available:https://github.com/confident-ai/deepeval

2024
[13]

GPT-4o System Card,

OpenAI, “GPT-4o System Card,” 2024. Available:https://openai.com/research/gp t-4o-system-card

2024
[14]

Verificate MCP: Multi-Context Protocol Server for EAV Governance,

Verificate Pty Ltd, “Verificate MCP: Multi-Context Protocol Server for EAV Governance,” Internal Technical Document, Build 34, 2026

2026
[15]

The Token Bill Comes Due: Inside the Industry Scramble to Manage AI’s Runaway Costs,

R. Bellan, “The Token Bill Comes Due: Inside the Industry Scramble to Manage AI’s Runaway Costs,”TechCrunch, June 5, 2026. Available:https://techcrunch.com/202 6/06/05/the-token-bill-comes-due-inside-the-industry-scramble-to-manag e-ais-runaway-costs/

2026
[16]

AI Agents Burn 50× More Tokens Than Chats,

LeanOps, “AI Agents Burn 50× More Tokens Than Chats,” LeanOps Tech Blog, 2026. Available: https://leanopstech.com/blog/agentic-ai-cost-runaway-token-bud get-2026/

2026
[17]

AI Agent Token Costs Are Now a Security Risk,

Sondera, “AI Agent Token Costs Are Now a Security Risk,”Secure Trajectories by Sondera, 2026. Available:https://blog.sondera.ai/p/ai-agent-token-costs-sec urity-risk

2026
[18]

ReasoningBomb: A stealthy denial-of-service attack by inducing pathologically long reasoning in large reasoning models,

J. Guoet al., “ReasoningBomb: Exploiting Reasoning Models to Inflate Token Costs,” InProceedings of CCS 2026, arXiv:2602.00154, 2026

work page arXiv 2026
[19]

Token Inflation: How Dishonest Providers Can Overcharge for Large Language Model Usage

S. Hoque, D. Bhatt, A. Gupta, and M. Srivatsa, “Token Inflation: How Dishonest Providers Can Overcharge for Large Language Model Usage,” arXiv:2605.30040, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[20]

Confidential S-1 Registration Statement,

OpenAI, “Confidential S-1 Registration Statement,” U.S. Securities and Exchange Com- mission, Filed May–June 2026. Valuation ($852B–$1T) and loss projection ($14B) as reported in contemporaneous media coverage of the confidential filing (The Information, Bloomberg, Reuters)

2026
[21]

Confidential S-1 Registration Statement,

Anthropic, “Confidential S-1 Registration Statement,” U.S. Securities and Exchange Commission, Filed June 2026. Post-money valuation ($965B) and projected breakeven (2028) as reported in contemporaneous media coverage of the confidential filing (The Information, Bloomberg). 39

2026
[22]

What Nearly 2 Quadrillion Annualized Tokens Reveal About LLM Pricing Trends,

YipitData, “What Nearly 2 Quadrillion Annualized Tokens Reveal About LLM Pricing Trends,” YipitData Research, 2026. Available:https://www.yipitdata.com/resource s/blog/cloud-llm-pricing-trends

2026
[23]

A Cost-Benefit Analysis of On-Premise Large Language Model Deployment: Breaking Even with Commercial LLM Services,

G. Pan, V. Chodnekar, A. Roy, and H. Wang, “A Cost-Benefit Analysis of On-Premise Large Language Model Deployment: Breaking Even with Commercial LLM Services,” arXiv:2509.18101, 2025

work page arXiv 2025
[24]

GuideLLM: Scalable LLM Deployment Evaluation,

Neural Magic, “GuideLLM: Scalable LLM Deployment Evaluation,” Version 0.6.0, 2024. Available:https://github.com/neuralmagic/guidellm

2024
[25]

Tethered Reasoning: Decoupling Entropy from Hallucination in Quantized LLMs via Manifold Steering,

C. Atkinson, “Tethered Reasoning: Decoupling Entropy from Hallucination in Quantized LLMs via Manifold Steering,” arXiv:2602.17691, 2026.(Author’s prior arXiv preprint; introduces the HELIX manifold-steering framework and Multi-Temperature Synthesis that underlie the Stage-1 extraction subsystem. Not used in the present OULAD experiments, which hold Stage...

work page arXiv 2026
[26]

Miro-to-EAV Proof of Concept: Production Validation of Sovereign Unstructured-to-EAV Extraction,

C. Atkinson, “Miro-to-EAV Proof of Concept: Production Validation of Sovereign Unstructured-to-EAV Extraction,” Technical Report, June 2026

2026
[27]

LLM Evaluators Recognize and Favor Their Own Generations

A. Panickssery, S. R. Bowman, and S. Feng, “LLM Evaluators Recognize and Favor Their Own Generations,”arXiv preprint, arXiv:2404.13076, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[28]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

L. Zhenget al., “Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena,” InAdvances in Neural Information Processing Systems (NeurIPS), vol. 36, 2023. arXiv:2306.05685

work page internal anchor Pith review Pith/arXiv arXiv 2023
[29]

Knowledge Tracing: Modeling the Acquisition of Procedural Knowledge,

A. T. Corbett and J. R. Anderson, “Knowledge Tracing: Modeling the Acquisition of Procedural Knowledge,”User Modeling and User-Adapted Interaction, vol. 4, no. 4, pp. 253–278, 1994

1994
[30]

Why tree-based models still outperform deep learning on tabular data,

L. Grinsztajn, E. Oyallon, and G. Varoquaux, “Why tree-based models still outperform deep learning on tabular data,”Advances in Neural Information Processing Systems, vol. 35, pp. 507–520, 2022

2022
[31]

Predicting Student Academic Success through Explainable Machine Learning Models: A Comparative Study of BRF, XGBoost and CatBoost,

O. Brahim, “Predicting Student Academic Success through Explainable Machine Learning Models: A Comparative Study of BRF, XGBoost and CatBoost,”Journal of Information & Educational Research, 2022. Available:https://jier.org/index.php/journal/art icle/download/4055/3198/7116

2022
[32]

Available:https://ejournal.s eaninstitute.or.id/index.php/InfoSains/article/download/7005/5572

Binary Classification of Academic Outcomes Using Ensemble Learning and Neural Networks: A Case Study on OULAD,InfoSains, 2024. Available:https://ejournal.s eaninstitute.or.id/index.php/InfoSains/article/download/7005/5572

2024
[33]

Large Language Models versus Classical Machine Learning: Performance in COVID-19 Mortality Prediction Using High-Dimensional Tabular Data

V. Swamy et al., “Large Language Models versus Classical Machine Learning: Perfor- mance in COVID-19 Mortality Prediction Using High-Dimensional Tabular Data,”arXiv preprint, arXiv:2409.02136, 2025. 40

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

Scalable In-Context Learning on Tabular Data via Retrieval-Augmented Large Language Models,

Y. Wen et al., “Scalable In-Context Learning on Tabular Data via Retrieval-Augmented Large Language Models,”arXiv preprint, arXiv:2502.03147, 2025

work page arXiv 2025
[35]

LLM Performance Benchmarks: Speed, Quality and Cost,

Artificial Analysis, “LLM Performance Benchmarks: Speed, Quality and Cost,”https: //artificialanalysis.ai, accessed June 2026. 41

2026

[1] [1]

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks,

P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W. Yih, T. Rocktäschel, S. Riedel, and D. Kiela, “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks,”Advances in Neural Information Processing Systems, vol. 33, pp. 9459–9474, 2020

2020

[2] [2]

Classifier Context Rot: Monitor Performance Degrades with Context Length

S. Martin and F. Roger, “Classifier Context Rot: Monitor Performance Degrades with Context Length,”arXiv preprint arXiv:2605.12366, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[3] [3]

Decision Transformer: Reinforcement Learning via Sequence Modeling,

L. Chen, K. Lu, A. Rajeswaran, K. Lee, A. Grover, M. Laskin, P. Abbeel, A. Srinivas, and I. Mordatch, “Decision Transformer: Reinforcement Learning via Sequence Modeling,” Advances in Neural Information Processing Systems, vol. 34, pp. 15084–15097, 2021

2021

[4] [4]

Conservative Q-Learning for Offline Reinforcement Learning,

A. Kumar, A. Zhou, G. Tucker, and S. Levine, “Conservative Q-Learning for Offline Reinforcement Learning,”Advances in Neural Information Processing Systems, vol. 33, pp. 1179–1191, 2020

2020

[5] [5]

Offline Reinforcement Learning with Implicit Q-Learning,

I. Kostrikov, A. Nair, and S. Levine, “Offline Reinforcement Learning with Implicit Q-Learning,”International Conference on Learning Representations, 2022

2022

[6] [6]

Retrieval-Augmented Generation for Large Language Models: A Survey

Y. Gao, Y. Xiong, X. Gao, K. Jia, J. Pan, Y. Bi, Y. Dai, J. Sun, and H. Wang, “Retrieval-Augmented Generation for Large Language Models: A Survey,”arXiv preprint arXiv:2312.10997, 2023. 38

work page internal anchor Pith review Pith/arXiv arXiv 2023

[7] [7]

LlamaIndex: A Data Framework for LLM Applications,

J. Liu et al., “LlamaIndex: A Data Framework for LLM Applications,”GitHub repository,

[8] [8]

Available:https://github.com/run-llama/llama_index

[9] [9]

TableRAG: Million-Token Table Understanding with Language Models,

N. Chen, Y. Su, Y. Tian, A. Liu, M. Li, D. Song, and T. Yu, “TableRAG: Million-Token Table Understanding with Language Models,”arXiv preprint arXiv:2410.04739, 2024

work page arXiv 2024

[10] [10]

Open University Learning Analytics Dataset,

J. Kužílek, M. Hlosta, and Z. Zdrahal, “Open University Learning Analytics Dataset,” Scientific Data, vol. 4, p. 170171, 2017

2017

[11] [11]

When Can We Trust Early Warnings? Leakage-Excluded Early Outcome Prediction from LMS Interaction Logs

N. L. Lê, M.-H. Abel, and B. Laforge, “When Can We Trust Early Warnings? Leakage- Excluded Early Outcome Prediction from LMS Interaction Logs,” arXiv:2605.25794, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[12] [12]

DeepEval: The Open-Source Evaluation Framework for LLMs,

Confident AI, “DeepEval: The Open-Source Evaluation Framework for LLMs,” 2024. Available:https://github.com/confident-ai/deepeval

2024

[13] [13]

GPT-4o System Card,

OpenAI, “GPT-4o System Card,” 2024. Available:https://openai.com/research/gp t-4o-system-card

2024

[14] [14]

Verificate MCP: Multi-Context Protocol Server for EAV Governance,

Verificate Pty Ltd, “Verificate MCP: Multi-Context Protocol Server for EAV Governance,” Internal Technical Document, Build 34, 2026

2026

[15] [15]

The Token Bill Comes Due: Inside the Industry Scramble to Manage AI’s Runaway Costs,

R. Bellan, “The Token Bill Comes Due: Inside the Industry Scramble to Manage AI’s Runaway Costs,”TechCrunch, June 5, 2026. Available:https://techcrunch.com/202 6/06/05/the-token-bill-comes-due-inside-the-industry-scramble-to-manag e-ais-runaway-costs/

2026

[16] [16]

AI Agents Burn 50× More Tokens Than Chats,

LeanOps, “AI Agents Burn 50× More Tokens Than Chats,” LeanOps Tech Blog, 2026. Available: https://leanopstech.com/blog/agentic-ai-cost-runaway-token-bud get-2026/

2026

[17] [17]

AI Agent Token Costs Are Now a Security Risk,

Sondera, “AI Agent Token Costs Are Now a Security Risk,”Secure Trajectories by Sondera, 2026. Available:https://blog.sondera.ai/p/ai-agent-token-costs-sec urity-risk

2026

[18] [18]

ReasoningBomb: A stealthy denial-of-service attack by inducing pathologically long reasoning in large reasoning models,

J. Guoet al., “ReasoningBomb: Exploiting Reasoning Models to Inflate Token Costs,” InProceedings of CCS 2026, arXiv:2602.00154, 2026

work page arXiv 2026

[19] [19]

Token Inflation: How Dishonest Providers Can Overcharge for Large Language Model Usage

S. Hoque, D. Bhatt, A. Gupta, and M. Srivatsa, “Token Inflation: How Dishonest Providers Can Overcharge for Large Language Model Usage,” arXiv:2605.30040, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[20] [20]

Confidential S-1 Registration Statement,

OpenAI, “Confidential S-1 Registration Statement,” U.S. Securities and Exchange Com- mission, Filed May–June 2026. Valuation ($852B–$1T) and loss projection ($14B) as reported in contemporaneous media coverage of the confidential filing (The Information, Bloomberg, Reuters)

2026

[21] [21]

Confidential S-1 Registration Statement,

Anthropic, “Confidential S-1 Registration Statement,” U.S. Securities and Exchange Commission, Filed June 2026. Post-money valuation ($965B) and projected breakeven (2028) as reported in contemporaneous media coverage of the confidential filing (The Information, Bloomberg). 39

2026

[22] [22]

What Nearly 2 Quadrillion Annualized Tokens Reveal About LLM Pricing Trends,

YipitData, “What Nearly 2 Quadrillion Annualized Tokens Reveal About LLM Pricing Trends,” YipitData Research, 2026. Available:https://www.yipitdata.com/resource s/blog/cloud-llm-pricing-trends

2026

[23] [23]

A Cost-Benefit Analysis of On-Premise Large Language Model Deployment: Breaking Even with Commercial LLM Services,

G. Pan, V. Chodnekar, A. Roy, and H. Wang, “A Cost-Benefit Analysis of On-Premise Large Language Model Deployment: Breaking Even with Commercial LLM Services,” arXiv:2509.18101, 2025

work page arXiv 2025

[24] [24]

GuideLLM: Scalable LLM Deployment Evaluation,

Neural Magic, “GuideLLM: Scalable LLM Deployment Evaluation,” Version 0.6.0, 2024. Available:https://github.com/neuralmagic/guidellm

2024

[25] [25]

Tethered Reasoning: Decoupling Entropy from Hallucination in Quantized LLMs via Manifold Steering,

C. Atkinson, “Tethered Reasoning: Decoupling Entropy from Hallucination in Quantized LLMs via Manifold Steering,” arXiv:2602.17691, 2026.(Author’s prior arXiv preprint; introduces the HELIX manifold-steering framework and Multi-Temperature Synthesis that underlie the Stage-1 extraction subsystem. Not used in the present OULAD experiments, which hold Stage...

work page arXiv 2026

[26] [26]

Miro-to-EAV Proof of Concept: Production Validation of Sovereign Unstructured-to-EAV Extraction,

C. Atkinson, “Miro-to-EAV Proof of Concept: Production Validation of Sovereign Unstructured-to-EAV Extraction,” Technical Report, June 2026

2026

[27] [27]

LLM Evaluators Recognize and Favor Their Own Generations

A. Panickssery, S. R. Bowman, and S. Feng, “LLM Evaluators Recognize and Favor Their Own Generations,”arXiv preprint, arXiv:2404.13076, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[28] [28]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

L. Zhenget al., “Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena,” InAdvances in Neural Information Processing Systems (NeurIPS), vol. 36, 2023. arXiv:2306.05685

work page internal anchor Pith review Pith/arXiv arXiv 2023

[29] [29]

Knowledge Tracing: Modeling the Acquisition of Procedural Knowledge,

A. T. Corbett and J. R. Anderson, “Knowledge Tracing: Modeling the Acquisition of Procedural Knowledge,”User Modeling and User-Adapted Interaction, vol. 4, no. 4, pp. 253–278, 1994

1994

[30] [30]

Why tree-based models still outperform deep learning on tabular data,

L. Grinsztajn, E. Oyallon, and G. Varoquaux, “Why tree-based models still outperform deep learning on tabular data,”Advances in Neural Information Processing Systems, vol. 35, pp. 507–520, 2022

2022

[31] [31]

Predicting Student Academic Success through Explainable Machine Learning Models: A Comparative Study of BRF, XGBoost and CatBoost,

O. Brahim, “Predicting Student Academic Success through Explainable Machine Learning Models: A Comparative Study of BRF, XGBoost and CatBoost,”Journal of Information & Educational Research, 2022. Available:https://jier.org/index.php/journal/art icle/download/4055/3198/7116

2022

[32] [32]

Available:https://ejournal.s eaninstitute.or.id/index.php/InfoSains/article/download/7005/5572

Binary Classification of Academic Outcomes Using Ensemble Learning and Neural Networks: A Case Study on OULAD,InfoSains, 2024. Available:https://ejournal.s eaninstitute.or.id/index.php/InfoSains/article/download/7005/5572

2024

[33] [33]

Large Language Models versus Classical Machine Learning: Performance in COVID-19 Mortality Prediction Using High-Dimensional Tabular Data

V. Swamy et al., “Large Language Models versus Classical Machine Learning: Perfor- mance in COVID-19 Mortality Prediction Using High-Dimensional Tabular Data,”arXiv preprint, arXiv:2409.02136, 2025. 40

work page internal anchor Pith review Pith/arXiv arXiv 2025

[34] [34]

Scalable In-Context Learning on Tabular Data via Retrieval-Augmented Large Language Models,

Y. Wen et al., “Scalable In-Context Learning on Tabular Data via Retrieval-Augmented Large Language Models,”arXiv preprint, arXiv:2502.03147, 2025

work page arXiv 2025

[35] [35]

LLM Performance Benchmarks: Speed, Quality and Cost,

Artificial Analysis, “LLM Performance Benchmarks: Speed, Quality and Cost,”https: //artificialanalysis.ai, accessed June 2026. 41

2026