arxiv: 2605.07675 · v1 · submitted 2026-05-08 · 💻 cs.AI · cs.LG

Recognition: 2 theorem links

· Lean Theorem

FactoryBench: Evaluating Industrial Machine Understanding

Yanis Merzouki , Coral Izquierdo , Matei Ignuta-Ciuncanu , Marcos Gomez-Bracamonte , Riccardo Maggioni , Alessandro Lombardi , Camilla Mazzoleni , Federico Martelli

show 3 more authors

Balazs Gunther Jonas Petersen Philipp Petersen

Authors on Pith no claims yet

Pith reviewed 2026-05-11 02:35 UTC · model grok-4.3

classification 💻 cs.AI cs.LG

keywords benchmarkmachine understandinglarge language modelsindustrial roboticscausal reasoningtelemetry datazero-shot evaluationrobotic arms

0 comments

The pith

No frontier LLM exceeds 50 percent accuracy on structured causal questions about industrial machines or 18 percent on decision-making.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds FactoryBench to test whether time-series models and large language models can understand industrial robotic machines from their sensor data. Questions are generated at four causal levels using structured templates, covering observation of machine states, effects of interventions, what would have happened otherwise, and actual decisions. Zero-shot tests on six leading models show none clearing 50 percent on the structured formats or 18 percent on decision tasks, based on more than 70,000 questions drawn from real robot episodes. A sympathetic reader would care because operational understanding of machines is required for reliable automation, predictive maintenance, and safe factory operations where mistakes carry real costs.

Core claim

FactoryBench shows that current frontier large language models lack operational machine understanding of industrial robots, as no model reaches more than 50 percent on structured questions at the state, intervention, counterfactual, and decision levels or more than 18 percent on decision-making when evaluated zero-shot against normalized telemetry episodes from cobots and industrial arms.

What carries the argument

FactoryBench, a collection of over 70,000 question-answer pairs generated by structured templates from normalized episodes of multivariate sensor data, with deterministic scoring on four structured formats and LLM-as-judge voting on free-form answers.

Load-bearing premise

The structured question templates and LLM-as-judge protocol accurately capture genuine machine understanding rather than surface-level pattern matching or judge bias.

What would settle it

A new model that scores above 60 percent on the decision-making level of the benchmark while preserving high scores on the structured levels across repeated evaluations would indicate the claimed performance gap is smaller than reported.

Figures

Figures reproduced from arXiv: 2605.07675 by Alessandro Lombardi, Balazs Gunther, Camilla Mazzoleni, Coral Izquierdo, Federico Martelli, Jonas Petersen, Marcos Gomez-Bracamonte, Matei Ignuta-Ciuncanu, Philipp Petersen, Riccardo Maggioni, Yanis Merzouki.

**Figure 1.** Figure 1: End-to-end FactoryBench pipeline. (1) Q&A construction: normalized episodes from FactoryWave, AURSAD [1], and voraus-AD [2] are paired with typed-variable templates (plus the knowledge graph for L4) to generate Q&A items at four reasoning levels. (2) Evaluation: the model’s answers are scored deterministically for L1–3 and by the median of three LLM judges for L4 free-form, then chance-corrected per level.… view at source ↗

**Figure 2.** Figure 2: FactoryWave tasks and fault catalogue. Photographs of the robotic platforms (UR3, KUKA [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Main results. (a) Cross-model panel performance across all four reasoning levels. (b) Per-fault and per-parameter breakdown of the GPT-5.1 L4 mass. 5.2 Signal comprehension across Levels 1–3 Most models fail to separate from the simple baseline on Level 1, including the largest in the panel. Three of six LLMs fail to clear the L1 baseline (28.4%): GPT-5.1 (30.9%), by parametercount and inference-cost prox… view at source ↗

**Figure 4.** Figure 4: The four levels of machine understanding defined by FactoryBench. [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗

**Figure 5.** Figure 5: The arm-disturbance fault as physically applied to the KUKA in FactoryWave. [PITH_FULL_IMAGE:figures/full_fig_p022_5.png] view at source ↗

**Figure 6.** Figure 6: Sensor data distribution across multiple healthy KUKA episodes (blue) versus multiple [PITH_FULL_IMAGE:figures/full_fig_p023_6.png] view at source ↗

**Figure 7.** Figure 7: Physical causal schema. Setpoint commands [PITH_FULL_IMAGE:figures/full_fig_p032_7.png] view at source ↗

read the original abstract

We introduce FactoryBench, a benchmark for evaluating time-series models and LLMs on machine understanding over industrial robotic telemetry. Q&A pairs are organized along four causal levels (state, intervention, counterfactual, decision) instantiating Pearl's ladder of causation, and span five answer formats: four structured formats are scored deterministically and free-form answers are scored by an LLM-as-judge voting protocol. We propose a scalable Q&A generation framework built around structured question templates, present FactoryWave (a dense, multitask, multivariate sensor dataset collected from a UR3 cobot and a KUKA KR10 industrial arm), and construct FactoryBench as a large-scale benchmark of over 70k Q&A items grounded in roughly 15k normalized episodes from FactoryWave, AURSAD, and voraus-AD. Zero-shot evaluation of six frontier LLMs shows that no model exceeds 50% on structured levels or 18% on decision-making, revealing a wide gap between current models and operational machine understanding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FactoryBench gives a new causal benchmark and dataset for industrial telemetry but its gap claims need human baselines to land.

read the letter

The core of this paper is a new benchmark, FactoryBench, built on FactoryWave—a dense sensor dataset from a UR3 cobot and KUKA KR10 arm—plus episodes from AURSAD and voraus-AD. They generate over 70k Q&A items across Pearl's four causal levels (state, intervention, counterfactual, decision) with four structured answer formats scored deterministically and free-form ones handled by LLM-as-judge voting. Zero-shot tests on six frontier LLMs show ceilings around 50% on structured levels and 18% on decision-making. The template-driven generation framework is a practical way to scale this kind of evaluation without heavy manual labeling, and the deterministic scoring for structured items avoids some common judge biases. That setup is genuinely new for applied industrial settings and could serve as a reusable tool for testing time-series or LLM models on real telemetry. The main soft spot is the missing human or expert performance ceiling on the same items. Without it, the low model scores could reflect question difficulty rather than a clear shortfall in causal machine understanding. The stress-test note on this point holds up from the abstract and reported results. The LLM judge protocol is secondary since the structured levels carry the main narrative. This paper is for researchers building or evaluating models for manufacturing and robotics who need domain-specific causal tests. A reader working on benchmarks or applied causal reasoning will get concrete value from the dataset and construction details. It deserves a serious referee because the benchmark itself is a fresh contribution with reproducible elements, even though it will need revisions to strengthen the interpretation of the results.

Referee Report

1 major / 2 minor

Summary. The paper introduces FactoryBench, a benchmark of over 70k Q&A pairs grounded in industrial robot telemetry from FactoryWave (new UR3/KUKA dataset), AURSAD, and voraus-AD. Questions follow Pearl's ladder of causation across state, intervention, counterfactual, and decision levels, with four structured answer formats scored deterministically and free-form answers scored via LLM-as-judge voting. A scalable template-based generation framework is proposed. Zero-shot evaluation of six frontier LLMs reports no model exceeding 50% on structured levels or 18% on decision-making, interpreted as evidence of a wide gap to operational machine understanding.

Significance. If the benchmark construction and scoring protocols hold, the work offers a scalable, causally structured resource for probing LLM and time-series model capabilities in industrial settings, with the deterministic structured scoring providing an objective strength. The results, if interpreted with appropriate baselines, could usefully quantify current limitations in causal reasoning over telemetry and motivate targeted improvements.

major comments (1)

[Abstract and §5] Abstract and §5 (Zero-shot Evaluation): the central claim that LLM scores reveal a 'wide gap between current models and operational machine understanding' is not supported without reported human or expert performance baselines on the identical Q&A items. The deterministic structured scores (≤50%) and decision-making scores (≤18%) could reflect task difficulty rather than model deficiencies, undermining the gap interpretation.

minor comments (2)

[§4] §4 (Benchmark Construction): the LLM-as-judge voting protocol for free-form answers would benefit from explicit reporting of inter-judge agreement metrics and full prompt templates to support reproducibility.
[Abstract] The abstract states the benchmark evaluates both time-series models and LLMs, but only LLM zero-shot results are presented; clarify whether time-series models were evaluated and, if not, adjust the scope statement.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed review and for identifying a key point regarding the strength of our interpretive claims. We address the major comment below and describe the revisions we will incorporate.

read point-by-point responses

Referee: [Abstract and §5] Abstract and §5 (Zero-shot Evaluation): the central claim that LLM scores reveal a 'wide gap between current models and operational machine understanding' is not supported without reported human or expert performance baselines on the identical Q&A items. The deterministic structured scores (≤50%) and decision-making scores (≤18%) could reflect task difficulty rather than model deficiencies, undermining the gap interpretation.

Authors: We agree that the manuscript's interpretation of a 'wide gap' to operational machine understanding would be strengthened by human or expert performance baselines on the same items. The current version reports only zero-shot LLM results and does not include such baselines, so the claim rests on the design of the questions (grounded in real telemetry and Pearl's ladder) rather than direct comparison. We will revise the abstract and §5 to qualify the language, replacing the 'wide gap' phrasing with a more measured statement that the results demonstrate current frontier LLMs achieve low accuracy on structured causal and decision-making questions over industrial telemetry. We will also add a dedicated limitations paragraph noting the absence of expert baselines and suggesting their collection as valuable future work. This change preserves the empirical findings while addressing the concern that task difficulty alone could explain the scores. revision: partial

Circularity Check

0 steps flagged

Empirical benchmark construction with direct measurements; no derivation chain

full rationale

The paper introduces FactoryBench via structured question templates grounded in external datasets (FactoryWave, AURSAD, voraus-AD) and Pearl's ladder of causation. It reports zero-shot LLM scores as direct empirical measurements on deterministic structured formats and LLM-as-judge free-form scoring. No equations, fitted parameters, self-citations, or ansatzes are used to derive results; the central claims are benchmark scores, not predictions or theorems that reduce to the paper's own inputs by construction. This is self-contained empirical work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical derivations or theoretical assumptions beyond standard practices in benchmark construction and LLM evaluation.

pith-pipeline@v0.9.0 · 5508 in / 964 out tokens · 39330 ms · 2026-05-11T02:35:47.627020+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Q&A pairs are organized along four causal levels (state, intervention, counterfactual, decision) instantiating Pearl's ladder of causation... Zero-shot evaluation of six frontier LLMs shows that no model exceeds 50% on structured levels or 18% on decision-making
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Four-tier hierarchy... Level 1: State... Level 2: Intervention... Level 3: Counterfactual... Level 4: Decision Making

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages · 4 internal anchors

[1]

Aursad: Universal robot screwdriving anomaly detection dataset

Bła˙zej Leporowski, Daniella Tola, Casper Hansen, and Alexandros Iosifidis. AURSAD: Univer- sal robot screwdriving anomaly detection dataset.arXiv preprint arXiv:2102.01409, 2021

work page arXiv 2021
[2]

The voraus-AD dataset for anomaly detection in robot applications.IEEE Transactions on Robotics, 40:438–451,

Jan Thieß Brockmann, Marco Rudolph, Bodo Rosenhahn, and Bastian Wandt. The voraus-AD dataset for anomaly detection in robot applications.IEEE Transactions on Robotics, 40:438–451,

work page
[3]

Informer: Beyond efficient transformer for long sequence time-series forecasting

Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang. Informer: Beyond efficient transformer for long sequence time-series forecasting. In Proceedings of the AAAI conference on artificial intelligence, volume 35, pages 11106–11115, 2021

work page 2021
[4]

Robust anomaly detection for multivariate time series through stochastic recurrent neural network

Ya Su, Youjian Zhao, Chenhao Niu, Rong Liu, Wei Sun, and Dan Pei. Robust anomaly detection for multivariate time series through stochastic recurrent neural network. InProceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2019

work page 2019
[5]

Julien Audibert, Pietro Michiardi, Frédéric Guyard, Sébastien Marti, and Maria A. Zuluaga. USAD: Unsupervised anomaly detection on multivariate time series. InProceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2020

work page 2020
[6]

Deep Learning for Anomaly Detection: A Survey

Raghavendra Chalapathy and Sanjay Chawla. Deep learning for anomaly detection: A survey. arXiv preprint arXiv:1901.03407, 2019

work page Pith review arXiv 1901
[7]

Evaluating real-time anomaly detection algorithms— the Numenta anomaly benchmark

Alexander Lavin and Subutai Ahmad. Evaluating real-time anomaly detection algorithms— the Numenta anomaly benchmark. InProceedings of the IEEE International Conference on Machine Learning and Applications (ICMLA), pages 38–44, 2015

work page 2015
[8]

Are transformers effective for time series forecasting? InProceedings of the AAAI Conference on Artificial Intelligence, 2023

Ailing Zeng, Muxi Chen, Lei Zhang, and Qiang Xu. Are transformers effective for time series forecasting? InProceedings of the AAAI Conference on Artificial Intelligence, 2023

work page 2023
[9]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. InAdvances in Neural Information Processing Systems (NeurIPS), 2022

work page 2022
[10]

Griffiths, Yuan Cao, and Karthik Narasimhan

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. In Advances in Neural Information Processing Systems (NeurIPS), 2023

work page 2023
[11]

Large language models are zero-shot time series forecasters

Nate Gruver, Marc Finzi, Shikai Qiu, and Andrew Gordon Wilson. Large language models are zero-shot time series forecasters. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

work page 2023
[12]

Zhang, Xiaoming Shi, Pin-Yu Chen, Yuxuan Liang, Yuan-Fang Li, Shirui Pan, and Qingsong Wen

Ming Jin, Shiyu Wang, Lintao Ma, Zhixuan Chu, James Y . Zhang, Xiaoming Shi, Pin-Yu Chen, Yuxuan Liang, Yuan-Fang Li, Shirui Pan, and Qingsong Wen. Time-LLM: Time series forecasting by reprogramming large language models. InProceedings of the International Conference on Learning Representations (ICLR), 2024

work page 2024
[13]

Gomez, Łukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neural Informa- tion Processing Systems (NeurIPS), 2017

work page 2017
[14]

Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting

Haixu Wu, Jiehui Xu, Jianmin Wang, and Mingsheng Long. Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting. InAdvances in Neural Information Processing Systems (NeurIPS), 2021

work page 2021
[15]

Fedformer: Frequency enhanced decomposed transformer for long-term series forecasting

Tian Zhou, Ziqing Ma, Qingsong Wen, Xue Wang, Liang Sun, and Rong Jin. Fedformer: Frequency enhanced decomposed transformer for long-term series forecasting. InInternational conference on machine learning, pages 27268–27286. PMLR, 2022

work page 2022
[16]

A Time Series is Worth 64 Words: Long-term Forecasting with Transformers

Yuqi Nie, Nam H Nguyen, Phanwadee Sinthong, and Jayant Kalagnanam. A time series is worth 64 words: Long-term forecasting with transformers. arxiv 2022.arXiv preprint arXiv:2211.14730, 2022. 10

work page internal anchor Pith review arXiv 2022
[17]

Chronos: Learning the Language of Time Series

Abdul Fatir Ansari, Lorenzo Stella, Caner Turkmen, Xiyuan Zhang, Pedro Mercado, Huibin Shen, Oleksandr Shchur, Syama Sundar Rangapuram, Sebastian Pineda Arango, Shubham Kapoor, Jasper Zschiegner, Danielle C. Maddix, Hao Wang, Michael W. Mahoney, Kari Torkkola, Andrew Gordon Wilson, Michael Bohlke-Schneider, and Yuyang Wang. Chronos: Learning the language ...

work page internal anchor Pith review arXiv 2024
[18]

A decoder- only foundation model for time-series forecasting.arXiv preprint arXiv:2310.10688,

Abhimanyu Das, Weihao Kong, Rajat Sen, and Yichen Zhou. A decoder-only foundation model for time-series forecasting. InProceedings of the International Conference on Machine Learning (ICML), 2024. arXiv:2310.10688

work page arXiv 2024
[19]

Measuring massive multitask language understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. InProceedings of the International Conference on Learning Representations (ICLR), 2021

work page 2021
[20]

Beyond the imitation game: Quantifying and extrapolating the capabili- ties of language models.Transactions on Machine Learning Research, 2023

Aarohi Srivastava et al. Beyond the imitation game: Quantifying and extrapolating the capabili- ties of language models.Transactions on Machine Learning Research, 2023

work page 2023
[21]

MMMU: A massive multi-discipline multi- modal understanding and reasoning benchmark for expert AGI

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. MMMU: A massive multi-discipline multi- modal understanding and reasoning benchmark for expert AGI. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024
[22]

TimeSeriesExam: A time series understanding exam, 2024

Yifu Cai, Arjun Choudhry, Mononito Goswami, and Artur Dubrawski. TimeSeriesExam: A time series understanding exam, 2024

work page 2024
[23]

ChatTS: Aligning time series with LLMs via synthetic data for enhanced understanding and reasoning.Proceedings of the VLDB Endowment, 18(8):2385–2398, 2025

Zhe Xie, Zeyan Li, Xiao He, Longlong Xu, Xidao Wen, Tieying Zhang, Jianjun Chen, Rui Shi, and Dan Pei. ChatTS: Aligning time series with LLMs via synthetic data for enhanced understanding and reasoning.Proceedings of the VLDB Endowment, 18(8):2385–2398, 2025

work page 2025
[24]

ITFormer: Bridging time series and natural language for multi-modal QA with large-scale multitask dataset, 2025

Yilin Wang, Peixuan Lei, Jie Song, Yuzhe Hao, Tao Chen, Yuxuan Zhang, Lei Jia, Yuanxiang Li, and Zhongyu Wei. ITFormer: Bridging time series and natural language for multi-modal QA with large-scale multitask dataset, 2025

work page 2025
[25]

TSAQA: Time series analysis question and answering benchmark, 2026

Baoyu Jing, Sanhorn Chen, Lecheng Zheng, Boyu Liu, Zihao Li, Jiaru Zou, Tianxin Wei, Zhining Liu, Zhichen Zeng, Ruizhong Qiu, Xiao Lin, Yuchen Yan, Dongqi Fu, Jingchao Ni, Jingrui He, and Hanghang Tong. TSAQA: Time series analysis question and answering benchmark, 2026

work page 2026
[26]

Time-MQA: Time series multi-task question answering with context enhancement, 2025

Yaxuan Kong, Yiyuan Yang, Yoontae Hwang, Wenjie Du, Stefan Zohren, Zhangyang Wang, Ming Jin, and Qingsong Wen. Time-MQA: Time series multi-task question answering with context enhancement, 2025

work page 2025
[27]

MTBench: A multimodal time series benchmark for temporal reasoning and question answering, 2026

Jialin Chen, Aosong Feng, Ziyu Zhao, Juan Garza, Gaukhar Nurbek, Cheng Qin, Ali Maatouk, Leandros Tassiulas, Yifeng Gao, and Rex Ying. MTBench: A multimodal time series benchmark for temporal reasoning and question answering, 2026

work page 2026
[28]

QuAnTS: Question answering on time series.arXiv preprint arXiv:2411.04795, 2024

Felix Wenkel, Ghada Sokar, Yuan Zhang, and Mirco Ravanelli. QuAnTS: Question answering on time series.arXiv preprint arXiv:2411.04795, 2024

work page arXiv 2024
[29]

Toolformer: Language models can teach themselves to use tools

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

work page 2023
[30]

ReAct: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. InProceedings of the International Conference on Learning Representations (ICLR), 2023

work page 2023
[31]

CoRR , volume =

Yujia Qin, Shengding Hu, Yankai Lin, Weize Chen, Ning Ding, Ganqu Cui, Zhiyuan Zeng, Yujia Huang, Chaojun Xiao, Chi Han, et al. Tool learning with foundation models.ACM Computing Surveys, 2024. arXiv:2304.08354. 11

work page arXiv 2024
[32]

Arık, Nicolas Loeff, and Tomas Pfister

Bryan Lim, Sercan Ö. Arık, Nicolas Loeff, and Tomas Pfister. Temporal fusion transformers for interpretable multi-horizon time series forecasting.International Journal of Forecasting, 37(4):1748–1764, 2021

work page 2021
[33]

Oreshkin, Dmitri Carpov, Nicolas Chapados, and Yoshua Bengio

Boris N. Oreshkin, Dmitri Carpov, Nicolas Chapados, and Yoshua Bengio. N-BEATS: Neu- ral basis expansion analysis for interpretable time series forecasting. InProceedings of the International Conference on Learning Representations (ICLR), 2020

work page 2020
[34]

Vandermeulen, Nico Görnitz, Lucas Deecke, Shoaib Ahmed Siddiqui, Alexander Binder, Emmanuel Müller, and Marius Kloft

Lukas Ruff, Robert A. Vandermeulen, Nico Görnitz, Lucas Deecke, Shoaib Ahmed Siddiqui, Alexander Binder, Emmanuel Müller, and Marius Kloft. Deep one-class classification. In Proceedings of the International Conference on Machine Learning (ICML), 2018

work page 2018
[35]

Cambridge University Press, 2nd edition, 2009

Judea Pearl.Causality: Models, Reasoning, and Inference. Cambridge University Press, 2nd edition, 2009

work page 2009
[36]

MIT Press, 2017

Jonas Peters, Dominik Janzing, and Bernhard Schölkopf.Elements of Causal Inference. MIT Press, 2017

work page 2017
[37]

Donald B. Rubin. Estimating causal effects of treatments in randomized and nonrandomized studies.Journal of Educational Psychology, 66(5):688–701, 1974

work page 1974
[38]

Clive W. J. Granger. Investigating causal relations by econometric models and cross-spectral methods.Econometrica, 37(3):424–438, 1969

work page 1969
[39]

Detecting and quantifying causal associations in large nonlinear time series datasets.Science Advances, 5(11):eaau4996, 2019

Jakob Runge, Peer Nowack, Marlene Kretschmer, Seth Flaxman, and Dino Sejdinovic. Detecting and quantifying causal associations in large nonlinear time series datasets.Science Advances, 5(11):eaau4996, 2019

work page 2019
[40]

Generic and scalable framework for automated time-series anomaly detection

Nikolay Laptev, Saeed Amizadeh, and Ian Flint. Generic and scalable framework for automated time-series anomaly detection. InProceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2015

work page 2015
[41]

The UCR time series archive.IEEE/CAA Journal of Automatica Sinica, 6(6):1293–1305, 2019

Hoang Anh Dau, Anthony Bagnall, Kaveh Kamgar, Chin-Chia Michael Yeh, Yan Zhu, Shaghayegh Gharghabi, Chotirat Ann Ratanamahatana, and Eamonn Keogh. The UCR time series archive.IEEE/CAA Journal of Automatica Sinica, 6(6):1293–1305, 2019

work page 2019
[42]

cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf

Qingsong Wen, Tian Zhou, Chaoli Zhang, Weiqi Chen, Ziqing Ma, Junchi Yan, and Liang Sun. Transformers in time series: A survey. InProceedings of the Thirty-Second International Joint Conference on Artificial Intelligence (IJCAI), pages 6778–6786, 2023. arXiv:2202.07125

work page arXiv 2023
[43]

CLadder: A benchmark to assess causal reasoning capabilities of language models

Zhijing Jin, Yuen Chen, Felix Leeb, Luigi Gresele, Ojasv Kamal, Zhiheng Lyu, Kevin Blin, Fernando Gonzalez Adauto, Max Kleiman-Weiner, Mrinmaya Sachan, and Bernhard Schölkopf. CLadder: A benchmark to assess causal reasoning capabilities of language models. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

work page 2023
[44]

Castro, and Ben Glocker

Nick Pawlowski, Daniel C. Castro, and Ben Glocker. Deep structural causal models for tractable counterfactual inference. InAdvances in Neural Information Processing Systems (NeurIPS), 2020

work page 2020
[45]

A ladder of causal distances.arXiv preprint arXiv:2005.02480, 2020

Maxime Peyrard and Robert West. A ladder of causal distances.arXiv preprint arXiv:2005.02480, 2020

work page arXiv 2005
[46]

Claude Sonnet 4.6 system card, 2026

Anthropic. Claude Sonnet 4.6 system card, 2026. Anthropic system card, February 2026. https://www.anthropic.com/claude-sonnet-4-6-system-card

work page 2026
[47]

GPT-5.1 system card, 2025

OpenAI. GPT-5.1 system card, 2025. OpenAI system card addendum. https://openai.com/ index/gpt-5-system-card-addendum-gpt-5-1/

work page 2025
[48]

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

DeepSeek-AI. DeepSeek-V3.2: Pushing the frontier of open large language models, 2025. arXiv:2512.02556.https://arxiv.org/abs/2512.02556

work page internal anchor Pith review Pith/arXiv arXiv 2025
[49]

Mistral Large 3: Model card, 2025

Mistral AI. Mistral Large 3: Model card, 2025. Mistral AI announcement. https://mistral. ai/news/mistral-3

work page 2025
[50]

Qwen3 Technical Report

Qwen Team. Qwen3 technical report, 2025. arXiv:2505.09388. 12 A Answer formats and scoring FactoryBench emits questions in five answer formats. The four structured formats (single-select MCQ, multi-select MCQ, ranking, tensor) are scored deterministically by per-format rules; free-form answers, used only at Level 4, are scored by an LLM-as-judge voting pr...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[51]

On Level 1 every model is essentially flat or improves slightly under substitution. This is consistent with L1 templates being substantially answerable from the question wording, the option set, and the still-present discrete annotations of task phase and fault label. It also flags L1 as the level where overlap between LLM scores and a random/regression b...

work page
[52]

what would have happened

On Level 2 and Level 3 every model drops, often substantially: DeepSeek and Mistral lose 9–11 points on L3, and gpt-5.1 loses 7.7 on L2. The L3 drops are particularly informative because L3 templates ask counterfactual “what would have happened” questions where the answer hinges on the actual signal trajectory rather than on a discrete annotation

work page
[53]

no anomaly is present

On Level 4, gpt-5.1’s7.6% original score collapses to 0.5% under noise substitution, a 15× drop. Inspection of the per-item rubric notes shows that gpt-5.1’s L4 mass on the original data came almost entirely from correctly emitting “no anomaly is present” on truly nominal episodes; once the time series is replaced by Gaussian noise, the same nominal episo...

work page