pith. machine review for the scientific record. sign in

arxiv: 2605.07675 · v1 · submitted 2026-05-08 · 💻 cs.AI · cs.LG

Recognition: 2 theorem links

· Lean Theorem

FactoryBench: Evaluating Industrial Machine Understanding

Authors on Pith no claims yet

Pith reviewed 2026-05-11 02:35 UTC · model grok-4.3

classification 💻 cs.AI cs.LG
keywords benchmarkmachine understandinglarge language modelsindustrial roboticscausal reasoningtelemetry datazero-shot evaluationrobotic arms
0
0 comments X

The pith

No frontier LLM exceeds 50 percent accuracy on structured causal questions about industrial machines or 18 percent on decision-making.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds FactoryBench to test whether time-series models and large language models can understand industrial robotic machines from their sensor data. Questions are generated at four causal levels using structured templates, covering observation of machine states, effects of interventions, what would have happened otherwise, and actual decisions. Zero-shot tests on six leading models show none clearing 50 percent on the structured formats or 18 percent on decision tasks, based on more than 70,000 questions drawn from real robot episodes. A sympathetic reader would care because operational understanding of machines is required for reliable automation, predictive maintenance, and safe factory operations where mistakes carry real costs.

Core claim

FactoryBench shows that current frontier large language models lack operational machine understanding of industrial robots, as no model reaches more than 50 percent on structured questions at the state, intervention, counterfactual, and decision levels or more than 18 percent on decision-making when evaluated zero-shot against normalized telemetry episodes from cobots and industrial arms.

What carries the argument

FactoryBench, a collection of over 70,000 question-answer pairs generated by structured templates from normalized episodes of multivariate sensor data, with deterministic scoring on four structured formats and LLM-as-judge voting on free-form answers.

Load-bearing premise

The structured question templates and LLM-as-judge protocol accurately capture genuine machine understanding rather than surface-level pattern matching or judge bias.

What would settle it

A new model that scores above 60 percent on the decision-making level of the benchmark while preserving high scores on the structured levels across repeated evaluations would indicate the claimed performance gap is smaller than reported.

Figures

Figures reproduced from arXiv: 2605.07675 by Alessandro Lombardi, Balazs Gunther, Camilla Mazzoleni, Coral Izquierdo, Federico Martelli, Jonas Petersen, Marcos Gomez-Bracamonte, Matei Ignuta-Ciuncanu, Philipp Petersen, Riccardo Maggioni, Yanis Merzouki.

Figure 1
Figure 1. Figure 1: End-to-end FactoryBench pipeline. (1) Q&A construction: normalized episodes from FactoryWave, AURSAD [1], and voraus-AD [2] are paired with typed-variable templates (plus the knowledge graph for L4) to generate Q&A items at four reasoning levels. (2) Evaluation: the model’s answers are scored deterministically for L1–3 and by the median of three LLM judges for L4 free-form, then chance-corrected per level.… view at source ↗
Figure 2
Figure 2. Figure 2: FactoryWave tasks and fault catalogue. Photographs of the robotic platforms (UR3, KUKA [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Main results. (a) Cross-model panel performance across all four reasoning levels. (b) Per-fault and per-parameter breakdown of the GPT-5.1 L4 mass. 5.2 Signal comprehension across Levels 1–3 Most models fail to separate from the simple baseline on Level 1, including the largest in the panel. Three of six LLMs fail to clear the L1 baseline (28.4%): GPT-5.1 (30.9%), by parameter￾count and inference-cost prox… view at source ↗
Figure 4
Figure 4. Figure 4: The four levels of machine understanding defined by FactoryBench. [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: The arm-disturbance fault as physically applied to the KUKA in FactoryWave. [PITH_FULL_IMAGE:figures/full_fig_p022_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Sensor data distribution across multiple healthy KUKA episodes (blue) versus multiple [PITH_FULL_IMAGE:figures/full_fig_p023_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Physical causal schema. Setpoint commands [PITH_FULL_IMAGE:figures/full_fig_p032_7.png] view at source ↗
read the original abstract

We introduce FactoryBench, a benchmark for evaluating time-series models and LLMs on machine understanding over industrial robotic telemetry. Q&A pairs are organized along four causal levels (state, intervention, counterfactual, decision) instantiating Pearl's ladder of causation, and span five answer formats: four structured formats are scored deterministically and free-form answers are scored by an LLM-as-judge voting protocol. We propose a scalable Q&A generation framework built around structured question templates, present FactoryWave (a dense, multitask, multivariate sensor dataset collected from a UR3 cobot and a KUKA KR10 industrial arm), and construct FactoryBench as a large-scale benchmark of over 70k Q&A items grounded in roughly 15k normalized episodes from FactoryWave, AURSAD, and voraus-AD. Zero-shot evaluation of six frontier LLMs shows that no model exceeds 50% on structured levels or 18% on decision-making, revealing a wide gap between current models and operational machine understanding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces FactoryBench, a benchmark of over 70k Q&A pairs grounded in industrial robot telemetry from FactoryWave (new UR3/KUKA dataset), AURSAD, and voraus-AD. Questions follow Pearl's ladder of causation across state, intervention, counterfactual, and decision levels, with four structured answer formats scored deterministically and free-form answers scored via LLM-as-judge voting. A scalable template-based generation framework is proposed. Zero-shot evaluation of six frontier LLMs reports no model exceeding 50% on structured levels or 18% on decision-making, interpreted as evidence of a wide gap to operational machine understanding.

Significance. If the benchmark construction and scoring protocols hold, the work offers a scalable, causally structured resource for probing LLM and time-series model capabilities in industrial settings, with the deterministic structured scoring providing an objective strength. The results, if interpreted with appropriate baselines, could usefully quantify current limitations in causal reasoning over telemetry and motivate targeted improvements.

major comments (1)
  1. [Abstract and §5] Abstract and §5 (Zero-shot Evaluation): the central claim that LLM scores reveal a 'wide gap between current models and operational machine understanding' is not supported without reported human or expert performance baselines on the identical Q&A items. The deterministic structured scores (≤50%) and decision-making scores (≤18%) could reflect task difficulty rather than model deficiencies, undermining the gap interpretation.
minor comments (2)
  1. [§4] §4 (Benchmark Construction): the LLM-as-judge voting protocol for free-form answers would benefit from explicit reporting of inter-judge agreement metrics and full prompt templates to support reproducibility.
  2. [Abstract] The abstract states the benchmark evaluates both time-series models and LLMs, but only LLM zero-shot results are presented; clarify whether time-series models were evaluated and, if not, adjust the scope statement.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed review and for identifying a key point regarding the strength of our interpretive claims. We address the major comment below and describe the revisions we will incorporate.

read point-by-point responses
  1. Referee: [Abstract and §5] Abstract and §5 (Zero-shot Evaluation): the central claim that LLM scores reveal a 'wide gap between current models and operational machine understanding' is not supported without reported human or expert performance baselines on the identical Q&A items. The deterministic structured scores (≤50%) and decision-making scores (≤18%) could reflect task difficulty rather than model deficiencies, undermining the gap interpretation.

    Authors: We agree that the manuscript's interpretation of a 'wide gap' to operational machine understanding would be strengthened by human or expert performance baselines on the same items. The current version reports only zero-shot LLM results and does not include such baselines, so the claim rests on the design of the questions (grounded in real telemetry and Pearl's ladder) rather than direct comparison. We will revise the abstract and §5 to qualify the language, replacing the 'wide gap' phrasing with a more measured statement that the results demonstrate current frontier LLMs achieve low accuracy on structured causal and decision-making questions over industrial telemetry. We will also add a dedicated limitations paragraph noting the absence of expert baselines and suggesting their collection as valuable future work. This change preserves the empirical findings while addressing the concern that task difficulty alone could explain the scores. revision: partial

Circularity Check

0 steps flagged

Empirical benchmark construction with direct measurements; no derivation chain

full rationale

The paper introduces FactoryBench via structured question templates grounded in external datasets (FactoryWave, AURSAD, voraus-AD) and Pearl's ladder of causation. It reports zero-shot LLM scores as direct empirical measurements on deterministic structured formats and LLM-as-judge free-form scoring. No equations, fitted parameters, self-citations, or ansatzes are used to derive results; the central claims are benchmark scores, not predictions or theorems that reduce to the paper's own inputs by construction. This is self-contained empirical work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical derivations or theoretical assumptions beyond standard practices in benchmark construction and LLM evaluation.

pith-pipeline@v0.9.0 · 5508 in / 964 out tokens · 39330 ms · 2026-05-11T02:35:47.627020+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages · 4 internal anchors

  1. [1]

    Aursad: Universal robot screwdriving anomaly detection dataset

    Bła˙zej Leporowski, Daniella Tola, Casper Hansen, and Alexandros Iosifidis. AURSAD: Univer- sal robot screwdriving anomaly detection dataset.arXiv preprint arXiv:2102.01409, 2021

  2. [2]

    The voraus-AD dataset for anomaly detection in robot applications.IEEE Transactions on Robotics, 40:438–451,

    Jan Thieß Brockmann, Marco Rudolph, Bodo Rosenhahn, and Bastian Wandt. The voraus-AD dataset for anomaly detection in robot applications.IEEE Transactions on Robotics, 40:438–451,

  3. [3]

    Informer: Beyond efficient transformer for long sequence time-series forecasting

    Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang. Informer: Beyond efficient transformer for long sequence time-series forecasting. In Proceedings of the AAAI conference on artificial intelligence, volume 35, pages 11106–11115, 2021

  4. [4]

    Robust anomaly detection for multivariate time series through stochastic recurrent neural network

    Ya Su, Youjian Zhao, Chenhao Niu, Rong Liu, Wei Sun, and Dan Pei. Robust anomaly detection for multivariate time series through stochastic recurrent neural network. InProceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2019

  5. [5]

    Julien Audibert, Pietro Michiardi, Frédéric Guyard, Sébastien Marti, and Maria A. Zuluaga. USAD: Unsupervised anomaly detection on multivariate time series. InProceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2020

  6. [6]

    Deep Learning for Anomaly Detection: A Survey

    Raghavendra Chalapathy and Sanjay Chawla. Deep learning for anomaly detection: A survey. arXiv preprint arXiv:1901.03407, 2019

  7. [7]

    Evaluating real-time anomaly detection algorithms— the Numenta anomaly benchmark

    Alexander Lavin and Subutai Ahmad. Evaluating real-time anomaly detection algorithms— the Numenta anomaly benchmark. InProceedings of the IEEE International Conference on Machine Learning and Applications (ICMLA), pages 38–44, 2015

  8. [8]

    Are transformers effective for time series forecasting? InProceedings of the AAAI Conference on Artificial Intelligence, 2023

    Ailing Zeng, Muxi Chen, Lei Zhang, and Qiang Xu. Are transformers effective for time series forecasting? InProceedings of the AAAI Conference on Artificial Intelligence, 2023

  9. [9]

    Chain-of-thought prompting elicits reasoning in large language models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. InAdvances in Neural Information Processing Systems (NeurIPS), 2022

  10. [10]

    Griffiths, Yuan Cao, and Karthik Narasimhan

    Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. In Advances in Neural Information Processing Systems (NeurIPS), 2023

  11. [11]

    Large language models are zero-shot time series forecasters

    Nate Gruver, Marc Finzi, Shikai Qiu, and Andrew Gordon Wilson. Large language models are zero-shot time series forecasters. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

  12. [12]

    Zhang, Xiaoming Shi, Pin-Yu Chen, Yuxuan Liang, Yuan-Fang Li, Shirui Pan, and Qingsong Wen

    Ming Jin, Shiyu Wang, Lintao Ma, Zhixuan Chu, James Y . Zhang, Xiaoming Shi, Pin-Yu Chen, Yuxuan Liang, Yuan-Fang Li, Shirui Pan, and Qingsong Wen. Time-LLM: Time series forecasting by reprogramming large language models. InProceedings of the International Conference on Learning Representations (ICLR), 2024

  13. [13]

    Gomez, Łukasz Kaiser, and Illia Polosukhin

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neural Informa- tion Processing Systems (NeurIPS), 2017

  14. [14]

    Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting

    Haixu Wu, Jiehui Xu, Jianmin Wang, and Mingsheng Long. Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting. InAdvances in Neural Information Processing Systems (NeurIPS), 2021

  15. [15]

    Fedformer: Frequency enhanced decomposed transformer for long-term series forecasting

    Tian Zhou, Ziqing Ma, Qingsong Wen, Xue Wang, Liang Sun, and Rong Jin. Fedformer: Frequency enhanced decomposed transformer for long-term series forecasting. InInternational conference on machine learning, pages 27268–27286. PMLR, 2022

  16. [16]

    A Time Series is Worth 64 Words: Long-term Forecasting with Transformers

    Yuqi Nie, Nam H Nguyen, Phanwadee Sinthong, and Jayant Kalagnanam. A time series is worth 64 words: Long-term forecasting with transformers. arxiv 2022.arXiv preprint arXiv:2211.14730, 2022. 10

  17. [17]

    Chronos: Learning the Language of Time Series

    Abdul Fatir Ansari, Lorenzo Stella, Caner Turkmen, Xiyuan Zhang, Pedro Mercado, Huibin Shen, Oleksandr Shchur, Syama Sundar Rangapuram, Sebastian Pineda Arango, Shubham Kapoor, Jasper Zschiegner, Danielle C. Maddix, Hao Wang, Michael W. Mahoney, Kari Torkkola, Andrew Gordon Wilson, Michael Bohlke-Schneider, and Yuyang Wang. Chronos: Learning the language ...

  18. [18]

    A decoder- only foundation model for time-series forecasting.arXiv preprint arXiv:2310.10688,

    Abhimanyu Das, Weihao Kong, Rajat Sen, and Yichen Zhou. A decoder-only foundation model for time-series forecasting. InProceedings of the International Conference on Machine Learning (ICML), 2024. arXiv:2310.10688

  19. [19]

    Measuring massive multitask language understanding

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. InProceedings of the International Conference on Learning Representations (ICLR), 2021

  20. [20]

    Beyond the imitation game: Quantifying and extrapolating the capabili- ties of language models.Transactions on Machine Learning Research, 2023

    Aarohi Srivastava et al. Beyond the imitation game: Quantifying and extrapolating the capabili- ties of language models.Transactions on Machine Learning Research, 2023

  21. [21]

    MMMU: A massive multi-discipline multi- modal understanding and reasoning benchmark for expert AGI

    Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. MMMU: A massive multi-discipline multi- modal understanding and reasoning benchmark for expert AGI. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

  22. [22]

    TimeSeriesExam: A time series understanding exam, 2024

    Yifu Cai, Arjun Choudhry, Mononito Goswami, and Artur Dubrawski. TimeSeriesExam: A time series understanding exam, 2024

  23. [23]

    ChatTS: Aligning time series with LLMs via synthetic data for enhanced understanding and reasoning.Proceedings of the VLDB Endowment, 18(8):2385–2398, 2025

    Zhe Xie, Zeyan Li, Xiao He, Longlong Xu, Xidao Wen, Tieying Zhang, Jianjun Chen, Rui Shi, and Dan Pei. ChatTS: Aligning time series with LLMs via synthetic data for enhanced understanding and reasoning.Proceedings of the VLDB Endowment, 18(8):2385–2398, 2025

  24. [24]

    ITFormer: Bridging time series and natural language for multi-modal QA with large-scale multitask dataset, 2025

    Yilin Wang, Peixuan Lei, Jie Song, Yuzhe Hao, Tao Chen, Yuxuan Zhang, Lei Jia, Yuanxiang Li, and Zhongyu Wei. ITFormer: Bridging time series and natural language for multi-modal QA with large-scale multitask dataset, 2025

  25. [25]

    TSAQA: Time series analysis question and answering benchmark, 2026

    Baoyu Jing, Sanhorn Chen, Lecheng Zheng, Boyu Liu, Zihao Li, Jiaru Zou, Tianxin Wei, Zhining Liu, Zhichen Zeng, Ruizhong Qiu, Xiao Lin, Yuchen Yan, Dongqi Fu, Jingchao Ni, Jingrui He, and Hanghang Tong. TSAQA: Time series analysis question and answering benchmark, 2026

  26. [26]

    Time-MQA: Time series multi-task question answering with context enhancement, 2025

    Yaxuan Kong, Yiyuan Yang, Yoontae Hwang, Wenjie Du, Stefan Zohren, Zhangyang Wang, Ming Jin, and Qingsong Wen. Time-MQA: Time series multi-task question answering with context enhancement, 2025

  27. [27]

    MTBench: A multimodal time series benchmark for temporal reasoning and question answering, 2026

    Jialin Chen, Aosong Feng, Ziyu Zhao, Juan Garza, Gaukhar Nurbek, Cheng Qin, Ali Maatouk, Leandros Tassiulas, Yifeng Gao, and Rex Ying. MTBench: A multimodal time series benchmark for temporal reasoning and question answering, 2026

  28. [28]

    QuAnTS: Question answering on time series.arXiv preprint arXiv:2411.04795, 2024

    Felix Wenkel, Ghada Sokar, Yuan Zhang, and Mirco Ravanelli. QuAnTS: Question answering on time series.arXiv preprint arXiv:2411.04795, 2024

  29. [29]

    Toolformer: Language models can teach themselves to use tools

    Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

  30. [30]

    ReAct: Synergizing reasoning and acting in language models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. InProceedings of the International Conference on Learning Representations (ICLR), 2023

  31. [31]

    CoRR , volume =

    Yujia Qin, Shengding Hu, Yankai Lin, Weize Chen, Ning Ding, Ganqu Cui, Zhiyuan Zeng, Yujia Huang, Chaojun Xiao, Chi Han, et al. Tool learning with foundation models.ACM Computing Surveys, 2024. arXiv:2304.08354. 11

  32. [32]

    Arık, Nicolas Loeff, and Tomas Pfister

    Bryan Lim, Sercan Ö. Arık, Nicolas Loeff, and Tomas Pfister. Temporal fusion transformers for interpretable multi-horizon time series forecasting.International Journal of Forecasting, 37(4):1748–1764, 2021

  33. [33]

    Oreshkin, Dmitri Carpov, Nicolas Chapados, and Yoshua Bengio

    Boris N. Oreshkin, Dmitri Carpov, Nicolas Chapados, and Yoshua Bengio. N-BEATS: Neu- ral basis expansion analysis for interpretable time series forecasting. InProceedings of the International Conference on Learning Representations (ICLR), 2020

  34. [34]

    Vandermeulen, Nico Görnitz, Lucas Deecke, Shoaib Ahmed Siddiqui, Alexander Binder, Emmanuel Müller, and Marius Kloft

    Lukas Ruff, Robert A. Vandermeulen, Nico Görnitz, Lucas Deecke, Shoaib Ahmed Siddiqui, Alexander Binder, Emmanuel Müller, and Marius Kloft. Deep one-class classification. In Proceedings of the International Conference on Machine Learning (ICML), 2018

  35. [35]

    Cambridge University Press, 2nd edition, 2009

    Judea Pearl.Causality: Models, Reasoning, and Inference. Cambridge University Press, 2nd edition, 2009

  36. [36]

    MIT Press, 2017

    Jonas Peters, Dominik Janzing, and Bernhard Schölkopf.Elements of Causal Inference. MIT Press, 2017

  37. [37]

    Donald B. Rubin. Estimating causal effects of treatments in randomized and nonrandomized studies.Journal of Educational Psychology, 66(5):688–701, 1974

  38. [38]

    Clive W. J. Granger. Investigating causal relations by econometric models and cross-spectral methods.Econometrica, 37(3):424–438, 1969

  39. [39]

    Detecting and quantifying causal associations in large nonlinear time series datasets.Science Advances, 5(11):eaau4996, 2019

    Jakob Runge, Peer Nowack, Marlene Kretschmer, Seth Flaxman, and Dino Sejdinovic. Detecting and quantifying causal associations in large nonlinear time series datasets.Science Advances, 5(11):eaau4996, 2019

  40. [40]

    Generic and scalable framework for automated time-series anomaly detection

    Nikolay Laptev, Saeed Amizadeh, and Ian Flint. Generic and scalable framework for automated time-series anomaly detection. InProceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2015

  41. [41]

    The UCR time series archive.IEEE/CAA Journal of Automatica Sinica, 6(6):1293–1305, 2019

    Hoang Anh Dau, Anthony Bagnall, Kaveh Kamgar, Chin-Chia Michael Yeh, Yan Zhu, Shaghayegh Gharghabi, Chotirat Ann Ratanamahatana, and Eamonn Keogh. The UCR time series archive.IEEE/CAA Journal of Automatica Sinica, 6(6):1293–1305, 2019

  42. [42]

    cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf

    Qingsong Wen, Tian Zhou, Chaoli Zhang, Weiqi Chen, Ziqing Ma, Junchi Yan, and Liang Sun. Transformers in time series: A survey. InProceedings of the Thirty-Second International Joint Conference on Artificial Intelligence (IJCAI), pages 6778–6786, 2023. arXiv:2202.07125

  43. [43]

    CLadder: A benchmark to assess causal reasoning capabilities of language models

    Zhijing Jin, Yuen Chen, Felix Leeb, Luigi Gresele, Ojasv Kamal, Zhiheng Lyu, Kevin Blin, Fernando Gonzalez Adauto, Max Kleiman-Weiner, Mrinmaya Sachan, and Bernhard Schölkopf. CLadder: A benchmark to assess causal reasoning capabilities of language models. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

  44. [44]

    Castro, and Ben Glocker

    Nick Pawlowski, Daniel C. Castro, and Ben Glocker. Deep structural causal models for tractable counterfactual inference. InAdvances in Neural Information Processing Systems (NeurIPS), 2020

  45. [45]

    A ladder of causal distances.arXiv preprint arXiv:2005.02480, 2020

    Maxime Peyrard and Robert West. A ladder of causal distances.arXiv preprint arXiv:2005.02480, 2020

  46. [46]

    Claude Sonnet 4.6 system card, 2026

    Anthropic. Claude Sonnet 4.6 system card, 2026. Anthropic system card, February 2026. https://www.anthropic.com/claude-sonnet-4-6-system-card

  47. [47]

    GPT-5.1 system card, 2025

    OpenAI. GPT-5.1 system card, 2025. OpenAI system card addendum. https://openai.com/ index/gpt-5-system-card-addendum-gpt-5-1/

  48. [48]

    DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

    DeepSeek-AI. DeepSeek-V3.2: Pushing the frontier of open large language models, 2025. arXiv:2512.02556.https://arxiv.org/abs/2512.02556

  49. [49]

    Mistral Large 3: Model card, 2025

    Mistral AI. Mistral Large 3: Model card, 2025. Mistral AI announcement. https://mistral. ai/news/mistral-3

  50. [50]

    Qwen3 Technical Report

    Qwen Team. Qwen3 technical report, 2025. arXiv:2505.09388. 12 A Answer formats and scoring FactoryBench emits questions in five answer formats. The four structured formats (single-select MCQ, multi-select MCQ, ranking, tensor) are scored deterministically by per-format rules; free-form answers, used only at Level 4, are scored by an LLM-as-judge voting pr...

  51. [51]

    On Level 1 every model is essentially flat or improves slightly under substitution. This is consistent with L1 templates being substantially answerable from the question wording, the option set, and the still-present discrete annotations of task phase and fault label. It also flags L1 as the level where overlap between LLM scores and a random/regression b...

  52. [52]

    what would have happened

    On Level 2 and Level 3 every model drops, often substantially: DeepSeek and Mistral lose 9–11 points on L3, and gpt-5.1 loses 7.7 on L2. The L3 drops are particularly informative because L3 templates ask counterfactual “what would have happened” questions where the answer hinges on the actual signal trajectory rather than on a discrete annotation

  53. [53]

    no anomaly is present

    On Level 4, gpt-5.1’s7.6% original score collapses to 0.5% under noise substitution, a 15× drop. Inspection of the per-item rubric notes shows that gpt-5.1’s L4 mass on the original data came almost entirely from correctly emitting “no anomaly is present” on truly nominal episodes; once the time series is replaced by Gaussian noise, the same nominal episo...