pith. machine review for the scientific record. sign in

arxiv: 2605.05134 · v1 · submitted 2026-05-06 · 💻 cs.LG · math.DS

Recognition: unknown

Low-Cost Black-Box Detection of LLM Hallucinations via Dynamical System Prediction

Authors on Pith no claims yet

Pith reviewed 2026-05-08 16:45 UTC · model grok-4.3

classification 💻 cs.LG math.DS
keywords LLM hallucination detectionKoopman operator theorydynamical systemsblack-box methodsembedding spacestate-space modelsresidual scoringpreference calibration
0
0 comments X

The pith

Projecting LLM responses into an embedding space and fitting Koopman operators to factual and hallucinated regimes enables low-cost single-pass hallucination detection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper treats sequences of LLM outputs as observations from a latent dynamical system once projected by an embedding model. It fits two Koopman transition operators, one to factual responses and one to hallucinated ones, then uses the difference in their prediction errors as a score for whether a new response is hallucinated. A calibration procedure adjusts the decision threshold using a small number of user-provided examples to match different risk tolerances. This single-generation approach avoids the multiple samples or knowledge-base lookups required by earlier methods. Benchmark tests show it reaches state-of-the-art accuracy while lowering compute demands.

Core claim

LLM responses projected via an embedding model form observable realizations of latent state-space dynamics; the factual and hallucinated regimes admit distinct, learnable Koopman transition operators whose respective prediction errors yield a differential residual score that classifies hallucinations.

What carries the argument

The differential residual score computed from the one-step prediction errors of two separately fitted Koopman transition operators, one for factual regime trajectories and one for hallucinated regime trajectories in embedding space.

If this is right

  • Detection operates on a single LLM response without requiring additional sampling or external retrieval.
  • The classification threshold can be tuned to user preferences or domain needs using only a small demonstration set.
  • The method runs as a black box, requiring no internal access to the LLM.
  • Resource overhead is reduced compared to consistency-checking or retrieval-augmented baselines while matching their accuracy on standard benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the embedding space faithfully captures the dynamics, the same separation might allow early prediction of hallucinations within a single long generation.
  • Extending the operator fitting to online updates could support real-time monitoring of chatbot conversations.
  • Similar dynamical modeling might apply to detecting other LLM failure modes such as logical inconsistencies if they also produce distinct trajectory patterns.

Load-bearing premise

That the projected embedding sequences of factual responses and of hallucinated responses obey measurably different linear transition rules in the lifted space.

What would settle it

A test set in which the factual-operator prediction error is not reliably smaller than the hallucinated-operator error on verified factual responses, or in which the differential score shows no correlation with ground-truth hallucination labels.

Figures

Figures reproduced from arXiv: 2605.05134 by Dan Wilson, Mohamed Akrout.

Figure 1
Figure 1. Figure 1: Our proposed differential residual score view at source ↗
Figure 2
Figure 2. Figure 2: For 300 factual ground-truth and hallucinated responses from the view at source ↗
Figure 3
Figure 3. Figure 3: Hallucination Detection Dynamical System (DS): (a) Phase 1: Dynamical system fitting view at source ↗
Figure 4
Figure 4. Figure 4: Hallucination detection performance using DS classification on 8K samples of the HaluE view at source ↗
Figure 5
Figure 5. Figure 5: ROC curves on the HaluEval dataset as a function of the sequence length L for (a) the average classification performance and for (b)–(f) each embedding model showing the variation of the true positive rate against the false positive rate at various prediction error thresholds. Notably, the magnitude of the improvement from shorter to longer sequence length correlates in￾versely with model size. Smaller emb… view at source ↗
Figure 6
Figure 6. Figure 6: Histograms of token-level residual scores from (6) for Wikibio test data. This effect can be leveraged to implement a calibration process that utilizes a small set of user￾provided demonstrations to determine the classification threshold η. A sketch of this process is provided here. Whether the user is strict or tolerant to minor hallucinations, our method sets the classification threshold η to align with … view at source ↗
Figure 7
Figure 7. Figure 7: User-centric threshold calibration: (1) selection of calibration samples based on user pref￾erence (tolerant or strict to minor inaccuracies), (2) computation of per-token differential residual scores ∆E, and (3) calibration of the classification threshold η to maximize the target metric. 5 Conclusion In this paper, we introduce a new hallucination detection method by rethinking a Large Language Model as a… view at source ↗
read the original abstract

Large Language Models (LLMs) frequently generate plausible but non-factual content, a phenomenon known as hallucination. While existing detection methods typically rely on computationally expensive sampling-based consistency checks or external knowledge retrieval, we propose a new method that treats the LLM as a black-box dynamical system. By projecting LLM responses into a high-dimensional manifold via an embedding model, we characterize the resulting vector sequences as observable realizations of the model's latent state-space dynamics. Leveraging Koopman operator theory, we fit the transition operators for both factual and hallucinated regimes and define a differential residual score based on their respective prediction errors. To accommodate varying user requirements and domain-specific sensitivities, we introduce a preference-aware calibration mechanism that optimizes the classification threshold based on a small set of demonstrations. This approach enables low-cost hallucination detection in a single-sample pass, avoiding the need for secondary sampling or external grounding. Extensive testing across three data benchmarks demonstrates that our method achieves state-of-the-art performance with reduced resource overhead.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes a black-box hallucination detection method that embeds LLM response sequences into a manifold and models them as observable realizations of latent state-space dynamics. Separate Koopman transition operators are fitted to factual and hallucinated regimes; a differential residual score derived from their one-step prediction errors is used for classification, with a preference-aware calibration step that tunes the threshold on a small set of demonstrations. The central claim is that this yields state-of-the-art detection performance across three benchmarks at substantially lower computational cost than sampling-based or retrieval-based alternatives.

Significance. If the core dynamical assumption holds, the approach would constitute a meaningful advance by replacing expensive multi-sample consistency checks or external grounding with a single-pass operator-based residual test. The low-cost, black-box framing and the introduction of preference-aware calibration are practically attractive. However, the significance is currently limited by the absence of quantitative evidence that the embedding trajectories actually admit distinct, learnable Koopman operators that separate the two regimes.

major comments (3)
  1. [Abstract] Abstract: the assertion of 'state-of-the-art performance' is unsupported by any numerical results, error bars, baseline comparisons, or dataset statistics. This claim is load-bearing for the paper's contribution and cannot be evaluated without the missing quantitative evidence.
  2. [Method] Method (operator fitting and calibration): transition operators are fitted separately on factual and hallucinated regimes while the classification threshold is optimized on demonstrations drawn from the same data distribution. This construction makes the differential residual score dependent on quantities derived from the evaluation data, raising a direct circularity risk that must be addressed with explicit train/test splits or held-out calibration sets.
  3. [§2–3] Core modeling assumption (§2–3): the manuscript provides no empirical test or theoretical argument showing that standard embedding trajectories are Koopman-linearizable in a factuality-dependent manner. If the fitted operators are not measurably distinct or if residuals are dominated by embedding noise, the method reduces to a generic embedding classifier and the claimed dynamical advantage disappears.
minor comments (2)
  1. [Method] The notation for the differential residual score and the precise definition of the Koopman operator approximation should be stated explicitly with equations rather than described only in prose.
  2. [Experiments] Figure captions and experimental tables should include the exact embedding model, sequence length, and number of demonstrations used for calibration so that the low-cost claim can be reproduced.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We address each major comment point-by-point below, clarifying aspects of the work and indicating revisions made to the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the assertion of 'state-of-the-art performance' is unsupported by any numerical results, error bars, baseline comparisons, or dataset statistics. This claim is load-bearing for the paper's contribution and cannot be evaluated without the missing quantitative evidence.

    Authors: We agree that the abstract would be stronger with explicit quantitative support. The full manuscript reports detailed results in Section 4, including accuracy/F1 scores on the three benchmarks, comparisons against sampling-based and retrieval baselines, error bars from repeated runs, and dataset statistics. We have revised the abstract to incorporate the key numerical findings (e.g., detection performance and cost reductions) while remaining concise. revision: yes

  2. Referee: [Method] Method (operator fitting and calibration): transition operators are fitted separately on factual and hallucinated regimes while the classification threshold is optimized on demonstrations drawn from the same data distribution. This construction makes the differential residual score dependent on quantities derived from the evaluation data, raising a direct circularity risk that must be addressed with explicit train/test splits or held-out calibration sets.

    Authors: This is a valid concern about potential leakage. In the original experiments the operators were fit on training partitions and the preference-aware threshold was tuned on a held-out calibration subset disjoint from the test set. We have revised the Method section to explicitly document the data partitioning (train/calibration/test splits) and to confirm that all reported metrics use these held-out sets, thereby removing any circularity. revision: yes

  3. Referee: [§2–3] Core modeling assumption (§2–3): the manuscript provides no empirical test or theoretical argument showing that standard embedding trajectories are Koopman-linearizable in a factuality-dependent manner. If the fitted operators are not measurably distinct or if residuals are dominated by embedding noise, the method reduces to a generic embedding classifier and the claimed dynamical advantage disappears.

    Authors: We acknowledge that the original submission did not foreground direct validation of the Koopman assumption. We have added a new subsection in §3 that (i) reports the Frobenius-norm difference between the factual and hallucinated operators, (ii) shows that the differential residual is not reproduced by noise-only ablations on the embeddings, and (iii) provides a brief theoretical motivation based on the approximation properties of Koopman operators for regime-dependent nonlinear dynamics. These additions demonstrate that the separation is not reducible to a generic embedding classifier. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper projects LLM responses into embeddings, fits separate Koopman transition operators to labeled factual and hallucinated regimes, defines a differential residual score from one-step prediction errors, and calibrates a threshold on a small set of demonstrations before evaluating on separate benchmarks. This is a standard supervised modeling pipeline with offline fitting and held-out testing; no equation or step reduces the claimed detection performance or SOTA results to the inputs by construction. The Koopman fitting is an explicit modeling choice whose validity is tested empirically rather than assumed tautologically, and no self-citation chain or ansatz smuggling supports the central claims.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim rests on the applicability of Koopman theory to embedded LLM sequences and on the existence of separable factual/hallucinated regimes; these are domain assumptions rather than derived results. Free parameters include the embedding model choice and the calibrated threshold.

free parameters (2)
  • classification threshold
    Optimized via preference-aware calibration on a small set of demonstrations to accommodate user requirements.
  • embedding model
    Choice of model used to project responses into the high-dimensional manifold before fitting operators.
axioms (1)
  • domain assumption Projected LLM response sequences behave as observable realizations of an underlying latent dynamical system to which Koopman operator theory applies.
    Invoked when the paper states that vector sequences are treated as realizations of the model's latent state-space dynamics.

pith-pipeline@v0.9.0 · 5468 in / 1395 out tokens · 41088 ms · 2026-05-08T16:45:40.075167+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

49 extracted references · 14 canonical work pages · 6 internal anchors

  1. [1]

    Language mod- els are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhari- wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language mod- els are few-shot learners. Advances in neural information processing systems , 33:1877–1901, 2020

  2. [2]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

  3. [3]

    A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions

    Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Y u, Dan Su, Y an Xu, Etsuko Ishii, Y ejin Bang, Andrea Madotto, and Pascale Fung. A survey of hallucination in large language models: Prin- ciples, taxonomy, challenges, and open questions. arXiv preprint arXiv:2311.05232, 2023

  4. [4]

    A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and int eractivity,

    Y ejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wenliang Dai, Dan Su, Bryan Wilie, Holy Lovenia, Ziwei Ji, Tiezheng Y u, Willy Chung, et al. A multitask, multilingual, multi- modal evaluation of chatgpt on reasoning, hallucination, and interactivity. arXiv preprint arXiv:2302.04023, 2023

  5. [5]

    Ethical governance of artificial intelligence hallucinations in legal practice

    Muhammad Khurram Shahzad Warraich, Hazrat Usman, Sidra Zakir, and Mohaddas Mehboob. Ethical governance of artificial intelligence hallucinations in legal practice. Social Sciences Spectrum, 4(2):603–615, 2025

  6. [6]

    A survey on hallucination in large language models: Principles, taxonomy, and challenges

    Ziwei Xu, Sanjay Jain, and Mohan Kankanhalli. A survey on hallucination in large language models: Principles, taxonomy, and challenges. arXiv preprint arXiv:2411.08009, 2024

  7. [7]

    Calibrated language models must hallucinate

    Adam Tauman Kalai and Santosh S V empala. Calibrated language models must hallucinate. In Proceedings of the 56th Annual ACM Symposium on Theory of Computing , pages 160–171, 2024

  8. [8]

    On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency , pages 610–623, 2021

    Emily M Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency , pages 610–623, 2021

  9. [9]

    Why Language Models Hallucinate

    Adam Tauman Kalai, Ofir Nachum, Santosh S V empala, and Edwin Zhang. Why language models hallucinate. arXiv preprint arXiv:2509.04664, 2025

  10. [10]

    Language models cannot reliably distinguish belief from knowledge and fact

    Mirac Suzgun, Tayfun Gur, Federico Bianchi, Daniel E Ho, Thomas Icard, Dan Jurafsky, and James Zou. Language models cannot reliably distinguish belief from knowledge and fact. Nature Machine Intelligence, pages 1–11, 2025

  11. [11]

    Truthful: A benchmark for evaluating the truthfulness of large language models

    Amos Azaria and Tom Mitchell. Truthful: A benchmark for evaluating the truthfulness of large language models. arXiv preprint arXiv:2310.06689, 2023

  12. [12]

    Truthfulqa: Measuring how models mimic human falsehoods

    Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2201.08045, 2022

  13. [13]

    Benchmarking large language models for news summarization

    Junyi Li, Tianyi Tang, Wayne Xin Zhao, and Ji-Rong Wen. Benchmarking large language models for news summarization. arXiv preprint arXiv:2305.09034, 2023

  14. [14]

    The internal state of an llm knows when its lying

    Amos Azaria and Tom Mitchell. The internal state of an llm knows when its lying. In Findings of the Association for Computational Linguistics: EMNLP 2023 , pages 967–976, 2023

  15. [15]

    Unsupervised real-time hallucination detection based on the internal states of large lan- guage models

    Weihang Su, Changyue Wang, Qingyao Ai, Yiran Hu, Zhijing Wu, Y ujia Zhou, and Yiqun Liu. Unsupervised real-time hallucination detection based on the internal states of large lan- guage models. In Findings of the Association for Computational Linguistics: ACL 2024 , pages 14379–14391, 2024. 10

  16. [16]

    INSIDE: LLMs’ internal states retain the power of hallucination detection.arXiv preprint arXiv:2402.03744,

    Chao Chen, Kai Liu, Ze Chen, Yi Gu, Y ue Wu, Mingyuan Tao, Zhihang Fu, and Jieping Y e. Inside: Llms’ internal states retain the power of hallucination detection. arXiv preprint arXiv:2402.03744, 2024

  17. [17]

    Be- yond the next token: Towards prompt-robust zero-shot classification via efficient multi-token prediction

    Junlang Qian, Zixiao Zhu, Hanzhang Zhou, Zijian Feng, Zepeng Zhai, and Kezhi Mao. Be- yond the next token: Towards prompt-robust zero-shot classification via efficient multi-token prediction. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (V olume 1: ...

  18. [18]

    Learning on llm output signatures for gray-box behavior analysis

    Guy Bar-Shalom, Fabrizio Frasca, Derek Lim, Y oav Gelberg, Yftah Ziser, Ran El-Y aniv, Gal Chechik, and Haggai Maron. Learning on llm output signatures for gray-box behavior analysis. In ICML 2025 Workshop on Reliable and Responsible F oundation Models, 2025

  19. [19]

    Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models

    Potsawee Manakul, Adian Liusie, and Mark Gales. Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. In Proceedings of the 2023 con- ference on empirical methods in natural language processing , pages 9004–9017, 2023

  20. [20]

    Felm: Bench- marking factuality evaluation of large language models

    Yiran Zhao, Jinghan Zhang, I Chern, Siyang Gao, Pengfei Liu, Junxian He, et al. Felm: Bench- marking factuality evaluation of large language models. Advances in Neural Information Pro- cessing Systems, 36:44502–44523, 2023

  21. [21]

    Halueval: A large-scale hallucination evaluation benchmark for large language models

    Junyi Li, Xiaoxue Cheng, Wayne Xin Zhao, Jian-Y un Nie, and Ji-Rong Wen. Halueval: A large-scale hallucination evaluation benchmark for large language models. In Proceedings of the 2023 conference on empirical methods in natural language processing , pages 6449–6464, 2023

  22. [22]

    A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions

    Lei Huang, Weijiang Y u, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qiang- long Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, et al. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. ACM Transactions on Information Systems, 43(2):1–55, 2025

  23. [23]

    Embedding and gradient say wrong: A white-box method for hallucination detection

    Xiaomeng Hu, Yiming Zhang, Ru Peng, Haozhe Zhang, Chenwei Wu, Gang Chen, and Junbo Zhao. Embedding and gradient say wrong: A white-box method for hallucination detection. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages 1950–1959, 2024

  24. [24]

    Llm-check: Investigating detection of hallucinations in large lan- guage models

    Gaurang Sriramanan, Siddhant Bharti, Vinu Sankar Sadasivan, Shoumik Saha, Priyatham Kat- takinda, and Soheil Feizi. Llm-check: Investigating detection of hallucinations in large lan- guage models. Advances in Neural Information Processing Systems , 37:34188–34216, 2024

  25. [25]

    Detect- ing hallucinations in large language model generation: A token probability approach

    Ernesto Quevedo, Jorge Y ero Salazar, Rachel Koerner, Pablo Rivas, and Tomas Cerny. Detect- ing hallucinations in large language model generation: A token probability approach. In World Congress in Computer Science, Computer Engineering & Applied Computing , pages 154–173. Springer, 2024

  26. [26]

    Detecting hallucinations in large language models using semantic entropy

    Sebastian Farquhar, Jannik Kossen, Lorenz Kuhn, and Y arin Gal. Detecting hallucinations in large language models using semantic entropy. Nature, 630(8017):625–630, 2024

  27. [27]

    Semantic uncertainty: Linguistic invari- ances for uncertainty estimation in natural language generation

    Lorenz Kuhn, Y arin Gal, and Sebastian Farquhar. Semantic uncertainty: Linguistic invari- ances for uncertainty estimation in natural language generation. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023 . Open- Review.net, 2023

  28. [28]

    Beyond next token probabilities: Learnable, fast detection of hallucinations and data contamination on llm output distributions

    Guy Bar-Shalom, Fabrizio Frasca, Derek Lim, Y oav Gelberg, Yftah Ziser, Ran El-Y aniv, Gal Chechik, and Haggai Maron. Beyond next token probabilities: Learnable, fast detection of hallucinations and data contamination on llm output distributions. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 30058–30066, 2026

  29. [29]

    Lowest span confidence: A zero-shot metric for efficient and black-box hallucination detection in llms

    Yitong Qiao, Licheng Pan, Y u Mi, Lei Liu, Y ue Shen, Fei Sun, and Zhixuan Chu. Lowest span confidence: A zero-shot metric for efficient and black-box hallucination detection in llms. arXiv preprint arXiv:2601.19918, 2026. 11

  30. [30]

    Factselfcheck: Fact-level black-box hallucination detection for llms

    Albert Sawczyn, Jakub Binkowski, Denis Janiak, Bogdan Gabrys, and Tomasz Jan Kajdanow- icz. Factselfcheck: Fact-level black-box hallucination detection for llms. In Findings of the Association for Computational Linguistics: EACL 2026 , pages 5603–5621, 2026

  31. [31]

    SAC3: reliable hallucination detection in black-box language models via semantic-aware cross-check consistency

    Jiaxin Zhang, Zhuohang Li, Kamalika Das, Bradley Malin, and Sricharan Kumar. SAC3: reliable hallucination detection in black-box language models via semantic-aware cross-check consistency. In Findings of the Association for Computational Linguistics: EMNLP 2023 , pages 15445–15458, 2023

  32. [32]

    Multi- perspective consistency checking for large language model hallucination detection: a black- box zero-resource approach

    Linggang Kong, Xiaofeng Zhong, Jie Chen, Haoran Fu, and Y ongjie Wang. Multi- perspective consistency checking for large language model hallucination detection: a black- box zero-resource approach. Frontiers of Information Technology & Electronic Engineering , 26(11):2298–2309, 2025

  33. [33]

    Zero-knowledge llm hallucination detection and mitigation through fine-grained cross-model consistency

    Aman Goel, Daniel Schwartz, and Y anjun Qi. Zero-knowledge llm hallucination detection and mitigation through fine-grained cross-model consistency. In Proceedings of the 2025 Con- ference on Empirical Methods in Natural Language Processing: Industry Track , pages 1982– 1999, 2025

  34. [34]

    Budiši ´c, R

    M. Budiši ´c, R. Mohr, and I. Mezi ´c. Applied Koopmanism. Chaos: An Interdisciplinary Journal of Nonlinear Science , 22(4):047510, 2012

  35. [35]

    I. Mezi ´c. Analysis of fluid flows via spectral properties of the Koopman operator. Annual Review of Fluid Mechanics , 45:357–378, 2013

  36. [36]

    J. N. Kutz, S. L. Brunton, B. W. Brunton, and J. L. Proctor. Dynamic mode decomposition: data-driven modeling of complex systems . Society for Industrial and Applied Mathematics, Philadelphia, PA, 2016

  37. [37]

    P . J. Schmid. Dynamic mode decomposition of numerical and experimental data. Journal of Fluid Mechanics, 656:5–28, 2010

  38. [38]

    C. W. Rowley, I. Mezic, S. Bagheri, P . Schlatter, and D. S. Henningson. Spectral analysis of nonlinear flows. Journal of Fluid Mechanics , 641(1):115–127, 2009

  39. [39]

    M. O. Williams, I. G. Kevrekidis, and C. W. Rowley. A data–driven approximation of the koopman operator: Extending dynamic mode decomposition. Journal of Nonlinear Science , 25(6):1307–1346, 2015

  40. [40]

    D. Wilson. Koopman operator inspired nonlinear system identification. SIAM Journal on Applied Dynamical Systems, 22(2):1445–1471, 2023

  41. [41]

    J. L. Proctor, S. L. Brunton, and J. N. Kutz. Dynamic mode decomposition with control. SIAM Journal on Applied Dynamical Systems , 15(1):142–161, 2016

  42. [42]

    jina-embeddings-v5-text: Task-Targeted Embedding Distillation

    Mohammad Kalim Akram, Saba Sturua, Nastia Havriushenko, Quentin Herreros, Michael Günther, Maximilian Werk, and Han Xiao. jina-embeddings-v5-text: Task-targeted embed- ding distillation. arXiv preprint arXiv:2602.15547, 2026

  43. [43]

    Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking

    Mingxin Li, Y anzhao Zhang, Dingkun Long, Keqin Chen, Sibo Song, Shuai Bai, Zhibo Y ang, Pengjun Xie, An Y ang, Dayiheng Liu, et al. Qwen3-vl-embedding and qwen3-vl-reranker: A unified framework for state-of-the-art multimodal retrieval and ranking. arXiv preprint arXiv:2601.04720, 2026

  44. [44]

    F2LLM-v2: Inclusive, performant, and efficient embeddings for a multilingual world.arXiv preprint arXiv:2603.19223, 2026

    Ziyin Zhang, Zihan Liao, Hang Y u, Peng Di, and Rui Wang. F2llm-v2: Inclusive, performant, and efficient embeddings for a multilingual world. arXiv preprint arXiv:2603.19223, 2026

  45. [45]

    Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7b, 2023

  46. [46]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex V aughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024. 12

  47. [47]

    A quantitative analysis of koopman operator methods for system identification and predictions

    Christophe Zhang and Enrique Zuazua. A quantitative analysis of koopman operator methods for system identification and predictions. Comptes Rendus. Mécanique, 351(S1):1–31, 2023

  48. [48]

    H. S. Hemati, C. W. Rowley, E. A. Deem, and L. N. Cattafesta. De-biasing the dynamic mode decomposition for applied koopman spectral analysis of noisy datasets. Theoretical and Computational Fluid Dynamics, 31(4):349–368, 2017

  49. [49]

    S. T. M. Dawson, H. S. Hemati, M. O. Williams, and C. W. Rowley. Characterizing and correcting for the effect of sensor noise in the dynamic mode decomposition. Experiments in Fluids, 57(3):42, 2016. 13 A Implementation Details A.1 Embedding extraction Embedding extraction for all datasets relies on capturing the embedding per token for a given sen- tence...