pith. sign in

arxiv: 2605.18801 · v1 · pith:OUQELVQGnew · submitted 2026-05-11 · 💻 cs.AI · cs.IR· cs.LG

Position: Let's Develop Data Probes to Fundamentally Understand How Data Affects LLM Performance

Pith reviewed 2026-05-20 22:25 UTC · model grok-4.3

classification 💻 cs.AI cs.IRcs.LG
keywords data probessynthetic sequencesLLM performancedata characteristicsrandom processesmodel generalizationtypical setsdata filtering
0
0 comments X

The pith

Synthetic sequences from random processes can serve as data probes to systematically reveal how data characteristics shape LLM behavior across training and inference stages.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current understanding of useful data for LLMs comes mostly from expensive trials with large real-world datasets that yield only rules of thumb. The paper proposes instead to generate synthetic sequences from carefully chosen random processes. These sequences, called data probes, can be inserted into one or more stages of the LLM workflow. Observing model responses on the probes isolates the effects of specific statistical properties on performance, generalization, and robustness. The approach also links observed behaviors to theoretical ideas such as typical sets to move beyond purely empirical methods.

Core claim

Synthetic sequences generated from appropriately defined random processes can reveal useful characteristics when used in stages of the LLM workflow; by studying LLM behavior on these data probes, researchers can systematically examine how data characteristics influence performance, generalization, and robustness, with statistical properties interpreted through concepts such as typical sets.

What carries the argument

Data probes: synthetic sequences produced from defined random processes that exhibit controllable statistical properties for insertion into LLM training, tuning, or inference.

If this is right

  • Studies of data effects can be performed in a controlled and repeatable manner without exclusive dependence on large public datasets.
  • Insights gained from probes can guide more principled methods for data filtering and dataset construction.
  • Theoretical descriptions using typical sets can be applied to explain and predict LLM responses to data variations.
  • The method opens a route to foundational understanding of data's role instead of continued reliance on empirical heuristics.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Probes could be used to test combined effects of multiple data traits simultaneously in ways that are hard to isolate in real corpora.
  • The technique might help diagnose why certain real datasets cause poor generalization by matching their statistics to probe variants.
  • Extending the probes to measure robustness under distribution shifts would connect the method to questions of model reliability.

Load-bearing premise

That patterns of LLM behavior on the artificial sequences will correspond to the actual causal effects of similar properties in real data rather than arising only from the way the sequences were constructed.

What would settle it

Run controlled tests in which probe properties are varied to predict performance changes, then check whether those same statistical changes in real datasets produce matching shifts in LLM accuracy or robustness.

Figures

Figures reproduced from arXiv: 2605.18801 by Hans Arno Jacobsen, Herbert Woisetschl\"ager, Mingyue Ji, Shiqiang Wang.

Figure 1
Figure 1. Figure 1: Data probes connect theory and practice. data probes will be an important “interface” for connecting theory and practice, as illustrated in [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: Validity and transfer decision logic for a claim h. If IV(h) = 1 but EV(h) = 0, the result is probe-local. This pass/fail structure makes transfer claims falsifiable rather than narrative. A formal object, predicate definitions, and transfer equations are provided in Appendix B [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Different regimes related to the typical set. of the sequence x n normalized by the sequence length n, referred to as the average NLL. Intuitively, typical sets cap￾ture the bulk of probability mass in a distribution, and a sequence is “typical” if its NLL is near the true entropy rate. Checking whether the average NLL lies within an ε-band around H amounts to verifying whether x n belongs to the typical s… view at source ↗
Figure 5
Figure 5. Figure 5: Cumulative density function (CDF) of average NLL of generated sequences. interesting that from the average NLL results and their con￾nection to the typical set concept, we are able to observe important LLM behaviors seen in practice from the simple GPT-2 model trained using data probes. This illustrates the potential of data probes for achieving an in-depth under￾standing and analysis of LLMs, thus we advo… view at source ↗
Figure 6
Figure 6. Figure 6: Entropy value distribution of randomly generated Markov Chains (128 states) with Dirichlet parameter α. synthetic data required negligible storage or curation. This setup’s simplicity highlights how one can conduct controlled LLM experiments without the large overhead of real-text pipelines. D. Generating a Markov Chain with Target Entropy Rate Let us define M as the number of states in the Markov chain, w… view at source ↗
read the original abstract

Data is fundamental to large language models (LLMs). However, understanding of what makes certain data useful for different stages of an LLM workflow, including training, tuning, alignment, in-context learning, etc., and why, remains an open question. Current approaches rely heavily on extensive experimentation with large public datasets to obtain empirical heuristics for data filtering and dataset construction. These approaches are compute intensive and lack a principled way of understanding the essence of how specific data characteristics drive LLM behavior. In this position paper, we advocate for the need of developing systematic methodologies for generating synthetic sequences from appropriately defined random processes, with the goal that these sequences can reveal useful characteristics when they are used in one or multiple stages of the LLM workflow. We refer to such sequences as data probes. By observing LLM behavior on data probes, researchers can systematically conduct studies on how data characteristics influence model performance, generalization, and robustness. The probing sequences exhibit statistical properties that can be viewed using theoretical concepts, such as typical sets, which are generalized to describe the behaviors of LLMs. This data-probe approach provides a pathway for uncovering foundational insights into the role of data in LLM training and inference, beyond empirical heuristics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript is a position paper arguing that current empirical approaches to data selection for LLMs—relying on large-scale experimentation with public datasets—are compute-intensive and lack principled understanding of how specific data characteristics drive performance. It advocates developing systematic methodologies to generate synthetic sequences ('data probes') from appropriately defined random processes; these sequences would be inserted into LLM training, tuning, or inference stages and analyzed via generalized typical-set concepts to reveal causal effects on generalization and robustness.

Significance. If the data-probe framework can be made concrete and shown to transfer, it would supply a lower-cost, more controllable alternative to brute-force empirical heuristics and could yield falsifiable, theoretically grounded insights into data's role in LLMs. The position correctly identifies a methodological gap between information-theoretic notions of typicality and practical LLM data curation.

major comments (2)
  1. [Abstract] Abstract and core proposal: the claim that sequences drawn from suitably chosen random processes will expose causal data characteristics on LLMs rests on an unargued mapping from engineered probe statistics to real-corpus effects; no construction of such a process, no toy example, and no validation strategy against distributional shift are supplied, so observed behaviors could be artifacts of the artificial measure rather than transferable insights.
  2. [The data-probe approach] The data-probe approach section: the manuscript does not discuss how to select or parameterize the random processes so that the controlled properties (entropy rate, Markov order, long-range dependence, etc.) are precisely those that matter for LLM behavior, leaving the method vulnerable to the concern that any measured effects are generation artifacts.
minor comments (1)
  1. [Introduction] The introduction of the term 'data probes' would benefit from a brief contrast with existing synthetic-data or probing techniques in the LLM literature to clarify novelty.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments correctly identify areas where the position paper can be made more concrete while preserving its advocacy focus. We address each major comment below and commit to revisions that strengthen the argument without overclaiming current results.

read point-by-point responses
  1. Referee: [Abstract] Abstract and core proposal: the claim that sequences drawn from suitably chosen random processes will expose causal data characteristics on LLMs rests on an unargued mapping from engineered probe statistics to real-corpus effects; no construction of such a process, no toy example, and no validation strategy against distributional shift are supplied, so observed behaviors could be artifacts of the artificial measure rather than transferable insights.

    Authors: We agree that the manuscript would be strengthened by an explicit discussion of the mapping from probe statistics to real-corpus effects. As a position paper, our primary goal is to advocate for developing such methodologies rather than delivering a fully worked-out implementation. In the revision we will add a new subsection that sketches example constructions (e.g., controlled Markov chains with tunable entropy rates and long-range dependence parameters chosen to approximate linguistic statistics) and outlines a validation strategy that includes (i) matching probe statistics to those measured on real pre-training subsets and (ii) using importance weighting or domain-adaptation diagnostics to check for distributional-shift artifacts. This addition will make the transferability argument more transparent while remaining within the scope of a position paper. revision: yes

  2. Referee: [The data-probe approach] The data-probe approach section: the manuscript does not discuss how to select or parameterize the random processes so that the controlled properties (entropy rate, Markov order, long-range dependence, etc.) are precisely those that matter for LLM behavior, leaving the method vulnerable to the concern that any measured effects are generation artifacts.

    Authors: We acknowledge that the current draft leaves the parameterization question open. We will expand the data-probe approach section to include concrete guidance on process selection. Specifically, we will describe (a) estimating target statistics such as entropy rate and autocorrelation structure from representative real corpora, (b) fitting probe generators (e.g., via maximum-likelihood or moment-matching) to reproduce those statistics, and (c) performing sensitivity analyses that vary one property at a time while holding others fixed. These additions directly address the risk that observed effects are generation artifacts by tying the probes to properties known to influence LLM behavior. revision: yes

Circularity Check

0 steps flagged

No circularity: position paper with no derivations or self-referential predictions

full rationale

This is a position paper that contrasts existing compute-intensive empirical methods for studying data effects on LLMs with a proposed future direction of using synthetic data probes generated from random processes and generalized typical-set concepts. It contains no equations, fitted parameters, predictions, or derivations that could reduce to their own inputs by construction. The central claim is an advocacy for developing new methodologies rather than a closed logical chain or result derived from prior self-citations. No load-bearing self-citations, ansatzes, or renamings of known results are present in the provided text. The argument is self-contained as a call for research and does not rely on any circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central proposal rests on the untested premise that controlled synthetic sequences can isolate data effects relevant to real LLM workflows; no free parameters or invented physical entities are introduced, but the data-probe concept itself is postulated without independent validation in the provided text.

axioms (1)
  • domain assumption Synthetic sequences generated from appropriately defined random processes can isolate and reveal data characteristics that drive LLM behavior in training, tuning, and inference.
    This premise is invoked as the justification for shifting from empirical heuristics to systematic probing.
invented entities (1)
  • data probes no independent evidence
    purpose: Controlled synthetic sequences used to study the influence of data properties on LLMs.
    New term and concept introduced to enable the advocated systematic studies.

pith-pipeline@v0.9.0 · 5754 in / 1354 out tokens · 36294 ms · 2026-05-20T22:25:06.734684+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 5 internal anchors

  1. [1]

    ICML 2024 Tutorial: Physics of Language Models , July 2024

    Allen-Zhu , Z. ICML 2024 Tutorial: Physics of Language Models , July 2024. Project page: https://physics.allen-zhu.com/

  2. [2]

    Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A.,...

  3. [3]

    Hallucination rates and reference accuracy of ChatGPT and bard for systematic reviews: Comparative analysis

    Chelli, M., Descamps, J., Lavou \'e , V., Trojani, C., Azar, M., Deckert, M., Raynier, J.-L., Clowez, G., Boileau, P., and Ruetsch-Chelli, C. Hallucination rates and reference accuracy of ChatGPT and bard for systematic reviews: Comparative analysis. J Med Internet Res, 26: 0 e53164, May 2024. doi:10.2196/53164. URL http://www.ncbi.nlm.nih.gov/pubmed/38776130

  4. [4]

    N., Li, T., Li, D., Zhu, B., Zhang, H., Jordan, M

    Chiang, W.-L., Zheng, L., Sheng, Y., Angelopoulos, A. N., Li, T., Li, D., Zhu, B., Zhang, H., Jordan, M. I., Gonzalez, J. E., and Stoica, I. Chatbot arena: an open platform for evaluating LLMs by human preference. In Proceedings of the 41st International Conference on Machine Learning, ICML'24. JMLR.org, 2024

  5. [5]

    Arc prize 2024: Technical report, 2025

    Chollet, F., Knoop, M., Kamradt, G., and Landers, B. ARC prize 2024: Technical report, 2024. URL https://arxiv.org/abs/2412.04604

  6. [6]

    Cover, T. M. and Thomas, J. A. Elements of Information Theory. John Wiley & Sons, 2006

  7. [7]

    J., and Ojeda, C

    Cvejoski, K., S\' a nchez, R. J., and Ojeda, C. The future is different: Predicting reddits popularity with variational dynamic language models. In Machine Learning and Knowledge Discovery in Databases. Research Track: European Conference, ECML PKDD 2024, Vilnius, Lithuania, September 9–13, 2024, Proceedings, Part I, pp.\ 422–439, 2024

  8. [8]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    DeepSeek-AI . DeepSeek-R1 : Incentivizing reasoning capability in llms via reinforcement learning, 2025. URL https://arxiv.org/abs/2501.12948

  9. [9]

    L., Malach, E., and Goel, S

    Edelman, E., Tsilivis, N., Edelman, B. L., Malach, E., and Goel, S. The evolution of statistical induction heads: In-context learning Markov chains. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

  10. [10]

    lm-evaluation-harness: v0.4.10, January 2026

    EleutherAI. lm-evaluation-harness: v0.4.10, January 2026. URL https://doi.org/10.5281/zenodo.18394108

  11. [11]

    Open LLM leaderboard v2

    Fourrier, C., Habib, N., Lozovskaya, A., Szafer, K., and Wolf, T. Open LLM leaderboard v2. https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard, 2024

  12. [12]

    Gardner, M., Merrill, W., Dodge, J., Peters, M., Ross, A., Singh, S., and Smith, N. A. Competency problems: On finding and removing artifacts in language data. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.\ 1801--1813, November 2021. doi:10.18653/v1/2021.emnlp-main.135. URL https://aclanthology.org/2021.emn...

  13. [13]

    Gemma: Open Models Based on Gemini Research and Technology

    Gemma Team , Mesnard, T., Hardin, C., Dadashi, R., Bhupatiraju, S., Pathak, S., Sifre, L., Rivière, M., Kale, M. S., Love, J., Tafti, P., Hussenot, L., Sessa, P. G., Chowdhery, A., Roberts, A., Barua, A., Botev, A., Castro-Ros, A., Slone, A., Héliou, A., Tacchetti, A., Bulanova, A., Paterson, A., Tsai, B., Shahriari, B., Lan, C. L., Choquette-Choo, C. A.,...

  14. [14]

    E., Kadhe, S

    Gohari, H. E., Kadhe, S. R., Shah, Y., Adam, C. M., Adebayo, A., Adusumilli, P., Ahmed, F., Baracaldo, N., Borse, S. S., Chang, Y.-C., Dang, X.-H., Desai, N., Eres, R., Iwamoto, R., Karve, A. A., Koyfman, Y., Lee, W.-H., Liu, C., Lublinsky, B., Ohko, T., Pesce, P., Touma, M., Wang, S., Witherspooon, S., Woisetschl \"a ger, H., Wood, D., Wu, K.-L., Yoshida...

  15. [15]

    The Llama 3 Herd of Models

    Grattafiori, A., Dubey, A., et al. The Llama 3 herd of models, 2024. URL https://arxiv.org/abs/2407.21783

  16. [16]

    Gururangan, S., Swayamdipta, S., Levy, O., Schwartz, R., Bowman, S., and Smith, N. A. Annotation artifacts in natural language inference data. In Proceedings of the 2018 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers) , pp.\ 107--112, June 2018. doi:10.18653/v...

  17. [17]

    A., Welbl, J., Clark, A., Hennigan, T., Noland, E., Millican, K., van den Driessche, G., Damoc, B., Guy, A., Osindero, S., Simonyan, K., Elsen, E., Vinyals, O., Rae, J

    Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., de Las Casas, D., Hendricks, L. A., Welbl, J., Clark, A., Hennigan, T., Noland, E., Millican, K., van den Driessche, G., Damoc, B., Guy, A., Osindero, S., Simonyan, K., Elsen, E., Vinyals, O., Rae, J. W., and Sifre, L. Training compute-optimal large language models. In Proce...

  18. [18]

    A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions,

    Huang, L., Yu, W., Ma, W., Zhong, W., Feng, Z., Wang, H., Chen, Q., Peng, W., Feng, X., Qin, B., and Liu, T. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. ACM Trans. Inf. Syst., 43 0 (2), January 2025. doi:10.1145/3703155. URL https://doi.org/10.1145/3703155

  19. [19]

    Scaling Laws for Neural Language Models

    Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. Scaling laws for neural language models, 2020. URL https://arxiv.org/abs/2001.08361

  20. [20]

    FineWeb-Edu : the finest collection of educational content, 2024

    Lozhkov, A., Ben Allal, L., von Werra, L., and Wolf, T. FineWeb-Edu : the finest collection of educational content, 2024. URL https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu

  21. [21]

    V., Bondaschi, M., Girish, A., Nagle, A., Jaggi, M., Kim, H., and Gastpar, M

    Makkuva, A. V., Bondaschi, M., Girish, A., Nagle, A., Jaggi, M., Kim, H., and Gastpar, M. Attention with Markov : A curious case of single-layer transformers. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=SqZ0KY4qBD

  22. [22]

    Manning, C. D. and Schutze, H. Foundations of Statistical Natural Language Processing. MIT Press, 1999

  23. [23]

    Mishra, M., Stallone, M., Zhang, G., Shen, Y., Prasad, A., Soria, A. M., Merler, M., Selvam, P., Surendran, S., Singh, S., Sethi, M., Dang, X.-H., Li, P., Wu, K.-L., Zawad, S., Coleman, A., White, M., Lewis, M., Pavuluri, R., Koyfman, Y., Lublinsky, B., de Bayser, M., Abdelaziz, I., Basu, K., Agarwal, M., Zhou, Y., Johnson, C., Goyal, A., Patel, H., Shah,...

  24. [24]

    J., Mindermann, S., Moscovitz, I., Pan, A

    Pacchiardi, L., Chan, A. J., Mindermann, S., Moscovitz, I., Pan, A. Y., Gal, Y., Evans, O., and Brauner, J. M. How to catch an AI liar: Lie detection in black-box LLM s by asking unrelated questions. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=567BjxgaTp

  25. [25]

    The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

    Penedo, G., Kydlíček, H., allal, L. B., Lozhkov, A., Mitchell, M., Raffel, C., Von Werra, L., and Wolf, T. The FineWeb datasets: Decanting the web for the finest text data at scale, 2024. URL https://arxiv.org/abs/2406.17557

  26. [26]

    GPT-2 Model Card , 2019 a

    Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and Sutskever, I. GPT-2 Model Card , 2019 a . URL https://huggingface.co/openai-community/gpt2

  27. [27]

    Language models are unsupervised multitask learners

    Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al. Language models are unsupervised multitask learners. OpenAI Blog, 1 0 (8): 0 9, 2019 b

  28. [28]

    V., Ramchandran, K., and Gastpar, M

    Rajaraman, N., Bondaschi, M., Makkuva, A. V., Ramchandran, K., and Gastpar, M. Transformers on Markov data: Constant depth suffices. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id=5uG9tp3v2q

  29. [29]

    Toward transparent AI : A survey on interpreting the inner structures of deep neural networks

    R\"auker, T., Ho, A., Casper, S., and Hadfield-Menell, D. Toward transparent AI : A survey on interpreting the inner structures of deep neural networks. In 2023 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML), pp.\ 464--483, Los Alamitos, CA, USA, February 2023. IEEE Computer Society. doi:10.1109/SaTML54575.2023.00039. URL https://doi.i...

  30. [30]

    NLP Evaluation in trouble: On the Need to Measure LLM Data Contamination for each Benchmark

    Sainz, O., Campos, J., Garc \'i a-Ferrero, I., Etxaniz, J., de Lacalle, O. L., and Agirre, E. NLP evaluation in trouble: On the need to measure LLM data contamination for each benchmark. In Findings of the Association for Computational Linguistics: EMNLP 2023, pp.\ 10776--10787, December 2023. doi:10.18653/v1/2023.findings-emnlp.722. URL https://aclanthol...

  31. [31]

    A., and Kolter, J

    Sam, D., Finzi, M. A., and Kolter, J. Z. Predicting the performance of black-box language models with follow-up queries. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URL https://openreview.net/forum?id=IBFnEaArnz

  32. [32]

    Shannon, C. E. A mathematical theory of communication. The Bell system technical journal, 27 0 (3): 0 379--423, 1948

  33. [33]

    and Yu, Z

    Shu, Y. and Yu, Z. Distribution shifts are bottlenecks: Extensive evaluation for grounding language models to knowledge bases. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop, pp.\ 71--88, March 2024. URL https://aclanthology.org/2024.eacl-srw.7/

  34. [34]

    arXiv preprint arXiv:2402.01761 , year=

    Singh, C., Inala, J. P., Galley, M., Caruana, R., and Gao, J. Rethinking interpretability in the era of large language models, 2024. URL https://arxiv.org/abs/2402.01761

  35. [35]

    Nemotron- CC : Transforming C ommon C rawl into a refined long-horizon pretraining dataset

    Su, D., Kong, K., Lin, Y., Jennings, J., Norick, B., Kliegl, M., Patwary, M., Shoeybi, M., and Catanzaro, B. Nemotron- CC : Transforming C ommon C rawl into a refined long-horizon pretraining dataset. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 2459--2475, July 2025. doi:10.18653...

  36. [36]

    Transformers learn in-context by gradient descent

    Von Oswald, J., Niklasson, E., Randazzo, E., Sacramento, J., Mordvintsev, A., Zhmoginov, A., and Vladymyrov, M. Transformers learn in-context by gradient descent. In Proceedings of the 40th International Conference on Machine Learning, Proceedings of Machine Learning Research, 2023

  37. [37]

    Wang, A., Pruksachatkun, Y., Nangia, N., Singh, A., Michael, J., Hill, F., Levy, O., and Bowman, S. R. SuperGLUE : a stickier benchmark for general-purpose language understanding systems. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, 2019

  38. [38]

    Survey on factuality in large language models

    Wang, C., Liu, X., Yue, Y., Guo, Q., Hu, X., Tang, X., Zhang, T., Jiayang, C., Yao, Y., Hu, X., Qi, Z., Gao, W., Wang, Y., Yang, L., Wang, J., Xie, X., Zhang, Z., and Zhang, Y. Survey on factuality in large language models. ACM Comput. Surv., 58 0 (1), September 2025. ISSN 0360-0300. doi:10.1145/3742420. URL https://doi.org/10.1145/3742420

  39. [39]

    Weber, M., Fu, D. Y., Anthony, Q., Oren, Y., Adams, S., Alexandrov, A., Lyu, X., Nguyen, H., Yao, X., Adams, V., Athiwaratkun, B., Chalamala, R., Chen, K., Ryabinin, M., Dao, T., Liang, P., Ré, C., Rish, I., and Zhang, C. RedPajama : an open dataset for training large language models. NeurIPS Datasets and Benchmarks Track, 2024

  40. [40]

    Measuring and reducing LLM hallucination without gold-standard answers, 2024

    Wei, J., Yao, Y., Ton, J.-F., Guo, H., Estornell, A., and Liu, Y. Measuring and reducing LLM hallucination without gold-standard answers, 2024. URL https://arxiv.org/abs/2402.10412

  41. [41]

    QuRating : selecting high-quality data for training language models

    Wettig, A., Gupta, A., Malik, S., and Chen, D. QuRating : selecting high-quality data for training language models. In Proceedings of the 41st International Conference on Machine Learning, ICML'24. JMLR.org, 2024

  42. [42]

    Are we there yet? revealing the risks of utilizing large language models in scholarly peer review,

    Ye, R., Pang, X., Chai, J., Chen, J., Yin, Z., Xiang, Z., Dong, X., Shao, J., and Chen, S. Are we there yet? revealing the risks of utilizing large language models in scholarly peer review, 2024. URL https://arxiv.org/abs/2412.01708

  43. [43]

    Large language models as markov chains

    Zekri, O., Odonnat, A., Benechehab, A., Bleistein, L., Boullé, N., and Redko, I. Large language models as Markov chains, 2024. URL https://arxiv.org/abs/2410.02724

  44. [44]

    E., and Stoica, I

    Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E., Zhang, H., Gonzalez, J. E., and Stoica, I. Judging LLM -as-a-judge with MT -bench and chatbot arena. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023