Position: Let's Develop Data Probes to Fundamentally Understand How Data Affects LLM Performance

Hans Arno Jacobsen; Herbert Woisetschl\"ager; Mingyue Ji; Shiqiang Wang

arxiv: 2605.18801 · v1 · pith:OUQELVQGnew · submitted 2026-05-11 · 💻 cs.AI · cs.IR· cs.LG

Position: Let's Develop Data Probes to Fundamentally Understand How Data Affects LLM Performance

Shiqiang Wang , Herbert Woisetschl\"ager , Hans Arno Jacobsen , Mingyue Ji This is my paper

Pith reviewed 2026-05-20 22:25 UTC · model grok-4.3

classification 💻 cs.AI cs.IRcs.LG

keywords data probessynthetic sequencesLLM performancedata characteristicsrandom processesmodel generalizationtypical setsdata filtering

0 comments

The pith

Synthetic sequences from random processes can serve as data probes to systematically reveal how data characteristics shape LLM behavior across training and inference stages.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current understanding of useful data for LLMs comes mostly from expensive trials with large real-world datasets that yield only rules of thumb. The paper proposes instead to generate synthetic sequences from carefully chosen random processes. These sequences, called data probes, can be inserted into one or more stages of the LLM workflow. Observing model responses on the probes isolates the effects of specific statistical properties on performance, generalization, and robustness. The approach also links observed behaviors to theoretical ideas such as typical sets to move beyond purely empirical methods.

Core claim

Synthetic sequences generated from appropriately defined random processes can reveal useful characteristics when used in stages of the LLM workflow; by studying LLM behavior on these data probes, researchers can systematically examine how data characteristics influence performance, generalization, and robustness, with statistical properties interpreted through concepts such as typical sets.

What carries the argument

Data probes: synthetic sequences produced from defined random processes that exhibit controllable statistical properties for insertion into LLM training, tuning, or inference.

If this is right

Studies of data effects can be performed in a controlled and repeatable manner without exclusive dependence on large public datasets.
Insights gained from probes can guide more principled methods for data filtering and dataset construction.
Theoretical descriptions using typical sets can be applied to explain and predict LLM responses to data variations.
The method opens a route to foundational understanding of data's role instead of continued reliance on empirical heuristics.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Probes could be used to test combined effects of multiple data traits simultaneously in ways that are hard to isolate in real corpora.
The technique might help diagnose why certain real datasets cause poor generalization by matching their statistics to probe variants.
Extending the probes to measure robustness under distribution shifts would connect the method to questions of model reliability.

Load-bearing premise

That patterns of LLM behavior on the artificial sequences will correspond to the actual causal effects of similar properties in real data rather than arising only from the way the sequences were constructed.

What would settle it

Run controlled tests in which probe properties are varied to predict performance changes, then check whether those same statistical changes in real datasets produce matching shifts in LLM accuracy or robustness.

Figures

Figures reproduced from arXiv: 2605.18801 by Hans Arno Jacobsen, Herbert Woisetschl\"ager, Mingyue Ji, Shiqiang Wang.

**Figure 1.** Figure 1: Data probes connect theory and practice. data probes will be an important “interface” for connecting theory and practice, as illustrated in [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 3.** Figure 3: Validity and transfer decision logic for a claim h. If IV(h) = 1 but EV(h) = 0, the result is probe-local. This pass/fail structure makes transfer claims falsifiable rather than narrative. A formal object, predicate definitions, and transfer equations are provided in Appendix B [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Different regimes related to the typical set. of the sequence x n normalized by the sequence length n, referred to as the average NLL. Intuitively, typical sets capture the bulk of probability mass in a distribution, and a sequence is “typical” if its NLL is near the true entropy rate. Checking whether the average NLL lies within an ε-band around H amounts to verifying whether x n belongs to the typical s… view at source ↗

**Figure 5.** Figure 5: Cumulative density function (CDF) of average NLL of generated sequences. interesting that from the average NLL results and their connection to the typical set concept, we are able to observe important LLM behaviors seen in practice from the simple GPT-2 model trained using data probes. This illustrates the potential of data probes for achieving an in-depth understanding and analysis of LLMs, thus we advo… view at source ↗

**Figure 6.** Figure 6: Entropy value distribution of randomly generated Markov Chains (128 states) with Dirichlet parameter α. synthetic data required negligible storage or curation. This setup’s simplicity highlights how one can conduct controlled LLM experiments without the large overhead of real-text pipelines. D. Generating a Markov Chain with Target Entropy Rate Let us define M as the number of states in the Markov chain, w… view at source ↗

read the original abstract

Data is fundamental to large language models (LLMs). However, understanding of what makes certain data useful for different stages of an LLM workflow, including training, tuning, alignment, in-context learning, etc., and why, remains an open question. Current approaches rely heavily on extensive experimentation with large public datasets to obtain empirical heuristics for data filtering and dataset construction. These approaches are compute intensive and lack a principled way of understanding the essence of how specific data characteristics drive LLM behavior. In this position paper, we advocate for the need of developing systematic methodologies for generating synthetic sequences from appropriately defined random processes, with the goal that these sequences can reveal useful characteristics when they are used in one or multiple stages of the LLM workflow. We refer to such sequences as data probes. By observing LLM behavior on data probes, researchers can systematically conduct studies on how data characteristics influence model performance, generalization, and robustness. The probing sequences exhibit statistical properties that can be viewed using theoretical concepts, such as typical sets, which are generalized to describe the behaviors of LLMs. This data-probe approach provides a pathway for uncovering foundational insights into the role of data in LLM training and inference, beyond empirical heuristics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This position paper sensibly flags the limits of empirical data heuristics for LLMs and calls for synthetic probes from random processes, but supplies no examples or validation that the idea would transfer to real data.

read the letter

The colleague should know two things about this paper. First, it is a position piece that argues for creating synthetic data probes from random processes to systematically study data effects on LLMs instead of depending on large-scale empirical trials. Second, while the idea addresses a genuine limitation in current practice, the manuscript does not include any specific constructions or evidence that such probes would work as intended. The paper does a good job highlighting the shortcomings of existing approaches. It notes that understanding data utility for training, tuning, alignment, and in-context learning is still mostly based on heuristics from experimenting with public datasets. This is compute-intensive and doesn't provide principled insights into why certain data characteristics matter. Proposing controllable synthetic sequences as diagnostic tools is a logical response to that issue. The use of theoretical concepts like generalized typical sets to describe the probes is a nice touch. It suggests a way to connect the synthetic data to formal properties that could be analyzed. On the soft spots, the main one is the absence of details. There are no examples of what kind of random process to use, how to define the statistical properties, or any small-scale test showing that LLM behavior on probes predicts effects on real data. Without that, the risk is that any findings stay tied to the artificial setup and don't transfer. The stress-test concern about generation artifacts and distributional shift to natural text is fair and not addressed in the text. This paper is for researchers in data-centric machine learning and LLM development who are frustrated with black-box dataset curation. Someone looking to explore new methods for understanding data influence could find it a useful prompt for their own work. It deserves peer review. The core argument is coherent and points to an open problem, so referees could help strengthen it by suggesting ways to make the probes more concrete and testable.

Referee Report

2 major / 1 minor

Summary. The manuscript is a position paper arguing that current empirical approaches to data selection for LLMs—relying on large-scale experimentation with public datasets—are compute-intensive and lack principled understanding of how specific data characteristics drive performance. It advocates developing systematic methodologies to generate synthetic sequences ('data probes') from appropriately defined random processes; these sequences would be inserted into LLM training, tuning, or inference stages and analyzed via generalized typical-set concepts to reveal causal effects on generalization and robustness.

Significance. If the data-probe framework can be made concrete and shown to transfer, it would supply a lower-cost, more controllable alternative to brute-force empirical heuristics and could yield falsifiable, theoretically grounded insights into data's role in LLMs. The position correctly identifies a methodological gap between information-theoretic notions of typicality and practical LLM data curation.

major comments (2)

[Abstract] Abstract and core proposal: the claim that sequences drawn from suitably chosen random processes will expose causal data characteristics on LLMs rests on an unargued mapping from engineered probe statistics to real-corpus effects; no construction of such a process, no toy example, and no validation strategy against distributional shift are supplied, so observed behaviors could be artifacts of the artificial measure rather than transferable insights.
[The data-probe approach] The data-probe approach section: the manuscript does not discuss how to select or parameterize the random processes so that the controlled properties (entropy rate, Markov order, long-range dependence, etc.) are precisely those that matter for LLM behavior, leaving the method vulnerable to the concern that any measured effects are generation artifacts.

minor comments (1)

[Introduction] The introduction of the term 'data probes' would benefit from a brief contrast with existing synthetic-data or probing techniques in the LLM literature to clarify novelty.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments correctly identify areas where the position paper can be made more concrete while preserving its advocacy focus. We address each major comment below and commit to revisions that strengthen the argument without overclaiming current results.

read point-by-point responses

Referee: [Abstract] Abstract and core proposal: the claim that sequences drawn from suitably chosen random processes will expose causal data characteristics on LLMs rests on an unargued mapping from engineered probe statistics to real-corpus effects; no construction of such a process, no toy example, and no validation strategy against distributional shift are supplied, so observed behaviors could be artifacts of the artificial measure rather than transferable insights.

Authors: We agree that the manuscript would be strengthened by an explicit discussion of the mapping from probe statistics to real-corpus effects. As a position paper, our primary goal is to advocate for developing such methodologies rather than delivering a fully worked-out implementation. In the revision we will add a new subsection that sketches example constructions (e.g., controlled Markov chains with tunable entropy rates and long-range dependence parameters chosen to approximate linguistic statistics) and outlines a validation strategy that includes (i) matching probe statistics to those measured on real pre-training subsets and (ii) using importance weighting or domain-adaptation diagnostics to check for distributional-shift artifacts. This addition will make the transferability argument more transparent while remaining within the scope of a position paper. revision: yes
Referee: [The data-probe approach] The data-probe approach section: the manuscript does not discuss how to select or parameterize the random processes so that the controlled properties (entropy rate, Markov order, long-range dependence, etc.) are precisely those that matter for LLM behavior, leaving the method vulnerable to the concern that any measured effects are generation artifacts.

Authors: We acknowledge that the current draft leaves the parameterization question open. We will expand the data-probe approach section to include concrete guidance on process selection. Specifically, we will describe (a) estimating target statistics such as entropy rate and autocorrelation structure from representative real corpora, (b) fitting probe generators (e.g., via maximum-likelihood or moment-matching) to reproduce those statistics, and (c) performing sensitivity analyses that vary one property at a time while holding others fixed. These additions directly address the risk that observed effects are generation artifacts by tying the probes to properties known to influence LLM behavior. revision: yes

Circularity Check

0 steps flagged

No circularity: position paper with no derivations or self-referential predictions

full rationale

This is a position paper that contrasts existing compute-intensive empirical methods for studying data effects on LLMs with a proposed future direction of using synthetic data probes generated from random processes and generalized typical-set concepts. It contains no equations, fitted parameters, predictions, or derivations that could reduce to their own inputs by construction. The central claim is an advocacy for developing new methodologies rather than a closed logical chain or result derived from prior self-citations. No load-bearing self-citations, ansatzes, or renamings of known results are present in the provided text. The argument is self-contained as a call for research and does not rely on any circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central proposal rests on the untested premise that controlled synthetic sequences can isolate data effects relevant to real LLM workflows; no free parameters or invented physical entities are introduced, but the data-probe concept itself is postulated without independent validation in the provided text.

axioms (1)

domain assumption Synthetic sequences generated from appropriately defined random processes can isolate and reveal data characteristics that drive LLM behavior in training, tuning, and inference.
This premise is invoked as the justification for shifting from empirical heuristics to systematic probing.

invented entities (1)

data probes no independent evidence
purpose: Controlled synthetic sequences used to study the influence of data properties on LLMs.
New term and concept introduced to enable the advocated systematic studies.

pith-pipeline@v0.9.0 · 5754 in / 1354 out tokens · 36294 ms · 2026-05-20T22:25:06.734684+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 5 internal anchors

[1]

ICML 2024 Tutorial: Physics of Language Models , July 2024

Allen-Zhu , Z. ICML 2024 Tutorial: Physics of Language Models , July 2024. Project page: https://physics.allen-zhu.com/

work page 2024
[2]

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A.,...

work page 1901
[3]

Hallucination rates and reference accuracy of ChatGPT and bard for systematic reviews: Comparative analysis

Chelli, M., Descamps, J., Lavou \'e , V., Trojani, C., Azar, M., Deckert, M., Raynier, J.-L., Clowez, G., Boileau, P., and Ruetsch-Chelli, C. Hallucination rates and reference accuracy of ChatGPT and bard for systematic reviews: Comparative analysis. J Med Internet Res, 26: 0 e53164, May 2024. doi:10.2196/53164. URL http://www.ncbi.nlm.nih.gov/pubmed/38776130

work page doi:10.2196/53164 2024
[4]

N., Li, T., Li, D., Zhu, B., Zhang, H., Jordan, M

Chiang, W.-L., Zheng, L., Sheng, Y., Angelopoulos, A. N., Li, T., Li, D., Zhu, B., Zhang, H., Jordan, M. I., Gonzalez, J. E., and Stoica, I. Chatbot arena: an open platform for evaluating LLMs by human preference. In Proceedings of the 41st International Conference on Machine Learning, ICML'24. JMLR.org, 2024

work page 2024
[5]

Arc prize 2024: Technical report, 2025

Chollet, F., Knoop, M., Kamradt, G., and Landers, B. ARC prize 2024: Technical report, 2024. URL https://arxiv.org/abs/2412.04604

work page arXiv 2024
[6]

Cover, T. M. and Thomas, J. A. Elements of Information Theory. John Wiley & Sons, 2006

work page 2006
[7]

J., and Ojeda, C

Cvejoski, K., S\' a nchez, R. J., and Ojeda, C. The future is different: Predicting reddits popularity with variational dynamic language models. In Machine Learning and Knowledge Discovery in Databases. Research Track: European Conference, ECML PKDD 2024, Vilnius, Lithuania, September 9–13, 2024, Proceedings, Part I, pp.\ 422–439, 2024

work page 2024
[8]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI . DeepSeek-R1 : Incentivizing reasoning capability in llms via reinforcement learning, 2025. URL https://arxiv.org/abs/2501.12948

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

L., Malach, E., and Goel, S

Edelman, E., Tsilivis, N., Edelman, B. L., Malach, E., and Goel, S. The evolution of statistical induction heads: In-context learning Markov chains. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

work page 2024
[10]

lm-evaluation-harness: v0.4.10, January 2026

EleutherAI. lm-evaluation-harness: v0.4.10, January 2026. URL https://doi.org/10.5281/zenodo.18394108

work page doi:10.5281/zenodo.18394108 2026
[11]

Open LLM leaderboard v2

Fourrier, C., Habib, N., Lozovskaya, A., Szafer, K., and Wolf, T. Open LLM leaderboard v2. https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard, 2024

work page 2024
[12]

Gardner, M., Merrill, W., Dodge, J., Peters, M., Ross, A., Singh, S., and Smith, N. A. Competency problems: On finding and removing artifacts in language data. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.\ 1801--1813, November 2021. doi:10.18653/v1/2021.emnlp-main.135. URL https://aclanthology.org/2021.emn...

work page doi:10.18653/v1/2021.emnlp-main.135 2021
[13]

Gemma: Open Models Based on Gemini Research and Technology

Gemma Team , Mesnard, T., Hardin, C., Dadashi, R., Bhupatiraju, S., Pathak, S., Sifre, L., Rivière, M., Kale, M. S., Love, J., Tafti, P., Hussenot, L., Sessa, P. G., Chowdhery, A., Roberts, A., Barua, A., Botev, A., Castro-Ros, A., Slone, A., Héliou, A., Tacchetti, A., Bulanova, A., Paterson, A., Tsai, B., Shahriari, B., Lan, C. L., Choquette-Choo, C. A.,...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

E., Kadhe, S

Gohari, H. E., Kadhe, S. R., Shah, Y., Adam, C. M., Adebayo, A., Adusumilli, P., Ahmed, F., Baracaldo, N., Borse, S. S., Chang, Y.-C., Dang, X.-H., Desai, N., Eres, R., Iwamoto, R., Karve, A. A., Koyfman, Y., Lee, W.-H., Liu, C., Lublinsky, B., Ohko, T., Pesce, P., Touma, M., Wang, S., Witherspooon, S., Woisetschl \"a ger, H., Wood, D., Wu, K.-L., Yoshida...

work page 2026
[15]

The Llama 3 Herd of Models

Grattafiori, A., Dubey, A., et al. The Llama 3 herd of models, 2024. URL https://arxiv.org/abs/2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024
[16]

Gururangan, S., Swayamdipta, S., Levy, O., Schwartz, R., Bowman, S., and Smith, N. A. Annotation artifacts in natural language inference data. In Proceedings of the 2018 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers) , pp.\ 107--112, June 2018. doi:10.18653/v...

work page doi:10.18653/v1/n18-2017 2018
[17]

A., Welbl, J., Clark, A., Hennigan, T., Noland, E., Millican, K., van den Driessche, G., Damoc, B., Guy, A., Osindero, S., Simonyan, K., Elsen, E., Vinyals, O., Rae, J

Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., de Las Casas, D., Hendricks, L. A., Welbl, J., Clark, A., Hennigan, T., Noland, E., Millican, K., van den Driessche, G., Damoc, B., Guy, A., Osindero, S., Simonyan, K., Elsen, E., Vinyals, O., Rae, J. W., and Sifre, L. Training compute-optimal large language models. In Proce...

work page 2022
[18]

A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions,

Huang, L., Yu, W., Ma, W., Zhong, W., Feng, Z., Wang, H., Chen, Q., Peng, W., Feng, X., Qin, B., and Liu, T. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. ACM Trans. Inf. Syst., 43 0 (2), January 2025. doi:10.1145/3703155. URL https://doi.org/10.1145/3703155

work page doi:10.1145/3703155 2025
[19]

Scaling Laws for Neural Language Models

Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. Scaling laws for neural language models, 2020. URL https://arxiv.org/abs/2001.08361

work page internal anchor Pith review Pith/arXiv arXiv 2020
[20]

FineWeb-Edu : the finest collection of educational content, 2024

Lozhkov, A., Ben Allal, L., von Werra, L., and Wolf, T. FineWeb-Edu : the finest collection of educational content, 2024. URL https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu

work page 2024
[21]

V., Bondaschi, M., Girish, A., Nagle, A., Jaggi, M., Kim, H., and Gastpar, M

Makkuva, A. V., Bondaschi, M., Girish, A., Nagle, A., Jaggi, M., Kim, H., and Gastpar, M. Attention with Markov : A curious case of single-layer transformers. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=SqZ0KY4qBD

work page 2025
[22]

Manning, C. D. and Schutze, H. Foundations of Statistical Natural Language Processing. MIT Press, 1999

work page 1999
[23]

Mishra, M., Stallone, M., Zhang, G., Shen, Y., Prasad, A., Soria, A. M., Merler, M., Selvam, P., Surendran, S., Singh, S., Sethi, M., Dang, X.-H., Li, P., Wu, K.-L., Zawad, S., Coleman, A., White, M., Lewis, M., Pavuluri, R., Koyfman, Y., Lublinsky, B., de Bayser, M., Abdelaziz, I., Basu, K., Agarwal, M., Zhou, Y., Johnson, C., Goyal, A., Patel, H., Shah,...

work page arXiv 2024
[24]

J., Mindermann, S., Moscovitz, I., Pan, A

Pacchiardi, L., Chan, A. J., Mindermann, S., Moscovitz, I., Pan, A. Y., Gal, Y., Evans, O., and Brauner, J. M. How to catch an AI liar: Lie detection in black-box LLM s by asking unrelated questions. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=567BjxgaTp

work page 2024
[25]

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

Penedo, G., Kydlíček, H., allal, L. B., Lozhkov, A., Mitchell, M., Raffel, C., Von Werra, L., and Wolf, T. The FineWeb datasets: Decanting the web for the finest text data at scale, 2024. URL https://arxiv.org/abs/2406.17557

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

GPT-2 Model Card , 2019 a

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and Sutskever, I. GPT-2 Model Card , 2019 a . URL https://huggingface.co/openai-community/gpt2

work page 2019
[27]

Language models are unsupervised multitask learners

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al. Language models are unsupervised multitask learners. OpenAI Blog, 1 0 (8): 0 9, 2019 b

work page 2019
[28]

V., Ramchandran, K., and Gastpar, M

Rajaraman, N., Bondaschi, M., Makkuva, A. V., Ramchandran, K., and Gastpar, M. Transformers on Markov data: Constant depth suffices. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id=5uG9tp3v2q

work page 2024
[29]

Toward transparent AI : A survey on interpreting the inner structures of deep neural networks

R\"auker, T., Ho, A., Casper, S., and Hadfield-Menell, D. Toward transparent AI : A survey on interpreting the inner structures of deep neural networks. In 2023 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML), pp.\ 464--483, Los Alamitos, CA, USA, February 2023. IEEE Computer Society. doi:10.1109/SaTML54575.2023.00039. URL https://doi.i...

work page doi:10.1109/satml54575.2023.00039 2023
[30]

NLP Evaluation in trouble: On the Need to Measure LLM Data Contamination for each Benchmark

Sainz, O., Campos, J., Garc \'i a-Ferrero, I., Etxaniz, J., de Lacalle, O. L., and Agirre, E. NLP evaluation in trouble: On the need to measure LLM data contamination for each benchmark. In Findings of the Association for Computational Linguistics: EMNLP 2023, pp.\ 10776--10787, December 2023. doi:10.18653/v1/2023.findings-emnlp.722. URL https://aclanthol...

work page doi:10.18653/v1/2023.findings-emnlp.722 2023
[31]

A., and Kolter, J

Sam, D., Finzi, M. A., and Kolter, J. Z. Predicting the performance of black-box language models with follow-up queries. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URL https://openreview.net/forum?id=IBFnEaArnz

work page 2025
[32]

Shannon, C. E. A mathematical theory of communication. The Bell system technical journal, 27 0 (3): 0 379--423, 1948

work page 1948
[33]

and Yu, Z

Shu, Y. and Yu, Z. Distribution shifts are bottlenecks: Extensive evaluation for grounding language models to knowledge bases. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop, pp.\ 71--88, March 2024. URL https://aclanthology.org/2024.eacl-srw.7/

work page 2024
[34]

arXiv preprint arXiv:2402.01761 , year=

Singh, C., Inala, J. P., Galley, M., Caruana, R., and Gao, J. Rethinking interpretability in the era of large language models, 2024. URL https://arxiv.org/abs/2402.01761

work page arXiv 2024
[35]

Nemotron- CC : Transforming C ommon C rawl into a refined long-horizon pretraining dataset

Su, D., Kong, K., Lin, Y., Jennings, J., Norick, B., Kliegl, M., Patwary, M., Shoeybi, M., and Catanzaro, B. Nemotron- CC : Transforming C ommon C rawl into a refined long-horizon pretraining dataset. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 2459--2475, July 2025. doi:10.18653...

work page doi:10.18653/v1/2025.acl-long.123 2025
[36]

Transformers learn in-context by gradient descent

Von Oswald, J., Niklasson, E., Randazzo, E., Sacramento, J., Mordvintsev, A., Zhmoginov, A., and Vladymyrov, M. Transformers learn in-context by gradient descent. In Proceedings of the 40th International Conference on Machine Learning, Proceedings of Machine Learning Research, 2023

work page 2023
[37]

Wang, A., Pruksachatkun, Y., Nangia, N., Singh, A., Michael, J., Hill, F., Levy, O., and Bowman, S. R. SuperGLUE : a stickier benchmark for general-purpose language understanding systems. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, 2019

work page 2019
[38]

Survey on factuality in large language models

Wang, C., Liu, X., Yue, Y., Guo, Q., Hu, X., Tang, X., Zhang, T., Jiayang, C., Yao, Y., Hu, X., Qi, Z., Gao, W., Wang, Y., Yang, L., Wang, J., Xie, X., Zhang, Z., and Zhang, Y. Survey on factuality in large language models. ACM Comput. Surv., 58 0 (1), September 2025. ISSN 0360-0300. doi:10.1145/3742420. URL https://doi.org/10.1145/3742420

work page doi:10.1145/3742420 2025
[39]

Weber, M., Fu, D. Y., Anthony, Q., Oren, Y., Adams, S., Alexandrov, A., Lyu, X., Nguyen, H., Yao, X., Adams, V., Athiwaratkun, B., Chalamala, R., Chen, K., Ryabinin, M., Dao, T., Liang, P., Ré, C., Rish, I., and Zhang, C. RedPajama : an open dataset for training large language models. NeurIPS Datasets and Benchmarks Track, 2024

work page 2024
[40]

Measuring and reducing LLM hallucination without gold-standard answers, 2024

Wei, J., Yao, Y., Ton, J.-F., Guo, H., Estornell, A., and Liu, Y. Measuring and reducing LLM hallucination without gold-standard answers, 2024. URL https://arxiv.org/abs/2402.10412

work page arXiv 2024
[41]

QuRating : selecting high-quality data for training language models

Wettig, A., Gupta, A., Malik, S., and Chen, D. QuRating : selecting high-quality data for training language models. In Proceedings of the 41st International Conference on Machine Learning, ICML'24. JMLR.org, 2024

work page 2024
[42]

Are we there yet? revealing the risks of utilizing large language models in scholarly peer review,

Ye, R., Pang, X., Chai, J., Chen, J., Yin, Z., Xiang, Z., Dong, X., Shao, J., and Chen, S. Are we there yet? revealing the risks of utilizing large language models in scholarly peer review, 2024. URL https://arxiv.org/abs/2412.01708

work page arXiv 2024
[43]

Large language models as markov chains

Zekri, O., Odonnat, A., Benechehab, A., Bleistein, L., Boullé, N., and Redko, I. Large language models as Markov chains, 2024. URL https://arxiv.org/abs/2410.02724

work page arXiv 2024
[44]

E., and Stoica, I

Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E., Zhang, H., Gonzalez, J. E., and Stoica, I. Judging LLM -as-a-judge with MT -bench and chatbot arena. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023

work page 2023

[1] [1]

ICML 2024 Tutorial: Physics of Language Models , July 2024

Allen-Zhu , Z. ICML 2024 Tutorial: Physics of Language Models , July 2024. Project page: https://physics.allen-zhu.com/

work page 2024

[2] [2]

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A.,...

work page 1901

[3] [3]

Hallucination rates and reference accuracy of ChatGPT and bard for systematic reviews: Comparative analysis

Chelli, M., Descamps, J., Lavou \'e , V., Trojani, C., Azar, M., Deckert, M., Raynier, J.-L., Clowez, G., Boileau, P., and Ruetsch-Chelli, C. Hallucination rates and reference accuracy of ChatGPT and bard for systematic reviews: Comparative analysis. J Med Internet Res, 26: 0 e53164, May 2024. doi:10.2196/53164. URL http://www.ncbi.nlm.nih.gov/pubmed/38776130

work page doi:10.2196/53164 2024

[4] [4]

N., Li, T., Li, D., Zhu, B., Zhang, H., Jordan, M

Chiang, W.-L., Zheng, L., Sheng, Y., Angelopoulos, A. N., Li, T., Li, D., Zhu, B., Zhang, H., Jordan, M. I., Gonzalez, J. E., and Stoica, I. Chatbot arena: an open platform for evaluating LLMs by human preference. In Proceedings of the 41st International Conference on Machine Learning, ICML'24. JMLR.org, 2024

work page 2024

[5] [5]

Arc prize 2024: Technical report, 2025

Chollet, F., Knoop, M., Kamradt, G., and Landers, B. ARC prize 2024: Technical report, 2024. URL https://arxiv.org/abs/2412.04604

work page arXiv 2024

[6] [6]

Cover, T. M. and Thomas, J. A. Elements of Information Theory. John Wiley & Sons, 2006

work page 2006

[7] [7]

J., and Ojeda, C

Cvejoski, K., S\' a nchez, R. J., and Ojeda, C. The future is different: Predicting reddits popularity with variational dynamic language models. In Machine Learning and Knowledge Discovery in Databases. Research Track: European Conference, ECML PKDD 2024, Vilnius, Lithuania, September 9–13, 2024, Proceedings, Part I, pp.\ 422–439, 2024

work page 2024

[8] [8]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI . DeepSeek-R1 : Incentivizing reasoning capability in llms via reinforcement learning, 2025. URL https://arxiv.org/abs/2501.12948

work page internal anchor Pith review Pith/arXiv arXiv 2025

[9] [9]

L., Malach, E., and Goel, S

Edelman, E., Tsilivis, N., Edelman, B. L., Malach, E., and Goel, S. The evolution of statistical induction heads: In-context learning Markov chains. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

work page 2024

[10] [10]

lm-evaluation-harness: v0.4.10, January 2026

EleutherAI. lm-evaluation-harness: v0.4.10, January 2026. URL https://doi.org/10.5281/zenodo.18394108

work page doi:10.5281/zenodo.18394108 2026

[11] [11]

Open LLM leaderboard v2

Fourrier, C., Habib, N., Lozovskaya, A., Szafer, K., and Wolf, T. Open LLM leaderboard v2. https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard, 2024

work page 2024

[12] [12]

Gardner, M., Merrill, W., Dodge, J., Peters, M., Ross, A., Singh, S., and Smith, N. A. Competency problems: On finding and removing artifacts in language data. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.\ 1801--1813, November 2021. doi:10.18653/v1/2021.emnlp-main.135. URL https://aclanthology.org/2021.emn...

work page doi:10.18653/v1/2021.emnlp-main.135 2021

[13] [13]

Gemma: Open Models Based on Gemini Research and Technology

Gemma Team , Mesnard, T., Hardin, C., Dadashi, R., Bhupatiraju, S., Pathak, S., Sifre, L., Rivière, M., Kale, M. S., Love, J., Tafti, P., Hussenot, L., Sessa, P. G., Chowdhery, A., Roberts, A., Barua, A., Botev, A., Castro-Ros, A., Slone, A., Héliou, A., Tacchetti, A., Bulanova, A., Paterson, A., Tsai, B., Shahriari, B., Lan, C. L., Choquette-Choo, C. A.,...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[14] [14]

E., Kadhe, S

Gohari, H. E., Kadhe, S. R., Shah, Y., Adam, C. M., Adebayo, A., Adusumilli, P., Ahmed, F., Baracaldo, N., Borse, S. S., Chang, Y.-C., Dang, X.-H., Desai, N., Eres, R., Iwamoto, R., Karve, A. A., Koyfman, Y., Lee, W.-H., Liu, C., Lublinsky, B., Ohko, T., Pesce, P., Touma, M., Wang, S., Witherspooon, S., Woisetschl \"a ger, H., Wood, D., Wu, K.-L., Yoshida...

work page 2026

[15] [15]

The Llama 3 Herd of Models

Grattafiori, A., Dubey, A., et al. The Llama 3 herd of models, 2024. URL https://arxiv.org/abs/2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024

[16] [16]

Gururangan, S., Swayamdipta, S., Levy, O., Schwartz, R., Bowman, S., and Smith, N. A. Annotation artifacts in natural language inference data. In Proceedings of the 2018 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers) , pp.\ 107--112, June 2018. doi:10.18653/v...

work page doi:10.18653/v1/n18-2017 2018

[17] [17]

A., Welbl, J., Clark, A., Hennigan, T., Noland, E., Millican, K., van den Driessche, G., Damoc, B., Guy, A., Osindero, S., Simonyan, K., Elsen, E., Vinyals, O., Rae, J

Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., de Las Casas, D., Hendricks, L. A., Welbl, J., Clark, A., Hennigan, T., Noland, E., Millican, K., van den Driessche, G., Damoc, B., Guy, A., Osindero, S., Simonyan, K., Elsen, E., Vinyals, O., Rae, J. W., and Sifre, L. Training compute-optimal large language models. In Proce...

work page 2022

[18] [18]

A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions,

Huang, L., Yu, W., Ma, W., Zhong, W., Feng, Z., Wang, H., Chen, Q., Peng, W., Feng, X., Qin, B., and Liu, T. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. ACM Trans. Inf. Syst., 43 0 (2), January 2025. doi:10.1145/3703155. URL https://doi.org/10.1145/3703155

work page doi:10.1145/3703155 2025

[19] [19]

Scaling Laws for Neural Language Models

Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. Scaling laws for neural language models, 2020. URL https://arxiv.org/abs/2001.08361

work page internal anchor Pith review Pith/arXiv arXiv 2020

[20] [20]

FineWeb-Edu : the finest collection of educational content, 2024

Lozhkov, A., Ben Allal, L., von Werra, L., and Wolf, T. FineWeb-Edu : the finest collection of educational content, 2024. URL https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu

work page 2024

[21] [21]

V., Bondaschi, M., Girish, A., Nagle, A., Jaggi, M., Kim, H., and Gastpar, M

Makkuva, A. V., Bondaschi, M., Girish, A., Nagle, A., Jaggi, M., Kim, H., and Gastpar, M. Attention with Markov : A curious case of single-layer transformers. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=SqZ0KY4qBD

work page 2025

[22] [22]

Manning, C. D. and Schutze, H. Foundations of Statistical Natural Language Processing. MIT Press, 1999

work page 1999

[23] [23]

Mishra, M., Stallone, M., Zhang, G., Shen, Y., Prasad, A., Soria, A. M., Merler, M., Selvam, P., Surendran, S., Singh, S., Sethi, M., Dang, X.-H., Li, P., Wu, K.-L., Zawad, S., Coleman, A., White, M., Lewis, M., Pavuluri, R., Koyfman, Y., Lublinsky, B., de Bayser, M., Abdelaziz, I., Basu, K., Agarwal, M., Zhou, Y., Johnson, C., Goyal, A., Patel, H., Shah,...

work page arXiv 2024

[24] [24]

J., Mindermann, S., Moscovitz, I., Pan, A

Pacchiardi, L., Chan, A. J., Mindermann, S., Moscovitz, I., Pan, A. Y., Gal, Y., Evans, O., and Brauner, J. M. How to catch an AI liar: Lie detection in black-box LLM s by asking unrelated questions. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=567BjxgaTp

work page 2024

[25] [25]

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

Penedo, G., Kydlíček, H., allal, L. B., Lozhkov, A., Mitchell, M., Raffel, C., Von Werra, L., and Wolf, T. The FineWeb datasets: Decanting the web for the finest text data at scale, 2024. URL https://arxiv.org/abs/2406.17557

work page internal anchor Pith review Pith/arXiv arXiv 2024

[26] [26]

GPT-2 Model Card , 2019 a

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and Sutskever, I. GPT-2 Model Card , 2019 a . URL https://huggingface.co/openai-community/gpt2

work page 2019

[27] [27]

Language models are unsupervised multitask learners

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al. Language models are unsupervised multitask learners. OpenAI Blog, 1 0 (8): 0 9, 2019 b

work page 2019

[28] [28]

V., Ramchandran, K., and Gastpar, M

Rajaraman, N., Bondaschi, M., Makkuva, A. V., Ramchandran, K., and Gastpar, M. Transformers on Markov data: Constant depth suffices. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id=5uG9tp3v2q

work page 2024

[29] [29]

Toward transparent AI : A survey on interpreting the inner structures of deep neural networks

R\"auker, T., Ho, A., Casper, S., and Hadfield-Menell, D. Toward transparent AI : A survey on interpreting the inner structures of deep neural networks. In 2023 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML), pp.\ 464--483, Los Alamitos, CA, USA, February 2023. IEEE Computer Society. doi:10.1109/SaTML54575.2023.00039. URL https://doi.i...

work page doi:10.1109/satml54575.2023.00039 2023

[30] [30]

NLP Evaluation in trouble: On the Need to Measure LLM Data Contamination for each Benchmark

Sainz, O., Campos, J., Garc \'i a-Ferrero, I., Etxaniz, J., de Lacalle, O. L., and Agirre, E. NLP evaluation in trouble: On the need to measure LLM data contamination for each benchmark. In Findings of the Association for Computational Linguistics: EMNLP 2023, pp.\ 10776--10787, December 2023. doi:10.18653/v1/2023.findings-emnlp.722. URL https://aclanthol...

work page doi:10.18653/v1/2023.findings-emnlp.722 2023

[31] [31]

A., and Kolter, J

Sam, D., Finzi, M. A., and Kolter, J. Z. Predicting the performance of black-box language models with follow-up queries. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URL https://openreview.net/forum?id=IBFnEaArnz

work page 2025

[32] [32]

Shannon, C. E. A mathematical theory of communication. The Bell system technical journal, 27 0 (3): 0 379--423, 1948

work page 1948

[33] [33]

and Yu, Z

Shu, Y. and Yu, Z. Distribution shifts are bottlenecks: Extensive evaluation for grounding language models to knowledge bases. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop, pp.\ 71--88, March 2024. URL https://aclanthology.org/2024.eacl-srw.7/

work page 2024

[34] [34]

arXiv preprint arXiv:2402.01761 , year=

Singh, C., Inala, J. P., Galley, M., Caruana, R., and Gao, J. Rethinking interpretability in the era of large language models, 2024. URL https://arxiv.org/abs/2402.01761

work page arXiv 2024

[35] [35]

Nemotron- CC : Transforming C ommon C rawl into a refined long-horizon pretraining dataset

Su, D., Kong, K., Lin, Y., Jennings, J., Norick, B., Kliegl, M., Patwary, M., Shoeybi, M., and Catanzaro, B. Nemotron- CC : Transforming C ommon C rawl into a refined long-horizon pretraining dataset. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 2459--2475, July 2025. doi:10.18653...

work page doi:10.18653/v1/2025.acl-long.123 2025

[36] [36]

Transformers learn in-context by gradient descent

Von Oswald, J., Niklasson, E., Randazzo, E., Sacramento, J., Mordvintsev, A., Zhmoginov, A., and Vladymyrov, M. Transformers learn in-context by gradient descent. In Proceedings of the 40th International Conference on Machine Learning, Proceedings of Machine Learning Research, 2023

work page 2023

[37] [37]

Wang, A., Pruksachatkun, Y., Nangia, N., Singh, A., Michael, J., Hill, F., Levy, O., and Bowman, S. R. SuperGLUE : a stickier benchmark for general-purpose language understanding systems. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, 2019

work page 2019

[38] [38]

Survey on factuality in large language models

Wang, C., Liu, X., Yue, Y., Guo, Q., Hu, X., Tang, X., Zhang, T., Jiayang, C., Yao, Y., Hu, X., Qi, Z., Gao, W., Wang, Y., Yang, L., Wang, J., Xie, X., Zhang, Z., and Zhang, Y. Survey on factuality in large language models. ACM Comput. Surv., 58 0 (1), September 2025. ISSN 0360-0300. doi:10.1145/3742420. URL https://doi.org/10.1145/3742420

work page doi:10.1145/3742420 2025

[39] [39]

Weber, M., Fu, D. Y., Anthony, Q., Oren, Y., Adams, S., Alexandrov, A., Lyu, X., Nguyen, H., Yao, X., Adams, V., Athiwaratkun, B., Chalamala, R., Chen, K., Ryabinin, M., Dao, T., Liang, P., Ré, C., Rish, I., and Zhang, C. RedPajama : an open dataset for training large language models. NeurIPS Datasets and Benchmarks Track, 2024

work page 2024

[40] [40]

Measuring and reducing LLM hallucination without gold-standard answers, 2024

Wei, J., Yao, Y., Ton, J.-F., Guo, H., Estornell, A., and Liu, Y. Measuring and reducing LLM hallucination without gold-standard answers, 2024. URL https://arxiv.org/abs/2402.10412

work page arXiv 2024

[41] [41]

QuRating : selecting high-quality data for training language models

Wettig, A., Gupta, A., Malik, S., and Chen, D. QuRating : selecting high-quality data for training language models. In Proceedings of the 41st International Conference on Machine Learning, ICML'24. JMLR.org, 2024

work page 2024

[42] [42]

Are we there yet? revealing the risks of utilizing large language models in scholarly peer review,

Ye, R., Pang, X., Chai, J., Chen, J., Yin, Z., Xiang, Z., Dong, X., Shao, J., and Chen, S. Are we there yet? revealing the risks of utilizing large language models in scholarly peer review, 2024. URL https://arxiv.org/abs/2412.01708

work page arXiv 2024

[43] [43]

Large language models as markov chains

Zekri, O., Odonnat, A., Benechehab, A., Bleistein, L., Boullé, N., and Redko, I. Large language models as Markov chains, 2024. URL https://arxiv.org/abs/2410.02724

work page arXiv 2024

[44] [44]

E., and Stoica, I

Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E., Zhang, H., Gonzalez, J. E., and Stoica, I. Judging LLM -as-a-judge with MT -bench and chatbot arena. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023

work page 2023