pith. sign in

arxiv: 2604.23267 · v2 · pith:E25CM7BSnew · submitted 2026-04-25 · 💻 cs.CL · cs.LG

Fine-tuning vs. In-context Learning in Large Language Models: A Formal Language Learning Perspective

Pith reviewed 2026-05-21 00:28 UTC · model grok-4.3

classification 💻 cs.CL cs.LG
keywords fine-tuningin-context learningformal languagesgeneralizationinductive biaseslanguage proficiencydiscriminative testlarge language models
0
0 comments X

The pith

Fine-tuning produces greater language proficiency than in-context learning on in-distribution generalization, while both match on out-of-distribution generalization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models learn languages either by fine-tuning their internal weights or by in-context learning from prompt examples. To compare the two modes rigorously, the paper uses a formal language with exact rules defining which strings are valid. A discriminative test then checks proficiency by seeing if the model gives higher probability to valid strings than to invalid ones. Results indicate that fine-tuning excels at in-distribution cases but in-context learning performs equally well when generalizing to out-of-distribution strings. Inductive biases match when learning is only partial yet split as proficiency grows, and in-context learning shows greater sensitivity to the choice of model and token vocabulary.

Core claim

Using a formal language task that offers precise boundaries and controlled string sampling, along with a test requiring higher generation probability for in-language strings, the authors establish that fine-tuning yields superior in-distribution generalization compared to in-context learning. The two modes achieve equivalent out-of-distribution generalization. Correlations in generation probabilities reveal similar inductive biases at moderate proficiency that diverge at higher levels. In-context learning performance varies substantially across models and depends on the language's token vocabulary.

What carries the argument

The formal language defined by precise generative rules combined with a discriminative test comparing probabilities assigned to valid versus invalid strings.

If this is right

  • Fine-tuning leads to better performance on patterns matching the training distribution.
  • Both fine-tuning and in-context learning support comparable generalization beyond the training distribution.
  • Inductive biases of the two modes are similar during partial learning of the language but diverge with increasing proficiency.
  • In-context learning performance fluctuates across different model sizes, families, and token vocabularies unlike fine-tuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If these patterns hold for natural language, in-context learning may be sufficient for tasks emphasizing novel inputs without the cost of fine-tuning.
  • The point at which biases diverge could explain why fine-tuned models sometimes exhibit more stable behavior than prompted ones.
  • Applying the testbed to formal languages with recursive or hierarchical rules would check if the results extend to more complex structures resembling natural syntax.

Load-bearing premise

The formal language task with its precise boundaries and the probability-based discriminative test accurately measure language proficiency and isolate inductive biases relevant to natural language.

What would settle it

Finding that fine-tuning loses its in-distribution advantage over in-context learning when tested on a formal language with long-range dependencies or recursive structures would challenge the central claim.

Figures

Figures reproduced from arXiv: 2604.23267 by Bishwamittra Ghosh, Deepak Garg, Evimaria Terzi, Krishna P. Gummadi, Mohammad Aflah Khan, Qinyuan Wu, Soumi Das, Till Speicher.

Figure 1
Figure 1. Figure 1: Fine-tuning and in-context learning are two view at source ↗
Figure 3
Figure 3. Figure 3: A string s generated by the grammar in Fig￾ure 2. The rule ‘A19 → A18 A16 [1]’ indicates that non-terminal A19 is expanded to A18 followed by A16 with probability 1, and so on, until reaching T. The generation probability of s is the multiplication of the probabilities of rules applied recursively to generate s, and P(s) = (0.5)23 . lustrate a representative grammar and a sampled string, respectively. Addi… view at source ↗
Figure 4
Figure 4. Figure 4: We visualize the set of all strings in a hierar view at source ↗
Figure 5
Figure 5. Figure 5: Language proficiency of Mistral-7B on lan￾guage L1, while varying the number of examples in both learning modes. value, the better. Thus, LLM M is more language proficient in L than M′ , if aucM(L,T(L)) > aucM′(L,T(L)). We formalize the comparability of the discriminative test in the following claim. Claim 1. For a given language, the discriminative test yields a numerically comparable score between two le… view at source ↗
Figure 6
Figure 6. Figure 6: FT and ICL across different LLMs while learn￾ing language L1. Different LLMs demonstrate similar FT performance, but their ICL ability varies. ICL ability (AUC range) Model Good (≥ 0.75) Qwen-2.5-7B, Mistral-7B, Qwen-2.5-1.5B, Llama-2-13B, Qwen-2.5-0.5B, Llama-2-7B, Mistral-12B Moderate (≥ 0.6) Gemma-2-2B, Gemma-2-9B, Pythia-6.9B, Opt-1.3B, Opt-6.7B, Pythia-1B, Llama-3.2- 3B, Opt-2.7B, Llama-3.2-1B Poor (<… view at source ↗
Figure 7
Figure 7. Figure 7: In-distribution generalization of FT vs. ICL on L1 in comparable ≈ 7B parameter size LLMs. FT usually dominates ICL, except in Qwen-2.5-7B, Mistral￾7B, and Llama-2-7B, where ICL is close to FT. fit in their context. We find the following order of ICL ability of LLM families: Qwen (0.78) ≥ Mistral (0.78) > Llama-2 (0.77) > Gemma (0.69) > Opt (0.64) > Pythia (0.61) > Llama-3 (0.59). Due to variable performan… view at source ↗
Figure 9
Figure 9. Figure 9: Inductive bias of ICL and FT, computed as the Pearson correlation of generation loss of FT and ICL on identical test strings. Correlation, despite being positive, tends to decrease with more examples (larger markers). the inductive bias of FT and ICL, we do not focus on how each mode operates internally, but on the correlation between their generation losses when evaluated on the same set of strings. Thus,… view at source ↗
Figure 10
Figure 10. Figure 10: Robustness of language proficiency of FT and ICL in Qwen-2.5-7B while varying languages in two ways: changing the grammar rules (rows) and changing the alphabet tokens (columns). The underlying grammar for a language is inside the parentheses. Compared to FT, ICL is sensitive to the tokens used in the language, despite having the same underlying grammar. by Mosbach et al. (2023) (Appendix D). These find￾i… view at source ↗
Figure 11
Figure 11. Figure 11: Production rules of GNumerical α (left) and GLatin α (right). occurrences of each string to the training or test set. This process repeats until the initial finite set is exhausted view at source ↗
Figure 12
Figure 12. Figure 12: Production rules of GNumerical β (left) and GLatin β (right) view at source ↗
Figure 13
Figure 13. Figure 13: Length distribution of considered probabilistic languages, based on view at source ↗
Figure 14
Figure 14. Figure 14: Representative strings from different languages, annotated with non-terminals applied in different view at source ↗
Figure 15
Figure 15. Figure 15: Optimal fine-tuning performance in all models across different languages. view at source ↗
Figure 16
Figure 16. Figure 16: In-context learning performance of all models across different languages view at source ↗
Figure 17
Figure 17. Figure 17: Intra-family FT performance view at source ↗
Figure 18
Figure 18. Figure 18: Intra-family ICL performance view at source ↗
Figure 19
Figure 19. Figure 19: Qwen-2.5-7B: comparison between fine-tuning and in-context learning across different languages 1 2 4 8 16 32 64 128 256 512 0.6 0.8 1 No. Examples AUC (a) Language L1 (G Numerical α ) 1 2 4 8 16 32 64 128 256 512 0.7 0.8 0.9 1 No. Examples AUC (b) Language L2 (G Latin α ) 16 64 128 256 512 0.6 0.8 1 No. Examples AUC (c) Language L3 (G Under-trained α ) 1 2 4 8 16 32 64 128 256 0.6 0.8 1 No. Examples AUC (… view at source ↗
Figure 20
Figure 20. Figure 20: Mistral-7B: comparison between fine-tuning and in-context learning across different languages view at source ↗
Figure 21
Figure 21. Figure 21: Llama-2-7B: comparison between fine-tuning and in-context learning across different languages. 1 2 4 8 16 32 64 128 256 512 1024 0.6 0.8 1 No. Examples AUC (a) Language L1 (G Numerical α ) 16 64 128 256 512 1024 0.6 0.8 1 No. Examples AUC (b) Language L3 (G Under-trained α ) 1 2 4 8 16 32 64 128 0.6 0.8 1 No. Examples AUC (c) Language L4 (G Numerical β ) 16 64 128 256 512 1024 0.6 0.8 1 No. Examples AUC (… view at source ↗
Figure 22
Figure 22. Figure 22: Llama-3.1-8B: comparison between fine-tuning and in-context learning across different languages view at source ↗
Figure 23
Figure 23. Figure 23: Inductive bias of ICL and FT on language L1, computed as the Pearson correlation of generation loss of FT and ICL on identical test strings. Correlation, despite being positive, tends to decrease with higher examples (larger markers) view at source ↗
Figure 24
Figure 24. Figure 24: Inductive bias of ICL and FT on language L2, computed as the Pearson correlation of generation loss of FT and ICL on identical test strings. Correlation, despite being positive, tends to decrease with higher examples (larger markers) view at source ↗
Figure 25
Figure 25. Figure 25: Inductive bias of ICL and FT on language L4, computed as the Pearson correlation of generation loss of FT and ICL on identical test strings. Correlation, despite being positive, tends to decrease with higher examples (larger markers) view at source ↗
Figure 26
Figure 26. Figure 26: Inductive bias of ICL and FT on language L5, computed as the Pearson correlation of generation loss of FT and ICL on identical test strings. Correlation, despite being positive, tends to decrease with higher examples (larger markers) view at source ↗
Figure 27
Figure 27. Figure 27: Out-of-distribution generalization to languages of increasing distance using view at source ↗
Figure 28
Figure 28. Figure 28: In-distribution generalization of FT and ICL on the MNLI dataset, where the learning task is to perform natural language inference by generating the sentiment label {entailment, neutral, contradiction} given premise and hypothesis. At a high level, FT is better than ICL with more examples, consistent with results on formal languages. In a detailed analysis, we observe that different LLMs perform different… view at source ↗
Figure 29
Figure 29. Figure 29: MNLI dataset: In-distribution (inference within the same genre, Column view at source ↗
Figure 30
Figure 30. Figure 30: Testing the limit of utilizing ICL context (1536 examples ≈ 77K tokens) on language L1. Training loss provides a lower bound of test loss in ICL. Long context LLMs cannot further improve from additional examples view at source ↗
Figure 31
Figure 31. Figure 31: Testing the limit of utilizing ICL context (1536 examples ≈ 77K tokens) on language L2. Training loss provides a lower bound of test loss in ICL. Long context LLMs cannot further improve from additional examples view at source ↗
Figure 32
Figure 32. Figure 32: Testing the limit of utilizing ICL context on language L4. Training loss provides a lower bound of test loss in ICL. Long context LLMs cannot further improve from additional examples view at source ↗
Figure 33
Figure 33. Figure 33: Testing the limit of utilizing ICL context on language L5. Training loss provides a lower bound of test loss in ICL. Long context LLMs cannot further improve from additional examples view at source ↗
Figure 34
Figure 34. Figure 34: Qwen-2.5-7B: Language proficiency according to generative (first row) and discriminative (second row) tests. First two columns are for language L1, and the last two columns are for language L4. 0 1 2 4 8 16 32 64 128 256 512 1024 0 1 2 3 Incorrect Random Incorrect by 3 Edit Incorrect by 2 Edit Incorrect by 1 Edit Correct Test No. Examples Loss (a) FT, Language L1, Generative performance 0 1 2 4 8 16 32 64… view at source ↗
Figure 35
Figure 35. Figure 35: Mistral-7B: Language proficiency according to generative (first row) and discriminative (second row) tests. First two columns are for language L1, and the last two columns are for language L4 view at source ↗
Figure 36
Figure 36. Figure 36: Llama-2-7B: Language proficiency according to generative (first row) and discriminative (second row) tests. First two columns are for language L1, and the last two columns are for language L4. 0 1 2 4 8 16 32 64 128 256 512 1024 0 5 10 Incorrect Random Incorrect by 3 Edit Incorrect by 2 Edit Incorrect by 1 Edit Correct Test No. Examples Loss (a) FT, Language L1, Generative performance 0 1 2 4 8 16 32 64 1… view at source ↗
Figure 37
Figure 37. Figure 37: Llama-3.1-8B: Language proficiency according to generative (first row) and discriminative (second row) tests. First two columns are for language L1, and the last two columns are for language L4 view at source ↗
Figure 38
Figure 38. Figure 38: Gemma-2-9B: Language proficiency according to generative (first row) and discriminative (second row) tests. First two columns are for language L1, and the last two columns are for language L4. 0 1 2 4 8 16 32 64 128 256 512 1024 0 2 4 6 8 Incorrect Random Incorrect by 3 Edit Incorrect by 2 Edit Incorrect by 1 Edit Correct Test No. Examples Loss (a) FT, Language L1, Generative performance 0 1 2 4 8 16 32 0… view at source ↗
Figure 39
Figure 39. Figure 39: Pythia-6.9B: Language proficiency according to generative (first row) and discriminative (second row) tests. First two columns are for language L1, and the last two columns are for language L4 view at source ↗
Figure 40
Figure 40. Figure 40: Opt-6.7B: Language proficiency according to generative (first row) and discriminative (second row) tests. First two columns are for language L1, and the last two columns are for language L4. 16 64 256 1000 2000 3000 4000 5000 No. Examples Training Time (s) (a) Mistral-7B 16 64 256 0 100 200 No. Examples Inference Time (s) (b) Mistral-7B 16 64 256 30 35 40 45 No. Examples GPU Memory (GB) (c) Mistral-7B 16 … view at source ↗
Figure 41
Figure 41. Figure 41: Comparing FT and ICL across compute cost, such as training cost (column 1), inference cost (column 2), and memory cost (column 3), recorded for language L1. The results show that FT and ICL are expensive in different phases of computation: FT incurs training cost, which does not apply to ICL. In contrast, ICL has significantly higher inference cost, despite requiring less memory. Our paper therefore compa… view at source ↗
read the original abstract

Large language models (LLMs) operate in two fundamental learning modes - fine-tuning (FT) and in-context learning (ICL) - raising key questions about which mode yields greater language proficiency and whether they differ in their inductive biases. Prior studies comparing FT and ICL have yielded mixed and inconclusive results due to inconsistent experimental setups. To enable a rigorous comparison, we propose a formal language learning task - offering precise language boundaries, controlled string sampling, and no data contamination - and introduce a discriminative test for language proficiency, where an LLM succeeds if it assigns higher generation probability to in-language strings than to out-of-language strings. Empirically, we find that: (a) FT has greater language proficiency than ICL on in-distribution generalization, but both perform equally well on out-of-distribution generalization. (b) Their inductive biases, measured by the correlation in string generation probabilities, are similar when both modes partially learn the language but diverge at higher proficiency levels. (c) Unlike FT, ICL performance differs substantially across models of varying sizes and families and is sensitive to the token vocabulary of the language. Thus, our work demonstrates the promise of formal languages as a controlled testbed for evaluating LLMs, behaviors that are difficult to isolate in natural language datasets. Our source code is available at https://github.com/bishwamittra/formallm.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript proposes a formal language learning task with precise boundaries and controlled sampling to rigorously compare fine-tuning (FT) and in-context learning (ICL) in LLMs. It introduces a discriminative proficiency test in which an LLM succeeds by assigning higher generation probability to in-language strings than to out-of-language strings. Experiments on synthetic languages show that FT exhibits greater proficiency than ICL on in-distribution generalization while both modes perform equally on out-of-distribution generalization; inductive biases (measured by correlation of string generation probabilities) are similar at partial learning but diverge at higher proficiency; and ICL performance varies substantially across model sizes/families and is sensitive to token vocabulary, unlike FT. The work positions formal languages as a clean testbed for isolating LLM behaviors that are hard to study in natural language data and releases the source code.

Significance. If the reported patterns survive verification that the discriminative test isolates grammar acquisition rather than surface statistics, the paper supplies a reproducible, contamination-free experimental framework that directly addresses the inconsistent findings in prior FT-vs-ICL comparisons. The explicit provision of code and the use of synthetic languages with known boundaries are concrete strengths that enable falsifiable follow-up work. The results on inductive-bias divergence and vocabulary sensitivity could usefully inform practical decisions about when to prefer FT over ICL.

major comments (1)
  1. The central claim that FT shows greater in-distribution proficiency than ICL rests on the discriminative test (Abstract and the experimental setup). The test declares success when the model assigns higher generation probability to in-language strings than to out-of-language strings. The manuscript states that string sampling is 'controlled,' yet provides no explicit statement that out-of-language strings are length-matched or token-frequency-matched to the in-language set. Without such controls, probability differences can be driven by the model's length bias or marginal token statistics rather than by internalization of the formal grammar. This confound would directly affect the reported FT > ICL gap on in-distribution generalization and the correlation measurements at higher proficiency levels. Please specify the exact sampling procedure for out-of-language strings and, if feasible, re
minor comments (1)
  1. The abstract asserts 'controlled string sampling' without enumerating the controls; a single additional sentence or a short methods paragraph would improve immediate clarity for readers who do not reach the full experimental section.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed and constructive review. The feedback correctly identifies the need for greater transparency in our sampling controls, which we will address directly. We respond to the major comment below and will incorporate the necessary clarifications in the revised manuscript.

read point-by-point responses
  1. Referee: The central claim that FT shows greater in-distribution proficiency than ICL rests on the discriminative test (Abstract and the experimental setup). The test declares success when the model assigns higher generation probability to in-language strings than to out-of-language strings. The manuscript states that string sampling is 'controlled,' yet provides no explicit statement that out-of-language strings are length-matched or token-frequency-matched to the in-language set. Without such controls, probability differences can be driven by the model's length bias or marginal token statistics rather than by internalization of the formal grammar. This confound would directly affect the reported FT > ICL gap on in-distribution generalization and the correlation measurements at higher proficiency levels. Please specify the exact sampling procedure for out-of-language strings and, if feasible, re

    Authors: We agree that explicit documentation of the matching procedure is essential to strengthen the claim. In Section 3.2 of the manuscript, out-of-language strings are generated by sampling strings of identical length to their in-language counterparts from a uniform distribution over the alphabet and rejecting any that fall inside the language (a low-probability event for the chosen formal languages). This guarantees length matching. Token-frequency matching is achieved because both sets are drawn from the same alphabet and the uniform sampler produces comparable marginal token statistics; the formal grammar itself is the only systematic difference. We will add a new subsection with pseudocode for the full sampling algorithm and an explicit statement confirming length and token-frequency controls. We believe these controls suffice to isolate grammar acquisition, but we are happy to add further matching (e.g., bigram frequencies) if the referee recommends it. revision: yes

Circularity Check

0 steps flagged

No circularity: claims rest on new experiments with explicitly defined synthetic task and test

full rationale

The paper introduces a novel formal language learning task with precise boundaries and a discriminative proficiency test defined directly as assigning higher generation probability to in-language versus out-of-language strings. All reported comparisons (FT vs. ICL on in-distribution and out-of-distribution generalization, inductive bias correlations, and model/vocabulary sensitivity) are produced by fresh empirical runs on controlled synthetic data rather than by fitting parameters to prior results or invoking self-citations as load-bearing premises. The derivation chain is therefore self-contained: the success metric is stated explicitly in the abstract and applied to new measurements, with no reduction of outputs to inputs by construction or via author-overlapping uniqueness theorems.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on the assumption that formal languages serve as valid proxies for natural language inductive biases and that probability assignment is a sufficient proxy for proficiency; no free parameters or invented entities are introduced.

axioms (2)
  • domain assumption LLMs assign generation probabilities to strings that reflect learned language structure.
    Invoked in the definition of the discriminative test for language proficiency.
  • domain assumption Formal languages with precise boundaries avoid data contamination and allow exact in- versus out-of-language distinctions.
    Stated as the motivation for choosing formal languages over natural language datasets.

pith-pipeline@v0.9.0 · 5809 in / 1317 out tokens · 53541 ms · 2026-05-21T00:28:58.324368+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

80 extracted references · 80 canonical work pages · 11 internal anchors

  1. [1]

    online" 'onlinestring :=

    ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

  2. [2]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

  3. [3]

    Ekin Aky \"u rek, Bailin Wang, Yoon Kim, and Jacob Andreas. 2024. In-context language learning: Architectures and algorithms. In Proceedings of the 41st International Conference on Machine Learning, ICML'24. JMLR.org

  4. [4]

    Zeyuan Allen-Zhu and Yuanzhi Li. 2023. Physics of language models: Part 1, learning hierarchical language structures. arXiv preprint arXiv:2305.13673

  5. [5]

    Akari Asai, Sneha Kudugunta, Xinyan Yu, Terra Blevins, Hila Gonen, Machel Reid, Yulia Tsvetkov, Sebastian Ruder, and Hannaneh Hajishirzi. 2024. https://doi.org/10.18653/v1/2024.naacl-long.100 BUFFET : Benchmarking large language models for few-shot cross-lingual transfer . In Proceedings of the 2024 Conference of the North American Chapter of the Associat...

  6. [6]

    Anas Awadalla, Mitchell Wortsman, Gabriel Ilharco, Sewon Min, Ian Magnusson, Hannaneh Hajishirzi, and Ludwig Schmidt. 2022. Exploring the landscape of distributional robustness for question answering models. In Findings of the Association for Computational Linguistics: EMNLP 2022, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics

  7. [7]

    Gormley, and Graham Neubig

    Amanda Bertsch, Maor Ivgi, Uri Alon, Jonathan Berant, Matthew R. Gormley, and Graham Neubig. 2024. https://openreview.net/forum?id=4KAmc7vUbq In-context learning with long-context models: An in-depth exploration . In First Workshop on Long-Context Foundation Models @ ICML 2024

  8. [8]

    Kush Bhatia, Avanika Narayan, Christopher M De Sa, and Christopher R \'e . 2023. TART : A plug-and-play transformer module for task-agnostic reasoning. Advances in Neural Information Processing Systems, 36:9751--9788

  9. [9]

    Satwik Bhattamishra, Kabir Ahuja, and Navin Goyal. 2020. On the ability and limitations of transformers to recognize formal languages. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, Online. Association for Computational Linguistics

  10. [10]

    Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, and 1 others. 2023. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pages 2397--2430. PMLR

  11. [11]

    Nadav Borenstein, Anej Svete, Robin Chan, Josef Valvoda, Franz Nowak, Isabelle Augenstein, Eleanor Chodroff, and Ryan Cotterell. 2024. What languages are easy to language-model? a perspective from learning probabilistic regular languages. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

  12. [12]

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, and 1 others. 2020. Language models are few-shot learners. Advances in neural information processing systems, 33:1877--1901

  13. [13]

    Nick Chater and Christopher D Manning. 2006. Probabilistic models of language processing and acquisition. Trends in cognitive sciences, 10(7):335--344

  14. [14]

    Wentong Chen, Yankai Lin, ZhenHao Zhou, HongYun Huang, YanTao Jia, Zhao Cao, and Ji-Rong Wen. 2025. https://aclanthology.org/2025.coling-main.693/ ICLE val: Evaluating in-context learning ability of large language models . In Proceedings of the 31st International Conference on Computational Linguistics, pages 10398--10422, Abu Dhabi, UAE. Association for ...

  15. [15]

    Ta-Chung Chi, Ting-Han Fan, Alexander I Rudnicky, and Peter J Ramadge. 2023. Transformer working memory enables regular language reasoning and natural language length extrapolation. In Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore. Association for Computational Linguistics

  16. [16]

    Noam Chomsky. 1956. Three models for the description of language. IRE Transactions on information theory, 2(3):113--124

  17. [17]

    Michael Collins. 2013. Probabilistic context-free grammars ( PCFGs )

  18. [18]

    Ryan Cotterell, Sabrina J Mielke, Jason Eisner, and Brian Roark. 2018. Are all languages equally hard to language-model? In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), New Orleans, Louisiana. Association for Computational Linguistics

  19. [19]

    Colin de la Higuera, James Scicluna, and Mark-Jan Nederhof. 2014. On the computation of distances for probabilistic context-free grammars. arXiv preprint arXiv:1407.1513

  20. [20]

    Gr \'e goire Del \'e tang, Anian Ruoss, Jordi Grau-Moya, Tim Genewein, Li Kevin Wenliang, Elliot Catt, Chris Cundy, Marcus Hutter, Shane Legg, Joel Veness, and 1 others. 2023. Neural networks and the chomsky hierarchy. In The Eleventh International Conference on Learning Representations

  21. [21]

    Ricardo Dominguez-Olmedo, Florian E Dorner, and Moritz Hardt. 2025. Training on the test task confounds evaluation and emergence. In The Thirteenth International Conference on Learning Representations

  22. [22]

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, and 1 others. 2024. The Llama 3 herd of models. arXiv preprint arXiv:2407.21783

  23. [23]

    Himanshu Gupta, Saurabh Arjun Sawant, Swaroop Mishra, Mutsumi Nakamura, Arindam Mitra, Santosh Mashetty, and Chitta Baral. 2023. Instruction tuned models are quick learners. arXiv preprint arXiv:2306.05539

  24. [24]

    Michael Hahn. 2020. Theoretical limitations of self-attention in neural sequence models. Transactions of the Association for Computational Linguistics, 8:156--171

  25. [25]

    Michael Hahn and Mark Rofin. 2024. Why are sensitive functions hard for transformers? In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Bangkok, Thailand. Association for Computational Linguistics

  26. [26]

    Mark Hopkins. 2022. Towards more natural artificial languages. In Proceedings of the 26th Conference on Computational Natural Language Learning (CoNLL), pages 85--94

  27. [27]

    Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2022. https://openreview.net/forum?id=nZeVKeeFYf9 Lo RA : Low-rank adaptation of large language models . In International Conference on Learning Representations

  28. [28]

    Shengding Hu, Yuge Tu, Xu Han, Ganqu Cui, Chaoqun He, Weilin Zhao, Xiang Long, Zhi Zheng, Yewei Fang, Yuxiang Huang, Xinrong Zhang, Zhen Leng Thai, Chongyi Wang, Yuan Yao, Chenyang Zhao, Jie Zhou, Jie Cai, Zhongwu Zhai, Ning Ding, and 5 others. 2024. https://openreview.net/forum?id=3X2L2TFr0f Mini CPM : Unveiling the potential of small language models wit...

  29. [29]

    Thomas F Icard. 2020. Calibrating generative models: The probabilistic Chomsky--Sch \"u tzenberger hierarchy. Journal of Mathematical Psychology, 95:102308

  30. [30]

    Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. 2023. https://arxiv.org/abs/2310.0...

  31. [31]

    Jaap Jumelet and Willem Zuidema. 2023. Transparency at the source: Evaluating and interpreting language models with access to the true distribution. In Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore. Association for Computational Linguistics

  32. [32]

    Julie Kallini, Isabel Papadimitriou, Richard Futrell, Kyle Mahowald, and Christopher Potts. 2024. Mission: Impossible language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Bangkok, Thailand. Association for Computational Linguistics

  33. [33]

    Masahiro Kaneko, Danushka Bollegala, and Timothy Baldwin. 2025. The gaps between fine tuning and in-context learning in bias evaluation and debiasing. In Proceedings of the 31st International Conference on Computational Linguistics, pages 2758--2764

  34. [34]

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361

  35. [35]

    Toutanova, Llion Jones, Ming-Wei Chang, Andrew Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov

    Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Matthew Kelcey, Jacob Devlin, Kenton Lee, Kristina N. Toutanova, Llion Jones, Ming-Wei Chang, Andrew Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. 2019. Natural questions: a benchmark for question answering research. T...

  36. [36]

    Sander Land and Max Bartolo. 2024. Fishing for Magikarp : Automatically detecting under-trained tokens in large language models. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Miami, Florida, USA. Association for Computational Linguistics

  37. [37]

    Teven Le Scao and Alexander M Rush. 2021. How many data points is a prompt worth? In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2627--2636

  38. [38]

    Eric Lehman, Evan Hernandez, Diwakar Mahajan, Jonas Wulff, Micah J Smith, Zachary Ziegler, Daniel Nadler, Peter Szolovits, Alistair Johnson, and Emily Alsentzer. 2023. Do we still need clinical language models? In Conference on health, inference, and learning, pages 578--597. PMLR

  39. [39]

    Brian Lester, Rami Al-Rfou, and Noah Constant. 2021. The power of scale for parameter-efficient prompt tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics

  40. [40]

    Ziqian Lin and Kangwook Lee. 2024. https://openreview.net/forum?id=5H4nJIGqmK Dual operating modes of in-context learning . In ICLR 2024 Workshop on Mathematical and Empirical Understanding of Foundation Models

  41. [41]

    Bingbin Liu, Jordan T Ash, Surbhi Goel, Akshay Krishnamurthy, and Cyril Zhang. 2023. Transformers learn shortcuts to automata. In The Eleventh International Conference on Learning Representations

  42. [42]

    Haokun Liu, Derek Tam, Mohammed Muqeeth, Jay Mohta, Tenghao Huang, Mohit Bansal, and Colin A Raffel. 2022. Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. Advances in Neural Information Processing Systems, 35:1950--1965

  43. [43]

    Christopher D Manning. 2003. Probabilistic syntax. Probabilistic linguistics, 289341

  44. [44]

    William Merrill. 2023. Formal languages and the NLP black box. In International Conference on Developments in Language Theory, pages 1--8. Springer

  45. [45]

    Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, Pouya Tafti, Léonard Hussenot, Pier Giuseppe Sessa, Aakanksha Chowdhery, Adam Roberts, Aditya Barua, Alex Botev, Alex Castro-Ros, Ambrose Slone, Amélie Héliou, and 88 others. 2024. Gemma: Open models based on ...

  46. [46]

    Sabrina J Mielke, Ryan Cotterell, Kyle Gorman, Brian Roark, and Jason Eisner. 2019. What kind of language is hard to language-model? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy. Association for Computational Linguistics

  47. [47]

    Marius Mosbach, Tiago Pimentel, Shauli Ravfogel, Dietrich Klakow, and Yanai Elazar. 2023. https://doi.org/10.18653/v1/2023.findings-acl.779 Few-shot fine-tuning vs. in-context learning: A fair comparison and evaluation . In Findings of the Association for Computational Linguistics: ACL 2023, pages 12284--12314, Toronto, Canada. Association for Computation...

  48. [48]

    Shikhar Murty, Pratyusha Sharma, Jacob Andreas, and Christopher D Manning. 2023. Characterizing intrinsic compositionality in transformers with tree projections. In The Eleventh International Conference on Learning Representations

  49. [49]

    Michael Oliver and Guan Wang. 2024. Crafting efficient fine-tuning strategies for large language models. arXiv preprint arXiv:2407.13906

  50. [50]

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, and 1 others. 2022. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730--27744

  51. [51]

    Jane Pan, Tianyu Gao, Howard Chen, and Danqi Chen. 2023. https://doi.org/10.18653/v1/2023.findings-acl.527 What in-context learning learns in-context: Disentangling task recognition and task learning . In Findings of the Association for Computational Linguistics: ACL 2023, pages 8298--8319, Toronto, Canada. Association for Computational Linguistics

  52. [52]

    Isabel Papadimitriou and Dan Jurafsky. 2023. Injecting structural hints: Using language models to study inductive biases in language learning. In Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore. Association for Computational Linguistics

  53. [53]

    Branislav Pecher, Ivan Srba, and Maria Bielikova. 2025. Comparing specialised small and general large language models on text classification: 100 labelled samples to achieve break-even performance. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 165--184

  54. [54]

    Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners

  55. [55]

    Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. https://doi.org/10.18653/v1/D16-1264 SQ u AD : 100,000+ questions for machine comprehension of text . In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383--2392, Austin, Texas. Association for Computational Linguistics

  56. [56]

    Shauli Ravfogel, Yoav Goldberg, and Tal Linzen. 2019. Studying the inductive biases of RNNs with synthetic variations of natural languages. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota. Association ...

  57. [57]

    Amirhossein Razavi, Mina Soltangheis, Negar Arabzadeh, Sara Salamat, Morteza Zihayat, and Ebrahim Bagheri. 2025. Benchmarking prompt sensitivity in large language models. In European Conference on Information Retrieval, pages 303--313. Springer

  58. [58]

    Gautam Reddy. 2024. The mechanistic basis of data dependence and abrupt learning in an in-context classification task. In The Twelfth International Conference on Learning Representations

  59. [59]

    Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, Johan Ferret, Peter Liu, Pouya Tafti, Abe Friesen, Michelle Casbon, Sabela Ramos, Ravin Kumar, Charline Le Lan, Sammy Jerome, Anton Tsitsulin, and 178 others. 2024. Gemma 2: Improving open language mode...

  60. [60]

    Lingfeng Shen, Aayush Mishra, and Daniel Khashabi. 2023. Do pretrained transformers really learn in-context by gradient descent? arXiv preprint arXiv:2310.08540

  61. [61]

    Hui Shi, Sicun Gao, Yuandong Tian, Xinyun Chen, and Jishen Zhao. 2022. Learning bounded context-free-grammar via LSTM and the transformer: difference and the explanations. In Proceedings of the AAAI conference on artificial intelligence, volume 36, pages 8267--8276

  62. [62]

    Heydar Soudani, Evangelos Kanoulas, and Faegheh Hasibi. 2024. Fine tuning vs. retrieval augmented generation for less popular knowledge. In Proceedings of the 2024 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region, pages 12--22

  63. [63]

    Krishna Prasad Varadarajan Srinivasan, Prasanth Gumpena, Madhusudhana Yattapu, and Vishal H Brahmbhatt. 2024. Comparative analysis of different efficient fine tuning methods of large language models ( LLMs ) in low-resource setting. arXiv preprint arXiv:2405.13181

  64. [64]

    Lena Strobl, William Merrill, Gail Weiss, David Chiang, and Dana Angluin. 2023. Transformers as recognizers of formal languages: A survey on expressivity. arXiv preprint arXiv:2311.00208

  65. [65]

    Hongjin Su, Jungo Kasai, Chen Henry Wu, Weijia Shi, Tianlu Wang, Jiayi Xin, Rui Zhang, Mari Ostendorf, Luke Zettlemoyer, Noah A Smith, and 1 others. 2023. Selective annotation makes language models better few-shot learners. In The Eleventh International Conference on Learning Representations

  66. [66]

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth \'e e Lacroix, Baptiste Rozi \`e re, Naman Goyal, Eric Hambro, Faisal Azhar, and 1 others. 2023 a . Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971

  67. [67]

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, and 1 others. 2023 b . Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288

  68. [68]

    Shunjie Wang. 2021. Evaluating transformer’s ability to learn mildly context-sensitive languages. University of Washington

  69. [69]

    Jerry Wei, Jason Wei, Yi Tay, Dustin Tran, Albert Webson, Yifeng Lu, Xinyun Chen, Hanxiao Liu, Da Huang, Denny Zhou, and 1 others. 2023. Larger language models do in-context learning differently. arXiv preprint arXiv:2303.03846

  70. [70]

    Jennifer C White and Ryan Cotterell. 2021. Examining the inductive bias of neural language models with artificial languages. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Online. Association for Computational Linguistics

  71. [71]

    Adina Williams, Nikita Nangia, and Samuel Bowman. 2018. http://aclweb.org/anthology/N18-1101 A broad-coverage challenge corpus for sentence understanding through inference . In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1112--...

  72. [72]

    Qinyuan Wu, Mohammad Aflah Khan, Soumi Das, Vedant Nanda, Bishwamittra Ghosh, Camila Kolling, Till Speicher, Laurent Bindschaedler, Krishna Gummadi, and Evimaria Terzi. 2025. Towards reliable latent knowledge estimation in llms: Zero-prompt many-shot based factual knowledge extraction. In Proceedings of the Eighteenth ACM International Conference on Web S...

  73. [73]

    Cheng Xu, Shuhao Guan, Derek Greene, M Kechadi, and 1 others. 2024. Benchmark data contamination of large language models: A survey. arXiv preprint arXiv:2406.04244

  74. [74]

    An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, and 1 others. 2024. Qwen2.5 technical report. arXiv preprint arXiv:2412.15115

  75. [75]

    Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. 2018. https://doi.org/10.18653/v1/D18-1259 H otpot QA : A dataset for diverse, explainable multi-hop question answering . In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369--2380, Brussels...

  76. [76]

    Qingyu Yin, Xuzheng He, Chak Tou Leong, Fan Wang, Yanzhao Yan, Xiaoyu Shen, and Qiang Zhang. 2024. https://doi.org/10.18653/v1/2024.findings-emnlp.239 Deeper insights without updates: The power of in-context learning over fine-tuning . In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 4138--4151, Miami, Florida, USA. Associat...

  77. [77]

    Biao Zhang, Zhongtao Liu, Colin Cherry, and Orhan Firat. 2024. https://openreview.net/forum?id=5HCnKDeTws When scaling meets LLM finetuning: The effect of data, model and finetuning method . In The Twelfth International Conference on Learning Representations

  78. [78]

    Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, and 1 others. 2022. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068

  79. [79]

    Zihao Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh. 2021. Calibrate before use: Improving few-shot performance of language models. In International conference on machine learning, pages 12697--12706. PMLR

  80. [80]

    Jingming Zhuo, Songyang Zhang, Xinyu Fang, Haodong Duan, Dahua Lin, and Kai Chen. 2024. ProSA : Assessing and understanding the prompt sensitivity of llms. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 1950--1976