Categorical Prior Lock-in: Why In-Context Learning Fails for Structured Data

Alberto Trombetta; Antonio Pelusi; Stefano Braghin

arxiv: 2606.11961 · v1 · pith:G3JURJB5new · submitted 2026-06-10 · 💻 cs.LG · cs.AI

Categorical Prior Lock-in: Why In-Context Learning Fails for Structured Data

Antonio Pelusi , Stefano Braghin , Alberto Trombetta This is my paper

Pith reviewed 2026-06-27 10:12 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords in-context learningcategorical prior lock-instructured data generationtabular datadistribution mismatchlarge language modelsparameter-efficient fine-tuningLoRA

0 comments

The pith

In-context learning cannot update categorical priors in LLMs for structured data generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests in-context learning as a way to make LLMs generate structured data like high-cardinality tables drawn from a new distribution. It finds that extra examples steadily improve how well the model matches numerical values, yet categorical features reach a hard limit where rare classes never appear. The authors name this limit categorical prior lock-in and trace it to the token probabilities the model acquired during pre-training. The finding matters because many practical uses of LLMs treat them as on-the-fly generators that should adapt without retraining. The work also shows that low-rank adaptation can break the lock-in but brings its own costs in memorization and output stability.

Core claim

Across two 7B open-weight models, in-context learning improves numerical fidelity with additional examples yet exhibits a sharp ceiling on categorical distributions, failing to reproduce rare classes entirely; the authors identify this behavior as categorical prior lock-in, the inability of ICL to update the model's prior over token distributions inherited from pre-training.

What carries the argument

Categorical prior lock-in: the structural inability of in-context learning to revise the model's inherited prior over categorical token distributions.

If this is right

ICL steadily raises numerical accuracy but plateaus on categorical reproduction regardless of example count.
LoRA fine-tuning removes the categorical ceiling yet introduces measurable memorization of training rows.
In some settings LoRA destabilizes the model's ability to produce valid structured output formats.
A fundamental trade-off exists between distribution adaptability and privacy preservation when moving from ICL to fine-tuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same prior-lock mechanism may limit ICL on other discrete structured outputs such as graphs or code with constrained vocabularies.
Prompt-only techniques are unlikely to overcome the lock-in because they leave the underlying token prior untouched.
The ceiling observed on 7B models may shift or disappear at substantially larger scales or with different pre-training mixtures.
Low-cardinality categorical features might still be adaptable under ICL while high-cardinality ones remain locked.

Load-bearing premise

The inability to update categorical priors is a structural property of in-context learning rather than an artifact of prompt format, model scale, or the specific high-cardinality tabular test case.

What would settle it

An experiment in which increasing the number of in-context examples allows the model to match the full empirical categorical distribution, including every rare class, on held-out high-cardinality tabular data would falsify the lock-in claim.

Figures

Figures reproduced from arXiv: 2606.11961 by Alberto Trombetta, Antonio Pelusi, Stefano Braghin.

**Figure 1.** Figure 1: Generation pipeline. Models. Qwen2.5-7B-Instruct [7] and Mistral-7B-Instruct-v0.3 [5] are open-weight, instruction-tuned decoder-only transformers selected for local deployment under data residency constraints. Their differences in training data and methodology allow us to assess whether observed behaviors are model-specific or general to this scale. Generation Strategies. All configurations use a schema-e… view at source ↗

**Figure 2.** Figure 2: TVD vs. cardinality. Zipf 𝛼 = 0.5 frequency structure that cannot be approximated from ten or fewer examples. Without weight updates [2], ICL can bias generation toward in-context values but cannot reshape the prior over the full label vocabulary. LoRA fine-tuning on Qwen2.5-7B reduces job TVD to 0.1551 at 10% exposure and 0.1430 at 50%, still above the 0.10 threshold, but a substantial improvement over an… view at source ↗

**Figure 3.** Figure 3: Geographic distribution across generation strategies (Qwen2.5-7B) [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Geographic distance across generation strategies (Qwen2.5-7B) [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Correlation heatmap across generation strategies (Qwen2.5-7B) [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

read the original abstract

Large language models (LLMs) are increasingly used as conditional generators for structured data, relying on in-context learning (ICL) to adapt to new distributions without parameter updates. We investigate the limits of ICL for structured generation under distribution mismatch, using high-cardinality tabular data as a controlled test case, and identify a structural failure mode we term \textit{categorical prior lock-in}: the inability of ICL to update the model's prior over token distributions inherited from pre-training. Across two 7B-parameter open-weight models, ICL improves numerical fidelity with additional examples but exhibits a sharp ceiling on categorical distributions, failing to reproduce rare classes entirely. Parameter-efficient fine-tuning (LoRA) overcomes these limitations but introduces measurable memorization risk and, in some cases, destabilizes structured output generation, highlighting a fundamental trade-off between adaptability and privacy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ICL on 7B models matches numbers but locks out rare categories in high-cardinality tabular data, with the structural claim still needing controls on scale and format.

read the letter

The paper's main finding is that in-context examples improve numerical fidelity on high-cardinality tabular data but leave categorical distributions stuck at the pre-training prior, so rare classes never appear. They document this on two 7B models and show LoRA can shift the categories, though it adds memorization risk and can destabilize the output format.

What stands out is the isolation of this categorical ceiling as a distinct limit for structured generation tasks. The distribution-mismatch test case gives a clear empirical contrast that prior ICL work on text or simple classification did not highlight.

The soft spot is exactly the one in the stress-test note: the runs stay at 7B scale on tabular prompts with no checks on larger models, different serializations, or lower-cardinality data. Without those, the "structural" label for ICL itself rests on untested choices. The abstract also gives no numbers or stats, so the full paper needs to show effect sizes and variance before the claim lands solidly.

This is for people working on LLM-based synthetic structured data or ICL adaptation. A reader in that corner would get a practical flag worth testing. It deserves peer review because the observation is concrete enough to check and refine, even if the interpretation needs tightening.

Referee Report

2 major / 1 minor

Summary. The paper claims that in-context learning (ICL) in LLMs exhibits a structural failure mode termed 'categorical prior lock-in' when used for conditional generation of structured data under distribution mismatch. Using high-cardinality tabular data as a test case with two 7B open-weight models, it reports that ICL improves numerical fidelity with more examples but shows a sharp ceiling on categorical distributions, failing to reproduce rare classes entirely. LoRA fine-tuning overcomes the limitation but introduces memorization risk and can destabilize structured outputs, highlighting a trade-off between adaptability and privacy.

Significance. If the empirical observation holds after appropriate controls, the result would be significant for applications of LLMs as conditional generators for structured data, as it identifies a concrete limitation of ICL on categorical priors distinct from numerical adaptation and quantifies a practical trade-off with parameter-efficient fine-tuning. The use of open-weight models and focus on high-cardinality tabular data provides a reproducible starting point for studying ICL boundaries in structured generation tasks.

major comments (2)

[Abstract / Experiments] Abstract and experimental setup: The central claim that categorical prior lock-in is a 'structural' property of ICL (distinct from numerical fidelity gains) is load-bearing on the assumption that the observed ceiling on rare classes generalizes beyond the tested conditions. However, the described experiments are restricted to two 7B-parameter models on high-cardinality tabular data with no ablations on prompt serialization formats, model scale, or lower-cardinality/non-tabular structured data; this leaves the structural interpretation dependent on those unvaried choices and does not rule out artifacts of scale, format, or data cardinality.
[Abstract] Abstract: The claim of a 'sharp ceiling' and complete failure to reproduce rare classes is presented without any quantitative results, dataset cardinalities, prompt templates, number of shots, or statistical tests. This absence prevents evaluation of whether the data support the distinction between numerical improvement and categorical lock-in, making the empirical observation unevaluable from the provided summary.

minor comments (1)

[Abstract] The abstract states the LoRA comparison but supplies no details on the LoRA rank, target modules, or memorization metrics used; these should be added for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify the scope of our claims. We respond to each major comment below and indicate revisions to the manuscript.

read point-by-point responses

Referee: [Abstract / Experiments] Abstract and experimental setup: The central claim that categorical prior lock-in is a 'structural' property of ICL (distinct from numerical fidelity gains) is load-bearing on the assumption that the observed ceiling on rare classes generalizes beyond the tested conditions. However, the described experiments are restricted to two 7B-parameter models on high-cardinality tabular data with no ablations on prompt serialization formats, model scale, or lower-cardinality/non-tabular structured data; this leaves the structural interpretation dependent on those unvaried choices and does not rule out artifacts of scale, format, or data cardinality.

Authors: We agree that the experiments are scoped to two 7B models and high-cardinality tabular data, and that this limits strong claims of universality. The term 'structural' in the manuscript is intended to highlight the consistent distinction between ICL's numerical adaptation and its failure on categorical priors, in contrast to LoRA's behavior within the same experimental setup, rather than to assert invariance across all scales or data types. We will revise the abstract and discussion to explicitly qualify the scope, add a limitations paragraph noting the absence of ablations on serialization formats, model scale, and non-tabular data, and avoid language implying broader generalization without further evidence. revision: partial
Referee: [Abstract] Abstract: The claim of a 'sharp ceiling' and complete failure to reproduce rare classes is presented without any quantitative results, dataset cardinalities, prompt templates, number of shots, or statistical tests. This absence prevents evaluation of whether the data support the distinction between numerical improvement and categorical lock-in, making the empirical observation unevaluable from the provided summary.

Authors: We accept this criticism. The abstract as written does not include the requested quantitative anchors. We will revise the abstract to report key details including the dataset cardinalities (e.g., number of categories per column), number of shots tested, the observed reproduction rates for rare classes under ICL (including the reported ceiling), and any statistical measures used to quantify the numerical vs. categorical distinction. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical observation with no derivation or load-bearing self-citation.

full rationale

The paper reports experimental results on ICL behavior with tabular data across two 7B models. The central claim (sharp ceiling on categorical distributions despite numerical gains) is presented as a direct observation from those runs, with no equations, fitted parameters renamed as predictions, or self-citation chains invoked to justify a structural property. No load-bearing step reduces to its own inputs by construction. This matches the reader's 0.0 assessment.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based solely on abstract; no free parameters, invented entities, or detailed axioms are stated.

axioms (1)

domain assumption High-cardinality tabular data constitutes a controlled test case that reveals a general structural failure of ICL for structured generation under distribution mismatch.
Invoked to justify the experimental setup and generalization of the lock-in phenomenon.

pith-pipeline@v0.9.1-grok · 5675 in / 1080 out tokens · 29018 ms · 2026-06-27T10:12:30.814692+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages · 5 internal anchors

[1]

Vadim Borisov, Kathrin Seßler, Tobias Leemann, Martin Pawelczyk, and Gjergji Kasneci. 2023. Language Models are Realistic Tabular Data Generators. arXiv:2210.06280 [cs.LG] https://arxiv.org/abs/2210.06280

work page arXiv 2023
[2]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwi...

work page internal anchor Pith review Pith/arXiv arXiv 2020
[3]

Sorouralsadat Fatemi, Yuheng Hu, and Maryam Mousavi. 2024. A Comparative Analysis of Instruction Fine-Tuning LLMs for Financial Text Classification. arXiv:2411.02476 [cs.CL] https://arxiv.org/abs/2411.02476

work page arXiv 2024
[4]

LoRA: Low-Rank Adaptation of Large Language Models

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. LoRA: Low-Rank Adaptation of Large Language Models. arXiv:2106.09685 [cs.CL] https://arxiv.org/abs/2106.09685

work page internal anchor Pith review Pith/arXiv arXiv 2021
[5]

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. 2023. Mistral 7B. arXiv:2310.06825...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

Jinhee Kim, Taesung Kim, and Jaegul Choo. 2025. EPIC: Effective Prompting for Imbalanced-Class Data Synthesis in Tabular Data Classification via Large Language Models. arXiv:2404.12404 [cs.LG] https://arxiv.org/abs/2404.12404

work page arXiv 2025
[7]

Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li,...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

Nabeel Seedat, Nicolas Huynh, Boris van Breugel, and Mihaela van der Schaar. 2024. Curated LLM: Synergy of LLMs and Data Curation for tabular augmentation in low-data regimes. arXiv:2312.12112 [cs.LG] https://arxiv.org/abs/2312.12112

work page arXiv 2024
[9]

Solatorio and Olivier Dupriez

Aivin V . Solatorio and Olivier Dupriez. 2023. REaLTabFormer: Generating Realistic Relational and Tabular Data using Transformers. arXiv:2302.02041 [cs.LG] https://arxiv.org/abs/2302.02041

work page arXiv 2023
[10]

Finetuned Language Models Are Zero-Shot Learners

Jason Wei, Maarten Bosma, Vincent Y . Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V . Le. 2022. Finetuned Language Models Are Zero-Shot Learners. arXiv:2109.01652 [cs.CL] https://arxiv.org/abs/2109.01652

work page internal anchor Pith review Pith/arXiv arXiv 2022
[11]

Lei Xu, Maria Skoularidou, Alfredo Cuesta-Infante, and Kalyan Veeramachaneni. 2019. Modeling Tabular data using Conditional GAN. arXiv:1907.00503 [cs.LG] https://arxiv.org/abs/1907.00503

work page arXiv 2019
[12]

Zilong Zhao, Robert Birke, and Lydia Chen. 2025. TabuLa: Harnessing Language Models for Tabular Data Synthesis. arXiv:2310.12746 [cs.LG] https://arxiv.org/abs/2310.12746 Manuscript submitted to ACM

work page arXiv 2025

[1] [1]

Vadim Borisov, Kathrin Seßler, Tobias Leemann, Martin Pawelczyk, and Gjergji Kasneci. 2023. Language Models are Realistic Tabular Data Generators. arXiv:2210.06280 [cs.LG] https://arxiv.org/abs/2210.06280

work page arXiv 2023

[2] [2]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwi...

work page internal anchor Pith review Pith/arXiv arXiv 2020

[3] [3]

Sorouralsadat Fatemi, Yuheng Hu, and Maryam Mousavi. 2024. A Comparative Analysis of Instruction Fine-Tuning LLMs for Financial Text Classification. arXiv:2411.02476 [cs.CL] https://arxiv.org/abs/2411.02476

work page arXiv 2024

[4] [4]

LoRA: Low-Rank Adaptation of Large Language Models

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. LoRA: Low-Rank Adaptation of Large Language Models. arXiv:2106.09685 [cs.CL] https://arxiv.org/abs/2106.09685

work page internal anchor Pith review Pith/arXiv arXiv 2021

[5] [5]

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. 2023. Mistral 7B. arXiv:2310.06825...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[6] [6]

Jinhee Kim, Taesung Kim, and Jaegul Choo. 2025. EPIC: Effective Prompting for Imbalanced-Class Data Synthesis in Tabular Data Classification via Large Language Models. arXiv:2404.12404 [cs.LG] https://arxiv.org/abs/2404.12404

work page arXiv 2025

[7] [7]

Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li,...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[8] [8]

Nabeel Seedat, Nicolas Huynh, Boris van Breugel, and Mihaela van der Schaar. 2024. Curated LLM: Synergy of LLMs and Data Curation for tabular augmentation in low-data regimes. arXiv:2312.12112 [cs.LG] https://arxiv.org/abs/2312.12112

work page arXiv 2024

[9] [9]

Solatorio and Olivier Dupriez

Aivin V . Solatorio and Olivier Dupriez. 2023. REaLTabFormer: Generating Realistic Relational and Tabular Data using Transformers. arXiv:2302.02041 [cs.LG] https://arxiv.org/abs/2302.02041

work page arXiv 2023

[10] [10]

Finetuned Language Models Are Zero-Shot Learners

Jason Wei, Maarten Bosma, Vincent Y . Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V . Le. 2022. Finetuned Language Models Are Zero-Shot Learners. arXiv:2109.01652 [cs.CL] https://arxiv.org/abs/2109.01652

work page internal anchor Pith review Pith/arXiv arXiv 2022

[11] [11]

Lei Xu, Maria Skoularidou, Alfredo Cuesta-Infante, and Kalyan Veeramachaneni. 2019. Modeling Tabular data using Conditional GAN. arXiv:1907.00503 [cs.LG] https://arxiv.org/abs/1907.00503

work page arXiv 2019

[12] [12]

Zilong Zhao, Robert Birke, and Lydia Chen. 2025. TabuLa: Harnessing Language Models for Tabular Data Synthesis. arXiv:2310.12746 [cs.LG] https://arxiv.org/abs/2310.12746 Manuscript submitted to ACM

work page arXiv 2025