Internal Knowledge Without External Expression: Probing the Generalization Boundary of a Classical Chinese Language Model
Pith reviewed 2026-05-14 00:14 UTC · model grok-4.3
The pith
A Classical Chinese language model encodes factual distinctions internally but fails to express uncertainty in its generated text.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
A 318M-parameter Transformer language model trained on 1.56 billion tokens of pure Classical Chinese exhibits a 2.39x perplexity increase on fabricated historical events compared to real ones, demonstrating internal factual encoding, yet applies epistemic markers at lower rates to out-of-distribution questions, showing that uncertainty expression is controlled by training data conventions rather than internal knowledge.
What carries the argument
The dissociation between internal perplexity-based detection of factual uncertainty and the lack of corresponding external expression through epistemic markers.
If this is right
- Metacognitive expression requires explicit training signals such as RLHF.
- Uncertainty expression is determined entirely by training data conventions.
- The dissociation between internal and external uncertainty holds across languages, writing systems, and model sizes from 110M to 1.56B parameters.
- Semi-fabricated events combining real figures with fictional events produce the highest perplexity.
Where Pith is reading between the lines
- AI models that express uncertainty appropriately have likely acquired this behavior through post-training alignment rather than base language modeling.
- Testing factual encoding with historical facts offers a method applicable to other domains with clear truth values.
- Cultural patterns in training data can produce counterintuitive behaviors such as hedging more on known topics.
- Without targeted training for metacognition, models may generate confident but incorrect responses on unknown inputs.
Load-bearing premise
Differences in perplexity between real and fabricated events stem from the model's encoding of historical facts rather than from variations in writing style or syntax in the test items.
What would settle it
If the perplexity gap between real and fabricated historical events vanishes after those events are rewritten to share identical syntactic and stylistic features.
Figures
read the original abstract
We train a 318M-parameter Transformer language model from scratch on a curated corpus of 1.56 billion tokens of pure Classical Chinese, with zero English characters or Arabic numerals. Through systematic out-of-distribution (OOD) testing, we investigate whether the model can distinguish known from unknown inputs, and crucially, whether it can express this distinction in its generated text. We find a clear dissociation between internal and external uncertainty. Internally, the model exhibits a perplexity jump ratio of 2.39x between real and fabricated historical events (p = 8.9e-11, n = 92 per group), with semi-fabricated events (real figures + fictional events) showing the highest perplexity (4.24x, p = 1.1e-16), demonstrating genuine factual encoding beyond syntactic pattern matching. Externally, however, the model never learns to express uncertainty: classical Chinese epistemic markers appear at lower rates for OOD questions (3.5%) than for in-distribution questions (8.3%, p = 0.023), reflecting rhetorical conventions rather than genuine metacognition. We replicate both findings across three languages (Classical Chinese, English, Japanese), three writing systems, and eight models from 110M to 1.56B parameters. We further show that uncertainty expression frequency is determined entirely by training data conventions, with Classical Chinese models showing a "humility paradox" (more hedging for known topics), while Japanese models almost never hedge. We argue that metacognitive expression -- the ability to say "I don't know" -- does not emerge from language modeling alone and requires explicit training signals such as RLHF.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper trains a 318M-parameter Transformer language model from scratch on 1.56 billion tokens of pure Classical Chinese and conducts OOD testing to probe internal vs. external uncertainty. It reports a 2.39x perplexity increase on fabricated historical events (p=8.9e-11) and 4.24x on semi-fabricated events, interpreted as evidence of genuine factual encoding, while epistemic marker usage is lower for OOD inputs (3.5% vs 8.3%), indicating that metacognitive expression does not emerge from language modeling alone and requires explicit signals such as RLHF. Findings are replicated across three languages, three writing systems, and eight model sizes.
Significance. If the central dissociation survives controls for surface confounds, the result would strengthen the claim that uncertainty expression is not an emergent property of next-token prediction and instead depends on post-training signals. The multi-language and multi-scale replication is a strength, as is the focus on a non-English, non-Latin corpus that reduces English-centric artifacts.
major comments (2)
- [Methods] The construction of the fabricated and semi-fabricated event sets is not described in sufficient detail to rule out surface confounds. Altering real historical descriptions can systematically change n-gram frequencies, rhetorical style, or syntactic complexity in ways that are absent from the 1.56B-token training corpus; without explicit matching on lexical frequency, embedding distance, or length, the reported 2.39x and 4.24x perplexity ratios cannot be unambiguously attributed to factual encoding rather than distributional mismatch.
- [Results] No ablation or control experiments are reported that isolate factual content from stylistic or n-gram effects. For example, the paper does not compare perplexity on real events rewritten in the same style as the fabricated items, nor does it provide baseline perplexity on length- and complexity-matched nonsense strings; such controls are load-bearing for the claim that the perplexity jump demonstrates internal knowledge distinct from pattern matching.
minor comments (3)
- [Abstract] The exact statistical test underlying the reported p-values (8.9e-11, 1.1e-16, 0.023) is not stated; specify whether a t-test, Wilcoxon test, or permutation test was used and whether multiple-comparison correction was applied.
- [Methods] The sample size n=92 per group and the selection criteria for the real, fabricated, and semi-fabricated items should be detailed, including how events were balanced for topic and historical period.
- [Figures] Figure captions and axis labels should explicitly state the units and normalization used for the perplexity ratios so that readers can directly compare the 2.39x and 4.24x values across model scales.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to incorporate additional methodological details and control experiments as requested.
read point-by-point responses
-
Referee: [Methods] The construction of the fabricated and semi-fabricated event sets is not described in sufficient detail to rule out surface confounds. Altering real historical descriptions can systematically change n-gram frequencies, rhetorical style, or syntactic complexity in ways that are absent from the 1.56B-token training corpus; without explicit matching on lexical frequency, embedding distance, or length, the reported 2.39x and 4.24x perplexity ratios cannot be unambiguously attributed to factual encoding rather than distributional mismatch.
Authors: We agree that the Methods section requires expansion to fully document the event set construction. In the revision we will add a subsection detailing the generation protocol for fabricated events (systematic replacement of verifiable historical facts while preserving sentence length and syntactic templates) and semi-fabricated events (retention of attested historical agents paired with invented predicates). We will also report quantitative matching statistics: mean sentence length, type-token ratio, average n-gram overlap with the training corpus, and cosine distance of sentence embeddings between real, fabricated, and semi-fabricated sets. These additions will allow readers to assess residual distributional differences. revision: yes
-
Referee: [Results] No ablation or control experiments are reported that isolate factual content from stylistic or n-gram effects. For example, the paper does not compare perplexity on real events rewritten in the same style as the fabricated items, nor does it provide baseline perplexity on length- and complexity-matched nonsense strings; such controls are load-bearing for the claim that the perplexity jump demonstrates internal knowledge distinct from pattern matching.
Authors: We accept that explicit ablations are necessary to strengthen the attribution to factual encoding. The revised manuscript will include two new control conditions: (1) perplexity measured on real historical events that have been manually paraphrased to match the lexical and rhetorical profile of the fabricated set, and (2) perplexity on length- and complexity-matched nonsense strings generated by shuffling content words while preserving function-word scaffolding. These results will be reported in a new subsection of Results and will be used to quantify the residual contribution of surface features. We expect the factual-content effect to remain after these controls, but we will present the data transparently. revision: yes
Circularity Check
No circularity: empirical measurements of perplexity and marker frequency are independent of any fitted inputs or self-citations
full rationale
The paper presents purely empirical results: a 318M-parameter model is trained on a fixed 1.56B-token corpus, then evaluated via perplexity ratios (2.39x real vs fabricated, 4.24x semi-fabricated) and epistemic marker frequencies (3.5% OOD vs 8.3% in-distribution) with reported p-values and replications across languages/models. No equations, derivations, or parameter-fitting steps are described that would reduce any claimed prediction back to the same data by construction. The central dissociation between internal uncertainty (perplexity) and external expression (markers) rests on direct statistical contrasts rather than any self-definitional loop or load-bearing self-citation.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Perplexity differences between real and fabricated historical events reflect factual encoding rather than surface-level statistical patterns.
- domain assumption Frequency of classical Chinese epistemic markers in generated text is a valid proxy for the model's ability to express uncertainty.
Reference graph
Works this paper leans on
-
[1]
Introduction Large language models (LLMs) have demonstrated remarkable abilities across diverse tasks, yet their tendency to generate plausible-sounding but factually incorrect text—commonly termed "hallucination"—remains a fundamental challenge. Modern LLMs such as GPT-4 and Qwen can express uncertainty with phrases like "I'm not sure about this," but th...
-
[2]
Related Work Hallucination in LLMs Hallucination in language models has been extensively studied (Ji et al., 2023; Huang et al., 2023). Prior work distinguishes between "faithfulness hallucinations" (contradicting source material) and "factuality hallucinations" (contradicting world knowledge). Our work focuses on the latter, specifically testing whether ...
work page 2023
-
[3]
Experimental Setup 3.1 Training Data We compile a corpus of 1.56 billion tokens from publicly available Classical Chinese texts: • 殆知阁古代文献 (Daizhige): 15,687 texts across 10 categories including histories (史藏, 1,376 MB), Confucian classics (儒藏, 394 MB), Buddhist sutras (佛藏, 618 MB), Daoist texts (道藏, 128 MB), medical texts (医藏, 315 MB), and others. Total:...
work page 2000
-
[4]
Test Design We design six test categories spanning a spectrum from fully in-distribution to fully out-of-distribution: Test Category OOD Type Test 1 Classical Chinese prompts In-distribution (baseline) Test 2 English text Token-level OOD Test 3 Mathematical symbols Token-level OOD Test 4 Modern concepts in classical style Semantic OOD Test 5 Fabricated hi...
-
[5]
Real events: Verifiable historical events, e.g., "汉武帝元狩二年,霍去病出陇西" (Emperor Wu of Han, 2nd year of Yuanshou, Huo Qubing marched from Longxi) 2. Fabricated events: Plausible but fictional events using real or fictional dates, e.g., "太宗贞观二十年,命李靖征伐大食国" (Emperor Taizong, 20th year of Zhenguan, ordered Li Jing to conquer the Arab Empire) 3. Semi-fabricated even...
-
[6]
臣愚不知" ("This foolish minister does not know
Results 5.1 Perplexity Gradient Across OOD Categories The model exhibits a clear four-level perplexity hierarchy: Category Mean PPL Relative to In-dist In-distribution 229 1.0× OOD Knowledge (modern concepts) 570 2.5× Fabricated History 29 0.13× English 28,129 123× Mathematics 45,377 198× Mixed (classical + English) 41,313 180× Note that fabricated histor...
-
[7]
Discussion 6.1 Creativity and Hallucination as Indistinguishable Processes Our results show that the model generates fabricated historical narratives with the same fluency, entropy, and confidence as real historical content. When prompted with "汉武帝元狩六年,张骞自天竺归,献飞行之术" (Zhang Qian returned from India and presented the art of flight), the model produces: "武帝大...
-
[8]
Limitations • Model scale range: While GPT-2 experiments cover 124M to 1.56B parameters, larger models (7B+) might exhibit qualitatively different behavior. However, the consistent trend across our tested range (no emergence of metacognitive expression at any scale) suggests this is unlikely. • Keyword-based uncertainty detection: Our epistemic marker cou...
-
[9]
Conclusion We present a controlled experiment demonstrating that autoregressive language modeling produces internal knowledge without external expression. A 318M-parameter model trained on 1.56 billion tokens of Classical Chinese: 1. Internally distinguishes real from fabricated history (PPL ratio 2.39×, p = 8.9×10⁻¹¹, n = 92), with semi-fabricated events...
-
[10]
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
The frequency and direction of uncertainty expressions are entirely determined by training data conventions: Classical Chinese models show a "humility paradox" (more hedging for known topics), English models show no difference, and Japanese models almost never hedge. None reflect actual epistemic states. These findings support the view that metacognitive ...
work page internal anchor Pith review Pith/arXiv arXiv 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.