Internal Knowledge Without External Expression: Probing the Generalization Boundary of a Classical Chinese Language Model

Hao Wu; Hiroshi Sasaki; Jiuting Chen; Jongil Choi; Makoto Kouno; Tianqi Huang; Yuan Lian

arxiv: 2604.14180 · v1 · submitted 2026-03-31 · 💻 cs.CL · cs.AI

Internal Knowledge Without External Expression: Probing the Generalization Boundary of a Classical Chinese Language Model

Jiuting Chen , Yuan Lian , Hao Wu , Tianqi Huang , Hiroshi Sasaki , Makoto Kouno , Jongil Choi This is my paper

Pith reviewed 2026-05-14 00:14 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords classical chineselanguage modeluncertainty expressionmetacognitionperplexityout-of-distributionepistemic markersfactual encoding

0 comments

The pith

A Classical Chinese language model encodes factual distinctions internally but fails to express uncertainty in its generated text.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper trains a language model from scratch on a large corpus of Classical Chinese and tests its ability to distinguish known from unknown information both internally and in its outputs. It finds that the model shows clear internal signals of uncertainty through perplexity differences on real versus fabricated historical events. Externally, however, the model does not use linguistic markers to indicate uncertainty, instead following the conventions of its training data. This suggests that the capacity to express what one knows or does not know does not arise automatically from language modeling and may require additional training signals.

Core claim

A 318M-parameter Transformer language model trained on 1.56 billion tokens of pure Classical Chinese exhibits a 2.39x perplexity increase on fabricated historical events compared to real ones, demonstrating internal factual encoding, yet applies epistemic markers at lower rates to out-of-distribution questions, showing that uncertainty expression is controlled by training data conventions rather than internal knowledge.

What carries the argument

The dissociation between internal perplexity-based detection of factual uncertainty and the lack of corresponding external expression through epistemic markers.

If this is right

Metacognitive expression requires explicit training signals such as RLHF.
Uncertainty expression is determined entirely by training data conventions.
The dissociation between internal and external uncertainty holds across languages, writing systems, and model sizes from 110M to 1.56B parameters.
Semi-fabricated events combining real figures with fictional events produce the highest perplexity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

AI models that express uncertainty appropriately have likely acquired this behavior through post-training alignment rather than base language modeling.
Testing factual encoding with historical facts offers a method applicable to other domains with clear truth values.
Cultural patterns in training data can produce counterintuitive behaviors such as hedging more on known topics.
Without targeted training for metacognition, models may generate confident but incorrect responses on unknown inputs.

Load-bearing premise

Differences in perplexity between real and fabricated events stem from the model's encoding of historical facts rather than from variations in writing style or syntax in the test items.

What would settle it

If the perplexity gap between real and fabricated historical events vanishes after those events are rewritten to share identical syntactic and stylistic features.

Figures

Figures reproduced from arXiv: 2604.14180 by Hao Wu, Hiroshi Sasaki, Jiuting Chen, Jongil Choi, Makoto Kouno, Tianqi Huang, Yuan Lian.

**Figure 2.** Figure 2: Perplexity comparison for real, fabricated, and semi [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: The central finding—internal knowledge (PPL jump ratio, red) grows steadily across training, while external uncertainty expression (blue) remains flat [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Perplexity trajectories for real, fabricated, and semi [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

read the original abstract

We train a 318M-parameter Transformer language model from scratch on a curated corpus of 1.56 billion tokens of pure Classical Chinese, with zero English characters or Arabic numerals. Through systematic out-of-distribution (OOD) testing, we investigate whether the model can distinguish known from unknown inputs, and crucially, whether it can express this distinction in its generated text. We find a clear dissociation between internal and external uncertainty. Internally, the model exhibits a perplexity jump ratio of 2.39x between real and fabricated historical events (p = 8.9e-11, n = 92 per group), with semi-fabricated events (real figures + fictional events) showing the highest perplexity (4.24x, p = 1.1e-16), demonstrating genuine factual encoding beyond syntactic pattern matching. Externally, however, the model never learns to express uncertainty: classical Chinese epistemic markers appear at lower rates for OOD questions (3.5%) than for in-distribution questions (8.3%, p = 0.023), reflecting rhetorical conventions rather than genuine metacognition. We replicate both findings across three languages (Classical Chinese, English, Japanese), three writing systems, and eight models from 110M to 1.56B parameters. We further show that uncertainty expression frequency is determined entirely by training data conventions, with Classical Chinese models showing a "humility paradox" (more hedging for known topics), while Japanese models almost never hedge. We argue that metacognitive expression -- the ability to say "I don't know" -- does not emerge from language modeling alone and requires explicit training signals such as RLHF.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper finds an internal-external dissociation in a Classical Chinese LM but the perplexity signal on fabricated events likely reflects surface patterns rather than factual knowledge.

read the letter

The main thing to know is that this work claims a clear split in a Classical Chinese LM: it registers higher perplexity on made-up history internally but doesn't use epistemic markers to signal uncertainty externally, and they argue this means metacognition needs explicit training like RLHF. They back it with replications in English and Japanese too. What stands out as new is the focus on a low-resource historical language with a from-scratch model on 1.56B tokens of pure Classical Chinese. The cross-lingual replication and the humility paradox where hedging is higher for known topics in Chinese models are solid additions to the uncertainty literature. The paper does well in setting up systematic OOD tests with real, fabricated, and semi-fabricated events, and reporting specific ratios like 2.39x perplexity jump with p-values. The replication across model sizes from 110M to 1.56B adds some robustness. The soft spots are in the methods for creating the test items. There's no description of how they ensured the fabricated events match the real ones in length, syntactic complexity, or lexical frequency. If the fakes just have rarer n-grams or different rhetorical styles, the perplexity signal could be surface matching rather than factual knowledge. The abstract mentions the stats but skips those controls, which makes the central claim harder to trust without the full paper details. Overall, this is for people studying how LLMs handle uncertainty in specialized domains or languages. It raises a good question about what pretraining alone can achieve, but the evidence is preliminary. I'd send it to peer review to get the methods scrutinized and see if better controls can be added. The idea has potential if the internal part holds up.

Referee Report

2 major / 3 minor

Summary. The paper trains a 318M-parameter Transformer language model from scratch on 1.56 billion tokens of pure Classical Chinese and conducts OOD testing to probe internal vs. external uncertainty. It reports a 2.39x perplexity increase on fabricated historical events (p=8.9e-11) and 4.24x on semi-fabricated events, interpreted as evidence of genuine factual encoding, while epistemic marker usage is lower for OOD inputs (3.5% vs 8.3%), indicating that metacognitive expression does not emerge from language modeling alone and requires explicit signals such as RLHF. Findings are replicated across three languages, three writing systems, and eight model sizes.

Significance. If the central dissociation survives controls for surface confounds, the result would strengthen the claim that uncertainty expression is not an emergent property of next-token prediction and instead depends on post-training signals. The multi-language and multi-scale replication is a strength, as is the focus on a non-English, non-Latin corpus that reduces English-centric artifacts.

major comments (2)

[Methods] The construction of the fabricated and semi-fabricated event sets is not described in sufficient detail to rule out surface confounds. Altering real historical descriptions can systematically change n-gram frequencies, rhetorical style, or syntactic complexity in ways that are absent from the 1.56B-token training corpus; without explicit matching on lexical frequency, embedding distance, or length, the reported 2.39x and 4.24x perplexity ratios cannot be unambiguously attributed to factual encoding rather than distributional mismatch.
[Results] No ablation or control experiments are reported that isolate factual content from stylistic or n-gram effects. For example, the paper does not compare perplexity on real events rewritten in the same style as the fabricated items, nor does it provide baseline perplexity on length- and complexity-matched nonsense strings; such controls are load-bearing for the claim that the perplexity jump demonstrates internal knowledge distinct from pattern matching.

minor comments (3)

[Abstract] The exact statistical test underlying the reported p-values (8.9e-11, 1.1e-16, 0.023) is not stated; specify whether a t-test, Wilcoxon test, or permutation test was used and whether multiple-comparison correction was applied.
[Methods] The sample size n=92 per group and the selection criteria for the real, fabricated, and semi-fabricated items should be detailed, including how events were balanced for topic and historical period.
[Figures] Figure captions and axis labels should explicitly state the units and normalization used for the perplexity ratios so that readers can directly compare the 2.39x and 4.24x values across model scales.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to incorporate additional methodological details and control experiments as requested.

read point-by-point responses

Referee: [Methods] The construction of the fabricated and semi-fabricated event sets is not described in sufficient detail to rule out surface confounds. Altering real historical descriptions can systematically change n-gram frequencies, rhetorical style, or syntactic complexity in ways that are absent from the 1.56B-token training corpus; without explicit matching on lexical frequency, embedding distance, or length, the reported 2.39x and 4.24x perplexity ratios cannot be unambiguously attributed to factual encoding rather than distributional mismatch.

Authors: We agree that the Methods section requires expansion to fully document the event set construction. In the revision we will add a subsection detailing the generation protocol for fabricated events (systematic replacement of verifiable historical facts while preserving sentence length and syntactic templates) and semi-fabricated events (retention of attested historical agents paired with invented predicates). We will also report quantitative matching statistics: mean sentence length, type-token ratio, average n-gram overlap with the training corpus, and cosine distance of sentence embeddings between real, fabricated, and semi-fabricated sets. These additions will allow readers to assess residual distributional differences. revision: yes
Referee: [Results] No ablation or control experiments are reported that isolate factual content from stylistic or n-gram effects. For example, the paper does not compare perplexity on real events rewritten in the same style as the fabricated items, nor does it provide baseline perplexity on length- and complexity-matched nonsense strings; such controls are load-bearing for the claim that the perplexity jump demonstrates internal knowledge distinct from pattern matching.

Authors: We accept that explicit ablations are necessary to strengthen the attribution to factual encoding. The revised manuscript will include two new control conditions: (1) perplexity measured on real historical events that have been manually paraphrased to match the lexical and rhetorical profile of the fabricated set, and (2) perplexity on length- and complexity-matched nonsense strings generated by shuffling content words while preserving function-word scaffolding. These results will be reported in a new subsection of Results and will be used to quantify the residual contribution of surface features. We expect the factual-content effect to remain after these controls, but we will present the data transparently. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical measurements of perplexity and marker frequency are independent of any fitted inputs or self-citations

full rationale

The paper presents purely empirical results: a 318M-parameter model is trained on a fixed 1.56B-token corpus, then evaluated via perplexity ratios (2.39x real vs fabricated, 4.24x semi-fabricated) and epistemic marker frequencies (3.5% OOD vs 8.3% in-distribution) with reported p-values and replications across languages/models. No equations, derivations, or parameter-fitting steps are described that would reduce any claimed prediction back to the same data by construction. The central dissociation between internal uncertainty (perplexity) and external expression (markers) rests on direct statistical contrasts rather than any self-definitional loop or load-bearing self-citation.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The paper relies on standard language modeling assumptions that perplexity reflects knowledge distinctions and that epistemic marker frequency measures uncertainty expression. No free parameters are introduced to fit the reported ratios. No new entities are postulated.

axioms (2)

domain assumption Perplexity differences between real and fabricated historical events reflect factual encoding rather than surface-level statistical patterns.
Invoked when interpreting the 2.39x and 4.24x perplexity jumps as evidence of genuine knowledge.
domain assumption Frequency of classical Chinese epistemic markers in generated text is a valid proxy for the model's ability to express uncertainty.
Used to conclude that the model 'never learns to express uncertainty' from the 3.5% vs 8.3% rates.

pith-pipeline@v0.9.0 · 5624 in / 1468 out tokens · 42105 ms · 2026-05-14T00:14:00.752935+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

10 extracted references · 10 canonical work pages · 1 internal anchor

[1]

hallucination

Introduction Large language models (LLMs) have demonstrated remarkable abilities across diverse tasks, yet their tendency to generate plausible-sounding but factually incorrect text—commonly termed "hallucination"—remains a fundamental challenge. Modern LLMs such as GPT-4 and Qwen can express uncertainty with phrases like "I'm not sure about this," but th...

work page
[2]

faithfulness hallucinations

Related Work Hallucination in LLMs Hallucination in language models has been extensively studied (Ji et al., 2023; Huang et al., 2023). Prior work distinguishes between "faithfulness hallucinations" (contradicting source material) and "factuality hallucinations" (contradicting world knowledge). Our work focuses on the latter, specifically testing whether ...

work page 2023
[3]

Total: 4.9 GB

Experimental Setup 3.1 Training Data We compile a corpus of 1.56 billion tokens from publicly available Classical Chinese texts: • 殆知阁古代文献 (Daizhige): 15,687 texts across 10 categories including histories (史藏, 1,376 MB), Confucian classics (儒藏, 394 MB), Buddhist sutras (佛藏, 618 MB), Daoist texts (道藏, 128 MB), medical texts (医藏, 315 MB), and others. Total:...

work page 2000
[4]

We construct three groups of 92 historical prompts each, spanning from the Zhou Dynasty to the Qing Dynasty:

Test Design We design six test categories spanning a spectrum from fully in-distribution to fully out-of-distribution: Test Category OOD Type Test 1 Classical Chinese prompts In-distribution (baseline) Test 2 English text Token-level OOD Test 3 Mathematical symbols Token-level OOD Test 4 Modern concepts in classical style Semantic OOD Test 5 Fabricated hi...

work page
[5]

汉武帝元狩二年，霍去病出陇西

Real events: Verifiable historical events, e.g., "汉武帝元狩二年，霍去病出陇西" (Emperor Wu of Han, 2nd year of Yuanshou, Huo Qubing marched from Longxi) 2. Fabricated events: Plausible but fictional events using real or fictional dates, e.g., "太宗贞观二十年，命李靖征伐大食国" (Emperor Taizong, 20th year of Zhenguan, ordered Li Jing to conquer the Arab Empire) 3. Semi-fabricated even...

work page
[6]

臣愚不知" ("This foolish minister does not know

Results 5.1 Perplexity Gradient Across OOD Categories The model exhibits a clear four-level perplexity hierarchy: Category Mean PPL Relative to In-dist In-distribution 229 1.0× OOD Knowledge (modern concepts) 570 2.5× Fabricated History 29 0.13× English 28,129 123× Mathematics 45,377 198× Mixed (classical + English) 41,313 180× Note that fabricated histor...

work page
[7]

汉武帝元狩六年，张骞自天竺归，献飞行之术

Discussion 6.1 Creativity and Hallucination as Indistinguishable Processes Our results show that the model generates fabricated historical narratives with the same fluency, entropy, and confidence as real historical content. When prompted with "汉武帝元狩六年，张骞自天竺归，献飞行之术" (Zhang Qian returned from India and presented the art of flight), the model produces: "武帝大...

work page
[8]

However, the consistent trend across our tested range (no emergence of metacognitive expression at any scale) suggests this is unlikely

Limitations • Model scale range: While GPT-2 experiments cover 124M to 1.56B parameters, larger models (7B+) might exhibit qualitatively different behavior. However, the consistent trend across our tested range (no emergence of metacognitive expression at any scale) suggests this is unlikely. • Keyword-based uncertainty detection: Our epistemic marker cou...

work page
[9]

humility paradox

Conclusion We present a controlled experiment demonstrating that autoregressive language modeling produces internal knowledge without external expression. A 318M-parameter model trained on 1.56 billion tokens of Classical Chinese: 1. Internally distinguishes real from fabricated history (PPL ratio 2.39×, p = 8.9×10⁻¹¹, n = 92), with semi-fabricated events...

work page
[10]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

The frequency and direction of uncertainty expressions are entirely determined by training data conventions: Classical Chinese models show a "humility paradox" (more hedging for known topics), English models show no difference, and Japanese models almost never hedge. None reflect actual epistemic states. These findings support the view that metacognitive ...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[1] [1]

hallucination

Introduction Large language models (LLMs) have demonstrated remarkable abilities across diverse tasks, yet their tendency to generate plausible-sounding but factually incorrect text—commonly termed "hallucination"—remains a fundamental challenge. Modern LLMs such as GPT-4 and Qwen can express uncertainty with phrases like "I'm not sure about this," but th...

work page

[2] [2]

faithfulness hallucinations

Related Work Hallucination in LLMs Hallucination in language models has been extensively studied (Ji et al., 2023; Huang et al., 2023). Prior work distinguishes between "faithfulness hallucinations" (contradicting source material) and "factuality hallucinations" (contradicting world knowledge). Our work focuses on the latter, specifically testing whether ...

work page 2023

[3] [3]

Total: 4.9 GB

Experimental Setup 3.1 Training Data We compile a corpus of 1.56 billion tokens from publicly available Classical Chinese texts: • 殆知阁古代文献 (Daizhige): 15,687 texts across 10 categories including histories (史藏, 1,376 MB), Confucian classics (儒藏, 394 MB), Buddhist sutras (佛藏, 618 MB), Daoist texts (道藏, 128 MB), medical texts (医藏, 315 MB), and others. Total:...

work page 2000

[4] [4]

We construct three groups of 92 historical prompts each, spanning from the Zhou Dynasty to the Qing Dynasty:

Test Design We design six test categories spanning a spectrum from fully in-distribution to fully out-of-distribution: Test Category OOD Type Test 1 Classical Chinese prompts In-distribution (baseline) Test 2 English text Token-level OOD Test 3 Mathematical symbols Token-level OOD Test 4 Modern concepts in classical style Semantic OOD Test 5 Fabricated hi...

work page

[5] [5]

汉武帝元狩二年，霍去病出陇西

Real events: Verifiable historical events, e.g., "汉武帝元狩二年，霍去病出陇西" (Emperor Wu of Han, 2nd year of Yuanshou, Huo Qubing marched from Longxi) 2. Fabricated events: Plausible but fictional events using real or fictional dates, e.g., "太宗贞观二十年，命李靖征伐大食国" (Emperor Taizong, 20th year of Zhenguan, ordered Li Jing to conquer the Arab Empire) 3. Semi-fabricated even...

work page

[6] [6]

臣愚不知" ("This foolish minister does not know

Results 5.1 Perplexity Gradient Across OOD Categories The model exhibits a clear four-level perplexity hierarchy: Category Mean PPL Relative to In-dist In-distribution 229 1.0× OOD Knowledge (modern concepts) 570 2.5× Fabricated History 29 0.13× English 28,129 123× Mathematics 45,377 198× Mixed (classical + English) 41,313 180× Note that fabricated histor...

work page

[7] [7]

汉武帝元狩六年，张骞自天竺归，献飞行之术

Discussion 6.1 Creativity and Hallucination as Indistinguishable Processes Our results show that the model generates fabricated historical narratives with the same fluency, entropy, and confidence as real historical content. When prompted with "汉武帝元狩六年，张骞自天竺归，献飞行之术" (Zhang Qian returned from India and presented the art of flight), the model produces: "武帝大...

work page

[8] [8]

However, the consistent trend across our tested range (no emergence of metacognitive expression at any scale) suggests this is unlikely

Limitations • Model scale range: While GPT-2 experiments cover 124M to 1.56B parameters, larger models (7B+) might exhibit qualitatively different behavior. However, the consistent trend across our tested range (no emergence of metacognitive expression at any scale) suggests this is unlikely. • Keyword-based uncertainty detection: Our epistemic marker cou...

work page

[9] [9]

humility paradox

Conclusion We present a controlled experiment demonstrating that autoregressive language modeling produces internal knowledge without external expression. A 318M-parameter model trained on 1.56 billion tokens of Classical Chinese: 1. Internally distinguishes real from fabricated history (PPL ratio 2.39×, p = 8.9×10⁻¹¹, n = 92), with semi-fabricated events...

work page

[10] [10]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

The frequency and direction of uncertainty expressions are entirely determined by training data conventions: Classical Chinese models show a "humility paradox" (more hedging for known topics), English models show no difference, and Japanese models almost never hedge. None reflect actual epistemic states. These findings support the view that metacognitive ...

work page internal anchor Pith review Pith/arXiv arXiv 2022