CAPC-CG: A Large-Scale, Expert-Directed LLM-Annotated Corpus of Adaptive Policy Communication in China

Bolun Sun; Charles Chang; Pingxu Hao; Ruotong Mu; Yuchen Xu; Yuen Yuen Ang; Zhengxin Zhang

arxiv: 2510.08986 · v3 · pith:JO7WHLPLnew · submitted 2025-10-10 · 💻 cs.CL · cs.CE· cs.CY

CAPC-CG: A Large-Scale, Expert-Directed LLM-Annotated Corpus of Adaptive Policy Communication in China

Bolun Sun , Charles Chang , Yuen Yuen Ang , Ruotong Mu , Yuchen Xu , Zhengxin Zhang , Pingxu Hao This is my paper

Pith reviewed 2026-05-21 20:47 UTC · model grok-4.3

classification 💻 cs.CL cs.CEcs.CY

keywords Chinese policy corpusadaptive policy communicationannotated datasetpolicy language analysisinter-annotator agreementLLM baselinesgovernment directivesclear ambiguous language

0 comments

The pith

CAPC-CG offers the first open annotated corpus of Chinese central government policies using a five-color taxonomy for clear and ambiguous language.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces CAPC-CG, a large dataset of Chinese policy directives from 1949 to 2023 annotated by experts according to a theory of adaptive policy communication. The corpus breaks documents into 3.3 million paragraphs and labels them with categories that distinguish clear from ambiguous statements. High agreement among annotators supports its use for training models, and the authors provide baseline results from large language models along with patterns observed in the data.

Core claim

The paper establishes CAPC-CG as a reliable, expert-annotated resource that applies a five-color taxonomy to categorize the language in national laws, administrative regulations, and ministerial rules, enabling quantitative study of how Chinese authorities communicate policy intent across decades.

What carries the argument

The five-color taxonomy of clear and ambiguous language categories, applied through a two-round expert labeling process to paragraph-level segments of policy documents.

If this is right

High inter-annotator agreement of 0.86 kappa allows for effective supervised machine learning on policy language classification.
Researchers can analyze historical shifts in the use of clear versus ambiguous directives from 1949 onward.
Baseline LLM performances provide starting points for improving automated annotation of similar government texts.
The released metadata and codebook support replication and extension to other policy domains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The dataset could help identify whether ambiguous language in policies leads to varied local implementation outcomes.
Comparing annotation patterns before and after key historical events might show changes in central government communication strategies.
Adapting this taxonomy to non-Chinese policy documents could reveal cross-national differences in how governments balance clarity and flexibility.

Load-bearing premise

The five-color taxonomy developed for adaptive policy communication applies reliably to Chinese central government documents without major adaptation or loss of validity.

What would settle it

A new round of annotations by independent experts on a held-out sample of documents yielding substantially lower agreement scores or frequent category mismatches would indicate the taxonomy does not transfer reliably.

Figures

Figures reproduced from arXiv: 2510.08986 by Bolun Sun, Charles Chang, Pingxu Hao, Ruotong Mu, Yuchen Xu, Yuen Yuen Ang, Zhengxin Zhang.

**Figure 1.** Figure 1: Overall Workflow 2 Theoretical Foundation: Adaptive Policy Communication Although several large-scale Chinese datasets have been developed, they are mainly based on Internet and social media content without much quality control (Wang et al., 2025a; Li et al., 2024; Zhang et al., 2023; Yuan et al., 2023, 2021; Xu et al., 2020). In contrast, resources for political and legal documents remain scarce: existin… view at source ↗

**Figure 2.** Figure 2: Workflow for Corpus Creation 3.1 Scope of Policy Directives Our analysis focuses on policy directives, defined as formally issued documents in which a higherlevel authority addresses state or public-sector actors with actionable implications—that is, the bureaucrats are expected to notice, interpret, and respond to the directives. Such directives form the "vertical formal channels" of bureaucratic comm… view at source ↗

**Figure 3.** Figure 3: Inter-Coder Agreement Improvement Once we felt confident about the annotators’ grasp of the codebook, they proceeded to code paragraphs at scale. Each week, we assigned 1,000 randomly sampled paragraphs and repeated this process without sample replacement. By the end of the period, they annotated 6,000 samples, ensuring an even distribution across all labels. Two-Round Labeling Workflow Each paragraph is … view at source ↗

**Figure 4.** Figure 4: Two-Round Labeling Workflow This rigorous, two-stage protocol balances theoretical nuance with operational reliability, enabling 4 [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Temporal Trend of Policies (1978-2023) Metric Score Cohen’s Kappa (4-class) 0.833 Overall Accuracy 87.5% Accuracy per Class: Class ’B’ (Authorizing) 92.0% Class ’C’ (Flexible) 85.0% Class ’G’ (Ambiguous) 84.0% Class ’Y’ (Mandating) 89.0% [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: The distribution of action-oriented topics [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Heatmap of governmental intents across pol [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: illustrates the end-to-end process, from raw text file to final segmented output [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗

read the original abstract

We introduce CAPC-CG, the Chinese Adaptive Policy Communication (Central Government) Corpus, the first open dataset of Chinese policy directives annotated with a five-color taxonomy of clear and ambiguous language categories, building on Ang's theory of adaptive policy communication. Spanning 1949-2023, this corpus includes national laws, administrative regulations, and ministerial rules issued by China's top authorities. Each document is segmented into paragraphs, producing a total of 3.3 million units. Alongside the corpus, we release comprehensive metadata, a two-round labeling framework, and a gold-standard annotation set developed by expert and trained coders. Inter-annotator agreement achieves a Fleiss's kappa of K = 0.86 on directive labels, indicating high reliability for supervised modeling. We provide baseline classification results with several large language models (LLMs), together with our annotation codebook, and describe patterns from the dataset. This release aims to support downstream tasks and multilingual NLP research in policy communication.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This releases a genuinely large new corpus of Chinese central-government policy texts annotated with a five-color clarity taxonomy, but the headline reliability number only covers the expert gold set and not the LLM labels on the 3.3 million paragraphs.

read the letter

The main thing to know is that the authors have put together the first openly available corpus of this scale for Chinese policy directives, covering national laws, regulations, and rules from 1949 to 2023 and segmented into 3.3 million paragraphs. They apply a five-color taxonomy based on Ang's adaptive policy communication framework, release metadata, a codebook, the gold-standard expert annotations, and some LLM baseline results. That package fills a real gap for people working on authoritarian policy language or multilingual NLP in under-resourced domains, and nothing comparable appears in the references they cite. The Fleiss's kappa of 0.86 on the directive labels from the expert and trained coders is a respectable number and shows the taxonomy can be applied consistently on the gold set. They also describe some patterns in the data, which adds immediate usability. The soft spot is exactly the one the stress-test flags. The released corpus carries LLM-generated labels for the bulk of the material, yet no agreement figures are given between those LLM outputs and the expert gold set across time periods, document types, or edge cases. Without that check, or at least sampled validation on the full span, the claim that the whole thing is ready for supervised modeling rests on an untested step. Paragraph segmentation rules and how ambiguous cases were handled also stay a bit thin in the abstract. This is a data-release paper aimed at political scientists studying Chinese policy communication and at NLP groups that need labeled Chinese text in this domain. It gives them concrete material to work with rather than a new theoretical claim. I would send it for peer review. The scale and the expert gold set make it worth referee time, and the main fixes needed are tighter validation numbers and clearer documentation on the LLM step.

Referee Report

1 major / 0 minor

Summary. The paper introduces CAPC-CG, the first open large-scale corpus of Chinese policy directives from 1949-2023, consisting of 3.3 million paragraph units annotated using a five-color taxonomy for clear and ambiguous language based on Ang's adaptive policy communication theory. It includes a gold-standard set with expert annotations achieving Fleiss's kappa of 0.86, releases the LLM-annotated full corpus, metadata, annotation framework, codebook, and baseline LLM classification results.

Significance. This work provides a valuable new resource for research in policy communication, NLP, and Chinese politics by offering an unprecedented scale of annotated data spanning decades. The open release of the corpus, gold-standard annotations, and baselines promotes reproducibility and enables supervised modeling tasks. The high inter-annotator agreement on the gold set is a strength, supporting potential use in downstream applications if the LLM annotations maintain similar quality.

major comments (1)

Abstract: The Fleiss's kappa of 0.86 is reported for the gold-standard annotation set by expert and trained coders. However, the primary released corpus uses LLM annotations on the full 3.3 million paragraphs, and no agreement metric (such as accuracy or kappa) between the LLM labels and the expert gold standard is provided. This extrapolation undermines the claim of high reliability for the released dataset.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive review and recommendation for minor revision. Their comment correctly identifies a point of clarification needed regarding the reliability claims for the LLM-annotated portion of the corpus, which we address directly below.

read point-by-point responses

Referee: Abstract: The Fleiss's kappa of 0.86 is reported for the gold-standard annotation set by expert and trained coders. However, the primary released corpus uses LLM annotations on the full 3.3 million paragraphs, and no agreement metric (such as accuracy or kappa) between the LLM labels and the expert gold standard is provided. This extrapolation undermines the claim of high reliability for the released dataset.

Authors: We appreciate this observation, which highlights a useful distinction. The Fleiss's kappa of 0.86 measures inter-annotator agreement among human experts on the gold-standard set and supports the validity of our five-color taxonomy and codebook. The full 3.3 million paragraphs were annotated via LLMs, with baseline classification results provided to illustrate model performance on held-out data derived from the gold standard. To directly address the concern, the revised manuscript will include explicit agreement metrics (accuracy, F1, and Cohen's kappa) comparing LLM-generated labels to expert annotations on a validation subset. This addition will strengthen transparency without changing the dataset release or core findings. revision: yes

Circularity Check

0 steps flagged

Data release with minor self-citation on taxonomy; no derivation reduces to inputs

full rationale

The paper is a corpus release that segments 3.3M paragraphs from 1949-2023 Chinese policy documents and annotates them with a five-color taxonomy drawn from Ang's prior theory of adaptive policy communication. One co-author (Yuen Yuen Ang) overlaps with that theory, producing a minor self-citation, but the central claim is the existence and release of the annotated dataset itself together with a gold-standard subset whose Fleiss's kappa is reported as 0.86. No equations, fitted parameters, predictions, or uniqueness theorems appear; the work contains no derivation chain that reduces by construction to its own inputs or to an unverified self-citation. The contribution is therefore self-contained as an empirical resource.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central contribution rests on the applicability of Ang's existing theory of adaptive policy communication to Chinese central-government documents and on the assumption that expert annotation can reliably operationalize the five-color taxonomy at paragraph level.

axioms (1)

domain assumption Ang's theory of adaptive policy communication supplies a valid five-color taxonomy for distinguishing clear and ambiguous language in Chinese policy directives.
The taxonomy is adopted directly from prior work by one of the co-authors; the paper does not re-derive or validate the categories from first principles within this corpus.

pith-pipeline@v0.9.0 · 5730 in / 1331 out tokens · 31119 ms · 2026-05-21T20:47:07.585927+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce CAPC-CG, the first open dataset of Chinese policy directives annotated with a five-color taxonomy of clear and ambiguous language categories, building on Ang's theory of adaptive policy communication.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Inter-annotator agreement achieves a Fleiss's kappa of K = 0.86 on directive labels

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages

[1]

29.Carion, N.et al.Sam 3: Segment anything with concepts (2025)

Inter-annotator agreement is not the ceiling of machine learning performance: Evidence from a comprehensive set of simulations. InProceedings of the 21st workshop on biomedical language process- ing, pages 275–284. Marina Sokolova and Guy Lapalme. 2009. A system- atic analysis of performance measures for classifica- tion tasks.Information Processing & Man...

work page arXiv 2009
[2]

Pri- vate laws and regulations directed solely at in- dividuals, enterprises, or social groups without assigning bureaucratic tasks are not considered directives

Recipient principle: All directives should be read from the perspective of state actors. Pri- vate laws and regulations directed solely at in- dividuals, enterprises, or social groups without assigning bureaucratic tasks are not considered directives

work page
[3]

specificity: Code for clarity of in- tent, not specificity of detail

Clarity vs. specificity: Code for clarity of in- tent, not specificity of detail. A directive may express a clear purpose while remaining vague in its implementation, or contain extensive de- tails yet lack a clear underlying intent

work page
[4]

A.4 Step-by-Step Workflow This section provides a practical workflow for ap- plying the two-round annotation process

Avoid keyword-only tagging: Keywords like must,flexible, orforbidcan serve as helpful cues, but they should never be used mechani- cally or treated as the deciding criteria. A.4 Step-by-Step Workflow This section provides a practical workflow for ap- plying the two-round annotation process. Step 1: Assign Level 1 Label (W / R / N)Deter- mine whether the p...

work page 1949
[5]

API Call: Send the entire raw text of docu- ment.txt to the LLM

work page
[6]

LLM Task: The LLM internally processes the text and embeds XML tags (e.g., <L1>...) di- rectly into the content

work page
[7]

text editor

LLM Output: The LLM returns the complete, modified text as its response. Cost analysisAt the project’s pricing of $0.20/1M input and $0.80/1M output tokens, this model is prohibitively expensive. The output token count is nearly identical to the input count, and output tokens are four times more costly. The total cost for a single document would be approx...

work page
[8]

This step happens locally and incurs no API cost

Local Pre-processing: A Python script first adds <line #> markers to the document text. This step happens locally and incurs no API cost

work page
[9]

API Call: Send the numbered text to the GPT- 4.1-mini-2025-04-14 model via the OpenAI Batch API

work page 2025
[10]

LLM Task: The LLM’s only task is to identify structural elements and output a compact JSON file listing labels and their corresponding line numbers

work page
[11]

segments

Local Reconstruction: A local script merges the LLM’s JSON output with the original text file to create the XML-tagged document. Cost analysisThis architecture transforms the cost equation. The output is now a very small JSON object, typically only 5-10% of the input token size. The total cost is: Cost optimized ≈ (Tokens input×$0.20) + (Tokens JSON×$0.80...

work page 2025
[12]

optimal segmentation layer

Metadata Generation: Heuristic Content Analysis Before segmentation, the script first analyzes the tagged document to determine its most sub- stantively important structural layer. This is achieved by scoring the content within each hi- erarchical layer (e.g., all L1 tags, all L2 tags) based on a set of heuristic parameters. The goal of this step is to pr...

work page
[13]

必须", "应",

Final Segmentation: Deterministic Rule Ap- plication 13 Parameter Value Rationale COLOR_DIVERSITY_WEIGHT3.0 Emphasizes the presence of action-oriented keywords. LENGTH_WEIGHT0.5 Favors longer, more substantive text blocks. LAYER_PENALTY_FACTOR1.5 Penalizes deeper layers (L3, L4), creating a bias for higher-level structure. MIN_LENGTH_THRESHOLD15.0 Minimum...

work page

[1] [1]

29.Carion, N.et al.Sam 3: Segment anything with concepts (2025)

Inter-annotator agreement is not the ceiling of machine learning performance: Evidence from a comprehensive set of simulations. InProceedings of the 21st workshop on biomedical language process- ing, pages 275–284. Marina Sokolova and Guy Lapalme. 2009. A system- atic analysis of performance measures for classifica- tion tasks.Information Processing & Man...

work page arXiv 2009

[2] [2]

Pri- vate laws and regulations directed solely at in- dividuals, enterprises, or social groups without assigning bureaucratic tasks are not considered directives

Recipient principle: All directives should be read from the perspective of state actors. Pri- vate laws and regulations directed solely at in- dividuals, enterprises, or social groups without assigning bureaucratic tasks are not considered directives

work page

[3] [3]

specificity: Code for clarity of in- tent, not specificity of detail

Clarity vs. specificity: Code for clarity of in- tent, not specificity of detail. A directive may express a clear purpose while remaining vague in its implementation, or contain extensive de- tails yet lack a clear underlying intent

work page

[4] [4]

A.4 Step-by-Step Workflow This section provides a practical workflow for ap- plying the two-round annotation process

Avoid keyword-only tagging: Keywords like must,flexible, orforbidcan serve as helpful cues, but they should never be used mechani- cally or treated as the deciding criteria. A.4 Step-by-Step Workflow This section provides a practical workflow for ap- plying the two-round annotation process. Step 1: Assign Level 1 Label (W / R / N)Deter- mine whether the p...

work page 1949

[5] [5]

API Call: Send the entire raw text of docu- ment.txt to the LLM

work page

[6] [6]

LLM Task: The LLM internally processes the text and embeds XML tags (e.g., <L1>...) di- rectly into the content

work page

[7] [7]

text editor

LLM Output: The LLM returns the complete, modified text as its response. Cost analysisAt the project’s pricing of $0.20/1M input and $0.80/1M output tokens, this model is prohibitively expensive. The output token count is nearly identical to the input count, and output tokens are four times more costly. The total cost for a single document would be approx...

work page

[8] [8]

This step happens locally and incurs no API cost

Local Pre-processing: A Python script first adds <line #> markers to the document text. This step happens locally and incurs no API cost

work page

[9] [9]

API Call: Send the numbered text to the GPT- 4.1-mini-2025-04-14 model via the OpenAI Batch API

work page 2025

[10] [10]

LLM Task: The LLM’s only task is to identify structural elements and output a compact JSON file listing labels and their corresponding line numbers

work page

[11] [11]

segments

Local Reconstruction: A local script merges the LLM’s JSON output with the original text file to create the XML-tagged document. Cost analysisThis architecture transforms the cost equation. The output is now a very small JSON object, typically only 5-10% of the input token size. The total cost is: Cost optimized ≈ (Tokens input×$0.20) + (Tokens JSON×$0.80...

work page 2025

[12] [12]

optimal segmentation layer

Metadata Generation: Heuristic Content Analysis Before segmentation, the script first analyzes the tagged document to determine its most sub- stantively important structural layer. This is achieved by scoring the content within each hi- erarchical layer (e.g., all L1 tags, all L2 tags) based on a set of heuristic parameters. The goal of this step is to pr...

work page

[13] [13]

必须", "应",

Final Segmentation: Deterministic Rule Ap- plication 13 Parameter Value Rationale COLOR_DIVERSITY_WEIGHT3.0 Emphasizes the presence of action-oriented keywords. LENGTH_WEIGHT0.5 Favors longer, more substantive text blocks. LAYER_PENALTY_FACTOR1.5 Penalizes deeper layers (L3, L4), creating a bias for higher-level structure. MIN_LENGTH_THRESHOLD15.0 Minimum...

work page