pith. sign in

arxiv: 2510.08986 · v3 · pith:JO7WHLPLnew · submitted 2025-10-10 · 💻 cs.CL · cs.CE· cs.CY

CAPC-CG: A Large-Scale, Expert-Directed LLM-Annotated Corpus of Adaptive Policy Communication in China

Pith reviewed 2026-05-21 20:47 UTC · model grok-4.3

classification 💻 cs.CL cs.CEcs.CY
keywords Chinese policy corpusadaptive policy communicationannotated datasetpolicy language analysisinter-annotator agreementLLM baselinesgovernment directivesclear ambiguous language
0
0 comments X

The pith

CAPC-CG offers the first open annotated corpus of Chinese central government policies using a five-color taxonomy for clear and ambiguous language.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces CAPC-CG, a large dataset of Chinese policy directives from 1949 to 2023 annotated by experts according to a theory of adaptive policy communication. The corpus breaks documents into 3.3 million paragraphs and labels them with categories that distinguish clear from ambiguous statements. High agreement among annotators supports its use for training models, and the authors provide baseline results from large language models along with patterns observed in the data.

Core claim

The paper establishes CAPC-CG as a reliable, expert-annotated resource that applies a five-color taxonomy to categorize the language in national laws, administrative regulations, and ministerial rules, enabling quantitative study of how Chinese authorities communicate policy intent across decades.

What carries the argument

The five-color taxonomy of clear and ambiguous language categories, applied through a two-round expert labeling process to paragraph-level segments of policy documents.

If this is right

  • High inter-annotator agreement of 0.86 kappa allows for effective supervised machine learning on policy language classification.
  • Researchers can analyze historical shifts in the use of clear versus ambiguous directives from 1949 onward.
  • Baseline LLM performances provide starting points for improving automated annotation of similar government texts.
  • The released metadata and codebook support replication and extension to other policy domains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The dataset could help identify whether ambiguous language in policies leads to varied local implementation outcomes.
  • Comparing annotation patterns before and after key historical events might show changes in central government communication strategies.
  • Adapting this taxonomy to non-Chinese policy documents could reveal cross-national differences in how governments balance clarity and flexibility.

Load-bearing premise

The five-color taxonomy developed for adaptive policy communication applies reliably to Chinese central government documents without major adaptation or loss of validity.

What would settle it

A new round of annotations by independent experts on a held-out sample of documents yielding substantially lower agreement scores or frequent category mismatches would indicate the taxonomy does not transfer reliably.

Figures

Figures reproduced from arXiv: 2510.08986 by Bolun Sun, Charles Chang, Pingxu Hao, Ruotong Mu, Yuchen Xu, Yuen Yuen Ang, Zhengxin Zhang.

Figure 1
Figure 1. Figure 1: Overall Workflow 2 Theoretical Foundation: Adaptive Policy Communication Although several large-scale Chinese datasets have been developed, they are mainly based on Inter￾net and social media content without much quality control (Wang et al., 2025a; Li et al., 2024; Zhang et al., 2023; Yuan et al., 2023, 2021; Xu et al., 2020). In contrast, resources for political and legal documents remain scarce: existin… view at source ↗
Figure 2
Figure 2. Figure 2: Workflow for Corpus Creation 3.1 Scope of Policy Directives Our analysis focuses on policy directives, defined as formally issued documents in which a higher￾level authority addresses state or public-sector ac￾tors with actionable implications—that is, the bu￾reaucrats are expected to notice, interpret, and re￾spond to the directives. Such directives form the "vertical formal channels" of bureaucratic comm… view at source ↗
Figure 3
Figure 3. Figure 3: Inter-Coder Agreement Improvement Once we felt confident about the annotators’ grasp of the codebook, they proceeded to code paragraphs at scale. Each week, we assigned 1,000 randomly sampled paragraphs and repeated this process without sample replacement. By the end of the period, they annotated 6,000 samples, ensuring an even distribution across all labels. Two-Round Labeling Workflow Each para￾graph is … view at source ↗
Figure 4
Figure 4. Figure 4: Two-Round Labeling Workflow This rigorous, two-stage protocol balances theo￾retical nuance with operational reliability, enabling 4 [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Temporal Trend of Policies (1978-2023) Metric Score Cohen’s Kappa (4-class) 0.833 Overall Accuracy 87.5% Accuracy per Class: Class ’B’ (Authorizing) 92.0% Class ’C’ (Flexible) 85.0% Class ’G’ (Ambiguous) 84.0% Class ’Y’ (Mandating) 89.0% [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: The distribution of action-oriented topics [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Heatmap of governmental intents across pol [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: illustrates the end-to-end process, from raw text file to final segmented output [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗
read the original abstract

We introduce CAPC-CG, the Chinese Adaptive Policy Communication (Central Government) Corpus, the first open dataset of Chinese policy directives annotated with a five-color taxonomy of clear and ambiguous language categories, building on Ang's theory of adaptive policy communication. Spanning 1949-2023, this corpus includes national laws, administrative regulations, and ministerial rules issued by China's top authorities. Each document is segmented into paragraphs, producing a total of 3.3 million units. Alongside the corpus, we release comprehensive metadata, a two-round labeling framework, and a gold-standard annotation set developed by expert and trained coders. Inter-annotator agreement achieves a Fleiss's kappa of K = 0.86 on directive labels, indicating high reliability for supervised modeling. We provide baseline classification results with several large language models (LLMs), together with our annotation codebook, and describe patterns from the dataset. This release aims to support downstream tasks and multilingual NLP research in policy communication.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper introduces CAPC-CG, the first open large-scale corpus of Chinese policy directives from 1949-2023, consisting of 3.3 million paragraph units annotated using a five-color taxonomy for clear and ambiguous language based on Ang's adaptive policy communication theory. It includes a gold-standard set with expert annotations achieving Fleiss's kappa of 0.86, releases the LLM-annotated full corpus, metadata, annotation framework, codebook, and baseline LLM classification results.

Significance. This work provides a valuable new resource for research in policy communication, NLP, and Chinese politics by offering an unprecedented scale of annotated data spanning decades. The open release of the corpus, gold-standard annotations, and baselines promotes reproducibility and enables supervised modeling tasks. The high inter-annotator agreement on the gold set is a strength, supporting potential use in downstream applications if the LLM annotations maintain similar quality.

major comments (1)
  1. Abstract: The Fleiss's kappa of 0.86 is reported for the gold-standard annotation set by expert and trained coders. However, the primary released corpus uses LLM annotations on the full 3.3 million paragraphs, and no agreement metric (such as accuracy or kappa) between the LLM labels and the expert gold standard is provided. This extrapolation undermines the claim of high reliability for the released dataset.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive review and recommendation for minor revision. Their comment correctly identifies a point of clarification needed regarding the reliability claims for the LLM-annotated portion of the corpus, which we address directly below.

read point-by-point responses
  1. Referee: Abstract: The Fleiss's kappa of 0.86 is reported for the gold-standard annotation set by expert and trained coders. However, the primary released corpus uses LLM annotations on the full 3.3 million paragraphs, and no agreement metric (such as accuracy or kappa) between the LLM labels and the expert gold standard is provided. This extrapolation undermines the claim of high reliability for the released dataset.

    Authors: We appreciate this observation, which highlights a useful distinction. The Fleiss's kappa of 0.86 measures inter-annotator agreement among human experts on the gold-standard set and supports the validity of our five-color taxonomy and codebook. The full 3.3 million paragraphs were annotated via LLMs, with baseline classification results provided to illustrate model performance on held-out data derived from the gold standard. To directly address the concern, the revised manuscript will include explicit agreement metrics (accuracy, F1, and Cohen's kappa) comparing LLM-generated labels to expert annotations on a validation subset. This addition will strengthen transparency without changing the dataset release or core findings. revision: yes

Circularity Check

0 steps flagged

Data release with minor self-citation on taxonomy; no derivation reduces to inputs

full rationale

The paper is a corpus release that segments 3.3M paragraphs from 1949-2023 Chinese policy documents and annotates them with a five-color taxonomy drawn from Ang's prior theory of adaptive policy communication. One co-author (Yuen Yuen Ang) overlaps with that theory, producing a minor self-citation, but the central claim is the existence and release of the annotated dataset itself together with a gold-standard subset whose Fleiss's kappa is reported as 0.86. No equations, fitted parameters, predictions, or uniqueness theorems appear; the work contains no derivation chain that reduces by construction to its own inputs or to an unverified self-citation. The contribution is therefore self-contained as an empirical resource.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central contribution rests on the applicability of Ang's existing theory of adaptive policy communication to Chinese central-government documents and on the assumption that expert annotation can reliably operationalize the five-color taxonomy at paragraph level.

axioms (1)
  • domain assumption Ang's theory of adaptive policy communication supplies a valid five-color taxonomy for distinguishing clear and ambiguous language in Chinese policy directives.
    The taxonomy is adopted directly from prior work by one of the co-authors; the paper does not re-derive or validate the categories from first principles within this corpus.

pith-pipeline@v0.9.0 · 5730 in / 1331 out tokens · 31119 ms · 2026-05-21T20:47:07.585927+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages

  1. [1]

    29.Carion, N.et al.Sam 3: Segment anything with concepts (2025)

    Inter-annotator agreement is not the ceiling of machine learning performance: Evidence from a comprehensive set of simulations. InProceedings of the 21st workshop on biomedical language process- ing, pages 275–284. Marina Sokolova and Guy Lapalme. 2009. A system- atic analysis of performance measures for classifica- tion tasks.Information Processing & Man...

  2. [2]

    Pri- vate laws and regulations directed solely at in- dividuals, enterprises, or social groups without assigning bureaucratic tasks are not considered directives

    Recipient principle: All directives should be read from the perspective of state actors. Pri- vate laws and regulations directed solely at in- dividuals, enterprises, or social groups without assigning bureaucratic tasks are not considered directives

  3. [3]

    specificity: Code for clarity of in- tent, not specificity of detail

    Clarity vs. specificity: Code for clarity of in- tent, not specificity of detail. A directive may express a clear purpose while remaining vague in its implementation, or contain extensive de- tails yet lack a clear underlying intent

  4. [4]

    A.4 Step-by-Step Workflow This section provides a practical workflow for ap- plying the two-round annotation process

    Avoid keyword-only tagging: Keywords like must,flexible, orforbidcan serve as helpful cues, but they should never be used mechani- cally or treated as the deciding criteria. A.4 Step-by-Step Workflow This section provides a practical workflow for ap- plying the two-round annotation process. Step 1: Assign Level 1 Label (W / R / N)Deter- mine whether the p...

  5. [5]

    API Call: Send the entire raw text of docu- ment.txt to the LLM

  6. [6]

    LLM Task: The LLM internally processes the text and embeds XML tags (e.g., <L1>...) di- rectly into the content

  7. [7]

    text editor

    LLM Output: The LLM returns the complete, modified text as its response. Cost analysisAt the project’s pricing of $0.20/1M input and $0.80/1M output tokens, this model is prohibitively expensive. The output token count is nearly identical to the input count, and output tokens are four times more costly. The total cost for a single document would be approx...

  8. [8]

    This step happens locally and incurs no API cost

    Local Pre-processing: A Python script first adds <line #> markers to the document text. This step happens locally and incurs no API cost

  9. [9]

    API Call: Send the numbered text to the GPT- 4.1-mini-2025-04-14 model via the OpenAI Batch API

  10. [10]

    LLM Task: The LLM’s only task is to identify structural elements and output a compact JSON file listing labels and their corresponding line numbers

  11. [11]

    segments

    Local Reconstruction: A local script merges the LLM’s JSON output with the original text file to create the XML-tagged document. Cost analysisThis architecture transforms the cost equation. The output is now a very small JSON object, typically only 5-10% of the input token size. The total cost is: Cost optimized ≈ (Tokens input×$0.20) + (Tokens JSON×$0.80...

  12. [12]

    optimal segmentation layer

    Metadata Generation: Heuristic Content Analysis Before segmentation, the script first analyzes the tagged document to determine its most sub- stantively important structural layer. This is achieved by scoring the content within each hi- erarchical layer (e.g., all L1 tags, all L2 tags) based on a set of heuristic parameters. The goal of this step is to pr...

  13. [13]

    必须", "应",

    Final Segmentation: Deterministic Rule Ap- plication 13 Parameter Value Rationale COLOR_DIVERSITY_WEIGHT3.0 Emphasizes the presence of action-oriented keywords. LENGTH_WEIGHT0.5 Favors longer, more substantive text blocks. LAYER_PENALTY_FACTOR1.5 Penalizes deeper layers (L3, L4), creating a bias for higher-level structure. MIN_LENGTH_THRESHOLD15.0 Minimum...