The GDN-CC Dataset: Automatic Corpus Clarification for AI-enhanced Democratic Citizen Consultations
Pith reviewed 2026-05-16 12:40 UTC · model grok-4.3
The pith
Small language models match larger ones at turning noisy citizen consultation texts into clear argumentative units.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Corpus Clarification transforms noisy citizen contributions into structured argumentative units, and finetuned small language models can reproduce the manual annotations in the GDN-CC dataset of 1,231 contributions while supporting effective opinion clustering, enabling the creation of the 240k-example GDN-CC-large corpus.
What carries the argument
Corpus Clarification, the preprocessing framework that converts multi-topic contributions into self-contained argumentative units annotated for structure.
If this is right
- Clarified units make topic modeling and political analysis more reliable on large consultation datasets.
- Small open-weight models allow local and transparent processing without sending data to external providers.
- The 240k-example GDN-CC-large corpus becomes available as the largest annotated democratic consultation resource.
- Opinion clustering gains accuracy when run on single-topic argumentative units instead of raw multi-topic texts.
Where Pith is reading between the lines
- The method could lower barriers to using AI in public deliberations by avoiding reliance on large proprietary models.
- Similar clarification datasets could be built for citizen input in other languages or national contexts.
- Testing whether the clarified units retain the original range of citizen nuance would be a direct next measurement.
Load-bearing premise
The manual clarification process produces consistent high-quality argumentative units that small models can learn and apply to new consultation data.
What would settle it
A new set of citizen contributions from a different consultation where humans rate the small-model clarifications as substantially less clear or faithful than the manual standard would falsify the transfer claim.
read the original abstract
LLMs are ubiquitous in modern NLP, and while their applicability extends to texts produced for democratic activities such as online deliberations or large-scale citizen consultations, ethical questions have been raised for their usage as analysis tools. We continue this line of research with two main goals: (a) to develop resources that can help standardize citizen contributions in public forums at the pragmatic level, and make them easier to use in topic modeling and political analysis; (b) to study how well this standardization can reliably be performed by small, open-weights LLMs, i.e. models that can be run locally and transparently with limited resources. Accordingly, we introduce Corpus Clarification as a preprocessing framework for large-scale consultation data that transforms noisy, multi-topic contributions into structured, self-contained argumentative units ready for downstream analysis. We present GDN-CC, a manually-curated dataset of 1,231 contributions to the French Grand D\'ebat National, comprising 2,285 argumentative units annotated for argumentative structure and manually clarified. We then show that finetuned Small Language Models match or outperform LLMs on reproducing these annotations, and measure their usability for an opinion clustering task. We finally release GDN-CC-large, an automatically annotated corpus of 240k contributions, the largest annotated democratic consultation dataset to date.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces a Corpus Clarification preprocessing framework that converts noisy, multi-topic citizen contributions from democratic consultations into structured, self-contained argumentative units. It presents the manually curated GDN-CC dataset of 1,231 French Grand Débat National contributions yielding 2,285 annotated argumentative units, shows that fine-tuned small language models match or outperform larger LLMs at reproducing the annotations, evaluates the resulting annotations on an opinion clustering task, and releases the automatically annotated GDN-CC-large corpus of 240k contributions as the largest such dataset to date.
Significance. If the central performance and generalization claims hold after additional validation, the work supplies a practical, ethically preferable pipeline for standardizing large-scale public consultation data using locally runnable open models. The public release of GDN-CC and GDN-CC-large would constitute a substantial resource for downstream topic modeling, political analysis, and AI-assisted democratic deliberation research.
major comments (2)
- [Model evaluation section] The manuscript reports no inter-annotator agreement statistics, exact evaluation metrics, baseline comparisons, or statistical significance tests for the claim that fine-tuned SLMs match or outperform LLMs on reproducing the annotations (Abstract and model evaluation section). These omissions make the central performance result difficult to interpret or replicate.
- [Results on GDN-CC-large] The usability claim for the 240k automatically annotated contributions rests on in-sample reproduction metrics only; the paper provides no held-out test split evaluation, no human quality ratings on samples from GDN-CC-large, and no stability comparison of opinion clustering when the model is applied to fresh consultation text (Abstract and results on GDN-CC-large). This leaves the generalization of the learned mapping unconfirmed and load-bearing for the practical contribution.
minor comments (1)
- [Abstract] The abstract could explicitly state the train/test split sizes used for the fine-tuning experiments and the precise clustering evaluation protocol.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We have revised the manuscript to strengthen the reporting of evaluation details and to provide additional evidence on the generalization of the GDN-CC-large annotations. Our point-by-point responses follow.
read point-by-point responses
-
Referee: [Model evaluation section] The manuscript reports no inter-annotator agreement statistics, exact evaluation metrics, baseline comparisons, or statistical significance tests for the claim that fine-tuned SLMs match or outperform LLMs on reproducing the annotations (Abstract and model evaluation section). These omissions make the central performance result difficult to interpret or replicate.
Authors: We agree that the original reporting was insufficiently detailed. In the revised manuscript we have added inter-annotator agreement statistics (Cohen's kappa for both segmentation and clarification subtasks), the precise definitions and formulas for all evaluation metrics, explicit baseline comparisons (zero-shot and few-shot prompting of the same model families plus simple rule-based methods), and statistical significance tests (paired tests with p-values). These additions are now presented in a new subsection of the model evaluation section and make the performance claims fully interpretable and replicable. revision: yes
-
Referee: [Results on GDN-CC-large] The usability claim for the 240k automatically annotated contributions rests on in-sample reproduction metrics only; the paper provides no held-out test split evaluation, no human quality ratings on samples from GDN-CC-large, and no stability comparison of opinion clustering when the model is applied to fresh consultation text (Abstract and results on GDN-CC-large). This leaves the generalization of the learned mapping unconfirmed and load-bearing for the practical contribution.
Authors: We accept that stronger evidence of out-of-distribution performance is needed. The revised version now includes (i) reproduction metrics on a held-out portion of the manually annotated GDN-CC data, (ii) human quality ratings collected on a random sample of 150 contributions drawn from GDN-CC-large, and (iii) a stability experiment in which the fine-tuned model is applied to an independent citizen-consultation corpus and the resulting clusters are compared for coherence against a manually annotated reference. We also added an explicit limitations paragraph noting that exhaustive human validation of the full 240k corpus remains infeasible. These changes address the core concern while remaining within the scope of feasible revisions. revision: partial
Circularity Check
No circularity: empirical dataset construction and model evaluation remain self-contained
full rationale
The paper presents a manually curated dataset of 1,231 contributions with 2,285 annotated argumentative units, followed by empirical fine-tuning and evaluation of small language models on reproducing those annotations plus a downstream clustering task. No mathematical derivations, first-principles predictions, or parameter-fitting steps are claimed; performance numbers are direct empirical measurements on the provided annotations rather than outputs that reduce to the inputs by construction. No self-citations serve as load-bearing uniqueness theorems, and the automatic annotation of the 240k corpus is presented as an application of the trained models rather than a validated generalization. The work is therefore self-contained as a resource and benchmarking contribution without circular reductions.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Manual expert annotation of argumentative structure produces consistent and usable gold labels for downstream NLP tasks
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We introduce Corpus Clarification as a preprocessing framework... finetuned Small Language Models match or outperform LLMs on reproducing these annotations
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We present GDN-CC... 2,285 argumentative units annotated for argumentative structure
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.