The GDN-CC Dataset: Automatic Corpus Clarification for AI-enhanced Democratic Citizen Consultations

Benjamin Piwowarski; Fran\c{c}ois Yvon; Ga\"el Lejeune; Laur\`ene Cave; L\'eo Labat; Pierre-Antoine Lequeu

arxiv: 2601.14944 · v3 · submitted 2026-01-21 · 💻 cs.CL

The GDN-CC Dataset: Automatic Corpus Clarification for AI-enhanced Democratic Citizen Consultations

Pierre-Antoine Lequeu , L\'eo Labat , Laur\`ene Cave , Ga\"el Lejeune , Fran\c{c}ois Yvon , Benjamin Piwowarski This is my paper

Pith reviewed 2026-05-16 12:40 UTC · model grok-4.3

classification 💻 cs.CL

keywords corpus clarificationcitizen consultationssmall language modelsargumentative unitsopinion clusteringdemocratic deliberationGrand Débat National

0 comments

The pith

Small language models match larger ones at turning noisy citizen consultation texts into clear argumentative units.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops Corpus Clarification as a preprocessing step that splits raw, multi-topic citizen posts into self-contained argumentative units suitable for topic modeling and political analysis. It supplies the GDN-CC dataset of 1,231 manually clarified French Grand Débat National contributions containing 2,285 annotated units. Experiments show that finetuned small language models reproduce these annotations at least as accurately as large models and perform well on subsequent opinion clustering. The work releases an automatically annotated corpus of 240,000 contributions to support further research on transparent, locally runnable analysis of democratic input.

Core claim

Corpus Clarification transforms noisy citizen contributions into structured argumentative units, and finetuned small language models can reproduce the manual annotations in the GDN-CC dataset of 1,231 contributions while supporting effective opinion clustering, enabling the creation of the 240k-example GDN-CC-large corpus.

What carries the argument

Corpus Clarification, the preprocessing framework that converts multi-topic contributions into self-contained argumentative units annotated for structure.

If this is right

Clarified units make topic modeling and political analysis more reliable on large consultation datasets.
Small open-weight models allow local and transparent processing without sending data to external providers.
The 240k-example GDN-CC-large corpus becomes available as the largest annotated democratic consultation resource.
Opinion clustering gains accuracy when run on single-topic argumentative units instead of raw multi-topic texts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could lower barriers to using AI in public deliberations by avoiding reliance on large proprietary models.
Similar clarification datasets could be built for citizen input in other languages or national contexts.
Testing whether the clarified units retain the original range of citizen nuance would be a direct next measurement.

Load-bearing premise

The manual clarification process produces consistent high-quality argumentative units that small models can learn and apply to new consultation data.

What would settle it

A new set of citizen contributions from a different consultation where humans rate the small-model clarifications as substantially less clear or faithful than the manual standard would falsify the transfer claim.

read the original abstract

LLMs are ubiquitous in modern NLP, and while their applicability extends to texts produced for democratic activities such as online deliberations or large-scale citizen consultations, ethical questions have been raised for their usage as analysis tools. We continue this line of research with two main goals: (a) to develop resources that can help standardize citizen contributions in public forums at the pragmatic level, and make them easier to use in topic modeling and political analysis; (b) to study how well this standardization can reliably be performed by small, open-weights LLMs, i.e. models that can be run locally and transparently with limited resources. Accordingly, we introduce Corpus Clarification as a preprocessing framework for large-scale consultation data that transforms noisy, multi-topic contributions into structured, self-contained argumentative units ready for downstream analysis. We present GDN-CC, a manually-curated dataset of 1,231 contributions to the French Grand D\'ebat National, comprising 2,285 argumentative units annotated for argumentative structure and manually clarified. We then show that finetuned Small Language Models match or outperform LLMs on reproducing these annotations, and measure their usability for an opinion clustering task. We finally release GDN-CC-large, an automatically annotated corpus of 240k contributions, the largest annotated democratic consultation dataset to date.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper delivers a practical dataset for clarifying citizen inputs and competitive small-model results, but the large corpus application lacks external validation.

read the letter

The main thing to know is that this paper creates the GDN-CC gold dataset of 1,231 French Grand Débat contributions turned into 2,285 clarified argumentative units, releases a 240k auto-annotated version, and shows fine-tuned small models match or beat larger LLMs at reproducing the manual clarifications while testing the output on opinion clustering. The Corpus Clarification framework itself is a straightforward preprocessing step that breaks noisy multi-topic posts into self-contained units for easier downstream use in topic modeling and political analysis. Releasing both the curated set and the large corpus is the clearest practical step forward here, especially since it targets ethical concerns around running big models on public consultation data by favoring local small models. The motivation and task setup read as honest and grounded in real civic NLP needs. The soft spot is the lack of checks on how well the automatic clarifications hold up beyond the gold set. The reported numbers and clustering results come from the 1,231 examples, with no held-out split, no human ratings on samples from the 240k items, and no test on fresh consultation text. That leaves the usability claim for the big corpus as an in-sample result whose reliability for new data is unconfirmed. The paper is aimed at researchers doing applied work on political text, argument mining, or civic platforms, especially anyone who needs French-language resources or a preprocessing pipeline for messy forum data. Readers looking for datasets to experiment with will find it directly useful. It deserves a serious referee because the resources are new and the core idea is workable, even if the validation on scale needs tightening. I would send it for peer review with a request to add out-of-sample checks.

Referee Report

2 major / 1 minor

Summary. The paper introduces a Corpus Clarification preprocessing framework that converts noisy, multi-topic citizen contributions from democratic consultations into structured, self-contained argumentative units. It presents the manually curated GDN-CC dataset of 1,231 French Grand Débat National contributions yielding 2,285 annotated argumentative units, shows that fine-tuned small language models match or outperform larger LLMs at reproducing the annotations, evaluates the resulting annotations on an opinion clustering task, and releases the automatically annotated GDN-CC-large corpus of 240k contributions as the largest such dataset to date.

Significance. If the central performance and generalization claims hold after additional validation, the work supplies a practical, ethically preferable pipeline for standardizing large-scale public consultation data using locally runnable open models. The public release of GDN-CC and GDN-CC-large would constitute a substantial resource for downstream topic modeling, political analysis, and AI-assisted democratic deliberation research.

major comments (2)

[Model evaluation section] The manuscript reports no inter-annotator agreement statistics, exact evaluation metrics, baseline comparisons, or statistical significance tests for the claim that fine-tuned SLMs match or outperform LLMs on reproducing the annotations (Abstract and model evaluation section). These omissions make the central performance result difficult to interpret or replicate.
[Results on GDN-CC-large] The usability claim for the 240k automatically annotated contributions rests on in-sample reproduction metrics only; the paper provides no held-out test split evaluation, no human quality ratings on samples from GDN-CC-large, and no stability comparison of opinion clustering when the model is applied to fresh consultation text (Abstract and results on GDN-CC-large). This leaves the generalization of the learned mapping unconfirmed and load-bearing for the practical contribution.

minor comments (1)

[Abstract] The abstract could explicitly state the train/test split sizes used for the fine-tuning experiments and the precise clustering evaluation protocol.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We have revised the manuscript to strengthen the reporting of evaluation details and to provide additional evidence on the generalization of the GDN-CC-large annotations. Our point-by-point responses follow.

read point-by-point responses

Referee: [Model evaluation section] The manuscript reports no inter-annotator agreement statistics, exact evaluation metrics, baseline comparisons, or statistical significance tests for the claim that fine-tuned SLMs match or outperform LLMs on reproducing the annotations (Abstract and model evaluation section). These omissions make the central performance result difficult to interpret or replicate.

Authors: We agree that the original reporting was insufficiently detailed. In the revised manuscript we have added inter-annotator agreement statistics (Cohen's kappa for both segmentation and clarification subtasks), the precise definitions and formulas for all evaluation metrics, explicit baseline comparisons (zero-shot and few-shot prompting of the same model families plus simple rule-based methods), and statistical significance tests (paired tests with p-values). These additions are now presented in a new subsection of the model evaluation section and make the performance claims fully interpretable and replicable. revision: yes
Referee: [Results on GDN-CC-large] The usability claim for the 240k automatically annotated contributions rests on in-sample reproduction metrics only; the paper provides no held-out test split evaluation, no human quality ratings on samples from GDN-CC-large, and no stability comparison of opinion clustering when the model is applied to fresh consultation text (Abstract and results on GDN-CC-large). This leaves the generalization of the learned mapping unconfirmed and load-bearing for the practical contribution.

Authors: We accept that stronger evidence of out-of-distribution performance is needed. The revised version now includes (i) reproduction metrics on a held-out portion of the manually annotated GDN-CC data, (ii) human quality ratings collected on a random sample of 150 contributions drawn from GDN-CC-large, and (iii) a stability experiment in which the fine-tuned model is applied to an independent citizen-consultation corpus and the resulting clusters are compared for coherence against a manually annotated reference. We also added an explicit limitations paragraph noting that exhaustive human validation of the full 240k corpus remains infeasible. These changes address the core concern while remaining within the scope of feasible revisions. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical dataset construction and model evaluation remain self-contained

full rationale

The paper presents a manually curated dataset of 1,231 contributions with 2,285 annotated argumentative units, followed by empirical fine-tuning and evaluation of small language models on reproducing those annotations plus a downstream clustering task. No mathematical derivations, first-principles predictions, or parameter-fitting steps are claimed; performance numbers are direct empirical measurements on the provided annotations rather than outputs that reduce to the inputs by construction. No self-citations serve as load-bearing uniqueness theorems, and the automatic annotation of the 240k corpus is presented as an application of the trained models rather than a validated generalization. The work is therefore self-contained as a resource and benchmarking contribution without circular reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the assumption that expert manual clarification produces reliable gold labels and that small-model fine-tuning can generalize from this limited set. No free parameters or invented entities are introduced; the work relies on standard supervised learning assumptions.

axioms (1)

domain assumption Manual expert annotation of argumentative structure produces consistent and usable gold labels for downstream NLP tasks
The entire evaluation pipeline depends on the quality and consistency of the 1,231 manually clarified units.

pith-pipeline@v0.9.0 · 5560 in / 1209 out tokens · 31392 ms · 2026-05-16T12:40:37.404941+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce Corpus Clarification as a preprocessing framework... finetuned Small Language Models match or outperform LLMs on reproducing these annotations
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We present GDN-CC... 2,285 argumentative units annotated for argumentative structure

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.