pith. sign in

arxiv: 2512.04316 · v7 · submitted 2025-12-03 · 💻 cs.HC

ConsentDiff at Scale: Longitudinal Audits of Web Privacy Policy Changes and UI Frictions

Pith reviewed 2026-05-17 01:42 UTC · model grok-4.3

classification 💻 cs.HC
keywords privacy policyconsent interfacelongitudinal auditweb measurementUI frictionpolicy churnconsent banneralignment score
0
0 comments X p. Extension

The pith

Longitudinal audits show privacy policies keep churning while consent banners shift toward easier rejection and better policy alignment.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ConsentDiff, a pipeline that takes monthly snapshots of websites to track how privacy policy text and consent user interfaces evolve together. It semantically aligns individual policy clauses across time to measure churn and combines DOM signals with screenshot cues to classify common UI patterns such as banner designs. A new weighted claim-UI alignment score then connects specific policy promises to observable interface features, enabling comparisons across time, regions, and site categories. Measurements indicate ongoing clause-level policy changes, a systematic reduction in higher-friction banner designs, and substantially higher alignment scores on sites where the reject option is visible and low-effort. This approach addresses the gap in understanding whether consent interfaces actually deliver on the commitments stated in policies.

Core claim

ConsentDiff provides a reproducible pipeline that snapshots sites every month, semantically aligns policy clauses to track clause-level churn, and classifies consent-UI patterns by combining DOM signals with cues from screenshots. It introduces a weighted claim-UI alignment score that links common policy claims to observable predicates, supporting comparisons over time, regions, and verticals. The resulting measurements indicate continued policy churn, systematic changes to eliminate a higher-friction banner design, and significantly higher alignment where rejecting is visible and lower friction.

What carries the argument

The ConsentDiff pipeline, which performs monthly site snapshots, semantic clause alignment for policy churn tracking, and DOM-plus-screenshot classification of consent UI patterns to produce a weighted claim-UI alignment score.

Load-bearing premise

The pipeline's semantic alignment of policy clauses and classification of UI patterns from DOM and screenshots accurately capture real-world policy-UI relationships without substantial interpretation errors or sampling bias.

What would settle it

A manual audit of several hundred sites revealing that the computed alignment scores frequently mismatch human judgments of whether the displayed consent interface actually implements the specific claims found in the current policy text.

Figures

Figures reproduced from arXiv: 2512.04316 by Haoze Guo.

Figure 1
Figure 1. Figure 1: Alignment score distributions by region (top) and [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
read the original abstract

Web privacy is experienced via two public artifacts: site utterances in policy texts, and the actions users are required to take during consent interfaces. In the extensive cross-section audits we've studied, there is a lack of longitudinal data detailing how these artifacts are changing together, and if interfaces are actually doing what they promise in policy. ConsentDiff provides that longitudinal view. We build a reproducible pipeline that snapshots sites every month, semantically aligns policy clauses to track clause-level churn, and classifies consent-UI patterns by pulling together DOM signals with cues provided by screenshots. We introduce a novel weighted claim-UI alignment score, connecting common policy claims to observable predicates, and enabling comparisons over time, regions, and verticals. Our measurements suggest continued policy churn, systematic changes to eliminate a higher-friction banner design, and significantly higher alignment where rejecting is visible and lower friction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces ConsentDiff, a reproducible pipeline for monthly website snapshots that semantically aligns privacy policy clauses to measure clause-level churn, classifies consent UI patterns by combining DOM signals with screenshot cues, and defines a novel weighted claim-UI alignment score linking policy claims to observable UI predicates. Measurements indicate continued policy churn, systematic removal of higher-friction banner designs, and significantly higher alignment scores in cases where rejection options are visible and friction is low.

Significance. If the pipeline components prove reliable, the work supplies valuable longitudinal empirical data on the co-evolution of privacy policy text and consent interfaces, enabling comparisons across time, regions, and verticals. The reproducible pipeline and the claim-UI alignment score are concrete strengths that support falsifiable follow-up studies and could inform regulatory audits of GDPR/CCPA-style consent mechanisms.

major comments (1)
  1. [Pipeline and measurement sections] Pipeline and measurement sections: the abstract and methods description present the semantic alignment procedure and UI classifier as central to all reported trends, yet supply no accuracy metrics, error rates, inter-annotator agreement, or ground-truth validation for either component. Because the headline claims of continued churn, systematic banner changes, and 'significantly higher alignment' rest directly on these measurements, the absence of quantitative validation is load-bearing and must be addressed before the observational results can be interpreted with confidence.
minor comments (2)
  1. [Abstract] Abstract: the phrase 'significantly higher alignment' should be accompanied by the statistical test, p-value threshold, and effect-size information used.
  2. [Data collection] The manuscript would benefit from an explicit statement of the sampling frame (how sites and regions were selected) and any exclusion rules applied to snapshots.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for validation of the core pipeline components. We agree this is a substantive issue that must be addressed to strengthen the interpretability of the results and will incorporate the requested metrics in the revision.

read point-by-point responses
  1. Referee: [Pipeline and measurement sections] Pipeline and measurement sections: the abstract and methods description present the semantic alignment procedure and UI classifier as central to all reported trends, yet supply no accuracy metrics, error rates, inter-annotator agreement, or ground-truth validation for either component. Because the headline claims of continued churn, systematic banner changes, and 'significantly higher alignment' rest directly on these measurements, the absence of quantitative validation is load-bearing and must be addressed before the observational results can be interpreted with confidence.

    Authors: We acknowledge that the submitted manuscript does not report quantitative validation metrics (accuracy, error rates, inter-annotator agreement, or ground-truth comparisons) for the semantic alignment procedure or the UI classifier. This omission limits confidence in the downstream claims, as noted. In the revised manuscript we will add a new subsection under Methods that details: (1) a manually annotated ground-truth set of 200 policy clauses for semantic alignment, with reported precision/recall and inter-annotator agreement (Cohen’s kappa); (2) a held-out test set of 150 consent UIs with screenshot+DOM labels, reporting classification accuracy and confusion matrices; and (3) an explicit discussion of remaining error sources and their potential impact on the longitudinal trends. We will also release the validation annotations alongside the pipeline code to support reproducibility. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical measurement pipeline

full rationale

The paper is an empirical measurement study that builds a reproducible pipeline to snapshot websites monthly, semantically align policy clauses for churn tracking, classify consent UI patterns from DOM signals and screenshots, and compute a novel weighted claim-UI alignment score linking policy claims to observable predicates. No derivation chain, equations, fitted parameters presented as predictions, self-definitional constructs, or load-bearing self-citations are present in the abstract or description. The central findings rely on the pipeline's outputs as direct measurements rather than reducing to inputs by construction. This is self-contained empirical work; the absence of validation metrics is a separate limitation on reliability, not circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are described in the abstract; the work relies on standard semantic alignment and DOM/screenshot classification techniques.

pith-pipeline@v0.9.0 · 5438 in / 1004 out tokens · 22770 ms · 2026-05-17T01:42:49.619797+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages · 3 internal anchors

  1. [1]

    Gunes Acar, Christian Eubank, Steven Englehardt, Marc Juarez, Arvind Narayanan, and Claudia Diaz. 2014. The Web Never Forgets: Persistent Tracking Mechanisms in the Wild. InProceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security (CCS). 674–689. doi:10.1145/2660267. 2660347

  2. [2]

    Angrist and Jörn-Steffen Pischke

    Joshua D. Angrist and Jörn-Steffen Pischke. 2009.Mostly Harmless Econometrics: An Empiricist’s Companion. Princeton University Press

  3. [3]

    Martin Degeling, Christine Utz, Christopher Lentzsch, Henry Hosseini, Florian Schaub, and Thorsten Holz. 2019. We Value Your Privacy... Now Take Some Cook- ies: Measuring the GDPR’s Impact on Web Privacy. InNetwork and Distributed System Security Symposium (NDSS). https://www.ndss-symposium.org/wp- content/uploads/2019/02/ndss2019_01A-3_Degeling_paper.pdf

  4. [4]

    Steven Englehardt and Arvind Narayanan. 2016. Online Tracking: A 1-Million- Site Measurement and Analysis. InNetwork and Distributed System Security Symposium (NDSS). https://webtransparency.cs.princeton.edu/webcensus/

  5. [5]

    European Data Protection Board. 2020. Guidelines 05/2020 on Consent under Regulation 2016/679. https://edpb.europa.eu/our-work-tools/our-documents/ guidelines/guidelines-052020-consent-under-regulation-2016679_en

  6. [6]

    Gray, Nataliia Bielova, Cristiana Santos, et al

    Colin M. Gray, Nataliia Bielova, Cristiana Santos, et al . 2021. Dark Patterns and the Legal Requirements of Consent Banners. InProceedings of the 2021 CHI Conference on Human Factors in Computing Systems. ACM. https://www- sop.inria.fr/members/Nataliia.Bielova/papers/Gray-etal-21-CHI.pdf

  7. [7]

    Haoze Guo and Ziqi Wei. 2026. Behind the Feed: A Taxonomy of User-Facing Cues for Algorithmic Transparency in Social Media.arXiv preprint arXiv:2602.03121 (2026)

  8. [8]

    Haoze Guo and Ziqi Wei. 2026. Hidden-in-Plain-Text: A Benchmark for Social- Web Indirect Prompt Injection in RAG.arXiv preprint arXiv:2601.10923(2026)

  9. [9]

    Haoze Guo and Ziqi Wei. 2026. Temporal Drift in Privacy Recall: Users Misremem- ber From Verbatim Loss to Gist-Based Overexposure. arXiv:2509.16962 [cs.HC]

  10. [10]

    Hamza Harkous, Kassem Fawaz, Reza Shokri, Bryan Ford, and Karl Aberer. 2018. Polisis: Automated Analysis and Presentation of Privacy Policies Using Deep Learning. In27th USENIX Security Symposium (USENIX Security). 531–548. https: //www.usenix.org/conference/usenixsecurity18/presentation/harkous

  11. [11]

    IAB Europe. 2020. Transparency & Consent Framework (TCF) v2.0: Policies and Specifications. https://iabeurope.eu/tcf-2-0/

  12. [12]

    Rebecca Killick, Paul Fearnhead, and Idris A. Eckley. 2012. Optimal Detection of Changepoints With a Linear Computational Cost.J. Amer. Statist. Assoc.107, 500 (2012), 1590–1598. doi:10.1080/01621459.2012.737745

  13. [13]

    Adam Lerner, Anna Kornfeld Simpson, Tadayoshi Kohno, and Franziska Roesner

  14. [14]

    InProceedings of the 2016 ACM Web Science Conference (WebSci)

    Internet Jones and the Raiders of the Lost Trackers: An Archaeological Study of Web Tracking from 1996 to 2016. InProceedings of the 2016 ACM Web Science Conference (WebSci). 237–246. doi:10.1145/2908131.2908165

  15. [15]

    Levenshtein

    Vladimir I. Levenshtein. 1966. Binary Codes Capable of Correcting Deletions, Insertions, and Reversals.Soviet Physics Doklady10 (1966), 707–710

  16. [16]

    Marco Lippi, Paolo Torroni, et al. 2019. CLAUDETTE: an Automated Detector of Potentially Unfair Clauses in Online Terms of Service.Artificial Intelligence and Law27, 2 (2019), 117–139. doi:10.1007/s10506-019-09243-2

  17. [17]

    Mathur, G

    Arunesh Mathur, Gunes Acar, Michael J. Friedman, Elena Lucherini, Jonathan Mayer, Marshini Chetty, and Arvind Narayanan. 2019. Dark Patterns at Scale: Findings from a Crawl of 11K Shopping Websites.Proceedings of the ACM on ConsentDiff at Scale: Longitudinal Audits of Web Privacy Policy Changes and UI Frictions CHI EA ’26, April 13–17, 2026, Barcelona, Sp...

  18. [18]

    Célestin Matte, Nataliia Bielova, and Cristiana Santos. 2020. Do Cookie Banners Respect My Choice? Measuring Legal Compliance of Banners from IAB Europe’s Transparency and Consent Framework. In2020 IEEE Symposium on Security and Privacy (SP). IEEE, 791–809. doi:10.1109/SP40000.2020.00025

  19. [19]

    Midas Nouwens, Ilaria Liccardi, Michael Veale, David Karger, and Lalana Kagal

  20. [20]

    URLhttp://dx.doi.org/10.1145/3313831.3376327

    Dark Patterns after the GDPR: Scraping Consent Pop-ups and Demonstrat- ing Their Influence. InProceedings of the 2020 CHI Conference on Human Factors in Computing Systems. ACM. doi:10.1145/3313831.3376321

  21. [21]

    Victor Le Pochat, Tom Van Goethem, Samaneh Tajalizadehkhoob, Maciej Ko- rczyński, and Wouter Joosen. 2019. Tranco: A Research-Oriented Top Sites Ranking Hardened Against Manipulation. InNetwork and Distributed System Security Symposium (NDSS). https://tranco-list.eu

  22. [22]

    Data Programming: Creating Large Training Sets, Quickly

    Alexander J. Ratner, Christopher M. De Sa, Sen Wu, Daniel Selsam, and Christo- pher Ré. 2017. Data Programming: Creating Large Training Sets, Quickly. In Advances in Neural Information Processing Systems (NeurIPS). https://arxiv.org/ abs/1605.07723

  23. [23]

    Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings Using Siamese BERT-Networks. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP). ACL, 3982–3992. https://arxiv. org/abs/1908.10084

  24. [24]

    Wooldridge

    Jeffrey M. Wooldridge. 2010.Econometric Analysis of Cross Section and Panel Data (2nd ed.). MIT Press