ConsentDiff at Scale: Longitudinal Audits of Web Privacy Policy Changes and UI Frictions
Pith reviewed 2026-05-17 01:42 UTC · model grok-4.3
The pith
Longitudinal audits show privacy policies keep churning while consent banners shift toward easier rejection and better policy alignment.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ConsentDiff provides a reproducible pipeline that snapshots sites every month, semantically aligns policy clauses to track clause-level churn, and classifies consent-UI patterns by combining DOM signals with cues from screenshots. It introduces a weighted claim-UI alignment score that links common policy claims to observable predicates, supporting comparisons over time, regions, and verticals. The resulting measurements indicate continued policy churn, systematic changes to eliminate a higher-friction banner design, and significantly higher alignment where rejecting is visible and lower friction.
What carries the argument
The ConsentDiff pipeline, which performs monthly site snapshots, semantic clause alignment for policy churn tracking, and DOM-plus-screenshot classification of consent UI patterns to produce a weighted claim-UI alignment score.
Load-bearing premise
The pipeline's semantic alignment of policy clauses and classification of UI patterns from DOM and screenshots accurately capture real-world policy-UI relationships without substantial interpretation errors or sampling bias.
What would settle it
A manual audit of several hundred sites revealing that the computed alignment scores frequently mismatch human judgments of whether the displayed consent interface actually implements the specific claims found in the current policy text.
Figures
read the original abstract
Web privacy is experienced via two public artifacts: site utterances in policy texts, and the actions users are required to take during consent interfaces. In the extensive cross-section audits we've studied, there is a lack of longitudinal data detailing how these artifacts are changing together, and if interfaces are actually doing what they promise in policy. ConsentDiff provides that longitudinal view. We build a reproducible pipeline that snapshots sites every month, semantically aligns policy clauses to track clause-level churn, and classifies consent-UI patterns by pulling together DOM signals with cues provided by screenshots. We introduce a novel weighted claim-UI alignment score, connecting common policy claims to observable predicates, and enabling comparisons over time, regions, and verticals. Our measurements suggest continued policy churn, systematic changes to eliminate a higher-friction banner design, and significantly higher alignment where rejecting is visible and lower friction.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces ConsentDiff, a reproducible pipeline for monthly website snapshots that semantically aligns privacy policy clauses to measure clause-level churn, classifies consent UI patterns by combining DOM signals with screenshot cues, and defines a novel weighted claim-UI alignment score linking policy claims to observable UI predicates. Measurements indicate continued policy churn, systematic removal of higher-friction banner designs, and significantly higher alignment scores in cases where rejection options are visible and friction is low.
Significance. If the pipeline components prove reliable, the work supplies valuable longitudinal empirical data on the co-evolution of privacy policy text and consent interfaces, enabling comparisons across time, regions, and verticals. The reproducible pipeline and the claim-UI alignment score are concrete strengths that support falsifiable follow-up studies and could inform regulatory audits of GDPR/CCPA-style consent mechanisms.
major comments (1)
- [Pipeline and measurement sections] Pipeline and measurement sections: the abstract and methods description present the semantic alignment procedure and UI classifier as central to all reported trends, yet supply no accuracy metrics, error rates, inter-annotator agreement, or ground-truth validation for either component. Because the headline claims of continued churn, systematic banner changes, and 'significantly higher alignment' rest directly on these measurements, the absence of quantitative validation is load-bearing and must be addressed before the observational results can be interpreted with confidence.
minor comments (2)
- [Abstract] Abstract: the phrase 'significantly higher alignment' should be accompanied by the statistical test, p-value threshold, and effect-size information used.
- [Data collection] The manuscript would benefit from an explicit statement of the sampling frame (how sites and regions were selected) and any exclusion rules applied to snapshots.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback highlighting the need for validation of the core pipeline components. We agree this is a substantive issue that must be addressed to strengthen the interpretability of the results and will incorporate the requested metrics in the revision.
read point-by-point responses
-
Referee: [Pipeline and measurement sections] Pipeline and measurement sections: the abstract and methods description present the semantic alignment procedure and UI classifier as central to all reported trends, yet supply no accuracy metrics, error rates, inter-annotator agreement, or ground-truth validation for either component. Because the headline claims of continued churn, systematic banner changes, and 'significantly higher alignment' rest directly on these measurements, the absence of quantitative validation is load-bearing and must be addressed before the observational results can be interpreted with confidence.
Authors: We acknowledge that the submitted manuscript does not report quantitative validation metrics (accuracy, error rates, inter-annotator agreement, or ground-truth comparisons) for the semantic alignment procedure or the UI classifier. This omission limits confidence in the downstream claims, as noted. In the revised manuscript we will add a new subsection under Methods that details: (1) a manually annotated ground-truth set of 200 policy clauses for semantic alignment, with reported precision/recall and inter-annotator agreement (Cohen’s kappa); (2) a held-out test set of 150 consent UIs with screenshot+DOM labels, reporting classification accuracy and confusion matrices; and (3) an explicit discussion of remaining error sources and their potential impact on the longitudinal trends. We will also release the validation annotations alongside the pipeline code to support reproducibility. revision: yes
Circularity Check
No significant circularity in empirical measurement pipeline
full rationale
The paper is an empirical measurement study that builds a reproducible pipeline to snapshot websites monthly, semantically align policy clauses for churn tracking, classify consent UI patterns from DOM signals and screenshots, and compute a novel weighted claim-UI alignment score linking policy claims to observable predicates. No derivation chain, equations, fitted parameters presented as predictions, self-definitional constructs, or load-bearing self-citations are present in the abstract or description. The central findings rely on the pipeline's outputs as direct measurements rather than reducing to inputs by construction. This is self-contained empirical work; the absence of validation metrics is a separate limitation on reliability, not circularity.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We build a reproducible pipeline that snapshots sites every month, semantically aligns policy clauses to track clause-level churn, and classifies consent-UI patterns by pulling together DOM signals with cues provided by screenshots. We introduce a novel weighted claim-UI alignment score
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We pair policy claims with necessary UI predicates (e.g., default-off, visible “Reject all”, steps-to-reject≤ 2) to obtain an alignment score A∈[0,1]
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Gunes Acar, Christian Eubank, Steven Englehardt, Marc Juarez, Arvind Narayanan, and Claudia Diaz. 2014. The Web Never Forgets: Persistent Tracking Mechanisms in the Wild. InProceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security (CCS). 674–689. doi:10.1145/2660267. 2660347
-
[2]
Angrist and Jörn-Steffen Pischke
Joshua D. Angrist and Jörn-Steffen Pischke. 2009.Mostly Harmless Econometrics: An Empiricist’s Companion. Princeton University Press
work page 2009
-
[3]
Martin Degeling, Christine Utz, Christopher Lentzsch, Henry Hosseini, Florian Schaub, and Thorsten Holz. 2019. We Value Your Privacy... Now Take Some Cook- ies: Measuring the GDPR’s Impact on Web Privacy. InNetwork and Distributed System Security Symposium (NDSS). https://www.ndss-symposium.org/wp- content/uploads/2019/02/ndss2019_01A-3_Degeling_paper.pdf
work page 2019
-
[4]
Steven Englehardt and Arvind Narayanan. 2016. Online Tracking: A 1-Million- Site Measurement and Analysis. InNetwork and Distributed System Security Symposium (NDSS). https://webtransparency.cs.princeton.edu/webcensus/
work page 2016
-
[5]
European Data Protection Board. 2020. Guidelines 05/2020 on Consent under Regulation 2016/679. https://edpb.europa.eu/our-work-tools/our-documents/ guidelines/guidelines-052020-consent-under-regulation-2016679_en
work page 2020
-
[6]
Gray, Nataliia Bielova, Cristiana Santos, et al
Colin M. Gray, Nataliia Bielova, Cristiana Santos, et al . 2021. Dark Patterns and the Legal Requirements of Consent Banners. InProceedings of the 2021 CHI Conference on Human Factors in Computing Systems. ACM. https://www- sop.inria.fr/members/Nataliia.Bielova/papers/Gray-etal-21-CHI.pdf
work page 2021
- [7]
- [8]
-
[9]
Haoze Guo and Ziqi Wei. 2026. Temporal Drift in Privacy Recall: Users Misremem- ber From Verbatim Loss to Gist-Based Overexposure. arXiv:2509.16962 [cs.HC]
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[10]
Hamza Harkous, Kassem Fawaz, Reza Shokri, Bryan Ford, and Karl Aberer. 2018. Polisis: Automated Analysis and Presentation of Privacy Policies Using Deep Learning. In27th USENIX Security Symposium (USENIX Security). 531–548. https: //www.usenix.org/conference/usenixsecurity18/presentation/harkous
work page 2018
-
[11]
IAB Europe. 2020. Transparency & Consent Framework (TCF) v2.0: Policies and Specifications. https://iabeurope.eu/tcf-2-0/
work page 2020
-
[12]
Rebecca Killick, Paul Fearnhead, and Idris A. Eckley. 2012. Optimal Detection of Changepoints With a Linear Computational Cost.J. Amer. Statist. Assoc.107, 500 (2012), 1590–1598. doi:10.1080/01621459.2012.737745
-
[13]
Adam Lerner, Anna Kornfeld Simpson, Tadayoshi Kohno, and Franziska Roesner
-
[14]
InProceedings of the 2016 ACM Web Science Conference (WebSci)
Internet Jones and the Raiders of the Lost Trackers: An Archaeological Study of Web Tracking from 1996 to 2016. InProceedings of the 2016 ACM Web Science Conference (WebSci). 237–246. doi:10.1145/2908131.2908165
-
[15]
Vladimir I. Levenshtein. 1966. Binary Codes Capable of Correcting Deletions, Insertions, and Reversals.Soviet Physics Doklady10 (1966), 707–710
work page 1966
-
[16]
Marco Lippi, Paolo Torroni, et al. 2019. CLAUDETTE: an Automated Detector of Potentially Unfair Clauses in Online Terms of Service.Artificial Intelligence and Law27, 2 (2019), 117–139. doi:10.1007/s10506-019-09243-2
-
[17]
Arunesh Mathur, Gunes Acar, Michael J. Friedman, Elena Lucherini, Jonathan Mayer, Marshini Chetty, and Arvind Narayanan. 2019. Dark Patterns at Scale: Findings from a Crawl of 11K Shopping Websites.Proceedings of the ACM on ConsentDiff at Scale: Longitudinal Audits of Web Privacy Policy Changes and UI Frictions CHI EA ’26, April 13–17, 2026, Barcelona, Sp...
-
[18]
Célestin Matte, Nataliia Bielova, and Cristiana Santos. 2020. Do Cookie Banners Respect My Choice? Measuring Legal Compliance of Banners from IAB Europe’s Transparency and Consent Framework. In2020 IEEE Symposium on Security and Privacy (SP). IEEE, 791–809. doi:10.1109/SP40000.2020.00025
-
[19]
Midas Nouwens, Ilaria Liccardi, Michael Veale, David Karger, and Lalana Kagal
-
[20]
URLhttp://dx.doi.org/10.1145/3313831.3376327
Dark Patterns after the GDPR: Scraping Consent Pop-ups and Demonstrat- ing Their Influence. InProceedings of the 2020 CHI Conference on Human Factors in Computing Systems. ACM. doi:10.1145/3313831.3376321
-
[21]
Victor Le Pochat, Tom Van Goethem, Samaneh Tajalizadehkhoob, Maciej Ko- rczyński, and Wouter Joosen. 2019. Tranco: A Research-Oriented Top Sites Ranking Hardened Against Manipulation. InNetwork and Distributed System Security Symposium (NDSS). https://tranco-list.eu
work page 2019
-
[22]
Data Programming: Creating Large Training Sets, Quickly
Alexander J. Ratner, Christopher M. De Sa, Sen Wu, Daniel Selsam, and Christo- pher Ré. 2017. Data Programming: Creating Large Training Sets, Quickly. In Advances in Neural Information Processing Systems (NeurIPS). https://arxiv.org/ abs/1605.07723
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[23]
Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings Using Siamese BERT-Networks. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP). ACL, 3982–3992. https://arxiv. org/abs/1908.10084
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[24]
Jeffrey M. Wooldridge. 2010.Econometric Analysis of Cross Section and Panel Data (2nd ed.). MIT Press
work page 2010
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.