pith. sign in

arxiv: 2604.27550 · v1 · submitted 2026-04-30 · 💻 cs.CL · cs.AI

APPSI-139: A Parallel Corpus of English Application Privacy Policy Summarization and Interpretation

Pith reviewed 2026-05-07 08:24 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords privacy policysummarizationparallel corpushybrid frameworkreadabilityreliabilitylegal interpretationdata practices
0
0 comments X

The pith

A new expert-annotated corpus and hybrid framework let smaller AI systems summarize and interpret privacy policies with better readability and reliability than GPT-4o or LLaMA-3-70B.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Privacy policies are often long, complex, and filled with legal and technical terms that leave users unaware of how their data is handled. The paper introduces APPSI-139, a corpus of 139 English policies accompanied by expert-rewritten parallel summaries and over 36,000 fine-grained annotations across 11 data practice categories. It also presents TCSI-pp-V2, a hybrid framework that alternates training across multiple expert modules to balance accuracy and efficiency. Experiments indicate that systems built on this corpus and framework produce summaries judged more readable and reliable than those from GPT-4o and LLaMA-3-70B. If the results hold, users could gain practical access to clearer explanations of data practices without relying solely on general large language models.

Core claim

The paper establishes that the APPSI-139 parallel corpus of 139 English privacy policies, containing 15,692 rewritten summary pairs and 36,351 annotations across 11 categories, combined with the TCSI-pp-V2 hybrid framework that coordinates expert modules through alternating training, produces summarization and interpretation outputs superior in readability and reliability to those from GPT-4o and LLaMA-3-70B.

What carries the argument

APPSI-139 provides the domain-specific parallel corpus with expert rewrites and category annotations, while TCSI-pp-V2 supplies the hybrid coordination of expert modules via alternating training to maintain both efficiency and accuracy.

Load-bearing premise

Domain-expert annotations serve as objective ground truth for legal clarity and the chosen readability and reliability metrics accurately reflect real-world usefulness without systematic bias from annotator selection or test-set construction.

What would settle it

A follow-up evaluation on a fresh collection of privacy policies, using blind human ratings of comprehension and decision accuracy, would show the hybrid system no longer outperforms GPT-4o.

Figures

Figures reproduced from arXiv: 2604.27550 by Deyi Xiong, Jinfei Liu, Junxu Liu, Kui Ren, Long Wen, Pengyun Zhu, Qiheng Sun, Yanbo Wang, Yang Cao, Zhibo Wang.

Figure 1
Figure 1. Figure 1: Summarization by TCSI-pp-V2. 2.2 Privacy Policy Summarization Recently, natural language processing and machine learning technologies have made significant strides in addressing the readability issues of “length” and “incomprehensibility” in privacy policies. These technologies offer an efficient solution by auto￾matically extracting key information or generat￾ing condensed summaries. Specifically, there a… view at source ↗
Figure 2
Figure 2. Figure 2: The organization of APPSI-139. tion, Third Party Sharing, Usage, Data Retention, Data Security, Edit/Control, Specific Audiences, Contact Information, Policy Change, and Cease Operation. Detailed definitions of each category are provided in Appendix B. Special Marking highlights clauses related to important data practices, sensitive personal infor￾mation, and potential risks, including Importance, Risk, an… view at source ↗
Figure 3
Figure 3. Figure 3: The framework of TCSI-pp-V2. ity”, Data Practice Categories”, or those requir￾ing a “Rewritten”. During formal annotation, to transparently handle edge cases, annotators docu￾ment ambiguous clauses. These are then resolved through discussion, with final adjudication by se￾nior reviewers. This results in a high-quality corpus named APPSI-139 (Application Privacy Policies Summarization and Interpretation, se… view at source ↗
Figure 4
Figure 4. Figure 4: Application Privacy Policy. tions. G.4 Evaluation Methodology To capture the overall preferences of human evalu￾ators, we evaluate the summaries across four crit￾ical dimensions: comprehensibility (how easy the summary is to understand), completeness (whether the summary covers all key aspects without omis￾sions), fidelity (the accuracy and consistency of the summary with the original text, ensuring that n… view at source ↗
Figure 5
Figure 5. Figure 5: Summarization by TCSI-pp-V2. caused by excessive authorization, we have carried out innovative work. Although exist￾ing English privacy policy corpora can alle￾viate the issue of “lengthiness” to some ex￾tent, they lack practical solutions for the “in￾comprehensibility” problem caused by profes￾sional jargon, technical terms, and complex sentence structures. Therefore, we have re￾leased the APPSI-139 corpu… view at source ↗
Figure 6
Figure 6. Figure 6: Summarization by GPT-4o. • Does the dataset contain all possible in￾stances or is it a sample (not necessarily random) of instances from a larger set? (If the dataset is a sample, then what is the larger set? Is the sample representative of the larger set (e.g., geographic coverage)? If so, please describe how this representativeness was val￾idated/verified. If it is not representative of the larger set, p… view at source ↗
Figure 7
Figure 7. Figure 7: Summarization by Llama3-70b. (If so, please provide a description of these splits, explaining the rationale behind them.) Yes. We did a train-validation-test split on the dataset, see Appendix E. • Are there any errors, sources of noise, or redundancies in the dataset? (If so, please provide a description.) The creation of the APPSI-139 corpus in￾volved annotations by legal experts, which may introduce bia… view at source ↗
Figure 8
Figure 8. Figure 8: Summarization by Kimi (Moonshot V1). and public documents to inform users and ob￾tain their consent. • Does the dataset relate to people? (If not, you may skip the remaining questions in this section.) No, the APPSI-139 does not pertain to indi￾viduals or personal data. It focuses solely on the content and structure of privacy policies. • Does the dataset identify any subpopula￾tions (e.g., by age, gender)… view at source ↗
Figure 9
Figure 9. Figure 9: Annotation of Privacy Policy in Doccano view at source ↗
Figure 10
Figure 10. Figure 10: Rewritten of Privacy Policy in Doccano H.5 Uses • Has the dataset been used for any tasks already? (If so, please provide a description.) The APPSI-139 dataset represents a novel re￾source for English application privacy policy summarization and interpretation. We have conducted extensive experiments to bench￾mark the performance of both classical ma￾chine learning algorithms and cutting-edge deep learnin… view at source ↗
read the original abstract

Privacy policies are essential for users to understand how service providers handle their personal data. However, these documents are often long and complex, as well as filled with technobabble and legalese, causing users to unknowingly accept terms that may even contradict the law. While summarizing and interpreting these privacy policies is crucial, there is a lack of high-quality English parallel corpus optimized for legal clarity and readability. To address this issue, we introduce APPSI-139, a high-quality English privacy policy corpus meticulously annotated by domain experts, specifically designed for summarization and interpretation tasks. The corpus includes 139 English privacy policies, 15,692 rewritten parallel corpora, and 36,351 fine-grained annotation labels across 11 data practice categories. Concurrently, we propose TCSI-pp-V2, a hybrid privacy policy summarization and interpretation framework that employs an alternating training strategy and coordinates multiple expert modules to effectively balance computational efficiency and accuracy. Experimental results show that the hybrid summarization system built on APPSI-139 corpus and the TCSI-pp-V2 framework outperform large language models, such as GPT-4o and LLaMA-3-70B, in terms of readability and reliability. The source code and dataset are available at https://github.com/EnlightenedAI/APPSI-139.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces APPSI-139, a parallel corpus of 139 English privacy policies containing 15,692 rewritten summaries and 36,351 fine-grained expert annotations across 11 data-practice categories. It also presents TCSI-pp-V2, a hybrid summarization and interpretation framework that alternates training across multiple expert modules. The central claim is that systems built from this corpus and framework outperform GPT-4o and LLaMA-3-70B on readability and reliability.

Significance. If the empirical claims are substantiated with complete evaluation details, the released corpus would constitute a reusable benchmark for legal-text summarization, and the hybrid framework could demonstrate practical advantages of modular, alternating training over pure LLM approaches in a high-stakes domain. Public code and data release supports reproducibility.

major comments (2)
  1. Abstract: the headline claim that the hybrid system 'outperform[s] large language models ... in terms of readability and reliability' is unsupported by any reported metric definitions, numerical scores, statistical tests, or baseline implementation details, preventing verification of the central empirical result.
  2. Evaluation section (implied by abstract): the readability and reliability metrics are computed against the same domain-expert annotations used to construct the training data, yet no inter-annotator agreement figures, guideline validation, or external correlation with user-comprehension studies are referenced, leaving open the possibility that reported gains reflect annotation-style bias rather than genuine improvement in legal clarity.
minor comments (1)
  1. Abstract: the informal phrase 'technobabble and legalese' could be replaced by a more precise description of the linguistic phenomena targeted by the annotations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and valuable suggestions. We provide point-by-point responses to the major comments and have prepared revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: Abstract: the headline claim that the hybrid system 'outperform[s] large language models ... in terms of readability and reliability' is unsupported by any reported metric definitions, numerical scores, statistical tests, or baseline implementation details, preventing verification of the central empirical result.

    Authors: We concur that the abstract would be strengthened by including references to the supporting evidence. In the revised version, we will update the abstract to briefly define the readability and reliability metrics, report the key numerical improvements over GPT-4o and LLaMA-3-70B, mention the statistical tests performed, and point to the detailed baseline descriptions in the Evaluation section. revision: yes

  2. Referee: Evaluation section (implied by abstract): the readability and reliability metrics are computed against the same domain-expert annotations used to construct the training data, yet no inter-annotator agreement figures, guideline validation, or external correlation with user-comprehension studies are referenced, leaving open the possibility that reported gains reflect annotation-style bias rather than genuine improvement in legal clarity.

    Authors: The evaluation is conducted on a held-out test set of policies and annotations disjoint from the training data. The annotation guidelines were developed and validated through multiple rounds of expert review, as described in the corpus construction section. We acknowledge that inter-annotator agreement figures and external user-comprehension correlations are not reported in the current manuscript. We will add IAA statistics and a limitations discussion that addresses potential annotation bias and proposes future user studies to correlate with real-world comprehension. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical evaluation on new corpus and framework

full rationale

The paper introduces APPSI-139 as a new annotated corpus and TCSI-pp-V2 as a hybrid framework, then reports experimental outperformance on readability/reliability metrics computed from the corpus's expert labels. This is standard supervised ML evaluation with train/test splits on held-out data rather than any derivation that reduces by construction to fitted inputs or self-citations. No equations, uniqueness theorems, or load-bearing self-citations appear in the provided text; the central claim remains falsifiable against external benchmarks or user studies. Minor risk (score 2) stems only from reliance on the same expert annotation process for both training and evaluation, which is common and non-circular in data-driven work.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the quality and representativeness of expert annotations and on the validity of the chosen evaluation metrics for readability and reliability. No new physical entities or mathematical axioms are introduced; the work is empirical and data-driven.

free parameters (1)
  • hyperparameters of TCSI-pp-V2 modules and alternating schedule
    The hybrid framework necessarily involves tunable parameters whose values are chosen or fitted during development; exact values are not reported in the abstract.
axioms (1)
  • domain assumption Domain-expert annotations provide reliable ground truth for legal clarity and readability
    The corpus construction and all downstream claims depend on the accuracy and consistency of the expert annotations described in the abstract.

pith-pipeline@v0.9.0 · 5563 in / 1546 out tokens · 81543 ms · 2026-05-07T08:24:45.017367+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

12 extracted references · 3 canonical work pages

  1. [1]

    Large language model safety: A holistic survey,

    Identifying the provision of choices in privacy policy text. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Process- ing, pages 2774–2779. Dan Shi, Tianhao Shen, Yufei Huang, Zhigen Li, Yongqi Leng, Renren Jin, Chuang Liu, Xinwei Wu, Zishan Guo, Linhao Yu, Ling Shi, Bojian Jiang, and Deyi Xiong. 2024. Large language model s...

  2. [2]

    Information Systems Frontiers, 13:501–514

    A user-centric evaluation of the readability of privacy policies in popular web sites. Information Systems Frontiers, 13:501–514. Yu Sun, Shuohuan Wang, Yu-Kun Li, Shikun Feng, Hao Tian, Hua Wu, and Haifeng Wang. 2020. ERNIE 2.0: A continual pre-training framework for language un- derstanding. In The Thirty-Fourth AAAI Conference on Artificial Intelligenc...

  3. [3]

    In Proceedings of the Sixth ACM Conference on Data and Application Security and Privacy, pages 133–135

    Automatic summarization of privacy policies using ensemble learning. In Proceedings of the Sixth ACM Conference on Data and Application Security and Privacy, pages 133–135. Shuai Wang, Xiang Zhao, Bo Li, Bin Ge, and Daquan Tang. 2017. Integrating extractive and abstrac- tive models for long text summarization. In 2017 IEEE International Congress on Big Da...

  4. [4]

    Shaolin Zhu, Supryadi, Shaoyang Xu, Haoran Sun, Leiyu Pan, Menglong Cui, Jiangcun Du, Renren Jin, António Branco, and Deyi Xiong

    Automatic text summarization: A review of ap- proaches, challenges, and future directions. Journal of Computer Science & Technology, 25. Haopeng Zhang, Philip S Yu, and Jiawei Zhang. 2025. A systematic survey of text summarization: From statistical methods to large language models. ACM Computing Surveys, 57(11):1–41. Tianyi Zhang, Varsha Kishore, Felix Wu...

  5. [5]

    Links to access the dataset and its metadata https://github.com/EnlightenedAI/ APPSI-139

  6. [6]

    The data is saved in a JSON format, where an example is shown in the README.md

  7. [7]

    Research group will maintain this dataset on the official Github account

  8. [8]

    ( https://creativecommons

    CC BY 4.0. ( https://creativecommons. org/licenses/by/4.0/) B Data Practice Category Data Practice Category Information also known as Topic, is used to describe the category of the sentence or term in privacy policies. It includes: • First Party Collection: The types of user infor- mation collected by the service provider, the pur- pose of collection, and...

  9. [9]

    In some cases, personal information, once leaked, may be used against the individual’s will or in con- junction with other data, posing a significant risk to the person’s rights

    Disclosure: When personal information is dis- closed, the individual and the organization or institution collecting or processing it lose con- trol over its distribution, resulting in uncon- trolled spreading and usage. In some cases, personal information, once leaked, may be used against the individual’s will or in con- junction with other data, posing a...

  10. [10]

    Such information should be regarded as personal sensitive information

    Illegal Provision: Certain personal informa- tion becomes a significant risk to the indi- vidual’s rights when shared without consent, especially if it’s spread beyond the intended scope. Such information should be regarded as personal sensitive information. For in- stance, sexual orientation, banking details, and medical history related to infectious dis...

  11. [11]

    lengthi- ness

    Abuse: Some personal information, when used beyond its authorized limits or for pur- poses other than originally intended, may pose Algorithm 1 TCSI-pp-V2 framework. Input: Privacy policy P ; Specified topics ∈ T opics. Output: Summarization Pats. Initialize: P = {p1, ..., pn} ← P reprocessing(P ); F iltered = list() #Step 1: Five trained experts carry ou...

  12. [12]

    com/EnlightenedAI/APPSI-139

    To foster transparency and reproducibil- ity, we provide the source code, annotation guidelines, and dataset in a public repository, which can be accessed via https://github. com/EnlightenedAI/APPSI-139. • Any other comments? None. Figure 9: Annotation of Privacy Policy in Doccano Figure 10: Rewritten of Privacy Policy in Doccano H.5 Uses • Has the datase...