pith. sign in

arxiv: 2605.17034 · v1 · pith:IZ3TAFUZnew · submitted 2026-05-16 · 💻 cs.LG · cs.AI· cs.CR

Privacy Policy Enforcement Guardrails for Data-Sensitive Retrieval-Augmented Generation

Pith reviewed 2026-05-19 20:22 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CR
keywords privacy policy enforcementretrieval augmented generationone-class classificationcontextual data leakagesynthetic dataRAG guardrailsanomaly detection
0
0 comments X

The pith

Dual one-class density estimators detect contextual privacy leaks in RAG with over 0.93 AUROC on borderline cases.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard PII filters miss contextual data leakage in RAG systems where non-regulated attributes together identify individuals. This paper introduces a Privacy Policy Enforcement framework built on dual one-class density estimators that use fused text embeddings and include a calibrated abstain region for out-of-distribution inputs. The authors generate training data with an axis-stratified multi-LLM pipeline spanning medicine, finance, and law to include safe and borderline-safe cases. A reader would care because the resulting T3+OCSVM detector reaches over 0.93 AUROC on hard borderline tests, cuts false positives sharply, and runs in milliseconds, offering better practicality than alternatives.

Core claim

The central discovery is that training a T3+OCSVM detector on safe and borderline-safe synthetic data allows it to identify privacy policy violations in RAG queries with a borderline AUROC of 0.93 or more. This comes with a 44-55 percentage point reduction in false positives compared to Gaussian Mixture models while keeping inference at millisecond speeds. The method proves more operationally viable than supervised MLP classifiers, which abstain too often, or 14B-parameter LLM judges, which are too slow and poorly calibrated.

What carries the argument

Dual one-class density estimators (specifically T3+OCSVM) applied to fused text embeddings, creating a model of safe query density with an abstain region for inputs that might leak private information.

If this is right

  • RAG systems gain a practical tool to block contextual leaks without high computational overhead.
  • Synthetic data from multi-LLM pipelines can serve as a reliable proxy for training privacy detectors across domains.
  • Gaussian Mixture baselines are inadequate for borderline cases because they latch onto linguistic register instead of semantic content.
  • The framework sets a standard for stress-testing any classifier trained on synthetic privacy data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar density-based approaches might help in other areas like detecting misinformation or bias in generated outputs.
  • Real-world deployment could involve feedback loops where abstained cases are reviewed to refine the model.
  • The method's focus on borderline cases suggests it could generalize to evolving privacy regulations by updating the training distribution.

Load-bearing premise

The axis-stratified multi-LLM synthetic data pipeline creates borderline-safe examples that have the same privacy leakage properties as those in real RAG deployments in medicine, finance, and law.

What would settle it

Running the T3+OCSVM detector on real production RAG queries from medical, financial, or legal applications and verifying if the borderline AUROC stays above 0.93 with comparable false positive reductions.

Figures

Figures reproduced from arXiv: 2605.17034 by Alexander Nemecek, Debargha Ganguly, Erman Ayday, Osama Zafar, Vikash Singh, Vipin Chaudhary, Wenbiao Li, Yiqian Zhang.

Figure 1
Figure 1. Figure 1: Layered privacy enforcement for RAG. Layer-1 catches direct identifiers via regex/NER; Layer-2 (this work) detects contextual QI-cluster leakage [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Axis-stratified multi-LLM data generation pipeline, instantiated per domain with the QI taxonomy of Table [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Layer-2 detector architecture. Three frozen encoders produce a [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Case-style confound and its remediation. (a) AUROC performance [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
read the original abstract

Standard PII filters often miss contextual data leakage in RAG systems, such as non-regulated attribute clusters that collectively identify individuals. We introduce a Privacy Policy Enforcement (PPE) framework using dual one-class density estimators with fused text embeddings and a calibrated abstain region for out-of-distribution inputs. Using an axis-stratified, multi-LLM synthetic data pipeline across medicine, finance, and law, we found that traditional Gaussian Mixture baselines fail on borderline-safe stress tests by focusing on linguistic register rather than content. Our proposed T3+OCSVM detector, trained on safe and borderline-safe data, achieves a borderline AUROC of 0.93+ while reducing false positives by 44-55 percentage points and maintaining millisecond latency. Compared to supervised MLP classifiers or 14B-parameter LLM judges, our framework offers superior operational suitability, as the former suffers from high abstention rates and the latter from latency and calibration issues. This methodology provides a robust stress-testing standard for any synthetic-data-trained classifier.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces a Privacy Policy Enforcement (PPE) framework for RAG systems that employs dual one-class density estimators (specifically T3+OCSVM) on fused text embeddings together with a calibrated abstain region for OOD inputs. Using an axis-stratified multi-LLM synthetic data generation pipeline spanning medicine, finance, and law, the authors report that their detector achieves a borderline AUROC above 0.93, reduces false positives by 44-55 percentage points relative to Gaussian Mixture baselines, and operates at millisecond latency while outperforming supervised MLP classifiers and large LLM judges on operational metrics.

Significance. If the central performance claims hold under more rigorous validation, the work offers a practical, low-latency guardrail for contextual privacy leakage in RAG deployments that standard PII filters miss. The emphasis on one-class learning trained on safe and borderline-safe examples, combined with an explicit stress-testing protocol for synthetic-data classifiers, could influence guardrail design in regulated domains.

major comments (3)
  1. Abstract and Evaluation section: the reported borderline AUROC of 0.93+ and 44-55 pp false-positive reductions are given without error bars, number of runs, or ablation studies on embedding fusion and abstain-region calibration; these omissions prevent assessment of whether the gains over Gaussian Mixture, MLP, and LLM baselines are statistically robust or sensitive to hyper-parameters.
  2. Synthetic data pipeline description (Abstract and §3): the central operational claims rest on the assumption that axis-stratified multi-LLM generated borderline-safe examples reproduce the content-based privacy leakage distributions encountered in real RAG systems across medicine, finance, and law. No cross-validation against real query logs or attribute-cluster statistics is presented, leaving open the possibility that reported AUROC and FP reductions reflect LLM artifacts rather than transferable leakage patterns.
  3. Comparison to baselines (Evaluation): the superiority claims versus 14B-parameter LLM judges cite latency and calibration issues, yet no quantitative latency measurements or calibration plots (e.g., ECE or reliability diagrams) are referenced for the proposed T3+OCSVM detector itself, making the operational-suitability argument incomplete.
minor comments (2)
  1. Clarify the precise definition of 'borderline-safe' examples and the axis stratification criteria used in the synthetic pipeline; a short table or pseudocode would improve reproducibility.
  2. The manuscript should state the embedding model and dimensionality explicitly when describing the fused text embeddings.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for their constructive feedback on our manuscript. We have carefully considered each major comment and provide point-by-point responses below. We plan to incorporate several revisions to address the concerns raised.

read point-by-point responses
  1. Referee: Abstract and Evaluation section: the reported borderline AUROC of 0.93+ and 44-55 pp false-positive reductions are given without error bars, number of runs, or ablation studies on embedding fusion and abstain-region calibration; these omissions prevent assessment of whether the gains over Gaussian Mixture, MLP, and LLM baselines are statistically robust or sensitive to hyper-parameters.

    Authors: We agree that the current presentation lacks sufficient statistical detail. In the revised version, we will report results averaged over 10 independent training and evaluation runs, including error bars representing standard deviation for both AUROC and false positive reduction metrics. We will also include ablation studies examining the impact of different embedding fusion methods (e.g., concatenation vs. averaging) and variations in the abstain region calibration thresholds, demonstrating the robustness of our performance gains. revision: yes

  2. Referee: Synthetic data pipeline description (Abstract and §3): the central operational claims rest on the assumption that axis-stratified multi-LLM generated borderline-safe examples reproduce the content-based privacy leakage distributions encountered in real RAG systems across medicine, finance, and law. No cross-validation against real query logs or attribute-cluster statistics is presented, leaving open the possibility that reported AUROC and FP reductions reflect LLM artifacts rather than transferable leakage patterns.

    Authors: This is a valid concern regarding the generalizability of our synthetic data approach. We designed the multi-LLM, axis-stratified pipeline specifically to generate diverse and challenging borderline-safe examples across the specified domains. However, we do not have access to proprietary real-world RAG query logs for cross-validation, as such data would contain sensitive information. We will expand §3 with additional details on the generation process and add a new subsection on limitations, explicitly discussing the potential influence of LLM artifacts and the need for future validation on real data where possible. revision: partial

  3. Referee: Comparison to baselines (Evaluation): the superiority claims versus 14B-parameter LLM judges cite latency and calibration issues, yet no quantitative latency measurements or calibration plots (e.g., ECE or reliability diagrams) are referenced for the proposed T3+OCSVM detector itself, making the operational-suitability argument incomplete.

    Authors: We will enhance the Evaluation section by providing explicit quantitative latency measurements for the T3+OCSVM detector, including average and percentile inference times on standard hardware. Additionally, we will include calibration analysis with Expected Calibration Error (ECE) values and reliability diagrams for our detector to enable a complete comparison with the LLM-based baselines. revision: yes

standing simulated objections not resolved
  • Validation of the synthetic data pipeline through cross-validation against real query logs or attribute-cluster statistics from production RAG systems.

Circularity Check

0 steps flagged

No circularity: empirical evaluation on held-out synthetic data with no derivations or self-referential definitions

full rationale

The manuscript presents an applied ML framework for privacy policy enforcement in RAG systems. It trains a T3+OCSVM detector on axis-stratified synthetic safe and borderline-safe examples and reports standard evaluation metrics (borderline AUROC 0.93+, FP reduction) on held-out portions of that data. No equations, mathematical derivations, parameter-fitting steps that are then relabeled as predictions, or load-bearing self-citations appear in the abstract or described methodology. The central claims are falsifiable empirical performance numbers rather than quantities defined in terms of themselves. The paper is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review limited to abstract; no explicit free parameters, axioms, or invented entities are stated. The approach implicitly assumes that synthetic data distributions capture real contextual leakage patterns and that one-class density estimation on embeddings is sufficient to separate safe from borderline content.

pith-pipeline@v0.9.0 · 5731 in / 1310 out tokens · 57643 ms · 2026-05-19T20:22:58.470375+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages · 6 internal anchors

  1. [1]

    Sankar, B

    D. Gangulyet al., “Trust the typical,”arXiv preprint arXiv:2602.04581, 2026

  2. [2]

    k-anonymity: A model for protecting privacy,

    L. Sweeney, “k-anonymity: A model for protecting privacy,”Interna- tional Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, vol. 10, 2002

  3. [3]

    Extracting training data from large language models,

    N. Carliniet al., “Extracting training data from large language models,” in30th USENIX Security Symposium, 2021

  4. [4]

    Quantifying memorization across neural language models,

    ——, “Quantifying memorization across neural language models,” in International Conference on Learning Representations (ICLR), 2023

  5. [5]

    Scalable Extraction of Training Data from (Production) Language Models

    M. Nasret al., “Scalable extraction of training data from (production) language models,”arXiv preprint arXiv:2311.17035, 2023

  6. [6]

    Membership inference attacks against machine learning models,

    R. Shokriet al., “Membership inference attacks against machine learning models,” inIEEE Symposium on Security and Privacy (SP), 2017

  7. [7]

    Membership inference attacks against language models via neighbourhood comparison,

    J. Matternet al., “Membership inference attacks against language models via neighbourhood comparison,” inFindings of the ACL, 2023

  8. [8]

    Exploring membership inference vulnerabilities in clinical large language models,

    A. Nemecek, Z. Yun, Z. Rahmani, Y . Harel, V . Chaudhary, M. Sharif, and E. Ayday, “Exploring membership inference vulnerabilities in clinical large language models,”arXiv preprint arXiv:2510.18674, 2025

  9. [9]

    Retrieval-augmented generation for knowledge-intensive nlp tasks,

    P. Lewiset al., “Retrieval-augmented generation for knowledge-intensive nlp tasks,” inAdvances in Neural Information Processing Systems (NeurIPS), 2020

  10. [10]

    Realm: Retrieval-augmented language model pre- training,

    K. Guuet al., “Realm: Retrieval-augmented language model pre- training,” inProceedings of the 37th International Conference on Machine Learning (ICML), 2020

  11. [11]

    The good and the bad: Exploring privacy issues in retrieval-augmented generation,

    Y . Zenget al., “The good and the bad: Exploring privacy issues in retrieval-augmented generation,” inFindings of the ACL 2024, 2024

  12. [12]

    Cohen, R

    S. Cohenet al., “Compromptmized: Unleashing zero-click worms that target genai-powered applications,”arXiv preprint arXiv:2403.02817, 2024

  13. [13]

    Circumventing steerability in retrieval-augmented genera- tion,

    Z. Qiet al., “Circumventing steerability in retrieval-augmented genera- tion,”arXiv preprint arXiv:2403.04832, 2024

  14. [14]

    React: Synergizing reasoning and acting in language mod- els,

    S. Yaoet al., “React: Synergizing reasoning and acting in language mod- els,” inInternational Conference on Learning Representations (ICLR), 2023

  15. [15]

    Toolformer: Language models can teach themselves to use tools,

    T. Schicket al., “Toolformer: Language models can teach themselves to use tools,” inAdvances in Neural Information Processing Systems (NeurIPS), 2023

  16. [16]

    and Schmidt, Ludwig , year =

    N. Andersonet al., “Membership inference attacks on retrieval- augmented generation,”arXiv preprint arXiv:2406.12031, 2024

  17. [17]

    Zkprov: A zero-knowledge approach to dataset provenance for large language models,

    M. Namazi, A. Nemecek, and E. Ayday, “Zkprov: A zero-knowledge approach to dataset provenance for large language models,”arXiv preprint arXiv:2506.20915, 2025

  18. [18]

    Health insurance portability and accountability act of 1996 (hipaa),

    US Congress, “Health insurance portability and accountability act of 1996 (hipaa),” Pub. L. 104-191, 1996

  19. [19]

    l-diversity: Privacy beyond k-anonymity,

    A. Machanavajjhalaet al., “l-diversity: Privacy beyond k-anonymity,” ACM Transactions on Knowledge Discovery from Data (TKDD), 2007

  20. [20]

    t-closeness: Privacy beyond k-anonymity and l-diversity,

    N. Liet al., “t-closeness: Privacy beyond k-anonymity and l-diversity,” inIEEE ICDE, 2007

  21. [21]

    A systematic review of re-identification attacks on health data,

    K. El Emamet al., “A systematic review of re-identification attacks on health data,”PLoS ONE, 2011

  22. [22]

    Estimating the success of re-identifications in incom- plete datasets using generative models,

    L. Rocheret al., “Estimating the success of re-identifications in incom- plete datasets using generative models,”Nature Communications, 2019

  23. [23]

    The text anonymization benchmark (tab): A specialized corpus for measuring the effectiveness of de-identification,

    I. Pil ´anet al., “The text anonymization benchmark (tab): A specialized corpus for measuring the effectiveness of de-identification,”Computa- tional Linguistics, 2022

  24. [24]

    Analyzing leakage of personally identifiable informa- tion in language models,

    N. Lukaset al., “Analyzing leakage of personally identifiable informa- tion in language models,” inIEEE Symposium on Security and Privacy (SP), 2023

  25. [25]

    Training language models to follow instructions with human feedback,

    L. Ouyanget al., “Training language models to follow instructions with human feedback,”NeurIPS, 2022

  26. [26]

    Constitutional AI: Harmlessness from AI Feedback

    Y . Baiet al., “Constitutional ai: Harmlessness from ai feedback,”arXiv preprint arXiv:2212.08073, 2022

  27. [27]

    Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

    H. Inanet al., “Llama guard: Llm-based input-output safeguard for human-ai conversations,”arXiv preprint arXiv:2312.06674, 2023

  28. [28]

    Nemo guardrails: A toolkit for controllable and safe llm applications,

    T. Rebedeaet al., “Nemo guardrails: A toolkit for controllable and safe llm applications,” inEMNLP System Demonstrations, 2023

  29. [29]

    Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of llms,

    S. Hanet al., “Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of llms,”Advances in Neural Information Processing Systems (NeurIPS), 2024

  30. [30]

    Universal and Transferable Adversarial Attacks on Aligned Language Models

    A. Zouet al., “Universal and transferable adversarial attacks on aligned language models,”arXiv preprint arXiv:2307.15043, 2023

  31. [31]

    Jailbroken: How does llm safety training fail?

    A. Weiet al., “Jailbroken: How does llm safety training fail?” in NeurIPS, 2023

  32. [32]

    McLachlan and K

    G. McLachlan and K. Basford,Mixture Models: Inference and Applica- tions to Clustering. Marcel Dekker, 1988

  33. [33]

    Support vector method for novelty detection,

    B. Scholkopfet al., “Support vector method for novelty detection,” in NeurIPS, 1999

  34. [34]

    Deep one-class classification,

    L. Ruffet al., “Deep one-class classification,” inProceedings of the 35th ICML, 2018

  35. [35]

    T. M. Cover and J. A. Thomas,Elements of Information Theory, 2nd ed. John Wiley and Sons, 2006

  36. [36]

    Detecting Out-of-Distribution Inputs to Deep Generative Models Using Typicality,

    E. Nalisnicket al., “Detecting out-of-distribution inputs to deep gener- ative models using typicality,”arXiv preprint arXiv:1906.02994, 2019

  37. [37]

    Density of states estimation for out of distribution detection,

    W. Morningstar, C. Ham, A. Gallagher, B. Lakshminarayanan, A. Alemi, and J. Dillon, “Density of states estimation for out of distribution detection,” inInternational Conference on Artificial Intelligence and Statistics. PMLR, 2021, pp. 3232–3240

  38. [38]

    Forte : Finding outliers with representation typicality estimation,

    D. Ganguly, W. R. Morningstar, A. S. Yu, and V . Chaudhary, “Forte : Finding outliers with representation typicality estimation,” inThe Thirteenth International Conference on Learning Representations, 2025. [Online]. Available: https://openreview.net/forum?id=7XNgVPxCiA

  39. [39]

    $K^4$: Online Log Anomaly Detection Via Unsupervised Typicality Learning

    W. Chen, V . Singh, Z. Rahmani, D. Ganguly, M. Hariri, and V . Chaud- hary, “K 4: Online log anomaly detection via unsupervised typicality learning,”arXiv preprint arXiv:2507.20051, 2025

  40. [40]

    La- beling copilot: A deep research agent for automated data curation in computer vision,

    D. Ganguly, S. Kumar, I. Balappanawar, W. Chen, S. Kambhatla, S. Iyengar, S. Kalyanaraman, P. Kumaraguru, and V . Chaudhary, “La- beling copilot: A deep research agent for automated data curation in computer vision,”arXiv preprint arXiv:2509.22631, 2025

  41. [41]

    Context determines optimal architecture in materials segmentation,

    M. Lu, P. K. Tripathi, M. Shteyn, D. Ganguly, R. H. French, V . Chaud- hary, and Y . Wu, “Context determines optimal architecture in materials segmentation,”arXiv preprint arXiv:2602.04154, 2026

  42. [42]

    Likelihood ratios for out-of-distribution detection,

    J. Ren, P. J. Liu, E. Fertig, J. Snoek, R. Poplin, M. Depristo, J. Dillon, and B. Lakshminarayanan, “Likelihood ratios for out-of-distribution detection,”Advances in neural information processing systems, vol. 32, 2019

  43. [43]

    A question-entailment ap- proach to question answering,

    A. Ben Abacha and D. Demner-Fushman, “A question-entailment ap- proach to question answering,”BMC Bioinformatics, vol. 20, no. 511, 2019, dataset: https://github.com/abachaa/MedQuAD

  44. [44]

    Medical Q&A vignettes (adrianf12),

    adrianf12, “Medical Q&A vignettes (adrianf12),” HuggingFace Datasets, https://huggingface.co/datasets/adrianf12/healthcare conversational prompt completion 10k, 2024

  45. [45]

    Medical conversational Q&A (kabatubare),

    kabatubare, “Medical conversational Q&A (kabatubare),” HuggingFace Datasets, https://huggingface.co/datasets/Kabatubare/medical/viewer/ default/train, 2024

  46. [46]

    FinanceBench: A New Benchmark for Financial Question Answering

    P. Islam, A. Kannappan, D. Kiela, R. Qian, N. Scherrer, and B. Vidgen, “FinanceBench: A new benchmark for financial question answering,” arXiv preprint arXiv:2311.11944, 2023, dataset: https://huggingface.co/ datasets/PatronusAI/financebench

  47. [47]

    FinQA: A dataset of numerical reasoning over financial data,

    Z. Chen, W. Chen, C. Smiley, S. Shah, I. Borova, D. Langdon, R. Moussa, M. Beane, T.-H. Huang, B. Routledge, and W. Y . Wang, “FinQA: A dataset of numerical reasoning over financial data,” in Empirical Methods in Natural Language Processing (EMNLP), 2021, dataset: https://github.com/czyssrs/FinQA

  48. [48]

    Money Stack Exchange data dump,

    Stack Exchange, Inc., “Money Stack Exchange data dump,” Inter- net Archive Stack Exchange Collection, https://archive.org/download/ stackexchange/money.stackexchange.com.7z, 2024, community Q&A under CC BY-SA 4.0; site: https://money.stackexchange.com

  49. [49]

    Measuring massive multitask language understanding,

    D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt, “Measuring massive multitask language understanding,” inInternational Conference on Learning Representations (ICLR), 2021, dataset (subsetprofessional_law): https://huggingface.co/datasets/ cais/mmlu/viewer/professional law

  50. [50]

    Open Australian legal Q&A,

    U. Butler, “Open Australian legal Q&A,” HuggingFace Datasets, https: //huggingface.co/datasets/umarbutler/open-australian-legal-qa, 2023

  51. [51]

    LegalQA-v1: Legal question answering dataset,

    dzunggg, “LegalQA-v1: Legal question answering dataset,” Hug- gingFace Datasets, https://huggingface.co/datasets/dzunggg/legal-qa-v1, 2023

  52. [52]

    Law Stack Exchange data dump,

    Stack Exchange, Inc., “Law Stack Exchange data dump,” Inter- net Archive Stack Exchange Collection, https://archive.org/download/ stackexchange/law.stackexchange.com.7z, 2024, community Q&A under CC BY-SA 4.0; site: https://law.stackexchange.com

  53. [53]

    EUR-Lex-Sum: A multi- and cross-lingual dataset for long-form summarization in the legal domain,

    D. Aumiller, A. Chouhan, and M. Gertz, “EUR-Lex-Sum: A multi- and cross-lingual dataset for long-form summarization in the legal domain,” inEmpirical Methods in Natural Language Processing (EMNLP), 2022, dataset: https://huggingface.co/datasets/dennlinger/eur-lex-sum. APPENDIX This appendix collects per-domain texture that did not fit in the merged main s...