pith. sign in

arxiv: 2606.18782 · v1 · pith:CJ3WZZIOnew · submitted 2026-06-17 · 💻 cs.CL · cs.AI

RedactionBench

Pith reviewed 2026-06-26 21:15 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords redactionPIIcontextual integritybenchmarkprivacylanguage modelsentity extractionR-Score
0
0 comments X

The pith

Contextual redaction of personally identifiable information is not solved by current models or tools.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces RedactionBench, a benchmark with 200 manually annotated documents across 11 domains, to evaluate redaction of PII based on context rather than just identifying entities. It also proposes R-Score, a character-level metric that focuses on semantic similarity of redactions instead of exact formatting. Evaluations of NER models, small language models, and frontier models with tools show poor performance on contextual decisions. Human annotators agree on mandatory redactions and safe text but disagree on contextual ones, at 47.7 percent agreement.

Core claim

Grounded in contextual integrity, RedactionBench provides target labels for redaction decisions in diverse documents. The R-Score metric decouples performance from shallow formatting choices. Across 35 models, contextual redaction remains unsolved, and the subjective nature of contextual privacy is shown by low human consensus on those cases.

What carries the argument

RedactionBench, a manually annotated benchmark of 200 documents, paired with the R-Score metric that treats semantically similar redactions equally.

If this is right

  • Current approaches to PII extraction fail to account for context in privacy decisions.
  • Standardized benchmarks like RedactionBench are needed to evaluate privacy-preserving systems.
  • Model design should focus on understanding contextual privacy norms rather than just entity detection.
  • Human variance in privacy perceptions motivates metrics that handle ambiguity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Future systems might need to handle ambiguous cases by deferring to user input or probabilistic outputs.
  • Extending the benchmark to more domains could reveal patterns in privacy norms across contexts.
  • The separation of mandatory and contextual redactions suggests hybrid approaches combining rules and learned context.
  • Releasing the benchmark establishes a baseline that can drive competition on privacy tasks.

Load-bearing premise

The manually created target labels in RedactionBench accurately represent the correct contextual privacy decisions.

What would settle it

A model that achieves high scores on RedactionBench while matching human consensus rates on contextual redactions, or a study showing consistent human agreement on contextual cases.

Figures

Figures reproduced from arXiv: 2606.18782 by Diptanshu Purwar, Esha Sali, Madhav Aggarwal, Sean Brynj\'olfsson, Shashvat Jayakrishnan.

Figure 1
Figure 1. Figure 1: REDACTIONBENCH provides rich segmentations across two tiers: mandatory (red) and contextual (yellow). An alternating shade of yellow is used to disambiguate adjacent contextual spans. Combinators, which connect parts of coherent entities, are represented by light blue. irrefutable benefits in several high-stakes environments, redaction has remained largely unexamined in the peer-reviewed literature. This h… view at source ↗
Figure 2
Figure 2. Figure 2: Length Distributions across popular general-purpose PII benchmarks. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Cross-architectural comparison of model performance (mean overall [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: (a) Mean target-match rate by unit type over all study windows, restricted to units exposed to [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Redaction Study Tool. User’s view of a labeling window from our study tool. Grayed-out [PITH_FULL_IMAGE:figures/full_fig_p020_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: (a) The distribution of initial conditions, R-Score, for each redaction window with endpoints [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Probability of non-decrease in R-Score after user edits as a function of the starting annotation window’s R-Score. Separate bins at 0.0 and 1.0 performance are shown as thin bars. To estimate the confidence in proportion, we used Wilson’s 95% binomial confidence intervals. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Evolution of the window score distribution as a fraction of user edits made. [PITH_FULL_IMAGE:figures/full_fig_p023_8.png] view at source ↗
read the original abstract

Large Language Models are increasingly applied to sensitive domains that require redaction of personally identifiable information (PII). While redacting PII is a data cleaning prerequisite, existing benchmarks conflate extraction mechanics with privacy semantics. A public phone number is not equivalent to a phone number in a medical record. Whether information constitutes a violation depends heavily on who holds it, why, and in what context, fundamentally differentiating redaction from simple entity recognition. Grounded in contextual integrity, we introduce RedactionBench, a manually annotated benchmark comprising 200 diverse documents across 11 domains, mostly seeded from real-world sources. We also introduce R-Score, a novel character-level metric that treats semantically similar redactions equally and nullifies shallow formatting choices, such as varying masking styles for phone numbers. Evaluations across Named Entity Recognition models, entity extraction Small Language Models, and frontier models equipped with agentic tools demonstrate that contextual redaction remains an unsolved problem. A human evaluation with over 80 users on RedactionBench reveals a stark dichotomy in privacy perceptions. Annotators show consensus with target labels for mandatory redactions (89.4 percent) and safe text preservations (94.1 percent), but fail to agree on contextual redactions (47.7 percent). This variance demonstrates the subjective nature of contextual privacy and motivates R-Score, which decouples contextual ambiguity from strict precision. We compare 35 models across families and report their performance in redacting PII. Finally, we release RedactionBench to establish a baseline for future privacy-preserving systems, hoping to inspire efficient model design and standardized evaluations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces RedactionBench, a manually annotated benchmark of 200 documents across 11 domains (mostly real-world sources) for evaluating contextual PII redaction grounded in contextual integrity. It proposes the R-Score, a character-level metric that equates semantically similar redactions and ignores formatting variations. Evaluations of 35 models (NER systems, entity-extraction SLMs, and frontier LLMs with agentic tools) show poor performance on contextual cases, while a human study with >80 users reports 89.4% agreement with targets on mandatory redactions, 94.1% on safe preservations, and only 47.7% on contextual redactions; the authors conclude that contextual redaction remains unsolved and release the benchmark.

Significance. If the target labels constitute stable, inter-subjectively validated ground truth, RedactionBench and R-Score would usefully separate contextual privacy decisions from mechanical entity extraction and provide a needed public baseline for privacy-preserving systems. The release of the dataset itself is a concrete strength that enables future work.

major comments (2)
  1. [Benchmark construction / human evaluation] Benchmark construction / human evaluation section: the target labels are described as 'manually annotated' and 'manually created' but the manuscript supplies no information on the number of annotators who produced them, their selection or expertise, the adjudication procedure used to resolve disagreements, or any inter-annotator agreement statistics computed on the targets themselves. This is load-bearing because all model scores (including the claim that contextual redaction is unsolved) are computed against these targets, yet the same human study reports only 47.7% agreement precisely on the contextual subset.
  2. [Abstract and evaluation results] Abstract and evaluation results: the central claim that 'contextual redaction remains an unsolved problem' is supported only by model performance against the author-defined targets; the reported 47.7% human agreement on contextual cases directly undercuts the assumption that those targets encode a reliable standard rather than one (or a small set of) subjective privacy judgments. Without additional validation (e.g., multi-annotator consensus labels or external expert review), model under-performance may simply track label idiosyncrasy.
minor comments (2)
  1. [Benchmark construction] Document selection criteria and domain sampling procedure are not described in sufficient detail to allow replication or assessment of coverage bias.
  2. [R-Score definition] The exact definition and implementation of R-Score (character-level matching rules, handling of partial overlaps, normalization) should be given with a worked example or pseudocode.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. The two major comments raise important points about the transparency of our target label creation process and the strength of evidence for our central claim. We address each below and indicate where revisions will be made.

read point-by-point responses
  1. Referee: [Benchmark construction / human evaluation] Benchmark construction / human evaluation section: the target labels are described as 'manually annotated' and 'manually created' but the manuscript supplies no information on the number of annotators who produced them, their selection or expertise, the adjudication procedure used to resolve disagreements, or any inter-annotator agreement statistics computed on the targets themselves. This is load-bearing because all model scores (including the claim that contextual redaction is unsolved) are computed against these targets, yet the same human study reports only 47.7% agreement precisely on the contextual subset.

    Authors: The target labels were produced by the three lead authors, each with prior experience in NLP and privacy research. Annotation proceeded via iterative individual review followed by group discussion to reach consensus on each document; no external annotators or crowd-sourcing were used because contextual-integrity judgments require sustained domain familiarity. We will add a new subsection (likely 3.2) that explicitly states the number of annotators, their backgrounds, the consensus procedure, and the rationale for not computing IAA on the targets themselves (they represent the authors' agreed reference standard rather than an averaged crowd label). The separate human study (>80 participants) was conducted afterward precisely to measure agreement against these targets and to surface the subjectivity that appears in the 47.7 % contextual figure. Adding this description addresses the transparency concern without altering the experimental design. revision: yes

  2. Referee: [Abstract and evaluation results] Abstract and evaluation results: the central claim that 'contextual redaction remains an unsolved problem' is supported only by model performance against the author-defined targets; the reported 47.7% human agreement on contextual cases directly undercuts the assumption that those targets encode a reliable standard rather than one (or a small set of) subjective privacy judgments. Without additional validation (e.g., multi-annotator consensus labels or external expert review), model under-performance may simply track label idiosyncrasy.

    Authors: We agree that the 47.7 % figure on contextual cases demonstrates subjectivity, but we view this as supporting rather than undermining the claim. The targets achieve high agreement on mandatory redactions (89.4 %) and safe preservations (94.1 %), indicating they are reliable where privacy norms are clear; the drop on contextual cases is the very phenomenon we argue makes the task unsolved. Model failures are measured against a fixed, reproducible reference that is already shown to be non-idiosyncratic on the non-contextual subsets. We will revise the abstract and Section 5 to clarify that the targets constitute one expert-validated standard (not the sole possible labeling) and to emphasize that both model and human performance remain low on contextual items. This framing keeps the claim intact while acknowledging the inherent variability that R-Score is designed to accommodate. No further external validation round was performed, but the existing human study already supplies the multi-annotator data the referee requests. revision: partial

Circularity Check

0 steps flagged

No circularity: benchmark and metric are introduced without self-referential derivations

full rationale

The paper introduces RedactionBench (200 documents, 11 domains) and R-Score (character-level metric) as new artifacts. No equations, fitted parameters, or predictions are defined; evaluations consist of direct model runs on the manually labeled set. The human agreement figures (89.4%, 94.1%, 47.7%) are reported as observations rather than inputs to any derivation. No self-citations serve as load-bearing premises for uniqueness or ansatzes. The work is self-contained as an empirical benchmark release.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that contextual integrity provides the right normative frame for redaction decisions and that the 200-document sample is representative enough to declare the problem unsolved.

axioms (1)
  • domain assumption Contextual integrity theory supplies the correct criteria for deciding whether a piece of information should be redacted.
    Stated in the abstract as the grounding for the benchmark design.
invented entities (1)
  • R-Score no independent evidence
    purpose: Character-level metric that equates semantically similar redactions and ignores formatting differences such as masking style.
    Newly introduced metric whose definition is only sketched in the abstract.

pith-pipeline@v0.9.1-grok · 5825 in / 1317 out tokens · 18471 ms · 2026-06-26T21:15:28.066681+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

74 extracted references · 12 canonical work pages

  1. [1]

    and Haberland, Matt and Reddy, Tyler and Cournapeau, David and Burovski, Evgeni and Peterson, Pearu and Weckesser, Warren and Bright, Jonathan and

    Virtanen, Pauli and Gommers, Ralf and Oliphant, Travis E. and Haberland, Matt and Reddy, Tyler and Cournapeau, David and Burovski, Evgeni and Peterson, Pearu and Weckesser, Warren and Bright, Jonathan and. Nature Methods , year =

  2. [2]

    , title =

    Wilson, Edwin B. , title =. Journal of the American Statistical Association , year =

  3. [3]

    Text Chunking using Transformation-Based Learning

    Ramshaw, Lance and Marcus, Mitch. Text Chunking using Transformation-Based Learning. Third Workshop on Very Large Corpora. 1995

  4. [4]

    Washington Law Review , year =

    Nissenbaum, Helen , title =. Washington Law Review , year =

  5. [5]

    The C o NLL -2013 Shared Task on Grammatical Error Correction

    Ng, Hwee Tou and Wu, Siew Mei and Wu, Yuanbin and Hadiwinoto, Christian and Tetreault, Joel. The C o NLL -2013 Shared Task on Grammatical Error Correction. Proceedings of the Seventeenth Conference on Computational Natural Language Learning: Shared Task. 2013

  6. [6]

    BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding

    Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina. BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019. doi:10.18653/v...

  7. [7]

    Advances in neural information processing systems , volume=

    Language models are few-shot learners , author=. Advances in neural information processing systems , volume=

  8. [8]

    arXiv preprint arXiv:2502.18443 , year=

    olmocr: Unlocking trillions of tokens in pdfs with vision language models , author=. arXiv preprint arXiv:2502.18443 , year=

  9. [9]

    arXiv preprint arXiv:2507.05595 , year=

    Paddleocr 3.0 technical report , author=. arXiv preprint arXiv:2507.05595 , year=

  10. [10]

    arXiv preprint arXiv:2506.03197 , year=

    Infinity parser: Layout aware reinforcement learning for scanned document parsing , author=. arXiv preprint arXiv:2506.03197 , year=

  11. [11]

    Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

    Gliner: Generalist model for named entity recognition using bidirectional transformer , author=. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

  12. [12]

    arXiv preprint arXiv:2507.18546 , year=

    GLiNER2: An Efficient Multi-Task Information Extraction System with Schema-Driven Interface , author=. arXiv preprint arXiv:2507.18546 , year=

  13. [13]

    2025 , publisher =

    Amy Steier and Andre Manoel and Alexa Haushalter and Maarten Van Segbroeck , title =. 2025 , publisher =

  14. [14]

    Piiranha-v1: Protect your personal information! , year =

  15. [15]

    2026 , publisher =

    OpenMed-PII-SuperClinical-Large-434M-v1: PII Detection Model , author =. 2026 , publisher =

  16. [16]

    2024 , publisher =

    Knowledgator , title =. 2024 , publisher =

  17. [17]

    2024 , url =

    Presidio , title =. 2024 , url =

  18. [18]

    2024 , url =

    Maarten Van Segbroeck , title =. 2024 , url =

  19. [19]

    2026 , howpublished =

    Privacy Filter , author =. 2026 , howpublished =

  20. [20]

    2020 , doi =

    Honnibal, Matthew and Montani, Ines and Van Landeghem, Sofie and Boyd, Adriane , title =. 2020 , doi =

  21. [21]

    PII Masking 200k Dataset , year =

  22. [22]

    Synthetic financial PII multilingual dataset , year =

  23. [23]

    2025 , eprint=

    OpenMed NER: Open-Source, Domain-Adapted State-of-the-Art Transformers for Biomedical NER Across 12 Public Datasets , author=. 2025 , eprint=

  24. [24]

    N u NER : Entity Recognition Encoder Pre-training via LLM -Annotated Data

    Bogdanov, Sergei and Constantin, Alexandre and Bernard, Timoth \'e e and Crabb \'e , Benoit and Bernard, Etienne P. N u NER : Entity Recognition Encoder Pre-training via LLM -Annotated Data. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.660

  25. [26]

    Proceedings of the 31st International Conference on Computational Linguistics , pages=

    Beyond boundaries: Learning a universal entity taxonomy across datasets and languages for open named entity recognition , author=. Proceedings of the 31st International Conference on Computational Linguistics , pages=

  26. [27]

    arXiv preprint arXiv:1907.11692 , year=

    Roberta: A robustly optimized bert pretraining approach , author=. arXiv preprint arXiv:1907.11692 , year=

  27. [28]

    2025 , eprint =

    OpenAI GPT-5 System Card , author =. 2025 , eprint =

  28. [29]

    arXiv preprint arXiv:2407.21783 , year=

    The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

  29. [30]

    2025 , month =

    System Card: Claude Opus 4.5 , institution =. 2025 , month =

  30. [31]

    Pengcheng He and Jianfeng Gao and Weizhu Chen , booktitle=. De. 2023 , url=

  31. [32]

    Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    Smarter, better, faster, longer: A modern bidirectional encoder for fast, memory efficient, and long context finetuning and inference , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  32. [33]

    The Thirteenth International Conference on Learning Representations , year=

    Union-over-Intersections: Object Detection beyond Winner-Takes-All , author=. The Thirteenth International Conference on Learning Representations , year=

  33. [34]

    Locate and Label: A Two-stage Identifier for Nested Named Entity Recognition

    Shen, Yongliang and Ma, Xinyin and Tan, Zeqi and Zhang, Shuai and Wang, Wen and Lu, Weiming. Locate and Label: A Two-stage Identifier for Nested Named Entity Recognition. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)...

  34. [35]

    Lawrence and Doll

    Lin, Tsung-Yi and Maire, Michael and Belongie, Serge and Bourdev, Lubomir and Girshick, Ross and Hays, James and Perona, Pietro and Ramanan, Deva and Zitnick, C. Lawrence and Doll. Microsoft COCO: Common Objects in Context , booktitle =. 2014 , publisher =. doi:10.1007/978-3-319-10602-1_48 , series =

  35. [36]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , year =

    Xueqing Deng, Qihang Yu, Peng Wang, Xiaohui Shen, Liang-Chieh Chen , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , year =

  36. [37]

    and de Bruijn, Berry

    Nejadgholi, Isar and Fraser, Kathleen C. and de Bruijn, Berry. Extensive Error Analysis and a Learning-Based Evaluation of Medical Entity Recognition Systems to Approximate User Experience. Proceedings of the 19th SIGBioMed Workshop on Biomedical Language Processing. 2020. doi:10.18653/v1/2020.bionlp-1.19

  37. [38]

    Boundary Smoothing for Named Entity Recognition

    Zhu, Enwei and Li, Jinpeng. Boundary Smoothing for Named Entity Recognition. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2022. doi:10.18653/v1/2022.acl-long.490

  38. [39]

    2025 , volume =

    Meena, Bharti and Skubisz, Joanna and Rajgarhia, Harshit and Dave, Nand and Ganesh, Kiran and Dalmia, Shivali and Mukherji, Abhishek and Sundarababu, Vasudevan and Pospelova, Olga , booktitle =. 2025 , volume =. doi:10.1109/ICDMW69685.2025.00049 , url =

  39. [40]

    Advances in Neural Information Processing Systems , volume=

    Bigbio: A framework for data-centric biomedical natural language processing , author=. Advances in Neural Information Processing Systems , volume=

  40. [41]

    Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 , pages=

    Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition , author=. Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 , pages=

  41. [42]

    O nto N otes: The 90 \

    Hovy, Eduard and Marcus, Mitchell and Palmer, Martha and Ramshaw, Lance and Weischedel, Ralph. O nto N otes: The 90 \. Proceedings of the Human Language Technology Conference of the NAACL , Companion Volume: Short Papers. 2006

  42. [43]

    and Chen, Q

    Zhang, Y. and Chen, Q. and Yang, Z. and others , title =. Scientific Data , volume =. 2019 , month =. doi:10.1038/s41597-019-0055-0 , url =

  43. [44]

    2005 , publisher =

    Mitchell, Alexis and Strassel, Stephanie and Huang, Shudong and Zakhary, Ramez , title =. 2005 , publisher =. doi:10.35111/8m4r-v312 , url =

  44. [45]

    2006 , publisher =

    Walker, Christopher and Strassel, Stephanie and Medero, Julie and Maeda, Kazuaki , title =. 2006 , publisher =. doi:10.35111/mwxc-vh88 , url =

  45. [46]

    2012--2026 , url =

    Faraglia, Daniele and others , title =. 2012--2026 , url =

  46. [47]

    2013 , publisher=

    Content Analysis: An Introduction to Its Methodology , author=. 2013 , publisher=

  47. [48]

    2026 , eprint=

    RAT-Bench: A Comprehensive Benchmark for Text Anonymization , author=. 2026 , eprint=

  48. [49]

    2025 , eprint=

    PII-Bench: Evaluating Query-Aware Privacy Protection Systems , author=. 2025 , eprint=

  49. [50]

    Ponomarenko, Mariia and Abedini, Sepideh and Shafieinejad, Masoumeh and Emerson, D. B. and Mohapatra, Shubhankar and He, Xi. CAPID : Context-Aware PII Detection for Question-Answering Systems. Proceedings of the 19th Conference of the E uropean Chapter of the A ssociation for C omputational L inguistics (Volume 4: Student Research Workshop). 2026. doi:10....

  50. [51]

    PRvL: Quantifying the Capabilities and Risks of Large Language Models for PII Redaction , year=

    Garza, Leon and Kotal, Anantaa and Piplai, Aritran and Elluri, Lavanya and Das, Prajit Kumar and Chadha, Aman , booktitle=. PRvL: Quantifying the Capabilities and Risks of Large Language Models for PII Redaction , year=

  51. [52]

    2026 , url=

    Privasis: Synthesizing the Largest ''Public'' Private Dataset from Scratch , author=. 2026 , url=

  52. [53]

    2024 , eprint=

    DePrompt: Desensitization and Evaluation of Personal Identifiable Information in Large Language Model Prompts , author=. 2024 , eprint=

  53. [54]

    Large Language Models Can Be Contextual Privacy Protection Learners

    Xiao, Yijia and Jin, Yiqiao and Bai, Yushi and Wu, Yue and Yang, Xianjun and Luo, Xiao and Yu, Wenchao and Zhao, Xujiang and Liu, Yanchi and Gu, Quanquan and Chen, Haifeng and Wang, Wei and Cheng, Wei. Large Language Models Can Be Contextual Privacy Protection Learners. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing...

  54. [55]

    RedacBench: Can

    Hyunjun Jeon and Kyuyoung Kim and Jinwoo Shin , booktitle=. RedacBench: Can. 2026 , url=

  55. [56]

    Niloofar Mireshghallah and Hyunwoo Kim and Xuhui Zhou and Yulia Tsvetkov and Maarten Sap and Reza Shokri and Yejin Choi , booktitle=. Can. 2024 , url=

  56. [57]

    2026 , eprint=

    GLM-5: from Vibe Coding to Agentic Engineering , author=. 2026 , eprint=

  57. [58]

    2024 , eprint=

    CI-Bench: Benchmarking Contextual Integrity of AI Assistants on Synthetic Data , author=. 2024 , eprint=

  58. [59]

    2025 , eprint=

    Privacy Checklist: Privacy Violation Detection Grounding on Contextual Integrity Theory , author=. 2025 , eprint=

  59. [60]

    Journal of the American Medical Informatics Association , year =

    Davidson, Rory and Hardman, Will and Amit, Guy and Bilu, Yonatan and Della Mea, Vincenzo and Galaida, Aleksandr and Girshovitz, Irena and Kulyabin, Mikhail and Popescu, Mihai Horia and Roitero, Kevin and Sokolov, Gleb and Yanover, Chen , title =. Journal of the American Medical Informatics Association , year =

  60. [61]

    Computational Linguistics , volume =

    Artstein, Ron and Poesio, Massimo , title =. Computational Linguistics , volume =. 2008 , publisher =

  61. [62]

    , title =

    Byrt, Ted and Bishop, Janet and Carlin, John B. , title =. Journal of Clinical Epidemiology , volume =

  62. [63]

    and Feinstein, Alvan R

    Cicchetti, Domenic V. and Feinstein, Alvan R. , title =. Journal of Clinical Epidemiology , volume =

  63. [64]

    , title =

    Efron, Bradley and Tibshirani, Robert J. , title =

  64. [65]

    and Cicchetti, Domenic V

    Feinstein, Alvan R. and Cicchetti, Domenic V. , title =. Journal of Clinical Epidemiology , volume =

  65. [66]

    , title =

    Fleiss, Joseph L. , title =. Psychological Bulletin , volume =

  66. [67]

    , title =

    Gwet, Kilem L. , title =. British Journal of Mathematical and Statistical Psychology , volume =

  67. [68]

    Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages =

    Nie, Yixin and Zhou, Xiang and Bansal, Mohit , title =. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages =

  68. [69]

    Transactions of the Association for Computational Linguistics , volume =

    Pavlick, Ellie and Kwiatkowski, Tom , title =. Transactions of the Association for Computational Linguistics , volume =

  69. [70]

    Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages =

    Plank, Barbara , title =. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages =

  70. [71]

    and Fornaciari, Tommaso and Hovy, Dirk and Paun, Silviu and Plank, Barbara and Poesio, Massimo , title =

    Uma, Alexandra N. and Fornaciari, Tommaso and Hovy, Dirk and Paun, Silviu and Plank, Barbara and Poesio, Massimo , title =. Journal of Artificial Intelligence Research , volume =

  71. [72]

    Together AI: AI-Native Cloud , year =

  72. [73]

    2311.08526 , archivePrefix =

    Urchade Zaratiana and Nadi Tomeh and Pierre Holat and Thierry Charnois , year =. 2311.08526 , archivePrefix =

  73. [74]

    2025 , address =

    Yang, Yuming and Zhao, Wantong and Huang, Caishuang and Ye, Junjie and Wang, Xiao and Zheng, Huiyuan and Nan, Yang and Wang, Yuran and Xu, Xueying and Huang, Kaixin and Zhang, Yunke and Gui, Tao and Zhang, Qi and Huang, Xuanjing , booktitle =. 2025 , address =

  74. [75]

    2308.03279 , archivePrefix =

    Wenxuan Zhou and Sheng Zhang and Yu Gu and Muhao Chen and Hoifung Poon , year =. 2308.03279 , archivePrefix =