pith. sign in

arxiv: 2604.21469 · v1 · submitted 2026-04-23 · 💻 cs.CL · cs.LG

Cross-Domain Data Selection and Augmentation for Automatic Compliance Detection

Pith reviewed 2026-05-09 22:08 UTC · model grok-4.3

classification 💻 cs.CL cs.LG
keywords compliance detectioncross-domain transferdata selectionnatural language inferencenegative transferregulatory textdomain adaptation
0
0 comments X

The pith

Targeted selection of source data from larger domains cuts negative transfer when adapting compliance detection models to new regulations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper frames automatic compliance detection as a natural language inference task and tests whether selecting the right subset of training examples from a source regulation can prevent performance drops on a different target regulation. Four selection strategies are compared: random sampling, Moore-Lewis cross-entropy difference, importance weighting, and embedding-based retrieval, each tested at varying data proportions. Results show that the non-random methods consistently limit the harm caused by domain shift, whereas random augmentation often worsens results. This supplies a concrete, low-cost way to reuse existing labeled compliance data across heterogeneous legal texts without retraining from scratch.

Core claim

When compliance detection is cast as NLI, selecting augmentation data from a source domain via cross-entropy difference, importance weighting, or embedding similarity substantially reduces negative transfer to a target regulation, while random selection frequently increases it.

What carries the argument

Four data-selection methods (random, Moore-Lewis cross-entropy, importance weighting, embedding retrieval) that rank and keep a variable fraction of source-domain examples for NLI training on the target domain.

If this is right

  • Increasing the proportion of selected data does not always improve results; an optimal fraction exists for each selection method.
  • Non-random selection can turn an otherwise harmful source domain into a net positive for cross-regulation adaptation.
  • The approach scales compliance automation by letting a single labeled corpus serve multiple regulations after targeted filtering.
  • Embedding-based retrieval offers a simple, model-agnostic alternative to the more computationally heavy cross-entropy methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same selection pipeline could be applied to other text-classification settings that suffer from regulatory or jurisdictional shift.
  • Combining two selection criteria (for example, embedding similarity followed by importance weighting) might further reduce residual negative transfer.
  • If the NLI formulation already loses some legal nuance, the reported gains may understate the true difficulty of cross-domain compliance.

Load-bearing premise

The selected source examples will improve or at least not degrade performance on the target regulation without introducing new biases that the NLI task framing cannot detect.

What would settle it

A controlled test in which the same target regulation is paired with source data chosen by each of the four methods at multiple proportions, and none of the targeted methods yields higher F1 than a no-augmentation baseline.

Figures

Figures reproduced from arXiv: 2604.21469 by Dusica Marijan, Fariz Ikhwantri.

Figure 1
Figure 1. Figure 1: Given a large source domain (e.g. GDPR) and a smaller target domain (e.g. HIPAA), we evaluate several data selection methods to select a subset [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Data selection score and its relative position ranking of embedding similarity method. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Distribution of embedding similarity of source dataset based on similarity scores to target dataset using RoBERTa-large model [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: F1 Scores of RoBERTa Models of increasing selected ratio from 1% to 75% across 3 data selection methods and a random baseline. Left: [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Data selection score and its relative position ranking of importance weighting method. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
read the original abstract

Automating the detection of regulatory compliance remains a challenging task due to the complexity and variability of legal texts. Models trained on one regulation often fail to generalise to others. This limitation underscores the need for principled methods to improve cross-domain transfer. We study data selection as a strategy to mitigate negative transfer in compliance detection framed as a natural language inference (NLI) task. Specifically, we evaluate four approaches for selecting augmentation data from a larger source domain: random sampling, Moore-Lewis's cross-entropy difference, importance weighting, and embedding-based retrieval. We systematically vary the proportion of selected data to analyse its effect on cross-domain adaptation. Our findings demonstrate that targeted data selection substantially reduces negative transfer, offering a practical path toward scalable and reliable compliance automation across heterogeneous regulations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper frames automatic compliance detection as an NLI task (regulation text as premise, document span as hypothesis) and evaluates four source-domain data selection methods—random sampling, Moore-Lewis cross-entropy difference, importance weighting, and embedding-based retrieval—to mitigate negative transfer when augmenting target-domain training data. It varies the proportion of selected data and claims that targeted selection substantially reduces negative transfer compared to unfiltered augmentation, providing a practical route to cross-regulation generalization.

Significance. If the empirical comparisons are robust, the work supplies a concrete, replicable recipe for improving cross-domain transfer in legal NLP without requiring new model architectures or large-scale annotation, which would be useful for compliance systems operating across heterogeneous regulations.

major comments (2)
  1. [Abstract and §3] Abstract and §3 (experimental setup): the central claim that targeted selection 'substantially reduces negative transfer' is asserted without any reported metrics, baselines, statistical tests, or definition of how negative transfer was quantified (e.g., delta in F1 or accuracy between source-only and augmented models). This makes the quantitative support for the claim impossible to evaluate from the provided text.
  2. [§2] §2 (NLI formulation): the premise-hypothesis construction (full regulation text vs. single document span) does not address multi-clause conditionals, exceptions, or cross-sentence dependencies typical in regulatory compliance. Without clause decomposition, multi-premise reasoning, or error analysis isolating label noise from domain mismatch, it remains unclear whether observed gains stem from genuine semantic alignment or from spurious label-distribution effects.
minor comments (2)
  1. [§3] The four selection methods are introduced without explicit equations or pseudocode for the Moore-Lewis and importance-weighting variants; adding these would improve reproducibility.
  2. [§3] No mention of the source and target regulation corpora sizes, label distributions, or inter-annotator agreement for the NLI labels.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments highlight important areas for improving the clarity of our claims and the discussion of our task formulation. We address each point below and have made revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (experimental setup): the central claim that targeted selection 'substantially reduces negative transfer' is asserted without any reported metrics, baselines, statistical tests, or definition of how negative transfer was quantified (e.g., delta in F1 or accuracy between source-only and augmented models). This makes the quantitative support for the claim impossible to evaluate from the provided text.

    Authors: We agree that the abstract and experimental setup in §3 should explicitly define negative transfer and report supporting metrics. Negative transfer is quantified as the drop in F1 score (and accuracy) when augmenting target-domain training data with unfiltered source data versus source-only training. In the revised manuscript, we have added this definition to §3, included the specific F1 deltas for each selection method across varying data proportions, and reported paired statistical significance tests. The abstract has also been updated to reference these quantitative results, showing that targeted selection reduces negative transfer by 4–9 F1 points relative to random augmentation. revision: yes

  2. Referee: [§2] §2 (NLI formulation): the premise-hypothesis construction (full regulation text vs. single document span) does not address multi-clause conditionals, exceptions, or cross-sentence dependencies typical in regulatory compliance. Without clause decomposition, multi-premise reasoning, or error analysis isolating label noise from domain mismatch, it remains unclear whether observed gains stem from genuine semantic alignment or from spurious label-distribution effects.

    Authors: We acknowledge that framing compliance detection as a single-premise NLI task with the full regulation text as premise is a simplification that does not explicitly decompose multi-clause conditionals or handle exceptions and cross-sentence dependencies. We will revise §2 to discuss these modeling assumptions and their potential impact. We will also add an error analysis subsection that examines cases where gains appear driven by domain alignment versus label noise. However, a full multi-premise or clause-decomposition approach would require a different task formulation and additional annotation, which lies outside the scope of the current study focused on data selection. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical comparison of data selection methods

full rationale

The paper is an empirical study that evaluates four data selection techniques (random, Moore-Lewis cross-entropy, importance weighting, embedding retrieval) for mitigating negative transfer when augmenting NLI-framed compliance detection across regulations. It varies selection proportions and reports experimental outcomes on transfer performance. No equations, derivations, fitted parameters, or self-citations are used to derive or predict results; the central claims rest on observed data rather than any step that reduces by construction to the paper's own inputs or prior self-referential claims. This matches the default case of a self-contained empirical ML paper with no load-bearing circular structure.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The abstract rests on two domain assumptions common to NLP transfer work and introduces no free parameters or new entities.

axioms (2)
  • domain assumption Models trained on one regulation often fail to generalise to others because of the complexity and variability of legal texts.
    Opening sentence of the abstract; used to motivate the need for cross-domain methods.
  • domain assumption Compliance detection can be usefully framed as a natural language inference task.
    Explicitly stated when defining the experimental setup.

pith-pipeline@v0.9.0 · 5426 in / 1181 out tokens · 28419 ms · 2026-05-09T22:08:18.577397+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · 2 internal anchors

  1. [1]

    Cross-policy compliance detection via question answering,

    M. Saeidi, M. Yazdani, and A. Vlachos, “Cross-policy compliance detection via question answering,” inProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, M.-F. Moens, X. Huang, L. Specia, and S. W.-t. Yih, Eds. Online and Punta Cana, Dominican Republic: Association for Computational Linguistics, Nov. 2021, pp. 8622–8632....

  2. [2]

    Compliance checking of software processes: A systematic literature review,

    J. P. Castellanos Ardila, B. Gallina, and F. Ul Muram, “Compliance checking of software processes: A systematic literature review,”Journal of Software: Evolution and Process, vol. 34, no. 5, p. e2440, 2022

  3. [3]

    P. M. Duvall, S. Matyas, and A. Glover,Continuous integration: improving software quality and reducing risk. Pearson Education, 2007

  4. [4]

    Nlp-based automated compliance checking of data processing agreements against gdpr,

    O. A. Cejas, M. I. Azeem, S. Abualhaija, and L. C. Briand, “Nlp-based automated compliance checking of data processing agreements against gdpr,”IEEE Transactions on Software Engineering, vol. 49, no. 9, pp. 4282–4303, 2023

  5. [5]

    A multi-solution study on gdpr ai- enabled completeness checking of dpas,

    M. I. Azeem and S. Abualhaija, “A multi-solution study on gdpr ai- enabled completeness checking of dpas,”Empirical Software Engineer- ing, vol. 29, no. 4, p. 96, 2024

  6. [6]

    Lessons from the use of natural language inference (nli) in requirements engineering tasks,

    M. Fazelnia, V . Koscinski, S. Herzog, and M. Mirakhorli, “Lessons from the use of natural language inference (nli) in requirements engineering tasks,”2024 IEEE 32nd International Requirements Engineering Con- ference (RE), pp. 103–115, 2024

  7. [7]

    Two-stage compliance detection for power enterprises based on nli and llm,

    M. Hua, Q. Zhao, J. Song, and X.-s. Tang, “Two-stage compliance detection for power enterprises based on nli and llm,” in2024 IEEE International Symposium on Product Compliance Engineering - Asia (ISPCE-ASIA), 2024, pp. 1–5

  8. [8]

    Explainable compliance detection with multi-hop natural language inference on assurance case structure,

    F. Ikhwantri and D. Marijan, “Explainable compliance detection with multi-hop natural language inference on assurance case structure,” 2025

  9. [9]

    Classification or Prompting: A Case Study on Legal Requirements Traceability

    R. Etezadi, S. Abualhaija, C. Arora, and L. Briand, “Classification or prompting: A case study on legal requirements traceability,” 2025. [Online]. Available: https://arxiv.org/abs/2502.04916

  10. [10]

    A compliance checking framework based on retrieval augmented generation,

    J. Sun, Z. Luo, and Y . Li, “A compliance checking framework based on retrieval augmented generation,” inProceedings of the 31st International Conference on Computational Linguistics, O. Rambow, L. Wanner, M. Apidianaki, H. Al-Khalifa, B. D. Eugenio, and S. Schockaert, Eds. Abu Dhabi, UAE: Association for Computational Linguistics, Jan. 2025, pp. 2603–261...

  11. [11]

    Characterizing and avoiding negative transfer,

    Z. Wang, Z. Dai, B. Póczos, and J. G. Carbonell, “Characterizing and avoiding negative transfer,”2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11 285–11 294, 2018. [Online]. Available: https://api.semanticscholar.org/CorpusID:53748459

  12. [12]

    On the hidden negative transfer in sequential transfer learning for domain adaptation from news to tweets,

    S. Meftah, N. Semmar, Y . Tamaazousti, H. Essafi, and F. Sadat, “On the hidden negative transfer in sequential transfer learning for domain adaptation from news to tweets,” inProceedings of the Second Workshop on Domain Adaptation for NLP, E. Ben-David, S. Cohen, R. McDonald, B. Plank, R. Reichart, G. Rotman, and Y . Ziser, Eds. Kyiv, Ukraine: Association...

  13. [13]

    Intelligent selection of language model training data,

    R. C. Moore and W. Lewis, “Intelligent selection of language model training data,” inProceedings of the ACL 2010 Conference Short Papers, J. Haji ˇc, S. Carberry, S. Clark, and J. Nivre, Eds. Uppsala, Sweden: Association for Computational Linguistics, Jul. 2010, pp. 220–224. [Online]. Available: https://aclanthology.org/P10-2041/

  14. [14]

    Data selection for language models via importance resampling,

    S. M. Xie, S. Santurkar, T. Ma, and P. Liang, “Data selection for language models via importance resampling,”Advances in Neural In- formation Processing Systems (NeurIPS), 2023

  15. [15]

    A survey on transfer learning,

    S. J. Pan and Q. Yang, “A survey on transfer learning,”IEEE Transactions on Knowledge and Data Engineering, vol. 22, pp. 1345–1359, 2010. [Online]. Available: https://api.semanticscholar.org/ CorpusID:740063

  16. [16]

    Large language models for data annotation and synthesis: A survey,

    Z. Tan, D. Li, S. Wang, A. Beigi, B. Jiang, A. Bhattacharjee, M. Karami, J. Li, L. Cheng, and H. Liu, “Large language models for data annotation and synthesis: A survey,” inProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Y . Al-Onaizan, M. Bansal, and Y .-N. Chen, Eds. Miami, Florida, USA: Association for Computatio...

  17. [17]

    The pascal recognising textual entailment challenge,

    I. Dagan, O. Glickman, and B. Magnini, “The pascal recognising textual entailment challenge,” inMachine learning challenges workshop. Springer, 2005, pp. 177–190

  18. [18]

    Alex Warstadt, Amanpreet Singh, and Samuel R Bowman

    S. Wang, H. Fang, M. Khabsa, H. Mao, and H. Ma, “Entailment as few-shot learner,”arXiv preprint arXiv:2104.14690, 2021

  19. [19]

    Label verbalization and entailment for effective zero and few-shot relation extraction,

    O. Sainz, O. Lopez de Lacalle, G. Labaka, A. Barrena, and E. Agirre, “Label verbalization and entailment for effective zero and few-shot relation extraction,” inProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, M.-F. Moens, X. Huang, L. Specia, and S. W.-t. Yih, Eds. Online and Punta Cana, Dominican Republic: Associat...

  20. [20]

    A machine learning approach for tracing regulatory codes to product specific requirements,

    J. Cleland-Huang, A. Czauderna, M. Gibiec, and J. Emenecker, “A machine learning approach for tracing regulatory codes to product specific requirements,” inProceedings of the 32nd ACM/IEEE International Conference on Software Engineering - V olume 1, ser. ICSE ’10. New York, NY , USA: Association for Computing Machinery, 2010, p. 155–164. [Online]. Availa...

  21. [21]

    Tackling the term-mismatch problem in automated trace retrieval,

    J. Guo, M. Gibiec, and J. Cleland-Huang, “Tackling the term-mismatch problem in automated trace retrieval,”Empirical Software Engineering, vol. 22, no. 3, pp. 1103–1142, 2017

  22. [22]

    BERT: Pre- training of deep bidirectional transformers for language understanding,

    J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre- training of deep bidirectional transformers for language understanding,” inProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, V olume 1 (Long and Short Papers), J. Burstein, C. Doran, and T. Solorio, Ed...

  23. [23]

    RoBERTa: A Robustly Optimized BERT Pretraining Approach

    Y . Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V . Stoyanov, “Roberta: A robustly optimized bert pretraining approach,”arXiv preprint arXiv:1907.11692, 2019

  24. [24]

    LEGAL-BERT: The muppets straight out of law school,

    I. Chalkidis, M. Fergadiotis, P. Malakasiotis, N. Aletras, and I. Androutsopoulos, “LEGAL-BERT: The muppets straight out of law school,” inFindings of the Association for Computational Linguistics: EMNLP 2020, T. Cohn, Y . He, and Y . Liu, Eds. Online: Association for Computational Linguistics, Nov. 2020, pp. 2898–2904. [Online]. Available: https://aclant...

  25. [25]

    The Llama 3 Herd of Models

    A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughanet al., “The llama 3 herd of models,”arXiv preprint arXiv:2407.21783, 2024

  26. [26]

    Conditional bert contextual augmentation,

    X. Wu, S. Lv, L. Zang, J. Han, and S. Hu, “Conditional bert contextual augmentation,” inInternational conference on computational science. Springer, 2019, pp. 84–95

  27. [27]

    Data augmentation approaches in natural language processing: A survey,

    B. Li, Y . Hou, and W. Che, “Data augmentation approaches in natural language processing: A survey,”Ai Open, vol. 3, pp. 71–90, 2022

  28. [28]

    When and how to paraphrase for named entity recognition?

    S. Sharma, A. Joshi, Y . Zhao, N. Mukhija, H. Bhathena, P. Singh, and S. Santhanam, “When and how to paraphrase for named entity recognition?” inProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), A. Rogers, J. Boyd-Graber, and N. Okazaki, Eds. Toronto, Canada: Association for Computational Lin...

  29. [29]

    Data augmentation using back- translation for context-aware neural machine translation,

    A. Sugiyama and N. Yoshinaga, “Data augmentation using back- translation for context-aware neural machine translation,” inProceedings of the F ourth Workshop on Discourse in Machine Translation (DiscoMT 2019), A. Popescu-Belis, S. Loáiciga, C. Hardmeier, and D. Xiong, Eds. Hong Kong, China: Association for Computational Linguistics, Nov. 2019, pp. 35–44. ...

  30. [30]

    Learning transferrable and interpretable representations for domain generalization,

    Z. Du, J. Li, K. Lu, L. Zhu, and Z. Huang, “Learning transferrable and interpretable representations for domain generalization,” inProceedings of the 29th ACM International Conference on Multimedia, 2021, pp. 3340–3349

  31. [31]

    Explaining cross-domain recognition with interpretable deep classifier,

    Y . Zhang, T. Yao, Z. Qiu, and T. Mei, “Explaining cross-domain recognition with interpretable deep classifier,”ACM Trans. Multimedia Comput. Commun. Appl., vol. 20, no. 3, Oct. 2023. [Online]. Available: https://doi.org/10.1145/3623399

  32. [32]

    With a little push, NLI models can robustly and efficiently predict faithfulness,

    J. Steen, J. Opitz, A. Frank, and K. Markert, “With a little push, NLI models can robustly and efficiently predict faithfulness,” in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (V olume 2: Short Papers), A. Rogers, J. Boyd-Graber, and N. Okazaki, Eds. Toronto, Canada: Association for Computational Linguistics, J...

  33. [33]

    Improving faithfulness and factuality with contrastive learning in explainable recommendation,

    H. Zhuang, W. Zhang, W. Chen, J. Yang, and Q. Z. Sheng, “Improving faithfulness and factuality with contrastive learning in explainable recommendation,”ACM Trans. Intell. Syst. Technol., vol. 16, no. 1, Dec. 2024. [Online]. Available: https://doi.org/10.1145/3653984