pith. sign in

arxiv: 1906.10671 · v1 · pith:5P74WJJTnew · submitted 2019-06-25 · 💻 cs.LG · stat.ML

Explaining Deep Learning Models with Constrained Adversarial Examples

Pith reviewed 2026-05-25 16:17 UTC · model grok-4.3

classification 💻 cs.LG stat.ML
keywords counterfactual explanationsadversarial examplesmodel interpretabilityconstrained optimizationdeep learningexplainable AIclassification
0
0 comments X

The pith

Constrained adversarial examples generate counterfactual explanations that respect domain rules like categories and ranges.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Constrained Adversarial Examples (CADEX) to generate counterfactual explanations for deep learning models. These explanations show specific input changes that would produce a different classification outcome. The changes are optimized to obey explicit constraints such as valid values for categorical attributes and allowed numerical ranges. This produces actionable suggestions that remain feasible in real applications. A reader would care because unconstrained counterfactuals often suggest impossible inputs that cannot be applied.

Core claim

The paper claims that adversarial perturbations can be optimized under explicit constraints on input features to produce counterfactual examples that remain valid under those constraints, thereby yielding explanations that incorporate business or domain rules such as categorical attributes and range constraints while still reflecting the model's decision process.

What carries the argument

Constrained Adversarial Examples (CADEX), which generate perturbations optimized subject to constraints to produce valid counterfactuals for model explanations.

If this is right

  • Explanations can directly handle categorical attributes by producing only valid category values.
  • Suggested changes stay within feasible numerical ranges without needing separate repair steps.
  • The same optimization framework applies across different real-world datasets that carry business rules.
  • Counterfactuals become directly usable as recommendations rather than requiring post-hoc filtering.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This approach might integrate with automated compliance checkers to ensure explanations meet regulatory standards.
  • Applying similar constraints during model training could produce models whose decisions are easier to explain from the start.
  • Testing the method on regression or ranking tasks would show whether the same constraint-handling logic generalizes beyond classification.

Load-bearing premise

The approach assumes that adversarial perturbations can be optimized under explicit constraints while still producing valid, interpretable counterfactuals that meaningfully reflect the model's decision process rather than artifacts of the constraint enforcement.

What would settle it

Feeding the generated counterfactual inputs back into the model and checking whether the prediction changes to the intended alternative outcome while every constraint remains strictly satisfied; if the prediction does not change or a constraint is violated, the method fails to deliver valid explanations.

Figures

Figures reproduced from arXiv: 1906.10671 by Chris Watkins, Jonathan Moore, Nils Hammerla.

Figure 1
Figure 1. Figure 1: Visual explanation of categorical attribute adjustment. On the left, the internal state of [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Number of solutions found by nchanged elements, it may get stuck in a local minimum or simply not point at the right direction to cross the decision boundary. To see how significant this is, we plot histograms of how many solutions were found per training set item, for the 3 values of nchanged. As can be seen in figure 2, for most samples CADEX finds at least 3 or 4 explanations which should be enough for … view at source ↗
Figure 3
Figure 3. Figure 3: Cummulative distribution of distances found using CADEX vs. training set [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Distribution of zero SHAP attributes, which were used to produce counterfactual explana [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗
read the original abstract

Machine learning algorithms generally suffer from a problem of explainability. Given a classification result from a model, it is typically hard to determine what caused the decision to be made, and to give an informative explanation. We explore a new method of generating counterfactual explanations, which instead of explaining why a particular classification was made explain how a different outcome can be achieved. This gives the recipients of the explanation a better way to understand the outcome, and provides an actionable suggestion. We show that the introduced method of Constrained Adversarial Examples (CADEX) can be used in real world applications, and yields explanations which incorporate business or domain constraints such as handling categorical attributes and range constraints.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Constrained Adversarial Examples (CADEX), a method for generating counterfactual explanations of deep learning classification decisions. Instead of explaining why a given outcome occurred, CADEX produces minimal perturbations that achieve a different target outcome while enforcing explicit constraints such as categorical attribute handling and range bounds. The central claim is that this constrained optimization procedure yields feasible, actionable explanations suitable for real-world applications.

Significance. If the central claim holds, the work has moderate significance for the XAI literature by extending adversarial-example techniques to respect domain constraints, which is a practical requirement for deployment. The explicit incorporation of constraints via projection or penalization is a clear technical contribution over unconstrained counterfactual methods, and the reported experiments on standard datasets demonstrate basic feasibility. No machine-checked proofs or parameter-free derivations are present.

major comments (2)
  1. [Abstract, Experiments] Abstract and Experiments section: the assertion that CADEX 'can be used in real world applications' is not supported by the reported evaluation, which uses only standard benchmark datasets without domain-specific business constraints or real deployment validation; this weakens the applicability claim.
  2. [Method] Method section: the description of how constraints are enforced (projection vs. penalty) lacks an explicit statement of the optimization objective and convergence criteria, making it difficult to assess whether the generated examples remain faithful to the model's decision boundary rather than artifacts of the constraint enforcement.
minor comments (2)
  1. [Method] Notation for the constrained optimization problem should be introduced with a numbered equation for clarity.
  2. [Experiments] The paper should include a comparison table against at least one prior counterfactual method (e.g., Wachter et al.) on the same metrics to quantify improvement.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments and the recommendation of minor revision. We address each major comment below.

read point-by-point responses
  1. Referee: [Abstract, Experiments] Abstract and Experiments section: the assertion that CADEX 'can be used in real world applications' is not supported by the reported evaluation, which uses only standard benchmark datasets without domain-specific business constraints or real deployment validation; this weakens the applicability claim.

    Authors: We agree that the experiments are limited to standard benchmark datasets and do not include real-world deployment or proprietary business constraints. The original claim in the abstract is therefore not fully supported by the empirical results. We will revise the abstract to state that CADEX 'is designed for use in real-world applications by incorporating domain constraints' and add a paragraph in the discussion section illustrating how the method's constraint mechanisms can be instantiated with typical business rules. revision: yes

  2. Referee: [Method] Method section: the description of how constraints are enforced (projection vs. penalty) lacks an explicit statement of the optimization objective and convergence criteria, making it difficult to assess whether the generated examples remain faithful to the model's decision boundary rather than artifacts of the constraint enforcement.

    Authors: We acknowledge that the method section would benefit from a more explicit mathematical formulation. In the revision we will add a dedicated paragraph stating the full optimization objective (adversarial loss plus constraint penalty or projection term) and the convergence criteria used (e.g., maximum iterations or objective change threshold). This will clarify that the generated examples are driven by the model's decision boundary rather than solely by the constraint mechanism. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper introduces CADEX as a constrained optimization procedure for generating counterfactual explanations that respect domain constraints such as categorical handling and range bounds. No derivation chain, first-principles result, or prediction is presented that reduces by construction to fitted inputs, self-definitions, or self-citation chains; the method is described algorithmically and evaluated on standard datasets, remaining self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no information on free parameters, background axioms, or new postulated entities.

pith-pipeline@v0.9.0 · 5633 in / 990 out tokens · 33007 ms · 2026-05-25T16:17:12.003501+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages · 5 internal anchors

  1. [1]

    Synthesizing Robust Adversarial Examples

    Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. CoRR abs/1707.07397 (2017), http://arxiv.org/abs/1707.07397

  2. [2]

    Adversarial Patch

    Brown, T.B., Man ´e, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. CoRR abs/1712.09665 (2017), http://arxiv.org/abs/1712.09665

  3. [3]

    uci.edu/ml

    Dua, D., Karra Taniskidou, E.: UCI machine learning repository (2017), http://archive.ics. uci.edu/ml

  4. [4]

    In: International Conference on Learning Representations (2015), http://arxiv.org/abs/1412

    Goodfellow, I., Shlens, J., Szegedy, C.: Explaining and harnessing adversarial examples. In: International Conference on Learning Representations (2015), http://arxiv.org/abs/1412. 6572 Explaining Deep Learning Models with Constrained Adversarial Examples 13

  5. [5]

    Adam: A Method for Stochastic Optimization

    Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. CoRR abs/1412.6980 (2014), http://arxiv.org/abs/1412.6980

  6. [6]

    Adversarial examples in the physical world

    Kurakin, A., Goodfellow, I.J., Bengio, S.: Adversarial examples in the physical world. CoRR abs/1607.02533 (2016)

  7. [7]

    In: Medina, J., Ojeda-Aciego, M., Verdegay, J.L., Pelta, D.A., Cabrera, I.P., Bouchon-Meunier, B., Yager, R.R

    Laugel, T., Lesot, M.J., Marsala, C., Renard, X., Detyniecki, M.: Comparison-based in- verse classification for interpretability in machine learning. In: Medina, J., Ojeda-Aciego, M., Verdegay, J.L., Pelta, D.A., Cabrera, I.P., Bouchon-Meunier, B., Yager, R.R. (eds.) Infor- mation Processing and Management of Uncertainty in Knowledge-Based Systems. Theory ...

  8. [8]

    In: Guyon, I., Luxburg, U.V ., Bengio, S., Wallach, H., Fergus, R., Vish- wanathan, S., Garnett, R

    Lundberg, S.M., Lee, S.I.: A unified approach to interpreting model predic- tions. In: Guyon, I., Luxburg, U.V ., Bengio, S., Wallach, H., Fergus, R., Vish- wanathan, S., Garnett, R. (eds.) Advances in Neural Information Processing Sys- tems 30, pp. 4765–4774. Curran Associates, Inc. (2017), http://papers.nips.cc/paper/ 7062-a-unified-approach-to-interpreti...

  9. [9]

    Miller, T.: Explanation in artificial intelligence: Insights from the social sciences. Artif. Intell. 267, 1–38 (2019)

  10. [10]

    In: Proceedings of the 2017 ACM on Asia Con- ference on Computer and Communications Security

    Papernot, N., McDaniel, P., Goodfellow, I., Jha, S., Celik, Z.B., Swami, A.: Practical black- box attacks against machine learning. In: Proceedings of the 2017 ACM on Asia Con- ference on Computer and Communications Security. pp. 506–519. ASIA CCS ’17, ACM, New York, NY , USA (2017). https://doi.org/10.1145/3052973.3053009, http://doi.acm.org/ 10.1145/305...

  11. [11]

    Why Should I Trust You?

    Ribeiro, M.T., Singh, S., Guestrin, C.: ”why should i trust you?”: Explaining the predic- tions of any classifier. In: Proceedings of the 22Nd ACM SIGKDD International Confer- ence on Knowledge Discovery and Data Mining. pp. 1135–1144. KDD ’16, ACM, New York, NY , USA (2016). https://doi.org/10.1145/2939672.2939778, http://doi.acm.org/10. 1145/2939672.2939778

  12. [12]

    In: Precup, D., Teh, Y .W

    Shrikumar, A., Greenside, P., Kundaje, A.: Learning important features through propagating activation differences. In: Precup, D., Teh, Y .W. (eds.) Proceedings of the 34th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 70, pp. 3145–3153. PMLR, International Convention Centre, Sydney, Australia (06–11 Aug 2017...

  13. [13]

    Intriguing properties of neural networks

    Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., Fergus, R.: Intriguing properties of neural networks. In: International Conference on Learning Repre- sentations (2014), http://arxiv.org/abs/1312.6199

  14. [14]

    Harvard Journal of Law and Technology31(2), 841–887 (2018)

    Wachter, S., Mittelstadt, B., Russell, C.: Counterfactual explanations without opening the black box: automated decisions and the gdpr. Harvard Journal of Law and Technology31(2), 841–887 (2018)

  15. [15]

    Fron- tiers of Information Technology & Electronic Engineering 19(1), 27–39 (Jan 2018)

    Zhang, Q.s., Zhu, S.c.: Visual interpretability for deep learning: a survey. Fron- tiers of Information Technology & Electronic Engineering 19(1), 27–39 (Jan 2018). https://doi.org/10.1631/FITEE.1700808, https://doi.org/10.1631/FITEE.1700808