Explaining Deep Learning Models with Constrained Adversarial Examples

Chris Watkins; Jonathan Moore; Nils Hammerla

arxiv: 1906.10671 · v1 · pith:5P74WJJTnew · submitted 2019-06-25 · 💻 cs.LG · stat.ML

Explaining Deep Learning Models with Constrained Adversarial Examples

Jonathan Moore , Nils Hammerla , Chris Watkins This is my paper

Pith reviewed 2026-05-25 16:17 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords counterfactual explanationsadversarial examplesmodel interpretabilityconstrained optimizationdeep learningexplainable AIclassification

0 comments

The pith

Constrained adversarial examples generate counterfactual explanations that respect domain rules like categories and ranges.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Constrained Adversarial Examples (CADEX) to generate counterfactual explanations for deep learning models. These explanations show specific input changes that would produce a different classification outcome. The changes are optimized to obey explicit constraints such as valid values for categorical attributes and allowed numerical ranges. This produces actionable suggestions that remain feasible in real applications. A reader would care because unconstrained counterfactuals often suggest impossible inputs that cannot be applied.

Core claim

The paper claims that adversarial perturbations can be optimized under explicit constraints on input features to produce counterfactual examples that remain valid under those constraints, thereby yielding explanations that incorporate business or domain rules such as categorical attributes and range constraints while still reflecting the model's decision process.

What carries the argument

Constrained Adversarial Examples (CADEX), which generate perturbations optimized subject to constraints to produce valid counterfactuals for model explanations.

If this is right

Explanations can directly handle categorical attributes by producing only valid category values.
Suggested changes stay within feasible numerical ranges without needing separate repair steps.
The same optimization framework applies across different real-world datasets that carry business rules.
Counterfactuals become directly usable as recommendations rather than requiring post-hoc filtering.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This approach might integrate with automated compliance checkers to ensure explanations meet regulatory standards.
Applying similar constraints during model training could produce models whose decisions are easier to explain from the start.
Testing the method on regression or ranking tasks would show whether the same constraint-handling logic generalizes beyond classification.

Load-bearing premise

The approach assumes that adversarial perturbations can be optimized under explicit constraints while still producing valid, interpretable counterfactuals that meaningfully reflect the model's decision process rather than artifacts of the constraint enforcement.

What would settle it

Feeding the generated counterfactual inputs back into the model and checking whether the prediction changes to the intended alternative outcome while every constraint remains strictly satisfied; if the prediction does not change or a constraint is violated, the method fails to deliver valid explanations.

Figures

Figures reproduced from arXiv: 1906.10671 by Chris Watkins, Jonathan Moore, Nils Hammerla.

**Figure 2.** Figure 2: Number of solutions found by nchanged elements, it may get stuck in a local minimum or simply not point at the right direction to cross the decision boundary. To see how significant this is, we plot histograms of how many solutions were found per training set item, for the 3 values of nchanged. As can be seen in figure 2, for most samples CADEX finds at least 3 or 4 explanations which should be enough for … view at source ↗

**Figure 3.** Figure 3: Cummulative distribution of distances found using CADEX vs. training set [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗

**Figure 4.** Figure 4: Distribution of zero SHAP attributes, which were used to produce counterfactual explana [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗

read the original abstract

Machine learning algorithms generally suffer from a problem of explainability. Given a classification result from a model, it is typically hard to determine what caused the decision to be made, and to give an informative explanation. We explore a new method of generating counterfactual explanations, which instead of explaining why a particular classification was made explain how a different outcome can be achieved. This gives the recipients of the explanation a better way to understand the outcome, and provides an actionable suggestion. We show that the introduced method of Constrained Adversarial Examples (CADEX) can be used in real world applications, and yields explanations which incorporate business or domain constraints such as handling categorical attributes and range constraints.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CADEX adapts adversarial examples to respect explicit constraints for counterfactuals, a useful but incremental step.

read the letter

The paper introduces CADEX as a way to generate counterfactual explanations by constraining adversarial perturbations so they respect categorical features and range bounds. This directly tackles a real deployment issue where plain counterfactuals often suggest impossible changes. The authors frame it as a practical extension rather than a theoretical overhaul, and the stress-test note confirms the optimization uses projection or penalties without obvious inconsistencies in the reported results on standard datasets. That part holds up on the evidence given. What stands out is the explicit handling of business constraints in the generation loop, which is not automatic in basic adversarial setups. The experiments appear to show feasible outputs, which is the main positive. Soft spots are modest. The method leans on existing adversarial machinery, so the novelty sits mostly in the constraint integration rather than a new optimization trick. Without deeper ablations on how much the constraints change explanation quality or fidelity to the model, it is hard to judge whether the outputs remain truly informative or just constraint-compliant. The abstract-level claims about real-world applicability are reasonable but rest on the experiments described in the full text. This work is for people already using counterfactuals in tabular ML settings who need to add domain rules. A reader working on XAI tooling would pick up the constraint technique and test it themselves. It is coherent enough on its own terms to warrant referee time rather than a desk reject, even if revisions would likely focus on stronger baselines and more varied datasets.

Referee Report

2 major / 2 minor

Summary. The paper introduces Constrained Adversarial Examples (CADEX), a method for generating counterfactual explanations of deep learning classification decisions. Instead of explaining why a given outcome occurred, CADEX produces minimal perturbations that achieve a different target outcome while enforcing explicit constraints such as categorical attribute handling and range bounds. The central claim is that this constrained optimization procedure yields feasible, actionable explanations suitable for real-world applications.

Significance. If the central claim holds, the work has moderate significance for the XAI literature by extending adversarial-example techniques to respect domain constraints, which is a practical requirement for deployment. The explicit incorporation of constraints via projection or penalization is a clear technical contribution over unconstrained counterfactual methods, and the reported experiments on standard datasets demonstrate basic feasibility. No machine-checked proofs or parameter-free derivations are present.

major comments (2)

[Abstract, Experiments] Abstract and Experiments section: the assertion that CADEX 'can be used in real world applications' is not supported by the reported evaluation, which uses only standard benchmark datasets without domain-specific business constraints or real deployment validation; this weakens the applicability claim.
[Method] Method section: the description of how constraints are enforced (projection vs. penalty) lacks an explicit statement of the optimization objective and convergence criteria, making it difficult to assess whether the generated examples remain faithful to the model's decision boundary rather than artifacts of the constraint enforcement.

minor comments (2)

[Method] Notation for the constrained optimization problem should be introduced with a numbered equation for clarity.
[Experiments] The paper should include a comparison table against at least one prior counterfactual method (e.g., Wachter et al.) on the same metrics to quantify improvement.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments and the recommendation of minor revision. We address each major comment below.

read point-by-point responses

Referee: [Abstract, Experiments] Abstract and Experiments section: the assertion that CADEX 'can be used in real world applications' is not supported by the reported evaluation, which uses only standard benchmark datasets without domain-specific business constraints or real deployment validation; this weakens the applicability claim.

Authors: We agree that the experiments are limited to standard benchmark datasets and do not include real-world deployment or proprietary business constraints. The original claim in the abstract is therefore not fully supported by the empirical results. We will revise the abstract to state that CADEX 'is designed for use in real-world applications by incorporating domain constraints' and add a paragraph in the discussion section illustrating how the method's constraint mechanisms can be instantiated with typical business rules. revision: yes
Referee: [Method] Method section: the description of how constraints are enforced (projection vs. penalty) lacks an explicit statement of the optimization objective and convergence criteria, making it difficult to assess whether the generated examples remain faithful to the model's decision boundary rather than artifacts of the constraint enforcement.

Authors: We acknowledge that the method section would benefit from a more explicit mathematical formulation. In the revision we will add a dedicated paragraph stating the full optimization objective (adversarial loss plus constraint penalty or projection term) and the convergence criteria used (e.g., maximum iterations or objective change threshold). This will clarify that the generated examples are driven by the model's decision boundary rather than solely by the constraint mechanism. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper introduces CADEX as a constrained optimization procedure for generating counterfactual explanations that respect domain constraints such as categorical handling and range bounds. No derivation chain, first-principles result, or prediction is presented that reduces by construction to fitted inputs, self-definitions, or self-citation chains; the method is described algorithmically and evaluated on standard datasets, remaining self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no information on free parameters, background axioms, or new postulated entities.

pith-pipeline@v0.9.0 · 5633 in / 990 out tokens · 33007 ms · 2026-05-25T16:17:12.003501+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages · 5 internal anchors

[1]

Synthesizing Robust Adversarial Examples

Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. CoRR abs/1707.07397 (2017), http://arxiv.org/abs/1707.07397

work page internal anchor Pith review Pith/arXiv arXiv 2017
[2]

Adversarial Patch

Brown, T.B., Man ´e, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. CoRR abs/1712.09665 (2017), http://arxiv.org/abs/1712.09665

work page internal anchor Pith review Pith/arXiv arXiv 2017
[3]

uci.edu/ml

Dua, D., Karra Taniskidou, E.: UCI machine learning repository (2017), http://archive.ics. uci.edu/ml

work page 2017
[4]

In: International Conference on Learning Representations (2015), http://arxiv.org/abs/1412

Goodfellow, I., Shlens, J., Szegedy, C.: Explaining and harnessing adversarial examples. In: International Conference on Learning Representations (2015), http://arxiv.org/abs/1412. 6572 Explaining Deep Learning Models with Constrained Adversarial Examples 13

work page 2015
[5]

Adam: A Method for Stochastic Optimization

Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. CoRR abs/1412.6980 (2014), http://arxiv.org/abs/1412.6980

work page internal anchor Pith review Pith/arXiv arXiv 2014
[6]

Adversarial examples in the physical world

Kurakin, A., Goodfellow, I.J., Bengio, S.: Adversarial examples in the physical world. CoRR abs/1607.02533 (2016)

work page internal anchor Pith review Pith/arXiv arXiv 2016
[7]

In: Medina, J., Ojeda-Aciego, M., Verdegay, J.L., Pelta, D.A., Cabrera, I.P., Bouchon-Meunier, B., Yager, R.R

Laugel, T., Lesot, M.J., Marsala, C., Renard, X., Detyniecki, M.: Comparison-based in- verse classiﬁcation for interpretability in machine learning. In: Medina, J., Ojeda-Aciego, M., Verdegay, J.L., Pelta, D.A., Cabrera, I.P., Bouchon-Meunier, B., Yager, R.R. (eds.) Infor- mation Processing and Management of Uncertainty in Knowledge-Based Systems. Theory ...

work page 2018
[8]

In: Guyon, I., Luxburg, U.V ., Bengio, S., Wallach, H., Fergus, R., Vish- wanathan, S., Garnett, R

Lundberg, S.M., Lee, S.I.: A uniﬁed approach to interpreting model predic- tions. In: Guyon, I., Luxburg, U.V ., Bengio, S., Wallach, H., Fergus, R., Vish- wanathan, S., Garnett, R. (eds.) Advances in Neural Information Processing Sys- tems 30, pp. 4765–4774. Curran Associates, Inc. (2017), http://papers.nips.cc/paper/ 7062-a-uniﬁed-approach-to-interpreti...

work page 2017
[9]

Miller, T.: Explanation in artiﬁcial intelligence: Insights from the social sciences. Artif. Intell. 267, 1–38 (2019)

work page 2019
[10]

In: Proceedings of the 2017 ACM on Asia Con- ference on Computer and Communications Security

Papernot, N., McDaniel, P., Goodfellow, I., Jha, S., Celik, Z.B., Swami, A.: Practical black- box attacks against machine learning. In: Proceedings of the 2017 ACM on Asia Con- ference on Computer and Communications Security. pp. 506–519. ASIA CCS ’17, ACM, New York, NY , USA (2017). https://doi.org/10.1145/3052973.3053009, http://doi.acm.org/ 10.1145/305...

work page doi:10.1145/3052973.3053009 2017
[11]

Why Should I Trust You?

Ribeiro, M.T., Singh, S., Guestrin, C.: ”why should i trust you?”: Explaining the predic- tions of any classiﬁer. In: Proceedings of the 22Nd ACM SIGKDD International Confer- ence on Knowledge Discovery and Data Mining. pp. 1135–1144. KDD ’16, ACM, New York, NY , USA (2016). https://doi.org/10.1145/2939672.2939778, http://doi.acm.org/10. 1145/2939672.2939778

work page doi:10.1145/2939672.2939778 2016
[12]

In: Precup, D., Teh, Y .W

Shrikumar, A., Greenside, P., Kundaje, A.: Learning important features through propagating activation differences. In: Precup, D., Teh, Y .W. (eds.) Proceedings of the 34th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 70, pp. 3145–3153. PMLR, International Convention Centre, Sydney, Australia (06–11 Aug 2017...

work page 2017
[13]

Intriguing properties of neural networks

Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., Fergus, R.: Intriguing properties of neural networks. In: International Conference on Learning Repre- sentations (2014), http://arxiv.org/abs/1312.6199

work page internal anchor Pith review Pith/arXiv arXiv 2014
[14]

Harvard Journal of Law and Technology31(2), 841–887 (2018)

Wachter, S., Mittelstadt, B., Russell, C.: Counterfactual explanations without opening the black box: automated decisions and the gdpr. Harvard Journal of Law and Technology31(2), 841–887 (2018)

work page 2018
[15]

Fron- tiers of Information Technology & Electronic Engineering 19(1), 27–39 (Jan 2018)

Zhang, Q.s., Zhu, S.c.: Visual interpretability for deep learning: a survey. Fron- tiers of Information Technology & Electronic Engineering 19(1), 27–39 (Jan 2018). https://doi.org/10.1631/FITEE.1700808, https://doi.org/10.1631/FITEE.1700808

work page doi:10.1631/fitee.1700808 2018

[1] [1]

Synthesizing Robust Adversarial Examples

Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. CoRR abs/1707.07397 (2017), http://arxiv.org/abs/1707.07397

work page internal anchor Pith review Pith/arXiv arXiv 2017

[2] [2]

Adversarial Patch

Brown, T.B., Man ´e, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. CoRR abs/1712.09665 (2017), http://arxiv.org/abs/1712.09665

work page internal anchor Pith review Pith/arXiv arXiv 2017

[3] [3]

uci.edu/ml

Dua, D., Karra Taniskidou, E.: UCI machine learning repository (2017), http://archive.ics. uci.edu/ml

work page 2017

[4] [4]

In: International Conference on Learning Representations (2015), http://arxiv.org/abs/1412

Goodfellow, I., Shlens, J., Szegedy, C.: Explaining and harnessing adversarial examples. In: International Conference on Learning Representations (2015), http://arxiv.org/abs/1412. 6572 Explaining Deep Learning Models with Constrained Adversarial Examples 13

work page 2015

[5] [5]

Adam: A Method for Stochastic Optimization

Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. CoRR abs/1412.6980 (2014), http://arxiv.org/abs/1412.6980

work page internal anchor Pith review Pith/arXiv arXiv 2014

[6] [6]

Adversarial examples in the physical world

Kurakin, A., Goodfellow, I.J., Bengio, S.: Adversarial examples in the physical world. CoRR abs/1607.02533 (2016)

work page internal anchor Pith review Pith/arXiv arXiv 2016

[7] [7]

In: Medina, J., Ojeda-Aciego, M., Verdegay, J.L., Pelta, D.A., Cabrera, I.P., Bouchon-Meunier, B., Yager, R.R

Laugel, T., Lesot, M.J., Marsala, C., Renard, X., Detyniecki, M.: Comparison-based in- verse classiﬁcation for interpretability in machine learning. In: Medina, J., Ojeda-Aciego, M., Verdegay, J.L., Pelta, D.A., Cabrera, I.P., Bouchon-Meunier, B., Yager, R.R. (eds.) Infor- mation Processing and Management of Uncertainty in Knowledge-Based Systems. Theory ...

work page 2018

[8] [8]

In: Guyon, I., Luxburg, U.V ., Bengio, S., Wallach, H., Fergus, R., Vish- wanathan, S., Garnett, R

Lundberg, S.M., Lee, S.I.: A uniﬁed approach to interpreting model predic- tions. In: Guyon, I., Luxburg, U.V ., Bengio, S., Wallach, H., Fergus, R., Vish- wanathan, S., Garnett, R. (eds.) Advances in Neural Information Processing Sys- tems 30, pp. 4765–4774. Curran Associates, Inc. (2017), http://papers.nips.cc/paper/ 7062-a-uniﬁed-approach-to-interpreti...

work page 2017

[9] [9]

Miller, T.: Explanation in artiﬁcial intelligence: Insights from the social sciences. Artif. Intell. 267, 1–38 (2019)

work page 2019

[10] [10]

In: Proceedings of the 2017 ACM on Asia Con- ference on Computer and Communications Security

Papernot, N., McDaniel, P., Goodfellow, I., Jha, S., Celik, Z.B., Swami, A.: Practical black- box attacks against machine learning. In: Proceedings of the 2017 ACM on Asia Con- ference on Computer and Communications Security. pp. 506–519. ASIA CCS ’17, ACM, New York, NY , USA (2017). https://doi.org/10.1145/3052973.3053009, http://doi.acm.org/ 10.1145/305...

work page doi:10.1145/3052973.3053009 2017

[11] [11]

Why Should I Trust You?

Ribeiro, M.T., Singh, S., Guestrin, C.: ”why should i trust you?”: Explaining the predic- tions of any classiﬁer. In: Proceedings of the 22Nd ACM SIGKDD International Confer- ence on Knowledge Discovery and Data Mining. pp. 1135–1144. KDD ’16, ACM, New York, NY , USA (2016). https://doi.org/10.1145/2939672.2939778, http://doi.acm.org/10. 1145/2939672.2939778

work page doi:10.1145/2939672.2939778 2016

[12] [12]

In: Precup, D., Teh, Y .W

Shrikumar, A., Greenside, P., Kundaje, A.: Learning important features through propagating activation differences. In: Precup, D., Teh, Y .W. (eds.) Proceedings of the 34th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 70, pp. 3145–3153. PMLR, International Convention Centre, Sydney, Australia (06–11 Aug 2017...

work page 2017

[13] [13]

Intriguing properties of neural networks

Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., Fergus, R.: Intriguing properties of neural networks. In: International Conference on Learning Repre- sentations (2014), http://arxiv.org/abs/1312.6199

work page internal anchor Pith review Pith/arXiv arXiv 2014

[14] [14]

Harvard Journal of Law and Technology31(2), 841–887 (2018)

Wachter, S., Mittelstadt, B., Russell, C.: Counterfactual explanations without opening the black box: automated decisions and the gdpr. Harvard Journal of Law and Technology31(2), 841–887 (2018)

work page 2018

[15] [15]

Fron- tiers of Information Technology & Electronic Engineering 19(1), 27–39 (Jan 2018)

Zhang, Q.s., Zhu, S.c.: Visual interpretability for deep learning: a survey. Fron- tiers of Information Technology & Electronic Engineering 19(1), 27–39 (Jan 2018). https://doi.org/10.1631/FITEE.1700808, https://doi.org/10.1631/FITEE.1700808

work page doi:10.1631/fitee.1700808 2018