Explaining Deep Learning Models with Constrained Adversarial Examples
Pith reviewed 2026-05-25 16:17 UTC · model grok-4.3
The pith
Constrained adversarial examples generate counterfactual explanations that respect domain rules like categories and ranges.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that adversarial perturbations can be optimized under explicit constraints on input features to produce counterfactual examples that remain valid under those constraints, thereby yielding explanations that incorporate business or domain rules such as categorical attributes and range constraints while still reflecting the model's decision process.
What carries the argument
Constrained Adversarial Examples (CADEX), which generate perturbations optimized subject to constraints to produce valid counterfactuals for model explanations.
If this is right
- Explanations can directly handle categorical attributes by producing only valid category values.
- Suggested changes stay within feasible numerical ranges without needing separate repair steps.
- The same optimization framework applies across different real-world datasets that carry business rules.
- Counterfactuals become directly usable as recommendations rather than requiring post-hoc filtering.
Where Pith is reading between the lines
- This approach might integrate with automated compliance checkers to ensure explanations meet regulatory standards.
- Applying similar constraints during model training could produce models whose decisions are easier to explain from the start.
- Testing the method on regression or ranking tasks would show whether the same constraint-handling logic generalizes beyond classification.
Load-bearing premise
The approach assumes that adversarial perturbations can be optimized under explicit constraints while still producing valid, interpretable counterfactuals that meaningfully reflect the model's decision process rather than artifacts of the constraint enforcement.
What would settle it
Feeding the generated counterfactual inputs back into the model and checking whether the prediction changes to the intended alternative outcome while every constraint remains strictly satisfied; if the prediction does not change or a constraint is violated, the method fails to deliver valid explanations.
Figures
read the original abstract
Machine learning algorithms generally suffer from a problem of explainability. Given a classification result from a model, it is typically hard to determine what caused the decision to be made, and to give an informative explanation. We explore a new method of generating counterfactual explanations, which instead of explaining why a particular classification was made explain how a different outcome can be achieved. This gives the recipients of the explanation a better way to understand the outcome, and provides an actionable suggestion. We show that the introduced method of Constrained Adversarial Examples (CADEX) can be used in real world applications, and yields explanations which incorporate business or domain constraints such as handling categorical attributes and range constraints.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Constrained Adversarial Examples (CADEX), a method for generating counterfactual explanations of deep learning classification decisions. Instead of explaining why a given outcome occurred, CADEX produces minimal perturbations that achieve a different target outcome while enforcing explicit constraints such as categorical attribute handling and range bounds. The central claim is that this constrained optimization procedure yields feasible, actionable explanations suitable for real-world applications.
Significance. If the central claim holds, the work has moderate significance for the XAI literature by extending adversarial-example techniques to respect domain constraints, which is a practical requirement for deployment. The explicit incorporation of constraints via projection or penalization is a clear technical contribution over unconstrained counterfactual methods, and the reported experiments on standard datasets demonstrate basic feasibility. No machine-checked proofs or parameter-free derivations are present.
major comments (2)
- [Abstract, Experiments] Abstract and Experiments section: the assertion that CADEX 'can be used in real world applications' is not supported by the reported evaluation, which uses only standard benchmark datasets without domain-specific business constraints or real deployment validation; this weakens the applicability claim.
- [Method] Method section: the description of how constraints are enforced (projection vs. penalty) lacks an explicit statement of the optimization objective and convergence criteria, making it difficult to assess whether the generated examples remain faithful to the model's decision boundary rather than artifacts of the constraint enforcement.
minor comments (2)
- [Method] Notation for the constrained optimization problem should be introduced with a numbered equation for clarity.
- [Experiments] The paper should include a comparison table against at least one prior counterfactual method (e.g., Wachter et al.) on the same metrics to quantify improvement.
Simulated Author's Rebuttal
We thank the referee for the constructive comments and the recommendation of minor revision. We address each major comment below.
read point-by-point responses
-
Referee: [Abstract, Experiments] Abstract and Experiments section: the assertion that CADEX 'can be used in real world applications' is not supported by the reported evaluation, which uses only standard benchmark datasets without domain-specific business constraints or real deployment validation; this weakens the applicability claim.
Authors: We agree that the experiments are limited to standard benchmark datasets and do not include real-world deployment or proprietary business constraints. The original claim in the abstract is therefore not fully supported by the empirical results. We will revise the abstract to state that CADEX 'is designed for use in real-world applications by incorporating domain constraints' and add a paragraph in the discussion section illustrating how the method's constraint mechanisms can be instantiated with typical business rules. revision: yes
-
Referee: [Method] Method section: the description of how constraints are enforced (projection vs. penalty) lacks an explicit statement of the optimization objective and convergence criteria, making it difficult to assess whether the generated examples remain faithful to the model's decision boundary rather than artifacts of the constraint enforcement.
Authors: We acknowledge that the method section would benefit from a more explicit mathematical formulation. In the revision we will add a dedicated paragraph stating the full optimization objective (adversarial loss plus constraint penalty or projection term) and the convergence criteria used (e.g., maximum iterations or objective change threshold). This will clarify that the generated examples are driven by the model's decision boundary rather than solely by the constraint mechanism. revision: yes
Circularity Check
No significant circularity
full rationale
The paper introduces CADEX as a constrained optimization procedure for generating counterfactual explanations that respect domain constraints such as categorical handling and range bounds. No derivation chain, first-principles result, or prediction is presented that reduces by construction to fitted inputs, self-definitions, or self-citation chains; the method is described algorithmically and evaluated on standard datasets, remaining self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Synthesizing Robust Adversarial Examples
Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. CoRR abs/1707.07397 (2017), http://arxiv.org/abs/1707.07397
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[2]
Brown, T.B., Man ´e, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. CoRR abs/1712.09665 (2017), http://arxiv.org/abs/1712.09665
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[3]
Dua, D., Karra Taniskidou, E.: UCI machine learning repository (2017), http://archive.ics. uci.edu/ml
work page 2017
-
[4]
In: International Conference on Learning Representations (2015), http://arxiv.org/abs/1412
Goodfellow, I., Shlens, J., Szegedy, C.: Explaining and harnessing adversarial examples. In: International Conference on Learning Representations (2015), http://arxiv.org/abs/1412. 6572 Explaining Deep Learning Models with Constrained Adversarial Examples 13
work page 2015
-
[5]
Adam: A Method for Stochastic Optimization
Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. CoRR abs/1412.6980 (2014), http://arxiv.org/abs/1412.6980
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[6]
Adversarial examples in the physical world
Kurakin, A., Goodfellow, I.J., Bengio, S.: Adversarial examples in the physical world. CoRR abs/1607.02533 (2016)
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[7]
Laugel, T., Lesot, M.J., Marsala, C., Renard, X., Detyniecki, M.: Comparison-based in- verse classification for interpretability in machine learning. In: Medina, J., Ojeda-Aciego, M., Verdegay, J.L., Pelta, D.A., Cabrera, I.P., Bouchon-Meunier, B., Yager, R.R. (eds.) Infor- mation Processing and Management of Uncertainty in Knowledge-Based Systems. Theory ...
work page 2018
-
[8]
In: Guyon, I., Luxburg, U.V ., Bengio, S., Wallach, H., Fergus, R., Vish- wanathan, S., Garnett, R
Lundberg, S.M., Lee, S.I.: A unified approach to interpreting model predic- tions. In: Guyon, I., Luxburg, U.V ., Bengio, S., Wallach, H., Fergus, R., Vish- wanathan, S., Garnett, R. (eds.) Advances in Neural Information Processing Sys- tems 30, pp. 4765–4774. Curran Associates, Inc. (2017), http://papers.nips.cc/paper/ 7062-a-unified-approach-to-interpreti...
work page 2017
-
[9]
Miller, T.: Explanation in artificial intelligence: Insights from the social sciences. Artif. Intell. 267, 1–38 (2019)
work page 2019
-
[10]
In: Proceedings of the 2017 ACM on Asia Con- ference on Computer and Communications Security
Papernot, N., McDaniel, P., Goodfellow, I., Jha, S., Celik, Z.B., Swami, A.: Practical black- box attacks against machine learning. In: Proceedings of the 2017 ACM on Asia Con- ference on Computer and Communications Security. pp. 506–519. ASIA CCS ’17, ACM, New York, NY , USA (2017). https://doi.org/10.1145/3052973.3053009, http://doi.acm.org/ 10.1145/305...
-
[11]
Ribeiro, M.T., Singh, S., Guestrin, C.: ”why should i trust you?”: Explaining the predic- tions of any classifier. In: Proceedings of the 22Nd ACM SIGKDD International Confer- ence on Knowledge Discovery and Data Mining. pp. 1135–1144. KDD ’16, ACM, New York, NY , USA (2016). https://doi.org/10.1145/2939672.2939778, http://doi.acm.org/10. 1145/2939672.2939778
-
[12]
Shrikumar, A., Greenside, P., Kundaje, A.: Learning important features through propagating activation differences. In: Precup, D., Teh, Y .W. (eds.) Proceedings of the 34th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 70, pp. 3145–3153. PMLR, International Convention Centre, Sydney, Australia (06–11 Aug 2017...
work page 2017
-
[13]
Intriguing properties of neural networks
Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., Fergus, R.: Intriguing properties of neural networks. In: International Conference on Learning Repre- sentations (2014), http://arxiv.org/abs/1312.6199
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[14]
Harvard Journal of Law and Technology31(2), 841–887 (2018)
Wachter, S., Mittelstadt, B., Russell, C.: Counterfactual explanations without opening the black box: automated decisions and the gdpr. Harvard Journal of Law and Technology31(2), 841–887 (2018)
work page 2018
-
[15]
Fron- tiers of Information Technology & Electronic Engineering 19(1), 27–39 (Jan 2018)
Zhang, Q.s., Zhu, S.c.: Visual interpretability for deep learning: a survey. Fron- tiers of Information Technology & Electronic Engineering 19(1), 27–39 (Jan 2018). https://doi.org/10.1631/FITEE.1700808, https://doi.org/10.1631/FITEE.1700808
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.