pith. sign in

arxiv: 2410.05284 · v2 · submitted 2024-09-29 · 💻 cs.CR · cs.AI· cs.CV· cs.LG

Hypnopaedia-Aware Machine Unlearning via Psychometrics of Artificial Mental Imagery

Pith reviewed 2026-05-23 21:05 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.CVcs.LG
keywords neural backdoorsmachine unlearningmodel inversionbackdoor detectionartificial mental imageryAI securitycybersecurityhypnopaedia
0
0 comments X

The pith

A self-aware unlearning method uses model inversion to generate artificial mental imagery from neural networks, then applies hypothesis analysis to estimate and detach backdoor triggers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to create an autonomous system that detects and removes hidden backdoor triggers in trained models without external labels or manual intervention. It does so by inverting the model to produce artificial mental imagery, applying stochastic disruption during inversion, and then running statistical hypothesis tests to score how likely each candidate pattern is to be the actual trigger. The goal is to preserve the model's original task performance while breaking the conditioned response to the trigger. A reader would care because backdoors allow stealthy control of AI systems through data poisoning, and an internal detection loop could reduce reliance on trusted training sources.

Core claim

Through reverse engineering and model inversion, artificial mental imagery is elicited from the network; stochastic processes are introduced to prevent convergence on flawed patterns; hypothesis analysis then assigns infection probabilities to candidate triggers, enabling the model to autonomously detach its behavior from the backdoor while retaining knowledge fidelity.

What carries the argument

The psychometrics of artificial mental imagery, produced by model inversion followed by hypothesis analysis that scores each generated pattern for likelihood of being the true trigger.

If this is right

  • The framework supports continuous monitoring of threats from untrustworthy data sources without halting training.
  • Autonomous detachment from the trigger occurs while the equilibrium between task accuracy and security is preserved.
  • Deceptive patterns are identified through statistical inference rather than human inspection of weights or data.
  • Infection likelihood is quantified for each potential trigger, allowing prioritized removal decisions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same inversion-plus-hypothesis loop could be tested on models that have undergone multiple sequential backdoor insertions to measure cumulative detection accuracy.
  • If the generated imagery proves separable in practice, the approach might be combined with periodic self-audits during deployment to catch triggers introduced after initial training.
  • Extending the stochastic disruption step to other inversion-based attacks could reveal whether the method generalizes beyond backdoors to related model-extraction vulnerabilities.

Load-bearing premise

Model inversion will reliably surface the backdoor trigger inside the generated imagery as a pattern that can be statistically distinguished from legitimate features.

What would settle it

Apply the full pipeline to a model with a known, verifiable backdoor trigger and check whether the hypothesis analysis assigns a markedly higher infection probability to the true trigger than to any other candidate pattern.

Figures

Figures reproduced from arXiv: 2410.05284 by Anastasia Kordoni, Ching-Chun Chang, Christopher Leckie, Isao Echizen, Kai Gao, Shuying Xu.

Figure 1
Figure 1. Figure 1: Cybernetic framework that consists of learner, controller and unlearner [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Psychometric profile that shows probability of infection, backdoor [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of backdoor attack through implanting triggers into samples [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Systematic pipeline for backdoor defence consisting of model inversion, hypothesis analysis and machine unlearning. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Illustration of multi-scale model inversion for projecting an artificial mental image with a random initial noise. [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Illustration of hypothesis analysis process for an artificial mental [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Illustration of machine unlearning process. [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Visualisation of reverse-engineered triggers with various methods. [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Visualisation of artificial mental images from models of both infected and uninfected states. [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Visualisation of natural triggers along with artificial mental images from test and surrogate models of uninfected state. [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗
read the original abstract

Neural backdoors represent insidious cybersecurity loopholes that render learning machinery vulnerable to unauthorised manipulations, potentially enabling the weaponisation of artificial intelligence with catastrophic consequences. A backdoor attack involves the clandestine infiltration of a trigger during the learning process, metaphorically analogous to hypnopaedia, where ideas are implanted into a subject's subconscious mind under the state of hypnosis or unconsciousness. When activated by a sensory stimulus, the trigger evokes a conditioned reflex that directs a machine to mount a predetermined response. In this study, we propose a cybernetic framework for constant surveillance of backdoor threats, driven by the dynamic nature of untrustworthy data sources. We develop a self-aware unlearning mechanism to autonomously detach a machine's behaviour from the backdoor trigger. Through reverse engineering and statistical inference, we detect deceptive patterns and estimate the likelihood of backdoor infection. We employ model inversion to elicit artificial mental imagery, using stochastic processes to disrupt optimisation pathways and avoid convergent but potentially flawed patterns. This is followed by hypothesis analysis, which estimates the likelihood of each potentially malicious pattern as the true trigger and infers the probability of infection. The primary objective of this study is to maintain a stable state of equilibrium between knowledge fidelity and backdoor vulnerability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 0 minor

Summary. The manuscript proposes a cybernetic framework for detecting and unlearning neural backdoors, analogizing them to hypnopaedia. It claims a self-aware unlearning mechanism that performs reverse engineering followed by model inversion (using stochastic processes to disrupt optimization pathways) to elicit 'artificial mental imagery,' then applies hypothesis analysis to estimate the likelihood that each pattern is the true trigger and infers the probability of backdoor infection, with the goal of maintaining equilibrium between knowledge fidelity and vulnerability.

Significance. If the claimed pipeline could be formalized and shown to work, it would address an important problem in AI security by offering an autonomous, self-aware unlearning approach. The manuscript, however, supplies no algorithms, objective functions, statistical tests, or results, so no assessment of actual significance is possible.

major comments (3)
  1. [Abstract] Abstract: the central claim that model inversion will 'elicit artificial mental imagery' amenable to statistical separation from legitimate features is unsupported; no inversion objective, loss, or stochastic disruption procedure is defined, and no condition guaranteeing distinguishability is stated. This is load-bearing for the entire unlearning and probability-estimation pipeline.
  2. [Abstract] Abstract: the hypothesis analysis step is described only as estimating 'the likelihood of each potentially malicious pattern as the true trigger,' with no hypothesis test, likelihood model, or decision rule provided; without these, the infection-probability inference cannot be performed or validated.
  3. [Abstract] Abstract / full text: the manuscript states the existence of mechanisms and an objective but contains no derivation, algorithm, dataset, experiment, or proof that the described steps achieve the claimed equilibrium between fidelity and vulnerability.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the detailed review. The manuscript introduces a conceptual cybernetic framework for backdoor unlearning via the hypnopaedia analogy and self-aware mechanisms, without formal algorithms or experiments. We address each major comment below, agreeing where the presentation lacks detail and noting planned revisions to clarify scope.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that model inversion will 'elicit artificial mental imagery' amenable to statistical separation from legitimate features is unsupported; no inversion objective, loss, or stochastic disruption procedure is defined, and no condition guaranteeing distinguishability is stated. This is load-bearing for the entire unlearning and probability-estimation pipeline.

    Authors: We agree that the abstract presents model inversion at a conceptual level only, without defining an objective function, loss, or explicit stochastic disruption procedure. The manuscript relies on the high-level description of using stochastic processes to avoid flawed convergence. This is a limitation of the current text. We will revise to add a high-level description of the inversion as a stochastic optimization process with noise injection to promote exploration of potential trigger patterns. revision: partial

  2. Referee: [Abstract] Abstract: the hypothesis analysis step is described only as estimating 'the likelihood of each potentially malicious pattern as the true trigger,' with no hypothesis test, likelihood model, or decision rule provided; without these, the infection-probability inference cannot be performed or validated.

    Authors: We agree that no specific hypothesis test, likelihood model, or decision rule is provided. The description remains at the level of statistical inference on elicited patterns. This reflects the conceptual focus of the work. In revision we will outline a possible likelihood estimation approach based on pattern consistency across multiple inversions and a simple threshold rule for inferring infection probability. revision: partial

  3. Referee: [Abstract] Abstract / full text: the manuscript states the existence of mechanisms and an objective but contains no derivation, algorithm, dataset, experiment, or proof that the described steps achieve the claimed equilibrium between fidelity and vulnerability.

    Authors: We acknowledge that the manuscript contains no derivations, algorithms, datasets, experiments, or proofs, as it is positioned as a high-level framework proposal introducing the hypnopaedia-inspired analogy. The equilibrium is stated as the primary objective but not demonstrated. We will revise the abstract and introduction to explicitly characterize the contribution as conceptual and to indicate that formalization and validation are left for subsequent work. revision: yes

standing simulated objections not resolved
  • The manuscript contains no algorithms, objective functions, statistical tests, datasets, experiments, or proofs, preventing any empirical or formal demonstration of the pipeline.

Circularity Check

0 steps flagged

No circularity; derivation remains at descriptive level without self-referential reductions

full rationale

The provided abstract and description outline a high-level pipeline of reverse engineering, model inversion with stochastic disruption, and hypothesis analysis to estimate infection probability, but contain no equations, fitted parameters presented as independent predictions, self-citations, or uniqueness theorems. No load-bearing step reduces by construction to its own inputs; the claims are methodological assertions without demonstrated mathematical equivalence to the data or priors used. The derivation is therefore self-contained at the level of description.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities with independent evidence; the central narrative rests on the unstated premise that model inversion produces interpretable trigger representations.

invented entities (1)
  • artificial mental imagery no independent evidence
    purpose: Elicit backdoor triggers via model inversion for subsequent statistical analysis
    Introduced as the output of model inversion in the abstract; no independent evidence or falsifiable prediction is supplied.

pith-pipeline@v0.9.0 · 5771 in / 1268 out tokens · 34117 ms · 2026-05-23T21:05:15.787148+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

84 extracted references · 84 canonical work pages

  1. [1]

    Backdoor learning: A survey,

    Y . Li, Y . Jiang, Z. Li, and S.-T. Xia, “Backdoor learning: A survey,” IEEE Trans. Neural Netw. Learn. Syst. , vol. 35, no. 1, pp. 5–22, 2024

  2. [2]

    I. P. Pavlov, Conditioned Reflexes: An Investigation of the Physiological Activity of the Cerebral Cortex . Oxford, UK: Oxford University Press, 1927

  3. [3]

    Communication-efficient learning of deep networks from decentralized data,

    B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y. Arcas, “Communication-efficient learning of deep networks from decentralized data,” in Proc. Int. Conf. Artif. Intell. Stat. (AISTATS) , vol. 54, Fort Lauderdale, FL, USA, 2017, pp. 1273–1282

  4. [4]

    Evolutionary principles in self-referential learning,

    J. Schmidhuber, “Evolutionary principles in self-referential learning,” Diploma Thesis, Technische Universitä t Mü nchen, Munich, Germany, 1987

  5. [5]

    Thrun, Lifelong Learning Algorithms

    S. Thrun, Lifelong Learning Algorithms. New York, NY , USA: Springer, 1998, pp. 181–209

  6. [6]

    Building machines that learn and think like people,

    B. M. Lake, T. D. Ullman, J. B. Tenenbaum, and S. J. Gershman, “Building machines that learn and think like people,” Behav. Brain Sci., vol. 40, pp. 1–72, 2017

  7. [7]

    Online meta- learning,

    C. Finn, A. Rajeswaran, S. Kakade, and S. Levine, “Online meta- learning,” in Proc. Int. Conf. Mach. Learn. (ICML), vol. 97, Long Beach, CA, USA, 2019, pp. 1920–1930

  8. [8]

    Backdoor attacks against deep learning systems in the physi- cal world,

    E. Wenger, J. Passananti, A. N. Bhagoji, Y . Yao, H. Zheng, and B. Y . Zhao, “Backdoor attacks against deep learning systems in the physi- cal world,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Nashville, TN, USA, 2021, pp. 6202–6211

  9. [9]

    BadNets: Evaluating backdooring attacks on deep neural networks,

    T. Gu, K. Liu, B. Dolan-Gavitt, and S. Garg, “BadNets: Evaluating backdooring attacks on deep neural networks,” IEEE Access, vol. 7, pp. 47 230–47 244, 2019

  10. [10]

    Invisible backdoor attack with sample-specific triggers,

    Y . Li, Y . Li, B. Wu, L. Li, R. He, and S. Lyu, “Invisible backdoor attack with sample-specific triggers,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Montréal, QC, Canada, 2021, pp. 16 443–16 452

  11. [11]

    Dynamic backdoor attacks against machine learning models,

    A. Salem, R. Wen, M. Backes, S. Ma, and Y . Zhang, “Dynamic backdoor attacks against machine learning models,” in Proc. IEEE Eur. Symp. Secur. Priv. (EuroS&P), Genoa, Italy, 2022, pp. 703–718

  12. [12]

    Poison ink: Robust and invisible backdoor attack,

    J. Zhang, C. Dongdong, Q. Huang, J. Liao, W. Zhang, H. Feng, G. Hua, and N. Yu, “Poison ink: Robust and invisible backdoor attack,” IEEE Trans. Image Process., vol. 31, pp. 5691–5705, 2022

  13. [13]

    Trojaning attack on neural networks,

    Y . Liu, S. Ma, Y . Aafer, W.-C. Lee, J. Zhai, W. Wang, and X. Zhang, “Trojaning attack on neural networks,” in Proc. Netw. Distrib. Syst. Secur. Symp. (NDSS), San Diego, CA, USA, 2018, pp. 1–15

  14. [14]

    Witches’ brew: Industrial scale data poisoning via gradient matching,

    J. Geiping, L. H. Fowl, W. R. Huang, W. Czaja, G. Taylor, M. Moeller, and T. Goldstein, “Witches’ brew: Industrial scale data poisoning via gradient matching,” in Proc. Int. Conf. Learn. Representations (ICLR) , Vienna, Austria, 2021, pp. 1–24

  15. [15]

    Sleeper agent: Scalable hidden trigger backdoors for neural networks trained from scratch,

    H. Souri, L. Fowl, R. Chellappa, M. Goldblum, and T. Goldstein, “Sleeper agent: Scalable hidden trigger backdoors for neural networks trained from scratch,” in Proc. Int. Conf. Neural Inf. Process. Syst. (NeurIPS), vol. 35, New Orleans, LA, USA, 2022, pp. 19 165–19 178

  16. [16]

    How to backdoor federated learning,

    E. Bagdasaryan, A. Veit, Y . Hua, D. Estrin, and V . Shmatikov, “How to backdoor federated learning,” in Proc. Int. Conf. Artif. Intell. Stat. (AISTATS), vol. 108, Online, 2020, pp. 2938–2948

  17. [17]

    Back- door embedding in convolutional neural network models via invisible perturbation,

    H. Zhong, C. Liao, A. C. Squicciarini, S. Zhu, and D. Miller, “Back- door embedding in convolutional neural network models via invisible perturbation,” in Proc. ACM Conf. Data Appl. Secur. Priv. (CODASPY), New Orleans, LA, USA, 2020, pp. 97–108

  18. [18]

    Invisible backdoor attacks on deep neural networks via steganography and regularization,

    S. Li, M. Xue, B. Z. H. Zhao, H. Zhu, and X. Zhang, “Invisible backdoor attacks on deep neural networks via steganography and regularization,” IEEE Trans. Dependable Secure Comput., vol. 18, no. 5, pp. 2088–2105, 2021

  19. [19]

    An invisible black-box backdoor attack through frequency domain,

    T. Wang, Y . Yao, F. Xu, S. An, H. Tong, and T. Wang, “An invisible black-box backdoor attack through frequency domain,” in Proc. Eur. Conf. Comput. Vis. (ECCV) , Tel Aviv, Israel, 2022, pp. 396–413

  20. [20]

    Spectral signatures in backdoor attacks,

    B. Tran, J. Li, and A. M ˛ adry, “Spectral signatures in backdoor attacks,” in Proc. Int. Conf. Neural Inf. Process. Syst. (NeurIPS) , Montréal, QC, Canada, 2018, pp. 8011–8021

  21. [21]

    Detecting backdoor attacks on deep neural networks by activation clustering,

    B. Chen, W. Carvalho, N. Baracaldo, H. Ludwig, B. Edwards, T. Lee, I. Molloy, and B. Srivastava, “Detecting backdoor attacks on deep neural networks by activation clustering,” in Proc. Assoc. Adv. Artif. Intell. Workshop (AAAI), vol. 2301, Honolulu, Hawaii, 2019, pp. 1–8

  22. [22]

    Demon in the variant: Sta- tistical analysis of DNNs for robust backdoor contamination detection,

    D. Tang, X. Wang, H. Tang, and K. Zhang, “Demon in the variant: Sta- tistical analysis of DNNs for robust backdoor contamination detection,” in Proc. USENIX Secur. Symp. (USENIX), Online, 2021, pp. 1541–1558

  23. [23]

    SPECTRE: defending against backdoor attacks using robust statistics,

    J. Hayase, W. Kong, R. Somani, and S. Oh, “SPECTRE: defending against backdoor attacks using robust statistics,” in Proc. Int. Conf. Mach. Learn. (ICML) , vol. 139, Online, 2021, pp. 4129–4139

  24. [24]

    Rethinking the backdoor attacks’ triggers: A frequency perspective,

    Y . Zeng, W. Park, Z. M. Mao, and R. Jia, “Rethinking the backdoor attacks’ triggers: A frequency perspective,” inProc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Montréal, QC, Canada, 2021, pp. 16 453–16 461

  25. [25]

    Strong data augmentation sanitizes poisoning and backdoor attacks without an accuracy tradeoff,

    E. Borgnia, V . Cherepanova, L. Fowl, A. Ghiasi, J. Geiping, M. Gold- blum, T. Goldstein, and A. Gupta, “Strong data augmentation sanitizes poisoning and backdoor attacks without an accuracy tradeoff,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP) , Toronto, ON, Canada, 2021, pp. 3855–3859

  26. [26]

    CutMix: Reg- ularization strategy to train strong classifiers with localizable features,

    S. Yun, D. Han, S. J. Oh, S. Chun, J. Choe, and Y . Yoo, “CutMix: Reg- ularization strategy to train strong classifiers with localizable features,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Seoul, Korea, 2019, pp. 6023–6032

  27. [27]

    Can you really backdoor federated learning?

    A. T. Suresh, B. McMahan, P. Kairouz, and Z. Sun, “Can you really backdoor federated learning?” in Proc. Int. Workshop Neural Inf. Pro- cess. Syst. (NeurIPS) , Vancouver, BC, Canada, 2019, pp. 1–10

  28. [28]

    Robust anomaly detection and backdoor attack detection via differential privacy,

    M. Du, R. Jia, and D. Song, “Robust anomaly detection and backdoor attack detection via differential privacy,” in Proc. Int. Conf. Learn. Represent. (ICLR), Addis Ababa, Ethiopia, 2020, pp. 1–19

  29. [29]

    Local and central differential privacy for robustness and privacy in federated learning,

    M. Naseri, J. Hayes, and E. D. Cristofaro, “Local and central differential privacy for robustness and privacy in federated learning,” in Proc. Netw. Distrib. Syst. Secur. Symp. (NDSS) , San Diego, CA, USA, 2022, pp. 1–18

  30. [30]

    Deep learning with differential privacy,

    M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan, I. Mironov, K. Tal- war, and L. Zhang, “Deep learning with differential privacy,” in Proc. ACM SIGSAC Conf. Comput. Commun. Secur. (CCS) , Vienna, Austria, 2016, pp. 308–318

  31. [31]

    Deep partition aggregation: Provable defenses against general poisoning attacks,

    A. Levine and S. Feizi, “Deep partition aggregation: Provable defenses against general poisoning attacks,” in Proc. Int. Conf. Learn. Represent. (ICLR), Online, 2021, pp. 1–20

  32. [32]

    Intrinsic certified robustness of bagging against data poisoning attacks,

    J. Jia, X. Cao, and N. Z. Gong, “Intrinsic certified robustness of bagging against data poisoning attacks,” in Proc. Assoc. Adv. Artif. Intell. Conf. (AAAI), vol. 35, no. 9, Online, 2021, pp. 7961–7969

  33. [33]

    Certified robustness of nearest neighbors against data poisoning and backdoor attacks,

    J. Jia, Y . Liu, X. Cao, and N. Z. Gong, “Certified robustness of nearest neighbors against data poisoning and backdoor attacks,” in Proc. Assoc. Adv. Artif. Intell. Conf. (AAAI) , vol. 36, no. 9, Online, 2022, pp. 9575– 9583

  34. [34]

    Bagging predictors,

    L. Breiman, “Bagging predictors,” Mach. Learn. , vol. 24, no. 2, pp. 123–140, 1996

  35. [35]

    STRIP: A defence against trojan attacks on deep neural networks,

    Y . Gao, C. Xu, D. Wang, S. Chen, D. C. Ranasinghe, and S. Nepal, “STRIP: A defence against trojan attacks on deep neural networks,” in Proc. Annu. Comput. Secur. Appl. Conf. (ACSAC) , San Juan, PR, USA, 2019, pp. 113–125

  36. [36]

    Deep probabilistic models to detect data poisoning attacks,

    M. Subedar, N. A. Ahuja, R. Krishnan, I. J. Ndiour, and O. Tickoo, “Deep probabilistic models to detect data poisoning attacks,” inProc. Int. Workshop Neural Inf. Process. Syst. (NeurIPS), Vancouver, BC, Canada, 2019, pp. 1–5

  37. [37]

    CleaNN: Accelerated trojan shield for embedded neural networks,

    M. Javaheripi, M. Samragh, G. Fields, T. Javidi, and F. Koushanfar, “CleaNN: Accelerated trojan shield for embedded neural networks,” in Proc. Int. Conf. Comput. Aided Des. (ICCAD) , Online, 2020, pp. 1–9

  38. [38]

    DeepSweep: An evaluation framework for mitigating DNN backdoor attacks using data augmentation,

    H. Qiu, Y . Zeng, S. Guo, T. Zhang, M. Qiu, and B. Thuraisingham, “DeepSweep: An evaluation framework for mitigating DNN backdoor attacks using data augmentation,” in Proc. ACM Asia Conf. Comput. Commun. Secur. (AsiaCCS), Online, 2021, pp. 363–377. JOURNAL OF LATEX 15

  39. [39]

    Februus: Input purification defense against trojan attacks on deep neural network systems,

    B. G. Doan, E. Abbasnejad, and D. C. Ranasinghe, “Februus: Input purification defense against trojan attacks on deep neural network systems,” in Proc. Annu. Comput. Secur. Appl. Conf. (ACSAC). Austin, TX, USA: Association for Computing Machinery, 2020, pp. 897–912

  40. [40]

    Model agnostic defence against backdoor attacks in machine learning,

    S. Udeshi, S. Peng, G. Woo, L. Loh, L. Rawshan, and S. Chattopadhyay, “Model agnostic defence against backdoor attacks in machine learning,” IEEE Trans. Reliab., vol. 71, no. 2, pp. 880–895, 2022

  41. [41]

    On certifying robust- ness against backdoor attacks via randomized smoothing,

    B. Wang, X. Cao, J. jia, and N. Z. Gong, “On certifying robust- ness against backdoor attacks via randomized smoothing,” in Proc. IEEE/CVF Workshop Comput. Vis. Pattern Recognit. (CVPR) , Online, 2020, pp. 1–5

  42. [42]

    Certified robustness to label-flipping attacks via randomized smoothing,

    E. Rosenfeld, E. Winston, P. Ravikumar, and J. Z. Kolter, “Certified robustness to label-flipping attacks via randomized smoothing,” in Proc. Int. Conf. Mach. Learn. (ICML) , Online, 2020, pp. 8230–8241

  43. [43]

    RAB: Provable robustness against backdoor attacks,

    M. Weber, X. Xu, B. Karlaš, C. Zhang, and B. Li, “RAB: Provable robustness against backdoor attacks,” in Proc. IEEE Symp. Secur. and Priv. (SP), San Francisco, CA, USA, 2023, pp. 1311–1328

  44. [44]

    Universal litmus patterns: Revealing backdoor attacks in CNNs,

    S. Kolouri, A. Saha, H. Pirsiavash, and H. Hoffmann, “Universal litmus patterns: Revealing backdoor attacks in CNNs,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Seattle, W A, USA, 2020, pp. 298–307

  45. [45]

    Detecting AI trojans using meta neural analysis,

    X. Xu, Q. Wang, H. Li, N. Borisov, C. A. Gunter, and B. Li, “Detecting AI trojans using meta neural analysis,” in Proc. IEEE Symp. Secur. and Priv. (SP), San Francisco, CA, USA, 2021, pp. 103–120

  46. [46]

    One-pixel signature: Character- izing CNN models for backdoor detection,

    S. Huang, W. Peng, Z. Jia, and Z. Tu, “One-pixel signature: Character- izing CNN models for backdoor detection,” in Proc. Eur. Conf. Comput. Vis. (ECCV), Glasgow, UK, 2020, pp. 326–341

  47. [47]

    Practical detection of trojan neural networks: Data-limited and data- free cases,

    R. Wang, G. Zhang, S. Liu, P.-Y . Chen, J. Xiong, and M. Wang, “Practical detection of trojan neural networks: Data-limited and data- free cases,” in Proc. Eur. Conf. Comput. Vis. (ECCV) , Glasgow, UK, 2020, pp. 222–238

  48. [48]

    Topological detection of trojaned neural networks,

    S. Zheng, Y . Zhang, H. Wagner, M. Goswami, and C. Chen, “Topological detection of trojaned neural networks,” in Proc. Int. Conf. Neural Inf. Process. Syst. (NeurIPS) , vol. 34, Online, 2021, pp. 17 258–17 272

  49. [49]

    Post-training detection of backdoor attacks for two-class and multi-attack scenarios,

    Z. Xiang, D. Miller, and G. Kesidis, “Post-training detection of backdoor attacks for two-class and multi-attack scenarios,” in Proc. Int. Conf. Learn. Represent. (ICLR) , Online, 2022, pp. 1–34

  50. [50]

    Overcom- ing catastrophic forgetting in neural networks,

    J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, D. Hassabis, C. Clopath, D. Kumaran, and R. Hadsell, “Overcom- ing catastrophic forgetting in neural networks,” Proc. Nat. Acad. Sci. (PNAS), vol. 114, no. 13, pp. 3521–3526, 2017

  51. [51]

    Neural trojans,

    Y . Liu, Y . Xie, and A. Srivastava, “Neural trojans,” in Proc. IEEE Int. Conf. Comput. Des. (ICCD) , Boston, MA, USA, 2017, pp. 45–48

  52. [52]

    Adversarial unlearning of backdoors via implicit hypergradient,

    Y . Zeng, S. Chen, W. Park, Z. Mao, M. Jin, and R. Jia, “Adversarial unlearning of backdoors via implicit hypergradient,” in Proc. Int. Conf. Learn. Representations (ICLR) , Online, 2022, pp. 1–28

  53. [53]

    Distilling the knowledge in a neural network,

    G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” in Proc. Int. Workshop Neural Inf. Process. Syst. (NeurIPS) , 2015, pp. 1–9

  54. [54]

    Disabling backdoor and identifying poison data by using knowledge distillation in backdoor attacks on deep neural networks,

    K. Yoshida and T. Fujino, “Disabling backdoor and identifying poison data by using knowledge distillation in backdoor attacks on deep neural networks,” in Proc. ACM Workshop Artif. Intell. Secur. (AISec) , Online, 2020, pp. 117–127

  55. [55]

    Neural attention distillation: Erasing backdoor triggers from deep neural networks,

    Y . Li, X. Lyu, N. Koren, L. Lyu, B. Li, and X. Ma, “Neural attention distillation: Erasing backdoor triggers from deep neural networks,” in Proc. Int. Conf. Learn. Representations (ICLR) , Vienna, Austria, 2021, pp. 1–19

  56. [56]

    Neural net pruning: Why and how,

    Sietsma and Dow, “Neural net pruning: Why and how,” in Proc. IEEE Int. Conf. Neural Netw. (ICNN), vol. 1, San Diego, CA, USA, 1988, pp. 325–333

  57. [57]

    Fine-pruning: Defending against backdooring attacks on deep neural networks,

    K. Liu, B. Dolan-Gavitt, and S. Garg, “Fine-pruning: Defending against backdooring attacks on deep neural networks,” in Proc. Int. Symp. Res. Attacks Intrusions Defenses (RAID) , Heraklion, Greece, 2018, pp. 273– 294

  58. [58]

    Adversarial neuron pruning purifies backdoored deep models,

    D. Wu and Y . Wang, “Adversarial neuron pruning purifies backdoored deep models,” in Proc. Int. Conf. Neural Inf. Process. Syst. (NeurIPS) , Online, 2021, pp. 1–13

  59. [59]

    Neural Cleanse: Identifying and mitigating backdoor attacks in neural networks,

    B. Wang, Y . Yao, S. Shan, H. Li, B. Viswanath, H. Zheng, and B. Y . Zhao, “Neural Cleanse: Identifying and mitigating backdoor attacks in neural networks,” in Proc. IEEE Symp. Secur. and Priv. (SP) , San Francisco, CA, USA, 2019, pp. 707–723

  60. [60]

    Towards inspecting and eliminating trojan backdoors in deep neural networks,

    W. Guo, L. Wang, Y . Xu, X. Xing, M. Du, and D. Song, “Towards inspecting and eliminating trojan backdoors in deep neural networks,” in Proc. IEEE Int. Conf. Data Mining (ICDM) , Sorrento, Italy, 2020, pp. 162–171

  61. [61]

    Better trigger inversion optimization in backdoor scanning,

    G. Tao, G. Shen, Y . Liu, S. An, Q. Xu, S. Ma, P. Li, and X. Zhang, “Better trigger inversion optimization in backdoor scanning,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) , New Orleans, LA, USA, 2022, pp. 13 358–13 368

  62. [62]

    Huxley, Brave New World

    A. Huxley, Brave New World. London, UK: Chatto & Windus, 1932

  63. [63]

    W. R. Ashby, An Introduction to Cybernetics . London, UK: Chapman & Hall, 1956

  64. [64]

    Wiener, Cybernetics or Control and Communication in the Animal and the Machine

    N. Wiener, Cybernetics or Control and Communication in the Animal and the Machine . Cambridge, MA, USA: MIT Press, 1948

  65. [65]

    The cybernetics of learning,

    B. Cope and M. Kalantzis, “The cybernetics of learning,” Educ. Philos. Theory, vol. 54, no. 14, pp. 2352–2388, 12 2022

  66. [66]

    Motivated forgetting and the study of repression,

    B. Weiner, “Motivated forgetting and the study of repression,” J. Pers., vol. 36, no. 2, pp. 213–234, 1968

  67. [67]

    Machine unlearning,

    L. Bourtoule, V . Chandrasekaran, C. A. Choquette-Choo, H. Jia, A. Travers, B. Zhang, D. Lie, and N. Papernot, “Machine unlearning,” in Proc. IEEE Symp. Secur. and Priv. (SP) , San Francisco, CA, USA, 2021, pp. 141–159

  68. [68]

    Model inversion attacks that exploit confidence information and basic countermeasures,

    M. Fredrikson, S. Jha, and T. Ristenpart, “Model inversion attacks that exploit confidence information and basic countermeasures,” in Proc. ACM SIGSAC Conf. Comput. Commun. Secur. (CCS), Denver, CO, USA, 2015, pp. 1322–1333

  69. [69]

    Enhancement of eyewitness memory with the cognitive interview,

    R. E. Geiselman, R. P. Fisher, D. P. MacKinnon, and H. L. Holland, “Enhancement of eyewitness memory with the cognitive interview,” Am. J. Psychol., vol. 99, no. 3, pp. 385–401, 1986

  70. [70]

    Statistics of mental imagery,

    F. Galton, “Statistics of mental imagery,” Mind, vol. 5, no. 19, pp. 301– 318, 1880

  71. [71]

    Mental representation,

    D. Pitt, “Mental representation,” in The Stanford Encyclopedia of Philosophy. Stanford University, 2000

  72. [72]

    Representation, similarity, and the chorus of prototypes,

    S. Edelman, “Representation, similarity, and the chorus of prototypes,” Minds Mach., vol. 5, no. 1, pp. 45–68, 1995

  73. [73]

    Sensory deprivation and hallu- cinations,

    J. Vernon, T. Marton, and E. Peterson, “Sensory deprivation and hallu- cinations,” Science, vol. 133, no. 3467, pp. 1808–1812, 1961

  74. [74]

    Intriguing properties of neural networks,

    C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. J. Goodfellow, and R. Fergus, “Intriguing properties of neural networks,” in Proc. Int. Conf. Learn. Represent. (ICLR) , Banff, AB, Canada, 2014, pp. 1–10

  75. [75]

    Explaining and harnessing adversarial examples,

    I. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing adversarial examples,” in Proc. Int. Conf. Learn. Represent. (ICLR), San Diego, CA, USA, 2015, pp. 1–11

  76. [76]

    Adversarial machine learning at scale,

    A. Kurakin, I. J. Goodfellow, and S. Bengio, “Adversarial machine learning at scale,” in Proc. Int. Conf. Learn. Represent. (ICLR) , 2017, pp. 1–17

  77. [77]

    Ensemble adversarial training: Attacks and defenses

    F. Tramèr, A. Kurakin, N. Papernot, I. J. Goodfellow, D. Boneh, and P. D. McDaniel, “Ensemble adversarial training: Attacks and defenses.” in Proc. Int. Conf. Learn. Representations (ICLR), Vancouver, BC, Canada, 2018, pp. 1–20

  78. [78]

    Adversarial training and robustness for multiple perturbations,

    F. Tramèr and D. Boneh, “Adversarial training and robustness for multiple perturbations,” in Proc. Int. Conf. Neural Inf. Process. Syst. (NeurIPS), Vancouver, BC, Canada, 2019, pp. 5866–5876

  79. [79]

    Towards deep learning models resistant to adversarial attacks

    A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu, “Towards deep learning models resistant to adversarial attacks.” in Proc. Int. Conf. Learn. Representations (ICLR), Vancouver, BC, Canada, 2018, pp. 1–23

  80. [80]

    Gradient-based learning applied to document recognition,

    Y . Lecun, L. Bottou, Y . Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proc. IEEE, vol. 86, no. 11, pp. 2278– 2324, 1998

Showing first 80 references.