pith. machine review for the scientific record. sign in

arxiv: 2605.00874 · v1 · submitted 2026-04-25 · 💻 cs.CV · cs.AI· cs.LG· cs.MM

Recognition: unknown

Latent Space Probing for Adult Content Detection in Video Generative Models

Authors on Pith no claims yet

Pith reviewed 2026-05-09 21:03 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LGcs.MM
keywords latent space probingadult content detectionvideo generative modelsdiffusion modelscontent moderationreal-time detectionCogVideoX
0
0 comments X

The pith

Intercepting latent representations during video generation allows real-time adult content detection at 97.29% F1 with 4-6 ms overhead.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that the internal denoised latent states created by a video diffusion model carry enough information to separate adult from non-adult content without needing the final pixels or the input prompt. A reader would care because current moderation methods miss material that only appears during the generation process itself. The authors assembled a dataset of more than eleven thousand short clips, half labeled violating, and trained two lightweight classifiers directly on the latents produced by CogVideoX. Their experiments report that this internal probe reaches high accuracy while adding almost no extra time to inference. The work therefore treats the latent trajectory as a richer signal than either the prompt or the decoded output for content safety tasks.

Core claim

The central claim is that latent-space signals produced during CogVideoX inference encode strong discriminative features for harmful content. By attaching lightweight classifiers to the denoised latents and training them on a new binary dataset of 11,039 ten-second clips, the method reaches 97.29 percent F1 on held-out data while adding only 4-6 milliseconds of computation. The authors conclude that probing the latent space improves both detection performance and cost relative to prompt-only or pixel-space approaches.

What carries the argument

Lightweight probing classifiers attached to the sequence of denoised latent representations generated during CogVideoX inference.

Load-bearing premise

The binary labels assigned to clips from adult websites and YouTube correctly mark adult content in the particular latent trajectories that CogVideoX produces from new prompts.

What would settle it

Run the trained classifiers on latents from a different video diffusion model on the same held-out clips and observe whether F1 drops below 80 percent.

Figures

Figures reproduced from arXiv: 2605.00874 by Alizishaan Khatri, Chiquita Prabhu.

Figure 1
Figure 1. Figure 1: Model training and dataset construction workflow. [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: CogVideoX inference pipeline with latent probe attachment (shown [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the proposed CNN-transformer video classifier. A compressed video latent tensor is first processed by a 3D convolutional stem, followed [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Architecture of CNN3DClassifier. The backbone (top row) extracts volumetric features; the classifier head (bottom row, right-to-left) produces binary logits. B. Ethical and Societal Implications Any content moderation system carries the dual risk of under-enforcement (allowing harmful content to reach users) and over-enforcement (incorrectly flagging benign content). We advocate for the need for false posi… view at source ↗
read the original abstract

The rapid proliferation of AI-powered video generation systems has introduced significant challenges in content moderation, particularly with respect to adult and sexually explicit material. Existing detection methods operate on either prompts or decoded pixel-space outputs. Therefore, both approaches are blind to the rich internal representations formed during generation. In this paper, we propose a novel latent space probing framework that intercepts the denoised latent representations produced by the CogVideoX video diffusion model during inference and attaches lightweight classifiers to perform real-time adult content detection. To support this work, we construct a large-scale binary dataset of 11039 ten-second video clips (5086 violating, 5953 non-violating) sourced from adult websites and YouTube respectively. We introduce two lightweight probing classifier architectures. We train and evaluate it on the dataset. Our work demonstrates that latent-space signals encode strong discriminative features for harmful content detection, achieving 97.29% F1 on our held-out test set with an overhead in the 4-6ms range. Our results suggest that probing the latent space results in improvements in both detection performance as well as cost.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript proposes a latent-space probing framework to detect adult content in videos generated by the CogVideoX diffusion model. The authors construct a dataset of 11,039 ten-second clips (5,086 violating, 5,953 non-violating) sourced from adult websites and YouTube, introduce two lightweight classifier architectures that operate on denoised latents intercepted during inference, and report 97.29% F1 on a held-out test split together with 4-6 ms overhead. The central claim is that latent representations encode strong discriminative features for harmful-content detection, yielding both higher accuracy and lower cost than prompt- or pixel-based alternatives.

Significance. If the reported performance generalizes to latents produced by actual text-to-video inference on novel prompts, the approach would supply an efficient, low-overhead mechanism for real-time moderation inside generative pipelines without requiring full pixel decoding. The emphasis on internal representations addresses a clear gap left by existing prompt- or output-based detectors, and the modest overhead is practically attractive for deployment.

major comments (1)
  1. The 97.29% F1 is measured on a held-out split of the real-video dataset. The intended use case, however, is detection on the specific denoised latent trajectories that CogVideoX produces when starting from noise and following its text-conditioned diffusion schedule on new prompts. No experiments are reported that evaluate the probes on actual model-generated videos, so the headline metric does not directly substantiate the deployment claim made in the abstract and introduction.
minor comments (2)
  1. The abstract and method description do not specify the exact procedure used to obtain 'denoised latent representations' for the real clips in the training set, nor which of the two proposed probe architectures produced the reported 97.29% F1.
  2. No details are given on the labeling protocol, inter-annotator agreement, handling of class imbalance, or verification that train/test splits contain no leakage.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading and for identifying a key distinction between our current evaluation and the intended deployment setting. We address the major comment below.

read point-by-point responses
  1. Referee: The 97.29% F1 is measured on a held-out split of the real-video dataset. The intended use case, however, is detection on the specific denoised latent trajectories that CogVideoX produces when starting from noise and following its text-conditioned diffusion schedule on new prompts. No experiments are reported that evaluate the probes on actual model-generated videos, so the headline metric does not directly substantiate the deployment claim made in the abstract and introduction.

    Authors: We agree that the reported 97.29% F1 is obtained by encoding real video clips from our dataset into the CogVideoX VAE latent space and does not include evaluation on full denoising trajectories generated from noise under text conditioning. This constitutes a genuine gap for directly supporting the real-time moderation use case inside generative pipelines. Although the underlying VAE latent space is identical and we expect the learned discriminative features to transfer, we do not currently provide empirical evidence on model-generated content. In the revised manuscript we will add a new experimental section that generates videos with CogVideoX using both safe and adult-oriented prompts, intercepts the denoised latents at the final (and optionally intermediate) timesteps, and evaluates the same probes on these trajectories. The resulting metrics will be reported alongside the existing real-video results to directly address the deployment claim. revision: yes

Circularity Check

0 steps flagged

Empirical supervised probing on held-out real-video latents; no derivation reduces to self-inputs

full rationale

The paper reports a standard machine-learning pipeline: construct a labeled dataset of real 10-second clips, extract latent representations (via the model's encoder or equivalent), train lightweight classifiers, and measure F1 on a held-out split of the same dataset. The 97.29% F1 and 4-6 ms overhead are direct empirical measurements on unseen examples drawn from the training distribution; no equations, ansatzes, or uniqueness theorems are presented that would make the reported metric equivalent to a fitted parameter by construction. No self-citations are invoked as load-bearing premises, and the central claim does not rename a known result or smuggle an ansatz. The work is therefore self-contained as an empirical measurement rather than a closed derivation.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that latent representations carry content-type information and on standard supervised learning assumptions; no new physical entities or ad-hoc constants are introduced.

free parameters (1)
  • probe classifier weights
    The lightweight classifiers are trained on the collected dataset, so their parameters are fitted values.
axioms (1)
  • domain assumption Denoised latent representations during inference encode semantic features sufficient to distinguish adult from non-adult content
    Invoked when the paper states that latent-space signals encode strong discriminative features.

pith-pipeline@v0.9.0 · 5499 in / 1206 out tokens · 34385 ms · 2026-05-09T21:03:44.735758+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

59 extracted references · 31 canonical work pages · 13 internal anchors

  1. [1]

    Sora: Creating video from text,

    OpenAI, “Sora: Creating video from text,” 2024. [Online]. Available: https://openai.com/sora

  2. [2]

    Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models

    Y . Liu, K. Zhang, Y . Li, Z. Yan, C. Gao, R. Chen, Z. Yuan, Y . Huang, H. Sun, J. Gao, L. He, and L. Sun, “Sora: A review on background, technology, limitations, and opportunities of large vision models,” 2024. [Online]. Available: https://arxiv.org/abs/2402.17177

  3. [3]

    Wan: Open and Advanced Large-Scale Video Generative Models

    T. Wan, A. Wang, B. Ai, B. Wen, C. Mao, C.-W. Xie, D. Chen, F. Yu, H. Zhao, J. Yanget al., “Wan: Open and advanced large-scale video generative models,”arXiv preprint arXiv:2503.20314, 2025

  4. [4]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    A. Blattmann, T. Dockhorn, S. Kulal, D. Mendelevitch, M. Kilian, D. Lorenz, Y . Levi, Z. English, V . V oleti, A. Letts, V . Jampani, and R. Rombach, “Stable video diffusion: Scaling latent video diffusion models to large datasets,” 2023. [Online]. Available: https://arxiv.org/abs/2311.15127

  5. [5]

    CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

    Z. Yang, J. Teng, W. Zheng, M. Ding, S. Huang, J. Xu, Y . Yang, W. Hong, X. Zhang, G. Feng, D. Yin, Y . Zhang, W. Wang, Y . Cheng, B. Xu, X. Gu, Y . Dong, and J. Tang, “Cogvideox: Text-to-video diffusion models with an expert transformer,” 2025. [Online]. Available: https://arxiv.org/abs/2408.06072

  6. [6]

    Deepfakes on demand: the rise of accessible non-consensual deepfake image generators,

    W. Hawkins, C. Russell, and B. Mittelstadt, “Deepfakes on demand: the rise of accessible non-consensual deepfake image generators,” 2025. [Online]. Available: https://arxiv.org/abs/2505.03859

  7. [7]

    Video deepfake abuse: How company choices predictably shape misuse patterns,

    M. Kamachee, S. Casper, M. L. Ding, R.-J. Yew, A. Reuel, S. Biderman, and D. Hadfield-Menell, “Video deepfake abuse: How company choices predictably shape misuse patterns,” 2026. [Online]. Available: https://arxiv.org/abs/2512.11815

  8. [8]

    Adversarial attacks and defenses on text-to-image diffusion models: A survey,

    C. Zhang, M. Hu, W. Li, and L. Wang, “Adversarial attacks and defenses on text-to-image diffusion models: A survey,”Information Fusion, vol. 114, p. 102701, 2025

  9. [9]

    Jailbreak attacks and defenses against multimodal generative models: A survey,

    X. Liu, X. Cui, P. Li, Z. Li, H. Huang, S. Xia, M. Zhang, Y . Zou, and R. He, “Jailbreak attacks and defenses against multimodal generative models: A survey,”arXiv preprint arXiv:2411.09259, 2024

  10. [10]

    Aeiou: A unified defense framework against nsfw prompts in text-to-image models,

    Y . Wang, J. Chen, Q. Li, T. Zhang, R. Zeng, X. Yang, and S. Ji, “Aeiou: A unified defense framework against nsfw prompts in text-to-image models,”arXiv preprint arXiv:2412.18123, 2024

  11. [11]

    Latent guard: a safety framework for text-to-image generation,

    R. Liu, A. Khakzar, J. Gu, Q. Chen, P. Torr, and F. Pizzati, “Latent guard: a safety framework for text-to-image generation,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 93–109

  12. [12]

    Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

    H. Inan, K. Upasani, J. Chi, R. Rungta, K. Iyer, Y . Mao, M. Tontchev, Q. Hu, B. Fuller, D. Testuggineet al., “Llama guard: Llm-based input-output safeguard for human-ai conversations,” arXiv preprint arXiv:2312.06674, 2023. [Online]. Available: http: //arxiv.org/abs/2312.06674

  13. [13]

    Sora safety,

    OpenAI, “Sora safety,” 2024. [Online]. Available: https://openai.com/ sora#safety

  14. [14]

    Unsafe diffusion: On the generation of unsafe images and hateful memes from text-to-image models,

    Y . Qu, X. Shen, X. He, M. Backes, S. Zannettou, and Y . Zhang, “Unsafe diffusion: On the generation of unsafe images and hateful memes from text-to-image models,” inProceedings of the 2023 ACM SIGSAC conference on computer and communications security, 2023, pp. 3403–3417

  15. [15]

    Safewatch: An efficient safety- policy following video guardrail model with transparent explanations,

    Z. Chen, F. Pinto, M. Pan, and B. Li, “Safewatch: An efficient safety- policy following video guardrail model with transparent explanations,” inInternational Conference on Learning Representations, 2025. [Online]. Available: https://openreview.net/forum?id=xjKz6IxgCX

  16. [16]

    T2vsafetybench: Evaluating the safety of text-to-video generative models,

    Y . Miao, Y . Zhu, Y . Dong, L. Yu, J. Zhu, and X.-S. Gao, “T2vsafetybench: Evaluating the safety of text-to-video generative models,” 2024. [Online]. Available: https://arxiv.org/abs/2407.05965

  17. [17]

    Representation Engineering: A Top-Down Approach to AI Transparency

    A. Zou, L. Phan, S. Chen, J. Campbell, P. Guo, R. Ren, A. Pan, X. Yin, M. Mazeika, A.-K. Dombrowskiet al., “Representation engineering: A top-down approach to ai transparency,”Computing Research Repository, vol. arXiv:2310.01405, 2023. [Online]. Available: http://arxiv.org/abs/2310.01405

  18. [18]

    Understanding intermediate layers using linear classifier probes

    G. Alain and Y . Bengio, “Understanding intermediate layers using linear classifier probes,” 2018. [Online]. Available: https://arxiv.org/abs/ 1610.01644

  19. [19]

    Safety beyond the interface: Detecting harm via latent llm states,

    A. Khatri, C. Prabhu, and O. Neogi, “Safety beyond the interface: Detecting harm via latent llm states,”ResearchGate Preprint, 2026. [Online]. Available: https://www.researchgate.net/publication/402378765_Safety_ Beyond_the_Interface_Detecting_Harm_via_Latent_LLM_States

  20. [20]

    Constitutional classifiers: Defending against universal jailbreaks across thousands of hours of red teaming,

    Anthropic, “Constitutional classifiers: Defending against universal jailbreaks across thousands of hours of red teaming,” 2025, accessed: 2025. [Online]. Available: https://www.anthropic.com/ research/constitutional-classifiers

  21. [21]

    Building production-ready probes for Gemini

    J. Kramár, J. Engels, Z. Wang, B. Chughtai, R. Shah, N. Nanda, and A. Conmy, “Building production-ready probes for gemini,”arXiv preprint arXiv:2601.11516, 2026

  22. [22]

    Seeing it before it happens: In-generation nsfw detection for diffusion- based text-to-image models,

    F. Yang, Y . Huang, J. Zhu, L. Shi, G. Pu, J. S. Dong, and K. Wang, “Seeing it before it happens: In-generation nsfw detection for diffusion- based text-to-image models,”arXiv preprint arXiv:2508.03006, 2025

  23. [23]

    Diffusion probe: Generated image result prediction using cnn probes,

    B. Cui, B. Huang, Z. Ye, X. Dong, T. Chen, H. Xue, D. Yang, L. Huang, J. Tang, and H. Hong, “Diffusion probe: Generated image result prediction using cnn probes,”arXiv preprint arXiv:2602.23783, 2026

  24. [24]

    Finding naked people,

    M. M. Fleck, D. A. Forsyth, and C. Bregler, “Finding naked people,” inProc. European Conference on Computer Vision. Springer, 1996, pp. 593–602. [Online]. Available: http://luthuli.cs.uiuc.edu/~daf/papers/ naked.pdf

  25. [25]

    NudeNet: Neural nets for nudity classification, detection and selective censoring,

    P. Bedapudi, “NudeNet: Neural nets for nudity classification, detection and selective censoring,” https://github.com/notAI-tech/NudeNet, 2019, open-source software

  26. [26]

    Open NSFW: Not suitable for work (NSFW) classification using deep neural network Caffe models,

    Yahoo, “Open NSFW: Not suitable for work (NSFW) classification using deep neural network Caffe models,” 2016, gitHub repository (archived October 2019). Model based on a thin ResNet-50 variant fine-tuned for image NSFW scoring. [Online]. Available: https://github.com/yahoo/open_nsfw

  27. [27]

    Learning Transferable Visual Models From Natural Language Supervision

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervision,” 2021. [Online]. Available: https://arxiv.org/abs/2103.00020

  28. [28]

    T2vs meet vlms: A scalable multimodal dataset for visual harmfulness recognition,

    C. Yeh, Y .-M. Chang, W.-C. Chiu, and N. Yu, “T2vs meet vlms: A scalable multimodal dataset for visual harmfulness recognition,” 2024. [Online]. Available: https://arxiv.org/abs/2409.19734

  29. [29]

    Safe-clip: Removing nsfw concepts from vision-and-language models,

    S. Poppi, T. Poppi, F. Cocchi, M. Cornia, L. Baraldi, and R. Cucchiara, “Safe-clip: Removing nsfw concepts from vision-and-language models,”

  30. [30]

    Available: https://arxiv.org/abs/2311.16254

    [Online]. Available: https://arxiv.org/abs/2311.16254

  31. [31]

    Video pornography detection through deep learning techniques and motion information,

    M. Perez, S. Avila, D. Moreira, D. Moraes, V . Testoni, E. Valle, S. Goldenstein, and A. Rocha, “Video pornography detection through deep learning techniques and motion information,”Neurocomputing, vol. 230, pp. 279–293, 2017

  32. [32]

    Spatiotemporal cnns for pornography detection in videos,

    M. V . da Silva and A. N. Marana, “Spatiotemporal cnns for pornography detection in videos,” 2018. [Online]. Available: https: //arxiv.org/abs/1810.10519

  33. [33]

    PEDA 376K: A novel dataset for deep-learning based porn-detectors,

    D. C. Moreira, E. T. Pereira, and M. Alvarez, “PEDA 376K: A novel dataset for deep-learning based porn-detectors,” inProceedings of the International Joint Conference on Neural Networks (IJCNN), 2020, pp. 1–8

  34. [34]

    BERT Rediscovers the Classical NLP Pipeline , publisher =

    I. Tenney, D. Das, and E. Pavlick, “Bert rediscovers the classical nlp pipeline,” 2019. [Online]. Available: https://arxiv.org/abs/1905.05950

  35. [35]

    A structural probe for finding syntax in word representations,

    J. Hewitt and C. D. Manning, “A structural probe for finding syntax in word representations,” inProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2019, pp. 4129–4138. [Online]. Available: https://aclanthology.org/N19-1419

  36. [36]

    Computational Linguistics , year =

    Y . Belinkov, “Probing classifiers: Promises, shortcomings, and advances,” 2021. [Online]. Available: https://arxiv.org/abs/2102.12452

  37. [37]

    Multimodal neurons in artificial neural networks,

    G. Goh, N. Cammarata, C. V oss, S. Carter, M. Petrov, L. Schubert, A. Radford, and C. Olah, “Multimodal neurons in artificial neural networks,”Distill, 2021. [Online]. Available: https://distill.pub/2021/ multimodal-neurons

  38. [38]

    Varun Chandola, Arindam Banerjee, and Vipin Kumar

    A. Azaria and T. Mitchell, “The internal state of an llm knows when it’s lying,” 2023. [Online]. Available: https://arxiv.org/abs/2304.13734

  39. [39]

    Constitutional AI: Harmlessness from AI Feedback

    Y . Bai, S. Kadavath, S. Kundu, A. Askell, J. Kernion, A. Jones, A. Chen, A. Goldie, A. Mirhoseini, C. McKinnon, C. Chen, C. Olsson, C. Olah, D. Hernandez, D. Drain, D. Ganguli, D. Li, E. Tran-Johnson, E. Perez, J. Kerr, J. Mueller, J. Ladish, J. Landau, K. Ndousse, K. Lukosuite, L. Lovitt, M. Sellitto, N. Elhage, N. Schiefer, N. Mercado, N. DasSarma, R. ...

  40. [40]

    Training language models to follow instructions with human feedback

    L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. Christiano, J. Leike, and R. Lowe, “Training language models to follow instructions with human feedback,” 2022. [Online]. Available: https://arxiv.org/abs/2203.02155

  41. [41]

    Safe latent diffusion: Mitigating inappropriate degeneration in diffusion models,

    P. Schramowski, M. Brack, B. Deiseroth, and K. Kersting, “Safe latent diffusion: Mitigating inappropriate degeneration in diffusion models,”

  42. [42]

    Available: https://arxiv.org/abs/2211.05105

    [Online]. Available: https://arxiv.org/abs/2211.05105

  43. [43]

    2023 , month = feb, journal =

    D. Hendrycks, C. Burns, S. Basart, A. Critch, J. Li, D. Song, and J. Steinhardt, “Aligning ai with shared human values,” 2023. [Online]. Available: https://arxiv.org/abs/2008.02275

  44. [44]

    Scalable diffusion models with transformers,

    W. Peebles and S. Xie, “Scalable diffusion models with transformers,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 4195–4205

  45. [45]

    Batch normalization: Accelerating deep network training by reducing internal covariate shift,

    S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” inInternational conference on machine learning. pmlr, 2015, pp. 448–456

  46. [46]

    Deep residual learning for image recognition,

    K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778

  47. [47]

    Squeeze-and-excitation networks,

    J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 7132–7141

  48. [48]

    Attention is all you need,

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”Advances in neural information processing systems, vol. 30, 2017

  49. [49]

    On layer normalization in the transformer architecture,

    R. Xiong, Y . Yang, D. He, K. Zheng, S. Zheng, C. Xing, H. Zhang, Y . Lan, L. Wang, and T. Liu, “On layer normalization in the transformer architecture,” inInternational conference on machine learning. PMLR, 2020, pp. 10 524–10 533

  50. [50]

    Layer Normalization

    J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,”arXiv preprint arXiv:1607.06450, 2016

  51. [51]

    Gaussian Error Linear Units (GELUs)

    D. Hendrycks and K. Gimpel, “Gaussian error linear units (gelus),”arXiv preprint arXiv:1606.08415, 2016

  52. [52]

    Adam: A method for stochastic optimization,

    D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”

  53. [53]

    Adam: A Method for Stochastic Optimization

    [Online]. Available: https://arxiv.org/abs/1412.6980

  54. [54]

    3d convolutional neural networks for human action recognition,

    S. Ji, W. Xu, M. Yang, and K. Yu, “3d convolutional neural networks for human action recognition,”IEEE transactions on pattern analysis and machine intelligence, vol. 35, no. 1, pp. 221–231, 2012

  55. [55]

    Rectified linear units improve restricted boltz- mann machines,

    V . Nair and G. E. Hinton, “Rectified linear units improve restricted boltz- mann machines,” inProceedings of the 27th international conference on machine learning (ICML-10), 2010, pp. 807–814

  56. [56]

    Network In Network

    M. Lin, Q. Chen, and S. Yan, “Network in network,”arXiv preprint arXiv:1312.4400, 2013

  57. [57]

    Dropout: a simple way to prevent neural networks from over- fitting,

    N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhut- dinov, “Dropout: a simple way to prevent neural networks from over- fitting,”The journal of machine learning research, vol. 15, no. 1, pp. 1929–1958, 2014. APPENDIX A. 3D CNN Classifier architecture We propose a 3D Convolutional Neural Network (3D-CNN)

  58. [58]

    The architecture consists of a feature extraction backbone followed by a fully connected classifier

    for volumetric classification. The architecture consists of a feature extraction backbone followed by a fully connected classifier. a) Feature Extractor.:The backbone comprises three convolutional blocks. The first block applies aConv3dlayer with 16 input channels, 32 output channels, and an asymmetric kernel of size(3×5×5)with stride(1,2,2)to downsample ...

  59. [59]

    vi- olating

    with ratep= 0.3for regularization. The final linear layer maps to 2 logits for binary classification. TABLE VII LAYER-BY-LAYER SUMMARY OFCNN3DCL A S S I F I E R.BDENOTES BATCH SIZE;SPATIAL DIMENSIONSHANDWDEPEND ON THE INPUT VOLUME. Stage Module Output Shape Stride Input —B×16×D×H×W— Block 1 Conv3D (16→32,k=(3,5,5)) + BN + ReLUB×32×D×H 2 ×W 2 (1,2,2) MaxPo...