arxiv: 2605.00874 · v1 · submitted 2026-04-25 · 💻 cs.CV · cs.AI· cs.LG· cs.MM

Recognition: unknown

Latent Space Probing for Adult Content Detection in Video Generative Models

Alizishaan Khatri , Chiquita Prabhu

Authors on Pith no claims yet

Pith reviewed 2026-05-09 21:03 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LGcs.MM

keywords latent space probingadult content detectionvideo generative modelsdiffusion modelscontent moderationreal-time detectionCogVideoX

0 comments

The pith

Intercepting latent representations during video generation allows real-time adult content detection at 97.29% F1 with 4-6 ms overhead.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that the internal denoised latent states created by a video diffusion model carry enough information to separate adult from non-adult content without needing the final pixels or the input prompt. A reader would care because current moderation methods miss material that only appears during the generation process itself. The authors assembled a dataset of more than eleven thousand short clips, half labeled violating, and trained two lightweight classifiers directly on the latents produced by CogVideoX. Their experiments report that this internal probe reaches high accuracy while adding almost no extra time to inference. The work therefore treats the latent trajectory as a richer signal than either the prompt or the decoded output for content safety tasks.

Core claim

The central claim is that latent-space signals produced during CogVideoX inference encode strong discriminative features for harmful content. By attaching lightweight classifiers to the denoised latents and training them on a new binary dataset of 11,039 ten-second clips, the method reaches 97.29 percent F1 on held-out data while adding only 4-6 milliseconds of computation. The authors conclude that probing the latent space improves both detection performance and cost relative to prompt-only or pixel-space approaches.

What carries the argument

Lightweight probing classifiers attached to the sequence of denoised latent representations generated during CogVideoX inference.

Load-bearing premise

The binary labels assigned to clips from adult websites and YouTube correctly mark adult content in the particular latent trajectories that CogVideoX produces from new prompts.

What would settle it

Run the trained classifiers on latents from a different video diffusion model on the same held-out clips and observe whether F1 drops below 80 percent.

Figures

Figures reproduced from arXiv: 2605.00874 by Alizishaan Khatri, Chiquita Prabhu.

**Figure 2.** Figure 2: CogVideoX inference pipeline with latent probe attachment (shown [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of the proposed CNN-transformer video classifier. A compressed video latent tensor is first processed by a 3D convolutional stem, followed [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Architecture of CNN3DClassifier. The backbone (top row) extracts volumetric features; the classifier head (bottom row, right-to-left) produces binary logits. B. Ethical and Societal Implications Any content moderation system carries the dual risk of under-enforcement (allowing harmful content to reach users) and over-enforcement (incorrectly flagging benign content). We advocate for the need for false posi… view at source ↗

read the original abstract

The rapid proliferation of AI-powered video generation systems has introduced significant challenges in content moderation, particularly with respect to adult and sexually explicit material. Existing detection methods operate on either prompts or decoded pixel-space outputs. Therefore, both approaches are blind to the rich internal representations formed during generation. In this paper, we propose a novel latent space probing framework that intercepts the denoised latent representations produced by the CogVideoX video diffusion model during inference and attaches lightweight classifiers to perform real-time adult content detection. To support this work, we construct a large-scale binary dataset of 11039 ten-second video clips (5086 violating, 5953 non-violating) sourced from adult websites and YouTube respectively. We introduce two lightweight probing classifier architectures. We train and evaluate it on the dataset. Our work demonstrates that latent-space signals encode strong discriminative features for harmful content detection, achieving 97.29% F1 on our held-out test set with an overhead in the 4-6ms range. Our results suggest that probing the latent space results in improvements in both detection performance as well as cost.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's latent-probing idea makes sense for video diffusion safety but the experiments only test real clips, not the model's actual generated latents.

read the letter

The main takeaway is that this work tries to catch adult content inside the diffusion process itself rather than at the prompt or final frame stage, but the evaluation leaves a gap between what they measured and what they claim to support for deployment. They intercept denoised latents from CogVideoX, train two lightweight classifiers on them, and report 97.29% F1 with 4-6 ms overhead on a held-out split of their 11k real-video dataset. That overhead number is concrete and the dataset construction (roughly balanced clips from adult sites and YouTube) is a reasonable starting point for supervised probing. The claim that latent signals carry discriminative features is at least plausible given how diffusion models organize content in their intermediate states. They also correctly note that prompt-only or pixel-only detectors miss the internal trajectory. Those parts are straightforward and useful to see laid out. The soft spot is the mismatch in distributions. The probes are trained and tested only on latents extracted from real ten-second clips; nothing in the abstract or described setup shows results on actual CogVideoX generations that start from noise and follow the model's denoising schedule under text conditioning. If the statistics or label alignment differ between real encodings and generative trajectories, the F1 score does not directly support the real-time filtering use case. They also give no details on labeling procedure, inter-annotator checks, or any leakage controls, which matters when the task is safety filtering. Minor issues like exact architecture choices for the two probes can be clarified later. This is for researchers building moderation layers into video generators who need low-overhead internal checks. A reader already working on diffusion internals or safety tooling could extract the probing pattern and the overhead measurement even if the transfer results need more work. I would send it to peer review because the core setup is simple enough to evaluate quickly and the gap on generated latents is fixable with additional experiments rather than a fundamental flaw in the framing.

Referee Report

1 major / 2 minor

Summary. The manuscript proposes a latent-space probing framework to detect adult content in videos generated by the CogVideoX diffusion model. The authors construct a dataset of 11,039 ten-second clips (5,086 violating, 5,953 non-violating) sourced from adult websites and YouTube, introduce two lightweight classifier architectures that operate on denoised latents intercepted during inference, and report 97.29% F1 on a held-out test split together with 4-6 ms overhead. The central claim is that latent representations encode strong discriminative features for harmful-content detection, yielding both higher accuracy and lower cost than prompt- or pixel-based alternatives.

Significance. If the reported performance generalizes to latents produced by actual text-to-video inference on novel prompts, the approach would supply an efficient, low-overhead mechanism for real-time moderation inside generative pipelines without requiring full pixel decoding. The emphasis on internal representations addresses a clear gap left by existing prompt- or output-based detectors, and the modest overhead is practically attractive for deployment.

major comments (1)

The 97.29% F1 is measured on a held-out split of the real-video dataset. The intended use case, however, is detection on the specific denoised latent trajectories that CogVideoX produces when starting from noise and following its text-conditioned diffusion schedule on new prompts. No experiments are reported that evaluate the probes on actual model-generated videos, so the headline metric does not directly substantiate the deployment claim made in the abstract and introduction.

minor comments (2)

The abstract and method description do not specify the exact procedure used to obtain 'denoised latent representations' for the real clips in the training set, nor which of the two proposed probe architectures produced the reported 97.29% F1.
No details are given on the labeling protocol, inter-annotator agreement, handling of class imbalance, or verification that train/test splits contain no leakage.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading and for identifying a key distinction between our current evaluation and the intended deployment setting. We address the major comment below.

read point-by-point responses

Referee: The 97.29% F1 is measured on a held-out split of the real-video dataset. The intended use case, however, is detection on the specific denoised latent trajectories that CogVideoX produces when starting from noise and following its text-conditioned diffusion schedule on new prompts. No experiments are reported that evaluate the probes on actual model-generated videos, so the headline metric does not directly substantiate the deployment claim made in the abstract and introduction.

Authors: We agree that the reported 97.29% F1 is obtained by encoding real video clips from our dataset into the CogVideoX VAE latent space and does not include evaluation on full denoising trajectories generated from noise under text conditioning. This constitutes a genuine gap for directly supporting the real-time moderation use case inside generative pipelines. Although the underlying VAE latent space is identical and we expect the learned discriminative features to transfer, we do not currently provide empirical evidence on model-generated content. In the revised manuscript we will add a new experimental section that generates videos with CogVideoX using both safe and adult-oriented prompts, intercepts the denoised latents at the final (and optionally intermediate) timesteps, and evaluates the same probes on these trajectories. The resulting metrics will be reported alongside the existing real-video results to directly address the deployment claim. revision: yes

Circularity Check

0 steps flagged

Empirical supervised probing on held-out real-video latents; no derivation reduces to self-inputs

full rationale

The paper reports a standard machine-learning pipeline: construct a labeled dataset of real 10-second clips, extract latent representations (via the model's encoder or equivalent), train lightweight classifiers, and measure F1 on a held-out split of the same dataset. The 97.29% F1 and 4-6 ms overhead are direct empirical measurements on unseen examples drawn from the training distribution; no equations, ansatzes, or uniqueness theorems are presented that would make the reported metric equivalent to a fitted parameter by construction. No self-citations are invoked as load-bearing premises, and the central claim does not rename a known result or smuggle an ansatz. The work is therefore self-contained as an empirical measurement rather than a closed derivation.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that latent representations carry content-type information and on standard supervised learning assumptions; no new physical entities or ad-hoc constants are introduced.

free parameters (1)

probe classifier weights
The lightweight classifiers are trained on the collected dataset, so their parameters are fitted values.

axioms (1)

domain assumption Denoised latent representations during inference encode semantic features sufficient to distinguish adult from non-adult content
Invoked when the paper states that latent-space signals encode strong discriminative features.

pith-pipeline@v0.9.0 · 5499 in / 1206 out tokens · 34385 ms · 2026-05-09T21:03:44.735758+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

59 extracted references · 31 canonical work pages · 13 internal anchors

[1]

Sora: Creating video from text,

OpenAI, “Sora: Creating video from text,” 2024. [Online]. Available: https://openai.com/sora

2024
[2]

Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models

Y . Liu, K. Zhang, Y . Li, Z. Yan, C. Gao, R. Chen, Z. Yuan, Y . Huang, H. Sun, J. Gao, L. He, and L. Sun, “Sora: A review on background, technology, limitations, and opportunities of large vision models,” 2024. [Online]. Available: https://arxiv.org/abs/2402.17177

work page internal anchor Pith review arXiv 2024
[3]

Wan: Open and Advanced Large-Scale Video Generative Models

T. Wan, A. Wang, B. Ai, B. Wen, C. Mao, C.-W. Xie, D. Chen, F. Yu, H. Zhao, J. Yanget al., “Wan: Open and advanced large-scale video generative models,”arXiv preprint arXiv:2503.20314, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

A. Blattmann, T. Dockhorn, S. Kulal, D. Mendelevitch, M. Kilian, D. Lorenz, Y . Levi, Z. English, V . V oleti, A. Letts, V . Jampani, and R. Rombach, “Stable video diffusion: Scaling latent video diffusion models to large datasets,” 2023. [Online]. Available: https://arxiv.org/abs/2311.15127

work page internal anchor Pith review arXiv 2023
[5]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Z. Yang, J. Teng, W. Zheng, M. Ding, S. Huang, J. Xu, Y . Yang, W. Hong, X. Zhang, G. Feng, D. Yin, Y . Zhang, W. Wang, Y . Cheng, B. Xu, X. Gu, Y . Dong, and J. Tang, “Cogvideox: Text-to-video diffusion models with an expert transformer,” 2025. [Online]. Available: https://arxiv.org/abs/2408.06072

work page internal anchor Pith review arXiv 2025
[6]

Deepfakes on demand: the rise of accessible non-consensual deepfake image generators,

W. Hawkins, C. Russell, and B. Mittelstadt, “Deepfakes on demand: the rise of accessible non-consensual deepfake image generators,” 2025. [Online]. Available: https://arxiv.org/abs/2505.03859

work page arXiv 2025
[7]

Video deepfake abuse: How company choices predictably shape misuse patterns,

M. Kamachee, S. Casper, M. L. Ding, R.-J. Yew, A. Reuel, S. Biderman, and D. Hadfield-Menell, “Video deepfake abuse: How company choices predictably shape misuse patterns,” 2026. [Online]. Available: https://arxiv.org/abs/2512.11815

work page arXiv 2026
[8]

Adversarial attacks and defenses on text-to-image diffusion models: A survey,

C. Zhang, M. Hu, W. Li, and L. Wang, “Adversarial attacks and defenses on text-to-image diffusion models: A survey,”Information Fusion, vol. 114, p. 102701, 2025

2025
[9]

Jailbreak attacks and defenses against multimodal generative models: A survey,

X. Liu, X. Cui, P. Li, Z. Li, H. Huang, S. Xia, M. Zhang, Y . Zou, and R. He, “Jailbreak attacks and defenses against multimodal generative models: A survey,”arXiv preprint arXiv:2411.09259, 2024

work page arXiv 2024
[10]

Aeiou: A unified defense framework against nsfw prompts in text-to-image models,

Y . Wang, J. Chen, Q. Li, T. Zhang, R. Zeng, X. Yang, and S. Ji, “Aeiou: A unified defense framework against nsfw prompts in text-to-image models,”arXiv preprint arXiv:2412.18123, 2024

work page arXiv 2024
[11]

Latent guard: a safety framework for text-to-image generation,

R. Liu, A. Khakzar, J. Gu, Q. Chen, P. Torr, and F. Pizzati, “Latent guard: a safety framework for text-to-image generation,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 93–109

2024
[12]

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

H. Inan, K. Upasani, J. Chi, R. Rungta, K. Iyer, Y . Mao, M. Tontchev, Q. Hu, B. Fuller, D. Testuggineet al., “Llama guard: Llm-based input-output safeguard for human-ai conversations,” arXiv preprint arXiv:2312.06674, 2023. [Online]. Available: http: //arxiv.org/abs/2312.06674

work page internal anchor Pith review arXiv 2023
[13]

Sora safety,

OpenAI, “Sora safety,” 2024. [Online]. Available: https://openai.com/ sora#safety

2024
[14]

Unsafe diffusion: On the generation of unsafe images and hateful memes from text-to-image models,

Y . Qu, X. Shen, X. He, M. Backes, S. Zannettou, and Y . Zhang, “Unsafe diffusion: On the generation of unsafe images and hateful memes from text-to-image models,” inProceedings of the 2023 ACM SIGSAC conference on computer and communications security, 2023, pp. 3403–3417

2023
[15]

Safewatch: An efficient safety- policy following video guardrail model with transparent explanations,

Z. Chen, F. Pinto, M. Pan, and B. Li, “Safewatch: An efficient safety- policy following video guardrail model with transparent explanations,” inInternational Conference on Learning Representations, 2025. [Online]. Available: https://openreview.net/forum?id=xjKz6IxgCX

2025
[16]

T2vsafetybench: Evaluating the safety of text-to-video generative models,

Y . Miao, Y . Zhu, Y . Dong, L. Yu, J. Zhu, and X.-S. Gao, “T2vsafetybench: Evaluating the safety of text-to-video generative models,” 2024. [Online]. Available: https://arxiv.org/abs/2407.05965

work page arXiv 2024
[17]

Representation Engineering: A Top-Down Approach to AI Transparency

A. Zou, L. Phan, S. Chen, J. Campbell, P. Guo, R. Ren, A. Pan, X. Yin, M. Mazeika, A.-K. Dombrowskiet al., “Representation engineering: A top-down approach to ai transparency,”Computing Research Repository, vol. arXiv:2310.01405, 2023. [Online]. Available: http://arxiv.org/abs/2310.01405

work page internal anchor Pith review arXiv 2023
[18]

Understanding intermediate layers using linear classifier probes

G. Alain and Y . Bengio, “Understanding intermediate layers using linear classifier probes,” 2018. [Online]. Available: https://arxiv.org/abs/ 1610.01644

work page Pith review arXiv 2018
[19]

Safety beyond the interface: Detecting harm via latent llm states,

A. Khatri, C. Prabhu, and O. Neogi, “Safety beyond the interface: Detecting harm via latent llm states,”ResearchGate Preprint, 2026. [Online]. Available: https://www.researchgate.net/publication/402378765_Safety_ Beyond_the_Interface_Detecting_Harm_via_Latent_LLM_States

work page arXiv 2026
[20]

Constitutional classifiers: Defending against universal jailbreaks across thousands of hours of red teaming,

Anthropic, “Constitutional classifiers: Defending against universal jailbreaks across thousands of hours of red teaming,” 2025, accessed: 2025. [Online]. Available: https://www.anthropic.com/ research/constitutional-classifiers

2025
[21]

Building production-ready probes for Gemini

J. Kramár, J. Engels, Z. Wang, B. Chughtai, R. Shah, N. Nanda, and A. Conmy, “Building production-ready probes for gemini,”arXiv preprint arXiv:2601.11516, 2026

work page arXiv 2026
[22]

Seeing it before it happens: In-generation nsfw detection for diffusion- based text-to-image models,

F. Yang, Y . Huang, J. Zhu, L. Shi, G. Pu, J. S. Dong, and K. Wang, “Seeing it before it happens: In-generation nsfw detection for diffusion- based text-to-image models,”arXiv preprint arXiv:2508.03006, 2025

work page arXiv 2025
[23]

Diffusion probe: Generated image result prediction using cnn probes,

B. Cui, B. Huang, Z. Ye, X. Dong, T. Chen, H. Xue, D. Yang, L. Huang, J. Tang, and H. Hong, “Diffusion probe: Generated image result prediction using cnn probes,”arXiv preprint arXiv:2602.23783, 2026

work page arXiv 2026
[24]

Finding naked people,

M. M. Fleck, D. A. Forsyth, and C. Bregler, “Finding naked people,” inProc. European Conference on Computer Vision. Springer, 1996, pp. 593–602. [Online]. Available: http://luthuli.cs.uiuc.edu/~daf/papers/ naked.pdf

1996
[25]

NudeNet: Neural nets for nudity classification, detection and selective censoring,

P. Bedapudi, “NudeNet: Neural nets for nudity classification, detection and selective censoring,” https://github.com/notAI-tech/NudeNet, 2019, open-source software

2019
[26]

Open NSFW: Not suitable for work (NSFW) classification using deep neural network Caffe models,

Yahoo, “Open NSFW: Not suitable for work (NSFW) classification using deep neural network Caffe models,” 2016, gitHub repository (archived October 2019). Model based on a thin ResNet-50 variant fine-tuned for image NSFW scoring. [Online]. Available: https://github.com/yahoo/open_nsfw

2016
[27]

Learning Transferable Visual Models From Natural Language Supervision

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervision,” 2021. [Online]. Available: https://arxiv.org/abs/2103.00020

work page internal anchor Pith review Pith/arXiv arXiv 2021
[28]

T2vs meet vlms: A scalable multimodal dataset for visual harmfulness recognition,

C. Yeh, Y .-M. Chang, W.-C. Chiu, and N. Yu, “T2vs meet vlms: A scalable multimodal dataset for visual harmfulness recognition,” 2024. [Online]. Available: https://arxiv.org/abs/2409.19734

work page arXiv 2024
[29]

Safe-clip: Removing nsfw concepts from vision-and-language models,

S. Poppi, T. Poppi, F. Cocchi, M. Cornia, L. Baraldi, and R. Cucchiara, “Safe-clip: Removing nsfw concepts from vision-and-language models,”
[30]

Available: https://arxiv.org/abs/2311.16254

[Online]. Available: https://arxiv.org/abs/2311.16254

work page arXiv
[31]

Video pornography detection through deep learning techniques and motion information,

M. Perez, S. Avila, D. Moreira, D. Moraes, V . Testoni, E. Valle, S. Goldenstein, and A. Rocha, “Video pornography detection through deep learning techniques and motion information,”Neurocomputing, vol. 230, pp. 279–293, 2017

2017
[32]

Spatiotemporal cnns for pornography detection in videos,

M. V . da Silva and A. N. Marana, “Spatiotemporal cnns for pornography detection in videos,” 2018. [Online]. Available: https: //arxiv.org/abs/1810.10519

work page arXiv 2018
[33]

PEDA 376K: A novel dataset for deep-learning based porn-detectors,

D. C. Moreira, E. T. Pereira, and M. Alvarez, “PEDA 376K: A novel dataset for deep-learning based porn-detectors,” inProceedings of the International Joint Conference on Neural Networks (IJCNN), 2020, pp. 1–8

2020
[34]

BERT Rediscovers the Classical NLP Pipeline , publisher =

I. Tenney, D. Das, and E. Pavlick, “Bert rediscovers the classical nlp pipeline,” 2019. [Online]. Available: https://arxiv.org/abs/1905.05950

work page arXiv 2019
[35]

A structural probe for finding syntax in word representations,

J. Hewitt and C. D. Manning, “A structural probe for finding syntax in word representations,” inProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2019, pp. 4129–4138. [Online]. Available: https://aclanthology.org/N19-1419

2019
[36]

Computational Linguistics , year =

Y . Belinkov, “Probing classifiers: Promises, shortcomings, and advances,” 2021. [Online]. Available: https://arxiv.org/abs/2102.12452

work page internal anchor Pith review arXiv 2021
[37]

Multimodal neurons in artificial neural networks,

G. Goh, N. Cammarata, C. V oss, S. Carter, M. Petrov, L. Schubert, A. Radford, and C. Olah, “Multimodal neurons in artificial neural networks,”Distill, 2021. [Online]. Available: https://distill.pub/2021/ multimodal-neurons

2021
[38]

Varun Chandola, Arindam Banerjee, and Vipin Kumar

A. Azaria and T. Mitchell, “The internal state of an llm knows when it’s lying,” 2023. [Online]. Available: https://arxiv.org/abs/2304.13734

work page arXiv 2023
[39]

Constitutional AI: Harmlessness from AI Feedback

Y . Bai, S. Kadavath, S. Kundu, A. Askell, J. Kernion, A. Jones, A. Chen, A. Goldie, A. Mirhoseini, C. McKinnon, C. Chen, C. Olsson, C. Olah, D. Hernandez, D. Drain, D. Ganguli, D. Li, E. Tran-Johnson, E. Perez, J. Kerr, J. Mueller, J. Ladish, J. Landau, K. Ndousse, K. Lukosuite, L. Lovitt, M. Sellitto, N. Elhage, N. Schiefer, N. Mercado, N. DasSarma, R. ...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[40]

Training language models to follow instructions with human feedback

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. Christiano, J. Leike, and R. Lowe, “Training language models to follow instructions with human feedback,” 2022. [Online]. Available: https://arxiv.org/abs/2203.02155

work page internal anchor Pith review arXiv 2022
[41]

Safe latent diffusion: Mitigating inappropriate degeneration in diffusion models,

P. Schramowski, M. Brack, B. Deiseroth, and K. Kersting, “Safe latent diffusion: Mitigating inappropriate degeneration in diffusion models,”
[42]

Available: https://arxiv.org/abs/2211.05105

[Online]. Available: https://arxiv.org/abs/2211.05105

work page arXiv
[43]

2023 , month = feb, journal =

D. Hendrycks, C. Burns, S. Basart, A. Critch, J. Li, D. Song, and J. Steinhardt, “Aligning ai with shared human values,” 2023. [Online]. Available: https://arxiv.org/abs/2008.02275

work page arXiv 2023
[44]

Scalable diffusion models with transformers,

W. Peebles and S. Xie, “Scalable diffusion models with transformers,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 4195–4205

2023
[45]

Batch normalization: Accelerating deep network training by reducing internal covariate shift,

S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” inInternational conference on machine learning. pmlr, 2015, pp. 448–456

2015
[46]

Deep residual learning for image recognition,

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778

2016
[47]

Squeeze-and-excitation networks,

J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 7132–7141

2018
[48]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”Advances in neural information processing systems, vol. 30, 2017

2017
[49]

On layer normalization in the transformer architecture,

R. Xiong, Y . Yang, D. He, K. Zheng, S. Zheng, C. Xing, H. Zhang, Y . Lan, L. Wang, and T. Liu, “On layer normalization in the transformer architecture,” inInternational conference on machine learning. PMLR, 2020, pp. 10 524–10 533

2020
[50]

Layer Normalization

J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,”arXiv preprint arXiv:1607.06450, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[51]

Gaussian Error Linear Units (GELUs)

D. Hendrycks and K. Gimpel, “Gaussian error linear units (gelus),”arXiv preprint arXiv:1606.08415, 2016

work page internal anchor Pith review arXiv 2016
[52]

Adam: A method for stochastic optimization,

D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”
[53]

Adam: A Method for Stochastic Optimization

[Online]. Available: https://arxiv.org/abs/1412.6980

work page internal anchor Pith review Pith/arXiv arXiv
[54]

3d convolutional neural networks for human action recognition,

S. Ji, W. Xu, M. Yang, and K. Yu, “3d convolutional neural networks for human action recognition,”IEEE transactions on pattern analysis and machine intelligence, vol. 35, no. 1, pp. 221–231, 2012

2012
[55]

Rectified linear units improve restricted boltz- mann machines,

V . Nair and G. E. Hinton, “Rectified linear units improve restricted boltz- mann machines,” inProceedings of the 27th international conference on machine learning (ICML-10), 2010, pp. 807–814

2010
[56]

Network In Network

M. Lin, Q. Chen, and S. Yan, “Network in network,”arXiv preprint arXiv:1312.4400, 2013

work page Pith review arXiv 2013
[57]

Dropout: a simple way to prevent neural networks from over- fitting,

N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhut- dinov, “Dropout: a simple way to prevent neural networks from over- fitting,”The journal of machine learning research, vol. 15, no. 1, pp. 1929–1958, 2014. APPENDIX A. 3D CNN Classifier architecture We propose a 3D Convolutional Neural Network (3D-CNN)

1929
[58]

The architecture consists of a feature extraction backbone followed by a fully connected classifier

for volumetric classification. The architecture consists of a feature extraction backbone followed by a fully connected classifier. a) Feature Extractor.:The backbone comprises three convolutional blocks. The first block applies aConv3dlayer with 16 input channels, 32 output channels, and an asymmetric kernel of size(3×5×5)with stride(1,2,2)to downsample ...
[59]

vi- olating

with ratep= 0.3for regularization. The final linear layer maps to 2 logits for binary classification. TABLE VII LAYER-BY-LAYER SUMMARY OFCNN3DCL A S S I F I E R.BDENOTES BATCH SIZE;SPATIAL DIMENSIONSHANDWDEPEND ON THE INPUT VOLUME. Stage Module Output Shape Stride Input —B×16×D×H×W— Block 1 Conv3D (16→32,k=(3,5,5)) + BN + ReLUB×32×D×H 2 ×W 2 (1,2,2) MaxPo...