Opto-Atomic Spatio-Temporal Holographic Correlators for High-Speed 3D CNNs

Bowen Qi; Selim M. Shahriar; Tabassom Hamidfar; Xi Shen

arxiv: 2604.24800 · v1 · submitted 2026-04-27 · 💻 cs.AR · eess.IV

Opto-Atomic Spatio-Temporal Holographic Correlators for High-Speed 3D CNNs

Xi Shen , Bowen Qi , Tabassom Hamidfar , Selim M. Shahriar This is my paper

Pith reviewed 2026-05-08 01:14 UTC · model grok-4.3

classification 💻 cs.AR eess.IV

keywords 3D CNNopto-atomic computingholographic correlatorvideo classificationatomic coherencerubidium atomsspatio-temporal processinghybrid architecture

0 comments

The pith

Hybrid opto-atomic hardware performs 3D convolutions using atomic coherence in rubidium for high-speed video tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that the computationally heavy 3D convolutional layers in neural networks for video can be handed off to a specialized opto-atomic device instead of running entirely on silicon. Temporal features are held as coherence patterns inside an array of cold rubidium-85 atoms while spatial features are handled by a conventional optical correlator, allowing both dimensions to be processed together. This hybrid setup is tested on a four-class human action recognition task where it reaches 59.72 percent accuracy with kernels that are 30 by 40 pixels across 8 time frames. The architecture is projected to support frame rates as high as 125000 per second. A sympathetic reader would care because current 3D CNNs scale poorly in time and energy when applied to real video streams.

Core claim

The authors establish that storing temporal information as atomic coherence in an array of inhomogeneously broadened cold Rubidium-85 atoms and combining it with a 2D spatial correlator enables simultaneous space-time correlation. This opto-atomic Spatio-temporal Holographic Correlator serves as the core of a hybrid optoelectronic architecture for 3D CNNs, delivering 59.72 percent classification accuracy on a four-class human action dataset with 30 by 40 pixel spatial and 8-frame temporal kernels while projecting operation up to 125000 frames per second.

What carries the argument

The Spatio-temporal Holographic Correlator (STHC) that encodes temporal data in atomic coherence within cold Rubidium-85 atoms and merges it with optical 2D spatial correlation to execute 3D convolutions in a single step.

If this is right

Large parallel kernels spanning dozens of pixels and multiple frames can be applied without the full cubic cost falling on electronic processors.
Video classification hardware can reach frame rates of 125000 per second under the projected operating conditions.
Energy use for 3D CNN inference drops by moving the dominant correlation work into optical and atomic domains.
Modest accuracy around 60 percent is shown to be reachable on basic action datasets even with the proposed kernel dimensions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same atomic storage method could be adapted to other temporal signal-processing tasks that currently rely on electronic memory.
Larger atomic arrays would allow kernels with higher spatial or temporal resolution without changing the core operating principle.
Embedding the device inside existing camera pipelines could produce compact systems for real-time video analysis at the edge.
Overcoming the cubic scaling of 3D convolutions this way points toward broader hybrid optical-electronic designs for other deep-learning layers.

Load-bearing premise

Atomic coherence inside the rubidium array stores and returns the temporal part of each video kernel with enough fidelity and speed that the combined spatio-temporal correlation remains accurate without decoherence or noise dominating the result.

What would settle it

A direct measurement of coherence lifetime and retrieval error for sequences spanning eight video frames in the rubidium atom array; if fidelity falls below the level needed to match the reported 59.72 percent accuracy, the claimed performance cannot hold.

Figures

Figures reproduced from arXiv: 2604.24800 by Bowen Qi, Selim M. Shahriar, Tabassom Hamidfar, Xi Shen.

**Figure 5.** Figure 5: The positive kernel K⁺ retains all positive weight values from the original kernel while view at source ↗

read the original abstract

Three-dimensional convolutional neural networks (3D CNNs) have demonstrated remarkable performance in video recognition tasks by processing both spatial and temporal features. However, the cubic scaling of computational complexity poses significant time and energy efficiency challenges for conventional silicon-based hardware. To address this, we propose a hybrid optoelectronic architecture that delegates the computationally intensive 3D convolutional layer to an opto-atomic Spatio-temporal Holographic Correlator (STHC). This system stores temporal information as atomic coherence in an array of inhomogeneously broadened cold Rubidium-85 atoms and combines a traditional 2D spatial correlator to perform correlation in both space and time simultaneously. Our results on a four-class human action dataset demonstrate a classification accuracy of 59.72% using parallel large-scale kernels (30X40 pixels spatially, 8 frames temporally), with potential operating speeds projected up to 125,000 frames per second. This approach offers a pathway to massively accelerated video classification through a hybrid architecture.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is an early hardware concept for 3D CNN acceleration via atomic coherence plus holography, but the accuracy and speed numbers rest on unexamined assumptions about fidelity in the Rb atoms.

read the letter

The paper describes a hybrid system that stores temporal information as coherence in cold Rb-85 atoms and handles spatial correlation with a 2D holographic setup to run 3D kernels in parallel. That specific combination for video-scale convolutions does not appear in prior hardware work I know of, and the motivation around the cubic cost of standard 3D CNNs is straightforward and worth pursuing. They give one concrete number: 59.72% accuracy on a four-class action dataset using 30x40 spatial by 8-frame temporal kernels, with a projected 125k fps rate. That is the part a reader can take away immediately. The rest of the contribution is the architectural sketch itself. The soft spot is exactly where the stress-test note points. No derivation, simulation, or bound is supplied for coherence time, retrieval efficiency under inhomogeneous broadening, spontaneous emission, or Doppler noise at the claimed speed. Without those, it is impossible to judge whether the reported accuracy could actually come from the physical system or whether the correlation matrix would stay close enough to a software 3D conv. The abstract presents the accuracy as an empirical result and the speed as a projection, but supplies none of the supporting methods or error analysis. This leaves the central claim unevaluable from the text. The work is aimed at researchers in optical and atomic hardware for machine learning who are looking for new directions rather than finished systems. A reader who needs reproducible results, baselines, or quantitative modeling of the atomic channel will not find enough here. I would not cite it yet. It could deserve peer review if the authors add the missing atomic-physics calculations and experimental details in revision, but the current version does not meet the threshold for a serious referee on its own.

Referee Report

2 major / 0 minor

Summary. The paper proposes a hybrid optoelectronic architecture for 3D CNNs in which the computationally intensive 3D convolutional layer is delegated to an opto-atomic Spatio-temporal Holographic Correlator (STHC). Temporal information is stored as atomic coherence in an array of inhomogeneously broadened cold Rubidium-85 atoms and combined with a conventional 2D spatial correlator to perform joint spatio-temporal correlation. On a four-class human action dataset the system is reported to achieve 59.72% classification accuracy with kernels of 30×40 pixels spatially and 8 frames temporally, with a projected operating speed of up to 125,000 frames per second.

Significance. If the performance claims are substantiated, the work would offer a concrete route to overcoming the cubic scaling of 3D CNNs through a hybrid optical-atomic-electronic platform capable of massively parallel, high-speed spatio-temporal correlations. The approach is novel in its use of atomic coherence for temporal kernel storage and could enable energy-efficient video processing at speeds far beyond current silicon implementations.

major comments (2)

Abstract and Results section: the classification accuracy of 59.72% is stated without any description of the dataset, software baseline, training procedure, error bars, or experimental/simulation protocol used to obtain the figure. Because the central claim rests on the STHC delivering performance equivalent to a conventional 3D convolution, the absence of these details prevents verification that the reported accuracy is attributable to the proposed hardware rather than an idealized software simulation.
Abstract and Results section: no quantitative analysis, simulation, or bounds are supplied for atomic coherence time T2, retrieval efficiency under inhomogeneous broadening, or additive noise (spontaneous emission, Doppler broadening) when storing and retrieving 8-frame temporal sequences at the projected 125 kfps rate. Without such analysis the fidelity of the spatio-temporal correlation matrix cannot be assessed, directly undermining both the accuracy number and the speed projection.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful review and constructive comments. We address each major comment below and will revise the manuscript accordingly to improve clarity and substantiation of our claims.

read point-by-point responses

Referee: Abstract and Results section: the classification accuracy of 59.72% is stated without any description of the dataset, software baseline, training procedure, error bars, or experimental/simulation protocol used to obtain the figure. Because the central claim rests on the STHC delivering performance equivalent to a conventional 3D convolution, the absence of these details prevents verification that the reported accuracy is attributable to the proposed hardware rather than an idealized software simulation.

Authors: We agree that these details are essential for verification. In the revised manuscript we will expand the Results section with a complete description of the four-class human action dataset, the conventional 3D CNN software baseline, the training procedure (including hyperparameters and optimizer), error bars from repeated simulations, and the exact simulation protocol used to model the STHC. This will confirm that the reported accuracy arises from the modeled opto-atomic correlator rather than an idealized software convolution. revision: yes
Referee: Abstract and Results section: no quantitative analysis, simulation, or bounds are supplied for atomic coherence time T2, retrieval efficiency under inhomogeneous broadening, or additive noise (spontaneous emission, Doppler broadening) when storing and retrieving 8-frame temporal sequences at the projected 125 kfps rate. Without such analysis the fidelity of the spatio-temporal correlation matrix cannot be assessed, directly undermining both the accuracy number and the speed projection.

Authors: We acknowledge the need for quantitative physical bounds. In the revision we will add a dedicated subsection (in Methods or a new Discussion section) supplying estimates and bounds for T2 in cold Rb-85, retrieval efficiency under inhomogeneous broadening, and noise contributions from spontaneous emission and Doppler broadening for 8-frame sequences at the projected rate. These will be derived from established atomic-physics parameters and will include a fidelity estimate for the spatio-temporal correlation matrix. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical accuracy and projected speed rest on experimental results, not self-referential derivations

full rationale

The manuscript presents an opto-atomic hardware proposal for 3D CNN acceleration and reports an empirical classification accuracy of 59.72% on a four-class action dataset using 30x40x8 kernels. No equations, derivations, fitted parameters, or uniqueness theorems appear in the provided text. The accuracy figure is stated as a measured outcome rather than a prediction derived from the architecture itself, and the 125 kfps speed is explicitly labeled a projection. Absent any load-bearing self-citation chains, ansatzes, or reductions of outputs to inputs by construction, the central claims remain independent of the circularity patterns enumerated in the analysis criteria.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the unverified feasibility of using atomic coherence for temporal storage and holographic readout for spatial correlation at the stated kernel sizes; no free parameters or invented entities beyond the named correlator are quantified in the abstract.

axioms (1)

domain assumption Atomic coherence in inhomogeneously broadened cold Rubidium-85 atoms can store temporal information for correlation purposes.
Invoked in the abstract as the mechanism for handling the temporal dimension of 3D convolution.

invented entities (1)

Spatio-temporal Holographic Correlator (STHC) no independent evidence
purpose: To perform simultaneous spatial and temporal correlation for 3D CNN layers using atomic coherence and holography.
Introduced as the core delegated component of the hybrid architecture.

pith-pipeline@v0.9.0 · 5485 in / 1482 out tokens · 60706 ms · 2026-05-08T01:14:42.122169+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages

[1]

ImageNet: A large-scale hierarchical image database,

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, "ImageNet: A large-scale hierarchical image database," in 2009 IEEE Conference on Computer Vision and Pattern Recognition (2009), pp. 248– 255

work page 2009
[2]

Learning Spatiotemporal Features with 3D Convolutional Networks,

D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, " Learning Spatiotemporal Features with 3D Convolutional Networks," in 2015 IEEE International Conference on Computer Vision (ICCV) (IEEE Computer Society, 2015), pp. 4489–4497

work page 2015
[3]

A Closer Look at Spatiotemporal Convolutions for Action Recognition,

D. Tran, H. Wang, L. Torresani, J. Ray, Y. Lecun, and M. Paluri, "A Closer Look at Spatiotemporal Convolutions for Action Recognition," in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (IEEE Computer Society, 2018), pp. 6450–6459

work page 2018
[4]

Signal Detection By Complex Spatial Filtering,

A. Vander Lugt, "Signal Detection By Complex Spatial Filtering," IEEE Trans. Inf. Theory 10, 139–145 (1964)

work page 1964
[5]

Multiple-object detection with a chirp-encoded joint transform correlator,

Q. Tang and B. Javidi, "Multiple-object detection with a chirp-encoded joint transform correlator," Appl. Opt. 32, 5079–5088 (1993)

work page 1993
[6]

Optical implementation of neural networks for face recognition by the use of nonlinear joint transform correlators,

B. Javidi, J. Li, and Q. Tang, "Optical implementation of neural networks for face recognition by the use of nonlinear joint transform correlators," Appl. Opt. 34, 3950–3962 (1995)

work page 1995
[7]

Hybrid optical-electronic convolutional neural networks with optimized diffractive optics for image classification,

J. Chang, V. Sitzmann, X. Dun, W. Heidrich, and G. Wetzstein, "Hybrid optical-electronic convolutional neural networks with optimized diffractive optics for image classification," Sci. Rep. 8, 12324 (2018)

work page 2018
[8]

Deep feature flow for video recognition,

X. Zhu, Y. Xiong, J. Dai, L. Yuan, and Y. Wei, "Deep feature flow for video recognition," in Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017 (2017), Vol. 2017-January, pp. 2349–2358

work page 2017
[9]

New Generation Deep Learning for Video Object Detection: A Survey,

L. Jiao, R. Zhang, F. Liu, S. Yang, B. Hou, L. Li, and X. Tang, "New Generation Deep Learning for Video Object Detection: A Survey," IEEE Trans. Neural Netw. Learn. Syst. 33, 3195–3215 (2022)

work page 2022
[10]

Analytical transfer function for the nonlinear response of a resonant medium in the spatio-temporal Fourier-transform domain,

M. S. Monjur, M. F. Fouda, and S. M. Shahriar, "Analytical transfer function for the nonlinear response of a resonant medium in the spatio-temporal Fourier-transform domain," Journal of the Optical Society of America B 34, 397–403 (2017)

work page 2017
[11]

Temporal scale and shift invariant automatic event recognition using the Mellin transform,

X. Shen, J. Gamboa, T. Hamidfar, S. A. Mitu, and S. M. Shahriar, "Temporal scale and shift invariant automatic event recognition using the Mellin transform," Opt. Express 33, 25515–25529 (2025)

work page 2025
[12]

3D Convolutional neural networks for human action recognition,

S. Ji, W. Xu, M. Yang, and K. Yu, "3D Convolutional neural networks for human action recognition," IEEE Trans. Pattern Anal. Mach. Intell. 35, 221–231 (2013)

work page 2013
[13]

All optical three dimensional spatio-temporal correlator for automatic event recognition using a multiphoton atomic system,

M. S. Monjur, M. F. Fouda, and S. M. Shahriar, "All optical three dimensional spatio-temporal correlator for automatic event recognition using a multiphoton atomic system," Opt. Commun. 381, 418–432 (2016)

work page 2016
[14]

Quo Vadis, action recognition? A new model and the kinetics dataset,

J. Carreira and A. Zisserman, "Quo Vadis, action recognition? A new model and the kinetics dataset," in Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017 (Institute of Electrical and Electronics Engineers Inc., 2017), Vol. 2017-January, pp. 4724–4733

work page 2017
[15]

Recognizing human actions: a local SVM approach,

C. Schuldt, I. Laptev, and B. Caputo, "Recognizing human actions: a local SVM approach," in Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004. (2004), Vol. 3, pp. 32-36 Vol.3

work page 2004

[1] [1]

ImageNet: A large-scale hierarchical image database,

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, "ImageNet: A large-scale hierarchical image database," in 2009 IEEE Conference on Computer Vision and Pattern Recognition (2009), pp. 248– 255

work page 2009

[2] [2]

Learning Spatiotemporal Features with 3D Convolutional Networks,

D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, " Learning Spatiotemporal Features with 3D Convolutional Networks," in 2015 IEEE International Conference on Computer Vision (ICCV) (IEEE Computer Society, 2015), pp. 4489–4497

work page 2015

[3] [3]

A Closer Look at Spatiotemporal Convolutions for Action Recognition,

D. Tran, H. Wang, L. Torresani, J. Ray, Y. Lecun, and M. Paluri, "A Closer Look at Spatiotemporal Convolutions for Action Recognition," in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (IEEE Computer Society, 2018), pp. 6450–6459

work page 2018

[4] [4]

Signal Detection By Complex Spatial Filtering,

A. Vander Lugt, "Signal Detection By Complex Spatial Filtering," IEEE Trans. Inf. Theory 10, 139–145 (1964)

work page 1964

[5] [5]

Multiple-object detection with a chirp-encoded joint transform correlator,

Q. Tang and B. Javidi, "Multiple-object detection with a chirp-encoded joint transform correlator," Appl. Opt. 32, 5079–5088 (1993)

work page 1993

[6] [6]

Optical implementation of neural networks for face recognition by the use of nonlinear joint transform correlators,

B. Javidi, J. Li, and Q. Tang, "Optical implementation of neural networks for face recognition by the use of nonlinear joint transform correlators," Appl. Opt. 34, 3950–3962 (1995)

work page 1995

[7] [7]

Hybrid optical-electronic convolutional neural networks with optimized diffractive optics for image classification,

J. Chang, V. Sitzmann, X. Dun, W. Heidrich, and G. Wetzstein, "Hybrid optical-electronic convolutional neural networks with optimized diffractive optics for image classification," Sci. Rep. 8, 12324 (2018)

work page 2018

[8] [8]

Deep feature flow for video recognition,

X. Zhu, Y. Xiong, J. Dai, L. Yuan, and Y. Wei, "Deep feature flow for video recognition," in Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017 (2017), Vol. 2017-January, pp. 2349–2358

work page 2017

[9] [9]

New Generation Deep Learning for Video Object Detection: A Survey,

L. Jiao, R. Zhang, F. Liu, S. Yang, B. Hou, L. Li, and X. Tang, "New Generation Deep Learning for Video Object Detection: A Survey," IEEE Trans. Neural Netw. Learn. Syst. 33, 3195–3215 (2022)

work page 2022

[10] [10]

Analytical transfer function for the nonlinear response of a resonant medium in the spatio-temporal Fourier-transform domain,

M. S. Monjur, M. F. Fouda, and S. M. Shahriar, "Analytical transfer function for the nonlinear response of a resonant medium in the spatio-temporal Fourier-transform domain," Journal of the Optical Society of America B 34, 397–403 (2017)

work page 2017

[11] [11]

Temporal scale and shift invariant automatic event recognition using the Mellin transform,

X. Shen, J. Gamboa, T. Hamidfar, S. A. Mitu, and S. M. Shahriar, "Temporal scale and shift invariant automatic event recognition using the Mellin transform," Opt. Express 33, 25515–25529 (2025)

work page 2025

[12] [12]

3D Convolutional neural networks for human action recognition,

S. Ji, W. Xu, M. Yang, and K. Yu, "3D Convolutional neural networks for human action recognition," IEEE Trans. Pattern Anal. Mach. Intell. 35, 221–231 (2013)

work page 2013

[13] [13]

All optical three dimensional spatio-temporal correlator for automatic event recognition using a multiphoton atomic system,

M. S. Monjur, M. F. Fouda, and S. M. Shahriar, "All optical three dimensional spatio-temporal correlator for automatic event recognition using a multiphoton atomic system," Opt. Commun. 381, 418–432 (2016)

work page 2016

[14] [14]

Quo Vadis, action recognition? A new model and the kinetics dataset,

J. Carreira and A. Zisserman, "Quo Vadis, action recognition? A new model and the kinetics dataset," in Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017 (Institute of Electrical and Electronics Engineers Inc., 2017), Vol. 2017-January, pp. 4724–4733

work page 2017

[15] [15]

Recognizing human actions: a local SVM approach,

C. Schuldt, I. Laptev, and B. Caputo, "Recognizing human actions: a local SVM approach," in Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004. (2004), Vol. 3, pp. 32-36 Vol.3

work page 2004