An Analysis of Deep Neural Networks with Attention for Action Recognition from a Neurophysiological Perspective

Oswald Lanz; Swathikiran Sudhakaran

arxiv: 1907.01273 · v1 · pith:476IN7U7new · submitted 2019-07-02 · 💻 cs.CV

An Analysis of Deep Neural Networks with Attention for Action Recognition from a Neurophysiological Perspective

Swathikiran Sudhakaran , Oswald Lanz This is my paper

Pith reviewed 2026-05-25 11:15 UTC · model grok-4.3

classification 💻 cs.CV

keywords action recognitiondeep learningattentionneurophysiologybrain hypothesescomparative analysisvideo understanding

0 comments

The pith

Three deep learning methods for action recognition parallel hypotheses about human brain function.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper reviews three recent deep learning methods for recognizing actions in video. It offers a comparative analysis of these methods from a neurophysiological perspective. The authors posit analogies between the methods and existing hypotheses on how the human brain processes visual information for actions. A sympathetic reader would care because this suggests the artificial models may implement computational principles similar to those refined by biology.

Core claim

We review three recent deep learning based methods for action recognition and present a brief comparative analysis of the methods from a neurophysiological point of view. We posit that there are some analogy between the three presented deep learning based methods and some of the existing hypotheses regarding the functioning of human brain.

What carries the argument

The posited functional analogies between attention-based deep networks for action recognition and neurophysiological hypotheses on brain processing.

Load-bearing premise

The three deep learning methods can be meaningfully compared to specific neurophysiological hypotheses in a way that reveals functional analogies.

What would settle it

A detailed mapping showing that the internal computations in the three methods do not align with the core operations described in the brain hypotheses would disprove the analogies.

Figures

Figures reproduced from arXiv: 1907.01273 by Oswald Lanz, Swathikiran Sudhakaran.

**Figure 2.** Figure 2: Attention maps for some frames in HMDB51 dataset. Top row: action class [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

read the original abstract

We review three recent deep learning based methods for action recognition and present a brief comparative analysis of the methods from a neurophyisiological point of view. We posit that there are some analogy between the three presented deep learning based methods and some of the existing hypotheses regarding the functioning of human brain.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a short review that summarizes three action recognition models and posits loose analogies to brain hypotheses, without new experiments or quantitative analysis.

read the letter

This paper is a short comparative review of three deep learning methods for action recognition, viewed through a neurophysiological lens. The authors review the methods and suggest some analogies to existing ideas about brain function. That's the core of it. The paper does a decent job of summarizing the three methods and pointing out possible parallels. If the analogies are drawn carefully, it could help readers think about how these models relate to biological systems. Credit to the authors for trying to bridge the two fields in a concise way. The main limitation is that the analogies are presented as posits without detailed mappings, quantitative support, or new experiments. The abstract frames it as a review, so this is expected, but it means the contribution is mostly in the comparison rather than in advancing either field. No new results or derivations are introduced. This kind of paper is for researchers already working on action recognition who are curious about neuro connections, or neuroscientists looking at DL models. It might spark some discussion in a reading group, but it doesn't have the depth or novelty for a full journal article. I wouldn't cite it in my own work for any specific finding. I'd recommend against sending it to peer review. It fits better as a workshop or arXiv note.

Referee Report

1 major / 2 minor

Summary. The manuscript reviews three recent deep learning-based methods for action recognition and presents a brief comparative analysis from a neurophysiological perspective. It posits that analogies exist between these methods and existing hypotheses on human brain functioning.

Significance. If the analogies are articulated clearly, the paper could serve as a modest bridge between computer vision and neuroscience literature, highlighting potential functional parallels. As a short review without new empirical data, quantitative metrics, or falsifiable predictions, its primary value would lie in prompting interdisciplinary discussion rather than establishing rigorous mappings.

major comments (1)

[Abstract] Abstract: the central claim consists of positing 'some analogy' between the three DL methods and neurophysiological hypotheses, yet no quantitative comparisons, error analysis, or explicit mappings are described. This leaves the claim as an opinion-based assertion rather than a substantiated comparative result.

minor comments (2)

[Abstract] Abstract: 'neurophyisiological' is misspelled; 'some analogy' should be 'some analogies' for grammatical agreement with the plural 'methods' and 'hypotheses'.
The manuscript is described as a 'brief comparative analysis'; expanding the review with at least one concrete example of a shared mechanism (e.g., attention weighting versus a specific cortical pathway) would improve clarity without altering the review format.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the thoughtful review and the recommendation of minor revision. We address the single major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim consists of positing 'some analogy' between the three DL methods and neurophysiological hypotheses, yet no quantitative comparisons, error analysis, or explicit mappings are described. This leaves the claim as an opinion-based assertion rather than a substantiated comparative result.

Authors: We agree that the manuscript contains no quantitative comparisons, error analyses, or explicit mappings; this is by design. The work is a short review whose stated goal (see abstract and introduction) is to review three attention-based methods and to posit qualitative analogies with existing neurophysiological hypotheses in order to stimulate interdisciplinary discussion. The referee's own significance assessment correctly notes that the paper's primary value lies in prompting such discussion rather than in establishing rigorous mappings. The abstract accurately reflects this limited scope. No changes to the abstract or addition of quantitative material are planned. revision: no

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is a review and comparative analysis that posits analogies between three deep learning methods for action recognition and existing neurophysiological hypotheses. It contains no equations, derivations, fitted parameters, or load-bearing mathematical steps. The central claim is a modest positing of observed parallels permitted by a review format, with no reduction of any result to its own inputs by construction or self-citation chain. The paper is self-contained as a qualitative review against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper rests on the domain assumption that meaningful analogies can be drawn between the reviewed DL methods and brain hypotheses; no free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption Analogies exist between the three DL attention methods and existing neurophysiological hypotheses on brain function
This is the core positing of the paper, presented without new supporting evidence in the abstract.

pith-pipeline@v0.9.0 · 5564 in / 1007 out tokens · 29883 ms · 2026-05-25T11:15:07.120990+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We posit that there are some analogy between the three presented deep learning based methods and some of the existing hypotheses regarding the functioning of human brain.
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

top-down attention... multiple pathway hypothesis... parallel information streams

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages

[1]

C. F. Cadieu, H. Hong, D. Yamins, N. Pinto, D. Ardila, E. Solomon, N. Majaj, and J. DiCarlo. Deep neural networks rival the representation of primate it cortex for core visual ob- ject recognition. PLoS computational biology, 10(12), 2014. 1

work page 2014
[2]

J. Duncan. Selective attention and the organization of visual information. Journal of Experimental Psychology: General, 113(4):501, 1984. 2

work page 1984
[3]

Eickenberg, A

M. Eickenberg, A. Gramfort, G. Varoquaux, and B. Thirion. Seeing it all: Convolutional network layers map the func- tion of the human visual system. NeuroImage, 152:184–194,

work page
[4]

Fukushima and S

K. Fukushima and S. Miyake. Neocognitron: A new algo- rithm for pattern recognition tolerant of deformations and shifts in position. Pattern recognition, 15(6):455–469, 1982. 1

work page 1982
[5]

K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proc. CVPR, 2016. 1

work page 2016
[6]

Hubel and T

D. Hubel and T. Wiesel. Ferrier lecture: Functional archi- tecture of macaque monkey visual cortex. Proceedings of the Royal Society of London. Series B, Biological Sciences , pages 1–59, 1977. 1

work page 1977
[7]

Kheradpisheh, M

S. Kheradpisheh, M. Ghodrati, M. Ganjtabesh, and T. Masquelier. Deep networks can resemble human feed- forward vision in invariant object recognition. Scientiﬁc re- ports, 6:32672, 2016. 1

work page 2016
[8]

LeCun, L

Y . LeCun, L. Bottou, Y . Bengio, and P. Haffner. Gradient- based learning applied to document recognition. Proceed- ings of the IEEE, 86(11):2278–2324, 1998. 1

work page 1998
[9]

Nassi and E

J. Nassi and E. Callaway. Parallel processing strategies of the primate visual system. Nature reviews neuroscience , 10(5):360, 2009. 3

work page 2009
[10]

Sudhakaran, S

S. Sudhakaran, S. Escalera, and O. Lanz. LSTA: Long Short- Term Attention for Egocentric Action Recognition. In Proc. CVPR, 2019. 1, 2, 3

work page 2019
[11]

Sudhakaran and O

S. Sudhakaran and O. Lanz. Attention is All We Need: Nail- ing Down Object-centric Attention for Egocentric Activity Recognition. In Proc. British Machine Vision Conference (BMVC), 2018. 1, 2

work page 2018
[12]

Sudhakaran and O

S. Sudhakaran and O. Lanz. Top-down Attention Recurrent VLAD Encoding for Action Recognition in Videos. In Proc. 17th International Conference of the Italian Association for Artiﬁcial Intelligence (AI*IA), 2018. 1, 2, 3

work page 2018
[13]

Szegedy, S

C. Szegedy, S. Ioffe, V . Vanhoucke, and A. Alemi. Inception- v4, inception-resnet and the impact of residual connections on learning. In Proc. 31st AAAI Conference on Artiﬁcial In- telligence, 2017. 1

work page 2017
[14]

Thorpe, D

S. Thorpe, D. Fize, and C. Marlot. Speed of processing in the human visual system. Nature, 381(6582):520, 1996. 1

work page 1996
[15]

T. Tu, J. Koss, and P. Sajda. Relating deep neural net- work representations to eeg-fmri spatiotemporal dynamics in a perceptual decision-making task. In Proc. CVPR Work- shops, pages 1985–1991, 2018. 3

work page 1985
[16]

Ungerleider and L

S. Ungerleider and L. G. Mechanisms of visual atten- tion in the human cortex. Annual review of neuroscience , 23(1):315–341, 2000. 2

work page 2000
[17]

Warrington and R

E. Warrington and R. McCarthy. Categories of knowledge: Further fractionations and an attempted integration. Brain, 110(5):1273–1296, 1987. 3

work page 1987
[18]

Yamins and J

D. Yamins and J. DiCarlo. Using goal-driven deep learning models to understand sensory cortex. Nature neuroscience, 19(3):356, 2016. 1

work page 2016

[1] [1]

C. F. Cadieu, H. Hong, D. Yamins, N. Pinto, D. Ardila, E. Solomon, N. Majaj, and J. DiCarlo. Deep neural networks rival the representation of primate it cortex for core visual ob- ject recognition. PLoS computational biology, 10(12), 2014. 1

work page 2014

[2] [2]

J. Duncan. Selective attention and the organization of visual information. Journal of Experimental Psychology: General, 113(4):501, 1984. 2

work page 1984

[3] [3]

Eickenberg, A

M. Eickenberg, A. Gramfort, G. Varoquaux, and B. Thirion. Seeing it all: Convolutional network layers map the func- tion of the human visual system. NeuroImage, 152:184–194,

work page

[4] [4]

Fukushima and S

K. Fukushima and S. Miyake. Neocognitron: A new algo- rithm for pattern recognition tolerant of deformations and shifts in position. Pattern recognition, 15(6):455–469, 1982. 1

work page 1982

[5] [5]

K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proc. CVPR, 2016. 1

work page 2016

[6] [6]

Hubel and T

D. Hubel and T. Wiesel. Ferrier lecture: Functional archi- tecture of macaque monkey visual cortex. Proceedings of the Royal Society of London. Series B, Biological Sciences , pages 1–59, 1977. 1

work page 1977

[7] [7]

Kheradpisheh, M

S. Kheradpisheh, M. Ghodrati, M. Ganjtabesh, and T. Masquelier. Deep networks can resemble human feed- forward vision in invariant object recognition. Scientiﬁc re- ports, 6:32672, 2016. 1

work page 2016

[8] [8]

LeCun, L

Y . LeCun, L. Bottou, Y . Bengio, and P. Haffner. Gradient- based learning applied to document recognition. Proceed- ings of the IEEE, 86(11):2278–2324, 1998. 1

work page 1998

[9] [9]

Nassi and E

J. Nassi and E. Callaway. Parallel processing strategies of the primate visual system. Nature reviews neuroscience , 10(5):360, 2009. 3

work page 2009

[10] [10]

Sudhakaran, S

S. Sudhakaran, S. Escalera, and O. Lanz. LSTA: Long Short- Term Attention for Egocentric Action Recognition. In Proc. CVPR, 2019. 1, 2, 3

work page 2019

[11] [11]

Sudhakaran and O

S. Sudhakaran and O. Lanz. Attention is All We Need: Nail- ing Down Object-centric Attention for Egocentric Activity Recognition. In Proc. British Machine Vision Conference (BMVC), 2018. 1, 2

work page 2018

[12] [12]

Sudhakaran and O

S. Sudhakaran and O. Lanz. Top-down Attention Recurrent VLAD Encoding for Action Recognition in Videos. In Proc. 17th International Conference of the Italian Association for Artiﬁcial Intelligence (AI*IA), 2018. 1, 2, 3

work page 2018

[13] [13]

Szegedy, S

C. Szegedy, S. Ioffe, V . Vanhoucke, and A. Alemi. Inception- v4, inception-resnet and the impact of residual connections on learning. In Proc. 31st AAAI Conference on Artiﬁcial In- telligence, 2017. 1

work page 2017

[14] [14]

Thorpe, D

S. Thorpe, D. Fize, and C. Marlot. Speed of processing in the human visual system. Nature, 381(6582):520, 1996. 1

work page 1996

[15] [15]

T. Tu, J. Koss, and P. Sajda. Relating deep neural net- work representations to eeg-fmri spatiotemporal dynamics in a perceptual decision-making task. In Proc. CVPR Work- shops, pages 1985–1991, 2018. 3

work page 1985

[16] [16]

Ungerleider and L

S. Ungerleider and L. G. Mechanisms of visual atten- tion in the human cortex. Annual review of neuroscience , 23(1):315–341, 2000. 2

work page 2000

[17] [17]

Warrington and R

E. Warrington and R. McCarthy. Categories of knowledge: Further fractionations and an attempted integration. Brain, 110(5):1273–1296, 1987. 3

work page 1987

[18] [18]

Yamins and J

D. Yamins and J. DiCarlo. Using goal-driven deep learning models to understand sensory cortex. Nature neuroscience, 19(3):356, 2016. 1

work page 2016