Appearance-free Action Recognition: Zero-shot Generalization in Humans and a Two-Pathway Model

2); (2) IMPRS-IS); Martin A. Giese (1) ((1) Hertie Institute; Prerana Kumar (1; University of Tuebingen

arxiv: 2604.16675 · v1 · submitted 2026-04-17 · 💻 cs.CV

Appearance-free Action Recognition: Zero-shot Generalization in Humans and a Two-Pathway Model

Prerana Kumar (1 , 2) , Martin A. Giese (1) ((1) Hertie Institute , University of Tuebingen , (2) IMPRS-IS) This is my paper

Pith reviewed 2026-05-10 08:33 UTC · model grok-4.3

classification 💻 cs.CV

keywords videosappearance-freeactionhumansmodelmotiongeneralizationmodels

0 comments

The pith

Humans generalize zero-shot to appearance-free action videos, and a two-pathway CNN model with coherence-gating outperforms standard video models while matching this behavior.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The researchers trained people on regular videos of five actions and then tested them on versions where body shapes were stripped away, leaving only moving dots or noisy motion. Participants still identified the actions better than random guessing. The team built a computer model with one stream for static form and another for optical flow motion, plus a gating step that emphasizes coherent moving parts like Gestalt grouping. The motion stream proved essential for the degraded videos, while the form stream helped on normal ones. The model beat other video classifiers on the appearance-free tests.

Core claim

Our model generalizes to both appearance-free datasets and outperforms contemporary video classification models, narrowing the gap to human performance. We find that the motion pathway is critical for generalization to appearance-free videos, while the form pathway improves performance on naturalistic videos.

Load-bearing premise

That the specific appearance-free transformations (dense-noise from AFD5 and random-dot videos) isolate motion cues sufficiently for recognition without participants or the model exploiting residual static shape information from training.

Figures

Figures reproduced from arXiv: 2604.16675 by 2), (2) IMPRS-IS), Martin A. Giese (1) ((1) Hertie Institute, Prerana Kumar (1, University of Tuebingen.

**Figure 1.** Figure 1: Appearance-free stimulus generation. AFD5 videos (Ilic et al., 2022) are generated by warping dense noise through time using RAFT optical flow (Teed & Deng, 2020) from the source RGB video. AFF5 videos (this work) are generated by warping sparse dots with the same RAFT flow and introducing finite-lifetime random dots in the background Jumping Jacks Lunges [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Example stimuli. Representative single frames from two action videos across UCF5 (RGB), AFD5 (densenoise appearance-free), and AFF5 (sparse random-dot appearance-free) videos. Single frames in AFD5/AFF5 videos contain minimal static appearance cues novel synthetic stimuli preserving the object motion from the original videos. Recent findings from neuroscience (Robert et al., 2023) demonstrated that motion… view at source ↗

**Figure 3.** Figure 3: Model schematic (CG2-X3D). Two-stream 3D CNN-based architecture with an RGB (form) stream and an explicit motion stream with fusion of streams for classification. The motion stream includes coherence-gated optical flow representations. training block first, followed by the UCF5 test block. The order of the two appearance-free blocks was counterbalanced (half the subjects viewed AFD5 videos before AFF5 and… view at source ↗

**Figure 4.** Figure 4: Human accuracy across conditions. Each light blue point/line shows one participant’s accuracy across UCF5, AFD5, and AFF5. The dark blue points/line indicate the group mean ± SEM across participants (n = 22). The dotted horizontal line indicates chance accuracy (20%) [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 6.** Figure 6: Ablations. Mean accuracy across five seeds for RGB-only model, flow-only model, and the twostream model across UCF5, AFD5, and AFF5. Error bars indicate ±1 SD across seeds. training on appearance-free stimuli. This extends classical findings from simplified motion displays to naturalistic, highly variable stimulus classes. Prior work in this area evaluated appearance-free recognition using AFD5 stimuli,… view at source ↗

read the original abstract

Action recognition is a fundamental ability for social species. Yet, its underlying computations are not well understood. Classical psychophysical studies using simplified stimuli have shown that humans can perceive body motion even under degradation of relevant shape cues. Recent work using real-world action videos and their appearance-free counterparts (that preserve motion but lack static shape cues) included explicit training of humans and models on the appearance-free videos. Whether humans and vision models generalize in a zero-shot manner to appearance-free transformations of real-world action videos is not yet known. To measure this generalization in humans, we conducted a laboratory-based psychophysics experiment. 22 participants were trained to recognize five action categories using naturalistic videos (UCF5 dataset), and tested zero-shot on two types of appearance-free transformations: (i) dense-noise motion videos from an existing dataset (AFD5) and (ii) random-dot appearance-free videos. We find that participants recognize actions in both types of appearance-free videos well above chance, albeit with reduced accuracy compared to naturalistic videos. To model this behavior, we developed a two-pathway 3D CNN-based model combining an RGB (form) stream and an optical flow (motion) stream, including a coherence-gating mechanism inspired by Gestalt common-fate grouping. Our model generalizes to both appearance-free datasets and outperforms contemporary video classification models, narrowing the gap to human performance. We find that the motion pathway is critical for generalization to appearance-free videos, while the form pathway improves performance on naturalistic videos. Our findings highlight the importance of motion-based representations for generalization to appearance-free videos, and support the use of multi-stream architectures to model video-based action recognition.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

read the letter

Humans generalize zero-shot to appearance-free action videos after training on normal ones, and the two-pathway model with coherence gating beats standard classifiers on those cases while showing the motion stream matters most. The experiment trains 22 participants on UCF5 categories then tests them on AFD5 dense-noise videos and random-dot versions with no prior exposure to the degraded formats. The model runs an RGB form stream and an optical flow motion stream in parallel, adds a coherence-gating step inspired by Gestalt common fate, and reports better accuracy than contemporary video classifiers on the appearance-free sets. Ablations indicate the motion pathway drives the generalization while the form pathway helps on the original videos. This setup gives a direct comparison between human behavior and a trainable architecture on the same zero-shot task. The link to classical psychophysics on body motion under degraded shape is a clear strength, and the pathway split makes the computational claim testable. The main soft spot is whether the transformations truly isolate motion. Dense noise and random dots are intended to remove static shape, but if any residual low-frequency or correlated form cues remain, both people and the model could succeed by detecting those instead of extracting unseen motion patterns. The paper needs explicit checks, such as control tests on shape-only versions or feature analysis, to confirm the isolation. Without them the claim that motion is critical rests on an assumption that may not fully hold. Readers working on robust action recognition or bio-inspired video models would get value from the human data and the architecture variant. It is not a large leap but it supplies a concrete data point worth checking. I would send it for peer review because the human experiment and the model comparison are substantive enough to deserve referee time.

Referee Report

2 major / 3 minor

Summary. The manuscript reports a psychophysics experiment in which 22 participants trained on naturalistic UCF5 action videos achieve above-chance zero-shot recognition on two appearance-free transformations (dense-noise videos from AFD5 and random-dot videos). It introduces a two-pathway 3D CNN with an RGB form stream, an optical-flow motion stream, and a coherence-gating mechanism inspired by Gestalt common-fate grouping; the model generalizes to the same appearance-free stimuli, outperforms contemporary video classifiers, narrows the gap to human accuracy, and shows via ablations that the motion pathway is critical for appearance-free generalization while the form pathway aids naturalistic performance.

Significance. If the central claims hold, the work is significant for linking human zero-shot generalization under cue degradation to a concrete computational architecture. The direct human-model comparison on identical stimuli and the pathway ablations provide testable insights into the computational role of motion versus form. The zero-shot design (no retraining on appearance-free data) and the explicit Gestalt-inspired gating are strengths that distinguish the contribution from standard supervised video classification.

major comments (2)

[Methods (Stimuli)] Methods, Stimuli subsection: The claim that participants and the model perform zero-shot generalization to motion cues requires that the AFD5 dense-noise and random-dot videos contain no residual static or low-frequency shape information correlated with the five UCF5 classes. No control analysis (e.g., Fourier spectrum comparison, static-frame classifier accuracy, or human performance on single frames) is reported to verify cue elimination. This verification is load-bearing for interpreting the above-chance results and the ablation finding that the motion pathway is critical.
[Results] Results, Model evaluation: The reported superiority over contemporary video classification models and the narrowing of the gap to human performance are presented without tabulated accuracies, standard errors, participant-level or run-level statistics, or explicit training protocols for the baselines. These omissions prevent assessment of whether the two-pathway advantage is robust or driven by implementation details.

minor comments (3)

[Abstract] The abstract states 'well above chance' and 'outperforms' without numerical values; the main text should include exact percentages, chance levels, and confidence intervals for both human and model conditions.
[Methods] Participant details beyond n=22 (age range, screening, exact trial counts per condition) are needed for reproducibility of the psychophysics results.
[Model Architecture] The coherence-gating mechanism is described at a high level; a precise equation or pseudocode for how gating is computed from the two streams would clarify the architectural contribution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and describe the revisions we will incorporate to improve the manuscript.

read point-by-point responses

Referee: [Methods (Stimuli)] Methods, Stimuli subsection: The claim that participants and the model perform zero-shot generalization to motion cues requires that the AFD5 dense-noise and random-dot videos contain no residual static or low-frequency shape information correlated with the five UCF5 classes. No control analysis (e.g., Fourier spectrum comparison, static-frame classifier accuracy, or human performance on single frames) is reported to verify cue elimination. This verification is load-bearing for interpreting the above-chance results and the ablation finding that the motion pathway is critical.

Authors: We agree that explicit verification of cue elimination is essential to support the zero-shot generalization interpretation. Although the AFD5 stimuli were constructed to preserve motion while removing static appearance, we did not report the requested controls in the original submission. In the revised manuscript we will add: (1) Fourier spectrum comparisons between the original UCF5 videos and their AFD5/random-dot counterparts, (2) accuracy of a static-frame classifier trained and tested on single frames from the appearance-free videos, and (3) human performance on single frames of the same stimuli. These analyses will be placed in the Methods and Results sections and will directly address whether residual static shape information could explain the above-chance performance. revision: yes
Referee: [Results] Results, Model evaluation: The reported superiority over contemporary video classification models and the narrowing of the gap to human performance are presented without tabulated accuracies, standard errors, participant-level or run-level statistics, or explicit training protocols for the baselines. These omissions prevent assessment of whether the two-pathway advantage is robust or driven by implementation details.

Authors: We acknowledge that the current presentation lacks the quantitative detail needed for full evaluation. In the revision we will add a table (and supplementary tables) reporting mean accuracies with standard errors for the two-pathway model, all baseline models, and human participants. We will also include participant-level and run-level (multiple random seeds) statistics, and provide explicit training protocols, hyperparameters, and implementation details for every baseline. These additions will appear in the Results section and will allow readers to assess the robustness of the reported advantages. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical measurements on external datasets

full rationale

The paper's claims rest on direct psychophysical testing of 22 human participants (trained on UCF5 naturalistic videos, tested zero-shot on AFD5 dense-noise and random-dot videos) and on empirical accuracy/ablation results of a two-pathway 3D CNN evaluated on the same held-out transformed datasets. These are measured outcomes, not quantities derived from the model's equations that reduce by construction to fitted parameters or architectural choices. The coherence-gating mechanism is presented as an inspired design decision rather than a definitional necessity, and no self-citation chain or uniqueness theorem is invoked to force the central conclusions about pathway contributions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that appearance-free stimuli isolate motion without usable shape leakage and that the coherence-gating mechanism implements common-fate grouping in a way that transfers to the tested datasets.

axioms (1)

domain assumption Humans can perceive body motion even under degradation of relevant shape cues
Invoked in the abstract as the basis for testing zero-shot generalization from classical psychophysical studies.

pith-pipeline@v0.9.0 · 5631 in / 1150 out tokens · 49939 ms · 2026-05-10T08:33:55.792262+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

4 extracted references · 4 canonical work pages

[1]

A., & Lappe, M

Beintema, J. A., & Lappe, M. (2002). Perception of bio- logical motion without local image motion.Proceed- ings of the National Academy of Sciences of the United States of America,99(8), 5661–5663. https: //doi.org/10.1073/pnas.082483699. Bertasius, G., Wang, H., & Torresani, L. (2021). Is space-time attention all you need for video un- derstanding?Procee...

work page doi:10.1073/pnas.082483699 2002
[2]

Carreira, J., & Zisserman, A. (2017). Quo vadis, ac- tion recognition? a new model and the Kinetics dataset.Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 4724–4733. https://doi.org/10.1109/CVPR.2017

work page doi:10.1109/cvpr.2017 2017
[3]

https://doi.org/10.1167/5

Casile,A.,&Giese,M.A.(2005).Criticalfeaturesforthe recognition of biological motion.Journal of Vision, 5(4), Article 6, 348–360. https://doi.org/10.1167/5. 4.6. Chen, Z., & Lee, H.-J. (1992). Knowledge-guided vi- sual perception of 3-D human gait from a single image sequence.IEEE Transactions on Systems, Man, and Cybernetics,22(2), 336–342. https://doi. o...

work page doi:10.1167/5 2005
[4]

Liu, T., Huynh, N., and van der Schaar, M

https://doi.org/10.1037/h0043158. Newsome, W. T., & Paré, E. B. (1988). A selective im- pairment of motion perception following lesions of the middle temporal visual area (mt).The Journal of Neuroscience,8(6), 2201–2211. https://doi.org/10. 1523/JNEUROSCI.08-06-02201.1988. Peirce, J. W. (2007). PsychoPy—psychophysics soft- ware in Python.Journal of Neuros...

work page doi:10.1037/h0043158 1988

[1] [1]

A., & Lappe, M

Beintema, J. A., & Lappe, M. (2002). Perception of bio- logical motion without local image motion.Proceed- ings of the National Academy of Sciences of the United States of America,99(8), 5661–5663. https: //doi.org/10.1073/pnas.082483699. Bertasius, G., Wang, H., & Torresani, L. (2021). Is space-time attention all you need for video un- derstanding?Procee...

work page doi:10.1073/pnas.082483699 2002

[2] [2]

Carreira, J., & Zisserman, A. (2017). Quo vadis, ac- tion recognition? a new model and the Kinetics dataset.Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 4724–4733. https://doi.org/10.1109/CVPR.2017

work page doi:10.1109/cvpr.2017 2017

[3] [3]

https://doi.org/10.1167/5

Casile,A.,&Giese,M.A.(2005).Criticalfeaturesforthe recognition of biological motion.Journal of Vision, 5(4), Article 6, 348–360. https://doi.org/10.1167/5. 4.6. Chen, Z., & Lee, H.-J. (1992). Knowledge-guided vi- sual perception of 3-D human gait from a single image sequence.IEEE Transactions on Systems, Man, and Cybernetics,22(2), 336–342. https://doi. o...

work page doi:10.1167/5 2005

[4] [4]

Liu, T., Huynh, N., and van der Schaar, M

https://doi.org/10.1037/h0043158. Newsome, W. T., & Paré, E. B. (1988). A selective im- pairment of motion perception following lesions of the middle temporal visual area (mt).The Journal of Neuroscience,8(6), 2201–2211. https://doi.org/10. 1523/JNEUROSCI.08-06-02201.1988. Peirce, J. W. (2007). PsychoPy—psychophysics soft- ware in Python.Journal of Neuros...

work page doi:10.1037/h0043158 1988