SpotFormer: Multi-Scale Spatio-Temporal Transformer for Facial Expression Spotting
Pith reviewed 2026-05-23 22:17 UTC · model grok-4.3
The pith
A new optical flow feature inside compact sliding windows combined with a multi-scale spatio-temporal transformer improves spotting of facial expressions, especially micro-expressions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We propose the CSW-MRO feature, which calculates multi-temporal-resolution optical flow of the input image sequence within compact sliding windows whose length is tailored to perceive complete micro-expressions and distinguish between general macro- and micro-expressions. We propose SpotFormer, a multi-scale spatio-temporal Transformer that simultaneously encodes spatio-temporal relationships of the CSW-MRO features for accurate frame-level probability estimation, using the Facial Local Graph Pooling operation and convolutional layers to extract multi-scale features. Supervised contrastive learning is introduced into SpotFormer to enhance discriminability between different types of facial-ex
What carries the argument
SpotFormer, the multi-scale spatio-temporal Transformer that encodes relationships in CSW-MRO features via Facial Local Graph Pooling and convolutional layers to produce frame-level expression probabilities.
If this is right
- Frame-level probability estimates become more accurate for both macro- and micro-expressions.
- Micro-expression spotting performance exceeds that of prior state-of-the-art models on SAMM-LV, CAS(ME)^2, and CAS(ME)^3.
- Supervised contrastive learning increases separation between expression categories.
- Model variants confirm that the combination of multi-scale encoding and the proposed pooling operation contributes to the reported gains.
Where Pith is reading between the lines
- The sliding-window optical-flow design could be adapted for spotting other brief, low-amplitude events in video such as small gestures or anomalies.
- Performance on datasets with different frame rates or head-motion statistics may require re-tuning of window lengths.
- Adding audio or head-pose normalization could address remaining cases of irrelevant facial movements not handled by the current optical-flow step.
Load-bearing premise
The multi-temporal-resolution optical flow computed inside compact sliding windows of tailored length will reveal subtle micro-expression motions while preventing head movements from dominating the signal on the evaluated datasets.
What would settle it
If the method shows no accuracy gain over baselines on a new video set containing larger head movements or micro-expressions whose durations fall outside the chosen window lengths, the central performance claim would be challenged.
read the original abstract
Facial expression spotting, identifying periods where facial expressions occur in a video, is a significant yet challenging task in facial expression analysis. The issues of irrelevant facial movements and the challenge of detecting subtle motions in micro-expressions remain unresolved, hindering accurate expression spotting. In this paper, we propose an efficient framework for facial expression spotting. First, we propose a Compact Sliding-Window-based Multi-temporal-Resolution Optical flow (CSW-MRO) feature, which calculates multi-temporal-resolution optical flow of the input image sequence within compact sliding windows. The window length is tailored to perceive complete micro-expressions and distinguish between general macro- and micro-expressions. CSW-MRO can effectively reveal subtle motions while avoiding the optical flow being dominated by head movements. Second, we propose SpotFormer, a multi-scale spatio-temporal Transformer that simultaneously encodes spatio-temporal relationships of the CSW-MRO features for accurate frame-level probability estimation. In SpotFormer, we use the proposed Facial Local Graph Pooling (FLGP) operation and convolutional layers to extract multi-scale spatio-temporal features. We show the validity of the architecture of SpotFormer by comparing it with several model variants. Third, we introduce supervised contrastive learning into SpotFormer to enhance the discriminability between different types of expressions. Extensive experiments on SAMM-LV, CAS(ME)^2, and CAS(ME)^3 show that our method outperforms state-of-the-art models, particularly in micro-expression spotting. Code is available at https://github.com/KinopioIsAllIn/SpotFormer.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes CSW-MRO, a compact sliding-window multi-temporal-resolution optical flow feature with tailored window lengths to capture complete micro-expressions while mitigating head movement dominance, and SpotFormer, a multi-scale spatio-temporal Transformer using Facial Local Graph Pooling (FLGP), convolutional layers, and supervised contrastive learning for frame-level expression probability estimation. It claims this framework outperforms state-of-the-art models on SAMM-LV, CAS(ME)^2, and CAS(ME)^3, especially for micro-expression spotting, with code released publicly.
Significance. If the experimental claims hold, the work could advance micro-expression spotting by combining multi-resolution optical flow with transformer-based multi-scale encoding and contrastive learning to improve discriminability. The public code link is a positive factor for reproducibility.
major comments (1)
- [Abstract] Abstract: the central claim of outperformance on SAMM-LV, CAS(ME)^2, and CAS(ME)^3 (particularly micro-expression spotting) is asserted without any quantitative results, metrics, baselines, ablations, error bars, dataset statistics, or statistical tests. This evidence is load-bearing for the claim but entirely absent from the provided manuscript.
minor comments (2)
- [Abstract] Abstract: the window length is described as 'tailored' to perceive complete micro-expressions but no values, selection method, or per-dataset details are supplied.
- [Abstract] Abstract: the claim of validity shown 'by comparing it with several model variants' provides no description of those variants or the comparison protocol.
Simulated Author's Rebuttal
We thank the referee for their comment on the abstract. We provide our response below.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim of outperformance on SAMM-LV, CAS(ME)^2, and CAS(ME)^3 (particularly micro-expression spotting) is asserted without any quantitative results, metrics, baselines, ablations, error bars, dataset statistics, or statistical tests. This evidence is load-bearing for the claim but entirely absent from the provided manuscript.
Authors: We agree with the observation that the abstract makes claims of outperformance without including quantitative evidence, and such details are not present in the provided manuscript, which consists solely of the abstract. In the revised version of the manuscript, we will modify the abstract to incorporate key quantitative results from our experiments on SAMM-LV, CAS(ME)^2, and CAS(ME)^3 to better support the central claims. revision: yes
Circularity Check
No significant circularity identified
full rationale
Only the abstract is available and it contains no equations, derivations, fitted parameters, or self-citations. The text describes the CSW-MRO feature, SpotFormer with FLGP, and supervised contrastive learning as novel components, then states experimental outperformance without showing any internal reduction of results to inputs by construction. No load-bearing step can be quoted or exhibited as circular.
Axiom & Free-Parameter Ledger
invented entities (3)
-
CSW-MRO feature
no independent evidence
-
SpotFormer architecture
no independent evidence
-
Facial Local Graph Pooling (FLGP)
no independent evidence
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.