pith. sign in

arxiv: 2407.20799 · v3 · submitted 2024-07-30 · 💻 cs.CV

SpotFormer: Multi-Scale Spatio-Temporal Transformer for Facial Expression Spotting

Pith reviewed 2026-05-23 22:17 UTC · model grok-4.3

classification 💻 cs.CV
keywords facial expression spottingmicro-expressionoptical flowspatio-temporal transformercontrastive learningSpotFormerCSW-MRO feature
0
0 comments X

The pith

A new optical flow feature inside compact sliding windows combined with a multi-scale spatio-temporal transformer improves spotting of facial expressions, especially micro-expressions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a framework to locate periods of facial expression in video by first computing a Compact Sliding-Window-based Multi-temporal-Resolution Optical flow feature. This feature extracts motion information at several time scales inside short windows sized to capture full micro-expressions while limiting interference from head motion. The resulting features are passed to SpotFormer, a transformer that jointly models spatial and temporal relations at multiple scales through Facial Local Graph Pooling and convolutional layers, trained with supervised contrastive learning to sharpen distinctions among expression categories. Experiments on SAMM-LV, CAS(ME)^2, and CAS(ME)^3 report higher accuracy than prior methods, with the largest gains on micro-expressions. If the approach holds, automated systems could locate fleeting expression events more reliably without being overwhelmed by unrelated motion.

Core claim

We propose the CSW-MRO feature, which calculates multi-temporal-resolution optical flow of the input image sequence within compact sliding windows whose length is tailored to perceive complete micro-expressions and distinguish between general macro- and micro-expressions. We propose SpotFormer, a multi-scale spatio-temporal Transformer that simultaneously encodes spatio-temporal relationships of the CSW-MRO features for accurate frame-level probability estimation, using the Facial Local Graph Pooling operation and convolutional layers to extract multi-scale features. Supervised contrastive learning is introduced into SpotFormer to enhance discriminability between different types of facial-ex

What carries the argument

SpotFormer, the multi-scale spatio-temporal Transformer that encodes relationships in CSW-MRO features via Facial Local Graph Pooling and convolutional layers to produce frame-level expression probabilities.

If this is right

  • Frame-level probability estimates become more accurate for both macro- and micro-expressions.
  • Micro-expression spotting performance exceeds that of prior state-of-the-art models on SAMM-LV, CAS(ME)^2, and CAS(ME)^3.
  • Supervised contrastive learning increases separation between expression categories.
  • Model variants confirm that the combination of multi-scale encoding and the proposed pooling operation contributes to the reported gains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The sliding-window optical-flow design could be adapted for spotting other brief, low-amplitude events in video such as small gestures or anomalies.
  • Performance on datasets with different frame rates or head-motion statistics may require re-tuning of window lengths.
  • Adding audio or head-pose normalization could address remaining cases of irrelevant facial movements not handled by the current optical-flow step.

Load-bearing premise

The multi-temporal-resolution optical flow computed inside compact sliding windows of tailored length will reveal subtle micro-expression motions while preventing head movements from dominating the signal on the evaluated datasets.

What would settle it

If the method shows no accuracy gain over baselines on a new video set containing larger head movements or micro-expressions whose durations fall outside the chosen window lengths, the central performance claim would be challenged.

read the original abstract

Facial expression spotting, identifying periods where facial expressions occur in a video, is a significant yet challenging task in facial expression analysis. The issues of irrelevant facial movements and the challenge of detecting subtle motions in micro-expressions remain unresolved, hindering accurate expression spotting. In this paper, we propose an efficient framework for facial expression spotting. First, we propose a Compact Sliding-Window-based Multi-temporal-Resolution Optical flow (CSW-MRO) feature, which calculates multi-temporal-resolution optical flow of the input image sequence within compact sliding windows. The window length is tailored to perceive complete micro-expressions and distinguish between general macro- and micro-expressions. CSW-MRO can effectively reveal subtle motions while avoiding the optical flow being dominated by head movements. Second, we propose SpotFormer, a multi-scale spatio-temporal Transformer that simultaneously encodes spatio-temporal relationships of the CSW-MRO features for accurate frame-level probability estimation. In SpotFormer, we use the proposed Facial Local Graph Pooling (FLGP) operation and convolutional layers to extract multi-scale spatio-temporal features. We show the validity of the architecture of SpotFormer by comparing it with several model variants. Third, we introduce supervised contrastive learning into SpotFormer to enhance the discriminability between different types of expressions. Extensive experiments on SAMM-LV, CAS(ME)^2, and CAS(ME)^3 show that our method outperforms state-of-the-art models, particularly in micro-expression spotting. Code is available at https://github.com/KinopioIsAllIn/SpotFormer.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript proposes CSW-MRO, a compact sliding-window multi-temporal-resolution optical flow feature with tailored window lengths to capture complete micro-expressions while mitigating head movement dominance, and SpotFormer, a multi-scale spatio-temporal Transformer using Facial Local Graph Pooling (FLGP), convolutional layers, and supervised contrastive learning for frame-level expression probability estimation. It claims this framework outperforms state-of-the-art models on SAMM-LV, CAS(ME)^2, and CAS(ME)^3, especially for micro-expression spotting, with code released publicly.

Significance. If the experimental claims hold, the work could advance micro-expression spotting by combining multi-resolution optical flow with transformer-based multi-scale encoding and contrastive learning to improve discriminability. The public code link is a positive factor for reproducibility.

major comments (1)
  1. [Abstract] Abstract: the central claim of outperformance on SAMM-LV, CAS(ME)^2, and CAS(ME)^3 (particularly micro-expression spotting) is asserted without any quantitative results, metrics, baselines, ablations, error bars, dataset statistics, or statistical tests. This evidence is load-bearing for the claim but entirely absent from the provided manuscript.
minor comments (2)
  1. [Abstract] Abstract: the window length is described as 'tailored' to perceive complete micro-expressions but no values, selection method, or per-dataset details are supplied.
  2. [Abstract] Abstract: the claim of validity shown 'by comparing it with several model variants' provides no description of those variants or the comparison protocol.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their comment on the abstract. We provide our response below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim of outperformance on SAMM-LV, CAS(ME)^2, and CAS(ME)^3 (particularly micro-expression spotting) is asserted without any quantitative results, metrics, baselines, ablations, error bars, dataset statistics, or statistical tests. This evidence is load-bearing for the claim but entirely absent from the provided manuscript.

    Authors: We agree with the observation that the abstract makes claims of outperformance without including quantitative evidence, and such details are not present in the provided manuscript, which consists solely of the abstract. In the revised version of the manuscript, we will modify the abstract to incorporate key quantitative results from our experiments on SAMM-LV, CAS(ME)^2, and CAS(ME)^3 to better support the central claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

Only the abstract is available and it contains no equations, derivations, fitted parameters, or self-citations. The text describes the CSW-MRO feature, SpotFormer with FLGP, and supervised contrastive learning as novel components, then states experimental outperformance without showing any internal reduction of results to inputs by construction. No load-bearing step can be quoted or exhibited as circular.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 3 invented entities

Abstract-only review limits visibility into free parameters or background axioms; the paper introduces three new named components (CSW-MRO, SpotFormer, FLGP) whose internal hyperparameters are not disclosed.

invented entities (3)
  • CSW-MRO feature no independent evidence
    purpose: Compute multi-temporal-resolution optical flow inside compact sliding windows to capture micro-expressions without head-movement dominance
    Newly proposed in the abstract as the input representation.
  • SpotFormer architecture no independent evidence
    purpose: Multi-scale spatio-temporal transformer that encodes CSW-MRO features for frame-level probability estimation
    Newly proposed model in the abstract.
  • Facial Local Graph Pooling (FLGP) no independent evidence
    purpose: Extract multi-scale spatio-temporal features inside SpotFormer
    New operation introduced in the abstract.

pith-pipeline@v0.9.0 · 5785 in / 1269 out tokens · 17600 ms · 2026-05-23T22:17:42.827982+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.