pith. sign in

arxiv: 2604.08106 · v1 · submitted 2026-04-09 · 💻 cs.CV

EPIR: An Efficient Patch Tokenization, Integration and Representation Framework for Micro-expression Recognition

Pith reviewed 2026-05-10 16:59 UTC · model grok-4.3

classification 💻 cs.CV
keywords micro-expression recognitionefficient transformerpatch tokenizationtoken integrationdiscriminative token selectionfacial analysisattention mechanismsmall dataset learning
0
0 comments X

The pith

EPIR framework reduces tokens in Transformer models to boost micro-expression recognition accuracy and efficiency.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to solve the problem of high computational demands in Transformer-based micro-expression recognition, which arises from processing many tokens, along with difficulties in learning from small datasets. It does this through a dual norm shifted tokenization module that learns spatial pixel relationships, a token integration module that combines tokens across blocks to cut their number, and a discriminative token extractor that uses improved attention and dynamic selection of important tokens. A sympathetic reader would find this relevant because micro-expressions provide insight into genuine emotions at the moment they occur, and an efficient method could make such recognition more accessible for applications in various fields. If the approach succeeds, it would demonstrate that careful token management can maintain or enhance performance despite reduced complexity.

Core claim

We propose the EPIR framework which first uses a dual norm shifted tokenization module implemented by refined spatial transformation and dual norm projection to learn spatial relationships. Then a token integration module integrates partial tokens among cascaded Transformer blocks to reduce token count without information loss. Finally a discriminative token extractor improves attention to reduce focus on self-tokens and uses dynamic token selection to capture more discriminative representations, resulting in performance gains over state-of-the-art on several datasets.

What carries the argument

The combination of dual norm shifted tokenization, token integration across blocks, and dynamic token selection in the discriminative extractor to manage tokens efficiently.

If this is right

  • Achieves up to 9.6% improvement in UF1 on CAS(ME)^3 dataset.
  • Achieves 4.58% improvement in UAR on SMIC dataset.
  • Lowers computational complexity compared to standard Transformer approaches.
  • Enables effective representation learning on small-scale micro-expression datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The token reduction techniques could be applied to other video-based recognition tasks facing similar data scarcity issues.
  • This framework might inspire hybrid models that combine efficiency modules with other backbone architectures.
  • Practical deployment in real-time systems for emotion-aware interfaces becomes more feasible.

Load-bearing premise

The proposed tokenization, integration, and selection steps do not cause loss of essential information required for accurate micro-expression classification.

What would settle it

A controlled experiment on the same datasets where the DNSPT, integration, or DTSM modules are removed or replaced with standard token handling, showing if the claimed improvements disappear.

Figures

Figures reproduced from arXiv: 2604.08106 by Junbo Wang, Kun Hu, Liangyu Fu, Xuecheng Wu, Yining Zhu, Yuke Li.

Figure 1
Figure 1. Figure 1: Comparison of (a) our EPIR and (b) the previous Transformer-based micro-expression recognition [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The overall framework of our proposed EPIR. The left column shows the overall framework of [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Confusion matrices for the proposed EPIR on the composite database (SAMM, SMIC, CASME [PITH_FULL_IMAGE:figures/full_fig_p017_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Confusion matrices for the proposed EPIR on the CAS(ME) [PITH_FULL_IMAGE:figures/full_fig_p018_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Ablation experiments on the number of Transformer blocks, the two subfigures on the left are the [PITH_FULL_IMAGE:figures/full_fig_p021_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Actual effect of token integration module on micro-expression samples. We take two micro￾expression samples (first and third rows) and their corresponding optical flow feature maps (second and fourth rows), from left to right, with integration rates of 0%, 30%, 60%, and 80%, respectively. norm during model training in the process of projecting optical flow patches into visual tokens, contributing significa… view at source ↗
read the original abstract

Micro-expression recognition can obtain the real emotion of the individual at the current moment. Although deep learning-based methods, especially Transformer-based methods, have achieved impressive results, these methods have high computational complexity due to the large number of tokens in the multi-head self-attention. In addition, the existing micro-expression datasets are small-scale, which makes it difficult for Transformer-based models to learn effective micro-expression representations. Therefore, we propose a novel Efficient Patch tokenization, Integration and Representation framework (EPIR), which can balance high recognition performance and low computational complexity. Specifically, we first propose a dual norm shifted tokenization (DNSPT) module to learn the spatial relationship between neighboring pixels in the face region, which is implemented by a refined spatial transformation and dual norm projection. Then, we propose a token integration module to integrate partial tokens among multiple cascaded Transformer blocks, thereby reducing the number of tokens without information loss. Furthermore, we design a discriminative token extractor, which first improves the attention in the Transformer block to reduce the unnecessary focus of the attention calculation on self-tokens, and uses the dynamic token selection module (DTSM) to select key tokens, thereby capturing more discriminative micro-expression representations. We conduct extensive experiments on four popular public datasets (i.e., CASME II, SAMM, SMIC, and CAS(ME)3. The experimental results show that our method achieves significant performance gains over the state-of-the-art methods, such as 9.6% improvement on the CAS(ME)$^3$ dataset in terms of UF1 and 4.58% improvement on the SMIC dataset in terms of UAR metric.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces EPIR, an efficient Transformer-based framework for micro-expression recognition on small datasets. It proposes a dual norm shifted tokenization (DNSPT) module using refined spatial transformation and dual norm projection, a token integration module to reduce token count across cascaded blocks without information loss, and a discriminative token extractor incorporating improved attention and a dynamic token selection module (DTSM). Experiments on CASME II, SAMM, SMIC, and CAS(ME)^3 report gains over prior SOTA, including 9.6% UF1 on CAS(ME)^3 and 4.58% UAR on SMIC.

Significance. If the performance gains prove robust and causally linked to the proposed modules, the work would advance efficient micro-expression recognition by mitigating Transformer token overhead while addressing small-dataset challenges, offering a practical balance of accuracy and complexity on public benchmarks.

major comments (2)
  1. [Abstract] Abstract: The claim that the token integration module reduces tokens 'without information loss' is unsupported by any quantitative check (e.g., reconstruction error, mutual information, or ablation comparing full vs. integrated tokens), which is load-bearing for the efficiency-without-sacrifice central claim.
  2. [Experimental Results] Experimental Results (as summarized): Concrete gains such as 9.6% UF1 on CAS(ME)^3 and 4.58% UAR on SMIC are reported without ablation studies, error bars, statistical significance tests, or details on data splits/hyperparameter selection, leaving open the possibility that gains arise from overfitting or unstated implementation choices on small datasets rather than the DNSPT/token integration/DTSM components.
minor comments (1)
  1. [Abstract] Abstract: Dataset name appears as 'CAS(ME)3' without the required superscript for consistency with 'CAS(ME)^3'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review of our manuscript on the EPIR framework. We appreciate the feedback on strengthening the claims regarding information preservation and experimental robustness. Below we address each major comment point by point, with commitments to revisions that enhance the paper without altering its core contributions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claim that the token integration module reduces tokens 'without information loss' is unsupported by any quantitative check (e.g., reconstruction error, mutual information, or ablation comparing full vs. integrated tokens), which is load-bearing for the efficiency-without-sacrifice central claim.

    Authors: We acknowledge that the manuscript does not provide explicit quantitative verification, such as reconstruction error or mutual information metrics, to support the 'without information loss' phrasing for the token integration module. The module is motivated by integrating partial tokens across cascaded Transformer blocks to maintain essential spatial and discriminative features while reducing overhead, as described in the method section. To directly address this, we will add ablation studies comparing full versus integrated token configurations and include quantitative checks for information preservation in the revised version. revision: yes

  2. Referee: [Experimental Results] Experimental Results (as summarized): Concrete gains such as 9.6% UF1 on CAS(ME)^3 and 4.58% UAR on SMIC are reported without ablation studies, error bars, statistical significance tests, or details on data splits/hyperparameter selection, leaving open the possibility that gains arise from overfitting or unstated implementation choices on small datasets rather than the DNSPT/token integration/DTSM components.

    Authors: We agree that additional rigor is needed to substantiate the performance gains on small-scale micro-expression datasets. The reported results follow established protocols for CASME II, SAMM, SMIC, and CAS(ME)^3, but the current version lacks module-specific ablations, error bars, and statistical tests. In the revision, we will include comprehensive ablations isolating DNSPT, token integration, and DTSM; report means and standard deviations from multiple runs with error bars; perform statistical significance tests; and expand details on data splits and hyperparameter selection to confirm the gains stem from the proposed components. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical architecture validated on external benchmarks

full rationale

The paper proposes an empirical CNN-Transformer hybrid architecture (DNSPT module, token integration, DTSM) for micro-expression recognition and reports performance numbers on four public datasets. No equations, uniqueness theorems, or self-citations are used to derive the claimed UF1/UAR gains; the improvements are presented strictly as experimental outcomes. The central claim therefore does not reduce to any fitted parameter or prior self-result by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No explicit free parameters, axioms, or invented entities are stated in the abstract; the work relies on standard Transformer assumptions and empirical tuning of module hyperparameters.

pith-pipeline@v0.9.0 · 5618 in / 1057 out tokens · 65137 ms · 2026-05-10T16:59:09.891340+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · 1 internal anchor

  1. [1]

    Paul, Emotions revealed: recognizing faces and feelings to improve communi- cation and emotional life, NY: OWL Books (2007)

    E. Paul, Emotions revealed: recognizing faces and feelings to improve communi- cation and emotional life, NY: OWL Books (2007)

  2. [2]

    Y . Li, J. Wei, Y . Liu, J. Kauttonen, G. Zhao, Deep learning for micro-expression recognition: A survey, IEEE Transactions on Affective Computing 13 (4) (2022) 2028–2046

  3. [3]

    Merghani, A

    W. Merghani, A. K. Davison, M. H. Yap, A review on facial micro-expressions analysis: datasets, features and metrics, arXiv preprint arXiv:1805.02397 (2018)

  4. [4]

    Crivelli, A

    C. Crivelli, A. J. Fridlund, Inside-out: From basic emotions theory to the behav- ioral ecology view, Journal of Nonverbal Behavior 43 (2) (2019) 161–194. 24

  5. [5]

    P. M. Niedenthal, M. Rychlowska, F. Zhao, A. Wood, Historical migration pat- terns shape contemporary cultures of emotion, Perspectives on Psychological Sci- ence 14 (4) (2019) 560–573

  6. [6]

    Pfister, X

    T. Pfister, X. Li, G. Zhao, M. Pietikäinen, Recognising spontaneous facial micro- expressions, in: 2011 international conference on computer vision, IEEE, 2011, pp. 1449–1456

  7. [7]

    Y . Wang, J. See, R. C.-W. Phan, Y .-H. Oh, Lbp with six intersection points: Re- ducing redundant information in lbp-top for micro-expression recognition, in: Asian conference on computer vision, Springer, 2014, pp. 525–537

  8. [8]

    Y . Li, X. Huang, G. Zhao, Joint local and global information learning with sin- gle apex frame detection for micro-expression recognition, IEEE Transactions on Image Processing 30 (2020) 249–263

  9. [9]

    M. Wei, X. Jiang, W. Zheng, Y . Zong, C. Lu, J. Liu, Cmnet: contrastive magni- fication network for micro-expression recognition, in: Proceedings of the AAAI Conference on Artificial Intelligence, 2023, pp. 119–127

  10. [10]

    L. Fu, J. Wang, Q. Jin, Y . Zhu, H. Wang, Y . Li, X. Wu, K. Hu, Ptsr: A unified patch tokenization, selection and representation framework for efficient micro- expression recognition, in: Proceedings of the 2025 International Conference on Multimedia Retrieval, 2025, pp. 312–320

  11. [11]

    Liu, J.-K

    Y .-J. Liu, J.-K. Zhang, W.-J. Yan, S.-J. Wang, G. Zhao, X. Fu, A main directional mean optical flow feature for spontaneous micro-expression recognition, IEEE Transactions on Affective Computing 7 (4) (2015) 299–310

  12. [12]

    He, J.-F

    J. He, J.-F. Hu, X. Lu, W.-S. Zheng, Multi-task mid-level feature learning for micro-expression recognition, Pattern Recognition 66 (2017) 44–52

  13. [13]

    Y . Li, X. Huang, G. Zhao, Can micro-expression be recognized based on single apex frame?, in: 2018 25th IEEE International Conference on Image Processing (ICIP), IEEE, 2018, pp. 3094–3098. 25

  14. [14]

    L. Zhou, Q. Mao, L. Xue, Dual-inception network for cross-database micro- expression recognition, in: 2019 14th IEEE International Conference on Auto- matic Face & Gesture Recognition (FG 2019), IEEE, 2019, pp. 1–5

  15. [15]

    Liong, Y

    S.-T. Liong, Y . S. Gan, J. See, H.-Q. Khor, Y .-C. Huang, Shallow triple stream three-dimensional cnn (ststnet) for micro-expression recognition, in: 2019 14th IEEE international conference on automatic face & gesture recognition (FG 2019), IEEE, 2019, pp. 1–5

  16. [16]

    Z. Xia, X. Hong, X. Gao, X. Feng, G. Zhao, Spatiotemporal recurrent convolu- tional networks for recognizing spontaneous micro-expressions, IEEE Transac- tions on Multimedia 22 (3) (2019) 626–640

  17. [17]

    A. J. R. Kumar, B. Bhanu, Micro-expression classification based on landmark relations with graph attention convolutional network, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 1511–1520

  18. [18]

    N. Liu, X. Liu, Z. Zhang, X. Xu, T. Chen, Offset or onset frame: A multi-stream convolutional neural network with capsulenet module for micro-expression recognition, in: 2020 5th international conference on intelligent informatics and biomedical sciences (ICIIBMS), IEEE, 2020, pp. 236–240

  19. [19]

    B. Sun, S. Cao, J. He, L. Yu, Two-stream attention-aware network for spontaneous micro-expression movement spotting, in: 2019 IEEE 10th International Confer- ence on Software Engineering and Service Science (ICSESS), IEEE, 2019, pp. 702–705

  20. [20]

    Zhang, Y

    L. Zhang, Y . Qian, O. Arandjelovi´c, T. Zhu, H. Xiao, Multimodal latent emotion recognition from micro-expression and physiological signal, Pattern Recognition 169 (2026) 111963

  21. [21]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Un- terthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al., An image is 26 worth 16x16 words: Transformers for image recognition at scale, arXiv preprint arXiv:2010.11929 (2020)

  22. [22]

    Zhang, X

    L. Zhang, X. Hong, O. Arandjelovi´c, G. Zhao, Short and long range relation based spatio-temporal transformer for micro-expression recognition, IEEE Transactions on Affective Computing 13 (4) (2022) 1973–1985

  23. [23]

    Indolia, S

    S. Indolia, S. Nigam, R. Singh, V . K. Singh, M. K. Singh, Micro expression recognition using convolution patch in vision transformer, IEEE Access (2023)

  24. [24]

    Z. Wang, K. Zhang, W. Luo, R. Sankaranarayana, Htnet for micro-expression recognition, Neurocomputing (2024) 128196

  25. [25]

    W. Cai, J. Zhao, R. Yi, M. Yu, F. Duan, Z. Pan, Y .-J. Liu, Mfdan: Multi-level flow- driven attention network for micro-expression recognition, IEEE Transactions on Circuits and Systems for Video Technology (2024)

  26. [26]

    Nguyen, C

    X.-B. Nguyen, C. N. Duong, X. Li, S. Gauch, H.-S. Seo, K. Luu, Micron-bert: Bert-based facial micro-expression recognition, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 1482–1492

  27. [27]

    B. Xia, W. Wang, S. Wang, E. Chen, Learning from macro-expression: A micro- expression recognition framework, in: Proceedings of the 28th ACM Interna- tional Conference on Multimedia, 2020, pp. 2936–2944

  28. [28]

    Zhang, Z

    K. Zhang, Z. Zhang, Z. Li, Y . Qiao, Joint face detection and alignment using mul- titask cascaded convolutional networks, IEEE signal processing letters 23 (10) (2016) 1499–1503

  29. [29]

    G. Farnebäck, Two-frame motion estimation based on polynomial expansion, in: Image Analysis: 13th Scandinavian Conference, SCIA 2003 Halmstad, Sweden, June 29–July 2, 2003 Proceedings 13, Springer, 2003, pp. 363–370

  30. [30]

    S. H. Lee, S. Lee, B. C. Song, Vision transformer for small-size datasets, arXiv preprint arXiv:2112.13492 (2021). 27

  31. [31]

    He, J.-N

    J. He, J.-N. Chen, S. Liu, A. Kortylewski, C. Yang, Y . Bai, C. Wang, Transfg: A transformer architecture for fine-grained recognition, in: Proceedings of the AAAI conference on artificial intelligence, 2022, pp. 852–860

  32. [32]

    W.-J. Yan, X. Li, S.-J. Wang, G. Zhao, Y .-J. Liu, Y .-H. Chen, X. Fu, Casme ii: An improved spontaneous micro-expression database and the baseline evaluation, PloS one 9 (1) (2014) e86041

  33. [33]

    A. K. Davison, C. Lansley, N. Costen, K. Tan, M. H. Yap, Samm: A spontaneous micro-facial movement dataset, IEEE transactions on affective computing 9 (1) (2016) 116–129

  34. [34]

    X. Li, T. Pfister, X. Huang, G. Zhao, M. Pietikäinen, A spontaneous micro- expression database: Inducement, collection and baseline, in: 2013 10th IEEE International Conference and Workshops on Automatic face and gesture recogni- tion (fg), IEEE, 2013, pp. 1–6

  35. [35]

    J. Li, Z. Dong, S. Lu, S.-J. Wang, W.-J. Yan, Y . Ma, Y . Liu, C. Huang, X. Fu, Cas (me) 3: A third generation facial spontaneous micro-expression database with depth information and high ecological validity, IEEE Transactions on Pattern Analysis and Machine Intelligence 45 (3) (2022) 2782–2800

  36. [36]

    J. See, M. H. Yap, J. Li, X. Hong, S.-J. Wang, Megc 2019–the second facial micro-expressions grand challenge, in: 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019), IEEE, 2019, pp. 1–5

  37. [37]

    Liong, J

    S.-T. Liong, J. See, K. Wong, R. C.-W. Phan, Less is more: Micro-expression recognition from video using apex frame, Signal Processing: Image Communica- tion 62 (2018) 82–92

  38. [38]

    Y . S. Gan, S.-T. Liong, W.-C. Yau, Y .-C. Huang, L.-K. Tan, Off-apexnet on micro-expression recognition system, Signal Processing: Image Communication 74 (2019) 129–139. 28

  39. [39]

    Van Quang, J

    N. Van Quang, J. Chun, T. Tokuyama, Capsulenet for micro-expression recogni- tion, in: 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019), IEEE, 2019, pp. 1–7

  40. [40]

    Y . Liu, H. Du, L. Zheng, T. Gedeon, A neural micro-expression recognizer, in: 2019 14th IEEE international conference on automatic face & gesture recognition (FG 2019), IEEE, 2019, pp. 1–4

  41. [41]

    Z. Xia, W. Peng, H.-Q. Khor, X. Feng, G. Zhao, Revealing the invisible with model and data shrinking for composite-database micro-expression recognition, IEEE Transactions on Image Processing 29 (2020) 8590–8605

  42. [42]

    X. Nie, M. A. Takalkar, M. Duan, H. Zhang, M. Xu, Geme: Dual-stream multi- task gender-based micro-expression recognition, Neurocomputing 427 (2021) 13–28

  43. [43]

    S. Zhao, H. Tao, Y . Zhang, T. Xu, K. Zhang, Z. Hao, E. Chen, A two-stage 3d cnn based learning method for spontaneous micro-expression recognition, Neu- rocomputing 448 (2021) 276–289

  44. [44]

    L. Zhou, Q. Mao, X. Huang, F. Zhang, Z. Zhang, Feature refinement: An expression-specific feature learning and fusion method for micro-expression recognition, Pattern Recognition 122 (2022) 108275

  45. [45]

    Z. Zhai, J. Zhao, C. Long, W. Xu, S. He, H. Zhao, Feature representation learn- ing with adaptive displacement generation and transformer fusion for micro- expression recognition, in: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, 2023, pp. 22086–22095

  46. [46]

    C. Guo, H. Huang, Gleffn: A global-local event feature fusion network for micro- expression recognition, in: Proceedings of the 3rd Workshop on Facial Micro- Expression: Advanced Techniques for Multi-Modal Facial Expression Analysis, 2023, pp. 17–24. 29

  47. [47]

    Zhang, Y

    L. Zhang, Y . Zhang, X. Sun, W. Tang, X. Wang, Z. Li, Micro-expression recog- nition based on direct learning of graph structure, Neurocomputing 619 (2025) 129135

  48. [48]

    L. Lei, J. Li, T. Chen, S. Li, A novel graph-tcn with a graph structured repre- sentation for micro-expression recognition, in: Proceedings of the 28th ACM international conference on multimedia, 2020, pp. 2237–2245

  49. [49]

    L. Lei, T. Chen, S. Li, J. Li, Micro-expression recognition based on facial graph representation learning and facial action unit fusion, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 1571–1580

  50. [50]

    Y . Bao, C. Wu, P. Zhang, C. Shan, Y . Qi, X. Ben, Boosting micro-expression recognition via self-expression reconstruction and memory contrastive learning, IEEE Transactions on Affective Computing (2024)

  51. [51]

    Zhang, S

    Z. Zhang, S. Zhao, S. Liu, S. Yin, X. Mao, T. Xu, E. Chen, Mellm: Exploring llm- powered micro-expression understanding enhanced by subtle motion perception, arXiv preprint arXiv:2505.07007 (2025). 30