EPIR: An Efficient Patch Tokenization, Integration and Representation Framework for Micro-expression Recognition
Pith reviewed 2026-05-10 16:59 UTC · model grok-4.3
The pith
EPIR framework reduces tokens in Transformer models to boost micro-expression recognition accuracy and efficiency.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We propose the EPIR framework which first uses a dual norm shifted tokenization module implemented by refined spatial transformation and dual norm projection to learn spatial relationships. Then a token integration module integrates partial tokens among cascaded Transformer blocks to reduce token count without information loss. Finally a discriminative token extractor improves attention to reduce focus on self-tokens and uses dynamic token selection to capture more discriminative representations, resulting in performance gains over state-of-the-art on several datasets.
What carries the argument
The combination of dual norm shifted tokenization, token integration across blocks, and dynamic token selection in the discriminative extractor to manage tokens efficiently.
If this is right
- Achieves up to 9.6% improvement in UF1 on CAS(ME)^3 dataset.
- Achieves 4.58% improvement in UAR on SMIC dataset.
- Lowers computational complexity compared to standard Transformer approaches.
- Enables effective representation learning on small-scale micro-expression datasets.
Where Pith is reading between the lines
- The token reduction techniques could be applied to other video-based recognition tasks facing similar data scarcity issues.
- This framework might inspire hybrid models that combine efficiency modules with other backbone architectures.
- Practical deployment in real-time systems for emotion-aware interfaces becomes more feasible.
Load-bearing premise
The proposed tokenization, integration, and selection steps do not cause loss of essential information required for accurate micro-expression classification.
What would settle it
A controlled experiment on the same datasets where the DNSPT, integration, or DTSM modules are removed or replaced with standard token handling, showing if the claimed improvements disappear.
Figures
read the original abstract
Micro-expression recognition can obtain the real emotion of the individual at the current moment. Although deep learning-based methods, especially Transformer-based methods, have achieved impressive results, these methods have high computational complexity due to the large number of tokens in the multi-head self-attention. In addition, the existing micro-expression datasets are small-scale, which makes it difficult for Transformer-based models to learn effective micro-expression representations. Therefore, we propose a novel Efficient Patch tokenization, Integration and Representation framework (EPIR), which can balance high recognition performance and low computational complexity. Specifically, we first propose a dual norm shifted tokenization (DNSPT) module to learn the spatial relationship between neighboring pixels in the face region, which is implemented by a refined spatial transformation and dual norm projection. Then, we propose a token integration module to integrate partial tokens among multiple cascaded Transformer blocks, thereby reducing the number of tokens without information loss. Furthermore, we design a discriminative token extractor, which first improves the attention in the Transformer block to reduce the unnecessary focus of the attention calculation on self-tokens, and uses the dynamic token selection module (DTSM) to select key tokens, thereby capturing more discriminative micro-expression representations. We conduct extensive experiments on four popular public datasets (i.e., CASME II, SAMM, SMIC, and CAS(ME)3. The experimental results show that our method achieves significant performance gains over the state-of-the-art methods, such as 9.6% improvement on the CAS(ME)$^3$ dataset in terms of UF1 and 4.58% improvement on the SMIC dataset in terms of UAR metric.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces EPIR, an efficient Transformer-based framework for micro-expression recognition on small datasets. It proposes a dual norm shifted tokenization (DNSPT) module using refined spatial transformation and dual norm projection, a token integration module to reduce token count across cascaded blocks without information loss, and a discriminative token extractor incorporating improved attention and a dynamic token selection module (DTSM). Experiments on CASME II, SAMM, SMIC, and CAS(ME)^3 report gains over prior SOTA, including 9.6% UF1 on CAS(ME)^3 and 4.58% UAR on SMIC.
Significance. If the performance gains prove robust and causally linked to the proposed modules, the work would advance efficient micro-expression recognition by mitigating Transformer token overhead while addressing small-dataset challenges, offering a practical balance of accuracy and complexity on public benchmarks.
major comments (2)
- [Abstract] Abstract: The claim that the token integration module reduces tokens 'without information loss' is unsupported by any quantitative check (e.g., reconstruction error, mutual information, or ablation comparing full vs. integrated tokens), which is load-bearing for the efficiency-without-sacrifice central claim.
- [Experimental Results] Experimental Results (as summarized): Concrete gains such as 9.6% UF1 on CAS(ME)^3 and 4.58% UAR on SMIC are reported without ablation studies, error bars, statistical significance tests, or details on data splits/hyperparameter selection, leaving open the possibility that gains arise from overfitting or unstated implementation choices on small datasets rather than the DNSPT/token integration/DTSM components.
minor comments (1)
- [Abstract] Abstract: Dataset name appears as 'CAS(ME)3' without the required superscript for consistency with 'CAS(ME)^3'.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive review of our manuscript on the EPIR framework. We appreciate the feedback on strengthening the claims regarding information preservation and experimental robustness. Below we address each major comment point by point, with commitments to revisions that enhance the paper without altering its core contributions.
read point-by-point responses
-
Referee: [Abstract] Abstract: The claim that the token integration module reduces tokens 'without information loss' is unsupported by any quantitative check (e.g., reconstruction error, mutual information, or ablation comparing full vs. integrated tokens), which is load-bearing for the efficiency-without-sacrifice central claim.
Authors: We acknowledge that the manuscript does not provide explicit quantitative verification, such as reconstruction error or mutual information metrics, to support the 'without information loss' phrasing for the token integration module. The module is motivated by integrating partial tokens across cascaded Transformer blocks to maintain essential spatial and discriminative features while reducing overhead, as described in the method section. To directly address this, we will add ablation studies comparing full versus integrated token configurations and include quantitative checks for information preservation in the revised version. revision: yes
-
Referee: [Experimental Results] Experimental Results (as summarized): Concrete gains such as 9.6% UF1 on CAS(ME)^3 and 4.58% UAR on SMIC are reported without ablation studies, error bars, statistical significance tests, or details on data splits/hyperparameter selection, leaving open the possibility that gains arise from overfitting or unstated implementation choices on small datasets rather than the DNSPT/token integration/DTSM components.
Authors: We agree that additional rigor is needed to substantiate the performance gains on small-scale micro-expression datasets. The reported results follow established protocols for CASME II, SAMM, SMIC, and CAS(ME)^3, but the current version lacks module-specific ablations, error bars, and statistical tests. In the revision, we will include comprehensive ablations isolating DNSPT, token integration, and DTSM; report means and standard deviations from multiple runs with error bars; perform statistical significance tests; and expand details on data splits and hyperparameter selection to confirm the gains stem from the proposed components. revision: yes
Circularity Check
No circularity; empirical architecture validated on external benchmarks
full rationale
The paper proposes an empirical CNN-Transformer hybrid architecture (DNSPT module, token integration, DTSM) for micro-expression recognition and reports performance numbers on four public datasets. No equations, uniqueness theorems, or self-citations are used to derive the claimed UF1/UAR gains; the improvements are presented strictly as experimental outcomes. The central claim therefore does not reduce to any fitted parameter or prior self-result by construction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
E. Paul, Emotions revealed: recognizing faces and feelings to improve communi- cation and emotional life, NY: OWL Books (2007)
work page 2007
-
[2]
Y . Li, J. Wei, Y . Liu, J. Kauttonen, G. Zhao, Deep learning for micro-expression recognition: A survey, IEEE Transactions on Affective Computing 13 (4) (2022) 2028–2046
work page 2022
-
[3]
W. Merghani, A. K. Davison, M. H. Yap, A review on facial micro-expressions analysis: datasets, features and metrics, arXiv preprint arXiv:1805.02397 (2018)
-
[4]
C. Crivelli, A. J. Fridlund, Inside-out: From basic emotions theory to the behav- ioral ecology view, Journal of Nonverbal Behavior 43 (2) (2019) 161–194. 24
work page 2019
-
[5]
P. M. Niedenthal, M. Rychlowska, F. Zhao, A. Wood, Historical migration pat- terns shape contemporary cultures of emotion, Perspectives on Psychological Sci- ence 14 (4) (2019) 560–573
work page 2019
-
[6]
T. Pfister, X. Li, G. Zhao, M. Pietikäinen, Recognising spontaneous facial micro- expressions, in: 2011 international conference on computer vision, IEEE, 2011, pp. 1449–1456
work page 2011
-
[7]
Y . Wang, J. See, R. C.-W. Phan, Y .-H. Oh, Lbp with six intersection points: Re- ducing redundant information in lbp-top for micro-expression recognition, in: Asian conference on computer vision, Springer, 2014, pp. 525–537
work page 2014
-
[8]
Y . Li, X. Huang, G. Zhao, Joint local and global information learning with sin- gle apex frame detection for micro-expression recognition, IEEE Transactions on Image Processing 30 (2020) 249–263
work page 2020
-
[9]
M. Wei, X. Jiang, W. Zheng, Y . Zong, C. Lu, J. Liu, Cmnet: contrastive magni- fication network for micro-expression recognition, in: Proceedings of the AAAI Conference on Artificial Intelligence, 2023, pp. 119–127
work page 2023
-
[10]
L. Fu, J. Wang, Q. Jin, Y . Zhu, H. Wang, Y . Li, X. Wu, K. Hu, Ptsr: A unified patch tokenization, selection and representation framework for efficient micro- expression recognition, in: Proceedings of the 2025 International Conference on Multimedia Retrieval, 2025, pp. 312–320
work page 2025
- [11]
- [12]
-
[13]
Y . Li, X. Huang, G. Zhao, Can micro-expression be recognized based on single apex frame?, in: 2018 25th IEEE International Conference on Image Processing (ICIP), IEEE, 2018, pp. 3094–3098. 25
work page 2018
-
[14]
L. Zhou, Q. Mao, L. Xue, Dual-inception network for cross-database micro- expression recognition, in: 2019 14th IEEE International Conference on Auto- matic Face & Gesture Recognition (FG 2019), IEEE, 2019, pp. 1–5
work page 2019
- [15]
-
[16]
Z. Xia, X. Hong, X. Gao, X. Feng, G. Zhao, Spatiotemporal recurrent convolu- tional networks for recognizing spontaneous micro-expressions, IEEE Transac- tions on Multimedia 22 (3) (2019) 626–640
work page 2019
-
[17]
A. J. R. Kumar, B. Bhanu, Micro-expression classification based on landmark relations with graph attention convolutional network, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 1511–1520
work page 2021
-
[18]
N. Liu, X. Liu, Z. Zhang, X. Xu, T. Chen, Offset or onset frame: A multi-stream convolutional neural network with capsulenet module for micro-expression recognition, in: 2020 5th international conference on intelligent informatics and biomedical sciences (ICIIBMS), IEEE, 2020, pp. 236–240
work page 2020
-
[19]
B. Sun, S. Cao, J. He, L. Yu, Two-stream attention-aware network for spontaneous micro-expression movement spotting, in: 2019 IEEE 10th International Confer- ence on Software Engineering and Service Science (ICSESS), IEEE, 2019, pp. 702–705
work page 2019
- [20]
-
[21]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Un- terthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al., An image is 26 worth 16x16 words: Transformers for image recognition at scale, arXiv preprint arXiv:2010.11929 (2020)
work page internal anchor Pith review Pith/arXiv arXiv 2010
- [22]
-
[23]
S. Indolia, S. Nigam, R. Singh, V . K. Singh, M. K. Singh, Micro expression recognition using convolution patch in vision transformer, IEEE Access (2023)
work page 2023
-
[24]
Z. Wang, K. Zhang, W. Luo, R. Sankaranarayana, Htnet for micro-expression recognition, Neurocomputing (2024) 128196
work page 2024
-
[25]
W. Cai, J. Zhao, R. Yi, M. Yu, F. Duan, Z. Pan, Y .-J. Liu, Mfdan: Multi-level flow- driven attention network for micro-expression recognition, IEEE Transactions on Circuits and Systems for Video Technology (2024)
work page 2024
- [26]
-
[27]
B. Xia, W. Wang, S. Wang, E. Chen, Learning from macro-expression: A micro- expression recognition framework, in: Proceedings of the 28th ACM Interna- tional Conference on Multimedia, 2020, pp. 2936–2944
work page 2020
- [28]
-
[29]
G. Farnebäck, Two-frame motion estimation based on polynomial expansion, in: Image Analysis: 13th Scandinavian Conference, SCIA 2003 Halmstad, Sweden, June 29–July 2, 2003 Proceedings 13, Springer, 2003, pp. 363–370
work page 2003
- [30]
- [31]
-
[32]
W.-J. Yan, X. Li, S.-J. Wang, G. Zhao, Y .-J. Liu, Y .-H. Chen, X. Fu, Casme ii: An improved spontaneous micro-expression database and the baseline evaluation, PloS one 9 (1) (2014) e86041
work page 2014
-
[33]
A. K. Davison, C. Lansley, N. Costen, K. Tan, M. H. Yap, Samm: A spontaneous micro-facial movement dataset, IEEE transactions on affective computing 9 (1) (2016) 116–129
work page 2016
-
[34]
X. Li, T. Pfister, X. Huang, G. Zhao, M. Pietikäinen, A spontaneous micro- expression database: Inducement, collection and baseline, in: 2013 10th IEEE International Conference and Workshops on Automatic face and gesture recogni- tion (fg), IEEE, 2013, pp. 1–6
work page 2013
-
[35]
J. Li, Z. Dong, S. Lu, S.-J. Wang, W.-J. Yan, Y . Ma, Y . Liu, C. Huang, X. Fu, Cas (me) 3: A third generation facial spontaneous micro-expression database with depth information and high ecological validity, IEEE Transactions on Pattern Analysis and Machine Intelligence 45 (3) (2022) 2782–2800
work page 2022
-
[36]
J. See, M. H. Yap, J. Li, X. Hong, S.-J. Wang, Megc 2019–the second facial micro-expressions grand challenge, in: 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019), IEEE, 2019, pp. 1–5
work page 2019
- [37]
-
[38]
Y . S. Gan, S.-T. Liong, W.-C. Yau, Y .-C. Huang, L.-K. Tan, Off-apexnet on micro-expression recognition system, Signal Processing: Image Communication 74 (2019) 129–139. 28
work page 2019
-
[39]
N. Van Quang, J. Chun, T. Tokuyama, Capsulenet for micro-expression recogni- tion, in: 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019), IEEE, 2019, pp. 1–7
work page 2019
-
[40]
Y . Liu, H. Du, L. Zheng, T. Gedeon, A neural micro-expression recognizer, in: 2019 14th IEEE international conference on automatic face & gesture recognition (FG 2019), IEEE, 2019, pp. 1–4
work page 2019
-
[41]
Z. Xia, W. Peng, H.-Q. Khor, X. Feng, G. Zhao, Revealing the invisible with model and data shrinking for composite-database micro-expression recognition, IEEE Transactions on Image Processing 29 (2020) 8590–8605
work page 2020
-
[42]
X. Nie, M. A. Takalkar, M. Duan, H. Zhang, M. Xu, Geme: Dual-stream multi- task gender-based micro-expression recognition, Neurocomputing 427 (2021) 13–28
work page 2021
-
[43]
S. Zhao, H. Tao, Y . Zhang, T. Xu, K. Zhang, Z. Hao, E. Chen, A two-stage 3d cnn based learning method for spontaneous micro-expression recognition, Neu- rocomputing 448 (2021) 276–289
work page 2021
-
[44]
L. Zhou, Q. Mao, X. Huang, F. Zhang, Z. Zhang, Feature refinement: An expression-specific feature learning and fusion method for micro-expression recognition, Pattern Recognition 122 (2022) 108275
work page 2022
-
[45]
Z. Zhai, J. Zhao, C. Long, W. Xu, S. He, H. Zhao, Feature representation learn- ing with adaptive displacement generation and transformer fusion for micro- expression recognition, in: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, 2023, pp. 22086–22095
work page 2023
-
[46]
C. Guo, H. Huang, Gleffn: A global-local event feature fusion network for micro- expression recognition, in: Proceedings of the 3rd Workshop on Facial Micro- Expression: Advanced Techniques for Multi-Modal Facial Expression Analysis, 2023, pp. 17–24. 29
work page 2023
- [47]
-
[48]
L. Lei, J. Li, T. Chen, S. Li, A novel graph-tcn with a graph structured repre- sentation for micro-expression recognition, in: Proceedings of the 28th ACM international conference on multimedia, 2020, pp. 2237–2245
work page 2020
-
[49]
L. Lei, T. Chen, S. Li, J. Li, Micro-expression recognition based on facial graph representation learning and facial action unit fusion, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 1571–1580
work page 2021
-
[50]
Y . Bao, C. Wu, P. Zhang, C. Shan, Y . Qi, X. Ben, Boosting micro-expression recognition via self-expression reconstruction and memory contrastive learning, IEEE Transactions on Affective Computing (2024)
work page 2024
- [51]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.