Investigating Gender Bias in Touch Biometrics
Pith reviewed 2026-06-27 11:09 UTC · model grok-4.3
The pith
Swipe authentication shows no significant gender differences in error rates across two datasets.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Statistical tests (Kolmogorov-Smirnov, Mann-Whitney, and Wasserstein permutation) applied to false acceptance and false rejection rates from XGBoost and DenseNet classifiers found no significant gender differences in authentication error rates across almost all experimental settings on the BBMAS and ANTAL datasets, while XGBoost reached 92 percent and 94 percent accuracy respectively.
What carries the argument
Comparison of gender-grouped false acceptance and false rejection rates via Kolmogorov-Smirnov, Mann-Whitney, and Wasserstein permutation tests on swipe gesture features.
If this is right
- Swipe-based systems can deliver over 90 percent accuracy while maintaining similar error rates for both genders.
- Behavioral biometric authentication may avoid the gender fairness issues reported in some other modalities.
- The approach supports continuous authentication without introducing detectable gender disparity in the tested conditions.
Where Pith is reading between the lines
- If the finding holds in larger populations, swipe biometrics could be deployed in mixed-gender settings with reduced fairness auditing for gender.
- The same datasets and tests could be reused to check other demographic splits such as age or handedness.
- High accuracy combined with gender parity may encourage wider adoption of touch-based continuous authentication on mobile devices.
Load-bearing premise
The chosen statistical tests have enough power to detect gender differences if they exist and the user samples represent broader populations.
What would settle it
A new experiment on a larger, balanced dataset that reports a statistically significant difference in error rates between male and female users would falsify the no-difference claim.
Figures
read the original abstract
Behavioral biometrics offer a promising approach for continuous authentication, but their fairness across demographic groups remains largely unexplored. This paper investigates gender bias in swipe-based authentication using the BBMAS (117 users) and ANTAL (71 users) datasets and evaluates XGBoost and DenseNet classifiers through False Acceptance Rate (FAR) and False Rejection Rate (FRR). XGBoost achieved authentication accuracies of 92% and 94% on the BBMAS and ANTAL datasets, respectively, while statistical tests (Kolmogorov-Smirnov, Mann-Whitney, and Wasserstein permutation) found no significant gender differences in authentication error rates across almost all experimental settings. These findings suggest that swipe-based authentication can achieve high accuracy while maintaining comparable performance for male and female users, supporting its potential as a fair and reliable behavioral biometric modality.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper investigates gender bias in swipe-based touch biometrics on the public BBMAS (117 users) and ANTAL (71 users) datasets. It trains XGBoost and DenseNet classifiers to report authentication accuracies of 92% and 94% (XGBoost) via FAR/FRR, then applies Kolmogorov-Smirnov, Mann-Whitney, and Wasserstein permutation tests, finding no significant gender differences in error rates across almost all settings. The central claim is that swipe biometrics can achieve high accuracy while remaining gender-fair.
Significance. If the non-significant results are shown to be adequately powered, the work would supply reproducible evidence from two public datasets that a behavioral biometric modality can be both accurate and equitable, which matters for fairness requirements in continuous authentication systems. The use of multiple standard statistical tests and public data is a strength that supports verifiability.
major comments (2)
- [Abstract and Results] Abstract and Results section: the claim of no significant gender differences (and thus fairness) rests on non-significant outcomes from the KS, MW, and Wasserstein tests, yet the manuscript provides neither power analysis, effect-size reporting, minimum detectable effect, nor exact p-values. With total N=117 and N=71 (and likely smaller gender splits after partitioning), non-significance is compatible with both true equality and undetected moderate bias; this directly affects the load-bearing interpretation of fairness.
- [Methods and Results] Methods/Results: no per-gender sample sizes, error bars on FAR/FRR, or train/test split details by demographic are reported. These omissions prevent assessment of whether the tests had sufficient power or whether post-hoc choices influenced the 'almost all settings' conclusion.
minor comments (2)
- [Abstract] Abstract: the phrase 'across almost all experimental settings' is imprecise; the specific settings or number of comparisons that did show differences should be stated explicitly.
- [Statistical Analysis] The manuscript should clarify whether the three statistical tests were pre-specified or chosen after inspection, and whether multiple-comparison correction was applied.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback emphasizing statistical rigor in interpreting non-significant results. We address each major comment below and will incorporate revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract and Results] Abstract and Results section: the claim of no significant gender differences (and thus fairness) rests on non-significant outcomes from the KS, MW, and Wasserstein tests, yet the manuscript provides neither power analysis, effect-size reporting, minimum detectable effect, nor exact p-values. With total N=117 and N=71 (and likely smaller gender splits after partitioning), non-significance is compatible with both true equality and undetected moderate bias; this directly affects the load-bearing interpretation of fairness.
Authors: We agree that non-significance alone does not establish equality and that the absence of power analysis, effect sizes, and exact p-values limits the strength of the fairness interpretation. In the revised manuscript we will add: (i) exact p-values for all Kolmogorov-Smirnov, Mann-Whitney, and Wasserstein tests; (ii) effect-size measures (e.g., Cohen’s d for continuous comparisons and appropriate equivalents for the permutation test); and (iii) a post-hoc power analysis reporting the minimum detectable effect size at 80 % power given the observed per-gender sample sizes. These additions will be placed in the Results section and referenced in the Abstract. revision: yes
-
Referee: [Methods and Results] Methods/Results: no per-gender sample sizes, error bars on FAR/FRR, or train/test split details by demographic are reported. These omissions prevent assessment of whether the tests had sufficient power or whether post-hoc choices influenced the 'almost all settings' conclusion.
Authors: We will revise the Methods and Results sections to report the exact number of male and female users retained after each partitioning step for both BBMAS and ANTAL. We will also add error bars (standard deviation or 95 % confidence intervals) to all FAR/FRR bar plots and include a table or paragraph detailing the train/test split ratios stratified by gender. These changes will allow readers to evaluate power and reproducibility directly. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper reports authentication accuracies and statistical test outcomes (KS, Mann-Whitney, Wasserstein) computed directly from public BBMAS and ANTAL datasets using standard XGBoost and DenseNet classifiers. No derivation chain, fitted parameters, or self-citations are described that reduce the central claims to inputs by construction; the non-significance findings rest on external data and off-the-shelf tools without self-definitional loops or load-bearing internal citations.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Investigating Gender Bias in Touch Biometrics
INTRODUCTION User authentication is an essential component of smart- phone security, particularly as mobile devices store sensi- tive personal, financial, and organizational information [14, 9]. Traditional authentication mechanisms, such as PINs, passwords, fingerprints, and facial recognition, typically ver- ify identity only at the point of login [11]....
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[2]
RELA TED WORK Prior studies have demonstrated that biometric systems, particularly face recognition, can exhibit demographic dis- parities, raising concerns about fairness in real-world de- ployment [3]. Swipe-based authentication has consistently shown promising authentication performance and usability [9, 15, 1, 10], yet its fairness across demographic ...
-
[3]
The BBMAS dataset has 117 users, whereas ANTAL contains
METHODOLOGY Datasets In this study, we use two biometric authenti- cation datasets, BBMAS [7] and ANTAL, which contain authentication data from users with varied male-to-female ratios, with 72/45 for BBMAS and 56/15 for ANTAL. The BBMAS dataset has 117 users, whereas ANTAL contains
-
[4]
Unfortunately, most publicly available touch authentication datasets do not include demographic attributes [10]
These datasets provide a broad sample for evaluat- ing authentication methods and investigating gender bias across gender groups. Unfortunately, most publicly available touch authentication datasets do not include demographic attributes [10]. Classification models We train our authentication mod- els using two classifiers: XGBoost [4] and DenseNet [16]. X...
-
[5]
XGBoost achieved authentication accuracies of 92% on the BBMAS dataset and 94% on the ANTAL dataset, compared with 91% and 93%, respectively, for DenseNet
RESULTS AND DISCUSSION Table 1 presents the accuracy, FAR, and FRR for both classifiers on the BBMAS and Antal datasets. XGBoost achieved authentication accuracies of 92% on the BBMAS dataset and 94% on the ANTAL dataset, compared with 91% and 93%, respectively, for DenseNet. XGBoost also maintained lower False Rejection Rates (FRR) of 8% and 7%, compared...
-
[6]
CONCLUSION AND FUTURE WORK Our study found no statistically significant gender bias in swipe-based authentication across two publicly available datasets and two authentication models. Consistent results from the Kolmogorov–Smirnov, Mann–Whitney, and Wasser- stein permutation tests indicate that False Acceptance Rate (FAR) and False Rejection Rate (FRR) ar...
-
[7]
Agrawal and et al
M. Agrawal and et al. Gantouch: An attack-resilient framework for touch-based continuous authentication system.IEEE TBIOM, 2022
2022
-
[8]
Antal and L
M. Antal and L. Z. Szab´ o. Biometric authentication based on touchscreen swipe patterns.Procedia Technology, 2016
2016
-
[9]
Buolamwini and T
J. Buolamwini and T. Gebru. Gender shades: Intersectional accuracy disparities in commercial gender classification. In(FAT*), 2018
2018
-
[10]
Chen and C
T. Chen and C. Guestrin. Xgboost: A scalable tree boosting system. InProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, page 785–794. ACM, Aug. 2016
2016
-
[11]
Deb and M
D. Deb and M. M. Guirguis. Use of auxiliary classifier generative adversarial network in touchstroke authentication. InICMLA, 2020
2020
-
[12]
Drozdowski, C
P. Drozdowski, C. Rathgeb, A. Dantcheva, N. Damer, and C. Busch. Demographic bias in biometrics: A survey on an emerging challenge.IEEE Transactions on Technology and Society, 1(2):89–103, 2020
2020
- [13]
-
[14]
Frank, R
M. Frank, R. Biedert, E. Ma, I. Martinovic, and D. Song. Touchalytics: On the applicability of touchscreen input as a behavioral biometric for continuous authentication.IEEE T-IFS, 2012
2012
-
[15]
Georgiev, S
M. Georgiev, S. Eberz, and I. Martinovic. Techniques for continuous touch-based authentication. 2022
2022
-
[16]
Georgiev, S
M. Georgiev, S. Eberz, H. Turner, G. Lovisotto, and I. Martinovic. Feta: Fair evaluation of touch-based authentication, 2023
2023
-
[17]
A. K. Jain, D. Deb, and J. J. Engelsma. Biometrics: Trust, but verify.IEEE Transactions on Biometrics, Behavior, and Identity Science, 4:303–323, 2021
2021
-
[18]
Kumar, V
R. Kumar, V. V. Phoha, and A. Serwadda. Continuous authentication of smartphone users by fusing typing, swiping, and phone movement patterns. InIEEE BTAS, 2016
2016
-
[19]
L. Li, X. Zhao, and G. Xue. Unobservable re-authentication for smartphones. InNDSS, 2013
2013
-
[20]
V. M. Patel, R. Chellappa, D. Chandra, and B. Barbello. Continuous user authentication on mobile devices: Recent progress and remaining challenges. IEEE Signal Processing Magazine, 2016
2016
-
[21]
Serwadda, V
A. Serwadda, V. V. Phoha, and Z. Wang. Which verifiers work?: A benchmark evaluation of touch-based authentication algorithms. InIEEE BTAS, 2013
2013
-
[22]
Zhang, P
C. Zhang, P. Benz, D. M. Argaw, S. Lee, J. Kim, F. Rameau, J.-C. Bazin, and I. S. Kweon. Resnet or densenet? introducing dense shortcuts to resnet. In WACV, 2021
2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.