pith. sign in

arxiv: 2606.29504 · v1 · pith:KM4H7JD2new · submitted 2026-06-28 · 💻 cs.CV · cs.CR

Empirical Evaluation of Multi-Modal Touch Detection in Over-the-Shoulder Video Surveillance

Pith reviewed 2026-06-30 06:59 UTC · model grok-4.3

classification 💻 cs.CV cs.CR
keywords touch detectionkeystroke reconstructionover-the-shoulder videomulti-modal fusionskin color filteringmobile interface surveillancevideo intelligenceempirical evaluation
0
0 comments X

The pith

A multi-modal touch detector for over-the-shoulder video fails to reconstruct keystrokes reliably outside staged tests.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates whether four parallel detection methods can turn third-person video of phone use into accurate keystroke sequences. On a controlled 120-frame staged recording the best single-modality runs reach roughly 18 percent F1 while the fused system sits at 16.7 percent F1 and 3 percent sequence match. When the same pipeline is run on five real third-person phone videos it produces a median of 57 reported touches per frame, one to three orders of magnitude above the true tap rate, because the skin-color filter marks the entire hand rather than the fingertip contact point. The authors conclude that performance observed under calibration conditions does not carry over to uncontrolled footage.

Core claim

The combined MediaPipe-landmark, HSV-skin, motion-differencing, and Canny-edge pipeline achieves only 16.7 percent F1 and 3 percent sequence similarity on a staged first-person passcode video; on five real third-person videos the detector emits a median 57 touch points per frame because the skin filter responds to the whole hand rather than fingertip contact, so the staged keystroke result does not survive contact with uncontrolled footage.

What carries the argument

Four parallel detection modalities (MediaPipe hand landmarks, HSV skin filtering, temporal frame differencing, shape-guided Canny edges) whose outputs are thresholded and mapped to a reference screen layout to produce touch sequences.

If this is right

  • Ablation and resolution-decay tests show that no single modality exceeds roughly 18 percent F1 even under ideal conditions.
  • The skin filter's over-response produces between one and three orders of magnitude more candidate touches than actual keystrokes.
  • Sequence similarity after mapping to an iOS layout drops to 3 percent once coordinate noise is introduced.
  • Proximity-threshold tuning and noise-sensitivity sweeps map the narrow operational envelope of the current pipeline.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Future detectors would need an explicit fingertip-isolation step before skin or edge cues are applied.
  • The same failure mode would appear in any surveillance pipeline that relies on whole-hand skin segmentation for fine motor events.
  • Sequence reconstruction accuracy may remain low even if per-frame false-positive rate is reduced, because coordinate jitter compounds across frames.

Load-bearing premise

The five real third-person videos are representative enough to show that the skin filter's response to the whole hand is a general failure mode.

What would settle it

A larger collection of uncontrolled third-person videos in which the detector produces a number of touch points per frame close to the actual tap rate would falsify the generalization claim.

Figures

Figures reproduced from arXiv: 2606.29504 by Mohammadreza Rashidi.

Figure 1
Figure 1. Figure 1: The BEHINT pipeline. Each video frame is processed by four parallel detectors (MediaPipe hand landmarks, HSV skin filtering, [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Ablation results: F1-score and sequence similarity across the individual and combined detector configurations. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Resolution decay curves for F1-score and Sequence Similarity. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: System F1-Score under increasing Gaussian noise [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Touch detections per frame (log scale) per video. A real typing session produces on the order of [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
read the original abstract

Video Intelligence Surveillance (VIDINT) on over-the-shoulder footage is a proposed vector for monitoring human-computer interaction patterns without direct screen recording access. In this paper, we evaluate a Behavioral Intelligence (BEHINT) touch-detection framework designed to reconstruct keystroke events on mobile keypad interfaces from physical finger interactions. Our system integrates four parallel detection modalities: (1) anatomical hand landmarks via MediaPipe, (2) HSV skin color filtering, (3) temporal frame differencing for motion detection, and (4) shape-guided Canny edge analysis. We map relative touch coordinates to a reference screen layout to reconstruct typing sequences. Evaluation on a 120-frame first-person staged video of passcode entry reveals that while MediaPipe and Skin Detection fail to run autonomously due to partial hand occlusion and ambient noise, Motion-Only and Edge-Only configurations achieve F1-scores of 18.5% and 18.2%, respectively. The combined multi-modal configuration achieves an F1-score of 16.7% and a sequence similarity of 3.0% when mapped to the iOS passcode layout. We conduct ablation, resolution decay, noise sensitivity, and proximity threshold tuning to characterize the system's operational envelope. We then audit generalization on 5 real, publicly licensed third-person phone videos and find that the detector emits a median of 57 touch points per frame (peaking at 205), one to three orders of magnitude more than the rate of real taps, because the skin filter responds to the whole hand rather than to fingertip contact. The staged keystroke result does not survive contact with uncontrolled footage; the system does not achieve reliable keystroke reconstruction outside the calibrated staged setting.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The paper claims that a multi-modal touch detection system (MediaPipe hand landmarks, HSV skin filtering, temporal frame differencing, and shape-guided Canny edges) for reconstructing keystrokes from over-the-shoulder video fails to generalize. On a 120-frame staged first-person passcode video, motion-only and edge-only achieve F1 scores of 18.5% and 18.2%, while the combined system reaches only 16.7% F1 and 3.0% sequence similarity on an iOS layout. Ablations cover resolution decay, noise, and proximity threshold. On five real third-person videos, the detector emits a median 57 detections per frame (peak 205), orders of magnitude above actual taps, because the skin filter responds to the whole hand rather than fingertip contact. Conclusion: reliable reconstruction does not hold outside the calibrated staged setting.

Significance. If the results hold, the work supplies a direct, quantitative demonstration of generalization failure in multi-modal CV for touch detection, contrasting controlled staged performance against uncontrolled real footage with concrete metrics (F1, sequence similarity, per-frame counts). The mechanistic attribution to the skin filter, together with the reported ablations on resolution, noise sensitivity, and threshold tuning, delineates operational limits and supplies falsifiable evidence that could steer future VIDINT/BEHINT research away from skin- and landmark-heavy pipelines. The empirical focus on failure rather than claimed success is a constructive contribution.

minor comments (2)
  1. The sequence similarity metric (reported as 3.0% when mapped to the iOS passcode layout) is not defined or referenced; an explicit formula or citation would aid replication.
  2. In the real-video audit, reporting only the median and peak detection counts without per-video breakdown or variance leaves the consistency of the skin-filter failure mode less transparent.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the detailed summary, positive significance assessment, and recommendation to accept. No major comments were raised in the report.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is a purely empirical evaluation reporting direct measurements (F1-scores, sequence similarity, detection counts) from staged and real-video experiments. No derivation chain, fitted parameters, predictions, or self-citations are present; all metrics are computed from the described detection modalities and ablation tests without reduction to inputs by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The work relies on established computer vision primitives and one tuned proximity threshold; no new physical entities or ad-hoc mathematical constructs are introduced.

free parameters (1)
  • proximity threshold
    Tuned during ablation studies to map relative touch coordinates to the reference screen layout.
axioms (1)
  • standard math MediaPipe hand landmarks, HSV skin segmentation, temporal frame differencing, and Canny edge detection perform as documented in their original references under the tested conditions
    These are invoked as off-the-shelf components without re-derivation in the paper.

pith-pipeline@v0.9.1-grok · 5838 in / 1408 out tokens · 44479 ms · 2026-06-30T06:59:07.079066+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

19 extracted references · 10 canonical work pages · 2 internal anchors

  1. [1]

    Blind recognition of touched keys on mobile devices,

    Q. Yue, Z. Ling, X. Fu, B. Liu, K. Ren, and W. Zhao, “Blind recognition of touched keys on mobile devices,” inProceedings of the 21st ACM SIGSAC Conference on Computer and Communications Security. ACM, 2014, pp. 1403–1414

  2. [2]

    Clearshot: Eavesdropping on keyboard input from video,

    D. Balzarotti, M. Cova, and G. Vigna, “Clearshot: Eavesdropping on keyboard input from video,” inProceedings of the 29th IEEE Symposium on Security and Privacy. IEEE Computer Society, May 2008, pp. 170–183

  3. [3]

    In: Advances in Neu- ral Information Processing Systems (NeurIPS) (2022).https://arxiv.org/abs/ 2203.12602

    Z. Tong, Y . Song, J. Wang, and L. Wang, “VideoMAE: Masked autoencoders are data-efficient learners for self-supervised video pre-training,” inAdvances in Neural Information Processing Systems (NeurIPS), 2022, arXiv:2203.12602

  4. [4]

    EgoTouch: On-body touch input using ar/vr headset cameras,

    V . Mollyn and C. Harrison, “EgoTouch: On-body touch input using ar/vr headset cameras,” inProceedings of the 37th Annual ACM Symposium on User Interface Software and Technology, ser. UIST ’24, 2024. [Online]. Available: https://doi.org/10.1145/3654777.3676455

  5. [5]

    V-Hands: Touchscreen-based hand tracking for remote whiteboard interaction,

    X. Liu, Y . Zhang, and X. Tong, “V-Hands: Touchscreen-based hand tracking for remote whiteboard interaction,”arXiv preprint arXiv:2409.13347, 2024. [Online]. Available: https://arxiv.org/abs/2409.13347

  6. [7]

    Detecting Precise Hand Touch Moments in Egocentric Video

    [Online]. Available: https://arxiv.org/abs/2604.12343

  7. [8]

    SurfaceXR: Fusing smartwatch imus and egocentric hand pose for seamless surface interactions,

    V . Xu, B. Chen, E. J. Gonzalez, A. Colac ¸o, H. Hoffmann, M. Gonzalez-Franco, and K. Ahuja, “SurfaceXR: Fusing smartwatch imus and egocentric hand pose for seamless surface interactions,”arXiv preprint arXiv:2603.19529, 2026. [Online]. Available: https://arxiv.org/abs/2603.19529

  8. [9]

    iSpy: Automatic reconstruction of typed input from compromising reflections,

    R. Raguram, A. M. White, D. Goswami, F. Monrose, and J.-M. Frahm, “iSpy: Automatic reconstruction of typed input from compromising reflections,” inACM Conference on Computer and Communications Security (CCS), 2011, pp. 527–536

  9. [10]

    PlaceRaider: Virtual Theft in Physical Spaces with Smartphones

    R. Templeman, Z. Rahman, D. Crandall, and A. Kapadia, “PlaceRaider: Virtual theft in physical spaces with smartphones,” inNetwork and Distributed System Security Symposium (NDSS), 2013, arXiv:1209.5982

  10. [11]

    Zoom on the keystrokes: Exploiting video calls for keystroke inference attacks,

    M. Sabra, A. Maiti, and M. Jadliwala, “Zoom on the keystrokes: Exploiting video calls for keystroke inference attacks,” inNetwork and Distributed System Security Symposium (NDSS), 2021, arXiv:2010.12078

  11. [12]

    Revisiting the threat space for vision-based keystroke inference attacks,

    J. Lim, T. Price, F. Monrose, and J.-M. Frahm, “Revisiting the threat space for vision-based keystroke inference attacks,” inEuropean Conference on Computer Vision (ECCV) Workshops, 2020, arXiv:2009.05796

  12. [13]

    Leveraging disentangled representations to improve vision-based keystroke inference attacks under low data constraints,

    J. Lim, J.-M. Frahm, and F. Monrose, “Leveraging disentangled representations to improve vision-based keystroke inference attacks under low data constraints,” inACM Conference on Data and Application Security and Privacy (CODASPY), 2022, arXiv:2204.02494

  13. [14]

    Cracking android pattern lock in five attempts,

    G. Ye, Z. Tang, D. Fang, X. Chen, K. I. Kim, B. Taylor, and Z. Wang, “Cracking android pattern lock in five attempts,” inNetwork and Distributed System Security Symposium (NDSS), 2017

  14. [15]

    Beware, your hands reveal your secrets!

    D. Shukla, R. Kumar, A. Serwadda, and V . V . Phoha, “Beware, your hands reveal your secrets!” inACM Conference on Computer and Communications Security (CCS), 2014, pp. 904–917

  15. [16]

    VISIBLE: Video-assisted keystroke inference from tablet backside motion,

    J. Sun, X. Jin, Y . Chen, J. Zhang, Y . Zhang, and R. Zhang, “VISIBLE: Video-assisted keystroke inference from tablet backside motion,” inNetwork and Distributed System Security Symposium (NDSS), 2016. 9

  16. [17]

    Zhang, V

    F. Zhang, V . Bazarevsky, A. Vakunov, A. Tkachenka, G. Sung, C.-L. Chang, and M. Grundmann, “MediaPipe Hands: On-device real-time hand tracking,”arXiv preprint arXiv:2006.10214, 2020. [Online]. Available: https://arxiv.org/abs/2006.10214

  17. [18]

    R. C. Gonzalez and R. E. Woods,Digital Image Processing, 4th ed. Pearson, 2018

  18. [19]

    Open source computer vision library,

    Itseez, “Open source computer vision library,” https://github.com/itseez/opencv, 2015

  19. [20]

    A computational approach to edge detection,

    J. Canny, “A computational approach to edge detection,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. PAMI-8, no. 6, pp. 679–698, 1986