iPay: Integrated Payment Action Recognition via Multimodal Networks and Adaptive Spatial Prior Learning

Kaicong Huang; Ruimin Ke; Thomas Guggisberg; Weiheng Oh

arxiv: 2605.10732 · v1 · submitted 2026-05-11 · 💻 cs.CV · cs.AI

iPay: Integrated Payment Action Recognition via Multimodal Networks and Adaptive Spatial Prior Learning

Kaicong Huang , Weiheng Oh , Thomas Guggisberg , Ruimin Ke This is my paper

Pith reviewed 2026-05-12 04:11 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords action recognitionmultimodal fusiontransit surveillancepayment actionsgraph convolutional networksspatial differenceedge deployment

0 comments

The pith

A multimodal network fuses RGB video and skeleton data with a spatial motion discriminator to recognize transit payment actions from noisy surveillance footage.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that payment actions in onboard transit videos can be recognized reliably by running four coordinated streams in parallel: one focused on local RGB details, one using graph convolutions on skeleton joints for global motion, one that transfers information between the two via dual attention, and one that explicitly scores hand-to-anchor spatial differences. This matters because manual inspection of fare payments does not scale, while earlier vision or skeleton methods lose either fine spatial cues or temporal stability under real surveillance conditions. The authors collected their own set of over 500 labeled clips from actual transit cameras to test the system and report that the combined architecture reaches 83.45 percent accuracy at a computational cost low enough for vehicle-edge hardware.

Core claim

iPay is a four-stream multimodal mixture-of-experts model in which an RGB expert emphasizes local evidence, a skeleton expert models articulated motion via graph convolutions, a dual-attention fusion stream transfers temporal information from skeleton to RGB and spatial information from RGB to skeleton, and a prior-driven Spatial Difference Discriminator explicitly models hand-to-anchor relative motion; together these components produce 83.45 percent recognition accuracy on real onboard transit footage while remaining efficient for edge deployment.

What carries the argument

The Spatial Difference Discriminator (SDD) that scores hand-to-anchor relative motion inside a multimodal mixture-of-experts architecture with dual-attention fusion between RGB and skeleton streams.

If this is right

Automated recognition replaces manual review for scalable fare auditing and passenger-flow analytics.
The system can run on vehicle-mounted edge devices without requiring cloud offload.
Skeleton features supply global temporal structure while RGB features supply local spatial detail when the two are fused with attention.
Explicit modeling of relative hand motion improves discriminability for subtle payment gestures that standard action models miss.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same fusion-plus-prior pattern could be tested on other fine-grained surveillance actions such as fare evasion or passenger assistance.
Adding a lightweight domain-specific discriminator may be a general way to boost performance when standard RGB or skeleton pipelines under-emphasize task-critical local motions.
The efficiency profile supports extending the approach to continuous real-time monitoring rather than post-hoc clip analysis.

Load-bearing premise

The 500-plus payment clips collected from local transit agencies contain enough variety in lighting, camera angles, and passenger behavior to train a model that generalizes beyond the specific recordings used.

What would settle it

Evaluating the trained model on an independent collection of transit surveillance videos recorded in a different city or with different camera hardware and checking whether accuracy stays near 83 percent or drops sharply.

Figures

Figures reproduced from arXiv: 2605.10732 by Kaicong Huang, Ruimin Ke, Thomas Guggisberg, Weiheng Oh.

**Figure 2.** Figure 2: Overview of the proposed iPay framework. The system integrates onboard surveillance RGB and skeleton streams within a multimodal mixture-of [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Construction of the spatiotemporal ROI (ST-ROI). Action-relevant [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Overview of the proposed Spatial Difference Discriminator (SDD), [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

read the original abstract

Automated transit payment analysis is vital for scalable fare auditing and passenger analytics, yet practice still relies on limited manual inspection. Prior vision- and skeleton-based methods remain brittle under noisy onboard surveillance and often depend on poorly generalizable handcrafted features. Building on the success of graph convolutional networks in human action recognition, we observe that skeleton features excel at modeling global spatiotemporal dependencies but tend to underemphasize the subtle local relative motions that distinguish payment actions. In contrast, RGB features preserve fine-grained spatial details yet often lack reliable temporal continuity in surveillance footage. To bridge both system-level deployment needs and model-level design challenges, we present iPay, an integrated payment action recognition framework for onboard transit surveillance system. iPay adopts a multimodal mixture-of-experts architecture with four tightly coupled streams: (1) an RGB expert stream emphasizing local evidence via region-focused computation; (2) a skeleton expert stream modeling articulated motion with a graph convolutional backbone; (3) a dual-attention fusion stream enabling skeleton-to-RGB temporal transfer and RGB-to-skeleton spatial enhancement; and (4) a prior-driven Spatial Difference Discriminator (SDD) that explicitly models hand-to-anchor relative motion to improve task-specific discriminability. We also collaborate with local transit agencies to collect over 55 hours of real onboard surveillance footage, yielding 500+ payment clips. Experiments show that iPay outperforms prior methods and achieves 83.45\% recognition accuracy with competitive computational efficiency, making it suitable for edge deployment. Code is available at https://github.com/ccoopq/iPay.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

iPay adds a domain-specific Spatial Difference Discriminator to a four-stream multimodal setup for transit payment actions, but the 83.45% claim sits on a modest single-source dataset without clear external checks.

read the letter

The paper's core contribution is a practical multimodal system for spotting payment gestures in onboard transit video. It runs an RGB expert for local details, a GCN skeleton stream for global motion, dual-attention fusion to move temporal cues one direction and spatial cues the other, and a new prior-driven Spatial Difference Discriminator that explicitly tracks hand-to-anchor relative motion. They also built a 500-plus clip dataset from 55 hours of real local agency footage, which is new for this narrow task, and they release code. That combination addresses a real operational pain point where manual fare checks still dominate, and the architecture choices line up with known strengths of each modality. The SDD looks like a reasonable way to inject task-specific priors without overcomplicating the backbone. The reported accuracy and efficiency numbers are at least stated clearly in the abstract. The main weakness is the evaluation setup. A four-stream model trained on clips from one local source risks learning camera angles, lighting, or passenger demographics that do not travel. The abstract gives no k-fold details, no held-out agency test, no public dataset comparison, and no error bars, so the 83.45% figure and the edge-deployment claim are hard to weigh. If the full paper shows solid ablations and some cross-condition testing, the gap narrows; otherwise the gains could be dataset-specific. This work is aimed at applied computer vision groups working on surveillance or smart transit rather than core action-recognition theorists. A reader who needs ideas for fusing modalities on small domain datasets or who wants a starting point for payment analytics could extract useful pieces. It deserves peer review because the task is well-motivated, the method is described without obvious internal contradictions, and the dataset is a concrete addition even if the experiments need more validation work.

Referee Report

3 major / 2 minor

Summary. The manuscript presents iPay, a multimodal mixture-of-experts framework for recognizing payment actions from onboard transit surveillance videos. It combines an RGB expert stream for local spatial evidence, a graph convolutional skeleton stream for global motion, a dual-attention fusion module for cross-modal transfer, and a prior-driven Spatial Difference Discriminator (SDD) that explicitly models hand-to-anchor relative motions. The authors collected a custom dataset of 500+ clips from 55 hours of real footage with local transit agencies and report that iPay reaches 83.45% accuracy while outperforming prior methods and offering competitive efficiency for edge deployment. Code is released at https://github.com/ccoopq/iPay.

Significance. If the performance and generalization claims hold, the work could support practical deployment in automated fare auditing and passenger analytics by mitigating brittleness of existing RGB- and skeleton-only methods under noisy surveillance conditions. The targeted multimodal design that addresses complementary weaknesses of the two modalities, together with the public code release, provides a reproducible starting point for further research on action recognition in constrained real-world settings.

major comments (3)

[Experiments] Experiments section: the headline 83.45% accuracy and outperformance claim rest on a custom dataset of only 500+ clips collected from local transit agencies. No details are supplied on train/test split ratios, k-fold cross-validation, or evaluation on held-out footage from a second agency or different camera/lighting conditions; this directly undermines the generalizability needed to support the edge-deployment suitability assertion.
[Method] Method section (SDD): the claim that the prior-driven Spatial Difference Discriminator improves task-specific discriminability by modeling hand-to-anchor relative motion lacks supporting ablation results (e.g., accuracy with vs. without SDD, or vs. a simple spatial prior baseline). Without these numbers the contribution of the invented SDD component cannot be isolated from the rest of the four-stream architecture.
[Experiments] Experiments section: baseline comparisons are asserted but no table or quantitative breakdown is referenced that lists the exact prior methods, their reported accuracies, parameter counts, and inference latencies on the same dataset. This makes the outperformance statement difficult to verify and weakens the central empirical claim.

minor comments (2)

[Abstract] Abstract: the phrases 'over 55 hours' and '500+ payment clips' should be replaced by exact figures and a brief note on class balance or annotation protocol.
[Introduction] Introduction: a short paragraph contrasting iPay with recent multimodal action-recognition works that also use attention-based fusion would better situate the dual-attention design.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for their thorough review and valuable feedback on our manuscript. We appreciate the opportunity to clarify and strengthen our work. We address each major comment below and will incorporate the necessary revisions in the updated manuscript.

read point-by-point responses

Referee: [Experiments] Experiments section: the headline 83.45% accuracy and outperformance claim rest on a custom dataset of only 500+ clips collected from local transit agencies. No details are supplied on train/test split ratios, k-fold cross-validation, or evaluation on held-out footage from a second agency or different camera/lighting conditions; this directly undermines the generalizability needed to support the edge-deployment suitability assertion.

Authors: We agree that more details on the dataset partitioning are necessary to support the generalizability claims. In the revised manuscript, we will add a dedicated subsection in the Experiments section detailing the train/test split ratios used, the cross-validation procedure, and additional experiments on subsets with varying lighting conditions from the collected footage. Regarding evaluation on a second agency, we currently do not have access to such data, but we will expand the discussion on the diversity of the existing dataset (55 hours from multiple routes and times of day) to better support the deployment claims. revision: partial
Referee: [Method] Method section (SDD): the claim that the prior-driven Spatial Difference Discriminator improves task-specific discriminability by modeling hand-to-anchor relative motion lacks supporting ablation results (e.g., accuracy with vs. without SDD, or vs. a simple spatial prior baseline). Without these numbers the contribution of the invented SDD component cannot be isolated from the rest of the four-stream architecture.

Authors: We acknowledge the importance of isolating the contribution of the SDD module. We will include ablation experiments in the revised manuscript to quantify the contribution of the SDD component, including comparisons with and without it, as well as against a baseline using a simple spatial prior. This will be added as a new table in the Experiments section. revision: yes
Referee: [Experiments] Experiments section: baseline comparisons are asserted but no table or quantitative breakdown is referenced that lists the exact prior methods, their reported accuracies, parameter counts, and inference latencies on the same dataset. This makes the outperformance statement difficult to verify and weakens the central empirical claim.

Authors: We apologize for the lack of a comprehensive comparison table. We will add a new table in the Experiments section that lists all compared methods, along with their accuracies on our dataset, parameter counts, and inference latencies measured on the same hardware setup to allow verification of the outperformance claims. revision: yes

standing simulated objections not resolved

Evaluation on held-out footage from a second agency, as we do not have access to data from additional transit agencies beyond the one we collaborated with.

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper proposes a new multimodal mixture-of-experts architecture (RGB expert, GCN skeleton, dual-attention fusion, prior-driven SDD) and reports empirical accuracy of 83.45% on a separately collected dataset of 500+ clips from 55 hours of transit footage. No equations, predictions, or first-principles results are presented that reduce by construction to fitted parameters, self-definitions, or self-citation chains; the method applies standard techniques to a new task and evaluates via conventional training/testing rather than deriving outputs from inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The abstract relies on domain assumptions about the complementary strengths of RGB and skeleton modalities and introduces one new architectural component. No explicit free parameters or invented physical entities are described.

axioms (2)

domain assumption Skeleton features excel at modeling global spatiotemporal dependencies but underemphasize subtle local relative motions.
Direct observation stated in the abstract as motivation for the multimodal design.
domain assumption RGB features preserve fine-grained spatial details yet often lack reliable temporal continuity in surveillance footage.
Observation presented in the abstract to justify the dual-stream approach.

invented entities (1)

Spatial Difference Discriminator (SDD) no independent evidence
purpose: Explicitly models hand-to-anchor relative motion to improve task-specific discriminability.
New module introduced as part of the four-stream architecture.

pith-pipeline@v0.9.0 · 5592 in / 1518 out tokens · 59994 ms · 2026-05-12T04:11:38.680349+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

multimodal mixture-of-experts architecture with four tightly coupled streams: RGB expert, skeleton GCN, dual-attention fusion, prior-driven Spatial Difference Discriminator (SDD)
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

achieves 83.45% recognition accuracy with competitive computational efficiency

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · 1 internal anchor

[1]

Survey of automated fare collection solutions in public transportation,

M. Bieler, A. Skretting, P. Büdinger, and T.-M. Grønli, “Survey of automated fare collection solutions in public transportation,”IEEE Transactions on Intelligent Transportation Systems, vol. 23, no. 9, pp. 14 248–14 266, 2022

work page 2022
[2]

Fare inspection in proof-of- payment transit networks: A review,

B. Barabino, M. Carra, and G. Currie, “Fare inspection in proof-of- payment transit networks: A review,”Journal of Public Transportation, vol. 26, p. 100101, 2024

work page 2024
[3]

Measuring and controlling subway fare evasion: Improving safety and security at new york city transit authority,

A. V . Reddy, J. Kuhls, and A. Lu, “Measuring and controlling subway fare evasion: Improving safety and security at new york city transit authority,”Transportation Research Record, vol. 2216, no. 1, pp. 85– 99, 2011

work page 2011
[4]

Transitreid: Transit od data collection with occlusion-resistant dynamic passenger re-identification,

K. Huang, T. Azfar, J. Reilly, and R. Ke, “Transitreid: Transit od data collection with occlusion-resistant dynamic passenger re-identification,” arXiv preprint arXiv:2504.11500, 2025

work page arXiv 2025
[5]

Bus violence: An open benchmark for video violence detection on public transport,

L. Ciampi, P. Foszner, N. Messina, M. Staniszewski, C. Gennaro, F. Falchi, G. Serao, M. Cogiel, D. Golba, A. Szcz˛ esnaet al., “Bus violence: An open benchmark for video violence detection on public transport,”Sensors, vol. 22, no. 21, p. 8345, 2022

work page 2022
[6]

Slowfast networks for video recognition,

C. Feichtenhofer, H. Fan, J. Malik, and K. He, “Slowfast networks for video recognition,” inProceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 6202–6211

work page 2019
[7]

Video swin transformer,

Z. Liu, J. Ning, Y . Cao, Y . Wei, Z. Zhang, S. Lin, and H. Hu, “Video swin transformer,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 3202–3211

work page 2022
[8]

Videomae: Masked autoen- coders are data-efficient learners for self-supervised video pre-training,

Z. Tong, Y . Song, J. Wang, and L. Wang, “Videomae: Masked autoen- coders are data-efficient learners for self-supervised video pre-training,” Advances in neural information processing systems, vol. 35, pp. 10 078– 10 093, 2022

work page 2022
[9]

Spatial temporal graph convolutional networks for skeleton-based action recognition,

S. Yan, Y . Xiong, and D. Lin, “Spatial temporal graph convolutional networks for skeleton-based action recognition,” inProceedings of the AAAI conference on artificial intelligence, vol. 32, no. 1, 2018

work page 2018
[10]

Two-stream adaptive graph convolutional networks for skeleton-based action recognition,

L. Shi, Y . Zhang, J. Cheng, and H. Lu, “Two-stream adaptive graph convolutional networks for skeleton-based action recognition,” inPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 12 026–12 035

work page 2019
[11]

Channel- wise topology refinement graph convolution for skeleton-based action recognition,

Y . Chen, Z. Zhang, C. Yuan, B. Li, Y . Deng, and W. Hu, “Channel- wise topology refinement graph convolution for skeleton-based action recognition,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 13 359–13 368

work page 2021
[12]

Infogcn: Representation learning for human skeleton-based action recognition,

H.-g. Chi, M. H. Ha, S. Chi, S. W. Lee, Q. Huang, and K. Ramani, “Infogcn: Representation learning for human skeleton-based action recognition,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 20 186–20 196

work page 2022
[13]

Blockgcn: Redefine topology awareness for skeleton-based action recognition,

Y . Zhou, X. Yan, Z.-Q. Cheng, Y . Yan, Q. Dai, and X.-S. Hua, “Blockgcn: Redefine topology awareness for skeleton-based action recognition,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 2049–2058

work page 2024
[14]

Degcn: Deformable graph convolutional networks for skeleton-based action recognition,

W. Myung, N. Su, J.-H. Xue, and G. Wang, “Degcn: Deformable graph convolutional networks for skeleton-based action recognition,”IEEE Transactions on Image Processing, vol. 33, pp. 2477–2490, 2024

work page 2024
[15]

SAM 3: Segment Anything with Concepts

N. Carion, L. Gustafson, Y .-T. Hu, S. Debnath, R. Hu, D. Suris, C. Ryali, K. V . Alwala, H. Khedr, A. Huanget al., “Sam 3: Segment anything with concepts,”arXiv preprint arXiv:2511.16719, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

Mmnet: A model-based multimodal network for human action recognition in rgb-d videos,

X. Bruce, Y . Liu, X. Zhang, S.-h. Zhong, and K. C. Chan, “Mmnet: A model-based multimodal network for human action recognition in rgb-d videos,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 3, pp. 3522–3538, 2022

work page 2022
[17]

A detection method of individual fare evasion behaviours on metros based on skeleton sequence and time series,

S. Huang, X. Liu, W. Chen, G. Song, Z. Zhang, L. Yang, and B. Zhang, “A detection method of individual fare evasion behaviours on metros based on skeleton sequence and time series,”Information Sciences, vol. 589, pp. 62–79, 2022

work page 2022
[18]

Time series–based detection on tailgating fare evasions using human pose estimation,

S. Huang, G. Song, W. Chen, J. Qin, X. Liu, B. Zhang, and Z. Zhang, “Time series–based detection on tailgating fare evasions using human pose estimation,”Journal of Transportation Engineering, Part A: Sys- tems, vol. 148, no. 7, p. 04022035, 2022

work page 2022
[19]

Ai-driven fare evasion detection in public transportation: A multi-technology approach integrating behavioural ai,

A. Adanyin and J. Odede, “Ai-driven fare evasion detection in public transportation: A multi-technology approach integrating behavioural ai,” IoT, and privacy-preserving systems, 2024

work page 2024
[20]

Towards fairer public transit: Real-time tensor-based multimodal fare evasion and fraud detection,

P. Wauyo, D. Bwiza, A. Murara, E. Mugume, and E. Umuhoza, “Towards fairer public transit: Real-time tensor-based multimodal fare evasion and fraud detection,”arXiv preprint arXiv:2510.02165, 2025

work page arXiv 2025
[21]

Hierarchical recurrent neural network for skeleton based action recognition,

Y . Du, W. Wang, and L. Wang, “Hierarchical recurrent neural network for skeleton based action recognition,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1110– 1118

work page 2015
[22]

Ntu rgb+ d: A large scale dataset for 3d human activity analysis,

A. Shahroudy, J. Liu, T.-T. Ng, and G. Wang, “Ntu rgb+ d: A large scale dataset for 3d human activity analysis,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 1010–1019

work page 2016
[23]

Ntu rgb+ d 120: A large-scale benchmark for 3d human activity understanding,

J. Liu, A. Shahroudy, M. Perez, G. Wang, L.-Y . Duan, and A. C. Kot, “Ntu rgb+ d 120: A large-scale benchmark for 3d human activity understanding,”IEEE transactions on pattern analysis and machine intelligence, vol. 42, no. 10, pp. 2684–2701, 2019

work page 2019
[24]

Cross-view action mod- eling, learning and recognition,

J. Wang, X. Nie, Y . Xia, Y . Wu, and S.-C. Zhu, “Cross-view action mod- eling, learning and recognition,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 2649–2656

work page 2014
[25]

Quo vadis, action recognition? a new model and the kinetics dataset,

J. Carreira and A. Zisserman, “Quo vadis, action recognition? a new model and the kinetics dataset,” inproceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 6299–6308

work page 2017
[26]

Real-time online video detection with tem- poral smoothing transformers,

Y . Zhao and P. Krähenbühl, “Real-time online video detection with tem- poral smoothing transformers,” inEuropean Conference on Computer Vision. Springer, 2022, pp. 485–502

work page 2022
[27]

Context-enhanced memory-refined transformer for online action detection,

Z. Pang, F. Sener, and A. Yao, “Context-enhanced memory-refined transformer for online action detection,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 8700–8710

work page 2025
[28]

Deep residual learning for image recognition,

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778

work page 2016
[29]

Body-hand modality expertized networks with cross-attention for fine-grained skeleton action recognition,

S. Cho and T.-K. Kim, “Body-hand modality expertized networks with cross-attention for fine-grained skeleton action recognition,” in2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2025, pp. 11 614–11 621

work page 2025
[30]

Multilevel spatial–temporal excited graph network for skeleton-based action recognition,

Y . Zhu, H. Shuai, G. Liu, and Q. Liu, “Multilevel spatial–temporal excited graph network for skeleton-based action recognition,”IEEE Transactions on Image Processing, vol. 32, pp. 496–508, 2022

work page 2022

[1] [1]

Survey of automated fare collection solutions in public transportation,

M. Bieler, A. Skretting, P. Büdinger, and T.-M. Grønli, “Survey of automated fare collection solutions in public transportation,”IEEE Transactions on Intelligent Transportation Systems, vol. 23, no. 9, pp. 14 248–14 266, 2022

work page 2022

[2] [2]

Fare inspection in proof-of- payment transit networks: A review,

B. Barabino, M. Carra, and G. Currie, “Fare inspection in proof-of- payment transit networks: A review,”Journal of Public Transportation, vol. 26, p. 100101, 2024

work page 2024

[3] [3]

Measuring and controlling subway fare evasion: Improving safety and security at new york city transit authority,

A. V . Reddy, J. Kuhls, and A. Lu, “Measuring and controlling subway fare evasion: Improving safety and security at new york city transit authority,”Transportation Research Record, vol. 2216, no. 1, pp. 85– 99, 2011

work page 2011

[4] [4]

Transitreid: Transit od data collection with occlusion-resistant dynamic passenger re-identification,

K. Huang, T. Azfar, J. Reilly, and R. Ke, “Transitreid: Transit od data collection with occlusion-resistant dynamic passenger re-identification,” arXiv preprint arXiv:2504.11500, 2025

work page arXiv 2025

[5] [5]

Bus violence: An open benchmark for video violence detection on public transport,

L. Ciampi, P. Foszner, N. Messina, M. Staniszewski, C. Gennaro, F. Falchi, G. Serao, M. Cogiel, D. Golba, A. Szcz˛ esnaet al., “Bus violence: An open benchmark for video violence detection on public transport,”Sensors, vol. 22, no. 21, p. 8345, 2022

work page 2022

[6] [6]

Slowfast networks for video recognition,

C. Feichtenhofer, H. Fan, J. Malik, and K. He, “Slowfast networks for video recognition,” inProceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 6202–6211

work page 2019

[7] [7]

Video swin transformer,

Z. Liu, J. Ning, Y . Cao, Y . Wei, Z. Zhang, S. Lin, and H. Hu, “Video swin transformer,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 3202–3211

work page 2022

[8] [8]

Videomae: Masked autoen- coders are data-efficient learners for self-supervised video pre-training,

Z. Tong, Y . Song, J. Wang, and L. Wang, “Videomae: Masked autoen- coders are data-efficient learners for self-supervised video pre-training,” Advances in neural information processing systems, vol. 35, pp. 10 078– 10 093, 2022

work page 2022

[9] [9]

Spatial temporal graph convolutional networks for skeleton-based action recognition,

S. Yan, Y . Xiong, and D. Lin, “Spatial temporal graph convolutional networks for skeleton-based action recognition,” inProceedings of the AAAI conference on artificial intelligence, vol. 32, no. 1, 2018

work page 2018

[10] [10]

Two-stream adaptive graph convolutional networks for skeleton-based action recognition,

L. Shi, Y . Zhang, J. Cheng, and H. Lu, “Two-stream adaptive graph convolutional networks for skeleton-based action recognition,” inPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 12 026–12 035

work page 2019

[11] [11]

Channel- wise topology refinement graph convolution for skeleton-based action recognition,

Y . Chen, Z. Zhang, C. Yuan, B. Li, Y . Deng, and W. Hu, “Channel- wise topology refinement graph convolution for skeleton-based action recognition,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 13 359–13 368

work page 2021

[12] [12]

Infogcn: Representation learning for human skeleton-based action recognition,

H.-g. Chi, M. H. Ha, S. Chi, S. W. Lee, Q. Huang, and K. Ramani, “Infogcn: Representation learning for human skeleton-based action recognition,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 20 186–20 196

work page 2022

[13] [13]

Blockgcn: Redefine topology awareness for skeleton-based action recognition,

Y . Zhou, X. Yan, Z.-Q. Cheng, Y . Yan, Q. Dai, and X.-S. Hua, “Blockgcn: Redefine topology awareness for skeleton-based action recognition,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 2049–2058

work page 2024

[14] [14]

Degcn: Deformable graph convolutional networks for skeleton-based action recognition,

W. Myung, N. Su, J.-H. Xue, and G. Wang, “Degcn: Deformable graph convolutional networks for skeleton-based action recognition,”IEEE Transactions on Image Processing, vol. 33, pp. 2477–2490, 2024

work page 2024

[15] [15]

SAM 3: Segment Anything with Concepts

N. Carion, L. Gustafson, Y .-T. Hu, S. Debnath, R. Hu, D. Suris, C. Ryali, K. V . Alwala, H. Khedr, A. Huanget al., “Sam 3: Segment anything with concepts,”arXiv preprint arXiv:2511.16719, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[16] [16]

Mmnet: A model-based multimodal network for human action recognition in rgb-d videos,

X. Bruce, Y . Liu, X. Zhang, S.-h. Zhong, and K. C. Chan, “Mmnet: A model-based multimodal network for human action recognition in rgb-d videos,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 3, pp. 3522–3538, 2022

work page 2022

[17] [17]

A detection method of individual fare evasion behaviours on metros based on skeleton sequence and time series,

S. Huang, X. Liu, W. Chen, G. Song, Z. Zhang, L. Yang, and B. Zhang, “A detection method of individual fare evasion behaviours on metros based on skeleton sequence and time series,”Information Sciences, vol. 589, pp. 62–79, 2022

work page 2022

[18] [18]

Time series–based detection on tailgating fare evasions using human pose estimation,

S. Huang, G. Song, W. Chen, J. Qin, X. Liu, B. Zhang, and Z. Zhang, “Time series–based detection on tailgating fare evasions using human pose estimation,”Journal of Transportation Engineering, Part A: Sys- tems, vol. 148, no. 7, p. 04022035, 2022

work page 2022

[19] [19]

Ai-driven fare evasion detection in public transportation: A multi-technology approach integrating behavioural ai,

A. Adanyin and J. Odede, “Ai-driven fare evasion detection in public transportation: A multi-technology approach integrating behavioural ai,” IoT, and privacy-preserving systems, 2024

work page 2024

[20] [20]

Towards fairer public transit: Real-time tensor-based multimodal fare evasion and fraud detection,

P. Wauyo, D. Bwiza, A. Murara, E. Mugume, and E. Umuhoza, “Towards fairer public transit: Real-time tensor-based multimodal fare evasion and fraud detection,”arXiv preprint arXiv:2510.02165, 2025

work page arXiv 2025

[21] [21]

Hierarchical recurrent neural network for skeleton based action recognition,

Y . Du, W. Wang, and L. Wang, “Hierarchical recurrent neural network for skeleton based action recognition,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1110– 1118

work page 2015

[22] [22]

Ntu rgb+ d: A large scale dataset for 3d human activity analysis,

A. Shahroudy, J. Liu, T.-T. Ng, and G. Wang, “Ntu rgb+ d: A large scale dataset for 3d human activity analysis,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 1010–1019

work page 2016

[23] [23]

Ntu rgb+ d 120: A large-scale benchmark for 3d human activity understanding,

J. Liu, A. Shahroudy, M. Perez, G. Wang, L.-Y . Duan, and A. C. Kot, “Ntu rgb+ d 120: A large-scale benchmark for 3d human activity understanding,”IEEE transactions on pattern analysis and machine intelligence, vol. 42, no. 10, pp. 2684–2701, 2019

work page 2019

[24] [24]

Cross-view action mod- eling, learning and recognition,

J. Wang, X. Nie, Y . Xia, Y . Wu, and S.-C. Zhu, “Cross-view action mod- eling, learning and recognition,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 2649–2656

work page 2014

[25] [25]

Quo vadis, action recognition? a new model and the kinetics dataset,

J. Carreira and A. Zisserman, “Quo vadis, action recognition? a new model and the kinetics dataset,” inproceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 6299–6308

work page 2017

[26] [26]

Real-time online video detection with tem- poral smoothing transformers,

Y . Zhao and P. Krähenbühl, “Real-time online video detection with tem- poral smoothing transformers,” inEuropean Conference on Computer Vision. Springer, 2022, pp. 485–502

work page 2022

[27] [27]

Context-enhanced memory-refined transformer for online action detection,

Z. Pang, F. Sener, and A. Yao, “Context-enhanced memory-refined transformer for online action detection,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 8700–8710

work page 2025

[28] [28]

Deep residual learning for image recognition,

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778

work page 2016

[29] [29]

Body-hand modality expertized networks with cross-attention for fine-grained skeleton action recognition,

S. Cho and T.-K. Kim, “Body-hand modality expertized networks with cross-attention for fine-grained skeleton action recognition,” in2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2025, pp. 11 614–11 621

work page 2025

[30] [30]

Multilevel spatial–temporal excited graph network for skeleton-based action recognition,

Y . Zhu, H. Shuai, G. Liu, and Q. Liu, “Multilevel spatial–temporal excited graph network for skeleton-based action recognition,”IEEE Transactions on Image Processing, vol. 32, pp. 496–508, 2022

work page 2022