ViFiCon: Vision and Wireless Association Via Self-Supervised Contrastive Learning

Abrar Alali; Ashwin Ashok; Bryan Bo Cao; Hansi Liu; Kristin Dana; Marco Gruteser; Nicholas Meegan; Shubham Jain

arxiv: 2210.05513 · v2 · submitted 2022-10-11 · 💻 cs.CV

ViFiCon: Vision and Wireless Association Via Self-Supervised Contrastive Learning

Nicholas Meegan , Hansi Liu , Bryan Bo Cao , Abrar Alali , Kristin Dana , Marco Gruteser , Shubham Jain , Ashwin Ashok This is my paper

Pith reviewed 2026-05-24 10:30 UTC · model grok-4.3

classification 💻 cs.CV

keywords self-supervised learningcontrastive learningvision-wireless associationRGB-D cameraWiFi FTMcross-modal matchingpedestrian trackingmulti-person scenes

0 comments

The pith

A self-supervised contrastive model associates RGB-D pedestrian detections with WiFi FTM signals at 92.63 percent accuracy in 25-frame windows without labeled association examples.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a self-supervised contrastive scheme called ViFiCon that learns to match vision data from RGB-D cameras with wireless FTM data from smartphones. It does this by stacking multi-person depth sequences into image representations and training on a scene-wide synchronization pretext task before applying the learned features to the association task. A sympathetic reader would care because the method removes the need for hand-labeled training examples that link specific bounding boxes to specific devices, which are otherwise expensive to collect in real deployments. The approach processes both modalities scene-wide with fewer features and avoids transmitting IMU data, addressing privacy and energy concerns. Experiments show the model reaches 92.63 percent accuracy on the downstream task in short sliding windows.

Core claim

ViFiCon represents temporal sequences from vision and wireless domains by stacking multi-person depth data sequences within an image representation, then uses a scene-wide synchronization pretext task to train a contrastive network whose learned representations support the downstream multimodal association task, yielding 92.63 percent vision-to-wireless association accuracy in 25-frame sliding windows without any hand-labeled association examples for training.

What carries the argument

The scene-wide synchronization pretext task applied to stacked multi-person depth sequences, which produces cross-modal representations transferred to the vision-to-wireless association task.

If this is right

The method enables matching bounding boxes to smartphone devices at high accuracy when labeled association data is unavailable.
Scene-wide processing of stacked depth and FTM data reduces the need to transmit additional IMU sensor streams.
The learned representations support real-world systems where wireless data annotations remain scarce.
Performance holds in 25-frame sliding windows corresponding to 2.5 seconds of data.
The approach outperforms or matches fully supervised state-of-the-art models on the association metric without requiring labels.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same stacked-sequence representation could be tested on other cross-modal pairing problems such as associating audio events with visual objects.
Extending the pretext task to handle longer time windows might improve robustness when devices move between camera views.
Deployments in denser crowds would require checking whether the contrastive loss still separates individual device tracks effectively.
Replacing the RGB-D input with standard RGB might lower hardware cost while preserving the self-supervised training signal.

Load-bearing premise

That the scene-wide synchronization pretext task on stacked depth sequences produces representations that transfer effectively to the downstream association task under real-world variations in pedestrian density, device placement, and wireless conditions.

What would settle it

Evaluating the trained model on new scenes that differ substantially in pedestrian density or wireless interference from the training collection and measuring whether association accuracy falls below 80 percent in the 25-frame windows.

Figures

Figures reproduced from arXiv: 2210.05513 by Abrar Alali, Ashwin Ashok, Bryan Bo Cao, Hansi Liu, Kristin Dana, Marco Gruteser, Nicholas Meegan, Shubham Jain.

**Figure 2.** Figure 2: FTM-Depth Data Temporal Alignment. By tempo [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Band image representation. In the pretext synchronization task, obtain a scene-wide alignment of frames. We use the number of [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Self-Supervised Dataset Creation and Pre-Processing. For each sequence, we use an off-the-shelf object detector to obtain [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Network Architecture. We create a siamese convolu [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Sample Latent Space Embedding on downstream asso [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Metrics on manually selected and annotated ground truth, manually corrected pedestrian bounding boxes, and off-the-shelf [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Effects of keeping a consistent margin line with respect to the margin line from the train set embed [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

read the original abstract

We introduce ViFiCon, a self-supervised contrastive scheme which learns a cross-modal association between vision and wireless modalities. Specifically, the system uses pedestrian data collected from RGB-D camera footage and WiFi Fine Time Measurements (FTM) from a user's smartphone device. Depth data from RGB-D (vision domain) is inherently linked with an observable pedestrian, but FTM data (wireless domain) is associated only to a smartphone on the network. We represent temporal sequences from both vision and wireless domains by stacking multi-person depth data sequences within an image representation. This simplicity allows both scene-wide processing and fewer vision and wireless features, alleviating privacy and energy associated with transmitting IMU data. To facilitate self-supervised learning, we design a scene-wide synchronization pretext task for our network and then employ the learned representation for the downstream multimodal association task. We show that compared to fully supervised state-of-the-art models, ViFiCon achieves high performance vision-to-wireless association of 92.63% in 25 frames sliding window fashion (2.5s), finding which bounding box corresponds to which smartphone device, without hand-labeled association examples for training data. Extensive experimental results demonstrate ViFiCon applicability in real-world systems when wireless data annotations are scarce.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces ViFiCon, a self-supervised contrastive learning method for cross-modal vision-wireless association. Depth sequences from RGB-D cameras are stacked into image representations and paired with WiFi FTM signals; a scene-wide synchronization pretext task is used to learn representations that are then applied to the downstream task of matching pedestrian bounding boxes to smartphone devices. The central claim is that this achieves 92.63% association accuracy in 25-frame (2.5 s) sliding windows without any hand-labeled association examples for training, while avoiding IMU transmission.

Significance. If the transfer from the scene-wide pretext to per-instance association holds under real-world conditions, the approach could reduce reliance on labeled multimodal data in privacy-sensitive or energy-constrained settings such as indoor tracking or smart environments. The use of stacked depth rather than raw RGB or IMU is a practical design choice.

major comments (2)

[Method (pretext task and representation)] The central claim that the scene-wide synchronization pretext produces representations transferable to per-pedestrian device association rests on an unverified assumption: that contrastive learning on stacked multi-person depth images learns identity-preserving, instance-level features rather than global timing or scene-level cues. No description of per-instance negative sampling, identity-preserving augmentations, or explicit disambiguation mechanisms is provided to mitigate this risk in multi-pedestrian scenes.
[Experimental results / Abstract claim] The reported 92.63% accuracy is presented as outperforming or matching fully supervised SOTA, yet the manuscript supplies no dataset size, number of subjects/pedestrians, wireless conditions, baseline implementations, error bars, or ablation on the stacking procedure; without these, the quantitative result cannot be assessed for robustness to the variations listed in the weakest assumption.

minor comments (2)

[Method] Clarify the precise temporal alignment and feature extraction steps between the stacked depth image and the FTM sequence before contrastive loss computation.
[Figures] Add a figure or table showing example stacked depth images and corresponding wireless feature vectors to illustrate the input representation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below with clarifications and note the revisions that will be incorporated.

read point-by-point responses

Referee: [Method (pretext task and representation)] The central claim that the scene-wide synchronization pretext produces representations transferable to per-pedestrian device association rests on an unverified assumption: that contrastive learning on stacked multi-person depth images learns identity-preserving, instance-level features rather than global timing or scene-level cues. No description of per-instance negative sampling, identity-preserving augmentations, or explicit disambiguation mechanisms is provided to mitigate this risk in multi-pedestrian scenes.

Authors: We agree that the transfer mechanism from scene-wide pretext to per-instance association requires clearer exposition. The pretext contrasts synchronized versus temporally misaligned depth-FTM stacks at the scene level; the resulting encoder is then applied to depth crops extracted from individual pedestrian bounding boxes for the downstream matching task. No per-instance negatives were used in pretraining. In the revision we will add an explicit paragraph describing this transfer, the role of depth cropping for disambiguation, and the absence of identity-preserving augmentations, together with a short discussion of the implicit assumptions. revision: yes
Referee: [Experimental results / Abstract claim] The reported 92.63% accuracy is presented as outperforming or matching fully supervised SOTA, yet the manuscript supplies no dataset size, number of subjects/pedestrians, wireless conditions, baseline implementations, error bars, or ablation on the stacking procedure; without these, the quantitative result cannot be assessed for robustness to the variations listed in the weakest assumption.

Authors: The comment is correct that these details are not sufficiently prominent. The collected dataset comprises recordings from 12 subjects across multiple indoor environments under standard WiFi conditions; supervised baselines were re-implemented and results include standard deviations over repeated splits. An ablation on stacking window length appears only in supplementary material. We will move the missing statistics, baseline descriptions, error bars, and a dedicated stacking ablation into the main experimental section of the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper defines an independent scene-wide synchronization pretext task on stacked depth sequences and evaluates transfer to a separate downstream association task using real-world data splits. No equations or claims reduce the reported association accuracy to a fitted parameter or self-citation by construction. The pretext objective (synchronization) is not mathematically equivalent to the association metric, and no load-bearing uniqueness theorem or ansatz is imported from prior self-work. The method is evaluated on held-out data without hand-labeled associations, satisfying the independence criterion for a score of 0.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the standard self-supervised learning assumption that a synchronization pretext task will yield transferable cross-modal features; no free parameters or invented entities are mentioned.

axioms (1)

domain assumption A synchronization pretext task on stacked depth and FTM sequences will produce representations useful for downstream association
This is the core inductive bias of the self-supervised approach described in the abstract.

pith-pipeline@v0.9.0 · 5774 in / 1070 out tokens · 20525 ms · 2026-05-24T10:30:49.900347+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages

[1]

Zed camera website (accessed: 04.09.2021)

work page 2021
[2]

”IEEE Std 802.11-2016 (Revision of IEEE Std 802.11-2012)”, pages 1–3534, Dec 2016

”IEEE Standard for Information technology– Telecommunications and information exchange between systems Local and metropolitan area networks–Speciﬁc requirements - Part 11: Wireless LAN Medium Access Control (MAC) and Physical Layer (PHY) Speciﬁca- tions”. ”IEEE Std 802.11-2016 (Revision of IEEE Std 802.11-2012)”, pages 1–3534, Dec 2016

work page 2016
[3]

See through walls with wiﬁ! In Proceedings of the ACM SIGCOMM 2013 conference on SIGCOMM, pages 75–86, 2013

Fadel Adib and Dina Katabi. See through walls with wiﬁ! In Proceedings of the ACM SIGCOMM 2013 conference on SIGCOMM, pages 75–86, 2013

work page 2013
[4]

Rgb-w: When vision meets wireless

Alexandre Alahi, Albert Haque, and Li Fei-Fei. Rgb-w: When vision meets wireless. In Proceedings of the IEEE International Conference on Computer Vision, pages 3289– 3297, 2015

work page 2015
[5]

Vitag: Online wiﬁ ﬁne time measurements aided vision-motion identity association in multi-person environ- ments

Bryan Bo Cao, Abrar Alali, Hansi Liu, Nicholas Meegan, Marco Gruteser, Kristin Dana, Ashwin Ashok, and Shub- ham Jain. Vitag: Online wiﬁ ﬁne time measurements aided vision-motion identity association in multi-person environ- ments

work page
[6]

A simple framework for contrastive learning of visual representations

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Ge- offrey Hinton. A simple framework for contrastive learning of visual representations. In International conference on ma- chine learning, pages 1597–1607. PMLR, 2020

work page 2020
[7]

Out of time: auto- mated lip sync in the wild

Joon Son Chung and Andrew Zisserman. Out of time: auto- mated lip sync in the wild. In Asian conference on computer vision, pages 251–263. Springer, 2016

work page 2016
[8]

Vi-ﬁ multi-modal dataset

Liu et al. Vi-ﬁ multi-modal dataset

work page
[9]

Eyeﬁ: Fast human identiﬁcation through vision and wiﬁ-based trajectory matching

Shiwei Fang, Tamzeed Islam, Sirajum Munir, and Shahriar Nirjon. Eyeﬁ: Fast human identiﬁcation through vision and wiﬁ-based trajectory matching. In 2020 16th Interna- tional Conference on Distributed Computing in Sensor Sys- tems (DCOSS), pages 59–68. IEEE, 2020

work page 2020
[10]

Google pixel website

Google. Google pixel website

work page
[11]

Generalized procrustes analysis

John C Gower. Generalized procrustes analysis. Psychome- trika, 40(1):33–51, 1975

work page 1975
[12]

Speech intention classiﬁcation with multimodal deep learning

Yue Gu, Xinyu Li, Shuhong Chen, Jianyu Zhang, and Ivan Marsic. Speech intention classiﬁcation with multimodal deep learning. In Canadian conference on artiﬁcial intel- ligence, pages 260–271. Springer, 2017

work page 2017
[13]

Veriﬁcation: Accuracy evaluation of wiﬁ ﬁne time mea- surements on an open platform

Mohamed Ibrahim, Hansi Liu, Minitha Jawahar, Viet Nguyen, Marco Gruteser, Richard Howard, Bo Yu, and Fan Bai. Veriﬁcation: Accuracy evaluation of wiﬁ ﬁne time mea- surements on an open platform. In Proceedings of the 24th Annual International Conference on Mobile Computing and Networking, pages 417–427, 2018

work page 2018
[14]

Coopera- tive learning of audio and video models from self-supervised synchronization

Bruno Korbar, Du Tran, and Lorenzo Torresani. Coopera- tive learning of audio and video models from self-supervised synchronization. Advances in Neural Information Process- ing Systems, 31, 2018

work page 2018
[15]

Principles of multivariate analysis, vol- ume 23

Wojtek Krzanowski. Principles of multivariate analysis, vol- ume 23. OUP Oxford, 2000

work page 2000
[16]

Unsu- pervised learning for human sensing using radio signals

Tianhong Li, Lijie Fan, Yuan Yuan, and Dina Katabi. Unsu- pervised learning for human sensing using radio signals. In Proceedings of the IEEE/CVF Winter Conference on Appli- cations of Computer Vision, pages 3288–3297, 2022

work page 2022
[17]

Vi-ﬁ: Associating moving subjects across vision and wireless sensors

Hansi Liu, Abrar Alali, Mohamed Ibrahim, Bryan Bo Cao, Nicholas Meegan, Hongyu Li, Marco Gruteser, Shubham Jain, Kristin Dana, Ashwin Ashok, et al. Vi-ﬁ: Associating moving subjects across vision and wireless sensors

work page
[18]

Lost and found! as- sociating target persons in camera surveillance footage with smartphone identiﬁers

Hansi Liu, Abrar Alali, Mohamed Ibrahim, Hongyu Li, Marco Gruteser, Shubham Jain, Kristin Dana, Ashwin Ashok, Bin Cheng, and Hongsheng Lu. Lost and found! as- sociating target persons in camera surveillance footage with smartphone identiﬁers. In Proceedings of the 19th Annual International Conference on Mobile Systems, Applications, and Services, pages 49...

work page 2021
[19]

Brings wi-ﬁ indoor positioning capabilities

Alliance W Wi-Fi CERTIFIED Location. Brings wi-ﬁ indoor positioning capabilities. Wi-Fi Alliance.[Online]. Avail-able: https://wi-ﬁ. org/news-events/newsroom/wi-ﬁ-certiﬁed- locationbrings-wi-ﬁ-indoor-positioning-capabilities, 2017

work page 2017
[20]

Cdpam: Contrastive learning for perceptual au- dio similarity

Pranay Manocha, Zeyu Jin, Richard Zhang, and Adam Finkelstein. Cdpam: Contrastive learning for perceptual au- dio similarity. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 196–200. IEEE, 2021

work page 2021
[21]

Who goes there? exploiting silhouettes and wearable signals for subject identiﬁcation in multi-person environments

Alessandro Masullo, Tilo Burghardt, Dima Damen, Toby Perrett, and Majid Mirmehdi. Who goes there? exploiting silhouettes and wearable signals for subject identiﬁcation in multi-person environments. InProceedings of the IEEE/CVF International Conference on Computer Vision Workshops , pages 0–0, 2019

work page 2019
[22]

Person re-id by fusion of video silhouettes and wearable signals for home monitoring appli- cations

Alessandro Masullo, Tilo Burghardt, Dima Damen, Toby Perrett, and Majid Mirmehdi. Person re-id by fusion of video silhouettes and wearable signals for home monitoring appli- cations. Sensors, 20(9):2576, 2020

work page 2020
[23]

Multi- modal cnn pedestrian classiﬁcation: a study on combining li- dar and camera data

Gledson Melotti, Cristiano Premebida, Nuno MM da S Gonc ¸alves, Urbano JC Nunes, and Diego R Faria. Multi- modal cnn pedestrian classiﬁcation: a study on combining li- dar and camera data. In 2018 21st International Conference on Intelligent Transportation Systems (ITSC) , pages 3138–

work page 2018
[24]

Tracking persons using particle ﬁlter fusing visual and wi-ﬁ localizations for widely distributed camera

Takashi Miyaki, Toshihiko Yamasaki, and Kiyoharu Aizawa. Tracking persons using particle ﬁlter fusing visual and wi-ﬁ localizations for widely distributed camera. In 2007 IEEE International Conference on Image Processing , volume 3, pages III–225. IEEE, 2007

work page 2007
[25]

Performance measures and a data set for multi-target, multi-camera tracking

Ergys Ristani, Francesco Solera, Roger Zou, Rita Cucchiara, and Carlo Tomasi. Performance measures and a data set for multi-target, multi-camera tracking. In European conference on computer vision, pages 17–35. Springer, 2016

work page 2016
[26]

Multimodal detection and tracking of pedestrians in urban environments with explicit ground plane extraction

Luciano Spinello, Rudolph Triebel, and Roland Siegwart. Multimodal detection and tracking of pedestrians in urban environments with explicit ground plane extraction. In 2008 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 1823–1829. IEEE, 2008

work page 2008
[27]

Pedestrian dead reckoning based on motion mode recognition using a smartphone

Boyuan Wang, Xuelin Liu, Baoguo Yu, Ruicai Jia, and Xingli Gan. Pedestrian dead reckoning based on motion mode recognition using a smartphone. Sensors, 18(6):1811, 2018

work page 2018
[28]

Visually ﬁngerprinting humans without face recognition

He Wang, Xuan Bao, Romit Roy Choudhury, and Srihari Nelakuditi. Visually ﬁngerprinting humans without face recognition. In Proceedings of the 13th Annual International Conference on Mobile Systems, Applications, and Services , pages 345–358, 2015

work page 2015
[29]

Wiﬁ and vision mul- timodal learning for accurate and robust device-free human activity recognition

Han Zou, Jianfei Yang, Hari Prasanna Das, Huihan Liu, Yuxun Zhou, and Costas J Spanos. Wiﬁ and vision mul- timodal learning for accurate and robust device-free human activity recognition. In Proceedings of the IEEE/CVF con- ference on computer vision and pattern recognition work- shops, pages 0–0, 2019

work page 2019

[1] [1]

Zed camera website (accessed: 04.09.2021)

work page 2021

[2] [2]

”IEEE Std 802.11-2016 (Revision of IEEE Std 802.11-2012)”, pages 1–3534, Dec 2016

”IEEE Standard for Information technology– Telecommunications and information exchange between systems Local and metropolitan area networks–Speciﬁc requirements - Part 11: Wireless LAN Medium Access Control (MAC) and Physical Layer (PHY) Speciﬁca- tions”. ”IEEE Std 802.11-2016 (Revision of IEEE Std 802.11-2012)”, pages 1–3534, Dec 2016

work page 2016

[3] [3]

See through walls with wiﬁ! In Proceedings of the ACM SIGCOMM 2013 conference on SIGCOMM, pages 75–86, 2013

Fadel Adib and Dina Katabi. See through walls with wiﬁ! In Proceedings of the ACM SIGCOMM 2013 conference on SIGCOMM, pages 75–86, 2013

work page 2013

[4] [4]

Rgb-w: When vision meets wireless

Alexandre Alahi, Albert Haque, and Li Fei-Fei. Rgb-w: When vision meets wireless. In Proceedings of the IEEE International Conference on Computer Vision, pages 3289– 3297, 2015

work page 2015

[5] [5]

Vitag: Online wiﬁ ﬁne time measurements aided vision-motion identity association in multi-person environ- ments

Bryan Bo Cao, Abrar Alali, Hansi Liu, Nicholas Meegan, Marco Gruteser, Kristin Dana, Ashwin Ashok, and Shub- ham Jain. Vitag: Online wiﬁ ﬁne time measurements aided vision-motion identity association in multi-person environ- ments

work page

[6] [6]

A simple framework for contrastive learning of visual representations

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Ge- offrey Hinton. A simple framework for contrastive learning of visual representations. In International conference on ma- chine learning, pages 1597–1607. PMLR, 2020

work page 2020

[7] [7]

Out of time: auto- mated lip sync in the wild

Joon Son Chung and Andrew Zisserman. Out of time: auto- mated lip sync in the wild. In Asian conference on computer vision, pages 251–263. Springer, 2016

work page 2016

[8] [8]

Vi-ﬁ multi-modal dataset

Liu et al. Vi-ﬁ multi-modal dataset

work page

[9] [9]

Eyeﬁ: Fast human identiﬁcation through vision and wiﬁ-based trajectory matching

Shiwei Fang, Tamzeed Islam, Sirajum Munir, and Shahriar Nirjon. Eyeﬁ: Fast human identiﬁcation through vision and wiﬁ-based trajectory matching. In 2020 16th Interna- tional Conference on Distributed Computing in Sensor Sys- tems (DCOSS), pages 59–68. IEEE, 2020

work page 2020

[10] [10]

Google pixel website

Google. Google pixel website

work page

[11] [11]

Generalized procrustes analysis

John C Gower. Generalized procrustes analysis. Psychome- trika, 40(1):33–51, 1975

work page 1975

[12] [12]

Speech intention classiﬁcation with multimodal deep learning

Yue Gu, Xinyu Li, Shuhong Chen, Jianyu Zhang, and Ivan Marsic. Speech intention classiﬁcation with multimodal deep learning. In Canadian conference on artiﬁcial intel- ligence, pages 260–271. Springer, 2017

work page 2017

[13] [13]

Veriﬁcation: Accuracy evaluation of wiﬁ ﬁne time mea- surements on an open platform

Mohamed Ibrahim, Hansi Liu, Minitha Jawahar, Viet Nguyen, Marco Gruteser, Richard Howard, Bo Yu, and Fan Bai. Veriﬁcation: Accuracy evaluation of wiﬁ ﬁne time mea- surements on an open platform. In Proceedings of the 24th Annual International Conference on Mobile Computing and Networking, pages 417–427, 2018

work page 2018

[14] [14]

Coopera- tive learning of audio and video models from self-supervised synchronization

Bruno Korbar, Du Tran, and Lorenzo Torresani. Coopera- tive learning of audio and video models from self-supervised synchronization. Advances in Neural Information Process- ing Systems, 31, 2018

work page 2018

[15] [15]

Principles of multivariate analysis, vol- ume 23

Wojtek Krzanowski. Principles of multivariate analysis, vol- ume 23. OUP Oxford, 2000

work page 2000

[16] [16]

Unsu- pervised learning for human sensing using radio signals

Tianhong Li, Lijie Fan, Yuan Yuan, and Dina Katabi. Unsu- pervised learning for human sensing using radio signals. In Proceedings of the IEEE/CVF Winter Conference on Appli- cations of Computer Vision, pages 3288–3297, 2022

work page 2022

[17] [17]

Vi-ﬁ: Associating moving subjects across vision and wireless sensors

Hansi Liu, Abrar Alali, Mohamed Ibrahim, Bryan Bo Cao, Nicholas Meegan, Hongyu Li, Marco Gruteser, Shubham Jain, Kristin Dana, Ashwin Ashok, et al. Vi-ﬁ: Associating moving subjects across vision and wireless sensors

work page

[18] [18]

Lost and found! as- sociating target persons in camera surveillance footage with smartphone identiﬁers

Hansi Liu, Abrar Alali, Mohamed Ibrahim, Hongyu Li, Marco Gruteser, Shubham Jain, Kristin Dana, Ashwin Ashok, Bin Cheng, and Hongsheng Lu. Lost and found! as- sociating target persons in camera surveillance footage with smartphone identiﬁers. In Proceedings of the 19th Annual International Conference on Mobile Systems, Applications, and Services, pages 49...

work page 2021

[19] [19]

Brings wi-ﬁ indoor positioning capabilities

Alliance W Wi-Fi CERTIFIED Location. Brings wi-ﬁ indoor positioning capabilities. Wi-Fi Alliance.[Online]. Avail-able: https://wi-ﬁ. org/news-events/newsroom/wi-ﬁ-certiﬁed- locationbrings-wi-ﬁ-indoor-positioning-capabilities, 2017

work page 2017

[20] [20]

Cdpam: Contrastive learning for perceptual au- dio similarity

Pranay Manocha, Zeyu Jin, Richard Zhang, and Adam Finkelstein. Cdpam: Contrastive learning for perceptual au- dio similarity. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 196–200. IEEE, 2021

work page 2021

[21] [21]

Who goes there? exploiting silhouettes and wearable signals for subject identiﬁcation in multi-person environments

Alessandro Masullo, Tilo Burghardt, Dima Damen, Toby Perrett, and Majid Mirmehdi. Who goes there? exploiting silhouettes and wearable signals for subject identiﬁcation in multi-person environments. InProceedings of the IEEE/CVF International Conference on Computer Vision Workshops , pages 0–0, 2019

work page 2019

[22] [22]

Person re-id by fusion of video silhouettes and wearable signals for home monitoring appli- cations

Alessandro Masullo, Tilo Burghardt, Dima Damen, Toby Perrett, and Majid Mirmehdi. Person re-id by fusion of video silhouettes and wearable signals for home monitoring appli- cations. Sensors, 20(9):2576, 2020

work page 2020

[23] [23]

Multi- modal cnn pedestrian classiﬁcation: a study on combining li- dar and camera data

Gledson Melotti, Cristiano Premebida, Nuno MM da S Gonc ¸alves, Urbano JC Nunes, and Diego R Faria. Multi- modal cnn pedestrian classiﬁcation: a study on combining li- dar and camera data. In 2018 21st International Conference on Intelligent Transportation Systems (ITSC) , pages 3138–

work page 2018

[24] [24]

Tracking persons using particle ﬁlter fusing visual and wi-ﬁ localizations for widely distributed camera

Takashi Miyaki, Toshihiko Yamasaki, and Kiyoharu Aizawa. Tracking persons using particle ﬁlter fusing visual and wi-ﬁ localizations for widely distributed camera. In 2007 IEEE International Conference on Image Processing , volume 3, pages III–225. IEEE, 2007

work page 2007

[25] [25]

Performance measures and a data set for multi-target, multi-camera tracking

Ergys Ristani, Francesco Solera, Roger Zou, Rita Cucchiara, and Carlo Tomasi. Performance measures and a data set for multi-target, multi-camera tracking. In European conference on computer vision, pages 17–35. Springer, 2016

work page 2016

[26] [26]

Multimodal detection and tracking of pedestrians in urban environments with explicit ground plane extraction

Luciano Spinello, Rudolph Triebel, and Roland Siegwart. Multimodal detection and tracking of pedestrians in urban environments with explicit ground plane extraction. In 2008 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 1823–1829. IEEE, 2008

work page 2008

[27] [27]

Pedestrian dead reckoning based on motion mode recognition using a smartphone

Boyuan Wang, Xuelin Liu, Baoguo Yu, Ruicai Jia, and Xingli Gan. Pedestrian dead reckoning based on motion mode recognition using a smartphone. Sensors, 18(6):1811, 2018

work page 2018

[28] [28]

Visually ﬁngerprinting humans without face recognition

He Wang, Xuan Bao, Romit Roy Choudhury, and Srihari Nelakuditi. Visually ﬁngerprinting humans without face recognition. In Proceedings of the 13th Annual International Conference on Mobile Systems, Applications, and Services , pages 345–358, 2015

work page 2015

[29] [29]

Wiﬁ and vision mul- timodal learning for accurate and robust device-free human activity recognition

Han Zou, Jianfei Yang, Hari Prasanna Das, Huihan Liu, Yuxun Zhou, and Costas J Spanos. Wiﬁ and vision mul- timodal learning for accurate and robust device-free human activity recognition. In Proceedings of the IEEE/CVF con- ference on computer vision and pattern recognition work- shops, pages 0–0, 2019

work page 2019