Fast YOLO: A Fast You Only Look Once System for Real-time Embedded Object Detection in Video
read the original abstract
Object detection is considered one of the most challenging problems in this field of computer vision, as it involves the combination of object classification and object localization within a scene. Recently, deep neural networks (DNNs) have been demonstrated to achieve superior object detection performance compared to other approaches, with YOLOv2 (an improved You Only Look Once model) being one of the state-of-the-art in DNN-based object detection methods in terms of both speed and accuracy. Although YOLOv2 can achieve real-time performance on a powerful GPU, it still remains very challenging for leveraging this approach for real-time object detection in video on embedded computing devices with limited computational power and limited memory. In this paper, we propose a new framework called Fast YOLO, a fast You Only Look Once framework which accelerates YOLOv2 to be able to perform object detection in video on embedded devices in a real-time manner. First, we leverage the evolutionary deep intelligence framework to evolve the YOLOv2 network architecture and produce an optimized architecture (referred to as O-YOLOv2 here) that has 2.8X fewer parameters with just a ~2% IOU drop. To further reduce power consumption on embedded devices while maintaining performance, a motion-adaptive inference method is introduced into the proposed Fast YOLO framework to reduce the frequency of deep inference with O-YOLOv2 based on temporal motion characteristics. Experimental results show that the proposed Fast YOLO framework can reduce the number of deep inferences by an average of 38.13%, and an average speedup of ~3.3X for objection detection in video compared to the original YOLOv2, leading Fast YOLO to run an average of ~18FPS on a Nvidia Jetson TX1 embedded system.
This paper has not been read by Pith yet.
Forward citations
Cited by 2 Pith papers
-
Multi-Cue Vehicle Detection for Semantic Video Compression In Georegistered Aerial Videos
A multi-cue pipeline combining deep learning appearance detection and flux tensor spatio-temporal filtering achieves high-precision moving vehicle detection in aerial videos while enabling over 100:1 semantic compression.
-
Real-time Vision-based Depth Reconstruction with NVidia Jetson
A comparison of FCNN architectures for monocular depth estimation yields a model suitable for real-time operation on NVidia Jetson hardware with evaluation in vSLAM.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.