Towards a real-time vehicle detection: SSD multibox approach

Published in

Chatbots Life

8 min readJan 20, 2017

Overview

Over the past few weeks, I have been working on developing a real-time vehicle detection algorithm. During this process, I have read several deep learning papers from arXiv. I believe the best way to learn something is to implement it by yourself, so you understand the tiny details that you may overlook if you read the paper or see the code from other people. Therefore, to better understand the underlying idea and potential limitations, I implemented several deep learning architectures for object detection. In the coming weeks, we will go over a few of these architectures. In this post, we will go over results from applying one such architecture, the single shot multibox detector model.

Learnings from readings

I have found that most algorithms are the variants based on a relatively smaller set of idea. I classified them into two categories, segmentation based models and scale based model. In segmentation based model, we make pixelwise prediction to determine if a pixel belongs to an object or not. The Unet deep learning architecture is one example of such a segmentation model. In other scale-based model, the main idea is to build a strong classifier and then pass different patches of the image at different scales (resolutions) through the same classifier to obtain probability of a patch belonging to object or not. By repeating this process at different resolutions, objects of different sizes and aspect ratios can be detected. This is the underlying idea in many image-pyramid based computer vision applications. This idea has been adopted for deep learning neural network also. For example in SPPNet, a spatial pooling layer after the fifth convolutional block (conv5 in VGGNet) passes the feature maps through convolutional layers of different scales, pools their output into a flattened layer, and passes them to a classifier. By combining features at different scale, the SPPNet is able to identify objects at different scales.

A framework that combines region proposals and pooling is the group of models called Region based Convolutional Neural Networks (R-CNNs). There are 3 variants of R-CNN, 1- R-CNN, 2- Fast-RCNN and 3- Faster-RCNN. In R-CNN first a selective search algorithm is used to identify potential object locations, and the images from different regions are passed through a pretrained convolutional neural network. In RCNN, as we are passing different parts of the image through the covolutional block, the computation times and overheads are high (50s per frame). To avoid this computationally expensive step, fast-RCNN was developed. Note, that there is no constraint on the input image size in a convolutional network. Therefore, a computationally efficient method will be to pass a high resolution image through the convolutional block first, and then run the selective search on the feature maps after convolution. The obtained proposals are then passed through a ROI layer that rescales the feature maps into 7X7 scale (as expected by the flatten layer) and uses it to make predictions. This process reduced the time per image to 2s. Ironically , the computationally expensive part now was the region proposal algorithm applied on the feature maps. To alleviate this issue, a region proposal network (RPN) was proposed in faster-RCNN that places a set of 9 anchor boxes at each pixel location makes predictions to determine if the object is present at that pixel location and the corresponding bounding boxes. A non-maximum supression (NMS) technique is used to combine these bounding boxes into a selected few region proposals and these region proposals are resized using ROI-pooling layer and are passed through the feedforward part of the convolutional neural network. This technique significantly reduced the prediction time to 5–7 frames per second. However, 5–7 is not real time, and the process of passing each image from region proposal network through ROI-pooling and prediction layer is computationally expensive.

YOLO (You Look Only Once) tries to resolve some of these issues by directly predicting bounding boxes and class labels using the same network. In YOLO, a prediction of bounding box and class is made for each pixel in the final layer, and a non-maximum suppression is applied to detect bounding boxes. However, in YOLO this prediction is made on the last maxpooling block that is 7X7 in shape. This translates to making predictions based on the original image that is divided into a 7X7 grid. Therefore, YOLO is prone to errors due to changes in background. Although YOLO performs very fast, close to 45 fps (150 fps for small YOLO), it has lower accuracy and detection rate than faster-RCNN.

Single shot multibox detector

The final architecture, and the title of this post is called the Single Shot Multibox Detector (SSD). SSD addresses the low resolution issue in YOLO by making predictions based on feature maps taken at different stages of the convolutional network, it is as accurate and in some cases more accurate than the state-of-the-art faster-RCNN. As the layers closer to the image have higher resolution. To keep the number of bounding boxes manageable an atrous convolutional layer was proposed. Atrous convolutional layers are inspired by “algorithme a trous” in wavelet signal processing, where blank filters are applied to subsample the data for faster calculations. The figure below shows how the features from different convolutional blocks are collected to form a multiscale (or multibox) detector. At each feature map, we make predictions for the class label and bounding boxes. The advantage of using this method is that the features are pooled from different scales of the feature maps, and as a result the overall algorithm can detect objects at different scales and of different sizes and is more accurate than faster-RCNN. Further, as all the predictions are made in a single pass, the SSD is significantly faster than faster-RCNN. On VOC2007 data set, SSD performed at 59 FPS with mAP 74.3%, vs. Faster R-CNN 7 FPS with mAP 73.2% or YOLO 45 FPS with mAP 63.4%.

Experiments:

Based on the results from literature, we chose to first test how well the SSD model performs on the vehicle detection task in still images obtained from Udacity’s data set. For quick prototyping, we used a model pretrained with the VOC data set and replaced the last feedforward layers with our own feedforward layers. This involved redefining the connections from all the previous convolutional layers to the feedforward layer. We froze the weights in the first 4 convolutional blocks and retrained the model with udacity’s data.

Results:

The figure below presents representative data from Udacity. Where each car is annotated by a bounding box as shown below.

Figures below present predictions from the model after it was modified to detect only cars.

As expected the performance of the network before training is bad. We will next test the performance after training for 32 epochs. We used a learning rate 0.0001 with adam gradient optimizer. The vehicle and bounding box predictions are shown below. The numbers in the boxes show how certain the model is of the object within the box being a car.

Bounding box predictions from SSD network

From above, the SSD model is able to detect vehicles with high accuracy. The prediction time for the SSD network was about 20 ms, therefore about 50 fps. This time increased when combined with other parts of the code. The time to make bounding box predictions was 20 ms, the time to load images and preprocess them was about 40 ms, and the time to draw bounding boxes was 250–270 ms, when put together the total processing time per frame was between 280–330 ms, resulting in a speed of 3–5 fps on a Titan X computer. This code was obviously not optimized by using multithreading. A good solution could be to run the prediction pipeline as a server, and run a separate drawing program. In the coming weeks we will modify our program to run the neural network and drawing routines in parallel, and hopefully get the detection rate above 32 fps.

Reflections

This was very interesting project, as I got to play with a complex architecture model. I was able to achieve only 3–5 fps, not 45 fps as reported in the research papers. The lower fps rate however was due to the fact that I am loading, preprocesssing the image, making predictions and drawing on the image all sequentially. As the drawing routine is the slowest, a good solution is to run a prediction server in parallel to the drawing commands. I was surprised to see a 10X drop in performance when drawing routines were implemented in series. In the coming weeks, I will work to make the prediction routine separate from drawing routines, so higher frame per second are processed.

Update

Since I wrote this post, I started using CV2’s plotting routine vs MATLAB’s and have gotten detection rate between 40–50 fps.

Towards a real-time vehicle detection: SSD multibox approach

Written by Vivek Yadav