Pedestrian Recognition

The goal of this project is to build a system to recognize pedestrians in static images.

The idea (and some of the images here) are largely based on the following papers:

- Oren, Papageorgiou, Sinha, Osuna, Poggio. Pedestrian Detection Using Wavelet Templates. CVPR 1997.
- Papageorgiou, and Poggio. Trainable Pedestrian Detection. International Conference on Image Processing 1999.

This document is essentially a writeup of this powerpoint presentation: 16899project.ppt

Motivation

A system for recognizing pedestrians could be implemented as a warning system inside vehicles. Such a system can alert the driver when pedestrians are near the vehicle. This is a possible application for the described work, but there are many issues that need to be solved if it really were to be implemented like this. We do not deal with them here. We focus more on the ability for a computer to automate such a recognition process.

A system for identifying the template or outline of a human could possibly be useful for Valerie, the roboceptionist in the front lobby of Newell-Simon Hall. Currently, Valerie detects humans by detecting motion. Even if there is a small amount of motion (ie. a person walks by hurriedly), it will try to say something to greet that person. This is not very human-like. If it can detect a person standing in front of the booth, then it can decide to greet those who actually stop in front of the booth for a while.

Overview

We start with positive and negative samples of what a pedestrian looks like (examples below). Negative samples can really be anything without a pedestrian. The pedestrians are cut out are re-scaled so they match in size approximately. The positive images here come from a database available on the web. There is much variety in the body pose, background, color, and texture in the pedestrian (positive) samples. Given these samples, we extract out features for each image. Each image is one instance, with a number of features, and a classification (pedestrian or no pedestrian). This data forms the basis for the classifier that we built.

With the classifier, we can take a new image. We then "cut out" smaller overlapping images from the new big image and run these throught the classifier. This will identify the pedestrians for us. We may have to scale the new image and repeat this process, since the people in these images need to be of a similar size as those in the database.

Details of Algorithm

For each image, we compute these wavelet templates, which are then used as features in the classification problem.

Wavelet Template

In the figure below, we demonstrate how we compute the coefficients (represented by the gray scale template on the right) from an image. For each image, we overlap blocks of pixels (as shown in figure) with a vertical wavelet. The vertical wavelet is used as a convolution mask applied to each block of pixels to form the coefficients. The darker the color in the template, the higher the coefficient value. For RGB, we compute the coefficient for each channel, and take the largest absolute value.

This vertical wavelet identifies "vertical color differences" in the image. If the wavelet is applied to a block of pixels that all have a similar color, the coefficient will be relatively small. But if the block of pixels have a left side with say red pixels and a right side with green pixels, the coefficent will be relatively large. In other words, the coefficient will be large if there is a vertical edge in the block of pixels. This allows the template (seen above) to approximately identify the vertical edges of the pedestrian in the image. The higher values or darker colors are the places where there are vertical edges. The template shown in the above figure is averaged over many samples; hence it is fairly symmetrical, and look very generic.

More Wavelet Templates

So the above procedure is repeated for 2 different scales of wavelet size, and 3 types of wavelets (vertical, horizontal, diagonal). And we get templates that look like the one below (for each image).

Features

The coefficients in the templates represent the feature values. So each image is one instance with 1326 features and one classification. We repeat the same procedure for negative samples.

Test Case

We have a training database of 282 positive samples and 236 negative samples. For testing, we have 20 positive samples and 20 negative samples. Here are some example positive images:

And here are some example negative images. We cut out smaller images (overlapped) with the same size as the positive ones.

Results

We tried a nearest neighbor classifier, provided by the WEKA software (google "weka" for more information). We get a 95% accuracy on the 40 test images. We get 2 false positives:

And we also tried a decision tree classifier (C4.5 algorithm). We get a 90% accuracy, with 3 false positives, and 1 false negative:

It's difficult to understand intuitively why these were classified incorrectly.

10-fold Cross Validation

We took the whole set of data (302 positives, and 256 negatives) and did a 10-fold cross validation. The nearest neighbor classifier has a 94.27% accuracy, with 30 false positives, and 2 false negatives. The decision tree classifier has a 86.74% accuracy, with 47 false positives, and 27 false negatives.

Incremental Bootstrapping

Given the results we had, we decided to proceed with the nearest neighbor (k = 5) classifier. The biggest problem is that there were many false positives. So we did incremental bootstrapping:

The main idea here is that we take new images and run them throught the classifier. Any images that are false positives are then included into the database that forms the classifier. This adds more negative samples to the database, without adding unnecessary ones that would already have been classified as negative with the existing database. We took the database of 558 total samples, and performed incremental bootstrapping with new images. Here is an example of a new image, and one instance that was incorrectly classified as positive. This one is interesting because there seems to be a (very dim) outline of a pedestrian there:

We performed bootstrapping with about 10 images. This process added about 100 negative samples, arriving at a total of 656 total samples. The results that follow use these 656 samples in the classifier.

Result of Bootstrapping

We took 2 completely new images, and tested to see if bootstrapping helped. We cut out smaller images from these new images and ran them through the classifier twice: once using the original 558 samples, and once using the 656 samples after bootstrapping. For one image, the results before bootstrapping were: 85.06% accurate, 65 false positives, and 0 false negatives. And the results after bootstrapping were: 90.11% accurate, 43 false positives, and 0 false negatives. For the other image, the results before bootstrapping were: 75.86% accurate, 100 false positives, and 5 false negatives. And the results after bootstrapping were: 81.15% accurate, 77 false positives, and 5 false negatives. So bootstrapping was successful in decreasing the number of false positives.

Results of Larger Test Images

This picture was splitted up into 560 images, and about 30 of them were classified as positive.

Here are some of those inaccurately classified as positives:

And here are the ones correctly classified as positives:

And here are more results. The red rectangles mark the places where the system thought there was a pedestrian.

Less Features

One possible idea that we did not implement is to take a smaller number of features (the 1326 that we used may seem unnecessarily big). We could have taken the templates calculated for each positive image, and averaged the corresponding coefficients:

We can then pick those coefficients that are lightest or darkest in the above figure. The darkest coefficients mean that there was a significant color difference (in the vertical, horizontal, diagonal direction) within the corresponding block of pixels. The lightest coefficients mean that the colors of the pixels within that corresponding block were similar. The coefficients that are on these two extreme (lightest or darkest) can be used to distinguish the classification. Hence we could have picked these features from looking at this average template, and used a much smaller number of features. This will allow for a faster classification. However, this may or may not decrease the accuracy. Some experiments done in the papers cited above say that this might decrease the accuracy. So we did not implement this. Instead we tried to get the best classification accuracy that we could. In any case, decreasing the number of features has the advantage of a faster classification, which is more appropriate for real-time applications.

Conclusions

Our system can detect positive samples in new test images well. However, there are still many false positives. I believe that if we do more bootstrapping on additional images, this will decrease the number of false positives. I also think that it is a good idea to build a system that is adapted for use within a specific area (see limitations below).

Limitations

Our system essentially recognizes a template/outline of a pedestrian. There may be other objects in the environment that have similar templates. This will cause our system to fail.

It is difficult to define what is a negative sample. We could have picked any images without pedestrians. It might be better to build a system for use in a specific environment. If we know a system will only be use within a certain area, we can take many negative images of this area. Such a system would be adapted for use in that area, and potentially will have less false positives.

Our system requires the pedestrians to use seen completely by the system. It will fail if the pedestrians are partially occluded (examples in the figures below).

Questions? Contact Manfred Lau