From image recognition to object recognition
The FusionNet deep learning architecture tackles three-dimensional objects with underlying architectures that “think” in three dimensions.
With the advent of augmented reality and self-driving cars, 3D data is booming. In the near future, algorithms consuming 3D data will be used for applications like robot navigation or smart user interfaces based on augmented reality. Inspired by this, our recent paper from Matroid presents FusionNet, which is an architecture for classifying 3D CAD objects in a standard data set called the Princeton ModelNet.
At the heart of FusionNet are new three-dimensional Convolutional Neural Networks (CNNs), applied to 3D objects. We had to modify the traditional CNN in several ways before it could be applied here. For the sake of explanation, first let’s look at two-dimensional CNNs for image classification. The idea is that the machine learning researcher builds a model constructed of several layers, each handling connections from the previous layer in a different way. In the first layer, you have a window that slides a patch across a two-dimensional image, which becomes the input for that layer. This is called a convolutional layer because the patch “convolves”—it overlaps with itself as the window passes over the image. Then several different types of layers follow. The last layer has the neurons for the potential outputs; each one of those activations corresponds to a particular label that identifies the image. In the ModelNet40 Challenge data set, there are 40 classes, and thus 40 neurons, in the final layer. The first class might be a cat, the second class might be a car, and so on, for all the 40 classes that the training data has. If the first neuron is firing the most out of the 40, then the input is identified as belonging to the first class, a cat.
This all assumes the input is an image, i.e. two-dimensional. How do you extend that to three dimensions? One way would be to treat the object as an image, by projecting it down to two dimensions, the same way your monitor shows three-dimensional objects, then run a standard two-dimensional CNN over it. Indeed, current leading submissions to the Princeton ModelNet Challenge use Convolutional Neural Networks on pixel representations, where they treat any 3D object as a set of 2D projections from several viewpoints. FusionNet does use CNNs on the pixel representation, but—crucially—adds a new type of three-dimensional CNN into the mix as well.
Instead of sliding a two-dimensional patch over a two-dimensional image, we can slide a three-dimensional volume over an object! In this representation, there is no need for a projection step. Treating an object in this way uses the volumetric representation of the object.
In our volumetric representation, the 3D object is discretized into 30 x 30 x 30 grid of voxels. If any part of the object is present inside a 1 x 1 x 1 voxel, a 1 is assigned for that voxel and 0 otherwise. Unlike previous work, we use both pixel and voxel representations to learn features of the objects and to classify the 3D CAD objects better than using either of the two representations in isolation. Some example objects:
We built two Convolutional Neural Networks to process voxel data (V-CNN I and V-CNN II), and one to process pixel data (MV-CNN). The next figure shows how all of these networks are combined together to give a final decision on the object’s label. This is unlike standard CNN on 2D images that only learn spatially-local features from the image.
As the basis for the 2D network MV-CNN, we use a standard pre-trained neural network (AlexNet) and warm-start the network on 2D projections of the 3D object. The pre-training of AlexNet was done on ImageNet, a large scale database of 2D pixel images. Because of the pre-training, many of the features required for classifying 2D images need not be learned from scratch. Here is a diagram illustrating the architecture of one of our Volumetric CNNs (V-CNN I):
The architecture for V-CNN I was inspired by GoogLeNet, which uses inception modules. The inception module is a concatenation of outputs from kernels of different size. It helps the network learn features at different scales and puts them on the same footing for the next convolution layer following the inception module.
In summary, FusionNet is a fusion of three convolutional neural networks, one trained on pixel representation and two networks trained on voxelized objects. It exploits the strength of each component network in order to improve the classification performance. Each component network of FusionNet considers multiple views or orientations of each object before classifying it. While it is intuitive that one can get more information from multiple views of the object than a single view, it is not trivial to put the information together in order to enhance the accuracy. We use information from 20 views for pixel representation and 60 CAD object orientations for voxel representation before predicting the object class. FusionNet outperforms the current leading submission on Princeton ModelNet leaderboard in the 40-class data set, showcasing its discriminative power.