Karen Simonyan and Andrew Zisserman

Overview

Convolutional networks (ConvNets) currently set the state of the art in visual recognition. The aim of this project is to investigate how the ConvNet depth affects their accuracy in the large-scale image recognition setting.

Our main contribution is a rigorous evaluation of networks of increasing depth, which shows that a significant improvement on the prior-art configurations can be achieved by increasing the depth to 16-19 weight layers, which is substantially deeper than what has been used in the prior art. To reduce the number of parameters in such very deep networks, we use very small 3×3 filters in all convolutional layers (the convolution stride is set to 1). Please see our publication for more details.

Results

ImageNet Challenge

The very deep ConvNets were the basis of our ImageNet ILSVRC-2014 submission, where our team (VGG) secured the first and the second places in the localisation and classification tasks respectively. After the competition, we further improved our models, which has lead to the following ImageNet classification results:

Generalisation

Very deep models generalise well to other datasets. A combination of multi-scale convolutional features and a linear SVM matches or outperforms more complex recognition pipelines built around less deep features. Our results on PASCAL VOC and Caltech image classification benchmarks are as follows:


Models

We release our two best-performing models, with 16 and 19 weight layers (denoted as configurations D and E in the publication). The models are released under Creative Commons Attribution License. Please cite our technical report if you use the models.

The models are compatible with the Caffe toolbox. They are available in the Caffe format from the Caffe Zoo (model repository), or directly from the following links:

The models can also be used with the MatConvNet toolbox. They are available in the MatConvNet format from the corresponding model repository.

Please note that the aforementioned ConvNet toolboxes might not have a readily available implementation of our dense multi-scale evaluation procedure, so the image classification results can be different from ours.

Publications


K. Simonyan, A. Zisserman
International Conference on Learning Representations, 2015

Acknowledgements

This work was supported by ERC grant VisRec no. 228180. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the GPUs used for this research.

erc-logo