|LeCun, Y., Bottou, L., Bengio, Y. and Haffner, P., 1998. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), pp.2278-2324.
|Krizhevsky, A., Sutskever, I. and Hinton, G.E., 2012. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25.
||Introduced ReLU activation and Dropout to CNNs
|Simonyan, K. and Zisserman, A., 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.
||Used large number of filters of small size in each layer to learn complex features
|Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V. and Rabinovich, A., 2015. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1-9).
||Introduced Inception Modules consisting of multiple parallel convolutional layers, designed to recognize different features at multiple scales
|Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J. and Wojna, Z., 2016. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2818-2826).
||Design Optimizations of the Inception Modules which improved performance and accuracy
|He, K., Zhang, X., Ren, S. and Sun, J., 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778).
||Introduced residual connections, which are shortcuts that bypass one or more layers in the network.
|Szegedy, C., Ioffe, S., Vanhoucke, V. and Alemi, A., 2017, February. Inception-v4, inception-resnet and the impact of residual connections on learning. In Proceedings of the AAAI conference on artificial intelligence (Vol. 31, No. 1).
||Hybrid approach combining Inception Net and ResNet
|Huang, G., Liu, Z., Van Der Maaten, L. and Weinberger, K.Q., 2017. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4700-4708).
||Each layer receives input from all the previous layers, creating a dense network of connections between the layers, allowing to learn more diverse features
|Chollet, F., 2017. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1251-1258).
||Based on InceptionV3 but uses depthwise separable convolutions instead on inception modules
|Xie, S., Girshick, R., Dollár, P., Tu, Z. and He, K., 2017. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1492-1500).
||Built over ResNet, introduces the concept of grouped convolutions, where the filters in a convolutional layer are divided into multiple groups
|Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M. and Adam, H., 2017. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861.
||Uses depthwise separable convolutions to reduce the number of parameters and computation required
|Sandler, M., Howard, A., Zhu, M., Zhmoginov, A. and Chen, L.C., 2018. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4510-4520).
||Built upon the MobileNetv1 architecture, uses inverted residuals and linear bottlenecks
|Howard, A., Sandler, M., Chu, G., Chen, L.C., Chen, B., Tan, M., Wang, W., Zhu, Y., Pang, R., Vasudevan, V. and Le, Q.V., 2019. Searching for mobilenetv3. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 1314-1324).
||Uses AutoML to find the best possible neural network architecture for a given problem
|Tan, M. and Le, Q., 2019, May. Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning (pp. 6105-6114). PMLR.
||Uses a compound scaling method to scale the network’s depth, width, and resolution to achieve a high accuracy with a relatively low computational cost
|Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S. and Uszkoreit, J., 2020. An image is worth 16×16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.
||Images are segmented into patches, which are treated as tokens and a sequence of linear embeddings of these patches are input to a Transformer
|Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S. and Guo, B., 2021. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 10012-10022).
||A hierarchical vision transformer that uses shifted windows to addresses the challenges of adapting the transformer model to Computer Vision
|Mehta, S. and Rastegari, M., Mobilevit: Light-weight, general-purpose, and mobile-friendly vision transformer. arXiv 2021. arXiv preprint arXiv:2110.02178.
||A lightweight vision transformer designed for mobile devices, effectively combining the strengths of CNNs and ViTs
|Trockman, A. and Kolter, J.Z., 2022. Patches are all you need?. arXiv preprint arXiv:2201.09792.
||Processes image patches using standard convolutions for mixing spatial and channel dimensions