Research paper accepted by NeurIPS 2022, a prestigious international conference on machine learning～Introduces image classification model replacing Vision Transformer～

Sep 27, 2022PRESS RELEASE

Research paper accepted by NeurIPS 2022, a prestigious international conference on machine learning
～Introduces image classification model replacing Vision Transformer～

OBJECTIVE.

A paper summarizing the results of groundbreaking research conducted by a Rikkyo doctoral student and associate professor has been accepted at Neural Information Processing Systems (NeurIPS) 2022, an illustrious annual international conference in the field of machine learning. Yuki Tatsunami, a first-year doctoral student at Rikkyo’s Graduate School of Artificial Intelligence and Science (who also works for AnyTech Co., Ltd.), and Associate Professor Masato Taki authored the paper.

Diagram showing two approaches for image recognition Above: The image patches (a container of pixels in a larger form) interact globally with the use of an attention mechanism. Below: Information from the image patches is sequentially integrated with the use of the recurrent neural network.

NeurIPS is considered the world’s most prestigious conference in the fields of machine learning, deep learning, reinforcement learning and learning theory. This year, 25% of the 10,411 submitted papers were accepted. The accepted papers will be presented at NeurIPS 2022, which will be held in New Orleans, Louisiana, from November 28 to December 9, 2022.

This research revealed that the attention mechanism (a technique meant to mimic cognitive attention) of the Transformer used in cutting-edge computer vision can be substituted with classic recurrent neural networks (which deal with sequential data). Based on this discovery, the researchers proposed a new image classification model called “Sequencer,” which does not use convolution or attention mechanisms. Sequencer is the first purely Japanese-made image classification model that has been adopted by PyTorch Image Models.

Research background

Flexibility in designing network structures is said to be one reason why deep learning can achieve extremely high performance in a wide range of tasks. In other words, a good network structure can reflect prior knowledge (inductive bias) well, thus providing clues for learning through the structure.

For example, in computer vision, convolutional neural networks (CNNs) have been the de facto standard for inductive bias since a breakthrough brought about by deep learning around 2012. CNNs have been recognized as a powerful method in computer vision since Kunihiko Fukushima’s discovery of the neocognitron (a hierarchical, multilayered artificial neural network) in the 1970s.

In 2020, however, it was discovered that the Vision Transformer, a network consisting of attention mechanisms, outperforms CNNs in various computer vision tasks. Attention mechanisms are an inductive bias originally discovered in natural language processing research. Impressive achievements in machine translation, language generation, and image generation from text have all been made possible thanks to the use of attention mechanisms. Because attention mechanisms are also useful in computer vision, many experts believe the de facto standard in computer vision will shift from CNNs to attention mechanisms.

Research results

In this study, the Rikkyo researchers proposed the new image classification model, Sequencer, after finding that attention mechanisms in computer vision can be substituted with a simpler inductive bias. Sequencer can compete with other state-of-the-art methods when it comes to accuracy.

While attention mechanisms have increasingly been regarded as important in computer vision since around 2020, some research results began to question the significance of these mechanisms in 2021. Consequently, research into the role attention mechanisms play in computer vision models, and whether they are even necessary, has become closely watched around the world.

Against this backdrop, the researchers discovered that a given model’s accuracy in image classification improved when attention mechanisms are substituted with recurrent neural networks (RNNs), a classic mechanism. RNNs, which have been known for decades, can serve as a substitute for the complex attention mechanism. This result means that an attention mechanism is not necessary, an outcome that calls for reconsidering its role.

Furthermore, the proposed Sequencer outperforms even advanced CNNs. In a nutshell, RNNs are a more suitable inductive bias in image recognition, including classification. Since RNNs were developed originally for time series processing, researchers were unaware of the fact that they can achieve superb image recognition performance. This result may offer a new approach to design principles for future computer vision models.

Expectations and challenges

Sequencer’s processing speed, however, is slower than CNNs and the Vision Transformer. Improving this processing speed will be a major challenge for practical applications. Nonetheless, there are high expectations that Sequencer could be applied to various computer vision tasks where CNNs or attention mechanisms have been used previously. In addition, the emergence of Sequencer, which is comparable to attention mechanisms, is expected to prompt research to better understand of these mechanisms.

Keywords

NeurIPS (Neural Information Processing Systems): a world-class international conference on machine learning and computational neuroscience.
Computer vision: a field of study to enable computers to recognize and process images.
Attention mechanism: a mechanism by which deep learning models choose and focus on important information, similar to how humans direct their attention to important information.
Transformer: a deep learning model created using the attention mechanism, primarily used for natural language processing tasks such as machine translation.
Vision Transformer: an improved Transformer model used for image recognition.
Inductive bias: clues that help machine learning models learn from non-data pathways, such as the algorithm itself.
Convolution: a technique for extracting information from images by aggregating local information. It is a kind of filtering.
Convolutional neural network: a neural network specializing in image processing that uses convolution.
Recurrent neural network: a neural network that can learn the relationship between sequential data by processing it sequentially over time
PyTorch Image Models: one of the most widely used and advanced classification libraries. Used in academia, industry and other fields.