Sep 27, 2022PRESS RELEASE
Research paper accepted by NeurIPS 2022, a prestigious international conference on machine learning
~Introduces image classification model replacing Vision Transformer~
Keyword:RESEARCH
OBJECTIVE.
A paper summarizing the results of groundbreaking research conducted by a Rikkyo doctoral student and associate professor has been accepted at Neural Information Processing Systems (NeurIPS) 2022, an illustrious annual international conference in the field of machine learning. Yuki Tatsunami, a first-year doctoral student at Rikkyo’s Graduate School of Artificial Intelligence and Science (who also works for AnyTech Co., Ltd.), and Associate Professor Masato Taki authored the paper.

Diagram showing two approaches for image recognition Above: The image patches (a container of pixels in a larger form) interact globally with the use of an attention mechanism. Below: Information from the image patches is sequentially integrated with the use of the recurrent neural network.
This research revealed that the attention mechanism (a technique meant to mimic cognitive attention) of the Transformer used in cutting-edge computer vision can be substituted with classic recurrent neural networks (which deal with sequential data). Based on this discovery, the researchers proposed a new image classification model called “Sequencer,” which does not use convolution or attention mechanisms. Sequencer is the first purely Japanese-made image classification model that has been adopted by PyTorch Image Models.
Research background
For example, in computer vision, convolutional neural networks (CNNs) have been the de facto standard for inductive bias since a breakthrough brought about by deep learning around 2012. CNNs have been recognized as a powerful method in computer vision since Kunihiko Fukushima’s discovery of the neocognitron (a hierarchical, multilayered artificial neural network) in the 1970s.
In 2020, however, it was discovered that the Vision Transformer, a network consisting of attention mechanisms, outperforms CNNs in various computer vision tasks. Attention mechanisms are an inductive bias originally discovered in natural language processing research. Impressive achievements in machine translation, language generation, and image generation from text have all been made possible thanks to the use of attention mechanisms. Because attention mechanisms are also useful in computer vision, many experts believe the de facto standard in computer vision will shift from CNNs to attention mechanisms.
Research results
While attention mechanisms have increasingly been regarded as important in computer vision since around 2020, some research results began to question the significance of these mechanisms in 2021. Consequently, research into the role attention mechanisms play in computer vision models, and whether they are even necessary, has become closely watched around the world.
Against this backdrop, the researchers discovered that a given model’s accuracy in image classification improved when attention mechanisms are substituted with recurrent neural networks (RNNs), a classic mechanism. RNNs, which have been known for decades, can serve as a substitute for the complex attention mechanism. This result means that an attention mechanism is not necessary, an outcome that calls for reconsidering its role.
Furthermore, the proposed Sequencer outperforms even advanced CNNs. In a nutshell, RNNs are a more suitable inductive bias in image recognition, including classification. Since RNNs were developed originally for time series processing, researchers were unaware of the fact that they can achieve superb image recognition performance. This result may offer a new approach to design principles for future computer vision models.
Expectations and challenges
Keywords
- NeurIPS (Neural Information Processing Systems): a world-class international conference on machine learning and computational neuroscience.
- Computer vision: a field of study to enable computers to recognize and process images.
- Attention mechanism: a mechanism by which deep learning models choose and focus on important information, similar to how humans direct their attention to important information.
- Transformer: a deep learning model created using the attention mechanism, primarily used for natural language processing tasks such as machine translation.
- Vision Transformer: an improved Transformer model used for image recognition.
- Inductive bias: clues that help machine learning models learn from non-data pathways, such as the algorithm itself.
- Convolution: a technique for extracting information from images by aggregating local information. It is a kind of filtering.
- Convolutional neural network: a neural network specializing in image processing that uses convolution.
- Recurrent neural network: a neural network that can learn the relationship between sequential data by processing it sequentially over time
- PyTorch Image Models: one of the most widely used and advanced classification libraries. Used in academia, industry and other fields.
Paper information
Authors: Yuki Tatsunami, Masato Taki