Insect Pest Classification with State Space Model (2024)

Qianning Wang ${}^{1}$ , Chenglin Wang ${}^{2}$ , Zhixin Lai ${}^{3}$ , Yucheng Zhou ${}^{4}$
${}^{1}$ Nanjing Audit University, ${}^{2}$ East China Normal University
${}^{3}$ Snapchat, ${}^{4}$ SKL-IOTSC, CIS, University of Macau
yucheng.zhou@connect.um.edu.mo

Abstract

The classification of insect pests is a critical task in agricultural technology, vital for ensuring food security and environmental sustainability. However, the complexity of pest identification, due to factors like high camouflage and species diversity, poses significant obstacles. Existing methods struggle with the fine-grained feature extraction needed to distinguish between closely related pest species. Although recent advancements have utilized modified network structures and combined deep learning approaches to improve accuracy, challenges persist due to the similarity between pests and their surroundings. To address this problem, we introduce InsectMamba, a novel approach that integrates State Space Models (SSMs), Convolutional Neural Networks (CNNs), Multi-Head Self-Attention mechanism (MSA), and Multilayer Perceptrons (MLPs) within Mix-SSM blocks. This integration facilitates the extraction of comprehensive visual features by leveraging the strengths of each encoding strategy. A selective module is also proposed to adaptively aggregate these features, enhancing the model’s ability to discern pest characteristics. InsectMamba was evaluated against strong competitors across five insect pest classification datasets. The results demonstrate its superior performance and verify the significance of each model component by an ablation study.

1 Introduction

In agricultural production, due to pests significantly impacting crop yields, the identification and classification of pests within agricultural technology are pivotal for ensuring food security and sustainability. The insect pest classification aims to leverage vision models to automate insect pest recognition Liu etal. (2022); Wu etal. (2019). This task is crucial for maintaining crop health, potentially reducing pesticide usage, and fostering environmentally sustainable agricultural practices. Furthermore, accurate identification of insect pests benefits crop management by minimizing damage and optimizing yields.

Since pests often exhibit a high degree of camouflage within their natural habitats An etal. (2023); Wu etal. (2019), which makes visual recognition difficult. This challenge also shows the complexity of insect pest classification. The similarity between pests and their surroundings, coupled with the vast diversity of species, poses significant obstacles to traditional image processing algorithms. Furthermore, the necessity for fine-grained feature extraction to distinguish between closely related pest species adds another layer of complexity to this challenge Anwar and Masood (2023); Butera etal. (2022). Recent work has proposed utilizing modified capsule networks to improve network structure, thereby enhancing the hierarchical and spatial relationships of features to increase classification accuracy Liu etal. (2022); Butera etal. (2022). Additionally, some studies have combined multiple deep networks and the advantages of complementary features from multiple perspectives to enhance recognition rates and robustness An etal. (2023); Anwar and Masood (2023). Nonetheless, these approaches still face challenges due to the similarity between pests and their surroundings.

In addressing the challenges of accurately identifying and classifying pests in varied conditions, different visual encoding strategies offer different advantages. Convolutional Neural Networks (CNNs O’Shea and Nash (2015)) excel in local feature extraction, whereas the Multi-Head Self-Attention mechanism (MSA Vaswani etal. (2017)) is adept at capturing global features. The State Space Models (SSMs Gu and Dao (2023)) structure is particularly effective at recognizing long-distance dependencies, and Multilayer Perceptrons (MLPs Murtagh (1990)) specialize in channel-aware information inference.

To integrate advantages from different visual encoding strategies, we propose a novel approach, InsectMamba, consisting of Mix-SSM blocks that integrate SSM, CNN, MSA, and MLP to extract more comprehensive visual features for insect pest classification. In addition, we propose a selective module to adaptively aggregate visual features from different encoding strategies. Our method, leveraging the complementary capabilities of these visual encoding strategies, aims to achieve the vision model’s capability in capturing both the local and global features of pests, thus addressing the critical challenges of camouflage and species diversity.

In the experiments, we evaluate our model and other strong competitors on five insect pest classification datasets. To improve the challenge of datasets, we re-split the dataset. The experimental results show that our method outperforms other methods, which demonstrates the effectiveness of our method. Moreover, we conduct the ablation study to verify the significance of each module of our model. Furthermore, we conduct extensive analysis of our model design to demonstrate its effectiveness.

Main contributions of this study are as follows:

•
We propose InsectMamba, which is the first attempt at the potential application of SSM-based models in insect pest classification.
•
We present Mix-SSM blocks that seamlessly integrate SSM, CNN, MSA, and MLP. This integration allows our model to capture a comprehensive range of visual features for insect pest classification.
•
We propose a selective aggregation module designed to adaptively combine visual features derived from different encoding strategies. This module allows the model to select relevant features that are utilized for classification.
•
We have rigorously evaluated InsectMamba across five insect pest classification datasets, demonstrating its superior performance compared to existing models.

2 Related Work

2.1 Image Classification

The rapid advancements in computer vision Liu etal. (2024b); Zhou etal. (2024); Su etal. (2024); Zhang etal. (2023) have led to its extensive application across various areas including AI security Lyu etal. (2024), generated detection Lai etal. (2024b), biomedicine Lai etal. (2024a), and agricultural technology Wu etal. (2024). Notably, image classification Krizhevsky etal. (2017); Dosovitskiy etal. (2021); Liu etal. (2024c) stands out as a fundamental technique for many applications in computer vision, and it aims to distinguish different categories of images.Some works He etal. (2016); Simonyan and Zisserman (2015) employ convolutional Neural Networks (CNNs) for image classification, due to the convolutional layer’ capability to capture local features within images. For instance, AlexNet Krizhevsky etal. (2017), consisting of five convolutional layers and three fully connected layers, achieves great image classification performance. VGG Simonyan and Zisserman (2015) and ResNet He etal. (2016) respectively propose enhancements by increasing the depth of the original network and integrating skip connections to further enhance the model’s classification capabilities.However, CNNs have limitations in understanding global information and lack robustness when capturing global and long-distance dependencies Dosovitskiy etal. (2021). Vision Transformer (ViT) Dosovitskiy etal. (2021) leverages multi-head self-attention (MSA) Vaswani etal. (2017) to capture context information of each patch, which enhances the model’s capability to capture global dependencies. Moreover, Swin Transformer Liu etal. (2021) adopts a windowed self-attention mechanism and hierarchical structural design, which not only retains the global modeling capabilities of the MSA but also enhances the extraction of local features. Furthermore, MLP-Mixer Tolstikhin etal. (2021) proposed a pure MLP-based architecture to capture different contextual relationships and enhance visual representation. In addition, VMamba Liu etal. (2024c) improves visual classification tasks by integrating a novel Sequence State Space (S4) model with a Selection mechanism and Scan computation, termed Mamba.

2.2 Insect Pest Classification

For the insect pest classification task, it can help people better understand the population dynamics, and potential damage of pests, to formulate effective pest management strategies, which is very important for agriculture economy, and environmental science.However, compared to general images, the feature differences in the insect pest domain may be very subtle, and the background is more complex, which places higher requirements on the classification model and requires more accurate extraction of effective features Doan (2023); Ung etal. (2022).For this challenge, some works Cheng etal. (2017); Liu etal. (2016); Wang etal. (2017); Kasinathan and Reddy (2019); Ren etal. (2019) improve CNN-based models to capture pest features under a complex background.In addition, Faster-PestNet Ali etal. (2023) used MobileNet Howard etal. (2017) to extract sample attributes, and redesigned an improved Faster-RCNN method to recognize the crop pests.Ung etal. (2022) propose a CNN-based model with an attention mechanism to further focus on insects in the image; An etal. (2023) proposes a feature fusion network that synthesizes representations from different backbone models to enhance insect image classification; Anwar and Masood (2023) employ deep ensemble models method Hu etal. (2023) to improve accuracy and robustness in insect and pest detection from images.Moreover, Peng and Wang (2022) investigated ViT architecture in the insect domain and aggregated CNNs and self-attention models to further improve capability for insect pest classification.

3 Preliminaries

3.1 Convolutional Neural Networks

Convolutional Neural Networks (CNNs O’Shea and Nash (2015)) are widely applied to computer vision owing to their strong capability for image feature extraction.It consists of a set of fixed-size learnable parameters known as filters and continuously performs convolutional computations with a sliding window across the input images.Specifically, given visual features ${\bm{V}}\in\mathbb{R}^{H\times W\times C}$ , where $H$ , $W$ , and $C$ are the height, width, and number of channels, we can use convolution kernels $w$ with a size of $F_{w}$ , $F_{h}$ , $C_{in}$ to calculate the pixel value of each channel for the visual features, i.e.,

3.2 Multi-Head Self-Attention

Multi-Head Self-Attention (MSA) is proposed by Vaswani etal. (2017) and is widely used for many natural language processing tasks Zhou etal. (2023); Han etal. (2024); Liu etal. (2024a). Unlike convolutional neural networks, MSA allows the model to weigh the importance of different input tokens when generating output representations, enabling the model to capture global dependencies and contextual information within the sequence effectively. Recently, Transformer-like architectures have also demonstrated powerful modeling capabilities in computer vision Dosovitskiy etal. (2021). Specifically, given visual features ${\bm{V}}\in\mathbb{R}^{H\times W\times C}$ , the multi-head self-attention modeling of the visual features can be defined as:

\displaystyle{\operatorname*{Attn}}_{t}^{h}=\mathrm{softmax}(\frac{{\bm{Q}}_{t%}^{h}\cdot({\bm{K}}_{t}^{h})^{T}}{\sqrt{d_{{\bm{K}}_{t}^{h}}}}),\text{Where}~{%}{\bm{Q}}_{t}^{h}={\bm{W}}_{Q}^{h}\cdot{\bm{V}},{\bm{K}}_{t}^{h}={\bm{W}}_{k}^%{h}\cdot{\bm{V}},

(2)

where ${\bm{W}}_{Q}^{h}\in\mathbb{R}^{D\times d}$ and ${\bm{W}}_{k}^{h}\in\mathbb{R}^{D\times d}$ refer to linear projections, which project the D-dimensional input vector into query ${\bm{Q}}_{t}^{h}\in\mathbb{R}^{N\times d}$ and key ${\bm{K}}_{t}^{h}\in\mathbb{R}^{N\times d}$ , respectively. Each Attention matrix ${\operatorname*{Attn}}_{t}^{h}$ is used to multiply value to obtain updated representation that fused global information, i.e.,

\displaystyle{\bm{V}}:={\operatorname*{Attn}}_{t}^{h}\cdot{\bm{V}}_{t}^{h},%\text{Where}~{}{\bm{V}}_{t}^{h}={\bm{W}}_{t}^{h}\cdot{\bm{V}}

(3)

In vision tasks, MSA needs to be pre-trained on large-scale datasets to make up for its lack of inductive bias in CNN, such as translation invariance and locality.

3.3 Multi-Layer Perceptron

Multi-layer perceptron(MLP) is a commonly used neural network layer for many tasks Murtagh (1990); Tolstikhin etal. (2021). An MLP mainly contains N linear layers, each layer has learnable parameters of both weight and bias as well as activation functions. The activation function is used to map the non-linear relationship between input and output. Specifically, given visual features ${\bm{V}}\in\mathbb{R}^{H\times W\times C}$ , MLP, only associated with channels, maps each channel to a $D$ -dimensional hidden vector ${\bm{h}}_{i}$ . i.e.,

\displaystyle{\bm{h}}_{i}=\operatorname*{Activation}(\sum\limits_{j}{\bm{W}}_{%ij}*c_{j}+b_{i}),{\bm{W}}\in\mathbb{R}^{C\times D},b\in\mathbb{R}^{D},

(4)

where ${\bm{h}}_{i}$ is the $i$ -th dimension of ${\bm{H}}$ , obtained by weighting $C$ channels and learnable parameters in the $i$ -th column of weight matrix ${\bm{W}}$ . $\operatorname*{Activation}$ is an activation function, that adjusts the output through non-linear transformation.

3.4 State Space Models

State Space Models (SSMs) Gu and Dao (2023); Liu etal. (2024c) introduce a novel Cross-Scan Module (CSM) for improved directional sensitivity and computational efficiency. SSMs are pivotal in modeling the dynamics of visual systems through equations that describe temporal evolution and observation generation. The observation function is as follows:

\displaystyle{\bm{x}}_{t+1}={\bm{A}}\cdot{\bm{x}}_{t}+{\bm{B}}\cdot{\bm{u}}_{t%}+{\bm{w}}_{t},

(5)

where ${\bm{x}}_{t}$ denotes the system state at time $t$ , ${\bm{u}}_{t}$ represents control inputs, and ${\bm{w}}_{t}$ is the process noise, indicating uncertainties in state transitions.Moreover, the observation function can be defined as:

\displaystyle{\bm{y}}_{t}={\bm{C}}\cdot{\bm{x}}_{t}+{\bm{D}}\cdot{\bm{u}}_{t}+%{\bm{v}}_{t},

(6)

with ${\bm{y}}_{t}$ as the observation at time $t$ , and ${\bm{v}}_{t}$ as the observation noise, highlighting discrepancies between modeled and actual observations. Matrices ${\bm{A}},{\bm{B}},{\bm{C}},{\bm{D}}$ define the dynamics, linking state transitions to observations. Furthermore, the Cross-Scan Module (CSM) further addresses directional sensitivity by structuring visual features into ordered patch sequences through:

\displaystyle\operatorname*{CSM}({\bm{V}})=\operatorname*{Order}(\operatorname%*{Traverse}({\bm{V}})),

(7)

where ${\bm{V}}$ is a visual feature input. This process allows for effective spatial information handling, improving the model’s dynamic processing capabilities.

4 InsectMamba

This section elaborates on the architecture of our InsectMamba Model, a novel vision model for insect pest classification. The backbone of our model is the Mix-SSM Block, designed to integrate features from various visual encoding strategies. Finally, we introduce our proposed Selective Module, which can integrate adaptively representations derived from different visual encoding strategies.

4.1 Overall Architecture

Insect Pest Classification with State Space Model (1)

As shown in Figure1, given an image ${\bm{I}}\in\mathbb{R}^{H\times W\times 3}$ , the image is initially segmented into multiple non-overlapping $4\times 4$ patches. Subsequently, a patch embedding layer Dosovitskiy etal. (2021) is utilized to transform these patches into a lower-dimensional latent space, resulting in dimensions $\frac{H}{4}\times\frac{W}{4}\times C$ , where $C$ denotes the number of channel in the latent space, i.e.,

\displaystyle{\bm{V}}=\operatorname*{PatchEmbed}({\bm{I}}),{\bm{V}}\in\mathbb{%R}^{\frac{H}{4}\times\frac{W}{4}\times C}.

(8)

Subsequently, we pass the features ${\bm{V}}$ into Mix-SSM Blocks for feature extraction, and dimensionality reduction is achieved through a Patch Merging operation Liu etal. (2021), i.e.,

\displaystyle{\bm{V}}:=\operatorname*{PatchMerging}(\operatorname*{Mix-SSM-%Block}({\bm{V}})).

(9)

After several iterations of Mix-SSM Blocks and Patch Merging operations shown in Figure1, the final visual representation of the image, ${\bm{v}}\in\mathbb{R}^{L}$ , is derived. lastly, ${\bm{v}}$ is passed through a linear layer Linear to transform its dimensions to the number of classes, i.e.,

	$\displaystyle{\bm{h}}=\operatorname*{Linear}({\bm{v}}),$
	$\displaystyle{\bm{p}}=\mathrm{softmax}({\bm{h}})$		(10)

where $\mathrm{softmax}$ converts the hidden features ${\bm{h}}$ into a probability distribution over each class ${\bm{p}}$ .

4.2 Mix-SSM Block

Insect Pest Classification with State Space Model (2)

The Mix-SSM Block is composed of several key components: a Selective Scan Module (SSM), convolutional layers (Conv), a Multi-Layer Perceptron (MLP), a Multi-Head Self-Attention mechanism (MSA), and a Selective Module. The details of the different kinds of visual encoding strategies, i.e., SSM, Conv, MLP, and MSA, can be found in Section 3.

As shown in Figure2,Given features ${\bm{V}}$ from Equation8, we pass it into Mix-SSM Blocks.The features ${\bm{V}}$ are respectively encoded with different visual encoding strategies, and we obtain ${\bm{F}}_{\operatorname*{SSM}}$ , ${\bm{F}}_{\operatorname*{Conv}}$ , ${\bm{F}}_{\operatorname*{MLP}}$ , and ${\bm{F}}_{\operatorname*{MSA}}$ , i.e.,

	$\displaystyle{\bm{F}}_{\operatorname{SSM}}=\operatorname{SSM}({\bm{V}}),$
	$\displaystyle{\bm{F}}_{\operatorname{Conv}}=\operatorname{Conv}({\bm{V}}),$
	$\displaystyle{\bm{F}}_{\operatorname{MLP}}=\operatorname{MLP}({\bm{V}}),$
	$\displaystyle{\bm{F}}_{\operatorname{MSA}}=\operatorname{MSA}({\bm{V}}).$		(11)

where ${\bm{F}}_{m},m\in\{\operatorname*{SSM},\operatorname*{Conv},\operatorname*{MLP%},\operatorname*{MSA}\}$ , is encoded features in same dimension. $\operatorname*{SSM}$ aims to adaptively aggregate spatial information with long-distance dependencies based on the input features, $\operatorname*{Conv}$ plays its role in extracting local visual features, $\operatorname*{MLP}$ processes the channel-wise information, and $\operatorname*{MSA}$ captures global dependencies for visual features.Subsequently, the features ${\bm{F}}_{\operatorname*{SSM}}$ , ${\bm{F}}_{\operatorname*{Conv}}$ , ${\bm{F}}_{\operatorname*{MLP}}$ , and ${\bm{F}}_{\operatorname*{MSA}}$ are passed through the Selective Module to adaptively aggregate the features.

4.3 Selective Module

To integrate features from different encoding strategies and leverage their respective advantages, we introduce the Selective Module, which enables the model to adaptively adjust visual features across different encoding strategies. Specifically, we first integrate features ${\bm{F}}_{\operatorname*{SSM}}$ , ${\bm{F}}_{\operatorname*{Conv}}$ , ${\bm{F}}_{\operatorname*{MLP}}$ , and ${\bm{F}}_{\operatorname*{MSA}}$ from different encoding strategies as follows:

\displaystyle{\bm{F}}={\bm{F}}_{\operatorname*{SSM}}+{\bm{F}}_{\operatorname*{%Conv}}+{\bm{F}}_{\operatorname*{MLP}}+{\bm{F}}_{\operatorname*{MSA}}.

(12)

Then, we aggregate the information across each channel by employing global average pooling to obtain embedded global features ${\bm{F}}\in\mathbb{R}^{\bar{C}}$ . $\bar{C}$ , $\bar{W}$ , and $\bar{H}$ denote the number of feature channels, width, and height entering the Selective Module within Mix-SSM, respectively.In particular, the $c$ -th element of ${\bm{g}}$ is computed by spatially downsampling ${\bm{F}}_{c}$ over dimensions $\bar{H}\times\bar{W}$ :

\displaystyle{\bm{g}}_{c}=\operatorname*{GlobalPool}({\bm{F}}_{c})=\frac{1}{%\bar{H}\times\bar{W}}\sum_{i=1}^{\bar{H}}\sum_{j=1}^{\bar{W}}{\bm{F}}_{c}(i,j).

(13)

To enable the model to infer the weight of different encoding strategies for various channels, we further encode ${\bm{g}}$ using an MLP to obtain hidden features ${\bm{h}}$ :

\displaystyle{\bm{h}}={\operatorname*{MLP}}_{h}({\bm{g}}),{\bm{h}}\in\mathbb{R%}^{\bar{C}\times n}

(14)

where $n$ represents the number of visual encoding strategies. Cross-channel soft attention is applied to adaptively select information across different spatial scales. Specifically, a softmax operation is applied to the channel dimensions of hidden features ${\bm{h}}$ :

\displaystyle{\bm{p}}=\mathrm{softmax}({\bm{h}}),{\bm{p}}\in\mathbb{R}^{\bar{C%}\times n}

(15)

where ${\bm{p}}$ signifies the weight of different encoding strategies across each channel. Weighting the features ${\bm{F}}_{\operatorname*{SSM}}$ , ${\bm{F}}_{\operatorname*{Conv}}$ , ${\bm{F}}_{\operatorname*{MLP}}$ , and ${\bm{F}}_{\operatorname*{MSA}}$ based on $\{{\bm{p}}_{\operatorname*{SSM}},{\bm{p}}_{\operatorname*{Conv}},{\bm{p}}_{%\operatorname*{MLP}},{\bm{p}}_{\operatorname*{MSA}}\}\in{\bm{p}}$ to obtain ${\bm{V}}$ :

\displaystyle{\bm{V}}:={\bm{p}}_{\operatorname*{SSM}}\cdot{\bm{F}}_{%\operatorname*{SSM}}+{\bm{p}}_{\operatorname*{Conv}}\cdot{\bm{F}}_{%\operatorname*{Conv}}+{\bm{p}}_{\operatorname*{MLP}}\cdot{\bm{F}}_{%\operatorname*{MLP}}+{\bm{p}}_{\operatorname*{MSA}}\cdot{\bm{F}}_{%\operatorname*{MSA}}.

(16)

5 Experiment

In experiments, we evaluate the performance of our InsectMamba model on five insect pest classification datasets. We compare the performance of our model with several state-of-the-art models. We also conduct an ablation study to investigate the effectiveness of different components in our model.

5.1 Dataset and Metrics

Dataset	Category	Train	Test
Farm Insects	15	160	1,368
Agricultural Pests	12	240	5,254
Insect Recognition	24	768	612
Forestry Pest Identification	31	599	6,564
IP102	102	1,909	65,805

To more effectively and comprehensively evaluate existing visual models, we curated and re-split five insect pest classification datasets to provide a challenging evaluation. The datasets employed in our experiments are Farm Insects ¹¹1https://www.kaggle.com/datasets/tarundalal/dangerous-insects-dataset, Agricultural Pests ²²2https://www.kaggle.com/datasets/gauravduttakiit/agricultural-pests-dataset, Insect Recognition Xie etal. (2015), Forestry Pest Identification Liu etal. (2022), and IP102 Wu etal. (2019), with details provided in Table1. We reduce the number of samples in the training set to compare the encoding capabilities of different visual models for visual features. In addition, we leverage accuracy (ACC), precision (Prec), recall (Rec), and the F1 score as evaluation metrics to evaluate the performance of the models comprehensively.

5.2 Implementation Details

For our model training, the batch size is configured to 32, and the learning rate was set at $5\times 10^{-5}$ . We conduct training on 10 epochs using the Adam optimizer Kingma and Ba (2015). The dimensions of the input images were fixed at $224\times 224\times 3$ pixels. For comparative analysis, we finetune various models on five datasets, i.e., ResNet18 He etal. (2016), ResNet50 He etal. (2016), ResNet101 He etal. (2016), ResNet152 He etal. (2016), DeiT-S Touvron etal. (2021), DeiT-B Touvron etal. (2021), Swin-T Liu etal. (2021), Swin-S Liu etal. (2021), Swin-B Liu etal. (2021), Vmamba-T Liu etal. (2024c), Vmamba-S Liu etal. (2024c), and Vmamba-B Liu etal. (2024c). “T”, “S”, and “B” denote the “Tiny”, “Small”, and “Base” model size of corresponding models, respectively. To initialize parts of our model’s parameters, we utilized pre-trained parameters from Vmamba-B.

5.3 Main Results

Method	ACC	Prec	Rec	F1
ResNet18	0.43	0.51	0.43	0.42
ResNet50	0.50	0.54	0.50	0.47
ResNet101	0.52	0.55	0.52	0.49
ResNet152	0.53	0.59	0.53	0.50
DeiT-S	0.49	0.50	0.50	0.47
DeiT-B	0.55	0.59	0.55	0.54
Swin-T	0.54	0.53	0.54	0.52
Swin-S	0.56	0.60	0.56	0.55
Swin-B	0.62	0.68	0.63	0.63
Vmamba-T	0.56	0.58	0.56	0.55
Vmamba-S	0.53	0.56	0.53	0.51
Vmamba-B	0.52	0.56	0.52	0.48
InsectMamba	0.66	0.67	0.66	0.65

Method	ACC	Prec	Rec	F1
ResNet18	0.69	0.71	0.67	0.64
ResNet50	0.68	0.77	0.67	0.65
ResNet101	0.75	0.76	0.74	0.73
ResNet152	0.78	0.80	0.76	0.76
DeiT-S	0.83	0.82	0.82	0.82
DeiT-B	0.86	0.86	0.85	0.85
Swin-T	0.80	0.82	0.80	0.79
Swin-S	0.74	0.79	0.73	0.74
Swin-B	0.83	0.86	0.83	0.82
Vmamba-T	0.78	0.81	0.77	0.77
Vmamba-S	0.83	0.84	0.82	0.80
Vmamba-B	0.89	0.90	0.89	0.89
InsectMamba	0.91	0.91	0.90	0.91

Method	ACC	Prec	Rec	F1
ResNet18	0.66	0.69	0.66	0.64
ResNet50	0.57	0.73	0.57	0.56
ResNet101	0.57	0.66	0.57	0.55
ResNet152	0.52	0.67	0.52	0.52
DeiT-S	0.73	0.76	0.74	0.73
DeiT-B	0.76	0.82	0.76	0.76
Swin-T	0.70	0.81	0.70	0.69
Swin-S	0.75	0.80	0.76	0.76
Swin-B	0.81	0.86	0.82	0.82
Vmamba-T	0.79	0.85	0.79	0.79
Vmamba-S	0.81	0.84	0.80	0.79
Vmamba-B	0.83	0.87	0.84	0.84
InsectMamba	0.86	0.88	0.86	0.86

Method	ACC	Prec	Rec	F1
ResNet18	0.80	0.83	0.80	0.80
ResNet50	0.80	0.84	0.80	0.80
ResNet101	0.79	0.85	0.79	0.79
ResNet152	0.77	0.82	0.77	0.76
DeiT-S	0.87	0.88	0.87	0.87
DeiT-B	0.90	0.91	0.90	0.90
Swin-T	0.83	0.88	0.84	0.84
Swin-S	0.85	0.88	0.85	0.85
Swin-B	0.86	0.89	0.87	0.86
Vmamba-T	0.90	0.91	0.90	0.90
Vmamba-S	0.91	0.92	0.92	0.91
Vmamba-B	0.92	0.93	0.93	0.93
InsectMamba	0.94	0.94	0.94	0.94

Method	ACC	Prec	Rec	F1
ResNet18	0.27	0.27	0.25	0.21
ResNet50	0.24	0.26	0.22	0.18
ResNet101	0.18	0.28	0.19	0.16
ResNet152	0.25	0.23	0.19	0.16
DeiT-S	0.22	0.24	0.21	0.17
DeiT-B	0.28	0.27	0.25	0.20
Swin-T	0.29	0.25	0.27	0.22
Swin-S	0.30	0.25	0.27	0.21
Swin-B	0.39	0.36	0.37	0.32
Vmamba-T	0.28	0.29	0.29	0.23
Vmamba-S	0.35	0.31	0.34	0.27
Vmamba-B	0.32	0.36	0.33	0.28
InsectMamba	0.43	0.38	0.42	0.37

The experimental results, as shown in Tables3, 3, 5, 5, and 6, demonstrate the superior performance of the InsectMamba model across multiple insect classification tasks. InsectMamba consistently outperforms the established benchmarks, including various configurations of ResNet, DeiT, Swin Transformer, and Vmamba, across all evaluation metrics: Accuracy (ACC), Precision (Prec), Recall (Rec), and F1 Score (F1). On the Farm Insects Dataset, InsectMamba achieves an ACC of 0.66, surpassing the next best model, Swin-B, by 4%. Similarly, significant improvements are observed on the Agricultural Pests Dataset, where InsectMamba reaches an ACC of 0.91, outperforming the strong Vmamba-B baseline by 2%. These results are consistent across the Insect Recognition and Forestry Pest Identification Datasets, which shows InsectMamba’s strong capability to extract features for images. The results on the IP102 Dataset further verify InsectMamba’s robustness, achieving an ACC of 0.43, which is a leap over the previous best of 0.39 by Swin-B. These results demonstrate that the Mix-SSM Block can integrate multiple visual encoding strategies to ensure comprehensive feature capture from the input images. The Selective Module further enhances the model’s capability by adaptively weighting the contribution of different encoding strategies.

5.4 Ablation Study

Method	Farm Insects		Insect Recognition		IP102
Method	Accuracy	F1	Accuracy	F1	Accuracy	F1
InsectMamba	0.66	0.65	0.86	0.86	0.43	0.37
w/o CNN	0.60	0.58	0.84	0.85	0.38	0.33
w/o MSA	0.62	0.60	0.85	0.86	0.40	0.34
w/o MLP	0.63	0.60	0.85	0.85	0.41	0.34
w/o CNN, MSA	0.54	0.50	0.83	0.84	0.34	0.29
w/o CNN, MLP	0.55	0.53	0.84	0.84	0.34	0.30
w/o MSA, MLP	0.57	0.55	0.84	0.84	0.35	0.31
w/o CNN, MSA, MLP	0.52	0.48	0.83	0.84	0.32	0.28

The ablation study results are shown in Table7 systematically evaluates the contribution of each component within the InsectMamba model, namely, Convolutional Neural Networks (CNN), Multi-Layer Perceptron (MLP), and Multi-Head Self-Attention (MSA), across three datasets: Farm Insects, Insect Recognition, and IP102. The results highlight the significant role each component plays in achieving high accuracy and F1 scores. The complete InsectMamba model achieves the best performance across all datasets, which underscores the synergistic effect of combining CNN, MLP, and MSA for feature extraction and representation learning. Removing any single component (CNN, MSA, or MLP) leads to a decrease in both accuracy and F1 scores across all datasets, indicating that each component contributes unique and valuable information for classification. The most significant performance degradation is observed when multiple components are removed simultaneously, particularly when CNN, MSA, and MLP are all excluded. This configuration results in the lowest accuracy and F1 scores, demonstrating that the integration of multiple visual encoding strategies is crucial for capturing the comprehensive visual characteristics of insects.

5.5 Analysis

Impact of Feature Aggregation Methods.

Insect Pest Classification with State Space Model (3)

Insect Pest Classification with State Space Model (4)

To investigate the effectiveness of different feature aggregation methods within the InsectMamba model, we evaluate by comparing the Selective Module against Max Pooling and Average Pooling methods. As depicted in Figure3, the Selective Module consistently outperforms Max Pooling and Average Pooling in terms of Accuracy (ACC) and F1 Score across two distinct datasets: Farm Insects and IP102. For the Farm Insects dataset, the Selective Module achieves the highest ACC and F1 Score, indicating its superior capability in capturing and integrating salient features for insect pest classification. Specifically, the ACC and F1 improvement over Max Pooling is pronounced, underlining the Selective Module’s effectiveness in handling more nuanced classification tasks within a diverse set of insect species. On the IP102 dataset, the Selective Module still maintains an advantage. Moreover, the variance in performance across the two datasets also highlights the adaptive nature of the Selective Module. It demonstrates that the Selective Module can dynamically adjust the integration of visual features from different visual encoding strategies according to the dataset’s complexity and diversity.

Impact of kernel size in the selective module.

Insect Pest Classification with State Space Model (5)

Insect Pest Classification with State Space Model (6)

In the process of optimizing our InsectMamba model, we investigated the impact of different kernel sizes within the Selective Module on the classification performance. As shown in Figure4, the Selective Module was evaluated with kernel sizes of $1\times 1$ , $3\times 3$ , $5\times 5$ , and $7\times 7$ . For the Farm Insects dataset, shown in Figure4(a), we observe that both Accuracy (ACC) and F1 Score (F1) metrics peak at a kernel size of $3\times 3$ . The performance declines when the kernel size is increased to $5\times 5$ and drops significantly at $7\times 7$ . It demonstrates that smaller kernel sizes are more effective at capturing the relevant visual features for insect pest classification. Moreover, the IP102 dataset, shown in Figure4(b), shows a consistent trend. It shows the importance of the kernel sizes in the Selective Module in adaptively integrating different visual encoding strategies.

Impact of pooling methods in the selective module.

Insect Pest Classification with State Space Model (7)

Insect Pest Classification with State Space Model (8)

We investigate the impact of various pooling methods on the performance of the Selective Module within our InsectMamba model. Specifically, we investigated Average Pooling, Max Pooling, L2 Pooling, and Stochastic Pooling to synthesize the global features as prescribed in Equation13. Figure5 shows the comparative performance on two datasets, i.e., Farm Insects and IP102. For the Farm Insects dataset, Average Pooling achieved the best accuracy and F1 score, indicating its effectiveness in preserving feature representation for classification tasks. Moreover, the results on the IP102 dataset show a consistent trend. Average Pooling performs as well as in the Farm Insects dataset.

6 Conclusion

In this work, we proposed a novel model, InsectMamba, for insect pest classification. The model is designed to amalgamate the strengths of State Space Models, Convolutional Neural Networks, Multi-Head Self-Attention mechanisms, and Multilayer Perceptrons. By integrating these varied visual encoding strategies through Mix-SSM blocks and a selective aggregation module, InsectMamba has showcased the capability to address the challenges posed by pest camouflage and species diversity. In the experiment, we conduct an extensive evaluation that compares our method and other strong competitors on five insect pest classification datasets. Experimental results show our model outperforms other models, which demonstrates the effectiveness of our model. We also illuminate the importance of each integrated module through comprehensive ablation studies.

References

Ali etal. [2023]Farooq Ali, Huma Qayyum, and MuhammadJaved Iqbal.Faster-pestnet: A lightweight deep learning framework for crop pest detection and classification.IEEE Access, 11:104016–104027, 2023.doi: 10.1109/ACCESS.2023.3317506.URL https://doi.org/10.1109/ACCESS.2023.3317506.
An etal. [2023]Jingmin An, Yong Du, Peng Hong, Lei Zhang, and Xiaogang Weng.Insect recognition based on complementary features from multiple views.Scientific Reports, 13(1):2966, 2023.
Anwar and Masood [2023]Zeba Anwar and Sarfaraz Masood.Exploring deep ensemble model for insect and pest detection from images.Procedia Computer Science, 218:2328–2337, 2023.
Butera etal. [2022]Luca Butera, Alberto Ferrante, Mauro Jermini, Mauro Prevostini, and Cesare Alippi.Precise agriculture: Effective deep learning strategies to detect pest insects.IEEE CAA J. Autom. Sinica, 9(2):246–258, 2022.doi: 10.1109/JAS.2021.1004317.URL https://doi.org/10.1109/JAS.2021.1004317.
Cheng etal. [2017]XiCheng, Youhua Zhang, Yiqiong Chen, Yun-Zhi Wu, and YiYue.Pest identification via deep residual learning in complex background.Comput. Electron. Agric., 141:351–356, 2017.doi: 10.1016/J.COMPAG.2017.08.005.URL https://doi.org/10.1016/j.compag.2017.08.005.
Doan [2023]Thanh-Nghi Doan.Large-scale insect pest image classification.Journal of Advances in Information Technology, 14(2):328–341, 2023.
Dosovitskiy etal. [2021]Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby.An image is worth 16x16 words: Transformers for image recognition at scale.In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021.URL https://openreview.net/forum?id=YicbFdNTTy.
Gu and Dao [2023]Albert Gu and Tri Dao.Mamba: Linear-time sequence modeling with selective state spaces.CoRR, abs/2312.00752, 2023.doi: 10.48550/ARXIV.2312.00752.URL https://doi.org/10.48550/arXiv.2312.00752.
Han etal. [2024]Guangzeng Han, Weisi Liu, Xiaolei Huang, and Brian Borsari.Chain-of-interaction: Enhancing large language models for psychiatric behavior understanding by dyadic contexts.arXiv preprint arXiv:2403.13786, 2024.
He etal. [2016]Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.Deep residual learning for image recognition.In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pages 770–778. IEEE Computer Society, 2016.doi: 10.1109/CVPR.2016.90.URL https://doi.org/10.1109/CVPR.2016.90.
Howard etal. [2017]AndrewG. Howard, Menglong Zhu, BoChen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam.Mobilenets: Efficient convolutional neural networks for mobile vision applications.CoRR, abs/1704.04861, 2017.URL http://arxiv.org/abs/1704.04861.
Hu etal. [2023]Zhengyu Hu, Jieyu Zhang, Haonan Wang, Siwei Liu, and Shangsong Liang.Leveraging relational graph neural network for transductive model ensemble.In AmbujK. Singh, Yizhou Sun, Leman Akoglu, Dimitrios Gunopulos, Xifeng Yan, Ravi Kumar, Fatma Ozcan, and Jieping Ye, editors, Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD 2023, Long Beach, CA, USA, August 6-10, 2023, pages 775–787. ACM, 2023.doi: 10.1145/3580305.3599414.URL https://doi.org/10.1145/3580305.3599414.
Kasinathan and Reddy [2019]Thenmozhi Kasinathan and U.Srinivasulu Reddy.Crop pest classification based on deep convolutional neural network and transfer learning.Comput. Electron. Agric., 164, 2019.doi: 10.1016/J.COMPAG.2019.104906.URL https://doi.org/10.1016/j.compag.2019.104906.
Kingma and Ba [2015]DiederikP. Kingma and Jimmy Ba.Adam: A method for stochastic optimization.In Yoshua Bengio and Yann LeCun, editors, 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015.URL http://arxiv.org/abs/1412.6980.
Krizhevsky etal. [2017]Alex Krizhevsky, Ilya Sutskever, and GeoffreyE. Hinton.Imagenet classification with deep convolutional neural networks.Commun. ACM, 60(6):84–90, 2017.doi: 10.1145/3065386.URL https://doi.org/10.1145/3065386.
Lai etal. [2024a]Zhixin Lai, Jing Wu, Suiyao Chen, Yucheng Zhou, Anna Hovakimyan, and Naira Hovakimyan.Language models are free boosters for biomedical imaging tasks.arXiv preprint arXiv:2403.17343, 2024a.
Lai etal. [2024b]Zhixin Lai, Xuesheng Zhang, and Suiyao Chen.Adaptive ensembles of fine-tuned transformers for llm-generated text detection, 2024b.
Liu etal. [2022]Bing Liu, Luyang Liu, Ran Zhuo, Weidong Chen, Rui Duan, and Guishen Wang.A dataset for forestry pest identification.Frontiers in Plant Science, 13:857104, 2022.
Liu etal. [2024a]Tianrui Liu, Changxin Xu, Yuxin Qiao, Chufeng Jiang, and Weisheng Chen.News recommendation with attention mechanism.CoRR, abs/2402.07422, 2024a.doi: 10.48550/ARXIV.2402.07422.URL https://doi.org/10.48550/arXiv.2402.07422.
Liu etal. [2024b]Tianrui Liu, Changxin Xu, Yuxin Qiao, Chufeng Jiang, and Jiqiang Yu.Particle filter SLAM for vehicle localization.CoRR, abs/2402.07429, 2024b.doi: 10.48550/ARXIV.2402.07429.URL https://doi.org/10.48550/arXiv.2402.07429.
Liu etal. [2024c]Yue Liu, Yunjie Tian, Yuzhong Zhao, Hongtian Yu, Lingxi Xie, Yaowei Wang, Qixiang Ye, and Yunfan Liu.Vmamba: Visual state space model.CoRR, abs/2401.10166, 2024c.doi: 10.48550/ARXIV.2401.10166.URL https://doi.org/10.48550/arXiv.2401.10166.
Liu etal. [2021]ZeLiu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo.Swin transformer: Hierarchical vision transformer using shifted windows.In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021, pages 9992–10002. IEEE, 2021.doi: 10.1109/ICCV48922.2021.00986.URL https://doi.org/10.1109/ICCV48922.2021.00986.
Liu etal. [2016]Ziyi Liu, Junfeng Gao, Guoguo Yang, Huan Zhang, and Yong He.Localization and classification of paddy field pests using a saliency map and deep convolutional neural network.Scientific reports, 6(1):20410, 2016.
Lyu etal. [2024]Weimin Lyu, Xiao Lin, Songzhu Zheng, LuPang, Haibin Ling, Susmit Jha, and Chao Chen.Task-agnostic detector for insertion-based backdoor attacks.arXiv preprint arXiv:2403.17155, 2024.
Murtagh [1990]Fionn Murtagh.Multilayer perceptrons for classification and regression.Neurocomputing, 2(5):183–197, 1990.doi: 10.1016/0925-2312(91)90023-5.URL https://doi.org/10.1016/0925-2312(91)90023-5.
O’Shea and Nash [2015]Keiron O’Shea and Ryan Nash.An introduction to convolutional neural networks.CoRR, abs/1511.08458, 2015.URL http://arxiv.org/abs/1511.08458.
Peng and Wang [2022]Yingshu Peng and YiWang.CNN and transformer framework for insect pest classification.Ecol. Informatics, 72:101846, 2022.doi: 10.1016/J.ECOINF.2022.101846.URL https://doi.org/10.1016/j.ecoinf.2022.101846.
Ren etal. [2019]Fuji Ren, Wenjie Liu, and Guoqing Wu.Feature reuse residual networks for insect pest recognition.IEEE Access, 7:122758–122768, 2019.doi: 10.1109/ACCESS.2019.2938194.URL https://doi.org/10.1109/ACCESS.2019.2938194.
Simonyan and Zisserman [2015]Karen Simonyan and Andrew Zisserman.Very deep convolutional networks for large-scale image recognition.In Yoshua Bengio and Yann LeCun, editors, 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015.URL http://arxiv.org/abs/1409.1556.
Su etal. [2024]Jing Su, Chufeng Jiang, Xin Jin, Yuxin Qiao, Tingsong Xiao, Hongda Ma, Rong Wei, Zhi Jing, Jiajun Xu, and Junhong Lin.Large language models for forecasting and anomaly detection: A systematic literature review.CoRR, abs/2402.10350, 2024.doi: 10.48550/ARXIV.2402.10350.URL https://doi.org/10.48550/arXiv.2402.10350.
Tolstikhin etal. [2021]IlyaO. Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jessica Yung, Andreas Steiner, Daniel Keysers, Jakob Uszkoreit, Mario Lucic, and Alexey Dosovitskiy.Mlp-mixer: An all-mlp architecture for vision.In Marc’Aurelio Ranzato, Alina Beygelzimer, YannN. Dauphin, Percy Liang, and JenniferWortman Vaughan, editors, Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 24261–24272, 2021.URL https://proceedings.neurips.cc/paper/2021/hash/cba0a4ee5ccd02fda0fe3f9a3e7b89fe-Abstract.html.
Touvron etal. [2021]Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou.Training data-efficient image transformers & distillation through attention.In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 10347–10357. PMLR, 2021.URL http://proceedings.mlr.press/v139/touvron21a.html.
Ung etal. [2022]HieuTrung Ung, QuangHuy Ung, TrungT. Nguyen, and BinhT. Nguyen.An efficient insect pest classification using multiple convolutional neural network based models.In Hamido Fujita, Yutaka Watanobe, and Takuya Azumi, editors, New Trends in Intelligent Software Methodologies, Tools and Techniques - Proceedings of the 21st International Conference on New Trends in Intelligent Software Methodologies, Tools and Techniques, SoMeT 2022, Kitakyushu, Japan, 20-22 September, 2022, volume 355 of Frontiers in Artificial Intelligence and Applications, pages 584–595. IOS Press, 2022.doi: 10.3233/FAIA220287.URL https://doi.org/10.3233/FAIA220287.
Vaswani etal. [2017]Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, AidanN. Gomez, Lukasz Kaiser, and Illia Polosukhin.Attention is all you need.In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, HannaM. Wallach, Rob Fergus, S.V.N. Vishwanathan, and Roman Garnett, editors, Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 5998–6008, 2017.URL https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html.
Wang etal. [2017]RuJing Wang, Jie Zhang, Wei Dong, Jian Yu, ChengJun Xie, Rui Li, TianJiao Chen, and HongBo Chen.A crop pests image classification algorithm based on deep convolutional neural network.TELKOMNIKA (Telecommunication Computing Electronics and Control), 15(3):1239–1246, 2017.
Wu etal. [2024]Jing Wu, Zhixin Lai, Suiyao Chen, Ran Tao, Pan Zhao, and Naira Hovakimyan.The new agronomists: Language models are experts in crop management.arXiv preprint arXiv:2403.19839, 2024.
Wu etal. [2019]Xiaoping Wu, Chi Zhan, Yu-Kun Lai, Ming-Ming Cheng, and Jufeng Yang.IP102: A large-scale benchmark dataset for insect pest recognition.In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pages 8787–8796. Computer Vision Foundation / IEEE, 2019.doi: 10.1109/CVPR.2019.00899.URL http://openaccess.thecvf.com/content_CVPR_2019/html/Wu_IP102_A_Large-Scale_Benchmark_Dataset_for_Insect_Pest_Recognition_CVPR_2019_paper.html.
Xie etal. [2015]Chengjun Xie, Jie Zhang, Rui Li, Jinyan Li, Peilin Hong, Junfeng Xia, and Peng Chen.Automatic classification for field crop insects via multiple-task sparse representation and multiple-kernel learning.Comput. Electron. Agric., 119:123–132, 2015.doi: 10.1016/J.COMPAG.2015.10.015.URL https://doi.org/10.1016/j.compag.2015.10.015.
Zhang etal. [2023]Jieyu Zhang, Bohan Wang, Zhengyu Hu, PangWei Koh, and AlexanderJ. Ratner.On the trade-off of intra-/inter-class diversity for supervised pre-training.In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023.URL http://papers.nips.cc/paper_files/paper/2023/hash/ca9567d8ef6b2ea2da0d7eed57b933ee-Abstract-Conference.html.
Zhou etal. [2023]Yucheng Zhou, Xiubo Geng, Tao Shen, Chongyang Tao, Guodong Long, Jian-Guang Lou, and Jianbing Shen.Thread of thought unraveling chaotic contexts.CoRR, abs/2311.08734, 2023.doi: 10.48550/ARXIV.2311.08734.URL https://doi.org/10.48550/arXiv.2311.08734.
Zhou etal. [2024]Yucheng Zhou, Xiang Li, Qianning Wang, and Jianbing Shen.Visual in-context learning for large vision-language models.CoRR, abs/2402.11574, 2024.doi: 10.48550/ARXIV.2402.11574.URL https://doi.org/10.48550/arXiv.2402.11574.

	$\displaystyle{\bm{F}}_{\operatorname{SSM}}=\operatorname{SSM}({\bm{V}}),$
	$\displaystyle{\bm{F}}_{\operatorname{Conv}}=\operatorname{Conv}({\bm{V}}),$
	$\displaystyle{\bm{F}}_{\operatorname{MLP}}=\operatorname{MLP}({\bm{V}}),$
	$\displaystyle{\bm{F}}_{\operatorname{MSA}}=\operatorname{MSA}({\bm{V}}).$		(11)