Insect Pest Classification with State Space Model (2024)

Qianning Wang11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT, Chenglin Wang22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT, Zhixin Lai33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPT, Yucheng Zhou44{}^{4}start_FLOATSUPERSCRIPT 4 end_FLOATSUPERSCRIPT
11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPTNanjing Audit University, 22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPTEast China Normal University
33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPTSnapchat, 44{}^{4}start_FLOATSUPERSCRIPT 4 end_FLOATSUPERSCRIPTSKL-IOTSC, CIS, University of Macau
yucheng.zhou@connect.um.edu.mo

Abstract

The classification of insect pests is a critical task in agricultural technology, vital for ensuring food security and environmental sustainability. However, the complexity of pest identification, due to factors like high camouflage and species diversity, poses significant obstacles. Existing methods struggle with the fine-grained feature extraction needed to distinguish between closely related pest species. Although recent advancements have utilized modified network structures and combined deep learning approaches to improve accuracy, challenges persist due to the similarity between pests and their surroundings. To address this problem, we introduce InsectMamba, a novel approach that integrates State Space Models (SSMs), Convolutional Neural Networks (CNNs), Multi-Head Self-Attention mechanism (MSA), and Multilayer Perceptrons (MLPs) within Mix-SSM blocks. This integration facilitates the extraction of comprehensive visual features by leveraging the strengths of each encoding strategy. A selective module is also proposed to adaptively aggregate these features, enhancing the model’s ability to discern pest characteristics. InsectMamba was evaluated against strong competitors across five insect pest classification datasets. The results demonstrate its superior performance and verify the significance of each model component by an ablation study.

1 Introduction

In agricultural production, due to pests significantly impacting crop yields, the identification and classification of pests within agricultural technology are pivotal for ensuring food security and sustainability. The insect pest classification aims to leverage vision models to automate insect pest recognition Liu etal. (2022); Wu etal. (2019). This task is crucial for maintaining crop health, potentially reducing pesticide usage, and fostering environmentally sustainable agricultural practices. Furthermore, accurate identification of insect pests benefits crop management by minimizing damage and optimizing yields.

Since pests often exhibit a high degree of camouflage within their natural habitats An etal. (2023); Wu etal. (2019), which makes visual recognition difficult. This challenge also shows the complexity of insect pest classification. The similarity between pests and their surroundings, coupled with the vast diversity of species, poses significant obstacles to traditional image processing algorithms. Furthermore, the necessity for fine-grained feature extraction to distinguish between closely related pest species adds another layer of complexity to this challenge Anwar and Masood (2023); Butera etal. (2022). Recent work has proposed utilizing modified capsule networks to improve network structure, thereby enhancing the hierarchical and spatial relationships of features to increase classification accuracy Liu etal. (2022); Butera etal. (2022). Additionally, some studies have combined multiple deep networks and the advantages of complementary features from multiple perspectives to enhance recognition rates and robustness An etal. (2023); Anwar and Masood (2023). Nonetheless, these approaches still face challenges due to the similarity between pests and their surroundings.

In addressing the challenges of accurately identifying and classifying pests in varied conditions, different visual encoding strategies offer different advantages. Convolutional Neural Networks (CNNs O’Shea and Nash (2015)) excel in local feature extraction, whereas the Multi-Head Self-Attention mechanism (MSA Vaswani etal. (2017)) is adept at capturing global features. The State Space Models (SSMs Gu and Dao (2023)) structure is particularly effective at recognizing long-distance dependencies, and Multilayer Perceptrons (MLPs Murtagh (1990)) specialize in channel-aware information inference.

To integrate advantages from different visual encoding strategies, we propose a novel approach, InsectMamba, consisting of Mix-SSM blocks that integrate SSM, CNN, MSA, and MLP to extract more comprehensive visual features for insect pest classification. In addition, we propose a selective module to adaptively aggregate visual features from different encoding strategies. Our method, leveraging the complementary capabilities of these visual encoding strategies, aims to achieve the vision model’s capability in capturing both the local and global features of pests, thus addressing the critical challenges of camouflage and species diversity.

In the experiments, we evaluate our model and other strong competitors on five insect pest classification datasets. To improve the challenge of datasets, we re-split the dataset. The experimental results show that our method outperforms other methods, which demonstrates the effectiveness of our method. Moreover, we conduct the ablation study to verify the significance of each module of our model. Furthermore, we conduct extensive analysis of our model design to demonstrate its effectiveness.

Main contributions of this study are as follows:

  • We propose InsectMamba, which is the first attempt at the potential application of SSM-based models in insect pest classification.

  • We present Mix-SSM blocks that seamlessly integrate SSM, CNN, MSA, and MLP. This integration allows our model to capture a comprehensive range of visual features for insect pest classification.

  • We propose a selective aggregation module designed to adaptively combine visual features derived from different encoding strategies. This module allows the model to select relevant features that are utilized for classification.

  • We have rigorously evaluated InsectMamba across five insect pest classification datasets, demonstrating its superior performance compared to existing models.

2 Related Work

2.1 Image Classification

The rapid advancements in computer vision Liu etal. (2024b); Zhou etal. (2024); Su etal. (2024); Zhang etal. (2023) have led to its extensive application across various areas including AI security Lyu etal. (2024), generated detection Lai etal. (2024b), biomedicine Lai etal. (2024a), and agricultural technology Wu etal. (2024). Notably, image classification Krizhevsky etal. (2017); Dosovitskiy etal. (2021); Liu etal. (2024c) stands out as a fundamental technique for many applications in computer vision, and it aims to distinguish different categories of images.Some works He etal. (2016); Simonyan and Zisserman (2015) employ convolutional Neural Networks (CNNs) for image classification, due to the convolutional layer’ capability to capture local features within images. For instance, AlexNet Krizhevsky etal. (2017), consisting of five convolutional layers and three fully connected layers, achieves great image classification performance. VGG Simonyan and Zisserman (2015) and ResNet He etal. (2016) respectively propose enhancements by increasing the depth of the original network and integrating skip connections to further enhance the model’s classification capabilities.However, CNNs have limitations in understanding global information and lack robustness when capturing global and long-distance dependencies Dosovitskiy etal. (2021). Vision Transformer (ViT) Dosovitskiy etal. (2021) leverages multi-head self-attention (MSA) Vaswani etal. (2017) to capture context information of each patch, which enhances the model’s capability to capture global dependencies. Moreover, Swin Transformer Liu etal. (2021) adopts a windowed self-attention mechanism and hierarchical structural design, which not only retains the global modeling capabilities of the MSA but also enhances the extraction of local features. Furthermore, MLP-Mixer Tolstikhin etal. (2021) proposed a pure MLP-based architecture to capture different contextual relationships and enhance visual representation. In addition, VMamba Liu etal. (2024c) improves visual classification tasks by integrating a novel Sequence State Space (S4) model with a Selection mechanism and Scan computation, termed Mamba.

2.2 Insect Pest Classification

For the insect pest classification task, it can help people better understand the population dynamics, and potential damage of pests, to formulate effective pest management strategies, which is very important for agriculture economy, and environmental science.However, compared to general images, the feature differences in the insect pest domain may be very subtle, and the background is more complex, which places higher requirements on the classification model and requires more accurate extraction of effective features Doan (2023); Ung etal. (2022).For this challenge, some works Cheng etal. (2017); Liu etal. (2016); Wang etal. (2017); Kasinathan and Reddy (2019); Ren etal. (2019) improve CNN-based models to capture pest features under a complex background.In addition, Faster-PestNet Ali etal. (2023) used MobileNet Howard etal. (2017) to extract sample attributes, and redesigned an improved Faster-RCNN method to recognize the crop pests.Ung etal. (2022) propose a CNN-based model with an attention mechanism to further focus on insects in the image; An etal. (2023) proposes a feature fusion network that synthesizes representations from different backbone models to enhance insect image classification; Anwar and Masood (2023) employ deep ensemble models method Hu etal. (2023) to improve accuracy and robustness in insect and pest detection from images.Moreover, Peng and Wang (2022) investigated ViT architecture in the insect domain and aggregated CNNs and self-attention models to further improve capability for insect pest classification.

3 Preliminaries

3.1 Convolutional Neural Networks

Convolutional Neural Networks (CNNs O’Shea and Nash (2015)) are widely applied to computer vision owing to their strong capability for image feature extraction.It consists of a set of fixed-size learnable parameters known as filters and continuously performs convolutional computations with a sliding window across the input images.Specifically, given visual features 𝑽H×W×C𝑽superscript𝐻𝑊𝐶{\bm{V}}\in\mathbb{R}^{H\times W\times C}bold_italic_V ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × italic_C end_POSTSUPERSCRIPT, where H𝐻Hitalic_H, W𝑊Witalic_W, and C𝐶Citalic_C are the height, width, and number of channels, we can use convolution kernels w𝑤witalic_w with a size of Fwsubscript𝐹𝑤F_{w}italic_F start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT, Fhsubscript𝐹F_{h}italic_F start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT, Cinsubscript𝐶𝑖𝑛C_{in}italic_C start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT to calculate the pixel value of each channel for the visual features, i.e.,

𝑽out[i,j,k]=l=0Cin1(m=0Fh1n=0Fw1𝑽[i×S+m,j×S+n,l]×w[m,n,l,k])+b[k]subscript𝑽𝑜𝑢𝑡𝑖𝑗𝑘superscriptsubscript𝑙0subscript𝐶𝑖𝑛1superscriptsubscript𝑚0subscript𝐹1superscriptsubscript𝑛0subscript𝐹𝑤1𝑽𝑖𝑆𝑚𝑗𝑆𝑛𝑙𝑤𝑚𝑛𝑙𝑘𝑏delimited-[]𝑘\displaystyle{\bm{V}}_{out}[i,j,k]=\sum_{l=0}^{C_{in}-1}\left(\sum_{m=0}^{F_{h%}-1}\sum_{n=0}^{F_{w}-1}{\bm{V}}[i\times S+m,j\times S+n,l]\times w[m,n,l,k]%\right)+b[k]bold_italic_V start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT [ italic_i , italic_j , italic_k ] = ∑ start_POSTSUBSCRIPT italic_l = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT ( ∑ start_POSTSUBSCRIPT italic_m = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_F start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_n = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_F start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT bold_italic_V [ italic_i × italic_S + italic_m , italic_j × italic_S + italic_n , italic_l ] × italic_w [ italic_m , italic_n , italic_l , italic_k ] ) + italic_b [ italic_k ](1)

where 𝑽outsubscript𝑽𝑜𝑢𝑡{\bm{V}}_{out}bold_italic_V start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT is output feature map, (i,j,k)𝑖𝑗𝑘(i,j,k)( italic_i , italic_j , italic_k ) is the index, S𝑆Sitalic_S is stride, and b[k]𝑏delimited-[]𝑘b[k]italic_b [ italic_k ] is the bias of channel k𝑘kitalic_k. Through the cascading structure, CNN can gradually learn from low-level to high-level feature representations from the original data, and finally achieve effective classification.

3.2 Multi-Head Self-Attention

Multi-Head Self-Attention (MSA) is proposed by Vaswani etal. (2017) and is widely used for many natural language processing tasks Zhou etal. (2023); Han etal. (2024); Liu etal. (2024a). Unlike convolutional neural networks, MSA allows the model to weigh the importance of different input tokens when generating output representations, enabling the model to capture global dependencies and contextual information within the sequence effectively. Recently, Transformer-like architectures have also demonstrated powerful modeling capabilities in computer vision Dosovitskiy etal. (2021). Specifically, given visual features 𝑽H×W×C𝑽superscript𝐻𝑊𝐶{\bm{V}}\in\mathbb{R}^{H\times W\times C}bold_italic_V ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × italic_C end_POSTSUPERSCRIPT, the multi-head self-attention modeling of the visual features can be defined as:

Attnth=softmax(𝑸th(𝑲th)Td𝑲th),Where𝑸th=𝑾Qh𝑽,𝑲th=𝑾kh𝑽,formulae-sequencesuperscriptsubscriptAttn𝑡softmaxsuperscriptsubscript𝑸𝑡superscriptsuperscriptsubscript𝑲𝑡𝑇subscript𝑑superscriptsubscript𝑲𝑡formulae-sequenceWheresuperscriptsubscript𝑸𝑡superscriptsubscript𝑾𝑄𝑽superscriptsubscript𝑲𝑡superscriptsubscript𝑾𝑘𝑽\displaystyle{\operatorname*{Attn}}_{t}^{h}=\mathrm{softmax}(\frac{{\bm{Q}}_{t%}^{h}\cdot({\bm{K}}_{t}^{h})^{T}}{\sqrt{d_{{\bm{K}}_{t}^{h}}}}),\text{Where}~{%}{\bm{Q}}_{t}^{h}={\bm{W}}_{Q}^{h}\cdot{\bm{V}},{\bm{K}}_{t}^{h}={\bm{W}}_{k}^%{h}\cdot{\bm{V}},roman_Attn start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT = roman_softmax ( divide start_ARG bold_italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ⋅ ( bold_italic_K start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_d start_POSTSUBSCRIPT bold_italic_K start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_ARG end_ARG ) , Where bold_italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT = bold_italic_W start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ⋅ bold_italic_V , bold_italic_K start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT = bold_italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ⋅ bold_italic_V ,(2)

where 𝑾QhD×dsuperscriptsubscript𝑾𝑄superscript𝐷𝑑{\bm{W}}_{Q}^{h}\in\mathbb{R}^{D\times d}bold_italic_W start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_D × italic_d end_POSTSUPERSCRIPT and 𝑾khD×dsuperscriptsubscript𝑾𝑘superscript𝐷𝑑{\bm{W}}_{k}^{h}\in\mathbb{R}^{D\times d}bold_italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_D × italic_d end_POSTSUPERSCRIPT refer to linear projections, which project the D-dimensional input vector into query 𝑸thN×dsuperscriptsubscript𝑸𝑡superscript𝑁𝑑{\bm{Q}}_{t}^{h}\in\mathbb{R}^{N\times d}bold_italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_d end_POSTSUPERSCRIPT and key 𝑲thN×dsuperscriptsubscript𝑲𝑡superscript𝑁𝑑{\bm{K}}_{t}^{h}\in\mathbb{R}^{N\times d}bold_italic_K start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_d end_POSTSUPERSCRIPT, respectively. Each Attention matrix AttnthsuperscriptsubscriptAttn𝑡{\operatorname*{Attn}}_{t}^{h}roman_Attn start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT is used to multiply value to obtain updated representation that fused global information, i.e.,

𝑽:=Attnth𝑽th,Where𝑽th=𝑾th𝑽formulae-sequenceassign𝑽superscriptsubscriptAttn𝑡superscriptsubscript𝑽𝑡Wheresuperscriptsubscript𝑽𝑡superscriptsubscript𝑾𝑡𝑽\displaystyle{\bm{V}}:={\operatorname*{Attn}}_{t}^{h}\cdot{\bm{V}}_{t}^{h},%\text{Where}~{}{\bm{V}}_{t}^{h}={\bm{W}}_{t}^{h}\cdot{\bm{V}}bold_italic_V := roman_Attn start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ⋅ bold_italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT , Where bold_italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT = bold_italic_W start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ⋅ bold_italic_V(3)

In vision tasks, MSA needs to be pre-trained on large-scale datasets to make up for its lack of inductive bias in CNN, such as translation invariance and locality.

3.3 Multi-Layer Perceptron

Multi-layer perceptron(MLP) is a commonly used neural network layer for many tasks Murtagh (1990); Tolstikhin etal. (2021). An MLP mainly contains N linear layers, each layer has learnable parameters of both weight and bias as well as activation functions. The activation function is used to map the non-linear relationship between input and output. Specifically, given visual features 𝑽H×W×C𝑽superscript𝐻𝑊𝐶{\bm{V}}\in\mathbb{R}^{H\times W\times C}bold_italic_V ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × italic_C end_POSTSUPERSCRIPT, MLP, only associated with channels, maps each channel to a D𝐷Ditalic_D-dimensional hidden vector 𝒉isubscript𝒉𝑖{\bm{h}}_{i}bold_italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. i.e.,

𝒉i=Activation(j𝑾ij*cj+bi),𝑾C×D,bD,formulae-sequencesubscript𝒉𝑖Activationsubscript𝑗subscript𝑾𝑖𝑗subscript𝑐𝑗subscript𝑏𝑖formulae-sequence𝑾superscript𝐶𝐷𝑏superscript𝐷\displaystyle{\bm{h}}_{i}=\operatorname*{Activation}(\sum\limits_{j}{\bm{W}}_{%ij}*c_{j}+b_{i}),{\bm{W}}\in\mathbb{R}^{C\times D},b\in\mathbb{R}^{D},bold_italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = roman_Activation ( ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT bold_italic_W start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT * italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT + italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , bold_italic_W ∈ blackboard_R start_POSTSUPERSCRIPT italic_C × italic_D end_POSTSUPERSCRIPT , italic_b ∈ blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ,(4)

where 𝒉isubscript𝒉𝑖{\bm{h}}_{i}bold_italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the i𝑖iitalic_i-th dimension of 𝑯𝑯{\bm{H}}bold_italic_H, obtained by weighting C𝐶Citalic_C channels and learnable parameters in the i𝑖iitalic_i-th column of weight matrix 𝑾𝑾{\bm{W}}bold_italic_W. ActivationActivation\operatorname*{Activation}roman_Activation is an activation function, that adjusts the output through non-linear transformation.

3.4 State Space Models

State Space Models (SSMs) Gu and Dao (2023); Liu etal. (2024c) introduce a novel Cross-Scan Module (CSM) for improved directional sensitivity and computational efficiency. SSMs are pivotal in modeling the dynamics of visual systems through equations that describe temporal evolution and observation generation. The observation function is as follows:

𝒙t+1=𝑨𝒙t+𝑩𝒖t+𝒘t,subscript𝒙𝑡1𝑨subscript𝒙𝑡𝑩subscript𝒖𝑡subscript𝒘𝑡\displaystyle{\bm{x}}_{t+1}={\bm{A}}\cdot{\bm{x}}_{t}+{\bm{B}}\cdot{\bm{u}}_{t%}+{\bm{w}}_{t},bold_italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = bold_italic_A ⋅ bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + bold_italic_B ⋅ bold_italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + bold_italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ,(5)

where 𝒙tsubscript𝒙𝑡{\bm{x}}_{t}bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT denotes the system state at time t𝑡titalic_t, 𝒖tsubscript𝒖𝑡{\bm{u}}_{t}bold_italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT represents control inputs, and 𝒘tsubscript𝒘𝑡{\bm{w}}_{t}bold_italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the process noise, indicating uncertainties in state transitions.Moreover, the observation function can be defined as:

𝒚t=𝑪𝒙t+𝑫𝒖t+𝒗t,subscript𝒚𝑡𝑪subscript𝒙𝑡𝑫subscript𝒖𝑡subscript𝒗𝑡\displaystyle{\bm{y}}_{t}={\bm{C}}\cdot{\bm{x}}_{t}+{\bm{D}}\cdot{\bm{u}}_{t}+%{\bm{v}}_{t},bold_italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_italic_C ⋅ bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + bold_italic_D ⋅ bold_italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + bold_italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ,(6)

with 𝒚tsubscript𝒚𝑡{\bm{y}}_{t}bold_italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT as the observation at time t𝑡titalic_t, and 𝒗tsubscript𝒗𝑡{\bm{v}}_{t}bold_italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT as the observation noise, highlighting discrepancies between modeled and actual observations. Matrices 𝑨,𝑩,𝑪,𝑫𝑨𝑩𝑪𝑫{\bm{A}},{\bm{B}},{\bm{C}},{\bm{D}}bold_italic_A , bold_italic_B , bold_italic_C , bold_italic_D define the dynamics, linking state transitions to observations. Furthermore, the Cross-Scan Module (CSM) further addresses directional sensitivity by structuring visual features into ordered patch sequences through:

CSM(𝑽)=Order(Traverse(𝑽)),CSM𝑽OrderTraverse𝑽\displaystyle\operatorname*{CSM}({\bm{V}})=\operatorname*{Order}(\operatorname%*{Traverse}({\bm{V}})),roman_CSM ( bold_italic_V ) = roman_Order ( roman_Traverse ( bold_italic_V ) ) ,(7)

where 𝑽𝑽{\bm{V}}bold_italic_V is a visual feature input. This process allows for effective spatial information handling, improving the model’s dynamic processing capabilities.

4 InsectMamba

This section elaborates on the architecture of our InsectMamba Model, a novel vision model for insect pest classification. The backbone of our model is the Mix-SSM Block, designed to integrate features from various visual encoding strategies. Finally, we introduce our proposed Selective Module, which can integrate adaptively representations derived from different visual encoding strategies.

4.1 Overall Architecture

Insect Pest Classification with State Space Model (1)

As shown in Figure1, given an image 𝑰H×W×3𝑰superscript𝐻𝑊3{\bm{I}}\in\mathbb{R}^{H\times W\times 3}bold_italic_I ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × 3 end_POSTSUPERSCRIPT, the image is initially segmented into multiple non-overlapping 4×4444\times 44 × 4 patches. Subsequently, a patch embedding layer Dosovitskiy etal. (2021) is utilized to transform these patches into a lower-dimensional latent space, resulting in dimensions H4×W4×C𝐻4𝑊4𝐶\frac{H}{4}\times\frac{W}{4}\times Cdivide start_ARG italic_H end_ARG start_ARG 4 end_ARG × divide start_ARG italic_W end_ARG start_ARG 4 end_ARG × italic_C, where C𝐶Citalic_C denotes the number of channel in the latent space, i.e.,

𝑽=PatchEmbed(𝑰),𝑽H4×W4×C.formulae-sequence𝑽PatchEmbed𝑰𝑽superscript𝐻4𝑊4𝐶\displaystyle{\bm{V}}=\operatorname*{PatchEmbed}({\bm{I}}),{\bm{V}}\in\mathbb{%R}^{\frac{H}{4}\times\frac{W}{4}\times C}.bold_italic_V = roman_PatchEmbed ( bold_italic_I ) , bold_italic_V ∈ blackboard_R start_POSTSUPERSCRIPT divide start_ARG italic_H end_ARG start_ARG 4 end_ARG × divide start_ARG italic_W end_ARG start_ARG 4 end_ARG × italic_C end_POSTSUPERSCRIPT .(8)

Subsequently, we pass the features 𝑽𝑽{\bm{V}}bold_italic_V into Mix-SSM Blocks for feature extraction, and dimensionality reduction is achieved through a Patch Merging operation Liu etal. (2021), i.e.,

𝑽:=PatchMerging(MixSSMBlock(𝑽)).assign𝑽PatchMergingMixSSMBlock𝑽\displaystyle{\bm{V}}:=\operatorname*{PatchMerging}(\operatorname*{Mix-SSM-%Block}({\bm{V}})).bold_italic_V := roman_PatchMerging ( start_OPERATOR roman_Mix - roman_SSM - roman_Block end_OPERATOR ( bold_italic_V ) ) .(9)

After several iterations of Mix-SSM Blocks and Patch Merging operations shown in Figure1, the final visual representation of the image, 𝒗L𝒗superscript𝐿{\bm{v}}\in\mathbb{R}^{L}bold_italic_v ∈ blackboard_R start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT, is derived. lastly, 𝒗𝒗{\bm{v}}bold_italic_v is passed through a linear layer Linear to transform its dimensions to the number of classes, i.e.,

𝒉=Linear(𝒗),𝒉Linear𝒗\displaystyle{\bm{h}}=\operatorname*{Linear}({\bm{v}}),bold_italic_h = roman_Linear ( bold_italic_v ) ,
𝒑=softmax(𝒉)𝒑softmax𝒉\displaystyle{\bm{p}}=\mathrm{softmax}({\bm{h}})bold_italic_p = roman_softmax ( bold_italic_h )(10)

where softmaxsoftmax\mathrm{softmax}roman_softmax converts the hidden features 𝒉𝒉{\bm{h}}bold_italic_h into a probability distribution over each class 𝒑𝒑{\bm{p}}bold_italic_p.

4.2 Mix-SSM Block

Insect Pest Classification with State Space Model (2)

The Mix-SSM Block is composed of several key components: a Selective Scan Module (SSM), convolutional layers (Conv), a Multi-Layer Perceptron (MLP), a Multi-Head Self-Attention mechanism (MSA), and a Selective Module. The details of the different kinds of visual encoding strategies, i.e., SSM, Conv, MLP, and MSA, can be found in Section 3.

As shown in Figure2,Given features 𝑽𝑽{\bm{V}}bold_italic_V from Equation8, we pass it into Mix-SSM Blocks.The features 𝑽𝑽{\bm{V}}bold_italic_V are respectively encoded with different visual encoding strategies, and we obtain 𝑭SSMsubscript𝑭SSM{\bm{F}}_{\operatorname*{SSM}}bold_italic_F start_POSTSUBSCRIPT roman_SSM end_POSTSUBSCRIPT, 𝑭Convsubscript𝑭Conv{\bm{F}}_{\operatorname*{Conv}}bold_italic_F start_POSTSUBSCRIPT roman_Conv end_POSTSUBSCRIPT, 𝑭MLPsubscript𝑭MLP{\bm{F}}_{\operatorname*{MLP}}bold_italic_F start_POSTSUBSCRIPT roman_MLP end_POSTSUBSCRIPT, and 𝑭MSAsubscript𝑭MSA{\bm{F}}_{\operatorname*{MSA}}bold_italic_F start_POSTSUBSCRIPT roman_MSA end_POSTSUBSCRIPT, i.e.,

𝑭SSM=SSM(𝑽),subscript𝑭SSMSSM𝑽\displaystyle{\bm{F}}_{\operatorname*{SSM}}=\operatorname*{SSM}({\bm{V}}),bold_italic_F start_POSTSUBSCRIPT roman_SSM end_POSTSUBSCRIPT = roman_SSM ( bold_italic_V ) ,
𝑭Conv=Conv(𝑽),subscript𝑭ConvConv𝑽\displaystyle{\bm{F}}_{\operatorname*{Conv}}=\operatorname*{Conv}({\bm{V}}),bold_italic_F start_POSTSUBSCRIPT roman_Conv end_POSTSUBSCRIPT = roman_Conv ( bold_italic_V ) ,
𝑭MLP=MLP(𝑽),subscript𝑭MLPMLP𝑽\displaystyle{\bm{F}}_{\operatorname*{MLP}}=\operatorname*{MLP}({\bm{V}}),bold_italic_F start_POSTSUBSCRIPT roman_MLP end_POSTSUBSCRIPT = roman_MLP ( bold_italic_V ) ,
𝑭MSA=MSA(𝑽).subscript𝑭MSAMSA𝑽\displaystyle{\bm{F}}_{\operatorname*{MSA}}=\operatorname*{MSA}({\bm{V}}).bold_italic_F start_POSTSUBSCRIPT roman_MSA end_POSTSUBSCRIPT = roman_MSA ( bold_italic_V ) .(11)

where 𝑭m,m{SSM,Conv,MLP,MSA}subscript𝑭𝑚𝑚SSMConvMLPMSA{\bm{F}}_{m},m\in\{\operatorname*{SSM},\operatorname*{Conv},\operatorname*{MLP%},\operatorname*{MSA}\}bold_italic_F start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , italic_m ∈ { roman_SSM , roman_Conv , roman_MLP , roman_MSA }, is encoded features in same dimension. SSMSSM\operatorname*{SSM}roman_SSM aims to adaptively aggregate spatial information with long-distance dependencies based on the input features, ConvConv\operatorname*{Conv}roman_Conv plays its role in extracting local visual features, MLPMLP\operatorname*{MLP}roman_MLP processes the channel-wise information, and MSAMSA\operatorname*{MSA}roman_MSA captures global dependencies for visual features.Subsequently, the features 𝑭SSMsubscript𝑭SSM{\bm{F}}_{\operatorname*{SSM}}bold_italic_F start_POSTSUBSCRIPT roman_SSM end_POSTSUBSCRIPT, 𝑭Convsubscript𝑭Conv{\bm{F}}_{\operatorname*{Conv}}bold_italic_F start_POSTSUBSCRIPT roman_Conv end_POSTSUBSCRIPT, 𝑭MLPsubscript𝑭MLP{\bm{F}}_{\operatorname*{MLP}}bold_italic_F start_POSTSUBSCRIPT roman_MLP end_POSTSUBSCRIPT, and 𝑭MSAsubscript𝑭MSA{\bm{F}}_{\operatorname*{MSA}}bold_italic_F start_POSTSUBSCRIPT roman_MSA end_POSTSUBSCRIPT are passed through the Selective Module to adaptively aggregate the features.

4.3 Selective Module

To integrate features from different encoding strategies and leverage their respective advantages, we introduce the Selective Module, which enables the model to adaptively adjust visual features across different encoding strategies. Specifically, we first integrate features 𝑭SSMsubscript𝑭SSM{\bm{F}}_{\operatorname*{SSM}}bold_italic_F start_POSTSUBSCRIPT roman_SSM end_POSTSUBSCRIPT, 𝑭Convsubscript𝑭Conv{\bm{F}}_{\operatorname*{Conv}}bold_italic_F start_POSTSUBSCRIPT roman_Conv end_POSTSUBSCRIPT, 𝑭MLPsubscript𝑭MLP{\bm{F}}_{\operatorname*{MLP}}bold_italic_F start_POSTSUBSCRIPT roman_MLP end_POSTSUBSCRIPT, and 𝑭MSAsubscript𝑭MSA{\bm{F}}_{\operatorname*{MSA}}bold_italic_F start_POSTSUBSCRIPT roman_MSA end_POSTSUBSCRIPT from different encoding strategies as follows:

𝑭=𝑭SSM+𝑭Conv+𝑭MLP+𝑭MSA.𝑭subscript𝑭SSMsubscript𝑭Convsubscript𝑭MLPsubscript𝑭MSA\displaystyle{\bm{F}}={\bm{F}}_{\operatorname*{SSM}}+{\bm{F}}_{\operatorname*{%Conv}}+{\bm{F}}_{\operatorname*{MLP}}+{\bm{F}}_{\operatorname*{MSA}}.bold_italic_F = bold_italic_F start_POSTSUBSCRIPT roman_SSM end_POSTSUBSCRIPT + bold_italic_F start_POSTSUBSCRIPT roman_Conv end_POSTSUBSCRIPT + bold_italic_F start_POSTSUBSCRIPT roman_MLP end_POSTSUBSCRIPT + bold_italic_F start_POSTSUBSCRIPT roman_MSA end_POSTSUBSCRIPT .(12)

Then, we aggregate the information across each channel by employing global average pooling to obtain embedded global features 𝑭C¯𝑭superscript¯𝐶{\bm{F}}\in\mathbb{R}^{\bar{C}}bold_italic_F ∈ blackboard_R start_POSTSUPERSCRIPT over¯ start_ARG italic_C end_ARG end_POSTSUPERSCRIPT. C¯¯𝐶\bar{C}over¯ start_ARG italic_C end_ARG, W¯¯𝑊\bar{W}over¯ start_ARG italic_W end_ARG, and H¯¯𝐻\bar{H}over¯ start_ARG italic_H end_ARG denote the number of feature channels, width, and height entering the Selective Module within Mix-SSM, respectively.In particular, the c𝑐citalic_c-th element of 𝒈𝒈{\bm{g}}bold_italic_g is computed by spatially downsampling 𝑭csubscript𝑭𝑐{\bm{F}}_{c}bold_italic_F start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT over dimensions H¯×W¯¯𝐻¯𝑊\bar{H}\times\bar{W}over¯ start_ARG italic_H end_ARG × over¯ start_ARG italic_W end_ARG:

𝒈c=GlobalPool(𝑭c)=1H¯×W¯i=1H¯j=1W¯𝑭c(i,j).subscript𝒈𝑐GlobalPoolsubscript𝑭𝑐1¯𝐻¯𝑊superscriptsubscript𝑖1¯𝐻superscriptsubscript𝑗1¯𝑊subscript𝑭𝑐𝑖𝑗\displaystyle{\bm{g}}_{c}=\operatorname*{GlobalPool}({\bm{F}}_{c})=\frac{1}{%\bar{H}\times\bar{W}}\sum_{i=1}^{\bar{H}}\sum_{j=1}^{\bar{W}}{\bm{F}}_{c}(i,j).bold_italic_g start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = roman_GlobalPool ( bold_italic_F start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) = divide start_ARG 1 end_ARG start_ARG over¯ start_ARG italic_H end_ARG × over¯ start_ARG italic_W end_ARG end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT over¯ start_ARG italic_H end_ARG end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT over¯ start_ARG italic_W end_ARG end_POSTSUPERSCRIPT bold_italic_F start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( italic_i , italic_j ) .(13)

To enable the model to infer the weight of different encoding strategies for various channels, we further encode 𝒈𝒈{\bm{g}}bold_italic_g using an MLP to obtain hidden features 𝒉𝒉{\bm{h}}bold_italic_h:

𝒉=MLPh(𝒈),𝒉C¯×nformulae-sequence𝒉subscriptMLP𝒈𝒉superscript¯𝐶𝑛\displaystyle{\bm{h}}={\operatorname*{MLP}}_{h}({\bm{g}}),{\bm{h}}\in\mathbb{R%}^{\bar{C}\times n}bold_italic_h = roman_MLP start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( bold_italic_g ) , bold_italic_h ∈ blackboard_R start_POSTSUPERSCRIPT over¯ start_ARG italic_C end_ARG × italic_n end_POSTSUPERSCRIPT(14)

where n𝑛nitalic_n represents the number of visual encoding strategies. Cross-channel soft attention is applied to adaptively select information across different spatial scales. Specifically, a softmax operation is applied to the channel dimensions of hidden features 𝒉𝒉{\bm{h}}bold_italic_h:

𝒑=softmax(𝒉),𝒑C¯×nformulae-sequence𝒑softmax𝒉𝒑superscript¯𝐶𝑛\displaystyle{\bm{p}}=\mathrm{softmax}({\bm{h}}),{\bm{p}}\in\mathbb{R}^{\bar{C%}\times n}bold_italic_p = roman_softmax ( bold_italic_h ) , bold_italic_p ∈ blackboard_R start_POSTSUPERSCRIPT over¯ start_ARG italic_C end_ARG × italic_n end_POSTSUPERSCRIPT(15)

where 𝒑𝒑{\bm{p}}bold_italic_p signifies the weight of different encoding strategies across each channel. Weighting the features 𝑭SSMsubscript𝑭SSM{\bm{F}}_{\operatorname*{SSM}}bold_italic_F start_POSTSUBSCRIPT roman_SSM end_POSTSUBSCRIPT, 𝑭Convsubscript𝑭Conv{\bm{F}}_{\operatorname*{Conv}}bold_italic_F start_POSTSUBSCRIPT roman_Conv end_POSTSUBSCRIPT, 𝑭MLPsubscript𝑭MLP{\bm{F}}_{\operatorname*{MLP}}bold_italic_F start_POSTSUBSCRIPT roman_MLP end_POSTSUBSCRIPT, and 𝑭MSAsubscript𝑭MSA{\bm{F}}_{\operatorname*{MSA}}bold_italic_F start_POSTSUBSCRIPT roman_MSA end_POSTSUBSCRIPT based on {𝒑SSM,𝒑Conv,𝒑MLP,𝒑MSA}𝒑subscript𝒑SSMsubscript𝒑Convsubscript𝒑MLPsubscript𝒑MSA𝒑\{{\bm{p}}_{\operatorname*{SSM}},{\bm{p}}_{\operatorname*{Conv}},{\bm{p}}_{%\operatorname*{MLP}},{\bm{p}}_{\operatorname*{MSA}}\}\in{\bm{p}}{ bold_italic_p start_POSTSUBSCRIPT roman_SSM end_POSTSUBSCRIPT , bold_italic_p start_POSTSUBSCRIPT roman_Conv end_POSTSUBSCRIPT , bold_italic_p start_POSTSUBSCRIPT roman_MLP end_POSTSUBSCRIPT , bold_italic_p start_POSTSUBSCRIPT roman_MSA end_POSTSUBSCRIPT } ∈ bold_italic_p to obtain 𝑽𝑽{\bm{V}}bold_italic_V:

𝑽:=𝒑SSM𝑭SSM+𝒑Conv𝑭Conv+𝒑MLP𝑭MLP+𝒑MSA𝑭MSA.assign𝑽subscript𝒑SSMsubscript𝑭SSMsubscript𝒑Convsubscript𝑭Convsubscript𝒑MLPsubscript𝑭MLPsubscript𝒑MSAsubscript𝑭MSA\displaystyle{\bm{V}}:={\bm{p}}_{\operatorname*{SSM}}\cdot{\bm{F}}_{%\operatorname*{SSM}}+{\bm{p}}_{\operatorname*{Conv}}\cdot{\bm{F}}_{%\operatorname*{Conv}}+{\bm{p}}_{\operatorname*{MLP}}\cdot{\bm{F}}_{%\operatorname*{MLP}}+{\bm{p}}_{\operatorname*{MSA}}\cdot{\bm{F}}_{%\operatorname*{MSA}}.bold_italic_V := bold_italic_p start_POSTSUBSCRIPT roman_SSM end_POSTSUBSCRIPT ⋅ bold_italic_F start_POSTSUBSCRIPT roman_SSM end_POSTSUBSCRIPT + bold_italic_p start_POSTSUBSCRIPT roman_Conv end_POSTSUBSCRIPT ⋅ bold_italic_F start_POSTSUBSCRIPT roman_Conv end_POSTSUBSCRIPT + bold_italic_p start_POSTSUBSCRIPT roman_MLP end_POSTSUBSCRIPT ⋅ bold_italic_F start_POSTSUBSCRIPT roman_MLP end_POSTSUBSCRIPT + bold_italic_p start_POSTSUBSCRIPT roman_MSA end_POSTSUBSCRIPT ⋅ bold_italic_F start_POSTSUBSCRIPT roman_MSA end_POSTSUBSCRIPT .(16)

5 Experiment

In experiments, we evaluate the performance of our InsectMamba model on five insect pest classification datasets. We compare the performance of our model with several state-of-the-art models. We also conduct an ablation study to investigate the effectiveness of different components in our model.

5.1 Dataset and Metrics

DatasetCategoryTrainTest
Farm Insects151601,368
Agricultural Pests122405,254
Insect Recognition24768612
Forestry Pest Identification315996,564
IP1021021,90965,805

To more effectively and comprehensively evaluate existing visual models, we curated and re-split five insect pest classification datasets to provide a challenging evaluation. The datasets employed in our experiments are Farm Insects 111https://www.kaggle.com/datasets/tarundalal/dangerous-insects-dataset, Agricultural Pests 222https://www.kaggle.com/datasets/gauravduttakiit/agricultural-pests-dataset, Insect Recognition Xie etal. (2015), Forestry Pest Identification Liu etal. (2022), and IP102 Wu etal. (2019), with details provided in Table1. We reduce the number of samples in the training set to compare the encoding capabilities of different visual models for visual features. In addition, we leverage accuracy (ACC), precision (Prec), recall (Rec), and the F1 score as evaluation metrics to evaluate the performance of the models comprehensively.

5.2 Implementation Details

For our model training, the batch size is configured to 32, and the learning rate was set at 5×1055superscript1055\times 10^{-5}5 × 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT. We conduct training on 10 epochs using the Adam optimizer Kingma and Ba (2015). The dimensions of the input images were fixed at 224×224×32242243224\times 224\times 3224 × 224 × 3 pixels. For comparative analysis, we finetune various models on five datasets, i.e., ResNet18 He etal. (2016), ResNet50 He etal. (2016), ResNet101 He etal. (2016), ResNet152 He etal. (2016), DeiT-S Touvron etal. (2021), DeiT-B Touvron etal. (2021), Swin-T Liu etal. (2021), Swin-S Liu etal. (2021), Swin-B Liu etal. (2021), Vmamba-T Liu etal. (2024c), Vmamba-S Liu etal. (2024c), and Vmamba-B Liu etal. (2024c). “T”, “S”, and “B” denote the “Tiny”, “Small”, and “Base” model size of corresponding models, respectively. To initialize parts of our model’s parameters, we utilized pre-trained parameters from Vmamba-B.

5.3 Main Results

MethodACCPrecRecF1
ResNet180.430.510.430.42
ResNet500.500.540.500.47
ResNet1010.520.550.520.49
ResNet1520.530.590.530.50
DeiT-S0.490.500.500.47
DeiT-B0.550.590.550.54
Swin-T0.540.530.540.52
Swin-S0.560.600.560.55
Swin-B0.620.680.630.63
Vmamba-T0.560.580.560.55
Vmamba-S0.530.560.530.51
Vmamba-B0.520.560.520.48
InsectMamba0.660.670.660.65
MethodACCPrecRecF1
ResNet180.690.710.670.64
ResNet500.680.770.670.65
ResNet1010.750.760.740.73
ResNet1520.780.800.760.76
DeiT-S0.830.820.820.82
DeiT-B0.860.860.850.85
Swin-T0.800.820.800.79
Swin-S0.740.790.730.74
Swin-B0.830.860.830.82
Vmamba-T0.780.810.770.77
Vmamba-S0.830.840.820.80
Vmamba-B0.890.900.890.89
InsectMamba0.910.910.900.91
MethodACCPrecRecF1
ResNet180.660.690.660.64
ResNet500.570.730.570.56
ResNet1010.570.660.570.55
ResNet1520.520.670.520.52
DeiT-S0.730.760.740.73
DeiT-B0.760.820.760.76
Swin-T0.700.810.700.69
Swin-S0.750.800.760.76
Swin-B0.810.860.820.82
Vmamba-T0.790.850.790.79
Vmamba-S0.810.840.800.79
Vmamba-B0.830.870.840.84
InsectMamba0.860.880.860.86
MethodACCPrecRecF1
ResNet180.800.830.800.80
ResNet500.800.840.800.80
ResNet1010.790.850.790.79
ResNet1520.770.820.770.76
DeiT-S0.870.880.870.87
DeiT-B0.900.910.900.90
Swin-T0.830.880.840.84
Swin-S0.850.880.850.85
Swin-B0.860.890.870.86
Vmamba-T0.900.910.900.90
Vmamba-S0.910.920.920.91
Vmamba-B0.920.930.930.93
InsectMamba0.940.940.940.94
MethodACCPrecRecF1
ResNet180.270.270.250.21
ResNet500.240.260.220.18
ResNet1010.180.280.190.16
ResNet1520.250.230.190.16
DeiT-S0.220.240.210.17
DeiT-B0.280.270.250.20
Swin-T0.290.250.270.22
Swin-S0.300.250.270.21
Swin-B0.390.360.370.32
Vmamba-T0.280.290.290.23
Vmamba-S0.350.310.340.27
Vmamba-B0.320.360.330.28
InsectMamba0.430.380.420.37

The experimental results, as shown in Tables3, 3, 5, 5, and 6, demonstrate the superior performance of the InsectMamba model across multiple insect classification tasks. InsectMamba consistently outperforms the established benchmarks, including various configurations of ResNet, DeiT, Swin Transformer, and Vmamba, across all evaluation metrics: Accuracy (ACC), Precision (Prec), Recall (Rec), and F1 Score (F1). On the Farm Insects Dataset, InsectMamba achieves an ACC of 0.66, surpassing the next best model, Swin-B, by 4%. Similarly, significant improvements are observed on the Agricultural Pests Dataset, where InsectMamba reaches an ACC of 0.91, outperforming the strong Vmamba-B baseline by 2%. These results are consistent across the Insect Recognition and Forestry Pest Identification Datasets, which shows InsectMamba’s strong capability to extract features for images. The results on the IP102 Dataset further verify InsectMamba’s robustness, achieving an ACC of 0.43, which is a leap over the previous best of 0.39 by Swin-B. These results demonstrate that the Mix-SSM Block can integrate multiple visual encoding strategies to ensure comprehensive feature capture from the input images. The Selective Module further enhances the model’s capability by adaptively weighting the contribution of different encoding strategies.

5.4 Ablation Study

MethodFarm InsectsInsect RecognitionIP102
AccuracyF1AccuracyF1AccuracyF1
InsectMamba0.660.650.860.860.430.37
w/o CNN0.600.580.840.850.380.33
w/o MSA0.620.600.850.860.400.34
w/o MLP0.630.600.850.850.410.34
w/o CNN, MSA0.540.500.830.840.340.29
w/o CNN, MLP0.550.530.840.840.340.30
w/o MSA, MLP0.570.550.840.840.350.31
w/o CNN, MSA, MLP0.520.480.830.840.320.28

The ablation study results are shown in Table7 systematically evaluates the contribution of each component within the InsectMamba model, namely, Convolutional Neural Networks (CNN), Multi-Layer Perceptron (MLP), and Multi-Head Self-Attention (MSA), across three datasets: Farm Insects, Insect Recognition, and IP102. The results highlight the significant role each component plays in achieving high accuracy and F1 scores. The complete InsectMamba model achieves the best performance across all datasets, which underscores the synergistic effect of combining CNN, MLP, and MSA for feature extraction and representation learning. Removing any single component (CNN, MSA, or MLP) leads to a decrease in both accuracy and F1 scores across all datasets, indicating that each component contributes unique and valuable information for classification. The most significant performance degradation is observed when multiple components are removed simultaneously, particularly when CNN, MSA, and MLP are all excluded. This configuration results in the lowest accuracy and F1 scores, demonstrating that the integration of multiple visual encoding strategies is crucial for capturing the comprehensive visual characteristics of insects.

5.5 Analysis

Impact of Feature Aggregation Methods.

Insect Pest Classification with State Space Model (3)
Insect Pest Classification with State Space Model (4)

To investigate the effectiveness of different feature aggregation methods within the InsectMamba model, we evaluate by comparing the Selective Module against Max Pooling and Average Pooling methods. As depicted in Figure3, the Selective Module consistently outperforms Max Pooling and Average Pooling in terms of Accuracy (ACC) and F1 Score across two distinct datasets: Farm Insects and IP102. For the Farm Insects dataset, the Selective Module achieves the highest ACC and F1 Score, indicating its superior capability in capturing and integrating salient features for insect pest classification. Specifically, the ACC and F1 improvement over Max Pooling is pronounced, underlining the Selective Module’s effectiveness in handling more nuanced classification tasks within a diverse set of insect species. On the IP102 dataset, the Selective Module still maintains an advantage. Moreover, the variance in performance across the two datasets also highlights the adaptive nature of the Selective Module. It demonstrates that the Selective Module can dynamically adjust the integration of visual features from different visual encoding strategies according to the dataset’s complexity and diversity.

Impact of kernel size in the selective module.

Insect Pest Classification with State Space Model (5)
Insect Pest Classification with State Space Model (6)

In the process of optimizing our InsectMamba model, we investigated the impact of different kernel sizes within the Selective Module on the classification performance. As shown in Figure4, the Selective Module was evaluated with kernel sizes of 1×1111\times 11 × 1, 3×3333\times 33 × 3, 5×5555\times 55 × 5, and 7×7777\times 77 × 7. For the Farm Insects dataset, shown in Figure4(a), we observe that both Accuracy (ACC) and F1 Score (F1) metrics peak at a kernel size of 3×3333\times 33 × 3. The performance declines when the kernel size is increased to 5×5555\times 55 × 5 and drops significantly at 7×7777\times 77 × 7. It demonstrates that smaller kernel sizes are more effective at capturing the relevant visual features for insect pest classification. Moreover, the IP102 dataset, shown in Figure4(b), shows a consistent trend. It shows the importance of the kernel sizes in the Selective Module in adaptively integrating different visual encoding strategies.

Impact of pooling methods in the selective module.

Insect Pest Classification with State Space Model (7)
Insect Pest Classification with State Space Model (8)

We investigate the impact of various pooling methods on the performance of the Selective Module within our InsectMamba model. Specifically, we investigated Average Pooling, Max Pooling, L2 Pooling, and Stochastic Pooling to synthesize the global features as prescribed in Equation13. Figure5 shows the comparative performance on two datasets, i.e., Farm Insects and IP102. For the Farm Insects dataset, Average Pooling achieved the best accuracy and F1 score, indicating its effectiveness in preserving feature representation for classification tasks. Moreover, the results on the IP102 dataset show a consistent trend. Average Pooling performs as well as in the Farm Insects dataset.

6 Conclusion

In this work, we proposed a novel model, InsectMamba, for insect pest classification. The model is designed to amalgamate the strengths of State Space Models, Convolutional Neural Networks, Multi-Head Self-Attention mechanisms, and Multilayer Perceptrons. By integrating these varied visual encoding strategies through Mix-SSM blocks and a selective aggregation module, InsectMamba has showcased the capability to address the challenges posed by pest camouflage and species diversity. In the experiment, we conduct an extensive evaluation that compares our method and other strong competitors on five insect pest classification datasets. Experimental results show our model outperforms other models, which demonstrates the effectiveness of our model. We also illuminate the importance of each integrated module through comprehensive ablation studies.

References

  • Ali etal. [2023]Farooq Ali, Huma Qayyum, and MuhammadJaved Iqbal.Faster-pestnet: A lightweight deep learning framework for crop pest detection and classification.IEEE Access, 11:104016–104027, 2023.doi: 10.1109/ACCESS.2023.3317506.URL https://doi.org/10.1109/ACCESS.2023.3317506.
  • An etal. [2023]Jingmin An, Yong Du, Peng Hong, Lei Zhang, and Xiaogang Weng.Insect recognition based on complementary features from multiple views.Scientific Reports, 13(1):2966, 2023.
  • Anwar and Masood [2023]Zeba Anwar and Sarfaraz Masood.Exploring deep ensemble model for insect and pest detection from images.Procedia Computer Science, 218:2328–2337, 2023.
  • Butera etal. [2022]Luca Butera, Alberto Ferrante, Mauro Jermini, Mauro Prevostini, and Cesare Alippi.Precise agriculture: Effective deep learning strategies to detect pest insects.IEEE CAA J. Autom. Sinica, 9(2):246–258, 2022.doi: 10.1109/JAS.2021.1004317.URL https://doi.org/10.1109/JAS.2021.1004317.
  • Cheng etal. [2017]XiCheng, Youhua Zhang, Yiqiong Chen, Yun-Zhi Wu, and YiYue.Pest identification via deep residual learning in complex background.Comput. Electron. Agric., 141:351–356, 2017.doi: 10.1016/J.COMPAG.2017.08.005.URL https://doi.org/10.1016/j.compag.2017.08.005.
  • Doan [2023]Thanh-Nghi Doan.Large-scale insect pest image classification.Journal of Advances in Information Technology, 14(2):328–341, 2023.
  • Dosovitskiy etal. [2021]Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby.An image is worth 16x16 words: Transformers for image recognition at scale.In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021.URL https://openreview.net/forum?id=YicbFdNTTy.
  • Gu and Dao [2023]Albert Gu and Tri Dao.Mamba: Linear-time sequence modeling with selective state spaces.CoRR, abs/2312.00752, 2023.doi: 10.48550/ARXIV.2312.00752.URL https://doi.org/10.48550/arXiv.2312.00752.
  • Han etal. [2024]Guangzeng Han, Weisi Liu, Xiaolei Huang, and Brian Borsari.Chain-of-interaction: Enhancing large language models for psychiatric behavior understanding by dyadic contexts.arXiv preprint arXiv:2403.13786, 2024.
  • He etal. [2016]Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.Deep residual learning for image recognition.In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pages 770–778. IEEE Computer Society, 2016.doi: 10.1109/CVPR.2016.90.URL https://doi.org/10.1109/CVPR.2016.90.
  • Howard etal. [2017]AndrewG. Howard, Menglong Zhu, BoChen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam.Mobilenets: Efficient convolutional neural networks for mobile vision applications.CoRR, abs/1704.04861, 2017.URL http://arxiv.org/abs/1704.04861.
  • Hu etal. [2023]Zhengyu Hu, Jieyu Zhang, Haonan Wang, Siwei Liu, and Shangsong Liang.Leveraging relational graph neural network for transductive model ensemble.In AmbujK. Singh, Yizhou Sun, Leman Akoglu, Dimitrios Gunopulos, Xifeng Yan, Ravi Kumar, Fatma Ozcan, and Jieping Ye, editors, Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD 2023, Long Beach, CA, USA, August 6-10, 2023, pages 775–787. ACM, 2023.doi: 10.1145/3580305.3599414.URL https://doi.org/10.1145/3580305.3599414.
  • Kasinathan and Reddy [2019]Thenmozhi Kasinathan and U.Srinivasulu Reddy.Crop pest classification based on deep convolutional neural network and transfer learning.Comput. Electron. Agric., 164, 2019.doi: 10.1016/J.COMPAG.2019.104906.URL https://doi.org/10.1016/j.compag.2019.104906.
  • Kingma and Ba [2015]DiederikP. Kingma and Jimmy Ba.Adam: A method for stochastic optimization.In Yoshua Bengio and Yann LeCun, editors, 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015.URL http://arxiv.org/abs/1412.6980.
  • Krizhevsky etal. [2017]Alex Krizhevsky, Ilya Sutskever, and GeoffreyE. Hinton.Imagenet classification with deep convolutional neural networks.Commun. ACM, 60(6):84–90, 2017.doi: 10.1145/3065386.URL https://doi.org/10.1145/3065386.
  • Lai etal. [2024a]Zhixin Lai, Jing Wu, Suiyao Chen, Yucheng Zhou, Anna Hovakimyan, and Naira Hovakimyan.Language models are free boosters for biomedical imaging tasks.arXiv preprint arXiv:2403.17343, 2024a.
  • Lai etal. [2024b]Zhixin Lai, Xuesheng Zhang, and Suiyao Chen.Adaptive ensembles of fine-tuned transformers for llm-generated text detection, 2024b.
  • Liu etal. [2022]Bing Liu, Luyang Liu, Ran Zhuo, Weidong Chen, Rui Duan, and Guishen Wang.A dataset for forestry pest identification.Frontiers in Plant Science, 13:857104, 2022.
  • Liu etal. [2024a]Tianrui Liu, Changxin Xu, Yuxin Qiao, Chufeng Jiang, and Weisheng Chen.News recommendation with attention mechanism.CoRR, abs/2402.07422, 2024a.doi: 10.48550/ARXIV.2402.07422.URL https://doi.org/10.48550/arXiv.2402.07422.
  • Liu etal. [2024b]Tianrui Liu, Changxin Xu, Yuxin Qiao, Chufeng Jiang, and Jiqiang Yu.Particle filter SLAM for vehicle localization.CoRR, abs/2402.07429, 2024b.doi: 10.48550/ARXIV.2402.07429.URL https://doi.org/10.48550/arXiv.2402.07429.
  • Liu etal. [2024c]Yue Liu, Yunjie Tian, Yuzhong Zhao, Hongtian Yu, Lingxi Xie, Yaowei Wang, Qixiang Ye, and Yunfan Liu.Vmamba: Visual state space model.CoRR, abs/2401.10166, 2024c.doi: 10.48550/ARXIV.2401.10166.URL https://doi.org/10.48550/arXiv.2401.10166.
  • Liu etal. [2021]ZeLiu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo.Swin transformer: Hierarchical vision transformer using shifted windows.In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021, pages 9992–10002. IEEE, 2021.doi: 10.1109/ICCV48922.2021.00986.URL https://doi.org/10.1109/ICCV48922.2021.00986.
  • Liu etal. [2016]Ziyi Liu, Junfeng Gao, Guoguo Yang, Huan Zhang, and Yong He.Localization and classification of paddy field pests using a saliency map and deep convolutional neural network.Scientific reports, 6(1):20410, 2016.
  • Lyu etal. [2024]Weimin Lyu, Xiao Lin, Songzhu Zheng, LuPang, Haibin Ling, Susmit Jha, and Chao Chen.Task-agnostic detector for insertion-based backdoor attacks.arXiv preprint arXiv:2403.17155, 2024.
  • Murtagh [1990]Fionn Murtagh.Multilayer perceptrons for classification and regression.Neurocomputing, 2(5):183–197, 1990.doi: 10.1016/0925-2312(91)90023-5.URL https://doi.org/10.1016/0925-2312(91)90023-5.
  • O’Shea and Nash [2015]Keiron O’Shea and Ryan Nash.An introduction to convolutional neural networks.CoRR, abs/1511.08458, 2015.URL http://arxiv.org/abs/1511.08458.
  • Peng and Wang [2022]Yingshu Peng and YiWang.CNN and transformer framework for insect pest classification.Ecol. Informatics, 72:101846, 2022.doi: 10.1016/J.ECOINF.2022.101846.URL https://doi.org/10.1016/j.ecoinf.2022.101846.
  • Ren etal. [2019]Fuji Ren, Wenjie Liu, and Guoqing Wu.Feature reuse residual networks for insect pest recognition.IEEE Access, 7:122758–122768, 2019.doi: 10.1109/ACCESS.2019.2938194.URL https://doi.org/10.1109/ACCESS.2019.2938194.
  • Simonyan and Zisserman [2015]Karen Simonyan and Andrew Zisserman.Very deep convolutional networks for large-scale image recognition.In Yoshua Bengio and Yann LeCun, editors, 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015.URL http://arxiv.org/abs/1409.1556.
  • Su etal. [2024]Jing Su, Chufeng Jiang, Xin Jin, Yuxin Qiao, Tingsong Xiao, Hongda Ma, Rong Wei, Zhi Jing, Jiajun Xu, and Junhong Lin.Large language models for forecasting and anomaly detection: A systematic literature review.CoRR, abs/2402.10350, 2024.doi: 10.48550/ARXIV.2402.10350.URL https://doi.org/10.48550/arXiv.2402.10350.
  • Tolstikhin etal. [2021]IlyaO. Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jessica Yung, Andreas Steiner, Daniel Keysers, Jakob Uszkoreit, Mario Lucic, and Alexey Dosovitskiy.Mlp-mixer: An all-mlp architecture for vision.In Marc’Aurelio Ranzato, Alina Beygelzimer, YannN. Dauphin, Percy Liang, and JenniferWortman Vaughan, editors, Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 24261–24272, 2021.URL https://proceedings.neurips.cc/paper/2021/hash/cba0a4ee5ccd02fda0fe3f9a3e7b89fe-Abstract.html.
  • Touvron etal. [2021]Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou.Training data-efficient image transformers & distillation through attention.In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 10347–10357. PMLR, 2021.URL http://proceedings.mlr.press/v139/touvron21a.html.
  • Ung etal. [2022]HieuTrung Ung, QuangHuy Ung, TrungT. Nguyen, and BinhT. Nguyen.An efficient insect pest classification using multiple convolutional neural network based models.In Hamido Fujita, Yutaka Watanobe, and Takuya Azumi, editors, New Trends in Intelligent Software Methodologies, Tools and Techniques - Proceedings of the 21st International Conference on New Trends in Intelligent Software Methodologies, Tools and Techniques, SoMeT 2022, Kitakyushu, Japan, 20-22 September, 2022, volume 355 of Frontiers in Artificial Intelligence and Applications, pages 584–595. IOS Press, 2022.doi: 10.3233/FAIA220287.URL https://doi.org/10.3233/FAIA220287.
  • Vaswani etal. [2017]Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, AidanN. Gomez, Lukasz Kaiser, and Illia Polosukhin.Attention is all you need.In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, HannaM. Wallach, Rob Fergus, S.V.N. Vishwanathan, and Roman Garnett, editors, Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 5998–6008, 2017.URL https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html.
  • Wang etal. [2017]RuJing Wang, Jie Zhang, Wei Dong, Jian Yu, ChengJun Xie, Rui Li, TianJiao Chen, and HongBo Chen.A crop pests image classification algorithm based on deep convolutional neural network.TELKOMNIKA (Telecommunication Computing Electronics and Control), 15(3):1239–1246, 2017.
  • Wu etal. [2024]Jing Wu, Zhixin Lai, Suiyao Chen, Ran Tao, Pan Zhao, and Naira Hovakimyan.The new agronomists: Language models are experts in crop management.arXiv preprint arXiv:2403.19839, 2024.
  • Wu etal. [2019]Xiaoping Wu, Chi Zhan, Yu-Kun Lai, Ming-Ming Cheng, and Jufeng Yang.IP102: A large-scale benchmark dataset for insect pest recognition.In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pages 8787–8796. Computer Vision Foundation / IEEE, 2019.doi: 10.1109/CVPR.2019.00899.URL http://openaccess.thecvf.com/content_CVPR_2019/html/Wu_IP102_A_Large-Scale_Benchmark_Dataset_for_Insect_Pest_Recognition_CVPR_2019_paper.html.
  • Xie etal. [2015]Chengjun Xie, Jie Zhang, Rui Li, Jinyan Li, Peilin Hong, Junfeng Xia, and Peng Chen.Automatic classification for field crop insects via multiple-task sparse representation and multiple-kernel learning.Comput. Electron. Agric., 119:123–132, 2015.doi: 10.1016/J.COMPAG.2015.10.015.URL https://doi.org/10.1016/j.compag.2015.10.015.
  • Zhang etal. [2023]Jieyu Zhang, Bohan Wang, Zhengyu Hu, PangWei Koh, and AlexanderJ. Ratner.On the trade-off of intra-/inter-class diversity for supervised pre-training.In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023.URL http://papers.nips.cc/paper_files/paper/2023/hash/ca9567d8ef6b2ea2da0d7eed57b933ee-Abstract-Conference.html.
  • Zhou etal. [2023]Yucheng Zhou, Xiubo Geng, Tao Shen, Chongyang Tao, Guodong Long, Jian-Guang Lou, and Jianbing Shen.Thread of thought unraveling chaotic contexts.CoRR, abs/2311.08734, 2023.doi: 10.48550/ARXIV.2311.08734.URL https://doi.org/10.48550/arXiv.2311.08734.
  • Zhou etal. [2024]Yucheng Zhou, Xiang Li, Qianning Wang, and Jianbing Shen.Visual in-context learning for large vision-language models.CoRR, abs/2402.11574, 2024.doi: 10.48550/ARXIV.2402.11574.URL https://doi.org/10.48550/arXiv.2402.11574.
Insect Pest Classification with State Space Model (2024)

References

Top Articles
Latest Posts
Article information

Author: Manual Maggio

Last Updated:

Views: 6474

Rating: 4.9 / 5 (69 voted)

Reviews: 84% of readers found this page helpful

Author information

Name: Manual Maggio

Birthday: 1998-01-20

Address: 359 Kelvin Stream, Lake Eldonview, MT 33517-1242

Phone: +577037762465

Job: Product Hospitality Supervisor

Hobby: Gardening, Web surfing, Video gaming, Amateur radio, Flag Football, Reading, Table tennis

Introduction: My name is Manual Maggio, I am a thankful, tender, adventurous, delightful, fantastic, proud, graceful person who loves writing and wants to share my knowledge and understanding with you.