Solar Radio Burst Detection Based on the MobileViT-SSDLite Lightweight Model

Hailan He; Guowu Yuan; Hao Zhou; Chengming Tan; Shaojie Guo

doi:10.3847/1538-4365/ad036c

1. Introduction

Solar radio bursts are a direct physical manifestation of high-energy particle activity on the Sun, and they serve as critical observational tools for diagnosing the physical state of active solar regions and the processes of high-energy electron acceleration and propagation (Zhou et al. 2022). These bursts reflect the evolution and interaction of the background plasma, nonthermal electrons, and magnetic fields in active regions during solar activity (Huang 2010). Based on the different characteristics recorded by dynamic spectrographs in the microwave observation frequency range, solar radio bursts are typically classified into five types (Bouratzis et al. 2015): noise storms (including enhanced radiation and type I bursts) and type II, III, IV, and V bursts. Type I radio storms indicate the occurrence of solar storms. Type II radio bursts exhibit narrowband radiation with a slow frequency drift over time, serving as tracers of coronal shocks. Type III radio bursts display rapid frequency drifts with a steep decrease over time, providing the best indicator of high-energy electron beams in the corona. They are important for studying energy release and particle acceleration processes in flares and coronal mass ejections. Type IV radio bursts exhibit wideband continuous radiation from synchrotron emission by high-energy electrons gyrating in magnetic fields. Type V radio bursts have a relatively wide frequency band and high intensity. While rare, they are usually observed following type III bursts (McLean 1985) and reflect the intense scattering of high-energy electrons in the corona. Each type of solar radio burst event is associated with specific solar activity phenomena and correlates with various phenomena in the near-Earth space environment.

There is a demand for solar radio burst detection and classification in solar physics research and space weather forecasting. Accurate space weather alerts are crucial for the safety of aerospace, satellite communications, and power grid operations. Each solar radio spectrograph observes the Sun for approximately 10 hr day^–1 and covers a wide frequency range. However, based on our analysis of e-CALLISTO (Benz et al. 2009; Husien et al. 2016) observation data, the accumulated duration of solar radio bursts accounts for only approximately 0.3% of the total observation time. Traditional methods for detecting and classifying solar radio bursts typically rely on manual approaches, which consume human resources, lack real-time processing capabilities, and are prone to missing events. Therefore, the automatic real-time detection and classification of solar radio burst events will positively affect solar physics research and space weather forecasting (Lin 2002).

Deep learning has played a role in various astronomical data processing tasks, helping astronomers analyze massive quantities of data and extract knowledge, thereby improving data processing efficiency and reducing the manual burden (Tao et al. 2020). With the development of solar radio spectrographs, there is an urgent need for efficient real-time processing of large-scale observational data and extraction of useful information. In recent years, domestic and international researchers have also begun to utilize deep-learning techniques for solar radio spectrogram classification and detection research.

Currently, most research focuses on detecting a single type of solar radio burst, with a significant emphasis on detecting type III bursts or classifying bursts, nonbursts, and calibrating solar radio spectra as demonstrated in the literature referenced below. In Zhang (2020), Faster R-CNN was used to detect type III bursts and small-scale spike burst events, extracting features such as start and stop frequencies, frequency drift rates, and positional coordinates. Scully et al. (2021) utilized YOLOv2 to classify type III solar radio storms. Solar radio spectrogram images have temporal and spatial characteristics, with time represented on the horizontal axis and frequency on the vertical axis. Yu et al. (2017) proposed a solar radio spectrogram classification method based on long short-term memory. Cheng & Yuan (2022) introduced a classification method based on convolutional long short-term memory. Guo et al. (2022b) presented a hybrid structure combining convolutional and memory units to simultaneously extract frequency structural features and time series features, enhancing sensitivity to small features in the spectral image and enabling solar radio spectrum classification. Li et al. (2022) proposed a self-supervised learning method for classifying only burst and nonburst events for solar radio spectrogram classification. The model was pretrained on many other existing data using self-masking, followed by fine-tuning on a solar radio spectrogram data set, achieving similar classification accuracy as supervised learning. In Chen et al. (2023), a solar radio spectrogram classification method based on a Swin transformer was introduced, classifying radio spectrogram bursts and nonbursts with relatively fewer model parameters.

In addition, researchers have also conducted studies on classifying various types of solar radio bursts. In Guo et al. (2022a), a transfer learning–based method for small-sample object detection was used to detect type II, III, and IV radio burst events. Since solar radio bursts are rare, the distribution of spectrogram data samples is imbalanced, with a significantly smaller number of solar radio burst spectrograms than spectrograms of quiet Suns. Therefore, it is necessary to balance the sample distribution through data augmentation or generative adversarial networks to generate sample images. Zhang et al. (2021) utilized solar radio spectroscopic data from the Culgoora and Learmonth observatories and proposed a conditional information-based deep convolutional generative adversarial network model. It automatically classified the five types of solar radio burst events and partially addressed the overfitting caused by insufficient data samples.

Existing methods can either detect only a single type of burst or determine the presence or absence of a burst without being able to detect and classify multiple common types of bursts. Alternatively, methods that can detect and classify multiple types of bursts may have low detection accuracy, large model parameters, and slow detection speed, which hinders the real-time and accurate provision of solar activity warning information.

To detect and classify solar radio burst events from massive observational data, efficient and accurate detection and classification models need to be designed. In computer vision, this is an object detection problem. A typical object detection model comprises a feature extractor and a detection head. The feature extractor is usually a convolutional neural network (CNN)–based architecture that extracts low-level features from the shallow layer (near the input) and high-level features from the deep layer (near the output), and the output feature maps are passed to the detection head, which performs classification and regression to determine the labels and positions of object instances (Arani et al. 2022). In the basic CNN building layers, standard convolutions are computationally expensive. Separable convolutions (Chollet 2017) were introduced and widely used in lightweight CNNs for visual tasks, including MobileNets (Howard et al. 2017; Sandler et al. 2018) and ShuffleNetv2 (Ma et al. 2018). These lightweight CNNs are versatile and easy to train, reducing network size and improving latency, thus achieving real-time object detection.

Due to the high real-time requirements for solar radio burst detection and classification, we consider using a fast, lightweight feature extractor and detector. MobileViT is a lightweight, versatile, and embedded device–friendly vision transformer (ViT) model (Mehta & Rastegari 2021). It combines the advantages of convolutional networks and ViTs (Dosovitskiy et al. 2020) to construct lightweight ViTs. MobileViT has been applied in various visual tasks, such as classification, object detection, and instance segmentation, achieving a balance between model lightweightness and high accuracy. We also consider the single-shot multibox detector (SSD; Liu et al. 2016). SSD is an anchor-based one-stage detector that balances accuracy and speed well. SSDLite is a lightweight version that reduces the computational cost while maintaining the model's detection performance, enabling real-time object detection.

In this paper, we propose a solar radio burst detection method based on MobileViT-SSDLite, leveraging the characteristics of the solar radio spectrum. This lightweight model enables real-time detection and classification of type II, III, IV, and V solar radio bursts. To enhance the model's robustness and convergence speed, we incorporate the CIoULoss (Zheng et al. 2019) localization loss function, which exhibits faster convergence. Additionally, we employ various data augmentation strategies to further improve the model's performance. The accuracy of our model is evaluated using the AP50 evaluation metric, which measures the precision of the model in detecting solar radio bursts. Furthermore, we assess the speed performance and model size using frames per second (FPS; where a frame refers to a spectral image with a temporal resolution of 15 minutes), floating-point operations (FLOPs), and model parameters. Experimental results demonstrate the effectiveness of our proposed method, showcasing good detection performance in accurately identifying solar radio bursts.

The paper makes several significant contributions.

1.
We curated and established a comprehensive data set of solar radio spectrogram images. These images were obtained from e-CALLISTO, and we meticulously annotated the target detection boxes and classified the types of radio burst events. We plan to publicly release this data set, allowing other researchers to benefit.
2.
To effectively capture the time and frequency domain characteristics of spectrogram images, we integrated the MobileViT network into our methodology. By combining global perception and local inductive bias, this network serves as an excellent feature extractor, enhancing the accuracy of our detection system.
3.
Our developed model is a lightweight neural network, enabling real-time solar radio burst classification performance. Furthermore, its versatility extends beyond solar physics, as it can be applied to classify other astronomical data sets with similar temporal and frequency domain characteristics.

These contributions not only provide valuable resources to the scientific community but also advance the field of solar physics and astronomical data analysis.

2. Data Set Construction and Augmentation

2.1. Source and Construction of Solar Radio Burst Spectrogram Image Data Set

We utilized publicly available radio spectrogram data from the e-CALLISTO network (Monstein et al. 2023) to establish our data set. The e-CALLISTO network combines all Compound Astronomical Low-frequency Low-cost Instrument for Spectroscopy and Transportable Observatory (CALLISTO) spectrometers. CALLISTO can continuously observe the solar radio spectrum for 24 hr day^–1 throughout the year. Its main applications include observing solar radio bursts. The data obtained from CALLISTO are FITS files with up to 400 frequencies per sweep. Our process of producing spectrogram images is as follows.

First, CALLISTO solar spectrogram FITS files were downloaded from the e-CALLISTO website. The FITS file is composed of four parts: the ASCII-format header, the binary spectrum, and two binary tables. One table is for the time axis, and the other is for the frequency axis. Then, the pyCALLISTO⁶ (Pawase & Raja 2020) tool was used to parse the FITS files into radio spectrogram images with background subtraction. Finally, based on the event record information provided by the system, we annotated the events with position and category information, constructing an object detection data set in the same format as COCO (Lin et al. 2014). Table 1 presents the number of instances for each type in the data set.

Table 1. Quantity Statistics of Solar Radio Burst Data Sets

	Number of Images	Number of Instances	II	III	IIIs	IV	V
Train. set	6126	7584	338	5736	1296	47	167
Val. set	2626	3238	126	2474	561	18	70

Note. The training and validation sets are randomly divided with a ratio of 7:3. Multiple burst instances may exist in the same spectrogram image. Hence, the number of solar radio burst instances exceeds the number of spectrogram images. The total number of labeled burst instances is 10,822.

Download table as: ASCII Typeset image

Type I bursts are not included in the data set because they consist of short-duration bursts superimposed on a stable or slowly varying continuous background, making it difficult to collect and annotate resources from the database. Additionally, since there can be more than one burst event on the same spectrogram image, the number of burst instances exceeds the number of spectrogram images.

Furthermore, based on the classification and burst catalog provided by the e-CALLISTO website,⁷ type III bursts have been further divided into individual (type III) and grouped (type IIIs) bursts, as shown in Figure 1.

**Figure 1.** Solar radio burst spectrograms. Three example images are provided for each burst type. The x-axis of the image is the time axis (the recorded time length is 15 minutes because a FITS source file contains 15 minute records), and the y-axis is the frequency axis.
Download figure:
Standard image High-resolution image

2.2. Data Augmentation

Due to the limited overall quantity of the obtained data set and the homogeneous backgrounds of various classes, we employed data augmentation techniques to diversify the training samples and improve the robustness and detection accuracy of the model. The augmentation techniques used include random flipping, random cropping with minimum intersection over union (IoU), and photometric distortions, as shown in Figure 2.

In Figure 2, (1) RandomFlip provides the model with invariance to mirror image reflections. (2) MinIoURandomCrop involves randomly selecting a target in the image and calculating its IoU with other targets. Then, based on a predefined IoU threshold, whether to keep the target is randomly determined. If it is retained, a randomly cropped image patch of a certain size is obtained around the target as a training sample. (3) PhotoMetricDistortion (PMD) techniques typically involve random perturbations of brightness, contrast, saturation, hue, and random color space transformations. By applying PMD techniques, more diverse training samples can be generated.

By employing a combination of various data augmentation techniques, we can generate richer and more diverse training data for the radio spectrum, thereby improving the model's generalization ability and robustness and enhancing its performance. It is important to note that the augmented images are used only for training purposes; experiments on real spectrogram images are conducted during testing.

3. Methods

Due to the high time sensitivity requirements of space weather alerts, real-time detection of solar radio spectrogram images is necessary to provide timely solar activity warnings. Therefore, we consider using lightweight modules and selecting a feature extraction module that can simultaneously learn radio spectrogram images' temporal and spatial features.

Through experimental comparisons, we combine the MobileViT feature extraction network with the SSDLite detector to establish a lightweight solar radio burst detection model. Our model is illustrated in Figure 3. In Figure 3, the radio burst spectrogram image is processed by the MobileViT backbone network to extract global temporal and spatial features. The neck network produces feature maps of different sizes using a pyramid structure, allowing the model to detect large and small objects. Each unit in each detection head outputs bounding box positions and classification prediction scores, which are used to establish the objective function for training the model.

**Figure 3.** The overall structure of the detection model for solar radio burst spectrograms comprises MobileViT and SSDLite. MobileViT is the feature extractor, while SSDLite includes a neck network and detection head. Extra layers (convolutional feature layers) are added after the truncated MobileViT network to form the neck network. Figure 6 shows the details of the SSDLite structure. DSC consists of depth and pointwise convolutions. MV2 denotes the MobileNetv2 block (inverted residual structure), which performs downsampling when s = 2.
Download figure:
Standard image High-resolution image

Next, we introduce the composition of the feature extractor MobileViT and the lightweight detection head SSDLite in our model and the improvements made to the localization loss function.

3.1. Feature Extractor: MobileViT

While the vision transformer (ViT; Dosovitskiy et al. 2020) architecture has great potential in computer vision, it has certain drawbacks, such as significant model parameters, high computational demands, lack of spatial inductive biases, and difficulties in training, requiring more training data and iterations. To address these issues, MobileViT introduces several modifications to the ViT architecture. It incorporates depthwise separable convolution (DSC), fewer transformer layers, and feature pooling layers, reducing the model's parameter count and computational costs. Additionally, MobileViT adopts a hybrid architecture that combines a CNN and transformer, benefiting from the lightweight and efficient nature of CNNs and the self-attention mechanism and global receptive field of transformers. Extensive experiments on multiple benchmark data sets have demonstrated the effectiveness of MobileViT in various computer vision tasks, including image classification and object detection.

MobileViT leverages the strengths of both CNN and ViT, enabling it to learn better representations with fewer parameters and more straightforward training configurations. It consists of two main components: MobileViTBlock and inverted residual structures, as illustrated in Figure 4.

**Figure 4.** Structure diagram of the MobileViTBlock module. The transformer (×L) represents the stacking of L transformer structures, where L is specified by the number in the MobileViTBlock box in Figure 3. X is the input tensor, and the output feature tensor Y is obtained by fusing the local and global representations with X through feature concatenation and convolution.
Download figure:
Standard image High-resolution image

The MobileViTBlock models local and global information using an input tensor with fewer parameters. It employs standard convolutional layers and pointwise convolutions to capture local representations. The global representations are then established through transformers, utilizing the self-attention mechanism to establish long-range receptive fields. As a result, MobileViT possesses a global receptive field, as shown in Figure 5. In Figure 5, each pixel can see other pixels within the MobileViTBlock. For example, the transformer processes the green pixel using the white pixel (corresponding position in other blocks). Since the white pixel has already encoded information about adjacent pixels using convolutions, the green pixel effectively encodes information from all pixels in the image. The black box represents a block, and the gray grid represents a pixel.

**Figure 5.** Global information of MobileViT.
Download figure:
Standard image High-resolution image

The inverted residual structure (MV2 architecture in Figure 3) is a bottleneck depth-separable convolution with residuals, which was introduced in MobileNetV2 (Sandler et al. 2018). It begins with convolutional operations on low-dimensional feature maps and upsampling through channel expansion. Finally, the channel dimension is adjusted to the target value through convolution. In the inverted residual structure, in addition to regular convolutional layers, it includes DSC. The computational cost of DSC is approximately one-eighth to one-ninth that of standard convolution, making it an important component in lightweight convolutional neural networks. DSC decomposes the convolution operation into two steps: depthwise and pointwise convolutions. Depthwise convolution involves applying different convolution kernels to each channel of the input feature map, while pointwise convolution further processes the feature map along the channel dimension using convolution. DSC achieves similar results as standard convolutional layers with fewer parameters and computations.

3.2. Lightweight Detection Head: SSDLite

Due to the significant size differences among different types of solar radio bursts in the spectrogram, we consider using a detection head that can adapt to various scales while meeting real-time requirements.

SSD is a one-stage detection model based on anchor boxes (prior boxes). In general, multiple anchor boxes with different scales or aspect ratios are set for each unit, and the predicted bounding boxes are based on these anchor boxes, which reduces the training difficulty to some extent. By using large-scale feature maps to detect small objects and small-scale feature maps to detect large objects, SSD can handle the detection of both large and small targets. This SSD characteristic makes it well suited for different types of solar radio bursts.

The SSD model achieves high accuracy but has a higher computational cost and model size. SSDLite is a variant of SSD that improves the convolutional layers of the SSD model by replacing them with lightweight modules, achieving a better balance between accuracy and speed.

The original SSD model utilized the Visual Geometry Group (VGG; Simonyan & Zisserman 2014) as the feature extraction network. However, the VGG model has many parameters and is unsuitable for real-time detection. In the SSDLite model, a lightweight model called MobileNetv2 (Sandler et al. 2018) is employed as the feature extractor. It uses smaller default boxes and fewer feature map layers. The SSDLite model is illustrated in Figure 6.

In Figure 6, many SSDLite components are similar to SSD. The main difference is the split of the convolution operation into two parts: depthwise convolution and pointwise convolution, represented by the DSC layer in the diagram. This split reduces the computational cost and number of parameters. The SSDLite model has demonstrated excellent detection performance on various benchmark data sets while significantly reducing the model size and computational costs. Consequently, the SSDLite model is widely adopted for object detection tasks on mobile and embedded devices, providing high practical value.

3.3. Loss Function

The SSDLite loss function consists of a weighted sum of the classification and localization losses, as shown in Equation (1):

$\begin{eqnarray}&&L(x,c,l,g)=\displaystyle \frac{{L}_{\mathrm{conf}}(x,c)+\alpha * {L}_{\mathrm{loc}}(x,l,g)}{N}.\end{eqnarray} \tag{ 1 }$

In the formula, L_conf represents classification loss, L_loc represents localization loss, N represents the number of positive samples of the prior boxes, c denotes the predicted class confidence, l denotes the predicted location values for the corresponding bounding boxes of the prior boxes, g represents the position parameters of the ground-truth boxes, and α is set to 1, similar to SSD (Liu et al. 2016).

SSDLite employs the softmax loss function for the classification loss, which calculates the difference between the predicted class probabilities for each prior box and the actual class. For the localization loss, SSDLite uses the smooth L1 loss function. However, smooth L1 loss may lead to unstable training when dealing with many negative samples. Therefore, our approach replaces the localization loss with a loss function based on IoU.

Continuous research shows several types of localization loss based on IoU, including GIoULoss, DIoULoss, and CIoULoss (Zheng et al. 2019). GIoULoss better measures the overlap between predicted and ground-truth boxes and penalizes the nonoverlapping regions more effectively. DIoULoss and CIoULoss converge faster than IoU loss and GIoULoss, with CIoULoss demonstrating better convergence speed and localization accuracy due to its consideration of comprehensive geometric factors, including overlap area, center point distance, and aspect ratio.

Compared to smooth L1 loss, the localization loss function based on CIoU is more stable and accurate, leading to improved model performance. The CIoULoss we use is given as

$\begin{eqnarray}&&\begin{array}{l}{L}_{\mathrm{CIoU}}=1-\mathrm{IoU}+\displaystyle \frac{{\rho }^{2}(l,g)}{{C}^{2}}+\alpha \nu \\ \alpha =\displaystyle \frac{\nu }{(1-\mathrm{IoU})+\nu },\nu =\displaystyle \frac{4}{{\pi }^{2}}\left(\arctan \displaystyle \frac{{w}^{{gt}}}{{h}^{{gt}}}-\arctan \displaystyle \frac{w}{h}\right),\end{array}\end{eqnarray} \tag{ 2 }$

where ρ²(l, g) represents the Euclidean distance between the center points of the predicted box l and the ground-truth box g, C² represents the diagonal Euclidean distance of the minimum enclosing region that can simultaneously contain the predicted and ground-truth boxes, ν is used to measure the aspect ratio consistency between the predicted box and the target box, and α is a parameter used to balance the ratios. The loss function tends to optimize toward increasing the overlap area, especially when the IoU is zero.

Figure 7 illustrates the principle of CIoULoss, which introduces a penalty term for the aspect ratio of the predicted and ground-truth bounding boxes on top of DIoULoss, making the loss function pay more attention to the shape of the bounding boxes. Figure 7(a) shows DIoULoss, where the green box represents the ground-truth box, the yellow box is the predicted box, and the gray box is the minimum enclosing region. Compared to GIoULoss, DIoULoss can measure the distance between the two when the target box encompasses the predicted box. However, in the case shown in Figure 7(b), when the predicted box is inside the target box and multiple predicted boxes have the same center point position, DIoULoss cannot differentiate the positions of these two predicted boxes because it does not consider different aspect ratios.

4. Experimental Results and Analysis

4.1. Experimental Configuration and Evaluation Metrics

All experiments were conducted on a machine with the following hardware configuration: i7-10700k CPU, NVIDIA GeForce RTX 2080 Ti GPU, and 32 GB of memory. The implementation was performed using PyTorch 1.8.1 (Paszke et al. 2019) and MMDetection v2.15.0. The initial learning rate was set to 0.015, and the optimizer used was SGD with a momentum parameter of 0.9 and a weight decay of 0.0004. For the learning rate decay strategy, we employed the cosine decay learning rate method and set the training epochs to 110 to gradually reduce the learning rate and increase model stability.

The detection accuracy metric used was AP50, where AP refers to the average precision, which is the average accuracy for each class in multiclass predictions. AP50 specifically represents the area under the precision-recall curve when the IoU value is 0.5. It comprehensively considers the impact of accuracy and recall, reflecting the model's performance in recognizing a specific class. The calculation methods for accuracy and recall are as follows:

$\begin{eqnarray}&&\begin{array}{l}\mathrm{precision}=\displaystyle \frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FP}},\\ \mathrm{recall}=\displaystyle \frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}},\end{array}\end{eqnarray} \tag{ 3 }$

where TP represents the number of detection boxes with IoU > 0.5, FP represents the number of detection boxes with IoU ≤ 0.5 or the number of redundant detection boxes for the same ground truth (GT), and FN represents the number of GTs that were not detected.

The detection speed metric is measured in FPS, which evaluates the speed of object detection. It indicates the number of spectrogram images that can be processed per second. FLOPs can be used to measure the complexity of an algorithm and are commonly used as an indirect measure of the speed of neural network models. The parameter count (Params) refers to the total number of trainable parameters in the model and is used to assess the spatial complexity of the model.

4.2. Comparative Experiments with Other Models

In the experiments, we utilized MobileViT as the feature extraction network, SSDLite as the object detection head, and cross-entropy loss as the classification loss. We replaced the original localization loss (smooth L1 loss) with CIoULoss and kept the cross-entropy loss for the classification loss. Multiple data augmentation strategies were employed to enhance the robustness of the model. The performance of the improved model was evaluated using AP50 for accuracy and FPS for speed, and a comparative experiment was conducted to analyze the effectiveness of the improvements.

We trained the model on the created solar radio spectrogram data set with the aforementioned experimental settings and model architecture. Figure 8 displays the changes in training loss and validation detection accuracy (AP50). The training loss curve reflects a gradual decrease and convergence of the loss, while the AP50 curve demonstrates a gradual increase and stabilization of the model's detection accuracy.

**Figure 8.** Training loss and AP50 change curves.
Download figure:
Standard image High-resolution image

To evaluate the model's detection accuracy, we used AP50 as the metric and compared it with several classic models. The comparison included YOLOv3 (Redmon & Farhadi 2018) with MobileNetv2 as the backbone network, YOLOv5s with K-means anchors (referred to as YOLOv5s[a] in Table 2), and YOLOv5s without K-means anchors, using default sizes (referred to as YOLOv5s[b] in Table 2). We also compared our model with the generalized focal loss (GFL; Li et al. 2020) method combined with adaptive training sample selection and SSDLite with SwinTransformer Tiny as the backbone network. The experimental results and comparisons are summarized in Table 2.

Table 2. Comparative Experimental Results of Other Detection Methods

Model	mAP50@50						Recall	FLOPs	Params	FPS
	Average	II	III	IIIs	IV	V
YOLOv3	0.738	0.774	0.827	0.814	0.517	0.759	89.5%	1.69G	3.74M	61.8

GFL	0.742	0.806	0.798	0.819	0.530	0.756	96.6%	20.78G	32.04M	44.5

MV2-SSDLite	0.75	0.811	0.787	0.777	0.652	0.721	88.8%	0.71G	3.05M	64.8

YOLOV5[a]	0.752	0.753	0.867	0.809	0.566	0.766	69.5%	7.943G	7.033M	75.4

YOLOV5[b]	0.761	0.795	0.778	0.813	0.649	0.769	70.4%	7.943G	7.033M	75.4

SwinT-SSDLite	0.764	0.810	0.842	0.781	0.714	0.689	90.8%	9.82G	28.79M	41.3

Ours	0.782	0.827	0.827	0.803	0.725	0.728	92%	2.9G	5.32M	54.8

Note. Bold values indicate the best results compared to other entries in the table.

Download table as: ASCII Typeset image

In Table 2, our improved model achieved higher average detection accuracy, especially for the less frequent radio burst type IV, where the detection performance was significantly better than other methods. It also outperformed other methods in burst type II. Nevertheless, it slightly underperformed compared to GFL in burst types IIIs and V. Since minimizing the omission of solar radio bursts in this application is crucial, we are particularly interested in the model's recall rate. Our improved model achieved a recall rate of 92%, which was higher than that of the three methods in the YOLO series but lower than that of the GFL model. This difference may be attributed to the fact that the GFL method learns a joint representation of classification scores and localization quality and models the general distribution of predicted box positions, making it more suitable for imbalanced samples.

The comparison of detection speeds among different models can be found in Table 2. The FLOPs of our model are 2.9 G, and the model parameters amount to 5.32 M, indicating a low model complexity suitable for devices with lower computational performance. Due to the presence of self-attention in the computation process of MobileViT, the detection speed of our model was slower than that of the MobileNetV2 backbone network. However, it achieved improved detection accuracy. Compared with SwinTransformer Tiny, which also uses self-attention, it reduces the number of model parameters and improves the detection speed. Compared to the GFL detection network using ResNet50 as the backbone and FPN for feature fusion, our model shows improvements in both speed and accuracy. Compared to the lightweight versions of YOLOv3 (with MobileNetV2 as the backbone) and YOLOv5s, our model improved accuracy and detection speed. However, it required a longer convergence time during training due to SSD's denser set of prior boxes compared to YOLO.

4.3. Experimental Results of the Feature Extraction Network

We conducted experiments comparing the MobileViT-SSDLite and MobileNetV2-SSDLite detection networks on solar radio spectrogram images, and the results are shown in Table 3.

Table 3. Experimental Results of MobileNetV2 and MobileViT

Backbone	mAP50@50						Recall
	II	III	IIIs	IV	V	Average
MobileNetV2-SSDLite	0.811	0.787	0.777	0.652	0.721	0.750	88.8%
MobileViT-SSDLite	0.795	0.790	0.781	0.746	0.742	0.771	90.3%

Note. Bold values indicate the best results compared to other entries in the table.

Download table as: ASCII Typeset image

According to the experimental results in Table 3, it was found that the use of the MobileViT feature extraction network not only improved the overall detection accuracy but also significantly enhanced the detection performance of the rare class, type IV solar radio bursts. The average AP50 for all burst types increased by 2.1%, while type IV bursts saw a 9.4% improvement. Furthermore, the experiments conducted heat map analysis using GradCAM++ (Chattopadhay et al. 2018) to examine the attention distribution of the two models. GradCAM++ reflects the detection output's sensitivity to the input image's pixel values. The higher the temperature, the more sensitive the region, and the lower the temperature, the less sensitive the region. As shown in Figure 9, MobileViT focused more on the radio bursts' frequency and time domain drift features.

**Figure 9.** Comparison of GradCAM++ heat maps of solar radio burst spectrum detection images.
Download figure:
Standard image High-resolution image

4.4. Model Improvement Effectiveness Analysis

In addition to the MobileViT feature extraction network analysis, a comparative experimental analysis was conducted on the data augmentation and localization loss improvements proposed in this paper. The results are shown in Table 4, which demonstrates the effectiveness of these improvements in further enhancing the detection accuracy of the model.

Table 4. Ablation Experiments

MobileViT	Data Augmentation	CIoULoss	mAP@50	Average Recall
✓			0.620	81.8%
	✓		0.750	88.8%
✓	✓		0.771	90.3%
✓	✓	✓	0.782	92.0%

Note. The bold value indicates the best result compared to other entries in the table.

Download table as: ASCII Typeset image

4.4.1. Data Augmentation Improves Performance

Assume data augmentation is not performed, and the solar radio spectrum images are directly used for training, as illustrated in Figure 10(a). In that case, the validation accuracy peaks after a few epochs, achieving a maximum mAP@50 of only 62%. Subsequently, the accuracy declines, indicating that the model is overfitting. We implemented online data augmentation techniques on the data set to address this issue. The data augmentation strategies include RandomFlip, MinIoURandomCrop, and PMD, as presented in Figure 2. These techniques played a crucial role in enhancing detection accuracy. We conducted further experiments on the role of RandomFlip. As shown in Figure 10(b), we compared the data augmentation using only PMD and MinIoURandomCrop. The addition of RandomFlip can improve detection accuracy. By incorporating scale and color transformations and flipping into the training process, the model becomes more resilient to variations in the spectrum images.

Furthermore, it learns features invariant to different image transformations, enhancing the model's generalization capability. Data augmentation mitigates overfitting and enables the model to adapt to the diverse range of solar radio spectrum images. This ultimately leads to improved performance and enhanced generalization of the model.

4.4.2. Comparison of Localization Loss Functions

Smooth L1 loss suffers from issues of variable independence and lack of scale invariance. To address this, we replaced the localization loss with the IoULoss series and conducted comparative experiments with GIoULoss and CIoULoss, as shown in Figure 11. The experiments demonstrate that CIoULoss achieves better detection accuracy, enables more stable model training, and improves detection accuracy with a 1.1% increase in AP50.

**Figure 11.** Comparison of localization loss functions.
Download figure:
Standard image High-resolution image

4.5. Visualization of Detection Results

As mentioned previously, we enhanced the feature extraction network, employed data augmentation techniques, and utilized an improved localization loss to improve the detection network. We applied this improved model to detect solar radio burst images. Figure 12 illustrates the visualized detection results for different types of solar radio bursts, including types II, III, IIIs, IV, and V. The blue boxes represent the ground-truth annotations, while the orange boxes indicate the detection results, providing both localization and classification information for the radio bursts. Our detection method improves the accuracy of the bounding boxes and reduces the occurrence of redundant boxes, particularly when the spectral features have a wide range.

**Figure 12.** Solar radio burst detection results of type II, III, IIIs, IV, and V.
Download figure:
Standard image High-resolution image

4.6. Model Robustness Analysis

The lightweight model MobileViT-SSDLite employed in this paper has achieved good results in detecting fine structural features of radio types and quickly processing many radio spectrum images. It contributes to the automation of solar radio spectrum detection research and enables efficient, real-time, and accurate detection. During the construction of the solar radio spectrum object detection data set, a small portion of incorrect annotations occurred, as shown in Figure 13, where type III was erroneously labeled as types IIIs and V. However, based on the visualized detection results, the detection model made correct judgments, indicating that our detection model performs well in radio detection and can provide reference annotation information for further improvements in solar radio spectrum data set establishment.

**Figure 13.** Correct prediction of the model for incorrect labeling.
Download figure:
Standard image High-resolution image

While collecting radio burst data, this paper did not gather noise data specifically for annotation purposes. As a result, the model incorrectly identified certain spectrum images that resembled burst types as radio bursts, as depicted in Figure 14. Interferences originating from the observation instrument can be present and exhibit distinct radio spectrum shapes. These interferences may resemble radio bursts in the spectrum images, leading to erroneous detection outcomes. However, the absence of annotated noise images and a limited number of mislabeled samples have impacted the performance of the quantified results. In most cases, the misclassifications were attributed to background types. It is worth noting that certain noise patterns may resemble radio bursts and necessitate further research to enhance detection accuracy.

**Figure 14.** Error detection examples that contain noise as solar radio bursts.
Download figure:
Standard image High-resolution image

5. Conclusion

Astronomical observations have entered the era of big data, necessitating real-time processing of massive quantities of observational data to support scientific research. In this paper, we take the first step by establishing an e-CALLISTO solar radio burst spectrogram data set and annotating four types of bursts: types II, III, IV, and V. Additionally, we classify type III into individual (type III) and grouped (type IIIs) bursts. Subsequently, we propose a solar radio burst detection model based on the lightweight MobileViT-SSDLite model. This model can automatically detect the five classes of solar radio bursts from many spectrograms. It can detect various types of solar radio bursts and accurately localize their positions in the spectrum, achieving an impressive accuracy of 78.2%. The model facilitates the automatic extraction of parameters such as the bursts' start and end times, duration, and spectral range following localization.

Astronomers require a high recall rate to minimize the omission of solar radio bursts. The experiments demonstrate that our model achieves a recall rate of 92%. Additionally, the model utilizes a lightweight network architecture, enabling a detection speed of 54.8 FPS. This capability allows for real-time burst information retrieval, greatly benefiting solar physics and space weather research. The findings of this paper also hold potential insights for processing other small-sample astronomical data.

However, it is important to acknowledge the limitations of this paper. One such limitation is the difficulty in distinguishing spectrograms affected by noise interference from those exhibiting genuine burst spectral features. Collecting noise spectrogram images to train the model and improve its classification accuracy would be a valuable direction for future research. We can further enhance the proposed model's performance and applicability in solar radio burst detection by addressing these limitations.

Acknowledgments

This research was funded by the National Natural Science Foundation of China (grant Nos. 12263008, 62061049), the Application and Foundation Project of Yunnan Province (grant No. 202001BB050032), the Key R&D Projects of Yunnan Province (grant No. 202202AD080004), and the Yunnan Provincial Department of Science and Technology-Yunnan University Joint Special Project for Double-Class Construction (grant No. 202201BF070001-005).

Solar Radio Burst Detection Based on the MobileViT-SSDLite Lightweight Model

Article metrics

Author e-mails

Author affiliations

ORCID iDs

Dates

Abstract

1. Introduction

2. Data Set Construction and Augmentation