Introduction

Colorectal cancer is the third most common newly diagnosed cancer and the second leading cause of cancer-related deaths worldwide [1]. In China, colorectal cancer ranks third in terms of mortality rate [2]. Adenomas, precursors to colorectal cancer, are typically identified and excised by endoscopists via colonoscopy before malignant progression, thus decreasing the likelihood of colorectal cancer incidence. Consequently, there exists an inverse correlation between adenoma detection rates and the occurrence of colorectal cancer. [3]. As a result, it is imperative to increase the adenoma detection rate (ADR), which is closely related to the incidence rate of colon cancer and the miss rate [4]. Studies have shown that about 1/4 of polyps are missed during colonoscopy [5], and the need for the detection and removal of polyps that are missed in the field of view [6]. The accuracy of a well-trained doctor in detecting and diagnosing adenomas is only 80% [7]. Currently, due to differences in endoscopist knowledge reserves and factors such as high-intensity work and mental fatigue in clinical practice, some polyps are easily missed, and early small polyps are difficult to identify [8,9,10]. Some studies have shown that the assistance of a second observer can increase ADR, but there is still some controversy [11,12,13]. Therefore, using computer-aided diagnosis (CAD) technology to improve diagnostic efficiency is of great clinical significance for the early prevention of colorectal cancer [14].

To better detect and diagnose colorectal polyps, various endoscopic imaging technologies have been invented in the early stage, such as narrow-band imaging technology (NBI) [15, 16], flexible spectral imaging color enhancement (FICE) [17], and confocal laser endomicroscopy [18] and so on. These endoscopic examinations can directly observe the glandular tubes and micro vessels on the surface of polyps, and combined with various polyp type classifications under endoscopy (such as the Japanese JNET classification), provide more accurate pathological judgments of polyps. However, during the observation of endoscopic images, polyps are still easily missed due to factors such as environment, lighting, and viewing angle. Artificial intelligence has improved its ability to detect intestinal polyps as a result of the development of computer-aided technology [19]. Convolutional neural networks (CNN) have been shown in studies to improve the detection rate of polyps during colonoscopy [20, 21]. Tang et al. employed Faster R-CNN with transfer learning to improve polyp detection [22], while Jiang et al. used CNN1 to detect polyps, colitis, and CRC patients with high accuracy [23]. The Yolo algorithm is presently the mainstream model for recognizing polyps. Guo et al. introduced a polyp automatic detection framework based on Yolov3 and active learning to decrease the false positive rate of polyp detection [24]. Wan et al. used YOLOv5 to precisely identify polyp images [25], which significantly increased the speed of polyp detection. However, there was no significant improvement in the ability to recognize polyps. In current CAD systems, polyp detection and analysis are primarily carried out on single-frame images extracted from colonoscopy videos to detect the presence of polyps and determine their location [26, 27]. However, in actual clinical practice, the process of intestinal examination is a dynamic change. Capsule endoscopy technology involves a patient swallowing a capsule camera that takes images at a specific frame rate, which are then delivered to a computer for evaluation. In contrast, traditional endoscopy directly transmits video data to a computer through a transmission line. Traditional endoscopy offers continuous video data that contains more temporal information than capsule endoscopy. The temporal correlations present in this data are crucial in guiding the task of polyp detection. However, studies on recognition based on video frame relationships are still relatively scarce, and most studies tend to focus on static images which do not fully consider the dynamic nature of colonoscopy operations. This method ignores the fact that colonoscopy procedures involve dynamic video images, which necessitate a more thorough analysis of temporal correlations. The background of endoscopic images is constantly changing in clinical practice, causing significant background variations between frames. The model finds it difficult to effectively combine features from adjacent frames and make reliable predictions as a result. To address this challenge and meet the requirements of clinical practice, a deep learning-based colon polyp auxiliary diagnosis model was developed using Spatio-Temporal Feature Transformation (STFT), and a series of comparative studies were conducted to evaluate the effectiveness of the proposed model. The STFT model uses neighboring support frames to predict if there are polyps in the target frame. It generates multiple levels of features and makes a prediction about the target frame. In layer 3, the model uses the target frame’s static proposal to direct the changes in the target feature’s shape. It also makes sure that the target feature on each support frame matches the target frame’s shape. Next, the model figures out how the features change over time and combines information about where things are located in space and when they happen to find the exact spot of the target detection box.

Adenomas are frequently associated with the development of colorectal cancer, indicating an inverse relationship [3]. As a result, the detection and diagnosis of adenomas are of utmost importance, and the integration of CAD and digestive endoscopy holds significant clinical value. Studies have demonstrated that the STFT-based detection model of colorectal polyps can effectively identify polyps and assist endoscopists in improving the detection rate of polyps. The development of an STFT-based detection model for colorectal polyps can enable real-time detection and diagnosis and can play a significant role in future clinical practice with the advancement of computer-aided diagnosis technology and updates in endoscopic technology.

Artificial intelligence, based on machine computing and learning capabilities, can effectively solve problems [28]. Therefore, it has been widely applied and developed in the field of medical image recognition, such as diabetic retinopathy [29], tissue pathology [30], radiology diagnosis [31], skin cancer classification [32], and so on.

Materials and Methods

Inclusion and Exclusion Criteria

Inclusion criteria were listed as follows: (1) Patients voluntarily participated in this study and signed an informed consent form. (2) Patients had a systolic blood pressure (SBP) level of less than 150 mmHg during the clinical examination. (3) Patients were male or female and over 18 years old. (4) Patients had no heart or lung function problems, no mental illness, and no familial inherited polyp disease. Exclusion criteria were listed as follows: (1) Patients with particularly poor bowel preparation, indicated by a Boston score of less than 3 points. (2) Patients who have undergone partial colectomy. (3) Patients who have had colon polyps removed. (4) Patients with colon cancer. (5) Patients with inflammatory bowel disease or proliferative polyps in the left colon.

Study Design and Patients

The study involved the selection of 600 colonoscopy videos from patients who underwent colonoscopy at the Digestive Endoscopy Center of the First Affiliated Hospital of Anhui Medical University from January 2018 to November 2022. Initially, the recorded videos were categorized based on the patients’ Boston Bowel Preparation Scores, with Olympus 290 colonoscopy videos meeting the standard (scores of ≥ 3 points) comprising 439 cases, and those not meeting the standard (scores of < 3 points) amounting to 161 cases. Within the cohort of videos conforming to the Boston score standard, there were 72 videos depicting Colorectal cancer, 63 videos exhibiting colitis, 124 videos devoid of polyps, and 180 videos containing one or more polyps. Given that our model is designed to assess the performance in identifying and testing polyps, we selected the 180 eligible videos as the video dataset. Subsequently, the data was then approximately divided into 80% for training, 10% for validation, and 10% for testing. Ultimately, we randomly partitioned the data into two datasets. Employing a random sampling methodology, we chose a total of 160 videos, comprising the training and validation sets, designated as Data set (1) In accordance with the aforementioned proportions, the remaining 20 videos constituted the test set, referred to as Data set (2) All patients included in the study had previously signed informed consent forms. Videos that did not meet the inclusion criteria were excluded from the study. Two physicians with mid-experience levels and less than five years of endoscopy experience later classified the remaining videos. The classifications were then verified by two highly experienced medical professionals with more than 10 years of endoscopy experience. Each colonoscopy video captured the process of withdrawing the endoscope from the ileocecal valve to the vicinity of the anus, and each video included at least one colon polyp. The two high-experience doctors carefully analyzed each frame of every video to determine the presence and number of polyps. Their diagnosis served as the study’s gold standard and served as an accurate benchmark for comparison and assessment. The selected videos were divided into two datasets for different purposes: Data set 1: This dataset comprised 160 colonoscopy patients from January 2018 to January 2021. It included a total of 200 colon polyp videos, consisting of 40,266 frames. Among these frames, there were 33,884 images of colon polyps and 6,382 images of non-polyps. Data set 1 was specifically used for building, training, and validating the STFT polyp detection model. Data set 2: This dataset consisted of 20 colonoscopy patient videos recorded from February 2021 to August 2022. It encompassed a total of 38 polyps, with 20,005 images of colon polyps and 148,147 images of non-polyps. The purpose of Data set 2 was to evaluate and test the performance of the STFT polyp detection model specifically on Olympus 290 videos.

Equipment and Collection Mode

The colonoscopy videos were captured using the Japanese Olympus CF-H290I equipment. The videos were all recorded in white light without magnification.

Model Construction

Colonoscopy Image Labeling

For Dataset 1, the polyps in the video segments were labeled by two high-experience doctors. During the labeling process, if there was any uncertainty regarding the presence of colon polyp images, both doctors had to reach a consensus before confirming the labeling. In cases where a consensus could not be reached, the image was judged as non-polyp. The image labeling tool used for this process was DataLabel, which facilitated the annotation of the colonoscopy images. Additionally, the videos were cropped and divided into separate images using the video cropping software VirtualDub for additional analysis and labeling.

Datasets

Firstly, to ensure the security of the data, we expunged any information that could potentially compromise the anonymity of participants, including personal identifiers such as names and genders. Second, we implemented conventional approaches to normalize the images and adjust their dimensions to generate datasets amenable to our model. Lastly, to augment the model’s generalizability and diversify the training dataset, we applied data augmentation strategies during the training phase. Specifically, we employed techniques such as rotations, flips, sliding-window cropping, and brightness transformations to expand the training set, thereby enhancing the model’s adaptability to varying viewpoints and environmental shifts.

Model Working Steps

The neighboring support frameset \(\left\{ {{I_s}} \right\}\) is utilized in the STFT model to help predict the presence of polyps in the target frame\({I_t}\). Firstly, multi-level feature maps and static predictions are generated for the target frame. In the lth layer, the model uses the static proposal set \(\left\{ {{P_t}} \right\}\) of the target frame to guide the spatial transformation of the target feature\(F_{t}^{l}\). Meanwhile, the static proposal set \(\left\{ {{P_t}} \right\}\) of the target frame and the proposal set \(\left\{ {{P_s}} \right\}\) of the support frame are used to guide the spatial transformation of the target frame feature \(F_{s}^{l}\) of each support frame, so that it remains feature-aligned with the target frame. Then, the temporal feature transformation module models the channel-aware relationship of all spatially aligned features. The classification module and regression module use the temporally transformed features to predict, for each layer, the offset of the static score and static proposal. Finally, the static proposal and proposal offset are inputted into a non-linear activation function to fuse temporal and spatial information and determine the position of the target detection box (Fig. 1).

Fig. 1
figure 1

Steps of the STFT polyp recognition model

Model Training and Validation

The model was trained and validated in this study using the dataset 1 annotated polyp videos. 7,953 images of polyps were saved for model validation, while a total of 32,213 images were chosen for training and model optimization.

Model Testing

  1. 1)

    Testing of the STFT model in colon polyp videos.

Twenty colonoscopy videos from dataset 2 were obtained retrospectively for testing the STFT model. The model was able to recognize polyp images in the colonoscopy at a speed of 25 frames per second. The colonoscopists converted the videos into multiple images, selected colonoscopy images that met the actual standards, and classified them into a polyp and non-polyp images. After that, the STFT model was used to identify all of the images. During the video test, a voting filter module was incorporated into the STFT model output results, based on the study conducted by Jin et al. [33]. This module analyzes and processes the video’s content in advance, then modifies the output content and recognition outcomes accordingly. The incorporation of this module has proven to be effective in reducing errors in dynamic detection. According to the predetermined criteria, the polyp target was deemed correctly identified if more than 2/5 of the video frames found it within 0.5 s (Fig. 2).

Fig. 2
figure 2

STFT model recognizes the video of the test set and converts the video into pictures. Assuming that the model identifies two polyp pictures out of five pictures in a row within 0.5s, the model outputs 5 polyp pictures after processing, that is, voting rules

  1. 2)

    Comparison of STFT polyp recognition model and endoscopists of different levels.

The experiment randomly selected 1500 images from the test set, with a ratio of 2:1 for polyp and non-polyp images. This study invited 9 endoscopists to participate in the test, who were divided into 3 groups based on their level of experience: high-experience endoscopists (with more than 10 years of colonoscopy experience), medium-experience endoscopists (with less than 5 years of colonoscopy experience), and low-experience endoscopists (with less than 2 years of colonoscopy experience), with 3 endoscopists in each group. Each endoscopist completed the test independently. The selected endoscopists recognized and diagnosed the 1500 images on the same computer. The order of the images was repeatedly changed throughout the test to ensure the stability of the test model and the balance of the results when compared with endoscopists of various levels. The computer was then used for analysis, recording, and statistics of the accuracy, sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV) of polyp recognition by endoscopists of different levels. The STFT model was then used to identify and categorize the 1500 images, and the outcomes were compared with those of endoscopists of various levels.

  1. 3)

    Comparison of the accuracy of the STFT model and endoscopists with different years of experience in diagnosing polyps of different sizes in the test set (n/N).

The test set was divided into 38 colon polyp videos that exclusively featured the polyps. Two attending physicians then grouped the videos by polyp size and classified them into the same group based on the colonoscopy report results. A total of 400 images were randomly selected from each group of different-sized colon polyp video images, and 9 endoscopists with different years of experience from the First Affiliated Hospital of Anhui Medical University were invited to participate in the test. Each physician completed the test independently, and the images in each test were arranged at random. The STFT polyp detection model was used to identify the images, and the results were compared to those of the endoscopists.

Applicability of the Model to Videos from Different Companies

On one hand, deep learning models preprocess input videos through the OpenCV module by using methods such as rotation, flipping, sliding window cropping, and brightness transformation to automatically adjust image sizes for recognition. On the other hand, our research utilizes a spatiotemporal geometric structure to reconstruct images from videos, as regardless of differences in presentation or formatting across devices from different manufacturers, the fundamental objective of polyp detection remains consistent. Subsequently, to evaluate the generalizability of each obtained model, tests are performed on consecutively extracted frames using both the generation method used during training and novel techniques not observed during the training phase, thereby investigating the generalization capability of different architectures.

Statistical Analysis

The data were analyzed and processed using SPSS 26.0 statistical software (IBM Corp.). The main evaluation indicators of the STFT colon polyp detection model is accuracy, precision, recall, and F1-score (also known as balanced F score). The performance of the STFT detection model in image and video testing is evaluated by calculating the receiver operating characteristic curve (ROC) and the area under the ROC curve (AUC) to obtain the best threshold. Apart from AUC, other indicators are calculated using true positive (TP), true negative (TN), false positive (FP), and false negative (FN) (refer to supplementary materials for formulas). The comparison between the STFT polyp detection model and endoscopists is conducted using the chi-square test, and accuracy, sensitivity, specificity, positive predictive value, and negative predictive value are calculated. P value < 0.05 indicates a statistically significant difference.

Results

Results of STFT Model for Identifying Colon Polyps

Dataset 1 used in this study was divided into training and validation sets, totaling 40,266 images. Among these images, 33,884 were colon polyp images, while 6,382 were non-polyp images. The precision of the STFT model for identifying colon polyps in dataset 1 was 0.950, the recall was 0.880, the F1 score was 0.914, and the F2 score was 0.893. In this study, we utilized dataset 2, which comprised a selection of 20 videos in the test set. The participants included 14 males and 6 females, with an average age of 50.2 years. The dataset consisted of a total of 167,914 images, with 20,005 images representing colon polyps and 148,147 images representing non-polyps. The STFT model demonstrated high performance in identifying colon polyps in the test set, with impressive precision of 0.884, accuracy of 0.967, recall of 0.832, and an F1-score of 0.857. The STFT model’s output images were obtained (Table 1), and the ROC curve showed how well the model performed in identifying colon polyps (Fig. 3).

Fig. 3
figure 3

The ROC curve of the white light polyp videos test. The optimal threshold for the model is 0.636

Table 1 Comparison of results for identifying colon polyps in the test set using the STFT model and different models

Comparison of the STFT Model and Different Endoscopists with Different Years of Experience in Identifying Colon Polyps

The STFT model identified 1500 colon images with relatively balanced evaluation indicators, with an accuracy of 0.902, a sensitivity of 0.904, a specificity of 0.898, and PPV and NPV of 0.947 and 0.824, respectively. The accuracy and specificity of the STFT model in identifying polyps were significantly different from those of the endoscopists with high years of experience, with the model achieving an accuracy of 0.925 and specificity of 0.986. The results were similar to those of the endoscopists with medium years of experience, with an accuracy of 0.900 and specificity of 0.908, and significantly higher than those of the endoscopists with low to medium years of experience, with an accuracy of 0.809 and sensitivity of 0.800. The sensitivity of the STFT model was 0.904, which was significantly higher than that of the 9 endoscopists with varying years of experience, with statistical significance (P < 0.05). Additionally, the overall accuracy of the STFT model was higher than that of the group of endoscopists with low to middle years of experience (P < 0.05) (Table 2).

Table 2 Comparison of results for identifying colon polyps using the STFT model and endoscopists with different years of experience

Comparison of the Accuracy of the STFT Model and Different Endoscopists in Identifying and Diagnosing Polyps of Different Sizes (n/N)

Among the 38 colon polyp videos that were cropped from Dataset 2, 2 were found in the ileocecal region, 4 in the ascending colon, 6 in the transverse colon, 3 in the descending colon, 14 in the sigmoid colon, and 9 in the rectum. There were 38 polyps in total, with 13 in the proximal colon and 25 in the distal colon. Among these polyps, 16 were ≤ 0.5 cm in size, 10 were between 0.6 and 1.0 cm, and 12 were > 1.0 cm. The 38 polyp videos had a combined duration of approximately 14 min, resulting in a total of 19,900 frames when expanded at 25 frames per second. The overall detection results of the STFT polyp detection model are shown in (Table 3). The model detected all 38 polyps, and its sensitivity in detecting each polyp was 100.00% (38/38).

Table 3 Results of the STFT polyp model in identifying polyps of different sizes

In a randomized sample experiment involving 1200 randomly selected polyps of varying sizes, the STFT model demonstrated significantly higher accuracy in identifying polyps that were ≤ 0.5 cm or between 0.6 and 1.0 cm, with accuracies of 0.905 and 0.955, respectively. These results were significantly better than those achieved by three novice endoscopists, with statistical significance (P < 0.05). For polyps that were ≥ 1 cm in size, the STFT model achieved an accuracy of 0.998, which did not statistically differ from the precision attained by nine endoscopists with various levels of experience (P > 0.05; Table 4; Fig. 4).

Fig. 4
figure 4

The model can identify polyps of different sizes at different locations. In (a–c), polyps are larger than 1.0 cm; in (d–f), polyps are 0.6–1.0 cm; and in (g–i), polyps are smaller than 0.5 cm. The model has a good recognition effect on both raised polyps and small polyps

Table 4 Comparison of the accuracy of the STFT model and 9 endoscopists with different years of experience in identifying polyps of different sizes (n/N)

False positive analysis: A statistical analysis of the misjudged polyp images by the model in dataset 2 revealed that endoscopic focal mucosal bubbles, intestinal folds, and endoscopic reflection rings accounted for approximately two-thirds (1436/2191) of the misjudged polyp images. The remaining third (755/2191) of the incorrectly categorized polyp images were attributed to hazy photos brought on by low illumination and motion blur. (Fig. 5)

Fig. 5
figure 5

False positive images misjudged as polyp images. (a) Reflection ring; (b) Intestinal wall; (c) Bubble; (d) Motion blur; (e) Dark view; (f) Fecal residue.

Discussion

The integration of deep learning and endoscopic systems primarily entails the use of various models for colorectal polyp recognition and diagnosis, resulting in highly effective polyp detection. Nonetheless, the majority of current polyp detection models concentrate on feature extraction, processing, and analysis of individual images, failing to fully leverage the inherent temporal and spatial information present in colonoscopy videos. As a result, a variety of small polyps may be missed. In contrast, the STFT model developed by our research group leverages the correlation between temporal and spatial frames to identify polyps. When a polyp is present, the model uses time-consistent features to quickly identify it and displays a diagnostic box above it with an immediate indication of its level of recognition. As such, it offers significant advantages in this regard. Regarding spatial analysis, the model employs deformable convolution as a feature alignment building block, which exhibits remarkable adaptability when modeling images with significant variations. Additionally, the model adjusts the predicted offset of the deformable convolution based on the image, further enhancing its spatial analysis capabilities. Regarding temporal analysis, the STFT model incorporates a channel-aware attention module, which comprises two components to estimate the variation in the frame quality. The first component uses cosine similarity to model the foreground correlation between frames, while the second reweight each channel to estimate the change in frame quality.

These two modules work together to strike a balance between representational power and computational effectiveness. The STFT model distinguishes itself from other models by proposing a proposal-guided spatial transformation method to enhance object center awareness during feature alignment. This method reduces feature inconsistency between adjacent frames when the camera moves, resulting in more accurate polyp detection. Additionally, to aggregate features and strike a balance between representational capacity and computational efficiency, the STFT model introduces a novel channel-aware attention model. Therefore, the STFT model based on time and space is better equipped to handle the blurring of front and back frames caused by dynamic polyp images and the interference of light and shadow on polyp judgment. Compared to other models such as Faster R-CNN, CNN1, and YOLOv5, the STFT model offers unique advantages.

This study advances the use of artificial intelligence for the detection and diagnosis of colorectal polyps significantly. The STFT model may help endoscopists recognize polyps in the clinical setting more precisely and quickly. By testing and analyzing the original videos stored on the server, the results demonstrate that the STFT model achieves high accuracy, precision, recall, and F1 score in the entire dataset 2. Specifically, the accuracy is 0.967, the precision is 0.884, the recall is 0.832, and the F1 score is 0.856. Overall, the STFT model exhibits good performance and high accuracy in detecting colorectal polyps. While the recall rate of the YOLOv5 model is higher than that of the STFT model, reaching 0.921, it is important to note that the YOLOv5 model was verified and recognized on static images, not dynamic videos like those used in this study [27]. The dataset used in this study was derived from dynamic videos, not static images. In summary, the reason why the sensitivity of the STFT model on the test set is slightly lower than that of YOLOv5 is due to the dynamic nature of the videos used for recognition, which contain a significantly larger number of polyp images with different types and locations than those in other studies. However, despite this challenge, the STFT model developed in this study still exhibits high accuracy in correctly identifying polyps and non-polyps, surpassing models such as Faster R-CNN, CNN1, and YOLOv5.

The STFT model developed in this study has demonstrated high accuracy, precision, recall, and F1-score in the test set, enabling real-time diagnosis in retrospective colonoscopy videos. The accuracy of the STFT model surpasses that of junior endoscopists and is comparable to that of mid-to-high-level physicians. The STFT model also has a better chance of detecting small polyps. Comparing the accuracy of the STFT model with that of endoscopists of different levels in identifying and diagnosing polyps of varying sizes is of unique significance. For polyps with a size of ≤ 0.5 cm, the STFT model’s accuracy in identifying polyps is 90.5%, significantly higher than that of junior endoscopists, comparable to that of mid-level endoscopists, and slightly lower than that of senior endoscopists. For polyps with a size between 0.6 and 1.0 cm, the STFT model’s accuracy in identifying polyps is 95.0%, higher than that of junior endoscopists and lower than that of mid-to-high-level endoscopists. For polyps with a size > 1.0 cm, the STFT model’s accuracy in identifying polyps is 99.75%, with no significant difference from endoscopists of different levels. A clinical randomized trial was carried out by Wang et al. [11] to compare the efficiency of DL-based CAD for polyp detection with conventional colonoscopy. The study found that the ADR of CAD in identifying polyps was superior to that of standard colonoscopy, with the increase in ADR limited to tiny polyps. However, the diagnosis of polyps larger than 10 mm did not differ significantly. As a result, the model’s accuracy in identifying polyps improves with polyp size. The STFT model has demonstrated good performance in identifying small polyps. In real-world clinical examinations, junior endoscopists are primarily responsible for performing colonoscopies, and their skills are often influenced by experience and operational level, which are the primary factors leading to missed diagnoses. Thus, utilizing the STFT model can effectively assist junior endoscopists in improving the detection rate of polyps and reducing the risk of adenoma occurrence to some extent. The spatiotemporal feature transformation-based model has demonstrated significant clinical benefits in diagnosing and detecting colorectal polyps, and can effectively aid endoscopists in improving the detection rate of polyps, particularly in identifying small polyps, which can provide greater value.

At this time, the majority of intestinal detection models researched globally, including those in domestic research, concentrate on identifying and diagnosing single images [16, 17]. However, this approach introduces a level of selection bias toward the images utilized. Typically, the selected polyp images are clear and represent static scenarios, which may not fully capture the dynamic nature of polyp recognition. While these models may achieve high recognition rates, they may not align with the real-world clinical context. Using a model designed for single-image recognition to detect dynamic videos has inherent limitations that result in significantly lower accuracy in polyp detection. This is mainly caused by the model’s incapacity to accurately analyze the temporal information found in video sequences.

In contrast, the STFT model employed by our research group is specifically designed to handle dynamic videos, enabling the detection of polyps in continuous dynamic processes. The model excels at quickly identifying polyps even in complex intestinal environments by examining temporal and spatial consistencies, thereby reducing the possibility of missed polyp diagnoses.

An STFT model that our research team created demonstrated its applicability to endoscopes of various brands. We have successfully expanded its clinical applications by utilizing transfer learning theories that have previously been applied to different image types [25]. During the initial phases of our research, the STFT model was employed to identify polyps in Fujifilm colonoscopes. Although its sensitivity and accuracy were slightly lower compared to Olympus 290 colonoscopes, it still holds significant clinical value and potential. Considering that primary hospitals primarily use Fujifilm colonoscopes, using the STFT model can be extremely advantageous. Primary hospitals can improve their capacity to recognize and classify polyps by incorporating the model, which will increase detection rates and decrease the incidence of misdiagnosis. Therefore, this type of polyp detection model holds immense practical value across various colonoscope models, making it highly applicable and beneficial in the field.

While our research group has made significant progress in efficiently detecting polyps in dynamic videos by utilizing the model to identify changes in spatiotemporal frame relationships, it is essential to acknowledge that there are still some restrictions or flaws that demand attention. On one hand, it is worth noting that the STFT model has not yet achieved real-time dynamic polyp detection and is currently limited to retrospective video analysis for polyp detection. However, in future research endeavors, we are committed to further exploring and developing targeted solutions that can enable real-time polyp detection. On the other hand, we have only studied databases from typical hospitals in the province, and more public data from a wider range of sources should be collected to further determine the general applicability of the STFT model. Additionally, multicenter studies can help the model perform better in multi-classification and improve its generalizability. Although the model is not currently trained to recognize inflammatory polyps and intestinal tumors, we plan to explore the possibility of developing this feature in the future to enhance the model’s versatility.

Conclusion

The spatiotemporal frame relationships-based STFT colon polyp detection model has shown high specificity and accuracy in detecting and identifying polyps. This model can be a valuable tool for endoscopists, especially those with limited experience in performing colonoscopies. Additionally, it has the potential to aid in the training and diagnosis of endoscopists in underdeveloped areas.