1 Introduction

The diagnosis of digestive tract diseases depends on gastrointestinal endoscopy, imaging, and pathology. Deep learning (DL) has been widely applied in these fields. It can automatically establish an image recognition system without manipulating image features and achieve high diagnostic efficiency. In recent years, various advanced algorithms and models of computer-aided diagnosis (CAD) have been proposed, which is expected to reduce doctors’ workload and misdiagnosis rates (Fig. 1).

Fig. 1
figure 1

Mind map

Artificial intelligence (AI) can be defined as the intelligence displayed by machines that mimic human cognitive functions [1, 2]. Machine learning (ML), a subdomain of AI, is an algorithm trained from data to perform a task rather than directly executing an explicit program. Representation Learning (RL) is a sub-category of ML, which can master core features and implement algorithms through the autonomous classification of data [3]. DL is a kind of RL. DL acquires feature combinations that reflect the hierarchical structure of data structures to provide detailed image classification output. At present, DL represented by convolutional neural networks (CNN) is the most widely used AI in medicine [4]. DL technology can extract pathological features through active learning of massive clinical data without providing features in advance and make a CAD through these pathological features. CAD can significantly reduce clinicians’ workload and assist doctors in making more accurate and rapid diagnoses. Besides, advanced diagnosis and treatment technologies can be shared across a wider region, and medical resources can be rebalanced through CAD.

2 Application of DL in gastrointestinal endoscopy

Digestive endoscopy is an essential method for diagnosing and treating digestive tract diseases and plays a vital role in screening precancerous lesions and early cancers. The detection rate of early precancerous lesions under endoscopy is relatively low, so it is of great significance to improve the endoscopic detection rate of early tumors for improving the prognosis of patients with digestive tract tumors. AI-assisted endoscopic diagnosis is expected to strengthen gastrointestinal lesions’ detection rate by endoscopic physicians and reduce misdiagnosis or missed diagnosis [5]. With the continuous iteration of computer technology and the arrival of the big data era, the research on the diagnosis of endoscopic diseases assisted by AI technology is flourishing.

DL has been applied in the endoscope-assisted diagnosis of tumors and precancerous lesions of the esophagus [6, 7], stomach [8], small intestine [9], and colorectum [10, 11]. The vast majority of scholars use endoscopic photographs or videos to carry out DL. The number and size of training sets adopted by different studies vary greatly, but most CAD systems’ accuracy in diagnosing tumors or precancerous lesions can exceed 80%.

Due to the lack of large-scale public authoritative data sets, studies often used single-center endoscopic data. The number of patients is usually less than 100, limiting DL’s accuracy and universality, leading to selection bias. Therefore, a study enhances data utilization and improves Barrett’s esophagus diagnostic accuracy by establishing an adversarial network [12]. Multi-center randomized controlled trials are the most compelling studies. However, there have been few multi-center prospective studies of AI in gastrointestinal endoscopy so far. Wu and Xu et al. Conducted two randomized controlled trials to verify the effectiveness of ENDOANGEL, a CAD system, in white-light imaging (WLI) and image-enhanced endoscopy (IEE) examination of early gastric cancer [13, 14].

CNN in fully supervised is challenging for endoscopes because it is challenging to obtain depth maps directly corresponding to authentic endoscope images. Weakly annotated images may be a cost-effective approach in future. Weakly supervised convolutional neural network (WCNN) can identify abnormal video frames and detect specific pathological points from video frames [15]. In this way, images can be marked only by image-level annotations instead of detailed pixel-level annotations. The system can automatically analyze detailed lesion areas by roughly dividing, thus achieving favorable detection and localization performance. Mahmood et al. put forward an unsupervised reverse domain adaptation framework to avoid excessive comments [16]. Their system worked by using confrontational training to remove patient-specific details from real endoscopic images while preserving diagnostic details. It is a pity that their research was limited to static image recognition, unable to adapt to endoscope videoed in poor light or unknown depth scenes. Ozyoruk et al. proposed an unsupervised monocular visual odometry and estimated depth to solve the problem of frequently changing lighting conditions and scale inconsistency between consecutive frames [17]. The algorithm was optimized by mixed loss functions, using spatial attention modules to instruct the network to focus on tissue areas. Besides, the system detected photometric loss to improve the robustness of fast inter-frame illumination changes in endoscope videos. Itoh et al. performed unsupervised DL by introducing the lambert reflection model as an auxiliary task for domain conversion between real and virtual colonoscopy images. The system can accurately extract 3D information, reducing the impact of specular reflection and colon wall texture on depth estimation [18]. Hwang et al. proposed a self-supervised monocular depth estimation method to assess Spatio-temporal consistency in the colonic environment by detecting depth differences between adjacent frames [19]. They used loss function and depth feedback network to estimate depth information in the next frame from previous frames’ data.

The diagnostic accuracy of esophageal disease by narrow-band imaging (NBI) is higher than WLI, but there are few DL studies on NBI at present. Compared with traditional WLI, NBI images have no significant difference in AI diagnostic efficiency because NBI improves lesion detection sensitivity and increases the possibility of overdiagnosis, leading to reduced diagnostic specificity. However, NBI is beneficial to enhance histological diagnostic grading accuracy [6]. Moreover, NBI can enhance the ability to differentiate squamous cell carcinoma microvessels [20]. A multi-center study shows that magnifying endoscopy narrow-band imaging (ME-NBI) reached senior endoscopic physicians’ predictive performance in early gastric cancer. Nevertheless, the system, which used images rather than videos for the study, requires an endoscopic magnification of the suspected lesion site before the CAD system can be used. Moreover, the system cannot distinguish the depth of tumor invasion [21].

Colorectal cancer is the third most common cancer in the world [22]. Colorectal adenomas have a 50% chance of malignant transformation, so early detection plays a crucial role in reducing mortality. About a quarter of adenomas are missed during standard colonoscopy [23]. DL’s study identifies and classifies colorectal polyps with excellent application value. Bora et al. collected WLI and NBI images of the colorectum to settle the complex problem of systematic visualization [24]. He used Generic Fourier Descriptors (GFD) to quantify shapes, Nonsubsampled Contourlet Transform (NSCT) to extract texture and color features and performed variance analysis to confirm that the GFD and NSCT features of tumors and non-neoplastic polyps were significantly different. After constructing the CNN model, Lai et al. found that both full-color NBI and red-green dual-channel NBI had better sensitivity than WLI in detecting polyps under colonoscopy [25].

Endoscopic ultrasonography (EUS) can improve imaging function and provide various methods for treating biliary tract diseases. Its steep learning curve and over-reliance on operators limit its clinical application in remote areas. Seven et al. predicted the mitotic index of gastrointestinal stromal tumors (GISTs) in EUS by DL. The system was able to automatically determine the prognosis of patients by EUS images [26]. The DL model designed by Yao’s research team can accurately identify the bile duct in EUS and automatically calibrate the anatomical position to measure the bile duct’s diameter, thus significantly improving the accuracy of the operator. The ability to identify lesions needs to be further developed in future [27].

The practical application of AI in gastrointestinal endoscopy is strongly time-sensitive, so it is necessary to integrate CAD into the working process of gastrointestinal endoscopy. The uneven light, gas, liquid, and surgical scars are the critical factors affecting the real-time application of AI in the endoscope. Manually filtered or standardized images for DL may reduce the system’s robustness. Gutierrez et al. collected clinical endoscopic videos of patients with ulcerative colitis from hundreds of different sites using different equipments, significantly increasing the area under the receiver operating characteristic curve (AUROC). Besides, these videos do not need to be marked by professional endoscopic physicians. The system automatically preprocesses and screens the original endoscopic videos and automatically carries out CNN system training, significantly reducing clinicians’ workload and reducing the deviation caused by artificial selection [28].

Confocal laser endoscopy (CLE) can detect various focal lesions with accuracy even close to pathological detection. CLE can also dynamically observe lesions under a microscope, so it has great application value in diagnosing and treating inflammatory bowel disease(IBD). However, CLE requires accurate image interpretation, which only experienced endoscopic physicians can do. Udristoiu’s team designed the DL system can distinguish between ulcerated and healed Crohn’s disease patients in CLE pictures [29]. Still, the algorithm was unable to determine active ulcers from inactive ulcers.

Wireless capsule endoscopy (WCE) can move along the entire digestive tract to identify gastrointestinal polyps and other lesions and allow patients to avoid the discomfort of traditional endoscopes. At present, there are two research hotspots: one is how to use the DL models to accurately find the lesion site from thousands of pictures taken by WCE; the other is how to control the active recognition of capsule endoscopy, the arrival of the lesion, and the administration of drugs or biopsy. Up to now, DL has been able to identify intestinal vascular dilatation [30], hemorrhage [31], polyps [32], colorectal tumors [33], and ulcers caused by Crohn’s disease [34] from tens of thousands of photographs taken during each WCE. However, current researches are mainly retrospective studies, and most of the data sets are composed of still images. Therefore, multi-center prospective studies with large samples are required to verify CNN’s effectiveness in WCE image recognition. IncetanK et at. introduced VR technology into WCE. His group used computed tomography (CT) images to create a 3D organ model and then remove interference such as bone, fat, and skin [35]. The system can precisely generate the same organ as the patient’s real organ, with mucosal texture and vascular network images, to locate the capsule accurately through DL technology. Furthermore, the capsule containing magnets is controlled by an external robotic arm, making it possible for physicians to observe and perform relevant tasks with the help of WCE (Table 1).

Table 1 Application of DL in gastrointestinal endoscopy

3 Application of DL in digestive system imaging

3.1 Computed tomography

Patients with cirrhosis were proposed for screening esophageal and gastric varices by gastroscopy. The invasive procedure may bring bleeding and other risks. Therefore, some studies suggested that platelet count, spleen length, and platelet count ratio to spleen length should be used to determine the shunt degree of esophageal varices to evaluate the risk of varices in patients as a non-invasive examination [36]. Ma’s team used DL to assess the CT volume of the liver and spleen in patients with hepatitis B virus-related cirrhosis, combined with patients’ platelet ratio, to perform a computer-intelligent assessment of patients’ varicose veins risk [37].

Zhang et al. established a 3D learning network to evaluate models from a data set of CT images collected from three medical centers, achieving promising performance in gastric tumor edge segmentation and lymph node classification [38]. Another study established a dual-energy computed tomography (DECT) radiology DL model [39]. The predictive value of its response to chemotherapy was analyzed to predict patients’ treatment response during chemotherapy, which may help adjust treatment strategies in time through semi-automatic segmentation of advanced gastric cancer. Due to the small sample size, performing a performance analysis for each chemotherapy regimen was impossible. The DL system developed by Jiang’s research team can predict occult peritoneal metastasis of gastric cancer preoperatively by analyzing CT images, thus reducing the risk of blindly performing extensive total gastrectomy [40]. The next research direction may be the judgment of peritoneal metastasis after neoadjuvant chemotherapy and the DL of 3D images of other tumors.

DL also plays a role in the interpretation of CT in patients with cystic pancreatic lesions [41], pancreatic neuroendocrine tumors [42], and pancreatic cancer [43]. It can also achieve automatic localization and boundary segmentation of the pancreas in CT [44]. Due to the high degree of malignancy, patients with pancreatic cancer present irregular contours and unclear periphery in CT, leading to difficulties in demarcation with surrounding tissues. Besides, it is challenging to label CT images manually because of the complex anatomy around the pancreas. Liu et al. artificially labeled CT images of pancreatic cancer enhanced the data exploitation degree by moving and flip images and reduced the number of convolutional layers to reduce the model’s complexity [43]. Besides, he limited the pixel size to 50 × 50 to avoid too small plaques to contain sufficient information about the relationship between the tumor and adjacent tissue or too large to increase the unrelated image interference. As a result, the diagnosis of pancreatic cancer in patients of different races has a high AUROC.

3.2 Magnetic resonance imaging

Compared with CT, there are few DL studies on magnetic resonance imaging (MRI). Most current research has focused on the diagnosis of liver, pancreatic, and rectal diseases, such as liver cancer [45], liver fibrosis [46], liver fat segmentation [47], pancreatic tumors [48], rectal cancer [49], etc. Abdominal organ segmentation and fat segmentation are the advantages of MRI. Automatic segmentation of high-risk organs has important application value in MRI-guided radiotherapy. The robotic abdominal multi-organ segmentation technology developed by Chen’s team can accurately segment the nine abdominal organs with fewer parameters, and the duodenum segmentation should be further improved [50].

The quantification of human adipose tissue depots can help doctors understand a patient’s health. Belly fat has been linked to high blood pressure, inflammation, and type 2 diabetes [51]. Langner’s multi-center study demonstrated the robustness of their DL model in fat quantification [52]. In recent years, studies have found interactions and pathological similarities between IBD and metabolic disorders, including metabolic tissue disorders, inadequate immune response, and inflammatory response [53]. Patients with non-alcoholic fatty liver diseases (NAFLD) or a high body fat percentage are at higher risk for IBD [54, 55]. Combined with the patient’s clinical symptoms, MRI fat quantification could be applied to CAD of IBD and diabetes in future.

3.3 Positron emission tomography

Positron emission tomography (PET) imaging is commonly used in clinical oncology for diagnosis, staging, restaging, and monitoring of treatment response [56]. Image quality is crucial for visual interpretation and quantitative analysis [57]. Outside the receiving energy window, Scattered photons can be ignored and cause attenuation. In the receiving energy window, the path variation of photon scattering needs to be corrected. Attenuation or scattering events result in local decrease or increase of detection count, which leads to underestimation or overestimation of tracer uptake, respectively. Resulting in decreased image contrast and quantization error. Thus, image contrast is reduced, and quantization error is caused. In PET imaging analysis, CNN has been applied to image reconstruction [58] and image denoising [59]. These technologies will help radiologists produce more accurate PET images without obtaining CT images. While earlier studies were limited to the brain, current studies tend to look at whole-body scans. Shiri and Mostafapour's DL model can automatically correct attenuation and scatter in PET images [60, 61]. The most conspicuous advantage of their systems is that they did not require pre-entry of anatomical information. Nevertheless, they were susceptible to artifacts, leading to misjudgment of organ boundaries, especially between lungs and livers, abdomen, and pelvis. To avoid misjudgment in whole-body dynamic PET, appropriate function and kinetic models are required, along with whole-body motion correction.

In digestive system, PET images are used to help detect lesions in liver CT scans. Using a combination of the generative adversarial network (GAN) and whole convolutional network (FCN) to generate PET images from CT scans, a research team reduced the false positive rate by 28% [62]. Wang et al. introduced a Gan-based method for generating high-quality PET images in low-dose tracers, thereby reducing the risk of radioactive isotopes [63]. They introduced a progressive refinement scheme based on 3D to improve the quality of image display.

3.4 Ultrasound

Although MRI is accurate and non-invasive, the cost of using MRI to assess liver fat is high, so some research teams want to quantify liver fat by ultrasound. For example, Byra et al. used MRI to obtain the proton density fat fraction (PDFF) of patients and then matched the ultrasound images for the training model, achieving qualified diagnostic accuracy [64].

Ultrasound is the front line of screening for abdominal diseases. At present, the research on the application of DL in ultrasound is gradually increasing. Yang’s team created a mouse model of intestinal inflammation to collect micro-ultrasound (μ US) images of the cecum, small intestine, and colon. Three DL networks were trained to distinguish between healthy tissue and early inflammation tissue [65]. A prospective five-center study using DL from ultrasound videos of biliary atresia achieved higher diagnostic accuracy than human experts. The research team has also developed a mobile APP by DL of ultrasound pictures, enabling rural doctors in remote areas to perform CAD by taking and uploading photographs of suspected biliary atresia [66].

Hepatic cystic echinococcosis is still endemic in some areas. Hepatic cystic echinococcosis has five subtypes [67]. The ultrasonic appearance may change naturally over time or in response to treatment, making diagnosis difficult [68]. Although microscopic examination after surgical treatment is the gold standard for diagnosing subtypes and stages of hepatic cystic echinococcosis, accurate ultrasound diagnosis is of great value for patients who can be cured with medical treatment [69]. Wu et al. used three types of CNNs for DL, and because the architecture and features extraction was different, the final result was not wholly consistent. The three systems complement each other further to improve the accuracy of the model’s accuracy and ultimately enable the exact classification of hepatic cystic echinococcosis under ultrasound [70].

Ultrasonography(US) has crucial diagnostic value for benign and malignant lesions of the liver. Due to the low contrast between lesions and normal liver tissue, the diagnosis of solid lesions is a challenge. Ryu et al. used 4309 US images with focal liver disease, including liver cysts, hemangioma, metastasis, and hepatocellular carcinomas, for DL and precise segmentation and classification of focal liver lesions [71]. Contrast-Enhanced Ultrasound (CEUS) can allow real-time scanning and provide dynamic perfusion information, so it has the potential to surpass CT and MRI in liver and gallbladder diseases [72, 73]. Hu’s CEUS system can assist young ultrasound physicians to achieve higher diagnostic sensitivity for liver tumors diagnosis [74].

Imaging is an essential method for the diagnosis of liver diseases. With CT, MRI, and ultrasound, clinicians can accurately determine whether a patient has liver fibrosis, cirrhosis, non-alcoholic fatty liver disease (NAFLD), benign tumors, or hepatocellular carcinoma (HCC). With the development of next-generation sequencing and multi-omics tools, precision medicine can help doctors more comprehensively understand the health status of patients [75, 76]. In future, omics information can be integrated into imaging data to facilitate the development of precision medicine, provide professional health care strategies for patients with sub-health, and design the best diagnosis and treatment plan for patients [77] (Table 2).

Table 2 Application of AI in digestive system imaging

4 Application of AI in digestive pathology

Pathology biopsy is the golden standard for diagnosing benign or malignant diseases of the digestive tract, but the number of pathologists is relatively small, so DL can effectively reduce pathologists’ workload. In recent studies, images at different amplification ratios were extracted from standardized HE staining specimens, and affine transformations were used to make up for deficiencies in data sets. Then, whole slide image (WSI) learning could be done by using these pictures. Standardized images have the advantage of removing stained samples, but retrospective studies can also lead to selective bias, and different staining conditions can affect CAD diagnoses. There have been retrospective studies on DL in the pathological diagnosis and prognosis analysis of Helicobacter pylori gastritis [78], rectal cancer [79], pancreatic tumors [80], gastrointestinal, and endocrine tumors [81]. Prospective, multi-center, and large-scale trials have also begun to verify these algorithms’ usability [82]. However, these studies generally have the problem of low interpretative ability for the results of CAD. The DL system developed by Ma et al. can distinguish between normal gastric mucosa, chronic gastritis, and gastric cancer. They used visualization techniques to display the DL model’s content and revealed how the AI program extracted gastric mucosal lesions’ morphological characteristics at different stages. Eventually, gastric cancer progression was revealed, and the effects of the CAD black box were attenuated [83].

The number of metastatic lymph nodes is an essential determinant of the TNM staging of gastrointestinal malignant tumors and is also one of the most critical factors in evaluating gastric cancer prognosis. The clinical-pathological diagnosis of lymph nodes is influenced by subjective factors and requires much time and effort [84]. Pan’s and Ding’s DL system can quickly detect the number of esophageal and rectal lymph node metastases in a large field of vision. However, as their system only supports rectangular annotation, its robustness is deficient in small detection objects or complex contours [85]. Hu’s model was improved on this basis, achieved excellent contour segmentation of a single lymph node, thus effectively improved the lymph node’s quantification accuracy [86]. Wang et al. have come up with a DL framework to analyze patients’ gastric carcinoma lymph node WSI. The system can accurately identify and divide the area of lymph nodes, then reveal the tumor area’s ratio and mesenteric lymph nodes area to predict patients’ prognosis. The system even found several poorly differentiated tumor cells missed by pathologists [87]. Kwak’s team also used WSI for the DL of lymph node metastasis in patients with colorectal cancer. They found that the peri-tumoral stroma (PTS) score was a reference for predicting the number of lymph node metastases [88].

A large sample of prospective studies recently investigated the DL system’s application in the pathological diagnosis of gastric cancer. The algorithm achieved 100% sensitivity and 97% specificity for gastric epithelial-derived tumors, and the vast majority of false positives were due to ulcers or inflammation [89]. However, the system often mistakenly identified GIST as atypical hyperplasia because there was no targeted dataset for non-epithelial tumors. Therefore, it is still worthwhile to establish a new learning model for mesenchymal tumors.

The evaluation of surgical margins is inseparable from the prognostic analysis. However, due to the excessively large resolution of WSI images, prognostic analysis based on WSI is often costly [90]. It is promising to divide the whole WSI into small pieces and then automatically analyze the prognosis of patients through DL. While the edge between tumors and normal tissues can be delineated artificially, labeled tumor areas can also contain normal tissues. Pixel-level annotation can alleviate this problem, but it is a drain on the pathologist's energy. Saillard et al. extracted regions randomly for manual marking to patch the DL system. They found that vascular spaces, the macrotrabecular architectural pattern, and a lack of immune infiltration suggested a poor prognosis of HCC. Although highlighting these areas can increase the accuracy of their system, it still does not utilize all pathological information [91]. Weakly supervised learning (WSL) utilizes easily available image-level annotations to infer pixel-level information automatically. Pathologists label WSI as cancer as long as a small portion of the image contains the cancerous area without specifying its exact location, greatly reducing pathologists' annotation burden and particularly applicable to the field of histopathology. Pathologists only need to mark WSI lesion types, but do not need to specify the exact location of cancer cells [92]. Shao et al. divided WSI into about 1000 patches with the size of 512*512 pixels and then used WSL to conduct DL recognition on all images. So as to fully obtain background information of pathological images. The effectiveness of the WSI level inpatient prognosis assessment was validated in three cancer datasets from the Cancer Genome Atlas (TCGA) [93]. Due to the large size of WSI images and the small proportion of lesions in some cases, image-level labels make automatic diagnosis difficult. Recalibrated multi-instance deep learning method (RMDL) can automatically find the key instances. The high-precision positioning network and recalibrated multi-instance learning were optimized, and the accuracy reached 86.5% [94].

Hyperspectral imaging (HSI) is a non-contact, non-contrast, non-invasive optical imaging technique that provides the analyzed region's pixel spectral and spatial information. It has been applied to both gastric and colorectal cancer. Jansen-Winkeln combined HSI with AI technology to intelligently distinguish colorectal cancer or adenomas from healthy mucosa on specimens. Besides, they used visualization techniques to help clinicians understand the mind of computers [95]. In future, with increased time efficiency, the technology may be used in the operating room on freshly removed specimens or even integrated into the laparoscope to help surgeons determine the extent of lymph node dissection in real-time.

In one study, 12 specimens of GIST were irradiated with near-infrared (NIR). NIR irradiation transparency distinguished the specific HSI information of GIST, and the lesion range of GIST was predicted by ML [96]. This technique may be utilized in the prediction of all submucosal tumors in future. However, light is often affected by the specimen’s thickness, and the training set is sometimes too small. (Table 3).

Table 3 Application of AI in digestive pathology

5 Major techniques and Issues

DL is a kind of ML technique that can recognize highly complex patterns in large data sets. As mentioned above, DL can be broadly divided into supervised learning and unsupervised learning. The most popular architectures in supervised learning are CNN and recurrent neural network(RNN) (Fig. 2). In addition, there are also spatial convolutional network (SCN), temporal convolutional network (TCN), and Spatio-temporal attention convolutional network (STACN), which are, respectively, used to extract the appearance information of RGB images, capture the motion information of flow fields and learn the appearance information of areas with significant attention to motion [97]. The latter three methods are used relatively infrequently in medicine.

Fig. 2
figure 2

Adapted from Litjens’ survey)

Node graphs of architectures commonly used in medical imaging. a convolutional neural network, b recurrent neural network, c Auto-encoder, d multi-stream convolutional neural network.(

CNN is mainly composed of alternating convolutional layer and pooling layer, and each layer contains trainable filter banks [98]. CNN can continuously learn abstract features and integrate them into the full connection layer to calculate local weights and generate output values, thus completing tasks [99]. In this paper, many studies described above designed and optimized systems by modifying the number of cores, channels, or filter sizes.

A typical chief underlying mathematical implementation expressions of CNN [98]:

$$ y\left( n \right) = x\left( n \right)*\omega \left( n \right) = \mathop \sum \limits_{m = - \infty }^{\infty } x\left( m \right)\omega \left( {n - m} \right), $$
(1)
$$ Y\left( {i,j} \right) = X\left( {i,j} \right)*\omega \left( {i,j} \right) = \mathop \sum \limits_{m} \mathop \sum \limits_{n} X\left( {m,n} \right)\omega \left( {i - m,j - n} \right). $$
(2)

RNN are designed for discrete sequence analysis. Each point in the sequence generates an internal signal fed through the neural network to the next layer. Hidden layers preserve information in the observed sequence and updates it in real-time [100]. Medical reports are typically processed by RNN. To integrate information from medical reports, it is often necessary to include a hybrid network combining.

A typical chief underlying mathematical implementation expressions of RNN [100]:

$$ a^{l} \left( t \right) = f^{l} \left( {n^{l} \left( t \right)} \right); $$
(3)
$$ \begin{aligned} n^{1} \left( t \right) = & {\text{IW}}^{1,1} \left[ {p\left( t \right);p\left( {t - 1} \right); \ldots p\left( {t - {\text{TDL}}_{in} } \right)} \right] \\ & + {\text{LW}}^{1,1} \left[ {a^{1} \left( {t - 1} \right); \ldots a^{1} \left( {t - {\text{TDL}}_{{\text{int}}} } \right)} \right] \\ & + {\text{LW}}^{1,2} \left[ {a^{2} \left( {t - 1} \right); \ldots a^{2} \left( {t - {\text{TDL}}_{{\text{int}}} } \right)} \right] \\ & + LW^{1,3} \left[ {a^{3} \left( {t - 1} \right); \ldots a^{3} \left( {t - TDL_{out} } \right)} \right] + \underline {b}^{1} ; \\ \end{aligned} $$
(4)
$$ \begin{aligned} n^{2} \left( t \right) = & {\text{LW}}^{2,1} a^{1} \left( t \right) + {\text{LW}}^{2,2} \left[ {a^{2} \left( {t - 1} \right); \ldots a^{2} \left( {t - {\text{TDL}}_{{\text{int}}} } \right)} \right] \\ & + {\text{LW}}^{2,3} \left[ {a^{3} \left( {t - 1} \right); \ldots a^{3} \left( {t - {\text{TDL}}_{{\text{int}}} } \right)} \right] + \underline {b}^{2} ; \\ \end{aligned} $$
(5)
$$ n^{3} \left( t \right) = {\text{LW}}^{2,2} a^{2} \left( t \right) + {\text{LW}}^{3,3} \left[ {a^{3} \left( {t - 1} \right); \ldots a^{3} \left( {t - {\text{TDL}}_{{\text{int}}} } \right)} \right] + \underline {c} . $$
(6)

Although most studies are based on supervised learning with per-pixel annotation, WSL with image-level labels and even unsupervised learning has high application value. WSL uses labeled data to train the entire network and unlabeled data to train encoders and decoders [101]. Original data for unsupervised learning come in the form of images without any expert-annotated labels. A common technique in unsupervised learning is converting input data into low-dimensional subspaces and then grouping. The most common method of unsupervised learning is GAN. GAN has been widely used in medical imaging, such as denoising, modal transfer, anomaly detection, and image synthesis [102]. In addition, unsupervised learning also includes Auto-Encoders (AEs), stacked auto-encoders (SAEs), restricted Boltzmann machines (RBMs), deep belief networks (DBNs), and variational auto-encoders (VAE) [103]. AEs can reduce nonlinear dimensionality reduction, find compressed raw information in the network and reenter the low-dimensional space [104]. These techniques have rarely been used in medicine, but because unsupervised learning allows for network training using large amounts of unlabeled data and the best use of information, it may have broad applications in future.

A typical chief underlying mathematical implementation expressions of GAN [12]:

$$ \mathop {\min }\limits_{G} \mathop {\max }\limits_{D} \varphi \left( {D,G} \right) = E_{x} \left[ {\log D\left( x \right)} \right] + E_{z} \left[ {\log \left( {1 - D\left( {G\left( z \right)} \right)} \right)} \right], $$
(7)
$$ F_{{{\rm B}_{i} }} \left( {\varphi_{j} ,\phi } \right) = \arg_{k} \min G\left( {\varphi_{j} ,\phi_{k} } \right), $$
(8)
$$ S\left( {{\rm B}_{i} } \right) = \frac{1}{{|{\rm B}_{i} |}}\mathop \sum \limits_{j = 1}^{{\left| {{\rm B}_{i} } \right|}} F_{{{\rm B}_{i} }} \left( {\varphi_{j} ,\phi } \right). $$
(9)

A typical chief underlying mathematical implementation expressions of AEs [103]:

$$ h = \sigma \left( {w_{x,h} x + b_{x,h} } \right). $$
(10)

A typical chief underlying mathematical implementation expressions of RBM [103]:

$$ E\left( {x,h} \right) = h^{T} Wx - c^{T} x - b^{T} h, $$
(11)
$$ p\left( {x,h} \right) = \frac{1}{Z}\exp \left\{ { - E\left( {x,h} \right)} \right\}. $$
(12)
$$ P\left( {h_{j} |x} \right) = \frac{1}{{1 + \exp \left\{ { - b_{j} - W_{j} x} \right\}}}. $$
(13)

Transfer learning(TL) can fine-tune or retrain the original DL model by using new annotations. Tajbakhsh demonstrated that pre-trained CNN with fine-tuning is superior to CNN trained from scratch CNN [105]. Fine-tuning can significantly reduce costs than retraining. When ideal training sets are small, TL can bring greater performance improvement. So far, most approaches have started pre-training with natural image data. It may be possible to design cross-domain data sets in future, for example, using TL between CT, MRI, ultrasound, and PET.

Active learning (AL) can be learned in an interactive environment by selecting learning strategies through trial and error. The system tries to achieve its goals based on feedback from its own behavior and experience. At present, no application of AL in the digestive system has been found, which may be due to the high inherent coupling between AL selection strategies and the model being trained. These results in later data sets that may not be conducive to model training [106].

6 Overview

6.1 Applications of DL in medicine

AI is becoming increasingly valuable in the early diagnosis of digestive tract diseases. DL systems can significantly reduce the workload of clinicians and maintain high diagnostic accuracy and systematic robustness. As the public dataset expands, more and more high-quality algorithms will be discovered. However, Large sample prospective studies are needed to verify the effectiveness of the algorithm. Although DL has been extensively studied in the image processing of endoscopy, imaging, and pathology of gastrointestinal tract diseases, each auxiliary diagnostic method has its limitations. At present, there is still a lack of a CAD system that can comprehensively recognize the image data of different auxiliary examinations. This review lists progresses of DL in different auxiliary examinations, hoping that the data of different auxiliary examinations can be integrated to improve diagnosis accuracy one day.

What AI can bring us better identifying endoscopic images or pathological data from a single angle, but perhaps its most outstanding value is that it can help us break through the traditional thinking patterns, transcend the fixed diagnosis ideas, and give us a broader explorative space. Shortly, AI may help us implement diagnosis and treatment methods more flexibly, achieve disciplines integration more thoroughly, and evaluate conditions more comprehensively. AI can create infinite possibilities for our future.

Aslam studied the characteristics of exhaled gas compounds in patients with gastric cancer through CNN analysis, and the diagnostic accuracy of early and advanced gastric cancer reached 97.3% and 98.7, respectively [107]. With the development of computer technology and the iteration of CNN, we will use computers to find more non-invasive examination items in future.

Xiao led a prospective multi-center study using a slit lamp to conduct DL on the fundus and iris of patients with several common liver diseases and finally achieved excellent results in identifying liver cancer and chronic cirrhosis. In future, ophthalmology imaging may be used as a tool for the early screening of liver and biliary diseases [108]. This project is innovative because linking two seemingly distant organs together, allows this kind of interdisciplinary computer-aided research to discover biological phenomena that have not been discovered before.

COVID-19 has become a serious public problem, and companies are racing to develop drugs. Recently, Li's research team used DL models to predict drug-induced liver injury, thereby reducing the cost of clinical drug development and testing [109]. In future, when facing unacquainted sudden diseases, we can also adopt DL technology to input disease information and let the computer judge the patient's condition, treatment, and prognosis.

AI has shown its superiority in the early diagnosis of gastrointestinal diseases. However, if clinicians rely too much on AI, the images under a specific condition may be repeatedly missed due to the algorithm’s limitations or data set. AI will help clinicians discover the potential links between diseases and comprehensively assess the patient's condition and prognosis, but it also requires clinicians to continuously accept, learn, and improve this new technology.

Besides, Wong comprehensively evaluated non-alcoholic fatty liver disease severity based on clinical information, including electronic health records, liver biopsies, and liver images [110]. In future, AI health assessment of patients may not be limited to cross-sectional studies. Still, it can collect patients’ dynamic data in more detail to conduct intelligent analysis to obtain professional suggestions with solid persuasion.

6.2 Characteristics and Challenges of DL in medicine

Medical image analysis has three main tasks: disease diagnosis, lesions detection, and lesions and organs segmentation. It also includes other related tasks, such as image reconstruction, image retrieval, and report generation. The digestive field emphasizes the ability to recognize abnormalities, such as polyps and early cancer. Since medicine is a human-facing science, DL has its own characteristics and challenges in medical image processing compared with other CV scenarios.

Characteristics

  1. 1.

    Physiological structures are often irregular and disordered, making it difficult to conceptualize them as matrices. There are multiple stages, such as precancerous lesions, between the tumor and normal tissue. Medical judgment is subjective and may vary from doctor to doctor, requiring extensive expert annotations to reach a consensus.

  2. 2.

    Image recognition is obviously interfered with by viewpoint, noise, background motion, and illumination changes [111]. In medicine, diagnosis often needs reference background information to achieve higher accuracy. The use of implicit information in biological systems has attracted great attention.

  3. 3.

    The compatibility of the DL system between hospitals needs special attention. Different light sources, resolutions, doctors' skills, and examination habits may affect judgment accuracy.

  4. 4.

    Medicine values the prediction of causal effects in order to evaluate the curative effects in time. Genomics will be more widely used in future, and DL will become a daily tool for analysis [112].

  5. 5.

    Medical images require high resolution, which makes image analysis costly and time-consuming. DL models can be trained using cloud computing in future, with instances deployed on different sites and trained on local data while sharing standard parameters, enabling the use of multiple GPUs at a reasonable cost and promoting respect for medical data privacy.

Challenges:

  1. 1.

    Most DL applications are considered to be a "black box". Users are tough in explaining, understanding, or correcting how the model makes predictions. The system needs to explain prediction conclusions further to gain the trust of doctors and patients.

  2. 2.

    Where is the application boundary of AI? The abuse of DL may infringe personal privacy, disturb natural law, and violate ethic. For example, what should a doctor do when the AI decides that abandoning treatment is the best option?

  3. 3.

    In health security, minor errors can lead to catastrophic results. How to further improve the accuracy is always the challenge and pursuit of engineers.

  4. 4.

    In the classification training of rare diseases, overfitting will occur if the sample number of one class is much larger than another class. Computer vision techniques can solve the overfitting problem. However, model complexity reduction and data enhancement techniques focus only on the target task on a given data set without introducing new information into the DL model. Today, introducing more information beyond a given medical data set has become a promising approach to solving the problem of small medical data sets. In addition to broader collaboration, enhanced data extraction using unsupervised learning and integration using different DL techniques are likely to mitigate this problem.

6.3 Restriction on AI’s clinical application

Currently, restriction on AI’s clinical application has three key factors: first, the compatibility of each system; second, daily maintenance and fault handling of the DL system; third, the legal liability. When the test is false, these errors involving computer knowledge are often difficult to be explained by doctors’ experience alone, so who should be responsible for this during the clinical process? Therefore, more multi-center prospective studies should be conducted in future. Relevant laws and regulations should be improved to translate scientific and technological achievements into practical applications.

The current limitation of AI in the digestive system image has its particularity:

1. Digestive endoscopy plays a significant role in clinical diagnosis and treatment. Inadequate intestinal preparation will affect image recognition and misdiagnose debris as the tumor.

2. While AI can serve as a second set of eyes for endoscopic physicians. There are still misdiagnosis rates to overcome.

3. The determination of tumor invasion depth depends on EUS, but the accuracy is limited. It is difficult to accurately distinguish the origin of tumors, which challenges the selection of surgical methods.

4. Blind spot is a vital factor leading to a missed diagnosis. It is necessary to develop further DL systems that can automatically prompt blind spots.

5. Digestive system covers a large number of organs and has high requirements for lesion localization in imaging. Currently, some systems can accurately achieve organ segmentation, but the localization inside organs is not accurate enough.

6. In the endoscopy process, rapid determination of polyp or tumor properties has high application value, but there is still a long way to go.

7 Conclusion

DL will be widely used in the medical field in the near future, especially for image recognition. CAD can significantly narrow the technical gap between physicians, reduce work pressure, and improve patients’ experience. However, there are many technical, ethical, and legal hurdles to overcome before AI is finally used in clinical practice.