Machine learning for medical image classification

Gazi Husain; Jonathan Mayer; Molly Bekbolatova; Prince Vathappallil; Mihir Matalia; Milan Toma

Husain, Mayer, Bekbolatova, Vathappallil, Matalia, Toma, Rančić, and Quintana: Machine learning for medical image classification

Review article

December 23, 2024

Machine learning for medical image classification

Gazi Husain [1]

Jonathan Mayer [1]

Molly Bekbolatova [1]

Prince Vathappallil [1]

Mihir Matalia [2]

Milan Toma* [1]

Author Affiliations

Open Access

addShow more

removeShow less

downloadDownload

Listen

Downloads

1,828

Views

4,047

Citations

Citations are sourced from Crossref and calculated once a day.

Abstract

This review article focuses on the application of machine learning (ML) algorithms in medical image classification. It highlights the intricate process involved in selecting the most suitable ML algorithm for predicting specific medical conditions, emphasizing the critical role of real-world data in testing and validation. It navigates through various ML methods utilized in healthcare, including Supervised Learning, Unsupervised Learning, Self-Supervised Learning, Deep Neural Networks, Reinforcement Learning, and Ensemble Methods. The challenge lies not just in the selection of an ML algorithm but in identifying the most appropriate one for a specific task as well, given the vast array of options available. Each unique dataset requires a comparative analysis to determine the best-performing algorithm. However, testing all available algorithms is impractical. This article examines the performance of various ML algorithms in recent studies, focusing on their applications across different imaging modalities for diagnosing specific medical conditions. It provides a summary of these studies, offering a starting point for those seeking to select the most suitable ML algorithm for specific medical conditions and imaging modalities.

AI podcast

Beta

Listen to an AI summary of this paper

1. Introduction

The selection of a machine learning (ML) algorithm for predicting a specific medical condition is a complex process that requires careful consideration of various factors [1]. It is typically guided by the algorithm’s relative performance when compared to other ML algorithms [2]. This performance is evaluated through a series of tests and validations, involving the use of real-world data [3]. Different ML algorithms are better suited to different types of data [4]. For instance, convolutional neural networks (CNNs) are particularly effective when dealing with image data, making them a popular choice for medical imaging tasks. This is because CNNs are designed to automatically and adaptively learn spatial hierarchies of features from the input images, which can be highly beneficial in medical imaging where the data are often complex and high dimensional [5]. For example, in the 2019 study by Pasa et al., an optimized deep learning architecture was developed for diagnosing tuberculosis from chest X-rays [5]. The proposed model, a simple CNN, outperformed previous models in terms of speed and efficiency while maintaining similar accuracy. The model was trained on two public databases, namely, the NIH Tuberculosis Chest X-ray dataset and the Belarus Tuberculosis Portal dataset, and it achieved an accuracy of 86.2% and an AUC (Area Under the Curve) score of 0.925 on the combined dataset. Notably, the model’s efficiency was highlighted by its significantly reduced computational complexity, with only about 230,000 parameters compared to other deep learning models that have up to 60 million parameters.

On the other hand, algorithms like support vector machines (SVMs) or random forests (RFs) may be more effective when dealing with structured, tabular data. These algorithms can handle high-dimensional data and are less prone to overfitting, making them suitable for tasks such as predicting disease outcomes based on patient records. The specific disease in question also plays a significant role in the choice of an ML algorithm. Some diseases may have specific patterns or characteristics that are more easily detected by certain algorithms. For example, a study might find that a deep learning algorithm is more effective at detecting signs of skin cancer in dermatological images, while another study might find that a different algorithm is better at predicting the progression of Alzheimer’s disease (AD) based on brain scans.

A recent comprehensive review of techniques for the detection and classification of skin cancer indicated that the most frequently utilized and reliable ML algorithms for the identification of pigmented skin lesions are SVMs, k-nearest neighbors (KNNs), RF, Naïve Bayes (NB), decision tree (DT), and logistic regression (LR) [6]. This finding highlights the necessity of considering a range of algorithms, even for the same medical condition. The selection of the most fitting algorithm is contingent on the specific task, the nature of the data, and the specific disease under consideration.

Performance metrics are another crucial factor in this decision-making process [7]. These metrics provide a quantitative measure of an algorithm’s effectiveness. Common metrics include accuracy, precision, recall, and the Area Under the Receiver Operating Characteristic Curve (AUC-ROC). The choice of metric can depend on the specific task and the relative importance of different types of errors. For instance, in a task where false negatives are particularly costly (such as cancer detection), recall might be a more important metric than overall accuracy.

This review highlights a variety of ML algorithms used for processing medical images to diagnose and predict diseases. These algorithms include DT, SVM, KNN, LR, deep learning, CNN, light gradient boosting machine (LightGBM), linear discriminant analysis (LDA), Google teachable machine (GTM), NB, RF, extreme gradient boosting (XGBoost), and adaptive boosting (AdaBoost). Each algorithm has its strengths and is chosen based on the specific requirements of the diagnostic task. The integration of these algorithms in medical diagnostics holds great promise for improving accuracy and efficiency in disease detection.

2. Medical imaging modalities

Medical imaging modalities have revolutionized the healthcare industry, providing a non-invasive method for diagnosing a wide range of medical conditions. Technologies such as X-rays, ultrasound, computed tomography (CT), magnetic resonance imaging (MRI), positron emission tomography (PET), and digital mammography are essential tools in diagnosing a variety of medical conditions. Each modality has its strengths and is chosen based on the specific clinical scenario, allowing for accurate diagnosis and effective patient management. Some of the most commonly diagnosed conditions associated with different imaging techniques are as follows:

X-rays
- – Fractures and bone injuries: X-rays are primarily used to diagnose fractures, dislocations, and other bone-related injuries due to their effectiveness in visualizing hard tissues.
- – Chest conditions: They are also commonly employed to assess conditions such as pneumonia, lung infections, and heart enlargement.
Ultrasound
- – Soft tissue abnormalities: Ultrasound is frequently used to evaluate soft tissue conditions, including cysts, tumors, and organ abnormalities. It is particularly useful in obstetrics for monitoring fetal development.
- – Cardiovascular issues: Doppler ultrasound helps visualize blood flow and can diagnose conditions related to blood vessels and heart function.
Computed tomography
- – Trauma and internal injuries: CT scans are often used in emergency settings to quickly assess internal injuries, especially in trauma cases.
- – Cancer detection: They are also instrumental in identifying tumors and assessing the extent of cancer.
Magnetic resonance imaging
- – Neurological disorders: MRI is the preferred modality for diagnosing conditions such as multiple sclerosis, brain tumors, and spinal cord injuries due to its high-resolution images of soft tissues.
- – Joint and ligament injuries: It is also widely used to evaluate injuries in joints, particularly in the knee and shoulder.
Positron emission tomography
- – Cancer diagnosis and monitoring: PET scans are particularly effective in detecting cancer and monitoring treatment response, as they can highlight metabolic activity in tissues.
- – Neurological conditions: They are also used in diagnosing conditions like AD by assessing brain metabolism.
Digital mammography
- – Breast cancer screening: This imaging technique is specifically designed for the early detection and diagnosis of breast cancer, making it a vital tool in women’s health.

As illustrated in Figure 1, the diverse landscape of medical imaging modalities, ranging from X-rays to MRI, is closely intertwined with a wide array of medical conditions, providing a comprehensive overview that will guide our subsequent exploration of this complex field and its intersection with ML.

The advent of medical imaging modalities has undeniably transformed the landscape of the healthcare industry, providing unprecedented insights into the human body and enabling accurate diagnoses of a myriad of medical conditions. However, the emergence of ML is set to propel this revolution to new heights, enhancing the capabilities of these imaging modalities and redefining the future of medical diagnostics.

Each of the medical conditions previously mentioned, from fractures and lung infections diagnosed via X-rays, to neurological disorders and joint injuries identified through MRI, can now be detected with greater speed and precision. This is made possible by the integration of ML algorithms with these imaging techniques, which can analyze vast amounts of data and identify patterns that may be overlooked by the human eye.

Figure 1

A representation of various medical imaging modalities and their associated medical conditions.

3. Machine learning methods in healthcare

ML methods encompass a variety of approaches, including Supervised Learning, Unsupervised Learning, Self-Supervised Learning, Deep Neural Networks (DNNs), Reinforcement Learning, and Ensemble Methods. Supervised learning methods utilize algorithms that learn from labeled training data, applying this knowledge to new data. These methods are widely used in healthcare for predicting outcomes based on historical data. Unsupervised learning methods, in contrast, discover patterns in data without the need for labels, and are often employed in healthcare for exploratory data analysis and understanding patient subgroups. Self-supervised learning is a type of ML where the data provide the supervision itself, often through a pretext task. It is gaining traction in healthcare for tasks like anomaly detection where labeled data are scarce. DNNs, a specific subset of ML, excel at recognizing patterns. Their use in healthcare is on the rise, particularly for tasks such as image analysis and predictive modeling. Reinforcement learning involves an agent learning to make decisions by taking actions in an environment to achieve a goal. This approach has potential applications in healthcare for tasks such as treatment optimization. Finally, ensemble methods combine the predictions of multiple ML models to improve accuracy. These methods are frequently used in healthcare to enhance the robustness and stability of predictive models.

Table 1 presents a summary of ML algorithms that are categorized under the ML methods mentioned earlier. Each algorithm is accompanied by a brief explanation and its specific applications within the healthcare industry.

Table 1

Summary of (some) machine learning algorithms and their relations to healthcare

ML method	ML algorithm	Brief description	Relation to healthcare
Supervised learning	GLMs	GLMs extend ordinary regression models to allow for response variables that have error distribution models other than a normal distribution [8].	GLMs can be used to predict patient outcomes based on various factors such as age, gender, and medical history [9].
	DTs	A DT is a flowchart-like structure in which each internal node represents a feature, each branch represents a decision rule, and each leaf node represents an outcome [10].	DTs can be used in healthcare to support decision-making, such as predicting the likelihood of a disease based on symptoms [11].
	KNN	KNN is a type of instance-based learning where the function is only approximated locally and all computation is deferred until function evaluation [12].	KNN can be used for diagnostic systems, such as predicting whether a tumor is malignant or benign based on the characteristics of neighboring tumors [13].
	SVMs	SVMs are models that represent the examples as points in space, mapped so that the examples of the separate categories are divided by a clear gap that is as wide as possible [14].	SVMs can be used in healthcare for classification tasks, such as distinguishing between healthy and diseased patients based on medical images [15].
	LDA and NB	These are statistical methods used in ML and statistics to classify objects based on their characteristics [16].	LDA and NB can be used for diagnostic purposes, such as predicting the likelihood of a disease based on patient symptoms [17].
Unsupervised learning	K-means clustering	K-means is a method of vector quantization that aims to partition n observations into k clusters [18].	K-means can be used to identify patient subgroups based on similarities in their medical records [19].
	PCA	PCA is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components [20].	PCA can be used to reduce the dimensionality of medical data, making it easier to visualize and analyze [21].
	t-SNE	t-SNE is an ML algorithm for visualization [22].	t-SNE can be used to visualize high-dimensional medical data in a way that preserves the relationships between data points [23].
	Autoencoders	An autoencoder is a type of artificial neural network used to learn efficient data codings in an unsupervised manner [24].	Autoencoders can be used for anomaly detection, such as identifying unusual patterns in patient data that could indicate a disease [25].
Self-supervised learning	CPC	CPC aims to learn representations of data by predicting future samples in latent space using autoregressive models [26].	CPC can be utilized for tasks such as medical imaging or ECG analysis, where it can learn meaningful representations from large amounts of unlabeled data [27].
	MoCo	MoCo leverages the concept of contrastive learning, where encoders are trained to perform dictionary look-up [28].	MoCo’s contrastive learning approach can be particularly beneficial in medical imaging, where it can aid in extracting robust features from unlabeled data [29].
	SimCLR	SimCLR is an approach where labels are inherent to the data without any human intervention. It maximizes agreement between differently augmented views of the same data example via a contrastive loss in the latent space [30].	SimCLR can be particularly useful in medical imaging, where it can learn rich representations from unlabeled data [31].
	DINOv2	DINOv2 is a vision transformer model that works without labels or annotations, allowing it to learn full visual representations used in various tasks [32].	DINOv2’s ability to learn from unlabeled data can be particularly useful in medical imaging, where it can enhance the accuracy of diagnostic models and improve patient care by learning comprehensive visual representations [33].
	BYOL	In BYOL, two neural networks, known as the online and target networks, interact and learn from each other. The online network is trained to predict the target network representation of an augmented view of an image [34].	The BYOL methodology can be applied to enhance medical imaging analysis, enabling the extraction of complex patterns from unlabeled data [35].
DNNs	MLPs	MLPs are feedforward artificial neural networks that consist of at least three layers of nodes [36].	MLPs can be used for tasks such as predicting patient outcomes based on a wide range of inputs [37].
	CNNs	CNNs are a class of DNNs, most commonly applied to analyzing visual imagery [38].	CNNs can be used for tasks such as analyzing medical images to detect diseases [39].
	RNNs	RNNs are a class of artificial neural networks where connections between nodes form a directed graph along a temporal sequence [40].	RNNs can be used for tasks such as predicting the progression of diseases over time based on patient data [41].
	GCNs	GCNs are a variation of CNNs, designed to work with data structured as graphs [42].	GCNs can be used for tasks such as predicting drug interactions based on the structure of molecular graphs [43].
Reinforcement learning	Q-Learning	Q-Learning is a model-free reinforcement learning algorithm that enables a model to iteratively learn and improve over time by taking the correct action [44].	Q-Learning could be used to develop treatment strategies that maximize patient outcomes over time [45].
	DQN	DQN is a reinforcement learning technique that combines Q-Learning with DNNs [46].	DQN could be used to develop more complex treatment strategies that take into account a wide range of factors [47].
	Policy gradients	Policy gradients are a type of reinforcement learning methods which are parameterized by neural networks [48].	Policy gradients could be used to develop personalized treatment strategies that adapt to each patient’s unique characteristics and responses to treatment [49].
Ensamble methods	Stacking	Stacking is an ensemble learning technique that combines multiple classification or regression models via a meta-classifier or a meta-regressor [50].	Stacking can be used to combine the predictions of multiple models, potentially leading to more accurate and robust predictions [51].
	Bagging	Bagging is a method in ensemble for improving unstable estimation or classification schemes [52].	Bagging can be used to improve the stability of predictive models, such as those used to predict patient outcomes [53].
	Boosting	Boosting is an ensemble meta-algorithm for primarily reducing bias, and also variance in supervised learning [54].	Boosting can be used to improve the accuracy of predictive models, such as those used to predict disease risk [55].

ML, machine learning; GLMs, generalized linear models; DTs, decision trees; KNN, K-nearest neighbours; SVMs, support vector machines; LDA, linear discriminant analysis; NB, Naïve? Bayes; PCA, principal component analysis; t-SNE, t-distributed stochastic neighbor embedding; CPC, contrastive predictive coding; MoCo, momentum contrast; SimCLR, a simple framework for contrastive learning of visual representations; BYOL, bootstrap your own latent; MLPs, multilayer perceptrons; CNNs, convolutional neural networks; RNNs, recurrent neural networks; GCNs, graph convolutional networks; DQN, deep Q network.

4. Inclusion criteria

Recent articles that include AUC and/or accuracy are included in this review. Those metrics are used to evaluate the performance of ML models. However, they measure different aspects of the model’s performance and are used in different contexts. Accuracy is a straightforward metric suitable for binary classification problems. It calculates the proportion of correct predictions (both true positives and true negatives) over all predictions. However, accuracy can be misleading if the classes are imbalanced. For instance, if 95% of your data belong to Class A and only 5% belong to Class B, a model that always predicts Class A will have an accuracy of 95%. Despite this high accuracy, it’s not a good model because it fails to correctly predict Class B. AUC, on the other hand, is used with the ROC (Receiver Operating Characteristic) curve, a performance measurement for classification problems at various threshold settings. AUC measures the entire two-dimensional area underneath the entire ROC curve from (0,0) to (1,1). AUC provides an aggregate measure of performance across all possible classification thresholds and is not sensitive to imbalanced classes. A model with high accuracy might not have a high AUC, and vice versa. For example, a model that classifies everything into the majority class will have high accuracy but low AUC, because its true-positive rate will be low. Conversely, a model that does a good job at ranking positive examples higher than negative examples can have a high AUC but low accuracy, if the threshold for classification is not chosen well. Hence, while both AUC and accuracy are useful metrics, they provide different information. The choice of metrics to be used depends on the specific problem and the balance of classes in the dataset.

5. Guided selection of machine learning models

Although every dataset is unique, this section outlines the procedure for determining an initial choice of the most appropriate ML algorithm for predicting specific medical conditions using a particular medical imaging modality. Table 2 presents a detailed summary of the studies published from 2023 to 2024. Most of these studies performed comparative analyses of multiple ML models to determine the most effective one. Only the highest-performing ML algorithms are included in the table. These studies focused on a specific medical condition and utilized a particular imaging modality. The table encapsulates this information, detailing the medical condition under investigation and the imaging modality used. The table incorporates crucial metrics such as accuracy and AUC, specifically for the validation cohort, while metrics for the testing cohort are not included. The validation set, used for tuning the model and selecting the most effective one, serves as a more dependable indicator of the model’s prospective performance on new data. To reiterate, these metrics provide a measure of the model’s ability to correctly classify cases (accuracy) and its discriminative power (AUC), which are essential for assessing the model’s performance on unseen data. Unfortunately, some studies have omitted either accuracy or AUC in their reports. This is a significant oversight, as both metrics provide unique insights into the model’s performance and should be reported together for a comprehensive evaluation. Future studies should ensure to report both metrics to provide a complete picture of the model’s performance and to allow for a more accurate comparison between different models.

The following are instructions on how a physician or researcher can use this summary to decide which ML model to use for diagnosis, depending on the medical condition and the imaging modality:

Step 1 (Identify the medical condition): Determine the medical condition that you need to diagnose. For example, if you’re diagnosing breast cancer, you would focus on the rows in Table 2 that pertain to breast cancer.
Step 2 (Identify the imaging modality): Determine the imaging modality that you have available. This could be, for example, CT, MRI, or ultrasound images. In Table 2, look at the rows that correspond to your medical condition and the available imaging modality.
Step 3 (Refer to comparative analysis): Look at the comparative analysis provided in Table 2. The table shows which ML models have been used for the identified medical condition and imaging modality, along with their performance metrics such as Accuracy and AUC.
Step 4 (Choose the ML model): Choose the ML model based on the comparative analysis. For example, if you’re diagnosing breast cancer and your imaging modality is CT, you would choose between LDA and LR as they have similar Accuracy and AUC. If your imaging modality is MRI, you would choose between LR and XGBoost. If your imaging modality is ultrasound, the corresponding study referenced in the last column has already concluded that CNN is the best-performing algorithm.
Step 5 (Apply the chosen ML model): Once you have chosen the ML model, you can then apply it to your diagnosis process.

This table is designed to offer a consolidated summary of the comparative analyses of diverse ML models across various medical conditions and imaging modalities. It obviates the need for conducting comparative analyses each time a diagnosis is required for a medical condition. By referencing the table, physicians or researchers can efficiently identify the most fitting ML algorithm for predicting specific medical conditions using a certain medical imaging modality.

Table 2

Summary of machine learning algorithms used for medical image classification tasks

Medical condition	Imaging modality	ML algorithm	Accuracy¹	AUC¹	Ref.	Year
Neurocognitive disorders	fMRI	DT	91.1%	0.99	[56]	2024
Knee pain	MRI	SVM	N/A	0.68	[57]	2024
Corneal ulcer	Slit lamp imaging	XGboost	91.0%	0.97	[58]	2024
Lung cancer	CT	LR	N/A	0.878	[59]	2024
Lung cancer	CT	XGBoost-RF-AdaBoost	87.1%	0.895	[60]	2024
Lung cancer	CT	NB	96.6%	0.910	[61]	2024
Lung cancer	CT	CNN	95.0%	0.980	[62]	2024
Lung cancer	Histopathology	LightGBM	99.8%	N/A	[63]	2024
Pancreatic cancer	Ultrasound	LR	N/A	0.850	[64]	2024
Pancreatic cancer	EUS	RF	N/A	0.649	[65]	2024
Colorectal cancer	Histopathology	LightGBM	99.8%	N/A	[63]	2024
Colorectal cancer	Endoscopic imaging	RF3	87.0%	N/A	[66]	2023
Gastric cancer	CT	RF	N/A	0.854	[67]	2024
Gastric cancer	CT	LightGBM	87.3%	0.923	[68]	2024
Kidney stones	CT	LR	80.6%	N/A	[69]	2024
Renal lesions	CT	LightGBM	92.3%	0.962	[70]	2024
Cardiomegaly	X-ray	CNN	81.2%	0.90	[62]	2024
Drusen	OCT	CNN	99.5%	0.99	[62]	2024
Diabetic retinopathy	OCT	LightGBM	95.5%	N/A	[71]	2024
Glaucoma	OCT	CNN	98.0%	0.943	[72]	2024
Glaucoma	Fundus imaging	KNN	92.6%	N/A	[73]	2024
Breast cancer	X-ray	CNN	96.3%	0.987	[74]	2024
Breast cancer	CT	LDA	87.2%	0.880	[75]	2024
Breast cancer	CT	LR	84.6%	0.880	[75]	2024
Breast cancer	MRI	LR	N/A	0.995	[76]	2024
Breast cancer	MRI	XGBoost	N/A	0.995	[76]	2024
Breast cancer	Histopathology	LightGBM	95.2%	N/A	[63]	2024
Breast cancer	Ultrasound	CNN	85.6%	0.915	[77]	2024
Metabolic syndrome	Facial imaging	AdaBoost	N/A	0.935	[78]	2024
Spinal cancer	MRI	SVM	86.2%	0.870	[79]	2024
Knee osteoarthritis	Radiography	RF	82.6%	N/A	[80]	2024
Liver cancer	CT	XGBoost	99.2%	0.998	[81]	2024
Liver cancer	CT	SVM	N/A	0.986	[81]	2024
Liver cancer	MRI	AdaBoost	79.2%	0.858	[82]	2024
Oral cancer	MRI	RF	87.5%	0.875	[83]	2024
Bone metastasis	MRI	XGboost	88.4%	0.928	[82]	2024
Tuberculosis	X-ray	GTM	91.0%	0.975	[84]	2024
Brain cancer	MRI	SVM	91.5%	N/A	[85]	2024
Brain cancer	MRI	AdaBoost	87.5%	0.927	[86]	2024
Brain cancer	MRI	LR	98.4%	0.978	[87]	2024
Brain cancer	MRI	CNN	98.0%	0.999	[88]	2024
Thyroid cancer	Ultrasound	SVM	93.0%	0.980	[89]	2023
Thyroid cancer	Scintigraphy	DT	78.5%	0.809	[90]	2024
Thyroid cancer	Scintigraphy	RF	79.0%	0.773	[90]	2024
Bladder cancer	CT	SVM	N/A	0.893	[91]	2024
Uterine cancer	MRI	NB	N/A	0.869	[92]	2024
Skin cancer	Dermoscopy	CNN	80.5%	0.810	[93]	2024
Kidney transplant	Scintigraphy	CNN	67.0%	0.720	[94]	2024
Pulmonary perfusion	CT	GLMB	67.0%	0.730	[95]	2024
Coronary artery disease	PET	CNN	N/A	0.900	[96]	2024
Alzheimer’s disease	MRI	KNN	45.86%	0.552	[97]	2024

ML, machine learning; AUC, area under the curve; fMRI, functional magnetic resonance imaging? DT, decision tree; MRI, magnetic resonance imaging? SVM, support vector machines; XGBoost, extreme gradient boosting; CT: computed tomography? LR, logistic regression; NB, Naïve Bayes; CNN, convolutional neural network; LightGBM, light gradient boosting machine; EUS, endoscopic ultrasonography? RF, random forest; RF3, RoboFlow 3.0 object detection; OCT, optical coherence tomography? KNN, k-nearest neighbors; LDA, linear discriminant analysis; GTM, Google teachable machine; GLMB, general linear model boosting; PET, positron emission tomography scan.

Please note that these values may be overly optimistic when applied to a general patient population due to potential limitations in the studies.

In reference to Table 2, it has been noted that a substantial number of these studies do not include external prospective validation [98]. Additionally, they appear to have significant leakage problems [99]. The weaknesses of the ML studies suggest that the reported accuracy values may be overly optimistic when applied to a general patient population. External prospective validation is a crucial step in the evaluation of ML models. It involves testing the model on a separate dataset that was not used during the training or internal validation stages. This dataset is typically collected in a different setting or time period, providing a more realistic assessment of how the model will perform in real-world conditions. The lack of external prospective validation in many of the studies listed in Table 2 raises concerns about the generalizability of their findings. Data leakage is a problem that occurs when information from outside the training dataset is used in the model [100]. This can lead to overly optimistic performance estimates, as the model may be using information that would not be available in a real-world setting. The presence of severe leakage issues in these studies further undermines the reliability of their reported accuracy values.

While the selection of an appropriate ML algorithm for medical image classification is important, it is equally important to first clearly define the clinical question being addressed, as this will guide not only the choice of algorithm but also the data collection, feature selection, and evaluation metrics used in the study [101]. Moreover, even with all these studies consolidated, it is worth noting that every dataset is unique, with its own characteristics and complexities. This uniqueness can significantly influence the choice of the appropriate ML algorithm. In the context of medical imaging, the choice of ML algorithm can greatly impact the performance and accuracy of the imaging analysis. The characteristics of the imaging dataset, such as the size, complexity, and type of imaging data, should be carefully considered when selecting the ML algorithm. The nature of the data, such as whether it’s structured or unstructured, can also dictate the choice of the ML algorithm.

Figure 2 illustrates a flowchart depicting the process of selecting and implementing an ML algorithm for medical image classification. The flowchart begins with input conditions, including the specific medical condition and imaging modality. It then moves to the ML algorithm selection phase, which is based on the information provided in Table 2. The process continues with model training and evaluation, followed by a decision point where the performance metrics are assessed. If the metrics are satisfactory, the process proceeds to the final diagnosis. If not, it enters an algorithm optimization loop, where other ML algorithms are tested and evaluated before returning to the model training phase.

In Figure 2, “satisfactory metrics” refers to the performance measures of the ML model (e.g., accuracy, AUC, sensitivity, specificity) that are deemed sufficient for the intended medical diagnostic application. The exact threshold for what is considered “satisfactory” depends on the specific medical condition and imaging modality being used. To determine whether metrics are satisfactory, researchers and clinicians can compare their model’s performance to the metrics reported in Table 2 for similar studies. This comparison allows them to assess how their model performs relative to other published work in the same domain. The process involves identifying studies in Table 2 that focus on the same medical condition and imaging modality as the research in question, and comparing the model’s accuracy and AUC (if available) to those reported in similar studies.

Several factors can be considered when evaluating whether a model’s metrics are satisfactory:

Figure 2

Flowchart for machine learning algorithm selection and implementation in medical image classification. This process integrates the medical condition, imaging modality, and algorithm performance to guide the selection and optimization of machine learning models for diagnostic tasks.

Context: A model with slightly lower metrics might still be considered satisfactory if it offers other advantages, such as faster processing time or lower computational requirements.
Clinical significance: Even if a model’s metrics are lower than some published studies, they may still be satisfactory if they meet the minimum threshold for clinical utility in the specific application.
Dataset characteristics: If the dataset is more challenging or diverse than those used in published studies, achieving comparable metrics could be considered satisfactory.
Trade-offs: Sometimes, a slight decrease in one metric (e.g., accuracy) might be acceptable if it leads to significant improvements in another crucial metric (e.g., AUC) for the specific medical application.

By comparing a model’s performance to the metrics reported in Table 2 and considering these factors, researchers can make an informed decision about whether the model’s metrics are satisfactory for their specific medical image classification task. This approach allows for a more nuanced evaluation that takes into account the current state of the art in the field.

The following sections offer detailed explanations of the studies referenced in Table 2.

6. Decision trees

DTs are flowchart-like structures used for differential diagnosis based on patient symptoms and medical history. They are effective in identifying risk factors for diseases like diabetes and cardiovascular disorders. Sadeghi et al. used a gradient-boosted DT model to analyze resting-state functional magnetic resonance imaging (fMRI) and clinical data, aiming to distinguish between AD, frontotemporal dementia, mild cognitive impairment, and healthy controls [56]. The model was trained using resting-state fMRI studies and clinical data. When using only imaging, the model achieved a mean balanced accuracy of 74.4%. When clinical data were added to the model inputs, the balanced accuracy increased to 91.1%.

Sabouri et al. conducted a study to develop an artificial intelligence (AI)-assisted model for diagnosing thyroid pathologies using thyroid scintigraphy images [90]. The study involved the extraction of 100 radiomic features from the images of 191 patients, which were then fed into eight different ML models. The DT model performed the best in terms of the AUC with a score of 0.809, while the RF model achieved the highest accuracy and F1 score, with values of 0.790 and 0.835, respectively. The study concluded that ML algorithms, when combined with radiomic features, can effectively differentiate between healthy cases and patients with thyroid disease.

7. Support vector machine

SVMs are used for classifying medical images for disease detection, such as in breast or lung cancer, and for protein sequence classification in genetic testing. Zhao at al. focused on the application of ML methods to identify key structural factors associated with the severity of knee pain in osteoarthritis patients. The data were sourced from knee radiographs and MRI scans of 567 individuals [57]. Among the five ML methods employed—RFs, SVM, LR, DT, and Bayesian—the SVM model demonstrated the highest performance, thereby proving its effectiveness in assessing the severity of knee pain. The study by Cao et al. focused on the application of ML models to differentiate between spinal multiple myeloma and metastases using MRI sequences. The multi-kernel learning-based SVM (MKL-SVM) model outperformed others, achieving an accuracy of 86.2% and an AUC of 0.870 [79].

In the study conducted by Zhang et al., the emphasis was on the application of ML algorithms to differentiate between gross tumor volume and normal liver tissue in hepatocellular carcinoma (HCC) patients using CT imaging [81]. The XGBoost model was the top performer, achieving a mean accuracy of 0.9921 and an AUC of 0.9975. However, the SVM model also showcased a strong performance, with a mean specificity of 0.9490 and an AUC of 0.9856 [81]. This underscores the potential of SVM as an appropriate ML model for enhancing the efficiency and accuracy of radiotherapy treatment for HCC, closely rivaling the top-performing XGBoost model.

An AI-assisted brain tumor detection system was developed by Sinha and Kumar [85]. They used a multimodal framework, which underwent multiple rounds of training and testing to build its accuracy. The model was trained using a dataset of 1747 MRI brain scans, which were divided into four categories: glioma, meningioma, pituitary, and normal (no tumor) cases. The model was trained using deep learning structures for medical image analysis, including a multi-layer perceptron (MLP) algorithm with back propagation and SVMs. The model achieved an accuracy of 91.5% in accurately categorizing diverse brain tumor types and normal cases. The model was also integrated into a user-friendly smartphone app, MediScan. This app provides heatmap visualizations and generates diagnostic reports, supporting medical professionals in making swift decisions. More research studies have been published on the topic of brain tumor detection [102].

A study conducted by Agyekum et al. explores the application of ML algorithms in predicting BRAFV600E mutations in patients with papillary thyroid carcinoma (PTC) using ultrasound elastography [89]. They developed six different ML models, among which the SVM with radial basis function (SVM_RBF) emerged as the best performer. The SVM_RBF model demonstrated good performance metrics in the validation cohort, achieving an accuracy of 0.93 and an AUC of 0.98.

A study conducted by Xiong et al. investigated the application value of ML-based CT radiomics in the prediction of bladder cancer (BLCA) in a preoperative stage [91]. They used a retrospective cohort of 105 patients with initial BLCA and applied ML algorithms, including the SVM, to establish a radiomics model. The best- performing model was the clinical-radiomics model, which combined radiomics features with clinical parameters. This model achieved an AUC value of 0.958 and 0.893 in the training and testing cohorts, respectively.

8. K-Nearest neighbor

KNN predicts disease outcomes based on symptom patterns and classifies heart disease patients. It is particularly effective for classification and regression analysis. Habeb et al. introduced a new method for improving medical image classification, particularly for diagnosing glaucoma, by combining the Caputo fractional order with the cuckoo search algorithm (CFO-CS) for superior feature selection [73]. The KNN classifier emerged as the top-performing ML algorithm, achieving 92.62% accuracy, 94.70% precision, 93.52% F1 score, 92.98% specificity, 92.36% sensitivity, and 85.00% Matthew’s correlation coefficient.

In a study conducted by Khasanah, the KNN algorithm was applied to a preprocessed Alzheimer MRI dataset to diagnose AD [97]. The dataset, which was sourced from Kaggle, was divided into four classes: non-demented, mild demented, moderate demented, and very mild demented. The KNN algorithm showed moderate performance, with accuracy ranging from 45.86% to 50.47% and AUC from 0.552 to 0.589. The study concluded that class imbalance significantly impacted the algorithm’s performance, particularly for the underrepresented Moderate Demented class, and suggested future research to explore advanced methods to enhance classification accuracy.

9. Logistic regression

LR is used for predicting patient readmissions and diagnosing diseases like diabetes. It provides probabilities in addition to classifications, making it useful for binary classification problems. A clinical model was developed and validated to differentiate between peripheral lung cancer (PLC) and solitary pulmonary tuberculosis (SP-TB) using clinical and imaging features [59]. The LR model, which used independent characteristic variables such as age, smoking history, and certain lesion characteristics, demonstrated the highest AUC value of 0.878 for the internal validation group, indicating its strong performance in discriminating between the two diseases [59].

Wen et al. investigated the application of ML models in predicting lymph node metastasis in pancreatic cancer using ultrasound image-omics features [64]. The most effective model was found to be the LR model. The LR model demonstrated high prediction efficiency, with an AUC of 0.773 in the training set. This performance was further validated in the test set, where the model achieved an AUC of 0.850. Zhu et al. utilized CT-based radiomics and ML to identify high-risk patients for kidney stones, with a total of 513 independent kidneys randomly divided into training and validation sets [69]. The LR model emerged as the best-performing ML model, achieving an AUC of 0.858 in the training set and 0.806 in the validation set.

The research by Shen et al. applied deep learning techniques to MRI data from patients with glioblastoma (GBM) and solitary brain metastases (SBMs) to enhance preoperative classification accuracy. Among the tested algorithms, LR emerged as the best-performing ML model, particularly when used in multimodal fusion models. The model achieved an AUC value of 0.978, indicating a high level of performance in distinguishing between GBM and SBMs [87].

10. Convolutional neural network

CNNs are a type of neural network mainly used for image recognition. They analyze medical images like X-rays, MRIs, and CT scans for various diseases. Sengupta et al. evaluated three different binary classification tasks to investigate the classification performance of baseline CNN and the VGG-16 architecture specifically designed for image classification tasks [62]. The first task involved tumor detection using a simulated digital mammography dataset, where the baseline CNN achieved an accuracy of 77.8% and VGG-16 achieved an accuracy of 79.8%. The second task was cardiomegaly detection using chest X-ray images, with the baseline CNN and VGG-16 achieving accuracies of 83.3% and 81.2%, respectively. The final task was drusen detection using optical coherence tomography (OCT) images of the human retina, where the baseline CNN and VGG-16 achieved accuracies of 99.1% and 99.5%, respectively.

A study conducted by Agrawal et al. focuses on the classification of breast cancer from mammograms using a framework that includes mammogram enhancement, discrete cosine transform (DCT) dimensionality reduction, and deep convolutional neural network (DCNN) [74]. The authors used the digital database for screening mammography, which contains approximately 55,000 mammogram images, to test their model. The DCNN model outperformed standard techniques, achieving a precision of 0.929, a recall of 0.963, an accuracy of 0.963, an F1 score of 0.962, and an AUC value of 0.987. The results suggest that the proposed approach could assist radiologists in accurately diagnosing breast cancer, thereby facilitating early detection and timely intervention.

A study conducted by Adebiyi et al. focused on the classification of benign and malignant skin lesions, specifically skin cancers, using dermoscopy images. They trained three popular deep learning models—ResNet50, DenseNet121, and Inception-V3 [93]. The medical condition under investigation in this study was skin cancer. The authors used dermoscopy images to classify skin lesions into benign and malignant categories. The DenseNet121 model demonstrated an accuracy of 81% and an AUC score of 0.81 on the testing data, making it the most effective model in this study.

A study conducted by Li et al. focuses on the diagnostic value of ultrasound habitat sub-region radiomics feature parameters in relation to breast cancer Ki-67 status [77]. Utilizing ultrasound images from 760 cases of female breast cancer, the researchers employed deep learning methods to outline the gross tumor volume, perform habitat clustering, and extract radiomics features. The best-performing ML model was a fully connected neural networks (FCNN) model, which was used to develop a prediction model for the Ki-67 status of breast cancer patients. The FCNN model achieved an accuracy of 0.856 and an AUC of 0.915 on the testing data.

In a study, Bagheri et al. investigated the use of a deep learning-based algorithm to detect acute tubular necrosis (ATN) and acute rejection (AR) in patients with transplanted kidneys, using nuclear medicine scans [94]. The imaging was performed using a single-head gamma camera after the intravenous administration of Tc-99m ethylene dicysteine (EC). The InceptionResNet model was used as the neural network architecture, and it achieved an accuracy of 0.67 and an AUC of 0.72 on the test data. The authors concluded that deep learning models could be useful in diagnosing anomalies in renal transplants, despite certain limitations such as the data’s singular center and insufficient sample size.

In a study, Hunter et al. developed deep learning and LR models to predict obstructive coronary artery disease (CAD) using Rb-82 PET perfusion imaging [96]. The study, which was a retrospective review of patients referred for Rb-82 PET perfusion imaging due to known or suspected CAD from 2012 to 2022, found that the deep learning CNN model significantly outperformed the LR model. The CNN model yielded higher AUC values both per patient (0.90) and per vessel (0.80–0.87) compared to LR (0.81 and 0.66–0.73, respectively), and also produced significantly higher sensitivity and specificity at the prespecified operating points. The authors concluded that the deep learning CNN model has the potential to improve the diagnosis of obstructive CAD using Rb-82 PET blood flow imaging. Other recent research studies have been published on this topic as well [103, 104].

A study conducted by Rasel et al. explores the use of 2D and 3D CNN algorithms for detecting glaucoma, a progressive neurodegenerative disease, from OCT images [72]. The researchers compared the performance of several state-of-the-art 2D-CNN models, 3D adaptations of these 2D-CNN models with specific weight transfer techniques, and a custom 5-layer 3D-CNN-Encoder algorithm. The best-performing ML model was the pretrained 2D-ResNet18, which consistently provided robust results compared to their 3D counterparts. The 2D-ResNet18 model achieved an AUC of 0.960 and 0.943, and an accuracy of 0.901 and 0.890 for the macular and optic nerve head (ONH) OCT test images, respectively.

A study conducted by Alshuhail et al. is centered on the application of ML, specifically CNNs, for the diagnosis of brain tumors using MRI scans [88]. The researchers trained the CNN model on a comprehensive dataset of MRI images, which were categorized into different types of brain tumors. The CNN model demonstrated a significant improvement in diagnostic accuracy, achieving an overall accuracy of 98% on the test dataset. The precision, recall, and F1 scores of the model ranged from 97% to 98%. Furthermore, the model’s effectiveness was validated using Grad-CAM visualizations and an AUC ranging from 0.99 to 1.00 for each tumor category.

11. Light gradient boosting machine

LightGBM is a gradient boosting framework that uses tree-based learning algorithms. It is known for its efficiency and speed, making it suitable for large datasets in medical image analysis. A study by Kanber et al. explores the use of ML algorithms, specifically LightGBM, for diagnosing breast cancer using histopathological images [63]. They developed a robust feature extraction pipeline and evaluated ten ML algorithms, with LightGBM showing the highest accuracy. The accuracy ranged from 95.16% to 96.52% before augmentation and 98.70% to 99.87% after augmentation. The study also demonstrated the effectiveness of augmentation in refining classification accuracy. The study also applied the method to the classification of lung cancer images, achieving an accuracy of 99.83%. Unfortunately, the study did not present the AUC values, a crucial metric for evaluating the performance of a classification model, as it measures the model’s ability to distinguish between classes.

In a study by Bidwai et al., the use of ML in diagnosing diabetic retinopathy (DR), a leading cause of blindness worldwide, was investigated [71]. They employed an ML model, Dunnock–Scheduler optimization-based LightGBM (DkSO-LightGBM), which uses optical coherence tomography angiography (OCTA) images to generate feature maps and combines OCTA with fundus images for more precise DR detection. The DkSO-LightGBM model outperformed other conventional techniques, achieving an impressive accuracy of 94.32% at a training percentage (TP) of 90 and 95.53% in terms of k-fold 6. Other key performance metrics such as sensitivity, specificity, precision, F1 score, balanced accuracy, and Matthews correlation coefficient (MCC) were all above 90%, demonstrating the model’s high performance. Other studies focus more on diabetic patients and their cardiovascular health management in general [105].

A study conducted by Yu et al. focused on the development of an ML model to distinguish between benign and malignant cystic renal lesions (CRLs) using contrast-enhanced computed tomography (CECT) radiomics features [70]. They employed the LightGBM algorithm to create a nomogram that combined radiomics features with clinical factors. The radiomics nomogram demonstrated excellent predictive performance. In the validation cohort, the model achieved an accuracy of 0.923 and an AUC of 0.962.

A study conducted by Wang et al. aimed to develop a predictive model for postoperative complications in patients with gastric cancer who underwent radical gastrectomy at the First Affiliated Hospital of Nanjing Medical University [68]. They used preoperative abdominal CT scans and a 3D convolutional neural network (3D-CNN) to extract image features from the visceral fat region of the patients. The LightGBM was the best-performing ML model, surpassing the performance of XGBoost, RF, and Gradient Boosting DT models. The LightGBM model achieved an accuracy of 87.28% and an AUC value of 0.9232, demonstrating its potential to facilitate individualized clinical decision-making and the early recovery of patients with gastric cancer post-surgery.

12. General linear model boosting

In a study, Hajianfar et al. investigated the detection of pulmonary perfusion deficiency (PPD) in lung subsegments using single-photon emission computed tomography (SPECT) images, ML, and radiomics [95]. The study involved 186 patients who underwent CT and SPECT scans, with the PPD diagnosis based on physicians’ reports for each lung subsegment. The best-performing model was the Recursive Feature Elimination with General Linear Model Boosting (RFE+GLMB) applied to a combined CT-SPECT AC feature set, achieving an AUC of 0.73 and an accuracy of 0.67 on the testing data. The study concluded that CT-SPECT AC, SPECT AC, and SPECT NoAC radiomic features have potential in PPD diagnosis, with the combined CT-SPECT AC model performing significantly better than using solely CT radiomic features.

13. Linear discriminant analysis

LDA is used for dimensionality reduction and classification tasks in medical diagnostics. It helps in distinguishing between different disease states based on medical images. The study by Xianfei et al. focused on the application of ML models to diagnose human epidermal growth factor receptor 2 (HER2)-low expression breast cancer using contrast-enhanced cone-beam breast computed tomography (CE-CBBCT) images [75]. The best-performing models were the LDA and LR. The LDA model achieved an accuracy of 0.872 and an AUC of 0.880. The LR model also performed well, with an accuracy of 0.846 and an AUC of 0.880 in the validation cohort [75].

14. Naïve Bayes

The NB algorithm is a probabilistic classifier based on Bayes’ theorem. It is particularly useful for medical diagnosis due to its simplicity and effectiveness in handling large datasets with multiple features. A study conducted by Zhang et al. aimed to create predictive models that could distinguish between two types of lung cancer prior to surgery: pulmonary pure invasive mucinous adenocarcinoma (pIMA) and mixed mucinous adenocarcinoma (mIMA) [61]. They utilized both clinical parameters and radiomic features derived from contrast-enhanced CT scans in their analysis. The Gaussian NB model emerged as the best-performing ML model in this study. Integrating radiomics features and clinical parameters, the combined model achieved an accuracy of 0.841 in the training group and a notably higher accuracy of 0.966 in the test group. Similarly, the AUC value was 0.81 for the training group, and it increased to 0.91 for the test group. This result is somewhat counterintuitive as ML models are generally expected to perform better on the training data, upon which they are built and optimized. However, the higher performance on the test data in this study, while atypical, is not an impossibility. The reasons behind this counterintuitive result could be manifold. It could be due to the nature of the test data, the model selection process, or even random chance.

The research conducted by Cao et al. focused on the development of a radiomics nomogram, utilizing multiparameter MRI, for the preoperative differentiation of type II and type I endometrial carcinoma (EC) [92]. The team employed six different ML algorithms to construct radiomics models, with the NB model showing the best performance. This model was developed using 12 radiomics features derived from ADC and DCE4 sequences. It demonstrated strong performance in both the training and validation sets, with AUC values of 0.927 and 0.869, respectively. Furthermore, a combined nomogram was created, integrating the radiomics model with significant clinical-radiological features and immunohistochemistry (IHC) markers. This combined model exhibited superior performance in both the training (AUC = 0.951) and the validation sets (AUC = 0.915).

15. Random forests

RFs are ensembles of DTs that offer robust performance and insights into feature importance. They are widely used in medical diagnostics for their accuracy and ability to handle large datasets with many variables. A study by Zong et al. used six ML models to differentiate Epstein-Barr virus-associated gastric cancer (EBVaGC) from non-EBVaGC using CT radiomics and clinical characteristics [67]. The RF model performed the best, with an AUC of 0.854, demonstrating stable performance and high accuracy, sensitivity, and specificity in the testing dataset.

Mo et al. focused on the development and validation of various ultrasonic models, utilizing endoscopic ultrasonography (EUS) to distinguish between pancreatic neuroendocrine tumors (PNETs) and pancreatic cancer [65]. The best-performing ML model was the RF, which achieved an AUC of 0.999 in the training cohort and an AUC of 0.649 in the test cohort. Note that while the AUC in the training cohort was very high, the lower AUC in the test cohort suggests that the model may not generalize well to new, unseen data. This is a common challenge in ML and underscores the importance of using separate data for training and testing to get a realistic estimate of a model’s performance.

Li et al. utilized ML algorithms to analyze dynamic plantar pressure data, collected from wearable devices, for the detection of knee osteoarthritis (KOA) changes [80]. Radiography was used as the imaging modality. Five ML algorithms, namely KNN, SVM, RF, AdaBoost, and XGBoost, were evaluated for their performance. The RF model outperformed the others, demonstrating better generalization ability with an accuracy of 82.61% and an F1 score of 0.8000.

In their research, Yang et al. investigated the use of ML models to predict phosphorylated mesenchymal-epithelial transition factor (p-MET) expression in oral tongue squamous cell carcinoma (OTSCC) [83]. They utilized MRI derived texture features and clinical features to train various ML models. Among the models tested, including AdaBoost, LR, NB, RF, and SVM, the RF model demonstrated the highest performance. It achieved an accuracy of 87.5% in correctly classifying p-MET expression status in OTSCC cases, with an AUC of 0.875 [83].

In a study conducted by Sabouri et al., the primary objective was to create an AI-assisted model for diagnosing thyroid pathologies using thyroid scintigraphy images [90]. The study involved the extraction of 100 radiomic features from the images of 191 patients, which were then processed through eight different ML models. Among these models, the RF model demonstrated superior performance by achieving the highest accuracy and F1 score, with respective values of 0.790 and 0.835. Despite the DT model achieving the highest AUC score of 0.809, the RF model demonstrated better performance in terms of accuracy. Furthermore, the RF model was only slightly behind the DT model in AUC. The study concluded that ML algorithms, especially the RF model, when combined with radiomic features, can effectively differentiate between healthy individuals and patients with thyroid disease.

16. Adaptive boosting

AdaBoost is an ensemble learning algorithm that combines multiple weak classifiers to create a strong classifier. It is used in medical diagnostics for its ability to improve the performance of simple models. Wu et al. focused on creating a facial image database and assessing the diagnostic effectiveness of an AI-based facial recognition (AI-FR) system for various endocrine and metabolic syndromes [78]. Three algorithms—SVM, KNN, and AdaBoost—were used to train AI-FR diagnostic models for each disease. The models exhibited diagnostic AUC values between 0.766 and 0.935, with AdaBoost achieving the highest accuracy. Their research discovered that a higher disease facial recognition intensity correlated with a higher optimal AUC, suggesting that inherent facial features could enhance the performance of AI-FR models.

Ren et al. focused on diagnosing lung adenocarcinoma using CT images [60]. Three ensemble-learning algorithms were employed: gradient boosting, RF, and AdaBoost. The best-performing model was a rule-based diagnostic model, which integrated the strengths of all three algorithms. This model achieved an AUC of 0.9621 and an accuracy of 0.9433 in the training cohort, an AUC of 0.9529 and an accuracy of 0.9292 in the testing cohort, and an AUC of 0.8953 and an accuracy of 0.8706 in the validation cohort, demonstrating its effectiveness in assessing lung adenocarcinoma subtypes.

In their study, Feng et al. utilized MRI to predict microvascular invasion in HCC, a type of liver cancer. The AdaBoost ML algorithm was found to be the best performing, with an accuracy of 79.2% in the independent test cohort and an AUC of 0.858 [82].

A study conducted by Zeng et al. explores the use of radiomic features derived from multi-parameter MRI to predict the progression-free survival (PFS) of patients diagnosed with grade II meningiomas [86]. Several ML models were employed in this study, including Bagged AdaBoost, Stochastic Gradient Boosting, RF, and Neural Network. The Bagged AdaBoost and Neural Network models emerged as the top performers. The Bagged AdaBoost model achieved an accuracy of 87.5% and an AUC of 0.927 on the test set. The Neural Network model, on the other hand, achieved an accuracy of 84.4%.

17. Extreme gradient boosting

XGBoost is an ML algorithm that is frequently utilized in medical image classification tasks due to its efficiency and high performance. It operates on the principle of gradient boosting and is capable of handling both regression and classification tasks. XGBoost can be used to classify different types of tissues, identify disease markers, or predict disease progression, among other applications. This algorithm is especially beneficial in medical image classification tasks because it can effectively manage high-dimensional data, handle missing values, and provide a robust performance even with noisy data, which are common characteristics of medical images.

In their research, Azeroual et al. aimed to predict breast cancer recurrence by applying ML algorithms to three different types of data: clinical data, radiomic data, and a combination of both. The most effective algorithms were LR and XGBoost [76]. When applied to clinical data, these algorithms achieved AUC values of 0.898 and 0.954, respectively. For radiomics data, the AUC values were 0.877 and 0.994, respectively. When both data types were combined, both algorithms achieved an AUC value of 0.995 [76].

In the study conducted by Zhang et al., ML models were utilized to analyze CT images for the diagnosis of HCC, a type of liver cancer [81]. The XGBoost model emerged as the top performer, achieving a mean accuracy of 0.9921 and an AUC of 0.9975. This demonstrates the potential of ML, particularly the XGBoost model, in enhancing the accuracy and efficiency of HCC diagnosis and treatment planning.

In their study, Feng et al. utilized biparametric magnetic resonance imaging (bpMRI) to develop a ML radiomics model for predicting bone metastasis in patients newly diagnosed with prostate cancer [82]. The XGboost algorithm emerged as the top-performing ML model, achieving an impressive accuracy of 0.884 and an AUC of 0.928. The study underscores the potential of integrating radiomics features from axial T2-weighted imaging (T2WI) and diffusion-weighted imaging (DWI) tumor regions to enhance diagnostic performance.

A study conducted by Wang et al. is centered around the creation of a prognostic model for corneal ulcer, a serious eye condition, using ML techniques [58]. They utilized slit lamp images from patients suffering from corneal ulcer to train a deep learning model. This model was designed to segment and classify five different types of lesions. The most effective ML models in this study were XGBoost and LightGBM. These models were employed to predict the outcomes of corneal ulcer perforation and visual impairment. The XGBoost model demonstrated an accuracy of 85% and an AUC of 0.81 for predicting ulcer perforation after 1 month. Furthermore, it achieved an accuracy of 91% and an AUC of 0.97 for predicting ulcer perforation after 3 months. On the other hand, the LightGBM model achieved an AUC of 0.98 for predicting visual impairment after 3 months.

18. RoboFlow 3.0 object detection

RoboFlow is a platform that provides tools and resources for computer vision tasks, including object detection in medical imaging. RoboFlow can process various types of medical images, including ultrasounds, X-rays, endoscopy, thermography, and MRIs. This versatility allows healthcare professionals to apply object detection models across different diagnostic tools, enhancing the detection and analysis of medical conditions. A study conducted by Siddiqui et al. focused on the comparison of three ML models, including RoboFlow 3.0 object detection (RF3), in their ability to detect and classify colon polyps [66]. The models were trained using an open-source online gastrointestinal endoscopy database, “HyperKvasir”. RF3 showed promising results with precision, recall, and F1 scores of 0.75, 1.00, and 0.86, respectively, for the “normal” class, and an overall accuracy of 0.87. The ultimate goal of the study was to enhance early detection and intervention of colon polyps, which are common protrusions in the colon’s lumen that carry potential risks of developing into colorectal cancer.

19. Google teachable machine

GTM is a user-friendly tool that allows non-experts to train ML models. It can be used for medical image analysis to identify various diseases. For example, the study conducted by Chen et al. utilized GTM to develop a deep learning-based algorithm for detecting pulmonary tuberculosis (TB) from chest X-rays (CXRs) [84]. The algorithm was trained on two datasets, one for TB detection and the other for abnormality detection, and was externally validated using 250 CXRs from their hospital. In the validation dataset, the algorithm demonstrated an AUC of 0.975 and an overall accuracy of 91%, indicating high efficacy in distinguishing TB from normal conditions. This performance was comparable to that of experienced clinical physicians, suggesting the potential of this AI tool in aiding TB detection.

20. Discussion

The complexities involved in selecting an appropriate ML algorithm for predicting specific medical conditions are multifaceted, encompassing the need for large, annotated datasets for training, ensuring the model’s generalizability across diverse populations, and addressing ethical and legal considerations associated with the use of ML in healthcare [106, 107]. However, the potential benefits of ML in medical imaging are immense. A detailed summary of recent studies that have used various ML methods for image classification tasks is presented, providing a starting point for selecting the most suitable ML algorithm for specific medical conditions and imaging modalities.

This review is primarily aimed at individuals who may lack a comprehensive understanding of ML algorithms. As such, it is recommended that medical professionals with an interest in ML studies seek collaboration with ML experts [108, 109]. These experts have the necessary expertise to not only implement these algorithms but also ensure their practical applicability in a real-world medical context. The task of preventing data leakage and overfitting is intricate, and there is a substantial risk of unintentionally mixing the validation dataset with data from the training dataset—an issue that has not been addressed in this article. Therefore, the involvement of ML experts can be instrumental in addressing these challenges, thereby enhancing the reliability and effectiveness of ML studies in medical diagnosis.

Despite the significant potential of ML in medical imaging, several challenges need to be addressed. One of the primary obstacles is the requirement for large, annotated datasets to train AI models. The availability of such datasets is often limited due to privacy concerns and the labor-intensive process of annotation. Another challenge lies in ensuring the generalizability of the models. While an ML model may perform exceptionally well on a specific dataset, it may not necessarily yield the same results when applied to data from different populations. This raises the question of how to develop models that can accurately predict medical conditions across diverse populations.

Additionally, the “black box” phenomenon in ML refers to the opacity and lack of transparency in the decision-making processes of complex models. This is particularly concerning in healthcare, where understanding the reasoning behind AI suggestions is crucial for trust and accountability [110]. Addressing this opacity is crucial for ensuring trust, accountability, and ethical use of AI systems across various domains. However, current techniques in transparent ML are dominated by computational feasibility and barely consider end users, such as clinical stakeholders. This gap in user consideration can lead to the misuse and disuse of ML models in the clinical domain. The INTRPRT guideline proposes a set of themes for designing transparent ML models: Incorporation, Interpretability, Target, Reporting, Prior, and Task [111]. The guideline also addresses the challenges of following a human-centered design approach in healthcare and proposes potential solutions.

Funding

The authors declare no financial support for the research, authorship, or publication of this article.

Author contributions

Conceptualization, G.H., J.M., M.B., P.V., M.M. and M.T.; methodology, G.H., J.M., M.B., P.V., M.M. and M.T.; software, G.H., J.M., M.B., P.V., M.M. and M.T.; validation, G.H., J.M., M.B., P.V., M.M. and M.T.; formal analysis, G.H., J.M., M.B., P.V., M.M. and M.T.; investigation, G.H., J.M., M.B., P.V., M.M. and M.T.; resources, G.H., J.M., M.B., P.V., M.M. and M.T.; data curation, G.H., J.M., M.B., P.V., M.M. and M.T.; writing—original draft preparation, G.H., J.M., M.B., P.V., M.M. and M.T.; writing—review and editing, G.H., J.M., M.B., P.V., M.M. and M.T.; visualization, G.H., J.M., M.B., P.V., M.M. and M.T.; supervision, M.T.; project administration, M.T.; funding acquisition, M.T. All authors have read and agreed to the published version of the manuscript.

Conflict of interest

The authors declare no conflict of interest.

Competing interest

The authors declare that there are no financial or non-financial conflicts of interest related to this study. We confirm that we have no personal, academic, or other interests that could be perceived to influence the objectivity, integrity, or value of this research. Furthermore, we state that there are no factors that could affect our decisions and actions regarding the presentation, analysis, or interpretation of the data that may undermine or be perceived as undermining the objectivity, integrity, and value of this publication.

Data availability statement

The original contributions presented in the study are included in the article, and further inquiries can be directed to the corresponding author. The studies and their associated data, which are summarized in Table 2, have also been made accessible online at this link: https://comresearchdata.nyit.edu/redcap/surveys/?s=E9RNRMNWPWYN7NH9. By selecting the appropriate medical condition and corresponding imaging modality, users can conveniently locate the most recent studies on the same combination. This allows them to decide on the ML method they would like to use as a starting point. For those interested in having a study included in the online list, it is suggested to directly contact the co-author of this review article M. Matalia with an enquiry for inclusion.

Institutional review board statement

Not applicable.

Informed consent statement

Not applicable.

Publisher’s note

Academia.edu Journals stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors, and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

References

An Q, Rahman S, Zhou J, Kang JJ. A comprehensive review on machine learning in healthcare industry: classification, restrictions, opportunities and challenges. Sensors. 2023;23(9):4178. doi: 10.3390/s23094178

Decoux A, Duron L, Habert P, Roblot V, Arsovic E, Chassagnon G, et al. Comparative performances of machine learning algorithms in radiomics and impacting factors. Sci Rep. 2023;13(1):14069. doi: 10.1038/s41598-023-39738-7

Huang Y, Li J, Li M, Aparasu RR. Application of machine learning in predicting survival outcomes involving real-world data: a scoping review. BMC Med Res Methodol. 2023;23(1):268. doi: 10.1186/s12874-023-02078-1

Siraj-Ud-Doulah M, Islam MN. Performance evaluation of machine learning algorithm in various datasets. J Artif Intell Mach Learn Neural Netw. 2023;3(2):14–32. doi: 10.55529/jaimlnn.32.14.32

Pasa F, Golkov V, Pfeiffer F, Cremers D, Pfeiffer D. Efficient deep network architectures for fast chest x-ray tuberculosis screening and visualization. Sci Rep. 2019;9(1):6268. doi: 10.1038/s41598-019-42557-4

Lyakhova UA, Lyakhov PA. Systematic review of approaches to detection and classification of skin cancer using artificial intelligence: development and prospects. Comput Biol Med. 2024;178:108742. doi: 10.1016/j.compbiomed.2024.108742

Van Thieu N. Permetrics: a framework of performance metrics for machine learning models. J Open Source Softw. 2024;9(95):6143. doi: 10.21105/joss.06143

Huang A, Rathouz PJ. Orthogonality of the mean and error distribution in generalized linear models. Commun Stat Theory Methods. 2016;46(7):3290–6. doi: 10.1080/03610926.2013.851241

Ikemura K, Bellin E, Yagi Y, Billett H, Saada M, Simone K, et al. Using automated machine learning to predict the mortality of patients with covid-19: prediction model development study. J Med Internet Res. 2021;23(2):e23458. doi: 10.2196/23458

Fürnkranz J. Decision tree. US: Springer; 2011. p. 263–7. doi: 10.1007/978-0-387-30164-8_204

Badawy M, Ramadan N, Hefny HA. Healthcare predictive analytics using machine learning and deep learning techniques: a survey. J Electr Syst Inf Technol. 2023;10(1):40. doi: 10.1186/s43067-023-00108-y

Kang S. k-nearest neighbor learning with graph neural networks. Mathematics. 2021;9(8):830. doi: 10.3390/math9080830

Song C, Li X. Cost-sensitive knn algorithm for cancer prediction based on entropy analysis. Entropy. 2022;24(2):253. doi: 10.3390/e24020253

Abe S. Support vector machines for pattern classification. London: Springer; 2010.

Chen Y, Mao Q, Wang B, Duan P, Zhang B, Hong Z. Privacy-preserving multi-class support vector machine model on medical diagnosis. IEEE J Biomed Health Inform. 2022;26(7):3342–53:2200098. doi: 10.1109/JBHI.2022.3157592

Graf R, Zeldovich M, Friedrich S. Comparing linear discriminant analysis and supervised learning algorithms for binary classification–a method comparison study. Biom J. 2022;66(1):2200098. doi: 10.1002/bimj.202200098

Park DJ, Park MW, Lee H, Kim YJ, Kim Y, Park YH. Development of machine learning model for diagnostic disease prediction based on laboratory tests. Sci Rep. 2021;11(1):7567. doi: 10.1038/s41598-021-87171-5

Jin X, Han J. K-Means clustering. US: Springer; 2011. p. 563–4. doi: 10.1007/978-0-387-30164-8_425

Grant RW, McCloskey J, Hatfield M, Uratsu C, Ralston JD, Bayliss E, et al. Use of latent class analysis and k-means clustering to identify complex patient profiles. JAMA Netw Open. 2020;3(12):e2029068. doi: 10.1001/jamanetworkopen.2020.29068

Jolliffe IT, Cadima J. Principal component analysis: a review and recent developments. Philos Trans R Soc A Math Phys Eng Sci. 2016;374(2065):20150202. doi: 10.1098/rsta.2015.0202

Qureshi NA, Suthar V, Magsi H, Sheikh MJ, Pathan M, Qureshi B. Application of principal component analysis (pca) to medical data. Ind J Sci Technol. 2017;10(20):1–9. doi: 10.17485/ijst/2017/v10i20/91294

Kawase Y, Mitarai K, Fujii K. Parametric t-stochastic neighbor embedding with quantum neural network. Phys Rev Res. 2022;4(4):043199. doi: 10.1103/PhysRevResearch.4.043199

Parmar H, Nutter B, Long R, Antani S, Mitra S. Visualizing temporal brain-state changes for fmri using t-distributed stochastic neighbor embedding. J Med Imag. 2021;8(4):046001. doi: 10.1117/1.JMI.8.4.046001

Nguyen TGL, Ardizzone L, Köthe U. Training invertible neural networks as autoencoders. Cham: Springer International Publishing; 2019. p. 442–55.

Oliveira HD, Martin P, Ludovic L, Vincent A, Xiaolan X. Explaining predictive factors in patient pathways using autoencoders. PLoS One. 2022;17(11):e0277135. doi: 10.1371/journal.pone.0277135

Haresamudram H, Essa I, Plötz T. Contrastive predictive coding for human activity recognition. Proc ACM Interact Mob Wearable Ubiquitous Technol. 2021;5(2):1–26. doi: 10.1145/3463506

Schüler W, Spicher N, Deserno TM. Cardiopulmonary coupling analysis using smart wearables and mobile computing. Curr Direct Biomed Eng. 2021;7(2):291–4. doi: 10.1515/cdbme-2021-2074

Garnot VSF, Landrieu L, Giordano S, Chehata N. Satellite image time series classification with pixel-set encoders and temporal self-attention. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE; 2020; Seattle (WA). doi: 10.1109/CVPR42600.2020.01234

Huang SC, Pareek A, Jensen M, Lungren MP, Yeung S, Chaudhari AS. Self-supervised learning for medical image classification: a systematic review and implementation guidelines. NPJ Digit Med. 2023;6(1):74. doi: 10.1038/s41746-023-00811-0

Zheng Y, Luo Y, Shao H, Zhang L, Li L. Dabaclt: a data augmentation bias-aware contrastive learning framework for time series representation. Appl Sci. 2023;13(13):7908. doi: 10.3390/app13137908

Wolf D, Payer T, Lisson CS, Lisson CG, Beer M, Götz M, et al. Self-supervised pre-training with contrastive and masked autoencoder methods for dealing with small datasets in deep learning for medical imaging. Sci Rep. 2023;13(1):20260. doi: 10.1038/s41598-023-46433-0

Singh KP, Chandra B, Kalra PK, Narang R. Amazing power of dinov2 for automatic diagnosis of 12-lead ecg. In 2023 International Conference on Computational Science and Computational Intelligence (CSCI). IEEE; 2023; Las Vegas (NA). doi: 10.1109/CSCI62032.2023.00227

Vorontsov E, Bozkurt A, Casson A, Shaikovski G, Zelechowski M, Severson K, et al. A foundation model for clinical-grade computational pathology and rare cancers detection. Nat Med. 2024;30:2924–35. doi: 10.1038/s41591-024-03141-0

Chen X, Zhou J, Chen Y, Han S, Wang Y, Du T, et al. Self-supervised clustering models based on byol network structure. Electronics. 2023;12(23):4723. doi: 10.3390/electronics12234723

Chen W, Li C, Chen D, Luo X. A knowledge-based learning framework for self-supervised pre-training towards enhanced recognition of biomedical microscopy images. Neural Netw. 2023;167:810–26. doi: 10.1016/j.neunet.2023.09.001

Aggarwal CC. Restricted Boltzmann machines. Amsterdam: Springer International Publishing; 2018. p. 235–70. doi: 10.1007/978-3-319-94463-0_6

Dayal K, Shukla M, Mahapatra S. Disease prediction using a modified multi-layer perceptron algorithm in diabetes. EAI Endorsed Trans Pervasive Health Technol. 2023;9. doi: 10.4108/eetpht.9.3926

Chen L, Li S, Bai Q, Yang J, Jiang S, Miao Y. Review of image classification algorithms based on convolutional neural networks. Remote Sens. 2021;13(22):4712. doi: 10.3390/rs13224712

Anwar SM, Majid M, Qayyum A, Awais M, Alnowami M, Khan MK. Medical image analysis using convolutional neural networks: a review. J Med Syst. 2018;42(11):226. doi: 10.1007/s10916-018-1088-1

Salem FM. Recurrent neural networks: from simple to gated architectures. Cham: Springer International Publishing; 2022.

Nguyen M, He T, An L, Alexander DC, Feng J, Yeo BTT. Predicting alzheimer’s disease progression using deep recurrent neural networks. NeuroImage. 2020;222:117203. doi: 10.1016/j.neuroimage.2020.117203

Wu L, Cui P, Pei J, Zhao L. Graph neural networks: foundations, frontiers, and applications. Singapore: Springer Singapore; 2022.

Wang F, Lei X, Liao B, Wu FX. Predicting drug-drug interactions by graph convolutional network with multi-kernel. Brief Bioinform. 2021;23(1):bbab511. doi: 10.1093/bib/bbab511

Sutton RS, Barto AG. Reinforcement learning: an introduction. Chapter 3: Temporal-difference learning, Q-learning, and n-step algorithms. Cambridge (MA): The MIT Press; 2018.

Lee J, Kim JM. Personalized treatment policies with the novel buckley-james q-learning algorithm. Axioms. 2024;13(4):212. doi: 10.3390/axioms13040212

Winder P. Reinforcement learning. Chapter 4: Deep Q-networks. Sebastopol (CA): O’Reilly Media; 2024.

Al-Hamadani M, Fadhel M, Alzubaidi L, Harangi B. Reinforcement learning algorithms and applications in healthcare and robotics: a comprehensive and systematic review. Sensors. 2024;24(8):2461. doi: 10.3390/s24082461

Sutton RS, Barto AG. Reinforcement learning: an introduction. Chapter 6: Policy gradient methods. Cambridge (MA): The MIT Press; 2018.

Eghbali N, Alhanai T, Ghassemi MM. Patient-specific sedation management via deep reinforcement learning. Front Digit Health. 2021;3:608893. doi: 10.3389/fdgth.2021.608893

Dangeti P. Statistics for machine learning. Birmingham: Packt Publishing; 2024.

Lu H, Uddin S. Explainable stacking-based model for predicting hospital readmission for diabetic patients. Information. 2022;13(9):436. doi: 10.3390/info13090436

Brown G. Ensemble learning. In Encyclopedia of machine learning and data mining. New York City (NY): Springer; 2017. p. 393–402.

Liu CL, Lee MH, Hsueh SN, Chung CC, Lin CJ, Chang PH, et al. A bagging approach for improved predictive accuracy of intradialytic hypotension during hemodialysis treatment. Comput Biol Med. 2024;172:108244. doi: 10.1016/j.compbiomed.2024.108244

Ramasamy S, Kantharaju HC, Bindu Madhavi N, Haripriya MP. Toward artificial general intelligence. Chapter 8: Meta-learning through ensemble approach: bagging, boosting, and random forest strategies. Berlin: De Gruyter; 2023.

Akinola S, Leelakrishna R, Varadarajan V. Enhancing cardiovascular disease prediction: a hybrid machine learning approach integrating oversampling and adaptive boosting techniques. AIMS Med Sci. 2024;11(2):58–71. doi: 10.3934/medsci.2024005

Sadeghi MA, Stevens D, Kundu S, Sanghera R, Dagher R, Yedavalli V, et al. Detecting alzheimer’s disease stages and frontotemporal dementia in time courses of resting-state fmri data using a machine learning approach. J Imag Inform Med. 2024;37:2768–83. doi: 10.1007/s10278-024-01101-1

Zhao Z, Zhao M, Yang T, Li J, Qin C, Wang B, et al. Identifying significant structural factors associated with knee pain severity in patients with osteoarthritis using machine learning. Sci Rep. 2024;14(1):14705. doi: 10.1038/s41598-024-65613-0

Wang MT, Cai YR, Jang V, Meng HJ, Sun LB, Deng LM, et al. Establishment of a corneal ulcer prognostic model based on machine learning. Sci Rep. 2024;14(1):16154. doi: 10.1038/s41598-024-66608-7

Gao X, Tan H, Zhu M, Zhang G, Cao Y. Construction and validation of a clinical differentiation model between peripheral lung cancer and solitary pulmonary tuberculosis. Lung Cancer. 2024;193:107851. doi: 10.1016/j.lungcan.2024.107851

Ren H, Wang Q, Xiao Z, Mo R, Guo J, Hide GR, et al. Fusing diverse decision rules in 3d-radiomics for assisting diagnosis of lung adenocarcinoma. J Imag Inform Med. 2024;37:2135–48. doi: 10.1007/s10278-024-00967-5

Zhang J, Hao L, Xu Q, Gao F. Radiomics and clinical characters based gaussian naive bayes (gnb) model for preoperative differentiation of pulmonary pure invasive mucinous adenocarcinoma from mixed mucinous adenocarcinoma. Technol Cancer Res Treat. 2024;23:15330338241258415. doi: 10.1177/15330338241258415

Sengupta S, Anastasio MA. A test statistic estimation-based approach for establishing self-interpretable cnn-based binary classifiers. IEEE Trans Med Imag. 2024;43(5):1753–65. doi: 10.1109/TMI.2023.3348699

Kanber BM, Smadi AA, Noaman NF, Liu B, Gou S, Alsmadi MK. Lightgbm: a leading force in breast cancer diagnosis through machine learning and image processing. IEEE Access. 2024;12:39811–32. doi: 10.1109/ACCESS.2024.3375755

Wen D, Chen J, Tang Z, Pang J, Qin Q, Zhang L, et al. Noninvasive prediction of lymph node metastasis in pancreatic cancer using an ultrasound-based clinicoradiomics machine learning model. BioMed Eng OnLine. 2024;23(1):56. doi: 10.1186/s12938-024-01259-3

Mo S, Huang C, Wang Y, Zhao H, Wei H, Qin H, et al. Construction and validation of an endoscopic ultrasonography-based ultrasomics nomogram for differentiating pancreatic neuroendocrine tumors from pancreatic cancer. Front Oncol. 2024;14:1359364. doi: 10.3389/fonc.2024.1359364

Abraham A, Jose R, Ahmad J, Joshi J, Jacob T, Khalid A, et al. Comparative analysis of machine learning models for image detection of colonic polyps vs. resected polyps. J Imag. 2023;9(10):215. doi: 10.3390/jimaging9100215

Zong R, Ma X, Shi Y, Geng L. Can machine learning models based on computed tomography radiomics and clinical characteristics provide diagnostic value for epstein-barr virus-associated gastric cancer? J Comput Assist Tomogr. 2024;48(6):859–67. doi: 10.1097/RCT.0000000000001636

Wang W, Sheng R, Liao S, Wu Z, Wang L, Liu C, et al. Lightgbm is an effective predictive model for postoperative complications in gastric cancer: a study integrating radiomics with ensemble learning. J Imag Inform Med. 2024;37:3034–48. doi: 10.1007/s10278-024-01172-0

Zhu B, Nie Y, Zheng S, Lin S, Li Z, Wu W. Ct-based radiomics of machine-learning to screen high-risk individuals with kidney stones. Urolithiasis. 2024;52(1):91. doi: 10.1007/s00240-024-01593-0

Yu T, Yan Z, Li Z, Yang M, Yu Z, Chen Y, et al. A contrast-enhanced computed tomography-based radiomics nomogram for preoperative differentiation between benign and malignant cystic renal lesions. Transl Androl Urol. 2024;13(6):949–61. doi: 10.21037/tau-23-656

Bidwai P, Gite S, Pahuja N, Pahuja K, Kotecha K, Jain N, Ramanna S. Multimodal image fusion for the detection of diabetic retinopathy using optimized explainable ai-based light gbm classifier. Inf Fusion. 2024;111:102526. doi: 10.1016/j.inffus.2024.102526

Rasel RK, Wu F, Chiariglione M, Choi SS, Doble N, Gao XR. Assessing the efficacy of 2d and 3d cnn algorithms in oct-based glaucoma detection. Sci Rep. 2024;14(1):11758. doi: 10.1038/s41598-024-62411-6

Habeb AAAA, Taresh MM, Li J, Gao Z, Zhu N. Enhancing medical image classification with an advanced feature selection algorithm: a novel approach to improving the cuckoo search algorithm by incorporating caputo fractional order. Diagnostics. 2024;14(11):1191. doi: 10.3390/diagnostics14111191

Agrawal R, Singh NP, Shelke NA, Tripathi KN, Singh RK. Cbcerdl: classification of breast cancer from mammograms using enhance image reduction and deep learning framework. Multimed Tools Appl. 2024. doi: 10.1007/s11042-024-19616-8

Chen X, Li M, Liang X, Su D. Performance evaluation of ml models for preoperative prediction of her2-low bc based on ce-cbbct radiomic features: a prospective study. Medicine. 2024;103(24):e38513. doi: 10.1097/MD.0000000000038513

Azeroual S, Ben-Bouazza F, Naqi A, Sebihi R. Predicting disease recurrence in breast cancer patients using machine learning models with clinical and radiomic characteristics: a retrospective study. J Egypt Nat Cancer Inst. 2024;36(1):20. doi: 10.1186/s43046-024-00222-6

Li Y, Long W, Zhou H, Tan T, Xie H. Revolutionizing breast cancer ki-67 diagnosis: ultrasound radiomics and fully connected neural networks (fcnn) combination method. Breast Cancer Res Treat. 2024;207:453–68. doi: 10.1007/s10549-024-07375-x

Wu D, Qiang J, Hong W, Du H, Yang H, Zhu H, et al. Artificial intelligence facial recognition system for diagnosis of endocrine and metabolic syndromes based on a facial image database. Diabetes Metab Syndr Clin Res Rev. 2024;18(4):103003. doi: 10.1016/j.dsx.2024.103003

Cao J, Li Q, Zhang H, Wu Y, Wang X, Ding S, et al. Radiomics model based on mri to differentiate spinal multiple myeloma from metastases: a two-center study. J Bone Oncol. 2024;45:100599. doi: 10.1016/j.jbo.2024.100599

Li G, Li S, Xie J, Zhang Z, Zou J, Yang C, et al. Identifying changes in dynamic plantar pressure associated with radiological knee osteoarthritis based on machine learning and wearable devices. J NeuroEng Rehabil. 2024;21(45):45. doi: 10.1186/s12984-024-01337-6

Zhang H, Huang D, Wang Y, Zhong H, Pang H. Ct radiomics based on different machine learning models for classifying gross tumor volume and normal liver tissue in hepatocellular carcinoma. Cancer Imag. 2024;24(1):20. doi: 10.1186/s40644-024-00652-4

Feng B, Wang L, Zhu Y, Ma X, Cong R, Cai W, et al. Gastrointestinal radiology: the value of li-rads and radiomic features from mri for predicting microvascular invasion in hepatocellular carcinoma within 5 cm. Acad Radiol. 2024;31(6):2381–90. doi: 10.1016/j.acra.2023.12.007

Yang G, Xiao Z, Ren J, Xia R, Wu Y, Yuan Y, et al. Machine learning based on magnetic resonance imaging and clinical parameters helps predict mesenchymal-epithelial transition factor expression in oral tongue squamous cell carcinoma: a pilot study. Oral Surg Oral Med Oral Pathol Oral Radiol. 2024;137(4):421–30. doi: 10.1016/j.oooo.2023.12.789

Chen CF, Hsu CH, Jiang YC, Lin WR, Hong WC, Chen IY, et al. A deep learning-based algorithm for pulmonary tuberculosis detection in chest radiography. Sci Rep. 2024;14(1):14917. doi: 10.1038/s41598-024-65703-z

Sinha A, Kumar T. Enhancing medical diagnostics: integrating ai for precise brain tumour detection. Procedia Comput Sci. 2024;235:456–67. doi: 10.1016/j.procs.2024.04.045

Zeng Q, Tian Z, Dong F, Shi F, Xu P, Zhang J, et al. Multi-parameter mri radiomic features may contribute to predict progression-free survival in patients with who grade ii meningiomas. Front Oncol. 2024;14:1246730. doi: 10.3389/fonc.2024.1246730

Shen S, Li C, Fan Y, Lu S, Yan Z, Liu H, et al. Development and validation of a multi-modality fusion deep learning model for differentiating glioblastoma from solitary brain metastases. J Cent South Univ (Med Sci). 2024;49(1):58–67. doi: 10.11817/j.issn.1672-7347.2024.230248

Alshuhail A, Thakur A, Chandramma R, Mahesh TR, Almusharraf A, Vinoth Kumar V, Khan SB. Refining neural network algorithms for accurate brain tumor classification in mri imagery. BMC Med Imaging. 2024;24(1):118. doi: 10.1186/s12880-024-01285-6

Agyekum EA, Wang Y, Xu FJ, Akortia D, Ren Y, Chambers KH, et al. Predicting brafv600e mutations in papillary thyroid carcinoma using six machine learning algorithms based on ultrasound elastography. Sci Rep. 2023;13(1):12604. doi: 10.1038/s41598-023-39747-6

Sabouri M, Avval AH, Bagheri S, Asadzadeh A, Sehati M, Mogharrabi M, et al. Machine learning and radiomics-based classification of thyroid disease using 99mtc-pertechnetate scintigraphy. J Nucl Med. 2024;65(suppl 2):242315.

Xiong S, Fu Z, Deng Z, Li S, Zhan X, Zheng F, et al. Machine learning-based ct radiomics enhances bladder cancer staging predictions: a comparative study of clinical, radiomics, and combined models. Med Phys. 2024;51(9):5965–77. doi: 10.1002/mp.17288

Cao Y, Zhang W, Wang X, Lv X, Zhang Y, Guo K, et al. Multiparameter mri-based radiomics analysis for preoperative prediction of type ii endometrial cancer. Heliyon. 2024;10(12):e32940. doi: 10.1016/j.heliyon.2024.e32940

Adebiyi A, Rao P, Hirner J, Anokhin A, Smith EH, Simoes EJ, et al. Comparison of three deep learning models in accurate classification of 770 dermoscopy skin lesion images. AMIA Jt Summits Transl Sci Proc. 2024:46–53. PMID: 38827104.

Bagheri S, Hajianfar G, Barashki S, Mohammadi MRM, Askari E, Alipourfiroozabadi L, et al. Deep learning-assisted automatic differentiated diagnosis of acute tubular necrosis from acute rejection in transplanted kidney scintigraphy. J Nucl Med. 2024;65(suppl 2):242505.

Hajianfar G, Salimi Y, Jafari E, Zareian H, Ahadi M, Amini M, et al. Pulmonary perfusion deficiency detection in lung subsegments of spect/ct images using radiomics and machine learning algorithms. J Nucl Med. 2024;65(suppl 2):241915.

Hunter C, Moulton E, Chong AY, Beanlands R, deKemp R. Deep learning improves diagnosis of obstructive cad using rb-82 pet imaging of myocardial blood flow. J Nucl Med. 2024;65(suppl 2):241721.

Khasanah I. Enhancing alzheimer’s disease diagnosis with k-nn: a study on pre-processed mri data. Int J Artif Intell Med Issues. 2024;2(1):49–60. doi: 10.56705/ijaimi.v2i1.150

Riley RD, Archer L, Snell KIE, Ensor J, Dhiman P, Martin GP, et al. Evaluation of clinical prediction models (part 2): how to undertake an external validation study. BMJ. 2024;384:e074820. doi: 10.1136/bmj-2023-074820

Kapoor S, Narayanan A. Leakage and the reproducibility crisis in machine-learning-based science. Patterns. 2023;4(9):100804. doi: 10.1016/j.patter.2023.100804

100

Tampu EM, Eklund A, Haj-Hosseini N. Inflation of test accuracy due to data leakage in deep learning-based classification of oct images. Sci Data. 2022;9(1):580. doi: 10.1038/s41597-022-01618-6

101

Leek JT, Peng RD. What is the question? Science. 2015;347(6228):1314–5. doi: 10.1126/science.aaa6146

102

Abraham A, Jose R, Farooqui N, Mayer J, Ahmad J, Satti Z, et al. The role of artificiai intelligence in brain tumor diagnosis: an evaluation of a machine learning model. Cureus. 2024;16(6):e61483. doi: 10.7759/cureus.61483

103

Jose R, Thomas A, Guo J, Steinberg R, Toma M. Evaluating machine learning models for prediction of coronary artery disease. Glob Transl Med. 2024;3(1):2669. doi: 10.36922/gtm.2669

104

Thomas A, Jose R, Syed F, Wei OC, Toma M. Machine learning-driven predictions and interventions for cardiovascular occlusions. Technol Health Care. 2024;32(5):3535–56. doi: 10.3233/THC-240582

105

Jose R, Syed F, Thomas A, Toma M. Cardiovascular health management in diabetic patients with machine-learning-driven predictions and interventions. Appl Sci. 2024;14(5):2132. doi: 10.3390/app14052132

106

Bekbolatova M, Mayer J, Ong CW, Toma M. Transformative potential of AI in healthcare: definitions, applications, and navigating the ethical landscape and public perspectives. Healthcare. 2024;12(2):125. doi: 10.3390/healthcare12020125

107

Toma M, Wei OC. Predictive modeling in medicine. Encyclopedia. 2023;3(2):590–601. doi: 10.3390/encyclopedia3020042

108

Ng ES. Bridge over troubled waters: connecting doctors and engineers. J Interprof Care. 2011;25(6):449–51. doi: 10.3109/13561820.2011.601823

109

Toma M, Syed F, McCoy L, Nizich M, Blazey W. Engineering in medicine: bridging the cognitive and emotional distance between medical and non-medical students. Int J Educ Math Sci Technol. 2023;12(1):99–113. doi: 10.46328/ijemst.3089

110

de Hond AAH, Leeuwenberg AM, Hooft L, Kant IMJ, Nijman SWJ, van Os HJA, et al. Guidelines and quality criteria for artificial intelligence-based prediction models in healthcare: a scoping review. NPJ Digit Med. 2022;5:2. doi: 10.1038/s41746-021-00549-7

111

Chen H, Gomez C, Huang CM, Unberath M. Explainable medical imaging ai needs human-centered design: guidelines and evidence from a systematic review. NPJ Digit Med. 2022;5:156. doi: 10.1038/s41746-022-00699-2

About this article

Citation

MLA

Husain, Gazi, et al. “Machine Learning for Medical Image Classification.” Academia Medicine, vol. 1, no. 4, Academia.edu Journals, 2024, doi:10.20935/AcadMed7444.

APA

Husain, G., Mayer, J., Bekbolatova, M., Vathappallil, P., Matalia, M., & Toma, M. (2024). Machine learning for medical image classification. Academia Medicine, 1(4). https://doi.org/10.20935/AcadMed7444

Chicago

Husain, Gazi, Jonathan Mayer, Molly Bekbolatova, Prince Vathappallil, Mihir Matalia, and Milan Toma. “Machine Learning for Medical Image Classification.” Academia Medicine 1, no. 4 (2024). doi:10.20935/AcadMed7444.

Vancouver

Husain G, Mayer J, Bekbolatova M, Vathappallil P, Matalia M, Toma M. Machine learning for medical image classification. Academia Medicine. 2024;1(4). doi:10.20935/AcadMed7444

Harvard

Husain, G. et al. (2024) “Machine learning for medical image classification,” Academia Medicine. Academia.edu Journals, 1(4). doi: 10.20935/AcadMed7444.