The effect of choosing optimizer algorithms to improve computer vision tasks: a comparative study

Hassan, Esraa; Shams, Mahmoud Y.; Hikal, Noha A.; Elmougy, Samir

doi:10.1007/s11042-022-13820-0

The effect of choosing optimizer algorithms to improve computer vision tasks: a comparative study

Open access
Published: 28 September 2022

Volume 82, pages 16591–16633, (2023)
Cite this article

Download PDF

You have full access to this open access article

Multimedia Tools and Applications Aims and scope Submit manuscript

The effect of choosing optimizer algorithms to improve computer vision tasks: a comparative study

Download PDF

Esraa Hassan ORCID: orcid.org/0000-0002-5732-6322¹,
Mahmoud Y. Shams¹,
Noha A. Hikal² &
…
Samir Elmougy³

6350 Accesses
35 Citations
1 Altmetric
Explore all metrics

Abstract

Optimization algorithms are used to improve model accuracy. The optimization process undergoes multiple cycles until convergence. A variety of optimization strategies have been developed to overcome the obstacles involved in the learning process. Some of these strategies have been considered in this study to learn more about their complexities. It is crucial to analyse and summarise optimization techniques methodically from a machine learning standpoint since this can provide direction for future work in both machine learning and optimization. The approaches under consideration include the Stochastic Gradient Descent (SGD), Stochastic Optimization Descent with Momentum, Rung Kutta, Adaptive Learning Rate, Root Mean Square Propagation, Adaptive Moment Estimation, Deep Ensembles, Feedback Alignment, Direct Feedback Alignment, Adfactor, AMSGrad, and Gravity. prove the ability of each optimizer applied to machine learning models. Firstly, tests on a skin cancer using the ISIC standard dataset for skin cancer detection were applied using three common optimizers (Adaptive Moment, SGD, and Root Mean Square Propagation) to explore the effect of the algorithms on the skin images. The optimal training results from the analysis indicate that the performance values are enhanced using the Adam optimizer, which achieved 97.30% accuracy. The second dataset is COVIDx CT images, and the results achieved are 99.07% accuracy based on the Adam optimizer. The result indicated that the utilisation of optimizers such as SGD and Adam improved the accuracy in training, testing, and validation stages.

Levenberg-Marquardt Variants in Chrominance-Based Skin Tissue Detection

Development of Automated Diagnostic System for Skin Cancer: Performance Analysis of Neural Network Learning Algorithms for Classification

An Analysis of Detection and Diagnosis of Different Classes of Skin Diseases Using Artificial Intelligence-Based Learning Approaches with Hyper Parameters

Article 19 October 2023

1 Introduction

Machine learning (ML) uses data and algorithms to replicate how humans learn and constantly improve its accuracy. Statistical techniques are applied to train algorithms and subsequently improve visual tasks and predict them. The data expansion task is growing, and the demand to find the most optimal solution has become widespread. Consequently, the required data have also expanded. Then, based on the input data, a data pattern is estimated using an optimization algorithm, as shown in Fig. 1 [32]. By using data, the objective function can estimate the model prediction and model accuracy. Once the model can fit the data points in the training set, weights are adjusted to reduce the distance between the known data and the model prediction [7]. Supervised learning entails the use of labeled datasets to train algorithms for predicting outcomes. As more data is introduced into the model, weights are continuously adjusted until the model is properly fitted, implying that one of the important tasks is to ensure that the model does not suffer from overfitting or underfitting [75]. Organizations use supervised learning to tackle a range of real-world problems at different scales, such as spam classification by using a distinct folder of an email account. Unsupervised learning analyzes unlabeled datasets via ML techniques. Deep learning (DL) is a popular method of addressing a variety of real-world issues. In DL, the dataset is used to train a computer, supposedly to increase its performance over time [81]. When an input value is given to the model, a function is applied to it, and it is turned into an output value through a series of layers. Thereafter, the generated output is compared with the real output, and the model calculates the difference [80]. Then, the resulting output is propagated into the model to lessen the difference. The DL architecture adjusts the weights and repeats the process until a convergence is achieved [46, 77]. An algorithm is searched to speed up the learning process while producing the best results. The main motivation behind this study is to compare with more virous optimizers to find out which one of them is best for solving medical diagnosis datasets without the need for human intervention. The algorithms can uncover hidden patterns in the data to find similarities and differences for computer vision tasks. The challenge with optimization is to identify a group of input data points for an objective function and the maximum or minimum function evaluation points. Several optimization techniques have been created and tested in this direction of solving a variety of problems. The impact of the most extensively used optimization algorithms on the learning process is investigated in this survey [89]. ML and DL are used as optimization methods to learn the parameters of the input data [82]. In particular, the parameters of the input data are learned via ML and DL as the optimization methods. The researchers of this study view optimization techniques as critical in successfully implementing real-world solutions [59].

ML optimization is the process of altering the hyperparameters to minimize the cost function using a certain optimization approach. The cost function must be minimized because of its specific task of determining the difference between the true value of an estimated parameter and the value predicted by the model [13]. However, prior to this task, the model parameters must be distinguished from the hyperparameters. In addition, prior to the training of the model, the hyperparameters must be specified. The number of clusters, learning rate (LR), among others, should be considered. A model’s structure is described by its hyperparameters. However, the model’s parameters can only be obtained during the training. At present, no existing method can calculate the parameters ahead of time.

Similarly, the model’s weights should be known in advance, but this task continues to be a challenge. Currently, trial and error are adopted with the loss function, and optimizers use the result to determine the ways of altering a neural network’s weights or LRs to reduce the loss [90]. Optimization algorithms are used to minimize the losses and ultimately deliver the most precise outcomes to the best extent possible. The process normally starts by defining a loss function for a DL problem. An optimization procedure is applied to minimize the loss after the loss function is obtained [60]. A loss function is frequently referred to as the optimization problem’s objective function during optimization. In history and practice, the majority of the optimization algorithms have focused on minimization. Meanwhile, a straightforward method of maximization is to simply reverse the sign on the objective. Although optimization contributes to DL by lowering the loss function, the goals of optimization and DL are fundamentally different [64]. Optimization is focused on minimizing an objective, whereas DL is oriented towards finding a good model given a finite amount of data. Moreover, the training error and the generalization error differ from each other, as the objective function of an optimization algorithm usually depends on a loss function based on the training dataset, in which the goal of the optimization is to reduce the training error [87]. Location problem, in example, may take into account a number of distinct (and potentially competing) objectives, such as obtaining a level of service commensurate to the location’s importance, lowering the worst-case service level, and raising the average service level. Taking into account all those goals in a single mathematical problem could result in a great number of answers that confound the decision-maker rather than aid them. Due to this, our study offers a novel analysis based on the comparison of various location solution characteristics using a battery of key performance indicators (KPIs). We also examine the trend of the given KPIs over the interventions to produce long-term managerial insights, since charging infrastructures are often expected to be located through a series of progressive interventions over a predetermined time [29, 30]. By contrast, DL aims to reduce the generalization errors. To achieve the latter, the overfitting and the optimization procedure must be both considered when lowering the training error. Rather than focusing on the generalization error of the model, the emphasis is on the performance of the optimization techniques for minimizing the objective function. The majority of the objective functions in DL are complex and devoid of analytical solutions. Thus, numerical optimization algorithms must be used instead. All of the optimization algorithms discussed in this paper fall into the DL category. Nonetheless, DL optimization is fraught with difficulties. Local minima, saddle points, and disappearing gradients are among the most perplexing issues. For example, the DL models’ objective functions frequently have plenty of local optima. As the gradient of the objective function’s solutions approaches or becomes zero, the numerical solution found by the final iteration can only minimize the objective function locally rather than globally. This issue is apparent when the numerical solution of an optimization problem approaches the local optimum. Only a small amount of noise allows for the parameter to leave the local minimum [53]. In reality, the natural change of the gradients in mini-batches can dislodge the parameters from the local minima. This practical concern is one of the advantages of the mini-batch stochastic gradient descent (SGD) [32]. This study offers the following contributions:

The methods of selecting optimization algorithms in computer vision tasks are comprehensively surveyed.

The motivations for using optimization algorithms to improve computer vision tasks are summarized.
The open challenges pertaining to the effects of optimization algorithms in computer vision tasks are investigated.
The effects of the selected algorithms on the final result are compared on the basis of measure metrics.

The rest of the comparative study is organized as follows. Section 2 describes the optimization algorithms. Section 3 presents a case study for skin cancer diagnosis. Section 4 concludes the survey.

2 Optimization algorithms

Optimization algorithms are the foundation on which a machine learns from its mistakes. Gradients are calculated, and the loss function is reduced to the smallest possible value. Learning can be implemented in many ways using optimization techniques, as shown in Fig. 2 [7, 68, 75]. The algorithms selected in this study are presented in the next sections. In this study, we highlighted the most common optimization algorithms such as gradient decent variants and gradient decent optimization. The gradient decent variant is generally categorized to batch gradient decent, stochastic gradient decent and mini-batch gradient decent. While the gradient decent optimization algorithms can be classified to momentum, Adagrad, Adadelta, RMSProp, Adam and Nestrov accelerated Gradient. The utilization of SGD, minibatch gradiend decent are more helpful to handle the over-fitting problem as well as optimization problem to boost the evaluation accuracy [26]. Moreover Adam optimizer are most commonly used to handle the medical images [70].

2.1 Gradient descent algorithm

Neural network algorithms are improved by taking a small batch of data and performing a type of gradient descent on them. The gradient descent calculates the slope of the landscape, which is the derivative of the function at this point with respect to the weights, as shown in Eq. (1) [75].

$$w=w- lr.{\nabla}_wL(w)$$

(1)

A constant value is adopted for the LR to determine the step size at each iteration as the calculation moves towards a minimum loss function [7, 81]. SGD is a fast and computationally efficient approach, but it adds noise in the estimation of the gradient. The frequent updating of the weight can lead to large oscillations, causing the training process to be extremely unstable. A list of stochastic optimization techniques is shown in the succeeding subsections, and each technique can be updated on a regular basis. The Gradient Descent has many advantages, such as easy computation, implementing, and understanding. It has some defects, such as weights being changed after calculating the gradient on the whole dataset. So, if the dataset is too large, it may take years to converge to the minima. It requires large memory to calculate the gradient on the whole dataset.

2.1.1 Stochastic gradient descent (SGD)

SGD is a basic algorithm and widely used in ML algorithms. Instead of calculating the gradient over all training examples and updating the weights, the SGD updates the weights of each training example x_i, y_i, as shown in Eq. (2) [32].

$$w=w- lr.{\nabla}_wL\left({x}_i,{y}_i,W\right)$$

(2)

The central idea is to start with a random point, and then a technique for updating is selected for each iteration as they descend the slope. The SGD method randomly selects a single data point from the entire dataset at each iteration to ease the computation. In “mini-batch” gradient descent, which is considered a common technique, a small number of data points instead of only one data point is sampled at each step [7]. However, this basic version of the SGD has certain limitations that can negatively affect the training. If the change in the loss function is fast in one direction and slow in another, then the oscillation of the gradients will be high, rendering the training progress to be extremely slow [32]. Furthermore, if the loss function has a local minimum, then the SGD will likely be stuck, and a good minimum cannot be determined. These problems occur when the gradient reaches zero and the weight or other relevant parameters are not updated. The gradients are noisy because they are estimated on the basis of only a small sample of the dataset. Subsequently, the noisy updates may not correlate well with the true direction of the loss function [75]. Selecting a good loss function is challenging and requires time-consuming experimentation with different hyperparameters. The same LR is applied to all parameters, which is problematic for features with different frequencies or significant attributes. Many improvements have been proposed over the years to overcome some of the aforementioned issues. Figure 3 shows the main and common tasks of the SGD optimizer in which federated learning and image classification have the highest precision among all of its performed tasks [81]. Figure 4 shows a plot of the loss, revealing the distinct properties of the SGD optimizer and its style of convergence in a specific coordinate.

Wenzel et al. [80] demonstrated that the posterior predictive created by the Bayes posterior produces systematically inferior predictions compared with the simpler approaches, such as the point estimates provided by SGD via the Markov chain Monte Carlo sampling method. Numerous theories have been proposed to explain the cold posterior effect, and predictions have been tested by experiments. Their research has casted doubt on the goal of correct posterior approximations of the Bayesian DL. Noroozi et al. [57] suggested a model for the Schema-Guided Dialogue dataset, which includes natural language descriptions for all elements. Table 1 presents some common tasks that use the SGD optimizer algorithm. According to a previous study, increasing the batch size of the SGD does not change the expectation of the stochastic gradient, but the variance is reduced. When the batch size is large, LR can be increased to achieve the opposite direction of the gradient. In general, SGD plays an important role in computer vision tasks, but it has not yet solved the two major problems associated with gradient descent. Thus, SGD is often combined with other algorithms, such as Momentum and AdaGrad; these algorithms will be presented in the following sections. Using the SGD has a number of advantages, including frequent changes in the model parameters, indicating a much more rapid convergence. The values of the loss functions can also be ignored, suggesting less memory usage, and a new minimum may also be derived. Nonetheless, SGD entails certain limitations, such as excessive variance in the model parameters. Even after attaining the global minima, the algorithm may continue to burn. For the SGD to achieve the same convergence as that in gradient descent, the LR must be gradually reduced.

Table 1 Some Related works for using SGD optimizer with vision datasets

The effect of choosing optimizer algorithms to improve computer vision tasks: a comparative study

Abstract

Similar content being viewed by others

Levenberg-Marquardt Variants in Chrominance-Based Skin Tissue Detection

Development of Automated Diagnostic System for Skin Cancer: Performance Analysis of Neural Network Learning Algorithms for Classification

An Analysis of Detection and Diagnosis of Different Classes of Skin Diseases Using Artificial Intelligence-Based Learning Approaches with Hyper Parameters

1 Introduction

2 Optimization algorithms

2.1 Gradient descent algorithm

2.1.1 Stochastic gradient descent (SGD)

2.1.2 SGD with momentum

2.2 Rung Kutta optimizer

2.3 Adaptive learning rate (AdaGrad)

2.4 RMSProp optimizer

2.5 Adaptive moment estimation (Adam)

2.6 Deep ensembles (DE)

2.7 Feedback alignment

2.8 Direct feedback alignment

2.9 Layer-wise adaptive rate scaling (LARS)

2.10 Adfactor

2.11 AMSGrad

2.12 Gravity

3 Proposed method

3.1 Data augmentation

3.2 Building deep learning model

3.3 Results

3.3.1 Datasets

3.3.2 Optimizer algorithms

SGD optimizer

RMSProp optimizer

Adam

4 Conclusion

Data availability

Code availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflicts of interests/competing interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation