1 Introduction

Image recognition is among the latest technologies becoming popular in many fields, including medicine with disease diagnosis through image scans, security and fighting cybercrime, robotics, and many others. More and more society has directed its resources to identifying how artificial intelligence can help solve its problems, and of course, the analysis of human behavior always provides great insights to science and technology, since many of the great human inventions have emerged from the observation of nature and its mechanisms, which is the case of Neural Networks, also known as deep learning. In this context, driven by the success of competitions promoted by the ImageNet platform [1] where one of its objectives is to improve the performance of algorithms developed for image recognition, we conducted some experiments to evaluate the performance of pre-trained Neural Network algorithms freely available on the internet. In this study, CNN was used to classify such behaviors. CNN is a type of Neural Network specialized in image processing that focus on learning, directly from the input, the most discriminative features for the target task. Furthermore, such networks demonstrate an enormous capacity for generalization when compared to fully connected networks [2]. We also carried out an extensive evaluation of the CNN architectures published in recent years by the deep learning research community. Finally, we delved into the mathematical concepts that were the basis for the origin of Neural Networks to analyze the impact on results of several adjustable parameters in the algorithms, such as the use of different optimizer [3]; different learning rates on gradient descent; the use of data augmentation.

This paper is organized as follow: In Sect. 1, we introduce our motivation and brief review of related works. We also present the mathematical concepts that are the foundation of neural networks and discuss some techniques implemented in the algorithms in order to overcome some of the problems faced when training the models. In Sect. 2, the methodology adopted to develop our models is presented. In Sect. 3, a summary of the results obtained is displayed and discussed. Finally, in Sect. 4, final considerations and directions for future work are described.

1.1 Motivation and related works

Predicting the future is certainly one of the greatest ambitions of human beings. The great differential of neural networks is that they are able to learn as a process, since they are not based on pre-defined programming, but on feedback systems that when generating a response and comparing the predicted result with the real result, they are able to modify their own programming, and after many tests, they are able to become intelligent. This makes neural networks extremely powerful in countless approaches and in the most diverse contexts, such as Stock Market prediction [4].

The forecasting of energy demand and electrical load has become increasingly efficient through the use of neural networks in the analysis of a significant volume of data collected by smart meters, thus improving the operational quality of distribution networks [5, 6].

The chemical industry has also discovered the advantages of neural networks in its process of identifying chemical compounds and developing new drugs [7, 8].

Neural networks, especially those specialized in image analysis, have presented outstanding results in several applications. In the medical area, we have many successful cases supporting the diagnosis of many diseases, such as skin lesions [9], brain tumor segmentation [10], and detection of chronic eye diseases [11]. In 2020, the impacts generated on healthcare systems worldwide by the emergence of Coronavirus disease of 2019 (COVID-19) have made the scientific community devote great efforts in tracing the origin of the disease, but also providing fast, safe, and large-scale diagnostic tools. In this sense, many works with excellent results have been presented where convolutional neural networks have been used for classification of chest Computed Tomography (CT) [12] or X-ray [13] to support diagnosis of COVID-19.

In agriculture, controlling pests and diseases through image analysis of plants [14] can prevent further crop damage. Furthermore, the classification of fruits [15] or food-producing species [16] can raise the quality level of crops and assist in the preservation of species.

Vehicle automation based on human motion recognition is another field where neural networks can be successfully applied [17], but it is in Security and Crime Control area where some challenges arise depending on what you want to classify. From facial recognition [18], to the prediction of criminal trends based on criminal psychological theories, learning analysis in time and space, behavioral recognition, and scene recognition that can identify human behavior in a given environment in real time [19], the vast majority of applications aimed at the security area need to deal with several details that when added together end up bringing great complexity to the data set (for example: features related to the human face associated with movements in different directions). The major problem of such complexity is that in the universe of neural networks, they are solved with large databases and high data processing capacity, which can be very costly. Another aspect related to the Safety area is related to the behavior of drivers at the wheel. According to the Federal Council of Medicine (CFM), an assessment made between 2009 and 2018 shows that traffic accidents left more than 1.6 million people injured in Brazil, implying a cost of almost $600 million dollars to the Unified Health System (SUS). Part of these accidents happen due to recklessness and distractions by drivers, such as answering messages or talking on cell phones. In the USA, the insurance company State Farm, also concerned with the alarming statistics that 1 in 5 car accidents is caused by a distracted driver, launched a challenge on the Kaggle platform to test whether dashboard cameras can automatically detect drivers who engage in distracted behavior.

In this sense, encouraged by recent successes of machine learning applied in analysis of images in different areas, we decided to use the set of 102.150 images provided by the Kaggle platform for the challenge launched by State Farm to explore some Neural Network models specialized in image processing, including the knowledge transfer techniques for each architecture. Our main goal will be to evaluate the performance of each architecture, but using basic or free computational resources within most people’s reach.

1.2 Mathematical fundamentals of neural networks

Despite the great evolution in recent years, the mathematical concept of the basic unit of a Neural Network emerged from the studies developed by Frank Rosenblatt [20]. Although objections made by Marvin Minsky and Seymour Papert [21] questioning the perceptron’s ability to solve problems involving nonlinearly separable patterns (i.e., XOR), led artificial intelligence to a period known as AI Winter, some adaptations to Rosenblatt’s Perceptron appeared years later, making artificial intelligence resurface in scientific discussions.

Basically, Artificial Neural Networks are complex formations of several units known as neurons. Every neuron has an input channel, and each of these inputs is assigned a weight. As in Fig. 1, the input data from each neuron are multiplied by their respective weights and computed into a weighted sum. This result is the activation potential of the respective neuron. To this weighted sum, a polarization constant, also known as a bias, is added, which represents an activation threshold for each neuron. Once this mathematical operation has been performed, an activation function is applied that will determine the output format of the data. The output of the activation function will be the result represented by the variable \(\hat{y}\), which is the output of each neuron.

Fig. 1
figure 1

Rosenblatt’s perceptron

The arrangement of neurons within a network and the way they connect to each other determines the architecture of a Neural Network, which, in combination with input and output layers, can have countless intermediate layers. Mathematically, Neural Networks can be defined as nonlinear statistical models, and their trainability is highly dependent on network architecture design choices, optimizer selection, variable initialization, and some other factors. As discussed in [22], the effect of each of these choices on the structure of the underlying loss surface is not clear. Because of the prohibitive cost of loss function evaluations (which require looping over all data points in the training set), studies in this field have remained predominantly theoretical.

As in models based on predetermined functions (Linear Regression, Logistic Regression), the submission of input data to a Neural Network algorithm must generate a hypothesis that can be represented by the variable \(\hat{y}\). The comparison between the hypothesis (\(\hat{y}\)) and some data previously labeled with real values, that can be represented by the variable y, should generate a cost function. The cost function represents the distance between the value predicted by the algorithm and the real value of all elements submitted to training. The most used method to form this function is from the mean squared error formula. Knowing the distance between the real and the predicted defined by the cost function, it is necessary to go through the inverse path in the network and readjust the weights of each neuron in the proportion in which each one of them contributed to the final error. This optimization process is done by gradient descent, where it is possible to minimize a cost function trying to find the global minimum.

1.3 Learning rate, gradient descent, optimization methods

Gradient descent is a crucial element, but also a challenging one when subjected to real data. This is because in functions widely known and studied by Algebra, global minimum is easily identified. However, when dealing with real data, we may be faced with spatial representations of these data such as those in Fig. 2, where finding the global minimum is a great challenge.

Fig. 2
figure 2

A complicated loss landscape image. Credits: https://www.cs.umd.edu/~tomg/projects/landscapes

Some techniques used in machine learning processes to deal with this problem are to control the speed of gradient descent through a parameter \(\alpha\) called Learning Rate. This learning rate works as an escape in cases of functions with several local minimum, which sometimes makes it difficult to minimize the cost function. This is a constant whose objective is to update weights slowly and smoothly, thus avoiding large steps and chaotic behavior. In our experiments, we can notice significant results when varying the learning rate values. There are many optimizers, and their advantages and disadvantages are often related to the specific task. We can separate optimizers into two distinct categories: gradient descent optimizers and adaptive optimizers.

According to [3], the gradient descent has 3 different optimizers that basically differ in the time and volume of data used to calculate the gradient of the objective function. They are:

  • Stochastic Gradient Descent In this variation, the network parameters are adjusted to each example passed by the algorithm. The Formula 1 is used for updating the weights:

    $$\begin{aligned} \theta =\theta -\eta \cdot \nabla _\theta J(\theta ,x_i; y_i). \end{aligned}$$
    (1)

    The major problem of this type of optimization is frequent weight updates with high variance, thus causing large fluctuations in the cost function.

  • Batch Gradient Descent (Vanilla) In this case, parameter adjustments are made after processing the entire data set at each epoch. An epoch is defined as a complete pass (round trip, or propagation and back propagation) through all examples in the training set. In other words, it computes the gradients \((\nabla )\) of the objective function J with respect to the parameters \(\theta\) for the entire training set, according to Formula 2:

    $$\begin{aligned} \theta =\theta -\eta \cdot \nabla _\theta J(\theta ). \end{aligned}$$
    (2)

    This process is extremely slow and therefore, not applicable to very large databases.

  • Mini-Batch Gradient Descent for this variation, the adjustments of parameters are made after processing subsets (mini-batches) of data. The size of mini-batches is configured in the algorithm, and they range between 50 and 256, but can vary for different applications. This is the most used method in training convolutional networks, since it reduces the update variance leading to a more stable convergence. Its update of the weights is based on the Formula 3:

    $$\begin{aligned} \theta =\theta -\eta \cdot \nabla _\theta J(\theta ,x_{i:i+n}; y_{i:i+n}). \end{aligned}$$
    (3)

    According to [3], Mini-Batch Gradient Descent is typically the algorithm of choice when training a Neural Network and the term Stochastic Gradient Descent (SGD) usually is employed also when mini-batches are used. In our tests, when we quote SGD, we are referring to Mini-Batch Gradient Descent.

With respect to adaptive optimizers, they have emerged to help mitigate some challenges of gradient descent algorithms applied to real data. In this work, we tried the types described below:

  • Adagrad This optimizer proposed by [23] and developed for compiling a Keras model that is a deep learning application programming interface (API) written in Python, running on top of the machine learning platform TensorFlow, has parameter-specific learning rates that are tailored against how often a parameter is updated during training. The more updates a parameter receives, the smaller the updates. The great advantage of the Adagrad optimizer is that it is not necessary to configure the learning rate, since it will be automatically adjusted by the algorithm in the course of training needs. The only thing that is configured in Keras model is the initial learning rate value. The disadvantage is that the positive sum that occurs in the denominator of this equation can make the learning rate shrink until it reaches zero and, from this moment on, the network will not continue learning.

  • RMSProp Basically, this optimizer came up with the proposal to correct some inconveniences of Adagrad by introducing a history window which sets a fixed number of past gradients to take in consideration during the training. In this way, we don’t have the problem of the vanishing learning rate.

  • Adaptive moment estimation (ADAM) This optimizer proposed by [24] is also based on the calculation of adapted learning rate for each parameter, but now considering the mean and variance of the gradients up to a certain step.

  • Adamax Also proposed by [24], it is an updated version of ADAM, based on the infinity norm.

Along with the use of optimizers, there is also the concept of momentum [25], which is a proposed method to accelerate the SGD in relevant directions and amortize possible oscillations in non-relevant directions. This occurs by adding a fraction of the update vector from the previous step to the current update vector.

1.4 Pre-trained architectures with learning transfer

In recent years, the research community has proposed various CNN architectures, optimization methods, and regularization techniques. In addition to focusing on improving generalization and effectiveness on particular tasks, some work has also aimed to increase the efficiency of these networks by allowing them to train faster and operate on low-power devices. Pre-trained nets are based on the argument that a model pre-trained on a task other than the current problem can provide a very useful starting point for different tasks, i.e., features learned during training on the old task are useful for the new task. The great motivation for using a pre-trained network arises from the fact that the initial layers of a Neural Network are very difficult to train. This is because of the way the weights are optimized. That is, as we back-propagate the error calculated through the partial derivative, it is very likely that when we reach the initial layers (which are the last in this process), the weight updates will be insignificant to the point of compromising the network learning. On the other hand, considering the way convolutional networks work, we know that the initial layers are intended to represent only the most primitive elements, such as edge identification, for example. Later layers, on the other hand, will capture the particular elements specific to our data set. Therefore, considering these two aspects, we can conclude that the final layers, besides being more relevant because of the particularities specific to the database being trained, are easier and faster to train, which generates an immediate impact on the final results. Furthermore, training a neural net from scratch is a complicated task that requires time and robust computational resources, depending on the size and complexity of the database. Therefore the concept of pre-trained networks is based on the possibility of training only the later layers, and for the initial layers, you can use the weights already learned by these networks in previous work. Obviously the more similar, your data and problems are to the source data of the pre-trained net, the better the final result will be for the specific data you are trying to classify.

Some winning architectures in the competitions proposed by the Imagenet platform have brought significant results to the Neural Network training process, among them we can highlight:

  • Visual geometry group (VGG16) [26] This architecture was the winner of the ImageNet competition in 2014 by the named Visual Geometry Group at Oxford University—England. The differential of VGG16 was its depth. Its 16 layers exceeded the previous model (AlexNet) of 8 layers, and it was this differential that increased its ability to learn complex features (i.e., combining more nonlinear operations).

  • Residual network (ResNet) [27] Essentially, until 2014, it was empirically known that learning a Neural Network took place by adding hidden layers to its architecture, as this created the power to solve nonlinear problems. However, this technique had a maximum limit of 30 layers. After this limit, a problem known as explosion or vanishing gradient started to arise, which limited the learning capacity of the network. Besides this, another relevant factor observed was that the accuracy of convolutional networks does not increase in the same proportion as the number of layers in their architecture increases; in some cases, the accuracy could even degrade. With the goal of solving such limitations, in 2015, ResNet emerged using the residual network concept, where its mainstream performed the convolutions, batch normalization, and rectified linear unit (ReLU) activation function for the input residual maps. Connected to this mainstream was a parallel stream that simply forwards the results of the previous layer without any changes. In this configuration, if the result of the function in the mainstream is zero, at least, what was learned previously will be transmitted forward. The ability of ResNet to solve the gradient vanishing problems in the optimization process, with very small error rates and low computational cost makes this architecture one of the most effective tools for Neural Network processing today.

  • EfficientNet Family Launched by Google in 2019, the EfficientNet family [28] came with a proposal that balancing network depth, width, and resolution can lead to better performance. Unlike other networks, where the depth of the networks is mainly taken into account, the idea here is to evenly scale these 3 factors (depth, width, and resolution) using fixed scaling coefficients. Since 2012 convolutional networks have become increasingly accurate as they get larger and larger. While the 2014 winner GoogleNet used around 6.8 million parameters, the 2015 winner ResNet in its version 50 uses around 23.6 million parameters. SeNet, the 2017 winner, on the other hand, managed to achieve the impressive error rate of \(2.3\%\) using 557 million parameters. EfficientNet, in its most robust version (B7), uses 66.6 million parameters. In our experiments, with the purpose of using basic computational resources, we were able to obtain results up to version B3. For larger versions, a larger machine processing capacity is required.

2 Methodology

The objective of this work is to solve a typical supervised machine learning problem, where we will apply image processing techniques with convolutional networks and multi-class classification. Our goal will be to define, train, and compare different convolutional network architectures, as well as to show how certain adjustments in the algorithms strongly influence the results. The first step is the definition of the resources to be used in the training process of our classification models. Due to the large volume of data to be processed, we chose to use the Google Colab tool—https://colab.research.google.com. This collaborative tool allows the development of code in Python in any web browser without the need to install or configure any application. To access it, you just need to create a Google account. In addition, Google Colab provides limited and free GPU access. Through the use of the Python tool, the proposal is to combine the use of the TensorFlow Keras library, which is an open source framework suitable for implementing deep learning solution, and which should automatically perform the optimization of the model based on the gradient descent technique. Furthermore, this framework also enables the use of the pre-trained architectures chosen for our models: VGG16, ResNet, and EfficientNet family. Table 1 contains a summary of parameters used in our experiments.

Table 1 List of parameters applied in our experiments

2.1 Data set

The dataset was obtained from Kaggle platform. It is a set of 102,150 images captured by cameras installed inside vehicles. Of this total, 22,424 images are intended for model training and validation. These images are labeled in 10 classes to be predicted according to Fig. 3.

In addition to the set of images for training and validation, the platform provides 79,726 images that should be used for testing.

Fig. 3
figure 3

Driver behavior—classes

2.2 Data preparation

Primary analyses of the database concluded that there was an acceptable balance in the amount of sample in each of the 10 classes to be classified. However, as reported by the Kaggle platform, the images were produced in a controlled environment with actors representing the various risk behaviors. Under these circumstances, a single driver could be representing more than one class of behavior within our dataset. To avoid likely face learning of the characters, we included commands in the algorithm to ensure that in separating the training and validation data, the drivers in the training set are not the same as those in the validation sets and vice versa. Each model evaluated in our analysis expects an input image with specific dimensions recommended by each pre-trained architecture. Therefore, before training and classification, the images were standardized according to the recommendations of each pre-trained network.

2.3 Classification models

Since the goal of our models was to obtain an image classifier, we chose to use CNN. Training a CNN from scratch is a challenging task that requires time and robust computational resources, but our proposal was to work with a computational capacity limited to an average end user. Therefore, the approach chosen was to use networks pre-trained on ImageNet [1], which is an object classification data set with over 14 million images. The logic behind this is that deep networks tend to learn similar concepts in their initial and intermediate layers common to most visual tasks. In this sense, the strategy is to use the weights already learned by these pre-trained networks and then, tune the network for the objective task, which in our case is to identify the behavior of distracted drivers. To do this, simply remove the last fully connected layer, responsible for classifying one of the classes proposed by ImageNet, and replace it by a new fully connected layer with the same number of classes to be classified (in our case, 10 classes) activated by Softmax operation.

2.4 Data augmentation

According to [29], an important aspect in the training of Neural Networks is the fact that certain computational techniques require a large volume of data to provide reliable solutions with the ability to generalize different inputs. It is in this context that Data Augmentation techniques in images are inserted. Such techniques allow applying transformations (rotations, cropping, scale changes, color shading, random images blending) to existing data in order to expand the data set and provide more examples for training. Besides limited data, class unbalance can be an additional obstacle. Considering the fact that neural nets are mathematically based on probabilistic models, balancing between classes is very important to ensure that the algorithm is able to produce equal classification rules between classes, thus avoiding a lower recognition rate for the class with the lower number of samples. In ethical terms, this is a basic precept of AI to avoid any biased or discriminatory results. However, such techniques need to be carefully applied. In our models, for example, where there are distinct classifications for left and right positions, data expansion techniques such as flipping, rotation, or translation of images are not recommended. In our preliminary tests, the most successful techniques were zooming and low-angle shear.

2.5 Metrics

We chose two different metrics to evaluate our models: Balanced Accuracy and Log Loss. Balanced accuracy, which we will use to measure the performance of the models in the training and validation phases, is generated from confusion matrix values using Formula 4. The higher the balanced accuracy, the better the performance of our model:

$$\begin{aligned} \textrm{AccB} = \frac{1}{2} \left( \frac{\mathrm{{VP}}}{\mathrm{{VP}}+\mathrm{{FN}}} + \frac{\mathrm{{VN}}}{\mathrm{{VN}}+\mathrm{{FP}}} \right) . \end{aligned}$$
(4)

Where:

  • TP = True Positive;

  • FP = False Positive;

  • TN = True Negative;

  • FN = False Negative;

The log loss will be used to measure the performance of our models in the testing phase, and the same metric used by the Kaggle platform to choose the winner of the competition. The lower the log loss, the greater the generalization capacity of our model. Log loss is defined by the Formula 5:

$$\begin{aligned} \mathrm{{log\; loss}} = \frac{1}{N}\sum _{i=1}^{N}{\sum _{j=1}^{M}{y_{ij}\mathrm{{log}}(\hat{y}_{ij})}}, \end{aligned}$$
(5)

where:

  • N is the number of images in the test set;

  • M is the number of image class labels;

  • \(y_{ij}=1\) if the observation i belongs to the class j. Otherwise, \(y_{ij}=0\);

  • \(\hat{y}_{ij}\) is the expected probability that the observation i belongs to the class j.

3 Results and discussions

We established our baseline model from the evaluation of different CNN architectures and the attempt that we made to create a Neural Network from scratch, but that was not successful due to the poor result. To train the predictive models, we organized the original training set into training and validation splits, taking care that drivers in the training set were not the same in validation set, as discussed in Sect. 2.2. The first architecture with good performance was VGG16 with its 13 convolutional layers and 3 dense layers at the end of the network, combined with its untrainable layers which are 5 max pooling layers and 2 dropout layers. In addition to this, we kept the pre-trained layers unfrozen. This setup generated a total of 33 million parameters to be trained, and the result was a balanced accuracy of \(73.7\%\) in the validation phase. We submitted this model to the test data and sent the results to the Kaggle platform, and the log loss value was 2.42887. Considering that the winning model of the competition promoted by the platform had a logarithmic loss less than 0.1, we concluded that we still needed to greatly improve the generalization ability of our model.

Next we evaluated the ResNet architecture. The first noticed advantage is that the runtime of ResNet was much shorter than that of VGG16. While VGG16 took about 3 h of execution time, the models with ResNet were completed in about one and half hours. A possible explanation for this is that while VGG16 has about 33.6 million parameters, ResNet has approximately 23.5 million parameters, however, if we freeze the already trained layers of the network and train only the extractor layer of classes that will perform a refinement of the network, the total number of trainable parameters drops to 20,490, which means a good reduction in the number of mathematical operations performed by the algorithm. Our first result with ResNet was a balanced accuracy of \(67\%\). However, when we submitted the test results to the Kaggle platform, we achieved a log loss of 1.34406. In other words, although we got a lower balanced accuracy than the VGG16 model, the performance of the model when submitted to the real data was much better.

From this model, we decided to explore the ResNet architecture a bit further, but varying optimizers and learning rates. We chose the optimizers ADAM, RMSProp, Adagrad, Adamax, and SGD+momentum. As for the learning rate, we chose 3 different rates: 0.01; 0.001; 0.0001. The balanced accuracy obtained for each optimizer are presented in Table 2.

Table 2 ResNet: balanced accuracy X learning rates

Looking at the results provided by Table 2 it is clear the importance of choosing a learning rate that is adequate for the chosen optimizer. In the case of the Adam optimizer, using a learning rate equal to 0.01, the balanced accuracy was \(28.1\%\), but when we changed the learning rate to 0.0001, the balanced accuracy jumped to \(84.5\%\). We notice a similar behavior for the RMSProp and Adamax optimizers. The optimizers SGD+momentum and Adagrad obtained better results with a learning rate of 0.01.

Considering the top results of for each optimizer, we submitted these models to the test data to obtain the log loss and presented the results in Table 3.

Table 3 ResNet: balanced accuracy X log loss

An interesting aspect of these models is that our highest balanced accuracy did not get the best result when subjected to the test data. Our model with ADAM optimizer and learning rate \(= 0.0001\) achieved a balanced accuracy of \(84.5\%\) and a log loss \(= 0.82788\). Our model with SGD+momentum optimizer and learning rate \(= 0.01\), despite a much lower balanced accuracy (\(76.7\%\)), achieved a much better generalization ability with a log loss \(= 0.55075\).

Considering the top three models, we evaluated their performance applying data augmentation using techniques of Shear_range with a \(20\%\) distortion rate and zoom_range with random zooming of \(20\%\) on the images. As we can see in Table 4, the best result obtained was with ADAM optimizer where we achieved a balanced accuracy of \(81.2\%\) and log loss of 0.45620. However, the model using SGD+momentum obtained very close results, with a balanced accuracy of \(80.8\%\) and log loss of 0.53517.

Table 4 ResNet: balanced accuracy and log loss using data augmentation

Considering the results obtained with ResNet, we started exploring EfficientNet family networks. Once ADAM and SGD+momentum optimizer presented good performance for ResNet, we decided to apply them with its respective learning rates (ADAM with LR \(=0.0001\) and SGD+momentum with LR \(=0.01\)). We also applied the same data augmentation techniques, and the results are presented in Table 5.

Table 5 EfficientNet: balanced accuracy and log loss X optimizers

As we can see in the results of Table 4, the more robust the version of the EfficientNet family, the better the performance, which is strongly related to the amount of parameters that each version has. The higher the version, the higher the quantity of parameters, and therefore, the higher is its capacity to deal with the complexity of the images. We can also see that, in all versions, the SGD+momentum optimizer performed better than the ADAM optimizer.

Another observation that is important to highlight is the different levels of complexity among classes. The results presented in the Sect. 3 represent an average across the 10 classes. However, when looking at the results for each class, we see that some classes were more successful than others. As we can see in Table 6, classes C5 and C7 had the best performance. Classes C0, C9, and C8 had lower scores. One possible explanation for these results is that classes like C9 (talking passenger), C8 (hair and makeup), and C0 (driving safe) represent similar body positions, which causes the models to confuse the classes. Classes C5 (operating radio), C1 (texting—right), and C7 (reaching behind) represent very specific body positions or movements and are easier for the algorithms to identify.

Table 6 Match per class

4 Final considerations

In this work, we evaluated some pre-trained convolutional Neural Network architectures published in the literature to classify images representing behavior of drivers behind the wheel. The experiments showed that knowledge transfer learning techniques can achieve good results by leveraging the generalization of initial layers in a different domain. We were also able to evaluate that the adjustments of some parameters, such as the use of the optimizer combined with an adequate learning rate can bring significant results to the models, and for this, it is important to combine the understanding of the data being analyzed with the mathematical concept that underlies the algorithms used. Another significant aspect, especially in cases of generalization of models, is the use of data augmentation, which aims to artificially increase the available data. In our experiments, probably, because we received a database with a certain balancing of classes, this technique did not need to be explored much, but in cases, where this balancing of real data is not always possible, this technique can add good results to the experiments. Still, our experiments were enough to show that the methods used (shear, zoom, rotation, etc.) need to be carefully chosen, otherwise they may confuse the classifier. Considering the progression of results demonstrated in our models, there are good chances that we will try versions of EfficientNetB4 or above and get better results; however, one of our proposals here was to find out how far we could go using free resources. Up to EfficientNetB3, we were able to run our models using the no-cost version of Google Colab. Starting with version 4, more powerful memory resources are required and thus, we were unable to continue the process. Therefore, for future work, in addition to more computational resources, it will be necessary to investigate ways to handle the particularities of these classes with a lower hit rates presented in Table 6.