GRaNN: feature selection with golden ratio-aided neural network for emotion, gender and speaker identification from voice signals

Garain, Avishek; Ray, Biswarup; Giampaolo, Fabio; Velasquez, Juan D.; Singh, Pawan Kumar; Sarkar, Ram

doi:10.1007/s00521-022-07261-x

GRaNN: feature selection with golden ratio-aided neural network for emotion, gender and speaker identification from voice signals

Original Article
Open access
Published: 08 May 2022

Volume 34, pages 14463–14486, (2022)
Cite this article

Download PDF

You have full access to this open access article

Neural Computing and Applications Aims and scope Submit manuscript

GRaNN: feature selection with golden ratio-aided neural network for emotion, gender and speaker identification from voice signals

Download PDF

Avishek Garain¹,
Biswarup Ray¹,
Fabio Giampaolo ORCID: orcid.org/0000-0001-5414-3435²,
Juan D. Velasquez³,
Pawan Kumar Singh⁴ &
…
Ram Sarkar¹

2172 Accesses
5 Citations
1 Altmetric
Explore all metrics

This article has been updated

Abstract

Compared to other features of the human body, voice is quite complex and dynamic, in a sense that a speech can be spoken in various languages with different accents and in different emotional states. Recognizing the gender, i.e. male or female from the voice of an individual, is by all accounts a minor errand for human beings. Similar goes for speaker identification if we are well accustomed with the speaker for a long time. Our ears function as the front end, accepting the sound signs which our cerebrum processes and settles on our disposition. Although being trivial for us, it becomes a challenging task to mimic for any computing device. Automatic gender, emotion and speaker identification systems have many applications in surveillance, multimedia technology, robotics and social media. In this paper, we propose a Golden Ratio-aided Neural Network (GRaNN) architecture for the said purposes. As deciding the number of units for each layer in deep NN is a challenging issue, we have done this using the concept of Golden Ratio. Prior to that, an optimal subset of features are selected from the feature vector extracted, common for all three tasks, from spectral images obtained from the input voice signals. We have used a wrapper-filter framework where minimum redundancy maximum relevance selected features are fed to Mayfly algorithm combined with adaptive beta hill climbing (A$\beta$HC) algorithm. Our model achieves accuracies of 99.306% and 95.68% for gender identification in RAVDESS and Voice Gender datasets, 95.27% for emotion identification in RAVDESS dataset and 67.172% for speaker identification in RAVDESS dataset. Performance comparison of this model with existing models on the publicly available datasets confirms its superiority over those models. Results also ensure that we have chosen the common feature set meticulously, which works equally well on three different pattern classification tasks. The proposed wrapper-filter framework reduces the feature dimension significantly, thereby lessening the storage requirement and training time. Finally, strategically selecting the number units in each layer in NN help increases the overall performance of all three pattern classification tasks.

Noise-Robust Gender Classification System Through Optimal Selection of Acoustic Features

MLPGI: Multilayer Perceptron-Based Gender Identification Over Voice Samples in Supervised Machine Learning

Automatic Gender Identification Through Speech

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Identification of speech and voice in any form has various real-life applications. Also, gender identification from voice has applications like automatic gender identification in telephonic calls, video classification with labelling, multimedia indexing, gender-based advertisements, speech feeling acknowledgment, and programmed greetings and sound classification. Sometimes medicine suggestion and diagnosis using artificial intelligence (AI) bots like Alexa, Echo dot, etc., on the basis of the voice of the user can be more customized if the gender is recognized correctly. Similarly speaker and emotion identification via voice can be used in crime investigation purposes by pointing out the criminals from their telephonic conversations. Also, it can be used as part of security systems of the future generation as a voice-print biometric signature. Recognizing the emotion in the voice can help AI agents in setting up a more personalized experience for their users like playing motivational videos when detecting sadness in user’s voice, playing music based on detecting the emotion in the user’s voice, etc. Previously due to less availability of labelled data and computational constraints, semi-supervised learning algorithms came into existence as an alternative approach to the commonly used supervised learning methods. With the increase in data availability, however, these constraints are no longer valid in many research fields, and such fields are reaching to a new high owing to successful use of the deep learning models.

The voice of an individual not only provides the meaning of words spoken, but also contains some characteristics of the speaker. For example, the gender, the identity, the age or the emotional state of the speaker can be understood from the voice of that person. A characteristic voice acknowledgment framework is the human ear, which is considered as a very efficient instrument. This can naturally recognize the sexual orientation of the speaker by voice and speech using the characteristics like recurrence and uproar. This very fact motivates the researchers to think whether a machine can be instructed to do likewise. Hence, methods are devised to build an automatic system with the information extracted from the voice data for the said purpose. These algorithms procure acoustic qualities, like length, force, recurrence and separation. In this paper, we have tried to develop a computerized gender, emotion and speaker identification system using voice information by transforming the human voice from the analog to the digital form in order to extract useful features and then constructed an identification model.

The rest of the paper has been organized as follows: Sect. 2 provides a literature survey about the works done on this topic. The methodology used is described in Sect. 3. This is followed by the results and concluding remarks in Sects. 4 and 5, respectively.

2 Literature survey

In this section, some recently published significant works on gender, emotion and speaker identification are discussed.

2.1 Gender and emotion identification

In recent years, several researchers have leveraged classic supervised machine learning classifier models to predict the gender and emotion from voice signals. In the work by Buyukyilmaz et al. [6], the results related to a frequency-based baseline model, logistic regression model, classification and regression tree (CART) model, random forest model, boosted tree model, support vector machine (SVM) model, XGBoost model, stacked model for Voice Gender dataset have been shown for comparison purpose with their architecture. According to the used models, the results are shown in Table 10.

The paper by Shafran et al. [39] has explored the problem of automated and accurate extraction of voice signatures from a speaker’s voice. They basically have analysed dual approaches for extraction of speaker traits. The first approach focuses on general acoustic and prosodic features, while the second works on the choice of words used by the speaker. In the first approach, they have shown that standard speech or non-speech hidden Markov model (HMM), conditioned on speaker traits and evaluated on cepstral and pitch features, achieves accuracy well above chance for all examined traits. Another approach, using the SVM with rational kernels, has been applied to speech recognition lattices. This method has gained an accuracy of approximately 81% in binary classification of emotion. They have used speech data corpus collected from a deployed customer-care application (HMIHY 0300).

The authors Yacoub et al. [42] in their paper have reported the results of emotion recognition from speech signals. They have mainly focused on extracting emotion-based features from the short utterances that are important for interactive voice response (IVR) applications. It involves distinguishing neutral speech from anger, which finds its importance in call centre applications. They have also worked on classification of emotions comprising of boredom, happy, sadness and cold anger. Comparison of the results using decision trees, neural networks and SVM, K-nearest neighbours (KNN) has also been done. The database that they have used was created by the Linguistic Data Consortium at University of Pennsylvania, which consists of records by 8 actors expressing 15 emotions. Results conclude that hot anger can be distinguished from neutral utterances with above 90% accuracy.

Several researchers have also implemented various unsupervised and semi-supervised machine learning models for the task. Li Wei et al. [24] in their work have designed a feasible identification system for non-semantic voice information using language and gender. Their system is text and speaker independent and concatenated the language and gender identification models. It has utilized feasible acoustic features and an optimal model-training method. From the four different Gaussian mixture modelling (GMM) training approaches, they have evaluated the system performance in terms of recognition rate. Their analysis has shown that the method they have followed achieves accuracies of 85.25% and 93.2% for language and gender recognition tasks, respectively.

In the work by Livieris Ioannis et al. [27], the authors have presented a semi-supervised algorithm for the purpose of gender identification using voice information and named it iCST-Voting. Their algorithm has constituted an ensemble of the prominent self-labelled algorithms like Self, Tri and Co-training by using them as base learners. The contribution of their approach compared to the state-of-the-art approaches is that they have utilized an ensemble of classifiers as base learners in place of using a single learner as used in any other self-labelled algorithm resulting in better classification performance than classical supervised algorithms.

However, the techniques discussed above yielded unsatisfactory results in terms of accurately classifying the gender and emotion for the voice signals.

Hence, research models using a hybrid of both the neural network and machine learning models were also utilized for the task. Zvarevashe et al. [44] have worked on a gender recognition method from voice, which uses feature selection by using the random forest recursive feature elimination (RF-RFE) algorithm with gradient boosting machine (GBM) algorithm. The GBM algorithm has been evaluated against the feedforward neural network and extreme machine learning algorithm. The results obtained indicate that GBM outperforms all the algorithms compared against it in classification accuracy, and it is proved to be a suitable methodology for gender voice recognition.

Pahwa et al. [32] proposed a gender recognition system where they have used voice samples of 46 speakers using hybrid frameworks using both deep learning and machine learning models. Specifically, they made use of one of the most popular, dominating and most applied speech feature, Mel-frequency cepstral coefficients (MFCCs) and the derivatives of first and second orders. Their proposed model makes use of SVM and neural network classifiers using a stacking methodology.

Scherer et al. [36] have explained the reason behind the achievement of an accuracy level of studies investigating the ability of listeners to recognize different emotions from a wide variety of standardized vocal stimuli, which largely exceeds what can be called a chance expectation. The work summarizes the reporting of a series of studies related to recognition conducted by Scherer, Banse and Wallbott in nine different countries in the continents of the North America, Europe and Asia. They have made use of vocal emotion portrayals containing content-free sentences which were produced by professional German actors.

Nasef et al. [31] in the year 2021 presented two self-attention-based models to deliver an end-to-end voice gender recognition system under unconstrained environments. The first model consisted a stack of six self-attention layers and a dense layer. To the first model, a set of convolution layers and six inception-residual blocks were added before the self-attention layers in the second model. Mel-frequency cepstral coefficients (MFCCs) were used as a representation of the audio data. Also, logistic regression was used for classification. The experiments were performed under unconstrained environments such as background noise, different languages, accents and ages of the speakers.

2.2 Speaker recognition

In the past few years, researches have leveraged several neural network frameworks to classify the speaker from the voice signals.

Joon Son Chung et al. [9] in their paper have tried to solve the problem of speaker recognition under unconstrained noisy conditions. They have put forward a very large-scale audio–visual-based speaker recognition dataset which has been crawled from open-source media. Using a fully automated pipeline, they have curated the dataset VoxCeleb2, which consists of more than a million utterances by over 6000 speakers. They have also developed and made comparison of convolutional neural network (CNN) models and training pipelines that can effectively carry out the task of speaker identification from voice under different conditions. The models that have been trained on the VoxCeleb2 dataset have superseded the performance of previous works on benchmark datasets by a considerable amount.

Zhenhao Ge et al. [15] in their work have built the speaker recognition system based on neural networks with the network parameters being optimized with grid search. They have made use of regularization parameters, which are dynamically reduced in order to avoid termination of training in any local minima. In speaker verification, performance has been improved by making use of normalizing of prediction score, which works by rewarding the speaker-identity indices with distinct peaks and penalizing the weak ones with high scores. The corpus used here was the TIMIT corpus with a sampling rate of 8K. From the corpus, 200 male speakers have been used in order to train and test the classification performance. For validation purposes, the testing data has been used as registered speakers from within domain, while data from the remaining 126 male speakers have been used as test data representing speakers from out of domain.

In 2021, Biswas et al. [4] made use of an enhanced approach to identify singer voice using neural network. The neural network model takes into consideration some songs to create the training data. The efficiency was observed for the detection of new and unknown signer to be detected. The neural network models used for classification was ANN and MLP. Both of these models were tested, and the better performing model was used to predict the speakers/singers for the song test data.

Though the neural network-based models provide state-of-the-art results, they are generally very computationally expensive, which sometimes create a major problem in a resource-constraint environment.

Frameworks were also implemented by using machine learning models. Bisio et al. [3] have developed an android SPEech proCessing plaTform as smaRtphone Application (SPECTRA) for gender, speaker and language recognition by making use of multiple unsupervised SVM classifiers. Generally, every classifier is trained by making use of a fixed training set. This makes it difficult for making decisions based on new data. The authors in their work have overcome this problem. Every time an user acquires a new audio, all the feature vectors are extracted and sent to the web server for retraining purposes and passed through the SVMs for classification purposes. But the framework failed to provide accurate results for the classification purposes.

Several review papers by comparing various methods for the speaker recognition task have also been done by researchers. In his paper, Douglas A. Reynolds [34] provided a brief overview of the area of speaker recognition. He has described the various applications, underlying techniques used and some indications of performance. Following this overview, he has discussed some of the advantages and loopholes of current speaker recognition technologies. He has also outlined some potential future trends in research, development and applications.

In the article by Hansen et al. [19], they have reviewed various works based on speaker recognition by machines imitating humans. They have emphasised on state-of-the-art speaker-modelling techniques that have emerged in the last decade. Discussion has been made based on different aspects of automated architectures. This includes voice-activity detection (VAD), speaker models, features, datasets for standard evaluation, and performance metrics. Speaker recognition has been discussed involving forensic speaker-recognition methods and illustrating how a naive listener performs this task from a neuroscience perspective.

The above discussions reveal that over the past few years, the traditional machine learning approaches have been explored by the researchers for classification of emotion, gender and speaker from voice signals. However, the major downfalls of such models are that only a few number of models performed all the three tasks single-handedly. This brings in the need for developing an efficient model, which can serve all the said classification tasks and combine them into a single and effective system. There is also a major disarray regarding the features to be used from the voice signals in order to feed the models. Hence, it brings a need for an efficient and accurate feature selection method on the common set of features extracted from the voice clips.

2.3 Motivation and contributions

The effectiveness of the classifiers is greatly determined by the quality of the features available in training data. Therefore, evoking useful voice features plays an important role in developing an efficient model since the human voice is liable for many non-useful and extraneous information. Works on improvement of the efficiency of voice classification systems are abundant, particularly on analysing the process of efficient feature extraction from the voice data, which include identifying the language content of voice signal components and removing non-useful contents such as background noise. Over the years, several methods have been developed on gender and emotion identification as well as speaker identification, and many of those methods have managed to gain very precise results. However, to the best of our knowledge, there are only a few number of works where the researchers performed all the three tasks single-handedly. However, it is quite challenging to design a single model for gender, emotion and speaker identification, which can work on the same set of features, and yields precise output. This inspires us to design a model, which can serve all these said purposes by a single system. Here, we have extracted features from the voice clips, applied feature selection on the feature vector and fed them to the customized neural network model. In a nutshell, we have contributed in the following way:

We have converted the input voice signals into spectral images and extracted a common set of features for emotion, gender and speaker identification tasks.
We have used a wrapper-filter-based feature selection framework, where minimum redundancy maximum relevance (mRMR) selected features are fed to the meta-heuristic Mayfly algorithm combined with adaptive beta hill climbing (A$\beta$HC) algorithm for selecting the optimal feature subset.
The final classification model, called GRaNN (Golden Ratio-aided Neural Network), has a backbone of a deep multilayer perceptron (DMLP). As deciding the number of units for each layer in DMLP is a challenging issue, we have done this using the concept of Golden ratio.
We have evaluated our model on the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) [28] dataset and Voice Gender dataset for emotion and gender identification purposes. We have also performed speaker identification on the RAVDESS dataset, which has not done in literature till date.
Our experimental results reveal that the proposed model performs better than many state-of-the art methods.

3 Proposed methodology

The flowchart of the whole pipeline is shown in Fig. 1. Our proposed methodology has four main modules—(a) Feature engineering, (b) Scaling, (c) Feature selection and (d) Classification. These modules are explained in the following subsections in detail.

3.1 Feature engineering

It is worth mentioning that there are a large number of features [14] that can be extracted from a voice signal, so we need to meticulously extract the features and process them according to our requirements. For loading the audio files and preparing it for feature extraction, we have made use of the library Librosa [30]. The following acoustic properties of each voice in both the datasets are measured and applied as features for classification purpose of all kinds:

meanfreq: Mean of frequencies (in kHz)
sd: Standard deviation of frequency
Q25: First quartile of frequency(in kHz)
Q75: Third quartile of frequency(in kHz)
kurt: Kurtosis
median: Median of frequencies (in kHz)
sp.ent: Spectral entropy
sfm: Spectral flatness (calculated by dividing the geometric mean of the power spectrum by the arithmetic mean of the power spectrum)
mode: Mode of frequencies
centroid: Frequency centroid
IQR: Interquartile range (in kHz)
skew: Skewness
peakf: Peak frequency (frequency having highest energy)
meanfun: Mean of all fundamental frequencies measured across acoustic signal
dfrange: Range of all dominant frequencies measured across acoustic signal
minfun: Minimum fundamental frequencies measured across acoustic signal
meandom: Mean of all dominant frequencies measured across acoustic signal
mindom: Minimum of all dominant frequencies measured across acoustic signal
maxdom: Maximum of all dominant frequencies measured across acoustic signal
maxfun: Maximum fundamental frequencies measured across acoustic signal
modindx: Modulation index (calculated as the per frequency range, accumulated absolute difference between adjacent measurements of fundamental frequencies)

For the RAVDESS dataset, some additional features incorporated are listed below:

MFCCs (Mel Frequency Cepstral Coefficients)
Spectral bandwidth
Spectral contrast
Spectral rolloff
Tonnetz

Total number of features: For Voice Gender dataset, we have used 21 features in total, and for RAVDESS [28] dataset, we have extracted 1000 features which are then passed on through feature selection algorithms to get the optimal feature subset.

3.2 Scaling

All the features are scaled and normalized before feeding to the classification model. For this purpose, we have used Scikit learn’s Min-Max-Scaler function. All the features are fed to this transformer column-wise and output are feature values belonging to the range [0,1]. These feature values are then fed to the classification model, which is discussed in the next section.

3.3 Feature selection

The feature engineering procedure may create some irrelevant and redundant features in the feature space. It is essential to remove such irrelevant feature attributes, which helps not only to enhance the overall classification accuracy but also to minimize the computational overheads. In the proposed work, we have used a two-stage feature selection approach, where in the first stage a filter called, mRMR, is used to remove the irrelevant and redundant features from the feature space, and in the second stage, a modified wrapper method, called Mayfly algorithm with A$\beta$HC algorithm, is used. This approach selects most relevant features from the original feature set so that the classification accuracy can be improved. The adopted approach has been mentioned in detail in the following subsections.

3.3.1 Minimum redundancy maximum relevance

The mRMR is a filter method, which uses a statistical or probabilistic approach to rank the feature attributes on the basis of the assigned score. The feature attributes then could be either selected or removed based on their scores/ranks. It tends to select the features with a high correlation with the class (output) and a low correlation between themselves. The initial idea behind mRMR is as follows. A feature space S with m features, Xi highly correlated features with the output class C, should be considered. Implementation of the maximum relevance is done using mean value of mutual information of all feature Xi with class C. The mutual information between two random variables X and Y whose joint probability is p(x, y) and p(x), p(y) as their individual probabilities can be stated formally as:

$$\begin{aligned} {\mathrm{Mutual Info}} = \sum _{x\in X}\sum _{y\in Y}p(x,y)\log {\frac{p(x,y)}{p(x)p(y)}} \end{aligned}$$

(1)

The measure of global relevance of the variables in S with respect to C is:

$$\begin{aligned} {\mathrm{GR}}_I(S) = {\frac{1}{|S|}}\sum _{X_i\in S}{\mathrm{Mutual Info}}(C, X_i) \end{aligned}$$

(2)

Here, ${\mathrm{Mutual Info}}(C, X_i)$ is the mutual information of feature $X_i$ with class C. Another idea behind mRMR is that for the class variable, the maximum relevance criterion of the features should be supplemented by the use of a minimum redundancy among features. Minimum redundancy should be implemented without disturbing its relevance since if only relevance is implemented, there is a high chance that the dependency between features could be increased. The formula to get minimum redundancy between features is as follows:

$$\begin{aligned} W_I(S) = {\frac{1}{|S|^2}}\sum _{X_i\in S}\sum _{X_j\in S}{\mathrm{Mutual Info}}(X_i, X_j) \end{aligned}$$

(3)

Here, ${\mathrm{Mutual Info}}(X_i, X_j)$ is the mutual information of feature $X_i$ with $X_j$. The mRMR criterion combines the above two constraints to optimize relevance and redundancy in order to obtain a good subset of features:

$$\begin{aligned} {\mathrm{max}}\phi ({\mathrm{GR}}_I(S),W_I(S)) \end{aligned}$$

(4)

where, $\phi = (GR_I(S) - W_I(S))$

3.3.2 Mayfly algorithm

Mayfly algorithm has been developed by Zervoudakis and Tsafarakis [43]. Many researchers have used it for various research purposes. Liu [26] proposed a novel multiobjective version of the Mayfly algorithm to estimate the optimal weight coefficients for integrating the forecasting values of the sub-series in order to design a Forecasting System for Short-Term Wind Speed. Liu and Chai [25] also proposed a novel resonance demodulation frequency band selection method to diagnose the bearing fault with the help of a modified Mayfly algorithm. Guo [17] used a modified Mayfly algorithm for optimizing the component size and operation strategy of a high-temperature proton exchange membrane fuel cell (PEMFC)-powered combined cooling, heat and power (CCHP). Mayfly algorithm has been recently used successfully in the domain of feature selection. Bhattacharyya [2] has proposed a hybrid of Mayfly algorithm and harmony search and used the combined algorithm to perform feature selection task on various datasets. The Mayfly algorithm performs the necessary modifications to the existing algorithms such as PSO, which is likely to get stuck in a local optimum, especially for feature space having a higher dimension, hence enabling the algorithm to have a better performance across both small- and large-scale feature sets. The Mayfly algorithm is based on the mating process of male and female mayfly insects. Most male adults assemble in a swarm a few metres above the water to attract the females. They perform a nuptial dance, which involves characteristic up and down movements, thereby generating a pattern. Female mayflies go to the swarm for mating. The mating process lasts only for a few seconds after which they drop the eggs in the water and the cycle continues. The Mayfly algorithm have three components given as follows:

Movement of male mayflies: The new position acquired by a male mayfly from a present position of $x_i^t$ is updated by:
$$\begin{aligned} x_i^{t+1} = x_i^t + v_i^{t+1} \end{aligned}$$
(5)
where, $v_i^{t+1}$ is the velocity of the male mayfly. The new velocity of male mayfly m in dimension j at time t having a previous velocity of $v_{mj}^{t}$ is updated by:
$$\begin{aligned} v_{mj}^{t+1} = g\times v_{mj}^t + a_1\times e^{-\beta r_p^2}\times ({\mathrm{opbest}}_{mj}-x_{mj}^t) + a_2\times e^{-\beta r_g^2}\times ({\mathrm{cpbest}}_{j}-x_{mj}^t) \end{aligned}$$
(6)
where, $x_{mj}^{t}$ is the position of the mayfly at time t, positive attraction constants a1 and a2 are used for measuring the contribution of the cognitive and social components, respectively. $\beta$ is a fixed visibility coefficient used to limit a mayfly’s visibility to others and g is a gravitational coefficient. $opbest_m$ is the most optimal position that the particular mayfly m has ever visited and $cpbest_j$ is the jth component of the position of the best male mayfly. And $r_p$ is the Cartesian distance between $x_m$ and $opbest_m,$ while $r_g$ is the Cartesian distance between $x_m$ and $cpbest_j$. Due to being a minimization task, the $opbest_m$ is updated as:
$${\text{opbest}}_{m} = \left\{ {\begin{array}{*{20}l} {x_{m}^{{t + 1}} } \hfill & {} \hfill \\ {{\text{if}}\quad {\text{fit}}(x_{m}^{{t + 1}} )} \hfill & {{\text{ < fit}}({\text{opbest}}_{m} )} \hfill \\ \end{array} } \right.$$
(7)
where, ${\mathrm{fit}}(x_m^{t})$ provides the fitness value of a position or the quality of a solution. A stochastic element is present in the algorithm as it is necessary that at a particular time the best mayflies keep performing the nuptial dance. Mathematical representation of this dance is given by:
$$\begin{aligned} v_{mj}^{t+1} = g \times v_{mj}^t + d \times rval \end{aligned}$$
(8)
where, d is the coefficient of nuptial dance and r is a random value such that $rval \in [-1, 1]$. Also, d reduces from an initial value $d_0$ by a random value $\delta \in [0, 1]$ as $d_{{\mathrm{itr}}} = d_0 \times \delta ^{itr}$, where, itr is the current number of iteration.
Movement of female mayflies: The new position acquired by a female mayfly, which moves towards a male mayfly from a present position of $x_i^t$, is updated by:
$$\begin{aligned} y_i^{t+1} = y_i^t + v_i^{t+1} \end{aligned}$$
(9)
where, $v_i^{t+1}$ is the velocity of the female mayfly. The velocity of a female mayfly is updated based on conditions since the attraction process between mayflies depends on the quality of the current solution. Therefore, the new velocity of mth female mayfly in jth component at time t having a previous velocity of $v_{mj}^{t}$ is updated by:
$$\begin{aligned} v_{mj}^{t+1} = {\left\{ \begin{array}{ll} {\mathrm{if}}\,\, {\mathrm{fit}}(y_m) > {\mathrm{fit}}(x_k)\\ g \times v_{mj}^{t} + a_2\times e^{-\beta r_g^2}\times (x_{mj}^{t} - y_{mj}^{t})\\ {\mathrm{else}} \,{\mathrm{if}}\,\, {\mathrm{fit}}(y_m) \le {\mathrm{fit}}(x_k)\\ g \times v_{mj}^{t} + wc \times {\mathrm{rval}} \end{array}\right. } \end{aligned}$$
(10)
where, $x_{mj}^{t}$ is the jth component position of the mayfly at time t, positive attraction constants a1, $\beta$ and g are previously defined in Eq. 6. r is a random value and $r \in [-1, 1]$ and wc is a random walk coefficient in the case when a female is not being attracted by a male.
Crossover between mayflies: In order to perform the crossover operation, initially, a male mayfly is selected and then a female mayfly. This selection is done on the basis of their fitness values, i.e. the best male breeds with the best female. Following equation shows two offspring produced after applying the crossover operation:
$$\begin{aligned} {\mathrm{offspring}}_1\,=\, & {} {\mathrm{rval}}_{\mathrm{of}} \times {\mathrm{pmale }}+ (1-{\mathrm{rval}}_{{\mathrm{of}}}) \times {\mathrm{pfemale}}\nonumber \\ {\mathrm{offspring}}_2\,=\, & {} {\mathrm{rval}}_{{\mathrm{of}}} \times {\mathrm{pfemale}} + (1-{\mathrm{rval}}_{{\mathrm{of}}}) \times {\mathrm{pmale}} \end{aligned}$$
(11)
Here, pmale and pfemale are the parent male and female mayflies, respectively, and ${\mathrm{rval}}_{{\mathrm{of}}}$ is a stipulated value between 0 and 1. Initial velocities of 0 are set for the offspring.
Mutation of mayflies: To enhance the exploration ability of the algorithm, the newly generated offspring are mutated by simply adding a normally distributed random number to the offspring’s variable.

3.3.3 Adaptive $\beta$-hill climbing

A$\beta$HC is a meta-heuristic algorithm, which is an adaptive version of the $\beta$HC, which itself is a modified version of the Hill climbing (HC) algorithm. HC algorithm is a comprehensible form of local search method. But it often gets stuck in local optima. To overcome this limitation, $\beta$HC was proposed. Given a solution $S = = (s_1,s_2,\ldots ,s_D)$ The $\beta$HC iteratively generates $S'' = = (s''_1,s''_2,\ldots ,s''_D)$ an improved solution based on $\beta$ operator and N (neighbourhood operator), which randomly chooses a neighbour $S' = = (s'_1,s'_2,\ldots ,s'_D)$ of the solution S defined by the equation:

$$\begin{aligned} s_i' = s_i \pm U(0,1) \times N\,\,\,\, {\mathrm{such}} \, {\mathrm{that}}\,\, i \in [1,D] \end{aligned}$$

(12)

where, a random value of i is selected. D is the dimension of the problem under consideration. The neighbourhood operator N denotes the maximum distance possible between current solution and its neighbour. Values are assigned either from the current solution or randomly from the corresponding range with probability value $\beta \in [0, 1]$ for the new solution given by:

$$s_{{i^{\prime\prime}}} = \left\{ {\begin{array}{*{20}l} {s_{r} } \hfill & {{\text{if}}{\mkern 1mu} {\mkern 1mu} {\text{rnd}} \le \beta } \hfill \\ {s_{i} } \hfill & {{\text{else}}} \hfill \\ \end{array} } \right.$$

(13)

where, rnd and $s_r$ are random values and $rnd \in [0,1]$ and $s_r$ depends on the dimension of the problem. Hence, it can be visualized that the outcome of this $\beta$HC largely depends on the values chosen for N and $\beta$, which requires extensive experiments to be conducted. Hence, to avoid this issue A$\beta$HC was proposed. The values for N and $\beta$ are expressed as a function of iteration number in A$\beta$HC. N(k), i.e. N for $k^{th}$ iteration, is defined by:

$$\begin{aligned} N(k) = 1 - {\frac{k^{\frac{1}{T}}}{{\mathrm{IterMax}}^{\frac{1}{T}}}} \end{aligned}$$

(14)

where, T is a constant and IterMax is the maximum number of iterations. Also, $\beta (k)$, i.e. $\beta$ for kth iteration, is defined by:

$$\begin{aligned} \beta (k) = \beta _{\mathrm{min}} +(\beta _{\mathrm{max}} - \beta _{\mathrm{min}}) \times {\frac{t}{\mathrm{IterMax}}} \end{aligned}$$

(15)

Now, S is replaced with $S''$ if the generated neighbour $S''$ is better than the current solution S.

3.3.4 The proposed Mayfly algorithm combined with A$\beta$HC

In the first stage, the filter method, called mRMR, is used. k number of features is filtered out whose values have been taken optimally from a graphical analysis shown in Fig. 2. The figure illustrates a variation of the k value with the classification accuracy for the RAVDESS dataset on which the analysis for mRMR has been done. Hence, according to the graphs presented in Fig. 2, 500 insignificant features for both the gender and emotion classification tasks, whereas 600 insignificant features for the speaker classification task are filtered out of overall previously extracted features, which are then used as input to the combination of Mayfly algorithm and A$\beta$HC algorithm for each of the classification tasks.

The proposed wrapper method is present in Algorithm 1. Conversion of continuous space to binary space is required as feature selection uses a binary feature space. The S-shaped transfer function is used in order to achieve such conversion. The probability of whether to choose a particular feature in a solution vector is given by the function. It is a general and reliable function, which has been previously used by many researchers. The equation for the S-shaped transfer function used in this algorithm is:

$$\begin{aligned} Sf(x) = \frac{1}{1+e^-x} \end{aligned}$$

(16)

The updated agent (or feature subset) during the binary conversion is given by :

$$\begin{aligned} Fs_d^{t+1}= {\left\{ \begin{array}{ll} 1\,\,\,\,{\mathrm{if}}\,\,Sf(Fs_d^{t+1}) > {\mathrm{randval}}\\ 0\,\,\,\,{\mathrm{if}}\,\,Sf(Fs_d^{t+1}) \le {\mathrm{randval}} \end{array}\right. } \end{aligned}$$

(17)

The fitness function used for the learning algorithm consists of the classification error and the number of features selected. This has been done in the intent of simultaneously increasing the accuracy and reducing the number of features. Thus for this purpose, classification error is used instead of accuracy. Combining these two reduces the fitness function to a single objective function. The equation for the fitness function is:

$$\begin{aligned} \downarrow {\mathrm{fit}} = \gamma \times E + (1 - \gamma ) \times \frac{|{\mathrm{Ns}}|}{|{\mathrm{Nf}}|} \end{aligned}$$

(18)

where, $|{\mathrm{Nf}}|$ is the number of features in the given dataset, $|{\mathrm{Ns}}|$ is the number of features in the feature subset, E is the classification error and $\gamma \in [0, 1]$ is a parameter that gives a relative contribution between the classification error and the number of features. In order to find the optimal feature subset, the method needs to find the global optima which requires proper exploration and exploitation of the search space. Hence, A$\beta$HC has been used to enhance the exploitation ability of the Mayfly algorithm.

3.4 GRaNN: golden ratio-aided neural network

In the present work, a variant of MLP is used which is deep as it consists of 15 hidden layers. The next subsection describes it in detail. It is to be noted that our main aim in this section revolves around the significance of Golden ratio. We have not made any changes to the architecture of MLP. Here, a framework is required where the units could be varied without hindering the results much. We have selected this architecture because of its lightweight characteristic and lesser computational complexities, thereby proving the effectiveness of the Golden ratio.

3.4.1 Definition

MLP is a special class of network belonging to the class of feed-forward artificial neural network (ANN). The term MLP is used ambiguously, sometimes loosely to refer to any feed-forward ANN, sometimes strictly referring to networks that are composed of multiple layers of perceptrons with threshold activation.

A basic MLP unit (hypothetical representation shown in Fig. 3) consists of at least three layers of nodes: an input layer, a hidden layer and an output layer. Except for the input nodes, each node is a neuron that uses an activation function which is nonlinear. Generally, it applies a supervised learning technique called back propagation for training. Its multiple layers and nonlinear activation help it to distinguish data that are not separable by using linear techniques.

3.5 Optimization of MLP layers

It has been already mentioned that deciding the number of nodes in the hidden layers of DMLP is always challenging and generally it requires a huge amount of trial and error. To overcome this, in this work, we have used the concept of Golden ratio which helps us to set the number of nodes in the hidden layers of DMLP used here. In this section, we have first defined the idea of Golden ratio followed by its use in various applications specifically in an optimization technique called the Golden Section search method. Finally, we have described how it is incorporated in our architecture for optimization purposes.

3.5.1 Golden ratio

In mathematics, two quantities are said to be in the Golden ratio if their ratio is the same as the ratio of their sum to the larger of the two quantities. Expressed algebraically, for quantities a and b with a > b > 0,

$$\begin{aligned} {\displaystyle {\frac{a+b}{a}}={\frac{a}{b}}\ {{\mathop {=}\limits ^{\text {def}}}}\ \varphi ,} \end{aligned}$$

(19)

where the letter phi $({\displaystyle \varphi }$ or ${\displaystyle \phi } )$ depicts the Golden ratio. It is an irrational number that is a solution to the quadratic equation ${\displaystyle x^{2}-x-1=0}$, with a value of:

$$\begin{aligned} {\displaystyle \varphi ={\frac{1+{\sqrt{5}}}{2}}=1.6180339887\ldots .} \end{aligned}$$

(20)

3.5.2 Use of golden ratio

The characteristic of the Golden ratio in any number systems that include any positive integral radix (base), negative radix, and variable radix possibly remains invariant [38]. Specifically, digits that have been chosen even randomly or systematically from consecutive digits or consecutive blocks of digits of the Golden ratio can be used as a source of uniformly distributed random numbers. Unlike any of the several quasi- and pseudo-random number generators using various methods, we do not need to follow any stringent method here. We only have to select the consecutive or non-consecutive blocks of digits from the stored Golden ratio, thereby making it a fastest means of obtaining random numbers. This idea of obtaining random sequences can be considered as quite an efficient way of solving numerous optimization problems including the NP-hard problems by polynomial time heuristics and randomized algorithms. Also, whether these random numbers that are sieved out of the Golden ratio are more uniformly distributed (quasi) or pseudo-random numbers may be studied including its scope among other random number generators.

3.5.3 Golden section search method

The Golden Section Search method [18] is used to find out the maximum or minimum of any unimodal function which is a function containing only one maximum or minimum value in any interval [a,b].

There are many methods for finding the local maximum or local minimum. The equal-interval search method is one of such simplest methods. Referring to Fig. 4, an interval of $\epsilon$ is chosen over which assumption of occurrence of the maximum is done. Then, $f\left( \dfrac{a+b}{2}+\dfrac{\epsilon }{2}\right)$ and $f\left( \dfrac{a+b}{2}-\dfrac{\epsilon }{2}\right)$ are computed.

If $f\left( \dfrac{a+b}{2}+\dfrac{\epsilon }{2}\right) \ge f\left( \dfrac{a+b}{2}-\dfrac{\epsilon }{2}\right)$, then the maximum occurs in the interval, $\left[ \dfrac{a+b}{2}-\dfrac{\epsilon }{2},b\right]$, else it occurs in $\left[ a,\dfrac{a+b}{2}+\dfrac{\epsilon }{2}\right]$. This helps in reducing the interval of occurrence of the local maximum. These steps are repeated until the interval is reduced to the precision of desired choice.

The equal-interval search method has its drawbacks, being the inefficiency caused when the interval is small, resulting in an undesirably long time to find the maximum of a function. To improve this inefficiency, the Golden Section Search method was introduced.

As Fig. 5 shows, three points $x_l$, $x_1$ and $x_u$ $(x_l< x_1 < x_u)$ are chosen along the X-axis with corresponding values of the function $f(x_l)$, $f(x_1)$ and $f(x_u)$, respectively. Since $f(x_1)>f(x_l)$ and $f(x_u)>f(x_1)$, the maximum ought to lie between $x_l$ and $x_u$. Now a fourth point denoted by $x_2$ is chosen such that it lies between the larger of the two intervals of $[x_l,x_1]$ and $[x_1,x_u]$. Making assumption that the interval $[x_l,x_1]$ is larger than $[x_1,x_u]$, $[x_l,x_1]$ is chosen as the interval in which $x_2$ is chosen. If $f(x_2)>f(x_1)$, then the new three points would be $(x_l< x_2 < x_1)$ ; else if $f(x_2)>f(x_1)$, then the points are $(x_2< x_1 < x_u)$. This procedure is repeated until the distance between the outer points is sufficiently small and matches the desired precision. Now question arises that how the intermediate points in the Golden Section Search are determined. Here, comes the role of the Golden ratio [1].

First the intermediate point $x_l$ is chosen in order to equalize the ratio of the lengths as shown in Eq. (19) where a and b are distances as present in Fig. 5. It should be noted that a+b is equal to the distance between the upper and lower boundary points $x_u$ and $x_l$.

The second intermediate point $x_2$ is chosen similarly in the interval a to satisfy the following ratio in Eq. (21) where the distances of a and b are shown in Fig. 6.

$$\begin{aligned} \dfrac{b}{a}=\dfrac{a-b}{b} \end{aligned}$$

(21)

The ratios in Eqs. (19) and (21) are equal and have a value equal to the Golden ratio.

This method deals with finding minimum and maximum of functions, and the nonlinear functions involved in the bases of the MLP architecture have a requirement of finding the same at various steps of learning. Hence, we have been motivated to experiment and incorporate the Golden ratio in our model, and the results are quite promising.

3.5.4 Final architecture

The number of units in each layer of DMLP has been set keeping the Golden ratio in mind. The model consists of 1 input layer, 1 output layer and 15 hidden layers. The usage of 15 hidden layers is purely experimental. Our main aim is to preserve the Golden ratio in deciding the number of hidden units. In the process, we ensure that our model does not overfit pertaining to a large number of hidden layers or underfit pertaining to a lesser number of hidden layers. The magic numbers used here are 110, 68, 42 corresponding to number of units in the hidden layers. As we can see,

$$\begin{aligned} 110/68= & {} 1.61764\approx \phi \\ 68/42= & {} 1.61904\approx \phi \\ \end{aligned}$$

They are basically multiples of numbers, which are part of the famous Fibonacci sequence, and it is a result of the Golden ratio itself. The pattern in the hidden neurons has been incorporated to bring uniformity throughout the network while preserving the Golden ratio. Either linearly increasing or decreasing values of the number of units in a hidden layer are avoided in the present work.

3.6 Output

The output layer has a Softmax activation. The class with max argument value is predicted as the output class. For the purpose of gender classification, the shape of output layer is set to fit 2 label classes. For the purpose of emotion identification, the shape of output layer is changed to fit 8 classes of emotion. Lastly, for speaker identification purpose, the shape has been modified to fit 24 classes of speakers.

The schematic diagram of our optimized GRaNN architecture used in the present work is shown in Fig. 7.

4 Experimental results

The organization of the result section is performed as follows: Sect. 4.1 presents the evaluation metrics used to evaluate our proposed GRaNN model, whereas Sects. 4.3, 4.4 and 4.5 report the detailed results related to gender, emotion and speaker identification tasks, respectively. The overall accuracies (rounded off to 2 decimal places) of the GRaNN model on both RAVDESS and Voice Gender datasets are depicted in Table 4. However, the results obtained using feature selection framework for both RAVDESS and Voice Gender datasets are given in Tables 5 and 6, respectively. It is to be noted that we have applied our model for gender classification problem on both the RAVDESS and Voice Gender datasets, whereas for emotion identification problem, we have considered only RAVDESS dataset. To test the effectiveness of using both feature selection process and Golden ratio in our proposed model, different classification accuracies attained before and after applying feature selection and Golden ratio for gender classification task are shown in Table 7. The gender identification results obtained by our GRaNN model are illustrated in Table 8. We have also compared our performance of gender classification with a semi-supervised algorithm K-RMS[13] in Table 9. Performance comparison of our proposed model with some existing methods for Voice Gender dataset is shown in Table 10.

The performance (evaluated in terms of accuracy and number of features used) for emotion identification problem using our feature selection framework is illustrated in Table 11. The different classification accuracies achieved before and after applying feature selection and Golden ratio for emotion identification task are shown in Table 12. The performance of the emotion identification produced by our feature selection-based GRaNN model is shown in Table 13, whereas comparison with some existing methods for RAVDESS dataset is illustrated in Table 14.

However, to check the versatility of our model, we have made use of the RAVDESS dataset for speaker identification purpose by changing the shape of the output layer while keeping the rest of the model intact. The performance (measured in terms of accuracy and number of features used) for speaker identification problem using our feature selection framework is illustrated in Table 16. The different classification accuracies produced before and after applying feature selection and Golden ratio for speaker identification task are shown in Table 17. The detailed individual speaker-wise performance is shown in Table 15, whereas the confusion matrix generated by the proposed feature selection-based GraNN model is illustrated in Table 19. We have also compared the performance of the speaker identification task with logistic regression [41] model in Table 20 owing to the fact that it gives quite impressive results for multiclass classification tasks.

To convey the basis of the selection of the proposed configuration for the model, the experiments have been conducted by varying network configurations of hidden units on both the datasets for all the three classification tasks. The overall results are shown in Table 18.

4.1 Datasets used

The different tasks performed by our model for the datasets are given in Table 1.

Table 1 Usage stats of datasets for different purposes

GRaNN: feature selection with golden ratio-aided neural network for emotion, gender and speaker identification from voice signals

Abstract

Similar content being viewed by others

Noise-Robust Gender Classification System Through Optimal Selection of Acoustic Features

MLPGI: Multilayer Perceptron-Based Gender Identification Over Voice Samples in Supervised Machine Learning

Automatic Gender Identification Through Speech

1 Introduction

2 Literature survey

2.1 Gender and emotion identification

2.2 Speaker recognition

2.3 Motivation and contributions

3 Proposed methodology

3.1 Feature engineering

3.2 Scaling

3.3 Feature selection

3.3.1 Minimum redundancy maximum relevance

3.3.2 Mayfly algorithm

3.3.3 Adaptive \(\beta\)-hill climbing

3.3.4 The proposed Mayfly algorithm combined with A\(\beta\)HC

3.4 GRaNN: golden ratio-aided neural network

3.4.1 Definition

3.5 Optimization of MLP layers

3.5.1 Golden ratio

3.5.2 Use of golden ratio

3.5.3 Golden section search method

3.5.4 Final architecture

3.6 Output

4 Experimental results

4.1 Datasets used

4.1.1 RAVDESS dataset

4.1.2 Voice Gender dataset

4.2 Evaluation metrics used

4.3 Gender identification results

4.4 Emotion identification results

4.5 Speaker identification results

5 Conclusion

Availability of data and materials

Change history

28 August 2022

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Supplementary Information

Supplementary file1 (TEX 3 kb)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation