Insights into modeling refractive index of ionic liquids using chemical structure-based machine learning methods

Esmaeili, Ali; Hekmatmehr, Hesamedin; Atashrouz, Saeid; Madani, Seyed Ali; Pourmahdi, Maryam; Nedeljkovic, Dragutin; Hemmati-Sarapardeh, Abdolhossein; Mohaddespour, Ahmad

doi:10.1038/s41598-023-39079-5

Download PDF

Article
Open access
Published: 24 July 2023

Insights into modeling refractive index of ionic liquids using chemical structure-based machine learning methods

Ali Esmaeili¹^na1,
Hesamedin Hekmatmehr¹^na1,
Saeid Atashrouz²,
Seyed Ali Madani³,
Maryam Pourmahdi⁴,
Dragutin Nedeljkovic⁵,
Abdolhossein Hemmati-Sarapardeh^6,7 &
…
Ahmad Mohaddespour⁸

Scientific Reports volume 13, Article number: 11966 (2023) Cite this article

1290 Accesses
4 Citations
1 Altmetric
Metrics details

Subjects

Abstract

Ionic liquids (ILs) have drawn much attention due to their extensive applications and environment-friendly nature. Refractive index prediction is valuable for ILs quality control and property characterization. This paper aims to predict refractive indices of pure ILs and identify factors influencing refractive index changes. Six chemical structure-based machine learning models called eXtreme Gradient Boosting (XGBoost), Light Gradient Boosting Machine (LightGBM), Categorical Boosting (CatBoost), Convolutional Neural Network (CNN), Adaptive Boosting-Decision Tree (Ada-DT), and Adaptive Boosting-Support Vector Machine (Ada-SVM) were developed to achieve this goal. An enormous dataset containing 6098 data points of 483 different ILs was exploited to train the machine learning models. Each data point’s chemical substructures, temperature, and wavelength were considered for the models’ inputs. Including wavelength as input is unprecedented among predictions done by machine learning methods. The results show that the best model was CatBoost, followed by XGBoost, LightGBM, Ada-DT, CNN, and Ada-SVM. The R² and average absolute percent relative error (AAPRE) of the best model were 0.9973 and 0.0545, respectively. Comparing this study’s models with the literature shows two advantages regarding the dataset’s abundance and prediction accuracy. This study also reveals that the presence of the –F substructure in an ionic liquid has the most influence on its refractive index among all inputs. It was also found that the refractive index of imidazolium-based ILs increases with increasing alkyl chain length. In conclusion, chemical structure-based machine learning methods provide promising insights into predicting the refractive index of ILs in terms of accuracy and comprehensiveness.

Introduction

By definition, ionic liquids (ILs) are molten salts with melting points of less than 100 °C; thus, they remain liquid about and under room temperature. Low volatility and vapor pressure, satisfactory chemical stability, high conductivity, and being suitable solvents have given ILs several applications in various fields¹. Photovoltaic cells², batteries³, thermo-electrochemical cells⁴, water treatment technologies⁵, thermal energy storage devices⁶, carbon capture⁷, and other green utilizations⁸ are among these applications. In order to analyze the purity of an IL or to obtain useful information about the behavior of molecules in a solution, an optical property called refractive index is investigated⁹. IUPAC has defined refractive index as “the ratio of the speed of light in vacuum to that in a given medium”⁹, and it has recently gained attention in the quality control and characterization of ILs subject¹⁰. Because acquiring the physical properties of ILs through experiments is not efficient in terms of time and cost¹¹, in recent years, many researchers have been inclined to develop general models that can predict the physical properties of diverse ILs¹². In addition, It is not realistic to experimentally screen ILs with preferable properties considering that the number of potential ILs is predicted to be 10⁸¹³. Predicting these physical and thermodynamic properties of ILs can be performed using machine learning methods, which are increasingly advancing in research fields¹¹. It has been a while since researchers began to investigate applying machine learning methods to predict different properties of ILs¹⁴, such as viscosity¹⁵, electrical conductivity¹⁶, thermal conductivity¹⁷, gas solubility¹⁸, and surface tension¹⁹. The fantastic improvement of machine learning has shown a new perspective to scientists¹².

Many models have been proposed and used in the literature to predict the refractive index of ILs. Iglesias-Otero et al.²⁰ suggested a correlation to determine the density of a binary IL, which can also be used to predict the refractive index through inverse prediction. The differences between the correlation and 131 experimental data points were less than 0.001. Another correlation was used by Koller et al.²¹ for 32 data points, and an average RMS of 0.007 was reported. Likewise, Safdar et al.²² suggested a correlation equation using 35 data points, leading to an R² > 0.99 and an average standard deviation of 0.023. In addition, Tong et al.²³ used a semiempirical method with 25 data points with an R² of 0.99 and an average standard deviation of 2.78 × 10^–5. Similarly, Xu et al.²⁴ presented a semiempirical method using 35 data points resulting in an R² of 0.99 and an average standard deviation of 4.15 × 10^–5. Other researchers used a group contribution method to estimate the refractive index of ILs. Gardas et al.²⁵ gathered 245 data points from literature and obtained an AARD of 0.18% using a group contribution method. Almeida et al.²⁶ also used a group contribution method utilizing 105 data points and found an AARD of 0.02%. Sattari et al.²⁷ brought together a more extensive database of 931 data points and found the AARD of 0.34%, R² = 0.964, and RMSE of 9.97 × 10^–3 using a group contribution method.

Researchers also included machine learning methods in their predictions. Díaz-Rodríguez et al.²⁸ put their effort into using artificial neural networks in the form of multilayer perceptrons (MLP) and compared their accuracy with multiple linear regression (MLR) models. After analyzing 72 data points and reporting R² and mean prediction error (MPE) of both MLP and MLR models, being R²_MLP = 0.98, R²_MLR = 0.76; MPE_MLP = 0.24%, MPE_MLR = 0.72%, they concluded that MLP method was quite convincing in predicting the refractive indices of pure ILs, while MLR model was not accurate enough for their system. In another research, Díaz-Rodríguez et al.²⁹ used the MLP method with a data size of 156 to predict the refractive index of ILs. They found the R² to be more than 0.99 and MPE of 0.02%, which were satisfactory. Furthermore, Díaz-Rodríguez et al.³⁰ employed two MLP models and predicted the refractive index of different pure ILs with MPE of less than 0.48% using 39 data points. In another study, Golzar et al.³¹ used 85 data points for predicting the refractive index of ILs by conducting the ANN method. The result showed that R² was very close to unity. Similarly, Cancilla et al.³² used 72 data points of ternary ILs to develop their ANN model and estimated the refractive indices with an MPE of 0.05%. In their next paper, Cancilla et al.³³ expanded their dataset to 146 and utilized four models based on MLPs to estimate the refractive index of ILs with MPE of less than 1%. Recently, researchers generally tend to expand their databases. Soriano et al.³⁴ made use of a database including 752 data points of binary ILs in their ANN model with a mean absolute error of 0.00783 and an overall average percentage error of 0.55%. Mesbah et al.¹² used 362 data points to model ternary systems through separate models: ANN and GEP. Their ANN model results had an R² of 0.9225, MSE of 2.47 × 10^–5, and AARD% of 0.2773 in predicting refractive index, while the error analysis of the refractive index correlation provided by GEP showed an R² of 0.9765, MSE of 7.20 × 10^–6 and AARD% of 0.1383. Kang et al.³⁵ also used a machine learning method called ELM to model a number of 1194 data points and compared the results with those obtained from the MLR method. The R² and AARD% values obtained by MLR were 0.841 and 0.855%, respectively whereas they were 0.957 and 0.295%, respectively for the ELM model. They concluded that ELM method was better than MLR in predicting the refractive indices. Another large database consisting of 3147 data points were used by Venkatraman et al.¹⁰ to model a refractive index of ILs by various machine learning models. They presented an MAE of less than 0.01 and an R² of more than 0.85 across both test and training data. A different study by Soroush et al.¹¹ was done using an ANN model with 812 data points and has an R² of 0.9993, an MSE of 6.91 × 10⁻⁷, and an AARD% of 0.04 for estimating the refractive index of various ternary ILs. An MLP model was used by Wang et al.³⁶ with 688 data points of different binary systems and achieved average error parameters of MSE = 8.45 × 10^–6, R² = 0.9905, AARD = 2.42%.

Sattari et al.⁹ proposed a QSPR model for estimating the refractive index of ILs using 931 data points. Their statistical analysis showed an R² of 0.935, RMSE of 1.07 × 10⁻², and ARRD of 0.51%. Other researchers tried combining machine learning methods with different models to obtain better results. Wang et al.³⁷ performed and compared a Group Contribution-Artificial Neural Network (GC-ANN) model with a Group Contribution model utilizing 2138 data points. Providing that the AARDs of the GC-ANN model and GC model were 0.179% and 0.628%, respectively; while the R² of the GC-ANN model and GC model were 0.961 and 0.886 respectively, they concluded that the ANN-GC model exhibited a better outcome. In another study, Ding et al.³⁸ gathered two datasets of refractive indices consisting of 3147 (first dataset) and 931 (second dataset) data points to build an XGBoost-assisted QSAR model capable of predicting the refractive index of a various number of ILs. They used two molecular fingerprints (MF) named Morgan fingerprint and atom-pair fingerprint as two molecular descriptors. The outcome error of their model was RMSE = 0.017 and R² = 0.782 for the first dataset using the Morgan fingerprint descriptor, and RMSE = 0.013 and R² = 0.853 for the second dataset. Also, an RMSE of 0.016 and R² of 0.836 was obtained for the first dataset utilizing atom-pair fingerprint, and RMSE of 0.022 and R² of 0.568 for the second dataset. Furthermore, Sun et al.¹³ used 3964 data points and four ways of IL descriptors to develop their QSPR-XGBoost model and assess its predictive performance. They concluded that the QSPR model developed by molecular fingerprint combined with molecular descriptor has the best accuracy with R² of 0.951 and RMSE of 0.0088. Recently, Baskin et al.³⁹ predicted the refractive index of various ILs using a collected 6443 data points. They utilized several machine learning methods to develop their QSPR models using different molecular descriptors. Once they predicted the refractive indices in one temperature, then they performed their estimation varying 8 different temperatures. They concluded that the best ML method for one-temperature and eight-temperature investigations were ASNN and DNN, respectively. Additionally, the best representation for their QSPR model to predict the refractive index were CDK23 in both investigation cases. The statistical outcome of their research was an R² of 0.86, an RMSE of 0.016, an MAE of 0.0081 for only one temperature, and an R² of 0.922, an RMSE of 0.0112, and an MAE of 0.00725 for when they changed the temperature in the specified range. As the literature review illustrated, the studies that used methods other than machine learning incorporated few ILs and data points for modeling, whereas the studies that developed machine learning with many data points obtained insignificant accuracy. Further, using wavelength as an input in machine learning models has not yet been investigated.

This study uses six robust chemical structure-based machine learning models named XGBoost, LightGBM, CatBoost, CNN, Ada-DT, and Ada-SVM to predict the refractive indices of an extensive database, including the temperature, wavelength, and chemical substructures of each data point as inputs. The database contains 6098 data points from 483 ILs. Besides, this study investigates how each chemical substructure, temperature, and wavelength affect the refractive index. The best model is also introduced through statistical analysis and graphical representation.

Data collection

The dataset used in this study was obtained from the NIST Ionic Liquids Database (SRD#147 v2.0)^40,41. The extracted data comprise 6098 data points belonging to 483 different pure ILs. The temperatures under which these ILs were tested varied from 278.15 to 368.1 K. The wavelength in the experimental data changed from 430.1 to 822.7 nm, resulting in refractive index changes of 1.335–1.7. Additionally, the molecular weights of the ILs used in this study differ from 77.08 to 866.64 g/mol. Brief statistics of the data used in this study are presented in Table 1.

Table 1 Statistical characteristics of the gathered data.

Full size table

Table S1 in Supplementary Information section presents all the ILs used in the present study with the ranges of their temperature, wavelength, and refractive index, as well as the number of corresponding data points.

Graphical demonstrations showing the dispersion of the database regarding the temperature and wavelength are presented in Figs. 1 and 2. Figure 1 shows that most data points lie between 298.14 K and 303.14 K. Moreover, as shown in Fig. 2, the majority of the wavelengths have a wavelength of 589.3 nm, which is the sodium D line wavelength. The scarcity of the other wavelengths should not be interpreted such that this parameter has a negligible role in determining the refractive index. As the literature states, the pressure, temperature, composition, and the light source wavelength are the variables that correlate with the refractive indices of liquid mixtures⁴². Cauchy’s equation and the Sellmeier equation used by Guo et al.⁴³ and Arosa et al.¹, respectively serve the purpose of showing the empirical relation between wavelength and the refractive index. Thus, the influence of wavelength on the refractive index of ionic liquids is taken for granted by the literature.

Each ionic liquid consists of anionic and cationic families. These family names and the number of ILs containing them are accumulated in Table 2. The cationic and anionic family combinations used in the current study with the number of ILs comprising them are displayed in Fig. 3. It can be realized that the most repeated ionic liquid family in the database is [im][NTf2].

Table 2 Cation and anion family names, their code, and their number used in this paper.

Full size table

Modeling

Modeling procedure

The process of modeling, starting with data gathering and ending with results analysis, is provided in the flowchart in Fig. 4. The strong models used in this study are CatBoost, XGBoost, LightGBM, Ada-DT, CNN, and Ada-SVM. The following sections present extensive descriptions of the used models, their hyperparameters, and their inputs.

Model development

XGBoost

XGBoost is a scalable tree boosting system that is one of the most successful and broadly used methods of machine learning. The model choice of XGBoost is called “decision tree ensemble”, which comprises a set of classification and regression trees (CARTs). Because using one single tree is not practically enough, an ensemble model that sums the prediction of various trees is commonly used⁴⁴. The model can be written as Eq. (1):

$${\widehat{y}}_{i}=\sum_{k=1}^{K}{f}_{k}\left({x}_{i}\right), {f}_{k}\in \mathcal{F},$$

(1)

where $K$ is the number of trees, ${f}_{k}$ is a function in the regression tree space $\mathcal{F}$, and $\mathcal{F}$ is the set of all possible CARTs. To train the model, the objective function is defined and minimized.

$$L=\sum_{i}l\left({\widehat{y}}_{i},{y}_{i}\right)+\sum_{k}\Omega \left({f}_{k}\right),$$

(2)

where $\Omega \left({f}_{k}\right)=\gamma T+\frac{1}{2}\lambda {\Vert \mathcalligra{w} \Vert }^{2}$. The first term of Eq. (2) indicates the training loss, and the second one is the regularization term which controls the complexity of the model. The regularization term is necessary to avoid overfitting. Since it is intractable to learn all the trees at once, XGBoost uses an additive training strategy. Therefore the prediction of the i-th instance of the t-th iteration is substituted to the objective function.

$$L=\sum_{i}l\left({y}_{i},{\widehat{y}}_{i}^{\left(t-1\right)}+{f}_{t}\left({x}_{i}\right)\right)+\Omega \left({f}_{k}\right).$$

(3)

By adding a new tree at a time, new predictive values are generated step by step⁴⁴.

$${\widehat{y}}_{i}^{\left(t\right)}={\widehat{y}}_{i}^{\left(t-1\right)}+{f}_{t}\left({x}_{i}\right).$$

(4)

LightGBM

Another algorithm besides XGBoost, which uses the GBDT framework, is LightGBM. The primary purpose of this method is to increase computational efficiency so the predicting problem can be carried out more effortlessly⁴⁵. LightGBM has two features that help solve the problem more cost-effectively: histogram-based decision tree algorithm and leaf-wise growth strategy. Unlike XGBoost and many boosting tools, which use pre-sort-based algorithms for decision tree learning, LightGBM uses histogram-based algorithms. In a histogram-based decision tree algorithm, floating-point eigenvalues are discretized into bins used to construct the histogram. Once the histogram accumulates gradients and samples within each compartment, we can find the optimal segmentation point using the discrete value of the histogram. In accordance with Fig. 5, in a level-wise growth approach, the leaves on each layer are separated at the same time. This strategy is inefficient regarding memory consumption because many leaves have low information gain, and it is unnecessary to search and split them. Leaf-wise growth approach, instead, splits only the leaves with the most significant information gain on each tree layer. This strategy reduces memory usage and speeds up training^45,46.

CatBoost

CatBoost is a gradient boosting method that primarily aims to decrease the prediction shift during data training⁴⁷. The prediction shift arises from a particular type of target leakage present in all implementations of gradient boosting algorithms⁴⁸. One of the benefits of CatBoost is that it uses an innovative algorithm that treats categorical features as numerical characteristics. It also combines category features that exploit the connections between features, enriching feature dimensions. In addition, it employs a symmetrical tree model to overcome the overfitting problem, so the algorithm becomes more accurate and generalized⁴⁹.

Convolutional neural network

In the literature, CNN architectures can be found in a wide variety of variations; however, they are all based on similar fundamental principles. A sample CNN comprises three layers other than input and output layers: convolutional layer, pooling layer, and fully-connected layer. Learning characteristic representations of the inputs is the purpose of the convolutional layer. By reducing the feature maps’ resolution, the pooling layer achieves shift-invariance. Some fully-connected layers may exist after several convolutional and pooling layers. The final layer of a CNN is the output layer. Minimizing a loss function on a particular task can obtain the optimal parameters for that task. The CNN loss function is defined as follows⁵⁰:

$$L=\frac{1}{N}\sum_{n=1}^{N}\mathcalligra{l}\left({\varvec{\theta}};{{\varvec{y}}}^{\left(n\right)},{{\varvec{o}}}^{(n)}\right),$$

(5)

where $N$ is the number of desired input–output connections, ${\varvec{\theta}}$ is all the parameters of the CNN, ${{\varvec{y}}}^{\left(n\right)}$ is the n-th data’s corresponding target label, and ${{\varvec{o}}}^{(n)}$ is the output of the CNN. The best fitting parameters can be obtained by minimizing the $L$ function. As a matter of fact, training a CNN is a global optimization problem⁵⁰.

AdaBoost

Generally, boosting is a solid approach for increasing regression and classification models’ predictive power and accuracy. AdaBoost is a boosting algorithm that performs in the 4 following steps⁵¹:

1.
The input containing the number of cycles, a learning algorithm, and a set of training samples are given to the AdaBoost algorithm.
2.
AdaBoost gives a number of identical weights to all training samples.
3.
It calls the algorithm to train a classifier regarding the weighted training samples and calculates the error. Then, it sets the weight for the component classifier and updates the weight of the training samples in some defined loops.
4.
This procedure advances the specified cycles, and finally, AdaBoost linearly integrates all the component classifiers into a single output.

Decision tree

Decision trees work by splitting a dataset sequentially into small segments until the target variables match or until the dataset cannot be divided anymore. The algorithm is greedy because it makes the best decision at the given time without considering global optimality⁵². There are different types of DT algorithms, but all of them use a similar structure explained in the following steps⁵³:

1.
Assigning every training instance to the tree’s root and setting the node as the tree’s root node.
2.
Finding the split characteristic and value in accordance with the split criterion. The criterion might be the Gini coefficient, information gain, or information gain ratio.
3.
Using the split feature and threshold number to divide all data points in each node.
4.
Designating all partitions of the current node as child nodes.
5.
Tagging every child node as a leaf and returning for child nodes with only one class instance; otherwise, setting the node as the current node and returning to step 2.

Support vector machine

Support Vector Machine is another famous supervised learning algorithm that can be utilized for both regression and classification problems. This algorithm plots data points as a single point in an n-dimensional space, where n is the number of inputs. The target of the SVM algorithm is to make the best line to separate the n-dimensional space into discrete classes so that new data points can be put in the appropriate category later. This line is called a hyperplane; the farther it is from the points of any class, the better separation is achieved. Thus, there is not only one hyperplane capable of separating the data, but the best one is the one with the most significant margin between the two classes. It is also worth mentioning that the closest points to the hyperplanes are called support vectors. Figure 6 shows two hyperplanes with small and maximal margins (H₂ and H₃) and one that fails to separate the classes correctly(H₁). The decision function in the SVM algorithm is⁵¹:

$$f\left({\varvec{x}}\right)=\langle {\varvec{w}},\varnothing \left({\varvec{x}}\right)\rangle +b,$$

(6)

where $b$ is the bias term, $\varnothing \left({\varvec{x}}\right)$ is a mapping of ${\varvec{x}}$ from the input space to the n-dimensional feature space, and ${\varvec{w}}$ is the weight of the sample. To obtain the optimal values of ${\varvec{w}}$ and $b$, the following optimization problem has to be solved:

$$\mathrm{minimize}: g\left(w,\xi \right)=\frac{1}{2}{\Vert w\Vert }^{2}+C\sum_{i=1}^{n}{\xi }_{i},$$

(7)

$$\mathrm{subject \; to}: {y}_{i}\left(\langle w,\phi \left({x}_{i}\right)\rangle +b\right)\ge 1-{\xi }_{i}, \quad {\xi }_{i}\ge 0,$$

(8)

where the regularization parameter is $C$ and ${\xi }_{i}$ is the i-th slack variable⁵¹.

Hyperparameters optimization

Adapting a machine learning model to different problems requires tuning its hyperparameters. As a result, choosing the appropriate hyperparameter configuration is a crucial step in the development of machine learning models, as it directly affects their performance⁵⁴. In Table 3, all the adjusted hyperparameters have been provided in addition to their range of analysis.

Table 3 Hyperparameters with their corresponding ranges.

Full size table

Input parameters

The independent inputs of our models are temperature, wavelength, and chemical substructure of the ILs. Since the temperature has a well-known influence on the refractive index, it is included in model inputs³⁷. However, the wavelength never caught the researcher’s attention to be considered as input to their machine learning models. Numerous studies focused on analyzing and determining the refractive index at a single wavelength called the sodium D line, which equals 589.3 nm. A material’s chromatic dispersion, which is the variation of the refractive index over a range of wavelengths, will be left out of the study if the wavelength is excluded in modeling. Furthermore, much information about the chemical composition and physical properties can be obtained by considering the wavelength as input¹.

While dispersion is minimized in certain applications (for instance, optical communications systems and imaging systems), it benefits other applications (for instance, dispersive prisms in laser cavities, for compensating dispersion introduced by other optical components, or in optical spectrometers). The refractive index dispersion of an optical device must be accurately characterized in both cases in order to ensure optimal performance⁵⁵.

The chemical substructure of an ionic liquid was also used as input in the models, similar to what Valderrama et al.⁵⁶ proposed. A list of the chemical substructures used in this study is presented in Table 4. Figure 7 shows a sample of the process in which an ionic liquid can be fragmented into its substructures. In Fig. 7, the cation of the presented ionic liquid has been fragmented into two –CH₃, one [> N =]⁺ (with rings), three = CH– (with rings), one > N– (with rings), and one –CH₂– substructures. Likewise, the anion part of the ionic liquid can be fragmented into four –F and one –B substructures. It is also worth mentioning that pressure was not included in the inputs because the pressure change has little effect on the refractive index⁵⁷.

Table 4 A set of 36 chemical substructures utilized in this study.

Full size table

Assessment of models

Statistical assessment

It is crucial to evaluate the proposed models’ accuracy by statistical and graphical analysis of the results. Statistical model performance is assessed through the use of a set of equations known as average absolute percent relative error (AAPRE), coefficient of determination (R²), root mean square error (RMSE), and mean absolute error (MAE).

$$AAPRE=\frac{1}{N}\sum_{i=1}^{N}\left|\frac{{x}_{i,exp}-{x}_{i,pred}}{{x}_{i,exp}}\right|\times 100,$$

(9)

$${R}^{2}=1-\frac{\sum_{i=1}^{N}{\left({x}_{i,exp}-{x}_{i,pred}\right)}^{2}}{\sum_{i=1}^{N}{\left({\overline{x} }_{exp}-{x}_{i,exp}\right)}^{2}},$$

(10)

$$RMSE=\sqrt{\frac{1}{N}\sum_{i=1}^{N}{\left({x}_{i,exp}-{x}_{i,pred}\right)}^{2}},$$

(11)

$$MAE=\frac{\sum_{i=1}^{N}\left|{x}_{i,exp}-{x}_{i,pred}\right|}{N},$$

(12)

where $N$ is the number of data points, ${x}_{i,exp}$ is the i-th experimental value of the refractive index, ${x}_{i,pred}$ is the i-th predicted value of the refractive index, and ${\overline{x} }_{exp}$ is the average value of the experimental amount of the refractive index.

Graphical assessment

In addition to statistical assessment, the model can be evaluated through visual plots containing cross plots, error distribution plots, box and whisker diagrams, heatmaps, and the cumulative frequency plot. A cross plot shows how well data are distributed around the ideal X = Y line. The X = Y line is a line on which the data must be placed if the prediction is performed ideally. An error distribution plot depicts the relative error versus the experimental value of the data. Prediction accuracy decreases as data deviates from the Y = 0 line.

$$Relative \; Error=\frac{{x}_{i,exp}-{x}_{i,pred}}{{x}_{i,exp}}\times 100.$$

(13)

Box and whisker diagrams are supposed to illustrate how well the model predicts the refractive index value concerning the four quartiles of a specific anionic or cationic family. The heatmaps also show the distribution of certain parameters with respect to the relative error. The cumulative frequency plot shows what portion of data is subject to less than a certain amount of an absolute relative error. Therefore, the x-axis is the absolute relative error.

$$Absolute \; Relative \; Error=\left|\frac{{x}_{i,exp}-{x}_{i,pred}}{{x}_{i,exp}}\right|\times 100,$$

(14)

where ${x}_{i,exp}$ is the i-th experimental value of the refractive index and ${x}_{i,pred}$ is the i-th predicted value of the refractive index.

Results and discussion

Statistical analysis

The results of this study present the effectiveness of machine learning methods in predicting the refractive indices of a wide range of ILs, which was the original purpose of this study. Six well-known machine learning models based on the chemical structures of numerous ILs were employed to achieve the aim of this study. Although all the models performed satisfyingly, the most accurate ML method was CatBoost; followed by XGBoost, LightGBM, Ada-DT, CNN, and Ada-SVM. Table 5 illustrates the statistical results of the training, testing, and overall data for each model used. The results showed a remarkably low error for most of the models. Among them, CatBoost results with an overall R², AAPRE, MAE, and RMSE of 0.9973, 0.0545, 0.0008, and 0.0021, respectively, reveal the extraordinary power of this ML model to predict the refractive index of more than 6000 data points. The least accurate model is Ada-SVM, with the statistical specifications of 0.9227 for R², 0.6618 for AAPRE, 0.0097 for MAE, and 0.0115 for RMSE. Although Ada-SVM is the least accurate model used in the current study, the comparison with other studies shows that it is sensibly acceptable.

Table 5 The output statistics of each model.

Full size table

A comprehensive comparison of the results of this study and the ones from the literature is demonstrated in Table 6. According to Table 6, this research considered the most ILs for predicting the refractive index, and the number of input data points was higher than in most studies. Thus, the current study is comparable with the one Baskin et al.³⁹ did regarding the extensiveness of the input points. However, concerning the errors, this research showed much better results than Baskin et al.’s study³⁹. Unlike previous ML studies, our study includes the wavelength as model input. This inclusion takes another essential factor into account for the refractive index prediction. Additionally, a wide range of temperatures was used in our dataset, giving the study an edge over the survey of Baskin et al.³⁹.

Table 6 Detailed description of the reported results of the literature and the present study.

Full size table

Figures 8 and 9 compare the present study’s R² of the CatBoost model and the number of our data points with the pure ionic liquid studies from the literature, respectively. The superiority of the current research is evident in Figs. 8 and 9 in terms of R² and the number of data points, respectively. Also, Fig. 10 shows these two parameters together in one diagram. The data point number indicates a study’s comprehensiveness, while the R² is a criterion to assess its accuracy. The upper right region of Fig. 10 is where the most accurate and comprehensive studies take place. While Baskin et al.³⁹ utilized more data points in their research, their accuracy was lower than the current study’s. It is evident that the best accuracy was obtained in the current study.

In addition to our six main models, we developed an auxiliary MLP model for better comparison with the literature utilizing pure ionic liquids, due to the fact that some studies used MLP models. This research’s MLP model has two hidden layers with 4 and 2 neurons, respectively, and the transfer functions used in the layers are "tansig" for the first hidden layer, "logsig" for the second hidden layer, and "purelin" for the output layer. A comparison between the results of our MLP model and the literature is shown in summarized in Table 7. A closer inspection of the table reveals that the MLP models perform very well with smaller datasets. While it is difficult to fairly compare all the error metrics as the literature did not fully provide their various metrics, our MLP’s results in Table 7 show generally accepted errors. In comparison with other models though, MLP generally might not be as accurate for larger datasets.

Table 7 Result comparison of the studies utilized the MLP model and our auxiliary MLP model.

Full size table

Graphical analysis

Graphical analysis is provided to demonstrate the results in another unambiguous way. Figure 11 exhibits the deviance of the data from the ideal X = Y line. The closer the data gets to this line, the better the prediction. A visual inspection of Fig. 11 discloses the accuracy of the used models. Another finding of Fig. 11 is the superior accuracy of the CatBoost model. The error distribution diagrams shown in Fig. 12 lay out the relative error of the predicted refractive index concerning the experimental data. Again, the results are satisfactory since most points have a relative error of less than 2%. A visual comparison confirms that the best model is CatBoost, and the worst is Ada-SVM.

In addition, the relative errors of the four quartiles of each cation and anion family using CatBoost model are presented in Figs. 13 and 14, respectively. The box and whisker diagrams show that the data in all quartiles of both cation and anion families can predict the refractive index correctly due to their low relative errors. The central marks on the boxes indicate the median. The boxes show the second and third quartiles, and the whiskers and outliers illustrate the first and fourth quartiles of the data. Outlier points are not visible in the figures.

Figure 15 illustrates the dataset’s mean absolute relative errors for each cation–anion family pair using the CatBoost model. Most ionic liquid families have low mean absolute relative errors. At the same time, a lack of sufficient data points in some specific ionic liquid families resulted in higher mean absolute relative error values, which are still acceptable. The distribution of relative error as a function of temperature utilizing the CatBoost model is shown in Fig. 16. As expected, most data points are located near the relative error equal to zero, such that 743 data points are in the temperature range of 293.15–298.15, with relative errors between -0.12% and 0.30%. Also, the diagram displays that very few data points have large relative errors even though the most significant relative error is less than 5%, which is acceptable.

An additional informative diagram demonstrating a suitable model comparison is shown in Fig. 17. As mentioned before, the CatBoost model performs better than the others regarding the refractive index prediction of ILs. The dashed line of absolute relative error (Error) indicates the value of 0.13%, meaning that 90% of the data analyzed with our best model, CatBoost, have a phenomenal absolute relative error of less than 0.13%. The diagram also illustrates that the most accurate model was CatBoost, followed by XGBoost, LightGBM, Ada-DT, CNN, and Ada-SVM.

Sensitivity analysis

Understanding the significance of each input on the output requires a sensitivity analysis of the results. Here, a relevancy factor was used, which is defined as follows⁵⁸:

$$r\left({x}_{v},y\right)=\frac{\sum_{i=1}^{n}\left({x}_{v,i}-\overline{x }\right)\left({y}_{i}-\overline{y }\right)}{\sqrt{\sum_{i=1}^{n}{\left({x}_{v,i}-\overline{x }\right)}^{2}\sum_{i=1}^{n}{\left({y}_{i}-\overline{y }\right)}^{2}}},$$

(15)

where ${y}_{i}$ and $\overline{y }$ are the i-th value and the average value of the predicted refractive index, respectively; and ${x}_{v,i}$ and $\overline{x }$ are the i-th value and average value of the v-th input. The input with the highest absolute relevancy factor impacts the output amount most. Figure 18 shows a complete set of the models’ inputs and their corresponding relevancy factors considering the CatBoost model. The colors and the names correspond from top to bottom to avoid confusion. As the figure depicts, the most substantial impact on the refractive index is generated by –F and > C < substructures with a corresponding relevancy factor of − 0.75 and − 0.47, respectively. The notation “with rings” in Fig. 18 implies that the mentioned chemical substructure is located inside a ring in the chemical structure, as defined by Valderrama et al.⁵⁶. In addition, if the r factor of any input is negative, it decreases the refractive index and vice versa. The collected absolute amounts of these factors can be seen in Table 8. Table 8 clearly indicates that temperature and wavelength are not amongst the first half of the most dominant factors influencing the refractive index.

Table 8 Input ranking regarding their absolute relevancy factors.

Full size table

The relevance importance rank of the wavelength is 28th among 38 factors. Although it did not gain a high rank, the importance of the wavelength is not ruled out. The fact that the existence or the quantity of certain substructures outranks the wavelength emphasizes the necessity of choosing appropriate materials, not the negligible role of wavelength. Once the material is chosen (i.e. all the substructures remain constant), the wavelength change shows its significant role.

The Pearson correlation coefficient used here is especially useful when the data distribution is normal. Otherwise, the correlation coefficients are advised to be calculated from the ranks of the data instead of their actual values. The coefficients recommended for this goal are Kendall's tau and Spearman's rho. Since some researchers suggest that Kendall's tau may draw more accurate generalizations compared to Spearman's rho⁵⁹, the absolute Kendall's tau is reported in Table 9. Kendall’s tau formula in case of tied ranks is as follows⁶⁰:

$${\tau }_{B}=\frac{2\times \left({N}_{a}-{N}_{d}\right)}{\sqrt{\left[N\left(N-1\right)-\sum {t}_{x}\left({t}_{x}-1\right)\right]\times \left[N\left(N-1\right)-\sum {t}_{y}\left({t}_{y}-1\right)\right]}},$$

(16)

where ${\tau }_{B}$ is Kendall's tau with tied rank adjustments, ${N}_{a}$ is the number of agreements in order, ${N}_{d}$ is the number of disagreements in order, $N$ is the number of data points, and ${t}_{x}$ and ${t}_{y}$ are the number of tied observations on the first and second variables, respectively.

Table 9 Input ranking regarding their absolute Kendall’s tau.

Full size table

While some of the ranks have changed compared to Table 8 due to the correlation method switch, Table 9 consolidates the fact that the –F and > C < substructures are highly correlated with the refractive index.

Trend analysis

Understanding the influence of changing input values on the output is another illuminative way of analyzing the results. This insight can be gained from the trend analysis of the alkyl chain length, temperature, and wavelength. All parameters except the considered one are fixed to display only the effect of the parameter in question. Additionally, the CatBoost model was used in trend analysis as it is our most accurate model.

Figure 19 shows the trend of refractive index with respect to the number of carbons in the cation of three ILs named 1-alkyl-3-methylimidazolium tetrafluoroborate⁶¹, 1-alkyl-3-methylimidazolium hexafluorophosphate⁶², and 1-alkyl-3-methylimidazolium trifluoromethanesulfonate⁶³ for experimental and predicted data. According to Fig. 19, the refractive index rises with the increase in the alkyl chain length of the cation of imidazolium-based ionic liquids. This behavior is due to the molar refraction variation with the number of carbon atoms¹.

Demonstrating the effect of temperature on the refractive index was done by considering 1-butyl-3-methylimidazolium tetrafluoroborate⁶⁴, 1-ethyl-3-methylimidazolium acetate⁶⁵, and tributylmethylphosphonium methyl sulfate⁶⁶ in Fig. 20. Unlike the alkyl chain length, an increase in ILs’ temperature reduces the refractive index value. This phenomenon happens because when the temperature rises, the density of the ionic liquid decreases, which consequently increases the free molar volume. This increase in the free molar volume causes the refractive index reduction¹. This result is in accordance with what the literature concludes³⁶.

Finally, the effect of wavelength change on the refractive index has been illuminated in Fig. 21. Three chosen ILs to display this effect are 1-ethyl-3-methylimidazolium tetrafluoroborate, 1-ethyl-3-methylimidazolium bis((trifluoromethyl)sulfonyl)imide, and 1-butyl-3-methylimidazolium trifluoromethanesulfonate¹. Observation of refractive index decrease with wavelength increase confirms what the literature states as normal dispersion⁶⁷.

Leverage approach

Leverage analysis is a method that could reveal outliers and the approximate range within which a prediction is likely to be accurate. Identifying the leverage points is essential because they might influence the prediction considerably. The hat matrix needs to be introduced as follows to determine the leverages of the inputs⁶⁸:

$$H=X{\left({X}^{T}X\right)}^{-1}{X}^{T}.$$

(17)

The diagonal elements of the hat matrix are named leverages and satisfy:

$$0\le {h}_{ii}\le 1,$$

(18)

where ${h}_{ii}$ is the diagonal elements of the hat matrix. A threshold can be introduced as the upper limit of the standard values, which is usually defined as:

$${H}^{*}=\frac{3(a+1)}{n},$$

(19)

where $a$ and is the number of inputs and $n$ is the number of data points⁶⁹. The following equation calculates standardized residuals:

$${R}_{i}=\frac{{e}_{i}}{\sqrt{MSE\left(1-{h}_{ii}\right)}},$$

(20)

where $MSE$ is the mean square error and ${e}_{i}$ is the ordinary residual of the i-th observation. As an accepted standard used by researchers, if the absolute value of a data point’s standardized residual is less than 3, the data is considered valid. The data out of this boundary is considered to be suspected⁷⁰.

The Williams diagram can be plotted in Fig. 22 thanks to the available leverages and standardized residuals. The lines R = 3 and R = − 3 are drawn to indicate the limits of the valid data. Also, to emphasize the limit beyond which the data points have high leverages, Hat = 0.019 is displayed. With these criteria, 95 out of 6098 data points (roughly 1.5%) were detected as suspected data, and 273 were good high leverage points (approximately 4.5%). So the number of valid data was 5730 (about 94% of the total data). This finding shows that a small portion of the data was not reasonable, and the CatBoost model’s performance was impressive. Setting the y-axis and x-axis to display data in the range of [− 10, 10] and [0, 0.15], respectively, can help the data points variation to be displayed clearly. Four points have an R of less than − 10, two points have an R of more than 10, and one point has a leverage value of more than 0.15. These seven points are not shown in Fig. 22, but the supplementary data section provides the entire plot (Fig. S1). The dataset has an unusual point with a very high Hat value (around 0.33). The point belongs to an ionic liquid with three -I substructures, unprecedented in the dataset. Because the leverage method only focuses on the inputs regardless of the output amounts, this anomaly in the inputs escalates the leverage value.

Conclusions

This study aimed to predict the refractive index of an abundant number of ILs. As a novel approach, the wavelength and 36 chemical substructures were considered as inputs, along with the temperature. More than 6000 data points were gathered and used in 6 different chemical structure-based machine learning models named XGBoost, LightGBM, CatBoost, CNN, Ada-DT, and Ada-SVM to achieve this study’s aim. The results’ statistical and visual analysis reveals that the most accurate model was CatBoost. Other models also performed effectively and can be sorted as XGBoost, LightGBM, Ada-DT, CNN, and Ada-SVM regarding their accuracy. Other findings of the research are highlighted as follows:

The sensitivity analysis showed that the –F and > C < substructures have the most influence over the predicted refractive index of an ionic liquid. The presence of these substructures in an ionic liquid declines the amount of the refractive index.
Apart from the type of ILs, the temperature has a more powerful effect in calculating the refractive index than the wavelength.
Neither temperature nor wavelength was among 50% of the most influential inputs on the refractive index. The type of ionic liquid, most precisely, the presence of certain chemical substructures, had more impact on the output than temperature and wavelength.
The results of the leverage approach display that some points have uncommon leverage values. This occurrence could result from abnormal chemical substructures in the ILs.
Using machine learning methods for predicting the refractive index of a vast number of ILs showed an extraordinary performance that even our worst model’s performance was acceptable with an R² of 0.9227 and an AAPRE of 0.6618. At the same time, our best model’s statistical results were exceptional, with an R² of 0.9973 and an AAPRE of 0.0545.
The trend analysis reveals that the refractive indices of ILs decline with wavelength and temperature rise while the refractive indices of imidazolium-based ILs increase with the alkyl chain length increase.

Data availability

All data generated or analyzed during this study are included in this published article (and its Supplementary Information files).

References

Arosa, Y. et al. Modeling the temperature-dependent material dispersion of imidazolium-based ionic liquids in the VIS-NIR. J. Phys. Chem. C 122, 29470–29478 (2018).
Article CAS Google Scholar
Abu Talip, R. A., Yahya, W. Z. N. & Bustam, M. A. Ionic liquids roles and perspectives in electrolyte for dye-sensitized solar cells. Sustainability 12, 7598 (2020).
Article CAS Google Scholar
Rüther, T., Bhatt, A. I., Best, A. S., Harris, K. R. & Hollenkamp, A. F. Electrolytes for lithium (sodium) batteries based on ionic liquids: Highlighting the key role played by the anion. Batter. Supercaps 3, 793–827 (2020).
Article Google Scholar
Gonçalves, W. D. G., Caspers, C., Dupont, J. & Migowski, P. Ionic liquids for thermoelectrochemical energy generation. Curr. Opin. Green Sustain. Chem. 26, 100404 (2020).
Article Google Scholar
Isosaari, P., Srivastava, V. & Sillanpää, M. Ionic liquid-based water treatment technologies for organic pollutants: Current status and future prospects of ionic liquid mediated technologies. Sci. Total Environ. 690, 604–619 (2019).
Article ADS CAS PubMed Google Scholar
Shi, H., Zhang, X., Sundmacher, K. & Zhou, T. Model-based optimal design of phase change ionic liquids for efficient thermal energy storage. Green Energy Environ. 6, 392–404 (2021).
Article CAS Google Scholar
Lee, Y. Y. et al. Capsules of reactive ionic liquids for selective capture of carbon dioxide at low concentrations. ACS Appl. Mater. Interfaces 12, 19184–19193 (2020).
Article ADS CAS PubMed PubMed Central Google Scholar
Kaur, G., Kumar, H. & Singla, M. Diverse applications of ionic liquids: A comprehensive review. J. Mol. Liq. 351, 118556 (2022).
Article CAS Google Scholar
Sattari, M., Kamari, A., Mohammadi, A. H. & Ramjugernath, D. Prediction of refractive indices of ionic liquids—A quantitative structure-property relationship based model. J. Taiwan Inst. Chem. Eng. 52, 165–180 (2015).
Article CAS Google Scholar
Venkatraman, V., Raj, J. J., Evjen, S., Lethesh, K. C. & Fiksdahl, A. In silico prediction and experimental verification of ionic liquid refractive indices. J. Mol. Liq. 264, 563–570 (2018).
Article CAS Google Scholar
Soroush, E., Mesbah, M. & Zendehboudi, S. An efficient tool to determine physical properties of ternary mixtures containing 1-alkyl-3-methylimidazolium based ILs and molecular solvents. Chem. Eng. Res. Des. 152, 415–432 (2019).
Article CAS Google Scholar
Mesbah, M., Soroush, E. & Rostampour Kakroudi, M. Predicting physical properties (viscosity, density, and refractive index) of ternary systems containing 1-octyl-3-methyl-imidazolium bis(trifluoromethylsulfonyl)imide, esters and alcohols at 298.15 K and atmospheric pressure, using rigorous classification techniques. J. Mol. Liq. 225, 778–787 (2017).
Article CAS Google Scholar
Sun, Y. et al. Machine learning assisted QSPR model for prediction of ionic liquid’s refractive index and viscosity: The effect of representations of ionic liquid and ensemble model development. J. Mol. Liq. 333, 115970 (2021).
Article CAS Google Scholar
Yusuf, F., Olayiwola, T. & Afagwu, C. Application of artificial intelligence-based predictive methods in Ionic liquid studies: A review. Fluid Phase Equilib. 531, 112898 (2021).
Article CAS Google Scholar
Paduszyński, K. & Domańska, U. Viscosity of ionic liquids: An extensive database and a new group contribution model based on a feed-forward artificial neural network. J. Chem. Inf. Model. 54, 1311–1324 (2014).
Article PubMed Google Scholar
Nakhaei-kohani, R. et al. Machine learning assisted structure-based models for predicting electrical conductivity of ionic liquids. J. Mol. Liq. 362, 119509 (2022).
Article CAS Google Scholar
Mousavi, S. P. et al. Modeling thermal conductivity of ionic liquids: A comparison between chemical structure and thermodynamic properties-based models. J. Mol. Liq. 322, 114911 (2021).
Article CAS Google Scholar
Hosseini, M., Rahimi, R. & Ghaedi, M. Hydrogen sulfide solubility in different ionic liquids: An updated database and intelligent modeling. J. Mol. Liq. 317, 113984 (2020).
Article CAS Google Scholar
Mousavi, S. et al. Modeling surface tension of ionic liquids by chemical structure-intelligence based models. J. Mol. Liq. https://doi.org/10.1016/j.molliq.2021.116961 (2021).
Article Google Scholar
Iglesias-otero, M. A., Troncoso, J., Carballo, E. & Romanı, L. Density and refractive index in mixtures of ionic liquids and organic solvents: Correlations and predictions. J. Chem. Thermodyn. 40, 949–956 (2008).
Article CAS Google Scholar
Koller, T. M. et al. Measurement and prediction of the thermal conductivity of tricyanomethanide- and tetracyanoborate-based imidazolium ionic liquids. (2014) https://doi.org/10.1007/s10765-014-1617-1.
Safdar, R., Omar, A. A., Ismail, L. B., Bari, A. & Lal, B. Chinese Journal of Chemical Engineering Measurement and correlation of physical properties of aqueous solutions of tetrabutylammonium hydroxide, piperazine and their aqueous blends. CJCHE 23, 1811–1818 (2015).
CAS Google Scholar
Tong, J. et al. Prediction of the physicochemical properties of valine ionic liquids [C n mim][Val] ( n = 2,3,4,5,6) by semiempirical methods. Ind. Eng. Chem. Res. 50, 2418–2423 (2011).
Article CAS Google Scholar
Xu, W., Ma, X., Li, L., Tong, J. & Guan, W. Prediction of physicochemical properties of valine ionic liquids [ C n mim ][ Val ] ( n = 2 , 3 , 4 , 5 , 6 ) by Empirical Methods. 2 (2012).
Gardas, R. L. & Coutinho, J. A. P. Group contribution methods for the prediction of thermophysical and transport properties of ionic liquids. AIChE J. 55, 1274–1290 (2009).
Article CAS Google Scholar
Almeida, H. F. D., Lopes-da-Silva, J. A., Freire, M. G. & Coutinho, J. A. P. Surface tension and refractive index of pure and water-saturated tetradecyltrihexylphosphonium-based ionic liquids. J. Chem. Thermodyn. 57, 372–379 (2013).
Article CAS Google Scholar
Sattari, M., Kamari, A., Mohammadi, A. H. & Ramjugernath, D. A group contribution method for estimating the refractive indices of ionic liquids. J. Mol. Liq. 200, 410–415 (2014).
Article CAS Google Scholar
Díaz-Rodríguez, P. et al. Estimation of the refractive indices of imidazolium-based ionic liquids using their polarisability values. Phys. Chem. Chem. Phys. 16, 128–134 (2014).
Article PubMed Google Scholar
Díaz-Rodríguez, P., Cancilla, J. C., Matute, G. & Torrecilla, J. S. Determination of physicochemical properties of pyridinium-based ionic liquid binary mixtures with a common component through neural networks. Ind. Eng. Chem. Res. 53, 1015–1019 (2014).
Article Google Scholar
Díaz-Rodríguez, P., Cancilla, J. C., Matute, G., Chicharro, D. & Torrecilla, J. S. Inputting molecular weights into a multilayer perceptron to estimate refractive indices of dialkylimidazolium-based ionic liquids—A purity evaluation. Appl. Soft Comput. 28, 394–399 (2015).
Article Google Scholar
Golzar, K., Amjad-Iranagh, S. & Modarress, H. Prediction of thermophysical properties for binary mixtures of common ionic liquids with water or alcohol at several temperatures and atmospheric pressure by means of artificial neural network. Ind. Eng. Chem. Res. 53, 7247–7262 (2014).
Article CAS Google Scholar
Cancilla, J. C., Díaz-Rodríguez, P., Matute, G. & Torrecilla, J. S. The accurate estimation of physicochemical properties of ternary mixtures containing ionic liquids via artificial neural networks. Phys. Chem. Chem. Phys. 17, 4533–4537 (2015).
Article CAS PubMed Google Scholar
Cancilla, J. C., Perez, A., Wierzchoś, K. & Torrecilla, J. S. Neural networks applied to determine the thermophysical properties of amino acid based ionic liquids. Phys. Chem. Chem. Phys. 18, 7435–7441 (2016).
Article CAS PubMed Google Scholar
Soriano, A. N. et al. Prediction of refractive index of binary solutions consisting of ionic liquids and alcohols (methanol or ethanol or 1-propanol) using artificial neural network. J. Taiwan Inst. Chem. Eng. 65, 83–90 (2016).
Article CAS Google Scholar
Kang, X., Zhao, Y. & Li, J. Predicting refractive index of ionic liquids based on the extreme learning machine (ELM) intelligence algorithm. J. Mol. Liq. 250, 44–49 (2018).
Article CAS Google Scholar
Wang, N., Yang, Z. & Li, Y. Environmental effects toward estimation of refractivity index of ionic liquids and alcohols by developing an MLP-ANN. Energy Sources Part A Recover. Util. Environ. Eff. 0, 1–10 (2019).
Wang, X. et al. Database and new models based on a group contribution method to predict the refractive index of ionic liquids. Phys. Chem. Chem. Phys. 19, 19967–19974 (2017).
Article CAS PubMed Google Scholar
Ding, Y., Chen, M., Guo, C., Zhang, P. & Wang, J. Molecular fingerprint-based machine learning assisted QSAR model development for prediction of ionic liquid properties. J. Mol. Liq. 326, 115212 (2021).
Article CAS Google Scholar
Baskin, I., Epshtein, A. & Ein-eli, Y. Benchmarking machine learning methods for modeling physical properties of ionic liquids. J. Mol. Liq. 351, 118616 (2022).
Article CAS Google Scholar
Kazakov, A. et al. Ionic Liquids Database—ILThermo (v2.0). (2013).
Dong, Q. et al. ILThermo: A free-access web database for thermodynamic properties of ionic liquids. (2007).
Rives, R., Mialdun, A., Yasnou, V., Shevtsova, V. & Coronas, A. Density, refractive index, and derived properties of binary mixtures of water + ionic liquid 1-(2-hydroxyethyl)-3-methylimidazolium tetrafluoroborate. J. Chem. Thermodyn. 160, 106484 (2021).
Article CAS Google Scholar
Guo, Y., Wang, X., Tao, X. & Shen, W. Liquid–liquid equilibrium and heat capacity measurements of the binary solution {ethanol + 1-butyl-3-methylimidazolium hexafluorophosphate}. J. Chem. Thermodyn. 115, 342–351 (2017).
Article CAS Google Scholar
Chen, T. & Guestrin, C. XGBoost: A scalable tree boosting system. In Proc. ACM SIGKDD Int. Conf. Knowl. Discov. Data Min. 13–17-Augu, 785–794 (2016).
Ke, G. et al. LightGBM: A highly efficient gradient boosting decision tree. In Advances in Neural Information Processing Systems, vol. 2017-Decem 3147–3155 (2017).
Liang, W., Luo, S., Zhao, G. & Wu, H. Predicting hard rock pillar stability using GBDT, XGBoost, and LightGBM algorithms. Mathematics 8, 1–17 (2020).
Article CAS Google Scholar
Bentéjac, C., Csörgő, A. & Martínez-Muñoz, G. A Comparative Analysis of Gradient Boosting Algorithms. Artificial Intelligence Review, vol. 54 (Springer Netherlands, 2021).
Prokhorenkova, L., Gusev, G., Vorobev, A., Dorogush, A. V. & Gulin, A. CatBoost: Unbiased boosting with categorical features. Adv. Neural Inf. Process. Syst. 2018-Decem, 1–23 (2017).
Luo, M. et al. Combination of feature selection and catboost for prediction: The first application to the estimation of aboveground biomass. Forests 12, 1–22 (2021).
Article Google Scholar
Gu, J. et al. Recent advances in convolutional neural networks. Pattern Recognit. 77, 354–377 (2018).
Article ADS Google Scholar
Li, Ã. X., Wang, L. & Sung, E. AdaBoost with SVM-based component classifiers. 21, 785–795 (2008).
Harrington, P. Machine Learning in Action. Machine Learning vol. 37 (2012).
Classification Algorithms—ML Glossary documentation. https://ml-cheatsheet.readthedocs.io/en/latest/classification_algos.html#decision-trees.
Yang, L. & Shami, A. On hyperparameter optimization of machine learning algorithms: Theory and practice. Neurocomputing 415, 295–316 (2020).
Article Google Scholar
Cabeza, O. Properties and green aspects of ionic liquids. In Ionic Liquids in Separation Technology 1–93 (Elsevier, 2014). https://doi.org/10.1016/B978-0-444-63257-9.00001-8.
Valderrama, J. O., Forero, L. A. & Rojas, R. E. Critical properties and normal boiling temperature of ionic liquids. Update and a new consistency test. Ind. Eng. Chem. Res. 51, 7838–7844 (2012).
Article CAS Google Scholar
Wang, X. & Zhou, Q. Refractive index of ionic liquids. In Encyclopedia of Ionic Liquids 1–8 (Springer Singapore, 2020). https://doi.org/10.1007/978-981-10-6739-6_104-1.
Chen, G. et al. The genetic algorithm based back propagation neural network for MMP prediction in CO2-EOR process. Fuel 126, 202–212 (2014).
Article CAS Google Scholar
Akoglu, H. User’s guide to correlation coefficients. Turk. J. Emerg. Med. 18, 91–93 (2018).
Article PubMed PubMed Central Google Scholar
Myers, J. L. & Well, A. D. More about correlation. Res. Des. Stat. Anal. 532–575 (2019). https://doi.org/10.4324/9781410607034-18/CORRELATION-JEROME-MYERS-ARNOLD-WELL.
Wang, R., Qi, X., Liu, S., He, Y. & Deng, Y. A comparison study on the properties of 1,3-dialkylimidazolium tetrafluoroborate salts prepared by halogen-free and traditional method. J. Mol. Liq. 221, 339–345 (2016).
Article CAS Google Scholar
AlTuwaim, M. S., Alkhaldi, K. H. A. E., Al-Jimaz, A. S. & Mohammad, A. A. Temperature dependence of physicochemical properties of imidazolium-, pyroldinium-, and phosphonium-based ionic liquids. J. Chem. Eng. Data 59, 1955–1963 (2014).
Article CAS Google Scholar
Musiał, M., Zorębski, E., Zorębski, M. & Dzida, M. Effect of alkyl chain length in cation on thermophysical properties of two homologous series: 1-alkyl-1-methylpyrrolidinium bis(trifluoromethylsulfonyl)imides and 1-alkyl-3-methylimidazolium trifluoromethanesulfonates. J. Mol. Liq. 293, 111511 (2019).
Article Google Scholar
Neves, C. M. S. S. et al. Systematic study of the thermophysical properties of imidazolium-based ionic liquids with cyano-functionalized anions. J. Phys. Chem. B 117, 10271–10283 (2013).
Article CAS PubMed Google Scholar
Almeida, H. F. D. et al. Thermophysical properties of five acetate-based ionic liquids. J. Chem. Eng. Data 57, 3005–3013 (2012).
Article CAS Google Scholar
Bhattacharjee, A., Lopes-da-Silva, J. A., Freire, M. G., Coutinho, J. A. P. & Carvalho, P. J. Thermophysical properties of phosphonium-based ionic liquids. Fluid Phase Equilib. 400, 103–113 (2015).
Article CAS PubMed PubMed Central Google Scholar
Arosa, Y. et al. Refractive index measurement of imidazolium based ionic liquids in the Vis-NIR. Opt. Mater. (Amst) 73, 647–657 (2017).
Article ADS CAS Google Scholar
Rousseeuw, P. J. & Leroy, A. M. Robust Regression and Outlier Detection. Syria Studies vol. 7 (Wiley, 1987).
Hemmati-Sarapardeh, A., Larestani, A., Nait Amar, M. & Hajirezaie, S. Introduction. Appl. Artif. Intell. Tech. Pet. Ind. https://doi.org/10.1016/b978-0-12-818680-0.00001-1 (2020).
Article Google Scholar
Gramatica, P. Principles of QSAR models validation: Internal and external. QSAR Comb. Sci. 26, 694–701 (2007).
Article CAS Google Scholar

Download references

Author information

These authors contributed equally: Ali Esmaeili and Hesamedin Hekmatmehr.

Authors and Affiliations

Renewable Energies Engineering Department, Faculty of Mechanical and Energy Engineering, Shahid Beheshti University, Tehran, Iran
Ali Esmaeili & Hesamedin Hekmatmehr
Department of Chemical Engineering, Amirkabir University of Technology (Tehran Polytechnic), Tehran, Iran
Saeid Atashrouz
Department of Chemical and Petroleum Engineering, University of Calgary, 2500 University Drive NW, Calgary, AB, T2N 1N4, Canada
Seyed Ali Madani
Department of Polymer Reaction Engineering, Faculty of Chemical Engineering, Tarbiat Modares University, Tehran, Iran
Maryam Pourmahdi
College of Engineering and Technology, American University of the Middle East, Egaila, 54200, Kuwait
Dragutin Nedeljkovic
Department of Petroleum Engineering, Shahid Bahonar University of Kerman, Kerman, Iran
Abdolhossein Hemmati-Sarapardeh
State Key Laboratory of Continental Shale Hydrocarbon Accumulation and Efficient Development, Ministry of Education, Northeast Petroleum University, Daqing, 163318, China
Abdolhossein Hemmati-Sarapardeh
Department of Chemical Engineering, McGill University, Montreal, QC, H3A 0C5, Canada
Ahmad Mohaddespour

Authors

Ali Esmaeili
View author publications
You can also search for this author in PubMed Google Scholar
Hesamedin Hekmatmehr
View author publications
You can also search for this author in PubMed Google Scholar
Saeid Atashrouz
View author publications
You can also search for this author in PubMed Google Scholar
Seyed Ali Madani
View author publications
You can also search for this author in PubMed Google Scholar
Maryam Pourmahdi
View author publications
You can also search for this author in PubMed Google Scholar
Dragutin Nedeljkovic
View author publications
You can also search for this author in PubMed Google Scholar
Abdolhossein Hemmati-Sarapardeh
View author publications
You can also search for this author in PubMed Google Scholar
Ahmad Mohaddespour
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

A.E.: Writing-original draft, visualization, data curation, software, investigation, H.H.: Writing-original draft, visualization, data curation, software, investigation, S.A.: Supervision, conceptualization, methodology, reviewing and editing, investigation, S.A.M.: Software, modeling, methodology, conceptualization, M.P.: Methodology, investigation, software, D.N.: Supervision, conceptualization, reviewing and editing, A.H.-S.: Supervision, conceptualization, reviewing and editing, methodology, investigation, A.M.: Supervision, conceptualization, reviewing and editing.

Corresponding authors

Correspondence to Saeid Atashrouz, Abdolhossein Hemmati-Sarapardeh or Ahmad Mohaddespour.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary Information 1.

Supplementary Information 2.

Supplementary Information 3.

Supplementary Information 4.

Supplementary Information 5.

Supplementary Information 6.

Supplementary Information 7.

Supplementary Information 8.

Supplementary Information 9.

Supplementary Information 10.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Esmaeili, A., Hekmatmehr, H., Atashrouz, S. et al. Insights into modeling refractive index of ionic liquids using chemical structure-based machine learning methods. Sci Rep 13, 11966 (2023). https://doi.org/10.1038/s41598-023-39079-5

Download citation

Received: 16 December 2022
Accepted: 19 July 2023
Published: 24 July 2023
DOI: https://doi.org/10.1038/s41598-023-39079-5

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Subjects

Abstract

Introduction

Data collection

Modeling

Modeling procedure

Model development

XGBoost

LightGBM

CatBoost

Convolutional neural network

AdaBoost

Decision tree

Support vector machine

Hyperparameters optimization

Input parameters

Assessment of models

Statistical assessment

Graphical assessment

Results and discussion

Statistical analysis

Graphical analysis

Sensitivity analysis

Trend analysis

Leverage approach

Conclusions

Data availability

References

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Additional information

Publisher's note

Supplementary Information

Rights and permissions

About this article

Cite this article

Share this article

Comments

Search

Quick links