Abstract
There are several categorical effect size methods in the literature. It is not clear which method performs better for a given dataset and it is a challenging task to select the correct method for a given dataset. In this sense, to overcome the questions like “Which method should we choose?” and “Which categorical effect size method is more reliable for a given dataset?”, an adaptive categorical effect size method based on intuitionistic meta fuzzy functions is introduced in the paper. Thus, the main motivation of the proposed method is to obtain more accurate outcomes by combining the results of better performing methods instead of relying on only one method. In the study, the intuitionistic fuzzy c-means clustering algorithm is adapted to meta fuzzy functions by incorporating not only membership degrees but also non-membership degrees to improve the clustering accuracy of meta fuzzy functions. Meta fuzzy functions are the linear combination of seven categorical effect size methods and the weights, which are calculated from membership grades from intuitionistic fuzzy c-means algorithm. Among the functions, the one with the lowest mean absolute percentage error is selected as the best. To evaluate the performance of the proposed method, 2 × 3, 2 × 4, and 3 × 4 contingency tables were simulated. Additionally, the performance of the proposed method is also assessed by applying it to a real-time dataset. Experimental results show that the proposed method outperforms compared to the evaluated seven categorical effect size methods in terms of mean absolute percentage error. Also, the calculated effect sizes are within the range of ±10% in terms of bias. Thus, the results verified that proposed method achieves greater reliability.
Similar content being viewed by others
Introduction
Statistical significance (p-value) is the probability that the observed difference between two groups is due to chance. If the p-value is greater than the chosen alpha level, it is assumed that any observed difference can be explained by the variability of the sample size. When conducting statistical comparisons with exceptionally large sample sizes, it is highly likely that the p-value will consistently indicate a significant difference. However, statistically significant differences that arise due to the large number of data points do not always represent meaningful differences in reality1. A statistically significant result may sometimes arise simply from using a large sample. Statistical significance depends on both the sample size and effect size (ES) but the effect size is generally independent of the sample size2. Therefore, reporting only the p-value, especially in large samples, is not sufficient for readers to fully understand the implications3,4. Effect size (ES) is substantial of quantitative research, and it indicates the real magnitude of the effect. In addition to statistical significance, it enables researchers to understand the practical significance of the findings. Statistical hypothesis tests can be misleading due to type 1 and type 2 errors made depending on the sample size. For this reason, it is necessary to report the effect size as well as the p-value in many disciplines. The seven categorical effect size methods, which is used for \(r\times c\) contingency tables in statistics in the study, are explained in section “Categorical effect size methods”. The \(Cramer{^\prime}s V\) effect size measure has some disadvantages. First, \(Cramer{^\prime}s V\) is a symmetric measure of association5,6,7. Second, it is zero under the assumption of independence. Third, interpretation of \(Crame{r}{\prime}s V\) effect size measures is difficult8. \(Tschuprow{^\prime}s T\) measure is closely related to \(Cramer{^\prime}s V\) measure but less well-known9. Since it is a simple function of the Pearson chi-square statistic, it is among the commonly used effect sizes. Barely, the bias of the measure is large in data with small samples and it is difficult to interpret8. \(Cohen{^\prime}s w\) is more appropriate for larger contingency tables10. \(Uncertainty coefficient (U)\) is also commonly used effect size to measure the validity of a statistical classification algorithm11.
Considering the disadvantages of the ES methods, it is important to select the correct ES method for a given dataset. To overcome the aforementioned disadvantages, selected 7 ES methods are aggregated in functions based on their performances for a given dataset. In this sense, the motivation of this paper is to combine different categorical effect sizes methods in functions with Meta Fuzzy Functions \((MFF)\) based on Intuitionistic Fuzzy C-Means Clustering \(\left(IFCM\right)\) algorithm. Fuzzy c-means (FCM) clustering algorithm is used in MFF. FCM, proposed by Bezdek et al.12, stands out as one of the frequently employed methods because of its simplicity and the benefits it offers compared to the k-means clustering algorithm. Nevertheless, it has certain drawbacks, including its susceptibility to initial settings and sensitivity to noise. In this sense, IFCM that accounts for hesitancy of an object belonging to a cluster is employed in MFF. Intuitionistic Fuzzy Sets \((IFSs)\) are introduced as a modification of Zadeh’s fuzzy set theory by Atanassov13,14. The main difference between fuzzy sets and \(IFSs\) is that fuzzy sets only consider membership degree while \(IFSs\) consider both membership and non-membership degrees. That is, IFSs account also for the hesitancy of membership grades in clusters. Thus, the centers of the clusters are obtained more accurately. It has been determined by the studies that IFSs are more effective than traditional fuzzy set theory by overcoming uncertainty15. \(IFSs\) have been commonly used for forecasting and engineering problems. In addition to time series and forecasting methods, \(IFSs\) are widely used in the field of medicine for clustering images and diagnostics16,17,18. Numerous studies employing IFSs have been proposed by Fan et al.19, Kumar and Gangwar20, Lei, et al.21, Tak22, Gwak et al.23.
Because aforementioned advantages of IFCM in the literature, it is employed in MFF. The \(MFF\) was proposed by Tak24. The purpose of the MFF is to combine methods or definitions used for the same purpose. Its logic is simply based on meta-analysis. Meta-analysis is a method that combines the outcomes of multiple studies to yield stronger results for a specific purpose. For example, Tak and Gök25 and Gök and Tak26 utilized the MFF to merge different definitions of currency crisis. By employing this approach, they aimed to enhance the accuracy and reliability of their analysis. Similarly, Tak et al.27 employed the MFF to combine various time series methods. Their objective was to improve the forecasting performance by integrating multiple forecasting techniques within the framework. Cevik et al.28 used the MFF approach to forecast the number of immigrants within the maritime line. Tak29 used the MFF approach to forecast combination. These studies have shown that combining different methods with the \(MFF\) has better estimation accuracy.
Yabacı Tak and Ercan30 ensembled some ES definitions for two independent groups with MFF to obtain a more accurate effect size value. Yabacı Tak and Ercan30 combined six effect size methods for numerical variables with the MFF approach by using classical fuzzy c-means algorithm \((FCM),\) which can be used with or without the assumption of normal distribution. The combined methods in the previous study were not used for categorical variables. Thus, numerous categorical ES methods are combined in this study. Besides, the \(FCM\) clustering method only uses membership degrees while calculating the cluster centers. Thus, the \(MFF\) approach with the \(IFCM\), which provides a more accurate estimation of the cluster centers, has been developed in the study.
In the light of this information, we will introduce intuitionistic meta fuzzy categorical effect size functions \(\left( {I - MFCESF} \right)\) approach. The aim of the study is to obtain better outcomes by combining seven categorical effect size measures in functions. The purpose of combining the ES is the assumption that each measure might have much or partial information for a given dataset. Therefore, while the methods that perform better will be gathered into one function, the methods that perform worse will be gathered into another. In the remainder of the paper, we will describe the IFCM and the meta fuzzy functions briefly in the section “Preliminaries”. The proposed method \(\left(I-MFCESF\right)\) is discussed in section “Intuitionistic meta fuzzy categorical effect size functions (1-MFCESF)”. The performance of the proposed method is evaluated with some applications for simulated and real datasets in section “Evaluation”. Finally, the results of the proposed method are discussed in section “Conclusion”.
Preliminaries
The methods (effect sizes, intuitionistic fuzzy c-means and meta fuzzy function) that are used in the paper are detailed in this section.
Categorical effect size methods
Short descriptions of seven types of ES measures are provided for \(r\times c\) contingency tables. \(Cramer{\prime}s V\) is proposed in 1946 and it is an effect size measure that is generally used with nominal variables in \(r\times c\) contingency tables7,31,32,33,34. It is calculated in Eq. (1) based on Pearson’s chi-square statistic. It takes values between 0 and + 1.
where, \({\chi }^{2}\) is the Pearson’s chi-squared statistics, \(n\) is the total observations number, \(c\) is the number of coloumns and \(r\) is the number of rows. In the Eq. (1), numerator of formula is based on the observed frequencies, denominator of formula is based on an unobserved frequencies. Therefore, when \(Crame{r}{\prime}s V=1\), the marginal frequencies are not zero and r or c has not zero cell frequencies.
\(Tschuprow{^\prime}s T\) is a ES which measures the association between two nominal variables in \(rxc\) contingency tables35. It takes values between 0 and +1, and calculated in Eq. (2).
where, \({\chi }^{2}\) is the Pearson’s chi-squared statistics, \(c\) is the number of coloumns and \(r\) is the number of rows.
Another measure of categorical effect size is the \(Pearson{^\prime}s contingency coefficient\) (\(Pearson{^\prime} s c\)). It takes values between 0 and + 1. \(Pearson{^\prime} s c\) can be calculated in Eq. (3)36.
where, \(\chi^{2}\) is the Pearson’s chi-squared statistics, and n is the total number of observations.
\(Cohen{^\prime} s w\) effect size is proposed by Cohen37.\(Cohen{^\prime} s w\) should be used for larger contingency tables. Cohen’s w effect size measure is obtained in Eq. (4).
where, m is the number of cells, \(p_{0i}\) is the value of the ith cell under the null hypothesis, \(p_{1i}\) is the value of the ith cell under the alternative hypothesis.
\(Goodman - Kruskal Tau \left( {G - K Tau} \right)\) is another ES measure of nominal variables. It measures the predictability of the column or row variable given the value of other variables, in percentage. The measure varies between 0 and 138,39. \(G - K Tau\) is calculated in Eq.(5)40.
where, \(n\) is the total number of observation, \(a_{ij}\) is the value of number of observation in ith row and jth column, \(a_{.j}\) is the total number of observation in jth column and \(a_{i.}\) is the total number of observation in ith row.
\(Uncertainty coefficient \left( U \right)\) is first introduced by Theil41. It is also called Proficiency, Entropy Coefficient or Theil’s U. It is often used as a measure of the ES of nominal variables in statistics and takes the value between 0 and + 1. This measure is defined in Eq. (6)
where, \(H\left( X \right)\) is the entropy of a single distribution, \(H\left( {XY} \right)\) is the conditional entropy and \(U\left( {XY} \right)\) is the uncertainty coefficient. \(P_{X,Y} \left( {x,y} \right)\) is the conditional distribution.
\(Goodman - Kruskal Lambda \left( \lambda \right)\) statistic is an effect size proposed to measure the strength of the relationship between two nominal variables by evaluating the proportional reduction of error (PRE)39. Also, \(\lambda\) is the asymmetrical measure. The \(\lambda\) statistic takes value between 0 and 1. How to calculate the \(\lambda\) statistic is given in Eq. (8).
where, \(E_{1}\) is the number of prediction errors made when the independent variable is ignored, \(E_{2}\) equal to the number of prediction errors made when the prediction is based on the independent variable.
IFCM
Over the past decades, the fuzzy set theory proposed by Zadeh14 has been expanded with different approaches. Among these, intuitionistic fuzzy set theory, which has been commonly used in the literature and has many applications in different fields, was developed by Atanassov13. While only the membership degree is taken into account in the FCM, non-membership degree is also taken into account in IFCM. So that, the centers of the clusters are calculated more accurately. Algorithm are given below22:
- Step-1.:
-
Determine the number of clusters \(\left( c \right)\), the fuzziness index (f), and initialize the cluster centers \(\left( {v_{i} } \right)\) randomly.
- Step-2.:
-
Calculate the degrees of membership (\(\mu\)) and non-membership (\(u\)). Formulas are given in Eqs. (9–11):
$$\mu_{ik} = \left[ {\mathop \sum \limits_{j = 1}^{c} \left( {\frac{{d\left( {x_{k} ,v_{i} } \right)}}{{d\left( {x_{k} ,v_{j} } \right)}}} \right)^{{\frac{2}{f - 1}}} } \right]^{ - 1} , \;\;i = 1,2, \ldots ,c ;\;\;k = 1,2, \ldots ,n$$(9)where \(d\left( . \right)\) is the Euclidean distance between kth data in the ith cluster center:
$$u_{ik} = \left( {1 - \mu_{ik}^{\alpha } } \right)^{{{\raise0.7ex\hbox{$1$} \!\mathord{\left/ {\vphantom {1 \alpha }}\right.\kern-0pt} \!\lower0.7ex\hbox{$\alpha $}}}} , \alpha > 0$$(10)$$\mu_{ik}^{*} = 1 - u_{ik}$$(11) - Step-3.:
-
Update the cluster centers by using Eq. (12):
$$v_{i} = \frac{{\mathop \sum \nolimits_{k = 1}^{n} \left( {\mu_{ik}^{*} } \right)^{f} x_{k} }}{{\mathop \sum \nolimits_{k = 1}^{n} \left( {\mu_{ik}^{*} } \right)^{f} }} , i = 1,2, \ldots ,c$$(12) - Step-4.:
-
Algorithm is ended if the difference between two iterations are dropped under some given threshold ε; otherwise, repeated Step-2 and Step-3.
Meta fuzzy functions
Tak24 proposes MFF to combine different methods or definitions, such as prediction and forecasting. The MFF consists of three components: functions, weights, and the best meta fuzzy function. Functions; the linear combination of weights and the findings of the selected methods. Weights: the membership grades that are obtained from FCM clustering algorithm are used to compute weights. The best meta fuzzy function: the function that has the best evaluation criteria. Meta fuzzy functions begin with obtaining the outcomes of the methods chosen for a purpose as the input matrix. After that, the input matrix is clustered using fuzzy c-means clustering algorithm to separate the categorical ES methods based on how well they predict outcomes. As a result, each method will be assigned to a cluster with a membership grade. Then, using membership grades for each cluster, the weights of the methods are calculated. In this case, there will be an equal number of functions as the cluster number. Finally, the best meta fuzzy function is selected based on its evaluation criteria.
Intuitionistic meta fuzzy categorical effect size functions \(\left( {{\varvec{I}} - {\varvec{MFCESF}}} \right)\)
\(Cramer{^\prime} s v, \;\;Tschuprow{^\prime} s T, \;\;Pearson{^\prime} s c, \;\;Cohen{^\prime} s w,\;\; G - K Tau,\;\; U\) and \(\lambda\) methods can be used to calculate effect size measures for a dataset. However, there is no definite information in the literature about which method is better or in which situations it should be used. Therefore, the performance of the methods may change according to the type of datasets. Because the performance of the ES measures in the proposed method is uncertain, we are looking for the optimum weights of the ES measures in the combination function. For this purpose, \(I - MFCESF\) method is proposed in this paper. The ES measures are clustered based on their performances by using the IFCM. There will be as many functions as the number of clusters. Functions are obtained by multiplying each method by its weight in the clusters. The ES measures that perform better for the dataset will be in a function with a higher membership degree, while the ES measures that perform worse will be in another function with a higher degree of membership. Finally, the function with the minimum model evaluation criterion is selected as \(I - MFCESF _{best}\) and new effect size value will be calculated for the dataset. So, \(I - MFCESF\) method is an adaptive combination of categorical effect size measures. Step-by-step algorithm, pseudocode and flowchart are given below for \(I - MFCESF\) approach.
Algorithm 1
- Step 1.:
-
Determine \(m\) categorical ES measures and simulated data randomly for t iterations. Obtain input matrix (Z) by applying \(m\) measures to the simulated dataset for t repeats.
$$Z = [Z_{ij} ] , i = 1,2, \ldots ,t ;\;\; j = 1,2, \ldots ,m$$(13)where, \(Z_{ij}\) is the ES value of ith repeat for jth measure.
$$Z = \left[ {\begin{array}{*{20}c} {Z_{1,1} } & {Z_{1,2} } & \ldots & {Z_{1,t} } \\ {Z_{2,1} } & {Z_{2,2} } & \ldots & {Z_{2,t} } \\ \vdots & \vdots & \ldots & \vdots \\ {Z_{m,1} } & {Z_{m,2} } & \ldots & {Z_{m,t} } \\ \end{array} } \right]$$ - Step 2.:
-
The input matrix is clustered by using intuitionistic fuzzy c-means.
- Step 2.3.:
-
The new clusters center is calculated by using Eq. (12).
- Step 2.4.:
-
If the difference between two iterations drops under some threshold, stop the algorithm; otherwise, repeat Step 1 and Step 2.
- Step 3.:
-
Intuitionistic meta categorical effect size functions are obtained. \(I - MFCESF\) is given in Eq. (14).
$$I - MFCESF_{i} \left( z \right) = \mathop \sum \limits_{j = 1}^{m} w_{ij} z_{j} , \;\; i = 1,2, \ldots ,c$$(14)$$w_{ij} = \frac{{\mu_{ij}^{*} }}{{\mathop \sum \nolimits_{j = 1}^{m} \mu_{ij}^{*} }} ,\; i = 1,2, \ldots , c$$(15)where, c is the number of clusters, \(\mu_{ij}^{*}\) is the membership grades of jth method in \(i\)th cluster,\(I - MFCESF_{i}\) is the ith intuitionistic meta categorical effect size functions, and \(w_{ij}\) is weight of j.th method in \(i\)th cluster.
- Step 4.:
-
Select the best intuitionistic meta categorical effect size functions that has the minimum Mean absolute percentage error (MAPE).
MAPE values are calculated for select \(I - MFCESF_{best}\). Mape formula is given in Eq. (16).
where, \(y_{i }\) is the mean of the ES value calculated from each method for the population and \(\widehat{{y_{i} }}\) is the predicted ES value obtained from 1000 simulated samples. The pseudo code and the flow chart of \(I - MFCESF\) based on MFF is given Algorithm 2 and Fig. 1, respectively.
Evaluation
The estimation performance of the proposed I-MFCESF method is evaluated through both simulation studies and the use of real-world datasets. In the simulation study, random generation of two categorical variables (x and y) is performed to create contingency tables of different sizes (2 × 3, 2 × 4, and 3 × 4). These tables are generated for a sample size of N = 1000 and repeated for t = 1000 iterations. Real-world datasets are obtained from the UCI Machine Learning Repository42, and 1000 different samples are taken with replacement from these datasets. By applying the selected categorical effect size methods to each dataset, an input matrix (Z) is obtained. The I-MFCESF method incorporates two crucial parameters: the number of clusters (c) and the fuzziness index parameter (m). To determine the optimal number of clusters (c), the minimum mean absolute percentage error (MAPE) for the I-MFCESF is calculated iteratively between 2 and 5. Due to the lack of consensus on the optimal value for the fuzziness index parameter of IFCM (intuitionistic fuzzy c-means algorithm), a value of 2 is selected for this study. The performance of the proposed method is evaluated using the MAPE, which measures the average percentage difference between the estimated values and the true values.
The simulation study and real-world dataset applications of the I-MFCESF method are conducted using R Studio. Various R package namely “ppclust,” “effectsize,” “DescTools,” “fclust,” “rcompanion, ” “remotes,” “githubinstall,” and “Metrics, ” are utilized43,44,45,46,47. As an application, seven different selected categorical ES measures are combined by using the MFF based on the intuitionistic fuzzy c-means to obtain more accurate results for all datasets.
Simulated 2 \(\times\) 3, 2 \(\times 4\) and 3 \(\times\) 4 contingency tables for the datasets of categorical variables
Two categorical variables x and y (\(2 \times 3\), \(2 \times 4\) and \(3 \times 4\) contingency tables) are simulated randomly for N = 1000 sample size and t = 1000 iterations. Selected measures: \(Cramer{^\prime} s v\) (Metasure 1), \(Tschuprow{^\prime} s T\) (Measure 2), \(Pearson{^\prime} s c\)(Measure 3), \(Cohen{^\prime} s w\) (Measure 4), \(G - K Tau\)(Measure 5), \(U\) (Measure 6) and \(\lambda\) (Measure 7) are applied to all datasets. The input matrix \(\left( Z \right)\) consists of the outcomes of the ES measures for the simulated data set. The proposed method utilizes the IFCM clustering algorithm, where the fuzziness index parameter (m) is set to 2. After obtaining the input matrix, the IFCM algorithm is applied. In this method, the number of functions is equal to the optimal number of clusters. Functions are obtained by multiplying the weights of the methods with the actual value and sum them (Eq. 14) up. The weights of each method in each function are obtained as in Algorithm 1 (Step 3). Finally, the MAPE values are calculated for each from obtained \(I - MFCESF\) functions. When calculating the MAPE values, the actual value is considered as the average of the values calculated from the dataset of the selected seven ES measures. The function with the lowest Mean Absolute Percentage Error is chosen as \(I - MFCESF_{best }\) and the new ES value is computed based on this selection.
The first dataset is simulated for 2 \(\times\) 3 contingency table and the input matrix (\(Z\)) is obtained by applying the selected categorical ES methods. The first five and last five prediction values of the input matrix are summarized in Table 1.
For the first simulated dataset, the optimal cluster number, which is set to 2, is determined by selecting the minimum MAPE value for \(I-MFCESFs\). As a result, two functions are obtained by multiplying each method with their respective weights. The weights for the \(I-MFCESF\) are computed using intuitionistic membership grades, as outlined in Table 2. The functions of the proposed method are obtained using the following equations (Eqs. 17, 18).
Table 2 provides a clear depiction that \(I - MFCESF_{2 }\) exhibits the lowest MAPE. Therefore, \(I - MFCESF_{2 }\) is identified as the best I-MFCESF. The MAPE values are computed and presented in Table 3, to assess the performance of the proposed method.
Table 3 clear that the I-MFCESF outperforms the other categorical ES methods in terms of the MAPE values. According to the Li et al.48 a parameter prediction is considered acceptable when the bias is within ± 10%. The bias value of the proposed method was determined as − 1% in Table 3. Thus, the accuracy of the method is also sufficient in terms of bias.
A subsequent dataset is simulated for a 2 × 4 contingency table, and the input matrix (Z) is obtained by applying the chosen categorical ES methods. Table 4 provides a summary of the first five and last five prediction values found in the input matrix.
The weights for the \(I-MFCESF\) are calculated by using intuitionistic membership grades as in Table 5, and the functions of the proposed method are obtained as in Eqs. (19, 20).
Table 5 clearly shows \({I-MFCESF}_{2 }\) has the lowest MAPE. Thus, the best I-MFCESF is \({I-MFCESF}_{2}\). The MAPE values of the methods are computed, and the results are presented in Table 6 to assess the performance of the proposed method.
Based on the information provided in Table 6, it is evident that the I-MFCESF method demonstrates superior performance compared to the individual categorical effect size methods in terms of MAPE. The bias of the proposed method is determined as respectively − 1.9%. Because bias is between ± 10%, the accuracy of the proposed method is also sufficient in terms of bias.
Lastly, a dataset is simulated for a 3 × 4 contingency table, and the input matrix (Z) is generated by applying the selected categorical ES methods. Table 7 provides a summary of the first five and last five prediction values found in the input matrix.
The weights for the \(I-MFCESF\) are calculated by using intuitionistic membership grades as in Table 8.
Table 8 demonstrates that two functions are computed by multiplying each method with their respective weights. In the case of I-MFCESF, the weights are determined using intuitionistic membership grades. The functions of the proposed method are derived using the equations provided in Eqs. (21, 22).
According to Table 8, it is evident that \({I-MFCESF}_{2 }\) exhibits the lowest MAPE. Therefore, \({I-MFCESF}_{2}\) is identified as the best I-MFCESF. The MAPE of the methods are computed and presented in Table 9 to assess the performance of the proposed method.
Based on the information provided in Table 9, it is evident that the I-MFCESF method outperforms the individual categorical ES methods in terms of MAPE. The I-MFCESF bias value was determined as respectively. − 3.2 %. Because bias is between ±10%, the accuracy of the proposed method is also sufficient in terms of bias. Figures 2, 3 and 4 illustrate the MAPE and Bias values of the proposed methods and selected methods for various contingency tables.
Real-world categorical dataset for 2 \(\times\) 3, 2 \(\times\) 4 and 3 \(\times\) 4 contingency tables
The first dataset contains 34 variables; 33 of which are categorical and one of them is numerical. There are 366 observations in the dataset. The dataset is a related to the differential diagnosis of erythematous-squamous diseases. The data is taken from the UCI Machine Learning Repository database. It can be open accessed via (https://archive.ics.uci.edu/ml/datasets/Dermatology). The “family history”, “eosinophi”, and “erythema” variables in the “Dermatology” dataset are used. In the dataset, the family history feature has the value “1” if any of these diseases has been observed in the family, and “0” otherwise. Eosinophi has the value “0”” if feature was not present, “1” indicate the relative intermediate values, “2” indicate the largest amount possible. Erythema has the value “0” if feature was not present, “3” indicates the largest amount possible, and “1”, “2” indicate the relative intermediate values. A totally of 1000 different samples with replacements are drawn from the Dermatology dataset. In the proposed method, the input matrix \((Z)\) is obtained from the outputs of the calculated categorical ES measures for these samples. Then, the membership grades are obtained by clustering the input matrix with the IFCM algorithm. The fuzziness index parameter (\(m)\) is taken as “2”. Using the membership grades, the weights of each categorical ES method in each cluster are calculated. The next step is to obtain the fuzzy functions by using the weights. There will be as many fuzzy functions as the optimum number of clusters. The optimum cluster number is searched between “2” and “5”, iteratively. Finally, the fuzzy function with the smallest MAPE is chosen and the new effect size value is calculated.
Family history and Eosinophi variables (\(2\times 3\) contingency tables)
“Family history” and “Eosinophi” variables are selected in the Dermatology dataset for \(2\times 3\) contingency table. The input matrix \((Z)\) is obtained from outcomes of seven ES measures for these variables. The first five and last five prediction values of the input matrix are summarized in Table 10.
The weights for the \(I-MFCESF\) are calculated as in Table 11 and \({I-MFCESF}_{1}\) and \({I-MFCESF}_{2}\) are obtained as in Eqs. (23, 24) for Family history and Eosinophi variables.
In consideration of Table 11, it is obviously seen that the \({I-MFCESF}_{2 }\) has the lowest MAPE. Thus, the best I-MFCESF is \({I-MFCESF}_{2}\). Seven methods contribute the performance of the second function. Besides, the sixth method makes the most contribution, but the seventh, fifth, third, fourth, second, and first methods also have an impact on the effectiveness of I-MFCESF. The MAPE of the methods are computed, and the results are presented in Table 12 to evaluate the performance of the proposed method. Additionally, Fig. 5 provides a visual representation of the MAPE and Bias values for the proposed and selected methods specifically for the family history and eosinophi variables.
According to Table 12, it is obviously seen that proposed I-MFCESF outperforms other categorical effect size methods in terms of the MAPE criterion. Moreover, the bias value of the proposed method is in the range of ± 10%, and it was found to be sufficient in terms of bias. As a result, the new ES value is calculated as 0.020 from Eq. (25).
Family history and Eryhthema variables (\(2\times 4\) contingency tables)
For \(2\times 4\) contingency table, “Family history” and “Erythema” variables are selected in the Dermatology dataset. The input matrix of \(I-MFCESF\) are obtained from outcomes of seven effect size measures for these variables. The input matrix is summarized in Table 13.
When the number of clusters was iteratively tried between 2 and 5 to obtain the smallest MAPE, it was determined as 3 for this data set. The weights for the $$I-MFCESF$$ are calculated as in Table 14 and $${I-MFCESF}_{1}$$, $${I-MFCESF}_{2}$$ and $${I-MFCESF}_{3}$$ are obtained as in Eqs. (26–28).
According to Table 14, it is seen that the \({I-MFCESF}_{2 }\) has the lowest MAPE and the best I-MFCESF is \({I-MFCESF}_{2}\). Seven methods contribute to the performance of the proposed method. Besides, the first method makes the most contribution, but the third, fifth, seventh, fourth, second, and sixth methods also have an impact on the effectiveness of I-MFCESF respectively. The MAPE values of the methods are given in Table 15 to evaluate the performance of the proposed method. Also, Fig. 6 represents the MAPE and the Bias values of the proposed and selected methods for family history and eryhthema variables.
It is clear from the Table 15 that proposed I-MFCESF give very accuracy prediction results for both evaluation criteria MAPE and bias. The MAPE value of the proposed method is better than other categorical effect size methods and the bias value is in the range of ± 10%. Therefore, I-MFCESF was found to be sufficient in terms of MAPE and bias. As a result, the new effect size value is calculated as 0.1328 from Eq. (29).
Eosinophi and Eryhthema variables \((3\times 4)\) contingency tables
For \(3\times 4\) contingency table, “Eosinophi” and “Eryhthema” variables are selected in the Dermatology dataset. The input matrix of \(I-MFCESF\) are obtained from outcomes of seven effect size measures for these variables. The input matrix is summarized in Table 16.
Table 17 is show that the weights are calculated on eosinophi and eryhthema variables. The functions \({I-MFCESF}_{1}\) and \({I-MFCESF}_{2}\), which were created over the weights are given in Eqs. (30) and (31).
Considering Table 17, it is clear that regarding the MAPE criterion, \({I-MFCESF}_{2 }\) function the best prediction performance for this contingency table. The most contributed performance of the proposed method is \(Pearson{^\prime}s c\). Also, other selected methods have smaller impact on the performance of the best function. Figure 7 represents the MAPE and the Bias values of the selected and proposed methods for eosinophi and eryhthema variables.
Table 18 lists the performances of selected and proposed method. It is obvious by looking at the MAPE and the Bias values of the methods that the best performance is produced by the proposed method. The bias value of the proposed methods is in the range of ± 10%, and the MAPE value of the proposed method is the lowest according to other effect size methods. Finally, new effect size value is calculated by using Eq. (32).
Conclusion
The significant two key points of the study can be highlighted as follows. The first, a new approach categorical effect size method based on the IFCM and MFF is used to ensemble seven different categorical effect size measures. Thus, instead of depending on a single categorical effect size method, seven categorical effect size methods are aggregated for more reliable and accurate outcomes. The second, I-MFCESF is an adaptive method that adjust itself based on the given dataset. Some advantages of I-MFCESF are below:
The proposed method incorporates seven different categorical effect size measures that are proposed under various conditions. In the literature, \(Cramer{^\prime}s v\), \(Pearson{^\prime}s c\), \(Tschuprows{^\prime}T\), \(Cohen{^\prime}s w\), \(G-K Tau\), \(U\) and \(\lambda\) effect size measures are most used to \(r\times c\) contingency tables. The interpretation ranges of these methods are in the same scale. Thus, these techniques are selected for the proposed method.
\(IFCM,\) in which the hesitancy of an object belonging to a cluster with a degree of membership valueis taken into consideration, is used to improve the performance of the proposed method to obtain more accurate results.
\(I-MFCESF\) is gathered the information of selected effect size measures in functions by considering their accuracy performances for a dataset. For example, for a given dataset, the X method may perform better than the Y method, while in another dataset, the Y method may perform better than the X method. In this case, the weight of the X method will be higher in the best in the first dataset, while the weight of Y method in the best function will be higher in the second dataset. For this reason, the proposed method has adaptive properties.
\(I-MFCESF\) is usually select the best effect size measures with a higher weight in terms of MAPE among seven measures.
To demonstrate the performances of the proposed method, we generate two randomly independent categorical variables for N = 1000 sample and t = 1000 repeat. Besides, we have investigated Dermatology real-world dataset which are taken from the UCI Machine Learning Repository database. According to the simulation results, MAPE was obtained as 0.4168 with a bias of − 0.0106 for the 2 × 3 contingency table, 0.3581 with a bias of − 0.0019 for the 2 × 4 contingency table, and 0.2753 with a bias of − 0.0032 for the 3 × 4 contingency table. The results obtained from the real data, on the other hand, were 0.3196 MAPE with a bias of − 0.0083 for the 2 × 3 contingency table, 0.4767 MAPE with a bias of − 0.0595 for the 2 × 4 contingency table, and 0.3335 MAPE with a bias of − 0.0370 for the 3 × 4 contingency table. Both the simulation study and the applications on the real data set showed us that; the proposed method can predict the results better than the other effect size measures in terms of MAPE and bias values. The MAPE value of the proposed method was found to be lower in all the application results compared to the other methods, and the bias value was in the range of ± 10%. From the results we can claim that I-MFCESFs improve prediction accuracy by combining different effect sizes results. The limitation of the study can be identified as the fact that the performance of the proposed method is affected by the performance of a clustering algorithm. Although, IFCM accounts for the hesitancy of an object to be belong to a cluster, it does not consider the outliers in the dataset. In this sense, possibilistic fuzzy clustering algorithm, that accounts for the outliers, can be adapted in MFF. This scenario is left for the future study. Therefore, as a future research direction, we plan to combine the effect size measures used for different types of variables and utilize possibilistic fuzzy c-means. Also, to improve the performance of the proposed method, different categorical effect size measures can be included in MFF.
Data availability
The real dataset are taken from UCI Machine Learning Repository database. It can be open accessed via (https://archive.ics.uci.edu/ml/datasets/Dermatology). The simulated dataset during the current study is available from the corresponding author on reasonable request.
References
Sullivan, G. M. & Feinn, R. Using effect size—or why the P value is not enough. J. Grad. Med. Educ. 4, 279–282 (2012).
Kelley, K. & Preacher, K. J. On effect size. Psychol. Methods 17, 137 (2012).
Ellis, P. D. Thresholds for interpreting effect sizes. Retrieved January 13, 2014 (2009).
Sullivan, F. & Feinn, R. Using effect size—or why the P value is not enough. J. Grad. Med. Educ. 4, 279–282 (2022).
Yule, G. U. On the methods of measuring association between two attributes. J. R. Stat. Soc. 75, 579–652 (1912).
Pearson, K. & Heron, D. On theories of association. Biometrika 9, 159–315 (1913).
Berry, K. J., Johnston, J. E. & Mielke, P. W. Jr. A measure of effect size for R× C contingency tables. Psychol. Rep. 99, 251–256 (2006).
Bergsma, W. A bias-correction for Cramér’s V and Tschuprow’s T. J. Korean Stat. Soc. 42, 323–328 (2013).
Tschuprow, A. A. & Tschuprow, A. Grundbegriffe und Grundprobleme der Korrelationstheorie (BG Teubner, 1925).
Cohen, J. The Concepts of Power Analysis. Statistical Power Analysis for the Behavioral Sciences (Elrbaum, 1988).
Mills, P. Efficient statistical classification of satellite measurements. Int. J. Remote Sens. 32, 6109–6132 (2011).
Bezdek, J. C., Ehrlich, R. & Full, W. FCM: The fuzzy c-means clustering algorithm. Comput. Geosci. 10, 191–203 (1984).
Atanassov, K. T. Intuitionistic Fuzzy Sets 1–137 (Springer, 1986).
Zadeh, L. Z. Fuzzy sets. Inform. Control 8, 338–353 (1965).
Xu, Z. Some similarity measures of intuitionistic fuzzy sets and their applications to multiple attribute decision making. Fuzzy Optim. Decis. Making 6, 109–121 (2007).
Adlassnig, K.-P. Fuzzy set theory in medical diagnosis. IEEE Trans. Syst. Man Cybern. 16, 260–265 (1986).
De, S. K., Biswas, R. & Roy, A. R. An application of intuitionistic fuzzy sets in medical diagnosis. Fuzzy Sets Syst. 117, 209–213 (2001).
Chaira, T. Intuitionistic fuzzy segmentation of medical images. IEEE Trans. Biomed. Eng. 57, 1430–1436 (2010).
Fan, X., Lei, Y. & Wang, Y. Adaptive partition intuitionistic fuzzy time series forecasting model. J. Syst. Eng. Electron. 28, 585–596 (2017).
Kumar, S. & Gangwar, S. S. Intuitionistic fuzzy time series: An approach for handling nondeterminism in time series forecasting. IEEE Trans. Fuzzy Syst. 24, 1270–1281 (2015).
Lei, Y., Lei, Y. & Fan, X. Multi-factor high-order intuitionistic fuzzy time series forecasting model. J. Syst. Eng. Electron. 27, 1054–1062 (2016).
Tak, N. Type-1 recurrent intuitionistic fuzzy functions for forecasting. Expert Syst. Appl. 140, 112913 (2020).
Gwak, J., Garg, H. & Jan, N. Investigation of robotics technology based on bipolar complex intuitionistic fuzzy soft relation. Int. J. Fuzzy Syst. 25, 1834–1852 (2023).
Tak, N. Meta fuzzy functions: Application of recurrent type-1 fuzzy functions. Appl. Soft Comput. 73, 1–13 (2018).
Tak, N. & Gök, A. Dating currency crises and designing early warning systems: Meta-possibilistic fuzzy index functions. Int. J. Financ. Econ. 27, 3773–3790 (2022).
Gök, A. & Tak, N. Dating currency crisis and assessing the determinants based on meta fuzzy index functions. Comput. Econ. 2022, 1–26 (2022).
Tak, N., Egrioglu, E., Bas, E. & Yolcu, U. An adaptive forecast combination approach based on meta intuitionistic fuzzy functions. J. Intell. Fuzzy Syst. 40, 9567–9581 (2021).
Cevik, F. C., Gever, B., Tak, N. & Khaniyev, T. Forecast combination approach with meta-fuzzy functions for forecasting the number of immigrants within the maritime line security project in Turkey. Soft Comput. 2023, 1–27 (2023).
Tak, N. Forecast combination with meta possibilistic fuzzy functions. Inf. Sci. 560, 168–182 (2021).
Yabacı Tak, A. & Ercan, I. Ensemble of effect size methods based on meta fuzzy functions. Eng. Appl. Artif. Intell. 119, 105804 (2023).
Cramer, H. Mathematical Methods of Statistics (Princeton University Press, 1946).
Gravetter, F. J., Wallnau, L. B., Forzano, L.-A.B. & Witnauer, J. E. Essentials of Statistics for the Behavioral Sciences (Cengage Learning, 2020).
Healey, J. F. Statistics: A Tool for Social Research (Cengage Learning, 2014).
Howell, D.C. Statistical Methods for Psychology. 6th Edition, Thomson Wadsworth, Belmont (2007).
Tschuprow, A. A. Principles of the Mathematical Theory of Correlation (1939).
Pearson, K. I. Mathematical contributions to the theory of evolution—VII On the correlation of characters not quantitatively measurable. Philos. Trans. R. Soc. Lond. Ser. A Contan. Pap. Math. Phys. Char. 195, 1–47 (1900).
Cohen, J. Statistical Power Analysis for the Behavioral Sciences 20–26 (Lawrence Erlbaum Associates, 1988).
Goodman, L. A. & Kruskal, W. H. Measures of association for cross classifications III: Approximate sampling theory. J. Am. Stat. Assoc. 58, 310–364 (1963).
Kruskal, W. H. & Goodman, L. Measures of association for cross classifications. J. Am. Stat. Assoc. 49, 732–764 (1954).
Somers, R. H. A Similarity between Goodman and Kruskal’s Tau and Kendall’s Tau, with a Partial Interpretation of the Latter. J. Am. Stat. Assoc. 57, 804–812 (1962).
Theil, H. Statistical Decomposition Analysis: With Applications in the Social and Administrative Sciences (North-Holland Publishing Company, 1972).
Asuncion, A. & Newman, D.J. UCI Machine Learning Repository. Irvine University of California, Irvine. (2007).
Cebeci, Z. Partitioning Cluster Analysis with Possibilistic C-Means. (2017).
Ben-Shachar, M. S., Makowski, D., Lüdecke, D., Kelley, K. & Stanley, D. (2021).
Mangiafico, S. & Mangiafico, M. S. Package ‘rcompanion’. Cran Repos. 20, 1–71 (2017).
Ferraro, M. B., Giordani, P. & Serafini, A. fclust: An r package for fuzzy clustering. R J. 11, 198 (2019).
Hamner, B., Frasco, M. & LeDell, E. Package ‘Metrics’. In R Foundation for Statistical Computing (2018).
Li, J. C. H., Chan, W. & Cui, Y. Bootstrap standard error and confidence intervals for the correlations corrected for indirect range restriction. Br. J. Math. Stat. Psychol. 64, 367–387 (2011).
Author information
Authors and Affiliations
Contributions
A.Y.T.: conceptualization, methodology, software, validation, formal analysis, investigation, writing—original draft, writing—review & editing.
Corresponding author
Ethics declarations
Competing interests
The author declares no competing interests.
Additional information
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Yabacı Tak, A. An adaptive categorical effect size method based on intuitionistic meta fuzzy functions. Sci Rep 13, 17403 (2023). https://doi.org/10.1038/s41598-023-44691-6
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-023-44691-6
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.