An adaptive categorical effect size method based on intuitionistic meta fuzzy functions

Yabacı Tak, Ayşegül

doi:10.1038/s41598-023-44691-6

Download PDF

Article
Open access
Published: 13 October 2023

An adaptive categorical effect size method based on intuitionistic meta fuzzy functions

Ayşegül Yabacı Tak¹

Scientific Reports volume 13, Article number: 17403 (2023) Cite this article

341 Accesses
Metrics details

Subjects

Abstract

There are several categorical effect size methods in the literature. It is not clear which method performs better for a given dataset and it is a challenging task to select the correct method for a given dataset. In this sense, to overcome the questions like “Which method should we choose?” and “Which categorical effect size method is more reliable for a given dataset?”, an adaptive categorical effect size method based on intuitionistic meta fuzzy functions is introduced in the paper. Thus, the main motivation of the proposed method is to obtain more accurate outcomes by combining the results of better performing methods instead of relying on only one method. In the study, the intuitionistic fuzzy c-means clustering algorithm is adapted to meta fuzzy functions by incorporating not only membership degrees but also non-membership degrees to improve the clustering accuracy of meta fuzzy functions. Meta fuzzy functions are the linear combination of seven categorical effect size methods and the weights, which are calculated from membership grades from intuitionistic fuzzy c-means algorithm. Among the functions, the one with the lowest mean absolute percentage error is selected as the best. To evaluate the performance of the proposed method, 2 × 3, 2 × 4, and 3 × 4 contingency tables were simulated. Additionally, the performance of the proposed method is also assessed by applying it to a real-time dataset. Experimental results show that the proposed method outperforms compared to the evaluated seven categorical effect size methods in terms of mean absolute percentage error. Also, the calculated effect sizes are within the range of ±10% in terms of bias. Thus, the results verified that proposed method achieves greater reliability.

Spatial multi-omics at subcellular resolution via high-throughput in situ pairwise sequencing

Article 14 May 2024

Segment anything in medical images

Article Open access 22 January 2024

Principal component analysis

Article 22 December 2022

Introduction

Statistical significance (p-value) is the probability that the observed difference between two groups is due to chance. If the p-value is greater than the chosen alpha level, it is assumed that any observed difference can be explained by the variability of the sample size. When conducting statistical comparisons with exceptionally large sample sizes, it is highly likely that the p-value will consistently indicate a significant difference. However, statistically significant differences that arise due to the large number of data points do not always represent meaningful differences in reality¹. A statistically significant result may sometimes arise simply from using a large sample. Statistical significance depends on both the sample size and effect size (ES) but the effect size is generally independent of the sample size². Therefore, reporting only the p-value, especially in large samples, is not sufficient for readers to fully understand the implications^3,4. Effect size (ES) is substantial of quantitative research, and it indicates the real magnitude of the effect. In addition to statistical significance, it enables researchers to understand the practical significance of the findings. Statistical hypothesis tests can be misleading due to type 1 and type 2 errors made depending on the sample size. For this reason, it is necessary to report the effect size as well as the p-value in many disciplines. The seven categorical effect size methods, which is used for $r\times c$ contingency tables in statistics in the study, are explained in section “Categorical effect size methods”. The $Cramer{^\prime}s V$ effect size measure has some disadvantages. First, $Cramer{^\prime}s V$ is a symmetric measure of association^5,6,7. Second, it is zero under the assumption of independence. Third, interpretation of $Crame{r}{\prime}s V$ effect size measures is difficult⁸. $Tschuprow{^\prime}s T$ measure is closely related to $Cramer{^\prime}s V$ measure but less well-known⁹. Since it is a simple function of the Pearson chi-square statistic, it is among the commonly used effect sizes. Barely, the bias of the measure is large in data with small samples and it is difficult to interpret⁸. $Cohen{^\prime}s w$ is more appropriate for larger contingency tables¹⁰. $Uncertainty coefficient (U)$ is also commonly used effect size to measure the validity of a statistical classification algorithm¹¹.

Considering the disadvantages of the ES methods, it is important to select the correct ES method for a given dataset. To overcome the aforementioned disadvantages, selected 7 ES methods are aggregated in functions based on their performances for a given dataset. In this sense, the motivation of this paper is to combine different categorical effect sizes methods in functions with Meta Fuzzy Functions $(MFF)$ based on Intuitionistic Fuzzy C-Means Clustering $\left(IFCM\right)$ algorithm. Fuzzy c-means (FCM) clustering algorithm is used in MFF. FCM, proposed by Bezdek et al.¹², stands out as one of the frequently employed methods because of its simplicity and the benefits it offers compared to the k-means clustering algorithm. Nevertheless, it has certain drawbacks, including its susceptibility to initial settings and sensitivity to noise. In this sense, IFCM that accounts for hesitancy of an object belonging to a cluster is employed in MFF. Intuitionistic Fuzzy Sets $(IFSs)$ are introduced as a modification of Zadeh’s fuzzy set theory by Atanassov^13,14. The main difference between fuzzy sets and $IFSs$ is that fuzzy sets only consider membership degree while $IFSs$ consider both membership and non-membership degrees. That is, IFSs account also for the hesitancy of membership grades in clusters. Thus, the centers of the clusters are obtained more accurately. It has been determined by the studies that IFSs are more effective than traditional fuzzy set theory by overcoming uncertainty¹⁵. $IFSs$ have been commonly used for forecasting and engineering problems. In addition to time series and forecasting methods, $IFSs$ are widely used in the field of medicine for clustering images and diagnostics^16,17,18. Numerous studies employing IFSs have been proposed by Fan et al.¹⁹, Kumar and Gangwar²⁰, Lei, et al.²¹, Tak²², Gwak et al.²³.

Because aforementioned advantages of IFCM in the literature, it is employed in MFF. The $MFF$ was proposed by Tak²⁴. The purpose of the MFF is to combine methods or definitions used for the same purpose. Its logic is simply based on meta-analysis. Meta-analysis is a method that combines the outcomes of multiple studies to yield stronger results for a specific purpose. For example, Tak and Gök²⁵ and Gök and Tak²⁶ utilized the MFF to merge different definitions of currency crisis. By employing this approach, they aimed to enhance the accuracy and reliability of their analysis. Similarly, Tak et al.²⁷ employed the MFF to combine various time series methods. Their objective was to improve the forecasting performance by integrating multiple forecasting techniques within the framework. Cevik et al.²⁸ used the MFF approach to forecast the number of immigrants within the maritime line. Tak²⁹ used the MFF approach to forecast combination. These studies have shown that combining different methods with the $MFF$ has better estimation accuracy.

Yabacı Tak and Ercan³⁰ ensembled some ES definitions for two independent groups with MFF to obtain a more accurate effect size value. Yabacı Tak and Ercan³⁰ combined six effect size methods for numerical variables with the MFF approach by using classical fuzzy c-means algorithm $(FCM),$ which can be used with or without the assumption of normal distribution. The combined methods in the previous study were not used for categorical variables. Thus, numerous categorical ES methods are combined in this study. Besides, the $FCM$ clustering method only uses membership degrees while calculating the cluster centers. Thus, the $MFF$ approach with the $IFCM$, which provides a more accurate estimation of the cluster centers, has been developed in the study.

In the light of this information, we will introduce intuitionistic meta fuzzy categorical effect size functions $\left( {I - MFCESF} \right)$ approach. The aim of the study is to obtain better outcomes by combining seven categorical effect size measures in functions. The purpose of combining the ES is the assumption that each measure might have much or partial information for a given dataset. Therefore, while the methods that perform better will be gathered into one function, the methods that perform worse will be gathered into another. In the remainder of the paper, we will describe the IFCM and the meta fuzzy functions briefly in the section “Preliminaries”. The proposed method $\left(I-MFCESF\right)$ is discussed in section “Intuitionistic meta fuzzy categorical effect size functions (1-MFCESF)”. The performance of the proposed method is evaluated with some applications for simulated and real datasets in section “Evaluation”. Finally, the results of the proposed method are discussed in section “Conclusion”.

Preliminaries

The methods (effect sizes, intuitionistic fuzzy c-means and meta fuzzy function) that are used in the paper are detailed in this section.

Categorical effect size methods

Short descriptions of seven types of ES measures are provided for $r\times c$ contingency tables. $Cramer{\prime}s V$ is proposed in 1946 and it is an effect size measure that is generally used with nominal variables in $r\times c$ contingency tables^{7,31,32,33,34}. It is calculated in Eq. (1) based on Pearson’s chi-square statistic. It takes values between 0 and + 1.

$$V = \sqrt {\frac{{\varphi^{2} }}{{\min \left( {c - 1,r - 1} \right)}}} = \sqrt {\frac{{{\raise0.7ex\hbox{${\chi^{2} }$} \!\mathord{\left/ {\vphantom {{\chi^{2} } n}}\right.\kern-0pt} \!\lower0.7ex\hbox{$n$}}}}{{\min \left( {c - 1,r - 1} \right)}} }$$

(1)

where, ${\chi }^{2}$ is the Pearson’s chi-squared statistics, $n$ is the total observations number, $c$ is the number of coloumns and $r$ is the number of rows. In the Eq. (1), numerator of formula is based on the observed frequencies, denominator of formula is based on an unobserved frequencies. Therefore, when $Crame{r}{\prime}s V=1$, the marginal frequencies are not zero and r or c has not zero cell frequencies.

$Tschuprow{^\prime}s T$ is a ES which measures the association between two nominal variables in $rxc$ contingency tables³⁵. It takes values between 0 and +1, and calculated in Eq. (2).

$$T = \sqrt {\frac{{\varphi^{2} }}{{\left( {c - 1} \right) \times \left( {r - 1} \right)}}}$$

(2)

where, ${\chi }^{2}$ is the Pearson’s chi-squared statistics, $c$ is the number of coloumns and $r$ is the number of rows.

Another measure of categorical effect size is the $Pearson{^\prime}s contingency coefficient$ ($Pearson{^\prime} s c$). It takes values between 0 and + 1. $Pearson{^\prime} s c$ can be calculated in Eq. (3)³⁶.

$$Pearson{^\prime}s c = \sqrt {\frac{{\chi^{2} }}{{\chi^{2} + n}}}$$

(3)

where, $\chi^{2}$ is the Pearson’s chi-squared statistics, and n is the total number of observations.

$Cohen{^\prime} s w$ effect size is proposed by Cohen³⁷.$Cohen{^\prime} s w$ should be used for larger contingency tables. Cohen’s w effect size measure is obtained in Eq. (4).

$$w = \sqrt {\mathop \sum \limits_{i = 1}^{m} \frac{{\left( {p_{1i} - p_{0i} } \right)^{2} }}{{p_{0i} }}}$$

(4)

where, m is the number of cells, $p_{0i}$ is the value of the ith cell under the null hypothesis, $p_{1i}$ is the value of the ith cell under the alternative hypothesis.

$Goodman - Kruskal Tau \left( {G - K Tau} \right)$ is another ES measure of nominal variables. It measures the predictability of the column or row variable given the value of other variables, in percentage. The measure varies between 0 and 1^38,39. $G - K Tau$ is calculated in Eq.(5)⁴⁰.

$$GK - Tau = \frac{{n\mathop \sum \nolimits_{ij} \left( {\frac{{a_{ij}^{2} }}{{a_{.j} }}} \right) - \mathop \sum \nolimits_{i} a_{i.}^{2} }}{{n^{2} - \mathop \sum \nolimits_{i} a_{i.}^{2} }} , \;\; i = 1, \ldots r,\;\; j = 1, \ldots c$$

(5)

where, $n$ is the total number of observation, $a_{ij}$ is the value of number of observation in ith row and jth column, $a_{.j}$ is the total number of observation in jth column and $a_{i.}$ is the total number of observation in ith row.

$Uncertainty coefficient \left( U \right)$ is first introduced by Theil⁴¹. It is also called Proficiency, Entropy Coefficient or Theil’s U. It is often used as a measure of the ES of nominal variables in statistics and takes the value between 0 and + 1. This measure is defined in Eq. (6)

$$U\left( {XY} \right) = \frac{{H\left( X \right) - H\left( {XY} \right)}}{H\left( X \right)} = \frac{{I\left( {X;Y} \right)}}{H\left( X \right)}$$

(6)

$$H\left( X \right) = - \mathop \sum \limits_{x} P_{X} \left( x \right)logP_{X} \left( x \right) , H\left( {XY} \right) = - \mathop \sum \limits_{x,y} P_{X,Y} \left( {x,y} \right)logP_{{X \rm{\mid }Y}} \left( {x \rm{\mid }y} \right)$$

(7)

where, $H\left( X \right)$ is the entropy of a single distribution, $H\left( {XY} \right)$ is the conditional entropy and $U\left( {XY} \right)$ is the uncertainty coefficient. $P_{X,Y} \left( {x,y} \right)$ is the conditional distribution.

$Goodman - Kruskal Lambda \left( \lambda \right)$ statistic is an effect size proposed to measure the strength of the relationship between two nominal variables by evaluating the proportional reduction of error (PRE)³⁹. Also, $\lambda$ is the asymmetrical measure. The $\lambda$ statistic takes value between 0 and 1. How to calculate the $\lambda$ statistic is given in Eq. (8).

$$\lambda = \frac{{E_{1} - E_{2 } }}{{E_{1} }}$$

(8)

where, $E_{1}$ is the number of prediction errors made when the independent variable is ignored, $E_{2}$ equal to the number of prediction errors made when the prediction is based on the independent variable.

IFCM

Over the past decades, the fuzzy set theory proposed by Zadeh¹⁴ has been expanded with different approaches. Among these, intuitionistic fuzzy set theory, which has been commonly used in the literature and has many applications in different fields, was developed by Atanassov¹³. While only the membership degree is taken into account in the FCM, non-membership degree is also taken into account in IFCM. So that, the centers of the clusters are calculated more accurately. Algorithm are given below²²:

Step-1.:

Determine the number of clusters $\left( c \right)$, the fuzziness index (f), and initialize the cluster centers $\left( {v_{i} } \right)$ randomly.

Step-2.:

Calculate the degrees of membership ($\mu$) and non-membership ($u$). Formulas are given in Eqs. (9–11):

$$\mu_{ik} = \left[ {\mathop \sum \limits_{j = 1}^{c} \left( {\frac{{d\left( {x_{k} ,v_{i} } \right)}}{{d\left( {x_{k} ,v_{j} } \right)}}} \right)^{{\frac{2}{f - 1}}} } \right]^{ - 1} , \;\;i = 1,2, \ldots ,c ;\;\;k = 1,2, \ldots ,n$$

(9)

where $d\left( . \right)$ is the Euclidean distance between kth data in the ith cluster center:

$$u_{ik} = \left( {1 - \mu_{ik}^{\alpha } } \right)^{{{\raise0.7ex\hbox{$1$} \!\mathord{\left/ {\vphantom {1 \alpha }}\right.\kern-0pt} \!\lower0.7ex\hbox{$\alpha $}}}} , \alpha > 0$$

(10)

$$\mu_{ik}^{*} = 1 - u_{ik}$$

(11)

Step-3.:

Update the cluster centers by using Eq. (12):

$$v_{i} = \frac{{\mathop \sum \nolimits_{k = 1}^{n} \left( {\mu_{ik}^{*} } \right)^{f} x_{k} }}{{\mathop \sum \nolimits_{k = 1}^{n} \left( {\mu_{ik}^{*} } \right)^{f} }} , i = 1,2, \ldots ,c$$

(12)

Step-4.:

Algorithm is ended if the difference between two iterations are dropped under some given threshold ε; otherwise, repeated Step-2 and Step-3.

Meta fuzzy functions

Tak²⁴ proposes MFF to combine different methods or definitions, such as prediction and forecasting. The MFF consists of three components: functions, weights, and the best meta fuzzy function. Functions; the linear combination of weights and the findings of the selected methods. Weights: the membership grades that are obtained from FCM clustering algorithm are used to compute weights. The best meta fuzzy function: the function that has the best evaluation criteria. Meta fuzzy functions begin with obtaining the outcomes of the methods chosen for a purpose as the input matrix. After that, the input matrix is clustered using fuzzy c-means clustering algorithm to separate the categorical ES methods based on how well they predict outcomes. As a result, each method will be assigned to a cluster with a membership grade. Then, using membership grades for each cluster, the weights of the methods are calculated. In this case, there will be an equal number of functions as the cluster number. Finally, the best meta fuzzy function is selected based on its evaluation criteria.

Intuitionistic meta fuzzy categorical effect size functions $\left( {{\varvec{I}} - {\varvec{MFCESF}}} \right)$

$Cramer{^\prime} s v, \;\;Tschuprow{^\prime} s T, \;\;Pearson{^\prime} s c, \;\;Cohen{^\prime} s w,\;\; G - K Tau,\;\; U$ and $\lambda$ methods can be used to calculate effect size measures for a dataset. However, there is no definite information in the literature about which method is better or in which situations it should be used. Therefore, the performance of the methods may change according to the type of datasets. Because the performance of the ES measures in the proposed method is uncertain, we are looking for the optimum weights of the ES measures in the combination function. For this purpose, $I - MFCESF$ method is proposed in this paper. The ES measures are clustered based on their performances by using the IFCM. There will be as many functions as the number of clusters. Functions are obtained by multiplying each method by its weight in the clusters. The ES measures that perform better for the dataset will be in a function with a higher membership degree, while the ES measures that perform worse will be in another function with a higher degree of membership. Finally, the function with the minimum model evaluation criterion is selected as $I - MFCESF _{best}$ and new effect size value will be calculated for the dataset. So, $I - MFCESF$ method is an adaptive combination of categorical effect size measures. Step-by-step algorithm, pseudocode and flowchart are given below for $I - MFCESF$ approach.

Algorithm 1

Step 1.:

Determine $m$ categorical ES measures and simulated data randomly for t iterations. Obtain input matrix (Z) by applying $m$ measures to the simulated dataset for t repeats.

$$Z = [Z_{ij} ] , i = 1,2, \ldots ,t ;\;\; j = 1,2, \ldots ,m$$

(13)

where, $Z_{ij}$ is the ES value of ith repeat for jth measure.

$$Z = \left[ {\begin{array}{*{20}c} {Z_{1,1} } & {Z_{1,2} } & \ldots & {Z_{1,t} } \\ {Z_{2,1} } & {Z_{2,2} } & \ldots & {Z_{2,t} } \\ \vdots & \vdots & \ldots & \vdots \\ {Z_{m,1} } & {Z_{m,2} } & \ldots & {Z_{m,t} } \\ \end{array} } \right]$$

Step 2.:

The input matrix is clustered by using intuitionistic fuzzy c-means.

Step 2.1.:: The number of fuzzy clusters $\left( c \right)$ is determined and fuzzy index value $\left( f \right)$ and center of clusters $\left( v \right)$ are initialized.
Step 2.2.:: The degrees of membership ($\mu$) and non-membership value are calculated in each cluster with Eqs. (9–11).

Step 2.3.:

The new clusters center is calculated by using Eq. (12).

Step 2.4.:

If the difference between two iterations drops under some threshold, stop the algorithm; otherwise, repeat Step 1 and Step 2.

Step 3.:

Intuitionistic meta categorical effect size functions are obtained. $I - MFCESF$ is given in Eq. (14).

$$I - MFCESF_{i} \left( z \right) = \mathop \sum \limits_{j = 1}^{m} w_{ij} z_{j} , \;\; i = 1,2, \ldots ,c$$

(14)

$$w_{ij} = \frac{{\mu_{ij}^{*} }}{{\mathop \sum \nolimits_{j = 1}^{m} \mu_{ij}^{*} }} ,\; i = 1,2, \ldots , c$$

(15)

where, c is the number of clusters, $\mu_{ij}^{*}$ is the membership grades of jth method in $i$th cluster,$I - MFCESF_{i}$ is the ith intuitionistic meta categorical effect size functions, and $w_{ij}$ is weight of j.th method in $i$th cluster.

Step 4.:

Select the best intuitionistic meta categorical effect size functions that has the minimum Mean absolute percentage error (MAPE).

MAPE values are calculated for select $I - MFCESF_{best}$. Mape formula is given in Eq. (16).

$$MAPE = \frac{1}{n}\mathop \sum \limits_{i = 1}^{n} \left| {\frac{{y_{i} - \hat{y}_{i} }}{{y_{i} }}} \right|$$

(16)

where, $y_{i }$ is the mean of the ES value calculated from each method for the population and $\widehat{{y_{i} }}$ is the predicted ES value obtained from 1000 simulated samples. The pseudo code and the flow chart of $I - MFCESF$ based on MFF is given Algorithm 2 and Fig. 1, respectively.

Evaluation

The estimation performance of the proposed I-MFCESF method is evaluated through both simulation studies and the use of real-world datasets. In the simulation study, random generation of two categorical variables (x and y) is performed to create contingency tables of different sizes (2 × 3, 2 × 4, and 3 × 4). These tables are generated for a sample size of N = 1000 and repeated for t = 1000 iterations. Real-world datasets are obtained from the UCI Machine Learning Repository⁴², and 1000 different samples are taken with replacement from these datasets. By applying the selected categorical effect size methods to each dataset, an input matrix (Z) is obtained. The I-MFCESF method incorporates two crucial parameters: the number of clusters (c) and the fuzziness index parameter (m). To determine the optimal number of clusters (c), the minimum mean absolute percentage error (MAPE) for the I-MFCESF is calculated iteratively between 2 and 5. Due to the lack of consensus on the optimal value for the fuzziness index parameter of IFCM (intuitionistic fuzzy c-means algorithm), a value of 2 is selected for this study. The performance of the proposed method is evaluated using the MAPE, which measures the average percentage difference between the estimated values and the true values.

The simulation study and real-world dataset applications of the I-MFCESF method are conducted using R Studio. Various R package namely “ppclust,” “effectsize,” “DescTools,” “fclust,” “rcompanion, ” “remotes,” “githubinstall,” and “Metrics, ” are utilized^{43,44,45,46,47}. As an application, seven different selected categorical ES measures are combined by using the MFF based on the intuitionistic fuzzy c-means to obtain more accurate results for all datasets.

Simulated 2 $\times$ 3, 2 $\times 4$ and 3 $\times$ 4 contingency tables for the datasets of categorical variables

Two categorical variables x and y ($2 \times 3$, $2 \times 4$ and $3 \times 4$ contingency tables) are simulated randomly for N = 1000 sample size and t = 1000 iterations. Selected measures: $Cramer{^\prime} s v$ (Metasure 1), $Tschuprow{^\prime} s T$ (Measure 2), $Pearson{^\prime} s c$(Measure 3), $Cohen{^\prime} s w$ (Measure 4), $G - K Tau$(Measure 5), $U$ (Measure 6) and $\lambda$ (Measure 7) are applied to all datasets. The input matrix $\left( Z \right)$ consists of the outcomes of the ES measures for the simulated data set. The proposed method utilizes the IFCM clustering algorithm, where the fuzziness index parameter (m) is set to 2. After obtaining the input matrix, the IFCM algorithm is applied. In this method, the number of functions is equal to the optimal number of clusters. Functions are obtained by multiplying the weights of the methods with the actual value and sum them (Eq. 14) up. The weights of each method in each function are obtained as in Algorithm 1 (Step 3). Finally, the MAPE values are calculated for each from obtained $I - MFCESF$ functions. When calculating the MAPE values, the actual value is considered as the average of the values calculated from the dataset of the selected seven ES measures. The function with the lowest Mean Absolute Percentage Error is chosen as $I - MFCESF_{best }$ and the new ES value is computed based on this selection.

The first dataset is simulated for 2 $\times$ 3 contingency table and the input matrix ($Z$) is obtained by applying the selected categorical ES methods. The first five and last five prediction values of the input matrix are summarized in Table 1.

Table 1 Input Matrix for 2 $\times$ 3 contingency table

Subjects

Abstract

Similar content being viewed by others

Spatial multi-omics at subcellular resolution via high-throughput in situ pairwise sequencing

Segment anything in medical images

Principal component analysis

Introduction

Preliminaries

Categorical effect size methods

IFCM

Meta fuzzy functions

Intuitionistic meta fuzzy categorical effect size functions \(\left( {{\varvec{I}} - {\varvec{MFCESF}}} \right)\)

Evaluation

Simulated 2 \(\times\) 3, 2 \(\times 4\) and 3 \(\times\) 4 contingency tables for the datasets of categorical variables

Real-world categorical dataset for 2 \(\times\) 3, 2 \(\times\) 4 and 3 \(\times\) 4 contingency tables

Family history and Eosinophi variables (\(2\times 3\) contingency tables)

Family history and Eryhthema variables (\(2\times 4\) contingency tables)

Eosinophi and Eryhthema variables \((3\times 4)\) contingency tables

Conclusion

Data availability

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's note

Rights and permissions

About this article

Cite this article

Share this article

Comments

Search

Quick links