A forecasting method with efficient selection of variables in multivariate data sets

Sagar, Pinki; Gupta, Prinima; Kashyap, Indu

doi:10.1007/s41870-021-00619-9

A forecasting method with efficient selection of variables in multivariate data sets

Original Research
Published: 28 February 2021

Volume 13, pages 1039–1046, (2021)
Cite this article

Download PDF

International Journal of Information Technology Aims and scope Submit manuscript

A forecasting method with efficient selection of variables in multivariate data sets

Download PDF

2689 Accesses
5 Citations
Explore all metrics

Abstract

Regression is a kind of data analysis technique in which the relationship between the independent variable(x) and dependent variable(y) is modeled and for polynomial regression it is up to the nth degree polynomial. Polynomial regression fits a nonlinear relationship between the value of x and the corresponding conditional mean of y, denoted by E (y|x). In this paper polynomial regression analysis has been improved through efficient selection of variables that is coefficient of determination. Coefficient of determination is a square of the correlation between new predicted y values and actual y values and its values are in the range from 0 to 1. The main purpose of regression analysis is to discover the relationship among the independent and dependent variables or in other words it is an explanation of variation in one variable with another variable. In this paper, the main focus is on Multivariate data sets that have many attributes and it is not necessary that all variables are required for data analysis purposes. Using coefficient of determination (COD) irrelevant attributes get eliminated during analysis. The main objective of research is to reduce the cost of data maintenance, reduce the execution time and improve the prediction accuracy rate. COD helps in selecting suitable independent variables. It is a notch that is used in statistical analysis that assesses how well a model explains and forecasts upcoming outcomes. This method also helps in eliminating the irrelevant variables which are not required for the prediction model by this maintenance cost and size of data sets can be reduced.

A random forest guided tour

Article 19 April 2016

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

A Review on Random Forest: An Ensemble Classifier

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Regression is an approach used for data analysis and helps in taking decisions. It is a data mining approach to predict the continuous values or range of numeric values. It is a prediction technique where the regression equation involves two variables: Unknown variable (predictor variable) and Response variable (values to predict). Manoj Kumar Gupta and Pravin Chandra [11] presented a systematic and detailed survey of different tasks and techniques of data mining. In addition, authors presented different real-life data mining applications. Authors explained the Data mining task realization and data mining techniques. Copeland, Karen [1] introduced non parametric methods which can relax assumptions on the outline of a regression method and can help to search for data which must be applicable for suitable regression function and for data sets as well. The use of these non-parametric functions with parametric techniques can yield immensely powerful data analysis tools. Eva Ostertagová [4] determined on the polynomial regression sculpt, if the relationship of two variables is curvilinear then polynomial regression is useful for prediction and characterizes the relationship between strains and drilling depth. Least square method is used to guesstimate the parameters of the model. After fitting and evaluating the model some frequent indicators are used to weigh up the truth of the regression model. Fawcett et al. [2] describes an automatic approach for fraud detection on the basis of transaction records, and the introduced system will learn the features and generate confidence alarms for the users.

Samar Wazir, Sufyan Beg, Tanvir Ahmad [12] Proposed earlier Master Apriori algorithm which is used to measure estimated frequent Items for a combination of certain and uncertain databases with the help of UApriori for the uncertain database based on Apriori for Certain and Planned support. Researcher expanded the previous work for the uncertain database by using UApriori based on poisson and normal distribution. There is only one-time communication between sites where data is transmitted in the proposed algorithms, which decreases the overhead of communication. By using normal and synthetic databases, the scalability and efficiency of proposed algorithms were then tested. The performances were then calculated by comparing the time taken and each algorithm generated a number of frequent products.

Rimal and Almøy, [9] introduced a novel approach that can be evaluated based on various aspects of data. However, it is very limited for multiple response variables. A novel approach is used for real data and simulated data sets. Authors compared their approach with well-established prediction methods. This approach is specially designed by varying properties such as multi col-linearity, the correlation between multiple responses and position of relevant principal components of predictors. Tahani S. Gendy [6] discusses thermal formation of stabilized limited jet dispersal flames in the presence of various geometries of trick body burners which has been scientifically modeled. Two stabilizer disc burners the radial mean temperature is measured to develop and stabilize flames at multiple normalized axial distances. Stangierski, Weiss and Kaczmarek [10] compared the quality of multiple linear regression (MLR) and artificial neural network (ANN) to predict the whole quality of spreadable Gouda cheese during storage at 8 °C, 20 °C and 30 °C. The models were based on ANNs with high values of coefficients determination and lower RMSE values proved to be more accurate.

A polynomial mathematical model has been measured to learn this occurrence to find the finest connection on behalf of the new data. Least Squares regression study has been applied to guess the coefficients of the polynomial and inspect its satisfactions. In the study, it has been identified that for predictions in large data sets may cause of high cost for maintaining the data sets and it requires lot of execution time to work on large data sets. Proposed method will reduce the maintenance cost of large sets and with efficient selection of variable, which are required for prediction, it reduce the execution time and improve the prediction rates with low errors. Ramjeet Singh Yadav [13] found that the root mean square error of the sixth degree polynomial is much smaller in these six models compared to other quadratic, third degree, fourth degree, fifth degree, and exponential polynomials. Therefore, the sixth-degree polynomial regression model for COVID-2019 data analysis in India is a very good model for predicting the next 6 days. In this analysis, authors found that in the next 7 days, the sixth-degree polynomial regression models would enable Indian physicians and the government to prepare their plans. This model can be optimized for forecasting over long-term periods based on additional regression analysis studies. Apurbalal Senapati, Amitava Nag, Arunendu Mondal and Soumen Maji [14] found from the latest COVID-19 data review that the pattern of infection number per day follows linearly and then increases the exponentially. This property has been used in our prediction and the linear regression in the piece is the most suitable model for adopting this property. The experimental results indicated the superiority of the proposed scheme and that was a new approach to the COVID-19 prediction to the best of our knowledge.

Felix Schönbrodt [7] discusses the response surface analysis into psychological science and eliminates numerous problems of surrounding and introduces the concept of fit patterns, which provides the hypothetical base intended for difficult fit hypotheses with incommensurable scales. New-fangled statistical models, namely the shifted (and rotated) squared difference models and their extensions with rising ridges, extend the statistical toolbox and facilitate researchers to experiment fit hypotheses devoid of having to rely on impractical assumptions. These models have an advanced statistical authority to notice genuine fit patterns and provide easily interpret-able parameters. New hypotheses can be tested using these parameters which could be difficult or impossible to test with traditional methods. Lastly, new open-source software provides easy to use functions which hopefully make polynomial regression methodology easier to get to researchers from a wide range of scientific fields. In data mining analysis techniques various types of data sets are available like stream data, temporal data, continuous data, discrete data, spatial data etc. Few data sets consist of one independent variable and few consist of two or more independent variables. Generally, data sets are considered in three categories:

Uni-variate data sets consist of one variable.
Bi-variate data sets consist of two variables.
Multivariate data sets consist of more than two variables.

Uni-variate data is the simplest type of data set, only one variable in the data set is considered. This data set deals with information or data sets that contain a single entity. It does not focus on causes. The representation of pattern will be initiated in this kind of data and can find the assumptions using measures of central tendency like mean, mode and median. Bi-variate data sets include two dissimilar variable quantities. The fields of bi-variate data set are quite less with result and analytic thinking. Bi-variate data sets are used to get the relationship between the two variables. The outcome relationship is involved in Bi-variate data sets, depth psychology, causes, comparison and account. Multivariate data sets are much similar to Bi-variate, but they contain more than one independent variable. Analysis in multivariate data sets is dependent on the results which are to be achieved through various algorithms and tools.

Multivariate data set consists of more than two independent variables. Generally, this data set is used for explanatory purposes. In this data set, analysis is done on the basis of two or more than one independent variables. On the basis of objectives of data analysis, various regression methods can be applied. Regression analysis, path analysis, factor analysis and multivariate analysis of variance are some of the techniques for data analysis. In Table 1 example of multivariate data set of energy consumption is shown: Humidity is recorded in every 10 min; humidity is estimated on the basis of various parameters: temperature (inside of building), windspeed, pressure (Press_m_hg), temperature outside (t_out), humidity outside (RH_out), humidity (humidity inside of building).

Table 1 Energy consumption multivariate data sets

Full size table

2 Regression components and data analysis

For prediction and analysis various types of regression techniques are used, Linear Regression (for numeric data sets), Logistic Regression (for binary data sets), Polynomial Regression, Stepwise Regression, Ridge Regression, Lasso Regression and Elastic Net Regression. Selection of prediction technique is based on the type of data sets, for example, if a data set involves logical data like 0 and 1, logistic regression is applied (Fig. 1).

All regression techniques have different levels of accuracy in predictions. These methods are regularly determined by three parameters: number of independent variables, type of dependent variables and shape of the regression line.

2.1 Data analysis

It is the statistical standard of observations in statistics. Data analysis is a study of more than one statistical resultant variable at a time. It is a process of inspecting, cleansing, converting and modeling data with the goal of discovering useful information, informing conclusions and supporting decision-making. The aim of data analysis is to withdraw needful information from data and take the assessment based upon the data analysis. Analysis study of Multivariate data sets says that one variable is treating it as a dependent variable and others as independent variables. In data analysis process include various steps as shown in Fig. 2.

2.2 Data requirements specification

Identify necessary inputs for analysis and its types, like it should be continuous, logical or discrete.

2.3 Data collection

Process of gathering the required data or variables for targeted variables. Data collection should be accurate.

2.4 Data pre-processing

It is a method for structuring the data according to the analysis method.

2.5 Data cleaning

It is a method for detecting and avoiding the errors in data sets. Eradicate the replica values, irrelevant values, or incomplete data from data sets.

2.6 Data analysis

It is a technique to recognize, interpret, and to find conclusions that are based on the requirements for analysis.

2.7 Communication

Is the result of the data analysis, reported in a format as essential by the handlers to support their decisions and further action.

Benefits of Data analysis are following:

On the basis of changing scenarios in the market or organization, production can be increased or decreased.
Analysis of variances (ANOVA), is an analysis to help in decision making.
Data modeling is applied to reduce a large number of variables to a smaller number of variables.
To confirm a range or index by representing its constituent items load on the same factor. Drop proposed scale items which cross-load on more than one factor.

3 Forecasting method and its execution

Forecasting is an approach to do prediction on the basis of historical data, current data sets and on the basis of recent trends. In the earlier studies, it has been found that various algorithms are used to predict one dimensional and two-dimensional stream data. Various methods are used to improve the prediction accuracy rate and reduce the errors during the prediction. Sagar et al. [5] 11 introduced a prediction algorithm using regression for time series data sets to improve the performance of algorithms. Prediction method is a structure of multi regression equation up to the ordinal degree, its relationship edged by the experimental variable x and also the variable y. Polynomial regression fits a nonlinear relationship between the value of x and corresponding conditional mean of y. It is mentioned by E(y|x), and it has been accustomed to depict nonlinear phenomena that performs the calculations. Polynomial regression fits a nonlinear equation for estimation. Ostertagová [3] used a polynomial state control model developed using the sign of the relationship between the complexity and the depth of the flow. The model parameters are estimated using the least squares method. After fitting the model was evaluated using some of the most commonly indicators used to assess the accuracy of the regression model. Data was analyzed using the MATLAB computer program. Polynomial regression is taken into account as a special case of multiple linear regression.

Step 1 Find the total numbers of regression model 2ⁿ, n is the number of independent variables. Find the ANOVAs for each regression equation shown in Table 2.

$${\text{y1}} = {\text{Temperature}},{\text{ y2}} = {\text{ T}}\_{\text{out}},{\text{ y3}} = {\text{Pressure}},{\text{ y4}} = {\text{ R}}\_{\text{h}},{\text{ y5}} = {\text{ WindSpeed}},{\text{ x}} = {\text{ Humidity}}$$

Table 2 The 2ⁿ possible regression equations

Full size table

Step 2 Find the coefficient of determination and mean square error for each of the regression equations which are defined in Table 2. For example, a set of independent variables are four y1, y2, y3, y4 and coefficients are b0, b1, b2, b3, b4. All possible sets of independent variables are considered in regression equations. In Model 2, the regression equation includes one independent variable and two coefficients, and in every equation all independent variables are used with different combinations. In Model 3, all regression equations include three coefficients and two independent variables; in this model all possible sets of coefficients and independent variables are taken. Find the ANOVA for each regression equation, using ANOVA model coefficient of determination and MSE are calculated.

$${\text{Coefficient ofdetermination}}:((\sum \, ({\text{SumSq}}\left( {{\text{y1}} + {\text{y2}} + {\text{y3}}} \right) \, + {\text{residual}})/ \, (\sum \, ({\text{Sum Sq}}\left( {{\text{y1}} + {\text{y2}} + {\text{y3}}} \right) \, *{1}00)))$$

(1)

MSE = value of “Mean sq” corresponding to residuals in ANOVA model.

ANOVA models for regression equations shown in Table 1 are shown in Table 3a–d and in each ANOVA response is Humidity (x, dependent variable).

Table 3 ANOVA for regression equations of (a) Model 2, (b) Model 3, (c) Model 4 and (d) Model 5

Full size table

In each model the regression equation is selected with the highest coefficient of determination and minimum mean square error (MSE). The values of equations which have been selected (highlighted) are regression Eq. 3 from Model 2, regression Eq. 1 from Model 3, regression Eq. 1 from Model 4 and regression Eq. 1 from Model 5. By comparing all selected values of r2p and MSE in Table 4, value of Model 4 is finally selected because although r2p value 95.91 of Model 5 is higher than r2p value 95.87 of Model 4, but the MSE value 0.07 of Model 4 is lowest among all. All independent variables in selected regression Models (with high coefficients and lowest MSE) are more relevant for prediction and these variables are selected for structuring of improved prediction methods.

Table 4 MSE and r2p (coefficient of determination)

Full size table

Step 3 Table 5 is having all selected values from Table 4. In Table 4 regression equation of Model 4 is chosen for identifying relevant variables for regression analysis. Regression Eq. 1 in Model 4 has 3 independent variables y1, y2, y3, which are most important and appropriate for the prediction model. X is a dependent variable that is converted in one column matrix as in Eq. (2). Independent variables are converted to matrices as in Eq. (3). Calculate values of inverses of x and y matrix. Matrix y' is the transpose of matrix y. Regression coefficients matrix b can be calculated as follows:

$${\text{x}} = \left[ {\begin{array}{*{20}c} {x1} \\ {x2} \\ {x3} \\ . \\ . \\ {xn} \\ \end{array} } \right]\quad {\text{b}} = \left[ {\begin{array}{*{20}c} {b0} \\ {b1} \\ . \\ . \\ . \\ {bk} \\ \end{array} } \right]$$

(2)

$${\text{Y}} = \left[ {\begin{array}{*{20}c} 1 & {y1,1} & {y2,2 } & {y1k1} \\ 1 & {y2,1} & {xy,2} & { y2,k} \\ \ldots & \ldots & \ldots & \ldots \\ \ldots & \ldots & \ldots & \ldots \\ 1 & {yn,1 } & {xn,2 } & {xn,k} \\ \end{array} } \right]$$

(3)

$${\text{Y}}^{\prime}{\text{X }} = {\text{ Y}}^{\prime}{\text{Yb}}$$

(4)

$$\left( {{\text{Y}}^{\prime}{\text{Y}}} \right)^{{ - {1}}} {\text{Y}}^{\prime}{\text{Yb }} = \, \left( {{\text{Y}}^{\prime}{\text{Y}}} \right)^{{ - {1}}} {\text{Y}}^{\prime}{\text{Y}}$$

(5)

$${\text{b }} = \, \left( {{\text{Y}}^{\prime}{\text{Y}}} \right)^{{ - {1}}} {\text{y}}^{\prime}{\text{x}}$$

(6)

$${\text{X}} = {\text{ b}}_{0} + {\text{b}}_{{1}} {\text{y1}} + {\text{b}}_{{2}} {\text{y2}}^{{2}} + {\text{b}}_{{3}} {\text{y3}}^{{3}} + {\text{b}}_{{4}} {\text{y4}}^{{4}} + \cdots {\text{b}}_{{\text{N}}} {\text{yN}}^{{\text{N}}}$$

(7)

Table 5 Selection of highest coefficient of determination with lowest mean square error

Full size table

Mean absolute error(MAE)

$$MAE = \frac{1}{n}\sum\limits_{i = 0}^{n} {y - predicted\;y}$$

(8)

4 Experimental results

Data set is collected from the UCI repository (https://archive.ics.uci.edu/ml/datasets/Appliances+energy+prediction). Sample data set is presented in Table 1, humidity is dependent variable rest of variables are independent variables. Humidity (inside of a building) is to be predicted on the basis of wind speed, temperature (inside of building), temperature outside of building (t_out), pressure of wind etc. Experiments are implemented in r studio 3.3.2. In experiment general polynomial regression and improved polynomial regression are implemented. The improved method is based on selection of variables using coefficient of determination and mean square errors. In Fig. 3 blue plots represent the errors in forecasting using existing polynomial regression models. In this paper residuals, MAE, and average errors are analyzed.Red plots in Fig. 3 represent the errors in forecasting using enhanced polynomial regression models. By using the coefficient of determination an effective independent variable can be selected and these variables will be included in the prediction model and can reduce the errors during the prediction.

In Fig. 4 average errors in improved method are less in comparison of existing polynomial method. In Table 4 consider the coefficient of determination (must be highest) and mean square errors (must be lowest) of all prediction equations in each model. In Table 4, Model 4, regression Eq. 1 is selected with high coefficient with low mean square error.

It is corresponding to Table 2, Model 4 first regression equation “x = b0 + b1y1 + b2y2 + b3y3” presented as a forecasting model for prediction with absolute selection of variables. It means y1, y2, y3 variables are efficient models for prediction. Using appropriate selection of variables for prediction models can improve the accuracy rate of prediction (Fig. 5).

5 Conclusion and future scope

In multivariate data sets if there are many independent variables and it is not necessary that all independent variables will be included in the prediction model. Some variables have less weightage, in results using coefficient of determination appropriate independent variables are chosen to formulate an efficient prediction model with reduced error rates. Exempt the variables which are not required for forecasting. In a data set, reprocessing of variables consumes less time. Elimination of irrelevant variables from huge data sets will reduce the cost of data maintenance. In future, this improved algorithm can be applied in the area of agriculture for different zone of the states in country to estimate the production of crops on the basis water supply, temperature, uses of pesticides, humidity etc. prediction algorithms also can be used in the area production of industries, market analysis, weather forecasting, finding of disease on the basis of symptoms.

References

Copeland KAF (1997) Local polynomial modeling and its applications. J Qual Technol. 29:234. https://doi.org/10.1080/00224065.1997
Article Google Scholar
Fawcett T, Provost F (1997) Adaptive fraud detection. Data Min Knowl Disc 1:291–316. https://doi.org/10.1023/A:1009700419189
Article Google Scholar
Ostertagová E (2012) Modelling using polynomial regression. Proc Eng 48:500–506. https://doi.org/10.1016/j.proeng.2012.09.545
Article Google Scholar
Ostertagová E (2013) Applied statistic (in Slovak), methodology and application of oneway ANOVA. Am J Mech Eng 1(7):256–261. https://doi.org/10.12691/ajme-1-7-21
Article Google Scholar
Sagar P (2015) Regression based data mining techniques for frequent data stream (one dimensional and two dimensional stream data). Int J Comput Sci Eng 3(9):140–143. https://www.ijcseonline.org/pub_paper/27-IJCSE-012584.pdf(E-ISSN: 2347-2693)
Tahani S. GendyTaher M. El-ShiekhAmal S.Zakhary.,2015.A polynomial regression model for stabilized turbulent confined jet diffusion flames using bluff body burners Egyptian Journal of Petroleum Volume 24, Issue 4, December 2015, Pages 445–453,https://doi.org/https://doi.org/10.1016/j.ejpe.2015.06.001.
Schönbrodt, F. D. (2016, November 25). Testing fit patterns with polynomial regression models. https://doi.org/https://doi.org/10.31219/osf.io/ndggf.
Sagar P, Gupta P, Kashyap I (2018) Prediction technique for time series data sets using regression models. In: International conference on advanced informatics for computing research. https://doi.org/10.1007/978-981-13-3140-4_43
Rimal R, Almøy TS (2019) Comparison of multi-response prediction methods. Chemometr Intell Lab Syst 190:10–21. https://doi.org/10.1016/j.chemolab.2019.05.004
Article Google Scholar
Stangierski J, Weiss D, Kaczmarek A (2019) Multiple regression models and Artificial Neural Network (ANN) as prediction tools of changes in overall quality during the storage of spreadable processed Gouda cheese. Eur Food Res Technol 245:2539–2547
Article Google Scholar
Gupta MK, Chandra P (2020) A comprehensive survey of data mining. Int J Inf Technol 12:1243–1257. https://doi.org/10.1007/s41870-020-00427-7
Article Google Scholar
Wazir S, Beg MMS, Ahmad T (2020) Comprehensive mining of frequent itemsets for a combination of certain and uncertain databases. Int J Inf Technol 12:1205–1216. https://doi.org/10.1007/s41870-019-00310-0
Article Google Scholar
Yadav RS (2020) Data analysis of COVID-2019 epidemic using machine learning methods: a case study of India. Int J Inf Technol 12:1321–1330. https://doi.org/10.1007/s41870-020-00484-y
Article Google Scholar
Senapati A, Nag A, Mondal A et al (2020) A novel framework for COVID-19 case prediction through piecewise regression in India. Int J Inf Technol. https://doi.org/10.1007/s41870-020-00552-3
Article Google Scholar

Download references

Author information

Authors and Affiliations

Manav Rachna International Institute of Research and Studies, Faridabad, Haryana, India
Pinki Sagar, Prinima Gupta & Indu Kashyap

Authors

Pinki Sagar
View author publications
You can also search for this author in PubMed Google Scholar
Prinima Gupta
View author publications
You can also search for this author in PubMed Google Scholar
Indu Kashyap
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Pinki Sagar.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sagar, P., Gupta, P. & Kashyap, I. A forecasting method with efficient selection of variables in multivariate data sets. Int. j. inf. tecnol. 13, 1039–1046 (2021). https://doi.org/10.1007/s41870-021-00619-9

Download citation

Received: 24 February 2020
Accepted: 03 February 2021
Published: 28 February 2021
Issue Date: June 2021
DOI: https://doi.org/10.1007/s41870-021-00619-9

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

A forecasting method with efficient selection of variables in multivariate data sets

Abstract