Variable selection and data pre-processing in NN modelling of complex chemical processes

https://doi.org/10.1016/j.compchemeng.2005.01.004Get rights and content

Abstract

The neural network models represent nowadays a powerful tool for complicated process identification. However, because of the fact that they belong to the category of data-driven “black box” models, they cannot avoid the consequences of the “garbage in–garbage out” rule. This work proposes a simultaneous data balancing-variable selection procedure, which is based on traditional statistical techniques and modern information theoretic approaches. It is implemented on a complicated dataset of restricted quality, which refers to a commercial aldol condensation unit (BASF). Based on the pre-processed database a neural model for the prediction of the process yield has been developed. The results verify the importance of the pre-processing stage in terms of generalization accuracy as well as of simpler network structure due to the data-variable selection procedure. Finally, an analysis of the model trends has been implemented to assess qualitative characteristics of the model, which was then used in industrial test runs and resulted in an improvement of the process operation.

Introduction

Neural networks (NNs) belong to the most promising modelling techniques of our time. As “universal approximators” they can handle non-linear multivariate systems of any complexity level, while as “black box” models they don’t require an extensive knowledge about the process to be modelled. Instead, they are based on databases, being tolerant to the faults and noise included in them (Hornik, Stichcombe, & White, 1989). Although they are not new in concept, the interest in them has increased significantly in the last decade mainly due to the tremendous evolution of digital computing. Some published applications of neural networks in chemical engineering topics are concerned with fault diagnosis in chemical plants (Venkatasubramanian & Chan, 1989), system identification and control (Polland, Broussard, Garrison & San, 1992; Psichogios & Ungar, 1991), sensor data analysis (Piovoso & Owens, 1991), and chemical composition analysis (Weixiang, Dezhao, & Shangxu, 1998).

As a result of their modelling success a number of software packages aiming at the design and development of neural networks has become nowadays available. Most of these packages include a variety of options about the design of the neural network architecture, the parameters of the training algorithm, the stop criteria and the model analysis. Most of all, they provide a user-friendly environment, which hides from the user the insights of the complicated mathematical and computational network training procedure and makes neural networks development much easier. However, the neural networks modelling task involves the data pre-processing stage, which can be decisive for the development of a successful model, since it must always be kept in mind that no matter how powerful the neural network modeling technique may be, it cannot escape the “garbage in–garbage out” curse of “black box” modelling.

More specifically, as far as commercial databases are concerned, raw data obtained from plant operation should not be used unprocessed in identification studies. First of all, the special cases of the startup and shutdown of the process have to be recognized and the respective data should be removed. Secondly, outliers should be detected in order to avoid using data that correspond to measurement errors or operation faults. One should always be aware of the fact that not all of the outliers are of the aforementioned nature. They may also refer to extreme but yet possible process set-up and therefore contain useful information. The cooperation of the neural network model developer with the process engineers is always of great importance for all these aspects as well as for the variable selection. The proper variables have to be selected out of a large number of potential input variables, which may also strongly correlate to each other due to the process control system. Furthermore, for effective modelling the data must be information rich over the process operation range (Neelkanten & Guiver, 1998). A well-balanced data-set may cure some of the problems arising from the fact that the variables range in industrial data-sets is usually restricted because of the operational limitations set by the process control system.

From the short description made above, it is clear that the data pre-processing stage is in fact a complicated task of data and variables selection, in which a substantial number of aspects must be taken into account. Because of the complex, multidimensional problems in which neural networks modelling is usually implemented, no standard data pre-processing procedure has been developed so far that treats all these aspects and could therefore be integrated and automatically performed in a neural network software package. Instead, it is very common that the users, based in a great extent on their intuition and the idiosyncrasies of a specific dataset, follow case-oriented data pre-processing methods.

In this paper, a data pre-processing method is proposed that combines typical statistical techniques (Hair, Anderson, Tatham, & Black, 1998) as well as information theoretic approaches for variable selection and data preparation in neural networks modelling (Sridhar, Barlett, & Seagrave, 1998). This method deals with the data selection procedure in a systematic way, setting simultaneously various variable selection levels and mentioning at which points the experience of the process engineers is valuable. A database regarding a commercial aldolcondensation process has been used for the implementation of the method as well as of the neural network modelling technique. The product of this process constitutes an important intermediate for the production of fungicidal agro-chemicals. The complexity of the modelling task, which makes a neural network modeling effort worthwhile, arises mainly from the side reactions in combination with crystallization occurring in the semi batch reactor as will be discussed later. Finally, the paper deals with the effect of the proposed pre-processing methods on the model accuracy and generalization capability, analyzes the performance of the model in terms of theoretical consistency in some case studies and validates the model predictions with operational test runs on plant.

Section snippets

Combinatory data pre-processing method

Typically, data pre-processing methods for neural networks modeling consist of two stages. In the first stage trivial techniques about missing data and outliers detection perform a first level data screening. Then, based on these results, more sophisticated multivariate techniques are implemented in order to decrease the dimensionality of the input variables space (variables selection) and homogenize the information distribution for input and output variables (data samples selection). Usually,

Process description

A brief description of the present type of process is given in order to illustrate the complexity-level of the modelling task as typical field of application for NN-modelling.

Commercial database—implementation of pre-processing methods

The training and validation of the NN model was based upon industrial data provided by BASF (Schwarzheide). The 19 candidate input variables and the one output variable (the primary product yield P1, where P1 is simply denoted as P in the following) are presented in Table 1.

From the description of the variables it is clear that operational variables of the reactor and the separation unit as well as laboratory variables relative to the reagent quality have been taken into account. The commercial

Results and discussion

The training of the NNs was carried out with ATLAN-tec’s NeuroModel software package (Version 1.41). This package performs automatically the data normalization procedure with a linear transformation in the range [0.1, 0.9]. It uses multilayer perceptrons (MLP), with one hidden layer and a version of the EBP algorithm with momentum term for the training procedure. The nodes of the hidden layer use a sigmoid transfer function and the number of the hidden nodes is a parameter that can be

Conclusions

The data pre-processing stage in neural networks modelling is of great importance, especially in the case of commercial data sets with restricted quality. The simultaneous “data balancing-variable selection” procedure proposed in this paper has been implemented in the complex case of a commercial aldol condensation process and the results have shown that the neural networks generalization accuracy has been improved in comparison with the one achieved, when only the typical “outliers removal”

Acknowledgements

Financial support provided by BASF Schwarzheide GmbH and knowledge interchange with its research and operating section particularly with Udo Rotermund, Olaf Otto, Rainer Noack and Jan Rudloff is greatly acknowledged.

References (14)

There are more references available in the full text version of this article.

Cited by (28)

  • Soft methodology selection of wind turbine parameters to large affect wind energy conversion

    2015, International Journal of Electrical Power and Energy Systems
    Citation Excerpt :

    The objective here is to select the proper explanatory (input) parameters and thereby reduce and minimize the error that exists between the observed values and the model estimations of the explained variables. Among the many neural network system, one of the most used and powerful is the ANFIS; and the ANFIS was employed here, for the purposes of this study, in the variable selection part [22–26]. In order to determine how the four main parameters affect the wind turbine power coefficient: blade pitch angle, rotor speed, wind speed and rotor radius, a parameter search by employing the ANFIS was conducted.

  • Soft sensor for real-time cement fineness estimation

    2015, ISA Transactions
    Citation Excerpt :

    Information theory is used to analyze interdependency between the process output and inputs to determine, from the entire input data, a subset of input variables that contains most of the information necessary to predict the output. Theoretical knowledge regarding the process that is modeled should also be incorporated when using ITSS to ensure that all important variables to the model are included in the subset [22,23]. In this paper the soft sensor is proposed to estimate cement fineness and provide real-time information on a process variable available only from off-line laboratory tests.

View all citing articles on Scopus
View full text