2008 Special IssueInteractive data analysis and clustering of genomic data☆
Introduction
In the last years many papers have been published in bioinformatics literature regarding the use of clustering techniques for the gene expression analysis (Amato et al., 1995, Ando et al., 2002, Balasubramaniyan et al., 2005, Belacel et al., 2006, Ben-Dor et al., 1999, Bussermaker et al., 2001, Chan et al., 2005, Chang et al., 2003, Eisen et al., 1998, Jiang et al., 2004, Törönen et al., 1999, Tamayo et al., 1999, Yeung et al., 2001, Zhu and Zhang, 2000). The main idea of cluster analysis is to group together genes with similar behavior both when the response to specific stimuli is concerned and when time evolution of expression levels is considered. In order to reach results of clear and meaningful biological significance the clustering process must include assessment and validation phases. The whole clustering procedure can be divided in the following steps:
- •
Preprocessing and identification of what subset of data is really to be taken into account.
- •
Model selection (number and value of parameters, figure of merit, type of learning algorithm, etc.).
- •
Finding the optimal solution of the clustering problem given the selected model.
- •
Evaluating the stability and reliability of the obtained clustering.
- •
Finding the biological validation of the proposed solution.
It is clear that a correct preprocessing is needed to permit to adequately use the data for the clustering algorithms. In the same way a priori information is useful to reduce the complexity of the task and to address the job to the aims of the experiment designer. A significant reduction of data dimension is most cases given by the elimination of too noisy genes.
The second step generally used is that of model selection: it consists in making experiments with different clustering algorithms, such as the hierarchical clustering, K-means and Self-Organizing Maps (Duda, Hart, & Stork, 2001) or Genetic Clustering or Fuzzy variations of them (Bandyopadhyay and Maulik, 2002, Di Gesù et al., 2005, Laszlo and Mukherjee, 2006, Lin, 1999, Lin and Chen, 2007, Pakhira et al., 2005). In general all the clustering techniques implement an algorithm which tries to minimize an objective function (Vector Quantization, Distortion, etc.). The only way of changing figure of merits is to use different metrics. Also this step in part of the model selection step. Since each clustering approach is characterized by some parameters, first of all the number of clusters to be used, then parameter estimation is another fundamental step of the process (Buhmann and Held, 1999, Jung et al., 2003, Li et al., 1999).
The most used algorithms start from a random or arbitrary initial configuration and then evolve to a local minimum of the objective function. In complex problems (in many real cases) there are several minima and more than one can explain in a convincing manner the data distribution. In this case at least we need to run many times the algorithms to choose the more reasonable solutions. This is due to the intrinsic ill-posedness of clustering problem where the existence of a global solution cannot be assured and small perturbations of data due to the noise can lead to very different solutions. Some discussions about this point can be found in (Agarwal and Mustafa, 2004, Hu and Hu, 2007, Kaukoranta et al., 1996).
As a consequence of these facts, in some cases researchers try to assess the stability and reliability of the obtained clusterings (Aidarkhanov and La, 2003, Bertoni and Valentini, 2005, Kerr and Churchill, 2001, Kuncheva and Vetrov, 2006, Smith and Dubes, 1980, Valentini and Ruffino, 2006). Stability means that slight perturbations on the inputs does not perturb too much the obtained clusters, i.e. elements belonging to a cluster tend to keep together after the input perturbation. Reliability, instead, means that a gene tends to belong always to the same cluster with high probability or possibility.
Finally, biological validation means two different results: the former that the results are in agreement with biological knowledge and latter that there are results which extend the biological knowledge in a reasonable (i.e. after new biological experiments) way.
A first approach to this clustering methodology was proposed in (Ciaramella et al., 2007), where clustering parameter estimation and visualization techniques were applied to a gene expression data set. The Results of the experiments were validated by using Gene Ontology annotations.
Ontologies have long been used in an attempt to describe all entities within an area of reality and all relationships between those entities. An ontology comprises a set of well-defined terms with well-defined relationships. Computer scientists have made significant contributions to linguistic formalisms and computational tools for developing complex vocabulary systems. Gene Ontology (GO, The Gene Ontology Consortium (2000)) is a structured, precisely defined, common, controlled vocabulary for description of the roles of genes and gene products in any organism. GO terms are connected into nodes of a network, thus the connections between parents and children are known and form a directed acyclic graphs. The GO classifies genes according to three points of view: the Biological Process (biological objective to which the gene product contributes), the Molecular Function (the biochemical activity of a gene product) and the Cellular Component (the place in the cell where a gene product is active).
In this paper the above mentioned analysis is extended. More precisely after the parameter estimation step we perform a deeper clustering assessment phase, including the analysis of multiple solutions obtained using random initializations of the used algorithms. In order to compare such solutions we introduce an entropy based similarity measure and a visualization exploiting it. Moreover, a reliability method was used to finally tune the best solutions found.
The aim of our approach is not just to propose new clustering algorithms to solve specific problems, but to introduce an integrated framework where many well known methods are available to the user to allow data exploration in an interactive and visual environment. On the other hand, data visualization is an important mean of extracting useful information from large quantities of raw data. For example, biologists need a visual environment that facilitates exploring high-dimensional data dependent on many parameters.
In this framework, the above mentioned steps are executed in a sequential process that can be interactively modified by the user in any point following his needs as a feedback from the observed results (see Fig. 1).
In this paper we used our exploration environment to the analysis of an excerpt from the HeLa database, found in Whitfield et al. (2002). The experiment we consider analysed the gene expression of human tumor cells undergoing to the cellular division process. The dataset is composed of expression values for 1099 (out of 1134, including missing data) genes of cancer cells monitored for two days and sampled once per hour, making up a 1099×48 initial dataset.
The paper organization is as follows: in Section 2 we describe the kind of data we used for our experiments; in Section 3 we introduce the preprocessing approaches to eliminate genes that have not information and to extract features from the data by using a Robust Principal Component Analysis Neural Network; in Section 4 we introduce the data clustering assessment approach; in Section 5 a novel method for the exploration of the space of clustering solution, namely Clustering Maps is introduced and the results of its use are shown; in Section 6 some considerations are reported about the biological significance of the selected clusters; in Section 7 we show the results obtained to cluster, label, visualize and validate genes periodically expressed in the human cancer cell line and finally in Section 8 some comments on the approach are presented as concluding remarks.
Section snippets
Dataset
Microarray technology allows scientists to have a picture of several biological phenomena from a whole-genome scale. Genes sharing similar expression profiles might be functionally related or co-regulated, especially in time-series experiments. Furthermore, similar expression profiles can potentially be utilized to predict the functions of gene products with unknown functions, and to identify sets of genes that are regulated by the same mechanism. This kind of experiments generates both a great
Preprocessing and data acquisition
Microarray data is very noisy, and thus preprocessing plays an important role. Preprocessing is needed to filter out noise and to deal with missing data points. We used a two step preprocessing phase: a preliminary procedure of noisy data rejection, followed by a nonlinear PCA features extraction. About the first part of the preprocessing we simple eliminated the genes that have not samples in the particular experiment that we consider. Moreover, in most cases an interpolation is needed to
Clustering and assessment
The clustering phase can be viewed both as the process of building the final clustering for a dataset and the process of reducing the number of objects to present to the user for subsequent steps of analysis. In our experiments we used some of most known classical methods known in literature:
- •
K-Means;
- •
Self Organizing Maps (SOM);
- •
Probabilistic Principal Surfaces (PPS) (Amato et al., 1995, Chang and Ghosh, 2001, Ciaramella et al., 2006).
Unfortunately, the clustering process is not straightforward
Clustering maps
It often happens, when dealing with the problem of data clusterings, that two or more of them have to be compared. There are many techniques available to handle this problem, whose aim is generally to measure how much the two clusterings can be superimposed. In many cases this condition can lead to miss important relations between clusterings, like subclustering.
Subclustering analysis consists in studying the subclusters of each cluster. For example, in hierarchical clustering, it is equivalent
Biological findings
We used three different methods to cluster expression profiles of genes from cycling HeLa cells dataset (Whitfield et al., 2002), namely K-means, PPS and SOM. In order to check whether biological functions were significantly overrepresented in clusters, we assessed whether a cluster was enriched by a GO Biological Process (The Gene Ontology Consortium, 2000) compared to the reference (all genes of dataset), by means of NIAID DAVID (Dennis et al., 2003) tool. Significant overrepresented GO
Cluster reliability
Once a clustering solution is obtained, an important issue arises: how to assess its reliability. With reliability we mean a quantitative measure of how much we can trust in the membership of a specific data-point to a cluster.
Given a clustering on a dataset , the similarity matrix for is the matrix with binary entries defined as follows:
For computing the reliability of each cluster in a clustering we first
Conclusions
In this work we presented a multi-step approach to data exploration allowing the extraction of information contained was a microarray dataset. The aim is to find some groups of genes with similar behavior and to search for a correlation between them and the biological annotations of the GO database. Three different clustering methodologies have been used and assessed to find the best parameters (i.e. number of centroids or neurons) and the best clusterings (after the Clustering Map process).
References (46)
- et al.
Genetic clustering for automatic evolution of clusters and application to image classification
Pattern Recognition
(2002) - et al.
Global optimization in clustering using hyperbolic cross points
Pattern Recognition
(2007) - et al.
An invisible hybrid color image system using spread vector quantization neural networks with penalized FCM
Pattern Recognition
(2007) - et al.
A study of some fuzzy cluster validity indices, genetic clustering and application to pixel classification
Fuzzy Sets and Systems
(2005) - et al.
Stability of a hierarchical clustering
Pattern Recognition
(1980) - et al.
Analysis of gene expression data using self-organizing maps
FEBS Letters
(1999) - et al.
Soft computing methodologies for spectral analysis in cyclostratigraphy
Computer and Geosciences
(2001) - et al.
k-means projective clustering
- Aidarkhanov, M. B., & La, L. L. (2003). On stability of group fuzzy classification algorithms, PRL(24), No. 12, August...
- et al.
A multi-step approach to time series analysis and gene expression clustering
Bioinformatics
(1995)
Fuzzy neural network applied to gene expression profiling for producing the prognosis of diffuse large B-cell lymphoma
Cancer Research
Clustering of gene expression data using a local shape-based similarity measure
Bioinformatics
Clustering methods for microarray gene expression data
Omics
Clustering gene expression patterns
Journal of Computational Biology
Model selection in clustering by uniform convergence bounds
Advances in Neural Information Processing Systems
Regulatory element detection using correlation with expression
Nature Genetics
A hybrid genetic algorithm and expectation maximization method for global gene trajectory clustering
Journal of Bioinformatics and Computational Biology
A unified model for probabilistic principal surfaces
IEEE Transactions on Pattern Analysis and Machine Intelligence
Gene expression pattern analysis via latent variable models coupled with topographic clustering
Genomics & Informatics
A multifrequency analysis of radio variability of blazars
Astronomy & Astrophysics
Cited by (0)
- ☆
An abbreviated version of some portions of this article appeared in Ciaramella et al. (2007) as part of the IJCNN 2007 Conference Proceedings, published under IEE copyright.