Elsevier

Neural Networks

Volume 21, Issues 2–3, March–April 2008, Pages 368-378
Neural Networks

2008 Special Issue
Interactive data analysis and clustering of genomic data

https://doi.org/10.1016/j.neunet.2007.12.026Get rights and content

Abstract

In this work a new clustering approach is used to explore a well- known dataset [Whitfield, M. L., Sherlock, G., Saldanha, A. J., Murray, J. I., Ball, C. A., Alexander, K. E., et al. (2002). Molecular biology of the cell: Vol. 13. Identification of genes periodically expressed in the human cell cycle and their expression in tumors (pp. 1977–2000)] of time dependent gene expression profiles in human cell cycle. The approach followed by us is realized with a multi-step procedure: after preprocessing, parameters are chosen by using data sub sampling and stability measures; for any used model, several different clustering solutions are obtained by random initialization and are selected basing on a similarity measure and a figure of merit; finally the selected solutions are tuned by evaluating a reliability measure. Three different models for clustering, K-means, Self-organizing Maps and Probabilistic Principal Surfaces are compared. Comparative analysis is carried out by considering: similarity between best solutions obtained through the three methods, absolute distortion value and validation through the use of Gene Ontology (GO) annotations. The GO annotations are used to give significance to the obtained clusters and to compare the results with those obtained in the work cited above.

Introduction

In the last years many papers have been published in bioinformatics literature regarding the use of clustering techniques for the gene expression analysis (Amato et al., 1995, Ando et al., 2002, Balasubramaniyan et al., 2005, Belacel et al., 2006, Ben-Dor et al., 1999, Bussermaker et al., 2001, Chan et al., 2005, Chang et al., 2003, Eisen et al., 1998, Jiang et al., 2004, Törönen et al., 1999, Tamayo et al., 1999, Yeung et al., 2001, Zhu and Zhang, 2000). The main idea of cluster analysis is to group together genes with similar behavior both when the response to specific stimuli is concerned and when time evolution of expression levels is considered. In order to reach results of clear and meaningful biological significance the clustering process must include assessment and validation phases. The whole clustering procedure can be divided in the following steps:

  • Preprocessing and identification of what subset of data is really to be taken into account.

  • Model selection (number and value of parameters, figure of merit, type of learning algorithm, etc.).

  • Finding the optimal solution of the clustering problem given the selected model.

  • Evaluating the stability and reliability of the obtained clustering.

  • Finding the biological validation of the proposed solution.

It is clear that a correct preprocessing is needed to permit to adequately use the data for the clustering algorithms. In the same way a priori information is useful to reduce the complexity of the task and to address the job to the aims of the experiment designer. A significant reduction of data dimension is most cases given by the elimination of too noisy genes.

The second step generally used is that of model selection: it consists in making experiments with different clustering algorithms, such as the hierarchical clustering, K-means and Self-Organizing Maps (Duda, Hart, & Stork, 2001) or Genetic Clustering or Fuzzy variations of them (Bandyopadhyay and Maulik, 2002, Di Gesù et al., 2005, Laszlo and Mukherjee, 2006, Lin, 1999, Lin and Chen, 2007, Pakhira et al., 2005). In general all the clustering techniques implement an algorithm which tries to minimize an objective function (Vector Quantization, Distortion, etc.). The only way of changing figure of merits is to use different metrics. Also this step in part of the model selection step. Since each clustering approach is characterized by some parameters, first of all the number of clusters to be used, then parameter estimation is another fundamental step of the process (Buhmann and Held, 1999, Jung et al., 2003, Li et al., 1999).

The most used algorithms start from a random or arbitrary initial configuration and then evolve to a local minimum of the objective function. In complex problems (in many real cases) there are several minima and more than one can explain in a convincing manner the data distribution. In this case at least we need to run many times the algorithms to choose the more reasonable solutions. This is due to the intrinsic ill-posedness of clustering problem where the existence of a global solution cannot be assured and small perturbations of data due to the noise can lead to very different solutions. Some discussions about this point can be found in (Agarwal and Mustafa, 2004, Hu and Hu, 2007, Kaukoranta et al., 1996).

As a consequence of these facts, in some cases researchers try to assess the stability and reliability of the obtained clusterings (Aidarkhanov and La, 2003, Bertoni and Valentini, 2005, Kerr and Churchill, 2001, Kuncheva and Vetrov, 2006, Smith and Dubes, 1980, Valentini and Ruffino, 2006). Stability means that slight perturbations on the inputs does not perturb too much the obtained clusters, i.e. elements belonging to a cluster tend to keep together after the input perturbation. Reliability, instead, means that a gene tends to belong always to the same cluster with high probability or possibility.

Finally, biological validation means two different results: the former that the results are in agreement with biological knowledge and latter that there are results which extend the biological knowledge in a reasonable (i.e. after new biological experiments) way.

A first approach to this clustering methodology was proposed in (Ciaramella et al., 2007), where clustering parameter estimation and visualization techniques were applied to a gene expression data set. The Results of the experiments were validated by using Gene Ontology annotations.

Ontologies have long been used in an attempt to describe all entities within an area of reality and all relationships between those entities. An ontology comprises a set of well-defined terms with well-defined relationships. Computer scientists have made significant contributions to linguistic formalisms and computational tools for developing complex vocabulary systems. Gene Ontology (GO, The Gene Ontology Consortium (2000)) is a structured, precisely defined, common, controlled vocabulary for description of the roles of genes and gene products in any organism. GO terms are connected into nodes of a network, thus the connections between parents and children are known and form a directed acyclic graphs. The GO classifies genes according to three points of view: the Biological Process (biological objective to which the gene product contributes), the Molecular Function (the biochemical activity of a gene product) and the Cellular Component (the place in the cell where a gene product is active).

In this paper the above mentioned analysis is extended. More precisely after the parameter estimation step we perform a deeper clustering assessment phase, including the analysis of multiple solutions obtained using random initializations of the used algorithms. In order to compare such solutions we introduce an entropy based similarity measure and a visualization exploiting it. Moreover, a reliability method was used to finally tune the best solutions found.

The aim of our approach is not just to propose new clustering algorithms to solve specific problems, but to introduce an integrated framework where many well known methods are available to the user to allow data exploration in an interactive and visual environment. On the other hand, data visualization is an important mean of extracting useful information from large quantities of raw data. For example, biologists need a visual environment that facilitates exploring high-dimensional data dependent on many parameters.

In this framework, the above mentioned steps are executed in a sequential process that can be interactively modified by the user in any point following his needs as a feedback from the observed results (see Fig. 1).

In this paper we used our exploration environment to the analysis of an excerpt from the HeLa database, found in Whitfield et al. (2002). The experiment we consider analysed the gene expression of human tumor cells undergoing to the cellular division process. The dataset is composed of expression values for 1099 (out of 1134, including missing data) genes of cancer cells monitored for two days and sampled once per hour, making up a 1099×48 initial dataset.

The paper organization is as follows: in Section 2 we describe the kind of data we used for our experiments; in Section 3 we introduce the preprocessing approaches to eliminate genes that have not information and to extract features from the data by using a Robust Principal Component Analysis Neural Network; in Section 4 we introduce the data clustering assessment approach; in Section 5 a novel method for the exploration of the space of clustering solution, namely Clustering Maps is introduced and the results of its use are shown; in Section 6 some considerations are reported about the biological significance of the selected clusters; in Section 7 we show the results obtained to cluster, label, visualize and validate genes periodically expressed in the human cancer cell line and finally in Section 8 some comments on the approach are presented as concluding remarks.

Section snippets

Dataset

Microarray technology allows scientists to have a picture of several biological phenomena from a whole-genome scale. Genes sharing similar expression profiles might be functionally related or co-regulated, especially in time-series experiments. Furthermore, similar expression profiles can potentially be utilized to predict the functions of gene products with unknown functions, and to identify sets of genes that are regulated by the same mechanism. This kind of experiments generates both a great

Preprocessing and data acquisition

Microarray data is very noisy, and thus preprocessing plays an important role. Preprocessing is needed to filter out noise and to deal with missing data points. We used a two step preprocessing phase: a preliminary procedure of noisy data rejection, followed by a nonlinear PCA features extraction. About the first part of the preprocessing we simple eliminated the genes that have not samples in the particular experiment that we consider. Moreover, in most cases an interpolation is needed to

Clustering and assessment

The clustering phase can be viewed both as the process of building the final clustering for a dataset and the process of reducing the number of objects to present to the user for subsequent steps of analysis. In our experiments we used some of most known classical methods known in literature:

  • K-Means;

  • Self Organizing Maps (SOM);

  • Probabilistic Principal Surfaces (PPS) (Amato et al., 1995, Chang and Ghosh, 2001, Ciaramella et al., 2006).

Unfortunately, the clustering process is not straightforward

Clustering maps

It often happens, when dealing with the problem of data clusterings, that two or more of them have to be compared. There are many techniques available to handle this problem, whose aim is generally to measure how much the two clusterings can be superimposed. In many cases this condition can lead to miss important relations between clusterings, like subclustering.

Subclustering analysis consists in studying the subclusters of each cluster. For example, in hierarchical clustering, it is equivalent

Biological findings

We used three different methods to cluster expression profiles of genes from cycling HeLa cells dataset (Whitfield et al., 2002), namely K-means, PPS and SOM. In order to check whether biological functions were significantly overrepresented in clusters, we assessed whether a cluster was enriched by a GO Biological Process (The Gene Ontology Consortium, 2000) compared to the reference (all genes of dataset), by means of NIAID DAVID (Dennis et al., 2003) tool. Significant overrepresented GO

Cluster reliability

Once a clustering solution is obtained, an important issue arises: how to assess its reliability. With reliability we mean a quantitative measure of how much we can trust in the membership of a specific data-point to a cluster.

Given a clustering C={C1,,CK} on a dataset D={x1,,xN}, the similarity matrix for C is the N×N matrix M with binary entries defined as follows: Mi,j={1ifl[1,,K]|xi,xjCl0otherwise,i,j=1,,N.

For computing the reliability of each cluster in a clustering we first

Conclusions

In this work we presented a multi-step approach to data exploration allowing the extraction of information contained was a microarray dataset. The aim is to find some groups of genes with similar behavior and to search for a correlation between them and the biological annotations of the GO database. Three different clustering methodologies have been used and assessed to find the best parameters (i.e. number of centroids or neurons) and the best clusterings (after the Clustering Map process).

References (46)

  • T. Ando et al.

    Fuzzy neural network applied to gene expression profiling for producing the prognosis of diffuse large B-cell lymphoma

    Cancer Research

    (2002)
  • R. Balasubramaniyan et al.

    Clustering of gene expression data using a local shape-based similarity measure

    Bioinformatics

    (2005)
  • N. Belacel et al.

    Clustering methods for microarray gene expression data

    Omics

    (2006)
  • A. Ben-Dor et al.

    Clustering gene expression patterns

    Journal of Computational Biology

    (1999)
  • Ben-Hur, A., Elisseeff, A., & Guyon, I. (2002). A stability based method for discovering structure in clustered data....
  • Bertoni, A., & Valentini, G. (2005). Random projections for assessing gene expression cluster stability, IJCNN ’05. In...
  • J.M. Buhmann et al.

    Model selection in clustering by uniform convergence bounds

    Advances in Neural Information Processing Systems

    (1999)
  • H.J. Bussermaker et al.

    Regulatory element detection using correlation with expression

    Nature Genetics

    (2001)
  • Z.S.H. Chan et al.

    A hybrid genetic algorithm and expectation maximization method for global gene trajectory clustering

    Journal of Bioinformatics and Computational Biology

    (2005)
  • K. Chang et al.

    A unified model for probabilistic principal surfaces

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (2001)
  • J.H. Chang et al.

    Gene expression pattern analysis via latent variable models coupled with topographic clustering

    Genomics & Informatics

    (2003)
  • A. Ciaramella et al.

    A multifrequency analysis of radio variability of blazars

    Astronomy & Astrophysics

    (2004)
  • A. Ciaramella et al.
  • Cited by (0)

    An abbreviated version of some portions of this article appeared in Ciaramella et al. (2007) as part of the IJCNN 2007 Conference Proceedings, published under IEE copyright.

    View full text