ScienceDirect® Home Skip Main Navigation Links
You have guest access to ScienceDirect. Find out more.
 
Home
Browse
My Settings
Alerts
Help
 Quick Search
 Search tips (Opens new window)
    Clear all fields    
 
Font Size: Decrease Font Size  Increase Font Size
 Abstract - selected
Article
Purchase PDF (139 K)

Article Toolbox
 
 
 
Related Articles in ScienceDirect
View More Related Articles
 
View Record in Scopus
 
doi:10.1016/j.compbiolchem.2004.03.002    
How to Cite or Link Using DOI (Opens New Window)

Copyright © 2004 Elsevier Ltd. All rights reserved.

Cluster-C, an algorithm for the large-scale clustering of protein sequences based on the extraction of maximal cliques

Purchase the full-text article



References and further reading may be available for this article. To view references and further reading you must purchase this article.

S. Mohseni-ZadehE-mail The Corresponding Author, P. BrézellecE-mail The Corresponding Author and J. -L. RislerCorresponding Author Contact Information, E-mail The Corresponding Author

Laboratoire Génome et Informatique, CNRS UMR 8116, Tour Evry2, 523 Place des Terrasses, 91034, Evry Cedex, France


Received 11 March 2004; 
Revised 11 March 2004; 
accepted 22 March 2004. 
Available online 1 July 2004.

Abstract

Although the characterization of proteins cannot solely rely upon sequence similarity, it has been widely proved that all-vs-all massive sequence comparisons may be an effective approach and a good basis for the prediction of biochemical functions or for the delineation of common shared properties. The program Cluster-C presented here enables a stand-alone and efficient construction of protein families within whole proteomes. The algorithm, which is based on the detection of cliques, ensures a high level of connectivity within the clusters. As opposed to the single transitive linkage method, Cluster-C allows a large number of sequences to be classified in such a way that the multidomain proteins do not produce a chain-grouping effect resulting in meaningless clusters. Moreover, some proteins can be present in several different but relevant clusters, which is of help in the determination of their functional domains. In the present analysis we used the Z-value, an evaluation of the significance of the similarity score, as the criterion for connecting sequences (the user can freely define the threshold of the similarity criterion). The clusters built with a rather low threshold (Z=14) include more than 97% of the sequences and are consistent with known protein families and PROSITE patterns.

Author Keywords: Cluster-C; Protein clustering; Maximal clique; Multidomain proteins; Sequence similarity

Article Outline

1. Introduction
2. Methods and algorithm
2.1. Massive protein sequence comparison
2.2. Clustering algorithm
2.2.1. Step 1
2.2.2. Step 2
2.2.3. Step 3 (optional)
2.2.4. Step 4
3. Examples of application
3.1. Description of the data
3.2. Clique searching step
3.3. Extension of clusters
3.4. Merging of clusters
3.5. Comparison with the single linkage method
3.6. Validation of clusters
4. Conclusion
Acknowledgements
References





Corresponding Author Contact InformationCorresponding author. Tel.: +33-1-60-87-38-67; fax: +33-1-60-87-38-97.


 
Home
Browse
My Settings
Alerts
Help
Elsevier.com (Opens new window)
About ScienceDirect  |  Contact Us  |  Information for Advertisers  |  Terms & Conditions  |  Privacy Policy
Copyright © 2009 Elsevier B.V. All rights reserved. ScienceDirect® is a registered trademark of Elsevier B.V.