Abstract
Protein fingerprints are groups of conserved motifs which can be used as diagnostic signatures to identify and characterize collections of protein sequences. These fingerprints are stored in the prints database after time-consuming annotation by domain experts who must first of all determine the fingerprint type, i.e., whether a fingerprint depicts a protein family, superfamily or domain. To alleviate the annotation bottleneck, a system called PRECIS has been developed which automatically generates prints records, provisionally stored in a supplement called preprints. One limitation of PRECIS is that its classification heuristics, handcoded by proteomics experts, often misclassify fingerprint type; their error rate has been estimated at 40%. This paper reports on an attempt to build more accurate classifiers based on information drawn from the fingerprints themselves and from the SWISS-PROT database. Extensive experimentation using 10-fold cross-validation led to the selection of a model combining the ReliefF feature selector with an SVM-RBF learner. The final model’s error rate was estimated at 14.1% on a blind test set, representing a 26% accuracy gain over PRECIS’ handcrafted rules.
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Attwood, T.K., Bradley, P., Flower, D.R., Gaulton, A., Maudling, N., Mitchell, A.L., et al.: PRINTS and its automatic supplement, prePRINTS. Nucleic Acids Research 31(1), 400–402 (2003)
Breiman, L.: Random forests. Machine Learning 45, 5–32 (2001)
Cohen, G., Hilario, M., Sax, H., Hugonnet, S.: Data imbalance in surveillance of nosocomial infections. In: Proc. International Symposium on Medical Data Analysis, Berlin, Springer, Heidelberg (2003)
Duda, R., Hart, P., Stork, D.: Pattern Classification. Wiley, Chichester (2000)
Gama, J., Brazdil, P.: Linear tree. Intelligent Data Analysis 3, 1–22 (1999)
Hall, M.: Correlation-based feature selection for discrete and numeric class machine learning. In: Proc. 17th International Conference on Machine Learning (2000)
Henikoff, S., Henikoff, J.G.: Amino acid substitution matrices from protein blocks. Proc. National Academy of Sciences USA 89, 10915–10919 (1992)
Little, R.J., Rubin, D.B.: Statistical Analysis with Missing Data. Wiley, Chichester (1987)
Mitchell, L., Reich, J.R., Attwood, T.K.: PRECIS–protein reports engineered from concise information in SWISS-PROT. Bioinformatics 19, 1664–1671 (2003)
Sikonja, M.R., Kononenko, I.: Theoretical and empirical analysis of relieff and rrelieff. Machine Learning 53, 23–69 (2003)
Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P., Hastie, T., Tibshirani, R., Botstein, D., Altman, R.: Missing value estimation methods for DNA microarrays. Bioinformatics 17(6), 520–525 (2001)
Witten, I., Frank, E.: Data Mining. Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann, San Francisco (2000)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Hilario, M., Mitchell, A., Kim, JH., Bradley, P., Attwood, T. (2004). Classifying Protein Fingerprints. In: Boulicaut, JF., Esposito, F., Giannotti, F., Pedreschi, D. (eds) Knowledge Discovery in Databases: PKDD 2004. PKDD 2004. Lecture Notes in Computer Science(), vol 3202. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30116-5_20
Download citation
DOI: https://doi.org/10.1007/978-3-540-30116-5_20
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-23108-0
Online ISBN: 978-3-540-30116-5
eBook Packages: Springer Book Archive