Classifying Protein Fingerprints

Hilario, Melanie; Mitchell, Alex; Kim, Jee-Hyub; Bradley, Paul; Attwood, Terri

doi:10.1007/978-3-540-30116-5_20

Melanie Hilario²²,
Alex Mitchell²³,
Jee-Hyub Kim²²,
Paul Bradley²³ &
…
Terri Attwood²⁴

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3202))

Included in the following conference series:

European Conference on Principles of Data Mining and Knowledge Discovery

2289 Accesses
4 Citations

Abstract

Protein fingerprints are groups of conserved motifs which can be used as diagnostic signatures to identify and characterize collections of protein sequences. These fingerprints are stored in the prints database after time-consuming annotation by domain experts who must first of all determine the fingerprint type, i.e., whether a fingerprint depicts a protein family, superfamily or domain. To alleviate the annotation bottleneck, a system called PRECIS has been developed which automatically generates prints records, provisionally stored in a supplement called preprints. One limitation of PRECIS is that its classification heuristics, handcoded by proteomics experts, often misclassify fingerprint type; their error rate has been estimated at 40%. This paper reports on an attempt to build more accurate classifiers based on information drawn from the fingerprints themselves and from the SWISS-PROT database. Extensive experimentation using 10-fold cross-validation led to the selection of a model combining the ReliefF feature selector with an SVM-RBF learner. The final model’s error rate was estimated at 14.1% on a blind test set, representing a 26% accuracy gain over PRECIS’ handcrafted rules.

Download to read the full chapter text

Chapter PDF

ProtDCal: A program to compute general-purpose-numerical descriptors for sequences and 3D-structures of proteins

Article Open access 16 May 2015

Decoding the Structural Keywords in Protein Structure Universe

Article 18 January 2019

Comprehensive identification of peptides in tandem mass spectra using an efficient open search engine

Article 08 October 2018

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

Attwood, T.K., Bradley, P., Flower, D.R., Gaulton, A., Maudling, N., Mitchell, A.L., et al.: PRINTS and its automatic supplement, prePRINTS. Nucleic Acids Research 31(1), 400–402 (2003)
Article Google Scholar
Breiman, L.: Random forests. Machine Learning 45, 5–32 (2001)
Article MATH Google Scholar
Cohen, G., Hilario, M., Sax, H., Hugonnet, S.: Data imbalance in surveillance of nosocomial infections. In: Proc. International Symposium on Medical Data Analysis, Berlin, Springer, Heidelberg (2003)
Google Scholar
Duda, R., Hart, P., Stork, D.: Pattern Classification. Wiley, Chichester (2000)
Google Scholar
Gama, J., Brazdil, P.: Linear tree. Intelligent Data Analysis 3, 1–22 (1999)
Article MATH Google Scholar
Hall, M.: Correlation-based feature selection for discrete and numeric class machine learning. In: Proc. 17th International Conference on Machine Learning (2000)
Google Scholar
Henikoff, S., Henikoff, J.G.: Amino acid substitution matrices from protein blocks. Proc. National Academy of Sciences USA 89, 10915–10919 (1992)
Article Google Scholar
Little, R.J., Rubin, D.B.: Statistical Analysis with Missing Data. Wiley, Chichester (1987)
MATH Google Scholar
Mitchell, L., Reich, J.R., Attwood, T.K.: PRECIS–protein reports engineered from concise information in SWISS-PROT. Bioinformatics 19, 1664–1671 (2003)
Article Google Scholar
Sikonja, M.R., Kononenko, I.: Theoretical and empirical analysis of relieff and rrelieff. Machine Learning 53, 23–69 (2003)
Article MATH Google Scholar
Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P., Hastie, T., Tibshirani, R., Botstein, D., Altman, R.: Missing value estimation methods for DNA microarrays. Bioinformatics 17(6), 520–525 (2001)
Article Google Scholar
Witten, I., Frank, E.: Data Mining. Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann, San Francisco (2000)
Google Scholar

Download references

Author information

Authors and Affiliations

Artificial Intelligence Laboratory, University of Geneva, Switzerland
Melanie Hilario & Jee-Hyub Kim
European Bioinformatics Institute, Hinxton, Cambridge, CB10 1SD, UK
Alex Mitchell & Paul Bradley
School of Biological Sciences, University of Manchester, UK
Terri Attwood

Authors

Melanie Hilario
View author publications
You can also search for this author in PubMed Google Scholar
Alex Mitchell
View author publications
You can also search for this author in PubMed Google Scholar
Jee-Hyub Kim
View author publications
You can also search for this author in PubMed Google Scholar
Paul Bradley
View author publications
You can also search for this author in PubMed Google Scholar
Terri Attwood
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

INSA-Lyon, LIRIS CNRS UMR5205, F-69621, Villeurbanne, France
Jean-François Boulicaut
Dipartimento di Informatica, Università degli Studi di Bari,
Floriana Esposito
Pisa KDD Laboratory, ISTI - CNR, Area della Ricerca di Pisa, Via Giuseppe Moruzzi 1, Pisa, Italy
Fosca Giannotti
Dipartimento di Informatica, Via F. Buonarroti 2, 56127, Pisa, Italy
Dino Pedreschi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hilario, M., Mitchell, A., Kim, JH., Bradley, P., Attwood, T. (2004). Classifying Protein Fingerprints. In: Boulicaut, JF., Esposito, F., Giannotti, F., Pedreschi, D. (eds) Knowledge Discovery in Databases: PKDD 2004. PKDD 2004. Lecture Notes in Computer Science(), vol 3202. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30116-5_20

Download citation

DOI: https://doi.org/10.1007/978-3-540-30116-5_20
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-23108-0
Online ISBN: 978-3-540-30116-5
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics

Classifying Protein Fingerprints

Abstract

Chapter PDF

Similar content being viewed by others

ProtDCal: A program to compute general-purpose-numerical descriptors for sequences and 3D-structures of proteins

Decoding the Structural Keywords in Protein Structure Universe

Comprehensive identification of peptides in tandem mass spectra using an efficient open search engine

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Classifying Protein Fingerprints

Abstract

Chapter PDF

Similar content being viewed by others

ProtDCal: A program to compute general-purpose-numerical descriptors for sequences and 3D-structures of proteins

Decoding the Structural Keywords in Protein Structure Universe

Comprehensive identification of peptides in tandem mass spectra using an efficient open search engine

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation