Abstract
We studied 1372 LacI-family transcription factors and their 4484 DNA binding sites using machine learning algorithms and feature selection techniques. The Naive Bayes classifier and Logistic Regression were used to predict binding sites given transcription factor sequences and to classify factor-site pairs on binding and non-binding ones. Prediction accuracy was estimated using 10-fold cross-validation. Experiments showed that the best prediction of nucleotide densities at selected site positions is obtained using only a few key protein sequence positions. These positions are stably selected by the forward feature selection based on the mutual information of factor-site position pairs.
Similar content being viewed by others
References
Suzuki M., Brenner S.E., Gerstein M., Yagi N. 1995. DNA recognition code of transcription factors. Protein Eng. 8, 319–328.
Jones S., Shanahan H.P., Berman H.M., Thornton J.M. 2003. Using electrostatic potentials to predict DNA-binding sites on DNA-binding proteins. Nucleic Acids Res. 31, 7189–7198.
Baker C.M., Grant G.H. 2007. Role of aromatic amino acids in protein-nucleic acid recognition. Biopolymers. 85, 456–470.
Sarai A., Kono H. 2005. Protein-DNA recognition patterns and predictions. Annu. Rev. Biophys. Biomol. Struct. 34, 379–398.
Sandelin A., Wasserman W.W. 2004. Constrained binding site diversity within families of transcription factors enhances pattern discovery bioinformatics. J. Mol. Biol. 338, 207–215.
Mahony S., Auron P.E., Benos P.V. 2007. Inferring protein-DNA dependencies using motif alignments and mutual information. Bioinformatics. 23, i297–i304.
Ahmad S., Sarai A. 2005. PSSM-based prediction of DNA binding sites in proteins. BMC Bioinformatics. 6, 33–34.
Ofran Y., Mysore V., Rost B. 2007. Prediction of DNA-binding residues from sequence. Bioinformatics. 23, i347–i353.
Yan C., Terribilini M., Wu F., et al. 2006. Predicting DNA-binding sites of proteins from amino acid sequence. BMC Bioinformatics. 7, 262–262.
Mirny L.A., Gelfand M.S. 2002. Using orthologous and paralogous proteins to identify specificity-determining residues in bacterial transcription factors. J. Mol. Biol. 321, 7–20.
Kalinina O.V., Mironov A.A., Gelfand M.S., Rakhmaninova A.B. 2004. Automated selection of positions determining functional specificity of proteins by comparative analysis of orthologous groups in protein families. Protein Sci. 13, 443–456.
Donald J.E., Shakhnovich E.I. 2005. Predicting specificity-determining residues in two large eukaryotic transcription factor families. Nucleic Acids Res. 33, 4455–4465.
Korostelev Y., Laikova O.N., Rakhmaninova A.B., Gelfand M.S. First RECOMB Satellite Conference on Bioinformatics Education, San Diego, 2009. Abstract Book, p. 13.
Novichkov P.S., Laikova O.N., Novichkova E.S., Gelfand M.S., Arkin A.P., Dubchak I., Rodionov D.A. 2010. RegPrecise: A database of curated genomic inferences of transcriptional regulatory interactions in prokaryotes. Nucleic Acids Res. 38, D111–D118.
Schultz J., Milpetz F., Bork P., Ponting C.P. 1998. SMART, a simple modular architecture research tool: Identification of signaling domains. Proc. Natl. Acad. Sci. U. S. A. 95, 5857–5864.
Kalinina O.V., Novichkov P.S., Mironov A.A., Gelfand M.S., Rakhmaninova A.B. 2004. SDPpred: A tool for prediction of amino acid residues that determine differences in functional specificity of homologous proteins. Nucleic Acids Res. 32, W424–W428.
Gerstein M., Sonnhammer E.L., Chothia C. 1994. Volume changes in protein evolution. J. Mol. Biol. 236, 1067–1078.
Domingos P., Pazzani M. 1997. On the optimality of the simple Bayesian classifier under zero-one loss. Machine Learning. 29, 103–137.
Hosmer D., Lemeshow S. 2000. Applied Logistic Regression, 2nd ed. NY: Wiley.
Peng H.C., Long F., Ding C. 2005. Feature selection based on mutual information: Criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Analysis Machine Intell. 27, 1226–1238.
Henikoff S., Henikoff J.G. 1992. Amino acid substitution matrices from protein blocks. Proc. Natl. Acad. Sci. U. S. A. 89, 10915–10919.
Berman H.M., Westbrook J., Feng Z., Gilliland G., Bhat T.N., Weissig H., Shindyalov I.N., Bourne P.E. 2000. The protein data bank. Nucleic Acids Res. 28, 235–242.
Rodriguez R., Chinea G., Lopez N., Pons T., Vriend G. 1998. Homology modeling, model and software evaluation: Three related resources. Comput. Appl. Biosci. 14, 523–528.
Sartorius J., Lehming N., Kisters B., von Wilcken-Bergmann B., Muller-Hill B. 1989. Lac repressor mutants with double or triple exchanges in the recognition helix bind specifically to lac operator variants with multiple exchanges. EMBO J. 8, 1265–1270.
Author information
Authors and Affiliations
Corresponding author
Additional information
Original Russian Text © G.G. Fedonin, A.B. Rakhmaninova, Yu.D. Korostelev, O.N. Laikova, M.S. Gelfand, 2011, published in Molekulyarnaya Biologiya, 2011, Vol. 45, No. 4, pp. 724–737.
The article was translated by the authors.
Rights and permissions
About this article
Cite this article
Fedonin, G.G., Rakhmaninova, A.B., Korostelev, Y.D. et al. Machine learning study of DNA binding by transcription factors from the LacI family. Mol Biol 45, 667–679 (2011). https://doi.org/10.1134/S0026893311040054
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1134/S0026893311040054