Skip to main content
Log in

Machine learning study of DNA binding by transcription factors from the LacI family

  • Bioinformatics
  • Published:
Molecular Biology Aims and scope Submit manuscript

Abstract

We studied 1372 LacI-family transcription factors and their 4484 DNA binding sites using machine learning algorithms and feature selection techniques. The Naive Bayes classifier and Logistic Regression were used to predict binding sites given transcription factor sequences and to classify factor-site pairs on binding and non-binding ones. Prediction accuracy was estimated using 10-fold cross-validation. Experiments showed that the best prediction of nucleotide densities at selected site positions is obtained using only a few key protein sequence positions. These positions are stably selected by the forward feature selection based on the mutual information of factor-site position pairs.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Suzuki M., Brenner S.E., Gerstein M., Yagi N. 1995. DNA recognition code of transcription factors. Protein Eng. 8, 319–328.

    Article  PubMed  CAS  Google Scholar 

  2. Jones S., Shanahan H.P., Berman H.M., Thornton J.M. 2003. Using electrostatic potentials to predict DNA-binding sites on DNA-binding proteins. Nucleic Acids Res. 31, 7189–7198.

    Article  PubMed  CAS  Google Scholar 

  3. Baker C.M., Grant G.H. 2007. Role of aromatic amino acids in protein-nucleic acid recognition. Biopolymers. 85, 456–470.

    Article  PubMed  CAS  Google Scholar 

  4. Sarai A., Kono H. 2005. Protein-DNA recognition patterns and predictions. Annu. Rev. Biophys. Biomol. Struct. 34, 379–398.

    Article  PubMed  CAS  Google Scholar 

  5. Sandelin A., Wasserman W.W. 2004. Constrained binding site diversity within families of transcription factors enhances pattern discovery bioinformatics. J. Mol. Biol. 338, 207–215.

    Article  PubMed  CAS  Google Scholar 

  6. Mahony S., Auron P.E., Benos P.V. 2007. Inferring protein-DNA dependencies using motif alignments and mutual information. Bioinformatics. 23, i297–i304.

    Article  PubMed  CAS  Google Scholar 

  7. Ahmad S., Sarai A. 2005. PSSM-based prediction of DNA binding sites in proteins. BMC Bioinformatics. 6, 33–34.

    Article  PubMed  Google Scholar 

  8. Ofran Y., Mysore V., Rost B. 2007. Prediction of DNA-binding residues from sequence. Bioinformatics. 23, i347–i353.

    Article  PubMed  CAS  Google Scholar 

  9. Yan C., Terribilini M., Wu F., et al. 2006. Predicting DNA-binding sites of proteins from amino acid sequence. BMC Bioinformatics. 7, 262–262.

    Article  PubMed  Google Scholar 

  10. Mirny L.A., Gelfand M.S. 2002. Using orthologous and paralogous proteins to identify specificity-determining residues in bacterial transcription factors. J. Mol. Biol. 321, 7–20.

    Article  PubMed  CAS  Google Scholar 

  11. Kalinina O.V., Mironov A.A., Gelfand M.S., Rakhmaninova A.B. 2004. Automated selection of positions determining functional specificity of proteins by comparative analysis of orthologous groups in protein families. Protein Sci. 13, 443–456.

    Article  PubMed  CAS  Google Scholar 

  12. Donald J.E., Shakhnovich E.I. 2005. Predicting specificity-determining residues in two large eukaryotic transcription factor families. Nucleic Acids Res. 33, 4455–4465.

    Article  PubMed  CAS  Google Scholar 

  13. Korostelev Y., Laikova O.N., Rakhmaninova A.B., Gelfand M.S. First RECOMB Satellite Conference on Bioinformatics Education, San Diego, 2009. Abstract Book, p. 13.

  14. Novichkov P.S., Laikova O.N., Novichkova E.S., Gelfand M.S., Arkin A.P., Dubchak I., Rodionov D.A. 2010. RegPrecise: A database of curated genomic inferences of transcriptional regulatory interactions in prokaryotes. Nucleic Acids Res. 38, D111–D118.

    Article  PubMed  CAS  Google Scholar 

  15. Schultz J., Milpetz F., Bork P., Ponting C.P. 1998. SMART, a simple modular architecture research tool: Identification of signaling domains. Proc. Natl. Acad. Sci. U. S. A. 95, 5857–5864.

    Article  PubMed  CAS  Google Scholar 

  16. Kalinina O.V., Novichkov P.S., Mironov A.A., Gelfand M.S., Rakhmaninova A.B. 2004. SDPpred: A tool for prediction of amino acid residues that determine differences in functional specificity of homologous proteins. Nucleic Acids Res. 32, W424–W428.

    Article  PubMed  CAS  Google Scholar 

  17. Gerstein M., Sonnhammer E.L., Chothia C. 1994. Volume changes in protein evolution. J. Mol. Biol. 236, 1067–1078.

    Article  PubMed  CAS  Google Scholar 

  18. Domingos P., Pazzani M. 1997. On the optimality of the simple Bayesian classifier under zero-one loss. Machine Learning. 29, 103–137.

    Article  Google Scholar 

  19. Hosmer D., Lemeshow S. 2000. Applied Logistic Regression, 2nd ed. NY: Wiley.

    Book  Google Scholar 

  20. Peng H.C., Long F., Ding C. 2005. Feature selection based on mutual information: Criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Analysis Machine Intell. 27, 1226–1238.

    Article  Google Scholar 

  21. Henikoff S., Henikoff J.G. 1992. Amino acid substitution matrices from protein blocks. Proc. Natl. Acad. Sci. U. S. A. 89, 10915–10919.

    Article  PubMed  CAS  Google Scholar 

  22. Berman H.M., Westbrook J., Feng Z., Gilliland G., Bhat T.N., Weissig H., Shindyalov I.N., Bourne P.E. 2000. The protein data bank. Nucleic Acids Res. 28, 235–242.

    Article  PubMed  CAS  Google Scholar 

  23. Rodriguez R., Chinea G., Lopez N., Pons T., Vriend G. 1998. Homology modeling, model and software evaluation: Three related resources. Comput. Appl. Biosci. 14, 523–528.

    CAS  Google Scholar 

  24. Sartorius J., Lehming N., Kisters B., von Wilcken-Bergmann B., Muller-Hill B. 1989. Lac repressor mutants with double or triple exchanges in the recognition helix bind specifically to lac operator variants with multiple exchanges. EMBO J. 8, 1265–1270.

    PubMed  CAS  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to G. G. Fedonin.

Additional information

Original Russian Text © G.G. Fedonin, A.B. Rakhmaninova, Yu.D. Korostelev, O.N. Laikova, M.S. Gelfand, 2011, published in Molekulyarnaya Biologiya, 2011, Vol. 45, No. 4, pp. 724–737.

The article was translated by the authors.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Fedonin, G.G., Rakhmaninova, A.B., Korostelev, Y.D. et al. Machine learning study of DNA binding by transcription factors from the LacI family. Mol Biol 45, 667–679 (2011). https://doi.org/10.1134/S0026893311040054

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1134/S0026893311040054

Keywords

Navigation