Skip to main content
Log in

A Constraint Based Structure Description Language for Biosequences

  • Published:
Constraints Aims and scope Submit manuscript

Abstract

We report an investigation into how constraint solving techniques can be used to search for patterns in sequences (or strings) of symbols over a finite alphabet. We define a constraint-based structure description language for biosequences, and give the definition of an algorithm to solve the structure searching problem as a CSP. The methodology which we have developed is able to describe two-dimensional structure of biosequences, such as tandem repeats, stem loops, palindromes and pseudo-knots. We also report on an implementation of the language in the constraint logic programming language clp(FD), with test results of a simple searching algorithm, and results from a preliminary implementation in C++ using consistency checking techniques from solving CSP.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Abarbanel, R. M., Eiencke, P. R., Mansfield, E., Jaffe, D. A., & Brutlag, D. L. (1984). Rapid searches for complex patterns in biological molecules. Nucleic Acids Research, 12(1): 263–280.

    Google Scholar 

  2. Altman, R. B., Weiser, B., & Noller, H. F. (1994). Constraint satisfaction techniques for modeling large complexes: application to the central domain of 16S ribosomal RNA. In Proceedings Second International Conference on Intelligent Systems for Molecular Biology, pages 10–18. Menlo Park, CA: AAAI Press.

    Google Scholar 

  3. Baldi, P., & Brunak, S. (1998). Bioinformatics: The Machine Learning Approach. Cambridge, MA: MIT Press.

    Google Scholar 

  4. Baranyi, L., Campell, W., Ohshima, K., Fujimoto, S., Boros, M., & Okada, H. (1995). The antisense homology box: a new motif within proteins that encodes biologically active peptides. Nature Medicine, 1(9): 894–901.

    Google Scholar 

  5. Billoud, B., Kontic, M., & Viari, A. (1996). Palingol: a declarative language to describe nucleic acids' secondary structures and to scan sequence databases. Nucleic Acids Research, 24(8): 1395–1403.

    Google Scholar 

  6. Brāzma, A., & Gilbert, D. (1995). A pattern language for molecular biology. Technical Report 11, Department of Computer Science, City University, London.

    Google Scholar 

  7. Brāzma, A., Jonassen, I., Eidhammer, I., & Gilbert, D. (1998). Approaches to the automatic discovery of patterns in biosequences. Journal of Computational Biology, 5(2): 277–304.

    Google Scholar 

  8. Clark, D. A., & Rawlings, C. J. (1994). Constraint satisfaction in molecular biology. Tutorial at Second International Conference on Intelligent Systems for Molecular Biology.

  9. Clark, D. A., Rawlings, C. J., & Doursenot, S. (1994). Genetic map construction with constraints. In R. Altman, D. Brutlag, P. Karp, R. Lathrop and D. Searls, eds., Proceedings Second International Conference on Intelligent Systems for Molecular Biology, pages 78–86. Menlo Park, CA: AAAI Press.

    Google Scholar 

  10. Clark, D. A., Rawlings, C. J., Shirazi, J., Veron, A., & Reeve, M. (1993). Protein topology prediction through parallel constraint logic programming. In L. Hunter, D. Searls, and J. Shavlik eds., Proceedings First International Conference on Intelligent Systems for Molecular Biology, pages 83–91. Menlo Park, CA: AAAI Press.

    Google Scholar 

  11. Clark, D. A., Shirazi, J., & Rawlings, C. J. (1992). Protein topology prediction through constraint-based search and the evaluation of topological folding rules. Protein Engineering, 4: 751–760.

    Google Scholar 

  12. Dandekar, T., & Hebtze, M. W. (1995). Finding the hairpin in the haystack: searching for RNA motifs. Trends in Genetics, 11(2): 45–50.

    Google Scholar 

  13. Dandekar, T., & Sibbald, P. R. (1990). Trans-splicing of pre-mRNA is predicted to occur in a wide range of organisms including vertebrates. Nucleic Acids Research, 18(16): 4719–4725.

    Google Scholar 

  14. Diaz, D., & Codognet, P. (1993). A minimal extension of the WAM for clp(FD). In D. S. Warren, ed., Proceedings of the Tenth International Conference on Logic Programming, Budapest, Hungary, pages 774–790. Cambridge, MA: The MIT Press.

    Google Scholar 

  15. Durbin, R., Eddy, S., Krough, A., & Mitchison, G. (1998). Biological Sequence Analysis. Cambridge University Press.

  16. Eddy, S., RNAbob User Guide, unpublished.

  17. Eidhammer, I. (1993). Extending constraint satisfaction problems with value constraints. Technical Report 90, Department of Informatics, University of Bergen.

  18. Foucrault, M., & Major, F. (1995). Symbolic generation and clustering of RNA 3-D motifs. In C. J. Rawlings, D. A. Clark, R. Altman, L. Hunter, T. Lengauer and S. Wodak, eds., Proceedings Third International Conference on Intelligent Systems for Molecular Biology, pages 121–126. Menlo Park, CA: AAAI Press.

    Google Scholar 

  19. Frühwirth, T., Herold, A., Küchenhoff, V., Le Provost, T., Lim, P., Monfroy, E., & Wallace, M. (1992). Constraint logic programming: an informal introduction. In G. Comyn, N. E. Fuchs and M. J. Ratcliffe eds., Logic Programming in Action, LNCS 636, pages 3–35. New York: Springer-Verlag. (Also available as Technical Report ECRC–93–5.)

    Google Scholar 

  20. Gaspin, C., & Westhof, E. (1994). The determination of the secondary structures of RNA as a constraint satisfaction problem. In S. Schultze-Kremer, ed., Advances in Molecular Bioinformatics, pages 103–122. IOS Press.

  21. Gautheret, D., Major, F., & Cedergren, R. (1990). Pattern searching/alignment with RNA primary and secondary structures: an effective descriptor for tRNA. Computer Applications in the Biosciences, 6: 325–331.

    Google Scholar 

  22. Gervet, C. (1994). Conjunto: constraint logic programming with finite set domains. In M. Bruynooghe ed., Logic Programming—Proceedings of the 1994 International Symposium, Massachusetts Institute of Technology, pages 339–358. Cambridge, MA: The MIT Press.

    Google Scholar 

  23. Gilbert, D. R., Westhead, D. R., Nagano, N., & Thornton, J. M. (1999). Motif-based searching in tops protein topology databases. Bioinformatics, 15(4): 317–326.

    Google Scholar 

  24. Hamming, R. (1982). Coding and Information Theory. Englewood Cliffs, NJ: Prentice Hall.

    Google Scholar 

  25. Helgesen, C., & Sibbald, P. (1993). PALM—a pattern language for molecular biology. In L. Hunter, D. Searls, and J. Shavlik eds., Proceedings First International Conference on Intelligent Systems for Molecular Biology, pages 172–180. Menlo Park, CA: AAAI Press.

    Google Scholar 

  26. Hentenryck, P. V. (1989). Constraint Satisfaction in Logic Programming. Cambridge MA: MIT Press.

    Google Scholar 

  27. Hentenryck, P. V., & Deville, Y. (1991). The cardinality operator: a new logical connective for constraint logic programming. In Proceedings Eight International Conference on Logic Programming.

  28. Hentenryck, P. V., & Deville, Y. (1991). Operational semantics of constraint logic programming over finite domains. In J. Maluszyński, and M. Wirsing, eds., PLILP91, number 528 in LNCS, pages 395–406. Berlin: Springer-Verlag.

    Google Scholar 

  29. Hofacker, I. L., & Stadler, P. F. (1999). Automatic detection of of conserved base pairing patterns in RNA virus genomes. Computation and Chemistry, 23(3–4): 401–414.

    Google Scholar 

  30. Hofacker, I. L., Fekete, M., Flamm, C., Huynen, M. A., Rauscher, S., Stolorz, P. E., & Stadler, P. F. (1998). Automatic detection of conserved RNA structure elements in complete RNA virus genomes. Nucleic Acids Research, 26(16): 3825–3836.

    Google Scholar 

  31. Hofman, K., Bucher, P., Felgnet, L., & Bairoch, A. (1999), The PROSITE database, its status in 1999. Nucleic Acids Research, 25(1): 217–221.

    Google Scholar 

  32. Laferrière, A., Gautheret, D., & Cedergren, R. (1994). An RNA pattern matching program with enhanced performance and portability. Computer Applications in the Biosciences, 10(2): 211–212.

    Google Scholar 

  33. Leishman, S., Gray, P. M. D., & Fothergill, J. E. (1995). A constraint-based assignment system for automatic long side chain assignments in protein 2D NMR spectra. In C. J. Rawlings, D. A. Clark, R. Altman, L. Hunter, T. Lengauer, and S. Wodak, eds., Proceedings Third International Conference on Intelligent Systems for Molecular Biology, pages 231–239. Menlo Park, CA: AAAI Press.

    Google Scholar 

  34. Letovsky, S., & Berlyn, M. B. (1992). CPRPO: a rule-based program for constructing genetic maps. Genomics, 12: 435–446.

    Google Scholar 

  35. Levenshtein, V. I. (1965). Binary codes capable of correcting deletions, insertions, and reversals. Doklady Akademii nauk SSSR (in Russian), 163(4): 845–848. Also in Cybernetics and Control Theory, 10(8): 707–710, 1996.

    Google Scholar 

  36. Major, F., Turcotte, M., Gautheret, D., Lapalme, G., Fillion, E., & Cedergren, R. (1991). The combination of symbolic and numerical computation for 3D modelling of RNA. Science, 253: 1255–1260.

    Google Scholar 

  37. McCaskill, J. S. (1990). The equilibrium partition function and base pair binding probabilities for RNA secondary structure. Biopolymers, 29(6–7): 1105–1119.

    Google Scholar 

  38. Mehldau, G., & Myers, G. (1993). A system for pattern matching applications on biosequences. Computer Applications in the Biosciences, 9(3): 299–314.

    Google Scholar 

  39. Nadel, B. A. (1988). Constraint satisfaction algorithms. Technical report, CSC–88–005, Wayne State University.

  40. Nussinov, R. Piecznik, G., Griggs, J. R., & Kleitman, D. J. (1978). Algorithms for loop matchings. SIAM Journal of Applied Mathematics, 35: 68–82.

    Google Scholar 

  41. Parsons, S. (1995). Softening constraints in constraint-based protein topology prediction. In C. J. Rawlings, D. A. Clark, R. Altman, L. Hunter, T. Lengauer and S. Wodak, eds., Proceedings Third International Conference on Intelligent Systems for Molecular Biology, pages 268–276. Menlo Park, CA: AAAI Press.

    Google Scholar 

  42. Pleij, C. W. A. (1994). RNA pseudoknots. Current Opinion in Structural Biology, 4: 337–344.

    Google Scholar 

  43. Rajasekar, A. (1994). Applications in constraint logic programming with strings. In A. Borning, ed., PPCP'94: Second Workshop on Principles and Practice of Constraint Programming, Seattle, WA.

  44. Ratnayake, M. (1996). Constrained pattern recognition in biosequences. B.Eng. (Honours) Degree in Software Engineering, Department of Computer Science, City University, London.

    Google Scholar 

  45. Rivas, E., & Eddy, S. R. (2000). The language of RNA: a formal grammar that includes pseudoknots. Bioinformatics. 16(4): 336–340.

    Google Scholar 

  46. Rivas, E., & Eddy, S. R. (1999). A dynamic programming algorithm for RNA structure prediction including pseudoknots. Journal of Molecular Biology, 285: 2053–2068.

    Google Scholar 

  47. Sakakibara, Y., Brown, M., Hughey, R., Mian, I. S., Sjoelander, K., Underwood, R., & Haussler, D. (1994). Stochastic context-free grammars for tRNA modelling. Nucleic Acids Research, 22: 5112–5120.

    Google Scholar 

  48. Searls, D. (1993). The computational linguistics of biological sequences. In L. Hunter, ed., Artificial Intelligence and Molecular Biology, chapter 2, pages 47–120. Menlo Park, CA: AAAI Press.

    Google Scholar 

  49. Searls, D. (1995). The computational linguistics of biological sequences. Tutorial at Third International Conference on Intelligent Systems for Molecular Biology.

  50. Searls, D. (1995). String variable grammar: a logic grammar formalism for the biological language of DNA. Journal of Logic Programming, 24(1–2): 73–102.

    Google Scholar 

  51. Searls, D., & Dong, S. (1993). A syntactic pattern recognition system for DNA sequences. In C. R. Cantor, H. A. Lim, J. Fickett and R. J. Robbins, eds., Proceedings Second International Conference on Bioinformatics, Supercomputing, and Complex Genome Analysis, pages 89–101. Singapore: World Scientific.

    Google Scholar 

  52. Sibbald, P. R., & Argos, P. (1990). Scrutineer: a computer program that flexibly seeks and describes motifs and profiles in protein sequences databases. Computer Applications in the Biosciences, 6(3): 279–288.

    Google Scholar 

  53. Sibbald, P. R., Sommerfeldt, H., & Argos, P. (1992). Overseer: a nucleotide sequence searching tool. Computer Applications in the Biosciences, 8(1): 45–48.

    Google Scholar 

  54. Staden, R. (1990). Searching for patterns in protein and nucleic acid sequencies. In R. F. Doolittle, ed., Methods in Enzymology, Volume 183, pages 193–211. New York: Academic Press.

    Google Scholar 

  55. Stefik, M. (1978). Inferring DNA structures from segmentation data. Artificial Intelligence, 11: 85–114.

    Google Scholar 

  56. Walinsky, C. (1989). CLP(∑*): constraint logic programming with regular sets. In G. Levi and M. Martelli, eds., ICLP'89: Proceedings 6th International Conference on Logic Programming, Lisbon, Portugal, pages 181–196. Cambridge, MA: MIT Press.

    Google Scholar 

  57. Zimmerman, D. E., Kulikowski, C. A., & Montelione, G. T. (1993). A constraint reasoning system for automating sequence-specific resonance assignments from multidimensional protein NMR spectra. In L. Hunter, D. Searls and J. Shavlik, eds., Proceedings First International Conference on Intelligent Systems for Molecular Biology, pages 447–455. Menlo Park, CA: AAAI Press.

    Google Scholar 

  58. Zuker, M. (1989). Computer prediction of RNA structure. Methods in Enzymology, 180: 189–225.

    Google Scholar 

  59. Zuker, M. (1989). On finding all foldings of an RNA molecule. Science, 244: 48–52.

    Google Scholar 

  60. Zuker, M., & Stiegler, P. (1981). Optimal folding of large RNA sequences using thermodynamics and auxilliary information. Nucleic Acids Research, 9: 133–148.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Eidhammer, I., Jonassen, I., Grindhaug, S.H. et al. A Constraint Based Structure Description Language for Biosequences. Constraints 6, 173–200 (2001). https://doi.org/10.1023/A:1011481521835

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1023/A:1011481521835

Navigation