Abstract
Repetitive DNA sequences cause genomic instability and are important genetic markers. Identification of repeats is a critical step in genome annotation and analysis. On the other hand, repeats also pose a technical challenge for genome assembly and alignment programs using NGS data. RFGR is a comprehensive tool that can find exact repetitive sequences in complete genomes and assembled genomes, as well as NGS reads of prokaryotes. For complete genomes, RFGR uses a suffix trees to find seed repeats of repetitive sequences of fixed length with indels. For assembled genomes, RFGR uses a modified Bowtie aligner to find seed repeats of exact repetitive sequences in the contigs/ scaffolds, which are then extended to maximal repeats. The repeats are classified and for repeats near a gene, RFGR reports the gene as well. For the control dataset of E. coli UTI89 and E. coli K12, RFGR reports 35,141 and 49,352 repeats, respectively. For NGS reads, RFGR uses the frequency of the repetitive k-mers to determine FASTQ reads containing repetitive sequences and removes them from the dataset. An E. coli K12 NGS dataset pre-processed using RFGR, on comparison with the original dataset, gives an improved assembly. The N50 value improves by 22.86% with a decrease in size of the assembly graph by nearly 50%. Thus, with RFGR, we achieve a better assembly with reduced computation. RFGR can be improved in terms of the length of the minimum repeat found, extending to find approximate repeats and to be applicable to Eukaryotes as well.
Similar content being viewed by others
Code Availability
The tool RFGR developed and introduced in the paper is publicly available at https://github.com/s-rashmi/rfgr
References
Abraham JA, Freitag CS, Clements JR, Eisenstein BI (1985) An invertible element of DNA controls phase variation of type I fimbriae of Escherichia coli. PNAS 82:5724–5727
Achaz G, Rocha EPC, Netter P, Coissac E (2002) Origin and fate of repeats in bacteria. Nucleic Acids Res 30(13):2987–2994
Achaz G, Boyer F, Rocha EPC, Viari A, Coissa E (2007) Repseek, a tool to retrieve approximate repeats from large DNA sequences. Bioinformatics 13(1):119–121
Bankevich A, Nurk S, Antipov D et al (2012) SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J Comput Biol 19(5):455–477. https://doi.org/10.1089/cmb.2012.0021
Bateman SL, Seed PC (2012) Epigenetic regulation of the nitrosative stress response and intercellular macrophage survival by extraintestinal pathogenic Escherichia coli. Mol Microbiol 83(5):908–925
Bedell JA, Korf I, Gish W (2000) MaskerAid: a performance enhancement to RepeatMasker. Bioinformatics 16:1040–1041
Chen SL, Hung CS, Pinkner JS, Walker JN, Cusumano CK, Li Z, Bouckaert J, Gordon JI, Hultgren SJ (2009) Positive selection identifies an in vivo role for FimH during urinary tract infection in addition to mannose binding. PNAS 106(52):22439–22444
Condon C, Liveris D, Squires C, Schwartz I, Squires CL (1995) rRNA operon multiplicity in Escherichia coli and the physiological implications of rrn inactivation. J Bacteriol 177:4152–4156
Davidson AL, Dassa E, Orelle C, Chen J (2008) Structure, function, and evolution of bacterial ATP-binding cassette systems. Microbiol Mol Biol Rev 72(2):317–364
Delcher AL, Kasif S, Fleischmann RD, Peterson J, White O, Salzberg SL (1999) Alignment of whole genomes. Nucleic Acids Res 27:2369–2376
Gray (2006) Perl module Tree::Suffix 0.22. https://metacpan.org/pod/Tree::Suffix
Christian Kreibich (2003) C library libstree 0.4.2. http://www.icir.org/christian/libstree
Kurtz S, Schleiermacher C (1999) REPuter—fast computation of maximal repeats in complete genomes. Bioinformatics 15:426–427
Langmead B, Trapnell C, Pop M, Salzberg SL (2009) Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10(3):R25.1-R25.10
Langmead B (2010) Aligning short sequencing reads with Bowtie. Curr Protoc Bioinformatics 11:7
Luo R, Liu B, Xie Y, Li Z, Huang W, Yuan J, He G et al (2012) SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. Gigascience 1(1):18. https://doi.org/10.1186/2047-217X-1-18
Misawa K (2013) RF: a method for filtering short reads with tandem repeats for genome mapping. Genomics 102:35–37
Novák P, Neumann P, Macas J (2010) Graph-based clustering and characterization of repetitive sequences in next-generation sequencing data. BMC Bioinformatics 11:378. https://doi.org/10.1186/1471-2105-11-378
Rice P, Longden I, Bleasby A (2000) EMBOSS : the European molecular biology open software suite. Trends Genetics 14:473–475
Smit AFA, Hubley R, Green P (1996) RepeatMasker Open-3.0. http://www.repeatmasker.org
Ukkonen E (1995) On-line construction of suffix trees. Algorithmica 14:249–260
Volfovsky N, Haas BJ, Salzberg SL (2001) A clustering method for repeat analysis in DNA sequences. Genome Biology 2(8):research0027.I-0027.II
Wall L (2002) Perl: practical extraction and report language. https://www.perl.org/
Waterman MS, Eggert M (1987) A new algorithm for best subsequence alignment with application to tRNA-rRNA comparisons. J Mol Biol 197:723–728
Zerbino D, Birney E (2008) Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res 18(5):821–829
Zillig W (1992) The order thermococcales. In: Balows A, Truper HG, Dworkin M, Harder W, Schleifer KH (eds) The prokaryotes. Springer-Verlag, New York, pp 702–706
Funding
RS and KS acknowledge the support of Kerala State Council for Science, Technology and Environment, (KSCSTE) for providing the research fellowship. RS, KS, and ASN acknowledge the SIUCEB support at the Department of Computational Biology and Bioinformatics, University of Kerala, for providing the necessary facilities to carry out the work.
Author information
Authors and Affiliations
Contributions
RS: Conceptualization, Methodology, Software, Writing—original draft. KS: Methodology, Writing—review & editing. ASN: Project administration, Supervision, Writing—review & editing.
Corresponding author
Ethics declarations
Competing interests
The authors declare that they have no competing interests.
Ethical Approval and Consent to Participate
Not applicable.
Consent for Publication
Not applicable.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Sukumaran, R., Shahina, K. & Nair, A.S. RFGR: Repeat Finder for Complete and Assembled Whole Genomes and NGS Reads. Biochem Genet (2024). https://doi.org/10.1007/s10528-023-10628-x
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s10528-023-10628-x