RFGR: Repeat Finder for Complete and Assembled Whole Genomes and NGS Reads

Sukumaran, Rashmi; Shahina, K.; Nair, Achuthsankar S.

doi:10.1007/s10528-023-10628-x

RFGR: Repeat Finder for Complete and Assembled Whole Genomes and NGS Reads

Methodology Article
Published: 12 January 2024

(2024)
Cite this article

Biochemical Genetics Aims and scope Submit manuscript

Rashmi Sukumaran¹,
K. Shahina¹ &
Achuthsankar S. Nair¹

119 Accesses
Explore all metrics

Abstract

Repetitive DNA sequences cause genomic instability and are important genetic markers. Identification of repeats is a critical step in genome annotation and analysis. On the other hand, repeats also pose a technical challenge for genome assembly and alignment programs using NGS data. RFGR is a comprehensive tool that can find exact repetitive sequences in complete genomes and assembled genomes, as well as NGS reads of prokaryotes. For complete genomes, RFGR uses a suffix trees to find seed repeats of repetitive sequences of fixed length with indels. For assembled genomes, RFGR uses a modified Bowtie aligner to find seed repeats of exact repetitive sequences in the contigs/ scaffolds, which are then extended to maximal repeats. The repeats are classified and for repeats near a gene, RFGR reports the gene as well. For the control dataset of E. coli UTI89 and E. coli K12, RFGR reports 35,141 and 49,352 repeats, respectively. For NGS reads, RFGR uses the frequency of the repetitive k-mers to determine FASTQ reads containing repetitive sequences and removes them from the dataset. An E. coli K12 NGS dataset pre-processed using RFGR, on comparison with the original dataset, gives an improved assembly. The N50 value improves by 22.86% with a decrease in size of the assembly graph by nearly 50%. Thus, with RFGR, we achieve a better assembly with reduced computation. RFGR can be improved in terms of the length of the minimum repeat found, extending to find approximate repeats and to be applicable to Eukaryotes as well.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Cerulean: A Hybrid Assembly Using High Throughput Short and Long Reads

High-fidelity (repeat) consensus sequences from short reads using combined read clustering and assembly

Article Open access 24 January 2024

A Guide to Sequencing for Long Repetitive Regions

Code Availability

The tool RFGR developed and introduced in the paper is publicly available at https://github.com/s-rashmi/rfgr

References

Abraham JA, Freitag CS, Clements JR, Eisenstein BI (1985) An invertible element of DNA controls phase variation of type I fimbriae of Escherichia coli. PNAS 82:5724–5727
Article CAS PubMed PubMed Central Google Scholar
Achaz G, Rocha EPC, Netter P, Coissac E (2002) Origin and fate of repeats in bacteria. Nucleic Acids Res 30(13):2987–2994
Article CAS PubMed PubMed Central Google Scholar
Achaz G, Boyer F, Rocha EPC, Viari A, Coissa E (2007) Repseek, a tool to retrieve approximate repeats from large DNA sequences. Bioinformatics 13(1):119–121
Article Google Scholar
Bankevich A, Nurk S, Antipov D et al (2012) SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J Comput Biol 19(5):455–477. https://doi.org/10.1089/cmb.2012.0021
Article CAS PubMed PubMed Central Google Scholar
Bateman SL, Seed PC (2012) Epigenetic regulation of the nitrosative stress response and intercellular macrophage survival by extraintestinal pathogenic Escherichia coli. Mol Microbiol 83(5):908–925
Article CAS PubMed Google Scholar
Bedell JA, Korf I, Gish W (2000) MaskerAid: a performance enhancement to RepeatMasker. Bioinformatics 16:1040–1041
Article CAS PubMed Google Scholar
Chen SL, Hung CS, Pinkner JS, Walker JN, Cusumano CK, Li Z, Bouckaert J, Gordon JI, Hultgren SJ (2009) Positive selection identifies an in vivo role for FimH during urinary tract infection in addition to mannose binding. PNAS 106(52):22439–22444
Article CAS PubMed PubMed Central Google Scholar
Condon C, Liveris D, Squires C, Schwartz I, Squires CL (1995) rRNA operon multiplicity in Escherichia coli and the physiological implications of rrn inactivation. J Bacteriol 177:4152–4156
Article CAS PubMed PubMed Central Google Scholar
Davidson AL, Dassa E, Orelle C, Chen J (2008) Structure, function, and evolution of bacterial ATP-binding cassette systems. Microbiol Mol Biol Rev 72(2):317–364
Article CAS PubMed PubMed Central Google Scholar
Delcher AL, Kasif S, Fleischmann RD, Peterson J, White O, Salzberg SL (1999) Alignment of whole genomes. Nucleic Acids Res 27:2369–2376
Article CAS PubMed PubMed Central Google Scholar
Gray (2006) Perl module Tree::Suffix 0.22. https://metacpan.org/pod/Tree::Suffix
Christian Kreibich (2003) C library libstree 0.4.2. http://www.icir.org/christian/libstree
Kurtz S, Schleiermacher C (1999) REPuter—fast computation of maximal repeats in complete genomes. Bioinformatics 15:426–427
Article CAS PubMed Google Scholar
Langmead B, Trapnell C, Pop M, Salzberg SL (2009) Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10(3):R25.1-R25.10
Article Google Scholar
Langmead B (2010) Aligning short sequencing reads with Bowtie. Curr Protoc Bioinformatics 11:7
Google Scholar
Luo R, Liu B, Xie Y, Li Z, Huang W, Yuan J, He G et al (2012) SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. Gigascience 1(1):18. https://doi.org/10.1186/2047-217X-1-18
Article PubMed PubMed Central Google Scholar
Misawa K (2013) RF: a method for filtering short reads with tandem repeats for genome mapping. Genomics 102:35–37
Article CAS PubMed Google Scholar
Novák P, Neumann P, Macas J (2010) Graph-based clustering and characterization of repetitive sequences in next-generation sequencing data. BMC Bioinformatics 11:378. https://doi.org/10.1186/1471-2105-11-378
Article CAS PubMed PubMed Central Google Scholar
Rice P, Longden I, Bleasby A (2000) EMBOSS : the European molecular biology open software suite. Trends Genetics 14:473–475
Google Scholar
Smit AFA, Hubley R, Green P (1996) RepeatMasker Open-3.0. http://www.repeatmasker.org
Ukkonen E (1995) On-line construction of suffix trees. Algorithmica 14:249–260
Article Google Scholar
Volfovsky N, Haas BJ, Salzberg SL (2001) A clustering method for repeat analysis in DNA sequences. Genome Biology 2(8):research0027.I-0027.II
Article Google Scholar
Wall L (2002) Perl: practical extraction and report language. https://www.perl.org/
Waterman MS, Eggert M (1987) A new algorithm for best subsequence alignment with application to tRNA-rRNA comparisons. J Mol Biol 197:723–728
Article CAS PubMed Google Scholar
Zerbino D, Birney E (2008) Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res 18(5):821–829
Article CAS PubMed PubMed Central Google Scholar
Zillig W (1992) The order thermococcales. In: Balows A, Truper HG, Dworkin M, Harder W, Schleifer KH (eds) The prokaryotes. Springer-Verlag, New York, pp 702–706
Google Scholar

Download references

Funding

RS and KS acknowledge the support of Kerala State Council for Science, Technology and Environment, (KSCSTE) for providing the research fellowship. RS, KS, and ASN acknowledge the SIUCEB support at the Department of Computational Biology and Bioinformatics, University of Kerala, for providing the necessary facilities to carry out the work.

Author information

Authors and Affiliations

Department of Computational Biology and Bioinformatics, University of Kerala, Karyavattom, Trivandrum, Kerala, India
Rashmi Sukumaran, K. Shahina & Achuthsankar S. Nair

Authors

Rashmi Sukumaran
View author publications
You can also search for this author in PubMed Google Scholar
K. Shahina
View author publications
You can also search for this author in PubMed Google Scholar
Achuthsankar S. Nair
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

RS: Conceptualization, Methodology, Software, Writing—original draft. KS: Methodology, Writing—review & editing. ASN: Project administration, Supervision, Writing—review & editing.

Corresponding author

Correspondence to Rashmi Sukumaran.

Ethics declarations

Competing interests

The authors declare that they have no competing interests.

Ethical Approval and Consent to Participate

Not applicable.

Consent for Publication

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file1 (PDF 250 KB)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Sukumaran, R., Shahina, K. & Nair, A.S. RFGR: Repeat Finder for Complete and Assembled Whole Genomes and NGS Reads. Biochem Genet (2024). https://doi.org/10.1007/s10528-023-10628-x

Download citation

Received: 14 September 2023
Accepted: 08 December 2023
Published: 12 January 2024
DOI: https://doi.org/10.1007/s10528-023-10628-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

RFGR: Repeat Finder for Complete and Assembled Whole Genomes and NGS Reads

Abstract

Access this article

Similar content being viewed by others

Cerulean: A Hybrid Assembly Using High Throughput Short and Long Reads

High-fidelity (repeat) consensus sequences from short reads using combined read clustering and assembly

A Guide to Sequencing for Long Repetitive Regions

Code Availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Ethical Approval and Consent to Participate

Consent for Publication

Additional information

Publisher's Note

Supplementary Information

Supplementary file1 (PDF 250 KB)

Rights and permissions

About this article

Cite this article

Keywords

Navigation

RFGR: Repeat Finder for Complete and Assembled Whole Genomes and NGS Reads

Abstract

Access this article

Similar content being viewed by others

Cerulean: A Hybrid Assembly Using High Throughput Short and Long Reads

High-fidelity (repeat) consensus sequences from short reads using combined read clustering and assembly

A Guide to Sequencing for Long Repetitive Regions

Code Availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Ethical Approval and Consent to Participate

Consent for Publication

Additional information

Publisher's Note

Supplementary Information

Supplementary file1 (PDF 250 KB)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation