Skip to main content
Log in

RFGR: Repeat Finder for Complete and Assembled Whole Genomes and NGS Reads

  • Methodology Article
  • Published:
Biochemical Genetics Aims and scope Submit manuscript

Abstract

Repetitive DNA sequences cause genomic instability and are important genetic markers. Identification of repeats is a critical step in genome annotation and analysis. On the other hand, repeats also pose a technical challenge for genome assembly and alignment programs using NGS data. RFGR is a comprehensive tool that can find exact repetitive sequences in complete genomes and assembled genomes, as well as NGS reads of prokaryotes. For complete genomes, RFGR uses a suffix trees to find seed repeats of repetitive sequences of fixed length with indels. For assembled genomes, RFGR uses a modified Bowtie aligner to find seed repeats of exact repetitive sequences in the contigs/ scaffolds, which are then extended to maximal repeats. The repeats are classified and for repeats near a gene, RFGR reports the gene as well. For the control dataset of E. coli UTI89 and E. coli K12, RFGR reports 35,141 and 49,352 repeats, respectively. For NGS reads, RFGR uses the frequency of the repetitive k-mers to determine FASTQ reads containing repetitive sequences and removes them from the dataset. An E. coli K12 NGS dataset pre-processed using RFGR, on comparison with the original dataset, gives an improved assembly. The N50 value improves by 22.86% with a decrease in size of the assembly graph by nearly 50%. Thus, with RFGR, we achieve a better assembly with reduced computation. RFGR can be improved in terms of the length of the minimum repeat found, extending to find approximate repeats and to be applicable to Eukaryotes as well.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Code Availability

The tool RFGR developed and introduced in the paper is publicly available at https://github.com/s-rashmi/rfgr

References

Download references

Funding

RS and KS acknowledge the support of Kerala State Council for Science, Technology and Environment, (KSCSTE) for providing the research fellowship. RS, KS, and ASN acknowledge the SIUCEB support at the Department of Computational Biology and Bioinformatics, University of Kerala, for providing the necessary facilities to carry out the work.

Author information

Authors and Affiliations

Authors

Contributions

RS: Conceptualization, Methodology, Software, Writing—original draft. KS: Methodology, Writing—review & editing. ASN: Project administration, Supervision, Writing—review & editing.

Corresponding author

Correspondence to Rashmi Sukumaran.

Ethics declarations

Competing interests

The authors declare that they have no competing interests.

Ethical Approval and Consent to Participate

Not applicable.

Consent for Publication

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file1 (PDF 250 KB)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sukumaran, R., Shahina, K. & Nair, A.S. RFGR: Repeat Finder for Complete and Assembled Whole Genomes and NGS Reads. Biochem Genet (2024). https://doi.org/10.1007/s10528-023-10628-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10528-023-10628-x

Keywords

Navigation