figshare
Browse
2020_Sample accessions retreived from NCBI GenBank.xlsx (78.58 kB)

Sample accessions retrieved for comparative analysis from NCBI Genbank belonging to rbcL and matK markers

Download (78.58 kB)
journal contribution
posted on 2020-04-24, 11:33 authored by Rahul JamdadeRahul Jamdade
Abstract

Arabia is the largest peninsula of the world, with >3000 species of vascular plants. However, not much effort were made to digitally catalog these species. The DNA barcodes from plants (1502) of Arabia belonging to rbcL and matK markers were mined/retreived from the NCBI GenBank that were deposited during the decade of 2009-2019. These plant barcodes were further analyzed using unsupervised OTU picking methods (TaxonDNA and ABGD) by alignment (AL) technique and supervised machine learning (SML) methods using alignment (AL) and alignment-free technique (AF). The SML based AF approach demonstrated the highest accuracy for the rbcL dataset. In contrast, the AL approach was competent enough for the matK dataset. Besides, the AL approach is suitable for recognizing true species in both the datasets. However, the lower rate of species identification was seen, which might be due to the variable length of matK sequences along with the missing data or gaps; also, the matK dataset has a lower number of sequences and a higher number of singletons. As demonstrated in the manuscript, our datasets lack sufficient species membership, generating sufficient DNA barcodes could improve the rate of identification.

Methods

Plant DNA sequences from the Arabian Peninsula, published between 2009-2019, belonging to two barcode regions rbcL and matK were extracted from NCBI GenBank. The initial regulatory criteria for sequence retrieval was implemented, where sequences with more than 400 bp were considered as standard barcodes. A total of 1502 sequences were retrieved and pre-processed to prepare two datasets ‘Alignment’ (AL) and ‘Alignment-Free’ (AF).

Alignment dataset was prepared by achieving the best possible alignment through Geneious Prime v11.0.3, those sequences that failed to align were eliminated, the alignment was trimmed and the maximum sequence length of 582 base pairs (bp) was obtained for the rbcL and 837 bp for matK. Further re-alignment was done to achieve highest possible pairwise identity (rbcL= 90% & matK= 80.6%). The alignment-free datasets were prepared from the NCBI GenBank accessions those representing the AL datasets.

The final dataset consisted of1118 and 227 sequences, representing rbcL and matK, respectively. As the efficiency of the analytical method exclusively relies upon how better is the alignment, both alignment, as well as alignment free methods, were adopted for our analysis

Usage Notes

List of accession numbers that were used for analysis are provided along with location (country), gene type, size in base pairs (bp), date of release, year of release and whether the sample passed or failed in pre-processing.

Abbreviations: - SA: Saudi Arabia, JO: Jordan, IR: Iraq, KU: Kuwait, BA: Bahrain, QA: Qatar, UAE: United Arab Emirates, OM: Oman, YE: Yemen

History