Keywords
Argane, Argania spinosa, Endemic, Genome, Assembly, Morocco, International Argane Genome Consortium
This article is included in the Genomics and Genetics gateway.
Argane, Argania spinosa, Endemic, Genome, Assembly, Morocco, International Argane Genome Consortium
Argania spinosa (L. Skeels) is a tree endemic to the South West of Morocco and occupying arid and semi-arid regions totaling up to around 900,000 ha1. The argane tree forest was recognized as biosphere reserve (Arganeraie Biosphere Reserve) by UNESCO in 19982. It is the only unique member of the tropical Sapotaceae family in Morocco3. In addition to its ecological role in preventing soil erosion and desertification, the argane tree has great cultural and socio-economic importance. The oil extracted from the seed is considered the most expensive edible oil in the world with great cosmetic value and therapeutic potential4–6. Argane oil represents a significant source of dietary fatty acids, while the Argane fruit is used as livestock feed by the local population7–9. Phytochemical composition of Argane fruits reveals different classes of bioactive compounds, including essential oils, fatty acids, triacylglycerols, flavonoids and their acylglycosyl derivatives, monophenols, phenolic acids, cinnamic acids, saponins, triterpenes, phytosterols, ubiquinone, melatonin, new aminophenols, and vitamin E. Argane oil contains high levels of antioxidant compounds. The long-chain fatty acids in Argane oil are primarily represented by unsaturated oleic acid, then linoleic acid, palmitic acid and stearic acid10.
The distribution area of the Argane forest decreased drastically during the 18th century. Furthermore, about 44 % of the forest was again lost between 1970 and 2007. While there are multiple causes, desertification and overgrazing form the main pressures on the Argane forest11,12. Therefore, the management and conservation of the remaining genetic resources of Argane forest are urgent priorities. In recent decades, several studies have been conducted to evaluate the genetic diversity of the Argane tree using morphological13,14, chemical10,15, biochemical6,16 and standard molecular marker techniques, all with the aim of describing the genetic diversity of Argane trees and addressing ecological and conservation issues17–25. The karyotype of A. spinosa (L.) is constituted of ten pairs of chromosomes (2n =2x =20)3. Until now, no reference genome was available of the A. spinosa species. Here, we present the Argane tree genome assembled from short and long DNA reads using a hybrid assembly strategy.
The Argane tree (Argania spinosa, taxid 85883, Sapotaceae, family, order Ericales), named Argane AMGHAR, to be sequenced was selected for its biological and ecological characteristics (Figure 1). This was a 9-year-old shrub, with weeping form (geotropic, unlike the erect have) with only one main trunk 3 m in height. The ripe fruits has a rounded shape. The plant had semi-evergreen dwarf leaves. The shrub is native to the valley of the plain of Sous, an arid climate with an annual average rainfall of around 220 mm, located between the hills of the Anti-Atlas towards the South East, the Western High Atlas towards the North-West and the Atlantic Ocean towards the West (9°32′ 00″N, 30°24′ 00″W; Altitude: 126 m).
Genomic DNA was extracted from lyophilized leaf tissues of a single tree (Argane AMGHAR) using the Plant DNeasy mini kit (Qiagen, USA). The Argane tree genome was shotgun-sequenced using both PacbioTM (Menlo Park, CA, USA) and IlluminaTM (San Diego, CA, USA) sequencing technologies, generating 7.2 Gb and 144 Gb of data, respectively. A gel-based size selection of DNA was performed for fragments ≥ 2 kb. Paired-end libraries with average insert sizes of 600 bp were constructed with NexteraTM DNA Library Prep Kit for Illumina (New England BiolabsTM, New Brunswick, MA, USA). These libraries were sequenced on an Illumina HiSeq XTen platform using the PE-150 module and yielded 957,451,810 reads (Table 1). These data was trimmed of adapters and low-quality sequences, yielding a clean set of 936,053,040 reads, representing 160× genome coverage, assuming a genome size of 573 Mb as estimated by the k-mer frequency analysis (described below). Raw reads were deposited at the NCBI Sequence Read Archive (SRA) under accession numbers: SRX3207155 and SRX3207156, corresponding to two independent runs from the same plant DNA sample. In addition, single-molecule long reads from the PacBio RS II platform (Pacific Biosciences, USA) were used to assist the subsequent de novo genome assembly using Illumina. Genomic sequencing libraries were constructed using the PacBio DNA template preparation kit 2.0 (Pacific Biosciences of California, Inc., Menlo Park, CA) for SMRT sequencing on the PacBio RS II machine (Pacific Biosciences of California, Inc.) according to the manufacturer's instructions, with a size range of 2-15 kb. The constructed libraries were sequenced on six SMRT cells on a PacBio RSII sequencer. The sequences of the 6 SMRT cell runs were deposited at the NCBI SRA under accession numbers: SRX1898029/SRX1898030/SRX1898031/SRX1898032/SRX1898033/SRX1898034. The sequencing runs produced about 7.2 Gb, consisting of 6,705,437 reads with an average read length of 2.5 kb and representing about 12× genome coverage, again assuming a genome size of 573 Mb (Table 1).
Quality-filtered reads from the Illumina platform were subjected to k-mer frequency distribution analysis with JELLYFISH v2.1.4 software26,27. Analysis parameters were set at -k 21 and 25, and the final result was plotted as a frequency graph (Figure 2). Two distinctive modes were observed from the distribution curve: the higher peak at a depth of 44 and reflecting the high heterozygosity of the Argane genome; the lower peak provided a peak depth of 87 for the estimation of the genome size28. Based on the total number of k-mers obtained, the Argane genome size was calculated to be approximately 573 Mb and 615 Mb, for 21- and 25-mers respectively, using the following formula: total number of k-mer / Peak depth. The double peak of k-mer distribution indicates heterozygosity whose rate is estimated to be 1.58 % (Figure 2). The estimated genome size seems to be credible compared to the ones of four other Sapotaceae family members. In fact, according to the Plant DNA c-values Database, the genome sizes of these four species ranged from 273 Mb in Mimusops elengi L. (c = 0.28 pg) to 2,513 Mb in Isonandra villosa L. (c = 2.57 pg). The other two species are Planchonella eerwah (c = 0.54 pg, 528 Mb) and Madhuca longifolia (c = 0.99 pg, 968 Mb).
Prior to assembly, Illumina and PacBio raw reads were trimmed for quality and adaptor removal using bbduk.sh from BBmap suite (https://github.com/BioInfoTools/BBMap). Short and long reads were assembled following a hybrid approach using MaSuRCA assembler v.3.2.229. The initial assembly consists of 671,690,540 bp composed of 82,183 contigs with the largest size being 422,848 bp and an N50 of 43,654 bp. The very few contigs (8) with length less than 200 bp were filtered out and the remaining contigs were scaffolded into 75,327 scaffolds totaling 670,096,797 bp; the N50 reached 49,916 bp and the assembly accounted for 2,982,868 Ns with 445.14 Ns per 100 kb (Table 2). The scaffolding was done using initial contigs and implemented in MaSuRCA v3.2.4 assembler script using Celera Assembler v8.3. The GC content was estimated to be 33%. The assembly was screened by VecScreen to look for and remove remaining vector contamination. Based on the VecSreen report, contigs containing mitochondrial/chloroplast were also removed. Trimmed PE reads were mapped on the final assembly using CLC genomics (v11.0, CLCbio, Arhus, Denmark) with 0.8 in length and 0.9 in sequence similarity. In total, 94% of the reads were mapped against the Argane genome. The 6% reads that were unmapped may result from the stringency of mapping criteria used.
Number | Total size (bp) | N50 (bp) | Largest (bp) | |
---|---|---|---|---|
Contigs | 82,183 | 671,690,540 | 43,654 | 422,848 |
Scaffolds ≥ 200 bp | 75,327 | 670,096,797 | 49,916 | 422,848 |
The difference between the genome size estimation and assembly size may be due to the use of parameters excluding extremely high frequency k-mers. They often represent organelle sequences, eventual contaminants inflating the genome size estimation30, or the high-frequency of repetitive regions found in plant genomes. Furthermore, the genome is highly heterozygous and different allelic regions would inflate assembly size. To assess the completeness of the final assembly, a Benchmarking Universal Single-Copy Orthologs (BUSCO) v3 software approach was used with Arabidopsis lineage-specific orthologous groups31, which showed that the assembly contained 89% (1271 genes) of complete and 4.3% (62 genes) of partial sequences that were Arabidopsis orthologs.
This draft genome assembly is a first step towards a global and integrative omics strategy for exhaustive characterization of the Argane tree. In particular, future work will focus on structurally annotating the genome using predictive tools and transcriptome analysis. Other future work will focus on functional gene annotation, finding evidence for genome duplication and comparative genome evolution. A reliable annotation is highly dependent on transcriptomic research, and sequencing of Argane transcriptome analysis of different parts and developmental stages or the plant is ongoing. The metabolome, and analysis of Argane oil biosynthesis, as well as the tree’s microbiome should also be analyzed. To this end, and in order to coordinate the strong interests of the Plant Genomics community for this precious tree, the International Argane Genome Consortium (IAGC) and a resource website has been created (www.arganome.org).
All of the A. spinosa datasets can be retrieved under BioProject accession number PRJNA294096: http://identifiers.org/bioproject:PRJNA294096. The raw reads are available at NCBI Sequence Reads Archive under accession number SRP077839: http://identifiers.org/insdc.sra:SRP077839. The complete genome sequence assembly project has been deposited at GenBank under accession number QLOD00000000: http://identifiers.org/ncbigi/GI:1408199612. Data can also be retrieved via the International Argane Genome Consortium (IAGC) website: http://www.arganome.org.
Slimane Khayi and Nour Elhouda Azza are co-first authors; Rachid Mentag and Hassan Ghazal contributed equally as supervisors.
This work was supported by the Iridian Genome Foundation (MD, USA). H.G. is supported by a Grant from the NIH (MD, USA) for H3ABioNet/H3Africa (grant numbers U41HG006941 and U24 HG006941-06). O.B. and B.C. are Fulbright JSD (USA) grant recipients. This work also benefited from support of Midterm Research Program of INRA-Morocco through the use of its bioinformatics platform.
Thanks are due to the Fulbright Program for supporting Morocco to US exchange PhD students. We would like to thank Lieven Sterck for a critical reading of the manuscript.
Views | Downloads | |
---|---|---|
F1000Research | - | - |
PubMed Central
Data from PMC are received and updated monthly.
|
- | - |
Is the work clearly and accurately presented and does it cite the current literature?
Yes
Is the study design appropriate and is the work technically sound?
Yes
Are sufficient details of methods and analysis provided to allow replication by others?
Yes
If applicable, is the statistical analysis and its interpretation appropriate?
Not applicable
Are all the source data underlying the results available to ensure full reproducibility?
Yes
Are the conclusions drawn adequately supported by the results?
Yes
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Whole genome shotgun assembly, pan-genome analysis
Is the work clearly and accurately presented and does it cite the current literature?
Yes
Is the study design appropriate and is the work technically sound?
Yes
Are sufficient details of methods and analysis provided to allow replication by others?
Yes
If applicable, is the statistical analysis and its interpretation appropriate?
Yes
Are all the source data underlying the results available to ensure full reproducibility?
Yes
Are the conclusions drawn adequately supported by the results?
Yes
Competing Interests: No competing interests were disclosed.
Alongside their report, reviewers assign a status to the article:
Invited Reviewers | ||
---|---|---|
1 | 2 | |
Version 2 (revision) 04 May 20 |
read | read |
Version 1 17 Aug 18 |
read | read |
Provide sufficient details of any financial or non-financial competing interests to enable users to assess whether your comments might lead a reasonable person to question your impartiality. Consider the following examples, but note that this is not an exhaustive list:
Sign up for content alerts and receive a weekly or monthly email with all newly published articles
Already registered? Sign in
The email address should be the one you originally registered with F1000.
You registered with F1000 via Google, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Google account password, please click here.
You registered with F1000 via Facebook, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Facebook account password, please click here.
If your email address is registered with us, we will email you instructions to reset your password.
If you think you should have received this email but it has not arrived, please check your spam filters and/or contact for further assistance.
Comments on this article Comments (0)