Background & Summary

Groupers (Family Epinephelidae, Subfamily Epinephelinae) are prominent marine fishes, mostly distributed in tropical and temperate marine areas, comprising 167 species that belong to 15 genera1. Due to their high protein, low fat, tender meat quality, and good taste, groupers are high-quality economic fish species in Asia2,3. Given the huge commercial interests at stake, groupers are highly susceptible to human-induced impacts, including overfishing, making them considered threatened by the International Union for Conservation of Nature (IUCN)4. Therefore, how to scientifically develop and protect their resources has become the top priority5.

The leopard coral grouper (Plectropomus leopardus) has a beautiful skin color and is a valuable marine fish that commands a higher price6,7,8. Wild populations are suffering sharp declines due to overfishing and the destruction of spawning aggregations9. In recent years, the increasing market demands have promoted the development of artificial breeding in leopard coral grouper10,11,12. A high-quality reference genome resource has become increasingly important to facilitate the genomic breeding program, biological phenomena investigation and germplasm conservation13,14. Although the leopard coral grouper genome has been released6,8,15, the completeness of genome assembly and annotations still need to be further improved. For examples, the reported chromosomal-scale assembly of the sequence contigs only anchored 87.7% of the whole genome sequence using Hi-C technology6. Additionally, a wide range of gene structure annotation errors existed in the previous versions15, or the annotation information is not released and accessible to the public8.

In the present study, we generated an improved high-quality chromosome-level genome assembly of leopard coral grouper using Nanopore long-reads, Illumina short reads, and the Hi-C sequencing data. Approximately 849.74 Mb genome was assembled, consisted of 45 contigs with the contig N50 length of 35.59 Mb. A total of 846.49 Mb (99.6%) of the assembled sequences were anchored to 24 pseudo-chromosomes with low missing bases, only about 2, 354 gaps. Based on this improved genome assembly, we have significantly improved upon previous gene annotations combining de novo prediction, homology-based searches and transcriptome-assisted methods. BUSCO alignment showed that our final assembly contained 4, 469 (97.5%) complete BUSCOs. Taken together, this high-quality reference genome provides a valuable basis for the conservation and utilization of germplasm resources, and the further genetic breeding program in leopard coral grouper.

Methods

De novo genome assembly

First, we estimated the genome size and heterozygosity of leopard coral grouper using GenomeScope v2.016 by k-mer analysis with clean Illumina short data. Program ontbc (https://github.com/FlyPythons/ontbc) was used to filter the Nanopore raw reads with parameters “-min_score 7 -min_length 1000”. Then, the filtered Nanopore reads self-corrected the base errors by the long-read assembler NextDenovo v2.3 (https://github.com/Nextomics/NextDenovo). Finally, clean long reads were assembled using NextDenovo v2.3 (https://github.com/Nextomics/NextDenovo) with the parameters: read_cutoff = 5k’ and ‘seed_cutoff = 40k’. We used purge_dups v1.2.517 to remove the haplotypic duplication after mapping the Nanopore reads with minimap2 v2.118. The assembly sequence was then polished using NextPolish v1.3.119 with default parameters based on Nanopore long reads. To ensure high accuracy of the genome assembly, Illumina paired-end clean reads were aligned to the assembly using BWA v0.7.1520, and the results were used to conduct another round of polishing by Pilon v1.2321 with the parameters:--fix SNPs, indels. The contig-level assembly covered 849.74 Mb of the genome consisted of 45 contigs with a contig N50 value of 35.59 Mb.

Hi-C analysis and chromosome assembly

To obtain the chromosome-level genome, we further anchored all 45 contigs of the draft assembly onto 24 chromosomes using a 3D-DNA pipeline (version 201008)22 based on the published high-quality HiC reads15. The HiC reads were aligned to the polished genome using Juicer v1.5.7 software23 with default parameters. Mis-joins, order and orientation were corrected by the 3D-DNA pipeline22 with the following parameters: -r 2. After the first round of 3D-DNA, we manually adjusted the assembly with Juicebox23 and rerun the 3D-DNA. The Hi-C scaffolding resulted in 24 chromosome-length scaffolds (Fig. 1a).

Fig. 1
figure 1

Statistics on genome assembly and Comparison of four version annotations of the leopard coral grouper, Plectropomus leopardus. (a) Hi-C interaction heat map for Plectropomus leopardus. (b) BUSCO evaluation on the genome assembly completeness. (c) BUSCO evaluation on the predicted gene models.

Repeat annotation

De novo and structure-based searches were used to identify repetitive sequences with both RepeatModeler v224 (http://www.repeatmasker.org/RepeatModeler/) and RepeatMasker v4.0.925 (http://www.repeatmasker.org). Candidate LTR-RTs repetitive sequence library was identified using LTR_finder26 with parameters ‘-D 15000 -d 1000 -L 7000 -l 100 -p 20 -C -M 0.9’ and LTRharvest v1.5.827 with parameters ‘-minlenltr 100 -maxlenltr 7000 -mintsd 4 -maxtsd 6 -motif TGCA -motifmis 1 -similar 85 -vic 10 -seed 20 -seqids yes’. The identified LTR-RT candidates were filtered with LTR_retriever v2.528 program with default parameters. RepeatScout v1.0.529 LTR_retriever v2.528 and RepeatModeler v224 were used to build de novo repeat libraries. The combined repeat library was used as the final library to identify repetitive sequences using RepeatMasker v4.0.925 with parameters ‘-q -no_is -norna -nolow -div 40’.

Gene prediction and annotation

To comprehensively annotate genes, protein-coding genes prediction was undertaken using the BRAKER v2.1.530 annotation pipeline which integrated different evidence, including de novo prediction, homology-based searches and transcriptome-assisted methods. First, for de novo gene prediction, we downloaded published RNA-seq (SRP20194331 and SRP32903132) and then mapped to the soft masked genome using Hi-SAT2 v. 2.1.033. Then, all mapping results were used to build transcript models using BRAKER v2.1.530 and StringTie v2.1.634. BRAKER v2.1.530 was run with Semi-HMM-based Nucleic Acid Parser (SNAP, v2013.11.29)35 and Augustus v3.3.336 which pre-trained using released gene models of P. leopardus6,15. Second, protein-coding sequences of from P. leopardus6,15, E. fuscoguttatus37, E. lanceolatus38, and E. moara39 were aligned to the genome assembly using TBLASTN and GeneWise v2.2.040. Third, Trinity v2.1.141 was used to generate the transcripts. The transcriptome data were further assembled using the PASA pipeline v2.5.242 with BLAT v3543 and GMAP (version 20150921)44 as the aligner. Finally, all evidences were merged to form a consensus gene set using EVidenceModeler v1.1.145. Finally, we identified a total of 25,965 protein-coding genes (Table 2). The noncoding RNA genes including rRNAs, tRNAs, snRNAs and miRNAs were screened using INFERNAL v 1.1.246 and tRNAscan-SE v1.447. Four types of noncoding RNAs, including 746 miRNAs, 1,224 tRNAs, 439 rRNAs and 596 sRNAs, were identified from the P. leopardus genome (Table 3).

In order to explore the function of predicted protein-coding genes in leopard coral grouper, InterPro30, Pfam32, PANTHER 14.1, Superfamily 1.75, Gene3D 4.2.0, SMART 7.1 and TrEMBL32 databases were respectively used to predict protein function based on the conserved protein domains by InterProScan v5.3648. We performed functional annotation by aligning the protein sequences to NCBI nr databases and SwissProt using BLASTP. The result showed more than 99.9% (25,927) of protein-coding genes were annotated (Table 4).

Data Records

The assembled genome has been deposited at GenBank under the accession GCA_026936395.149. Moreover, the whole genome sequence data reported in this paper have been deposited in the Genome Warehouse in National Genomics Data Center50,51, Beijing Institute of Genomics, Chinese Academy of Sciences/China National Center for Bioinformation, under accession number GWHBPCI00000000 that is publicly accessible at https://ngdc.cncb.ac.cn/gwh/Assembly/29542/show52. In addition, the genome annotation files had been submitted at the figshare53. The Nanopore long reads, Illumina genomic sequencing data and Hi-C data were downloaded from CNGBdb51,54 under the accession CNP000085955. Transcriptomic sequences can be retrieved under the following accession numbers: SRP20194331 and SRP32903132.

Technical Validation

To evaluate the quality of genome assembly, first, we assessed genome continuity with QUAST v5.0.256. Contig N50 (the length such that half of all sequence is in contigs of this size) has achieved a significant improvement to 35.59 Mb, which is much higher than other versions6,8,15 or closely related species (Epinephelus fuscoguttatus, Epinephelus lanceolatus, Epinephelus moara) assembled with long-read sequencing from 0.12 to 13.8 Mb. Meanwhile, in the latest version, there are very few gaps in the genome (2.77 per 100 kbp), which is remarkably less than the previous from 68.31 per 100 kbp to 1793.38 per 100 kbp6,8,15 (Table 1; Fig. 2). Second, Illumina paired-end clean reads and Nanopore long reads were mapped to the final reference genome assembly by using BWA v0.7.1520 and Minimap2 v2.118, respectively. The mapping rate of Illumina and Nanopore reads reached 99.18% and 99.95%. We only detected 6, 900 (0.0008%) conflicting sites in the final assembly, indicating that this is a high level of the complete genome (Fig. 2; Table 5). Finally, we evaluated the completeness of our genome assembly using Benchmarking Universal Single-Copy Orthologs (BUSCO, v3.0)57 with the actinopterygii_odb9 database. The actinopterygii_odb9 database contained 4,584 conserved core genes while our assembled genome contained 4,469 (97.5%) of the expected actinopterygii genes (including 4,393 (95.2%) single and 106 (2.3%) duplicated ones). Obviously, our data had complete gene coverage, and 48 (1.0%) were identified as fragmented, respectively, while 67 (1.5%) were missing in our assembled genome (Fig. 1b). Furthermore, we also used BUSCO to evaluate the completeness of gene annotations57, and only 22 (0.5%) genes were missing in the final annotation version (Fig. 1c) Table 5.

Table 1 Comparison of genome assembly metrics in groupers.
Fig. 2
figure 2

Global genome landscape of the leopard coral grouper, Plectropomus leopardus. From outer to inner circles: Density of genes with 500 kbp windows, ranging from 0 to 70; GC content with 500 kbp windows, ranging from 0.30 to 45; depth of coverage of Nanopore reads with 100 kbp windows, ranging from 20 to 150; depth of coverage of Illumina short reads with 100 kbp windows, ranging from 10 to 35; distribution of heterozygous SNPs with 500 kbp windows, ranging from 0 to 3,420; distribution of homozygous SNPs with 500 kbp windows, ranging from 0 to 3,420.

Table 2 Comparison of the genome-wide statistics for annotations of groupers.
Table 3 The statistics of functional annotation in the leopard coral grouper.
Table 4 The statistics of functional annotation in the leopard coral grouper.
Table 5 The statistics of the leopard coral grouper (MGB_pleo_1.0) SNPs.