Background & Summary

Cynomolgus macaque (or crab-eating macaque, Macaca fascicularis) and common marmoset (Callithrix jacchus), belonging to old world monkey and new world monkey respectively, have been widely used in human biomedical research and drug developments with expectations that they recapitulate human physiology and pathology1. Their genomes, consisting of 42 and 46 chromosomes in diploids2,3,4, respectively, were assembled initially using first- and second-generation sequencing technologies. Short-read de novo assembly was not able to resolve complex repetitive genomic regions, and the resulting contigs tended to remain fragmentary. Techniques such as mate-pair sequencing were commonly used to join the contigs into longer scaffolds, albeit with sequence gaps in between. Long-standing non-human primate (NHP) genome assemblies released earlier used the human genome for ordering and orienting the assemblies into chromosomes, which prevents the observation of intrinsic structural differences between the primate genomes5. For example, a large inversion of around 20 Mb was observed in chromosome 16 of the earlier marmoset genome assembly, which should have been the result of the ‘humanization’ bias6.

Recent technological advancements allow us to obtain chromosome-scale assemblies without relying on existing genome assemblies, where such errors or bias can be avoided. Single-molecule long-read sequencing (Pacific Biosciences [PacBio] and Oxford Nanopore Technologies) have drastically increased the contiguity of assemblies, and chromatin contact profiling with Hi-C and other techniques such as optical mapping have paved the way to reconstructing chromosome-scale sequences. Taking advantage of these recent advancements, the genome sequences of some non-human primates (NHP) including gorilla, orangutan, and chimpanzee were largely improved5,7, followed by the ones for bonobo (Bioproject accession: PRJNA526933), and Northern white-cheeked gibbon (PRJNA369439). High-quality genome assemblies of old world monkeys were also recently reported, such as the ones for Rhesus macaque by three different research groups (PRJNA4764748, PRJNA5094459, and PRJNA51419610), the ones for olive baboon (PRJNA52787411), golden snub-nosed monkey (PRJNA52494912), and Francois’s langur (PRJNA48853013). We had also previously produced pseudo-chromosome assembly by using PacBio long-reads for common marmoset6, where ‘humanized’ sequences were still used as a reference.

In this study, we focus on cynomolgus macaque and common marmoset to establish a solid baseline for human biomedical research. We performed genome sequencing and de novo assembly for both species by using PacBio long-reads, along with Hi-C for chromosome-scale scaffolding through optimized Hi-C data acquisition14. We corrected misjoins of the scaffolds through examination of the Hi-C contact maps. We further investigated misjoins through cross-checking alternate contigs made from PacBio data and corrected them. Lastly, we assessed the quality of the resultant assemblies based on the contiguity of assembled sequences (N50), completeness of conserved protein-coding genes, and consistency to BAC clone sequences. The results indicate that our genome assemblies are of chromosomal-scale contiguity, with nearly complete coverage of gene space, and highly concordant to the conventional genomic resource obtained independently to our data set. Above all, for cynomolgus macaque, our genome assembly achieved the optimal quality in comparison with other resources available for this species.

Methods

Sample preparation, sequencing, and de novo assembly

A cynomolgus monkey was purchased by Shiga University of Medical Science from Shin Nippon Biomedical Laboratories, Ltd through Angkor Primates Center Inc in Kingdom of Cambodia. The identification number is CE1976F in Shiga University of Medical Science and K150090 in Shin Nippon Biomedical Laboratories, Ltd. We followed the Reporting in Vivo Experiments (ARRIVE) guidelines developed by the National Centre for the Replacement, Refinement & Reduction of Animals in Research (NC3Rs). All animal experimental procedures were approved by the Animal Care and Use Committee of Shiga University of Medical Science (approval number: 2017-10-2 (H1)). Genomic DNA (gDNA) was extracted from the kidney of the 6-year-8-month-old female cynomolgus macaque with MagAttract HMW DNA kit (48) [Qiagen 5067–5583; Cat No 67563] according to the manufacturer’s instruction. After measurement of concentration with Qubit3 - Qubit dsDNA (BR) [Thermo Fisher Scientific] and optical density (OD) with nanoDrop 2000 [Thermo Fisher Scientific], ethanol precipitation of the extracted gDNA was performed with DNA Clean & Concentrator-10 (25) [Zymo research] to condense it. Then, concentration and OD of the gDNA were again measured with NanoDrop 2000, Qubit3 - Qubit dsDNA (BR): the amount, 70.6 μg (307 ng/μL × 230 μL); 260/280, 1.87; 260/230, 2.28. Also, the successful extraction of long gDNA was confirmed by electrophoresis with TapeStation ScreenTape GenomicDNA on TapeStation [Agilent]. Using PacBio Sequel II, approximately 81x sequence data of the cynomolgus macaque was sequenced with a read N50 length of 12.05 kbp. For common marmoset, the 43x PacBio RSII read dataset from our earlier sequencing effort (PRJDB8242), was used to reassemble the genome (Fig. 1a; Table 1).

Fig. 1
figure 1

Quality assessment of the experiments. (a) Read length distribution of the common marmoset and cynomolgus macaque genomes, with a vertical line representing the N50 length. (b) Length distribution of the Hi-C DNA analyzed with Agilent TapeStation using the Genomic DNA ScreenTape. (c) Length distribution of the Hi-C library analyzed with Agilent TapeStation using the High Sensitivity D1000 ScreenTape.

Table 1 Overview of the obtained sequences.

To assemble both the NHP genomes, we employed a series of assembly programs (supplementary fig. S1). Based on the performances from the earlier marmoset genome assembly study, we chose the assemblers Flye15, Redbean (wtdbg2)16, SMARTdenovo17, and miniasm18,19. After de novo assembly, all the assemblies were processed by PacBio’s polishing tool, arrow, to eliminate possible sequencing errors. After considering the results from multiple de novo assembly tools, Redbean was chosen as the base assembler, as it produced the most contiguous assemblies for both the common marmoset (contig N50 of 8.04 Mb; Table 2) and the cynomolgus macaque (contig N50 of 5.84 Mb) genomes (Table 3).

Table 2 Properties of the obtained and existing assemblies of common marmoset.
Table 3 Properties of the obtained and existing assemblies of cynomolgus macaque.

Hi-C data acquisition and scaffolding

Hi-C libraries were constructed with the restriction enzyme DpnII following the iconHi-C protocol14, using the skeletal muscle tissues of an adult female marmoset and a cynomolgus macaque that were kept frozen in −80 °C after dissection and snap-freezing in liquid nitrogen. Fixed tissue materials containing 2 μg of DNA were used for the preparation of Hi-C DNA by in situ restriction digestion and ligation. Library preparation was performed using 1 μg of Hi-C DNA with 5 cycles of PCR for the marmoset and 6 cycles of PCR for the cynomolgus macaque. Quality control of the Hi-C DNA and the Hi-C library was performed as described in the iconHi-C protocol14. Quality control of the Hi-C DNA showed an expected pattern of a shift in size – shortening after digestion and elongation after ligation, indicating successful preparation of the Hi-C DNA (Fig. 1b). Quality control of the Hi-C library by restriction digestion confirmed the existence of expected ligation junction sequences inside the library molecules at a high proportion and successful generation of the library (Fig. 1c).

Sequencing to obtain Hi-C reads was performed on an Illumina HiSeq X in paired-ends with 151 cycles. The obtained reads were processed using Trim Galore to remove low-quality regions and adapter sequences. Post-sequencing quality control of the Hi-C libraries was performed as described previously14. Hi-C read statistics obtained by HiC-Pro20, using one million subsampled read pairs from the large scale sequencing data mapped to the PacBio contigs, confirmed high quality of the libraries while showing a high proportion of valid interaction read pairs, a low proportion of invalid ligation products (dangling end pairs), and a low proportion of contiguous restriction fragments (re-ligation pairs) (Supplementary table S1). Approximately 408 and 377 million Hi-C reads for the marmoset and the cynomolgus macaque were mapped to the Redbean contig sequences of the common marmoset and the cynomolgus macaque respectively using Juicer21. With the Juicer output files, Hi-C scaffolding was performed using 3d-dna22. Inversions and misjoins in the assemblies that occurred during the Hi-C scaffolding process were corrected by using Juicebox based on the frequency of Hi-C contacts23.

As a result of the genome assembly workflow involving de novo assembly followed by Hi-C scaffolding (Fig. 2a), the N50 lengths increased from approximately 8 to 135 Mb for the marmoset (Table 2) and from 5.8 to 149 Mb for the cynomolgus macaque (Table 3), and the number of scaffolds longer than 10 Mb presumed to be at the chromosome level became close to the actual number of chromosomes (Fig. 3). For the cynomolgus macaque, two pairs of the scaffold sequences longer than 10 Mb were merged into two pseudo-chromosomes, referred to as chromosomes 2 and 8 later, based on alignment concordance with the 3d-dna scaffold sequences of alternate assemblies, as well as the previous reference sequence.

Fig. 2
figure 2

Computational steps of genome assembly. (a) De novo assembly and Hi-C scaffolding workflow. (b) Misjoin correction and gap-filling using contigs from genome assemblies using alternate assembly tools.

Fig. 3
figure 3

Hi-C contact map on the final assemblies. Hi-C contact maps of the pseudo-chromosome assemblies of (a) common marmoset and (b) cynomolgus macaque.

Misjoin detection and gap-filling

All the alternate assemblies were also independently scaffolded using 3d-dna. These were later aligned against the Hi-C scaffolds from Redbean assembly to observe whether any misjoins were introduced by Hi-C scaffolding. Because the contiguity profiles of the assemblies are different across the employed assemblers, contigs which are broken into two or more sequences in one assembly could have been assembled into a single contiguous piece in another assembler (Supplementary fig. S1). This contiguity information from alternate assemblies was used to identify misjoins in the scaffold sequences from the Redbean assembly (Fig. 2b). When the alignments of the alternate assemblies against Redbean scaffolds were visualized using D-genies dot plots24, occasionally the scaffolds did not exhibit consistency in chromosomal structure among the 3d-dna scaffolds. These indicated the presence of misjoins introduced by Hi-C scaffolding (Supplementary fig. S2).

In case that such discrepancies in contig orders were observed in the alignments, a two-step check was performed, A) whether gaps are present in the Redbean scaffolds, and B) whether at least two of the alternate assemblies are consistent in those regions and discrepant against the Redbean scaffolds. If the above case is true, instead of re-ordering the misjoined scaffolds, we replaced the alignment block with one of the alternate assemblies making it a hybrid assembly. A similar procedure was also performed for gap-filling in regions without misjoins also. When a gap-containing region in Redbean scaffolds aligned to a gap-free region of contigs from alternate assemblies, those gap-containing regions were replaced with the corresponding blocks from alternate assemblies (Fig. 2b). When two or more alternate assemblies can be used to fill the gaps, one of them was randomly chosen to fill the gaps, followed by additional two rounds of polishing using arrow. This gap-filling procedure filled 1,147 gaps of length 573.94 kb in cynomolgus macaque, and 1,061 gaps of length 522.16 kb in common marmoset respectively.

Data Records

The N50 length of the contigs increased from 8.04 Mb to 24.82 Mb, and from 5.84 Mb to 26.27 Mb for the final assemblies of common marmoset and cynomolgus macaque genomes, respectively. The final genome assemblies displayed chromosome-sized sequences, designated here as ‘pseudo-chromosomes’, which exhibited scaffold N50 lengths of 132.27 Mb and 149.88 Mb, for the cynomolgus macaque and the common marmoset genomes, respectively (Table 2; Table 3).

The raw genome sequence data of the cynomolgus macaque are deposited at the DDBJ under the accession, DRA00958425. The Hi-C raw sequence data are deposited under the accessions, DRA00964126, and DRA00998727, for the cynomolgus macaque and the common marmoset genomes, respectively. The assembled genome sequences are deposited under the accessions, BLPH01000001-BLPH0100040028, and BLSI01000001-BLSI0100187229, for the cynomolgus macaque and the common marmoset genomes, respectively.

Technical Validation

Gene space completeness assessment

The web server gVolante30, in which two established pipelines CEGMA31 and BUSCO32 are implemented, was used to assess the sequence length distributions and gene space completeness in a uniform environment. The latter was based on the coverage of one-to-one reference orthologues with the ortholog search pipeline CEGMA and the gene set CVG that is specifically optimized to assess vertebrate genome sequences33. CEGMA identified pre-selected conserved protein-coding genes by using a combination of tblastn, and GeneWise for sequence alignment34,35, and GeneID for gene prediction36, in addition to HMMER3 for further homology search37, which are used to assess the completeness of the genome assemblies. The analysis of gene space completeness revealed a smaller number of missing CVG genes for both species, in comparison with the scores for the genome assemblies released earlier (Table 2; Table 3).

BAC-end alignment

BAC-end read pairs of the common marmoset from our earlier sequencing effort4, were aligned against the constructed assembly using bowtie238. After the misjoin detection and gap-filling procedure, the number of concordantly aligned read pairs increased by a count of 684, while the number of discordantly aligned read pairs decreased by a count of 401 (Table 2).

Comparison of common marmoset assembly against previous assemblies

For the common marmoset, NCBI hosts nine genome assemblies (https://www.ncbi.nlm.nih.gov/assembly/organism/9483/latest/), with cj1700 being the representative reference genome. The assembled genome in this study has a total scaffold length of 2.81 Gb with a contig N50 length of 24.82 Mb, in comparison to the total scaffold length of 2.79 Gb with the contig N50 length of 6.38 Mb from our earlier effort (CJ2019). Our newly assembled contigs have slightly lesser contiguity than the recently submitted assembly cj1700 which has a total scaffold length of 2.9 Gb with a contig N50 length of 26.62 Mb (Fig. 4a). Notably, our assembly is 23 Mb longer than cj1700 after excluding gap regions, and more than half of the sequence gaps in the pseudo-chromosomal sequences of the cj1700 assembly were identified as contiguous regions in our assembly. These results imply that these assemblies are similarly useful and, in some difficult regions, complementary to each other (Supplementary table S2).

Fig. 4
figure 4

Contiguity plots of the existing and obtained assemblies. N(X) plots comparing the contiguity profiles of existing and obtained genome assemblies of (a) common marmoset and (b) cynomolgus macaque. The contig lengths of MacFas_Jun2011 are well below 10 kb and did not fit into the scale of the plotted graph, and hence MacFas_Jun2011 was omitted from the figure.

Comparison of cynomolgus macaque assembly against previous assemblies

NCBI hosts five cynomolgus macaque genome assemblies (https://www.ncbi.nlm.nih.gov/assembly/organism/9541/latest/), with the macFas5 genome assembly being the representative reference genome (Table 3). Three of the assemblies including the representative genome have contig N50 lengths shorter than 100 kbp. In contrast, the recently submitted Macaca_Fascicularis_6.0 assembly had produced a much larger contig N50 length of 21.34 Mb. It has a total scaffold length 2.91 Gb, which is smaller than that of the representative reference genome’s 2.95 Gb. The assembly from our present study has a contig N50 length of 26.27 Mb and a total scaffold length 2.91 Gb. It is 65 Mb shorter than Macaca_Fascularis_6.0 even after excluding gap regions, and the ideal total length of the assembly is not yet conclusive. Notably, our assembly outperformed all the assemblies in terms of contiguity (Fig. 4b) and its coverage of orthologous genes in our assembly achieved the highest ratio, 93%, whereas the one in Macaca_Fascicularis_6.0 was 90% (Table 3). It indicates that the assembly produced here outperformed the rest not only in terms of contiguity, but also in base accuracy.

Comparison against other non-human primate genome assemblies

The contiguity of the two genomes assembled in this study was compared against other non-human primate genomes such as those of Rhesus macaque, Francois langur, chimpanzee, gorilla, orangutan, white-cheeked gibbon, golden snub-nosed monkey, bonobo, and olive baboon. Although there were three different sub-species genome assemblies for Rhesus macaque, we herein considered only the best assembly out of them in terms of contiguity. In comparison to the other non-primate genome assemblies, it was evident that the common marmoset and cynomolgus macaque assemblies produced the second and the third-best contiguities, with the Rhesus macaque genome assembly8, being the best among all in terms of contiguity (Supplementary fig. S3).