Introduction

Nonsyndromic oral clefting (NSOC) is a spectrum of isolated prenatal developmental disorders with complex aetiologies, comprising genetic and environment factors along with possible interactions within and between, consistent with it being a complex genetic disease [1, 2].

Among congenital disorders, NSOC is one of the most prevalent worldwide with a prevalence of ~1.7/1000 births [3]. This relatively-high prevalence along with its clinical, economical and psychological effects on affected, has led to a plethora of studies trying to decipher the underlying aetiology of these disorders given that the genetic and environmental factors are largely undetermined. Many investigations, which started by the pioneer work of Ardinger et al [4]. roughly three decades ago, have been focused on identifying the genetic factors involved in NSOC. Hitherto, multiple genes have been associated with different types of NSOC, particularly with non-syndromic cleft lip with or without palate (NSCL/P) through candidate gene, GWA and whole exome-based studies [5]. Nevertheless, only a few genes that have been confirmed with the most prominent being IRF6, a gene confirmed to be strongly associated with NSOC [6, 7]. Nevertheless, even this finding has not been replicated in some populations [8, 9]. More astonishing is non-replication of studies within a single population. For example, we recently showed that IRF6 is associated with NSCL/P in Iran [10], however, a previous study on the same population suggested otherwise [11]. These observations suggest that substantial heterogeneity exists in the genetics of this complex disease and seems to be unearthed at all levels including aetiological, locus and allelic heterogeneity [10].

In addition, geneā€“gene interactions have been reported for genes implicated in NSOC. These interactions, which have been shown either statistically [12, 13] or based on functional assays [6, 14, 15], suggest that the inter-relationship of NSOC genes is a major contributory factor of NSOC phenotypic variation. This level of complexity in the genetics of NSOC suggests that genetic analyses based on single genes is no longer effective in understanding the aetiology of these disorders since geneā€“gene interactions which are central to complex traits [16] are not addressed and considering the significant missing heritability [17], a move must be made towards identifying the underlying ā€˜networkā€™ as proposed by Kousa and Schutte [18]. This systematic approach has been previously applied to a number of complex phenotypes from autism spectrum disorders [19] to severe spermatogenic failure [20]. For NSOC, however, studies have been limited to either single gene-centric analysis [21] of IRF6 or gene ontology (GO) enrichment analysis based on NSOC-associated genes [22] or those common between orofacial clefting and neural tube defects [23]. No attempt has therefore been made to expand the gene regulatory network based on multi-omics integrated systems analysis to identify causative genes explaining the missing heritability and in turn deliver more effective diagnosis in the form of targeted gene sequencing panels. We therefore took a systems genetics approach similar to that of Ansari-Pour et al. [20] by integrating physical proteinā€“protein and regulatory interactome data (with NSOC-implicated genes as network ā€˜seedsā€™) with whole-transcriptome data to not only generate a tissue-specific network explicating the underlying genetic heterogeneity of NSOC, but to also identify genes, modules and cellular functions likely to be key drivers in biological processes perturbed in this complex disease.

Methods

NSOC gene set extraction

Genes implicated in nonsyndromic oral clefts (NSOC) at the genomic level were manually curated by conducting a systematic literature review by using the keyword ā€œnonsyndromicā€ with ā€œoral cleftā€, ā€œcleft lip with or without palateā€, ā€œcleft lipā€, ā€œcleft palateā€ and ā€œcleft lip/palateā€ and ā€œcleft lip and palateā€ plus ā€œgeneā€, ā€œgeneticsā€, ā€œlocusā€, ā€œmutationā€, ā€œpolymorphismā€, ā€œvariantā€, ā€œhaplotypeā€ and ā€œGWASā€ in PubMed and Google Scholar. To select seed genes at GWA loci, the gene closest to the top signal was reported. Manually curated GWA-based genes were then compared with entries in the GWAS Catalog (https://www.ebi.ac.uk/gwas, accessed on 18/12/2017) [24] to ensure completeness and consistency. Interestingly, of the twenty GWA-based genes, thirteen had been previously associated with NSOC (see Supplementary TableĀ S1). For gene-wide association studies, three quality control criteria were applied, namely a sample size larger than 200, significance remaining after correction for multiple testing and, under a case-control setting, the control group not significantly deviating from Hardy-Weinberg equilibrium expectations at the associated SNP [25]. For genes identified by SNP association (both gene-wide and genome-wide), only those with at least one independent replication were retained. The final gene list fulfilling the above criteria tallied 43 (see Supplementary TableĀ S1).

Microarray whole-transcriptome data

The Gene Expression Omnibus (GEO; https://www.ncbi.nlm.nih.gov/geo/) database was searched for whole-transcriptome microarray datasets on NSOC. Surprisingly, only one dataset based on NSCL/P samples was available (GEO accession number GSE42589) [26] which comprised seven NSCL/P and six control dental pulp stem cell samples. The dataset was normalised with RMA [27] and expression values were obtained in Log2 format. Genes were considered not expressed if their median expression level in the control group was below the minimum of median expression levels observed among the seed genes and were removed from the dataset. Given that only one dataset was available, we used a combination of four different well-known statistical methods to identify differentially expressed genes (DEGs) with a higher confidence. The independent t-test and the Mannā€“Whitney U-test (both with multiple testing correction by permutation) were undertaken independently. Furthermore, we used the Significance Analysis of Microarrays (SAM) method [28] and a geometrical multivariate approach, namely the Characteristic Direction [29] to also select DEGs. Finally, based on the four independent lists of DEGs obtained, we employed the Robust Rank Aggregation (RRA) method [30] to acquire a single ranked DEG list. The RRA method uses a probabilistic model for aggregation that is robust to noise and also facilitates the calculation of significance probabilities for all the genes in the final ranking. Given that based on conventional statistical analyses in DEG identification, a large number of genes (sometimes in thousands) show significant expression change [31], we limited the DEGs to the top 500 as a conservative measure. This set was then filtered based on expression fold-change, calculated as 2^(mean expression value in disease state/mean expression value in control state), and only those that satisfied a fold-change cut-off of 1.7 were retained. All of the above analyses were implemented within the R environment for statistical computing (http://www.r-project.org/).

Global interactome data

The global proteinā€“protein interaction (PPI) network used was constructed based only on physical interactions reported in BioGRID (limited to human interactions only) [32] and HPRD [33] to form the global interactome in this study. This merged interactome consisted of 16,629 nodes and 22,1831 edges, all of which are based on experimental evidence.

TREEDUST regulatory network generation

We converted all transcription factor-target interactions reported in the TRED database [34] into text format and then merged with those interactions reported in the TRRUST database [35] to generate the comprehensive TREEDUST human regulatory interaction network (meta-dataset available upon request). The generated network was used to identify downstream targets of TFs identified in the PPI network. In addition, wherever a target was also a TF and differentially expressed, its DEG targets were also added to the network. Next, we conducted a literature search to specifically mine functionally validated regulatory interactions among the curated seeds potentially missing from TREEDUST. These pairwise interactions were combined with the physical PPI and TREEDUST interactions for network visualisation, however, given the bias in reported interactions in the literature, these interactions were excluded from any topological analysis.

Network reconstruction and topological analysis

We used the approach of Ansari-Pour et al. [20] to identify the local network of the seed nodes and their interactions with DEGs, which were identified from the whole-trancriptome microarray dataset. This approach in network reconstruction is based on the ā€˜local hypothesisā€™ of disease aetiology [36]. Networks were visualised in Gephi 0.9.1 (http://gephi.github.io/) [37]. In a given network, nodes and edges are representative of proteins and pairwise interactions, respectively. Directed and undirected edges represent regulatory and physical interactions respectively. The NSOC network component was clustered using the fast unfolding clustering algorithm [38] to identify modules within the network. For each node, two critical topological parameters were calculated in Cytoscape [39]. First, degree, the number of interactions, was calculated. Next, betweenness centrality, a measure expressing the importance of a node in intermodular crosstalk, was computed. Modules were named based on the representative node, which displayed the highest betweenness centrality. Shortest path (SP), an algorithm in identifying the shortest distance between any two nodes in a given network [40], was also implemented to compute the distance of all one-to-one interactions among all seeds and DEGs, thus examining the level of functional association between the DEG and seed gene sets.

Functional enrichment analysis

The NSOC network and its constituent modules were analysed for functional enrichment of gene ontology and pathway terms by using PANTHER [41] and based on a custom reference gene set comprising the unique set of genes with reported interactions in either the merged interactome or TREEDUST. Enrichment significance values were corrected for multiple testing by the FDR method and terms covered with over 10% of NSOC genes were retained to exclude terms overlapping scantily with the NSOC network. To potentially identify more unique terms associated with the functional units of the reconstructed network, enrichment analysis was also undertaken for each module of the network independently and terms enriched solely in each module (i.e. not enriched in the overall analysis) were sought.

Results

NSOC network reconstruction

Initially, we scoured the merged PPI interactome to identify interactions among the genes implicated in NSOC (i.e. seeds). Of the total 43 seeds, 40 were present in the merged PPI interactome, of which 19 interacted physically with at least one other seed. Interestingly, 15 (79%) of these formed a single network component.

To expand the NSOC network based on dys-regulated genes, we identified DEGs from the whole-transcriptome microarray dataset, tallying to a total of 445 genes. Of these 43 were not mapped to any protein in the PPI interactome and none were among the seeds To construct the global NSOC network, we sought to identify the interactions between seeds and DEGs using SP. Interestingly, SP ranged from one to six for all pairwise interactions with minimum SP ranging one to three between any DEG and the seed set. More specifically, of the 402 DEGs in the PPI interactome, 57 directly interacted with the seed gene set, 318 were two-step neighbours, and the remaining 27 were three-step neighbours (mean SPā€‰=ā€‰1.86). We then assessed whether this level of proximity is a non-random observation in the human PPI, we generated a null distribution of ā€˜mean SPā€™ based on 10,000 random 402-node gene sets sampled from the global set of expressed genes. The mean of ā€˜mean SPā€™ observed between the seeds and the random sets was 21.87 and the propinquity of seed-DEG interactivity was highly significant (Pā€‰<ā€‰0.0001). To focus on interactions with more significant functional relevance, we further examined the interactions of directly interacting DEGs. For this, all three-node (V shape) and four-node (U shape) sub-graphs (see Ansari-Pour et al. [20]) of seeds were systematically identified in the global interactome. Of the 57 one-step DEG neighbours, 38 were among those forming V and U subgraphs with the seeds, suggesting their strong bilateral functional associations. We then merged the seeds with the 38 shared DEGs, as genes with the highest interactivity with seeds, to undertake network analysis. After merging, an extra three seeds were connected to the initial seed component through the DEGs alone, resulting in the presence of 22 interconnected seeds (see Fig.Ā 1a for all physical interactions). We then calculated the number of interactions of each shared DEG with the seed gene set not to only identify the level of connection heterogeneity among the DEGs, but to also identify highly interacting DEGs. The mean number of connections was 1.45ā€‰Ā±ā€‰0.86 however, the range of these interactions was 1ā€“5, displaying considerable connection heterogeneity. Subsequently, one hub-like DEG was found with 5 interactions, namely BRCA1. Next, to expand the NSOC network by integrating regulatory interactions resulting in expression dysregulation, the TREEDUST dataset was used to identify TFs in the NSOC network. Of the 11 TF seeds, five had at least one regulatory interaction with a DEG, tallying 15 interactions. These five TF seeds directly targeted 13 unique DEGs, of which one (BRCA1) was also a TF and regulated two downstream DEGs (see Fig.Ā 1b). Interestingly, of these target DEGs, four overlapped with the 38 shared DEGs of which the only one with downstream regulatory interactions was BRCA1. DLX4, one of the five TF seeds, which regulates BRCA1 but has no reported physical interaction was also added to the network as a seed (see Fig.Ā 1b). The NSOC network thus comprised 23 seeds, 38 shared DEGs and 10 DEGs unique to regulatory interactions, and a total of 139 pairwise interactions. Moreover, we identified a total of six well-defined interactions among the seeds which were absent in the merged PPI interactome and TREEDUST (i.e. green edges in Fig.Ā 1b; Supplementary TableĀ S2). Although these interactions represented only 4% of the network edges, they were removed when calculating topological parameters due to the inherent overrepresentation bias of reported interactions in the literature.

Fig. 1
figure 1

Graphical representation of a physical protein-protein and b regulatory interactions in the local NSOC network. Node colours represent the combination of the source and regulatory attributes of a given gene included in the network. Node size is proportionate to degree. TF transcription factor, PPI physical proteinā€“protein interaction, REG TREEDUST regulatory interaction, LIT literature-based regulatory interaction. All pairwise interactions in this network are given in Supplementary TableĀ S2

Network analysis

All nodes in the NSOC network were analysed topologically based on their degree and betweenness centrality. Among all nodes, MYC, CREBBP, BRCA1, and TFAP2A not only had the highest betweenness centrality in the network (>0.20), but they were also had high degrees (ranges of 8ā€“27), thus being crucial hubs in the network (see Supplementary TableĀ S3). We then analysed which of these were cut-vertices to identify which nodes in the network solely interact with these highly central nodes. Interestingly, BRCA1, MYC, and TFAP2A were cut-vertices, thus further emphasising the importance of the topological position of these central genes. A total of 10 genes were disconnected by removing the cut-vertices with each cut-vertex disconnecting 2ā€“4 genes from the network (TableĀ 1; Supplementary FiguresĀ S1-S3).

Table 1 Details of the three hub cut-vertices in the NSOC network and their dependent private neighbours

Next, to identify potential functional units in the network, we clustered the network based on all edges and identified eight clusters within the NSOC network (see Fig.Ā 2). The number of nodes per module ranged from 3 to 19.

Fig. 2
figure 2

Graphical representation of the nine modules within the NSOC network. Modularity class and betweenness centrality can be distinguished by node colour and node size respectively. The node with the highest betweenness centrality in each module is brought to the centre of the network. Edges (interactions) are coloured based on the combination of the colours of any two nodes

Gene enrichment analysis

Based on the NSOC network gene set (Supplementary TableĀ S4), we identified 319 enriched terms (see Supplementary TableĀ S5). The top terms with the highest fold enrichment were all biological process terms associated with either embryonic limb/appendage morphogeneis or with DNA replication. Consistently, the top enriched pathways were Hedgehog signalling and DNA replication. Enrichment analysis based on individual network modules did not result in the enrichment of any further terms.

Discussion

Isolated NSOC is not only the most common nonsyndromic congenital deformity in humans, but it is also a highly heterogeneous disorder with a complex aetiology. To address this complexity, we undertook a systems genetics approach by conducting network analysis of integrated multi-omics data. Given that a gene is generally either functionally disrupted at the protein level or dysregulated at the transcript level, NSOC-implicated genes identified by variant screening and/or genetic association studies were combined with differentially expressed genes to reconstruct the local network of NSOC (comprising 101 nodes and 9 modules). The NSOC network can explain the likely functional relationships of 23 of the 43 curated NSOC-implicated genes by being unified in a highly connected dense network, many of which may seem as disparate entities associated with the same disorder. Interestingly, all but one connected via a physical interaction, thus emphasising the importance of integrating physical interactome data in multi-omics studies.

A key issue in whole-transcriptome studies is that the most significantly differentially expressed genes identified either do not overlap or have no plausible association with confirmed genes in a particular disease. Also, in this specific case, the microarray transcriptome dataset used was based on dental pulp stem cells [26], which do not accurately represent the NSOC-affected tissue. Given the difficulty in accurate sampling of the developing palatal shelves, this remains the only published dataset associated with NSOC. However, to test the relevance of this dataset for this analysis, we examined the level of interaction between DEGs and the seed gene set using SP. Although any given DEG could have had an SP value of infinity (i.e. impossible interaction given that the global interactome comprises 15 disconnnected super-components) to the seed gene set, we observed a highly significant non-random co-locality of DEGs and seeds in the network (SP ranging from 1 to 3, Pā€‰<ā€‰0.0001). According to the ā€˜localā€™ hypothesis in network medicine [36], this robust functional association of DEGs make them functionally relevant to NSOC and therefore the dataset that they emanate from may be deemed functionally relevant. Moreover, this suggests that a computational systems approach is superior in identifying the most functionally relevant DEGs when compared with the most common status-quo approach of conducting GO analysis on the overall DEG list, since one would expect a ā€˜molecularā€™ association among perturbed genes. It should be noted that, based on the FC range, down-regulation of implicated DEGs is relatively low in magnitude (max of fivefold down-regulation). This may be due to the low sample size in the whole-transcriptome experiment. However, a recent report showed that low expression changes of key genes significantly affect facial morphogenesis [14]. Thus, it is likely that the observed FC range may still have a significant impact on NSOC development.

Identification of bona fide candidate genes

We used the NSOC network as a framework to rank DEGs based on their topological features. ANGTPL4, BRCA1 and PCNA were the DEGs with the most favourable topological features in the network. For instance, ANGPTL4 which has been implicated in disrupting endothelial cellā€“cell adhesion [42] is not only the second most dysregulated gene in the whole-transcriptome, but it also interacts directly with MYC, thus making it a bona fide candidate gene in NSOC development. More interestingly, another top DEG, BRCA1 (the highly penetrant breast cancer predisposition gene), not only interacts physically with five seeds, but it is also regulated directly by one seed (see Supplementary FigureĀ S1), thus displaying a strong functional association with NSOC. This, however, does not necessarily imply that BRCA1 is an upstream aetiological factor. Rather, it is more plausible that variants in its regulatory factors such as that observed in DLX4 may have dysregulated its expression. Although, due to the paucity of exome studies, there is no report of BRCA1 variants in NSOC patients, no study has also shown BRCA1-positive breast cancer patients to have had NSOC in their childhood. Nonetheless, the potential effects it may have on its physical seed interactors can not be ruled out.

The aetiological overlap of NSOC with cancer

A number of epidemiological studies have suggested an association between different types of cancer and subtypes of nonsyndromic NSOC with the most reliable and significant association reported with breast cancer [43]. This suggestion has been further strengthened by recent genetic studies showing an overlap of predisposition genes [44, 45] and GWA signals [46] between cancer and NSOC. The presence of key cancer genes in the network as hub cut-vertices (e.g. BRCA1) and enriched DNA replication and repair GO and KEGG terms, support the aetiological overlap of these concurrent diseases.

Prioritisation of GWAS genes

Among all genes identified through GWAS only (see Supplementary TableĀ S1), we prioritised MYC as the associated gene at 8q24.21. This is due to the highly favourable topological features of this gene in the network than any other flanking gene. Also, given the suggested aetiological overlap of NSOC with cancer and the genome-wide significant association of multiple cancers with this locus [47], this prioritisation seems more palpable. Interestingly, this finding is further supported with a recent functional study showing that MYC regulation by enhancers in the 8q24.21 locus is essential for normal facial morphogenesis [48]. Variant detection studies in large cohorts of patients may validate this finding and explain, to some extent, the missing heritability of syndromic and nonsyndromic forms of orofacial cleft, especially in syndromic patients ā€“ known for their strong genetic predisposition ā€“ that are IRF6-negative.

Limitations

The global physical interactome and the TREEDUST regulatory interactome used here are inherently incomplete and missing interactions are likely. Consistently, of the six interactions identified in the literature among seeds, none were reported in the combined physical interactome even though all nodes were present in both the PPI and TREEDUST datasets. Furthermore, the five interactions that we identified for IRF6 are not reported in either of the two datasets for this well-studied NSOC gene. For instance, one of the best established interactions has been demonstrated between TFAP2A and IRF6 (with rs642961 as the only aetiological SNP at the IRF6 locus present in the upstream AP-2Ī± binding site) [6], however, this interaction was among those missing. This shows that further manual curation of the literature is required to update the current physical/regulatory interactome. Another limitation of multi-omics integrated systems analysis for NSOC is the inaccessibility of tissue samples representing the developing palatal shelves. This may become scarcely available with sporadic research-donated foetal tissue samples. However, the human developmental biology resource (HDBR) [49], which is an ongoing effort in human embryonic and foetal tissue collection for scientific research, is likely to resolve this issue in a more systematic way.

There is some evidence for subtype-specific genes in NSOC [12], thus warranting independent systems analysis of each subtype. However, the number of such genes is currently not sufficient to allow appropriate network analysis. Nevertheless, increase in the identification of subtype-specific genes in the future will create the potential to reconstruct gene networks of specific NSOC subtypes, which are known to be aetiologically distinct [22].

Conclusion

In conclusion, we show that the reconstructed local network of disease aetiology is a solid molecular framework to address missing heritability of complex diseases by identifying potential bona fide candidate genes and prioritise genes in GWA loci based on topological parameters and the level of dysregulation. It has been recently proposed that mouse models, due to their high similarity in gene signatures, may be a useful tool to unravel the underlying heterogeneity of NSOC given that human studies are cumbersome [23]. However, we show that this approach is able to tackle this issue based on human data and importantly, does not require the translational step, which is often not straightforward. A targeted gene sequencing panel custom-designed to include the genes comprising the reconstructed NSOC network would be able to validate the extent of genetic heterogeneity explicated here.