SUMMARY
Numerous disease syndromes are associated with regions of copy number variation (CNV) in the human genome and, in most cases, the pathogenicity of the CNV is thought to be related to altered dosage of the genes contained within the affected segment. However, establishing the contribution of individual genes to the overall pathogenicity of CNV syndromes is difficult and often relies on the identification of potential candidates through manual searches of the literature and online resources. We describe here the development of a computational framework to comprehensively search phenotypic information from model organisms and single-gene human hereditary disorders, and thus speed the interpretation of the complex phenotypes of CNV disorders. There are currently more than 5000 human genes about which nothing is known phenotypically but for which detailed phenotypic information for the mouse and/or zebrafish orthologs is available. Here, we present an ontology-based approach to identify similarities between human disease manifestations and the mutational phenotypes in characterized model organism genes; this approach can therefore be used even in cases where there is little or no information about the function of the human genes. We applied this algorithm to detect candidate genes for 27 recurrent CNV disorders and identified 802 gene-phenotype associations, approximately half of which involved genes that were previously reported to be associated with individual phenotypic features and half of which were novel candidates. A total of 431 associations were made solely on the basis of model organism phenotype data. Additionally, we observed a striking, statistically significant tendency for individual disease phenotypes to be associated with multiple genes located within a single CNV region, a phenomenon that we denote as pheno-clustering. Many of the clusters also display statistically significant similarities in protein function or vicinity within the protein-protein interaction network. Our results provide a basis for understanding previously un-interpretable genotype-phenotype correlations in pathogenic CNVs and for mobilizing the large amount of model organism phenotype data to provide insights into human genetic disorders.
INTRODUCTION
Genomic disorders make up a family of genetic diseases that are characterized by large genomic rearrangements, including deletions, duplications and inversions of specific genomic segments. Many such rearrangements result in the loss or gain of specific genomic segments and thus are referred to as copy number variants (CNV). These regions can contain multiple genes. The phenotypic abnormalities seen in diseases associated with CNVs are thought to be related to altered gene dosage effects in most cases (Branzei and Foiani, 2007). In assessing the medical relevance of a CNV for a patient with a range of observed phenotypic abnormalities, it is essential to ascertain whether the CNV is causative for the disease and/or is merely incidental. If the CNV is, in fact, the cause of the disease, it is then important to know which of the genes located within the CNV are associated with which of the phenotypic features. In this study we focus on the latter challenge.
At present, information on Mendelian disorders that are associated with about 2000 human genes is available from sources such as OMIM (Online Mendelian Inheritance in Man) (Hamosh et al., 2005). However, substantially more information is available from model organisms such as the mouse and the zebrafish (Schofield et al., 2012). Furthermore, it has previously been shown that model organism phenotype data can be used for the analysis of human CNV disorders. For instance, Webber and co-workers investigated CNVs associated with mental retardation by linking the genes in these CNVs with phenotypes found in mouse gene-knockout models and showed that pathogenic mental-retardation-associated CNVs are significantly enriched with genes whose mouse orthologs, when disrupted, result in a nervous system phenotype (Boulding and Webber, 2012; Hehir-Kwa et al., 2010; Webber et al., 2009).
Clinical issue
More than 60 disease syndromes, covering a wide range of systems, have been associated with copy number variation (CNV) in the human genome. With the advent of whole genome sequencing, many more CNVs are being found in patients with previously unreported phenotypes. With currently available approaches, it is difficult to determine whether these CNVs cause the disease phenotype, or whether dosage effects of certain genes in the segment are responsible for specific aspects of the disease. Moreover, there are more than 5000 human genes about which nothing is known phenotypically, but for which detailed phenotypic information about their orthologs in model organisms is available. This study introduces a novel computational method for prioritizing candidate human disease genes using model organism phenotype data, and is applicable on a genome-wide scale.
Results
For 27 recurrent CNV disorders, the authors identify 802 gene-phenotype associations. Approximately half of these associations were previously reported and half were novel candidate associations. 431 associations were made solely on the basis of model organism phenotype data. The authors also observed a striking and statistically significant tendency for individual disease phenotypes to be associated with multiple genes located within a single CNV region, a phenomenon that they denote as ‘pheno-clustering’. Many of the members in a given cluster display statistically significant similarity in protein function, or vicinity within the protein-protein interaction network.
Implications and future directions
This study represents proof-of-principle for using phenotype data from model organisms to augment human clinical data to establish candidate disease genes. This method can, in principle, be extended to help with prioritizing candidates in GWAS and other association studies. In summary, this method represents an important new computational application for model organism phenotype data, and is expected to be widely applicable for interpreting individual genotype and precision phenotype data emerging from personalized medicine.
The use of non-human models has proved to be one of the most powerful approaches to understanding human disease (Rosenthal and Brown, 2007; Schofield et al., 2010). The description of abnormal phenotypes to model organisms can inform our understanding of the pathogenicity of human mutations, help prioritize candidate genes identified from genome-wide association studies (GWAS) and other investigations, and help dissect complex disease syndromes (Boulding and Webber, 2012). For the mouse, we now have phenotypes for around 8500 genes, and 40,000 genotypes with phenotypic annotations, in the Mouse Genome Informatics database (Blake et al., 2011). All of these phenotypes are coded using the Mammalian Phenotype Ontology (MPO) (Smith and Eppig, 2009; Smith et al., 2005). For zebrafish, there are more than 60,000 phenotypic descriptions of many thousands of genotypes, encoded using Entity Quality syntax (Washington et al., 2009). The recently launched International Mouse Phenotyping Consortium and the Zebrafish Mutation Project, which aim to systematically phenotype knockout lines for every protein-coding gene in the mouse and zebrafish genomes (Bartsch et al., 2005; Brown and Moore, 2012), will provide the most comprehensive phenotypic descriptions of any higher organisms. Together with data already available in other model organism databases, this provides an increasingly rich resource that can be leveraged to understand the consequences of human mutation and functionally dissect the human genome. The barrier to computational use of these data has been the disparate and non-standardized way of describing human phenotypic data, which has traditionally relied on free text or terminologies designed for medical management, billing and epidemiology (Schofield et al., 2010). The advent of the Human Phenotype Ontology (HPO) (Robinson et al., 2008; Robinson and Mundlos, 2010) addresses these problems with human data and is increasingly being used by clinical geneticists and systems biologists; we are now in a position to address the cross-mining of phenotype data from humans and model organisms to enormous benefit. Cross-species ontological approaches that use computer reasoning over phenotype ontologies offer a promising new methodology to identify similarities between human disease manifestations and observations made in genetically modified model organisms (Hoehndorf et al., 2011; Mungall et al., 2010; Washington et al., 2009). Ontologies are knowledge representations that use controlled vocabularies designed to facilitate knowledge capture and computer reasoning (Robinson and Bauer, 2011). An ontology provides a computational representation of the concepts of a domain together with the semantic relations between them. The use of ontologies for phenotypic analysis is discussed in Schofield et al. and Gkoutos et al. (Schofield et al., 2011; Gkoutos et al., 2012).
In this study, we introduce a computational algorithm that takes advantage of computable definitions of human, mouse and zebrafish phenotypes to perform genome-wide interspecies phenotype comparisons to detect candidate genes in recurrent hereditary CNV disorders. We have computationally examined the relationships between phenotypes associated with recurrent CNV disorders and phenotypes associated with human and model organism single-gene diseases whose genes are located within the CNV intervals. We have identified a total of 802 candidate genes for individual phenotypic features, approximately half of which were not previously reported in the literature. We additionally found a striking tendency for individual phenotypic features in CNV disorders to be associated with two or more individual genes located within the CNV. In many cases, these genes share functions or are located in close proximity to one another within the protein interactome. Thus, our work provides a framework for the interpretation of CNV-associated phenotypes, suggesting that clustering of functionally related genes within CNVs might be an important factor related to the phenotypic abnormalities seen in affected individuals.
RESULTS
We have developed a computational framework to harness phenotypic information from model organisms and single-gene human hereditary disorders to gain insights into the genetic etiology of the complex phenotypes of CNV disorders (Fig. 1). The goal of our analysis was to identify genes located within CNVs that are most likely to be responsible for the individual phenotypic abnormalities of the disease (Table 1). All available phenotype data for humans, mice and zebrafish were integrated for this analysis, resulting in a total of 7546 phenotypically described gene families, including 5703 for which phenotypes were only available from the model organisms (Fig. 2). In total, 802 genes were identified for individual phenotypic features of 27 different recurrent CNV disorders, including 346 newly identified associations (Table 2; supplementary material Tables S1, S2). For the 27 CNV diseases investigated in this work, a total of 468 of their respective phenotypic features could be explained on the basis of phenotypes associated with 802 single-gene mutations in humans, mice or zebrafish. The analysis consisted of four primary tasks: construction of logical definitions of pre-composed phenotype ontology terms, cross-species integration, calculation of information content of individual phenotype terms, and statistical comparison of phenotypes. For the frequent cases in which multiple individual genes provide a potential explanation for a phenotypic feature, the relationship of the genes to one another in the protein interactome and Gene Ontology functional space was examined (Fig. 1; see Materials and Methods, and supplementary material Table S3, which contains further details of the methods used).
The integration is possible because the computable definitions make use of the more atomic and species-agnostic elements from which each of the individual phenotypic classes is formed. For example, the HPO term ‘increased bone mineral density’ is composed of the Quality (PATO) term ‘increased density’ and the human anatomy term ‘bone organ’, which itself is a subclass of the Uberon (cross-species anatomy ontology) class ‘bone.’ A variety of such atomic ontologies covering biological processes, small molecules, cell types and anatomical structures are used to construct the logical (i.e. computable) definitions. In the second phase, our algorithm traverses the phenotype ontologies to integrate information from humans, mice and zebrafish semantically into a single composite ontology: ‘uberpheno’. Thus, we combine information about the phenotype of a human CNV disorder, the phenotypes (if known) of every human gene contained within the various possible CNVs, and the phenotypes of each of the gene orthologs in mouse and zebrafish. In the third step, the information content (IC) (Shannon, 1948) is calculated for each of the phenotypes in the composite ontology to provide a measure of how informative those annotations are. The IC is based on the number of genes annotated to the term in humans, mice and zebrafish. Very specific terms are associated only with very few genes and therefore have a high IC; nonspecific phenotypes associated with many genes have a low IC (Resnik, 1995). In the comparison step, each of the phenotypic descriptions for each of the genes is searched for similarity to each of the CNV phenotypes. Every phenotype scoring above a similarity threshold is selected to be part of the resulting ‘phenogram’ (described below; see also ‘Quantification of phenogram score’ in the Materials and Methods section), and each phenogram is scored according to its IC power sum. An empirical P-value is calculated for a phenogram by performing the entire analysis 5000 times with the same CNV phenotypes, but using randomized sets of genes. Our results are highly statistically significant; compared with the analyzed 27 CNV diseases, a total of ∼480 of their respective phenotypic features could be explained, whereas in the randomized experiments, the average number of phenotypic features explained was ∼250 (Fig. 3A).
For a number of the 27 CNVs analyzed, a critical gene is already known or highly suspected. Our results are highly consistent with these previous findings, supporting our computational methodology and interpretation, and we could additionally identify many previously unknown gene-phenotype connections (Table 3). The label ‘major gene’ in Table 3 refers to CNV disorders for which one gene is known to be the major player in the disease because intragenic mutations lead to a very similar phenotype in which most of the major phenotypic abnormalities characterizing the disease are present. It is common for individuals with the CNV disorder to present with more severe and/or additional features compared with individuals with intragenic single-gene mutations. Phenotypic variability of course also occurs within both cohorts of individuals – those with point mutations and those with the CNV disorder – which can make it difficult to pinpoint the systematic phenotypic differences between single-gene and CNV disorders. The underlying difficulty is also apparent in the current use of nomenclature: patients with the CNV disorder and patients with a point mutation in the ‘main’ disease-causing gene are defined as having ‘Sotos syndrome’ or ‘neurofibromatosis’ or ‘Smith-Magenis syndrome’, for example. Certainly, an individual with a microdeletion at 5q35 can be said to have Sotos syndrome but, even though most major features of the syndrome are present in cases of microdeletion and intragenic mutation, more or less obvious differences in phenotypic severity and additional abnormalities might be observed. In these cases, our method indicates where such additional effects might be present and offers information on additional genes and their potential contribution to certain phenotypic features of the disease.
Inclusion of data from different species into analyses such as ours is important because different model organisms can contribute distinct kinds of information. Our analysis results also support this concept, with interesting findings not just from mouse models but also from zebrafish: for the 15q26 overgrowth syndrome, micrognathia might be associated with haploinsufficiency of the gene CHSY1, an association drawn as a result of the zebrafish phenotype ‘abnormally decreased size mandibular arch skeleton’, whereas sensorineural hearing impairment might be a result of haploinsufficiency of IGF1R, which causes ‘abnormally absent inner ear hair cell’ in zebrafish models. Future projects will aim to include additional model organism data. The need for phenotype comparisons with further species has also been made evident only recently by the discovery of the major disease-causing gene for the 17q21.3 microdeletion syndrome, KANSL1. For this gene, no information was available from human, mouse or zebrafish, but it has been shown in 2010 by Lone et al. that Drosophila mutants show defects in synaptic vesicle biogenesis and trafficking, resulting in reduced learning ability (Lone et al., 2010).
Phenograms: prediction and visualization of genotype-phenotype associations
A phenogram of a CNV represents the network of genes and related phenotypes that have been associated with the genes in a particular CNV interval. All phenotype matches above a threshold, calculated based on the phenotype IC of the closest match (Resnik, 1995), are used to form a phenogram. For example, Rubinstein-Taybi syndrome is thought to result from haploinsufficiency of a single gene and, unsurprisingly, 19 of the phenotypic features of Rubinstein-Taybi syndrome were all assigned to CREBBP (Fig. 4). A more representative phenogram is observed in the analysis for Williams syndrome: from the original 68 frequent phenotypes and 13 phenotypically described genes in the CNV, the analysis generated a profile of 32 phenotypic abnormalities connected to 11 candidate genes through 39 associations. These included 16 of 23 previously reported associations (Pober, 2010) and 23 novel associations (Fig. 5). Note that, for Williams syndrome, we identified many phenotypic features that were associated with more than one candidate gene, i.e. multiple different genes all produce a similar phenotype.
In addition, the results indicate that dosage effects of other genes in an interval can present as modifying factors. For instance, the 9q subtelomeric deletion syndrome, clinically characterized by mental retardation, childhood hypotonia and facial dysmorphism, has been thought to result from haploinsufficiency of EHMT1, because point mutations in this gene cause a similar phenotype (Verhoeven et al., 2010). Our analysis identifies CACNA1B as a potential additional contributor to sleep disturbances and behavioral problems found in affected individuals. Similarly, point mutations in HNF1B are a known cause of RCAD (renal cysts and diabetes), but, in cases of microdeletions, haploinsufficiency of ACACA, which plays a role in glucose homeostasis, might contribute to the diabetes phenotype, and haploinsufficiency of LHX1 might modify susceptibility to renal abnormalities.
Phenograms for all 27 CNV disorders investigated in this work can be downloaded from http://compbio.charite.de/tl_files/groupmembers/koehler/.
Pheno-clusters: composite effects of genes in CNV regions
A striking observation of our analysis was that numerous CNV phenotypes could be clearly associated with multiple genes in the interval, each of which in isolation has been shown to result in a similar phenotypic abnormality. Such pheno-clusters – physical clusters of genes associated with particular shared phenotypes in the genome – might be causative for a larger subset of the phenotypes observed in CNV disorders. Even genes that do not show dosage effects in isolation might cause phenotypic abnormalities if one or more additional pathway members are simultaneously deleted. Such an effect has been observed for the SHFM1 locus, where the genes DLX5 and DLX6 are known to cause split-hand/split-foot malformation (SHFM). Existing mouse models exhibit the SHFM phenotype only if both genes are knocked out (Merlo et al., 2002; Robledo et al., 2002). Similarly, studies in murine models of Williams syndrome indicate that deletion of the Eln gene combined with the presence (non-deletion) of Ncf1 contributes to the observed murine hypertension (Adams and Schmaier, 2012). The interaction between different genes affected by a CNV is potentially a very important determinant of clinical severity.
We investigated whether such phenotypic summation effects due to pheno-clusters occur more often than would be expected by chance. The total number of pheno-clusters in our analysis was more than twice that of the randomized data (Fig. 3B). Because this could in part be a consequence of the lower overall number of genes and phenotypes identified in the randomized data, we examined the percentage of phenotypes in the randomized data explained by multiple genes. Even here, the percentage of pheno-clusters was significantly greater for the real CNVs than expected by random chance (P=0.02; Fig. 3C). In all, pheno-clusters were predicted for 220 phenotypes, corresponding to 135 gene clusters (in some cases, the same genes were associated with more than one phenotype). Thus, from the total of 802 gene-phenotype predictions, 220 (∼25%) were explained by multiple genes, corresponding to pheno-clusters. It is known that the chromosomal location of genes can be related to their function. For instance, genes located adjacent to gene deserts very often function as transcriptional regulators (Ovcharenko et al., 2005). We therefore asked whether the functions of the genes identified in the pheno-clusters tend to functionally cluster as well. We analyzed the functional similarity of genes within each of the 135 pheno-clusters based on Gene Ontology criteria (see Materials and Methods). Overall, 49 of the pheno-clusters demonstrated a statistically significant intracluster similarity. We also used random walk analysis to investigate whether the gene products of the genes in the 135 pheno-clusters were in closer proximity in the protein interactome than expected by chance (Köhler et al., 2008). In total, 27 of the pheno-clusters showed a statistically significant vicinity score (see Materials and Methods), and 62 of the pheno-clusters (46%) were validated by both methods. Fig. 3D,E shows examples of statistically significant protein-protein interaction (PPI) network results. Although these examples are not based on experimental evidence, we note that explanations of CNV phenotypes currently given in the literature are almost exclusively based on ‘guilt-by-association’ from manual literature searches. In contrast, our results are based on a comprehensive phenome-wide search across data from humans and two model organisms. Our results also indicate the importance of collecting detailed phenotype-genotype information on patients with different forms of CNV diseases (i.e. due to point mutations in a single gene, and due to microdeletions), because this information could be relevant to their clinical management. Similar conclusions apply to CNV disorders characterized by variable intervals. For example, in Phelan McDermid syndrome, SHANK3 was initially thought to be responsible for most of the phenotype, because it was included in the minimal critical region (Bonaglia et al., 2001). Our analysis suggests that MAPK81P2 might contribute to the behavioral and autistic features of the disease. Not all individuals with this disease have deletions encompassing SHANK3. Thus, either SHANK3 or MAPK81P2 might cause behavioral problems, and combined haploinsufficiency of both genes might increase the likelihood of autism. Currently, publicly available data do not include sufficient information about the phenotype and the extent of the deletions to draw this conclusion. Our analysis thus motivates further specific genotype-phenotype studies for CNV disorders such as Phelan McDermid syndrome that are characterized by variable interval sizes.
DISCUSSION
In clinical genetics, it is often difficult to decide whether a quantitative variation in the genome is related to the observed phenotype, and predicting consequences of haploinsufficiency is challenging (Huang et al., 2010). To understand the functional impact of a given CNV region, not only does the general issue of pathogenicity need to be answered, but also the question of which of the genes included in the CNV region are associated with which phenotypic abnormalities present in the patient. Such information is invaluable for clinical management. Patients with overlapping but different sized deletions or duplications might present with different phenotypes that correlate with the affected genes. For patient follow-up and screening procedures, the information that one patient might, for example, have a high cancer risk or a risk for developing diabetes or hypertension, whereas another patient does not, might have a huge impact on individual prognosis and treatment.
In this work, we developed a semantic algorithm for mapping model organism phenotype data to equivalent human phenotypic features. We used this algorithm to address the question of which genes in CNVs are most likely to be causally related to individual phenotypic features seen in the CNV based on the assumption that an abnormal dosage in a gene is likely to lead to similar phenotypic abnormalities as a loss- or gain-of-function mutation in the same gene. In this way, we are able to exploit the wealth of phenotypic information available for 5703 genes in model organisms for which phenotypes of mutations in the human orthologs are unknown (Fig. 2). For the 27 well-characterized CNV disorders analyzed in this work, we identified a total of 802 phenogram matches, i.e. genes in which a monogenic disease in humans or model organisms is associated with a phenotypic feature that is also seen in (or similar to) one of the features of the CNV disorder. In order to test the performance of our algorithm, we performed the identical analysis 5000 times on randomized data. On average, only 250 features were identified, and the maximum number of features found in any of the randomized runs was ∼350 (Fig. 3A). We performed an extensive literature search for previously reported phenotype associations (supplementary material Table S2); comparison of the results of our algorithm revealed that we identified 457 previously reported associations. Additionally, we found 346 novel phenotype associations that, to the best of our knowledge, have not been previously recognized in the medical literature. Our algorithms might additionally be valuable to incorporate model organism data into other areas in human genetics, such as the prioritization of variants found in exome sequencing projects.
Our work has several limitations. In Table 1, empirical probabilities (P-values) for the phenogram scores (SPG) are given for each of the 27 CNV disorders. In total, 14 of the CNV disorders displayed statistically significant scores (P<0.05), including clinically distinct disorders such as WAGR and Sotos syndrome. There are several possible reasons for the lack of statistical significance of the remaining 13 disorders, which could relate either to the limitations of our computational approach, inadequate phenotypic annotations, or lack of knowledge about the genes located within the CNV. An important limitation of the approach as implemented in the current work is that it depends upon the granularity of the phenotype descriptions. More broadly used, nonspecific descriptions of abnormalities, such as autism or intellectual disability, are not flagged as ‘statistically significant’ because they are so frequently used. The calculated P-values are based on the IC of the phenotypic features, and the IC of intellectual disability is very low (IC=3.2) because so many genes (currently 392) are annotated to this term. Indeed, P-values of the phenogram scores reported in Table 1 correlate with the granularity of the phenotypic descriptions of the CNV disorders, shown as the average IC of the CNV phenotypes and the IC of the phenogram-matches; unsurprisingly, the P-values also correlate with the size of the intervals, measured with respect to the absolute numbers of genes and the numbers of genes with available phenotype information (see figure 3.2 in section 3.9 of supplementary material Table S3). Many recently characterized CNV disorders that have been delineated on the basis of array-CGH (comparative genomic hybridization) screening rather than clinical studies have substantially less-specific clinical pictures. Nonspecific clinical phenotypes and high phenotypic variability complicate diagnosis and could explain why diseases associated with microdeletions or duplications of 3q29 (Ballif et al., 2008) and microdeletions of 15q24 (Andrieux et al., 2009) do not score as well as more distinct CNV disorders. Although the current work concentrated on identifying statistically significant phenotypic matches, an implementation of our method as a clinical decision support system could be designed to show the best match or matches for both specific and less specific phenotypic abnormalities. We also note that the P-value as calculated in this work is not a measure of the probability that the CNV is the cause of the disease phenotype, which is the type of hypothesis testing that one would use in a diagnostic setting. Rather, the statistical hypothesis is a measure of whether the phenotypic abnormalities associated with the individual genes within the CNV match the phenotype of the CNV disorder better than one would expect by random chance, which is a conservative way of evaluating the results of semantic phenotype matching. It is to be expected that some degree of phenotypic similarity to CNVs with complex phenotypes exists at many other loci in the genome. For instance, hundreds of distinct CNVs could be associated with phenotypes such as autism (Levy et al., 2011), and indeed there is a high likelihood that a large deletion anywhere in the genome will be pathogenic and result in one or more abnormal phenotypic features (Vermeesch et al., 2007). Therefore, the method presented in this work would need to be extended to include other data, such as previous reports of comparable CNVs in databases such as DECIPHER (Database of Chromosomal Imbalance and Phenotype in Humans Using Ensembl Resources) (Firth et al., 2009) and ISCA (Riggs et al., 2012), to be useful as a clinical differential diagnosis support tool.
It is difficult to provide any direct experimental proof in humans that altered dosage of a specific gene is responsible for a specific phenotypic abnormality in a CNV disorder, unless monogenic lesions also occur in isolation in other patients. Rather, candidate genes are proposed based on the similarity of their single-gene mutation phenotypes to the CNV phenotypes; for instance, haploinsufficiency for ELN was proposed as the cause of supravalvular aortic stenosis observed in individuals with Williams syndrome because point mutations in the ELN gene also give rise to this phenotype. A total of 456 of the 802 associations identified by computational analysis in our study have been previously proposed in the literature, thereby supporting our computational approach (supplementary material Table S2). To the best of our knowledge, 346 of the 802 associations offer novel candidate genes for individual phenotypic features. We examined associations from the literature that were not detected by our approach to determine the extent and possible reasons for such false-negatives. Some associations, such as FZD9 and ‘intellectual disability’ for Williams syndrome (Pober, 2010), fell below the threshold for detection by our method because of their very low IC. Others, such as the association of genetic variants in STX1A and ‘impaired glucose tolerance’ (Pober, 2010; Romeo et al., 2008), were not detected by our method because they were based on human association studies and are so far not reported in OMIM or any of the information sources used in this study. Inclusion of these kinds of data, for example from resources such as the Genetic Association Database (GAD) (Becker et al., 2004) or GWAS Central (Thorisson et al., 2009), in candidate gene prediction algorithms such as ours will be addressed in future projects.
For neurological and neuropsychiatric phenotypes such as intellectual disability, seizures, schizophrenia, mood disorders and autism, genetic heterogeneity and variable expressivity and penetrance are well-known features. Cytogenetic imbalances are the most frequently identified cause of intellectual disability (Aradhya et al., 2007), and CNVs are increasingly being detected by array-CGH in individuals with neurological and neuropsychiatric phenotypes (Akil et al., 2010). For such phenotypes, dysregulation of relevant neural circuits might be caused by disruption of single genes, but combinatorial effects of variations in many genes affecting shared pathways have also been proposed (Shaikh et al., 2011). Similarly, clustering of functionally related genes has been proposed for bovine quantitative trait loci (Salih and Adelson, 2009). We identified clustering of functionally related genes within CNVs as a second important factor for pathogenicity of CNVs in the human genome, not only for neurological phenotypes, but also for various other phenotypic features such as genitourinary, skeletal and metabolic abnormalities. We found evidence that genes involved in pheno-clusters are often functionally related to one another and tend to be near one another in the PPI network (Fig. 3D,E).
There is abundant evidence now that there is functional clustering in all mammalian genomes. Presumably, the phenotypic clustering observed in our study is related to the clustering of functional neighborhoods of genes across chromosomes, which is even partially conserved across species (Al-Shahrour et al., 2010). In some cases, clustering is associated with areas of strong linkage disequilibrium, suggesting that coinheritance of combinations of alleles of genes whose products interact or are associated with the same pathway or function might be the evolutionary driving force. Interestingly, functional clusters shared by different species do not always seem to consist of orthologs, suggesting that evolutionary pressure is exerted upon the cluster's function rather than the individual genes within it (Al-Shahrour et al., 2010; Michalak, 2008; Petkov et al., 2005). To date, there has been no explicit global analysis of the clustering of gene function, location, process, pathway or expression patterns involved in human CNVs, but the possibility of epistatic relationships between these genes would be predicted to be strong. There is, however, some evidence that certain functional gene classes are overrepresented in areas of the genome containing common CNVs (Conrad et al., 2010). Our data provides for the first time a breakdown of the ‘phenotypic readout’ from regions involved in CNVs and strongly suggests that they contain functionally clustered genes (Michalak, 2008; Petkov et al., 2005; Stranger et al., 2007). The results of our study shed new light on the pathobiology of human CNVs and provide evidence that the concept of clustering of phenotypically related genes plays an important role in genome pathology.
Another important aspect of our study is that 377 of the 629 genes analyzed did not have any human or model organism phenotype information. Thus, systematic genome-wide phenotyping efforts such as the International Mouse Phenotyping Consortium (Brown and Moore, 2012) and corresponding efforts in zebrafish (Kettleborough et al., 2011; Wang et al., 2007), such as the Zebrafish Mutation Project, have great potential to provide additional insights and candidates for genes involved in human disease. Algorithms such as ours that make use of phenotypic similarities between human and model organisms will facilitate the computational integration of information from these projects, harnessing these increasingly rich resources to help us understand the consequences of human mutation and functionally dissect the human genome. Our algorithms can be adapted to assist with interpretation and understanding of the diagnostic results from array-CGH analyses. Similar algorithms can be developed for interpreting next-generation sequencing data, thereby moving closer to the objective of a personalized genetic approach to medical care.
MATERIALS AND METHODS
Phenotype annotations using ‘uberpheno’, a cross-species phenotype ontology
We downloaded 17 ontologies from the Open Biological and Biomedical Ontologies (OBO) Foundry website (Smith et al., 2007) and constructed the logical definitions for HPO terms and MPO terms from these. The definitions can thereby serve to relate entities across the three species for common biological processes, small biological molecules and cell types. The anatomical terms used in the phenotype definitions, from the corresponding anatomical ontologies of the three species, were related to one another using the metazoan anatomy ontology Uberon (Mungall et al., 2012). Using these definitions, we created a single combined cross-species ontology called ‘uberpheno’ that represents phenotypes in mouse, human and zebrafish (Köhler et al., 2011). Full details of the construction of uberpheno are provided in supplementary material Table S3.
The phenotype ontologies do not themselves represent diseases, but rather describe individual phenotypic abnormalities. Any one disease may comprise one or more such abnormalities; therefore, each disease is represented computationally by an annotation to multiple phenotypic abnormalities. For this work, we compiled phenotype annotations from multiple sources. All available phenotypic information from humans was extracted from the HPO annotations, the majority of which are based on data from the OMIM knowledgebase (Amberger et al., 2009). Data on 6535 murine models were obtained from the Mouse Genome Informatics (MGI) database (Shaw, 2009), and detailed phenotypic annotations for 1625 zebrafish models were taken from ZFIN (Zebrafish Model Organism Database) (Bradford et al., 2011). The 27 recurrent CNV disorders examined in this work were manually curated using HPO terms to generate comprehensive sets of annotations. The main sources for the manual annotation were recent publications, GeneReviews (Pagon et al., 1993), OMIM (Amberger et al., 2009) and a standard reference work on dysmorphology in human genetics (Jones and Smith, 1997). The intervals and corresponding genes included in the intervals of the chosen CNVs were taken from DECIPHER (Firth et al., 2009). For this project, a conservative approach was chosen by including all genes in a maximal critical region as stated by DECIPHER. For some diseases, a gene that was not included by DECIPHER was added to the list for the corresponding CNV because of evidence from recent publications stating involvement of the gene. For detailed information on individual annotations including references for all annotated phenotypes, as well as the complete gene lists for the intervals of all 27 CNV disorders, see supplementary material Table S4.
Computational strategy for CNV analysis
The analysis of a CNV begins with the set of genes (GCNV= {g1, g2, …, gn}) located within the CNV. For each of the genes (gi ∈ GCNV) there is a set (Tgi) of associated phenotype terms from human single-gene disorders and from available mouse and zebrafish models. Similarly, TCNV represents the set of phenotypes associated with the particular CNV disorder. Phenotype annotations for humans (HPO), mouse (MPO) and zebrafish (directly composed Entity and Quality annotations) are mapped to the corresponding terms in uberpheno. Phenotypic features that are only rarely associated with the CNV (i.e. less than 15% of affected persons show the feature) were removed from TCNV before further analysis.
Information content of phenotype terms
The IC of a term t is defined as the negative logarithm of the frequency of annotations to the term (Resnik, 1995): IC(t) = −logpt, where pt is the probability of annotations to term t of uberpheno among all annotated genes in humans, mice and zebrafish.
Common ancestors
We define anc(·) as a function that, for a given term or set of terms, returns the set of ancestral terms (i.e. inferred super-classes). Note, this function is reflexive, i.e. t ∈ anc(t). The set of common ancestors of an uberpheno term Tgi associated with gene gi and the set of uberpheno terms associated with the CNV is defined as:
We define tmax(tgi,TCNV) to be a term with the highest information content from the set CA(tgi,TCNV).
Phenograms
We define the phenogram of the CNV as a structure (G,P,D,E,λ) where G refers to the genes that have a matching phenotype above the IC threshold λ, P are the matching gene phenotypes, and D are the disorder phenotypes, with E the edges that connect them. G consists of all genes in GCNV for which a single-gene phenotype tg matches with a phenotype of the CNV with an information content above the threshold, i.e. IC[tmax(tg,TCNV)]≥λ (these genes are shown as green squares in Figs 4 and 5). P consists of all phenotypes in tmax(tg,TCNV) (shown as beige triangles in Figs 4, 5). The genes are connected to one or more shared phenotype terms with the connections labeled according to the source organism (HP, MP or ZP). Finally, the triangles are connected to the phenotype terms of the CNV (TCNV; shown as blue circles) that are explained by the matches. The phenograms were visualized using Cytoscape (Shannon et al., 2003; Smoot et al., 2011).
Quantification of phenogram score
For each gene, g ∈ GCNV, a phenomatch score Sg is defined based on the information content of all matching terms with specificity above a certain similarity threshold λ to exclude relatively nonspecific phenotypic features:
In our analysis, we selected λ to be 2.5, corresponding to a frequency for the feature of 786 among all 9580 analyzed genes. By choosing k>1, terms with a higher IC receive higher weighting. For this study, we selected k to be 5. The full phenogram score across all genes located in the CNVs can then be calculated as:
We note that the HPO annotations for the 27 CNV disorders in this work were created by manual curation. Additionally, OMIM contains some entries that correspond to CNV diseases such as Rubinstein Taybi Deletion syndrome (MIM:610543). To avoid bias, these annotations were excluded when analyzing the corresponding CNV. Table 1 shows empirical P-values based on the SPG scores of the 27 CNV disorders. To calculate the P-values, 5000 intervals containing the same number of genes as the original CNV were generated at random and SPG was calculated. The P-value was estimated as the proportion of times in which the randomized interval scored at least as high as the original CNV.
Distribution of phenogram matches
Each of the phenotypic matches, i.e. genes (g) annotated to some term (tg) whose similarity to a term in C exceeds the information content threshold λ, represents a potential ‘explanation’ of a phenotypic feature of the CNV. We reasoned that, although individual matches could be due to chance, the total number of above-threshold matches could provide a useful measure of the utility of our method. For the complete analysis, we included all genes from model organisms that have an ortholog in human as well as phenotype information. All human genes located within the CNV interval (GCNV) were then analyzed as described above, and the number of terms in P were summed over all of the 27 CNV disorders under consideration. We calculated an empirical P-value for the distribution by keeping the set of CNVs and their phenotypic abnormalities fixed while comparing them to randomly chosen sets of genes (Gr) to replace the original set of genes (G). This was done by randomly selecting a human gene (gr) and defining a random interval (Gr) surrounding gr that contained the same number of human genes as the original CNV.
Pheno-clusters and functional relatedness
In our analysis, we identified groups of genes (Gp) located in the same CNV that are associated with the same phenotypic abnormality. We investigated the hypothesis that these ‘phenoclusters’ of genes are not only related to the same phenotype but also share similarity based on other biological measures. Here, we calculated similarity based on Gene Ontology (GO) annotations of the genes in Gp and examined the vicinity of the genes to one another within PPI networks.
To compute the homogeneity of Gp = {g1, g2,…, gn} based on GO, we compute the average pairwise similarity for all unique pairs of genes in Gp:
For a pair of genes, we calculate the symmetric semantic similarity [simGO (gi, gj)] as in equation 2 of Köhler et al. (Köhler et al, 2009). To determine a P-value for a given homogeneity score [HOMGO(Gp)], we set up the empirical score distribution by randomly generating 10,000 random gene groups (Gr) and computing HOMGO(Gr), then estimating the P-value as the fraction of cases in which HOMGO(Gr) ≥ HOMGO(Gp).
In order to test the hypothesis that genes in the same phenocluster also tend to cluster in the human PPI network, we analyzed a network containing 10,742 nodes, corresponding to human genes coding for proteins with known interactions, as described previously (Köhler et al., 2008). We constructed the column-normalized adjacency matrix A and then computed the random walk matrix P by P = [I − (1 − r)A]−1 × r, where I is the identity matrix. Every entry (Pij) represents the probability of a random walker starting at node i and being at node j after an infinite number of steps. In every step, the walker randomly visits adjacent nodes. Note that, with probability r, the walker is reset to the start node i. In our study we set r to 0.75.
For a group of genes (Gp = {g1, g2, …, gn}), [GNP(Gp)] by:
whereby p∞ is calculated as . To set up the vector of start probabilities , the start probability of a network node k is defined as :
if gk ∈ {Gp\gi}, and 0 otherwise. Thus, when analyzing a particular gi, the random walker starts with equal probability from all nodes in Gp except gi. Then, the random walk distance from all the start nodes to gi is computed and the average over all gi ∈ Gp is taken as the GNP(Gp).
Similar to the GO analysis, we determine a P-value for a given score GNP(Gp) by calculating the empirical score distribution. This was done by randomly generating 10,000 random gene groups (Gr) and computing GNP(Gr). Afterwards, we estimate the P-value as the fraction of cases in which GNP(Gr) ≥ GNP(Gp).
Acknowledgements
FUNDING: This work was supported by the Director, Office of Science, Office of Basic Energy Sciences, of the US Department of Energy under Contract No. DE-AC02-05CH11231, and by grants of the Deutsche Forschungsgemeinschaft (DFG RO 2005/4-1), the Bundesministerium für Bildung und Forschung (BMBF project number 0313911), the MGD grant from the National Institutes of Health, HG000330, the ZFIN grant from the National Institutes of Health, U41-HG002659 and the PATO grant from the National Institutes of Health, R01-HG004838.
AUTHOR CONTRIBUTIONS: P.N.R., S.E.L., P.N.S. and M.W. conceived, coordinated and supervised the study. S.K., C.J.M., S.C.D. and S.B. developed the computational methods. S.K. and S.C.D. performed the analysis and analyzed the data. S.C.D., S.K., C.J.M., G.V.G., B.J.R., C.S., D.S., E.K., P.N.R., P.N.S. and M.W. worked on ontology development and annotations. S.C.D., S.K., S.E.L., P.N.S. and P.N.R. wrote the paper.
References
COMPETING INTERESTS: The authors declare that they do not have any competing or financial interests.