A major challenge of the post-genomic era is coding phenotype data from humans and model organisms such as the mouse, to permit the meaningful translation of phenotype descriptions between species. This ability is essential if we are to facilitate phenotype-driven gene function discovery and empower comparative pathobiology. Here, we review the current state of the art for phenotype and disease description in mice and humans, and discuss ways in which the semantic gap between coding systems might be bridged to facilitate the discovery and exploitation of new mouse models of human diseases.
Mouse models of human diseases
The value of the mouse as a model for human disease has become firmly established as new mutants are repeatedly validated as models of human disease and, increasingly, the similarities in the pathobiology of the two species provide new insights into disease mechanisms and aetiologies (Peters et al., 2007; Rosenthal and Brown, 2007; Justice, 2008; Brown et al., 2009). Mutant strains derived from hypothesis-driven research are now being augmented by large-scale mutagenesis efforts that are being undertaken worldwide (Brown et al., 2009). Following the successful phenotype-driven N-ethyl-N-nitrosourea (ENU) mutagenesis projects, the products of which are still being analyzed, large-scale gene knockout programmes have been established to provide the mutant embryonic stem (ES) cells and mice that are needed to discover the functions of all of the protein-coding genes in the mouse genome. The International Knockout Mouse Consortium, (IKMC; www.knockoutmouse.org) (International Mouse Knockout Consortium et al., 2007), composed of four international partners (EUCOMM, KOMP, NorCOMM and TIGM), is currently producing large collections of targeted and gene-trapped mouse mutants. Currently 13,374 genes have been knocked out from a target number close to 24,000. More than 500 mouse lines are expected to be systematically phenotyped within the next five years using standardised phenotyping procedures developed by the EUMORPHIA (European Union Mouse Research for Public Health and Industrial Applications) and EUMODIC (The European Mouse Disease Clinics) consortia (www.eumodic.org) (Brown et al., 2005).
The mutagenesis efforts are not the only new sources of large amounts of systematic phenotyping data. The Shock-Ellison Medical Foundation-funded mouse aging programme at the Jackson Laboratory (http://agingmice.jax.org/index.html) is keeping mice from 31 different strains for the entirety of their natural life span to generate a huge volume of age-dependent phenotype data covering physiology, pathology and gene expression (Yuan et al., 2009). Longitudinal, cross-sectional and targeted studies of these mice provide interesting insights into the pathophysiology of aging. By using the high-resolution single nucleotide polymorphism (SNP) maps that are now available, these data will generate new gene/phenotype associations for many age-related conditions and complex traits.
To make the best use of the sheer volume and depth of the emerging mouse phenotype data we need to be able to relate it to human ‘phenotype’ or disease data in a way that is amenable to computation; it is this challenge that we discuss here.
What is a phenotype?
The concept of a phenotype is used in a variety of ways, not all of which are compatible with each other. Descriptions of clinical diseases (signs and symptoms), pathological lesions and entities; summative disease nomenclature (e.g. syndromes); the appearance or behaviour of mutants; genetically determined traits of strains; and, at the molecular level, transcriptome and gene expression patterns all represent examples of the common understanding of the concept of phenotype. When defined properly, the phenome itself is all of the genetically determined traits manifested under the prevailing environmental conditions and a phenotype is an observable property of the organism in the specified environment. Another useful concept is the phenoset, which represents a group of phenotypes in the same individual (e.g. behaviour, cancer, adiposity) that, together, characterise it.
Phenotypes versus traits
The term phenotype is often used as a synonym for a trait, especially in the description of human disease. This leads to considerable confusion. In the development of ontologies, the distinction between traits and phenotypes is essential for logical clarity and, in line with other developers (e.g. Hughes et al., 2008), we adopt the following definitions. A trait is a heritable, specifically measurable or identifiable feature of an organism, which can be followed through the genetic segregation of one or more phenotypes – such as short legs or dark hair. The traits here are ‘leg length’ and ‘hair colour’. ‘Short legs’ and ‘dark hair’ are phenotypes, which are properties that can be measured or categorised under given environmental conditions.
The importance of phenotype data
For the mouse, useful associations can be made between genotypes and phenotypes where the mutation is known; either from identification of the ENU-induced or insertional alleles, or from targeted mutations. Additionally, where haplotype analysis is possible between inbred strains, making the association between phenotype and genotype permits the association of phenotype differences with specific haplotypes or SNPs, which is invaluable for complex trait analysis. This is now facilitated greatly by SNP discovery using high-throughput sequencing (Nikolaev et al., 2009).
Phenotypes that are shared between humans and mice can help identify candidate genes for human diseases. For example, candidate disease genes within association intervals in human genetic mapping studies, e.g. genome-wide association studies, may be triaged by looking at phenotypes of genes within the orthologous interval in the mouse. Evolutionary conservation of gene co-expression patterns for closely related phenotypes allows candidate gene prioritisation and, apart from identifying mouse mutants that can act as models for human diseases, we now see instances where high-resolution phenotyping of the mouse generates novel insights into human conditions (Ishimori et al., 2006; Ackert-Bicknell et al., 2008; Lisse et al., 2008).
The development of a common framework to describe human diseases and similar phenotypes in model organisms is needed to integrate the huge amount of phenotypic and genetic data that is generated from clinical genetic studies and the analysis of mutant animals. The problem is how to construct such a harmonised framework starting from the existing, well-established, but fundamentally different, approaches to describing phenotypes in humans and mice.
Coding of phenotype data
Both mice and humans have been ‘phenotyped’ for many years. Phenotypic variation in mice was recognised by the ancient Chinese (Keeler and Fuji, 1937) more than 2000 years ago. The Eh Yah dictionary (1100 B.C.) has a special term for a ‘mouse with the hair pattern of a leopard’, which is maybe the first description of a spontaneous mutation in the endothelin type B receptor gene, such as piebald (Ednrbs-l) (Lane, 1966), which shows a characteristic black spotting. The ‘waltzing’ phenotype, which is probably the result of vestibular defects that are similar to the familiar spontaneous waltzer mutations, e.g. Cdh23v-5J, was valued by the Japanese. A treatise on ‘The Breeding of Curious Varieties of Mice’, was published in 1787 by Chobei Zeniya of Kyoto, Japan. In this work, the author describes the crossing of various types of fancy mice and identifiably mentions the albino, non-agouti, recessive piebald, lilac with pink-eye and other heritable phenotypes. Our interest in mouse phenotypes and their genetics has a rich history.
A medical classification of human disease, known as nosology, has been attempted many times; in antiquity by Hippocrates and Isidore of Seville, and then later by Carl Linnaeus, who undertook one of the first attempts at a modern systematic classification of disease on the basis of symptoms (von Linne and Schroeder, 1763). Although subsequent classification systems have successively replaced these, the current system of International Classification of Diseases (ICD) for humans (World Health Organisation, 2008), still works in a paradigm that Linnaeus would recognise. To date, however, there have been few systematic attempts to harmonise the description of abnormality or disease between different species.
Phenotypes are generally described in natural language, frequently using a mixture of unstructured terminologies and free text, with variations that are widely understood within specific disciplines. Qualitative data is represented using disparate data models and indexed with simple text descriptions. At worst, the descriptions used for human phenotypes reflect local informal term usage or domain-specific controlled vocabularies. At best, they use terms from internationally accepted frameworks such as the Unified Medical Language System (UMLS), Medical Subject Headings (MeSH), International Classification of Disease (ICD-9/10) or Systematised Nomenclature of Medicine Clinical Terms (SNOMED-CT) terminology. SNOMED and ICD-9 are designed and structured for use in a clinical context, and both UMLS and MeSH are predominantly designed for describing human diseases and therapies. The structure of these nomenclatures precludes their use for logical inference and, in many cases, the terms are etiologically or anatomically predicated in a way that cannot be used to describe disease in non-human organisms. The result is that the coding of human data using these large and complex terminologies is logically and semantically incompatible with the type of coding and nomenclature used for model organisms such as mice.
Since natural language is highly expressive, the range of information it can capture in phenotype descriptions is usually both deep and broad. For example the ‘hoarse cry’ in Opitz GBBB syndrome (OMIM: 145410), and the ‘striking upslanting of the palpebral fissures, small nose with broad root, abnormally modelled ears, short neck with loose skin’ in Opitz C syndrome (OMIM: 211750) are difficult to express as concisely in any other way. Thus, natural language is the most obvious medium in which to record and express phenotypes. However, it is hard to carry out computation on descriptions based on natural language, and the task suffers from the now often-rehearsed problems of ambiguity, semantic complexity and lack of structure. For example the term hedgehog can refer to one of several human or mouse genes; human or mouse gene products; a small mammal of the family Erinaceomorpha; or an arrangement of pineapple and cheese impaled on cocktail sticks. Disambiguation and semantic standardisation are vital but difficult to achieve.
The key to providing terminological clarity is to use far more formalised language sets than are provided by natural language. The bioinformatics community realised this more than a decade ago and has produced complex term hierarchies describing various areas of knowledge (gene properties, anatomies, etc.) where the terms are linked by relationships (e.g. part of, is a, derived from, etc.). These ontologies have provided computational tools to capture knowledge within a domain and to express it within a relational framework that can be used by a broad range of clinicians and scientists (see Box 1) (Bard and Rhee, 2004). The most important ontologies for describing human abnormalities are the Human Phenotype Ontology (HPO) (Robinson et al., 2008) and the Disease Ontology (DO) (Du et al., 2009; Osborne et al., 2009), whereas those for the mouse are the Mammalian Phenotype Ontology (MP) (Smith et al., 2005) and the Mouse Pathology Ontology (MPATH; www.obofoundry.org/cgi-bin/detail.cgi?id=mouse_pathology). All are members of the Open Biological Ontology (OBO) family (Smith et al., 2007) and can be downloaded from the OBO foundry site (www.obofoundry.org/).
An ontology is a formal conceptual representation of a domain of knowledge with the primary aim of creating a shared understanding of a domain and the relationships within that domain. It contains common defined symbols for the concepts within a domain and meaningful relationships between those concepts. These relationships permit inference – the propagation of meaning across the ontology.
Most biomedical ontologies are structured as simple hierarchies of information using is_a or part_of relationships. For example the big toe is a part_of the foot and the heart is_a thoracic organ. These hierarchies are termed directed acyclic graphs as cyclic relationships are not permitted, i.e. one term is not permitted to be the parent and child of another term, and the flow of meaning through the hierarchy is from the most-specific term to the least specific.
Mammalian phenotype ontology and the mouse pathology ontology
The Mouse Genome Informatics (MGI) databases (www.informatics.jax.org/) (Eppig et al., 2007; Bult et al., 2008) hold qualitative (categorical) data coded by the MP (Smith et al., 2005). The MP consists largely of ‘pre-coordinated’ terms (see Box 2) – i.e. terms that include, for example, severity qualifiers or anatomical locations – and currently contains 9861 concepts.
The MP is currently the most successful and readily applicable approach to describing a wide range of aspects of phenotype and disease using a set of carefully defined descriptive terms. The terminology effectively captures various abnormal phenotypes and processes, as well as summative diagnoses and other descriptors of phenodeviance, which is the deviance of a phenotype in an animal, or cohort of genetically identical animals, away from what is typical in a reference population. Phenodeviance includes abnormal values for characteristics such as weight, coat colour or blood metabolites. The upper level terms of the MP ontology include physiological systems, behaviour, developmental phenotypes and ageing, and below this level, physiological systems are divided into morphological and physiological phenotypes. Many disease manifestations can be coded readily by MP and currently there are 88,600 annotations of approximately 21,000 genotypes in the MGI database. MP is a classically structured, hierarchy-based ontology and is designed to enable phenotype databases to be searched in order to find mutations and alleles with specific phenotypes; allow gene clustering based on mutant phenotypes; and discover genes in related pathways or potential mouse models of human diseases.
MPATH was originally designed as a description ontology for images of mouse histopathology and is segmented into aspects of pathology that would be familiar to traditionally trained pathologists. The most recent release is fully defined and contains terms covering all of the major classes of pathological lesions (594 to date), with specific reference to the mouse. These classes are arranged as a hierarchy within a directed acyclic graph (DAG), six levels deep, using the is_a relationship (e.g. a Harderian gland carcinoma is_a glandular tumour, is_a neoplasm) with each item having an MPATH ID that can be used for database interoperability and analysis. Many tissue responses are common to multiple anatomical sites and, as far as possible, the redundancy of specifying a particular response in multiple tissues has been avoided. The additional topographical or anatomical information for each image comes from the curatorial creation of crossproducts with an appropriate anatomy ontology such as MA, the mouse adult anatomy (Hayamizu et al., 2005). For example, colon adenocarcinoma=[MPATH; 0000268 (Adenocarcinoma) + MA; 0000335 (Colon)]. The use of cross-products prevents the combinatorial explosion that causes ‘ontology bloat’ in poorly structured ontologies – the inclusion in the ontology of all possible pre-composed variations of instances of an entity (see Box 2).
Also known as pre-coordination methodology, pre-composition uses a predefined set of phenotype terms created in advance by the ontology developer and combines, for example, the entity, say ‘big toe’, and the quality of that entity, say ‘[large big toe]’.
Also known as post-coordination methodology, post-composition involves construction of phenotype description at the time of annotation. In this case, there would be a term for ‘big toe’ and a term for ‘large’, and the post-composed term would combine these: [‘big toe’ + ‘large’]. This avoids, for example, the combinatorial explosion that is evident when the big toes might have many attributes that could also describe other toes, e.g. [‘small big toe’], [‘blue big toe’], [‘short big toe’], [‘short little toe’], [‘large little toe’], etc.
Human disease ontology and human phenotype ontology
The full DO and its cut-down version, DO-lite (Du et al., 2009; Osborne et al., 2009), are based on ICD-9 and referenced to UMLS and SNOMED-CT. The full version contains 11,961 terms in the form of a hierarchy, of which 4399 terms are internal nodes lying up to 16 levels deep. HPO, the human phenotype ontology, was however derived from the terms found in the ‘clinical synopsis’ section of Online Mendelian Inheritance in Man (OMIM; www.ncbi.nlm.nih.gov/omim/) (Hamosh et al., 2005), and therefore covers largely monogenic diseases with mendelian inheritance. Although a hugely valuable resource, OMIM is not structured formally and the terminology used does not follow any consistent pattern. The construction of the HPO therefore represents a major improvement in the utility of OMIM and provides immediate structured genotype annotation to all of the 4779 annotated diseases. Both GeneRIFs (Mitchell et al., 2003) and GeneReviews (www.ncbi.nlm.nih.gov/projects/GeneTests/static/about/content/reviews.shtml) are additional useful sources of genotype/phenotype data but again are textual resources only.
The use of ontologies for recording human phenotypes is in its infancy and it is fair to say that the mouse research community has been much more pro-active in accepting and implementing standard terminologies than that of the human. The call for a human phenome project in 2003 (Freimer and Sabatti, 2003) with emphasis on the need for standards and international integration has not yet met with a concerted response, and with regard to human phenotypes and traits, there is an uncoordinated scatter of human phenotype and trait data throughout databases and resources across the world. Much human phenotype data relates to disease and its predisposition, and is largely captured with free text. In the best situations, it is coded using clinical informatics formalisms such as ICD-9/10 or SNOMED-CT. These systems are structured, unambiguous and widely accepted but suffer from being highly pre-composed (e.g. aetiologically and anatomically predicated), and are not organised in such way as to support inference or computer reasoning. Nevertheless, one great advantage is that tools and resources such as UMLS and MetaMap (Bodenreider, 2004) are available for using ICD and SNOMED coding systems. These provide synonyms, cross-references and mark-up facilities, which are of assistance in comparing data between databases and within literature records, and have recently been used in crossing species boundaries (see below) (Marquet et al., 2007).
Human genetic databases may be divided into core databases and locus-specific databases (LSDB). Core databases attempt to provide data on all pathological variation and its consequences, for example, the human gene mutation database (HGMD) (Stenson et al., 2008), which uses a local controlled vocabulary. LSDBs, by contrast, focus on one gene or locus respectively [for discussion, see Patrinos and Brookes (Patrinos and Brookes, 2005)]. The genetic association database (GAD) (Becker et al., 2004) contains associations between complex diseases and disorders and individual human genes curated from the literature; here, diseases are categorised using a controlled vocabulary drawn from MeSH terms. Quantitative data sets on human populations are held by the database of genotypes and phenotypes, DBGaP (Mailman et al., 2007), and again are indexed in a largely unstructured way through MeSH-defined terms. The human genome variation database, HGVbase G2P (Thorisson et al., 2009), is one of the most useful collections of genotype/phenotype associations, although it uses only a local controlled vocabulary to record phenotype data.
The consequence of the terminology ‘Babel’ in human clinical databases is that text mining is often the only approach to extract information from these resources (Perez-Iratxeta et al., 2002; Hristovski et al., 2005; van Driel et al., 2006). Text mining is fraught with problems, including issues of semantics, over-representation of common phenotypes and insufficient granularity.
Misinterpretation of the literature, combined with inaccurate database curation, can generate misleading hypotheses through implied disease orthology. However, the following example of the mouse hairless gene and its incorrect link to the complex polygenic disease known as alopecia universalis in humans shows that more considered analysis of such errors can ultimately create a much greater understanding of a particular disease. The hairless phenotype and its more severe form, known as rhino (short for rhinoceros), were first described in mice in 1856 (Gaskoin, 1856). The human homologue, atrichia with papules, or as it later became known as, papular atrichia, was first described in 1954, nearly 100 years later (Damste and Prakken, 1954). The link between the mouse and human disease was made some 30 years afterwards (Sundberg et al., 1989; Sundberg, 1994). The hairless gene was traditionally linked to a simple, recessively inherited form of alopecia universalis based on a curation call in the OMIM entry (OMIM: 203655) (Ahmad et al., 1998). The OMIM designation was based on morphologic diagnosis; a total lack of hair in patients with an autosomal recessive pattern of inheritance. Alopecia universalis is actually a well-characterised, complex genetic-based autoimmune skin disease in both humans (Martinez-Mir et al., 2007) and mice (Sundberg et al., 2004). Although this mismatch was initially of great concern (Sundberg et al., 1999), it subsequently led to a much better understanding of papular atrichia. Many mutations have now been identified in the human hairless gene, as well as in rodents and non-human primates (Panteleyev et al., 1998; Ahmad et al., 2002).
Crossing the species divide; granularity and specificity
Accurate phenotype descriptions can discover new relationships between genes and phenotypes, and new functions for previously uncharacterised genes and alleles. A good example is PhenomicDB (Groth et al., 2007), which contains one of the most wide-ranging cross-species datasets on gene/phenotype associations. This database combines data from OMIM, the Mouse Genome Database (MGD), Worm-Base, FlyBase, the Comprehensive Yeast Genome Database (CYGD), the Zebrafish Information Network (ZFIN), and the MIPS Arabidopsis thaliana database (MAtDB). Groth et al. (Groth et al., 2008) queried the resulting PhenomicDB ‘warehouse’ that was created by using a text-mining approach, and which generated a summary phenotypic statement for each gene, then clustered the statements to produce what Oti and Brunner (Oti and Brunner, 2007) have termed ‘Phenoclusters’ – a group of genes with overlapping phenotypes, which may then be used for the discovery of new disease or functional associations. This phenotype-driven approach to the discovery of gene function has distinct advantages over the gene-driven approach to phenotype prediction because, although many closely related phenotypes are caused by mutations in different genes whose gene products interact directly or are on the same pathway, mutations in the same gene can have diverse phenotypic outcomes depending on which function of a multifunctional gene product is compromised. Several related disease candidate gene discovery approaches have been developed (for examples, see Tiffin et al., 2006; van Driel and Brunner, 2006; Oti and Brunner, 2007). However, in the absence of systematic coding, all of these approaches depend to a greater or lesser extent on text mining from their data sources, and making use, at best, of UMLS and MeSH terms in abstracts and database phenotype fields. Despite impressive results from many of these approaches, it is clear that a standardised description of phenotypes and diseases would greatly increase the power and specificity of cross-species data mining.
A key problem is the assumption that the currently dominant paradigm for disease conceptualisation, based on clinical medicine, is useful for biomedical science applications. It is a mistake to assume that the human ‘phenome’ is a list of ‘diseases’ that form more or less distinct entities. The realisation that diseases of separate genetic aetiology may share similar phenotypes may seem obvious, but it is only recently that this has generated attention. Work by Brunner and others (Oti and Brunner, 2007; Oti et al., 2008) demonstrated that shared aspects of phenotype may be viewed as a proxy for a common underlying pathogenetic mechanism, and that this mechanism may be shared by dysfunction of a group of genes whose products either interact or are on the same functional pathway. This ‘modularity’ of phenotypes should not come as a surprise, but it makes the formulation of a new concept of disease description all the more urgent. The generation of phenoclusters depends on the ability to code phenotypes in as granular a way as possible. This approach was used originally in making gene/phenotype associations in RNA interference (RNAi)-generated phenotypes (by our definition, phenosets) in C. elegans, where each was expressed as a combination of 45 phenotypic features, enabling clustering of functionally related genes (Piano et al., 2002).
The use of a phenotype-driven approach to discover new information about gene/phenotype relationships within a species requires a sufficiently high level of specificity and granularity to discriminate between closely related phenotypes with overlapping components. This is particularly true of complex traits. Joy and Hegele (Joy and Hegele, 2008) provide an excellent discussion of the problems caused by the inaccuracy and variability of definitions in the context of metabolic syndrome and the resulting problems with candidate gene association and linkage studies. Description problems inhibited gene association studies in X-linked mental retardation, where there are insufficient phenotypic features to ‘unbundle’ non-syndromic cases in gene association studies (Ropers and Hamel, 2005).
The requirement for ‘deep phenotyping’ using well-defined criteria is clearly important in human gene association studies. It is also crucial if human phenotypes are to be compared with those from model organisms. The deficiency in cross-species interoperability of phenotype description formalisms is well demonstrated by the analysis of cross-species phenoclustering that was carried out using PhenomicDB by Groth et al. (Groth et al., 2007), discussed above. More than 90% of the clusters they generated contained genes from a single species and there was a tendency for genes to fall into species-specific clusters. They interpret this as an indication that the terminology used to describe phenotype in each species fails to cross the species barrier, even though many phenotypes clearly have their equivalents between species. It is therefore clear that, if our aim is to understand the underlying processes and genetic aetiology through using model organisms, we need a change in the way in which diseases and phenotypes are described.
The ontologies described earlier were all developed for particular species and, like many other controlled vocabularies, are not readily interoperable for cross-species queries, for example, between different genotype/phenotype databases. Semantic inconsistency and anatomical incompatibility, together with different traditions of disease description in different organisms, prevent the matching of phenotype ontologies either lexically or conceptually.
Two related problems impede the bridging of different ontologies that have been derived for either the same or separate species. None of the ontologies or controlled vocabularies for describing disease is truly orthogonal (generally used in this context to mean complementary and non-redundant), although they were designed to cover the same area of knowledge, for example DO and HPO. This means that, even within a species, the terminology used and the underlying structure may be different. For example, the term ‘melanoma in situ’ is used within SNOMED-CT and MPATH to represent a potentially cancerous lesion, whereas the National Toxicology Program (NTP) Toxicology Data Management System (TDMS) pathology code table for microscopic lesions (http://hazel.niehs.nih.gov/user_spt/pct_terms.htm) defines only ‘melanoma benign’ and ‘melanoma malignant’. Similarly, only the NTP TDMS pathology code table and the SNOMED-CT vocabularies define a benign melanoma term (‘melanoma benign’ and ‘benign melanocytic neoplasm’, respectively). HPO provides ‘especially prone to malignant melanoma’, ‘malignant intraocular melanoma’ and ‘malignant melanoma’. HPO does not address pre-neoplastic or benign lesions, but provides an anatomically predicated version and a predisposition syndrome. DO provides 96 melanoma terms, many of which are pre-composed and are both anatomically predicated and include morphological and prognostic qualifiers. Interestingly, there is no term for the preneoplastic lesion. MP only contains the anatomically predicated ‘intraocular melanoma’. Even this superficial comparison shows that comparing the data coded to each of these ontologies is very difficult and impossible to do automatically using simple lexical matching. A major problem is the use of complex pre-composed terms (see Box 2). In comparison to this, the issue of species-specific lesions, for example, as is found when comparing mouse haematopoietic neoplasms with those in humans, is relatively easy to deal with (Kogan et al., 2002; Morse et al., 2002). Making use of the subsumption (incorporation of a term into a higher order or parental category) that is available within an ontology permits relation of species-specific variants through a common parent. For example, the mouse small T-cell lymphoma (STL), which probably has no counterpart in the human (Morse et al., 2002), can be classified as a ‘mature T-cell neoplasm’ – a parent category that is common to human and mouse malignancies. Searching a database of mouse and human tumours using an ontology, where STL is_a ‘mature T-cell neoplasm’, would recover any human data coded to the 16 ‘mature T-cell neoplasms’ listed in ICD.
One approach to bridging the nomenclature gap between species is to make use of the UMLS resource of the National Libraries of Medicine (NLM). The UMLS thesaurus (Bodenreider, 2004) is a large and well-curated resource of terms and synonyms that can be used for semantic mapping between terminologies. This was used by Osborne et al. (Osborne et al., 2009) to annotate the human genome to the DO ontology and has proved a valuable approach to cross-mapping the DO, MP and MPATH (Marquet et al., 2007). However, apart from what might be described as the ‘straightforward’ compatibility problem, there is a more complex problem that needs to be considered: that of the composition of disease terms themselves.
An alternative way to represent phenotype: the E+Q approach
It is clear from the preceding discussion that a major problem in describing phenotypes and diseases is that many of the terms that are commonly used to describe them are complex and subsume a multitude of meanings. This is both a problem for cross-linking phenotype and disease, and restrictive computationally. The MP ontology, for example, only allows the description of abnormal phenotypes and does not allow quantitative descriptions. An alternative approach is to break down complex precomposed terms into their constituent logical parts, an approach known as the E+Q (entity plus quality) approach, which is used in the capture of raw mouse phenotype data (Bard and Rhee, 2004; Gkoutos et al., 2004; Gkoutos et al., 2005; Mungall et al., 2007; Beck et al., 2009). The E+Q syntax uses a combination of relevant descriptive ontologies. It represents entities (E), such as anatomical structures or chemical compounds, using ontologies such as MA, the Foundational Model of Anatomy (FMA) (Rosse and Mejino, 2003) and Chemicals of Biological Interest (CheBI) (Degtyarenko et al., 2008), etc., and represents the qualities (Q) inhering in the entities, such as colour, size or shape using the Phenotype and Trait Ontology, PATO (www.obofoundry.org/cgibin/detail.cgi?id=quality). The combination of E and Q terms can then be used to represent both traits (e.g. ‘tail+length’) or phenotypes (e.g. ‘tail+long’); within PATO, ‘long’ in this example is a child of ‘length’, so that the trait is implicit in the phenotype (Gkoutos et al., 2004; Gkoutos et al., 2005; Beck et al., 2009). The basic E+Q syntax can be extended to increase expressivity to include E2, which is an additional optional entity type for relational qualities, and the modifier M: E+Q+E2+M.
The E+Q approach is referred to as ‘post-composition’, reflecting the composition of compound terms from components. It is used in the EuroPhenome mouse phenotype database (www.europhenome.org/) to describe raw phenotype data from high-throughput phenotyping experiments (Beck et al., 2009; Morgan et al., 2010), and in some model organism databases such as ZFIN and FlyBase (Drysdale, 2008; Sprague et al., 2008). For example, to describe the phenotype of Sox9 mutants, MGI uses the pre-composed term MP:0005587 (abnormal Meckel’s cartilage) and ZFIN uses the E+Q approach – entity, ZFA:0001205 (Meckel’s cartilage); quality, PATO:000587 (decreased size).
The E+Q approach can be used to provide a ‘logical definition’ of a pre-composed ontology term. Applying a decomposition process to pre-composed terms in principle allows terms with different names to be linked via shared logical definitions, a process that could be used to link phenotypes across species or, in principle, phenotypes to diseases. Using this approach, Mungall and co-workers (Mungall et al., 2010) recently reported the association of 8285 classes from four species-specific ontologies to E+Q definitions, using a cross-species upper level ontology of anatomy, Uberon (Haendel et al., 2009). Leveraging the E+Q definitions that were available for mouse, human and zebrafish phenotypes, Washington et al. (Washington et al., 2009) have been able to identify orthologous and biologically relevant genes on the basis of E+Q phenotype similarity, matching within and between species for a defined test set of genes, thereby validating the approach.
The relationship between pre- and post-composed ontologies is additionally advantageous as pre-composed ontologies, such as MP, are ‘human-readable’, whereas post-composed ontologies are better for computational analysis. An example of this is the EuroPhenome database (Beck et al., 2009; Morgan et al., 2010). Here, quantitative parameters for specific phenotypic assays are stored in the database. Mutant cohorts are then compared with control cohort data and statistically abnormal lines are annotated dynamically to E+Q statements of phenodeviance using preset parameters. Logical definitions then allow E+Q statements to be translated into pre-composed MP terms. Both quantitative and qualitative data can be represented in this way, and representation in MP allows the data to be queried in a consistent and transparent way that offers a powerful paradigm for the annotation and computational analysis of mutant phenotype data.
Disaggregation of disease entities
In principle, the ontology decomposition approach described above might be used to map phenotypes to diseases, and phenotypes between species. However, as discussed above, the term ‘phenotype’ is used to encompass a multitude of logically disparate entities. This is especially true with the terms that are commonly used to describe diseases in humans. Human diseases are complex collections of phenotypic observations and pathological processes, and a diagnosis involves establishing the presence of a set of phenotypes, which is often probabilistic. For example, Beckwith-Wiedemann syndrome is defined by the simultaneous presence in the proband of all of the three most common phenotypes (macroglossia, anterior abdominal wall defect and overgrowth), or two of these phenotypes combined with five of more than a dozen other manifestations (Elliott et al., 1994; Cooper et al., 2005).
As long as we do not have a formalism to capture the probabilistic phenotypic elements of diseases – often the underlying observations used by clinicians as diagnostic criteria – high-level disease terms will be difficult to use for detecting overlaps between diseases and between phenotypes in different species. Additional elements also need to be captured to accurately record aspects of genetic disease that are used for differential diagnosis and stratification, such as the mode of inheritance, penetrance, pleiotropy, expressivity and progression. This is a challenge for the E+Q framework.
A first step towards a solution is to disaggregate disease terms into individual phenotypic components, which, in combination, make up the disease entity. An example of this is shown in Fig. 1, using the HPO. Here, the congenital heart defect tetralogy of Fallot, which is very difficult to render into a satisfactory E+Q statement directly, is broken down into its constituent endophenotypes, which are then amenable to E+Q definition. With the provision of a bridging anatomy ontology, the remaining terms used in the E+Q statements are from PATO and are species agnostic, allowing species-specific phenotype data to be traversed readily.
As an illustration of the potential utility of this disaggregation approach, we set out to search MGI for models of the tetralogy of Fallot. MP does not contain the term ‘tetralogy of Fallot’, but searching MGI with the intersection of MP:0000273 (overriding aorta), MP:0000486 (abnormal pulmonary trunk morphology), MP:0000276 (heart right ventricle hypertrophy) and MP:0008823 (abnormal membranous ventricular septum morphology), yields the homozygous knockout of hairy/enhancer-of-split related with YRPW motif 2 (Hey2tm1Uts), which is already annotated as a model for the tetralogy of Fallot in OMIM, and homozygous knockout of polyhomeotic-like 1 (Phc1tm1Os), which has not previously been linked.
This is a relatively straightforward example. However, disease description is a complex domain and disaggregating disease terms requires expert input if the disaggregations are to accurately reflect the clinical nature of the disease. Although automatic approaches, such as the ones used in the HPO, are a great advance, the cooperation of experts in individual disease areas is needed to produce a well-founded, ‘post-composed’ disease ontology.
Ontologies and description frameworks for capturing data on disease and phenotype are essential tools to support mouse functional genomics, and in a broader context, for the assignation of functions to genes. The tools that are currently available are still in the early stages of development and may need to be applied in new ways to fully serve the requirements of cross-species phenotype mapping. Even a preliminary attempt to implement existing ontologies in the E+Q framework demonstrates the need for more terms to describe measured entities, both in humans and in mice, and for example a mammalian trait ontology would be of great utility. Another area in need of development is that of the non-anatomical phenotype traits, notably behaviour. With respect to the human, it will not always be possible to obtain or record measurements with the same completeness or precision as with mice in a laboratory setting, although some clinical biobanking projects approach this, and in many cases phenotype description from the literature will inevitably be only qualitative, if only because it constitutes legacy data. The power of the decompositional approach is that it is applicable to both qualitative and quantitative data and, in either, lends itself to computational analysis. The difficulty and amount of labour necessary to implement effective cross-species ontologies is daunting, but success will yield valuable insights from model organisms.
This work was funded by the Commission of the European Community Contract number LSHG-CT-2006-037811; CASIMIR. J.P.S. acknowledges support of the National Institutes of Health (CA089713) and the Ellison Medical Foundation. The authors thank Prof. Jonathan Bard, Prof. Janan Eppig, Dr Peter Robinson and Dr Anita Burgun for discussions and for helpful comments on the manuscript. Deposited in PMC for release after 12 months.
The authors declare no competing financial interests.