Model organisms remain a cornerstone of human disease research. However, extrapolating genotype–phenotype relationships from human to model species, and vice versa, remains a significant challenge for clinical translation. Although identification of human disease-causing alleles based on phenotypic analogy in the mouse has been exceptionally successful, there remains a significant gap. Mapping human disease-associated phenotypes in other established model organisms has the potential to address this gap. To evaluate whether established models, such as zebrafish, fruit fly and fission yeast, could complement the phenotypic annotations for human disease-causing genes, Hoehndorf and colleagues analysed model organism database information with machine learning methods to measure how the reported phenotypes contribute to the identification of human disease-associated genes.

The authors collected data on phenotypes associated with loss-of-function mutations using two phenotype ontologies, uPheno and Pheno-e, and combined these annotations with those of human Mendelian diseases to test whether, and to what extent, different model organisms could contribute to the phenotype-based computational discovery of disease-associated genes. Despite the long and successful history of modelling human genetics in Drosophila, zebrafish and yeast, the authors conclude that only the mouse consistently predicts disease-causing genes based on phenotypic ontology annotation. Importantly, their thoroughly tested analysis uncovered biases in how human disease-associated orthologues are annotated in model databases, as well as issues with how phenotype-based computational methods then use these annotations. To support the disease modelling community's future work, the authors endeavoured to correct some of these biases across databases.

Although the use of machine learning to infer human disease-causing genes based on model organism phenotypes is a fairly new development, it is one that holds immense potential to transform our understanding of human disease. A thorough understanding of the data fed into the algorithms will ensure that this potential is realised.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution and reproduction in any medium provided that the original work is properly attributed.