Ben Stear1, MS; Taha Mohseni Ahooyi1, PhD; Shubha Vasisht1; Jonathan Silverstein3,4, MD, MS, FACS, FACMI; Tiffany Callahan5, PhD; Deanne Taylor1,2, PhD. 1Department of Biomedical and Health Informatics, Children's Hospital of Philadelphia; 2Perelman School of Medicine, University of Pennsylvania; 3Health Sciences and Institute for Precision Medicine, University of Pittsburgh; 4Department of Biomedical Informatics, University of Pittsburgh; 5Anschutz Medical Campus, University of Colorado Denver
Poster Not on Display
Background: Complex genetic diseases could be a consequence of multiple interacting variations that affect gene function. Current annotation pipelines to determine the effect of genetic variation typically measure the deleterious nature of one variation at a time, reflecting Mendelian models of gene inheritance and disease effect. A common approach to studying multi-gene contributions to disease is to use "gene sets" from various sources. Integrated datasets with deeply typed ontological categorizations can provide a richer background for determining multi-genic functional relationships. Using this approach will allow for determination of new gene sets related by functional, semantic and categorical links not easily accessible without massive data integration. As a result, we can explore the effects of new multi-gene interactions in complex genetic diseases.Methods: The Unified Medical Language System (UMLS) is a large repository of biomedical ontologies, vocabularies, and relationships. We "bring the data to the ontologies" by mapping quantitative data from various sources into the UMLS by modeling it as a property graph using the Neo4j graph database platform. We integrate additional ontologies into the ontology systems native to the UMLS knowledge graph (UMLS-KG) framework as well as several quantitative datasets and gene-to-phenotype mappings across and between mouse and human genomes.Results: We integrated data from over a dozen sources on a Neo4j property graph implementation of the UMLS. This knowledge graph implementation allows for fast, complex queries across a wide range of biomedical terms and quantitative data. Our integration has produced a knowledge graph with approximately 50 million nodes and 160 million relationships. We discuss the resulting graph characteristics and the query results from this massively complex ontological, categorical and quantitative data integration.