top of page
Search
  • mabc307

Petagraph: A biomedical knowledge graph built into the UMLS

Updated: Sep 29, 2022

Ben Stear (1), MS; Taha Mohseni Ahooyi (1), PhD;  Shubha Vasisht (1); Jonathan Silverstein (3,4), MD, MS, FACS, FACMI; Tiffany Callahan (5), PhD; Deanne Taylor (1,2), PhD. 1. Department of Biomedical and Health Informatics, Children's Hospital of Philadelphia 2. Perelman School of Medicine, University of Pennsylvania 3. Health Sciences and Institute for Precision Medicine, University of Pittsburgh 4. Department of Biomedical Informatics, University of Pittsburgh 5. Anschutz Medical Campus, University of Colorado Denver


Background: Complex genetic diseases could be a consequence of multiple interacting variations that affect gene function. Current annotation pipelines to determine the effect of genetic variation typically measure the deleterious nature of one variation at a time, reflecting Mendelian models of gene inheritance and disease effect. A common approach to studying multi-gene contributions to disease is to use "gene sets" from various sources. Integrated datasets with deeply typed ontological categorizations can provide a richer background for determining multi-genic functional relationships. Using this approach will allow for determination of new gene sets related by functional, semantic and categorical links not easily accessible without massive data integration. As a result, we can explore the effects of new multi-gene interactions in complex genetic diseases. Methods: The Unified Medical Language System (UMLS) is a large repository of biomedical ontologies, vocabularies, and relationships. We "bring the data to the ontologies" by mapping quantitative data from various sources into the UMLS by modeling it as a property graph using the Neo4j graph database platform. We integrate additional ontologies into the ontology systems native to the UMLS knowledge graph (UMLS-KG) framework as well as several quantitative datasets and gene-to-phenotype mappings across and between mouse and human genomes. Results: We integrated data from over a dozen sources on a Neo4j property graph implementation of the UMLS. This knowledge graph implementation allows for fast, complex queries across a wide range of biomedical terms and quantitative data. Our integration has produced a knowledge graph with approximately 50 million nodes and 160 million relationships. We discuss the resulting graph characteristics and the query results from this massively complex ontological, categorical and quantitative data integration.

4 views0 comments

Recent Posts

See All
bottom of page