Poster #92 - Cong Liu(2)
- vitod24
- Oct 20
- 2 min read
Evaluating the Cost and Feasibility of LLM-Based HPO Term Extraction in Clinical Genomics
Cong Liu PhD, Department of Pediatrics, Division of Genetics and Genomics, Boston Children's Hospital
Human Phenotype Ontology (HPO) terms are essential for variant interpretation in whole-exome and whole-genome sequencing (WES/WGS) analysis and reanalysis. Natural language processing (NLP) tools have been widely used to extract HPO terms from clinical notes in electronic health records (EHRs), streamlining automated reanalysis pipelines. Recent advances in large language models (LLMs) have further improved the accuracy and efficiency of HPO term extraction. At Boston Children's Hospital, we piloted a study to evaluate the cost-effectiveness of deploying an LLM-powered pipeline for regular WES/WGS reanalysis. In collaboration with OpenAI, we deployed a HIPAA-compliant GPT model within a secure internal Azure cloud environment. The pipeline prompts the model (GPT-4.1 assistant) to extract HPO terms and IDs from clinical notes and return structured JSON output. Final HPO IDs were confirmed through a stepwise validation in a post-process against a pre-indexed vector store: first requiring both name and ID matches, then exact name matches, followed by semantic matches based on cosine similarity, and finally fallback to ID-only matches. We tested the pipeline on over 1,200 clinical notes from three patients. The median input note contained 184 tokens (excluding static cached instructions), producing a median output of 81 tokens. While many notes yielded no extractable information, aggregation across patients produced up to hundreds of unique HPO terms. Processing averaged ~6 seconds per patients for all his/her notes, with an estimated cost of $0.60 per patient (plus <$0.1 for cached instruction input). The process can be further optimized and parallelized to increase speed and reduce costs. Our results indicate that deploying an LLM-powered infrastructure for large-scale, incremental data processing is both feasible and economically viable. However, processing an entire EHR system at scale (millions of patients) would require substantial computational resources for the initial pass.


Comments