top of page
Search

Poster #93 - Youssef Mokssit

  • vitod24
  • Oct 20
  • 2 min read

Optimizing Clinical LLM Agents: Design Choices for FHIR and Rare Disease Diagnosis Workflows


Youssef Mokssit, MS; Cong Liu, PhD; Junyoung Kim, MS; Mengshu Nie, MS Affiliation: Department of Pediatrics, Division of Genetics and Genomics, Boston Children's Hospital, Boston, MA, USA


This project aims to systematically evaluate the architectural choices that most impact the effectiveness of large language model (LLM)-based clinical agents in genomics and rare disease workflows. We unify two workflows within a single, modular evaluation framework: a FHIR agent capable of executing complex, free-text-driven FHIR tasks, and a diagnosis agent for rare disease differential diagnosis. Prior to evaluation, both agents undergo an "onboarding" phase where they're trained on a set of tasks and evaluated against gold-standard outcomes and tool usage patterns. This process yields textual gradients that will be used to optimize the system prompts, and reflections accessible subsequently via dedicated memory MCP servers. The evaluation framework is based on three steps where the best-performing configuration in each stage is adopted in subsequent stages, enabling us to isolate the impact of distinct architectural choices. First, we benchmark agents using three feedback paradigms: a baseline with no learnt feedback, Reflexion (verbal, long-term memory-based reinforcement), and TextGrad (system prompt optimization). Second, we compare the effects of tool access mechanisms by evaluating each agent's performance when operating using a dedicated MCP tools server versus direct REST API-based function calling. Finally, we evaluate agent collaboration paradigms by comparing three approaches: a baseline agent, a single-agent ReAct system that combines planning and execution, and a multi-agent plan-and-execute system with dedicated planning and execution agents. The baseline for the FHIR workflow is a straightforward tool-calling agent with no explicit planning, and for the diagnosis workflow, the baseline mirrors the deterministic, hard-coded planning and execution logic of the DeepRare workflow. Preliminary results from the first stage indicate that for the FHIR agent, Reflexion yields the best results with an average task success rate of 0.37, followed closely by TextGrad (0.35), with both marginally outperforming the baseline (0.34). For the diagnosis agent, we see a reverse pattern where its baseline model performs the best with an average LLM-evaluated diagnosis similarity score of 0.5, compared to 0.35 for Reflexion, and 0.3 for TextGrad.



 
 
 

Recent Posts

See All
Poster #9 - Yuheng Du

Cell-Type-Resolved Placental Epigenomics Identifies Clinically Distinct Subtypes of Preeclampsia Yuheng Du, Ph.D. Student, Department of Computational Medicine and Bioinformatics, University of Michig

 
 
 
Poster #15 - Jiayi Xin

Interpretable Multimodal Interaction-aware Mixture-of-Experts Jiayi Xin, BS, PhD Student, University of Pennsylvania, PA, USA Sukwon Yun, MS, PhD Student, University of North Carolina at Chapel Hil

 
 
 
Poster #14 - Aditya Shah

Tumor subtype and clinical factors mediate the impact of tumor PPARɣ expression on outcomes in patients with primary breast cancer. Aditya Shah1,2, Katie Liu1,3, Ryan Liu1, 4, Gautham Ramshankar1, Cur

 
 
 

Comments


bottom of page