Poster #93 - Youssef Mokssit
- vitod24
- Oct 20
- 2 min read
Optimizing Clinical LLM Agents: Design Choices for FHIR and Rare Disease Diagnosis Workflows
Youssef Mokssit, MS; Cong Liu, PhD; Junyoung Kim, MS; Mengshu Nie, MS Affiliation: Department of Pediatrics, Division of Genetics and Genomics, Boston Children's Hospital, Boston, MA, USA
This project aims to systematically evaluate the architectural choices that most impact the effectiveness of large language model (LLM)-based clinical agents in genomics and rare disease workflows. We unify two workflows within a single, modular evaluation framework: a FHIR agent capable of executing complex, free-text-driven FHIR tasks, and a diagnosis agent for rare disease differential diagnosis. Prior to evaluation, both agents undergo an "onboarding" phase where they're trained on a set of tasks and evaluated against gold-standard outcomes and tool usage patterns. This process yields textual gradients that will be used to optimize the system prompts, and reflections accessible subsequently via dedicated memory MCP servers. The evaluation framework is based on three steps where the best-performing configuration in each stage is adopted in subsequent stages, enabling us to isolate the impact of distinct architectural choices. First, we benchmark agents using three feedback paradigms: a baseline with no learnt feedback, Reflexion (verbal, long-term memory-based reinforcement), and TextGrad (system prompt optimization). Second, we compare the effects of tool access mechanisms by evaluating each agent's performance when operating using a dedicated MCP tools server versus direct REST API-based function calling. Finally, we evaluate agent collaboration paradigms by comparing three approaches: a baseline agent, a single-agent ReAct system that combines planning and execution, and a multi-agent plan-and-execute system with dedicated planning and execution agents. The baseline for the FHIR workflow is a straightforward tool-calling agent with no explicit planning, and for the diagnosis workflow, the baseline mirrors the deterministic, hard-coded planning and execution logic of the DeepRare workflow. Preliminary results from the first stage indicate that for the FHIR agent, Reflexion yields the best results with an average task success rate of 0.37, followed closely by TextGrad (0.35), with both marginally outperforming the baseline (0.34). For the diagnosis agent, we see a reverse pattern where its baseline model performs the best with an average LLM-evaluated diagnosis similarity score of 0.5, compared to 0.35 for Reflexion, and 0.3 for TextGrad.


Comments