William K Flynn, Greg Barren, Angela Agunloye, Eric Peterson, Cherise Green, Surendra Buddi
Poster # 81
Most bioinformatics research requires clinical data from an electronic medical record (EMR). One common approach is for an analyst to extract, transform, and deliver a unique dataset for each research project. This approach can present challenges for data provenance and reuse. Additionally, it can be difficult for a small team to create many unique datasets.To address these challenges, we developed an ecosystem of tools and standards that support delivery of standardized and custom relational data models scoped to research cohorts.So far, our approach has been used to deliver clinical data for more than 80 research projects.Once we define a cohort, via a SQL query or list of patients, we parameterize a SQL code generation workflow based on the project's cohort, research aims, and privacy requirements (e.g., whether the project can access protected health information (PHI)). The generated SQL code can then be further customized, and version controlled. Next, we build a "staging" dataset using the generated code. The code and descriptive statistics (e.g., the distribution of encounters per patient) from this staging dataset are then reviewed by other analysts using GitHub. On approval, the dataset is encoded and delivered to a secure computational environment that can be accessed through an internet browser. For retrospective research, subsequent data pulls are versioned, reviewed, and delivered while the old versions are retained for reproducibility and migration.Because all versions of the code and data to the point of delivery are maintained as part of a standardized process, we ensure continuity across analysts and have improved time-to-delivery of datasets. Additionally, since the same data model is used across multiple projects, code, and suggested data model improvements from one lab can often be reused in another. We will also describe new challenges presented by this approach.
Comments