Data Archiving for Long-Term Research Sustainability
Allison Olsen, MS, MA; Nicole Feldman, MSIS; Kelsey Zhu, MS; Juliana Pakstis, MI; Jennae Luecke, MSIS; Ene Belleh, MBA,MLS,AHIP-D
Poster # 30
The Arcus Omics Initiative at the CHOP Research Institute is establishing a central, shared data repository for Omics data. The initiative was inspired by the 2023 NIH Data Management and Sharing requirements, and the contribution of data to Arcus will help fulfill grant requirements and further drive progress in this area. Archiving is essential for ensuring universal access, preserving data legacy through technological changes, and promoting open participation in scientific advancements by robustly preserving and sharing data. The Initiative collects genomic, transcriptomic, proteomic, and phenomic data from various research studies and sequencing providers and harmonizes the data for research use. The team works with data contributors to record metadata in a standardized structure using protocol and manifest files to capture essential information. After data ingest and quality assurance steps, archived data undergoes bioinformatics analysis workflows, including exhaustive QA/QC measures. These measures help reduce human error and sequencing artifacts derived from the library preparation or bias in PCR amplification, ensuring researchers can confidently and quickly conduct their analyses. In the second year of the Initiative, the Arcus Omics team implemented improvements to infrastructure, ingest, and delivery workflows based on user feedback. For example, the team is evaluating and integrating an Amazon Web Services Sequencing Store to replace the current S3 selection for storing omics data. Persistent cloud storage with backup, access controls, and versioning is an improvement over individual and segregated servers. The dynamic archival process, paired with infrastructure improvements, ensures data is preserved for future use and available for research reuse. The long-term goal is to establish the continuous collection of omics data over time, potentially merging the existing sequence data of current CHOP patients with sequences from their future descendants, creating a longitudinal database.