Poster #44 - Ellyse Lai
- vitod24
- Oct 20
- 2 min read
Harnessing Longitudinal Claims Data to Reveal Predictive Signals for Clinical Event Forecasting
Lai, E., Chen, J., & Dubrawski, A. Ellyse Lai, Student/Research Intern, CMU Auton Lab Jieshi Chen, Researcher, CMU Auton Lab Artur Dubrawski, Professor, CMU Auton Lab
Introduction: Large administrative health datasets are a powerful resource for public health surveillance, but their utility is often hindered by recording complexities, such as change of diagnostic coding systems. This creates artificial discontinuities that may confound traditional time-series analysis. We use a data harmonization and feature engineering pipeline that overcomes these challenges to create consistent, temporally stable signals for predictive modeling, targeting the severe medical outcome of amputation to explore a possible correlation between drug abuse and amputation. Methods: We used nine years of California non-public patient discharge claims data, encompassing the ICD-9 to ICD-10 diagnosis coding transition period. The data includes diagnosis, discharge date, and other demographic details. We developed a code mapping to harmonize diagnosis codes and flag key clinical concepts, and aggregated the data into a time series of claim counts. We created 182-day rolling-window features to capture trends and ensure there were no artificial breaks in the data. The predictive validity of these features was then tested by training an XGBoost model to forecast future amputation events. Results: Our data harmonization successfully created continuous and stable time-series features that were free from artifacts of the coding system change. The resulting features demonstrated predictive power, with models forecasting population-level amputation trends 182 days in the future with high predictive utility (R² > 0.80). Stimulant misuse emerged as a primary leading indicator, validating that our data processing successfully uncovered clinically relevant signals. Conclusion: This work reveals the possible ability of aggregated complex, longitudinal healthcare data to support robust predictive modeling. By focusing on the creation of temporally consistent features, we can effectively track and forecast significant clinical events for public health surveillance. Future work using this rich dataset will dig deeper into other possible effects of the significant increase of drug abuse, both on the population and individual level.


Comments