Using Decision Trees to Predict the Clinical Isolation Source of Haemophilus influenzae Based on Pan
Updated: Sep 29, 2022
Koser, K(2345), Ehrlich, R. L(1234)., Hammond, J.(1234), Czerski, S.(1234), Mell, J. C.(12345), Earl, J. P.(1234), Ahmed, A.(1234), Ehrlich, G. D.(12345) Affiliations: 1Department of Microbiology and Immunology, Drexel University College of Medicine, Philadelphia PA 19102 2Institute for Molecular Medicine & Infectious Disease, Drexel University College of Medicine, Philadelphia, PA 19102 3Center for Advanced Microbial Processing, Drexel University College of Medicine, Philadelphia, PA, 19102 4Center for Genomic Sciences, Drexel University College of Medicine, Philadelphia, PA, 19102 5Molecular & Cellular Biology & Genetics Graduate Program, Drexel University College of Medicine, Philadelphia PA 19102
Haemophilus influenzae, like many human-associated bacterial species, is a commensal that can sometimes become pathogenic and cause disease. Normally, H. influenzae colonizes the nasopharynx (NP) of healthy individuals, but also causes infections in various other parts of the body, including the ear, eye, and lung. Additionally, H. influenzae has also been shown to cause invasive infections, although this is less common. This commensal-to-pathogen transition stems from a variety of factors including host genetics, viral co-infections, and other environmental factors. Another major driver of this transition is the high degree of pan-genomic diversity among strains. We hypothesized that the presence of key accessory genes in the H. influenzae pan-genome can predict a sample's clinical isolation source. To test this, we trained XGBoost, a decision tree-based machine learning (ML) algorithm, to classify H. influenzae strains on the basis of 794 intermediate-frequency accessory genes (features) across 1275 genome-sequenced strains (observations) into one of five clinical sources from which they were originally isolated: (a) carriage (from the NP of a healthy subject), (b) eye (from a patient with conjunctivitis), (c) ear (from the middle ear of a child with otitis media), (d) lung (from a chronic lung infection), and (e) invasive (from blood or cerebrospinal fluid). We found that, on average, our model predicted the clinical provenance of a given H. influenzae strain significantly better (55%) than random (31%). Using recursive feature elimination, we identified a reduced set of 100 genes with comparable accuracy. This underlines the importance of specific bacterial genes involved in H. influenzae pathogenesis. Furthermore, these genes could potentially serve as diagnostic biomarkers of pathogenic H. influenzae or provide new insights into the role of accessory genes in the commensal-to-pathogen transition.