Proceedings Article | 15 February 2021
KEYWORDS: Tumor growth modeling, Tumors, Genomics, Lung cancer, Performance modeling, Cancer, Statistical modeling, Principal component analysis, Feature extraction, Data modeling
This study aims at developing a radiogenomic model to identify high-risk non-small cell lung cancer (NSCLC) patients and predict overall survival. Baseline CT images of 85 NSCLC patients (male/female: 58/27, event: death, adenocarcinoma/squamous cell carcinoma/unspecified: 41/32/12, in stages I/II/III/unspecified: 39/25/12/9) with gene expression profile (microarray data) of 33 genes were used from the NSCLC-Radiomics Genomics dataset, publicly available from the National Cancer Institute’s Cancer Imaging Archive (TCIA). The 33 genes were selected on the basis that they represent three major co-expression patterns in the dataset- histology, neuroendocrine (NE), and pulmonary surfactant systems (PSS) signature genes. Radiomic features (429) characterizing the primary tumor were extracted from the 3D tumor volume using PyRadiomics. As the first step to our analysis, we used the Mann-Whitney U test to identify a sub-set of 224 features that were robust (p-value < 0.05) to scan differences in slice thickness, reconstruction kernel, and contrast enhancement. Principal Component Analysis was used to extract ten principal components, capturing 85% of the variance, from the radiomic and genomic features, respectively. The following three models were created, to assess the prognostic performance of the radiomic and genomic PCs- Model 1: consists of the first five radiomic PCs; Model 2: consists of the first five genomic PCs; and Model 3: consists of the first three radiomic PCs and the first two genomic PCs. For these models, a five-fold crossvalidated multivariate Cox proportional hazards model (200 iterations) was used to compute the concordance index which measures the ability of the models to predict overall survival. The concordance index values for the Cox model are (mean (min, max)): Model 1: 0.59(0.57,0.62), Model 2: 0.57(0.54,0.61), Model 3: 0.62(0.61,0.64). In addition to the cross-validated c statistics, we also built a model on the complete dataset, for each of the three sets of predictors, to evaluate Kaplan Meier's performance in separating participants above versus below the median prognostic score. We constructed separate radiogenomic signatures for the NSCLC patients based on their histology- LUSC and LUAD and conducted similar survival analysis on them, showing that the combined radiogenomic model was the only one with statistically significant curve separation (p<0.05). This preliminary study suggests that a combination of radiomics and genomics information could give a more comprehensive assessment of the tumor’s characteristics at baseline and generate a better prognostic model in comparison to models formed using radiomics or genomics information alone.