Evidentiary and interpretable prediction
Binary and Multi-classification algorithm for adverse outcome detection, survival classification, and endpoint prediction (see references for details)
Objectives of this project
Build the mccv python package: easily implement and perform MCCV for learning and prediction tasks.
Showcase accessibly to build, validate, and interpret MCCV classifiers.
Demonstrate use in both python and R for diverse community implementations.
Installation
mkdir ~/my_directory #choose where to clone the mccv repository
cd ~/my_directory
git clone https://github.com/ngiangre/mccv.git
cd mccv/
python3 -m pip install .
Usage
import pandas as pd
data = pd.read_csv('data/data.csv' ,index_col= 0 ) # Feature column name is 'biomarker' and response column name is 'status'
data.head()
status biomarker
obs
1 0 1.665731
2 0 -0.875837
3 0 -1.391374
4 0 -0.297352
5 1 0.189857
import mccv
mccv_obj = mccv.mccv(num_bootstraps= 200 )
mccv_obj.set_X( data.loc[:,['biomarker' ]] )
mccv_obj.set_Y( data.loc[:,['status' ]] )
mccv_obj.run_mccv()
mccv_obj.run_permuted_mccv()
#Output
for n in mccv_obj.mccv_data:
print (n)
mccv_obj.mccv_data[n].head()
Model Learning
bootstrap model ... train_roc_auc validation_roc_auc
0 0 Logistic Regression ... 0.529453 0.611111
1 1 Logistic Regression ... 0.515235 0.732143
2 2 Logistic Regression ... 0.543056 0.400000
3 3 Logistic Regression ... 0.519728 0.727273
4 4 Logistic Regression ... 0.554054 0.574074
[5 rows x 5 columns]
Feature Importance
bootstrap feature importance model
0 0 biomarker 1.009705 Logistic Regression
1 0 Intercept -0.598575 Logistic Regression
0 1 biomarker 0.509433 Logistic Regression
1 1 Intercept -0.226550 Logistic Regression
0 2 biomarker 1.598627 Logistic Regression
Patient Predictions
bootstrap model y_pred y_proba y_true
obs
27 0 Logistic Regression 0 0.384723 1
87 0 Logistic Regression 1 0.601359 0
3 0 Logistic Regression 0 0.401320 0
56 0 Logistic Regression 1 0.512481 1
76 0 Logistic Regression 0 0.393009 0
Performance
model metric performance_bootstrap value
0 Logistic Regression roc_auc 0 0.467487
1 Logistic Regression roc_auc 1 0.467776
2 Logistic Regression roc_auc 2 0.480176
3 Logistic Regression roc_auc 3 0.480679
4 Logistic Regression roc_auc 4 0.475859
for n in mccv_obj.mccv_permuted_data:
print (n)
mccv_obj.mccv_permuted_data[n].head()
Model Learning
bootstrap model ... train_roc_auc validation_roc_auc
0 0 Logistic Regression ... 0.506233 0.642857
1 1 Logistic Regression ... 0.492030 0.703704
2 2 Logistic Regression ... 0.510135 0.537037
3 3 Logistic Regression ... 0.506944 0.703704
4 4 Logistic Regression ... 0.589547 0.340909
[5 rows x 5 columns]
Feature Importance
bootstrap feature importance model
0 0 biomarker -0.196116 Logistic Regression
1 0 Intercept 0.079220 Logistic Regression
0 1 biomarker -0.628093 Logistic Regression
1 1 Intercept 0.236617 Logistic Regression
0 2 biomarker 0.166196 Logistic Regression
Patient Predictions
bootstrap model y_pred y_proba y_true
obs
27 0 Logistic Regression 1 0.513536 1
87 0 Logistic Regression 0 0.470809 0
3 0 Logistic Regression 1 0.510160 0
56 0 Logistic Regression 0 0.488317 1
76 0 Logistic Regression 1 0.511844 1
Performance
model metric performance_bootstrap value
0 Logistic Regression roc_auc 0 0.440616
1 Logistic Regression roc_auc 1 0.442506
2 Logistic Regression roc_auc 2 0.449941
3 Logistic Regression roc_auc 3 0.440162
4 Logistic Regression roc_auc 4 0.449896
if (! requireNamespace ("readr" )){install.packages ("readr" )}
Loading required namespace: readr
library (readr)
data <- read_csv ("data/data.csv" ,col_types = c ("iid" )) #set obs as integer, status as integer, and biomarker as double
head (data)
# A tibble: 6 × 3
obs status biomarker
<int> <int> <dbl>
1 1 0 1.67
2 2 0 -0.876
3 3 0 -1.39
4 4 0 -0.297
5 5 1 0.190
6 6 0 2.20
if (! requireNamespace ("reticulate" )){install.packages ("reticulate" )}
mccv = reticulate:: import ('mccv' )
mccv_obj = mccv$ mccv (num_bootstraps = as.integer (200 ))
X = reticulate:: r_to_py (data[,c ('obs' ,'biomarker' )])
X = X$ set_index (reticulate:: r_to_py ('obs' ))
y = reticulate:: r_to_py (data[,c ('obs' ,'status' )])
y = y$ set_index (reticulate:: r_to_py ('obs' ))
mccv_obj$ set_X (X)
mccv_obj$ set_Y (y)
mccv_obj$ run_mccv ()
mccv_obj$ run_permuted_mccv ()
#Output
lapply (mccv_obj$ mccv_data,head)
Warning in py_to_r.pandas.core.frame.DataFrame(object): index contains
duplicated values: row names not set
Warning in py_to_r.pandas.core.frame.DataFrame(object): index contains
duplicated values: row names not set
$`Model Learning`
bootstrap model test_roc_auc train_roc_auc validation_roc_auc
1 0 Logistic Regression 1.0000 0.5294525 0.6111111
2 1 Logistic Regression 0.8000 0.5152355 0.7321429
3 2 Logistic Regression 1.0000 0.5430556 0.4000000
4 3 Logistic Regression 0.8750 0.5197279 0.7272727
5 4 Logistic Regression 0.8125 0.5540541 0.5740741
6 5 Logistic Regression 1.0000 0.5499325 0.5357143
$`Feature Importance`
bootstrap feature importance model
1 0 biomarker 1.0097049 Logistic Regression
2 0 Intercept -0.5985751 Logistic Regression
3 1 biomarker 0.5094328 Logistic Regression
4 1 Intercept -0.2265503 Logistic Regression
5 2 biomarker 1.5986271 Logistic Regression
6 2 Intercept -0.9420031 Logistic Regression
$`Patient Predictions`
bootstrap model y_pred y_proba y_true
1 0 Logistic Regression 0 0.3847230 1
2 0 Logistic Regression 1 0.6013587 0
3 0 Logistic Regression 0 0.4013202 0
4 0 Logistic Regression 1 0.5124811 1
5 0 Logistic Regression 0 0.3930090 0
6 0 Logistic Regression 0 0.4660667 1
$Performance
model metric performance_bootstrap value
1 Logistic Regression roc_auc 0 0.4674874
2 Logistic Regression roc_auc 1 0.4677764
3 Logistic Regression roc_auc 2 0.4801763
4 Logistic Regression roc_auc 3 0.4806793
5 Logistic Regression roc_auc 4 0.4758592
6 Logistic Regression roc_auc 5 0.4687351
lapply (mccv_obj$ mccv_permuted_data,head)
Warning in py_to_r.pandas.core.frame.DataFrame(object): index contains
duplicated values: row names not set
Warning in py_to_r.pandas.core.frame.DataFrame(object): index contains
duplicated values: row names not set
$`Model Learning`
bootstrap model test_roc_auc train_roc_auc validation_roc_auc
1 0 Logistic Regression 0.5500 0.5062327 0.6428571
2 1 Logistic Regression 0.8000 0.4920305 0.7037037
3 2 Logistic Regression 0.5625 0.5101351 0.5370370
4 3 Logistic Regression 0.8000 0.5069444 0.7037037
5 4 Logistic Regression 0.9000 0.5895470 0.3409091
6 5 Logistic Regression 0.7000 0.5360111 0.5178571
$`Feature Importance`
bootstrap feature importance model
1 0 biomarker -0.19611610 Logistic Regression
2 0 Intercept 0.07921951 Logistic Regression
3 1 biomarker -0.62809256 Logistic Regression
4 1 Intercept 0.23661698 Logistic Regression
5 2 biomarker 0.16619555 Logistic Regression
6 2 Intercept -0.01455491 Logistic Regression
$`Patient Predictions`
bootstrap model y_pred y_proba y_true
1 0 Logistic Regression 1 0.5135363 1
2 0 Logistic Regression 0 0.4708091 0
3 0 Logistic Regression 1 0.5101595 0
4 0 Logistic Regression 0 0.4883168 1
5 0 Logistic Regression 1 0.5118443 1
6 0 Logistic Regression 0 0.4973405 1
$Performance
model metric performance_bootstrap value
1 Logistic Regression roc_auc 0 0.4406164
2 Logistic Regression roc_auc 1 0.4425061
3 Logistic Regression roc_auc 2 0.4499406
4 Logistic Regression roc_auc 3 0.4401616
5 Logistic Regression roc_auc 4 0.4498963
6 Logistic Regression roc_auc 5 0.4436607
Contribute
Please do! Reach out to Nick directly (nick.giangreco@gmail.com), make an issue, or make a pull request.
License
This software is released under the MIT license, which can be found in LICENSE in the root directory of this repository.