| Title: | Model Selection Based on Machine Learning (ML) | 
| Version: | 1.0.0.1 | 
| Description: | Model evaluation based on a modified version of the recursive feature elimination algorithm. This package is designed to determine the optimal model(s) by leveraging all available features. | 
| License: | GPL (≥ 3) | 
| URL: | https://github.com/mommy003/MSML | 
| Encoding: | UTF-8 | 
| RoxygenNote: | 7.3.1 | 
| Depends: | R (≥ 2.10) | 
| Imports: | r2redux, R2ROC | 
| LazyData: | true | 
| NeedsCompilation: | no | 
| Packaged: | 2024-03-04 05:04:48 UTC; cvasu | 
| Author: | Hong Lee [aut, cph], Moksedul Momin [aut, cre, cph] | 
| Maintainer: | Moksedul Momin <cvasu.momin@gmail.com> | 
| Repository: | CRAN | 
| Date/Publication: | 2024-03-04 05:20:02 UTC | 
3 sets of covariates for training data set
Description
A dataset containing N sets of covariates (N=3 as an example here) intended for constant use across all model configurations (refer to the 'model_configuration2' function) when using a training dataset. Please note that if constant covariates are not required, this file is unnecessary (refer to the 'model_configuration' function).
Usage
cov_train
Format
A data frame for training dataset:
- V1
 covariate 1
- V2
 covariate 2
- V3
 covariate 3
3 sets of covariates for validation data set
Description
A dataset containing N sets of covariates (N=3 as an example here) intended for constant use across all model configurations (refer to the 'model_configuration2' function) when using a validation dataset. Please note that if constant covariates are not required, this file is unnecessary (refer to the 'model_configuration' function).
Usage
cov_valid
Format
A data frame for validation dataset:
- V1
 covariate 1
- V2
 covariate 2
- V3
 covariate 3
7 sets of PRSs for test dataset and target phenotype
Description
A dataset containing 7 sets of PRSs for test dataset and target phenotype
Usage
data_test
Format
A data frame for test dataset:
- V1
 Feature 1 (or PRS1 constructed using the first subset of SNPs from GWAS summary statistics)
- V2
 Feature 2 (or PRS2 constructed using the second subset of SNPs from GWAS summary statistics)
- V3
 Feature 3 (or PRS3 constructed using the third subset of SNPs from GWAS summary statistics)
- V4
 Feature 4 (or PRS4 constructed using the fourth subset of SNPs from GWAS summary statistics)
- V5
 Feature 5 (or PRS5 constructed using the fifth subset of SNPs from GWAS summary statistics)
- V6
 Feature 6 (or PRS6 constructed using the sixth subset of SNPs from GWAS summary statistics)
- V7
 Feature 7 (or PRS7 constructed using the seventh subset of SNPs from GWAS summary statistics)
- phenotype
 Phenotypic values
7 sets of PRSs for training data set and target phenotype
Description
A dataset containing 7 sets of PRSs for training data set and target phenotype
Usage
data_train
Format
A data frame for training dataset:
- V1
 Feature 1 (or PRS1 constructed using the first subset of SNPs from GWAS summary statistics)
- V2
 Feature 2 (or PRS2 constructed using the second subset of SNPs from GWAS summary statistics)
- V3
 Feature 3 (or PRS3 constructed using the third subset of SNPs from GWAS summary statistics)
- V4
 Feature 4 (or PRS4 constructed using the fourth subset of SNPs from GWAS summary statistics)
- V5
 Feature 5 (or PRS5 constructed using the fifth subset of SNPs from GWAS summary statistics)
- V6
 Feature 6 (or PRS6 constructed using the sixth subset of SNPs from GWAS summary statistics)
- V7
 Feature 7 (or PRS7 constructed using the seventh subset of SNPs from GWAS summary statistics)
- phenotype
 Phenotypic values
7 sets of PRSs for validation dataset and target phenotype
Description
A dataset containing 7 sets of PRSs for validation dataset and target phenotype
Usage
data_valid
Format
A data frame for validation dataset:
- V1
 Feature 1 (or PRS1 constructed using the first subset of SNPs from GWAS summary statistics)
- V2
 Feature 2 (or PRS2 constructed using the second subset of SNPs from GWAS summary statistics)
- V3
 Feature 3 (or PRS3 constructed using the third subset of SNPs from GWAS summary statistics)
- V4
 Feature 4 (or PRS4 constructed using the fourth subset of SNPs from GWAS summary statistics)
- V5
 Feature 5 (or PRS5 constructed using the fifth subset of SNPs from GWAS summary statistics)
- V6
 Feature 6 (or PRS6 constructed using the sixth subset of SNPs from GWAS summary statistics)
- V7
 Feature 7 (or PRS7 constructed using the seventh subset of SNPs from GWAS summary statistics)
- phenotype
 Phenotypic values
model_configuration function
Description
This function generates predicted values for the validation dataset by applying optimal weights to features, which were estimated in the training dataset for each model configuration. The total number of model configurations is determined by summing the combinations for each possible number of features, ranging from 1 to 'n' (C(n, k)), where 'n choose k' (C(n, k)) represents the binomial coefficient. Here, 'n' denotes the total number of features, and 'k' indicates the number of features included in each model. For example, with n=7, the total number of model configurations is 127.
Usage
model_configuration(data_train, data_valid, mv, model = "lm")
Arguments
data_train | 
 This includes the dataframe of the training dataset in a matrix format  | 
data_valid | 
 This includes the dataframe of the validation dataset in a matrix format  | 
mv | 
 The total number of columns in data_train/data_valid  | 
model | 
 This is the type of model (e.g. lm (default) or glm)  | 
Value
This function will generate all possible model outcomes for validation and test dataset
Examples
data_train <- data_train
data_valid  <- data_valid
mv=8
out=model_configuration(data_train,data_valid,mv,model = "lm")
#This process will produce predicted values for the validation datasets,
#corresponding to each model configuration trained on the training dataset.
#The outcome of this function will yield variables named 'predict_validation'
#and 'total_model_configurations.
#To print the outcomes run out$predict_validation and out$total_model_configurations.
#For details (see https://github.com/mommy003/MSML). 
model_configuration2 function
Description
This function is similar to the model_configuration function, with the added capability to maintain constant variables across models during training and prediction (see cov_train and cov_valid in page 2). Additionally, users have the option to select between linear or logistic regression models.
Usage
model_configuration2(
  data_train,
  data_valid,
  mv,
  cov_train,
  cov_valid,
  model = "lm"
)
Arguments
data_train | 
 This includes the dataframe of the training dataset in a matrix format  | 
data_valid | 
 This includes the dataframe of the validation dataset in a matrix format  | 
mv | 
 The total number of columns in data_train/data_valid  | 
cov_train | 
 This includes dataframe of covariates for training dataset in a matrix format  | 
cov_valid | 
 This includes dataframe of covariates for validation dataset in a matrix format  | 
model | 
 This is the type of model (e.g. lm (default) or glm (logistic regression))  | 
Value
This function will generate all possible model outcomes for validation and test dataset
Examples
data_train <- data_train
data_valid  <- data_valid
mv=8
cov_train <- cov_train
cov_valid  <- cov_valid
out=model_configuration2(data_train,data_valid,mv,cov_train, cov_valid, model = "lm")
#This process will produce predicted values for the validation datasets,
#corresponding to each model configuration trained on the training dataset.
#The outcome of this function will yield variables named 'predict_validation'
#and 'total_model_configurations.
#To print the outcomes run out$predict_validation and out$total_model_configurations.
#For details (see https://github.com/mommy003/MSML). 
#If a user intends to employ logistic regression without constant covariates, 
#we advise preparing a covariate file where all values are set to 1.
model_evaluation function
Description
This function will identify the best model in the validation and test dataset.
Usage
model_evaluation(dat, mv, tn, prev, pthreshold = 0.05, method = "R2ROC")
Arguments
dat | 
 This is the dataframe for all the combinations of the model in a matrix format  | 
mv | 
 The total number of columns in data_train/data_valid  | 
tn | 
 The total number of best models to be identified  | 
prev | 
 The prevalence of disease in the data  | 
pthreshold | 
 The significance p value threshold when comparing models (default 0.05)  | 
method | 
 The methods to be used to evaluate models (e.g. R2ROC (default) or r2redux)  | 
Value
This function will generate all possible model outcomes for validation and test dataset
Examples
dat <- predict_validation
mv=8
tn=15
prev=0.047
out=model_evaluation(dat,mv,tn,prev)
#This process will generate three output files.
#out$out_all, contains AUC, p values for AUC, R2, and p values for R2, 
#respectively for all models.
#out$out_start, contains AUC, p values for AUC, R2, and p values for R2,
#respectively for top tn models.
#out$out_selected, contains AUC, p values for AUC, R2, and p values for R2,
#respectively for best models.  This also includes selected features for models
#For details (see https://github.com/mommy003/MSML).
target phenotype and 127 sets of model configurations based on validation dataset
Description
A dataset containing target phenotype and 127 sets of model configurations based on validation dataset
Usage
predict_validation
Format
A data frame for predicted values for target dataset from model configurations_test:
- V1
 Phenotypic values in target dataset
- V2
 predicted values for target dataset from model configuration1
- V3
 predicted values for target dataset from model configuration2
- V4
 predicted values for target dataset from model configuration3
- V5
 predicted values for target dataset from model configuration4
- V6
 predicted values for target dataset from model configuration5
- V7
 predicted values for target dataset from model configuration6
- V8
 predicted values for target dataset from model configuration7
- V9
 predicted values for target dataset from model configuration8
- V10
 predicted values for target dataset from model configuration9
- V11
 predicted values for target dataset from model configuration10
- V12
 predicted values for target dataset from model configuration11
- V13
 predicted values for target dataset from model configuration12
- V14
 predicted values for target dataset from model configuration13
- V15
 predicted values for target dataset from model configuration14
- V16
 predicted values for target dataset from model configuration15
- V17
 predicted values for target dataset from model configuration16
- V18
 predicted values for target dataset from model configuration17
- V19
 predicted values for target dataset from model configuration18
- V20
 predicted values for target dataset from model configuration19
- V21
 predicted values for target dataset from model configuration10
- V22
 predicted values for target dataset from model configuration21
- V23
 predicted values for target dataset from model configuration22
- V24
 predicted values for target dataset from model configuration23
- V25
 predicted values for target dataset from model configuration24
- V26
 predicted values for target dataset from model configuration25
- V27
 predicted values for target dataset from model configuration26
- V28
 predicted values for target dataset from model configuration27
- V29
 predicted values for target dataset from model configuration28
- V30
 predicted values for target dataset from model configuration29
- V31
 predicted values for target dataset from model configuration30
- V32
 predicted values for target dataset from model configuration31
- V33
 predicted values for target dataset from model configuration32
- V34
 predicted values for target dataset from model configuration33
- V35
 predicted values for target dataset from model configuration34
- V36
 predicted values for target dataset from model configuration35
- V37
 predicted values for target dataset from model configuration36
- V38
 predicted values for target dataset from model configuration37
- V39
 predicted values for target dataset from model configuration38
- V40
 predicted values for target dataset from model configuration39
- V41
 predicted values for target dataset from model configuration40
- V42
 predicted values for target dataset from model configuration41
- V43
 predicted values for target dataset from model configuration42
- V44
 predicted values for target dataset from model configuration43
- V45
 predicted values for target dataset from model configuration44
- V46
 predicted values for target dataset from model configuration45
- V47
 predicted values for target dataset from model configuration46
- V48
 predicted values for target dataset from model configuration47
- V49
 predicted values for target dataset from model configuration48
- V50
 predicted values for target dataset from model configuration49
- V51
 predicted values for target dataset from model configuration50
- V52
 predicted values for target dataset from model configuration51
- V53
 predicted values for target dataset from model configuration52
- V54
 predicted values for target dataset from model configuration53
- V55
 predicted values for target dataset from model configuration54
- V56
 predicted values for target dataset from model configuration55
- V57
 predicted values for target dataset from model configuration56
- V58
 predicted values for target dataset from model configuration57
- V59
 predicted values for target dataset from model configuration58
- V60
 predicted values for target dataset from model configuration59
- V61
 predicted values for target dataset from model configuration60
- V62
 predicted values for target dataset from model configuration61
- V63
 predicted values for target dataset from model configuration62
- V64
 predicted values for target dataset from model configuration63
- V65
 predicted values for target dataset from model configuration64
- V66
 predicted values for target dataset from model configuration65
- V67
 predicted values for target dataset from model configuration66
- V68
 predicted values for target dataset from model configuration67
- V69
 predicted values for target dataset from model configuration68
- V70
 predicted values for target dataset from model configuration69
- V71
 predicted values for target dataset from model configuration70
- V72
 predicted values for target dataset from model configuration71
- V73
 predicted values for target dataset from model configuration72
- V74
 predicted values for target dataset from model configuration73
- V75
 predicted values for target dataset from model configuration74
- V76
 predicted values for target dataset from model configuration75
- V77
 predicted values for target dataset from model configuration76
- V78
 predicted values for target dataset from model configuration77
- V79
 predicted values for target dataset from model configuration78
- V80
 predicted values for target dataset from model configuration79
- V81
 predicted values for target dataset from model configuration80
- V82
 predicted values for target dataset from model configuration81
- V83
 predicted values for target dataset from model configuration82
- V84
 predicted values for target dataset from model configuration83
- V85
 predicted values for target dataset from model configuration84
- V86
 predicted values for target dataset from model configuration85
- V87
 predicted values for target dataset from model configuration86
- V88
 predicted values for target dataset from model configuration87
- V89
 predicted values for target dataset from model configuration88
- V90
 predicted values for target dataset from model configuration89
- V91
 predicted values for target dataset from model configuration90
- V92
 predicted values for target dataset from model configuration91
- V93
 predicted values for target dataset from model configuration92
- V94
 predicted values for target dataset from model configuration93
- V95
 predicted values for target dataset from model configuration94
- V96
 predicted values for target dataset from model configuration95
- V97
 predicted values for target dataset from model configuration96
- V98
 predicted values for target dataset from model configuration97
- V99
 predicted values for target dataset from model configuration98
- V100
 predicted values for target dataset from model configuration99
- V101
 predicted values for target dataset from model configuration100
- V102
 predicted values for target dataset from model configuration101
- V103
 predicted values for target dataset from model configuration102
- V104
 predicted values for target dataset from model configuration103
- V105
 predicted values for target dataset from model configuration104
- V106
 predicted values for target dataset from model configuration105
- V107
 predicted values for target dataset from model configuration106
- V108
 predicted values for target dataset from model configuration107
- V109
 predicted values for target dataset from model configuration108
- V110
 predicted values for target dataset from model configuration109
- V111
 predicted values for target dataset from model configuration110
- V112
 predicted values for target dataset from model configuration111
- V113
 predicted values for target dataset from model configuration112
- V114
 predicted values for target dataset from model configuration113
- V115
 predicted values for target dataset from model configuration114
- V116
 predicted values for target dataset from model configuration115
- V117
 predicted values for target dataset from model configuration116
- V118
 predicted values for target dataset from model configuration117
- V119
 predicted values for target dataset from model configuration118
- V120
 predicted values for target dataset from model configuration119
- V121
 predicted values for target dataset from model configuration120
- V122
 predicted values for target dataset from model configuration121
- V123
 predicted values for target dataset from model configuration122
- V124
 predicted values for target dataset from model configuration123
- V125
 predicted values for target dataset from model configuration124
- V126
 predicted values for target dataset from model configuration125
- V127
 predicted values for target dataset from model configuration126
- V128
 predicted values for target dataset from model configuration127