`select_pred.Rd`

Selection of a subset of non collinear predictors having relevant relationships with a given target outcome using a random forest procedure.

```
select_pred(
databa,
Y = NULL,
Z = NULL,
ID = 1,
OUT = "Y",
quanti = NULL,
nominal = NULL,
ordinal = NULL,
logic = NULL,
convert_num = NULL,
convert_class = NULL,
thresh_cat = 0.3,
thresh_num = 0.7,
thresh_Y = 0.2,
RF = TRUE,
RF_ntree = 500,
RF_condi = FALSE,
RF_condi_thr = 0.2,
RF_SEED = sample(1:1e+06, 1)
)
```

- databa
a data.frame with a column of identifiers (of row or of database in the case of two concatened databases), an outcome, and a set of predictors. The number of columns can exceed the number of rows.

- Y
the label of a first target variable with quotes

- Z
the label of a second target variable with quotes when

`databa`

is the result of two overlayed databases.- ID
the column index of the database identifier (The first column by default) in the case of two concatened databases, a row identifier otherwise

- OUT
a character that indicates the outcome to predict in the context of overlayed databases. By default, the outcome declared in the argument

`Y`

is predicted. Another possible outcome to predict can be set with the related argument`Z`

.- quanti
a vector of integers corresponding to the column indexes of all the numeric predictors.

- nominal
a vector of integers which corresponds to the column indexes of all the categorical nominal predictors.

- ordinal
a vector of integers which corresponds to the column indexes of all the categorical ordinal predictors.

- logic
a vector of integers indicating the indexes of logical predictors. No index remained by default

- convert_num
a vector of integers indicating the indexes of quantitative variables to convert in ordered factors. No index remained by default. Each index selected has to be defined as quantitative in the argument

`quanti`

.- convert_class
a vector of integers indicating the number of classes related to each transformation of quantitative variable in ordered factor. The length of this vector can not exceed the length of the argument

`convert_num`

. Nevertheless, if length(`convert_num`

) > 1 and length(`convert_class`

) = 1, all quantitative predictors selected for discretization will have by default the same number of classes.- thresh_cat
a threshold associated to the Cramer's V coefficient (= 0.30 by default)

- thresh_num
a threshold associated to the Spearman's coefficient of correlation (= 0.70 by default)

- thresh_Y
a threshold linked to the RF approach, that corresponds to the minimal cumulative percent of importance measure required to be kept in the final list of predictors.

- RF
a boolean sets to TRUE (default) if a random forest procedure must be applied to select the best subset of predictors according to the outcome.Otherwise, only pairwise associations between predictors are used for the selection.

- RF_ntree
the number of bootsrap samples required from the row datasource during the random forest procedure

- RF_condi
a boolean specifying if the conditional importance measures must be assessed from the random forest procedure (

`TRUE`

) rather than the standard variable importance measures (`FALSE`

by default)- RF_condi_thr
a threshold linked to (1 - pvalue) of an association test between each predictor \(X\) and the other variables, given that a threshold value of zero will include all variables in the computation of the conditional importance measure of \(X\) (0.20 is the default value). Conversely, a larger threshold will only keeps the subset of variables that is strongly correlated to \(X\) for the computation of the variable importance measure of \(X\).

- RF_SEED
an integer used as argument by the set.seed() for offsetting the random number generator (random integer by default). This value is only used for RF method.

A list of 14 (if `RF = TRUE`

) or 11 objects (Only the first ten objects if `RF = FALSE`

) is returned:

- seed
the random number generator related to the study

- outc
the identifier of the outcome to predict

- thresh
a summarize of the different thresholds fixed for the study

- convert_num
the labels of the continuous predictors transformed in categorical form

- DB_USED
the final database used after potential transformations of predictors

- vcrm_OUTC_cat
a table of pairwise associations between the outcome and the categorical predictors (Cramer's V)

- cor_OUTC_num
a table of pairwise associations between the outcome and the continuous predictors (Rank correlation)

- vcrm_X_cat
a table of pairwise associations between the categorical predictors (Cramer's V)

- cor_X_num
a table of pairwise associations between the continuous predictors (Cramer's V)

- FG_test
the results of the Farrar and Glauber tests, with and without approximation form

- collinear_PB
a table of predictors with problem of collinearity according to the fixed thresholds

- drop_var
the labels of predictors to drop after RF process (optional output: only if

`RF`

=TRUE)- RF_PRED
the table of variable importance measurements, conditional or not, according to the argument

`condi_RF`

(optional output: Only if`RF`

=TRUE)- RF_best
the labels of the best predictors selected (optional output: Only if

`RF`

=TRUE) according to the value of the argument`thresh_Y`

The `select_pred`

function provides several tools to identify, on the one hand, the relationships between predictors, by detecting especially potential problems of collinearity, and, on the other hand, proposes a parcimonious subset of relevant predictors (of the outcome) using appropriate random forest procedures.
The function which can be used as a preliminary step of prediction in regression areas is particularly adapted to the context of data fusion by providing relevant subsets of predictors (the matching variables) to algorithms dedicated to the solving of recoding problems.

A. REQUIRED STRUCTURE FOR THE DATABASE

The expected input database is a data.frame that especially requires a specific column of row identifier and a target variable (or outcome) having a finite number of values or classes (ordinal, nominal or discrete type). Notice that if the chosen outcome is in numeric form, it will be automatically converted in ordinal type.
The number of predictors is not a constraint for `select_pred`

(even if, with less than three variables a process of variables selection has no real sense...), and can exceed the number of rows (no problem of high dimensionality here).
The predictors can be continuous (quantitative), boolean, nominal or ordinal with or without missing values.
In presence of numeric variables, users can decide to discretize them or a part of them by themselves beforehand. They can also choose to use the internal process directly integrated in the function. Indeed, to assist users in this task, two arguments called `convert_num`

and `convert_class`

dedicated to these transformations are available in input of the function.
These options make the function `select_pred`

particularly adapted to the function `OT_joint`

which only allows data.frame with categorical covariates.
With the argument `convert_num`

, users choose the continuous variables to convert and the related argument `convert_class`

specifies the corresponding number of classes chosen for each discretization.
It is the reason why these two arguments must be two vectors of indexes of same length. Nevertheless, an unique exception exists when `convert_class`

is equalled to a scalar \(S\). In this case, all the continuous predictors selected for conversion will be discretized with a same number of classes S.
By example, if `convert_class = 4`

, all the continuous variables specified in the `convert_num`

argument will be discretized by quartiles. Moreover, notice that missing values from incomplete predictors to convert are not taken into account during the conversion, and that each predictor specified in the argument `convert_num`

must be also specified in the argument `quanti`

.
In this situation, the label of the outcome must be entered in the argument `Y`

, and the arguments `Z`

and `OUT`

must keep their default values.
Finally, the order of the column indexes related to the identifier and the outcome have no importance.

For a better flexibility, the input database can also be the result of two overlayed databases.
In this case, the structure of the database must be similar to those observed in the datasets `simu_data`

and `tab_test`

available in the package with a column of database identifier, one target outcome by database (2 columns), and a subset of shared predictors.
Notice that, overlaying two separate databases can also be done easily using the function `merge_dbs`

beforehand.
The labels of the two outcomes will have to be specified in the arguments `Y`

for the top database, and in `Z`

for the bottom one.
Notice also that the function `select_pred`

deals with only one outcome at a time that will have to be specified in the argument `OUT`

which must be equalled to "Y" for the study of the top database or "Z" for the study of the bottom one.

Finally, whatever the structure of the database declared in input, each column index related to the database variable must be entered once (and only once) in one of the following four arguments: `quanti`

, `nominal`

, `ordinal`

, `logic`

.

B. PAIRWISE ASSOCIATIONS BETWEEN PREDICTORS

In a first step of process, `select_pred`

calculates standard pairwise associations between predictors according to their types.

Between categorical predictors (ordinal, nominal and logical): Cramer's V (and Bias-corrected Cramer's V, see (1) for more details) are calculated between categorical predictors and the argument

`thres_cat`

fixed the associated threshold beyond which two predictors can be considered as redundant. A similar process is done between the target variable and the subset of categorical variables which provides in output a first table ranking the top scoring predictors. This table summarizes the ability of each variable to predict the target outcome.Between continuous predictors: If the

`ordinal`

and`logic`

arguments differ from NULL, all the corresponding predictors are beforehand converted in rank values. For numeric (quantitative), logical and ordinal predictors, pairwise correlations between ranks (Spearman) are calculated and the argument`thresh_num`

fixed the related threshold beyond which two predictors can be considered as redundant. A similar process is done between the outcome and the subset of discrete variables which provides in output, a table ranking the top scoring predictor variates which summarizes their abilities to predict the target. In addition, the result of a Farrar and Glauber test is provided. This test is based on the determinant of the correlation matrix of covariates and the related null hypothesis of the test corresponds to an absence of collinearity between them (see (2) for more details about the method). In presence of a large number of numeric covariates and/or ordered factors, the approximate Farrar-Glauber test, based on the normal approximation of the null distribution is more adapted and its result is also provided in output. These two tests are highly sensitive and, by consequence, it suggested to consider these results as simple indicators of collinearity between predictors rather than an essential condition of acceptability.

If the initial number of predictors is not too important, these informations can be sufficient to the user for the visualization of potential problems of collinearity and for the selection of a subset of predictors (`RF = FALSE`

).
It is nevertheless often necessary to complete this visualization by an automatical process of selection like the Random Forest approach (see Breiman 2001, for a better understanding of the method) linked to the function `select_pred`

(`RF = TRUE`

).

C. RANDOM FOREST PROCEDURE

As a final step of the process, a random forest approach (RF(3)) is here prefered (to regression models) for two main reasons: RF methods allow notably the number of variables to exceed the number of rows and remain applicable whatever the types of covariates considered.
The function `select_pred`

integrates in its algorithm the functions `cforest`

and `varimp`

of the package party (Hothorn, 2006) and so gives access to their main arguments.

A RF approach generally provides two types of measures for estimating the mean variable importance of each covariate in the prediction of an outcome: the Gini importance and the permutation importance. These measurements must be used with caution, by taking into account the following constraints:

The Gini importance criterion can produce bias in favor of continuous variables and variables with many categories. To avoid this problem, only the permutation criterion is available in the function.

The permutation importance criterion can overestimate the importance of highly correlated predictors.

The function `select_pred`

proposes three different scenarios according to the types of predictors:

The first one consists in boiling down to a set of categorical variables (ordered or not) by discretizing all the continuous predictors beforehand, using the internal

`convert_num`

argument or another one, and then works with the conditional importance measures (`RF_condi = TRUE`

) which give unbiased estimations. In the spirit of a partial correlation, the conditional importance measure related to a variable \(X\) for the prediction of an outcome \(Y\), only uses the subset of variables the most correlated to \(X\) for its computation. The argument`RF_condi_thr`

that corresponds exactly to the argument`threshold`

of the function`varimp`

, fixes a ratio below which a variable Z is considered sufficiently correlated to \(X\) to be used as an adjustment variable in the computation of the importance measure of \(X\) (In other words, Z is included in the conditioning for the computation, see (4) and (5) for more details). A threshold value of zero will include all variables in the computation of conditional importance measure of each predictor \(X\), while a threshold \(< 1\), will only include a subset of variables. Two remarks related to this method: firstly, notice that taking into account only subsets of predictors in the computation of the variable importance measures could lead to a relevant saving of execution time. Secondly, because this approach does not take into account incomplete information, the method will only be applied to complete data (incomplete rows will be temporarily removed for the study).The second possibility, always in presence of mixed types predictors, consists in the execution of two successive RF procedures. The first one will be used to select an unique candidate in each susbset of correlated predictors (detecting in the 1st section), while the second one will extract the permutation measures from the remaining subset of uncorrelated predictors (

`RF_condi = FALSE`

, by default). This second possibility has the advantage to work in presence of incomplete predictors.The third scenario consists in running a first time the function without RF process (

`RF = FALSE`

), and according to the presence of highly correlated predictors or not, users can choose to extract redundant predictors manually and re-runs the function with the subset of remaining non-collinear predictors to avoid potential biases introduced by the standard permutations measures.

The three scenarios finally lead to a list of uncorrelated predictors of the outcome sorted in importance order. The argument `thresh_Y`

corresponds to the minimal percent of importance required (and fixed by user) for a variable to be considered as a reliable predictor of the outcome.
Finally, because all random forest results are subjects to random variation, users can check whether the same importance ranking is achieved by varying the random seed parameter (`RF_SEED`

) or by increasing the number of trees (`RF_ntree`

).

Bergsma W. (2013). A bias-correction for Cramer's V and Tschuprow's T. Journal of the Korean Statistical Society, 42, 323–328.

Farrar D, and Glauber R. (1968). Multicolinearity in regression analysis. Review of Economics and Statistics, 49, 92–107.

Breiman L. (2001). Random Forests. Machine Learning, 45(1), 5–32.

Hothorn T, Buehlmann P, Dudoit S, Molinaro A, Van Der Laan M (2006). “Survival Ensembles.” Biostatistics, 7(3), 355–373.

Strobl C, Boulesteix A-L, Kneib T, Augustin T, Zeileis A (2008). Conditional Variable Importance for Random Forests. BMC Bioinformatics, 9, 307. https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-9-307

```
### Example 1
#-----
# - From two overlayed databases: using the table simu_data
# - Searching for the best predictors of "Yb1"
# - Using the row database
# - The RF approaches are not required
#-----
data(simu_data)
sel_ex1 <- select_pred(simu_data,
Y = "Yb1", Z = "Yb2", ID = 1, OUT = "Y",
quanti = c(3, 8), nominal = c(1, 4:5, 7), ordinal = c(2, 6),
thresh_cat = 0.30, thresh_num = 0.70, thresh_Y = 0.20,
RF = FALSE
)
#> The select_pred function is running for outcome= Yb1. Please wait ...
#> The process is now successfully completed
#> ---------
#> For comparison with another outcome from two overlayed tables :
#> just adapt the OUT option keeping all the others unchanged in the function
#> ---
#> For comparison with another outcome from two unoverlayed tables:
#> just adapt the arguments from Y to convert_class
#> ---------
### Example 2
#-----
# - With same conditions as example 1
# - Searching for the best predictors of "Yb2"
#-----
sel_ex2 <- select_pred(simu_data,
Y = "Yb1", Z = "Yb2", ID = 1, OUT = "Z",
quanti = c(3, 8), nominal = c(1, 4:5, 7), ordinal = c(2, 6),
thresh_cat = 0.30, thresh_num = 0.70, thresh_Y = 0.20,
RF = FALSE
)
#> The select_pred function is running for outcome= Yb2. Please wait ...
#> The process is now successfully completed
#> ---------
#> For comparison with another outcome from two overlayed tables :
#> just adapt the OUT option keeping all the others unchanged in the function
#> ---
#> For comparison with another outcome from two unoverlayed tables:
#> just adapt the arguments from Y to convert_class
#> ---------
# \donttest{
### Example 3
#-----
# - With same conditions as example 1
# - Using a RF approach to estimate the standard variable importance measures
# and determine the best subset of predictors
# - Here a seed is required
#-----
sel_ex3 <- select_pred(simu_data,
Y = "Yb1", Z = "Yb2", ID = 1, OUT = "Y",
quanti = c(3, 8), nominal = c(1, 4:5, 7), ordinal = c(2, 6),
thresh_cat = 0.30, thresh_num = 0.70, thresh_Y = 0.20,
RF = TRUE, RF_condi = FALSE, RF_SEED = 3023
)
#> The select_pred function is running for outcome= Yb1. Please wait ...
#> Risk of collinearity between predictors detected: Some predictors will be dropped during RF process
#> The process is now successfully completed
#> ---------
#> For comparison with another outcome from two overlayed tables :
#> just adapt the OUT option keeping all the others unchanged in the function
#> ---
#> For comparison with another outcome from two unoverlayed tables:
#> just adapt the arguments from Y to convert_class
#> ---------
### Example 4
#-----
# - With same conditions as example 1
# - Using a RF approach to estimate the conditional variable importance measures
# and determine the best subset of predictors
# - This approach requires to convert the numeric variables: Only "Age" here
# discretized in 3 levels
#-----
sel_ex4 <- select_pred(simu_data,
Y = "Yb1", Z = "Yb2", ID = 1, OUT = "Z",
quanti = c(3, 8), nominal = c(1, 4:5, 7), ordinal = c(2, 6),
convert_num = 8, convert_class = 3,
thresh_cat = 0.30, thresh_num = 0.70, thresh_Y = 0.20,
RF = TRUE, RF_condi = TRUE, RF_condi_thr = 0.60, RF_SEED = 3023
)
#> The select_pred function is running for outcome= Yb2. Please wait ...
#> Risk of collinearity between predictors detected: Some predictors will be dropped during RF process
#> The process is now successfully completed
#> ---------
#> For comparison with another outcome from two overlayed tables :
#> just adapt the OUT option keeping all the others unchanged in the function
#> ---
#> For comparison with another outcome from two unoverlayed tables:
#> just adapt the arguments from Y to convert_class
#> ---------
### Example 5
#-----
# - Starting with a unique database
# - Same conditions as example 1
#-----
simu_A <- simu_data[simu_data$DB == "A", -3] # Base A
sel_ex5 <- select_pred(simu_A,
Y = "Yb1",
quanti = 7, nominal = c(1, 3:4, 6), ordinal = c(2, 5),
thresh_cat = 0.30, thresh_num = 0.70, thresh_Y = 0.20,
RF = FALSE
)
#> The select_pred function is running for outcome= Yb1. Please wait ...
#> The process is now successfully completed
#> ---------
#> For comparison with another outcome from two overlayed tables :
#> just adapt the OUT option keeping all the others unchanged in the function
#> ---
#> For comparison with another outcome from two unoverlayed tables:
#> just adapt the arguments from Y to convert_class
#> ---------
### Example 6
#-----
# - Starting with an unique database
# - Using a RF approach to estimate the conditional variable importance measures
# and determine the best subset of predictors
# - This approach requires to convert the numeric variables: Only "Age" here
# discretized in 3 levels
#-----
simu_B <- simu_data[simu_data$DB == "B", -2] # Base B
sel_ex6 <- select_pred(simu_B,
Y = "Yb2",
quanti = 7, nominal = c(1, 3:4, 6), ordinal = c(2, 5),
convert_num = 7, convert_class = 3,
thresh_cat = 0.30, thresh_num = 0.70, thresh_Y = 0.20,
RF = TRUE, RF_condi = TRUE, RF_condi_thr = 0.60, RF_SEED = 3023
)
#> The select_pred function is running for outcome= Yb2. Please wait ...
#> Risk of collinearity between predictors detected: Some predictors will be dropped during RF process
#> The process is now successfully completed
#> ---------
#> For comparison with another outcome from two overlayed tables :
#> just adapt the OUT option keeping all the others unchanged in the function
#> ---
#> For comparison with another outcome from two unoverlayed tables:
#> just adapt the arguments from Y to convert_class
#> ---------
# }
```