This function sequentially assigns individual predictions using a nearest neighbors procedure to solve recoding problems of data fusion.

indiv_grp_closest(
proxim,
jointprobaA = NULL,
jointprobaB = NULL,
percent_closest = 1,
which.DB = "BOTH"
)

## Arguments

proxim

a proxim_dist object or an object of similar structure

jointprobaA

a matrix whose number of columns corresponds to the number of modalities of the target variable $$Y$$ in database A, and which number of rows corresponds to the number of modalities of Z in database B. It gives an estimation of the joint probability of $$(Y,Z)$$ in A. The sum of cells of this matrix must be equal to 1

jointprobaB

a matrix whose number of columns equals to the number of modalities of the target variable $$Y$$ in database A, and which number of rows corresponds to the number of modalities of $$Z$$ in database B. It gives an estimation of the joint probability of $$(Y,Z)$$ in B. The sum of cells of this matrix must be equal to 1

percent_closest

a value between 0 and 1 (by default) corresponding to the fixed percent closest of individuals remained in the computation of the average distances

which.DB

a character string (with quotes) that indicates which individual predictions need to be computed: only the individual predictions of $$Y$$ in B ("B"), only those of $$Z$$ in A ("A") or the both ("BOTH" by default)

## Value

A list of two vectors of numeric values:

YAtrans

a vector corresponding to the individual predictions of $$Y$$ (numeric form) in the database B using the Optimal Transportation algorithm

ZBtrans

a vector corresponding to the individual predictions of $$Z$$ (numeric form) in the database A using the Optimal Transportation algorithm

## Details

A. THE RECODING PROBLEM IN DATA FUSION

Assuming that $$Y$$ and $$Z$$ are two variables which refered to the same target population in two separate databases A and B respectively (no overlapping rows), so that $$Y$$ and $$Z$$ are never jointly observed. Assuming also that A and B share a subset of common covariates $$X$$ of any types (same encodings in A and B) completed or not. Integrating these two databases often requires to solve the recoding problem by creating an unique database where the missing information of $$Y$$ and $$Z$$ is fully completed.

B. DESCRIPTION OF THE FUNCTION

The function indiv_grp_closest is an intermediate function used in the implementation of an algorithm called OUTCOME (and its enrichment R-OUTCOME, see the reference (2) for more details) dedicated to the solving of recoding problems in data fusion using Optimal Transportation theory. The model is implemented in the function OT_outcome which integrates the function indiv_grp_closest in its syntax as a possible second step of the algorithm. The function indiv_grp_closest can also be used separately provided that the argument proxim receives an output object of the function proxim_dist. This latter is available in the package and is so directly usable beforehand.

The algorithms OUTCOME (and R-OUTCOME) are made of two independent parts. Assuming that the objective consists in the prediction of $$Z$$ in the database A:

• The first part of the algorithm solves the optimization problem by providing a solution called $$\gamma$$ that corresponds here to an estimation of the joint distribution $$(Y,Z)$$ in A.

• From the first part, a nearest neighbor procedure is carried out as a second part to provide the individual predictions of $$Z$$ in A: this procedure is implemented in the function indiv_group_closest. In other words, this function sequentially assigns to each individual of A the modality of $$Z$$ that is closest.

Obviously, this algorithm runs in the same way for the prediction of $$Y$$ in the database B. The function indiv_grp_closest integrates in its syntax the function avg_dist_closest. Therefore, the related argument percent_closest is identical in the two functions. Thus, when computing average distances between an individual $$i$$ and a subset of individuals assigned to a same level of $$Y$$ or $$Z$$ is required, user can decide if all individuals from the subset of interest can participate to the computation (percent_closest=1) or only a fixed part p (<1) corresponding to the closest neighbors of $$i$$ (in this case percent_closest = p).

The arguments jointprobaA and jointprobaB correspond to the estimations of $$\gamma$$ (sum of cells must be equal to 1) in A and/or B respectively, according to the which.DB argument. For example, assuming that $$n_{Y_1}$$ individuals are assigned to the first modality of $$Y$$ in A, the objective consists in the individual predictions of $$Z$$ in A. Then, if jointprobaA[1,2] = 0.10, the maximum number of individuals that can be assigned to the second modality of $$Z$$ in A, can not exceed $$0.10 \times n_A$$. If $$n_{Y_1} \leq 0.10 \times n_A$$ then all individuals assigned to the first modality of $$Y$$ will be assigned to the second modality of $$Z$$. At the end of the process, each individual with still no affectation will receive the same modality of $$Z$$ as those of his nearest neighbor in B.

## References

1. Gares V, Dimeglio C, Guernec G, Fantin F, Lepage B, Korosok MR, savy N (2019). On the use of optimal transportation theory to recode variables and application to database merging. The International Journal of Biostatistics. Volume 16, Issue 1, 20180106, eISSN 1557-4679. doi:10.1515/ijb-2018-0106

2. Gares V, Omer J (2020) Regularized optimal transport of covariates and outcomes in data recoding. Journal of the American Statistical Association. doi:10.1080/01621459.2020.1775615

## See also

proxim_dist,avg_dist_closest, ,OT_outcome

## Author

Gregory Guernec, Valerie Gares, Jeremy Omer

otrecod.pkg@gmail.com

## Examples

data(simu_data)

### Example with the Manhattan distance

man1 <- transfo_dist(simu_data,
quanti = c(3, 8), nominal = c(1, 4:5, 7),
ordinal = c(2, 6), logic = NULL, prep_choice = "M"
)
#> Your target DB[, "Z"] was numeric ... By default, it has been converted in factor of integers
#> 5 remaining levels
mat_man1 <- proxim_dist(man1, norm = "M")

### Y(Yb1) and Z(Yb2) are a same information encoded in 2 different forms:
### (3 levels for Y and 5 levels for Z)
### ... Stored in two distinct databases, A and B, respectively
### The marginal distribution of Y in B is unknown,
### as the marginal distribution of Z in A ...

# Empirical distribution of Y in database A:
freqY <- prop.table(table(man1$Y)) freqY #> #> [20-40] [40-60[ [60-80] #> 0.4333333 0.4233333 0.1433333 # Empirical distribution of Z in database B freqZ <- prop.table(table(man1$Z))
freqZ
#>
#>      1      2      3      4      5
#> 0.3625 0.1275 0.0875 0.1075 0.3150

# By supposing that the following matrix called transport symbolizes
# an estimation of the joint distribution L(Y,Z) ...
# Note that, in reality this distribution is UNKNOWN and is
# estimated in the OT function by resolving an optimisation problem.

transport1 <- matrix(c(0.3625, 0, 0, 0.07083333, 0.05666667,
0, 0, 0.0875, 0, 0, 0.1075, 0,
0, 0.17166667, 0.1433333),
ncol = 5, byrow = FALSE)

# ... So that the marginal distributions of this object corresponds to freqY and freqZ:
apply(transport1, 1, sum) # = freqY
#> [1] 0.4333333 0.4233333 0.1433333
apply(transport1, 2, sum) # = freqZ
#> [1] 0.3625 0.1275 0.0875 0.1075 0.3150

# The affectation of the predicted values of Y in database B and Z in database A
# are stored in the following object:

pred_man1 <- indiv_grp_closest(mat_man1,
jointprobaA = transport1, jointprobaB = transport1,
percent_closest = 0.90
)
summary(pred_man1)
#>         Length Class  Mode
#> YAtrans 400    -none- numeric
#> ZBtrans 300    -none- numeric

# For the prediction of Z in A only, add the corresponding argument:
pred_man1_A <- indiv_grp_closest(mat_man1,
jointprobaA = transport1, jointprobaB = transport1,
percent_closest = 0.90, which.DB = "A"
)