ham.Rd
This function computes a matrix distance using the Hamming distance as proximity measure.
ham(mat_1, mat_2)
a vector, a matrix or a data.frame of binary values that may contain missing data
a vector, a matrix or a data.frame of binary values with the same number of columns as mat_1
that may contain missing data
A distance matrix
ham
returns the pairwise distances between rows (observations) of a single matrix if mat_1
equals mat_2
.
Otherwise ham
returns the matrix distance between rows of the two matrices mat_1
and mat_2
if this 2 matrices are different in input.
Computing the Hamming distance stays possible despite the presence of missing data by applying the following formula. Assuming that A and B are 2 matrices such as ncol(A) = ncol(B)
.
The Hamming distance between the \(i^{th}\) row of A and the \(k^{th}\) row of B equals:
$$\mbox{ham}(A_i,B_k) = \frac{\sum_j 1_{\left\{A_{ij} \neq B_{kj}\right\}}}{\sum_j 1}\times\left(\frac{\sum_j 1}{\sum_j 1_{\left\{!\mbox{is.na}(A_{ij}) \& !\mbox{is.na}( B_{kj})\right\}}}\right)$$
where: \(i = 1,\dots,\mbox{nrow}(A)\) and \(k = 1,\dots,\mbox{nrow}(B)\); And the expression located to the right term of the multiplication corresponds to a specific weigh applied in presence of NAs in \(A_i\) and/or \(B_k\).
This specificity is not implemented in the cdist
function and the Hamming distance can not be computed using the dist
function either.
The Hamming distance can not be calculated in only two situations:
If a row of A or B has only missing values (ie for each of the columns of A or B respectively).
The union of the indexes of the missing values in row i of A with the indexes of the missing values in row j of B concerns the indexes of all considered columns.
Example: Assuming that \(\mbox{ncol}(A) = \mbox{ncol}(B) = 3\), if \(A_i = (1,\mbox{NA},0)\) and \(B_j = (\mbox{NA},1,\mbox{NA})\), for each column, either the information in row i is missing in A, or the information is missing in B, which induces: \(\mbox{ham}(A_i,B_k) = \mbox{NA}\).
If mat_1
is a vector and mat_2
is a matrix (or data.frame) or vice versa, the length of mat_1
must be equal to the number of columns of mat_2
.
Roth R (2006). Introduction to Coding Theory. Cambridge University Press.
set.seed(3010)
sample_A <- sample(c(0, 1), 12, replace = TRUE)
set.seed(3007)
sample_B <- sample(c(0, 1), 15, replace = TRUE)
A <- matrix(sample_A, ncol = 3)
B <- matrix(sample_B, ncol = 3)
# These 2 matrices have no missing values
# Matrix of pairwise distances with A:
ham(A, A)
#> [,1] [,2] [,3] [,4]
#> [1,] 0.0000000 0.6666667 0.6666667 1.0000000
#> [2,] 0.6666667 0.0000000 0.6666667 0.3333333
#> [3,] 0.6666667 0.6666667 0.0000000 0.3333333
#> [4,] 1.0000000 0.3333333 0.3333333 0.0000000
# Matrix of distances between the rows of A and the rows of B:
ham(A, B)
#> [,1] [,2] [,3] [,4] [,5]
#> [1,] 1.0000000 0.0000000 0.3333333 0.6666667 0.6666667
#> [2,] 0.3333333 0.6666667 1.0000000 0.6666667 0.6666667
#> [3,] 0.3333333 0.6666667 0.3333333 0.0000000 0.6666667
#> [4,] 0.0000000 1.0000000 0.6666667 0.3333333 0.3333333
# If mat_1 is a vector of binary values:
ham(c(0, 1, 0), B)
#> [1] 0.6666667 0.3333333 0.6666667 0.3333333 1.0000000
# Now by considering A_NA and B_NA two matrices built from A and B respectively,
# where missing values have been manually added:
A_NA <- A
A_NA[3, 1] <- NA
A_NA[2, 2:3] <- rep(NA, 2)
B_NA <- B
B_NA[2, 2] <- NA
ham(A_NA, B_NA)
#> [,1] [,2] [,3] [,4] [,5]
#> [1,] 1.0 0 0.3333333 0.6666667 0.6666667
#> [2,] 0.0 1 1.0000000 0.0000000 1.0000000
#> [3,] 0.5 0 0.0000000 0.0000000 0.5000000
#> [4,] 0.0 1 0.6666667 0.3333333 0.3333333