Calculating overlap (and distance measures) for categorical variables in R -
i trying calculate distance between rows (data points) on basis of categorical variables in columns. simplest method have seen calculate overlap. in other words in proportion of variables x , y take identical values.
imagine have dataset follows;
id = 1:5 dummy <- data.frame(country = c("uk", "uk", "usa", "usa", "usa"), category = c("private", "public", "private", "private", "public"), level = c("high", "low", "low", "low", "high"))
and want calculate proportional overlap (as above) between pairs of rows.
i define function this;
calcoverlap <- function(id, df) { n <- length(id) results <- matrix(na, n, n) for(i in 1:n) { for(j in 1:n) { if(i > j) { results[i, j] <- length(which(df[i,] == df[j,])) / ncol(df) } } } results }
i think worked....
dummy calcoverlap(id, dummy)
my question is, has been implemented more neatly , more somewhere. more generally, there package calculate distance measures categorical variables.
thanks!
here 1 way it:
outer(seq(nrow(df)), seq(nrow(df)), vectorize(function(x,y) mean(df[x,]==df[y,]))) [,1] [,2] [,3] [,4] [,5] [1,] 1.0000000 0.3333333 0.3333333 0.3333333 0.3333333 [2,] 0.3333333 1.0000000 0.3333333 0.3333333 0.3333333 [3,] 0.3333333 0.3333333 1.0000000 1.0000000 0.3333333 [4,] 0.3333333 0.3333333 1.0000000 1.0000000 0.3333333 [5,] 0.3333333 0.3333333 0.3333333 0.3333333 1.0000000
however, computes more comparisons needed. avoid that, there's combn
:
# values v = combn(seq(nrow(df)), 2, function(x) mean(df[x[1],]==df[x[2],])) # [1] 0.3333333 0.3333333 0.3333333 0.3333333 0.3333333 0.3333333 0.3333333 1.0000000 0.3333333 0.3333333 # row combos r = combn(seq(nrow(df)), 2) # [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] # [1,] 1 1 1 1 2 2 2 3 3 4 # [2,] 2 3 4 5 3 4 5 4 5 5
if want in matrix, there's
m = matrix(,nrow(df),nrow(df)) m[t(r)] <- v # [,1] [,2] [,3] [,4] [,5] # [1,] na 0.3333333 0.3333333 0.3333333 0.3333333 # [2,] na na 0.3333333 0.3333333 0.3333333 # [3,] na na na 1.0000000 0.3333333 # [4,] na na na na 0.3333333 # [5,] na na na na na
Comments
Post a Comment