Calculating overlap (and distance measures) for categorical variables in R -


i trying calculate distance between rows (data points) on basis of categorical variables in columns. simplest method have seen calculate overlap. in other words in proportion of variables x , y take identical values.

imagine have dataset follows;

    id = 1:5     dummy <- data.frame(country = c("uk", "uk", "usa", "usa", "usa"),                         category = c("private", "public", "private", "private", "public"),                         level = c("high", "low", "low", "low", "high")) 

and want calculate proportional overlap (as above) between pairs of rows.

i define function this;

    calcoverlap <- function(id, df) {       n <- length(id)       results <- matrix(na, n, n)       for(i in 1:n) {         for(j in 1:n) {           if(i > j) {             results[i, j] <- length(which(df[i,] == df[j,])) / ncol(df)           }         }       }       results     } 

i think worked....

    dummy     calcoverlap(id, dummy) 

my question is, has been implemented more neatly , more somewhere. more generally, there package calculate distance measures categorical variables.

thanks!

here 1 way it:

outer(seq(nrow(df)), seq(nrow(df)), vectorize(function(x,y) mean(df[x,]==df[y,])))            [,1]      [,2]      [,3]      [,4]      [,5] [1,] 1.0000000 0.3333333 0.3333333 0.3333333 0.3333333 [2,] 0.3333333 1.0000000 0.3333333 0.3333333 0.3333333 [3,] 0.3333333 0.3333333 1.0000000 1.0000000 0.3333333 [4,] 0.3333333 0.3333333 1.0000000 1.0000000 0.3333333 [5,] 0.3333333 0.3333333 0.3333333 0.3333333 1.0000000 

however, computes more comparisons needed. avoid that, there's combn:

# values v = combn(seq(nrow(df)), 2, function(x) mean(df[x[1],]==df[x[2],]))  # [1] 0.3333333 0.3333333 0.3333333 0.3333333 0.3333333 0.3333333 0.3333333 1.0000000 0.3333333 0.3333333  # row combos r = combn(seq(nrow(df)), 2) #          [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] # [1,]    1    1    1    1    2    2    2    3    3     4 # [2,]    2    3    4    5    3    4    5    4    5     5 

if want in matrix, there's

m = matrix(,nrow(df),nrow(df)) m[t(r)] <- v  #      [,1]      [,2]      [,3]      [,4]      [,5] # [1,]   na 0.3333333 0.3333333 0.3333333 0.3333333 # [2,]   na        na 0.3333333 0.3333333 0.3333333 # [3,]   na        na        na 1.0000000 0.3333333 # [4,]   na        na        na        na 0.3333333 # [5,]   na        na        na        na        na 

Comments

Popular posts from this blog

javascript - Chart.js (Radar Chart) different scaleLineColor for each scaleLine -

apache - Error with PHP mail(): Multiple or malformed newlines found in additional_header -

java - Android – MapFragment overlay button shadow, just like MyLocation button -