[R] Compact Indicator Matrices

Mon May 12 15:30:49 CEST 2008

On Sun, May 11, 2008 at 9:49 AM, amarkos <amarkos at gmail.com> wrote:
> On May 11, 4:47 pm, "Douglas Bates" <ba... at stat.wisc.edu> wrote:
>
>> Do you mean that you want to collapse similar rows into a single row
>> and perhaps a count of the number of times that this row occurs?
>
> Let me rephrase the problem by providing an example.
>
> Input:
>
> A =
>      [,1] [,2]
>  [1,]    1    1
>  [2,]    1    3
>  [3,]    2    1
>  [4,]    1    2
>  [5,]    2    1
>  [6,]    1    2
>  [7,]    1    1
>  [8,]    1    2
>  [9,]    1    3
> [10,]    2    1

An important question here is do you start with two or more variables
like the columns of your matrix A?  If so, there is a more direct
method of getting the answers that you want.  The natural way to store
such variables in R is as factors.  I prefer to use letters instead of
numbers to represent the levels of a factor (that way I don't confuse
a factor with a numeric variable when I look at rows)  so I would
create a data frame with two factors instead of a matrix.

> V1 <- factor(c(1,1,2,1,2,1,1,1,1,2), labels = LETTERS[1:2])
> V2 <- factor(c(1,3,1,2,1,2,1,2,3,1), labels = letters[1:3])
> df <- data.frame(f1 = V1, f2 = V2)
> df
   f1 f2
1   A  a
2   A  c
3   B  a
4   A  b
5   B  a
6   A  b
7   A  a
8   A  b
9   A  c
10  B  a

You could produce the indicator matrix and check for unique rows, etc.
- I will show that below - but all you need is the interaction of the
two factors

> df$f12 <- with(df, f1:f2)[drop = TRUE]
> df
   f1 f2 f12
1   A  a A:a
2   A  c A:c
3   B  a B:a
4   A  b A:b
5   B  a B:a
6   A  b A:b
7   A  a A:a
8   A  b A:b
9   A  c A:c
10  B  a B:a
> str(df)
'data.frame':	10 obs. of  3 variables:
 $ f1 : Factor w/ 2 levels "A","B": 1 1 2 1 2 1 1 1 1 2
 $ f2 : Factor w/ 3 levels "a","b","c": 1 3 1 2 1 2 1 2 3 1
 $ f12: Factor w/ 4 levels "A:a","A:b","A:c",..: 1 3 4 2 4 2 1 2 3 4
> table(df$f12)

A:a A:b A:c B:a
  2   3   2   3
> as.numeric(df$f12)
 [1] 1 3 4 2 4 2 1 2 3 4

Notice that this shows you that there are four distinct combinations
that occur 2, 3, 2 and 3 times respectively; the first combination
occurs in rows 1 and 7, it consists of the first level of f1 and the
first level of f2, etc.

If you really do want the indicator matrix you could generate it as

> (ind <- cbind(model.matrix(~ 0 + f1, df), model.matrix(~ 0 + f2, df)))
   f1A f1B f2a f2b f2c
1    1   0   1   0   0
2    1   0   0   0   1
3    0   1   1   0   0
4    1   0   0   1   0
5    0   1   1   0   0
6    1   0   0   1   0
7    1   0   1   0   0
8    1   0   0   1   0
9    1   0   0   0   1
10   0   1   1   0   0
> unique(ind)
  f1A f1B f2a f2b f2c
1   1   0   1   0   0
2   1   0   0   0   1
3   0   1   1   0   0
4   1   0   0   1   0

but working with the factors is generally much simpler than working
with the indicators.

> # Indicator matrix
> A <- data.frame(lapply(data.frame(obj), as.factor))
>
> nocases <- dim(obj)[1]
> novars  <- dim(obj)[2]
>
> # variable levels
> levels.n <- sapply(obj, nlevels)
> n        <- cumsum(levels.n)
>
> # Indicator matrix calculations
> Z        <- matrix(0, nrow = nocases, ncol = n[length(n)])
> newdat   <- lapply(obj, as.numeric)
> offset   <- (c(0, n[-length(n)]))
> for (i in 1:novars)
>  Z[1:nocases + (nocases * (offset[i] + newdat[[i]] - 1))] <- 1
>
> #######
>
> Output:
>
> Z =
>
>    [,1] [,2] [,3] [,4] [,5]
>  [1,]    1    0    1    0    0
>  [2,]    1    0    0    0    1
>  [3,]    0    1    1    0    0
>  [4,]    1    0    0    1    0
>  [5,]    0    1    1    0    0
>  [6,]    1    0    0    1    0
>  [7,]    1    0    1    0    0
>  [8,]    1    0    0    1    0
>  [9,]    1    0    0    0    1
> [10,]    0    1    1    0    0
>
>
> Z is an indicator matrix in the Multiple Correspondence Analysis
> framework.
> My problem is to collapse identical rows (e.g. 2 and 9) into a single
> row and
> store the row ids.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>