[R] How should I denormalise a data frame list of lists column?

Wed Jul 1 21:31:30 CEST 2009

Hi,

(apologies for initial html posting)

I have a data frame where one column is a list of lists. I would like to
subset the data frame based on membership of the lists in that column and be
able to 'denormalise' the data frame so that a row is duplicated for each of
its list elements. Example code follows:

# The data is read in in this form with the c2 list values in single strings
which I then split to give lists:
> f1 <- data.frame(c1=0:2, c2=c("A,B,C", "A,E", "F,G,H"))
> f1$Split <- strsplit(as.character(f1$c2), ",")
> f1
  c1    c2   Split
1  0 A,B,C A, B, C
2  1   A,E    A, E
3  2 F,G,H F, G, H

# So f1$Split is the list of lists column I want to denormalise or use as
the subject for subsetting

# f2 is data to use to select subsets from f1
> f2 <- data.frame(c1=LETTERS[0:8], c2=c("Apples",
"Badger","Camel","Dog","Elephants","Fish","Goat","Horse"))
> f2
  c1       c2
1  A    Apple
2  B   Badger
3  C    Camel
4  D      Dog
5  E Elephant
6  F     Fish
7  G     Goat
8  H    Horse

# I was able to find which rows of f2 are represented in the f1 lists (not
entirely sure if this is the best way to do this):
> f3 <- f2[f2$c1 %in% unlist(f1$Split),]
> f3
  c1       c2
1  A    Apple
2  B   Badger
3  C    Camel
5  E Elephant
6  F     Fish
7  G     Goat
8  H    Horse

# Note that 'D' is missing from f3 because it is not in any of the f1$Split
lists

# f4 is a subset of f3 and I want to find the rows of f1 where f1$Split
contains any of f4$c1:
> f4 <- f3[c(1,3),]
> f4
  c1    c2
1  A Apple
3  C Camel

# I tried this and it didn't work, presumably because it's trying to match
against each list object rather than the list elements, but unlist doesn't
do the trick here because I need the individual rows, I need to unlist on a
row by row basis.
> f1[f1$Split %in% f4$c1,]
[1] c1    c2    Split
<0 rows> (or 0-length row.names)
> f1[f4$c1 %in% f1$Split,]
[1] c1    c2    Split
<0 rows> (or 0-length row.names)
> f1[match(f4$c1, f1$Split),]
     c1   c2 Split
NA   NA <NA>  NULL
NA.1 NA <NA>  NULL

I also looked at reshape which I don't think helps. I thought I might be
able to create a new data frame with the f1$Split denormalised and use that,
but couldn't find a way to do this, the result I'd want there is something
like:
> f1_denorm
  c1    c2   Split       SplitDenorm
1  0 A,B,C A, B, C   A
2  0 A,B,C A, B, C   B
3  0 A,B,C A, B, C   C
4  1   A,E    A, E     A
5  1   A,E    A, E     E
6  2 F,G,H F, G, H   F
7  2 F,G,H F, G, H  G
8  2 F,G,H F, G, H  H

I thought perhaps for loops would be the next thing to try, but there must
be a better way!

Thanks for any help.

Tim