[R] How should I denormalise a data frame list of lists column?
Tim Slidel
tim at slidel.net
Wed Jul 1 21:31:30 CEST 2009
Hi,
(apologies for initial html posting)
I have a data frame where one column is a list of lists. I would like to
subset the data frame based on membership of the lists in that column and be
able to 'denormalise' the data frame so that a row is duplicated for each of
its list elements. Example code follows:
# The data is read in in this form with the c2 list values in single strings
which I then split to give lists:
> f1 <- data.frame(c1=0:2, c2=c("A,B,C", "A,E", "F,G,H"))
> f1$Split <- strsplit(as.character(f1$c2), ",")
> f1
c1 c2 Split
1 0 A,B,C A, B, C
2 1 A,E A, E
3 2 F,G,H F, G, H
# So f1$Split is the list of lists column I want to denormalise or use as
the subject for subsetting
# f2 is data to use to select subsets from f1
> f2 <- data.frame(c1=LETTERS[0:8], c2=c("Apples",
"Badger","Camel","Dog","Elephants","Fish","Goat","Horse"))
> f2
c1 c2
1 A Apple
2 B Badger
3 C Camel
4 D Dog
5 E Elephant
6 F Fish
7 G Goat
8 H Horse
# I was able to find which rows of f2 are represented in the f1 lists (not
entirely sure if this is the best way to do this):
> f3 <- f2[f2$c1 %in% unlist(f1$Split),]
> f3
c1 c2
1 A Apple
2 B Badger
3 C Camel
5 E Elephant
6 F Fish
7 G Goat
8 H Horse
# Note that 'D' is missing from f3 because it is not in any of the f1$Split
lists
# f4 is a subset of f3 and I want to find the rows of f1 where f1$Split
contains any of f4$c1:
> f4 <- f3[c(1,3),]
> f4
c1 c2
1 A Apple
3 C Camel
# I tried this and it didn't work, presumably because it's trying to match
against each list object rather than the list elements, but unlist doesn't
do the trick here because I need the individual rows, I need to unlist on a
row by row basis.
> f1[f1$Split %in% f4$c1,]
[1] c1 c2 Split
<0 rows> (or 0-length row.names)
> f1[f4$c1 %in% f1$Split,]
[1] c1 c2 Split
<0 rows> (or 0-length row.names)
> f1[match(f4$c1, f1$Split),]
c1 c2 Split
NA NA <NA> NULL
NA.1 NA <NA> NULL
I also looked at reshape which I don't think helps. I thought I might be
able to create a new data frame with the f1$Split denormalised and use that,
but couldn't find a way to do this, the result I'd want there is something
like:
> f1_denorm
c1 c2 Split SplitDenorm
1 0 A,B,C A, B, C A
2 0 A,B,C A, B, C B
3 0 A,B,C A, B, C C
4 1 A,E A, E A
5 1 A,E A, E E
6 2 F,G,H F, G, H F
7 2 F,G,H F, G, H G
8 2 F,G,H F, G, H H
I thought perhaps for loops would be the next thing to try, but there must
be a better way!
Thanks for any help.
Tim
More information about the R-help
mailing list