[R] Problem with ddply in the plyr-package: surprising output of a date-column
Brian Diggs
diggsb at ohsu.edu
Mon Apr 25 20:05:05 CEST 2011
On 4/25/2011 10:19 AM, Christoph Jäckel wrote:
> Hi Together,
>
> I have a problem with the plyr package - more precisely with the ddply
> function - and would be very grateful for any help. I hope the example
> here is precise enough for someone to identify the problem. Basically,
> in this step I want to identify observations that are identical in
> terms of certain identifiers (ID1, ID2, ID3) and just want to save
> those observations (in this step, without deleting any rows or
> manipulating any data) in a separate data.frame. However, I get the
> warning message below and the column with dates is messed up.
> Interestingly, the value column (the type is factor here, but if you
> change that with as.integer it doesn't make any difference) is handled
> correctly. Any idea what I do wrong?
>
> df<- data.frame(cbind(ID1=c(1,2,2,3,3,4,4),ID2=c('a','b','b','c','d','e','e'),ID3=c("v1","v1","v1","v1","v2","v1","v1"),
>
> Date=c("1985-05-1","1985-05-2","1985-05-3","1985-05-4","1985-05-5","1985-05-6","1985-05-7"),
> Value=c(1,2,3,4,5,6,7)))
> df[,1]<- as.character(df[,1])
> df[,2]<- as.character(df[,2])
> df$Date<- strptime(df$Date,"%Y-%m-%d")
>
> #Apparently there are two observation that have the same IDs: ID1=2 and ID1=4
> ddply(df,.(ID1,ID2,ID3),nrow)
> #I want to save those IDs in a separate data.frame, so the desired output is:
> df[c(2:3,6:7),]
>
> #My idea: Write a custom function that only returns observations with
> multiple rows.
> #Seems to work except that the Date column doesn't make any sense anymore
> #Warning message: In output[[var]][rng]<- df[[var]]: number of items
> to replace is not a multiple of replacement length
> ddply(df,.(ID1,ID2,ID3),function(df) if(nrow(df)<=1){NULL}else{df})
>
> #Notice that it works perfectly if I only have one observation with
> multiple rows
> ddply(df[1:6,],.(ID1,ID2,ID3),function(df) if(nrow(df)<=1){NULL}else{df})
Works for me:
> df[c(2:3,6:7),]
ID1 ID2 ID3 Date Value
2 2 b v1 1985-05-2 2
3 2 b v1 1985-05-3 3
6 4 e v1 1985-05-6 6
7 4 e v1 1985-05-7 7
> ddply(df,.(ID1,ID2,ID3),function(df) if(nrow(df)<=1){NULL}else{df})
ID1 ID2 ID3 Date Value
1 2 b v1 1985-05-2 2
2 2 b v1 1985-05-3 3
3 4 e v1 1985-05-6 6
4 4 e v1 1985-05-7 7
> sessionInfo()
R version 2.13.0 (2011-04-13)
Platform: x86_64-pc-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] plyr_1.5.2
loaded via a namespace (and not attached):
[1] tools_2.13.0
A couple of things: there was just an update of plyr to 1.5.2; maybe
that fixes what you are seeing? Also, your df consists of only factors.
cbind-ing the data before turning it into a data.frame makes it a
character matrix which gets converted to factors.
> str(df)
'data.frame': 7 obs. of 5 variables:
$ ID1 : Factor w/ 4 levels "1","2","3","4": 1 2 2 3 3 4 4
$ ID2 : Factor w/ 5 levels "a","b","c","d",..: 1 2 2 3 4 5 5
$ ID3 : Factor w/ 2 levels "v1","v2": 1 1 1 1 2 1 1
$ Date : Factor w/ 7 levels "1985-05-1","1985-05-2",..: 1 2 3 4 5 6 7
$ Value: Factor w/ 7 levels "1","2","3","4",..: 1 2 3 4 5 6 7
Maybe that has something to do with the odd "dates" since they are not
really dates at all, just string representations of factor levels.
Compare with:
DF <- data.frame(ID1=c(1,2,2,3,3,4,4),
ID2=c('a','b','b','c','d','e','e'),
ID3=c("v1","v1","v1","v1","v2","v1","v1"),
Date=as.Date(c("1985-05-1","1985-05-2","1985-05-3",
"1985-05-4","1985-05-5","1985-05-6","1985-05-7")),
Value=c(1,2,3,4,5,6,7))
str(DF)
#'data.frame': 7 obs. of 5 variables:
# $ ID1 : num 1 2 2 3 3 4 4
# $ ID2 : Factor w/ 5 levels "a","b","c","d",..: 1 2 2 3 4 5 5
# $ ID3 : Factor w/ 2 levels "v1","v2": 1 1 1 1 2 1 1
# $ Date : Date, format: "1985-05-01" "1985-05-02" ...
# $ Value: num 1 2 3 4 5 6 7
This version also works for me.
ddply(DF,.(ID1,ID2,ID3),function(df) if(nrow(df)<=1){NULL}else{df})
# ID1 ID2 ID3 Date Value
#1 2 b v1 1985-05-02 2
#2 2 b v1 1985-05-03 3
#3 4 e v1 1985-05-06 6
#4 4 e v1 1985-05-07 7
> Thanks in advance,
>
> Christoph
>
> --------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Christoph Jäckel (Dipl.-Kfm.)
>
> --------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Research Assistant
>
> Chair for Financial Management and Capital Markets | Lehrstuhls für
> Finanzmanagement und Kapitalmärkte
>
> TUM School of Management | Technische Universität München
>
> Arcisstr. 21 | D-80333 München | Germany
>
--
Brian S. Diggs, PhD
Senior Research Associate, Department of Surgery
Oregon Health & Science University
More information about the R-help
mailing list