[R] Problem with ddply in the plyr-package: surprising output of a date-column

Mon Apr 25 20:05:05 CEST 2011

On 4/25/2011 10:19 AM, Christoph Jäckel wrote:
> Hi Together,
>
> I have a problem with the plyr package - more precisely with the ddply
> function - and would be very grateful for any help. I hope the example
> here is precise enough for someone to identify the problem. Basically,
> in this step I want to identify observations that are identical in
> terms of certain identifiers (ID1, ID2, ID3) and just want to save
> those observations (in this step, without deleting any rows or
> manipulating any data) in a separate data.frame. However, I get the
> warning message below and the column with dates is messed up.
> Interestingly, the value column (the type is factor here, but if you
> change that with as.integer it doesn't make any difference) is handled
> correctly. Any idea what I do wrong?
>
> df<- data.frame(cbind(ID1=c(1,2,2,3,3,4,4),ID2=c('a','b','b','c','d','e','e'),ID3=c("v1","v1","v1","v1","v2","v1","v1"),
>
> Date=c("1985-05-1","1985-05-2","1985-05-3","1985-05-4","1985-05-5","1985-05-6","1985-05-7"),
>                   Value=c(1,2,3,4,5,6,7)))
> df[,1]<- as.character(df[,1])
> df[,2]<- as.character(df[,2])
> df$Date<- strptime(df$Date,"%Y-%m-%d")
>
> #Apparently there are two observation that have the same IDs: ID1=2 and ID1=4
> ddply(df,.(ID1,ID2,ID3),nrow)
> #I want to save those IDs in a separate data.frame, so the desired output is:
> df[c(2:3,6:7),]
>
> #My idea: Write a custom function that only returns observations with
> multiple rows.
> #Seems to work except that the Date column doesn't make any sense anymore
> #Warning message: In output[[var]][rng]<- df[[var]]: number of items
> to replace is not a multiple of replacement length
> ddply(df,.(ID1,ID2,ID3),function(df) if(nrow(df)<=1){NULL}else{df})
>
> #Notice that it works perfectly if I only have one observation with
> multiple rows
> ddply(df[1:6,],.(ID1,ID2,ID3),function(df) if(nrow(df)<=1){NULL}else{df})

Works for me:

 > df[c(2:3,6:7),]
   ID1 ID2 ID3      Date Value
2   2   b  v1 1985-05-2     2
3   2   b  v1 1985-05-3     3
6   4   e  v1 1985-05-6     6
7   4   e  v1 1985-05-7     7
 > ddply(df,.(ID1,ID2,ID3),function(df) if(nrow(df)<=1){NULL}else{df})
   ID1 ID2 ID3      Date Value
1   2   b  v1 1985-05-2     2
2   2   b  v1 1985-05-3     3
3   4   e  v1 1985-05-6     6
4   4   e  v1 1985-05-7     7
 > sessionInfo()
R version 2.13.0 (2011-04-13)
Platform: x86_64-pc-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] plyr_1.5.2

loaded via a namespace (and not attached):
[1] tools_2.13.0

A couple of things: there was just an update of plyr to 1.5.2; maybe 
that fixes what you are seeing?  Also, your df consists of only factors. 
  cbind-ing the data before turning it into a data.frame makes it a 
character matrix which gets converted to factors.

 > str(df)
'data.frame':   7 obs. of  5 variables:
  $ ID1  : Factor w/ 4 levels "1","2","3","4": 1 2 2 3 3 4 4
  $ ID2  : Factor w/ 5 levels "a","b","c","d",..: 1 2 2 3 4 5 5
  $ ID3  : Factor w/ 2 levels "v1","v2": 1 1 1 1 2 1 1
  $ Date : Factor w/ 7 levels "1985-05-1","1985-05-2",..: 1 2 3 4 5 6 7
  $ Value: Factor w/ 7 levels "1","2","3","4",..: 1 2 3 4 5 6 7

Maybe that has something to do with the odd "dates" since they are not 
really dates at all, just string representations of factor levels. 
Compare with:

DF <- data.frame(ID1=c(1,2,2,3,3,4,4),
	ID2=c('a','b','b','c','d','e','e'),
	ID3=c("v1","v1","v1","v1","v2","v1","v1"),
	Date=as.Date(c("1985-05-1","1985-05-2","1985-05-3",
		"1985-05-4","1985-05-5","1985-05-6","1985-05-7")),
	Value=c(1,2,3,4,5,6,7))
str(DF)
#'data.frame':   7 obs. of  5 variables:
# $ ID1  : num  1 2 2 3 3 4 4
# $ ID2  : Factor w/ 5 levels "a","b","c","d",..: 1 2 2 3 4 5 5
# $ ID3  : Factor w/ 2 levels "v1","v2": 1 1 1 1 2 1 1
# $ Date : Date, format: "1985-05-01" "1985-05-02" ...
# $ Value: num  1 2 3 4 5 6 7

This version also works for me.

ddply(DF,.(ID1,ID2,ID3),function(df) if(nrow(df)<=1){NULL}else{df})
#  ID1 ID2 ID3       Date Value
#1   2   b  v1 1985-05-02     2
#2   2   b  v1 1985-05-03     3
#3   4   e  v1 1985-05-06     6
#4   4   e  v1 1985-05-07     7

> Thanks in advance,
>
> Christoph
>
> --------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Christoph Jäckel (Dipl.-Kfm.)
>
> --------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Research Assistant
>
> Chair for Financial Management and Capital Markets | Lehrstuhls für
> Finanzmanagement und Kapitalmärkte
>
> TUM School of Management | Technische Universität München
>
> Arcisstr. 21 | D-80333 München | Germany
>

-- 
Brian S. Diggs, PhD
Senior Research Associate, Department of Surgery
Oregon Health & Science University