[R] collapsing records

Mon Jan 20 04:10:37 CET 2014

On 01/20/2014 11:44 AM, Bill wrote:
> I am trying to read a csv file with a date-time field. There are many rows
> with the same date but different times. I first want to clear the times so
> that rows from the same day have the same date-time field (called Date).
> There is another field called Text and I want to collapse all the records
> with the same date so that there is only one record for this date and with
> a text field that contains all the strings from all the corresponding text
> fields. At the same time I want to create a new field that has the count of
> how many records were collapsed for each date. There is a third field
> called Tw.ID and I was trying to use tapply on this field to do this. Later
> I will create a DocumentTermMatrix with the tm package on this dataframe.
> In the code below I have not figured out how to collapse the data so that
> there is only one record for each date and I don't really have a good way
> to add in a count field. Can anyone make any suggestions?
> Thanks.
>
> install.packages(c("tm"))
> library(tm)
> y.df=read.csv("YHOO3000.csv", header=TRUE)
> y.df$Date= as.POSIXlt( y.df$Date)
> ysub14.df=y.df
> ysub14.df$Date=y.df$Date -14*3600 #I pushed the record times back a little
> here.
> ysub14.df$Date=as.Date(ysub14.df$Date, "%Y-%m-%d")
> # might want to use groups<-
> unstack(data.frame(ysub14.df$Text,ysub14.df$Date))
> # to put all the tweets for one day into a group. This makes a list
> # I think, with the name of the list being the Date and
> # the tweets for that date being stored in a vector.
> countgroup2=tapply(ysub14.df$Tw.ID,ysub14.df$Date,length)
>
Hi Bill,
Here is one way:

# get some date-time strings
dates<-paste("2014-01-",10:15," ",sample(0:23,20),
  ":",sample(0:60,20),":",sample(0:60,20),sep="")
# function to return stupid text
sillytext<-function(n) {
  return(paste(sample(letters[1:26],n),sep="",collapse=""))
}
# get the stupid text
ttext<-sapply(rep(10,20),sillytext)
# make the data frame
y.df<-data.frame(dates,ttext)
# convert the date-time strings to dates
y.df$dates<-
  as.Date(format(as.Date(dates,"%Y-%m-%d %H:%M:%S"),
  "Y-%m-%d"),"Y-%m-%d")
library(prettyR)
# stretch out all the text strings for each day
y2.df<-stretch_df(y.df,"dates","ttext")
# get the dimension of the resulting data frame
ydim<-dim(y2.df)
# function to count the NAs
nna<-function(x) return(sum(is.na(x)))
# add a column with a count of _not_ NAs
y2.df$nrec<-
  (ydim[2]-1)-apply(as.matrix(y2.df[,2:ydim[2]]),1,nna)

Jim