[R] collapsing records

Jim Lemon jim at bitwrit.com.au
Mon Jan 20 04:10:37 CET 2014

On 01/20/2014 11:44 AM, Bill wrote:
> I am trying to read a csv file with a date-time field. There are many rows
> with the same date but different times. I first want to clear the times so
> that rows from the same day have the same date-time field (called Date).
> There is another field called Text and I want to collapse all the records
> with the same date so that there is only one record for this date and with
> a text field that contains all the strings from all the corresponding text
> fields. At the same time I want to create a new field that has the count of
> how many records were collapsed for each date. There is a third field
> called Tw.ID and I was trying to use tapply on this field to do this. Later
> I will create a DocumentTermMatrix with the tm package on this dataframe.
> In the code below I have not figured out how to collapse the data so that
> there is only one record for each date and I don't really have a good way
> to add in a count field. Can anyone make any suggestions?
> Thanks.
> install.packages(c("tm"))
> library(tm)
> y.df=read.csv("YHOO3000.csv", header=TRUE)
> y.df$Date= as.POSIXlt( y.df$Date)
> ysub14.df=y.df
> ysub14.df$Date=y.df$Date -14*3600 #I pushed the record times back a little
> here.
> ysub14.df$Date=as.Date(ysub14.df$Date, "%Y-%m-%d")
> # might want to use groups<-
> unstack(data.frame(ysub14.df$Text,ysub14.df$Date))
> # to put all the tweets for one day into a group. This makes a list
> # I think, with the name of the list being the Date and
> # the tweets for that date being stored in a vector.
> countgroup2=tapply(ysub14.df$Tw.ID,ysub14.df$Date,length)
Hi Bill,
Here is one way:

# get some date-time strings
dates<-paste("2014-01-",10:15," ",sample(0:23,20),
# function to return stupid text
sillytext<-function(n) {
# get the stupid text
# make the data frame
# convert the date-time strings to dates
  as.Date(format(as.Date(dates,"%Y-%m-%d %H:%M:%S"),
# stretch out all the text strings for each day
# get the dimension of the resulting data frame
# function to count the NAs
nna<-function(x) return(sum(is.na(x)))
# add a column with a count of _not_ NAs


More information about the R-help mailing list