[R] How to extract Friday data from daily data.
Gabor Grothendieck
ggrothendieck at gmail.com
Sat Nov 6 03:29:25 CET 2010
On Fri, Nov 5, 2010 at 8:24 PM, Gabor Grothendieck
<ggrothendieck at gmail.com> wrote:
> On Fri, Nov 5, 2010 at 1:22 PM, thornbird <huachang396 at gmail.com> wrote:
>>
>> I am new to Using R for data analysis. I have an incomplete time series
>> dataset that is in daily format. I want to extract only Friday data from it.
>> However, there are two problems with it.
>>
>> First, if Friday data is missing in that week, I need to extract the data of
>> the day prior to that Friday (e.g. Thursday).
>>
>> Second, sometimes there are duplicate Friday data (say Friday morning and
>> afternoon), but I only need the latest one (Friday afternoon).
>>
>> My question is how I can only extract the Friday data and make it a new
>> dataset so that I have data for every single week for the convenience of
>> data analysis.
>>
>
>
> There are several approaches depending on exactly what is to be
> produced. We show two of them here using zoo.
>
>
> # read in data
>
> Lines <- " views number timestamp day time
> 1 views 910401 1246192687 Sun 6/28/2009 12:38
> 2 views 921537 1246278917 Mon 6/29/2009 12:35
> 3 views 934280 1246365403 Tue 6/30/2009 12:36
> 4 views 986463 1246888699 Mon 7/6/2009 13:58
> 5 views 995002 1246970243 Tue 7/7/2009 12:37
> 6 views 1005211 1247079398 Wed 7/8/2009 18:56
> 7 views 1011144 1247135553 Thu 7/9/2009 10:32
> 8 views 1026765 1247308591 Sat 7/11/2009 10:36
> 9 views 1036856 1247436951 Sun 7/12/2009 22:15
> 10 views 1040909 1247481564 Mon 7/13/2009 10:39
> 11 views 1057337 1247568387 Tue 7/14/2009 10:46
> 12 views 1066999 1247665787 Wed 7/15/2009 13:49
> 13 views 1077726 1247778752 Thu 7/16/2009 21:12
> 14 views 1083059 1247845413 Fri 7/17/2009 15:43
> 15 views 1083059 1247845824 Fri 7/17/2009 18:45
> 16 views 1089529 1247914194 Sat 7/18/2009 10:49"
>
> library(zoo)
>
> # read in and create a zoo series
> # - skip= over the header
> # - index=. the time index is third non-removed column.
> # - format=. convert the index to Date class using indicated format
> # - col.names= as specified
> # - aggregate= over duplicate dates keeping last
> # - colClasses= specifies "NULL" for columns we want to remove
>
> colClasses <-
> c("NULL", "NULL", "numeric", "numeric", "NULL", "character", "NULL")
>
> col.names <- c(NA, NA, "views", "number", NA, NA, NA)
>
> # z <- read.zoo("myfile.dat", skip = 1, index = 3,
> z <- read.zoo(textConnection(Lines), skip = 1, index = 3,
> format = "%m/%d/%Y", col.names = col.names,
> aggregate = function(x) tail(x, 1), colClasses = colClasses)
>
> ## Now that we have read it in lets process it
>
> ## 1.
>
> # extract all Thursdays and Fridays
> z45 <- z[format(time(z), "%w") %in% 4:5,]
>
> # keep last entry in each week
> # and show result on R console
> z45[!duplicated(format(time(z45), "%U"), fromLast = TRUE), ]
>
>
> # 2. alternative approach
> # above approach labels each point as it was originally labelled
> # so if Thursday is used it gets the date of that Thursday
> # Another approach is to always label the resulting point as Friday
> # and also use the last available value even if its not Thursday
>
> # create daily grid
> g <- seq(start(z), end(z), by = "day")
>
> # fill in daily grid so Friday is filled in with prior value
> # if Friday is NA
> z.filled <- na.locf(z, xout = g)
>
> # extract Fridays (including those filled in from previous)
> # and show result on R console
> z.filled[format(time(z.filled), "%w") == "5", ]
>
Note that if the data can span more than one year then "%U" above
should be replaced with "%Y-%U" so that weeks in one year are not
lumped with weeks in other years.
--
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com
More information about the R-help
mailing list