[R] How to extract Friday data from daily data.

Sat Nov 6 01:24:42 CET 2010

On Fri, Nov 5, 2010 at 1:22 PM, thornbird <huachang396 at gmail.com> wrote:
>
> I am new to Using R for data analysis. I have an incomplete time series
> dataset that is in daily format. I want to extract only Friday data from it.
> However, there are two problems with it.
>
> First, if Friday data is missing in that week, I need to extract the data of
> the day prior to that Friday (e.g. Thursday).
>
> Second, sometimes there are duplicate Friday data (say Friday morning and
> afternoon), but I only need the latest one (Friday afternoon).
>
> My question is how I can only extract the Friday data and make it a new
> dataset so that I have data for every single week for the convenience of
> data analysis.
>

There are several approaches depending on exactly what is to be
produced.  We show two of them here using zoo.

# read in data

Lines <- "  views  number  timestamp day            time
1  views  910401 1246192687 Sun 6/28/2009 12:38
2  views  921537 1246278917 Mon 6/29/2009 12:35
3  views  934280 1246365403 Tue 6/30/2009 12:36
4  views  986463 1246888699 Mon  7/6/2009 13:58
5  views  995002 1246970243 Tue  7/7/2009 12:37
6  views 1005211 1247079398 Wed  7/8/2009 18:56
7  views 1011144 1247135553 Thu  7/9/2009 10:32
8  views 1026765 1247308591 Sat 7/11/2009 10:36
9  views 1036856 1247436951 Sun 7/12/2009 22:15
10 views 1040909 1247481564 Mon 7/13/2009 10:39
11 views 1057337 1247568387 Tue 7/14/2009 10:46
12 views 1066999 1247665787 Wed 7/15/2009 13:49
13 views 1077726 1247778752 Thu 7/16/2009 21:12
14 views 1083059 1247845413 Fri 7/17/2009 15:43
15 views 1083059 1247845824 Fri 7/17/2009 18:45
16 views 1089529 1247914194 Sat 7/18/2009 10:49"

library(zoo)

# read in and create a zoo series
# - skip= over the header
# - index=. the time index is third non-removed column.
# - format=. convert the index to Date class using indicated format
# - col.names= as specified
# - aggregate= over duplicate dates keeping last
# - colClasses= specifies "NULL" for columns we want to remove

colClasses <-
 c("NULL", "NULL", "numeric", "numeric", "NULL", "character", "NULL")

col.names <- c(NA, NA, "views", "number", NA, NA, NA)

# z <- read.zoo("myfile.dat", skip = 1, index = 3,
z <- read.zoo(textConnection(Lines), skip = 1, index = 3,
	format = "%m/%d/%Y", col.names = col.names,
	aggregate = function(x) tail(x, 1), colClasses = colClasses)

## Now that we have read it in lets process it

## 1.

# extract all Thursdays and Fridays
z45 <- z[format(time(z), "%w") %in% 4:5,]

# keep last entry in each week
# and show result on R console
z45[!duplicated(format(time(z45), "%U"), fromLast = TRUE), ]

# 2. alternative approach
# above approach labels each point as it was originally labelled
# so if Thursday is used it gets the date of that Thursday
# Another approach is to always label the resulting point as Friday
# and also use the last available value even if its not Thursday

# create daily grid
g <- seq(start(z), end(z), by = "day")

# fill in daily grid so Friday is filled in with prior value
# if Friday is NA
z.filled <- na.locf(z, xout = g)

# extract Fridays (including those filled in from previous)
# and show result on R console
z.filled[format(time(z.filled), "%w") == "5", ]

-- 
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com