[R] fill data forward in data frame.
Petr Savicky
savicky at cs.cas.cz
Thu Mar 1 22:57:00 CET 2012
On Thu, Mar 01, 2012 at 02:31:01PM -0700, Ben quant wrote:
> Hello,
>
> My direct desire is a good (fast) way to fill values forward until there is
> another value then fill that value foward in the data xx (at the bottom of
> this email). For example, from row 1 to row 45 should be NA (no change),
> but from row 46 row 136 the value should be 12649, and from row 137 to the
> next value should be 13039.00. The last line of code is all you need for
> this part.
>
> If you are so inclined, my goal is this: I want to create a weekly time
> series out of some data based on the report date. The report date is 'rd'
> below, and is the correct date for the time series. My idea (in part seen
> below) is to align rd and ua via the incorrect date (the time series date),
> then merge that using the report date (rd) and a daily series (so I capture
> all of the dates) of dates (dt). That gets the data in the right start
> period. I've done all of this so far below and it looks fine. Then I plan
> to roll all of those values forward to the next value (see question above),
> then I'll do something like this:
>
> xx[weekdays(xx[,1]) == "Friday",]
>
> ...to get a weekly series of Friday values. I'm thinking someone probably
> has a faster way of doing this. I have to do this many times, so speed is
> important. Thanks!
>
> Here is what I have done so far:
>
> dt <- seq(from =as.Date("2009-06-01"), to = Sys.Date(), by = "day")
>
> > nms
> [1] "2009-06-30" "2009-09-30" "2009-12-31" "2010-03-31" "2010-06-30"
> "2010-09-30" "2010-12-31" "2011-03-31" "2011-06-30" "2011-09-30"
> [11] "2011-12-31"
>
> > rd
> 2009-06-30 2009-09-30 2009-12-31 2010-03-31 2010-06-30
> 2010-09-30 2010-12-31 2011-03-31 2011-06-30 2011-09-30
> "2009-07-16" "2009-10-15" "2010-01-19" "2010-04-19" "2010-07-19"
> "2010-10-18" "2011-01-18" "2011-04-19" "2011-07-18" "2011-10-17"
> 2011-12-31
> "2012-01-19"
>
> > ua
> 2009-06-30 2009-09-30 2009-12-31 2010-03-31 2010-06-30 2010-09-30
> 2010-12-31 2011-03-31 2011-06-30 2011-09-30 2011-12-31
> 12649.00 13039.00 13425.00 13731.00 14014.00 14389.00
> 14833.00 15095.00 15481.43 15846.43 16186.43
>
> > x = merge(ua,rd,by='row.names')
> > names(x) = c('z.date','val','rt_date')
> > xx = merge(dt,x,by.y= 'rt_date',by.x=1,all.x=T)
> > xx
> x z.date val
> 1 2009-06-01 <NA> NA
> 2 2009-06-02 <NA> NA
> 3 2009-06-03 <NA> NA
> 4 2009-06-04 <NA> NA
> 5 2009-06-05 <NA> NA
>
> ...ect....
>
> 36 2009-07-06 <NA> NA
> 37 2009-07-07 <NA> NA
> 38 2009-07-08 <NA> NA
> 39 2009-07-09 <NA> NA
> 40 2009-07-10 <NA> NA
> 41 2009-07-11 <NA> NA
> 42 2009-07-12 <NA> NA
> 43 2009-07-13 <NA> NA
> 44 2009-07-14 <NA> NA
> 45 2009-07-15 <NA> NA
> 46 2009-07-16 2009-06-30 12649
> 47 2009-07-17 <NA> NA
> 48 2009-07-18 <NA> NA
> 49 2009-07-19 <NA> NA
> 50 2009-07-20 <NA> NA
> 51 2009-07-21 <NA> NA
> 52 2009-07-22 <NA> NA
> 53 2009-07-23 <NA> NA
> 54 2009-07-24 <NA> NA
> 55 2009-07-25 <NA> NA
> 56 2009-07-26 <NA> NA
> 57 2009-07-27 <NA> NA
> 58 2009-07-28 <NA> NA
>
> ...ect....
>
> 129 2009-10-07 <NA> NA
> 130 2009-10-08 <NA> NA
> 131 2009-10-09 <NA> NA
> 132 2009-10-10 <NA> NA
> 133 2009-10-11 <NA> NA
> 134 2009-10-12 <NA> NA
> 135 2009-10-13 <NA> NA
> 136 2009-10-14 <NA> NA
> 137 2009-10-15 2009-09-30 13039.00
> 138 2009-10-16 <NA> NA
> 139 2009-10-17 <NA> NA
> 140 2009-10-18 <NA> NA
> 141 2009-10-19 <NA> NA
> 142 2009-10-20 <NA> NA
> 143 2009-10-21 <NA> NA
Hi.
Try first the following simpler version.
# an input vector
x <- rep(NA, times=20)
x[4] <- "A"
x[9] <- "B"
x[17] <- "C"
# extending the values forward
values <- c(NA, x[!is.na(x)])
ind <- cumsum(!is.na(x)) + 1
y <- values[ind]
# compare with the original
cbind(x, y)
x y
[1,] NA NA
[2,] NA NA
[3,] NA NA
[4,] "A" "A"
[5,] NA "A"
[6,] NA "A"
[7,] NA "A"
[8,] NA "A"
[9,] "B" "B"
[10,] NA "B"
[11,] NA "B"
[12,] NA "B"
[13,] NA "B"
[14,] NA "B"
[15,] NA "B"
[16,] NA "B"
[17,] "C" "C"
[18,] NA "C"
[19,] NA "C"
[20,] NA "C"
This could be applied directly to the last two columns of your
data frame "xx". However, it may be more natural to obtain the
vector "values" from the input data and not from their sparse
form, which is the data frame. Also, the logical vector !is.na(x)
is the same for the last two columns of your data frame, so
it may be computed only once.
Hope this helps.
Petr Savicky.
More information about the R-help
mailing list