[R] find data (date) gaps in time series

David Winsemius dwinsemius at comcast.net
Fri Nov 20 16:27:20 CET 2009


On Nov 20, 2009, at 9:21 AM, Marc Schwartz wrote:

> On Nov 20, 2009, at 8:04 AM, David Winsemius wrote:
>
>>
>> On Nov 20, 2009, at 6:26 AM, Stefan Strohmeier wrote:
>>
>>> Dear R users,
>>>
>>> I have a time series of precipitation data. The time series  
>>> comprises ~ 20 years and it is supposed to be constant (one value  
>>> per day), but due to some failure of the measuring device some  
>>> days or periods are missing. I would like to find these missing  
>>> days or periods just to get a first idea about the reliability of  
>>> the measurements. The only function I could find was  
>>> is.constant(), but of course I only get a true or false statement  
>>> instead of the dates missing.
>>> Google searches and a look at the R help mailing did not reveal an  
>>> answer.
>>>
>>> Please find attached a few dates of the time series with missing  
>>> values from February to April. I would like R to detect those  
>>> missing dates.
>>>
>> > dtdta <- read.table(textConnection("2916 2002-02-17  0.0
>> + 2917 2002-02-18  0.3
>> + 2918 2002-02-19  3.8
>> + 2919 2002-02-20 43.6
>> + 2920 2002-02-21  1.0
>> + 2921 2002-02-22  5.6
>> + 2922 2002-02-23 10.6
>> + 2923 2002-02-24  2.8
>> + 2924 2002-02-25 19.1
>> + 2925 2002-02-26 20.5
>> + 2926 2002-03-06  0.0
>> + 2927 2002-05-06  0.0
>> + 2928 2002-05-07  0.0
>> + 2929 2002-05-08  0.0
>> + 2930 2002-05-09  0.0") )
>>
>> > dtdta[dtdta$V3 == 0, ]
>>
>>    V1         V2 V3
>> 1  2916 2002-02-17  0
>> 11 2926 2002-03-06  0
>> 12 2927 2002-05-06  0
>> 13 2928 2002-05-07  0
>> 14 2929 2002-05-08  0
>> 15 2930 2002-05-09  0
>>
>> You seem to be using "0" as a missing marker. That's bad practice,  
>> but I suppose it's possble you cannot change how your instruments  
>> report. You should be using NA and the functions that support  
>> proper treatment of "missingness".
>
>
> David,
>
> I think that he is actually looking for dates where there is no  
> measurement as opposed to dates where the measurement is 0.
>
> Thus:
>
> > DF
>     V1         V2   V3
> 1  2916 2002-02-17  0.0
> 2  2917 2002-02-18  0.3
> 3  2918 2002-02-19  3.8
> 4  2919 2002-02-20 43.6
> 5  2920 2002-02-21  1.0
> 6  2921 2002-02-22  5.6
> 7  2922 2002-02-23 10.6
> 8  2923 2002-02-24  2.8
> 9  2924 2002-02-25 19.1
> 10 2925 2002-02-26 20.5
> 11 2926 2002-03-06  0.0
> 12 2927 2002-05-06  0.0
> 13 2928 2002-05-07  0.0
> 14 2929 2002-05-08  0.0
> 15 2930 2002-05-09  0.0

You're right. I slipped a gear in reading that.
>
> # Convert V2 to dates
> # Default format is "%Y-%m-%d"
> # See ?as.Date
> DF$V2 <- as.Date(DF$V2)

At this point an alternative approach:

# Scan for differences > 1
 > diff(DF$V2)
Time differences in days
  [1]  1  1  1  1  1  1  1  1  1  8 61  1  1  1

#Records at the start of gaps
 > dtdta[diff(dtdta$V2)>1, ]
      V1         V2   V3
10 2925 2002-02-26 20.5
11 2926 2002-03-06  0.0

$Records at the end of gaps
 > dtdta[c(1, diff(dtdta$V2))>1, ]
      V1         V2 V3
11 2926 2002-03-06  0
12 2927 2002-05-06  0

#Gap dataframe
 > dfgaps <-data.frame( start= DF[c(1, diff(DF$V2))>1, ]$V2, end=  
DF[diff(DF$V2)>1, ]$V2)
 > dfgaps
        start        end
1 2002-03-06 2002-02-26
2 2002-05-06 2002-03-06

>
>
> # Get the range of dates covered
> DateRange <- seq(min(DF$V2), max(DF$V2), by = 1)
>
>
> # Get the dates in DateRange that are not in DF$V2
> # See ?"%in%"
> > DateRange[!DateRange %in% DF$V2]
> [1] "2002-02-27" "2002-02-28" "2002-03-01" "2002-03-02" "2002-03-03"
> [6] "2002-03-04" "2002-03-05" "2002-03-07" "2002-03-08" "2002-03-09"
> [11] "2002-03-10" "2002-03-11" "2002-03-12" "2002-03-13" "2002-03-14"
> [16] "2002-03-15" "2002-03-16" "2002-03-17" "2002-03-18" "2002-03-19"
> [21] "2002-03-20" "2002-03-21" "2002-03-22" "2002-03-23" "2002-03-24"
> [26] "2002-03-25" "2002-03-26" "2002-03-27" "2002-03-28" "2002-03-29"
> [31] "2002-03-30" "2002-03-31" "2002-04-01" "2002-04-02" "2002-04-03"
> [36] "2002-04-04" "2002-04-05" "2002-04-06" "2002-04-07" "2002-04-08"
> [41] "2002-04-09" "2002-04-10" "2002-04-11" "2002-04-12" "2002-04-13"
> [46] "2002-04-14" "2002-04-15" "2002-04-16" "2002-04-17" "2002-04-18"
> [51] "2002-04-19" "2002-04-20" "2002-04-21" "2002-04-22" "2002-04-23"
> [56] "2002-04-24" "2002-04-25" "2002-04-26" "2002-04-27" "2002-04-28"
> [61] "2002-04-29" "2002-04-30" "2002-05-01" "2002-05-02" "2002-05-03"
> [66] "2002-05-04" "2002-05-05"
>
> HTH,
>
> Marc Schwartz
>

David Winsemius, MD
Heritage Laboratories
West Hartford, CT




More information about the R-help mailing list