[R] Need a faster function to replace missing data

Tim Clark mudiver1200 at yahoo.com
Tue May 26 10:42:10 CEST 2009


Many thanks to Jim, Bill, and Carl.  Using indexes instead of the for loop gave me my answer in minutes instead of hours!  Thanks for all of your great suggestions!

Aloha,

Tim

Tim Clark
Department of Zoology 
University of Hawaii


--- On Fri, 5/22/09, jim holtman <jholtman at gmail.com> wrote:

> From: jim holtman <jholtman at gmail.com>
> Subject: Re: [R] Need a faster function to replace missing data
> To: "Tim Clark" <mudiver1200 at yahoo.com>
> Cc: r-help at r-project.org
> Date: Friday, May 22, 2009, 4:59 PM
> Here is a modification that should
> now find the closest:
>  
> >
> myvscan<-data.frame(c(1,NA,1.5),as.POSIXct(c("12:00:00","12:14:00","12:20:00"),
> + format="%H:%M:%S"))
> > # convert to numeric
> > 
> >
> names(myvscan)<-c("Latitude","DateTime")
> 
> > 
> > myvscan$tn <- as.numeric(myvscan$DateTime)  #
> numeric for findInterval
> > 
> >
> mygarmin<-data.frame(c(20,30,40),as.POSIXct(c("12:00:00","12:10:00","12:15:00"),
> 
> + format="%H:%M:%S"))
> > 
> > 
> >
> names(mygarmin)<-c("Latitude","DateTime")
> > mygarmin$tn <- as.numeric(mygarmin$DateTime)
> > 
> > # use 'findInterval'
> 
> > na.indx <- which(is.na(myvscan$Latitude))  # find
> NAs
> > 
> > # create matrix of values to test the range
> > indices <-
> findInterval(myvscan$tn[na.indx],mygarmin$tn)
> 
> > x <- cbind(indices,
> +            abs(myvscan$tn[na.indx] -
> mygarmin$tn[indices]), # lower
> +            abs(myvscan$tn[na.indx] -
> mygarmin$tn[indices + 1]))  #higher
> > # now determine which index is closer
> 
> > closest <- x[,1] + (x[,2] > x[,3])  # determine
> the proper index
> > # replace with garmin latitude
> > myvscan$Latitude[na.indx] <-
> mygarmin$Latitude[closest]
> > 
> > 
> > 
> > myvscan
> 
>   Latitude            DateTime        
> tn
> 1      1.0 2009-05-23 12:00:00 1243080000
> 2     40.0 2009-05-23 12:14:00 1243080840
> 3      1.5 2009-05-23 12:20:00 1243081200
> > 
> 
> 
> 
> On Fri, May 22, 2009 at 7:39 PM,
> Tim Clark <mudiver1200 at yahoo.com>
> wrote:
> 
> 
> Jim,
> 
> Thanks!  I like the way you use indexing instead of the
> loops.  However, the find.Interval function does not give
> the right result.  I have been playing with it and it seems
> to give the closest number that is less than the one of
> interest.  In this case, the correct replacement should
> have been 40, not 30, since 12:15 from mygarmin is closer to
> 12:14 in myvscan than 12:10.  Is there a way to get the
> function to find the closest in value instead of the next
> smaller value?  I was trying to use which.min to get the
> closet date but can't seem to get it to work right
> either.
> 
> 
> 
> Aloha,
> 
> Tim
> 
> 
> Tim Clark
> Department of Zoology
> University of Hawaii
> 
> 
> --- On Fri, 5/22/09, jim holtman <jholtman at gmail.com>
> wrote:
> 
> 
> > From: jim holtman <jholtman at gmail.com>
> > Subject: Re: [R] Need a faster function to replace
> missing data
> > To: "Tim Clark" <mudiver1200 at yahoo.com>
> 
> > Cc: r-help at r-project.org
> > Date: Friday, May 22, 2009, 7:24 AM
> 
> 
> 
> > I think this does what you
> > want.  It uses 'findInterval' to determine
> where a
> > possible match is:
> >  
> > >
> >
> myvscan<-data.frame(c(1,NA,1.5),as.POSIXct(c("12:00:00","12:14:00","12:20:00"),
> 
> > format="%H:%M:%S"))
> > > # convert to numeric
> > >
> >
> names(myvscan)<-c("Latitude","DateTime")
> >
> > > myvscan$tn <- as.numeric(myvscan$DateTime) 
> #
> 
> > numeric for findInterval
> > >
> >
> mygarmin<-data.frame(c(20,30,40),as.POSIXct(c("12:00:00","12:10:00","12:15:00"),
> > format="%H:%M:%S"))
> >
> > >
> 
> >
> names(mygarmin)<-c("Latitude","DateTime")
> > > mygarmin$tn <- as.numeric(mygarmin$DateTime)
> > >
> > > # use 'findInterval'
> > > na.indx <- which(is.na(myvscan$Latitude))  # find
> 
> > NAs
> >
> > > # replace with garmin latitude
> > > myvscan$Latitude[na.indx] <-
> > mygarmin$Latitude[findInterval(myvscan$tn[na.indx],
> > mygarmin$tn)]
> > >
> > >
> > > myvscan
> 
> >   Latitude           
> DateTime        
> > tn
> >
> > 1      1.0 2009-05-22 12:00:00 1243008000
> > 2     30.0 2009-05-22 12:14:00 1243008840
> > 3      1.5 2009-05-22 12:20:00 1243009200
> > >
> 
> >
> >
> >
> > On Fri, May 22, 2009 at 12:45 AM,
> > Tim Clark <mudiver1200 at yahoo.com>
> > wrote:
> >
> >
> > Dear List,
> >
> > I need some help in coming up with a function that
> will
> 
> > take two data sets, determine if a value is missing in
> one,
> > find a value in the second that was taken at about the
> same
> > time, and substitute the second value in for where the
> first
> > should have been.  My problem is from a fish
> tracking
> 
> > study.  We put acoustic tags in fish and track them
> for
> > several days.  Location data is supposed to be
> > automatically recorded every time we detect a
> > "ping" from the fish.  Unfortunately the
> GPS had
> 
> > some problems and sometimes the fishes depth was
> recorded
> > but not its location.  I fortunately had a back-up
> GPS that
> > was taking location data every five minutes.  I would
> like
> > to merge the two files, replacing the missing value in
> the
> 
> > vscan (automatic) file with the location from the
> garmin
> > file.  Since we were getting vscan records every 1-2
> > seconds and garmin records every 5 minutes, I need to
> find
> > the right place in the vscan file to place the garmin
> record
> 
> > - i.e. the
> >
> >  closest in time, but not greater than 5 minutes.  I
> have
> > written a function that does this. However, it works
> with my
> > test data but locks up my computer with my real data.
>  I
> 
> > have several million vscan records and several
> thousand
> > garmin records.  Is there a better way to do this?
> >
> >
> >
> > My function and test data:
> >
> >
> myvscan<-data.frame(c(1,NA,1.5),times(c("12:00:00","12:14:00","12:20:00")))
> 
> >
> names(myvscan)<-c("Latitude","DateTime")
> >
> >
> mygarmin<-data.frame(c(20,30,40),times(("12:00:00","12:10:00","12:15:00")))
> >
> names(mygarmin)<-c("Latitude","DateTime")
> 
> >
> > minute.diff<-1/24/12   #Time diff is in days, so
> this
> > is 5 minutes
> >
> > for (k in 1:nrow(myvscan))
> > {
> > if (is.na(myvscan$Latitude[k]))
> 
> > {
> > if ((min(abs(mygarmin$DateTime-myvscan$DateTime[k])))
> <
> > minute.diff )
> > {
> >
> index.min.date<-which.min(abs(mygarmin$DateTime-myvscan$DateTime[k]))
> >
> >
> myvscan$Latitude[k]<-mygarmin$Latitude[index.min.date]
> 
> > }}}
> >
> > I appreciate your help and advice.
> >
> > Aloha,
> >
> > Tim
> >
> >
> >
> >
> > Tim Clark
> > Department of Zoology
> > University of Hawaii
> >
> 
> > ______________________________________________
> >
> > R-help at r-project.org
> > mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> 
> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> >
> > and provide commented, minimal, self-contained,
> 
> > reproducible code.
> >
> >
> >
> >
> > --
> > Jim Holtman
> > Cincinnati, OH
> > +1 513 646 9390
> >
> > What is the problem that you are trying to solve?
> >
> >
> 
> 
> 
> 
> 
> 
> 
> 
> -- 
> Jim Holtman
> Cincinnati, OH
> +1 513 646 9390
> 
> What is the problem that you are trying to solve?
> 
> 







More information about the R-help mailing list