[R] Help searching a matrix for only certain records

Matt Borkowski mathias1979 at yahoo.com
Mon Mar 4 03:10:15 CET 2013


I appreciate all the feedback on this. I ended up using this line to solve my problem, just because I stumbled upon it first...

> alldata <- alldata[alldata$REC.TYPE == "SAO  " | alldata$REC.TYPE == "FM-15",,drop=FALSE]

But I think Jim's solution would work equally as well. I was a bit confused by the relative complexity of the data frames solution, as it seems like more steps than necessary.

Thanks again for the input!

-Matt




Again, thanks for the feedback!

--- On Sun, 3/3/13, arun <smartpink111 at yahoo.com> wrote:

> From: arun <smartpink111 at yahoo.com>
> Subject: Re: [R] Help searching a matrix for only certain records
> To: "Matt Borkowski" <mathias1979 at yahoo.com>
> Cc: "R help" <r-help at r-project.org>, "jim holtman" <jholtman at gmail.com>
> Date: Sunday, March 3, 2013, 1:29 PM
> HI,
> You could also use ?data.table() 
> 
> n<- 300000
> set.seed(51)
>  mat1<- as.matrix(data.frame(REC.TYPE=
> sample(c("SAO","FAO","FL-1","FL-2","FL-15"),n,replace=TRUE),Col2=rnorm(n),Col3=runif(n),stringsAsFactors=FALSE))
>  dat1<- as.data.frame(mat1,stringsAsFactors=FALSE)
>  table(mat1[,1])
> #
>  # FAO  FL-1 FL-15  FL-2   SAO 
> #60046 60272 59669 59878 60135 
> system.time(x1 <- subset(mat1, grepl("(SAO|FL-15)",
> mat1[, "REC.TYPE"])))
>  #user  system elapsed 
>  # 0.076   0.004   0.082 
>  system.time(x2 <- subset(mat1, mat1[, "REC.TYPE"] %in%
> c("SAO", "FL-15")))
>  #  user  system elapsed 
>  # 0.028   0.000   0.030 
> 
> system.time(x3 <- mat1[match(mat1[, "REC.TYPE"]
>                             ,
> c("SAO", "FL-15")
>                             ,
> nomatch = 0) != 0
>                             ,,
> drop = FALSE]
>             )
> #user  system elapsed 
> #  0.028   0.000   0.028 
>  table(x3[,1])
> #
> #FL-15   SAO 
> #59669 60135 
> 
> 
> library(data.table)
> 
> dat2<- data.table(dat1) 
>  system.time(x4<- dat2[match(REC.TYPE,c("SAO",
> "FL-15"),nomatch=0)!=0,,drop=FALSE])
>   # user  system elapsed 
>   #0.024   0.000   0.025 
>  table(x4$REC.TYPE)
> 
> #FL-15   SAO 
> #59669 60135 
> A.K.
> 
> 
> 
> 
> 
> 
> 
> 
> ----- Original Message -----
> From: jim holtman <jholtman at gmail.com>
> To: Matt Borkowski <mathias1979 at yahoo.com>
> Cc: "r-help at r-project.org"
> <r-help at r-project.org>
> Sent: Sunday, March 3, 2013 11:52 AM
> Subject: Re: [R] Help searching a matrix for only certain
> records
> 
> If you are using matrices, then here is several ways of
> doing it for
> size 300,000.  You can determine if the difference of 0.1
> seconds is
> important in terms of the performance you are after.  It is
> taking you
> more time to type in the statements than it is taking them
> to execute:
> 
> > n <- 300000
> > testdata <- matrix(
> +     sample(c("SAO ", "FL-15", "Other"), n, TRUE,
> prob = c(1,2,1000))
> +     , nrow = n
> +     , dimnames = list(NULL, "REC.TYPE")
> +     )
> > table(testdata[, "REC.TYPE"])
> 
> FL-15  Other   SAO
>    562 299151    287
> > system.time(x1 <- subset(testdata, grepl("(SAO
> |FL-15)", testdata[, "REC.TYPE"])))
>    user  system elapsed
>    0.17    0.00    0.17
> > system.time(x2 <- subset(testdata, testdata[,
> "REC.TYPE"] %in% c("SAO ", "FL-15")))
>    user  system elapsed
>    0.05    0.00    0.05
> > system.time(x3 <- testdata[match(testdata[,
> "REC.TYPE"]
> +                             , c("SAO ",
> "FL-15")
> +                             , nomatch =
> 0) != 0
> +                             ,, drop =
> FALSE]
> +             )
>    user  system elapsed
>    0.03    0.00    0.03
> > identical(x1, x2)
> [1] TRUE
> > identical(x2, x3)
> [1] TRUE
> >
> 
> 
> On Sun, Mar 3, 2013 at 11:22 AM, Jim Holtman <jholtman at gmail.com>
> wrote:
> > there are way "more efficient" ways of doing many of
> the operations , but you probably won't see any differences
> unless you have very large objects (several hunfred thousand
> entries), or have to do it a lot of times.  My background
> is in computer performance and for the most part I have
> found that the easiest/mostbstraight forward ways are fine
> most of the time.
> >
> > a more efficient way might be:
> >
> > testdata <- testdata[match(c('SAO ', 'FL-15'),
> testdata$REC.TYPE), ]
> >
> > you can always use 'system.time' to determine how long
> actions take.
> >
> > for multiple comparisons use %in%
> >
> > Sent from my iPad
> >
> > On Mar 3, 2013, at 9:22, Matt Borkowski <mathias1979 at yahoo.com>
> wrote:
> >
> >> Thank you for your response Jim! I will give this
> one a try! But a couple followup questions...
> >>
> >> In my search for a solution, I had seen something
> stating match() is much more efficient than subset() and
> will cut down significantly on computing time. Is there any
> truth to that?
> >>
> >> Also, I found the following solution which works
> for matching a single condition, but I couldn't quite figure
> out how to  modify it it to search for both my acceptable
> conditions...
> >>
> >>> testdata <- testdata[testdata$REC.TYPE ==
> "SAO",,drop=FALSE]
> >>
> >> -Matt
> >>
> >>
> >>
> >>
> >> --- On Sun, 3/3/13, jim holtman <jholtman at gmail.com>
> wrote:
> >>
> >> From: jim holtman <jholtman at gmail.com>
> >> Subject: Re: [R] Help searching a matrix for only
> certain records
> >> To: "Matt Borkowski" <mathias1979 at yahoo.com>
> >> Cc: r-help at r-project.org
> >> Date: Sunday, March 3, 2013, 8:00 AM
> >>
> >> Try this:
> >>
> >> dataset <- subset(dataset, grepl("(SAO |FL-15)",
> REC.TYPE))
> >>
> >>
> >> On Sun, Mar 3, 2013 at 1:11 AM, Matt Borkowski
> <mathias1979 at yahoo.com>
> wrote:
> >>> Let me start by saying I am rather new to R and
> generally consider myself to be a novice programmer...so
> don't assume I know what I'm doing :)
> >>>
> >>> I have a large matrix, approximately 300,000 x
> 14. It's essentially a 20-year dataset of 15-minute data.
> However, I only need the rows where the column I've named
> REC.TYPE contains the string "SAO  " or "FL-15".
> >>>
> >>> My horribly inefficient solution was to search
> the matrix row by row, test the REC.TYPE column and
> essentially delete the row if it did not match my criteria.
> Essentially...
> >>>
> >>>> j <- 1
> >>>> for (i in 1:nrow(dataset)) {
> >>>>     if(dataset$REC.TYPE[j] != "SAO 
> " && dataset$RECTYPE[j] != "FL-15") {
> >>>>       dataset <- dataset[-j,] 
> }
> >>>>     else {
> >>>>       j <- j+1  }
> >>>> }
> >>>
> >>> After watching my code get through only about
> 10% of the matrix in an hour and slowing with every row...I
> figure there must be a more efficient way of pulling out
> only the records I need...especially when I need to repeat
> this for another 8 datasets.
> >>>
> >>> Can anyone point me in the right direction?
> >>>
> >>> Thanks!
> >>>
> >>> Matt
> >>>
> >>> ______________________________________________
> >>> R-help at r-project.org
> mailing list
> >>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> >>> and provide commented, minimal, self-contained,
> reproducible code.
> >>
> >>
> >>
> >> --
> >> Jim Holtman
> >> Data Munger Guru
> >>
> >> What is the problem that you are trying to solve?
> >> Tell me what you want to do, not how you want to do
> it.
> >>
> 
> 
> 
> -- 
> Jim Holtman
> Data Munger Guru
> 
> What is the problem that you are trying to solve?
> Tell me what you want to do, not how you want to do it.
> 
> ______________________________________________
> R-help at r-project.org
> mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible
> code.
> 
>



More information about the R-help mailing list