[R] what is the faster way to search for a pattern in a few million entries data frame ?

Duncan Murdoch murdoch.duncan at gmail.com
Sun Apr 10 20:40:24 CEST 2016


On 10/04/2016 2:03 PM, Fabien Tarrade wrote:
> Hi there,
>
> I have a data frame DF with 40 millions strings and their frequency. I
> am searching for strings with a given pattern and I am trying to speed
> up this part of my code. I try many options but so far I am not
> satisfied. I tried:
> - grepl and subset are equivalent in term of processing time
>      grepl(paste0("^",pattern),df$Strings)
>      subset(df, grepl(paste0("^",pattern), df$Strings))
>
> - lookup(pattern,df) is not what I am looking for since it is doing an
> exact matching
>
> - I tried to convert my data frame in a data table but it didn't improve
> things (probably read/write of this DT will be much faster)
>
> - the only way I found was to remove 1/3 of the data frame with the
> strings of lowest frequency which speed up the process by a factor x10 !
>
> - didn't try yet parRapply and with a machine with multicore I can get
> another factor.
>      I did use parLapply for some other code but I had many issue with
> memory (crashing my Mac).
>      I had to sub-divide the dataset to have it working correctly but I
> didn't manage to fully understand the issue.
>
> I am sure their is some other smart way to do that. Any good
> article/blogs or suggestion that can give me some guidance ?

Didn't you post the same question yesterday?  Perhaps nobody answered 
because your question is unanswerable.  You need to describe what the 
strings are like and what the patterns are like if you want advice on 
speeding things up.

Duncan Murdoch



More information about the R-help mailing list