[R] Frequency of a character in a string

Charles C. Berry ccberry at ucsd.edu
Mon Nov 14 20:55:44 CET 2016


On Mon, 14 Nov 2016, Marc Schwartz wrote:

>
>> On Nov 14, 2016, at 11:26 AM, Charles C. Berry <ccberry at ucsd.edu> wrote:
>>
>> On Mon, 14 Nov 2016, Bert Gunter wrote:
>>
[stuff deleted]

> Hi,
>
> Both gsub() and strsplit() are using regex based pattern matching 
> internally. That being said, they are ultimately calling .Internal code, 
> so both are pretty fast.
>
> For comparison:
>
> ## Create a 1,000,000 character vector
> set.seed(1)
> Vec <- paste(sample(letters, 1000000, replace = TRUE), collapse = "")
>
>> nchar(Vec)
> [1] 1000000
>
> ## Split the vector into single characters and tabulate
>> table(strsplit(Vec, split = "")[[1]])
>
>    a     b     c     d     e     f     g     h     i     j     k     l
> 38664 38442 38282 38496 38540 38623 38548 38288 38143 38493 38184 38621
>    m     n     o     p     q     r     s     t     u     v     w     x
> 38306 38725 38705 38144 38529 38809 38575 38355 38386 38364 38904 38310
>    y     z
> 38265 38299
>
>
> ## Get just the count of "a"
>> table(strsplit(Vec, split = "")[[1]])["a"]
>    a
> 38664
>
>> nchar(gsub("[^a]", "", Vec))
> [1] 38664
>
>
> ## Check performance
>> system.time(table(strsplit(Vec, split = "")[[1]])["a"])
>   user  system elapsed
>  0.100   0.007   0.107
>
>> system.time(nchar(gsub("[^a]", "", Vec)))
>   user  system elapsed
>  0.270   0.001   0.272
>
>
> So, the above would suggest that using strsplit() is somewhat faster 
> than using gsub(). However, as Chuck notes, in the absence of more 
> exhaustive benchmarking, the difference may or may not be more 
> generalizable.


Whether splitting on fixed strings rather than treating them as
regex'es (i.e.`fixed=TRUE') makes a big difference seems to depend on
what you split:

First repeating what Marc did...

> system.time(table(strsplit(Vec, split = "",fixed=TRUE)[[1]])["a"])
    user  system elapsed
   0.132   0.010   0.139 
> system.time(table(strsplit(Vec, split = "",fixed=FALSE)[[1]])["a"])
    user  system elapsed
   0.130   0.010   0.138

... fixed=TRUE hardly matters. But the idiom I proposed...

> system.time(sum(lengths(strsplit(paste0("X", Vec, "X"),"a",fixed=TRUE)) - 1))
    user  system elapsed
   0.017   0.000   0.018 
> system.time(sum(lengths(strsplit(paste0("X", Vec, "X"),"a",fixed=FALSE)) - 1))
    user  system elapsed
   0.104   0.000   0.104
>

... is 5 times faster with fixed=TRUE for this case.

This result matchea Marc's count:

> sum(lengths(strsplit(paste0("X", Vec, "X"),"a",fixed=FALSE)) - 1)
[1] 38664
>

Chuck



More information about the R-help mailing list