[R] Frequency of a character in a string

Mon Nov 14 21:23:28 CET 2016

Chuck, Marc, and anyone else who still has interest in this odd little
discussion ...

Yes, and with fixed = TRUE my approach took 1/3 as much time as
Chuck's with a 10 element vector each element of which is a character
string of length 1e5:

> set.seed(1001)
> x <- sapply(1:10, function(x)paste0(sample(letters,1e5,rep=TRUE),collapse = ""))

> system.time(sum(lengths(strsplit(paste0("X", x, "X"),"a",fixed=TRUE)) - 1))
   user  system elapsed
  0.012   0.000   0.012
> system.time(nchar(gsub("[^a]", "", x,fixed = TRUE)))
   user  system elapsed
  0.004   0.000   0.004

Best,
Bert

Bert Gunter

"The trouble with having an open mind is that people keep coming along
and sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )

On Mon, Nov 14, 2016 at 11:55 AM, Charles C. Berry <ccberry at ucsd.edu> wrote:
> On Mon, 14 Nov 2016, Marc Schwartz wrote:
>
>>
>>> On Nov 14, 2016, at 11:26 AM, Charles C. Berry <ccberry at ucsd.edu> wrote:
>>>
>>> On Mon, 14 Nov 2016, Bert Gunter wrote:
>>>
> [stuff deleted]
>
>
>> Hi,
>>
>> Both gsub() and strsplit() are using regex based pattern matching
>> internally. That being said, they are ultimately calling .Internal code, so
>> both are pretty fast.
>>
>> For comparison:
>>
>> ## Create a 1,000,000 character vector
>> set.seed(1)
>> Vec <- paste(sample(letters, 1000000, replace = TRUE), collapse = "")
>>
>>> nchar(Vec)
>>
>> [1] 1000000
>>
>> ## Split the vector into single characters and tabulate
>>>
>>> table(strsplit(Vec, split = "")[[1]])
>>
>>
>>    a     b     c     d     e     f     g     h     i     j     k     l
>> 38664 38442 38282 38496 38540 38623 38548 38288 38143 38493 38184 38621
>>    m     n     o     p     q     r     s     t     u     v     w     x
>> 38306 38725 38705 38144 38529 38809 38575 38355 38386 38364 38904 38310
>>    y     z
>> 38265 38299
>>
>>
>> ## Get just the count of "a"
>>>
>>> table(strsplit(Vec, split = "")[[1]])["a"]
>>
>>    a
>> 38664
>>
>>> nchar(gsub("[^a]", "", Vec))
>>
>> [1] 38664
>>
>>
>> ## Check performance
>>>
>>> system.time(table(strsplit(Vec, split = "")[[1]])["a"])
>>
>>   user  system elapsed
>>  0.100   0.007   0.107
>>
>>> system.time(nchar(gsub("[^a]", "", Vec)))
>>
>>   user  system elapsed
>>  0.270   0.001   0.272
>>
>>
>> So, the above would suggest that using strsplit() is somewhat faster than
>> using gsub(). However, as Chuck notes, in the absence of more exhaustive
>> benchmarking, the difference may or may not be more generalizable.
>
>
>
> Whether splitting on fixed strings rather than treating them as
> regex'es (i.e.`fixed=TRUE') makes a big difference seems to depend on
> what you split:
>
> First repeating what Marc did...
>
>> system.time(table(strsplit(Vec, split = "",fixed=TRUE)[[1]])["a"])
>
>    user  system elapsed
>   0.132   0.010   0.139
>>
>> system.time(table(strsplit(Vec, split = "",fixed=FALSE)[[1]])["a"])
>
>    user  system elapsed
>   0.130   0.010   0.138
>
> ... fixed=TRUE hardly matters. But the idiom I proposed...
>
>> system.time(sum(lengths(strsplit(paste0("X", Vec, "X"),"a",fixed=TRUE)) -
>> 1))
>
>    user  system elapsed
>   0.017   0.000   0.018
>>
>> system.time(sum(lengths(strsplit(paste0("X", Vec, "X"),"a",fixed=FALSE)) -
>> 1))
>
>    user  system elapsed
>   0.104   0.000   0.104
>>
>>
>
> ... is 5 times faster with fixed=TRUE for this case.
>
> This result matchea Marc's count:
>
>> sum(lengths(strsplit(paste0("X", Vec, "X"),"a",fixed=FALSE)) - 1)
>
> [1] 38664
>>
>>
>
> Chuck