[R] Frequency of a character in a string

Mon Nov 14 23:19:47 CET 2016

On 11/14/2016 12:44 PM, Bert Gunter wrote:
> (Sheepishly)...
>
> Yes, thank you Hervé. It would have been nice if I had given correct
> soutions. Fixed = TRUE could not have of course worked with ["a"]
> character class!
>
> Here's what I found with a 10 element vector each member of which is a
> 1e5 length string:
>
>> system.time((lengths(strsplit(paste0("X", x, "X"),"a",fixed=TRUE)) - 1))
>    user  system elapsed
>   0.013   0.000   0.013
>
>> system.time(nchar(gsub("[^a]", "", x,fixed = FALSE)))
>    user  system elapsed
>   0.251   0.000   0.252
> ## WAYYYY slower
>
>
>> system.time(nchar(x) - nchar(gsub("a", "", x,fixed = TRUE)))
>    user  system elapsed
>   0.007   0.000   0.007
> ## twice as fast
>
>
>
> Clearly and unsurprisingly, the message is to avoid fixed = FALSE;
> after that, it seems mostly to be: who cares?!

Another message is to pay attention to the "cost" of generating a
big intermediate objects like the list returned by strsplit(). On a
big character vector made of 5000 strings of about 1e5 random letters
each, the strsplit-based solution uses more than 2Gb of RAM on my
Ubuntu system. The gsub( , fixed=TRUE) solution uses less than 1Gb.

Cheers,
H.

>
>
> Cheers,
> Bert
>
>
>
> Bert Gunter
>
> "The trouble with having an open mind is that people keep coming along
> and sticking things into it."
> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>
>
> On Mon, Nov 14, 2016 at 12:26 PM, Hervé Pagès <hpages at fredhutch.org> wrote:
>> Hi,
>>
>> FWIW using gsub( , fixed=TRUE) is faster than using gsub( , fixed=FALSE)
>> or strsplit( , fixed=TRUE):
>>
>>   set.seed(1)
>>   Vec <- paste(sample(letters, 5000000, replace = TRUE), collapse = "")
>>
>>   system.time(res1 <- nchar(gsub("[^a]", "", Vec)))
>>   #  user  system elapsed
>>   # 0.585   0.000   0.586
>>
>>   system.time(res2 <- lengths(strsplit(Vec,"a",fixed=TRUE)) - 1L)
>>   #  user  system elapsed
>>   # 0.061   0.000   0.061
>>
>>   system.time(res3 <- nchar(Vec) - nchar(gsub("a", "", Vec, fixed=TRUE)))
>>   #  user  system elapsed
>>   # 0.039   0.000   0.039
>>
>>   identical(res1, res2)
>>   # [1] TRUE
>>   identical(res1, res3)
>>   # [1] TRUE
>>
>> The gsub( , fixed=TRUE) solution also uses slightly less memory than the
>> strsplit( , fixed=TRUE) solution.
>>
>> Cheers,
>> H.
>>
>>
>> On 11/14/2016 11:55 AM, Charles C. Berry wrote:
>>>
>>> On Mon, 14 Nov 2016, Marc Schwartz wrote:
>>>
>>>>
>>>>> On Nov 14, 2016, at 11:26 AM, Charles C. Berry <ccberry at ucsd.edu> wrote:
>>>>>
>>>>> On Mon, 14 Nov 2016, Bert Gunter wrote:
>>>>>
>>> [stuff deleted]
>>>
>>>> Hi,
>>>>
>>>> Both gsub() and strsplit() are using regex based pattern matching
>>>> internally. That being said, they are ultimately calling .Internal
>>>> code, so both are pretty fast.
>>>>
>>>> For comparison:
>>>>
>>>> ## Create a 1,000,000 character vector
>>>> set.seed(1)
>>>> Vec <- paste(sample(letters, 1000000, replace = TRUE), collapse = "")
>>>>
>>>>> nchar(Vec)
>>>>
>>>> [1] 1000000
>>>>
>>>> ## Split the vector into single characters and tabulate
>>>>>
>>>>> table(strsplit(Vec, split = "")[[1]])
>>>>
>>>>
>>>>    a     b     c     d     e     f     g     h     i     j     k     l
>>>> 38664 38442 38282 38496 38540 38623 38548 38288 38143 38493 38184 38621
>>>>    m     n     o     p     q     r     s     t     u     v     w     x
>>>> 38306 38725 38705 38144 38529 38809 38575 38355 38386 38364 38904 38310
>>>>    y     z
>>>> 38265 38299
>>>>
>>>>
>>>> ## Get just the count of "a"
>>>>>
>>>>> table(strsplit(Vec, split = "")[[1]])["a"]
>>>>
>>>>    a
>>>> 38664
>>>>
>>>>> nchar(gsub("[^a]", "", Vec))
>>>>
>>>> [1] 38664
>>>>
>>>>
>>>> ## Check performance
>>>>>
>>>>> system.time(table(strsplit(Vec, split = "")[[1]])["a"])
>>>>
>>>>   user  system elapsed
>>>>  0.100   0.007   0.107
>>>>
>>>>> system.time(nchar(gsub("[^a]", "", Vec)))
>>>>
>>>>   user  system elapsed
>>>>  0.270   0.001   0.272
>>>>
>>>>
>>>> So, the above would suggest that using strsplit() is somewhat faster
>>>> than using gsub(). However, as Chuck notes, in the absence of more
>>>> exhaustive benchmarking, the difference may or may not be more
>>>> generalizable.
>>>
>>>
>>>
>>> Whether splitting on fixed strings rather than treating them as
>>> regex'es (i.e.`fixed=TRUE') makes a big difference seems to depend on
>>> what you split:
>>>
>>> First repeating what Marc did...
>>>
>>>> system.time(table(strsplit(Vec, split = "",fixed=TRUE)[[1]])["a"])
>>>
>>>    user  system elapsed
>>>   0.132   0.010   0.139
>>>>
>>>> system.time(table(strsplit(Vec, split = "",fixed=FALSE)[[1]])["a"])
>>>
>>>    user  system elapsed
>>>   0.130   0.010   0.138
>>>
>>> ... fixed=TRUE hardly matters. But the idiom I proposed...
>>>
>>>> system.time(sum(lengths(strsplit(paste0("X", Vec,
>>>> "X"),"a",fixed=TRUE)) - 1))
>>>
>>>    user  system elapsed
>>>   0.017   0.000   0.018
>>>>
>>>> system.time(sum(lengths(strsplit(paste0("X", Vec,
>>>> "X"),"a",fixed=FALSE)) - 1))
>>>
>>>    user  system elapsed
>>>   0.104   0.000   0.104
>>>>
>>>>
>>>
>>> ... is 5 times faster with fixed=TRUE for this case.
>>>
>>> This result matchea Marc's count:
>>>
>>>> sum(lengths(strsplit(paste0("X", Vec, "X"),"a",fixed=FALSE)) - 1)
>>>
>>> [1] 38664
>>>>
>>>>
>>>
>>> Chuck
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>
>>
>> --
>> Hervé Pagès
>>
>> Program in Computational Biology
>> Division of Public Health Sciences
>> Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N, M1-B514
>> P.O. Box 19024
>> Seattle, WA 98109-1024
>>
>> E-mail: hpages at fredhutch.org
>> Phone:  (206) 667-5791
>> Fax:    (206) 667-1319
>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fredhutch.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319