[R] Frequency of a character in a string
Bert Gunter
bgunter.4567 at gmail.com
Mon Nov 14 21:44:10 CET 2016
(Sheepishly)...
Yes, thank you Hervé. It would have been nice if I had given correct
soutions. Fixed = TRUE could not have of course worked with ["a"]
character class!
Here's what I found with a 10 element vector each member of which is a
1e5 length string:
> system.time((lengths(strsplit(paste0("X", x, "X"),"a",fixed=TRUE)) - 1))
user system elapsed
0.013 0.000 0.013
> system.time(nchar(gsub("[^a]", "", x,fixed = FALSE)))
user system elapsed
0.251 0.000 0.252
## WAYYYY slower
> system.time(nchar(x) - nchar(gsub("a", "", x,fixed = TRUE)))
user system elapsed
0.007 0.000 0.007
## twice as fast
Clearly and unsurprisingly, the message is to avoid fixed = FALSE;
after that, it seems mostly to be: who cares?!
Cheers,
Bert
Bert Gunter
"The trouble with having an open mind is that people keep coming along
and sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
On Mon, Nov 14, 2016 at 12:26 PM, Hervé Pagès <hpages at fredhutch.org> wrote:
> Hi,
>
> FWIW using gsub( , fixed=TRUE) is faster than using gsub( , fixed=FALSE)
> or strsplit( , fixed=TRUE):
>
> set.seed(1)
> Vec <- paste(sample(letters, 5000000, replace = TRUE), collapse = "")
>
> system.time(res1 <- nchar(gsub("[^a]", "", Vec)))
> # user system elapsed
> # 0.585 0.000 0.586
>
> system.time(res2 <- lengths(strsplit(Vec,"a",fixed=TRUE)) - 1L)
> # user system elapsed
> # 0.061 0.000 0.061
>
> system.time(res3 <- nchar(Vec) - nchar(gsub("a", "", Vec, fixed=TRUE)))
> # user system elapsed
> # 0.039 0.000 0.039
>
> identical(res1, res2)
> # [1] TRUE
> identical(res1, res3)
> # [1] TRUE
>
> The gsub( , fixed=TRUE) solution also uses slightly less memory than the
> strsplit( , fixed=TRUE) solution.
>
> Cheers,
> H.
>
>
> On 11/14/2016 11:55 AM, Charles C. Berry wrote:
>>
>> On Mon, 14 Nov 2016, Marc Schwartz wrote:
>>
>>>
>>>> On Nov 14, 2016, at 11:26 AM, Charles C. Berry <ccberry at ucsd.edu> wrote:
>>>>
>>>> On Mon, 14 Nov 2016, Bert Gunter wrote:
>>>>
>> [stuff deleted]
>>
>>> Hi,
>>>
>>> Both gsub() and strsplit() are using regex based pattern matching
>>> internally. That being said, they are ultimately calling .Internal
>>> code, so both are pretty fast.
>>>
>>> For comparison:
>>>
>>> ## Create a 1,000,000 character vector
>>> set.seed(1)
>>> Vec <- paste(sample(letters, 1000000, replace = TRUE), collapse = "")
>>>
>>>> nchar(Vec)
>>>
>>> [1] 1000000
>>>
>>> ## Split the vector into single characters and tabulate
>>>>
>>>> table(strsplit(Vec, split = "")[[1]])
>>>
>>>
>>> a b c d e f g h i j k l
>>> 38664 38442 38282 38496 38540 38623 38548 38288 38143 38493 38184 38621
>>> m n o p q r s t u v w x
>>> 38306 38725 38705 38144 38529 38809 38575 38355 38386 38364 38904 38310
>>> y z
>>> 38265 38299
>>>
>>>
>>> ## Get just the count of "a"
>>>>
>>>> table(strsplit(Vec, split = "")[[1]])["a"]
>>>
>>> a
>>> 38664
>>>
>>>> nchar(gsub("[^a]", "", Vec))
>>>
>>> [1] 38664
>>>
>>>
>>> ## Check performance
>>>>
>>>> system.time(table(strsplit(Vec, split = "")[[1]])["a"])
>>>
>>> user system elapsed
>>> 0.100 0.007 0.107
>>>
>>>> system.time(nchar(gsub("[^a]", "", Vec)))
>>>
>>> user system elapsed
>>> 0.270 0.001 0.272
>>>
>>>
>>> So, the above would suggest that using strsplit() is somewhat faster
>>> than using gsub(). However, as Chuck notes, in the absence of more
>>> exhaustive benchmarking, the difference may or may not be more
>>> generalizable.
>>
>>
>>
>> Whether splitting on fixed strings rather than treating them as
>> regex'es (i.e.`fixed=TRUE') makes a big difference seems to depend on
>> what you split:
>>
>> First repeating what Marc did...
>>
>>> system.time(table(strsplit(Vec, split = "",fixed=TRUE)[[1]])["a"])
>>
>> user system elapsed
>> 0.132 0.010 0.139
>>>
>>> system.time(table(strsplit(Vec, split = "",fixed=FALSE)[[1]])["a"])
>>
>> user system elapsed
>> 0.130 0.010 0.138
>>
>> ... fixed=TRUE hardly matters. But the idiom I proposed...
>>
>>> system.time(sum(lengths(strsplit(paste0("X", Vec,
>>> "X"),"a",fixed=TRUE)) - 1))
>>
>> user system elapsed
>> 0.017 0.000 0.018
>>>
>>> system.time(sum(lengths(strsplit(paste0("X", Vec,
>>> "X"),"a",fixed=FALSE)) - 1))
>>
>> user system elapsed
>> 0.104 0.000 0.104
>>>
>>>
>>
>> ... is 5 times faster with fixed=TRUE for this case.
>>
>> This result matchea Marc's count:
>>
>>> sum(lengths(strsplit(paste0("X", Vec, "X"),"a",fixed=FALSE)) - 1)
>>
>> [1] 38664
>>>
>>>
>>
>> Chuck
>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fredhutch.org
> Phone: (206) 667-5791
> Fax: (206) 667-1319
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list