[R] Frequency of a character in a string
Hervé Pagès
hpages at fredhutch.org
Mon Nov 14 21:26:50 CET 2016
Hi,
FWIW using gsub( , fixed=TRUE) is faster than using gsub( , fixed=FALSE)
or strsplit( , fixed=TRUE):
set.seed(1)
Vec <- paste(sample(letters, 5000000, replace = TRUE), collapse = "")
system.time(res1 <- nchar(gsub("[^a]", "", Vec)))
# user system elapsed
# 0.585 0.000 0.586
system.time(res2 <- lengths(strsplit(Vec,"a",fixed=TRUE)) - 1L)
# user system elapsed
# 0.061 0.000 0.061
system.time(res3 <- nchar(Vec) - nchar(gsub("a", "", Vec, fixed=TRUE)))
# user system elapsed
# 0.039 0.000 0.039
identical(res1, res2)
# [1] TRUE
identical(res1, res3)
# [1] TRUE
The gsub( , fixed=TRUE) solution also uses slightly less memory than the
strsplit( , fixed=TRUE) solution.
Cheers,
H.
On 11/14/2016 11:55 AM, Charles C. Berry wrote:
> On Mon, 14 Nov 2016, Marc Schwartz wrote:
>
>>
>>> On Nov 14, 2016, at 11:26 AM, Charles C. Berry <ccberry at ucsd.edu> wrote:
>>>
>>> On Mon, 14 Nov 2016, Bert Gunter wrote:
>>>
> [stuff deleted]
>
>> Hi,
>>
>> Both gsub() and strsplit() are using regex based pattern matching
>> internally. That being said, they are ultimately calling .Internal
>> code, so both are pretty fast.
>>
>> For comparison:
>>
>> ## Create a 1,000,000 character vector
>> set.seed(1)
>> Vec <- paste(sample(letters, 1000000, replace = TRUE), collapse = "")
>>
>>> nchar(Vec)
>> [1] 1000000
>>
>> ## Split the vector into single characters and tabulate
>>> table(strsplit(Vec, split = "")[[1]])
>>
>> a b c d e f g h i j k l
>> 38664 38442 38282 38496 38540 38623 38548 38288 38143 38493 38184 38621
>> m n o p q r s t u v w x
>> 38306 38725 38705 38144 38529 38809 38575 38355 38386 38364 38904 38310
>> y z
>> 38265 38299
>>
>>
>> ## Get just the count of "a"
>>> table(strsplit(Vec, split = "")[[1]])["a"]
>> a
>> 38664
>>
>>> nchar(gsub("[^a]", "", Vec))
>> [1] 38664
>>
>>
>> ## Check performance
>>> system.time(table(strsplit(Vec, split = "")[[1]])["a"])
>> user system elapsed
>> 0.100 0.007 0.107
>>
>>> system.time(nchar(gsub("[^a]", "", Vec)))
>> user system elapsed
>> 0.270 0.001 0.272
>>
>>
>> So, the above would suggest that using strsplit() is somewhat faster
>> than using gsub(). However, as Chuck notes, in the absence of more
>> exhaustive benchmarking, the difference may or may not be more
>> generalizable.
>
>
> Whether splitting on fixed strings rather than treating them as
> regex'es (i.e.`fixed=TRUE') makes a big difference seems to depend on
> what you split:
>
> First repeating what Marc did...
>
>> system.time(table(strsplit(Vec, split = "",fixed=TRUE)[[1]])["a"])
> user system elapsed
> 0.132 0.010 0.139
>> system.time(table(strsplit(Vec, split = "",fixed=FALSE)[[1]])["a"])
> user system elapsed
> 0.130 0.010 0.138
>
> ... fixed=TRUE hardly matters. But the idiom I proposed...
>
>> system.time(sum(lengths(strsplit(paste0("X", Vec,
>> "X"),"a",fixed=TRUE)) - 1))
> user system elapsed
> 0.017 0.000 0.018
>> system.time(sum(lengths(strsplit(paste0("X", Vec,
>> "X"),"a",fixed=FALSE)) - 1))
> user system elapsed
> 0.104 0.000 0.104
>>
>
> ... is 5 times faster with fixed=TRUE for this case.
>
> This result matchea Marc's count:
>
>> sum(lengths(strsplit(paste0("X", Vec, "X"),"a",fixed=FALSE)) - 1)
> [1] 38664
>>
>
> Chuck
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fredhutch.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the R-help
mailing list