[R] Frequency of a character in a string
Charles C. Berry
ccberry at ucsd.edu
Mon Nov 14 20:55:44 CET 2016
On Mon, 14 Nov 2016, Marc Schwartz wrote:
>
>> On Nov 14, 2016, at 11:26 AM, Charles C. Berry <ccberry at ucsd.edu> wrote:
>>
>> On Mon, 14 Nov 2016, Bert Gunter wrote:
>>
[stuff deleted]
> Hi,
>
> Both gsub() and strsplit() are using regex based pattern matching
> internally. That being said, they are ultimately calling .Internal code,
> so both are pretty fast.
>
> For comparison:
>
> ## Create a 1,000,000 character vector
> set.seed(1)
> Vec <- paste(sample(letters, 1000000, replace = TRUE), collapse = "")
>
>> nchar(Vec)
> [1] 1000000
>
> ## Split the vector into single characters and tabulate
>> table(strsplit(Vec, split = "")[[1]])
>
> a b c d e f g h i j k l
> 38664 38442 38282 38496 38540 38623 38548 38288 38143 38493 38184 38621
> m n o p q r s t u v w x
> 38306 38725 38705 38144 38529 38809 38575 38355 38386 38364 38904 38310
> y z
> 38265 38299
>
>
> ## Get just the count of "a"
>> table(strsplit(Vec, split = "")[[1]])["a"]
> a
> 38664
>
>> nchar(gsub("[^a]", "", Vec))
> [1] 38664
>
>
> ## Check performance
>> system.time(table(strsplit(Vec, split = "")[[1]])["a"])
> user system elapsed
> 0.100 0.007 0.107
>
>> system.time(nchar(gsub("[^a]", "", Vec)))
> user system elapsed
> 0.270 0.001 0.272
>
>
> So, the above would suggest that using strsplit() is somewhat faster
> than using gsub(). However, as Chuck notes, in the absence of more
> exhaustive benchmarking, the difference may or may not be more
> generalizable.
Whether splitting on fixed strings rather than treating them as
regex'es (i.e.`fixed=TRUE') makes a big difference seems to depend on
what you split:
First repeating what Marc did...
> system.time(table(strsplit(Vec, split = "",fixed=TRUE)[[1]])["a"])
user system elapsed
0.132 0.010 0.139
> system.time(table(strsplit(Vec, split = "",fixed=FALSE)[[1]])["a"])
user system elapsed
0.130 0.010 0.138
... fixed=TRUE hardly matters. But the idiom I proposed...
> system.time(sum(lengths(strsplit(paste0("X", Vec, "X"),"a",fixed=TRUE)) - 1))
user system elapsed
0.017 0.000 0.018
> system.time(sum(lengths(strsplit(paste0("X", Vec, "X"),"a",fixed=FALSE)) - 1))
user system elapsed
0.104 0.000 0.104
>
... is 5 times faster with fixed=TRUE for this case.
This result matchea Marc's count:
> sum(lengths(strsplit(paste0("X", Vec, "X"),"a",fixed=FALSE)) - 1)
[1] 38664
>
Chuck
More information about the R-help
mailing list