[R] Frequency of a character in a string
Bert Gunter
bgunter.4567 at gmail.com
Mon Nov 14 21:23:28 CET 2016
Chuck, Marc, and anyone else who still has interest in this odd little
discussion ...
Yes, and with fixed = TRUE my approach took 1/3 as much time as
Chuck's with a 10 element vector each element of which is a character
string of length 1e5:
> set.seed(1001)
> x <- sapply(1:10, function(x)paste0(sample(letters,1e5,rep=TRUE),collapse = ""))
> system.time(sum(lengths(strsplit(paste0("X", x, "X"),"a",fixed=TRUE)) - 1))
user system elapsed
0.012 0.000 0.012
> system.time(nchar(gsub("[^a]", "", x,fixed = TRUE)))
user system elapsed
0.004 0.000 0.004
Best,
Bert
Bert Gunter
"The trouble with having an open mind is that people keep coming along
and sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
On Mon, Nov 14, 2016 at 11:55 AM, Charles C. Berry <ccberry at ucsd.edu> wrote:
> On Mon, 14 Nov 2016, Marc Schwartz wrote:
>
>>
>>> On Nov 14, 2016, at 11:26 AM, Charles C. Berry <ccberry at ucsd.edu> wrote:
>>>
>>> On Mon, 14 Nov 2016, Bert Gunter wrote:
>>>
> [stuff deleted]
>
>
>> Hi,
>>
>> Both gsub() and strsplit() are using regex based pattern matching
>> internally. That being said, they are ultimately calling .Internal code, so
>> both are pretty fast.
>>
>> For comparison:
>>
>> ## Create a 1,000,000 character vector
>> set.seed(1)
>> Vec <- paste(sample(letters, 1000000, replace = TRUE), collapse = "")
>>
>>> nchar(Vec)
>>
>> [1] 1000000
>>
>> ## Split the vector into single characters and tabulate
>>>
>>> table(strsplit(Vec, split = "")[[1]])
>>
>>
>> a b c d e f g h i j k l
>> 38664 38442 38282 38496 38540 38623 38548 38288 38143 38493 38184 38621
>> m n o p q r s t u v w x
>> 38306 38725 38705 38144 38529 38809 38575 38355 38386 38364 38904 38310
>> y z
>> 38265 38299
>>
>>
>> ## Get just the count of "a"
>>>
>>> table(strsplit(Vec, split = "")[[1]])["a"]
>>
>> a
>> 38664
>>
>>> nchar(gsub("[^a]", "", Vec))
>>
>> [1] 38664
>>
>>
>> ## Check performance
>>>
>>> system.time(table(strsplit(Vec, split = "")[[1]])["a"])
>>
>> user system elapsed
>> 0.100 0.007 0.107
>>
>>> system.time(nchar(gsub("[^a]", "", Vec)))
>>
>> user system elapsed
>> 0.270 0.001 0.272
>>
>>
>> So, the above would suggest that using strsplit() is somewhat faster than
>> using gsub(). However, as Chuck notes, in the absence of more exhaustive
>> benchmarking, the difference may or may not be more generalizable.
>
>
>
> Whether splitting on fixed strings rather than treating them as
> regex'es (i.e.`fixed=TRUE') makes a big difference seems to depend on
> what you split:
>
> First repeating what Marc did...
>
>> system.time(table(strsplit(Vec, split = "",fixed=TRUE)[[1]])["a"])
>
> user system elapsed
> 0.132 0.010 0.139
>>
>> system.time(table(strsplit(Vec, split = "",fixed=FALSE)[[1]])["a"])
>
> user system elapsed
> 0.130 0.010 0.138
>
> ... fixed=TRUE hardly matters. But the idiom I proposed...
>
>> system.time(sum(lengths(strsplit(paste0("X", Vec, "X"),"a",fixed=TRUE)) -
>> 1))
>
> user system elapsed
> 0.017 0.000 0.018
>>
>> system.time(sum(lengths(strsplit(paste0("X", Vec, "X"),"a",fixed=FALSE)) -
>> 1))
>
> user system elapsed
> 0.104 0.000 0.104
>>
>>
>
> ... is 5 times faster with fixed=TRUE for this case.
>
> This result matchea Marc's count:
>
>> sum(lengths(strsplit(paste0("X", Vec, "X"),"a",fixed=FALSE)) - 1)
>
> [1] 38664
>>
>>
>
> Chuck
More information about the R-help
mailing list