# [R] Strplit code

Wacek Kusnierczyk Waclaw.Marcin.Kusnierczyk at idi.ntnu.no
Thu Dec 4 11:04:47 CET 2008

John Fox wrote:
> By coincidence, I have a version of strsplit() that I've used to
> illustrate recursion:
>
> Strsplit <- function(x, split){
>     if (length(x) > 1) {
>         return(lapply(x, Strsplit, split))  # vectorization
>         }
>     result <- character(0)
>     if (nchar(x) == 0) return(result)
>     posn <- regexpr(split, x)
>     if (posn <= 0) return(x)
>     c(result, substring(x, 1, posn - 1),
>         Recall(substring(x, posn+1, nchar(x)), split))  # recursion
>     }
>
>

well, it is both inefficient and wrong.

inefficient because of the non-tail recursion and recursive
concatenation, which is justified for the sake the purpose of showing
recursion, but for practical purposes you'd rather use gregexepr.

wrong because of how you pick the remaining part of the string to be
split -- it works just under the assumption the pattern is a single
character:

Strsplit("hello-dolly,--sweet", "--")
# the pattern is *two* hyphens
# [1] "hello-dolly" "-sweet"

Strsplit("hello dolly", "")
# the pattern is the empty string
#  [1] "" "" "" "" "" "" "" "" "" "" ""

here's a quick rewrite -- i haven't tested it on extreme cases, it may
not be perfect, and there's a hidden source of inefficiency here as well:

strsplit =
function(strings, split) {
positions = gregexpr(split, strings)
lapply(1:length(strings), function(i)
substring(strings[[i]], c(1, positions[[i]] +
attr(positions[[i]], "match.length")), c(positions[[i]]-1,
nchar(strings[[i]]))))
}

n = 1000; m = 100
strings = replicate(n, paste(sample(c(letters, " "), 100, replace=TRUE),
collapse=""))
system.time(replicate(m, strsplit(strings, " ")))
system.time(replicate(m, Strsplit(strings, " ")))

vQ