[R] Strplit code
John Fox
jfox at mcmaster.ca
Thu Dec 4 13:14:06 CET 2008
Dear Wacek,
"Wrong" is a bit strong, I think -- limited to single-pattern characters is
more accurate. Moreover, it isn't hard to make the function work with
multiple-character matches as well:
Strsplit <- function(x, split){
if (length(x) > 1) {
return(lapply(x, Strsplit, split)) # vectorization
}
result <- character(0)
if (nchar(x) == 0) return(result)
posn <- regexpr(split, x)
if (posn <= 0) return(x)
c(result, substring(x, 1, posn - 1),
Recall(substring(x, posn + attr(posn, "match.length"),
nchar(x)), split)) # recursion
}
On the other hand, your function is much more efficient.
Regards,
John
------------------------------
John Fox, Professor
Department of Sociology
McMaster University
Hamilton, Ontario, Canada
web: socserv.mcmaster.ca/jfox
> -----Original Message-----
> From: Wacek Kusnierczyk [mailto:Waclaw.Marcin.Kusnierczyk at idi.ntnu.no]
> Sent: December-04-08 5:05 AM
> To: John Fox
> Cc: R help
> Subject: Re: [R] Strplit code
>
> John Fox wrote:
> > By coincidence, I have a version of strsplit() that I've used to
> > illustrate recursion:
> >
> > Strsplit <- function(x, split){
> > if (length(x) > 1) {
> > return(lapply(x, Strsplit, split)) # vectorization
> > }
> > result <- character(0)
> > if (nchar(x) == 0) return(result)
> > posn <- regexpr(split, x)
> > if (posn <= 0) return(x)
> > c(result, substring(x, 1, posn - 1),
> > Recall(substring(x, posn+1, nchar(x)), split)) # recursion
> > }
> >
> >
>
> well, it is both inefficient and wrong.
>
> inefficient because of the non-tail recursion and recursive
> concatenation, which is justified for the sake the purpose of showing
> recursion, but for practical purposes you'd rather use gregexepr.
>
> wrong because of how you pick the remaining part of the string to be
> split -- it works just under the assumption the pattern is a single
> character:
>
> Strsplit("hello-dolly,--sweet", "--")
> # the pattern is *two* hyphens
> # [1] "hello-dolly" "-sweet"
>
> Strsplit("hello dolly", "")
> # the pattern is the empty string
> # [1] "" "" "" "" "" "" "" "" "" "" ""
>
>
> here's a quick rewrite -- i haven't tested it on extreme cases, it may
> not be perfect, and there's a hidden source of inefficiency here as well:
>
> strsplit =
> function(strings, split) {
> positions = gregexpr(split, strings)
> lapply(1:length(strings), function(i)
> substring(strings[[i]], c(1, positions[[i]] +
> attr(positions[[i]], "match.length")), c(positions[[i]]-1,
> nchar(strings[[i]]))))
> }
>
>
> n = 1000; m = 100
> strings = replicate(n, paste(sample(c(letters, " "), 100, replace=TRUE),
> collapse=""))
> system.time(replicate(m, strsplit(strings, " ")))
> system.time(replicate(m, Strsplit(strings, " ")))
>
>
> vQ
More information about the R-help
mailing list