[R] split strings

Thu May 28 14:30:14 CEST 2009

(diverted to r-devel, a source code patch attached)

Wacek Kusnierczyk wrote:
> Allan Engelhardt wrote:
>   
>> Immaterial, yes, but it is always good to test :) and your solution
>> *is* faster and it is even faster if you can assume byte strings:
>>     
>
> :)
>
> indeed;  though if the speed is immaterial (and in this case it
> supposedly was), it's probably not worth risking fixed=TRUE removing
> '.tif' from the middle of the name, however unlikely this might be (cf
> murphy's laws).
>
> but if you can assume that each string ends with a '.tif' (or any other
> \..{3} substring), then substr is marginally faster than sub, even as a
> three-pass approach, while avoiding the risk of removing '.tif' from the
> middle:
>
>     strings = sprintf('f:/foo/bar//%s.tif', replicate(1000,
> paste(sample(letters, 10), collapse='')))
>     library(rbenchmark)
>     benchmark(columns=c('test', 'elapsed'), replications=1000, order=NULL,
>        substr={basenames=basename(strings); substr(basenames, 1,
> nchar(basenames)-4)},
>        sub=sub('.tif', '', basename(strings), fixed=TRUE, useBytes=TRUE))
>     #     test elapsed
>     # 1 substr   3.176
>     # 2    sub   3.296
>   

btw., i wonder why negative indices default to 1 in substr:

    substr('foobar', -5, 5)
    # "fooba"
    # substr('foobar', 1, 5)
    substr('foobar', 2, -2)
    # ""
    # substr('foobar', 2, 1)

this does not seem to be documented in ?substr.  there are ways to make
negative indices meaningful, e.g., by taking them as indexing from
behind (as in, e.g., perl):

    # hypothetical
    substr('foobar', -5, 5)
    # "ooba"
    # substr('foobar', 6-5+1, 5)
    substr('foobar', 2, -2)
    # "ooba"
    # substr('foobar', 2, 6-2+1)

there is a trivial fix to src/main/character.c that gives substr the
extended functionality -- see the attached patch.  the patch has been
created and tested as follows:

    svn co https://svn.r-project.org/R/trunk r-devel
    cd r-devel
    # modifications made to src/main/character.c
    svn diff > character.c.diff
    svn revert -R .
    patch -p0 < character.c.diff

    ./configure
    make
    make check-all
    # no problems reported

with the patched substr, the original problem can now be solved more
concisely, using a two-pass approach, with performance still better than
the sub/fixed/bytes one, as follows:

    strings = sprintf('f:/foo/bar//%s.tif', replicate(1000,
    paste(sample(letters, 10), collapse='')))
    library(rbenchmark)
    benchmark(columns=c('test', 'elapsed'), replications=1000, order=NULL,
        substr=substr(basename(strings), 1, -5),
        'substr-nchar'={
            basenames=basename(strings)
            substr(basenames, 1, nchar(basenames)-4) },
        sub=sub('.tif', '', basename(strings), fixed=TRUE, useBytes=TRUE))
    #     test elapsed
    # 1       substr   2.981
    # 2 substr-nchar   3.206
    # 3          sub   3.273

if this sounds interesting, i can update the docs accordingly.

vQ