[R] split strings

Allan Engelhardt allane at cybaea.com
Wed May 27 08:34:14 CEST 2009


Immaterial, yes, but it is always good to test :) and your solution *is* 
faster and it is even faster if you can assume byte strings:

 > strings = sprintf('f:/foo/bar//%s.tif', replicate(1000, 
paste(sample(letters, 10), collapse='')))
 > library(rbenchmark)
 > benchmark(columns=c('test', 'elapsed'), replications=1000, order=NULL,
   'one-pass, perl'=sub('.*//(.*)[.]tif$', '\\1', strings, perl=TRUE),
   'two-pass, perl'=sub('.tif$', '', basename(strings), perl=TRUE),
   'one-pass, no perl'=sub('.*//(.*)[.]tif$', '\\1', strings, perl=FALSE),
   'two-pass, no perl'=sub('.tif$', '', basename(strings), perl=FALSE),
   'fixed'=sub(".tif", "", basename(strings), fixed=TRUE),
   'fixed, bytes'=sub(".tif", "", basename(strings), fixed=TRUE, 
useBytes=TRUE))

               test elapsed
1    one-pass, perl   2.946
2    two-pass, perl   3.858
3 one-pass, no perl  15.884
4 two-pass, no perl   3.788
5             fixed   2.264
6      fixed, bytes   1.813

Allan

Gabor Grothendieck wrote:
> Although speed is really immaterial here this is likely
> to be faster than all shown so far:
>
> sub(".tif", "", basename(metr_list), fixed = TRUE)
>
> It does not allow file names with .tif in the middle
> of them since it will delete the first occurrence rather
> than the last but such a situation is highly unlikely.
>
>
> On Tue, May 26, 2009 at 4:24 PM, Wacek Kusnierczyk
> <Waclaw.Marcin.Kusnierczyk at idi.ntnu.no> wrote:
>   
>> Monica Pisica wrote:
>>     
>>> Hi everybody,
>>>
>>> Thank you for the suggestions and especially the explanation Waclaw provided for his code. Maybe one day i will be able to wrap my head around this.
>>>
>>> Thanks again,
>>>
>>>       
>> you're welcome.  note that if efficiency is an issue, you'd better have
>> perl=TRUE there:
>>
>>    output = sub('.*//(.*)[.]tif$', '\\1', input, perl=TRUE)
>>
>> with perl=TRUE, the one-pass solution is somewhat faster than the
>> two-pass solution of gabor's -- which, however, is probably easier to
>> understand;  with perl=FALSE (the default), the performance drops:
>>
>>    strings = sprintf(
>>        'f:/foo/bar//%s.tif',
>>        replicate(1000, paste(sample(letters, 10), collapse='')))
>>    library(rbenchmark)
>>    benchmark(columns=c('test', 'elapsed'), replications=1000, order=NULL,
>>       'one-pass, perl'=sub('.*//(.*)[.]tif$', '\\1', strings, perl=TRUE),
>>       'two-pass, perl'=sub('.tif$', '', basename(strings), perl=TRUE),
>>       'one-pass, no perl'=sub('.*//(.*)[.]tif$', '\\1', strings,
>> perl=FALSE),
>>       'two-pass, no perl'=sub('.tif$', '', basename(strings), perl=FALSE))
>>    # 1    one-pass, perl   3.391
>>    # 2    two-pass, perl   4.944
>>    # 3 one-pass, no perl  18.836
>>    # 4 two-pass, no perl   5.191
>>
>> vQ
>>
>>
>>     
>>> Monica
>>>
>>> ----------------------------------------
>>>
>>>       
>>>> Date: Tue, 26 May 2009 15:46:21 +0200
>>>> From: Waclaw.Marcin.Kusnierczyk at idi.ntnu.no
>>>> To: pisicandru at hotmail.com
>>>> CC: r-help at r-project.org
>>>> Subject: Re: [R] split strings
>>>>
>>>> Monica Pisica wrote:
>>>>
>>>>         
>>>>> Hi everybody,
>>>>>
>>>>> I have a vector of characters and i would like to extract certain parts. My vector is named metr_list:
>>>>>
>>>>> [1] "F:/Naval_Live_Oaks/2005/data//BE.tif"
>>>>> [2] "F:/Naval_Live_Oaks/2005/data//CH.tif"
>>>>> [3] "F:/Naval_Live_Oaks/2005/data//CRR.tif"
>>>>> [4] "F:/Naval_Live_Oaks/2005/data//HOME.tif"
>>>>>
>>>>> And i would like to extract BE, CH, CRR, and HOME in a different vector named "names.id"
>>>>>
>>>>>           
>>>> one way that seems reasonable is to use sub:
>>>>
>>>> output = sub('.*//(.*)[.]tif$', '\\1', input)
>>>>
>>>> which says 'from each string remember the substring between the
>>>> rigthmost two slashes and a .tif extension, exclusive, and replace the
>>>> whole thing with the captured part'. if the pattern does not match, you
>>>> get the original input:
>>>>
>>>> sub('.*//(.*)[.]tif$', '\\1', 'f:/foo/bar//buz.tif')
>>>> # buz
>>>>
>>>> vQ
>>>>
>>>>         
>>> _________________________________________________________________
>>>       
>>     
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>




More information about the R-help mailing list