[R] Matching long strings ... was Re: Memory management in R

David Winsemius dwinsemius at comcast.net
Sun Oct 10 20:00:11 CEST 2010


On Oct 10, 2010, at 11:35 AM, Martin Morgan wrote:

> On 10/10/2010 07:11 AM, David Winsemius wrote:
>>
>> On Oct 10, 2010, at 9:27 AM, Lorenzo Isella wrote:
>>
>>>
>>>> I already offered the Biostrings package. It provides more robust
>>>> methods for string matching than does grepl. Is there a reason  
>>>> that you
>>>> choose not to?
>>>>
>>>
>>> Indeed that is the way I should go for and I have installed the
>>> package after some struggling.
>>
>> For me is was a matter of waiting. The only struggle was coming  
>> from my
>> inner timer saying it was taking too long.
>>
>>> Since biostring is a fairly complex package and I need only a way to
>>> check if a certain string A is a subset of string B, do you know the
>>> biostring functions to achieve this?
>>> I see a lot of methods for biological (DNA, RNA) sequences, and they
>>> may not apply to my series (which are definitely not from biology).
>>> Cheers
>>
>> It appeared to me that the function matchPattern should replace your
>> grepl invocation that was failing. It returns a more complex  
>> structure,
>> so you would need to determine what would be an exact replacement for
>> grepl(...) != 1. Looks like a no-match event resutls in the start and
>> end items being of length 0.
>>
>>> str(  matchPattern("A", BString("BBB")) )
>
> A couple of things from this thread.
>
> To install a Bioconductor package follow directions here
>
>  http://bioconductor.org/install/index.html#install-bioconductor-packages
>
> which leads to
>
>   source("http://bioconductor.org/biocLite.R")
>   biocLite("Biostrings")
>
> biocLite is just a wrapper around install.packages with appropriate
> repositories defined.
>
> Some Bioconductor packages are relatively mature and make relatively
> advanced use of S4 classes, so looking at str() is not that helpful --
> the way the user is meant to interact with the object is different  
> from
> the way the object is implemented. So the best bet is to look at the
> relevant help pages
>
>  result = matchPattern("A", BString("BBB"))
>  class(result)
>  class?XStringViews

The above was the most surprising example for me (not being  
particularly S4-savvy). Looks like it parses as:
`?`(class, XStringViews)

Is that an S4 sort of extension for accessing documentation or have I  
just missed a more general method? I tried looking at the help Index  
for the "methods" package.

>
> and the help pages referenced there, or from which XStringViews  
> inherits
>
>   class("XStringViews")
>
> and in particular
>
>   class?Ranges
>
> Rather than accessing the 'start' slot, use start(result). Vignettes  
> are
> used heavily in Bioconductor packages, and in particular
>
>   browseVignettes("Biostrings")
>
> pops up a page with several relevant vignettes, e.g., 'A short
> presentation of the basic classes...' and perhaps 'Pairwise Sequence
> Alignment'. These are also accessible on the Bioconductor web site,
> e.g., on the pages linked from
>
>  http://bioconductor.org/help/bioc-views/release/bioc/
>
> The rule of thumb hinted at below -- that an operation seems to be
> taking longer than it should -- probably indicates that the function  
> is
> being invoked in an inefficient way. If the documentation is opaque  
> then
> definitely the place to seek additional help is on the Bioconductor
> mailing list
>
>  http://bioconductor.org/help/mailing-list/
>
> Hope this helps.
>
> Martin
>
>
>> Formal class 'XStringViews' [package "Biostrings"] with 7 slots
>>  ..@ subject        :Formal class 'BString' [package "Biostrings"]  
>> with
>> 6 slots
>>  .. .. ..@ shared         :Formal class 'SharedRaw' [package  
>> "IRanges"]
>> with 2 slots
>>  .. .. .. .. ..@ xp                    :<externalptr>
>>  .. .. .. .. ..@ .link_to_cached_object:<environment: 0x11e0e59f8>
>>  .. .. ..@ offset         : int 0
>>  .. .. ..@ length         : int 3
>>  .. .. ..@ elementMetadata: NULL
>>  .. .. ..@ elementType    : chr "ANY"
>>  .. .. ..@ metadata       : list()
>>  ..@ start          : int(0)
>>  ..@ width          : int(0)
>>  ..@ NAMES          : NULL
>>  ..@ elementMetadata: NULL
>>  ..@ elementType    : chr "integer"
>>  ..@ metadata       : list()
>>
>> Perhaps:
>>
>> length(matchPattern(fut_string, past_string)@start ) == 0
>>
>> You do need to use BString() on at least the past_string argument and
>> maybe the fut_string as well. The BioConductor Mailing List would  
>> have a
>> larger audience with experience using this package, so they should
>> probably be your next avenue for advice. I am just reading the help
>> pages as you should be able to do. The help page
>> help("lowlevel-matching") should probably be reviewed since there  
>> may be
>> efficiency issues to consider as mentioned below.
>>
>> When dropped into your function with the BString coercion, it  
>> replicated
>> your small example results and did not crash after a long period with
>> your larger example, so I then terminated it and insert a "reporter"
>> line to monitor progress. With that reporter I got up into the  
>> 200's for
>> count_len without error. My laptop CPU was warming up the case and  
>> I was
>> getting sleepy so I terminated the process. (I had no way of checking
>> for accuracy, even if I had let it proceed, since you did not offer a
>> "correct" answer.)
>>
>> By the way, the construct ... grepl(. , .) != 1 ... is perhaps
>> inefficient. It could more compactly be expressed as ...   !grepl(. ,
>> .)  which would not be doing coercion of logicals to integers.
>>
>
>
> -- 
> Computational Biology
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109
>
> Location: M1-B861
> Telephone: 206 667-2793



More information about the R-help mailing list