[R] Matching long strings ... was Re: Memory management in R
David Winsemius
dwinsemius at comcast.net
Sun Oct 10 20:00:11 CEST 2010
On Oct 10, 2010, at 11:35 AM, Martin Morgan wrote:
> On 10/10/2010 07:11 AM, David Winsemius wrote:
>>
>> On Oct 10, 2010, at 9:27 AM, Lorenzo Isella wrote:
>>
>>>
>>>> I already offered the Biostrings package. It provides more robust
>>>> methods for string matching than does grepl. Is there a reason
>>>> that you
>>>> choose not to?
>>>>
>>>
>>> Indeed that is the way I should go for and I have installed the
>>> package after some struggling.
>>
>> For me is was a matter of waiting. The only struggle was coming
>> from my
>> inner timer saying it was taking too long.
>>
>>> Since biostring is a fairly complex package and I need only a way to
>>> check if a certain string A is a subset of string B, do you know the
>>> biostring functions to achieve this?
>>> I see a lot of methods for biological (DNA, RNA) sequences, and they
>>> may not apply to my series (which are definitely not from biology).
>>> Cheers
>>
>> It appeared to me that the function matchPattern should replace your
>> grepl invocation that was failing. It returns a more complex
>> structure,
>> so you would need to determine what would be an exact replacement for
>> grepl(...) != 1. Looks like a no-match event resutls in the start and
>> end items being of length 0.
>>
>>> str( matchPattern("A", BString("BBB")) )
>
> A couple of things from this thread.
>
> To install a Bioconductor package follow directions here
>
> http://bioconductor.org/install/index.html#install-bioconductor-packages
>
> which leads to
>
> source("http://bioconductor.org/biocLite.R")
> biocLite("Biostrings")
>
> biocLite is just a wrapper around install.packages with appropriate
> repositories defined.
>
> Some Bioconductor packages are relatively mature and make relatively
> advanced use of S4 classes, so looking at str() is not that helpful --
> the way the user is meant to interact with the object is different
> from
> the way the object is implemented. So the best bet is to look at the
> relevant help pages
>
> result = matchPattern("A", BString("BBB"))
> class(result)
> class?XStringViews
The above was the most surprising example for me (not being
particularly S4-savvy). Looks like it parses as:
`?`(class, XStringViews)
Is that an S4 sort of extension for accessing documentation or have I
just missed a more general method? I tried looking at the help Index
for the "methods" package.
>
> and the help pages referenced there, or from which XStringViews
> inherits
>
> class("XStringViews")
>
> and in particular
>
> class?Ranges
>
> Rather than accessing the 'start' slot, use start(result). Vignettes
> are
> used heavily in Bioconductor packages, and in particular
>
> browseVignettes("Biostrings")
>
> pops up a page with several relevant vignettes, e.g., 'A short
> presentation of the basic classes...' and perhaps 'Pairwise Sequence
> Alignment'. These are also accessible on the Bioconductor web site,
> e.g., on the pages linked from
>
> http://bioconductor.org/help/bioc-views/release/bioc/
>
> The rule of thumb hinted at below -- that an operation seems to be
> taking longer than it should -- probably indicates that the function
> is
> being invoked in an inefficient way. If the documentation is opaque
> then
> definitely the place to seek additional help is on the Bioconductor
> mailing list
>
> http://bioconductor.org/help/mailing-list/
>
> Hope this helps.
>
> Martin
>
>
>> Formal class 'XStringViews' [package "Biostrings"] with 7 slots
>> ..@ subject :Formal class 'BString' [package "Biostrings"]
>> with
>> 6 slots
>> .. .. ..@ shared :Formal class 'SharedRaw' [package
>> "IRanges"]
>> with 2 slots
>> .. .. .. .. ..@ xp :<externalptr>
>> .. .. .. .. ..@ .link_to_cached_object:<environment: 0x11e0e59f8>
>> .. .. ..@ offset : int 0
>> .. .. ..@ length : int 3
>> .. .. ..@ elementMetadata: NULL
>> .. .. ..@ elementType : chr "ANY"
>> .. .. ..@ metadata : list()
>> ..@ start : int(0)
>> ..@ width : int(0)
>> ..@ NAMES : NULL
>> ..@ elementMetadata: NULL
>> ..@ elementType : chr "integer"
>> ..@ metadata : list()
>>
>> Perhaps:
>>
>> length(matchPattern(fut_string, past_string)@start ) == 0
>>
>> You do need to use BString() on at least the past_string argument and
>> maybe the fut_string as well. The BioConductor Mailing List would
>> have a
>> larger audience with experience using this package, so they should
>> probably be your next avenue for advice. I am just reading the help
>> pages as you should be able to do. The help page
>> help("lowlevel-matching") should probably be reviewed since there
>> may be
>> efficiency issues to consider as mentioned below.
>>
>> When dropped into your function with the BString coercion, it
>> replicated
>> your small example results and did not crash after a long period with
>> your larger example, so I then terminated it and insert a "reporter"
>> line to monitor progress. With that reporter I got up into the
>> 200's for
>> count_len without error. My laptop CPU was warming up the case and
>> I was
>> getting sleepy so I terminated the process. (I had no way of checking
>> for accuracy, even if I had let it proceed, since you did not offer a
>> "correct" answer.)
>>
>> By the way, the construct ... grepl(. , .) != 1 ... is perhaps
>> inefficient. It could more compactly be expressed as ... !grepl(. ,
>> .) which would not be doing coercion of logicals to integers.
>>
>
>
> --
> Computational Biology
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109
>
> Location: M1-B861
> Telephone: 206 667-2793
More information about the R-help
mailing list