[R] Matching long strings ... was Re: Memory management in R
mtmorgan at fhcrc.org
Sun Oct 10 20:22:43 CEST 2010
On 10/10/2010 11:00 AM, David Winsemius wrote:
> On Oct 10, 2010, at 11:35 AM, Martin Morgan wrote:
>> On 10/10/2010 07:11 AM, David Winsemius wrote:
>>> On Oct 10, 2010, at 9:27 AM, Lorenzo Isella wrote:
>>>>> I already offered the Biostrings package. It provides more robust
>>>>> methods for string matching than does grepl. Is there a reason that
>>>>> choose not to?
>>>> Indeed that is the way I should go for and I have installed the
>>>> package after some struggling.
>>> For me is was a matter of waiting. The only struggle was coming from my
>>> inner timer saying it was taking too long.
>>>> Since biostring is a fairly complex package and I need only a way to
>>>> check if a certain string A is a subset of string B, do you know the
>>>> biostring functions to achieve this?
>>>> I see a lot of methods for biological (DNA, RNA) sequences, and they
>>>> may not apply to my series (which are definitely not from biology).
>>> It appeared to me that the function matchPattern should replace your
>>> grepl invocation that was failing. It returns a more complex structure,
>>> so you would need to determine what would be an exact replacement for
>>> grepl(...) != 1. Looks like a no-match event resutls in the start and
>>> end items being of length 0.
>>>> str( matchPattern("A", BString("BBB")) )
>> A couple of things from this thread.
>> To install a Bioconductor package follow directions here
>> which leads to
>> biocLite is just a wrapper around install.packages with appropriate
>> repositories defined.
>> Some Bioconductor packages are relatively mature and make relatively
>> advanced use of S4 classes, so looking at str() is not that helpful --
>> the way the user is meant to interact with the object is different from
>> the way the object is implemented. So the best bet is to look at the
>> relevant help pages
>> result = matchPattern("A", BString("BBB"))
> The above was the most surprising example for me (not being particularly
> S4-savvy). Looks like it parses as:
> `?`(class, XStringViews)
> Is that an S4 sort of extension for accessing documentation or have I
> just missed a more general method? I tried looking at the help Index for
> the "methods" package.
?"?" documents type?topic. It is more general, in that package?stats
takes one to the 'stats' topic amongst the 'package' doc-type help
pages. It relies on package authors choosing appropriate docTypes for
their man pages.
One S4 paradigm that can be useful is the analog of methods(class="lm"),
which is showMethods(class="XStringViews", where="package:Biostrings").
>> and the help pages referenced there, or from which XStringViews inherits
>> and in particular
>> Rather than accessing the 'start' slot, use start(result). Vignettes are
>> used heavily in Bioconductor packages, and in particular
>> pops up a page with several relevant vignettes, e.g., 'A short
>> presentation of the basic classes...' and perhaps 'Pairwise Sequence
>> Alignment'. These are also accessible on the Bioconductor web site,
>> e.g., on the pages linked from
>> The rule of thumb hinted at below -- that an operation seems to be
>> taking longer than it should -- probably indicates that the function is
>> being invoked in an inefficient way. If the documentation is opaque then
>> definitely the place to seek additional help is on the Bioconductor
>> mailing list
>> Hope this helps.
>>> Formal class 'XStringViews' [package "Biostrings"] with 7 slots
>>> ..@ subject :Formal class 'BString' [package "Biostrings"] with
>>> 6 slots
>>> .. .. ..@ shared :Formal class 'SharedRaw' [package "IRanges"]
>>> with 2 slots
>>> .. .. .. .. ..@ xp :<externalptr>
>>> .. .. .. .. ..@ .link_to_cached_object:<environment: 0x11e0e59f8>
>>> .. .. ..@ offset : int 0
>>> .. .. ..@ length : int 3
>>> .. .. ..@ elementMetadata: NULL
>>> .. .. ..@ elementType : chr "ANY"
>>> .. .. ..@ metadata : list()
>>> ..@ start : int(0)
>>> ..@ width : int(0)
>>> ..@ NAMES : NULL
>>> ..@ elementMetadata: NULL
>>> ..@ elementType : chr "integer"
>>> ..@ metadata : list()
>>> length(matchPattern(fut_string, past_string)@start ) == 0
>>> You do need to use BString() on at least the past_string argument and
>>> maybe the fut_string as well. The BioConductor Mailing List would have a
>>> larger audience with experience using this package, so they should
>>> probably be your next avenue for advice. I am just reading the help
>>> pages as you should be able to do. The help page
>>> help("lowlevel-matching") should probably be reviewed since there may be
>>> efficiency issues to consider as mentioned below.
>>> When dropped into your function with the BString coercion, it replicated
>>> your small example results and did not crash after a long period with
>>> your larger example, so I then terminated it and insert a "reporter"
>>> line to monitor progress. With that reporter I got up into the 200's for
>>> count_len without error. My laptop CPU was warming up the case and I was
>>> getting sleepy so I terminated the process. (I had no way of checking
>>> for accuracy, even if I had let it proceed, since you did not offer a
>>> "correct" answer.)
>>> By the way, the construct ... grepl(. , .) != 1 ... is perhaps
>>> inefficient. It could more compactly be expressed as ... !grepl(. ,
>>> .) which would not be doing coercion of logicals to integers.
>> Computational Biology
>> Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109
>> Location: M1-B861
>> Telephone: 206 667-2793
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109
Telephone: 206 667-2793
More information about the R-help