[Bioc-devel] Feasibility of Parallel Extraction of Matches with extractAllMatches
Hervé Pagès
hpages at fredhutch.org
Wed Nov 16 19:58:27 CET 2016
Hi Dario,
On 11/16/2016 02:00 AM, Dario Strbenac wrote:
> Good day,
>
> I'd like to request that extractAllMatches works when subject is an XStringSet. The function could check that subject and mindex have the same length and then process them in parallel. Currently, the following example isn't immediately possible.
>
> words <- BStringSet(c("xxGOATzz", "xxMOATzz", "xxNOTEzz"))
> matches <- vmatchPattern("GOAT", words, max.mismatch = 1)
> similarWords <- extractAllMatches(words, matches) # Not possible.
Not possible because extractAllMatches() returns a Views object and
a Views object can only represent views defined on a *single* subject.
extractAllMatches() is old and predates extractAt() which can be used
for this. See man page for extractAt/replaceAt for more information.
In particular the "(C) ADVANCED EXAMPLES" section in the man page
shows how to use extractAt() to extract the matches returned by
vmatchPattern().
>
> Could that be implemented for the next release of Biostrings? Or, perhaps it can be deprecated since it duplicates the functionality of substr?
>
>> substr(words, start(matches), end(matches))
> [1] "GOAT" "MOAT" NA
2 issues with substr():
(1) It will be quite inefficient if there are millions of matches
to extract since it actually generates a copy of the matches.
extractAllMatches() and extractAt() don't have this problem
because they don't generate copies of the original sequence
data. Even extractAt(), because the DNAStringSetList object
it returns actually contains views on the original DNAStringSet
subject, except that these views are Biostrings internal business
and not something that can easily be seen unless you look
at the internals of the DNAStringSet and DNAStringSetList
objects.
(2) substr() returns a "flat" vector so in general the mapping
between the matches and the individual sequences in the
DNAStringSet subject is lost.
>
> Also, the expected subsetting fails for MIndex objects.
>
>> class(matches)
> [1] "ByPos_MIndex"
>> length(matches)
> [1] 3
>> length(matches[1])
> [1] 3
This should be addressed in Biostrings 2.43.1. Thanks!
H.
>
> --------------------------------------
> Dario Strbenac
> University of Sydney
> Camperdown NSW 2050
> Australia
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fredhutch.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioc-devel
mailing list