[Bioc-devel] Feasibility of Parallel Extraction of Matches with extractAllMatches

Hervé Pagès hpages at fredhutch.org
Wed Nov 16 19:58:27 CET 2016


Hi Dario,

On 11/16/2016 02:00 AM, Dario Strbenac wrote:
> Good day,
>
> I'd like to request that extractAllMatches works when subject is an XStringSet. The function could check that subject and mindex have the same length and then process them in parallel. Currently, the following example isn't immediately possible.
>
> words <- BStringSet(c("xxGOATzz", "xxMOATzz", "xxNOTEzz"))
> matches <- vmatchPattern("GOAT", words, max.mismatch = 1)
> similarWords <- extractAllMatches(words, matches) # Not possible.

Not possible because extractAllMatches() returns a Views object and
a Views object can only represent views defined on a *single* subject.

extractAllMatches() is old and predates extractAt() which can be used
for this. See man page for extractAt/replaceAt for more information.
In particular the "(C) ADVANCED EXAMPLES" section in the man page
shows how to use extractAt() to extract the matches returned by
vmatchPattern().

>
> Could that be implemented for the next release of Biostrings? Or, perhaps it can be deprecated since it duplicates the functionality of substr?
>
>> substr(words, start(matches), end(matches))
> [1] "GOAT" "MOAT" NA

2 issues with substr():

   (1) It will be quite inefficient if there are millions of matches
       to extract since it actually generates a copy of the matches.
       extractAllMatches() and extractAt() don't have this problem
       because they don't generate copies of the original sequence
       data. Even extractAt(), because the DNAStringSetList object
       it returns actually contains views on the original DNAStringSet
       subject, except that these views are Biostrings internal business
       and not something that can easily be seen unless you look
       at the internals of the DNAStringSet and DNAStringSetList
       objects.

   (2) substr() returns a "flat" vector so in general the mapping
       between the matches and the individual sequences in the
       DNAStringSet subject is lost.

>
> Also, the expected subsetting fails for MIndex objects.
>
>> class(matches)
> [1] "ByPos_MIndex"
>> length(matches)
> [1] 3
>> length(matches[1])
> [1] 3

This should be addressed in Biostrings 2.43.1. Thanks!

H.

>
> --------------------------------------
> Dario Strbenac
> University of Sydney
> Camperdown NSW 2050
> Australia
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fredhutch.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioc-devel mailing list