[Bioc-sig-seq] match, vmatch

Hervé Pagès hpages at fhcrc.org
Thu Aug 6 22:24:46 CEST 2009


Hi Jim,

I guess the subject of your email refers to the matchPattern/vmatchPattern
and matchPDict/vmatchPDict functions defined in the Biostrings package
(something I suspected by reading at the subject but that I could only
confirm after reading almost the entire email -- you actually never mention
Biostrings).

James Bullard wrote:
> I suppose I could go digging through the code and determine why, but I 
> am curious why there need be a match and a vmatch. This seems to me to 
> be one of the glaring advantages of R's system of generics and multiple 
> dispatch, i.e. that you can dispatch on both pattern and subject and 
> therefore no need to have the two different names match and vmatch which 
> pollute the namespace.

I'll try to make some analogy with get()/mget() and also some aspect of
the strsplit() function can be enlightening.

The main difference between get() and mget() is that one deals with
a single symbol and the other one deals with multiple symbols.
Same thing with matchPattern() and vmatchPattern(): one deals with a
single sequence subject and the other one with a multiple sequence
subject.

There is a simple (metacircular) relationship between get()/mget():

    o get(x, envir) is equivalent to mget(x, envir)[[1]]

    o mget(x, envir) is equivalent to lapply(x, get, envir)

So, strictly speaking, since each of them can be defined using the
other, we don't need both. But having both is convenient for the user,
even if that introduces a little bit of redundancy.

I guess the same could be argued about matchPattern/vmatchPattern. There
is the same kind of (metacircular) relationship between them:

   o If 'subject' is an XString object (i.e. a single sequence), then
     matchPattern(pattern, subject) is equivalent to
     Views(subject, vmatchPattern(pattern, as(subject, "XStringSet"))[[1]])

   o If 'subject' is an XStringSet object (i.e. a set of sequences), then
     vmatchPattern(pattern, subject) does something similar to
     lapply(subject, matchPattern, pattern), except that the former returns
     an MIndex object and the latter a list of Views objects.

So here too I would tend to think that it's convenient to have both.
And since they return different types of objects, I don't really like
the idea of putting them under the same generic. I generally don't like
the idea of having a function that will return an object of a different
class, depending on whether the length of the input is 1 or not.

See for example the strsplit() function. Even when the input is of length
1 (a very common use case), it will return a list (of length 1). So I
end up doing a lot of strsplit(x, split)[[1]]. But I'm ok with it. It's
really a small inconvenience and I certainly would not like to see
strsplit() do that simplification for me i.e. to treat the "length 1"
case as a special case.

Another (more subtle) reason for having matchPattern/vmatchPattern
separated is to disambiguate what must be returned when the subject
is an XStringViews object. In that case matchPattern() treats this
subject as a single sequence and only returns the matches that occur
in the views defined on it. The match locations that are returned (in
another XStringViews object -- the input and output are now both
XStringViews objects that share the same subject) are relative to
the underlying sequence of the input.
vmatchPattern() does something very different: it treats each view
as a separate subsequence and returns the MIndex object containing
the mapping between the views and the matches that were found in
the corresponding view. Also now the locations of the matches
are relative to the view they belong to.

> 
> Out of curiosity I am wondering why this had to be done. Additionally, 
> the help pages leave me a little less than satisfied in regards to why 
> there is a matchPDict and matchPatter n;

No doubt that the man pages in Biostrings can be improved, and suggestions
are welcome. But I'm not sure they are the appropriate place for
discussing software design. When writing a man page I like to stick to
what the function does and how to use it.

> It seems again that the use case 
> is the same, i.e. match some stuff.

As are the match(), matchchar(), pmatch(), match.arg(), match.fun(),
grep(), grepl(), regexpr(), gregexpr() functions (and probably more)
in the base package. Also in some way, the pairwiseAlignment() function
in Biostrings could be seen as a tool to "match some stuff".

The generic/method model has its downsides too. For example it doesn't
make method documentation easy. If you want your generic to be broadly
reusable, then you can't really make any assumption about what its
arguments should be or what kind of thing it will return. So you have
to make the signature as minimalist as possible e.g. 1 named arg 'x'
followed by 3 dots, and then leave the responsibility to specify and
document the extra arguments to the individual methods. Will that make
life easier for the end user? How does that scale when the documentation
for each method becomes so big that it becomes impossible to document
all the methods for this generic in a single man page? And if you split
the man pages, then how many users will find the man page for the method
they want to use?

> I can fully understand that the 
> speed of these things might be at the heart of these design choices, but 
> that seems unfortunate, and again I am wondering if dispatch could be 
> used so that we don't have so many generics to sort through.
> 
> So my main question is why isn't dispatch being used to choose which 
> matchPattern to call and then just have that be the single entry point 
> to fast string matching.

Just for the record, here are some stats:

   Generic:                        Nb of methods
   matchPattern / vmatchPattern    5 / 7
   countPattern / vcountPattern    6 / 5
   matchPDict / vmatchPDict        4 / 3
   countPDict / vcountPDict        4 / 4
   whichPDict / vwhichPDict        1 / 4

So dispatch is already used. Yes in theory this could be pushed one step
further by putting everybody under the same roof. It's just that I'm not
convinced that there is a lot to win by doing this.

Cheers,
H.

> 
> thanks, jim
> 
> _______________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioc-sig-sequencing mailing list