[Bioc-devel] C library or C package API for regular expressions

Hervé Pagès hpages at fredhutch.org
Mon Jan 25 23:34:59 CET 2016


Hi Jiri,

On 01/25/2016 09:40 AM, Jiří Hon wrote:
> Hi Martin
>
> Dne 25.1.2016 v 13:08 Morgan, Martin napsal(a):
>> There is discussion at
>>
>> http://stackoverflow.com/questions/23556205/using-boost-regex-with-rcpp
>>
>> pointing to
>>
>> http://gallery.rcpp.org/articles/boost-regular-expressions/
>>
>> There is a Bioconductor example in that bundles the regex library at
>>  flowCore/src/
>>
>> https://github.com/Bioconductor-mirror/flowCore
>>
>> A second example is in the mzR package.
>
> Thank you for pointing me to the flowCore and mzR packages, these
> examples are really helpful.
>
>> A real question is, do you really need this functionality at the C
>> level?
>
> I think it's unavoidable in my case for performance reasons. I'am trying
> to dedect all possible overlapping motifs in DNA compounded from
> elements matching some regular expression.

I think Martin's question is: are you sure you need this at the C
level? What makes you think that calling a regex engine from C will
perform better than calling it from R?

Note that using a regex for finding motifs in a DNA sequence has 2
fundamental problems:

(1) It doesn't always find all the matches. For example if 2 matches
     are overlapping, it only returns the 1st of the 2 matches:

   > library(Biostrings)

   > matchPattern("ATAAT", "CCATAATAATGATAAT")
     Views on a 16-letter BString subject
   subject: CCATAATAATGATAAT
   views:
       start end width
   [1]     3   7     5 [ATAAT]
   [2]     6  10     5 [ATAAT]
   [3]    12  16     5 [ATAAT]

   > gregexpr("ATAAT", "CCATAATAATGATAAT")[[1]]
   [1]  3 12
   attr(,"match.length")
   [1] 5 5
   attr(,"useBytes")
   [1] TRUE

(2) It's inefficient on a long DNA sequence:

   > library(BSgenome.Hsapiens.UCSC.hg19)
   > chr1 <- BSgenome.Hsapiens.UCSC.hg19$chr1
   > system.time(m1 <- matchPattern("ATAAT", chr1))
      user  system elapsed
     0.946   0.000   0.940
   > chr1c <- as.character(chr1)
   > system.time(m2 <- gregexpr("ATAAT", chr1c)[[1]])
      user  system elapsed
     4.109   0.000   4.109

This was actually the very first motivating use case for developing
the Biostrings package. It's important to realize that using the regex
engine at the C level wouldn't make much difference.

matchPattern() and family don't support regex though. However when
working with DNA motifs, the motifs can often be described with IUPAC
ambiguity letters. For example, instead of describing the motifs
with regular expression AT(A|G|T|)T(A|C)GG.G, you can describe it with
ATDTMGGNG. Then you can use matchPattern() on this pattern and with
fixed=FALSE to find all the matches. Additionally you can use the
'max.mismatch' and/or 'with.indels' arguments to allow a small number
of mismatches and/or indels. See ?matchPattern for more information
and examples.

Of course this has its own limitations: you can only do this for a
subclass of regular expressions. For example regular expressions that
use * or + to allow for repetitions cannot be replaced by a sequence
with just IUPAC codes, so the string matching tools in Biostrings
cannnot be used in that case.

Cheers,
H.

>
>> A secondary question is that if several packages are using this
>> functionality, then perhaps the library could be bundled separately
>> and made available just once; zlibbioc does something like this (sort
>> of; zlib is only needed on Windows). The flowCore and mzR maintainers
>> (cc'd) might be a valuable resource in this regard.
>
> Efficient regexp algorithms seems useful to me for solving many
> bioinformatic problems. So it would be natural to have package with C
> API to the most efficient regexp libraries.
>
>> Martin
>>
>> ________________________________________ From: Bioc-devel
>> <bioc-devel-bounces at r-project.org> on behalf of Jiří Hon
>> <xhonji01 at stud.fit.vutbr.cz> Sent: Monday, January 25, 2016 4:33 AM
>> To: Charles Determan Cc: bioc-devel at r-project.org Subject: Re:
>> [Bioc-devel] C library or C package API for regular expressions
>>
>> Hi Charles,
>>
>> thank you a lot for your helpful hint. There is still a thing that
>> I'm not sure about - Boost manual says that Boost.Regex is not header
>> only [1]. So as BH package contains only headers, I will have to
>> bundle the Boost.Regex library into the package code anyway. Am I
>> right?
>>
>> Jiri
>>
>> [1]
>> http://www.boost.org/doc/libs/1_60_0/more/getting_started/unix-variants.html#header-only-libraries
>>
>>
>>
>>
>>
> Dne 23.1.2016 v 13:35 Charles Determan napsal(a):
>>> Hi Jiri,
>>>
>>> I believe you can use the BH package. It contains most of the
>>> Boost
>> headers.
>>>
>>> Regards, Charles
>>>
>>> On Saturday, January 23, 2016, Jiří Hon
>>> <xhonji01 at stud.fit.vutbr.cz>
>> wrote:
>>>
>>>> Dear package developers,
>>>>
>>>> I would like to ask you for advice. Please, what is the most
>>>> seamless way to use regular expressions in C/C++ code of
>>>> R/Bioconductor package? Is it allowed to bundle some C/C++
>>>> library for that (like PCRE or Boost.Regex)? Or is there existing
>>>> C API of some package I can depend on and import?
>>>>
>>>> Thank you a lot for your attention and please have a nice day :)
>>>>
>>>> Jiri Hon
>>>>
>>>> _______________________________________________
>>>> Bioc-devel at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>
>>>
>>
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>> This email message may contain legally privileged and/or confidential
>> information.  If you are not the intended recipient(s), or the
>> employee or agent responsible for the delivery of this message to the
>> intended recipient(s), you are hereby notified that any disclosure,
>> copying, distribution, or use of this email message is prohibited. If
>> you have received this message in error, please notify the sender
>> immediately by e-mail and delete this email message from your
>> computer. Thank you.
>>
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fredhutch.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioc-devel mailing list