[Bioc-devel] C library or C package API for regular expressions

Tue Jan 26 12:18:25 CET 2016

Dne 25.1.2016 v 23:34 Hervé Pagès napsal(a):
> Hi Jiri,
>
> On 01/25/2016 09:40 AM, Jiří Hon wrote:
>> Hi Martin
>>
>> Dne 25.1.2016 v 13:08 Morgan, Martin napsal(a):
>>> There is discussion at
>>>
>>> http://stackoverflow.com/questions/23556205/using-boost-regex-with-rcpp
>>>
>>> pointing to
>>>
>>> http://gallery.rcpp.org/articles/boost-regular-expressions/
>>>
>>> There is a Bioconductor example in that bundles the regex library at
>>>  flowCore/src/
>>>
>>> https://github.com/Bioconductor-mirror/flowCore
>>>
>>> A second example is in the mzR package.
>>
>> Thank you for pointing me to the flowCore and mzR packages, these
>> examples are really helpful.
>>
>>> A real question is, do you really need this functionality at the C
>>> level?
>>
>> I think it's unavoidable in my case for performance reasons. I'am trying
>> to dedect all possible overlapping motifs in DNA compounded from
>> elements matching some regular expression.
>
> I think Martin's question is: are you sure you need this at the C
> level? What makes you think that calling a regex engine from C will
> perform better than calling it from R?
>
> Note that using a regex for finding motifs in a DNA sequence has 2
> fundamental problems:
>
> (1) It doesn't always find all the matches. For example if 2 matches
>      are overlapping, it only returns the 1st of the 2 matches:
>
>    > library(Biostrings)
>
>    > matchPattern("ATAAT", "CCATAATAATGATAAT")
>      Views on a 16-letter BString subject
>    subject: CCATAATAATGATAAT
>    views:
>        start end width
>    [1]     3   7     5 [ATAAT]
>    [2]     6  10     5 [ATAAT]
>    [3]    12  16     5 [ATAAT]
>
>    > gregexpr("ATAAT", "CCATAATAATGATAAT")[[1]]
>    [1]  3 12
>    attr(,"match.length")
>    [1] 5 5
>    attr(,"useBytes")
>    [1] TRUE
>
> (2) It's inefficient on a long DNA sequence:
>
>    > library(BSgenome.Hsapiens.UCSC.hg19)
>    > chr1 <- BSgenome.Hsapiens.UCSC.hg19$chr1
>    > system.time(m1 <- matchPattern("ATAAT", chr1))
>       user  system elapsed
>      0.946   0.000   0.940
>    > chr1c <- as.character(chr1)
>    > system.time(m2 <- gregexpr("ATAAT", chr1c)[[1]])
>       user  system elapsed
>      4.109   0.000   4.109
>
> This was actually the very first motivating use case for developing
> the Biostrings package. It's important to realize that using the regex
> engine at the C level wouldn't make much difference.
>
> matchPattern() and family don't support regex though. However when
> working with DNA motifs, the motifs can often be described with IUPAC
> ambiguity letters. For example, instead of describing the motifs
> with regular expression AT(A|G|T|)T(A|C)GG.G, you can describe it with
> ATDTMGGNG. Then you can use matchPattern() on this pattern and with
> fixed=FALSE to find all the matches. Additionally you can use the
> 'max.mismatch' and/or 'with.indels' arguments to allow a small number
> of mismatches and/or indels. See ?matchPattern for more information
> and examples.
>
> Of course this has its own limitations: you can only do this for a
> subclass of regular expressions. For example regular expressions that
> use * or + to allow for repetitions cannot be replaced by a sequence
> with just IUPAC codes, so the string matching tools in Biostrings
> cannnot be used in that case.
>
> Cheers,
> H.

Thank you Hervé for your tips. I'm aware of the limited power of regular 
expressions, but using matchPattern doesn't solves my problem. The 
reason for using regexp library at C level is that I plan to call it 
million times (on short DNA parts) and I suppose it would be better to 
avoid the calling and for-loop overhead. Therefore I wanted to get the 
idea about possible regex C APIs I can use or if its usually bundled.

Jirka

>
>>
>>> A secondary question is that if several packages are using this
>>> functionality, then perhaps the library could be bundled separately
>>> and made available just once; zlibbioc does something like this (sort
>>> of; zlib is only needed on Windows). The flowCore and mzR maintainers
>>> (cc'd) might be a valuable resource in this regard.
>>
>> Efficient regexp algorithms seems useful to me for solving many
>> bioinformatic problems. So it would be natural to have package with C
>> API to the most efficient regexp libraries.
>>
>>> Martin
>>>
>>> ________________________________________ From: Bioc-devel
>>> <bioc-devel-bounces at r-project.org> on behalf of Jiří Hon
>>> <xhonji01 at stud.fit.vutbr.cz> Sent: Monday, January 25, 2016 4:33 AM
>>> To: Charles Determan Cc: bioc-devel at r-project.org Subject: Re:
>>> [Bioc-devel] C library or C package API for regular expressions
>>>
>>> Hi Charles,
>>>
>>> thank you a lot for your helpful hint. There is still a thing that
>>> I'm not sure about - Boost manual says that Boost.Regex is not header
>>> only [1]. So as BH package contains only headers, I will have to
>>> bundle the Boost.Regex library into the package code anyway. Am I
>>> right?
>>>
>>> Jiri
>>>
>>> [1]
>>> http://www.boost.org/doc/libs/1_60_0/more/getting_started/unix-variants.html#header-only-libraries
>>>
>>>
>>>
>>>
>>>
>>>
>> Dne 23.1.2016 v 13:35 Charles Determan napsal(a):
>>>> Hi Jiri,
>>>>
>>>> I believe you can use the BH package. It contains most of the
>>>> Boost
>>> headers.
>>>>
>>>> Regards, Charles
>>>>
>>>> On Saturday, January 23, 2016, Jiří Hon
>>>> <xhonji01 at stud.fit.vutbr.cz>
>>> wrote:
>>>>
>>>>> Dear package developers,
>>>>>
>>>>> I would like to ask you for advice. Please, what is the most
>>>>> seamless way to use regular expressions in C/C++ code of
>>>>> R/Bioconductor package? Is it allowed to bundle some C/C++
>>>>> library for that (like PCRE or Boost.Regex)? Or is there existing
>>>>> C API of some package I can depend on and import?
>>>>>
>>>>> Thank you a lot for your attention and please have a nice day :)
>>>>>
>>>>> Jiri Hon
>>>>>
>>>>> _______________________________________________
>>>>> Bioc-devel at r-project.org mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>
>>>>
>>>
>>> _______________________________________________
>>> Bioc-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>
>>> This email message may contain legally privileged and/or confidential
>>> information.  If you are not the intended recipient(s), or the
>>> employee or agent responsible for the delivery of this message to the
>>> intended recipient(s), you are hereby notified that any disclosure,
>>> copying, distribution, or use of this email message is prohibited. If
>>> you have received this message in error, please notify the sender
>>> immediately by e-mail and delete this email message from your
>>> computer. Thank you.
>>>
>>
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>