[Bioc-devel] Creating Biostrings DNAStringSet from Rcpp/C++

Hervé Pagès hpages at fredhutch.org
Tue Jan 5 20:12:07 CET 2016


Hi Sean,

The advantage of using choice #2 is better performance (memory
and speed).

In addition to Michael's pointer, there is this post

   https://stat.ethz.ch/pipermail/r-devel/2013-June/066890.html

on R-devel that provides some details and pseudo-code. Unfortunately
a few things have changed since 2013:

   1) You need to replace:

        cachedXVectorList        with   XVectorList_holder
        cachedCharSeq            with   Chars_holder
        cache_XVectorList        with   hold_XVectorList
        get_cachedXRawList_elt   with   get_elt_from_XRawList_holder

   2) For readability you probably want to rename the variables
      accordingly. So replace:

        cached_ans               with   ans_holder
        cached_ans_elt           with   ans_elt_older

   3) The 'seq' member of a Chars_holder struct was renamed 'ptr'.
      So replace:

        cached_ans_elt->seq      with   ans_elt_older->ptr

Also since you want the result in a DNAStringSet object, you'll need
to replace any occurrence of BStringSet with DNAStringSet. One extra
difficulty with DNAStringSet objects is that the incoming string data
needs to be encoded before it's stored in the object. This encoding
can be performed by replacing

   memcpy((char *) cached_ans_elt->seq, foo,
          INTEGER(width)[i] * sizeof(char));

with

   Ocopy_bytes_to_i1i2_with_lkup(0, ans_elt_holder.length - 1,
         (char *) ans_elt_holder.ptr, ans_elt_holder.length,
         foo, INTEGER(width)[i] * sizeof(char),
         INTEGER(lkup), LENGTH(lkup));

where 'lkup' is an integer vector that you need to prepare at the
R level with:

   lkup <- get_seqtype_conversion_lookup("B", "DNA")

and that you then need to pass all the way down to the
read_text_file_in_DNAStringSet() function as an extra argument:

SEXP read_text_file_in_DNAStringSet(FILE *FN, SEXP lkup)
{
     ...
}

Let me know if you need further help with this.

H.

On 01/05/2016 05:31 AM, Michael Lawrence wrote:
> You're going to want to use the Biostrings C API. You can see
> TwoBitFile_read() in src/twoBit.c in rtracklayer for an example. It's
> not trivial.
>
> Michael
>
>
> On Tue, Jan 5, 2016 at 5:21 AM, Sean Davis <seandavi at gmail.com> wrote:
>> Forgive the really naive questions, but my C/C++ and Biostrings skills are
>> pretty rough.  I have been playing with Rcpp to interface to an external
>> library for reading sequence formats.  I am reading DNA sequences
>> one-at-a-time and would like to package a set of those up as a DNAStringSet.
>> I have at least two options:
>>
>> 1) Return a character vector back to R and then create a DNAStringSet in R.
>> This works now and works fine.
>> 2) Create the DNAStringSet in the cpp code.
>>
>> Are there advantages to using choice #2?  If so, where would I start to get
>> the necessary Rcpp/Biostrings glue in place? I know how to make new S4
>> objects in Rcpp, so it is really a question of knowing what class hierarchy
>> I need to instantiate and, roughly, how.  Any pointers or code examples?
>>
>> Thanks,
>> Sean
>>
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fredhutch.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioc-devel mailing list