[Bioc-devel] Creating Biostrings DNAStringSet from Rcpp/C++
hpages at fredhutch.org
Tue Jan 5 20:12:07 CET 2016
The advantage of using choice #2 is better performance (memory
In addition to Michael's pointer, there is this post
on R-devel that provides some details and pseudo-code. Unfortunately
a few things have changed since 2013:
1) You need to replace:
cachedXVectorList with XVectorList_holder
cachedCharSeq with Chars_holder
cache_XVectorList with hold_XVectorList
get_cachedXRawList_elt with get_elt_from_XRawList_holder
2) For readability you probably want to rename the variables
accordingly. So replace:
cached_ans with ans_holder
cached_ans_elt with ans_elt_older
3) The 'seq' member of a Chars_holder struct was renamed 'ptr'.
cached_ans_elt->seq with ans_elt_older->ptr
Also since you want the result in a DNAStringSet object, you'll need
to replace any occurrence of BStringSet with DNAStringSet. One extra
difficulty with DNAStringSet objects is that the incoming string data
needs to be encoded before it's stored in the object. This encoding
can be performed by replacing
memcpy((char *) cached_ans_elt->seq, foo,
INTEGER(width)[i] * sizeof(char));
Ocopy_bytes_to_i1i2_with_lkup(0, ans_elt_holder.length - 1,
(char *) ans_elt_holder.ptr, ans_elt_holder.length,
foo, INTEGER(width)[i] * sizeof(char),
where 'lkup' is an integer vector that you need to prepare at the
R level with:
lkup <- get_seqtype_conversion_lookup("B", "DNA")
and that you then need to pass all the way down to the
read_text_file_in_DNAStringSet() function as an extra argument:
SEXP read_text_file_in_DNAStringSet(FILE *FN, SEXP lkup)
Let me know if you need further help with this.
On 01/05/2016 05:31 AM, Michael Lawrence wrote:
> You're going to want to use the Biostrings C API. You can see
> TwoBitFile_read() in src/twoBit.c in rtracklayer for an example. It's
> not trivial.
> On Tue, Jan 5, 2016 at 5:21 AM, Sean Davis <seandavi at gmail.com> wrote:
>> Forgive the really naive questions, but my C/C++ and Biostrings skills are
>> pretty rough. I have been playing with Rcpp to interface to an external
>> library for reading sequence formats. I am reading DNA sequences
>> one-at-a-time and would like to package a set of those up as a DNAStringSet.
>> I have at least two options:
>> 1) Return a character vector back to R and then create a DNAStringSet in R.
>> This works now and works fine.
>> 2) Create the DNAStringSet in the cpp code.
>> Are there advantages to using choice #2? If so, where would I start to get
>> the necessary Rcpp/Biostrings glue in place? I know how to make new S4
>> objects in Rcpp, so it is really a question of knowing what class hierarchy
>> I need to instantiate and, roughly, how. Any pointers or code examples?
>> Bioc-devel at r-project.org mailing list
> Bioc-devel at r-project.org mailing list
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fredhutch.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioc-devel