[Bioc-devel] Getting pointers to data inside XStringSet object

Sat Dec 15 03:07:05 CET 2012

Hi Ulrich,

A C interface is provided in the Biostrings package for accessing the
string data stored in an XString or XStringSet object. This interface
serves 2 purposes:

   (1) Make it easy to access the string data from C code without
       having any knowledge of the internal representation of
       XString or XStringSet objects.
       Note that having this separation also allows changes in the
       XString or XStringSet class definitions without breaking existing
       C code, even though, recompilation will sometimes be needed.

   (2) Make it efficient to repeatedly access different parts of the
       same XString/XStringSet (a very common use case) by using
       an intermediate C structure for storing some book-keeping
       information that only needs to be computed once.

This interface is exposed to 3rd-party packages thru C header file
Biostrings/inst/include/Biostrings_interface.h, but, unfortunately,
is not documented yet. Packages that use it can be recognized by the
presence of Biostrings in the LinkingTo field of their DESCRIPTION
file. There are currently 5 BioC packages that use it: Rsamtools,
ShortRead, VariantAnnotation, DECIPHER, and rSFFreader.

Here is a brief tutorial.

The central structures/functions of this interface are:

   ** For accessing an XString:

      - The cachedCharSeq typedef: a simple C struct defined as
        follow:

          typedef struct cached_charseq {
            const char *seq;
            int length;
          } cachedCharSeq;

      - The cache_XRaw(SEXP x) function that takes an XString object
        (which is a particular type of XRaw object) and returns
        a cachedCharSeq struct.

      A typical use of this structure & function is:

        #include "Biostrings_interface.h"

        cachedCharSeq X = cache_XRaw(x);
        int x_len = X.length;  // get the length of the sequence in 'x'
        const char *x_seq = X.seq;  // get a pointer to the sequence in 'x'

      2 notes:

      The cachedCharSeq typedef and the cache_XRaw() functions
      are actually part of the IRanges C interface so your package will
      also need to be LinkingTo IRanges (like the 5 BioC packages I
      listed above).

      The sequence data in a given XString, or XStringSet object
      might be shared with other XString or XStringSet objects.
      More generally, the sequence data in a given XVector
      (which XString derives from, via XRaw) or XVectorList
      (which XStringSet derives from, via XRawList) object
      might be shared with other XString or XStringSet objects.
      This why the seq field in the cachedCharSeq struct is
      a const char * instead of a char *, so it is very important
      to not try to write there (via a cast to char *).

   ** For accessing an XStringSet:

      - The cachedXStringSet typedef: this is an opaque C struct i.e.
        you should not try to access its fields directly.

      - The cache_XStringSet(SEXP x) function that takes an XStringSet
        object and returns a cachedXStringSet struct.

      - The get_cachedXStringSet_length(const cachedXStringSet *cached_x)
        function that takes a pointer to a cachedXStringSet struct and
        returns the length of the XStringSet object as an int.

      - The get_cachedXStringSet_elt(const cachedXStringSet *cached_x, 
int i)
        function that takes a pointer to a cachedXStringSet struct and
        an int i and returns the i-th sequence of the XStringSet object
        as a cachedCharSeq struct.

      A typical use of these structures & functions is:

        // Code for walking along each sequence of an XStringSet object
        // and read 1 character at a time.

        #include "Biostrings_interface.h"

        cachedXStringSet X = cache_XStringSet(x);
        int x_len = get_cachedXStringSet_length(&X);
        for (int i = 0; i < x_len; i++) {
            cachedCharSeq X_elt = get_cachedXStringSet_elt(&X, i);
            int x_elt_len = X_elt.length;
            const char *x_elt_seq = X_elt.seq;
            for (int j = 0; j < x_elt_len; j++) {
                char c = x_elt_seq[j];
            }
        }

This was for the basics.

One complication though is that DNA and RNA sequences are encoded (and 
that should answer your 2nd question) but not as four letters per byte,
only 1 letter per byte but the letters in DNA_ALPHABET are represented
internally by a char value between 1 and 32. This translation is
summarized in the following table:

   > Biostrings:::.DNA_CODES
    A  C  G  T  M  R  W  S  Y  K  V  H  D  B  N  -  +
    1  2  4  8  3  5  9  6 10 12  7 11 13 14 15 16 32

   > Biostrings:::.RNA_CODES
    A  C  G  U  M  R  W  S  Y  K  V  H  D  B  N  -  +
    1  2  4  8  3  5  9  6 10 12  7 11 13 14 15 16 32

Note that the codes for the IUPAC ambiguity letters are based on
the codes of the multiple bases that they represent (e.g. D is
A + G + T). The reasons for this encoding scheme are mostly historical
e.g. the shift-or algo is taking advantage of it and was implemented
in the early version of the package. Additionally, it allows switching
an object back and forth between DNAString/DNAStringSet and
RNAString/RNAStringSet without copying the sequence data (only the
table used to decode on the fly needs to be changed).

4 functions are provided in the C interface for dealing with
encoding/decoding the DNA/RNA sequences:

   char DNAencode(char c);
   char DNAdecode(char code);
   char RNAencode(char c);
   char RNAdecode(char code);

BString/BStringSet and AAString/AAStringSet objects don't encode the
sequence data. The BStringSet container is the most general (no
alphabet attached to it i.e. anything can go there) and is the
closest to an ordinary character vector there is in Biostrings
from a user point of view.

Hope this gets you started. Don't hesitate to have a look at the
5 packages listed above, and, in particular, note that you will also
need to put those 2 stub files (Biostrings_stubs.c and IRanges_stubs.c)
under your src/ folder, for reasons too long to explain here.

Let me know if you need further help.

Cheers,
H.

On 12/14/2012 01:04 AM, Ulrich Bodenhofer wrote:
> Hi,
>
> Some colleagues and I are currently developing some R packages with the
> aim to integrate some existing C/C++ libraries into R. The algorithms we
> are trying to integrate, not too surprisingly, expect sequences as "char
> *" objects. Of course, we would like our packages to be nicely
> interoperable with Biostrings. In particular, since the algorithms do
> not change their input data, it would be nice to avoid copying. That is
> why I tried to find out how to get pointers to the data hidden inside
> XStringSet objects. As far as I understood it, an XStringSet object
> consists of one large data container of class "SharedRaw_Pool" and a
> "GroupedIRanges" object that defines the views on this container. I had
> no problem disecting the GroupedIRanges object in my C++ code, but I
> could not yet find a way to get the pointer to the container right. I
> searched the web and could not find any information. Biostrings and
> IRanges are indeed very well documented, but only on a user level, not
> the internals. I also looked at the C code included in these two
> packages, but, to be frank, I got lost. So, let me ask you the following
> questions:
>
> - Is there a way to get a plain "char *" pointer that points to the
> first element of the data container?
> - Are sequences actually encoded as plain text or not (it would make
> sense to me to encode DNA/RNA sequences as four letters per byte)? If
> not, my approach is not reasonable anyway and I will have to resort to a
> conversion to character vectors anyway.
>
> Thanks a lot in advance for your inputs!
>
> Best regards,
> Ulrich
>
>
> ------------------------------------------------------------------------
> *Dr. Ulrich Bodenhofer*
> Associate Professor
> Institute of Bioinformatics
>
> *Johannes Kepler University*
> Altenberger Str. 69
> 4040 Linz, Austria
>
> Tel. +43 732 2468 4526
> Fax +43 732 2468 4539
> bodenhofer at bioinf.jku.at <mailto:bodenhofer at bioinf.jku.at>
> http://www.bioinf.jku.at/ <http://www.bioinf.jku.at>
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319