[Bioc-devel] [Rd] Read a text file into R with .Call()

Fri Jun 28 21:38:59 CEST 2013

Hi Ge,

On 06/28/2013 04:45 AM, Ge Tan wrote:
> Hi  Hervé,
>
> Thank you so much!!! I tweaked it a little and it works exactly in the Biostrings way which I want to learn!
> You are my idol!!
>
> One minor question I want to ask is when I change the
> alloc_XRawList("BStringSet", "BString", ans_width));
> to DNAStringSet.

So you have a text file with one DNA sequence per line. I wonder what
kind of file is that. Wouldn't it be easier if you managed to store
your DNA sequences in FASTA format so you could just reuse
readDNAStringSet() to load it? If you leave the record headers
empty, the FASTA file won't be significantly bigger than the text file
with one DNA sequence per line.

> It reprots an error,
>> myAxt
>    A DNAStringSet instance of length 315050
>           width seq
> Error in .Call2("new_CHARACTER_from_XString", x, xs_dec_lkup(x), PACKAGE = "Biostrings") :
>    key 65 (char 'A') not in lookup table

Chars stored in a DNAStringSet need to be encoded.

The encoding scheme is:

   - DNA base letters A, C, G, and T are associated to bits 0, 1, 2,
     and 3, respectively. So A -> 0x1, C -> 0x2, G -> 0x4, T -> 0x8.

   - IUPAC ambiguity codes are represented by a bit pattern that
     combines the bits of the base letters that they are associated
     with. For example, R is associated with A and G so is represented
     by bits 0 and 2 set to 1 and all other bits set to 0. So R -> 0x5.

   - Special letters - and + are represented by 0x10 and 0x20,
     respectively.

This encoding scheme is summarized by the following lookup table:

   > lkup <- get_seqtype_conversion_lookup("B", "DNA")
   > lkup
     [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 
NA NA NA NA
    [26] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 32 NA 16 
NA NA NA NA
    [51] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA  1 14  2 13 NA NA 
  4 11 NA NA
    [76] 12 NA  3 15 NA NA NA  5  6  8 NA  7  9 NA 10 NA NA NA NA NA NA 
NA  1 14  2
   [101] 13 NA NA  4 11 NA NA 12 NA  3 15 NA NA NA  5  6  8 NA  7  9 NA 10

'lkup' is an integer vector that allows fast encoding from ASCII to
DNA (accepts lower-case DNA letters):

   > lkup[as.integer(charToRaw("ACGTRa-+Xz")) + 1L]
   [1]  1  2  4  8  5  1 16 32 NA NA

To use this lookup table at the C level, just pass it as an extra arg
to your .Call entry point. Then, at the C level, encode the incoming
bytes with something like:

     int lkup_len, key, val;
     char encoded_byte;

     ...
     lkup_len = LENGTH(lkup);
     key = (unsigned char) incoming_byte;
     if (key >= lkup_len || (val = INTEGER(lkup)[key]) == NA_INTEGER) {
         /* Make sure to UNPROTECT() whetever is currently PROTECTed,
            and to free() whatever user-controlled memory is currently
            allocated before you call error(). */
         error("invalid char: '%c'", incoming_byte);
     }
     encoded_byte = (char) val;
     ...

HTH,
H.

>
> Do you have any idea?
>
> Thanks again!
> Ge
>
> ------------------ Original ------------------
> From:  "Herv  Pag s"<hpages at fhcrc.org>;
> Date:  Fri, Jun 28, 2013 05:44 AM
> To:  "Ge Tan"<184523479 at qq.com>;
> Cc:  "r-devel"<r-devel at r-project.org>;
> Subject:  Re: [Rd] Read a text file into R with .Call()
>
>
>
> Hi Ge,
>
> Here is one way to do this with the Biostrings C API. It does
> 2 passes on the file. There is also a 1-pass way but not necessarily
> worth it because not as memory efficient.
>
> The code below is untested (not even guaranteed to compile)!
>
> SEXP read_text_file_in_BStringSet(FILE *FN)
> {
>       SEXP ans, width;
>       IntAE width_buf;
>       char *foo;
>       cachedXVectorList cached_ans;
>       cachedCharSeq cached_ans_elt;
>       int i;
>
>       /* 1st pass: compute the 'width' vector. */
>       width_buf = new_IntAE(0, 0, 0);
>       while (foo = readLine(FN)) {
>           IntAE_insert_at(&width_buf, IntAE_get_nelt(width_buf),
> strlen(foo));
>       }
>       PROTECT(width = new_INTEGER_from_IntAE(&width_buf));
>
>       /* Allocate 'ans'. */
>       PROTECT(ans = alloc_XRawList("BStringSet", "BString", ans_width));
>
>       /* 2nd pass: Fill 'ans' with data. */
>       cached_ans = cache_XVectorList(ans);
>       rewind(FN);
>       i = 0;
>       while (foo = readLine(FN)) {
>           cached_ans_elt = get_cachedXRawList_elt(&cached_ans, i);
>           memcpy((char *) cached_ans_elt->seq, foo, INTEGER(width)[i] *
> sizeof(char));
>           i++;
>       }
>
>       UNPROTECT(2);
>       return ans;
> }
>
> The returned object is a BStringSet object.
>
> Note that I kept your
>
>       foo = readLine(FN)
>
> approach for reading the file line by line. A more efficient way would
> be to use something like
>
>       n = getLineLength(FN)
>
> for the 1st pass (no need to load the line of text into the foo
> buffer), and something like
>
>       readLine(FN, (char *) cached_ans_elt->seq)
>
> for the 2nd pass so the data is loaded directly to where it needs to
> go (i.e. without going first thru the foo buffer, hence avoiding the
> memcpy()).
>
> Cheers,
> H.
>
> On 06/27/2013 12:37 PM, Ge Tan wrote:
>> Hi Simons,
>>
>> Thanks for your reply.
>> 10000 is just an example I wrote. In fact, there can be millions of strings (all of them are different and each has thousands of characters) I want to read from the file. So if I use mkChar it will store the same amount of the copies in the global cache.
>> The problem is when I get the returned qNames in R, and then rm(qNames) and do the gc().
>> gc() shows a normal amout of memory it uses. But from the top command, this R session can still use several GB. The rm() and gc() does not take effect on the memory release. (I suspect the release of the global cache is not done, even now there is not objects pointing to them.)
>> I am sure there is no other memory leak problem. Once I run the mkChar, the memory issue emerges.
>>
>> So I am comfused how to read lines from text files and make it into R character vectors to pass back to R. We cannot store each of them into the global cache nor is not necessary as they are not duplicated.
>> Regarding the raw vector method, I am not quite clear how to manipulate it. Could you give some more detailed examples?
>>
>> I attached more complete code I wrote. BTW, I am using R version 2.15.2.
>>
>> Thanks!
>> Ge
>>
>>     PROTECT(qNames = NEW_CHARACTER(nrAxts));
>>     PROTECT(qStart = NEW_INTEGER(nrAxts));
>>     PROTECT(qEnd = NEW_INTEGER(nrAxts));
>>     PROTECT(qStrand = NEW_CHARACTER(nrAxts));
>>     PROTECT(qSym = NEW_CHARACTER(nrAxts));
>>     PROTECT(tNames = NEW_CHARACTER(nrAxts));
>>     PROTECT(tStart = NEW_INTEGER(nrAxts));
>>     PROTECT(tEnd = NEW_INTEGER(nrAxts));
>>     PROTECT(tStrand = NEW_CHARACTER(nrAxts));
>>     PROTECT(tSym = NEW_CHARACTER(nrAxts));
>>     PROTECT(score = NEW_INTEGER(nrAxts));
>>     PROTECT(symCount = NEW_INTEGER(nrAxts));
>>     PROTECT(returnList = NEW_LIST(12));
>>     int *p_qStart, *p_qEnd, *p_tStart, *p_tEnd, *p_score, *p_symCount;
>>     p_qStart = INTEGER_POINTER(qStart);
>>     p_qEnd = INTEGER_POINTER(qEnd);
>>     p_tStart = INTEGER_POINTER(tStart);
>>     p_tEnd = INTEGER_POINTER(tEnd);
>>     p_score = INTEGER_POINTER(score);
>>     p_symCount = INTEGER_POINTER(symCount);
>>     int j = 0;
>>     i = 0;
>>     for(j = 0; j < nrAxtFiles; j++){
>>       char *filepath_elt = (char *) R_alloc(strlen(CHAR(STRING_ELT(filepath, j))), sizeof(char));
>>       strcpy(filepath_elt, CHAR(STRING_ELT(filepath, j)));
>>       lf = lineFileOpen(filepath_elt, TRUE);
>>       while((axt = axtRead(lf)) != NULL){
>>         SET_STRING_ELT(qNames, i, mkChar(axt->qName));
>>         p_qStart[i] = axt->qStart + 1;
>>         p_qEnd[i] = axt->qEnd;
>>         if(axt->qStrand == '+')
>>           SET_STRING_ELT(qStrand, i, mkChar("+"));
>>         else
>>           SET_STRING_ELT(qStrand, i, mkChar("-"));
>>           SET_STRING_ELT(qSym, i, mkChar(axt->qSym));
>>         SET_STRING_ELT(tNames, i, mkChar(axt->tName));
>>         p_tStart[i] = axt->tStart + 1;
>>         p_tEnd[i] = axt->tEnd;
>>         if(axt->tStrand == '+')
>>           SET_STRING_ELT(tStrand, i, mkChar("+"));
>>         else
>>           SET_STRING_ELT(tStrand, i, mkChar("-"));
>>           SET_STRING_ELT(tSym, i, mkChar(axt->tSym));
>>         p_score[i] = axt->score;
>>         p_symCount[i] = axt->symCount;
>>         i++;
>>         axtFree(&axt);
>>       }
>>       lineFileClose(&lf);
>>     }
>>     SET_VECTOR_ELT(returnList, 0, tNames);
>>     SET_VECTOR_ELT(returnList, 1, tStart);
>>     SET_VECTOR_ELT(returnList, 2, tEnd);
>>     SET_VECTOR_ELT(returnList, 3, tStrand);
>>     SET_VECTOR_ELT(returnList, 4, tSym);
>>     SET_VECTOR_ELT(returnList, 5, qNames);
>>     SET_VECTOR_ELT(returnList, 6, qStart);
>>     SET_VECTOR_ELT(returnList, 7, qEnd);
>>     SET_VECTOR_ELT(returnList, 8, qStrand);
>>     SET_VECTOR_ELT(returnList, 9, qSym);
>>     SET_VECTOR_ELT(returnList, 10, score);
>>     SET_VECTOR_ELT(returnList, 11, symCount);
>>     UNPROTECT(13);
>>     //axtFree(&curAxt);
>>     //return R_NilValue;
>>     return returnList;
>>
>>
>>
>>
>>
>> ------------------ Original ------------------
>> From:  "r-devel"<r-devel at r-project.org>;
>> Date:  Fri, Jun 28, 2013 03:08 AM
>> To:  "Ge Tan"<184523479 at qq.com>;
>> Cc:  "r-devel"<r-devel at r-project.org>;
>> Subject:  Re: [Rd] Read a text file into R with .Call()
>>
>>
>>
>>
>> On Jun 27, 2013, at 9:18 AM, Ge Tan wrote:
>>
>>> Hi,
>>>
>>> I want to read a text file into R with .Call().
>>> So I define some NEW_CHARACTER() to store the chracters read and use SET_STRING_ELT to fill the elements.
>>>
>>> e.g.
>>> PROTECT(qNames = NEW_CHARACTER(10000));
>>> char *foo; // This foo holds the string I want.
>>> while(foo = readLine(FN)){
>>>    SET_STRING_ELT(qNames, i, mkChar(foo)));
>>> }
>>>
>>> In this way, I can get the desired character from qNames. The only problem is that "mkChar" will make every foo string into a global CHARSXP cache. When I have a huge amount of file to read, the CHARSXP cache use too much memory. I do not know whether there is any other way to SET_STRING_ELT without the mkChar operation.
>>
>> No. *all* strings in R are in the cache. The whole point of it is that is uses less memory by not duplicating strings - and the overhead for as little as 10000 strings is minuscule. So I suspect that is not your problem since if that was the case, you would not have enough memory to just load the file. Check you code, chances are the issue is elsewhere.
>>
>> That said, you can always load the file into a raw vector and use accessor function to create strings on demand - but this is only meaningful when you plan to use a very small subset.
>>
>> Cheers,
>> Simon
>>
>>
>>> I know I cam refer to the Biostrings pakcage's way of readDNAStringSet, but that is a bit complicated I have not full understood it.
>>>
>>> Any help will be appreciated!!
>>>
>>> Ge
>>> ______________________________________________
>>> R-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>
>>>
>> ______________________________________________
>> R-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319