[Rd] Read a text file into R with .Call()

Hervé Pagès hpages at fhcrc.org
Fri Jun 28 20:26:22 CEST 2013


Hi Ge,

On 06/28/2013 04:45 AM, Ge Tan wrote:
> Hi  Hervé,
>
> Thank you so much!!! I tweaked it a little and it works exactly in the Biostrings way which I want to learn!
> You are my idol!!
>
> One minor question I want to ask is when I change the
> alloc_XRawList("BStringSet", "BString", ans_width));
> to DNAStringSet.
> It reprots an error,
>> myAxt
>    A DNAStringSet instance of length 315050
>           width seq
> Error in .Call2("new_CHARACTER_from_XString", x, xs_dec_lkup(x), PACKAGE = "Biostrings") :
>    key 65 (char 'A') not in lookup table

Glad it works for you. I'll answer your new question on the Bioc-devel
mailing list. See you there.

H.

>
> Do you have any idea?
>
> Thanks again!
> Ge
>
> ------------------ Original ------------------
> From:  "Herv  Pag s"<hpages at fhcrc.org>;
> Date:  Fri, Jun 28, 2013 05:44 AM
> To:  "Ge Tan"<184523479 at qq.com>;
> Cc:  "r-devel"<r-devel at r-project.org>;
> Subject:  Re: [Rd] Read a text file into R with .Call()
>
>
>
> Hi Ge,
>
> Here is one way to do this with the Biostrings C API. It does
> 2 passes on the file. There is also a 1-pass way but not necessarily
> worth it because not as memory efficient.
>
> The code below is untested (not even guaranteed to compile)!
>
> SEXP read_text_file_in_BStringSet(FILE *FN)
> {
>       SEXP ans, width;
>       IntAE width_buf;
>       char *foo;
>       cachedXVectorList cached_ans;
>       cachedCharSeq cached_ans_elt;
>       int i;
>
>       /* 1st pass: compute the 'width' vector. */
>       width_buf = new_IntAE(0, 0, 0);
>       while (foo = readLine(FN)) {
>           IntAE_insert_at(&width_buf, IntAE_get_nelt(width_buf),
> strlen(foo));
>       }
>       PROTECT(width = new_INTEGER_from_IntAE(&width_buf));
>
>       /* Allocate 'ans'. */
>       PROTECT(ans = alloc_XRawList("BStringSet", "BString", ans_width));
>
>       /* 2nd pass: Fill 'ans' with data. */
>       cached_ans = cache_XVectorList(ans);
>       rewind(FN);
>       i = 0;
>       while (foo = readLine(FN)) {
>           cached_ans_elt = get_cachedXRawList_elt(&cached_ans, i);
>           memcpy((char *) cached_ans_elt->seq, foo, INTEGER(width)[i] *
> sizeof(char));
>           i++;
>       }
>
>       UNPROTECT(2);
>       return ans;
> }
>
> The returned object is a BStringSet object.
>
> Note that I kept your
>
>       foo = readLine(FN)
>
> approach for reading the file line by line. A more efficient way would
> be to use something like
>
>       n = getLineLength(FN)
>
> for the 1st pass (no need to load the line of text into the foo
> buffer), and something like
>
>       readLine(FN, (char *) cached_ans_elt->seq)
>
> for the 2nd pass so the data is loaded directly to where it needs to
> go (i.e. without going first thru the foo buffer, hence avoiding the
> memcpy()).
>
> Cheers,
> H.
>
> On 06/27/2013 12:37 PM, Ge Tan wrote:
>> Hi Simons,
>>
>> Thanks for your reply.
>> 10000 is just an example I wrote. In fact, there can be millions of strings (all of them are different and each has thousands of characters) I want to read from the file. So if I use mkChar it will store the same amount of the copies in the global cache.
>> The problem is when I get the returned qNames in R, and then rm(qNames) and do the gc().
>> gc() shows a normal amout of memory it uses. But from the top command, this R session can still use several GB. The rm() and gc() does not take effect on the memory release. (I suspect the release of the global cache is not done, even now there is not objects pointing to them.)
>> I am sure there is no other memory leak problem. Once I run the mkChar, the memory issue emerges.
>>
>> So I am comfused how to read lines from text files and make it into R character vectors to pass back to R. We cannot store each of them into the global cache nor is not necessary as they are not duplicated.
>> Regarding the raw vector method, I am not quite clear how to manipulate it. Could you give some more detailed examples?
>>
>> I attached more complete code I wrote. BTW, I am using R version 2.15.2.
>>
>> Thanks!
>> Ge
>>
>>     PROTECT(qNames = NEW_CHARACTER(nrAxts));
>>     PROTECT(qStart = NEW_INTEGER(nrAxts));
>>     PROTECT(qEnd = NEW_INTEGER(nrAxts));
>>     PROTECT(qStrand = NEW_CHARACTER(nrAxts));
>>     PROTECT(qSym = NEW_CHARACTER(nrAxts));
>>     PROTECT(tNames = NEW_CHARACTER(nrAxts));
>>     PROTECT(tStart = NEW_INTEGER(nrAxts));
>>     PROTECT(tEnd = NEW_INTEGER(nrAxts));
>>     PROTECT(tStrand = NEW_CHARACTER(nrAxts));
>>     PROTECT(tSym = NEW_CHARACTER(nrAxts));
>>     PROTECT(score = NEW_INTEGER(nrAxts));
>>     PROTECT(symCount = NEW_INTEGER(nrAxts));
>>     PROTECT(returnList = NEW_LIST(12));
>>     int *p_qStart, *p_qEnd, *p_tStart, *p_tEnd, *p_score, *p_symCount;
>>     p_qStart = INTEGER_POINTER(qStart);
>>     p_qEnd = INTEGER_POINTER(qEnd);
>>     p_tStart = INTEGER_POINTER(tStart);
>>     p_tEnd = INTEGER_POINTER(tEnd);
>>     p_score = INTEGER_POINTER(score);
>>     p_symCount = INTEGER_POINTER(symCount);
>>     int j = 0;
>>     i = 0;
>>     for(j = 0; j < nrAxtFiles; j++){
>>       char *filepath_elt = (char *) R_alloc(strlen(CHAR(STRING_ELT(filepath, j))), sizeof(char));
>>       strcpy(filepath_elt, CHAR(STRING_ELT(filepath, j)));
>>       lf = lineFileOpen(filepath_elt, TRUE);
>>       while((axt = axtRead(lf)) != NULL){
>>         SET_STRING_ELT(qNames, i, mkChar(axt->qName));
>>         p_qStart[i] = axt->qStart + 1;
>>         p_qEnd[i] = axt->qEnd;
>>         if(axt->qStrand == '+')
>>           SET_STRING_ELT(qStrand, i, mkChar("+"));
>>         else
>>           SET_STRING_ELT(qStrand, i, mkChar("-"));
>>           SET_STRING_ELT(qSym, i, mkChar(axt->qSym));
>>         SET_STRING_ELT(tNames, i, mkChar(axt->tName));
>>         p_tStart[i] = axt->tStart + 1;
>>         p_tEnd[i] = axt->tEnd;
>>         if(axt->tStrand == '+')
>>           SET_STRING_ELT(tStrand, i, mkChar("+"));
>>         else
>>           SET_STRING_ELT(tStrand, i, mkChar("-"));
>>           SET_STRING_ELT(tSym, i, mkChar(axt->tSym));
>>         p_score[i] = axt->score;
>>         p_symCount[i] = axt->symCount;
>>         i++;
>>         axtFree(&axt);
>>       }
>>       lineFileClose(&lf);
>>     }
>>     SET_VECTOR_ELT(returnList, 0, tNames);
>>     SET_VECTOR_ELT(returnList, 1, tStart);
>>     SET_VECTOR_ELT(returnList, 2, tEnd);
>>     SET_VECTOR_ELT(returnList, 3, tStrand);
>>     SET_VECTOR_ELT(returnList, 4, tSym);
>>     SET_VECTOR_ELT(returnList, 5, qNames);
>>     SET_VECTOR_ELT(returnList, 6, qStart);
>>     SET_VECTOR_ELT(returnList, 7, qEnd);
>>     SET_VECTOR_ELT(returnList, 8, qStrand);
>>     SET_VECTOR_ELT(returnList, 9, qSym);
>>     SET_VECTOR_ELT(returnList, 10, score);
>>     SET_VECTOR_ELT(returnList, 11, symCount);
>>     UNPROTECT(13);
>>     //axtFree(&curAxt);
>>     //return R_NilValue;
>>     return returnList;
>>
>>
>>
>>
>>
>> ------------------ Original ------------------
>> From:  "r-devel"<r-devel at r-project.org>;
>> Date:  Fri, Jun 28, 2013 03:08 AM
>> To:  "Ge Tan"<184523479 at qq.com>;
>> Cc:  "r-devel"<r-devel at r-project.org>;
>> Subject:  Re: [Rd] Read a text file into R with .Call()
>>
>>
>>
>>
>> On Jun 27, 2013, at 9:18 AM, Ge Tan wrote:
>>
>>> Hi,
>>>
>>> I want to read a text file into R with .Call().
>>> So I define some NEW_CHARACTER() to store the chracters read and use SET_STRING_ELT to fill the elements.
>>>
>>> e.g.
>>> PROTECT(qNames = NEW_CHARACTER(10000));
>>> char *foo; // This foo holds the string I want.
>>> while(foo = readLine(FN)){
>>>    SET_STRING_ELT(qNames, i, mkChar(foo)));
>>> }
>>>
>>> In this way, I can get the desired character from qNames. The only problem is that "mkChar" will make every foo string into a global CHARSXP cache. When I have a huge amount of file to read, the CHARSXP cache use too much memory. I do not know whether there is any other way to SET_STRING_ELT without the mkChar operation.
>>
>> No. *all* strings in R are in the cache. The whole point of it is that is uses less memory by not duplicating strings - and the overhead for as little as 10000 strings is minuscule. So I suspect that is not your problem since if that was the case, you would not have enough memory to just load the file. Check you code, chances are the issue is elsewhere.
>>
>> That said, you can always load the file into a raw vector and use accessor function to create strings on demand - but this is only meaningful when you plan to use a very small subset.
>>
>> Cheers,
>> Simon
>>
>>
>>> I know I cam refer to the Biostrings pakcage's way of readDNAStringSet, but that is a bit complicated I have not full understood it.
>>>
>>> Any help will be appreciated!!
>>>
>>> Ge
>>> ______________________________________________
>>> R-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>
>>>
>> ______________________________________________
>> R-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the R-devel mailing list