[Rd] Read a text file into R with .Call()

Simon Urbanek simon.urbanek at r-project.org
Thu Jun 27 22:13:01 CEST 2013


On Jun 27, 2013, at 3:37 PM, Ge Tan wrote:

> Hi Simons,
> 
> Thanks for your reply.
> 10000 is just an example I wrote. In fact, there can be millions of strings (all of them are different and each has thousands of characters) I want to read from the file. So if I use mkChar it will store the same amount of the copies in the global cache.

Those are not "copies" - those are the actual strings, there is no other copy - that's the whole point.


> The problem is when I get the returned qNames in R, and then rm(qNames) and do the gc(). 
> gc() shows a normal amout of memory it uses. But from the top command, this R session can still use several GB. The rm() and gc() does not take effect on the memory release. (I suspect the release of the global cache is not done, even now there is not objects pointing to them.)

Nope - that is normal - see FAQ 7.42 - the cache is released as you see in gc().
So is there any actual problem?


> I am sure there is no other memory leak problem. Once I run the mkChar, the memory issue emerges.
> 
> So I am comfused how to read lines from text files and make it into R character vectors to pass back to R. We cannot store each of them into the global cache nor is not necessary as they are not duplicated.
> Regarding the raw vector method, I am not quite clear how to manipulate it. Could you give some more detailed examples?
> 

Let's say you have a file with millions of lines, then you can simply use something like

x <- readBin(fn, raw(), file.info(fn)$size)

to read it all. When you need a particular string, you create the one you need, .e.g

rawToChar(x[1:16])

As for how to know where it is, you can use grepRaw("\n", x, fixed=TRUE) to get an index of the lines.
As I said, the premise here is that you need only a small fraction of the content. If you really need them all as strings, then you have to load them all anyway.

Cheers,
Simon


> I attached more complete code I wrote.

PS: Just minor details (not relevant to the above): The R_alloc() leads to buffer overflow - you're not allocating enough bytes (you should really not need allocate anything, ideally lineFileOpen() should be non-desctructive and you can pass CHAR()). Also I find it a bit safer to create the result list first and PROTECT() it and add elements as you allocate them [e.g., x = SET_VECTOR_ELT(res, i, allocVector(STRSXP, n))] - then you don't need any other protects at all since all its elements are protected by the list as you add them, so you don't need to worry about getting the protection count right.


> BTW, I am using R version 2.15.2.
> 
> Thanks!
> Ge
> 
>  PROTECT(qNames = NEW_CHARACTER(nrAxts));
>  PROTECT(qStart = NEW_INTEGER(nrAxts));
>  PROTECT(qEnd = NEW_INTEGER(nrAxts));
>  PROTECT(qStrand = NEW_CHARACTER(nrAxts));
>  PROTECT(qSym = NEW_CHARACTER(nrAxts));
>  PROTECT(tNames = NEW_CHARACTER(nrAxts));
>  PROTECT(tStart = NEW_INTEGER(nrAxts));
>  PROTECT(tEnd = NEW_INTEGER(nrAxts));
>  PROTECT(tStrand = NEW_CHARACTER(nrAxts));
>  PROTECT(tSym = NEW_CHARACTER(nrAxts));
>  PROTECT(score = NEW_INTEGER(nrAxts));
>  PROTECT(symCount = NEW_INTEGER(nrAxts));
>  PROTECT(returnList = NEW_LIST(12));
>  int *p_qStart, *p_qEnd, *p_tStart, *p_tEnd, *p_score, *p_symCount;
>  p_qStart = INTEGER_POINTER(qStart);
>  p_qEnd = INTEGER_POINTER(qEnd);
>  p_tStart = INTEGER_POINTER(tStart);
>  p_tEnd = INTEGER_POINTER(tEnd);
>  p_score = INTEGER_POINTER(score);
>  p_symCount = INTEGER_POINTER(symCount);
>  int j = 0;
>  i = 0;
>  for(j = 0; j < nrAxtFiles; j++){
>    char *filepath_elt = (char *) R_alloc(strlen(CHAR(STRING_ELT(filepath, j))), sizeof(char));
>    strcpy(filepath_elt, CHAR(STRING_ELT(filepath, j)));
>    lf = lineFileOpen(filepath_elt, TRUE);
>    while((axt = axtRead(lf)) != NULL){
>      SET_STRING_ELT(qNames, i, mkChar(axt->qName));
>      p_qStart[i] = axt->qStart + 1;
>      p_qEnd[i] = axt->qEnd;
>      if(axt->qStrand == '+')
>        SET_STRING_ELT(qStrand, i, mkChar("+"));
>      else
>        SET_STRING_ELT(qStrand, i, mkChar("-"));
>        SET_STRING_ELT(qSym, i, mkChar(axt->qSym));
>      SET_STRING_ELT(tNames, i, mkChar(axt->tName));
>      p_tStart[i] = axt->tStart + 1;
>      p_tEnd[i] = axt->tEnd;
>      if(axt->tStrand == '+')
>        SET_STRING_ELT(tStrand, i, mkChar("+"));
>      else
>        SET_STRING_ELT(tStrand, i, mkChar("-"));
>        SET_STRING_ELT(tSym, i, mkChar(axt->tSym));
>      p_score[i] = axt->score;
>      p_symCount[i] = axt->symCount;
>      i++;
>      axtFree(&axt);
>    }
>    lineFileClose(&lf);
>  }
>  SET_VECTOR_ELT(returnList, 0, tNames);
>  SET_VECTOR_ELT(returnList, 1, tStart);
>  SET_VECTOR_ELT(returnList, 2, tEnd);
>  SET_VECTOR_ELT(returnList, 3, tStrand);
>  SET_VECTOR_ELT(returnList, 4, tSym);
>  SET_VECTOR_ELT(returnList, 5, qNames);
>  SET_VECTOR_ELT(returnList, 6, qStart);
>  SET_VECTOR_ELT(returnList, 7, qEnd);
>  SET_VECTOR_ELT(returnList, 8, qStrand);
>  SET_VECTOR_ELT(returnList, 9, qSym);
>  SET_VECTOR_ELT(returnList, 10, score);
>  SET_VECTOR_ELT(returnList, 11, symCount);
>  UNPROTECT(13);
>  //axtFree(&curAxt);
>  //return R_NilValue;
>  return returnList;
> 
> 
> 
> 
> 
> ------------------ Original ------------------
> From:  "r-devel"<r-devel at r-project.org>;
> Date:  Fri, Jun 28, 2013 03:08 AM
> To:  "Ge Tan"<184523479 at qq.com>; 
> Cc:  "r-devel"<r-devel at r-project.org>; 
> Subject:  Re: [Rd] Read a text file into R with .Call()
> 
> 
> 
> 
> On Jun 27, 2013, at 9:18 AM, Ge Tan wrote:
> 
>> Hi,
>> 
>> I want to read a text file into R with .Call().
>> So I define some NEW_CHARACTER() to store the chracters read and use SET_STRING_ELT to fill the elements.
>> 
>> e.g.
>> PROTECT(qNames = NEW_CHARACTER(10000));
>> char *foo; // This foo holds the string I want.
>> while(foo = readLine(FN)){
>> SET_STRING_ELT(qNames, i, mkChar(foo)));
>> }
>> 
>> In this way, I can get the desired character from qNames. The only problem is that "mkChar" will make every foo string into a global CHARSXP cache. When I have a huge amount of file to read, the CHARSXP cache use too much memory. I do not know whether there is any other way to SET_STRING_ELT without the mkChar operation.
> 
> No. *all* strings in R are in the cache. The whole point of it is that is uses less memory by not duplicating strings - and the overhead for as little as 10000 strings is minuscule. So I suspect that is not your problem since if that was the case, you would not have enough memory to just load the file. Check you code, chances are the issue is elsewhere.
> 
> That said, you can always load the file into a raw vector and use accessor function to create strings on demand - but this is only meaningful when you plan to use a very small subset.
> 
> Cheers,
> Simon
> 
> 
>> I know I cam refer to the Biostrings pakcage's way of readDNAStringSet, but that is a bit complicated I have not full understood it.
>> 
>> Any help will be appreciated!!
>> 
>> Ge
>> ______________________________________________
>> R-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>> 
>> 
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
> 
> 



More information about the R-devel mailing list