[Rd] Read a text file into R with .Call()
Hervé Pagès
hpages at fhcrc.org
Thu Jun 27 23:44:01 CEST 2013
Hi Ge,
Here is one way to do this with the Biostrings C API. It does
2 passes on the file. There is also a 1-pass way but not necessarily
worth it because not as memory efficient.
The code below is untested (not even guaranteed to compile)!
SEXP read_text_file_in_BStringSet(FILE *FN)
{
SEXP ans, width;
IntAE width_buf;
char *foo;
cachedXVectorList cached_ans;
cachedCharSeq cached_ans_elt;
int i;
/* 1st pass: compute the 'width' vector. */
width_buf = new_IntAE(0, 0, 0);
while (foo = readLine(FN)) {
IntAE_insert_at(&width_buf, IntAE_get_nelt(width_buf),
strlen(foo));
}
PROTECT(width = new_INTEGER_from_IntAE(&width_buf));
/* Allocate 'ans'. */
PROTECT(ans = alloc_XRawList("BStringSet", "BString", ans_width));
/* 2nd pass: Fill 'ans' with data. */
cached_ans = cache_XVectorList(ans);
rewind(FN);
i = 0;
while (foo = readLine(FN)) {
cached_ans_elt = get_cachedXRawList_elt(&cached_ans, i);
memcpy((char *) cached_ans_elt->seq, foo, INTEGER(width)[i] *
sizeof(char));
i++;
}
UNPROTECT(2);
return ans;
}
The returned object is a BStringSet object.
Note that I kept your
foo = readLine(FN)
approach for reading the file line by line. A more efficient way would
be to use something like
n = getLineLength(FN)
for the 1st pass (no need to load the line of text into the foo
buffer), and something like
readLine(FN, (char *) cached_ans_elt->seq)
for the 2nd pass so the data is loaded directly to where it needs to
go (i.e. without going first thru the foo buffer, hence avoiding the
memcpy()).
Cheers,
H.
On 06/27/2013 12:37 PM, Ge Tan wrote:
> Hi Simons,
>
> Thanks for your reply.
> 10000 is just an example I wrote. In fact, there can be millions of strings (all of them are different and each has thousands of characters) I want to read from the file. So if I use mkChar it will store the same amount of the copies in the global cache.
> The problem is when I get the returned qNames in R, and then rm(qNames) and do the gc().
> gc() shows a normal amout of memory it uses. But from the top command, this R session can still use several GB. The rm() and gc() does not take effect on the memory release. (I suspect the release of the global cache is not done, even now there is not objects pointing to them.)
> I am sure there is no other memory leak problem. Once I run the mkChar, the memory issue emerges.
>
> So I am comfused how to read lines from text files and make it into R character vectors to pass back to R. We cannot store each of them into the global cache nor is not necessary as they are not duplicated.
> Regarding the raw vector method, I am not quite clear how to manipulate it. Could you give some more detailed examples?
>
> I attached more complete code I wrote. BTW, I am using R version 2.15.2.
>
> Thanks!
> Ge
>
> PROTECT(qNames = NEW_CHARACTER(nrAxts));
> PROTECT(qStart = NEW_INTEGER(nrAxts));
> PROTECT(qEnd = NEW_INTEGER(nrAxts));
> PROTECT(qStrand = NEW_CHARACTER(nrAxts));
> PROTECT(qSym = NEW_CHARACTER(nrAxts));
> PROTECT(tNames = NEW_CHARACTER(nrAxts));
> PROTECT(tStart = NEW_INTEGER(nrAxts));
> PROTECT(tEnd = NEW_INTEGER(nrAxts));
> PROTECT(tStrand = NEW_CHARACTER(nrAxts));
> PROTECT(tSym = NEW_CHARACTER(nrAxts));
> PROTECT(score = NEW_INTEGER(nrAxts));
> PROTECT(symCount = NEW_INTEGER(nrAxts));
> PROTECT(returnList = NEW_LIST(12));
> int *p_qStart, *p_qEnd, *p_tStart, *p_tEnd, *p_score, *p_symCount;
> p_qStart = INTEGER_POINTER(qStart);
> p_qEnd = INTEGER_POINTER(qEnd);
> p_tStart = INTEGER_POINTER(tStart);
> p_tEnd = INTEGER_POINTER(tEnd);
> p_score = INTEGER_POINTER(score);
> p_symCount = INTEGER_POINTER(symCount);
> int j = 0;
> i = 0;
> for(j = 0; j < nrAxtFiles; j++){
> char *filepath_elt = (char *) R_alloc(strlen(CHAR(STRING_ELT(filepath, j))), sizeof(char));
> strcpy(filepath_elt, CHAR(STRING_ELT(filepath, j)));
> lf = lineFileOpen(filepath_elt, TRUE);
> while((axt = axtRead(lf)) != NULL){
> SET_STRING_ELT(qNames, i, mkChar(axt->qName));
> p_qStart[i] = axt->qStart + 1;
> p_qEnd[i] = axt->qEnd;
> if(axt->qStrand == '+')
> SET_STRING_ELT(qStrand, i, mkChar("+"));
> else
> SET_STRING_ELT(qStrand, i, mkChar("-"));
> SET_STRING_ELT(qSym, i, mkChar(axt->qSym));
> SET_STRING_ELT(tNames, i, mkChar(axt->tName));
> p_tStart[i] = axt->tStart + 1;
> p_tEnd[i] = axt->tEnd;
> if(axt->tStrand == '+')
> SET_STRING_ELT(tStrand, i, mkChar("+"));
> else
> SET_STRING_ELT(tStrand, i, mkChar("-"));
> SET_STRING_ELT(tSym, i, mkChar(axt->tSym));
> p_score[i] = axt->score;
> p_symCount[i] = axt->symCount;
> i++;
> axtFree(&axt);
> }
> lineFileClose(&lf);
> }
> SET_VECTOR_ELT(returnList, 0, tNames);
> SET_VECTOR_ELT(returnList, 1, tStart);
> SET_VECTOR_ELT(returnList, 2, tEnd);
> SET_VECTOR_ELT(returnList, 3, tStrand);
> SET_VECTOR_ELT(returnList, 4, tSym);
> SET_VECTOR_ELT(returnList, 5, qNames);
> SET_VECTOR_ELT(returnList, 6, qStart);
> SET_VECTOR_ELT(returnList, 7, qEnd);
> SET_VECTOR_ELT(returnList, 8, qStrand);
> SET_VECTOR_ELT(returnList, 9, qSym);
> SET_VECTOR_ELT(returnList, 10, score);
> SET_VECTOR_ELT(returnList, 11, symCount);
> UNPROTECT(13);
> //axtFree(&curAxt);
> //return R_NilValue;
> return returnList;
>
>
>
>
>
> ------------------ Original ------------------
> From: "r-devel"<r-devel at r-project.org>;
> Date: Fri, Jun 28, 2013 03:08 AM
> To: "Ge Tan"<184523479 at qq.com>;
> Cc: "r-devel"<r-devel at r-project.org>;
> Subject: Re: [Rd] Read a text file into R with .Call()
>
>
>
>
> On Jun 27, 2013, at 9:18 AM, Ge Tan wrote:
>
>> Hi,
>>
>> I want to read a text file into R with .Call().
>> So I define some NEW_CHARACTER() to store the chracters read and use SET_STRING_ELT to fill the elements.
>>
>> e.g.
>> PROTECT(qNames = NEW_CHARACTER(10000));
>> char *foo; // This foo holds the string I want.
>> while(foo = readLine(FN)){
>> SET_STRING_ELT(qNames, i, mkChar(foo)));
>> }
>>
>> In this way, I can get the desired character from qNames. The only problem is that "mkChar" will make every foo string into a global CHARSXP cache. When I have a huge amount of file to read, the CHARSXP cache use too much memory. I do not know whether there is any other way to SET_STRING_ELT without the mkChar operation.
>
> No. *all* strings in R are in the cache. The whole point of it is that is uses less memory by not duplicating strings - and the overhead for as little as 10000 strings is minuscule. So I suspect that is not your problem since if that was the case, you would not have enough memory to just load the file. Check you code, chances are the issue is elsewhere.
>
> That said, you can always load the file into a raw vector and use accessor function to create strings on demand - but this is only meaningful when you plan to use a very small subset.
>
> Cheers,
> Simon
>
>
>> I know I cam refer to the Biostrings pakcage's way of readDNAStringSet, but that is a bit complicated I have not full understood it.
>>
>> Any help will be appreciated!!
>>
>> Ge
>> ______________________________________________
>> R-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the R-devel
mailing list