[R] Memory leak with character arrays?
Peter Waltman
waltman at cs.nyu.edu
Wed Jan 17 22:54:31 CET 2007
Hi -
When I'm trying to read in a text file into a labeled character array,
the memory stamp/footprint of R will exceed 4 gigs or more. I've seen
this behavior on Mac OS X, Linux for AMD_64 and X86_64., and the R
versions are 2.4, 2.4 and 2.2, respectively. So, it would seem that
this is platform and R version independant.
The file that I'm reading contains the upstream regions of the yeast
genome, with each upstream region labeled using a FASTA header, i.e.:
FASTA header for gene 1
upstream region.....
.....
....
FASTA header for gene 2
upstream....
....
The script I use - code below - opens the file, parses for a FASTA
header, and then parses the header for the gene name. Once this is
done, it reads the following lines which contain the upstream region,
and then adds it as an item to the character array, using the gene name
as the name of the item it adds. And then continues on to the following
genes.
Each upstream region (the text to be added) is 550 bases (characters)
long. With ~6000 genes in the file I'm reading it, this would be 550 *
6000 * 8 (if we're using ascii chars) ~= 25 Megs (if we're using ascii
chars).
I realize that the character arrays/vectors will have a higher memory
stamp b/c they are a named array and most likely aren't storing the text
as ascii, but 4 gigs and up seems a bit excessive. Or is it?
For an example, this is the output of top, at the point which R has
processed around 5000 genes:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
4969 waltman 18 0 *6746m 3.4g* 920 D 2.7 88.2 19:09.19 R
Is this expected behavior? Can anyone recommend a less memory intensive
way to store this data? The relevant code that reads in the file follows:
....code....
lines <- readLines( gzfile( seqs.fname ) )
n.seqs <- 0
upstream <- gene.names <- character()
syn <- character( 0 )
gene.start <- gene.end <- integer()
gene <- seq <- ""
for ( i in 1:length( lines ) ) {
line <- lines[ i ]
if ( line == "" ) next
if ( substr( line, 1, 1 ) == ">" ) {
if ( seq != "" && gene != "" ) upstream[ gene ] <-
toupper( seq )
splitted <- strsplit( line, "\t" )[[ 1 ]]
splitted <- strsplit( splitted[ 1 ], ";\\ " )[[ 1 ]]
gene <- toupper( substr( splitted[ 1 ], 2, nchar(
splitted[ 1 ] ) ) )
syn <- splitted[ 2 ]
if ( ! is.null( syn ) &&
length( grep( valid.gene.regexp, gene, perl=T ) ) == 0 &&
length( grep( valid.gene.regexp, syn, perl=T ) ) == 1
) gene <- syn
else if ( length( grep( valid.gene.regexp, gene, perl=T,
ignore.case=T ) ) == 0 &&
length( grep( valid.gene.regexp, syn, perl=T,
ignore.case=T ) ) == 0 ) next
gene.start[ gene ] <- as.integer( splitted[ 9 ] )
gene.end[ gene ] <- as.integer( splitted[ 10 ] )
if ( n.seqs %% 100 == 0 ) cat.new( n.seqs, gene, "|", syn,
"| length=", nchar( seq ),
gene.end[gene]-gene.start[gene]+1,"\n" )
if ( ! is.na( syn ) && syn != "" ) gene.names[ gene ] <- syn
else gene.names[ gene ] <- toupper( gene )
n.seqs <- n.seqs + 1
seq <- ""
} else {
seq <- paste( seq, line, sep="" )
}
}
if ( seq != "" && gene != "" ) upstream[ gene ] <- toupper( seq )
....code....
Thanks,
Peter Waltman
More information about the R-help
mailing list