[Bioc-devel] Reducing memory footprint of large object

Christian Arnold christian.arnold at embl.de
Thu Nov 5 16:22:07 CET 2015


Hi all,

I wanted to ask around in this list with full of experts if any of you 
have an advice about the following problem:

I got a large SNPhood object from someone (package SNPhood, which I 
developed) from an analysis of 200.000 SNPs or so that stores lots of 
read counts and the positions of overlapping reads in general. In total, 
the object is 2 GB large. I examined the object and identified the slots 
that need the most memory. In this particular slot, a nested list is 
stored that saves the read start positions of all overlapping reads for 
each SNP region.

For example, for one individual, a list of length 120,049 with integer 
vectors, with 20,853,838 elements within the vectors in total:

 > 
format(object.size(SNPhood.o at internal$readStartPos$ambiguous$GM12878), 
units = "Mb")
[1] "86 Mb"

Unsurprisingly, when unlisting, this can only be a bit improved:
format(object.size(unlist(SNPhood.o at internal$readStartPos$ambiguous$GM12878)), 
units = "Mb")
"79.6 Mb"

 > length(SNPhood.o at internal$readStartPos$ambiguous$GM12878)
[1] 120049
 > length(unlist(SNPhood.o at internal$readStartPos$ambiguous$GM12878))
[1] 20853838

The vector of read start positions may look like this:

head(unlist(SNPhood.o at internal$readStartPos$ambiguous$GM12878),50)
  [1] 714086 714087 714088 714089 714099 714100 714106 714108 714110 
714114 714114 714123 714123 714123 714125 714125 714128 714130 714138 
714139 714145 714148 714149 714150 714151 714152 714154 714164 714164 
714172 714173 714184 714186 714187 714188 714189 714192 714194 714198 
714204 714206 714209 714209 714212 714216 714219 714219 714223 714224 714224

So there are a few reads with identical start sites, but this does not 
occur too often. I indeed need all of this information for further 
processing.

Do you have any idea if I can save this information more efficiently so 
that the overall object size is reduced? I could try an Rle, but the 
structure of the data does not be ideal for this...

Any tips are very much appreciated!

Thanks,
Christian

-- 
—————————————————————————
Christian Arnold, PhD
Staff Bioinformatician

SCB Unit - Computational Biology
Joint appointment Genome Biology
Joint appointment European Bioinformatics Institute (EMBL-EBI)

European Molecular Biology Laboratory (EMBL)
Meyerhofstrasse 1; 69117, Heidelberg, Germany

Email: christian.arnold at embl.de
Phone: +49(0)6221-387-8472
Web: http://www.zaugg.embl.de/



More information about the Bioc-devel mailing list