[Bioc-devel] Reducing memory footprint of large object
Christian Arnold
christian.arnold at embl.de
Thu Nov 5 16:22:07 CET 2015
Hi all,
I wanted to ask around in this list with full of experts if any of you
have an advice about the following problem:
I got a large SNPhood object from someone (package SNPhood, which I
developed) from an analysis of 200.000 SNPs or so that stores lots of
read counts and the positions of overlapping reads in general. In total,
the object is 2 GB large. I examined the object and identified the slots
that need the most memory. In this particular slot, a nested list is
stored that saves the read start positions of all overlapping reads for
each SNP region.
For example, for one individual, a list of length 120,049 with integer
vectors, with 20,853,838 elements within the vectors in total:
>
format(object.size(SNPhood.o at internal$readStartPos$ambiguous$GM12878),
units = "Mb")
[1] "86 Mb"
Unsurprisingly, when unlisting, this can only be a bit improved:
format(object.size(unlist(SNPhood.o at internal$readStartPos$ambiguous$GM12878)),
units = "Mb")
"79.6 Mb"
> length(SNPhood.o at internal$readStartPos$ambiguous$GM12878)
[1] 120049
> length(unlist(SNPhood.o at internal$readStartPos$ambiguous$GM12878))
[1] 20853838
The vector of read start positions may look like this:
head(unlist(SNPhood.o at internal$readStartPos$ambiguous$GM12878),50)
[1] 714086 714087 714088 714089 714099 714100 714106 714108 714110
714114 714114 714123 714123 714123 714125 714125 714128 714130 714138
714139 714145 714148 714149 714150 714151 714152 714154 714164 714164
714172 714173 714184 714186 714187 714188 714189 714192 714194 714198
714204 714206 714209 714209 714212 714216 714219 714219 714223 714224 714224
So there are a few reads with identical start sites, but this does not
occur too often. I indeed need all of this information for further
processing.
Do you have any idea if I can save this information more efficiently so
that the overall object size is reduced? I could try an Rle, but the
structure of the data does not be ideal for this...
Any tips are very much appreciated!
Thanks,
Christian
--
—————————————————————————
Christian Arnold, PhD
Staff Bioinformatician
SCB Unit - Computational Biology
Joint appointment Genome Biology
Joint appointment European Bioinformatics Institute (EMBL-EBI)
European Molecular Biology Laboratory (EMBL)
Meyerhofstrasse 1; 69117, Heidelberg, Germany
Email: christian.arnold at embl.de
Phone: +49(0)6221-387-8472
Web: http://www.zaugg.embl.de/
More information about the Bioc-devel
mailing list