[Bioc-devel] Reducing memory footprint of large object

Thu Nov 5 17:41:56 CET 2015

Hi Christian,

you should have a look at packages that for partial reading of data, like
e.g. big memory that only load data partially in memory or implement partial
reading yourself using HDF5 and rhdf5.

Best,

Bernd

> On 05.11.2015, at 16:22, Christian Arnold <christian.arnold at embl.de> wrote:
> 
> 
> Hi all,
> 
> I wanted to ask around in this list with full of experts if any of you 
> have an advice about the following problem:
> 
> I got a large SNPhood object from someone (package SNPhood, which I 
> developed) from an analysis of 200.000 SNPs or so that stores lots of 
> read counts and the positions of overlapping reads in general. In total, 
> the object is 2 GB large. I examined the object and identified the slots 
> that need the most memory. In this particular slot, a nested list is 
> stored that saves the read start positions of all overlapping reads for 
> each SNP region.
> 
> For example, for one individual, a list of length 120,049 with integer 
> vectors, with 20,853,838 elements within the vectors in total:
> 
>> 
> format(object.size(SNPhood.o at internal$readStartPos$ambiguous$GM12878), 
> units = "Mb")
> [1] "86 Mb"
> 
> Unsurprisingly, when unlisting, this can only be a bit improved:
> format(object.size(unlist(SNPhood.o at internal$readStartPos$ambiguous$GM12878)), 
> units = "Mb")
> "79.6 Mb"
> 
>> length(SNPhood.o at internal$readStartPos$ambiguous$GM12878)
> [1] 120049
>> length(unlist(SNPhood.o at internal$readStartPos$ambiguous$GM12878))
> [1] 20853838
> 
> The vector of read start positions may look like this:
> 
> head(unlist(SNPhood.o at internal$readStartPos$ambiguous$GM12878),50)
>  [1] 714086 714087 714088 714089 714099 714100 714106 714108 714110 
> 714114 714114 714123 714123 714123 714125 714125 714128 714130 714138 
> 714139 714145 714148 714149 714150 714151 714152 714154 714164 714164 
> 714172 714173 714184 714186 714187 714188 714189 714192 714194 714198 
> 714204 714206 714209 714209 714212 714216 714219 714219 714223 714224 714224
> 
> So there are a few reads with identical start sites, but this does not 
> occur too often. I indeed need all of this information for further 
> processing.
> 
> Do you have any idea if I can save this information more efficiently so 
> that the overall object size is reduced? I could try an Rle, but the 
> structure of the data does not be ideal for this...
> 
> Any tips are very much appreciated!
> 
> Thanks,
> Christian
> 
> -- 
> —————————————————————————
> Christian Arnold, PhD
> Staff Bioinformatician
> 
> SCB Unit - Computational Biology
> Joint appointment Genome Biology
> Joint appointment European Bioinformatics Institute (EMBL-EBI)
> 
> European Molecular Biology Laboratory (EMBL)
> Meyerhofstrasse 1; 69117, Heidelberg, Germany
> 
> Email: christian.arnold at embl.de
> Phone: +49(0)6221-387-8472
> Web: http://www.zaugg.embl.de/
> 
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel