[Bioc-devel] Reducing memory footprint of large object

Stephanie M. Gogarten sdmorris at u.washington.edu
Fri Nov 6 01:22:00 CET 2015


gdsfmt is another option for storing large datasets on disk, similar to 
HDF5. Take a look at packages SNPRelate, GWASTools and SeqArray which 
all use it to store genotype data.

Stephanie

On 11/5/15 8:41 AM, Fischer, Bernd wrote:
> Hi Christian,
>
> you should have a look at packages that for partial reading of data, like
> e.g. big memory that only load data partially in memory or implement partial
> reading yourself using HDF5 and rhdf5.
>
> Best,
>
> Bernd
>
>
>
>> On 05.11.2015, at 16:22, Christian Arnold <christian.arnold at embl.de> wrote:
>>
>>
>> Hi all,
>>
>> I wanted to ask around in this list with full of experts if any of you
>> have an advice about the following problem:
>>
>> I got a large SNPhood object from someone (package SNPhood, which I
>> developed) from an analysis of 200.000 SNPs or so that stores lots of
>> read counts and the positions of overlapping reads in general. In total,
>> the object is 2 GB large. I examined the object and identified the slots
>> that need the most memory. In this particular slot, a nested list is
>> stored that saves the read start positions of all overlapping reads for
>> each SNP region.
>>
>> For example, for one individual, a list of length 120,049 with integer
>> vectors, with 20,853,838 elements within the vectors in total:
>>
>>>
>> format(object.size(SNPhood.o at internal$readStartPos$ambiguous$GM12878),
>> units = "Mb")
>> [1] "86 Mb"
>>
>> Unsurprisingly, when unlisting, this can only be a bit improved:
>> format(object.size(unlist(SNPhood.o at internal$readStartPos$ambiguous$GM12878)),
>> units = "Mb")
>> "79.6 Mb"
>>
>>> length(SNPhood.o at internal$readStartPos$ambiguous$GM12878)
>> [1] 120049
>>> length(unlist(SNPhood.o at internal$readStartPos$ambiguous$GM12878))
>> [1] 20853838
>>
>> The vector of read start positions may look like this:
>>
>> head(unlist(SNPhood.o at internal$readStartPos$ambiguous$GM12878),50)
>>   [1] 714086 714087 714088 714089 714099 714100 714106 714108 714110
>> 714114 714114 714123 714123 714123 714125 714125 714128 714130 714138
>> 714139 714145 714148 714149 714150 714151 714152 714154 714164 714164
>> 714172 714173 714184 714186 714187 714188 714189 714192 714194 714198
>> 714204 714206 714209 714209 714212 714216 714219 714219 714223 714224 714224
>>
>> So there are a few reads with identical start sites, but this does not
>> occur too often. I indeed need all of this information for further
>> processing.
>>
>> Do you have any idea if I can save this information more efficiently so
>> that the overall object size is reduced? I could try an Rle, but the
>> structure of the data does not be ideal for this...
>>
>> Any tips are very much appreciated!
>>
>> Thanks,
>> Christian
>>
>> --
>> —————————————————————————
>> Christian Arnold, PhD
>> Staff Bioinformatician
>>
>> SCB Unit - Computational Biology
>> Joint appointment Genome Biology
>> Joint appointment European Bioinformatics Institute (EMBL-EBI)
>>
>> European Molecular Biology Laboratory (EMBL)
>> Meyerhofstrasse 1; 69117, Heidelberg, Germany
>>
>> Email: christian.arnold at embl.de
>> Phone: +49(0)6221-387-8472
>> Web: http://www.zaugg.embl.de/
>>
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>



More information about the Bioc-devel mailing list