[R] Manage huge database
Thomas Lumley
tlumley at u.washington.edu
Mon Sep 22 22:38:19 CEST 2008
On Mon, 22 Sep 2008, Martin Morgan wrote:
> "José E. Lozano" <lozalojo at jcyl.es> writes:
>
>>> Maybe you've not lurked on R-help for long enough :) Apologies!
>>
>> Probably.
>>
>>> So, how much "design" is in this data? If none, and what you've
>>> basically got is a 2000x500000 grid of numbers, then maybe a more raw
>>
>> Exactly, raw data, but a little more complex since all the 500000 variables
>> are in text format, so the width is around 2,500,000.
<snip>>
>> Is genetic DNA data (individuals genotyped), hence the large amount of
>> columns to analyze.
>
> The Bioconductor package snpMatrix is designed for this type of
> data. See
>
> http://www.bioconductor.org/packages/2.2/bioc/html/snpMatrix.html
>
> and if that looks promising
>
>> source('http://bioconductor.org/biocLite.R')
>> biocLite('snpMatrix')
>
> Likely you'll quickly want a 64 bit (linux or Mac) machine.
>
netCDF is another useful option -- we have been using the ncdf package for
large genomic datasets. We read the data in one person at a time and
write to netCDF. For analysis we can then read any subsets. Since we
have imputed SNP data as well as measured this comes to about 2.5 million
variables on 4000 people for one of our data sets.
-thomas
Thomas Lumley Assoc. Professor, Biostatistics
tlumley at u.washington.edu University of Washington, Seattle
More information about the R-help
mailing list