[R] reordering huge data file

Thomas Lumley tlumley at u.washington.edu
Tue Jan 22 02:24:47 CET 2008

On Mon, 21 Jan 2008, Boks, M.P.M. wrote:

> Dear R-experts,
> My problem is how to handle a 10GB data file containing genotype data. The
> file is in a particular format (Illumina final report) and needs to be 
> altered
> and merged with phenotype data for further analysis.

If the data have all the SNPs for one individual, then all the SNPs for the next individual, and so on, you can read in 305000 lines of data, look up the phenotype, then write out one line of output, eg with cat().

As another approach, I've been using the ncdf package for handling Illumina genotype data (slightly larger datasets, and multiple phenotypes).  This has been faster and more compact than SQLite (because it doesn't need indexes to do random access by person and by SNP). It is then easy to write analyses by SNP (association tests) or analyses by person (allele sharing, population structure), and even analyses by genomic region (all SNPs in chr9q21.3)


Thomas Lumley			Assoc. Professor, Biostatistics
tlumley at u.washington.edu	University of Washington, Seattle

More information about the R-help mailing list