[R] Reshaping genetic data from long to wide

Thu Apr 13 01:18:21 CEST 2006

Farrel Buchinsky <fbuchins at wpahs.org> wrote:

> 2)Storing all the SNP data as a string seems quite clever and a space-saving
> way of doing it. However, if you were to analyze a whole chromosome at a
> time you would still be creating one almighty big table albeit only
> temporarily. Do you use R to run TDT analyses? If so, how are you setting up
> your data frames and then what commands do you issue to analyze what is in
> your dataframes?

A lot of how I do this is tied to how we've set up our genotype
database so it would be hard to describe briefly and probably not much
use to you.  I often work with larger data sets (i.e., >200K SNPs and
>1000 individuals) where I really don't want to try to manipulate the
entire data set in memory at once.  Instead I'll analyze the data in
chunks and store those results back into the database.

I haven't specifically done TDT in R though I know there is code out
there to do that.

> Currently I have my data such that I can access it from R through an ODBC
> connection to Microsoft Access which in turn has an ODBC connection to the
> Sybase database. Whether I go through strings or not, I still need to find a
> way that I can assemble it so that a program can systematically run a TDT
> analysis on all the loci. I can see how strings help me in my storing of the
> data but that is already a fait acomplis. Can you explain to me how it would
> help me with sequential analysis of each locus? Do you have any history
> files so that I can see what you were doing?

I don't think a history file would be very helpful here since what I
do is mostly buried in libraries that are dependent on our database
system.  Basically, I first build up a data frame of sample details,
and a second data frame with SNP information that also contains a
column of genotype strings.  Then I do an lapply() on that genotype
column, calling a function that unpacks the genotypes, appends them as
a new column to the sample details table, and then scores that SNP.
So something like:

  samples <- load.sample.info()
  snps <- load.snp.data()
  fn <- function(str)
  {
    g <- strsplit(str, "")[[1]]
    g <- factor(g, levels=1:3, labels=c("AA", "AB", "BB"))
    samples$genotype <- g
    m1 <- lm(height~age+gender*genotype, samples)
    m2 <- lm(height~age+gender, samples)
    anova(m1,m2)
  }
  results <- lapply(snps$genotypes, fn)

(here I've coded the genotypes as a string like '121231012310' where
1..3 are genotypes AA, AB, BB and 0 is missing data, and if I actually
cared about what nucleotides correspond to alleles 'A' and 'B', I
would have columns for that in the SNP table)

If you are doing haplotype analysis and need to manipulate data for
multiple SNPs at once, then this would get more complicated, but I
still think the string format is a convenient one, and is quicker to
reshape into other formats than the "long" format.  

-- Dave