[R] Using PCA to correct p-values from snpMatrix

Justin Reese jtr4v at yahoo.com
Mon Jan 3 04:46:51 CET 2011


Hi R-help folks,
I have been doing some single SNP association work using snpMatrix. This works 
well, but produces a lot of false positives, because of population structure in 
my data. I would like to correct the p-values (which snpMatrix gives me) for 
population structure, possibly using principle component analysis (PCA). 

My data is complicated, so here's a simple example of what I'd like to do:
    
# 3x8 matrix of example snp data, 1 = allele A, -1 = allele B, 0 = hetero       
                            
snp.data = matrix(                               
        c(0,1,0,-1,-1,1,1,-1,0,1,1,0,0,1,0,-1,-1,NA,0,-1,0,0,1,0),  
        nrow=3,  
        dimnames = list(   
                 c("bob", "frita", "trudy"),     
                 c("snp1", "snp2", "snp3", "snp4", "snp5","snp6", "snp7", 
"snp8") 
                 )   
         )

# phenotype data - resistant or susceptible to zombie infection
phenotype.data = matrix( 
        c("bob", "frita", "trudy", 
        "resistant", "susceptible", "resistant"),
        nrow=3,   
        dimnames = list(                                           
                 c("bob", "frita", "trudy"),   
                 c("rowNames", "cc")   
                 )      
         )

library("snpMatrix") 
# add one in the following line so genotypes are 0,1 or 2, so single.snp.tests() 
method doesn't complain
snp.matrix <- as(snp.data + 1,'snp.matrix')   
single.snp.assoc <- single.snp.tests(cc, data=as.data.frame(phenotype.data), 
snp.data=snp.matrix )

Okay, so now I have p-values for the association between SNPs and resistance to 
zombie infection.


I do this for PCA:
snp.data = replace( snp.data, is.na(snp.data), 0) # workaround, b/c I can't get 
prcomp to ignore NAs no matter what I do
pca = prcomp(snp.data)

So, the question is, can I use the PCA data to correct my p-values (in 
single.snp.association) for population structure? In my real data, population 
structure is causing a lot of type I errors (false positive SNPs). 

I have read of some standalone software that does this sort of thing, for 
example EIGENSTRAT:
http://www.biostat.jhsph.edu/~iruczins/teaching/misc/gwas/papers/price2006.pdf
But I'd like to stick to R if possible. 

Any advice/comments welcome.



More information about the R-help mailing list