[R] Deleting columns where the frequency of values are too disparate

Mon Jan 19 13:13:27 CET 2009

> Please consider the following "toy" data matrix example, called "x" 
> for simplicity. There are 20 different individuals  ("ID"), with 
> information about the alleles (A,T, G, C) at six different loci 
> ("Locus1" -  "Locus6") for each of these 20 individuals. At any 
> single locus (e.g., "Locus1" or "Locus2", ... or "Locus6"), the 
> individuals have either one allele (from the set of A,T,C,G) or one 
> other allele (from the set of A,T,C, G). For example, at Locus1 
> individuals have have either the A or T allele only; at Locus2 the 
> individuals can have either C or G only; at Locus3 the individuals 
> can have either T or G only.
> 
> IDLocus1Locus2Locus3Locus4Locus5Locus6
> 1AGTAAC
> 2AGGACC
> 3ACGGCC
> 4ACGGCC
> 5AGGGAC
> 6TGGGCC
> 7TCGGCC
> 8TCGGAC
> 9TGGGCC
> 10TCGGCC
> 11AGGGAC
> 12ACGGCC
> 13AGGGCC
> 14AGGGAC
> 15ACGGCC
> 16TCGGCC
> 17TGGGAC
> 18TGGGCC
> 19TGGGCC
> 20TCGGAC
> 
> I want to delete any columns from the dataset where the rarer of the
> two alleles has a frequency of ten percent or less. In other words, 
> I would like to delete Locus3, Locus4, and Locus6 in this data 
> matrix, because the frequency of the rare allele is not greater than
> ten percent (and conversely, the frequency of the common allele is 
> not less than ninety percent). Please note that the frequency of the
> rare allele in Locus6 is equal to zero (conversely, the frequency of
> the common allele is equal to one hundred percent).
> 
> Would one of you know of simple way to write this sort of code? (In 
> my real dataset, there are 1096 loci, so this cannot be done easily "by 
eye.")

Most of the problem is just organising the data into a sensible form.

# read in data
data <- readLines(tc <- textConnection("1AGTAAC
2AGGACC
3ACGGCC
4ACGGCC
5AGGGAC
6TGGGCC
7TCGGCC
8TCGGAC
9TGGGCC
10TCGGCC
11AGGGAC
12ACGGCC
13AGGGCC
14AGGGAC
15ACGGCC
16TCGGCC
17TGGGAC
18TGGGCC
19TGGGCC
20TCGGAC")); close(tc)

# retrieve the useful bit
loci <- sub("[[:digit:]]{1,2}", "", data)

# you may also want this
ID <- grep("[[:digit:]]{1,2}", data)

# find out how many of each base occurs at each locus
freqs <- list()
n <- length(ID)
for(i in 1:6)
{
   assign(paste("locus", i, sep=""), factor(substring(loci,i,i), 
levels=c("A","C","G","T")))
   freqs[[i]] <- summary(get(paste("locus", i, sep=""))) 
}
freqs

# remove loci with 90% or more cases of same base
loci.to.remove <- sapply(freqs, function(x) any(x>0.9*n))

Regards,
Richie.

Mathematical Sciences Unit
HSL

------------------------------------------------------------------------
ATTENTION:

This message contains privileged and confidential inform...{{dropped:20}}