[R] Deleting columns where the frequency of values are too disparate
Richard.Cotton at hsl.gov.uk
Richard.Cotton at hsl.gov.uk
Mon Jan 19 13:13:27 CET 2009
> Please consider the following "toy" data matrix example, called "x"
> for simplicity. There are 20 different individuals ("ID"), with
> information about the alleles (A,T, G, C) at six different loci
> ("Locus1" - "Locus6") for each of these 20 individuals. At any
> single locus (e.g., "Locus1" or "Locus2", ... or "Locus6"), the
> individuals have either one allele (from the set of A,T,C,G) or one
> other allele (from the set of A,T,C, G). For example, at Locus1
> individuals have have either the A or T allele only; at Locus2 the
> individuals can have either C or G only; at Locus3 the individuals
> can have either T or G only.
>
> IDLocus1Locus2Locus3Locus4Locus5Locus6
> 1AGTAAC
> 2AGGACC
> 3ACGGCC
> 4ACGGCC
> 5AGGGAC
> 6TGGGCC
> 7TCGGCC
> 8TCGGAC
> 9TGGGCC
> 10TCGGCC
> 11AGGGAC
> 12ACGGCC
> 13AGGGCC
> 14AGGGAC
> 15ACGGCC
> 16TCGGCC
> 17TGGGAC
> 18TGGGCC
> 19TGGGCC
> 20TCGGAC
>
> I want to delete any columns from the dataset where the rarer of the
> two alleles has a frequency of ten percent or less. In other words,
> I would like to delete Locus3, Locus4, and Locus6 in this data
> matrix, because the frequency of the rare allele is not greater than
> ten percent (and conversely, the frequency of the common allele is
> not less than ninety percent). Please note that the frequency of the
> rare allele in Locus6 is equal to zero (conversely, the frequency of
> the common allele is equal to one hundred percent).
>
> Would one of you know of simple way to write this sort of code? (In
> my real dataset, there are 1096 loci, so this cannot be done easily "by
eye.")
Most of the problem is just organising the data into a sensible form.
# read in data
data <- readLines(tc <- textConnection("1AGTAAC
2AGGACC
3ACGGCC
4ACGGCC
5AGGGAC
6TGGGCC
7TCGGCC
8TCGGAC
9TGGGCC
10TCGGCC
11AGGGAC
12ACGGCC
13AGGGCC
14AGGGAC
15ACGGCC
16TCGGCC
17TGGGAC
18TGGGCC
19TGGGCC
20TCGGAC")); close(tc)
# retrieve the useful bit
loci <- sub("[[:digit:]]{1,2}", "", data)
# you may also want this
ID <- grep("[[:digit:]]{1,2}", data)
# find out how many of each base occurs at each locus
freqs <- list()
n <- length(ID)
for(i in 1:6)
{
assign(paste("locus", i, sep=""), factor(substring(loci,i,i),
levels=c("A","C","G","T")))
freqs[[i]] <- summary(get(paste("locus", i, sep="")))
}
freqs
# remove loci with 90% or more cases of same base
loci.to.remove <- sapply(freqs, function(x) any(x>0.9*n))
Regards,
Richie.
Mathematical Sciences Unit
HSL
------------------------------------------------------------------------
ATTENTION:
This message contains privileged and confidential inform...{{dropped:20}}
More information about the R-help
mailing list