[BioC] Bioc Interface for 1,000 Genomes SNP Frequencies

Tue Jul 17 19:09:41 CEST 2012

On 07/17/2012 08:36 AM, Valerie Obenchain wrote:
> Hi Tim,
>
> 1000 Genomes data can be read in with readVcf(). The 'param' argument 
> allows you to specify chromosomes and/or genome positions you are 
> interested in
> Using a sample file from 1000 genomes,
>
>   library(VariantAnnotation)
>   fl <- 
> "ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20100804/supporting/AFR.2of4intersection_allele_freq.20100804.genotypes.vcf.gz"
>   genomes <- readVcf(TabixFile(fl), "hg19", param=GRanges("12", 
> IRanges(101266000, 101422000)))
>
> The data are read into a VCF obejct. See ?VCF for class details. If a 
> snp has an rsid, it will appear as the rowname. If not, the rowname is 
> a concatenation of chromosome:position.
>
> > genomes
> class: VCF
> dim: 1325 174
> genome: hg19
> exptData(1): header
> fixed(4): REF ALT QUAL FILTER
> info(6): AF DP ... AFR_R2 ASN_R2
> geno(7): AD DP ... GD OG
> rownames(1325): rs75462666 rs78597949 ... 12:101421722 rs67962959
> rowData values names(1): paramRangeID
> colnames(174): HG00553 HG00554 ... NA19982 NA20414
> colData names(1): Samples
>
> We don't have a function that computes the frequencies but that can be 
> accomplished fairly easily. The genotype data can be accessed with the 
> 'geno' function.
>
> > geno(genomes)
> SimpleList of length 7
> names(7): AD DP GL GQ GT GD OG
>
> > geno(genomes)$GT[1:5,1:3]
>              HG00553 HG00554 HG00637
> rs75462666   "0|0"   "0|0"   "0|0"
> rs78597949   "0|0"   "0|0"   "0|0"
> rs78693793   "0|0"   "0|0"   "0|0"
> 12:101266725 "0|0"   "0|0"   "0|0"
> 12:101266729 "0|0"   "0|0"   "0|0"
>
> The frequency of the alternate allele is computed as  [ 0|1 + 2*(1|1)] 
> / 2*(total number of individuals).  Maybe something like,
>
>   heterozyg <- apply(geno(genomes)$GT, 2, function(x) x %in% c("0|1", 
> "1|0"))
>   homozyg <- apply(geno(genomes)$GT, 2, function(x) x %in% "1|1")
>   nsub <- ncol(geno(genomes)$GT)
>    freq <- (heterozyg + 2*homozyg) / 2*nsub
>

or this more elegant solution from Martin,

 > m = geno(vcf)$GT
 > freq = rowSums((m == "0|1" | m == "1|0") / 2 + (m == "1|1")) / ncol(m)

which saves a couple of apply's.

Valerie

> Valerie
>
> On 07/17/2012 08:09 AM, Vincent Carey wrote:
>> Val Obenchain can reply more definitively, but, briefly, 
>> VariantAnnotation
>> package has a scanVcf capability that
>> allows survey of vcf archives with managed memory consumption.  If 
>> you do
>> not see enough details in the vignette
>> to solve this use case efficiently, let us know.
>>
>> On Tue, Jul 17, 2012 at 10:53 AM, Timothy Duff<timothy.duff at ncf.edu>  
>> wrote:
>>
>>> Hi. I have a set of rsIDs for which I'm interested in obtaining allele
>>> frequencies for (in a particular population surveyed by 1kGP.) The
>>> corresponding .vcf files are pretty memory consumptive. Do there 
>>> currently
>>> exist basic BC tools that could help with this sort of thing? Thank 
>>> you all
>>> for your consideration.
>>>
>>> -- 
>>> Tim
>>>
>>>          [[alternative HTML version deleted]]
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>     [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: 
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: 
> http://news.gmane.org/gmane.science.biology.informatics.conductor