[BioC] Bioc Interface for 1,000 Genomes SNP Frequencies
Valerie Obenchain
vobencha at fhcrc.org
Tue Jul 17 19:09:41 CEST 2012
On 07/17/2012 08:36 AM, Valerie Obenchain wrote:
> Hi Tim,
>
> 1000 Genomes data can be read in with readVcf(). The 'param' argument
> allows you to specify chromosomes and/or genome positions you are
> interested in
> Using a sample file from 1000 genomes,
>
> library(VariantAnnotation)
> fl <-
> "ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20100804/supporting/AFR.2of4intersection_allele_freq.20100804.genotypes.vcf.gz"
> genomes <- readVcf(TabixFile(fl), "hg19", param=GRanges("12",
> IRanges(101266000, 101422000)))
>
> The data are read into a VCF obejct. See ?VCF for class details. If a
> snp has an rsid, it will appear as the rowname. If not, the rowname is
> a concatenation of chromosome:position.
>
> > genomes
> class: VCF
> dim: 1325 174
> genome: hg19
> exptData(1): header
> fixed(4): REF ALT QUAL FILTER
> info(6): AF DP ... AFR_R2 ASN_R2
> geno(7): AD DP ... GD OG
> rownames(1325): rs75462666 rs78597949 ... 12:101421722 rs67962959
> rowData values names(1): paramRangeID
> colnames(174): HG00553 HG00554 ... NA19982 NA20414
> colData names(1): Samples
>
> We don't have a function that computes the frequencies but that can be
> accomplished fairly easily. The genotype data can be accessed with the
> 'geno' function.
>
> > geno(genomes)
> SimpleList of length 7
> names(7): AD DP GL GQ GT GD OG
>
> > geno(genomes)$GT[1:5,1:3]
> HG00553 HG00554 HG00637
> rs75462666 "0|0" "0|0" "0|0"
> rs78597949 "0|0" "0|0" "0|0"
> rs78693793 "0|0" "0|0" "0|0"
> 12:101266725 "0|0" "0|0" "0|0"
> 12:101266729 "0|0" "0|0" "0|0"
>
> The frequency of the alternate allele is computed as [ 0|1 + 2*(1|1)]
> / 2*(total number of individuals). Maybe something like,
>
> heterozyg <- apply(geno(genomes)$GT, 2, function(x) x %in% c("0|1",
> "1|0"))
> homozyg <- apply(geno(genomes)$GT, 2, function(x) x %in% "1|1")
> nsub <- ncol(geno(genomes)$GT)
> freq <- (heterozyg + 2*homozyg) / 2*nsub
>
or this more elegant solution from Martin,
> m = geno(vcf)$GT
> freq = rowSums((m == "0|1" | m == "1|0") / 2 + (m == "1|1")) / ncol(m)
which saves a couple of apply's.
Valerie
> Valerie
>
> On 07/17/2012 08:09 AM, Vincent Carey wrote:
>> Val Obenchain can reply more definitively, but, briefly,
>> VariantAnnotation
>> package has a scanVcf capability that
>> allows survey of vcf archives with managed memory consumption. If
>> you do
>> not see enough details in the vignette
>> to solve this use case efficiently, let us know.
>>
>> On Tue, Jul 17, 2012 at 10:53 AM, Timothy Duff<timothy.duff at ncf.edu>
>> wrote:
>>
>>> Hi. I have a set of rsIDs for which I'm interested in obtaining allele
>>> frequencies for (in a particular population surveyed by 1kGP.) The
>>> corresponding .vcf files are pretty memory consumptive. Do there
>>> currently
>>> exist basic BC tools that could help with this sort of thing? Thank
>>> you all
>>> for your consideration.
>>>
>>> --
>>> Tim
>>>
>>> [[alternative HTML version deleted]]
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>> [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
More information about the Bioconductor
mailing list