[BioC] how to convert genotype snp matrix to nucleotide genotypes?
Stephanie M. Gogarten
sdmorris at u.washington.edu
Mon May 13 17:58:22 CEST 2013
You might have better luck using the methods in VariantAnnotation to
access the genotype matrix, rather than converting to SnpMatrix. For
example:
fl <- system.file("extdata", "ex2.vcf", package="VariantAnnotation")
vcf <- readVcf(fl, "hg19")
geno <- geno(vcf)$GT
geno
NA00001 NA00002 NA00003
rs6054257 "0|0" "1|0" "1/1"
20:17330 "0|0" "0|1" "0/0"
rs6040355 "1|2" "2|1" "2/2"
20:1230237 "0|0" "0|0" "0/0"
microsat1 "0/1" "0/2" "1/1"
ref <- ref(vcf)
alt <- alt(vcf)
geno2 <- geno
for (i in 1:nrow(geno)) {
geno2[i,] <- gsub("0", as.character(ref[i]), geno[i,])
for (j in 1:elementLengths(alt[i])) {
geno2[i,] <- gsub(as.character(j),
as.character(alt[[i]][j]),
geno2[i,])
}
}
geno2
NA00001 NA00002 NA00003
rs6054257 "G|G" "A|G" "A/A"
20:17330 "T|T" "T|A" "T/T"
rs6040355 "G|T" "T|G" "T/T"
20:1230237 "T|T" "T|T" "T/T"
microsat1 "GTC/G" "GTC/GTCT" "G/G"
There is probably a much more efficient way to do the string
substitution than my nested for loop, but this gives you the idea.
Stephanie
On 5/12/13 6:47 PM, Vincent Carey wrote:
> On Sat, May 11, 2013 at 9:18 PM, Tereza Jezkova UNLV <
> jezkovat at unlv.nevada.edu> wrote:
>
>> Hi Vincent,
>>
>> Thanks so much for your help. I did what you suggested but the resulting
>> matrix has only two values: NA (majority of cells) or A/B. There is no A/A
>> or B/B. SO I know I did something wrong. This is my entire code:
>>
>>> fl <- system.file("extdata", "Lizard_std.vcf",
>> package="VariantAnnotation")
>>> vcf <- readVcf(fl, "1342gen_fasta")
>>> mat <- genotypeToSnpMatrix(vcf)
>> Warning messages:
>> 1: In .local(x, ...) : variants with >1 ALT allele are set to NA
>> 2: In .local(x, ...) : non-single nucleotide variations are set to NA
>> 3: In .local(x, ...) : non-diploid variants are set to NA
>>
>
> could it be that your genome does not have many diallelic loci? you may
> need to work directly with
> the genotype data and not the SnpMatrix representation.
>
>
>>> MAT <- as(mat$genotypes, "character")
>>> MAT_TRAN <- t(MAT)
>>> write.csv(MAT_TRAN, "mat.csv")
>>
>>
>> The resulting matrix (when opened in excel looks like this. I assume it
>> has something to do with the warning messages I received but I am not sure
>> what to do.
>> 1 2 3 4 5 6 7 8 9 10 RADid_0000001_depth_39:0000000058 NA NA NA NA NA
>> NA NA NA A/B NA RADid_0000003_depth_152:0000000007 NA NA NA NA NA NA NA NA
>> NA NA RADid_0000003_depth_152:0000000034 NA NA NA NA NA NA NA NA NA NA
>> RADid_0000003_depth_152:0000000046 NA NA NA A/B NA A/B NA NA NA NA
>> RADid_0000010_depth_57:0000000010 NA NA NA NA NA NA NA A/B NA NA
>> RADid_0000010_depth_57:0000000019 A/B A/B NA NA NA NA NA NA NA NA
>> RADid_0000010_depth_57:0000000020 A/B A/B A/B A/B NA NA NA A/B NA NA
>> RADid_0000010_depth_57:0000000059 A/B A/B NA A/B NA NA A/B NA NA NA
>> RADid_0000010_depth_57:0000000062 NA NA A/B NA NA NA NA NA NA NA
>>
>>
>> Will be very grateful for any advice.
>>
>> Thanks, Tereza
>>
>>
>>
>>
>>
>> *From:* Vincent Carey <stvjc at channing.harvard.edu>
>> *Sent:* Friday, May 10, 2013 7:51 PM
>> *To:* Tereza Jezkova UNLV <jezkovat at unlv.nevada.edu>
>> *Cc:* bioconductor at r-project.org
>> *Subject:* Re: [BioC] how to convert genotype snp matrix to nucleotide
>> genotypes?
>>
>>
>>
>> On Fri, May 10, 2013 at 10:02 PM, Tereza Jezkova UNLV <
>> jezkovat at unlv.nevada.edu> wrote:
>>
>>> I created a Snp matrix using a genotypeToSnpMatrix command
>>>
>>> The matrix looks like this:
>>>
>>>> mat
>>> $genotypes
>>> A SnpMatrix with 10 rows and 50581 columns
>>> Row names: Rodriguez_Lizard_1240_sequence_1_pileup.txt ...
>>> Rodriguez_Lizard_623_sequence_1_pileup.txt
>>> Col names: RADid_0000001_depth_39:0000000058 ...
>>> RADid_0078132_depth_33:0000000081
>>>
>>> $map
>>> DataFrame with 50581 rows and 4 columns
>>> snp.names allele.1
>>> allele.2 ignore
>>> <character> <DNAStringSet>
>>> <DNAStringSetList> <logical>
>>> 1 RADid_0000001_depth_39:0000000058 C
>>> A FALSE
>>> 2 RADid_0000003_depth_152:0000000007 G
>>> A,T TRUE
>>> 3 RADid_0000003_depth_152:0000000034 G
>>> T,C TRUE
>>> 4 RADid_0000003_depth_152:0000000046 T
>>> C FALSE
>>> 5 RADid_0000010_depth_57:0000000010 T
>>> C FALSE
>>> ... ... ...
>>> ... ...
>>> 50577 RADid_0078129_depth_31:0000000062 C
>>> T FALSE
>>> 50578 RADid_0078132_depth_33:0000000025 T
>>> C FALSE
>>> 50579 RADid_0078132_depth_33:0000000033 C
>>> T FALSE
>>> 50580 RADid_0078132_depth_33:0000000044 C
>>> A FALSE
>>> 50581 RADid_0078132_depth_33:0000000081 C
>>> T FALSE
>>>
>>>
>>> How do I convert my matrix to a a nucleotide genotype matrix?
>>> I would like my data to look something like:
>>>
>>> Sample 1 Snp 1 T/T
>>> Sample 2 Snp 1 T/A
>>> Sample 3 Snp 1 A/A etc.
>>>
>>>
>>
>> as(mat, "character") will yield a conforming matrix with "A/B" notation
>> in each cell. the
>> representation is only useful for diallelic SNP
>>
>> from ?read.snps.long
>>
>> For nucleotide coding, nucleotides are assigned to the nominal alleles
>> in alphabetic order. Thus, for a SNP with either "T" and "A"
>> nucleotides in the variant position,
>> the nominal genotypes AA, AB and BB will refer to A/A,
>> A/T and T/T.
>>
>> provided this convention is observed in the VCF translation, you could use
>> string substitutions
>> to transform the A/B notation to nucleotide notation. oftentimes this is
>> not really needed.
>>
>>
>>> I noticed that people have been creating their Snp matrix from a txt
>>> file. I am importing from a vcf file and I can’t figure out how to get the
>>> desired format.
>>>
>>> Thanks a lot for your kind help,
>>>
>>> Tereza
>>> [[alternative HTML version deleted]]
>>>
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>
>>
>
> [[alternative HTML version deleted]]
>
>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
More information about the Bioconductor
mailing list