[BioC] how to convert genotype snp matrix to nucleotide genotypes?
Martin Morgan
mtmorgan at fhcrc.org
Mon May 13 19:42:16 CEST 2013
On 05/13/2013 08:58 AM, Stephanie M. Gogarten wrote:
> You might have better luck using the methods in VariantAnnotation to access the
> genotype matrix, rather than converting to SnpMatrix. For example:
>
> fl <- system.file("extdata", "ex2.vcf", package="VariantAnnotation")
> vcf <- readVcf(fl, "hg19")
> geno <- geno(vcf)$GT
> geno
> NA00001 NA00002 NA00003
> rs6054257 "0|0" "1|0" "1/1"
> 20:17330 "0|0" "0|1" "0/0"
> rs6040355 "1|2" "2|1" "2/2"
> 20:1230237 "0|0" "0|0" "0/0"
> microsat1 "0/1" "0/2" "1/1"
>
> ref <- ref(vcf)
> alt <- alt(vcf)
> geno2 <- geno
> for (i in 1:nrow(geno)) {
> geno2[i,] <- gsub("0", as.character(ref[i]), geno[i,])
> for (j in 1:elementLengths(alt[i])) {
> geno2[i,] <- gsub(as.character(j),
> as.character(alt[[i]][j]),
> geno2[i,])
> }
> }
> geno2
> NA00001 NA00002 NA00003
> rs6054257 "G|G" "A|G" "A/A"
> 20:17330 "T|T" "T|A" "T/T"
> rs6040355 "G|T" "T|G" "T/T"
> 20:1230237 "T|T" "T|T" "T/T"
> microsat1 "GTC/G" "GTC/GTCT" "G/G"
>
> There is probably a much more efficient way to do the string substitution than
> my nested for loop, but this gives you the idea.
Maybe the following gets away from iterating? The idea is to update the "0|0"
elements in a vectorized fashion, then the "0|1", then the "1|1" (dealing with
elementLength(alt(vcf)) != 1 is left as an exercise...!).
geno2geno <-
function(vcf)
{
## standardize
idx <- elementLengths(alt(vcf)) == 1L
if (!all(idx)) {
warning("only coercing single-element 'alt' records")
vcf <- vcf[idx]
}
geno(vcf)$GT <- sub("/", "|", geno(vcf)$GT)
## replacement rule
.replace <- function(GT, patterns, alleleA, alleleB)
{
repl <- paste(alleleA, alleleB, sep="|") # replacement value
idx0 <- which(GT %in% patterns) # update these entries in GT...
idx <- (idx0 - 1L) %% nrow(GT) + 1L # ...with these entries in repl
GT[idx0] <- repl[idx]
GT
}
## replace
GT <- geno(vcf)$GT
GT <- .replace(GT, "0|0", ref(vcf), ref(vcf))
## FIXME: loop over subset with alt(vcf) > 1 ?
GT <- .replace(GT, c("0|1", "1|0"), ref(vcf), unlist(alt(vcf)))
GT <- .replace(GT, "1|1", unlist(alt(vcf)), unlist(alt(vcf)))
geno(vcf)$GT <- GT
vcf
}
Mostly I tried not to separate out, say, unlist(alt(vcf)) into a separate
variable, even though this might be more efficient / less typing, because it
opens the door for mixing up the ordering of genotype and alt rows.
> geno(geno2geno(vcf))$GT
NA00001 NA00002 NA00003
rs6054257 "G|G" "G|A" "A|A"
20:17330 "T|T" "T|A" "T|T"
20:1230237 "T|T" "T|T" "T|T"
Warning message:
In geno2geno(vcf) : only coercing single-element 'alt' records
Martin
>
> Stephanie
>
> On 5/12/13 6:47 PM, Vincent Carey wrote:
>> On Sat, May 11, 2013 at 9:18 PM, Tereza Jezkova UNLV <
>> jezkovat at unlv.nevada.edu> wrote:
>>
>>> Hi Vincent,
>>>
>>> Thanks so much for your help. I did what you suggested but the resulting
>>> matrix has only two values: NA (majority of cells) or A/B. There is no A/A
>>> or B/B. SO I know I did something wrong. This is my entire code:
>>>
>>>> fl <- system.file("extdata", "Lizard_std.vcf",
>>> package="VariantAnnotation")
>>>> vcf <- readVcf(fl, "1342gen_fasta")
>>>> mat <- genotypeToSnpMatrix(vcf)
>>> Warning messages:
>>> 1: In .local(x, ...) : variants with >1 ALT allele are set to NA
>>> 2: In .local(x, ...) : non-single nucleotide variations are set to NA
>>> 3: In .local(x, ...) : non-diploid variants are set to NA
>>>
>>
>> could it be that your genome does not have many diallelic loci? you may
>> need to work directly with
>> the genotype data and not the SnpMatrix representation.
>>
>>
>>>> MAT <- as(mat$genotypes, "character")
>>>> MAT_TRAN <- t(MAT)
>>>> write.csv(MAT_TRAN, "mat.csv")
>>>
>>>
>>> The resulting matrix (when opened in excel looks like this. I assume it
>>> has something to do with the warning messages I received but I am not sure
>>> what to do.
>>> 1 2 3 4 5 6 7 8 9 10 RADid_0000001_depth_39:0000000058 NA NA NA NA NA
>>> NA NA NA A/B NA RADid_0000003_depth_152:0000000007 NA NA NA NA NA NA NA NA
>>> NA NA RADid_0000003_depth_152:0000000034 NA NA NA NA NA NA NA NA NA NA
>>> RADid_0000003_depth_152:0000000046 NA NA NA A/B NA A/B NA NA NA NA
>>> RADid_0000010_depth_57:0000000010 NA NA NA NA NA NA NA A/B NA NA
>>> RADid_0000010_depth_57:0000000019 A/B A/B NA NA NA NA NA NA NA NA
>>> RADid_0000010_depth_57:0000000020 A/B A/B A/B A/B NA NA NA A/B NA NA
>>> RADid_0000010_depth_57:0000000059 A/B A/B NA A/B NA NA A/B NA NA NA
>>> RADid_0000010_depth_57:0000000062 NA NA A/B NA NA NA NA NA NA NA
>>>
>>>
>>> Will be very grateful for any advice.
>>>
>>> Thanks, Tereza
>>>
>>>
>>>
>>>
>>>
>>> *From:* Vincent Carey <stvjc at channing.harvard.edu>
>>> *Sent:* Friday, May 10, 2013 7:51 PM
>>> *To:* Tereza Jezkova UNLV <jezkovat at unlv.nevada.edu>
>>> *Cc:* bioconductor at r-project.org
>>> *Subject:* Re: [BioC] how to convert genotype snp matrix to nucleotide
>>> genotypes?
>>>
>>>
>>>
>>> On Fri, May 10, 2013 at 10:02 PM, Tereza Jezkova UNLV <
>>> jezkovat at unlv.nevada.edu> wrote:
>>>
>>>> I created a Snp matrix using a genotypeToSnpMatrix command
>>>>
>>>> The matrix looks like this:
>>>>
>>>>> mat
>>>> $genotypes
>>>> A SnpMatrix with 10 rows and 50581 columns
>>>> Row names: Rodriguez_Lizard_1240_sequence_1_pileup.txt ...
>>>> Rodriguez_Lizard_623_sequence_1_pileup.txt
>>>> Col names: RADid_0000001_depth_39:0000000058 ...
>>>> RADid_0078132_depth_33:0000000081
>>>>
>>>> $map
>>>> DataFrame with 50581 rows and 4 columns
>>>> snp.names allele.1
>>>> allele.2 ignore
>>>> <character> <DNAStringSet>
>>>> <DNAStringSetList> <logical>
>>>> 1 RADid_0000001_depth_39:0000000058 C
>>>> A FALSE
>>>> 2 RADid_0000003_depth_152:0000000007 G
>>>> A,T TRUE
>>>> 3 RADid_0000003_depth_152:0000000034 G
>>>> T,C TRUE
>>>> 4 RADid_0000003_depth_152:0000000046 T
>>>> C FALSE
>>>> 5 RADid_0000010_depth_57:0000000010 T
>>>> C FALSE
>>>> ... ... ...
>>>> ... ...
>>>> 50577 RADid_0078129_depth_31:0000000062 C
>>>> T FALSE
>>>> 50578 RADid_0078132_depth_33:0000000025 T
>>>> C FALSE
>>>> 50579 RADid_0078132_depth_33:0000000033 C
>>>> T FALSE
>>>> 50580 RADid_0078132_depth_33:0000000044 C
>>>> A FALSE
>>>> 50581 RADid_0078132_depth_33:0000000081 C
>>>> T FALSE
>>>>
>>>>
>>>> How do I convert my matrix to a a nucleotide genotype matrix?
>>>> I would like my data to look something like:
>>>>
>>>> Sample 1 Snp 1 T/T
>>>> Sample 2 Snp 1 T/A
>>>> Sample 3 Snp 1 A/A etc.
>>>>
>>>>
>>>
>>> as(mat, "character") will yield a conforming matrix with "A/B" notation
>>> in each cell. the
>>> representation is only useful for diallelic SNP
>>>
>>> from ?read.snps.long
>>>
>>> For nucleotide coding, nucleotides are assigned to the nominal alleles
>>> in alphabetic order. Thus, for a SNP with either "T" and "A"
>>> nucleotides in the variant position,
>>> the nominal genotypes AA, AB and BB will refer to A/A,
>>> A/T and T/T.
>>>
>>> provided this convention is observed in the VCF translation, you could use
>>> string substitutions
>>> to transform the A/B notation to nucleotide notation. oftentimes this is
>>> not really needed.
>>>
>>>
>>>> I noticed that people have been creating their Snp matrix from a txt
>>>> file. I am importing from a vcf file and I can’t figure out how to get the
>>>> desired format.
>>>>
>>>> Thanks a lot for your kind help,
>>>>
>>>> Tereza
>>>> [[alternative HTML version deleted]]
>>>>
>>>>
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at r-project.org
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives:
>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>
>>>
>>>
>>
>> [[alternative HTML version deleted]]
>>
>>
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109
Location: Arnold Building M1 B861
Phone: (206) 667-2793
More information about the Bioconductor
mailing list