[BioC] how to convert genotype snp matrix to nucleotide genotypes?

Mon May 13 19:42:16 CEST 2013

On 05/13/2013 08:58 AM, Stephanie M. Gogarten wrote:
> You might have better luck using the methods in VariantAnnotation to access the
> genotype matrix, rather than converting to SnpMatrix.  For example:
>
> fl <- system.file("extdata", "ex2.vcf", package="VariantAnnotation")
> vcf <- readVcf(fl, "hg19")
> geno <- geno(vcf)$GT
> geno
>             NA00001 NA00002 NA00003
> rs6054257  "0|0"   "1|0"   "1/1"
> 20:17330   "0|0"   "0|1"   "0/0"
> rs6040355  "1|2"   "2|1"   "2/2"
> 20:1230237 "0|0"   "0|0"   "0/0"
> microsat1  "0/1"   "0/2"   "1/1"
>
> ref <- ref(vcf)
> alt <- alt(vcf)
> geno2 <- geno
> for (i in 1:nrow(geno)) {
>    geno2[i,] <- gsub("0", as.character(ref[i]), geno[i,])
>    for (j in 1:elementLengths(alt[i])) {
>      geno2[i,] <- gsub(as.character(j),
>                        as.character(alt[[i]][j]),
>                        geno2[i,])
>    }
> }
> geno2
>             NA00001 NA00002    NA00003
> rs6054257  "G|G"   "A|G"      "A/A"
> 20:17330   "T|T"   "T|A"      "T/T"
> rs6040355  "G|T"   "T|G"      "T/T"
> 20:1230237 "T|T"   "T|T"      "T/T"
> microsat1  "GTC/G" "GTC/GTCT" "G/G"
>
> There is probably a much more efficient way to do the string substitution than
> my nested for loop, but this gives you the idea.

Maybe the following gets away from iterating? The idea is to update the "0|0" 
elements in a vectorized fashion, then the "0|1", then the "1|1" (dealing with 
elementLength(alt(vcf)) != 1 is left as an exercise...!).

geno2geno <-
     function(vcf)
{
     ## standardize
     idx <- elementLengths(alt(vcf)) == 1L
     if (!all(idx)) {
         warning("only coercing single-element 'alt' records")
         vcf <- vcf[idx]
     }
     geno(vcf)$GT <- sub("/", "|", geno(vcf)$GT)

     ## replacement rule
     .replace <- function(GT, patterns, alleleA, alleleB)
     {
         repl <- paste(alleleA, alleleB, sep="|") # replacement value
         idx0 <- which(GT %in% patterns)          # update these entries in GT...
         idx <- (idx0 - 1L) %% nrow(GT) + 1L      # ...with these entries in repl
         GT[idx0] <- repl[idx]
         GT
     }

     ## replace
     GT <- geno(vcf)$GT
     GT <- .replace(GT, "0|0", ref(vcf), ref(vcf))
     ## FIXME: loop over subset with alt(vcf) > 1 ?
     GT <- .replace(GT, c("0|1", "1|0"), ref(vcf), unlist(alt(vcf)))
     GT <- .replace(GT, "1|1", unlist(alt(vcf)), unlist(alt(vcf)))
     geno(vcf)$GT <- GT

     vcf
}

Mostly I tried not to separate out, say, unlist(alt(vcf)) into a separate 
variable, even though this might be more efficient / less typing, because it 
opens the door for mixing up the ordering of genotype and alt rows.

 > geno(geno2geno(vcf))$GT
            NA00001 NA00002 NA00003
rs6054257  "G|G"   "G|A"   "A|A"
20:17330   "T|T"   "T|A"   "T|T"
20:1230237 "T|T"   "T|T"   "T|T"
Warning message:
In geno2geno(vcf) : only coercing single-element 'alt' records

Martin

>
> Stephanie
>
> On 5/12/13 6:47 PM, Vincent Carey wrote:
>> On Sat, May 11, 2013 at 9:18 PM, Tereza Jezkova UNLV <
>> jezkovat at unlv.nevada.edu> wrote:
>>
>>>    Hi Vincent,
>>>
>>> Thanks so much for your help. I did what you suggested but the resulting
>>> matrix has only two values: NA (majority of cells) or A/B. There is no A/A
>>> or B/B. SO I know I did something wrong. This is my entire code:
>>>
>>>> fl <- system.file("extdata", "Lizard_std.vcf",
>>> package="VariantAnnotation")
>>>> vcf <- readVcf(fl, "1342gen_fasta")
>>>> mat <- genotypeToSnpMatrix(vcf)
>>> Warning messages:
>>> 1: In .local(x, ...) : variants with >1 ALT allele are set to NA
>>> 2: In .local(x, ...) : non-single nucleotide variations are set to NA
>>> 3: In .local(x, ...) : non-diploid variants are set to NA
>>>
>>
>> could it be that your genome does not have many diallelic loci?  you may
>> need to work directly with
>> the genotype data and not the SnpMatrix representation.
>>
>>
>>>> MAT <- as(mat$genotypes, "character")
>>>> MAT_TRAN <- t(MAT)
>>>> write.csv(MAT_TRAN, "mat.csv")
>>>
>>>
>>> The resulting matrix (when opened in excel looks like this. I assume it
>>> has something to do with the warning messages I received but I am not sure
>>> what to do.
>>>     1 2 3 4 5 6 7 8 9 10 RADid_0000001_depth_39:0000000058 NA NA NA NA NA
>>> NA NA NA A/B NA RADid_0000003_depth_152:0000000007 NA NA NA NA NA NA NA NA
>>> NA NA RADid_0000003_depth_152:0000000034 NA NA NA NA NA NA NA NA NA NA
>>> RADid_0000003_depth_152:0000000046 NA NA NA A/B NA A/B NA NA NA NA
>>> RADid_0000010_depth_57:0000000010 NA NA NA NA NA NA NA A/B NA NA
>>> RADid_0000010_depth_57:0000000019 A/B A/B NA NA NA NA NA NA NA NA
>>> RADid_0000010_depth_57:0000000020 A/B A/B A/B A/B NA NA NA A/B NA NA
>>> RADid_0000010_depth_57:0000000059 A/B A/B NA A/B NA NA A/B NA NA NA
>>> RADid_0000010_depth_57:0000000062 NA NA A/B NA NA NA NA NA NA NA
>>>
>>>
>>> Will be very grateful for any advice.
>>>
>>> Thanks, Tereza
>>>
>>>
>>>
>>>
>>>
>>>   *From:* Vincent Carey <stvjc at channing.harvard.edu>
>>> *Sent:* Friday, May 10, 2013 7:51 PM
>>> *To:* Tereza Jezkova UNLV <jezkovat at unlv.nevada.edu>
>>> *Cc:* bioconductor at r-project.org
>>> *Subject:* Re: [BioC] how to convert genotype snp matrix to nucleotide
>>> genotypes?
>>>
>>>
>>>
>>> On Fri, May 10, 2013 at 10:02 PM, Tereza Jezkova UNLV <
>>> jezkovat at unlv.nevada.edu> wrote:
>>>
>>>> I created a Snp matrix using a genotypeToSnpMatrix command
>>>>
>>>> The matrix looks like this:
>>>>
>>>>> mat
>>>> $genotypes
>>>> A SnpMatrix with  10 rows and  50581 columns
>>>> Row names:  Rodriguez_Lizard_1240_sequence_1_pileup.txt ...
>>>> Rodriguez_Lizard_623_sequence_1_pileup.txt
>>>> Col names:  RADid_0000001_depth_39:0000000058 ...
>>>> RADid_0078132_depth_33:0000000081
>>>>
>>>> $map
>>>> DataFrame with 50581 rows and 4 columns
>>>>                                 snp.names       allele.1
>>>> allele.2    ignore
>>>>                               <character> <DNAStringSet>
>>>> <DNAStringSetList> <logical>
>>>> 1      RADid_0000001_depth_39:0000000058              C
>>>> A     FALSE
>>>> 2     RADid_0000003_depth_152:0000000007              G
>>>> A,T      TRUE
>>>> 3     RADid_0000003_depth_152:0000000034              G
>>>> T,C      TRUE
>>>> 4     RADid_0000003_depth_152:0000000046              T
>>>> C     FALSE
>>>> 5      RADid_0000010_depth_57:0000000010              T
>>>> C     FALSE
>>>> ...                                  ...            ...
>>>> ...       ...
>>>> 50577  RADid_0078129_depth_31:0000000062              C
>>>> T     FALSE
>>>> 50578  RADid_0078132_depth_33:0000000025              T
>>>> C     FALSE
>>>> 50579  RADid_0078132_depth_33:0000000033              C
>>>> T     FALSE
>>>> 50580  RADid_0078132_depth_33:0000000044              C
>>>> A     FALSE
>>>> 50581  RADid_0078132_depth_33:0000000081              C
>>>> T     FALSE
>>>>
>>>>
>>>> How do I convert my matrix to a a nucleotide genotype matrix?
>>>> I would like my data to look something like:
>>>>
>>>> Sample 1 Snp 1 T/T
>>>> Sample 2 Snp 1 T/A
>>>> Sample 3 Snp 1 A/A etc.
>>>>
>>>>
>>>
>>> as(mat, "character")  will yield a conforming matrix with "A/B" notation
>>> in each cell.  the
>>> representation is only useful for diallelic SNP
>>>
>>> from ?read.snps.long
>>>
>>>    For nucleotide coding, nucleotides are assigned to the nominal alleles
>>>    in alphabetic order. Thus, for a SNP with either  "T" and "A"
>>>    nucleotides in the variant position,
>>>    the nominal genotypes AA, AB and BB will refer to A/A,
>>>    A/T and T/T.
>>>
>>> provided this convention is observed in the VCF translation, you could use
>>> string substitutions
>>> to transform the A/B notation to nucleotide notation.  oftentimes this is
>>> not really needed.
>>>
>>>
>>>> I noticed that people have been creating their Snp matrix from a txt
>>>> file. I am importing from a vcf file and I can’t figure out how to get the
>>>> desired format.
>>>>
>>>> Thanks a lot for your kind help,
>>>>
>>>> Tereza
>>>>          [[alternative HTML version deleted]]
>>>>
>>>>
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at r-project.org
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives:
>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>
>>>
>>>
>>
>>     [[alternative HTML version deleted]]
>>
>>
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793