[BioC] Finding coding SNPs with predictCoding

Sat Mar 3 01:58:17 CET 2012

Hi Valerie,

Based on my experience the position in the complete protein (rather than
CDS) sequence would be the most important piece of information to record
here. For instance, if a SNP changes the 88th triplet CAT (His) in the
ORF of a transcript to CAA (Gln) then you want to record it like this:
His88 (CAT) -> Gln88 (CAA). This way the user can map this change to a
protein structure or inject it into the corresponding protein sequence
without any additional remapping or coding. 

Another feature to consider are SNPs affecting splice sites (commonly 
first and last two nucleotides of an intron). 

If possible it would be also very useful to support users who want to
work with their custom genomes and annotations files provided as FASTA
and GFF/GTF files, respectively.

Best,

Thomas

On Fri, Mar 02, 2012 at 11:31:57PM +0000, Valerie Obenchain wrote:
> Slight change to this -
> 
> I'm now returning the following new columns,
> 
>    \item{\code{seqTxLoc}}{
>        Location in transcript-based coordinates of the first nucleotide in
>        the codon sequence to be translated. This position corresponds to the
>        first nucleotide in both the \code{refSeq} and \code{varSeq} columns.
>      }
>      \item{\code{varTxLoc}}{
>        Location in transcript-based coordinates of the first nucleotide in
>        the variant. This value will be the same as \code{seqTxLoc} when the
>        variant starts exactly at the beginning of a codon.
>      }
>      \item{\code{varCdsLoc}}{
>        Location in cds-based coordinates of the first nucleotide in
>        the variant. This position is relative to the start of the cds region
>        defined in the \code{subject} annotation.
>      }
>      \item{\code{subjStrand}}{
>        The strand of the \code{subject} the variant matched. 
> \code{predictCoding}
>        determines which variants fall in a coding region by finding the 
> overlaps
>        between the \code{query} and \code{subject}. The \code{query} may be
>        un-stranded \sQuote{*} but the \code{subject} annotation will 
> have a strand.
>      }
> 
> 
> You are interested in 'protein coordinates'. Does 'varCdsLoc' described 
> above meet the need or are you looking for the actual codon number in 
> the coding sequence? I am interested in hearing more about what you are 
> doing with the protein coordinates, how you are using them. It would 
> help us better design future functions.
> 
> Thanks,
> Valerie
> 
> On 03/02/2012 01:11 AM, Alex Gutteridge wrote:
> > Thanks Valerie - much appreciated!
> >
> > On 01.03.2012 21:30, Valerie Obenchain wrote:
> >> A 'txLoc' column has been added to the output of predictCoding.
> >> Available in devel version 1.1.57.
> >>
> >> Valerie
> >>
> >>
> >> On 02/28/2012 08:20 AM, Valerie Obenchain wrote:
> >>> Good suggestion. Yes, predictCoding is does this internally. I'll 
> >>> post back here when this has been added.
> >>>
> >>> Valerie
> >>>
> >>>
> >>>
> >>> On 02/28/2012 01:49 AM, Alex Gutteridge wrote:
> >>>> Hi Valerie,
> >>>>
> >>>> Thanks everything works great now. One small feature request - 
> >>>> would it be hard to output the protein sequence position of the 
> >>>> coding SNPs? At the moment once I've run predictCoding I'm 
> >>>> re-extracting the cds and working out the position of each coding 
> >>>> SNP so I can see where in the protein sequence it is, but it seems 
> >>>> like this is probably just replicating what predictCoding must be 
> >>>> doing internally anyway?
> >>>>
> >>>> Alex Gutteridge
> >>>
> >>>
> >>> On 02/24/2012 10:39 AM, Valerie Obenchain wrote:
> >>>> Hi Alex,
> >>>>
> >>>> Thanks for the bug report. The cdsID was taken from an overlap 
> >>>> between the query and GRangesList of cds by transcripts. This gave 
> >>>> the correct transcript number but (incorrectly) took the first cds 
> >>>> number in the list by default. Now fixed in devel 1.1.55.
> >>>>
> >>>> I've also updated the man page.
> >>>>
> >>>> Valerie
> >>>>
> >>>>
> >>>>
> >>>> On 02/24/2012 02:08 AM, Alex Gutteridge wrote:
> >>>>> On 22.02.2012 18:58, Herv? Pag?s wrote:
> >>>>>> Hi Alex,
> >>>>>>
> >>>>>> On 02/22/2012 03:56 AM, Alex Gutteridge wrote:
> >>>>>
> >>>>> [...]
> >>>>>
> >>>>>>> But the predictCoding call gives this error:
> >>>>>>>
> >>>>>>> Error in .setSeqNames(x, value) :
> >>>>>>> The replacement value for isActiveSeq must be a logical vector, 
> >>>>>>> with
> >>>>>>> names that match the seqlevels of the object
> >>>>>>
> >>>>>> The error message doesn't help much but I think the pb is that you
> >>>>>> didn't rename chMT properly. Try to do this:
> >>>>>>
> >>>>>>   seqlevels(snps) <- gsub("chrMT", "chrM", seqlevels(snps))
> >>>>>>
> >>>>>> before you start the for(eg in entrez.ids){..} loop again.
> >>>>>>
> >>>>>> Cheers,
> >>>>>> H.
> >>>>>
> >>>>> Thanks Herv? that nailed it. I'm having some difficulty joining up 
> >>>>> the output of predictCoding() with the query SNPs though. If 
> >>>>> someone could point out where the disconnect in my thinking is I 
> >>>>> would appreciate it!
> >>>>>
> >>>>> Here's my (now edited down) script:
> >>>>>
> >>>>> library(BSgenome.Hsapiens.UCSC.hg19)
> >>>>> library(VariantAnnotation)
> >>>>> library(SNPlocs.Hsapiens.dbSNP.20110815)
> >>>>> library(TxDb.Hsapiens.UCSC.hg19.knownGene)
> >>>>>
> >>>>> entrez.ids = c('6335')
> >>>>> txdb19 = TxDb.Hsapiens.UCSC.hg19.knownGene
> >>>>>
> >>>>> snps   = getSNPlocs(c("ch1","ch2"),as.GRanges=T)
> >>>>> seqlevels(snps) <- gsub("ch", "chr", seqlevels(snps))
> >>>>> seqlevels(snps) <- gsub("chrMT", "chrM", seqlevels(snps))
> >>>>>
> >>>>> gene.list = cdsBy(txdb19, by="gene")
> >>>>> vsd.list = gene.list[entrez.ids]
> >>>>> cds.list = cdsBy(txdb19,by="tx")
> >>>>>
> >>>>> eg = entrez.ids[1]
> >>>>>
> >>>>> snp.idx = unique(queryHits(findOverlaps(snps, vsd.list[[eg]])))
> >>>>> eg.snps = snps[snp.idx]
> >>>>> iupac   = values(eg.snps)[,"alleles_as_ambig"]
> >>>>> eg.snps.exp = rep(eg.snps, nchar(IUPAC_CODE_MAP[iupac]))
> >>>>> variant.alleles = 
> >>>>> DNAStringSet(strsplit(paste(IUPAC_CODE_MAP[iupac],collapse=""),"")[[1]]) 
> >>>>>
> >>>>>
> >>>>>
> >>>>> aa = 
> >>>>> predictCoding(eg.snps.exp,txdb19,seqSource=Hsapiens,varAllele=variant.alleles)
> >>>>>
> >>>>> #####
> >>>>>
> >>>>> Then if I query the predictCoding results in aa in an interactive 
> >>>>> session I get the following (see inline comments for what I think 
> >>>>> should be happening, but I must be misinterpreting what queryID 
> >>>>> means)
> >>>>>
> >>>>> The docs for predictCoding() contain a small typo 
> >>>>> (s/queryHits/queryID), but otherwise seem clear?
> >>>>>
> >>>>> Columns include ?queryID?, ?consequence?, ?refSeq?, ?varSeq?,
> >>>>>      ?refAA?, ?varAA?, ?txID?, ?geneID?, and ?cdsID?.
> >>>>>
> >>>>>      ?queryHits? The ?queryHits? column provides a map back to the
> >>>>>           variants in the original ?query?. If the ?query? was a 
> >>>>> ?VCF?
> >>>>>           object this index corresponds to the row in the 
> >>>>> ?GRanges? in
> >>>>>           the ?rowData? slot. If ?query? was an expanded ?GRanges?,
> >>>>>           ?RangedData? or ?RangesList? the index corresponds to 
> >>>>> the row
> >>>>>           in the expanded object.
> >>>>>
> >>>>> #####
> >>>>>
> >>>>>> aa[1,]
> >>>>> DataFrame with 1 row and 9 columns
> >>>>>     queryID   consequence         refSeq         varSeq         refAA
> >>>>> <integer> <factor> <DNAStringSet> <DNAStringSet> <AAStringSet>
> >>>>> 1         1 nonsynonymous            CTC            ATC            L
> >>>>>           varAA        txID   geneID     cdsID
> >>>>> <AAStringSet> <character> <factor> <integer>
> >>>>> 1             I       10921     6335     33668
> >>>>>> #So the first SNP (queryID: 1) is nonsynonymous and maps to tx 
> >>>>>> '10921' and cds '33668'.
> >>>>>> #If I look at the first query SNP I get this:
> >>>>>> eg.snps.exp[aa[1,'queryID'],]
> >>>>> GRanges with 1 range and 2 elementMetadata values:
> >>>>>       seqnames                 ranges strand |   RefSNP_id 
> >>>>> alleles_as_ambig
> >>>>> <Rle> <IRanges> <Rle> | <character> <character>
> >>>>>   [1]     chr2 [167055370, 167055370]      * |   111558968         
> >>>>>      R
> >>>>>   ---
> >>>>>   seqlengths:
> >>>>>     chr1  chr2  chr3  chr4  chr5  chr6 ... chr20 chr21 chr22  chrX 
> >>>>> chrY  chrM
> >>>>>       NA    NA    NA    NA    NA    NA ...    NA    NA    NA    NA 
> >>>>>   NA    NA
> >>>>>> #So SNP 1 is at 167055370 on chr2
> >>>>>> #But if I check tx '10921' I see that the cds overlapping 
> >>>>>> 167055370 is actually '33651'
> >>>>>> #And cds '33668' is at the other end of the tx:
> >>>>>> cds.list[[aa[1,'txID']]]
> >>>>> GRanges with 26 ranges and 3 elementMetadata values:
> >>>>>        seqnames                 ranges strand   |    cds_id    
> >>>>> cds_name
> >>>>> <Rle> <IRanges> <Rle>   | <integer> <character>
> >>>>>    [1]     chr2 [167168009, 167168266]      -   |     33668 <NA>
> >>>>>    [2]     chr2 [167163466, 167163584]      -   |     33667 <NA>
> >>>>>    [3]     chr2 [167163020, 167163109]      -   |     33666 <NA>
> >>>>>    [4]     chr2 [167162302, 167162430]      -   |     33647 <NA>
> >>>>>    [5]     chr2 [167160748, 167160839]      -   |     33646 <NA>
> >>>>>    [6]     chr2 [167159600, 167159812]      -   |     33645 <NA>
> >>>>>    [7]     chr2 [167151109, 167151172]      -   |     33644 <NA>
> >>>>>    [8]     chr2 [167149741, 167149882]      -   |     33643 <NA>
> >>>>>    [9]     chr2 [167144947, 167145153]      -   |     33642 <NA>
> >>>>>    ...      ...                    ...    ... ...       ...        
> >>>>> ...
> >>>>>   [18]     chr2 [167099012, 167099166]      -   |     33659 <NA>
> >>>>>   [19]     chr2 [167094604, 167094777]      -   |     33658 <NA>
> >>>>>   [20]     chr2 [167089850, 167089972]      -   |     33657 <NA>
> >>>>>   [21]     chr2 [167085201, 167085482]      -   |     33656 <NA>
> >>>>>   [22]     chr2 [167084180, 167084233]      -   |     33655 <NA>
> >>>>>   [23]     chr2 [167083077, 167083214]      -   |     33654 <NA>
> >>>>>   [24]     chr2 [167060870, 167060974]      -   |     33653 <NA>
> >>>>>   [25]     chr2 [167060465, 167060735]      -   |     33652 <NA>
> >>>>>   [26]     chr2 [167055182, 167056374]      -   |     33651 <NA>
> >>>>>        exon_rank
> >>>>> <integer>
> >>>>>    [1]         2
> >>>>>    [2]         3
> >>>>>    [3]         4
> >>>>>    [4]         5
> >>>>>    [5]         6
> >>>>>    [6]         7
> >>>>>    [7]         8
> >>>>>    [8]         9
> >>>>>    [9]        10
> >>>>>    ...       ...
> >>>>>   [18]        19
> >>>>>   [19]        20
> >>>>>   [20]        21
> >>>>>   [21]        22
> >>>>>   [22]        23
> >>>>>   [23]        24
> >>>>>   [24]        25
> >>>>>   [25]        26
> >>>>>   [26]        27
> >>>>>   ---
> >>>>>   seqlengths:
> >>>>>                     chr1                  chr2 ... 
> >>>>> chr18_gl000207_random
> >>>>>                249250621             243199373 ...                 
> >>>>> 4262
> >>>>>
> >>>>>
> >>>>
> >>>> _______________________________________________
> >>>> Bioconductor mailing list
> >>>> Bioconductor at r-project.org
> >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
> >>>> Search the archives: 
> >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
> >>>
> >>> _______________________________________________
> >>> Bioconductor mailing list
> >>> Bioconductor at r-project.org
> >>> https://stat.ethz.ch/mailman/listinfo/bioconductor
> >>> Search the archives: 
> >>> http://news.gmane.org/gmane.science.biology.informatics.conductor
> >
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor