[BioC] Finding coding SNPs with predictCoding

Tue May 15 22:17:52 CEST 2012

Hi Thomas, Alex,

Marc has been working on a makeTranscriptDbFromGFF() function which is 
now in GenomicFeatures v 1.9.11 in devel. This addresses the last of 
Thomas' points below wrt the user providing annotations in GFF or GTF 
format. The function creates a TxDb which can be given to 
locateVariants() and predictCoding(). Let us know how it goes.

Valerie

On 04/05/2012 01:47 PM, Valerie Obenchain wrote:
> Hi Thomas, Alex,
>
> Changes below have been made in release 1.2.4 and devel 1.3.4.
>
> predictCoding :
>
> - Now returns a proteinID that is the triplet number in the protein
>
> - The refAA and varAA are the 1-letter amino acid code but not the 
> 3-letter code (i.e., 'G' but not 'Gly'). We don't currently have a 
> function that returns the 3-letter code. Is this of interest/use?
>
> - Over the next dev cycle I will implement methods for annotations 
> (subject argument) to be gff/gtf. Currently genomes (seqSource 
> argument) can be a BSgenome or fasta file.
>
>
> locateVariants:
>
> To add flexibility for finding variants in a particular region the 
> function now dispatches on a 'region' argument (e.g., 
> CodingVariants(), IntronVariants(), etc.). I've included a 
> SpliceSiteVariants() that counts ranges that overlap with any portion 
> of the first 2 or last 2 nucleotides of an intron. Called as,
>
>     locateVariants(query, subject, range=SpliceSiteVariants())
>
> Thanks again for the suggestions.
>
> Valerie
>
>
> On 03/05/2012 01:25 AM, Alex Gutteridge wrote:
>> Hi Valerie,
>>
>> I'd exactly echo what Thomas wrote (thanks Thomas!). My specific use 
>> case is mapping coding SNPs to PDB protein structures, so something 
>> that encodes 'His88 (CAT) -> Gln88 (CAA)' is all I need.
>>
>> Alex Gutteridge.
>>
>> On 03.03.2012 00:58, Thomas Girke wrote:
>>> Hi Valerie,
>>>
>>> Based on my experience the position in the complete protein (rather 
>>> than
>>> CDS) sequence would be the most important piece of information to 
>>> record
>>> here. For instance, if a SNP changes the 88th triplet CAT (His) in the
>>> ORF of a transcript to CAA (Gln) then you want to record it like this:
>>> His88 (CAT) -> Gln88 (CAA). This way the user can map this change to a
>>> protein structure or inject it into the corresponding protein sequence
>>> without any additional remapping or coding.
>>>
>>> Another feature to consider are SNPs affecting splice sites (commonly
>>> first and last two nucleotides of an intron).
>>>
>>> If possible it would be also very useful to support users who want to
>>> work with their custom genomes and annotations files provided as FASTA
>>> and GFF/GTF files, respectively.
>>>
>>> Best,
>>>
>>> Thomas
>>>
>>>
>>> On Fri, Mar 02, 2012 at 11:31:57PM +0000, Valerie Obenchain wrote:
>>>> Slight change to this -
>>>>
>>>> I'm now returning the following new columns,
>>>>
>>>>    \item{\code{seqTxLoc}}{
>>>>        Location in transcript-based coordinates of the first 
>>>> nucleotide in
>>>>        the codon sequence to be translated. This position 
>>>> corresponds to the
>>>>        first nucleotide in both the \code{refSeq} and \code{varSeq} 
>>>> columns.
>>>>      }
>>>>      \item{\code{varTxLoc}}{
>>>>        Location in transcript-based coordinates of the first 
>>>> nucleotide in
>>>>        the variant. This value will be the same as \code{seqTxLoc} 
>>>> when the
>>>>        variant starts exactly at the beginning of a codon.
>>>>      }
>>>>      \item{\code{varCdsLoc}}{
>>>>        Location in cds-based coordinates of the first nucleotide in
>>>>        the variant. This position is relative to the start of the 
>>>> cds region
>>>>        defined in the \code{subject} annotation.
>>>>      }
>>>>      \item{\code{subjStrand}}{
>>>>        The strand of the \code{subject} the variant matched.
>>>> \code{predictCoding}
>>>>        determines which variants fall in a coding region by finding 
>>>> the
>>>> overlaps
>>>>        between the \code{query} and \code{subject}. The 
>>>> \code{query} may be
>>>>        un-stranded \sQuote{*} but the \code{subject} annotation will
>>>> have a strand.
>>>>      }
>>>>
>>>>
>>>> You are interested in 'protein coordinates'. Does 'varCdsLoc' 
>>>> described
>>>> above meet the need or are you looking for the actual codon number in
>>>> the coding sequence? I am interested in hearing more about what you 
>>>> are
>>>> doing with the protein coordinates, how you are using them. It would
>>>> help us better design future functions.
>>>>
>>>> Thanks,
>>>> Valerie
>>>>
>>>> On 03/02/2012 01:11 AM, Alex Gutteridge wrote:
>>>> > Thanks Valerie - much appreciated!
>>>> >
>>>> > On 01.03.2012 21:30, Valerie Obenchain wrote:
>>>> >> A 'txLoc' column has been added to the output of predictCoding.
>>>> >> Available in devel version 1.1.57.
>>>> >>
>>>> >> Valerie
>>>> >>
>>>> >>
>>>> >> On 02/28/2012 08:20 AM, Valerie Obenchain wrote:
>>>> >>> Good suggestion. Yes, predictCoding is does this internally. I'll
>>>> >>> post back here when this has been added.
>>>> >>>
>>>> >>> Valerie
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>> On 02/28/2012 01:49 AM, Alex Gutteridge wrote:
>>>> >>>> Hi Valerie,
>>>> >>>>
>>>> >>>> Thanks everything works great now. One small feature request -
>>>> >>>> would it be hard to output the protein sequence position of the
>>>> >>>> coding SNPs? At the moment once I've run predictCoding I'm
>>>> >>>> re-extracting the cds and working out the position of each coding
>>>> >>>> SNP so I can see where in the protein sequence it is, but it 
>>>> seems
>>>> >>>> like this is probably just replicating what predictCoding must be
>>>> >>>> doing internally anyway?
>>>> >>>>
>>>> >>>> Alex Gutteridge
>>>> >>>
>>>> >>>
>>>> >>> On 02/24/2012 10:39 AM, Valerie Obenchain wrote:
>>>> >>>> Hi Alex,
>>>> >>>>
>>>> >>>> Thanks for the bug report. The cdsID was taken from an overlap
>>>> >>>> between the query and GRangesList of cds by transcripts. This 
>>>> gave
>>>> >>>> the correct transcript number but (incorrectly) took the first 
>>>> cds
>>>> >>>> number in the list by default. Now fixed in devel 1.1.55.
>>>> >>>>
>>>> >>>> I've also updated the man page.
>>>> >>>>
>>>> >>>> Valerie
>>>> >>>>
>>>> >>>>
>>>> >>>>
>>>> >>>> On 02/24/2012 02:08 AM, Alex Gutteridge wrote:
>>>> >>>>> On 22.02.2012 18:58, Herv? Pag?s wrote:
>>>> >>>>>> Hi Alex,
>>>> >>>>>>
>>>> >>>>>> On 02/22/2012 03:56 AM, Alex Gutteridge wrote:
>>>> >>>>>
>>>> >>>>> [...]
>>>> >>>>>
>>>> >>>>>>> But the predictCoding call gives this error:
>>>> >>>>>>>
>>>> >>>>>>> Error in .setSeqNames(x, value) :
>>>> >>>>>>> The replacement value for isActiveSeq must be a logical 
>>>> vector,
>>>> >>>>>>> with
>>>> >>>>>>> names that match the seqlevels of the object
>>>> >>>>>>
>>>> >>>>>> The error message doesn't help much but I think the pb is 
>>>> that you
>>>> >>>>>> didn't rename chMT properly. Try to do this:
>>>> >>>>>>
>>>> >>>>>>   seqlevels(snps) <- gsub("chrMT", "chrM", seqlevels(snps))
>>>> >>>>>>
>>>> >>>>>> before you start the for(eg in entrez.ids){..} loop again.
>>>> >>>>>>
>>>> >>>>>> Cheers,
>>>> >>>>>> H.
>>>> >>>>>
>>>> >>>>> Thanks Herv? that nailed it. I'm having some difficulty 
>>>> joining up
>>>> >>>>> the output of predictCoding() with the query SNPs though. If
>>>> >>>>> someone could point out where the disconnect in my thinking is I
>>>> >>>>> would appreciate it!
>>>> >>>>>
>>>> >>>>> Here's my (now edited down) script:
>>>> >>>>>
>>>> >>>>> library(BSgenome.Hsapiens.UCSC.hg19)
>>>> >>>>> library(VariantAnnotation)
>>>> >>>>> library(SNPlocs.Hsapiens.dbSNP.20110815)
>>>> >>>>> library(TxDb.Hsapiens.UCSC.hg19.knownGene)
>>>> >>>>>
>>>> >>>>> entrez.ids = c('6335')
>>>> >>>>> txdb19 = TxDb.Hsapiens.UCSC.hg19.knownGene
>>>> >>>>>
>>>> >>>>> snps   = getSNPlocs(c("ch1","ch2"),as.GRanges=T)
>>>> >>>>> seqlevels(snps) <- gsub("ch", "chr", seqlevels(snps))
>>>> >>>>> seqlevels(snps) <- gsub("chrMT", "chrM", seqlevels(snps))
>>>> >>>>>
>>>> >>>>> gene.list = cdsBy(txdb19, by="gene")
>>>> >>>>> vsd.list = gene.list[entrez.ids]
>>>> >>>>> cds.list = cdsBy(txdb19,by="tx")
>>>> >>>>>
>>>> >>>>> eg = entrez.ids[1]
>>>> >>>>>
>>>> >>>>> snp.idx = unique(queryHits(findOverlaps(snps, vsd.list[[eg]])))
>>>> >>>>> eg.snps = snps[snp.idx]
>>>> >>>>> iupac   = values(eg.snps)[,"alleles_as_ambig"]
>>>> >>>>> eg.snps.exp = rep(eg.snps, nchar(IUPAC_CODE_MAP[iupac]))
>>>> >>>>> variant.alleles =
>>>> >>>>> 
>>>> DNAStringSet(strsplit(paste(IUPAC_CODE_MAP[iupac],collapse=""),"")[[1]]) 
>>>>
>>>> >>>>>
>>>> >>>>>
>>>> >>>>>
>>>> >>>>> aa =
>>>> >>>>> 
>>>> predictCoding(eg.snps.exp,txdb19,seqSource=Hsapiens,varAllele=variant.alleles)
>>>> >>>>>
>>>> >>>>> #####
>>>> >>>>>
>>>> >>>>> Then if I query the predictCoding results in aa in an 
>>>> interactive
>>>> >>>>> session I get the following (see inline comments for what I 
>>>> think
>>>> >>>>> should be happening, but I must be misinterpreting what queryID
>>>> >>>>> means)
>>>> >>>>>
>>>> >>>>> The docs for predictCoding() contain a small typo
>>>> >>>>> (s/queryHits/queryID), but otherwise seem clear?
>>>> >>>>>
>>>> >>>>> Columns include ?queryID?, ?consequence?, ?refSeq?, ?varSeq?,
>>>> >>>>>      ?refAA?, ?varAA?, ?txID?, ?geneID?, and ?cdsID?.
>>>> >>>>>
>>>> >>>>>      ?queryHits? The ?queryHits? column provides a map back 
>>>> to the
>>>> >>>>>           variants in the original ?query?. If the ?query? was a
>>>> >>>>> ?VCF?
>>>> >>>>>           object this index corresponds to the row in the
>>>> >>>>> ?GRanges? in
>>>> >>>>>           the ?rowData? slot. If ?query? was an expanded 
>>>> ?GRanges?,
>>>> >>>>>           ?RangedData? or ?RangesList? the index corresponds to
>>>> >>>>> the row
>>>> >>>>>           in the expanded object.
>>>> >>>>>
>>>> >>>>> #####
>>>> >>>>>
>>>> >>>>>> aa[1,]
>>>> >>>>> DataFrame with 1 row and 9 columns
>>>> >>>>>     queryID   consequence         refSeq         
>>>> varSeq        refAA
>>>> >>>>> <integer> <factor> <DNAStringSet> <DNAStringSet> <AAStringSet>
>>>> >>>>> 1         1 nonsynonymous            CTC            
>>>> ATC           L
>>>> >>>>>           varAA        txID   geneID     cdsID
>>>> >>>>> <AAStringSet> <character> <factor> <integer>
>>>> >>>>> 1             I       10921     6335     33668
>>>> >>>>>> #So the first SNP (queryID: 1) is nonsynonymous and maps to tx
>>>> >>>>>> '10921' and cds '33668'.
>>>> >>>>>> #If I look at the first query SNP I get this:
>>>> >>>>>> eg.snps.exp[aa[1,'queryID'],]
>>>> >>>>> GRanges with 1 range and 2 elementMetadata values:
>>>> >>>>>       seqnames                 ranges strand |   RefSNP_id
>>>> >>>>> alleles_as_ambig
>>>> >>>>> <Rle> <IRanges> <Rle> | <character> <character>
>>>> >>>>>   [1]     chr2 [167055370, 167055370]      * |   111558968
>>>> >>>>>      R
>>>> >>>>>   ---
>>>> >>>>>   seqlengths:
>>>> >>>>>     chr1  chr2  chr3  chr4  chr5  chr6 ... chr20 chr21 chr22  
>>>> chrX
>>>> >>>>> chrY  chrM
>>>> >>>>>       NA    NA    NA    NA    NA    NA ...    NA    NA    
>>>> NA   NA
>>>> >>>>>   NA    NA
>>>> >>>>>> #So SNP 1 is at 167055370 on chr2
>>>> >>>>>> #But if I check tx '10921' I see that the cds overlapping
>>>> >>>>>> 167055370 is actually '33651'
>>>> >>>>>> #And cds '33668' is at the other end of the tx:
>>>> >>>>>> cds.list[[aa[1,'txID']]]
>>>> >>>>> GRanges with 26 ranges and 3 elementMetadata values:
>>>> >>>>>        seqnames                 ranges strand   |    cds_id
>>>> >>>>> cds_name
>>>> >>>>> <Rle> <IRanges> <Rle>   | <integer> <character>
>>>> >>>>>    [1]     chr2 [167168009, 167168266]      -   |     33668 <NA>
>>>> >>>>>    [2]     chr2 [167163466, 167163584]      -   |     33667 <NA>
>>>> >>>>>    [3]     chr2 [167163020, 167163109]      -   |     33666 <NA>
>>>> >>>>>    [4]     chr2 [167162302, 167162430]      -   |     33647 <NA>
>>>> >>>>>    [5]     chr2 [167160748, 167160839]      -   |     33646 <NA>
>>>> >>>>>    [6]     chr2 [167159600, 167159812]      -   |     33645 <NA>
>>>> >>>>>    [7]     chr2 [167151109, 167151172]      -   |     33644 <NA>
>>>> >>>>>    [8]     chr2 [167149741, 167149882]      -   |     33643 <NA>
>>>> >>>>>    [9]     chr2 [167144947, 167145153]      -   |     33642 <NA>
>>>> >>>>>    ...      ...                    ...    ... ...       ...
>>>> >>>>> ...
>>>> >>>>>   [18]     chr2 [167099012, 167099166]      -   |     33659 <NA>
>>>> >>>>>   [19]     chr2 [167094604, 167094777]      -   |     33658 <NA>
>>>> >>>>>   [20]     chr2 [167089850, 167089972]      -   |     33657 <NA>
>>>> >>>>>   [21]     chr2 [167085201, 167085482]      -   |     33656 <NA>
>>>> >>>>>   [22]     chr2 [167084180, 167084233]      -   |     33655 <NA>
>>>> >>>>>   [23]     chr2 [167083077, 167083214]      -   |     33654 <NA>
>>>> >>>>>   [24]     chr2 [167060870, 167060974]      -   |     33653 <NA>
>>>> >>>>>   [25]     chr2 [167060465, 167060735]      -   |     33652 <NA>
>>>> >>>>>   [26]     chr2 [167055182, 167056374]      -   |     33651 <NA>
>>>> >>>>>        exon_rank
>>>> >>>>> <integer>
>>>> >>>>>    [1]         2
>>>> >>>>>    [2]         3
>>>> >>>>>    [3]         4
>>>> >>>>>    [4]         5
>>>> >>>>>    [5]         6
>>>> >>>>>    [6]         7
>>>> >>>>>    [7]         8
>>>> >>>>>    [8]         9
>>>> >>>>>    [9]        10
>>>> >>>>>    ...       ...
>>>> >>>>>   [18]        19
>>>> >>>>>   [19]        20
>>>> >>>>>   [20]        21
>>>> >>>>>   [21]        22
>>>> >>>>>   [22]        23
>>>> >>>>>   [23]        24
>>>> >>>>>   [24]        25
>>>> >>>>>   [25]        26
>>>> >>>>>   [26]        27
>>>> >>>>>   ---
>>>> >>>>>   seqlengths:
>>>> >>>>>                     chr1                  chr2 ...
>>>> >>>>> chr18_gl000207_random
>>>> >>>>>                249250621             243199373 ...
>>>> >>>>> 4262
>>>> >>>>>
>>>> >>>>>
>>>> >>>>
>>>> >>>> _______________________________________________
>>>> >>>> Bioconductor mailing list
>>>> >>>> Bioconductor at r-project.org
>>>> >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> >>>> Search the archives:
>>>> >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>> >>>
>>>> >>> _______________________________________________
>>>> >>> Bioconductor mailing list
>>>> >>> Bioconductor at r-project.org
>>>> >>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> >>> Search the archives:
>>>> >>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>> >
>>>>
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at r-project.org
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives: 
>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>