[BioC] Finding coding SNPs with predictCoding
Thomas Girke
thomas.girke at ucr.edu
Tue May 15 22:55:18 CEST 2012
Great, and thanks for doing this!
Creating TxDb objects directly from gff/gtf files has been for a long
time on my wish list. If this works now for most of the GFF/GTF files
provided by ncbi (ftp://ftp.ncbi.nih.gov/genomes/), ensembl
(http://useast.ensembl.org/info/data/ftp/index.html) or other genome
databases, then this will save many of us a lot of time and headache.
Thomas
On Tue, May 15, 2012 at 08:17:52PM +0000, Valerie Obenchain wrote:
> Hi Thomas, Alex,
>
> Marc has been working on a makeTranscriptDbFromGFF() function which is
> now in GenomicFeatures v 1.9.11 in devel. This addresses the last of
> Thomas' points below wrt the user providing annotations in GFF or GTF
> format. The function creates a TxDb which can be given to
> locateVariants() and predictCoding(). Let us know how it goes.
>
> Valerie
>
>
>
> On 04/05/2012 01:47 PM, Valerie Obenchain wrote:
> > Hi Thomas, Alex,
> >
> > Changes below have been made in release 1.2.4 and devel 1.3.4.
> >
> > predictCoding :
> >
> > - Now returns a proteinID that is the triplet number in the protein
> >
> > - The refAA and varAA are the 1-letter amino acid code but not the
> > 3-letter code (i.e., 'G' but not 'Gly'). We don't currently have a
> > function that returns the 3-letter code. Is this of interest/use?
> >
> > - Over the next dev cycle I will implement methods for annotations
> > (subject argument) to be gff/gtf. Currently genomes (seqSource
> > argument) can be a BSgenome or fasta file.
> >
> >
> > locateVariants:
> >
> > To add flexibility for finding variants in a particular region the
> > function now dispatches on a 'region' argument (e.g.,
> > CodingVariants(), IntronVariants(), etc.). I've included a
> > SpliceSiteVariants() that counts ranges that overlap with any portion
> > of the first 2 or last 2 nucleotides of an intron. Called as,
> >
> > locateVariants(query, subject, range=SpliceSiteVariants())
> >
> > Thanks again for the suggestions.
> >
> > Valerie
> >
> >
> > On 03/05/2012 01:25 AM, Alex Gutteridge wrote:
> >> Hi Valerie,
> >>
> >> I'd exactly echo what Thomas wrote (thanks Thomas!). My specific use
> >> case is mapping coding SNPs to PDB protein structures, so something
> >> that encodes 'His88 (CAT) -> Gln88 (CAA)' is all I need.
> >>
> >> Alex Gutteridge.
> >>
> >> On 03.03.2012 00:58, Thomas Girke wrote:
> >>> Hi Valerie,
> >>>
> >>> Based on my experience the position in the complete protein (rather
> >>> than
> >>> CDS) sequence would be the most important piece of information to
> >>> record
> >>> here. For instance, if a SNP changes the 88th triplet CAT (His) in the
> >>> ORF of a transcript to CAA (Gln) then you want to record it like this:
> >>> His88 (CAT) -> Gln88 (CAA). This way the user can map this change to a
> >>> protein structure or inject it into the corresponding protein sequence
> >>> without any additional remapping or coding.
> >>>
> >>> Another feature to consider are SNPs affecting splice sites (commonly
> >>> first and last two nucleotides of an intron).
> >>>
> >>> If possible it would be also very useful to support users who want to
> >>> work with their custom genomes and annotations files provided as FASTA
> >>> and GFF/GTF files, respectively.
> >>>
> >>> Best,
> >>>
> >>> Thomas
> >>>
> >>>
> >>> On Fri, Mar 02, 2012 at 11:31:57PM +0000, Valerie Obenchain wrote:
> >>>> Slight change to this -
> >>>>
> >>>> I'm now returning the following new columns,
> >>>>
> >>>> \item{\code{seqTxLoc}}{
> >>>> Location in transcript-based coordinates of the first
> >>>> nucleotide in
> >>>> the codon sequence to be translated. This position
> >>>> corresponds to the
> >>>> first nucleotide in both the \code{refSeq} and \code{varSeq}
> >>>> columns.
> >>>> }
> >>>> \item{\code{varTxLoc}}{
> >>>> Location in transcript-based coordinates of the first
> >>>> nucleotide in
> >>>> the variant. This value will be the same as \code{seqTxLoc}
> >>>> when the
> >>>> variant starts exactly at the beginning of a codon.
> >>>> }
> >>>> \item{\code{varCdsLoc}}{
> >>>> Location in cds-based coordinates of the first nucleotide in
> >>>> the variant. This position is relative to the start of the
> >>>> cds region
> >>>> defined in the \code{subject} annotation.
> >>>> }
> >>>> \item{\code{subjStrand}}{
> >>>> The strand of the \code{subject} the variant matched.
> >>>> \code{predictCoding}
> >>>> determines which variants fall in a coding region by finding
> >>>> the
> >>>> overlaps
> >>>> between the \code{query} and \code{subject}. The
> >>>> \code{query} may be
> >>>> un-stranded \sQuote{*} but the \code{subject} annotation will
> >>>> have a strand.
> >>>> }
> >>>>
> >>>>
> >>>> You are interested in 'protein coordinates'. Does 'varCdsLoc'
> >>>> described
> >>>> above meet the need or are you looking for the actual codon number in
> >>>> the coding sequence? I am interested in hearing more about what you
> >>>> are
> >>>> doing with the protein coordinates, how you are using them. It would
> >>>> help us better design future functions.
> >>>>
> >>>> Thanks,
> >>>> Valerie
> >>>>
> >>>> On 03/02/2012 01:11 AM, Alex Gutteridge wrote:
> >>>> > Thanks Valerie - much appreciated!
> >>>> >
> >>>> > On 01.03.2012 21:30, Valerie Obenchain wrote:
> >>>> >> A 'txLoc' column has been added to the output of predictCoding.
> >>>> >> Available in devel version 1.1.57.
> >>>> >>
> >>>> >> Valerie
> >>>> >>
> >>>> >>
> >>>> >> On 02/28/2012 08:20 AM, Valerie Obenchain wrote:
> >>>> >>> Good suggestion. Yes, predictCoding is does this internally. I'll
> >>>> >>> post back here when this has been added.
> >>>> >>>
> >>>> >>> Valerie
> >>>> >>>
> >>>> >>>
> >>>> >>>
> >>>> >>> On 02/28/2012 01:49 AM, Alex Gutteridge wrote:
> >>>> >>>> Hi Valerie,
> >>>> >>>>
> >>>> >>>> Thanks everything works great now. One small feature request -
> >>>> >>>> would it be hard to output the protein sequence position of the
> >>>> >>>> coding SNPs? At the moment once I've run predictCoding I'm
> >>>> >>>> re-extracting the cds and working out the position of each coding
> >>>> >>>> SNP so I can see where in the protein sequence it is, but it
> >>>> seems
> >>>> >>>> like this is probably just replicating what predictCoding must be
> >>>> >>>> doing internally anyway?
> >>>> >>>>
> >>>> >>>> Alex Gutteridge
> >>>> >>>
> >>>> >>>
> >>>> >>> On 02/24/2012 10:39 AM, Valerie Obenchain wrote:
> >>>> >>>> Hi Alex,
> >>>> >>>>
> >>>> >>>> Thanks for the bug report. The cdsID was taken from an overlap
> >>>> >>>> between the query and GRangesList of cds by transcripts. This
> >>>> gave
> >>>> >>>> the correct transcript number but (incorrectly) took the first
> >>>> cds
> >>>> >>>> number in the list by default. Now fixed in devel 1.1.55.
> >>>> >>>>
> >>>> >>>> I've also updated the man page.
> >>>> >>>>
> >>>> >>>> Valerie
> >>>> >>>>
> >>>> >>>>
> >>>> >>>>
> >>>> >>>> On 02/24/2012 02:08 AM, Alex Gutteridge wrote:
> >>>> >>>>> On 22.02.2012 18:58, Herv? Pag?s wrote:
> >>>> >>>>>> Hi Alex,
> >>>> >>>>>>
> >>>> >>>>>> On 02/22/2012 03:56 AM, Alex Gutteridge wrote:
> >>>> >>>>>
> >>>> >>>>> [...]
> >>>> >>>>>
> >>>> >>>>>>> But the predictCoding call gives this error:
> >>>> >>>>>>>
> >>>> >>>>>>> Error in .setSeqNames(x, value) :
> >>>> >>>>>>> The replacement value for isActiveSeq must be a logical
> >>>> vector,
> >>>> >>>>>>> with
> >>>> >>>>>>> names that match the seqlevels of the object
> >>>> >>>>>>
> >>>> >>>>>> The error message doesn't help much but I think the pb is
> >>>> that you
> >>>> >>>>>> didn't rename chMT properly. Try to do this:
> >>>> >>>>>>
> >>>> >>>>>> seqlevels(snps) <- gsub("chrMT", "chrM", seqlevels(snps))
> >>>> >>>>>>
> >>>> >>>>>> before you start the for(eg in entrez.ids){..} loop again.
> >>>> >>>>>>
> >>>> >>>>>> Cheers,
> >>>> >>>>>> H.
> >>>> >>>>>
> >>>> >>>>> Thanks Herv? that nailed it. I'm having some difficulty
> >>>> joining up
> >>>> >>>>> the output of predictCoding() with the query SNPs though. If
> >>>> >>>>> someone could point out where the disconnect in my thinking is I
> >>>> >>>>> would appreciate it!
> >>>> >>>>>
> >>>> >>>>> Here's my (now edited down) script:
> >>>> >>>>>
> >>>> >>>>> library(BSgenome.Hsapiens.UCSC.hg19)
> >>>> >>>>> library(VariantAnnotation)
> >>>> >>>>> library(SNPlocs.Hsapiens.dbSNP.20110815)
> >>>> >>>>> library(TxDb.Hsapiens.UCSC.hg19.knownGene)
> >>>> >>>>>
> >>>> >>>>> entrez.ids = c('6335')
> >>>> >>>>> txdb19 = TxDb.Hsapiens.UCSC.hg19.knownGene
> >>>> >>>>>
> >>>> >>>>> snps = getSNPlocs(c("ch1","ch2"),as.GRanges=T)
> >>>> >>>>> seqlevels(snps) <- gsub("ch", "chr", seqlevels(snps))
> >>>> >>>>> seqlevels(snps) <- gsub("chrMT", "chrM", seqlevels(snps))
> >>>> >>>>>
> >>>> >>>>> gene.list = cdsBy(txdb19, by="gene")
> >>>> >>>>> vsd.list = gene.list[entrez.ids]
> >>>> >>>>> cds.list = cdsBy(txdb19,by="tx")
> >>>> >>>>>
> >>>> >>>>> eg = entrez.ids[1]
> >>>> >>>>>
> >>>> >>>>> snp.idx = unique(queryHits(findOverlaps(snps, vsd.list[[eg]])))
> >>>> >>>>> eg.snps = snps[snp.idx]
> >>>> >>>>> iupac = values(eg.snps)[,"alleles_as_ambig"]
> >>>> >>>>> eg.snps.exp = rep(eg.snps, nchar(IUPAC_CODE_MAP[iupac]))
> >>>> >>>>> variant.alleles =
> >>>> >>>>>
> >>>> DNAStringSet(strsplit(paste(IUPAC_CODE_MAP[iupac],collapse=""),"")[[1]])
> >>>>
> >>>> >>>>>
> >>>> >>>>>
> >>>> >>>>>
> >>>> >>>>> aa =
> >>>> >>>>>
> >>>> predictCoding(eg.snps.exp,txdb19,seqSource=Hsapiens,varAllele=variant.alleles)
> >>>> >>>>>
> >>>> >>>>> #####
> >>>> >>>>>
> >>>> >>>>> Then if I query the predictCoding results in aa in an
> >>>> interactive
> >>>> >>>>> session I get the following (see inline comments for what I
> >>>> think
> >>>> >>>>> should be happening, but I must be misinterpreting what queryID
> >>>> >>>>> means)
> >>>> >>>>>
> >>>> >>>>> The docs for predictCoding() contain a small typo
> >>>> >>>>> (s/queryHits/queryID), but otherwise seem clear?
> >>>> >>>>>
> >>>> >>>>> Columns include ?queryID?, ?consequence?, ?refSeq?, ?varSeq?,
> >>>> >>>>> ?refAA?, ?varAA?, ?txID?, ?geneID?, and ?cdsID?.
> >>>> >>>>>
> >>>> >>>>> ?queryHits? The ?queryHits? column provides a map back
> >>>> to the
> >>>> >>>>> variants in the original ?query?. If the ?query? was a
> >>>> >>>>> ?VCF?
> >>>> >>>>> object this index corresponds to the row in the
> >>>> >>>>> ?GRanges? in
> >>>> >>>>> the ?rowData? slot. If ?query? was an expanded
> >>>> ?GRanges?,
> >>>> >>>>> ?RangedData? or ?RangesList? the index corresponds to
> >>>> >>>>> the row
> >>>> >>>>> in the expanded object.
> >>>> >>>>>
> >>>> >>>>> #####
> >>>> >>>>>
> >>>> >>>>>> aa[1,]
> >>>> >>>>> DataFrame with 1 row and 9 columns
> >>>> >>>>> queryID consequence refSeq
> >>>> varSeq refAA
> >>>> >>>>> <integer> <factor> <DNAStringSet> <DNAStringSet> <AAStringSet>
> >>>> >>>>> 1 1 nonsynonymous CTC
> >>>> ATC L
> >>>> >>>>> varAA txID geneID cdsID
> >>>> >>>>> <AAStringSet> <character> <factor> <integer>
> >>>> >>>>> 1 I 10921 6335 33668
> >>>> >>>>>> #So the first SNP (queryID: 1) is nonsynonymous and maps to tx
> >>>> >>>>>> '10921' and cds '33668'.
> >>>> >>>>>> #If I look at the first query SNP I get this:
> >>>> >>>>>> eg.snps.exp[aa[1,'queryID'],]
> >>>> >>>>> GRanges with 1 range and 2 elementMetadata values:
> >>>> >>>>> seqnames ranges strand | RefSNP_id
> >>>> >>>>> alleles_as_ambig
> >>>> >>>>> <Rle> <IRanges> <Rle> | <character> <character>
> >>>> >>>>> [1] chr2 [167055370, 167055370] * | 111558968
> >>>> >>>>> R
> >>>> >>>>> ---
> >>>> >>>>> seqlengths:
> >>>> >>>>> chr1 chr2 chr3 chr4 chr5 chr6 ... chr20 chr21 chr22
> >>>> chrX
> >>>> >>>>> chrY chrM
> >>>> >>>>> NA NA NA NA NA NA ... NA NA
> >>>> NA NA
> >>>> >>>>> NA NA
> >>>> >>>>>> #So SNP 1 is at 167055370 on chr2
> >>>> >>>>>> #But if I check tx '10921' I see that the cds overlapping
> >>>> >>>>>> 167055370 is actually '33651'
> >>>> >>>>>> #And cds '33668' is at the other end of the tx:
> >>>> >>>>>> cds.list[[aa[1,'txID']]]
> >>>> >>>>> GRanges with 26 ranges and 3 elementMetadata values:
> >>>> >>>>> seqnames ranges strand | cds_id
> >>>> >>>>> cds_name
> >>>> >>>>> <Rle> <IRanges> <Rle> | <integer> <character>
> >>>> >>>>> [1] chr2 [167168009, 167168266] - | 33668 <NA>
> >>>> >>>>> [2] chr2 [167163466, 167163584] - | 33667 <NA>
> >>>> >>>>> [3] chr2 [167163020, 167163109] - | 33666 <NA>
> >>>> >>>>> [4] chr2 [167162302, 167162430] - | 33647 <NA>
> >>>> >>>>> [5] chr2 [167160748, 167160839] - | 33646 <NA>
> >>>> >>>>> [6] chr2 [167159600, 167159812] - | 33645 <NA>
> >>>> >>>>> [7] chr2 [167151109, 167151172] - | 33644 <NA>
> >>>> >>>>> [8] chr2 [167149741, 167149882] - | 33643 <NA>
> >>>> >>>>> [9] chr2 [167144947, 167145153] - | 33642 <NA>
> >>>> >>>>> ... ... ... ... ... ...
> >>>> >>>>> ...
> >>>> >>>>> [18] chr2 [167099012, 167099166] - | 33659 <NA>
> >>>> >>>>> [19] chr2 [167094604, 167094777] - | 33658 <NA>
> >>>> >>>>> [20] chr2 [167089850, 167089972] - | 33657 <NA>
> >>>> >>>>> [21] chr2 [167085201, 167085482] - | 33656 <NA>
> >>>> >>>>> [22] chr2 [167084180, 167084233] - | 33655 <NA>
> >>>> >>>>> [23] chr2 [167083077, 167083214] - | 33654 <NA>
> >>>> >>>>> [24] chr2 [167060870, 167060974] - | 33653 <NA>
> >>>> >>>>> [25] chr2 [167060465, 167060735] - | 33652 <NA>
> >>>> >>>>> [26] chr2 [167055182, 167056374] - | 33651 <NA>
> >>>> >>>>> exon_rank
> >>>> >>>>> <integer>
> >>>> >>>>> [1] 2
> >>>> >>>>> [2] 3
> >>>> >>>>> [3] 4
> >>>> >>>>> [4] 5
> >>>> >>>>> [5] 6
> >>>> >>>>> [6] 7
> >>>> >>>>> [7] 8
> >>>> >>>>> [8] 9
> >>>> >>>>> [9] 10
> >>>> >>>>> ... ...
> >>>> >>>>> [18] 19
> >>>> >>>>> [19] 20
> >>>> >>>>> [20] 21
> >>>> >>>>> [21] 22
> >>>> >>>>> [22] 23
> >>>> >>>>> [23] 24
> >>>> >>>>> [24] 25
> >>>> >>>>> [25] 26
> >>>> >>>>> [26] 27
> >>>> >>>>> ---
> >>>> >>>>> seqlengths:
> >>>> >>>>> chr1 chr2 ...
> >>>> >>>>> chr18_gl000207_random
> >>>> >>>>> 249250621 243199373 ...
> >>>> >>>>> 4262
> >>>> >>>>>
> >>>> >>>>>
> >>>> >>>>
> >>>> >>>> _______________________________________________
> >>>> >>>> Bioconductor mailing list
> >>>> >>>> Bioconductor at r-project.org
> >>>> >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
> >>>> >>>> Search the archives:
> >>>> >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
> >>>> >>>
> >>>> >>> _______________________________________________
> >>>> >>> Bioconductor mailing list
> >>>> >>> Bioconductor at r-project.org
> >>>> >>> https://stat.ethz.ch/mailman/listinfo/bioconductor
> >>>> >>> Search the archives:
> >>>> >>> http://news.gmane.org/gmane.science.biology.informatics.conductor
> >>>> >
> >>>>
> >>>> _______________________________________________
> >>>> Bioconductor mailing list
> >>>> Bioconductor at r-project.org
> >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
> >>>> Search the archives:
> >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
> >>
> >
>
More information about the Bioconductor
mailing list