[BioC] Extracting protein sequence associated to UCSC transcript id

Sat Jan 11 00:10:00 CET 2014

Hi Raffaele,

There is also workflow that describes all this stuff in a high level 
overview here:

http://www.bioconductor.org/help/workflows/annotation/annotation/

This is a higher level view of things.  So it may help you know about 
the different kinds of resources available and how to make use of each kind.

   Marc

On 01/09/2014 07:01 AM, rcaloger wrote:
> Great!
> Thanks
> Raf
>
> On 09/01/14 15:19, James W. MacDonald wrote:
>> Hi Raf,
>>
>> The GenomicFeatures vignette covers these packages:
>>
>> http://www.bioconductor.org/packages/release/bioc/vignettes/GenomicFeatures/inst/doc/GenomicFeatures.pdf 
>>
>>
>> Best,
>>
>> Jim
>>
>>
>> On 1/9/2014 5:24 AM, rcaloger wrote:
>>> Dear Michael and Hervé,
>>> many thanks for the great help
>>> I followed the suggestion of Hervé
>>> Since I need to retrieve only two sequences I did the following
>>> ...
>>> txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
>>> cds_by_tx <- cdsBy(txdb, by="tx", use.names=TRUE)
>>> trs <- c(id1,id2)
>>> if(length(which(trs %in% names(cds_by_tx)))==1){
>>>         cat("\nOnly names(cds_by_tx)[which(trs %in% 
>>> names(cds_by_tx))] is coding\n"
>>>         return()
>>>     }else if(length(which(trs %in% names(cds_by_tx)))==0){
>>>         cat("\nNon of the two transcripts is coding\n"
>>>         return()
>>>     }
>>>     cds_seqs <- 
>>> extractTranscriptsFromGenome(BSgenome.Hsapiens.UCSC.hg19, 
>>> cds_by_tx[which(names(cds_by_tx) %in% trs)])
>>>     proteome <- translate(cds_seqs)
>>>     names(proteome) <- names(cds_seqs)
>>>     p1 <- proteome[which(names(proteome)==id1)]
>>>     p2 <- proteome[which(names(proteome)==id2)]
>>> ...
>>>
>>> Many thanks again!
>>>
>>> One point that would be of general interest.
>>> TxDb.XX.YY.ZZ.knownGene are very powerful dbs but, as far as I know, 
>>> there is no vignette associated to them, and documentation, in my 
>>> opinion does not allow an easy understanding of the db.
>>> Do you know if some tutorial like that available for BSgenome is 
>>> also present for TxDb.XX.YY.ZZ.knownGene?
>>> Cheers
>>> Raf
>>>
>>> On 09/01/14 00:49, Michael Lawrence wrote:
>>>> I had issues doing this with Homo.sapiens:
>>>>
>>>> First I tried just:
>>>>> cdsBy(Homo.sapiens)
>>>> Error in .select(x, keys, columns, keytype, ...) :
>>>>    You must provide cols argument
>>>>
>>>> Not really sure why? And the argument is "columns"...
>>>>
>>>> Then:
>>>>> cdsBy(Homo.sapiens, columns=character())
>>>> Error in res[, .reverseColAbbreviations(x, cnames), drop = FALSE] :
>>>>    incorrect number of dimensions
>>>>
>>>>
>>>>> cdsBy(Homo.sapiens, columns="UCSCKG")
>>>> Warning messages:
>>>> 1: In .generateExtraRows(tab, keys, jointype) :
>>>>    'select' and duplicate query keys resulted in 1:many mapping 
>>>> between
>>>> keys and return rows
>>>> 2: In .generateExtraRows(tab, keys, jointype) :
>>>>    'select' resulted in 1:many mapping between keys and return rows
>>>> 3: In .generateExtraRows(tab, keys, jointype) :
>>>>    'select' and duplicate query keys resulted in 1:many mapping 
>>>> between
>>>> keys and return rows
>>>> 4: In `levels<-`(`*tmp*`, value = if (nl == nL) 
>>>> as.character(labels) else
>>>> paste0(labels,  :
>>>>    duplicated levels in factors are deprecated
>>>>
>>>> Worked, but a bunch of warnings (not sure if all of them are helpful?)
>>>>
>>>> Finally, I was not aware of the use.names argument. Seems very 
>>>> useful. But
>>>> apparently not supported yet for the OrganismDb objects:
>>>>
>>>>> cdsBy(Homo.sapiens, columns="UCSCKG", use.names=TRUE)
>>>> Error in .local(x, by, ...) : unused argument (use.names = TRUE)
>>>>
>>>> I think the OrganismDb objects are in theory a great step forward, 
>>>> because
>>>> they let us easily integrate other identifiers. But I've had little 
>>>> luck
>>>> getting them to work so far.
>>>>
>>>> Michael
>>>>
>>>>
>>>> On Wed, Jan 8, 2014 at 11:59 AM, Hervé Pagès <hpages at fhcrc.org> wrote:
>>>>
>>>>> Hi Raf,
>>>>>
>>>>>
>>>>> On 01/08/2014 10:25 AM, rcaloger wrote:
>>>>>
>>>>>> I found a way to extract the protein sequences querying the UCSC 
>>>>>> web page.
>>>>>> However, there should be a  more elegant way to do it.
>>>>>> library(RCurl)
>>>>>> trs <- c("uc003mfv.3", "uc001ajb.1", "uc011asd.2")
>>>>>> myquery<- list()
>>>>>> for(i in 1:length(trs)){
>>>>>>       myquery[[i]] <-
>>>>>> getURL(paste("http://genome-euro.ucsc.edu/cgi-bin/hgGene?
>>>>>> hgsid=195297095&hgg_do_getProteinSeq=1&hgg_gene=",
>>>>>> trs[i],sep=""))
>>>>>>       Sys.sleep(30)
>>>>>> }
>>>>>>
>>>>> Your 3rd transcript is non-coding hence no protein sequence for it.
>>>>>
>>>>> Otherwise you get exactly what you wanted using Michael's suggestion:
>>>>>
>>>>>    library(TxDb.Hsapiens.UCSC.hg19.knownGene)
>>>>>    txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
>>>>>    cds_by_tx <- cdsBy(txdb, by="tx", use.names=TRUE)
>>>>>
>>>>> See which of your transcripts are coding:
>>>>>
>>>>>
>>>>>    > trs <- c("uc003mfv.3", "uc001ajb.1", "uc011asd.2")
>>>>>    > trs %in% names(cds_by_tx)
>>>>>    [1]  TRUE  TRUE FALSE
>>>>>
>>>>> Extract and translate the CDS sequences (note that here I choose to
>>>>> compute the full proteome but I don't have to):
>>>>>
>>>>>    library(BSgenome.Hsapiens.UCSC.hg19)
>>>>>    cds_seqs <- 
>>>>> extractTranscriptsFromGenome(BSgenome.Hsapiens.UCSC.hg19,
>>>>> cds_by_tx)
>>>>>    proteome <- translate(cds_seqs)
>>>>>    names(proteome) <- names(cds_seqs)
>>>>>
>>>>> Then:
>>>>>
>>>>>    > proteome
>>>>>      A AAStringSet instance of length 63691
>>>>>            width seq names
>>>>>        [1]   134 MSESINFSHNLGQLLSPPRCV...KGETQESVESRVLPGPRHRH* 
>>>>> uc010nxq.1
>>>>>        [2]   306 MVTEFIFLGLSDSQELQTFLF...DMKTAIRQLRKWDAHSSVKF* 
>>>>> uc001aal.1
>>>>>        [3]   390 MLLPPGSLSRPRTFSSQPLQT...CQQPQQAQLLPHSGPFRPNS* 
>>>>> uc009vjk.2
>>>>>        [4]   390 MLLPPGSLSRPRTFSSQPLQT...CQQPQQAQLLPHSGPFRPNS* 
>>>>> uc001aau.3
>>>>>        [5]   267 MAYLGPYPTSRQPPQMRLLPH...CQQPQQAQLLPHSGPFRPNS* 
>>>>> uc021oeh.1
>>>>>        ...   ... ...
>>>>>    [63687]   425 MALPTPSDSTLPAEARGRGRR...AASLEAPLSEEEYRALLEEL* 
>>>>> uc031tgq.1
>>>>>    [63688]   425 MALPTPSDSTLPAEARGRGRR...AASLEAPLSEEEYRALLEEL* 
>>>>> uc031tgr.1
>>>>>    [63689]   425 MALPTPSDSTLPAEARGRGRR...AASLEAPLSEEEYRALLEEL* 
>>>>> uc031tgs.1
>>>>>    [63690]   279 MGKGNEDSDLHCSSIQCSTDQ...HPFPGQEITETVSGSDEAKL* 
>>>>> uc011mgh.2
>>>>>    [63691]   280 MGKGNEDSDLHCSSIQCSTDQ...HPFPGQEITETVSGSDEAKL* 
>>>>> uc011mgi.2
>>>>>
>>>>>    > trs <- c("uc003mfv.3", "uc001ajb.1")
>>>>>
>>>>>    > proteome[trs]
>>>>>      A AAStringSet instance of length 2
>>>>>        width seq names
>>>>>    [1]   204 MSGQRVDVKVVMLGKEYVGKTSL...TEDKGVDLGQKPNPYFYSCCHH* 
>>>>> uc003mfv.3
>>>>>    [2]   498 MAAAGEGTPSSRGPRRDPPRRPP...GPEASSSWQAAHSCTPEPPAPR* 
>>>>> uc001ajb.1
>>>>>
>>>>> You can drop the trailing "*" with
>>>>>
>>>>>    subseq(proteome[trs], end=-2)
>>>>>
>>>>> Cheers,
>>>>> H.
>>>>>
>>>>>
>>>>>
>>>>>> It is interesting that in bioconductor there are no databases 
>>>>>> linking
>>>>>> transcripts to proteins
>>>>>> Cheers
>>>>>> Raf
>>>>>>
>>>>>> On 08/01/14 17:10, Michael Lawrence wrote:
>>>>>>
>>>>>>> In theory, you should be able to get the cds regions using e.g. the
>>>>>>> Homo.sapiens package, but it seems kind of tough to retrieve 
>>>>>>> those for
>>>>>>> UCSC
>>>>>>> Known Gene identifiers (assuming that is what you have). Marc 
>>>>>>> Carlson
>>>>>>> could
>>>>>>> probably help more.
>>>>>>>
>>>>>>> Michael
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Jan 8, 2014 at 6:31 AM, rcaloger 
>>>>>>> <raffaele.calogero at unito.it>
>>>>>>> wrote:
>>>>>>>
>>>>>>>   Dear Michael,
>>>>>>>> thank for the kind suggestion but unfortunately it does not 
>>>>>>>> solve my
>>>>>>>> problem because, using the approach you are suggesting, I do 
>>>>>>>> not have
>>>>>>>> access to the position of the start codon for the different 
>>>>>>>> isoforms.
>>>>>>>> Cheers
>>>>>>>> Raf
>>>>>>>>
>>>>>>>>
>>>>>>>> On 07/01/14 16:44, Michael Lawrence wrote:
>>>>>>>>
>>>>>>>>   If you had the transcript coordinates (as GRangesList, 
>>>>>>>> perhaps from an
>>>>>>>>> exonsBy() on a TranscriptDb) you could use
>>>>>>>>> extractTranscriptsFromGenome()
>>>>>>>>> and translate, see the GenomicFeatures vignette for an example.
>>>>>>>>>
>>>>>>>>> Michael
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Jan 7, 2014 at 6:54 AM, rcaloger 
>>>>>>>>> <raffaele.calogero at gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>    Hi,
>>>>>>>>>
>>>>>>>>>> In order to validate fusion products I need to be sure that the
>>>>>>>>>> peptides
>>>>>>>>>> encoded by the the two fused proteins are in the same frame.
>>>>>>>>>> I have now a function that allows to confirm the protein1 and 
>>>>>>>>>> protein2
>>>>>>>>>> have sequences located in the same frame.
>>>>>>>>>> However, I got stack to retrieve those proteins sequences 
>>>>>>>>>> from UCSC. I
>>>>>>>>>> did
>>>>>>>>>> not found a quick way to retrieve the protein sequence 
>>>>>>>>>> associated to a
>>>>>>>>>> UCSC
>>>>>>>>>> ID.
>>>>>>>>>> Indeed the protein sequence is present in the UCSC genome 
>>>>>>>>>> browser,
>>>>>>>>>> but I
>>>>>>>>>> do not know how to grab it.
>>>>>>>>>> Any suggestion?
>>>>>>>>>> Cheers
>>>>>>>>>> Raffaele
>>>>>>>>>>
>>>>>>>>>> -- 
>>>>>>>>>>
>>>>>>>>>> ----------------------------------------
>>>>>>>>>> Prof. Raffaele A. Calogero
>>>>>>>>>> Bioinformatics and Genomics Unit
>>>>>>>>>> MBC Centro di Biotecnologie Molecolari
>>>>>>>>>> Via Nizza 52, Torino 10126
>>>>>>>>>> tel.   ++39 0116706457
>>>>>>>>>> Fax    ++39 0112366457
>>>>>>>>>> Mobile ++39 3333827080
>>>>>>>>>> email: raffaele.calogero at unito.it
>>>>>>>>>>           raffaele[dot]calogero[at]gmail[dot]com
>>>>>>>>>> www:   http://www.bioinformatica.unito.it
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> Bioconductor mailing list
>>>>>>>>>> Bioconductor at r-project.org
>>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>>>>>> Search the archives: http://news.gmane.org/gmane.
>>>>>>>>>> science.biology.informatics.conductor
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>   --
>>>>>>>> ----------------------------------------
>>>>>>>> Prof. Raffaele A. Calogero
>>>>>>>> Bioinformatics and Genomics Unit
>>>>>>>> MBC Centro di Biotecnologie Molecolari
>>>>>>>> Via Nizza 52, Torino 10126
>>>>>>>> tel.   ++39 0116706457
>>>>>>>> Fax    ++39 0112366457
>>>>>>>> Mobile ++39 3333827080
>>>>>>>> email: raffaele.calogero at unito.it
>>>>>>>>          raffaele[dot]calogero[at]gmail[dot]com
>>>>>>>> www:   http://www.bioinformatica.unito.it
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>> -- 
>>>>> Hervé Pagès
>>>>>
>>>>> Program in Computational Biology
>>>>> Division of Public Health Sciences
>>>>> Fred Hutchinson Cancer Research Center
>>>>> 1100 Fairview Ave. N, M1-B514
>>>>> P.O. Box 19024
>>>>> Seattle, WA 98109-1024
>>>>>
>>>>> E-mail: hpages at fhcrc.org
>>>>> Phone:  (206) 667-5791
>>>>> Fax:    (206) 667-1319
>>>>>
>>>
>>>
>>
>
>