[BioC] Extracting protein sequence associated to UCSC transcript id

Thu Jan 9 16:01:22 CET 2014

Great!
Thanks
Raf

On 09/01/14 15:19, James W. MacDonald wrote:
> Hi Raf,
>
> The GenomicFeatures vignette covers these packages:
>
> http://www.bioconductor.org/packages/release/bioc/vignettes/GenomicFeatures/inst/doc/GenomicFeatures.pdf 
>
>
> Best,
>
> Jim
>
>
> On 1/9/2014 5:24 AM, rcaloger wrote:
>> Dear Michael and Hervé,
>> many thanks for the great help
>> I followed the suggestion of Hervé
>> Since I need to retrieve only two sequences I did the following
>> ...
>> txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
>> cds_by_tx <- cdsBy(txdb, by="tx", use.names=TRUE)
>> trs <- c(id1,id2)
>> if(length(which(trs %in% names(cds_by_tx)))==1){
>>         cat("\nOnly names(cds_by_tx)[which(trs %in% 
>> names(cds_by_tx))] is coding\n"
>>         return()
>>     }else if(length(which(trs %in% names(cds_by_tx)))==0){
>>         cat("\nNon of the two transcripts is coding\n"
>>         return()
>>     }
>>     cds_seqs <- 
>> extractTranscriptsFromGenome(BSgenome.Hsapiens.UCSC.hg19, 
>> cds_by_tx[which(names(cds_by_tx) %in% trs)])
>>     proteome <- translate(cds_seqs)
>>     names(proteome) <- names(cds_seqs)
>>     p1 <- proteome[which(names(proteome)==id1)]
>>     p2 <- proteome[which(names(proteome)==id2)]
>> ...
>>
>> Many thanks again!
>>
>> One point that would be of general interest.
>> TxDb.XX.YY.ZZ.knownGene are very powerful dbs but, as far as I know, 
>> there is no vignette associated to them, and documentation, in my 
>> opinion does not allow an easy understanding of the db.
>> Do you know if some tutorial like that available for BSgenome is also 
>> present for TxDb.XX.YY.ZZ.knownGene?
>> Cheers
>> Raf
>>
>> On 09/01/14 00:49, Michael Lawrence wrote:
>>> I had issues doing this with Homo.sapiens:
>>>
>>> First I tried just:
>>>> cdsBy(Homo.sapiens)
>>> Error in .select(x, keys, columns, keytype, ...) :
>>>    You must provide cols argument
>>>
>>> Not really sure why? And the argument is "columns"...
>>>
>>> Then:
>>>> cdsBy(Homo.sapiens, columns=character())
>>> Error in res[, .reverseColAbbreviations(x, cnames), drop = FALSE] :
>>>    incorrect number of dimensions
>>>
>>>
>>>> cdsBy(Homo.sapiens, columns="UCSCKG")
>>> Warning messages:
>>> 1: In .generateExtraRows(tab, keys, jointype) :
>>>    'select' and duplicate query keys resulted in 1:many mapping between
>>> keys and return rows
>>> 2: In .generateExtraRows(tab, keys, jointype) :
>>>    'select' resulted in 1:many mapping between keys and return rows
>>> 3: In .generateExtraRows(tab, keys, jointype) :
>>>    'select' and duplicate query keys resulted in 1:many mapping between
>>> keys and return rows
>>> 4: In `levels<-`(`*tmp*`, value = if (nl == nL) as.character(labels) 
>>> else
>>> paste0(labels,  :
>>>    duplicated levels in factors are deprecated
>>>
>>> Worked, but a bunch of warnings (not sure if all of them are helpful?)
>>>
>>> Finally, I was not aware of the use.names argument. Seems very 
>>> useful. But
>>> apparently not supported yet for the OrganismDb objects:
>>>
>>>> cdsBy(Homo.sapiens, columns="UCSCKG", use.names=TRUE)
>>> Error in .local(x, by, ...) : unused argument (use.names = TRUE)
>>>
>>> I think the OrganismDb objects are in theory a great step forward, 
>>> because
>>> they let us easily integrate other identifiers. But I've had little 
>>> luck
>>> getting them to work so far.
>>>
>>> Michael
>>>
>>>
>>> On Wed, Jan 8, 2014 at 11:59 AM, Hervé Pagès <hpages at fhcrc.org> wrote:
>>>
>>>> Hi Raf,
>>>>
>>>>
>>>> On 01/08/2014 10:25 AM, rcaloger wrote:
>>>>
>>>>> I found a way to extract the protein sequences querying the UCSC 
>>>>> web page.
>>>>> However, there should be a  more elegant way to do it.
>>>>> library(RCurl)
>>>>> trs <- c("uc003mfv.3", "uc001ajb.1", "uc011asd.2")
>>>>> myquery<- list()
>>>>> for(i in 1:length(trs)){
>>>>>       myquery[[i]] <-
>>>>> getURL(paste("http://genome-euro.ucsc.edu/cgi-bin/hgGene?
>>>>> hgsid=195297095&hgg_do_getProteinSeq=1&hgg_gene=",
>>>>> trs[i],sep=""))
>>>>>       Sys.sleep(30)
>>>>> }
>>>>>
>>>> Your 3rd transcript is non-coding hence no protein sequence for it.
>>>>
>>>> Otherwise you get exactly what you wanted using Michael's suggestion:
>>>>
>>>>    library(TxDb.Hsapiens.UCSC.hg19.knownGene)
>>>>    txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
>>>>    cds_by_tx <- cdsBy(txdb, by="tx", use.names=TRUE)
>>>>
>>>> See which of your transcripts are coding:
>>>>
>>>>
>>>>    > trs <- c("uc003mfv.3", "uc001ajb.1", "uc011asd.2")
>>>>    > trs %in% names(cds_by_tx)
>>>>    [1]  TRUE  TRUE FALSE
>>>>
>>>> Extract and translate the CDS sequences (note that here I choose to
>>>> compute the full proteome but I don't have to):
>>>>
>>>>    library(BSgenome.Hsapiens.UCSC.hg19)
>>>>    cds_seqs <- 
>>>> extractTranscriptsFromGenome(BSgenome.Hsapiens.UCSC.hg19,
>>>> cds_by_tx)
>>>>    proteome <- translate(cds_seqs)
>>>>    names(proteome) <- names(cds_seqs)
>>>>
>>>> Then:
>>>>
>>>>    > proteome
>>>>      A AAStringSet instance of length 63691
>>>>            width seq names
>>>>        [1]   134 MSESINFSHNLGQLLSPPRCV...KGETQESVESRVLPGPRHRH* 
>>>> uc010nxq.1
>>>>        [2]   306 MVTEFIFLGLSDSQELQTFLF...DMKTAIRQLRKWDAHSSVKF* 
>>>> uc001aal.1
>>>>        [3]   390 MLLPPGSLSRPRTFSSQPLQT...CQQPQQAQLLPHSGPFRPNS* 
>>>> uc009vjk.2
>>>>        [4]   390 MLLPPGSLSRPRTFSSQPLQT...CQQPQQAQLLPHSGPFRPNS* 
>>>> uc001aau.3
>>>>        [5]   267 MAYLGPYPTSRQPPQMRLLPH...CQQPQQAQLLPHSGPFRPNS* 
>>>> uc021oeh.1
>>>>        ...   ... ...
>>>>    [63687]   425 MALPTPSDSTLPAEARGRGRR...AASLEAPLSEEEYRALLEEL* 
>>>> uc031tgq.1
>>>>    [63688]   425 MALPTPSDSTLPAEARGRGRR...AASLEAPLSEEEYRALLEEL* 
>>>> uc031tgr.1
>>>>    [63689]   425 MALPTPSDSTLPAEARGRGRR...AASLEAPLSEEEYRALLEEL* 
>>>> uc031tgs.1
>>>>    [63690]   279 MGKGNEDSDLHCSSIQCSTDQ...HPFPGQEITETVSGSDEAKL* 
>>>> uc011mgh.2
>>>>    [63691]   280 MGKGNEDSDLHCSSIQCSTDQ...HPFPGQEITETVSGSDEAKL* 
>>>> uc011mgi.2
>>>>
>>>>    > trs <- c("uc003mfv.3", "uc001ajb.1")
>>>>
>>>>    > proteome[trs]
>>>>      A AAStringSet instance of length 2
>>>>        width seq names
>>>>    [1]   204 MSGQRVDVKVVMLGKEYVGKTSL...TEDKGVDLGQKPNPYFYSCCHH* 
>>>> uc003mfv.3
>>>>    [2]   498 MAAAGEGTPSSRGPRRDPPRRPP...GPEASSSWQAAHSCTPEPPAPR* 
>>>> uc001ajb.1
>>>>
>>>> You can drop the trailing "*" with
>>>>
>>>>    subseq(proteome[trs], end=-2)
>>>>
>>>> Cheers,
>>>> H.
>>>>
>>>>
>>>>
>>>>> It is interesting that in bioconductor there are no databases linking
>>>>> transcripts to proteins
>>>>> Cheers
>>>>> Raf
>>>>>
>>>>> On 08/01/14 17:10, Michael Lawrence wrote:
>>>>>
>>>>>> In theory, you should be able to get the cds regions using e.g. the
>>>>>> Homo.sapiens package, but it seems kind of tough to retrieve 
>>>>>> those for
>>>>>> UCSC
>>>>>> Known Gene identifiers (assuming that is what you have). Marc 
>>>>>> Carlson
>>>>>> could
>>>>>> probably help more.
>>>>>>
>>>>>> Michael
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Jan 8, 2014 at 6:31 AM, rcaloger 
>>>>>> <raffaele.calogero at unito.it>
>>>>>> wrote:
>>>>>>
>>>>>>   Dear Michael,
>>>>>>> thank for the kind suggestion but unfortunately it does not 
>>>>>>> solve my
>>>>>>> problem because, using the approach you are suggesting, I do not 
>>>>>>> have
>>>>>>> access to the position of the start codon for the different 
>>>>>>> isoforms.
>>>>>>> Cheers
>>>>>>> Raf
>>>>>>>
>>>>>>>
>>>>>>> On 07/01/14 16:44, Michael Lawrence wrote:
>>>>>>>
>>>>>>>   If you had the transcript coordinates (as GRangesList, perhaps 
>>>>>>> from an
>>>>>>>> exonsBy() on a TranscriptDb) you could use
>>>>>>>> extractTranscriptsFromGenome()
>>>>>>>> and translate, see the GenomicFeatures vignette for an example.
>>>>>>>>
>>>>>>>> Michael
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Jan 7, 2014 at 6:54 AM, rcaloger 
>>>>>>>> <raffaele.calogero at gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>    Hi,
>>>>>>>>
>>>>>>>>> In order to validate fusion products I need to be sure that the
>>>>>>>>> peptides
>>>>>>>>> encoded by the the two fused proteins are in the same frame.
>>>>>>>>> I have now a function that allows to confirm the protein1 and 
>>>>>>>>> protein2
>>>>>>>>> have sequences located in the same frame.
>>>>>>>>> However, I got stack to retrieve those proteins sequences from 
>>>>>>>>> UCSC. I
>>>>>>>>> did
>>>>>>>>> not found a quick way to retrieve the protein sequence 
>>>>>>>>> associated to a
>>>>>>>>> UCSC
>>>>>>>>> ID.
>>>>>>>>> Indeed the protein sequence is present in the UCSC genome 
>>>>>>>>> browser,
>>>>>>>>> but I
>>>>>>>>> do not know how to grab it.
>>>>>>>>> Any suggestion?
>>>>>>>>> Cheers
>>>>>>>>> Raffaele
>>>>>>>>>
>>>>>>>>> -- 
>>>>>>>>>
>>>>>>>>> ----------------------------------------
>>>>>>>>> Prof. Raffaele A. Calogero
>>>>>>>>> Bioinformatics and Genomics Unit
>>>>>>>>> MBC Centro di Biotecnologie Molecolari
>>>>>>>>> Via Nizza 52, Torino 10126
>>>>>>>>> tel.   ++39 0116706457
>>>>>>>>> Fax    ++39 0112366457
>>>>>>>>> Mobile ++39 3333827080
>>>>>>>>> email: raffaele.calogero at unito.it
>>>>>>>>>           raffaele[dot]calogero[at]gmail[dot]com
>>>>>>>>> www:   http://www.bioinformatica.unito.it
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> Bioconductor mailing list
>>>>>>>>> Bioconductor at r-project.org
>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>>>>> Search the archives: http://news.gmane.org/gmane.
>>>>>>>>> science.biology.informatics.conductor
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>   --
>>>>>>> ----------------------------------------
>>>>>>> Prof. Raffaele A. Calogero
>>>>>>> Bioinformatics and Genomics Unit
>>>>>>> MBC Centro di Biotecnologie Molecolari
>>>>>>> Via Nizza 52, Torino 10126
>>>>>>> tel.   ++39 0116706457
>>>>>>> Fax    ++39 0112366457
>>>>>>> Mobile ++39 3333827080
>>>>>>> email: raffaele.calogero at unito.it
>>>>>>>          raffaele[dot]calogero[at]gmail[dot]com
>>>>>>> www:   http://www.bioinformatica.unito.it
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>
>>>> -- 
>>>> Hervé Pagès
>>>>
>>>> Program in Computational Biology
>>>> Division of Public Health Sciences
>>>> Fred Hutchinson Cancer Research Center
>>>> 1100 Fairview Ave. N, M1-B514
>>>> P.O. Box 19024
>>>> Seattle, WA 98109-1024
>>>>
>>>> E-mail: hpages at fhcrc.org
>>>> Phone:  (206) 667-5791
>>>> Fax:    (206) 667-1319
>>>>
>>
>>
>

-- 

----------------------------------------
Prof. Raffaele A. Calogero
Bioinformatics and Genomics Unit
MBC Centro di Biotecnologie Molecolari
Via Nizza 52, Torino 10126
tel.   ++39 0116706457
Fax    ++39 0112366457
Mobile ++39 3333827080
email: raffaele.calogero at unito.it
        raffaele[dot]calogero[at]gmail[dot]com
www:   http://www.bioinformatica.unito.it