[BioC] Extracting protein sequence associated to UCSC transcript id

Thu Jan 9 11:24:52 CET 2014

Dear Michael and Hervé,
many thanks for the great help
I followed the suggestion of Hervé
Since I need to retrieve only two sequences I did the following
...
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
cds_by_tx <- cdsBy(txdb, by="tx", use.names=TRUE)
trs <- c(id1,id2)
if(length(which(trs %in% names(cds_by_tx)))==1){
         cat("\nOnly names(cds_by_tx)[which(trs %in% names(cds_by_tx))] 
is coding\n"
         return()
     }else if(length(which(trs %in% names(cds_by_tx)))==0){
         cat("\nNon of the two transcripts is coding\n"
         return()
     }
     cds_seqs <- 
extractTranscriptsFromGenome(BSgenome.Hsapiens.UCSC.hg19, 
cds_by_tx[which(names(cds_by_tx) %in% trs)])
     proteome <- translate(cds_seqs)
     names(proteome) <- names(cds_seqs)
     p1 <- proteome[which(names(proteome)==id1)]
     p2 <- proteome[which(names(proteome)==id2)]
...

Many thanks again!

One point that would be of general interest.
TxDb.XX.YY.ZZ.knownGene are very powerful dbs but, as far as I know, 
there is no vignette associated to them, and documentation, in my 
opinion does not allow an easy understanding of the db.
Do you know if some tutorial like that available for BSgenome is also 
present for TxDb.XX.YY.ZZ.knownGene?
Cheers
Raf

On 09/01/14 00:49, Michael Lawrence wrote:
> I had issues doing this with Homo.sapiens:
>
> First I tried just:
>> cdsBy(Homo.sapiens)
> Error in .select(x, keys, columns, keytype, ...) :
>    You must provide cols argument
>
> Not really sure why? And the argument is "columns"...
>
> Then:
>> cdsBy(Homo.sapiens, columns=character())
> Error in res[, .reverseColAbbreviations(x, cnames), drop = FALSE] :
>    incorrect number of dimensions
>
>
>> cdsBy(Homo.sapiens, columns="UCSCKG")
> Warning messages:
> 1: In .generateExtraRows(tab, keys, jointype) :
>    'select' and duplicate query keys resulted in 1:many mapping between
> keys and return rows
> 2: In .generateExtraRows(tab, keys, jointype) :
>    'select' resulted in 1:many mapping between keys and return rows
> 3: In .generateExtraRows(tab, keys, jointype) :
>    'select' and duplicate query keys resulted in 1:many mapping between
> keys and return rows
> 4: In `levels<-`(`*tmp*`, value = if (nl == nL) as.character(labels) else
> paste0(labels,  :
>    duplicated levels in factors are deprecated
>
> Worked, but a bunch of warnings (not sure if all of them are helpful?)
>
> Finally, I was not aware of the use.names argument. Seems very useful. But
> apparently not supported yet for the OrganismDb objects:
>
>> cdsBy(Homo.sapiens, columns="UCSCKG", use.names=TRUE)
> Error in .local(x, by, ...) : unused argument (use.names = TRUE)
>
> I think the OrganismDb objects are in theory a great step forward, because
> they let us easily integrate other identifiers. But I've had little luck
> getting them to work so far.
>
> Michael
>
>
> On Wed, Jan 8, 2014 at 11:59 AM, Hervé Pagès <hpages at fhcrc.org> wrote:
>
>> Hi Raf,
>>
>>
>> On 01/08/2014 10:25 AM, rcaloger wrote:
>>
>>> I found a way to extract the protein sequences querying the UCSC web page.
>>> However, there should be a  more elegant way to do it.
>>> library(RCurl)
>>> trs <- c("uc003mfv.3", "uc001ajb.1", "uc011asd.2")
>>> myquery<- list()
>>> for(i in 1:length(trs)){
>>>       myquery[[i]] <-
>>> getURL(paste("http://genome-euro.ucsc.edu/cgi-bin/hgGene?
>>> hgsid=195297095&hgg_do_getProteinSeq=1&hgg_gene=",
>>> trs[i],sep=""))
>>>       Sys.sleep(30)
>>> }
>>>
>> Your 3rd transcript is non-coding hence no protein sequence for it.
>>
>> Otherwise you get exactly what you wanted using Michael's suggestion:
>>
>>    library(TxDb.Hsapiens.UCSC.hg19.knownGene)
>>    txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
>>    cds_by_tx <- cdsBy(txdb, by="tx", use.names=TRUE)
>>
>> See which of your transcripts are coding:
>>
>>
>>    > trs <- c("uc003mfv.3", "uc001ajb.1", "uc011asd.2")
>>    > trs %in% names(cds_by_tx)
>>    [1]  TRUE  TRUE FALSE
>>
>> Extract and translate the CDS sequences (note that here I choose to
>> compute the full proteome but I don't have to):
>>
>>    library(BSgenome.Hsapiens.UCSC.hg19)
>>    cds_seqs <- extractTranscriptsFromGenome(BSgenome.Hsapiens.UCSC.hg19,
>> cds_by_tx)
>>    proteome <- translate(cds_seqs)
>>    names(proteome) <- names(cds_seqs)
>>
>> Then:
>>
>>    > proteome
>>      A AAStringSet instance of length 63691
>>            width seq                                           names
>>        [1]   134 MSESINFSHNLGQLLSPPRCV...KGETQESVESRVLPGPRHRH* uc010nxq.1
>>        [2]   306 MVTEFIFLGLSDSQELQTFLF...DMKTAIRQLRKWDAHSSVKF* uc001aal.1
>>        [3]   390 MLLPPGSLSRPRTFSSQPLQT...CQQPQQAQLLPHSGPFRPNS* uc009vjk.2
>>        [4]   390 MLLPPGSLSRPRTFSSQPLQT...CQQPQQAQLLPHSGPFRPNS* uc001aau.3
>>        [5]   267 MAYLGPYPTSRQPPQMRLLPH...CQQPQQAQLLPHSGPFRPNS* uc021oeh.1
>>        ...   ... ...
>>    [63687]   425 MALPTPSDSTLPAEARGRGRR...AASLEAPLSEEEYRALLEEL* uc031tgq.1
>>    [63688]   425 MALPTPSDSTLPAEARGRGRR...AASLEAPLSEEEYRALLEEL* uc031tgr.1
>>    [63689]   425 MALPTPSDSTLPAEARGRGRR...AASLEAPLSEEEYRALLEEL* uc031tgs.1
>>    [63690]   279 MGKGNEDSDLHCSSIQCSTDQ...HPFPGQEITETVSGSDEAKL* uc011mgh.2
>>    [63691]   280 MGKGNEDSDLHCSSIQCSTDQ...HPFPGQEITETVSGSDEAKL* uc011mgi.2
>>
>>    > trs <- c("uc003mfv.3", "uc001ajb.1")
>>
>>    > proteome[trs]
>>      A AAStringSet instance of length 2
>>        width seq                                               names
>>    [1]   204 MSGQRVDVKVVMLGKEYVGKTSL...TEDKGVDLGQKPNPYFYSCCHH* uc003mfv.3
>>    [2]   498 MAAAGEGTPSSRGPRRDPPRRPP...GPEASSSWQAAHSCTPEPPAPR* uc001ajb.1
>>
>> You can drop the trailing "*" with
>>
>>    subseq(proteome[trs], end=-2)
>>
>> Cheers,
>> H.
>>
>>
>>
>>> It is interesting that in bioconductor there are no databases linking
>>> transcripts to proteins
>>> Cheers
>>> Raf
>>>
>>> On 08/01/14 17:10, Michael Lawrence wrote:
>>>
>>>> In theory, you should be able to get the cds regions using e.g. the
>>>> Homo.sapiens package, but it seems kind of tough to retrieve those for
>>>> UCSC
>>>> Known Gene identifiers (assuming that is what you have). Marc Carlson
>>>> could
>>>> probably help more.
>>>>
>>>> Michael
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, Jan 8, 2014 at 6:31 AM, rcaloger <raffaele.calogero at unito.it>
>>>> wrote:
>>>>
>>>>   Dear Michael,
>>>>> thank for the kind suggestion but unfortunately it does not solve my
>>>>> problem because, using the approach you are suggesting, I do not have
>>>>> access to the position of the start codon for the different isoforms.
>>>>> Cheers
>>>>> Raf
>>>>>
>>>>>
>>>>> On 07/01/14 16:44, Michael Lawrence wrote:
>>>>>
>>>>>   If you had the transcript coordinates (as GRangesList, perhaps from an
>>>>>> exonsBy() on a TranscriptDb) you could use
>>>>>> extractTranscriptsFromGenome()
>>>>>> and translate, see the GenomicFeatures vignette for an example.
>>>>>>
>>>>>> Michael
>>>>>>
>>>>>>
>>>>>> On Tue, Jan 7, 2014 at 6:54 AM, rcaloger <raffaele.calogero at gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>    Hi,
>>>>>>
>>>>>>> In order to validate fusion products I need to be sure that the
>>>>>>> peptides
>>>>>>> encoded by the the two fused proteins are in the same frame.
>>>>>>> I have now a function that allows to confirm the protein1 and protein2
>>>>>>> have sequences located in the same frame.
>>>>>>> However, I got stack to retrieve those proteins sequences from UCSC. I
>>>>>>> did
>>>>>>> not found a quick way to retrieve the protein sequence associated to a
>>>>>>> UCSC
>>>>>>> ID.
>>>>>>> Indeed the protein sequence is present in the UCSC genome browser,
>>>>>>> but I
>>>>>>> do not know how to grab it.
>>>>>>> Any suggestion?
>>>>>>> Cheers
>>>>>>> Raffaele
>>>>>>>
>>>>>>> --
>>>>>>>
>>>>>>> ----------------------------------------
>>>>>>> Prof. Raffaele A. Calogero
>>>>>>> Bioinformatics and Genomics Unit
>>>>>>> MBC Centro di Biotecnologie Molecolari
>>>>>>> Via Nizza 52, Torino 10126
>>>>>>> tel.   ++39 0116706457
>>>>>>> Fax    ++39 0112366457
>>>>>>> Mobile ++39 3333827080
>>>>>>> email: raffaele.calogero at unito.it
>>>>>>>           raffaele[dot]calogero[at]gmail[dot]com
>>>>>>> www:   http://www.bioinformatica.unito.it
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Bioconductor mailing list
>>>>>>> Bioconductor at r-project.org
>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>>> Search the archives: http://news.gmane.org/gmane.
>>>>>>> science.biology.informatics.conductor
>>>>>>>
>>>>>>>
>>>>>>>   --
>>>>> ----------------------------------------
>>>>> Prof. Raffaele A. Calogero
>>>>> Bioinformatics and Genomics Unit
>>>>> MBC Centro di Biotecnologie Molecolari
>>>>> Via Nizza 52, Torino 10126
>>>>> tel.   ++39 0116706457
>>>>> Fax    ++39 0112366457
>>>>> Mobile ++39 3333827080
>>>>> email: raffaele.calogero at unito.it
>>>>>          raffaele[dot]calogero[at]gmail[dot]com
>>>>> www:   http://www.bioinformatica.unito.it
>>>>>
>>>>>
>>>>>
>>>
>> --
>> Hervé Pagès
>>
>> Program in Computational Biology
>> Division of Public Health Sciences
>> Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N, M1-B514
>> P.O. Box 19024
>> Seattle, WA 98109-1024
>>
>> E-mail: hpages at fhcrc.org
>> Phone:  (206) 667-5791
>> Fax:    (206) 667-1319
>>

-- 

----------------------------------------
Prof. Raffaele A. Calogero
Bioinformatics and Genomics Unit
MBC Centro di Biotecnologie Molecolari
Via Nizza 52, Torino 10126
tel.   ++39 0116706457
Fax    ++39 0112366457
Mobile ++39 3333827080
email: raffaele.calogero at unito.it
        raffaele[dot]calogero[at]gmail[dot]com
www:   http://www.bioinformatica.unito.it