[BioC] Extracting protein sequence associated to UCSC transcript id

Wed Jan 8 20:59:38 CET 2014

Hi Raf,

On 01/08/2014 10:25 AM, rcaloger wrote:
> I found a way to extract the protein sequences querying the UCSC web page.
> However, there should be a  more elegant way to do it.
> library(RCurl)
> trs <- c("uc003mfv.3", "uc001ajb.1", "uc011asd.2")
> myquery<- list()
> for(i in 1:length(trs)){
>      myquery[[i]] <-
> getURL(paste("http://genome-euro.ucsc.edu/cgi-bin/hgGene?hgsid=195297095&hgg_do_getProteinSeq=1&hgg_gene=",
> trs[i],sep=""))
>      Sys.sleep(30)
> }

Your 3rd transcript is non-coding hence no protein sequence for it.

Otherwise you get exactly what you wanted using Michael's suggestion:

   library(TxDb.Hsapiens.UCSC.hg19.knownGene)
   txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
   cds_by_tx <- cdsBy(txdb, by="tx", use.names=TRUE)

See which of your transcripts are coding:

   > trs <- c("uc003mfv.3", "uc001ajb.1", "uc011asd.2")
   > trs %in% names(cds_by_tx)
   [1]  TRUE  TRUE FALSE

Extract and translate the CDS sequences (note that here I choose to
compute the full proteome but I don't have to):

   library(BSgenome.Hsapiens.UCSC.hg19)
   cds_seqs <- extractTranscriptsFromGenome(BSgenome.Hsapiens.UCSC.hg19, 
cds_by_tx)
   proteome <- translate(cds_seqs)
   names(proteome) <- names(cds_seqs)

Then:

   > proteome
     A AAStringSet instance of length 63691
           width seq                                           names 

       [1]   134 MSESINFSHNLGQLLSPPRCV...KGETQESVESRVLPGPRHRH* uc010nxq.1
       [2]   306 MVTEFIFLGLSDSQELQTFLF...DMKTAIRQLRKWDAHSSVKF* uc001aal.1
       [3]   390 MLLPPGSLSRPRTFSSQPLQT...CQQPQQAQLLPHSGPFRPNS* uc009vjk.2
       [4]   390 MLLPPGSLSRPRTFSSQPLQT...CQQPQQAQLLPHSGPFRPNS* uc001aau.3
       [5]   267 MAYLGPYPTSRQPPQMRLLPH...CQQPQQAQLLPHSGPFRPNS* uc021oeh.1
       ...   ... ...
   [63687]   425 MALPTPSDSTLPAEARGRGRR...AASLEAPLSEEEYRALLEEL* uc031tgq.1
   [63688]   425 MALPTPSDSTLPAEARGRGRR...AASLEAPLSEEEYRALLEEL* uc031tgr.1
   [63689]   425 MALPTPSDSTLPAEARGRGRR...AASLEAPLSEEEYRALLEEL* uc031tgs.1
   [63690]   279 MGKGNEDSDLHCSSIQCSTDQ...HPFPGQEITETVSGSDEAKL* uc011mgh.2
   [63691]   280 MGKGNEDSDLHCSSIQCSTDQ...HPFPGQEITETVSGSDEAKL* uc011mgi.2

   > trs <- c("uc003mfv.3", "uc001ajb.1")

   > proteome[trs]
     A AAStringSet instance of length 2
       width seq                                               names 

   [1]   204 MSGQRVDVKVVMLGKEYVGKTSL...TEDKGVDLGQKPNPYFYSCCHH* uc003mfv.3
   [2]   498 MAAAGEGTPSSRGPRRDPPRRPP...GPEASSSWQAAHSCTPEPPAPR* uc001ajb.1

You can drop the trailing "*" with

   subseq(proteome[trs], end=-2)

Cheers,
H.

>
> It is interesting that in bioconductor there are no databases linking
> transcripts to proteins
> Cheers
> Raf
>
> On 08/01/14 17:10, Michael Lawrence wrote:
>> In theory, you should be able to get the cds regions using e.g. the
>> Homo.sapiens package, but it seems kind of tough to retrieve those for
>> UCSC
>> Known Gene identifiers (assuming that is what you have). Marc Carlson
>> could
>> probably help more.
>>
>> Michael
>>
>>
>>
>>
>>
>> On Wed, Jan 8, 2014 at 6:31 AM, rcaloger <raffaele.calogero at unito.it>
>> wrote:
>>
>>> Dear Michael,
>>> thank for the kind suggestion but unfortunately it does not solve my
>>> problem because, using the approach you are suggesting, I do not have
>>> access to the position of the start codon for the different isoforms.
>>> Cheers
>>> Raf
>>>
>>>
>>> On 07/01/14 16:44, Michael Lawrence wrote:
>>>
>>>> If you had the transcript coordinates (as GRangesList, perhaps from an
>>>> exonsBy() on a TranscriptDb) you could use
>>>> extractTranscriptsFromGenome()
>>>> and translate, see the GenomicFeatures vignette for an example.
>>>>
>>>> Michael
>>>>
>>>>
>>>> On Tue, Jan 7, 2014 at 6:54 AM, rcaloger <raffaele.calogero at gmail.com>
>>>> wrote:
>>>>
>>>>   Hi,
>>>>> In order to validate fusion products I need to be sure that the
>>>>> peptides
>>>>> encoded by the the two fused proteins are in the same frame.
>>>>> I have now a function that allows to confirm the protein1 and protein2
>>>>> have sequences located in the same frame.
>>>>> However, I got stack to retrieve those proteins sequences from UCSC. I
>>>>> did
>>>>> not found a quick way to retrieve the protein sequence associated to a
>>>>> UCSC
>>>>> ID.
>>>>> Indeed the protein sequence is present in the UCSC genome browser,
>>>>> but I
>>>>> do not know how to grab it.
>>>>> Any suggestion?
>>>>> Cheers
>>>>> Raffaele
>>>>>
>>>>> --
>>>>>
>>>>> ----------------------------------------
>>>>> Prof. Raffaele A. Calogero
>>>>> Bioinformatics and Genomics Unit
>>>>> MBC Centro di Biotecnologie Molecolari
>>>>> Via Nizza 52, Torino 10126
>>>>> tel.   ++39 0116706457
>>>>> Fax    ++39 0112366457
>>>>> Mobile ++39 3333827080
>>>>> email: raffaele.calogero at unito.it
>>>>>          raffaele[dot]calogero[at]gmail[dot]com
>>>>> www:   http://www.bioinformatica.unito.it
>>>>>
>>>>> _______________________________________________
>>>>> Bioconductor mailing list
>>>>> Bioconductor at r-project.org
>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>> Search the archives: http://news.gmane.org/gmane.
>>>>> science.biology.informatics.conductor
>>>>>
>>>>>
>>> --
>>>
>>> ----------------------------------------
>>> Prof. Raffaele A. Calogero
>>> Bioinformatics and Genomics Unit
>>> MBC Centro di Biotecnologie Molecolari
>>> Via Nizza 52, Torino 10126
>>> tel.   ++39 0116706457
>>> Fax    ++39 0112366457
>>> Mobile ++39 3333827080
>>> email: raffaele.calogero at unito.it
>>>         raffaele[dot]calogero[at]gmail[dot]com
>>> www:   http://www.bioinformatica.unito.it
>>>
>>>
>
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319