[BioC] Extracting protein sequence associated to UCSC transcript id
Hervé Pagès
hpages at fhcrc.org
Wed Jan 8 20:59:38 CET 2014
Hi Raf,
On 01/08/2014 10:25 AM, rcaloger wrote:
> I found a way to extract the protein sequences querying the UCSC web page.
> However, there should be a more elegant way to do it.
> library(RCurl)
> trs <- c("uc003mfv.3", "uc001ajb.1", "uc011asd.2")
> myquery<- list()
> for(i in 1:length(trs)){
> myquery[[i]] <-
> getURL(paste("http://genome-euro.ucsc.edu/cgi-bin/hgGene?hgsid=195297095&hgg_do_getProteinSeq=1&hgg_gene=",
> trs[i],sep=""))
> Sys.sleep(30)
> }
Your 3rd transcript is non-coding hence no protein sequence for it.
Otherwise you get exactly what you wanted using Michael's suggestion:
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
cds_by_tx <- cdsBy(txdb, by="tx", use.names=TRUE)
See which of your transcripts are coding:
> trs <- c("uc003mfv.3", "uc001ajb.1", "uc011asd.2")
> trs %in% names(cds_by_tx)
[1] TRUE TRUE FALSE
Extract and translate the CDS sequences (note that here I choose to
compute the full proteome but I don't have to):
library(BSgenome.Hsapiens.UCSC.hg19)
cds_seqs <- extractTranscriptsFromGenome(BSgenome.Hsapiens.UCSC.hg19,
cds_by_tx)
proteome <- translate(cds_seqs)
names(proteome) <- names(cds_seqs)
Then:
> proteome
A AAStringSet instance of length 63691
width seq names
[1] 134 MSESINFSHNLGQLLSPPRCV...KGETQESVESRVLPGPRHRH* uc010nxq.1
[2] 306 MVTEFIFLGLSDSQELQTFLF...DMKTAIRQLRKWDAHSSVKF* uc001aal.1
[3] 390 MLLPPGSLSRPRTFSSQPLQT...CQQPQQAQLLPHSGPFRPNS* uc009vjk.2
[4] 390 MLLPPGSLSRPRTFSSQPLQT...CQQPQQAQLLPHSGPFRPNS* uc001aau.3
[5] 267 MAYLGPYPTSRQPPQMRLLPH...CQQPQQAQLLPHSGPFRPNS* uc021oeh.1
... ... ...
[63687] 425 MALPTPSDSTLPAEARGRGRR...AASLEAPLSEEEYRALLEEL* uc031tgq.1
[63688] 425 MALPTPSDSTLPAEARGRGRR...AASLEAPLSEEEYRALLEEL* uc031tgr.1
[63689] 425 MALPTPSDSTLPAEARGRGRR...AASLEAPLSEEEYRALLEEL* uc031tgs.1
[63690] 279 MGKGNEDSDLHCSSIQCSTDQ...HPFPGQEITETVSGSDEAKL* uc011mgh.2
[63691] 280 MGKGNEDSDLHCSSIQCSTDQ...HPFPGQEITETVSGSDEAKL* uc011mgi.2
> trs <- c("uc003mfv.3", "uc001ajb.1")
> proteome[trs]
A AAStringSet instance of length 2
width seq names
[1] 204 MSGQRVDVKVVMLGKEYVGKTSL...TEDKGVDLGQKPNPYFYSCCHH* uc003mfv.3
[2] 498 MAAAGEGTPSSRGPRRDPPRRPP...GPEASSSWQAAHSCTPEPPAPR* uc001ajb.1
You can drop the trailing "*" with
subseq(proteome[trs], end=-2)
Cheers,
H.
>
> It is interesting that in bioconductor there are no databases linking
> transcripts to proteins
> Cheers
> Raf
>
> On 08/01/14 17:10, Michael Lawrence wrote:
>> In theory, you should be able to get the cds regions using e.g. the
>> Homo.sapiens package, but it seems kind of tough to retrieve those for
>> UCSC
>> Known Gene identifiers (assuming that is what you have). Marc Carlson
>> could
>> probably help more.
>>
>> Michael
>>
>>
>>
>>
>>
>> On Wed, Jan 8, 2014 at 6:31 AM, rcaloger <raffaele.calogero at unito.it>
>> wrote:
>>
>>> Dear Michael,
>>> thank for the kind suggestion but unfortunately it does not solve my
>>> problem because, using the approach you are suggesting, I do not have
>>> access to the position of the start codon for the different isoforms.
>>> Cheers
>>> Raf
>>>
>>>
>>> On 07/01/14 16:44, Michael Lawrence wrote:
>>>
>>>> If you had the transcript coordinates (as GRangesList, perhaps from an
>>>> exonsBy() on a TranscriptDb) you could use
>>>> extractTranscriptsFromGenome()
>>>> and translate, see the GenomicFeatures vignette for an example.
>>>>
>>>> Michael
>>>>
>>>>
>>>> On Tue, Jan 7, 2014 at 6:54 AM, rcaloger <raffaele.calogero at gmail.com>
>>>> wrote:
>>>>
>>>> Hi,
>>>>> In order to validate fusion products I need to be sure that the
>>>>> peptides
>>>>> encoded by the the two fused proteins are in the same frame.
>>>>> I have now a function that allows to confirm the protein1 and protein2
>>>>> have sequences located in the same frame.
>>>>> However, I got stack to retrieve those proteins sequences from UCSC. I
>>>>> did
>>>>> not found a quick way to retrieve the protein sequence associated to a
>>>>> UCSC
>>>>> ID.
>>>>> Indeed the protein sequence is present in the UCSC genome browser,
>>>>> but I
>>>>> do not know how to grab it.
>>>>> Any suggestion?
>>>>> Cheers
>>>>> Raffaele
>>>>>
>>>>> --
>>>>>
>>>>> ----------------------------------------
>>>>> Prof. Raffaele A. Calogero
>>>>> Bioinformatics and Genomics Unit
>>>>> MBC Centro di Biotecnologie Molecolari
>>>>> Via Nizza 52, Torino 10126
>>>>> tel. ++39 0116706457
>>>>> Fax ++39 0112366457
>>>>> Mobile ++39 3333827080
>>>>> email: raffaele.calogero at unito.it
>>>>> raffaele[dot]calogero[at]gmail[dot]com
>>>>> www: http://www.bioinformatica.unito.it
>>>>>
>>>>> _______________________________________________
>>>>> Bioconductor mailing list
>>>>> Bioconductor at r-project.org
>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>> Search the archives: http://news.gmane.org/gmane.
>>>>> science.biology.informatics.conductor
>>>>>
>>>>>
>>> --
>>>
>>> ----------------------------------------
>>> Prof. Raffaele A. Calogero
>>> Bioinformatics and Genomics Unit
>>> MBC Centro di Biotecnologie Molecolari
>>> Via Nizza 52, Torino 10126
>>> tel. ++39 0116706457
>>> Fax ++39 0112366457
>>> Mobile ++39 3333827080
>>> email: raffaele.calogero at unito.it
>>> raffaele[dot]calogero[at]gmail[dot]com
>>> www: http://www.bioinformatica.unito.it
>>>
>>>
>
>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioconductor
mailing list