[BioC] Extracting protein sequence associated to UCSC transcript id

Thu Jan 9 15:19:59 CET 2014

Hi Raf,

The GenomicFeatures vignette covers these packages:

http://www.bioconductor.org/packages/release/bioc/vignettes/GenomicFeatures/inst/doc/GenomicFeatures.pdf

Best,

Jim

On 1/9/2014 5:24 AM, rcaloger wrote:
> Dear Michael and Hervé,
> many thanks for the great help
> I followed the suggestion of Hervé
> Since I need to retrieve only two sequences I did the following
> ...
> txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
> cds_by_tx <- cdsBy(txdb, by="tx", use.names=TRUE)
> trs <- c(id1,id2)
> if(length(which(trs %in% names(cds_by_tx)))==1){
>         cat("\nOnly names(cds_by_tx)[which(trs %in% names(cds_by_tx))] 
> is coding\n"
>         return()
>     }else if(length(which(trs %in% names(cds_by_tx)))==0){
>         cat("\nNon of the two transcripts is coding\n"
>         return()
>     }
>     cds_seqs <- 
> extractTranscriptsFromGenome(BSgenome.Hsapiens.UCSC.hg19, 
> cds_by_tx[which(names(cds_by_tx) %in% trs)])
>     proteome <- translate(cds_seqs)
>     names(proteome) <- names(cds_seqs)
>     p1 <- proteome[which(names(proteome)==id1)]
>     p2 <- proteome[which(names(proteome)==id2)]
> ...
>
> Many thanks again!
>
> One point that would be of general interest.
> TxDb.XX.YY.ZZ.knownGene are very powerful dbs but, as far as I know, 
> there is no vignette associated to them, and documentation, in my 
> opinion does not allow an easy understanding of the db.
> Do you know if some tutorial like that available for BSgenome is also 
> present for TxDb.XX.YY.ZZ.knownGene?
> Cheers
> Raf
>
> On 09/01/14 00:49, Michael Lawrence wrote:
>> I had issues doing this with Homo.sapiens:
>>
>> First I tried just:
>>> cdsBy(Homo.sapiens)
>> Error in .select(x, keys, columns, keytype, ...) :
>>    You must provide cols argument
>>
>> Not really sure why? And the argument is "columns"...
>>
>> Then:
>>> cdsBy(Homo.sapiens, columns=character())
>> Error in res[, .reverseColAbbreviations(x, cnames), drop = FALSE] :
>>    incorrect number of dimensions
>>
>>
>>> cdsBy(Homo.sapiens, columns="UCSCKG")
>> Warning messages:
>> 1: In .generateExtraRows(tab, keys, jointype) :
>>    'select' and duplicate query keys resulted in 1:many mapping between
>> keys and return rows
>> 2: In .generateExtraRows(tab, keys, jointype) :
>>    'select' resulted in 1:many mapping between keys and return rows
>> 3: In .generateExtraRows(tab, keys, jointype) :
>>    'select' and duplicate query keys resulted in 1:many mapping between
>> keys and return rows
>> 4: In `levels<-`(`*tmp*`, value = if (nl == nL) as.character(labels) 
>> else
>> paste0(labels,  :
>>    duplicated levels in factors are deprecated
>>
>> Worked, but a bunch of warnings (not sure if all of them are helpful?)
>>
>> Finally, I was not aware of the use.names argument. Seems very 
>> useful. But
>> apparently not supported yet for the OrganismDb objects:
>>
>>> cdsBy(Homo.sapiens, columns="UCSCKG", use.names=TRUE)
>> Error in .local(x, by, ...) : unused argument (use.names = TRUE)
>>
>> I think the OrganismDb objects are in theory a great step forward, 
>> because
>> they let us easily integrate other identifiers. But I've had little luck
>> getting them to work so far.
>>
>> Michael
>>
>>
>> On Wed, Jan 8, 2014 at 11:59 AM, Hervé Pagès <hpages at fhcrc.org> wrote:
>>
>>> Hi Raf,
>>>
>>>
>>> On 01/08/2014 10:25 AM, rcaloger wrote:
>>>
>>>> I found a way to extract the protein sequences querying the UCSC 
>>>> web page.
>>>> However, there should be a  more elegant way to do it.
>>>> library(RCurl)
>>>> trs <- c("uc003mfv.3", "uc001ajb.1", "uc011asd.2")
>>>> myquery<- list()
>>>> for(i in 1:length(trs)){
>>>>       myquery[[i]] <-
>>>> getURL(paste("http://genome-euro.ucsc.edu/cgi-bin/hgGene?
>>>> hgsid=195297095&hgg_do_getProteinSeq=1&hgg_gene=",
>>>> trs[i],sep=""))
>>>>       Sys.sleep(30)
>>>> }
>>>>
>>> Your 3rd transcript is non-coding hence no protein sequence for it.
>>>
>>> Otherwise you get exactly what you wanted using Michael's suggestion:
>>>
>>>    library(TxDb.Hsapiens.UCSC.hg19.knownGene)
>>>    txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
>>>    cds_by_tx <- cdsBy(txdb, by="tx", use.names=TRUE)
>>>
>>> See which of your transcripts are coding:
>>>
>>>
>>>    > trs <- c("uc003mfv.3", "uc001ajb.1", "uc011asd.2")
>>>    > trs %in% names(cds_by_tx)
>>>    [1]  TRUE  TRUE FALSE
>>>
>>> Extract and translate the CDS sequences (note that here I choose to
>>> compute the full proteome but I don't have to):
>>>
>>>    library(BSgenome.Hsapiens.UCSC.hg19)
>>>    cds_seqs <- 
>>> extractTranscriptsFromGenome(BSgenome.Hsapiens.UCSC.hg19,
>>> cds_by_tx)
>>>    proteome <- translate(cds_seqs)
>>>    names(proteome) <- names(cds_seqs)
>>>
>>> Then:
>>>
>>>    > proteome
>>>      A AAStringSet instance of length 63691
>>>            width seq names
>>>        [1]   134 MSESINFSHNLGQLLSPPRCV...KGETQESVESRVLPGPRHRH* 
>>> uc010nxq.1
>>>        [2]   306 MVTEFIFLGLSDSQELQTFLF...DMKTAIRQLRKWDAHSSVKF* 
>>> uc001aal.1
>>>        [3]   390 MLLPPGSLSRPRTFSSQPLQT...CQQPQQAQLLPHSGPFRPNS* 
>>> uc009vjk.2
>>>        [4]   390 MLLPPGSLSRPRTFSSQPLQT...CQQPQQAQLLPHSGPFRPNS* 
>>> uc001aau.3
>>>        [5]   267 MAYLGPYPTSRQPPQMRLLPH...CQQPQQAQLLPHSGPFRPNS* 
>>> uc021oeh.1
>>>        ...   ... ...
>>>    [63687]   425 MALPTPSDSTLPAEARGRGRR...AASLEAPLSEEEYRALLEEL* 
>>> uc031tgq.1
>>>    [63688]   425 MALPTPSDSTLPAEARGRGRR...AASLEAPLSEEEYRALLEEL* 
>>> uc031tgr.1
>>>    [63689]   425 MALPTPSDSTLPAEARGRGRR...AASLEAPLSEEEYRALLEEL* 
>>> uc031tgs.1
>>>    [63690]   279 MGKGNEDSDLHCSSIQCSTDQ...HPFPGQEITETVSGSDEAKL* 
>>> uc011mgh.2
>>>    [63691]   280 MGKGNEDSDLHCSSIQCSTDQ...HPFPGQEITETVSGSDEAKL* 
>>> uc011mgi.2
>>>
>>>    > trs <- c("uc003mfv.3", "uc001ajb.1")
>>>
>>>    > proteome[trs]
>>>      A AAStringSet instance of length 2
>>>        width seq names
>>>    [1]   204 MSGQRVDVKVVMLGKEYVGKTSL...TEDKGVDLGQKPNPYFYSCCHH* 
>>> uc003mfv.3
>>>    [2]   498 MAAAGEGTPSSRGPRRDPPRRPP...GPEASSSWQAAHSCTPEPPAPR* 
>>> uc001ajb.1
>>>
>>> You can drop the trailing "*" with
>>>
>>>    subseq(proteome[trs], end=-2)
>>>
>>> Cheers,
>>> H.
>>>
>>>
>>>
>>>> It is interesting that in bioconductor there are no databases linking
>>>> transcripts to proteins
>>>> Cheers
>>>> Raf
>>>>
>>>> On 08/01/14 17:10, Michael Lawrence wrote:
>>>>
>>>>> In theory, you should be able to get the cds regions using e.g. the
>>>>> Homo.sapiens package, but it seems kind of tough to retrieve those 
>>>>> for
>>>>> UCSC
>>>>> Known Gene identifiers (assuming that is what you have). Marc Carlson
>>>>> could
>>>>> probably help more.
>>>>>
>>>>> Michael
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Jan 8, 2014 at 6:31 AM, rcaloger <raffaele.calogero at unito.it>
>>>>> wrote:
>>>>>
>>>>>   Dear Michael,
>>>>>> thank for the kind suggestion but unfortunately it does not solve my
>>>>>> problem because, using the approach you are suggesting, I do not 
>>>>>> have
>>>>>> access to the position of the start codon for the different 
>>>>>> isoforms.
>>>>>> Cheers
>>>>>> Raf
>>>>>>
>>>>>>
>>>>>> On 07/01/14 16:44, Michael Lawrence wrote:
>>>>>>
>>>>>>   If you had the transcript coordinates (as GRangesList, perhaps 
>>>>>> from an
>>>>>>> exonsBy() on a TranscriptDb) you could use
>>>>>>> extractTranscriptsFromGenome()
>>>>>>> and translate, see the GenomicFeatures vignette for an example.
>>>>>>>
>>>>>>> Michael
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Jan 7, 2014 at 6:54 AM, rcaloger 
>>>>>>> <raffaele.calogero at gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>    Hi,
>>>>>>>
>>>>>>>> In order to validate fusion products I need to be sure that the
>>>>>>>> peptides
>>>>>>>> encoded by the the two fused proteins are in the same frame.
>>>>>>>> I have now a function that allows to confirm the protein1 and 
>>>>>>>> protein2
>>>>>>>> have sequences located in the same frame.
>>>>>>>> However, I got stack to retrieve those proteins sequences from 
>>>>>>>> UCSC. I
>>>>>>>> did
>>>>>>>> not found a quick way to retrieve the protein sequence 
>>>>>>>> associated to a
>>>>>>>> UCSC
>>>>>>>> ID.
>>>>>>>> Indeed the protein sequence is present in the UCSC genome browser,
>>>>>>>> but I
>>>>>>>> do not know how to grab it.
>>>>>>>> Any suggestion?
>>>>>>>> Cheers
>>>>>>>> Raffaele
>>>>>>>>
>>>>>>>> -- 
>>>>>>>>
>>>>>>>> ----------------------------------------
>>>>>>>> Prof. Raffaele A. Calogero
>>>>>>>> Bioinformatics and Genomics Unit
>>>>>>>> MBC Centro di Biotecnologie Molecolari
>>>>>>>> Via Nizza 52, Torino 10126
>>>>>>>> tel.   ++39 0116706457
>>>>>>>> Fax    ++39 0112366457
>>>>>>>> Mobile ++39 3333827080
>>>>>>>> email: raffaele.calogero at unito.it
>>>>>>>>           raffaele[dot]calogero[at]gmail[dot]com
>>>>>>>> www:   http://www.bioinformatica.unito.it
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Bioconductor mailing list
>>>>>>>> Bioconductor at r-project.org
>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>>>> Search the archives: http://news.gmane.org/gmane.
>>>>>>>> science.biology.informatics.conductor
>>>>>>>>
>>>>>>>>
>>>>>>>>   --
>>>>>> ----------------------------------------
>>>>>> Prof. Raffaele A. Calogero
>>>>>> Bioinformatics and Genomics Unit
>>>>>> MBC Centro di Biotecnologie Molecolari
>>>>>> Via Nizza 52, Torino 10126
>>>>>> tel.   ++39 0116706457
>>>>>> Fax    ++39 0112366457
>>>>>> Mobile ++39 3333827080
>>>>>> email: raffaele.calogero at unito.it
>>>>>>          raffaele[dot]calogero[at]gmail[dot]com
>>>>>> www:   http://www.bioinformatica.unito.it
>>>>>>
>>>>>>
>>>>>>
>>>>
>>> -- 
>>> Hervé Pagès
>>>
>>> Program in Computational Biology
>>> Division of Public Health Sciences
>>> Fred Hutchinson Cancer Research Center
>>> 1100 Fairview Ave. N, M1-B514
>>> P.O. Box 19024
>>> Seattle, WA 98109-1024
>>>
>>> E-mail: hpages at fhcrc.org
>>> Phone:  (206) 667-5791
>>> Fax:    (206) 667-1319
>>>
>
>

-- 
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099