[BioC] retrieving mRNA sequences via biomaRt

Thu Aug 6 19:02:08 CEST 2009

Thanks, for the recommendation.

So far, I just read Steffen's and your biomaRt user’s guide and had a 
look at the BioMart 0.7 Documentation, since I needed quick results.
I'm going to have a look at the recommended book and paper, now.

In the meantime, I got to a solution - but not a very satisfying one:

ensembl = useDataset("hsapiens_gene_ensembl", mart=ensembl)

myAttributes = c("embl", "cdna", "5utr", "coding", "3utr", "5_utr_end", 
"3_utr_start", "sequence_cdna_length","cds_length")

...

qresult = getBM(attributes=myAttributes,
                  filters=...,
                  values=...,
                  mart=ensembl)

finalResult = mySeqCdsLengthFilter(qresult, c(3000, 5000), c(2000, 3000))

For now, I parse my query results manually, using
the values for "sequence_cdna_length" and "cds_length" as limits.
I wish these attributes were filters ...
or there was a BioMart and a database, I could use in a linked query via 
getLDS.

I'm still curious for a smarter solution.

Best regards,
Simon

Wolfgang Huber wrote:
> 
> Hi Simon,
> 
> with all respect, for a first contact with the Bioconductor project I'd 
> also recommend studying some of the documentation.
> 
> A (slightly biased) set of points to start with are the "Bioconductor 
> Case Studies" book by Hahne, Huber, Gentleman, Falcon and the paper 
> "Mapping identifiers for the integration of genomic datasets with the 
> R/Bioconductor package biomaRt." by Durinck et al. in Nature Protocols 
> 2009;4(8):1184-91.
> 
>     Best wishes
>     Wolfgang
> 
> 
> 
> 
> Simon ha scritto:
>> Hello everybody,
>>
>> I am trying to solve the following tasks as a first contact with the 
>> bioconductor project:
>>
>> # Task 1:
>> # find:
>> #   * mRNA sequence (5'UTR, Coding region, 3'UTR)
>> #   * position of start codon in sequence
>> #   * position of stop codon in sequence
>> #   * ID (Which ID(s) would I choose to reference my
>> #     sequence hits? Embl, ensembl transcript id,
>> #     Entrez Gene id, RefSeq, etc.?)
>> #   * name of associated protein product
>> #
>> #  where:
>> #   * origin is human
>> #     Entrez Search would be: human[ORGN]
>> #   * sequence is mRNA transcript
>> #     Entrez Search for Molecule Type: biomol_mRNA[PROP]?
>> #   * mRNA sequence length is 3000 to 5000 nts
>> #     * Entrez Search for Sequence Length: 3000:5000[SLEN]
>> #   * coding region of mRNA length is 2000 to 3000 nts
>> #     * Entrez Search Field for stop and start of
>> #       coding region: start:stop[CDS]
>> #
>> #
>> # Task 2:
>> # store the retrieved information to file for the first 200 hits
>> # (Which would be a suitable file formate?)
>>
>> I started by using and playing around with the biomaRt package for R, 
>> but I got overwhelmed by its many possibilities.
>>
>> I would be glad to get any feedback, on how to start or even solve my 
>> tasks.
>>
>> Best regards,
>> Simon
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: 
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>