[BioC] retrieving genomic sequences with biomaRt

Steffen Durinck durincks at mail.nih.gov
Tue Oct 10 19:53:01 CEST 2006


Hi Jose,

This should be possible with biomaRt, I would recommend using it in 
MySQL mode as the web service sometimes hangs for sequences.

Which type of identifiers do you have?

As an example with Entrez Gene identifiers you could do:

library(biomaRt)
ensmart=useMart("ensembl",dataset="hsapiens_gene_ensembl", mysql=TRUE)
entrez = c("100","330")
gene = getGene(id=entrez, type="entrezgene", mart=ensmart)
getSequence(chromosome = gene$chromosome, start = gene$start - 2000, end 
= gene$end + 1000, mart=ensmart)

Note that you'll have to make it a little bit more complicated than this 
as you have to take into account the strand on which the genes are 
located.  The strand is also returned by the getGene function.

best,
Steffen

J.delasHeras at ed.ac.uk wrote:
> Hi,
>
> I am putting a little script together to look at promoter regions, 
> search for CpG islands etc, given a list of gene ids.
>
> I thought I would be able to use biomaRt to retrieve the sequence of 
> interest, but I just can't find how right now... either I'm too tired 
> (possible) or I'm missing something and it's just not possible. In 
> particular 'getSequence' looked promising, but no luck so far.
>
> What I want to do is:
> 1) I have a gene id, from it I want to locate the chromosome location 
> of the gene, the start of transcription.
> 2) knowing the transcriptional start, I want to retrieve the genomic 
> sequence around that area... say 2000bp upstream and 1000 downstream.
>
> Anybody knows if this is possible and/or could give me any pointers?
>
> Thank you!
>
> Jose
>
>   


-- 
Steffen Durinck, Ph.D.

Oncogenomics Section
Pediatric Oncology Branch
National Cancer Institute, National Institutes of Health
URL: http://home.ccr.cancer.gov/oncology/oncogenomics/

Phone: 301-402-8103
Address:
Advanced Technology Center,
8717 Grovemont Circle
Gaithersburg, MD 20877



More information about the Bioconductor mailing list