[BioC] retrieving genomic sequences with biomaRt

Steffen Durinck durincks at mail.nih.gov
Tue Oct 10 21:24:00 CEST 2006


Hi Jose,

For one biomaRt function, the implementation in MySQL or web service 
mode is not the same: getSequence.
If you currently use getSequence in MySQL mode you need to give a 
chromosome name + start and end positions of the sequences you want to 
retrieve, and you can not use the seqType argument as 'only' genomic 
sequences (regardless of their annotation) are returned. 
If you use getSequence in web service mode, you have to use the seqType 
argument as the Ensembl BioMart web service restricts sequence retrieval 
to annotated sequences such as either protein, 5'UTR, 3'UTR, ... (full 
list see MartView of Ensembl).  The result of this query will be a 
data.frame with all sequences that have this annotation  (e.g. 5'UTR) 
and are located on the given chromosome between the given start and end 
positions.

This should be better documented in the getSequence function and the 
getSequence function should throw a proper error message.  I will make 
these fixes asap.

Cheers,
Steffen


J.delasHeras at ed.ac.uk wrote:
>
> Hi Steffen,
>
> that is more or less what I tried, but I get the following error after 
> the call to 'getSequence'
>
> "Error in inherits(x, "factor") : argument "seqType" is missing, with 
> no default"
>
> The only thing I seem to do different is that I didn't indicate 
> mysql=TRUE because I haven't installed the package yet... I will, 
> 'though, but I don't suppose that's the cause of the error?
>
> Thanks for your help, Steffen!
>
> Jose
>
> Quoting Steffen Durinck <durincks at mail.nih.gov>:
>
>> Hi Jose,
>>
>> This should be possible with biomaRt, I would recommend using it in
>> MySQL mode as the web service sometimes hangs for sequences.
>>
>> Which type of identifiers do you have?
>>
>> As an example with Entrez Gene identifiers you could do:
>>
>> library(biomaRt)
>> ensmart=useMart("ensembl",dataset="hsapiens_gene_ensembl", mysql=TRUE)
>> entrez = c("100","330")
>> gene = getGene(id=entrez, type="entrezgene", mart=ensmart)
>> getSequence(chromosome = gene$chromosome, start = gene$start - 2000, end
>> = gene$end + 1000, mart=ensmart)
>>
>> Note that you'll have to make it a little bit more complicated than this
>> as you have to take into account the strand on which the genes are
>> located.  The strand is also returned by the getGene function.
>>
>> best,
>> Steffen
>>
>> J.delasHeras at ed.ac.uk wrote:
>>> Hi,
>>>
>>> I am putting a little script together to look at promoter regions, 
>>> search for CpG islands etc, given a list of gene ids.
>>>
>>> I thought I would be able to use biomaRt to retrieve the sequence of 
>>> interest, but I just can't find how right now... either I'm too 
>>> tired (possible) or I'm missing something and it's just not 
>>> possible. In particular 'getSequence' looked promising, but no luck 
>>> so far.
>>>
>>> What I want to do is:
>>> 1) I have a gene id, from it I want to locate the chromosome 
>>> location of the gene, the start of transcription.
>>> 2) knowing the transcriptional start, I want to retrieve the genomic 
>>> sequence around that area... say 2000bp upstream and 1000 downstream.
>>>
>>> Anybody knows if this is possible and/or could give me any pointers?
>>>
>>> Thank you!
>>>
>>> Jose
>>>
>>>
>>
>>
>> -- 
>> Steffen Durinck, Ph.D.
>>
>> Oncogenomics Section
>> Pediatric Oncology Branch
>> National Cancer Institute, National Institutes of Health
>> URL: http://home.ccr.cancer.gov/oncology/oncogenomics/
>>
>> Phone: 301-402-8103
>> Address:
>> Advanced Technology Center,
>> 8717 Grovemont Circle
>> Gaithersburg, MD 20877
>>
>
>
>


-- 
Steffen Durinck, Ph.D.

Oncogenomics Section
Pediatric Oncology Branch
National Cancer Institute, National Institutes of Health
URL: http://home.ccr.cancer.gov/oncology/oncogenomics/

Phone: 301-402-8103
Address:
Advanced Technology Center,
8717 Grovemont Circle
Gaithersburg, MD 20877



More information about the Bioconductor mailing list