[BioC] All 3' UTRs for an organism - biomaRt package
Steffen
sdurinck at lbl.gov
Tue Oct 2 02:11:32 CEST 2007
Dear Urska,
Your errors show that the getSequence function needed extra argument
checking to help the user in understanding what is going wrong. This has
been added to the new biomaRt version which will be included in the new
BioConductor release later this week.
Firstly, getSequence always needs the type and seqType arguments. The
type argument specifies the type of identifiers that you use as input
and also specifies the type of identifiers that will be present in the
output so you know to which genes the retrieved sequences belong.
Secondly, getSequence uses two ways to specify the sequences you are
interested in:
a) you give a list of identifiers and getSequence will retrieve the
sequences for the corresponding genes (here you'll use the id argument)
b) you give a chromosomal location and getSequence will retrieve the
sequences of all genes that are located between the given positions.
(here you'll use the chromosome, start and end arguments)
An additional problem that you encounter is that the data volume you
want to retrieve is so large (as you query for all sequences of all
genes) that this possibly resulted in a time-out.
Here's how I would retrieve the sequences:
First retrieve all ensembl gene identifiers in the hsapiens dataset by:
ensmart=useMart("ensembl", dataset="hsapiens_gene_ensembl")
ensemblid = getBM("ensembl_gene_id", mart=ensmart)
This should give 31189 unique ensembl gene identifiers.
To retrieve the 3UTRs for all these identifiers you could now retrieve
them in batches of 5000 at once e.g.:
utr3Batch1 = getSequence(id = ensemblid[1:5000,1],
type="ensembl_gene_id", seqType="3utr", mart=ensmart)
utr3Batch2 = getSequence(id = ensemblid[5001:10000,1],
type="ensembl_gene_id", seqType="3utr", mart=ensmart)
etc.
Note that you'll have to do some cleaning as genes might have multiple
3'UTRs associated with them for alternatively spliced transcripts and
others might have no annotated 3'UTR.
Cheers,
Steffen
Urska Cvek wrote:
> I am using the latest version of biomaRt (1.10.1) package with R version
> 2.5.1, and would like to retrieve all 3' UTR sequences in the Ensembl 46,
> NCBI 36 (current versions) for the homo sapiens organism. I have used the
> Vignette/documentation and tried:
>
>
>> ensmart=useMart("ensembl", dataset="hsapiens_gene_ensembl")
>>
> Checking attributes and filters ... ok
>
>> getSequence(type="ensembl", seqType="3utr", mart=ensmart)
>>
> Error in getSequence(type = "ensembl", seqType = "3utr", mart = ensmart) :
> Invalid type argument. Use the listFilters function to select a
> valid type argument.
>
> although both arguments to "type" and "seqType" seem to be correct, it
> looks like there is something wrong with the input.
>
> I also tried
>
>
>> getSequence(type="entrezgene", seqType="3utr", mart=ensmart)
>>
>
>
>> utr3_1 = getSequence(chromosome=1, start=1, end=10000000,
>>
> type="entrezgene", seqType="3utr", mart=ensmart)
>
>> show(utr3_1)
>>
> NULL
>
> When I try the advanced option for retrieval, the command takes a long time
> to return, and then returns the following:
>
>> getBM(attributes=c("ensembl_gene_id","external_gene_id","3utr"),
>>
> mart=ensmart)
> Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,
> :
> line 21 did not have 3 elements
>
> I must be missing something... Any suggestions how I could get all 3' UTR
> sequences for the human organism using this package?
>
> Thank you in advance,
>
> U.C.
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>
More information about the Bioconductor
mailing list