[BioC] BiomaRt & retrieving flanking regions
Hervé Pagès
hpages at fhcrc.org
Sun Dec 29 20:53:54 CET 2013
Hi Greg,
Not sure exactly why you're getting this error. FWIW this can
also be done with the GenomicFeatures and BSgenome packages:
library(GenomicFeatures)
txdb <- makeTranscriptDbFromBiomart("ensembl", "hsapiens_gene_ensembl")
gn <- genes(txdb)
Then:
> gn["ENSG00000165702"]
GRanges with 1 range and 1 metadata column:
seqnames ranges strand |
gene_id
<Rle> <IRanges> <Rle> |
<CharacterList>
ENSG00000165702 9 [135820932, 135867083] + |
ENSG00000165702
---
seqlengths:
1 2 ... LRG_98
LRG_99
249250621 243199373 ... 18750
13294
Obtain the ranges of the flanking regions with flank(). By default
flank() returns the upstream flanks:
> gnflanks <- flank(gn, width=100)
> gnflanks["ENSG00000165702"]
GRanges with 1 range and 1 metadata column:
seqnames ranges strand |
gene_id
<Rle> <IRanges> <Rle> |
<CharacterList>
ENSG00000165702 9 [135820832, 135820931] + |
ENSG00000165702
---
seqlengths:
1 2 ... LRG_98
LRG_99
249250621 243199373 ... 18750
13294
To extract the corresponding sequence, we can use the BSgenome
package for hg19. This assembly is compatible with the GRCh37.p12
assembly for all the main chromosomes *except* for chrM. Also we'll
need to adjust the names of the chromosomes because Ensembl/NCBI
and UCSC use different naming conventions:
> library(BSgenome.Hsapiens.UCSC.hg19)
> head(seqlevels(Hsapiens))
[1] "chr1" "chr2" "chr3" "chr4" "chr5" "chr6"
Keep only flank regions for genes located on chromosomes 1 to 22, X
and Y, and rename the chromosomes:
seqlevels(gnflanks, force=TRUE) <- c(1:22, "X", "Y")
seqlevels(gnflanks) <- paste0("chr", seqlevels(gnflanks))
Finally, use getSeq() to extract the DNA sequence:
> getSeq(Hsapiens, gnflanks["ENSG00000165702"])
A DNAStringSet instance of length 1
width seq
[1] 100
TAATGACATCTGTGACGTGAAGAATCAAGGAGAT...GCCTAGTAAATGCTCACTCACTAGATAGGTGGC
Note that getSeq() can extract more than 1 sequence at once:
> my_genes <- c("ENSG00000165702", "ENSG00000252380", "ENSG00000200999")
> getSeq(Hsapiens, gnflanks[my_genes])
A DNAStringSet instance of length 3
width seq
[1] 100
TAATGACATCTGTGACGTGAAGAATCAAGGAGAT...GCCTAGTAAATGCTCACTCACTAGATAGGTGGC
[2] 100
AGTCAAGGTGCTAGTTTGTGATGGAGCGGCAGGC...TCAACTAAATAAAACTAGCAAACTAGCTACAAA
[3] 100
CTTACATGCAAGAAAACTTCTAAGAAAACTTATA...AGAAAAAGCAGAATGAGCATCAGTAAATATTTA
Cheers,
H.
On 12/28/2013 04:43 AM, Greg Slodkowicz wrote:
> I should've probably mentioned the exact error I'm getting:
>
> Error in getBM(attributes = attrs, filters = list(ensembl_gene_id =
> "ENSG00000165702", :
> Query ERROR: caught BioMart::Exception: non-BioMart die(): Can't use an
> undefined value as an ARRAY reference at
> /ensemblweb/wwwmart/www_74/biomart-perl/lib/BioMart/Dataset/GenomicSequence.pm
> line 396.
>
> G.
>
>
> On Sat, Dec 28, 2013 at 1:42 PM, Greg Slodkowicz <gregs at ebi.ac.uk> wrote:
>
>> Dear all,
>> I'm having some trouble fetching flanking regions for a gene from the
>> Ensembl Biomart:
>>
>> library(biomaRt)
>> ensembl <- useMart(biomart = "ENSEMBL_MART_ENSEMBL",
>> dataset="hsapiens_gene_ensembl",
>> host="www.ensembl.org", path="/biomart/martservice")
>>
>> attrs <- c("ensembl_gene_id", "upstream_flank")
>> gene_ann <- getBM(attributes=attrs,
>> filters=list(ensembl_gene_id="ENSG00000165702", upstream_flank=100),
>> mart=ensembl, checkFilters=F)
>> gene_ann
>>
>> (here's a Gist: https://gist.github.com/jergosh/8159092)
>>
>> I also tried passing the parameters differently but it doesn't make any
>> difference:
>>
>> gene_ann <- getBM(attributes=attrs, filters=c("ensembl_gene_id",
>> "upstream_flank"), values=list(ensembl_gene_id="ENSG00000165702",
>> upstream_flank=100), mart=ensembl, checkFilters=F)
>>
>> Am I doing something wrong?
>>
>> Best,
>> Greg
>>
>> --
>> Greg Slodkowicz
>> PhD student, Nick Goldman group
>> European Bioinformatics Institute (EMBL-EBI)
>>
>
>
>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioconductor
mailing list