Hi Hervé,
Thanks for a really comprehensive answer!  I actually need to fetch these
from many genomes (orthologs for ~30 species) so in the meantime I fell
back on PyCogent and it seems to work.

Best,
Greg


On Sun, Dec 29, 2013 at 8:53 PM, Hervé Pagès <hpages@fhcrc.org> wrote:

> Hi Greg,
>
> Not sure exactly why you're getting this error. FWIW this can
> also be done with the GenomicFeatures and BSgenome packages:
>
>   library(GenomicFeatures)
>   txdb <- makeTranscriptDbFromBiomart("ensembl", "hsapiens_gene_ensembl")
>   gn <- genes(txdb)
>
> Then:
>
>  > gn["ENSG00000165702"]
>   GRanges with 1 range and 1 metadata column:
>                     seqnames                 ranges strand | gene_id
>                        <Rle>              <IRanges>  <Rle> |
> <CharacterList>
>     ENSG00000165702        9 [135820932, 135867083]      + |
> ENSG00000165702
>     ---
>     seqlengths:
>                      1                 2 ...            LRG_98    LRG_99
>              249250621         243199373 ...             18750     13294
>
> Obtain the ranges of the flanking regions with flank(). By default
> flank() returns the upstream flanks:
>
>   > gnflanks <- flank(gn, width=100)
>
>   > gnflanks["ENSG00000165702"]
>   GRanges with 1 range and 1 metadata column:
>                     seqnames                 ranges strand | gene_id
>                        <Rle>              <IRanges>  <Rle> |
> <CharacterList>
>     ENSG00000165702        9 [135820832, 135820931]      + |
> ENSG00000165702
>     ---
>     seqlengths:
>                      1                 2 ...            LRG_98    LRG_99
>              249250621         243199373 ...             18750     13294
>
> To extract the corresponding sequence, we can use the BSgenome
> package for hg19. This assembly is compatible with the GRCh37.p12
> assembly for all the main chromosomes *except* for chrM. Also we'll
> need to adjust the names of the chromosomes because Ensembl/NCBI
> and UCSC use different naming conventions:
>
>   > library(BSgenome.Hsapiens.UCSC.hg19)
>   > head(seqlevels(Hsapiens))
>   [1] "chr1" "chr2" "chr3" "chr4" "chr5" "chr6"
>
> Keep only flank regions for genes located on chromosomes 1 to 22, X
> and Y, and rename the chromosomes:
>
>   seqlevels(gnflanks, force=TRUE) <- c(1:22, "X", "Y")
>   seqlevels(gnflanks) <- paste0("chr", seqlevels(gnflanks))
>
> Finally, use getSeq() to extract the DNA sequence:
>
>   > getSeq(Hsapiens, gnflanks["ENSG00000165702"])
>     A DNAStringSet instance of length 1
>       width seq
>   [1]   100 TAATGACATCTGTGACGTGAAGAATCAAGGAGAT...
> GCCTAGTAAATGCTCACTCACTAGATAGGTGGC
>
> Note that getSeq() can extract more than 1 sequence at once:
>
>   > my_genes <- c("ENSG00000165702", "ENSG00000252380", "ENSG00000200999")
>
>   > getSeq(Hsapiens, gnflanks[my_genes])
>     A DNAStringSet instance of length 3
>       width seq
>   [1]   100 TAATGACATCTGTGACGTGAAGAATCAAGGAGAT...
> GCCTAGTAAATGCTCACTCACTAGATAGGTGGC
>   [2]   100 AGTCAAGGTGCTAGTTTGTGATGGAGCGGCAGGC...
> TCAACTAAATAAAACTAGCAAACTAGCTACAAA
>   [3]   100 CTTACATGCAAGAAAACTTCTAAGAAAACTTATA...
> AGAAAAAGCAGAATGAGCATCAGTAAATATTTA
>
> Cheers,
> H.
>
>
>
> On 12/28/2013 04:43 AM, Greg Slodkowicz wrote:
>
>> I should've probably mentioned the exact error I'm getting:
>>
>> Error in getBM(attributes = attrs, filters = list(ensembl_gene_id =
>> "ENSG00000165702",  :
>>    Query ERROR: caught BioMart::Exception: non-BioMart die(): Can't use an
>> undefined value as an ARRAY reference at
>> /ensemblweb/wwwmart/www_74/biomart-perl/lib/BioMart/
>> Dataset/GenomicSequence.pm
>> line 396.
>>
>> G.
>>
>>
>> On Sat, Dec 28, 2013 at 1:42 PM, Greg Slodkowicz <gregs@ebi.ac.uk> wrote:
>>
>>  Dear all,
>>> I'm having some trouble fetching flanking regions for a gene from the
>>> Ensembl Biomart:
>>>
>>> library(biomaRt)
>>> ensembl <- useMart(biomart = "ENSEMBL_MART_ENSEMBL",
>>> dataset="hsapiens_gene_ensembl",
>>>        host="www.ensembl.org", path="/biomart/martservice")
>>>
>>> attrs <- c("ensembl_gene_id", "upstream_flank")
>>> gene_ann <- getBM(attributes=attrs,
>>> filters=list(ensembl_gene_id="ENSG00000165702", upstream_flank=100),
>>> mart=ensembl, checkFilters=F)
>>> gene_ann
>>>
>>> (here's a Gist: https://gist.github.com/jergosh/8159092)
>>>
>>> I also tried passing the parameters differently but it doesn't make any
>>> difference:
>>>
>>> gene_ann <- getBM(attributes=attrs, filters=c("ensembl_gene_id",
>>> "upstream_flank"), values=list(ensembl_gene_id="ENSG00000165702",
>>> upstream_flank=100), mart=ensembl, checkFilters=F)
>>>
>>> Am I doing something wrong?
>>>
>>> Best,
>>> Greg
>>>
>>> --
>>> Greg Slodkowicz
>>> PhD student, Nick Goldman group
>>> European Bioinformatics Institute (EMBL-EBI)
>>>
>>>
>>
>>
>>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages@fhcrc.org
> Phone:  (206) 667-5791
> Fax:    (206) 667-1319
>



-- 
Greg Slodkowicz
PhD student, Nick Goldman group
European Bioinformatics Institute (EMBL-EBI)

	[[alternative HTML version deleted]]

