[BioC] getSequence ensmebl biomaRt
James W. MacDonald
jmacdon at med.umich.edu
Fri Aug 14 15:41:51 CEST 2009
If you simply want to get the DNA sequence, then you should use the
BSgenome.Drerio.UCSC.danRer5 package:
> suppressMessages(library(BSgenome.Drerio.UCSC.danRer5))
> subseq(Drerio$chr15, 18357968,18360987)
3020-letter "MaskedDNAString" instance (# for masking)
seq:
CATATATCTTAAGCAGAGTGCACTGGACAGATCAGA...TAAAGGTTTTTTTCCCTGGTGACCTTCCACACCAAA
masks:
maskedwidth maskedratio active names desc
1 0 0.0000000 TRUE AGAPS assembly gaps
2 0 0.0000000 TRUE AMB intra-contig ambiguities
3 1446 0.4788079 FALSE RM RepeatMasker
4 0 0.0000000 FALSE TRF Tandem Repeats Finder [period<=12]
all masks together:
maskedwidth maskedratio
1446 0.4788079
all active masks together:
maskedwidth maskedratio
0 0
You can convert to a string (small range converted here):
> toString(subseq(Drerio$chr15, 18357968,18358000))
[1] "CATATATCTTAAGCAGAGTGCACTGGACAGATC"
>
Best,
Jim
Mayra Eduardoff wrote:
> hi james,
> thanks, I know .... my question is how to get a genomic dna sequence
> (where there maybe is no gene) ???
> any ideas ?
> kind regards
> mayra
>
> On Thu, Aug 13, 2009 at 2:53 PM, James W. MacDonald
> <jmacdon at med.umich.edu <mailto:jmacdon at med.umich.edu>> wrote:
>
> Hi Mayra,
>
>
> Mayra Eduardoff wrote:
>
> Hi Steffen
>
>
> I want to retrieve a genomic sequence with biomaRt:
>
>
> Session(info)
> R version 2.9.1 (2009-06-26)
> i386-pc-mingw32
>
> locale:
> LC_COLLATE=German_Austria.1252;LC_CTYPE=German_Austria.1252;LC_MONETARY=German_Austria.1252;LC_NUMERIC=C;LC_TIME=German_Austria.1252
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
> other attached packages:
> [1] BSgenome_1.12.3 cureos_0.3
> Biostrings_2.12.8
> IRanges_1.2.3 zfv2.db_1.0.0 RSQLite_0.7-1
> [7] DBI_0.2-4 Agi4x44PreProcess_1.4.0
> genefilter_1.24.2
> annotate_1.22.0 AnnotationDbi_1.6.1 venn_1.5
> [13] multtest_2.1.1 vsn_3.12.0 Biobase_2.5.5
> xtable_1.5-5 limma_2.18.2 biomaRt_2.0.0
>
>
>
> mart <- useMart("ensembl")
> mart <- useDataset(mart=mart, "drerio_gene_ensembl")
>
>
> seq <- getSequence(chromosome = 15, start = 18357968, end =
> 18360987, mart =
> mart)
>
> Fehler in getSequence(chromosome = 15, start = 18357968, end =
> 18360987, :
> Please specify the type of sequence that needs to be retrieved
> when using
> biomaRt in web service mode. Choose either gene_exon,
> transcript_exon,transcript_exon_intron, gene_exon_intron, cdna,
> coding,coding_transcript_flank,coding_gene_flank,transcript_flank,gene_flank,peptide,
> 3utr or 5utr
>
> Apart from the fact that I want a genomic region even if I
> specify type it
> doesn t seem to work :
>
> seq <- getSequence(chromosome = 15, start = 18357968, end =
> 18360987,
> type="gene_exon", mart = mart)
> Fehler in getSequence(chromosome = 15, start = 18357968, end =
> 18360987, :
> Please specify the type of sequence that needs to be retrieved
> when using
> biomaRt in web service mode. Choose either gene_exon,
> transcript_exon,transcript_exon_intron, gene_exon_intron, cdna,
> coding,coding_transcript_flank,coding_gene_flank,transcript_flank,gene_flank,peptide,
> 3utr or 5utr
>
>
> or as in documentation (although this doesn t make any sense to
> me to
> specify seqType and type...)
>
>
> You have to specify seqType and type because the sequences don't
> come back in the same order you requested, so the type argument is
> used to label the sequences.
>
> Also, I don't see any way to get inter-genic sequences. For instance:
>
> > getSequence(15,18357968,18360987,seqType="cdna", mart=mart,
> type="ensembl_transcript_id")
> [1] cdna ensembl_transcript_id
> <0 rows> (or 0-length row.names)
>
> Because this portion of the zebrafish genome contains no known
> genes. However, if I pick a region that does contain a gene:
>
> > getSequence(15,18723006,18741517,seqType="cdna", mart=mart,
> type="ensembl_transcript_id")
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> cdna
> 1
> AGGAGCCGCTCAGACCACACCAGTGCAGGGTCAGAACCTGGTGACAAATAATGTCTCAGTGGTGGAGGGCGAGACGGCCATCATCAGCTGCCGGGTGAAAAACAACGACGACTCCGTCATCCAACTGCTCAACCCCAACCGGCAGACTATCTACTTCAGAGACGTTAGACCTTTGAAGGACAGTCGGTTTCAGCTGGTAAACTTCTCCGACAACGAGCTCTTGGTGTCCCTGTCCAACGTGTCTCTGTCGGACGAGGGCCGCTACGTGTGTCAACTCTACACGGATCCACCGCAAGAAGCCTACGCCGACATCACTGTACTGGTTCCACCAGGCAACCCAATCTTAGAGTCCCGCGAGGAAATCGTGAGCGAGGGGAATGAGACCGAGATAACCTGCACCGCCATGGGCAGCAAACCTGCTTCCACCATCAAATGGATGAAAGGCGACCAACCACTGCAAGGTGAGGCGACTGTGGAGGAGTTATACGACAGGATGTTCACTGTCACCAGCCGGCTCAGGCTCACCGTCTCTAAGGAGGACGATGGAGTGGCCGTCATCTGCATCATTGACCATCCAGCCGTGAAGGACTTCCAGGCCCAGAAATACCTGGAAGTGCAGTATAAACCAGAAGTGAAGATTGTGGTGGGATTCCCAGAGGGTTTGACCAGAGAAGGAGAAAATCTCGAGCTGACATGCAAAGCTAAAGGAAAACCGCAGCCTCATCAAATTAACTGGCTCAAAGTGGATGATGATTTCCCCTCCCACGCCTTGGTAACTGGCTCTGATCTCTTCATCGAAAACCTTAACAAGTCCTACAACGGAACGTACCGCTGTGTGGCATCTAACTTAGTGGGAGAAGCCTACGATGATTACATCCTTTATGTATACGATTCAAGAGCAGATGGAGCGCCACAGAAAATTGATCATGCCGTCATCGGCGGAGTTGTCGCAGTGGTTGTGTTCGCCATGCTTTGTCTCCTGA
TTGTTC
> TTGGCCGATATTTCGCCAGACACAAAGGGACCTACTTCACCCACGAAGCTAAAGGAGCGGATGACGCGGCGGACGCCGACACTGCCATCATCAACGCAGAGGGCGGACACAACAATTCGGATGACAAGAAGGAATACTACATTTAA
> ensembl_transcript_id
> 1 ENSDART00000062603
>
> Best,
>
> Jim
>
>
>
>
>
> seq <- getSequence(chromosome = 15, start = 18357968, end =
> 18360987,
> type="entrez", seqType="cdna", mart = mart)
> Fehler in getBM(c(seqType, type), filters = c("chromosome_name",
> "start", :
>
> Invalid attribute(s): entrez
> Please use the function 'listAttributes' to get valid attribute
> names
>
>
>
>
> I can t load in msyql mode either anymore :
> mart <- useMart("ensembl", mysql=TRUE)
> Fehler: mysql access to Ensembl is no longer available through
> this package
> the web service mode supports all queries. If mysql is needed a
> separate
> package will become available with limited mysql query support.
>
>
> I would be very greatful for you help !
>
>
> kind regards,
>
> Mayra
>
>
> --
> James W. MacDonald, M.S.
> Biostatistician
> Douglas Lab
> University of Michigan
> Department of Human Genetics
> 5912 Buhl
> 1241 E. Catherine St.
> Ann Arbor MI 48109-5618
> 734-615-7826
>
>
>
>
> --
> Mayra Eduardoff
> Institute of Molecular Biology
> University of Innsbruck
> Viktor-Franz Hess Haus
> Technikerstrasse 25
> 6020 Innsbruck
> Tel: +43 512 507 6286
> email: mayra.eduardoff at student.uibk.ac.at
> <mailto:mayra.eduardoff at student.uibk.ac.at>
>
--
James W. MacDonald, M.S.
Biostatistician
Douglas Lab
University of Michigan
Department of Human Genetics
5912 Buhl
1241 E. Catherine St.
Ann Arbor MI 48109-5618
734-615-7826
More information about the Bioconductor
mailing list