[Bioc-devel] how to get genomic sequences?
Herve Pages
hpages at fhcrc.org
Wed Mar 28 00:03:09 CEST 2007
Hi Roger,
You can use one of the Biostrings-based genome data packages for this.
Those packages contain the full genomic sequences for some organisms.
Here is how to proceed (with R-devel + Bioc-devel).
1) Install BSgenome
===================
> source("http://bioconductor.org/biocLite.R")
> biocLite("BSgenome")
> library(BSgenome)
> available.genomes()
[1] "BSgenome.Celegans.UCSC.ce2"
[2] "BSgenome.Dmelanogaster.BDGP.Release5"
[3] "BSgenome.Dmelanogaster.FlyBase.r51"
[4] "BSgenome.Dmelanogaster.UCSC.dm2"
[5] "BSgenome.Hsapiens.UCSC.hg16"
[6] "BSgenome.Hsapiens.UCSC.hg17"
[7] "BSgenome.Hsapiens.UCSC.hg18"
[8] "BSgenome.Mmusculus.UCSC.mm7"
[9] "BSgenome.Mmusculus.UCSC.mm8"
[10] "BSgenome.Scerevisiae.UCSC.sacCer1"
2) Install and load a specific genome
=====================================
> biocLite("BSgenome.Hsapiens.UCSC.hg18") # can take a long time (850M to download)
> library(BSgenome.Hsapiens.UCSC.hg18)
> ls(2)
[1] "Hsapiens"
> Hsapiens
Homo sapiens genome:
Single sequences (DNAString objects, see '?seqnames'):
chr1 chr2 chr3 chr4 chr5
chr6 chr7 chr8 chr9 chr10
chr11 chr12 chr13 chr14 chr15
chr16 chr17 chr18 chr19 chr20
chr21 chr22 chrX chrY chrM
chr5_h2_hap1 chr6_cox_hap1 chr6_qbl_hap2 chr1_random chr2_random
chr3_random chr4_random chr5_random chr6_random chr7_random
chr8_random chr9_random chr10_random chr11_random chr13_random
chr15_random chr16_random chr17_random chr18_random chr19_random
chr21_random chr22_random chrX_random
Multiple sequences (BStringViews objects, see '?mseqnames'):
upstream1000 upstream2000 upstream5000
(use the '$' or '[[' operator to access a given sequence)
3) Use getSeq() to retrieve the genomic sequence in a given chromosome, at given start and end
==============================================================================================
> getSeq(Hsapiens, "chrX", 100, 150)
[1] "CCTGAGCCAGCAGTGGCAACCCAATGGGGTCCCTTTCCATACTGTGGAAGC"
If you need to retrieve a big chunk (> 100000 nucleotides), then it's much more efficient
to use as.BStringViews=TRUE:
> getSeq(Hsapiens, "chrX", 100, 5000000, as.BStringViews=TRUE)
Views on a 154913754-letter DNAString subject
Subject: CTAACCCTAACCCTAACCCTAACCCTAACCCTAA...TGTGGGTGTGTGGGTGTGGTGTGTGGGTGTGGT
Views:
start end width
[1] 100 5000000 4999901 [CCTGAGCCAGCAGTGGCAACCCAA...CCTATTATTGACTTCACTTGAGCT]
See ?getSeq (from BSgenome package) for more info...
Finally, there have been some important improvements + changes in the devel versions
of Biostrings and BSgenome so I strongly suggest you use Bioc-devel for this.
Let me know if you need further help.
Cheers,
H.
Roger Liu wrote:
> Hi,
>
> I have a set of data with chromosome number and coordinates of the sequences
> such as,chr10, start 1000, end 2000.
> I have tried using biomart to retrieve the genomic sequences for my dataset,
> but I didn't get success, I used seqType argument as:
> seqType="genomic", it reported error as"The type of sequence specified with
> seqType is not available. Please select from: cdna, peptide, 3utr, 5utr",
> but I have seen this "genomic" argument for seqType in the help file. So
> what's going on there?
>
> Or anyone can recommend a package which can help me retrieve the genomic
> sequences from hg18 with known chromosome number and sequences
> coordinates(start and end).
>
> Thanks.
>
> ZRL
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioc-devel at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
More information about the Bioc-devel
mailing list