[BioC] obtain DNA sequence
Hervé Pagès
hpages at fhcrc.org
Mon Sep 14 21:50:55 CEST 2009
Hi Simon,
The getSeq() function from the BSgenome package is provided for that
purpose:
myseqs <- data.frame(
Chr=c("chr9", "chr6", "chr8", "chrX", "chr4", "chr11"),
Start=c(79466420, 50495860, 19687900,
90313740, 117732780, 4090400),
Stop=c(79466570, 50496010, 19688050,
90313890, 117732930, 4090550))
> myseqs
Chr Start Stop
1 chr9 79466420 79466570
2 chr6 50495860 50496010
3 chr8 19687900 19688050
4 chrX 90313740 90313890
5 chr4 117732780 117732930
6 chr11 4090400 4090550
> getSeq(Mmusculus, myseqs$Chr, start=myseqs$Start, end=myseqs$Stop)
[1]
"CTCTGCCTCTGCCTCTGCCTCTGCCTCTGCCTCTGCCTCTGCCTCTGCCTCTGCCTCTGCCTCTGCCTCTGCCTCTGCCTCTGCCTCTGCCTCTGCCTCTGCCTCTGCCTCCAAGTGCTGGGATTAACGGTGTGCACCACCACTGCCTGGC"
[2]
"TTACTGTCACCCTCAGAATCACGTGTTCAGATATCCAGCTTCCGGGTGACAAACCCACAAAATTCTCTTTTTTCTTTAACCTTACTCTCTCCAACACTTGACCTTTCTTTGTTTATTCCTTCTGGAGTGCCCAGGTCCTTATGCATTATGA"
[3]
"GGTAGGTAAGTAATGGTCACCTATTCTCTTTCTATCTGGTATGTCTGCAGGTTGACAGGCTGGTGCCTGCCCTTCAACCCAGGAAGCAGAGCTTGTGTTCAATCATTATTGCACATTAACAAGGAAAAAAATGCCTTGTTGGATTCTTAAA"
[4]
"TCAAAATGGCAAGAAAAACACTTAAGTTTTTATTACTCAGGGCTCACAGCAGCTAAAAGGTTTCAGCAATATTATATGGCATACAAATTGCAGATTTAACTTGGTTGAGGAGCGTCCCCATGCAATCACCATAATATTTTATTGTAGAATA"
[5]
"TTCAAAACGTCCTCCTGCTTCCTCTGTGGTGACCAGCTATGACTCGGGGCATCCCTCCTCAAGGCCTTAGTGTTTTGGCTTTGCTCAGTTTCCATGAGGCCTGACCAACCCCTAGGAGTCTCCTCTTTCTGCCTCCTGCTACCTGGATGCA"
[6]
"AGCCTGCTCTGTAGGGAACCTTTAGTGGGCTTGAAGTGTTCCCTGACTGCTCTTGAGCACTGGCCAAAAGCAAGAAAGCAGCTAGCCCATGAATGGCCCTGTGGGTGGCACAGGCACAGGCAGTGAAACCCCAAGAAGACCAGGTATAATG"
See ?getSeq for more information about this function.
Cheers,
H.
Biddie, Simon (NIH/NCI) [F] wrote:
> Dear All,
>
> I am trying to obtain DNA sequences (mouse) from chromosome coordinates. I am relatively new with R and Bioconductor and would appreciate any help.
>
> I have the following style matrix:
>
> Chr Start Stop
> 1 chr9 79466420 79466570
> 2 chr6 50495860 50496010
> 3 chr8 19687900 19688050
> 4 chrX 90313740 90313890
> 5 chr4 117732780 117732930
> 6 chr11 4090400 4090550
>
> I can use the following code to obtain a single sequence by typing in the chromosome number, start and stop manually:
>
>> library(BSgenome.Mmusculus.UCSC.mm9)
>
>> seq1 = subseq(Mmusculus$chr9,79466420,79466570)
>
>> as(seq1, "character")
>
> How would I do this for all the rows in a matrix to be output as a single txt or csv file? ... without having to type each row (I have up to 15,000!) one at a time. Please find below the sessionInfo.
>
> Thank you for any help,
>
> Simon
>
>> sessionInfo()
> R version 2.8.1 (2008-12-22)
> i386-pc-mingw32
>
> locale:
> LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252
>
> attached base packages:
> [1] stats graphics grDevices datasets utils methods base
>
> other attached packages:
> [1] BSgenome.Mmusculus.UCSC.mm9_1.3.11 BSgenome_1.10.5
> [3] Biostrings_2.10.22 IRanges_1.0.16
> [5] R.utils_1.1.3 R.oo_1.4.6
> [7] R.methodsS3_1.0.3
>
> loaded via a namespace (and not attached):
> [1] grid_2.8.1 lattice_0.17-25 Matrix_0.999375-23
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioconductor
mailing list