[BioC] question about how to get ranges in Biostrings
Hervé Pagès
hpages at fhcrc.org
Thu Feb 21 22:34:30 CET 2013
Hi Andress,
[cc'ing the list so others can benefit from or participate to the
thread]
On 02/21/2013 06:43 AM, Andres Weinfeld Avalos Figueroa wrote:
> Dear Dr Pagés
> Greetings from Guatemala.
> I am somewhat new in using BioC, and an intermediate R user.
>
> Last couple of days I have been working with the matchProbePair function
> on Sorghum genome, testing around 3,500 SSR primer pairs generated in
> 2011 for sugarcane for the ICSB. I am using Sorghum as a framework due
> to its genome stability and ploidy (besides genetic closeness) I have
> succeeded in evaluating the 10 chromosomes, and obtained a interesting
> amount of stringviews.
>
> Now I want to plot those theoretical amplicons along S. bicolor genome
> (chromosome by crhromosome), but I have not found the way to substract
> the ranges from this list. Being "result" the matchprobepair outputI
> know that >range(result[[1]]) does it, but I want to use lapply()
> function...
To extract the ranges (as an IRanges object) from the result (a Views
object), just use ranges():
> result <- matchProbePair(Fprobe, Rprobe, subject)
> rg <- ranges(result)
> rg
IRanges of length 9
start end width
[1] 6523541 6524340 800
[2] 7268925 8284316 1015392
[3] 9019256 9295206 275951
[4] 13503602 15492126 1988525
[5] 17612494 19210625 1598132
[6] 21840840 21862022 21183
[7] 22281529 22329032 47504
[8] 23045531 23096640 51110
[9] 23759210 24263538 504329
If you want to plot those ranges, for example with the Gviz package:
library(Gviz)
plotTracks(list(GenomeAxisTrack(),
AnnotationTrack(GRanges("chr3R", rg))))
If you want to use lapply(), then:
lapply(seq_along(rg),
function(i) {
... do something with start(rg)[i] and end(rg)[i] ...
})
Note that the plotting code and lapply code above would work directly
on the Views object (result) so there is no need to extract the
ranges in 'rg'.
If you want to extract the sequences of the amplicons, just pass
'result' to the DNAStringSet constructor:
> amplicons <- DNAStringSet(result)
> amplicons
A DNAStringSet instance of length 9
width seq
[1] 800
AGCTCCGAGTTCCTGCAATATTTGAACGAGGAC...CTTCCAAGCCATCCGCATATTTGTGAACAACG
[2] 1015392
AGCTCCGAGTTGAGATGAGGCTACCCACTTAAT...TATTTGGCCTCGGTCAAATCGAACTCGGAGCT
[3] 275951
AGCTCCGAGTTACGTGTATCGATTACCCCGATT...ACTAGGAAATAAATAAAGTAGAACTCGGAGCT
[4] 1988525
CGTTGTTCACATCCCTCTGCTGTTCCCGCTGAC...ACCGATGATAGAAAGAGTGTTAACTCGGAGCT
[5] 1598132
CGTTGTTCACAATCGTCAGATCGAAGAAGTGAC...CACTTGGAACTTGACACTTGGAACTCGGAGCT
[6] 21183
AGCTCCGAGTTCGTCGAGAATCTGCACGAGAAG...GCCATGTGCATAAGGGCTCGGAACTCGGAGCT
[7] 47504
AGCTCCGAGTTGCGCTTCCTGATGGGCACGGAT...CGGGTGTAAAAAGCTGAGAGCAACTCGGAGCT
[8] 51110
CGTTGTTCACATTTTGCTCAGAGCATTTGCATT...AAAACTAAACCCGCAGACTGATGTGAACAACG
[9] 504329
AGCTCCGAGTTCTAAAAAGAGCAGTTTTATCAG...GCTGCCAATAAAGGAAAAGGAAACTCGGAGCT
If you want to look at them individually, use [[ either on 'result'
or 'amplicons':
> result[[3]]
275951-letter "DNAString" instance
seq:
AGCTCCGAGTTACGTGTATCGATTACCCCGATTTTA...AAAAACTAGGAAATAAATAAAGTAGAACTCGGAGCT
> amplicons[[3]]
275951-letter "DNAString" instance
seq:
AGCTCCGAGTTACGTGTATCGATTACCCCGATTTTA...AAAAACTAGGAAATAAATAAAGTAGAACTCGGAGCT
If you want to loop on the sequences of the amplicons, you can use
lapply() directly on 'result' or 'amplicons':
lapply(amplicons, ...)
But if you have a lot of amplicons (e.g. more than 10-20k), that's
probably going to take a long time. Note that a lot of basic
operations are serialized and fast even on big DNAStringSet objects:
> alphabetFrequency(amplicons)
A C G T M R W S Y K V H D B N - +
[1,] 216 196 218 170 0 0 0 0 0 0 0 0 0 0 0 0 0
[2,] 287220 220536 221447 286189 0 0 0 0 0 0 0 0 0 0 0 0 0
[3,] 79566 58099 58205 80081 0 0 0 0 0 0 0 0 0 0 0 0 0
[4,] 573547 421203 423538 570237 0 0 0 0 0 0 0 0 0 0 0 0 0
[5,] 453126 345037 344723 455246 0 0 0 0 0 0 0 0 0 0 0 0 0
[6,] 5902 4635 4783 5863 0 0 0 0 0 0 0 0 0 0 0 0 0
[7,] 13708 10336 10342 13118 0 0 0 0 0 0 0 0 0 0 0 0 0
[8,] 12589 12586 12343 13592 0 0 0 0 0 0 0 0 0 0 0 0 0
[9,] 142837 108612 110988 141892 0 0 0 0 0 0 0 0 0 0 0 0 0
Hope this helps,
H.
>
> This is a really basic question, I hope you can find a second to push me
> throug... I would really appreciate your help. I am not in my computer
> right now... don't have my work handy but in a couple of hours.
>
> Kind regards
> Andres Avalos
> Biotecnologist
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioconductor
mailing list