[BioC] question regarding BSGenome/Biostrings
Hervé Pagès
hpages at fhcrc.org
Fri Nov 21 03:33:23 CET 2008
Hi Elizabeth,
Elizabeth Purdom wrote:
> Hello,
>
> I am trying to run code like that in the pdf documentation of BSGenome
> to match patterns to all the chromosomes of a genome. I have copied the
> function 'runOneStrandAnalysis' from the documentation and have changed
> runAnalysis2 to be the human genome (BSgenome.Hsapiens.UCSC.hg18) as
> follows:
>
> runAnalysis2 <- function(dict0, outfile="")
> {
> seqnames <- seqnames(Hsapiens)
> runOneStrandAnalysis(dict0, Celegans, seqnames, "+", outfile=outfile,
> append=FALSE)
> runOneStrandAnalysis(dict0, Celegans, seqnames, "-", outfile=outfile,
> append=TRUE)
> }
You also need to replace the 2nd argument of your call to
runOneStrandAnalysis() by Hsapiens.
>
> When I run the code using the patterns in the example, it runs fine
> (though of course the patterns in the dictionary were intended for C
> elegans, but that's irrelevant):
>
> runAnalysis2(ce2dict0cw15, outfile="ce2dict0cw15_ana2.txt")
>
> Instead I want to find if regions in the genome are repeated. So I made
> a dictionary that is class 'XStringViews' by using views to grab the
> regions:
>
> MAPLENGTH<-36
> WRITELENGTH<-100
> myDictViews<-views(Hsapiens$chr1, 1:WRITELENGTH,
> (1:WRITELENGTH)+MAPLENGTH-1)
>
> And I get the following error:
> runAnalysis2(myDictViews, outfile="sample.txt")
> >>> Finding all hits in strand + of chromosome chr1 ...
> Error in .match.CWdna_PDict.exact(pdict, subject, fixed, count.only) :
> Biostrings internal error in _init_chrtrtable(): invalid code -1
>
> The problem is in matchPDict:
> mypdict<-PDict(myDictViews)
> subject <- Hsapiens[["chr1"]]
> mindex <- matchPDict(mypdict, subject)
> Error in .match.CWdna_PDict.exact(pdict, subject, fixed, count.only) :
> Biostrings internal error in _init_chrtrtable(): invalid code -1
>
> I have checked and there are only A,C,G,T values in this region. Is it
> because I have a 'XStringViews' object rather than a DNAStringSet
> object? If this is a bug (I know this part is still under progress) is
> there a quick fix, like converting from 'XStringViews' to 'DNAStringSet'?
This is an old bug when one of the 4 base letters was missing (in your
case the views contain no Gs).
The bug was fixed in the devel branch of BioC a couple of months before
the last BioC release. So I suspect you are still using an outdated
version of Biostrings (providing your sessionInfo() would help us to
figure out this).
Please update your installation and make sure you use BioC 2.3
and the latest version of Biostrings (v 2.10.6).
Then create the views with:
myDictViews <- Views(unmasked(Hsapiens$chr1),
start=1:WRITELENGTH,
end=(1:WRITELENGTH)+MAPLENGTH-1)
(note that views() is deprecated in favor of Views())
and try to run your code again with this 'myDictViews'.
>
> Also, if I want to do this on a large scale, is this a reasonable way to
> go or are there tricks for this that would speed it up? I'm still not
> sure that I will ultimately do this in R, but just to know.
If you want to slide the 36-nucleotide window along the entire
Human genome in order to find which views are repeated, yes it's going
to be a huge analysis and will require a lot of computing power.
I don't think the matchPDict() approach is the best way to go but
it's certainly doable with it.
Here are a few tips:
o If you blindly slide the window (of width 36) along chromosome 1
(or any other chromosome), you will run thru N-blocks. These blocks
are generally assembly gaps (inter-contig) but they can also be
found inside a contig (intra-contig Ns).
Of course you want to skip the Ns (or any other ambiguity letter
but I don't think there are any in hg18), even isolated Ns, because
PDict() doesn't support them.
One easy way to do this is to split chr1 into N-free views with:
as(Hsapiens$chr1, "XStringViews")
o Even on a machine with a lot of memory, there are currently some
limitations to the size of a PDict object. For a window of width
36, the max number of views that you will be able to store in
the PDict object should be between 10 and 15 millions.
Note that for 10 millions, the resulting PDict object will occupy
about 7GB in memory. So make sure that you have at least 16GB or
RAM if you plan to use PDict objects of this size.
Otherwise you will need to use smaller PDict objects (5 millions
views) and the entire analysis will be twice longer.
o The MIndex object returned by matchPDict() contains the coordinates
of all the hits and for each PDict object that you run along the
genome will collect tens or hundreds of millions of hits.
If you are only interested in knowing how many times each view
is repeated in the entire genome, it's highly recommended that
you use countPDict() instead of matchPDict(). That will help you
save a lot of memory.
o Use a streaming approach to save your results to a file (i.e.
save as soon as you can so you don't have to keep everything
in memory during the entire analysis). Also try to save in a
compact way. For example you could just use a 4-column format,
e.g.:
chromosome strand start nb_of_occurences
... ... ... ...
chr1 + 147776276 86
chr1 + 147776277 135
chr1 + 147776278 97
... ... ... ...
and save lines only for views that have other occurences in the
genome (nb_of_occurences >= 2).
But if you are only interested in views that are unique in the
genome (nb_of_occurences = 1), you could only save those and then
you don't need the nb_of_occurences column anymore.
Note that this is maybe the kind of analysis where using an indexed
genome (suffix tree or suffix array) could be more efficient. But I
cannot give any advice on this since I'm not familiar with this
kind of approach.
Hope this helps.
H.
>
> Thanks,
> Elizabeth
> UC, Berkeley
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioconductor
mailing list