[BioC] question regarding BSGenome/Biostrings

Fri Nov 21 03:33:23 CET 2008

Hi Elizabeth,

Elizabeth Purdom wrote:
> Hello,
> 
> I am trying to run code like that in the pdf documentation of BSGenome 
> to match patterns to all the chromosomes of a genome. I have copied the 
> function 'runOneStrandAnalysis' from the documentation and have changed 
> runAnalysis2 to be the human genome (BSgenome.Hsapiens.UCSC.hg18) as 
> follows:
> 
>  runAnalysis2 <- function(dict0, outfile="")
>  {
>  seqnames <- seqnames(Hsapiens)
>  runOneStrandAnalysis(dict0, Celegans, seqnames, "+", outfile=outfile, 
> append=FALSE)
>  runOneStrandAnalysis(dict0, Celegans, seqnames, "-", outfile=outfile, 
> append=TRUE)
>  }

You also need to replace the 2nd argument of your call to 
runOneStrandAnalysis() by Hsapiens.

> 
> When I run the code using the patterns in the example, it runs fine 
> (though of course the patterns in the dictionary were intended for C 
> elegans, but that's irrelevant):
> 
> runAnalysis2(ce2dict0cw15, outfile="ce2dict0cw15_ana2.txt")
> 
> Instead I want to find if regions in the genome are repeated. So I made 
> a dictionary that is class 'XStringViews' by using views to grab the 
> regions:
> 
> MAPLENGTH<-36
> WRITELENGTH<-100
> myDictViews<-views(Hsapiens$chr1, 1:WRITELENGTH, 
> (1:WRITELENGTH)+MAPLENGTH-1)
> 
> And I get the following error:
> runAnalysis2(myDictViews, outfile="sample.txt")
>  >>> Finding all hits in strand + of chromosome chr1 ...
> Error in .match.CWdna_PDict.exact(pdict, subject, fixed, count.only) :
>   Biostrings internal error in _init_chrtrtable(): invalid code -1
> 
> The problem is in matchPDict:
> mypdict<-PDict(myDictViews)
> subject <- Hsapiens[["chr1"]]
> mindex <- matchPDict(mypdict, subject)
> Error in .match.CWdna_PDict.exact(pdict, subject, fixed, count.only) :
>   Biostrings internal error in _init_chrtrtable(): invalid code -1
> 
> I have checked and there are only A,C,G,T values in this region. Is it 
> because I have a 'XStringViews' object rather than a DNAStringSet 
> object? If this is a bug (I know this part is still under progress) is 
> there a quick fix, like converting from 'XStringViews' to 'DNAStringSet'?

This is an old bug when one of the 4 base letters was missing (in your 
case the views contain no Gs).
The bug was fixed in the devel branch of BioC a couple of months before
the last BioC release. So I suspect you are still using an outdated
version of Biostrings (providing your sessionInfo() would help us to
figure out this).

Please update your installation and make sure you use BioC 2.3
and the latest version of Biostrings (v 2.10.6).

Then create the views with:

   myDictViews <- Views(unmasked(Hsapiens$chr1),
                        start=1:WRITELENGTH,
                        end=(1:WRITELENGTH)+MAPLENGTH-1)

(note that views() is deprecated in favor of Views())
and try to run your code again with this 'myDictViews'.

> 
> Also, if I want to do this on a large scale, is this a reasonable way to 
> go or are there tricks for this that would speed it up? I'm still not 
> sure that I will ultimately do this in R, but just to know.

If you want to slide the 36-nucleotide window along the entire
Human genome in order to find which views are repeated, yes it's going
to be a huge analysis and will require a lot of computing power.
I don't think the matchPDict() approach is the best way to go but
it's certainly doable with it.

Here are a few tips:

   o If you blindly slide the window (of width 36) along chromosome 1
     (or any other chromosome), you will run thru N-blocks. These blocks
     are generally assembly gaps (inter-contig) but they can also be
     found inside a contig (intra-contig Ns).
     Of course you want to skip the Ns (or any other ambiguity letter
     but I don't think there are any in hg18), even isolated Ns, because
     PDict() doesn't support them.
     One easy way to do this is to split chr1 into N-free views with:
       as(Hsapiens$chr1, "XStringViews")

   o Even on a machine with a lot of memory, there are currently some
     limitations to the size of a PDict object. For a window of width
     36, the max number of views that you will be able to store in
     the PDict object should be between 10 and 15 millions.
     Note that for 10 millions, the resulting PDict object will occupy
     about 7GB in memory. So make sure that you have at least 16GB or
     RAM if you plan to use PDict objects of this size.
     Otherwise you will need to use smaller PDict objects (5 millions
     views) and the entire analysis will be twice longer.

   o The MIndex object returned by matchPDict() contains the coordinates
     of all the hits and for each PDict object that you run along the
     genome will collect tens or hundreds of millions of hits.
     If you are only interested in knowing how many times each view
     is repeated in the entire genome, it's highly recommended that
     you use countPDict() instead of matchPDict(). That will help you
     save a lot of memory.

   o Use a streaming approach to save your results to a file (i.e.
     save as soon as you can so you don't have to keep everything
     in memory during the entire analysis). Also try to save in a
     compact way. For example you could just use a 4-column format,
     e.g.:
         chromosome strand     start  nb_of_occurences
         ...        ...          ...               ...
         chr1       +      147776276                86
         chr1       +      147776277               135
         chr1       +      147776278                97
         ...        ...          ...               ...
     and save lines only for views that have other occurences in the
     genome (nb_of_occurences >= 2).
     But if you are only interested in views that are unique in the
     genome (nb_of_occurences = 1), you could only save those and then
     you don't need the nb_of_occurences column anymore.

Note that this is maybe the kind of analysis where using an indexed
genome (suffix tree or suffix array) could be more efficient. But I
cannot give any advice on this since I'm not familiar with this
kind of approach.

Hope this helps.

H.

> 
> Thanks,
> Elizabeth
> UC, Berkeley
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: 
> http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319