[BioC] QuasR: how to use an indexed reference genome?

Fri May 17 15:01:29 CEST 2013

Hi Michael,

Thanks for your quick and clarifying response. 

Since it is not possible to use pre-built indices from the bowtie developers, I would be glad to have a small recipe (perhaps featured prominently in the vignette?) which

  1) explains the need for custom-built indices
  2) provides (perhaps) a standalone QuasR-specific command for creating one

My somewhat fuzzy grasp of the current approach is that 

  1) QuasR sees the string "BSgenome.Hsapiens.UCSC.hg19" on a call to qAlign
  2) QuasR then spends a few hours building a new package with the proper index
  3) and saves this package somewhere (I could not figure out where)

I agree that a one-time cost to create a QuasR index is quite reasonable, given all the benefits of the package.

Maybe it's just my own muddle-headedness, but I could not -- and still cannot :} -- figure out how to do it, where the result is written, and thus how to reuse the index on subsequent runs.

Can you give me some assistance?

Many thanks,

 - Paul

On May 16, 2013, at 11:41 PM, Michael Stadler wrote:

> Dear Paul,
> 
> You are right, it is not possibile to use pre-built indices with QuasR,
> such as those downloaded from the bowtie developers.
> 
> This is because QuasR needs to be able to go back to the original
> sequences, for example when working with bisulfite converted reads,
> which it does by accessing the fasta or BSgenome. Since we cannot
> control the sequences that went into an externally provided index, it
> would be more complicated to guarantee a consistent state (identical
> sequences in BSgenome and index, but also identical sequence names and
> order, etc.).
> 
> I do understand that building the index is tedious and will take a few
> hours, but it has to be done only once, and from there on will be
> available for futher QuasR analyses. This is still fast compared to the
> CPU-time that goes into the analysis of most NGS datasets.
> 
> If more people would prefer to use pre-built indices, and if the BioC
> agrees to host such big packages, it would also be conceivable to
> provide index packages corresponding to the BSgenome packages for download.
> 
> Regards,
> Michael
> 
> 
> 
> On 16.05.2013 22:48, Paul Shannon wrote:
>> I am new to QuasR, and alos quite new to aligning short reads to reference genomes more generally.
>> I cannot figure out how to use a pre-built indexed reference genome file with QuasR.   The examples supplied with the package work nicely.
>> Scaling up to using all of hg19 raises problems for me.  I apologize if I am missing the obvious.
>> 
>> To illustrate the problem, I call QuasR's qAlign method with just two arguments (quoting from the man page):
>> 
>>    sampleFile:  a text file listing input sequence files and  sample names
>> 
>>    genome: the reference genome for primary alignments, one of:
>> 
>>            * a string referring to a "BSgenome" package (e.g.
>>              ""BSgenome.Hsapiens.UCSC.hg19""), which will be
>>              downloaded automatically from Bioconductor if not present
>> 
>>            * the name of a fasta sequence file containing one or
>>              several sequences (chromosomes) to be used as a reference
>> 
>> QuasR apparently invokes the bowtie indexing program when supplied either of the two "genome" options:  a BSgenome package, or a fasta file.  But since indexing takes a long time -- hours, apparently --  I hoped to provide a ready-made index file, and found some described here:
>> 
>> 
>>   http://bowtie-bio.sourceforge.net/tutorial.shtml
>> 
>> specifically 
>> 
>>   ftp://ftp.ccb.jhu.edu/pub/data/bowtie_indexes/hg19.ebwt.zip
>> 
>> Various attempts to specify this file, or any of its contents (unzipped) to QuasR fail with these messages:
>> 
>> 
>>   Error: The specified genome /Users/pshannon/s/data/public/bowtie/indexes/hg19.1.ebwt does not have the extension of a fasta file (fa,fasta,fna)> 
>>   Error: The specified genome has to be a file and not a directory: /Users/pshannon/s/data/public/bowtie/indexes
>> 
>> 
>> I'll be grateful for advice on how to do this properly.
>> 
>> Thanks,
>> 
>> - Paul
>> 
>> 
>>> sessionInfo()
>> R version 3.0.0 (2013-04-03)
>> Platform: x86_64-apple-darwin10.8.0 (64-bit)
>> 
>> locale:
>> [1] C
>> 
>> attached base packages:
>> [1] parallel  stats     graphics  grDevices utils     datasets  methods   base     
>> 
>> other attached packages:
>> [1] Rsamtools_1.13.14     BSgenome_1.29.0       Biostrings_2.29.2     QuasR_1.1.4           GenomicRanges_1.13.12 XVector_0.1.0        
>> [7] IRanges_1.19.8        BiocGenerics_0.7.2    Rbowtie_1.1.3         BiocInstaller_1.11.1 
>> 
>> loaded via a namespace (and not attached):
>> [1] AnnotationDbi_1.23.11  Biobase_2.21.2         DBI_0.2-7              GenomicFeatures_1.13.8 RCurl_1.95-4.1        
>> [6] RSQLite_0.11.3         ShortRead_1.19.3       XML_3.95-0.2           biomaRt_2.17.0         bitops_1.0-5          
>> [11] compiler_3.0.0         grid_3.0.0             hwriter_1.3            lattice_0.20-15        rtracklayer_1.21.5    
>> [16] stats4_3.0.0           tools_3.0.0            zlibbioc_1.7.0        
>> 
>> 
>> 
>>