[BioC] QuasR: how to use an indexed reference genome?
Paul Shannon
paul.thurmond.shannon at gmail.com
Fri May 17 15:01:29 CEST 2013
Hi Michael,
Thanks for your quick and clarifying response.
Since it is not possible to use pre-built indices from the bowtie developers, I would be glad to have a small recipe (perhaps featured prominently in the vignette?) which
1) explains the need for custom-built indices
2) provides (perhaps) a standalone QuasR-specific command for creating one
My somewhat fuzzy grasp of the current approach is that
1) QuasR sees the string "BSgenome.Hsapiens.UCSC.hg19" on a call to qAlign
2) QuasR then spends a few hours building a new package with the proper index
3) and saves this package somewhere (I could not figure out where)
I agree that a one-time cost to create a QuasR index is quite reasonable, given all the benefits of the package.
Maybe it's just my own muddle-headedness, but I could not -- and still cannot :} -- figure out how to do it, where the result is written, and thus how to reuse the index on subsequent runs.
Can you give me some assistance?
Many thanks,
- Paul
On May 16, 2013, at 11:41 PM, Michael Stadler wrote:
> Dear Paul,
>
> You are right, it is not possibile to use pre-built indices with QuasR,
> such as those downloaded from the bowtie developers.
>
> This is because QuasR needs to be able to go back to the original
> sequences, for example when working with bisulfite converted reads,
> which it does by accessing the fasta or BSgenome. Since we cannot
> control the sequences that went into an externally provided index, it
> would be more complicated to guarantee a consistent state (identical
> sequences in BSgenome and index, but also identical sequence names and
> order, etc.).
>
> I do understand that building the index is tedious and will take a few
> hours, but it has to be done only once, and from there on will be
> available for futher QuasR analyses. This is still fast compared to the
> CPU-time that goes into the analysis of most NGS datasets.
>
> If more people would prefer to use pre-built indices, and if the BioC
> agrees to host such big packages, it would also be conceivable to
> provide index packages corresponding to the BSgenome packages for download.
>
> Regards,
> Michael
>
>
>
> On 16.05.2013 22:48, Paul Shannon wrote:
>> I am new to QuasR, and alos quite new to aligning short reads to reference genomes more generally.
>> I cannot figure out how to use a pre-built indexed reference genome file with QuasR. The examples supplied with the package work nicely.
>> Scaling up to using all of hg19 raises problems for me. I apologize if I am missing the obvious.
>>
>> To illustrate the problem, I call QuasR's qAlign method with just two arguments (quoting from the man page):
>>
>> sampleFile: a text file listing input sequence files and sample names
>>
>> genome: the reference genome for primary alignments, one of:
>>
>> * a string referring to a "BSgenome" package (e.g.
>> ""BSgenome.Hsapiens.UCSC.hg19""), which will be
>> downloaded automatically from Bioconductor if not present
>>
>> * the name of a fasta sequence file containing one or
>> several sequences (chromosomes) to be used as a reference
>>
>> QuasR apparently invokes the bowtie indexing program when supplied either of the two "genome" options: a BSgenome package, or a fasta file. But since indexing takes a long time -- hours, apparently -- I hoped to provide a ready-made index file, and found some described here:
>>
>>
>> http://bowtie-bio.sourceforge.net/tutorial.shtml
>>
>> specifically
>>
>> ftp://ftp.ccb.jhu.edu/pub/data/bowtie_indexes/hg19.ebwt.zip
>>
>> Various attempts to specify this file, or any of its contents (unzipped) to QuasR fail with these messages:
>>
>>
>> Error: The specified genome /Users/pshannon/s/data/public/bowtie/indexes/hg19.1.ebwt does not have the extension of a fasta file (fa,fasta,fna)>
>> Error: The specified genome has to be a file and not a directory: /Users/pshannon/s/data/public/bowtie/indexes
>>
>>
>> I'll be grateful for advice on how to do this properly.
>>
>> Thanks,
>>
>> - Paul
>>
>>
>>> sessionInfo()
>> R version 3.0.0 (2013-04-03)
>> Platform: x86_64-apple-darwin10.8.0 (64-bit)
>>
>> locale:
>> [1] C
>>
>> attached base packages:
>> [1] parallel stats graphics grDevices utils datasets methods base
>>
>> other attached packages:
>> [1] Rsamtools_1.13.14 BSgenome_1.29.0 Biostrings_2.29.2 QuasR_1.1.4 GenomicRanges_1.13.12 XVector_0.1.0
>> [7] IRanges_1.19.8 BiocGenerics_0.7.2 Rbowtie_1.1.3 BiocInstaller_1.11.1
>>
>> loaded via a namespace (and not attached):
>> [1] AnnotationDbi_1.23.11 Biobase_2.21.2 DBI_0.2-7 GenomicFeatures_1.13.8 RCurl_1.95-4.1
>> [6] RSQLite_0.11.3 ShortRead_1.19.3 XML_3.95-0.2 biomaRt_2.17.0 bitops_1.0-5
>> [11] compiler_3.0.0 grid_3.0.0 hwriter_1.3 lattice_0.20-15 rtracklayer_1.21.5
>> [16] stats4_3.0.0 tools_3.0.0 zlibbioc_1.7.0
>>
>>
>>
>>
More information about the Bioconductor
mailing list