[Bioc-devel] BamFile validation
delhomme at embl.de
Tue Jan 8 22:05:17 CET 2013
On 8 Jan 2013, at 19:53, Martin Morgan wrote:
> On 01/07/2013 12:32 PM, Nicolas Delhomme wrote:
>> Hi Martin, Marc,
>> I'm now implementing the use of BamFile objects in easyRNASeq and I like
>> them. I think it would be very useful if when constructing a BamFile the
>> existence of the path and index could be tested; i.e. this works:
>> BamFile("test.bam","test.bam.bai") although these files do not exist. Is
>> there a reason that this validation is not done? If there is, could a
>> validation parameter be added (set to FALSE by default to keep the current
>> behavior) that would check for the files' existence? The same goes for the
>> yieldSize argument, i.e. this works
>> BamFile("test.bam","test.bam.bai",yieldSize=-1), although I'm not sure what a
>> -1 yieldSize means. I can of course do these validations within easyRNASeq,
>> but anyone else building packages on top of BamFile would probably want to do
>> the same...
> I want to be able to specify a BAM file without opening it, and then open it later, e.g., in mclapply or after distributing to a cluster. Also, conceptually, I want to distinguish between processing an entire BAM file -- provide me with something for which isOpen(BamFile("foo")) == FALSE -- versus reading a chunk of a BamFile, i.e., already open. So I separated BamFile creation from open().
> I focus on open() in the above because opening the BAM file is a cheap way to validate that the BAM file exists -- it could be local or remote (http or ftp, so file.exists isn't sufficient) and even if the file 'exists' as Ryan mentions it needs to actually be a BAM file so should, e.g., have a header. open() allows for all of these possibilities. Also, the consequences of trying to open a non-existent file results in a clear enough error
> > open(BamFile("sdfs"))
> Error in value[[3L]](cond) :
> failed to open BamFile: file(s) do not exist:
> So against the votes of the other contributors to this thread, I haven't made a change. Sorry about that.
No need to. I hadn't thought of a use case as those you presented above where not checking makes perfect sense. I'll use open for validating.
> I added a check that yieldSize is a non-negative scalar integer, or NA.
>> A related point unclear at the moment in the documentation is what the index
>> filename should be: i.e. scanBam expects as the index the same value as for
>> the bam filename (that assumes the user has not renamed his bam.bai file and
>> you never know what users might be doing... :-S ... ) but the BamFile Rd page
>> file: A character vector of BAM file paths
> > index: A character vector of indices (forBamFile);
>> so it's unclear to me what the index character vector should contain.
> Tried to clarify that, it's just a character vector containing the path to the index file. Generally, the code tries not to care about whether the index file is specified with a '.bai' extension, or without.
That was my perception :-) just wanted to be sure.
A related question, could you detail which functions require the bai index to be present and which ones "just" benefit from it?
>> Thanks again for this set of class, they're really handy!
>> Here's my sessionInfo:
>> R Under development (unstable) (2012-10-02 r60861) Platform:
>> x86_64-apple-darwin10.8.0 (64-bit)
>> locale:  en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
>> attached base packages:  parallel stats graphics grDevices utils
>> datasets methods  base
>> other attached packages:  Rsamtools_1.11.14 Biostrings_2.27.8
>> GenomicRanges_1.11.21  IRanges_1.17.24 BiocGenerics_0.5.6
>> loaded via a namespace (and not attached):  bitops_1.0-5 stats4_2.16.0
>> tools_2.16.0 zlibbioc_1.5.0
>> --------------------------------------------------------------- Nicolas
>> Genome Biology Computational Support
>> European Molecular Biology Laboratory
>> Tel: +49 6221 387 8310 Email: nicolas.delhomme at embl.de Meyerhofstrasse 1 -
>> Postfach 10.2209 69102 Heidelberg, Germany
>> _______________________________________________ Bioc-devel at r-project.org
>> mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
> Computational Biology / Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N.
> PO Box 19024 Seattle, WA 98109
> Location: Arnold Building M1 B861
> Phone: (206) 667-2793
More information about the Bioc-devel