[Bioc-devel] BamFile validation

Martin Morgan mtmorgan at fhcrc.org
Tue Jan 8 19:53:42 CET 2013

On 01/07/2013 12:32 PM, Nicolas Delhomme wrote:
> Hi Martin, Marc,
> I'm now implementing the use of BamFile objects in easyRNASeq and I like
> them. I think it would be very useful if when constructing a BamFile the
> existence of the path and index could be tested; i.e. this works:
> BamFile("test.bam","test.bam.bai") although these files do not exist. Is
> there a reason that this validation is not done? If there is, could a
> validation parameter be added (set to FALSE by default to keep the current
> behavior) that would check for the files' existence? The same goes for the
> yieldSize argument, i.e. this works
> BamFile("test.bam","test.bam.bai",yieldSize=-1), although I'm not sure what a
> -1 yieldSize means. I can of course do these validations within easyRNASeq,
> but anyone else building packages on top of BamFile would probably want to do
> the same...

I want to be able to specify a BAM file without opening it, and then open it 
later, e.g., in mclapply or after distributing to a cluster. Also, conceptually, 
I want to distinguish between processing an entire BAM file -- provide me with 
something for which isOpen(BamFile("foo")) == FALSE -- versus reading a chunk of 
a BamFile, i.e., already open. So I separated BamFile creation from open().

I focus on open() in the above because opening the BAM file is a cheap way to 
validate that the BAM file exists -- it could be local or remote (http or ftp, 
so file.exists isn't sufficient) and even if the file 'exists' as Ryan mentions 
it needs to actually be a BAM file so should, e.g., have a header. open() allows 
for all of these possibilities. Also, the consequences of trying to open a 
non-existent file results in a clear enough error

 > open(BamFile("sdfs"))
Error in value[[3L]](cond) :
   failed to open BamFile: file(s) do not exist:

So against the votes of the other contributors to this thread, I haven't made a 
change. Sorry about that.

I added a check that yieldSize is a non-negative scalar integer, or NA.

> A related point unclear at the moment in the documentation is what the index
> filename should be: i.e. scanBam expects as the index the same value as for
> the bam filename (that assumes the user has not renamed his bam.bai file  and
> you never know what users might be doing... :-S ... ) but the BamFile Rd page
> says:
> file: A character vector of BAM file paths
 > index:  A character vector of indices (forBamFile);
> so it's unclear to me what the index character vector should contain.

Tried to clarify that, it's just a character vector containing the path to the 
index file. Generally, the code tries not to care about whether the index file 
is specified with a '.bai' extension, or without.


> Thanks again for this set of class, they're really handy!
