[Bioc-devel] BamFile validation

Nicolas Delhomme delhomme at embl.de
Tue Jan 8 22:05:17 CET 2013


Hi Martin,

On 8 Jan 2013, at 19:53, Martin Morgan wrote:

> On 01/07/2013 12:32 PM, Nicolas Delhomme wrote:
>> Hi Martin, Marc,
>> 
>> I'm now implementing the use of BamFile objects in easyRNASeq and I like
>> them. I think it would be very useful if when constructing a BamFile the
>> existence of the path and index could be tested; i.e. this works:
>> BamFile("test.bam","test.bam.bai") although these files do not exist. Is
>> there a reason that this validation is not done? If there is, could a
>> validation parameter be added (set to FALSE by default to keep the current
>> behavior) that would check for the files' existence? The same goes for the
>> yieldSize argument, i.e. this works
>> BamFile("test.bam","test.bam.bai",yieldSize=-1), although I'm not sure what a
>> -1 yieldSize means. I can of course do these validations within easyRNASeq,
>> but anyone else building packages on top of BamFile would probably want to do
>> the same...
> 
> I want to be able to specify a BAM file without opening it, and then open it later, e.g., in mclapply or after distributing to a cluster. Also, conceptually, I want to distinguish between processing an entire BAM file -- provide me with something for which isOpen(BamFile("foo")) == FALSE -- versus reading a chunk of a BamFile, i.e., already open. So I separated BamFile creation from open().
> 
> I focus on open() in the above because opening the BAM file is a cheap way to validate that the BAM file exists -- it could be local or remote (http or ftp, so file.exists isn't sufficient) and even if the file 'exists' as Ryan mentions it needs to actually be a BAM file so should, e.g., have a header. open() allows for all of these possibilities. Also, the consequences of trying to open a non-existent file results in a clear enough error
> 
> > open(BamFile("sdfs"))
> Error in value[[3L]](cond) :
>  failed to open BamFile: file(s) do not exist:
>  'sdfs'
> 
> So against the votes of the other contributors to this thread, I haven't made a change. Sorry about that.

No need to. I hadn't thought of a use case as those you presented above where not checking makes perfect sense. I'll use open for validating.

> 
> I added a check that yieldSize is a non-negative scalar integer, or NA.

Great thanks.

> 
>> 
>> A related point unclear at the moment in the documentation is what the index
>> filename should be: i.e. scanBam expects as the index the same value as for
>> the bam filename (that assumes the user has not renamed his bam.bai file  and
>> you never know what users might be doing... :-S ... ) but the BamFile Rd page
>> says:
>> 
>> file: A character vector of BAM file paths
> > index:  A character vector of indices (forBamFile);
>> 
>> so it's unclear to me what the index character vector should contain.
> 
> Tried to clarify that, it's just a character vector containing the path to the index file. Generally, the code tries not to care about whether the index file is specified with a '.bai' extension, or without.

That was my perception :-) just wanted to be sure.

A related question, could you detail which functions require the bai index to  be present and which ones "just" benefit from it?

Cheers,

Nico

> 
> Martin
> 
>> 
>> Thanks again for this set of class, they're really handy!
>> 
>> Here's my sessionInfo:
>> 
>> R Under development (unstable) (2012-10-02 r60861) Platform:
>> x86_64-apple-darwin10.8.0 (64-bit)
>> 
>> locale: [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
>> 
>> attached base packages: [1] parallel  stats     graphics  grDevices utils
>> datasets  methods [8] base
>> 
>> other attached packages: [1] Rsamtools_1.11.14     Biostrings_2.27.8
>> GenomicRanges_1.11.21 [4] IRanges_1.17.24       BiocGenerics_0.5.6
>> BiocInstaller_1.9.6
>> 
>> loaded via a namespace (and not attached): [1] bitops_1.0-5   stats4_2.16.0
>> tools_2.16.0   zlibbioc_1.5.0
>> 
>> Cheers,
>> 
>> Nico
>> 
>> --------------------------------------------------------------- Nicolas
>> Delhomme
>> 
>> Genome Biology Computational Support
>> 
>> European Molecular Biology Laboratory
>> 
>> Tel: +49 6221 387 8310 Email: nicolas.delhomme at embl.de Meyerhofstrasse 1 -
>> Postfach 10.2209 69102 Heidelberg, Germany
>> 
>> _______________________________________________ Bioc-devel at r-project.org
>> mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
>> 
> 
> 
> -- 
> Computational Biology / Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N.
> PO Box 19024 Seattle, WA 98109
> 
> Location: Arnold Building M1 B861
> Phone: (206) 667-2793



More information about the Bioc-devel mailing list