[Bioc-devel] BamFile validation
Nicolas Delhomme
delhomme at embl.de
Tue Jan 8 22:05:17 CET 2013
Hi Martin,
On 8 Jan 2013, at 19:53, Martin Morgan wrote:
> On 01/07/2013 12:32 PM, Nicolas Delhomme wrote:
>> Hi Martin, Marc,
>>
>> I'm now implementing the use of BamFile objects in easyRNASeq and I like
>> them. I think it would be very useful if when constructing a BamFile the
>> existence of the path and index could be tested; i.e. this works:
>> BamFile("test.bam","test.bam.bai") although these files do not exist. Is
>> there a reason that this validation is not done? If there is, could a
>> validation parameter be added (set to FALSE by default to keep the current
>> behavior) that would check for the files' existence? The same goes for the
>> yieldSize argument, i.e. this works
>> BamFile("test.bam","test.bam.bai",yieldSize=-1), although I'm not sure what a
>> -1 yieldSize means. I can of course do these validations within easyRNASeq,
>> but anyone else building packages on top of BamFile would probably want to do
>> the same...
>
> I want to be able to specify a BAM file without opening it, and then open it later, e.g., in mclapply or after distributing to a cluster. Also, conceptually, I want to distinguish between processing an entire BAM file -- provide me with something for which isOpen(BamFile("foo")) == FALSE -- versus reading a chunk of a BamFile, i.e., already open. So I separated BamFile creation from open().
>
> I focus on open() in the above because opening the BAM file is a cheap way to validate that the BAM file exists -- it could be local or remote (http or ftp, so file.exists isn't sufficient) and even if the file 'exists' as Ryan mentions it needs to actually be a BAM file so should, e.g., have a header. open() allows for all of these possibilities. Also, the consequences of trying to open a non-existent file results in a clear enough error
>
> > open(BamFile("sdfs"))
> Error in value[[3L]](cond) :
> failed to open BamFile: file(s) do not exist:
> 'sdfs'
>
> So against the votes of the other contributors to this thread, I haven't made a change. Sorry about that.
No need to. I hadn't thought of a use case as those you presented above where not checking makes perfect sense. I'll use open for validating.
>
> I added a check that yieldSize is a non-negative scalar integer, or NA.
Great thanks.
>
>>
>> A related point unclear at the moment in the documentation is what the index
>> filename should be: i.e. scanBam expects as the index the same value as for
>> the bam filename (that assumes the user has not renamed his bam.bai file and
>> you never know what users might be doing... :-S ... ) but the BamFile Rd page
>> says:
>>
>> file: A character vector of BAM file paths
> > index: A character vector of indices (forBamFile);
>>
>> so it's unclear to me what the index character vector should contain.
>
> Tried to clarify that, it's just a character vector containing the path to the index file. Generally, the code tries not to care about whether the index file is specified with a '.bai' extension, or without.
That was my perception :-) just wanted to be sure.
A related question, could you detail which functions require the bai index to be present and which ones "just" benefit from it?
Cheers,
Nico
>
> Martin
>
>>
>> Thanks again for this set of class, they're really handy!
>>
>> Here's my sessionInfo:
>>
>> R Under development (unstable) (2012-10-02 r60861) Platform:
>> x86_64-apple-darwin10.8.0 (64-bit)
>>
>> locale: [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
>>
>> attached base packages: [1] parallel stats graphics grDevices utils
>> datasets methods [8] base
>>
>> other attached packages: [1] Rsamtools_1.11.14 Biostrings_2.27.8
>> GenomicRanges_1.11.21 [4] IRanges_1.17.24 BiocGenerics_0.5.6
>> BiocInstaller_1.9.6
>>
>> loaded via a namespace (and not attached): [1] bitops_1.0-5 stats4_2.16.0
>> tools_2.16.0 zlibbioc_1.5.0
>>
>> Cheers,
>>
>> Nico
>>
>> --------------------------------------------------------------- Nicolas
>> Delhomme
>>
>> Genome Biology Computational Support
>>
>> European Molecular Biology Laboratory
>>
>> Tel: +49 6221 387 8310 Email: nicolas.delhomme at embl.de Meyerhofstrasse 1 -
>> Postfach 10.2209 69102 Heidelberg, Germany
>>
>> _______________________________________________ Bioc-devel at r-project.org
>> mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>
>
> --
> Computational Biology / Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N.
> PO Box 19024 Seattle, WA 98109
>
> Location: Arnold Building M1 B861
> Phone: (206) 667-2793
More information about the Bioc-devel
mailing list