[Bioc-devel] file registry - feedback

Valerie Obenchain vobencha at fhcrc.org
Wed Mar 12 04:57:51 CET 2014


Hi,

On 03/11/14 15:33, Hervé Pagès wrote:
> On 03/11/2014 02:52 PM, Hervé Pagès wrote:
>> On 03/11/2014 09:57 AM, Valerie Obenchain wrote:
>>> Hi Herve,
>>>
>>> On 03/10/2014 10:31 PM, Hervé Pagès wrote:
>>>> Hi Val,
>>>>
>>>> I think it would help understand the motivations behind this proposal
>>>> if you could give an example of a method where the user cannot supply
>>>> a file name but has to create a 'File' (or 'FileList') object first.
>>>> And how the file registry proposal below would help.
>>>> It looks like you have such an example in the GenomicFileViews package.
>>>> Do you think you could give more details?
>>>
>>> The most recent motivating use case was in creating subclasses of
>>> GenomicFileViews objects (BamFileViews, BigWigFileViews, etc.) We wanted
>>> to have a general constructor, something like GenomicFileViews(), that
>>> would create the appropriate subclass. However to create the correct
>>> subclass we needed to know if the files were bam, bw, fasta etc.
>>> Recognition of the file type by extension would allow us to do this with
>>> no further input from the user.
>>
>> That helps, thanks!
>>
>> Having this kind of general constructor sounds like it could indeed be
>> useful. Would be an opportunity to put all these *File classes (the 22
>> RTLFile subclasses defined in rtracklayer and the 5 RsamtoolsFile
>> subclasses defined in Rsamtools) under the same umbrella (i.e. a parent
>> virtual class) and use the name of this virtual class (e.g. File) for
>> the general constructor.
>>
>> Allowing a registration mechanism to extend the knowledge of this File()
>> constructor is an implementation detail. I don't see a lot of benefit to
>> it. Only a package that implements a concrete File subclass would
>> actually need to register the new subclass. Sounds easy enough to ask
>> to whoever has commit access to the File() code to modify it. This
>> kind of update might also require adding the name of the package where
>> the new File subclass is implemented to the Depends/Imports/Suggests
>> of the package where File() lives, which is something that cannot be
>> done via a registration mechanism.
>
> This clean-up of the *File jungle would also be a good opportunity to:
>
>    - Choose what we want to do with reference classes: use them for all
>      the *File classes or for none of them. (Right now, those defined
>      in Rsamtools are reference classes, and those defined in
>      rtracklayer are not.)
>
>    - Move the I/O functionality currently in rtracklayer to a
>      separate package. Based on the number of contributed packages I
>      reviewed so far that were trying to reinvent the wheel because
>      they had no idea that the I/O function they needed was actually
>      in rtracklayer, I'd like to advocate for using a package name
>      that makes it very clear that it's all about I/O.

Thanks for the suggestions. This re-org sounds good to me. As you say, 
unifying the *File classes in a single package would make them more 
visible to other developers and enforce consistent behavior.

If you aren't in favor of a registration mechanism for 'discovery' how 
should a function with methods for many *File classes (e.g., import()) 
handle a character file name? import() uses FileForFormat() to discover 
the file type, makes the *File class and dispatches to the appropriate 
*File method. The registry was an attempt at generalizing this concept.

What do you think about the use of a registry for Vince's idea of 
holding a digest/path reference to large data but not installing it 
until it's used? Other ideas of how / where this could be stored?

Val


>
> H.
>
>
>>
>> H.
>>
>>
>>>
>>> Val
>>>
>>>>
>>>> Thanks,
>>>> H.
>>>>
>>>>
>>>> On 03/10/2014 08:46 PM, Valerie Obenchain wrote:
>>>>> Hi all,
>>>>>
>>>>> I'm soliciting feedback on the idea of a general file 'registry' that
>>>>> would identify file types by their extensions. This is similar in
>>>>> spirit
>>>>> to FileForformat() in rtracklayer but a more general abstraction that
>>>>> could be used across packages. The goal is to allow a user to supply
>>>>> only file name(s) to a method instead of first creating a 'File' class
>>>>> such as BamFile, FaFile, BigWigFile etc.
>>>>>
>>>>> A first attempt at this is in the GenomicFileViews package
>>>>> (https://github.com/Bioconductor/GenomicFileViews). A registry
>>>>> (lookup)
>>>>> is created as an environment at load time:
>>>>>
>>>>> .fileTypeRegistry <- new.env(parent=emptyenv()
>>>>>
>>>>> Files are registered with an information triplet consisting of class,
>>>>> package and regular expression to identify the extension. In
>>>>> GenomicFileViews we register FaFileList, BamFileList and
>>>>> BigWigFileList
>>>>> but any 'File' class can be registered that has a constructor of the
>>>>> same name.
>>>>>
>>>>> .onLoad <- function(libname, pkgname)
>>>>> {
>>>>>      registerFileType("FaFileList", "Rsamtools", "\\.fa$")
>>>>>      registerFileType("FaFileList", "Rsamtools", "\\.fasta$")
>>>>>      registerFileType("BamFileList", "Rsamtools", "\\.bam$")
>>>>>      registerFileType("BigWigFileList", "rtracklayer", "\\.bw$")
>>>>> }
>>>>>
>>>>> The makeFileType() helper creates the appropriate class. This function
>>>>> is used behind the scenes to do the lookup and coerce to the correct
>>>>> 'File' class.
>>>>>
>>>>>  > makeFileType(c("foo.bam", "bar.bam"))
>>>>> BamFileList of length 2
>>>>> names(2): foo.bam bar.bam
>>>>>
>>>>> New types can be added at any time with registerFileType():
>>>>>
>>>>> registerFileType(NewClass, NewPackage, "\\.NewExtension$")
>>>>>
>>>>>
>>>>> Thoughts:
>>>>>
>>>>> (1) If this sounds generally useful where should it live? rtracklayer,
>>>>> GenomicFileViews or other? Alternatively it could be its own
>>>>> lightweight
>>>>> package (FileRegister) that creates the registry and provides the
>>>>> helpers. It would be up to the package authors that depend on
>>>>> FileRegister to register their own files types at load time.
>>>>>
>>>>> (2) To avoid potential ambiguities maybe searching should be by regex
>>>>> and package name. Still a work in progress.
>>>>>
>>>>>
>>>>> Valerie
>>>>>
>>>>> _______________________________________________
>>>>> Bioc-devel at r-project.org mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>
>>>
>>>
>>
>



More information about the Bioc-devel mailing list