[Bioc-devel] file registry - feedback

Hervé Pagès hpages at fhcrc.org
Wed Mar 12 07:29:07 CET 2014


Hi Val,

On 03/11/2014 08:57 PM, Valerie Obenchain wrote:
> Hi,
>
> On 03/11/14 15:33, Hervé Pagès wrote:
>> On 03/11/2014 02:52 PM, Hervé Pagès wrote:
>>> On 03/11/2014 09:57 AM, Valerie Obenchain wrote:
>>>> Hi Herve,
>>>>
>>>> On 03/10/2014 10:31 PM, Hervé Pagès wrote:
>>>>> Hi Val,
>>>>>
>>>>> I think it would help understand the motivations behind this proposal
>>>>> if you could give an example of a method where the user cannot supply
>>>>> a file name but has to create a 'File' (or 'FileList') object first.
>>>>> And how the file registry proposal below would help.
>>>>> It looks like you have such an example in the GenomicFileViews
>>>>> package.
>>>>> Do you think you could give more details?
>>>>
>>>> The most recent motivating use case was in creating subclasses of
>>>> GenomicFileViews objects (BamFileViews, BigWigFileViews, etc.) We
>>>> wanted
>>>> to have a general constructor, something like GenomicFileViews(), that
>>>> would create the appropriate subclass. However to create the correct
>>>> subclass we needed to know if the files were bam, bw, fasta etc.
>>>> Recognition of the file type by extension would allow us to do this
>>>> with
>>>> no further input from the user.
>>>
>>> That helps, thanks!
>>>
>>> Having this kind of general constructor sounds like it could indeed be
>>> useful. Would be an opportunity to put all these *File classes (the 22
>>> RTLFile subclasses defined in rtracklayer and the 5 RsamtoolsFile
>>> subclasses defined in Rsamtools) under the same umbrella (i.e. a parent
>>> virtual class) and use the name of this virtual class (e.g. File) for
>>> the general constructor.
>>>
>>> Allowing a registration mechanism to extend the knowledge of this File()
>>> constructor is an implementation detail. I don't see a lot of benefit to
>>> it. Only a package that implements a concrete File subclass would
>>> actually need to register the new subclass. Sounds easy enough to ask
>>> to whoever has commit access to the File() code to modify it. This
>>> kind of update might also require adding the name of the package where
>>> the new File subclass is implemented to the Depends/Imports/Suggests
>>> of the package where File() lives, which is something that cannot be
>>> done via a registration mechanism.
>>
>> This clean-up of the *File jungle would also be a good opportunity to:
>>
>>    - Choose what we want to do with reference classes: use them for all
>>      the *File classes or for none of them. (Right now, those defined
>>      in Rsamtools are reference classes, and those defined in
>>      rtracklayer are not.)
>>
>>    - Move the I/O functionality currently in rtracklayer to a
>>      separate package. Based on the number of contributed packages I
>>      reviewed so far that were trying to reinvent the wheel because
>>      they had no idea that the I/O function they needed was actually
>>      in rtracklayer, I'd like to advocate for using a package name
>>      that makes it very clear that it's all about I/O.
>
> Thanks for the suggestions. This re-org sounds good to me. As you say,
> unifying the *File classes in a single package would make them more
> visible to other developers and enforce consistent behavior.
>
> If you aren't in favor of a registration mechanism for 'discovery' how
> should a function with methods for many *File classes (e.g., import())
> handle a character file name? import() uses FileForFormat() to discover
> the file type, makes the *File class and dispatches to the appropriate
> *File method. The registry was an attempt at generalizing this concept.

It would use File() (or whatever the name of this general constructor
would be) instead of FileForFormat(). Right now FileForFormat() seems
to know how to map almost any supported file extension to the
corresponding *File class:

   > class(FileForFormat("aaa.bed"))
   [1] "BEDFile"
   attr(,"package")
   [1] "rtracklayer"

   > class(FileForFormat("aaa.gff"))
   [1] "GFFFile"
   attr(,"package")
   [1] "rtracklayer"

   > class(FileForFormat("aaa.wig"))
   [1] "WIGFile"
   attr(,"package")
   [1] "rtracklayer"

   > class(FileForFormat("aaa.xz"))
   [1] "XZFile"
   attr(,"package")
   [1] "rtracklayer"

   > class(FileForFormat("aaa.2bit"))
   [1] "TwoBitFile"
   attr(,"package")
   [1] "rtracklayer"

   > class(FileForFormat("aaa.bam"))
   [1] "BamFile"
   attr(,"package")
   [1] "Rsamtools"

   > class(FileForFormat("aaa.fasta"))
   [1] "FastaFile"
   attr(,"package")
   [1] "rtracklayer"

   etc...

To my knowledge it doesn't use a registration mechanism for that.
It only fails for a couple of common extensions e.g.:

   > class(FileForFormat("aaa.fa"))
   Error in FileForFormat("aaa.fa") : Format 'fa' unsupported

but in such situation it seems easy enough to just add the
".fa" -> FastaFile mapping as permanent knowledge to FileForFormat(),
rather than having a registration mechanism that allows Rsamtools
to dynamically add the mapping at load time.

Another situation that can happen is that a new *File class is
implemented in a package (say foo) other than rtracklayer or
Rsamtools to support a new file format/extension. Again it sounds
easy enough to add the new mapping as permanent knowledge to
FileForFormat(). Bioconductor is a collaborative environment where
developers are encouraged to talk to each other :-)

Now I'm assuming that most of the file formats that are meaningful
for Bioconductor are already covered with an "ext" -> *File mapping
and that the need for new mappings is going to be a rare event.
But maybe I'm wrong. IMO a registration mechanism would be justified
if: (1) we anticipate many new *File classes and new mappings to come,
and (2) many of the new *File classes are going to be implemented
in many different packages outside rtracklayer and Rsamtools.

>
> What do you think about the use of a registry for Vince's idea of
> holding a digest/path reference to large data but not installing it
> until it's used? Other ideas of how / where this could be stored?

Maybe it's related to the file registry thing. But maybe it's a slightly
different topic. A real detailed example would help.

H.

PS: Note that the ".fasta" extension is currently mapped to FastaFile
(defined in rtracklayer) and FaFile (defined in Rsamtools).
If we had a registration mechanism in place to let rtracklayer
and Rsamtools register mappings at load time, we would end up with
one mapping overriding the other. A registration mechanism that
allows any package to register mappings would make his kind of
conflicts possible and probably more frequent than we would like.

>
> Val
>
>
>>
>> H.
>>
>>
>>>
>>> H.
>>>
>>>
>>>>
>>>> Val
>>>>
>>>>>
>>>>> Thanks,
>>>>> H.
>>>>>
>>>>>
>>>>> On 03/10/2014 08:46 PM, Valerie Obenchain wrote:
>>>>>> Hi all,
>>>>>>
>>>>>> I'm soliciting feedback on the idea of a general file 'registry' that
>>>>>> would identify file types by their extensions. This is similar in
>>>>>> spirit
>>>>>> to FileForformat() in rtracklayer but a more general abstraction that
>>>>>> could be used across packages. The goal is to allow a user to supply
>>>>>> only file name(s) to a method instead of first creating a 'File'
>>>>>> class
>>>>>> such as BamFile, FaFile, BigWigFile etc.
>>>>>>
>>>>>> A first attempt at this is in the GenomicFileViews package
>>>>>> (https://github.com/Bioconductor/GenomicFileViews). A registry
>>>>>> (lookup)
>>>>>> is created as an environment at load time:
>>>>>>
>>>>>> .fileTypeRegistry <- new.env(parent=emptyenv()
>>>>>>
>>>>>> Files are registered with an information triplet consisting of class,
>>>>>> package and regular expression to identify the extension. In
>>>>>> GenomicFileViews we register FaFileList, BamFileList and
>>>>>> BigWigFileList
>>>>>> but any 'File' class can be registered that has a constructor of the
>>>>>> same name.
>>>>>>
>>>>>> .onLoad <- function(libname, pkgname)
>>>>>> {
>>>>>>      registerFileType("FaFileList", "Rsamtools", "\\.fa$")
>>>>>>      registerFileType("FaFileList", "Rsamtools", "\\.fasta$")
>>>>>>      registerFileType("BamFileList", "Rsamtools", "\\.bam$")
>>>>>>      registerFileType("BigWigFileList", "rtracklayer", "\\.bw$")
>>>>>> }
>>>>>>
>>>>>> The makeFileType() helper creates the appropriate class. This
>>>>>> function
>>>>>> is used behind the scenes to do the lookup and coerce to the correct
>>>>>> 'File' class.
>>>>>>
>>>>>>  > makeFileType(c("foo.bam", "bar.bam"))
>>>>>> BamFileList of length 2
>>>>>> names(2): foo.bam bar.bam
>>>>>>
>>>>>> New types can be added at any time with registerFileType():
>>>>>>
>>>>>> registerFileType(NewClass, NewPackage, "\\.NewExtension$")
>>>>>>
>>>>>>
>>>>>> Thoughts:
>>>>>>
>>>>>> (1) If this sounds generally useful where should it live?
>>>>>> rtracklayer,
>>>>>> GenomicFileViews or other? Alternatively it could be its own
>>>>>> lightweight
>>>>>> package (FileRegister) that creates the registry and provides the
>>>>>> helpers. It would be up to the package authors that depend on
>>>>>> FileRegister to register their own files types at load time.
>>>>>>
>>>>>> (2) To avoid potential ambiguities maybe searching should be by regex
>>>>>> and package name. Still a work in progress.
>>>>>>
>>>>>>
>>>>>> Valerie
>>>>>>
>>>>>> _______________________________________________
>>>>>> Bioc-devel at r-project.org mailing list
>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>
>>>>
>>>>
>>>
>>
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioc-devel mailing list