[Bioc-devel] file registry - feedback

Valerie Obenchain vobencha at fhcrc.org
Tue Mar 25 21:18:27 CET 2014


Hi,

This discussion went off-line and I wanted to give a summary of what we 
decided to go with.

We'll create a new package, BiocFile, that has a minimal API.

API:
- 'File' class (virtual, reference class) and constructor
- close / open / isOpen
- import / export
- file registry

We won't require existing *File classes to implement yield but would 
'recommend' that new *File classes do. By getting this structure in 
place we can guide future *File developments in a consistent direction 
even if we can't harmonize all current classes. I'll start work on this 
after the release.

Thanks again for the input.

Valerie

On 03/11/2014 10:23 PM, Michael Lawrence wrote:
>
>
>
> On Tue, Mar 11, 2014 at 3:33 PM, Hervé Pagès <hpages at fhcrc.org
> <mailto:hpages at fhcrc.org>> wrote:
>
>     On 03/11/2014 02:52 PM, Hervé Pagès wrote:
>
>         On 03/11/2014 09:57 AM, Valerie Obenchain wrote:
>
>             Hi Herve,
>
>             On 03/10/2014 10:31 PM, Hervé Pagès wrote:
>
>                 Hi Val,
>
>                 I think it would help understand the motivations behind
>                 this proposal
>                 if you could give an example of a method where the user
>                 cannot supply
>                 a file name but has to create a 'File' (or 'FileList')
>                 object first.
>                 And how the file registry proposal below would help.
>                 It looks like you have such an example in the
>                 GenomicFileViews package.
>                 Do you think you could give more details?
>
>
>             The most recent motivating use case was in creating
>             subclasses of
>             GenomicFileViews objects (BamFileViews, BigWigFileViews,
>             etc.) We wanted
>             to have a general constructor, something like
>             GenomicFileViews(), that
>             would create the appropriate subclass. However to create the
>             correct
>             subclass we needed to know if the files were bam, bw, fasta etc.
>             Recognition of the file type by extension would allow us to
>             do this with
>             no further input from the user.
>
>
>         That helps, thanks!
>
>         Having this kind of general constructor sounds like it could
>         indeed be
>         useful. Would be an opportunity to put all these *File classes
>         (the 22
>         RTLFile subclasses defined in rtracklayer and the 5 RsamtoolsFile
>         subclasses defined in Rsamtools) under the same umbrella (i.e. a
>         parent
>         virtual class) and use the name of this virtual class (e.g.
>         File) for
>         the general constructor.
>
>         Allowing a registration mechanism to extend the knowledge of
>         this File()
>         constructor is an implementation detail. I don't see a lot of
>         benefit to
>         it. Only a package that implements a concrete File subclass would
>         actually need to register the new subclass. Sounds easy enough
>         to ask
>         to whoever has commit access to the File() code to modify it. This
>         kind of update might also require adding the name of the package
>         where
>         the new File subclass is implemented to the Depends/Imports/Suggests
>         of the package where File() lives, which is something that cannot be
>         done via a registration mechanism.
>
>
>     This clean-up of the *File jungle would also be a good opportunity to:
>
>        - Choose what we want to do with reference classes: use them for all
>          the *File classes or for none of them. (Right now, those defined
>          in Rsamtools are reference classes, and those defined in
>          rtracklayer are not.)
>
>        - Move the I/O functionality currently in rtracklayer to a
>          separate package. Based on the number of contributed packages I
>          reviewed so far that were trying to reinvent the wheel because
>          they had no idea that the I/O function they needed was actually
>          in rtracklayer, I'd like to advocate for using a package name
>          that makes it very clear that it's all about I/O.
>
>
>
> I can see some benefit in renaming/reorganizing, but if they weren't
> able to perform a simple google search for functionality, I don't think
> the name of the package was the problem. "read gff bioconductor" returns
> rtracklayer as the top hit.
>
>
>     H.
>
>
>
>         H.
>
>
>
>             Val
>
>
>                 Thanks,
>                 H.
>
>
>                 On 03/10/2014 08:46 PM, Valerie Obenchain wrote:
>
>                     Hi all,
>
>                     I'm soliciting feedback on the idea of a general
>                     file 'registry' that
>                     would identify file types by their extensions. This
>                     is similar in
>                     spirit
>                     to FileForformat() in rtracklayer but a more general
>                     abstraction that
>                     could be used across packages. The goal is to allow
>                     a user to supply
>                     only file name(s) to a method instead of first
>                     creating a 'File' class
>                     such as BamFile, FaFile, BigWigFile etc.
>
>                     A first attempt at this is in the GenomicFileViews
>                     package
>                     (https://github.com/__Bioconductor/GenomicFileViews
>                     <https://github.com/Bioconductor/GenomicFileViews>)__.
>                     A registry (lookup)
>                     is created as an environment at load time:
>
>                     .fileTypeRegistry <- new.env(parent=emptyenv()
>
>                     Files are registered with an information triplet
>                     consisting of class,
>                     package and regular expression to identify the
>                     extension. In
>                     GenomicFileViews we register FaFileList, BamFileList
>                     and BigWigFileList
>                     but any 'File' class can be registered that has a
>                     constructor of the
>                     same name.
>
>                     .onLoad <- function(libname, pkgname)
>                     {
>                           registerFileType("FaFileList", "Rsamtools",
>                     "\\.fa$")
>                           registerFileType("FaFileList", "Rsamtools",
>                     "\\.fasta$")
>                           registerFileType("BamFileList"__, "Rsamtools",
>                     "\\.bam$")
>                           registerFileType("__BigWigFileList",
>                     "rtracklayer", "\\.bw$")
>                     }
>
>                     The makeFileType() helper creates the appropriate
>                     class. This function
>                     is used behind the scenes to do the lookup and
>                     coerce to the correct
>                     'File' class.
>
>                       > makeFileType(c("foo.bam", "bar.bam"))
>                     BamFileList of length 2
>                     names(2): foo.bam bar.bam
>
>                     New types can be added at any time with
>                     registerFileType():
>
>                     registerFileType(NewClass, NewPackage,
>                     "\\.NewExtension$")
>
>
>                     Thoughts:
>
>                     (1) If this sounds generally useful where should it
>                     live? rtracklayer,
>                     GenomicFileViews or other? Alternatively it could be
>                     its own
>                     lightweight
>                     package (FileRegister) that creates the registry and
>                     provides the
>                     helpers. It would be up to the package authors that
>                     depend on
>                     FileRegister to register their own files types at
>                     load time.
>
>                     (2) To avoid potential ambiguities maybe searching
>                     should be by regex
>                     and package name. Still a work in progress.
>
>
>                     Valerie
>
>                     _________________________________________________
>                     Bioc-devel at r-project.org
>                     <mailto:Bioc-devel at r-project.org> mailing list
>                     https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>                     <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>
>
>
>
>
>
>     --
>     Hervé Pagès
>
>     Program in Computational Biology
>     Division of Public Health Sciences
>     Fred Hutchinson Cancer Research Center
>     1100 Fairview Ave. N, M1-B514
>     P.O. Box 19024
>     Seattle, WA 98109-1024
>
>     E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>     Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
>     Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
>
>


-- 
Valerie Obenchain
Program in Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B155
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: vobencha at fhcrc.org
Phone:  (206) 667-3158
Fax:    (206) 667-1319



More information about the Bioc-devel mailing list