[Bioc-devel] file registry - feedback

Hervé Pagès hpages at fhcrc.org
Tue Mar 11 23:33:53 CET 2014


On 03/11/2014 02:52 PM, Hervé Pagès wrote:
> On 03/11/2014 09:57 AM, Valerie Obenchain wrote:
>> Hi Herve,
>>
>> On 03/10/2014 10:31 PM, Hervé Pagès wrote:
>>> Hi Val,
>>>
>>> I think it would help understand the motivations behind this proposal
>>> if you could give an example of a method where the user cannot supply
>>> a file name but has to create a 'File' (or 'FileList') object first.
>>> And how the file registry proposal below would help.
>>> It looks like you have such an example in the GenomicFileViews package.
>>> Do you think you could give more details?
>>
>> The most recent motivating use case was in creating subclasses of
>> GenomicFileViews objects (BamFileViews, BigWigFileViews, etc.) We wanted
>> to have a general constructor, something like GenomicFileViews(), that
>> would create the appropriate subclass. However to create the correct
>> subclass we needed to know if the files were bam, bw, fasta etc.
>> Recognition of the file type by extension would allow us to do this with
>> no further input from the user.
>
> That helps, thanks!
>
> Having this kind of general constructor sounds like it could indeed be
> useful. Would be an opportunity to put all these *File classes (the 22
> RTLFile subclasses defined in rtracklayer and the 5 RsamtoolsFile
> subclasses defined in Rsamtools) under the same umbrella (i.e. a parent
> virtual class) and use the name of this virtual class (e.g. File) for
> the general constructor.
>
> Allowing a registration mechanism to extend the knowledge of this File()
> constructor is an implementation detail. I don't see a lot of benefit to
> it. Only a package that implements a concrete File subclass would
> actually need to register the new subclass. Sounds easy enough to ask
> to whoever has commit access to the File() code to modify it. This
> kind of update might also require adding the name of the package where
> the new File subclass is implemented to the Depends/Imports/Suggests
> of the package where File() lives, which is something that cannot be
> done via a registration mechanism.

This clean-up of the *File jungle would also be a good opportunity to:

   - Choose what we want to do with reference classes: use them for all
     the *File classes or for none of them. (Right now, those defined
     in Rsamtools are reference classes, and those defined in
     rtracklayer are not.)

   - Move the I/O functionality currently in rtracklayer to a
     separate package. Based on the number of contributed packages I
     reviewed so far that were trying to reinvent the wheel because
     they had no idea that the I/O function they needed was actually
     in rtracklayer, I'd like to advocate for using a package name
     that makes it very clear that it's all about I/O.

H.


>
> H.
>
>
>>
>> Val
>>
>>>
>>> Thanks,
>>> H.
>>>
>>>
>>> On 03/10/2014 08:46 PM, Valerie Obenchain wrote:
>>>> Hi all,
>>>>
>>>> I'm soliciting feedback on the idea of a general file 'registry' that
>>>> would identify file types by their extensions. This is similar in
>>>> spirit
>>>> to FileForformat() in rtracklayer but a more general abstraction that
>>>> could be used across packages. The goal is to allow a user to supply
>>>> only file name(s) to a method instead of first creating a 'File' class
>>>> such as BamFile, FaFile, BigWigFile etc.
>>>>
>>>> A first attempt at this is in the GenomicFileViews package
>>>> (https://github.com/Bioconductor/GenomicFileViews). A registry (lookup)
>>>> is created as an environment at load time:
>>>>
>>>> .fileTypeRegistry <- new.env(parent=emptyenv()
>>>>
>>>> Files are registered with an information triplet consisting of class,
>>>> package and regular expression to identify the extension. In
>>>> GenomicFileViews we register FaFileList, BamFileList and BigWigFileList
>>>> but any 'File' class can be registered that has a constructor of the
>>>> same name.
>>>>
>>>> .onLoad <- function(libname, pkgname)
>>>> {
>>>>      registerFileType("FaFileList", "Rsamtools", "\\.fa$")
>>>>      registerFileType("FaFileList", "Rsamtools", "\\.fasta$")
>>>>      registerFileType("BamFileList", "Rsamtools", "\\.bam$")
>>>>      registerFileType("BigWigFileList", "rtracklayer", "\\.bw$")
>>>> }
>>>>
>>>> The makeFileType() helper creates the appropriate class. This function
>>>> is used behind the scenes to do the lookup and coerce to the correct
>>>> 'File' class.
>>>>
>>>>  > makeFileType(c("foo.bam", "bar.bam"))
>>>> BamFileList of length 2
>>>> names(2): foo.bam bar.bam
>>>>
>>>> New types can be added at any time with registerFileType():
>>>>
>>>> registerFileType(NewClass, NewPackage, "\\.NewExtension$")
>>>>
>>>>
>>>> Thoughts:
>>>>
>>>> (1) If this sounds generally useful where should it live? rtracklayer,
>>>> GenomicFileViews or other? Alternatively it could be its own
>>>> lightweight
>>>> package (FileRegister) that creates the registry and provides the
>>>> helpers. It would be up to the package authors that depend on
>>>> FileRegister to register their own files types at load time.
>>>>
>>>> (2) To avoid potential ambiguities maybe searching should be by regex
>>>> and package name. Still a work in progress.
>>>>
>>>>
>>>> Valerie
>>>>
>>>> _______________________________________________
>>>> Bioc-devel at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>
>>
>>
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioc-devel mailing list