[Bioc-devel] file registry - feedback

Valerie Obenchain vobencha at fhcrc.org
Tue Mar 11 18:08:35 CET 2014


On 03/11/2014 09:47 AM, Michael Lawrence wrote:
> Except for the checksum, the existing File classes should support this,
> where the package provides a dataset via data() that is just the serialized
> File object (path). One could create a FileWithChecksum class that
> decorates a File object with a checksum. Any attempts to read the file are
> intercepted by the decorator, which verifies the checksum, and then
> delegates.

Neat. Sounds like this is worth pursuing.

> Michael
> On Tue, Mar 11, 2014 at 8:53 AM, Vincent Carey
> <stvjc at channing.harvard.edu>wrote:
>> I'm going to suggest a use case that may motivate this type of development.
>> Up to 2010 or so, data packages generally made sense.  You have about
>> 100-500MB of serialized or pre-serialized stuff.  Installing it in an R
>> package is unpleasant from a resource consumption perspective but it works,
>> you can use data/extdata and work with data with programmatic access,
>> documentation and checkability.
>> More recently, it is easy to come across data resources that we'd like to
>> have package-like control over/access to, but installing such packages
>> makes no sense.  The volume is too big, and you want to work with the
>> resource with non-R tools as well from time to time.  You don't want to
>> move the data.
>> We should have a protocol for "packaging" data without installing it.  A
>> digest of the raw data resource should be computed and kept in the
>> registry.  A registered file can be part of a package that can be checked
>> and installed, but the data themselves do not move.  Genomic data in S3
>> buckets should provide a basic use case.
>> The digest is recomputed whenever we want to start working with the
>> registry/package to verify that we are working with the intended artifact.
>> On Tue, Mar 11, 2014 at 11:11 AM, Gabriel Becker <gmbecker at ucdavis.edu>wrote:
>>> Would it be better to let the user (registerer) specify a function, which
>>> could be a simple class constructor or something more complex in cases
>>> where that would be useful?

Yes, good suggestion.

>>> This could allow the concept to generalize to other things, such as
>>> databases that might need some startup machinery called before they are
>>> actually useful to the user.

The intent of the registry was to provide a way to lookup files by their 
extension. I'm not sure how this applies to the database example. Do you 
envision creating multiple databases throughout an R session (vs a 
single set up at load time)? For example if the file has type 'X' 
extension it becomes a type 'X' database etc.?

>>> This would also deal with Michael's point about package/where since
>>> functions have their own "where" information. Unless I'm missing some
>>> other
>>> intent for specifying a specific package?
>>> ~G
>>> On Tue, Mar 11, 2014 at 5:59 AM, Michael Lawrence <
>>> lawrence.michael at gene.com
>>>> wrote:
>>>> rtracklayer essentially has this, although registration is implicit
>>> through
>>>> extension of RTLFile or RsamtoolsFile, and the extension is taken from
>>> the
>>>> class name. There is a BigWigFile, corresponding to ".bigwig", and that
>>> is
>>>> extended by BWFile to support the ".bw" extension. The expectation is
>>> that
>>>> other packages would extend RTLFile to implictly register handlers.  I'm
>>>> not sure there is a use case for generalization, but this proposal makes
>>>> registration more explicit, which is probably a good thing. rtracklayer
>>> was
>>>> just piggy backing on S4 registration.
>>>> I'm a little bit confused by the use of Lists rather than individual
>>> File
>>>> objects. Are you also proposing that all RTLFiles would need a
>>>> corresponding List, and that there would need to be an RTLFileList
>>> method
>>>> for the various generics?

No, I don't want to force the 'List' route. I was using them in 
GenomicFileViews so that's what I registered. The 'class' should be any 
class that has a constructor of the same name. Thinking about this more 
the 'class' probably should be the individual File object instead of the 
List object. Coercion to List can be done inside the helper.

>>>> It may not be necessary to specify the package name. There should be an
>>>> environment (where) argument that defaults to topenv(parent.frame()),
>>> and
>>>> that should suffice.

I'll look into this.

Any comments on whether this should be it's own package or in an 
existing one?

Thanks for the input.

>>>> Michael
>>>> On Mon, Mar 10, 2014 at 8:46 PM, Valerie Obenchain <vobencha at fhcrc.org
>>>>> wrote:
>>>>> Hi all,
>>>>> I'm soliciting feedback on the idea of a general file 'registry' that
>>>>> would identify file types by their extensions. This is similar in
>>> spirit
>>>> to
>>>>> FileForformat() in rtracklayer but a more general abstraction that
>>> could
>>>> be
>>>>> used across packages. The goal is to allow a user to supply only file
>>>>> name(s) to a method instead of first creating a 'File' class such as
>>>>> BamFile, FaFile, BigWigFile etc.
>>>>> A first attempt at this is in the GenomicFileViews package (
>>>>> https://github.com/Bioconductor/GenomicFileViews). A registry
>>> (lookup)
>>>> is
>>>>> created as an environment at load time:
>>>>> .fileTypeRegistry <- new.env(parent=emptyenv()
>>>>> Files are registered with an information triplet consisting of class,
>>>>> package and regular expression to identify the extension. In
>>>>> GenomicFileViews we register FaFileList, BamFileList and
>>> BigWigFileList
>>>> but
>>>>> any 'File' class can be registered that has a constructor of the same
>>>> name.
>>>>> .onLoad <- function(libname, pkgname)
>>>>> {
>>>>>      registerFileType("FaFileList", "Rsamtools", "\\.fa$")
>>>>>      registerFileType("FaFileList", "Rsamtools", "\\.fasta$")
>>>>>      registerFileType("BamFileList", "Rsamtools", "\\.bam$")
>>>>>      registerFileType("BigWigFileList", "rtracklayer", "\\.bw$")
>>>>> }
>>>>> The makeFileType() helper creates the appropriate class. This
>>> function is
>>>>> used behind the scenes to do the lookup and coerce to the correct
>>> 'File'
>>>>> class.
>>>>>> makeFileType(c("foo.bam", "bar.bam"))
>>>>> BamFileList of length 2
>>>>> names(2): foo.bam bar.bam
>>>>> New types can be added at any time with registerFileType():
>>>>> registerFileType(NewClass, NewPackage, "\\.NewExtension$")
>>>>> Thoughts:
>>>>> (1) If this sounds generally useful where should it live? rtracklayer,
>>>>> GenomicFileViews or other? Alternatively it could be its own
>>> lightweight
>>>>> package (FileRegister) that creates the registry and provides the
>>>> helpers.
>>>>> It would be up to the package authors that depend on FileRegister to
>>>>> register their own files types at load time.
>>>>> (2) To avoid potential ambiguities maybe searching should be by regex
>>> and
>>>>> package name. Still a work in progress.
>>>>> Valerie
>>>>          [[alternative HTML version deleted]]
>>>> _______________________________________________
>>>> Bioc-devel at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>> --
>>> Gabriel Becker
>>> Graduate Student
>>> Statistics Department
>>> University of California, Davis
>>>          [[alternative HTML version deleted]]
>>> _______________________________________________
>>> Bioc-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
> 	[[alternative HTML version deleted]]
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel

Valerie Obenchain
Program in Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B155
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: vobencha at fhcrc.org
Phone:  (206) 667-3158
Fax:    (206) 667-1319

More information about the Bioc-devel mailing list