[Bioc-devel] XVector: abstraction

Hervé Pagès hpages at fhcrc.org
Mon Dec 9 20:33:28 CET 2013


On 12/09/2013 10:50 AM, Kasper Daniel Hansen wrote:
> I agree with Michael. I don't think we want to deprive ourselves of good
> approaches by a need for supporting Windows.  Especially in a case like
> this where on-disc representation is optional.

I agree mmap is appealing. I just didn't want to have to depend on
it in XVector, which is at the bottom of the package stack. For now my
focus/interest is more on the OnDiskVector concept/API. Specific storage
back-ends can be implemented as concrete subclasses. There are already
2 of them (DirectRaw and SerializedRaw). Others can be added for mmap
and HDF5 for example. They don't necessarily have to be implemented in
XVector.

H.

>
> Kasper
>
>
> On Mon, Dec 9, 2013 at 1:46 PM, Michael Lawrence
> <lawrence.michael at gene.com <mailto:lawrence.michael at gene.com>> wrote:
>
>     On Mon, Dec 9, 2013 at 9:30 AM, Hervé Pagès <hpages at fhcrc.org
>     <mailto:hpages at fhcrc.org>> wrote:
>
>      > On 12/09/2013 05:39 AM, Michael Lawrence wrote:
>      >
>      >> Any thoughts about using mmap(), so that SharedRaw and OnDiskRaw
>     just
>      >> operate on a pointer as the abstraction?
>      >>
>      >
>      > Martin mentioned mmap to me for this project but I had some concerns
>      > about Windows compatibility. Are there CRAN or BioC packages that use
>      > it? Would be interesting to have a look at them.
>      >
>
>     bigmemory is a CRAN package, and it is extended by bigmemoryExtras in
>     Bioconductor.
>
>     No Windows version available, of course. But seriously, who uses
>     Windows to
>     crunch data? Easy enough to fallback to the in-memory implementation.
>
>
>
>      > H.
>      >
>      >
>      >> Michael
>      >>
>      >>
>      >> On Sun, Dec 8, 2013 at 11:39 PM, Hervé Pagès <hpages at fhcrc.org
>     <mailto:hpages at fhcrc.org>
>      >> <mailto:hpages at fhcrc.org <mailto:hpages at fhcrc.org>>> wrote:
>      >>
>      >>     Hi Michael,
>      >>
>      >>     The OnDiskXRaw virtual class (if this is what you're
>     referring to)
>      >>     is still a very early work-in-progress. The idea is to
>     experiment
>      >>     with on-disk representation of atomic vectors and direct
>     random access
>      >>     to subsequences of the vector. The exact storage mode is
>     implemented
>      >> by
>      >>     concrete subclasses (currently only DirectRaw and
>     SerializedRaw).
>      >>     OnDiskXRaw is actually analog to SharedRaw except that with
>     the latter
>      >>     the "shared" sequence of bytes resides in memory.
>      >>
>      >>     If we had "on-disk" support for all atomic vectors, it
>     sounds like it
>      >>     would then be easy to support "on-disk" versions of higher-level
>      >>     objects like IRanges or GRanges. They would be defined as their
>      >>     "in-memory" counterpart except that the slots that are
>     atomic vectors
>      >>     in the "in-memory" version would just need to be replaced by
>     "on-disk"
>      >>     atomic vectors. "On-disk" versions of DNAString (and even
>      >> DNAStringSet)
>      >>     objects could also easily be implemented e.g. by just making the
>      >>     "shared" slot an OnDiskXRaw object instead of a SharedRaw
>     object.
>      >>
>      >>     Putting SharedRaw and OnDiskXRaw under the same umbrella
>     (i.e. under
>      >>     a virtual class) and using that virtual class to specify the
>     slot of
>      >>     higher-level objects like DNAString is tempting but
>     realistically we
>      >>     don't operate on an on-disk object like we do on an
>     in-memory object.
>      >>
>      >>     Having an "on-disk" version of DNAString with direct random
>     access was
>      >>     in fact the initial motivation for OnDiskXRaw. The use case
>     for this
>      >>     was to support direct random access in BSgenome objects
>     without having
>      >>     to change the way the chromosomes are stored on disk
>     (they're stored
>      >>     as serialized raw vectors). I've finally implemented this
>     feature
>      >> (will
>      >>     soon be pushed to BioC devel) but I changed the storage and
>     didn't use
>      >>     OnDiskXRaw in the end.
>      >>
>      >>     H.
>      >>
>      >>
>      >>
>      >>     On 12/05/2013 06:43 AM, Michael Lawrence wrote:
>      >>
>      >>         A nice goal for the XVector package would be full
>     implementation
>      >>         of the R
>      >>         vector API on top of the already existing memory-sharing
>     (rather
>      >>         than
>      >>         memory-duplicating) data structures. The actual storage
>     mode of
>      >>         the data
>      >>         should be obviously be abstracted, e.g., on-disk should be
>      >>         treated the same
>      >>         as the externalptr representation. Much of the
>     implementation
>      >>         will need to
>      >>         be in C, unless we want to pay the price of extracting
>     things
>      >>         into ordinary
>      >>         R vectors. Should the abstraction be therefore dropped
>     down to
>      >>         the C level,
>      >>         so that the implementations can more easily share from each
>      >>         other? Anything
>      >>         to gain here from the externalVector package?
>      >>
>      >>                  [[alternative HTML version deleted]]
>      >>
>      >>         _________________________________________________
>      >> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>     <mailto:Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>>
>      >>         mailing list
>      >> https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>      >>
>      >>         <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>      >>
>      >>
>      >>     --
>      >>     Hervé Pagès
>      >>
>      >>     Program in Computational Biology
>      >>     Division of Public Health Sciences
>      >>     Fred Hutchinson Cancer Research Center
>      >>     1100 Fairview Ave. N, M1-B514
>      >>     P.O. Box 19024
>      >>     Seattle, WA 98109-1024
>      >>
>      >>     E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>     <mailto:hpages at fhcrc.org <mailto:hpages at fhcrc.org>>
>      >>     Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
>     <tel:%28206%29%20667-5791>
>      >>     Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
>     <tel:%28206%29%20667-1319>
>      >>
>      >>
>      >>
>      > --
>      > Hervé Pagès
>      >
>      > Program in Computational Biology
>      > Division of Public Health Sciences
>      > Fred Hutchinson Cancer Research Center
>      > 1100 Fairview Ave. N, M1-B514
>      > P.O. Box 19024
>      > Seattle, WA 98109-1024
>      >
>      > E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>      > Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
>      > Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
>      >
>
>              [[alternative HTML version deleted]]
>
>
>     _______________________________________________
>     Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org> mailing list
>     https://stat.ethz.ch/mailman/listinfo/bioc-devel
>
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioc-devel mailing list