[Bioc-devel] XVector: abstraction
Hervé Pagès
hpages at fhcrc.org
Mon Dec 9 20:33:28 CET 2013
On 12/09/2013 10:50 AM, Kasper Daniel Hansen wrote:
> I agree with Michael. I don't think we want to deprive ourselves of good
> approaches by a need for supporting Windows. Especially in a case like
> this where on-disc representation is optional.
I agree mmap is appealing. I just didn't want to have to depend on
it in XVector, which is at the bottom of the package stack. For now my
focus/interest is more on the OnDiskVector concept/API. Specific storage
back-ends can be implemented as concrete subclasses. There are already
2 of them (DirectRaw and SerializedRaw). Others can be added for mmap
and HDF5 for example. They don't necessarily have to be implemented in
XVector.
H.
>
> Kasper
>
>
> On Mon, Dec 9, 2013 at 1:46 PM, Michael Lawrence
> <lawrence.michael at gene.com <mailto:lawrence.michael at gene.com>> wrote:
>
> On Mon, Dec 9, 2013 at 9:30 AM, Hervé Pagès <hpages at fhcrc.org
> <mailto:hpages at fhcrc.org>> wrote:
>
> > On 12/09/2013 05:39 AM, Michael Lawrence wrote:
> >
> >> Any thoughts about using mmap(), so that SharedRaw and OnDiskRaw
> just
> >> operate on a pointer as the abstraction?
> >>
> >
> > Martin mentioned mmap to me for this project but I had some concerns
> > about Windows compatibility. Are there CRAN or BioC packages that use
> > it? Would be interesting to have a look at them.
> >
>
> bigmemory is a CRAN package, and it is extended by bigmemoryExtras in
> Bioconductor.
>
> No Windows version available, of course. But seriously, who uses
> Windows to
> crunch data? Easy enough to fallback to the in-memory implementation.
>
>
>
> > H.
> >
> >
> >> Michael
> >>
> >>
> >> On Sun, Dec 8, 2013 at 11:39 PM, Hervé Pagès <hpages at fhcrc.org
> <mailto:hpages at fhcrc.org>
> >> <mailto:hpages at fhcrc.org <mailto:hpages at fhcrc.org>>> wrote:
> >>
> >> Hi Michael,
> >>
> >> The OnDiskXRaw virtual class (if this is what you're
> referring to)
> >> is still a very early work-in-progress. The idea is to
> experiment
> >> with on-disk representation of atomic vectors and direct
> random access
> >> to subsequences of the vector. The exact storage mode is
> implemented
> >> by
> >> concrete subclasses (currently only DirectRaw and
> SerializedRaw).
> >> OnDiskXRaw is actually analog to SharedRaw except that with
> the latter
> >> the "shared" sequence of bytes resides in memory.
> >>
> >> If we had "on-disk" support for all atomic vectors, it
> sounds like it
> >> would then be easy to support "on-disk" versions of higher-level
> >> objects like IRanges or GRanges. They would be defined as their
> >> "in-memory" counterpart except that the slots that are
> atomic vectors
> >> in the "in-memory" version would just need to be replaced by
> "on-disk"
> >> atomic vectors. "On-disk" versions of DNAString (and even
> >> DNAStringSet)
> >> objects could also easily be implemented e.g. by just making the
> >> "shared" slot an OnDiskXRaw object instead of a SharedRaw
> object.
> >>
> >> Putting SharedRaw and OnDiskXRaw under the same umbrella
> (i.e. under
> >> a virtual class) and using that virtual class to specify the
> slot of
> >> higher-level objects like DNAString is tempting but
> realistically we
> >> don't operate on an on-disk object like we do on an
> in-memory object.
> >>
> >> Having an "on-disk" version of DNAString with direct random
> access was
> >> in fact the initial motivation for OnDiskXRaw. The use case
> for this
> >> was to support direct random access in BSgenome objects
> without having
> >> to change the way the chromosomes are stored on disk
> (they're stored
> >> as serialized raw vectors). I've finally implemented this
> feature
> >> (will
> >> soon be pushed to BioC devel) but I changed the storage and
> didn't use
> >> OnDiskXRaw in the end.
> >>
> >> H.
> >>
> >>
> >>
> >> On 12/05/2013 06:43 AM, Michael Lawrence wrote:
> >>
> >> A nice goal for the XVector package would be full
> implementation
> >> of the R
> >> vector API on top of the already existing memory-sharing
> (rather
> >> than
> >> memory-duplicating) data structures. The actual storage
> mode of
> >> the data
> >> should be obviously be abstracted, e.g., on-disk should be
> >> treated the same
> >> as the externalptr representation. Much of the
> implementation
> >> will need to
> >> be in C, unless we want to pay the price of extracting
> things
> >> into ordinary
> >> R vectors. Should the abstraction be therefore dropped
> down to
> >> the C level,
> >> so that the implementations can more easily share from each
> >> other? Anything
> >> to gain here from the externalVector package?
> >>
> >> [[alternative HTML version deleted]]
> >>
> >> _________________________________________________
> >> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
> <mailto:Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>>
> >> mailing list
> >> https://stat.ethz.ch/mailman/__listinfo/bioc-devel
> >>
> >> <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
> >>
> >>
> >> --
> >> Hervé Pagès
> >>
> >> Program in Computational Biology
> >> Division of Public Health Sciences
> >> Fred Hutchinson Cancer Research Center
> >> 1100 Fairview Ave. N, M1-B514
> >> P.O. Box 19024
> >> Seattle, WA 98109-1024
> >>
> >> E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
> <mailto:hpages at fhcrc.org <mailto:hpages at fhcrc.org>>
> >> Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
> <tel:%28206%29%20667-5791>
> >> Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
> <tel:%28206%29%20667-1319>
> >>
> >>
> >>
> > --
> > Hervé Pagès
> >
> > Program in Computational Biology
> > Division of Public Health Sciences
> > Fred Hutchinson Cancer Research Center
> > 1100 Fairview Ave. N, M1-B514
> > P.O. Box 19024
> > Seattle, WA 98109-1024
> >
> > E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
> > Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
> > Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
> >
>
> [[alternative HTML version deleted]]
>
>
> _______________________________________________
> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org> mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>
>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioc-devel
mailing list