[Bioc-devel] XVector: abstraction

Hervé Pagès hpages at fhcrc.org
Mon Dec 9 08:39:22 CET 2013

Hi Michael,

The OnDiskXRaw virtual class (if this is what you're referring to)
is still a very early work-in-progress. The idea is to experiment
with on-disk representation of atomic vectors and direct random access
to subsequences of the vector. The exact storage mode is implemented by
concrete subclasses (currently only DirectRaw and SerializedRaw).
OnDiskXRaw is actually analog to SharedRaw except that with the latter
the "shared" sequence of bytes resides in memory.

If we had "on-disk" support for all atomic vectors, it sounds like it
would then be easy to support "on-disk" versions of higher-level
objects like IRanges or GRanges. They would be defined as their
"in-memory" counterpart except that the slots that are atomic vectors
in the "in-memory" version would just need to be replaced by "on-disk"
atomic vectors. "On-disk" versions of DNAString (and even DNAStringSet)
objects could also easily be implemented e.g. by just making the
"shared" slot an OnDiskXRaw object instead of a SharedRaw object.

Putting SharedRaw and OnDiskXRaw under the same umbrella (i.e. under
a virtual class) and using that virtual class to specify the slot of
higher-level objects like DNAString is tempting but realistically we
don't operate on an on-disk object like we do on an in-memory object.

Having an "on-disk" version of DNAString with direct random access was
in fact the initial motivation for OnDiskXRaw. The use case for this
was to support direct random access in BSgenome objects without having
to change the way the chromosomes are stored on disk (they're stored
as serialized raw vectors). I've finally implemented this feature (will
soon be pushed to BioC devel) but I changed the storage and didn't use
OnDiskXRaw in the end.


On 12/05/2013 06:43 AM, Michael Lawrence wrote:
> A nice goal for the XVector package would be full implementation of the R
> vector API on top of the already existing memory-sharing (rather than
> memory-duplicating) data structures. The actual storage mode of the data
> should be obviously be abstracted, e.g., on-disk should be treated the same
> as the externalptr representation. Much of the implementation will need to
> be in C, unless we want to pay the price of extracting things into ordinary
> R vectors. Should the abstraction be therefore dropped down to the C level,
> so that the implementations can more easily share from each other? Anything
> to gain here from the externalVector package?
> 	[[alternative HTML version deleted]]
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel

Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319

More information about the Bioc-devel mailing list