[Bioc-devel] Changes to DataFrame
Pages, Herve
hp@ge@ @end|ng |rom |redhutch@org
Wed Aug 28 03:39:59 CEST 2019
Hi developers,
Short story: these changes shouldn't affect you but I recommend you read
the long story just in case.
Long story:
Some of you maybe already noticed that I was making changes to the
DataFrame class. The idea is to "make room" for other data-frame-like
containers by having DataFrame become a virtual class with no slots with
concrete subclasses that provide specific
representations/implementations. This will make it easier to experiment
with on-disk data frame representations (e.g. SQL-based, Parquet-based,
etc...) and have these data-frame-like containers re-usable in any place
where a DataFrame object is currently expected. The typical use cases we
have in mind is to support on-disk storage of the metadata columns of
Vector-like derivatives or on-disk storage of the colData slot of a
SummarizedExperiment object or derivative.
The first round of changes I made was to introduce the DFrame class as a
subclass of DataFrame, and to have the DataFrame() constructor return a
DFrame object instead of a DataFrame. Note that DFrame uses exactly the
same internal representation as DataFrame (i.e. it does not add any slot
to the current representation of DataFrame) so for now DFrame and
DataFrame objects are equivalent (but this will change in the future
when DataFrame "looses" its slots). However, you should no longer see
DataFrame instances. More precisely: unless you use new("DataFrame",
...) (which you should not, you should always use the DataFrame()
constructor instead), you will always get DFrame instances instead of
DataFrame instances. In order to make this change as transparent as
possible to the end-user, show() still reports that the object is a
DataFrame. Note that this is actually true because is( , "DataFrame") is
true on a DFrame object so we are not lying, just hiding the truth ;-)
The only situation where you'll still see a DataFrame instance is when
you use readRDS(), load(), or data() to deserialize an object that was
created before these changes. Nothing wrong with these "old" objects
though: they're still valid objects and should keep working as before.
Note however that their population will naturally start to shrink from
now on until they completely disappear at some point in the future. FWIW
we've actually started to consider some strategies/tricks to accelerate
their eradication from planet earth.
I made similar changes to the DataFrameList class and subclasses.
These changes are in S4Vectors 0.23.20 and IRanges 2.19.14.
I think I've taken care of all software packages that this first round
of changes broke. Let me know if I didn't.
We're still a long way from having DataFrame be a virtual class with no
slots (and with DFrame being its "canonical" subclass i.e. providing the
current in-memory representation) so expect more changes in the future.
I'll report later here as we make significant progress on this but the
next major round of changes should not happen before the next BioC
release (i.e. when we start the BioC 3.11 6-month devel cycle).
Regards,
H.
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages using fredhutch.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioc-devel
mailing list