[Bioc-devel] How to trigger h5read.<classname> with h5read function in rhdf5

Hayden, Nathaniel nhayden at fhcrc.org
Fri Aug 8 23:55:17 CEST 2014


Hi, Bernd. Glad to hear from you! I'm out of the office so will get back to you with a more considered response Monday. My initial thought is yes: I agree the task of extending the functionality to truly arbitrary yet-unknown classes is daunting, but it seems we can get this encoding-reconstruction cycle a lot closer--and hence provide a lot of valuable benefit--with far less effort than getting it to work in 100% of cases.

I would add that part of the motivation for choosing a non-rds file format for a current project was to eliminate dependence on package versions for transporting Bioc objects between sessions.

Thanks,
Nate

----- Original Message -----
From: "Bernd Fischer" <bernd.fischer at embl.de>
To: "Nathaniel Hayden" <nhayden at fhcrc.org>
Cc: bioc-devel at r-project.org
Sent: Thursday, August 7, 2014 1:25:52 PM
Subject: Re: How to trigger h5read.<classname> with h5read function in rhdf5

Dear Nathaniel! 







h5read is in fact designed to be able to call a user defined function 
h5read.<myclass>, but it is not yet fully implemented, respectively tested. 
I stalled this because of the complexity of this task. But maybe you and the 
Bioc-devel list can help. 


I can imaging the following scenario: 
- h5write can write the attr(foo, "class") <- "myclass" attribute to the HDF5 object 
This is already set up, one can invoke this by using write.attributes=TRUE as you mentioned 


h5write is a generic function and one can write its own h5write.<myclass> function. 


- Before h5read reads the object, it tries to read the class-attribute and invokes h5read.<myclass> 
which is defined somewhere outside rhdf5. 

The problems I came across are: 
1.) Usually, the h5read.<mycall> is implemented in some package "mypackage". 
How do I know, which package it is, if the package is not yet loaded? 
Do we have to store an additional "BioCpackage" attribute in the HDF5 object? 
2.) What happens, if the package provider changes the class definition in the next BioC-release? 
Do we have to store a package version number as well? 
3.) How shall we deal with R-attributes? 
HDF5 attributes are not able to store all R-attibutes, because HDF5-attributes are restricted to 
a maximum size, R-attibutes can be almost as large as you like. One way would be 
to store attributes in a group called /obj.ATTRIBUTES. 
E.g. assume you have an R-object foo with attribute names = c("A","B",…) of length 2^30 
and geneNames = c("ENSGA","ENSGB",…) 
Should h5write write the following: 

/foo : an HDF5 object, e.g. an integer array 
/foo.ATTRIBUTES : a group 

/foo.ATTRIBUTES/names : a string vector 

/foo.ATTRIBUTES/geneNames : a string vector 
This definitely breaks, if someone wants to write a list that contains both 
elements "foo" and "foo.ATTRIBUTES". Is this acceptable? 
4.) What is the best standard for storing S3/S4-objects in HDF5? 
Assume there is an object foo class baa with slots a = "integer", b = "double" and c = "mysecondclass" 
Should h5write write the following: 
/foo : a group with attributes class="baa", BioCpackage="baapackage" 
/foo/slots : a group 
/foo/slots/a : integer 
/foo/slots/b : double 
/foo/slots/c : a group with attributes class="mysecondclass", BioCpackage="mysecondpackage" 
/foo/slots/c/slots 
and assume foo has additional attributes as above h5write would write in addition: 

/foo.ATTRIBUTES : a group 
/foo.ATTRIBUTES/names : a string vector 
/foo.ATTRIBUTES/geneNames : a string vector 
This standard would allow the definition of a function that reads S3/S4-objects of any kind 
and still allow the user to define its own function h5read.<myclass>. 


What do you think about this? I guess that is the direction that you have in mind. Any other 
suggestions and comments are welcome. 


Bernd 






On 07.08.2014, at 02:49, Nathaniel Hayden < nhayden at fhcrc.org > wrote: 


When reading from an hdf5 file I would like to automatically call a function I define when datasets of an arbitrary type (see: 'class') are read from an hdf5 file. Since it looks like the existing infrastructure (courtesy of the 'callGeneric' parameter in h5read) in rhdf5 was made for this, I would like to avoid duplicating work. But I can't find an example of the h5read.<classname> functionality indicated in the callGeneric description in the h5read man page. 

A simple example is if the type is integer, I want as.integer to be automatically called on the read-in object before it gets passed back. But I intend to extend this to other Bioconductor classes of arbitrary complexity. 

Based on the documentation, it seems like either using attr(foo "class") <- "integer" (in conjunction with h5write(<...>, write.attributes=TRUE) or adding a 'class' attribute through the h5writeAttribute interface should be enough to trigger the h5read.integer function upon calling h5read. Neither seems to work. Note that I can pass read.attributes=TRUE and the attributes get assigned the object (for example, the object comes back with a "class" attribute), but that's not exactly what I'm after. 

In looking at the R/h5read.R source code, it looks like the block where the h5read.<classname> call gets set up (around line 59) queries the "class" attribute of the read-in obj before the h5 object's attributes are actually read, so the 'cl' variable never seems to get set. 

Here's an example where I would expect h5read.<classname> to be invoked, but it doesn't: 

library(rhdf5) 
h5read.integer <- function(obj) { as.integer(obj) } ## h5read.<classname> 
debug(h5read.integer) 
exists(paste("h5read","integer",sep="."),mode="function") 

h5fl <- tempfile(fileext=".h5") 
h5createFile(h5fl) 
ints <- 42L:33L 
attr(ints, "class") <- "integer" 
h5write(ints, h5fl, "foo", write.attributes=TRUE) 
H5close() 

## h5writeAttribute route 
##fid <- H5Fopen(h5fl) 
##did <- H5Dopen(fid, "foo") 
##h5writeAttribute("integer", did, name="class") 
##H5close() 

##res <- h5read(h5fl, "foo", read.attributes=FALSE) 
res <- h5read(h5fl, "foo", read.attributes=TRUE) 

Running the external h5dump utility confirms that a "class" attribute is attached to the foo DATASET, which seems to match what the h5read man page prescribes. If I edit the source code to set the 'cl' variable to "integer" my h5read.integer function gets invoked, as expected. 

Any help would be much appreciated. Thank you. 



More information about the Bioc-devel mailing list