[Bioc-devel] as.data.frame for GRanges when one meta column is a data frame

Jialin Ma m@rlin- @ending from gmx@cn
Fri Jul 6 00:16:43 CEST 2018


Dear Hervé,

It seems that the printing method is broken not because the data frame
has nested data frame, but because the nested data frame has "AsIs"
class, for example:

> df <- data.frame(x = c(1,2))
> df$d <- data.frame(z = c(3,4))
> df
  x z
1 1 3
2 2 4
> str(df)
'data.frame':	2 obs. of  2 variables:
 $ x: num  1 2
 $ d:'data.frame':	2 obs. of  1 variable:
  ..$ z: num  3 4
> df$d <- I(data.frame(z = c(3,4)))
> df
Error in dim(rvec) <- dim(x) : 
  dims [product 2] do not match the length of object [1]
> str(df)
'data.frame':	2 obs. of  2 variables:
 $ x: num  1 2
 $ d:Classes ‘AsIs’ and 'data.frame':	2 obs. of  1 variable:
  ..$ z: num  3 4


Also, as far as I know, nested data frames are used in some packages
such as jsonlite:

> df <- data.frame(x = c(1,2))
> df$d <- data.frame(z = c(3,4))
> jsonlite::toJSON(df)
[{"x":1,"d":{"z":3}},{"x":2,"d":{"z":4}}] 

> str(jsonlite::fromJSON(txt =
'[{"x":1,"d":{"z":3}},{"x":2,"d":{"z":4}}]'))
'data.frame':	2 obs. of  2 variables:
 $ x: int  1 2
 $ d:'data.frame':	2 obs. of  1 variable:
  ..$ z: int  3 4

But I agree with you that it may be more consistent to flatten the
nested data frame. I will make changes to my package in order to fix
the errors.

Many thanks,
Jialin



On Thu, 2018-07-05 at 10:59 -0700, Hervé Pagès wrote:
> Hi Jialin,
> 
> Note that up to BioC 3.7, as.data.frame(gr) in your example
> was returning a broken data.frame:
> 
>    > as.data.frame(gr)
>    Error in dim(rvec) <- dim(x) :
>      dims [product 6] do not match the length of object [1]
> 
> More precisely, the call to as.data.frame(gr) is successful and
> returns a data.frame but that data.frame cannot be displayed:
> 
>    > df2 <- as.data.frame(gr)
>    > df2
>    Error in dim(rvec) <- dim(x) :
>      dims [product 6] do not match the length of object [1]
> 
> The problem is with the print.data.frame() method:
> 
>    > print.data.frame(df2)
>    Error in dim(rvec) <- dim(x) :
>      dims [product 6] do not match the length of object [1]
> 
> Feel free to bring this up to the R devel folks.
> 
> Anyway, since it's not clear whether data.frame objects are actually
> expected to support nesting, it's safer to have as.data.frame()
> getting rid of the nesting.
> 
> Furthermore: as.data.frame() **has** to "un-nest" nested objects
> in the general case e.g. when the nested objects are S4
> vector-like objects like Hits, GRanges, DataFrame, etc... That's
> because an ordinary data.frame cannot contain these objects. So it
> seems preferable to un-nest everything rather than making an
> exception
> when the metadata column is a data.frame. In particular, this
> exception
> would lead to inconsistent behavior if the data.frame column is
> replaced
> with a DataFrame.
> 
> For the record, here is the commit that refactored as.data.frame()
> to un-nest everything:
> 
>  
> https://github.com/Bioconductor/S4Vectors/commit/d84bc18dea7a23206194
> 6fbfe30d2072b88705a7
> 
> With this new approach, as.data.frame() can work on "complicated"
> objects i.e. on objects with an arbitrary number of nesting levels.
> 
> Hope this makes sense.
> 
> Cheers,
> H.
> 
> 
> On 07/04/2018 01:38 PM, Jialin Ma wrote:
> > Dear all,
> > 
> > It seems that the devel branch of Bioconductor has made
> > changes/improvements on the behavior of as.data.frame. In the case
> > that
> > input is a GRanges with a meta column of data frame, as.data.frame
> > in
> > devel will flatten the nested data frame. I made an example below:
> > 
> >   library(GenomicRanges)
> >   gr <- GRanges("chr2", IRanges(1:6, width = 2))
> >   gr$df <- data.frame(x = runif(6))
> >   str(as.data.frame(gr))
> > 
> > which shows:
> > 
> >    'data.frame':	6 obs. of  6 variables:
> >    $ seqnames: Factor w/ 1 level "chr2": 1 1 1 1 1 1
> >    $ start   : int  1 2 3 4 5 6
> >    $ end     : int  2 3 4 5 6 7
> >    $ width   : int  2 2 2 2 2 2
> >    $ strand  : Factor w/ 3 levels "+","-","*": 3 3 3 3 3 3
> >    $ x       : num  0.55 0.058 0.966 0.75 0.764 ...
> > 
> > with session info:
> > 
> >    R version 3.5.0 (2018-04-23)
> >    Platform: x86_64-suse-linux-gnu (64-bit)
> >    Running under: openSUSE Tumbleweed
> > 
> >    attached base packages:
> >    [1] parallel  stats4    stats     graphics  grDevices
> > utils     datasets
> >    [8] methods   base
> >    
> >    other attached packages:
> >    [1] GenomicRanges_1.33.6 GenomeInfoDb_1.17.1  IRanges_2.15.14
> >    [4] S4Vectors_0.19.17    BiocGenerics_0.27.1  magrittr_1.5
> >    
> >    loaded via a namespace (and not attached):
> >    [1]
> > zlibbioc_1.27.0        compiler_3.5.0         XVector_0.21.3
> >    [4] tools_3.5.0            GenomeInfoDbData_1.1.0 RCurl_1.95-
> > 4.10
> >    [7] yaml_2.1.19            bitops_1.0-6
> >    
> > 
> > While in the old version, the same code have the following results:
> > 
> >    'data.frame':	6 obs. of  6 variables:
> >    $ seqnames: Factor w/ 1 level "chr2": 1 1 1 1 1 1
> >    $ start   : int  1 2 3 4 5 6
> >    $ end     : int  2 3 4 5 6 7
> >    $ width   : int  2 2 2 2 2 2
> >    $ strand  : Factor w/ 3 levels "+","-","*": 3 3 3 3 3 3
> >    $ df      :Classes ‘AsIs’ and 'data.frame':	6 obs. of  1
> > variable:
> >      ..$ x: num  0.935 0.577 0.245 0.687 0.194 ...
> > 
> > with session info:
> > 
> >    R version 3.5.0 (2018-04-23)
> >    Platform: x86_64-suse-linux-gnu (64-bit)
> >    Running under: openSUSE Tumbleweed
> >    
> >    attached base packages:
> >    [1] parallel  stats4    stats     graphics  grDevices
> > utils     datasets
> >    [8] methods   base
> >    
> >    other attached packages:
> >    [1] GenomicRanges_1.32.3 GenomeInfoDb_1.17.1  IRanges_2.14.10
> >    [4] S4Vectors_0.18.3     BiocGenerics_0.27.1  magrittr_1.5
> >    
> >    loaded via a namespace (and not attached):
> >    [1]
> > zlibbioc_1.27.0        compiler_3.5.0         BiocInstaller_1.30.0
> >    [4]
> > XVector_0.21.3         tools_3.5.0            GenomeInfoDbData_1.1.
> > 0
> >    [7] RCurl_1.95-4.10        yaml_2.1.19            bitops_1.0-
> > 6
> >    
> > 
> > I personally feel that automatically flattening the nested data
> > frame
> > may not be the right behavior. I am not sure about it but I would
> > like
> > to suggest to keep data frame column as is when using as.data.frame
> > (also do not add "AsIs" class as it will cause error showing the
> > converted data frame).
> > 
> > Any thoughts?
> > 
> > Best regards,
> > Jialin
> > 
> > 
> > 
> > -------- Forwarded Message --------
> > From: "Shepherd, Lori" <Lori.Shepherd using RoswellPark.org>
> > To: marlin- using gmx.cn <marlin- using gmx.cn>
> > Subject: failing Bioconductor package TnT
> > Date: Tue, 3 Jul 2018 12:25:20 +0000
> > 
> > > Dear TnT maintainer,
> > > 
> > > I'd like to bring to your attention that the TnT package is
> > > failing
> > > to pass 'R CMD build' on all platforms in the devel version of
> > > Bioconductor (i.e. BioC 3.8):
> > > 
> > > https://urldefense.proofpoint.com/v2/url?u=http-3A__bioconductor.
> > > org_checkResults_devel_bioc-
> > > 2DLATEST_TnT&d=DwIFaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGb
> > > WY_wJYbW0WYiZvSXAJJKaaPhzWA&m=Qj-vl9xxsXyBySh08ExrawvLKqjD6wsNm-
> > > Ksdv_FY5M&s=F8bgEUvup-gEFW5bhS2Qwar6e7mcBHB5RJ7bpO320-g&e=
> > > 
> > > Would you mind taking a look at this? Don't hesitate to ask on
> > > the bi
> > > oc-devel using r-project.org mailing list if you have any question or
> > > need
> > > help.
> > > 
> > > 
> > > While devel is a place to experiment with new features, we expect
> > > packages to build and check cleanly in a reasonable time period
> > > and
> > > not stay broken for
> > > any extended period of time.   The package has been failing since
> > > 06/11/18
> > > 
> > > If no action is taken over the next few weeks we will begin the
> > > deprecation process for your package.
> > > 
> > > 
> > > Thank you for your time and effort, and your continued
> > > contribution
> > > to Bioconductor.
> > > 
> > > Pleae be advised that Bioconductor has switched from svn to Git.
> > > Some
> > > helpful links can be found here:
> > > https://urldefense.proofpoint.com/v2/url?u=https-3A__bioconductor
> > > .org_developers_how-
> > > 2Dto_git_&d=DwIFaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_
> > > wJYbW0WYiZvSXAJJKaaPhzWA&m=Qj-vl9xxsXyBySh08ExrawvLKqjD6wsNm-
> > > Ksdv_FY5M&s=sTHnSumyDr9UrxEynYbE2X_wTeyelJEgKiJ5qCh5_y8&e=
> > > https://urldefense.proofpoint.com/v2/url?u=http-3A__bioconductor.
> > > org_developers_how-2Dto_git_bug-2Dfix-2Din-2Drelease-2Dand-
> > > 2D&d=DwIFaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0W
> > > YiZvSXAJJKaaPhzWA&m=Qj-vl9xxsXyBySh08ExrawvLKqjD6wsNm-
> > > Ksdv_FY5M&s=005acfxYDLwSkfUPRJ14v0UbzU6yeYb_6s0TrIgT50k&e=
> > > devel/
> > > 
> > > 
> > > 
> > > Lori Shepherd
> > > Bioconductor Core Team
> > > Roswell Park Cancer Institute
> > > Department of Biostatistics & Bioinformatics
> > > Elm & Carlton Streets
> > > Buffalo, New York 14263
> > > 
> > > This email message may contain legally privileged and/or
> > > confidential
> > > information. If you are not the intended recipient(s), or the
> > > employee or agent responsible for the delivery of this message to
> > > the
> > > intended recipient(s), you are hereby notified that any
> > > disclosure,
> > > copying, distribution, or use of this email message is
> > > prohibited. If
> > > you have received this message in error, please notify the sender
> > > immediately by e-mail and delete this email message from your
> > > computer. Thank you.
> > 
> > _______________________________________________
> > Bioc-devel using r-project.org mailing list
> > https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_m
> > ailman_listinfo_bioc-
> > 2Ddevel&d=DwIFaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYb
> > W0WYiZvSXAJJKaaPhzWA&m=Qj-vl9xxsXyBySh08ExrawvLKqjD6wsNm-
> > Ksdv_FY5M&s=t4B7seeMvFDydrqlCa5XQLvfjxhjSke-NHGWjS30Lkc&e=
> > 
> 
> 



More information about the Bioc-devel mailing list