[Bioc-devel] Printing DataFrame with nested data.frame/DataFrame/DataFrameList

Thu Sep 28 10:19:03 CEST 2017

Dear all,

I have a package in reviewing at
https://github.com/Bioconductor/Contributions/issues/487, in which I
would like to use a GRanges with nested data.frame or DataFrameList to
represent the track data internally.

However, the default show method does not seem to work well with such
structures.

I have an example for GRanges in which one meta-column is a one-column
data frame:

    gr <- GRanges("chr21", IRanges(1:5, width = 1))
    gr$df <- data.frame(x = 1:5)
    show(gr)

    GRanges object with 5 ranges and 1 metadata column:
    Error in .Method(..., deparse.level = deparse.level) : 
      number of rows of matrices must match (see arg 3)

However, if the nested data frame has two columns, it can be printed
out correctly:

    gr <- GRanges("chr21", IRanges(1:5, width = 1))
    gr$df <- data.frame(x = 1:5, y = 11:15)
    show(gr)

    GRanges object with 5 ranges and 1 metadata column:
          seqnames    ranges strand |           df
             <Rle> <IRanges>  <Rle> | <data.frame>
      [1]    chr21    [1, 1]      * |         1:11
      [2]    chr21    [2, 2]      * |         2:12
      [3]    chr21    [3, 3]      * |         3:13
      [4]    chr21    [4, 4]      * |         4:14
      [5]    chr21    [5, 5]      * |         5:15
      -------
      seqinfo: 1 sequence from an unspecified genome; no seqlengths

In some cases, it can be printed with a warning message, but the form
is wrong:

    gr <- GRanges("chr21", IRanges(1:5, width = 1), emm = 6:10)
    gr$df <- data.frame(x = 1:5)
    show(gr)

    # The nested df is not printed with correct format, there is only 
    # one column in the nested df.

    GRanges object with 5 ranges and 2 metadata columns:
          seqnames    ranges strand |       emm           df
             <Rle> <IRanges>  <Rle> | <integer> <data.frame>
      [1]    chr21    [1, 1]      * |         6    1,2,3,...
      [2]    chr21    [2, 2]      * |         7    1,2,3,...
      [3]    chr21    [3, 3]      * |         8    1,2,3,...
      [4]    chr21    [4, 4]      * |         9    1,2,3,...
      [5]    chr21    [5, 5]      * |        10    1,2,3,...
      -------
      seqinfo: 1 sequence from an unspecified genome; no seqlengths
    Warning message:
    In (function (..., row.names = NULL, check.rows = FALSE, check.names
    = TRUE,  :
      row names were found from a short variable and have been discarded

Nested DataFrameList can not be printed:

    DF <- DataFrame(x = 1:2)
    DF$split = split(DataFrame(aa = 1:4), c(1,1,2,2))
    show(DF)

    DataFrame with 2 rows and 2 columns
    Error in dim(object) <- c(nrow(object), prod(tail(dim(object), -1)))
    : 
      invalid first argument

    class(DF$split)

    [1] "CompressedSplitDataFrameList"
    attr(,"package")
    [1] "IRanges"

    In the case above, I understand that it is hard to create a short
    string representation of the nested structure, but I think printing
    dimensions of the nested element may be sufficient.

    Any comments?

    Best,
    Jialin

    -----------
    Session Info:

    R version 3.4.1 (2017-06-30)
    Platform: x86_64-suse-linux-gnu (64-bit)
    Running under: openSUSE Tumbleweed

    Matrix products: default
    BLAS: /usr/lib64/R/lib/libRblas.so
    LAPACK: /usr/lib64/R/lib/libRlapack.so

    locale:
     [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
     [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
     [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
     [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
     [9] LC_ADDRESS=C               LC_TELEPHONE=C            
    [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

    attached base packages:
    [1] stats4    parallel  stats     graphics  grDevices
    utils     datasets 
    [8] methods   base     

    other attached packages:
    [1] Biobase_2.37.2        GenomicRanges_1.29.14 GenomeInfoDb_1.13.4  
    [4] IRanges_2.11.17       S4Vectors_0.15.8      BiocGenerics_0.23.1  
    [7] magrittr_1.5         

    loaded via a namesp

    r$> DF$split <- DF$split %>% as.list %>%
    lapply(as.data.frame)                                

    r$>
    DF                                                                     

    DataFrame with 2 rows and 2 columns
              x  split
      <integer> <list>
    1         1    1,2
    2         2    3,4

    ace (and not attached):
    [1]
    zlibbioc_1.23.0         compiler_3.4.1          XVector_0.17.1         
    [4] tools_3.4.1             GenomeInfoDbData_0.99.1 RCurl_1.95-
    4.8         
    [7] ulimit_0.0-3            bitops_1.0-6