[Bioc-sig-seq] as.data.frame on GRanges object with DNAStringSet in values

Hervé Pagès hpages at fhcrc.org
Thu Jun 16 01:26:40 CEST 2011


On 11-06-15 03:38 PM, Michael Lawrence wrote:
>
>
> 2011/6/15 Hervé Pagès <hpages at fhcrc.org <mailto:hpages at fhcrc.org>>
>
>     Hi Michael, Janet,
>
>     I just added an "as.vector" method for XStringSet objects to
>     Biostrings 2.21.6:
>
>      > library(Biostrings)
>      > x <- DNAStringSet(c("aaatg", "gt"))
>      > as.vector(x)
>       [1] "AAATG" "GT"
>
>     But that doesn't solve Janet's problem:
>
>      > df <- DataFrame(id=c("ID1", "ID2"), seqs=x)
>      > df
>       DataFrame with 2 rows and 2 columns
>                  id           seqs
>     <character> <DNAStringSet>
>       1         ID1          AAATG
>       2         ID2             GT
>      > as.data.frame(df)
>
>       Error in as.data.frame.default(y, optional = TRUE, ...) :
>         cannot coerce class 'structure("DNAStringSet", package =
>     "Biostrings")' into a data.frame
>
>     Michael?
>
>
> Well, sorry for that. I just added a coercion from Vector to data.frame
> through as.vector, so this works.

Thanks!

> But someone might add a coercion from
> List to data.frame that would treat the elements as columns. Would this
> make sense?

Hard to tell. Maybe sometimes it would make sense, but sometimes it
definitely does not (e.g. DNAStringSet).

> AtomicList to data.frame does something even stranger: it
> creates a two column data frame with the unlisted values and
> names/indices rep'd out as a factor. Actually, that's kind of cool,
> since usually one does not have a list with equal element lengths, but
> it's somewhat unintuitive. But why does it apply only to AtomicList?

Glad you bring this on the table.

For the record, "as.vector" also unrolls an AtomicList:

   > as.vector(IntegerList(1:4, 0:-2))
   [1]  1  2  3  4  0 -1 -2

IMO, we should not do things like that. Because:

   1) The same can be achieved with unlist():

     > unlist(IntegerList(1:4, 0:-2))
     [1]  1  2  3  4  0 -1 -2

   2) It's totally unintuitive to use as.vector for unlisting
      a list (as.vector on a standard list does not do that).

   3) There is a strong expectation that as.vector() will preserve
      the length of its input.

So I propose to deprecate those "as.vector" and "as.data.frame"
methods for AtomicList objects.

H.


> Anyway, given the special correspondence between a XStringSet and a
> character vector, we could always add an as.data.frame method for
> XStringSet, just to make sure stuff behaves as expected.
>
>     Thanks,
>     H.
>
>
>      > sessionInfo()
>     R version 2.14.0 Under development (unstable) (2011-05-30 r56024)
>     Platform: x86_64-unknown-linux-gnu (64-bit)
>
>     locale:
>       [1] LC_CTYPE=en_CA.UTF-8       LC_NUMERIC=C
>       [3] LC_TIME=en_CA.UTF-8        LC_COLLATE=en_CA.UTF-8
>       [5] LC_MONETARY=en_CA.UTF-8    LC_MESSAGES=en_CA.UTF-8
>       [7] LC_PAPER=C                 LC_NAME=C
>       [9] LC_ADDRESS=C               LC_TELEPHONE=C
>     [11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C
>
>
>     attached base packages:
>     [1] stats     graphics  grDevices utils     datasets  methods   base
>
>     other attached packages:
>     [1] Biostrings_2.21.6 IRanges_1.11.10
>
>
>
>     On 11-06-15 12:49 PM, Janet Young wrote:
>
>         yes - as.character seems a good choice, I think
>
>         thanks,
>
>         Janet
>
>         On Jun 15, 2011, at 12:46 PM, Michael Lawrence wrote:
>
>             So you would expect that the DNAStringSet is converted to a
>             character vector? DNAStringSet (technically XStringSet) then
>             just needs an as.vector method that delegates to as.character.
>
>             Michael
>
>
>             On Wed, Jun 15, 2011 at 12:37 PM, Janet
>             Young<jayoung at fhcrc.org <mailto:jayoung at fhcrc.org>>  wrote:
>             Hi there,
>
>             I'm trying to as as.data.frame on a GRanges object. On
>             regular GRanges objects it works fine but I have some
>             objects that contain a DNAStringSet in the values column,
>             which isn't built in to the as.data.frame method.  Is it
>             possible to add the ability to coerce the DNAStringSet too,
>             please?
>
>             Here's some code that demonstrates the issue:
>
>             ################
>             library(GenomicRanges)
>             library(Biostrings)
>
>             gr1<-
>             GRanges(seqnames=rep("chr1",3),ranges=IRanges(start=c(1,101,201),width=50),strand=c("+","-","+"),
>             genenames=c("seq1","seq2","seq3") )
>
>             as.data.frame(gr1)
>             # works
>
>             gr2<- gr1
>             values(gr2)[,"myseqs"]<- DNAStringSet(c ("AACGTG",
>             "ACGGTGGTGTT", "GAGGCTG"))
>
>             as.data.frame(gr2)
>             # Error in as.data.frame.default(y, optional = TRUE, ...) :
>             #   cannot coerce class 'structure("DNAStringSet", package =
>             "Biostrings")' into a data.frame
>             ################
>
>             and here's   sessionInfo() output:
>
>             R version 2.13.0 (2011-04-13)
>             Platform: i386-apple-darwin9.8.0/i386 (32-bit)
>
>             locale:
>             [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
>
>             attached base packages:
>             [1] stats     graphics  grDevices utils     datasets
>               methods   base
>
>             other attached packages:
>             [1] Biostrings_2.20.1   GenomicRanges_1.4.6 IRanges_1.10.4
>
>             ################
>
>
>             You might wonder why I'm storing sequences in the GRanges
>             values - in my real data they're sequencing reads that have
>             mapped back to that region, but I'm still curious to
>             maintain the sequence itself (for the moment) because it's
>             not always identical to the underlying genomic sequence of
>             that region (investigating mapping issues).
>
>             (and my desire to use as.data.frame relates to a suggestion
>             from Herve to let me workaround some issues with the
>             identical function)
>
>             thanks,
>
>             Janet
>
>             _______________________________________________
>             Bioc-sig-sequencing mailing list
>             Bioc-sig-sequencing at r-project.org
>             <mailto:Bioc-sig-sequencing at r-project.org>
>             https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>
>
>         _______________________________________________
>         Bioc-sig-sequencing mailing list
>         Bioc-sig-sequencing at r-project.org
>         <mailto:Bioc-sig-sequencing at r-project.org>
>         https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>
>
>
>     --
>     Hervé Pagès
>
>     Program in Computational Biology
>     Division of Public Health Sciences
>     Fred Hutchinson Cancer Research Center
>     1100 Fairview Ave. N, M1-B514
>     P.O. Box 19024
>     Seattle, WA 98109-1024
>
>     E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>     Phone:  (206) 667-5791
>     Fax:    (206) 667-1319
>
>


-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioc-sig-sequencing mailing list