[BioC] [Bioc-sig-seq] as.data.frame on GRanges object with DNAStringSet in values
Hervé Pagès
hpages at fhcrc.org
Fri Oct 7 19:21:14 CEST 2011
Hi Michael,
On 11-09-29 02:17 PM, Michael Lawrence wrote:
> I saw that all coercions to atomic vectors from AtomicList are now
> deprecated. You had proposed deprecating as.vector(), because it should
> not unlist, and I agreed. Really as.vector() should return an ordinary R
> list. However, as.character(), as.numeric(), etc, in base R will unlist.
They don't seem to do that:
> as.integer(list(a=1:3, b=4:-2))
Error: (list) object cannot be coerced to type 'integer'
> as.character(list(a=1:3, b=4:-2))
[1] "1:3" "c(4, 3, 2, 1, 0, -1, -2)"
So they either refuse to do the coercion or they do it in a strange
way. Note that in the latter case they honor the strong expectation
that the output of the as.<atomic_type> coercion functions must have
the same length as the input (with positions of the elements being
preserved). unlist() would not honor this.
H.
> I'd like to keep consistency with base R. Do we really need to deprecate
> those, as well?
>
> Michael
>
> 2011/6/15 Michael Lawrence <michafla at gene.com <mailto:michafla at gene.com>>
>
>
>
> 2011/6/15 Hervé Pagès <hpages at fhcrc.org <mailto:hpages at fhcrc.org>>
>
> On 11-06-15 03:38 PM, Michael Lawrence wrote:
>
>
>
> 2011/6/15 Hervé Pagès <hpages at fhcrc.org
> <mailto:hpages at fhcrc.org> <mailto:hpages at fhcrc.org
> <mailto:hpages at fhcrc.org>>>
>
>
> Hi Michael, Janet,
>
> I just added an "as.vector" method for XStringSet objects to
> Biostrings 2.21.6:
>
> > library(Biostrings)
> > x <- DNAStringSet(c("aaatg", "gt"))
> > as.vector(x)
> [1] "AAATG" "GT"
>
> But that doesn't solve Janet's problem:
>
> > df <- DataFrame(id=c("ID1", "ID2"), seqs=x)
> > df
> DataFrame with 2 rows and 2 columns
> id seqs
> <character> <DNAStringSet>
> 1 ID1 AAATG
> 2 ID2 GT
> > as.data.frame(df)
>
> Error in as.data.frame.default(y, optional = TRUE, ...) :
> cannot coerce class 'structure("DNAStringSet", package =
> "Biostrings")' into a data.frame
>
> Michael?
>
>
> Well, sorry for that. I just added a coercion from Vector to
> data.frame
> through as.vector, so this works.
>
>
> Thanks!
>
>
> But someone might add a coercion from
> List to data.frame that would treat the elements as columns.
> Would this
> make sense?
>
>
> Hard to tell. Maybe sometimes it would make sense, but sometimes it
> definitely does not (e.g. DNAStringSet).
>
>
> AtomicList to data.frame does something even stranger: it
> creates a two column data frame with the unlisted values and
> names/indices rep'd out as a factor. Actually, that's kind
> of cool,
> since usually one does not have a list with equal element
> lengths, but
> it's somewhat unintuitive. But why does it apply only to
> AtomicList?
>
>
> Glad you bring this on the table.
>
> For the record, "as.vector" also unrolls an AtomicList:
>
> > as.vector(IntegerList(1:4, 0:-2))
> [1] 1 2 3 4 0 -1 -2
>
> IMO, we should not do things like that. Because:
>
> 1) The same can be achieved with unlist():
>
> > unlist(IntegerList(1:4, 0:-2))
> [1] 1 2 3 4 0 -1 -2
>
> 2) It's totally unintuitive to use as.vector for unlisting
> a list (as.vector on a standard list does not do that).
>
> 3) There is a strong expectation that as.vector() will preserve
> the length of its input.
>
> So I propose to deprecate those "as.vector" and "as.data.frame"
> methods for AtomicList objects.
>
>
> Sounds good to me. In fact, the stack method on List is almost
> identical to as.data.frame on AtomicList (and the stack method
> actually makes sense). You could make as.vector return an ordinary
> list, since list is a vector.
>
> H.
>
>
> Anyway, given the special correspondence between a
> XStringSet and a
> character vector, we could always add an as.data.frame
> method for
> XStringSet, just to make sure stuff behaves as expected.
>
> Thanks,
> H.
>
>
> > sessionInfo()
> R version 2.14.0 Under development (unstable)
> (2011-05-30 r56024)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
> [1] LC_CTYPE=en_CA.UTF-8 LC_NUMERIC=C
> [3] LC_TIME=en_CA.UTF-8 LC_COLLATE=en_CA.UTF-8
> [5] LC_MONETARY=en_CA.UTF-8 LC_MESSAGES=en_CA.UTF-8
> [7] LC_PAPER=C LC_NAME=C
> [9] LC_ADDRESS=C LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C
>
>
> attached base packages:
> [1] stats graphics grDevices utils datasets
> methods base
>
> other attached packages:
> [1] Biostrings_2.21.6 IRanges_1.11.10
>
>
>
> On 11-06-15 12:49 PM, Janet Young wrote:
>
> yes - as.character seems a good choice, I think
>
> thanks,
>
> Janet
>
> On Jun 15, 2011, at 12:46 PM, Michael Lawrence wrote:
>
> So you would expect that the DNAStringSet is
> converted to a
> character vector? DNAStringSet (technically
> XStringSet) then
> just needs an as.vector method that delegates to
> as.character.
>
> Michael
>
>
> On Wed, Jun 15, 2011 at 12:37 PM, Janet
> Young<jayoung at fhcrc.org
> <mailto:jayoung at fhcrc.org> <mailto:jayoung at fhcrc.org
> <mailto:jayoung at fhcrc.org>>> wrote:
>
> Hi there,
>
> I'm trying to as as.data.frame on a GRanges
> object. On
> regular GRanges objects it works fine but I have
> some
> objects that contain a DNAStringSet in the
> values column,
> which isn't built in to the as.data.frame
> method. Is it
> possible to add the ability to coerce the
> DNAStringSet too,
> please?
>
> Here's some code that demonstrates the issue:
>
> ################
> library(GenomicRanges)
> library(Biostrings)
>
> gr1<-
>
> GRanges(seqnames=rep("chr1",3),ranges=IRanges(start=c(1,101,201),width=50),strand=c("+","-","+"),
> genenames=c("seq1","seq2","seq3") )
>
> as.data.frame(gr1)
> # works
>
> gr2<- gr1
> values(gr2)[,"myseqs"]<- DNAStringSet(c ("AACGTG",
> "ACGGTGGTGTT", "GAGGCTG"))
>
> as.data.frame(gr2)
> # Error in as.data.frame.default(y, optional =
> TRUE, ...) :
> # cannot coerce class
> 'structure("DNAStringSet", package =
> "Biostrings")' into a data.frame
> ################
>
> and here's sessionInfo() output:
>
> R version 2.13.0 (2011-04-13)
> Platform: i386-apple-darwin9.8.0/i386 (32-bit)
>
> locale:
> [1]
> en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
>
> attached base packages:
> [1] stats graphics grDevices utils datasets
> methods base
>
> other attached packages:
> [1] Biostrings_2.20.1 GenomicRanges_1.4.6
> IRanges_1.10.4
>
> ################
>
>
> You might wonder why I'm storing sequences in
> the GRanges
> values - in my real data they're sequencing
> reads that have
> mapped back to that region, but I'm still curious to
> maintain the sequence itself (for the moment)
> because it's
> not always identical to the underlying genomic
> sequence of
> that region (investigating mapping issues).
>
> (and my desire to use as.data.frame relates to a
> suggestion
> from Herve to let me workaround some issues with the
> identical function)
>
> thanks,
>
> Janet
>
> _______________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.org
> <mailto:Bioc-sig-sequencing at r-project.org>
> <mailto:Bioc-sig-sequencing at r-project.org
> <mailto:Bioc-sig-sequencing at r-project.org>>
>
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>
>
> _______________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.org
> <mailto:Bioc-sig-sequencing at r-project.org>
> <mailto:Bioc-sig-sequencing at r-project.org
> <mailto:Bioc-sig-sequencing at r-project.org>>
>
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>
>
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
> <mailto:hpages at fhcrc.org <mailto:hpages at fhcrc.org>>
>
> Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
> Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
>
>
>
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
> Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
> Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
>
>
>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioconductor
mailing list