[BioC] Easy way to convert CharacterList to character, collapsing each element?
Hervé Pagès
hpages at fhcrc.org
Tue Dec 17 07:35:41 CET 2013
Hi Michael,
On 12/16/2013 05:15 PM, Michael Lawrence wrote:
> There is a function in rtracklayer called pasteCollapse. It is hidden
> behind the namespace but it does exactly what you want. Just use ":::".
> Implemented in C for speed, and arguably simpler than the R one
> suggested in this thread. It just yields a character vector, not a
> Biostrings container, so maybe it could be pushed down into IRanges?
Or we could make strunsplit() a generic function and have 2
methods:
- One for CharacterList objects that returns a character vector.
Would be in IRanges and would use the pasteCollapse C code (after
we move it to IRanges).
- One for XStringSetList objects that returns an XStringSet object.
Would be in Biostrings. With the implementation I gave earlier
(based on the unlist/relist trick) it's almost as fast as
pasteCollapse but it would be easy to implement it in C to make
it even faster.
The mapping between the input and output types of strunsplit() is the
same as with unlist() or [[.
H.
>
> Michael
>
>
> On Mon, Dec 16, 2013 at 4:21 PM, Ryan C. Thompson <rct at thompsonclan.org
> <mailto:rct at thompsonclan.org>> wrote:
>
> Thanks! I look forward to seeing this in the next release.
>
>
>
> On 12/16/2013 04:16 PM, Hervé Pagès wrote:
>
> Hi Ryan,
>
> Here is one way to do this using Biostrings:
>
> library(Biostrings)
>
> strunsplit <- function(x, sep=",")
> {
> if (!is(x, "XStringSetList"))
> x <- Biostrings:::XStringSetList("__B", x)
> if (!isSingleString(sep))
> stop("'sep' must be a single character string")
>
> ## unlist twice.
> unlisted_x <- unlist(x, use.names=FALSE)
> unlisted_ans0 <- unlist(unlisted_x, use.names=FALSE)
>
> ## insert 'seq'.
> unlisted_x_width <- width(unlisted_x)
> x_partitioning <- PartitioningByEnd(x)
> at <- cumsum(unlisted_x_width)[-end(__x_partitioning)] + 1L
> unlisted_ans <- replaceAt(unlisted_ans0, at, value=sep)
>
> ## relist.
> ans_width <- sum(relist(unlisted_x_width, x_partitioning))
> x_eltlens <- width(x_partitioning)
> idx <- which(x_eltlens >= 2L)
> ans_width[idx] <- ans_width[idx] + (x_eltlens[idx] - 1L) *
> nchar(sep)
> relist(unlisted_ans, PartitioningByWidth(ans_width)__)
> }
>
> Then:
>
> > x <- CharacterList(A=c("id35", "id2", "id18"), B=NULL,
> C="id4", D=c("id2", "id4"))
> > strunsplit(x)
> A BStringSet instance of length 4
> width seq names
> [1] 13 id35,id2,id18 A
> [2] 0 B
> [3] 3 id4 C
> [4] 7 id2,id4 D
>
> I'll add this to Biostrings.
>
> Cheers,
> H.
>
>
> On 12/16/2013 03:04 PM, Ryan C. Thompson wrote:
>
> Hi all,
>
> I have some annotation data in a DataFrame, and of course since
> annotations are not one-to-one, some of the columns are
> CharacterList or
> similar classes. I would like to know if there is an
> efficient way to
> collapse a CharacterList to a character vector of the same
> length, such
> that for elements of length > 1, those elements are
> collapsed with a
> given separator. The following is what I came up with, but
> it is very
> slow for large CharacterLists:
>
> library(stringr)
> library(plyr)
> flatten.CharacterList <- function(x, sep=",") {
> if (is.list(x)) {
> x[!is.na <http://is.na>(x)] <- laply(x[!is.na
> <http://is.na>(x)], str_c, collapse=sep,
> .parallel=TRUE)
> x <- as(x, "character")
> }
> x
> }
>
> -Ryan
>
> _________________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
> https://stat.ethz.ch/mailman/__listinfo/bioconductor
> <https://stat.ethz.ch/mailman/listinfo/bioconductor>
> Search the archives:
> http://news.gmane.org/gmane.__science.biology.informatics.__conductor
> <http://news.gmane.org/gmane.science.biology.informatics.conductor>
>
>
>
> _________________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
> https://stat.ethz.ch/mailman/__listinfo/bioconductor
> <https://stat.ethz.ch/mailman/listinfo/bioconductor>
> Search the archives:
> http://news.gmane.org/gmane.__science.biology.informatics.__conductor <http://news.gmane.org/gmane.science.biology.informatics.conductor>
>
>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioconductor
mailing list