[BioC] Easy way to convert CharacterList to character, collapsing each element?

Tue Dec 17 07:35:41 CET 2013

Hi Michael,

On 12/16/2013 05:15 PM, Michael Lawrence wrote:
> There is a function in rtracklayer called pasteCollapse. It is hidden
> behind the namespace but it does exactly what you want. Just use ":::".
> Implemented in C for speed, and arguably simpler than the R one
> suggested in this thread. It just yields a character vector, not a
> Biostrings container, so maybe it could be pushed down into IRanges?

Or we could make strunsplit() a generic function and have 2
methods:

   - One for CharacterList objects that returns a character vector.
     Would be in IRanges and would use the pasteCollapse C code (after
     we move it to IRanges).

   - One for XStringSetList objects that returns an XStringSet object.
     Would be in Biostrings. With the implementation I gave earlier
     (based on the unlist/relist trick) it's almost as fast as
     pasteCollapse but it would be easy to implement it in C to make
     it even faster.

The mapping between the input and output types of strunsplit() is the
same as with unlist() or [[.

H.

>
> Michael
>
>
> On Mon, Dec 16, 2013 at 4:21 PM, Ryan C. Thompson <rct at thompsonclan.org
> <mailto:rct at thompsonclan.org>> wrote:
>
>     Thanks! I look forward to seeing this in the next release.
>
>
>
>     On 12/16/2013 04:16 PM, Hervé Pagès wrote:
>
>         Hi Ryan,
>
>         Here is one way to do this using Biostrings:
>
>            library(Biostrings)
>
>            strunsplit <- function(x, sep=",")
>            {
>              if (!is(x, "XStringSetList"))
>                  x <- Biostrings:::XStringSetList("__B", x)
>              if (!isSingleString(sep))
>                  stop("'sep' must be a single character string")
>
>              ## unlist twice.
>              unlisted_x <- unlist(x, use.names=FALSE)
>              unlisted_ans0 <- unlist(unlisted_x, use.names=FALSE)
>
>              ## insert 'seq'.
>              unlisted_x_width <- width(unlisted_x)
>              x_partitioning <- PartitioningByEnd(x)
>              at <- cumsum(unlisted_x_width)[-end(__x_partitioning)] + 1L
>              unlisted_ans <- replaceAt(unlisted_ans0, at, value=sep)
>
>              ## relist.
>              ans_width <- sum(relist(unlisted_x_width, x_partitioning))
>              x_eltlens <- width(x_partitioning)
>              idx <- which(x_eltlens >= 2L)
>              ans_width[idx] <- ans_width[idx] + (x_eltlens[idx] - 1L) *
>         nchar(sep)
>              relist(unlisted_ans, PartitioningByWidth(ans_width)__)
>            }
>
>         Then:
>
>            > x <- CharacterList(A=c("id35", "id2", "id18"), B=NULL,
>         C="id4", D=c("id2", "id4"))
>            > strunsplit(x)
>              A BStringSet instance of length 4
>                width seq names
>            [1]    13 id35,id2,id18                                     A
>            [2]     0                                                   B
>            [3]     3 id4                                               C
>            [4]     7 id2,id4                                           D
>
>         I'll add this to Biostrings.
>
>         Cheers,
>         H.
>
>
>         On 12/16/2013 03:04 PM, Ryan C. Thompson wrote:
>
>             Hi all,
>
>             I have some annotation data in a DataFrame, and of course since
>             annotations are not one-to-one, some of the columns are
>             CharacterList or
>             similar classes. I would like to know if there is an
>             efficient way to
>             collapse a CharacterList to a character vector of the same
>             length, such
>             that for elements of length > 1, those elements are
>             collapsed with a
>             given separator. The following is what I came up with, but
>             it is very
>             slow for large CharacterLists:
>
>             library(stringr)
>             library(plyr)
>             flatten.CharacterList <- function(x, sep=",") {
>                 if (is.list(x)) {
>                   x[!is.na <http://is.na>(x)] <- laply(x[!is.na
>             <http://is.na>(x)], str_c, collapse=sep,
>             .parallel=TRUE)
>                   x <- as(x, "character")
>                 }
>                 x
>             }
>
>             -Ryan
>
>             _________________________________________________
>             Bioconductor mailing list
>             Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
>             https://stat.ethz.ch/mailman/__listinfo/bioconductor
>             <https://stat.ethz.ch/mailman/listinfo/bioconductor>
>             Search the archives:
>             http://news.gmane.org/gmane.__science.biology.informatics.__conductor
>             <http://news.gmane.org/gmane.science.biology.informatics.conductor>
>
>
>
>     _________________________________________________
>     Bioconductor mailing list
>     Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
>     https://stat.ethz.ch/mailman/__listinfo/bioconductor
>     <https://stat.ethz.ch/mailman/listinfo/bioconductor>
>     Search the archives:
>     http://news.gmane.org/gmane.__science.biology.informatics.__conductor <http://news.gmane.org/gmane.science.biology.informatics.conductor>
>
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319