[R] Compressing String in R

jim holtman jholtman at gmail.com
Wed Dec 24 17:43:24 CET 2008


Since you only have 4 characters, you can can create a table of all
the combinations of 4 of them and this will reduce to one byte instead
of 4.  This is fine if you just want to store them.

> x <- expand.grid(c("A","C","G","T"),
+     c("A", "C", "G", "T"),
+     c("A", "C", "G", "T"),
+     c("A", "C", "G", "T"))
> gene.table <- apply(x, 1, paste, collapse='')
> # convert the string (right now it is length mod 4. more logic if not multiple of 4
> gene <- "ACGATACGGCGACCACCGAGATCTACACTCTTCCCC"
> # break into 4 character strings
> start <- seq(1, by=4, to=nchar(gene))
> strings <- mapply(substr, gene, start, start+3)
> # create new compressed string
> comp <- as.raw(match(strings, gene.table) - 1)
> # convert back
> paste(gene.table[as.integer(comp) + 1], collapse='')
[1] "ACGATACGGCGACCACCGAGATCTACACTCTTCCCC"
>


On Wed, Dec 24, 2008 at 10:26 AM, Gundala Viswanath <gundalav at gmail.com> wrote:
> Dear all,
>
> What's the R way to compress the string into smaller 2~3 char/digit length.
> In particular I want to compress string of length >=30 characters,
> e.g. ACGATACGGCGACCACCGAGATCTACACTCTTCC
>
> The reason I want to do that is because, there are billions
> of such string I want to print out. And I need to save disk space.
>
> - Gundala Viswanath
> Jakarta - Indonesia
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



-- 
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem that you are trying to solve?



More information about the R-help mailing list