[R] sample and rearrange

Thu May 20 00:24:21 CEST 2010

On May 19, 2010, at 5:01 PM, Wu Gong wrote:

>
> It took me a day to make the sense of Jim's code :(
>
> Hope my comments will help.
>
> ## Transform data to matrix
> x <- as.matrix(x)
>
> ## Apply function to each row
> ## Create a function to rearrange bases
> result <- apply(x, 1, function(eachrow){
>
> ## Split each gene to bases
> ## Exclude the fist column which is id
> 	bases <- strsplit(eachrow[-1], '')
> 	
> ## Transform list to matrix
> ## Because the result of function strsplit is a list
> 	bases <- do.call(rbind,bases)
> 	
> ## Recombine bases by connecting all bases in each column
> 	recombine <- apply(bases, 2, paste, collapse="")
> 	
> ## Add id
> ## Transpos recombine
> 	cbind(eachrow[1], t(recombine))
> })
>
> ## Transpose the result matrix	
> result <- t(result)

It will come more quickly as you learn more. I also looked at Jimm's  
solution by pulling it apart, although I did not spend a whole day at  
it, maybe ten minutes. I thought a three line version was more  
informative, because it did not make everything scroll of the console:

 > x <- read.table(textConnection("SampleID        A1      A2       
A3      A4
+  GM920222        GATTGCC GATTGCC GATAGAC GATAGAC
+  GM930040        GTCATCA GAGTGCA ACTATAA GATTGCC
+  GM930040        GTCATCA GAGTGCA ACTATAA GATTGCC"), header=TRUE,  
as.is=TRUE)
 > x <- as.matrix(x)
 > t(apply(x, 1, function(.row){
+      # separate characters
+      z <- do.call(rbind, strsplit(.row[-1], ''))
+      # combine each column
+      z.col <- t(apply(z, 2, paste, collapse=''))
+      # add the ID
+      cbind(.row[1], z.col)
+  }))
      [,1]       [,2]   [,3]   [,4]   [,5]   [,6]   [,7]   [,8]
[1,] "GM920222" "GGGG" "AAAA" "TTTT" "TTAA" "GGGG" "CCAA" "CCCC"
[2,] "GM930040" "GGAG" "TACA" "CGTT" "ATAT" "TGTG" "CCAC" "AAAC"
[3,] "GM930040" "GGAG" "TACA" "CGTT" "ATAT" "TGTG" "CCAC" "AAAC"

# I usually see if I can get the inner-most function to work:

 > z <- do.call(rbind, strsplit(x[1,], ''))
Warning message:
In function (..., deparse.level = 1)  :
   number of columns of result is not a multiple of vector length (arg  
2)
 > z
          [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
SampleID "G"  "M"  "9"  "2"  "0"  "2"  "2"  "2"

#So I guess I didn't get an exact replica since Jim had excluded the  
first element in the row

A1       "G"  "A"  "T"  "T"  "G"  "C"  "C"  "G"
A2       "G"  "A"  "T"  "T"  "G"  "C"  "C"  "G"
A3       "G"  "A"  "T"  "A"  "G"  "A"  "C"  "G"
A4       "G"  "A"  "T"  "A"  "G"  "A"  "C"  "G"
 > z <- do.call(rbind, strsplit(x[1,-1], ''))  # there ... cleaner
 > z
    [,1] [,2] [,3] [,4] [,5] [,6] [,7]
A1 "G"  "A"  "T"  "T"  "G"  "C"  "C"
A2 "G"  "A"  "T"  "T"  "G"  "C"  "C"
A3 "G"  "A"  "T"  "A"  "G"  "A"  "C"
A4 "G"  "A"  "T"  "A"  "G"  "A"  "C"

That seemed to help understand what was going on in the middle of the  
functions. Now I wondered if the transpose could be avoided. So I  
tried cbind instead of rbind:

 > z <- do.call(cbind, strsplit(x[1,-1], ''))
 > z
      A1  A2  A3  A4
[1,] "G" "G" "G" "G"
[2,] "A" "A" "A" "A"
[3,] "T" "T" "T" "T"
[4,] "T" "T" "A" "A"
[5,] "G" "G" "G" "G"
[6,] "C" "C" "A" "A"
[7,] "C" "C" "C" "C"
 > z.col <- apply(z, 2, paste, collapse='')
 > z.col
        A1        A2        A3        A4
"GATTGCC" "GATTGCC" "GATAGAC" "GATAGAC"

## Nope that does not work:
## So try apply on the columns ...
 > z.col <- apply(z, 1, paste, collapse='')
 > z.col
[1] "GGGG" "AAAA" "TTTT" "TTAA" "GGGG" "CCAA" "CCCC"

## OK that worked. Now see if it works inside the whole sequence:

 > x <- as.matrix(x)
 > t(apply(x, 1, function(.row){
+      # separate characters
+      z <- do.call(cbind, strsplit(.row[-1], ''))
+      # combine each column
+      z.col <- apply(z, 1, paste, collapse='')
+      # add the ID
+      cbind(.row[1], z.col)
+  }))
      [,1]       [,2]       [,3]       [,4]       [,5]       [, 
6]       [,7]
[1,] "GM920222" "GM920222" "GM920222" "GM920222" "GM920222" "GM920222"  
"GM920222"
[2,] "GM930040" "GM930040" "GM930040" "GM930040" "GM930040" "GM930040"  
"GM930040"
[3,] "GM930040" "GM930040" "GM930040" "GM930040" "GM930040" "GM930040"  
"GM930040"

Well not exactly.
      [,8]   [,9]   [,10]  [,11]  [,12]  [,13]  [,14]
[1,] "GGGG" "AAAA" "TTTT" "TTAA" "GGGG" "CCAA" "CCCC"
[2,] "GGAG" "TACA" "CGTT" "ATAT" "TGTG" "CCAC" "AAAC"
[3,] "GGAG" "TACA" "CGTT" "ATAT" "TGTG" "CCAC" "AAAC"
 > x <- as.matrix(x)
 > t(apply(x, 1, function(.row){
+      # separate characters
+      z <- do.call(cbind, strsplit(.row[-1], ''))
+      # combine each column
+      z.col <- apply(z, 1, paste, collapse='')
+      # add the ID
## and add the transpose columns:
+      cbind(.row[1], t(z.col))
+  }))
      [,1]       [,2]   [,3]   [,4]   [,5]   [,6]   [,7]   [,8]
[1,] "GM920222" "GGGG" "AAAA" "TTTT" "TTAA" "GGGG" "CCAA" "CCCC"
[2,] "GM930040" "GGAG" "TACA" "CGTT" "ATAT" "TGTG" "CCAC" "AAAC"
[3,] "GM930040" "GGAG" "TACA" "CGTT" "ATAT" "TGTG" "CCAC" "AAAC"

So I got to the same place but didn't really achieve any savings.

>
> -----
> A R learner.

David "also still learning" Winsemius, MD
West Hartford, CT