[R] Apply function to every 20 rows between pairs of columns in a matrix

Tue Nov 12 04:43:35 CET 2013

HI,

set.seed(25)
dat1 <- as.data.frame(matrix(sample(c("A","T","G","C"),46482*56,replace=TRUE),ncol=56,nrow=46482),stringsAsFactors=FALSE)
 lst1 <- split(dat1,as.character(gl(nrow(dat1),20,nrow(dat1))))
res <- lapply(lst1,function(x) sapply(x[,1:8],function(y) sapply(x[,9:56], function(z) sum(y==z)/20)))

 length(res)
#[1] 2325  ### check here
 dim(res[[1]])
#[1] 48  8

A.K.

Hi all, I have a set of genetic SNP data that looks like 

Founder1 Founder2 Founder3 Founder4 Founder5 Founder6 Founder7 Founder8 Sample1 Sample2 Sample3 Sample... 
A A A T T T T T A T A T 
A A A T T T T T A T A T 
A A A T T T T T A T A T 
A A A T T T T T A T A T 
A A A T T T T T A T A T 
A A A T T T T T A T A T 
A A A T T T T T A T A T 
A A A T T T T T A T A T 
A A A T T T T T A T A T 
A A A T T T T T A T A T 
A A A T T T T T A T A T 
A A A T T T T T A T A T 

The size of the matrix is 56 columns by 46482 rows. I need to 
first bin the matrix by every 20 rows, then compare each of the first 8 
columns (founders) to each columns 9-56, and divide the total number of 
matching letters/alleles by the total number of rows (20). Ultimately I 
need 48 8 column by 2342 row matrices, which are essentially similarity 
matrices. I have tried to extract each pair separately by something like 

"length(cbind(odd[,9],odd[,1])[cbind(odd[,9],cbind(odd[,9],odd[,1])[,1])[,1]=="T"
& cbind(odd[,9],odd[,1])[,2]=="T",])/nrow(cbind(odd[,9],odd[,1]))" 

but this is no where near efficient, and I do not know of a 
faster way of applying the function to every 20 rows and across multiple
pairs. 

In the example given above, if the rows were all identical like 
shown across 20 rows, then the first row of the matrix for Sample1 would
be 

1 1 1 0 0 0 0