[R] Quantile and rowMean from multiple files in a folder

Sun Apr 13 16:54:55 CEST 2014

Hi,

I am formatting the codes using library(formatR).  Hopefully, it will not be mangled in the email.
dir.create("final")
lst1 <- split(list.files(pattern = ".csv"), gsub("\\_.*", "", list.files(pattern = ".csv")))

lst2 <- lapply(lst1, function(x1) lapply(x1, function(x2) { lines1 <- readLines(x2) header1 <- lines1[1:2] dat1 <- read.table(text = lines1, header = FALSE, sep = ",", stringsAsFactors = FALSE,  skip = 2) colnames(dat1) <- Reduce(paste, strsplit(header1, ",")) dat1[-c(nrow(dat1), nrow(dat1) - 1), ]
}))

library(plyr) 

lapply(seq_along(lst2), function(i) { lstN <- lapply(lst2[[i]], function(x) x[, -1]) lstQ1 <- lapply(lstN, function(x) numcolwise(function(y) quantile(y, seq(0, 1,  by = 0.01), na.rm = TRUE))(x)) arr1 <- array(unlist(lstQ1), dim = c(dim(lstQ1[[1]]), length(lstQ1)), dimnames = list(NULL,  lapply(lstQ1, names)[[1]])) res <- rowMeans(arr1, dims = 2, na.rm = TRUE) colnames(res) <- gsub(" ", "_", colnames(res)) res1 <- data.frame(Percentiles = paste0(seq(0, 100, by = 1), "%"), res, stringsAsFactors = FALSE) write.csv(res1, paste0(paste(getwd(), "final", paste(names(lst1)[[i]], "Quantile",  sep = "_"), sep = "/"), ".csv"), row.names = FALSE, quote = FALSE)
})

ReadOut1 <- lapply(list.files(recursive = TRUE)[grep("Quantile", list.files(recursive = TRUE))],  function(x) read.csv(x, header = TRUE, stringsAsFactors = FALSE))
sapply(ReadOut1, dim)
#     [,1] [,2]
#[1,]  101  101
#[2,]  258  258

lapply(ReadOut1,function(x) x[1:2,1:3])
#[[1]]
#  Percentiles  txav_DJF txav_MAM
#1          0% -12.68566  7.09702
#2          1% -12.59062  7.15338
#
#[[2]]
#  Percentiles  txav_DJF txav_MAM
#1          0% -12.75516 6.841840
#2          1% -12.68244 6.910664 

###Q2:

dir.create("Indices")
names1 <- lapply(ReadOut1, function(x) names(x))[[1]]
lstNew <- simplify2array(ReadOut1) lapply(2:nrow(lstNew), function(i) { dat1 <- data.frame(lstNew[1], do.call(cbind, lstNew[i, ]), stringsAsFactors = FALSE) colnames(dat1) <- c(rownames(lstNew)[1], paste(names(lst1), rep(rownames(lstNew)[i],  length(lst1)), sep = "_")) write.csv(dat1, paste0(paste(getwd(), "Indices", rownames(lstNew)[i], sep = "/"),  ".csv"), row.names = FALSE, quote = FALSE)
}) ## Output2:
ReadOut2 <- lapply(list.files(recursive = TRUE)[grep("Indices", list.files(recursive = TRUE))],  function(x) read.csv(x, header = TRUE, stringsAsFactors = FALSE))
length(ReadOut2)
# [1] 257 

head(ReadOut2[[1]], 2)
#  Percentiles G100_pav_ANN G101_pav_ANN
#1          0%     1.054380     1.032740
#2          1%     1.069457     1.045689 

A.K.

On Sunday, April 13, 2014 2:46 AM, Zilefac Elvis <zilefacelvis at yahoo.com> wrote:

Hi AK,
Q1) I need your help again. Using the previous data (attached) and the previous code below,instead of taking rowMeans, let's do quantile(x,seq(0,1,by=0.01)). 

Delete the last 2 rows (Trend and p<) in each file before doing quantile(x,seq(0,1,by=0.01)).

For example, assume that I want to calculate quantile(x,seq(0,1,by=0.01)) for each column of Site G100. I will do so for the 5 sims of site G100 and then take their average. This will be approximately close to the true value than just calculating quantile(x,seq(0,1,by=0.01)) from one sim. Please do this same thing for all the files.

So, when you do rowMeans, it should be the mean of quantile(x,seq(0,1,by=0.01)) calculated from all sims in that Site.

Output

The number of files in "final" remains the same (2 files). The "Year" column(will be replaced) will contain  the names of quantile(x,seq(0,1,by=0.01)) such as  0%           1%           2%           3%           4%           5%           6%, ..., 98%          99%         100% . You can give this column any name such as "Percentiles".

Q2)  From the folder "final", please go to each file identified by site name, take a column, say col1 of txav  from each file, create a dataframe whose colnames are site codes (names of files in "final"). Create a folder called "Indices" and place this dataframe in it. The filename for the dataframe is txav, say. So, in "Indices", you will have one file having 3 columns [, c(Percentiles, G100,G101)]. The idea is that I want to be able to pick any column from files in "final" and form a dataframe from which I will generate my qqplot or boxplot.

Thanks very much AK.
Atem
This should be the final step of this my drama, at least for now.
#==============================================================================================================

dir.create("final")
lst1 <- split(list.files(pattern=".csv"),gsub("\\_.*","",list.files(pattern=".csv"))) lst2 <-  lapply(lst1,function(x1) lapply(x1, function(x2) {lines1 <- readLines(x2); header1 <- lines1[1:2]; dat1 <- read.table(text=lines1,header=FALSE,sep=",",stringsAsFactors=FALSE, skip=2); colnames(dat1) <- Reduce(paste,strsplit(header1,","));dat1}))

lstYear <- lapply(lst2,function(x) lapply(x, function(y) y[,1,drop=FALSE])[[1]]) 

lapply(seq_along(lst2),function(i) {lstN <-lapply(lst2[[i]],function(x) x[,-1]); arr1 <- array(unlist(lstN),dim=c(dim(lstN[[1]]),length(lstN)),dimnames=list(NULL,lapply(lstN,names)[[1]]));res <- cbind(lstYear[[i]],rowMeans(arr1,dims=2,na.rm=TRUE)); names(res) <- gsub("\\_$","",gsub(" ", "_",names(res))); res[,1] <- gsub(" <", "",res[,1]); write.csv(res,paste0(paste(getwd(),"final",names(lst1)
[[i]],sep="/"),".csv"),row.names=FALSE,quote=FALSE)  }) 

#====================================================================================================