[R] Reading in and modifying multiple datasets in a loop

Mon Oct 24 23:10:32 CEST 2011

Thanks Uwe. This works perfectly.

#######

owd <- setwd(pth) 
fls <- list.files(pattern="^chr") 
ufls <- unique(sapply(strsplit(fls, "_"), "[", 1)) 
for(i in ufls){ 
     of <- strsplit(i, "\\.")[[1]] 
     of <- paste(of[1], tail(of, 1), sep=".") 
     impute2databel(genofile = i, 
                    samplefile = paste(i, "info", sep="_"), 
                    outfile = of, 
                    makeprob=TRUE, old=FALSE) 
} 
setwd(owd) 

####

I have a question regarding how strsplit works.

When my files are the following:

        chr1.one.phased.impute2.chunk1
        chr1.one.phased.impute2.chunk1_info
        chr1.one.phased.impute2.chunk1_info_by_sample
        chr1.one.phased.impute2.chunk1_summary
        chr1.one.phased.impute2.chunk1_warnings
ufls <- unique(sapply(strsplit(fls, "_"), "[", 1))

This works like a charm.

I have another dataset where the files are

        study1_chr1.one.phased.impute2.chunk1
        study1_chr1.one.phased.impute2.chunk1_info
        study1_chr1.one.phased.impute2.chunk1_info_by_sample
        study1_chr1.one.phased.impute2.chunk1_summary
        study1_chr1.one.phased.impute2.chunk1_warnings

... and so on.

and I wanted to run the same loop but I was unable to change strsplit so that it will work when the files are names ads above:

I tried 

ufls <- unique(sapply(strsplit(fls, "_"), "[", 2)) 

but this knocks off "study1" (modified code below).  What modification do I need to make to make this run:

####

fls <- list.files(pattern="study1_chr")
ufls <- unique(sapply(strsplit(fls, "_"), "[", 2)) 

library(GenABEL)

for(i in ufls){
     of <- strsplit(i, "\\.")[[1]]
     of <- paste(of[1], tail(of, 1), sep=".")
     impute2databel(genofile = i,
                    samplefile = paste(i, "info", sep="_"),
                    outfile = of,
                    makeprob=TRUE, old=FALSE)

}

#####

Thanks,

 Debs

----- Original Message -----
From: Debs Majumdar <debs_stata at yahoo.com>
To: "r-help at r-project.org" <r-help at r-project.org>
Cc: 
Sent: Friday, October 21, 2011 2:32 PM
Subject: Reading in and modifying multiple datasets in a loop

Hi,

  I have been given a set of around 300 files where there are 5 files corresponding to each chunk.

E.g. Chunk 1 for chr1 contains these 5 files:

        chr1.one.phased.impute2.chunk1
        chr1.one.phased.impute2.chunk1_info
        chr1.one.phased.impute2.chunk1_info_by_sample
        chr1.one.phased.impute2.chunk1_summary
        chr1.one.phased.impute2.chunk1_warnings

For chr 1 there are 47 chunks, chr2 has 42 chunks...and it ends at chr22 with 23 chunks.

I am using the DatABEL package to  convert them databel format using the following command:

impute2databel(genofile="chr1.one.phased.impute2.chunk1", samplefile="chr1.one.phased.impute2.chunk1_info", outfile="chr1.chunk1", makeprob=TRUE, old=FALSE)  

which uses two files per chunk.

Is there a way I can automate this so that the code goes through each chunk of each chromosome and does the conversion to databel format.

Thanks,

 -Debs