[R] Scripting in R -- pattern matching, logic, system calls, the works!
Dan Davison
davison at stats.ox.ac.uk
Tue Sep 16 17:31:51 CEST 2008
Instead of writing some long, ugly, "script", the way to use R is to
break problems down into distinct tasks. Reading data is one task, and
performing regressions on the data, plotting & summarising are
different tasks. Write functions to do each task in general, and then
use those functions.
So one task is reading the data from a Coverage dir. You want to do a
linear regression on the data, so you want to have the data stored as
a data frame. Following on from Don McQueen's good advice, here's a
function that does the job:
read.data.from.coverage.dir <- function(dir, pattern="Length_[0-9]+", min.length=0, max.length=Inf) {
## return a data frame with lengths in first column and means of
## file contents in second column
files <- list.files(dir, pattern)
lengths <- as.numeric(gsub("Length_", "", files, perl=TRUE))
files <- files[lengths >= min.length && lengths <= max.length]
get.mean.from.file <- function(file) mean(scan(file.path(dir,file), quiet=TRUE))
data.frame(x=lengths, y=sapply(files, get.mean.from.file))
}
And here's a function, that uses the first one, to get all the data
from your various coverage dirs
get.all.data <- function(topdir) {
coverage.dirs <- list.files(path=topdir, pattern="Coverage", full.names=TRUE)
lapply(coverage.dirs, read.data.from.coverage.dir)
}
So now you can do
## read all the data
all.data <- get.all.data(topdir="~")
## perform all the regressions
regression.fits <- lapply(all.data, function(df) lm(y ~ x, data=df))
## summarise them
summaries <- lapply(regression.fits, summary)
## etc
All those commands are generating lists of objects; lapply is a
shorthand for doing a for loop over a list.
You can use sink() to redirect output, but it would probably be better
to create tables and/or figures in R first, then write them to files.
Dan
On Tue, Sep 16, 2008 at 07:01:42AM -0700, bioinformatics_guy wrote:
>
> Don,
> Excellent advice. I've gone back and done a bit of coding and wanted to see
> what you think and possibly "shore up" some of the technical stuff I am
> still having a bit of difficulty with.
>
> I'll past the code I have to date with any important annotations:
>
> topdir="~"
> library(gmodels)
>
> setwd(topdir)
>
> ### Will probably want to do two for loops as opposed to recursive
> files=list.files(path=topdir,pattern="Coverage")
>
> for (i in files)
> {
> dir=paste("~/hangers/",i,sep="")
>
> files2=list.files(path=dir,pattern="Length")
>
> ### Make an empty matrix that will have the independent variable as
> the filenum and the dependent variable
> ### as the mean of the length or should I have two vectors for the
> regression. Basically the Length_(\d+) is the independent variable (which
> is taken from the filename) which all the regressions will have and then
> inside the Length_(\d+) is a 1d set of numbers which I take the mean of
> which in turn becomes the dependent variable. So in essence the points are:
> f(length)=mean(length$V1)
> f(45)=50
> f(50)=60
> etc ...
>
>
> for (j in files2)
> {
> ## I just rearranged the following line but I'm not sure what the
> command is doing
> ## I am assuming 'as.numeric' means take the input as a number
> instead of a string and the gsub has #me stumped
>
> filenum=as.numeric(gsub('Length_','',j))
>
> ## Can I assign variables at the top instead of hardcoding? like
> upper=50 , lower=30?
> ## And I don't need to put brackets for this if statement do I?
> Does it basically just
> ## say that if the filenum is outside those parameters, just go to
> the next j in files2?
> if (filenum > 200 | filenum < -10) next
>
> dir2=paste("~/hangers",i,j,sep="/")
>
> tmp=read.table(dir2)
>
> mean(tmp($V1))
>
> Now should I put these in a matrix or a vector (all j values (length
> vs mean(tmp$V1) for each i iteration)
> }
> }
>
> I think lastly, Id like to get a print out of each of the regressions (each
> iteration of i). Is that when I use the summary command? And, like in
> unix, can I redirect the output to a file?
>
> Best
>
>
> Don MacQueen wrote:
> >
> > I can't go through all the details, but hopefully this will help get
> > you started.
> >
> > If you look at the help page for the list.files() function, you will see
> > this:
> >
> > list.files(path = ".", pattern = NULL, all.files = FALSE,
> > full.names = FALSE, recursive = FALSE,
> > ignore.case = FALSE)
> >
> > The "." in path means to start at your current working directory.
> > Assuming your 5 Coverage directories are subdirectories of your
> > current working directory, that's what you want.
> >
> > Then, setting recursive to TRUE will cause it to also list the
> > contents of all subdirectories. Since your Length files are in the
> > Coverage subdirectories, that's what you want.
> >
> > Finally, the pattern argument returns only files that match the
> > pattern, so something like
> > patter="Length"
> > should get you just the files you want.
> >
> > The result is a character vector containing the names of all your
> > Length files. Try it and see.
> >
> > Then, a simple loop over the over the vector of filenames, with an
> > appropriate scan() or read.table() command for each, will read the
> > data in.
> >
> > If you need to restrict the files, say Length_20, Length_25,
> > Length_30, etc. then you'll have to do some more work.
> > Look at
> > as.numeric(gsub( 'Length_', '', filename))
> > to get just the number part of the filename, as a number, and then
> > you can use numeric inequalities to identify whether or not any
> > particular file is to be processed.
> >
> > Since you haven't shown what the contents of your files look like
> > (two columns of numbers or what), I have no idea what to suggest for
> > the part having to do with reading them in, plotting or doing linear
> > regression.
> >
> > The basic function for linear regression is lm().
> >
> >
> > Here is a summary:
> >
> > files <- list.files( '~' , pattern='Length', recursive=TRUE)
> >
> > for (fl in files) {
> >
> > ## optional, to restrict to only certain files
> > filenum <- as.numeric(gsub( 'Length_', '', filename))
> >
> > ## skip to next file if it isn't in the correct number range
> > if (filenum > 50 | filenum < 20) next
> >
> > ## a command to read the current file. perhaps:
> > ## tmp <- read.table(fl)
> >
> > ## commands to do statistics on the data in the current file. perhaps:
> > ## fit <- lm( y ~ y, data=tmp)
> >
> > ## some output
> > cat('------ file =',fl,'-----\n')
> > print(fit)
> >
> > }
> >
> > This example doesn't restrict only to certain Coverage subdirectories.
> >
> > -Don
> >
> >
> >
> > At 9:29 AM -0700 9/15/08, bioinformatics_guy wrote:
> >>Im very new to R so this might be a very simple question. First I'll lay
> out
> >>the hierarchy of my directories, goals.
> >>
> >>I have say 5 directories of form "Coverage_(some number)" and each one of
> >>these I have text files of form "Length_(some number)" which are comprised
> >>of say 30 numbers. Each one of these Length files (which are basically
> >>incremented by 5 from 0 to 100, Length_(0,5,10,15,20) are to be averaged
> >>where the average is the y-value and the length is the x-value in a linear
> >>regression.
> >>
> >>What I want to do is, write a script that looks in each of the coverage
> >>directories and then reads in each of the files, takes the means, and
> plots
> >>them in form I specified above. The catch is, what if I only want to plot
> >>say Length_(20-50) and what command/method is best for a linear
> regression?
> >>I've looked at m1(), but have not gotten it to work properly.
> >>
> >>Below is some of the code I've put together:
> >>
> >>topdir="~"
> >>
> >>setwd(topdir)
> >>
> >>### Took this function from a friend so I'm not sure what its doing
> besides
> >>grep-ing a directory?
> >>ll<-function(string)
> >>{
> >> grep(string,dir(),value=T)
> >>}
> >>
> >>### I believe this is looking for all files of form below
> >>subdir = ll("Coverage_[1-9][0-9]$")
> >>
> >>### A for loop iterating through each of the sub directories.
> >>for (i in subdir)
> >>{
> >> #not sure what this line is doing as I found it on the internet
> >> on a
> >>similar function
> >> setwd(paste(getwd(),i,sep="/"))
> >> #This makes a vector of all the file names
> >> filelist=ll("Length_")
> >>
> >>Can I use a regex or logic to only take the filelist variables I want?
> >>And can I now get the mean of each Length_* and set in a matrix (length x
> >>mean)?
> >>
> >>Then finally, how to do a linear regression of this.
> >>
> >>--
> >>View this message in context: http:// www.
> >>nabble.com/Scripting-in-R----pattern-matching%2C-logic%2C-system-calls%2C-the-works%21-tp19496451p19496451.html
> >>Sent from the R help mailing list archive at Nabble.com.
> >>
> >>______________________________________________
> >>R-help at r-project.org mailing list
> >>https:// stat.ethz.ch/mailman/listinfo/r-help
> >>PLEASE do read the posting guide http:// www.
> R-project.org/posting-guide.html
> >>and provide commented, minimal, self-contained, reproducible code.
> >
> >
> > --
> > --------------------------------------
> > Don MacQueen
> > Environmental Protection Department
> > Lawrence Livermore National Laboratory
> > Livermore, CA, USA
> > 925-423-1062
> >
> > ______________________________________________
> > R-help at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> > http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
> >
>
> --
> View this message in context: http://www.nabble.com/Scripting-in-R----pattern-matching%2C-logic%2C-system-calls%2C-the-works%21-tp19496451p19512508.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
--
http://www.stats.ox.ac.uk/~davison
More information about the R-help
mailing list