[R] Scripting in R -- pattern matching, logic, system calls, the works!
bioinformatics_guy
wwwhitener at gmail.com
Tue Sep 16 16:01:42 CEST 2008
Don,
Excellent advice. I've gone back and done a bit of coding and wanted to see
what you think and possibly "shore up" some of the technical stuff I am
still having a bit of difficulty with.
I'll past the code I have to date with any important annotations:
topdir="~"
library(gmodels)
setwd(topdir)
### Will probably want to do two for loops as opposed to recursive
files=list.files(path=topdir,pattern="Coverage")
for (i in files)
{
dir=paste("~/hangers/",i,sep="")
files2=list.files(path=dir,pattern="Length")
### Make an empty matrix that will have the independent variable as
the filenum and the dependent variable
### as the mean of the length or should I have two vectors for the
regression. Basically the Length_(\d+) is the independent variable (which
is taken from the filename) which all the regressions will have and then
inside the Length_(\d+) is a 1d set of numbers which I take the mean of
which in turn becomes the dependent variable. So in essence the points are:
f(length)=mean(length$V1)
f(45)=50
f(50)=60
etc ...
for (j in files2)
{
## I just rearranged the following line but I'm not sure what the
command is doing
## I am assuming 'as.numeric' means take the input as a number
instead of a string and the gsub has #me stumped
filenum=as.numeric(gsub('Length_','',j))
## Can I assign variables at the top instead of hardcoding? like
upper=50 , lower=30?
## And I don't need to put brackets for this if statement do I?
Does it basically just
## say that if the filenum is outside those parameters, just go to
the next j in files2?
if (filenum > 200 | filenum < -10) next
dir2=paste("~/hangers",i,j,sep="/")
tmp=read.table(dir2)
mean(tmp($V1))
Now should I put these in a matrix or a vector (all j values (length
vs mean(tmp$V1) for each i iteration)
}
}
I think lastly, Id like to get a print out of each of the regressions (each
iteration of i). Is that when I use the summary command? And, like in
unix, can I redirect the output to a file?
Best
Don MacQueen wrote:
>
> I can't go through all the details, but hopefully this will help get
> you started.
>
> If you look at the help page for the list.files() function, you will see
> this:
>
> list.files(path = ".", pattern = NULL, all.files = FALSE,
> full.names = FALSE, recursive = FALSE,
> ignore.case = FALSE)
>
> The "." in path means to start at your current working directory.
> Assuming your 5 Coverage directories are subdirectories of your
> current working directory, that's what you want.
>
> Then, setting recursive to TRUE will cause it to also list the
> contents of all subdirectories. Since your Length files are in the
> Coverage subdirectories, that's what you want.
>
> Finally, the pattern argument returns only files that match the
> pattern, so something like
> patter="Length"
> should get you just the files you want.
>
> The result is a character vector containing the names of all your
> Length files. Try it and see.
>
> Then, a simple loop over the over the vector of filenames, with an
> appropriate scan() or read.table() command for each, will read the
> data in.
>
> If you need to restrict the files, say Length_20, Length_25,
> Length_30, etc. then you'll have to do some more work.
> Look at
> as.numeric(gsub( 'Length_', '', filename))
> to get just the number part of the filename, as a number, and then
> you can use numeric inequalities to identify whether or not any
> particular file is to be processed.
>
> Since you haven't shown what the contents of your files look like
> (two columns of numbers or what), I have no idea what to suggest for
> the part having to do with reading them in, plotting or doing linear
> regression.
>
> The basic function for linear regression is lm().
>
>
> Here is a summary:
>
> files <- list.files( '~' , pattern='Length', recursive=TRUE)
>
> for (fl in files) {
>
> ## optional, to restrict to only certain files
> filenum <- as.numeric(gsub( 'Length_', '', filename))
>
> ## skip to next file if it isn't in the correct number range
> if (filenum > 50 | filenum < 20) next
>
> ## a command to read the current file. perhaps:
> ## tmp <- read.table(fl)
>
> ## commands to do statistics on the data in the current file. perhaps:
> ## fit <- lm( y ~ y, data=tmp)
>
> ## some output
> cat('------ file =',fl,'-----\n')
> print(fit)
>
> }
>
> This example doesn't restrict only to certain Coverage subdirectories.
>
> -Don
>
>
>
> At 9:29 AM -0700 9/15/08, bioinformatics_guy wrote:
>>Im very new to R so this might be a very simple question. First I'll lay
out
>>the hierarchy of my directories, goals.
>>
>>I have say 5 directories of form "Coverage_(some number)" and each one of
>>these I have text files of form "Length_(some number)" which are comprised
>>of say 30 numbers. Each one of these Length files (which are basically
>>incremented by 5 from 0 to 100, Length_(0,5,10,15,20) are to be averaged
>>where the average is the y-value and the length is the x-value in a linear
>>regression.
>>
>>What I want to do is, write a script that looks in each of the coverage
>>directories and then reads in each of the files, takes the means, and
plots
>>them in form I specified above. The catch is, what if I only want to plot
>>say Length_(20-50) and what command/method is best for a linear
regression?
>>I've looked at m1(), but have not gotten it to work properly.
>>
>>Below is some of the code I've put together:
>>
>>topdir="~"
>>
>>setwd(topdir)
>>
>>### Took this function from a friend so I'm not sure what its doing
besides
>>grep-ing a directory?
>>ll<-function(string)
>>{
>> grep(string,dir(),value=T)
>>}
>>
>>### I believe this is looking for all files of form below
>>subdir = ll("Coverage_[1-9][0-9]$")
>>
>>### A for loop iterating through each of the sub directories.
>>for (i in subdir)
>>{
>> #not sure what this line is doing as I found it on the internet
>> on a
>>similar function
>> setwd(paste(getwd(),i,sep="/"))
>> #This makes a vector of all the file names
>> filelist=ll("Length_")
>>
>>Can I use a regex or logic to only take the filelist variables I want?
>>And can I now get the mean of each Length_* and set in a matrix (length x
>>mean)?
>>
>>Then finally, how to do a linear regression of this.
>>
>>--
>>View this message in context: http:// www.
>>nabble.com/Scripting-in-R----pattern-matching%2C-logic%2C-system-calls%2C-the-works%21-tp19496451p19496451.html
>>Sent from the R help mailing list archive at Nabble.com.
>>
>>______________________________________________
>>R-help at r-project.org mailing list
>>https:// stat.ethz.ch/mailman/listinfo/r-help
>>PLEASE do read the posting guide http:// www.
R-project.org/posting-guide.html
>>and provide commented, minimal, self-contained, reproducible code.
>
>
> --
> --------------------------------------
> Don MacQueen
> Environmental Protection Department
> Lawrence Livermore National Laboratory
> Livermore, CA, USA
> 925-423-1062
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
>
--
View this message in context: http://www.nabble.com/Scripting-in-R----pattern-matching%2C-logic%2C-system-calls%2C-the-works%21-tp19496451p19512508.html
Sent from the R help mailing list archive at Nabble.com.
More information about the R-help
mailing list