[R] Averaging within a range of values
Jeff Newmiller
jdnewmil at dcn.davis.ca.us
Sat Jan 14 07:23:30 CET 2012
I don't think my advice to use cut is in fact working, because the ranges
are overlapping. Following is a reproducible example... as the posting
guide indicates, you should provide self-contained examples like this in
future questions posted to the list.
# begin example
tc <- textConnection(
"Group Start End
G1 200 700
G2 500 1000
G3 2000 3000
G4 4000 6000
G5 7000 8000
" )
d1 <- read.table( tc, header=TRUE )
close(tc)
tc <- textConnection(
"Pos C0 C1
200 0.9 0.6
500 0.8 0.8
800 0.9 0.7
1000 0.7 0.6
2000 0.6 0.4
2500 1.2 0.8
3000 0.6 1.5
3500 0.7 0.7
4000 0.8 0.8
4500 0.6 0.6
5000 0.9 0.9
5500 0.7 0.8
6000 0.8 0.7
6500 0.4 0.4
7000 0.5 0.8
7500 0.7 0.9
8000 0.9 0.5
8500 0.8 0.6
9000 0.9 0.8
" )
d2 <- read.table( tc, header=TRUE )
close(tc)
library(plyr)
# get speed by using more memory
# outer join
d3 <- merge( d1, d2, by.all=TRUE )
# remove combinations that do not fit
d3 <- d3[ ( d3$Start <= d3$Pos ) & ( d3$Pos <= d3$End ), ]
d4a <- ddply( d3
, "Group"
, function( df ) {
c( C0mean=mean(df$C0), C1mean=mean(df$C1) )
}
)
# if you work with a large dataset, you may not be able to afford an
# open join, so use a slower calculation that conserves memory
d4b <- ldply( seq_along( d1$Group )
, function( idx, gpdf, dta ) {
group <- gpdf$Group[ idx ]
start <- gpdf$Start[ idx ]
end <- gpdf$End[ idx ]
subdta <- dta[ ( start <= dta$Pos ) & ( dta$Pos <= end ), ]
data.frame( Group=group
, C0mean=mean( subdta$C0 )
, C1mean=mean( subdta$C1 ) )
}
, gpdf = d1
, dta = d2
)
# end of suggested solutions
# there are other ways as well, such as using the aggregate function or
# the sqldf package
On Fri, 13 Jan 2012, doggysaywhat wrote:
> My apologies for the context problem. I'll explain.
>
> df1 is a matrix of genes labeled g1 through g5 with start positions in the
> START column and end positions in the END column.
>
> df2 is a matrix of chromatin modification values at positions along the DNA.
>
> I want to average chromatin modification values for each gene from the start
> to the end position. So this would involve pulling out all values for
> column C0 that are between pos 200 and 700 for the first gene and averaging
> them. Then, I would pull all values from 500 to 1000, and continue for each
> gene.
>
> The example I gave previously was a short one, but I will be doing this for
> around 1000 genes with different positions. This is why just removing one
> group.
>
> This was something I tried to come up with that allowed me to use start and
> end positions. Your advice to use the cut is working.
>
> start<-df1[,2]
> end<-df1[,3]
>
> while(i<length(start)){
> i<-i+1
> print(cut(df2[,1],c(start[i],end[i])))
> }
>
> These were the results
>
> [1] <NA> (200,700] <NA> <NA> <NA> <NA> <NA>
> [8] <NA> <NA> <NA> <NA> <NA> <NA> <NA>
> [15] <NA> <NA> <NA> <NA> <NA>
> Levels: (200,700]
> [1] <NA> <NA> (500,1e+03] (500,1e+03] <NA> <NA>
> [7] <NA> <NA> <NA> <NA> <NA> <NA>
> [13] <NA> <NA> <NA> <NA> <NA> <NA>
> [19] <NA>
> Levels: (500,1e+03]
> [1] <NA> <NA> <NA> <NA> <NA>
> [6] (2e+03,3e+03] (2e+03,3e+03] <NA> <NA> <NA>
> [11] <NA> <NA> <NA> <NA> <NA>
> [16] <NA> <NA> <NA> <NA>
> Levels: (2e+03,3e+03]
> [1] <NA> <NA> <NA> <NA> <NA>
> [6] <NA> <NA> <NA> <NA> (4e+03,6e+03]
> [11] (4e+03,6e+03] (4e+03,6e+03] (4e+03,6e+03] <NA> <NA>
> [16] <NA> <NA> <NA> <NA>
> Levels: (4e+03,6e+03]
> [1] <NA> <NA> <NA> <NA> <NA>
> [6] <NA> <NA> <NA> <NA> <NA>
> [11] <NA> <NA> <NA> <NA> <NA>
> [16] (7e+03,8e+03] (7e+03,8e+03] <NA> <NA>
> Levels: (7e+03,8e+03]
>
>
> This is producing the right bins for each of the results, but I'm not sure
> how to put this into a data frame. When I did this.
>
>
> start<-df1[,2]
> end<-df1[,3]
>
> while(i<length(start)){
> i<-i+1
> bins<-(cut(df2[,1],c(start[i],end[i])))
> }
>
> the bins variable was the last level.
> Is there a way to assign the results of the of the while statement to a
> dataframe?
>
> Many thanks
>
> --
> View this message in context: http://r.789695.n4.nabble.com/Averaging-within-a-range-of-values-tp4291958p4294061.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
---------------------------------------------------------------------------
Jeff Newmiller The ..... ..... Go Live...
DCN:<jdnewmil at dcn.davis.ca.us> Basics: ##.#. ##.#. Live Go...
Live: OO#.. Dead: OO#.. Playing
Research Engineer (Solar/Batteries O.O#. #.O#. with
/Software/Embedded Controllers) .OO#. .OO#. rocks...1k
More information about the R-help
mailing list