[R] Averaging within a range of values

Sat Jan 14 07:23:30 CET 2012

I don't think my advice to use cut is in fact working, because the ranges 
are overlapping.  Following is a reproducible example... as the posting 
guide indicates, you should provide self-contained examples like this in 
future questions posted to the list.

# begin example
tc <- textConnection(
"Group       Start         End
G1             200         700
G2             500        1000
G3            2000        3000
G4            4000        6000
G5            7000        8000
" )
d1 <- read.table( tc, header=TRUE )
close(tc)

tc <- textConnection(
"Pos    C0    C1
  200   0.9   0.6
  500   0.8   0.8
  800   0.9   0.7
1000   0.7   0.6
2000   0.6   0.4
2500   1.2   0.8
3000   0.6   1.5
3500   0.7   0.7
4000   0.8   0.8
4500   0.6   0.6
5000   0.9   0.9
5500   0.7   0.8
6000   0.8   0.7
6500   0.4   0.4
7000   0.5   0.8
7500   0.7   0.9
8000   0.9   0.5
8500   0.8   0.6
9000   0.9   0.8
" )
d2 <- read.table( tc, header=TRUE )
close(tc)

library(plyr)

# get speed by using more memory
# outer join
d3 <- merge( d1, d2, by.all=TRUE )
# remove combinations that do not fit
d3 <- d3[ ( d3$Start <= d3$Pos ) & ( d3$Pos <= d3$End ), ]
d4a <- ddply( d3
             , "Group"
             , function( df ) {
                 c( C0mean=mean(df$C0), C1mean=mean(df$C1) )
               }
             )

# if you work with a large dataset, you may not be able to afford an
# open join, so use a slower calculation that conserves memory
d4b <- ldply( seq_along( d1$Group )
              , function( idx, gpdf, dta ) {
                  group <- gpdf$Group[ idx ]
                  start <- gpdf$Start[ idx ]
                  end <- gpdf$End[ idx ]
                  subdta <- dta[ ( start <= dta$Pos ) & ( dta$Pos <= end ), ]
                  data.frame( Group=group
                            , C0mean=mean( subdta$C0 )
                            , C1mean=mean( subdta$C1 ) )
                }
              , gpdf = d1
              , dta = d2
              )

# end of suggested solutions
# there are other ways as well, such as using the aggregate function or 
# the sqldf package

On Fri, 13 Jan 2012, doggysaywhat wrote:

> My apologies for the context problem.  I'll explain.
>
> df1 is a matrix of genes labeled g1 through g5 with start positions in the
> START column and end positions in the END column.
>
> df2 is a matrix of chromatin modification values at positions along the DNA.
>
> I want to average chromatin modification values for each gene from the start
> to the end position.  So this would involve pulling out all values for
> column C0 that are between pos 200 and 700 for the first gene and averaging
> them.  Then, I would pull all values from 500 to 1000, and continue for each
> gene.
>
> The example I gave previously was a short one, but I will be doing this for
> around 1000 genes with different positions.  This is why just removing one
> group.
>
> This was something I tried to come up with that allowed me to use start and
> end positions.  Your advice to use the cut is working.
>
> start<-df1[,2]
> end<-df1[,3]
>
> while(i<length(start)){
>          i<-i+1
>           print(cut(df2[,1],c(start[i],end[i])))
> }
>
> These were the results
>
> [1] <NA>      (200,700] <NA>      <NA>      <NA>      <NA>      <NA>
> [8] <NA>      <NA>      <NA>      <NA>      <NA>      <NA>      <NA>
> [15] <NA>      <NA>      <NA>      <NA>      <NA>
> Levels: (200,700]
> [1] <NA>        <NA>        (500,1e+03] (500,1e+03] <NA>        <NA>
> [7] <NA>        <NA>        <NA>        <NA>        <NA>        <NA>
> [13] <NA>        <NA>        <NA>        <NA>        <NA>        <NA>
> [19] <NA>
> Levels: (500,1e+03]
> [1] <NA>          <NA>          <NA>          <NA>          <NA>
> [6] (2e+03,3e+03] (2e+03,3e+03] <NA>          <NA>          <NA>
> [11] <NA>          <NA>          <NA>          <NA>          <NA>
> [16] <NA>          <NA>          <NA>          <NA>
> Levels: (2e+03,3e+03]
> [1] <NA>          <NA>          <NA>          <NA>          <NA>
> [6] <NA>          <NA>          <NA>          <NA>          (4e+03,6e+03]
> [11] (4e+03,6e+03] (4e+03,6e+03] (4e+03,6e+03] <NA>          <NA>
> [16] <NA>          <NA>          <NA>          <NA>
> Levels: (4e+03,6e+03]
> [1] <NA>          <NA>          <NA>          <NA>          <NA>
> [6] <NA>          <NA>          <NA>          <NA>          <NA>
> [11] <NA>          <NA>          <NA>          <NA>          <NA>
> [16] (7e+03,8e+03] (7e+03,8e+03] <NA>          <NA>
> Levels: (7e+03,8e+03]
>
>
> This is producing the right bins for each of the results, but I'm not sure
> how to put this into a data frame.  When I did this.
>
>
> start<-df1[,2]
> end<-df1[,3]
>
> while(i<length(start)){
>          i<-i+1
>           bins<-(cut(df2[,1],c(start[i],end[i])))
> }
>
> the bins variable was the last level.
> Is there a way to assign the results of the of the while statement to a
> dataframe?
>
> Many thanks
>
> --
> View this message in context: http://r.789695.n4.nabble.com/Averaging-within-a-range-of-values-tp4291958p4294061.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

---------------------------------------------------------------------------
Jeff Newmiller                        The     .....       .....  Go Live...
DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live Go...
                                       Live:   OO#.. Dead: OO#..  Playing
Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
/Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k