[R] Writing a summary file in R

Dennis Murphy djmuser at gmail.com
Thu Jul 28 03:42:09 CEST 2011


Hi:

Is this more or less what you're after?

## Note: This is the preferred way to send your data by e-mail.
## I used dput(data-frame-name) to produce this,
## where data-frame-name = 'df' on my end.
df <- structure(list(V1 = c("chr1", "chr1", "chr1", "chr1", "chr3",
"chr4", "chr4", "chr7", "chr7", "chr9", "chr11", "chr11", "chr22",
"chr22", "chr22"), V2 = c(100L, 100L, 200L, 500L, 450L, 100L,
100L, 350L, 350L, 100L, 679L, 679L, 100L, 100L, 300L), V3 = c(159L,
159L, 260L, 750L, 700L, 300L, 300L, 600L, 600L, 125L, 687L, 687L,
200L, 200L, 400L), V4 = c(104L, 145L, 205L, 600L, 500L, 150L,
175L, 400L, 550L, 100L, 680L, 681L, 105L, 110L, 350L), V5 = c(104L,
145L, 205L, 600L, 500L, 150L, 175L, 400L, 550L, 100L, 680L, 681L,
105L, 110L, 350L), V6 = c(1L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 0L,
1L, 1L, 0L, 1L, 1L, 0L), V7 = c(0.05, 0.04, 0.12, 0.09, 0.03,
0.05, 0, 0.06, 0, 0.1, 0.07, 0, 0.03, 0.08, 0), V8 = c("+", "+",
"+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+"
)), .Names = c("V1", "V2", "V3", "V4", "V5", "V6", "V7", "V8"
), class = "data.frame", row.names = c(NA, -15L))

############
# This is the structure you should see:
> str(df)
'data.frame':   15 obs. of  8 variables:
 $ V1: chr  "chr1" "chr1" "chr1" "chr1" ...
 $ V2: int  100 100 200 500 450 100 100 350 350 100 ...
 $ V3: int  159 159 260 750 700 300 300 600 600 125 ...
 $ V4: int  104 145 205 600 500 150 175 400 550 100 ...
 $ V5: int  104 145 205 600 500 150 175 400 550 100 ...
 $ V6: int  1 1 1 1 1 1 0 1 0 1 ...
 $ V7: num  0.05 0.04 0.12 0.09 0.03 0.05 0 0.06 0 0.1 ...
 $ V8: chr  "+" "+" "+" "+" ...
############

# Method 1: Write a function and call ddply()
summfun <- function(d)  {
    dsum <- as.data.frame(as.list(summary(d[['V7']])))
    names(dsum) <- c('Min', 'Q1', 'Median', 'Mean', 'Q3', 'Max')
    data.frame(V3 = d[1, 'V3'], dsum)
  }
library('plyr')
ddply(df, .(V1, V2), summfun)

The idea behind summfun is this: ddply() prefers functions that take a
data frame as input and a data frame (or scalar) as output. dsum
converts summary(V7) to a data frame by first coercing it into a list
and then to a data frame. The names are changed for convenience. dsum
has one line, so we add V3 to the data frame before outputting it.
ddply() will attach the grouping variables to the output
automatically; however, you can put them into the output data frame
and ddply() will not duplicate the grouping variables in the output.

The alternative in ddply(), which is simpler code, outputs the results
from summary() in different rows for each grouping. In this event, it
is useful to carry along the names of the summaries so that one can
recast the data with the cast() function from the reshape package:

# Method 2: Summarize and reshape
# V3 is unnecessary but it is useful to carry it along for the output
u <- ddply(df, .(V1, V2, V3), summarise, summ = summary(V7),
                       summtype = names(summary(V7)))
library('reshape')
cast(u, V1 + V2 + V3 ~ summtype, value = 'summ')

HTH,
Dennis

PS: I may be one of those folks to whom David was referring in
relation to plyr :)

On Wed, Jul 27, 2011 at 4:02 PM, a217 <ajn21 at case.edu> wrote:
> Hello,
>
> I have an input file:
> http://r.789695.n4.nabble.com/file/n3700031/testOut.txt testOut.txt
>
> where col 1 is chromosome, column2 is start of region, column 3 is end of
> region, column 4 and 5 is base position, column 6 is total reads, column 7
> is methylation data, and column 8 is the strand.
>
>
> I would like a summary output file such as:
> http://r.789695.n4.nabble.com/file/n3700031/out.summary.txt out.summary.txt
>
> where column 1 is chromosome, column 2 is start of region, column 3 is end
> of region, column 4 is total reads in general, column 5 is total reads >=1,
> column 6 is (col4/col5) or the percentage, and at the end I'd like to list 6
> more columns based on summary results from summary() function in R.
>
> The summary() function will be used to analyze all of the methylation data
> (col7 from input) for each region (bounded by col2 and col3).
>
> For example for chr1 100 159 summary() gives:
>  Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
>  0.0400  0.0425  0.0450  0.0450  0.0475  0.0500
>
> which is simply the methylation data input into summary() only in the region
> of chr1 100 159.
>
> I know how to perform all of the required functions line-by-line, but the
> hard part for me is essentially taking the input data with multiple
> positions in each region and assigning all of the summary results to one
> line identified by the region.
>
> If any of you have any suggestions I would appreciate it.
>
> --
> View this message in context: http://r.789695.n4.nabble.com/Writing-a-summary-file-in-R-tp3700031p3700031.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



More information about the R-help mailing list