[R] Writing a summary file in R

Thu Jul 28 04:11:17 CEST 2011

On Jul 27, 2011, at 9:42 PM, Dennis Murphy wrote:

> Hi:
>
> Is this more or less what you're after?
>
> ## Note: This is the preferred way to send your data by e-mail.
> ## I used dput(data-frame-name) to produce this,
> ## where data-frame-name = 'df' on my end.
> df <- structure(list(V1 = c("chr1", "chr1", "chr1", "chr1", "chr3",
> "chr4", "chr4", "chr7", "chr7", "chr9", "chr11", "chr11", "chr22",
> "chr22", "chr22"), V2 = c(100L, 100L, 200L, 500L, 450L, 100L,
> 100L, 350L, 350L, 100L, 679L, 679L, 100L, 100L, 300L), V3 = c(159L,
> 159L, 260L, 750L, 700L, 300L, 300L, 600L, 600L, 125L, 687L, 687L,
> 200L, 200L, 400L), V4 = c(104L, 145L, 205L, 600L, 500L, 150L,
> 175L, 400L, 550L, 100L, 680L, 681L, 105L, 110L, 350L), V5 = c(104L,
> 145L, 205L, 600L, 500L, 150L, 175L, 400L, 550L, 100L, 680L, 681L,
> 105L, 110L, 350L), V6 = c(1L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 0L,
> 1L, 1L, 0L, 1L, 1L, 0L), V7 = c(0.05, 0.04, 0.12, 0.09, 0.03,
> 0.05, 0, 0.06, 0, 0.1, 0.07, 0, 0.03, 0.08, 0), V8 = c("+", "+",
> "+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+"
> )), .Names = c("V1", "V2", "V3", "V4", "V5", "V6", "V7", "V8"
> ), class = "data.frame", row.names = c(NA, -15L))
>
> ############
> # This is the structure you should see:
>> str(df)
> 'data.frame':   15 obs. of  8 variables:
> $ V1: chr  "chr1" "chr1" "chr1" "chr1" ...
> $ V2: int  100 100 200 500 450 100 100 350 350 100 ...
> $ V3: int  159 159 260 750 700 300 300 600 600 125 ...
> $ V4: int  104 145 205 600 500 150 175 400 550 100 ...
> $ V5: int  104 145 205 600 500 150 175 400 550 100 ...
> $ V6: int  1 1 1 1 1 1 0 1 0 1 ...
> $ V7: num  0.05 0.04 0.12 0.09 0.03 0.05 0 0.06 0 0.1 ...
> $ V8: chr  "+" "+" "+" "+" ...
> ############
>
> # Method 1: Write a function and call ddply()
> summfun <- function(d)  {
>    dsum <- as.data.frame(as.list(summary(d[['V7']])))
>    names(dsum) <- c('Min', 'Q1', 'Median', 'Mean', 'Q3', 'Max')
>    data.frame(V3 = d[1, 'V3'], dsum)
>  }
> library('plyr')
> ddply(df, .(V1, V2), summfun)
>
> The idea behind summfun is this: ddply() prefers functions that take a
> data frame as input and a data frame (or scalar) as output. dsum
> converts summary(V7) to a data frame by first coercing it into a list
> and then to a data frame. The names are changed for convenience. dsum
> has one line, so we add V3 to the data frame before outputting it.
> ddply() will attach the grouping variables to the output
> automatically; however, you can put them into the output data frame
> and ddply() will not duplicate the grouping variables in the output.
>
> The alternative in ddply(), which is simpler code, outputs the results
> from summary() in different rows for each grouping. In this event, it
> is useful to carry along the names of the summaries so that one can
> recast the data with the cast() function from the reshape package:
>
> # Method 2: Summarize and reshape
> # V3 is unnecessary but it is useful to carry it along for the output
> u <- ddply(df, .(V1, V2, V3), summarise, summ = summary(V7),
>                       summtype = names(summary(V7)))
> library('reshape')
> cast(u, V1 + V2 + V3 ~ summtype, value = 'summ')
>
> HTH,
> Dennis
>
> PS: I may be one of those folks to whom David was referring in
> relation to plyr :)

I've been really impressed at Dennis' facility with plyr, reshape, and  
reshape2. Note that the 'reshape' function has nothing to do with the  
'reshape' package.  Here's what I came up with using base functions:

 > str(inpdat)
'data.frame':	15 obs. of  8 variables:
  $ chromosome : chr  "chr1" "chr1" "chr1" "chr1" ...
  $ startreg   : int  100 100 200 500 450 100 100 350 350 100 ...
  $ endreg     : int  159 159 260 750 700 300 300 600 600 125 ...
  $ base1      : int  104 145 205 600 500 150 175 400 550 100 ...
  $ base2      : int  104 145 205 600 500 150 175 400 550 100 ...
  $ totalreads : int  1 1 1 1 1 1 0 1 0 1 ...
  $ methylation: num  0.05 0.04 0.12 0.09 0.03 0.05 0 0.06 0 0.1 ...
  $ strand     : chr  "+" "+" "+" "+" ...
# The split into distinct 'chromosome' and 'startreg' categories:
splinp <- split(inpdat, paste(inpdat$chromosome, inpdat$startreg) )

# Process within separate categories: the tapply, aggragate and by  
functions are all related

 > df <- as.data.frame( t(sapply(splinp, function(x) list(chr=x 
$chromosome[1], strt=x$startreg[1], end=x$endreg[1],  
frac=sum(x[['totalreads']]>=1)/nrow(x) )) ) )
# You often need the t() function when working with apply functions
 > df
             chr strt end frac
chr1 100   chr1  100 159    1
chr1 200   chr1  200 260    1
chr1 500   chr1  500 750    1
chr11 679 chr11  679 687  0.5
chr22 100 chr22  100 200    1
chr22 300 chr22  300 400    0
chr3 450   chr3  450 700    1
chr4 100   chr4  100 300  0.5
chr7 350   chr7  350 600  0.5
chr9 100   chr9  100 125    1

 > as.data.frame(t(sapply(splinp, function(x) summary(x 
$methylation )) ) )
           Min. 1st Qu. Median  Mean 3rd Qu. Max.
chr1 100  0.04  0.0425  0.045 0.045  0.0475 0.05
chr1 200  0.12  0.1200  0.120 0.120  0.1200 0.12
chr1 500  0.09  0.0900  0.090 0.090  0.0900 0.09
chr11 679 0.00  0.0175  0.035 0.035  0.0525 0.07
chr22 100 0.03  0.0425  0.055 0.055  0.0675 0.08
chr22 300 0.00  0.0000  0.000 0.000  0.0000 0.00
chr3 450  0.03  0.0300  0.030 0.030  0.0300 0.03
chr4 100  0.00  0.0125  0.025 0.025  0.0375 0.05
chr7 350  0.00  0.0150  0.030 0.030  0.0450 0.06
chr9 100  0.10  0.1000  0.100 0.100  0.1000 0.10

# The coup de grace: bind the columns together

 > df.summ <- as.data.frame(t(sapply(splinp, function(x) summary(x 
$methylation )) ) )
 > cbind(df, df.summ)
             chr strt end frac Min. 1st Qu. Median  Mean 3rd Qu. Max.
chr1 100   chr1  100 159    1 0.04  0.0425  0.045 0.045  0.0475 0.05
chr1 200   chr1  200 260    1 0.12  0.1200  0.120 0.120  0.1200 0.12
chr1 500   chr1  500 750    1 0.09  0.0900  0.090 0.090  0.0900 0.09
chr11 679 chr11  679 687  0.5 0.00  0.0175  0.035 0.035  0.0525 0.07
chr22 100 chr22  100 200    1 0.03  0.0425  0.055 0.055  0.0675 0.08
chr22 300 chr22  300 400    0 0.00  0.0000  0.000 0.000  0.0000 0.00
chr3 450   chr3  450 700    1 0.03  0.0300  0.030 0.030  0.0300 0.03
chr4 100   chr4  100 300  0.5 0.00  0.0125  0.025 0.025  0.0375 0.05
chr7 350   chr7  350 600  0.5 0.00  0.0150  0.030 0.030  0.0450 0.06
chr9 100   chr9  100 125    1 0.10  0.1000  0.100 0.100  0.1000 0.10

-- 
Best;

David.

>
> On Wed, Jul 27, 2011 at 4:02 PM, a217 <ajn21 at case.edu> wrote:
>> Hello,
>>
>> I have an input file:
>> http://r.789695.n4.nabble.com/file/n3700031/testOut.txt testOut.txt
>>
>> where col 1 is chromosome, column2 is start of region, column 3 is  
>> end of
>> region, column 4 and 5 is base position, column 6 is total reads,  
>> column 7
>> is methylation data, and column 8 is the strand.
>>
>>
>> I would like a summary output file such as:
>> http://r.789695.n4.nabble.com/file/n3700031/out.summary.txt  
>> out.summary.txt
>>
>> where column 1 is chromosome, column 2 is start of region, column 3  
>> is end
>> of region, column 4 is total reads in general, column 5 is total  
>> reads >=1,
>> column 6 is (col4/col5) or the percentage, and at the end I'd like  
>> to list 6
>> more columns based on summary results from summary() function in R.
>>
>> The summary() function will be used to analyze all of the  
>> methylation data
>> (col7 from input) for each region (bounded by col2 and col3).
>>
>> For example for chr1 100 159 summary() gives:
>>  Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
>>  0.0400  0.0425  0.0450  0.0450  0.0475  0.0500
>>
>> which is simply the methylation data input into summary() only in  
>> the region
>> of chr1 100 159.
>>
>> I know how to perform all of the required functions line-by-line,  
>> but the
>> hard part for me is essentially taking the input data with multiple
>> positions in each region and assigning all of the summary results  
>> to one
>> line identified by the region.
>>
>> If any of you have any suggestions I would appreciate it.
>>
>> --
>> View this message in context: http://r.789695.n4.nabble.com/Writing-a-summary-file-in-R-tp3700031p3700031.html
>> Sent from the R help mailing list archive at Nabble.com.
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

David Winsemius, MD
West Hartford, CT