[R] Writing a summary file in R
a217
ajn21 at case.edu
Wed Aug 3 22:57:58 CEST 2011
Just a very simple follow-up. In the summary table (listed as "summ" below),
the "TR" column I would like to display the total number of rows (i.e.
counts) which I have done via "NROW()" function.
However, in the "RG1" I would only like to count the number of rows with a
'totalread' count >= 1 (i.e. rows that don't contain zero).
This may be confusing given the data I've provided, but values in the
'totalreads' column don't have to be 1 or 0, they can be any value.
Therefore using "sum()" won't work in every case.
As you can see I've tried using NROW() below for "RG1" but it didn't work
out like I had planned.
For example, given the input data, "chr4 100 300" should have RG1=1 and
percent=0.5. Instead, it just counts every row regardless of value.
The solution is probably something very simple I'm overlooking, but if you
could help I'd appreciate it.
Below is the code I've slightly modified from David's reply:
###############################Code##############################
> colnames(data) <-
> c("chr","start","end","base1","base2","totalreads","methylation","strand")
> data
#this is the input file
####################################
chr start end base1 base2 totalreads methylation strand
1 chr1 100 159 104 104 1 0.05 +
2 chr1 100 159 145 145 1 0.04 +
3 chr1 200 260 205 205 1 0.12 +
4 chr1 500 750 600 600 1 0.09 +
5 chr3 450 700 500 500 1 0.03 +
6 chr4 100 300 150 150 1 0.05 +
7 chr4 100 300 175 175 0 0.00 +
8 chr7 350 600 400 400 1 0.06 +
9 chr7 350 600 550 550 0 0.00 +
10 chr9 100 125 100 100 1 0.10 +
11 chr11 679 687 680 680 1 0.07 +
12 chr11 679 687 681 681 0 0.00 +
13 chr22 100 200 105 105 1 0.03 +
14 chr22 100 200 110 110 1 0.08 +
15 chr22 300 400 350 350 0 0.00 +
####################################
> splinp <- split(data, paste(data$chr, data$start))
> df <- as.data.frame(t(sapply(splinp, function(x) list(end=x$end[1],
> TR=NROW(x[['totalreads']]), RG1=NROW(x[['totalreads']]>=1),
> percent=(NROW(x[['totalreads']]>=1)/NROW(x[['totalreads']]))))))
> df
#######################
end TR RG1 percent
chr1 100 159 2 2 1
chr1 200 260 1 1 1
chr1 500 750 1 1 1
chr11 679 687 2 2 1
chr22 100 200 2 2 1
chr22 300 400 1 1 1
chr3 450 700 1 1 1
chr4 100 300 2 2 1
chr7 350 600 2 2 1
chr9 100 125 1 1 1
#######################
> df.summ <- as.data.frame(t(sapply(splinp, function(x)
> summary(x$methylation))))
> summ<-cbind(df,df.summ)
> summ
#the finished output
###########################################
end TR RG1 percent Min. 1st Qu. Median Mean 3rd Qu. Max.
chr1 100 159 2 2 1 0.04 0.0425 0.045 0.045 0.0475 0.05
chr1 200 260 1 1 1 0.12 0.1200 0.120 0.120 0.1200 0.12
chr1 500 750 1 1 1 0.09 0.0900 0.090 0.090 0.0900 0.09
chr11 679 687 2 2 1 0.00 0.0175 0.035 0.035 0.0525 0.07
chr22 100 200 2 2 1 0.03 0.0425 0.055 0.055 0.0675 0.08
chr22 300 400 1 1 1 0.00 0.0000 0.000 0.000 0.0000 0.00
chr3 450 700 1 1 1 0.03 0.0300 0.030 0.030 0.0300 0.03
chr4 100 300 2 2 1 0.00 0.0125 0.025 0.025 0.0375 0.05
chr7 350 600 2 2 1 0.00 0.0150 0.030 0.030 0.0450 0.06
chr9 100 125 1 1 1 0.10 0.1000 0.100 0.100 0.1000 0.10
############################################
##############################################################
David Winsemius wrote:
>
> On Jul 27, 2011, at 9:42 PM, Dennis Murphy wrote:
>
>> Hi:
>>
>> Is this more or less what you're after?
>>
>> ## Note: This is the preferred way to send your data by e-mail.
>> ## I used dput(data-frame-name) to produce this,
>> ## where data-frame-name = 'df' on my end.
>> df <- structure(list(V1 = c("chr1", "chr1", "chr1", "chr1", "chr3",
>> "chr4", "chr4", "chr7", "chr7", "chr9", "chr11", "chr11", "chr22",
>> "chr22", "chr22"), V2 = c(100L, 100L, 200L, 500L, 450L, 100L,
>> 100L, 350L, 350L, 100L, 679L, 679L, 100L, 100L, 300L), V3 = c(159L,
>> 159L, 260L, 750L, 700L, 300L, 300L, 600L, 600L, 125L, 687L, 687L,
>> 200L, 200L, 400L), V4 = c(104L, 145L, 205L, 600L, 500L, 150L,
>> 175L, 400L, 550L, 100L, 680L, 681L, 105L, 110L, 350L), V5 = c(104L,
>> 145L, 205L, 600L, 500L, 150L, 175L, 400L, 550L, 100L, 680L, 681L,
>> 105L, 110L, 350L), V6 = c(1L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 0L,
>> 1L, 1L, 0L, 1L, 1L, 0L), V7 = c(0.05, 0.04, 0.12, 0.09, 0.03,
>> 0.05, 0, 0.06, 0, 0.1, 0.07, 0, 0.03, 0.08, 0), V8 = c("+", "+",
>> "+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+"
>> )), .Names = c("V1", "V2", "V3", "V4", "V5", "V6", "V7", "V8"
>> ), class = "data.frame", row.names = c(NA, -15L))
>>
>> ############
>> # This is the structure you should see:
>>> str(df)
>> 'data.frame': 15 obs. of 8 variables:
>> $ V1: chr "chr1" "chr1" "chr1" "chr1" ...
>> $ V2: int 100 100 200 500 450 100 100 350 350 100 ...
>> $ V3: int 159 159 260 750 700 300 300 600 600 125 ...
>> $ V4: int 104 145 205 600 500 150 175 400 550 100 ...
>> $ V5: int 104 145 205 600 500 150 175 400 550 100 ...
>> $ V6: int 1 1 1 1 1 1 0 1 0 1 ...
>> $ V7: num 0.05 0.04 0.12 0.09 0.03 0.05 0 0.06 0 0.1 ...
>> $ V8: chr "+" "+" "+" "+" ...
>> ############
>>
>> # Method 1: Write a function and call ddply()
>> summfun <- function(d) {
>> dsum <- as.data.frame(as.list(summary(d[['V7']])))
>> names(dsum) <- c('Min', 'Q1', 'Median', 'Mean', 'Q3', 'Max')
>> data.frame(V3 = d[1, 'V3'], dsum)
>> }
>> library('plyr')
>> ddply(df, .(V1, V2), summfun)
>>
>> The idea behind summfun is this: ddply() prefers functions that take a
>> data frame as input and a data frame (or scalar) as output. dsum
>> converts summary(V7) to a data frame by first coercing it into a list
>> and then to a data frame. The names are changed for convenience. dsum
>> has one line, so we add V3 to the data frame before outputting it.
>> ddply() will attach the grouping variables to the output
>> automatically; however, you can put them into the output data frame
>> and ddply() will not duplicate the grouping variables in the output.
>>
>> The alternative in ddply(), which is simpler code, outputs the results
>> from summary() in different rows for each grouping. In this event, it
>> is useful to carry along the names of the summaries so that one can
>> recast the data with the cast() function from the reshape package:
>>
>> # Method 2: Summarize and reshape
>> # V3 is unnecessary but it is useful to carry it along for the output
>> u <- ddply(df, .(V1, V2, V3), summarise, summ = summary(V7),
>> summtype = names(summary(V7)))
>> library('reshape')
>> cast(u, V1 + V2 + V3 ~ summtype, value = 'summ')
>>
>> HTH,
>> Dennis
>>
>> PS: I may be one of those folks to whom David was referring in
>> relation to plyr :)
>
> I've been really impressed at Dennis' facility with plyr, reshape, and
> reshape2. Note that the 'reshape' function has nothing to do with the
> 'reshape' package. Here's what I came up with using base functions:
>
> > str(inpdat)
> 'data.frame': 15 obs. of 8 variables:
> $ chromosome : chr "chr1" "chr1" "chr1" "chr1" ...
> $ startreg : int 100 100 200 500 450 100 100 350 350 100 ...
> $ endreg : int 159 159 260 750 700 300 300 600 600 125 ...
> $ base1 : int 104 145 205 600 500 150 175 400 550 100 ...
> $ base2 : int 104 145 205 600 500 150 175 400 550 100 ...
> $ totalreads : int 1 1 1 1 1 1 0 1 0 1 ...
> $ methylation: num 0.05 0.04 0.12 0.09 0.03 0.05 0 0.06 0 0.1 ...
> $ strand : chr "+" "+" "+" "+" ...
> # The split into distinct 'chromosome' and 'startreg' categories:
> splinp <- split(inpdat, paste(inpdat$chromosome, inpdat$startreg) )
>
> # Process within separate categories: the tapply, aggragate and by
> functions are all related
>
> > df <- as.data.frame( t(sapply(splinp, function(x) list(chr=x
> $chromosome[1], strt=x$startreg[1], end=x$endreg[1],
> frac=sum(x[['totalreads']]>=1)/nrow(x) )) ) )
> # You often need the t() function when working with apply functions
> > df
> chr strt end frac
> chr1 100 chr1 100 159 1
> chr1 200 chr1 200 260 1
> chr1 500 chr1 500 750 1
> chr11 679 chr11 679 687 0.5
> chr22 100 chr22 100 200 1
> chr22 300 chr22 300 400 0
> chr3 450 chr3 450 700 1
> chr4 100 chr4 100 300 0.5
> chr7 350 chr7 350 600 0.5
> chr9 100 chr9 100 125 1
>
> > as.data.frame(t(sapply(splinp, function(x) summary(x
> $methylation )) ) )
> Min. 1st Qu. Median Mean 3rd Qu. Max.
> chr1 100 0.04 0.0425 0.045 0.045 0.0475 0.05
> chr1 200 0.12 0.1200 0.120 0.120 0.1200 0.12
> chr1 500 0.09 0.0900 0.090 0.090 0.0900 0.09
> chr11 679 0.00 0.0175 0.035 0.035 0.0525 0.07
> chr22 100 0.03 0.0425 0.055 0.055 0.0675 0.08
> chr22 300 0.00 0.0000 0.000 0.000 0.0000 0.00
> chr3 450 0.03 0.0300 0.030 0.030 0.0300 0.03
> chr4 100 0.00 0.0125 0.025 0.025 0.0375 0.05
> chr7 350 0.00 0.0150 0.030 0.030 0.0450 0.06
> chr9 100 0.10 0.1000 0.100 0.100 0.1000 0.10
>
> # The coup de grace: bind the columns together
>
> > df.summ <- as.data.frame(t(sapply(splinp, function(x) summary(x
> $methylation )) ) )
> > cbind(df, df.summ)
> chr strt end frac Min. 1st Qu. Median Mean 3rd Qu. Max.
> chr1 100 chr1 100 159 1 0.04 0.0425 0.045 0.045 0.0475 0.05
> chr1 200 chr1 200 260 1 0.12 0.1200 0.120 0.120 0.1200 0.12
> chr1 500 chr1 500 750 1 0.09 0.0900 0.090 0.090 0.0900 0.09
> chr11 679 chr11 679 687 0.5 0.00 0.0175 0.035 0.035 0.0525 0.07
> chr22 100 chr22 100 200 1 0.03 0.0425 0.055 0.055 0.0675 0.08
> chr22 300 chr22 300 400 0 0.00 0.0000 0.000 0.000 0.0000 0.00
> chr3 450 chr3 450 700 1 0.03 0.0300 0.030 0.030 0.0300 0.03
> chr4 100 chr4 100 300 0.5 0.00 0.0125 0.025 0.025 0.0375 0.05
> chr7 350 chr7 350 600 0.5 0.00 0.0150 0.030 0.030 0.0450 0.06
> chr9 100 chr9 100 125 1 0.10 0.1000 0.100 0.100 0.1000 0.10
>
> --
> Best;
>
> David.
>
>>
>> On Wed, Jul 27, 2011 at 4:02 PM, a217 <ajn21 at case.edu> wrote:
>>> Hello,
>>>
>>> I have an input file:
>>> http://r.789695.n4.nabble.com/file/n3700031/testOut.txt testOut.txt
>>>
>>> where col 1 is chromosome, column2 is start of region, column 3 is
>>> end of
>>> region, column 4 and 5 is base position, column 6 is total reads,
>>> column 7
>>> is methylation data, and column 8 is the strand.
>>>
>>>
>>> I would like a summary output file such as:
>>> http://r.789695.n4.nabble.com/file/n3700031/out.summary.txt
>>> out.summary.txt
>>>
>>> where column 1 is chromosome, column 2 is start of region, column 3
>>> is end
>>> of region, column 4 is total reads in general, column 5 is total
>>> reads >=1,
>>> column 6 is (col4/col5) or the percentage, and at the end I'd like
>>> to list 6
>>> more columns based on summary results from summary() function in R.
>>>
>>> The summary() function will be used to analyze all of the
>>> methylation data
>>> (col7 from input) for each region (bounded by col2 and col3).
>>>
>>> For example for chr1 100 159 summary() gives:
>>> Min. 1st Qu. Median Mean 3rd Qu. Max.
>>> 0.0400 0.0425 0.0450 0.0450 0.0475 0.0500
>>>
>>> which is simply the methylation data input into summary() only in
>>> the region
>>> of chr1 100 159.
>>>
>>> I know how to perform all of the required functions line-by-line,
>>> but the
>>> hard part for me is essentially taking the input data with multiple
>>> positions in each region and assigning all of the summary results
>>> to one
>>> line identified by the region.
>>>
>>> If any of you have any suggestions I would appreciate it.
>>>
>>> --
>>> View this message in context:
>>> http://r.789695.n4.nabble.com/Writing-a-summary-file-in-R-tp3700031p3700031.html
>>> Sent from the R help mailing list archive at Nabble.com.
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
> David Winsemius, MD
> West Hartford, CT
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
--
View this message in context: http://r.789695.n4.nabble.com/Writing-a-summary-file-in-R-tp3700031p3716837.html
Sent from the R help mailing list archive at Nabble.com.
More information about the R-help
mailing list