[BioC] What is the easiest way to make the GC content and length matrix for package cqn?
Martin Morgan
mtmorgan at fhcrc.org
Wed Apr 9 00:38:29 CEST 2014
On 04/08/2014 01:03 PM, Thornton, Matthew wrote:
> Hello!
>
> I have some RNASeq data that I am analyzing with edgeR and I would like to use the cqn package to correct for GC bias. I have aligned the data to the UCSC hg38 genome.
>
>>From googling I have found the the bedtools 'nuc' command will give me the GC content with ranges and the length. Providing that I have a bed file of the hg38 genome.
>
> What I need to make sure of is that I am calculating the length and the GC content for the correct intervals. I bin my reads using GenomicFeatures like this (thank you Homer)
>
> # Load Libraries
> library(GenomicFeatures)
> library(GenomicRanges)
>
> txdb <- makeTranscriptDbFromUCSC(genome='hg38',tablename='refGene')
>
> tr_by_gene <- transcriptsBy(txdb,'gene')
>
> library(Rsamtools)
>
> r_AE89_m <- readGAlignmentsFromBam("sorted_trimmed_AE89_minus.bam")
> r_AE89_p <- readGAlignmentsFromBam("sorted_trimmed_AE89_plus.1.bam")
> r_AE90_m <- readGAlignmentsFromBam("sorted_trimmed_AE90_minus.bam")
> r_AE90_p <- readGAlignmentsFromBam("sorted_trimmed_AE90_plus.bam")
>
> # counts reads overlapping the genes.
> c_AE89_m <- countOverlaps(tr_by_gene,r_AE89_m)
> c_AE89_p <- countOverlaps(tr_by_gene,r_AE89_p)
> c_AE90_m <- countOverlaps(tr_by_gene,r_AE90_m)
> c_AE90_p <- countOverlaps(tr_by_gene,r_AE90_p)
>
> # Create a count table
> countTable <- data.frame(AE89_minus=c_AE89_m, AE89_1_plus=c_AE89_p, AE90_minus=c_AE90_m, AE90_plus=c_AE90_p, stringsAsFactors=FALSE)
>
For the preceding (I know you didn't ask!) you could represent all of your files
files <- c(r_AE89_m="sorted_trimmed_AE89_minus.bam",
r_AE89_p="sorted_trimmed_AE89_plus.1.bam",
r_AE90_m="sorted_trimmed_AE90_minus.bam",
r_AE90_p="sorted_trimmed_AE90_plus.bam")
as a 'BamFileList'
bam_files <- BamFileList(files)
write a 'COUNTER' function that takes features, reads, and possibly other arguments
COUNTER <- function(features, reads, ...)
## ignore some arguments
countOverlaps(features, reads)
And then perform the summary
se <- summarizeOverlaps(tx_by_gene, bam_files, COUNTER)
head(assay(se), 3)
head(rowData(se))
Once this is working to your satisfaction, modify the recipe to count each file
on a separate thread, limiting memory use by changing the 'yieldSize' (number of
records read at any one time) on the BamFileList
library(BiocParallel) # count one file per core
bam_files <- BamFileList(files, yieldSize=1000000)
se <- summarizeOverlaps(tx_by_gene, bam_files, COUNTER)
Add a param=ScanBamParam() argument to control which reads and possible other
information (e.g., mapping quality) that gets read in and hence is available for
use in the COUNTER function.
The help page ?summarizeOverlaps has some built-in functions that count in
different, relevant, ways.
Martin
> rownames(countTable)=names(tr_by_gene)
>
> Is there a way to get lengths and gc content from 'tr_by-gene'? Is there a way to create a bed file from 'tr_by_gene' so that I can then use bedtools nuc?
>
> Thank you.
>
> Matt
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109
Location: Arnold Building M1 B861
Phone: (206) 667-2793
More information about the Bioconductor
mailing list