[BioC] What is the easiest way to make the GC content and length matrix for package cqn?

Tue Apr 8 22:03:10 CEST 2014

Hello!

I have some RNASeq data that I am analyzing with edgeR and I would like to use the cqn package to correct for GC bias.  I have aligned the data to the UCSC hg38 genome.

>From googling I have found the the bedtools 'nuc' command will give me the GC content with ranges and the length. Providing that I have a bed file of the hg38 genome.

What I need to make sure of is that I am calculating the length and the GC content for the correct intervals. I bin my reads using GenomicFeatures like this (thank you Homer)

# Load Libraries
library(GenomicFeatures)
library(GenomicRanges)

txdb <- makeTranscriptDbFromUCSC(genome='hg38',tablename='refGene')

tr_by_gene <- transcriptsBy(txdb,'gene')

library(Rsamtools)

r_AE89_m <- readGAlignmentsFromBam("sorted_trimmed_AE89_minus.bam")
r_AE89_p <- readGAlignmentsFromBam("sorted_trimmed_AE89_plus.1.bam")
r_AE90_m <- readGAlignmentsFromBam("sorted_trimmed_AE90_minus.bam")
r_AE90_p <- readGAlignmentsFromBam("sorted_trimmed_AE90_plus.bam")

# counts reads overlapping the genes.
c_AE89_m <- countOverlaps(tr_by_gene,r_AE89_m)
c_AE89_p <- countOverlaps(tr_by_gene,r_AE89_p)
c_AE90_m <- countOverlaps(tr_by_gene,r_AE90_m)
c_AE90_p <- countOverlaps(tr_by_gene,r_AE90_p)

# Create a count table
countTable <- data.frame(AE89_minus=c_AE89_m, AE89_1_plus=c_AE89_p, AE90_minus=c_AE90_m, AE90_plus=c_AE90_p, stringsAsFactors=FALSE)

rownames(countTable)=names(tr_by_gene)

Is there a way to get lengths and gc content from 'tr_by-gene'? Is there a way to create a bed file from 'tr_by_gene' so that I can then use bedtools nuc?

Thank you.

Matt