[BioC] What is the easiest way to make the GC content and length matrix for package cqn?

Wed Apr 9 00:23:30 CEST 2014

On 04/08/2014 02:33 PM, Harris A. Jaffee wrote:
> I would like to know the answer!  This is all I have.
>
>> width(tr_by_gene)
> IntegerList of length 25529
> [["1"]] 6694
> [["10"]] 9969
> [["100"]] 32214
> [["1000"]] 226516
> [["10000"]] 355050 343564 343866 355040 343554 343856
> [["100008586"]] 187588 6118 6109
> [["100008587"]] 156 156 156 156 156 156 156 156
> [["100008588"]] 1869 1869 1869 1869 1869 1869 1869
> [["100008589"]] 188093 5066 5077 5070
> [["100009613"]] 2915
> ...
> <25519 more elements>
>
> Then my guess would be to build the "DNAStringSetList" (see Biostrings)
> underlying your transcripts, say x, and do
>
> 	sapply(x, letterFrequency, letters="GC")
>
> Hopefully there's something better.

Yes, that's the idea. You want to try avoiding the sapply() though,
which is generally slow:

   ## Using BSgenome.Hsapiens.NCBI.GRCh38 for now (which hg38 is based
   ## on but, sadly, the reference sequences are named differently).
   ## Maybe we should have BSgenome.Hsapiens.NCBI.hg38 too, so it's
   ## easier to work with TranscriptDb objects obtained from UCSC.
   library(BSgenome)
   genome <- getBSgenome("GRCh38")

   ## The following hack would not be needed if hg38 and GRCh38 were
   ## using the same chromosome names. We'll keep only chr1-22, chrX,
   ## chrY, chrM, because it's too complicated to safely map and rename
   ## the other sequences.
   seqlevels(tr_by_gene, force=TRUE) <- seqlevels(tr_by_gene)[1:25]
   seqlevels(genome)[1:25] <- seqlevels(tr_by_gene)
   genome(tr_by_gene) <- "GRCh38"

   ## Extract the transcript sequences grouped by gene:
   trseq_by_gene <- getSeq(genome, tr_by_gene)

   ## Compute the transcript GC counts grouped by gene:
   af <- alphabetFrequency(unlist(trseq_by_gene, use.names=FALSE), 
baseOnly=TRUE)
   trGCcount_by_gene <- relist(af[ , "C"] + af[ , "G"], trseq_by_gene)

Then:

   > trGCcount_by_gene
   IntegerList of length 24443
   [["1"]] 3856
   [["10"]] 3697
   [["100"]] 17142
   [["1000"]] 84322
   [["100008586"]] 3156 2539 2537
   [["100009613"]] 1578
   [["100009676"]] 1463
   [["10001"]] 6854 6854 6854 6854
   [["10002"]] 2652 4134
   [["10003"]] 20205
   ...
   <24433 more elements>

Note that BSgenome.Hsapiens.NCBI.GRCh38 is only available in BioC 2.14.
I'll think of a way to make a thin BSgenome.Hsapiens.UCSC.hg38 package
that depends on BSgenome.Hsapiens.NCBI.GRCh38 for the sequences but
presents them to the user with the UCSC names.

Cheers,
H.

>
> On Apr 8, 2014, at 4:03 PM, Thornton, Matthew wrote:
>
>> Hello!
>>
>> I have some RNASeq data that I am analyzing with edgeR and I would like to use the cqn package to correct for GC bias.  I have aligned the data to the UCSC hg38 genome.
>>
>>>  From googling I have found the the bedtools 'nuc' command will give me the GC content with ranges and the length. Providing that I have a bed file of the hg38 genome.
>>
>> What I need to make sure of is that I am calculating the length and the GC content for the correct intervals. I bin my reads using GenomicFeatures like this (thank you Homer)
>>
>> # Load Libraries
>> library(GenomicFeatures)
>> library(GenomicRanges)
>>
>> txdb <- makeTranscriptDbFromUCSC(genome='hg38',tablename='refGene')
>>
>> tr_by_gene <- transcriptsBy(txdb,'gene')
>>
>> library(Rsamtools)
>>
>> r_AE89_m <- readGAlignmentsFromBam("sorted_trimmed_AE89_minus.bam")
>> r_AE89_p <- readGAlignmentsFromBam("sorted_trimmed_AE89_plus.1.bam")
>> r_AE90_m <- readGAlignmentsFromBam("sorted_trimmed_AE90_minus.bam")
>> r_AE90_p <- readGAlignmentsFromBam("sorted_trimmed_AE90_plus.bam")
>>
>> # counts reads overlapping the genes.
>> c_AE89_m <- countOverlaps(tr_by_gene,r_AE89_m)
>> c_AE89_p <- countOverlaps(tr_by_gene,r_AE89_p)
>> c_AE90_m <- countOverlaps(tr_by_gene,r_AE90_m)
>> c_AE90_p <- countOverlaps(tr_by_gene,r_AE90_p)
>>
>> # Create a count table
>> countTable <- data.frame(AE89_minus=c_AE89_m, AE89_1_plus=c_AE89_p, AE90_minus=c_AE90_m, AE90_plus=c_AE90_p, stringsAsFactors=FALSE)
>>
>> rownames(countTable)=names(tr_by_gene)
>>
>> Is there a way to get lengths and gc content from 'tr_by-gene'? Is there a way to create a bed file from 'tr_by_gene' so that I can then use bedtools nuc?
>>
>> Thank you.
>>
>> Matt
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319