[BioC] What is the easiest way to make the GC content and length matrix for package cqn?
Hervé Pagès
hpages at fhcrc.org
Wed Apr 9 00:23:30 CEST 2014
On 04/08/2014 02:33 PM, Harris A. Jaffee wrote:
> I would like to know the answer! This is all I have.
>
>> width(tr_by_gene)
> IntegerList of length 25529
> [["1"]] 6694
> [["10"]] 9969
> [["100"]] 32214
> [["1000"]] 226516
> [["10000"]] 355050 343564 343866 355040 343554 343856
> [["100008586"]] 187588 6118 6109
> [["100008587"]] 156 156 156 156 156 156 156 156
> [["100008588"]] 1869 1869 1869 1869 1869 1869 1869
> [["100008589"]] 188093 5066 5077 5070
> [["100009613"]] 2915
> ...
> <25519 more elements>
>
> Then my guess would be to build the "DNAStringSetList" (see Biostrings)
> underlying your transcripts, say x, and do
>
> sapply(x, letterFrequency, letters="GC")
>
> Hopefully there's something better.
Yes, that's the idea. You want to try avoiding the sapply() though,
which is generally slow:
## Using BSgenome.Hsapiens.NCBI.GRCh38 for now (which hg38 is based
## on but, sadly, the reference sequences are named differently).
## Maybe we should have BSgenome.Hsapiens.NCBI.hg38 too, so it's
## easier to work with TranscriptDb objects obtained from UCSC.
library(BSgenome)
genome <- getBSgenome("GRCh38")
## The following hack would not be needed if hg38 and GRCh38 were
## using the same chromosome names. We'll keep only chr1-22, chrX,
## chrY, chrM, because it's too complicated to safely map and rename
## the other sequences.
seqlevels(tr_by_gene, force=TRUE) <- seqlevels(tr_by_gene)[1:25]
seqlevels(genome)[1:25] <- seqlevels(tr_by_gene)
genome(tr_by_gene) <- "GRCh38"
## Extract the transcript sequences grouped by gene:
trseq_by_gene <- getSeq(genome, tr_by_gene)
## Compute the transcript GC counts grouped by gene:
af <- alphabetFrequency(unlist(trseq_by_gene, use.names=FALSE),
baseOnly=TRUE)
trGCcount_by_gene <- relist(af[ , "C"] + af[ , "G"], trseq_by_gene)
Then:
> trGCcount_by_gene
IntegerList of length 24443
[["1"]] 3856
[["10"]] 3697
[["100"]] 17142
[["1000"]] 84322
[["100008586"]] 3156 2539 2537
[["100009613"]] 1578
[["100009676"]] 1463
[["10001"]] 6854 6854 6854 6854
[["10002"]] 2652 4134
[["10003"]] 20205
...
<24433 more elements>
Note that BSgenome.Hsapiens.NCBI.GRCh38 is only available in BioC 2.14.
I'll think of a way to make a thin BSgenome.Hsapiens.UCSC.hg38 package
that depends on BSgenome.Hsapiens.NCBI.GRCh38 for the sequences but
presents them to the user with the UCSC names.
Cheers,
H.
>
> On Apr 8, 2014, at 4:03 PM, Thornton, Matthew wrote:
>
>> Hello!
>>
>> I have some RNASeq data that I am analyzing with edgeR and I would like to use the cqn package to correct for GC bias. I have aligned the data to the UCSC hg38 genome.
>>
>>> From googling I have found the the bedtools 'nuc' command will give me the GC content with ranges and the length. Providing that I have a bed file of the hg38 genome.
>>
>> What I need to make sure of is that I am calculating the length and the GC content for the correct intervals. I bin my reads using GenomicFeatures like this (thank you Homer)
>>
>> # Load Libraries
>> library(GenomicFeatures)
>> library(GenomicRanges)
>>
>> txdb <- makeTranscriptDbFromUCSC(genome='hg38',tablename='refGene')
>>
>> tr_by_gene <- transcriptsBy(txdb,'gene')
>>
>> library(Rsamtools)
>>
>> r_AE89_m <- readGAlignmentsFromBam("sorted_trimmed_AE89_minus.bam")
>> r_AE89_p <- readGAlignmentsFromBam("sorted_trimmed_AE89_plus.1.bam")
>> r_AE90_m <- readGAlignmentsFromBam("sorted_trimmed_AE90_minus.bam")
>> r_AE90_p <- readGAlignmentsFromBam("sorted_trimmed_AE90_plus.bam")
>>
>> # counts reads overlapping the genes.
>> c_AE89_m <- countOverlaps(tr_by_gene,r_AE89_m)
>> c_AE89_p <- countOverlaps(tr_by_gene,r_AE89_p)
>> c_AE90_m <- countOverlaps(tr_by_gene,r_AE90_m)
>> c_AE90_p <- countOverlaps(tr_by_gene,r_AE90_p)
>>
>> # Create a count table
>> countTable <- data.frame(AE89_minus=c_AE89_m, AE89_1_plus=c_AE89_p, AE90_minus=c_AE90_m, AE90_plus=c_AE90_p, stringsAsFactors=FALSE)
>>
>> rownames(countTable)=names(tr_by_gene)
>>
>> Is there a way to get lengths and gc content from 'tr_by-gene'? Is there a way to create a bed file from 'tr_by_gene' so that I can then use bedtools nuc?
>>
>> Thank you.
>>
>> Matt
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioconductor
mailing list