[BioC] countOverlaps Warnings
Valerie Obenchain
vobencha at fhcrc.org
Mon Apr 21 22:32:58 CEST 2014
Hi Pankaj,
There are several options for counting paired-end reads.
Both readGAlignmentPairs() and readGAlignmentsList() are appropriate for
paired-end data and are in the GenomicAlignments package. They use the
same counting algorithm but return the data in different containers.
readGAlignmentPairs() returns only pairs while readGAlignmentsList()
returns pairs as well as singletons, reads with unmapped pairs etc. See
the ?readGAlignments man page for details.
Another counting function is summarizeOverlaps() which uses the above
functions 'under the hood'. It returns the counts in the 'assays' slot
of a SummarizedExperiment object. Since you have multiple bam files to
count you may want to use the BamFileList method.
The man page has examples of counting paired-end bam files.
library(Rsamtools)
?summarizeOverlaps
Other packages that offer count functions are Rsubread and easyRNASeq.
Valerie
On 04/21/2014 09:53 AM, Pankaj Agarwal wrote:
> Hi,
>
> I am analyzing a matched pair tumor/normal rna-seq data set. After aligning with bowtie2 against UCSC hg19 index, I am trying to get the counts for each of the samples using the iGenomes UCSC GTF file. I came across the following tutorial which shows different ways to do it in Bioconductor:
> http://faculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Workshop_Dec_12_16_2013/Rrnaseq/Rrnaseq.pdf
>
> Following along slide 28 "Read Counting with countOverlaps", I am able to generate the counts but get the following Warnings. I just want to be sure that there is nothing to be worried about because I don't understand the meaning of these warnings. My code is as follows:
>
>> library(GenomicFeatures)
> Loading required package: AnnotationDbi
>> library(Rsamtools)
>> library(rtracklayer)
>> txdb <- loadDb("./data/ucsc_igenomes_hg19_gtf.sqlite")
>> eByg <- exonsBy(txdb, by="gene")
>> head(eByg)
> GRangesList of length 6:
> $A1BG
> GRanges with 8 ranges and 2 metadata columns:
> seqnames ranges strand | exon_id exon_name
> <Rle> <IRanges> <Rle> | <integer> <character>
> [1] chr19 [58858172, 58858395] - | 163177 <NA>
> [2] chr19 [58858719, 58859006] - | 163178 <NA>
> [3] chr19 [58861736, 58862017] - | 163179 <NA>
> [4] chr19 [58862757, 58863053] - | 163180 <NA>
> [5] chr19 [58863649, 58863921] - | 163181 <NA>
> [6] chr19 [58864294, 58864563] - | 163182 <NA>
> [7] chr19 [58864658, 58864693] - | 163183 <NA>
> [8] chr19 [58864770, 58864865] - | 163184 <NA>
>
> $A1BG-AS1
> GRanges with 4 ranges and 2 metadata columns:
> seqnames ranges strand | exon_id exon_name
> [1] chr19 [58863336, 58864410] + | 156640 <NA>
> [2] chr19 [58864745, 58864840] + | 156641 <NA>
> [3] chr19 [58865080, 58865223] + | 156642 <NA>
> [4] chr19 [58865735, 58866549] + | 156643 <NA>
>
> ...
> <4 more elements>
> ---
> seqlengths:
> chr12 chr8 ... chr1_gl000192_random
> NA NA ... NA
>> samples = c("BRPC13-1118_L1.D710_501.sorted", "BRPC_13-764_L2.sorted", "DU04_PBMC_RNA_L1.sorted", "DU06_PBMC_RNA_L1.sorted")
>>
>> samples
> [1] "BRPC13-1118_L1.D710_501.sorted" "BRPC_13-764_L2.sorted"
> [3] "DU04_PBMC_RNA_L1.sorted" "DU06_PBMC_RNA_L1.sorted"
>>
>> samplespath <- paste("./", samples, ".bam", sep="")
>>
>> samplespath
> [1] "./BRPC13-1118_L1.D710_501.sorted.bam"
> [2] "./BRPC_13-764_L2.sorted.bam"
> [3] "./DU04_PBMC_RNA_L1.sorted.bam"
> [4] "./DU06_PBMC_RNA_L1.sorted.bam"
>>
>> names(samplespath) <- samples
>>
>> samplespath
> BRPC13-1118_L1.D710_501.sorted BRPC_13-764_L2.sorted
> "./BRPC13-1118_L1.D710_501.sorted.bam" "./BRPC_13-764_L2.sorted.bam"
> DU04_PBMC_RNA_L1.sorted DU06_PBMC_RNA_L1.sorted
> "./DU04_PBMC_RNA_L1.sorted.bam" "./DU06_PBMC_RNA_L1.sorted.bam"
>>
>> countDF <- data.frame(row.names=names(eByg))
>>
>> countDF
> data frame with 0 columns and 23710 rows
>>
>> dim(countDF)
> [1] 23710 0
>>
>> for(i in samplespath) {
> + aligns <- readGAlignmentsFromBam(i)
> + counts <- countOverlaps(eByg, aligns, ignore.strand=TRUE)
> + countDF <- cbind(countDF, counts)
> + }
> Warning messages:
> 1: In .Seqinfo.mergexy(x, y) :
> Each of the 2 combined objects has sequence levels not in the other:
> - in 'x': chr6_cox_hap2, chr6_dbb_hap3, chr6_ssto_hap7, chr6_qbl_hap6, chr6_mann_hap4, chr6_mcf_hap5, chr6_apd_hap1, chrUn_gl000228, chr17_ctg5_hap1, chr19_gl000209_random, chr4_ctg9_hap1, chrUn_gl000222, chrUn_gl000223, chr1_gl000191_random, chr7_gl000195_random, chrUn_gl000213, chrUn_gl000220, chr17_gl000205_random, chrUn_gl000212, chrUn_gl000218, chrUn_gl000219, chrUn_gl000211, chr4_gl000193_random, chr4_gl000194_random, chr1_gl000192_random
> - in 'y': chrM
> Make sure to always combine/compare objects based on the same reference
> genome (use suppressWarnings() to suppress this warning).
> 2: In .Seqinfo.mergexy(x, y) :
> Each of the 2 combined objects has sequence levels not in the other:
> - in 'x': chr6_cox_hap2, chr6_dbb_hap3, chr6_ssto_hap7, chr6_qbl_hap6, chr6_mann_hap4, chr6_mcf_hap5, chr6_apd_hap1, chrUn_gl000228, chr17_ctg5_hap1, chr19_gl000209_random, chr4_ctg9_hap1, chrUn_gl000222, chrUn_gl000223, chr1_gl000191_random, chr7_gl000195_random, chrUn_gl000213, chrUn_gl000220, chr17_gl000205_random, chrUn_gl000212, chrUn_gl000218, chrUn_gl000219, chrUn_gl000211, chr4_gl000193_random, chr4_gl000194_random, chr1_gl000192_random
> - in 'y': chrM
> Make sure to always combine/compare objects based on the same reference
> genome (use suppressWarnings() to suppress this warning).
> 3: In .Seqinfo.mergexy(x, y) :
> Each of the 2 combined objects has sequence levels not in the other:
> - in 'x': chr6_cox_hap2, chr6_dbb_hap3, chr6_ssto_hap7, chr6_qbl_hap6, chr6_mann_hap4, chr6_mcf_hap5, chr6_apd_hap1, chrUn_gl000228, chr17_ctg5_hap1, chr19_gl000209_random, chr4_ctg9_hap1, chrUn_gl000222, chrUn_gl000223, chr1_gl000191_random, chr7_gl000195_random, chrUn_gl000213, chrUn_gl000220, chr17_gl000205_random, chrUn_gl000212, chrUn_gl000218, chrUn_gl000219, chrUn_gl000211, chr4_gl000193_random, chr4_gl000194_random, chr1_gl000192_random
> - in 'y': chrM
> Make sure to always combine/compare objects based on the same reference
> genome (use suppressWarnings() to suppress this warning).
> 4: In .Seqinfo.mergexy(x, y) :
> Each of the 2 combined objects has sequence levels not in the other:
> - in 'x': chr6_cox_hap2, chr6_dbb_hap3, chr6_ssto_hap7, chr6_qbl_hap6, chr6_mann_hap4, chr6_mcf_hap5, chr6_apd_hap1, chrUn_gl000228, chr17_ctg5_hap1, chr19_gl000209_random, chr4_ctg9_hap1, chrUn_gl000222, chrUn_gl000223, chr1_gl000191_random, chr7_gl000195_random, chrUn_gl000213, chrUn_gl000220, chr17_gl000205_random, chrUn_gl000212, chrUn_gl000218, chrUn_gl000219, chrUn_gl000211, chr4_gl000193_random, chr4_gl000194_random, chr1_gl000192_random
> - in 'y': chrM
> Make sure to always combine/compare objects based on the same reference
> genome (use suppressWarnings() to suppress this warning).
>>
>
> I also wanted to verify if what I am doing is correct for paired end reads.
>
> Thanks,
>
> - Pankaj
>
>
> =======================================
> sessionInfo()
> R version 3.0.3 (2014-03-06)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
> [9] LC_ADDRESS=C LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
--
Valerie Obenchain
Program in Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, Seattle, WA 98109
Email: vobencha at fhcrc.org
Phone: (206) 667-3158
More information about the Bioconductor
mailing list