[BioC] countOverlaps Warnings
Valerie Obenchain
vobencha at fhcrc.org
Wed Apr 23 18:49:34 CEST 2014
Pankaj,
How did it go? Were you able to get readGAlignmentPairs() or
readGAlignmentsList() to work for your case?
Valerie
On 04/21/2014 01:32 PM, Valerie Obenchain wrote:
> Hi Pankaj,
>
> There are several options for counting paired-end reads.
>
> Both readGAlignmentPairs() and readGAlignmentsList() are appropriate for
> paired-end data and are in the GenomicAlignments package. They use the
> same counting algorithm but return the data in different containers.
> readGAlignmentPairs() returns only pairs while readGAlignmentsList()
> returns pairs as well as singletons, reads with unmapped pairs etc. See
> the ?readGAlignments man page for details.
>
> Another counting function is summarizeOverlaps() which uses the above
> functions 'under the hood'. It returns the counts in the 'assays' slot
> of a SummarizedExperiment object. Since you have multiple bam files to
> count you may want to use the BamFileList method.
>
> The man page has examples of counting paired-end bam files.
> library(Rsamtools)
> ?summarizeOverlaps
>
> Other packages that offer count functions are Rsubread and easyRNASeq.
>
> Valerie
>
>
>
>
> On 04/21/2014 09:53 AM, Pankaj Agarwal wrote:
>> Hi,
>>
>> I am analyzing a matched pair tumor/normal rna-seq data set. After
>> aligning with bowtie2 against UCSC hg19 index, I am trying to get the
>> counts for each of the samples using the iGenomes UCSC GTF file. I
>> came across the following tutorial which shows different ways to do it
>> in Bioconductor:
>> http://faculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Workshop_Dec_12_16_2013/Rrnaseq/Rrnaseq.pdf
>>
>>
>> Following along slide 28 "Read Counting with countOverlaps", I am able
>> to generate the counts but get the following Warnings. I just want to
>> be sure that there is nothing to be worried about because I don't
>> understand the meaning of these warnings. My code is as follows:
>>
>>> library(GenomicFeatures)
>> Loading required package: AnnotationDbi
>>> library(Rsamtools)
>>> library(rtracklayer)
>>> txdb <- loadDb("./data/ucsc_igenomes_hg19_gtf.sqlite")
>>> eByg <- exonsBy(txdb, by="gene")
>>> head(eByg)
>> GRangesList of length 6:
>> $A1BG
>> GRanges with 8 ranges and 2 metadata columns:
>> seqnames ranges strand | exon_id exon_name
>> <Rle> <IRanges> <Rle> | <integer> <character>
>> [1] chr19 [58858172, 58858395] - | 163177 <NA>
>> [2] chr19 [58858719, 58859006] - | 163178 <NA>
>> [3] chr19 [58861736, 58862017] - | 163179 <NA>
>> [4] chr19 [58862757, 58863053] - | 163180 <NA>
>> [5] chr19 [58863649, 58863921] - | 163181 <NA>
>> [6] chr19 [58864294, 58864563] - | 163182 <NA>
>> [7] chr19 [58864658, 58864693] - | 163183 <NA>
>> [8] chr19 [58864770, 58864865] - | 163184 <NA>
>>
>> $A1BG-AS1
>> GRanges with 4 ranges and 2 metadata columns:
>> seqnames ranges strand | exon_id exon_name
>> [1] chr19 [58863336, 58864410] + | 156640 <NA>
>> [2] chr19 [58864745, 58864840] + | 156641 <NA>
>> [3] chr19 [58865080, 58865223] + | 156642 <NA>
>> [4] chr19 [58865735, 58866549] + | 156643 <NA>
>>
>> ...
>> <4 more elements>
>> ---
>> seqlengths:
>> chr12 chr8 ... chr1_gl000192_random
>> NA NA ... NA
>>> samples = c("BRPC13-1118_L1.D710_501.sorted",
>>> "BRPC_13-764_L2.sorted", "DU04_PBMC_RNA_L1.sorted",
>>> "DU06_PBMC_RNA_L1.sorted")
>>>
>>> samples
>> [1] "BRPC13-1118_L1.D710_501.sorted" "BRPC_13-764_L2.sorted"
>> [3] "DU04_PBMC_RNA_L1.sorted" "DU06_PBMC_RNA_L1.sorted"
>>>
>>> samplespath <- paste("./", samples, ".bam", sep="")
>>>
>>> samplespath
>> [1] "./BRPC13-1118_L1.D710_501.sorted.bam"
>> [2] "./BRPC_13-764_L2.sorted.bam"
>> [3] "./DU04_PBMC_RNA_L1.sorted.bam"
>> [4] "./DU06_PBMC_RNA_L1.sorted.bam"
>>>
>>> names(samplespath) <- samples
>>>
>>> samplespath
>> BRPC13-1118_L1.D710_501.sorted
>> BRPC_13-764_L2.sorted
>> "./BRPC13-1118_L1.D710_501.sorted.bam"
>> "./BRPC_13-764_L2.sorted.bam"
>> DU04_PBMC_RNA_L1.sorted
>> DU06_PBMC_RNA_L1.sorted
>> "./DU04_PBMC_RNA_L1.sorted.bam"
>> "./DU06_PBMC_RNA_L1.sorted.bam"
>>>
>>> countDF <- data.frame(row.names=names(eByg))
>>>
>>> countDF
>> data frame with 0 columns and 23710 rows
>>>
>>> dim(countDF)
>> [1] 23710 0
>>>
>>> for(i in samplespath) {
>> + aligns <- readGAlignmentsFromBam(i)
>> + counts <- countOverlaps(eByg, aligns, ignore.strand=TRUE)
>> + countDF <- cbind(countDF, counts)
>> + }
>> Warning messages:
>> 1: In .Seqinfo.mergexy(x, y) :
>> Each of the 2 combined objects has sequence levels not in the other:
>> - in 'x': chr6_cox_hap2, chr6_dbb_hap3, chr6_ssto_hap7,
>> chr6_qbl_hap6, chr6_mann_hap4, chr6_mcf_hap5, chr6_apd_hap1,
>> chrUn_gl000228, chr17_ctg5_hap1, chr19_gl000209_random,
>> chr4_ctg9_hap1, chrUn_gl000222, chrUn_gl000223, chr1_gl000191_random,
>> chr7_gl000195_random, chrUn_gl000213, chrUn_gl000220,
>> chr17_gl000205_random, chrUn_gl000212, chrUn_gl000218, chrUn_gl000219,
>> chrUn_gl000211, chr4_gl000193_random, chr4_gl000194_random,
>> chr1_gl000192_random
>> - in 'y': chrM
>> Make sure to always combine/compare objects based on the same
>> reference
>> genome (use suppressWarnings() to suppress this warning).
>> 2: In .Seqinfo.mergexy(x, y) :
>> Each of the 2 combined objects has sequence levels not in the other:
>> - in 'x': chr6_cox_hap2, chr6_dbb_hap3, chr6_ssto_hap7,
>> chr6_qbl_hap6, chr6_mann_hap4, chr6_mcf_hap5, chr6_apd_hap1,
>> chrUn_gl000228, chr17_ctg5_hap1, chr19_gl000209_random,
>> chr4_ctg9_hap1, chrUn_gl000222, chrUn_gl000223, chr1_gl000191_random,
>> chr7_gl000195_random, chrUn_gl000213, chrUn_gl000220,
>> chr17_gl000205_random, chrUn_gl000212, chrUn_gl000218, chrUn_gl000219,
>> chrUn_gl000211, chr4_gl000193_random, chr4_gl000194_random,
>> chr1_gl000192_random
>> - in 'y': chrM
>> Make sure to always combine/compare objects based on the same
>> reference
>> genome (use suppressWarnings() to suppress this warning).
>> 3: In .Seqinfo.mergexy(x, y) :
>> Each of the 2 combined objects has sequence levels not in the other:
>> - in 'x': chr6_cox_hap2, chr6_dbb_hap3, chr6_ssto_hap7,
>> chr6_qbl_hap6, chr6_mann_hap4, chr6_mcf_hap5, chr6_apd_hap1,
>> chrUn_gl000228, chr17_ctg5_hap1, chr19_gl000209_random,
>> chr4_ctg9_hap1, chrUn_gl000222, chrUn_gl000223, chr1_gl000191_random,
>> chr7_gl000195_random, chrUn_gl000213, chrUn_gl000220,
>> chr17_gl000205_random, chrUn_gl000212, chrUn_gl000218, chrUn_gl000219,
>> chrUn_gl000211, chr4_gl000193_random, chr4_gl000194_random,
>> chr1_gl000192_random
>> - in 'y': chrM
>> Make sure to always combine/compare objects based on the same
>> reference
>> genome (use suppressWarnings() to suppress this warning).
>> 4: In .Seqinfo.mergexy(x, y) :
>> Each of the 2 combined objects has sequence levels not in the other:
>> - in 'x': chr6_cox_hap2, chr6_dbb_hap3, chr6_ssto_hap7,
>> chr6_qbl_hap6, chr6_mann_hap4, chr6_mcf_hap5, chr6_apd_hap1,
>> chrUn_gl000228, chr17_ctg5_hap1, chr19_gl000209_random,
>> chr4_ctg9_hap1, chrUn_gl000222, chrUn_gl000223, chr1_gl000191_random,
>> chr7_gl000195_random, chrUn_gl000213, chrUn_gl000220,
>> chr17_gl000205_random, chrUn_gl000212, chrUn_gl000218, chrUn_gl000219,
>> chrUn_gl000211, chr4_gl000193_random, chr4_gl000194_random,
>> chr1_gl000192_random
>> - in 'y': chrM
>> Make sure to always combine/compare objects based on the same
>> reference
>> genome (use suppressWarnings() to suppress this warning).
>>>
>>
>> I also wanted to verify if what I am doing is correct for paired end
>> reads.
>>
>> Thanks,
>>
>> - Pankaj
>>
>>
>> =======================================
>> sessionInfo()
>> R version 3.0.3 (2014-03-06)
>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>
>> locale:
>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
>> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
>> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
>> [9] LC_ADDRESS=C LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] stats graphics grDevices utils datasets methods base
>>
>> [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>
>
--
Valerie Obenchain
Program in Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, Seattle, WA 98109
Email: vobencha at fhcrc.org
Phone: (206) 667-3158
More information about the Bioconductor
mailing list