[BioC] countOverlaps Warnings

Thu Apr 24 23:07:15 CEST 2014

I think you have an older version of summarizeOverlaps() which had a 
restriction on using 'yieldSize' with paired-end and emits the warning 
you see below. The reason you didn't see the warning the first time is 
because the reads were being treated as single-end (hence almost double 
the count difference).

To update to the current version you'll need R 3.1 and Bioconductor 2.14 
(your sessionInfo() below shows R 3.0.3). Even with the update there is 
no guarantee the counts you see with htseq-count and summarizeOverlaps 
will be 100% the same. The pairing algorithm (i.e., how we define which 
reads are true pairs) is documented on the ?readGAlignments man page in 
the 'Pairing criteria' section. The HTSeq docs probably define how they 
determine pairs.

Once you update you may also want to check out the 'fragments' argument 
to summarizeOverlaps(). This allows you to read in not only the records 
paired according to the algorithm but records with unmapped pairs, 
singletons, and other fragments. This may not be appropriate for your 
current task but does provide and interesting separation of the paired 
reads and 'other' fragments.

Valerie

On 04/24/14 12:44, Pankaj Agarwal wrote:
> Yes, these are p.e reads.  I made the changes you suggested as follows, everything else being the same as before:
>
> bfl <- BamFileList(samplespath, yieldSize=50000, index=character(), asMates=TRUE)
> ...
> countDF_intsectionNotEmpty_042414A <- summarizeOverlaps(eByg, bfl, mode="IntersectionNotEmpty", ignore.strand=TRUE, singleEnd=FALSE)
> Warning in FUN(bf, param = param) : 'yieldSize' set to 'NA'
> Warning in FUN(bf, param = param) : 'yieldSize' set to 'NA'
> ...
>
> Not sure why I got these warning this time (did not see them last time).
>
> The results are much closer to what I got with htseq-count, though still not very close:
>
> htseq-count:
>>                       A.bam  B. bam  C. bam  D. bam
>> A1BG            7            67             90     72
>> A1BG-AS1     1            69          41       51
>> A1CF             2          16            0        3
>> A2M              1536    10866        34      173
>
> summarizeOverlaps:
>>                       A.bam  B. bam  C. bam  D. bam
>> A1BG            7            65             78     66
>> A1BG-AS1     1            66          37       47
>> A1CF             2          16            0        3
>> A2M              1001    6897        28      160
>
> I got the GTF files from the following website:
>
> http://support.illumina.com/sequencing/sequencing_software/igenome.ilmn
>
> Homo sapiens
> Ensembl	GRCh37
> NCBI	build37.2
> UCSC	hg19
>
> I used the UCSC hg19 - which has the bowtie2 index, which I used for alignment with bowtie2, and the GTF file, which I used as the annotation source.  This same GTF was used for htseq-count also.
>
> Thanks,
>
> - Pankaj
>
> ________________________________________
> From: Valerie Obenchain [vobencha at fhcrc.org]
> Sent: Thursday, April 24, 2014 9:39 AM
> To: Pankaj Agarwal; bioconductor at r-project.org
> Subject: Re: [BioC] countOverlaps Warnings
>
> Are these paired-end reads? If yes set 'singleEnd=FALSE' in the
> summarizeOverlaps() call.
>
>     summarizeOverlaps(eByg, bfl, "IntersectionNotEmpty",
>                       ignore.strand=TRUE, singleEnd=FALSE)
>
> To 'stamp' the BamFileList as containing paired-end reads set the
> 'asMates' arg to TRUE.
>
>     bfl <- BamFileList(samplespath, asMates=TRUE)
>
> 'yieldSize' reads data in by chunks and is useful when the bam files are
> too large to fit in memory.
>
> Can you point me to where you download 'genes.gtf'?
>
> Valerie
>
> On 04/23/14 19:40, Pankaj Agarwal wrote:
>> Valerie,
>>
>> I used summarizeOverlaps() instead, since it seemed simpler to use.  I also used htseq-count to get the counts using the same parameters, but I get very different results.
>>   From what I have read and understood, both should give similar results since the counts should be reported at the "gene" level and not the "isoform" level.
>> I would feel much more comfortable with the results if they were both at least close.
>>
>> For both I used the UCSC GTF file in the iGenomes distribution.
>> For htseq-count I used BAM files sorted by name since it required it, but for summarizeOverlaps I used BAM files sorted by coordinates (the default sorting done by samtools).
>> Example of counts generated by the two methods are given after the code:
>>
>> The code for htseq-count that I used is as follows (once for each sample):
>>
>> samtools view -h A.sorted_byname.bam | htseq-count -t exon -i gene_id -m intersection-nonempty -s no - iGenomes/UCSC/Homo_sapiens/UCSC/hg19/Annotation/Genes/genes.gtf > A.htseq.count
>>
>>    The code for summarizeOverlaps() is as follows:
>>
>>> library(GenomicFeatures)
>> Loading required package: BiocGenerics
>> Loading required package: parallel
>> ...
>>> library(Rsamtools)
>> Loading required package: Biostrings
>>> library(rtracklayer)
>>> library(GenomicRanges)
>>>
>>> txdb <- makeTranscriptDbFromGFF(file=iGenomes/UCSC/Homo_sapiens/UCSC/hg19/Annotation/Genes/genes.gtf", format="gtf")
>> extracting transcript information
>> Estimating transcript ranges.
>> Extracting gene IDs
>> Processing splicing information for gtf file.
>> Deducing exon rank from relative coordinates provided
>> Prepare the 'metadata' data frame ... metadata: OK
>> Now generating chrominfo from available sequence names. No chromosome length information is available.
>> Warning messages:
>> 1: In .deduceExonRankings(exs, format = "gtf") :
>>     Infering Exon Rankings.  If this is not what you expected, then please be sure that you have provided a valid attribute for exonRankAttributeName
>> 2: In matchCircularity(chroms, circ_seqs) :
>>     None of the strings in your circ_seqs argument match your seqnames.
>>>
>>
>> - are these warnings message problems?
>>
>>> saveDb(txdb, file="./data/ucsc_igenomes_hg19_gtf.sqlite")
>>>
>>> txdb <- loadDb("./data/ucsc_igenomes_hg19_gtf.sqlite")
>>>
>>> eByg <- exonsBy(txdb, by="gene")
>>>
>>> head(eByg)
>> GRangesList of length 6:
>> $A1BG
>> GRanges with 8 ranges and 2 metadata columns:
>>         seqnames               ranges strand |   exon_id   exon_name
>>            <Rle>            <IRanges>  <Rle> | <integer> <character>
>>     [1]    chr19 [58858172, 58858395]      - |    163177        <NA>
>>     [2]    chr19 [58858719, 58859006]      - |    163178        <NA>
>>>
>>> samples = c("A.sorted", "B.sorted", "C.sorted", "D.sorted")
>>>
>>> samples
>> [1] "A.sorted" "B.sorted"
>> [3] "C.sorted" "D.sorted"
>>>
>>> samplespath
>> [1] "../A.sorted.bam"
>> [2] "../B.bam"
>> [3] "../C.sorted.bam"
>> [4] "../D.sorted.bam"
>>>
>>> names(samplespath) <- samples
>>> bfl <- BamFileList(samplespath, yieldSize=50000, index=character())
>>
>> I am not sure what yieldSize means, just used it from the example I followed.
>>
>>> bfl
>> BamFileList of length 4
>> names(4): A.sorted ... B.sorted
>>>
>>> countDF_intsectionNotEmpty <- summarizeOverlaps(eByg, bfl, mode="IntersectionNotEmpty", ignore.strand=TRUE)
>>>
>>> head(countDF_intsectionNotEmpty)
>> class: SummarizedExperiment
>> dim: 1 4
>> exptData(0):
>> assays(1): counts
>> rownames(1): A1BG
>> rowData metadata column names(0):
>> colnames(4): A1.sorted B.sorted  C.sorted D.sorted
>> colData names(0):
>>>
>>> class(countDF_intsectionNotEmpty)
>> [1] "SummarizedExperiment"
>> attr(,"package")
>> [1] "GenomicRanges"
>>>
>>> countDF_intsectionNotEmpty <- assays(countDF_intsectionNotEmpty)$counts
>>> colnames(countDF_intsectionNotEmpty) <- samples
>>> countDF_intsectionNotEmpty[1:4,]
>>                A.sorted   B.sorted   C.sorted   D.sorted
>> A1BG            13         119                162                  136
>> A1BG-AS1   2         112              52                   85
>> A1CF             3         28                0                    5
>> A2M              2543  17593       47                233
>> ...
>>
>> The counts that I get are very different from what I get using htseq-count, which are following for the same genes listed above:
>>
>>                       A.bam  B. bam  C. bam  D. bam
>> A1BG            7            67             90              72
>> A1BG-AS1  1            69          41      51
>> A1CF             2          16            0        3
>> A2M         1536    10866     34            173
>>    ...
>> ...
>>
>> I would appreciate your feedback.
>>
>> Thanks,
>>
>> - Pankaj
>>
>> -----Original Message-----
>> From: Valerie Obenchain [mailto:vobencha at fhcrc.org]
>> Sent: Wednesday, April 23, 2014 12:50 PM
>> To: Pankaj Agarwal; bioconductor at r-project.org
>> Subject: Re: [BioC] countOverlaps Warnings
>>
>> Pankaj,
>>
>> How did it go? Were you able to get readGAlignmentPairs() or
>> readGAlignmentsList() to work for your case?
>>
>> Valerie
>>
>>
>> On 04/21/2014 01:32 PM, Valerie Obenchain wrote:
>>> Hi Pankaj,
>>>
>>> There are several options for counting paired-end reads.
>>>
>>> Both readGAlignmentPairs() and readGAlignmentsList() are appropriate
>>> for paired-end data and are in the GenomicAlignments package. They use
>>> the same counting algorithm but return the data in different containers.
>>> readGAlignmentPairs() returns only pairs while readGAlignmentsList()
>>> returns pairs as well as singletons, reads with unmapped pairs etc.
>>> See the ?readGAlignments man page for details.
>>>
>>> Another counting function is summarizeOverlaps() which uses the above
>>> functions 'under the hood'. It returns the counts in the 'assays' slot
>>> of a SummarizedExperiment object. Since you have multiple bam files to
>>> count you may want to use the BamFileList method.
>>>
>>> The man page has examples of counting paired-end bam files.
>>> library(Rsamtools)
>>> ?summarizeOverlaps
>>>
>>> Other packages that offer count functions are Rsubread and easyRNASeq.
>>>
>>> Valerie
>>>
>>>
>>>
>>>
>>> On 04/21/2014 09:53 AM, Pankaj Agarwal wrote:
>>>> Hi,
>>>>
>>>> I am analyzing a matched pair tumor/normal rna-seq data set.  After
>>>> aligning with bowtie2 against UCSC hg19 index, I am trying to get the
>>>> counts for each of the samples using the iGenomes UCSC GTF file.  I
>>>> came across the following tutorial which shows different ways to do
>>>> it in Bioconductor:
>>>> http://faculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Workshop_De
>>>> c_12_16_2013/Rrnaseq/Rrnaseq.pdf
>>>>
>>>>
>>>> Following along slide 28 "Read Counting with countOverlaps", I am
>>>> able to generate the counts but get the following Warnings.  I just
>>>> want to be sure that there is nothing to be worried about because I
>>>> don't understand the meaning of these warnings.  My code is as follows:
>>>>
>>>>> library(GenomicFeatures)
>>>> Loading required package: AnnotationDbi
>>>>> library(Rsamtools)
>>>>> library(rtracklayer)
>>>>> txdb <- loadDb("./data/ucsc_igenomes_hg19_gtf.sqlite")
>>>>> eByg <- exonsBy(txdb, by="gene")
>>>>> head(eByg)
>>>> GRangesList of length 6:
>>>> $A1BG
>>>> GRanges with 8 ranges and 2 metadata columns:
>>>>          seqnames               ranges strand |   exon_id   exon_name
>>>>             <Rle>            <IRanges>  <Rle> | <integer> <character>
>>>>      [1]    chr19 [58858172, 58858395]      - |    163177        <NA>
>>>>      [2]    chr19 [58858719, 58859006]      - |    163178        <NA>
>>>>      [3]    chr19 [58861736, 58862017]      - |    163179        <NA>
>>>>      [4]    chr19 [58862757, 58863053]      - |    163180        <NA>
>>>>      [5]    chr19 [58863649, 58863921]      - |    163181        <NA>
>>>>      [6]    chr19 [58864294, 58864563]      - |    163182        <NA>
>>>>      [7]    chr19 [58864658, 58864693]      - |    163183        <NA>
>>>>      [8]    chr19 [58864770, 58864865]      - |    163184        <NA>
>>>>
>>>> $A1BG-AS1
>>>> GRanges with 4 ranges and 2 metadata columns:
>>>>          seqnames               ranges strand | exon_id exon_name
>>>>      [1]    chr19 [58863336, 58864410]      + |  156640      <NA>
>>>>      [2]    chr19 [58864745, 58864840]      + |  156641      <NA>
>>>>      [3]    chr19 [58865080, 58865223]      + |  156642      <NA>
>>>>      [4]    chr19 [58865735, 58866549]      + |  156643      <NA>
>>>>
>>>> ...
>>>> <4 more elements>
>>>> ---
>>>> seqlengths:
>>>>                     chr12                  chr8 ...  chr1_gl000192_random
>>>>                        NA                    NA ...                    NA
>>>>> samples = c("BRPC13-1118_L1.D710_501.sorted",
>>>>> "BRPC_13-764_L2.sorted", "DU04_PBMC_RNA_L1.sorted",
>>>>> "DU06_PBMC_RNA_L1.sorted")
>>>>>
>>>>> samples
>>>> [1] "BRPC13-1118_L1.D710_501.sorted" "BRPC_13-764_L2.sorted"
>>>> [3] "DU04_PBMC_RNA_L1.sorted"        "DU06_PBMC_RNA_L1.sorted"
>>>>>
>>>>> samplespath <- paste("./", samples, ".bam", sep="")
>>>>>
>>>>> samplespath
>>>> [1] "./BRPC13-1118_L1.D710_501.sorted.bam"
>>>> [2] "./BRPC_13-764_L2.sorted.bam"
>>>> [3] "./DU04_PBMC_RNA_L1.sorted.bam"
>>>> [4] "./DU06_PBMC_RNA_L1.sorted.bam"
>>>>>
>>>>> names(samplespath) <- samples
>>>>>
>>>>> samplespath
>>>>            BRPC13-1118_L1.D710_501.sorted BRPC_13-764_L2.sorted
>>>> "./BRPC13-1118_L1.D710_501.sorted.bam"
>>>> "./BRPC_13-764_L2.sorted.bam"
>>>>                   DU04_PBMC_RNA_L1.sorted DU06_PBMC_RNA_L1.sorted
>>>>           "./DU04_PBMC_RNA_L1.sorted.bam"
>>>> "./DU06_PBMC_RNA_L1.sorted.bam"
>>>>>
>>>>> countDF <- data.frame(row.names=names(eByg))
>>>>>
>>>>> countDF
>>>> data frame with 0 columns and 23710 rows
>>>>>
>>>>> dim(countDF)
>>>> [1] 23710     0
>>>>>
>>>>> for(i in samplespath) {
>>>> + aligns <- readGAlignmentsFromBam(i) counts <- countOverlaps(eByg,
>>>> + aligns, ignore.strand=TRUE) countDF <- cbind(countDF, counts) }
>>>> Warning messages:
>>>> 1: In .Seqinfo.mergexy(x, y) :
>>>>      Each of the 2 combined objects has sequence levels not in the other:
>>>>      - in 'x': chr6_cox_hap2, chr6_dbb_hap3, chr6_ssto_hap7,
>>>> chr6_qbl_hap6, chr6_mann_hap4, chr6_mcf_hap5, chr6_apd_hap1,
>>>> chrUn_gl000228, chr17_ctg5_hap1, chr19_gl000209_random,
>>>> chr4_ctg9_hap1, chrUn_gl000222, chrUn_gl000223, chr1_gl000191_random,
>>>> chr7_gl000195_random, chrUn_gl000213, chrUn_gl000220,
>>>> chr17_gl000205_random, chrUn_gl000212, chrUn_gl000218,
>>>> chrUn_gl000219, chrUn_gl000211, chr4_gl000193_random,
>>>> chr4_gl000194_random, chr1_gl000192_random
>>>>      - in 'y': chrM
>>>>      Make sure to always combine/compare objects based on the same
>>>> reference
>>>>      genome (use suppressWarnings() to suppress this warning).
>>>> 2: In .Seqinfo.mergexy(x, y) :
>>>>      Each of the 2 combined objects has sequence levels not in the other:
>>>>      - in 'x': chr6_cox_hap2, chr6_dbb_hap3, chr6_ssto_hap7,
>>>> chr6_qbl_hap6, chr6_mann_hap4, chr6_mcf_hap5, chr6_apd_hap1,
>>>> chrUn_gl000228, chr17_ctg5_hap1, chr19_gl000209_random,
>>>> chr4_ctg9_hap1, chrUn_gl000222, chrUn_gl000223, chr1_gl000191_random,
>>>> chr7_gl000195_random, chrUn_gl000213, chrUn_gl000220,
>>>> chr17_gl000205_random, chrUn_gl000212, chrUn_gl000218,
>>>> chrUn_gl000219, chrUn_gl000211, chr4_gl000193_random,
>>>> chr4_gl000194_random, chr1_gl000192_random
>>>>      - in 'y': chrM
>>>>      Make sure to always combine/compare objects based on the same
>>>> reference
>>>>      genome (use suppressWarnings() to suppress this warning).
>>>> 3: In .Seqinfo.mergexy(x, y) :
>>>>      Each of the 2 combined objects has sequence levels not in the other:
>>>>      - in 'x': chr6_cox_hap2, chr6_dbb_hap3, chr6_ssto_hap7,
>>>> chr6_qbl_hap6, chr6_mann_hap4, chr6_mcf_hap5, chr6_apd_hap1,
>>>> chrUn_gl000228, chr17_ctg5_hap1, chr19_gl000209_random,
>>>> chr4_ctg9_hap1, chrUn_gl000222, chrUn_gl000223, chr1_gl000191_random,
>>>> chr7_gl000195_random, chrUn_gl000213, chrUn_gl000220,
>>>> chr17_gl000205_random, chrUn_gl000212, chrUn_gl000218,
>>>> chrUn_gl000219, chrUn_gl000211, chr4_gl000193_random,
>>>> chr4_gl000194_random, chr1_gl000192_random
>>>>      - in 'y': chrM
>>>>      Make sure to always combine/compare objects based on the same
>>>> reference
>>>>      genome (use suppressWarnings() to suppress this warning).
>>>> 4: In .Seqinfo.mergexy(x, y) :
>>>>      Each of the 2 combined objects has sequence levels not in the other:
>>>>      - in 'x': chr6_cox_hap2, chr6_dbb_hap3, chr6_ssto_hap7,
>>>> chr6_qbl_hap6, chr6_mann_hap4, chr6_mcf_hap5, chr6_apd_hap1,
>>>> chrUn_gl000228, chr17_ctg5_hap1, chr19_gl000209_random,
>>>> chr4_ctg9_hap1, chrUn_gl000222, chrUn_gl000223, chr1_gl000191_random,
>>>> chr7_gl000195_random, chrUn_gl000213, chrUn_gl000220,
>>>> chr17_gl000205_random, chrUn_gl000212, chrUn_gl000218,
>>>> chrUn_gl000219, chrUn_gl000211, chr4_gl000193_random,
>>>> chr4_gl000194_random, chr1_gl000192_random
>>>>      - in 'y': chrM
>>>>      Make sure to always combine/compare objects based on the same
>>>> reference
>>>>      genome (use suppressWarnings() to suppress this warning).
>>>>>
>>>>
>>>> I also wanted to verify if what I am doing is correct for paired end
>>>> reads.
>>>>
>>>> Thanks,
>>>>
>>>> - Pankaj
>>>>
>>>>
>>>> =======================================
>>>> sessionInfo()
>>>> R version 3.0.3 (2014-03-06)
>>>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>>>
>>>> locale:
>>>> [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>>> [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>>> [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>>>> [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>>>> [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>>>
>>>> attached base packages:
>>>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>>>
>>>>       [[alternative HTML version deleted]]
>>>>
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at r-project.org
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives:
>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>
>>>
>>>
>>
>>
>