[Bioc-devel] Making GenomicAlignments::readGAlignmentPairs() fail fast if given bad seqlevels in `which`

Peter Hickey peter.hickey at gmail.com
Wed Mar 16 19:53:44 CET 2016


Hi,

GenomicAlignments::readGAlignmentPairs() can take a while to
(correctly) fail if the `which` parameter contains a "bad" seqlevel.
It'd be great if it failed early in the following scenario (just
experienced).

An example BAM is available from
https://www.dropbox.com/sh/4avqxuqnhlv3r9c/AADqx-XqXV6c7Dc_SaSUq324a?dl=0
(110 MB; sorry, it needs to be large-ish in order to notice the
problem). The following code ought to reproduce the problem. Here I am
taking the example BAM of mouse data mapped to mm10 and using a
`which` based on hg19 (it was mistakenly assuming all my data were
human that led me to this problem). When a single "bad" seqlevel is
provided via `which` then it errors fast and with a helpful error.
However, if the `which` contains multiple seqlevels, some "good" and
some "bad", then it seemingly just hangs. I initially thought it had
just frozen indefinitely but it actually eventually returns the
correct error.

It'd be great if it failed fast in this situation.

Thanks,
Pete

library(GenomicAlignments)
library(BSgenome.Hsapiens.UCSC.hg19)
si <- seqinfo(BSgenome.Hsapiens.UCSC.hg19)

# mouse data mapped to mm10 (temporarily available from
https://www.dropbox.com/sh/4avqxuqnhlv3r9c/AADqx-XqXV6c7Dc_SaSUq324a?dl=0)
file <- "~/Desktop/tmp/SRR1781315.markdup.bam"

# Errors fast and helpfully because chr20 doesn't exist in mouse
readGAlignmentPairs(file, param = ScanBamParam(which = as(si, "GRanges")[20]))

# Takes a long time to error if some seqlevels exist (chr19) and some
don't exist (chr20) in sample
readGAlignmentPairs(file, param = ScanBamParam(which = as(si,
"GRanges")[19:20]))

> sessionInfo()
R Under development (unstable) (2016-03-11 r70310)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11.3 (El Capitan)

locale:
[1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils
datasets  methods   base

other attached packages:
 [1] BSgenome.Hsapiens.UCSC.hg19_1.4.0 BSgenome_1.39.4
  rtracklayer_1.31.7
 [4] GenomicAlignments_1.7.20          Rsamtools_1.23.3
  Biostrings_2.39.12
 [7] XVector_0.11.7                    SummarizedExperiment_1.1.22
  Biobase_2.31.3
[10] GenomicRanges_1.23.24             GenomeInfoDb_1.7.6
  IRanges_2.5.40
[13] S4Vectors_0.9.42                  BiocGenerics_0.17.3
  repete_0.0.0.9000
[16] devtools_1.10.0

loaded via a namespace (and not attached):
[1] XML_3.98-1.4        digest_0.6.9        bitops_1.0-6
zlibbioc_1.17.0     BiocParallel_1.5.20
[6] tools_3.3.0         RCurl_1.95-4.8      memoise_1.0.0



More information about the Bioc-devel mailing list