[BioC] Rsubread/Rsamtools mappings outside of chromosome ranges

Thomas Girke thomas.girke at ucr.edu
Sat Feb 11 04:44:19 CET 2012


With some Illumina libraries I have been running into problems importing
the read mappings from Rsubread into Rsamtools. After some testing I
found out that some reads reported in Rsubread's SAM output have their
end positions outside of the chromosome ranges. If those reads are
removed from the SAM file then the import into Rsamtools works just
fine. Below is a reproducible example of this problem.

I could think of several solutions to fix this, e.g. not reporting reads
outside of chromosome ranges or updating their length in the CIAGR string.
An additional option could be to remove invalid mappings during the import
into Rsamtools. 

Thanks,

Thomas

################################
## Read mapping with Rsubread ##
################################
library(Rsubread)
## Reference source: ftp://ftp.arabidopsis.org/home/tair/Sequences/whole_chromosomes/
## Fastq source: ftp://ftp-trace.ncbi.nih.gov/sra/sra-instant/reads/ByStudy/sra/SRP%2FSRP009%2FSRP009850/SRR390302/
buildindex(basename="tair10chr.fasta", reference="tair10chr.fasta")
align(index="tair10chr.fasta", readfile1="SRR390302.fastq", output_file="SRR390302_subread.sam", nthreads=8, indels=1, TH1=2)

#####################################
## Process SAM file with Rsamtools ##
#####################################
library(Rsamtools)
asBam(file="SRR390302_subread.sam", destination="SRR390302_subread")
[1] "./data/SRR390302_subread.bam"
reads <- readBamGappedAlignments("SRR390302_subread.bam", use.names=FALSE)
Error in validObject(.Object) : 
  invalid class "GappedAlignments" object: 'ranges' contains values outside of sequence bounds

## The error disappears if the following two reads are removed from the SAM file:
## SRR390302.434558 and SRR390302.2716043. Both reads have their end positions outside 
## the range of Chr5 as illustrated here: 

## Relevant chunks of SAM file generated by Rsubread  
@SQ     SN:Chr1 LN:30427671
@SQ     SN:Chr2 LN:19698289
@SQ     SN:Chr3 LN:23459830
@SQ     SN:Chr4 LN:18585056
@SQ     SN:Chr5 LN:26975502
@SQ     SN:ChrC LN:154478
@SQ     SN:ChrM LN:366924
SRR390302.1     0       Chr1    27018257        2.0     35M     *       0       0       TGG...ACC     III...GD)0
SRR390302.2     4       *       0       0       *       *       0       0       TCC...AAA     III...+I@
...
SRR390302.434558        0       Chr5    26975485        5.0     35M     *       0       0       ATG...TGG     III...&4(
SRR390302.2716043       0       Chr5    26975486        3.0     35M     *       0       0       CAT...GGT     III...+2&


> sessionInfo()
R version 2.14.0 (2011-10-31)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
[1] C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] Rsamtools_1.6.3     Biostrings_2.22.0   GenomicRanges_1.6.4 IRanges_1.12.5      Rsubread_1.4.2     

loaded via a namespace (and not attached):
[1] BSgenome_1.22.0    RCurl_1.8-0        XML_3.6-2          bitops_1.0-4.1     rtracklayer_1.14.4 tools_2.14.0       zlibbioc_1.0.0



More information about the Bioconductor mailing list