[BioC] Rsubread/Rsamtools mappings outside of chromosome ranges

Martin Morgan mtmorgan at fhcrc.org
Sun Feb 12 18:28:21 CET 2012


On 02/10/2012 07:44 PM, Thomas Girke wrote:
> With some Illumina libraries I have been running into problems importing
> the read mappings from Rsubread into Rsamtools. After some testing I
> found out that some reads reported in Rsubread's SAM output have their
> end positions outside of the chromosome ranges. If those reads are
> removed from the SAM file then the import into Rsamtools works just
> fine. Below is a reproducible example of this problem.
>
> I could think of several solutions to fix this, e.g. not reporting reads
> outside of chromosome ranges or updating their length in the CIAGR string.
> An additional option could be to remove invalid mappings during the import
> into Rsamtools.

Hi Thomas --

for the latter and for the case where those reads are intrinsically not 
interesting, it might be reasonable to specify a range on 
readGappedAlignments that precludes the possibility of extension beyond 
sequence ends.

 > si <- seqinfo(BamFile(fl))
 > gr <- GRanges(seqnames(si), IRanges(34, seqlengths(si)-34))
 > bam <- readGappedAlignments(fl, param=ScanBamParam(which=gr))

One would like a better solution, either doing this automatically with a 
warning, or warning rather than stopping on reads overlapping ends.

Martin

>
> Thanks,
>
> Thomas
>
> ################################
> ## Read mapping with Rsubread ##
> ################################
> library(Rsubread)
> ## Reference source: ftp://ftp.arabidopsis.org/home/tair/Sequences/whole_chromosomes/
> ## Fastq source: ftp://ftp-trace.ncbi.nih.gov/sra/sra-instant/reads/ByStudy/sra/SRP%2FSRP009%2FSRP009850/SRR390302/
> buildindex(basename="tair10chr.fasta", reference="tair10chr.fasta")
> align(index="tair10chr.fasta", readfile1="SRR390302.fastq", output_file="SRR390302_subread.sam", nthreads=8, indels=1, TH1=2)
>
> #####################################
> ## Process SAM file with Rsamtools ##
> #####################################
> library(Rsamtools)
> asBam(file="SRR390302_subread.sam", destination="SRR390302_subread")
> [1] "./data/SRR390302_subread.bam"
> reads<- readBamGappedAlignments("SRR390302_subread.bam", use.names=FALSE)
> Error in validObject(.Object) :
>    invalid class "GappedAlignments" object: 'ranges' contains values outside of sequence bounds
>
> ## The error disappears if the following two reads are removed from the SAM file:
> ## SRR390302.434558 and SRR390302.2716043. Both reads have their end positions outside
> ## the range of Chr5 as illustrated here:
>
> ## Relevant chunks of SAM file generated by Rsubread
> @SQ     SN:Chr1 LN:30427671
> @SQ     SN:Chr2 LN:19698289
> @SQ     SN:Chr3 LN:23459830
> @SQ     SN:Chr4 LN:18585056
> @SQ     SN:Chr5 LN:26975502
> @SQ     SN:ChrC LN:154478
> @SQ     SN:ChrM LN:366924
> SRR390302.1     0       Chr1    27018257        2.0     35M     *       0       0       TGG...ACC     III...GD)0
> SRR390302.2     4       *       0       0       *       *       0       0       TCC...AAA     III...+I@
> ...
> SRR390302.434558        0       Chr5    26975485        5.0     35M     *       0       0       ATG...TGG     III...&4(
> SRR390302.2716043       0       Chr5    26975486        3.0     35M     *       0       0       CAT...GGT     III...+2&
>
>
>> sessionInfo()
> R version 2.14.0 (2011-10-31)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
> [1] C
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
> [1] Rsamtools_1.6.3     Biostrings_2.22.0   GenomicRanges_1.6.4 IRanges_1.12.5      Rsubread_1.4.2
>
> loaded via a namespace (and not attached):
> [1] BSgenome_1.22.0    RCurl_1.8-0        XML_3.6-2          bitops_1.0-4.1     rtracklayer_1.14.4 tools_2.14.0       zlibbioc_1.0.0
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor


-- 
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109

Location: M1-B861
Telephone: 206 667-2793



More information about the Bioconductor mailing list