[BioC] Rsubread/Rsamtools mappings outside of chromosome ranges

Thomas Girke thomas.girke at ucr.edu
Tue Feb 14 01:58:26 CET 2012


That's great, thanks a lot! 

BTW: are there any published/shareable performance test results
available comparing subread against other RNA-read to genome aligners
such as Tophat with respect to the accuracy of aligning reads across
exon/intron junctions?

Thanks,

Thomas

On Mon, Feb 13, 2012 at 11:41:42PM +0000, Wei Shi wrote:
> Hi Thomas,
> 
> I have just committed changes to both the release version and the devel version of Rsubread package to fix the bug of mapping reads out of chromosomal boundaries. They should be available in 24-36 hours. Please rebuild your index when you rerun your alignment.
> 
> Let me know if the problem persists or you encounter further problems.
> 
> Cheers,
> Wei
> 
> On Feb 13, 2012, at 11:12 AM, Thomas Girke wrote:
> 
> > Thanks Martin for your suggestion!
> > 
> > Thomas
> > 
> > On Sun, Feb 12, 2012 at 05:28:21PM +0000, Martin Morgan wrote:
> >> On 02/10/2012 07:44 PM, Thomas Girke wrote:
> >>> With some Illumina libraries I have been running into problems importing
> >>> the read mappings from Rsubread into Rsamtools. After some testing I
> >>> found out that some reads reported in Rsubread's SAM output have their
> >>> end positions outside of the chromosome ranges. If those reads are
> >>> removed from the SAM file then the import into Rsamtools works just
> >>> fine. Below is a reproducible example of this problem.
> >>> 
> >>> I could think of several solutions to fix this, e.g. not reporting reads
> >>> outside of chromosome ranges or updating their length in the CIAGR string.
> >>> An additional option could be to remove invalid mappings during the import
> >>> into Rsamtools.
> >> 
> >> Hi Thomas --
> >> 
> >> for the latter and for the case where those reads are intrinsically not 
> >> interesting, it might be reasonable to specify a range on 
> >> readGappedAlignments that precludes the possibility of extension beyond 
> >> sequence ends.
> >> 
> >>> si <- seqinfo(BamFile(fl))
> >>> gr <- GRanges(seqnames(si), IRanges(34, seqlengths(si)-34))
> >>> bam <- readGappedAlignments(fl, param=ScanBamParam(which=gr))
> >> 
> >> One would like a better solution, either doing this automatically with a 
> >> warning, or warning rather than stopping on reads overlapping ends.
> >> 
> >> Martin
> >> 
> >>> 
> >>> Thanks,
> >>> 
> >>> Thomas
> >>> 
> >>> ################################
> >>> ## Read mapping with Rsubread ##
> >>> ################################
> >>> library(Rsubread)
> >>> ## Reference source: ftp://ftp.arabidopsis.org/home/tair/Sequences/whole_chromosomes/
> >>> ## Fastq source: ftp://ftp-trace.ncbi.nih.gov/sra/sra-instant/reads/ByStudy/sra/SRP%2FSRP009%2FSRP009850/SRR390302/
> >>> buildindex(basename="tair10chr.fasta", reference="tair10chr.fasta")
> >>> align(index="tair10chr.fasta", readfile1="SRR390302.fastq", output_file="SRR390302_subread.sam", nthreads=8, indels=1, TH1=2)
> >>> 
> >>> #####################################
> >>> ## Process SAM file with Rsamtools ##
> >>> #####################################
> >>> library(Rsamtools)
> >>> asBam(file="SRR390302_subread.sam", destination="SRR390302_subread")
> >>> [1] "./data/SRR390302_subread.bam"
> >>> reads<- readBamGappedAlignments("SRR390302_subread.bam", use.names=FALSE)
> >>> Error in validObject(.Object) :
> >>>   invalid class "GappedAlignments" object: 'ranges' contains values outside of sequence bounds
> >>> 
> >>> ## The error disappears if the following two reads are removed from the SAM file:
> >>> ## SRR390302.434558 and SRR390302.2716043. Both reads have their end positions outside
> >>> ## the range of Chr5 as illustrated here:
> >>> 
> >>> ## Relevant chunks of SAM file generated by Rsubread
> >>> @SQ     SN:Chr1 LN:30427671
> >>> @SQ     SN:Chr2 LN:19698289
> >>> @SQ     SN:Chr3 LN:23459830
> >>> @SQ     SN:Chr4 LN:18585056
> >>> @SQ     SN:Chr5 LN:26975502
> >>> @SQ     SN:ChrC LN:154478
> >>> @SQ     SN:ChrM LN:366924
> >>> SRR390302.1     0       Chr1    27018257        2.0     35M     *       0       0       TGG...ACC     III...GD)0
> >>> SRR390302.2     4       *       0       0       *       *       0       0       TCC...AAA     III...+I@
> >>> ...
> >>> SRR390302.434558        0       Chr5    26975485        5.0     35M     *       0       0       ATG...TGG     III...&4(
> >>> SRR390302.2716043       0       Chr5    26975486        3.0     35M     *       0       0       CAT...GGT     III...+2&
> >>> 
> >>> 
> >>>> sessionInfo()
> >>> R version 2.14.0 (2011-10-31)
> >>> Platform: x86_64-unknown-linux-gnu (64-bit)
> >>> 
> >>> locale:
> >>> [1] C
> >>> 
> >>> attached base packages:
> >>> [1] stats     graphics  grDevices utils     datasets  methods   base
> >>> 
> >>> other attached packages:
> >>> [1] Rsamtools_1.6.3     Biostrings_2.22.0   GenomicRanges_1.6.4 IRanges_1.12.5      Rsubread_1.4.2
> >>> 
> >>> loaded via a namespace (and not attached):
> >>> [1] BSgenome_1.22.0    RCurl_1.8-0        XML_3.6-2          bitops_1.0-4.1     rtracklayer_1.14.4 tools_2.14.0       zlibbioc_1.0.0
> >>> 
> >>> _______________________________________________
> >>> Bioconductor mailing list
> >>> Bioconductor at r-project.org
> >>> https://stat.ethz.ch/mailman/listinfo/bioconductor
> >>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
> >> 
> >> 
> >> -- 
> >> Computational Biology
> >> Fred Hutchinson Cancer Research Center
> >> 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109
> >> 
> >> Location: M1-B861
> >> Telephone: 206 667-2793
> > 
> > _______________________________________________
> > Bioconductor mailing list
> > Bioconductor at r-project.org
> > https://stat.ethz.ch/mailman/listinfo/bioconductor
> > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
> 
> 
> ______________________________________________________________________
> The information in this email is confidential and inte...{{dropped:4}}



More information about the Bioconductor mailing list