[BioC] Rsubread/Rsamtools mappings outside of chromosome ranges

Thomas Girke thomas.girke at ucr.edu
Tue Feb 14 23:51:03 CET 2012


Great! Ticket closed :).

On Tue, Feb 14, 2012 at 10:40:40PM +0000, Martin Morgan wrote:
> On 02/14/2012 02:36 PM, Wei Shi wrote:
> > Hi Thomas,
> >
> > After re-checking the changes we have made before for addressing the
> > issue of reads mapped out of the chromosome ranges, I found we
> > actually made the decision to allow reads to be mapped out of the
> > chromosome ranges, due to the fact that the reference sequences might
> > not be complete and reads could originate from outside of the
> > reference sequences used for mapping.
> >
> > So you may take Martin's suggestion to add extra parameters when you
> 
> Actually, Herve has patched GenomicRanges (1.7.25) in the development 
> branch so that ranges outside sequence bounds are now accepted (with a 
> warning).
> 
> Martin
> 
> > call 'readGappedAlignments' function to make it work for you. Or
> > alternatively you may try the 'featureCounts' function in Rsubread
> > package to count the number of reads mapped to each gene, if the
> > purpose of your analysis is perform RNA-seq analysis. This function
> > works on SAM file, so you do not need to convert it to a BAM file.
> > But you will have to provide an Arabidopsis anntation file it as it
> > has built-in annotation for mouse and human only at present.
> >
> > I think it might be better if the 'readGappedAlignments'  function
> > could allow reads being mapped outside of chromosomal ranges. We have
> > been successfully running through a SNP discovery pipeline using
> > subread/subjunc, samtools and vcftools on unix command lines, which
> > does not complain about the reads mapped outside of the chromosomal
> > ranges. I guess many other downstream analysis tools do not require
> > mapping locations of reads to be within the chromosomal ranges as
> > well.
> >
> > Hope this helps.
> >
> > Cheers, Wei
> >
> >
> >
> > On Feb 14, 2012, at 11:58 AM, Thomas Girke wrote:
> >
> >> That's great, thanks a lot!
> >>
> >> BTW: are there any published/shareable performance test results
> >> available comparing subread against other RNA-read to genome
> >> aligners such as Tophat with respect to the accuracy of aligning
> >> reads across exon/intron junctions?
> >>
> >> Thanks,
> >>
> >> Thomas
> >>
> >> On Mon, Feb 13, 2012 at 11:41:42PM +0000, Wei Shi wrote:
> >>> Hi Thomas,
> >>>
> >>> I have just committed changes to both the release version and the
> >>> devel version of Rsubread package to fix the bug of mapping reads
> >>> out of chromosomal boundaries. They should be available in 24-36
> >>> hours. Please rebuild your index when you rerun your alignment.
> >>>
> >>> Let me know if the problem persists or you encounter further
> >>> problems.
> >>>
> >>> Cheers, Wei
> >>>
> >>> On Feb 13, 2012, at 11:12 AM, Thomas Girke wrote:
> >>>
> >>>> Thanks Martin for your suggestion!
> >>>>
> >>>> Thomas
> >>>>
> >>>> On Sun, Feb 12, 2012 at 05:28:21PM +0000, Martin Morgan wrote:
> >>>>> On 02/10/2012 07:44 PM, Thomas Girke wrote:
> >>>>>> With some Illumina libraries I have been running into
> >>>>>> problems importing the read mappings from Rsubread into
> >>>>>> Rsamtools. After some testing I found out that some reads
> >>>>>> reported in Rsubread's SAM output have their end positions
> >>>>>> outside of the chromosome ranges. If those reads are
> >>>>>> removed from the SAM file then the import into Rsamtools
> >>>>>> works just fine. Below is a reproducible example of this
> >>>>>> problem.
> >>>>>>
> >>>>>> I could think of several solutions to fix this, e.g. not
> >>>>>> reporting reads outside of chromosome ranges or updating
> >>>>>> their length in the CIAGR string. An additional option
> >>>>>> could be to remove invalid mappings during the import into
> >>>>>> Rsamtools.
> >>>>>
> >>>>> Hi Thomas --
> >>>>>
> >>>>> for the latter and for the case where those reads are
> >>>>> intrinsically not interesting, it might be reasonable to
> >>>>> specify a range on readGappedAlignments that precludes the
> >>>>> possibility of extension beyond sequence ends.
> >>>>>
> >>>>>> si<- seqinfo(BamFile(fl)) gr<- GRanges(seqnames(si),
> >>>>>> IRanges(34, seqlengths(si)-34)) bam<-
> >>>>>> readGappedAlignments(fl, param=ScanBamParam(which=gr))
> >>>>>
> >>>>> One would like a better solution, either doing this
> >>>>> automatically with a warning, or warning rather than stopping
> >>>>> on reads overlapping ends.
> >>>>>
> >>>>> Martin
> >>>>>
> >>>>>>
> >>>>>> Thanks,
> >>>>>>
> >>>>>> Thomas
> >>>>>>
> >>>>>> ################################ ## Read mapping with
> >>>>>> Rsubread ## ################################
> >>>>>> library(Rsubread) ## Reference source:
> >>>>>> ftp://ftp.arabidopsis.org/home/tair/Sequences/whole_chromosomes/
> >>>>>>
> >>>>>>
> ## Fastq source: 
> ftp://ftp-trace.ncbi.nih.gov/sra/sra-instant/reads/ByStudy/sra/SRP%2FSRP009%2FSRP009850/SRR390302/
> >>>>>> buildindex(basename="tair10chr.fasta",
> >>>>>> reference="tair10chr.fasta") align(index="tair10chr.fasta",
> >>>>>> readfile1="SRR390302.fastq",
> >>>>>> output_file="SRR390302_subread.sam", nthreads=8, indels=1,
> >>>>>> TH1=2)
> >>>>>>
> >>>>>> ##################################### ## Process SAM file
> >>>>>> with Rsamtools ## #####################################
> >>>>>> library(Rsamtools) asBam(file="SRR390302_subread.sam",
> >>>>>> destination="SRR390302_subread") [1]
> >>>>>> "./data/SRR390302_subread.bam" reads<-
> >>>>>> readBamGappedAlignments("SRR390302_subread.bam",
> >>>>>> use.names=FALSE) Error in validObject(.Object) : invalid
> >>>>>> class "GappedAlignments" object: 'ranges' contains values
> >>>>>> outside of sequence bounds
> >>>>>>
> >>>>>> ## The error disappears if the following two reads are
> >>>>>> removed from the SAM file: ## SRR390302.434558 and
> >>>>>> SRR390302.2716043. Both reads have their end positions
> >>>>>> outside ## the range of Chr5 as illustrated here:
> >>>>>>
> >>>>>> ## Relevant chunks of SAM file generated by Rsubread @SQ
> >>>>>> SN:Chr1 LN:30427671 @SQ     SN:Chr2 LN:19698289 @SQ
> >>>>>> SN:Chr3 LN:23459830 @SQ     SN:Chr4 LN:18585056 @SQ
> >>>>>> SN:Chr5 LN:26975502 @SQ     SN:ChrC LN:154478 @SQ
> >>>>>> SN:ChrM LN:366924 SRR390302.1     0       Chr1    27018257
> >>>>>> 2.0     35M     *       0       0       TGG...ACC
> >>>>>> III...GD)0 SRR390302.2     4       *       0       0
> >>>>>> *       *       0       0       TCC...AAA     III...+I@
> >>>>>> ... SRR390302.434558        0       Chr5    26975485
> >>>>>> 5.0     35M     *       0       0       ATG...TGG
> >>>>>> III...&4( SRR390302.2716043       0       Chr5    26975486
> >>>>>> 3.0     35M     *       0       0       CAT...GGT
> >>>>>> III...+2&
> >>>>>>
> >>>>>>
> >>>>>>> sessionInfo()
> >>>>>> R version 2.14.0 (2011-10-31) Platform:
> >>>>>> x86_64-unknown-linux-gnu (64-bit)
> >>>>>>
> >>>>>> locale: [1] C
> >>>>>>
> >>>>>> attached base packages: [1] stats     graphics  grDevices
> >>>>>> utils     datasets  methods   base
> >>>>>>
> >>>>>> other attached packages: [1] Rsamtools_1.6.3
> >>>>>> Biostrings_2.22.0   GenomicRanges_1.6.4 IRanges_1.12.5
> >>>>>> Rsubread_1.4.2
> >>>>>>
> >>>>>> loaded via a namespace (and not attached): [1]
> >>>>>> BSgenome_1.22.0    RCurl_1.8-0        XML_3.6-2
> >>>>>> bitops_1.0-4.1     rtracklayer_1.14.4 tools_2.14.0
> >>>>>> zlibbioc_1.0.0
> >>>>>>
> >>>>>> _______________________________________________
> >>>>>> Bioconductor mailing list Bioconductor at r-project.org
> >>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor Search
> >>>>>> the archives:
> >>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
> >>>>>
> >>>>>
> >>>>>
> >>>>>>
> -- 
> >>>>> Computational Biology Fred Hutchinson Cancer Research Center
> >>>>> 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109
> >>>>>
> >>>>> Location: M1-B861 Telephone: 206 667-2793
> >>>>
> >>>> _______________________________________________ Bioconductor
> >>>> mailing list Bioconductor at r-project.org
> >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor Search the
> >>>> archives:
> >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
> >>>
> >>>
> >>>
> >>>>
> ______________________________________________________________________
> >>> The information in this email is confidential and intended solely
> >>> for the addressee. You must not disclose, forward, print or use
> >>> it without the permission of the sender.
> >>> ______________________________________________________________________
> >
> >>>
> >
> > ______________________________________________________________________
> >
> >
> The information in this email is confidential and intended solely for 
> the addressee.
> > You must not disclose, forward, print or use it without the
> > permission of the sender.
> > ______________________________________________________________________
> 
> 
> --
> >
> Computational Biology
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109
> 
> Location: M1-B861
> Telephone: 206 667-2793



More information about the Bioconductor mailing list