[BioC] Rsubread/Rsamtools mappings outside of chromosome ranges
Thomas Girke
thomas.girke at ucr.edu
Mon Feb 13 01:12:47 CET 2012
Thanks Martin for your suggestion!
Thomas
On Sun, Feb 12, 2012 at 05:28:21PM +0000, Martin Morgan wrote:
> On 02/10/2012 07:44 PM, Thomas Girke wrote:
> > With some Illumina libraries I have been running into problems importing
> > the read mappings from Rsubread into Rsamtools. After some testing I
> > found out that some reads reported in Rsubread's SAM output have their
> > end positions outside of the chromosome ranges. If those reads are
> > removed from the SAM file then the import into Rsamtools works just
> > fine. Below is a reproducible example of this problem.
> >
> > I could think of several solutions to fix this, e.g. not reporting reads
> > outside of chromosome ranges or updating their length in the CIAGR string.
> > An additional option could be to remove invalid mappings during the import
> > into Rsamtools.
>
> Hi Thomas --
>
> for the latter and for the case where those reads are intrinsically not
> interesting, it might be reasonable to specify a range on
> readGappedAlignments that precludes the possibility of extension beyond
> sequence ends.
>
> > si <- seqinfo(BamFile(fl))
> > gr <- GRanges(seqnames(si), IRanges(34, seqlengths(si)-34))
> > bam <- readGappedAlignments(fl, param=ScanBamParam(which=gr))
>
> One would like a better solution, either doing this automatically with a
> warning, or warning rather than stopping on reads overlapping ends.
>
> Martin
>
> >
> > Thanks,
> >
> > Thomas
> >
> > ################################
> > ## Read mapping with Rsubread ##
> > ################################
> > library(Rsubread)
> > ## Reference source: ftp://ftp.arabidopsis.org/home/tair/Sequences/whole_chromosomes/
> > ## Fastq source: ftp://ftp-trace.ncbi.nih.gov/sra/sra-instant/reads/ByStudy/sra/SRP%2FSRP009%2FSRP009850/SRR390302/
> > buildindex(basename="tair10chr.fasta", reference="tair10chr.fasta")
> > align(index="tair10chr.fasta", readfile1="SRR390302.fastq", output_file="SRR390302_subread.sam", nthreads=8, indels=1, TH1=2)
> >
> > #####################################
> > ## Process SAM file with Rsamtools ##
> > #####################################
> > library(Rsamtools)
> > asBam(file="SRR390302_subread.sam", destination="SRR390302_subread")
> > [1] "./data/SRR390302_subread.bam"
> > reads<- readBamGappedAlignments("SRR390302_subread.bam", use.names=FALSE)
> > Error in validObject(.Object) :
> > invalid class "GappedAlignments" object: 'ranges' contains values outside of sequence bounds
> >
> > ## The error disappears if the following two reads are removed from the SAM file:
> > ## SRR390302.434558 and SRR390302.2716043. Both reads have their end positions outside
> > ## the range of Chr5 as illustrated here:
> >
> > ## Relevant chunks of SAM file generated by Rsubread
> > @SQ SN:Chr1 LN:30427671
> > @SQ SN:Chr2 LN:19698289
> > @SQ SN:Chr3 LN:23459830
> > @SQ SN:Chr4 LN:18585056
> > @SQ SN:Chr5 LN:26975502
> > @SQ SN:ChrC LN:154478
> > @SQ SN:ChrM LN:366924
> > SRR390302.1 0 Chr1 27018257 2.0 35M * 0 0 TGG...ACC III...GD)0
> > SRR390302.2 4 * 0 0 * * 0 0 TCC...AAA III...+I@
> > ...
> > SRR390302.434558 0 Chr5 26975485 5.0 35M * 0 0 ATG...TGG III...&4(
> > SRR390302.2716043 0 Chr5 26975486 3.0 35M * 0 0 CAT...GGT III...+2&
> >
> >
> >> sessionInfo()
> > R version 2.14.0 (2011-10-31)
> > Platform: x86_64-unknown-linux-gnu (64-bit)
> >
> > locale:
> > [1] C
> >
> > attached base packages:
> > [1] stats graphics grDevices utils datasets methods base
> >
> > other attached packages:
> > [1] Rsamtools_1.6.3 Biostrings_2.22.0 GenomicRanges_1.6.4 IRanges_1.12.5 Rsubread_1.4.2
> >
> > loaded via a namespace (and not attached):
> > [1] BSgenome_1.22.0 RCurl_1.8-0 XML_3.6-2 bitops_1.0-4.1 rtracklayer_1.14.4 tools_2.14.0 zlibbioc_1.0.0
> >
> > _______________________________________________
> > Bioconductor mailing list
> > Bioconductor at r-project.org
> > https://stat.ethz.ch/mailman/listinfo/bioconductor
> > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>
> --
> Computational Biology
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109
>
> Location: M1-B861
> Telephone: 206 667-2793
More information about the Bioconductor
mailing list