[BioC] Rsubread/Rsamtools mappings outside of chromosome ranges
Wei Shi
shi at wehi.EDU.AU
Mon Feb 13 11:29:21 CET 2012
Dear Thomas,
Sorry for the problem you have encountered. This was a bug we have identified a little while ago and it has been fixed in the C version of subread aligner (http://subread.sourceforge.net). I'm now updating the C code in Rsubread package to fix this and will let you know when this is done (should be in a couple of days).
Cheers,
Wei
On Feb 13, 2012, at 11:12 AM, Thomas Girke wrote:
> Thanks Martin for your suggestion!
>
> Thomas
>
> On Sun, Feb 12, 2012 at 05:28:21PM +0000, Martin Morgan wrote:
>> On 02/10/2012 07:44 PM, Thomas Girke wrote:
>>> With some Illumina libraries I have been running into problems importing
>>> the read mappings from Rsubread into Rsamtools. After some testing I
>>> found out that some reads reported in Rsubread's SAM output have their
>>> end positions outside of the chromosome ranges. If those reads are
>>> removed from the SAM file then the import into Rsamtools works just
>>> fine. Below is a reproducible example of this problem.
>>>
>>> I could think of several solutions to fix this, e.g. not reporting reads
>>> outside of chromosome ranges or updating their length in the CIAGR string.
>>> An additional option could be to remove invalid mappings during the import
>>> into Rsamtools.
>>
>> Hi Thomas --
>>
>> for the latter and for the case where those reads are intrinsically not
>> interesting, it might be reasonable to specify a range on
>> readGappedAlignments that precludes the possibility of extension beyond
>> sequence ends.
>>
>>> si <- seqinfo(BamFile(fl))
>>> gr <- GRanges(seqnames(si), IRanges(34, seqlengths(si)-34))
>>> bam <- readGappedAlignments(fl, param=ScanBamParam(which=gr))
>>
>> One would like a better solution, either doing this automatically with a
>> warning, or warning rather than stopping on reads overlapping ends.
>>
>> Martin
>>
>>>
>>> Thanks,
>>>
>>> Thomas
>>>
>>> ################################
>>> ## Read mapping with Rsubread ##
>>> ################################
>>> library(Rsubread)
>>> ## Reference source: ftp://ftp.arabidopsis.org/home/tair/Sequences/whole_chromosomes/
>>> ## Fastq source: ftp://ftp-trace.ncbi.nih.gov/sra/sra-instant/reads/ByStudy/sra/SRP%2FSRP009%2FSRP009850/SRR390302/
>>> buildindex(basename="tair10chr.fasta", reference="tair10chr.fasta")
>>> align(index="tair10chr.fasta", readfile1="SRR390302.fastq", output_file="SRR390302_subread.sam", nthreads=8, indels=1, TH1=2)
>>>
>>> #####################################
>>> ## Process SAM file with Rsamtools ##
>>> #####################################
>>> library(Rsamtools)
>>> asBam(file="SRR390302_subread.sam", destination="SRR390302_subread")
>>> [1] "./data/SRR390302_subread.bam"
>>> reads<- readBamGappedAlignments("SRR390302_subread.bam", use.names=FALSE)
>>> Error in validObject(.Object) :
>>> invalid class "GappedAlignments" object: 'ranges' contains values outside of sequence bounds
>>>
>>> ## The error disappears if the following two reads are removed from the SAM file:
>>> ## SRR390302.434558 and SRR390302.2716043. Both reads have their end positions outside
>>> ## the range of Chr5 as illustrated here:
>>>
>>> ## Relevant chunks of SAM file generated by Rsubread
>>> @SQ SN:Chr1 LN:30427671
>>> @SQ SN:Chr2 LN:19698289
>>> @SQ SN:Chr3 LN:23459830
>>> @SQ SN:Chr4 LN:18585056
>>> @SQ SN:Chr5 LN:26975502
>>> @SQ SN:ChrC LN:154478
>>> @SQ SN:ChrM LN:366924
>>> SRR390302.1 0 Chr1 27018257 2.0 35M * 0 0 TGG...ACC III...GD)0
>>> SRR390302.2 4 * 0 0 * * 0 0 TCC...AAA III...+I@
>>> ...
>>> SRR390302.434558 0 Chr5 26975485 5.0 35M * 0 0 ATG...TGG III...&4(
>>> SRR390302.2716043 0 Chr5 26975486 3.0 35M * 0 0 CAT...GGT III...+2&
>>>
>>>
>>>> sessionInfo()
>>> R version 2.14.0 (2011-10-31)
>>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>>
>>> locale:
>>> [1] C
>>>
>>> attached base packages:
>>> [1] stats graphics grDevices utils datasets methods base
>>>
>>> other attached packages:
>>> [1] Rsamtools_1.6.3 Biostrings_2.22.0 GenomicRanges_1.6.4 IRanges_1.12.5 Rsubread_1.4.2
>>>
>>> loaded via a namespace (and not attached):
>>> [1] BSgenome_1.22.0 RCurl_1.8-0 XML_3.6-2 bitops_1.0-4.1 rtracklayer_1.14.4 tools_2.14.0 zlibbioc_1.0.0
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>>
>> --
>> Computational Biology
>> Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109
>>
>> Location: M1-B861
>> Telephone: 206 667-2793
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
______________________________________________________________________
The information in this email is confidential and intend...{{dropped:6}}
More information about the Bioconductor
mailing list