[BioC] Rsubread/Rsamtools mappings outside of chromosome ranges
Wei Shi
shi at wehi.EDU.AU
Tue Feb 14 23:42:47 CET 2012
That's great! Thanks Martin.
Cheers,
Wei
On Feb 15, 2012, at 9:40 AM, Martin Morgan wrote:
> On 02/14/2012 02:36 PM, Wei Shi wrote:
>> Hi Thomas,
>>
>> After re-checking the changes we have made before for addressing the
>> issue of reads mapped out of the chromosome ranges, I found we
>> actually made the decision to allow reads to be mapped out of the
>> chromosome ranges, due to the fact that the reference sequences might
>> not be complete and reads could originate from outside of the
>> reference sequences used for mapping.
>>
>> So you may take Martin's suggestion to add extra parameters when you
>
> Actually, Herve has patched GenomicRanges (1.7.25) in the development branch so that ranges outside sequence bounds are now accepted (with a warning).
>
> Martin
>
>> call 'readGappedAlignments' function to make it work for you. Or
>> alternatively you may try the 'featureCounts' function in Rsubread
>> package to count the number of reads mapped to each gene, if the
>> purpose of your analysis is perform RNA-seq analysis. This function
>> works on SAM file, so you do not need to convert it to a BAM file.
>> But you will have to provide an Arabidopsis anntation file it as it
>> has built-in annotation for mouse and human only at present.
>>
>> I think it might be better if the 'readGappedAlignments' function
>> could allow reads being mapped outside of chromosomal ranges. We have
>> been successfully running through a SNP discovery pipeline using
>> subread/subjunc, samtools and vcftools on unix command lines, which
>> does not complain about the reads mapped outside of the chromosomal
>> ranges. I guess many other downstream analysis tools do not require
>> mapping locations of reads to be within the chromosomal ranges as
>> well.
>>
>> Hope this helps.
>>
>> Cheers, Wei
>>
>>
>>
>> On Feb 14, 2012, at 11:58 AM, Thomas Girke wrote:
>>
>>> That's great, thanks a lot!
>>>
>>> BTW: are there any published/shareable performance test results
>>> available comparing subread against other RNA-read to genome
>>> aligners such as Tophat with respect to the accuracy of aligning
>>> reads across exon/intron junctions?
>>>
>>> Thanks,
>>>
>>> Thomas
>>>
>>> On Mon, Feb 13, 2012 at 11:41:42PM +0000, Wei Shi wrote:
>>>> Hi Thomas,
>>>>
>>>> I have just committed changes to both the release version and the
>>>> devel version of Rsubread package to fix the bug of mapping reads
>>>> out of chromosomal boundaries. They should be available in 24-36
>>>> hours. Please rebuild your index when you rerun your alignment.
>>>>
>>>> Let me know if the problem persists or you encounter further
>>>> problems.
>>>>
>>>> Cheers, Wei
>>>>
>>>> On Feb 13, 2012, at 11:12 AM, Thomas Girke wrote:
>>>>
>>>>> Thanks Martin for your suggestion!
>>>>>
>>>>> Thomas
>>>>>
>>>>> On Sun, Feb 12, 2012 at 05:28:21PM +0000, Martin Morgan wrote:
>>>>>> On 02/10/2012 07:44 PM, Thomas Girke wrote:
>>>>>>> With some Illumina libraries I have been running into
>>>>>>> problems importing the read mappings from Rsubread into
>>>>>>> Rsamtools. After some testing I found out that some reads
>>>>>>> reported in Rsubread's SAM output have their end positions
>>>>>>> outside of the chromosome ranges. If those reads are
>>>>>>> removed from the SAM file then the import into Rsamtools
>>>>>>> works just fine. Below is a reproducible example of this
>>>>>>> problem.
>>>>>>>
>>>>>>> I could think of several solutions to fix this, e.g. not
>>>>>>> reporting reads outside of chromosome ranges or updating
>>>>>>> their length in the CIAGR string. An additional option
>>>>>>> could be to remove invalid mappings during the import into
>>>>>>> Rsamtools.
>>>>>>
>>>>>> Hi Thomas --
>>>>>>
>>>>>> for the latter and for the case where those reads are
>>>>>> intrinsically not interesting, it might be reasonable to
>>>>>> specify a range on readGappedAlignments that precludes the
>>>>>> possibility of extension beyond sequence ends.
>>>>>>
>>>>>>> si<- seqinfo(BamFile(fl)) gr<- GRanges(seqnames(si),
>>>>>>> IRanges(34, seqlengths(si)-34)) bam<-
>>>>>>> readGappedAlignments(fl, param=ScanBamParam(which=gr))
>>>>>>
>>>>>> One would like a better solution, either doing this
>>>>>> automatically with a warning, or warning rather than stopping
>>>>>> on reads overlapping ends.
>>>>>>
>>>>>> Martin
>>>>>>
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Thomas
>>>>>>>
>>>>>>> ################################ ## Read mapping with
>>>>>>> Rsubread ## ################################
>>>>>>> library(Rsubread) ## Reference source:
>>>>>>> ftp://ftp.arabidopsis.org/home/tair/Sequences/whole_chromosomes/
>>>>>>>
>>>>>>>
> ## Fastq source: ftp://ftp-trace.ncbi.nih.gov/sra/sra-instant/reads/ByStudy/sra/SRP%2FSRP009%2FSRP009850/SRR390302/
>>>>>>> buildindex(basename="tair10chr.fasta",
>>>>>>> reference="tair10chr.fasta") align(index="tair10chr.fasta",
>>>>>>> readfile1="SRR390302.fastq",
>>>>>>> output_file="SRR390302_subread.sam", nthreads=8, indels=1,
>>>>>>> TH1=2)
>>>>>>>
>>>>>>> ##################################### ## Process SAM file
>>>>>>> with Rsamtools ## #####################################
>>>>>>> library(Rsamtools) asBam(file="SRR390302_subread.sam",
>>>>>>> destination="SRR390302_subread") [1]
>>>>>>> "./data/SRR390302_subread.bam" reads<-
>>>>>>> readBamGappedAlignments("SRR390302_subread.bam",
>>>>>>> use.names=FALSE) Error in validObject(.Object) : invalid
>>>>>>> class "GappedAlignments" object: 'ranges' contains values
>>>>>>> outside of sequence bounds
>>>>>>>
>>>>>>> ## The error disappears if the following two reads are
>>>>>>> removed from the SAM file: ## SRR390302.434558 and
>>>>>>> SRR390302.2716043. Both reads have their end positions
>>>>>>> outside ## the range of Chr5 as illustrated here:
>>>>>>>
>>>>>>> ## Relevant chunks of SAM file generated by Rsubread @SQ
>>>>>>> SN:Chr1 LN:30427671 @SQ SN:Chr2 LN:19698289 @SQ
>>>>>>> SN:Chr3 LN:23459830 @SQ SN:Chr4 LN:18585056 @SQ
>>>>>>> SN:Chr5 LN:26975502 @SQ SN:ChrC LN:154478 @SQ
>>>>>>> SN:ChrM LN:366924 SRR390302.1 0 Chr1 27018257
>>>>>>> 2.0 35M * 0 0 TGG...ACC
>>>>>>> III...GD)0 SRR390302.2 4 * 0 0
>>>>>>> * * 0 0 TCC...AAA III...+I@
>>>>>>> ... SRR390302.434558 0 Chr5 26975485
>>>>>>> 5.0 35M * 0 0 ATG...TGG
>>>>>>> III...&4( SRR390302.2716043 0 Chr5 26975486
>>>>>>> 3.0 35M * 0 0 CAT...GGT
>>>>>>> III...+2&
>>>>>>>
>>>>>>>
>>>>>>>> sessionInfo()
>>>>>>> R version 2.14.0 (2011-10-31) Platform:
>>>>>>> x86_64-unknown-linux-gnu (64-bit)
>>>>>>>
>>>>>>> locale: [1] C
>>>>>>>
>>>>>>> attached base packages: [1] stats graphics grDevices
>>>>>>> utils datasets methods base
>>>>>>>
>>>>>>> other attached packages: [1] Rsamtools_1.6.3
>>>>>>> Biostrings_2.22.0 GenomicRanges_1.6.4 IRanges_1.12.5
>>>>>>> Rsubread_1.4.2
>>>>>>>
>>>>>>> loaded via a namespace (and not attached): [1]
>>>>>>> BSgenome_1.22.0 RCurl_1.8-0 XML_3.6-2
>>>>>>> bitops_1.0-4.1 rtracklayer_1.14.4 tools_2.14.0
>>>>>>> zlibbioc_1.0.0
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Bioconductor mailing list Bioconductor at r-project.org
>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor Search
>>>>>>> the archives:
>>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>>
>>>>>>
>>>>>>
>>>>>>>
> --
>>>>>> Computational Biology Fred Hutchinson Cancer Research Center
>>>>>> 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109
>>>>>>
>>>>>> Location: M1-B861 Telephone: 206 667-2793
>>>>>
>>>>> _______________________________________________ Bioconductor
>>>>> mailing list Bioconductor at r-project.org
>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor Search the
>>>>> archives:
>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>
>>>>
>>>>
>>>>>
> ______________________________________________________________________
>>>> The information in this email is confidential and intended solely
>>>> for the addressee. You must not disclose, forward, print or use
>>>> it without the permission of the sender.
>>>> ______________________________________________________________________
>>
>>>>
>>
>> ______________________________________________________________________
>>
>>
> The information in this email is confidential and inte...{{dropped:24}}
More information about the Bioconductor
mailing list