[BioC] About Rsubread

Fri Jun 15 02:09:48 CEST 2012

Dear Lucia,

For your version 2 of alignment, align() function firstly used mapping quality scores to break ties if more than 1 best locations were found for a read. If one of the mapping locations was found to have a higher mapping score than others, that location will be selected and reported for that read. However, this read will be reported as a unmapped read in your version 3 of alignment, because it was mapped to multiple locations and the '-unique=TRUE' option instructed align() to not report it. So 'tieBreakQS=TRUE' changed a multi-mapping read to a unique-mapping read. This is why you got more mapped reads in (2) than (3).  I would recommend setting 'tieBreakQS=TRUE" if you are worried about the multi-mapping reads.

For you second question, you are right in that the set of mapped reads reported in version 2 should be subset of mapped reads reported in version 1. However, the mapping locations could be different for the same read if it was mapped to more than 1 locations. For version 1, the mapping location for such a read was randomly chosen. However, the location with best mapping quality score was chosen in version 2. If the mapping locations of such reads reported by version 2 fell within the gene region in your annotation but their locations reported by version 1 did not,  these reads will counted in read summarization 2 but not counted in summarization 1.

So the align() and featureCounts() function did what they are supposed to do for your different versions of running.

Let me know if this is unclear to you.

Cheers,
Wei

On Jun 15, 2012, at 4:55 AM, Lucia Spangenberg [guest] wrote:

> 
> Dear list, 
> I have a question about the arguments of the align function in the Rsubread package. I have mapped my RNA-seq SOLiD data (single-end, 16 samples, 50bp long reads, human) with Rsubread using the align() function in 3 versions:
> 
> -default parameters (1)
> -unique=TRUE and tieBreakQS=TRUE (2)
> -unique=TRUE (3)
> 
> For my surprise, the percentage of mapped reads is ordered like this: (1)>(2)>(3), for all samples.
> Why is it, that when unique and tieBreakQS=TRUE (2) is used, I get more mapped reads than only with unique (3)? tieBreaksQS argument should only decide, when two reads are equally optimally aligned, which read has to be kept. I expected something like this: 
> (1)>(2)=(3) approximately. 
> Where is my reasoning mistake? 
> 
> On the other hand, after the counting procedure using the featureCounts() function (gtf only with genes), I retrieved in some samples more genes with alignment (2) than (1). I thought that the set of mapped reads of (2) should be contained in (1)? Is this also wrong? It does not happen in many samples, and the difference is not that big, but is unexpected for me.
> 
> So, if anyone could help and sees where my thinking mistake is, I would be very thankful!
> 
> Cheers, 
> Lucía
> 
> 
> 
> 
> -- output of sessionInfo(): 
> 
> sessionInfo()
> +R version 2.15.0 (2012-03-30)
> Platform: x86_64-redhat-linux-gnu (64-bit)
> 
> locale:
> [1] LC_CTYPE=es_ES.UTF-8       LC_NUMERIC=C              
> [3] LC_TIME=es_ES.UTF-8        LC_COLLATE=es_ES.UTF-8    
> [5] LC_MONETARY=es_ES.UTF-8    LC_MESSAGES=es_ES.UTF-8   
> [7] LC_PAPER=C                 LC_NAME=C                 
> [9] LC_ADDRESS=C               LC_TELEPHONE=C            
> [11] LC_MEASUREMENT=es_ES.UTF-8 LC_IDENTIFICATION=C       
> 
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base     
> 
> other attached packages:
> [1] Rsubread_1.6.3
> 
> loaded via a namespace (and not attached):
> [1] tools_2.15.0
> 
> 
> --
> Sent via the guest posting facility at bioconductor.org.

______________________________________________________________________
The information in this email is confidential and intend...{{dropped:6}}