[Bioc-sig-seq] unique reads count

Thu Jan 28 03:53:17 CET 2010

On 01/27/2010 05:34 PM, joseph wrote:
> Thanks for the clarification.
> How do I filter the reads that occur exactly once?

Hi Joseph --

One way is to

  rnk = srrank(sread(rfq))

identical reads are assigned the same srrank, so

  rnk %in% which(tabulate(rnk)==1)

returns those reads whose rank occurs exactly once and

  rfq1 = rfq[rnk %in% which(tabulate(rnk)==1)]

subsets the rfq object. We should check this with

  any(srduplicated(sread(rfq1)))

which should return FALSE.

Maybe worth pointing out that the reads might have different quality
base scores, and then it's not clear that they are the 'same'. Also in
aligned reads that they might not have been aligned to the same
chr/strand/position even when the sequence (and base quality!, yes, some
aligners chose amongst equally likely alignment positions by tossing a
coin) is the same.

Martin

> 
> 
> 
> ________________________________
> From: Martin Morgan <mtmorgan at fhcrc.org>
> To: joseph <jdsandjd at yahoo.com>
> Cc: bioc-sig-sequencing at r-project.org
> Sent: Wed, January 27, 2010 5:29:00 PM
> Subject: Re: [Bioc-sig-seq] unique reads count
> 
> On 01/27/2010 02:57 PM, joseph wrote:
>> If I understand this correctly, !duplicated does not count unique reads:
>> !duplicated(c("A", "B", "B", "C"))
>> [1]  TRUE  TRUE FALSE  TRUE
>>
>> in your example, there are only two uniques (not 3 as counted by !duplicated).
> 
> Well, it depends on what your definition of 'duplicate' is, but yes, I
> think we agree.
> 
>> Is it correct to say that the number of unique reads is given by the nReads when  nOccurrences=1 using the tables function? 
> 
> again unique needs definition (see unique(c("A", "B", "B", "C")), for
> instance!) but yes, nOccurrences is the number of reads that occur
> exactly once.
> 
> Martin
> 
>>
>>
>> ________________________________
>> From: Martin Morgan <mtmorgan at fhcrc.org>
>> To: joseph <jdsandjd at yahoo.com>
>> Cc: bioc-sig-sequencing at r-project.org
>> Sent: Wed, January 27, 2010 1:06:55 PM
>> Subject: Re: [Bioc-sig-seq] unique reads count
>>
>> Hi Joseph --
>>
>> On 01/27/2010 12:33 PM, joseph wrote:
>>> Hello
>>> I have a ShortReadQ object: 
>>>> rfq
>>> class: ShortReadQ
>>> length: 16115723 reads; width: 34 cycles
>>>
>>> I used the negation of the result from srduplicated to count the unique reads:
>>>> sum(!srduplicated(sread(rfq)))
>>> [1] 4545719
>>>
>>> But also I looked at the frequency with which each read occurs using the tables function:
>>>> head(tables(rfq_s_3_mel)$distribution)
>>>   nOccurrences  nReads
>>> 1            1 4022038
>>> 2            2  255649
>>
>> srduplicated is behaving like 'duplicated', which is to return TRUE when
>> an element has already been seen
>>
>>> duplicated(c("A", "B", "B", "C"))
>> [1] FALSE FALSE  TRUE FALSE
>>
>> There's one duplicate, the second 'B'.
>>
>> After example(srduplicated) I have
>>
>>> tables(sread(rfq))$distribution
>>   nOccurrences nReads
>> 1            1    239
>> 2            2      7
>> 3            3      1
>>> sum(srduplicated(sread(rfq)))
>> [1] 9
>>
>> there are 7 reads that are the second of two reads, and 2 reads that are
>> the second and third of three reads.
>>
>> Martin
>>
>>>
>>> I expected that for nOccurrences=1, the nReads should be the same as what I got with !srduplicated.
>>>
>>> Can anybody explain why I got different counts?
>>> Thank you
>>> Joseph Dhahbi
>>>
>>>
>>>      
>>>     [[alternative HTML version deleted]]
>>>
>>> _______________________________________________
>>> Bioc-sig-sequencing mailing list
>>> Bioc-sig-sequencing at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>
>>
> 
> 

-- 
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793