[Bioc-sig-seq] "identical" on PhredQuality objects - I'm confused

Janet Young jayoung at fhcrc.org
Wed Jun 15 21:23:45 CEST 2011


Hi, 

Marcus - thanks for the self-contained example. I probably should have included something like that too!

and Herve - thank you VERY much - that's helpful. I'm beginning to understand this, but I have a little way to go still - I'm amazed at the subtleties there are this stuff.

I'm glad you explained the false negatives too:   just this morning I saw one of those for the first time, and I would have spent a long time worrying about it if you hadn't explained...

I'll go with your ugly workaround (nothing arong with ugly for a biologist like me).  That has brought up another question (as.data.frame on GRanges), but I'll send a separate email about that, as it's a separate issue.

Janet




On Jun 15, 2011, at 10:45 AM, Hervé Pagès wrote:

> Hi Janet,
> 
> This is because identical() is not the appropriate tool when comparing
> objects that contain external pointers. It will sometimes report that
> different objects are identical, and sometimes report that identical
> objects are different.
> 
> With different objects:
> 
>  > library(Biostrings)
>  > x <- DNAString("AC")
>  > y <- DNAString("GG")
>  > identical(x, y)
>  [1] TRUE
> 
> With "identical" objects:
> 
>  > z <- subseq(DNAString("TGG"), start=2)
>  > z
>    2-letter "DNAString" instance
>  seq: GG
>  > identical(y, z)
>  [1] FALSE
> 
> The 2nd one (false negative) is a fair one: even if 'y' and 'z'
> represent the same object (conceptually), they actually have different
> internal representations:
> 
>  > y at shared
>  SharedRaw of length 2 (data starting at address 0x4833bf0)
>  > y at offset
>  [1] 0
>  > z at shared
>  SharedRaw of length 3 (data starting at address 0x4780350)
>  > z at offset
>  [1] 1
> 
> Since identical() performs low-level comparison, it will report that
> the objects are different.
> 
> For the 1st one (false positive), you should blame identical() for
> reporting that 2 external pointers are *always* identical:
> 
>  > xp1 <- .Call("externalptr_new", PACKAGE="IRanges")
>  > xp2 <- .Call("externalptr_new", PACKAGE="IRanges")
>  > IRanges:::address(xp1)
>  [1] "0x2e20178"
>  > IRanges:::address(xp2)
>  [1] "0x2e20530"
>  > identical(xp1, xp2)
>  [1] TRUE
> 
> Actually, the addresses shouldn't matter. What should matter is the
> data that the 2 external pointers are pointing at. But identical()
> doesn't try to "follow" the pointers in order to get to the data
> they are pointing at. It just reports that 2 external pointers are
> identical! Like here, where the 2 external pointers are pointing to
> data that differ in *value*:
> 
>  > identical(x at shared@xp, y at shared@xp)
>  [1] TRUE
> 
> One could try to make a case on the R-devel mailing list to change
> the behavior of identical() on external pointers, but that wouldn't
> solve the 2nd issue (false positives). This one could be addressed
> by having a new generic e.g. equal() with methods for Vector objects
> and their derivations that would know how to reliably compare 2
> objects. Most of the time, it would just do the same as identical(),
> but for objects that contain external pointers (like DNAString,
> DNAStringSet, PhredQuality, etc...), it would use implement a
> specialized logic.
> 
> Any suggestion on this would be appreciated.
> 
> In the meantime, the ugly workaround that I often use is to apply
> as.character on the objects I want to compare:
> 
>  > identical(as.character(x), as.character(y))
>  [1] FALSE
>  > identical(as.character(y), as.character(z))
>  [1] TRUE
> 
> Cheers,
> H.
> 
> On 11-06-14 06:58 PM, Janet Young wrote:
>> Hi there,
>> 
>> I'm investigating a mistake I made when running BWA (a little embarrassing, but it happens).  I had initially failed to use the -I flag on BWA to tell it to convert our Illumina ASCII-64 qualities appropriately.  Now I'm taking a look to see how much differnce it makes in our downstream analysis.
>> 
>> Anyway, that's not why I'm writing - while delving into this I've found something unexpected about the "identical" function, and I'd love to understand that better if possible.
>> 
>> I used scanBam to read in my two BWA output files - with and without the "-I" option when I ran BWA (same input seqs, everything else the same). As I expected, with the -I option BWA converted the qualities (my lane1read2_bwaWithI_bamscan object), and in the other it didn't (my lane1read2_bwaNotI_bamscan object).  That makes sense to me - here's how the quals look after reading BWA output in with scanBam.
>> 
>> ### when I use the -I option with BWA:
>>> head(lane1read2_bwaWithI_bamscan[[1]][["qual"]])
>>   A PhredQuality instance of length 6
>>     width seq
>> [1]    50 HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHFHHHHFHFF
>> [2]    36 GFFEEHHFHHHHHHHEFEEEHHHHHFHHHEEFEE at E
>> [3]    50 ##################################################
>> [4]    31 HHHHHHHHHHHHHHHHHHHHHHEHHHHFHFH
>> [5]    24 HHHHHHHHHHHHHHDFHHHHHHHH
>> [6]    50 FFFFCD=8BGEBHGHHHHEHGFGHHDHEHFHEHHHHGHHHHHHHHHGHHH
>> 
>> ### when I don't use the -I option with BWA:
>>> head(lane1read2_bwaNotI_bamscan[[1]][["qual"]])
>>   A PhredQuality instance of length 6
>>     width seq
>> [1]    50 gggggggggggggggggggggggggggggggggggggggggeggggegee
>> [2]    36 feeddggegggggggdedddgggggegggddedd_d
>> [3]    50 BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
>> [4]    31 ggggggggggggggggggggggdggggegeg
>> [5]    24 ggggggggggggggcegggggggg
>> [6]    50 eeeebc\Wafdagfggggdgfefggcgdgegdggggfgggggggggfggg
>> 
>> 
>> The part I don't understand is when I use identical to test whether those two qual objects are the same, it says they are identical:
>> 
>>> identical( lane1read2_bwaWithI_bamscan[[1]][["qual"]], lane1read2_bwaNotI_bamscan[[1]][["qual"]])
>> [1] TRUE
>> 
>> In one sense they're the same (if you convert from Illumina qualities to normal Phred qualities), but how did identical know to do that?  I might be misunderstanding identical - I thought that EVERYTHING about the object had to be identical.
>> 
>> I want to understand this as it seems like a dangerous misunderstanding - I use identical a fair amount to make sure I'm not screwing things up.
>> 
>> Really I should probably have converted the "qual" item in my scamBam-generated list to a SolexaQuality object to keep track of the fact they're offset by 64 and not 33, but that's OK. At this point I just want to take a look at what's changed about the BWA output when I use that -I flag.
>> 
>> Any thoughts?
>> 
>> thanks,
>> 
>> Janet
>> 
>> _______________________________________________
>> Bioc-sig-sequencing mailing list
>> Bioc-sig-sequencing at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
> 
> 
> -- 
> Hervé Pagès
> 
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
> 
> E-mail: hpages at fhcrc.org
> Phone:  (206) 667-5791
> Fax:    (206) 667-1319



More information about the Bioc-sig-sequencing mailing list