[Bioc-sig-seq] "identical" on PhredQuality objects - I'm confused

Hervé Pagès hpages at fhcrc.org
Wed Jun 15 19:45:00 CEST 2011


Hi Janet,

This is because identical() is not the appropriate tool when comparing
objects that contain external pointers. It will sometimes report that
different objects are identical, and sometimes report that identical
objects are different.

With different objects:

   > library(Biostrings)
   > x <- DNAString("AC")
   > y <- DNAString("GG")
   > identical(x, y)
   [1] TRUE

With "identical" objects:

   > z <- subseq(DNAString("TGG"), start=2)
   > z
     2-letter "DNAString" instance
   seq: GG
   > identical(y, z)
   [1] FALSE

The 2nd one (false negative) is a fair one: even if 'y' and 'z'
represent the same object (conceptually), they actually have different
internal representations:

   > y at shared
   SharedRaw of length 2 (data starting at address 0x4833bf0)
   > y at offset
   [1] 0
   > z at shared
   SharedRaw of length 3 (data starting at address 0x4780350)
   > z at offset
   [1] 1

Since identical() performs low-level comparison, it will report that
the objects are different.

For the 1st one (false positive), you should blame identical() for
reporting that 2 external pointers are *always* identical:

   > xp1 <- .Call("externalptr_new", PACKAGE="IRanges")
   > xp2 <- .Call("externalptr_new", PACKAGE="IRanges")
   > IRanges:::address(xp1)
   [1] "0x2e20178"
   > IRanges:::address(xp2)
   [1] "0x2e20530"
   > identical(xp1, xp2)
   [1] TRUE

Actually, the addresses shouldn't matter. What should matter is the
data that the 2 external pointers are pointing at. But identical()
doesn't try to "follow" the pointers in order to get to the data
they are pointing at. It just reports that 2 external pointers are
identical! Like here, where the 2 external pointers are pointing to
data that differ in *value*:

   > identical(x at shared@xp, y at shared@xp)
   [1] TRUE

One could try to make a case on the R-devel mailing list to change
the behavior of identical() on external pointers, but that wouldn't
solve the 2nd issue (false positives). This one could be addressed
by having a new generic e.g. equal() with methods for Vector objects
and their derivations that would know how to reliably compare 2
objects. Most of the time, it would just do the same as identical(),
but for objects that contain external pointers (like DNAString,
DNAStringSet, PhredQuality, etc...), it would use implement a
specialized logic.

Any suggestion on this would be appreciated.

In the meantime, the ugly workaround that I often use is to apply
as.character on the objects I want to compare:

   > identical(as.character(x), as.character(y))
   [1] FALSE
   > identical(as.character(y), as.character(z))
   [1] TRUE

Cheers,
H.

On 11-06-14 06:58 PM, Janet Young wrote:
> Hi there,
>
> I'm investigating a mistake I made when running BWA (a little embarrassing, but it happens).  I had initially failed to use the -I flag on BWA to tell it to convert our Illumina ASCII-64 qualities appropriately.  Now I'm taking a look to see how much differnce it makes in our downstream analysis.
>
> Anyway, that's not why I'm writing - while delving into this I've found something unexpected about the "identical" function, and I'd love to understand that better if possible.
>
> I used scanBam to read in my two BWA output files - with and without the "-I" option when I ran BWA (same input seqs, everything else the same). As I expected, with the -I option BWA converted the qualities (my lane1read2_bwaWithI_bamscan object), and in the other it didn't (my lane1read2_bwaNotI_bamscan object).  That makes sense to me - here's how the quals look after reading BWA output in with scanBam.
>
> ### when I use the -I option with BWA:
>> head(lane1read2_bwaWithI_bamscan[[1]][["qual"]])
>    A PhredQuality instance of length 6
>      width seq
> [1]    50 HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHFHHHHFHFF
> [2]    36 GFFEEHHFHHHHHHHEFEEEHHHHHFHHHEEFEE at E
> [3]    50 ##################################################
> [4]    31 HHHHHHHHHHHHHHHHHHHHHHEHHHHFHFH
> [5]    24 HHHHHHHHHHHHHHDFHHHHHHHH
> [6]    50 FFFFCD=8BGEBHGHHHHEHGFGHHDHEHFHEHHHHGHHHHHHHHHGHHH
>
> ### when I don't use the -I option with BWA:
>> head(lane1read2_bwaNotI_bamscan[[1]][["qual"]])
>    A PhredQuality instance of length 6
>      width seq
> [1]    50 gggggggggggggggggggggggggggggggggggggggggeggggegee
> [2]    36 feeddggegggggggdedddgggggegggddedd_d
> [3]    50 BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
> [4]    31 ggggggggggggggggggggggdggggegeg
> [5]    24 ggggggggggggggcegggggggg
> [6]    50 eeeebc\Wafdagfggggdgfefggcgdgegdggggfgggggggggfggg
>
>
> The part I don't understand is when I use identical to test whether those two qual objects are the same, it says they are identical:
>
>> identical( lane1read2_bwaWithI_bamscan[[1]][["qual"]], lane1read2_bwaNotI_bamscan[[1]][["qual"]])
> [1] TRUE
>
> In one sense they're the same (if you convert from Illumina qualities to normal Phred qualities), but how did identical know to do that?  I might be misunderstanding identical - I thought that EVERYTHING about the object had to be identical.
>
> I want to understand this as it seems like a dangerous misunderstanding - I use identical a fair amount to make sure I'm not screwing things up.
>
> Really I should probably have converted the "qual" item in my scamBam-generated list to a SolexaQuality object to keep track of the fact they're offset by 64 and not 33, but that's OK. At this point I just want to take a look at what's changed about the BWA output when I use that -I flag.
>
> Any thoughts?
>
> thanks,
>
> Janet
>
> _______________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing


-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioc-sig-sequencing mailing list