[Bioc-devel] GenomicRanges: Seqinfo indexing with non-existing key

Hervé Pagès hpages at fhcrc.org
Thu Feb 27 22:55:46 CET 2014


Hi Julian, Dan,

On 02/27/2014 05:08 AM, Julian Gehring wrote:
> Hi Dan,
>
> Subestting your vector returns an element with both value and name set
> to 'NA'.  In contrast, subsetting the 'Seqinfo' object returns an
> element with 'seqname' set to the key - which can be easily
> misinterpreted.  Since the other columns ('seqlengths', 'genome',
> 'circular') are in many cases 'NA' anyway, it becomes hard to catch
> this.  Also, subsetting a GRanges (e.g. 'as(si, "GRanges")') with a
> non-existing key throws an error.

Yeah maybe this looks kind of weird but it's actually intended.

If you look at the man page:

   ‘x[i]’: A Seqinfo object can be subsetted only by name i.e. ‘i’ must
           be a character vector.  This is a convenient way to
           drop/add/reorder the rows (aka the sequence levels) of a
           Seqinfo object.

Now I need to convince you that the man page is not documenting a bug or
a non-sense feature (some man pages do).

Examples:

   > si <- Seqinfo(paste0("chr", 1:4), 11:14)
   > si
   Seqinfo of length 4
   seqnames seqlengths isCircular genome
   chr1             11         NA   <NA>
   chr2             12         NA   <NA>
   chr3             13         NA   <NA>
   chr4             14         NA   <NA>

Drop seqlevels:

   > si[seqnames(si)[-2]]
   Seqinfo of length 3
   seqnames seqlengths isCircular genome
   chr1             11         NA   <NA>
   chr3             13         NA   <NA>
   chr4             14         NA   <NA>

Add seqlevels:

   > si[c("chrX", seqnames(si))]
   Seqinfo of length 5
   seqnames seqlengths isCircular genome
   chrX             NA         NA   <NA>
   chr1             11         NA   <NA>
   chr2             12         NA   <NA>
   chr3             13         NA   <NA>
   chr4             14         NA   <NA>

Reorder seqlevels:

   > si[sample(seqnames(si))]
   Seqinfo of length 4
   seqnames seqlengths isCircular genome
   chr4             14         NA   <NA>
   chr2             12         NA   <NA>
   chr1             11         NA   <NA>
   chr3             13         NA   <NA>

Anyway the end-user should rarely have the need to directly subset
a Seqinfo object. Dropping/adding/reordering seqlevels is normally
done with the seqlevels() setter (which uses [ internally):

   > seqlevels(si) <- union(seqlevels(si), c("chr4", "chrX", "chr2"))
   > si
   Seqinfo of length 5
   seqnames seqlengths isCircular genome
   chr1             11         NA   <NA>
   chr2             12         NA   <NA>
   chr3             13         NA   <NA>
   chr4             14         NA   <NA>
   chrX             NA         NA   <NA>

Unlike [, using seqlevels() for dropping/adding/reordering seqlevels
works on any object that has a Seqinfo in it (e.g. GRanges), not just
on Seqinfo objects. [ is just the low-level operation that seqlevels
is based on.

Finally, I'll try to give some lame excuses for this design choice:

   - Subsetting a Seqinfo object is special anyway because it only
     supports subsetting by name (subsetting by integer or logical
     vector is not supported).

   - Because the seqnames column of a Seqinfo object is considered
     to be the primary key for the object, it cannot contain NAs or
     duplicated values (a lot of code would break if those were
     allowed).

   - Seqinfo objects are not intended and should not be seen as
     fully-fledged vector-like objects. Ok length() works on them
     but that's pretty much it. You cannot c(), duplicated(),
     unique(), head(), tail(), rev(), split(), etc.. on them.

   - From a more formal point of view, Seqinfo doesn't extend
     Vector so has no moral obligation to follow whatever other
     Vector subclasses do with [.

Mmmh... as I said, lame excuses. If not entirely convinced, can
big consistency fans (like myself) live with that?

Cheers,
H.


>
> Best wishes
> Julian
>
>
>
> On 27/02/14 14:00, Dan Du wrote:
>> Hi,
>>
>> To me it sounds like what I am expecting, from a R base point of view.
>>
>> x<-1:5
>> names(x)<-letters[1:5]
>> i<-'f'
>> x[i]
>>
>> Best,
>> Dan
>>
>> On Thu, 2014-02-27 at 13:49 +0100, Julian Gehring wrote:
>>> Hi,
>>>
>>> In the current bioc-stable and bioc-devel, a 'Seqinfo' object can be
>>> indexed successfully with any character key, even it is non-existing:
>>>
>>> #+BEGIN_SRC R
>>>    library(GenomicRanges)
>>>    library(BSgenome.Hsapiens.UCSC.hg19)
>>>
>>>    si = seqinfo(BSgenome.Hsapiens.UCSC.hg19)
>>>
>>>    ind2 = "999"
>>>    ind2 %in% seqnames(si) ## FALSE, no seqlevel with this name
>>>
>>>    si[ind2] ## works, but should not
>>> #+END_SRC
>>>
>>> Shouldn't this throw an error?
>>>
>>> Best wishes
>>> Julian
>>>
>>> _______________________________________________
>>> Bioc-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>>
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioc-devel mailing list