[Bioc-devel] GenomicRanges: Seqinfo indexing with non-existing key
Hervé Pagès
hpages at fhcrc.org
Thu Feb 27 22:55:46 CET 2014
Hi Julian, Dan,
On 02/27/2014 05:08 AM, Julian Gehring wrote:
> Hi Dan,
>
> Subestting your vector returns an element with both value and name set
> to 'NA'. In contrast, subsetting the 'Seqinfo' object returns an
> element with 'seqname' set to the key - which can be easily
> misinterpreted. Since the other columns ('seqlengths', 'genome',
> 'circular') are in many cases 'NA' anyway, it becomes hard to catch
> this. Also, subsetting a GRanges (e.g. 'as(si, "GRanges")') with a
> non-existing key throws an error.
Yeah maybe this looks kind of weird but it's actually intended.
If you look at the man page:
‘x[i]’: A Seqinfo object can be subsetted only by name i.e. ‘i’ must
be a character vector. This is a convenient way to
drop/add/reorder the rows (aka the sequence levels) of a
Seqinfo object.
Now I need to convince you that the man page is not documenting a bug or
a non-sense feature (some man pages do).
Examples:
> si <- Seqinfo(paste0("chr", 1:4), 11:14)
> si
Seqinfo of length 4
seqnames seqlengths isCircular genome
chr1 11 NA <NA>
chr2 12 NA <NA>
chr3 13 NA <NA>
chr4 14 NA <NA>
Drop seqlevels:
> si[seqnames(si)[-2]]
Seqinfo of length 3
seqnames seqlengths isCircular genome
chr1 11 NA <NA>
chr3 13 NA <NA>
chr4 14 NA <NA>
Add seqlevels:
> si[c("chrX", seqnames(si))]
Seqinfo of length 5
seqnames seqlengths isCircular genome
chrX NA NA <NA>
chr1 11 NA <NA>
chr2 12 NA <NA>
chr3 13 NA <NA>
chr4 14 NA <NA>
Reorder seqlevels:
> si[sample(seqnames(si))]
Seqinfo of length 4
seqnames seqlengths isCircular genome
chr4 14 NA <NA>
chr2 12 NA <NA>
chr1 11 NA <NA>
chr3 13 NA <NA>
Anyway the end-user should rarely have the need to directly subset
a Seqinfo object. Dropping/adding/reordering seqlevels is normally
done with the seqlevels() setter (which uses [ internally):
> seqlevels(si) <- union(seqlevels(si), c("chr4", "chrX", "chr2"))
> si
Seqinfo of length 5
seqnames seqlengths isCircular genome
chr1 11 NA <NA>
chr2 12 NA <NA>
chr3 13 NA <NA>
chr4 14 NA <NA>
chrX NA NA <NA>
Unlike [, using seqlevels() for dropping/adding/reordering seqlevels
works on any object that has a Seqinfo in it (e.g. GRanges), not just
on Seqinfo objects. [ is just the low-level operation that seqlevels
is based on.
Finally, I'll try to give some lame excuses for this design choice:
- Subsetting a Seqinfo object is special anyway because it only
supports subsetting by name (subsetting by integer or logical
vector is not supported).
- Because the seqnames column of a Seqinfo object is considered
to be the primary key for the object, it cannot contain NAs or
duplicated values (a lot of code would break if those were
allowed).
- Seqinfo objects are not intended and should not be seen as
fully-fledged vector-like objects. Ok length() works on them
but that's pretty much it. You cannot c(), duplicated(),
unique(), head(), tail(), rev(), split(), etc.. on them.
- From a more formal point of view, Seqinfo doesn't extend
Vector so has no moral obligation to follow whatever other
Vector subclasses do with [.
Mmmh... as I said, lame excuses. If not entirely convinced, can
big consistency fans (like myself) live with that?
Cheers,
H.
>
> Best wishes
> Julian
>
>
>
> On 27/02/14 14:00, Dan Du wrote:
>> Hi,
>>
>> To me it sounds like what I am expecting, from a R base point of view.
>>
>> x<-1:5
>> names(x)<-letters[1:5]
>> i<-'f'
>> x[i]
>>
>> Best,
>> Dan
>>
>> On Thu, 2014-02-27 at 13:49 +0100, Julian Gehring wrote:
>>> Hi,
>>>
>>> In the current bioc-stable and bioc-devel, a 'Seqinfo' object can be
>>> indexed successfully with any character key, even it is non-existing:
>>>
>>> #+BEGIN_SRC R
>>> library(GenomicRanges)
>>> library(BSgenome.Hsapiens.UCSC.hg19)
>>>
>>> si = seqinfo(BSgenome.Hsapiens.UCSC.hg19)
>>>
>>> ind2 = "999"
>>> ind2 %in% seqnames(si) ## FALSE, no seqlevel with this name
>>>
>>> si[ind2] ## works, but should not
>>> #+END_SRC
>>>
>>> Shouldn't this throw an error?
>>>
>>> Best wishes
>>> Julian
>>>
>>> _______________________________________________
>>> Bioc-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>>
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioc-devel
mailing list