[Bioc-sig-seq] GRangesList with duplicate names

Fri Feb 25 20:50:04 CET 2011

Hi,

An important use case for GRangesList is the storage of a list
of genomic features grouped by their appartenance to a parent
feature (actually this was the original motivation for implementing
the GRangesList class).

Annotations are not perfect:

  > library(GenomicFeatures)
  > txdb <- makeTranscriptDbFromUCSC(tablename="refGene")
  > tx <- transcripts(txdb, columns="tx_name")
  > any(duplicated(values(tx)$tx_name))
  [1] TRUE

Here the transcript names are not unique:

  > any(duplicated(values(tx)$tx_name))
  [1] TRUE

Here tx_name is just a column in the values slot of the GRanges object.
Duplicates are of course perfectly ok here.

However, the following code generates a GRangesList object with
duplicated names:

  > ex <- exonsBy(txdb, "tx", use.names=TRUE)
  Warning message:
  In .set.group.names(ans, use.names, txdb, by) :
    some group names are NAs or duplicated
  > any(duplicated(names(ex)))
  [1] TRUE

Note the warning.

By default (i.e. without specifying use.names=TRUE), the internal ids
of the transcripts (tx_id col, guaranteed to be unique) would have been
put on the object here instead of the tx names. It's worth remembering
that the original version of exonsBy() was always using those internal
ids but, a few months ago, the 'use.names' arg was added because some
users found it useful. See this thread:

https://stat.ethz.ch/pipermail/bioc-sig-sequencing/2010-August/001387.html

Should we revisit this?

Also, by not allowing duplicated names, some basic operations like
combining 2 GRangesList objects with clashing names won't work anymore:

  > x <- GRangesList(a=GRanges(), b=GRanges())
  > names(c(x, x))
  [1] "a" "b" "a" "b"

Or, if one of the 2 objects has no names:

  > y <- GRangesList(GRanges(), GRanges())
  > names(c(x, y))
  [1] "a" "b" ""  "" 

Or if my index 'i' contains duplicated values:

  > i <- c(2, 1, 2)
  > x[i]  # will have duplicated names

etc...

(Note that validObject would fail on all those objects I just made.)

I guess names are not required to be unique on ordinary lists
for those reasons too. Environments do require names to be unique
and can use a hash table internally for fast name lookup. Those 2
features make them the equivalent of dictionaries in Python.

Disclaimer: I've never been a fan of the fact that the names of
an ordinary list or vector in R can have duplicates. I've just
learned to live with it. A really bad consequence of this feature
is that subsetting 'x' by a non-unique name only returns the first
element with that name:

  > x <- c(a=2, b=3, a=4)
  > x[names(x)]
  a b a 
  2 3 2 

Without even a warning!

For GRangesList objects, and more generally, for (almost) all the
List (L uppercase) subclasses defined in IRanges and GenomicRanges,
duplicated names are currently ok. And we don't issue a warning either
when subsetting by a non-unique name (we were just following what
ordinary lists do) but we could (and maybe we should).

Cheers,
H.

----- Original Message -----
From: "Martin Morgan" <mtmorgan at fhcrc.org>
To: "Steve Lianoglou" <mailinglist.honeypot at gmail.com>
Cc: bioc-sig-sequencing at r-project.org
Sent: Friday, February 25, 2011 9:46:02 AM
Subject: Re: [Bioc-sig-seq] GRangesList with duplicate names

On 02/25/2011 07:05 AM, Steve Lianoglou wrote:
> Hi,
> 
> I think I'm with Ivan and leaning towards not allowing duplicate names
> in a GRangesList, even though normal lists in R do allow duplicate
> names.
> 
> As Ivan suggested, I also often use the names of any R list when I
> want to use the list as something similar to a Python dictionary.

I cast my vote in the same direction, for similar reasons.

Dario's use case offered a different on GRangesList which I had thought
of as a collection of GRanges in a hierarchical relationship, like
exons-within-genes. Maybe this is just me, though.

Wanted also to suggest some alternatives, with

  a <- GRanges("A", IRanges(1:3, width=5))
  b <- GRanges("B", IRanges(5:7, width=10))
  c <- GRanges("C", IRanges(10:12, width=15))

The first is to use a GRangesList but store the case / control status as
elementMetadata / values and take advantage of the flexibility that
offers to record them as a factor

> grl <- GRangesList(a=a, b=b, c=c)
> values(grl)[["Status"]] <- factor(c("Cancer", "Cancer", "Control"))

The second is to more-or-less honor the notion of GRangesList as a
hierarchy, hence use a different data structure

> lst <- SimpleList(a=a, b=b, c=c)
> df <- DataFrame(Status=factor(c("Cancer", "Cancer", "Control")))
> elementMetadata(lst) <- df

The third might be relevant if the GRanges ('regions of interest') are
actually common across samples, e.g.,

  d <- GRanges("D", IRanges(c(1,5, 10), c(7, 16, 26)))

perhaps with measurements made on each

  assays <- SimpleList(asinhCounts=matrix(rnorm(9, 6, 2), 3))

and coordinated in a SummarizedExperiment

> ## some additional annotation on rows / cols
> names(d) <- paste("roi", seq_len(length(d)), sep="")
> rownames(df) <- paste("sample", seq_len(nrow(df)), sep="")
> sx <- SummarizedExperiment(assays, rowData=d, colData=df)
> sx
class: SummarizedExperiment
dim: 3 3
assays(1): asinhCounts
rownames(3): roi1 roi2 roi3
rowData values names(0):
colnames(3): sample1 sample2 sample3
colData names(1): Status

where measurements (e.g., asinh-transformed counts) associated with
ranges in all samples are part of 'assays', marginal values associated
with rows / ranges (e.g., significance values associated with
differential expression) are values(rowData(sx)), and marginal values
associated with columns / samples are colData(sx).

Martin

> Still, if the consensus turns out to allow duplicate names in
> *RangesList(s), perhaps it'd be nice for the the validity method to
> fire off a warning that duplicate names exist in the list so the user
> knows something might be fishy.
> 
> -steve
> 
> On Fri, Feb 25, 2011 at 9:48 AM, Ivan Gregoretti <ivangreg at gmail.com> wrote:
>> Hello Hervé,
>>
>> While we wait for comments from "power users", I just wanted to say
>> that non-unique names open the door for potentially more problems than
>> solutions.
>>
>> Imagine a Python dictionary or a Perl hash with multiple values per key.
>>
>> I wonder how many R/Bioconductor functions exploit the vector's
>> capability to hold multiple elements with the same name.
>>
>> Regardless, thanks for asking users opinions.
>>
>> Ivan
>>
>>
>> Ivan Gregoretti, PhD
>> National Institute of Diabetes and Digestive and Kidney Diseases
>> National Institutes of Health
>> 5 Memorial Dr, Building 5, Room 205.
>> Bethesda, MD 20892. USA.
>> Phone: 1-301-496-1016 and 1-301-496-1592
>> Fax: 1-301-496-9878
>>
>>
>>
>> On Fri, Feb 25, 2011 at 3:08 AM, Pages, Herve <hpages at fhcrc.org> wrote:
>>> Hi Dario,
>>>
>>> A GRangesList object with duplicated names is apparently
>>> considered broken:
>>>
>>>> grl <- GRangesList(GRanges(), GRanges())
>>>> names(grl) <- c("a", "a")
>>>> validObject(grl)
>>> Error in `rownames<-`(`*tmp*`, value = c("a", "a")) :
>>>  duplicate rownames not allowed
>>>
>>> If we are ok with this feature, we should fix the "names<-"
>>> method (and any other code around that lets the user generate
>>> broken objects).
>>>
>>> But if we are not ok with this feature, we should modify
>>> the validity method for GRangesList objects. I tend to prefer
>>> this solution for 3 reasons:
>>>
>>>  1. Consistency with ordinary vectors: the names of a vector
>>>     in R are not required to be unique.
>>>
>>>  2. It's not uncommon to see the same name used for 2 different
>>>     genes. One might still want to be able to stick those names
>>>     on a GRangesList object where each top-level element corresponds
>>>     to a gene (e.g. exons grouped by gene).
>>>
>>>  3. It's easier to modify the validity method than to go around
>>>     trying to find and fix every piece of code in GenomicRanges
>>>     (and maybe other places) that can potentially produce a
>>>     GRangesList object with duplicated names.
>>>
>>> How do our power users feel about this?
>>>
>>> Thanks,
>>> H.
>>>
>>>
>>> ----- Original Message -----
>>> From: "Dario Strbenac" <D.Strbenac at garvan.org.au>
>>> To: bioc-sig-sequencing at r-project.org
>>> Sent: Thursday, February 24, 2011 10:00:11 PM
>>> Subject: [Bioc-sig-seq] GRangesList with duplicate names
>>>
>>> Hello,
>>>
>>> It is possible to create a GRangesList with duplicated names, but not to re-order it.
>>>
>>>> summary(grl)
>>>     Length       Class        Mode
>>>          3 GRangesList          S4
>>>> names(grl) <- c("Cancer", "Cancer", "Normal")
>>>> grl[3:1]
>>> Error in `rownames<-`(`*tmp*`, value = c("Normal", "Cancer", "Cancer")) :
>>>  duplicate rownames not allowed
>>>> sessionInfo()
>>> R version 2.12.0 (2010-10-15)
>>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>>
>>> locale:
>>>  [1] LC_CTYPE=en_AU.UTF-8       LC_NUMERIC=C
>>>  [3] LC_TIME=en_AU.UTF-8        LC_COLLATE=en_AU.UTF-8
>>>  [5] LC_MONETARY=C              LC_MESSAGES=en_AU.UTF-8
>>>  [7] LC_PAPER=en_AU.UTF-8       LC_NAME=C
>>>  [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>> [11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C
>>>
>>> attached base packages:
>>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>>
>>> other attached packages:
>>> [1] GenomicRanges_1.2.3 IRanges_1.8.9
>>>
>>> --------------------------------------
>>> Dario Strbenac
>>> Research Assistant
>>> Cancer Epigenetics
>>> Garvan Institute of Medical Research
>>> Darlinghurst NSW 2010
>>> Australia
>>>
>>> _______________________________________________
>>> Bioc-sig-sequencing mailing list
>>> Bioc-sig-sequencing at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>>
>>> _______________________________________________
>>> Bioc-sig-sequencing mailing list
>>> Bioc-sig-sequencing at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>>
>>
>> _______________________________________________
>> Bioc-sig-sequencing mailing list
>> Bioc-sig-sequencing at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>
> 
> 
> 

-- 
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109

Location: M1-B861
Telephone: 206 667-2793

_______________________________________________
Bioc-sig-sequencing mailing list
Bioc-sig-sequencing at r-project.org
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing