[Bioc-devel] Changes in AnnotationDbi

Tue Jun 9 17:11:05 CEST 2015

As select() works currently, the returned keys are in identical order as
the input, with extra rows inserted as needed. And any one-to-nothing
mapping results in a NA returned. So by definition my (admittedly naive)
method is guaranteed to work. But your point is well taken - match() is
safer. But my point isn't to say how people should deal with duplicates,
and I didn't intend for my example to be an example of what anybody should
do. Instead, I object to the idea that we would silently return something
that the average user might not expect, without some indication to alert
the user.

Anyway, the issue is much more complicated than you seem to realize. If you
ask for more than one thing to be returned (e.g., ENTREZID and SYMBOL), if
either has duplicates, you get multiple rows returned. So if there is a
single Entrez Gene ID, but multiple symbols, which symbol do you choose?
Which is the 'preferred' symbol? For that matter, which is the 'preferred'
Entrez Gene ID? Are there duplicates because NCBI hasn't yet discontinued
some duplicates that point to the same thing, or are there duplicates
because a given reporter is measuring multiple highly similar genes?

We have long insisted that the annotation packages we provide are 'as is',
and that we are not the arbiters of correctness. We simply give end users
the ability to easily get available data from within R. I would argue that
we should maintain that stance. I am not particularly enthused with any
filtering we might do, as I don't think we have the time or personnel
required to do it well. We should leave it to the manufacturer of the array
and the people at NCBI and Ensembl to do that part. And it should be up to
the end user to figure which ID is the 'preferred' one. As a simple example:

> z <- mapIds(hugene20sttranscriptcluster.db,
keys(hugene20sttranscriptcluster.db), "ENTREZID", "PROBEID", multiVals =
"CharacterList")

> head(z[sapply(z, length) > 1])
CharacterList of length 6
[["16657436"]] 100287102 100288486 100287029 84771 100287596
[["16657440"]] 100422919 100422834 100422831 100302278
[["16657450"]] 729737 101929819 441124 100132062 101059936
[["16657473"]] 81399 729759 26683 441308
[["16657910"]] 8511 8510
[["16658119"]] 284630 100287898

If we now go to NCBI and look at the five IDs for the first gene, they are
all current, and map to DDX11L1, DDX11L9, DDX11L10, DDX11L2, and DDX11L5.
We can certainly choose the first one (and that is what I do for my
collaborators, in general), but is that the right thing to do? If so, why?

Best,

Jim

On Tue, Jun 9, 2015 at 5:52 AM, Simon Anders <anders at embl.de> wrote:

> Hi
>
> My two cents:
>
> On 04/06/15 19:50, James W. MacDonald wrote:
>
>> In other words, for me it is a common practice to do something like this:
>>
>> fit <- lmFit(eset, design)
>> fit2 <- eBayes(fit)
>> gns <- select(<chippackage>, featureNames(eset), c("ENTREZID","SYMBOL"))
>> gns <- gns[!duplicated(gns[,1]),]
>> fit2$genes <- gns
>>
>> I add in the step where dups are removed because I already know they are
>> there. But a naive user might instead do
>>
>> fit2$genes <- select(<chippackage>, featureNames(eset),
>> c("ENTREZID","SYMBOL"))
>>
>
> I'm not even that happy with James' first solution, as it relies on the
> order being correct after removing the duplicates. I'd feel safer to use
> 'match' to ensure that. (What if an EntrezId is not found in the Annotation
> DB? Will we have a line with NA, or is the line simply missing? The latter
> would break James' code.)
>
> What users really want here is a way to get the "preferred" symbol for an
> entrezId, and for lack of this, they accept simply a random one or the
> first one (in some unspecified collation). So, we should have a function,
> maybe 'select1', to select one and only one hit for each query value.
>
>   select1(x, keys, columns, keytype, requireUnique=FALSE, ... )
>
> This would query the AnnotationDbi object 'x' as does 'select', but return
> a data frame with the columns specified in 'columns', and the vector that
> was passed as 'keys' as row names, thus guaranteeing that each line in the
> data frame corresponds to one query key. If there were multiple records for
> a key, the first one is used, unless 'requireUnique' is set, in which case
> an error is issued. And if no record is present for a key, the data frame
> contains a row of NAs for this key.
>
> This would be quite convenient for any kind of ID conversion issues.
>
>   Simon
>
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>

-- 
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099

	[[alternative HTML version deleted]]