[Bioc-devel] Changes in AnnotationDbi

Tue Jun 9 11:52:14 CEST 2015

Hi

My two cents:

On 04/06/15 19:50, James W. MacDonald wrote:
> In other words, for me it is a common practice to do something like this:
>
> fit <- lmFit(eset, design)
> fit2 <- eBayes(fit)
> gns <- select(<chippackage>, featureNames(eset), c("ENTREZID","SYMBOL"))
> gns <- gns[!duplicated(gns[,1]),]
> fit2$genes <- gns
>
> I add in the step where dups are removed because I already know they are
> there. But a naive user might instead do
>
> fit2$genes <- select(<chippackage>, featureNames(eset),
> c("ENTREZID","SYMBOL"))

I'm not even that happy with James' first solution, as it relies on the 
order being correct after removing the duplicates. I'd feel safer to use 
'match' to ensure that. (What if an EntrezId is not found in the 
Annotation DB? Will we have a line with NA, or is the line simply 
missing? The latter would break James' code.)

What users really want here is a way to get the "preferred" symbol for 
an entrezId, and for lack of this, they accept simply a random one or 
the first one (in some unspecified collation). So, we should have a 
function, maybe 'select1', to select one and only one hit for each query 
value.

   select1(x, keys, columns, keytype, requireUnique=FALSE, ... )

This would query the AnnotationDbi object 'x' as does 'select', but 
return a data frame with the columns specified in 'columns', and the 
vector that was passed as 'keys' as row names, thus guaranteeing that 
each line in the data frame corresponds to one query key. If there were 
multiple records for a key, the first one is used, unless 
'requireUnique' is set, in which case an error is issued. And if no 
record is present for a key, the data frame contains a row of NAs for 
this key.

This would be quite convenient for any kind of ID conversion issues.

   Simon