[Bioc-devel] Changes in AnnotationDbi
Simon Anders
anders at embl.de
Tue Jun 9 11:52:14 CEST 2015
Hi
My two cents:
On 04/06/15 19:50, James W. MacDonald wrote:
> In other words, for me it is a common practice to do something like this:
>
> fit <- lmFit(eset, design)
> fit2 <- eBayes(fit)
> gns <- select(<chippackage>, featureNames(eset), c("ENTREZID","SYMBOL"))
> gns <- gns[!duplicated(gns[,1]),]
> fit2$genes <- gns
>
> I add in the step where dups are removed because I already know they are
> there. But a naive user might instead do
>
> fit2$genes <- select(<chippackage>, featureNames(eset),
> c("ENTREZID","SYMBOL"))
I'm not even that happy with James' first solution, as it relies on the
order being correct after removing the duplicates. I'd feel safer to use
'match' to ensure that. (What if an EntrezId is not found in the
Annotation DB? Will we have a line with NA, or is the line simply
missing? The latter would break James' code.)
What users really want here is a way to get the "preferred" symbol for
an entrezId, and for lack of this, they accept simply a random one or
the first one (in some unspecified collation). So, we should have a
function, maybe 'select1', to select one and only one hit for each query
value.
select1(x, keys, columns, keytype, requireUnique=FALSE, ... )
This would query the AnnotationDbi object 'x' as does 'select', but
return a data frame with the columns specified in 'columns', and the
vector that was passed as 'keys' as row names, thus guaranteeing that
each line in the data frame corresponds to one query key. If there were
multiple records for a key, the first one is used, unless
'requireUnique' is set, in which case an error is issued. And if no
record is present for a key, the data frame contains a row of NAs for
this key.
This would be quite convenient for any kind of ID conversion issues.
Simon
More information about the Bioc-devel
mailing list