[Bioc-devel] Changes in AnnotationDbi

Tue Jun 9 15:35:02 CEST 2015

On 06/09/2015 02:52 AM, Simon Anders wrote:
> Hi
>
> My two cents:
>
> On 04/06/15 19:50, James W. MacDonald wrote:
>> In other words, for me it is a common practice to do something like this:
>>
>> fit <- lmFit(eset, design)
>> fit2 <- eBayes(fit)
>> gns <- select(<chippackage>, featureNames(eset), c("ENTREZID","SYMBOL"))
>> gns <- gns[!duplicated(gns[,1]),]
>> fit2$genes <- gns
>>
>> I add in the step where dups are removed because I already know they are
>> there. But a naive user might instead do
>>
>> fit2$genes <- select(<chippackage>, featureNames(eset),
>> c("ENTREZID","SYMBOL"))
>
> I'm not even that happy with James' first solution, as it relies on the order
> being correct after removing the duplicates. I'd feel safer to use 'match' to
> ensure that. (What if an EntrezId is not found in the Annotation DB? Will we
> have a line with NA, or is the line simply missing? The latter would break
> James' code.)
>
> What users really want here is a way to get the "preferred" symbol for an
> entrezId, and for lack of this, they accept simply a random one or the first one
> (in some unspecified collation). So, we should have a function, maybe 'select1',
> to select one and only one hit for each query value.
>
>    select1(x, keys, columns, keytype, requireUnique=FALSE, ... )
>
> This would query the AnnotationDbi object 'x' as does 'select', but return a
> data frame with the columns specified in 'columns', and the vector that was
> passed as 'keys' as row names, thus guaranteeing that each line in the data
> frame corresponds to one query key. If there were multiple records for a key,
> the first one is used, unless 'requireUnique' is set, in which case an error is
> issued. And if no record is present for a key, the data frame contains a row of
> NAs for this key.
>
> This would be quite convenient for any kind of ID conversion issues.

In case  you missed it in Marc's reply, and acknowledging that this is different 
from your suggestion, there is mapIds() for doing this on a single column basis, 
which is the common use case where one doesn't care too much about multiple 
mapping ids

 > org = org.Hs.eg.db
 > head(select(org, keys(org), "ALIAS"))
   ENTREZID    ALIAS
1        1      A1B
2        1      ABG
3        1      GAB
4        1 HYST2477
5        1     A1BG
6        2     A2MD
 > head(mapIds(org, keys(org), "ALIAS", "ENTREZID"))
      1      2      3      9     10     11
  "A1B" "A2MD" "A2MP" "AAC1" "AAC2" "AACP"
 > head(mapIds(org, keys(org), "ALIAS", "ENTREZID", multiVals="CharacterList"))
CharacterList of length 6
[["1"]] A1B ABG GAB HYST2477 A1BG
[["2"]] A2MD CPAMD5 FWP007 S863-7 A2M
[["3"]] A2MP A2MP1
[["9"]] AAC1 MNAT NAT-1 NATI NAT1
[["10"]] AAC2 NAT-2 PNAT NAT2
[["11"]] AACP NATP1 NATP
 > str(head(mapIds(org, keys(org), "ALIAS", "ENTREZID", multiVals="list")))
List of 6
  $ 1 : chr [1:5] "A1B" "ABG" "GAB" "HYST2477" ...
  $ 2 : chr [1:5] "A2MD" "CPAMD5" "FWP007" "S863-7" ...
  $ 3 : chr [1:2] "A2MP" "A2MP1"
  $ 9 : chr [1:5] "AAC1" "MNAT" "NAT-1" "NATI" ...
  $ 10: chr [1:4] "AAC2" "NAT-2" "PNAT" "NAT2"
  $ 11: chr [1:3] "AACP" "NATP1" "NATP"

Also since this is the devel list, there is

 > library(dplyr)
 > d = src_sqlite(org.Hs.eg_dbfile())
 > d
src:  sqlite 3.8.6 
[/home/mtmorgan/R/x86_64-unknown-linux-gnu-library/3.2-BiocDevel/org.Hs.eg.db/extdata/org.Hs.eg.sqlite]
tbls: accessions, alias, chrlengths, chromosome_locations, chromosomes,
   cytogenetic_locations, ec, ensembl, ensembl_prot, ensembl_trans,
   ensembl2ncbi, gene_info, genes, go, go_all, go_bp, go_bp_all, go_cc,
   go_cc_all, go_mf, go_mf_all, kegg, map_counts, map_metadata, metadata,
   ncbi2ensembl, omim, pfam, prosite, pubmed, refseq, sqlite_stat1, ucsc,
   unigene, uniprot
 > d %>% tbl("alias") %>% group_by(`_id`) %>% summarize(alias_symbol)
Source: sqlite 3.8.6 
[/home/mtmorgan/R/x86_64-unknown-linux-gnu-library/3.2-BiocDevel/org.Hs.eg.db/extdata/org.Hs.eg.sqlite]
From: <derived table> [?? x 2]

    _id alias_symbol
1    1         A1BG
2    2          A2M
3    3        A2MP1
4    4         NAT1
5    5         NAT2
6    6         NATP
7    7     SERPINA3
8    8        AADAC
9    9         AAMP
10  10        AANAT
.. ...          ...

(with lots of nice confusion there, including extensive masking of symbols 
between dplyr / AnnotationDbi, need for knowledge of the schema (basically a 
central id, ENTREZID for org packages, and tables of mappings from the central 
id to other ids), and the more-or-less arbitrary choice of alias_symbol).

Martin

>
>    Simon
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel

-- 
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793