[BioC] edgeR adding RefSeq IDs

Mon Oct 21 16:50:17 CEST 2013

Hi Robin,

On Monday, October 21, 2013 9:23:34 AM, Robin [guest] wrote:
>
> My pipeline for adding annotations suddenly didn't work, maybe because some packages are updated. Could you check if something is wrong with my pipeline, I loose all the rows when I add annotations:
>
>> data <- read.table("~/Documents/Jobb/mRNA_Ch12/TreatedControlCounts.csv", row.names=1, sep="", header=T)
>> head(data)
>               T0h T0.25h T0.5h T1h T2h T3h T6h T12h T24h T48h C0h C0.25h C0.5h C1h C2h C3h
> NM_001001130  68     95    56  43  66  62  68   90   63   89  65     85    58  49  81  49
> NM_001001144   0      1     4   0   1   1   1    4    1    2   1      3     1   0   6   3
> NM_001001152  79    129    52  50  24  45 130  154  112  147  56     85    67  33  52  31
> NM_001001160   1      1     2   0   0   1   0    0    0    1   0      1     2   4   2   1
> NM_001001176   0      0     0   0   0   0   0    0    0    0   0      0     0   0   0   0
> NM_001001177   1      3     2   3   0   1   1    0    2    0   1      6     4   1   3   0
>               C6h C12h C24h C48h
> NM_001001130  76   73   48   77
> NM_001001144   0    1    2    1
> NM_001001152  93   77   65  109
> NM_001001160   0    2    1    0
> NM_001001176   0    0    0    0
> NM_001001177   0    3    0    2
>> y <- DGEList(counts=data[,1:20], genes=data[,0:1])

What are you expecting data[,0:1] to do?

Note that R is not Perl, so counting starts at 1, so R is ignoring the 
zero. What you are in fact doing is putting the first column of your 
'data' data.frame into the DGEList object, and since you are only using 
one column, R is dropping the dimension attribute, and thereby reducing 
to a vector. You later ask for the row.names() of this vector, and 
since it has no rows, it has no row.names either.

As an example:

> x <- data.frame(letters=1:26)
> row.names(x) <- letters
> x[,0:1]
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 
23 24 25
[26] 26
> row.names(x[,0:1])
>

Not that the call to row.names above didn't return anything, because 
there wasn't anything to return.

In addition, you are going through a lot of unnecessary work below, 
trying to get mappings to Entrez Gene from RefSeq IDs. This is actually 
a simple one-liner.

gns <- select(org.Hs.eg.db, row.names(data), c("ENTREZID","SYMBOL"), 
"ACCNUM")

You may get a warning that you have some one-to-many mappings here. How 
you deal with that is up to you. I normally take the most naive 
approach possible and just do

gns <- gns[!duplicated(gns[,1]),]

After which you an check that

all.equal(gns$ACCNUM, row.names(data))

and then you can do

y <- DGEList(data, genes = gns)

Best,

Jim

>> library(org.Mm.eg.db)
>> idfound <- rownames(y$genes) %in% mappedRkeys(org.Mm.egREFSEQ)
>> y <- y[idfound,]
>> dim(y)
> [1]  0 20
>> egREFSEQ <- toTable(org.Mm.egREFSEQ)
>> m <- match(rownames(y$genes), egREFSEQ$accession)
>> y$genes$EntrezGene <- egREFSEQ$gene_id[m]
>> y$genes$Symbol <- egSYMBOL$symbol[m]
>> m <- match(y$genes$EntrezGene, egSYMBOL$gene_id)
>> y$genes$Symbol <- egSYMBOL$symbol[m]
>> head(y$genes)
> [1] genes      EntrezGene Symbol
> <0 rows> (or 0-length row.names)
>
>   -- output of sessionInfo():
>
>> sessionInfo()
> R version 3.0.2 (2013-09-25)
> Platform: x86_64-pc-linux-gnu (64-bit)
>
> locale:
>   [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8
>   [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>   [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C
> [10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] parallel  stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
>   [1] edgeR_3.4.0           limma_3.18.0          GSEABase_1.24.0
>   [4] annotate_1.40.0       org.Mm.eg.db_2.10.1   AnnotationForge_1.4.0
>   [7] org.Hs.eg.db_2.10.1   GOstats_2.28.0        graph_1.40.0
> [10] Category_2.28.0       GO.db_2.10.1          RSQLite_0.11.4
> [13] DBI_0.2-7             Matrix_1.0-14         lattice_0.20-24
> [16] AnnotationDbi_1.24.0  Biobase_2.22.0        BiocGenerics_0.8.0
>
> loaded via a namespace (and not attached):
>   [1] genefilter_1.44.0 grid_3.0.2        IRanges_1.20.0    RBGL_1.38.0
>   [5] splines_3.0.2     stats4_3.0.2      survival_2.37-4   tools_3.0.2
>   [9] XML_3.98-1.1      xtable_1.7-1
>
> --
> Sent via the guest posting facility at bioconductor.org.
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

--
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099