[Bioc-devel] Bug Report - for goseq - for RefSeq Genes track - RefSeq identifiers or Entrez Gene ID ?

Arora, Sonali sarora at fredhutch.org
Fri Oct 9 23:15:08 CEST 2015


Hi everyone,

I am using goseq to perform Gene Ontology analysis of RNA-seq data and I 
think I found the following bug.

Since my RNA-seq data was aligned using Refseq table - so I need to 
supply id as "refGene" which are actually EntrezId's from 
goseq::supportedGeneIDs()

The bug is - that the internal function(getgo) is expecting refseq 
identifiers and the external function (goseq) is expecting entrez id's..

When I use refGene as id ( which is actually an entrez gene ids). I get 
the following error -

 > pwf = nullp(genes, "hg19", "refGene")
Loading hg19 length data...
Warning message:
In pcls(G) : initial point very close to some inequality constraints
 > GO.wall = goseq(pwf, "hg19", "refGene")
Fetching GO annotations...
Error in names(tmp) = rep(names(map), times = as.numeric(summary(map)[,  :
   attempt to set an attribute on NULL


  After some debugging, I find that internal function getgo() returns a 
vector of NULL's using "Entrez Gene Id" as id.

 > go = getgo(names(genes), "hg19", "refGene")
 > head(go)
$<NA>
NULL

$<NA>
NULL

$<NA>
NULL

$<NA>
NULL

$<NA>
NULL

$<NA>
NULL

The last line of the function goseq::getgo() seems to be the issue..

Browse[2]> head(names(user2cat))
[1] "NM_000014" "NM_000015" "NM_000016" "NM_000017" "NM_000018" "NM_000019"
Browse[2]> head(genes)
[1] "1"         "100"       "1000"      "10000"     "10001" "100037417"
Browse[2]> head(user2cat[toupper(genes)])
$<NA>
NULL

$<NA>
NULL

$<NA>
NULL

$<NA>
NULL

$<NA>
NULL

$<NA>
NULL

If I give the getgo() function refseq identifiers instead of entrez id, 
then the internal function getgo works - but the nullp function fails as 
it is still expecting entrez gene id.

 > test_genes <- c(0, 0,1,0,0,0)
 > names(test_genes) =c("NM_000014", "NM_000015", "NM_000016", "NM_000017",
+                      "NM_000018", "NM_000019")
 > go = getgo(names(test_genes), "hg19", "refGene")
 > names(go)
[1] "NM_000014" "NM_000015" "NM_000016" "NM_000017" "NM_000018" "NM_000019"
 > pwf = nullp(test_genes, "hg19", "refGene")
Loading hg19 length data...
Error in getlength(names(DEgenes), genome, id) :
   The gene names specified do not match the gene names for genome hg19 
and ID refGene.
     Gene names given were: NM_000014, NM_000015, NM_000016, NM_000017, 
NM_000018, NM_000019, NA, NA, NA, NA
     Required gene names are: 84251, 113451, 25932, 55707, 55707, 84871, 
50651, 55707, 7049, 1629

A reproducible example -

# case - 1
test_genes <- c(0, 0,1,0,0,0)
names(test_genes) <- c(1,100,1000, 10000, 10001, 100037417)
go = getgo(names(test_genes), "hg19", "refGene")

# case -2
names(test_genes) =c("NM_000014", "NM_000015", "NM_000016", "NM_000017",
                      "NM_000018", "NM_000019")
go = getgo(names(test_genes), "hg19", "refGene")
pwf = nullp(test_genes, "hg19", "refGene")


 > sessionInfo()
R version 3.2.2 RC (2015-08-09 r68965)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United 
States.1252    LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils datasets  
methods   base

other attached packages:
  [1] org.Hs.eg.db_3.2.2    AnnotationDbi_1.31.19 IRanges_2.3.24        
S4Vectors_0.7.22      Biobase_2.29.1 BiocGenerics_0.15.10  
goseq_1.21.1          RSQLite_1.0.0
  [9] DBI_0.3.1             geneLenDataBase_1.5.0 BiasedUrn_1.06.1

loaded via a namespace (and not attached):
  [1] XVector_0.9.4              GenomicRanges_1.21.30 
zlibbioc_1.15.0            GenomicAlignments_1.5.18 
BiocParallel_1.3.54        lattice_0.20-33
  [7] GenomeInfoDb_1.5.16        tools_3.2.2 grid_3.2.2                 
SummarizedExperiment_0.3.9 nlme_3.1-122               mgcv_1.8-7
[13] lambda.r_1.1.7             futile.logger_1.4.1 
Matrix_1.2-2               rtracklayer_1.29.28 
futile.options_1.0.0       bitops_1.0-6
[19] RCurl_1.95-4.7             biomaRt_2.25.3 
GO.db_3.2.2                GenomicFeatures_1.21.31 
Biostrings_2.37.8          Rsamtools_1.21.20
[25] XML_3.98-1.3

Please advise.

Thanks and Regards,
Sonali.



More information about the Bioc-devel mailing list