[Bioc-devel] Bug Report - for goseq - for RefSeq Genes track - RefSeq identifiers or Entrez Gene ID ?
Arora, Sonali
sarora at fredhutch.org
Fri Oct 9 23:15:08 CEST 2015
Hi everyone,
I am using goseq to perform Gene Ontology analysis of RNA-seq data and I
think I found the following bug.
Since my RNA-seq data was aligned using Refseq table - so I need to
supply id as "refGene" which are actually EntrezId's from
goseq::supportedGeneIDs()
The bug is - that the internal function(getgo) is expecting refseq
identifiers and the external function (goseq) is expecting entrez id's..
When I use refGene as id ( which is actually an entrez gene ids). I get
the following error -
> pwf = nullp(genes, "hg19", "refGene")
Loading hg19 length data...
Warning message:
In pcls(G) : initial point very close to some inequality constraints
> GO.wall = goseq(pwf, "hg19", "refGene")
Fetching GO annotations...
Error in names(tmp) = rep(names(map), times = as.numeric(summary(map)[, :
attempt to set an attribute on NULL
After some debugging, I find that internal function getgo() returns a
vector of NULL's using "Entrez Gene Id" as id.
> go = getgo(names(genes), "hg19", "refGene")
> head(go)
$<NA>
NULL
$<NA>
NULL
$<NA>
NULL
$<NA>
NULL
$<NA>
NULL
$<NA>
NULL
The last line of the function goseq::getgo() seems to be the issue..
Browse[2]> head(names(user2cat))
[1] "NM_000014" "NM_000015" "NM_000016" "NM_000017" "NM_000018" "NM_000019"
Browse[2]> head(genes)
[1] "1" "100" "1000" "10000" "10001" "100037417"
Browse[2]> head(user2cat[toupper(genes)])
$<NA>
NULL
$<NA>
NULL
$<NA>
NULL
$<NA>
NULL
$<NA>
NULL
$<NA>
NULL
If I give the getgo() function refseq identifiers instead of entrez id,
then the internal function getgo works - but the nullp function fails as
it is still expecting entrez gene id.
> test_genes <- c(0, 0,1,0,0,0)
> names(test_genes) =c("NM_000014", "NM_000015", "NM_000016", "NM_000017",
+ "NM_000018", "NM_000019")
> go = getgo(names(test_genes), "hg19", "refGene")
> names(go)
[1] "NM_000014" "NM_000015" "NM_000016" "NM_000017" "NM_000018" "NM_000019"
> pwf = nullp(test_genes, "hg19", "refGene")
Loading hg19 length data...
Error in getlength(names(DEgenes), genome, id) :
The gene names specified do not match the gene names for genome hg19
and ID refGene.
Gene names given were: NM_000014, NM_000015, NM_000016, NM_000017,
NM_000018, NM_000019, NA, NA, NA, NA
Required gene names are: 84251, 113451, 25932, 55707, 55707, 84871,
50651, 55707, 7049, 1629
A reproducible example -
# case - 1
test_genes <- c(0, 0,1,0,0,0)
names(test_genes) <- c(1,100,1000, 10000, 10001, 100037417)
go = getgo(names(test_genes), "hg19", "refGene")
# case -2
names(test_genes) =c("NM_000014", "NM_000015", "NM_000016", "NM_000017",
"NM_000018", "NM_000019")
go = getgo(names(test_genes), "hg19", "refGene")
pwf = nullp(test_genes, "hg19", "refGene")
> sessionInfo()
R version 3.2.2 RC (2015-08-09 r68965)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United
States.1252 LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets
methods base
other attached packages:
[1] org.Hs.eg.db_3.2.2 AnnotationDbi_1.31.19 IRanges_2.3.24
S4Vectors_0.7.22 Biobase_2.29.1 BiocGenerics_0.15.10
goseq_1.21.1 RSQLite_1.0.0
[9] DBI_0.3.1 geneLenDataBase_1.5.0 BiasedUrn_1.06.1
loaded via a namespace (and not attached):
[1] XVector_0.9.4 GenomicRanges_1.21.30
zlibbioc_1.15.0 GenomicAlignments_1.5.18
BiocParallel_1.3.54 lattice_0.20-33
[7] GenomeInfoDb_1.5.16 tools_3.2.2 grid_3.2.2
SummarizedExperiment_0.3.9 nlme_3.1-122 mgcv_1.8-7
[13] lambda.r_1.1.7 futile.logger_1.4.1
Matrix_1.2-2 rtracklayer_1.29.28
futile.options_1.0.0 bitops_1.0-6
[19] RCurl_1.95-4.7 biomaRt_2.25.3
GO.db_3.2.2 GenomicFeatures_1.21.31
Biostrings_2.37.8 Rsamtools_1.21.20
[25] XML_3.98-1.3
Please advise.
Thanks and Regards,
Sonali.
More information about the Bioc-devel
mailing list