[Bioc-devel] Bug Report - for goseq - for RefSeq Genes track - RefSeq identifiers or Entrez Gene ID ?
Nadia Davidson
nadia.davidson at mcri.edu.au
Sun Oct 11 05:14:26 CEST 2015
Hi Sonali,
Thanks for reporting this. I’ve moved away from Entrez IDs for refseq with newer genome versions, as I think goseq should take the refs symbols directly. Documentation still needs to be updated for this.
It’s concerning that you’re seeing this behaviour for hg19, so I’ll look into it. A work around is to use the Entrez IDs and provide goseq with the flag “knownGene” rather than “refGene”.
I’m currently on leave until January, so a fix might be slow.
Cheers,
Nadia.
> On 10 Oct 2015, at 8:15 am, Arora, Sonali <sarora at fredhutch.org> wrote:
>
> Hi everyone,
>
> I am using goseq to perform Gene Ontology analysis of RNA-seq data and I
> think I found the following bug.
>
> Since my RNA-seq data was aligned using Refseq table - so I need to
> supply id as "refGene" which are actually EntrezId's from
> goseq::supportedGeneIDs()
>
> The bug is - that the internal function(getgo) is expecting refseq
> identifiers and the external function (goseq) is expecting entrez id's..
>
> When I use refGene as id ( which is actually an entrez gene ids). I get
> the following error -
>
>> pwf = nullp(genes, "hg19", "refGene")
> Loading hg19 length data...
> Warning message:
> In pcls(G) : initial point very close to some inequality constraints
>> GO.wall = goseq(pwf, "hg19", "refGene")
> Fetching GO annotations...
> Error in names(tmp) = rep(names(map), times = as.numeric(summary(map)[, :
> attempt to set an attribute on NULL
>
>
> After some debugging, I find that internal function getgo() returns a
> vector of NULL's using "Entrez Gene Id" as id.
>
>> go = getgo(names(genes), "hg19", "refGene")
>> head(go)
> $<NA>
> NULL
>
> $<NA>
> NULL
>
> $<NA>
> NULL
>
> $<NA>
> NULL
>
> $<NA>
> NULL
>
> $<NA>
> NULL
>
> The last line of the function goseq::getgo() seems to be the issue..
>
> Browse[2]> head(names(user2cat))
> [1] "NM_000014" "NM_000015" "NM_000016" "NM_000017" "NM_000018" "NM_000019"
> Browse[2]> head(genes)
> [1] "1" "100" "1000" "10000" "10001" "100037417"
> Browse[2]> head(user2cat[toupper(genes)])
> $<NA>
> NULL
>
> $<NA>
> NULL
>
> $<NA>
> NULL
>
> $<NA>
> NULL
>
> $<NA>
> NULL
>
> $<NA>
> NULL
>
> If I give the getgo() function refseq identifiers instead of entrez id,
> then the internal function getgo works - but the nullp function fails as
> it is still expecting entrez gene id.
>
>> test_genes <- c(0, 0,1,0,0,0)
>> names(test_genes) =c("NM_000014", "NM_000015", "NM_000016", "NM_000017",
> + "NM_000018", "NM_000019")
>> go = getgo(names(test_genes), "hg19", "refGene")
>> names(go)
> [1] "NM_000014" "NM_000015" "NM_000016" "NM_000017" "NM_000018" "NM_000019"
>> pwf = nullp(test_genes, "hg19", "refGene")
> Loading hg19 length data...
> Error in getlength(names(DEgenes), genome, id) :
> The gene names specified do not match the gene names for genome hg19
> and ID refGene.
> Gene names given were: NM_000014, NM_000015, NM_000016, NM_000017,
> NM_000018, NM_000019, NA, NA, NA, NA
> Required gene names are: 84251, 113451, 25932, 55707, 55707, 84871,
> 50651, 55707, 7049, 1629
>
> A reproducible example -
>
> # case - 1
> test_genes <- c(0, 0,1,0,0,0)
> names(test_genes) <- c(1,100,1000, 10000, 10001, 100037417)
> go = getgo(names(test_genes), "hg19", "refGene")
>
> # case -2
> names(test_genes) =c("NM_000014", "NM_000015", "NM_000016", "NM_000017",
> "NM_000018", "NM_000019")
> go = getgo(names(test_genes), "hg19", "refGene")
> pwf = nullp(test_genes, "hg19", "refGene")
>
>
>> sessionInfo()
> R version 3.2.2 RC (2015-08-09 r68965)
> Platform: x86_64-w64-mingw32/x64 (64-bit)
> Running under: Windows 7 x64 (build 7601) Service Pack 1
>
> locale:
> [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United
> States.1252 LC_MONETARY=English_United States.1252 LC_NUMERIC=C
> [5] LC_TIME=English_United States.1252
>
> attached base packages:
> [1] parallel stats4 stats graphics grDevices utils datasets
> methods base
>
> other attached packages:
> [1] org.Hs.eg.db_3.2.2 AnnotationDbi_1.31.19 IRanges_2.3.24
> S4Vectors_0.7.22 Biobase_2.29.1 BiocGenerics_0.15.10
> goseq_1.21.1 RSQLite_1.0.0
> [9] DBI_0.3.1 geneLenDataBase_1.5.0 BiasedUrn_1.06.1
>
> loaded via a namespace (and not attached):
> [1] XVector_0.9.4 GenomicRanges_1.21.30
> zlibbioc_1.15.0 GenomicAlignments_1.5.18
> BiocParallel_1.3.54 lattice_0.20-33
> [7] GenomeInfoDb_1.5.16 tools_3.2.2 grid_3.2.2
> SummarizedExperiment_0.3.9 nlme_3.1-122 mgcv_1.8-7
> [13] lambda.r_1.1.7 futile.logger_1.4.1
> Matrix_1.2-2 rtracklayer_1.29.28
> futile.options_1.0.0 bitops_1.0-6
> [19] RCurl_1.95-4.7 biomaRt_2.25.3
> GO.db_3.2.2 GenomicFeatures_1.21.31
> Biostrings_2.37.8 Rsamtools_1.21.20
> [25] XML_3.98-1.3
>
> Please advise.
>
> Thanks and Regards,
> Sonali.
>
> ______________________________________________________________________
> This email has been scanned by the Symantec Email Security.cloud service.
> For more information please visit http://www.symanteccloud.com
>
> If you have any questions, please contact MCRI IT Servicedesk for further assistance.
> ______________________________________________________________________
______________________________________________________________________
This email has been scanned by the Symantec Email Security.cloud service.
For more information please visit http://www.symanteccloud.com
______________________________________________________________________
More information about the Bioc-devel
mailing list