[Bioc-devel] Bug Report - for goseq - for RefSeq Genes track - RefSeq identifiers or Entrez Gene ID ?

Mon Oct 12 20:47:02 CEST 2015

Thanks Nadia, I'll use the knownGene with Entrez Ids for my analysis.
Sonali.

On 10/10/2015 8:14 PM, Nadia Davidson wrote:
> Hi Sonali,
>
> Thanks for reporting this. I’ve moved away from Entrez IDs for refseq with newer genome versions, as I think goseq should take the refs symbols directly. Documentation still needs to be updated for this.
>
> It’s concerning that you’re seeing this behaviour for hg19, so I’ll look into it. A work around is to use the Entrez IDs and provide goseq with the flag “knownGene” rather than “refGene”.
>
> I’m currently on leave until January, so a fix might be slow.
>
> Cheers,
> Nadia.
>
>
>> On 10 Oct 2015, at 8:15 am, Arora, Sonali <sarora at fredhutch.org> wrote:
>>
>> Hi everyone,
>>
>> I am using goseq to perform Gene Ontology analysis of RNA-seq data and I
>> think I found the following bug.
>>
>> Since my RNA-seq data was aligned using Refseq table - so I need to
>> supply id as "refGene" which are actually EntrezId's from
>> goseq::supportedGeneIDs()
>>
>> The bug is - that the internal function(getgo) is expecting refseq
>> identifiers and the external function (goseq) is expecting entrez id's..
>>
>> When I use refGene as id ( which is actually an entrez gene ids). I get
>> the following error -
>>
>>> pwf = nullp(genes, "hg19", "refGene")
>> Loading hg19 length data...
>> Warning message:
>> In pcls(G) : initial point very close to some inequality constraints
>>> GO.wall = goseq(pwf, "hg19", "refGene")
>> Fetching GO annotations...
>> Error in names(tmp) = rep(names(map), times = as.numeric(summary(map)[,  :
>>    attempt to set an attribute on NULL
>>
>>
>>   After some debugging, I find that internal function getgo() returns a
>> vector of NULL's using "Entrez Gene Id" as id.
>>
>>> go = getgo(names(genes), "hg19", "refGene")
>>> head(go)
>> $<NA>
>> NULL
>>
>> $<NA>
>> NULL
>>
>> $<NA>
>> NULL
>>
>> $<NA>
>> NULL
>>
>> $<NA>
>> NULL
>>
>> $<NA>
>> NULL
>>
>> The last line of the function goseq::getgo() seems to be the issue..
>>
>> Browse[2]> head(names(user2cat))
>> [1] "NM_000014" "NM_000015" "NM_000016" "NM_000017" "NM_000018" "NM_000019"
>> Browse[2]> head(genes)
>> [1] "1"         "100"       "1000"      "10000"     "10001" "100037417"
>> Browse[2]> head(user2cat[toupper(genes)])
>> $<NA>
>> NULL
>>
>> $<NA>
>> NULL
>>
>> $<NA>
>> NULL
>>
>> $<NA>
>> NULL
>>
>> $<NA>
>> NULL
>>
>> $<NA>
>> NULL
>>
>> If I give the getgo() function refseq identifiers instead of entrez id,
>> then the internal function getgo works - but the nullp function fails as
>> it is still expecting entrez gene id.
>>
>>> test_genes <- c(0, 0,1,0,0,0)
>>> names(test_genes) =c("NM_000014", "NM_000015", "NM_000016", "NM_000017",
>> +                      "NM_000018", "NM_000019")
>>> go = getgo(names(test_genes), "hg19", "refGene")
>>> names(go)
>> [1] "NM_000014" "NM_000015" "NM_000016" "NM_000017" "NM_000018" "NM_000019"
>>> pwf = nullp(test_genes, "hg19", "refGene")
>> Loading hg19 length data...
>> Error in getlength(names(DEgenes), genome, id) :
>>    The gene names specified do not match the gene names for genome hg19
>> and ID refGene.
>>      Gene names given were: NM_000014, NM_000015, NM_000016, NM_000017,
>> NM_000018, NM_000019, NA, NA, NA, NA
>>      Required gene names are: 84251, 113451, 25932, 55707, 55707, 84871,
>> 50651, 55707, 7049, 1629
>>
>> A reproducible example -
>>
>> # case - 1
>> test_genes <- c(0, 0,1,0,0,0)
>> names(test_genes) <- c(1,100,1000, 10000, 10001, 100037417)
>> go = getgo(names(test_genes), "hg19", "refGene")
>>
>> # case -2
>> names(test_genes) =c("NM_000014", "NM_000015", "NM_000016", "NM_000017",
>>                       "NM_000018", "NM_000019")
>> go = getgo(names(test_genes), "hg19", "refGene")
>> pwf = nullp(test_genes, "hg19", "refGene")
>>
>>
>>> sessionInfo()
>> R version 3.2.2 RC (2015-08-09 r68965)
>> Platform: x86_64-w64-mingw32/x64 (64-bit)
>> Running under: Windows 7 x64 (build 7601) Service Pack 1
>>
>> locale:
>> [1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United
>> States.1252    LC_MONETARY=English_United States.1252 LC_NUMERIC=C
>> [5] LC_TIME=English_United States.1252
>>
>> attached base packages:
>> [1] parallel  stats4    stats     graphics  grDevices utils datasets
>> methods   base
>>
>> other attached packages:
>>   [1] org.Hs.eg.db_3.2.2    AnnotationDbi_1.31.19 IRanges_2.3.24
>> S4Vectors_0.7.22      Biobase_2.29.1 BiocGenerics_0.15.10
>> goseq_1.21.1          RSQLite_1.0.0
>>   [9] DBI_0.3.1             geneLenDataBase_1.5.0 BiasedUrn_1.06.1
>>
>> loaded via a namespace (and not attached):
>>   [1] XVector_0.9.4              GenomicRanges_1.21.30
>> zlibbioc_1.15.0            GenomicAlignments_1.5.18
>> BiocParallel_1.3.54        lattice_0.20-33
>>   [7] GenomeInfoDb_1.5.16        tools_3.2.2 grid_3.2.2
>> SummarizedExperiment_0.3.9 nlme_3.1-122               mgcv_1.8-7
>> [13] lambda.r_1.1.7             futile.logger_1.4.1
>> Matrix_1.2-2               rtracklayer_1.29.28
>> futile.options_1.0.0       bitops_1.0-6
>> [19] RCurl_1.95-4.7             biomaRt_2.25.3
>> GO.db_3.2.2                GenomicFeatures_1.21.31
>> Biostrings_2.37.8          Rsamtools_1.21.20
>> [25] XML_3.98-1.3
>>
>> Please advise.
>>
>> Thanks and Regards,
>> Sonali.
>>
>> ______________________________________________________________________
>> This email has been scanned by the Symantec Email Security.cloud service.
>> For more information please visit http://www.symanteccloud.com
>>
>> If you have any questions, please contact MCRI IT Servicedesk for further assistance.
>> ______________________________________________________________________
>
> ______________________________________________________________________
> This email has been scanned by the Symantec Email Security.cloud service.
> For more information please visit http://www.symanteccloud.com
> ______________________________________________________________________