[BioC] Making use of the Apis mellifera assembly 2 data in goSeq

Wed Feb 29 01:02:11 CET 2012

Aha!  Looks like if I enter a more general taxonomic id it partially works...

Again, the sessionInfo()

>  sessionInfo()
R version 2.14.0 (2011-10-31)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=C                 LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
 [1] org.Hs.eg.db_2.6.4     goseq_1.6.0            geneLenDataBase_0.99.8
 [4] BiasedUrn_1.04         RCurl_1.91-1           bitops_1.0-4.1
 [7] GO.db_2.6.1            RSQLite_0.11.1         DBI_0.2-5
[10] AnnotationDbi_1.16.17  Biobase_2.14.0

loaded via a namespace (and not attached):
 [1] biomaRt_2.10.0        Biostrings_2.22.0     BSgenome_1.22.0
 [4] GenomicFeatures_1.6.8 GenomicRanges_1.6.7   grid_2.14.0
 [7] IRanges_1.12.6        lattice_0.20-0        Matrix_1.0-1
[10] mgcv_1.7-9            nlme_3.1-102          rtracklayer_1.14.4
[13] tools_2.14.0          XML_3.9-4             zlibbioc_1.0.0

And if I say this (note, using the more general taxonomic id that includes all subspecies of Apis mellifera, 7460, as opposed to the taxonomic id for Apis mellifera ligustica)...

makeOrgPackageFromNCBI(version = "0.1", author = "Vanessa Corby-Harris <vanessa.corby at ars.usda.gov>", maintainer = "Vanessa Corby-Harris <vanessa.corby at ars.usda.gov>", outputDir = "/homeB/home4/u8/vlcorbyharris", tax_id = "7460", genus = "Apis", species = "mellifera")

I get this...

Getting data for gene2pubmed.gz
Populating gene2pubmed table:
table gene2pubmed filled
Getting data for gene2accession.gz
Populating gene2accession table:
table gene2accession filled
Getting data for gene2refseq.gz
Populating gene2refseq table:
table gene2refseq filled
Getting data for gene2unigene
Populating gene2unigene table:
table gene2unigene filled
Getting data for gene_info.gz
Populating gene_info table:
table gene_info filled
Getting data for gene2go.gz
Populating gene2go table:
Getting blast2GO data as a substitute for gene2go
table metadata filled
table map_metadata filled
table gene2go filled
table metadata filled
table map_metadata filled
Populating genes table:
genes table filled
Populating gene_info_temp table:
gene_info_temp table filled
Populating alias table:
alias table filled
Populating chromosomes table:
chromosomes table filled
Populating pubmed table:
pubmed table filled
Populating refseq table:
refseq table filled
Populating accessions table:
accessions table filled
Populating unigene table:
unigene table filled
Dropping GO IDs that are too new for the current GO.db
Dropping GO IDs that are too new for the current GO.db
Dropping GO IDs that are too new for the current GO.db
Populating go_bp table:
go_bp table filled
Populating go_mf table:
go_mf table filled
Populating go_cc table:
go_cc table filled
Populating go_bp_all table:
go_bp_all table filled
Populating go_mf_all table:
go_mf_all table filled
Populating go_cc_all table:
go_cc_all table filled
dropping table gene2pubmeddropping table gene2accessiondropping table gene2refseqdropping table gene2unigenedropping table gene_infodropping table gene2go
SELECT count(DISTINCT g.gene_id) FROM gene_info AS t, genes as g WHERE t._id=g._id AND t.gene_name NOT NULL
SELECT count(DISTINCT g.gene_id) FROM gene_info AS t, genes as g WHERE t._id=g._id AND t.symbol NOT NULL
SELECT count(DISTINCT t.symbol) FROM gene_info AS t, genes as g WHERE t._id=g._id AND t.symbol NOT NULL
SELECT count(DISTINCT g.gene_id) FROM chromosomes AS t, genes as g WHERE t._id=g._id AND t.chromosome NOT NULL
SELECT count(DISTINCT g.gene_id) FROM refseq AS t, genes as g WHERE t._id=g._id AND t.accession NOT NULL
SELECT count(DISTINCT t.accession) FROM refseq AS t, genes as g WHERE t._id=g._id AND t.accession NOT NULL
SELECT count(DISTINCT g.gene_id) FROM unigene AS t, genes as g WHERE t._id=g._id AND t.unigene_id NOT NULL
SELECT count(DISTINCT t.unigene_id) FROM unigene AS t, genes as g WHERE t._id=g._id AND t.unigene_id NOT NULL
SELECT count(DISTINCT g.gene_id) FROM accessions AS t, genes as g WHERE t._id=g._id AND t.accession NOT NULL
SELECT count(DISTINCT t.accession) FROM accessions AS t, genes as g WHERE t._id=g._id AND t.accession NOT NULL
SELECT count(DISTINCT g.gene_id) FROM alias AS t, genes as g WHERE t._id=g._id AND t.alias_symbol NOT NULL
table map_counts filled
Creating package in /homeB/home4/u8/vlcorbyharris/org.Amellifera.eg.db
[1] TRUE

Hooray!  But now when I ask it to load the package... drat!!!

> library(org.Amellifera.eg.db)
Error in library(org.Amellifera.eg.db) :
  âorg.Amellifera.eg.dbâ is not a valid installed package
> traceback()
2: stop(gettextf("%s is not a valid installed package", sQuote(package)),
       domain = NA)
1: library(org.Amellifera.eg.db)

It does not see the package as a valid installed package...

Could it be because in the folder associated with the org.Amellifera.eg.db does not contain a proper R file?  For example, if I look in the directory  ~/R/x86_64-unknown-linux-gnu-library/2.14/org.Hs.eg.db/R, which contains files for the org.Hs.eg.db package that I can properly install, it contains several files - org.Hs.eg.db,  org.Hs.eg.db.rdb, and org.Hs.eg.db.rdx.  But the same type of directory associated with org.Amellifera.eg.db contains only a zzz.R file - ~/R/x86_64-unknown-linux-gnu-library/2.14/org.Amellifera.eg.db/R has only the file named zzz.R.  BTW, for reference, here is the contents of both for one level up:

Associated with the org.Amellifera.eg.db database:

vlcorbyharris at service2:~/R/x86_64-unknown-linux-gnu-library/2.14/org.Amellifera.eg.db> ls
DESCRIPTION  inst  man  NAMESPACE  R  tests

Associated with the org.Hs.eg.db database package:

vlcorbyharris at service2:~/R/x86_64-unknown-linux-gnu-library/2.14/org.Hs.eg.db> ls
DESCRIPTION  extdata  help  html  INDEX  Meta  NAMESPACE  R

They are clearly different, suggesting that the package didn't install correctly using the function makeOrgPackageFromNCBI.

-----Original Message-----
From: Hervé Pagès [mailto:hpages at fhcrc.org] 
Sent: Tuesday, February 28, 2012 1:58 PM
To: Corby, Vanessa
Cc: bioconductor at r-project.org
Subject: Re: [BioC] Making use of the Apis mellifera assembly 2 data in goSeq

Vanessa,

On 02/28/2012 11:17 AM, Corby, Vanessa wrote:
> Hi again Hervé,
>
> I hit "Reply All" this time.  Sorry about that.
>
> Here is my session info:

Thanks! More below...

>
>> sessionInfo()
> R version 2.14.0 (2011-10-31)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
>   [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>   [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>   [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>   [7] LC_PAPER=C                 LC_NAME=C
>   [9] LC_ADDRESS=C               LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
>   [1] goseq_1.6.0            geneLenDataBase_0.99.8 BiasedUrn_1.04
>   [4] RCurl_1.91-1           bitops_1.0-4.1         GO.db_2.6.1
>   [7] RSQLite_0.11.1         DBI_0.2-5              AnnotationDbi_1.16.17
> [10] Biobase_2.14.0
>
> loaded via a namespace (and not attached):
>   [1] biomaRt_2.10.0        Biostrings_2.22.0     BSgenome_1.22.0
>   [4] GenomicFeatures_1.6.8 GenomicRanges_1.6.7   grid_2.14.0
>   [7] IRanges_1.12.6        lattice_0.20-0        Matrix_1.0-1
> [10] mgcv_1.7-9            nlme_3.1-102          rtracklayer_1.14.4
> [13] tools_2.14.0          XML_3.9-4             zlibbioc_1.0.0
>
> I should have looked through these guidelines in more detail.  I could have given the output of the traceback().
>
> V
>
> -----Original Message-----
> From: Hervé Pagès [mailto:hpages at fhcrc.org]
> Sent: Tuesday, February 28, 2012 12:04 PM
> To: Corby, Vanessa; bioconductor at r-project.org
> Subject: Re: [BioC] Making use of the Apis mellifera assembly 2 data 
> in goSeq
>
> Hi Vanessa,
>
> Please let's keep this discussion on the list (I should have said use the "Reply All" button instead of "Reply" in my previous email).
>
> On 02/28/2012 07:57 AM, Corby, Vanessa wrote:
>> Hello Hervé,
>>
>> So it looks like it made a .sqlite file called "org.Amellifera.eg.sqlite" in the specified directory, but I don't see an associated package.  I see in the AnnotationDbi information there is a wrapBaseDBPackages function.  Would this be the next step?
>
> More importantly, it seems that you got an error (see below), which would explain why the package is half-backed only.
>
>>
>> Here is my input into R and what I got out:
>>
>>> makeOrgPackageFromNCBI(version = "0.1", author = "Vanessa 
>>> Corby-Harris<vanessa.corby at ars.usda.gov>", maintainer = "Vanessa 
>>> Corby-Harris<vanessa.corby at ars.usda.gov>", outputDir = 
>>> "/homeB/home4/u8/vlcorbyharris", tax_id = "7469", genus = "Apis", 
>>> species = "mellifera")
>> Getting data for gene2pubmed.gz
>> Populating gene2pubmed table:
>> table gene2pubmed filled
>> Getting data for gene2accession.gz
>> Populating gene2accession table:
>> table gene2accession filled
>> Getting data for gene2refseq.gz
>> Populating gene2refseq table:
>> table gene2refseq filled
>> Getting data for gene2unigene
>> Populating gene2unigene table:
>> table gene2unigene filled
>> Getting data for gene_info.gz
>> Populating gene_info table:
>> table gene_info filled
>> Getting data for gene2go.gz
>> Populating gene2go table:
>> Getting blast2GO data as a substitute for gene2go Error in file(file,
>> "rt") : invalid 'description' argument In addition: Warning message:
>> In unzip(tmp) : error 1 in extracting from zip file

OK I can reproduce now. So it seems that when no GO data is available for a tax id (which is the case for tax id 7469), then
makeOrgPackageFromNCBI() tries to use blast2GO data as a substitute.
Problem is blast2GO data is not available either for tax id 7469:

   >
url.exists("http://bioinfo.cipf.es/b2gfar/_media/species:data:7469.annot.zip")
[1] FALSE

Hopefully a more competent person will be able to suggest a workaround.

In the mean time I've a small patch for makeOrgPackageFromNCBI() that I'm about to apply and that makes it display a more useful error message when the blast2GO data is not available.

Cheers,
H.

>
> I'm trying to reproduce this but makeOrgPackageFromNCBI() seems to take a long time (started this 10+ minutes ago). In the mean time could you please provide your sessionInfo()? This is an absolute requisite for us to troubleshoot. Also make sure you read our posting guide:
>
>     http://bioconductor.org/help/mailing-list/
>
> Thanks,
> H.
>
>>
>> Vanessa
>> -----Original Message-----
>> From: Hervé Pagès [mailto:hpages at fhcrc.org]
>> Sent: Monday, February 27, 2012 3:14 PM
>> To: Corby, Vanessa; bioconductor at r-project.org
>> Subject: Re: [BioC] Making use of the Apis mellifera assembly 2 data 
>> in goSeq
>>
>> On 02/27/2012 02:02 PM, Corby, Vanessa wrote:
>>> Argh.  This is what I was afraid of...
>>
>> Why not try to make yours? See the makeOrgPackageFromNCBI() function 
>> in
>> AnnotationDbi:
>>
>>      library(AnnotationDbi)
>>      ?makeOrgPackageFromNCBI
>>
>> (Would be nice if the goseq documentation could mention this.)
>>
>> If you run into problems while trying to use makeOrgPackageFromNCBI() to roll up your own org.Am.eg.db package, please come back with a precise description of the problem. Thanks!
>>
>> Cheers,
>> H.
>>
>>>
>>> Vanessa
>>>
>>> -----Original Message-----
>>> From: Hervé Pagès [mailto:hpages at fhcrc.org]
>>> Sent: Monday, February 27, 2012 2:51 PM
>>> To: Corby, Vanessa
>>> Cc: bioconductor at r-project.org
>>> Subject: Re: [BioC] Making use of the Apis mellifera assembly 2 data 
>>> in goSeq
>>>
>>> Vanessa,
>>>
>>> On 02/27/2012 12:04 PM, Corby, Vanessa wrote:
>>>> Page 3 of the goseq documentation states
>>>>
>>>> "In order to link GO categories to genes, goseq uses the organism packages from Bioconductor.
>>>> These packages are named org.<Genome>.<ID>.db, where<Genome>     is a
>>>> short string identifying the genome and<ID>     is a short string
>>>> identifying the gene identi_er. Currently, goseq will automatically 
>>>> retrieve the mapping between GO categories and genes from the relevant package (as long as it is installed) for commonly used genome/ID combinations."
>>>>
>>>> Let's say I want to use the organism package for the Apis mellifera version 2 genome, a supported genome according to the goseq supportedGenomes() command.  I see that there are several packages that follow the naming convention org.<Genome>.<ID>.db, but none of these is for the Apis version 2 genome.  What organism package do I install for Apis mellifera 2?  Can I use BSgenome.Amellifera.UCSC.apiMel2 even though it does not follow the proper naming scheme?
>>>
>>> The org.<Genome>.<ID>.db packages contain annotations (more precisely mappings between gene ids and other kinds of ids). The BSgenome.* packages (like Amellifera.UCSC.apiMel2) contain no annotations, only genomes (i.e. DNA sequences). Of course there is no way you can use one to replace an org.<Genome>.<ID>.db package.
>>>
>>> Cheers,
>>> H.
>>>
>>> PS: Please use the "Reply" button so emails are grouped by thread. This discussion is already splitted in 3 different threads on the list which is not going to help. Thanks!
>>>
>>>>
>>>> Thanks.
>>>>
>>>> Vanessa
>>>>
>>>> Research Molecular Biologist
>>>> USDA-ARS
>>>> Carl Hayden Bee Research Center
>>>> 2000 E. Allen Rd., Tucson, AZ  85719
>>>>
>>>> (520) 647-9269
>>>>
>>>> This electronic message contains information generated by the USDA solely for the intended recipients. Any unauthorized interception of this message or the use or disclosure of the information it contains may violate the law and subject the violator to civil or criminal penalties. If you believe you have received this message in error, please notify the sender and delete the email immediately.
>>>>
>>>> 	[[alternative HTML version deleted]]
>>>>
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at r-project.org
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives:
>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>>
>>> --
>>> Hervé Pagès
>>>
>>> Program in Computational Biology
>>> Division of Public Health Sciences
>>> Fred Hutchinson Cancer Research Center
>>> 1100 Fairview Ave. N, M1-B514
>>> P.O. Box 19024
>>> Seattle, WA 98109-1024
>>>
>>> E-mail: hpages at fhcrc.org
>>> Phone:  (206) 667-5791
>>> Fax:    (206) 667-1319
>>>
>>>
>>
>>
>> --
>> Hervé Pagès
>>
>> Program in Computational Biology
>> Division of Public Health Sciences
>> Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N, M1-B514
>> P.O. Box 19024
>> Seattle, WA 98109-1024
>>
>> E-mail: hpages at fhcrc.org
>> Phone:  (206) 667-5791
>> Fax:    (206) 667-1319
>>
>>
>
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fhcrc.org
> Phone:  (206) 667-5791
> Fax:    (206) 667-1319
>
>

--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319