[BioC] makeTranscriptDbFromGFF fails on NCBI Bacteria

Sarah Pohl Sarah.Pohl at helmholtz-hzi.de
Tue Aug 27 14:25:18 CEST 2013


Hey Mark,

I had a self-made file, but this one would also be fine if it worked. This time, this happened:

> txdb <- makeTranscriptDbFromGFF(file="NC_008463_ncbi.gff", format="gff3", dataSource="CDS", species="Pseudomonas aeruginosa", chrominfo=inf)
extracting transcript information
Error in .prepareGFF3TXS(data) :
  No Transcript information present in gff file
> sessionInfo()
R version 3.0.1 (2013-05-16)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=German_Germany.1252  LC_CTYPE=German_Germany.1252    LC_MONETARY=German_Germany.1252
[4] LC_NUMERIC=C                    LC_TIME=German_Germany.1252

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
 [1] BSgenome_1.28.0         GenomicFeatures_1.12.3  AnnotationDbi_1.22.6
 [4] Biobase_2.20.1          VariantAnnotation_1.6.7 Rsamtools_1.12.3
 [7] Biostrings_2.28.0       GenomicRanges_1.12.4    IRanges_1.18.3
[10] BiocGenerics_0.6.0

loaded via a namespace (and not attached):
 [1] biomaRt_2.16.0     bitops_1.0-6       DBI_0.2-7          RCurl_1.95-4.1     RSQLite_0.11.4
 [6] rtracklayer_1.20.4 stats4_3.0.1       tools_3.0.1        XML_3.98-1.1       zlibbioc_1.6.0


What transcript information should be present in the gff3 file??

Sarah

> ------------------------------
>
> Message: 5
> Date: Fri, 23 Aug 2013 10:23:49 -0700
> From: Marc Carlson <mcarlson at fhcrc.org>
> To: bioconductor at r-project.org
> Subject: Re: [BioC] makeTranscriptDbFromGFF fails on NCBI Bacteria
>       genomes
> Message-ID: <52179AA5.8070206 at fhcrc.org>
> Content-Type: text/plain; charset=UTF-8; format=flowed
>
> Thank you Sarah,
>
> That is much better.  Is this the file you were parsing here?
>
> ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/Pseudomonas_aeruginosa_UCBPP_PA14_uid57977/NC_008463.gff
>
>
> Marc



On 08/23/2013 03:49 AM, Sarah Pohl wrote:
> Hey Marc,
>
> I'm sorry, I came here via gmane.org and didn't see the posting guide. I'll attach the relevant information this time.
> I tried with the chrominfo argument, and in a sense it works. At least there's no error about the missing chromosome size now. The main error stays the same, though.
>
> I checked my gff3 file with http://modencode.oicr.on.ca/cgi-bin/validate_gff3_online yesterday and according to them it is fine.
>
> Here's the code:
> library(VariantAnnotation)
> library(GenomicFeatures)
> library(BSgenome)
> inf <- data.frame(cbind("NC_008463", 6537648, TRUE))
> txdb <- makeTranscriptDbFromGFF(file="//CPI-SL64001/spo12/BSgenome/annotation/NC_008463.gff", format="gff3", dataSource="CDS", species="Pseudomonas aeruginosa", chrominfo=inf)
>
> the error:
> Prepare the 'metadata' data frame ... metadata: OK
> Error in is.data.frame(arg) : object 'tables' not found
>
> and the session info:
> R version 3.0.1 (2013-05-16)
> Platform: x86_64-w64-mingw32/x64 (64-bit)
>
> locale:
> [1] LC_COLLATE=German_Germany.1252  LC_CTYPE=German_Germany.1252    LC_MONETARY=German_Germany.1252
> [4] LC_NUMERIC=C                    LC_TIME=German_Germany.1252
>
> attached base packages:
> [1] parallel  stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
>   [1] BSgenome_1.28.0         GenomicFeatures_1.12.3  AnnotationDbi_1.22.6
>   [4] Biobase_2.20.1          VariantAnnotation_1.6.7 Rsamtools_1.12.3
>   [7] Biostrings_2.28.0       GenomicRanges_1.12.4    IRanges_1.18.3
> [10] BiocGenerics_0.6.0
>
> loaded via a namespace (and not attached):
>   [1] biomaRt_2.16.0     bitops_1.0-6       DBI_0.2-7          RCurl_1.95-4.1     RSQLite_0.11.4
>   [6] rtracklayer_1.20.4 stats4_3.0.1       tools_3.0.1        XML_3.98-1.1       zlibbioc_1.6.0
> Date: Thu, 22 Aug 2013 11:27:39 -0700
> From: Marc Carlson <mcarlson at fhcrc.org>
> To: bioconductor at r-project.org
> Subject: Re: [BioC] makeTranscriptDbFromGFF fails on NCBI Bacteria
>          genomes
> Message-ID: <5216581B.8090608 at fhcrc.org>
> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
>
>
>
> On 08/22/2013 02:12 AM, Sarah Pohl wrote:
>> Cook, Malcolm <MEC at ...> writes:
>>
>>> FYI, bioperl includes bp_genbank2gff3.pl
>>>
>>> which when run as
>>>
>>>> bp_genbank2gff3.pl NC_011025.gbk
>>> produces NC_011025.gbk.gff (attached)
>>>
>>> which loaded without error with transcript:
>>>
>>>> txdb <- makeTranscriptDbFromGFF(file="NC_011025.gbk.gff", format="gff3",
>> dataSource="NCBI",
>>> species="Some bact")
>>> extracting transcript information
>>> Extracting gene IDs
>>> extracting transcript information
>>> Processing splicing information for gff3 file.
>>> Deducing exon rank from relative coordinates provided
>>> Prepare the 'metadata' data frame ... metadata: OK
>>> Now generating chrominfo from available sequence names. No chromosome
>> length information is available.
>>> Warning messages:
>>> 1: In .deduceExonRankings(exs, format = "gff") :
>>>     Infering Exon Rankings.  If this is not what you expected, then please
>> be sure that you have provided a valid
>>> attribute for exonRankAttributeName
>>> 2: In matchCircularity(chroms, circ_seqs) :
>>>     None of the strings in your circ_seqs argument match your seqnames.
>>>> txdb
>>> TranscriptDb object:
>>> | Db type: TranscriptDb
>>> | Supporting package: GenomicFeatures
>>> | Data source: NCBI
>>> | Genus and Species: Some bact
>>> | miRBase build ID: NA
>>> | transcript_nrow: 631
>>> | exon_nrow: 631
>>> | cds_nrow: 631
>>> | Db created by: GenomicFeatures package from Bioconductor
>>> | Creation time: 2013-06-07 14:52:50 -0500 (Fri, 07 Jun 2013)
>>> | GenomicFeatures version at creation time: 1.10.2
>>> | RSQLite version at creation time: 0.11.2
>>> | DBSCHEMAVERSION: 1.0
>> Hey,
>>
>> I know I'm a bit late for this discussion, but I have a similar problem.
>>
>> I have a bacterial GBK file which I tried to convert using the
>> bp_genbank2gff3.pl script,
>>       perl bp_genbank2gff3.pl annotation/NC_008463.gbk -o annotation/
>> but I got the following error:
>>      "Can't call method "binomial" on an undefined value at bp_genbank2gff3.pl
>> line 672, <FH> line 208948."
>> So instead I converted it with Biopython and the BCBio module, which worked
>> fine.
>> Only now, when I try to load it with makeTranscriptDbFromGFF,
>>       txdb <- makeTranscriptDbFromGFF(file="NC_008463.gff", format="gff3",
>> dataSource="CDS", species="Pseudomonas aeruginosa")
>> I also get an error:
>>       Error in unique(tables[["transcripts"]][["tx_chrom"]]) :
>>       'unique': Error: object 'tables' not found
>>
>> Why does this happen and what can I do about it?
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
> Hi Sarah,
>
> It's hard to help you because it's pretty difficult to know what
> actually happened after reading your post.  I can't be sure if the other
> scripts you mention produced a valid gff3 file and I have no idea which
> version of the software you are using.  Please see our posting guide here:
>
> http://www.bioconductor.org/help/mailing-list/posting-guide/
>
> But I will go out on a limb anyways and guess (based only the error code
> in your message), that your problem might get better if you passed in a
> value to the chrominfo argument.  You can see an example of how to use
> that argument in the manual page by pulling the manual page up like this:
>
> help(makeTranscriptDbFromGFF)
>
> Hope this helps,
>
>
>     Marc
>
> ________________________________
>
> Helmholtz-Zentrum f?r Infektionsforschung GmbH | Inhoffenstra?e 7 | 38124 Braunschweig | www.helmholtz-hzi.de
> Das HZI ist seit 2007 zertifiziertes Mitglied im "audit berufundfamilie"
>
> Vorsitzende des Aufsichtsrates: MinDir?in B?rbel Brumme-Bothe, Bundesministerium f?r Bildung und Forschung
> Stellvertreter: R?diger Eichel, Abteilungsleiter Nieders?chsisches Ministerium f?r Wissenschaft und Kultur
> Gesch?ftsf?hrung: Prof. Dr. Dirk Heinz; Ulf Richter, MBA
> Gesellschaft mit beschr?nkter Haftung (GmbH)
> Sitz der Gesellschaft: Braunschweig
> Handelsregister: Amtsgericht Braunschweig, HRB 477
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor



------------------------------

Message: 6
Date: Fri, 23 Aug 2013 11:27:39 -0700
From: Dan Tenenbaum <dtenenba at fhcrc.org>
To: Alleene <astrickland at med.miami.edu>
Cc: Xi Wang <Xi.Wang at newcastle.edu.au>,
        "bioconductor at stat.math.ethz.ch bioconductor"
        <bioconductor at stat.math.ethz.ch>
Subject: Re: [BioC] SeqGSEA estiGeneNBstat()
Message-ID:
        <CAF42j23htuLRXS=5vStqaOc2eTpqqT8R1HKCij5XpWZu8dmfhA at mail.gmail.com>
Content-Type: text/plain; charset="ISO-8859-1"

CC'ing the maintainer. I think this person is referring to this post:
https://stat.ethz.ch/pipermail/bioconductor/2013-July/053717.html

Dan


On Wed, Aug 21, 2013 at 1:03 PM, Alleene <astrickland at med.miami.edu> wrote:
> I am having the same issue - any help would be appreciated.
>
> Thank you!
> -Alleene
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor



------------------------------

Message: 7
Date: Fri, 23 Aug 2013 11:31:46 -0700
From: Valerie Obenchain <vobencha at fhcrc.org>
To: Dan Du <tooyoung at gmail.com>
Cc: Michael Lawrence <lawrence.michael at gene.com>,       Bioconductor List
        <bioconductor at stat.math.ethz.ch>
Subject: Re: [BioC] Semantics of GenomicRanges gaps()
Message-ID: <5217AA92.9040703 at fhcrc.org>
Content-Type: text/plain; charset="ISO-8859-1"; format=flowed

Hi Dan,

Thanks for catching this. Herve is currently out of town. We'll put this
on the TODO and address it when he is back.

Valerie


On 08/21/2013 12:13 AM, Dan Du wrote:
> Hi all,
>
> Sorry to wake up this sleeping beauty.
>
> Just something more to add and consider, there is another inconsistency
> in the gaps function for GRanges, when there is no seqinfo present. The
> default gaps will not produce extra ranges for those unused strands,
> unless one specify the "end" argument, however only setting "start"
> won't trigger the extra results. This is the current behavior of both
> release and dev branch,
>
> ## rls
> GenomicRanges_1.13.36 IRanges_1.19.24
> ## dev
> GenomicRanges_1.12.4 IRanges_1.18.3
>
> ## case without seqinfo
> g<-GRanges(seqnames="1", IRanges(3, 10), strand='*')
> seqinfo=Seqinfo("1",seqlengths=1000))
> g
> GRanges with 1 range and 0 metadata columns:
>        seqnames    ranges strand
>           <Rle> <IRanges>  <Rle>
>    [1]        1   [3, 10]      *
>    ---
>    seqlengths:
>      1
>     NA
>
> ## default, no extra
> gaps(g) # this is expected, start always defaults to 1L
> GRanges with 1 range and 0 metadata columns:
>        seqnames    ranges strand
>           <Rle> <IRanges>  <Rle>
>    [1]        1    [1, 2]      *
>    ---
>    seqlengths:
>      1
>     NA
>
> ## with start, no extra
> gaps(g, start=2)
> GRanges with 1 range and 0 metadata columns:
>        seqnames    ranges strand
>           <Rle> <IRanges>  <Rle>
>    [1]        1    [2, 2]      *
>    ---
>    seqlengths:
>      1
>     NA
> ## with end, strand comes into play, giving two more
> gaps(g, end=20)
> GRanges with 4 ranges and 0 metadata columns:
>        seqnames    ranges strand
>           <Rle> <IRanges>  <Rle>
>    [1]        1  [ 1, 20]      +
>    [2]        1  [ 1, 20]      -
>    [3]        1  [ 1,  2]      *
>    [4]        1  [11, 20]      *
>    ---
>    seqlengths:
>      1
>     NA
>
> ## with both start and end, same as above
> gaps(g, start=2, end=20) #
> GRanges with 4 ranges and 0 metadata columns:
>        seqnames    ranges strand
>           <Rle> <IRanges>  <Rle>
>    [1]        1  [ 2, 20]      +
>    [2]        1  [ 2, 20]      -
>    [3]        1  [ 2,  2]      *
>    [4]        1  [11, 20]      *
>    ---
>    seqlengths:
>      1
> ## reset seqinfo
> seqinfo(g)<-Seqinfo("1",seqlengths=1000)
>
> ## now the extra ranges are back
> gaps(g, start=2)
> GRanges with 4 ranges and 0 metadata columns:
>        seqnames     ranges strand
>           <Rle>  <IRanges>  <Rle>
>    [1]        1 [ 2, 1000]      +
>    [2]        1 [ 2, 1000]      -
>    [3]        1 [ 2,    2]      *
>    [4]        1 [11, 1000]      *
>    ---
>    seqlengths:
>        1
>     1000
>
> Best,
> Dan
>
> On Mon, 2013-06-03 at 05:54 -0700, Michael Lawrence wrote:
>> Not sure how confusing this would be, but a common case is (like Alex's
>> input) a set of ranges where the strand is uniform, i.e., all ranges have
>> strand 'x'. In that case, gaps,GenomicRanges could behave just like
>> gaps,Ranges and return only ranges on strand 'x'. That would satisfy both
>> properties.
>>
>> A natural extension would be that if all ranges are '+' or '-' (none of
>> them are '*'), then do not add the '*' range spanning the whole sequence.
>> Also satisfies both properties.
>>
>> I only know of one use-case where all three strand types are mixed:
>> rna-seq, where the strand has been inferred only from the spliced
>> alignments. Not if gaps() would even be useful in that case, but if it
>> were, I think the current gaps behavior would make sense.
>>
>> Michael
>>
>>
>>
>>
>> On Sun, Jun 2, 2013 at 11:36 PM, Herv Pags <hpages at fhcrc.org> wrote:
>>
>>> Hi Alex,
>>>
>>>
>>> On 05/31/2013 01:46 AM, Alex Gutteridge wrote:
>>>
>>>> I just have a quick question/comment about the behaviour of the gaps
>>>> function in the GenomicRanges package, particularly with how it handles
>>>> * strand ranges. The current behaviour is as below:
>>>>
>>>>   range = GRanges(seqnames="1",IRanges(**start=300,end=700), strand="*",
>>>>> seqinfo=Seqinfo("1",**seqlengths=1000))
>>>>> range
>>>>>
>>>> GRanges with 1 range and 0 metadata columns:
>>>>         seqnames     ranges strand
>>>>            <Rle>  <IRanges>  <Rle>
>>>>     [1]        1 [300, 700]      *
>>>>     ---
>>>>     seqlengths:
>>>>         1
>>>>      1000
>>>>
>>>>> gaps(range)
>>>>>
>>>> GRanges with 4 ranges and 0 metadata columns:
>>>>         seqnames      ranges strand
>>>>            <Rle>   <IRanges>  <Rle>
>>>>     [1]        1 [  1, 1000]      +
>>>>     [2]        1 [  1, 1000]      -
>>>>     [3]        1 [  1,  299]      *
>>>>     [4]        1 [701, 1000]      *
>>>>     ---
>>>>     seqlengths:
>>>>         1
>>>>      1000
>>>>
>>>> My interpretation of "*" as a strand identifier is that it means the
>>>> range covers both + and - strands and so intuitively I was expecting the
>>>> 'gaps' to only cover the 3rd and 4th ranges returned above (not the
>>>> full-length + and - strand ranges).
>>>>
>>>
>>> It's even worse than that. If there is no range at all on a chromosome,
>>> gaps(gr) will return 3 ranges covering the full chromosome:
>>>
>>>    > gr <- GRanges("chr1",
>>>                    IRanges(start=300,end=700),
>>>                    seqlengths=c(chr1=1000,chr2=**1000))
>>>
>>>    > gr
>>>
>>>    GRanges with 1 range and 0 metadata columns:
>>>          seqnames     ranges strand
>>>             <Rle>  <IRanges>  <Rle>
>>>      [1]     chr1 [300, 700]      *
>>>      ---
>>>      seqlengths:
>>>       chr1  chr2
>>>       1000 1000
>>>
>>>    > gaps(gr)
>>>
>>>    GRanges with 7 ranges and 0 metadata columns:
>>>          seqnames      ranges strand
>>>             <Rle>   <IRanges>  <Rle>
>>>      [1]     chr1 [  1, 1000]      +
>>>      [2]     chr1 [  1, 1000]      -
>>>      [3]     chr1 [  1,  299]      *
>>>      [4]     chr1 [701, 1000]      *
>>>      [5]     chr2 [  1, 1000]      +
>>>      [6]     chr2 [  1, 1000]      -
>>>      [7]     chr2 [  1, 1000]      *
>>>      ---
>>>      seqlengths:
>>>       chr1  chr2
>>>       1000 1000
>>>
>>>
>>>   The semantics here implies to me
>>>> that the * strand is being thought of as a kind of imaginary third
>>>> strand and hence doesn't overlap with the + or - strands. This is
>>>> contrary to the semantics of findOverlaps which does appear to consider
>>>> a * strand range to overlap with + or - strand ranges:
>>>>
>>>>   findOverlaps(range,gaps(range)**)
>>>>>
>>>> Hits of length 2
>>>> queryLength: 1
>>>> subjectLength: 4
>>>>     queryHits subjectHits
>>>>      <integer>   <integer>
>>>>    1         1           1
>>>>    2         1           2
>>>>
>>>> To summarise I guess I was expecting findOverlaps(range,gaps(range)**) to
>>>> never return any overlaps under any circumstance (that I can think of!).
>>>>
>>>
>>> Let's call this property 2. This sounds indeed like a good property
>>> that it would be nice to have. But an even more important property is
>>> property 1: gaps() must be its own reverse operation i.e.
>>> 'gaps(gaps(gr))' must always return 'gr'.
>>>
>>> The current behavior of gaps() guarantees property 1. I'm not against
>>> changing gaps() behavior to also have property 2, as long as property
>>> 1 is preserved. Suggestions are welcome.
>>>
>>> Thanks,
>>> H.
>>>
>>>
>>>   Do people agree the behaviour of gaps() is not quite intuitive in this
>>>> case?
>>>>
>>>>   sessionInfo()
>>>>>
>>>> R version 2.15.2 (2012-10-26)
>>>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>>>
>>>> locale:
>>>>    [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>>>    [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>>>    [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>>>>    [7] LC_PAPER=C                 LC_NAME=C
>>>>    [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>>>
>>>> attached base packages:
>>>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>>>
>>>> other attached packages:
>>>> [1] GenomicRanges_1.10.7 IRanges_1.16.6       BiocGenerics_0.4.0
>>>>
>>>> loaded via a namespace (and not attached):
>>>> [1] parallel_2.15.2 stats4_2.15.2
>>>>
>>>>
>>> --
>>> Herv Pags
>>>
>>> Program in Computational Biology
>>> Division of Public Health Sciences
>>> Fred Hutchinson Cancer Research Center
>>> 1100 Fairview Ave. N, M1-B514
>>> P.O. Box 19024
>>> Seattle, WA 98109-1024
>>>
>>> E-mail: hpages at fhcrc.org
>>> Phone:  (206) 667-5791
>>> Fax:    (206) 667-1319
>>>
>>>
>>> ______________________________**_________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https://stat.ethz.ch/mailman/listinfo/bioconductor>
>>> Search the archives: http://news.gmane.org/gmane.**
>>> science.biology.informatics.**conductor<http://news.gmane.org/gmane.science.biology.informatics.conductor>
>>>
>>
>>      [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>



------------------------------

Message: 8
Date: Fri, 23 Aug 2013 14:30:32 -0700
From: Marc Carlson <mcarlson at fhcrc.org>
To: bioconductor at r-project.org
Subject: Re: [BioC] defining accessors (cols, keys, keytypes, select)
        for annotation package built with AnnotationForge package
Message-ID: <5217D478.5070000 at fhcrc.org>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed

On 08/20/2013 12:40 AM, Sashi wrote:
> Marc Carlson <mcarlson at ...> writes:
>
>> On 08/14/2013 04:58 AM, Sashi wrote:
>>> Marc Carlson <mcarlson <at> ...> writes:
>>>
>>>> Hi Sashi,
>>>>
>>>> The PDF from Gabor that you are looking at is much older and was from
>>>> before we even had the select method.  These days you probably don't
>>>> want to do that.  Especially if you want to implement a method like
>>>> select().  I strongly suspect that you really just want to be looking
> at
>>>> this vignette instead:
>>>>
>>>>
> http://www.bioconductor.org/packages/release/bioc/vignettes/AnnotationForge/
> inst/doc/MakingNewAnnotationPackages.pdf
>>>> To answer your questions, GO is actually looking at a view that is
>>>> created in the database of the three GO tables (one for BP, MF and CC).
>>>> But you probably don't need that level of detail.  If you are using
>>>> org.At.tair.db to look at arabidopsis, then you may already have
>>>> everything you need.  And if you need another organism, you probably
>>>> want to look 1st at making an org package using
>>>> makeOrgPackageFromNCBI().  And if for some reason you want to expose
>>>> some entirely new database resource (IOW you don't want to make an
>>>> organism package but something else entirely), then you might need to
>>>> use the vignette above.
>>>>
>>>> I hope this helps you,
>>>>
>>>>      Marc
>>>>
>>>> On 08/13/2013 04:33 AM, Rameswara Sashi Kiran Challa wrote:
>>>>> Hi ,
>>>>>
>>>>> I am trying to build an annotation organism package by using
> Annotation
>>>>> Forge package. I followed this
>>>>>
> document<http://www.bioconductor.org/packages/2.12/bioc/vignettes/Annotation
> Forge/inst/doc/NewSchema.pdf>written
>>>>> by Gabor Csardi.
>>>>> I was able to build a sqlite database and create an Annotation package
>>>>> using the makeAnnDbPkg() function.
>>>>>
>>>>> I understand cols(), keys(), keytypes() and select() are set as
> generic
>>>>> methods in AnnotationDbi.
>>>>>
>>>>> When I look into methods-AnnotationDb.R script in AnnotationDbi
> package, I
>>>>> see cols() method is set and it actually reads all the columns of all
> the
>>>>> tables in the sqlite table.
>>>>>
>>>>> When I run *cols() *on *org.At.tair.db  *I get few values which are
>>>>> actually not field/column names in the sqlite db. For Eg. there is no
> table
>>>>> called "GO" in org.At.tair.sqlite database. I am unable to understand
> how
>>>>> it creates these values. Could someone please help me understand how
> and
>>>>> where exactly these accessor functions are defined and how and where
> are
>>>>> they to be modified to be able to access the data in the sqlite db
> that I
>>>>> am creating for the organism I am working on.
>>>>>
>>>>> ==========================
>>>>>
>>>>>> cols(org.At.tair.db)
>>>>>     [1] "TAIR"         "CHRLOC"       "CHRLOCEND"    "ENZYME"
> "PATH"
>>>>>
>>>>> [6] "PMID"         "REFSEQ"       "SYMBOL"       "GENENAME"     "GO"
>>>>>
>>>>>
>>>>> [11] "EVIDENCE"     "ONTOLOGY"     "GOALL"        "EVIDENCEALL"
>>> "ONTOLOGYALL"
>>>>> [16] "ARACYC"       "ARACYCENZYME" "ENTREZID"     "CHR"
>>>>> =======================================
>>>>>
>>>>> Please point me to any documentation available for the same.
>>>>>
>>>>> Thanks for your time,
>>>>> Sashi
>>>>>
>>>>>   [[alternative HTML version deleted]]
>>>>>
>>>>> _______________________________________________
>>>>> Bioconductor mailing list
>>>>> Bioconductor <at> ...
>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor <at> ...
>>> Hi Marc,
>>>
>>> Thanks for your prompt reply. Referring to the document you pointed me
> to, I
>>> created another R script within the organism package skeleton( an R
> script
>>> apart from zzz.R) and set cols, keytypes accessor methods.
>>>
>>> As part of annotation packages Bimaps are created in every annotation
>>> package. How do we use these Bimaps in these accessor methods? Am I
> right in
>>> thinking that these Bimaps are to be used in these accessor methods? Or
>>> those Bimaps have to be accessed only via get(), mget(), toTable()
> methods?
>>> Also, can you please let me know if there is any documentation available
> on
>>> how the GO views are created? I see there are seperate tables like
> go_cc,
>>> go_mf, go_bp, etc under Arabidopsis annotation package. Is it necessary
> to
>>> have go_cc, go_mf, go_bp, go_mf_all, like tables in the sqlite database
> for
>>> the customized annotation package I am creating? Will not just a single
>>> table for all GO annotations suffice?
>>>
>>> Thanks again for your time,
>>> Sashi
>> Hi Sashi,
>>
>> I really doubt that you need to think about bimaps at all.  You don't
>> need them to implement select, cols, keytypes or keys.  And they are
>> really only still supported for the sake of older legacy code.  The get,
>> mget, and toTable methods are defined to help with bimaps, but you
>> probably don't need to use these methods anyways. So it's very unlikely
>> that you would even need to use bimaps let alone implement them.
>>
>> And the go view is just a SQLite database view.  A view is sort of like
>> a pre-canned database query that appears as a table.  Our "go view" is
>> really just the union of go_bp, go_mf, and go_cc tables. Those three
>> separate tables allow us to still keep the different terms (from the
>> different ontologies) as separate from each other in the database.  But
>> since we are using a view, we can also easily query all three of them
>> (as if they were lumped together) WITHOUT actually duplicating all that
>> data into another enormous table.  And the performance for this is still
>> great.
>>
>> You can read a bit about how SQLITE views are created here if you are
>> curious:
>>
>> http://www.sqlite.org/lang_createview.html
>>
>> But if you are making an org package, why not just use
>> makeOrgPackageFromNCBI?
>>
>>     Marc
>>
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at ...
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at ...
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
> Thanks a lot Marc!!
>
> It's good to know that BioConductor community is trying to move away from
> Bimaps and adopt cols, select, keys, keytypes methods for any sort of
> queries. Are Reverse maps that are part of Bimaps taken care by these
> accesors?
>
> As I understand, for older legacy code perhaps the makeOrgPackageFromNCBI is
> also still generating Bimaps and in near future, perhaps all the Annotation
> packages will just have a sqlite database and these accessors, defined. Am I
> correct?
>
> I had started by looking at how to build a sqlite db with some of the
> mappings we have and had not used makeOrgPackageFromNCBI function. My
> thinking was that having an understanding of sqlite db building will enable
> me to add any new mappings that are not part of NCBI.
>
> So, to summarize, for Annotation package development one approach is using
> makeOrgPackageFromNCBI() and the other approach is to make a sqlite db and
> then define these accessors, as given in the pdf you had linked me to
> earlier. And there will be no need of any Bimaps for the package development
> as such.
>
> Thanks for your time,
> -Sashi
Hi Sashi,

We don't aim to get rid of bimaps from packages that have already had
them before, but we definitely don't think that new packages need to
have anything other than cols, keys, keytypes and select. Reverse maps
are also unnecessary if you have defined those newer methods.

So yes any new stuff you are developing should really just focus on
those four accessors.  The other stuff is just older stuff that you
shouldn't ever need.  And if you do need it, when we need to change
something so that you no longer need it!  We want to make it EASIER for
people like you who are interested in exposing new resources.

Science is hard enough already.  ;)



   Marc



> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor



------------------------------

Message: 9
Date: Fri, 23 Aug 2013 15:34:42 -0700
From: Valerie Obenchain <vobencha at fhcrc.org>
To: Peter Hickey <hickey at wehi.EDU.AU>
Cc: bioconductor at r-project.org
Subject: Re: [BioC] findOverlaps() fails with type = 'equal' fails
        when query is a CollapsedVCF object and subject is a GRanges object
Message-ID: <5217E382.7090107 at fhcrc.org>
Content-Type: text/plain; charset=windows-1252; format=flowed

Hi Peter,

Thanks for catching this oversight. 'equal' was omitted when
implementing findOverlaps() methods for SummarizedExperiment. The other
'overlap' methods call findOverlaps() so they were failing too.

Now fixed in GenomicRanges 1.13.37 (devel) and 1.12.5 (release). Both
should be available via biocLite() by Saturday noon PST.

Valerie

On 08/21/2013 09:04 PM, Peter Hickey wrote:
> Hi all,
>
> I got a bit lost trying to figure out why the following code does not work. The same code does work when type = 'any', 'start', 'end' or 'within', but not when type = 'equal'. When type = 'equal' it fails with one of the following:
> ## countOverlaps() error message
> Error in queryHits(findOverlaps(query, subject, maxgap = maxgap, minoverlap = minoverlap,  :
>    error in evaluating the argument 'x' in selecting a method for function 'queryHits': Error in match.arg(type) :
>    'arg' should be one of ?any?, ?start?, ?end?, ?within?
> ## findOverlaps() and overlapsAny() error message
> Error in match.arg(type) :
>    'arg' should be one of ?any?, ?start?, ?end?, ?within?
>
> So 'equal' isn't a valid option for when the subject is a CollapsedVCF object, whereas it is a valid option for when the subject is, say, a GRanges object, correct?
>
> Is it possible to extend findOverlaps(), countOverlaps() and overlapsAny() to allow for type = 'equal' when the subject is a CollapsedVCF object? Ideally this would also work if query and subject were both of CollapsedVCF class because I'm often interested in finding overlaps between two VCF files and I'd like to be able to do that with type = 'equal'.
>
> Thanks
> Peter Hickey
>
> #### DESCRIPTION ####
> # Peter Hickey
> # 22/08/2013
> # findOverlaps(), countOverlaps() and overlapsAny() don't work when `query` is a CollapsedVCF object, `subject` is a GRanges object and `type` is 'equal' yet they do work when `type` is 'any'.
>
> #### Load packages ####
> library(VariantAnnotation)
>
> #### Create VCF ####
> fl <- system.file("extdata", "ex2.vcf", package="VariantAnnotation")
> vcf <- readVcf(fl, "hg19")
>
> #### Create GRanges object ####
> gr <- GRanges(seqnames = '20', ranges = IRanges(start = 14370, end = 14370))
>
> #### countOverlaps ####
> countOverlaps(vcf, gr, type = 'any') # Works
> countOverlaps(vcf, gr, type = 'equal') # Doesn't work
> findOverlaps(vcf, gr, type = 'any') # Works
> findOverlaps(vcf, gr, type = 'equal') # Doesn't work
> overlapsAny(vcf, gr, type = 'any') # Works
> overlapsAny(vcf, gr, type = 'equal') # Doesn't work
>
> #### sessionInfo ####
> sessionInfo()
> R version 3.0.0 (2013-04-03)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
>   [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>   [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>   [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>   [7] LC_PAPER=C                 LC_NAME=C
>   [9] LC_ADDRESS=C               LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] parallel  stats     graphics  grDevices utils     datasets  methods
> [8] base
>
> other attached packages:
> [1] VariantAnnotation_1.6.7 Rsamtools_1.12.3        Biostrings_2.28.0
> [4] GenomicRanges_1.12.4    IRanges_1.18.3          BiocGenerics_0.6.0
>
> loaded via a namespace (and not attached):
>   [1] AnnotationDbi_1.22.6   Biobase_2.20.1         biomaRt_2.16.0
>   [4] bitops_1.0-6           BSgenome_1.28.0        DBI_0.2-7
>   [7] GenomicFeatures_1.12.3 RCurl_1.95-4.1         RSQLite_0.11.4
> [10] rtracklayer_1.20.4     stats4_3.0.0           tools_3.0.0
> [13] XML_3.98-1.1           zlibbioc_1.6.0
>
>
>
>
> --------------------------------
> Peter Hickey,
> PhD Student/Research Assistant,
> Bioinformatics Division,
> Walter and Eliza Hall Institute of Medical Research,
> 1G Royal Parade, Parkville, Vic 3052, Australia.
> Ph: +613 9345 2324
>
> hickey at wehi.edu.au
> http://www.wehi.edu.au
>
>
> ______________________________________________________________________
> The information in this email is confidential and intend...{{dropped:8}}
>
>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>



------------------------------

Message: 10
Date: Fri, 23 Aug 2013 16:05:07 -0700
From: "Tim Triche, Jr." <tim.triche at gmail.com>
To: Jon Manning <Jonathan.Manning at ed.ac.uk>
Cc: "bioconductor at r-project.org" <bioconductor at r-project.org>
Subject: Re: [BioC] Control probe QC plots for Infinium 450k chips
        (lumi,  minfi...)
Message-ID:
        <CAC+N9BXW54A_k3ekWE9st5hzwifrHcqVRME-EZCh1QeJ14G2tQ at mail.gmail.com>
Content-Type: text/plain

heh.  yeah I wrote something like this a long time ago for methylumi... not
sure if there's an issue with the Final Report Format parser or what, but
normally the control probe names are preserved.  Anyways, here's the fugly
old code, which I might as well update in methylumi in SVN.

If a type of probe isn't found, the plot will try and search by name for
the specified controls.  If it finds those, it will break them out by name
(the assumption here is that the user knows what they are doing if they ask
for e.g. bisulfite conversion probes).

You might also look into shinyMethyl, which is another way of approaching
this.  There are coercions within methylumi that will take a MethyLumiSet
or a MethyLumiM object and turn it into a MethylSet or a RGChannelSet, so
presumably it would be possible to use shinyMethyl (which builds on minfi)
for QC.  I happen to think that minfi has a nice clean codebase and makes
it relatively easy to build upon, but that's just me.

Fugly old code follows...

library(methylumi)
PBLs <- readRDS('PBLs.rds') ## a MethyLumiSet I had lying around
## caveat: I have always read in data from IDATs, so I didn't test the
parser here

## examples:
## qc.probe.plot(PBLs, controltype='DNP') ## finds the probes by name
## qc.probe.plot(PBLs, controltype='bisulfite') ## finds the probes by type
## qc.probe.plot(PBLs, by.type=T, controltype='bisulfite') ## splits the
probes by type

## avert your eyes, this code predates my learning how to be hygienic in
R...
qc.probe.plot <- function (obj, controltype="negnorm", log2=T, by.type=F,
...)
{
    require("ggplot2")
    require("reshape2")
    require("scales")
    by.probe <- FALSE
    log2_trans <- log_trans(base = 2)
    if (class(obj) %in% c("MethyLumiSet", "MethyLumiM")) {
        qc <- controlData(obj)
        if (!identical(sampleNames(qc), sampleNames(obj))) {
            sampleNames(qc) <- sampleNames(obj)
        }
    }
    else if (class(obj) == "MethyLumiOOB" & tolower(controltype) ==
        "oob") {
        qc <- obj
    }
    else if (class(obj) == "MethyLumiQC") {
        qc <- obj
    }
    else {
        stop("Don't know how to QC this data you've given me...")
    }
    if (tolower(controltype) == "negnorm" || missing(controltype)) {
        rows <- grep("(Negative|Norm)", fData(qc)$Type, ignore.case = T)
    }
    else {
        rows <- grep(paste("^", controltype, sep = ""), fData(qc)$Type,
            ignore.case = T)
        if(length(rows) < 1) {
            ## try looking by name instead
            rows <- grep(paste("^", controltype, sep = ""),
featureNames(qc),
                         ignore.case = T)
            by.probe <- TRUE
        }
    }
    if (tolower(controltype) == "oob") {
        dat <- intensities.OOB.allelic(obj)
        rownames(dat$Cy5$M) <- paste(rownames(dat$Cy5$M), "M", sep = "_")
        rownames(dat$Cy5$U) <- paste(rownames(dat$Cy5$U), "U", sep = "_")
        rownames(dat$Cy3$M) <- paste(rownames(dat$Cy3$M), "M", sep = "_")
        rownames(dat$Cy3$U) <- paste(rownames(dat$Cy3$U), "U", sep = "_")
        dat$Cy5 <- rbind(dat$Cy5$M, dat$Cy5$U)
        dat$Cy3 <- rbind(dat$Cy3$M, dat$Cy3$U)
        names(dat) = gsub("Cy3", "green", gsub("Cy5", "red", names(dat)))
        dat <- lapply(dat, function(d) {
            datum = data.frame(d)
            colnames(datum) = gsub("^X", "", colnames(datum))
            m.probes = grepl("_M$", rownames(d))
            datum$probe = as.factor(gsub("_(M|U)$", "", rownames(datum)))
            datum$type = "unmethylated"
            datum$type[which(m.probes)] = "methylated"
            datum$type = as.factor(datum$type)
            return(datum)
        })
        dat$green$channel = "Cy5"
        dat$red$channel = "Cy3"
        dat.frame <- rbind(dat$red, dat$green)
        dat.frame$channel <- as.factor(dat.frame$channel)
        a.title <- "Out-of-band probe intensity plot"
    } else {
        probes <- featureNames(qc)[rows]
        type <- as.factor(fData(qc)$Type[rows])
        if (tolower(controltype) == "negnorm") {
            if ("NORM_C" %in% unique(fData(qc)$Type)) {
                colour.settings <- c(NEGATIVE = "darkgray", NORM_A = "red",
                  NORM_T = "darkred", NORM_C = "green", NORM_G =
"darkgreen")
                type = factor(type, levels = names(colour.settings))
                shape.settings <- c(20, 20, 20, 20, 20)
            } else {
                colour.settings <- c("darkgray", "darkgreen", "red")
                shape.settings <- c(20, 20, 20)
            }
        }
        dat <- c(red = "unmethylated", green = "methylated")
        Cy5 <- as.data.frame(assayDataElement(qc, dat[1])[rows, ])
        Cy5$channel <- "Cy5 (Red)"
        Cy5$probe <- as.factor(probes)
        Cy3 <- as.data.frame(assayDataElement(qc, dat[2])[rows, ])
        Cy3$channel <- "Cy3 (Green)"
        Cy3$probe <- as.factor(probes)
        Cy3$type <- Cy5$type <- type
        dat.frame <- rbind(Cy5, Cy3)
        dat.frame$channel <- as.factor(dat.frame$channel)
        a.title <- paste(controltype, "control probe plot")
        if (tolower(controltype) == "negnorm") {
            a.title <- "Negative & normalization control probes"
        }
    }
    more.args = list(...)
    if ("extra" %in% names(more.args))
        a.title = paste(a.title, more.args[["extra"]])
    qc <- melt(dat.frame, id = c("probe", "channel", "type"))
    geometry <- ifelse(tolower(controltype) == "negnorm",
ifelse(tolower(controltype) == "oob", "boxplot", "jitter"), "point")
    if (tolower(controltype) == "oob") {
        qc$grouping = paste(qc$variable, qc$type, sep = ".")
        p <- ggplot2::qplot(data = qc, colour = type, fill = type,
            group = grouping, x = variable, y = value, geom = "boxplot",
            main = a.title, xlab = "Sample", ylab = "Intensities") +
            coord_flip()
        if (log2)
            p <- p + scale_x_continuous(trans = "log2", limits = c(2^4,
2^16))
        else
            p <- p + scale_x_continuous(limits = c(2^4, 2^16))
    } else {
        if(by.probe) {
            p <- ggplot2::qplot(data = qc, colour = probe, shape = probe,
                x = value, y = variable, geom = geometry, main = a.title,
                ylab = "Sample", xlab = "Intensities")
        } else {
            p <- ggplot2::qplot(data = qc, colour = type, shape = type,
                x = value, y = variable, geom = geometry, main = a.title,
                ylab = "Sample", xlab = "Intensities")
        }
        p <- p + scale_y_discrete(limits = rev(sampleNames(obj)))
        if (log2) p <- p + scale_x_continuous(trans="log2", limits=c(2^4,
2^16))
        else p <- p + scale_x_continuous(limits = c(2^4, 2^16))
    }
    if (by.type) {
        p <- p + facet_grid(type ~ channel)
    } else if(by.probe) {
        p <- p + facet_grid(. ~ channel)
    } else {
        p <- p + facet_grid(. ~ channel)
    }
    if (tolower(controltype) == "negnorm") {
        p <- p + scale_colour_manual(values = colour.settings)
        p <- p + scale_shape_manual(values = shape.settings)
    }
    p <- p + theme_bw()
    return(p)
}











On Fri, Aug 23, 2013 at 4:05 AM, Jon Manning <Jonathan.Manning at ed.ac.uk>wrote:

> Hi Bioconductors,
>
> I'm trying to find out if there are good existing ways in Bioconductor of
> plotting out the QC data from Illumina Methylation 450k chips. I want to
> plot the control probe intensities separately for green and red channels,
> label the individual probe types properly (such as 'DNP (High)', 'Biotin
> (Med)' for the staining controls), and ideally indicate expected
> (qualitative) intensities. I've actually done this myself, but want to know
> if I've missed the BioC way.
>
> The detail:
>
> I have 'sample methylation profile', 'control probe profile' etc files
> (not the IDAT files). The way Lumi and friends read the controlData
> (retrievable from a controlData slot in the methyLumiM object), probe
> detail seems to be lost, such that for the staining controls ('DNP (High)',
> 'Biotin (Med)') I just get identifiers like STAINING, STAINING.1. The
> results of methylumi's qcplot() function are then supremely un-useful.
> Likewise Minfi's QC plotting functions seem from the docs not to label
> individual control types (though I don't have the IDAT data so Minfi isn't
> very useful to me anyway).
>
> I needed to finish my work, so I brewed my own method using ggplot() and
> cross-referencing the control data and annotation from the manifest
> (outside of Lumi et al), but I feel this is something others will have done
> and that I've therefore missed something. Any pointers?
>
> Many thanks,
>
> Jon Manning
>
>
> --
> The University of Edinburgh is a charitable body, registered in
> Scotland, with registration number SC005336.
>
> ______________________________**_________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https://stat.ethz.ch/mailman/listinfo/bioconductor>
> Search the archives: http://news.gmane.org/gmane.**
> science.biology.informatics.**conductor<http://news.gmane.org/gmane.science.biology.informatics.conductor>
>



--
*He that would live in peace and at ease, *
*Must not speak all he knows, nor judge all he sees.*
*
*
Benjamin Franklin, Poor Richard's
Almanack<http://archive.org/details/poorrichardsalma00franrich>

        [[alternative HTML version deleted]]



------------------------------

Message: 11
Date: Fri, 23 Aug 2013 16:07:50 -0700
From: Marc Carlson <mcarlson at fhcrc.org>
To: bioconductor at r-project.org
Subject: Re: [BioC] error while using goProfiles package on
        arabidopsis entrez gene IDs
Message-ID: <5217EB46.202 at fhcrc.org>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed

Hi,

So now I can see a little better what you are doing.  The problem is
what is happening inside of goProfiles.  Now this is not my package and
I have never really used it much myself, so I just did a little
debugging to see what was happening, and this is what I found:

The basicProfile() function is expecting you to give it a central ID for
the org package you name for it.  It seems to be assuming that this will
be an entrez gene ID.  But that is *not* what the arabidopsis community
usually uses.  That community likes to use TAIR IDs.  So the
org.At.tair.db, uses TAIR IDs as the central ID (this is why TAIR is in
the middle of the package name).  You can get and use entrez gene IDs
with the org.At.tair.db package, but they are not the central id that is
expected by many of the older methods like mget() etc.   These days, we
have moved away from that model and now use the select method.  We feel
it's less confusing since there is no longer the need to pay attention
to which key type is most important for a package etc.  Instead the
select() interface just asks you to provide the kind of key that you are
using.  We feel this is more transparent.

So anyways here is how I was able to make it run:

## 1st take some of your entrez gene IDs
egIDs <- c("839235", "838362", "838961", "837091", "837455", "837543")

## use select to quickly translate these into TAIR IDs, and then grab
that column of IDs back out.
## (You may find it more convenient to just start with the TAIR IDs that
you said were in your file, but I don't have those here)
tairIDs <-  as.character(select(org.At.tair.db, keys=egIDs, cols="TAIR",
keytype="ENTREZID")[[2]])

## THEN call basicProfile function and pass in tair IDs instead...
## Now when it calls mget on the GO mapping, it will actually get some
matches.
basicProfile(tairIDs, idType="Entrez", onto ="ANY", level=2,
orgPackage="org.At.tair.db", ord=FALSE)


I hope this helps you,


   Marc



On 08/22/2013 02:07 AM, dd [guest] wrote:
> Hi all,
> I was using goProfiles package for functional analysis using a genelist of 316 Arabidopsis entrez gene IDs as shown below in the R command sessionInfo().
>
> - Read a file containing Entrez IDs and TAIR IDs.
> - Subset the Entrez IDs and converted to character vector.
> - Used the vector as genelist.
> -Used goProfiles package function basicProfile for this genelist with organism package of Arabidopsis.
>
> OUTPUT :Error in GOtermslist[[i]] : subscript out of bounds.
>
> Can somebody please help me in finding any mistake I might have done?
>
> Thanks in advance.
>
>   -- output of sessionInfo():
>
> Console output :
>
>>> a<-read.table("tair_ids to gene_ids.csv" ,header=TRUE,sep=",")
>>
>>> b<-as.character(a[,2])
>>> head(b)
>> [1] "839235" "838362" "838961" "837091" "837455" "837543"
>>
>>> h<-basicProfile(b,idType="Entrez",onto ="ANY",level=2,orgPackage="org.At.tair.db",ord=FALSE)
>> Error in GOtermslist[[i]] : subscript out of bounds
> --
> Sent via the guest posting facility at bioconductor.org.
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor



------------------------------

Message: 12
Date: Fri, 23 Aug 2013 18:53:18 -0700 (PDT)
From: Luo Weijun <luo_weijun at yahoo.com>
To: Bioconductor at r-project.org, Oleg Moskvin <moskvin at wisc.edu>
Subject: Re: [BioC] pathview puzzle
Message-ID:
        <1377309198.3634.YahooMailBasic at web125504.mail.ne1.yahoo.com>
Content-Type: text/plain; charset=utf-8

Hi Oleg,
Thanks for the note. This is indeed a problem I didn?t realize previously! KEGG uses Entrez Gene ID for all other model organisms I?ve checked.
I am working on a generic fix (not only for E coli but other species with similar situation) and will incorporate that into the development version of pathview soon. Will keep you posted.
Thanks for pointing this out.
Weijun


--------------------------------------------
On Fri, 8/23/13, Oleg Moskvin <moskvin at wisc.edu> wrote:

 Subject: Re: [BioC] pathview puzzle

 Date: Friday, August 23, 2013, 12:19 PM

 Hi Weijun,

 Thank you for the response.

 The problem seems to be deeper than that and is connected to
 special handling of a particular species - E.coli - by KEGG.


 I looked into the pathview() code and here is what I see:

 1) gene.data is remapped internally via mol.sum() to have
 ENTREZ IDs;
 2) remapped gene.data is used by node.map() to map onto KEGG
 nodes using node.data
 3) the node.data used in (2) was originally extracted from
 the KEGG XML by node.info()

 The above route implies that the "name" entries in the KEGG
 XML of type="gene" have "speciesID:ENTREZ" format...

 And in the case of E.coli this doesn't hold true! See the
 examples of XML entries for H.sapience and E.coli from my
 yesterday's message (below).

 In fact, in KEGG XML for E.coli "gene" records b-numbers are
 used as IDs!

 So, for the cases like that, when KEGG fails to be
 consistent in the supplied XML structure, one may suggest
 introducing an "id.bypass" option to pathview() which will
 take the gene.data as is (with the IDs supplied by user that
 match KEGG XML ids; for example, b-numbers), and pass this
 directly to the step 3 (node matching).

 Thanks!

 Oleg



 On 08/22/13, Luo Weijun wrote:
 > Hi Oleg,
 > You are right, the problem is due to ID type
 inconsistency.
 > You have to specify gene.idtype when calling pathview
 function, if your gene id type is not Entrez Gene. I don?t
 think b-numbers are recognized for sure. For your gene name
 example, if you mean official gene symbols by ?gene
 name?, you should specify gene.idtype="SYMBOL" (lower case
 is fine):
 > eco2.out <- pathview(gene.data =
 T2.CEBF095.crt115.ASCH.DROP3.rel.gn, pathway.id = "02010",
 gene.idtype="SYMBOL", out.suffix = "T2ACSH", species =
 "eco", kegg.native=TRUE)


 On 08/22/13, Oleg Moskvin? wrote:

 >
 > <entry id="2" name="hsa:51343" type="gene"
 > link="http://www.kegg.jp/dbget-bin/www_bget?hsa:51343">
 > <graphics name="FZR1, CDC20C, CDH1, FZR, FZR2, HCDH,
 HCDH1" fgcolor="#000000" bgcolor="#BFFFBF"
 > type="rectangle" x="919" y="536" width="46"
 height="17"/>
 > </entry>
 >
 >
 > <entry id="4" name="eco:b1513" type="gene"
 > link="http://www.kegg.jp/dbget-bin/www_bget?eco:b1513">
 > <graphics name="lsrA" fgcolor="#000000"
 bgcolor="#BFFFBF"
 > type="rectangle" x="339" y="1882" width="46"
 height="17"/>
 > </entry>



------------------------------

Message: 13
Date: Sat, 24 Aug 2013 02:24:30 +0000
From: "Blanchette, Marco" <MAB at stowers.org>
To: "bioconductor at r-project.org" <bioconductor at r-project.org>
Subject: [BioC] Bug in makeOrgPackageFromNCBI from AnnotationForge?
Message-ID: <E05792C7-364D-4F0C-BE9E-821C302EFFCD at stowers.org>
Content-Type: text/plain; charset="us-ascii"

I am working on a project involving Schizosaccharomyces pombe as a source for genomic analysis and love to use ReportingTools html producing wrappers. However, I am struggling as there is no AnnotationDbi package available for this organism. I decided to finally take the plunge and try to see if I could be one myself using AnnotationForge and was quite exciting to find that I could perhaps melt one simply by using the makeOrgPackageFromNCBI(). Most likely, something went wrong and I suspect a bug somewhere in the pipeline. I have not dug deeper then trying to build the package and use it hoping that someone closer to the code could shed some lights. Here the steps I took:'

> library(AnnotationForge)
> makeOrgPackageFromNCBI(version = "0.1",
                       author = "Marco Blanchette <mab at stowers.org>",
                       maintainer = "Marco Blanchette <mab at stowers.org>",
                       outputDir = ".",
                       tax_id = "4896",
                       genus = "Schizosaccharomyces",
                       species = "pombe")

This step succeeded with only a warning:

Warning message:
In .makeSimpleTable(ug, table = "unigene", con) :
  no values found for table unigene in this data chunk.

I didn't think this was critical enough to raise any red flag, so I then proceeded with the installation that went smoothly

> library(devtools)
> install('org.Spombe.eg.db')
> library('org.Spombe.eg.db')

Then I try to use it with ReportingTools publish() but fail as it returns an error related to Entrez ID which I had a conversion table from biomaRt. I dug a bit deeper and found that none of the genes I was querying were in the database to finally realize that there was only 38 entries int the org.Spombe.eg.db database I had just created and installed... Check this out:

> keytypes(org.Spombe.eg.db)
 [1] "ENTREZID" "ACCNUM"   "ALIAS"    "CHR"      "PMID"     "REFSEQ"
 [7] "SYMBOL"   "UNIGENE"  "GENENAME" "GO"       "EVIDENCE" "ONTOLOGY"

Looking good! However:

> length(keys(org.Spombe.eg.db,'ENTREZID'))
[1] 38

Can someone close enough to the code shed some lights has to whether there is a bug in AnnotationForge or whether it is the NCBI database that is not conforming to what is expected? For instance, biomart has 5117 entrez ID

> library(biomaRt)
> mart <- useMart("fungi_mart_18","spombe_eg_gene")
> ensembl2entrez <- getBM(c('ensembl_gene_id','entrezgene'),mart=mart)
> sum(!is.na(ensembl2entrez$entrezgene))
[1] 5117

The ids I tested on the NCBI website return the correct genes. However, only 10 of the AnnotationForge EntrezID (out of a skirmish 38 ids) are found in biomaRt

> sum(keys(org.Spombe.eg.db,'ENTREZID') %in% ensembl2entrez$entrezgene)
[1] 10

Again, I would appreciate any comments or suggestions as to whether this is a bug or something I did wrong or a miss alignment between the NCBI S. pombe annotation and what is expected by AnnotationForge.

Thanks
--
Marco Blanchette, Ph.D.
Assistant Investigator
Stowers Institute for Medical Research
1000 East 50th St.

Kansas City, MO 64110

Tel: 816-926-4071
Cell: 816-726-8419
Fax: 816-926-2018



------------------------------

Message: 14
Date: Fri, 23 Aug 2013 23:46:53 -0400
From: Matthew McCall <mccallm at gmail.com>
To: Wolfgang Huber <whuber at embl.de>
Cc: j.m.boer at erasmusmc.nl, bioconductor at r-project.org
Subject: Re: [BioC] frma normalization and batch effects
Message-ID:
        <CADDc-iTKV9Eo-yCTTuwa+SB8ENBexZyVv9iGVDeb-TzFC9D93w at mail.gmail.com>
Content-Type: text/plain

Judith,

You're probably fine using the default frma summarization unless your data
are in some way atypical. The random effect summarization just allows the
probe effects in your data to differ slightly from the global (frozen)
probe effects. Also if you're going to have singletons for which you want
to predict in the future, then the default summarization is definitely the
way to go.

Depending on how big your dataset is going to be, you might consider
creating your own custom frma implementation using the frmaTools package.

Finally frma addresses one type of batch effect functioning at the probe
level. It does nothing when the batch effect exists at the probeset level.
So something like fSVA, after preprocessing with frma, would certainly be a
good idea as well.

Best,
Matt
On Aug 23, 2013 7:56 AM, "Wolfgang Huber" <whuber at embl.de> wrote:

> Hi Judith
>
> I am sure the frma people will have more specific recommendations, but in
> addition, both your questions below could be interpreted as questions of
> parameter choice in a (somewhat complex, since it includes the
> preprocessing and batch adjustment) classifier. An often useful way of
> making such choices is by cross-validation on a dataset that mimics the
> kind of data you expect to see in the future.
>
> I guess you might also enjoy Jeff Leek's recent talk:
> http://www.birs.ca/events/2013/5-day-workshops/13w5083/videos/watch/201308151110-Leek.mp4with frozen sva, and top scoring pairs
>
>         Best wishes
>         Wolfgang
>
> On 23 Aug 2013, at 10:55, Judith Boer [guest] <guest at bioconductor.org>
> wrote:
>
> >
> > Dear all,
> >
> > I am working on expression classifiers for leukemic subtypes using
> Affymetrix Plus2 arrays. The training data consists of several batches. The
> developed classifier will be used to predict the subtype of new sets of
> samples as well as single samples. So far, I co-normalized new arrays with
> the training set, but this is not ideal.
> >
> > I have read the frma paper by McCall et al, and it seems the perfect
> solutions. Before I start, I have a few conceptual questions:
> >
> > 1. The training data consists of several batches of different sizes,
> some of them biased  towards a single subtype. Does normalization per batch
> using summarize=?? random_effect??  remove biology in this case? ComBat
> clearly did, and I ended up not correcting for batch effect, which worked
> fine for the classifiers I am using. Any suggestion which summarization
> would be best to use in this case?
> >
> > 2. Is there a minimum of arrays to use with summarize=?? random_effect??
> ?
> >
> > Any suggestions on how to best implement frma in this project are very
> welcome!
> >
> > Cheers, Judith
> >
> >
> > -- output of sessionInfo():
> >
> > R version 2.15.2 (2012-10-26)
> > Platform: i386-w64-mingw32/i386 (32-bit)
> >
> > locale:
> > [1] LC_COLLATE=English_United Kingdom.1252  LC_CTYPE=English_United
> Kingdom.1252
> > [3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C
> > [5] LC_TIME=English_United Kingdom.1252
> >
> > attached base packages:
> > [1] stats     graphics  grDevices utils     datasets  methods   base
> >
> >
> > --
> > Sent via the guest posting facility at bioconductor.org.
> >
> > _______________________________________________
> > Bioconductor mailing list
> > Bioconductor at r-project.org
> > https://stat.ethz.ch/mailman/listinfo/bioconductor
> > Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>

        [[alternative HTML version deleted]]



------------------------------

Message: 15
Date: Fri, 23 Aug 2013 15:58:59 +0100
From: James Floyd <j.a.floyd at qmul.ac.uk>
To: <bioconductor at r-project.org>
Subject: [BioC] DESeq query
Message-ID: <CE3D3743.21A%j.a.floyd at qmul.ac.uk>
Content-Type: text/plain

Hi,

I am very new to RNA-seq and differential expression analyses, but have been
attempting to use DESeq in my analyses (package maintainer: Simon Anders).

I was going through the vignette provided here
<http://bioconductor.org/packages/2.12/bioc/vignettes/DESeq/inst/doc/DESeq.p
df> . Is the idea of performing the getVarianceStabilizedData to then be
able to go ahead and re-analyse the transformed count data with stabilised
variance? I ask because trying to use the value from
getVarianceStabilizedData as input to newCountDataSet doesn't work since the
"counts" are no longer integers.

Thanks in advance,
Jamie



        [[alternative HTML version deleted]]



------------------------------

Message: 16
Date: Fri, 23 Aug 2013 22:23:48 +0200
From: Xi Wang <xi.wang at newcastle.edu.au>
To: "Strickland, Alleene" <AStrickland at med.miami.edu>
Cc: "bioconductor at stat.math.ethz.ch     bioconductor"
        <bioconductor at stat.math.ethz.ch>
Subject: Re: [BioC] SeqGSEA estiGeneNBstat()
Message-ID:
        <CAPm7ooD4HsSkvt8=wtyLS6wckpguF8qcA9bNd8pj+LS66Z5__w at mail.gmail.com>
Content-Type: text/plain

Hi Alleene,

Have you solved your problems in running SeqGSEA? I'm very glad to help
solve out any running issues regarding SeqGSEA.

The NA values generated after running estiExonNBstat() were due to the only
one exon in each genes, where no differential splicing can happen.

Regards
Xi


On Fri, Aug 23, 2013 at 9:16 PM, Strickland, Alleene <
AStrickland at med.miami.edu> wrote:

> Dan,
> I was referring to that post - thank you! I didn't realize it went to the
> wrong place.
>
> -----Original Message-----
> From: dandante at gmail.com [mailto:dandante at gmail.com] On Behalf Of Dan
> Tenenbaum
> Sent: Friday, August 23, 2013 2:28 PM
> To: Strickland, Alleene
> Cc: bioconductor at stat.math.ethz.ch bioconductor; Xi Wang
> Subject: Re: [BioC] SeqGSEA estiGeneNBstat()
>
> CC'ing the maintainer. I think this person is referring to this post:
> https://stat.ethz.ch/pipermail/bioconductor/2013-July/053717.html
>
> Dan
>
>
> On Wed, Aug 21, 2013 at 1:03 PM, Alleene <astrickland at med.miami.edu>
> wrote:
> > I am having the same issue - any help would be appreciated.
> >
> > Thank you!
> > -Alleene
> >
> > _______________________________________________
> > Bioconductor mailing list
> > Bioconductor at r-project.org
> > https://stat.ethz.ch/mailman/listinfo/bioconductor
> > Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>

        [[alternative HTML version deleted]]



------------------------------

Message: 17
Date: Fri, 23 Aug 2013 11:38:56 +0200
From: Moritz Hess <ssehztirom at googlemail.com>
To: "James W. MacDonald" <jmacdon at uw.edu>,
        "bioconductor at r-project.org" <bioconductor at r-project.org>
Subject: Re: [BioC] Proper set up to test interaction of continuous
        covariate and factor levels (limma)
Message-ID:
        <CAMvLs2xLO3c4dSZ0GW91HQG=7qkT900vgODCUgJhj3LaG56q0Q at mail.gmail.com>
Content-Type: text/plain

Hi James,

thanks for your reply! My intention to use the above mentioned (although
improper as you mentioned) method was to be able to form clear tests. I
assumed that when I am testing for genes responding to the covariate e.g.
significant "cov" I am testing against the Baseline  in the reference group
(the group that has been assigned as reference by limma or lm). But I want
to test against the baseline in all groups. Basically I wanted to adapt the
procedure that has been proposed for factoral designs in the limma
userguide (Section 8.7).
I now realize that testing for significant coefficients for cov and
interaction of group only seems to be a good choice for me. However I was a
little puzzled by the fact that the effect size for cov would be only the
effect size / slope within GroupA, which I somehow confused with a
signifcant nonzero slope of the covariate exclusively in GroupA.

Best,

Moritz



2013/8/21 James W. MacDonald <jmacdon at uw.edu>

> Hi Moritz,
>
>
> On 8/20/2013 2:30 PM, Moritz Hess wrote:
>
>> Hi,
>> I am investigating the global gene expression response to a continuous
>> environmental covariate in two groups of individuals using limma.
>> I am interested in genes whose  expression is:
>> A) correlated with the covariate
>> B) differentially correlated with the covariate in the two groups of
>> individuals (Interaction of Group and Covariate)
>>
>> In order to be able to set up the proper contrasts I split the Covariate
>> in
>> two Vectors where one Vector contains only the samples with the lowest
>> level of the covariate
>> e.g.  CovBase = c(0,0,0,3,0,3,0,0)
>> and where the other vector contains all the samples with higher levels of
>> the covariate
>> e.g. Cov = c(34,2,5,0,5,0,2,34)
>> and set up a design matrix without intercept:
>>
>> ~ Group + CovBase + Cov
>>
>>
>> The contrasts I am testing are specified as follows:
>> Effect of Covariate: Cov-CovBase
>> Interaction of Group and Covariate: (GroupA-GroupB) - (Cov-CovBase)
>>
>> Does this procedure makes sense ?
>>
>
> It doesn't make sense to me. In the first place the formula you use will
> create an intercept. Secondly, if you want the covariate to be continuous,
> then why are you splitting it like that? And why do you think values of 3
> are lower than values of 2?
>
> If I were trying to do what you say you are doing, then I would just do
>
> cov <- c(34,2,5,3,5,3,2,34)
> group <- factor(whateveryourgroupsare)
>
> design <- model.matrix(~cov*group)
>
> Best,
>
> Jim
>
>
>
>> Thank you very much in advance
>>
>> Moritz
>>
>>
>>
>> ______________________________**_________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https://stat.ethz.ch/mailman/listinfo/bioconductor>
>> Search the archives: http://news.gmane.org/gmane.**
>> science.biology.informatics.**conductor<http://news.gmane.org/gmane.science.biology.informatics.conductor>
>>
>
> --
> James W. MacDonald, M.S.
> Biostatistician
> University of Washington
> Environmental and Occupational Health Sciences
> 4225 Roosevelt Way NE, # 100
> Seattle WA 98105-6099
>
>


--

*Moritz He?
PhD Candidate
*
*Research associate
Forest Research Institute
of Baden W?rttemberg (FVA)
Wonnhalde 4
79100 Freiburg (Germany)

phone +49 761 4018 301*

        [[alternative HTML version deleted]]



------------------------------

Message: 18
Date: Sat, 24 Aug 2013 08:37:31 +0200
From: Michael Love <michaelisaiahlove at gmail.com>
To: James Floyd <j.a.floyd at qmul.ac.uk>
Cc: bioconductor at r-project.org
Subject: Re: [BioC] DESeq query
Message-ID:
        <CADqzidUGQTocF7A=Hsvx4viA=AZpiR8DFgJFoXGa=1u2DGMCxw at mail.gmail.com>
Content-Type: text/plain

Hi Jamie,

The point of providing the variance stabilization is for applications other
than DE testing, for example clustering of samples or other machine
learning applications.

For DE testing we suggest to use the count-based methods.

Mike
On Aug 24, 2013 8:07 AM, "James Floyd" <j.a.floyd at qmul.ac.uk> wrote:

> Hi,
>
> I am very new to RNA-seq and differential expression analyses, but have
> been
> attempting to use DESeq in my analyses (package maintainer: Simon Anders).
>
> I was going through the vignette provided here
> <
> http://bioconductor.org/packages/2.12/bioc/vignettes/DESeq/inst/doc/DESeq.p
> df> . Is the idea of performing the getVarianceStabilizedData to then be
> able to go ahead and re-analyse the transformed count data with stabilised
> variance? I ask because trying to use the value from
> getVarianceStabilizedData as input to newCountDataSet doesn't work since
> the
> "counts" are no longer integers.
>
> Thanks in advance,
> Jamie
>
>
>
>         [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>

        [[alternative HTML version deleted]]



------------------------------

_______________________________________________
Bioconductor mailing list
Bioconductor at r-project.org
https://stat.ethz.ch/mailman/listinfo/bioconductor


End of Bioconductor Digest, Vol 126, Issue 23
*********************************************

________________________________

Helmholtz-Zentrum für Infektionsforschung GmbH | Inhoffenstraße 7 | 38124 Braunschweig | www.helmholtz-hzi.de
Das HZI ist seit 2007 zertifiziertes Mitglied im "audit berufundfamilie"

Vorsitzende des Aufsichtsrates: MinDir’in Bärbel Brumme-Bothe, Bundesministerium für Bildung und Forschung
Stellvertreter: Rüdiger Eichel, Abteilungsleiter Niedersächsisches Ministerium für Wissenschaft und Kultur
Geschäftsführung: Prof. Dr. Dirk Heinz; Ulf Richter, MBA
Gesellschaft mit beschränkter Haftung (GmbH)
Sitz der Gesellschaft: Braunschweig
Handelsregister: Amtsgericht Braunschweig, HRB 477


More information about the Bioconductor mailing list