[Bioc-devel] should genome() be so complicated?/add genome report to GRanges show method

Tue Sep 9 13:02:25 CEST 2014

I'm in favor of this display. The seqinfo output at the bottom has always
been annoying (over-emphasized).

On Mon, Sep 8, 2014 at 10:08 PM, Vincent Carey <stvjc at channing.harvard.edu>
wrote:

>
>
> On Tue, Sep 9, 2014 at 12:30 AM, Hervé Pagès <hpages at fhcrc.org> wrote:
>
>> On 09/08/2014 06:42 PM, Michael Lawrence wrote:
>>
>>> Instead of printing out multiple lines of a table that is rarely of
>>> interest, could we develop Peter's idea toward something like:
>>>
>>> hg19:chr1  hg19:chr2 ...
>>> [lengths ...]
>>>
>>> Not sure what condensed notation would be useful for circularity.
>>>
>>
>> I don't know either. I'm worried that this would make the seqinfo
>> stuff look like a named vector and that the user would expect
>> hg19:chr1, hg19:chr2, etc... to be valid names.
>>
>> With the table-like layout, some screen real estate can always be
>> saved by printing less lines:
>>
>>
> What I had in mind was
>
>
>>   > gr
>>   GRanges with 3 ranges and 0 metadata columns:
>>
>       genome: hg19
>
>>         seqnames               ranges strand
>>            <Rle>            <IRanges>  <Rle>
>>     [1]    chr14 [19069583, 19069654]      +
>>     [2]    chr14 [19363738, 19363809]      +
>>     [3]    chr14 [19363755, 19363826]      -
>>     [4]    chr14 [19369799, 19369870]      +
>>
>
>
> you could then probably dispense with the seqlengths.  i have
> never found them too useful except as a key to the  genome.
>
> if there are multiple genomes, we have something like
>
> genomes: hg19, mm9
>
> the point is to make it prominent, particularly at a time of transition.
>
>
>
>> --- seqinfo: 60 seqlevels (2 circulars) on 2 genomes (hg19, mm10) ---
>>     seqlevels                                seqlengths isCircular genome
>>     chr1                                      249250621       <NA>   hg19
>>     chr10                                     135534747       <NA>   hg19
>>     ...                                             ...        ...    ...
>>     chrX                                      155270560       <NA>   hg19
>>     chrY                                       59373566       <NA>   hg19
>>
>> I agree that the exact content of the seqinfo table itself is rarely
>> of interest so printing only 3 or 4 lines is OK. IMO it's important
>> to make the user aware of the existence of this hidden table and to
>> display it like what it really is (i.e. a table). Also displaying the
>> column names is a well established tradition and serves the purpose
>> of providing a quick summary of the accessors that are available to
>> access those fields.
>>
>> H.
>>
>>
>>>
>>> On Mon, Sep 8, 2014 at 5:21 PM, Peter Hickey <hickey at wehi.edu.au
>>> <mailto:hickey at wehi.edu.au>> wrote:
>>>
>>>     Perhaps it might be useful to have some way of highlighting if any
>>>     of the chromosomes are circular or highlighting if there are
>>>     multiple genomes present? Otherwise this information might be hidden
>>>     in the "…"
>>>
>>>     Cheers,
>>>     Pete
>>>
>>>
>>>     On 09/09/2014, at 9:44 AM, Hervé Pagès <hpages at fhcrc.org
>>>     <mailto:hpages at fhcrc.org>> wrote:
>>>
>>>      > On 09/08/2014 02:28 PM, Peter Hickey wrote:
>>>      >> Just a vote for still allowing for multiple genomes in a Seqinfo
>>>     object (in a GRanges object). My use case is in bisulfite-sequencing
>>>     experiments where there is often a spike-in of a lambda phage genome
>>>     along with the genome of interest (human or mouse). It's often
>>>     useful to keep all data from a single library together in the same
>>>     objet but process according to genome(x) for each seqlevel.
>>>      >
>>>      > Note taken. Thanks Pete! It's always great to know about concrete
>>> use
>>>      > cases.
>>>      >
>>>      >>
>>>      >> FWIW, I like Vincent's proposal of selectSome(unique(genome(x)))
>>>     in the show method.
>>>      >
>>>      > Or what about displaying the genome next to the seqlevel it's
>>>      > associated with? Like e.g.:
>>>      >
>>>      >  > gr
>>>      >  GRanges with 3 ranges and 0 metadata columns:
>>>      >        seqnames               ranges strand
>>>      >           <Rle>            <IRanges>  <Rle>
>>>      >    [1]    chr14 [19069583, 19069654]      +
>>>      >    [2]    chr14 [19363738, 19363809]      +
>>>      >    [3]    chr14 [19363755, 19363826]      -
>>>      >    [4]    chr14 [19369799, 19369870]      +
>>>      >    ---
>>>      >    seqinfo:
>>>      >      seqlevels             seqlengths isCircular genome
>>>      >      chr1                   249250621       <NA>   hg19
>>>      >      chr10                  135534747       <NA>   hg19
>>>      >      chr11                  135006516       <NA>   hg19
>>>      >      ...                          ...        ...    ...
>>>      >      chrUn_gl000249             38502       <NA>   hg19
>>>      >      chrX                   155270560       <NA>   hg19
>>>      >      chrY                    59373566       <NA>   hg19
>>>      >
>>>      > That way, we also raise awareness about the isCircular field.
>>>      > The current choice to only display the seqlengths pre-dates the
>>>      > existence of the seqinfo slot but might be a little bit misleading
>>>      > those days since it only exposes some arbitrary seqinfo fields.
>>>      >
>>>      > H.
>>>      >
>>>      >>
>>>      >> Cheers,
>>>      >> Pete
>>>      >>
>>>      >>
>>>      >>> I might have requested the genome annotation, but I'm pretty
>>>     sure it wasn't
>>>      >>> me who decide on tracking it on a per-sequence basis. I could
>>>     imagine use
>>>      >>> cases for that though, e.g., when diagnosing sequencing
>>>     contamination (like
>>>      >>> human vs. mouse). But most other tools and file formats expect
>>>     a single
>>>      >>> genome per "track", so, for example, rtracklayer has an
>>>     internal function
>>>      >>> singleGenome() to take care of this.
>>>      >>>
>>>      >>> On Mon, Sep 8, 2014 at 10:50 AM, Herv? Pag?s <hpages at fhcrc.org
>>>     <mailto:hpages at fhcrc.org>> wrote:
>>>      >>>
>>>      >>>> Hi Vince,
>>>      >>>>
>>>      >>>> Yes it would make sense to have the "show" method report the
>>>     genome
>>>      >>>> when genome(x) contains a unique non-NA value. I think the main
>>>      >>>> use case for having the genome defined at the sequence level
>>>     instead
>>>      >>>> of the whole object level is metagenomics. Maybe Michael has
>>>     some other
>>>      >>>> good use cases to share since IIRC he requested the addition
>>>     of the
>>>      >>>> genome field a couple of years ago and made the case for
>>> having it
>>>      >>>> defined at the sequence level.
>>>      >>>>
>>>      >>>> Cheers,
>>>      >>>> H.
>>>      >>>>
>>>      >>>>
>>>      >>>> On 09/08/2014 07:21 AM, Vincent Carey wrote:
>>>      >>>>
>>>      >>>>> For GRanges x, my naive expectation is that genome(x) returns
>>>     a length-
>>>      >>>>>
>>>      >>>>> one tag identifying the genome to which chromosomal
>>> coordinates
>>>      >>>>>
>>>      >>>>> correspond.  The genome() method seems to have
>>> sequence-specific
>>>      >>>>>
>>>      >>>>> semantics, which makes sense, but when we identify sequence
>>>      >>>>>
>>>      >>>>> with chromosome, it seems too complicated.  Is there a use
>>>     case for
>>>      >>>>>
>>>      >>>>> a GRanges with sequences from several different genomes?
>>>      >>>>>
>>>      >>>>>
>>>      >>>>> One reason I am inquiring is that I feel it would be nice to
>>>     have the
>>>      >>>>> GRanges show() method report, prominently, the genome in use
>>>     (or NA
>>>      >>>>>
>>>      >>>>> if unspecified).  This could be accomplished by reporting
>>>      >>>>> unique(genome(x)), and perhaps that would be satisfactory.
>>>      >>>>>
>>>      >>>>> after example(genome) :
>>>      >>>>>
>>>      >>>>> seqinfo(txdb)
>>>      >>>>>>
>>>      >>>>>
>>>      >>>>> Seqinfo of length 15
>>>      >>>>>
>>>      >>>>> seqnames seqlengths isCircular genome
>>>      >>>>>
>>>      >>>>> CH2L       23011544      FALSE    dm3
>>>      >>>>>
>>>      >>>>> CH2R       21146708      FALSE    dm3
>>>      >>>>>
>>>      >>>>> CH3L       24543557      FALSE    dm3
>>>      >>>>>
>>>      >>>>> CH3R       27905053      FALSE    dm3
>>>      >>>>>
>>>      >>>>> CH4         1351857      FALSE    dm3
>>>      >>>>>
>>>      >>>>> ...             ...        ...    ...
>>>      >>>>>
>>>      >>>>> CH3LHet     2555491      FALSE    dm3
>>>      >>>>>
>>>      >>>>> CH3RHet     2517507      FALSE    dm3
>>>      >>>>>
>>>      >>>>> CHXHet       204112      FALSE    dm3
>>>      >>>>>
>>>      >>>>> CHYHet       347038      FALSE    dm3
>>>      >>>>>
>>>      >>>>> CHUextra   29004656      FALSE    dm3
>>>      >>>>>
>>>      >>>>> genome(seqinfo(txdb))
>>>      >>>>>>
>>>      >>>>>
>>>      >>>>>     CH2L     CH2R     CH3L     CH3R      CH4      CHX
>>>     CHU        M
>>>      >>>>>
>>>      >>>>>    "dm3"    "dm3"    "dm3"    "dm3"    "dm3"    "dm3"
>>>     "dm3"    "dm3"
>>>      >>>>>
>>>      >>>>>  CH2LHet  CH2RHet  CH3LHet  CH3RHet   CHXHet   CHYHet CHUextra
>>>      >>>>>
>>>      >>>>>    "dm3"    "dm3"    "dm3"    "dm3"    "dm3"    "dm3"    "dm3"
>>>      >>>>>
>>>      >>>>>        [[alternative HTML version deleted]]
>>>      >>>>>
>>>      >>>>> _______________________________________________
>>>      >>>>> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>>>     mailing list
>>>      >>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>      >>>>>
>>>      >>>>>
>>>      >>>> --
>>>      >>>> Herv? Pag?s
>>>      >>>>
>>>      >>>> Program in Computational Biology
>>>      >>>> Division of Public Health Sciences
>>>      >>>> Fred Hutchinson Cancer Research Center
>>>      >>>> 1100 Fairview Ave. N, M1-B514
>>>      >>>> P.O. Box 19024
>>>      >>>> Seattle, WA 98109-1024
>>>      >>>>
>>>      >>>> E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>>>      >>>> Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
>>>      >>>> Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
>>>      >>>>
>>>      >>>>
>>>      >>>> _______________________________________________
>>>      >>>> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>>>     mailing list
>>>      >>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>      >>>>
>>>      >>
>>>      >> --------------------------------
>>>      >> Peter Hickey,
>>>      >> PhD Student/Research Assistant,
>>>      >> Bioinformatics Division,
>>>      >> Walter and Eliza Hall Institute of Medical Research,
>>>      >> 1G Royal Parade, Parkville, Vic 3052, Australia.
>>>      >> Ph: +613 9345 2324 <tel:%2B613%209345%202324>
>>>      >>
>>>      >> hickey at wehi.edu.au <mailto:hickey at wehi.edu.au>
>>>      >> http://www.wehi.edu.au
>>>      >>
>>>      >>
>>>     ____________________________________________________________
>>> __________
>>>      >> The information in this email is confidential and
>>>     intend...{{dropped:6}}
>>>     >>
>>>     >> _______________________________________________
>>>     >>Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>>> mailing list
>>>     >>https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>     >>
>>>     >
>>>     > --
>>>      > Hervé Pagès
>>>     >
>>>     > Program in Computational Biology
>>>     > Division of Public Health Sciences
>>>     > Fred Hutchinson Cancer Research Center
>>>     > 1100 Fairview Ave. N, M1-B514
>>>     > P.O. Box 19024
>>>     > Seattle, WA 98109-1024
>>>     >
>>>     > E-mail:hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>>>     > Phone:(206) 667-5791 <tel:%28206%29%20667-5791>
>>>     > Fax:(206) 667-1319 <tel:%28206%29%20667-1319>
>>>
>>>     --------------------------------
>>>     Peter Hickey,
>>>     PhD Student/Research Assistant,
>>>     Bioinformatics Division,
>>>     Walter and Eliza Hall Institute of Medical Research,
>>>     1G Royal Parade, Parkville, Vic 3052, Australia.
>>>     Ph: +613 9345 2324 <tel:%2B613%209345%202324>
>>>
>>>     hickey at wehi.edu.au <mailto:hickey at wehi.edu.au>
>>>     http://www.wehi.edu.au
>>>
>>>
>>>     ____________________________________________________________
>>> __________
>>>     The information in this email is confidential and
>>> intend...{{dropped:8}}
>>>
>>>
>>>     _______________________________________________
>>>     Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org> mailing
>>> list
>>>     https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>
>>>
>>>
>> --
>> Hervé Pagès
>>
>> Program in Computational Biology
>> Division of Public Health Sciences
>> Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N, M1-B514
>> P.O. Box 19024
>> Seattle, WA 98109-1024
>>
>> E-mail: hpages at fhcrc.org
>> Phone:  (206) 667-5791
>> Fax:    (206) 667-1319
>>
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>
>

	[[alternative HTML version deleted]]