[Bioc-devel] should genome() be so complicated?/add genome report to GRanges show method

Tue Sep 9 13:42:34 CEST 2014

On 09/09/2014 04:02 AM, Michael Lawrence wrote:
> I'm in favor of this display. The seqinfo output at the bottom has always
> been annoying (over-emphasized).

the fact that the lengths are 'NA' can be a helpful prompt to do something about 
it, e.g., add seqinfo when inputting the data. Also they are helpful when one is 
told that seqlengths are incompatible during, e.g., findOverlaps. But I like the 
idea of less but more informative display of seqinfo, along the lines suggested 
by Vince.

seqinfo: 60 seqlevels (2 circular) on 2 genomes (hg19, mm10); 60 'NA' seqlengths

Martin

>
> On Mon, Sep 8, 2014 at 10:08 PM, Vincent Carey <stvjc at channing.harvard.edu>
> wrote:
>
>>
>>
>> On Tue, Sep 9, 2014 at 12:30 AM, Hervé Pagès <hpages at fhcrc.org> wrote:
>>
>>> On 09/08/2014 06:42 PM, Michael Lawrence wrote:
>>>
>>>> Instead of printing out multiple lines of a table that is rarely of
>>>> interest, could we develop Peter's idea toward something like:
>>>>
>>>> hg19:chr1  hg19:chr2 ...
>>>> [lengths ...]
>>>>
>>>> Not sure what condensed notation would be useful for circularity.
>>>>
>>>
>>> I don't know either. I'm worried that this would make the seqinfo
>>> stuff look like a named vector and that the user would expect
>>> hg19:chr1, hg19:chr2, etc... to be valid names.
>>>
>>> With the table-like layout, some screen real estate can always be
>>> saved by printing less lines:
>>>
>>>
>> What I had in mind was
>>
>>
>>>    > gr
>>>    GRanges with 3 ranges and 0 metadata columns:
>>>
>>        genome: hg19
>>
>>>          seqnames               ranges strand
>>>             <Rle>            <IRanges>  <Rle>
>>>      [1]    chr14 [19069583, 19069654]      +
>>>      [2]    chr14 [19363738, 19363809]      +
>>>      [3]    chr14 [19363755, 19363826]      -
>>>      [4]    chr14 [19369799, 19369870]      +
>>>
>>
>>
>> you could then probably dispense with the seqlengths.  i have
>> never found them too useful except as a key to the  genome.
>>
>> if there are multiple genomes, we have something like
>>
>> genomes: hg19, mm9
>>
>> the point is to make it prominent, particularly at a time of transition.
>>
>>
>>
>>> --- seqinfo: 60 seqlevels (2 circulars) on 2 genomes (hg19, mm10) ---
>>>      seqlevels                                seqlengths isCircular genome
>>>      chr1                                      249250621       <NA>   hg19
>>>      chr10                                     135534747       <NA>   hg19
>>>      ...                                             ...        ...    ...
>>>      chrX                                      155270560       <NA>   hg19
>>>      chrY                                       59373566       <NA>   hg19
>>>
>>> I agree that the exact content of the seqinfo table itself is rarely
>>> of interest so printing only 3 or 4 lines is OK. IMO it's important
>>> to make the user aware of the existence of this hidden table and to
>>> display it like what it really is (i.e. a table). Also displaying the
>>> column names is a well established tradition and serves the purpose
>>> of providing a quick summary of the accessors that are available to
>>> access those fields.
>>>
>>> H.
>>>
>>>
>>>>
>>>> On Mon, Sep 8, 2014 at 5:21 PM, Peter Hickey <hickey at wehi.edu.au
>>>> <mailto:hickey at wehi.edu.au>> wrote:
>>>>
>>>>      Perhaps it might be useful to have some way of highlighting if any
>>>>      of the chromosomes are circular or highlighting if there are
>>>>      multiple genomes present? Otherwise this information might be hidden
>>>>      in the "…"
>>>>
>>>>      Cheers,
>>>>      Pete
>>>>
>>>>
>>>>      On 09/09/2014, at 9:44 AM, Hervé Pagès <hpages at fhcrc.org
>>>>      <mailto:hpages at fhcrc.org>> wrote:
>>>>
>>>>       > On 09/08/2014 02:28 PM, Peter Hickey wrote:
>>>>       >> Just a vote for still allowing for multiple genomes in a Seqinfo
>>>>      object (in a GRanges object). My use case is in bisulfite-sequencing
>>>>      experiments where there is often a spike-in of a lambda phage genome
>>>>      along with the genome of interest (human or mouse). It's often
>>>>      useful to keep all data from a single library together in the same
>>>>      objet but process according to genome(x) for each seqlevel.
>>>>       >
>>>>       > Note taken. Thanks Pete! It's always great to know about concrete
>>>> use
>>>>       > cases.
>>>>       >
>>>>       >>
>>>>       >> FWIW, I like Vincent's proposal of selectSome(unique(genome(x)))
>>>>      in the show method.
>>>>       >
>>>>       > Or what about displaying the genome next to the seqlevel it's
>>>>       > associated with? Like e.g.:
>>>>       >
>>>>       >  > gr
>>>>       >  GRanges with 3 ranges and 0 metadata columns:
>>>>       >        seqnames               ranges strand
>>>>       >           <Rle>            <IRanges>  <Rle>
>>>>       >    [1]    chr14 [19069583, 19069654]      +
>>>>       >    [2]    chr14 [19363738, 19363809]      +
>>>>       >    [3]    chr14 [19363755, 19363826]      -
>>>>       >    [4]    chr14 [19369799, 19369870]      +
>>>>       >    ---
>>>>       >    seqinfo:
>>>>       >      seqlevels             seqlengths isCircular genome
>>>>       >      chr1                   249250621       <NA>   hg19
>>>>       >      chr10                  135534747       <NA>   hg19
>>>>       >      chr11                  135006516       <NA>   hg19
>>>>       >      ...                          ...        ...    ...
>>>>       >      chrUn_gl000249             38502       <NA>   hg19
>>>>       >      chrX                   155270560       <NA>   hg19
>>>>       >      chrY                    59373566       <NA>   hg19
>>>>       >
>>>>       > That way, we also raise awareness about the isCircular field.
>>>>       > The current choice to only display the seqlengths pre-dates the
>>>>       > existence of the seqinfo slot but might be a little bit misleading
>>>>       > those days since it only exposes some arbitrary seqinfo fields.
>>>>       >
>>>>       > H.
>>>>       >
>>>>       >>
>>>>       >> Cheers,
>>>>       >> Pete
>>>>       >>
>>>>       >>
>>>>       >>> I might have requested the genome annotation, but I'm pretty
>>>>      sure it wasn't
>>>>       >>> me who decide on tracking it on a per-sequence basis. I could
>>>>      imagine use
>>>>       >>> cases for that though, e.g., when diagnosing sequencing
>>>>      contamination (like
>>>>       >>> human vs. mouse). But most other tools and file formats expect
>>>>      a single
>>>>       >>> genome per "track", so, for example, rtracklayer has an
>>>>      internal function
>>>>       >>> singleGenome() to take care of this.
>>>>       >>>
>>>>       >>> On Mon, Sep 8, 2014 at 10:50 AM, Herv? Pag?s <hpages at fhcrc.org
>>>>      <mailto:hpages at fhcrc.org>> wrote:
>>>>       >>>
>>>>       >>>> Hi Vince,
>>>>       >>>>
>>>>       >>>> Yes it would make sense to have the "show" method report the
>>>>      genome
>>>>       >>>> when genome(x) contains a unique non-NA value. I think the main
>>>>       >>>> use case for having the genome defined at the sequence level
>>>>      instead
>>>>       >>>> of the whole object level is metagenomics. Maybe Michael has
>>>>      some other
>>>>       >>>> good use cases to share since IIRC he requested the addition
>>>>      of the
>>>>       >>>> genome field a couple of years ago and made the case for
>>>> having it
>>>>       >>>> defined at the sequence level.
>>>>       >>>>
>>>>       >>>> Cheers,
>>>>       >>>> H.
>>>>       >>>>
>>>>       >>>>
>>>>       >>>> On 09/08/2014 07:21 AM, Vincent Carey wrote:
>>>>       >>>>
>>>>       >>>>> For GRanges x, my naive expectation is that genome(x) returns
>>>>      a length-
>>>>       >>>>>
>>>>       >>>>> one tag identifying the genome to which chromosomal
>>>> coordinates
>>>>       >>>>>
>>>>       >>>>> correspond.  The genome() method seems to have
>>>> sequence-specific
>>>>       >>>>>
>>>>       >>>>> semantics, which makes sense, but when we identify sequence
>>>>       >>>>>
>>>>       >>>>> with chromosome, it seems too complicated.  Is there a use
>>>>      case for
>>>>       >>>>>
>>>>       >>>>> a GRanges with sequences from several different genomes?
>>>>       >>>>>
>>>>       >>>>>
>>>>       >>>>> One reason I am inquiring is that I feel it would be nice to
>>>>      have the
>>>>       >>>>> GRanges show() method report, prominently, the genome in use
>>>>      (or NA
>>>>       >>>>>
>>>>       >>>>> if unspecified).  This could be accomplished by reporting
>>>>       >>>>> unique(genome(x)), and perhaps that would be satisfactory.
>>>>       >>>>>
>>>>       >>>>> after example(genome) :
>>>>       >>>>>
>>>>       >>>>> seqinfo(txdb)
>>>>       >>>>>>
>>>>       >>>>>
>>>>       >>>>> Seqinfo of length 15
>>>>       >>>>>
>>>>       >>>>> seqnames seqlengths isCircular genome
>>>>       >>>>>
>>>>       >>>>> CH2L       23011544      FALSE    dm3
>>>>       >>>>>
>>>>       >>>>> CH2R       21146708      FALSE    dm3
>>>>       >>>>>
>>>>       >>>>> CH3L       24543557      FALSE    dm3
>>>>       >>>>>
>>>>       >>>>> CH3R       27905053      FALSE    dm3
>>>>       >>>>>
>>>>       >>>>> CH4         1351857      FALSE    dm3
>>>>       >>>>>
>>>>       >>>>> ...             ...        ...    ...
>>>>       >>>>>
>>>>       >>>>> CH3LHet     2555491      FALSE    dm3
>>>>       >>>>>
>>>>       >>>>> CH3RHet     2517507      FALSE    dm3
>>>>       >>>>>
>>>>       >>>>> CHXHet       204112      FALSE    dm3
>>>>       >>>>>
>>>>       >>>>> CHYHet       347038      FALSE    dm3
>>>>       >>>>>
>>>>       >>>>> CHUextra   29004656      FALSE    dm3
>>>>       >>>>>
>>>>       >>>>> genome(seqinfo(txdb))
>>>>       >>>>>>
>>>>       >>>>>
>>>>       >>>>>     CH2L     CH2R     CH3L     CH3R      CH4      CHX
>>>>      CHU        M
>>>>       >>>>>
>>>>       >>>>>    "dm3"    "dm3"    "dm3"    "dm3"    "dm3"    "dm3"
>>>>      "dm3"    "dm3"
>>>>       >>>>>
>>>>       >>>>>  CH2LHet  CH2RHet  CH3LHet  CH3RHet   CHXHet   CHYHet CHUextra
>>>>       >>>>>
>>>>       >>>>>    "dm3"    "dm3"    "dm3"    "dm3"    "dm3"    "dm3"    "dm3"
>>>>       >>>>>
>>>>       >>>>>        [[alternative HTML version deleted]]
>>>>       >>>>>
>>>>       >>>>> _______________________________________________
>>>>       >>>>> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>>>>      mailing list
>>>>       >>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>       >>>>>
>>>>       >>>>>
>>>>       >>>> --
>>>>       >>>> Herv? Pag?s
>>>>       >>>>
>>>>       >>>> Program in Computational Biology
>>>>       >>>> Division of Public Health Sciences
>>>>       >>>> Fred Hutchinson Cancer Research Center
>>>>       >>>> 1100 Fairview Ave. N, M1-B514
>>>>       >>>> P.O. Box 19024
>>>>       >>>> Seattle, WA 98109-1024
>>>>       >>>>
>>>>       >>>> E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>>>>       >>>> Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
>>>>       >>>> Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
>>>>       >>>>
>>>>       >>>>
>>>>       >>>> _______________________________________________
>>>>       >>>> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>>>>      mailing list
>>>>       >>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>       >>>>
>>>>       >>
>>>>       >> --------------------------------
>>>>       >> Peter Hickey,
>>>>       >> PhD Student/Research Assistant,
>>>>       >> Bioinformatics Division,
>>>>       >> Walter and Eliza Hall Institute of Medical Research,
>>>>       >> 1G Royal Parade, Parkville, Vic 3052, Australia.
>>>>       >> Ph: +613 9345 2324 <tel:%2B613%209345%202324>
>>>>       >>
>>>>       >> hickey at wehi.edu.au <mailto:hickey at wehi.edu.au>
>>>>       >> http://www.wehi.edu.au
>>>>       >>
>>>>       >>
>>>>      ____________________________________________________________
>>>> __________
>>>>       >> The information in this email is confidential and
>>>>      intend...{{dropped:6}}
>>>>      >>
>>>>      >> _______________________________________________
>>>>      >>Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>>>> mailing list
>>>>      >>https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>      >>
>>>>      >
>>>>      > --
>>>>       > Hervé Pagès
>>>>      >
>>>>      > Program in Computational Biology
>>>>      > Division of Public Health Sciences
>>>>      > Fred Hutchinson Cancer Research Center
>>>>      > 1100 Fairview Ave. N, M1-B514
>>>>      > P.O. Box 19024
>>>>      > Seattle, WA 98109-1024
>>>>      >
>>>>      > E-mail:hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>>>>      > Phone:(206) 667-5791 <tel:%28206%29%20667-5791>
>>>>      > Fax:(206) 667-1319 <tel:%28206%29%20667-1319>
>>>>
>>>>      --------------------------------
>>>>      Peter Hickey,
>>>>      PhD Student/Research Assistant,
>>>>      Bioinformatics Division,
>>>>      Walter and Eliza Hall Institute of Medical Research,
>>>>      1G Royal Parade, Parkville, Vic 3052, Australia.
>>>>      Ph: +613 9345 2324 <tel:%2B613%209345%202324>
>>>>
>>>>      hickey at wehi.edu.au <mailto:hickey at wehi.edu.au>
>>>>      http://www.wehi.edu.au
>>>>
>>>>
>>>>      ____________________________________________________________
>>>> __________
>>>>      The information in this email is confidential and
>>>> intend...{{dropped:8}}
>>>>
>>>>
>>>>      _______________________________________________
>>>>      Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org> mailing
>>>> list
>>>>      https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>
>>>>
>>>>
>>> --
>>> Hervé Pagès
>>>
>>> Program in Computational Biology
>>> Division of Public Health Sciences
>>> Fred Hutchinson Cancer Research Center
>>> 1100 Fairview Ave. N, M1-B514
>>> P.O. Box 19024
>>> Seattle, WA 98109-1024
>>>
>>> E-mail: hpages at fhcrc.org
>>> Phone:  (206) 667-5791
>>> Fax:    (206) 667-1319
>>>
>>> _______________________________________________
>>> Bioc-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>
>>
>>
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>

-- 
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793