[Bioc-devel] should genome() be so complicated?/add genome report to GRanges show method

Michael Lawrence lawrence.michael at gene.com
Tue Sep 9 15:38:07 CEST 2014


Agreed, that looks a lot nicer.

On Tue, Sep 9, 2014 at 4:42 AM, Martin Morgan <mtmorgan at fhcrc.org> wrote:

> On 09/09/2014 04:02 AM, Michael Lawrence wrote:
>
>> I'm in favor of this display. The seqinfo output at the bottom has always
>> been annoying (over-emphasized).
>>
>
> the fact that the lengths are 'NA' can be a helpful prompt to do something
> about it, e.g., add seqinfo when inputting the data. Also they are helpful
> when one is told that seqlengths are incompatible during, e.g.,
> findOverlaps. But I like the idea of less but more informative display of
> seqinfo, along the lines suggested by Vince.
>
> seqinfo: 60 seqlevels (2 circular) on 2 genomes (hg19, mm10); 60 'NA'
> seqlengths
>
> Martin
>
>
>> On Mon, Sep 8, 2014 at 10:08 PM, Vincent Carey <
>> stvjc at channing.harvard.edu>
>> wrote:
>>
>>
>>>
>>> On Tue, Sep 9, 2014 at 12:30 AM, Hervé Pagès <hpages at fhcrc.org> wrote:
>>>
>>>  On 09/08/2014 06:42 PM, Michael Lawrence wrote:
>>>>
>>>>  Instead of printing out multiple lines of a table that is rarely of
>>>>> interest, could we develop Peter's idea toward something like:
>>>>>
>>>>> hg19:chr1  hg19:chr2 ...
>>>>> [lengths ...]
>>>>>
>>>>> Not sure what condensed notation would be useful for circularity.
>>>>>
>>>>>
>>>> I don't know either. I'm worried that this would make the seqinfo
>>>> stuff look like a named vector and that the user would expect
>>>> hg19:chr1, hg19:chr2, etc... to be valid names.
>>>>
>>>> With the table-like layout, some screen real estate can always be
>>>> saved by printing less lines:
>>>>
>>>>
>>>>  What I had in mind was
>>>
>>>
>>>     > gr
>>>>    GRanges with 3 ranges and 0 metadata columns:
>>>>
>>>>         genome: hg19
>>>
>>>           seqnames               ranges strand
>>>>             <Rle>            <IRanges>  <Rle>
>>>>      [1]    chr14 [19069583, 19069654]      +
>>>>      [2]    chr14 [19363738, 19363809]      +
>>>>      [3]    chr14 [19363755, 19363826]      -
>>>>      [4]    chr14 [19369799, 19369870]      +
>>>>
>>>>
>>>
>>> you could then probably dispense with the seqlengths.  i have
>>> never found them too useful except as a key to the  genome.
>>>
>>> if there are multiple genomes, we have something like
>>>
>>> genomes: hg19, mm9
>>>
>>> the point is to make it prominent, particularly at a time of transition.
>>>
>>>
>>>
>>>  --- seqinfo: 60 seqlevels (2 circulars) on 2 genomes (hg19, mm10) ---
>>>>      seqlevels                                seqlengths isCircular
>>>> genome
>>>>      chr1                                      249250621       <NA>
>>>>  hg19
>>>>      chr10                                     135534747       <NA>
>>>>  hg19
>>>>      ...                                             ...        ...
>>>> ...
>>>>      chrX                                      155270560       <NA>
>>>>  hg19
>>>>      chrY                                       59373566       <NA>
>>>>  hg19
>>>>
>>>> I agree that the exact content of the seqinfo table itself is rarely
>>>> of interest so printing only 3 or 4 lines is OK. IMO it's important
>>>> to make the user aware of the existence of this hidden table and to
>>>> display it like what it really is (i.e. a table). Also displaying the
>>>> column names is a well established tradition and serves the purpose
>>>> of providing a quick summary of the accessors that are available to
>>>> access those fields.
>>>>
>>>> H.
>>>>
>>>>
>>>>
>>>>> On Mon, Sep 8, 2014 at 5:21 PM, Peter Hickey <hickey at wehi.edu.au
>>>>> <mailto:hickey at wehi.edu.au>> wrote:
>>>>>
>>>>>      Perhaps it might be useful to have some way of highlighting if any
>>>>>      of the chromosomes are circular or highlighting if there are
>>>>>      multiple genomes present? Otherwise this information might be
>>>>> hidden
>>>>>      in the "…"
>>>>>
>>>>>      Cheers,
>>>>>      Pete
>>>>>
>>>>>
>>>>>      On 09/09/2014, at 9:44 AM, Hervé Pagès <hpages at fhcrc.org
>>>>>      <mailto:hpages at fhcrc.org>> wrote:
>>>>>
>>>>>       > On 09/08/2014 02:28 PM, Peter Hickey wrote:
>>>>>       >> Just a vote for still allowing for multiple genomes in a
>>>>> Seqinfo
>>>>>      object (in a GRanges object). My use case is in
>>>>> bisulfite-sequencing
>>>>>      experiments where there is often a spike-in of a lambda phage
>>>>> genome
>>>>>      along with the genome of interest (human or mouse). It's often
>>>>>      useful to keep all data from a single library together in the same
>>>>>      objet but process according to genome(x) for each seqlevel.
>>>>>       >
>>>>>       > Note taken. Thanks Pete! It's always great to know about
>>>>> concrete
>>>>> use
>>>>>       > cases.
>>>>>       >
>>>>>       >>
>>>>>       >> FWIW, I like Vincent's proposal of
>>>>> selectSome(unique(genome(x)))
>>>>>      in the show method.
>>>>>       >
>>>>>       > Or what about displaying the genome next to the seqlevel it's
>>>>>       > associated with? Like e.g.:
>>>>>       >
>>>>>       >  > gr
>>>>>       >  GRanges with 3 ranges and 0 metadata columns:
>>>>>       >        seqnames               ranges strand
>>>>>       >           <Rle>            <IRanges>  <Rle>
>>>>>       >    [1]    chr14 [19069583, 19069654]      +
>>>>>       >    [2]    chr14 [19363738, 19363809]      +
>>>>>       >    [3]    chr14 [19363755, 19363826]      -
>>>>>       >    [4]    chr14 [19369799, 19369870]      +
>>>>>       >    ---
>>>>>       >    seqinfo:
>>>>>       >      seqlevels             seqlengths isCircular genome
>>>>>       >      chr1                   249250621       <NA>   hg19
>>>>>       >      chr10                  135534747       <NA>   hg19
>>>>>       >      chr11                  135006516       <NA>   hg19
>>>>>       >      ...                          ...        ...    ...
>>>>>       >      chrUn_gl000249             38502       <NA>   hg19
>>>>>       >      chrX                   155270560       <NA>   hg19
>>>>>       >      chrY                    59373566       <NA>   hg19
>>>>>       >
>>>>>       > That way, we also raise awareness about the isCircular field.
>>>>>       > The current choice to only display the seqlengths pre-dates the
>>>>>       > existence of the seqinfo slot but might be a little bit
>>>>> misleading
>>>>>       > those days since it only exposes some arbitrary seqinfo fields.
>>>>>       >
>>>>>       > H.
>>>>>       >
>>>>>       >>
>>>>>       >> Cheers,
>>>>>       >> Pete
>>>>>       >>
>>>>>       >>
>>>>>       >>> I might have requested the genome annotation, but I'm pretty
>>>>>      sure it wasn't
>>>>>       >>> me who decide on tracking it on a per-sequence basis. I could
>>>>>      imagine use
>>>>>       >>> cases for that though, e.g., when diagnosing sequencing
>>>>>      contamination (like
>>>>>       >>> human vs. mouse). But most other tools and file formats
>>>>> expect
>>>>>      a single
>>>>>       >>> genome per "track", so, for example, rtracklayer has an
>>>>>      internal function
>>>>>       >>> singleGenome() to take care of this.
>>>>>       >>>
>>>>>       >>> On Mon, Sep 8, 2014 at 10:50 AM, Herv? Pag?s <
>>>>> hpages at fhcrc.org
>>>>>      <mailto:hpages at fhcrc.org>> wrote:
>>>>>       >>>
>>>>>       >>>> Hi Vince,
>>>>>       >>>>
>>>>>       >>>> Yes it would make sense to have the "show" method report the
>>>>>      genome
>>>>>       >>>> when genome(x) contains a unique non-NA value. I think the
>>>>> main
>>>>>       >>>> use case for having the genome defined at the sequence level
>>>>>      instead
>>>>>       >>>> of the whole object level is metagenomics. Maybe Michael has
>>>>>      some other
>>>>>       >>>> good use cases to share since IIRC he requested the addition
>>>>>      of the
>>>>>       >>>> genome field a couple of years ago and made the case for
>>>>> having it
>>>>>       >>>> defined at the sequence level.
>>>>>       >>>>
>>>>>       >>>> Cheers,
>>>>>       >>>> H.
>>>>>       >>>>
>>>>>       >>>>
>>>>>       >>>> On 09/08/2014 07:21 AM, Vincent Carey wrote:
>>>>>       >>>>
>>>>>       >>>>> For GRanges x, my naive expectation is that genome(x)
>>>>> returns
>>>>>      a length-
>>>>>       >>>>>
>>>>>       >>>>> one tag identifying the genome to which chromosomal
>>>>> coordinates
>>>>>       >>>>>
>>>>>       >>>>> correspond.  The genome() method seems to have
>>>>> sequence-specific
>>>>>       >>>>>
>>>>>       >>>>> semantics, which makes sense, but when we identify sequence
>>>>>       >>>>>
>>>>>       >>>>> with chromosome, it seems too complicated.  Is there a use
>>>>>      case for
>>>>>       >>>>>
>>>>>       >>>>> a GRanges with sequences from several different genomes?
>>>>>       >>>>>
>>>>>       >>>>>
>>>>>       >>>>> One reason I am inquiring is that I feel it would be nice
>>>>> to
>>>>>      have the
>>>>>       >>>>> GRanges show() method report, prominently, the genome in
>>>>> use
>>>>>      (or NA
>>>>>       >>>>>
>>>>>       >>>>> if unspecified).  This could be accomplished by reporting
>>>>>       >>>>> unique(genome(x)), and perhaps that would be satisfactory.
>>>>>       >>>>>
>>>>>       >>>>> after example(genome) :
>>>>>       >>>>>
>>>>>       >>>>> seqinfo(txdb)
>>>>>       >>>>>>
>>>>>       >>>>>
>>>>>       >>>>> Seqinfo of length 15
>>>>>       >>>>>
>>>>>       >>>>> seqnames seqlengths isCircular genome
>>>>>       >>>>>
>>>>>       >>>>> CH2L       23011544      FALSE    dm3
>>>>>       >>>>>
>>>>>       >>>>> CH2R       21146708      FALSE    dm3
>>>>>       >>>>>
>>>>>       >>>>> CH3L       24543557      FALSE    dm3
>>>>>       >>>>>
>>>>>       >>>>> CH3R       27905053      FALSE    dm3
>>>>>       >>>>>
>>>>>       >>>>> CH4         1351857      FALSE    dm3
>>>>>       >>>>>
>>>>>       >>>>> ...             ...        ...    ...
>>>>>       >>>>>
>>>>>       >>>>> CH3LHet     2555491      FALSE    dm3
>>>>>       >>>>>
>>>>>       >>>>> CH3RHet     2517507      FALSE    dm3
>>>>>       >>>>>
>>>>>       >>>>> CHXHet       204112      FALSE    dm3
>>>>>       >>>>>
>>>>>       >>>>> CHYHet       347038      FALSE    dm3
>>>>>       >>>>>
>>>>>       >>>>> CHUextra   29004656      FALSE    dm3
>>>>>       >>>>>
>>>>>       >>>>> genome(seqinfo(txdb))
>>>>>       >>>>>>
>>>>>       >>>>>
>>>>>       >>>>>     CH2L     CH2R     CH3L     CH3R      CH4      CHX
>>>>>      CHU        M
>>>>>       >>>>>
>>>>>       >>>>>    "dm3"    "dm3"    "dm3"    "dm3"    "dm3"    "dm3"
>>>>>      "dm3"    "dm3"
>>>>>       >>>>>
>>>>>       >>>>>  CH2LHet  CH2RHet  CH3LHet  CH3RHet   CHXHet   CHYHet
>>>>> CHUextra
>>>>>       >>>>>
>>>>>       >>>>>    "dm3"    "dm3"    "dm3"    "dm3"    "dm3"    "dm3"
>>>>> "dm3"
>>>>>       >>>>>
>>>>>       >>>>>        [[alternative HTML version deleted]]
>>>>>       >>>>>
>>>>>       >>>>> _______________________________________________
>>>>>       >>>>> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>>>>>      mailing list
>>>>>       >>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>       >>>>>
>>>>>       >>>>>
>>>>>       >>>> --
>>>>>       >>>> Herv? Pag?s
>>>>>       >>>>
>>>>>       >>>> Program in Computational Biology
>>>>>       >>>> Division of Public Health Sciences
>>>>>       >>>> Fred Hutchinson Cancer Research Center
>>>>>       >>>> 1100 Fairview Ave. N, M1-B514
>>>>>       >>>> P.O. Box 19024
>>>>>       >>>> Seattle, WA 98109-1024
>>>>>       >>>>
>>>>>       >>>> E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>>>>>       >>>> Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
>>>>>       >>>> Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
>>>>>       >>>>
>>>>>       >>>>
>>>>>       >>>> _______________________________________________
>>>>>       >>>> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>>>>>      mailing list
>>>>>       >>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>       >>>>
>>>>>       >>
>>>>>       >> --------------------------------
>>>>>       >> Peter Hickey,
>>>>>       >> PhD Student/Research Assistant,
>>>>>       >> Bioinformatics Division,
>>>>>       >> Walter and Eliza Hall Institute of Medical Research,
>>>>>       >> 1G Royal Parade, Parkville, Vic 3052, Australia.
>>>>>       >> Ph: +613 9345 2324 <tel:%2B613%209345%202324>
>>>>>       >>
>>>>>       >> hickey at wehi.edu.au <mailto:hickey at wehi.edu.au>
>>>>>       >> http://www.wehi.edu.au
>>>>>       >>
>>>>>       >>
>>>>>      ____________________________________________________________
>>>>> __________
>>>>>       >> The information in this email is confidential and
>>>>>      intend...{{dropped:6}}
>>>>>      >>
>>>>>      >> _______________________________________________
>>>>>      >>Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>>>>> mailing list
>>>>>      >>https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>      >>
>>>>>      >
>>>>>      > --
>>>>>       > Hervé Pagès
>>>>>      >
>>>>>      > Program in Computational Biology
>>>>>      > Division of Public Health Sciences
>>>>>      > Fred Hutchinson Cancer Research Center
>>>>>      > 1100 Fairview Ave. N, M1-B514
>>>>>      > P.O. Box 19024
>>>>>      > Seattle, WA 98109-1024
>>>>>      >
>>>>>      > E-mail:hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>>>>>      > Phone:(206) 667-5791 <tel:%28206%29%20667-5791>
>>>>>      > Fax:(206) 667-1319 <tel:%28206%29%20667-1319>
>>>>>
>>>>>      --------------------------------
>>>>>      Peter Hickey,
>>>>>      PhD Student/Research Assistant,
>>>>>      Bioinformatics Division,
>>>>>      Walter and Eliza Hall Institute of Medical Research,
>>>>>      1G Royal Parade, Parkville, Vic 3052, Australia.
>>>>>      Ph: +613 9345 2324 <tel:%2B613%209345%202324>
>>>>>
>>>>>      hickey at wehi.edu.au <mailto:hickey at wehi.edu.au>
>>>>>      http://www.wehi.edu.au
>>>>>
>>>>>
>>>>>      ____________________________________________________________
>>>>> __________
>>>>>      The information in this email is confidential and
>>>>> intend...{{dropped:8}}
>>>>>
>>>>>
>>>>>      _______________________________________________
>>>>>      Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>>>>> mailing
>>>>> list
>>>>>      https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>
>>>>>
>>>>>
>>>>>  --
>>>> Hervé Pagès
>>>>
>>>> Program in Computational Biology
>>>> Division of Public Health Sciences
>>>> Fred Hutchinson Cancer Research Center
>>>> 1100 Fairview Ave. N, M1-B514
>>>> P.O. Box 19024
>>>> Seattle, WA 98109-1024
>>>>
>>>> E-mail: hpages at fhcrc.org
>>>> Phone:  (206) 667-5791
>>>> Fax:    (206) 667-1319
>>>>
>>>> _______________________________________________
>>>> Bioc-devel at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>
>>>>
>>>
>>>
>>         [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>>
>
> --
> Computational Biology / Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N.
> PO Box 19024 Seattle, WA 98109
>
> Location: Arnold Building M1 B861
> Phone: (206) 667-2793
>

	[[alternative HTML version deleted]]



More information about the Bioc-devel mailing list