[Bioc-devel] should genome() be so complicated?/add genome report to GRanges show method

Tue Sep 9 06:30:56 CEST 2014

On 09/08/2014 06:42 PM, Michael Lawrence wrote:
> Instead of printing out multiple lines of a table that is rarely of
> interest, could we develop Peter's idea toward something like:
>
> hg19:chr1  hg19:chr2 ...
> [lengths ...]
>
> Not sure what condensed notation would be useful for circularity.

I don't know either. I'm worried that this would make the seqinfo
stuff look like a named vector and that the user would expect
hg19:chr1, hg19:chr2, etc... to be valid names.

With the table-like layout, some screen real estate can always be
saved by printing less lines:

   > gr
   GRanges with 3 ranges and 0 metadata columns:
         seqnames               ranges strand
            <Rle>            <IRanges>  <Rle>
     [1]    chr14 [19069583, 19069654]      +
     [2]    chr14 [19363738, 19363809]      +
     [3]    chr14 [19363755, 19363826]      -
     [4]    chr14 [19369799, 19369870]      +
     --- seqinfo: 60 seqlevels (2 circulars) on 2 genomes (hg19, mm10) ---
     seqlevels                                seqlengths isCircular genome
     chr1                                      249250621       <NA>   hg19
     chr10                                     135534747       <NA>   hg19
     ...                                             ...        ...    ...
     chrX                                      155270560       <NA>   hg19
     chrY                                       59373566       <NA>   hg19

I agree that the exact content of the seqinfo table itself is rarely
of interest so printing only 3 or 4 lines is OK. IMO it's important
to make the user aware of the existence of this hidden table and to
display it like what it really is (i.e. a table). Also displaying the
column names is a well established tradition and serves the purpose
of providing a quick summary of the accessors that are available to
access those fields.

H.

>
>
> On Mon, Sep 8, 2014 at 5:21 PM, Peter Hickey <hickey at wehi.edu.au
> <mailto:hickey at wehi.edu.au>> wrote:
>
>     Perhaps it might be useful to have some way of highlighting if any
>     of the chromosomes are circular or highlighting if there are
>     multiple genomes present? Otherwise this information might be hidden
>     in the "…"
>
>     Cheers,
>     Pete
>
>
>     On 09/09/2014, at 9:44 AM, Hervé Pagès <hpages at fhcrc.org
>     <mailto:hpages at fhcrc.org>> wrote:
>
>      > On 09/08/2014 02:28 PM, Peter Hickey wrote:
>      >> Just a vote for still allowing for multiple genomes in a Seqinfo
>     object (in a GRanges object). My use case is in bisulfite-sequencing
>     experiments where there is often a spike-in of a lambda phage genome
>     along with the genome of interest (human or mouse). It's often
>     useful to keep all data from a single library together in the same
>     objet but process according to genome(x) for each seqlevel.
>      >
>      > Note taken. Thanks Pete! It's always great to know about concrete use
>      > cases.
>      >
>      >>
>      >> FWIW, I like Vincent's proposal of selectSome(unique(genome(x)))
>     in the show method.
>      >
>      > Or what about displaying the genome next to the seqlevel it's
>      > associated with? Like e.g.:
>      >
>      >  > gr
>      >  GRanges with 3 ranges and 0 metadata columns:
>      >        seqnames               ranges strand
>      >           <Rle>            <IRanges>  <Rle>
>      >    [1]    chr14 [19069583, 19069654]      +
>      >    [2]    chr14 [19363738, 19363809]      +
>      >    [3]    chr14 [19363755, 19363826]      -
>      >    [4]    chr14 [19369799, 19369870]      +
>      >    ---
>      >    seqinfo:
>      >      seqlevels             seqlengths isCircular genome
>      >      chr1                   249250621       <NA>   hg19
>      >      chr10                  135534747       <NA>   hg19
>      >      chr11                  135006516       <NA>   hg19
>      >      ...                          ...        ...    ...
>      >      chrUn_gl000249             38502       <NA>   hg19
>      >      chrX                   155270560       <NA>   hg19
>      >      chrY                    59373566       <NA>   hg19
>      >
>      > That way, we also raise awareness about the isCircular field.
>      > The current choice to only display the seqlengths pre-dates the
>      > existence of the seqinfo slot but might be a little bit misleading
>      > those days since it only exposes some arbitrary seqinfo fields.
>      >
>      > H.
>      >
>      >>
>      >> Cheers,
>      >> Pete
>      >>
>      >>
>      >>> I might have requested the genome annotation, but I'm pretty
>     sure it wasn't
>      >>> me who decide on tracking it on a per-sequence basis. I could
>     imagine use
>      >>> cases for that though, e.g., when diagnosing sequencing
>     contamination (like
>      >>> human vs. mouse). But most other tools and file formats expect
>     a single
>      >>> genome per "track", so, for example, rtracklayer has an
>     internal function
>      >>> singleGenome() to take care of this.
>      >>>
>      >>> On Mon, Sep 8, 2014 at 10:50 AM, Herv? Pag?s <hpages at fhcrc.org
>     <mailto:hpages at fhcrc.org>> wrote:
>      >>>
>      >>>> Hi Vince,
>      >>>>
>      >>>> Yes it would make sense to have the "show" method report the
>     genome
>      >>>> when genome(x) contains a unique non-NA value. I think the main
>      >>>> use case for having the genome defined at the sequence level
>     instead
>      >>>> of the whole object level is metagenomics. Maybe Michael has
>     some other
>      >>>> good use cases to share since IIRC he requested the addition
>     of the
>      >>>> genome field a couple of years ago and made the case for having it
>      >>>> defined at the sequence level.
>      >>>>
>      >>>> Cheers,
>      >>>> H.
>      >>>>
>      >>>>
>      >>>> On 09/08/2014 07:21 AM, Vincent Carey wrote:
>      >>>>
>      >>>>> For GRanges x, my naive expectation is that genome(x) returns
>     a length-
>      >>>>>
>      >>>>> one tag identifying the genome to which chromosomal coordinates
>      >>>>>
>      >>>>> correspond.  The genome() method seems to have sequence-specific
>      >>>>>
>      >>>>> semantics, which makes sense, but when we identify sequence
>      >>>>>
>      >>>>> with chromosome, it seems too complicated.  Is there a use
>     case for
>      >>>>>
>      >>>>> a GRanges with sequences from several different genomes?
>      >>>>>
>      >>>>>
>      >>>>> One reason I am inquiring is that I feel it would be nice to
>     have the
>      >>>>> GRanges show() method report, prominently, the genome in use
>     (or NA
>      >>>>>
>      >>>>> if unspecified).  This could be accomplished by reporting
>      >>>>> unique(genome(x)), and perhaps that would be satisfactory.
>      >>>>>
>      >>>>> after example(genome) :
>      >>>>>
>      >>>>> seqinfo(txdb)
>      >>>>>>
>      >>>>>
>      >>>>> Seqinfo of length 15
>      >>>>>
>      >>>>> seqnames seqlengths isCircular genome
>      >>>>>
>      >>>>> CH2L       23011544      FALSE    dm3
>      >>>>>
>      >>>>> CH2R       21146708      FALSE    dm3
>      >>>>>
>      >>>>> CH3L       24543557      FALSE    dm3
>      >>>>>
>      >>>>> CH3R       27905053      FALSE    dm3
>      >>>>>
>      >>>>> CH4         1351857      FALSE    dm3
>      >>>>>
>      >>>>> ...             ...        ...    ...
>      >>>>>
>      >>>>> CH3LHet     2555491      FALSE    dm3
>      >>>>>
>      >>>>> CH3RHet     2517507      FALSE    dm3
>      >>>>>
>      >>>>> CHXHet       204112      FALSE    dm3
>      >>>>>
>      >>>>> CHYHet       347038      FALSE    dm3
>      >>>>>
>      >>>>> CHUextra   29004656      FALSE    dm3
>      >>>>>
>      >>>>> genome(seqinfo(txdb))
>      >>>>>>
>      >>>>>
>      >>>>>     CH2L     CH2R     CH3L     CH3R      CH4      CHX
>     CHU        M
>      >>>>>
>      >>>>>    "dm3"    "dm3"    "dm3"    "dm3"    "dm3"    "dm3"
>     "dm3"    "dm3"
>      >>>>>
>      >>>>>  CH2LHet  CH2RHet  CH3LHet  CH3RHet   CHXHet   CHYHet CHUextra
>      >>>>>
>      >>>>>    "dm3"    "dm3"    "dm3"    "dm3"    "dm3"    "dm3"    "dm3"
>      >>>>>
>      >>>>>        [[alternative HTML version deleted]]
>      >>>>>
>      >>>>> _______________________________________________
>      >>>>> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>     mailing list
>      >>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>      >>>>>
>      >>>>>
>      >>>> --
>      >>>> Herv? Pag?s
>      >>>>
>      >>>> Program in Computational Biology
>      >>>> Division of Public Health Sciences
>      >>>> Fred Hutchinson Cancer Research Center
>      >>>> 1100 Fairview Ave. N, M1-B514
>      >>>> P.O. Box 19024
>      >>>> Seattle, WA 98109-1024
>      >>>>
>      >>>> E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>      >>>> Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
>      >>>> Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
>      >>>>
>      >>>>
>      >>>> _______________________________________________
>      >>>> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>     mailing list
>      >>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>      >>>>
>      >>
>      >> --------------------------------
>      >> Peter Hickey,
>      >> PhD Student/Research Assistant,
>      >> Bioinformatics Division,
>      >> Walter and Eliza Hall Institute of Medical Research,
>      >> 1G Royal Parade, Parkville, Vic 3052, Australia.
>      >> Ph: +613 9345 2324 <tel:%2B613%209345%202324>
>      >>
>      >> hickey at wehi.edu.au <mailto:hickey at wehi.edu.au>
>      >> http://www.wehi.edu.au
>      >>
>      >>
>     ______________________________________________________________________
>      >> The information in this email is confidential and
>     intend...{{dropped:6}}
>     >>
>     >> _______________________________________________
>     >>Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org> mailing list
>     >>https://stat.ethz.ch/mailman/listinfo/bioc-devel
>     >>
>     >
>     > --
>      > Hervé Pagès
>     >
>     > Program in Computational Biology
>     > Division of Public Health Sciences
>     > Fred Hutchinson Cancer Research Center
>     > 1100 Fairview Ave. N, M1-B514
>     > P.O. Box 19024
>     > Seattle, WA 98109-1024
>     >
>     > E-mail:hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>     > Phone:(206) 667-5791 <tel:%28206%29%20667-5791>
>     > Fax:(206) 667-1319 <tel:%28206%29%20667-1319>
>
>     --------------------------------
>     Peter Hickey,
>     PhD Student/Research Assistant,
>     Bioinformatics Division,
>     Walter and Eliza Hall Institute of Medical Research,
>     1G Royal Parade, Parkville, Vic 3052, Australia.
>     Ph: +613 9345 2324 <tel:%2B613%209345%202324>
>
>     hickey at wehi.edu.au <mailto:hickey at wehi.edu.au>
>     http://www.wehi.edu.au
>
>
>     ______________________________________________________________________
>     The information in this email is confidential and intend...{{dropped:8}}
>
>
>     _______________________________________________
>     Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org> mailing list
>     https://stat.ethz.ch/mailman/listinfo/bioc-devel
>
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319