[Bioc-devel] should genome() be so complicated?/add genome report to GRanges show method

Vincent Carey stvjc at channing.harvard.edu
Tue Sep 9 07:08:23 CEST 2014


On Tue, Sep 9, 2014 at 12:30 AM, Hervé Pagès <hpages at fhcrc.org> wrote:

> On 09/08/2014 06:42 PM, Michael Lawrence wrote:
>
>> Instead of printing out multiple lines of a table that is rarely of
>> interest, could we develop Peter's idea toward something like:
>>
>> hg19:chr1  hg19:chr2 ...
>> [lengths ...]
>>
>> Not sure what condensed notation would be useful for circularity.
>>
>
> I don't know either. I'm worried that this would make the seqinfo
> stuff look like a named vector and that the user would expect
> hg19:chr1, hg19:chr2, etc... to be valid names.
>
> With the table-like layout, some screen real estate can always be
> saved by printing less lines:
>
>
What I had in mind was


>   > gr
>   GRanges with 3 ranges and 0 metadata columns:
>
      genome: hg19

>         seqnames               ranges strand
>            <Rle>            <IRanges>  <Rle>
>     [1]    chr14 [19069583, 19069654]      +
>     [2]    chr14 [19363738, 19363809]      +
>     [3]    chr14 [19363755, 19363826]      -
>     [4]    chr14 [19369799, 19369870]      +
>


you could then probably dispense with the seqlengths.  i have
never found them too useful except as a key to the  genome.

if there are multiple genomes, we have something like

genomes: hg19, mm9

the point is to make it prominent, particularly at a time of transition.



> --- seqinfo: 60 seqlevels (2 circulars) on 2 genomes (hg19, mm10) ---
>     seqlevels                                seqlengths isCircular genome
>     chr1                                      249250621       <NA>   hg19
>     chr10                                     135534747       <NA>   hg19
>     ...                                             ...        ...    ...
>     chrX                                      155270560       <NA>   hg19
>     chrY                                       59373566       <NA>   hg19
>
> I agree that the exact content of the seqinfo table itself is rarely
> of interest so printing only 3 or 4 lines is OK. IMO it's important
> to make the user aware of the existence of this hidden table and to
> display it like what it really is (i.e. a table). Also displaying the
> column names is a well established tradition and serves the purpose
> of providing a quick summary of the accessors that are available to
> access those fields.
>
> H.
>
>
>>
>> On Mon, Sep 8, 2014 at 5:21 PM, Peter Hickey <hickey at wehi.edu.au
>> <mailto:hickey at wehi.edu.au>> wrote:
>>
>>     Perhaps it might be useful to have some way of highlighting if any
>>     of the chromosomes are circular or highlighting if there are
>>     multiple genomes present? Otherwise this information might be hidden
>>     in the "…"
>>
>>     Cheers,
>>     Pete
>>
>>
>>     On 09/09/2014, at 9:44 AM, Hervé Pagès <hpages at fhcrc.org
>>     <mailto:hpages at fhcrc.org>> wrote:
>>
>>      > On 09/08/2014 02:28 PM, Peter Hickey wrote:
>>      >> Just a vote for still allowing for multiple genomes in a Seqinfo
>>     object (in a GRanges object). My use case is in bisulfite-sequencing
>>     experiments where there is often a spike-in of a lambda phage genome
>>     along with the genome of interest (human or mouse). It's often
>>     useful to keep all data from a single library together in the same
>>     objet but process according to genome(x) for each seqlevel.
>>      >
>>      > Note taken. Thanks Pete! It's always great to know about concrete
>> use
>>      > cases.
>>      >
>>      >>
>>      >> FWIW, I like Vincent's proposal of selectSome(unique(genome(x)))
>>     in the show method.
>>      >
>>      > Or what about displaying the genome next to the seqlevel it's
>>      > associated with? Like e.g.:
>>      >
>>      >  > gr
>>      >  GRanges with 3 ranges and 0 metadata columns:
>>      >        seqnames               ranges strand
>>      >           <Rle>            <IRanges>  <Rle>
>>      >    [1]    chr14 [19069583, 19069654]      +
>>      >    [2]    chr14 [19363738, 19363809]      +
>>      >    [3]    chr14 [19363755, 19363826]      -
>>      >    [4]    chr14 [19369799, 19369870]      +
>>      >    ---
>>      >    seqinfo:
>>      >      seqlevels             seqlengths isCircular genome
>>      >      chr1                   249250621       <NA>   hg19
>>      >      chr10                  135534747       <NA>   hg19
>>      >      chr11                  135006516       <NA>   hg19
>>      >      ...                          ...        ...    ...
>>      >      chrUn_gl000249             38502       <NA>   hg19
>>      >      chrX                   155270560       <NA>   hg19
>>      >      chrY                    59373566       <NA>   hg19
>>      >
>>      > That way, we also raise awareness about the isCircular field.
>>      > The current choice to only display the seqlengths pre-dates the
>>      > existence of the seqinfo slot but might be a little bit misleading
>>      > those days since it only exposes some arbitrary seqinfo fields.
>>      >
>>      > H.
>>      >
>>      >>
>>      >> Cheers,
>>      >> Pete
>>      >>
>>      >>
>>      >>> I might have requested the genome annotation, but I'm pretty
>>     sure it wasn't
>>      >>> me who decide on tracking it on a per-sequence basis. I could
>>     imagine use
>>      >>> cases for that though, e.g., when diagnosing sequencing
>>     contamination (like
>>      >>> human vs. mouse). But most other tools and file formats expect
>>     a single
>>      >>> genome per "track", so, for example, rtracklayer has an
>>     internal function
>>      >>> singleGenome() to take care of this.
>>      >>>
>>      >>> On Mon, Sep 8, 2014 at 10:50 AM, Herv? Pag?s <hpages at fhcrc.org
>>     <mailto:hpages at fhcrc.org>> wrote:
>>      >>>
>>      >>>> Hi Vince,
>>      >>>>
>>      >>>> Yes it would make sense to have the "show" method report the
>>     genome
>>      >>>> when genome(x) contains a unique non-NA value. I think the main
>>      >>>> use case for having the genome defined at the sequence level
>>     instead
>>      >>>> of the whole object level is metagenomics. Maybe Michael has
>>     some other
>>      >>>> good use cases to share since IIRC he requested the addition
>>     of the
>>      >>>> genome field a couple of years ago and made the case for having
>> it
>>      >>>> defined at the sequence level.
>>      >>>>
>>      >>>> Cheers,
>>      >>>> H.
>>      >>>>
>>      >>>>
>>      >>>> On 09/08/2014 07:21 AM, Vincent Carey wrote:
>>      >>>>
>>      >>>>> For GRanges x, my naive expectation is that genome(x) returns
>>     a length-
>>      >>>>>
>>      >>>>> one tag identifying the genome to which chromosomal coordinates
>>      >>>>>
>>      >>>>> correspond.  The genome() method seems to have
>> sequence-specific
>>      >>>>>
>>      >>>>> semantics, which makes sense, but when we identify sequence
>>      >>>>>
>>      >>>>> with chromosome, it seems too complicated.  Is there a use
>>     case for
>>      >>>>>
>>      >>>>> a GRanges with sequences from several different genomes?
>>      >>>>>
>>      >>>>>
>>      >>>>> One reason I am inquiring is that I feel it would be nice to
>>     have the
>>      >>>>> GRanges show() method report, prominently, the genome in use
>>     (or NA
>>      >>>>>
>>      >>>>> if unspecified).  This could be accomplished by reporting
>>      >>>>> unique(genome(x)), and perhaps that would be satisfactory.
>>      >>>>>
>>      >>>>> after example(genome) :
>>      >>>>>
>>      >>>>> seqinfo(txdb)
>>      >>>>>>
>>      >>>>>
>>      >>>>> Seqinfo of length 15
>>      >>>>>
>>      >>>>> seqnames seqlengths isCircular genome
>>      >>>>>
>>      >>>>> CH2L       23011544      FALSE    dm3
>>      >>>>>
>>      >>>>> CH2R       21146708      FALSE    dm3
>>      >>>>>
>>      >>>>> CH3L       24543557      FALSE    dm3
>>      >>>>>
>>      >>>>> CH3R       27905053      FALSE    dm3
>>      >>>>>
>>      >>>>> CH4         1351857      FALSE    dm3
>>      >>>>>
>>      >>>>> ...             ...        ...    ...
>>      >>>>>
>>      >>>>> CH3LHet     2555491      FALSE    dm3
>>      >>>>>
>>      >>>>> CH3RHet     2517507      FALSE    dm3
>>      >>>>>
>>      >>>>> CHXHet       204112      FALSE    dm3
>>      >>>>>
>>      >>>>> CHYHet       347038      FALSE    dm3
>>      >>>>>
>>      >>>>> CHUextra   29004656      FALSE    dm3
>>      >>>>>
>>      >>>>> genome(seqinfo(txdb))
>>      >>>>>>
>>      >>>>>
>>      >>>>>     CH2L     CH2R     CH3L     CH3R      CH4      CHX
>>     CHU        M
>>      >>>>>
>>      >>>>>    "dm3"    "dm3"    "dm3"    "dm3"    "dm3"    "dm3"
>>     "dm3"    "dm3"
>>      >>>>>
>>      >>>>>  CH2LHet  CH2RHet  CH3LHet  CH3RHet   CHXHet   CHYHet CHUextra
>>      >>>>>
>>      >>>>>    "dm3"    "dm3"    "dm3"    "dm3"    "dm3"    "dm3"    "dm3"
>>      >>>>>
>>      >>>>>        [[alternative HTML version deleted]]
>>      >>>>>
>>      >>>>> _______________________________________________
>>      >>>>> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>>     mailing list
>>      >>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>      >>>>>
>>      >>>>>
>>      >>>> --
>>      >>>> Herv? Pag?s
>>      >>>>
>>      >>>> Program in Computational Biology
>>      >>>> Division of Public Health Sciences
>>      >>>> Fred Hutchinson Cancer Research Center
>>      >>>> 1100 Fairview Ave. N, M1-B514
>>      >>>> P.O. Box 19024
>>      >>>> Seattle, WA 98109-1024
>>      >>>>
>>      >>>> E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>>      >>>> Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
>>      >>>> Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
>>      >>>>
>>      >>>>
>>      >>>> _______________________________________________
>>      >>>> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>>     mailing list
>>      >>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>      >>>>
>>      >>
>>      >> --------------------------------
>>      >> Peter Hickey,
>>      >> PhD Student/Research Assistant,
>>      >> Bioinformatics Division,
>>      >> Walter and Eliza Hall Institute of Medical Research,
>>      >> 1G Royal Parade, Parkville, Vic 3052, Australia.
>>      >> Ph: +613 9345 2324 <tel:%2B613%209345%202324>
>>      >>
>>      >> hickey at wehi.edu.au <mailto:hickey at wehi.edu.au>
>>      >> http://www.wehi.edu.au
>>      >>
>>      >>
>>     ____________________________________________________________
>> __________
>>      >> The information in this email is confidential and
>>     intend...{{dropped:6}}
>>     >>
>>     >> _______________________________________________
>>     >>Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org> mailing
>> list
>>     >>https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>     >>
>>     >
>>     > --
>>      > Hervé Pagès
>>     >
>>     > Program in Computational Biology
>>     > Division of Public Health Sciences
>>     > Fred Hutchinson Cancer Research Center
>>     > 1100 Fairview Ave. N, M1-B514
>>     > P.O. Box 19024
>>     > Seattle, WA 98109-1024
>>     >
>>     > E-mail:hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>>     > Phone:(206) 667-5791 <tel:%28206%29%20667-5791>
>>     > Fax:(206) 667-1319 <tel:%28206%29%20667-1319>
>>
>>     --------------------------------
>>     Peter Hickey,
>>     PhD Student/Research Assistant,
>>     Bioinformatics Division,
>>     Walter and Eliza Hall Institute of Medical Research,
>>     1G Royal Parade, Parkville, Vic 3052, Australia.
>>     Ph: +613 9345 2324 <tel:%2B613%209345%202324>
>>
>>     hickey at wehi.edu.au <mailto:hickey at wehi.edu.au>
>>     http://www.wehi.edu.au
>>
>>
>>     ____________________________________________________________
>> __________
>>     The information in this email is confidential and
>> intend...{{dropped:8}}
>>
>>
>>     _______________________________________________
>>     Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org> mailing
>> list
>>     https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>>
>>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fhcrc.org
> Phone:  (206) 667-5791
> Fax:    (206) 667-1319
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>

	[[alternative HTML version deleted]]



More information about the Bioc-devel mailing list