[Bioc-devel] should genome() be so complicated?/add genome report to GRanges show method
Martin Morgan
mtmorgan at fhcrc.org
Tue Sep 9 13:42:34 CEST 2014
On 09/09/2014 04:02 AM, Michael Lawrence wrote:
> I'm in favor of this display. The seqinfo output at the bottom has always
> been annoying (over-emphasized).
the fact that the lengths are 'NA' can be a helpful prompt to do something about
it, e.g., add seqinfo when inputting the data. Also they are helpful when one is
told that seqlengths are incompatible during, e.g., findOverlaps. But I like the
idea of less but more informative display of seqinfo, along the lines suggested
by Vince.
seqinfo: 60 seqlevels (2 circular) on 2 genomes (hg19, mm10); 60 'NA' seqlengths
Martin
>
> On Mon, Sep 8, 2014 at 10:08 PM, Vincent Carey <stvjc at channing.harvard.edu>
> wrote:
>
>>
>>
>> On Tue, Sep 9, 2014 at 12:30 AM, Hervé Pagès <hpages at fhcrc.org> wrote:
>>
>>> On 09/08/2014 06:42 PM, Michael Lawrence wrote:
>>>
>>>> Instead of printing out multiple lines of a table that is rarely of
>>>> interest, could we develop Peter's idea toward something like:
>>>>
>>>> hg19:chr1 hg19:chr2 ...
>>>> [lengths ...]
>>>>
>>>> Not sure what condensed notation would be useful for circularity.
>>>>
>>>
>>> I don't know either. I'm worried that this would make the seqinfo
>>> stuff look like a named vector and that the user would expect
>>> hg19:chr1, hg19:chr2, etc... to be valid names.
>>>
>>> With the table-like layout, some screen real estate can always be
>>> saved by printing less lines:
>>>
>>>
>> What I had in mind was
>>
>>
>>> > gr
>>> GRanges with 3 ranges and 0 metadata columns:
>>>
>> genome: hg19
>>
>>> seqnames ranges strand
>>> <Rle> <IRanges> <Rle>
>>> [1] chr14 [19069583, 19069654] +
>>> [2] chr14 [19363738, 19363809] +
>>> [3] chr14 [19363755, 19363826] -
>>> [4] chr14 [19369799, 19369870] +
>>>
>>
>>
>> you could then probably dispense with the seqlengths. i have
>> never found them too useful except as a key to the genome.
>>
>> if there are multiple genomes, we have something like
>>
>> genomes: hg19, mm9
>>
>> the point is to make it prominent, particularly at a time of transition.
>>
>>
>>
>>> --- seqinfo: 60 seqlevels (2 circulars) on 2 genomes (hg19, mm10) ---
>>> seqlevels seqlengths isCircular genome
>>> chr1 249250621 <NA> hg19
>>> chr10 135534747 <NA> hg19
>>> ... ... ... ...
>>> chrX 155270560 <NA> hg19
>>> chrY 59373566 <NA> hg19
>>>
>>> I agree that the exact content of the seqinfo table itself is rarely
>>> of interest so printing only 3 or 4 lines is OK. IMO it's important
>>> to make the user aware of the existence of this hidden table and to
>>> display it like what it really is (i.e. a table). Also displaying the
>>> column names is a well established tradition and serves the purpose
>>> of providing a quick summary of the accessors that are available to
>>> access those fields.
>>>
>>> H.
>>>
>>>
>>>>
>>>> On Mon, Sep 8, 2014 at 5:21 PM, Peter Hickey <hickey at wehi.edu.au
>>>> <mailto:hickey at wehi.edu.au>> wrote:
>>>>
>>>> Perhaps it might be useful to have some way of highlighting if any
>>>> of the chromosomes are circular or highlighting if there are
>>>> multiple genomes present? Otherwise this information might be hidden
>>>> in the "…"
>>>>
>>>> Cheers,
>>>> Pete
>>>>
>>>>
>>>> On 09/09/2014, at 9:44 AM, Hervé Pagès <hpages at fhcrc.org
>>>> <mailto:hpages at fhcrc.org>> wrote:
>>>>
>>>> > On 09/08/2014 02:28 PM, Peter Hickey wrote:
>>>> >> Just a vote for still allowing for multiple genomes in a Seqinfo
>>>> object (in a GRanges object). My use case is in bisulfite-sequencing
>>>> experiments where there is often a spike-in of a lambda phage genome
>>>> along with the genome of interest (human or mouse). It's often
>>>> useful to keep all data from a single library together in the same
>>>> objet but process according to genome(x) for each seqlevel.
>>>> >
>>>> > Note taken. Thanks Pete! It's always great to know about concrete
>>>> use
>>>> > cases.
>>>> >
>>>> >>
>>>> >> FWIW, I like Vincent's proposal of selectSome(unique(genome(x)))
>>>> in the show method.
>>>> >
>>>> > Or what about displaying the genome next to the seqlevel it's
>>>> > associated with? Like e.g.:
>>>> >
>>>> > > gr
>>>> > GRanges with 3 ranges and 0 metadata columns:
>>>> > seqnames ranges strand
>>>> > <Rle> <IRanges> <Rle>
>>>> > [1] chr14 [19069583, 19069654] +
>>>> > [2] chr14 [19363738, 19363809] +
>>>> > [3] chr14 [19363755, 19363826] -
>>>> > [4] chr14 [19369799, 19369870] +
>>>> > ---
>>>> > seqinfo:
>>>> > seqlevels seqlengths isCircular genome
>>>> > chr1 249250621 <NA> hg19
>>>> > chr10 135534747 <NA> hg19
>>>> > chr11 135006516 <NA> hg19
>>>> > ... ... ... ...
>>>> > chrUn_gl000249 38502 <NA> hg19
>>>> > chrX 155270560 <NA> hg19
>>>> > chrY 59373566 <NA> hg19
>>>> >
>>>> > That way, we also raise awareness about the isCircular field.
>>>> > The current choice to only display the seqlengths pre-dates the
>>>> > existence of the seqinfo slot but might be a little bit misleading
>>>> > those days since it only exposes some arbitrary seqinfo fields.
>>>> >
>>>> > H.
>>>> >
>>>> >>
>>>> >> Cheers,
>>>> >> Pete
>>>> >>
>>>> >>
>>>> >>> I might have requested the genome annotation, but I'm pretty
>>>> sure it wasn't
>>>> >>> me who decide on tracking it on a per-sequence basis. I could
>>>> imagine use
>>>> >>> cases for that though, e.g., when diagnosing sequencing
>>>> contamination (like
>>>> >>> human vs. mouse). But most other tools and file formats expect
>>>> a single
>>>> >>> genome per "track", so, for example, rtracklayer has an
>>>> internal function
>>>> >>> singleGenome() to take care of this.
>>>> >>>
>>>> >>> On Mon, Sep 8, 2014 at 10:50 AM, Herv? Pag?s <hpages at fhcrc.org
>>>> <mailto:hpages at fhcrc.org>> wrote:
>>>> >>>
>>>> >>>> Hi Vince,
>>>> >>>>
>>>> >>>> Yes it would make sense to have the "show" method report the
>>>> genome
>>>> >>>> when genome(x) contains a unique non-NA value. I think the main
>>>> >>>> use case for having the genome defined at the sequence level
>>>> instead
>>>> >>>> of the whole object level is metagenomics. Maybe Michael has
>>>> some other
>>>> >>>> good use cases to share since IIRC he requested the addition
>>>> of the
>>>> >>>> genome field a couple of years ago and made the case for
>>>> having it
>>>> >>>> defined at the sequence level.
>>>> >>>>
>>>> >>>> Cheers,
>>>> >>>> H.
>>>> >>>>
>>>> >>>>
>>>> >>>> On 09/08/2014 07:21 AM, Vincent Carey wrote:
>>>> >>>>
>>>> >>>>> For GRanges x, my naive expectation is that genome(x) returns
>>>> a length-
>>>> >>>>>
>>>> >>>>> one tag identifying the genome to which chromosomal
>>>> coordinates
>>>> >>>>>
>>>> >>>>> correspond. The genome() method seems to have
>>>> sequence-specific
>>>> >>>>>
>>>> >>>>> semantics, which makes sense, but when we identify sequence
>>>> >>>>>
>>>> >>>>> with chromosome, it seems too complicated. Is there a use
>>>> case for
>>>> >>>>>
>>>> >>>>> a GRanges with sequences from several different genomes?
>>>> >>>>>
>>>> >>>>>
>>>> >>>>> One reason I am inquiring is that I feel it would be nice to
>>>> have the
>>>> >>>>> GRanges show() method report, prominently, the genome in use
>>>> (or NA
>>>> >>>>>
>>>> >>>>> if unspecified). This could be accomplished by reporting
>>>> >>>>> unique(genome(x)), and perhaps that would be satisfactory.
>>>> >>>>>
>>>> >>>>> after example(genome) :
>>>> >>>>>
>>>> >>>>> seqinfo(txdb)
>>>> >>>>>>
>>>> >>>>>
>>>> >>>>> Seqinfo of length 15
>>>> >>>>>
>>>> >>>>> seqnames seqlengths isCircular genome
>>>> >>>>>
>>>> >>>>> CH2L 23011544 FALSE dm3
>>>> >>>>>
>>>> >>>>> CH2R 21146708 FALSE dm3
>>>> >>>>>
>>>> >>>>> CH3L 24543557 FALSE dm3
>>>> >>>>>
>>>> >>>>> CH3R 27905053 FALSE dm3
>>>> >>>>>
>>>> >>>>> CH4 1351857 FALSE dm3
>>>> >>>>>
>>>> >>>>> ... ... ... ...
>>>> >>>>>
>>>> >>>>> CH3LHet 2555491 FALSE dm3
>>>> >>>>>
>>>> >>>>> CH3RHet 2517507 FALSE dm3
>>>> >>>>>
>>>> >>>>> CHXHet 204112 FALSE dm3
>>>> >>>>>
>>>> >>>>> CHYHet 347038 FALSE dm3
>>>> >>>>>
>>>> >>>>> CHUextra 29004656 FALSE dm3
>>>> >>>>>
>>>> >>>>> genome(seqinfo(txdb))
>>>> >>>>>>
>>>> >>>>>
>>>> >>>>> CH2L CH2R CH3L CH3R CH4 CHX
>>>> CHU M
>>>> >>>>>
>>>> >>>>> "dm3" "dm3" "dm3" "dm3" "dm3" "dm3"
>>>> "dm3" "dm3"
>>>> >>>>>
>>>> >>>>> CH2LHet CH2RHet CH3LHet CH3RHet CHXHet CHYHet CHUextra
>>>> >>>>>
>>>> >>>>> "dm3" "dm3" "dm3" "dm3" "dm3" "dm3" "dm3"
>>>> >>>>>
>>>> >>>>> [[alternative HTML version deleted]]
>>>> >>>>>
>>>> >>>>> _______________________________________________
>>>> >>>>> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>>>> mailing list
>>>> >>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>> >>>>>
>>>> >>>>>
>>>> >>>> --
>>>> >>>> Herv? Pag?s
>>>> >>>>
>>>> >>>> Program in Computational Biology
>>>> >>>> Division of Public Health Sciences
>>>> >>>> Fred Hutchinson Cancer Research Center
>>>> >>>> 1100 Fairview Ave. N, M1-B514
>>>> >>>> P.O. Box 19024
>>>> >>>> Seattle, WA 98109-1024
>>>> >>>>
>>>> >>>> E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>>>> >>>> Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
>>>> >>>> Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
>>>> >>>>
>>>> >>>>
>>>> >>>> _______________________________________________
>>>> >>>> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>>>> mailing list
>>>> >>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>> >>>>
>>>> >>
>>>> >> --------------------------------
>>>> >> Peter Hickey,
>>>> >> PhD Student/Research Assistant,
>>>> >> Bioinformatics Division,
>>>> >> Walter and Eliza Hall Institute of Medical Research,
>>>> >> 1G Royal Parade, Parkville, Vic 3052, Australia.
>>>> >> Ph: +613 9345 2324 <tel:%2B613%209345%202324>
>>>> >>
>>>> >> hickey at wehi.edu.au <mailto:hickey at wehi.edu.au>
>>>> >> http://www.wehi.edu.au
>>>> >>
>>>> >>
>>>> ____________________________________________________________
>>>> __________
>>>> >> The information in this email is confidential and
>>>> intend...{{dropped:6}}
>>>> >>
>>>> >> _______________________________________________
>>>> >>Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>>>> mailing list
>>>> >>https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>> >>
>>>> >
>>>> > --
>>>> > Hervé Pagès
>>>> >
>>>> > Program in Computational Biology
>>>> > Division of Public Health Sciences
>>>> > Fred Hutchinson Cancer Research Center
>>>> > 1100 Fairview Ave. N, M1-B514
>>>> > P.O. Box 19024
>>>> > Seattle, WA 98109-1024
>>>> >
>>>> > E-mail:hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>>>> > Phone:(206) 667-5791 <tel:%28206%29%20667-5791>
>>>> > Fax:(206) 667-1319 <tel:%28206%29%20667-1319>
>>>>
>>>> --------------------------------
>>>> Peter Hickey,
>>>> PhD Student/Research Assistant,
>>>> Bioinformatics Division,
>>>> Walter and Eliza Hall Institute of Medical Research,
>>>> 1G Royal Parade, Parkville, Vic 3052, Australia.
>>>> Ph: +613 9345 2324 <tel:%2B613%209345%202324>
>>>>
>>>> hickey at wehi.edu.au <mailto:hickey at wehi.edu.au>
>>>> http://www.wehi.edu.au
>>>>
>>>>
>>>> ____________________________________________________________
>>>> __________
>>>> The information in this email is confidential and
>>>> intend...{{dropped:8}}
>>>>
>>>>
>>>> _______________________________________________
>>>> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org> mailing
>>>> list
>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>
>>>>
>>>>
>>> --
>>> Hervé Pagès
>>>
>>> Program in Computational Biology
>>> Division of Public Health Sciences
>>> Fred Hutchinson Cancer Research Center
>>> 1100 Fairview Ave. N, M1-B514
>>> P.O. Box 19024
>>> Seattle, WA 98109-1024
>>>
>>> E-mail: hpages at fhcrc.org
>>> Phone: (206) 667-5791
>>> Fax: (206) 667-1319
>>>
>>> _______________________________________________
>>> Bioc-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>
>>
>>
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>
--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109
Location: Arnold Building M1 B861
Phone: (206) 667-2793
More information about the Bioc-devel
mailing list