[Bioc-devel] should genome() be so complicated?/add genome report to GRanges show method
Michael Lawrence
lawrence.michael at gene.com
Tue Sep 9 13:02:25 CEST 2014
I'm in favor of this display. The seqinfo output at the bottom has always
been annoying (over-emphasized).
On Mon, Sep 8, 2014 at 10:08 PM, Vincent Carey <stvjc at channing.harvard.edu>
wrote:
>
>
> On Tue, Sep 9, 2014 at 12:30 AM, Hervé Pagès <hpages at fhcrc.org> wrote:
>
>> On 09/08/2014 06:42 PM, Michael Lawrence wrote:
>>
>>> Instead of printing out multiple lines of a table that is rarely of
>>> interest, could we develop Peter's idea toward something like:
>>>
>>> hg19:chr1 hg19:chr2 ...
>>> [lengths ...]
>>>
>>> Not sure what condensed notation would be useful for circularity.
>>>
>>
>> I don't know either. I'm worried that this would make the seqinfo
>> stuff look like a named vector and that the user would expect
>> hg19:chr1, hg19:chr2, etc... to be valid names.
>>
>> With the table-like layout, some screen real estate can always be
>> saved by printing less lines:
>>
>>
> What I had in mind was
>
>
>> > gr
>> GRanges with 3 ranges and 0 metadata columns:
>>
> genome: hg19
>
>> seqnames ranges strand
>> <Rle> <IRanges> <Rle>
>> [1] chr14 [19069583, 19069654] +
>> [2] chr14 [19363738, 19363809] +
>> [3] chr14 [19363755, 19363826] -
>> [4] chr14 [19369799, 19369870] +
>>
>
>
> you could then probably dispense with the seqlengths. i have
> never found them too useful except as a key to the genome.
>
> if there are multiple genomes, we have something like
>
> genomes: hg19, mm9
>
> the point is to make it prominent, particularly at a time of transition.
>
>
>
>> --- seqinfo: 60 seqlevels (2 circulars) on 2 genomes (hg19, mm10) ---
>> seqlevels seqlengths isCircular genome
>> chr1 249250621 <NA> hg19
>> chr10 135534747 <NA> hg19
>> ... ... ... ...
>> chrX 155270560 <NA> hg19
>> chrY 59373566 <NA> hg19
>>
>> I agree that the exact content of the seqinfo table itself is rarely
>> of interest so printing only 3 or 4 lines is OK. IMO it's important
>> to make the user aware of the existence of this hidden table and to
>> display it like what it really is (i.e. a table). Also displaying the
>> column names is a well established tradition and serves the purpose
>> of providing a quick summary of the accessors that are available to
>> access those fields.
>>
>> H.
>>
>>
>>>
>>> On Mon, Sep 8, 2014 at 5:21 PM, Peter Hickey <hickey at wehi.edu.au
>>> <mailto:hickey at wehi.edu.au>> wrote:
>>>
>>> Perhaps it might be useful to have some way of highlighting if any
>>> of the chromosomes are circular or highlighting if there are
>>> multiple genomes present? Otherwise this information might be hidden
>>> in the "…"
>>>
>>> Cheers,
>>> Pete
>>>
>>>
>>> On 09/09/2014, at 9:44 AM, Hervé Pagès <hpages at fhcrc.org
>>> <mailto:hpages at fhcrc.org>> wrote:
>>>
>>> > On 09/08/2014 02:28 PM, Peter Hickey wrote:
>>> >> Just a vote for still allowing for multiple genomes in a Seqinfo
>>> object (in a GRanges object). My use case is in bisulfite-sequencing
>>> experiments where there is often a spike-in of a lambda phage genome
>>> along with the genome of interest (human or mouse). It's often
>>> useful to keep all data from a single library together in the same
>>> objet but process according to genome(x) for each seqlevel.
>>> >
>>> > Note taken. Thanks Pete! It's always great to know about concrete
>>> use
>>> > cases.
>>> >
>>> >>
>>> >> FWIW, I like Vincent's proposal of selectSome(unique(genome(x)))
>>> in the show method.
>>> >
>>> > Or what about displaying the genome next to the seqlevel it's
>>> > associated with? Like e.g.:
>>> >
>>> > > gr
>>> > GRanges with 3 ranges and 0 metadata columns:
>>> > seqnames ranges strand
>>> > <Rle> <IRanges> <Rle>
>>> > [1] chr14 [19069583, 19069654] +
>>> > [2] chr14 [19363738, 19363809] +
>>> > [3] chr14 [19363755, 19363826] -
>>> > [4] chr14 [19369799, 19369870] +
>>> > ---
>>> > seqinfo:
>>> > seqlevels seqlengths isCircular genome
>>> > chr1 249250621 <NA> hg19
>>> > chr10 135534747 <NA> hg19
>>> > chr11 135006516 <NA> hg19
>>> > ... ... ... ...
>>> > chrUn_gl000249 38502 <NA> hg19
>>> > chrX 155270560 <NA> hg19
>>> > chrY 59373566 <NA> hg19
>>> >
>>> > That way, we also raise awareness about the isCircular field.
>>> > The current choice to only display the seqlengths pre-dates the
>>> > existence of the seqinfo slot but might be a little bit misleading
>>> > those days since it only exposes some arbitrary seqinfo fields.
>>> >
>>> > H.
>>> >
>>> >>
>>> >> Cheers,
>>> >> Pete
>>> >>
>>> >>
>>> >>> I might have requested the genome annotation, but I'm pretty
>>> sure it wasn't
>>> >>> me who decide on tracking it on a per-sequence basis. I could
>>> imagine use
>>> >>> cases for that though, e.g., when diagnosing sequencing
>>> contamination (like
>>> >>> human vs. mouse). But most other tools and file formats expect
>>> a single
>>> >>> genome per "track", so, for example, rtracklayer has an
>>> internal function
>>> >>> singleGenome() to take care of this.
>>> >>>
>>> >>> On Mon, Sep 8, 2014 at 10:50 AM, Herv? Pag?s <hpages at fhcrc.org
>>> <mailto:hpages at fhcrc.org>> wrote:
>>> >>>
>>> >>>> Hi Vince,
>>> >>>>
>>> >>>> Yes it would make sense to have the "show" method report the
>>> genome
>>> >>>> when genome(x) contains a unique non-NA value. I think the main
>>> >>>> use case for having the genome defined at the sequence level
>>> instead
>>> >>>> of the whole object level is metagenomics. Maybe Michael has
>>> some other
>>> >>>> good use cases to share since IIRC he requested the addition
>>> of the
>>> >>>> genome field a couple of years ago and made the case for
>>> having it
>>> >>>> defined at the sequence level.
>>> >>>>
>>> >>>> Cheers,
>>> >>>> H.
>>> >>>>
>>> >>>>
>>> >>>> On 09/08/2014 07:21 AM, Vincent Carey wrote:
>>> >>>>
>>> >>>>> For GRanges x, my naive expectation is that genome(x) returns
>>> a length-
>>> >>>>>
>>> >>>>> one tag identifying the genome to which chromosomal
>>> coordinates
>>> >>>>>
>>> >>>>> correspond. The genome() method seems to have
>>> sequence-specific
>>> >>>>>
>>> >>>>> semantics, which makes sense, but when we identify sequence
>>> >>>>>
>>> >>>>> with chromosome, it seems too complicated. Is there a use
>>> case for
>>> >>>>>
>>> >>>>> a GRanges with sequences from several different genomes?
>>> >>>>>
>>> >>>>>
>>> >>>>> One reason I am inquiring is that I feel it would be nice to
>>> have the
>>> >>>>> GRanges show() method report, prominently, the genome in use
>>> (or NA
>>> >>>>>
>>> >>>>> if unspecified). This could be accomplished by reporting
>>> >>>>> unique(genome(x)), and perhaps that would be satisfactory.
>>> >>>>>
>>> >>>>> after example(genome) :
>>> >>>>>
>>> >>>>> seqinfo(txdb)
>>> >>>>>>
>>> >>>>>
>>> >>>>> Seqinfo of length 15
>>> >>>>>
>>> >>>>> seqnames seqlengths isCircular genome
>>> >>>>>
>>> >>>>> CH2L 23011544 FALSE dm3
>>> >>>>>
>>> >>>>> CH2R 21146708 FALSE dm3
>>> >>>>>
>>> >>>>> CH3L 24543557 FALSE dm3
>>> >>>>>
>>> >>>>> CH3R 27905053 FALSE dm3
>>> >>>>>
>>> >>>>> CH4 1351857 FALSE dm3
>>> >>>>>
>>> >>>>> ... ... ... ...
>>> >>>>>
>>> >>>>> CH3LHet 2555491 FALSE dm3
>>> >>>>>
>>> >>>>> CH3RHet 2517507 FALSE dm3
>>> >>>>>
>>> >>>>> CHXHet 204112 FALSE dm3
>>> >>>>>
>>> >>>>> CHYHet 347038 FALSE dm3
>>> >>>>>
>>> >>>>> CHUextra 29004656 FALSE dm3
>>> >>>>>
>>> >>>>> genome(seqinfo(txdb))
>>> >>>>>>
>>> >>>>>
>>> >>>>> CH2L CH2R CH3L CH3R CH4 CHX
>>> CHU M
>>> >>>>>
>>> >>>>> "dm3" "dm3" "dm3" "dm3" "dm3" "dm3"
>>> "dm3" "dm3"
>>> >>>>>
>>> >>>>> CH2LHet CH2RHet CH3LHet CH3RHet CHXHet CHYHet CHUextra
>>> >>>>>
>>> >>>>> "dm3" "dm3" "dm3" "dm3" "dm3" "dm3" "dm3"
>>> >>>>>
>>> >>>>> [[alternative HTML version deleted]]
>>> >>>>>
>>> >>>>> _______________________________________________
>>> >>>>> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>>> mailing list
>>> >>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>> >>>>>
>>> >>>>>
>>> >>>> --
>>> >>>> Herv? Pag?s
>>> >>>>
>>> >>>> Program in Computational Biology
>>> >>>> Division of Public Health Sciences
>>> >>>> Fred Hutchinson Cancer Research Center
>>> >>>> 1100 Fairview Ave. N, M1-B514
>>> >>>> P.O. Box 19024
>>> >>>> Seattle, WA 98109-1024
>>> >>>>
>>> >>>> E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>>> >>>> Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
>>> >>>> Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
>>> >>>>
>>> >>>>
>>> >>>> _______________________________________________
>>> >>>> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>>> mailing list
>>> >>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>> >>>>
>>> >>
>>> >> --------------------------------
>>> >> Peter Hickey,
>>> >> PhD Student/Research Assistant,
>>> >> Bioinformatics Division,
>>> >> Walter and Eliza Hall Institute of Medical Research,
>>> >> 1G Royal Parade, Parkville, Vic 3052, Australia.
>>> >> Ph: +613 9345 2324 <tel:%2B613%209345%202324>
>>> >>
>>> >> hickey at wehi.edu.au <mailto:hickey at wehi.edu.au>
>>> >> http://www.wehi.edu.au
>>> >>
>>> >>
>>> ____________________________________________________________
>>> __________
>>> >> The information in this email is confidential and
>>> intend...{{dropped:6}}
>>> >>
>>> >> _______________________________________________
>>> >>Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>>> mailing list
>>> >>https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>> >>
>>> >
>>> > --
>>> > Hervé Pagès
>>> >
>>> > Program in Computational Biology
>>> > Division of Public Health Sciences
>>> > Fred Hutchinson Cancer Research Center
>>> > 1100 Fairview Ave. N, M1-B514
>>> > P.O. Box 19024
>>> > Seattle, WA 98109-1024
>>> >
>>> > E-mail:hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>>> > Phone:(206) 667-5791 <tel:%28206%29%20667-5791>
>>> > Fax:(206) 667-1319 <tel:%28206%29%20667-1319>
>>>
>>> --------------------------------
>>> Peter Hickey,
>>> PhD Student/Research Assistant,
>>> Bioinformatics Division,
>>> Walter and Eliza Hall Institute of Medical Research,
>>> 1G Royal Parade, Parkville, Vic 3052, Australia.
>>> Ph: +613 9345 2324 <tel:%2B613%209345%202324>
>>>
>>> hickey at wehi.edu.au <mailto:hickey at wehi.edu.au>
>>> http://www.wehi.edu.au
>>>
>>>
>>> ____________________________________________________________
>>> __________
>>> The information in this email is confidential and
>>> intend...{{dropped:8}}
>>>
>>>
>>> _______________________________________________
>>> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org> mailing
>>> list
>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>
>>>
>>>
>> --
>> Hervé Pagès
>>
>> Program in Computational Biology
>> Division of Public Health Sciences
>> Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N, M1-B514
>> P.O. Box 19024
>> Seattle, WA 98109-1024
>>
>> E-mail: hpages at fhcrc.org
>> Phone: (206) 667-5791
>> Fax: (206) 667-1319
>>
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>
>
[[alternative HTML version deleted]]
More information about the Bioc-devel
mailing list