[Bioc-devel] should genome() be so complicated?/add genome report to GRanges show method
Michael Lawrence
lawrence.michael at gene.com
Tue Sep 9 15:38:07 CEST 2014
Agreed, that looks a lot nicer.
On Tue, Sep 9, 2014 at 4:42 AM, Martin Morgan <mtmorgan at fhcrc.org> wrote:
> On 09/09/2014 04:02 AM, Michael Lawrence wrote:
>
>> I'm in favor of this display. The seqinfo output at the bottom has always
>> been annoying (over-emphasized).
>>
>
> the fact that the lengths are 'NA' can be a helpful prompt to do something
> about it, e.g., add seqinfo when inputting the data. Also they are helpful
> when one is told that seqlengths are incompatible during, e.g.,
> findOverlaps. But I like the idea of less but more informative display of
> seqinfo, along the lines suggested by Vince.
>
> seqinfo: 60 seqlevels (2 circular) on 2 genomes (hg19, mm10); 60 'NA'
> seqlengths
>
> Martin
>
>
>> On Mon, Sep 8, 2014 at 10:08 PM, Vincent Carey <
>> stvjc at channing.harvard.edu>
>> wrote:
>>
>>
>>>
>>> On Tue, Sep 9, 2014 at 12:30 AM, Hervé Pagès <hpages at fhcrc.org> wrote:
>>>
>>> On 09/08/2014 06:42 PM, Michael Lawrence wrote:
>>>>
>>>> Instead of printing out multiple lines of a table that is rarely of
>>>>> interest, could we develop Peter's idea toward something like:
>>>>>
>>>>> hg19:chr1 hg19:chr2 ...
>>>>> [lengths ...]
>>>>>
>>>>> Not sure what condensed notation would be useful for circularity.
>>>>>
>>>>>
>>>> I don't know either. I'm worried that this would make the seqinfo
>>>> stuff look like a named vector and that the user would expect
>>>> hg19:chr1, hg19:chr2, etc... to be valid names.
>>>>
>>>> With the table-like layout, some screen real estate can always be
>>>> saved by printing less lines:
>>>>
>>>>
>>>> What I had in mind was
>>>
>>>
>>> > gr
>>>> GRanges with 3 ranges and 0 metadata columns:
>>>>
>>>> genome: hg19
>>>
>>> seqnames ranges strand
>>>> <Rle> <IRanges> <Rle>
>>>> [1] chr14 [19069583, 19069654] +
>>>> [2] chr14 [19363738, 19363809] +
>>>> [3] chr14 [19363755, 19363826] -
>>>> [4] chr14 [19369799, 19369870] +
>>>>
>>>>
>>>
>>> you could then probably dispense with the seqlengths. i have
>>> never found them too useful except as a key to the genome.
>>>
>>> if there are multiple genomes, we have something like
>>>
>>> genomes: hg19, mm9
>>>
>>> the point is to make it prominent, particularly at a time of transition.
>>>
>>>
>>>
>>> --- seqinfo: 60 seqlevels (2 circulars) on 2 genomes (hg19, mm10) ---
>>>> seqlevels seqlengths isCircular
>>>> genome
>>>> chr1 249250621 <NA>
>>>> hg19
>>>> chr10 135534747 <NA>
>>>> hg19
>>>> ... ... ...
>>>> ...
>>>> chrX 155270560 <NA>
>>>> hg19
>>>> chrY 59373566 <NA>
>>>> hg19
>>>>
>>>> I agree that the exact content of the seqinfo table itself is rarely
>>>> of interest so printing only 3 or 4 lines is OK. IMO it's important
>>>> to make the user aware of the existence of this hidden table and to
>>>> display it like what it really is (i.e. a table). Also displaying the
>>>> column names is a well established tradition and serves the purpose
>>>> of providing a quick summary of the accessors that are available to
>>>> access those fields.
>>>>
>>>> H.
>>>>
>>>>
>>>>
>>>>> On Mon, Sep 8, 2014 at 5:21 PM, Peter Hickey <hickey at wehi.edu.au
>>>>> <mailto:hickey at wehi.edu.au>> wrote:
>>>>>
>>>>> Perhaps it might be useful to have some way of highlighting if any
>>>>> of the chromosomes are circular or highlighting if there are
>>>>> multiple genomes present? Otherwise this information might be
>>>>> hidden
>>>>> in the "…"
>>>>>
>>>>> Cheers,
>>>>> Pete
>>>>>
>>>>>
>>>>> On 09/09/2014, at 9:44 AM, Hervé Pagès <hpages at fhcrc.org
>>>>> <mailto:hpages at fhcrc.org>> wrote:
>>>>>
>>>>> > On 09/08/2014 02:28 PM, Peter Hickey wrote:
>>>>> >> Just a vote for still allowing for multiple genomes in a
>>>>> Seqinfo
>>>>> object (in a GRanges object). My use case is in
>>>>> bisulfite-sequencing
>>>>> experiments where there is often a spike-in of a lambda phage
>>>>> genome
>>>>> along with the genome of interest (human or mouse). It's often
>>>>> useful to keep all data from a single library together in the same
>>>>> objet but process according to genome(x) for each seqlevel.
>>>>> >
>>>>> > Note taken. Thanks Pete! It's always great to know about
>>>>> concrete
>>>>> use
>>>>> > cases.
>>>>> >
>>>>> >>
>>>>> >> FWIW, I like Vincent's proposal of
>>>>> selectSome(unique(genome(x)))
>>>>> in the show method.
>>>>> >
>>>>> > Or what about displaying the genome next to the seqlevel it's
>>>>> > associated with? Like e.g.:
>>>>> >
>>>>> > > gr
>>>>> > GRanges with 3 ranges and 0 metadata columns:
>>>>> > seqnames ranges strand
>>>>> > <Rle> <IRanges> <Rle>
>>>>> > [1] chr14 [19069583, 19069654] +
>>>>> > [2] chr14 [19363738, 19363809] +
>>>>> > [3] chr14 [19363755, 19363826] -
>>>>> > [4] chr14 [19369799, 19369870] +
>>>>> > ---
>>>>> > seqinfo:
>>>>> > seqlevels seqlengths isCircular genome
>>>>> > chr1 249250621 <NA> hg19
>>>>> > chr10 135534747 <NA> hg19
>>>>> > chr11 135006516 <NA> hg19
>>>>> > ... ... ... ...
>>>>> > chrUn_gl000249 38502 <NA> hg19
>>>>> > chrX 155270560 <NA> hg19
>>>>> > chrY 59373566 <NA> hg19
>>>>> >
>>>>> > That way, we also raise awareness about the isCircular field.
>>>>> > The current choice to only display the seqlengths pre-dates the
>>>>> > existence of the seqinfo slot but might be a little bit
>>>>> misleading
>>>>> > those days since it only exposes some arbitrary seqinfo fields.
>>>>> >
>>>>> > H.
>>>>> >
>>>>> >>
>>>>> >> Cheers,
>>>>> >> Pete
>>>>> >>
>>>>> >>
>>>>> >>> I might have requested the genome annotation, but I'm pretty
>>>>> sure it wasn't
>>>>> >>> me who decide on tracking it on a per-sequence basis. I could
>>>>> imagine use
>>>>> >>> cases for that though, e.g., when diagnosing sequencing
>>>>> contamination (like
>>>>> >>> human vs. mouse). But most other tools and file formats
>>>>> expect
>>>>> a single
>>>>> >>> genome per "track", so, for example, rtracklayer has an
>>>>> internal function
>>>>> >>> singleGenome() to take care of this.
>>>>> >>>
>>>>> >>> On Mon, Sep 8, 2014 at 10:50 AM, Herv? Pag?s <
>>>>> hpages at fhcrc.org
>>>>> <mailto:hpages at fhcrc.org>> wrote:
>>>>> >>>
>>>>> >>>> Hi Vince,
>>>>> >>>>
>>>>> >>>> Yes it would make sense to have the "show" method report the
>>>>> genome
>>>>> >>>> when genome(x) contains a unique non-NA value. I think the
>>>>> main
>>>>> >>>> use case for having the genome defined at the sequence level
>>>>> instead
>>>>> >>>> of the whole object level is metagenomics. Maybe Michael has
>>>>> some other
>>>>> >>>> good use cases to share since IIRC he requested the addition
>>>>> of the
>>>>> >>>> genome field a couple of years ago and made the case for
>>>>> having it
>>>>> >>>> defined at the sequence level.
>>>>> >>>>
>>>>> >>>> Cheers,
>>>>> >>>> H.
>>>>> >>>>
>>>>> >>>>
>>>>> >>>> On 09/08/2014 07:21 AM, Vincent Carey wrote:
>>>>> >>>>
>>>>> >>>>> For GRanges x, my naive expectation is that genome(x)
>>>>> returns
>>>>> a length-
>>>>> >>>>>
>>>>> >>>>> one tag identifying the genome to which chromosomal
>>>>> coordinates
>>>>> >>>>>
>>>>> >>>>> correspond. The genome() method seems to have
>>>>> sequence-specific
>>>>> >>>>>
>>>>> >>>>> semantics, which makes sense, but when we identify sequence
>>>>> >>>>>
>>>>> >>>>> with chromosome, it seems too complicated. Is there a use
>>>>> case for
>>>>> >>>>>
>>>>> >>>>> a GRanges with sequences from several different genomes?
>>>>> >>>>>
>>>>> >>>>>
>>>>> >>>>> One reason I am inquiring is that I feel it would be nice
>>>>> to
>>>>> have the
>>>>> >>>>> GRanges show() method report, prominently, the genome in
>>>>> use
>>>>> (or NA
>>>>> >>>>>
>>>>> >>>>> if unspecified). This could be accomplished by reporting
>>>>> >>>>> unique(genome(x)), and perhaps that would be satisfactory.
>>>>> >>>>>
>>>>> >>>>> after example(genome) :
>>>>> >>>>>
>>>>> >>>>> seqinfo(txdb)
>>>>> >>>>>>
>>>>> >>>>>
>>>>> >>>>> Seqinfo of length 15
>>>>> >>>>>
>>>>> >>>>> seqnames seqlengths isCircular genome
>>>>> >>>>>
>>>>> >>>>> CH2L 23011544 FALSE dm3
>>>>> >>>>>
>>>>> >>>>> CH2R 21146708 FALSE dm3
>>>>> >>>>>
>>>>> >>>>> CH3L 24543557 FALSE dm3
>>>>> >>>>>
>>>>> >>>>> CH3R 27905053 FALSE dm3
>>>>> >>>>>
>>>>> >>>>> CH4 1351857 FALSE dm3
>>>>> >>>>>
>>>>> >>>>> ... ... ... ...
>>>>> >>>>>
>>>>> >>>>> CH3LHet 2555491 FALSE dm3
>>>>> >>>>>
>>>>> >>>>> CH3RHet 2517507 FALSE dm3
>>>>> >>>>>
>>>>> >>>>> CHXHet 204112 FALSE dm3
>>>>> >>>>>
>>>>> >>>>> CHYHet 347038 FALSE dm3
>>>>> >>>>>
>>>>> >>>>> CHUextra 29004656 FALSE dm3
>>>>> >>>>>
>>>>> >>>>> genome(seqinfo(txdb))
>>>>> >>>>>>
>>>>> >>>>>
>>>>> >>>>> CH2L CH2R CH3L CH3R CH4 CHX
>>>>> CHU M
>>>>> >>>>>
>>>>> >>>>> "dm3" "dm3" "dm3" "dm3" "dm3" "dm3"
>>>>> "dm3" "dm3"
>>>>> >>>>>
>>>>> >>>>> CH2LHet CH2RHet CH3LHet CH3RHet CHXHet CHYHet
>>>>> CHUextra
>>>>> >>>>>
>>>>> >>>>> "dm3" "dm3" "dm3" "dm3" "dm3" "dm3"
>>>>> "dm3"
>>>>> >>>>>
>>>>> >>>>> [[alternative HTML version deleted]]
>>>>> >>>>>
>>>>> >>>>> _______________________________________________
>>>>> >>>>> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>>>>> mailing list
>>>>> >>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>> >>>>>
>>>>> >>>>>
>>>>> >>>> --
>>>>> >>>> Herv? Pag?s
>>>>> >>>>
>>>>> >>>> Program in Computational Biology
>>>>> >>>> Division of Public Health Sciences
>>>>> >>>> Fred Hutchinson Cancer Research Center
>>>>> >>>> 1100 Fairview Ave. N, M1-B514
>>>>> >>>> P.O. Box 19024
>>>>> >>>> Seattle, WA 98109-1024
>>>>> >>>>
>>>>> >>>> E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>>>>> >>>> Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
>>>>> >>>> Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
>>>>> >>>>
>>>>> >>>>
>>>>> >>>> _______________________________________________
>>>>> >>>> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>>>>> mailing list
>>>>> >>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>> >>>>
>>>>> >>
>>>>> >> --------------------------------
>>>>> >> Peter Hickey,
>>>>> >> PhD Student/Research Assistant,
>>>>> >> Bioinformatics Division,
>>>>> >> Walter and Eliza Hall Institute of Medical Research,
>>>>> >> 1G Royal Parade, Parkville, Vic 3052, Australia.
>>>>> >> Ph: +613 9345 2324 <tel:%2B613%209345%202324>
>>>>> >>
>>>>> >> hickey at wehi.edu.au <mailto:hickey at wehi.edu.au>
>>>>> >> http://www.wehi.edu.au
>>>>> >>
>>>>> >>
>>>>> ____________________________________________________________
>>>>> __________
>>>>> >> The information in this email is confidential and
>>>>> intend...{{dropped:6}}
>>>>> >>
>>>>> >> _______________________________________________
>>>>> >>Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>>>>> mailing list
>>>>> >>https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>> >>
>>>>> >
>>>>> > --
>>>>> > Hervé Pagès
>>>>> >
>>>>> > Program in Computational Biology
>>>>> > Division of Public Health Sciences
>>>>> > Fred Hutchinson Cancer Research Center
>>>>> > 1100 Fairview Ave. N, M1-B514
>>>>> > P.O. Box 19024
>>>>> > Seattle, WA 98109-1024
>>>>> >
>>>>> > E-mail:hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>>>>> > Phone:(206) 667-5791 <tel:%28206%29%20667-5791>
>>>>> > Fax:(206) 667-1319 <tel:%28206%29%20667-1319>
>>>>>
>>>>> --------------------------------
>>>>> Peter Hickey,
>>>>> PhD Student/Research Assistant,
>>>>> Bioinformatics Division,
>>>>> Walter and Eliza Hall Institute of Medical Research,
>>>>> 1G Royal Parade, Parkville, Vic 3052, Australia.
>>>>> Ph: +613 9345 2324 <tel:%2B613%209345%202324>
>>>>>
>>>>> hickey at wehi.edu.au <mailto:hickey at wehi.edu.au>
>>>>> http://www.wehi.edu.au
>>>>>
>>>>>
>>>>> ____________________________________________________________
>>>>> __________
>>>>> The information in this email is confidential and
>>>>> intend...{{dropped:8}}
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>>>>> mailing
>>>>> list
>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>
>>>>>
>>>>>
>>>>> --
>>>> Hervé Pagès
>>>>
>>>> Program in Computational Biology
>>>> Division of Public Health Sciences
>>>> Fred Hutchinson Cancer Research Center
>>>> 1100 Fairview Ave. N, M1-B514
>>>> P.O. Box 19024
>>>> Seattle, WA 98109-1024
>>>>
>>>> E-mail: hpages at fhcrc.org
>>>> Phone: (206) 667-5791
>>>> Fax: (206) 667-1319
>>>>
>>>> _______________________________________________
>>>> Bioc-devel at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>
>>>>
>>>
>>>
>> [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>>
>
> --
> Computational Biology / Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N.
> PO Box 19024 Seattle, WA 98109
>
> Location: Arnold Building M1 B861
> Phone: (206) 667-2793
>
[[alternative HTML version deleted]]
More information about the Bioc-devel
mailing list