[Bioc-devel] should genome() be so complicated?/add genome report to GRanges show method

Tue Sep 9 01:44:29 CEST 2014

On 09/08/2014 02:28 PM, Peter Hickey wrote:
> Just a vote for still allowing for multiple genomes in a Seqinfo object (in a GRanges object). My use case is in bisulfite-sequencing experiments where there is often a spike-in of a lambda phage genome along with the genome of interest (human or mouse). It's often useful to keep all data from a single library together in the same objet but process according to genome(x) for each seqlevel.

Note taken. Thanks Pete! It's always great to know about concrete use
cases.

>
> FWIW, I like Vincent's proposal of selectSome(unique(genome(x))) in the show method.

Or what about displaying the genome next to the seqlevel it's
associated with? Like e.g.:

   > gr
   GRanges with 3 ranges and 0 metadata columns:
         seqnames               ranges strand
            <Rle>            <IRanges>  <Rle>
     [1]    chr14 [19069583, 19069654]      +
     [2]    chr14 [19363738, 19363809]      +
     [3]    chr14 [19363755, 19363826]      -
     [4]    chr14 [19369799, 19369870]      +
     ---
     seqinfo:
       seqlevels             seqlengths isCircular genome
       chr1                   249250621       <NA>   hg19
       chr10                  135534747       <NA>   hg19
       chr11                  135006516       <NA>   hg19
       ...                          ...        ...    ...
       chrUn_gl000249             38502       <NA>   hg19
       chrX                   155270560       <NA>   hg19
       chrY                    59373566       <NA>   hg19

That way, we also raise awareness about the isCircular field.
The current choice to only display the seqlengths pre-dates the
existence of the seqinfo slot but might be a little bit misleading
those days since it only exposes some arbitrary seqinfo fields.

H.

>
> Cheers,
> Pete
>
>
>> I might have requested the genome annotation, but I'm pretty sure it wasn't
>> me who decide on tracking it on a per-sequence basis. I could imagine use
>> cases for that though, e.g., when diagnosing sequencing contamination (like
>> human vs. mouse). But most other tools and file formats expect a single
>> genome per "track", so, for example, rtracklayer has an internal function
>> singleGenome() to take care of this.
>>
>> On Mon, Sep 8, 2014 at 10:50 AM, Herv? Pag?s <hpages at fhcrc.org> wrote:
>>
>>> Hi Vince,
>>>
>>> Yes it would make sense to have the "show" method report the genome
>>> when genome(x) contains a unique non-NA value. I think the main
>>> use case for having the genome defined at the sequence level instead
>>> of the whole object level is metagenomics. Maybe Michael has some other
>>> good use cases to share since IIRC he requested the addition of the
>>> genome field a couple of years ago and made the case for having it
>>> defined at the sequence level.
>>>
>>> Cheers,
>>> H.
>>>
>>>
>>> On 09/08/2014 07:21 AM, Vincent Carey wrote:
>>>
>>>> For GRanges x, my naive expectation is that genome(x) returns a length-
>>>>
>>>> one tag identifying the genome to which chromosomal coordinates
>>>>
>>>> correspond.  The genome() method seems to have sequence-specific
>>>>
>>>> semantics, which makes sense, but when we identify sequence
>>>>
>>>> with chromosome, it seems too complicated.  Is there a use case for
>>>>
>>>> a GRanges with sequences from several different genomes?
>>>>
>>>>
>>>> One reason I am inquiring is that I feel it would be nice to have the
>>>> GRanges show() method report, prominently, the genome in use (or NA
>>>>
>>>> if unspecified).  This could be accomplished by reporting
>>>> unique(genome(x)), and perhaps that would be satisfactory.
>>>>
>>>> after example(genome) :
>>>>
>>>> seqinfo(txdb)
>>>>>
>>>>
>>>> Seqinfo of length 15
>>>>
>>>> seqnames seqlengths isCircular genome
>>>>
>>>> CH2L       23011544      FALSE    dm3
>>>>
>>>> CH2R       21146708      FALSE    dm3
>>>>
>>>> CH3L       24543557      FALSE    dm3
>>>>
>>>> CH3R       27905053      FALSE    dm3
>>>>
>>>> CH4         1351857      FALSE    dm3
>>>>
>>>> ...             ...        ...    ...
>>>>
>>>> CH3LHet     2555491      FALSE    dm3
>>>>
>>>> CH3RHet     2517507      FALSE    dm3
>>>>
>>>> CHXHet       204112      FALSE    dm3
>>>>
>>>> CHYHet       347038      FALSE    dm3
>>>>
>>>> CHUextra   29004656      FALSE    dm3
>>>>
>>>> genome(seqinfo(txdb))
>>>>>
>>>>
>>>>      CH2L     CH2R     CH3L     CH3R      CH4      CHX      CHU        M
>>>>
>>>>     "dm3"    "dm3"    "dm3"    "dm3"    "dm3"    "dm3"    "dm3"    "dm3"
>>>>
>>>>   CH2LHet  CH2RHet  CH3LHet  CH3RHet   CHXHet   CHYHet CHUextra
>>>>
>>>>     "dm3"    "dm3"    "dm3"    "dm3"    "dm3"    "dm3"    "dm3"
>>>>
>>>>         [[alternative HTML version deleted]]
>>>>
>>>> _______________________________________________
>>>> Bioc-devel at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>
>>>>
>>> --
>>> Herv? Pag?s
>>>
>>> Program in Computational Biology
>>> Division of Public Health Sciences
>>> Fred Hutchinson Cancer Research Center
>>> 1100 Fairview Ave. N, M1-B514
>>> P.O. Box 19024
>>> Seattle, WA 98109-1024
>>>
>>> E-mail: hpages at fhcrc.org
>>> Phone:  (206) 667-5791
>>> Fax:    (206) 667-1319
>>>
>>>
>>> _______________________________________________
>>> Bioc-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>
>
> --------------------------------
> Peter Hickey,
> PhD Student/Research Assistant,
> Bioinformatics Division,
> Walter and Eliza Hall Institute of Medical Research,
> 1G Royal Parade, Parkville, Vic 3052, Australia.
> Ph: +613 9345 2324
>
> hickey at wehi.edu.au
> http://www.wehi.edu.au
>
> ______________________________________________________________________
> The information in this email is confidential and inte...{{dropped:21}}