[Bioc-devel] should genome() be so complicated?/add genome report to GRanges show method

Hervé Pagès hpages at fhcrc.org
Mon Sep 8 22:00:53 CEST 2014


On 09/08/2014 11:41 AM, Michael Lawrence wrote:
> I might have requested the genome annotation, but I'm pretty sure it
> wasn't me who decide on tracking it on a per-sequence basis.

OK, maybe. Don't trust my memory too much on this. No regrets though.
I think it was the right thing to do ;-) Just because the SAM/BAM
format does it is a good enough reason for us to do it too. According
to the SAM Spec, the header can look like:

   @HD	VN:1.3	SO:coordinate
   @SQ	SN:chr1_hg19	LN:45	AS:hg19
   @SQ	SN:chr1_mm10	LN:42	AS:mm10

The only problem is that seqinfo(BamFile("test.bam")) seems to ignore
the AS tag (genome assembly identifier) at the moment:

   > seqinfo(BamFile("test.bam"))
   Seqinfo of length 2
   seqnames seqlengths isCircular genome
   chr1_hg19        45         NA   <NA>
   chr1_mm10        42         NA   <NA>

Hopefully that can be addressed. But that's a separate issue...

Cheers,
H.


> I could
> imagine use cases for that though, e.g., when diagnosing sequencing
> contamination (like human vs. mouse). But most other tools and file
> formats expect a single genome per "track", so, for example, rtracklayer
> has an internal function singleGenome() to take care of this.
>
> On Mon, Sep 8, 2014 at 10:50 AM, Hervé Pagès <hpages at fhcrc.org
> <mailto:hpages at fhcrc.org>> wrote:
>
>     Hi Vince,
>
>     Yes it would make sense to have the "show" method report the genome
>     when genome(x) contains a unique non-NA value. I think the main
>     use case for having the genome defined at the sequence level instead
>     of the whole object level is metagenomics. Maybe Michael has some other
>     good use cases to share since IIRC he requested the addition of the
>     genome field a couple of years ago and made the case for having it
>     defined at the sequence level.
>
>     Cheers,
>     H.
>
>
>     On 09/08/2014 07:21 AM, Vincent Carey wrote:
>
>         For GRanges x, my naive expectation is that genome(x) returns a
>         length-
>
>         one tag identifying the genome to which chromosomal coordinates
>
>         correspond.  The genome() method seems to have sequence-specific
>
>         semantics, which makes sense, but when we identify sequence
>
>         with chromosome, it seems too complicated.  Is there a use case for
>
>         a GRanges with sequences from several different genomes?
>
>
>         One reason I am inquiring is that I feel it would be nice to
>         have the
>         GRanges show() method report, prominently, the genome in use (or NA
>
>         if unspecified).  This could be accomplished by reporting
>         unique(genome(x)), and perhaps that would be satisfactory.
>
>         after example(genome) :
>
>             seqinfo(txdb)
>
>
>         Seqinfo of length 15
>
>         seqnames seqlengths isCircular genome
>
>         CH2L       23011544      FALSE    dm3
>
>         CH2R       21146708      FALSE    dm3
>
>         CH3L       24543557      FALSE    dm3
>
>         CH3R       27905053      FALSE    dm3
>
>         CH4         1351857      FALSE    dm3
>
>         ...             ...        ...    ...
>
>         CH3LHet     2555491      FALSE    dm3
>
>         CH3RHet     2517507      FALSE    dm3
>
>         CHXHet       204112      FALSE    dm3
>
>         CHYHet       347038      FALSE    dm3
>
>         CHUextra   29004656      FALSE    dm3
>
>             genome(seqinfo(txdb))
>
>
>               CH2L     CH2R     CH3L     CH3R      CH4      CHX
>         CHU        M
>
>              "dm3"    "dm3"    "dm3"    "dm3"    "dm3"    "dm3"
>         "dm3"    "dm3"
>
>            CH2LHet  CH2RHet  CH3LHet  CH3RHet   CHXHet   CHYHet CHUextra
>
>              "dm3"    "dm3"    "dm3"    "dm3"    "dm3"    "dm3"    "dm3"
>
>                  [[alternative HTML version deleted]]
>
>         _________________________________________________
>         Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>         mailing list
>         https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>         <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>
>
>     --
>     Hervé Pagès
>
>     Program in Computational Biology
>     Division of Public Health Sciences
>     Fred Hutchinson Cancer Research Center
>     1100 Fairview Ave. N, M1-B514
>     P.O. Box 19024
>     Seattle, WA 98109-1024
>
>     E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>     Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
>     Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
>
>
>     _________________________________________________
>     Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org> mailing list
>     https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>     <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioc-devel mailing list