[Bioc-devel] deprecation of keepSeqlevels and renameSeqlevels

Hervé Pagès hpages at fhcrc.org
Tue Sep 17 21:10:45 CEST 2013


On 09/17/2013 11:30 AM, Michael Lawrence wrote:
> Sounds great. Could we have a more intuitive name? Like naturalOrder().

That would be confusing because order() does something different:

   > order(c("chrY", "chr1", "chr9", "chr2"))
   [1] 2 4 3 1

   > makeSeqnameIds(c("chrY", "chr1", "chr9", "chr2"))
   [1] 4 1 3 2

makeSeqnameIds() is more like rank() except on how it deals with
duplicates:

   > makeSeqnameIds(c("chrY", "chr1", "chr9", "chr2", "chr9"))
   [1] 4 1 3 2 3

Whatever 'ties.method' you use, rank() will put "chrY" at rank 5.

IIRC makeSeqnameIds() needs to support duplicates because of how it's
used internally in the makeTranscriptDb* functions.

Alternatively we could add sortSeqlevels() as a wrapper to
makeSeqnameIds() so you would be able to do:

   seqlevels(gr) <- sortSeqlevels(seqlevels(gr))

Does that work?

Thanks,
H.


> And then the method to fix the GRanges could be restoreNaturalOrder().
> Just some suggestions...
>
>
>
>
> On Tue, Sep 17, 2013 at 10:58 AM, Hervé Pagès <hpages at fhcrc.org
> <mailto:hpages at fhcrc.org>> wrote:
>
>     For sorting the chromosome names in "natural" order, you can use
>     the makeSeqnameIds() helper function in GenomicRanges:
>
>        seqlevels(gr) <- seqlevels(gr)[makeSeqnameIds(__seqlevels(gr))]
>
>     It recognizes several naming conventions (chr-prefixed or not,
>     roman, etc...) e.g.:
>
>        > makeSeqnameIds(c("Y", "1", "9", "M", "2"))
>        [1] 4 1 3 5 2
>
>        > makeSeqnameIds(c("chrY", "chrI", "chrIX", "chrM", "chrII"))
>        [1] 4 1 3 5 2
>
>     I still need to improve the documentation of this function, put
>     more examples, and add unit tests. It's used internally by the
>     makeTranscriptDb* functions in GenomicFeatures to assign internal
>     ids to the chromosomes so that all the TranscriptDb extractors
>     can then return GRanges and GRangesList objects with the seqlevels
>     in the "natural" order.
>
>     Cheers,
>     H.
>
>
>
>     On 09/17/2013 10:23 AM, Kasper Daniel Hansen wrote:
>
>         Actually, what I (and I think several others) just want is
>             fixSeqnames()
>         which sorts and fixes them to some convention.  I don't really
>         care which
>         one, but I want to do some easy harmonization, preferably in
>         line with
>         other bioc tools.  I know this is hard to do for all organisms,
>         but having
>         something that deals with the major ones (human, mouse, fruit
>         fly, yeast,
>         some monkeys) would be extremely valuable.  If this function
>         existed, I
>         could just throw all my GRanges at it.
>
>         Kasper
>
>
>         On Tue, Sep 17, 2013 at 1:14 PM, Michael Lawrence
>         <lawrence.michael at gene.com <mailto:lawrence.michael at gene.com>
>
>             wrote:
>
>
>             When I hit this I usually just do:
>
>             seqlevels(gr) <- paste0("chr", c(1:22, "X", "Y"))
>
>             It would be nice if there were a function for sorting
>             chromosome names
>             according to a specified naming convention. Like
>             sortSeqlevels(gr,
>             UCSCNaming()) or the alternative replacement syntax:
>             seqinfo(gr) <-
>             sort(seqinfo(gr), UCSCNaming()). Although sort is only
>             single-dispatch.
>
>             Michael
>
>
>
>
>
>
>             On Tue, Sep 17, 2013 at 8:59 AM, Vincent Carey
>             <stvjc at channing.harvard.edu
>             <mailto:stvjc at channing.harvard.edu>>__wrote:
>
>                 I am replying to this as Michael mentions that this is a
>                 powerful
>                 operation.  Here's my
>                 problem that I believe is related, but I do not see a
>                 straightforward
>                 solution
>
>                     bhmm19uni
>
>                 GRanges with 559456 ranges and 4 metadata columns:
>                            seqnames                 ranges strand   |
>                             name
>                 score
>                               <Rle>              <IRanges>  <Rle>   |
>                      <character>
>                 <numeric>
>                          1     chr1 [ 67229025,  67341224]      *   |
>                 13_Heterochrom/lo
>                       0
>                          2     chr1 [203057955, 203060154]      *   |
>                     12_Repressed
>                       0
>                          3     chr1 [  8458427,   8486226]      *   |
>                 13_Heterochrom/lo
>                       0
>                          4     chr1 [ 16904427,  16904826]      *   |
>                 5_Strong_Enhancer
>                       0
>                          5     chr1 [ 25289427,  25293426]      *   |
>                      11_Weak_Txn
>                       0
>                        ...      ...                    ...    ... ...
>                              ...
>                     ...
>                     570766    chr22   [51100931, 51101330]      *   |
>                 2_Weak_Promoter
>                       0
>                     570767    chr22   [51101331, 51101530]      *   |
>                 6_Weak_Enhancer
>                       0
>                     570768    chr22   [51101531, 51101730]      *   |
>                 2_Weak_Promoter
>                       0
>                     570769    chr22   [51101731, 51178130]      *   |
>                 13_Heterochrom/lo
>                       0
>                     570770    chr22   [51178131, 51178330]      *   |
>                 15_Repetitive/CNV
>                       0
>                                             thick     numcode
>                                         <IRanges> <character>
>                          1 [ 67001613,  67113812]          13
>                          2 [201324578, 201326777]          12
>                          3 [  8381014,   8408813]          13
>                          4 [ 16777014,  16777413]           5
>                          5 [ 25162014,  25166013]          11
>                        ...                    ...         ...
>                     570766   [49447797, 49448196]           2
>                     570767   [49448197, 49448396]           6
>                     570768   [49448397, 49448596]           2
>                     570769   [49448597, 49524996]          13
>                     570770   [49524997, 49525196]          15
>                     ---
>                     seqlengths:
>                           chr1     chr10     chr11      chr5 ...
>                   chr2     chr21
>                    chr4
>                      249250621 135534747 135006516 180915260 ...
>                 243199373  48129895
>                 191154276
>
>                 If I try to make a karyogram with ggbio, the chromosomes
>                 are in an
>                 unnatural order.  What is the right approach to getting
>                 the seqinfo in a
>                 natural order with respect to chromosome names and
>                 lengths?  Reassigning
>                 seqlevels seems dangerous but I haven't experimented
>                 fully.  It must be a
>                 common use case but I do not see it in doc.
>
>
>
>                 On Tue, May 21, 2013 at 11:00 AM, Michael Lawrence <
>                 lawrence.michael at gene.com
>                 <mailto:lawrence.michael at gene.com>> wrote:
>
>                     On Mon, May 20, 2013 at 10:36 PM, Hervé Pagès
>                     <hpages at fhcrc.org <mailto:hpages at fhcrc.org>> wrote:
>
>                         Michael,
>
>
>                         On 05/20/2013 08:44 PM, Michael Lawrence wrote:
>
>
>
>
>                             On Mon, May 20, 2013 at 4:13 PM, Hervé Pagès
>                             <hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>                             <mailto:hpages at fhcrc.org
>                             <mailto:hpages at fhcrc.org>>> wrote:
>
>                                   Hi,
>
>
>                                   On 05/20/2013 03:15 PM, Michael
>                             Lawrence wrote:
>
>                                       On Mon, May 20, 2013 at 3:11 PM,
>                             Martin Morgan
>                                       <mtmorgan at fhcrc.org
>                             <mailto:mtmorgan at fhcrc.org>
>                             <mailto:mtmorgan at fhcrc.org
>                             <mailto:mtmorgan at fhcrc.org>>> wrote:
>
>                                           Hi Michael --
>
>
>                                           On 5/20/2013 5:56 AM, Michael
>                             Lawrence wrote:
>
>                                               While it's cool that
>                             seqlevels<- has become so
>
>                     flexible,
>
>                                               I still claim
>                                               that
>                                               rename/keep/drop would be
>                             a lot easier to read,
>
>             because
>
>                                               they describe the
>                                               high-level operation, and
>                             there's no reason to
>
>                     deprecate
>
>                                               them. They're
>                                               also
>                                               easier to remember. For
>                             example, for dropping with
>                                               seqlevels<-, the user
>                                               needs
>                                               to remember that "force"
>                             is necessary to drop. Much
>                                               easier to just say
>                                               "dropSeqlevels(), please".
>                             And reimplementing
>                                               keepSeqlevels is still too
>                                               complicated. Is it such a
>                             maintenance burden to have
>                                               these simple
>                                               wrappers sit
>                                               on top of the low-level
>                             manipulators?
>
>                                               Another suggestion:
>                             renameSeqlevels should not
>
>             require
>
>                     a
>
>                                               named vector (in
>                                               cases
>                                               where it is unnamed,
>                             require that it is parallel to
>
>             the
>
>                                               existing
>                                               seqlevels, as
>                                               with seqlevels<-).
>
>
>                                           I didn't really indicate what
>                             drove my desire to see
>                                           keepSeqlevels
>                                           deprecated. keepSeqlevels,
>                             seqlevels<-, and isActiveSeq
>
>                     were
>
>                                           developed more
>                                           or less independently, and
>                             have different contracts. The
>                                           different
>                                           contracts are confusing to the
>                             user, as is creating a
>
>                     usable
>
>                                           help system
>                                           when there are multiple end
>                             points for what is a common
>                                           operation. The help
>                                           pages of each were inadequate
>                             in different ways. And
>
>                     because
>
>                                           the code was
>                                           developed independently,
>                             support for different objects
>
>             was
>
>                                           inconsistent. So
>                                           actually, yes, the maintenance
>                             (and use) burden for the
>                                           previous state of
>                                           affairs was substantial.
>
>                                           On the other hand, I agree
>                             that keepSeqlevels is
>
>             convenient
>
>                                           as a simple
>                                           wrapper around seqlevels<-, in
>                             the same way that setNames
>                                           and names<- are
>                                           both useful.
>
>                                           So we could iterate to
>
>                                               keepSeqlevels <-
>                                                   function(x, value, ...)
>                                               {
>                                                   seqlevels(x,
>                             force=TRUE) <- value
>                                                   x
>                                               }
>
>                                           but I'd be less enthusiastic
>                             about maintaining the
>
>             original
>
>                                           contract of
>                                           keepSeqlevels, where
>                             keepSeqlevels(gr1, gr2), would have
>                                           worked for two
>                                           GRanges objects.
>
>
>                                   Why would this be called
>                             keepSeqlevels() if, depending on how
>
>             it's
>
>                             used,
>                                   it will either add, drop, rename, or
>                             permute the seqlevels?
>                                   Couldn't this be called setSeqlevels?
>
>
>                             I thought that new2old was necessary for
>                             permuting.
>
>
>                         The seqlevels setter has no 'new2old' arg.
>
>
>                            You're right that
>
>                             adding should be disallowed for
>                             keepSeqlevels(). Adding seqlevels is
>
>                     not
>
>                             a common operation. The two common
>                             operations are:
>                             * Permuting, usually because the data were
>                             imported in non-canonical
>                             order (seqnameStyle was designed to address
>                             this, no?).
>
>
>                         Permuting is straightforward with seqlevels<-:
>
>                             > seqlevels(gr)
>                             [1] "chr1" "chr2" "chr3"
>
>                             > seqlevels(gr) <- rev(seqlevels(gr))
>
>                             > seqlevels(gr)
>                             [1] "chr3" "chr2" "chr1"
>
>
>                            * Subsetting, either by keeping or dropping.
>
>
>
>                         Also straightforward:
>
>                             > seqlevels(gr, force=TRUE) <- "chr2"
>
>                             > seqlevels(gr)
>                             [1] "chr2"
>
>
>                            Two main reasons: taking a
>
>                             small slice of the data for prototyping, or
>                             removing problematic
>                             chromosomes (sex, circular). This goes back
>                             to bsapply and the
>
>             exclude
>
>                             argument.
>
>
>                         There are other important use cases:
>
>                             - Dropping seqlevels that are *not* in use.
>                         This happens for example
>                               when subsetting a BAM file with
>                         Rsamtools::filterBam to keep only
>                               the alignments located on a given
>                         chromosome. The entire sequence
>                               dictionary (in the header) is propagated
>                         to the resulting BAM file
>                               so when you read the file with
>                         readGAlignments() you end up with a
>                               bunch of useless seqlevels that you
>                         generally start by dropping.
>                               You don't want to drop any alignment so
>                         force=FALSE is your
>
>             friend.
>
>
>
>                     This is sort of tangential to the discussion, but do
>                     you really want to
>
>             do
>
>                     this? I would preserve the universe as given by the BAM.
>
>
>                             - An even more common operation is renaming:
>                         90% of the times that's
>                               what I use seqlevels<- for, and that's
>                         what I tell people to use
>                               when they need to rename their
>                         chromosomes. Also straightforward:
>
>                                 > seqlevels(gr)
>                                 [1] "chr1" "chr2" "chr3" "chrM"
>
>                                 > seqlevels(gr) <- sub("^chr", "",
>                         seqlevels(gr))
>                                 > seqlevels(gr)
>                                 [1] "1" "2" "3" "M"
>
>                                 > seqlevels(gr)[seqlevels(gr) == "M"] <-
>                         "MT"
>                                 > seqlevels(gr)
>                                 [1] "1"  "2"  "3"  "MT"
>
>                               Note that this is simpler to use than
>                         renameSeqlevels() which
>                               always required you to build the entire
>                         right value as a named
>                               vector.
>
>
>                     Sure, renaming is a common use case. Not sure how I
>                     forgot that.
>
>                     As I wrote earlier in the thread, renameSeqlevels
>                     should be changed so
>
>             as
>
>                     not to require naming of the vector.
>
>
>                         So the added-value of keepSeqlevels() seems to
>                         boil down to:
>
>                             (a) it always uses force=TRUE
>
>                             (b) it preserves the original object and
>                         returns the modified object
>
>                         If you want to restrict the use of
>                         keepSeqlevels() to permuting and
>                         dropping, you'll also have to disallow renaming
>                         (in addition to
>
>             adding).
>
>                         Then its name will more or less reflect what it
>                         does (if "keep" means
>                         "permute" or "drop"). The final result will be
>                         something that does
>
>             less
>
>                         than setSeqlevels() and that is not necessarily
>                         easier to read: both
>                         will set on the object whatever vector of
>                         seqlevels they are supplied,
>                         except that one will fail when doing so would
>                         actually result in
>
>             adding
>
>                         or renaming seqlevels.
>
>
>                     So it looks like seqlevels<- is pretty powerful. Now
>                     I see that if I
>                     scroll
>                     down to the examples in the man page, I find the
>                     actual documentation.
>                     It's
>                     cool to learn that we can replace seqlevels on TxDb
>                     objects. My argument
>                     has always been that since these low-level
>                     operations are so powerful,
>                     it's
>                     nice to have high-level operations to clarify the
>                     code. We have to think
>                     hard to know what the RHS will do in that
>                     replacement. It's probably one
>                     of
>                     the most complex replacement operations; far more
>                     complex than the
>
>             typical
>
>                     one, including levels<- on factor. There's nothing
>                     wrong with that; it's
>                     just complexity that we should be able to hide.
>
>                     Michael
>
>                     H.
>
>
>
>                             Michael
>
>
>                                   H.
>
>
>
>                                       Well, I wasn't even aware of that
>                             feature, so you've made
>
>             your
>
>                                       point about
>                                       the documentation ;) Sounds like a
>                             good compromise to me.
>
>                                       Thanks for understanding,
>                                       Michael
>
>
>
>                                           Martin
>
>
>                                               Michael
>
>
>
>
>
>
>                                               On Sat, May 18, 2013 at
>                             6:00 PM, Martin Morgan
>                                               <mtmorgan at fhcrc.org
>                             <mailto:mtmorgan at fhcrc.org>
>                             <mailto:mtmorgan at fhcrc.org
>                             <mailto:mtmorgan at fhcrc.org>>
>                                               <mailto:mtmorgan at fhcrc.org
>                             <mailto:mtmorgan at fhcrc.org> <mailto:
>
>             mtmorgan at fhcrc.org <mailto:mtmorgan at fhcrc.org>
>
>
>
>                                               wrote:
>
>                                                     On 05/18/2013 05:42
>                             PM, Martin Morgan wrote:
>
>                                                         Some of the most
>                             common operations are
>                                               straight-forward, e.g.,
>
>                                                               > gr =
>                             GRanges(paste0("chr", c(1:22,
>                                               "X", "Y")), IRanges(1,
>                                               100))
>                                                               >
>                             seqlevels(gr) = sub("chr", "ch",
>                                               seqlevels(gr))
>
>
>                                                     To get a more
>                             comprehensive example I should
>
>             have
>
>                                               followed my own
>                                               advice and
>                                                     grabbed from the
>                             help page!
>
>                                                           ## Rename:
>
>                               seqlevels(txdb) <- sub("chr", "",
>                                               seqlevels(txdb))
>                                                           seqlevels(txdb)
>
>
>                               seqlevels(txdb) <- paste0("CH",
>                             seqlevels(txdb))
>                                                           seqlevels(txdb)
>
>
>                               seqlevels(txdb)[seqlevels(__**__**__txdb)
>
>             ==
>
>
>
>                                               "CHM"] <- "M"
>
>
>                                                           seqlevels(txdb)
>
>
>                                                     --
>                                                     Computational
>                             Biology / Fred Hutchinson Cancer
>                                               Research Center
>                                                     1100 Fairview Ave. N.
>                                                     PO Box 19024
>                             Seattle, WA 98109
>
>                                                     Location: Arnold
>                             Building M1 B861
>                                                     Phone: (206)
>                             667-2793 <tel:%28206%29%20667-2793>
>
>             <tel:%28206%29%20667-2793>
>
>                                               <tel:%28206%29%20667-2793>
>
>
>
>
>                                           --
>                                           Dr. Martin Morgan, PhD
>
>                                           Fred Hutchinson Cancer
>                             Research Center
>                                           1100 Fairview Ave. N.
>                                           PO Box 19024 Seattle, WA 98109
>
>
>                                                [[alternative HTML
>                             version deleted]]
>
>
>                               ________________________________**___________________
>                             Bioc-devel at r-project.org
>                             <mailto:Bioc-devel at r-project.org>
>                             <mailto:Bioc-devel at r-project.
>                             <mailto:Bioc-devel at r-project.>*__*org<
>
>                     Bioc-devel at r-project.org
>                     <mailto:Bioc-devel at r-project.org>>
>
>
>                                       mailing list
>                             https://stat.ethz.ch/mailman/___**_listinfo/bioc-devel
>                             <https://stat.ethz.ch/mailman/_**_listinfo/bioc-devel><
>
>                     https://stat.ethz.ch/mailman/____listinfo/bioc-devel
>                     <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>>
>
>
>
>                               <https://stat.ethz.ch/mailman/__**listinfo/bioc-devel <https://stat.ethz.ch/mailman/**listinfo/bioc-devel><
>
>                     https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>                     <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>
>
>
>
>
>                                   --
>                                   Hervé Pagès
>
>                                   Program in Computational Biology
>                                   Division of Public Health Sciences
>
>                                   Fred Hutchinson Cancer Research Center
>                                   1100 Fairview Ave. N, M1-B514
>                                   P.O. Box 19024
>                                   Seattle, WA 98109-1024
>
>                                   E-mail: hpages at fhcrc.org
>                             <mailto:hpages at fhcrc.org>
>                             <mailto:hpages at fhcrc.org
>                             <mailto:hpages at fhcrc.org>>
>                                   Phone: (206) 667-5791
>                             <tel:%28206%29%20667-5791>
>                             <tel:%28206%29%20667-5791>
>                                   Fax: (206) 667-1319
>                             <tel:%28206%29%20667-1319>
>                             <tel:%28206%29%20667-1319>
>
>
>
>                         --
>                         Hervé Pagès
>
>                         Program in Computational Biology
>                         Division of Public Health Sciences
>                         Fred Hutchinson Cancer Research Center
>                         1100 Fairview Ave. N, M1-B514
>                         P.O. Box 19024
>                         Seattle, WA 98109-1024
>
>                         E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>                         Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
>                         Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
>
>
>                               [[alternative HTML version deleted]]
>
>
>                     _________________________________________________
>                     Bioc-devel at r-project.org
>                     <mailto:Bioc-devel at r-project.org> mailing list
>                     https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>                     <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>
>
>
>
>                       [[alternative HTML version deleted]]
>
>
>             _________________________________________________
>             Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>             mailing list
>             https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>             <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>
>
>
>                  [[alternative HTML version deleted]]
>
>
>
>         _________________________________________________
>         Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>         mailing list
>         https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>         <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>
>
>     --
>     Hervé Pagès
>
>     Program in Computational Biology
>     Division of Public Health Sciences
>     Fred Hutchinson Cancer Research Center
>     1100 Fairview Ave. N, M1-B514
>     P.O. Box 19024
>     Seattle, WA 98109-1024
>
>     E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>     Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
>     Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
>
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioc-devel mailing list