[Bioc-devel] deprecation of keepSeqlevels and renameSeqlevels

Hervé Pagès hpages at fhcrc.org
Tue May 21 07:36:03 CEST 2013


Michael,

On 05/20/2013 08:44 PM, Michael Lawrence wrote:
>
>
>
> On Mon, May 20, 2013 at 4:13 PM, Hervé Pagès <hpages at fhcrc.org
> <mailto:hpages at fhcrc.org>> wrote:
>
>     Hi,
>
>
>     On 05/20/2013 03:15 PM, Michael Lawrence wrote:
>
>         On Mon, May 20, 2013 at 3:11 PM, Martin Morgan
>         <mtmorgan at fhcrc.org <mailto:mtmorgan at fhcrc.org>> wrote:
>
>             Hi Michael --
>
>
>             On 5/20/2013 5:56 AM, Michael Lawrence wrote:
>
>                 While it's cool that seqlevels<- has become so flexible,
>                 I still claim
>                 that
>                 rename/keep/drop would be a lot easier to read, because
>                 they describe the
>                 high-level operation, and there's no reason to deprecate
>                 them. They're
>                 also
>                 easier to remember. For example, for dropping with
>                 seqlevels<-, the user
>                 needs
>                 to remember that "force" is necessary to drop. Much
>                 easier to just say
>                 "dropSeqlevels(), please". And reimplementing
>                 keepSeqlevels is still too
>                 complicated. Is it such a maintenance burden to have
>                 these simple
>                 wrappers sit
>                 on top of the low-level manipulators?
>
>                 Another suggestion: renameSeqlevels should not require a
>                 named vector (in
>                 cases
>                 where it is unnamed, require that it is parallel to the
>                 existing
>                 seqlevels, as
>                 with seqlevels<-).
>
>
>             I didn't really indicate what drove my desire to see
>             keepSeqlevels
>             deprecated. keepSeqlevels, seqlevels<-, and isActiveSeq were
>             developed more
>             or less independently, and have different contracts. The
>             different
>             contracts are confusing to the user, as is creating a usable
>             help system
>             when there are multiple end points for what is a common
>             operation. The help
>             pages of each were inadequate in different ways. And because
>             the code was
>             developed independently, support for different objects was
>             inconsistent. So
>             actually, yes, the maintenance (and use) burden for the
>             previous state of
>             affairs was substantial.
>
>             On the other hand, I agree that keepSeqlevels is convenient
>             as a simple
>             wrapper around seqlevels<-, in the same way that setNames
>             and names<- are
>             both useful.
>
>             So we could iterate to
>
>                 keepSeqlevels <-
>                     function(x, value, ...)
>                 {
>                     seqlevels(x, force=TRUE) <- value
>                     x
>                 }
>
>             but I'd be less enthusiastic about maintaining the original
>             contract of
>             keepSeqlevels, where keepSeqlevels(gr1, gr2), would have
>             worked for two
>             GRanges objects.
>
>
>     Why would this be called keepSeqlevels() if, depending on how it's used,
>     it will either add, drop, rename, or permute the seqlevels?
>     Couldn't this be called setSeqlevels?
>
>
> I thought that new2old was necessary for permuting.

The seqlevels setter has no 'new2old' arg.

> You're right that
> adding should be disallowed for keepSeqlevels(). Adding seqlevels is not
> a common operation. The two common operations are:
> * Permuting, usually because the data were imported in non-canonical
> order (seqnameStyle was designed to address this, no?).

Permuting is straightforward with seqlevels<-:

   > seqlevels(gr)
   [1] "chr1" "chr2" "chr3"

   > seqlevels(gr) <- rev(seqlevels(gr))

   > seqlevels(gr)
   [1] "chr3" "chr2" "chr1"

> * Subsetting, either by keeping or dropping.

Also straightforward:

   > seqlevels(gr, force=TRUE) <- "chr2"

   > seqlevels(gr)
   [1] "chr2"

> Two main reasons: taking a
> small slice of the data for prototyping, or removing problematic
> chromosomes (sex, circular). This goes back to bsapply and the exclude
> argument.

There are other important use cases:

   - Dropping seqlevels that are *not* in use. This happens for example
     when subsetting a BAM file with Rsamtools::filterBam to keep only
     the alignments located on a given chromosome. The entire sequence
     dictionary (in the header) is propagated to the resulting BAM file
     so when you read the file with readGAlignments() you end up with a
     bunch of useless seqlevels that you generally start by dropping.
     You don't want to drop any alignment so force=FALSE is your friend.

   - An even more common operation is renaming: 90% of the times that's
     what I use seqlevels<- for, and that's what I tell people to use
     when they need to rename their chromosomes. Also straightforward:

       > seqlevels(gr)
       [1] "chr1" "chr2" "chr3" "chrM"

       > seqlevels(gr) <- sub("^chr", "", seqlevels(gr))
       > seqlevels(gr)
       [1] "1" "2" "3" "M"

       > seqlevels(gr)[seqlevels(gr) == "M"] <- "MT"
       > seqlevels(gr)
       [1] "1"  "2"  "3"  "MT"

     Note that this is simpler to use than renameSeqlevels() which
     always required you to build the entire right value as a named
     vector.

So the added-value of keepSeqlevels() seems to boil down to:

   (a) it always uses force=TRUE

   (b) it preserves the original object and returns the modified object

If you want to restrict the use of keepSeqlevels() to permuting and
dropping, you'll also have to disallow renaming (in addition to adding).
Then its name will more or less reflect what it does (if "keep" means
"permute" or "drop"). The final result will be something that does less
than setSeqlevels() and that is not necessarily easier to read: both
will set on the object whatever vector of seqlevels they are supplied,
except that one will fail when doing so would actually result in adding
or renaming seqlevels.

H.

>
> Michael
>
>
>     H.
>
>
>
>         Well, I wasn't even aware of that feature, so you've made your
>         point about
>         the documentation ;) Sounds like a good compromise to me.
>
>         Thanks for understanding,
>         Michael
>
>
>
>             Martin
>
>
>                 Michael
>
>
>
>
>
>
>                 On Sat, May 18, 2013 at 6:00 PM, Martin Morgan
>                 <mtmorgan at fhcrc.org <mailto:mtmorgan at fhcrc.org>
>                 <mailto:mtmorgan at fhcrc.org <mailto:mtmorgan at fhcrc.org>>>
>                 wrote:
>
>                       On 05/18/2013 05:42 PM, Martin Morgan wrote:
>
>                           Some of the most common operations are
>                 straight-forward, e.g.,
>
>                                 > gr = GRanges(paste0("chr", c(1:22,
>                 "X", "Y")), IRanges(1,
>                 100))
>                                 > seqlevels(gr) = sub("chr", "ch",
>                 seqlevels(gr))
>
>
>                       To get a more comprehensive example I should have
>                 followed my own
>                 advice and
>                       grabbed from the help page!
>
>                             ## Rename:
>                             seqlevels(txdb) <- sub("chr", "",
>                 seqlevels(txdb))
>                             seqlevels(txdb)
>
>                             seqlevels(txdb) <- paste0("CH", seqlevels(txdb))
>                             seqlevels(txdb)
>
>                             seqlevels(txdb)[seqlevels(__**__txdb) ==
>                 "CHM"] <- "M"
>
>
>                             seqlevels(txdb)
>
>
>                       --
>                       Computational Biology / Fred Hutchinson Cancer
>                 Research Center
>                       1100 Fairview Ave. N.
>                       PO Box 19024 Seattle, WA 98109
>
>                       Location: Arnold Building M1 B861
>                       Phone: (206) 667-2793 <tel:%28206%29%20667-2793>
>                 <tel:%28206%29%20667-2793>
>
>
>
>
>             --
>             Dr. Martin Morgan, PhD
>
>             Fred Hutchinson Cancer Research Center
>             1100 Fairview Ave. N.
>             PO Box 19024 Seattle, WA 98109
>
>
>                  [[alternative HTML version deleted]]
>
>         _________________________________________________
>         Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>         mailing list
>         https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>         <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>
>
>     --
>     Hervé Pagès
>
>     Program in Computational Biology
>     Division of Public Health Sciences
>
>     Fred Hutchinson Cancer Research Center
>     1100 Fairview Ave. N, M1-B514
>     P.O. Box 19024
>     Seattle, WA 98109-1024
>
>     E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>     Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
>     Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
>
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioc-devel mailing list