[Bioc-devel] deprecation of keepSeqlevels and renameSeqlevels

Hervé Pagès hpages at fhcrc.org
Tue May 21 23:34:12 CEST 2013



On 05/21/2013 08:00 AM, Michael Lawrence wrote:
>
>
>
> On Mon, May 20, 2013 at 10:36 PM, Hervé Pagès <hpages at fhcrc.org
> <mailto:hpages at fhcrc.org>> wrote:
>
>     Michael,
>
>
>     On 05/20/2013 08:44 PM, Michael Lawrence wrote:
>
>
>
>
>         On Mon, May 20, 2013 at 4:13 PM, Hervé Pagès <hpages at fhcrc.org
>         <mailto:hpages at fhcrc.org>
>         <mailto:hpages at fhcrc.org <mailto:hpages at fhcrc.org>>> wrote:
>
>              Hi,
>
>
>              On 05/20/2013 03:15 PM, Michael Lawrence wrote:
>
>                  On Mon, May 20, 2013 at 3:11 PM, Martin Morgan
>                  <mtmorgan at fhcrc.org <mailto:mtmorgan at fhcrc.org>
>         <mailto:mtmorgan at fhcrc.org <mailto:mtmorgan at fhcrc.org>>> wrote:
>
>                      Hi Michael --
>
>
>                      On 5/20/2013 5:56 AM, Michael Lawrence wrote:
>
>                          While it's cool that seqlevels<- has become so
>         flexible,
>                          I still claim
>                          that
>                          rename/keep/drop would be a lot easier to read,
>         because
>                          they describe the
>                          high-level operation, and there's no reason to
>         deprecate
>                          them. They're
>                          also
>                          easier to remember. For example, for dropping with
>                          seqlevels<-, the user
>                          needs
>                          to remember that "force" is necessary to drop. Much
>                          easier to just say
>                          "dropSeqlevels(), please". And reimplementing
>                          keepSeqlevels is still too
>                          complicated. Is it such a maintenance burden to
>         have
>                          these simple
>                          wrappers sit
>                          on top of the low-level manipulators?
>
>                          Another suggestion: renameSeqlevels should not
>         require a
>                          named vector (in
>                          cases
>                          where it is unnamed, require that it is
>         parallel to the
>                          existing
>                          seqlevels, as
>                          with seqlevels<-).
>
>
>                      I didn't really indicate what drove my desire to see
>                      keepSeqlevels
>                      deprecated. keepSeqlevels, seqlevels<-, and
>         isActiveSeq were
>                      developed more
>                      or less independently, and have different
>         contracts. The
>                      different
>                      contracts are confusing to the user, as is creating
>         a usable
>                      help system
>                      when there are multiple end points for what is a common
>                      operation. The help
>                      pages of each were inadequate in different ways.
>         And because
>                      the code was
>                      developed independently, support for different
>         objects was
>                      inconsistent. So
>                      actually, yes, the maintenance (and use) burden for the
>                      previous state of
>                      affairs was substantial.
>
>                      On the other hand, I agree that keepSeqlevels is
>         convenient
>                      as a simple
>                      wrapper around seqlevels<-, in the same way that
>         setNames
>                      and names<- are
>                      both useful.
>
>                      So we could iterate to
>
>                          keepSeqlevels <-
>                              function(x, value, ...)
>                          {
>                              seqlevels(x, force=TRUE) <- value
>                              x
>                          }
>
>                      but I'd be less enthusiastic about maintaining the
>         original
>                      contract of
>                      keepSeqlevels, where keepSeqlevels(gr1, gr2), would
>         have
>                      worked for two
>                      GRanges objects.
>
>
>              Why would this be called keepSeqlevels() if, depending on
>         how it's used,
>              it will either add, drop, rename, or permute the seqlevels?
>              Couldn't this be called setSeqlevels?
>
>
>         I thought that new2old was necessary for permuting.
>
>
>     The seqlevels setter has no 'new2old' arg.
>
>
>         You're right that
>         adding should be disallowed for keepSeqlevels(). Adding
>         seqlevels is not
>         a common operation. The two common operations are:
>         * Permuting, usually because the data were imported in non-canonical
>         order (seqnameStyle was designed to address this, no?).
>
>
>     Permuting is straightforward with seqlevels<-:
>
>        > seqlevels(gr)
>        [1] "chr1" "chr2" "chr3"
>
>        > seqlevels(gr) <- rev(seqlevels(gr))
>
>        > seqlevels(gr)
>        [1] "chr3" "chr2" "chr1"
>
>
>         * Subsetting, either by keeping or dropping.
>
>
>     Also straightforward:
>
>        > seqlevels(gr, force=TRUE) <- "chr2"
>
>        > seqlevels(gr)
>        [1] "chr2"
>
>
>         Two main reasons: taking a
>         small slice of the data for prototyping, or removing problematic
>         chromosomes (sex, circular). This goes back to bsapply and the
>         exclude
>         argument.
>
>
>     There are other important use cases:
>
>        - Dropping seqlevels that are *not* in use. This happens for example
>          when subsetting a BAM file with Rsamtools::filterBam to keep only
>          the alignments located on a given chromosome. The entire sequence
>          dictionary (in the header) is propagated to the resulting BAM file
>          so when you read the file with readGAlignments() you end up with a
>          bunch of useless seqlevels that you generally start by dropping.
>          You don't want to drop any alignment so force=FALSE is your friend.
>
>
> This is sort of tangential to the discussion, but do you really want to
> do this? I would preserve the universe as given by the BAM.
>
>        - An even more common operation is renaming: 90% of the times that's
>          what I use seqlevels<- for, and that's what I tell people to use
>          when they need to rename their chromosomes. Also straightforward:
>
>            > seqlevels(gr)
>            [1] "chr1" "chr2" "chr3" "chrM"
>
>            > seqlevels(gr) <- sub("^chr", "", seqlevels(gr))
>            > seqlevels(gr)
>            [1] "1" "2" "3" "M"
>
>            > seqlevels(gr)[seqlevels(gr) == "M"] <- "MT"
>            > seqlevels(gr)
>            [1] "1"  "2"  "3"  "MT"
>
>          Note that this is simpler to use than renameSeqlevels() which
>          always required you to build the entire right value as a named
>          vector.
>
>
> Sure, renaming is a common use case. Not sure how I forgot that.
>
> As I wrote earlier in the thread, renameSeqlevels should be changed so
> as not to require naming of the vector.
>
>     So the added-value of keepSeqlevels() seems to boil down to:
>
>        (a) it always uses force=TRUE
>
>        (b) it preserves the original object and returns the modified object
>
>     If you want to restrict the use of keepSeqlevels() to permuting and
>     dropping, you'll also have to disallow renaming (in addition to adding).
>     Then its name will more or less reflect what it does (if "keep" means
>     "permute" or "drop"). The final result will be something that does less
>     than setSeqlevels() and that is not necessarily easier to read: both
>     will set on the object whatever vector of seqlevels they are supplied,
>     except that one will fail when doing so would actually result in adding
>     or renaming seqlevels.
>
>
> So it looks like seqlevels<- is pretty powerful. Now I see that if I
> scroll down to the examples in the man page, I find the actual
> documentation. It's cool to learn that we can replace seqlevels on TxDb
> objects.

Has been working on TranscriptDb objects for more than

> My argument has always been that since these low-level
> operations are so powerful, it's nice to have high-level operations to
> clarify the code. We have to think hard to know what the RHS will do in
> that replacement.

When you do something like seqlevels(gr) <- new_seqlevels, you're saying
"can you please replace the current seqlevels by those new ones".
That's all. The request/contract is simple and all the complexity is
taken care of internally by seqlevels<-. It might be that, depending on
what's currently on 'gr', this will actually result in dropping, adding,
permuting. Or *a combination of the 3*. Like here:

   > seqlevels(gr)
   [1] "chr1" "chr2" "chr3"

   > seqlevels(gr) <- c("chr4", "chr3", "chr1")
   > seqlevels(gr)
   [1] "chr4" "chr3" "chr1"

   # one seqlevel was drop, one was added, and the 2 seqlevels that were
   # preserved were reordered.

When you want to subset a vector to keep all elements with an even
index you do x[c(FALSE, TRUE)]. Is that keeping or dropping elements?
Do we need to have a subsetting operator for keeping elements, another
one for dropping elements, and another one for permuting them (like
in x[sample(length(x))])?

H.

> It's probably one of the most complex replacement
> operations; far more complex than the typical one, including levels<- on
> factor. There's nothing wrong with that; it's just complexity that we
> should be able to hide.
>
> Michael
>
>     H.
>
>
>         Michael
>
>
>              H.
>
>
>
>                  Well, I wasn't even aware of that feature, so you've
>         made your
>                  point about
>                  the documentation ;) Sounds like a good compromise to me.
>
>                  Thanks for understanding,
>                  Michael
>
>
>
>                      Martin
>
>
>                          Michael
>
>
>
>
>
>
>                          On Sat, May 18, 2013 at 6:00 PM, Martin Morgan
>                          <mtmorgan at fhcrc.org <mailto:mtmorgan at fhcrc.org>
>         <mailto:mtmorgan at fhcrc.org <mailto:mtmorgan at fhcrc.org>>
>                          <mailto:mtmorgan at fhcrc.org
>         <mailto:mtmorgan at fhcrc.org> <mailto:mtmorgan at fhcrc.org
>         <mailto:mtmorgan at fhcrc.org>>>>
>
>                          wrote:
>
>                                On 05/18/2013 05:42 PM, Martin Morgan wrote:
>
>                                    Some of the most common operations are
>                          straight-forward, e.g.,
>
>                                          > gr = GRanges(paste0("chr",
>         c(1:22,
>                          "X", "Y")), IRanges(1,
>                          100))
>                                          > seqlevels(gr) = sub("chr", "ch",
>                          seqlevels(gr))
>
>
>                                To get a more comprehensive example I
>         should have
>                          followed my own
>                          advice and
>                                grabbed from the help page!
>
>                                      ## Rename:
>                                      seqlevels(txdb) <- sub("chr", "",
>                          seqlevels(txdb))
>                                      seqlevels(txdb)
>
>                                      seqlevels(txdb) <- paste0("CH",
>         seqlevels(txdb))
>                                      seqlevels(txdb)
>
>
>         seqlevels(txdb)[seqlevels(__**____txdb) ==
>
>                          "CHM"] <- "M"
>
>
>                                      seqlevels(txdb)
>
>
>                                --
>                                Computational Biology / Fred Hutchinson
>         Cancer
>                          Research Center
>                                1100 Fairview Ave. N.
>                                PO Box 19024 Seattle, WA 98109
>
>                                Location: Arnold Building M1 B861
>                                Phone: (206) 667-2793
>         <tel:%28206%29%20667-2793> <tel:%28206%29%20667-2793>
>                          <tel:%28206%29%20667-2793>
>
>
>
>
>                      --
>                      Dr. Martin Morgan, PhD
>
>                      Fred Hutchinson Cancer Research Center
>                      1100 Fairview Ave. N.
>                      PO Box 19024 Seattle, WA 98109
>
>
>                           [[alternative HTML version deleted]]
>
>                  ___________________________________________________
>         Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>         <mailto:Bioc-devel at r-project.__org
>         <mailto:Bioc-devel at r-project.org>>
>                  mailing list
>         https://stat.ethz.ch/mailman/____listinfo/bioc-devel
>         <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>
>
>                  <https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>         <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>
>
>
>              --
>              Hervé Pagès
>
>              Program in Computational Biology
>              Division of Public Health Sciences
>
>              Fred Hutchinson Cancer Research Center
>              1100 Fairview Ave. N, M1-B514
>              P.O. Box 19024
>              Seattle, WA 98109-1024
>
>              E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>         <mailto:hpages at fhcrc.org <mailto:hpages at fhcrc.org>>
>              Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
>         <tel:%28206%29%20667-5791>
>              Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
>         <tel:%28206%29%20667-1319>
>
>
>
>     --
>     Hervé Pagès
>
>     Program in Computational Biology
>     Division of Public Health Sciences
>     Fred Hutchinson Cancer Research Center
>     1100 Fairview Ave. N, M1-B514
>     P.O. Box 19024
>     Seattle, WA 98109-1024
>
>     E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>     Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
>     Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
>
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioc-devel mailing list