[Bioc-devel] deprecation of keepSeqlevels and renameSeqlevels
Hervé Pagès
hpages at fhcrc.org
Tue May 21 23:34:12 CEST 2013
On 05/21/2013 08:00 AM, Michael Lawrence wrote:
>
>
>
> On Mon, May 20, 2013 at 10:36 PM, Hervé Pagès <hpages at fhcrc.org
> <mailto:hpages at fhcrc.org>> wrote:
>
> Michael,
>
>
> On 05/20/2013 08:44 PM, Michael Lawrence wrote:
>
>
>
>
> On Mon, May 20, 2013 at 4:13 PM, Hervé Pagès <hpages at fhcrc.org
> <mailto:hpages at fhcrc.org>
> <mailto:hpages at fhcrc.org <mailto:hpages at fhcrc.org>>> wrote:
>
> Hi,
>
>
> On 05/20/2013 03:15 PM, Michael Lawrence wrote:
>
> On Mon, May 20, 2013 at 3:11 PM, Martin Morgan
> <mtmorgan at fhcrc.org <mailto:mtmorgan at fhcrc.org>
> <mailto:mtmorgan at fhcrc.org <mailto:mtmorgan at fhcrc.org>>> wrote:
>
> Hi Michael --
>
>
> On 5/20/2013 5:56 AM, Michael Lawrence wrote:
>
> While it's cool that seqlevels<- has become so
> flexible,
> I still claim
> that
> rename/keep/drop would be a lot easier to read,
> because
> they describe the
> high-level operation, and there's no reason to
> deprecate
> them. They're
> also
> easier to remember. For example, for dropping with
> seqlevels<-, the user
> needs
> to remember that "force" is necessary to drop. Much
> easier to just say
> "dropSeqlevels(), please". And reimplementing
> keepSeqlevels is still too
> complicated. Is it such a maintenance burden to
> have
> these simple
> wrappers sit
> on top of the low-level manipulators?
>
> Another suggestion: renameSeqlevels should not
> require a
> named vector (in
> cases
> where it is unnamed, require that it is
> parallel to the
> existing
> seqlevels, as
> with seqlevels<-).
>
>
> I didn't really indicate what drove my desire to see
> keepSeqlevels
> deprecated. keepSeqlevels, seqlevels<-, and
> isActiveSeq were
> developed more
> or less independently, and have different
> contracts. The
> different
> contracts are confusing to the user, as is creating
> a usable
> help system
> when there are multiple end points for what is a common
> operation. The help
> pages of each were inadequate in different ways.
> And because
> the code was
> developed independently, support for different
> objects was
> inconsistent. So
> actually, yes, the maintenance (and use) burden for the
> previous state of
> affairs was substantial.
>
> On the other hand, I agree that keepSeqlevels is
> convenient
> as a simple
> wrapper around seqlevels<-, in the same way that
> setNames
> and names<- are
> both useful.
>
> So we could iterate to
>
> keepSeqlevels <-
> function(x, value, ...)
> {
> seqlevels(x, force=TRUE) <- value
> x
> }
>
> but I'd be less enthusiastic about maintaining the
> original
> contract of
> keepSeqlevels, where keepSeqlevels(gr1, gr2), would
> have
> worked for two
> GRanges objects.
>
>
> Why would this be called keepSeqlevels() if, depending on
> how it's used,
> it will either add, drop, rename, or permute the seqlevels?
> Couldn't this be called setSeqlevels?
>
>
> I thought that new2old was necessary for permuting.
>
>
> The seqlevels setter has no 'new2old' arg.
>
>
> You're right that
> adding should be disallowed for keepSeqlevels(). Adding
> seqlevels is not
> a common operation. The two common operations are:
> * Permuting, usually because the data were imported in non-canonical
> order (seqnameStyle was designed to address this, no?).
>
>
> Permuting is straightforward with seqlevels<-:
>
> > seqlevels(gr)
> [1] "chr1" "chr2" "chr3"
>
> > seqlevels(gr) <- rev(seqlevels(gr))
>
> > seqlevels(gr)
> [1] "chr3" "chr2" "chr1"
>
>
> * Subsetting, either by keeping or dropping.
>
>
> Also straightforward:
>
> > seqlevels(gr, force=TRUE) <- "chr2"
>
> > seqlevels(gr)
> [1] "chr2"
>
>
> Two main reasons: taking a
> small slice of the data for prototyping, or removing problematic
> chromosomes (sex, circular). This goes back to bsapply and the
> exclude
> argument.
>
>
> There are other important use cases:
>
> - Dropping seqlevels that are *not* in use. This happens for example
> when subsetting a BAM file with Rsamtools::filterBam to keep only
> the alignments located on a given chromosome. The entire sequence
> dictionary (in the header) is propagated to the resulting BAM file
> so when you read the file with readGAlignments() you end up with a
> bunch of useless seqlevels that you generally start by dropping.
> You don't want to drop any alignment so force=FALSE is your friend.
>
>
> This is sort of tangential to the discussion, but do you really want to
> do this? I would preserve the universe as given by the BAM.
>
> - An even more common operation is renaming: 90% of the times that's
> what I use seqlevels<- for, and that's what I tell people to use
> when they need to rename their chromosomes. Also straightforward:
>
> > seqlevels(gr)
> [1] "chr1" "chr2" "chr3" "chrM"
>
> > seqlevels(gr) <- sub("^chr", "", seqlevels(gr))
> > seqlevels(gr)
> [1] "1" "2" "3" "M"
>
> > seqlevels(gr)[seqlevels(gr) == "M"] <- "MT"
> > seqlevels(gr)
> [1] "1" "2" "3" "MT"
>
> Note that this is simpler to use than renameSeqlevels() which
> always required you to build the entire right value as a named
> vector.
>
>
> Sure, renaming is a common use case. Not sure how I forgot that.
>
> As I wrote earlier in the thread, renameSeqlevels should be changed so
> as not to require naming of the vector.
>
> So the added-value of keepSeqlevels() seems to boil down to:
>
> (a) it always uses force=TRUE
>
> (b) it preserves the original object and returns the modified object
>
> If you want to restrict the use of keepSeqlevels() to permuting and
> dropping, you'll also have to disallow renaming (in addition to adding).
> Then its name will more or less reflect what it does (if "keep" means
> "permute" or "drop"). The final result will be something that does less
> than setSeqlevels() and that is not necessarily easier to read: both
> will set on the object whatever vector of seqlevels they are supplied,
> except that one will fail when doing so would actually result in adding
> or renaming seqlevels.
>
>
> So it looks like seqlevels<- is pretty powerful. Now I see that if I
> scroll down to the examples in the man page, I find the actual
> documentation. It's cool to learn that we can replace seqlevels on TxDb
> objects.
Has been working on TranscriptDb objects for more than
> My argument has always been that since these low-level
> operations are so powerful, it's nice to have high-level operations to
> clarify the code. We have to think hard to know what the RHS will do in
> that replacement.
When you do something like seqlevels(gr) <- new_seqlevels, you're saying
"can you please replace the current seqlevels by those new ones".
That's all. The request/contract is simple and all the complexity is
taken care of internally by seqlevels<-. It might be that, depending on
what's currently on 'gr', this will actually result in dropping, adding,
permuting. Or *a combination of the 3*. Like here:
> seqlevels(gr)
[1] "chr1" "chr2" "chr3"
> seqlevels(gr) <- c("chr4", "chr3", "chr1")
> seqlevels(gr)
[1] "chr4" "chr3" "chr1"
# one seqlevel was drop, one was added, and the 2 seqlevels that were
# preserved were reordered.
When you want to subset a vector to keep all elements with an even
index you do x[c(FALSE, TRUE)]. Is that keeping or dropping elements?
Do we need to have a subsetting operator for keeping elements, another
one for dropping elements, and another one for permuting them (like
in x[sample(length(x))])?
H.
> It's probably one of the most complex replacement
> operations; far more complex than the typical one, including levels<- on
> factor. There's nothing wrong with that; it's just complexity that we
> should be able to hide.
>
> Michael
>
> H.
>
>
> Michael
>
>
> H.
>
>
>
> Well, I wasn't even aware of that feature, so you've
> made your
> point about
> the documentation ;) Sounds like a good compromise to me.
>
> Thanks for understanding,
> Michael
>
>
>
> Martin
>
>
> Michael
>
>
>
>
>
>
> On Sat, May 18, 2013 at 6:00 PM, Martin Morgan
> <mtmorgan at fhcrc.org <mailto:mtmorgan at fhcrc.org>
> <mailto:mtmorgan at fhcrc.org <mailto:mtmorgan at fhcrc.org>>
> <mailto:mtmorgan at fhcrc.org
> <mailto:mtmorgan at fhcrc.org> <mailto:mtmorgan at fhcrc.org
> <mailto:mtmorgan at fhcrc.org>>>>
>
> wrote:
>
> On 05/18/2013 05:42 PM, Martin Morgan wrote:
>
> Some of the most common operations are
> straight-forward, e.g.,
>
> > gr = GRanges(paste0("chr",
> c(1:22,
> "X", "Y")), IRanges(1,
> 100))
> > seqlevels(gr) = sub("chr", "ch",
> seqlevels(gr))
>
>
> To get a more comprehensive example I
> should have
> followed my own
> advice and
> grabbed from the help page!
>
> ## Rename:
> seqlevels(txdb) <- sub("chr", "",
> seqlevels(txdb))
> seqlevels(txdb)
>
> seqlevels(txdb) <- paste0("CH",
> seqlevels(txdb))
> seqlevels(txdb)
>
>
> seqlevels(txdb)[seqlevels(__**____txdb) ==
>
> "CHM"] <- "M"
>
>
> seqlevels(txdb)
>
>
> --
> Computational Biology / Fred Hutchinson
> Cancer
> Research Center
> 1100 Fairview Ave. N.
> PO Box 19024 Seattle, WA 98109
>
> Location: Arnold Building M1 B861
> Phone: (206) 667-2793
> <tel:%28206%29%20667-2793> <tel:%28206%29%20667-2793>
> <tel:%28206%29%20667-2793>
>
>
>
>
> --
> Dr. Martin Morgan, PhD
>
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N.
> PO Box 19024 Seattle, WA 98109
>
>
> [[alternative HTML version deleted]]
>
> ___________________________________________________
> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
> <mailto:Bioc-devel at r-project.__org
> <mailto:Bioc-devel at r-project.org>>
> mailing list
> https://stat.ethz.ch/mailman/____listinfo/bioc-devel
> <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>
>
> <https://stat.ethz.ch/mailman/__listinfo/bioc-devel
> <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>
>
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
>
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
> <mailto:hpages at fhcrc.org <mailto:hpages at fhcrc.org>>
> Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
> <tel:%28206%29%20667-5791>
> Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
> <tel:%28206%29%20667-1319>
>
>
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
> Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
> Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
>
>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioc-devel
mailing list