[Bioc-devel] deprecation of keepSeqlevels and renameSeqlevels
Tim Triche, Jr.
tim.triche at gmail.com
Tue Sep 17 21:12:10 CEST 2013
I have been doing this in a kludgey fashion for a long long time, and would favor restoreNaturalOrder() as a clear name for what it does. It would be great to have this in BioC-2.13 release.
Thanks Michael and Herve and Kasper for bringing up and solving this cleanly.
--t
On Sep 17, 2013, at 11:30 AM, Michael Lawrence <lawrence.michael at gene.com> wrote:
> Sounds great. Could we have a more intuitive name? Like naturalOrder(). And
> then the method to fix the GRanges could be restoreNaturalOrder(). Just
> some suggestions...
>
>
>
>
> On Tue, Sep 17, 2013 at 10:58 AM, Hervé Pagès <hpages at fhcrc.org> wrote:
>
>> For sorting the chromosome names in "natural" order, you can use
>> the makeSeqnameIds() helper function in GenomicRanges:
>>
>> seqlevels(gr) <- seqlevels(gr)[makeSeqnameIds(**seqlevels(gr))]
>>
>> It recognizes several naming conventions (chr-prefixed or not,
>> roman, etc...) e.g.:
>>
>>> makeSeqnameIds(c("Y", "1", "9", "M", "2"))
>> [1] 4 1 3 5 2
>>
>>> makeSeqnameIds(c("chrY", "chrI", "chrIX", "chrM", "chrII"))
>> [1] 4 1 3 5 2
>>
>> I still need to improve the documentation of this function, put
>> more examples, and add unit tests. It's used internally by the
>> makeTranscriptDb* functions in GenomicFeatures to assign internal
>> ids to the chromosomes so that all the TranscriptDb extractors
>> can then return GRanges and GRangesList objects with the seqlevels
>> in the "natural" order.
>>
>> Cheers,
>> H.
>>
>>
>>
>> On 09/17/2013 10:23 AM, Kasper Daniel Hansen wrote:
>>
>>> Actually, what I (and I think several others) just want is
>>> fixSeqnames()
>>> which sorts and fixes them to some convention. I don't really care which
>>> one, but I want to do some easy harmonization, preferably in line with
>>> other bioc tools. I know this is hard to do for all organisms, but having
>>> something that deals with the major ones (human, mouse, fruit fly, yeast,
>>> some monkeys) would be extremely valuable. If this function existed, I
>>> could just throw all my GRanges at it.
>>>
>>> Kasper
>>>
>>>
>>> On Tue, Sep 17, 2013 at 1:14 PM, Michael Lawrence <
>>> lawrence.michael at gene.com
>>>
>>>> wrote:
>>>
>>> When I hit this I usually just do:
>>>>
>>>> seqlevels(gr) <- paste0("chr", c(1:22, "X", "Y"))
>>>>
>>>> It would be nice if there were a function for sorting chromosome names
>>>> according to a specified naming convention. Like sortSeqlevels(gr,
>>>> UCSCNaming()) or the alternative replacement syntax: seqinfo(gr) <-
>>>> sort(seqinfo(gr), UCSCNaming()). Although sort is only single-dispatch.
>>>>
>>>> Michael
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, Sep 17, 2013 at 8:59 AM, Vincent Carey
>>>> <stvjc at channing.harvard.edu>**wrote:
>>>>
>>>> I am replying to this as Michael mentions that this is a powerful
>>>>> operation. Here's my
>>>>> problem that I believe is related, but I do not see a straightforward
>>>>> solution
>>>>>
>>>>> bhmm19uni
>>>>> GRanges with 559456 ranges and 4 metadata columns:
>>>>> seqnames ranges strand | name
>>>>> score
>>>>> <Rle> <IRanges> <Rle> | <character>
>>>>> <numeric>
>>>>> 1 chr1 [ 67229025, 67341224] * | 13_Heterochrom/lo
>>>>> 0
>>>>> 2 chr1 [203057955, 203060154] * | 12_Repressed
>>>>> 0
>>>>> 3 chr1 [ 8458427, 8486226] * | 13_Heterochrom/lo
>>>>> 0
>>>>> 4 chr1 [ 16904427, 16904826] * | 5_Strong_Enhancer
>>>>> 0
>>>>> 5 chr1 [ 25289427, 25293426] * | 11_Weak_Txn
>>>>> 0
>>>>> ... ... ... ... ... ...
>>>>> ...
>>>>> 570766 chr22 [51100931, 51101330] * | 2_Weak_Promoter
>>>>> 0
>>>>> 570767 chr22 [51101331, 51101530] * | 6_Weak_Enhancer
>>>>> 0
>>>>> 570768 chr22 [51101531, 51101730] * | 2_Weak_Promoter
>>>>> 0
>>>>> 570769 chr22 [51101731, 51178130] * | 13_Heterochrom/lo
>>>>> 0
>>>>> 570770 chr22 [51178131, 51178330] * | 15_Repetitive/CNV
>>>>> 0
>>>>> thick numcode
>>>>> <IRanges> <character>
>>>>> 1 [ 67001613, 67113812] 13
>>>>> 2 [201324578, 201326777] 12
>>>>> 3 [ 8381014, 8408813] 13
>>>>> 4 [ 16777014, 16777413] 5
>>>>> 5 [ 25162014, 25166013] 11
>>>>> ... ... ...
>>>>> 570766 [49447797, 49448196] 2
>>>>> 570767 [49448197, 49448396] 6
>>>>> 570768 [49448397, 49448596] 2
>>>>> 570769 [49448597, 49524996] 13
>>>>> 570770 [49524997, 49525196] 15
>>>>> ---
>>>>> seqlengths:
>>>>> chr1 chr10 chr11 chr5 ... chr2 chr21
>>>>> chr4
>>>>> 249250621 135534747 135006516 180915260 ... 243199373 48129895
>>>>> 191154276
>>>>>
>>>>> If I try to make a karyogram with ggbio, the chromosomes are in an
>>>>> unnatural order. What is the right approach to getting the seqinfo in a
>>>>> natural order with respect to chromosome names and lengths? Reassigning
>>>>> seqlevels seems dangerous but I haven't experimented fully. It must be
>>>>> a
>>>>> common use case but I do not see it in doc.
>>>>>
>>>>>
>>>>>
>>>>> On Tue, May 21, 2013 at 11:00 AM, Michael Lawrence <
>>>>> lawrence.michael at gene.com> wrote:
>>>>>
>>>>> On Mon, May 20, 2013 at 10:36 PM, Hervé Pagès <hpages at fhcrc.org>
>>>>>> wrote:
>>>>>>
>>>>>> Michael,
>>>>>>>
>>>>>>>
>>>>>>> On 05/20/2013 08:44 PM, Michael Lawrence wrote:
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, May 20, 2013 at 4:13 PM, Hervé Pagès <hpages at fhcrc.org
>>>>>>>> <mailto:hpages at fhcrc.org>> wrote:
>>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>>
>>>>>>>> On 05/20/2013 03:15 PM, Michael Lawrence wrote:
>>>>>>>>
>>>>>>>> On Mon, May 20, 2013 at 3:11 PM, Martin Morgan
>>>>>>>> <mtmorgan at fhcrc.org <mailto:mtmorgan at fhcrc.org>> wrote:
>>>>>>>>
>>>>>>>> Hi Michael --
>>>>>>>>
>>>>>>>>
>>>>>>>> On 5/20/2013 5:56 AM, Michael Lawrence wrote:
>>>>>>>>
>>>>>>>> While it's cool that seqlevels<- has become so
>>>>>>> flexible,
>>>>>>
>>>>>>> I still claim
>>>>>>>> that
>>>>>>>> rename/keep/drop would be a lot easier to read,
>>>>>>> because
>>>>
>>>>> they describe the
>>>>>>>> high-level operation, and there's no reason to
>>>>>>> deprecate
>>>>>>
>>>>>>> them. They're
>>>>>>>> also
>>>>>>>> easier to remember. For example, for dropping with
>>>>>>>> seqlevels<-, the user
>>>>>>>> needs
>>>>>>>> to remember that "force" is necessary to drop. Much
>>>>>>>> easier to just say
>>>>>>>> "dropSeqlevels(), please". And reimplementing
>>>>>>>> keepSeqlevels is still too
>>>>>>>> complicated. Is it such a maintenance burden to have
>>>>>>>> these simple
>>>>>>>> wrappers sit
>>>>>>>> on top of the low-level manipulators?
>>>>>>>>
>>>>>>>> Another suggestion: renameSeqlevels should not
>>>>>>> require
>>>>
>>>>> a
>>>>>>
>>>>>>> named vector (in
>>>>>>>> cases
>>>>>>>> where it is unnamed, require that it is parallel to
>>>>>>> the
>>>>
>>>>> existing
>>>>>>>> seqlevels, as
>>>>>>>> with seqlevels<-).
>>>>>>>>
>>>>>>>>
>>>>>>>> I didn't really indicate what drove my desire to see
>>>>>>>> keepSeqlevels
>>>>>>>> deprecated. keepSeqlevels, seqlevels<-, and isActiveSeq
>>>>>>> were
>>>>>>
>>>>>>> developed more
>>>>>>>> or less independently, and have different contracts. The
>>>>>>>> different
>>>>>>>> contracts are confusing to the user, as is creating a
>>>>>>> usable
>>>>>>
>>>>>>> help system
>>>>>>>> when there are multiple end points for what is a common
>>>>>>>> operation. The help
>>>>>>>> pages of each were inadequate in different ways. And
>>>>>>> because
>>>>>>
>>>>>>> the code was
>>>>>>>> developed independently, support for different objects
>>>>>>> was
>>>>
>>>>> inconsistent. So
>>>>>>>> actually, yes, the maintenance (and use) burden for the
>>>>>>>> previous state of
>>>>>>>> affairs was substantial.
>>>>>>>>
>>>>>>>> On the other hand, I agree that keepSeqlevels is
>>>>>>> convenient
>>>>
>>>>> as a simple
>>>>>>>> wrapper around seqlevels<-, in the same way that
>>>>>>>> setNames
>>>>>>>> and names<- are
>>>>>>>> both useful.
>>>>>>>>
>>>>>>>> So we could iterate to
>>>>>>>>
>>>>>>>> keepSeqlevels <-
>>>>>>>> function(x, value, ...)
>>>>>>>> {
>>>>>>>> seqlevels(x, force=TRUE) <- value
>>>>>>>> x
>>>>>>>> }
>>>>>>>>
>>>>>>>> but I'd be less enthusiastic about maintaining the
>>>>>>> original
>>>>
>>>>> contract of
>>>>>>>> keepSeqlevels, where keepSeqlevels(gr1, gr2), would have
>>>>>>>> worked for two
>>>>>>>> GRanges objects.
>>>>>>>>
>>>>>>>>
>>>>>>>> Why would this be called keepSeqlevels() if, depending on how
>>>>>>> it's
>>>>
>>>>> used,
>>>>>>>> it will either add, drop, rename, or permute the seqlevels?
>>>>>>>> Couldn't this be called setSeqlevels?
>>>>>>>>
>>>>>>>>
>>>>>>>> I thought that new2old was necessary for permuting.
>>>>>>> The seqlevels setter has no 'new2old' arg.
>>>>>>>
>>>>>>>
>>>>>>> You're right that
>>>>>>>
>>>>>>>> adding should be disallowed for keepSeqlevels(). Adding seqlevels is
>>>>>>> not
>>>>>>
>>>>>>> a common operation. The two common operations are:
>>>>>>>> * Permuting, usually because the data were imported in non-canonical
>>>>>>>> order (seqnameStyle was designed to address this, no?).
>>>>>>> Permuting is straightforward with seqlevels<-:
>>>>>>>
>>>>>>>> seqlevels(gr)
>>>>>>> [1] "chr1" "chr2" "chr3"
>>>>>>>
>>>>>>>> seqlevels(gr) <- rev(seqlevels(gr))
>>>>>>>
>>>>>>>> seqlevels(gr)
>>>>>>> [1] "chr3" "chr2" "chr1"
>>>>>>>
>>>>>>>
>>>>>>> * Subsetting, either by keeping or dropping.
>>>>>>>
>>>>>>> Also straightforward:
>>>>>>>
>>>>>>>> seqlevels(gr, force=TRUE) <- "chr2"
>>>>>>>
>>>>>>>> seqlevels(gr)
>>>>>>> [1] "chr2"
>>>>>>>
>>>>>>>
>>>>>>> Two main reasons: taking a
>>>>>>>
>>>>>>>> small slice of the data for prototyping, or removing problematic
>>>>>>>> chromosomes (sex, circular). This goes back to bsapply and the
>>>>>>> exclude
>>>>
>>>>> argument.
>>>>>>> There are other important use cases:
>>>>>>>
>>>>>>> - Dropping seqlevels that are *not* in use. This happens for
>>>>>>> example
>>>>>>> when subsetting a BAM file with Rsamtools::filterBam to keep only
>>>>>>> the alignments located on a given chromosome. The entire sequence
>>>>>>> dictionary (in the header) is propagated to the resulting BAM
>>>>>>> file
>>>>>>> so when you read the file with readGAlignments() you end up with
>>>>>>> a
>>>>>>> bunch of useless seqlevels that you generally start by dropping.
>>>>>>> You don't want to drop any alignment so force=FALSE is your
>>>>>> friend.
>>>>
>>>>>
>>>>>>>
>>>>>>> This is sort of tangential to the discussion, but do you really want
>>>>>> to
>>>>> do
>>>>
>>>>> this? I would preserve the universe as given by the BAM.
>>>>>>
>>>>>>
>>>>>> - An even more common operation is renaming: 90% of the times
>>>>>>> that's
>>>>>>> what I use seqlevels<- for, and that's what I tell people to use
>>>>>>> when they need to rename their chromosomes. Also straightforward:
>>>>>>>
>>>>>>>> seqlevels(gr)
>>>>>>> [1] "chr1" "chr2" "chr3" "chrM"
>>>>>>>
>>>>>>>> seqlevels(gr) <- sub("^chr", "", seqlevels(gr))
>>>>>>>> seqlevels(gr)
>>>>>>> [1] "1" "2" "3" "M"
>>>>>>>
>>>>>>>> seqlevels(gr)[seqlevels(gr) == "M"] <- "MT"
>>>>>>>> seqlevels(gr)
>>>>>>> [1] "1" "2" "3" "MT"
>>>>>>>
>>>>>>> Note that this is simpler to use than renameSeqlevels() which
>>>>>>> always required you to build the entire right value as a named
>>>>>>> vector.
>>>>>>>
>>>>>>>
>>>>>>> Sure, renaming is a common use case. Not sure how I forgot that.
>>>>>>
>>>>>> As I wrote earlier in the thread, renameSeqlevels should be changed so
>>>>> as
>>>>
>>>>> not to require naming of the vector.
>>>>>>
>>>>>>
>>>>>> So the added-value of keepSeqlevels() seems to boil down to:
>>>>>>>
>>>>>>> (a) it always uses force=TRUE
>>>>>>>
>>>>>>> (b) it preserves the original object and returns the modified
>>>>>>> object
>>>>>>>
>>>>>>> If you want to restrict the use of keepSeqlevels() to permuting and
>>>>>>> dropping, you'll also have to disallow renaming (in addition to
>>>>>> adding).
>>>>
>>>>> Then its name will more or less reflect what it does (if "keep" means
>>>>>>> "permute" or "drop"). The final result will be something that does
>>>>>> less
>>>>
>>>>> than setSeqlevels() and that is not necessarily easier to read: both
>>>>>>> will set on the object whatever vector of seqlevels they are supplied,
>>>>>>> except that one will fail when doing so would actually result in
>>>>>> adding
>>>>
>>>>> or renaming seqlevels.
>>>>>>>
>>>>>>>
>>>>>>> So it looks like seqlevels<- is pretty powerful. Now I see that if I
>>>>>> scroll
>>>>>> down to the examples in the man page, I find the actual documentation.
>>>>>> It's
>>>>>> cool to learn that we can replace seqlevels on TxDb objects. My
>>>>>> argument
>>>>>> has always been that since these low-level operations are so powerful,
>>>>>> it's
>>>>>> nice to have high-level operations to clarify the code. We have to
>>>>>> think
>>>>>> hard to know what the RHS will do in that replacement. It's probably
>>>>>> one
>>>>>> of
>>>>>> the most complex replacement operations; far more complex than the
>>>>> typical
>>>>
>>>>> one, including levels<- on factor. There's nothing wrong with that; it's
>>>>>> just complexity that we should be able to hide.
>>>>>>
>>>>>> Michael
>>>>>>
>>>>>> H.
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Michael
>>>>>>>>
>>>>>>>>
>>>>>>>> H.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Well, I wasn't even aware of that feature, so you've made
>>>>>>> your
>>>>
>>>>> point about
>>>>>>>> the documentation ;) Sounds like a good compromise to me.
>>>>>>>>
>>>>>>>> Thanks for understanding,
>>>>>>>> Michael
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Martin
>>>>>>>>
>>>>>>>>
>>>>>>>> Michael
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, May 18, 2013 at 6:00 PM, Martin Morgan
>>>>>>>> <mtmorgan at fhcrc.org <mailto:mtmorgan at fhcrc.org>
>>>>>>>> <mailto:mtmorgan at fhcrc.org <mailto:
>>>>>>> mtmorgan at fhcrc.org
>>>>
>>>>>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> On 05/18/2013 05:42 PM, Martin Morgan wrote:
>>>>>>>>
>>>>>>>> Some of the most common operations are
>>>>>>>> straight-forward, e.g.,
>>>>>>>>
>>>>>>>>> gr = GRanges(paste0("chr", c(1:22,
>>>>>>>> "X", "Y")), IRanges(1,
>>>>>>>> 100))
>>>>>>>>> seqlevels(gr) = sub("chr", "ch",
>>>>>>>> seqlevels(gr))
>>>>>>>>
>>>>>>>>
>>>>>>>> To get a more comprehensive example I should
>>>>>>> have
>>>>
>>>>> followed my own
>>>>>>>> advice and
>>>>>>>> grabbed from the help page!
>>>>>>>>
>>>>>>>> ## Rename:
>>>>>>>> seqlevels(txdb) <- sub("chr", "",
>>>>>>>> seqlevels(txdb))
>>>>>>>> seqlevels(txdb)
>>>>>>>>
>>>>>>>> seqlevels(txdb) <- paste0("CH",
>>>>>>>> seqlevels(txdb))
>>>>>>>> seqlevels(txdb)
>>>>>>>>
>>>>>>>> seqlevels(txdb)[seqlevels(__****
>>>>>>>> **__txdb)
>>>>>>> ==
>>>>
>>>>>
>>>>>>
>>>>>>>> "CHM"] <- "M"
>>>>>>>>
>>>>>>>>
>>>>>>>> seqlevels(txdb)
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Computational Biology / Fred Hutchinson Cancer
>>>>>>>> Research Center
>>>>>>>> 1100 Fairview Ave. N.
>>>>>>>> PO Box 19024 Seattle, WA 98109
>>>>>>>>
>>>>>>>> Location: Arnold Building M1 B861
>>>>>>>> Phone: (206) 667-2793
>>>>>>> <tel:%28206%29%20667-2793>
>>>>
>>>>> <tel:%28206%29%20667-2793>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Dr. Martin Morgan, PhD
>>>>>>>>
>>>>>>>> Fred Hutchinson Cancer Research Center
>>>>>>>> 1100 Fairview Ave. N.
>>>>>>>> PO Box 19024 Seattle, WA 98109
>>>>>>>>
>>>>>>>>
>>>>>>>> [[alternative HTML version deleted]]
>>>>>>>>
>>>>>>>> ______________________________****___________________
>>>>>>>> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.***
>>>>>>>> *org<
>>>>>>> Bioc-devel at r-project.org>
>>>>>>
>>>>>>>
>>>>>>>>> mailing list
>>>>>>>> https://stat.ethz.ch/mailman/_****_listinfo/bioc-devel<https://stat.ethz.ch/mailman/_**_listinfo/bioc-devel>
>>>>>>>> <
>>>>>>> https://stat.ethz.ch/mailman/_**_listinfo/bioc-devel<https://stat.ethz.ch/mailman/__listinfo/bioc-devel>
>>>>>>
>>>>>>>
>>>>>>>> <https://stat.ethz.ch/mailman/****listinfo/bioc-devel<https://stat.ethz.ch/mailman/**listinfo/bioc-devel>
>>>>>>>> <
>>>>>>> https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>>>>>>
>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Hervé Pagès
>>>>>>>>
>>>>>>>> Program in Computational Biology
>>>>>>>> Division of Public Health Sciences
>>>>>>>>
>>>>>>>> Fred Hutchinson Cancer Research Center
>>>>>>>> 1100 Fairview Ave. N, M1-B514
>>>>>>>> P.O. Box 19024
>>>>>>>> Seattle, WA 98109-1024
>>>>>>>>
>>>>>>>> E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>>>>>>>> Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
>>>>>>>> Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>> Hervé Pagès
>>>>>>>
>>>>>>> Program in Computational Biology
>>>>>>> Division of Public Health Sciences
>>>>>>> Fred Hutchinson Cancer Research Center
>>>>>>> 1100 Fairview Ave. N, M1-B514
>>>>>>> P.O. Box 19024
>>>>>>> Seattle, WA 98109-1024
>>>>>>>
>>>>>>> E-mail: hpages at fhcrc.org
>>>>>>> Phone: (206) 667-5791
>>>>>>> Fax: (206) 667-1319
>>>>>> [[alternative HTML version deleted]]
>>>>>>
>>>>>>
>>>>>> ______________________________**_________________
>>>>>> Bioc-devel at r-project.org mailing list
>>>>>> https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>>>> [[alternative HTML version deleted]]
>>>>
>>>>
>>>> ______________________________**_________________
>>>> Bioc-devel at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>>> [[alternative HTML version deleted]]
>>>
>>>
>>>
>>> ______________________________**_________________
>>> Bioc-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>> --
>> Hervé Pagès
>>
>> Program in Computational Biology
>> Division of Public Health Sciences
>> Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N, M1-B514
>> P.O. Box 19024
>> Seattle, WA 98109-1024
>>
>> E-mail: hpages at fhcrc.org
>> Phone: (206) 667-5791
>> Fax: (206) 667-1319
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
More information about the Bioc-devel
mailing list