[Bioc-devel] deprecation of keepSeqlevels and renameSeqlevels

Tue Sep 17 19:58:22 CEST 2013

For sorting the chromosome names in "natural" order, you can use
the makeSeqnameIds() helper function in GenomicRanges:

   seqlevels(gr) <- seqlevels(gr)[makeSeqnameIds(seqlevels(gr))]

It recognizes several naming conventions (chr-prefixed or not,
roman, etc...) e.g.:

   > makeSeqnameIds(c("Y", "1", "9", "M", "2"))
   [1] 4 1 3 5 2

   > makeSeqnameIds(c("chrY", "chrI", "chrIX", "chrM", "chrII"))
   [1] 4 1 3 5 2

I still need to improve the documentation of this function, put
more examples, and add unit tests. It's used internally by the
makeTranscriptDb* functions in GenomicFeatures to assign internal
ids to the chromosomes so that all the TranscriptDb extractors
can then return GRanges and GRangesList objects with the seqlevels
in the "natural" order.

Cheers,
H.

On 09/17/2013 10:23 AM, Kasper Daniel Hansen wrote:
> Actually, what I (and I think several others) just want is
>    fixSeqnames()
> which sorts and fixes them to some convention.  I don't really care which
> one, but I want to do some easy harmonization, preferably in line with
> other bioc tools.  I know this is hard to do for all organisms, but having
> something that deals with the major ones (human, mouse, fruit fly, yeast,
> some monkeys) would be extremely valuable.  If this function existed, I
> could just throw all my GRanges at it.
>
> Kasper
>
>
> On Tue, Sep 17, 2013 at 1:14 PM, Michael Lawrence <lawrence.michael at gene.com
>> wrote:
>
>> When I hit this I usually just do:
>>
>> seqlevels(gr) <- paste0("chr", c(1:22, "X", "Y"))
>>
>> It would be nice if there were a function for sorting chromosome names
>> according to a specified naming convention. Like sortSeqlevels(gr,
>> UCSCNaming()) or the alternative replacement syntax: seqinfo(gr) <-
>> sort(seqinfo(gr), UCSCNaming()). Although sort is only single-dispatch.
>>
>> Michael
>>
>>
>>
>>
>>
>>
>> On Tue, Sep 17, 2013 at 8:59 AM, Vincent Carey
>> <stvjc at channing.harvard.edu>wrote:
>>
>>> I am replying to this as Michael mentions that this is a powerful
>>> operation.  Here's my
>>> problem that I believe is related, but I do not see a straightforward
>>> solution
>>>
>>>> bhmm19uni
>>> GRanges with 559456 ranges and 4 metadata columns:
>>>           seqnames                 ranges strand   |              name
>>> score
>>>              <Rle>              <IRanges>  <Rle>   |       <character>
>>> <numeric>
>>>         1     chr1 [ 67229025,  67341224]      *   | 13_Heterochrom/lo
>>>      0
>>>         2     chr1 [203057955, 203060154]      *   |      12_Repressed
>>>      0
>>>         3     chr1 [  8458427,   8486226]      *   | 13_Heterochrom/lo
>>>      0
>>>         4     chr1 [ 16904427,  16904826]      *   | 5_Strong_Enhancer
>>>      0
>>>         5     chr1 [ 25289427,  25293426]      *   |       11_Weak_Txn
>>>      0
>>>       ...      ...                    ...    ... ...               ...
>>>    ...
>>>    570766    chr22   [51100931, 51101330]      *   |   2_Weak_Promoter
>>>      0
>>>    570767    chr22   [51101331, 51101530]      *   |   6_Weak_Enhancer
>>>      0
>>>    570768    chr22   [51101531, 51101730]      *   |   2_Weak_Promoter
>>>      0
>>>    570769    chr22   [51101731, 51178130]      *   | 13_Heterochrom/lo
>>>      0
>>>    570770    chr22   [51178131, 51178330]      *   | 15_Repetitive/CNV
>>>      0
>>>                            thick     numcode
>>>                        <IRanges> <character>
>>>         1 [ 67001613,  67113812]          13
>>>         2 [201324578, 201326777]          12
>>>         3 [  8381014,   8408813]          13
>>>         4 [ 16777014,  16777413]           5
>>>         5 [ 25162014,  25166013]          11
>>>       ...                    ...         ...
>>>    570766   [49447797, 49448196]           2
>>>    570767   [49448197, 49448396]           6
>>>    570768   [49448397, 49448596]           2
>>>    570769   [49448597, 49524996]          13
>>>    570770   [49524997, 49525196]          15
>>>    ---
>>>    seqlengths:
>>>          chr1     chr10     chr11      chr5 ...      chr2     chr21
>>>   chr4
>>>     249250621 135534747 135006516 180915260 ... 243199373  48129895
>>> 191154276
>>>
>>> If I try to make a karyogram with ggbio, the chromosomes are in an
>>> unnatural order.  What is the right approach to getting the seqinfo in a
>>> natural order with respect to chromosome names and lengths?  Reassigning
>>> seqlevels seems dangerous but I haven't experimented fully.  It must be a
>>> common use case but I do not see it in doc.
>>>
>>>
>>>
>>> On Tue, May 21, 2013 at 11:00 AM, Michael Lawrence <
>>> lawrence.michael at gene.com> wrote:
>>>
>>>> On Mon, May 20, 2013 at 10:36 PM, Hervé Pagès <hpages at fhcrc.org> wrote:
>>>>
>>>>> Michael,
>>>>>
>>>>>
>>>>> On 05/20/2013 08:44 PM, Michael Lawrence wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Mon, May 20, 2013 at 4:13 PM, Hervé Pagès <hpages at fhcrc.org
>>>>>> <mailto:hpages at fhcrc.org>> wrote:
>>>>>>
>>>>>>      Hi,
>>>>>>
>>>>>>
>>>>>>      On 05/20/2013 03:15 PM, Michael Lawrence wrote:
>>>>>>
>>>>>>          On Mon, May 20, 2013 at 3:11 PM, Martin Morgan
>>>>>>          <mtmorgan at fhcrc.org <mailto:mtmorgan at fhcrc.org>> wrote:
>>>>>>
>>>>>>              Hi Michael --
>>>>>>
>>>>>>
>>>>>>              On 5/20/2013 5:56 AM, Michael Lawrence wrote:
>>>>>>
>>>>>>                  While it's cool that seqlevels<- has become so
>>>> flexible,
>>>>>>                  I still claim
>>>>>>                  that
>>>>>>                  rename/keep/drop would be a lot easier to read,
>> because
>>>>>>                  they describe the
>>>>>>                  high-level operation, and there's no reason to
>>>> deprecate
>>>>>>                  them. They're
>>>>>>                  also
>>>>>>                  easier to remember. For example, for dropping with
>>>>>>                  seqlevels<-, the user
>>>>>>                  needs
>>>>>>                  to remember that "force" is necessary to drop. Much
>>>>>>                  easier to just say
>>>>>>                  "dropSeqlevels(), please". And reimplementing
>>>>>>                  keepSeqlevels is still too
>>>>>>                  complicated. Is it such a maintenance burden to have
>>>>>>                  these simple
>>>>>>                  wrappers sit
>>>>>>                  on top of the low-level manipulators?
>>>>>>
>>>>>>                  Another suggestion: renameSeqlevels should not
>> require
>>>> a
>>>>>>                  named vector (in
>>>>>>                  cases
>>>>>>                  where it is unnamed, require that it is parallel to
>> the
>>>>>>                  existing
>>>>>>                  seqlevels, as
>>>>>>                  with seqlevels<-).
>>>>>>
>>>>>>
>>>>>>              I didn't really indicate what drove my desire to see
>>>>>>              keepSeqlevels
>>>>>>              deprecated. keepSeqlevels, seqlevels<-, and isActiveSeq
>>>> were
>>>>>>              developed more
>>>>>>              or less independently, and have different contracts. The
>>>>>>              different
>>>>>>              contracts are confusing to the user, as is creating a
>>>> usable
>>>>>>              help system
>>>>>>              when there are multiple end points for what is a common
>>>>>>              operation. The help
>>>>>>              pages of each were inadequate in different ways. And
>>>> because
>>>>>>              the code was
>>>>>>              developed independently, support for different objects
>> was
>>>>>>              inconsistent. So
>>>>>>              actually, yes, the maintenance (and use) burden for the
>>>>>>              previous state of
>>>>>>              affairs was substantial.
>>>>>>
>>>>>>              On the other hand, I agree that keepSeqlevels is
>> convenient
>>>>>>              as a simple
>>>>>>              wrapper around seqlevels<-, in the same way that setNames
>>>>>>              and names<- are
>>>>>>              both useful.
>>>>>>
>>>>>>              So we could iterate to
>>>>>>
>>>>>>                  keepSeqlevels <-
>>>>>>                      function(x, value, ...)
>>>>>>                  {
>>>>>>                      seqlevels(x, force=TRUE) <- value
>>>>>>                      x
>>>>>>                  }
>>>>>>
>>>>>>              but I'd be less enthusiastic about maintaining the
>> original
>>>>>>              contract of
>>>>>>              keepSeqlevels, where keepSeqlevels(gr1, gr2), would have
>>>>>>              worked for two
>>>>>>              GRanges objects.
>>>>>>
>>>>>>
>>>>>>      Why would this be called keepSeqlevels() if, depending on how
>> it's
>>>>>> used,
>>>>>>      it will either add, drop, rename, or permute the seqlevels?
>>>>>>      Couldn't this be called setSeqlevels?
>>>>>>
>>>>>>
>>>>>> I thought that new2old was necessary for permuting.
>>>>>>
>>>>>
>>>>> The seqlevels setter has no 'new2old' arg.
>>>>>
>>>>>
>>>>>   You're right that
>>>>>> adding should be disallowed for keepSeqlevels(). Adding seqlevels is
>>>> not
>>>>>> a common operation. The two common operations are:
>>>>>> * Permuting, usually because the data were imported in non-canonical
>>>>>> order (seqnameStyle was designed to address this, no?).
>>>>>>
>>>>>
>>>>> Permuting is straightforward with seqlevels<-:
>>>>>
>>>>>    > seqlevels(gr)
>>>>>    [1] "chr1" "chr2" "chr3"
>>>>>
>>>>>    > seqlevels(gr) <- rev(seqlevels(gr))
>>>>>
>>>>>    > seqlevels(gr)
>>>>>    [1] "chr3" "chr2" "chr1"
>>>>>
>>>>>
>>>>>   * Subsetting, either by keeping or dropping.
>>>>>>
>>>>>
>>>>> Also straightforward:
>>>>>
>>>>>    > seqlevels(gr, force=TRUE) <- "chr2"
>>>>>
>>>>>    > seqlevels(gr)
>>>>>    [1] "chr2"
>>>>>
>>>>>
>>>>>   Two main reasons: taking a
>>>>>> small slice of the data for prototyping, or removing problematic
>>>>>> chromosomes (sex, circular). This goes back to bsapply and the
>> exclude
>>>>>> argument.
>>>>>>
>>>>>
>>>>> There are other important use cases:
>>>>>
>>>>>    - Dropping seqlevels that are *not* in use. This happens for example
>>>>>      when subsetting a BAM file with Rsamtools::filterBam to keep only
>>>>>      the alignments located on a given chromosome. The entire sequence
>>>>>      dictionary (in the header) is propagated to the resulting BAM file
>>>>>      so when you read the file with readGAlignments() you end up with a
>>>>>      bunch of useless seqlevels that you generally start by dropping.
>>>>>      You don't want to drop any alignment so force=FALSE is your
>> friend.
>>>>>
>>>>>
>>>> This is sort of tangential to the discussion, but do you really want to
>> do
>>>> this? I would preserve the universe as given by the BAM.
>>>>
>>>>
>>>>>    - An even more common operation is renaming: 90% of the times that's
>>>>>      what I use seqlevels<- for, and that's what I tell people to use
>>>>>      when they need to rename their chromosomes. Also straightforward:
>>>>>
>>>>>        > seqlevels(gr)
>>>>>        [1] "chr1" "chr2" "chr3" "chrM"
>>>>>
>>>>>        > seqlevels(gr) <- sub("^chr", "", seqlevels(gr))
>>>>>        > seqlevels(gr)
>>>>>        [1] "1" "2" "3" "M"
>>>>>
>>>>>        > seqlevels(gr)[seqlevels(gr) == "M"] <- "MT"
>>>>>        > seqlevels(gr)
>>>>>        [1] "1"  "2"  "3"  "MT"
>>>>>
>>>>>      Note that this is simpler to use than renameSeqlevels() which
>>>>>      always required you to build the entire right value as a named
>>>>>      vector.
>>>>>
>>>>>
>>>> Sure, renaming is a common use case. Not sure how I forgot that.
>>>>
>>>> As I wrote earlier in the thread, renameSeqlevels should be changed so
>> as
>>>> not to require naming of the vector.
>>>>
>>>>
>>>>> So the added-value of keepSeqlevels() seems to boil down to:
>>>>>
>>>>>    (a) it always uses force=TRUE
>>>>>
>>>>>    (b) it preserves the original object and returns the modified object
>>>>>
>>>>> If you want to restrict the use of keepSeqlevels() to permuting and
>>>>> dropping, you'll also have to disallow renaming (in addition to
>> adding).
>>>>> Then its name will more or less reflect what it does (if "keep" means
>>>>> "permute" or "drop"). The final result will be something that does
>> less
>>>>> than setSeqlevels() and that is not necessarily easier to read: both
>>>>> will set on the object whatever vector of seqlevels they are supplied,
>>>>> except that one will fail when doing so would actually result in
>> adding
>>>>> or renaming seqlevels.
>>>>>
>>>>>
>>>> So it looks like seqlevels<- is pretty powerful. Now I see that if I
>>>> scroll
>>>> down to the examples in the man page, I find the actual documentation.
>>>> It's
>>>> cool to learn that we can replace seqlevels on TxDb objects. My argument
>>>> has always been that since these low-level operations are so powerful,
>>>> it's
>>>> nice to have high-level operations to clarify the code. We have to think
>>>> hard to know what the RHS will do in that replacement. It's probably one
>>>> of
>>>> the most complex replacement operations; far more complex than the
>> typical
>>>> one, including levels<- on factor. There's nothing wrong with that; it's
>>>> just complexity that we should be able to hide.
>>>>
>>>> Michael
>>>>
>>>> H.
>>>>>
>>>>>
>>>>>> Michael
>>>>>>
>>>>>>
>>>>>>      H.
>>>>>>
>>>>>>
>>>>>>
>>>>>>          Well, I wasn't even aware of that feature, so you've made
>> your
>>>>>>          point about
>>>>>>          the documentation ;) Sounds like a good compromise to me.
>>>>>>
>>>>>>          Thanks for understanding,
>>>>>>          Michael
>>>>>>
>>>>>>
>>>>>>
>>>>>>              Martin
>>>>>>
>>>>>>
>>>>>>                  Michael
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>                  On Sat, May 18, 2013 at 6:00 PM, Martin Morgan
>>>>>>                  <mtmorgan at fhcrc.org <mailto:mtmorgan at fhcrc.org>
>>>>>>                  <mailto:mtmorgan at fhcrc.org <mailto:
>> mtmorgan at fhcrc.org
>>>>>>>
>>>>>>
>>>>>>                  wrote:
>>>>>>
>>>>>>                        On 05/18/2013 05:42 PM, Martin Morgan wrote:
>>>>>>
>>>>>>                            Some of the most common operations are
>>>>>>                  straight-forward, e.g.,
>>>>>>
>>>>>>                                  > gr = GRanges(paste0("chr", c(1:22,
>>>>>>                  "X", "Y")), IRanges(1,
>>>>>>                  100))
>>>>>>                                  > seqlevels(gr) = sub("chr", "ch",
>>>>>>                  seqlevels(gr))
>>>>>>
>>>>>>
>>>>>>                        To get a more comprehensive example I should
>> have
>>>>>>                  followed my own
>>>>>>                  advice and
>>>>>>                        grabbed from the help page!
>>>>>>
>>>>>>                              ## Rename:
>>>>>>                              seqlevels(txdb) <- sub("chr", "",
>>>>>>                  seqlevels(txdb))
>>>>>>                              seqlevels(txdb)
>>>>>>
>>>>>>                              seqlevels(txdb) <- paste0("CH",
>>>>>> seqlevels(txdb))
>>>>>>                              seqlevels(txdb)
>>>>>>
>>>>>>                              seqlevels(txdb)[seqlevels(__****__txdb)
>> ==
>>>>
>>>>>>
>>>>>>                  "CHM"] <- "M"
>>>>>>
>>>>>>
>>>>>>                              seqlevels(txdb)
>>>>>>
>>>>>>
>>>>>>                        --
>>>>>>                        Computational Biology / Fred Hutchinson Cancer
>>>>>>                  Research Center
>>>>>>                        1100 Fairview Ave. N.
>>>>>>                        PO Box 19024 Seattle, WA 98109
>>>>>>
>>>>>>                        Location: Arnold Building M1 B861
>>>>>>                        Phone: (206) 667-2793
>> <tel:%28206%29%20667-2793>
>>>>>>                  <tel:%28206%29%20667-2793>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>              --
>>>>>>              Dr. Martin Morgan, PhD
>>>>>>
>>>>>>              Fred Hutchinson Cancer Research Center
>>>>>>              1100 Fairview Ave. N.
>>>>>>              PO Box 19024 Seattle, WA 98109
>>>>>>
>>>>>>
>>>>>>                   [[alternative HTML version deleted]]
>>>>>>
>>>>>>          ______________________________**___________________
>>>>>>          Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.**org<
>>>> Bioc-devel at r-project.org>
>>>>>>>
>>>>>>          mailing list
>>>>>>          https://stat.ethz.ch/mailman/_**_listinfo/bioc-devel<
>>>> https://stat.ethz.ch/mailman/__listinfo/bioc-devel>
>>>>>>
>>>>>>          <https://stat.ethz.ch/mailman/**listinfo/bioc-devel<
>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>      --
>>>>>>      Hervé Pagès
>>>>>>
>>>>>>      Program in Computational Biology
>>>>>>      Division of Public Health Sciences
>>>>>>
>>>>>>      Fred Hutchinson Cancer Research Center
>>>>>>      1100 Fairview Ave. N, M1-B514
>>>>>>      P.O. Box 19024
>>>>>>      Seattle, WA 98109-1024
>>>>>>
>>>>>>      E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>>>>>>      Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
>>>>>>      Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
>>>>>>
>>>>>>
>>>>>>
>>>>> --
>>>>> Hervé Pagès
>>>>>
>>>>> Program in Computational Biology
>>>>> Division of Public Health Sciences
>>>>> Fred Hutchinson Cancer Research Center
>>>>> 1100 Fairview Ave. N, M1-B514
>>>>> P.O. Box 19024
>>>>> Seattle, WA 98109-1024
>>>>>
>>>>> E-mail: hpages at fhcrc.org
>>>>> Phone:  (206) 667-5791
>>>>> Fax:    (206) 667-1319
>>>>>
>>>>
>>>>          [[alternative HTML version deleted]]
>>>>
>>>>
>>>> _______________________________________________
>>>> Bioc-devel at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>
>>>>
>>>
>>
>>          [[alternative HTML version deleted]]
>>
>>
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>>
>
> 	[[alternative HTML version deleted]]
>
>
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319