[Bioc-devel] deprecation of keepSeqlevels and renameSeqlevels
Hervé Pagès
hpages at fhcrc.org
Tue Sep 17 21:56:24 CEST 2013
On 09/17/2013 10:58 AM, Hervé Pagès wrote:
> For sorting the chromosome names in "natural" order, you can use
> the makeSeqnameIds() helper function in GenomicRanges:
>
> seqlevels(gr) <- seqlevels(gr)[makeSeqnameIds(seqlevels(gr))]
I meant:
seqlevels(gr) <- seqlevels(gr)[order(makeSeqnameIds(seqlevels(gr)))]
Probably the source of Michael's confusion... sorry.
sortSeqlevels() just added to GenomicRanges 1.13.44:
seqlevels <- c("chr2RHet", "chr4", "chrUextra", "chrYHet",
"chrM", "chrXHet", "chr2LHet", "chrU",
"chr3L", "chr3R", "chr2R", "chrX")
Then:
> sortSeqlevels(seqlevels)
[1] "chr2R" "chr3L" "chr3R" "chr4" "chrX"
"chrU"
[7] "chrM" "chr2LHet" "chr2RHet" "chrXHet" "chrYHet"
"chrUextra"
> sortSeqlevels(seqlevels, X.is.sexchrom=TRUE)
[1] "chr2R" "chr3L" "chr3R" "chr4" "chrX"
"chrU"
[7] "chrM" "chr2LHet" "chr2RHet" "chrXHet" "chrYHet"
"chrUextra"
H.
>
> It recognizes several naming conventions (chr-prefixed or not,
> roman, etc...) e.g.:
>
> > makeSeqnameIds(c("Y", "1", "9", "M", "2"))
> [1] 4 1 3 5 2
>
> > makeSeqnameIds(c("chrY", "chrI", "chrIX", "chrM", "chrII"))
> [1] 4 1 3 5 2
>
> I still need to improve the documentation of this function, put
> more examples, and add unit tests. It's used internally by the
> makeTranscriptDb* functions in GenomicFeatures to assign internal
> ids to the chromosomes so that all the TranscriptDb extractors
> can then return GRanges and GRangesList objects with the seqlevels
> in the "natural" order.
>
> Cheers,
> H.
>
>
> On 09/17/2013 10:23 AM, Kasper Daniel Hansen wrote:
>> Actually, what I (and I think several others) just want is
>> fixSeqnames()
>> which sorts and fixes them to some convention. I don't really care which
>> one, but I want to do some easy harmonization, preferably in line with
>> other bioc tools. I know this is hard to do for all organisms, but
>> having
>> something that deals with the major ones (human, mouse, fruit fly, yeast,
>> some monkeys) would be extremely valuable. If this function existed, I
>> could just throw all my GRanges at it.
>>
>> Kasper
>>
>>
>> On Tue, Sep 17, 2013 at 1:14 PM, Michael Lawrence
>> <lawrence.michael at gene.com
>>> wrote:
>>
>>> When I hit this I usually just do:
>>>
>>> seqlevels(gr) <- paste0("chr", c(1:22, "X", "Y"))
>>>
>>> It would be nice if there were a function for sorting chromosome names
>>> according to a specified naming convention. Like sortSeqlevels(gr,
>>> UCSCNaming()) or the alternative replacement syntax: seqinfo(gr) <-
>>> sort(seqinfo(gr), UCSCNaming()). Although sort is only single-dispatch.
>>>
>>> Michael
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Tue, Sep 17, 2013 at 8:59 AM, Vincent Carey
>>> <stvjc at channing.harvard.edu>wrote:
>>>
>>>> I am replying to this as Michael mentions that this is a powerful
>>>> operation. Here's my
>>>> problem that I believe is related, but I do not see a straightforward
>>>> solution
>>>>
>>>>> bhmm19uni
>>>> GRanges with 559456 ranges and 4 metadata columns:
>>>> seqnames ranges strand | name
>>>> score
>>>> <Rle> <IRanges> <Rle> | <character>
>>>> <numeric>
>>>> 1 chr1 [ 67229025, 67341224] * | 13_Heterochrom/lo
>>>> 0
>>>> 2 chr1 [203057955, 203060154] * | 12_Repressed
>>>> 0
>>>> 3 chr1 [ 8458427, 8486226] * | 13_Heterochrom/lo
>>>> 0
>>>> 4 chr1 [ 16904427, 16904826] * | 5_Strong_Enhancer
>>>> 0
>>>> 5 chr1 [ 25289427, 25293426] * | 11_Weak_Txn
>>>> 0
>>>> ... ... ... ... ... ...
>>>> ...
>>>> 570766 chr22 [51100931, 51101330] * | 2_Weak_Promoter
>>>> 0
>>>> 570767 chr22 [51101331, 51101530] * | 6_Weak_Enhancer
>>>> 0
>>>> 570768 chr22 [51101531, 51101730] * | 2_Weak_Promoter
>>>> 0
>>>> 570769 chr22 [51101731, 51178130] * | 13_Heterochrom/lo
>>>> 0
>>>> 570770 chr22 [51178131, 51178330] * | 15_Repetitive/CNV
>>>> 0
>>>> thick numcode
>>>> <IRanges> <character>
>>>> 1 [ 67001613, 67113812] 13
>>>> 2 [201324578, 201326777] 12
>>>> 3 [ 8381014, 8408813] 13
>>>> 4 [ 16777014, 16777413] 5
>>>> 5 [ 25162014, 25166013] 11
>>>> ... ... ...
>>>> 570766 [49447797, 49448196] 2
>>>> 570767 [49448197, 49448396] 6
>>>> 570768 [49448397, 49448596] 2
>>>> 570769 [49448597, 49524996] 13
>>>> 570770 [49524997, 49525196] 15
>>>> ---
>>>> seqlengths:
>>>> chr1 chr10 chr11 chr5 ... chr2 chr21
>>>> chr4
>>>> 249250621 135534747 135006516 180915260 ... 243199373 48129895
>>>> 191154276
>>>>
>>>> If I try to make a karyogram with ggbio, the chromosomes are in an
>>>> unnatural order. What is the right approach to getting the seqinfo
>>>> in a
>>>> natural order with respect to chromosome names and lengths?
>>>> Reassigning
>>>> seqlevels seems dangerous but I haven't experimented fully. It must
>>>> be a
>>>> common use case but I do not see it in doc.
>>>>
>>>>
>>>>
>>>> On Tue, May 21, 2013 at 11:00 AM, Michael Lawrence <
>>>> lawrence.michael at gene.com> wrote:
>>>>
>>>>> On Mon, May 20, 2013 at 10:36 PM, Hervé Pagès <hpages at fhcrc.org>
>>>>> wrote:
>>>>>
>>>>>> Michael,
>>>>>>
>>>>>>
>>>>>> On 05/20/2013 08:44 PM, Michael Lawrence wrote:
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Mon, May 20, 2013 at 4:13 PM, Hervé Pagès <hpages at fhcrc.org
>>>>>>> <mailto:hpages at fhcrc.org>> wrote:
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>>
>>>>>>> On 05/20/2013 03:15 PM, Michael Lawrence wrote:
>>>>>>>
>>>>>>> On Mon, May 20, 2013 at 3:11 PM, Martin Morgan
>>>>>>> <mtmorgan at fhcrc.org <mailto:mtmorgan at fhcrc.org>> wrote:
>>>>>>>
>>>>>>> Hi Michael --
>>>>>>>
>>>>>>>
>>>>>>> On 5/20/2013 5:56 AM, Michael Lawrence wrote:
>>>>>>>
>>>>>>> While it's cool that seqlevels<- has become so
>>>>> flexible,
>>>>>>> I still claim
>>>>>>> that
>>>>>>> rename/keep/drop would be a lot easier to read,
>>> because
>>>>>>> they describe the
>>>>>>> high-level operation, and there's no reason to
>>>>> deprecate
>>>>>>> them. They're
>>>>>>> also
>>>>>>> easier to remember. For example, for dropping with
>>>>>>> seqlevels<-, the user
>>>>>>> needs
>>>>>>> to remember that "force" is necessary to drop. Much
>>>>>>> easier to just say
>>>>>>> "dropSeqlevels(), please". And reimplementing
>>>>>>> keepSeqlevels is still too
>>>>>>> complicated. Is it such a maintenance burden to
>>>>>>> have
>>>>>>> these simple
>>>>>>> wrappers sit
>>>>>>> on top of the low-level manipulators?
>>>>>>>
>>>>>>> Another suggestion: renameSeqlevels should not
>>> require
>>>>> a
>>>>>>> named vector (in
>>>>>>> cases
>>>>>>> where it is unnamed, require that it is parallel to
>>> the
>>>>>>> existing
>>>>>>> seqlevels, as
>>>>>>> with seqlevels<-).
>>>>>>>
>>>>>>>
>>>>>>> I didn't really indicate what drove my desire to see
>>>>>>> keepSeqlevels
>>>>>>> deprecated. keepSeqlevels, seqlevels<-, and isActiveSeq
>>>>> were
>>>>>>> developed more
>>>>>>> or less independently, and have different contracts.
>>>>>>> The
>>>>>>> different
>>>>>>> contracts are confusing to the user, as is creating a
>>>>> usable
>>>>>>> help system
>>>>>>> when there are multiple end points for what is a common
>>>>>>> operation. The help
>>>>>>> pages of each were inadequate in different ways. And
>>>>> because
>>>>>>> the code was
>>>>>>> developed independently, support for different objects
>>> was
>>>>>>> inconsistent. So
>>>>>>> actually, yes, the maintenance (and use) burden for the
>>>>>>> previous state of
>>>>>>> affairs was substantial.
>>>>>>>
>>>>>>> On the other hand, I agree that keepSeqlevels is
>>> convenient
>>>>>>> as a simple
>>>>>>> wrapper around seqlevels<-, in the same way that
>>>>>>> setNames
>>>>>>> and names<- are
>>>>>>> both useful.
>>>>>>>
>>>>>>> So we could iterate to
>>>>>>>
>>>>>>> keepSeqlevels <-
>>>>>>> function(x, value, ...)
>>>>>>> {
>>>>>>> seqlevels(x, force=TRUE) <- value
>>>>>>> x
>>>>>>> }
>>>>>>>
>>>>>>> but I'd be less enthusiastic about maintaining the
>>> original
>>>>>>> contract of
>>>>>>> keepSeqlevels, where keepSeqlevels(gr1, gr2), would
>>>>>>> have
>>>>>>> worked for two
>>>>>>> GRanges objects.
>>>>>>>
>>>>>>>
>>>>>>> Why would this be called keepSeqlevels() if, depending on how
>>> it's
>>>>>>> used,
>>>>>>> it will either add, drop, rename, or permute the seqlevels?
>>>>>>> Couldn't this be called setSeqlevels?
>>>>>>>
>>>>>>>
>>>>>>> I thought that new2old was necessary for permuting.
>>>>>>>
>>>>>>
>>>>>> The seqlevels setter has no 'new2old' arg.
>>>>>>
>>>>>>
>>>>>> You're right that
>>>>>>> adding should be disallowed for keepSeqlevels(). Adding seqlevels is
>>>>> not
>>>>>>> a common operation. The two common operations are:
>>>>>>> * Permuting, usually because the data were imported in non-canonical
>>>>>>> order (seqnameStyle was designed to address this, no?).
>>>>>>>
>>>>>>
>>>>>> Permuting is straightforward with seqlevels<-:
>>>>>>
>>>>>> > seqlevels(gr)
>>>>>> [1] "chr1" "chr2" "chr3"
>>>>>>
>>>>>> > seqlevels(gr) <- rev(seqlevels(gr))
>>>>>>
>>>>>> > seqlevels(gr)
>>>>>> [1] "chr3" "chr2" "chr1"
>>>>>>
>>>>>>
>>>>>> * Subsetting, either by keeping or dropping.
>>>>>>>
>>>>>>
>>>>>> Also straightforward:
>>>>>>
>>>>>> > seqlevels(gr, force=TRUE) <- "chr2"
>>>>>>
>>>>>> > seqlevels(gr)
>>>>>> [1] "chr2"
>>>>>>
>>>>>>
>>>>>> Two main reasons: taking a
>>>>>>> small slice of the data for prototyping, or removing problematic
>>>>>>> chromosomes (sex, circular). This goes back to bsapply and the
>>> exclude
>>>>>>> argument.
>>>>>>>
>>>>>>
>>>>>> There are other important use cases:
>>>>>>
>>>>>> - Dropping seqlevels that are *not* in use. This happens for
>>>>>> example
>>>>>> when subsetting a BAM file with Rsamtools::filterBam to keep
>>>>>> only
>>>>>> the alignments located on a given chromosome. The entire
>>>>>> sequence
>>>>>> dictionary (in the header) is propagated to the resulting BAM
>>>>>> file
>>>>>> so when you read the file with readGAlignments() you end up
>>>>>> with a
>>>>>> bunch of useless seqlevels that you generally start by dropping.
>>>>>> You don't want to drop any alignment so force=FALSE is your
>>> friend.
>>>>>>
>>>>>>
>>>>> This is sort of tangential to the discussion, but do you really
>>>>> want to
>>> do
>>>>> this? I would preserve the universe as given by the BAM.
>>>>>
>>>>>
>>>>>> - An even more common operation is renaming: 90% of the times
>>>>>> that's
>>>>>> what I use seqlevels<- for, and that's what I tell people to use
>>>>>> when they need to rename their chromosomes. Also
>>>>>> straightforward:
>>>>>>
>>>>>> > seqlevels(gr)
>>>>>> [1] "chr1" "chr2" "chr3" "chrM"
>>>>>>
>>>>>> > seqlevels(gr) <- sub("^chr", "", seqlevels(gr))
>>>>>> > seqlevels(gr)
>>>>>> [1] "1" "2" "3" "M"
>>>>>>
>>>>>> > seqlevels(gr)[seqlevels(gr) == "M"] <- "MT"
>>>>>> > seqlevels(gr)
>>>>>> [1] "1" "2" "3" "MT"
>>>>>>
>>>>>> Note that this is simpler to use than renameSeqlevels() which
>>>>>> always required you to build the entire right value as a named
>>>>>> vector.
>>>>>>
>>>>>>
>>>>> Sure, renaming is a common use case. Not sure how I forgot that.
>>>>>
>>>>> As I wrote earlier in the thread, renameSeqlevels should be changed so
>>> as
>>>>> not to require naming of the vector.
>>>>>
>>>>>
>>>>>> So the added-value of keepSeqlevels() seems to boil down to:
>>>>>>
>>>>>> (a) it always uses force=TRUE
>>>>>>
>>>>>> (b) it preserves the original object and returns the modified
>>>>>> object
>>>>>>
>>>>>> If you want to restrict the use of keepSeqlevels() to permuting and
>>>>>> dropping, you'll also have to disallow renaming (in addition to
>>> adding).
>>>>>> Then its name will more or less reflect what it does (if "keep" means
>>>>>> "permute" or "drop"). The final result will be something that does
>>> less
>>>>>> than setSeqlevels() and that is not necessarily easier to read: both
>>>>>> will set on the object whatever vector of seqlevels they are
>>>>>> supplied,
>>>>>> except that one will fail when doing so would actually result in
>>> adding
>>>>>> or renaming seqlevels.
>>>>>>
>>>>>>
>>>>> So it looks like seqlevels<- is pretty powerful. Now I see that if I
>>>>> scroll
>>>>> down to the examples in the man page, I find the actual documentation.
>>>>> It's
>>>>> cool to learn that we can replace seqlevels on TxDb objects. My
>>>>> argument
>>>>> has always been that since these low-level operations are so powerful,
>>>>> it's
>>>>> nice to have high-level operations to clarify the code. We have to
>>>>> think
>>>>> hard to know what the RHS will do in that replacement. It's
>>>>> probably one
>>>>> of
>>>>> the most complex replacement operations; far more complex than the
>>> typical
>>>>> one, including levels<- on factor. There's nothing wrong with that;
>>>>> it's
>>>>> just complexity that we should be able to hide.
>>>>>
>>>>> Michael
>>>>>
>>>>> H.
>>>>>>
>>>>>>
>>>>>>> Michael
>>>>>>>
>>>>>>>
>>>>>>> H.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Well, I wasn't even aware of that feature, so you've made
>>> your
>>>>>>> point about
>>>>>>> the documentation ;) Sounds like a good compromise to me.
>>>>>>>
>>>>>>> Thanks for understanding,
>>>>>>> Michael
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Martin
>>>>>>>
>>>>>>>
>>>>>>> Michael
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Sat, May 18, 2013 at 6:00 PM, Martin Morgan
>>>>>>> <mtmorgan at fhcrc.org <mailto:mtmorgan at fhcrc.org>
>>>>>>> <mailto:mtmorgan at fhcrc.org <mailto:
>>> mtmorgan at fhcrc.org
>>>>>>>>
>>>>>>>
>>>>>>> wrote:
>>>>>>>
>>>>>>> On 05/18/2013 05:42 PM, Martin Morgan wrote:
>>>>>>>
>>>>>>> Some of the most common operations are
>>>>>>> straight-forward, e.g.,
>>>>>>>
>>>>>>> > gr = GRanges(paste0("chr",
>>>>>>> c(1:22,
>>>>>>> "X", "Y")), IRanges(1,
>>>>>>> 100))
>>>>>>> > seqlevels(gr) = sub("chr", "ch",
>>>>>>> seqlevels(gr))
>>>>>>>
>>>>>>>
>>>>>>> To get a more comprehensive example I should
>>> have
>>>>>>> followed my own
>>>>>>> advice and
>>>>>>> grabbed from the help page!
>>>>>>>
>>>>>>> ## Rename:
>>>>>>> seqlevels(txdb) <- sub("chr", "",
>>>>>>> seqlevels(txdb))
>>>>>>> seqlevels(txdb)
>>>>>>>
>>>>>>> seqlevels(txdb) <- paste0("CH",
>>>>>>> seqlevels(txdb))
>>>>>>> seqlevels(txdb)
>>>>>>>
>>>>>>> seqlevels(txdb)[seqlevels(__****__txdb)
>>> ==
>>>>>
>>>>>>>
>>>>>>> "CHM"] <- "M"
>>>>>>>
>>>>>>>
>>>>>>> seqlevels(txdb)
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Computational Biology / Fred Hutchinson
>>>>>>> Cancer
>>>>>>> Research Center
>>>>>>> 1100 Fairview Ave. N.
>>>>>>> PO Box 19024 Seattle, WA 98109
>>>>>>>
>>>>>>> Location: Arnold Building M1 B861
>>>>>>> Phone: (206) 667-2793
>>> <tel:%28206%29%20667-2793>
>>>>>>> <tel:%28206%29%20667-2793>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Dr. Martin Morgan, PhD
>>>>>>>
>>>>>>> Fred Hutchinson Cancer Research Center
>>>>>>> 1100 Fairview Ave. N.
>>>>>>> PO Box 19024 Seattle, WA 98109
>>>>>>>
>>>>>>>
>>>>>>> [[alternative HTML version deleted]]
>>>>>>>
>>>>>>> ______________________________**___________________
>>>>>>> Bioc-devel at r-project.org
>>>>>>> <mailto:Bioc-devel at r-project.**org<
>>>>> Bioc-devel at r-project.org>
>>>>>>>>
>>>>>>> mailing list
>>>>>>> https://stat.ethz.ch/mailman/_**_listinfo/bioc-devel<
>>>>> https://stat.ethz.ch/mailman/__listinfo/bioc-devel>
>>>>>>>
>>>>>>> <https://stat.ethz.ch/mailman/**listinfo/bioc-devel<
>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Hervé Pagès
>>>>>>>
>>>>>>> Program in Computational Biology
>>>>>>> Division of Public Health Sciences
>>>>>>>
>>>>>>> Fred Hutchinson Cancer Research Center
>>>>>>> 1100 Fairview Ave. N, M1-B514
>>>>>>> P.O. Box 19024
>>>>>>> Seattle, WA 98109-1024
>>>>>>>
>>>>>>> E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>>>>>>> Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
>>>>>>> Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>> --
>>>>>> Hervé Pagès
>>>>>>
>>>>>> Program in Computational Biology
>>>>>> Division of Public Health Sciences
>>>>>> Fred Hutchinson Cancer Research Center
>>>>>> 1100 Fairview Ave. N, M1-B514
>>>>>> P.O. Box 19024
>>>>>> Seattle, WA 98109-1024
>>>>>>
>>>>>> E-mail: hpages at fhcrc.org
>>>>>> Phone: (206) 667-5791
>>>>>> Fax: (206) 667-1319
>>>>>>
>>>>>
>>>>> [[alternative HTML version deleted]]
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Bioc-devel at r-project.org mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>
>>>>>
>>>>
>>>
>>> [[alternative HTML version deleted]]
>>>
>>>
>>> _______________________________________________
>>> Bioc-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>
>>>
>>
>> [[alternative HTML version deleted]]
>>
>>
>>
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioc-devel
mailing list