[Bioc-devel] deprecation of keepSeqlevels and renameSeqlevels
Hervé Pagès
hpages at fhcrc.org
Tue Sep 17 21:10:45 CEST 2013
On 09/17/2013 11:30 AM, Michael Lawrence wrote:
> Sounds great. Could we have a more intuitive name? Like naturalOrder().
That would be confusing because order() does something different:
> order(c("chrY", "chr1", "chr9", "chr2"))
[1] 2 4 3 1
> makeSeqnameIds(c("chrY", "chr1", "chr9", "chr2"))
[1] 4 1 3 2
makeSeqnameIds() is more like rank() except on how it deals with
duplicates:
> makeSeqnameIds(c("chrY", "chr1", "chr9", "chr2", "chr9"))
[1] 4 1 3 2 3
Whatever 'ties.method' you use, rank() will put "chrY" at rank 5.
IIRC makeSeqnameIds() needs to support duplicates because of how it's
used internally in the makeTranscriptDb* functions.
Alternatively we could add sortSeqlevels() as a wrapper to
makeSeqnameIds() so you would be able to do:
seqlevels(gr) <- sortSeqlevels(seqlevels(gr))
Does that work?
Thanks,
H.
> And then the method to fix the GRanges could be restoreNaturalOrder().
> Just some suggestions...
>
>
>
>
> On Tue, Sep 17, 2013 at 10:58 AM, Hervé Pagès <hpages at fhcrc.org
> <mailto:hpages at fhcrc.org>> wrote:
>
> For sorting the chromosome names in "natural" order, you can use
> the makeSeqnameIds() helper function in GenomicRanges:
>
> seqlevels(gr) <- seqlevels(gr)[makeSeqnameIds(__seqlevels(gr))]
>
> It recognizes several naming conventions (chr-prefixed or not,
> roman, etc...) e.g.:
>
> > makeSeqnameIds(c("Y", "1", "9", "M", "2"))
> [1] 4 1 3 5 2
>
> > makeSeqnameIds(c("chrY", "chrI", "chrIX", "chrM", "chrII"))
> [1] 4 1 3 5 2
>
> I still need to improve the documentation of this function, put
> more examples, and add unit tests. It's used internally by the
> makeTranscriptDb* functions in GenomicFeatures to assign internal
> ids to the chromosomes so that all the TranscriptDb extractors
> can then return GRanges and GRangesList objects with the seqlevels
> in the "natural" order.
>
> Cheers,
> H.
>
>
>
> On 09/17/2013 10:23 AM, Kasper Daniel Hansen wrote:
>
> Actually, what I (and I think several others) just want is
> fixSeqnames()
> which sorts and fixes them to some convention. I don't really
> care which
> one, but I want to do some easy harmonization, preferably in
> line with
> other bioc tools. I know this is hard to do for all organisms,
> but having
> something that deals with the major ones (human, mouse, fruit
> fly, yeast,
> some monkeys) would be extremely valuable. If this function
> existed, I
> could just throw all my GRanges at it.
>
> Kasper
>
>
> On Tue, Sep 17, 2013 at 1:14 PM, Michael Lawrence
> <lawrence.michael at gene.com <mailto:lawrence.michael at gene.com>
>
> wrote:
>
>
> When I hit this I usually just do:
>
> seqlevels(gr) <- paste0("chr", c(1:22, "X", "Y"))
>
> It would be nice if there were a function for sorting
> chromosome names
> according to a specified naming convention. Like
> sortSeqlevels(gr,
> UCSCNaming()) or the alternative replacement syntax:
> seqinfo(gr) <-
> sort(seqinfo(gr), UCSCNaming()). Although sort is only
> single-dispatch.
>
> Michael
>
>
>
>
>
>
> On Tue, Sep 17, 2013 at 8:59 AM, Vincent Carey
> <stvjc at channing.harvard.edu
> <mailto:stvjc at channing.harvard.edu>>__wrote:
>
> I am replying to this as Michael mentions that this is a
> powerful
> operation. Here's my
> problem that I believe is related, but I do not see a
> straightforward
> solution
>
> bhmm19uni
>
> GRanges with 559456 ranges and 4 metadata columns:
> seqnames ranges strand |
> name
> score
> <Rle> <IRanges> <Rle> |
> <character>
> <numeric>
> 1 chr1 [ 67229025, 67341224] * |
> 13_Heterochrom/lo
> 0
> 2 chr1 [203057955, 203060154] * |
> 12_Repressed
> 0
> 3 chr1 [ 8458427, 8486226] * |
> 13_Heterochrom/lo
> 0
> 4 chr1 [ 16904427, 16904826] * |
> 5_Strong_Enhancer
> 0
> 5 chr1 [ 25289427, 25293426] * |
> 11_Weak_Txn
> 0
> ... ... ... ... ...
> ...
> ...
> 570766 chr22 [51100931, 51101330] * |
> 2_Weak_Promoter
> 0
> 570767 chr22 [51101331, 51101530] * |
> 6_Weak_Enhancer
> 0
> 570768 chr22 [51101531, 51101730] * |
> 2_Weak_Promoter
> 0
> 570769 chr22 [51101731, 51178130] * |
> 13_Heterochrom/lo
> 0
> 570770 chr22 [51178131, 51178330] * |
> 15_Repetitive/CNV
> 0
> thick numcode
> <IRanges> <character>
> 1 [ 67001613, 67113812] 13
> 2 [201324578, 201326777] 12
> 3 [ 8381014, 8408813] 13
> 4 [ 16777014, 16777413] 5
> 5 [ 25162014, 25166013] 11
> ... ... ...
> 570766 [49447797, 49448196] 2
> 570767 [49448197, 49448396] 6
> 570768 [49448397, 49448596] 2
> 570769 [49448597, 49524996] 13
> 570770 [49524997, 49525196] 15
> ---
> seqlengths:
> chr1 chr10 chr11 chr5 ...
> chr2 chr21
> chr4
> 249250621 135534747 135006516 180915260 ...
> 243199373 48129895
> 191154276
>
> If I try to make a karyogram with ggbio, the chromosomes
> are in an
> unnatural order. What is the right approach to getting
> the seqinfo in a
> natural order with respect to chromosome names and
> lengths? Reassigning
> seqlevels seems dangerous but I haven't experimented
> fully. It must be a
> common use case but I do not see it in doc.
>
>
>
> On Tue, May 21, 2013 at 11:00 AM, Michael Lawrence <
> lawrence.michael at gene.com
> <mailto:lawrence.michael at gene.com>> wrote:
>
> On Mon, May 20, 2013 at 10:36 PM, Hervé Pagès
> <hpages at fhcrc.org <mailto:hpages at fhcrc.org>> wrote:
>
> Michael,
>
>
> On 05/20/2013 08:44 PM, Michael Lawrence wrote:
>
>
>
>
> On Mon, May 20, 2013 at 4:13 PM, Hervé Pagès
> <hpages at fhcrc.org <mailto:hpages at fhcrc.org>
> <mailto:hpages at fhcrc.org
> <mailto:hpages at fhcrc.org>>> wrote:
>
> Hi,
>
>
> On 05/20/2013 03:15 PM, Michael
> Lawrence wrote:
>
> On Mon, May 20, 2013 at 3:11 PM,
> Martin Morgan
> <mtmorgan at fhcrc.org
> <mailto:mtmorgan at fhcrc.org>
> <mailto:mtmorgan at fhcrc.org
> <mailto:mtmorgan at fhcrc.org>>> wrote:
>
> Hi Michael --
>
>
> On 5/20/2013 5:56 AM, Michael
> Lawrence wrote:
>
> While it's cool that
> seqlevels<- has become so
>
> flexible,
>
> I still claim
> that
> rename/keep/drop would be
> a lot easier to read,
>
> because
>
> they describe the
> high-level operation, and
> there's no reason to
>
> deprecate
>
> them. They're
> also
> easier to remember. For
> example, for dropping with
> seqlevels<-, the user
> needs
> to remember that "force"
> is necessary to drop. Much
> easier to just say
> "dropSeqlevels(), please".
> And reimplementing
> keepSeqlevels is still too
> complicated. Is it such a
> maintenance burden to have
> these simple
> wrappers sit
> on top of the low-level
> manipulators?
>
> Another suggestion:
> renameSeqlevels should not
>
> require
>
> a
>
> named vector (in
> cases
> where it is unnamed,
> require that it is parallel to
>
> the
>
> existing
> seqlevels, as
> with seqlevels<-).
>
>
> I didn't really indicate what
> drove my desire to see
> keepSeqlevels
> deprecated. keepSeqlevels,
> seqlevels<-, and isActiveSeq
>
> were
>
> developed more
> or less independently, and
> have different contracts. The
> different
> contracts are confusing to the
> user, as is creating a
>
> usable
>
> help system
> when there are multiple end
> points for what is a common
> operation. The help
> pages of each were inadequate
> in different ways. And
>
> because
>
> the code was
> developed independently,
> support for different objects
>
> was
>
> inconsistent. So
> actually, yes, the maintenance
> (and use) burden for the
> previous state of
> affairs was substantial.
>
> On the other hand, I agree
> that keepSeqlevels is
>
> convenient
>
> as a simple
> wrapper around seqlevels<-, in
> the same way that setNames
> and names<- are
> both useful.
>
> So we could iterate to
>
> keepSeqlevels <-
> function(x, value, ...)
> {
> seqlevels(x,
> force=TRUE) <- value
> x
> }
>
> but I'd be less enthusiastic
> about maintaining the
>
> original
>
> contract of
> keepSeqlevels, where
> keepSeqlevels(gr1, gr2), would have
> worked for two
> GRanges objects.
>
>
> Why would this be called
> keepSeqlevels() if, depending on how
>
> it's
>
> used,
> it will either add, drop, rename, or
> permute the seqlevels?
> Couldn't this be called setSeqlevels?
>
>
> I thought that new2old was necessary for
> permuting.
>
>
> The seqlevels setter has no 'new2old' arg.
>
>
> You're right that
>
> adding should be disallowed for
> keepSeqlevels(). Adding seqlevels is
>
> not
>
> a common operation. The two common
> operations are:
> * Permuting, usually because the data were
> imported in non-canonical
> order (seqnameStyle was designed to address
> this, no?).
>
>
> Permuting is straightforward with seqlevels<-:
>
> > seqlevels(gr)
> [1] "chr1" "chr2" "chr3"
>
> > seqlevels(gr) <- rev(seqlevels(gr))
>
> > seqlevels(gr)
> [1] "chr3" "chr2" "chr1"
>
>
> * Subsetting, either by keeping or dropping.
>
>
>
> Also straightforward:
>
> > seqlevels(gr, force=TRUE) <- "chr2"
>
> > seqlevels(gr)
> [1] "chr2"
>
>
> Two main reasons: taking a
>
> small slice of the data for prototyping, or
> removing problematic
> chromosomes (sex, circular). This goes back
> to bsapply and the
>
> exclude
>
> argument.
>
>
> There are other important use cases:
>
> - Dropping seqlevels that are *not* in use.
> This happens for example
> when subsetting a BAM file with
> Rsamtools::filterBam to keep only
> the alignments located on a given
> chromosome. The entire sequence
> dictionary (in the header) is propagated
> to the resulting BAM file
> so when you read the file with
> readGAlignments() you end up with a
> bunch of useless seqlevels that you
> generally start by dropping.
> You don't want to drop any alignment so
> force=FALSE is your
>
> friend.
>
>
>
> This is sort of tangential to the discussion, but do
> you really want to
>
> do
>
> this? I would preserve the universe as given by the BAM.
>
>
> - An even more common operation is renaming:
> 90% of the times that's
> what I use seqlevels<- for, and that's
> what I tell people to use
> when they need to rename their
> chromosomes. Also straightforward:
>
> > seqlevels(gr)
> [1] "chr1" "chr2" "chr3" "chrM"
>
> > seqlevels(gr) <- sub("^chr", "",
> seqlevels(gr))
> > seqlevels(gr)
> [1] "1" "2" "3" "M"
>
> > seqlevels(gr)[seqlevels(gr) == "M"] <-
> "MT"
> > seqlevels(gr)
> [1] "1" "2" "3" "MT"
>
> Note that this is simpler to use than
> renameSeqlevels() which
> always required you to build the entire
> right value as a named
> vector.
>
>
> Sure, renaming is a common use case. Not sure how I
> forgot that.
>
> As I wrote earlier in the thread, renameSeqlevels
> should be changed so
>
> as
>
> not to require naming of the vector.
>
>
> So the added-value of keepSeqlevels() seems to
> boil down to:
>
> (a) it always uses force=TRUE
>
> (b) it preserves the original object and
> returns the modified object
>
> If you want to restrict the use of
> keepSeqlevels() to permuting and
> dropping, you'll also have to disallow renaming
> (in addition to
>
> adding).
>
> Then its name will more or less reflect what it
> does (if "keep" means
> "permute" or "drop"). The final result will be
> something that does
>
> less
>
> than setSeqlevels() and that is not necessarily
> easier to read: both
> will set on the object whatever vector of
> seqlevels they are supplied,
> except that one will fail when doing so would
> actually result in
>
> adding
>
> or renaming seqlevels.
>
>
> So it looks like seqlevels<- is pretty powerful. Now
> I see that if I
> scroll
> down to the examples in the man page, I find the
> actual documentation.
> It's
> cool to learn that we can replace seqlevels on TxDb
> objects. My argument
> has always been that since these low-level
> operations are so powerful,
> it's
> nice to have high-level operations to clarify the
> code. We have to think
> hard to know what the RHS will do in that
> replacement. It's probably one
> of
> the most complex replacement operations; far more
> complex than the
>
> typical
>
> one, including levels<- on factor. There's nothing
> wrong with that; it's
> just complexity that we should be able to hide.
>
> Michael
>
> H.
>
>
>
> Michael
>
>
> H.
>
>
>
> Well, I wasn't even aware of that
> feature, so you've made
>
> your
>
> point about
> the documentation ;) Sounds like a
> good compromise to me.
>
> Thanks for understanding,
> Michael
>
>
>
> Martin
>
>
> Michael
>
>
>
>
>
>
> On Sat, May 18, 2013 at
> 6:00 PM, Martin Morgan
> <mtmorgan at fhcrc.org
> <mailto:mtmorgan at fhcrc.org>
> <mailto:mtmorgan at fhcrc.org
> <mailto:mtmorgan at fhcrc.org>>
> <mailto:mtmorgan at fhcrc.org
> <mailto:mtmorgan at fhcrc.org> <mailto:
>
> mtmorgan at fhcrc.org <mailto:mtmorgan at fhcrc.org>
>
>
>
> wrote:
>
> On 05/18/2013 05:42
> PM, Martin Morgan wrote:
>
> Some of the most
> common operations are
> straight-forward, e.g.,
>
> > gr =
> GRanges(paste0("chr", c(1:22,
> "X", "Y")), IRanges(1,
> 100))
> >
> seqlevels(gr) = sub("chr", "ch",
> seqlevels(gr))
>
>
> To get a more
> comprehensive example I should
>
> have
>
> followed my own
> advice and
> grabbed from the
> help page!
>
> ## Rename:
>
> seqlevels(txdb) <- sub("chr", "",
> seqlevels(txdb))
> seqlevels(txdb)
>
>
> seqlevels(txdb) <- paste0("CH",
> seqlevels(txdb))
> seqlevels(txdb)
>
>
> seqlevels(txdb)[seqlevels(__**__**__txdb)
>
> ==
>
>
>
> "CHM"] <- "M"
>
>
> seqlevels(txdb)
>
>
> --
> Computational
> Biology / Fred Hutchinson Cancer
> Research Center
> 1100 Fairview Ave. N.
> PO Box 19024
> Seattle, WA 98109
>
> Location: Arnold
> Building M1 B861
> Phone: (206)
> 667-2793 <tel:%28206%29%20667-2793>
>
> <tel:%28206%29%20667-2793>
>
> <tel:%28206%29%20667-2793>
>
>
>
>
> --
> Dr. Martin Morgan, PhD
>
> Fred Hutchinson Cancer
> Research Center
> 1100 Fairview Ave. N.
> PO Box 19024 Seattle, WA 98109
>
>
> [[alternative HTML
> version deleted]]
>
>
> ________________________________**___________________
> Bioc-devel at r-project.org
> <mailto:Bioc-devel at r-project.org>
> <mailto:Bioc-devel at r-project.
> <mailto:Bioc-devel at r-project.>*__*org<
>
> Bioc-devel at r-project.org
> <mailto:Bioc-devel at r-project.org>>
>
>
> mailing list
> https://stat.ethz.ch/mailman/___**_listinfo/bioc-devel
> <https://stat.ethz.ch/mailman/_**_listinfo/bioc-devel><
>
> https://stat.ethz.ch/mailman/____listinfo/bioc-devel
> <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>>
>
>
>
> <https://stat.ethz.ch/mailman/__**listinfo/bioc-devel <https://stat.ethz.ch/mailman/**listinfo/bioc-devel><
>
> https://stat.ethz.ch/mailman/__listinfo/bioc-devel
> <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>
>
>
>
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
>
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fhcrc.org
> <mailto:hpages at fhcrc.org>
> <mailto:hpages at fhcrc.org
> <mailto:hpages at fhcrc.org>>
> Phone: (206) 667-5791
> <tel:%28206%29%20667-5791>
> <tel:%28206%29%20667-5791>
> Fax: (206) 667-1319
> <tel:%28206%29%20667-1319>
> <tel:%28206%29%20667-1319>
>
>
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
> Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
> Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
>
>
> [[alternative HTML version deleted]]
>
>
> _________________________________________________
> Bioc-devel at r-project.org
> <mailto:Bioc-devel at r-project.org> mailing list
> https://stat.ethz.ch/mailman/__listinfo/bioc-devel
> <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>
>
>
>
> [[alternative HTML version deleted]]
>
>
> _________________________________________________
> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
> mailing list
> https://stat.ethz.ch/mailman/__listinfo/bioc-devel
> <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>
>
>
> [[alternative HTML version deleted]]
>
>
>
> _________________________________________________
> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
> mailing list
> https://stat.ethz.ch/mailman/__listinfo/bioc-devel
> <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
> Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
> Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
>
>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioc-devel
mailing list