[Bioc-devel] deprecation of keepSeqlevels and renameSeqlevels
Hervé Pagès
hpages at fhcrc.org
Tue Sep 17 22:22:18 CEST 2013
OK I just made sortSeqlevels() a generic with a method for character
vectors and a default method. So you can do sortSeqlevels() directly on
a Seqinfo object, a GRanges object and anything that contains a Seqinfo
component:
> gr
GRanges with 0 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
---
seqlengths:
chr2RHet chr4 chrUextra chrYHet ... chr3R chr2R
chrX
NA NA NA NA ... NA NA
NA
> sortSeqlevels(gr)
GRanges with 0 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
---
seqlengths:
chr2R chr3L chr3R chr4 ... chrXHet chrYHet
chrUextra
NA NA NA NA ... NA NA
NA
H.
On 09/17/2013 01:05 PM, Michael Lawrence wrote:
> Thanks, Hervé, that's great.
>
> Would it make sense to have a sort or sortSeqlevels method for Seqinfo
> and then a sortSeqlevels,ANY method that just does seqinfo(x) <-
> sort(seqinfo(x))?
>
> Michael
>
>
>
> On Tue, Sep 17, 2013 at 12:56 PM, Hervé Pagès <hpages at fhcrc.org
> <mailto:hpages at fhcrc.org>> wrote:
>
> On 09/17/2013 10:58 AM, Hervé Pagès wrote:
>
> For sorting the chromosome names in "natural" order, you can use
> the makeSeqnameIds() helper function in GenomicRanges:
>
> seqlevels(gr) <- seqlevels(gr)[makeSeqnameIds(__seqlevels(gr))]
>
>
> I meant:
>
> seqlevels(gr) <-
> seqlevels(gr)[order(__makeSeqnameIds(seqlevels(gr)))__]
>
> Probably the source of Michael's confusion... sorry.
>
> sortSeqlevels() just added to GenomicRanges 1.13.44:
>
> seqlevels <- c("chr2RHet", "chr4", "chrUextra", "chrYHet",
> "chrM", "chrXHet", "chr2LHet", "chrU",
> "chr3L", "chr3R", "chr2R", "chrX")
>
> Then:
>
> > sortSeqlevels(seqlevels)
> [1] "chr2R" "chr3L" "chr3R" "chr4" "chrX" "chrU"
> [7] "chrM" "chr2LHet" "chr2RHet" "chrXHet" "chrYHet"
> "chrUextra"
>
> > sortSeqlevels(seqlevels, X.is.sexchrom=TRUE)
> [1] "chr2R" "chr3L" "chr3R" "chr4" "chrX" "chrU"
> [7] "chrM" "chr2LHet" "chr2RHet" "chrXHet" "chrYHet"
> "chrUextra"
>
> H.
>
>
>
>
> It recognizes several naming conventions (chr-prefixed or not,
> roman, etc...) e.g.:
>
> > makeSeqnameIds(c("Y", "1", "9", "M", "2"))
> [1] 4 1 3 5 2
>
> > makeSeqnameIds(c("chrY", "chrI", "chrIX", "chrM", "chrII"))
> [1] 4 1 3 5 2
>
> I still need to improve the documentation of this function, put
> more examples, and add unit tests. It's used internally by the
> makeTranscriptDb* functions in GenomicFeatures to assign internal
> ids to the chromosomes so that all the TranscriptDb extractors
> can then return GRanges and GRangesList objects with the seqlevels
> in the "natural" order.
>
> Cheers,
> H.
>
>
> On 09/17/2013 10:23 AM, Kasper Daniel Hansen wrote:
>
> Actually, what I (and I think several others) just want is
> fixSeqnames()
> which sorts and fixes them to some convention. I don't
> really care which
> one, but I want to do some easy harmonization, preferably in
> line with
> other bioc tools. I know this is hard to do for all
> organisms, but
> having
> something that deals with the major ones (human, mouse,
> fruit fly, yeast,
> some monkeys) would be extremely valuable. If this function
> existed, I
> could just throw all my GRanges at it.
>
> Kasper
>
>
> On Tue, Sep 17, 2013 at 1:14 PM, Michael Lawrence
> <lawrence.michael at gene.com <mailto:lawrence.michael at gene.com>
>
> wrote:
>
>
> When I hit this I usually just do:
>
> seqlevels(gr) <- paste0("chr", c(1:22, "X", "Y"))
>
> It would be nice if there were a function for sorting
> chromosome names
> according to a specified naming convention. Like
> sortSeqlevels(gr,
> UCSCNaming()) or the alternative replacement syntax:
> seqinfo(gr) <-
> sort(seqinfo(gr), UCSCNaming()). Although sort is only
> single-dispatch.
>
> Michael
>
>
>
>
>
>
> On Tue, Sep 17, 2013 at 8:59 AM, Vincent Carey
> <stvjc at channing.harvard.edu
> <mailto:stvjc at channing.harvard.edu>>__wrote:
>
> I am replying to this as Michael mentions that this
> is a powerful
> operation. Here's my
> problem that I believe is related, but I do not see
> a straightforward
> solution
>
> bhmm19uni
>
> GRanges with 559456 ranges and 4 metadata columns:
> seqnames ranges strand
> | name
> score
> <Rle> <IRanges> <Rle>
> | <character>
> <numeric>
> 1 chr1 [ 67229025, 67341224] *
> | 13_Heterochrom/lo
> 0
> 2 chr1 [203057955, 203060154] *
> | 12_Repressed
> 0
> 3 chr1 [ 8458427, 8486226] *
> | 13_Heterochrom/lo
> 0
> 4 chr1 [ 16904427, 16904826] *
> | 5_Strong_Enhancer
> 0
> 5 chr1 [ 25289427, 25293426] *
> | 11_Weak_Txn
> 0
> ... ... ... ...
> ... ...
> ...
> 570766 chr22 [51100931, 51101330] *
> | 2_Weak_Promoter
> 0
> 570767 chr22 [51101331, 51101530] *
> | 6_Weak_Enhancer
> 0
> 570768 chr22 [51101531, 51101730] *
> | 2_Weak_Promoter
> 0
> 570769 chr22 [51101731, 51178130] *
> | 13_Heterochrom/lo
> 0
> 570770 chr22 [51178131, 51178330] *
> | 15_Repetitive/CNV
> 0
> thick numcode
> <IRanges> <character>
> 1 [ 67001613, 67113812] 13
> 2 [201324578, 201326777] 12
> 3 [ 8381014, 8408813] 13
> 4 [ 16777014, 16777413] 5
> 5 [ 25162014, 25166013] 11
> ... ... ...
> 570766 [49447797, 49448196] 2
> 570767 [49448197, 49448396] 6
> 570768 [49448397, 49448596] 2
> 570769 [49448597, 49524996] 13
> 570770 [49524997, 49525196] 15
> ---
> seqlengths:
> chr1 chr10 chr11 chr5 ...
> chr2 chr21
> chr4
> 249250621 135534747 135006516 180915260 ...
> 243199373 48129895
> 191154276
>
> If I try to make a karyogram with ggbio, the
> chromosomes are in an
> unnatural order. What is the right approach to
> getting the seqinfo
> in a
> natural order with respect to chromosome names and
> lengths?
> Reassigning
> seqlevels seems dangerous but I haven't experimented
> fully. It must
> be a
> common use case but I do not see it in doc.
>
>
>
> On Tue, May 21, 2013 at 11:00 AM, Michael Lawrence <
> lawrence.michael at gene.com
> <mailto:lawrence.michael at gene.com>> wrote:
>
> On Mon, May 20, 2013 at 10:36 PM, Hervé Pagès
> <hpages at fhcrc.org <mailto:hpages at fhcrc.org>>
> wrote:
>
> Michael,
>
>
> On 05/20/2013 08:44 PM, Michael Lawrence wrote:
>
>
>
>
> On Mon, May 20, 2013 at 4:13 PM, Hervé
> Pagès <hpages at fhcrc.org
> <mailto:hpages at fhcrc.org>
> <mailto:hpages at fhcrc.org
> <mailto:hpages at fhcrc.org>>> wrote:
>
> Hi,
>
>
> On 05/20/2013 03:15 PM, Michael
> Lawrence wrote:
>
> On Mon, May 20, 2013 at 3:11
> PM, Martin Morgan
> <mtmorgan at fhcrc.org
> <mailto:mtmorgan at fhcrc.org>
> <mailto:mtmorgan at fhcrc.org
> <mailto:mtmorgan at fhcrc.org>>> wrote:
>
> Hi Michael --
>
>
> On 5/20/2013 5:56 AM,
> Michael Lawrence wrote:
>
> While it's cool that
> seqlevels<- has become so
>
> flexible,
>
> I still claim
> that
> rename/keep/drop would
> be a lot easier to read,
>
> because
>
> they describe the
> high-level operation,
> and there's no reason to
>
> deprecate
>
> them. They're
> also
> easier to remember.
> For example, for dropping with
> seqlevels<-, the user
> needs
> to remember that
> "force" is necessary to drop. Much
> easier to just say
> "dropSeqlevels(),
> please". And reimplementing
> keepSeqlevels is still too
> complicated. Is it
> such a maintenance burden to
> have
> these simple
> wrappers sit
> on top of the
> low-level manipulators?
>
> Another suggestion:
> renameSeqlevels should not
>
> require
>
> a
>
> named vector (in
> cases
> where it is unnamed,
> require that it is parallel to
>
> the
>
> existing
> seqlevels, as
> with seqlevels<-).
>
>
> I didn't really indicate
> what drove my desire to see
> keepSeqlevels
> deprecated. keepSeqlevels,
> seqlevels<-, and isActiveSeq
>
> were
>
> developed more
> or less independently, and
> have different contracts.
> The
> different
> contracts are confusing to
> the user, as is creating a
>
> usable
>
> help system
> when there are multiple
> end points for what is a common
> operation. The help
> pages of each were
> inadequate in different ways. And
>
> because
>
> the code was
> developed independently,
> support for different objects
>
> was
>
> inconsistent. So
> actually, yes, the
> maintenance (and use) burden for the
> previous state of
> affairs was substantial.
>
> On the other hand, I agree
> that keepSeqlevels is
>
> convenient
>
> as a simple
> wrapper around
> seqlevels<-, in the same way that
> setNames
> and names<- are
> both useful.
>
> So we could iterate to
>
> keepSeqlevels <-
> function(x, value,
> ...)
> {
> seqlevels(x,
> force=TRUE) <- value
> x
> }
>
> but I'd be less
> enthusiastic about maintaining the
>
> original
>
> contract of
> keepSeqlevels, where
> keepSeqlevels(gr1, gr2), would
> have
> worked for two
> GRanges objects.
>
>
> Why would this be called
> keepSeqlevels() if, depending on how
>
> it's
>
> used,
> it will either add, drop, rename,
> or permute the seqlevels?
> Couldn't this be called setSeqlevels?
>
>
> I thought that new2old was necessary for
> permuting.
>
>
> The seqlevels setter has no 'new2old' arg.
>
>
> You're right that
>
> adding should be disallowed for
> keepSeqlevels(). Adding seqlevels is
>
> not
>
> a common operation. The two common
> operations are:
> * Permuting, usually because the data
> were imported in non-canonical
> order (seqnameStyle was designed to
> address this, no?).
>
>
> Permuting is straightforward with seqlevels<-:
>
> > seqlevels(gr)
> [1] "chr1" "chr2" "chr3"
>
> > seqlevels(gr) <- rev(seqlevels(gr))
>
> > seqlevels(gr)
> [1] "chr3" "chr2" "chr1"
>
>
> * Subsetting, either by keeping or dropping.
>
>
>
> Also straightforward:
>
> > seqlevels(gr, force=TRUE) <- "chr2"
>
> > seqlevels(gr)
> [1] "chr2"
>
>
> Two main reasons: taking a
>
> small slice of the data for prototyping,
> or removing problematic
> chromosomes (sex, circular). This goes
> back to bsapply and the
>
> exclude
>
> argument.
>
>
> There are other important use cases:
>
> - Dropping seqlevels that are *not* in
> use. This happens for
> example
> when subsetting a BAM file with
> Rsamtools::filterBam to keep
> only
> the alignments located on a given
> chromosome. The entire
> sequence
> dictionary (in the header) is
> propagated to the resulting BAM
> file
> so when you read the file with
> readGAlignments() you end up
> with a
> bunch of useless seqlevels that you
> generally start by dropping.
> You don't want to drop any alignment
> so force=FALSE is your
>
> friend.
>
>
>
> This is sort of tangential to the discussion,
> but do you really
> want to
>
> do
>
> this? I would preserve the universe as given by
> the BAM.
>
>
> - An even more common operation is
> renaming: 90% of the times
> that's
> what I use seqlevels<- for, and that's
> what I tell people to use
> when they need to rename their
> chromosomes. Also
> straightforward:
>
> > seqlevels(gr)
> [1] "chr1" "chr2" "chr3" "chrM"
>
> > seqlevels(gr) <- sub("^chr", "",
> seqlevels(gr))
> > seqlevels(gr)
> [1] "1" "2" "3" "M"
>
> > seqlevels(gr)[seqlevels(gr) ==
> "M"] <- "MT"
> > seqlevels(gr)
> [1] "1" "2" "3" "MT"
>
> Note that this is simpler to use than
> renameSeqlevels() which
> always required you to build the
> entire right value as a named
> vector.
>
>
> Sure, renaming is a common use case. Not sure
> how I forgot that.
>
> As I wrote earlier in the thread,
> renameSeqlevels should be changed so
>
> as
>
> not to require naming of the vector.
>
>
> So the added-value of keepSeqlevels() seems
> to boil down to:
>
> (a) it always uses force=TRUE
>
> (b) it preserves the original object and
> returns the modified
> object
>
> If you want to restrict the use of
> keepSeqlevels() to permuting and
> dropping, you'll also have to disallow
> renaming (in addition to
>
> adding).
>
> Then its name will more or less reflect what
> it does (if "keep" means
> "permute" or "drop"). The final result will
> be something that does
>
> less
>
> than setSeqlevels() and that is not
> necessarily easier to read: both
> will set on the object whatever vector of
> seqlevels they are
> supplied,
> except that one will fail when doing so
> would actually result in
>
> adding
>
> or renaming seqlevels.
>
>
> So it looks like seqlevels<- is pretty powerful.
> Now I see that if I
> scroll
> down to the examples in the man page, I find the
> actual documentation.
> It's
> cool to learn that we can replace seqlevels on
> TxDb objects. My
> argument
> has always been that since these low-level
> operations are so powerful,
> it's
> nice to have high-level operations to clarify
> the code. We have to
> think
> hard to know what the RHS will do in that
> replacement. It's
> probably one
> of
> the most complex replacement operations; far
> more complex than the
>
> typical
>
> one, including levels<- on factor. There's
> nothing wrong with that;
> it's
> just complexity that we should be able to hide.
>
> Michael
>
> H.
>
>
>
> Michael
>
>
> H.
>
>
>
> Well, I wasn't even aware of
> that feature, so you've made
>
> your
>
> point about
> the documentation ;) Sounds
> like a good compromise to me.
>
> Thanks for understanding,
> Michael
>
>
>
> Martin
>
>
> Michael
>
>
>
>
>
>
> On Sat, May 18, 2013
> at 6:00 PM, Martin Morgan
> <mtmorgan at fhcrc.org
> <mailto:mtmorgan at fhcrc.org>
> <mailto:mtmorgan at fhcrc.org
> <mailto:mtmorgan at fhcrc.org>>
>
> <mailto:mtmorgan at fhcrc.org
> <mailto:mtmorgan at fhcrc.org> <mailto:
>
> mtmorgan at fhcrc.org <mailto:mtmorgan at fhcrc.org>
>
>
>
> wrote:
>
> On 05/18/2013
> 05:42 PM, Martin Morgan wrote:
>
> Some of the
> most common operations are
> straight-forward, e.g.,
>
> > gr =
> GRanges(paste0("chr",
> c(1:22,
> "X", "Y")), IRanges(1,
> 100))
> >
> seqlevels(gr) = sub("chr", "ch",
> seqlevels(gr))
>
>
> To get a more
> comprehensive example I should
>
> have
>
> followed my own
> advice and
> grabbed from the
> help page!
>
> ## Rename:
>
> seqlevels(txdb) <- sub("chr", "",
> seqlevels(txdb))
>
> seqlevels(txdb)
>
>
> seqlevels(txdb) <- paste0("CH",
> seqlevels(txdb))
>
> seqlevels(txdb)
>
>
> seqlevels(txdb)[seqlevels(__**__**__txdb)
>
> ==
>
>
>
> "CHM"] <- "M"
>
>
>
> seqlevels(txdb)
>
>
> --
> Computational
> Biology / Fred Hutchinson
> Cancer
> Research Center
> 1100 Fairview
> Ave. N.
> PO Box 19024
> Seattle, WA 98109
>
> Location: Arnold
> Building M1 B861
> Phone: (206)
> 667-2793 <tel:%28206%29%20667-2793>
>
> <tel:%28206%29%20667-2793>
>
> <tel:%28206%29%20667-2793>
>
>
>
>
> --
> Dr. Martin Morgan, PhD
>
> Fred Hutchinson Cancer
> Research Center
> 1100 Fairview Ave. N.
> PO Box 19024 Seattle, WA 98109
>
>
> [[alternative HTML
> version deleted]]
>
>
> ________________________________**___________________
> Bioc-devel at r-project.org
> <mailto:Bioc-devel at r-project.org>
> <mailto:Bioc-devel at r-project.
> <mailto:Bioc-devel at r-project.>*__*org<
>
> Bioc-devel at r-project.org
> <mailto:Bioc-devel at r-project.org>>
>
>
> mailing list
> https://stat.ethz.ch/mailman/___**_listinfo/bioc-devel
> <https://stat.ethz.ch/mailman/_**_listinfo/bioc-devel><
>
> https://stat.ethz.ch/mailman/____listinfo/bioc-devel
> <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>>
>
>
>
> <https://stat.ethz.ch/mailman/__**listinfo/bioc-devel <https://stat.ethz.ch/mailman/**listinfo/bioc-devel><
>
> https://stat.ethz.ch/mailman/__listinfo/bioc-devel
> <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>
>
>
>
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
>
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fhcrc.org
> <mailto:hpages at fhcrc.org>
> <mailto:hpages at fhcrc.org
> <mailto:hpages at fhcrc.org>>
> Phone: (206) 667-5791
> <tel:%28206%29%20667-5791>
> <tel:%28206%29%20667-5791>
> Fax: (206) 667-1319
> <tel:%28206%29%20667-1319>
> <tel:%28206%29%20667-1319>
>
>
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fhcrc.org
> <mailto:hpages at fhcrc.org>
> Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
> Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
>
>
> [[alternative HTML version deleted]]
>
>
> _________________________________________________
> Bioc-devel at r-project.org
> <mailto:Bioc-devel at r-project.org> mailing list
> https://stat.ethz.ch/mailman/__listinfo/bioc-devel
> <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>
>
>
>
> [[alternative HTML version deleted]]
>
>
> _________________________________________________
> Bioc-devel at r-project.org
> <mailto:Bioc-devel at r-project.org> mailing list
> https://stat.ethz.ch/mailman/__listinfo/bioc-devel
> <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>
>
>
> [[alternative HTML version deleted]]
>
>
>
> _________________________________________________
> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
> mailing list
> https://stat.ethz.ch/mailman/__listinfo/bioc-devel
> <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>
>
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
> Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
> Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
>
>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioc-devel
mailing list