[Bioc-devel] isActiveSeq deprecated

Hervé Pagès hpages at fhcrc.org
Tue Sep 24 08:22:00 CEST 2013


Hi Florian, Marc,

On 09/18/2013 11:55 PM, Hahne, Florian wrote:
> Hi Marc, Herve,
> I also noticed this behaviour:
>
> library(TxDb.Hsapiens.UCSC.hg19.knownGene)
> txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
> seqlevels(txdb, force=TRUE) <- c("chr1", "ch2")
> oldLevs <- seqlevels(txdb)
> seqlevels(txdb, force=TRUE) <- "chr1"
>
> seqlevels(txdb, force=TRUE) <- oldLevs
> Error in .seqinfo.TranscriptDbReplace(x, new2old = new2old, force = force,
>   :
>    The replacement value must be either a 1 to 1 replacement or a subset of
> the original set when replacing the 'seqinfo' of a TranscriptDb object
>
>
> But:
>
> restoreSeqlevels(txdb)
> seqlevels(txdb, force=TRUE) <- oldLevs
>
>
> This is probably intentional, but I found it to be rather confusing. If
> one should be able to use the seqlevels replacement method to control
> active and inactive chromosomes in the TranscriptDb object, then having
> this mandatory step of restoring all chromosomes to the active state
> followed by another restriction to be quite cumbersome. At least a
> somewhat more useful error message would help in that case. Or couldn't
> the replacement method always call restoreSeqlevels internally.

Somewhat related to this, I also found the following issue:

   > library(TxDb.Hsapiens.UCSC.hg19.knownGene)
   > txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
   > seqlevels(txdb, force=TRUE) <- "chrM"
   > seqlengths(txdb)
    chrM
   16571
   > seqlevels(txdb) <- "chr1"
   > seqlengths(txdb)
    chr1
   16571

The 2nd call to the seqlevels() setter actually performed a
*renaming* operation (a clue for this is that I didn't have to
use force=TRUE). Renaming the seqlevels of a TranscriptDb object
is something I think we want to support, e.g. when there is the
need to use a different naming convention (like in
seqlevels(txdb) <- "M"). This is exactly the syntax that one
would use to rename the seqlevels of a GRanges object:

   > gr <- GRanges("chrM", IRanges(1:2, 10))
   > seqlevels(gr) <- "chr1"
   > gr
   GRanges with 2 ranges and 0 metadata columns:
         seqnames    ranges strand
            <Rle> <IRanges>  <Rle>
     [1]     chr1   [1, 10]      *
     [2]     chr1   [2, 10]      *
     ---
     seqlengths:
      chr1
        NA

There is nothing that prevents the user from doing this kind of silly
renaming except that, in the case of the TranscriptDb object, this is
almost certainly not what the user intended to do. What s/he really
wanted was selecting chr1 instead of chrM, which can be achieved with:

   > restoreSeqlevels(txdb)
   > seqlevels(txdb, force=TRUE) <- "chr1"
   > seqlengths(txdb)
        chr1
   249250621

Yes having to restore all chromosomes to the active state before one
can make a new selection is cumbersome so we probably need to revisit
this.

>
> It would also help to point out somewhere (I may have missed it) that
> TranscriptDb seqlevels (or rather the internal GRanges) are actually pass
> by reference:
>
> txdb2 <- txdb
> seqlevels(txdb)
> seqlevels(txdb2)
>
> restoreSeqlevels(txdb)
> seqlevels(txdb2)

That's because TranscriptDb is a reference class. IMHO it shouldn't.

>
> Also Herve: what is the definition of a "used" seqlevel? Does that mean
> that a factor level is defined, but no range in the GRanges object is
> assigned this level?

Yes. Like here:

   > gr <- GRanges("chrM", IRanges(1:2, 10), seqlengths=c(chr1=5000, 
chrM=25))
   > gr
   GRanges with 2 ranges and 0 metadata columns:
         seqnames    ranges strand
            <Rle> <IRanges>  <Rle>
     [1]     chrM   [1, 10]      *
     [2]     chrM   [2, 10]      *
     ---
     seqlengths:
      chr1 chrM
      5000   25

Only chrM is in use:

   > seqlevelsInUse(gr)
   [1] "chrM"

The idiom to drop seqlevels that are not in use is:

   > seqlevels(gr) <- seqlevelsInUse(gr)
   > gr
   GRanges with 2 ranges and 0 metadata columns:
         seqnames    ranges strand
            <Rle> <IRanges>  <Rle>
     [1]     chrM   [1, 10]      *
     [2]     chrM   [2, 10]      *
     ---
     seqlengths:
      chrM
        25

Of course, in that case, force=TRUE is not needed because the operation
is guaranteed to not "shrink" the GRanges object.

> Why would those exists in a TranscriptDb object at
> all?

The notion of "seqlevels in use" is not formally defined for a
TranscriptDb object:

   > seqlevelsInUse(txdb)
   Error in (function (classes, fdef, mtable)  :
     unable to find an inherited method for function ‘seqlevelsInUse’ 
for signature ‘"TranscriptDb"’

but it could be: a seqlevel or chromosome could be considered to be
in use if there is at least 1 feature (transcript, exon or CDS) in
the db that is on it. I think for most TranscriptDb objects, all the
seqlevels are actually in use. However it's conceivable that a
TranscriptDb object could have some extra seqlevels that are not in
use (the seqlevels + their lengths + their circularity flags are stored
in a separate table, the 'chrominfo' table).

Anyway, when I added the seqlevelsInUse() generic a few months ago,
I didn't write a method for TranscriptDb objects because I couldn't
think of a use case for it. And also because without any change to
the current db schema, this would be a costly operation as it would
need to scan the entire database, which cannot be done with a single
query.

Now with the seqlevels() setter allowing to reduce the seqlevels
of a TranscriptDb object, we face the following choices: (a)
implement the seqlevelsInUse,TranscriptDb method and make the
'force' arg of the seqlevels() setter behave consistently with
that, or (b) not implement the seqlevelsInUse,TranscriptDb method
and make the 'force' arg meaningless. As I said earlier, my preference
goes for (b).

H.

>
> Florian
>
>
>
>
>
> On 9/18/13 9:08 PM, "Hervé Pagès" <hpages at fhcrc.org> wrote:
>
>> Note that currently it's kind of inconsistent anyway because it doesn't
>> look at the seqlevels that are in use. For example:
>>
>>    library(TxDb.Hsapiens.UCSC.hg19.knownGene)
>>    txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
>>
>> Trying to drop chrUn_gl000249 (which I know is not in use):
>>
>>    > seqlevels(txdb) <- setdiff(seqlevels(txdb), "chrUn_gl000249")
>>    Error in .seqinfo.TranscriptDbReplace(x, new2old = new2old, force =
>> force,  :
>>      You need to use force=TRUE if you want to drop seqlevels.
>>
>> 'force=TRUE' should only be required when trying to drop seqlevels
>> that are in use.
>>
>> So the choice are more between: (a) consistent, (b) convenient,
>> or (c) leave it as it is (which is neither convenient or consistent).
>>
>> I vote for (b).
>>
>> H.
>>
>>
>> On 09/18/2013 11:31 AM, Marc Carlson wrote:
>>> I actually considered this, but I opted to do it this way just for the
>>> sake of being consistent (which was my whole mission for implementing
>>> seqlevels in here in the 1st place).  Now I could make it more
>>> convenient here and break consistency with how it is used elsewhere, but
>>> what do people prefer?
>>>
>>> Consistent or convenient?
>>>
>>>
>>>     Marc
>>>
>>>
>>>
>>> On 09/18/2013 10:40 AM, Hervé Pagès wrote:
>>>> Hi Marc,
>>>>
>>>> Wouldn't it make sense to just ignore the 'force' arg when
>>>> dropping the seqlevels of a TranscriptDb?
>>>>
>>>> The 'force' argument is FALSE by default and this prevents
>>>> seqlevels<- to shrink GRanges or other vector-like objects
>>>> when the user tries to drop seqlevels that are in use.
>>>> Internally seqlevels<- calls seqlevelsInUse() to get the
>>>> seqlevels currently in use and see if they intersect with
>>>> the seqlevels to drop.
>>>>
>>>> In the TranscriptDb situation, people always have to use
>>>> 'force=TRUE' to drop seqlevels, regardless of whether the
>>>> levels to drop are in use or not (the seqlevelsInUse()
>>>> getter not being defined for TranscriptDb objects, I suspect
>>>> seqlevels<- doesn't look at this).
>>>>
>>>> So maybe 'force' could just be ignored for TranscriptDb objects?
>>>> That would make seqlevels<- a little bit more user-friendly on
>>>> those objects.
>>>>
>>>> Thanks,
>>>> H.
>>>>
>>>>
>>>> On 09/13/2013 10:38 AM, Marc Carlson wrote:
>>>>> Hi Florian,
>>>>>
>>>>> Yes we are trying to make things more uniform.  seqlevels() lets you
>>>>> rename as well as deactivate chromosomes you want to ignore, so it was
>>>>> really redundant with isActiveSeq().  So we are moving away from
>>>>> isActiveSeq() just so that users have less to learn about.  The reason
>>>>> why isActiveSeq was different from seqlevels was just because it was
>>>>> born for a TranscriptDb (which is based on an annotation database)
>>>>> instead of being born on a GRanges object.  So seqlevels was the more
>>>>> general tool.
>>>>>
>>>>>      Marc
>>>>>
>>>>>
>>>>>
>>>>> On 09/13/2013 07:24 AM, Hahne, Florian wrote:
>>>>>> Hi Marc,
>>>>>> I saw these warnings in Gviz, but they stem from GenomicFeatures
>>>>>>
>>>>>> Warning messages:
>>>>>> 1: 'isActiveSeq' is deprecated.
>>>>>> Use 'seqlevels' instead.
>>>>>> See help("Deprecated") and help("GenomicFeatures-deprecated").
>>>>>> 2: 'isActiveSeq' is deprecated.
>>>>>> Use 'seqlevels' instead.
>>>>>> See help("Deprecated") and help("GenomicFeatures-deprecated").
>>>>>> 3: 'isActiveSeq<-' is deprecated.
>>>>>> Use 'seqlevels' instead.
>>>>>> See help("Deprecated") and help("GenomicFeatures-deprecated").
>>>>>> 4: 'isActiveSeq<-' is deprecated.
>>>>>> Use 'seqlevels' instead.
>>>>>> See help("Deprecated") and help("GenomicFeatures-deprecated").
>>>>>> 5: 'isActiveSeq' is deprecated.
>>>>>> Use 'seqlevels' instead.
>>>>>> See help("Deprecated") and help("GenomicFeatures-deprecated").
>>>>>> 6: 'isActiveSeq<-' is deprecated.
>>>>>> Use 'seqlevels' instead.
>>>>>> See help("Deprecated") and help("GenomicFeatures-deprecated").
>>>>>>
>>>>>> So has the whole idea of active chromosomes in the data base been
>>>>>> dropped? I could not find anything in the change notes. Do I get it
>>>>>> right that you can now do
>>>>>> seqlevels(txdb, force=TRUE) <- "chr1"
>>>>>> if you just want the first chromosome to be active?
>>>>>>
>>>>>> Florian
>>>>>>
>>>>>
>>>>>
>>>>>      [[alternative HTML version deleted]]
>>>>>
>>>>> _______________________________________________
>>>>> Bioc-devel at r-project.org mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>
>>>>
>>>
>>
>> --
>> Hervé Pagès
>>
>> Program in Computational Biology
>> Division of Public Health Sciences
>> Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N, M1-B514
>> P.O. Box 19024
>> Seattle, WA 98109-1024
>>
>> E-mail: hpages at fhcrc.org
>> Phone:  (206) 667-5791
>> Fax:    (206) 667-1319
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioc-devel mailing list