[BioC] Why does a call to "unique" removes a DNAStringSet names?
Hervé Pagès
hpages at fhcrc.org
Tue Dec 18 21:02:40 CET 2012
Hi Nico, Val,
Just to let you know that this is addressed in the latest
IRanges/Biostrings (devel):
dna <- DNAStringSet(c(a="AAA", b="CC", c="ATTA", d="CC", e="CG"))
mcols(dna) <- DataFrame(score=1:5)
Then:
> unique(dna)
A DNAStringSet instance of length 4
width seq names
[1] 3 AAA a
[2] 2 CC b
[3] 4 ATTA c
[4] 2 CG e
> mcols(unique(dna))
DataFrame with 4 rows and 1 column
score
<integer>
1 1
2 2
3 3
4 5
This new behavior is consistent with what unique() does on a
GRanges object or a Vector object in general.
Cheers,
H.
On 11/15/2012 03:27 PM, Hervé Pagès wrote:
> Hi Nico, Val,
>
> Yes sorry for taking so long Nico, I didn't notice your email
> before.
>
> 2 additional issues I didn't realize:
>
> (1) unique() does not drop the metadata columns of a DNAStringSet:
>
> > dset
> A DNAStringSet instance of length 2
> width seq names
> [1] 1 A a
> [2] 1 C a
> > mcols(dset)
> DataFrame with 2 rows and 1 column
> score
> <integer>
> 1 1
> 2 2
>
> > mcols(unique(dset))
> DataFrame with 2 rows and 1 column
> score
> <integer>
> 1 1
> 2 2
>
> (2) unique() doesn't treat DNAStringSet consistently with GRanges:
>
> > gr
> GRanges with 5 ranges and 2 metadata columns:
> seqnames ranges strand | score GC
> <Rle> <IRanges> <Rle> | <integer> <numeric>
> a chr1 [1, 10] - | 1 1
> b chr2 [2, 10] + | 2 0.888888888888889
> c chr2 [3, 10] + | 3 0.777777777777778
> d chr2 [2, 10] + | 2 0.888888888888889
> e chr2 [4, 10] * | 4 0.666666666666667
> ---
> seqlengths:
> chr1 chr2 chr3
> 1000 2000 1500
> > unique(gr)
> GRanges with 4 ranges and 2 metadata columns:
> seqnames ranges strand | score GC
> <Rle> <IRanges> <Rle> | <integer> <numeric>
> a chr1 [1, 10] - | 1 1
> b chr2 [2, 10] + | 2 0.888888888888889
> c chr2 [3, 10] + | 3 0.777777777777778
> e chr2 [4, 10] * | 4 0.666666666666667
> ---
> seqlengths:
> chr1 chr2 chr3
> 1000 2000 1500
>
> On a GRanges, it just does x[!duplicated(x)], so not only the names
> are propagated but also the metadata columns.
>
> So the choices are:
> (a) we do the same for DNAStringSet, even if that's not what
> base::unique() does,
> (b) we choose to have unique() drop the names and metadata columns
> of any Vector object (DNAStringSet, GRanges, etc...),
> (c) we add the 'use.names' and 'use.mcols' args to unique(), with
> defaults to FALSE? or to TRUE?
> (d) ?
>
> I have a small preference for (a) even though I'm not really sure what
> the use cases are. Whatever we do, we should have unique() behave
> consistently on any member of the Vector family and also treat
> names and metadata columns the same way.
>
> Thanks,
> H.
>
>
> On 11/15/2012 09:11 AM, Valerie Obenchain wrote:
>> Hi Nico,
>>
>> Sorry it's taken awhile to get back to you. I wanted to ask about what
>> behavior you'd expect from a call to unique() on a DNAStringSet, i.e.,
>> what is your use case?
>>
>>
>> unique() on a named character vector drops names:
>> chr <- c(a="A", c="C", aa="A", c="CC")
>> > unique(chr)
>> [1] "A" "C" "CC"
>>
>>
>> Same for a named list:
>> lst <- list(a="A", c="C", aa="A", c="CC")
>> > unique(lst)
>> [[1]]
>> [1] "A"
>>
>> [[2]]
>> [1] "C"
>>
>> [[3]]
>> [1] "CC"
>>
>>
>> unique() on a DNAStringSet was patterned after this behavior. If names
>> were kept, would it be useful to retain only the name of the first
>> duplicate? In the data above there are two "A"'s. Would you want 'a'
>> kept and 'aa' dropped?
>>
>> Valerie
>>
>>
>>
>>
>> On 07/26/2012 08:36 AM, Nicolas Delhomme wrote:
>>> Hi,
>>>
>>> I've just realized that a call to unique on a DNAStringSet would
>>> result in the names slot to disappear. There's nothing about this in
>>> the documentation, but if that's the desired effect, warning about it
>>> would be good :-)
>>>
>>> Here is how to reproduce it:
>>>
>>> library(Biostrings)
>>> dset<-DNAStringSet(c("A","C"))
>>> names(dset)<- c("a","a")
>>> dset
>>> unique(dset)
>>>
>>>
>>> It gives:
>>>
>>>> dset
>>> A DNAStringSet instance of length 2
>>> width seq names
>>> [1] 1 A a
>>> [2] 1 C a
>>>> unique(dset)
>>> A DNAStringSet instance of length 2
>>> width seq
>>> [1] 1 A
>>> [2] 1 C
>>>
>>> My sessionInfo():
>>>
>>> R version 2.15.1 (2012-06-22)
>>> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
>>>
>>> locale:
>>> [1] C/UTF-8/C/C/C/C
>>>
>>> attached base packages:
>>> [1] stats graphics grDevices utils datasets methods base
>>>
>>> other attached packages:
>>> [1] Biostrings_2.25.8 IRanges_1.15.24 BiocGenerics_0.3.0
>>>
>>> loaded via a namespace (and not attached):
>>> [1] stats4_2.15.1 tools_2.15.1
>>>
>>> Cheers,
>>>
>>> Nico
>>>
>>> ---------------------------------------------------------------
>>> Nicolas Delhomme
>>>
>>> Nathaniel Street Lab
>>> Department of Plant Physiology
>>> Umeå Plant Science Center
>>>
>>> Tel: +46 90 786 7989
>>> Email: nicolas.delhomme at plantphys.umu.se
>>> SLU - Umeå universitet
>>> Umeå S-901 87 Sweden
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioconductor
mailing list