[R] Odp: duplicated() and unique() problems
Petr PIKAL
petr.pikal at precheza.cz
Tue Jun 8 12:58:08 CEST 2010
Hi
r-help-bounces at r-project.org napsal dne 08.06.2010 08:44:39:
> Hi everybody
>
> I have found something (for me at least) strange with duplicated(). I
will
> first provide a replicable example of a certain kind of behaviour that I
> find odd and then give a sample of unexpected results from my own data.
I
> hope someone can help me understand this.
>
> Consider the following
>
> # this works as expected
>
> ex=sample(1:20, replace=TRUE)
>
> ex
>
> duplicated(ex)
>
> ex=sort(ex)
This is OK as sort sorts your data
>
> ex
>
> duplicated(ex)
>
>
> # but why does duplicate not work after order() ?
>
> ex=sample(1:20, replace=TRUE)
>
> ex
>
> duplicated(ex)
>
> ex=order(ex)
This is not as order gives you positions not your data
> ex=sample(letters[1:5],20, replace=TRUE)
> ex
[1] "b" "b" "b" "e" "d" "c" "e" "a" "a" "d" "d" "d" "a" "e" "b" "c" "e"
"d" "a"
[20] "a"
> ex<-order(ex)
> ex
[1] 8 9 13 19 20 1 2 3 15 6 16 5 10 11 12 18 4 7 14 17
>
ex=ex[order(ex)]
shall give you the same result as sort. Maybe with exception of ties.
>
> duplicated(ex)
>
> Why does duplicated() not work after order() has been applied but it
works
> fine after sort() ? Is this an error or is there something I don't
> understand.
>
> I have been getting very strage results from duplicated() and unique()
in a
> dataset I am analysing. Her is a little sample of my real life problem
>
> > str(Masechaba$PROPDESC)
> Factor w/ 24545 levels " 06"," 71Hemilton str",..: 14527 8043
16113
> 16054 13875 15780 12522 7771 14824 12314 ...
> > # Create a indicator if the PROPDESC is unique. Default false
> > Masechaba$unique=FALSE
> > Masechaba$unique[which(is.na(unique(Masechaba$PROPDESC))==FALSE)]=TRUE
> > # Check is something happended
> > length(which(Masechaba$unique==TRUE))
> [1] 2174
> > length(which(Masechaba$unique==FALSE))
> [1] 476
> > Masechaba$duplicate=FALSE
> > Masechaba$duplicate[which(duplicated(Masechaba$PROPDESC)==TRUE)]=TRUE
> > length(which(Masechaba$duplicate==TRUE))
> [1] 476
> > length(which(Masechaba$duplicate==FALSE))
> [1] 2174
> > # Looks OK so far
> > # Test on a known duplicate. I expect one to be true and one to be
false
> > Masechaba[which(Masechaba$PROPDESC==2363),10:12]
> PROPDESC unique duplicate
> 24874 2363 TRUE FALSE
> 31280 2363 TRUE TRUE
>
> # This is strange. I expected that unique() and duplicate() would give
the
> same results. The variable PROPDESC is clearly not unique in both cases.
No.
ex=sample(letters[1:5],10, replace=TRUE)
ex
[1] "b" "d" "d" "b" "a" "c" "b" "c" "d" "d"
unique(ex)
[1] "b" "d" "a" "c"
duplicated(ex)
[1] FALSE FALSE TRUE TRUE FALSE FALSE TRUE TRUE TRUE TRUE
Functions give you different answers about your data as you ask different
questions.
> > Masechaba$unique[which(is.na(unique(Masechaba$PROPDESC))==FALSE)]=TRUE
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
This seems to be strange. At first sight I am puzzlet what result I shall
expect from such construction.
Regards
Petr
> # The totals are the same but not the individual results
> > table(Masechaba$unique,Masechaba$duplicate)
>
> FALSE TRUE
> FALSE 342 134
> TRUE 1832 342
>
> I don't understand this. Is there something I am missing?
>
> Best regards
> Christaan
>
>
> P.S
> > sessionInfo()
> R version 2.11.1 (2010-05-31)
> x86_64-apple-darwin9.8.0
>
> locale:
> [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
>
> attached base packages:
> [1] splines stats graphics grDevices utils datasets methods
> base
>
> other attached packages:
> [1] plyr_0.1.9 maptools_0.7-34 lattice_0.18-8 foreign_0.8-40
> Hmisc_3.8-0 survival_2.35-8 rgdal_0.6-26
> [8] sp_0.9-64
>
> loaded via a namespace (and not attached):
> [1] cluster_1.12.3 grid_2.11.1 tools_2.11.1
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list